![]() |
The most commonly utilized microarrays for mRNA profiling (Affymetrix) include probe sets of a series of perfect match and mismatch probes (typically 22 oligonucleotides per probe set). There are an increasing number of reported probe set algorithms that differ in their interpretation of a probe set to derive a single normalized "signal" - representative of expression of each mRNA. These algorithms are known to differ in accuracy and sensitivity, and optimization has been done using a small set of standardized control microarray data.
We hypothesized that different mRNA profiling projects have varying sources and degrees of confounding noise, and that these should alter choice of a specific probe set algorithm. Also, we hypothesized that use of the Microarray Suite (MAS) 5.0 probe set detection p value as a weighting function would improve the performance of all probe set algorithms.
![]() |
| Permutation Study Framework using Unsupervised
Clustering in HCE2W (the improved version of the Hierarchical
Clustering Explorer 2.0 with p-value weighting and F-measure). Inputs to
the Hierarchical Clustering Explorer are two files, signal data file and
p-value file. Each column of the two input files has values for a sample
(or a chip), and the known target biological group index is assigned to
each column of the signal data file. Success is measured using F-measure
of a dendrogram and the known biological grouping.
Note : HCE 3.0 is a newer version, which has all functions in HCE2W. |
You have to prepare two files, probe set signal file and probe set detection p-value file, for each probe set signal algorithm (e.g., MAS5, dChip, or RMA). As you can see in the figure, you can use the probe set detection p-value file from MAS5 for all other signal files generated by probe set signal algorithms other than MAS5.
The two files should be in the same folder. The extension of the detection p-value file should be pvl. Please refer to the following example.
1. Using Excel files
If the signal file name is mah-mas5.xls, the detection p-value file name should be mah-mas5.pvl.xls.
2. Using tab delimited text files
If the signal file name is mah-mas5.exp, the detection p-value file name should be mah-mas5.pvl.
Example : Please take a close look at this small example input files (mah-mas5-small.exp and mas-mas5-small.pvl) in mah-mas5-small.zip. There are 4808 probe sets and 40 chips. It was filtered from the PGA Murine Airway Hyperresponsiveness project using a very stringent present call filter.
Please note that the order of rows and columns is the same as in the signal file.

Affymetrix noise calculations give us two outputs; one is the continuous detection p value assignment, and the other is a simple detection call (present/absent). Each signal intensity value has a confidence factor, detection p-value, which contributes to determining the detection call for the corresponding probe set. When the probe set detection p-value reaches a certain level of significance, then the probe set is assigned a "present" call, while all those probe sets with less robust signal/noise ratios are assigned an absent call.( follow this link at Affymetrix.com (login required) for more detail). This enables the use of a present call threshold noise filter. We reported that a 10% present call noise filter did improve the performance of probe set signal algorithms. While such present call-based filtering improves performance, it is clearly an arbitrary threshold method, and thus it is highly possible that potentially important signals that might be conveyed by the probe sets are filtered out.
There are many possible similarity measures for unsupervised clustering
methods, and it is also possible to develop weighted versions of most similarity
measures. For example, we can derive a weighted Pearson correlation
coefficient as follows from the Pearson correlation coefficient that has been
widely used in the microarray analysis. Let
and
be the vectors representing two arrays to
be compared (these values are prepared in the .exp or .xls files) , and let
and
be the vectors representing continuous
probe set detection p-values for
and
respectively.
(These p-values are prepared in the .pvl or .pvl.xls files) Then the
weighted Pearson correlation coefficient is given by
, where
,
,
We applied F-measure to the entire hierarchical structure of clustering
results and also to the set of clusters determined by the minimum similarity
threshold in HCE2W. Let
,.. ,
,..,
be the right clusters according
to the target biological variable. Let
, ..,
, ..,
be the clusters from the hierarchical
clustering results. In F-measure, each cluster is considered a query and
each class (or each correct cluster) is considered the correct answer of the
query. The F-measure of a correct cluster (or a class)
and an actual cluster
is defined as follows:
, where
,
.
The precision values
and recall values
are
defined by the information retrieval concepts. The F-measure of a class
is given by
.
Finally, the F-measure of the entire clustering result is given by
, where
is the total number of arrays in the
experiment.
In the final clustering result visualization, each sample name is color-coded by its biological class as shown in the figure at the top. Overall F-measure is highlighted with a pink oval. The F-measure distribution is shown, as the distance from the left side, over the dendrogram display as indicated by an arrow mark.
We used HCE 3.0 (HCE2W) to test and define parameters in Affymetrix analyses that optimize the ratio of signal (desired biological variable) versus noise (confounding uncontrolled variables). Five probe set algorithms were studied with and without statistical weighting of probe sets using the Microarray Suite (MAS) 5.0 probe set detection p values. The signal/noise optimization method was tested in two large novel microarray datasets with different levels of confounding noise; a 105 sample U133A human muscle biopsy data set (11 groups; mutation-defined; extensive noise), and a 40 sample U74A inbred mouse lung data set (8 groups; little noise). Performance was measured by the ability of the specific probe set algorithm, with and without detection p value weighting, to cluster samples into the appropriate biological groups (unsupervised agglomerative clustering with F-measure values).
Probe set detection p-value weighting had the greatest positive effect on performance of dChip difference model, ProbeProfiler, and RMA algorithms. Importantly, probe set algorithms did indeed perform differently depending on the specific project, likely due to degree of confounding noise. Our data indicates that significantly improved data analysis of mRNA profile projects can be achieved by optimizing the choice of probe set algorithm with the noise levels intrinsic to a project.
The following graph shows the external evaluation results using F-measure of unsupervised clustering for the human muscular dystrophy data and the mouse lung biopsy data. "no-wt" bar represents the result without MAS 5.0 detection p-value weighting, and "wt" bar represents the result with p-value weighting.

For more information, please refer to the following papers.
Jinwook Seo, Marina Bakay, Yi-Wen Chen, Sara Hilmer, Ben Shneiderman, Eric P Hoffman, " Interactively optimizing signal-to-noise ratios in expression profiling: project-specific algorithm selection and detection p-value weighting in Affymetrix microarrays," Bioinformatics, Vol. 20, pp. 2534-2544, 2004.
HCE is a standalone Windows application running on a general PC environment. It is freely downloadable for academic and/or research purposes. Commercial licenses can be negotiated with the UM Office of Technology Commercialization (James Poulos, jpoulos@umd.edu).
Register and Download HCE 3.0 version (released on March 29, 2004)
System requirements
Intel® Pentium® processor
Microsoft® Windows 2000® Windows XP®
Last updated Return to Bioinformatics Resources