The most commonly utilized microarrays for mRNA profiling (Affymetrix) include probe sets of a series of perfect match and mismatch probes (typically 22 oligonucleotides per probe set). There are an increasing number of reported probe set algorithms that differ in their interpretation of a probe set to derive a single normalized "signal" - representative of expression of each mRNA. These algorithms are known to differ in accuracy and sensitivity, and optimization has been done using a small set of standardized control microarray data.
We hypothesized that different mRNA profiling projects have varying sources and degrees of confounding noise, and that these should alter choice of a specific probe set algorithm. Also, we hypothesized that use of the Microarray Suite (MAS) 5.0 probe set detection p value as a weighting function would improve the performance of all probe set algorithms.
|Permutation Study Framework using Unsupervised
Clustering in HCE2W (the improved version of the Hierarchical
Clustering Explorer 2.0 with p-value weighting and F-measure). Inputs to
the Hierarchical Clustering Explorer are two files, signal data file and
p-value file. Each column of the two input files has values for a sample
(or a chip), and the known target biological group index is assigned to
each column of the signal data file. Success is measured using F-measure
of a dendrogram and the known biological grouping.
Note : HCE 3.0 is a newer version, which has all functions in HCE2W.
You have to prepare two files, probe set signal file and probe set detection p-value file, for each probe set signal algorithm (e.g., MAS5, dChip, or RMA). As you can see in the figure, you can use the probe set detection p-value file from MAS5 for all other signal files generated by probe set signal algorithms other than MAS5.
The two files should be in the same folder. The extension of the detection p-value file should be pvl. Please refer to the following example.
1. Using Excel files
If the signal file name is mah-mas5.xls, the detection p-value file name should be mah-mas5.pvl.xls.
2. Using tab delimited text files
If the signal file name is mah-mas5.exp, the detection p-value file name should be mah-mas5.pvl.
Example : Please take a close look at this small example input files (mah-mas5-small.exp and mas-mas5-small.pvl) in mah-mas5-small.zip. There are 4808 probe sets and 40 chips. It was filtered from the PGA Murine Airway Hyperresponsiveness project using a very stringent present call filter.
Please note that the order of rows and columns is the same as in the signal file.
Affymetrix noise calculations give us two outputs; one is the continuous detection p value assignment, and the other is a simple detection call (present/absent). Each signal intensity value has a confidence factor, detection p-value, which contributes to determining the detection call for the corresponding probe set. When the probe set detection p-value reaches a certain level of significance, then the probe set is assigned a "present" call, while all those probe sets with less robust signal/noise ratios are assigned an absent call.( follow this link at Affymetrix.com (login required) for more detail). This enables the use of a present call threshold noise filter. We reported that a 10% present call noise filter did improve the performance of probe set signal algorithms. While such present call-based filtering improves performance, it is clearly an arbitrary threshold method, and thus it is highly possible that potentially important signals that might be conveyed by the probe sets are filtered out.
There are many possible similarity measures for unsupervised clustering methods, and it is also possible to develop weighted versions of most similarity measures. For example, we can derive a weighted Pearson correlation coefficient as follows from the Pearson correlation coefficient that has been widely used in the microarray analysis. Let and be the vectors representing two arrays to be compared (these values are prepared in the .exp or .xls files) , and let and be the vectors representing continuous probe set detection p-values for and respectively. (These p-values are prepared in the .pvl or .pvl.xls files) Then the weighted Pearson correlation coefficient is given by
, where , ,We use the complement of detection p-value to calculate the weight for each term since the smaller the p-value is, the more significant the signal is. Other similarity measures such as Euclidean distance, Manhattan distance, and cosine coefficient can be extended to their weighted version in a similar way to the weighted Pearson correlation coefficient. In HCE, we can check the option checkbox (highlighted with a red oval in the following figure) to use the MAS 5.0 detection p-values as weights for distance/similarity measures.
We applied F-measure to the entire hierarchical structure of clustering results and also to the set of clusters determined by the minimum similarity threshold in HCE2W. Let ,.. , ,.., be the right clusters according to the target biological variable. Let , .., , .., be the clusters from the hierarchical clustering results. In F-measure, each cluster is considered a query and each class (or each correct cluster) is considered the correct answer of the query. The F-measure of a correct cluster (or a class) and an actual cluster is defined as follows:
, where , .
The precision values and recall values are defined by the information retrieval concepts. The F-measure of a class is given by
Finally, the F-measure of the entire clustering result is given by
, where is the total number of arrays in the experiment.The F-measure score is between 0 and 1. The higher the F-measure score is, the better the clustering result is. When we calculate the F-measure for the entire cluster hierarchy, for each external class we traverse the hierarchy recursively and consider each subtree as a cluster. Then the F-measure for an external class is the maximum of F-measures for all subtrees.
In the final clustering result visualization, each sample name is color-coded by its biological class as shown in the figure at the top. Overall F-measure is highlighted with a pink oval. The F-measure distribution is shown, as the distance from the left side, over the dendrogram display as indicated by an arrow mark.
We used HCE 3.0 (HCE2W) to test and define parameters in Affymetrix analyses that optimize the ratio of signal (desired biological variable) versus noise (confounding uncontrolled variables). Five probe set algorithms were studied with and without statistical weighting of probe sets using the Microarray Suite (MAS) 5.0 probe set detection p values. The signal/noise optimization method was tested in two large novel microarray datasets with different levels of confounding noise; a 105 sample U133A human muscle biopsy data set (11 groups; mutation-defined; extensive noise), and a 40 sample U74A inbred mouse lung data set (8 groups; little noise). Performance was measured by the ability of the specific probe set algorithm, with and without detection p value weighting, to cluster samples into the appropriate biological groups (unsupervised agglomerative clustering with F-measure values).
Probe set detection p-value weighting had the greatest positive effect on performance of dChip difference model, ProbeProfiler, and RMA algorithms. Importantly, probe set algorithms did indeed perform differently depending on the specific project, likely due to degree of confounding noise. Our data indicates that significantly improved data analysis of mRNA profile projects can be achieved by optimizing the choice of probe set algorithm with the noise levels intrinsic to a project.
The following graph shows the external evaluation results using F-measure of unsupervised clustering for the human muscular dystrophy data and the mouse lung biopsy data. "no-wt" bar represents the result without MAS 5.0 detection p-value weighting, and "wt" bar represents the result with p-value weighting.
For more information, please refer to the following papers.
Jinwook Seo, Marina Bakay, Yi-Wen Chen, Sara Hilmer, Ben Shneiderman, Eric P Hoffman, " Interactively optimizing signal-to-noise ratios in expression profiling: project-specific algorithm selection and detection p-value weighting in Affymetrix microarrays," Bioinformatics, Vol. 20, pp. 2534-2544, 2004.
HCE is a standalone Windows application running on a general PC environment. It is freely downloadable for academic and/or research purposes. Commercial licenses can be negotiated with the UM Office of Technology Commercialization (James Poulos, email@example.com).
Register and Download HCE 3.0 version (released on March 29, 2004)
Intel® Pentium® processor
Microsoft® Windows 2000® Windows XP®
Last updated Return to Bioinformatics Resources