Key words: DeCyder EDA • extended data analysis • workspace • base set • differential expression analysis • principal components analysis • pattern analysis • discriminant analysis • Ettan DIGE
DeCyder™ Extended Data Analysis (EDA) Software was used to perform advanced statistics on samples from a study on human ovarian cancer. The known samples could be clearly separated into normal, benign, and malignant, agreeing with classifications made by pathologists. The classifier created from the known samples was used to classify the unknown samples. All unknown spot maps were classified correctly, except for one from a poorly cast gel. The incorrectly classified sample was a duplicate of a sample that had been correctly classified.
The Ettan™ 2-D Difference Gel Electrophoresis (DIGE) system has greatly reduced system variation, to a level that allows the detection of minor expression changes of as little as 10%.
Univariate analysis within DeCyder 2-D Differential Analysis Software (DeCyder 2-D)—Student’s t-test, one-way Analysis Of Variance (ANOVA), and two-way ANOVA—allows the identification of proteins that have altered expression with a given statistical confidence, as well as the interaction of two conditions.
Nevertheless, there might be remaining questions:
• How many groups or classes exist in a given data set?
• Are there proteins or spots that behave similarly to a given protein or spot (i.e. co-regulation)?
• Are there proteins that might be used for the development of noninvasive tests (i.e. diagnostic markers)?
• Are there proteins or protein patterns that might be characteristic of a biological state (e.g. tumor versus normal tissue aje, F. Cluster validation techniques for genome expression data. Signal Processing 83, 825–833 (2003).
11. Tibshirani, R. et al. Estimating the number of clusters in a dataset via the gap statistic. J. R. Stat. Soc. B 63, 411–423 (2001).
12. Webb, A. Statistical Pattern Recognition, 2nd edition, John Wiley and Sons, Inc., New York (2002).
13. Witten, I. H. and Frank, E. Data Mining, Morgan Kaufmann Publishers, San Francisco (2000).
14. Golub, T. et al. Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science 286, 531-537 (1999).
15. Dudoit, S. et al. Comparison of discrimination methods for the classification of tumors using gene expression data, Department of Statistics, University of California, Berkeley, CA (2000). [Online.] http://www.stat.berkeley.edu/tech-reports/index.html, report number 576.
16. Nguyen, D. V. and Rocke, D.M. Tumor classification by partial least squares using microarray gene expression data. Bioinformatics 18, 39–50 (2002).
back to top )?
This application note describes the use of DeCyder Extended Data Analysis Software (DeCyder EDA) to answer these remaining questions, using an example from a human cancer study.
The analysis workflow for DeCyder EDA is shown in Figure 1.
DeCyder EDA uses a set of data for analysis. A set is a group of spot maps with matched spots—a group of spot maps and proteins. A set of data can be displayed in several ways depending on the context, for example as a heat map where each row represents a protein and each column represents a spot map.
The original data set consists of the Biological Variance Analysis (BVA) workspaces imported and linked in the DeCyder EDA workspace. Before any analyses can be performed, an EDA workspace and a base set must be created from the BVA workspaces. Setup consists of three main steps:
1. Workspace: creating an EDA workspace by importing and linking BVA workspaces.
2. Experimental design: assigning experimental groups and conditions for the different samples included in the EDA workspace.
3. Base set creation: creating the base set automatically or manually by filtering and normalization of the data.
After setup is finished and a base set is created, calculations are enabled. The different calculation methods are divided into four main groups:
• Differential expression analysis
• Principal components analysis (1–3)
• Pattern analysis (4–11)
• Discriminant analysis (12–16)
There are a number of subanalyses within each of the calculation groups. The different calculatio ns that might be selected can be performed in any order.
Differential expression analysis is performed by applying Student’s t-test or one-way ANOVA for a multiple-comparison test. The results from the differential expression analysis are used to reduce the data set, for example limiting it to proteins that show changes in expression level. In this study we used one-way ANOVA.
Principal components analysis (PCA) helps identify some underlying sources of variation, and will give a first impression if, and how well groups and classes might be separated. This type of analysis is extremely sensitive to outliers and might help to identify possible mismatches. In this study each patient sample was run as a duplicate on separate gels that included dye swapping. The two spot maps from the same patient should show up close to each other in the diagrams.
Pattern analysis finds patterns in the expression profiles in the EDA data without any prior information about the variables. Items with similar expression profiles—such as proteins, spot maps, and experimental groups—are clustered. In this study we applied two types of unsupervised clustering:
• Hierarchical clustering, which is displayed as a heat map with dendrogram, showing if and how many different classes exist in the data set
• K-means clustering, which shows clusters of proteins with similar expression patterns
Discriminant analysis identifies markers, and creates a classifier. The classifier is used to classify unknowns. This analysis also helps find proteins that might be useful for the development of a noninvasive diagnostic test.
Based on the results from the different calculations, new sets can be created and new calculations and biological interpretation can be performed.
DeCyder Extended Data Analysis Software, one network u ser license 28-4012-03
DeCyder 2-D Differential Analysis Software v6.5, preinstalled network 28-4012-01
(including PC and single concurrent network user license)
Original data set
The original data set was a BVA workspace from a study on human ovarian cancer. All samples were classified by pathologists into one of the following three groups: normal, benign, or malignant. A subset of samples from 18 patients (three normal, four benign, 11 malignant) was run in duplicate and used to perform all subsequent analysis. For the classifier, nine (one as duplicate) “unknown” patients were used to verify the analysis, as well as the classification made by the pathologists.
We created our EDA workspace by importing the BVA workspace from the cancer study. No statistical values from the BVA workspace were copied into the EDA workspace. Standard spot maps from the master and matched gels (to provide matching information) were copied into the EDA workspace but were not included in the actual analysis.
In the BVA workspace we imported, an experimental design was already defined. This was transferred to the EDA workspace.
Base set creation
The starting material (biopsy tissue) had a large variation, from both the inherent biological variation and the sample preparation. This led to a situation where only a small number of proteins was present in all spot maps or not matched.
For our cancer study we decided to filter out unassigned spot maps (master image and other spot maps not assigned to given groups), as well as proteins present in less than 80% of the spot maps. The remaining spot maps and proteins comprised our base set (Fig 2).
From this base set two subsets were created: one containing all known samples, the other containing all unknown samples. For the final classification the set called “unknowns” was used; for all other calculations the set called “base set–unknowns” and its subset “p<0.001” were used.
We further reduced our data set by restricting it to proteins that showed differential expression. From the differential expression calculation routines, we selected One-Way-ANOVA for multiple-comparison tests. Because all samples were run in duplicate, we found some artificially low p values. For filtering we selected those proteins with p values < 0.001. The remaining 150 proteins now formed the new data set “p<0.001” on which all subsequent analysis was performed except classification.
Question 1: How many groups or classes exist in the data set?
To answer this question we performed two different analyses on the “p<0.001” data set: PCA and hierarchical clustering.
The results from the PCA show that on the expression group level the three valid groups normal, benign, and malignant were well separated. Benign and malignant were close to each other while distant from normal (Fig 3).
At the spot map level the maps from the malignant group were well separated from the benign and normal. Spot maps from the benign and normal groups partly overlapped in the 2-D view, but were still well separated in the 3-D view. This first overview analysis showed clearly that the three valid groups are separable and we should find some valid markers helping us to classify the unknown samples later on.
In the hierarchical clustering we found two main clusters, one consisting of all malignant samples, the other containing all the normal and benign samples (not well separated). The malignant cluster showed three subclusters in partial concordance with the classification from the pathologists (Fig 4).
Question 2: Are there proteins or spots that behave similarly to a given protein or spot?
To answer this question we performed K-means clustering with gap statistics. We found nine clusters, each containing 2–40 different proteins, with very good accuracy (Fig 5).
These proteins are candidates for further analysis. Identifying those that are co-regulated might help to uncover the biological basis for this co-regulation.
Question 3: Are there proteins that might be used for the development of noninvasive tests?
To answer this question we performed a discriminant analysis using the Partial Least Square Search routine as the search method and the Regularized Discriminant Analysis (alpha+gamma 0.7) routine as the evaluation method, with 10-fold cross validation. We found that a subset of 13 proteins allowed discrimination between the known classes with 100% accuracy (Fig 6).
If any of these 13 proteins are highly abundant, they will be good candidates for diagnostic tests.
Question 4: Are there proteins or protein patterns that are characteristic of a biological state?
To answer this question a classifier was calculated using the same method that was applied for feature selection. The confusion matrix from the cross-validation showed no errors for the known samples (Fig 7). The created classifier was then applied to the “Unknown” data set.
The result was one more benign than expected and one less malignant than expected. Looking at the classification results, we found that the only “ unknown” patient who was run as a duplicate came out once correctly as malignant, and once incorrectly as benign (Fig 8).
A detailed analysis of the images showed that both spot maps from the duplicate patient were of very low quality. Both gels were from the same casting batch and showed severe polymerization problems, as well as dust particles and air bubbles.
The purpose of this application note was to demonstrate how the different analysis procedures provided by DeCyder EDA might be used to answer the remaining questions from the DeCyder 2-D software BVA analysis.
It was not necessary to apply all possible calculations; neither would that provide clearer answers. We defined the questions to answer, and then selected the analysis tools that were suitable.
The remaining questions from BVA analysis were the following:
• Q1: How many groups or classes exist in the data set?
• Q2: Are there proteins or spots that behave similarly to a given protein or spot (co-regulation)?
• Q3: Are there proteins that might be used for the development of noninvasive tests?
• Q4: Are there proteins or protein patterns that might be characteristic of a biological state (e.g. tumor versus normal tissue)?
If and how many classes were in our data set was partly answered by the results from PCA and from hierarchical clustering. PCA showed that on the expression group level, as well as on the spot map level, the respective spot maps from the known groups were well separated. Hierarchical clustering worked well with the 150 selected proteins, but subclustering for the malignant samples did not follow the classification from the pathologists. If subclassification of the malignant samples was necessary, the experimental design and the anal ysis procedure would have to be changed.
Proteins that behaved in a similar way were nicely clustered by using K-means clustering. We could identify proteins that were differentially expressed between normal and malignant, normal and benign, and benign and malignant samples.
Discriminant analysis showed that a small subset of 13 proteins allowed us to discriminate with 100% accuracy between the known groups. The top-ranked proteins from this analysis are candidates for a noninvasive diagnostic test for cancer. The high-abundance proteins from this list are ideal candidates for such a test, because they offer a high probability of being detected as leakage proteins in plasma. Data sets could be created from the top-ranked proteins to build classifiers for the classification of the unknown samples.
If samples show a huge variation (either inherently biological or from sample preparation), proteins could be missing in one or more spot maps, which would affect the final classification. In such a case, one should use a less restrictive data set (i.e. more proteins) for discriminant analysis.
The most important question was whether and how well unknown samples could be classified using DeCyder EDA. We were able to demonstrate that all spot maps could be classified correctly, except for one. Nine unknown patients were classified (10 spot maps, one patient in duplicate). The incorrectly classified sample was one of the duplicates. This sample and its correctly classified duplicate were from poorly cast gels; they should be rerun.
A final comment must be made on the experimental design. For this type of study (classification of unknown samples), we highly recommend having a balanced design. In our study we had only four patients in the benign group and three in the normal group, but 11 patients in the malignant group (all patients run as duplicates). We should have similar numbers of samples in each of the different kno wn groups, otherwise the discriminant analysis and classification might be affected.
We thank Professor Peter James, University of Lund, Sweden, for the human ovarian cancer study data.
1. Wold, H. Estimation of principal components and related models by iterative least squares, in Multivariate Analysis (Krishnaiah, P. R., ed.), Academic Press, New York, pp. 391–420 (1966).
2. Eriksson, I. et al. Multi- and Megavariate Data Analysis, Umetrics Academy, Umeå, Sweden (2001).
3. Everitt, B. S. et al. Cluster Analysis, 4th edition, Edward Arnold Publishers Ltd., London (2001).
4. Sokal, R. and Mitchener, C. A statistical method for evaluating systematic relationships. Univ. Kansas Sci. Bull. 38, 1409–1438 (1958).
5. Lloyd, S. Least squares quantization in pcm. IEEE Trans. Inf. Theory 28, 128–137 (1982).
6. Kohonen, T. The self-organizing map. Proc. IEEE 78, 1464–1480 (1990).
7. Tamayo, P. et al. Interpreting patterns of gene expression with self-organizing maps: methods and application to hematopoietic differentiation. Proc. Natl. Acad. Sci. USA 96, 2907–2912 (1999).
8. Hastie, T. et al. 'Gene shaving' as a method for identifying distinct sets of genes with similar expression patterns. Genome Biology 1, research0003.1–21 (2000). [Online.] http://genomebiology.com/2000/1/2/research/0003/
9. Dunn, J. Well separated clusters and optimal fuzzy partitions. J. Cybernetics 4, 95–104 (1974).
10. Bolshakova, N. and Azu