Home
Takes you to our home page
Figures
figures from the paper
Supplement
Interactive figure 2 and 3 (Khan dataset)
|
|
|
The effect of selecting genes on between group eigenanalysis
Background
We investigated the effect of filtering or pre-selecting a number of genes on between group eigenanalysis(BGA). In order to reduce the ratio of number of variables (genes) to samples, it is common practice to select a small number of genes from the total and to base the analysis on these only. Typically this is done using 50 or fewer genes (Li and Yang, 2001).
Methods
One then needs a criterion to decide on the quality of each gene with regard to the ability to discriminate between the groups. Such a criterion should emphasise the difference in expression levels between the samples from the various groups and minimise the variation within groups. Selection was done using the t-statistic (von Heydebreck, et al. 2001, Dudoit el al., 2000b) or the ratio of between sums of squares (BSS) to within sums of squares (WSS) by (Dudoit et al. 2000a). We tested the effect of filtering using both of these statistics on the dataset from Golub, et al. (1999). BGA with correspondence (COA) or principal component analysis (PCA) was preformed on different numbers of genes ranging from 2000 down to 20 from the training data.
Results
Two different metrics were used to select the genes that might be most useful for discriminating between the two groups. Genes were ranked using the t-statistic or the ratio of between-group to within-group sums of squares (BSS/WSS) for the two groups. The top 20, 50, 100, 200, 500, 1000 and 2000 genes were selected from each set of ranked genes in the training data and subjected to BGA using COA or PCA. The results of jack-knife experiments of these gene selections are shown in Table 1.
Filtering of genes using either metric improved the assignment accuracy of data that had been subjected to BGA using PCA. In contrast filtering of genes using the BSS/WSS ratio reduced the prediction accuracy of the BGA/COA analysis. Taken on face value, this would suggest that BGA is best performed using PCA and that prior selection of at least some genes is useful. However repeating the above analyses using blind test data from Golub, et al. (1999) provides a more general and useful test.
The 34 supplementary blind test samples from Golub, et al. (1999) were projected onto the discriminating axis produced from each of the gene selections above and the classification of each sample was assigned as AML or ALL (See results in Table 2). Using the full set of genes, the prediction accuracies were 82.4% and 88.2% respectively using BGA with PCA and COA. In the latter case all AML cases (15/15) and 16/20 ALL cases were correctly assigned (88.2%).
Although there was a weak trend towards greater accuracy with smaller numbers of genes, the trend was highly variable depending on the ranking metric and the ordination method used. In general BGA using COA tended to out perform PCA ordination. Since COA has the further advantage that the genes and samples can be projected along the same axes, we only show results for BGA with COA in the paper.
Tables
Table 1    Percentage of correctly assigned training samples from the Golub, et al. (1999) data set in jackknifed tests when genes were ranked and selected using the t-statistic or BSS/WSS.
|
Method |
Complete Dataset |
Selection criterion |
Number of Genes Selected |
|
|
|
|
|
|
|
2000 |
1000 |
500 |
200 |
100 |
50 |
20 |
|
PCA |
92.1 |
t-statistic |
100 |
100 |
100 |
100 |
100 |
100 |
100 |
|
|
BSS/WSS |
97.4 |
100 |
100 |
94.7 |
94.7 |
94.7 |
94.7 |
COA |
92.1 |
t-statistic |
92.1 |
94.7 |
89.5 |
89.5 |
94.7 |
97.4 |
100 |
|
|
BSS/WSS |
89.5 |
89.5 |
89.5 |
89.5 |
89.5 |
89.5 |
89.5 |
|
Table 2    Percentage of test samples from the Golub, et al. (1999) data set correctly assigned when genes were ranked and selected using the t-statistic or BSS/WSS ratio.
|
Method |
Complete Dataset |
Selection criterion |
Number of Genes Selected |
|
|
|
|
|
|
|
2000 |
1000 |
500 |
200 |
100 |
50 |
20 |
|
PCA |
82.4 |
t-statistic |
88.2 |
88.2 |
88.2 |
76.5 |
76.5 |
79.4 |
88.2 |
|
|
BSS/WSS |
91.2 |
94.1 |
94.1 |
91.2 |
91.2 |
91.2 |
91.2 |
COA |
88.2 |
t-statistic |
88.2 |
94.1 |
94.1 |
91.2 |
94.1 |
97.1 |
94.1 |
|
|
BSS/WSS |
88.2 |
88.2 |
88.2 |
94.1 |
91.2 |
88.2 |
94.1 |
|
References
Dudoit,S., Fridlyand,J., and Speed,T. P., (2000a) Comparison of discrimination methods for the classification of tumors using gene expression data. J. Amer. Statist. Assoc., 97(457) 77-87. download
Dudoit,S., Yang,Y.H., Callow,M.J., and Speed,T.P. (2000b). Statistical methods for identifying differentially expressed genes in replicated cDNA microarray experiments. Statistica Sinica 12(1) 111-139. download
Li,W., and Yang,Y. (2001) How many genes are needed for a discriminant microarray data analysis? In Lin,S.M., and Johnson,K.F. (eds), Methods of Microarray Data Analysis. Kluwer Academic Publishers, Boston, pp. 137-150. download
von Heydebreck,A., Huber,W., Poustka,A., and Vingron,M. (2001) Identifying splits with clear separation: a new class discovery method for gene expression data. Bioinformatics, 17 Suppl 1, S107-S114.download
|