Abstract
We study the performance of principal component analysis (PCA). In particular, we consider the problem of how many training pattern vectors are required to accurately represent the low-dimensional structure of the data. This problem is of particular relevance now that PCA is commonly applied to extremely high-dimensional (N ≃ 5000-30000) real data sets produced from molecular-biology research projects. In these applications the number of patterns p is often orders of magnitude less than the data dimension (p ≪ N). We follow previous work and perform the analysis in the context of p random patterns which are isotropically distributed with the exception of a single symmetry-breaking direction. The standard mean-field theory for the performance of PCA is constructed by considering the thermodynamic limit N → ∞, with α = p/N fixed. For real data sets the strength of the symmetry breaking may increase with N, and therefore one must reconsider the accuracy of the mean-field theory. We show, using simulation results, that the mean-field theory is still accurate even when the strength of the symmetry breaking scales with N, and even for small values of α that are more appropriate to real biological data sets.
Original language | English |
---|---|
Pages (from-to) | 117-123 |
Number of pages | 6 |
Journal | EPL |
Volume | 62 |
Issue number | 1 |
DOIs | |
Publication status | Published - Apr 2003 |