PCA learning for sparse high-dimensional data

David Hoyle, D. C. Hoyle, M. Rattray

    Research output: Contribution to journalArticlepeer-review

    Abstract

    We study the performance of principal component analysis (PCA). In particular, we consider the problem of how many training pattern vectors are required to accurately represent the low-dimensional structure of the data. This problem is of particular relevance now that PCA is commonly applied to extremely high-dimensional (N ≃ 5000-30000) real data sets produced from molecular-biology research projects. In these applications the number of patterns p is often orders of magnitude less than the data dimension (p ≪ N). We follow previous work and perform the analysis in the context of p random patterns which are isotropically distributed with the exception of a single symmetry-breaking direction. The standard mean-field theory for the performance of PCA is constructed by considering the thermodynamic limit N → ∞, with α = p/N fixed. For real data sets the strength of the symmetry breaking may increase with N, and therefore one must reconsider the accuracy of the mean-field theory. We show, using simulation results, that the mean-field theory is still accurate even when the strength of the symmetry breaking scales with N, and even for small values of α that are more appropriate to real biological data sets.
    Original languageEnglish
    Pages (from-to)117-123
    Number of pages6
    JournalEPL
    Volume62
    Issue number1
    DOIs
    Publication statusPublished - Apr 2003

    Fingerprint

    Dive into the research topics of 'PCA learning for sparse high-dimensional data'. Together they form a unique fingerprint.

    Cite this