Automatic PCA dimension selection for high dimensional data and small sample sizes

David C. Hoyle

    Research output: Contribution to journalArticlepeer-review

    Abstract

    Bayesian inference from high-dimensional data involves the integration over a large number of model parameters. Accurate evaluation of such high-dimensional integrals raises a unique set of issues. These issues are illustrated using the exemplar of model selection for principal component analysis (PCA). A Bayesian model selection criterion, based on a Laplace approximation to the model evidence for determining the number of signal principal components present in a data set, has previously been show to perform well on various test data sets. Using simulated data we show that for d-dimensional data and small sample sizes, N, the accuracy of this model selection method is strongly affected by increasing values of d. By taking proper account of the contribution to the evidence from the large number of model parameters we show that model selection accuracy is substantially improved. The accuracy of the improved model evidence is studied in the asymptotic limit d → ∞at fixed ratio α = N/d, with a α 1. In this limit, model selection based upon the improved model evidence agrees with a frequentist hypothesis testing approach. © 2008 David C. Hoyle.
    Original languageEnglish
    Pages (from-to)2733-2759
    Number of pages26
    JournalJournal of Machine Learning Research
    Volume9
    Publication statusPublished - Dec 2008

    Keywords

    • Bayesian model selection
    • High dimensional inference
    • PCA
    • Random matrix theory

    Fingerprint

    Dive into the research topics of 'Automatic PCA dimension selection for high dimensional data and small sample sizes'. Together they form a unique fingerprint.

    Cite this