Factors affecting the errors in the estimation of evolutionary distances between sequences

D. C. Hoyle, P. G. Higgs

    Research output: Contribution to journalArticlepeer-review

    Abstract

    Phylogenetic methods that use matrices of pairwise distances between sequences (e.g., neighbor joining) will only give accurate results when the initial estimates of the pairwise distances are accurate. For many different models of sequence evolution, analytical formulae are known that give estimates of the distance between two sequences as a function of the observed numbers of substitutions of various classes. These are often of a form that we call "log transform formulae". Errors in these distance estimates become larger as the time t since divergence of the two sequences increases. For long times, the log transform formulae can sometimes give divergent distance estimates when applied to finite sequences. We show that these errors become significant when t ∼ 1/2 |λmax|-1 logN, where λmax is the eigenvalue of the substitution rate matrix with the largest absolute value and N is the sequence length. Various likelihood-based methods have been proposed to estimate the values of parameters in rate matrices. If rate matrix parameters are known with reasonable accuracy, it is possible to use the maximum likelihood method to estimate evolutionary distances while keeping the rate parameters fixed. We show that errors in distances estimated in this way only become significant when t ∼ 1/2 |λ1|-1 logN, where λ1 is the eigenvalue of the substitution rate matrix with the smallest nonzero absolute value. The accuracy of likelihood-based distance estimates is therefore much higher than those based on log transform formulae, particularly in cases where there is a large range of timescales involved in the rate matrix (e.g., when the ratio of transition to transversion rates is large). We discuss several practical ways of estimating the rate matrix parameters before distance calculation and hence of increasing the accuracy of distance estimates.
    Original languageEnglish
    Pages (from-to)1-9
    Number of pages8
    JournalMolecular Biology and Evolution
    Volume20
    Issue number1
    DOIs
    Publication statusPublished - 1 Jan 2003

    Keywords

    • Distance matrix
    • Evolutionary distances
    • Maximum likelihood
    • Molecular phylogeny

    Fingerprint

    Dive into the research topics of 'Factors affecting the errors in the estimation of evolutionary distances between sequences'. Together they form a unique fingerprint.

    Cite this