Unsupervised identification of significant lineages of SARS-CoV-2 through scalable machine learning method

Research output: Contribution to journalArticlepeer-review

Abstract

Since its emergence in late 2019, SARS-CoV-2 has diversified into a large number of lineages and globally caused multiple waves of infection. Novel lineages have the potential to spread rapidly and internationally if they have higher intrinsic transmissibility and/or can evade host immune responses, as has been seen with the Alpha, Delta, and Omicron variants of concern (VoC). They can also cause increased mortality and morbidity if they have increased virulence, as was seen for Alpha and Delta, but not Omicron. Phylogenetic methods provide the gold standard for representing the global diversity of SARS-CoV-2 and to identify newly emerging lineages. However, these methods are computationally expensive, struggle when datasets get too large, and require manual curation to designate new lineages. These challenges together with the increasing volumes of genomic data available provide a motivation to develop complementary methods that can incorporate all of the genetic data available, without down-sampling, to extract meaningful information rapidly and with minimal curation. Here, we demonstrate the utility of using algorithmic approaches based on word-statistics to represent whole sequences, bringing speed, scalability, and interpretability to the construction of genetic topologies, and while not serving as a substitute for current phylogenetic analyses the proposed methods can be used as a complementary approach to identify and confirm new emerging variants.

Original languageEnglish
Article number2317284121
JournalProceedings of the National Academy of Sciences
Volume121
Issue number12
Early online date13 Mar 2024
DOIs
Publication statusPublished - 19 Mar 2024

Keywords

  • SARS-CoV-2
  • Machine Learning
  • Dimensionality reduction
  • variants of Concern
  • Clustering methods
  • Lineages

Fingerprint

Dive into the research topics of 'Unsupervised identification of significant lineages of SARS-CoV-2 through scalable machine learning method'. Together they form a unique fingerprint.

Cite this