Developing Sequence Analysis Pipelines to Characterise Human Genome Variation.

  • Nikita Abramovs

Student thesis: Master of Philosophy

Abstract

The latest genome sequencing platforms generate large catalogues of genomic variants, with individual genomes containing about four to five million variants. Large general population studies also estimate that individuals carry up to 100 loss of function (LoF) variants with ~20 genes (mostly participating in the immune system) completely inactivated. Deciding which variants are important in disease is a difficult task, and a crucial step in disease candidate gene prioritisation is comparison of variants in affected and healthy individuals. The purpose of this study is to characterise genes based on variant data in large apparently healthy populations, and create datasets which can be integrated into other variant studies, sequence analysis pipelines, or used independently. There are about 18,000-20,000 protein coding genes in humans, all of which are present in two copies (alleles), except for sex chromosome genes. One or both alleles can be affected by deleterious variants and result in dominant or recessive disease respectively. Genes which require both alleles to maintain their functions are called haploinsufficient, but their proportion in all protein coding genes is still unknown. It is also hypothesized that many more genes are nonessential for human survival, and loss of both alleles of these genes can be tolerated.Variant biallelic distribution within the genes was analysed on 2504 individual genomes from the 1000 Genomes Project Phase 3 dataset, and a custom NoSQL database was created from the VCF files. This can be reused in studies which involve whole genome variant analysis at the individual level, as this information is not publicly available in other variant databases. A dataset of 76,254 rare variant pairs, which affected both gene alleles in some individuals, was produced and can be used for candidate gene prioritisation.Overall load of variants within 18,225 genes was analysed on 60,706 exomes from the Exome Aggregation Consortium (ExAC) database, to create a dataset of gene haploinsufficiency scores. The scores were calculated by several models based on supervised machine learning algorithms which were trained and evaluated on known dominant and recessive genes from the Online Mendelian Inheritance in Man (OMIM) database. The scores were called Gene Variant Haploinsufficiency Scores (GVHS), as they were based on six different types of variant statistical data. This approach is different from existing methods used by ExAC or DECIPHER (DatabasE of Chromosomal Imbalance and Phenotype in Humans using Ensembl Resources) to calculate gene haploinsufficiency scores. ExAC used an unsupervised learning algorithm and considered only splicing and nonsense variants, whereas DECIPHER used gene biological properties and ignored variant data completely. Evaluation performed in this study showed that, on average, GVHS models performance metrics were similar to ExAC, and both of them had better haploinsufficiency predictions than DECIPHER. However, one of the GVHS models was ~4.5% more precise in detecting haploinsufficient genes and produced more interpretable probabilities, which can be useful for candidate gene prioritisation in disease sequencing studies.
Date of Award1 Aug 2017
Original languageEnglish
Awarding Institution
  • The University of Manchester
SupervisorMayada Tassabehji (Supervisor)

Keywords

  • Machine Learning
  • MPhil
  • 1000 Genomes
  • ExAC
  • Genes
  • Bionformatics

Cite this

'