Computational proteogenomic integrates proteomics with genomics and transcriptomics data and has emerged over the past decade as a useful discipline for functional annotation of genomes. It has been driven by high-throughput technologies from mass spectrometry and next generation sequencing platforms. Both technologies provide empirical evidence for protein coding genes and genome sequences that are beneficial to minimise or correct errors that occur during genome annotations. Here we demonstrate the role of proteomics data to help characterise the Ensembl chicken genome annotation, considering it from a historical perspective via a database search strategy using multiple search engines over numerous versions of the Ensembl Gallus gallus proteome. We found two significant points where the genome annotation underpins major changes in the discoverable proteome; Ensembl 41 versus 42 and Ensembl 70 versus 71. Both points were manifest of significant genome assembly events that occurred in the genome's lifetime spanning over 10 years. We then investigated a transcriptome-based proteogenomic approach using EST contig sequences, which enabled the discovery of potential novel genes via peptide sequences predicted from contigs, which in turn also map to the chicken genome. This general approach was then applied to the current V3 pineapple genome annotation, where we found 16 new novel gene candidates and 334 further cases of single amino acid variants. Overall, the results demonstrate the effectiveness of the proteogenomic approach, to both established and novel genomes, providing validation, discovery and improvements to genome annotations.
|Date of Award||1 Aug 2020|
- The University of Manchester
|Supervisor||Simon Hubbard (Supervisor) & Giles Johnson (Supervisor)|
- genome annotation