Using Regression Analyses For the Determination of Protein Structure From FTIR Spectra

  • Kieaibi Wilcox

    Student thesis: Phd


    One of the challenges in the structural biological community is processing the wealth of protein data being produced today; therefore, the use of computational tools has been incorporated to speed up and help understand the structures of proteins, hence the functions of proteins. In this thesis, protein structure investigations were made through the use of Multivariate Analysis (MVA), and Fourier Transformed Infrared (FTIR), a form of vibrational spectroscopy. FTIR has been shown to identify the chemical bonds in a protein in solution and it is rapid and easy to use; the spectra produced from FTIR are then analysed qualitatively and quantitatively by using MVA methods, and this produces non-redundant but important information from the FTIR spectra.High resolution techniques such as X-ray crystallography and NMR are not always applicable and Fourier Transform Infrared (FTIR) spectroscopy, a widely applicable analytical technique, has great potential to assist structure analysis for a wide range of proteins. FTIR spectral shape and band positions in the Amide I (which contains the most intense absorption region), Amide II, and Amide III regions, can be analysed computationally, using multivariate regression, to extract structural information. In this thesis Partial least squares (PLS), a form of MVA, was used to correlate a matrix of FTIR spectra and their known secondary structure motifs, in order to determine their structures (in terms of "helix", "sheet", "310-helix", "turns" and "other" contents) for a selection of 84 non-redundant proteins. Analysis of the spectral wavelength range between 1480 and 1900 cm-1 (Amide I and Amide II regions) results in high accuracies of prediction, as high as R2 = 0.96 for alpha-helix, 0.95 for β-sheet, 0.92 for 310-helix, 0.94 for turns and 0.90 for other; their Root Mean Square Error for Calibration (RMSEC) values are between 0.01 to 0.05, and their Root Mean Square Error for Prediction (RMSEP) values are between 0.02 to 0.12. The Amide II region also gave results comparable to that of Amide I, especially for predictions of helix content. We also used Principal Component Analysis (PCA) to classify FTIR protein spectra into their natural groupings as proteins of mainly alpha-helical structure, or protein of mainly β-sheet structure or proteins of some mixed variations of alpha-helix and β-sheet. We have also been able to differentiate between parallel and anti-parallel β-sheet. The developed methods were applied to characterize the secondary structure conformational changes of an unfolding protein as a function of pH and also to determine the limit of Quantitation (LoQ).Our structural analyses compare highly favourably to those in the literature using machine learning techniques. Our work proves that FTIR spectra in combination with multivariate regression analysis like PCA and PLS, can accurately identify and quantify protein secondary structure. The developed models in this research are especially important in the pharmaceutical industry where the therapeutic effect of drugs strongly depends on the stability of the physical or chemical structure of their proteins targets; therefore, understanding the structure of proteins is very important in the biopharmaceutical world for drugs production and formulation. There is a new class of drugs that are proteins themselves used to treat infectious and autoimmune diseases. The use of spectroscopy and multivariate regression analysis in the medical industry to identify biomarkers in diseases has also brought new challenges to the bioinformatics field. These methods may be applicable in food science and academia in general, for the investigation and elucidation of protein structure.
    Date of Award1 Aug 2015
    Original languageEnglish
    Awarding Institution
    • The University of Manchester
    SupervisorEwan Blanch (Supervisor) & Andrew Doig (Supervisor)


    • iPLS, Spectrial pre-processing, Limit of Quantitation
    • Protein Secondary Structure, PLS, PCA, FTIR, Vibrational Spectroscopy, ATR, MVA,Machine Learning, Multivariate Regression Analysis

    Cite this