Machine learning to the rescue: Enabling novel proteomics workflows with data-driven bioinformatics methods

Introduction to proteomics, mass spectrometry, and machine learning for a layman audience and a summary of my PhD projects. The full dissertation text is available at doi:10.5281/zenodo.6580035.

Abstract

Proteins are the molecular work horses of the cell and carry out many functional and structural tasks. In-depth knowledge of the complement of all proteins in a cell or tissue, called the proteome, can provide valuable insights into cellular biology in health and disease. To study the proteome in high throughput, liquid-chromatography – tandem mass spectrometry (LC-MS/MS) is most-often the platform of choice. The result of an LC-MS/MS experiment is a large set of peptide spectra that require specific bioinformatics software to be identified. However, the confident identification of peptide spectra is not always straightforward, especially when novel challenging proteomics workflows are used. Examples of such workflows are data independent acquisition (DIA), proteogenomics, metaproteomics, biopeptidomics, and immunopeptidomics. Fortunately, machine learning (ML) can provide accurate predictions of peptide behavior in LC-MS/MS, allowing more LC-MS/MS data to be used in the identification process, resulting in a higher sensitivity. In this PhD research, the use of ML to enable novel proteomics workflows is investigated in-depth. First, the peptide spectrum predictor MS²PIP is significantly improved and extended to more use-cases. Second, a novel paradigm for the proteome-wide identification of DIA data is proposed and developed. Third, a perspective of the current state of peptide LC-MS/MS behavior predictors is given. Fourth, the MS²PIP spectrum predictor is integrated in a fully data-driven post-processing pipeline, which is subsequently applied on the various challenging proteomics workflows mentioned above. Fifth, preliminary results are shown on a novel modification-aware spectrum predictor. Each of the detailed applications of spectrum prediction for improved identification performance resulted in a more sensitive scoring function leading to more confident peptide identifications. In conclusion, ML proved to be a valuable tool for the identification of peptide mass spectra in challenging proteomics workflows. In the future, where proteomics experiments will become increasingly demanding, ML is expected to take up a central role in proteomics data analysis workflows.