Oral Presentation 28th Annual Lorne Proteomics Symposium 2023

Machine learning of cancer type and tissue of origin from proteomes of 1,277 human tissue samples and 975cancer cell lines (#25)

Qing Zhong 1
  1. Children's Medical Research Institute, Westmead, NSW, Australia

Cancer type is determined through tumor morphology, aided by immunohistochemical staining. The development of machine learning (ML) models using histology slides has powered the image-based prediction of the site of origin in cancer of unknown primary (CUP). Here, we used ML on proteomic data to predict cancer types and tissue of origin from a sample cohort consisting of 1,277 human tissue samples spanning 44 cancer types. The training proteome datasets included two independent sets of proteomes acquired from a pan-cancer cell line collection and a subset of the tissue cohort for online ML.

All samples were processed using data-independent acquisition mass spectrometry (DIA-MS).  Two proteomic profiles from the pan-cancer cell line cohort were generated using two independent sample preparation methods. These were normalized by Combat and merged by averaging the protein abundance, yielding a single training set (D1) with 975 cell lines and 9,688 proteins. Similary, 1,277 tissue samples were processed by DIA-MS, quantifying 9,501 proteins. Celligner was used to alignthe cell lines (D1) with the tissue cohort. Half of the tissue proteomes were used as a second training set (D2) for online ML and a hold-out test set was constructed by taking the other half of the tissue cohort (T1). 

As a proof of concept, we defined six cancer types (adenocarcinoma, sarcoma, squamous carcinoma, lymphoma, melanoma and small cell carcinoma) and seven adenocarcinoma tissues of origin (breast, colorectal, liver, lung, ovary, stomach/esophagus and pancreas) for an ML experiment. We learned a classifier using the cell lines (D1) as the baseline training set, and consecutively added 10% of D2 to D1 for online ML. We tested the baseline model and each subsequent new model on the test set T1. We observed a monotonic performance increase from 0.89 (baseline) to 0.97 (all D2 were used) when predicting the six cancer types. We observed an analogous trend when predicting the seven tissue types (from 0.64 to 0.84). 

Our proteomic-based ML model can predict cancer type and adenocarcinoma tissue of origin in concordance with existing histopathological classification. It can also assign multiple probabilities to tumor type and tissue of origin, potentially enabling the classification of CUP in future work. By adding tissue samples stepwise to the existing model, its predictive performance can be furtherenhanced. This reflects a real-world knowledgebase that will continue to increase in predictive poweras additional data are added.