Mass spectrometry-based proteomics studies conducted on sizable population cohorts have immense power for biomarker discovery. As a result of the cohort size and extended periods of MS acquisition, however, undesirable intra- and inter-batch variation can occur which is both unavoidable and can obscure real biological signals and thus lead to false discoveries.
These large cohort studies therefore present unique challenges within the framework of existing data analysis pipelines which normally cater to moderately-sized cohorts. Critical evaluation of the most appropriate experimental design and selection of a proper normalization method are crucial to achieving reliable MS quantification results with low false discovery rate. Even with suitable controls in place, data analysis must reduce variability due to batch effects while keeping the true biological variation.
We have acquired data across a large dataset consisting of a plasma cohort (n >1000), which covered 7 batches of sample preparation and were acquired over a two-month period using diaPASEF acquisition method on a timsTOF Pro instrument. We modified an existing normalization approach2 taking the advantage of the spatial experimental design, spiked-in protein standards, and quality control pool samples (QC).
To evaluate our approach for eliminating unwanted variations, our method was compared to a standard Loess normalization used for small sample size proteomics and SERRF (systematic error removal using random forest) 3, a normalisation approach used for large scale metabolomics data. Our methodology significantly reduces unwanted variation, thereby maximising the statistical power and clinical potential of large-cohort, clinical studies.
References: