iDIA-QC: AI-empowered data-independent acquisition mass spectrometry-based quality control

Gao, Huanhuan; Zhu, Yi; Wang, Dongxue; Nie, Zongxiang; Wang, He; Wang, Guibin; Liang, Shuang; Xie, Yuting; Sun, Yingying; Jiang, Wenhao; Dong, Zhen; Qian, Liqin; Wang, Xufei; Liang, Mengdi; Chen, Min; Fang, Houqi; Zeng, Qiufang; Tian, Jiao; Sun, Zeyu; Xue, Juan; Li, Shan; Chen, Chen; Liu, Xiang; Lyu, Xiaolei; Guo, Zhenchang; Qi, Yingzi; Wu, Ruoyu; Du, Xiaoxian; Tong, Tingde; Kong, Fengchun; Han, Liming; Wang, Minghui; Zhao, Yang; Dai, Xinhua; He, Fuchu; Guo, Tiannan

doi:10.1038/s41467-024-54871-1

Download PDF

Article
Open access
Published: 21 January 2025

iDIA-QC: AI-empowered data-independent acquisition mass spectrometry-based quality control

Huanhuan Gao^1,2,3^na1,
Yi Zhu ORCID: orcid.org/0000-0003-0429-0802^1,2,3^na1,
Dongxue Wang^4,5^na1,
Zongxiang Nie⁶^na1,
He Wang^1,2,3^na1,
Guibin Wang⁴,
Shuang Liang⁷,
Yuting Xie^1,2,3,
Yingying Sun^1,2,3,
Wenhao Jiang ORCID: orcid.org/0009-0000-9381-9848^1,2,3,
Zhen Dong ORCID: orcid.org/0000-0002-3480-3421^1,2,3,
Liqin Qian^1,2,3,
Xufei Wang⁸,
Mengdi Liang⁸,
Min Chen⁹,
Houqi Fang⁹,
Qiufang Zeng¹⁰,
Jiao Tian¹⁰,
Zeyu Sun ORCID: orcid.org/0000-0003-0218-2277¹¹,
Juan Xue^12,13,
Shan Li ORCID: orcid.org/0000-0001-8646-3193^12,13,
Chen Chen¹⁴,
Xiang Liu¹⁴,
Xiaolei Lyu¹⁴,
Zhenchang Guo¹⁵,
Yingzi Qi¹⁵,
Ruoyu Wu¹⁶,
Xiaoxian Du¹⁶,
Tingde Tong¹⁵,
Fengchun Kong¹⁴,
Liming Han¹⁶,
Minghui Wang¹⁶,
Yang Zhao ORCID: orcid.org/0000-0003-1444-5927¹⁷,
Xinhua Dai¹⁷,
Fuchu He ORCID: orcid.org/0000-0002-1094-4019^4,5 &
…
Tiannan Guo ORCID: orcid.org/0000-0003-3869-7651^1,2,3

Nature Communications volume 16, Article number: 892 (2025) Cite this article

4871 Accesses
4 Altmetric
Metrics details

Subjects

Abstract

Quality control (QC) in mass spectrometry (MS)-based proteomics is mainly based on data-dependent acquisition (DDA) analysis of standard samples. Here, we collect 2754 files acquired by data independent acquisition (DIA) and paired 2638 DDA files from mouse liver digests using 21 mass spectrometers across nine laboratories over 31 months. Our data demonstrate that DIA-based LC-MS/MS-related consensus QC metrics exhibit higher sensitivity compared to DDA-based QC metrics in detecting changes in LC-MS status. We then prioritize 15 metrics and invite 21 experts to manually assess the quality of 2754 DIA files based on those metrics. We develop an AI model for DIA-based QC using 2110 training files. It achieves AUCs of 0.91 (LC) and 0.97 (MS) in the first validation dataset (n = 528), and 0.78 (LC) and 0.94 (MS) in an independent validation dataset (n = 116). Finally, we develop an offline software called iDIA-QC for convenient adoption of this methodology.

DIALib-QC an assessment tool for spectral libraries in data-independent acquisition proteomics

Article Open access 16 October 2020

A comprehensive LFQ benchmark dataset on modern day acquisition strategies in proteomics

Article Open access 30 March 2022

Calibr improves spectral library search for spectrum-centric analysis of data independent acquisition proteomics

Article Open access 07 February 2022

Introduction

The rapid advancement of mass spectrometry (MS) fosters the evolution of high-throughput, multi-batch, and large-scale quantitative proteomics^1,2,3. A high-performing and stable liquid chromatography-tandem mass spectrometry (LC-MS/MS) workflow is essential for the precise identification and quantification of proteins^4,5,6. Currently, the assessment of instrument performance relies primarily on the periodic analysis of a quality control (QC) sample. The stability of the instrument is evaluated based on variations of the QC data. In a typical proteomics experiment, the optimal condition results in a satisfactory outcome as reflected by the number of peptide and protein identifications. Therefore, in most cases, proteomics practitioners adopt the number of identified peptides and proteins as a simple and direct measure for quality control of LC-MS. Nonetheless, pinpointing the cause is usually challenging, time-consuming and highly empirical when the figures fall below the anticipated range. In quantitative experiments, it is not uncommon for the quantitative accuracy and reproducibility to be compromised without reducing the peptide and protein identifications. Therefore, the development of more sophisticated and elegant parameters involving the LC and MS systems has been proposed. Most QC studies focused on data-dependent acquisition (DDA) MS experiments using a specific LC-MS system. Rudnick et al.⁷ published the milestone paper in the field, introducing 46 metrics related to the QC of DDA data. The 46 metrics concern chromatography, ion source, MS1, dynamic sampling, MS2, and peptide identification. The Tabb team further refined these parameters and developed QuaMeter for DDA QC of multivendor LC-MS instruments^8,9. Matzke¹⁰ et al. reported a multivariate statistical strategy to identify outlier LC-MS runs. The Payne team manually curated 1150 multivendor DDA-MS raw files and developed machine learning models for QC¹¹. The Laukens team and the Martens team have developed new file format and software tools to describe and evaluate the quality of DDA-MS files^12,13,14. The MacCoss team have reported Skyline-based QC software tools SProCoP¹⁵ and Panorama AutoQC webserver¹⁶ for targeted proteomics experiments, i.e. selected/multiple reaction monitoring (SRM/MRM).

Nevertheless, these QC studies and software tools are exclusively for DDA-MS and/or SRM-based targeted proteomics. Currently, there has been no systematic investigation of data-independent acquisition (DIA) QC. This lack of investigation is not surprising, considering that the generation of large DIA data sets only emerged recently thanks to the rapid progresses of LC and MS systems^2,17,18,19. Nowadays, DIA-MS has become arguably the method of choice for large-scale label-free quantitative proteomics experiments, not only for body fluid^20,21, bulk cell and tissue specimens^22,23,24, but also for single cell proteomics^25,26. The success of DIA experiments requires robust LC and MS systems, therefore QC is crucial. However, the existing QC pipelines mentioned above cannot be directly applied to analyze large-scale DIA experiments.

In this work, we establish a computational pipeline demonstrating that DIA-based quality control outperforms DDA-based approaches in detecting changes in LC-MS status across multivendor systems, validated through data from 21 mass spectrometers over 31 months and further supported by an AI model built from expert-inspected QC metrics. Nine laboratories dedicate 21 high-resolution mass spectrometers of eight models, including Orbitrap, TripleTOF, and timsTOF Pro series instruments, for this comprehensive QC study. The same mouse liver digests are injected into these mass spectrometers for both DDA and DIA analyses. The MS operators monitor the performance of these instruments for up to 31 months, and all the maintenance data are recorded. This allows an unbiased comparison of DDA and DIA for LC-MS QC. Our data show that the DIA-based QC outperforms the DDA-based QC in detecting changes in LC-MS status. Twenty-one experts are recruited from 10 laboratories for manual inspection of the 2754 DIA QC files based on 15 prioritized consensus QC metrics. These original manually inspected data are then used to build an AI model, which is further implemented as a standalone software tool for community applications.

Results

Longitudinal monitoring the performance of 21 mass spectrometers

Firstly, we generated paired DDA and DIA files as QC using a homogenous mouse liver digest samples from six mice, called Westlake Mouse Liver Digests (WMLD), employing pressure cycling technology (PCT)-assisted sample preparation^27,28 (Fig. 1). The WMLD aliquots, reconstituted using a standardized protocol, were distributed to nine different laboratories. Altogether, 21 mass spectrometers of eight different types were included in this study, including Orbitrap, TripleTOF, and timsTOF Pro series mass spectrometers (Fig. 1; Supplementary Fig. 1). Each instrument was operated to generate a pair of DDA and DIA runs using a 250 to 1000 ng WMLD sample over an effective LC gradient of 30 min, with the corresponding run-to-run time ranging from 40 to 55 min with different LC systems (Supplementary Data 1). The 21 LC-MS systems from nine laboratories were monitored for a period ranging from 91 to 873 days, with a median duration of 321 days (Fig. 2A). The entire project spans over 125 weeks. Detailed data acquisition parameters are provided in Supplementary Data 1.

**Fig. 1: Schematic overview of the study.**

**Fig. 2: Longitudinal monitoring of 21 MS instruments.**

A total of 2638 DIA raw files and 2638 paired DDA raw files were obtained, accompanied by a detailed maintenance log (Supplementary Data 2). Throughout this period, we conducted a total of 785 LC-MS maintenance procedures, which could be categorized into three primary groups: LC maintenance, MS front-end maintenance, and MS back-end maintenance (Fig. 2). Additionally, these maintenance procedures were further categorized into 27 subtypes, as listed in Fig. 2. The five most frequently performed subtypes included changing analytical column, changing trap column, calibrating MS accuracy, cleaning quadrupole, and cleaning S-lens (specific to certain Orbitrap instruments). On average, maintenance was carried out 0.6 to 5 times per month across various laboratories. The number of proteins identified varied widely across different instruments. The timsTOF Pro instrument had the highest number of protein identifications, followed by the Orbitrap series, with the TripleTOF series ranking third (Fig. 2C). The quality of these QC files was examined by 21 MS experts from 10 different participating laboratories using 15 LC-MS-related consensus metrics (Fig. 1A).

DIA-based QC is more sensitive than DDA-based QC in detecting LC-MS system faults

We first selected commonly used metrics for both DDA and DIA, encompassing both quantitative and qualitative measures such as peptide number, protein number, and MS signals. For MS signals, we used the MS1 area of targeted peptide precursors for DDA data, while the MS2 intensities of targeted peptide precursors were used for DIA files. We monitored a QE HF-X instrument for 280 days, selecting four representative time periods of 30 to 40 days each. During these periods, we analyzed various maintenance activities (Fig. 3). Each figure in Fig. 3A–D represents a specific time frame. Our data showed that while DDA and DIA metrics were generally consistent, the DIA-based QC were more sensitive in detecting various system faults. Additionally, we observed that MS1 signals from DDA were more sensitive indicators of instrument performance compared to peptide and protein identifications in DDA QC files. Similarly, MS signals in DIA files were more sensitive than peptide and protein identifications in those files. This is not unexpected, as most MS signals are not interpreted by current software. Overall, our results demonstrate that metrics from DIA files provide superior sensitivity in monitoring fluctuations across multiple LC-MS systems.

**Fig. 3: DIA-based QC metric is more sensitive than DDA-based QC metric in detecting data changes.**

Performance metric prioritization for DIA-based QC

Despite numerous proposed performance metrics in prior articles^8,9,29, relatively few studies have delved into DIA-based QC³⁰. To assemble a comprehensive set of performance metrics for evaluating DIA files, we adopted six metrics from previous publications³⁰ and assembled 135 additional candidate metrics from papers related to DDA and selected reaction monitoring (SRM) (Fig. 4A). More details are provided in Supplementary Data 3. First, we categorized those 141 metrics into eight main groups based on the concerned sample issue and LC-MS system components: sample preparation, LC system, ion source, MS1 scan, MS2 scan, peptide and protein identification, and peptide and protein quantification, statistical analysis (Supplementary Data 3). We first removed metrics used for evaluating sample preparation that did not align with the quality control criteria of our study. Then we eliminated redundant metrics⁷, including those unsuitable for DIA files, such as ion injection time, TopN, and identified PSM. We also removed 14 multivariate statistics-related metrics, including MBR alignment²⁹ that are specific to MaxQuant results, PCA analysis⁹, and correlation coefficient analysis^9,31. We reassigned the remaining metrics to the five categories (Fig. 4B) and conducted independent evaluation for each category to ensure both the completeness and simplicity of the metrics. Since most metrics reported in the literature are specific to DDA, we also selected a few metrics from SRM-MS for evaluating DIA data.

**Fig. 4: Selecting metrics for data annotation.**

After the above-mentioned filtering, we focused on 13 metrics. To further refine the selection, we sought input from 21 MS experts, including instrument operators, application engineers, and service engineers from analytical instrument companies including SCIEX, Thermo Fisher Scientific, and Bruker Daltonics. After discussion, the experts proposed to add two additional metrics, namely precursor ion chromatogram (PIC)³² and ion mobility accuracy to evaluate the chromatography and ion mobility specifically for diaPASEF acquired by timsTOF Pro instruments. Thereby, we assembled 15 metrics corresponding to five characteristics of the LC-MS system, including chromatography, ions source, MS1 scan, MS2 scan, identification and quantification. More details are provided in Supplementary Data 3 and Fig. 4B.

To evaluate whether these 15 metrics can capture common LC-MS system faults during maintenance, we constructed a Sankey diagram illustrating the relationship between these metrics and common instrument faults. These faults include the decline in MS2 signal and good MS1 signal with poor identification, good MS1 and MS2 signals with poor identification, similar proportional decline in MS1 and MS2 signals with poor identification, unstable spray, loss of hydrophilic peptides, loss of hydrophobic peptides, retention time shift with poor peptide identification, and poor peak shape as shown in the 1st node of Fig. 5A. Our data suggested that combinations of the 15 metrics (2nd node, Fig. 5A) can effectively detect all these LC-MS faults. For instance, the combination of F4 (charge state distribution) and F5 (intensity variation of adjacent MS1 scans) could be utilized to detect a raw file acquired with an unstable ion source spray, typically stemming from anomalies in the spray needle and fluctuations in the LC gradient flowrate (4th node, Fig. 5A). It’s worth noting that this Sankey diagram is concluded for the 20 LC instruments (Supplementary Data 1) evaluated in this study. Other types of LC instruments, such as Evosep³³, are not included. We have uploaded the 15 metrics to the mzQC³⁴ Format Specification Group from the Human Proteome Organization−Proteomics Standards Initiative (HUPO-PSI) Quality Control working group to ensure that more people are aware of and can use these metrics for quality assessment. The controlled vocabulary terms of those metrics are provided in Supplementary Data 3.

**Fig. 5: Annotation of metrics for 2638 DIA files by 21 experts.**

Manual inspection of 2638 DIA QC data files using the 15 metrics

The 15 metrics were utilized to assess the performance of the 2638 DIA raw files. For reliable annotation results, 21 expert raters were recruited from 10 different institutions, including instrument operators and application engineers from manufacturers. Each DIA file underwent an independent review by four or five experts possessing extensive hands-on experience with their respective MS instruments. More details are provided in Fig. 5B. The DIA raw files were evaluated by raters and categorized as either qualified or unqualified based on the 15 metrics. In addition, raters were requested to evaluate the overall LC and MS status of each raw file, respectively, and make a conclusion of either qualified or unqualified based on their empirical overall judgment. The qualified tag indicates that the performance, although not optimal, is deemed acceptable. The 17 metrics were assessed using an inter-rater algorithm known as “observed agreement”³⁵. The final quality label for each file was determined through a majority voting system³⁶.

To assess the reliability of this extensive manual inspection, we randomly chose 11 DIA files from various instruments and, without informing the annotators, replicated these files twice within the entire dataset slated for annotation. This was carried out to assess the precision of the annotators’ file labeling. The median observed agreement for the technical replicates was 100%, indicating high consistency and reproducibility of this data evaluation process (Fig. 5C)^35,37. Afterwards, we compiled annotation results of 2638 DIA files. Our raters demonstrated a strong consensus on metric F4 (charge state distribution), F5 (intensity variation of adjacent MS1 scans), F6 (precursor mass accuracy), F8 (fragment mass accuracy), F11 (number of identified peptides), and F12 (number of identified proteins), with consistency rates from 65% to 83% (blue bars in Fig. 5D, E). Conversely, achieving consensus on F10 (ratio of peak area of MS1 to MS2), F13 (MS1 area of targeted precursors), and F14 (MS2 intensities of targeted precursors) proved to be more challenging. Next, we employed an inter-rater algorithm termed “observed agreement” to gauge the reliability of the evaluation (Supplementary Fig. 2A). The median observed agreement across the 21 instruments ranged from 73% to 90% (Supplementary Fig. 2A and Supplementary Fig. 3), indicating that the manual inspection by 21 experts was reasonably reliable and reproducible. 73–80% of the DIA files were annotated as qualified across the majority of the 17 metrics (Supplementary Fig. 2B). The fewest faults were observed in terms of F4 (charge state distribution), F5 (intensity variation of adjacent MS1 scans), and F6 (precursor mass accuracy), since 88% to 90% of the DIA files were marked as qualified based on these three metrics.

Development of a DIA-based classifier for QC

Next, we tried to pinpoint a concise selection of peptide precursors from the mouse liver digest that exhibit robust identification across various instruments. We generated 221 liver digests as biological replicates from four mice. These were subsequently analyzed in DIA mode in seven cohorts using four types of mass spectrometers. Using DIA-NN and an in-house built spectral library, we identified 176,304 peptide precursors, with missing values ranging from 50% to 66%. Based on the inclusion criteria outlined in Fig. 6A, we narrowed our focus to peptide precursors that were consistently expressed across all DIA files, with appropriate length and no missed cleavage. We additionally excluded peptides containing reactive residues like methionine and N-terminal glutamine, along with those featuring inhibitory motifs for trypsin, such as Lys-Lys and Arg-Arg following previous publication³⁸. We also manually inspected each of these peptide precursors using Skyline to verify their quality, particularly in terms of transition peak shape. This process boiled down to 934 high-quality peptide precursors. The physicochemical properties of these peptide precursors are illustrated in Supplementary Fig. 4. Ninety percent of the peptide precursors consist of 7 to 15 amino acids, with the GRAVY value³⁹ of the 934 peptide precursors ranging from −2.5 to 2.3. Mean retention time of the 934 peptide precursors corresponds to 90% of the elution time across datasets. Intensities of the remaining peptide precursors span 3.5 to 4 orders of magnitude. The 934 peptide precursors were used for the development of subsequent classifier.

**Fig. 6: Peptide precursor candidates’ selection, classifier development, performance evaluation, and validation in two independent blinded test datasets.**

Subsequently, the 2638 DIA files were analyzed using DIA-NN and an in-house built spectral library for mouse liver, leading to identification of a total of 220,514 peptides from 11,970 proteins (Supplementary Fig. 1B; Fig. 2C). Among the four TripleTOF series instruments, a median of 2916 proteins were identified. The 13 Orbitrap series instruments identified a median of 4381 proteins, while four timsTOF Pro instruments identified a median of 6070 proteins. As expected, the data from each instrument were well resolved in t-distributed stochastic neighbor embedding⁴⁰ (t-SNE) plot (Supplementary Fig. 5). This indicates that relying solely on data dimensionality reduction analysis to differentiate the quality among different instruments poses challenges. Conversely, t-SNE analysis of the data acquired by a particular instrument at various time points exhibited no obvious batch effects, indicating that the data from a specific MS instrument can be considered comparable (Supplementary Fig. 6).

These files were then divided into a training set containing 2110 DIA files, and a validation set (n = 528) (Fig. 6B). Based on the training set, we extracted the precursor-related features (F2, F13 and F14) of the 934 peptide precursors, along with the DIA raw file-based features (F1, F3 ~ F12), prior to applying XGBoost-based machine learning algorithms⁴¹ with five-fold cross-validation. To ensure that our selected features could be practically measured following QC sample acquisition, we constrained the number of features to twenty or fewer. A panel of 19 peptide precursors along with F3 achieved the optimal accuracy in distinguishing qualified and unqualified DIA files at the LC level (Fig. 6C), while another panel composed of 14 peptide precursors, along with F4, F5, F6, F7, F11, and F12, achieved the highest accuracy in segregating qualified and unqualified DIA files at the MS level (Fig. 6D). The contribution of each feature to the LC and MS classifiers was subsequently evaluated using the SHapley Additive exPlanations (SHAP) algorithm (Figs. 6C, D). We further evaluated the performance of the LC and MS models using the validation dataset containing 528 DIA files. The annotations were blinded during feature extraction and machine learning model development, ensuring an unbiased evaluation. The receiver operating characteristic curve (ROC) plot for these independent test sets, using the 20 independent features for LC and MS quality control, yield an area uner curve (AUC) of 0.91 for the LC model (Fig. 6E) and an AUC of 0.97 for the MS model (Fig. 6F).

To further validate the machine learning features, we additionally used 116 DIA datasets collected by a QE HF instrument from December 2023 to August 2024 for quality control. During these eight months, the QE HF instrument conducted QC twice a week using a 30-minute DIA method with an effective LC gradient during normal operation, and additional QC data were collected before and after any instrument malfunctions. The identified proteins in those 116 DIA files ranged from 1983 to 4794 with a median of 4634 proteins (Supplementary Fig. 7A). We invited three experts to annotate 116 raw files based on the 14 aspects mentioned previously in the article. The observed agreement among each pair of experts ranged from 0.85 to 0.87 (Supplementary Fig. 7B), indicating a good level of consistency. For most metrics, our raters demonstrated a consensus, with consistency rates between 82% and 99% (Supplementary Fig. 7C). To improve the consistency and accuracy of high-quality labels while minimizing the bias or errors of individual reviewers, we applied the principle of majority rule in determining the final labels. Ultimately, 68% to 99% of the files were classified as qualified across most of the 16 metrics (Supplementary Fig. 7D). The ROC plot for these independent test sets, utilizing 20 independent features for the LC and MS systems, yielded an AUC of 0.78 for the LC model (Fig. 6G) and an AUC of 0.94 for the MS model (Fig. 6H). These results demonstrate that the model is highly effective and practical for monitoring LC-MS instruments.

To test the stability of these 33 selected peptide precursors in real-world applications, we conducted additional experiments to monitor them over time. We analyzed a WMLD sample using DIA in a TripleTOF 5600+ over 31 days. The retention time, GRAVY value and coefficient of variation (CV) of peptide precursors are shown in Fig. 6I. The two panels had no overlapped peptide precursors, suggesting each of them captures unique characteristic of the LC and MS system. Their retention time values spanned from 7.67 to 33.07 min, while the GRAVY values ranged from −1.97 to 1.83, indicating a diverse distribution of hydrophilic and hydrophobic properties. All except one peptide precursor exhibited a CV lower than 5.8%. The highest CV was 11.2%. The median CV was 2.9%, while the mean CV was 3.5%. The data strongly indicate that these peptides are consistently detected over time, thus enabling the derivation of reliable QC metrics.

iDIA-QC software tool for quality control evaluation of DIA files

Finally, we built a software tool called iDIA-QC to automate the thus described QC and troubleshooting. The metric extraction process within the software depends on two independent software tools, namely msConvert and DIA-NN. Specifically, msConvert was utilized for extracting F3 and F5, F15 was obtained from MGF files acquired with timsTOF Pro instrument, and the remaining metrics were extracted using DIA-NN (Supplementary Fig. 8). Given a 30-min DIA analysis of the mouse liver digest, iDIA-QC outputs a comprehensive report of the 17 metrics based on a machine learning models for assessing the LC and MS. The tool not only reports whether the LC and MS systems are qualified, but also provides guidance pinpointing the potential cause of the faults as exemplified in Fig. 5A.

Discussion

Quality control for quantitative MS-based proteomics is predominantly conducted using DDA and SRM/MRM analyses of certain QC samples. In 2010, Rudnick et al.⁷ proposed 46 system performance metrics which are currently the most comprehensive scheme for QC and widely adopted by the community in integrative computational modeling^8,9. However, they lack essential information for evaluating MS2-based quantitation of peptide precursors. This could be attributed to the fact that the paper was published before DIA-MS prevailed in quantitative proteomics¹⁷. Xuan et al.³⁰ discussed six metrics used for evaluating the MS instrument status based on the DIA files when they presented the MS1-based DIA workflow. However, the community still lacks a comprehensive investigation of QC for DIA experiments which require more rigorous performance for both MS1- and MS2-based quantitation of peptide precursors. The SRM-based QC software tools are more rigorous in terms quantitative proteomics, however, it cannot be directly applied to large scale quantitative proteomics experiments concerning thousands of peptide precursors^15,16,42.

The first contribution of our study is the demonstration that DIA-MS is more sensitive and precise indication of the quality of LC and MS systems. Moreover, we generated the so-far most well-designed and comprehensive QC data sets using both DDA-MS and DIA-MS in multivendor LC and MS instruments over a period of 125 weeks in different laboratories. We further recruited 21 LC-MS experts to annotate the resultant 2754 DIA files. In addition, we compiled over 140 potentially informative QC metrics, and, through group efforts, narrowed down to 15 practically applicable metrics. Some of the metrics, for instance that evaluating ion mobility, have not been reported before.

While various QC samples are being used in different laboratories, we proposed a mouse liver digest sample, which is relatively abundant and cheap, from which we proposed two peptide precursor panels which are robustly measurable in all the MS instruments used in our study. With two machine learning models, the performance of LC and MS can be reliably predicted (AUC values is higher than 91% in independent validation set). The models are condensed in a convenient software tool called iDIA-QC. To the best of our knowledge, this is the primary software tool which not only evaluates the quality of a raw DIA file, but also offers guidance to the potential causes (Fig. 5).

Of the 15 prioritized QC metrics, only three are peptide precursors from mouse liver digests, while the remaining 12 are generic metrics independent of the mouse sample. This suggests that the workflow could potentially be applied to various QC and experimental samples, though further investigation is needed to fully generalize its application. While our study covers a broader range of commonly used instrument types in the participating laboratories, it is not possible to include all LC and MS instruments, particularly those emerging new instruments. However, we think it won’t be too difficult to extend the metrics, the standard operating procedures, the computational models, and the software tool, as established here, to evaluate the performance of additional instruments.

In conclusion, we established a strategy to adopt DIA-MS for QC of MS-based quantitative proteomics experiments. All the potentially informative QC metrics have been investigated in a comprehensive manually curated QC dataset. We proposed two machine learning models to evaluate the performance of various LC and MS systems and present a software tool, namely iDIA-QC, for automated QC and troubleshooting of LC-MS instruments.

Methods

Preparation of mouse liver digest

Fresh frozen mouse liver tissues were harvested from six C57BL/6 J mice (3 males and 3 females), with the approval of the Animal Care and Ethics Committee of Westlake University (No. 2019-027-GTN). Pressure cycling technology (PCT)^28,43 based preparation methods were used for mouse liver tissue lysis and protein digestion respectively⁴⁴. Unless otherwise noted, all reagents were purchased from Sigma-Aldrich. For PCT-based sample preparation, sample lysis and protein digestion were performed using the PCT workflow^28,43. In brief, 1 mg of each mouse liver sample was weighed and reconstituted in a lysis buffer containing 6 M urea and 2 M thiourea in 100 mM ammonium bicarbonate (ABB) with a pH of 8.5, The sample was then transferred to a PCT-MicroTube for subsequently PCT-assisted tissue lysis. Next, add 5 µL of a 100 mM Tris(2-carboxyethyl)- phosphine (TCEP) solution (final concentration 10 mM) and 2.5 µL of an 800 mM iodoacetamide solution (final concentration 40 mM) to facilitate reduction and alkylation. Following this, 75 µL of a 100 mM ABB solution was added to each PCT-MicroTube. Lys-C was incorporated at an enzyme-to-substrate ratio of 1:80 (wt/wt), and trypsin at a ratio of 1:20 (wt/wt) to initiate protein digestion. Tissue lysis, reduction, alkylation, and protein digestion were conducted using the Barocycler NEP2320-45K (Pressure Biosciences Inc., MA). The tryptic peptides were cleaned using Sep-Pak C18 1cc Vac Cartridge (Waters, MA). Finally, the peptide solutions were pooled, aliquoted into tubes, lyophilized into powder, and stored at −80 °C for subsequent analysis.

DIA library construction

We constructed three mouse liver-specific spectral libraries for two MS vendors, and a published mouse spectral library⁴⁵ to analyzed the 2754 DIA files. About 200 µg of mouse liver digests were initially separated and fractionated using a Thermo Scientific UltiMate^TM 3000 RSLC LC system equipped with a reversed-phase LC into 90 fractions, then combined into 30 fractions using a specific strategy and evaporated and resuspended for next analysis. Each fraction was injected and trapped on an Acclaim PepMap RSLC C18 column, followed by elution using a linear ramping gradient. The analysis was conducted using the UltiMate^TM 3000 RSLCnano LC system from Thermo Fisher Scientific, coupled with QE-HF X and a nanoElute system from Bruker Daltonics, which was coupled with the timsTOF Pro instrument.

FragPipe computation platform (version 13), incorporating with MSFragger⁴⁶ (version 2.4) and Philosopher⁴⁷ (version 3.2.5) components, was employed to generate three spectral libraries for FAIMS-DIA, DIA and diaPASEF⁴⁸ analysis of mouse liver tissues. The FragPipe software searched a mouse Swiss-Prot FASTA database against all DDA files, encompassing 20,269 protein sequences (downloaded on May 21, 2021), with default settings.

For DIA files produced by TripleTOF series instruments, a previously established spectral library containing 110,126 precursors and 10,542 protein groups was utilized⁴⁵. The spectral library used to analyze the data generated from the Orbitrap instruments without FAIMS contains 161,081 peptide precursors from 8513 protein groups, while the libraries for FAIMS-Exploris 480 and timsTOF Pro contain 154,452 peptide precursors from 8593 protein groups and 305,630 peptide precursors from 9488 protein groups, respectively.

DDA-MS analysis

The FragPipe computational platform (version 14), incorporating MSFragger (version 3.1.1), Philosopher (version 3.2.5), and IonQuant (version 1.4.4) components, was utilized for the analysis of the DDA files. The mouse Swiss-Prot FASTA database (downloaded on May 21, 2021), comprising 20,269 protein sequences, was queried, permitting a maximum of two missed cleavages. Cysteine carbamidomethyl was established as a fixed modification, while methionine oxidation was designated as a variable modification. We opted for comprehensive identification and false discovery rate (FDR) analysis, with an FDR threshold set at 1%.

DIA-MS analysis

We used DIA-NN (v1.7.15)⁴⁹ to analyze data acquired through various data-independent acquisition methods, encompassing FAIMS-DIA and DIA files acquired from Orbitrap instruments with and without FAIMS devices, SWATH files from TripleTOF series instruments, and diaPASEF files from timsTOF Pro instruments. We applied a data filtration criterion of 1% FDR at both peptide precursor and protein levels and configured the quantification strategy to robust LC (high accuracy) mode. All 2754 DIA files analyzed in the study have been uploaded to Proteomics IDEntifications database (PRIDE).

Description of the 15 metrics

The 15 metrics can be divided into five sections: liquid chromatography, ion source, MS1 scan, MS2 scan, and the identification and quantification of peptides and proteins. The names of these 15 metrics are shown in Fig.4B.

LC-MS sample analysis was evaluated using three key metrics. The first metric, peak width (FWHM), assesses column performance, with wider peaks indicating decreased performance, especially for hydrophobic peptides. Sharper peaks produce higher signal intensities and improve the signal-to-noise ratio, increasing the number of identified peptides. The second metric, retention time of targeted precursors, measures the elution time and order of targeted peptide precursors, allowing for comparison between raw files. The third metric, precursor ion chromatogram (PIC), provides comprehensive information on peptide elution, with a focus on hydrophilic and hydrophobic peptides. The number of MS1 signals from the 2754 DIA files showed varying distributions due to differences in LC-MS systems. Consequently, the MS1 scan was scaled using linear interpolation⁵⁰, generating 1000 additional data points within the existing range.

The fourth metric, charge state distribution, is crucial for peptide identification. Charge state distribution can vary based on peptide length, composition, and experimental ionization conditions. By monitoring this distribution, we can assess the quality of peptide identification and identify potential factors affecting result accuracy. The fifth metric, intensity variation of adjacent MS1 scans, is a critical factor in achieving reliable and reproducible mass spectra. It is monitored by analyzing the ratio of MS1 total ion current between adjacent MS1 scans. By tracking the frequency of changes in MS1 intensity, we can detect issues with electrospray stability and take corrective actions to ensure data reliability. Factors such as sample quality, spray needle condition, and the applied voltage of the ion source can impact electrospray stability, leading to inconsistent results.

The sixth metric, precursor mass accuracy, was extracted from the DIA-NN mainoutput. Mass shifts can be caused by various factors such as temperature, airflow, or instrument calibration. The seventh metric, TIC MS1 signal, was obtained by integrating the MS1 chromatogram profile. This signal can be influenced by contamination on the MS front-end, retention time shifts, or loss of hydrophilic/hydrophobic peptides. While comparability between raw files from different instrument types is limited, data from the same instruments can be used to assess consistency with this parameter. Metric 15 evaluates the ion mobility accuracy of a standard sample with an m/z of 622.0290, featuring a reduced ion mobility coefficient (1/K0) of 0.9915, as extracted from diaPASEF files. The standard sample, ESI-L Low Concentration Tuning Mix from Agilent, was introduced into the ion source filter of the timsTOF Pro instrument. This metric is specifically designed for analyzing raw files generated by timsTOF Pro instruments. The eighth metric evaluates the mass accuracy of identified peptide fragment ions, extracted from the DIA-NN mainoutput.stats.tsv file. The ninth metric quantifies TIC MS2 signal, obtained by integrating the MS2 chromatogram profile. Several factors, such as contamination at the MS front-end and fragmentation efficiency, can adversely affect the MS2 signal. The tenth metric, which represents the ratio of the peak area of the MS1 signal to the MS2 signal, was introduced to distinguish between these two issues. This ratio is calculated by dividing the value of the metric seven by that of the metric nine. Together, the metric seven, metric nine and metric ten enable a more precise identification of the underlying factors affecting MS performance.

Metrics 11 and 12 reflect the number of identified peptides and proteins, respectively. These values are obtained by matching MS2 spectra to those in a spectral library and are extracted from the DIA-NN mainoutput. Several factors can influence the results of these metrics, including the peak width and retention time distribution during chromatography, as well as the TIC MS1 and MS2 signals. Additionally, metrics 13 and 14 provide the MS1 area and MS2 intensities of targeted peptide precursors, respectively, also extracted using DIA-NN. Unlike metric seven and metric nine, metrics 13 and 14 offer detailed insights into changes in targeted peptide precursors based on their varying physicochemical properties and signal characteristics.

Metrics 3 and 5 are extracted from mzXML files, while metric 15 is derived from MGF files from timsTOF Pro instruments. The remaining metrics are sourced from DIA-NN result files. Details on the sources of all metrics can be found in Supplementary Data 3.

Machine learning

The datasets acquired from 21 instruments were used to develop and validate the XGBoost model, including both the discovery and test datasets. Model building comprised three steps, data preprocessing, model training, and testing. During data preprocessing, features with high missing rates ( > 70% NA) were removed, and other missing values were imputed by the minimum of each respective feature. Data preprocessing was performed using the ‘pandas’ library in Python. During model training, the discovery dataset was used to develop the model, and the test dataset was utilized for model evaluation. In machine learning, the importance of peptide features is determined by summing up the importance of corresponding F2, F13, and F14 metrics of the peptide precursors. Furthermore, the evaluation metric was set to the AUC. ROC) curves and AUCs were generated to assess the model performance with the Proc package.

Development of iDIA-QC software

The complete algorithmic process of iDIA-QC consists of four steps, as depicted in Supplementary Fig. 8. Firstly, the. d /.raw /.wiff files were converted to mzXML format using the msConvert software. Subsequently, the DIA files were analyzed using DIA-NN (v1.7.15) to derive peptide and protein quantification results. Secondly, numerical metrics were extracted from each DIA raw file using the mzXML files and DIA analysis results obtained from DIA-NN. Thirdly, we fed the metric extraction results into a machine learning model, which assigns an evaluation result (qualified or unqualified) to each metric, generating a performance assessment for both LC and MS across all evaluated aspects.

Statistical analysis

The statistical analysis of the data was performed using R software (version 4.1.2), which included the use of heatmap, t-SNE, observed agreement, and R package plot functions. The proteins in the heatmaps were hierarchically clustered using the centroid method for each instrument. The CV was calculated by dividing the standard deviation by the mean.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

Data availability

iDIA-QC is developed in Python and is freely available. The latest software version is available at https://github.com/guomics-lab/iDIA-QC. The mass spectrometry raw data and spectral libraries generated and analyzed during the current study have been deposited with the ProteomeXchange Consortium via the PRIDE⁵¹ partner repository. Specifically, the training set and retrospective dataset, which include 2638 files, have been deposited in the PRIDE under dataset identifier PXD051878, while the prospective dataset, containing 116 files, is stored under identifier PXD055173. Source data are provided with this paper.

Code availability

Code relevant to data analysis in this study is available at https://github.com/guomics-lab/iDIA-QC.

References

Domon, B. & Aebersold, R. Options and considerations when selecting a quantitative proteomics strategy. Nat. Biotechnol. 28, 710–721 (2010).
Article CAS PubMed MATH Google Scholar
Xiao, Q. et al. High-throughput proteomics and AI for cancer biomarker discovery. Adv. Drug Deliv. Rev. 176, 113844 (2021).
Article CAS PubMed MATH Google Scholar
Aebersold, R. & Mann, M. Mass-spectrometric exploration of proteome structure and function. Nature 537, 347–355 (2016).
Article ADS CAS PubMed MATH Google Scholar
Mann, M. Comparative analysis to guide quality improvements in proteomics. Nat. Methods 6, 717–719 (2009).
Article CAS PubMed MATH Google Scholar
Bittremieux, W. et al. Quality control in mass spectrometry-based proteomics. Mass Spectrom. Rev. 37, 697–711 (2018).
Article ADS CAS PubMed MATH Google Scholar
Bittremieux, W., Valkenborg, D., Martens, L., Laukens, K. Computational quality control tools for mass spectrometry proteomics. Proteomics 17 https://doi.org/10.1002/pmic.201600159 (2017).
Rudnick, P. A. et al. Performance Metrics for Liquid Chromatography-Tandem Mass Spectrometry Systems in Proteomics Analyses. Mol. Cell. Proteom. 9, 225–241 (2010).
Article CAS MATH Google Scholar
Ma, Z.-Q. et al. QuaMeter: Multivendor Performance Metrics for LC MS/MS Proteomics Instrumentation. Anal. Chem. 84, 5845–5850 (2012).
Article CAS PubMed PubMed Central MATH Google Scholar
Wang, X. et al. QC Metrics from CPTAC Raw LC-MS/MS Data Interpreted through Multivariate Statistics. Anal. Chem. 86, 2497–2509 (2014).
Article CAS PubMed PubMed Central MATH Google Scholar
Matzke, M. M. et al. Improved quality control processing of peptide-centric LC-MS proteomics data. Bioinformatics 27, 2866–2872 (2011).
Article CAS PubMed PubMed Central MATH Google Scholar
Amidan, B. G. et al. Signatures for Mass Spectrometry Data Quality. J. Proteome Res. 13, 2215–2222 (2014).
Article CAS PubMed PubMed Central MATH Google Scholar
Bittremieux, W., Meysman, P., Martens, L., Valkenborg, D. & Laukens, K. Unsupervised Quality Assessment of Mass Spectrometry Proteomics Experiments by Multivariate Quality Control Metrics. J. Proteome Res. 15, 1300–1307 (2016).
Article CAS PubMed Google Scholar
Bittremieux, W., Kelchtermans, P., Valkenborg, D., Martens, L. & Laukens, K. jqcML: An Open-Source Java API for Mass Spectrometry Quality Control Data in the qcML Format. J. Proteome Res. 13, 3484–3487 (2014).
Article CAS PubMed Google Scholar
Walzer, M. et al. qcML: An Exchange Format for Quality Control Metrics from Mass Spectrometry Experiments. Mol. Cell. Proteom. 13, 1905–1913 (2014).
Article CAS MATH Google Scholar
Bereman, M. S. et al. Implementation of Statistical Process Control for Proteomic Experiments Via LC MS/MS. J. Am. Soc. Mass Spectrom. 25, 581–587 (2014).
Article ADS CAS PubMed PubMed Central Google Scholar
Bereman, M. S. et al. An Automated Pipeline to Monitor System Performance in Liquid Chromatography Tandem Mass Spectrometry Proteomic Experiments. J. Proteome Res. 15, 4763–4769 (2016).
Article ADS CAS PubMed PubMed Central MATH Google Scholar
Gillet, L. C. et al. Targeted Data Extraction of the MS/MS Spectra Generated by Data-independent Acquisition: A New Concept for Consistent and Accurate Proteome Analysis Mol. Cell. Proteomics. 11 https://doi.org/10.1074/mcp.O111.016717 (2012).
Guo, T. et al. Rapid mass spectrometric conversion of tissue biopsy samples into permanent quantitative digital proteome maps. Nat. Med. 21, 407–413 (2015).
Article CAS PubMed PubMed Central MATH Google Scholar
Zhu, Y., Aebersold, R., Mann, M. & Guo, T. SnapShot: Clinical proteomics. Cell 184, 4840–4840 e4841 (2021).
Article CAS PubMed Google Scholar
Niu, L. et al. Noninvasive proteomic biomarkers for alcohol-related liver disease. Nat. Med. 28, 1277–1287 (2022).
Article CAS PubMed PubMed Central MATH Google Scholar
Cai, X. et al. Population serum proteomics uncovers a prognostic protein classifier for metabolic syndrome. Cell Rep. Med 4, 101172 (2023).
Article CAS PubMed PubMed Central MATH Google Scholar
Sun, Y. et al. Artificial intelligence defines protein-based classification of thyroid nodules. Cell Discov. 8, 85 (2022).
Article CAS PubMed PubMed Central MATH Google Scholar
Sun, R. et al. Proteomic Dynamics of Breast Cancer Cell Lines Identifies Potential Therapeutic Protein Targets. Mol. Cell Proteom. 22, 100602 (2023).
Article CAS Google Scholar
Guo, T. et al. Quantitative Proteome Landscape of the NCI-60 Cancer Cell Lines. iScience 21, 664–680 (2019).
Article ADS CAS PubMed PubMed Central MATH Google Scholar
Derks, J. et al. Increasing the throughput of sensitive proteomics by plexDIA. Nat. Biotechnol. 41, 50–59 (2023).
Article CAS PubMed MATH Google Scholar
Rosenberger, F. A. et al. Spatial single-cell mass spectrometry defines zonation of the hepatocyte proteome. Nat. Methods 20, 1530–1536 (2023).
Article CAS PubMed PubMed Central MATH Google Scholar
Gao, H. et al. Accelerated Lysis and Proteolytic Digestion of Biopsy-Level Fresh-Frozen and FFPE Tissue Samples Using Pressure Cycling Technology. J. Proteome Res. 19, 1982–1990 (2020).
Article CAS PubMed MATH Google Scholar
Cai, X. et al. High-throughput proteomic sample preparation using pressure cycling technology. Nat. Protoc. 17, 2307–2325 (2022).
Article CAS PubMed PubMed Central MATH Google Scholar
Bielow, C., Mastrobuoni, G. & Kempa, S. Proteomics Quality Control: Quality Control Software for MaxQuant Results. J. Proteome Res. 15, 777–787 (2016).
Article CAS PubMed Google Scholar
Xuan, Y. et al. Standardization and harmonization of distributed multi-center proteotype analysis supporting precision medicine studies. Nat. Commun. 11, 5248 (2020).
Article ADS CAS PubMed PubMed Central MATH Google Scholar
Lu, C. et al. Longitudinal Large-Scale Semiquantitative Proteomic Data Stability Across Multiple Instrument Platforms. J. Proteome Res. 20, 5203–5211 (2021).
Article ADS CAS PubMed MATH Google Scholar
Wang, N. & Li, L. Exploring the Precursor Ion Exclusion Feature of Liquid Chromatography Electrospray Ionization Quadrupole Time-of-Flight Mass Spectrometry for Improving Protein Identification in Shotgun Proteome Analysis. Anal. Chem. 80, 4696–4710 (2008).
Article CAS PubMed MATH Google Scholar
Bache, N. et al. A Novel LC System Embeds Analytes in Pre-formed Gradients for Rapid, Ultra-robust Proteomics. Mol. Cell. Proteom. 17, 2284–2296 (2018).
Article CAS MATH Google Scholar
Bittremieux, W. et al. The Human Proteome Organization-Proteomics Standards Initiative Quality Control Working Group: Making Quality Control More Accessible for Biological Mass Spectrometry. Anal. Chem. 89, 4474–4479 (2017).
Article CAS PubMed MATH Google Scholar
Salminen, J. O., Al-Merekhi, H. A., Dey, P. & Jansen, B. J. Inter-Rater Agreement for Social Computing Studies. In 2018 Fifth International Conference on Social Networks Analysis, Management and Security (SNAMS), 80–87 https://doi.org/10.1109/SNAMS.2018.8554744 (2018).
Penrose, L. S. The Elementary Statistics of Majority Voting. J. R. Stat. Soc. 109, 53–57 (1946).
Article MATH Google Scholar
Landis, J. R. & Koch, G. G. The Measurement of Observer Agreement for Categorical Data. Biometrics 33, 159–174 (1977).
Article CAS PubMed MATH Google Scholar
Hoofnagle, A. N. et al. Recommendations for the Generation, Quantification, Storage, and Handling of Peptides Used for Mass Spectrometry-Based Assays. Clin. Chem. 62, 48–69 (2016).
Article CAS PubMed PubMed Central Google Scholar
Gasteiger, E. et al. Protein Identification and Analysis Tools on the ExPASy Server. In The Proteomics Protocols Handbook, Walker, J. M. Ed.; Humana Press, (2005); 571-607.
Song, W., Wang, L., Liu, P. & Choo, K.-K. R. Improved t-SNE based manifold dimensional reduction for remote sensing data processing. Multimed. Tools Appl. 78, 4311–4326 (2019).
Article MATH Google Scholar
Chen, T. & Guestrin, C. XGBoost: A Scalable Tree Boosting System. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, California, USA; (2016).
Bereman, M. S. Tools for monitoring system suitability in LC MS/MS centric proteomic experiments. PROTEOMICS 15, 891–902 (2015).
Article CAS PubMed MATH Google Scholar
Zhu, Y. et al. High-throughput proteomic analysis of FFPE tissue samples facilitates tumor stratification. Mol. Oncol. 13, 2305–2328 (2019).
Article CAS PubMed PubMed Central MATH Google Scholar
Glatter, T. et al. Large-Scale Quantitative Assessment of Different In-Solution Protein Digestion Protocols Reveals Superior Cleavage Efficiency of Tandem Lys-C/Trypsin Proteolysis over Trypsin Digestion. J. Proteome Res. 11, 5145–5156 (2012).
Article CAS PubMed MATH Google Scholar
Krasny, L. et al. A mouse SWATH-mass spectrometry reference spectral library enables deconvolution of species-specific proteomic alterations in human tumour xenografts. Dis. Models Mechanisms 13, dmm044586 (2020).
Article CAS MATH Google Scholar
Kong, A. T., Leprevost, F. V., Avtonomov, D. M., Mellacheruvu, D. & Nesvizhskii, A. I. MSFragger: ultrafast and comprehensive peptide identification in mass spectrometry based proteomics. Nat. Methods 14, 513–520 (2017).
Article CAS PubMed PubMed Central Google Scholar
da Veiga Leprevost, F. et al. Philosopher: a versatile toolkit for shotgun proteomics data analysis. Nat. Methods 17, 869–870 (2020).
Article PubMed PubMed Central MATH Google Scholar
Demichev, V. et al. dia-PASEF data analysis using FragPipe and DIA-NN for deep proteomics of low sample amounts. Nat. Commun. 13, 3944 (2022).
Article ADS CAS PubMed PubMed Central MATH Google Scholar
Demichev, V., Messner, C. B., Vernardis, S. I., Lilley, K. S. & Ralser, M. DIA-NN: neural networks and interference correction enable deep proteome coverage in high throughput. Nat. Methods 17, 41–44 (2020).
Article CAS PubMed Google Scholar
Blu, T., Thavenaz, P. & Unser, M. Linear interpolation revitalized. IEEE Trans. Image Process. 13, 710–719 (2004).
Article ADS MathSciNet PubMed MATH Google Scholar
Vizcaíno, J. A. et al. 2016 update of the PRIDE database and its related tools. Nucleic Acids Res. 44, 11033–11033 (2016).
Article PubMed PubMed Central MATH Google Scholar

Download references

Acknowledgements

This work is supported by grants from the National Key R&D Program of China (Grant No. 2022YFF0608403, 2021YFA1301600, 2022YFF0608404, and 2020YFE0202200), the National Natural Science Foundation of China (82073529, 81903316), the “Pioneer” and “Leading Goose” R&D Program of Zhejiang (2024SSYS0035, 2023C03056), and the Westlake Educational Foundation. Fig. 1 and Supplementary Fig. 8 are created with Biorender.com. We thank the Westlake University Supercomputer Center for their assistance in data analysis and storage. This study is also partly supported by the π-CLIP (π-HuB Clinical Proteomics) workgroup in the π-HuB consortium.

Author information

These authors contributed equally: Huanhuan Gao, Yi Zhu, Dongxue Wang, Zongxiang Nie, He Wang.

Authors and Affiliations

Affiliated Hangzhou First People’s Hospital, State Key Laboratory of Medical Proteomics, School of Medicine, Westlake University, Hangzhou, Zhejiang Province, China
Huanhuan Gao, Yi Zhu, He Wang, Yuting Xie, Yingying Sun, Wenhao Jiang, Zhen Dong, Liqin Qian & Tiannan Guo
Westlake Center for Intelligent Proteomics, Westlake Laboratory of Life Sciences and Biomedicine, Hangzhou, Zhejiang province, China
Huanhuan Gao, Yi Zhu, He Wang, Yuting Xie, Yingying Sun, Wenhao Jiang, Zhen Dong, Liqin Qian & Tiannan Guo
Research Center for Industries of the Future, School of Life Sciences, Westlake University, Hangzhou, Zhejiang province, China
Huanhuan Gao, Yi Zhu, He Wang, Yuting Xie, Yingying Sun, Wenhao Jiang, Zhen Dong, Liqin Qian & Tiannan Guo
State Key Laboratory of Medical Proteomics, Beijing Proteome Research Center, National Center for Protein Sciences (Beijing), Beijing Institute of Lifeomics, Beijing, China
Dongxue Wang, Guibin Wang & Fuchu He
International Academy of Phronesis Medicine, Guangzhou, Guangdong, China
Dongxue Wang & Fuchu He
Westlake Omics (Hangzhou) Biotechnology Co., Ltd., Hangzhou, China
Zongxiang Nie
State Key Laboratory for Managing Biotic and Chemical Treats to the Quality and Safety of Agro-products, Zhejiang Academy of Agricultural Sciences, Hangzhou, China
Shuang Liang
State Key Laboratory of Respiratory Disease, Sino-French Hoffmann Institute, School of Basic Medical Science, Guangzhou Medical University, Guangzhou, China
Xufei Wang & Mengdi Liang
Luming Biotechnology Co., Ltd, Shanghai, China
Min Chen & Houqi Fang
Shanghai Applied Protein Technology Co., Ltd, Shanghai, China
Qiufang Zeng & Jiao Tian
State Key Laboratory for Diagnosis and Treatment of Infectious Diseases, National Clinical Research Center for Infectious Diseases, Collaborative Innovation Center for Diagnosis and Treatment of Infectious Disease, The First Affiliated Hospital, School of Medicine, Zhejiang University, Hangzhou, Zhejiang, China
Zeyu Sun
Institute of Infection and Immunity, Taihe Hospital, Hubei University of Medicine, Shiyan, Hubei, China
Juan Xue & Shan Li
College of Biomedicine and Health, Huazhong Agricultural University, Wuhan, Hubei, China
Juan Xue & Shan Li
SCIEX, Shanghai, China
Chen Chen, Xiang Liu, Xiaolei Lyu & Fengchun Kong
Thermo Fisher Scientific, Shanghai, China
Zhenchang Guo, Yingzi Qi & Tingde Tong
Bruker Daltonics, Shanghai, China
Ruoyu Wu, Xiaoxian Du, Liming Han & Minghui Wang
Technology Innovation Center of Mass Spectrometry for State Market Regulation, Center for Advanced Measurement Science, National Institute of Metrology, Beijing, 100029, China
Yang Zhao & Xinhua Dai

Authors

Huanhuan Gao
View author publications
You can also search for this author inPubMed Google Scholar
Yi Zhu
View author publications
You can also search for this author inPubMed Google Scholar
Dongxue Wang
View author publications
You can also search for this author inPubMed Google Scholar
Zongxiang Nie
View author publications
You can also search for this author inPubMed Google Scholar
He Wang
View author publications
You can also search for this author inPubMed Google Scholar
Guibin Wang
View author publications
You can also search for this author inPubMed Google Scholar
Shuang Liang
View author publications
You can also search for this author inPubMed Google Scholar
Yuting Xie
View author publications
You can also search for this author inPubMed Google Scholar
Yingying Sun
View author publications
You can also search for this author inPubMed Google Scholar
Wenhao Jiang
View author publications
You can also search for this author inPubMed Google Scholar
Zhen Dong
View author publications
You can also search for this author inPubMed Google Scholar
Liqin Qian
View author publications
You can also search for this author inPubMed Google Scholar
Xufei Wang
View author publications
You can also search for this author inPubMed Google Scholar
Mengdi Liang
View author publications
You can also search for this author inPubMed Google Scholar
Min Chen
View author publications
You can also search for this author inPubMed Google Scholar
Houqi Fang
View author publications
You can also search for this author inPubMed Google Scholar
Qiufang Zeng
View author publications
You can also search for this author inPubMed Google Scholar
Jiao Tian
View author publications
You can also search for this author inPubMed Google Scholar
Zeyu Sun
View author publications
You can also search for this author inPubMed Google Scholar
Juan Xue
View author publications
You can also search for this author inPubMed Google Scholar
Shan Li
View author publications
You can also search for this author inPubMed Google Scholar
Chen Chen
View author publications
You can also search for this author inPubMed Google Scholar
Xiang Liu
View author publications
You can also search for this author inPubMed Google Scholar
Xiaolei Lyu
View author publications
You can also search for this author inPubMed Google Scholar
Zhenchang Guo
View author publications
You can also search for this author inPubMed Google Scholar
Yingzi Qi
View author publications
You can also search for this author inPubMed Google Scholar
Ruoyu Wu
View author publications
You can also search for this author inPubMed Google Scholar
Xiaoxian Du
View author publications
You can also search for this author inPubMed Google Scholar
Tingde Tong
View author publications
You can also search for this author inPubMed Google Scholar
Fengchun Kong
View author publications
You can also search for this author inPubMed Google Scholar
Liming Han
View author publications
You can also search for this author inPubMed Google Scholar
Minghui Wang
View author publications
You can also search for this author inPubMed Google Scholar
Yang Zhao
View author publications
You can also search for this author inPubMed Google Scholar
Xinhua Dai
View author publications
You can also search for this author inPubMed Google Scholar
Fuchu He
View author publications
You can also search for this author inPubMed Google Scholar
Tiannan Guo
View author publications
You can also search for this author inPubMed Google Scholar

Contributions

T.G., Y.Z., and H.G. designed the project. H.W. completed parts of data analysis. H.G., D.W., G.W., S.L., Y.X., W.J., Z.D., X.W., M.L., M.C., H.F., Q.Z., J.T., Z.S., J.X., S.L., C.C., X.Liu, X.Lyu, Z.G., Y.Q., R.W., X.D., T.T., F.K., L.H., and M.W., generated or annotated the raw files using various mass spectrometers. H.G., and Z.N., designed the iDIA-QC.exe offline software. H. G. and T.G. wrote the manuscript. Y.Z., D.W., L.Q., Y.S., Y.Z., and X.D., modified the manuscript. T.G. and F.H. supervised this project.

Corresponding authors

Correspondence to Yi Zhu, Fuchu He or Tiannan Guo.

Ethics declarations

Competing interests

T.G. and Y. Zhu are shareholders of Westlake Omics Biotechnology Co., Ltd. Three patents related to iDIA-QC technologies have been filed. Two have been granted, with the numbers CN 114858958 B and CN 116106464 B, while the third is currently pending, with the application number CN 202210783026.2. Y.L., Z.N. and Y.L. are employees of Westlake Omics Inc. H.F. and M.C. are employees of Shanghai Luming Biological Technology Inc. Q.F. and J.T. are employees of Shanghai Applied Protein Technology co. ltd. C.C., X.L., X.L. and F.K. were employees of SCIEX China during this project. Z.G., Y. Q. and T.T. are employees of Thermo Fisher Scientific China while R.W., X.D., L.M. and M.W. are employees of Bruker Daltonics China. The remaining authors declare no competing interests.

Peer review

Peer review information

Nature Communications thanks Liang Qiao, and David Tabb for their contribution to the peer review of this work. A peer review file is available.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Reporting Summary

Description of additional supplementary files

Supplementart Data 1

Supplementart Data 2

Supplementart Data 3

Transparent Peer Review file

Source data

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Gao, H., Zhu, Y., Wang, D. et al. iDIA-QC: AI-empowered data-independent acquisition mass spectrometry-based quality control. Nat Commun 16, 892 (2025). https://doi.org/10.1038/s41467-024-54871-1

Download citation

Received: 31 May 2024
Accepted: 22 November 2024
Published: 21 January 2025
DOI: https://doi.org/10.1038/s41467-024-54871-1