Schneider Et Al 2025 A Scalable Web Based Platform For Proteomics Data Processing Result Storage and Analysis
Schneider Et Al 2025 A Scalable Web Based Platform For Proteomics Data Processing Result Storage and Analysis
pubs.acs.org/jpr Article
ABSTRACT: The exponential increase in proteomics data presents critical challenges for
conventional processing workflows. These pipelines often consist of fragmented software
packages, glued together using complex in-house scripts or error-prone manual workflows
running on local hardware, which are costly to maintain and scale. The MSAID Platform
offers a fully automated, managed proteomics data pipeline, consolidating formerly
disjointed functions into unified, API-driven services that cover the entire process from raw
data to biological insights. Backed by the cloud-native search algorithm CHIMERYS, as well
as scalable cloud compute instances and data lakes, the platform facilitates efficient
processing of large data sets, automation of processing via the command line, systematic result storage, analysis, and visualization.
The data lake supports elastically growing storage and unified query capabilities, facilitating large-scale analyses and efficient reuse of
previously processed data, such as aggregating longitudinally acquired studies. Users interact with the platform via a web interface,
CLI client, or API, providing flexible, automated access. Readily available tools for accessing result data include browser-based
interrogation and one-click visualizations for statistical analysis. The platform streamlines research processes, making advanced and
automated proteomic workflows accessible to a broader range of scientists. The MSAID Platform is globally available via https://
platform.msaid.io.
KEYWORDS: proteomics, platform, pipeline, CHIMERYS, compute infrastructure, data processing, cloud, AWS, scalable, SaaS
■ INTRODUCTION
Proteomics is an indispensable technology for the compre-
In parallel, recent years have seen a fast-paced development
and improvement of software for proteomics data processing,
hensive identification and quantification of proteins, which are fueled by the introduction of deep learning-based prediction of
pivotal for understanding cellular functions and disease peptide properties.6−9 Academic software, such as MSFrag-
mechanisms. Over the past decade, there have been substantial ger10 and rescoring concepts like Prosit,6 MSBooster,11
advancements in the key components of proteomic workflows, MS2Rescore,12 EncyclopeDIA,13 DeepDIA,14 AlphaDIA,15
including sample preparation techniques, liquid chromatog- DIA-NN,16 and commercial products like INFERYS,17
raphy (LC), and mass spectrometry (MS) instrumentation.1,2 Spectronaut18 and CHIMERYS,19 have pushed the boundaries
These improvements, particularly the advent of fast-scanning of data extraction, enabling deeper insights into complex
mass spectrometers, have significantly enhanced the sensitivity, proteomics data sets by leveraging fragment ion intensities.
comprehensiveness, and throughput of proteomic analyses.3 Together, the combination of instrumentation and more
Consequently, researchers can now conduct large-scale sensitive algorithms allows researchers to generate protein
proteomic studies that generate an unprecedented volume of profiles to an unprecedented depth and throughput. However,
raw data. However, this surge in data production presents in contrast to today’s streamlined sample workflows in the wet
substantial challenges for data processing pipelines generating
protein identifications and associated quantitative information.
The growing size of raw and result data and sheer number of Received: September 30, 2024
mass spectrometry measurements that can be performed in a Revised: December 20, 2024
short period of time regularly exceed the capabilities of Accepted: January 23, 2025
conventional on-premises compute infrastructure, particularly Published: February 21, 2025
with respect to the demanded processing power and storage
space.4,5
© 2025 The Authors. Published by
American Chemical Society https://doi.org/10.1021/acs.jproteome.4c00871
1241 J. Proteome Res. 2025, 24, 1241−1249
Journal of Proteome Research pubs.acs.org/jpr Article
Figure 1. The MSAID Platform for proteomics comprises a cloud-native, microservices-based architecture, orchestrated by Kubernetes. It is hosted
on Amazon Web Services (AWS), utilizing Elastic Kubernetes Service (EKS). The platform supports multiple interfaces, including a web interface,
command-line interface (CLI), and an application programming interface (API) access for seamless user interaction. Uploaded raw and fasta files
are stored as on AWS Simple Storage Service (S3). A relational database (RDS) manages the data lake and meta-attributes for files and processing
jobs. Scalable CHIMERYS workflows for data-dependent acquisition (DDA), data-independent acquisition (DIA) or parallel reaction monitoring
(PRM) data processing can be executed on AWS Elastic Compute Cloud (EC2) instances. The platform offers the option to continuously acquire
and search raw files, while raw file overarching postprocessing like protein grouping are performed later without researching the data. Result data is
systematically stored as parquet files and can be interactively explored and visualized in the browser or downloaded via browser, CLI, or API for
further exploration.
lab leading up to the mass spectrometer, the data flow from the often hastily developed, lack robustness and are costly to
acquired raw data to the extraction of biological insights maintain long-term. Bespoke in-house solutions frequently do
remains fragmented and often inefficient. Laboratories are not scale well with the size of projects, growing infrastructure,
confronted with a range of computational challenges as they or team size, due to the effort of coordinating limited resources
attempt to process and analyze their data: frequently in an exponentially growing data environment.
encountered manual workflows are not only time-consuming To streamline the proteomic data workflow, we introduce
but also prone to errors and inconsistencies, particularly when the MSAID Platform�a comprehensive, managed, and cloud-
repetitive tasks are involved. A data pipeline might involve the based one-stop-shop for proteomics. It facilitates data
manual transfer of raw files from the acquisition computer to a handling, storage and analysis, allowing researchers to focus
storage medium, which may be a local personal computer, a on scientific questions of interest. By leveraging the scalability
laptop, or, in some cases, a network-attached storage (NAS) and flexibility of cloud computing, this platform eliminates the
system, rarely a cloud-based service. Subsequent processing of limitations of local hardware, enabling researchers to run
the data with a proteomic search engine involves the use of experiments at any time, without having to worry about
local consumer hardware such as personal computers or, in resource limitations and processing vast data sets automati-
some cases, high-performance servers. This reliance on local cally, efficiently, and reproducibly. Through its application
hardware inherently limits the scalability of proteomic studies, programming interface (API)-based design, advanced users
as the computational demands of large-scale analyses often retain the ability to tailor workflows to their needs, facilitating
exceed the capabilities of on-premises infrastructure or the seamless integration with existing or new tools, providing a
scalability of the software package itself. Once processed, the “best of both worlds” approach if desired.
results are usually manually moved to user- or project-specific
directories and shuffled between different storages to avoid
disk exhaustion, causing confusion and parallel systems of data
■ MATERIALS AND METHODS
The MSAID Platform is designed as a cloud-native solution,
organization, limiting accessibility to other researchers or data employing a microservices architecture orchestrated by
mining. The level of subsequent data interrogation and Kubernetes to ensure both scalability and flexibility across
interpretation varies drastically depending on the researcher’s various computational tasks (Figure 1). In its current
skill set, with approaches ranging from basic spreadsheet inception, it is hosted on Amazon Web Services (AWS) but
analyses to more sophisticated bioinformatics tools (e.g., is compartmentalized for future deployment into other cloud
Perseus20) or scripting languages. Dedicated statistics and service providers or a local server solution. Platform services
visualization suites like MSstats21 and Mass Dynamics22 can and compute resources are deployed using an AWS Elastic
aid non-bioinformaticians in drilling down on their biological Kubernetes Service (EKS) cluster, with automated infra-
question, but also present standalone solutions, adding to a structure management facilitated by Terraform and Helm.
fragmented tooling landscape. The described patchwork of User management, including authentication and authorization,
disconnected local infrastructures and applications create is handled through AWS Cognito, incorporating multifactor
highly redundant work streams, render it difficult to automate authentication to ensure secure access and compliance with
processes, risk loss of data integrity, and hinder the generation data protection protocols. Centralized control of the platform’s
of reproducible results. While custom scripts and pipelines operations is achieved through an API server, which governs all
might attenuate the manual labor in the process, these are aspects of data handling, processing, and user interactions
1242 https://doi.org/10.1021/acs.jproteome.4c00871
J. Proteome Res. 2025, 24, 1241−1249
Journal of Proteome Research pubs.acs.org/jpr Article
Figure 2. (A) The welcome screen of the platform presents key statistics of the user account, including the number of running searches, quick links
to the latest triggered experiments and the available processing quota. (B) Speed comparison of the browser-based (Firefox v130.0) and CLI-based
10 GB raw data upload into the AWS S3 data lake using a Windows Server 2022 server connected with a 1 Gbit/s uplink. Theoretical limit is
determined as the maximum achievable throughput of a 1 Gbit/s uplink. (C) Data management and organization are facilitated by adding free text
tags. Both tags and auto-generated metadata can be used in no-code queries for data retrieval. Images reproduced with permission from MSAID.
Figure 3. (A) The experimental design acts as an additional layer of meta data annotation. Raw files can be annotated as samples for later
visualization. In the case of tandem mass tag (TMT)-labeled samples, the individual channels can be annotated. (B) Runtime comparison of a 1h Q
Exactive HF-X HeLa Files run individually and as copies of the same file in parallel. Identification, error control, and quantification were performed
across all files. (C) Identification numbers for an offline-fractionated DIA data set acquired with an Orbitrap Astral. Peptide-spectrum matches
(PSMs) are represented at 1% file-local false discovery rate (FDR), all other levels at 1% data set-global FDR. Raw data reprocessed from Serrano et
al.24 (PRIDE data set identifier PXD049028). (D) Comparison of 2 HeLa samples searched together or combined later in postprocessing
demonstrates the result identity of longitudinally processed and later aggregated data. Displayed are PSMs at 1% PSM FDR (Venn diagram),
unique precursors at 1% precursor FDR (bar chart), a scatterplot of mokapot SVM score of all precursors irrespective of FDR with a Pearson
correlation of 1.00 and the delta in precursor quantitation at 1% precursor FDR, all indicating result identity.
folder using freely configurable regular expressions to include protein names, gene names, and organisms to cater for the
and exclude expressions, such as “HeLa” or “QC”. Upon various sources of fasta files.
completion of raw data acquisition, files matching these Processing of proteomic data is facilitated through an
expressions are automatically uploaded. This feature is intuitive, multistep wizard that guides users in setting up
particularly advantageous for longitudinal studies or quality experiments. This wizard assists browsing, filtering, and
control applications, where data files are generated repeatedly selecting input files, making it straightforward for users to
or over a longer period. The CLI client has been optimized to initiate their analyses. It also allows recording the experimental
achieve upload speeds of >100 MB/s on a 1 Gbit/s uplink, design of a study for record keeping and to facilitate later
rendering the upload of even large studies feasible in just a few statistical testing and visualization (Figure 3A).
hours (Figure 2B). The platform’s design is search engine-agnostic, enabling
Data security and compliance are central to the platform’s integration with any search engine that can operate within a
design; all hosting is performed on AWS, an ISO norm (by the Docker container. Currently, the platform is powered by
International Organization for Standardization) and CSA CHIMERYS 4,19 with plans to incorporate additional search
engines in the future. CHIMERYS is capable of handling Data-
STAR (Security, Trust, Assurance, and Risk) program by the
Dependent Acquisition (DDA), Data-Independent Acquisition
CSA Group certified provider. All data are encrypted in transit
(DIA), and Parallel Reaction Monitoring (PRM) experiments.
and at rest and securely stored on S3, benefiting from its
It operates in a fully spectrum-centric manner, features the
inherent redundancy and recovery features. Stringent access deconvolution of chimeric spectra, and incorporates the
control via Access Control Lists (ACLs) ensures that each INFERYS 4 deep-learning model, which provides retention
user’s data is isolated from others, aligning with state-of-the-art time and fragment ion intensity predictions for the most
security practices and compliance requirements, including the common post-translational modifications (PTMs) like phos-
EU General Data Protection Regulation (GDPR). phorylation, acetylation, ubiquitination, cysteine modifications,
During and after upload, the platform provides data tagging oxidation, tandem mass tags (TMT), and isotopically labeled
and metadata management capabilities. Users can tag data with amino acids. An in-depth characterization of the CHIMERYS
free-text labels, facilitating fully customizable organization and algorithm is available in a separate manuscript.19 The
retrieval. This tagging system integrates seamlessly with the processing pipeline also includes comprehensive postprocess-
platform’s table-based data management feature, allowing users ing features, such as MS1- and MS2-based quantification via
to organize their raw or fasta files and construct powerful no- deconvolution and TMT reporter ion-based quantification.
code filtering queries based on various data attributes within During postprocessing, Mokapot26 and Picked Protein Group
the browser (Figure 2C). Uploaded fasta protein databases can FDR25 and are employed for rigorous error control. Processing
be associated with parse rules to ensure proper extraction of templates for experiments can be saved by the user to facilitate
1244 https://doi.org/10.1021/acs.jproteome.4c00871
J. Proteome Res. 2025, 24, 1241−1249
Journal of Proteome Research pubs.acs.org/jpr Article
Figure 4. (A) Exploration of results directly in the browser, including nested associations of all contributing data levels (PSMs, precursors, modified
peptides, protein groups). (B) Volcano plot on protein group level created within the platform contrasting a CRISPR-Cas9 mitochondrial MGME1
gene Knockout (KO) in human HAP1 cells with wildtype (WT) HAP1 cells. Raw data reprocessed from Serrano et al.24 (PRIDE data set identifier
PXD049028). A two-sided t test was performed for all proteins with complete observation on n = 3 replicate single shots for WT and KO.
Benjamini−Hochberg was used to calculate false discovery rate (FDR). CHIMERYS processing yields 639 significantly (q-value ≤ 0.05) regulated
proteins with an absolute fold-change of ≥2.
setting up standardized experiments. The settings can also be via the CLI client. Once a study is completed, experiments
exported to directly submit jobs using the CLI client instead of processed with compatible settings can be easily combined
interacting with the graphical user interface (GUI). At the time through a simple wizard in the browser or the CLI. This
of writing, CHIMERYS and hence the platform is compatible combination triggers a rerun of the computationally inex-
with all Thermo Scientific mass spectrometers. Compatibility pensive postprocessing steps only, including quantification,
with other vendors and open formats like mzML is expected FDR roll-up, and picked-protein grouping, allowing users to
within the year 2025. benefit from the thorough analysis of individual files while also
The cloud-native setup allows for the deployment of several obtaining comprehensive results from the entire study without
hundred compute pods, backed by hundreds of central the need for a full search engine run. Due to the deterministic
processing unit (CPU) cores and graphics processing unit and reproducible results of the processing step, no difference in
(GPU) instances, ensuring that processing remains efficient results (Pearson correlation of R = 1.00) is observed whether
and fast regardless of the data volume or parallel usage of the data is processed together or combined later (Figure 3D).
platform. Raw files are processed in parallel, with subsequent During the execution of experiments, users can monitor their
combination of results of all searches during postprocessing to progress in real-time via the browser. Once an experiment
optimize the overall analysis runtime. Elastic scaling of the concludes, the platform provides an overview of identified
platform is achieved using an autoscaler, which analyzes PSMs, peptides, and protein groups, offering immediate insight
submitted workloads and dynamically acquires or releases into the results.
computation resources. This strategy keeps the cluster size To allow the user to engage with their results, the platform
appropriate to the scheduled compute tasks and ensures provides a range of interactive tools. The processed data is
efficient processing from low activity times to load spikes. systematically stored in a data lake, enabling complex queries
Currently, the cluster can simultaneously spawn >1,000
across potentially thousands of files. A Trino/DuckDB data
compute instances, and we are working to expand this capacity
lake query layer allows users to retrieve or analyze data directly
by an additional order of magnitude and expand to more than
in the browser (Figure 4A). Tab-Separated Values (TSV) files
a single datacenter/availability zone to service users across the
can be exported and downloaded, providing users complete
globe. Performance benchmarks demonstrate the scalability of
the platform. Processing a single 1 GB HeLa file including MS1 control over their results for offline storage and processing if
quantification takes 36 min, while processing 100 files desired. File downloads can be fine-tuned with options to
concurrently extends the total time to 108 min (Figure 3B), apply FDR filtering, formatting and level selection (PSMs,
resulting in 56x faster processing than acquisition time. A precursors, modified peptides, peptides and protein groups).
published, fractionated Orbitrap Astral DIA data set24 The platform output is evolving to conform to existing
comprising 103 GB in size was processed 2.7x faster than standards (e.g., SDRF23) and will soon offer integration with
acquisition time (198 min processing, 552 min acquisition frequently used tools such as Skyline. Additionally, the CLI
time), further highlighting the platform’s capability to handle allows to download the results of submitted jobs directly,
large data sets, even if they are not automatically uploaded and enabling users to upload, process, and download results within
streamed (data not shown). The analysis resulted in 5,859,533 their pipelines without requiring any interaction with the GUI.
peptide-spectrum matches (PSMs) at 1% file-local FDR, As an alternative to downloading data, users can explore
342,146 precursors, 236,882 peptides and 11,048 protein their results online through a data browser that provides an
groups (at 1% dataset-global FDR), underlining the excep- intuitive tabular overview of the full result set set, including
tional depth of proteomic profiling that can be achieved advanced filtering and search functionalities backed by Trino’s
nowadays from a single biological sample (Figure 3C). distributed query engine. This data browser allows users to
The platform also supports the efficient reuse of existing gain valuable insights into their data before committing to a
data, via combination of previously generated experiments, large download, for example, to quickly determine if proteins
benefiting quality control (QC) applications and longitudinal of interest have been detected. Nested tables link all evidence
data collection and analysis. Users can process each raw file as levels, facilitating detailed examination of data, such as the
it becomes available, also through a fully automated workflow quality of all detected PSMs associated with a specific protein.
1245 https://doi.org/10.1021/acs.jproteome.4c00871
J. Proteome Res. 2025, 24, 1241−1249
Journal of Proteome Research pubs.acs.org/jpr Article
In addition to allowing access to fully searchable online platform includes tools for automated raw data uploads and
results, the data lake structure enables scientists to perform result downloads, simplifying the analysis process for
statistical testing and visualization directly in the browser. Each researchers. Future developments will leverage the data lake
experiment includes an interactive, modifiable, and restorable to provide advanced features, such as generating insights from
visualization dashboard (Figure 4B). This dashboard offers previous experiments, creating downstream analyses, and
simple, one-click creation of a variety of customizable plots, producing aggregated data views and additional visualizations.
such as bar plots for identification numbers and scatter plots Programming libraries for R and Python will offer direct
visualizing the correlation between files, as well as common interaction with the results, enabling custom analysis. Addi-
tools like UpSet plots, Principal Component Analysis (PCA) tionally, the API will facilitate programmatic access to both
and differential expression analysis with Volcano plot visual- experiment-specific and cross-experimental data, ensuring
izations. Both the data points underlying the plot (TSV files) flexibility and integration into diverse research workflows.
and the plots themselves (vector graphics or Portable Network The cloud-based nature of the platform may raise concerns
Graphics [PNGs]) can be downloaded. The plotting regarding security and associated costs. To address these
capabilities of the platform will expand continuously, aiming concerns, the platform follows state-of-the-art data handling
to eliminate the need for external analysis tools like R or including encryption and strict ACLs. Further reinforcing the
Python for straightforward data exploration by integrating commitment to security, we pursue an ISO27001 certification,
more functionalities over time. which will make it easier to adopt the platform for companies
Overall, we have introduced the first publicly accessible all- and researchers operating in regulated environments. To
in-one Software as a Service (SaaS) platform for proteomics. provide scientists with the opportunity of exploring the
Our goal was to create an easy-to-use solution for managing platform, a generous free processing package is available.
and processing proteomic data that can handle swiftly growing Currently, the SaaS solution is fully managed by us, but we
volumes of data, while relieving users from the need to buy and are aware of the demand for additional compliance and access
manage large compute and storage systems to keep up with the management through alternative deployment options. In
speed of data acquisition. The cloud-native design ensures response, we plan to offer Virtual Private Cloud (VPC)
scalable data upload, management, processing, and result deployments into user-owned cloud accounts, in turn
deposition, with features for systematic result exploration and providing enhanced compliance, access control, and data
advanced online data interaction directly in the browser. We sovereignty. Initially, this will be available for AWS, with future
believe this platform provides a strong foundation, marking the expansion to other cloud providers. While the platform
beginning of moving proteomic data processing to the cloud. It currently relies on AWS services, the core components of the
empowers researchers by decoupling scientific tasks from the platform are cloud-native technologies not specific to AWS,
underlying compute and to focus on solving problems instead enabling adaptation to other Kubernetes environments in the
of spending time managing infrastructure. future. For example, the S3 data storage can be replaced with
■ DISCUSSION
The MSAID Platform represents a pioneering effort in the field
any S3 compatible object storage solution like Google Cloud
Storage, Azure Blob Storage, or MinIO with reasonable effort.
For organizations with existing high-performance computing
of proteomics, offering a managed proteomic pipeline and (HPC) infrastructure or those preferring on-premises
storage solution with an intuitive browser-based interface that solutions, we are also developing a local server deployment
eliminates the need for individual laboratories to manage their option. This approach offers key advantages, including
own infrastructure. This approach significantly lowers entry complete data control, offline access, and tailored cost
barriers, particularly for smaller laboratories that may lack the management. It provides a highly viable solution for
resources to establish and maintain complex data processing laboratories operating in sensitive environments. In addition,
pipelines. This contrasts our solution to pipelines like public funding opportunities often favor one-time hardware
quantms,4 which require a self-managed compute environment and software purchases and have yet to fully adapt to
and only provide a command line interface. By automating the supporting recurring compute costs, even though modern
data pipelines from raw data to conclusions, the platform software packages, including essential tools like office suites,
streamlines research processes, making advanced proteomic are transitioning to SaaS.
workflows accessible to a broader range of scientists. SaaS allows for continuous feature delivery and improve-
Implementing a cloud-native proteomic workflow addresses ment and to quickly patch critical software exploits. To ensure
a critical need for scalable analyses that keep pace with the reproducibility, a two-tiered deprecation strategy is followed:
rapid growth of data volume fueled by recent developments of Updates to CHIMERYS that change results are released as new
faster and more sensitive instruments. A single mass minor versions (e.g., 4.1 → 4.2) and remain available for at
spectrometer running at full efficiency can generate more least one year. Critical security patches may replace earlier
than a terabyte of raw data within a week, presenting versions within the same minor release without affecting results
substantial storage and resource challenges that quickly exceed (e.g., 4.1.0 → 4.1.1). This approach balances software security
the capacity for local compute clusters, which are not easily with reproducibility of prior data. While the platform is not
scalable. Even batch-processing cloud models struggle with intended as a permanent data storage solution at this stage, we
scalability, prolonged transfer times, and lack of integrated plan to introduce data archiving options at a fraction of the
storage. cost of S3 storage, reducing the need for local backups. We will
In contrast, the platform offers a private proteomic data lake, also focus on simplifying data integration, including importing
enabling users to store and analyze large data sets without data from public proteomic repositories to facilitate a neglected
hardware constraints. Integrated online workflows eliminate data workflow in the proteomic community: the reuse and
the need for repeated data uploads and downloads, allowing reanalysis of the wealth of publicly available and previously
efficient data reuse. To further streamline the workflow, the analyzed data. In addition, we plan to streamline publishing
1246 https://doi.org/10.1021/acs.jproteome.4c00871
J. Proteome Res. 2025, 24, 1241−1249
Journal of Proteome Research pubs.acs.org/jpr Article
results obtained on the platform, by e.g. directly uploading Agnes Guevende − MSAID GmbH, Garching b. München
data, metadata and results to repositories like PRIDE or 85748, Germany
allowing a public “view-only” option for obtained results. Alexander Hogrebe − MSAID GmbH, Berlin 13347,
Looking ahead, we aim to enhance the platform’s visual- Germany; orcid.org/0000-0002-0203-6803
ization capabilities, driven by user feedback. This includes Michelle T. Berger − MSAID GmbH, Garching b. München
developing intuitive QC reports and plots and offering diverse 85748, Germany
views into the underlying MS data (e.g., visualization of Michael Graber − MSAID GmbH, Garching b. München
quantification traces or a potential Skyline integration), 85748, Germany
emphasizing our viewpoint that visual inspection of raw data Vishal Sukumar − MSAID GmbH, Garching b. München
remains crucial and should be taught. While the platform does 85748, Germany
not yet fully replace offline data analysis, ongoing development Lizi Mamisashvili − MSAID GmbH, Garching b. München
aims to close this gap. Our vision is for biologists to focus on 85748, Germany
interpreting results, such as understanding the Volcano plot on Igor Bronsthein − MSAID GmbH, Berlin 13347, Germany
a molecular level, rather than worrying about how to juggle Layla Eljagh − MSAID GmbH, Garching b. München 85748,
terabytes of experiment data in the process of generating it. Germany
Looking further ahead, we consider the platform as the Siegfried Gessulat − MSAID GmbH, Berlin 13347, Germany
foundation for large-scale proteomic result generation and Florian Seefried − MSAID GmbH, Garching b. München
exploitation. After addressing the data processing challenges, 85748, Germany
our focus will shift to advanced data interrogation. This will Tobias Schmidt − MSAID GmbH, Garching b. München
involve linking with external resources (like PhopshoSite- 85748, Germany
Plus,27 UniProt28,29 and ProteomicsDB30,31), better utilizing Complete contact information is available at:
insights already available from other tools, and culminates in https://pubs.acs.org/10.1021/acs.jproteome.4c00871
integrating large language models (LLMs) for conversational
data analysis.32 Ultimately, we aim to help researchers generate Author Contributions
hypotheses to follow up on the ever-growing volume of data, $
made possible by an integrated workflow from raw files to M.S. and D.P.Z. contributed equally
results and the systematic storage provided by the platform. Author Contributions
In summary, we are advancing into the cloud age of M.S., D.P.Z., and M.F. conceived the study. M.S. con-
proteomics and project that the MSAID Platform will become ceptualized the cloud-backend infrastructure. D.P.Z., M.F.,
an essential tool with a low entry barrier for researchers. Our M.S., and P.S. conceptualized the browser front end. M.S., P.S.,
long-term vision is making proteomic research more accessible S.B.F., D.B., A.H., and T.S. implemented the backend and the
and efficient for the expert and non-expert proteomic CLI, and mediated the front-end interaction. P.S., M.S., M.G.,
community. I.B., L.M, V.S., and F.S. integrated the CHIMERYS software.
■ ASSOCIATED CONTENT
Data Availability Statement
F.S. and S.G. integrated post-processing routines. A.G., A.H.,
M.T.B., and L.E. evaluated the platform. D.P.Z., M.S., and M.F.
wrote the manuscript.
The platform can be tested free of charge after registering at Funding
https://platform.msaid.io Work presented in this manuscript was in part funded by the
*
sı Supporting Information German Federal Ministry of Education and Research (BMBF)
with grant no. 13GW0603B.
The Supporting Information is available free of charge at
https://pubs.acs.org/doi/10.1021/acs.jproteome.4c00871. Notes
(Figure S1) Structure of the MSAID Platform API The authors declare the following competing financial
(PDF) interest(s): All authors are employees of MSAID GmbH, a
commercial entity which develops the software described in
■ AUTHOR INFORMATION
Corresponding Author
the study. M.F., D.P.Z., S.G. and T.S. are co-founders and
shareholders of MSAID.
ISO: International Organization for Standardization (11) Yang, K. L.; Yu, F.; Teo, G. C.; Li, K.; Demichev, V.; Ralser, M.;
LC: liquid chromatography Nesvizhskii, A. I. MSBooster: Improving Peptide Identification Rates
LLM: large language model Using Deep Learning-Based Features. Nat. Commun. 2023, 14 (1),
MS: mass spectrometry 4539.
(12) Declercq, A.; Bouwmeester, R.; Hirschler, A.; Carapito, C.;
NAS: network-attached storage
Degroeve, S.; Martens, L.; Gabriels, R. MS2Rescore: Data-Driven
PCA: principal component analysis Rescoring Dramatically Boosts Immunopeptide Identification Rates.
PNG: portable network graphic, file format Mol. Cell. Proteom.: MCP 2022, 21 (8), No. 100266.
PSM: peptide-spectrum match (13) Searle, B. C.; Pino, L. K.; Egertson, J. D.; Ting, Y. S.; Lawrence,
QC: quality control R. T.; MacLean, B. X.; Villén, J.; MacCoss, M. J. Chromatogram
R: R statistical computing language Libraries Improve Peptide Detection and Quantification by Data
RAW: raw data, file format Independent Acquisition Mass Spectrometry. Nat. Commun. 2018, 9
RDS: AWS relational database service (1), 5128.
S3: AWS simple storage service (14) Yang, Y.; Liu, X.; Shen, C.; Lin, Y.; Yang, P.; Qiao, L. In Silico
SaaS: software as a service Spectral Libraries by Deep Learning Facilitate Data-Independent
SDK: software development kit Acquisition Proteomics. Nat. Commun. 2020, 11 (1), 146.
(15) Wallmann, G.; Skowronek, P.; Brennsteiner, V.; Lebedev, M.;
TMT: tandem mass tag
Thielert, M.; Steigerwald, S.; Kotb, M.; Heymann, T.; Zhou, X.-X.;
TSV: tab-separated values, file format Schwörer, M.; Strauss, M. T.; Ammar, C.; Willems, S.; Zeng, W.-F.;
VPC: virtual private cloud Mann, M. AlphaDIA Enables End-to-End Transfer Learning for
■ REFERENCES
(1) Heil, L. R.; Damoc, E.; Arrey, T. N.; Pashkova, A.; Denisov, E.;
Feature-Free Proteomics. bioRxiv 2024, 2024.05.28.596182. .
(16) Demichev, V.; Messner, C. B.; Vernardis, S. I.; Lilley, K. S.;
Ralser, M. DIA-NN: Neural Networks and Interference Correction
Enable Deep Proteome Coverage in High Throughput. Nat. Methods
Petzoldt, J.; Peterson, A. C.; Hsu, C.; Searle, B. C.; Shulman, N.; 2020, 17 (1), 41−44.
Riffle, M.; Connolly, B.; MacLean, B. X.; Remes, P. M.; Senko, M. W.; (17) Zolg, D. P.; Gessulat, S.; Paschke, C.; Graber, M.; Rathke-
Stewart, H. I.; Hock, C.; Makarov, A. A.; Hermanson, D.; Zabrouskov, Kuhnert, M.; Seefried, F.; Fitzemeier, K.; Berg, F.; Lopez-Ferrer, D.;
V.; Wu, C. C.; MacCoss, M. J. Evaluating the Performance of the Horn, D.; Henrich, C.; Huhmer, A.; Delanghe, B.; Frejno, M.
Astral Mass Analyzer for Quantitative Proteomics Using Data- INFERYS Rescoring: Boosting Peptide Identifications and Scoring
Independent Acquisition. J. Proteome Res. 2023, 22 (10), 3290−3300. Confidence of Database Search Results. Rapid Commun. Mass
(2) Peters-Clarke, T. M.; Coon, J. J.; Riley, N. M. Instrumentation at Spectrom. 2021, No. e9128.
the Leading Edge of Proteomics. Anal. Chem. 2024, 96 (20), 7976− (18) Bruderer, R.; Bernhardt, O. M.; Gandhi, T.; Xuan, Y.;
8010. Sondermann, J.; Schmidt, M.; Gomez-Varela, D.; Reiter, L.
(3) Guzman, U. H.; Martinez-Val, A.; Ye, Z.; Damoc, E.; Arrey, T. Optimization of Experimental Parameters in Data-Independent
N.; Pashkova, A.; Renuse, S.; Denisov, E.; Petzoldt, J.; Peterson, A. C.; Mass Spectrometry Significantly Increases Depth and Reproducibility
Harking, F.; Østergaard, O.; Rydbirk, R.; Aznar, S.; Stewart, H.; Xuan, of Results*. Mol. Cell. Proteom.: MCP 2017, 16 (12), 2296−2309.
Y.; Hermanson, D.; Horning, S.; Hock, C.; Makarov, A.; Zabrouskov, (19) Frejno, M.; Berger, M. T.; Tüshaus, J.; Hogrebe, A.; Seefried,
V.; Olsen, J. V. Ultra-Fast Label-Free Quantification and Compre- F.; Graber, M.; Samaras, P.; Fredj, S. B.; Sukumar, V.; Eljagh, L.;
hensive Proteome Coverage with Narrow-Window Data-Independent Brohnshtein, I.; Mamisashvili, L.; Schneider, M.; Gessulat, S.;
Acquisition. Nat. Biotechnol. 2024, 42, 1855−1866. Schmidt, T.; Kuster, B.; Zolg, D. P.; Wilhelm, M. Unifying the
(4) Dai, C.; Pfeuffer, J.; Wang, H.; Zheng, P.; Käll, L.; Sachsenberg, Analysis of Bottom-up Proteomics Data with CHIMERYS. bioRxiv
T.; Demichev, V.; Bai, M.; Kohlbacher, O.; Perez-Riverol, Y. 2024, 2024.05.27.596040. .
Quantms: A Cloud-Based Pipeline for Quantitative Proteomics (20) Tyanova, S.; Temu, T.; Sinitcyn, P.; Carlson, A.; Hein, M. Y.;
Enables the Reanalysis of Public Proteomics Data. Nat. Methods Geiger, T.; Mann, M.; Cox, J. The Perseus Computational Platform
2024, 21 (9), 1603−1607. for Comprehensive Analysis of (Prote)Omics Data. Nat. Methods
(5) Perez-Riverol, Y.; Bai, J.; Bandla, C.; García-Seisdedos, D.; 2016, 13 (9), 731−740.
Hewapathirana, S.; Kamatchinathan, S.; Kundu, D. J.; Prakash, A.; (21) Kohler, D.; Staniak, M.; Tsai, T.-H.; Huang, T.; Shulman, N.;
Frericks-Zipper, A.; Eisenacher, M.; Walzer, M.; Wang, S.; Brazma, A.; Bernhardt, O. M.; MacLean, B. X.; Nesvizhskii, A. I.; Reiter, L.;
Vizcaíno, J. A. The PRIDE Database Resources in 2022: A Hub for Sabido, E.; Choi, M.; Vitek, O. MSstats Version 4.0: Statistical
Mass Spectrometry-Based Proteomics Evidences. Nucleic Acids Res. Analyses of Quantitative Mass Spectrometry-Based Proteomic
2022, 50 (D1), D543−D552. Experiments with Chromatography-Based Quantification at Scale. J.
(6) Gessulat, S.; Schmidt, T.; Zolg, D. P.; Samaras, P.; Schnatbaum, Proteome Res. 2023, 22 (5), 1466−1482.
K.; Zerweck, J.; Knaute, T.; Rechenberger, J.; Delanghe, B.; Huhmer, (22) Bloom, J.; Triantafyllidis, A.; Quaglieri, A.; Ngov, P. B.; Infusini,
A.; Reimer, U.; Ehrlich, H.-C.; Aiche, S.; Kuster, B.; Wilhelm, M. G.; Webb, A. Mass Dynamics 1.0: A Streamlined, Web-Based
Prosit: Proteome-Wide Prediction of Peptide Tandem Mass Spectra Environment for Analyzing, Sharing, and Integrating Label-Free
by Deep Learning. Nat. methods 2019, 16 (6), 509−518. Data. J. Proteome Res. 2021, 20 (11), 5180−5188.
(7) Zhou, X.-X.; Zeng, W.-F.; Chi, H.; Luo, C.; Liu, C.; Zhan, J.; He, (23) Deutsch, E. W.; Bandeira, N.; Perez-Riverol, Y.; Sharma, V.;
S.-M.; Zhang, Z. PDeep: Predicting MS/MS Spectra of Peptides with Carver, J. J.; Mendoza, L.; Kundu, D. J.; Wang, S.; Bandla, C.;
Deep Learning. Anal. Chem. 2017, 89 (23), 12690−12697. Kamatchinathan, S.; Hewapathirana, S.; Pullman, B. S.; Wertz, J.; Sun,
(8) Zeng, W.-F.; Zhou, X.-X.; Willems, S.; Ammar, C.; Wahle, M.; Z.; Kawano, S.; Okuda, S.; Watanabe, Y.; MacLean, B.; MacCoss, M.
Bludau, I.; Voytik, E.; Strauss, M. T.; Mann, M. AlphaPeptDeep: A J.; Zhu, Y.; Ishihama, Y.; Vizcaíno, J. A. The ProteomeXchange
Modular Deep Learning Framework to Predict Peptide Properties for Consortium at 10 Years: 2023 Update. Nucleic Acids Res. 2023, 51
Proteomics. Nat. Commun. 2022, 13 (1), 7238. (D1), D1539−D1548.
(9) Meyer, J. G. Deep Learning Neural Network Tools for (24) Serrano, L. R.; Peters-Clarke, T. M.; Arrey, T. N.; Damoc, E.;
Proteomics. Cell Rep. methods 2021, 1 (2), No. 100003. Robinson, M. L.; Lancaster, N. M.; Shishkova, E.; Moss, C.; Pashkova,
(10) Yu, F.; Teo, G. C.; Kong, A. T.; Fröhlich, K.; Li, G. X.; A.; Sinitcyn, P.; Brademan, D. R.; Quarmby, S. T.; Peterson, A. C.;
Demichev, V.; Nesvizhskii, A. I. Analysis of DIA Proteomics Data Zeller, M.; Hermanson, D.; Stewart, H.; Hock, C.; Makarov, A.;
Using MSFragger-DIA and FragPipe Computational Platform. Nat. Zabrouskov, V.; Coon, J. J. The One Hour Human Proteome. Mol.
Commun. 2023, 14 (1), 4154. Cell. Proteom.: MCP 2024, 23 (5), No. 100760.
1248 https://doi.org/10.1021/acs.jproteome.4c00871
J. Proteome Res. 2025, 24, 1241−1249
Journal of Proteome Research pubs.acs.org/jpr Article
1249 https://doi.org/10.1021/acs.jproteome.4c00871
J. Proteome Res. 2025, 24, 1241−1249