Seminars in Perinatology
Keywords: The application of ‘omic techniques including, but not limited to genomics/metagenomics,
Omics transcriptomics/meta-transcriptomics, proteomics/meta-proteomics, and metabolomics
multi’omic integration to generate multiple datasets from a single sample have facilitated hypothesis generation
metagenomics leading to the identification of biological, molecular and ecological functions and mecha-
metatranscriptomics nisms, as well as associations and correlations. Despite their power and promise, a variety
proteomics of challenges must be considered in the successful design and execution of a multi-omics
metabolomics study. In this review, various ‘omic technologies applicable to single- and meta-organisms
(i.e., host + microbiome) are described, and considerations for sample collection, storage
and processing prior to data generation and analysis, as well as approaches to data storage,
dissemination and analysis are discussed. Finally, case studies are included as examples of
multi-omic applications providing novel insights and a more holistic understanding of bio-
logical processes.
Ó 2021 Elsevier Inc. All rights reserved.
‘omic data sets has been driven, in large part, by the develop-
An introduction to ‘omic technologies ment and increasing availability of array technologies, high-
throughput mass spectrometry and sequencing platforms,
‘Omics refers to the comprehensive or global assessment of a and the development of data and computer science techni-
collection of features. An ‘omic-based analysis may include ques for the movement, management, and integration of
genomics or metagenomics (i.e., all genes in an organism or high dimensional data.2
community), transcriptomics or meta-transcriptomics, Biological systems rely upon the transfer of information
metabolomics, proteomics, or other features studied in a from nucleic acids to proteins and metabolites in order to
global, typically high-throughput, manner.1 Although reduc- shape function and phenotype. This is referred to as the
tionist approaches focused on single genes, proteins, and ‘omics cascade (Fig. 1), and 'omic-driven studies are facilitat-
pathways have successfully identified key features influenc- ing a new, more holistic understanding of systems biology.
ing health and disease, rarely is a single entity the underlying This recognizes that many diseases result from complex and/
cause of a particular phenotype or complex disease. Whereas or heterogeneous processes and a combined ‘lens’ may be
analyte by analyte-based approaches were common previ- more powerful than single ‘omes alone.3 7 Recent studies
ously, many ‘omics platforms have reached a state where also suggest that outliers in multi-omic data sets may signal
high-throughput, high-resolution, cost-efficient profiling of disease progression and could be leveraged as early diagnos-
samples is the norm. The ability to study information-rich tic indicators.3,8,9 Such early detection capabilities may help
Fig. 1 – ‘Omics cascade. Figure depicts depicts the flow of information and various ‘omic technologies used in the characteri-
zation of genome/metagenomes, transcriptome/meta-transcriptomes, proteome/meta-proteomes, and metabolomes in sin-
gle organisms and meta-organisms (i.e., host + microbiome). Figure also shows the utility of each analyte and their half-
refine approaches for public health screening and diagnosis, live in and on the body. Metagenomic studies of the human
as well as pinpoint and personalize interventions or treat- microbiome have focused largely on its bacterial members (i.
ments.10 In the following sections, we present commonly e., bacteriomics) to date. Despite this, both fungi (mycobiome)
studied ‘omes, discuss methods and strategies for their anal- and viruses (virome) are part of the human microbiome.
ysis and integration, and highlight multi-omic case studies. Although human-associated microbes have been studied for
centuries, interest in the human microbiome has grown
steadily in recent years. Fueled by the recognition that
Genomics and metagenomics microbes, both pathogenic and commensal, have the poten-
tial to influence nutrient acquisition, immune and neurologi-
Genomics is the characterization of an organism’s entire cal development, and long-term health,13 15 as well as
genetic content. Typically consisting of DNA (RNA in some advances in sequencing technology, metagenomic studies of
viruses), genomes consist of coding (i.e., genes) and non-coding the human microbiome have become commonplace.
regions, and the relative proportions of these vary across the Metagenomic data are generated through untargeted
tree of life. Noncoding regions include introns, regulatory ele- approaches extracting whole community DNA and using
ments, and repetitive DNA and tend to be more common in shotgun sequencing to generate taxonomic and functional
eukaryotes.11 Protein coding sequences represent approxi- profiles; however, amplicon-based approaches [e.g., 16S ribo-
mately 1% of the human genome, but 80-90% of most bacterial somal RNA (rRNA) gene or internal transcribed spacer (ITS)
genomes 12. Genomes are characterized through sequencing, sequencing)] can be used to generate bacterial and archaeal
assembly, and comparison with reference genomes. Whole (16S rRNA gene) or fungal (ITS) taxonomic profiles.16 Genomic
genome sequencing can be used to identify novel genes or gene (or metagenomic) data provide information about taxonomy,
variants, and previously uncharacterized traits may be identi- phylogenetic relationships, potential function, mutations,
fied through genome wide association studies (GWAS), which and allelic variations, which can be used to shape hypothe-
link genomic variants with disease states or other phenotypes. ses, guide experiments, and/or suggest mechanisms underly-
Multiple different approaches exist within genomics. For exam- ing the appearance or behavior of a system. Although
ple, exome sequencing is a genome sequencing strategy in mutations occur within genomes and metagenomic content
which exons (i.e., protein coding regions) are targeted instead of can shift in the context of shifts in community composition,
whole genomes. Targeted approaches like this can lead to genomes are typically considered static in the sense that they
faster, cheaper, more sensitive gene variant discovery.12 do not turn over rapidly and have long half-lives relative to
In contrast to single genomes, a metagenome is the collec- other ‘omes (e.g., RNA, proteins and metabolites). For
tive genomic content of a community (typically microbial). instance, the half-life of DNA in bones is 521 years, but soil
The human microbiome is the collection of the microbes that microbial DNA has a half-life varying from 9.1 to 49.7 h.17
Viromics refers to the study of viral communities (i.e., Proteomics, the study of expressed proteins, is used to char-
virome) and may include viruses that infect bacterial, acterize protein identity, abundance, post-translational mod-
archaeal, fungal, plant, animal, and/or human cells. Viral ifications, and/or the interactions of proteins (or peptides) in
metagenomics typically employs shotgun metagenomic or a system, sample, or matrix. Meta-proteomics characterizes
meta-transcriptomic sequencing to characterize DNA or RNA proteins originating from multiple sources (e.g., host plus
viruses, respectively. Viral metagenomics is challenging microbiome). Protein abundances are the result of upstream
given the small genome and structural sizes of viruses com- transcriptional activity,32,33 translational effects, and post-
pared to bacteria, their high degree of genome variation (e.g., translational modifications (e.g., phosphorylation, glycosyla-
double stranded or single stranded, DNA or RNA-based tion, nitrosylation, and/or ubiquitination), which are difficult
genomes), and biases of reference databases toward patho- to predict from transcriptional signals alone.34 Proteomic
gens and well-characterized bacteriophages. These con- measurements are typically generated using coupled liquid
straints often require additional isolation, concentration chromatography (LC) and mass spectrometry (MS), and they
steps, and annotation steps relative to general metagenomics can be targeted or untargeted. Unlike RNA, proteins are typi-
pipelines.18 Despite these challenges, the importance of the cally more stable and have longer half-lives (e.g., 10 to 1000 h,
virome in health and diseases is increasingly being discov- depending on origin).35,36 Proteomics technologies also tend
ered using viral metagenomics.18 Viral metagenomics may be to be lower-throughput and more expensive in nature.37
used to detect pathogenic viruses,19 and when paired with
bacterial community profiling correlations between viruses
and bacteria can be used to understand cross-kingdom poten- Metabolomics
tial relationships.20 22 Bacteria-virus interactions have also
been used to understand horizontal gene transfer mediated Metabolomics characterizes the small molecules (i.e., metab-
by bacteriophages, a process also known as transduction olites) present in a sample or matrix, including amino acids,
(and transductomics).23 fatty acids, carbohydrates, and other compounds. Metabolites
can be derived from host and/or microbiome metabolism,
ingested through diet or medication, and/or environmental
sources and are not limited to a single source (i.e., some
Transcriptomics and meta-transcriptomics metabolites may be produced by both host and microbe).
Metabolites reflect a variety of upstream biological processes
Transcriptomics is the study of expressed RNA. Transcrip- and can link genotype with phenotype.38 Metabolomic meas-
tomics often focuses on protein-coding RNA [i.e., messenger urements are made using nuclear magnetic resonance (NMR)
RNA (mRNA)] but can include non-coding RNAs, which coor- or gas or liquid chromatography (GC, LC) paired with mass
dinate and tune gene expression. Given that expressed genes spectrometry (GC-MS, LC-MS), and metabolomic studies can
represent a fraction of the total genome, transcriptomic sig- be performed in a targeted or untargeted manner. Metabolite
nals can provide focus on the genes and potential mecha- half-lives vary depending on the metabolite, collection site,
nisms involved in a biological process of interest. sample matrix, and preservation conditions, with the range
Transcriptomic profiling can be performed on host tissue or covering hours,39 41 to days.42 Most metabolites are highly
the microbiome (i.e., meta-transcriptomics), and each pro- labile, necessitating careful handling and preservation to
vides different information regarding a system’s biology. maintain signal integrity.43
Transcriptomic approaches can be targeted focusing on one
or a few genes, semi-targeted using pre-defined arrays (e.g.,
sequence capture panels and microarrays), or untargeted Epigenomics
using shotgun sequencing.24
Transcriptomic (or meta-transcriptomic) data can be used Epigenomics addresses genome-wide characterization of
to confirm or refute (meta)genomic-based hypotheses. How- reversible chemical modifications of DNA or DNA-associated
ever, a transcript’s presence does not guarantee translation proteins impacting gene expression and regulation.44 Epige-
into viable protein. RNases are nearly ubiquitously present, nomic modifications are associated with several diseases
and RNA molecules tend to be highly labile. mRNA half-lives including Prader-Willi syndrome, Angelman syndrome, and
range from seconds to minutes in well-characterized bacte- certain types of cancer.45,46 Epigenomic modifications can
rial and fungal strains, with this range extending to hours for provide information regarding disease status and/or environ-
some human transcripts.25 29 Immediate sample processing mental exposure and act as heritable traits. Epigenomic mod-
or preservation are crucial for ensuring RNA quantity, quality, ifications may be characterized through methylome
and signal detection.30,31 Adequate depletion of ribosomal sequencing, a modified genomic sequencing approach which
RNA and other non-target molecules are also important for captures methylated genomic regions using restriction
(meta)transcriptomic data generation.31 Additionally, given enzymes or bisulfate treatment prior to sequencing. Alterna-
that transcriptomic data reflect responses to conditions pres- tively, chromatin-immuno-precipitation coupled with
ent at a specific point in time, careful experimentation, sam- sequencing (ChIP-seq) represents a more targeted approach,
ple handling, and time series analysis are often required. employing antibodies specific to the modification(s) of
information. Although a variety of specialized databases summary containing the description of the project, sample
exist, some of the most frequently used ones are described and processing protocols, as well as contact information.81
Nucleotide and protein sequences Approaches and platforms for multi-omic analysis
and data integration
The National Center for Biotechnology Information (NCBI)
houses multiple nucleotide and protein databases, including Despite the increased availability of ‘omic data sets and tools
GenBank and the Reference Sequence collection (RefSeq). for their analysis, multi-omic integration remains challeng-
GenBank is an open access, annotated collection of publicly ing.2 Study design, noise, and data interoperability contribute
deposited nucleotide and protein sequences.66 RefSeq is a to this, as do varying definitions of what constitutes a multi-
non-redundant version of GenBank. Other repositories omic study. “Multi-omic integration” may be used to describe
include the Sequence Read Archive (SRA), the European Bioin- the analysis of a single 'ome across multiple studies (e.g., a
formatics Institute (EMBL-EBI), and the DNA Database of meta-analysis), as well as the integration of multiple ‘omes
Japan (DDBJ), each of which contribute to the International generated on the same set of samples (i.e., “vertical integra-
Nucleotide Sequence Database Collaboration (INSDC). The tion”).82 Similarly, multi-omic integration can be conceptual,
SRA is specific for sequence data generated on high-through- statistical, and/or model-based.57,83 Conceptual integration
put, next-generation sequencing platforms. Specialized data- combines insights obtained from single ‘omes to form a more
bases exist as well, including specific databases for 16S and comprehensive understanding of the biology of a system. Sta-
18S rRNA gene sequence data, like SILVA,67 the Genome Tax- tistical integration focuses on statistical relationships within
onomy Database (GTDB), which has sought to quantitatively and across ‘omes, and model-based integration is the layering
define species via average nucleotide identity (ANI),68 the of ‘omic data onto pre-defined system models to understand
MetaHIT (METAgenomics of the Human Intestinal Tract) gene molecular organization and function.
catalog,69 and the Integrated Microbial Genomes and Micro- In selecting methods for multi-omic data integration, the
biomes (IMG/M) database.70 nature of one’s data and how it should be handled must be
considered. Many ‘omic data are noisy, sparse (i.e., contain
Metabolomics many zeros), high-dimensional, and contain batch effects.
‘Omic data can be highly heterogeneous, have signals of dif-
Metabolite information can be accessed through a variety of fering scale across ‘omes, and vary with respect to being
repositories. For instance, the Human Metabolome Database quantitative, qualitative, and/or compositional.84 These qual-
(HMDB) includes predicted MS/MS and GC-MS data, and ities are computationally challenging and often require pre-
metabolite structural information.71 A related database, the processing, including filtering, imputation of missing values,
Human Fecal Metabolome database, focuses specifically on transformation, normalization, and/or scaling prior to down-
metabolite information from human stool. Each entry con- stream analysis. These steps limit impacts of outliers, reduce
tains over 110 data fields that are hyperlinked to other refer- the number of features considered, and prevent one ‘ome’s
ence databases, and it allows users to browse concentration signals from overwhelming others.85,86 Additionally, spike-in
data and associated diseases.72 Similarly, the Metabolomics standards may provide a “ground truth” to facilitate harmoni-
Workbench Metabolite Database contains structures and zation across ‘omes.56
annotations of biologically relevant metabolites, containing Interoperability (i.e., the ability of the data sets to ‘speak’ to
approximately 136,000 entries collected from public sources one another) between ‘omic data sets also represents an anal-
like Chemical Entities of Biological Interest (ChEBI), HMDB, ysis obstacle. Although numeric relationships can be consid-
Biological Magnetic Resonance Bank (BMRB), PubChem, and ered without explicit links between 'omes, the ability to
the Kyoto Encyclopedia of Genes and Genomes (KEGG).73 leverage knowledge-based metabolic networks, like those
METLIN is a resource for metabolite and other molecule data published by KEGG,78 MetaCyc,87 Reactome,88 and Ingenuity
analysis, representing over 350 chemical classes,74,75 and Pathway Analysis (IPA, QIAGEN Inc., https://digitalinsights.
MetaboLights covers structures, reference spectra, biological qiagen.com/products-overview/discovery-insights-portfolio/
roles, and data from metabolic experiments.76 analysis-and-visualization/qiagen-ipa/), relies upon one’s
ability to map or overlay data onto pre-existing models and
Proteins and proteomics frameworks. It also requires that identifiers (e.g., gene,
enzyme, compound) be classified using specific ontologies.
The Universal Protein Resource (UniProt) provides protein Although such models and frameworks are incredibly useful,
sequence and annotation information, as well as tools includ- it should be noted that they are not comprehensive and may
ing local BLAST, alignments, downloadable releases, and data not fully represent complex systems.
submission options.77 Other databases aggregate functional Through conceptual integration, results from one ‘ome are
information into functional modules and pathways, includ- used to build upon signals observed in another.89 Beyond
ing KEGG,78 the Clusters of Orthologous Genes (COGs) data- seeking consensus (i.e., maximum agreement), studies may
base,79 and Pfam.80 In addition, multiple easily searchable, search for complementarity (i.e., useful information found in
curated databases support proteomics research. Among each ‘ome but not necessarily shared across them) or lever-
these, the PRoteomics IDEntification Database (PRIDE) con- age information contained in one ‘ome to provide context for
tains proteomics datasets that are searchable, with a another. Consensus-seeking approaches can pinpoint
features that contribute to mechanism. Complementarity- be achieved using a variety of methods, including partial least
maximizing approaches can provide useful information squares regression, canonical correlation analysis (CCA),
about molecular processes and levels at which they are car- sparse CCA, and partial least squares discriminant analysis.
ried out.3 Leveraging ‘omes versus one another is often done Feature selection can serve as a discovery tool to identify can-
with paired metagenomic and meta-transcriptomic data to didate biomarkers and may provide mechanistic clues related
determine transcript to gene ratios. In paired metagenomic/ to poorly understood phenomena. Variable selection meth-
meta-proteomic studies, metagenome-assembled genes facil- ods are frequently applied in the analysis of single ‘omic data
itate protein identification from MS spectra.90,91 sets and may be used as a filtering step preceding the integra-
With statistical integration, multiple existing approaches tion of multiple ‘omes.82 Additional models and techniques
and models have been applied to ‘omic data sets,82,86,92 and are described in the platforms below and reviewed at length
new algorithms and platforms are introduced regularly. Sta- elsewhere.82,86,92
tistical integration includes model-based (i.e., statistical mod- Model-based integration leverages pre-defined system
els), multi-variate, and network-based methods, including models to understand molecular organization and function.
many Bayesian approaches. Bayesian approaches leverage a ‘Omics data may be pre-processed or pre-filtered (e.g., retain-
priori assumptions about the data to calculate posterior prob- ing differentially abundant or expressed features identified
ability distributions, refining our ability to model and predict through feature set enrichment analysis or similar
outcomes. Bayesian approaches are commonly applied to approaches) prior to model mapping, or all data may be
high-dimensional data sets, including single and multiple mapped with the intent of identifying reactions and/or path-
‘omes, for feature selection. Such approaches seek to identify ways for which evidence exists. Tools such as the Integrated
features for which support across multiple ‘omes is collec- Molecular Pathway-Level Analysis (IMPaLA),97 PaintOmics,98
tively high, and they are attractive in that they can incorpo- IPA (Qiagen), and the KEGG mapper,78 can be leveraged for
rate existing biological knowledge. this type of analysis. These approaches are appealing due to
Statistical integration includes both unsupervised and the highly visual nature of their output, and they are useful in
supervised analyses. Unsupervised analyses seek to identify that they can place large amounts of data in biological con-
subgroups within a data set, irrespective of known pheno- text. However, as noted above, these models are not fully
types or clinical features. For example, cluster analysis and comprehensive, may not facilitate the integration of all types
Principal coordinates analysis seek to identify groups such of ‘omic data, and are often limited in their representation of
that samples within a group are more like one another than complex systems.
to those in other groups. Typically, these similarities are
based on a similarity (or distance) metric calculated among
samples in an all versus all manner. Importantly, cluster
Software and bioinformatic tools for multi ‘omic
number and size are not defined a priori. Clustering can be
used to identify outliers, sub-groups, or assess the distribu-
tion of known sample characteristics across groups.
Multi ‘omic data generation has fueled a need for tools and
Network, or association, analysis is another unsupervised
platforms to support analysis, integration, and visualization.
technique used to identify numerical relationships among
A variety of tools have been developed and are becoming
‘omic features and samples.84 Analytes and/or samples are
available. Some perform correlation analyses, while others
represented as nodes and relationships (i.e., Pearson or
facilitate covariance-based and/or multiple co-inertia analy-
Spearman correlations) are depicted by edges connecting
ses99. Other tools support data visualization, increase
them. Network analysis can be performed on single or multi-
interpretability, and support data sharing and accessibil-
ple ‘omes. When performed with multiple ‘omes, this is
ity.85,100 Several of these tools are described in Table 1, and a
referred to as concatenation, “early integration”,92 or “single
more comprehensive list can be found in https://github.com/
block” analysis.2. With this approach, all data are combined
mikelove/awesome-multi-omics. In addition, multiple com-
into a single matrix prior to downstream analysis, and signal
mercial platforms are available and will likely continue to
normalization and scaling are particularly important. Net-
emerge (for example, Seven Bridges https://www.seven
work size, degree of connectivity, and overall topology can be
evaluated, and potential relationships can be inferred from
these results. The numerical relationships identified using
correlation-based approaches may not reflect direct, physical
relationships, nor do they specifically account for complex Case studies and future directions
interactions.86 Despite the simplicity of these approaches,
association analyses can identify potentially co-regulated Multi-omic approaches have been applied in a variety of con-
features,93,94 features regulating one another,95 and highlight texts to date, including studies of pregnancy and neonatal
dysregulation in the context of disease.82,96 health and disease. As described above, conceptual, statisti-
Supervised analyses attempt to model features that can be cal, and model-based integration approaches have been used
used to predict traits (e.g., phenotype or clinical outcome).82 to gain insight from layers of ‘omic data. Multi-omic applica-
Examples include regression analysis and its multi-variate tions to date have included biomarker discovery, the charac-
extensions, as well as some machine learning algorithms (e. terization of microbial dynamics, and host-microbiome
g., Random Forests). Variable selection (i.e., feature selection) interactions. In most cases, it is anticipated that these discov-
techniques are an extension of supervised analysis and can eries will be translated to therapeutics or distilled down to a
Table 1 – Description of several multi-omics tools for data integration, analysis and/or visualization.
Table 1 (continued)
* Although the program has been developed specifically for the analysis of single organisms or microbiome data, both analysis types may be supported.
few key analytes for the purposes of diagnosis and/or patient microbial metabolism.112 Understanding this window of
stratification (Fig. 2). development is particularly important as the human gut
Multi-omic approaches are being used to understand a vari- microbiome develops rapidly following birth and has the
ety of biological phenomena across the lifespan, including potential to impact health and disease outcomes later in
gestational clocks and preterm birth. Preterm birth is a major life.113 Similarly, meta-proteomics has been used to identify
cause of neonatal death, and researchers routinely seek to key functional differences in the gut microbiota of preterm
understand the factors contributing to it. Establishing a thor- infants.114 As with the preterm birth study described above,
ough understanding of the molecular underpinnings and deviations from “healthy” or “normal” progression in micro-
chronological changes occurring during term pregnancy may bial community assembly or metabolism may serve as indica-
help identify (molecular) deviations associated with preterm tors for poor health outcomes and could implicate key
birth and other pregnancy-related pathologies. Leveraging a molecular mechanisms associated with them.
combination of metabolomics, proteomics, transcriptomics, Beyond pregnancy and early-life programming, biomarkers
and statistical models based on “single block” analysis, of birthweight, a characteristic having implications for health
Ghaemi et al. found that immune cell signaling responses later in life, have been identified using correlation analysis of
model gestational age better than single ‘omes alone.111 Their various ‘omic data sets including DNA methylation profiling,
results highlight the immunomodulatory capacity of the ste- transcriptomics, inflammation-related proteins, cholesterol,
roid hormone pregnanolone sulfate as a possible mechanism and anthropometric measurements.5 Abnormal birthweight
for maintaining pregnancy. is associated with increased mortality, risk of cardiovascular
In the context of microbiome development in the days and diseases, mental health problems, and certain cancers later
weeks following birth, another multi-omic analysis used in life. Recent work has found that both the metabolome and
metagenomics, proteomics, and metabolomics in meconium methylome carry common signatures associated with abnor-
and early stool. This study, an example of conceptual integra- mal birthweight, as well as novel biomarkers, including a
tion, demonstrated transient versus persistent presence of macrophage-derived chemokine. Additionally, cholesterol
certain gut microbes, as well as key features reflecting levels in cord blood were correlated with birthweight, such
Fig. 2 – Example of multi-omic applications in health and disease. Multi-omic approaches lend themselves to the discovery of
biomarkers and signals contributing to underlying mechanism. Distilled signals based on these discoveries are likely to con-
tribute to the development of therapeutics or become key analytes which serve as the basis for diagnostic tests and/or
patient stratification.
