The Best Practice For Microbiome Analysis Using R
The Best Practice For Microbiome Analysis Using R
https://doi.org/10.1093/procel/pwad024
Advance access publication 2 May 2023
Review
Protein & Cell
Review
Abstract
With the gradual maturity of sequencing technology, many microbiome studies have published, driving the emergence and advance
Introduction annotation, and (iii) statistics and visualization (Fig. 1A). In the
preprocessing step, the raw data is filtered and quality controlled
The metagenomic analysis is used to study microbial diversity,
to ensure data quality. In the quantification and annotation step,
structure, and function by sequencing, quantifying, annotating,
tools, and databases are used to identify microbial representative
and analyzing DNA and/or RNA sequences of microbial com-
sequences and annotate microbial taxonomy and function. The
munities or microbiota. The commonly used high-throughput
first two parts of microbial community analysis have been well
sequencing technology in microbiome research is mainly known
discussed and could be well done according to our previous paper
as amplicon sequencing and shotgun metagenomic sequencing.
(Liu et al., 2023). Finally, in the statistics and visualization step,
Amplicon sequencing with the advantages of low cost, mature
various statistical methods are used to explore microbial com-
analysis system, and simple analysis process was widely used
munity diversity, structure, and potential functions.
in microbiome research. Shotgun metagenomic sequencing pro-
With the development of high-throughput sequencing tech-
vided the functional information of microbes and more accu-
nology, plenty of studies were performed with amplicon-sequenc-
rate information on the microbial composition with the higher
ing technology (Thompson et al., 2017; Proctor et al., 2019) and
sequencing cost and large amount of computational resources
shotgun metagenomes sequencing (Carrión et al., 2019; Li et al.,
needed. The detailed pipeline for both sequencing methods have
2022; Paoli et al., 2022), which led to the development of micro-
been systemically summarized in our previous review (Liu et
biome analysis methodologies, software, and pipelines, for exam-
al., 2021). As an important component of biodiversity, microbial
ple, QIIME (Caporaso et al., 2010), Mothur (Schloss et al., 2009),
communities play a vital role in biology, ecology, biotechnol-
USEARCH (Edgar, 2010), VSEARCH (Rognes et al., 2016), QIIME
ogy, agriculture, and medicine. Various bioinformatics methods
2 (Bolyen et al., 2019), Parallel-Meta Suite (Chen et al., 2022),
are required for microbial community analysis, which mainly
EasyAmplicon (Liu et al., 2023), Kraken (Wood and Salzberg,
includes three parts: (i) data preprocessing, (ii) quantification and
A B
Raw data preprocessing USEARCH/VSEARCH QIIME2 DADA2
RCurl
s
Biost ring tidyverse styler Del ayedAr
R
fansi downlit ray classInt GenomeInfoDbData
units MatrixModels evaluate
numDeriv
remotes Matrix tweenr graphics
sandwich ggtreeExt ra sourcetools progress tidyselect multtest
dbplyr statip clipr BiocGenerics roxygen2 pbkrtest
latticeExtra
igraph crosstalk
jpeg
huge
vegan agrico
lae
Rhdf5lib
BiocVersion
magrittr modeest
foreign
textshaping
geneplotter tidyfst
digest
Formula
VGAM
ggplot2 reprex
gh
highr
EasyStat
broom
ggsci
Tax4Fun2
grDevices
bitops
glmnet
NA
gitcreds tcltk plotly fontawesome multcomp klaR
sys forcats xml2
permute
polyclip labeling foreachcir stats rvest plogr lambda.r rpart
GC
clize GenomicRanges
latex2exp minqa ggn ewscale rcmdcheck
al abind gss googled
ggalluvi
rive deldir
W
formatR ble patchwork KEGGREST
haven data.ta sessionin fo crayon modeltools
tidyr tinytex purrr futile.o
ptions miniUI ps tzdb ggplotify blob shiny
gg force gg raph RVenn pkgbuild mgcv SparseM gargle
diffobj spatial gtable doParallel testthat IRa
Microbiome boot
S4Vectors
snow
lifecycle ggstance isoband
picante
httpuv
checkmate
RcppEigen AnnotationDbi
s2
proto compos
itions
R.cache
nges
htmlTable
ade4 backpo
rts bayesm parallel
stats4 cpp1
1 glue
bit64
clue R.utils zip cluster waldo lazyeval
sf
fastcluster
rmarkdown
rmutil
readxl
annotate
ser vr AlgDesign pkgdown questionr
GUniF
rac
brew
reshape2 compiler
rematch2
yaml sass xfun
cowplot fastmap
dtplyr
ggte rn ggstar nnet locit
random shape car devtools combinat XML
1 Diversity analysis
selectr
viridisLite
proxy promises
methods
geneilter ini
coda
Forest
rstudioap
i BH xtable statmod network nlme
rgexf
bslib
dynamicTreeCut farver rlang ggupset MASS memoise DBI
TH.data splines impute treeio
ellipsis s
R.meth base64enc generic fstcore Rcpp Hmisc
R.oo sp codetools dirmult
gridGraphics hexbin
odsS3 matrixStats
tidygraph ggVennDiagram
2 Difference analysis statne
t.commo profvis pkgconfig tensorA png utf8 tidytree readr aplot
Protein & Cell
n praise BiocManager fs
pulsar systemfon
ts carData
commonmark datasets usethis limma stabledist
gert mnormt
httr
GO.db
hms e1071
data.tree
loseq
zlibbioc mime uuid
phytools
ragg
ggfun
desc VennDiagram stringr bit lattice
scales
sta ble knitr
coin networ quantreg
rematch libcoin
kD3 pillar rstatix htmlwidgets whis
labelled MatrixGene rics lubridate vroom rprojroot
R6 gridExtra RcppA rmadillo DEoptimR ker cachem
4 Network analysis modelr meInf
oDb rhdf5ilters iterators microeco ets4
ggrepel
t
graphlayouts Geno
stringi fBasics
class
ids munsell jquerylib edgeR corrplo
cli
googleshe
grid
RColorBr ewer cellranger rappdirs
htmltools
plyr psych
pkgload
biomformat ions
tibble colorspace timech
ange
ggClusterNet timeDate processx translat
Biobase Summa rized
Expe rime nt
BiocParallel inte rp ape
6 Other analysis prettyunits utils survival ggsignif
packcircle MicrobiotaProcess
SpiecEasi
preprocessCore jsonlite
DESeq2 SIAMCAT
s
curatedMetagenomicData askpass
composit bit
C D
Cut
colorspace
fast
fontawalluviale
adespatiae4
GUniFrac
clue icTree
mnormt
corrplot
crosstalkir
cowplot
pic a nte
clus
circlize
fBasics
ggne ggplot2
gg forc
ve g a n
crayon
aplot
e
diffobj
deld
tidytre
wald esign
gg gfun
esom
treeio
farver
ions
ad
ter
am
ggp ubr
wsc
g le
sa ble ist
ggp aph
dyn
l
VGA o
ggr epel
ggr ggsci
AlgD
sta M
lotify
ape
ix h
e
a
sta bled
ggsstanc r
sta tip
gg ggstarn
rst ndwic
gg gra grid aph ble
gg ggu gramts
pe eat reg
nu rmu map
Ve ph E ic
ign e
tre ps
ed nei a p iew
gg ree
m ult orm v
phuant
t
mv mD te
nn lay xtr s
ggt xtra
m tn eri
if
eE et
ge mm com pV
e
Di ou a
gr
li ult com t
q
m ult tes
a
idG g exb eg
r ta in
D sy R r
Ea ge lte
la
h jp tra
tti muhwo ap
rg
a oin Seq at
ce ns rk
c E St
ep p
2
Ex el
at ix
m ae
c m
at ol
pl se t
od
st gric
ba
l
pr pnotly st es
or
RC o g bu til F
ol pu p fvis ro mu dom e
or
Br ra lsar s r an nt st ls
ew gg r ica rte oo ss
p bk elt ce
RV rgex er p od elr st Pro
sc en f m od ee iota els s
m od ob od ric
tex sh ales n m icr ixM ne
ap M atr xGe
tsh e
tid api sp M tri
yg ng
ra Ma e4
ba twee ph lm aR et
s kl mn 1
Bio e64 nr gl 07 ree
bio cPa enc e1 ta.t
mfo rall da oom m
rm el br yes
b at ba CNA si
bitoit64 WGiecEa AT
p
b s Diversity Sp MC
brelob SIA w
w
brio analysis snoa
com coda Difference sn ych s
data binat Visualization 16 ps kcircle 3
.tab analysis pac orkD
dbp le netw ork
lyr 54 netw
dply 22
dtpl
r igraph rNet
te
forcatyr Biomarkers ggClus
s n2
Global foreach identiication Tax4Fu
Options ST
KEGGREs
Hmisc 20 IRange icRanges
huge Genom tter
impute Network geneplo
plyr ionDbi
purrr Annotat
analysis GO.db
questionr
readr 10 annotate
readxl veg an
reshape2 Data cleaning Function stats4
stringi ape
stringr prediction FSA
Summarized 34 Other eulerr
8
Experiment tidyfst Hmisc
tidyr analysis picante
tidyverse minpack.lm
vroom 8
Figure 1. Microbial community data analysis workflow and related R packages. (A) Overview of microbial community data analysis workflow. Core
files are feature table (OTU), Taxonomy, sample metadata (Metadata), phylogenetic tree (Tree), and representative sequences (Rep.fa). (B) Detail of
microbial community analysis workflow. First, the raw data can be processed by using USEARCH/VSEARCH, QIIME 2, DADA2 packages. Then, the
important files are saved and used for downstream analysis in R language and RStudio software. Many microbial analysis methods rely on numerous R
packages developed with R language. The font size in the word cloud represents the number of citations of R packages. (C) Commonly used R packages
for data-cleaning/manipulation and visualization. (D) Classification of R packages for six categories in microbial community analysis.
2014), MEGAN (Huson et al., 2007), MetaPhlAn2 (Truong et al., before the year of 2015 while non-clustering methods were grad-
2015), HUMAnN2 (Franzosa et al., 2018), etc. As the most crucial ually developed and widely used recently. Currently, the common
and basic procedure for amplicon sequencing data analysis, OTU non-clustering methods include DADA2 (Callahan et al., 2016),
(Operational taxonomic unit) clustering method was popular deblur (Amir et al., 2017), unoise3 (Edgar and Flyvbjerg, 2015).
Using R language in microbiome analysis | 715
One of the most representative non-clustering algorithms among Preparing microbiome data analysis
them is DADA2, which was created with R language. It makes the R
Downstream analysis of microbiome requires the preparation
language (Ihaka and Gentleman, 1996) occupy an important posi-
of five data files, including a feature table, a feature annotation
tion in raw data processing for amplicon sequencing. Compared
file, a sample metadata file, a phylogenetic tree, and representa-
with many software that can be used in upstream steps of micro-
tive sequences. For beginners, it is important to understand the
biota sequencing data analysis, the downstream analysis steps
format and basic data structure of these files and learn how to
rely on the R language heavily with various packages. These anal-
import these files into R language. Furthermore, different analyt-
yses mainly include: (i) Diversity analysis; (ii) Difference analysis;
ical contents often have different requirements for data, and it
(iii) Correlation and network analysis; (iv) Biomarker identifica-
R types (vector, list, and data frame) and basically replaces the Diversity analysis
apply family of functions in the base package. It can easily han- Microbial community diversity mainly includes alpha diversity
dle grouping calculations, for example, microbial abundance at (Richness, Shannon, Simpson, Chao1, ACE, etc.), rarefaction curve,
different taxonomy levels (Code 1C). The reshape2 (Wickham, beta diversity (ordination and clustering analysis), taxonomic or
2007) package provides the long-wide format transformation dur- functional composition. Here must introduce the package vegan
ing data processing, and since ggplot2 (Wickham, 2011a) plotting (Oksanen et al., 2007), an abbreviation for Vegetation Analysis,
functions and most modeling functions, such as lm(), glm(), gam(), written by nine quantitative ecologists, including Oksanen from
often use long data, microbiome data are general showed as wide Finland, which is initially used for specifical dealing with data
form, so the transformation of microbiome data for plotting can on community ecology. The package provides a variety of meth-
of important variables (Code 1E). ing correspondence analysis (CA) and canonical correspondence
analysis (CCA), decorana() for conducting decision curve analysis
Visualization in R language
(DCA), and metaMDS() for conducting non-metric multidimen-
In most cases, we are used to plotting standard graphs in microbi- sional scaling (NMDS) for microbiome ordination analysis (Code
ome data display such as alpha/beta diversity, taxonomic compo- 3B). The prcom() in stats package can be used for principal com-
sition. All the visualization-related packages show in Fig. 1C. Due ponent analysis (PCA), which is a kind of dimension reduction
to the widespread use of ggplot2 (Code 2A), many extension pack- analysis. The mca() provided by the MASS package and the MCA()
ages have emerged to extend based on ggplot2 with a high capac- provided by the FactoMineR package can be used for multiple CA
ity of plotting styles, colors, and themes. These packages mainly (Code 3B); the ape package provides the pcoa() function for prin-
include ggtern plotting ternary graphs in Code 2B (Hamilton and cipal coordinate analysis (PCoA); the MASS package provides lda()
Ferry, 2018), ggraph plotting network graphs in Code 2C (Si et al., for linear discriminant analysis (LDA, Code 3C). Before running
2022), ggtree plotting evolutionary tree or cladogram in Code 2D many ordination operations, it is often necessary for commu-
(Xu et al., 2022), the ggalluvial package, the ggVennDiagram pack- nity clustering. The vegdist() in the vegan package can calculate
age (Code 2E), the ggstatsplot package plotting pie chart, and the Euclidean, Manhattan, Bray, Canberra, and other distances (Code
ggpubr package providing many various themes and colors of 3B). In addition, distance calculation can also be done using
output. In addition, the pheatmap and ComplexHeatmap pack- dist() of stats package. The distance matrix can be used for clus-
age (Gu, 2022) based on the grid mapping system plots the rel- tering analysis in addition to ordination analysis. The hclust() in
ative abundance of features in different samples (Code 2F), the the stats package can be used for clustering analysis, a similar
VennDiagram package (Chen and Boutros, 2011) could show the function can be achieved with the facteoextra, kmeans packages
number of features in different samples. The UpSetR package (Code 3D). Microbial composition analysis mainly used to display
(Conway et al., 2017), which draws Upset view is a new form plot- the abundance of microbes, and the dplyr package is needed to
ting similar to Venn diagram. The base-based plotting system is organize the data then display with ggplot2 subsequently.
complex and difficult to learn, while it is a good choice for com-
plex graph drawing, such as the circlize (Gu et al., 2014) package Difference analysis
(Code 2G), which draws chord diagrams composed of microbiota. Difference analysis is divided into community-level analysis and
Additionally, there is often a lot of microbiome mapping work feature-level (any hierarchy of taxonomy and function) analysis.
that involves a combination of graphics. At present, many tools in Community-level difference analysis is mainly performed with
R can combine graphics, such as cowplot, patchwork, and aplot. functions including adonis(), anosim(), and mrpp() in vegan pack-
The patchwork package has the most powerful functions and age, and mantel.test() in ape package (Code 4A). The R package for
supports modular splicing graphics (Code 2H). compositional data difference analysis in the feature level can
utilize the wilcox.test() (Code 4B) and t.test() (Code 4C) in the stats
package. Subsequently, data correction algorithms were devel-
Microbial community analysis oped specifically for sequencing data, such as the upper quartile
We have categorized the analysis of microbiome data into the (UQ), trimmed mean of M-values (TMM) (Code 4C), and relative log
following six major types in Fig. 1D: diversity analysis, differ- expression (RLE) harbored in the edgeR package (Robinson et al.,
ence analysis, biomarkers identification, correlation and network 2009) (Code 4D). Median of ratios method (MED) in DESeq2 pack-
analysis, functional prediction, and other microbiome analyses age (Love et al., 2014) (Code 4E), and cumulative-sum scaling (CSS)
(including source tracking analysis, community assembly pro- algorithm in metagenomeSeq package (Code 4F). Furthermore,
cesses, and analysis of associations between microbiota and envi- the ALDEx2 package provides polynomial models which can be
ronmental factors). Then, we would have organized, compared, used to infer feature abundance and calculate feature differences
and summarized all relevant R packages. with non-parametric tests, t-tests, or generalized linear models
Using R language in microbiome analysis | 717
(Code 4G). The ANCOM-BC package attempts to address sam- SpiecEasi package can use the sparcc method to perform a more
ple heterogeneity by correcting bias with a log-linear model. In suitable method for microbiome data to calculate the correlation
addition, other R packages for microbiome data correction and matrix, and can call multiple-threads to accelerate the calcula-
difference tests include limma (Code 4H), DR, ANCOM (Lin and tion. R packages for network visualization and attribute calcu-
Peddada, 2020) (Code 4I), corncob (Code 4J), Maaslin2 (Code 4K), lation can use igraph (Code 6E), network, and ggraph packages
etc. Nearing et al. (2022) showed that they compared these differ- (Code 6F). These R packages contain many layout algorithms for
ence analysis methods and proposed that ALDEx2 and ANCOM-II network visualization. In addition, network packages combined
(anchom_v2.1.R, Code 4L) were the best performers in the differ- with ggplot2 to visualize the network are easier to modify. Sna and
ence analysis of microbial communities. As for the significance ggraph packages have many visualization layout algorithms to
phylogenetic distances between clades, and tests for phylogenetic feature tables, phylogenetic trees, and feature annotation) clus-
signals within individual phylogenetic bins. tering, integrating data import, storage, analysis, and output.
Microbial communities were often used to analyze the corre- The package utilizes many tools in R for ecological and phyloge-
lation with environment indicators, for example, mantel.test() pro- netic analyses (vegan, ade4, ape, and picante) and uses ggplot2
vided by the vegan package was used to examine the correlation to output high-standard figures. The data storage structure uses
between microbial communities and environment indicators, a S4-like storage system to store all relevant data as a single
and using wascores(), mantel.correlog() to detect the phylogenetic experiment-level object, thus making it easier to share data and
signal between microbial communities and environmental fac- reproduce the analysis. Subsequently, the packages microbiome,
tors (Code 8C). In addition, the ggClusterNet package can be used the MicrobiomeAnalystR (Chong et al., 2020), microViz (Barnett et
As microbiome sequencing becomes more popular, R packages (the feature table, feature annotation, metadata, representative
dedicated to microbiome data processing are gradually emerg- sequences, and evolutionary tree) to maintain correspondence
ing (Fig. 2). McMurdie and Holmes (2013) developed the phyloseq under the same framework, and provides a variety of multiple
package: a comprehensive tool for microbiome data (including filtering functions on microbial features and samples, allowing
plot_diff_bar () campare ()
* differential_ Perform-UnivarTest ()
trans_diff$new () campare_volcano ()
Difference analysis abundance () Perform-LefseAnal ()
plot_diff_abund ()
Metadata
cal_train ()
ind_biomarker () RF.Anal () RF_regression ()
cal_ROC ()
Biomarker identiication cal_predict () MultiAssay PlotRF.Classify () RF_classiication ()
Experiment ()
set_trainControl ()
Rep.fa show_prok_func ()
Function prediction cal_spe_func_perc ()
cal_NTI ()
cal_tNST ()
uild ing
unity b cal_rcbray ()
Comm
Assoc
iation cal_mantel ()
analys PlotDensityView ()
is cal_autocor ()
Analysis of other index trans_env$new ()
SetMetaAttributes ()
Figure 2. Introduction to the functions of integrated microbial analysis R packages. Microbial community analysis can be divided into diversity
analysis, difference analysis, biomarker identification, correlation and network analysis, functional prediction, and other microbial community analysis
(community building/construction process, association analysis with other indicators).
Using R language in microbiome analysis | 719
the five parts of data to be filtered consistently without consid- evolutionary tree and feature taxonomic and abundance on tree
ering different among data. It also provides microbiome analysis branches and leaves (Fig. S2G), which makes the tree informative
through microbial data filtering and normalization, diversity cal- and beautiful.
culation (Fig. S2A and S2B), microbial composition visualization
(Fig. S2C and S2D), evolutionary tree visualization, and network Microbiome data analysis using microbiome
analysis (Fig. S2E). The beta diversity function provides more than The microbiome package also uses S4 class objects, like phyloseq,
30 distance algorithms, far more than those provided by pack- and can also perform most of the analysis of microbiomes (Figs.
ages such as vegan. Secondly, the phyloseq package uses ggplot 2, 3 and S3A–G, Pipeline 2. Microbiome.Rmd). It includes micro-
for graphical visualization (Fig. S2F), which is easier to gener- bial diversity analysis (Fig. S3A–E), and difference analysis (Fig.
① Phyloseq
② Microbiome
③ MicrobiomeAnalystR
a ④ Animalcules
b
Community building * Alpha diversity ⑤ Microeco
⑥ EasyAmplicon
bimodality
intermediate stability
Other analysis
*
RDA2
RDA1
Function prediction * *
PC 2
*
*
* *
PC 1 Beta diversity
Biomarker identiication
Features
compositionn
Network analysis
1 0
Evolutionary tree
b
Difference analysis * *
*
*
* *
Figure 3. Typical results of integrated microbial community analysis R packages and comparison of similar results. Group the analysis results of
multiple integrated R packages according to the major categories of microbial community analysis functions. Each main branch in the tree diagram
represents a type of microbial community analysis, and there are a total of 10 main branches: feature diversity analysis including (i) alpha diversity
analysis, (ii) beta diversity analysis, (iii) features (community taxonomic or functional) composition analysis, (iv) evolutionary or taxonomic tree
analysis; (v) difference analysis; (vi) biomarker identification; (vii) correlation and network analysis; (viii) functional prediction; (ix) community building/
construction process analysis; (x) other analysis, such as association analysis with other indicators. Each leaf (circle) represents a style of the result
displayed in the analysis, and the circle number around the outside of leaf represents the package number of the integrated R package that the analysis
result comes from.
720 | Wen et al.
Table 1. Comparison of the advantages and limitations of the six integrated R packages.
Phyloseq 1. Diversity analysis 1. Firstly utilize S4 class objects 1. Introduction to phyloseq objects can be
including alpha/beta 2. Possess lots of analysis functions based on phyloseq challenging for beginners
diversity, community objects 2. Statistical tests, including diversity tests
composition, and 3. The network analysis process is simplified (Fig. S2E) and community/feature-level microbial
phylogenetic tree analysis 4. Ordinate analysis can be applied to arrange the difference analysis, are not well integrated
2. Network analysis order of samples and microbes on heatmap rows and into community analysis
is richer in alpha diversity indicators, which provides more than and GitHub, so a complete installation of MicrobiomeAnalystR
30 alpha diversity indicators. Secondly, it provides core microbial requires a lot of effort.
calculation and visualization functions. In general, it can be used
as a complement to phyloseq or in conjunction with it. Microbiome data analysis using Animalcules
The Animalcules package is an alternative way to analysis in an
Microbiome data analysis using interactive platform (Figs. 2, 3 and S5A–J, Pipeline 4. Animalcules.
MicrobiomeAnalystR Rmd). It is possible to calculate and plot sample statistics in bar
MicrobiomeAnalystR is an R package version according to the plot (Fig. S5A) or interactive pie charts (Fig. S5B), calculate, and
MicrobiomeAnalyst webserver (Figs. 2, 3 and S4A–J, Pipeline 3. visualize alpha diversity dot plot (Fig. S5C), group microbial tax-
MicrobiomeAnalystR.Rmd). These functions include diversity onomic or functional composition heatmap and stack plot (Fig.
analysis (Fig. S4A–F), difference analysis (Fig. S4G), biomarker S5D and S5E), feature abundance in boxplot (Fig. S5F), genus bray
identification (Fig. S4H and S4I), sample sequencing library size distance heatmap (Fig. S5G), ordination analysis (Fig. S5H and
overview (Fig. S4J), which are more powerful than the previous S5I), using randomforest, logistic regression to select biomark-
two packages. The visualization combines basic packages, ggplot ers (Fig. S5J), and other analyses. The results of these analyses
plotting, and interactive plotting. In terms of network analysis, can often be reanalyzed by interactively modifying parameters,
it provides the process of calculating and plotting SparCC net- and the images can be interactively zoomed in and out, clicked
works that are more suitable for microbiome data. However, the to see details, and other operations performed by the mouse for
package depends on many R packages from CRAN, Bioconductor, better pattern discovery. However, the results cannot be exported
Using R language in microbiome analysis | 721
as vector format, which do not meet the requirements for publi- (Fig. 4G), and applied DESeq2 for difference analysis and gener-
cation. Secondly, the analysis content is too little, especially the ated multi-group volcano plots (Fig. 4H). We also used the el071,
microbiome network analysis, the correlation analysis between caret, randomforest, ROC packages for various machine learn-
the microbiome and other indicators. ing analyses and generated microbiome weighted plots (Fig. 4I).
Furthermore, we used ggClusterNet for microbiome network
Microbiome data analysis using microeco analysis (Fig. 4J), constructed network graphs and combined plots
The microeco package is very powerful, using R6 class data struc- to explore the associations between environmental factors and
ture (Figs. 2, 3 and S6A–L, Pipeline 5. microeco.Rmd). It includes microbiome communities (Fig. 4K). Finally, we used the FEAST
microbial diversity (Fig. S6A and S6B) taxonomic composition package to perform community source tracking analysis and con-
Figure 4. Examples of the best practice results of microbial community analysis in R language. The selected results include rarefaction curve (A),
principal coordinate analysis scatter plot (B), Venn network graph (C), evolutionary tree (D), LEfSe cladogram (E), difference analysis extended error bar
plot in STAMP style (F), difference analysis Manhattan plot (G), difference analysis multi-group volcano plot (H), biomarker selection ring-column chart
(I), network graph (J), correlation connection combination graph (K), source tracing analysis pie chart (L).
on Kraken (Wood and Salzberg, 2014), MEGAN (Huson et al., data processed by R rise from megabyte (M) to gigabyte (G).
2007), MetaPhlAn2 (Truong et al., 2015), HUMAnN2 (Franzosa Therefore, faster data processing R packages should be used to
et al., 2018), eggNOG-mapper (Huerta-Cepas et al., 2017), etc. the microbiome data analysis process, such as data.table, fst,
is becoming more and more important, and these make the tidyfst, etc.
Using R language in microbiome analysis | 723
The use of appropriate data structures can accelerate the Consent for publication
microbiome data processing. At first, we used S4 class objects
All authors agree to publish.
for microbiome data encapsulation, which can complete
a variety of analyses comprehensively and efficiently. The
emergence of R6 class objects and other objects has greatly
impacted microbiome data processing and largely facilitates it.
Author contributions
With the development of the tidy family of R languages, tidy- J.Y. and Y.L. conceived and supervised the project; T.W. and G.N.
based data structures have recently emerged for microbiome implement this project and wrote the paper; Y.L., T.C., and Q.S.
data mining. For example, the MicrobiotaProcess package (Xu provided critical comments and revised the paper.
Supplementary information
References
The online version contains supplementary material available at
Edgar RC. Search and clustering orders of magnitude faster than Metcalf JL, Xu ZZ, Weiss S et al. Microbial community assembly and
BLAST. Bioinformatics 2010;26:2460–2461. metabolic function during mammalian corpse decomposition.
Edgar RC, Flyvbjerg H. Error filtering, pair assembly and error cor- Science 2016;351:158–162.
rection for next-generation sequencing reads. Bioinformatics Nearing JT, Douglas GM, Hayes MG et al. Microbiome differential
2015;31:3476–3482. abundance methods produce different results across 38 data-
Fisher RA. The use of multiple measurements in taxonomic prob- sets. Nat Commun 2022;13:342.
lems. Ann Eugen 1936;7:179–188. Nguyen NH, Song Z, Bates ST et al. FUNGuild: an open annotation
Franzosa EA, McIver LJ, Rahnavard G et al. Species-level functional tool for parsing fungal community datasets by ecological guild.
profiling of metagenomes and metatranscriptomes. Nat Methods Fungal Ecol 2016;20:241–248.
Wickham H. The split-apply-combine strategy for data analysis. J Xu S, Li L, Luo X et al. Ggtree: a serialized data object for visual-
Stat Softw 2011b;40:1–29. ization of a phylogenetic tree and annotation data. iMeta
Wirbel J, Zych K, Essex M et al. Microbiome meta-analysis and 2022;1:e56.
cross-disease comparison enabled by the SIAMCAT machine Xu S, Zhan L, Tang W et al. MicrobiotaProcess: a comprehensive R
learning toolbox. Genome Biol 2021;22:93. package for deep mining microbiome. Innovation 2023;4:100388.
Wood DE, Salzberg SL. Kraken: ultrafast metagenomic sequence Zhao Y, Federico A, Faits T et al. animalcules: interactive microbiome
classification using exact alignments. Genome Biol 2014;15:1–12. analytics and visualization in R. Microbiome 2021;9:1–16.