0% found this document useful (0 votes)
65 views13 pages

The Best Practice For Microbiome Analysis Using R

Uploaded by

Amanda
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
65 views13 pages

The Best Practice For Microbiome Analysis Using R

Uploaded by

Amanda
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

Protein Cell, 2023, 14, 713–725

https://doi.org/10.1093/procel/pwad024
Advance access publication 2 May 2023
Review
Protein & Cell

Review

Downloaded from https://academic.oup.com/proteincell/article/14/10/713/7147618 by Universidade de São Paulo user on 08 February 2024


The best practice for microbiome analysis using R
Tao Wen1,2,‡, , Guoqing Niu2,‡, , Tong Chen3, , Qirong Shen2, , Jun Yuan2,*, , Yong-Xin Liu1,*,
1
Shenzhen Branch, Guangdong Laboratory of Lingnan Modern Agriculture, Genome Analysis Laboratory of the Ministry of Agriculture and Rural Affairs, Agricultural
Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Shenzhen 518120, China
2
The Key Laboratory of Plant Immunity Jiangsu Provincial Key Lab for Organic Solid Waste Utilization Jiangsu Collaborative Innovation Center for Solid Organic
Waste Resource Utilization, National Engineering Research Center for Organic-based Fertilizers, Nanjing Agricultural University, Nanjing 210095, China
3
National Resource Center for Chinese Materia Medica, China Academy of Chinese Medical Sciences, Beijing 100700, China

These authors contributed equally to this work.
*
Correspondence: [email protected] (J. Yuan), [email protected] (Y.-X. Liu)

Abstract
With the gradual maturity of sequencing technology, many microbiome studies have published, driving the emergence and advance

Protein & Cell


of related analysis tools. R language is the widely used platform for microbiome data analysis for powerful functions. However, tens
of thousands of R packages and numerous similar analysis tools have brought major challenges for many researchers to explore
microbiome data. How to choose suitable, efficient, convenient, and easy-to-learn tools from the numerous R packages has become
a problem for many microbiome researchers. We have organized 324 common R packages for microbiome analysis and classified
them according to application categories (diversity, difference, biomarker, correlation and network, functional prediction, and others),
which could help researchers quickly find relevant R packages for microbiome analysis. Furthermore, we systematically sorted the
integrated R packages (phyloseq, microbiome, MicrobiomeAnalystR, Animalcules, microeco, and amplicon) for microbiome analysis,
and summarized the advantages and limitations, which will help researchers choose the appropriate tools. Finally, we thoroughly
reviewed the R packages for microbiome analysis, summarized most of the common analysis content in the microbiome, and formed
the most suitable pipeline for microbiome analysis. This paper is accompanied by hundreds of examples with 10,000 lines codes in
GitHub, which can help beginners to learn, also help analysts compare and test different tools. This paper systematically sorts the
application of R in microbiome, providing an important theoretical basis and practical reference for the development of better micro-
biome tools in the future. All the code is available at GitHub github.com/taowenmicro/EasyMicrobiomeR.

Keywords R package, microbiome, data analysis, visualization, amplicon, metagenome

Introduction annotation, and (iii) statistics and visualization (Fig. 1A). In the
preprocessing step, the raw data is filtered and quality controlled
The metagenomic analysis is used to study microbial diversity,
to ensure data quality. In the quantification and annotation step,
structure, and function by sequencing, quantifying, annotating,
tools, and databases are used to identify microbial representative
and analyzing DNA and/or RNA sequences of microbial com-
sequences and annotate microbial taxonomy and function. The
munities or microbiota. The commonly used high-throughput
first two parts of microbial community analysis have been well
sequencing technology in microbiome research is mainly known
discussed and could be well done according to our previous paper
as amplicon sequencing and shotgun metagenomic sequencing.
(Liu et al., 2023). Finally, in the statistics and visualization step,
Amplicon sequencing with the advantages of low cost, mature
various statistical methods are used to explore microbial com-
analysis system, and simple analysis process was widely used
munity diversity, structure, and potential functions.
in microbiome research. Shotgun metagenomic sequencing pro-
With the development of high-throughput sequencing tech-
vided the functional information of microbes and more accu-
nology, plenty of studies were performed with amplicon-sequenc-
rate information on the microbial composition with the higher
ing technology (Thompson et al., 2017; Proctor et al., 2019) and
sequencing cost and large amount of computational resources
shotgun metagenomes sequencing (Carrión et al., 2019; Li et al.,
needed. The detailed pipeline for both sequencing methods have
2022; Paoli et al., 2022), which led to the development of micro-
been systemically summarized in our previous review (Liu et
biome analysis methodologies, software, and pipelines, for exam-
al., 2021). As an important component of biodiversity, microbial
ple, QIIME (Caporaso et al., 2010), Mothur (Schloss et al., 2009),
communities play a vital role in biology, ecology, biotechnol-
USEARCH (Edgar, 2010), VSEARCH (Rognes et al., 2016), QIIME
ogy, agriculture, and medicine. Various bioinformatics methods
2 (Bolyen et al., 2019), Parallel-Meta Suite (Chen et al., 2022),
are required for microbial community analysis, which mainly
EasyAmplicon (Liu et al., 2023), Kraken (Wood and Salzberg,
includes three parts: (i) data preprocessing, (ii) quantification and

Received 28 February 2023; accepted 2 April 2023.


©The Author(s) 2023. Published by Oxford University Press on behalf of Higher Education Press.
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which
permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.
714 | Wen et al.

A B
Raw data preprocessing USEARCH/VSEARCH QIIME2 DADA2

Quanti ication and annotation OTU Taxonomy Metadata Tree Rep.fa


S1 S2 P C … Group …
(…(ASV_1, AS >ASV1
OTU ASV1 1 0 … ASV1 Pro- Beta-… S1 Group1 … V_2): ASV_3): ATCGATCG …
ASV2 1 0 … ASV2 Pro- Beta-… S2 Group2 … ASV_4): ASV_ >ASV2
Metadata Taxonomy … … … … … … … … … … 5)…) ATCGATCG …

Downloaded from https://academic.oup.com/proteincell/article/14/10/713/7147618 by Universidade de São Paulo user on 08 February 2024


Tree Rep.fa

RCurl
s
Biost ring tidyverse styler Del ayedAr

R
fansi downlit ray classInt GenomeInfoDbData
units MatrixModels evaluate
numDeriv
remotes Matrix tweenr graphics
sandwich ggtreeExt ra sourcetools progress tidyselect multtest
dbplyr statip clipr BiocGenerics roxygen2 pbkrtest
latticeExtra
igraph crosstalk
jpeg
huge
vegan agrico
lae
Rhdf5lib
BiocVersion

magrittr modeest
foreign

textshaping
geneplotter tidyfst
digest
Formula

VGAM
ggplot2 reprex
gh
highr

EasyStat
broom

ggsci
Tax4Fun2
grDevices
bitops
glmnet

NA
gitcreds tcltk plotly fontawesome multcomp klaR
sys forcats xml2
permute
polyclip labeling foreachcir stats rvest plogr lambda.r rpart

GC
clize GenomicRanges
latex2exp minqa ggn ewscale rcmdcheck
al abind gss googled
ggalluvi
rive deldir

W
formatR ble patchwork KEGGREST
haven data.ta sessionin fo crayon modeltools
tidyr tinytex purrr futile.o
ptions miniUI ps tzdb ggplotify blob shiny
gg force gg raph RVenn pkgbuild mgcv SparseM gargle
diffobj spatial gtable doParallel testthat IRa
Microbiome boot
S4Vectors
snow
lifecycle ggstance isoband
picante
httpuv
checkmate

RcppEigen AnnotationDbi
s2
proto compos
itions
R.cache
nges

ggtree yulab.utils KernSmooth GlobalOptio


ns
zoo
multcompView
later
robustbase pixmap
dplyr nloptr

statistical analysis base


timeSeries brio wk lme4 fst
withr
callr
assertthat
xopen

htmlTable
ade4 backpo
rts bayesm parallel
stats4 cpp1
1 glue
bit64
clue R.utils zip cluster waldo lazyeval
sf
fastcluster
rmarkdown
rmutil
readxl
annotate
ser vr AlgDesign pkgdown questionr
GUniF
rac
brew
reshape2 compiler
rematch2
yaml sass xfun
cowplot fastmap
dtplyr
ggte rn ggstar nnet loc˜it
random shape car devtools combinat XML

1 Diversity analysis
selectr
viridisLite
proxy promises
methods
gene˜ilter ini
coda
Forest
rstudioap
i BH xtable statmod network nlme
rgexf
bslib
dynamicTreeCut farver rlang ggupset MASS memoise DBI
TH.data splines impute treeio
ellipsis s
R.meth base64enc generic fstcore Rcpp Hmisc
R.oo sp codetools dirmult
gridGraphics hexbin
odsS3 matrixStats
tidygraph ggVennDiagram
2 Difference analysis statne
t.commo profvis pkgconfig tensorA png utf8 tidytree readr aplot
Protein & Cell

n praise BiocManager fs

pulsar systemfon
ts carData
commonmark datasets usethis limma stabledist
gert mnormt
httr
GO.db
hms e1071
data.tree

loseq
zlibbioc mime uuid

3 Biomarkers identi ication pheatmap curl vctrs webshot

phytools
ragg
ggfun
desc VennDiagram stringr bit lattice
scales
sta ble knitr
coin networ quantreg
rematch libcoin
kD3 pillar rstatix htmlwidgets whis
labelled MatrixGene rics lubridate vroom rprojroot
R6 gridExtra RcppA rmadillo DEoptimR ker cachem
4 Network analysis modelr meInf
oDb rhdf5‹ilters iterators microeco ets4
ggrepel
t
graphlayouts Geno
stringi fBasics
class
ids munsell jquerylib edgeR corrplo
cli
googleshe
grid
RColorBr ewer cellranger rappdirs

5 Function prediction rhdf5


ggpubr
urlchecker
polynom
sna
openssl
futile.logger
RSQLite
credentials
mvtnorm
XVector
viridis rversions

htmltools
plyr psych
pkgload
biomformat ions
tibble colorspace timech
ange
ggClusterNet timeDate processx translat
Biobase Summa rized
Expe rime nt
BiocParallel inte rp ape
6 Other analysis prettyunits utils survival ggsignif
packcircle MicrobiotaProcess
SpiecEasi
preprocessCore jsonlite
DESeq2 SIAMCAT
s
curatedMetagenomicData askpass
composit bit

C D
Cut
colorspace

fast
fontawalluviale

adespatiae4
GUniFrac

clue icTree
mnormt
corrplot
crosstalkir

cowplot

pic a nte

clus
circlize

fBasics
ggne ggplot2

gg forc

ve g a n
crayon

aplot

e
diffobj
deld

tidytre

wald esign
gg gfun
esom

treeio
farver

ions

ad
ter

am
ggp ubr

wsc
g le

sa ble ist
ggp aph

dyn
l

VGA o
ggr epel
ggr ggsci

AlgD

sta M
lotify

ape

ix h
e
a

sta bled
ggsstanc r

sta tip
gg ggstarn

rst ndwic
gg gra grid aph ble

gg ggu gramts

pe eat reg
nu rmu map
Ve ph E ic

ign e
tre ps

ed nei a p iew
gg ree

m ult orm v
phuant
t

mv mD te
nn lay xtr s

ggt xtra

m tn eri
if
eE et

ge mm com pV
e
Di ou a
gr

li ult com t
q

m ult tes
a
idG g exb eg
r ta in

D sy R r
Ea ge lte
la

h jp tra
tti muhwo ap
rg

a oin Seq at
ce ns rk

c E St
ep p

2
Ex el
at ix

m ae
c m

at ol

pl se t
od
st gric

ba
l

pr pnotly st es
or
RC o g bu til F
ol pu p fvis ro mu dom e
or
Br ra lsar s r an nt st ls
ew gg r ica rte oo ss
p bk elt ce
RV rgex er p od elr st Pro
sc en f m od ee iota els s
m od ob od ric
tex sh ales n m icr ixM ne
ap M atr xGe
tsh e
tid api sp M tri
yg ng
ra Ma e4
ba twee ph lm aR et
s kl mn 1
Bio e64 nr gl 07 ree
bio cPa enc e1 ta.t
mfo rall da oom m
rm el br yes
b at ba CNA si
bitoit64 WGiecEa AT
p
b s Diversity Sp MC
brelob SIA w
w
brio analysis snoa
com coda Difference sn ych s
data binat Visualization 16 ps kcircle 3
.tab analysis pac orkD
dbp le netw ork
lyr 54 netw
dply 22
dtpl
r igraph rNet
te
forcatyr Biomarkers ggClus
s n2
Global foreach identiication Tax4Fu
Options ST
KEGGREs
Hmisc 20 IRange icRanges
huge Genom tter
impute Network geneplo
plyr ionDbi
purrr Annotat
analysis GO.db
questionr
readr 10 annotate
readxl veg an
reshape2 Data cleaning Function stats4
stringi ape
stringr prediction FSA
Summarized 34 Other eulerr
8
Experiment tidyfst Hmisc
tidyr analysis picante
tidyverse minpack.lm
vroom 8

Figure 1. Microbial community data analysis workflow and related R packages. (A) Overview of microbial community data analysis workflow. Core
files are feature table (OTU), Taxonomy, sample metadata (Metadata), phylogenetic tree (Tree), and representative sequences (Rep.fa). (B) Detail of
microbial community analysis workflow. First, the raw data can be processed by using USEARCH/VSEARCH, QIIME 2, DADA2 packages. Then, the
important files are saved and used for downstream analysis in R language and RStudio software. Many microbial analysis methods rely on numerous R
packages developed with R language. The font size in the word cloud represents the number of citations of R packages. (C) Commonly used R packages
for data-cleaning/manipulation and visualization. (D) Classification of R packages for six categories in microbial community analysis.

2014), MEGAN (Huson et al., 2007), MetaPhlAn2 (Truong et al., before the year of 2015 while non-clustering methods were grad-
2015), HUMAnN2 (Franzosa et al., 2018), etc. As the most crucial ually developed and widely used recently. Currently, the common
and basic procedure for amplicon sequencing data analysis, OTU non-clustering methods include DADA2 (Callahan et al., 2016),
(Operational taxonomic unit) clustering method was popular deblur (Amir et al., 2017), unoise3 (Edgar and Flyvbjerg, 2015).
Using R language in microbiome analysis | 715

One of the most representative non-clustering algorithms among Preparing microbiome data analysis
them is DADA2, which was created with R language. It makes the R
Downstream analysis of microbiome requires the preparation
language (Ihaka and Gentleman, 1996) occupy an important posi-
of five data files, including a feature table, a feature annotation
tion in raw data processing for amplicon sequencing. Compared
file, a sample metadata file, a phylogenetic tree, and representa-
with many software that can be used in upstream steps of micro-
tive sequences. For beginners, it is important to understand the
biota sequencing data analysis, the downstream analysis steps
format and basic data structure of these files and learn how to
rely on the R language heavily with various packages. These anal-
import these files into R language. Furthermore, different analyt-
yses mainly include: (i) Diversity analysis; (ii) Difference analysis;
ical contents often have different requirements for data, and it
(iii) Correlation and network analysis; (iv) Biomarker identifica-

Downloaded from https://academic.oup.com/proteincell/article/14/10/713/7147618 by Universidade de São Paulo user on 08 February 2024


is necessary to learn some data manipulation skills to meet the
tion; (v) Functional predictions; (vi) Integrative analysis of micro-
demands of various functions. Finally, it is necessary to learn the
bial communities with other indicators (including phylogenetic
basics of R plotting to facilitate the presentation of results.
analysis, multi-omics integration, environmental factor analysis,
etc.). In addition to the kinds of multivariate statistical analysis
that can be done in R, there are diversified data-cleaning pack- Data preparation and cleaning
ages that allow data to be transformed among different analyses. After the process of sequence data preprocessing, quantifica-
R is a free, open-source language and environment for data tion, and annotation, we need to further analysis the output files,
statistical analysis and visualization, which was created by Ross including importing these files, cleaning data, and converting for-
Ihaka and Robert Gentleman from the University of Auckland mat, which required for subsequent microbiome analysis in R.
in New Zealand and now is responsible by the “R Development Before statistical analysis, we must master the basic procedure of
Core Team”. Compared with other analysis tools, such as SPSS, R language to cope with the data input requirements of different
MINITAB, MATLAB, which are more suitable for the statistics of packages. This section includes: importing, organizing, filtering,
processed and standardized data, R language can handle pro- basic calculations, conversion, normalization, and modification

Protein & Cell


cessed data as well as raw data. R can easily implement almost of data. Five data forms are frequently used from raw data pro-
all analysis methods, many of the latest methods or algorithms cessing, including feature tables (file formats are .csv/.txt/.xlsx/.
were first exhibited in it. Furthermore, R shows excellent data biom, typically used taxonomic and functional tables, includ-
visualization, particularly for complex data. The powerful and ing OTU/ASV/taxonomy/gene/module/pathway tables), fea-
flexible interactive analysis is also an advantage of R, mean- ture annotation (.csv/.txt/.xlsx/.biom), sample metadata (.csv/.
while enabling visual data exploration. The functionality of the txt), evolutionary/phylogenetic trees (.nwk/.tree), representative
R language relies heavily on thousands of R packages, which sequences (.fasta/.fas/.fa). All the data cleaning-related packages
provide a wide variety of data processing and analysis strate- show in Fig. 1C. Tabular data input for microbial community is
gies, allowing almost any data analysis process to be done in R. primarily accomplished using functions such as read.table(), read.
The total number of R packages published on CRAN is 18,981, delim(), and read.csv() in the utils package (Code 1A, script in
and Bioconductor is 2,183 (by January 31, 2023). These packages GitHub github.com/taowenmicro/EasyMicrobiomeR). The reading
demonstrated the powerful data process and analysis perfor- of evolutionary tree files depends on functions like read.tree() in
mance of R. the ape/ggtree/treeio package, or read_tree() in the phyloseq pack-
In recent years, numerous R packages have been developed age. For reading representative sequence files in microbiome, the
on the R platform for the downstream analysis of microbiome, readDNAStringSet() in the Biostrings package (Pages et al., 2016) is
which have made important contributions to the associated-re- typically used. Currently, big data integration of microbiome has
search field. However, the increasing number of downstream become a trend, and leading to the emergence of R packages for
analysis R packages has reached a dizzying level (Fig. 1B). In integrated data from multiple studies, likes curatedMetagenom-
addition, integrated R packages containing a large amount of icData (Pasolli et al., 2017). The package only needs to import the
microbiome analysis content, such as phyloseq (McMurdie and package and could re-analysis the curated data, rather than input
Holmes, 2013), microeco (Liu et al., 2020), and amplicon (Liu et in raw sequencing data.
al., 2023), have gradually emerged. This abundance of R packages The basic idea of data organization can be summarized as three
provides microbiome analysts with more choices, but also makes steps: splitting the data, processing with functions, and combin-
it difficult to identify the most suitable tools among many simi- ing the output results into the desired format. The functions of
lar analysis tools. Furthermore, this plethora of R packages make basic packages in R can be combined to meet most requirements
it difficult for beginners to embark on a well-organized learning of the microbiome data operations. For example, the “for loop”
path for microbiome analysis. Therefore, it is urgent to compare combined with the basic statistical functions [sum(), mean(), sd(),
similar analysis functions, and extract the similarities and differ- etc.] can be used to perform basic statistical analysis and data
ences functions, to select the best process for microbiome analy- transformations for microbial relative abundance (Code 1B); the
sis and help beginners learn more effectively. base package provides the apply family of functions, including
This paper attempts to sort and run the 324 common R pack- apply(), sapply(), lapply(), tapply(), aggregate(), etc., which can be
ages (Fig. S1), especially the integrated R packages for microbiome applied to quickly complete the three stages of data processing.
analysis, and complete the following three parts: (i) compare dif- The apply family of functions provides a framework that acts as
ferent R package analysis processes according to the functional an alternative to “for loop” and is much faster than the basic “for
categories of microbiome analysis, analyze the results, and sum- loop” function in R (Code 1B). A similar purr package can be used
marize example code; (ii) organize the content of six integrated in place of “for loop” to perform efficient operations.
R packages according to the functional categories of microbiome The plyr (Wickham, 2011b) package was upgraded from pack-
analysis, compare the analysis results, and generate example age of base with a variety of data sorting processes for kinds of
code; (iii) based on all R packages, select the optimal analysis data frames, lists, etc. The plyr package provides three data pro-
approach using R language and provide example code for refer- cessing stages “Split–Apply–Combine” in one function, and the
ence and learning to researchers. plyr package implements grouping transformations between
716 | Wen et al.

R types (vector, list, and data frame) and basically replaces the Diversity analysis
apply family of functions in the base package. It can easily han- Microbial community diversity mainly includes alpha diversity
dle grouping calculations, for example, microbial abundance at (Richness, Shannon, Simpson, Chao1, ACE, etc.), rarefaction curve,
different taxonomy levels (Code 1C). The reshape2 (Wickham, beta diversity (ordination and clustering analysis), taxonomic or
2007) package provides the long-wide format transformation dur- functional composition. Here must introduce the package vegan
ing data processing, and since ggplot2 (Wickham, 2011a) plotting (Oksanen et al., 2007), an abbreviation for Vegetation Analysis,
functions and most modeling functions, such as lm(), glm(), gam(), written by nine quantitative ecologists, including Oksanen from
often use long data, microbiome data are general showed as wide Finland, which is initially used for specifical dealing with data
form, so the transformation of microbiome data for plotting can on community ecology. The package provides a variety of meth-

Downloaded from https://academic.oup.com/proteincell/article/14/10/713/7147618 by Universidade de São Paulo user on 08 February 2024


be done using reshape2 (Code 1D), which provides the long-wide ods for data standardization and transformation. For example,
format transformation during data processing. data used for alpha diversity analysis can be normalized at the
The dplyr package is a member of the tidyverse family, inno- same sequencing depth with rrarefy(), and data for ordination
vatively abandoning the common form of data preservation in analysis can be normalized with the decostant() (Code 3A). After
R rather than using the tibble format (more powerful than data. the sequencing data are sampling normalization, diversity calcu-
frame format) for data processing, which can more efficiently lation can be more reasonable. In addition, alpha diversity met-
complete the data frame selection, merging and statistics within rics calculation can also be carried out with the ade4 (Dray and
row and column, and data frame length and width format Dufour, 2007), adespatial (Dray et al., 2018), and picante packages
changes, the “%>%” pipeline symbol can be used to complete (Kembel et al., 2010). For example, phylogenetic diversity can be
more complex data processing. The tibble format can store data calculated using the pd() in the picante package (Code 3A). Vegan
during the analysis and modeling process, which is important for not only allows for alpha diversity analysis, but also provides
data analysis. For example, we demonstrated the use of dplyr and functions such as rda() for conducting principal components
pipeline to run random forest modeling and the selection process analysis (PCA) and redundancy analysis (RDA), cca() for conduct-
Protein & Cell

of important variables (Code 1E). ing correspondence analysis (CA) and canonical correspondence
analysis (CCA), decorana() for conducting decision curve analysis
Visualization in R language
(DCA), and metaMDS() for conducting non-metric multidimen-
In most cases, we are used to plotting standard graphs in microbi- sional scaling (NMDS) for microbiome ordination analysis (Code
ome data display such as alpha/beta diversity, taxonomic compo- 3B). The prcom() in stats package can be used for principal com-
sition. All the visualization-related packages show in Fig. 1C. Due ponent analysis (PCA), which is a kind of dimension reduction
to the widespread use of ggplot2 (Code 2A), many extension pack- analysis. The mca() provided by the MASS package and the MCA()
ages have emerged to extend based on ggplot2 with a high capac- provided by the FactoMineR package can be used for multiple CA
ity of plotting styles, colors, and themes. These packages mainly (Code 3B); the ape package provides the pcoa() function for prin-
include ggtern plotting ternary graphs in Code 2B (Hamilton and cipal coordinate analysis (PCoA); the MASS package provides lda()
Ferry, 2018), ggraph plotting network graphs in Code 2C (Si et al., for linear discriminant analysis (LDA, Code 3C). Before running
2022), ggtree plotting evolutionary tree or cladogram in Code 2D many ordination operations, it is often necessary for commu-
(Xu et al., 2022), the ggalluvial package, the ggVennDiagram pack- nity clustering. The vegdist() in the vegan package can calculate
age (Code 2E), the ggstatsplot package plotting pie chart, and the Euclidean, Manhattan, Bray, Canberra, and other distances (Code
ggpubr package providing many various themes and colors of 3B). In addition, distance calculation can also be done using
output. In addition, the pheatmap and ComplexHeatmap pack- dist() of stats package. The distance matrix can be used for clus-
age (Gu, 2022) based on the grid mapping system plots the rel- tering analysis in addition to ordination analysis. The hclust() in
ative abundance of features in different samples (Code 2F), the the stats package can be used for clustering analysis, a similar
VennDiagram package (Chen and Boutros, 2011) could show the function can be achieved with the facteoextra, kmeans packages
number of features in different samples. The UpSetR package (Code 3D). Microbial composition analysis mainly used to display
(Conway et al., 2017), which draws Upset view is a new form plot- the abundance of microbes, and the dplyr package is needed to
ting similar to Venn diagram. The base-based plotting system is organize the data then display with ggplot2 subsequently.
complex and difficult to learn, while it is a good choice for com-
plex graph drawing, such as the circlize (Gu et al., 2014) package Difference analysis
(Code 2G), which draws chord diagrams composed of microbiota. Difference analysis is divided into community-level analysis and
Additionally, there is often a lot of microbiome mapping work feature-level (any hierarchy of taxonomy and function) analysis.
that involves a combination of graphics. At present, many tools in Community-level difference analysis is mainly performed with
R can combine graphics, such as cowplot, patchwork, and aplot. functions including adonis(), anosim(), and mrpp() in vegan pack-
The patchwork package has the most powerful functions and age, and mantel.test() in ape package (Code 4A). The R package for
supports modular splicing graphics (Code 2H). compositional data difference analysis in the feature level can
utilize the wilcox.test() (Code 4B) and t.test() (Code 4C) in the stats
package. Subsequently, data correction algorithms were devel-
Microbial community analysis oped specifically for sequencing data, such as the upper quartile
We have categorized the analysis of microbiome data into the (UQ), trimmed mean of M-values (TMM) (Code 4C), and relative log
following six major types in Fig. 1D: diversity analysis, differ- expression (RLE) harbored in the edgeR package (Robinson et al.,
ence analysis, biomarkers identification, correlation and network 2009) (Code 4D). Median of ratios method (MED) in DESeq2 pack-
analysis, functional prediction, and other microbiome analyses age (Love et al., 2014) (Code 4E), and cumulative-sum scaling (CSS)
(including source tracking analysis, community assembly pro- algorithm in metagenomeSeq package (Code 4F). Furthermore,
cesses, and analysis of associations between microbiota and envi- the ALDEx2 package provides polynomial models which can be
ronmental factors). Then, we would have organized, compared, used to infer feature abundance and calculate feature differences
and summarized all relevant R packages. with non-parametric tests, t-tests, or generalized linear models
Using R language in microbiome analysis | 717

(Code 4G). The ANCOM-BC package attempts to address sam- SpiecEasi package can use the sparcc method to perform a more
ple heterogeneity by correcting bias with a log-linear model. In suitable method for microbiome data to calculate the correlation
addition, other R packages for microbiome data correction and matrix, and can call multiple-threads to accelerate the calcula-
difference tests include limma (Code 4H), DR, ANCOM (Lin and tion. R packages for network visualization and attribute calcu-
Peddada, 2020) (Code 4I), corncob (Code 4J), Maaslin2 (Code 4K), lation can use igraph (Code 6E), network, and ggraph packages
etc. Nearing et al. (2022) showed that they compared these differ- (Code 6F). These R packages contain many layout algorithms for
ence analysis methods and proposed that ALDEx2 and ANCOM-II network visualization. In addition, network packages combined
(anchom_v2.1.R, Code 4L) were the best performers in the differ- with ggplot2 to visualize the network are easier to modify. Sna and
ence analysis of microbial communities. As for the significance ggraph packages have many visualization layout algorithms to

Downloaded from https://academic.oup.com/proteincell/article/14/10/713/7147618 by Universidade de São Paulo user on 08 February 2024


test, different packages use different methods for significance increase the styles of network visualization. With the increasing
testing. For example, Fisher test was used in edgeR package; Wald use of network analysis in the microbiome analysis, more atten-
test was used in DESeq2 and corncob package; t-test was used in tion is paid to network modularity and the key groups through
limma package. There were other methods for significance test, network modules. The WGCNA package provides a complete
likes Wilcoxon rank-sum test (ALDEx2 and ANCOM-II), ANOVA framework to quickly complete the correlation calculation, net-
(Maaslin2) etc. work module calculation, module feature vector calculation, and
other network properties exploration. The recent development of
Biomarker identification the ggClusterNet (Wen et al., 2022) package (Code 6G) provides a
Characteristic microbial consortia were explored to explain cer- unified framework for microbiome networks and designs a vari-
tain questions, such as the biomarkers of the gut in obese or ety of unique module-based visualization algorithms to visualize
hypertensive populations, or of soil in Fusarium wilt develops, the module relationships in the network.
etc. Microbes selected through difference analysis are often una-
ble to determine whether they represent the main differences of Functional prediction

Protein & Cell


concern. Therefore, weight analysis or machine learning methods The Tax4Fun (Aßhauer et al., 2015) R package (Code 7A) for
are used to further distinguish the feature microbes. functional prediction of 16S rDNA has been developed to more
The main ones commonly used for weighted analysis are linear accurately predict changes in microbial community func-
discriminant analysis effect size (LEfSe), PCA, etc (Code 5A). LEfSe tion using amplicon data. The package has been updated to
is developed specifically for microbiome data, and the core func- Tax4Fun2 (Wemheuer et al., 2020). Microeco can implement
tionality is implemented using the packages LDA (Fisher, 1936) FAPROTAX (Louca et al., 2016) prediction for bacteria/archaea
and MASS (Ripley et al., 2013). By extracting the loading matrix of and FUNGuild (Nguyen et al., 2016) prediction for fungi, which is
PCA ordination, the microbiome with the greatest impact on the based on the database of taxonomic functional description from
sample variation are found as biomarkers (Code 5B). curated published papers. Functional prediction enables the
In terms of machine learning, the random forest model, which prediction of microbial community function and subsequent
is widely used in microbiome analysis, is implemented by using statistical analysis. Additionally, vegan can be used for diversity
the randomforest package (Liaw and Wiener, 2002) (Code 5C). analysis, while edgeR, DEseq2, and limma packages can be used
There are many other decision tree-based machine learning for difference analysis. For functional enrichment, the cluster-
models, such as the mboost (Hofner et al., 2014) package pro- Profiler (Code 7B) package can perform GO, KEGG, GSEA and
vides boosting-based algorithms, the e1071 (Dimitriadou et al., GSVA enrichment, which considers gene/pathway abundance
2008) package provides support vector machines svm() in Code and is recommended. Furthermore, the clusterProfiler package
5D, and plain Bayes naiveBayes(). The xgboost package can inte- provides plot functions based on the ggplot syntax, allowing to
grate many tree models together to form a strong classifier, which plot appealing graphics in a simple manner. Gene/Pathway net-
can prevent overfitting via many strategies, including regulariza- work analysis can be performed using WGCNA for calculation,
tion terms, shrinkage, and column subsampling, etc. In addition, and ggClusterNet for network parameter calculation and visual-
the pROC (Robin et al., 2011) package is used to plot the oper- ization. However, the reliability of functional prediction results,
ating characteristic curve (ROC, Code 5D) to evaluate the effi- particularly for environmental samples, is currently disputed,
ciency of machine learning models. The Caret package provides and therefore, further verification of analysis results is often
cross-validation to determine the number of features (Kuhn, required.
2008). Currently, Wirbel et al. (2021) developed an open-source
R package SIAMCAT, a powerful yet user-friendly computational Other microbiome analysis
machine learning toolkit tailored to the characteristics of micro- Analysis for microbial community formation process commonly
biome data. used in the framework proposed by Stegen et al. (2013) to calculate
βNTI and RC-Bray indices with R packages minpack.lm, picante,
Correlation and network analysis Hmisc, eulerr, FSA, ape, stats4, and others (Code 8A). Ning et al.
Microbial co-occurrence network analysis is used to find microbial (2020) used a phylogenetic binning-based null model analysis to
modules that may have mutualistic relationships. Co-occurrence infer quantitative mechanisms underlying community assembly,
network analysis mainly includes the calculation of correlations, and developed the R package iCAMP (Code 8B). It allows for the
network visualization, and the calculation of network properties. quantitative assessment of the relative importance of different
The common R packages for calculation of correlations are psych ecological processes (e.g., homogenizing selection, heterogeniz-
(Revelle and Revelle, 2015) (Code 6A), WGCNA (Langfelder and ing selection, dispersal, and drift) on both the entire community
Horvath, 2008) (Code 6B), Hmisc (Harrell Jr and Harrell Jr, 2019) and each phylogenetic bin (which is usually composed of taxa
(Code 6C), and SpiecEasi (Kurtz et al., 2015) (Code 6D). Among from a single family or order with distinct ecological characteris-
these R packages, WGCNA has the highest calculation speed, tics). In addition, the package also provides neutral theory mod-
while requiring additional P-value correction; psych can calculate els, phylogenetic and taxonomic null model analyses at both the
correlation with correct P-value, but the speed is very low; the community and clade levels, calculation of niche differences and
718 | Wen et al.

phylogenetic distances between clades, and tests for phylogenetic feature tables, phylogenetic trees, and feature annotation) clus-
signals within individual phylogenetic bins. tering, integrating data import, storage, analysis, and output.
Microbial communities were often used to analyze the corre- The package utilizes many tools in R for ecological and phyloge-
lation with environment indicators, for example, mantel.test() pro- netic analyses (vegan, ade4, ape, and picante) and uses ggplot2
vided by the vegan package was used to examine the correlation to output high-standard figures. The data storage structure uses
between microbial communities and environment indicators, a S4-like storage system to store all relevant data as a single
and using wascores(), mantel.correlog() to detect the phylogenetic experiment-level object, thus making it easier to share data and
signal between microbial communities and environmental fac- reproduce the analysis. Subsequently, the packages microbiome,
tors (Code 8C). In addition, the ggClusterNet package can be used the MicrobiomeAnalystR (Chong et al., 2020), microViz (Barnett et

Downloaded from https://academic.oup.com/proteincell/article/14/10/713/7147618 by Universidade de São Paulo user on 08 February 2024


to calculate the co-occurrence relationships between microbes/ al., 2021), and micreobiomeSeq emerged under this framework.
microbiome and environmental factors, and generated pub- Subsequently, the microeco package according to the R6 frame-
lish-ready figures (Code 8D). work, which provides more analysis functions. With the need for
Knights et al. (2011) proposed the microbiome traceability tool data interactive analysis, Animalcules (Zhao et al., 2021) emerged.
source tracker in R language. Metcalf et al. (2016) predicted the EasyMicroPlot also uses an interactive interface for microbiome
time of death and tracked the source microbes of real cadavers on data exploration, allowing for rapid downstream analysis of the
microbial communities, then microbial traceability analysis grad- microbiome (Fig. 3; Table 1).
ually popular. Shenhav et al. (2019) proposed a new algorithm in
R, FEAST (Code 8E), which makes microbial traceability analysis Microbiome data analysis using phyloseq
more efficient, faster, and with low false positives. Phyloseq, using the S4 class object, is more suitable for object-ori-
ented programming and has had a great impact on microbiome
data analysis (Figs. 2, 3 and S2A–G, Pipeline 1. phyloseq.Rmd).
Integrated R packages for microbiome Through the S4 class object, phyloseq allows the five parts of data
Protein & Cell

As microbiome sequencing becomes more popular, R packages (the feature table, feature annotation, metadata, representative
dedicated to microbiome data processing are gradually emerg- sequences, and evolutionary tree) to maintain correspondence
ing (Fig. 2). McMurdie and Holmes (2013) developed the phyloseq under the same framework, and provides a variety of multiple
package: a comprehensive tool for microbiome data (including filtering functions on microbial features and samples, allowing

phyloseq microeco microbiome Animalcules Microbiome EasyAmplicon


Saved iles Functional analysis AnalystR
rarity () alpha_boxplot ()
plot_alpha () alpha_div_boxplot () PlotAlphaData ()
plot_richness () richness () alpha_barplot ()
trans_alpha$new () do_alpha_div_test () PlotAlphaBoxData () alpha_rare_curve ()
dominance ()
rs ity
ive
h ad cal_manova ()
Alp y
ordinate ()
cal_ordination ()
quiet () dimred_pca ()
PCoA3D.Anal () beta_pcoa ()
iversit dimred_pcoa ()
Beta d
plot_ordination () plot_group_ plot_landscape () PlotBeat-Diversity () beta_cpcoa ()
distance () diversity_beta_test ()
Specie
s com
Evo positio plot_bar () transform () PlotTaxaAundanceBar () tax_circlize ()
OTU lut n trans_abund$new ()
relabu_barplot ()
tax_maptree ()
subset_taxa () PlotTaxaAbundance
Diversity ion plot_composition () relabu_heatmap () BarSamGrp () tax_stackplot ()
ary plot_heatmap ()
analysis tre
e
plot_diff_
cladogram ()
plot_tree () PlotPhylogeneticTree () tax_maptree ()
Taxonomy plot_lefse_
cladogram ()

plot_diff_bar () campare ()
* differential_ Perform-UnivarTest ()
trans_diff$new () campare_volcano ()
Difference analysis abundance () Perform-LefseAnal ()
plot_diff_abund ()
Metadata
cal_train ()
ind_biomarker () RF.Anal () RF_regression ()
cal_ROC ()
Biomarker identiication cal_predict () MultiAssay PlotRF.Classify () RF_classiication ()
Experiment ()
set_trainControl ()

Tree cal_network () bimodality ()


plot_net () save_network ()
Network analysis intermediate_
prune_taxa () trans_network$ stability ()
new ()

Rep.fa show_prok_func ()
Function prediction cal_spe_func_perc ()

cal_NTI ()
cal_tNST ()
uild ing
unity b cal_rcbray ()
Comm
Assoc
iation cal_mantel ()
analys PlotDensityView ()
is cal_autocor ()
Analysis of other index trans_env$new ()
SetMetaAttributes ()

Figure 2. Introduction to the functions of integrated microbial analysis R packages. Microbial community analysis can be divided into diversity
analysis, difference analysis, biomarker identification, correlation and network analysis, functional prediction, and other microbial community analysis
(community building/construction process, association analysis with other indicators).
Using R language in microbiome analysis | 719

the five parts of data to be filtered consistently without consid- evolutionary tree and feature taxonomic and abundance on tree
ering different among data. It also provides microbiome analysis branches and leaves (Fig. S2G), which makes the tree informative
through microbial data filtering and normalization, diversity cal- and beautiful.
culation (Fig. S2A and S2B), microbial composition visualization
(Fig. S2C and S2D), evolutionary tree visualization, and network Microbiome data analysis using microbiome
analysis (Fig. S2E). The beta diversity function provides more than The microbiome package also uses S4 class objects, like phyloseq,
30 distance algorithms, far more than those provided by pack- and can also perform most of the analysis of microbiomes (Figs.
ages such as vegan. Secondly, the phyloseq package uses ggplot 2, 3 and S3A–G, Pipeline 2. Microbiome.Rmd). It includes micro-
for graphical visualization (Fig. S2F), which is easier to gener- bial diversity analysis (Fig. S3A–E), and difference analysis (Fig.

Downloaded from https://academic.oup.com/proteincell/article/14/10/713/7147618 by Universidade de São Paulo user on 08 February 2024


ate and modify figures. Additionally, phyloseq can integrate the S3F and S3G). Compared with phyloseq, the microbiome package

① Phyloseq
② Microbiome
③ MicrobiomeAnalystR
a ④ Animalcules
b
Community building * Alpha diversity ⑤ Microeco
⑥ EasyAmplicon
bimodality

intermediate stability

Other analysis
*

Protein & Cell


beta-NTI

RDA2

RDA1
Function prediction * *

PC 2
*
*
* *

PC 1 Beta diversity

Biomarker identi†ication

Features
compositionn

Network analysis

1 0

Evolutionary tree
b

Difference analysis * *
*
*
* *

Figure 3. Typical results of integrated microbial community analysis R packages and comparison of similar results. Group the analysis results of
multiple integrated R packages according to the major categories of microbial community analysis functions. Each main branch in the tree diagram
represents a type of microbial community analysis, and there are a total of 10 main branches: feature diversity analysis including (i) alpha diversity
analysis, (ii) beta diversity analysis, (iii) features (community taxonomic or functional) composition analysis, (iv) evolutionary or taxonomic tree
analysis; (v) difference analysis; (vi) biomarker identification; (vii) correlation and network analysis; (viii) functional prediction; (ix) community building/
construction process analysis; (x) other analysis, such as association analysis with other indicators. Each leaf (circle) represents a style of the result
displayed in the analysis, and the circle number around the outside of leaf represents the package number of the integrated R package that the analysis
result comes from.
720 | Wen et al.

Table 1. Comparison of the advantages and limitations of the six integrated R packages.

R package Function Advantages Limitations

Phyloseq 1. Diversity analysis 1. Firstly utilize S4 class objects 1. Introduction to phyloseq objects can be
including alpha/beta 2. Possess lots of analysis functions based on phyloseq challenging for beginners
diversity, community objects 2. Statistical tests, including diversity tests
composition, and 3. The network analysis process is simplified (Fig. S2E) and community/feature-level microbial
phylogenetic tree analysis 4. Ordinate analysis can be applied to arrange the difference analysis, are not well integrated
2. Network analysis order of samples and microbes on heatmap rows and into community analysis

Downloaded from https://academic.oup.com/proteincell/article/14/10/713/7147618 by Universidade de São Paulo user on 08 February 2024


columns (Fig. S2F) 3. Network analysis lacks test, attribute
5. Combine evolutionary trees with microbial calculation
abundance to display species richness (Fig. S2G)
6. Offer over 30 distance algorithms
Microbiome 1. Diversity analysis only 1. The alpha diversity index is abundance 1. The t-SNE and CAP ordination analyses
including alpha/beta 2. The t-SNE and CAP ordination algorithms frequently encounter errors
diversity, community 3. The stacked bar chart for community composition 2. The statistical tests, including diversity
composition analysis can be sorted by specified microbial features tests, community and feature-level
(Fig. S3C) differences tests is not ideal
4. Visualization of individual microbes (Fig. S3D)
Microbiome 1. Diversity analysis 1. Various functions ranging from data-cleaning to 1. Difficulties in installing R packages with
AnalystR including alpha/beta visualization dependencies
diversity, community 2. Multiple algorithms to correct sequencing errors, 2. Some functions may not work, including
composition, and leading more accurate evaluation of abundance network analysis and difference analysis of
relative abundance
Protein & Cell

phylogenetic tree analysis 3. Machine learning can be utilized to extract feature


2. Difference analysis variables (Fig. S4H) 3. Insufficient explanation of parameters
3. Biomarker identification 4. Difference analysis using multiple methods, such as and examples
LEfSe or metagenomeSeq
Animalcules 1. Diversity analysis 1. SummarizedExperiment package supported 1. Unable to save vector graphics and
2. Difference analysis and 2. Interactively executed in R (Fig. S5A–J) completed tables
biomarker identification 3. A 3D clustering plot can be generated 2. Insufficient functionality
Microeco 1. Diversity analysis 1. R6 class more expansibility than phyloseq objects 1. New data structures increase the cost of
2. Difference analysis 2. Simple function calling learning time
3. Biomarker identification 3. Rich plots of diversity and difference analysis (Fig.
4. Network, correlation S6A–H) 2. So many functions and dependency
analysis with other 4. Unique correlation analysis of other indicators caused frequent some malfunctioning
indicators 5. Network analysis functionality (Fig. S6K)
5. Functional prediction 6. FAPROTAX and FUNGuild function prediction
EasyAmplicon 1. Diversity analysis 1. It can be used in both command-line mode and 1. Need using the most popular tools,
2. Provide script for interactive mode within RStudio STAMP, LEfSe, PICRUSt 1&2, BugBase,
preparing STAMP, LEfSe, 2. It offers multiple visualization styles, allowing for FAPROTAX, and iTOL
PICRUSt 1&2, BugBase, easy generation of publication-quality figures (Fig. S7) 2. Some functions need to be development
FAPROTAX, iTOL 3. Its open-source code facilitates reproducible analysis
3. Provide slide tutorial for and allows for personalized modifications
many analyses, such as
QIIIME 2

is richer in alpha diversity indicators, which provides more than and GitHub, so a complete installation of MicrobiomeAnalystR
30 alpha diversity indicators. Secondly, it provides core microbial requires a lot of effort.
calculation and visualization functions. In general, it can be used
as a complement to phyloseq or in conjunction with it. Microbiome data analysis using Animalcules
The Animalcules package is an alternative way to analysis in an
Microbiome data analysis using interactive platform (Figs. 2, 3 and S5A–J, Pipeline 4. Animalcules.
MicrobiomeAnalystR Rmd). It is possible to calculate and plot sample statistics in bar
MicrobiomeAnalystR is an R package version according to the plot (Fig. S5A) or interactive pie charts (Fig. S5B), calculate, and
MicrobiomeAnalyst webserver (Figs. 2, 3 and S4A–J, Pipeline 3. visualize alpha diversity dot plot (Fig. S5C), group microbial tax-
MicrobiomeAnalystR.Rmd). These functions include diversity onomic or functional composition heatmap and stack plot (Fig.
analysis (Fig. S4A–F), difference analysis (Fig. S4G), biomarker S5D and S5E), feature abundance in boxplot (Fig. S5F), genus bray
identification (Fig. S4H and S4I), sample sequencing library size distance heatmap (Fig. S5G), ordination analysis (Fig. S5H and
overview (Fig. S4J), which are more powerful than the previous S5I), using randomforest, logistic regression to select biomark-
two packages. The visualization combines basic packages, ggplot ers (Fig. S5J), and other analyses. The results of these analyses
plotting, and interactive plotting. In terms of network analysis, can often be reanalyzed by interactively modifying parameters,
it provides the process of calculating and plotting SparCC net- and the images can be interactively zoomed in and out, clicked
works that are more suitable for microbiome data. However, the to see details, and other operations performed by the mouse for
package depends on many R packages from CRAN, Bioconductor, better pattern discovery. However, the results cannot be exported
Using R language in microbiome analysis | 721

as vector format, which do not meet the requirements for publi- (Fig. 4G), and applied DESeq2 for difference analysis and gener-
cation. Secondly, the analysis content is too little, especially the ated multi-group volcano plots (Fig. 4H). We also used the el071,
microbiome network analysis, the correlation analysis between caret, randomforest, ROC packages for various machine learn-
the microbiome and other indicators. ing analyses and generated microbiome weighted plots (Fig. 4I).
Furthermore, we used ggClusterNet for microbiome network
Microbiome data analysis using microeco analysis (Fig. 4J), constructed network graphs and combined plots
The microeco package is very powerful, using R6 class data struc- to explore the associations between environmental factors and
ture (Figs. 2, 3 and S6A–L, Pipeline 5. microeco.Rmd). It includes microbiome communities (Fig. 4K). Finally, we used the FEAST
microbial diversity (Fig. S6A and S6B) taxonomic composition package to perform community source tracking analysis and con-

Downloaded from https://academic.oup.com/proteincell/article/14/10/713/7147618 by Universidade de São Paulo user on 08 February 2024


(Fig. S6C–E), difference (Fig. S6F–H), biomarker (Fig. S6I and S6J), structed pie charts (Fig. 4L). Other analyses included stacked bar
network (Fig. S6K), integrated community structure with environ- charts of microbial community composition (Figs. S9E and S9H),
mental factor (Fig. S6L), and phylogenetic diversity analysis. It can chord diagrams (Fig. S10A), Venn diagrams (Fig. S10C), Upset dia-
complete almost all the current microbiome analysis contents. grams (Fig. S10D), difference analysis volcano plots (Fig. S10F),
However, it is not suitable for novices because there is a certain functional prediction etc.
threshold for using R6 class objects. In addition, due to too many
functions, the requirements for input data are different, causing
some functions are hard to use. Perspective and conclusions
In the past 10 years, the R language and numerous R packages
Microbiome data analysis using amplicon have played an important role in the microbiome data analysis.
The package amplicon is an analysis and plotting tool (Figs. 2, R language is easy to use and get started. It has attracted many
3 and S7A–I, Pipeline 6. Amplicon.Rmd) within the microbiome researchers to learn about it. However, there are still some con-
analysis toolkit EasyMicrobiome (Liu et al., 2023). It enables var- tradictions between supply and demand in the microbiome data

Protein & Cell


ious diversity analyses, including alpha diversity (Fig. S7A), rar- analysis. For example, it is often difficult to support multi-thread-
efaction curve (Fig. S7B), clustering distance heatmap (Fig. S7C) ing under the Windows system; second, the speed of many R
and PCoA (Fig. S7D), NMDS, LDA and PCA, taxonomic composition packages running is relatively slow, although some R packages
(Fig. S7E and S7F), difference analysis (Fig. S7G and S7H). Then, it are written in other languages as supplements; third, the applica-
can easily generate high-quality figures such as boxplots, scatter tion in microbiome still needs further development. For instance,
plots for diversity analysis, stacked bar plots, circlize plots, and there is a shortage of packages that allow for the exploration of
map trees for taxonomic or functional composition. One of its time-series-based microbial compositions, as well as more robust
notable features is its ability to finely adjust the presentation of interactive packages for analyzing complex microbial data.
figures, resulting in published-ready figures. Additionally, several Furthermore, ggplot2 lacks the capability to create complex and
tools within the amplicon package are available for microbiome combined figures, which fails to meet the visualization require-
data transformation, facilitating subsequent analysis using tools ments for relationships between multiple intricate indicators
such as LEfSe and STAMP. However, at the current version, the with microbial community data. Therefore, developing new R
amplicon package does not provide some functions for network packages that are more suitable for drawing complex figures and
analysis, analysis of microbiome–environment interactions, and composite figures would be necessary for microbiome data.
analysis of community formation processes. The authors provide With the development of sequencing technology, data analysis
some scripts in EasyAmplicon pipeline to do this, mentioned in methods have advanced along with the development of R pack-
the published paper plan to finish these functions in the future. ages contributed to the field of microbiome. These R packages
range from classic R packages such as vegan, which has been
cited more than 10,000 times, to integrated R packages such as
The best practice for microbiome data phyloseq, which contain many functions in one package and set
analysis in R a unified data processing framework. These R packages have been
The abundance of R packages can hinder microbiome research- able to implement most of the functions of microbiome analysis,
ers from efficiently selecting appropriate R packages for micro- from microbial diversity, difference, biomarker identification, cor-
biome-related analyses. Therefore, we organized and selected relation and network, phylogenetic analysis, etc. However, these
efficient, commonly used, and user-friendly functions for micro- R packages have some redundant functions; for example, phy-
biome data analysis in six categories (Fig. S8): (i) diversity analysis loseq, microbiome, and others can do microbial diversity analysis.
(Figs. S9A–I and S10A–E), (ii) difference analysis (Figs. S10F–I, S11A The difference is only in the visualization method and scheme.
and S11B), (iii) biomarker identification (Fig. S11C and S11D), (iv) A similar situation has always existed in microbiome analysis R
correlation and network analysis (Figs. S11E–I), (v) functional pre- packages, so we hope that in future developments we will try to
diction, 6 other microbiome analyses (Fig. S12A–I). All the script de-redundantly use the same part of the content or similar con-
can be found in the file Pipeline.BestPractice.Rmd. This led to tent to highlight the advantages of R packages.
develop a better microbiome data analysis pipeline. Although these R packages can conduct a lot of functions,
In this pipeline, we used the amplicon package for alpha diver- they don’t well enough in some specific analyses, for exam-
sity rarefaction curve (Figs. 4A and S9A) and PCoA analysis (Figs. ple, alpha and beta diversity analysis, and the outgoing graphs
4B and S9B), ggplot2 package for visualization of microbial com- often not add difference detection results to visualize the
munity composition, ggClusterNet for constructing Venn net- differences from the figures. In addition, there are still some
work (Chen et al., 2021) (Fig. 4C), ggtree and ggtrextre for building contents that can continue to be developed, such as applying
evolutionary trees (Fig. 4D), and LEfSe for generating cladograms more machine learning methods to microbiome data and its
(Fig. 4E). We employed the stst4, ggplot2, and cowplot packages learning method, model, and important variable evaluation.
for difference analysis and generated STAMP plots (Fig. 4F), used Secondly, metagenomes are becoming more widely used, and
edgeR for difference analysis and visualized in Manhattan plots the support of species and functional annotation results based
722 | Wen et al.

Downloaded from https://academic.oup.com/proteincell/article/14/10/713/7147618 by Universidade de São Paulo user on 08 February 2024


Protein & Cell

Figure 4. Examples of the best practice results of microbial community analysis in R language. The selected results include rarefaction curve (A),
principal coordinate analysis scatter plot (B), Venn network graph (C), evolutionary tree (D), LEfSe cladogram (E), difference analysis extended error bar
plot in STAMP style (F), difference analysis Manhattan plot (G), difference analysis multi-group volcano plot (H), biomarker selection ring-column chart
(I), network graph (J), correlation connection combination graph (K), source tracing analysis pie chart (L).

on Kraken (Wood and Salzberg, 2014), MEGAN (Huson et al., data processed by R rise from megabyte (M) to gigabyte (G).
2007), MetaPhlAn2 (Truong et al., 2015), HUMAnN2 (Franzosa Therefore, faster data processing R packages should be used to
et al., 2018), eggNOG-mapper (Huerta-Cepas et al., 2017), etc. the microbiome data analysis process, such as data.table, fst,
is becoming more and more important, and these make the tidyfst, etc.
Using R language in microbiome analysis | 723

The use of appropriate data structures can accelerate the Consent for publication
microbiome data processing. At first, we used S4 class objects
All authors agree to publish.
for microbiome data encapsulation, which can complete
a variety of analyses comprehensively and efficiently. The
emergence of R6 class objects and other objects has greatly
impacted microbiome data processing and largely facilitates it.
Author contributions
With the development of the tidy family of R languages, tidy- J.Y. and Y.L. conceived and supervised the project; T.W. and G.N.
based data structures have recently emerged for microbiome implement this project and wrote the paper; Y.L., T.C., and Q.S.
data mining. For example, the MicrobiotaProcess package (Xu provided critical comments and revised the paper.

Downloaded from https://academic.oup.com/proteincell/article/14/10/713/7147618 by Universidade de São Paulo user on 08 February 2024


et al., 2023). This structure is more suitable for microbiome
data mining, machine learning modeling, and other analyses,
which can more easily extract the influence of experimental Data availability
design, time, space, and other factors on microbiome data in No new sequencing data generated by this project.
analysis, to discover the deep-seated patterns. We expect the
R language to make microbiome analysis more efficient and
help everyone discover more about its role in humans, animals, Code availability
plants, and the environment, and use it for our benefit to make
the world a better place. All the demo data and scripts are available in GitHub github.com/
taowenmicro/EasyMicrobiomeR.

Supplementary information
References
The online version contains supplementary material available at

Protein & Cell


https://doi.org/10.1093/procel/pwad024. Amir A, McDonald D, Navas-Molina JA et al. Deblur rapidly resolves
single-nucleotide community sequence patterns. MSystems
2017;2:e00191–e00116.
Acknowledgements Aßhauer KP, Wemheuer B, Daniel R et al. Tax4Fun: predicting func-
tional profiles from metagenomic 16S rRNA data. Bioinformatics
We thank all the people star and fork this project in GitHub and 2015;31:2882–2884.
feedback the useful comments. Thanks to Guangchuang Yu, Barnett DJ, Arts IC, Penders J. microViz: an R package for microbiome
Mingshou Zhang, Yunyun Gao for them suggestions for revising data visualization and statistics. J Open Source Softw 2021;6:3201.
this article. Bolyen E, Rideout JR, Dillon MR et al. Reproducible, interactive, scala-
ble and extensible microbiome data science using QIIME 2. Nat
Biotechnol 2019;37:852–857.
Abbreviations Callahan BJ, McMurdie PJ, Rosen MJ et al. DADA2: high-resolution
ASV, an amplicon sequence variant; CCA, canonical cor- sample inference from Illumina amplicon data. Nat Methods
respondence analysis; CSS, cumulative-sum scaling; DCA, 2016;13:581–583.
decision curve analysis; GO, gene ontology; GSEA, gene set Caporaso JG, Kuczynski J, Stombaugh J et al. QIIME allows analysis
enrichment analysis; GSVA, gene set variation analysis; KEGG, of high-throughput community sequencing data. Nat Methods
kyoto encyclopedia of genes and genomes; LDA, linear discri- 2010;7:335–336.
minant analysis; LEfSe, linear discriminant analysis effect size; Carrión VJ, Perez-Jaramillo J, Cordovez V et al. Pathogen-induced acti-
NMDS, non-metric multidimensional scaling; OTU, operational vation of disease-suppressive functions in the endophytic root
taxonomic unit; PCA, principal components analysis; PCoA, microbiome. Science 2019;366:606–612.
principal coordinate analysis; RLE, relative log expression; Chen H, Boutros PC. VennDiagram: a package for the generation of
ROC, receiver operating characteristic curve; TMM, trimmed highly-customizable Venn and Euler diagrams in R. BMC Bioinf
mean of M-values; UQ, upper quartile; MED, median of ratios 2011;12:1–7.
method. Chen T, Zhang H, Liu Y et al. EVenn: easy to create repeatable and
editable Venn diagrams and Venn networks online. J Genet
Genom 2021;48:863–866.
Funding Chen Y, Li J, Zhang Y et al. Parallel-Meta Suite: interactive and
rapid microbiome data analysis on multiple platforms. iMeta
This study was financially supported by the Agricultural Science
2022;1:e1.
and Technology Innovation Program (CAAS-ZDRW202308), the
Chong J, Liu P, Zhou G et al. Using MicrobiomeAnalyst for compre-
Natural Science Foundation of China (42277297, 42090060,
hensive statistical, functional, and meta-analysis of microbi-
U21A20182), Jiangsu Funding Program for Excellent Postdoctoral
ome data. Nat Protoc 2020;15:799–821.
Talent (2022ZB325), Scientific and technology innovation project
Conway JR, Lex A, Gehlenborg NU. An R package for the visuali-
of China Academy of Chinese Medical Sciences (C12021A04115),
zation of intersecting sets and their properties. Bioinformatics
the Fundamental Research Funds for the Central public welfare
2017;33:2938–2940.
research institutes (ZZ13-YQ-095).
Dimitriadou E, Hornik K, Leisch F et al. Misc functions of the
Department of Statistics (e1071), TU Wien. R Package 2008;1:5–24.
Dray S, Dufour A-B. The ade4 package: implementing the duality dia-
Conflict of interests gram for ecologists. J Stat Softw 2007;22:1–20.
The authors declare no competing interests related to the content Dray S, Blanchet G, Borcard D et al. Package ‘adespatial’. R Package
of this paper. 2018;1:3–8.
724 | Wen et al.

Edgar RC. Search and clustering orders of magnitude faster than Metcalf JL, Xu ZZ, Weiss S et al. Microbial community assembly and
BLAST. Bioinformatics 2010;26:2460–2461. metabolic function during mammalian corpse decomposition.
Edgar RC, Flyvbjerg H. Error filtering, pair assembly and error cor- Science 2016;351:158–162.
rection for next-generation sequencing reads. Bioinformatics Nearing JT, Douglas GM, Hayes MG et al. Microbiome differential
2015;31:3476–3482. abundance methods produce different results across 38 data-
Fisher RA. The use of multiple measurements in taxonomic prob- sets. Nat Commun 2022;13:342.
lems. Ann Eugen 1936;7:179–188. Nguyen NH, Song Z, Bates ST et al. FUNGuild: an open annotation
Franzosa EA, McIver LJ, Rahnavard G et al. Species-level functional tool for parsing fungal community datasets by ecological guild.
profiling of metagenomes and metatranscriptomes. Nat Methods Fungal Ecol 2016;20:241–248.

Downloaded from https://academic.oup.com/proteincell/article/14/10/713/7147618 by Universidade de São Paulo user on 08 February 2024


2018;15:962–968. Ning D, Yuan M, Wu L et al. A quantitative framework reveals eco-
Gu Z. Complex heatmap visualization. iMeta 2022;1:e43. logical drivers of grassland microbial community assembly in
Gu Z, Gu L, Eils R et al. Circlize implements and enhances circular response to warming. Nat Commun 2020;11:4717.
visualization in R. Bioinformatics 2014;30:2811–2812. Oksanen J, Kindt R, Legendre P et al. The vegan package. Community
Hamilton NE, Ferry M. ggtern: Ternary diagrams using ggplot2. J Stat Ecol Package 2007;10:719.
Softw 2018;87:1–17. Pages H, Aboyoun P, Gentleman R et al. Biostrings: string objects
Harrell Jr FE, Harrell Jr MFE. Package ‘hmisc’. CRAN2018 representing biological sequences, and matching algorithms. R
2019;2019:235–236. Package Version 2016;2:10.18129.
Hofner B, Mayr A, Robinzonov N et al. Model-based boosting in R: Paoli L, Ruscheweyh H-J, Forneris CC et al. Biosynthetic potential of
a hands-on tutorial using the R package mboost. Comput Stat the global ocean microbiome. Nature 2022;607:111–118.
2014;29:3–35. Pasolli E, Schiffer L, Manghi P et al. Accessible, curated metagenomic
Huerta-Cepas J, Forslund K, Coelho LP et al. Fast genome-wide data through ExperimentHub. Nat Methods 2017;14:1023–1024.
functional annotation through orthology assignment by egg- Proctor LM, Creasy HH, Fettweis JM et al. The integrative human
Protein & Cell

NOG-mapper. Mol Biol Evol 2017;34:2115–2122. microbiome project. Nature 2019;569:641–648.


Huson DH, Auch AF, Qi J et al. MEGAN analysis of metagenomic data. Revelle W, Revelle MW. Package ‘psych’. The Compr R Archive Netw
Genome Res 2007;17:377–386. 2015;337:338.
Ihaka R, Gentleman R. R: a language for data analysis and graphics. Ripley B, Venables B, Bates DM et al. Package ‘mass’. Cran R
J Comput Graph Stat 1996;5:299–314. 2013;538:113–120.
Kembel SW, Cowan PD, Helmus MR et al. Picante: R tools for integrat- Robin X, Turck N, Hainard A et al. pROC: an open-source package
ing phylogenies and ecology. Bioinformatics 2010;26:1463–1464. for R and S+ to analyze and compare ROC curves. BMC Bioinf
Knights D, Kuczynski J, Charlson ES et al. Bayesian community-wide 2011;12:1–8.
culture-independent microbial source tracking. Nat Methods Robinson MD, McCarthy DJ, Smyth GK. edgeR: a Bioconductor pack-
2011;8:761–763. age for differential expression analysis of digital gene expres-
Kuhn M. Building predictive models in R using the caret package. J sion data. Bioinformatics 2009;26:139–140.
Stat Softw 2008;28:1–26. Rognes T, Flouri T, Nichols B et al. VSEARCH: a versatile open source
Kurtz ZD, Müller CL, Miraldi ER et al. Sparse and compositionally tool for metagenomics. PeerJ 2016;4:e2584.
robust inference of microbial ecological networks. PLoS Comput Schloss PD, Westcott SL, Ryabin T et al. Introducing mothur: open-
Biol 2015;11:e1004226. source, platform-independent, community-supported software
Langfelder P, Horvath S. WGCNA: an R package for weighted correla- for describing and comparing microbial communities. Appl
tion network analysis. BMC Bioinf 2008;9:1–13. Environ Microbiol 2009;75:7537–7541.
Li W, Wang L, Li X et al. Sequence-based functional metagenomics Shenhav L, Thompson M, Joseph TA et al. FEAST: fast expecta-
reveals novel natural diversity of functioning CopA in environ- tion-maximization for microbial source tracking. Nat Methods
mental microbiomes. Genom Proteom Bioinform 2022;20:1–12. 2019;16:627–632.
Liaw A, Wiener M. Classification and regression by randomForest. R Si B, Liang Y, Zhao J et al. GGraph: an efficient structure-aware
News 2002;2:18–22. approach for iterative graph processing. IEEE Trans Big Data
Lin H, Peddada SD. Analysis of microbial compositions: a review of 2022;8:1182–1194.
normalization and differential abundance analysis. Npj Biofilms Stegen JC, Lin X, Fredrickson JK et al. Quantifying community assem-
Microbiomes 2020;6:1–13. bly processes and identifying features that impose them. ISME
Liu C, Cui Y, Li X et al. microeco: an R package for data mining in micro- J 2013;7:2069–2079.
bial community ecology. FEMS Microbiol Ecol 2020;97:fiaa255. Thompson LR, Sanders JG, McDonald D et al; Earth Microbiome
Liu Y, Qin Y, Chen T et al. A practical guide to amplicon and metagen- Project Consortium. A communal catalogue reveals Earth’s
omic analysis of microbiome data. Protein Cell 2021;12:315–330. multiscale microbial diversity. Nature 2017;551:457–463.
Liu YX, Chen L, Ma T et al. EasyAmplicon: an easy-to-use, open- Truong DT, Franzosa EA, Tickle TL et al. MetaPhlAn2 for enhanced
source, reproducible, and community-based pipeline for ampli- metagenomic taxonomic profiling. Nat Methods 2015;12:902–903.
con data analysis in microbiome research. iMeta 2023;2:e83. Wemheuer F, Taylor JA, Daniel R et al. Tax4Fun2: prediction of habi-
Louca S, Parfrey LW, Doebeli M. Decoupling function and taxonomy tat-specific functional profiles and functional redundancy based
in the global ocean microbiome. Science 2016;353:1272–1277. on 16S rRNA gene sequences. Environ Microbiome 2020;15:11.
Love MI, Huber W, Anders S. Moderated estimation of fold change Wen T, Xie P, Yang S et al. ggClusterNet: an R package for microbiome
and dispersion for RNA-seq data with DESeq2. Genome Biol network analysis and modularity-based multiple network lay-
2014;15:1–21. outs. iMeta 2022;1:e32.
McMurdie PJ, Holmes S. phyloseq: an R package for reproducible Wickham H. Reshaping data with the reshape package. J Stat Softw
interactive analysis and graphics of microbiome census data. 2007;21:1–20.
PLoS One 2013;8:e61217. Wickham H. ggplot2. Wiley Interdiscip Rev Comput Stat 2011a;3:180–185.
Using R language in microbiome analysis | 725

Wickham H. The split-apply-combine strategy for data analysis. J Xu S, Li L, Luo X et al. Ggtree: a serialized data object for visual-
Stat Softw 2011b;40:1–29. ization of a phylogenetic tree and annotation data. iMeta
Wirbel J, Zych K, Essex M et al. Microbiome meta-analysis and 2022;1:e56.
cross-disease comparison enabled by the SIAMCAT machine Xu S, Zhan L, Tang W et al. MicrobiotaProcess: a comprehensive R
learning toolbox. Genome Biol 2021;22:93. package for deep mining microbiome. Innovation 2023;4:100388.
Wood DE, Salzberg SL. Kraken: ultrafast metagenomic sequence Zhao Y, Federico A, Faits T et al. animalcules: interactive microbiome
classification using exact alignments. Genome Biol 2014;15:1–12. analytics and visualization in R. Microbiome 2021;9:1–16.

Downloaded from https://academic.oup.com/proteincell/article/14/10/713/7147618 by Universidade de São Paulo user on 08 February 2024


Protein & Cell

You might also like