Large-scale benchmarking reveals false discoveries and count transformation sensitivity in 16S rRNA gene amplicon data analysis methods used in microbiome studies

Jonathan Thorsen; Asker Brejnrod; Martin Mortensen; Morten A Rasmussen; Jakob Stokholm; Waleed Abu Al-Soud; Søren Sørensen; Hans Bisgaard; Johannes Waage

doi:10.1186/s40168-016-0208-8

Large-scale benchmarking reveals false discoveries and count transformation sensitivity in 16S rRNA gene amplicon data analysis methods used in microbiome studies

Microbiome. 2016 Nov 25;4(1):62. doi: 10.1186/s40168-016-0208-8.

Authors

Jonathan Thorsen¹, Asker Brejnrod^{2

3}, Martin Mortensen², Morten A Rasmussen¹, Jakob Stokholm¹, Waleed Abu Al-Soud², Søren Sørensen², Hans Bisgaard⁴, Johannes Waage⁵

Affiliations

¹ COPSAC, Copenhagen Prospective Studies on Asthma in Childhood, Herlev and Gentofte Hospital, University of Copenhagen, Copenhagen, Denmark.
² Section of Microbiology, Department of Biology, University of Copenhagen, Copenhagen, Denmark.
³ Department of Biology, Laboratory of Genomics and Molecular Biomedicine, University of Copenhagen, Copenhagen, Denmark.
⁴ COPSAC, Copenhagen Prospective Studies on Asthma in Childhood, Herlev and Gentofte Hospital, University of Copenhagen, Copenhagen, Denmark. [email protected].
⁵ COPSAC, Copenhagen Prospective Studies on Asthma in Childhood, Herlev and Gentofte Hospital, University of Copenhagen, Copenhagen, Denmark. [email protected].

Abstract

Background: There is an immense scientific interest in the human microbiome and its effects on human physiology, health, and disease. A common approach for examining bacterial communities is high-throughput sequencing of 16S rRNA gene hypervariable regions, aggregating sequence-similar amplicons into operational taxonomic units (OTUs). Strategies for detecting differential relative abundance of OTUs between sample conditions include classical statistical approaches as well as a plethora of newer methods, many borrowing from the related field of RNA-seq analysis. This effort is complicated by unique data characteristics, including sparsity, sequencing depth variation, and nonconformity of read counts to theoretical distributions, which is often exacerbated by exploratory and/or unbalanced study designs. Here, we assess the robustness of available methods for (1) inference in differential relative abundance analysis and (2) beta-diversity-based sample separation, using a rigorous benchmarking framework based on large clinical 16S microbiome datasets from different sources.

Results: Running more than 380,000 full differential relative abundance tests on real datasets with permuted case/control assignments and in silico-spiked OTUs, we identify large differences in method performance on a range of parameters, including false positive rates, sensitivity to sparsity and case/control balances, and spike-in retrieval rate. In large datasets, methods with the highest false positive rates also tend to have the best detection power. For beta-diversity-based sample separation, we show that library size normalization has very little effect and that the distance metric is the most important factor in terms of separation power.

Conclusions: Our results, generalizable to datasets from different sequencing platforms, demonstrate how the choice of method considerably affects analysis outcome. Here, we give recommendations for tools that exhibit low false positive rates, have good retrieval power across effect sizes and case/control proportions, and have low sparsity bias. Result output from some commonly used methods should be interpreted with caution. We provide an easily extensible framework for benchmarking of new methods and future microbiome datasets.

Keywords: 16S sequencing; Benchmark; Beta-diversity; Differential relative abundance; Microbiome.

Publication types

Research Support, Non-U.S. Gov't

MeSH terms

Bacteria / classification*
Bacteria / genetics*
Base Sequence
Benchmarking*
Case-Control Studies
Computational Biology / methods
False Positive Reactions
High-Throughput Nucleotide Sequencing / methods*
Humans
Microbiota / genetics*
RNA, Ribosomal, 16S / genetics*
Sequence Analysis, RNA / methods

Substances

RNA, Ribosomal, 16S