Assessment of metagenomic assembly using simulated next generation sequencing data
Research output: Contribution to journal › Journal article › Research › peer-review
Standard
Assessment of metagenomic assembly using simulated next generation sequencing data. / Mende, Daniel R; Waller, Alison S; Sunagawa, Shinichi; Järvelin, Aino I; Chan, Michelle M; Arumugam, Manimozhiyan; Raes, Jeroen; Bork, Peer.
In: P L o S One, Vol. 7, No. 2, 2012, p. e31386.Research output: Contribution to journal › Journal article › Research › peer-review
Harvard
APA
Vancouver
Author
Bibtex
}
RIS
TY - JOUR
T1 - Assessment of metagenomic assembly using simulated next generation sequencing data
AU - Mende, Daniel R
AU - Waller, Alison S
AU - Sunagawa, Shinichi
AU - Järvelin, Aino I
AU - Chan, Michelle M
AU - Arumugam, Manimozhiyan
AU - Raes, Jeroen
AU - Bork, Peer
PY - 2012
Y1 - 2012
N2 - Due to the complexity of the protocols and a limited knowledge of the nature of microbial communities, simulating metagenomic sequences plays an important role in testing the performance of existing tools and data analysis methods with metagenomic data. We developed metagenomic read simulators with platform-specific (Sanger, pyrosequencing, Illumina) base-error models, and simulated metagenomes of differing community complexities. We first evaluated the effect of rigorous quality control on Illumina data. Although quality filtering removed a large proportion of the data, it greatly improved the accuracy and contig lengths of resulting assemblies. We then compared the quality-trimmed Illumina assemblies to those from Sanger and pyrosequencing. For the simple community (10 genomes) all sequencing technologies assembled a similar amount and accurately represented the expected functional composition. For the more complex community (100 genomes) Illumina produced the best assemblies and more correctly resembled the expected functional composition. For the most complex community (400 genomes) there was very little assembly of reads from any sequencing technology. However, due to the longer read length the Sanger reads still represented the overall functional composition reasonably well. We further examined the effect of scaffolding of contigs using paired-end Illumina reads. It dramatically increased contig lengths of the simple community and yielded minor improvements to the more complex communities. Although the increase in contig length was accompanied by increased chimericity, it resulted in more complete genes and a better characterization of the functional repertoire. The metagenomic simulators developed for this research are freely available.
AB - Due to the complexity of the protocols and a limited knowledge of the nature of microbial communities, simulating metagenomic sequences plays an important role in testing the performance of existing tools and data analysis methods with metagenomic data. We developed metagenomic read simulators with platform-specific (Sanger, pyrosequencing, Illumina) base-error models, and simulated metagenomes of differing community complexities. We first evaluated the effect of rigorous quality control on Illumina data. Although quality filtering removed a large proportion of the data, it greatly improved the accuracy and contig lengths of resulting assemblies. We then compared the quality-trimmed Illumina assemblies to those from Sanger and pyrosequencing. For the simple community (10 genomes) all sequencing technologies assembled a similar amount and accurately represented the expected functional composition. For the more complex community (100 genomes) Illumina produced the best assemblies and more correctly resembled the expected functional composition. For the most complex community (400 genomes) there was very little assembly of reads from any sequencing technology. However, due to the longer read length the Sanger reads still represented the overall functional composition reasonably well. We further examined the effect of scaffolding of contigs using paired-end Illumina reads. It dramatically increased contig lengths of the simple community and yielded minor improvements to the more complex communities. Although the increase in contig length was accompanied by increased chimericity, it resulted in more complete genes and a better characterization of the functional repertoire. The metagenomic simulators developed for this research are freely available.
KW - Computational Biology
KW - Computer Simulation
KW - Contig Mapping
KW - DNA, Bacterial
KW - Genome, Bacterial
KW - Genomics
KW - Metagenome
KW - Metagenomics
KW - Models, Genetic
KW - Probability
KW - Quality Control
KW - Reproducibility of Results
KW - Sequence Analysis, DNA
KW - Software
U2 - 10.1371/journal.pone.0031386
DO - 10.1371/journal.pone.0031386
M3 - Journal article
C2 - 22384016
VL - 7
SP - e31386
JO - PLoS ONE
JF - PLoS ONE
SN - 1932-6203
IS - 2
ER -
ID: 43975678