Getting Genetics Done: Comparing Sequence Classification Algorithms for Metagenomics

See on Scoop.itVirology and Bioinformatics from Virology.ca

Metagenomics is the study of DNA collected from environmental samples (e.g., seawater, soil, acid mine drainage, the human gut, sputum, pus, etc.). While traditional microbial genomics typically means sequencing a pure cultured isolate, metagenomics involves taking a culture-free environmental sample and sequencing a single gene (e.g. the 16S rRNA gene), multiple marker genes, or shotgun sequencing everything in the sample in order to determine what’s there.

A challenge in shotgun metagenomics analysis is the sequence classification problem: i.e., given a sequence, what’s it’s origin? I.e., did this sequence read come from E. coli or some other enteric bacteria? Note that sequence classification does not involve genome assembly – sequence classification is done on unassembled reads. If you could perfectly classify the origin of every sequence read in your sample, you would know exactly what organisms are in your environmental sample and how abundant each one is.

The solution to this problem isn’t simply BLAST’ing every sequence read that comes off your HiSeq 2500 against NCBI nt/nr. The computational cost of this BLAST search would be many times more expensive than the sequencing itself. There are many algorithms for sequence classification. This paper examines a wide range of the available algorithms and software implementations for sequence classification as applied to metagenomic data.

See on gettinggeneticsdone.blogspot.pt

You may also like...

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.