The immense amount of metagenomic data produced today requires an automated approach for data processing and analysis. Before any downstream analysis will be performed, the datasets should be preprocessed to ensure the quality of the data and prevent erroneous conclusions. One step of your data preprocessing (usually the last) should be to check for sequence contamination (DNA from sources other than the sample). This post will show you how to identify and remove human sequence contamination from metagenomes, but can also be applied to any other type of sequence dataset or contamination.
Monthly Archives: February 2011
Perl subroutine to extract fasta sequences from a file
The following perl subroutine will read in a fasta formatted file, parse the file, and return all the sequences in a reference to a hash table.
2008
Transition in Vibrio spp. correlates with human activity in the Northern Line Islands Robert Schmieder, Tracy McDole, Elizabeth Dinsdale, Matthew Haynes, Forest Rohwer, Robert Edwards
Presented at: International Coral Reef Symposium (ICRS) 2008 Download PDF file (3.5 MB) |
||
ADAPTdb/ADAPT – A Framework for the Analysis of ARISA Data Sets Robert Schmieder, Matthew Haynes, Elizabeth Dinsdale, Forest Rohwer, and Robert Edwards
Presented at: Metagenomics 2008 Download PDF file (1.2 MB) |
2009
ADAPTdb/ADAPT – A Framework for the Analysis of ARISA Data Sets Robert Schmieder, Matthew Haynes, Elizabeth Dinsdale, Forest Rohwer, and Robert Edwards
Presented at: ISMB/ECCB and M3 2009 Download PDF file (812 KB) |
||
Deviation of amino acid utilization and correlation with G+C composition in bacterial genome Sajia Akhter, Hochul K Lee, Barbara Bailey, Peter Salamon, Robert Edwards
Presented at: Applied Computational Science and Engineering Student Support (ACSESS) 2009 Download PDF file (1.6 MB) |
||
Assembler for SOLiD data: by Improving memory management of Velvet assembler Sajia Akhter and Robert Edwards
Presented at: Rocky Mountain Bioinformatics Conference (Rocky) 2009 Download PDF file (1.0 MB) |
||
Phage Annotation Tools and Methods Ramy K. Aziz, Bhakti Dwivedi, Joe Anderson, Bonnie Hurwitz, JP Massar, Mya Breitbart, Matthew Sullivan, Jeff Elhai and Robert A. Edwards
Presented at: Rocky Mountain Bioinformatics Conference (Rocky) 2009 Download PDF file (940 KB) |
2012
Tools for Detecting Macrolide Resistance in the Human Microbiome Robert Schmieder, Yan Wei Lim, Anca Segall, Molly Schmid, and Robert Edwards
Presented at: Advances in Genome Biology & Technology (AGBT) 2012 Download PDF file (1.2 MB) |
||
Host prediction for viral metagenomes using oligonucleotide profiles Michiyo Wellington-Oguri, Robert Schmieder, Barbara Bailey, Robert A. Edwards, and Bas E. Dutilh
Presented at: Student Research Symposium (SRS) 2012 Download PDF file (705 KB) |
||
Database Structure and Visualization Software for Microbial Physiology Data Nicholas Turner, Haquio Liu, Jeremy Frank, and Robert Edwards
Presented at: Student Research Symposium (SRS) 2012 Download PDF file (176 KB) |
||
Tools for Fast Sequence Alignment Sajia Akhter and Robert Edwards
Presented at: Student Research Symposium (SRS) 2012 Download PDF file (1.3 MB) |
2011
Tools for Quality Control and Preprocessing of Metagenomic Datasets Robert Schmieder, Yan Wei Lim and Robert Edwards
Presented at: Pacific Symposium on Biocomputing (PSB) 2011 Download PDF file (4.2 MB) |
||
FACIL: fast and accurate genetic code inference and logo Bas E. Dutilh, Rasa Jurgelenaite, Radek Szklarczyk, Sacha A.F.T. van Hijum, Harry R. Harhangi, Markus Schmid, Bart de Wild, Kees-Jan Françoijs, Hendrik G. Stunnenberg, Marc Strous, Mike S.M. Jetten, Huub J.M. Op den Camp and Martijn A. Huynen
Presented at: SDMG All Day Meeting 2011 Download PDF file (3.6 MB) |
||
Genomic Comparison of Salmonella enterica Serovars Enteritidis and Dublin D. Matthews, R. Schmieder, J. Busch, N. Cassman, M. Doherty, D. Green, B. Matolock, B. Heffernan, G. Olsen, L. Farris, D. Schiffeli, S. Maloy, E. Dinsdale, and R. Edwards
Presented at: ASM 2011 Download PDF file (561 KB) |
||
PhiSpy: A novel similarity-independent tool for predicting prophages in microbial genomes Sajia Akhter, Ramy K Aziz, Robert A Edwards
Presented at: Evergreeen International Phage Biology Meeting Download PDF file (1.5 MB) |
2010
Real-Time Metagenomics Analysis Daniel A. Cuevas, Joshua A. Hoffman and Robert A. Edwards
Presented at: ASM 2010 Download PDF file (980 KB) |
||
Investigating the Frequency of Quinolone Resistance Genes in Environmental Samples Sajia Akhter, Anca M. Segall, Molly Schmid and Robert A. Edwards
Presented at: ASM 2010 Download PDF file (1.5 MB) |
||
Identification of Macrolide Resistance Alleles in Environmental Metagenomes Robert Schmieder, Anca Segall, Molly Schmid and Robert Edwards
Presented at: ASM 2010 Download PDF file (2.7 MB) |
||
Fast Identification and Removal of Sequence Contaminations from Genomic and Metagenomic Datasets Robert Schmieder and Robert Edwards
Presented at: Human Microbiome Meeting 2010 Download PDF file (1.2 MB) |
||
PHANTOME: Phage Annotation Tools and Methods Ramy K. Aziz, Brad Hull, Bhakti Dwivedi, Joe Anderson, Bonnie Hurwitz, JP Massar, Matthew Sullivan, Jeff Elhai, Mya Breitbart, Ross Overbeek and Robert A. Edwards
Presented at: Institut Pasteur Virus of Microbes meeting 2010 Download PDF file (808 KB) |
||
Phages Without Borders: Distribution of Phage Nucleic Acids in 310 Metagenomes Ramy K. Aziz, Mya Breitbart and Robert A. Edwards
Presented at: ASM 2010 Download PDF file (1.9 MB) |
||
Phage Annotation Tools and Methods Ramy K. Aziz, Bhakti Dwivedi, Joe Anderson, Bonnie Hurwitz, Brad Hull, JP Massar, Mya Breitbart, Matthew Sullivan, Jeff Elhai and Robert A. Edwards
Presented at: ASM 2010 Download PDF file (924 KB) |
||
Predicting Phage Preferences: Lytic vs. Lysogenic Lifestyle from Genomes Katelyn McNair, Rob Edwards and Barbara Bailey
Presented at: CSHL meeting 2010 Download PDF file (269 KB) |
How to convert FASTQ to FASTA
The following examples show how to convert a FASTQ file to a FASTA file. The commands assume a Unix-based operating system with Perl.
Contamination of sequencing data (Pt. 2)
It is amazing how easily the processing of samples can lead to contamination of data. Something like 22% of sequenced genomes contain AluY elements from the human genome. As noted in the following posting from The Scientist, this alarming discovery could also be indicative of contamination of sequenced genomes by DNA from other sources, such as the commonly used E. coli, which could be problematic when working with other bacterial genomes. This possibility could have grave consequences when it comes to evaluating horizontal gene transfer.
http://www.the-scientist.com/news/display/57990/
http://www.plosone.org/article/info:doi/10.1371/journal.pone.0016410
Though, it should be noted that as per the article in The Scientist, this is only applicable to female scientists (“But probably the most common contaminant is the scientist herself.” from paragraph 4).
Project Update: Multi-threading or Cluster Computing?
Recently, I’ve been faced with a problem where I feel my metagenome comparator program is running too slow. The main reason behind it is that it’s performing operations that occur multiple times in a loop. These operations involve different tasks such as: reading lines from text, creating objects, inserting those objects into a data structure, retrieving those objects from the data structure, and writing the data structures to disk (just to name a few). So it would be natural to suggest to someone in my position to parallelize it all, and that’s exactly what I want to do. However, I’ve never written any type of parallel applications, and thus, I need to do a little bit of learning and researching into parallel programming. (More of my ramblings after the Read More break)