Monthly Archives: April 2012

ncbi_taxon_entries

Entries in the NCBI Taxonomy Database

The NCBI Taxonomy database currently contains 865,348 different taxonomy entries that can be accessed using a unique identifier (the taxid). This unique identifier is an integer from 1 to 1,154,685 that can be used to access database entries at different taxonomic levels (kingdom, phylum, …). The graphic below summarizes the content of the NCBI Taxonomy database and highlights the phyla or families with the most entries. Most taxonomy entries by domain are for Eukaryota, followed by Bacteria and Viruses. Unclassified sequences include, for example, entries for metagenomes.

ncbi_taxon_entries

(Click on the image to see a larger version.)

Perl one liner to extract sequences by their ID from a FASTA file

The first one liner is useful if you only want to extract a few sequences by their identifier from a FASTA file.

perl -ne 'if(/^>(\S+)/){$c=grep{/^$1$/}qw(id1 id2)}print if $c' fasta.file

This will extract the two sequences with the sequence idenfiers id1 and id2. You only have to change the identifiers within the parentheses and separate them by space to extract the sequences you need.

 

If you have a large number of sequences that you want to extract, then you most likely have the sequence identifiers in a separate file. Assuming that you have one sequence identifier per line in the file ids.file, then you can use this one line:

perl -ne 'if(/^>(\S+)/){$c=$i{$1}}$c?print:chomp;$i{$_}=1 if @ARGV' ids.file fasta.file

 

 

Three easy ways to download multiple sequences from NCBI

There are different ways of how to download multiple sequences from the NCBI databases in a single request.

 

1) Using the batch Entrez website

http://www.>ncbi.nlm.nih.gov/sites/batchentrez

 

2) Using Perl: (copy into your terminal and press return/enter)

perl -e 'use LWP::Simple;getstore("http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nucleotide&rettype=fasta&retmode=text&id=".join(",",qw(6701965 6701969 6702094 6702105 6702160)),"seqs.fasta");'

This takes the IDs separated by spaces and the filename of the fasta file with the sequences that will be generated (seqs.fasta). If you don’t try to get the nucleotide data, then you will have to change the database name as well.

 

3) Using your browser: (paste this to the address field)

http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nucleotide&rettype=fasta&retmode=text&id=6701965,6701969,6702094,6702105,6702160
This time the IDs are separated by commas. Same here, if you need to get data from a different database you just have to change that.