NCBI datasets and genome assembly data


Recently, NCBI released their new datasets API that might replace NCBI E-utils. At the moment, datasets is focused on genomes, genes, and viruses, but no doubt it will expand over time. [As an aside: I think the name is terrible, and they should use ncbi_datasets (see this tweet)]

Here is a rough guide to extracting some data about genomes using datasets.

First, we have a list of all bacterial genome assemblies. There are currently just over a million genome assemblies, and you can download the latest list:

DATE=`date +%Y%m%d`
curl -Lo assembly_summary_$DATE.txt ftp://ftp.ncbi.nlm.nih.gov/genomes/genbank/bacteria/assembly_summary.txt

Now we want to just get the report data about these genomes so we can figure out which ones are worth interrogating further. In particular we are concerned with the number of contigs and the overall assembly length, but you might want other data. Here is how to get all the data for a lot of organisms.

Before you begin, you’ll need to install ncbi_datasets, and you can easily do that with conda:

mamba create -n ncbi_datasets -c conda-forge ncbi-datasets-cli
conda activate ncbi_datasets

First, we are going to get about 10 accessions to see if what happens, and then we’ll build up to get all the accessions.

We create a variable called $ACC with the accession numbers separated by spaces.

ACC=$(head ../assembly_summary_20220130.txt | grep -v \# | cut -f 1 | tr \\n \ )

Now, we use datasets to get the genome assembly report:

datasets download genome accession $ACC --exclude-genomic-cds --exclude-gff3 --exclude-protein --exclude-rna --exclude-seq  --filename ncbi_data.zip

This will download three files:

  • README.md: a generic readme file
  • ncbi_dataset/data/assembly_data_report.jsonl: all the genome data in JSON format
  • ncbi_dataset/data/dataset_catalog.json: a JSON summary of what was downloaded

However, we don’t really want to extract the archive, we can just access it directly using another ncbi datasets tool, dataformat. Let’s extract that data into a tsv file:

dataformat tsv genome --package  ncbi_data.zip | awk -F"\t" '!s[$18]++ {print}' > ncbi_data.tsv

Note that we use awk to only print out one line per accession (otherwise we get lots of lines that appear mostly redundant per accession).

This table will have the following columns:

  1. Annotation Info BUSCO Complete
  2. Annotation Info BUSCO Duplicated
  3. Annotation Info BUSCO Fragmented
  4. Annotation Info BUSCO Lineage
  5. Annotation Info BUSCO Missing
  6. Annotation Info BUSCO Single Copy
  7. Annotation Info BUSCO Total Count
  8. Annotation Info BUSCO Version
  9. Annotation Info Count Gene Non-coding
  10. Annotation Info Count Gene Other
  11. Annotation Info Count Gene Protein-coding
  12. Annotation Info Count Gene Pseudogene
  13. Annotation Info Count Gene Total
  14. Annotation Info Name
  15. Annotation Info Release Date
  16. Annotation Info Report URL
  17. Annotation Info Source
  18. Assembly Accession
  19. Assembly BioProject Lineage Accession
  20. Assembly BioProject Lineage Parent Accessions
  21. Assembly BioProject Lineage Title
  22. Assembly BioSample Accession
  23. Assembly BioSample Attribute Name
  24. Assembly BioSample Attribute Value
  25. Assembly BioSample BioProject Accession
  26. Assembly BioSample BioProject Parent Accessions
  27. Assembly BioSample BioProject Title
  28. Assembly BioSample Description Comment
  29. Assembly BioSample Description Organism Common Name
  30. Assembly BioSample Description Organism Organism Name
  31. Assembly BioSample Description Organism Pangolin Classification
  32. Assembly BioSample Description Organism Strain
  33. Assembly BioSample Description Organism Taxonomic ID
  34. Assembly BioSample Description Title
  35. Assembly BioSample Sample Identifiers Database
  36. Assembly BioSample Sample Identifiers Label
  37. Assembly BioSample Sample Identifiers Value
  38. Assembly BioSample Last updated
  39. Assembly BioSample Models
  40. Assembly BioSample Owner Contact Lab
  41. Assembly BioSample Owner Name
  42. Assembly BioSample Package
  43. Assembly BioSample Publication date
  44. Assembly BioSample Status Status
  45. Assembly BioSample Status When
  46. Assembly BioSample Submission date
  47. Assembly Blast URL
  48. Assembly Description
  49. Assembly GenBank Accession
  50. Assembly Level
  51. Assembly Linked Assembly
  52. Assembly Name
  53. Assembly Paired Accession
  54. Assembly RefSeq Accession
  55. Assembly Refseq Dategory
  56. Assembly Sequencing Tech
  57. Assembly Submission Date
  58. Assembly Submitter
  59. Assembly Type
  60. Assembly UCSC Assembly Name
  61. Assembly Stats Contig L50
  62. Assembly Stats Contig N50
  63. Assembly Stats Gaps Between Scaffolds Count
  64. Assembly Stats GC Count
  65. Assembly Stats Number of Component Sequences
  66. Assembly Stats Number of Contigs
  67. Assembly Stats Number of Scaffolds
  68. Assembly Stats Scaffold L50
  69. Assembly Stats Scaffold N50
  70. Assembly Stats Total Number of Chromosomes
  71. Assembly Stats Total Sequence Length
  72. Assembly Stats Total Ungapped Length
  73. Breed
  74. Common name
  75. Cultivar
  76. Ecotype
  77. Isolate
  78. Organelle Assembly Name
  79. Organelle BioProject Accessions
  80. Organelle Description
  81. Organelle Infraspecific Name
  82. Organelle Submitter
  83. Organelle Total Seq Length
  84. Organism name
  85. Sex
  86. Strain
  87. Taxonomic ID
  88. WGS contigs URL
  89. WGS project accession
  90. WGS URL

Running on all the accessions

Now we can put that together and run this on all the accessions at NCBI.

Note: Before you start this it is imperative you have an NCBI API Key set up.

We can make a simple script to process a lot of accessions at once. From trial and error, it appears that the limit is ~500 accessions, so we set that as our limit.

Lets have a little slurm script that sets this up as an array job:

#SBATCH --time=2-0
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=1

# How much memory. I usually request 2000M (2 GB) if I am not sure
#SBATCH --mem=4G

# bash strict mode
set -euo pipefail

# have we run already?
if [ -e ncbi/ncbi_${SLURM_ARRAY_TASK_ID}.tsv ]; then exit 0; fi

# sleep for upto two minutes to delay concurrent jobs
sleep $((RANDOM%120))

# who many accessions to get at a time?
NUM=500
END=$(((SLURM_ARRAY_TASK_ID*NUM)+1))

# generate the list of accessions
ACC=$(head -n $END ../assembly_summary_20220130.txt | tail -n $NUM | grep -v \# | cut -f 1 | tr \\n \ )

# download the data
datasets download genome accession $ACC --exclude-genomic-cds --exclude-gff3 --exclude-protein --exclude-rna --exclude-seq  --filename ncbi/ncbi_${SLURM_ARRAY_TASK_ID}.zip

# extract it to a tsv file
dataformat tsv genome --package  ncbi/ncbi_${SLURM_ARRAY_TASK_ID}.zip | awk -F"\t" '!s[$18]++ {print}' > ncbi/ncbi_${SLURM_ARRAY_TASK_ID}.tsv

Now you can submit this as an array job, say running max 10 at once so NCBI doesn’t get too upset:

sbatch --array=1-1000%10 extract_all_info.sh