For our search SRA engine, we want to remove the ribosomal RNA operon (not just the 16S gene, the whole opeon) before we run the search, otherwise all our hits are to the rRNA genes!
Here’s who you can use PATRIC to download a genome and remove the 16S region. For the example, we’re going to use a Faecalibacterium prausnitzii genome, because, well why not!
First, we download the genome and convert the GTO to fasta
p3-gto 657322.3
rast-export-genome -i 657322.3.gto contig_fasta > 657322.3.fna
Next, we use a couple of helper scripts from the EdwardsLab Git Repo. We start by converting the gto to a tab separated file with features and their locations
python3.7 ~/EdwardsLab/patric/parse_gto.py -f 657322.3.gto -p > 657322.3.tab
Then we can grep
through that file for the ribosomal genes:
grep rna 657322.3.tab | grep Subunit
We only find two of the genes:
fig|657322.3.rna.5 Large Subunit Ribosomal RNA; lsuRNA; LSU rRNA FP929046 586941 - 589785 (-)
fig|657322.3.rna.6 Small Subunit Ribosomal RNA; ssuRNA; SSU rRNA FP929046 590567 - 591540 (-)
Now we can trim out the sequences and keep only the non-rRNA regions. Note that here I trim a little extra off the sequences, but you may not wish to do that
python3.7 ~/EdwardsLab/manipulate_genomes/trim_fasta.py -f 657322.3.fna -e 576941 -c FP929046 > FP929046.fna
python3.7 ~/EdwardsLab/manipulate_genomes/trim_fasta.py -f 657322.3.fna -b 601540 -c FP929046 >> FP929046.fna
We run this twice, which is suboptimal, but this is definitely not the most computationally challenging thing we will do with those sequences!