Monthly Archives: July 2014

Changing the label of paired-end sequences in FASTQ files

Programs such as MIRA require that paired-end reads are labeled accoring to specific rules. If your paired-end reads, for example, have the same name and are not marked with either /1 and /2 or _F and _R, you can add this using this oneliner:

cat file_1.fastq | paste - - | sed 's/^\(\S*\)/\1\/1/' | tr "\t" "\n" > file_1_renamed.fastq

cat file_2.fastq | paste - - | sed 's/^\(\S*\)/\1\/2/' | tr "\t" "\n" > file_2_renamed.fastq

The cat command will print the file content (to STDOUT).

The paste command will join two lines of a FASTQ file into a single line

The sed command will add the /1 or /2 to the sequence identifiers

The tr command will replace the tabs with line breaks, which is basically an undo of the paste command (in a simplified explanation).

The “>” sign will write the renamed output to the file specified after it.


This assumes that you have a sequence identifier for both the sequence and the quality line. If you have an empty header line for the quality entry (just +), then you can use “paste – – – -” instead of “paste – -” to rename your FASTQ files.

Open Source Terminal Applications

I came across this article that describes a few open source productivity/system-monitoring terminal applications. I thought they may be useful to some of us who love to use Terminal for Linux, Unix, or OSX.

I haven’t tried any of them myself but maybe those who do or have in the past can shed some light on them and share their thoughts.

Sorting FASTQ files by their sequence identifiers

In certain cases, you need to sort FASTQ files by their sequence identifiers (e.g. to fix the order of paired-end or mate-pair sequences). There are several ways of sorting the FASTQ files, but the simplest way is usually the best. Here is a one liner to do the job:

cat file.fastq | paste - - - - | sort -k1,1 -t " " | tr "\t" "\n" > file_sorted.fastq

The cat command will print the file content (to STDOUT).

The paste command will join the four lines of a FASTQ entry into a single line, each original line separated by a tab.

The sort command will sort each line using everything before the first space (which is our sequence identifer).

The tr command will replace the tabs with line breaks, which is basically an undo of the paste command (in a simplified explanation).

The “>” sign will write the sorted output to the file specified after it.

Model Seed Tips

The model seed provides a suite of tools for generating metabolic models and working with those models. After the readmore we will demonstrate how to use some of those tools.

Setting up the system.

First, you need to get the model seed core infrastructure. It is available from GIT. There are several dependencies that have to be met before the code will run, and I have installed that code on goldeneye. [At the moment it is not running on the central data servers (rambox et al) because of an issue running fork() over NFS that I can’t resolve.]

On goldeneye, the code is in /data/ModelSeed/. You need to start by sourcing the environment variables:

source /data/ModelSeed/ModelSEED/bin/

That should set you up with the appropriate paths to run the model seed code, most of which runs via the ms command. For FBA you need to add a single additional command before you can get started:

ms config MFATK_BIN=/data/ModelSeed/MFAToolkit/bin/mfatoolkit

This sets the path to the MFA toolkit (obviously!)

You should be able to start with a really simple command, like this one:

ms list

Try that, and if it doesn’t work, let Rob know!

The first thing to do is to add a database that you can use for local storage. This will be a place where biochemistry, media, genomes, and other objects are stored into. The commands are:

ms stores add local --type file

And then you should be able to list the stores:

ms stores list -v

Next, we will log in to RAST so you can access your genomes. You can use this command, replacing username with your username:

ms login username

And now we have logged in, we can import some biochemistry (the latest greatest models), and the mappings between biochemistry, reactions, roles, etc.

ms import biochemistry main -v

ms import mapping main -b main -v


At this point, you should peruse some of the ModelSeed tutorial pages on their wiki at github. We will continue with some examples that demonstrate how far we have got!

E. coli model

We will start with importing the E. coli genome annotation:

ms import annotation 83333.1 ecoli -m main -v

(note the -v at the end enables verbose output and prints a bunch of progress messages).

That will allow us to build a new genome model:

ms genome buildmodel ecoli ecoli-model -v

We can have a look at our model and make sure it is there:

$ ms list model | grep -i coli

… and run the model using Argonne’s LB Media formulation:

ms model runfba ecoli-model > out.txt

The output here is a bunch of JSON text which is not as clear as it could be. In a second we will see how to output HTML code.

The objectValue entry shows you the output of the model. In my run, this was 7.75482e-26 which is essentially 0 … i.e. the model didn’t grow!

Since we don’t have growth, we need to gap filll our E. coli model:

ms model gapfill ecoli-model -v -o

In this case, the -o means overwrite the existing model. You can also save the model with a new name. To find out the parameters that this, or any ms command takes, you call the command with an insufficient number of parameters. For example:

ms model gapfill

This has generated one or more solutions (in my case, it just generated a single solution), and so now we need to choose one and integrate it into the model.

My solution was:

New gapfilling solution: GF0.0
Solution cost:3
Gapfilled reactions 3{
        ID Direction Equation
        rxn00313 < (1) H+ + (1) meso-2,6-Diaminopimelate <=> (1) CO2 + (1) L-Lysine
        rxn09345 > (1) Farnesyldiphosphate + (8) Isopentenyldiphosphate <=> (1) Bactoprenyl diphosphate + (8) PPi
        rxn05039 > (1) 5-Amino-6--5-phosphoribitylaminouracil + (1) H2O <=> (1) 4--1-D-Ribitylamino-5-aminouracil + (1) H+ + (1) Phosphate


Since we only have one solution, we can use that. We start with managesol which if we don’t provide any options will just list the possibilities

ms model managesol ecoli-model 

Then we integrate our solution:

ms model managesol ecoli-model -s GF0.0

Now we can re-run the FBA and print the output in HTML format. The output from this run is available online.

model runfba ecoli-model --html > ecoli_gf.html

The ouput is given in this box:

resultObjective Max { bio1_biomassflux } = 2.49999

This suggests that E. coli is growing in complete media.

Citrobacter sedakii model

We have sequenced the Citrobacter sedakii genome, and I’m interested in comparing its model to that of growth in different media. I have annotated the genome in RAST, and so we can download and import that annotation:

ms import annotation 67826.3 C_sedakii -m main -v

again, we build a model:

ms genome buildmodel C_sedakii csedakii -v

and gapfill that model:

 model gapfill csedakii -v -o

and generate an HTML output of growth on ArgonneLBMedia:

model runfba csedakii  --media ArgonneLBMedia --html > csedakii.html

(These are all the same commands that we ran before!)

We can also run a model for each of the media that we have in the database. I created a script, /data/ModelSeed/bin/ that generates a list of all media and its appropriate ID, and then another script, /data/ModelSeed/bin/ that will use each of those media to create a model, gap fill it, and save it with a name. However, at the moment none of these generate models that predict growth under these conditions.