SGE Array Jobs

How to create an array job for the cluster using SGE

 

Often you have a lot of jobs that are all the same. For example, if you want to blast a series of files against the same database. Here is how to make an array job

First, you need to know about environment variables. In an array job, the environment variable $SGE_TASK_ID is set to a unique number in a range that you define, and is incremented as you define it.

To submit an array job, we use the -t flag in our qsub command:

This will submit an array job where $SGE_TASK_ID is set to every number from one to one hundred and is incremented by one: qsub -t 1-100:1    

This will submit an array job where $SGE_TASK_ID is set to every number from one to one thousand and is incremented by ten: qsub -t 1-1000:10

The range can be any set of numbers you define. There is an upper limit of 75000 jobs in a single array job, but you can submit a second array job with numbers 75001 onwards.

Now all you need is a script that processes your files and runs them. There are several ways to do this. One approach is to number all of your input files, then in your script you can replace the number with $SGE_TASK_ID:

#!/bin/bash
blastn -in $SGE_TASK_ID.fasta -db nr -o $SGE_TASK_ID.blast

You can also list all the files that you want to process and use head and tail 

#!/bin/bash
input=$(head -n $SGE_TASK_ID file_of_files | tail -n 1)
blastn -in $input -db nr -out $input.blast

Another way to do it is to have  a file with all the commands and use head and tail to get a specific command:

#!/bin/bash
cmd=$(head -n $SGE_TASK_ID file_of_commands | tail -n 1)
./$cmd

NOTE: All of these examples use bash. You should be sure to include -S /bin/bash in your qsub command to make sure that they run with the bash shell.