Monthly Archives: July 2023

bash

awk use tab as input separator

By default, awk uses any whitespace to separate the input text. You can use the same construct as perl to change the input Field separator:

cat file.tsv | awk -F"\t" '!s[$1]'

The above example will split a file on tabs and then print the unique entries in the first column.

SQLite banner image

Automatically create a SQLlite database from a gzip file

This is a neat trick in a couple of steps to create a SQLite database from a gzip file.

First, make sure your text file is tab-separated, and has a header row that you would like to use as the table columns.

First, we make a FIFO object and stream the gzip file to that. Note that we background the zcat command here.

mkfifo tempfile
zcat sprot.uniprot.proc.gz > tempfile &

This doesn’t actually uncompress your file, it just sets a 0 byte file object that will read the uncompressed file.

Now, we open SQLIte3 with the name of our (new or existing) database.

sqlite3 db.sqlite

And then set tabs mode for the import and read the tempfile filehandle into a new table.

.mode tabs
.import tempfile newtable

SQLite3 will automatically create the table with column names the same as the headers, and the data will be loaded.

You can check by looking at the .schema

Adapter trimming

“plus ça change, plus c’est la même chose “

Jean-Baptiste Alphonse Karr

So its 2023 and the new [sequencing] kid on the block, MGI, hasn’t figured out adapter trimming.

Here is a quick primer [pun intended] on the easiest way to remove primers and filter reads.

Use fastp and give it this file of Illumina adapters to trim against.

I use this command to filter and remove adapters from our sequences

mkdir output
fastp -n 1 -l 100 -i fastq/$R1 -I fastq/$R2 -o output/$R1 -O output/$R2 --adapter_fasta IlluminaAdapters.fa

If you have a lot of files, you can easily wrap this in a for loop to process all the files in a directory

mkdir fastq_fastp
for R1 in $(find fastq -name \*R1\* -printf "%f\n"); do
    R2=${R1/R1/R2}
    fastp -n 1 -l 100 -i fastq/$R1 -I fastq/$R2 -o fastq_fastp/$R1 -O fastq_fastp/$R2 --adapter_fasta IlluminaAdapters.fa 
done

In addition to trimming all adapter sequences, this will remove any sequence with an N (yes, one or more), and remove any sequence shorter than 100 bp. fastp will also make sure that your reads are paired as well!