Finding Genes which occur in two lists of data using the comm command.

 

TL:DR

  • Use comm to find overlapping genes
  • Both lists must be sorted before hand
[email protected]:~$ comm -1 -2 list1.txt list2.txt

Just a quick note on this as it is a task that crops up quite often. You have two lists of genes (or any data really but today it's genes), maybe these are genes that are showing a certain level of expression in two different tissues or something. You want to know which of these genes occur in both lists. I don't know why I have never come across the comm command before but previously I have just used grep and awk to achieve this. However comm is much easier.

My lists look like this;

[email protected]:~$ cat list1.txt 
AT1G01020
AT1G01070
AT1G01130
AT1G01150
AT1G01390
AT1G01400
AT1G01430
AT1G01448
AT1G01470
AT1G01780
[email protected]:~$ cat list2.txt 
AT1G01225
AT1G01780
AT1G02000
AT1G02570
AT1G02580
AT1G02610
AT1G02640
AT1G02650
AT1G03230
AT1G03270

Using the comm command I can find out which genes only occur in file1, those that only occur in file2 and those that appear in both files. This is done like so;

[email protected]:~$ comm list1.txt list2.txt 
AT1G01020
AT1G01070
AT1G01130
AT1G01150
	AT1G01225
AT1G01390
AT1G01400
AT1G01430
AT1G01448
AT1G01470
		AT1G01780
	AT1G02000
	AT1G02570
	AT1G02580
	AT1G02610
	AT1G02640
	AT1G02650
	AT1G03230
	AT1G03270

....that looks confusing. Let me explain. What is happening is that the comm output is three columns;

<uniq_to_file1> <uniq_to_file2> <in_both_files>

So what you do is suppress the columns that you don't want;

[email protected]:~$ comm -1 -2 list1.txt list2.txt 
AT1G01780

in this case, column 1 and 2.

You need to remember to sort your lists before hand otherwise it will complain.

BIOINFORMATICS!

 

 

Bowtie1 - RTFM

 

I've never used bowtie1 before but in order to follow some downstream analysis used in our lab I needed to. However with no previous bowtie1 commands to refer to I set forth and played around with this.

My initial run of bowtie (since noone ever really reads the manual right?) is to run the most basic command to see what happens.

bowtie --threads 1 -a TAIR10 -1 sample_R1.fq -2 sample_R2.fq > alignment.bowtie

The only two flags besides the -1 and -2 for the Paired-End reads I cared about at this point is --threads for the number of CPU cores and --all to output all valid alignments.

The first run I did returned the following;

Warning: Exhausted best-first chunk memory for read HWI-D00742:132:C9JLJANXX:7:2314:3394:96646 1:N:0:TCCGGAGAGTACTGAC/1 (patid 7908067); skipping read
Warning: Exhausted best-first chunk memory for read HWI-D00742:132:C9JLJANXX:7:2315:3400:64037 1:N:0:TCCGGAGAGTACTGAC/1 (patid 7937433); skipping read
Warning: Exhausted best-first chunk memory for read HWI-D00742:132:C9JLJANXX:7:2316:2079:14758 
1:N:0:TCCGGAGAGTACTGAC/1 (patid 7965462); skipping read
............
Warning: Exhausted best-first chunk memory for read HWI-D00742:132:C9JLJANXX:7:2316:14696:46094 1:N:0:TCCGGAGAGTACTGAC/1 (patid 7983119); skipping read
Warning: Exhausted best-first chunk memory for read HWI-D00742:132:C9JLJANXX:7:2316:13310:65720 1:N:0:TCCGGAGAGTACTGAC/1 (patid 7994229); skipping read
Warning: Exhausted best-first chunk memory for read HWI-D00742:132:C9JLJANXX:7:2316:11742:71964 1:N:0:TCCGGAGAGTACTGAC/1 (patid 7997881); skipping read
# reads processed: 8014358
# reads with at least one reported alignment: 143655 (1.79%)
# reads that failed to align: 7870703 (98.21%)
Reported 283478 paired-end alignments to 1 output stream(s)

Okay, that's not good. My first thought was to increase the amount of RAM I've requested from SLURM from 16GB to 32GB. But that produced the same result, and I should have checked out how much RAM was actually used and it was less than 200Mb. I found the following flag --chunkmbs  in the bowtie1 manual and set it to 128 as the default is 64. This reduced the number of exhausted best-first chunk memory errors but didn't get rid of it completely. So I increased it to 256 and away they went..... but this did not change the alignment results.

# reads processed: 8014358
# reads with at least one reported alignment: 143655 (1.79%)
# reads that failed to align: 7870703 (98.21%)
Reported 283478 paired-end alignments to 1 output stream(s)

So my next thought when looking at more parameters in the manual was that it was too strict looking for exact matches, so by adding the -v flag you can allow X number of mismatches, e.g. -v 2 would allow a maximum of two mismatches. I chose two at this point simply because it was in one of the bowtie examples. What's an appropriate number of mismatches before you're aligning reads to anything? But regardless, this made no real difference (a few more aligned reads but no change in percentage). I did the same with --tryhard to do a more exhaustive search, again nothing.

I then thought, screw it, google time. I found a great biostars discussion which mentions really low alignments caused by a low insert size. The default --maxins is set very low at 250. In the discussion someone suggests putting a high value of 1000 in and seeing if that helps. So, that's what I did.

# reads processed: 8014358
# reads with at least one reported alignment: 4706986 (58.73%)
# reads that failed to align: 3307372 (41.27%)
Reported 6040941 paired-end alignments to 1 output stream(s)

Okay, that's made a significant difference. This does still leave something unanswered for me though, what is the correct insert size for my data? Is an insert size of 1000 on the lower side of the insert distribution? Should I set it to 10,000 and then see what happens?

WARNING: RANT ALERT

Like all bioinformatics software, even those that have been around for a long time like bowtie, it's important to know what you're actually trying to do and what the parameters are. There is a really nice blog post by Keith Bradnam about the dangers of using the default parameters;

http://www.acgt.me/blog/2015/4/27/the-dangers-of-default-parameters-in-bioinformatics-lessons-from-bowtie-and-tophat

Coming from a pure computing side (the guy who looks after the big expensive boxes with flashing lights on them) often the biologist just wants me to run the software so that they don't have to touch the command-line. The problem here is that just because I know *puters* doesn't mean that I know how the software works, what the software is trying to do, what the software parameters mean, anything about your data, how it was prepped or sequenced, the organism or treatment or anything other than what the man page for the software says. Coupled with this, does the biologist even know? I've often asked what parameters are appropriate for their data and I very rarely get an answer I can work with.

Anyway....

PHEW: RANT OVER

In order to work this out I followed the method outlined from the discussion on biostars helpfully committed to github. I had to tweek a couple of bits though, here is my slurm submission.

#!/bin/bash -e
#SBATCH -p short-queue
#SBATCH --mem-per-cpu=1000
#SBATCH --cpus-per-task=4

source samtools-1.3.1
source bowtie-1.1.2
source picard-1.134
source R-3.2.3

srun bowtie --chunkmbs 200 --maxins 1000 --threads 4 --all -v 2 TAIR10 -1 sample_R1.fq -2 sample_R2.fq -S sample.sam
srun samtools view -bS sample.sam | samtools sort - -o sample.bam

##note, someone has written a nice wrapper on our HPC for picard so that you don't have to type the whole java -jar $PATH_TO_PICARD
srun picard CollectInsertSizeMetrics I=sample.bam O=sample.txt H=sample.hist ASSUME_SORTED=true

 

undefined

Perfect! With this, one is now able to set a max and min insert size to cover the majority of this curve, e.g. 200 to 600. While I have been successful in determining the correct settings for the insert size, there are many many more settings that I have completely ignored at this point. Maybe the next time I get a near zero alignment I will play with more settings. 

 

Home