I've never used bowtie1 before but in order to follow some downstream analysis used in our lab I needed to. However with no previous bowtie1 commands to refer to I set forth and played around with this.
My initial run of bowtie (since noone ever really reads the manual right?) is to run the most basic command to see what happens.
bowtie --threads 1 -a TAIR10 -1 sample_R1.fq -2 sample_R2.fq > alignment.bowtie
The only two flags besides the -1 and -2 for the Paired-End reads I cared about at this point is --threads for the number of CPU cores and --all to output all valid alignments.
The first run I did returned the following;
Warning: Exhausted best-first chunk memory for read HWI-D00742:132:C9JLJANXX:7:2314:3394:96646 1:N:0:TCCGGAGAGTACTGAC/1 (patid 7908067); skipping read Warning: Exhausted best-first chunk memory for read HWI-D00742:132:C9JLJANXX:7:2315:3400:64037 1:N:0:TCCGGAGAGTACTGAC/1 (patid 7937433); skipping read Warning: Exhausted best-first chunk memory for read HWI-D00742:132:C9JLJANXX:7:2316:2079:14758 1:N:0:TCCGGAGAGTACTGAC/1 (patid 7965462); skipping read ............ Warning: Exhausted best-first chunk memory for read HWI-D00742:132:C9JLJANXX:7:2316:14696:46094 1:N:0:TCCGGAGAGTACTGAC/1 (patid 7983119); skipping read Warning: Exhausted best-first chunk memory for read HWI-D00742:132:C9JLJANXX:7:2316:13310:65720 1:N:0:TCCGGAGAGTACTGAC/1 (patid 7994229); skipping read Warning: Exhausted best-first chunk memory for read HWI-D00742:132:C9JLJANXX:7:2316:11742:71964 1:N:0:TCCGGAGAGTACTGAC/1 (patid 7997881); skipping read # reads processed: 8014358 # reads with at least one reported alignment: 143655 (1.79%) # reads that failed to align: 7870703 (98.21%) Reported 283478 paired-end alignments to 1 output stream(s)
Okay, that's not good. My first thought was to increase the amount of RAM I've requested from SLURM from 16GB to 32GB. But that produced the same result, and I should have checked out how much RAM was actually used and it was less than 200Mb. I found the following flag --chunkmbs in the bowtie1 manual and set it to 128 as the default is 64. This reduced the number of exhausted best-first chunk memory errors but didn't get rid of it completely. So I increased it to 256 and away they went..... but this did not change the alignment results.
# reads processed: 8014358 # reads with at least one reported alignment: 143655 (1.79%) # reads that failed to align: 7870703 (98.21%) Reported 283478 paired-end alignments to 1 output stream(s)
So my next thought when looking at more parameters in the manual was that it was too strict looking for exact matches, so by adding the -v flag you can allow X number of mismatches, e.g. -v 2 would allow a maximum of two mismatches. I chose two at this point simply because it was in one of the bowtie examples. What's an appropriate number of mismatches before you're aligning reads to anything? But regardless, this made no real difference (a few more aligned reads but no change in percentage). I did the same with --tryhard to do a more exhaustive search, again nothing.
I then thought, screw it, google time. I found a great biostars discussion which mentions really low alignments caused by a low insert size. The default --maxins is set very low at 250. In the discussion someone suggests putting a high value of 1000 in and seeing if that helps. So, that's what I did.
# reads processed: 8014358 # reads with at least one reported alignment: 4706986 (58.73%) # reads that failed to align: 3307372 (41.27%) Reported 6040941 paired-end alignments to 1 output stream(s)
Okay, that's made a significant difference. This does still leave something unanswered for me though, what is the correct insert size for my data? Is an insert size of 1000 on the lower side of the insert distribution? Should I set it to 10,000 and then see what happens?
WARNING: RANT ALERT
Like all bioinformatics software, even those that have been around for a long time like bowtie, it's important to know what you're actually trying to do and what the parameters are. There is a really nice blog post by Keith Bradnam about the dangers of using the default parameters;
Coming from a pure computing side (the guy who looks after the big expensive boxes with flashing lights on them) often the biologist just wants me to run the software so that they don't have to touch the command-line. The problem here is that just because I know *puters* doesn't mean that I know how the software works, what the software is trying to do, what the software parameters mean, anything about your data, how it was prepped or sequenced, the organism or treatment or anything other than what the man page for the software says. Coupled with this, does the biologist even know? I've often asked what parameters are appropriate for their data and I very rarely get an answer I can work with.
PHEW: RANT OVER
In order to work this out I followed the method outlined from the discussion on biostars helpfully committed to github. I had to tweek a couple of bits though, here is my slurm submission.
#!/bin/bash -e #SBATCH -p short-queue #SBATCH --mem-per-cpu=1000 #SBATCH --cpus-per-task=4 source samtools-1.3.1 source bowtie-1.1.2 source picard-1.134 source R-3.2.3 srun bowtie --chunkmbs 200 --maxins 1000 --threads 4 --all -v 2 TAIR10 -1 sample_R1.fq -2 sample_R2.fq -S sample.sam srun samtools view -bS sample.sam | samtools sort - -o sample.bam ##note, someone has written a nice wrapper on our HPC for picard so that you don't have to type the whole java -jar $PATH_TO_PICARD srun picard CollectInsertSizeMetrics I=sample.bam O=sample.txt H=sample.hist ASSUME_SORTED=true
Perfect! With this, one is now able to set a max and min insert size to cover the majority of this curve, e.g. 200 to 600. While I have been successful in determining the correct settings for the insert size, there are many many more settings that I have completely ignored at this point. Maybe the next time I get a near zero alignment I will play with more settings.