So I didn't have to bother installing this as it was already on our HPC, so I simply had to 'source' the software. Before this though, I need to prep my data. I am going to run Kallisto on some Arabidopsis RNAseq data from a previous publication. To begin with I need to prep a 'transcript' fasta file. Using the TAIR10 gff and fa file along with bedtools I am able to create this.

Getting the genes from TAIR10

Since the TAIR10_GFF_genes.gff file contains all of the features;

[[email protected] kallisto_test]$ head TAIR10_GFF3_genes.gff 
Chr1 TAIR10 chromosome 1 30427671 . . . ID=Chr1;Name=Chr1
Chr1 TAIR10 gene 3631 5899 . + . ID=AT1G01010;Note=protein_coding_gene;Name=AT1G01010
Chr1 TAIR10 mRNA 3631 5899 . + . ID=AT1G01010.1;Parent=AT1G01010;Name=AT1G01010.1;Index=1
Chr1 TAIR10 protein 3760 5630 . + . ID=AT1G01010.1-Protein;Name=AT1G01010.1;Derives_from=AT1G01010.1
Chr1 TAIR10 exon 3631 3913 . + . Parent=AT1G01010.1
Chr1 TAIR10 five_prime_UTR 3631 3759 . + . Parent=AT1G01010.1
Chr1 TAIR10 CDS 3760 3913 . + 0 Parent=AT1G01010.1,AT1G01010.1-Protein;
Chr1 TAIR10 exon 3996 4276 . + . Parent=AT1G01010.1
Chr1 TAIR10 CDS 3996 4276 . + 2 Parent=AT1G01010.1,AT1G01010.1-Protein;
Chr1 TAIR10 exon 4486 4605 . + . Parent=AT1G01010.1

we only really need the genes. So using awk we can get just the gene features (piped to head for this example,).

[[email protected] kallisto_test]$ awk '{if($3=="gene") print $0}' TAIR10_GFF3_genes.gff >TAIR10_GFF3-gene_only.gff

[[email protected] kallisto_test]$ head TAIR10_GFF3-gene_only.gff
Chr1 TAIR10 gene 3631 5899 . + . ID=AT1G01010;Note=protein_coding_gene;Name=AT1G01010
Chr1 TAIR10 gene 5928 8737 . - . ID=AT1G01020;Note=protein_coding_gene;Name=AT1G01020
Chr1 TAIR10 gene 11649 13714 . - . ID=AT1G01030;Note=protein_coding_gene;Name=AT1G01030
Chr1 TAIR10 gene 23146 31227 . + . ID=AT1G01040;Note=protein_coding_gene;Name=AT1G01040
Chr1 TAIR10 gene 28500 28706 . + . ID=AT1G01046;Note=miRNA;Name=AT1G01046
Chr1 TAIR10 gene 31170 33153 . - . ID=AT1G01050;Note=protein_coding_gene;Name=AT1G01050
Chr1 TAIR10 gene 33379 37871 . - . ID=AT1G01060;Note=protein_coding_gene;Name=AT1G01060
Chr1 TAIR10 gene 38752 40944 . - . ID=AT1G01070;Note=protein_coding_gene;Name=AT1G01070
Chr1 TAIR10 gene 44677 44787 . + . ID=AT1G01073;Note=protein_coding_gene;Name=AT1G01073
Chr1 TAIR10 gene 45296 47019 . - . ID=AT1G01080;Note=protein_coding_gene;Name=AT1G01080

Creating a fasta file from the TAIR10 GFF and fasta file.

This is relatively easy using bedtools;

bedtools getfasta -name -fi TAIR10_reference.fas -bed TAIR10_GFF3-gene_only.gff -fo tair10_transcripts.fa

However the -name flag uses column $3 of the GFF file to label the sequences in the fasta file, which will look like this;

[[email protected] kallisto_test]$ head tair10_transcripts.fa|head -4
>gene
AAATTATTAGATATACCAAACCAGAGAAAACAAATACATAATCGGAGAAATACAGATTACAGAGAGCGAGAGAGATCGACGGCGAAGCTCTTTACCCGGAAACCATTGAAATCGGACGGTTTAGTGAAAATGGAGGATCAAGTTGGGTTTGGGTTCCGTCCGAACGACGAGGAGCTCGTTGGTCACTATCTCCGTAACAAAATCGAAGGAAA
>gene
GAGTAAAGGGTGTTGGCTCAAAGATTCATTTAGCAACCAAAGTTGCATTTGCAAGGAAATGAAAAGGTGTTAACAATGTCACTGCGTAACATGACTTCGATACAAAATCTCAAACAATACATCTTCAACTGTGGATTATGATGGACTTGGGGTTGCAGGTGATATGTCTATAGAAAAACGGGTGGGAATGGAAAATGGCTGAAGAAGAAAGCTCCAACTAAGCATCATAACTGGATTGTTTTAGAGGAGATGAGTCAAAGGAATTCACAGTGGAACTATCTCAAGAACATGATCATTGGCTTCTTATTGTTCATCTCCATCATTGGCTGGATCATTCTGGTTCAAGAGGTCAAATTATATATACATAACG

This isn't ideal. So I will add the gene names to column $3 of the gff file from column $9.

So this one-liner simply splits column $9 by the ";" character and puts it into the array labelled 'a'. Then I simply print out all 9 columns but instead of $3 I replace it with a[1], which contains the gene ID.

[[email protected] kallisto_test]$ cat TAIR10_GFF3-gene_only.gff | awk '{split($9,a,";"); print $1"\t"$2"\t"a[1]"\t"$4"\t"$5"\t"$6"\t"$7"\t"$8"\t"$9}' > TAIR10_GFF3-gene_only_withnames.gff
[[email protected] kallisto_test]$ head TAIR10_GFF3-gene_only_withnames.gff
chr1	TAIR10	ID=AT1G01010	3631	5899	.	+	.	ID=AT1G01010;Note=protein_coding_gene;Name=AT1G01010
chr1	TAIR10	ID=AT1G01020	5928	8737	.	-	.	ID=AT1G01020;Note=protein_coding_gene;Name=AT1G01020
chr1	TAIR10	ID=AT1G01030	11649	13714	.	-	.	ID=AT1G01030;Note=protein_coding_gene;Name=AT1G01030
chr1	TAIR10	ID=AT1G01040	23146	31227	.	+	.	ID=AT1G01040;Note=protein_coding_gene;Name=AT1G01040
chr1	TAIR10	ID=AT1G01046	28500	28706	.	+	.	ID=AT1G01046;Note=miRNA;Name=AT1G01046
chr1	TAIR10	ID=AT1G01050	31170	33153	.	-	.	ID=AT1G01050;Note=protein_coding_gene;Name=AT1G01050
chr1	TAIR10	ID=AT1G01060	33379	37871	.	-	.	ID=AT1G01060;Note=protein_coding_gene;Name=AT1G01060
chr1	TAIR10	ID=AT1G01070	38752	40944	.	-	.	ID=AT1G01070;Note=protein_coding_gene;Name=AT1G01070
chr1	TAIR10	ID=AT1G01073	44677	44787	.	+	.	ID=AT1G01073;Note=protein_coding_gene;Name=AT1G01073
chr1	TAIR10	ID=AT1G01080	45296	47019	.	-	.	ID=AT1G01080;Note=protein_coding_gene;Name=AT1G01080

Now let's rerun our bedtools line 

[[email protected] kallisto_test]$ bedtools getfasta -name -fi TAIR10_reference.fas -bed TAIR10_GFF3-gene_only_withnames.gff -fo tair10_transcripts.fa

And we get what we need :-)

ID=AT1G01010
AAATTATTAGATATACCAAACCAGAGAAAACAAATACATAATCGGAGAAATACAGATTACAGAGAGCGAGAGAGATCGACGGCGAAGCTCTTTACCCGGAAACCATTGAAATCGGACGGTTTAGTGAAAATGGAGGATCAAGTTGGGTTTGGGTTCCGTCCGAACGACGAGGAGCTCGTTGGTCACTATCTCCGTAACAAAATCGAAGGAAATGTAATAATAATTAGTGCATCGTTTTGTGGTGTAGTTTATATAAATAAAGTGATATATAGTCTTGTATAAG
>ID=AT1G01020
GAGTAAAGGGTGTTGGCTCAAAGATTCATTTAGCAACCAAAGTTGCATTTGCAAGGAAATGAAAAGGTGTTAACAATGTCACTGCGTAACATGACTTCGATACAAAATCTCAAACAATACATCTTCAACTGTGGATTATGATGGACTT

Building Kallisto Index

This is very similar to most bioinformatics packages, e.g. bowtie, blast etc. Since I do most of my work on our HPC and our interactive nodes are capped to 2GB ram otherwise processes get killed, I submitted the following slurm job.

#!/bin/bash

#SBATCH --partition medium
#SBATCH --cpus-per-task=1
#SBATCH --mail-type=END,FAIL
#SBATCH [email protected]
#SBATCH --mem-per-cpu=4000
#SBATCH --job-name="kallisto-idx"

source source kallisto-0.43.0

srun kallisto index -i tair10_transcripts.idx tair10_transcripts.fa

This took just under 3 mins to run and used 1.7GB RAM (despite our interactive node killing the process earlier).

Running Kallisto

#!/bin/bash

#SBATCH --partition medium
#SBATCH --cpus-per-task=4
#SBATCH --mail-type=END,FAIL
#SBATCH [email protected]
#SBATCH --mem-per-cpu=8000
#SBATCH --job-name="kallisto-idx"

source source kallisto-0.43.0

srun kallisto quant -t 4 -b 4 -i tair10_transcripts.idx -o Rep1 --single -l 49 -s 20 Rep1_L2_1.fq.gz

And that's it, it's really quick and you get an abundance tsv file. At this point it's time to begin looking at the data. I will write more about this in a later post.

NOTE: To add to this post as I've got other things to do today.

  • The bootstrap flag -b
  • Quickly getting -l and -s information
  • Multiple fasta files for the same sample.