Weird awk error when messing around with making a GFF from a TXT file

 

Strange error when trying to mess around with a text file created from STATA in Windows. When using awk to create the 9 column GFF file to use with SignalMap awk goes a little weird. The original data looks like this;

CHR1	101	.17989999
CHR1	151	.083400011
CHR1	301	-.125
CHR1	451	0
CHR1	501	.16670001
CHR1	601	.69999999
CHR1	651	.33329999
CHR1	751	.75
CHR1	801	0
CHR1	901	.25099999

And when you try to create a GFF file using awk, it goes weird like this;

[[email protected]]$ head Sample.txt |awk '{if($3>=0) print $1"\t.\tSAMPLE\t"$2"\t"$2+49"\t"$3"\t.\t.\t."}'
CHR1	.	.AMPLE	.01	150	.17989999
CHR1	.	.AMPLE	.51	200	.083400011
CHR1	.	.AMPLE	.51	500	0
CHR1	.	.AMPLE	.01	550	.16670001
CHR1	.	.AMPLE	.01	650	.69999999
CHR1	.	.AMPLE	.51	700	.33329999
CHR1	.	.AMPLE	.51	800	.75
CHR1	.	.AMPLE	.01	850	0
CHR1	.	.AMPLE	.01	950	.25099999

After 10 mins of banging my head on the table I realised that it was probably something to do with Windows/Unix formatting. So this solved it;

[[email protected]]$ dos2unix -n sample.txt sample_new.txt
[[email protected]]$ head sample_new.txt |awk '{if($3>=0) print $1"\t.\tSAMPLE\t"$2"\t"$2+49"\t"$3"\t.\t."}'
CHR1	.	SAMPLE 101	150	.17989999	.	.
CHR1	.	SAMPLE 151	200	.083400011	.	.
CHR1	.	SAMPLE 451	500	0	.	.
CHR1	.	SAMPLE 501	550	.16670001	.	.
CHR1	.	SAMPLE 601	650	.69999999	.	.
CHR1	.	SAMPLE 651	700	.33329999	.	.
CHR1	.	SAMPLE 751	800	.75	.	.
CHR1	.	SAMPLE 801	850	0	.	.
CHR1	.	SAMPLE 901	950	.25099999	.	.

Getting SignalMap GFF files into IGV

You can already load a GFF file in IGV

undefined

This will allow you to load your file into IGV, however it will be slow especially for many tracks. So you may wish to convert the GFF to a bedgraph file, and then from there create a TDF file.

Convert GFF into a bedgraph file

cat input.gff | awk '{print $1"\t"$4-1"\t"$5"\t"$6}' > output.bed

and then convert this bed file to a TDF file which can be done using igvtIGV. 

In IGV, go to the 'Tools' -> 'Run igvtools' menu at the top and you will get the following box;

undefined

Ensure the 'Command' is set to "toTDF", set the 'Input File' to your bedfile, the 'Output File' will be filled in automagically (unless you want to change it) and then set the 'Zoom Levels' to 10.

This will create a new tdf file which is much smaller than the bedfile and contains the histogram you would have had in SignalMap.

NOTE: Still to write up

  • Give visuals of signal map/IGV
  • Command-line method of doing this

 

Cheat Sheet

 

My ever-growing list of commands that I use often enough to need to write down but not often enough that I can remember it. Maybe someone else will find it useful.

Get some data from SRA

##download the data using the stupid SRA toolkit

./prefetch -V SRR3311821

##Extract the fastq data. I always run with --split-files even if I know it's single-end reads

~/bin/sratoolkit.2.8.1-ubuntu64/bin/fastq-dump --split-files -v SRR3311821.sra

Sorting and Indexing bamfiles for IGV

source samtools-1.3.1

samtools sort [email protected] 4 accepted_hits.bam -o accepted_hits_sorted.bam

samtools index accepted_hits_sorted.bam

 

 

 

Installing Roche SignalMap on Windows 10

 

In our lab we sometimes use a piece of software called SignalMap by Roche to visualise the outputs from some legacy pipelines. This is a very simply GFF viewer which displays the score column (the 6th column) of a GFF file as a histogram. It will also display regular GFF features such as genes, transposons etc. quite nicely. 

Despite it being a tidy little program it does not appear to have been updated for quite a while (2013 maybe as that is the date on the copyright). Needless to say there hasn't been an update since Windows 10 was released in 2015. If you simply try to run the SignalMap installer exe file you get a bunch of errors. Since we use this software, everyone is used to it and until it becomes 100% necessary to have to ditch it for bedgraph files in IGV (maybe more in another post) then I need to get this to work.

Here was my workaround;

After installing the Java Runtime Environment (JRE) and then downloading the SignalMap you need to break out your Windows command-line and then type the following;

C:\Users\USER\Desktop> SignalMap_installer.exe LAX_VM "C:\Program Files (x86)\Java\jre1.8.0_101\bin\java.exe" -i GUI

This works under the following assumptions;

  • That the SignalMap_installer.exe file is downloaded to your Desktop and the current working directory is the Desktop directory.
  • The java.exe file is located here, but obviously if you have installed a different version of JRE then it's quite likely the directory will differ slightly (probably just the version number)

Once you've taken those two assumptions into account to run that command the installer will start and you can proceed as normal.

NOTE:

This method does have the implication that it fixes SignalMap to a specific version of the JRE which would mean that if a new version of JRE is installed, e.g. with security patches, then there could be a risk. The user in me doesn't care, I just want to look at the pretty histograms in my BS-Seq data, but the Sys-Admin in me hates it and would never deploy it on a shared system. The world of Bioinformatics and Scientific computing....ENJOY!

Home ← Older posts