Finding Genes which occur in two lists of data using the comm command.

 

TL:DR

  • Use comm to find overlapping genes
  • Both lists must be sorted before hand
[email protected]:~$ comm -1 -2 list1.txt list2.txt

Just a quick note on this as it is a task that crops up quite often. You have two lists of genes (or any data really but today it's genes), maybe these are genes that are showing a certain level of expression in two different tissues or something. You want to know which of these genes occur in both lists. I don't know why I have never come across the comm command before but previously I have just used grep and awk to achieve this. However comm is much easier.

My lists look like this;

[email protected]:~$ cat list1.txt 
AT1G01020
AT1G01070
AT1G01130
AT1G01150
AT1G01390
AT1G01400
AT1G01430
AT1G01448
AT1G01470
AT1G01780
[email protected]:~$ cat list2.txt 
AT1G01225
AT1G01780
AT1G02000
AT1G02570
AT1G02580
AT1G02610
AT1G02640
AT1G02650
AT1G03230
AT1G03270

Using the comm command I can find out which genes only occur in file1, those that only occur in file2 and those that appear in both files. This is done like so;

[email protected]:~$ comm list1.txt list2.txt 
AT1G01020
AT1G01070
AT1G01130
AT1G01150
	AT1G01225
AT1G01390
AT1G01400
AT1G01430
AT1G01448
AT1G01470
		AT1G01780
	AT1G02000
	AT1G02570
	AT1G02580
	AT1G02610
	AT1G02640
	AT1G02650
	AT1G03230
	AT1G03270

....that looks confusing. Let me explain. What is happening is that the comm output is three columns;

<uniq_to_file1> <uniq_to_file2> <in_both_files>

So what you do is suppress the columns that you don't want;

[email protected]:~$ comm -1 -2 list1.txt list2.txt 
AT1G01780

in this case, column 1 and 2.

You need to remember to sort your lists before hand otherwise it will complain.

BIOINFORMATICS!

 

 

UNIX One Liners .... and the occasional short script.

These are thing that I have occasionally found useful.

Looking for last accessed files

This was part of my reporting so that I could try and get people to remove some data from there storage space on our HPC. This one-liner reports in GB the amount of storage taken up by files that haven't been accessed in over 1 year.

NOTE: I think there may be a bug in this, will need to check.

find . -atime +365 -exec ls -ltr '{}' \; | sh | awk '$NF {c+=$5} END {print c/1073741824 " GB"}'

To get user data, run this in the directory containing users home directory as root.

#!/usr/bin/bash
for i in `ls -1`; do
find $i -atime +365 -exec ls -ltr '{}' \; | \
awk '"'$NF'" {c+="'$5'"} END {print \"$i \" c/1073741824 \" GB\"}'
done

Splitting data

This assumes a stream of one column data that one may wish to reformat as a float with 3.d.p. and then takes every 266 and makes that a row. Therefore if there were 266*266 values in the file, you would get a 266*266 matrix.

cat file | awk '{printf "%.3f\n", $i}' | xargs -n 266

Merge two files

Assume you have two files where the rows represent the same entry, but the columns are in different files....so we'd like to join them.

pr -m -t -s" " file1 file2 | gawk '{print $0}'

Check a bunch of servers for almost full partitions

Assumes you have set up the ssh keys so you don't need to enter the password.

#!/usr/bin/bash
THRESHOLD=90
for i in server1.domain.net server2.domain.net server3.domain.net; do
ssh -q [email protected]$i 'df -hP' | \
grep -v Filesystem | \
sed 's/%//g' | \
awk '{ if($5>='"$THRESHOLD"') print "Warning: Low disk on ""'"$i"'" $5}'
done

DVD to iso, and back again

dd if=/dev/dvd of=myiso.iso

cdrecord -v -dao speed=1 dev=/dev/dvd myiso.iso
Home