TL:DR

  • Use comm to find overlapping genes
  • Both lists must be sorted before hand
[email protected]:~$ comm -1 -2 list1.txt list2.txt

Just a quick note on this as it is a task that crops up quite often. You have two lists of genes (or any data really but today it's genes), maybe these are genes that are showing a certain level of expression in two different tissues or something. You want to know which of these genes occur in both lists. I don't know why I have never come across the comm command before but previously I have just used grep and awk to achieve this. However comm is much easier.

My lists look like this;

[email protected]:~$ cat list1.txt 
AT1G01020
AT1G01070
AT1G01130
AT1G01150
AT1G01390
AT1G01400
AT1G01430
AT1G01448
AT1G01470
AT1G01780
[email protected]:~$ cat list2.txt 
AT1G01225
AT1G01780
AT1G02000
AT1G02570
AT1G02580
AT1G02610
AT1G02640
AT1G02650
AT1G03230
AT1G03270

Using the comm command I can find out which genes only occur in file1, those that only occur in file2 and those that appear in both files. This is done like so;

[email protected]:~$ comm list1.txt list2.txt 
AT1G01020
AT1G01070
AT1G01130
AT1G01150
	AT1G01225
AT1G01390
AT1G01400
AT1G01430
AT1G01448
AT1G01470
		AT1G01780
	AT1G02000
	AT1G02570
	AT1G02580
	AT1G02610
	AT1G02640
	AT1G02650
	AT1G03230
	AT1G03270

....that looks confusing. Let me explain. What is happening is that the comm output is three columns;

<uniq_to_file1> <uniq_to_file2> <in_both_files>

So what you do is suppress the columns that you don't want;

[email protected]:~$ comm -1 -2 list1.txt list2.txt 
AT1G01780

in this case, column 1 and 2.

You need to remember to sort your lists before hand otherwise it will complain.

BIOINFORMATICS!