- Use comm to find overlapping genes
- Both lists must be sorted before hand
[email protected]:~$ comm -1 -2 list1.txt list2.txt
Just a quick note on this as it is a task that crops up quite often. You have two lists of genes (or any data really but today it's genes), maybe these are genes that are showing a certain level of expression in two different tissues or something. You want to know which of these genes occur in both lists. I don't know why I have never come across the comm command before but previously I have just used grep and awk to achieve this. However comm is much easier.
My lists look like this;
[email protected]:~$ cat list1.txt AT1G01020 AT1G01070 AT1G01130 AT1G01150 AT1G01390 AT1G01400 AT1G01430 AT1G01448 AT1G01470 AT1G01780 [email protected]:~$ cat list2.txt AT1G01225 AT1G01780 AT1G02000 AT1G02570 AT1G02580 AT1G02610 AT1G02640 AT1G02650 AT1G03230 AT1G03270
Using the comm command I can find out which genes only occur in file1, those that only occur in file2 and those that appear in both files. This is done like so;
[email protected]:~$ comm list1.txt list2.txt AT1G01020 AT1G01070 AT1G01130 AT1G01150 AT1G01225 AT1G01390 AT1G01400 AT1G01430 AT1G01448 AT1G01470 AT1G01780 AT1G02000 AT1G02570 AT1G02580 AT1G02610 AT1G02640 AT1G02650 AT1G03230 AT1G03270
....that looks confusing. Let me explain. What is happening is that the comm output is three columns;
<uniq_to_file1> <uniq_to_file2> <in_both_files>
So what you do is suppress the columns that you don't want;
[email protected]:~$ comm -1 -2 list1.txt list2.txt AT1G01780
in this case, column 1 and 2.
You need to remember to sort your lists before hand otherwise it will complain.