My journey into finding out what all this methylation stuff is about. This post is really to begin defining the key concepts and components required to understand how to analyse the data I am working with. I should expect that I will post more on specific detailed ideas at a later date.

So, we know about AGCT's in our DNA, Adenine (A), Guanine (G), Cytosine (C) and Thymine (T). DNA methylation is the addition of a CH3 group to the DNA which then changes the function of genes. The most common form* of DNA methylation is when the CH3 group attaches to Cytosine at position 5 of the "pyrimidine ring". When this group is added, it becomes 5-methylcytosine (5-mC).  


*(up until quite recently it was thought that 5-mC was the only form of DNA methylation, but it was discovered that there is another form of cytosine methylation in the form of 5-hydroxymethylcytosine (5-hmC).

Types of Cytosine Methylation

Cytosine methylation can occur in three "sequence contexts";

  • CG - A methylated Cytosine followed by a Guanine (NOTE: CG and CpG are the same, but not quite)
  • CHG - A methylated Cytosine followed by either an A, C or T, and then finally followed by Guanine
  • CHH - A methylated Cytosine followed by any two bases except for a Guanine

The amount of methylation of these three contexts varies based upon the cell type. So for example, in Human DNA, 5-mC methylation is almost exclusively in the CG context in any cell that isn't a reproductive cell (known as a Somatic Cell). In embronic stem (ES) cells there is an increase of non-CpG methylation (e.g. CHG and CHH contexts).

CG and CpG

CG means that a Cytosine is followed by a Guanine. A CpG also means a Cytosine followed by a Guanine, but it's a little more specific than this. The p in CpG represents prosphate, the chemical which links two bases together. CpG fully represents not only the linear sequence of bases but also the direction of travel in a single stranded sequence. So therefore, CpG is shorthand for the following;

5' -- C -- p -- G -- 3'

This differs from GpC, which means a Guanine followed by a Cytosine in the 5' -> 3' direction. These distinctions become important when considering the analysis of data from directional/non-directional DNA sequencing technologies. (More on this another day!)

Methylation Site

This is the position on the genome where methylation has occurred. So if you wish to talk about CpG sites, then you are referencing positions on the genome where CpG methylation has occurred.

Methylation Islands

While the amount of methylation is important, the distribution of methylation within the genome is probably much more important. Methylation rich areas in the genome are known as islands. (I've extrapolated a little here from Jones, P. A. 2012 who is specifically discussing CpG islands, however looking for methylation rich areas and calling them an island can apply to non-CpG methylation).

There is not a good definition of a methylation island, noone really agrees and there is no standard definition. Jones 2012 talks about genes in vertebrate genomes containing 1kb CpG-rich regions and that they are known as CpG Islands (CGIs).

The more common definition you will find (and is mentioned quite poorly on wikipedia) is by Gardiner-Garden, M. & Frommer, M. (1987). Takai and Jones 2002 discuss the common definition here;

"The first large-scale computational analysis of CpG islands using vertebrate sequences in GenBank was performed by Gardiner-Garden and Frommer (1), who defined a CpG island as being a 200-bp region of DNA with a high GC content (greater than 50%) and observed CpG/expected CpG ratio(ObsCpG/ExpCpG) of greater or equal to 0.6. The exact definition of what constitutes a CpG island is somewhat arbitrary because the cutoffs for the parameters used to describe them can make significant differences to what sequences are included within the definition."



[1]Lister R. et al. Human DNA methylomes at base resolution show widespread epigenomic differences. Nature 462, 315-322 (2009)

[2]Jones, P. A. 2012

[3]Daiya Takai* and Peter A. Jones Comprehensive analysis of CpG islands in human chromosomes 21 and 22.  vol. 99 no. 6 > Daiya Takai, 3740–3745, doi: 10.1073/pnas.052410099. 2002

[4] Gardiner-Garden, M. & Frommer, M. (1987) J. Mol. Biol. 196, 261–282