E(h)volution: October 2012

Tuesday, October 23, 2012

ENCODE-mania

As Robert touched upon in the previous post, much has been written about the ENCODE project, both by people who believ the hype and by people who do not. For a good overview of the "ENCODE position" see the blog post by Ewan Birney, one of the lead authors of the project and to get a sense of the alternative positions, take a look at Brendan Maher's summary of the controversy at the Nature News Blog.

Of course, the best way to understand what the hullabaloo is all about is to read the papers yourself. As a consequence, we decided to dedicate a lab meeting to dipping our toes into the flood of papers coming out from the project. After strolling around the pretty threads interface of the ENCODE explorer, my personal choice fell on Djebali et al.'s Landscape of transcription in human cells. Admittedly, my own lack of appreciation of molecular cell biology made me a tad sceptical of its entertainment value. However, after reading the abstract, which promised a "re-definition of the concept of a gene", I found my enthusiasm growing.

At the heart of the author's approach is the sequencing of RNA from different kinds of sub-cellular locations (nucleus, cytosol etc) in 15 different cell lines. This approach resulted in a genome wide catalogue of the identify and character of RNAs. They report several observations, of those I think four were particularly interesting.

First, it has long been known that a given gene may produce several different forms of the same protein and that there are more transcripts than genes. Isoforms, as these different stein forms are called, may be due to SNP differences or variation in start locations or splicing. Here, the authors show that the number of isoforms that a gene has is not linearly correlated with the number that are expressed. Instead, the correlation plateaus around 10-12 expressed isoforms in a given cell.

Second, they revisit the question of RNA editing, that is the extent to which a transcript can change after transcription. This apparently made a bit of a splash last year when Li et al. published a paper in Science that argued that this was very common in humans. Djebali et al. end up siding with the number of researchers that attributed to Li et al.'s higher number to a failure to apply a decent false discovery rate.

Finally, they show that 74.7% of the human genome is transcribed as at least a primary transcript (62.1% as processed transcript). A high number indeed (probably higher than what I would have guessed) but even more interesting is that no type of cell expressed more than 56.7% of all possible transcript. In other words, expression is highly cell specific. Moreover, they also found that the intergenic regions often overlap and that this overlap often includes loci that traditionally would have been considered to be distinct genes.

The last piece on what constitutes a gene is particularly interesting for those of use engaged in population genomics. Our annotations, and hence our inferences, depend on our definition. However, our theoretical framework was established decades before the double helix. Moreover, many conceptually influential evolutionary biologists, such as George Williams and Richard Dawkins, adopted a rather liberal definition of a "gene" that more molecular inclined workers found unsatisfactory. To what extent changing the definition of a gene changes our thinking remains to be seen.

Song of the day:

Selection and diversity in human regulatory elements

Vernot, B et al. (2012) "Personal and population genomics of human regulatory variation." Genome Research.

Today I tacke one of the ENCODE papers. The ENCODE project was a large project looking at various aspects of many human genomes, with particular interest in identifying biochemically active parts of the genome.

This particular paper looked at diversity and selection in regions that regulate the expression of genes. They identified these regions in several different cell lines, using DNase I activity. DNase I is an enzyme that is known to cleave parts of the genome that are actively binding transcription factors and other regulatory elements (those proteins that "turn on" genes). They identified the location of these sites that were cleaved by DNase I then looked at variation in a sample of 53 unrelated individuals at and near these sites.

They compared diversity in the peaks of DNase I activity, the "footprint" in the peak (the location where a transcription factor actually bound to DNA), and to the exome (all the DNA that makes proteins). They found many more variant sites in peaks, than in the other categories, and fewest in the exome. They also looked at GERP scores around each variant, this is a measure of how constrained the site is, or how much negative selection is keeping the site from changing. A higher GERP score means that more constraint is acting on the site. Though there were fewer variant sites in the exome, a higher proportion of those sites had a high GREP score, and the peaks had the lowest proportion. They also looked at variation within each individual sample, they found consistently the same patterns described above. They also show, as expected, that the African samples have higher diversity (more variants at DNase I sensitive sites) than non-Africans.

They looked at diversity around specific known regulatory motifs (sequences of DNA where a specific kind of regulation is known to occur). They found that regulatory elements that are used in cell differentiation usually have very low diversity. They also show that regulatory elements with a CpG site (a C followed by a G in the DNA sequence, or vice versa) had higher diversity, probably because these sites have a higher mutation rate.

Finally, they looked at how positive selection had acted in regions around each of their DNase I peaks. They did this by measuring shifts in allele frequencies near the sites of interest. Interestingly, their data shows evidence of an inversion on chromosome 17 that is found in some Europeans (more info on the MAPT inversion region can be found here). They looked at gene pathways that were enriched for sites under positive selection in African, Asian, and European populations. They found many pathways that were under positive selection in all populations, and interesting showed that the pathway involved in skin pigmentation was under positive selection in Europeans, and that pathways involved in susceptibility to diabetes were enriched for selection in Africans.

Tuesday, October 16, 2012

Eco-evolutionary spatial dynamics

Hanski, I. (2011) "Eco-evolutionary spatial dynamics in the Glanville fritillary butterfly." PNAS

This species of butterfly lives in a series of meadows on in the Åland Islands, the dynamics in this species were reviewed in the article. The butterflies constantly go extinct in individual meadows, which are then recolonized from others. It has been shown that after a series of extinctions, there is a burst of colonizations, leading to a fairly stable population size.

I found it very interesting that a variant of one gene, Pgi, has been shown to affect dispersal propensity. Individuals heterozygous at this locus carry one allele with an A and one with a C, these individuals are more likely to disperse than the AA homozygotes. Also, the CC homozygotes are very rare, the author indicates that this is probably because the C allele is linked to a recessive lethal mutation. Individuals with two copies of the C allele rarely survive through development. AA homozygotes primarily arise during inbreeding within a meadow, after it has been colonized, and have lower fitness. Therefore, the AC heterozygotes actually have higher fitness.

This is an example of heterozygote advantage (or overdominance), where individuals carrying two different alleles for a gene have higher fitness than individuals with two copies of the same allele. A primary example is sicle cell anemia in humans, a certain allele of one of the genes that makes hemoglobin. If you have two normal copies of the allele your blood cells are normal, but if you have two copies of the alternate allele your blood cells become sicle shaped and are very bad at delivering oxygen. But, if you have one copy of each gene you have a mix of cell shapes. Normally this is bad, but if someone with a mix lives in an area with a high rate of malaria infection they actually do better.

Monday, October 15, 2012

What did Robert do today?

Well, I'm working on a data set of whole genome sequence from 13 Capsella grandiflora individuals. The main goal of this project is to quantify selection across the whole genome of this species (and the closely related selfer C. rubella). My main project today was to calculate pairwise divergence between my samples, so I can see if there is any clustering of individuals. Those scripts are running, and hopefully tomorrow I'll have awesome pictures.

I did play around with making neighbor-joining trees in R, so I can plot these data in a meaningful way. It is actually much easier than I expected, R just has a library for working with phylogenies (ape) with a handy-dandy function for making neighbor-joining trees based on an input matrix of differences between each sample. There is one thing about the function example in the R docs that confuses me. When they make the input matrix:

x <- c(7, 8, 11, 13, 16, 13, 17, 5, 8, 10, 13,
       10, 14, 5, 7, 10, 7, 11, 8, 11, 8, 12,
       5, 6, 10, 9, 13, 8)
M <- matrix(0, 8, 8)
M[row(M) > col(M)] <- x
M[row(M) < col(M)] <- x

The matrix is not symmetrical. I tried it with excluding the second line that adds the data to the matrix, and it doesn't seem to affect the resulting tree. So I think I'll just be giving R my data with half the matrix empty.

A survey of loss-of-function variants in the human genome

MacArthur DG, et. al. (2012) "A Systematic Survey of loss-of-Function Variants in Human Protein-Coding Genes." Science: 335.

Loss-of-function (LOF) variants, or alleles that stop protein activity, are expected to be rare for most genes. These authors looked at whole genome data from a pilot of the 1000 genomes project for variants of genes that had some loss of function. They had four categories of interest: 1) nonsense mutations (new stop codons inserted into the gene), 2) site disrupting single-nucleotide variants (SNVs; sites that disrupt exon splicing), 3) indels expected to disrupt the reading frame, and 4) very large deletions that removed most of a genes coding sequence.

They found, not surprisingly, that the allele frequencies of LOF variants were shifted towards rare variants, indicating purifying selection is acting strongly on these variants. They also noted that most of the indels and SNVs were clustered around the 3' end of the gene, indicating that mutations toward the end of a gene were less deleterious, and selection on them was weaker. It would have been interesting to see the AFS of these as a separate category, however. They also noted a slight peak in these types of mutations toward the beginning (5' end) of the gene sequence, which they suggested was due to alternate start codons leading to relaxed selection. Overall, this just indicated that the 'meat' of a gene, the part you don't want to mess up, is usually toward the middle.

Interestingly their list of candidate genes that had LOF variants was highly enriched for chemical sensory genes (e.g. those involved in smell and taste). Since a loss of function of one of these alleles isn't immediately fatal, it makes sense that selection to maintain function in these genes is weaker. They did also find several genes that where in regions that show evidence of positive selection. Several of the olfactory genes appear in these regions, and so does one gene that may be involved in brain lipid formation and another in male fertility. These regions could, of course, be positively selected for some other locus, and these deleterious LOF variants were just dragged along.

The most interesting finding of the article is definitely the number of LOF variants per individual. They estimate that most people have about 100 LOF alleles, most of which are heterozygous. They also point out that since theory predicts we should each only carry about 5 recessive lethal mutations, therefore most of these LOF variants are probably only slightly deleterious.

This article did make me think a bit about how splicing works. Most genes have many exons, which are put together to create the final mRNA that is translated into protein. In some transcripts not all exons are present, however. How many of the possible variants do we see, and what determines if we see them or not? For example, if a gene has 6 exons (the average number of exons on the first C. rubella chromosome is 5.5) then there are 192 possible variants, I'm sure we don't see all of them. What causes this? I'm sure someone knows, maybe Emily can shed some light here.