Sunday, November 4, 2012

How much of variation in gene expression is due to differences rates of mRNA decay?

Most studies of gene expression variation, including my own, measure expression as a steady level. However, a gene's expression level is the result of two dynamic processes: mRNA transcription and mRNA decay. We've talked a bit on this blog about studies that have investigated mechanisms of mRNA transcription (like DNaseI seq and ChIP seq), but we've so far ignored mRNA decay. So I'm going to summarize this paper:


The authors in this study measured relative mRNA decay rate in 70 human cell lines by treating the cells with a chemical that halts transcription and measuring expression level at a number of time points after the treatment. 

First, they did a number of gene by gene comparisons using data pooled from all individuals. They classified genes into fast-decaying and slow-decaying categories and found that genes in these categories are associated with a number of things that they'd expect (gene length, cis-reg elements, etc). More interestingly, the authors also looked for associations between a gene's rate of decay and expression level.They expected that genes with transcripts that decay quickly will tend to have low expression and genes with slow-decaying transcripts will have high expression and they found that this is, indeed, the general pattern. However they also found a number of genes with the opposite pattern: fast-decay time and high expression.

Next, they looked at between-individual variation in decay rate and how that relates to expression variation. They tested for associations between a gene's decay rate and nearby SNPs and found a handful of significant associations. These 'rdQTLs' ('rate of decay QTLs') overlap with a significant number of the eQTLs they were also able to find with this data set. In 55% of the cases where a gene has an rdQTL and an eQTL, the allele that's associated with faster decay is also associated with lower expression and vice versa. This makes sense: if allele causes a gene to decay faster it should lower expression. However, 45% of the time the relationship was reversed, which seems pretty strange to me.

Overall, what I think is really interesting about this paper is that it gets at some of the mechanisms which contribute to expression level variation. I tend to treat expression level as a fairly abstract trait, so it's useful to remember that it's the result of multiple processes, which can interact in complex and sometimes strange ways.

Tuesday, October 23, 2012

ENCODE-mania

As Robert touched upon in the previous post, much has been written about the ENCODE project, both by people who believ the hype and by people who do not. For a good overview of the "ENCODE position" see the blog post by Ewan Birney, one of the lead authors of the project and to get a sense of the alternative positions, take a look at Brendan Maher's summary of the controversy at the Nature News Blog.

Of course, the best way to understand what the hullabaloo is all about is to read the papers yourself.  As  a consequence, we decided to dedicate a lab meeting to dipping our toes into the flood of papers coming out from the project. After strolling around the pretty threads interface of the ENCODE explorer, my personal choice fell on Djebali et al.'s Landscape of transcription in human cells. Admittedly, my own lack of appreciation of molecular cell biology made me a tad sceptical of its entertainment value. However, after reading the abstract, which promised a "re-definition of the concept of a gene", I found my enthusiasm growing.

At the heart of the author's approach is the sequencing of RNA from different kinds of sub-cellular locations (nucleus, cytosol etc) in 15 different cell lines. This approach resulted in a genome wide catalogue of the identify and character of  RNAs. They report several observations, of those I think four were particularly interesting.

First,  it has long been known that a given gene may produce several different forms of the same protein and that there are more transcripts than genes. Isoforms, as these different stein forms are called, may be due to SNP differences or variation in start locations or splicing. Here, the authors show that the number of isoforms that a gene has is not linearly correlated with the number that are expressed. Instead, the correlation plateaus around 10-12 expressed isoforms in a given cell.

Second, they revisit the question of RNA editing, that is the extent to which a transcript can change after transcription. This apparently made a bit of a splash last year when Li et al. published a paper in Science that argued that this was very common in humans. Djebali et al. end up siding with the number of researchers that attributed to Li et al.'s higher number to a failure to apply a decent false discovery rate.

Finally, they show that 74.7% of the human genome is transcribed as at least a primary transcript (62.1% as processed transcript). A high number indeed (probably higher than what I would have guessed) but even more interesting is that no type of cell expressed more than 56.7% of all possible transcript. In other words, expression is highly cell specific. Moreover, they also found that the intergenic regions often overlap and that this overlap often includes loci that traditionally would have been considered to be distinct genes.

The last piece on what constitutes a gene is particularly interesting for those of use engaged in population genomics. Our annotations, and hence our inferences, depend on our definition. However, our theoretical framework was established decades before the double helix. Moreover, many conceptually influential  evolutionary biologists, such as George Williams and Richard Dawkins, adopted a rather liberal definition of a "gene" that more molecular inclined workers found unsatisfactory. To what extent changing the definition of a gene changes our thinking remains to be seen.



Song of the day:












Selection and diversity in human regulatory elements

Vernot, B et al. (2012) "Personal and population genomics of human regulatory variation." Genome Research.

Today I tacke one of the ENCODE papers. The ENCODE project was a large project looking at various aspects of many human genomes, with particular interest in identifying biochemically active parts of the genome.

This particular paper looked at diversity and selection in regions that regulate the expression of genes. They identified these regions in several different cell lines, using DNase I activity. DNase I is an enzyme that is known to cleave parts of the genome that are actively binding transcription factors and other regulatory elements (those proteins that "turn on" genes). They identified the location of these sites that were cleaved by DNase I then looked at variation in a sample of 53 unrelated individuals at and near these sites.

They compared diversity in the peaks of DNase I activity, the "footprint" in the peak (the location where a transcription factor actually bound to DNA), and to the exome (all the DNA that makes proteins). They found many more variant sites in peaks, than in the other categories, and fewest in the exome. They also looked at GERP scores around each variant, this is a measure of how constrained the site is, or how much negative selection is keeping the site from changing. A higher GERP score means that more constraint is acting on the site. Though there were fewer variant sites in the exome, a higher proportion of those sites had a high GREP score, and the peaks had the lowest proportion. They also looked at variation within each individual sample, they found consistently the same patterns described above. They also show, as expected, that the African samples have higher diversity (more variants at DNase I sensitive sites) than non-Africans.

They looked at diversity around specific known regulatory motifs (sequences of DNA where a specific kind of regulation is known to occur). They found that regulatory elements that are used in cell differentiation usually have very low diversity. They also show that regulatory elements with a CpG site (a C followed by a G in the DNA sequence, or vice versa) had higher diversity, probably because these sites have a higher mutation rate.

Finally, they looked at how positive selection had acted in regions around each of their DNase I peaks. They did this by measuring shifts in allele frequencies near the sites of interest. Interestingly, their data shows evidence of an inversion on chromosome 17 that is found in some Europeans (more info on the MAPT inversion region can be found here). They looked at gene pathways that were enriched for sites under positive selection in African, Asian, and European populations. They found many pathways that were under positive selection in all populations, and interesting showed that the pathway involved in skin pigmentation was under positive selection in Europeans, and that pathways involved in susceptibility to diabetes were enriched for selection in Africans.

Tuesday, October 16, 2012

Eco-evolutionary spatial dynamics

Hanski, I. (2011) "Eco-evolutionary spatial dynamics in the Glanville fritillary butterfly." PNAS

This species of butterfly lives in a series of meadows on in the Åland Islands, the dynamics in this species were reviewed in the article. The butterflies constantly go extinct in individual meadows, which are then recolonized from others. It has been shown that after a series of extinctions, there is a burst of colonizations, leading to a fairly stable population size.

I found it very interesting that a variant of one gene, Pgi, has been shown to affect dispersal propensity. Individuals heterozygous at this locus carry one allele with an A and one with a C, these individuals are more likely to disperse than the AA homozygotes. Also, the CC homozygotes are very rare, the author indicates that this is probably because the C allele is linked to a recessive lethal mutation. Individuals with two copies of the C allele rarely survive through development. AA homozygotes primarily arise during inbreeding within a meadow, after it has been colonized, and have lower fitness. Therefore, the AC heterozygotes actually have higher fitness.

This is an example of heterozygote advantage (or overdominance), where individuals carrying two different alleles for a gene have higher fitness than individuals with two copies of the same allele. A primary example is sicle cell anemia in humans, a certain allele of one of the genes that makes hemoglobin. If you have two normal copies of the allele your blood cells are normal, but if you have two copies of the alternate allele your blood cells become sicle shaped and are very bad at delivering oxygen. But, if you have one copy of each gene you have a mix of cell shapes. Normally this is bad, but if someone with a mix lives in an area with a high rate of malaria infection they actually do better.

Monday, October 15, 2012

What did Robert do today?

Well, I'm working on a data set of whole genome sequence from 13 Capsella grandiflora individuals. The main goal of this project is to quantify selection across the whole genome of this species (and the closely related selfer C. rubella). My main project today was to calculate pairwise divergence between my samples, so I can see if there is any clustering of individuals. Those scripts are running, and hopefully tomorrow I'll have awesome pictures.

I did play around with making neighbor-joining trees in R, so I can plot these data in a meaningful way.  It is actually much easier than I expected, R just has a library for working with phylogenies (ape) with a handy-dandy function for making neighbor-joining trees based on an input matrix of differences between each sample. There is one thing about the function example in the R docs that confuses me. When they make the input matrix:
x <- c(7, 8, 11, 13, 16, 13, 17, 5, 8, 10, 13,
       10, 14, 5, 7, 10, 7, 11, 8, 11, 8, 12,
       5, 6, 10, 9, 13, 8)
M <- matrix(0, 8, 8)
M[row(M) > col(M)] <- x
M[row(M) < col(M)] <- x
The matrix is not symmetrical. I tried it with excluding the second line that adds the data to the matrix, and it doesn't seem to affect the resulting tree. So I think I'll just be giving R my data with half the matrix empty.

A survey of loss-of-function variants in the human genome

MacArthur DG, et. al. (2012) "A Systematic Survey of loss-of-Function Variants in Human Protein-Coding Genes." Science: 335.

Loss-of-function (LOF) variants, or alleles that stop protein activity, are expected to be rare for most genes. These authors looked at whole genome data from a pilot of the 1000 genomes project for variants of genes that had some loss of function. They had four categories of interest: 1) nonsense mutations (new stop codons inserted into the gene), 2) site disrupting single-nucleotide variants (SNVs; sites that disrupt exon splicing), 3) indels expected to disrupt the reading frame, and 4) very large deletions that removed most of a genes coding sequence.

They found, not surprisingly, that the allele frequencies of LOF variants were shifted towards rare variants, indicating purifying selection is acting strongly on these variants. They also noted that most of the indels and SNVs were clustered around the 3' end of the gene, indicating that mutations toward the end of a gene were less deleterious, and selection on them was weaker. It would have been interesting to see the AFS of these as a separate category, however. They also noted a slight peak in these types of mutations toward the beginning (5' end) of the gene sequence, which they suggested was due to alternate start codons leading to relaxed selection. Overall, this just indicated that the 'meat' of a gene, the part you don't want to mess up, is usually toward the middle.

Interestingly their list of candidate genes that had LOF variants was highly enriched for chemical sensory genes (e.g. those involved in smell and taste). Since a loss of function of one of these alleles isn't immediately fatal, it makes sense that selection to maintain function in these genes is weaker. They did also find several genes that where in regions that show evidence of positive selection. Several of the olfactory genes appear in these regions, and so does one gene that may be involved in brain lipid formation and another in male fertility. These regions could, of course, be positively selected for some other locus, and these deleterious LOF variants were just dragged along.

The most interesting finding of the article is definitely the number of LOF variants per individual. They estimate that most people have about 100 LOF alleles, most of which are heterozygous. They also point out that since theory predicts we should each only carry about 5 recessive lethal mutations, therefore most of these LOF variants are probably only slightly deleterious.

This article did make me think a bit about how splicing works. Most genes have many exons, which are put together to create the final mRNA that is translated into protein. In some transcripts not all exons are present, however. How many of the possible variants do we see, and what determines if we see them or not? For example, if a gene has 6 exons (the average number of exons on the first C. rubella chromosome is 5.5) then there are 192 possible variants, I'm sure we don't see all of them. What causes this? I'm sure someone knows, maybe Emily can shed some light here.

Monday, September 10, 2012

Human regulatory network architecture.

Architecture of the human regulatory network derived from ENCODE data
Gerstein et al. 2012 Nature.

This paper uses a ChIP sequencing to identify binding sites for 119 transcription factors in 5 cell lines (from only one human, I think. maybe?). They used this data to construct a network of transcription factors and the genes they regulate. Their overall goal was to describe the architecture of the regulatory network, identify correlations between network position and other genomic properties, and test of selection acts differently on different places in the network. 

They have a lot of results, and a lot of the data presented in the main text feels a bit anecdotal, so instead of providing a laundry list of all of them, I'll just point out things I found interesting with the caveat that I don't really understand most of their methods.

1) They looked at situations where two transcription factors have an overlapping binding site, which they call coassociation. Transcription factors tend to coassociate with different partners in sites that are near a gene ('proximal') and far from a gene ('distal'). However, this conclusion appears to be based on supplementary figure 2C3, which only shows associations between one focal transcription factor and those factors that differ between proximal and distal sites.

2) The researchers constructed a network of associations between transcription factors and their targets and found they could group transcription factors into three levels of hierarchy. Highly connected factors tend to be highly expressed across tissues, which is unsurprising to me.

3) The researchers used diversity data from the 1000 genomes project to measure constraint on target genes and transcription factors. They found the strongest constraint on genes that are regulated by many transcription factors, followed by transcription factors that regulate many genes. They also found that transcription factors at the top level of the network are more constrained than those at the middle and lower level.

4) They also took a stab at one of my pet interests: allele-specific expression. It's a bit complicated, but what I think is going on is that when transcription factors bind preferentially to an allele, this allele is also more likely to be preferentially expressed downstream. However, this section is really unclear to me because allele-specific expression is generally defined as being any difference in expression level between alleles, not a preference for one allele, so I'm not sure what they mean when they say things like "X% of genes show allele-specific expression from the paternal allele") If my interpretation is right, then this suggests that most allele-specific binding is enhancing expression? But who knows. It's a bit frustrating that with 271 pages of supplement, they can't find the space to clearly define their terms.

5) Finally, the researchers compared diversity in transcription factor binding sites that show allele-specific binding to those that don't. They found that the allele-specific sites have a higher SNP density, suggesting that they're under less constraint than those binding sites without allele-specific binding. The authors think that this result, that allele-specific binding sites are under less constraint, is 'surprising'. I don't find it surprising AT ALL. If the genetic variation that causes allele-specific binding is deleterious and subject to purifying selection (which we think is the case for most variation), then this result makes perfect sense.