Sunday, November 4, 2012

How much of variation in gene expression is due to differences rates of mRNA decay?

Most studies of gene expression variation, including my own, measure expression as a steady level. However, a gene's expression level is the result of two dynamic processes: mRNA transcription and mRNA decay. We've talked a bit on this blog about studies that have investigated mechanisms of mRNA transcription (like DNaseI seq and ChIP seq), but we've so far ignored mRNA decay. So I'm going to summarize this paper:


The authors in this study measured relative mRNA decay rate in 70 human cell lines by treating the cells with a chemical that halts transcription and measuring expression level at a number of time points after the treatment. 

First, they did a number of gene by gene comparisons using data pooled from all individuals. They classified genes into fast-decaying and slow-decaying categories and found that genes in these categories are associated with a number of things that they'd expect (gene length, cis-reg elements, etc). More interestingly, the authors also looked for associations between a gene's rate of decay and expression level.They expected that genes with transcripts that decay quickly will tend to have low expression and genes with slow-decaying transcripts will have high expression and they found that this is, indeed, the general pattern. However they also found a number of genes with the opposite pattern: fast-decay time and high expression.

Next, they looked at between-individual variation in decay rate and how that relates to expression variation. They tested for associations between a gene's decay rate and nearby SNPs and found a handful of significant associations. These 'rdQTLs' ('rate of decay QTLs') overlap with a significant number of the eQTLs they were also able to find with this data set. In 55% of the cases where a gene has an rdQTL and an eQTL, the allele that's associated with faster decay is also associated with lower expression and vice versa. This makes sense: if allele causes a gene to decay faster it should lower expression. However, 45% of the time the relationship was reversed, which seems pretty strange to me.

Overall, what I think is really interesting about this paper is that it gets at some of the mechanisms which contribute to expression level variation. I tend to treat expression level as a fairly abstract trait, so it's useful to remember that it's the result of multiple processes, which can interact in complex and sometimes strange ways.

Tuesday, October 23, 2012

ENCODE-mania

As Robert touched upon in the previous post, much has been written about the ENCODE project, both by people who believ the hype and by people who do not. For a good overview of the "ENCODE position" see the blog post by Ewan Birney, one of the lead authors of the project and to get a sense of the alternative positions, take a look at Brendan Maher's summary of the controversy at the Nature News Blog.

Of course, the best way to understand what the hullabaloo is all about is to read the papers yourself.  As  a consequence, we decided to dedicate a lab meeting to dipping our toes into the flood of papers coming out from the project. After strolling around the pretty threads interface of the ENCODE explorer, my personal choice fell on Djebali et al.'s Landscape of transcription in human cells. Admittedly, my own lack of appreciation of molecular cell biology made me a tad sceptical of its entertainment value. However, after reading the abstract, which promised a "re-definition of the concept of a gene", I found my enthusiasm growing.

At the heart of the author's approach is the sequencing of RNA from different kinds of sub-cellular locations (nucleus, cytosol etc) in 15 different cell lines. This approach resulted in a genome wide catalogue of the identify and character of  RNAs. They report several observations, of those I think four were particularly interesting.

First,  it has long been known that a given gene may produce several different forms of the same protein and that there are more transcripts than genes. Isoforms, as these different stein forms are called, may be due to SNP differences or variation in start locations or splicing. Here, the authors show that the number of isoforms that a gene has is not linearly correlated with the number that are expressed. Instead, the correlation plateaus around 10-12 expressed isoforms in a given cell.

Second, they revisit the question of RNA editing, that is the extent to which a transcript can change after transcription. This apparently made a bit of a splash last year when Li et al. published a paper in Science that argued that this was very common in humans. Djebali et al. end up siding with the number of researchers that attributed to Li et al.'s higher number to a failure to apply a decent false discovery rate.

Finally, they show that 74.7% of the human genome is transcribed as at least a primary transcript (62.1% as processed transcript). A high number indeed (probably higher than what I would have guessed) but even more interesting is that no type of cell expressed more than 56.7% of all possible transcript. In other words, expression is highly cell specific. Moreover, they also found that the intergenic regions often overlap and that this overlap often includes loci that traditionally would have been considered to be distinct genes.

The last piece on what constitutes a gene is particularly interesting for those of use engaged in population genomics. Our annotations, and hence our inferences, depend on our definition. However, our theoretical framework was established decades before the double helix. Moreover, many conceptually influential  evolutionary biologists, such as George Williams and Richard Dawkins, adopted a rather liberal definition of a "gene" that more molecular inclined workers found unsatisfactory. To what extent changing the definition of a gene changes our thinking remains to be seen.



Song of the day:












Selection and diversity in human regulatory elements

Vernot, B et al. (2012) "Personal and population genomics of human regulatory variation." Genome Research.

Today I tacke one of the ENCODE papers. The ENCODE project was a large project looking at various aspects of many human genomes, with particular interest in identifying biochemically active parts of the genome.

This particular paper looked at diversity and selection in regions that regulate the expression of genes. They identified these regions in several different cell lines, using DNase I activity. DNase I is an enzyme that is known to cleave parts of the genome that are actively binding transcription factors and other regulatory elements (those proteins that "turn on" genes). They identified the location of these sites that were cleaved by DNase I then looked at variation in a sample of 53 unrelated individuals at and near these sites.

They compared diversity in the peaks of DNase I activity, the "footprint" in the peak (the location where a transcription factor actually bound to DNA), and to the exome (all the DNA that makes proteins). They found many more variant sites in peaks, than in the other categories, and fewest in the exome. They also looked at GERP scores around each variant, this is a measure of how constrained the site is, or how much negative selection is keeping the site from changing. A higher GERP score means that more constraint is acting on the site. Though there were fewer variant sites in the exome, a higher proportion of those sites had a high GREP score, and the peaks had the lowest proportion. They also looked at variation within each individual sample, they found consistently the same patterns described above. They also show, as expected, that the African samples have higher diversity (more variants at DNase I sensitive sites) than non-Africans.

They looked at diversity around specific known regulatory motifs (sequences of DNA where a specific kind of regulation is known to occur). They found that regulatory elements that are used in cell differentiation usually have very low diversity. They also show that regulatory elements with a CpG site (a C followed by a G in the DNA sequence, or vice versa) had higher diversity, probably because these sites have a higher mutation rate.

Finally, they looked at how positive selection had acted in regions around each of their DNase I peaks. They did this by measuring shifts in allele frequencies near the sites of interest. Interestingly, their data shows evidence of an inversion on chromosome 17 that is found in some Europeans (more info on the MAPT inversion region can be found here). They looked at gene pathways that were enriched for sites under positive selection in African, Asian, and European populations. They found many pathways that were under positive selection in all populations, and interesting showed that the pathway involved in skin pigmentation was under positive selection in Europeans, and that pathways involved in susceptibility to diabetes were enriched for selection in Africans.

Tuesday, October 16, 2012

Eco-evolutionary spatial dynamics

Hanski, I. (2011) "Eco-evolutionary spatial dynamics in the Glanville fritillary butterfly." PNAS

This species of butterfly lives in a series of meadows on in the Åland Islands, the dynamics in this species were reviewed in the article. The butterflies constantly go extinct in individual meadows, which are then recolonized from others. It has been shown that after a series of extinctions, there is a burst of colonizations, leading to a fairly stable population size.

I found it very interesting that a variant of one gene, Pgi, has been shown to affect dispersal propensity. Individuals heterozygous at this locus carry one allele with an A and one with a C, these individuals are more likely to disperse than the AA homozygotes. Also, the CC homozygotes are very rare, the author indicates that this is probably because the C allele is linked to a recessive lethal mutation. Individuals with two copies of the C allele rarely survive through development. AA homozygotes primarily arise during inbreeding within a meadow, after it has been colonized, and have lower fitness. Therefore, the AC heterozygotes actually have higher fitness.

This is an example of heterozygote advantage (or overdominance), where individuals carrying two different alleles for a gene have higher fitness than individuals with two copies of the same allele. A primary example is sicle cell anemia in humans, a certain allele of one of the genes that makes hemoglobin. If you have two normal copies of the allele your blood cells are normal, but if you have two copies of the alternate allele your blood cells become sicle shaped and are very bad at delivering oxygen. But, if you have one copy of each gene you have a mix of cell shapes. Normally this is bad, but if someone with a mix lives in an area with a high rate of malaria infection they actually do better.

Monday, October 15, 2012

What did Robert do today?

Well, I'm working on a data set of whole genome sequence from 13 Capsella grandiflora individuals. The main goal of this project is to quantify selection across the whole genome of this species (and the closely related selfer C. rubella). My main project today was to calculate pairwise divergence between my samples, so I can see if there is any clustering of individuals. Those scripts are running, and hopefully tomorrow I'll have awesome pictures.

I did play around with making neighbor-joining trees in R, so I can plot these data in a meaningful way.  It is actually much easier than I expected, R just has a library for working with phylogenies (ape) with a handy-dandy function for making neighbor-joining trees based on an input matrix of differences between each sample. There is one thing about the function example in the R docs that confuses me. When they make the input matrix:
x <- c(7, 8, 11, 13, 16, 13, 17, 5, 8, 10, 13,
       10, 14, 5, 7, 10, 7, 11, 8, 11, 8, 12,
       5, 6, 10, 9, 13, 8)
M <- matrix(0, 8, 8)
M[row(M) > col(M)] <- x
M[row(M) < col(M)] <- x
The matrix is not symmetrical. I tried it with excluding the second line that adds the data to the matrix, and it doesn't seem to affect the resulting tree. So I think I'll just be giving R my data with half the matrix empty.

A survey of loss-of-function variants in the human genome

MacArthur DG, et. al. (2012) "A Systematic Survey of loss-of-Function Variants in Human Protein-Coding Genes." Science: 335.

Loss-of-function (LOF) variants, or alleles that stop protein activity, are expected to be rare for most genes. These authors looked at whole genome data from a pilot of the 1000 genomes project for variants of genes that had some loss of function. They had four categories of interest: 1) nonsense mutations (new stop codons inserted into the gene), 2) site disrupting single-nucleotide variants (SNVs; sites that disrupt exon splicing), 3) indels expected to disrupt the reading frame, and 4) very large deletions that removed most of a genes coding sequence.

They found, not surprisingly, that the allele frequencies of LOF variants were shifted towards rare variants, indicating purifying selection is acting strongly on these variants. They also noted that most of the indels and SNVs were clustered around the 3' end of the gene, indicating that mutations toward the end of a gene were less deleterious, and selection on them was weaker. It would have been interesting to see the AFS of these as a separate category, however. They also noted a slight peak in these types of mutations toward the beginning (5' end) of the gene sequence, which they suggested was due to alternate start codons leading to relaxed selection. Overall, this just indicated that the 'meat' of a gene, the part you don't want to mess up, is usually toward the middle.

Interestingly their list of candidate genes that had LOF variants was highly enriched for chemical sensory genes (e.g. those involved in smell and taste). Since a loss of function of one of these alleles isn't immediately fatal, it makes sense that selection to maintain function in these genes is weaker. They did also find several genes that where in regions that show evidence of positive selection. Several of the olfactory genes appear in these regions, and so does one gene that may be involved in brain lipid formation and another in male fertility. These regions could, of course, be positively selected for some other locus, and these deleterious LOF variants were just dragged along.

The most interesting finding of the article is definitely the number of LOF variants per individual. They estimate that most people have about 100 LOF alleles, most of which are heterozygous. They also point out that since theory predicts we should each only carry about 5 recessive lethal mutations, therefore most of these LOF variants are probably only slightly deleterious.

This article did make me think a bit about how splicing works. Most genes have many exons, which are put together to create the final mRNA that is translated into protein. In some transcripts not all exons are present, however. How many of the possible variants do we see, and what determines if we see them or not? For example, if a gene has 6 exons (the average number of exons on the first C. rubella chromosome is 5.5) then there are 192 possible variants, I'm sure we don't see all of them. What causes this? I'm sure someone knows, maybe Emily can shed some light here.

Monday, September 10, 2012

Human regulatory network architecture.

Architecture of the human regulatory network derived from ENCODE data
Gerstein et al. 2012 Nature.

This paper uses a ChIP sequencing to identify binding sites for 119 transcription factors in 5 cell lines (from only one human, I think. maybe?). They used this data to construct a network of transcription factors and the genes they regulate. Their overall goal was to describe the architecture of the regulatory network, identify correlations between network position and other genomic properties, and test of selection acts differently on different places in the network. 

They have a lot of results, and a lot of the data presented in the main text feels a bit anecdotal, so instead of providing a laundry list of all of them, I'll just point out things I found interesting with the caveat that I don't really understand most of their methods.

1) They looked at situations where two transcription factors have an overlapping binding site, which they call coassociation. Transcription factors tend to coassociate with different partners in sites that are near a gene ('proximal') and far from a gene ('distal'). However, this conclusion appears to be based on supplementary figure 2C3, which only shows associations between one focal transcription factor and those factors that differ between proximal and distal sites.

2) The researchers constructed a network of associations between transcription factors and their targets and found they could group transcription factors into three levels of hierarchy. Highly connected factors tend to be highly expressed across tissues, which is unsurprising to me.

3) The researchers used diversity data from the 1000 genomes project to measure constraint on target genes and transcription factors. They found the strongest constraint on genes that are regulated by many transcription factors, followed by transcription factors that regulate many genes. They also found that transcription factors at the top level of the network are more constrained than those at the middle and lower level.

4) They also took a stab at one of my pet interests: allele-specific expression. It's a bit complicated, but what I think is going on is that when transcription factors bind preferentially to an allele, this allele is also more likely to be preferentially expressed downstream. However, this section is really unclear to me because allele-specific expression is generally defined as being any difference in expression level between alleles, not a preference for one allele, so I'm not sure what they mean when they say things like "X% of genes show allele-specific expression from the paternal allele") If my interpretation is right, then this suggests that most allele-specific binding is enhancing expression? But who knows. It's a bit frustrating that with 271 pages of supplement, they can't find the space to clearly define their terms.

5) Finally, the researchers compared diversity in transcription factor binding sites that show allele-specific binding to those that don't. They found that the allele-specific sites have a higher SNP density, suggesting that they're under less constraint than those binding sites without allele-specific binding. The authors think that this result, that allele-specific binding sites are under less constraint, is 'surprising'. I don't find it surprising AT ALL. If the genetic variation that causes allele-specific binding is deleterious and subject to purifying selection (which we think is the case for most variation), then this result makes perfect sense.

Friday, August 10, 2012

Regulatory networks and phenotypic evolution

Kopp & McIntyre. 2012. Transcriptional network structure has little effect on the rate of regulatory evolution in yeast. MBE

It's expected that network position will affect regulatory evolution because evolutionary changes at nodes with fewer connections are less likely to have deleterious pleiotropic effects. The authors test this prediction by combining cis-regulated expression divergence between S. cerevisiae and S. paradoxus with a number of gene network data sets for S. cerevisiae. In particular, they looked at the number of transcription factors binding to each gene ('incoming connections') and the number of genes regulated by each transcription factor ('outgoing connections'), They found no overall relationship between the number of outgoing connections and cis-regulatory divergence. There was a significant correlation between the number of incoming connections and cis-regulatory expression divergence: genes regulated by many transcription factors have a higher cis-regulatory divergence than those regulated by few transcription factors. The authors claim that the magnitude of this effect is small but it's hard to tell based on the data they present. Their main explanation for this is that genes that are regulated by many different transcription factors are likely to have more binding sites and thus a larger mutational target. The authors also looked at five smaller data sets made up of condition-dependent subnetworks and found a significant relationship between incoming connection number and divergence in most subnetworks but only found a relationship between outgoing connections and divergence in the stress response subnetwork.

Method notes: The divergence data comes from allele-specific expression measured in F1 hybrids of S. cerevisiae and S. paradoxus in multiple conditions while the network data comes from various chromatin imunoprecipitation (ChIP) experiments. ChIP experiments quantify binding between candidate transcription factors and genomic regions by hybridizing transcription factors to tiling microarrays. Since TF binding can be condition specific, this method could miss some true binding sites while finding others which are not biologically significant. 

Saturday, August 4, 2012

Em summarizes

Andres et al. 2009. Targets of balancing selection in the human genome. Molecular Biology and Evolution

While genome scans have been successfully used to find the signature of purifying selection, finding evidence of balancing selection is more difficult. Andres et al. approach this problem, scanning human coding seuqnece for evidence long-term balancing selection, which should leave narrow regions of excess polymorphism. They used sequence data from 13,400 genes in 39 humans (two populations) to construct a demographic model and then tested for balancing selection in the 4,877 genes which had 10+ polymorphic or divergent sites. They conducted a two part test: first they used a modified HKA test to detect genes that showed an excess of polymorphism relative to variation. Second, they looked at each gene's allele frequency spectrum and found genes with an excess of intermediate-frequency alleles. They identified 60 genes that deviated from the expectations set by the neutral demographic model in both tests. An MHC gene which was previously known to be under balancing selection was included in this set, validating their results. They also found that on average these 60 genes had higher LD than the rest of the genome, consistent with there being positive epistasis between sites. 

Wednesday, August 1, 2012

Em summarizes

Obbard et al. 2009 Quantifying adaptive evolution in the Drosophila immune system. PLoS Genetics.

Stephen said to read this a while ago, I did, and didn't think much of it. Now, after banging around calculating alpha myself, it seems a lot more interesting ...


Population genetics studies have found that a surprisingly large proportion of changes in Drosophila genomes were fixed by positive selection (this value is measure as α). Obbard and co. explore this result by investigating sequence of immune-related genes, which they expect to have higher rates of adaptive evolution. They resequenced 136 immune genes and 136 nearby non-immune related control genes in 6 populations of Drosophila melanogaster and 2 populations of D simulans, with 4 individuals pooled per population. They first found that, as expected, α is higher in immune genes (α = 0.65) compared to their controls (α = 0.41). Second, they looked at the distributions of α values and found that this difference is driven by a small subset of their immune-related genes. Third, they classified genes by various pathways and function and found that some of these groups have higher average α values than others (this presumably makes sense for people that understand Drosophila immune systems). Finally, they look at lineage specific divergence and show that a values are correlated in D. simulans and D. melanogaster, suggesting that similar selective pressures are operating in both species. 

Overall this paper suggests that, since α is higher in genes that are expected to be under strong positive selection, high α estimates in Drosophila represent reality, not artifacts of some other process. Also interesting, for me, is their calculations of the exact number of variants fixed by positive selection (a), which I'd like to do with my own data. Also, they appear to calculate lineage-specific divergence by using PAML to estimate the ancestral state and then calculate divergence between the inferred ancestor and their sequenced genes, something which the author of PAML says not to do. 

Wednesday, June 20, 2012

Robert Summarizes

Hvilsom C, Qian Y, et. al. (2012) “Extensive X-linked adaptive evolution in central chimpanzees.” PNAS

They analyzed the amount of diversity on autosomes and the X chromosomes in exons in chimpanzees. They showed higher diversity in chimpanzees than in humans, in line with earlier studies. Using the DFE-alpha approach they showed that strong selection (Nes > 100) is slightly elevated on the X chromosome (though they didn’t test for significance, however the Ne of the X should be lower than autosomes, so this is a strong sign of differences in selection). Using two different McDonald-Kreitman based tests they show no evidence for positive selection on autosomes but some on the X chromosome (38% from DFE-alpha). The saw no evidence for selective sweeps in windows in the X chromosome however, they suggest the difference in alpha is caused by fixations of novel recessive mutations being fixed. They also show very reduced coding polymorphism on the X chromosome compared to autosomes, showing faster X evolution caused by positive selection.

Monday, June 18, 2012

Robert Summarizes

Drewell RA, Lo N, et. al. (2012) “Kin conflict in insect societies: a new epigenetic perspective.” TREE

Review paper that suggests that imprinting may play a role in conflict between paternal and maternal interests in social insects (i.e. ants, bees, wasps). Since it would be in the paternal interests for workers to produce offspring but this is not in the interests of the queen genes that regulate fertility in workers may be under strong conflict. Since fathers are more related to their own offspring than to a random worker from the (1/2 vs ⅜-0 depending on the matings the queen has), there also may be conflict amongst workers since relatedness is not equal between all workers. However, there seems to be much less methylation in ants than in mammals (3 orders of magnitude less), which indicates that methylation may not play a large role in imprinting, it is mediated by a different mechanism, or that methylation is used much more sparingly. They suggest that there are many candidate genes (primarily developmental and those that control sterility) that are good candidates for imprinting. However, some evidence from behavior of crosses between European honey bees and Africanized bees does indicate that direction of cross affects phenotype.

Doing reciprocal crosses here would be very interesting, but because most queen multiply mate it may complicate matters (or at least make results somewhat not biologically relevant if queens are forced to mate only once). I want to think about this one some more...

Glémin S, Bazin E, and Charlesworth D. (2006) “Impact of mating systems on patterns of sequence polymorphism in flowering plants.” Proc R Soc B

They gathered available sequence data from many plant species and families. They analyzed polymorphism (Watterson’s theta and pi) in synonymous, nonsynonymous, and some non-coding (intronic) sites. They performed ANOVAs to test whether mating system contributed more to diversity levels than other life history traits. Mating system did contribute significantly more than any other trait. They also showed that selfing species have less GC content (possibly due to less biased gene conversion because of low heterozygosity). Additionally they showed tentative evidence that selfers have less effective selection than outcrossers.

Friday, June 15, 2012

Robert Summarizes

Carneiro M, Albert FW, et. al. (2012) “Evidence for widespread positive and purifying selection across the European rabbit (Oryctolagus cuniculus) genome.” Mol Biol Evol.

They extracted RNA from rabbit brains of 2 species and assembled the transcriptome then quantifies both positive and negative selection. They used several methods for alpha (including DFE-alpha) and got very high estimates of positive selection, alpha was ~0.6. They found very high estimates of negative selection as well, at least 93% of sites had Nes > 10. They found slightly increased selection on the X chromosome in one species, but not in the other, neither of these comparisons to autosomes was significant though. They attribute most of this increased selection to very large Ne in rabbits (~ 800,000 - 1,130,000 for the two species).

He F, Zhang X, et. al. (2012) “Genome-wide analysis of cis-regulatory divergence between species in the Arabidopsis genus.” MBE.

The authors looked at expression in F1s between thaliana (materna) and lyrata (paternal). They found lots of allele specific expression (ASE). Most genes that showed ASE (up to ~90%) only from the A. lyrata copy. They found that A. lyrata ASE genes were enriched for a marker associated with repression during development, while the ASE A. thaliana genes were associated with a different marker associated with increased growth during development. They suggest imprinting is unlikely due to the small number of imprinted genes found in A. thaliana. This argument seems weak, since there still could be more imprinted loci we don't know about, and there shouldn't be many in selfers anyway. Another explanation would be accumulation of deleterious alleles in the selfer that are recessive, so the outcrossing allele is preferentially expressed.

Hufford MD, Xu X, et. al. (2012) “Comparative population genomics of maize domestication and improvement.” Nat. Genet.

They compared wild, domesticated, and “improved” maize genetics. They found lots of sweeps in domestication, and in improvement, but with lower signals of sweeps in improvement. The number of sites and genes acted on in improvement seems much smaller. Looks like mostly regulatory changes though. pi ~= .004, domesticated races kept most of wild diversity.

Robert reawakens the blog?

I've decided to get better at summarizing my readings for a few reasons. First, I'm terrible at remembering what I've read, so if I write something about it that should improve. Second, I need to get better at names, and writing authors names after I read something will help that. Lastly, I think it (or your people) will guilt me into reading more effectively.

I was going to set up a new blog for to do this or just use a google doc, but since we aren't using this blog for anything else, and this way you two can bug me to complete this, I figured that I'm gonna use E(h)volution, barring objections.  I'm aiming for my summaries to be 100-200 words, with perhaps a bit of opinion about the work. It would be awesome to have little mini discussions about the articles here (think baby journal club), especially if you think I've summarized something incorrectly, or missed the point. I want to do these on a semi-daily basis, usually one or a few articles. You have my permission to yell at me if I haven't done one for a while. Next post with a few from the last couple days follows.

Cheers!
Robert