Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Technical Report
  • Published:

An evolutionary framework for measuring epigenomic information and estimating cell-type-specific fitness consequences

A Publisher Correction to this article was published on 20 February 2019

Abstract

Here we ask the question “How much information do epigenomic datasets provide about human genomic function?” We consider nine epigenomic features across 115 cell types and measure information about function as a reduction in entropy under a probabilistic evolutionary model fitted to human and nonhuman primate genomes. Several epigenomic features yield more information in combination than they do individually. We find that the entropy in human genetic variation predominantly reflects a balance between mutation and neutral drift. Our cell-type-specific FitCons scores reveal relationships among cell types and suggest that around 8% of nucleotide sites are constrained by natural selection.

This is a preview of subscription content, access via your institution

Access options

Buy this article

Prices may be subject to local taxes which are calculated during checkout

Fig. 1: Conceptual diagram of the FitCons2 algorithm.
Fig. 2: Decision tree and clusters for the human genome.
Fig. 3: Information and synergy.
Fig. 4: Annotation-specific distributions of FitCons2 scores.
Fig. 5: Genome browser display.

Similar content being viewed by others

Data availability

All raw data for this study are publicly available from the sources described in the Supplementary Note. The cell-type-specific and integrated FitCons2 scores are available as UCSC genome browser tracks at http://compgen.cshl.edu/fitCons2/. Additional data generated during the course of our analyses can be obtained from the corresponding author upon reasonable request.

References

  1. The ENCODE Project Consortium. An integrated encyclopedia of DNA elements in the human genome. Nature 489, 57–74 (2012).

    Article  Google Scholar 

  2. Yue, F. et al. A comparative encyclopedia of DNA elements in the mouse genome. Nature 515, 355–364 (2014).

    Article  CAS  Google Scholar 

  3. The GTEx Consortium. The Genotype-Tissue Expression (GTEx) pilot analysis: multitissue gene regulation in humans. Science 348, 648–660 (2015).

  4. Kundaje, A. et al. Integrative analysis of 111 reference human epigenomes. Nature 518, 317–330 (2015).

    Article  CAS  Google Scholar 

  5. Doolittle, W. F. Is junk DNA bunk? A critique of ENCODE. Proc. Natl Acad. Sci. USA 110, 5294–5300 (2013).

    Article  CAS  Google Scholar 

  6. Eddy, S. R. The ENCODE project: missteps overshadowing a success. Curr. Biol. 23, R259–R261 (2013).

    Article  CAS  Google Scholar 

  7. Ernst, J. & Kellis, M. Discovery and characterization of chromatin states for systematic annotation of the human genome. Nat. Biotechnol. 28, 817–825 (2010).

    Article  CAS  Google Scholar 

  8. Hoffman, M. M. et al. Unsupervised pattern discovery in human chromatin structure through genomic segmentation. Nat. Methods 9, 473–476 (2012).

    Article  CAS  Google Scholar 

  9. Ritchie, G. R., Dunham, I., Zeggini, E. & Flicek, P. Functional annotation of noncoding sequence variants. Nat. Methods 11, 294–296 (2014).

    Article  CAS  Google Scholar 

  10. Shihab, H. A. et al. An integrative approach to predicting the functional effects of non-coding and coding sequence variation. Bioinformatics 31, 1536–1543 (2015).

    Article  CAS  Google Scholar 

  11. Alipanahi, B., Delong, A., Weirauch, M. T. & Frey, B. J. Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning. Nat. Biotechnol. 33, 831–838 (2015).

    Article  CAS  Google Scholar 

  12. Zhou, J. & Troyanskaya, O. G. Predicting effects of noncoding variants with deep learning-based sequence model. Nat. Methods 12, 931–934 (2015).

    Article  CAS  Google Scholar 

  13. Kircher, M. et al. A general framework for estimating the relative pathogenicity of human genetic variants. Nat. Genet. 46, 310–315 (2014).

    Article  CAS  Google Scholar 

  14. Fu, Y. et al. FunSeq2: a framework for prioritizing noncoding regulatory variants in cancer. Genome. Biol. 15, 480 (2014).

    Article  Google Scholar 

  15. Gronau, I., Arbiza, L., Mohammed, J. & Siepel, A. Inference of natural selection from interspersed genomic elements based on polymorphism and divergence. Mol. Biol. Evol. 30, 1159–1171 (2013).

    Article  CAS  Google Scholar 

  16. Arbiza, L. et al. Genome-wide inference of natural selection on human transcription factor binding sites. Nat. Genet. 45, 723–729 (2013).

    Article  CAS  Google Scholar 

  17. Gulko, B., Hubisz, M. J., Gronau, I. & Siepel, A. A method for calculating probabilities of fitness consequences for point mutations across the human genome. Nat. Genet. 47, 276–283 (2015).

    Article  CAS  Google Scholar 

  18. Huang, Y. F., Gulko, B. & Siepel, A. Fast, scalable prediction of deleterious noncoding variants from functional and population genomic data. Nat. Genet. 49, 618–624 (2017).

    Article  CAS  Google Scholar 

  19. Iwasa, Y. Free fitness that always increases in evolution. J. Theor. Biol. 135, 265–281 (1988).

    Article  CAS  Google Scholar 

  20. Barton, N. H. & Coe, J. B. On the application of statistical physics to evolutionary biology. J. Theor. Biol. 259, 317–324 (2009).

    Article  CAS  Google Scholar 

  21. Mouse Genome Sequencing Consortium. Initial sequencing and comparative analysis of the mouse genome. Nature 420, 520–562 (2002).

    Article  Google Scholar 

  22. Taipale, J. Informational limits of biological organisms. EMBO J. 37, e96114 (2018).

    Article  Google Scholar 

  23. Gao, T. et al. EnhancerAtlas: a resource for enhancer annotation and analysis in 105 human cell/tissue types. Bioinformatics 32, 3543–3551 (2016).

    Article  CAS  Google Scholar 

  24. Andersson, R. et al. An atlas of active enhancers across human cell types and tissues. Nature 507, 455–461 (2014).

    Article  CAS  Google Scholar 

  25. Arner, E. et al. Transcribed enhancers lead waves of coordinated transcription in transitioning mammalian cells. Science 347, 1010–1014 (2015).

    Article  CAS  Google Scholar 

  26. GTEx Consortium. Genetic effects on gene expression across human tissues. Nature 550, 204–213 (2017).

    Article  Google Scholar 

  27. Harrow, J. et al. GENCODE: the reference human genome annotation for The ENCODE Project. Genome Res. 22, 1760–1774 (2012).

    Article  CAS  Google Scholar 

  28. Liu, F. et al. The human genomic melting map. PLoS Comput. Biol. 3, e93 (2007).

    Article  Google Scholar 

  29. Zerbino, D. R. et al. Ensembl 2018. Nucleic Acids Res. 46, D754–D761 (2018).

    Article  CAS  Google Scholar 

  30. The 1000 Genomes Project Consortium. A global reference for human genetic variation. Nature 526, 68–74 (2015).

    Article  Google Scholar 

  31. The UK10K Consortium. The UK10K project identifies rare variants in health and disease. Nature 526, 82–90 (2015).

    Article  Google Scholar 

  32. Mallick, S. et al. The Simons Genome Diversity Project: 300 genomes from 142 diverse populations. Nature 538, 201–206 (2016).

    Article  CAS  Google Scholar 

  33. Telenti, A. et al. Deep sequencing of 10,000 human genomes. Proc. Natl Acad. Sci. USA 113, 11901–11906 (2016).

    Article  CAS  Google Scholar 

  34. Song, Q. et al. A reference methylome database and analysis pipeline to facilitate integrative and comparative epigenomics. PLoS ONE 8, e81148 (2013).

    Article  Google Scholar 

  35. Zerbino, D. R., Wilder, S. P., Johnson, N., Juettemann, T. & Flicek, P. R. The ensembl regulatory build. Genome. Biol. 16, 56 (2015).

    Article  Google Scholar 

Download references

Acknowledgements

We thank R. Ramani for assistance with browser track development, D. McCandlish for comments on the manuscript, N. Dukler for calculating the number of bits required to encode the reference human genome, and other members of the Siepel laboratory for helpful discussions. This research was supported by US National Institutes of Health grants R01-GM102192 and R35-GM127070 (to A.S.). The content is solely the responsibility of the authors and does not necessarily represent the official views of the US National Institutes of Health.

Author information

Authors and Affiliations

Authors

Contributions

B.G. and A.S. conceived and designed the study. B.G. designed and implemented the FitCons2 method. B.G. and A.S. analyzed the data. A.S. supervised the research. B.G. and A.S. wrote the manuscript.

Corresponding author

Correspondence to Adam Siepel.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Integrated supplementary information

Supplementary Fig. 1 Estimated decision tree for a random sample of 57 cell types.

The estimated tree is nearly identical to the one estimated from all 115 cell types (Fig. 2). The main changes are in the subtree beneath node 20 (highlighted in gray). Beneath this node, the original and resampled tree are still quite similar, with differences primarily in the order in which decision rules are applied. The impact on the estimated values of ρ is minimal, and the classes for which ρ does change contain few sites (~0.1% of the genome). Four additional trees (not shown) were estimated from random samples of 57 cell types, and all of them were similarly consistent with the tree in Fig. 2.

Supplementary Fig. 2 Reductions in entropy per site due to selection plotted as a function of estimated ρ for the 61 classes.

Reductions in entropy per site due to selection (vertical axis) plotted as a function of estimated ρ (see Supplementary Table 3) for the 61 classes. Sizes of circles reflect numbers of sites. Coding classes (CDS) are shown in blue, and noncoding classes (NCD) are shown in orange. The three classes with negative estimates of the reduction in entropy due to selection are shown in red.

Supplementary Fig. 3 Violin plots of FitCons2 scores for HUVEC and H1hESC cells.

Violin plots of FitCons2 scores similar to the one shown in Fig. 4 for two additional cell types: HUVEC (top) and H1hESC (bottom). Notice the similarity in the annotation-dependent distributions across cell types, despite differences in the genomic locations of the active regions.

Supplementary Fig. 4 Hierarchical clustering of 115 cell types based on FitCons2 scores.

The dendrogram is derived from a ‘Manhattan’ or L1 distance matrix defined such that the distance between every pair of cell types is equal to the sum of the absolute differences of their nucleotide-specific FitCons2 scores (see Methods). Clustering was done using the Ward-D2 clustering method in R. Major groups in the dendrogram correspond to cell types associated with (clockwise from top left) blood and the immune system (brown), internal organs (red), the digestive system (gray), neural tissues (blue), skin and connective tissue (purple), and stem cells (green). Insets show examples of closely related cell types from each group. Notice that the digestive cell types are nested within the internal organ-related cell types. Within the neural tissue cluster, separate groups are evident for embryonic and adult brain tissues (blue inset; embryonic cell types highlighted at bottom). Similarly, fetal cell types form subclusters within the internal organ (red inset, entire group) and digestive system (gray inset, gray background) groups. SUS, sites under selection (see Supplementary Note).

Supplementary Fig. 5 Genome browser display showing FitCons2 scores in the promoter region and first few exons of the MIER2 gene.

In red callout at the lower left, FitCons1 highlights a regulatory locus of about 200 bp, upstream of MIER2, via an elevated score surrounding an enhancer-associated ChromHMM feature. FitCons2 refines this locus by identifying a binding site of ~ 8 bp (AP1, red) with sharply elevated score (cluster 48, ρ = 0.31). A second component of this locus contains a smRNA signal immediately adjoining a GWAS hit. An elevated LINSIGHT score provides evidence for the importance of the binding site, but does not identify the adjoining variation, suggesting a cell-type-specific effect. The TSS and core promoter are highlighted in a green callout (center left). Active TF binding sites within the promoter are indicated by the green FitCons2 class (16, ρ = 0.43), while elevated scores identify the start codon and more conserved codon positions (blue; 07, ρ = 0.67). At the boundary of the first exon, FitCons2 scores spike, indicating increased selective pressure at both intronic (14, ρ = 0.92) and exonic (04, ρ = 0.93) splice boundaries. At center right, FitCons2 scores regularly alternate in a period of three, reflecting reduced constraint at the third codon position. The scores are elevated at the splice site (orange; 24, ρ = 0.16) and then drop off in the intron.

Supplementary Fig. 6 Genome browser display showing FitCons2 scores at the BCL3 gene and upstream enhancers in E023, a derived adipocyte cell type.

At far left (gold highlight), elevated FitCons2 scores identify individual TFBSs in the enhancer region. Enhancers associated with BCL3 are identified in brown (at top), and both ChromHMM and DNase-seq features support elevated FitCons2 scores. The TSS and core promoter (purple highlight) also show elevated FitCons2 scores and classes indicating promoter activity. Individual binding sites can be observed via the green FitCons2 class 40 (ρ = 0.43). The blue highlight shows elevated selective pressure at an active intronic splice site (red; class 14, ρ = 0.92), followed by a periodic pattern mirroring codon structure (blue; classes 00, ρ = 0.75, and 05, ρ = 0.62). FitCons2 scores here are elevated by the presence of a strong RNA-seq feature (dark green). The central brown highlight shows several areas identified as intronic enhancers for BCL3 (brown at top) including a cluster of TFBSs (gold). In the final detail (dark green), the smRNA-seq feature (turquoise) drives an elevated score that surrounds an 8-bp locus in the 3ʹ UTR of this gene, an annotated microRNA-binding site (BCL3:miR-19).

Supplementary Fig. 7 Genome browser display showing FitCons2 scores for multiple cell types at a super-enhancer on chromosome 13.

a, Genome browser display showing FitCons2 scores for multiple cell types at a cell-type-specific super-enhancer on chromosome 13. Super-enhancer SE33394 appears active and obtains high scores, in the H1hESC cell type (A) but not the GM12878 (B) or HUVEC (C) cell types. While SE33394 target LECT1 is nearly 3 Mb away, transcription levels at the gene follow FitCons2 enhancer scores in the corresponding cell type. b, LECT1 (which encodes a protein associated with suppression of angiogenesis) is transcribed in H1hESC (A) but not GM12878 (B) or HUVEC (C) cells. Highlighted in gold in a, SE33394 contains five loci identified as distal regulatory modules (blue at top) as well as a FANTOM5 enhancer (green), and is flanked by two GWAS hits associated with blood-related phenotypes (highlighted in red, rs10507601 and rs9527419). Scores at each genomic position are aggregated across cell types to generate an integrated FitCons2 score (bottom). This integrated score identifies elements of the super-enhancer exhibiting the potential for cell-type-specific activity, without requiring epigenomic data from any particular cell type.

Supplementary Fig. 8 Extended legend for browser displays.

Extended legend for the browser displays shown in Fig. 5 and Supplementary Figs. 46. Bar notation for ρ indicates an average across multiple FitCons2 clusters.

Supplementary Fig. 9 Extended legend for browser displays.

Extended legend for the browser displays shown in Fig. 5 and Supplementary Figs. 46.

Supplementary Fig. 10 Predictive power for genomic function.

a, Sensitivity of various computational prediction methods (see Supplementary Note) for cell-type-specific transcription factor binding sites (TFBSs). Sensitivity is evaluated using 55,024 motif matches for 12 transcription factors in ChIP–seq peaks for H1-hESC cells6 (Supplementary Methods). Sensitivity is plotted against total coverage outside of annotated coding regions as the prediction threshold for each method is varied. Results for two sets of FitCons1 and FitCons2 scores are shown: integrated scores across cell types (I) and cell-type-specific scores for H1-hESC cells. For reference, the vertical gray bar shows the expected fraction of the noncoding genome that is under selection according to FitCons2 (that is, the average score in noncoding regions). b, Receiver operating characteristic (ROC) curves for human disease-associated (pathogenic) single-nucleotide variants (SNVs) listed in HGMD (1,495 HGMD SNVs and 15,042 matched negative controls). The same computational methods are shown, but in this case only integrated scores are used for FitCons1 and FitCons2. The area-under-the-curve (AUC) statistic is listed after each label in the key. False positives are assessed using likely benign variants matched by distance to the nearest transcription start site (Supplementary Methods).

Supplementary Fig. 11 Sensitivity for TFBS prediction as a function of total noncoding coverage.

Sensitivity for TFBS prediction as a function of total noncoding coverage (as in Supplementary Fig. 9a) for the K562 cell type.

Supplementary Fig. 12 Precision versus recall for HGMD.

Precision (vertical axis) versus recall (horizontal axis) for HGMD. This plot is based on the same data as Supplementary Fig. 9b.

Supplementary Fig. 13 Receiver operating characteristic curves and precision–recall curves for ClinVar.

Receiver operating characteristic curves (left) and precision–recall curves (right) for ClinVar.

Supplementary information

Supplementary Text and Figures

Supplementary Figures 1–13, Supplementary Tables 1–3 and Supplementary Note

Reporting Summary

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Gulko, B., Siepel, A. An evolutionary framework for measuring epigenomic information and estimating cell-type-specific fitness consequences. Nat Genet 51, 335–342 (2019). https://doi.org/10.1038/s41588-018-0300-z

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1038/s41588-018-0300-z

This article is cited by

Search

Quick links

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing