An evolutionary framework for measuring epigenomic information and estimating cell-type-specific fitness consequences

Gulko, Brad; Siepel, Adam

doi:10.1038/s41588-018-0300-z

Technical Report
Published: 17 December 2018

An evolutionary framework for measuring epigenomic information and estimating cell-type-specific fitness consequences

Nature Genetics volume 51, pages 335–342 (2019)Cite this article

4097 Accesses
22 Citations
63 Altmetric
Metrics details

Subjects

A Publisher Correction to this article was published on 20 February 2019

Abstract

Here we ask the question “How much information do epigenomic datasets provide about human genomic function?” We consider nine epigenomic features across 115 cell types and measure information about function as a reduction in entropy under a probabilistic evolutionary model fitted to human and nonhuman primate genomes. Several epigenomic features yield more information in combination than they do individually. We find that the entropy in human genetic variation predominantly reflects a balance between mutation and neutral drift. Our cell-type-specific FitCons scores reveal relationships among cell types and suggest that around 8% of nucleotide sites are constrained by natural selection.

Access through your institution

Buy or subscribe

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on Springer Link
Instant access to full article PDF

Buy now

Prices may be subject to local taxes which are calculated during checkout

**Fig. 1: Conceptual diagram of the FitCons2 algorithm.**

**Fig. 2: Decision tree and clusters for the human genome.**

**Fig. 4: Annotation-specific distributions of FitCons2 scores.**

Inferring gene regulatory networks from single-cell multiome data using atlas-scale external data

Article Open access 12 April 2024

Qiuyue Yuan & Zhana Duren

Simultaneous single-cell three-dimensional genome and gene expression profiling uncovers dynamic enhancer connectivity underlying olfactory receptor choice

Article Open access 15 April 2024

Honggui Wu, Jiankun Zhang, … X. Sunney Xie

Tissue-specific enhancer–gene maps from multimodal single-cell data identify causal disease alleles

Article 09 April 2024

Saori Sakaue, Kathryn Weinand, … Soumya Raychaudhuri

Data availability

All raw data for this study are publicly available from the sources described in the Supplementary Note. The cell-type-specific and integrated FitCons2 scores are available as UCSC genome browser tracks at http://compgen.cshl.edu/fitCons2/. Additional data generated during the course of our analyses can be obtained from the corresponding author upon reasonable request.

References

The ENCODE Project Consortium. An integrated encyclopedia of DNA elements in the human genome. Nature 489, 57–74 (2012).
Article Google Scholar
Yue, F. et al. A comparative encyclopedia of DNA elements in the mouse genome. Nature 515, 355–364 (2014).
Article CAS Google Scholar
The GTEx Consortium. The Genotype-Tissue Expression (GTEx) pilot analysis: multitissue gene regulation in humans. Science 348, 648–660 (2015).
Kundaje, A. et al. Integrative analysis of 111 reference human epigenomes. Nature 518, 317–330 (2015).
Article CAS Google Scholar
Doolittle, W. F. Is junk DNA bunk? A critique of ENCODE. Proc. Natl Acad. Sci. USA 110, 5294–5300 (2013).
Article CAS Google Scholar
Eddy, S. R. The ENCODE project: missteps overshadowing a success. Curr. Biol. 23, R259–R261 (2013).
Article CAS Google Scholar
Ernst, J. & Kellis, M. Discovery and characterization of chromatin states for systematic annotation of the human genome. Nat. Biotechnol. 28, 817–825 (2010).
Article CAS Google Scholar
Hoffman, M. M. et al. Unsupervised pattern discovery in human chromatin structure through genomic segmentation. Nat. Methods 9, 473–476 (2012).
Article CAS Google Scholar
Ritchie, G. R., Dunham, I., Zeggini, E. & Flicek, P. Functional annotation of noncoding sequence variants. Nat. Methods 11, 294–296 (2014).
Article CAS Google Scholar
Shihab, H. A. et al. An integrative approach to predicting the functional effects of non-coding and coding sequence variation. Bioinformatics 31, 1536–1543 (2015).
Article CAS Google Scholar
Alipanahi, B., Delong, A., Weirauch, M. T. & Frey, B. J. Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning. Nat. Biotechnol. 33, 831–838 (2015).
Article CAS Google Scholar
Zhou, J. & Troyanskaya, O. G. Predicting effects of noncoding variants with deep learning-based sequence model. Nat. Methods 12, 931–934 (2015).
Article CAS Google Scholar
Kircher, M. et al. A general framework for estimating the relative pathogenicity of human genetic variants. Nat. Genet. 46, 310–315 (2014).
Article CAS Google Scholar
Fu, Y. et al. FunSeq2: a framework for prioritizing noncoding regulatory variants in cancer. Genome. Biol. 15, 480 (2014).
Article Google Scholar
Gronau, I., Arbiza, L., Mohammed, J. & Siepel, A. Inference of natural selection from interspersed genomic elements based on polymorphism and divergence. Mol. Biol. Evol. 30, 1159–1171 (2013).
Article CAS Google Scholar
Arbiza, L. et al. Genome-wide inference of natural selection on human transcription factor binding sites. Nat. Genet. 45, 723–729 (2013).
Article CAS Google Scholar
Gulko, B., Hubisz, M. J., Gronau, I. & Siepel, A. A method for calculating probabilities of fitness consequences for point mutations across the human genome. Nat. Genet. 47, 276–283 (2015).
Article CAS Google Scholar
Huang, Y. F., Gulko, B. & Siepel, A. Fast, scalable prediction of deleterious noncoding variants from functional and population genomic data. Nat. Genet. 49, 618–624 (2017).
Article CAS Google Scholar
Iwasa, Y. Free fitness that always increases in evolution. J. Theor. Biol. 135, 265–281 (1988).
Article CAS Google Scholar
Barton, N. H. & Coe, J. B. On the application of statistical physics to evolutionary biology. J. Theor. Biol. 259, 317–324 (2009).
Article CAS Google Scholar
Mouse Genome Sequencing Consortium. Initial sequencing and comparative analysis of the mouse genome. Nature 420, 520–562 (2002).
Article Google Scholar
Taipale, J. Informational limits of biological organisms. EMBO J. 37, e96114 (2018).
Article Google Scholar
Gao, T. et al. EnhancerAtlas: a resource for enhancer annotation and analysis in 105 human cell/tissue types. Bioinformatics 32, 3543–3551 (2016).
Article CAS Google Scholar
Andersson, R. et al. An atlas of active enhancers across human cell types and tissues. Nature 507, 455–461 (2014).
Article CAS Google Scholar
Arner, E. et al. Transcribed enhancers lead waves of coordinated transcription in transitioning mammalian cells. Science 347, 1010–1014 (2015).
Article CAS Google Scholar
GTEx Consortium. Genetic effects on gene expression across human tissues. Nature 550, 204–213 (2017).
Article Google Scholar
Harrow, J. et al. GENCODE: the reference human genome annotation for The ENCODE Project. Genome Res. 22, 1760–1774 (2012).
Article CAS Google Scholar
Liu, F. et al. The human genomic melting map. PLoS Comput. Biol. 3, e93 (2007).
Article Google Scholar
Zerbino, D. R. et al. Ensembl 2018. Nucleic Acids Res. 46, D754–D761 (2018).
Article CAS Google Scholar
The 1000 Genomes Project Consortium. A global reference for human genetic variation. Nature 526, 68–74 (2015).
Article Google Scholar
The UK10K Consortium. The UK10K project identifies rare variants in health and disease. Nature 526, 82–90 (2015).
Article Google Scholar
Mallick, S. et al. The Simons Genome Diversity Project: 300 genomes from 142 diverse populations. Nature 538, 201–206 (2016).
Article CAS Google Scholar
Telenti, A. et al. Deep sequencing of 10,000 human genomes. Proc. Natl Acad. Sci. USA 113, 11901–11906 (2016).
Article CAS Google Scholar
Song, Q. et al. A reference methylome database and analysis pipeline to facilitate integrative and comparative epigenomics. PLoS ONE 8, e81148 (2013).
Article Google Scholar
Zerbino, D. R., Wilder, S. P., Johnson, N., Juettemann, T. & Flicek, P. R. The ensembl regulatory build. Genome. Biol. 16, 56 (2015).
Article Google Scholar

Download references

Acknowledgements

We thank R. Ramani for assistance with browser track development, D. McCandlish for comments on the manuscript, N. Dukler for calculating the number of bits required to encode the reference human genome, and other members of the Siepel laboratory for helpful discussions. This research was supported by US National Institutes of Health grants R01-GM102192 and R35-GM127070 (to A.S.). The content is solely the responsibility of the authors and does not necessarily represent the official views of the US National Institutes of Health.

Author information

Authors and Affiliations

Graduate Field of Computer Science, Cornell University, Ithaca, NY, USA
Brad Gulko
Simons Center for Quantitative Biology, Cold Spring Harbor Laboratory, Cold Spring Harbor, NY, USA
Brad Gulko & Adam Siepel

Authors

Brad Gulko
View author publications
You can also search for this author in PubMed Google Scholar
Adam Siepel
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

B.G. and A.S. conceived and designed the study. B.G. designed and implemented the FitCons2 method. B.G. and A.S. analyzed the data. A.S. supervised the research. B.G. and A.S. wrote the manuscript.

Corresponding author

Correspondence to Adam Siepel.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Integrated supplementary information

Supplementary Fig. 1 Estimated decision tree for a random sample of 57 cell types.

The estimated tree is nearly identical to the one estimated from all 115 cell types (Fig. 2). The main changes are in the subtree beneath node 20 (highlighted in gray). Beneath this node, the original and resampled tree are still quite similar, with differences primarily in the order in which decision rules are applied. The impact on the estimated values of ρ is minimal, and the classes for which ρ does change contain few sites (~0.1% of the genome). Four additional trees (not shown) were estimated from random samples of 57 cell types, and all of them were similarly consistent with the tree in Fig. 2.

Supplementary Fig. 2 Reductions in entropy per site due to selection plotted as a function of estimated ρ for the 61 classes.

Reductions in entropy per site due to selection (vertical axis) plotted as a function of estimated ρ (see Supplementary Table 3) for the 61 classes. Sizes of circles reflect numbers of sites. Coding classes (CDS) are shown in blue, and noncoding classes (NCD) are shown in orange. The three classes with negative estimates of the reduction in entropy due to selection are shown in red.

Supplementary Fig. 3 Violin plots of FitCons2 scores for HUVEC and H1hESC cells.

Violin plots of FitCons2 scores similar to the one shown in Fig. 4 for two additional cell types: HUVEC (top) and H1hESC (bottom). Notice the similarity in the annotation-dependent distributions across cell types, despite differences in the genomic locations of the active regions.

Supplementary Fig. 4 Hierarchical clustering of 115 cell types based on FitCons2 scores.

The dendrogram is derived from a ‘Manhattan’ or L₁ distance matrix defined such that the distance between every pair of cell types is equal to the sum of the absolute differences of their nucleotide-specific FitCons2 scores (see Methods). Clustering was done using the Ward-D2 clustering method in R. Major groups in the dendrogram correspond to cell types associated with (clockwise from top left) blood and the immune system (brown), internal organs (red), the digestive system (gray), neural tissues (blue), skin and connective tissue (purple), and stem cells (green). Insets show examples of closely related cell types from each group. Notice that the digestive cell types are nested within the internal organ-related cell types. Within the neural tissue cluster, separate groups are evident for embryonic and adult brain tissues (blue inset; embryonic cell types highlighted at bottom). Similarly, fetal cell types form subclusters within the internal organ (red inset, entire group) and digestive system (gray inset, gray background) groups. SUS, sites under selection (see Supplementary Note).

Supplementary Fig. 5 Genome browser display showing FitCons2 scores in the promoter region and first few exons of the MIER2 gene.

In red callout at the lower left, FitCons1 highlights a regulatory locus of about 200 bp, upstream of MIER2, via an elevated score surrounding an enhancer-associated ChromHMM feature. FitCons2 refines this locus by identifying a binding site of ~ 8 bp (AP1, red) with sharply elevated score (cluster 48, ρ = 0.31). A second component of this locus contains a smRNA signal immediately adjoining a GWAS hit. An elevated LINSIGHT score provides evidence for the importance of the binding site, but does not identify the adjoining variation, suggesting a cell-type-specific effect. The TSS and core promoter are highlighted in a green callout (center left). Active TF binding sites within the promoter are indicated by the green FitCons2 class (16, ρ = 0.43), while elevated scores identify the start codon and more conserved codon positions (blue; 07, ρ = 0.67). At the boundary of the first exon, FitCons2 scores spike, indicating increased selective pressure at both intronic (14, ρ = 0.92) and exonic (04, ρ = 0.93) splice boundaries. At center right, FitCons2 scores regularly alternate in a period of three, reflecting reduced constraint at the third codon position. The scores are elevated at the splice site (orange; 24, ρ = 0.16) and then drop off in the intron.

Supplementary Fig. 6 Genome browser display showing FitCons2 scores at the BCL3 gene and upstream enhancers in E023, a derived adipocyte cell type.

At far left (gold highlight), elevated FitCons2 scores identify individual TFBSs in the enhancer region. Enhancers associated with BCL3 are identified in brown (at top), and both ChromHMM and DNase-seq features support elevated FitCons2 scores. The TSS and core promoter (purple highlight) also show elevated FitCons2 scores and classes indicating promoter activity. Individual binding sites can be observed via the green FitCons2 class 40 (ρ = 0.43). The blue highlight shows elevated selective pressure at an active intronic splice site (red; class 14, ρ = 0.92), followed by a periodic pattern mirroring codon structure (blue; classes 00, ρ = 0.75, and 05, ρ = 0.62). FitCons2 scores here are elevated by the presence of a strong RNA-seq feature (dark green). The central brown highlight shows several areas identified as intronic enhancers for BCL3 (brown at top) including a cluster of TFBSs (gold). In the final detail (dark green), the smRNA-seq feature (turquoise) drives an elevated score that surrounds an 8-bp locus in the 3ʹ UTR of this gene, an annotated microRNA-binding site (BCL3:miR-19).

Supplementary Fig. 7 Genome browser display showing FitCons2 scores for multiple cell types at a super-enhancer on chromosome 13.

a, Genome browser display showing FitCons2 scores for multiple cell types at a cell-type-specific super-enhancer on chromosome 13. Super-enhancer SE33394 appears active and obtains high scores, in the H1hESC cell type (A) but not the GM12878 (B) or HUVEC (C) cell types. While SE33394 target LECT1 is nearly 3 Mb away, transcription levels at the gene follow FitCons2 enhancer scores in the corresponding cell type. b, LECT1 (which encodes a protein associated with suppression of angiogenesis) is transcribed in H1hESC (A) but not GM12878 (B) or HUVEC (C) cells. Highlighted in gold in a, SE33394 contains five loci identified as distal regulatory modules (blue at top) as well as a FANTOM5 enhancer (green), and is flanked by two GWAS hits associated with blood-related phenotypes (highlighted in red, rs10507601 and rs9527419). Scores at each genomic position are aggregated across cell types to generate an integrated FitCons2 score (bottom). This integrated score identifies elements of the super-enhancer exhibiting the potential for cell-type-specific activity, without requiring epigenomic data from any particular cell type.

Supplementary Fig. 8 Extended legend for browser displays.

Extended legend for the browser displays shown in Fig. 5 and Supplementary Figs. 4–6. Bar notation for ρ indicates an average across multiple FitCons2 clusters.

Supplementary Fig. 9 Extended legend for browser displays.

Extended legend for the browser displays shown in Fig. 5 and Supplementary Figs. 4–6.

Supplementary Fig. 10 Predictive power for genomic function.

a, Sensitivity of various computational prediction methods (see Supplementary Note) for cell-type-specific transcription factor binding sites (TFBSs). Sensitivity is evaluated using 55,024 motif matches for 12 transcription factors in ChIP–seq peaks for H1-hESC cells⁶ (Supplementary Methods). Sensitivity is plotted against total coverage outside of annotated coding regions as the prediction threshold for each method is varied. Results for two sets of FitCons1 and FitCons2 scores are shown: integrated scores across cell types (I) and cell-type-specific scores for H1-hESC cells. For reference, the vertical gray bar shows the expected fraction of the noncoding genome that is under selection according to FitCons2 (that is, the average score in noncoding regions). b, Receiver operating characteristic (ROC) curves for human disease-associated (pathogenic) single-nucleotide variants (SNVs) listed in HGMD (1,495 HGMD SNVs and 15,042 matched negative controls). The same computational methods are shown, but in this case only integrated scores are used for FitCons1 and FitCons2. The area-under-the-curve (AUC) statistic is listed after each label in the key. False positives are assessed using likely benign variants matched by distance to the nearest transcription start site (Supplementary Methods).

Supplementary Fig. 11 Sensitivity for TFBS prediction as a function of total noncoding coverage.

Sensitivity for TFBS prediction as a function of total noncoding coverage (as in Supplementary Fig. 9a) for the K562 cell type.

Supplementary Fig. 12 Precision versus recall for HGMD.

Precision (vertical axis) versus recall (horizontal axis) for HGMD. This plot is based on the same data as Supplementary Fig. 9b.

Supplementary Fig. 13 Receiver operating characteristic curves and precision–recall curves for ClinVar.

Receiver operating characteristic curves (left) and precision–recall curves (right) for ClinVar.

Supplementary information

Supplementary Text and Figures

Supplementary Figures 1–13, Supplementary Tables 1–3 and Supplementary Note

Reporting Summary

Rights and permissions

Reprints and permissions

About this article

Cite this article

Gulko, B., Siepel, A. An evolutionary framework for measuring epigenomic information and estimating cell-type-specific fitness consequences. Nat Genet 51, 335–342 (2019). https://doi.org/10.1038/s41588-018-0300-z

Download citation

Received: 30 May 2018
Accepted: 30 October 2018
Published: 17 December 2018
Issue Date: February 2019
DOI: https://doi.org/10.1038/s41588-018-0300-z

This article is cited by

Transcription factor binding sites are frequently under accelerated evolution in primates
- Xinru Zhang
- Bohao Fang
- Yi-Fei Huang
Nature Communications (2023)
Extreme purifying selection against point mutations in the human genome
- Noah Dukler
- Mehreen R. Mughal
- Adam Siepel
Nature Communications (2022)
GhostKnockoff inference empowers identification of putative causal variants in genome-wide association studies
- Zihuai He
- Linxi Liu
- Iuliana Ionita-Laza
Nature Communications (2022)
An inferred fitness consequence map of the rice genome
- Zoé Joly-Lopez
- Adrian E. Platts
- Michael D. Purugganan
Nature Plants (2020)
Identification and characterization of constrained non-exonic bases lacking predictive epigenomic and transcription factor binding annotations
- Olivera Grujic
- Tanya N. Phung
- Jason Ernst
Nature Communications (2020)