Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Letter
  • Published:

GeVIR is a continuous gene-level metric that uses variant distribution patterns to prioritize disease candidate genes

Abstract

With large-scale population sequencing projects gathering pace, there is a need for strategies that advance disease gene prioritization1,2. Metrics that provide information about a gene and its ability to tolerate protein-altering variation can aid in clinical interpretation of human genomes and can advance disease gene discovery1,2,3,4. Previous reported methods analyzed the total variant load in a gene1,2,3,4, but did not analyze the distribution pattern of variants within a gene. Using data from 138,632 exome and genome sequences2, we developed gene variation intolerance rank (GeVIR), a continuous gene-level metric for 19,361 genes that is able to prioritize both dominant and recessive Mendelian disease genes5, that outperforms missense constraint metrics3 and that is comparable—but complementary—to loss-of-function (LOF) constraint metrics2. GeVIR is also able to prioritize short genes, for which LOF constraint cannot be estimated with confidence2. The majority of the most intolerant genes identified here have no defined phenotype and are candidates for severe dominant disorders.

This is a preview of subscription content, access via your institution

Access options

Buy this article

Prices may be subject to local taxes which are calculated during checkout

Fig. 1: Correlations among length of VIRs, location of pathogenic variants, and evolutionary conservation.
Fig. 2: GeVIR workflow.
Fig. 3: Comparison of GeVIR gene ranking with gnomAD constraint metrics on 19,361 genes.
Fig. 4: Comparison of GeVIR, LOEUF and VIRLOF performance on the most variant intolerant genes.

Similar content being viewed by others

Data availability

The GERP++ file can be found at http://mendel.stanford.edu/SidowLab/downloads/gerp/hg19.GERP_scores.tar.gz. The ClinVar files can be found at ftp://ftp.ncbi.nlm.nih.gov/pub/clinvar/tab_delimited/variant_summary.txt.gz and ftp://ftp.ncbi.nlm.nih.gov/pub/clinvar/tab_delimited/var_citations.txt. The CCR files can be found at https://s3.us-east-2.amazonaws.com/ccrs/ccrs/ccrs.autosomes.v2.20180420.bed.gz and https://s3.us-east-2.amazonaws.com/ccrs/ccrs/ccrs.xchrom.v2.20180420.bed.gz. The OMIM genemap2.txt file can be found, after registration, at https://omim.org/downloads. The gnomAD gene constraint metric file can be found at https://storage.googleapis.com/gnomad-public/release/2.1.1/constraint/gnomad.v2.1.1.lof_metrics.by_transcript.txt.bgz. The gnomAD exomes variants and coverage files can be found at https://storage.googleapis.com/gnomad-public/release/2.0.2/vcf/exomes/gnomad.exomes.r2.0.2.sites.vcf.bgz and https://storage.googleapis.com/gnomad-public/release/2.0.2/coverage/combined_tars/gnomad.exomes.r2.0.2.coverage.all.tar, respectively. The gnomAD genomes variants files can be found at https://storage.googleapis.com/gnomad-public/release/2.0.2/vcf/genomes/gnomad.genomes.r2.0.2.sites.coding_only.chr1-22.vcf.bgz and https://storage.googleapis.com/gnomad-public/release/2.0.2/vcf/genomes/gnomad.genomes.r2.0.2.sites.coding_only.chrX.vcf.bgz. The gnomAD genes, transcripts and exons files can be found at http://broadinstitute.org/~konradk/exac_browser/exac_browser.tar.gz. The Ensembl coding and peptide sequences from build GRCh37/hg19 can be found at https://grch37.ensembl.org/biomart/martview (data set: Human genes (GRCh37.p13); Attributes → Sequences → ‘Coding sequence’ and ‘Peptide’). The homozygous LOF tolerant genes (that is, nulls) can be found at https://github.com/macarthur-lab/gene_lists/blob/master/lists/homozygous_lof_tolerant_twohit.tsv. The cell essential and non-essential genes from CRISPR–Cas experiments can be found at https://github.com/macarthur-lab/gene_lists/blob/master/lists/CEGv2_subset_universe.tsv and https://github.com/macarthur-lab/gene_lists/blob/master/lists/NEGv1_subset_universe.tsv, respectively. The mouse heterozygous lethal genes can be obtained from http://www.mousemine.org/ by querying the database with the following search terms: path = ‘OntologyAnnotation.ontologyTerm’ type = ‘MPTerm’; path = ‘OntologyAnnotation.subject’ type = ‘SequenceFeature’; path = ‘OntologyAnnotation.evidence.baseAnnotations.subject’ type = ‘Genotype’; path = ‘OntologyAnnotation.evidence.baseAnnotations.subject.zygosity’ op = ‘ = ’ value = ‘ht’ code = ‘B’; path = ‘OntologyAnnotation.ontologyTerm.name’ op = ‘CONTAINS’ value = ‘lethal’. The human–mouse ortholog mapping file can be found at http://www.informatics.jax.org/downloads/reports/HMD_HumanPhenotype.rpt. The HGNC approved gene symbols can be found at https://www.genenames.org/download/statistics-and-files.

Code availability

Code for calculating GeVIR/VIRLOF scores, data analysis and figures can be found at https://github.com/gevirank/gevir. Computed GeVIR/VIRLOF scores are available in Supplementary Table 2.

References

  1. Lek, M. et al. Analysis of protein-coding genetic variation in 60,706 humans. Nature 536, 285–291 (2016).

    Article  CAS  Google Scholar 

  2. Karczewski, K. J. et al. Variation across 141,456 human exomes and genomes reveals the spectrum of loss-of-function intolerance across human protein-coding genes. Preprint at bioRxiv https://doi.org/10.1101/531210 (2019).

  3. Samocha, K. E. et al. A framework for the interpretation of de novo mutation in human disease. Nat. Genet. 46, 944–950 (2014).

    Article  CAS  Google Scholar 

  4. Petrovski, S., Wang, Q., Heinzen, E. L., Allen, A. S. & Goldstein, D. B. Genic intolerance to functional variation and the interpretation of personal genomes. PLoS Genet. 9, e1003709 (2013).

    Article  CAS  Google Scholar 

  5. Hamosh, A., Scott, A. F., Amberger, J. S., Bocchini, C. A. & McKusick, V. A. Online Mendelian Inheritance in Man (OMIM), a knowledgebase of human genes and genetic disorders. Nucleic Acids Res. 33, D514–D517 (2005).

    Article  CAS  Google Scholar 

  6. Cassa, C. A. et al. Estimating the selective effects of heterozygous protein-truncating variants from human exome data. Nat. Genet. 49, 806–810 (2017).

    Article  CAS  Google Scholar 

  7. Gussow, A. B., Petrovski, S., Wang, Q., Allen, A. S. & Goldstein, D. B. The intolerance to functional genetic variation of protein domains predicts the localization of pathogenic mutations within genes. Genome Biol. 17, 9 (2016).

    Article  Google Scholar 

  8. Samocha, K. E. et al. Regional missense constraint improves variant deleteriousness prediction. Preprint at bioRxiv https://doi.org/10.1101/148353 (2017).

  9. Sivley, M. Comprehensive analysis of constraint on the spatial distribution of missense variants in human protein structures. Am. J. Hum. Genet. 102, 415–426 (2018).

    Article  CAS  Google Scholar 

  10. Havrilla, J. M., Pedersen, B. S., Layer, R. M. & Quinlan, A. R. A map of constrained coding regions in the human genome. Nat. Genet. 51, 88–95 (2018).

    Article  Google Scholar 

  11. Landrum, M. J. et al. ClinVar: public archive of interpretations of clinically relevant variants. Nucleic Acids Res. 44, D862–D868 (2016).

    Article  CAS  Google Scholar 

  12. Davydov, E. V. et al. Identifying a high fraction of the human genome to be under selective constraint using GERP++. PLoS Comput. Biol. 6, e1001025 (2010).

    Article  Google Scholar 

  13. Motenko, H., Neuhauser, S. B., O’Keefe, M. & Richardson, J. E. MouseMine: a new data warehouse for MGI. Mamm. Genome 26, 325–330 (2015).

    Article  CAS  Google Scholar 

  14. Eppig, J. T. et al. The Mouse Genome Database (MGD): facilitating mouse as a model for human biology and disease. Nucleic Acids Res. 43, D726–D736 (2015).

    Article  CAS  Google Scholar 

  15. Hart, T. et al. Evaluation and design of genome-wide CRISPR/SpCas9 knockout screens. G3 7, 2719–2727 (2017).

    Article  CAS  Google Scholar 

  16. Huang, D. W., Sherman, B. T. & Lempicki, R. A. Systematic and integrative analysis of large gene lists using DAVID bioinformatics resources. Nat. Protoc. 4, 44–57 (2009).

    Article  CAS  Google Scholar 

  17. Huang, D. W., Sherman, B. T. & Lempicki, R. A. Bioinformatics enrichment tools: paths toward the comprehensive functional analysis of large gene lists. Nucleic Acids Res. 37, 1–13 (2009).

    Article  Google Scholar 

  18. Kobayashi, Y. et al. Pathogenic variant burden in the ExAC database: an empirical approach to evaluating population data for clinical variant interpretation. Genome Med. 9, 13 (2017).

    Article  Google Scholar 

  19. Huang, N., Lee, I., Marcotte, E. M. & Hurles, M. E. Characterising and predicting haploinsufficiency in the human genome. PLoS Genet. 6, e1001154 (2010).

    Article  Google Scholar 

  20. Steinberg, J., Honti, F., Meader, S. & Webber, C. Haploinsufficiency predictions without study bias. Nucleic Acids Res. 43, e101–e101 (2015).

    Article  Google Scholar 

  21. Yates, B. et al. Genenames.org: the HGNC and VGNC resources in 2017. Nucleic Acids Res. 45, D619–D625 (2017).

    Article  CAS  Google Scholar 

  22. Pedregosa, F. et al. Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011).

    Google Scholar 

  23. Virtanen, P. et al. SciPy 1.0–fundamental algorithms for scientific computing in Python. Preprint at https://arxiv.org/abs/1907.10121 (2019).

Download references

Acknowledgements

This work was supported by the Engineering and Physical Sciences Research Council (EP/N509565/1). M.T. was funded by the Newlife Foundation (grant no.14–15/15). We also acknowledge the support of the Manchester Academic Health Science Centre. We thank gnomAD and the groups that provided exome and genome variant data to this resource. A full list of contributing groups can be found at https://gnomad.broadinstitute.org/about.

Author information

Authors and Affiliations

Authors

Contributions

N.A., M.T. and A.B. conceived and designed the research. N.A. executed the analysis. N.A. and M.T. performed the primary writing. M.T. and A.B. supervised all aspects of the research, and reviewed and edited the manuscript.

Corresponding author

Correspondence to May Tassabehji.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

Extended Data Fig. 1 Comparison of top genes ranked by GeVIR with a list of genes sorted by number of CCRs at 95th or greater percentile (7,000 genes).

a, Cumulative number of genes associated exclusively with AD diseases in OMIM (n = 770). b, Cumulative number of genes associated exclusively with AR diseases in OMIM (n = 1,553). c, AD class F1 score calculated at each subset of top genes (cumulative) considering AD genes as true positives and AR genes as false positives. d, Gene canonical transcript protein length in each thousand ranked genes (that is 1–1,000, 1,001–2,000 … 6,001–7,000). Standard notations are used for elements of the box plot (that is, upper or lower hinges: 75th or 25th percentiles; inner segment: median, notches are calculated using a Gaussian-based asymptotic approximation; and upper or lower whiskers: extension of the hinges to the largest or smallest value at most 1.5 times of interquartile range). Outliers are not shown due to the presence of genes with extreme protein length (for example TTN, ~36,000 amino acids) in the data set, which would distort the figure. Correlation between protein length and gene rank was measured with Spearman’s rank correlation coefficient.

Supplementary information

Supplementary Information

Supplementary Figures 1–5, Note and Tables 1, 3, 4, 5, 7 and 8

Reporting Summary

Supplementary Data 1

Supplementary Tables 2 and 6

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Abramovs, N., Brass, A. & Tassabehji, M. GeVIR is a continuous gene-level metric that uses variant distribution patterns to prioritize disease candidate genes. Nat Genet 52, 35–39 (2020). https://doi.org/10.1038/s41588-019-0560-2

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1038/s41588-019-0560-2

This article is cited by

Search

Quick links

Nature Briefing: Translational Research

Sign up for the Nature Briefing: Translational Research newsletter — top stories in biotechnology, drug discovery and pharma.

Get what matters in translational research, free to your inbox weekly. Sign up for Nature Briefing: Translational Research