GeVIR is a continuous gene-level metric that uses variant distribution patterns to prioritize disease candidate genes

Abramovs, Nikita; Brass, Andrew; Tassabehji, May

doi:10.1038/s41588-019-0560-2

Letter
Published: 23 December 2019

GeVIR is a continuous gene-level metric that uses variant distribution patterns to prioritize disease candidate genes

Nature Genetics volume 52, pages 35–39 (2020)Cite this article

5950 Accesses
24 Citations
21 Altmetric
Metrics details

Subjects

Abstract

With large-scale population sequencing projects gathering pace, there is a need for strategies that advance disease gene prioritization^1,2. Metrics that provide information about a gene and its ability to tolerate protein-altering variation can aid in clinical interpretation of human genomes and can advance disease gene discovery^1,2,3,4. Previous reported methods analyzed the total variant load in a gene^1,2,3,4, but did not analyze the distribution pattern of variants within a gene. Using data from 138,632 exome and genome sequences², we developed gene variation intolerance rank (GeVIR), a continuous gene-level metric for 19,361 genes that is able to prioritize both dominant and recessive Mendelian disease genes⁵, that outperforms missense constraint metrics³ and that is comparable—but complementary—to loss-of-function (LOF) constraint metrics². GeVIR is also able to prioritize short genes, for which LOF constraint cannot be estimated with confidence². The majority of the most intolerant genes identified here have no defined phenotype and are candidates for severe dominant disorders.

Access through your institution

Buy or subscribe

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on Springer Link
Instant access to full article PDF

Buy now

Prices may be subject to local taxes which are calculated during checkout

**Fig. 1: Correlations among length of VIRs, location of pathogenic variants, and evolutionary conservation.**

**Fig. 3: Comparison of GeVIR gene ranking with gnomAD constraint metrics on 19,361 genes.**

**Fig. 4: Comparison of GeVIR, LOEUF and VIRLOF performance on the most variant intolerant genes.**

Refining the impact of genetic evidence on clinical success

Article Open access 17 April 2024

Inferring gene regulatory networks from single-cell multiome data using atlas-scale external data

Article Open access 12 April 2024

Genome-wide association studies

Article 26 August 2021

Data availability

The GERP++ file can be found at http://mendel.stanford.edu/SidowLab/downloads/gerp/hg19.GERP_scores.tar.gz. The ClinVar files can be found at ftp://ftp.ncbi.nlm.nih.gov/pub/clinvar/tab_delimited/variant_summary.txt.gz and ftp://ftp.ncbi.nlm.nih.gov/pub/clinvar/tab_delimited/var_citations.txt. The CCR files can be found at https://s3.us-east-2.amazonaws.com/ccrs/ccrs/ccrs.autosomes.v2.20180420.bed.gz and https://s3.us-east-2.amazonaws.com/ccrs/ccrs/ccrs.xchrom.v2.20180420.bed.gz. The OMIM genemap2.txt file can be found, after registration, at https://omim.org/downloads. The gnomAD gene constraint metric file can be found at https://storage.googleapis.com/gnomad-public/release/2.1.1/constraint/gnomad.v2.1.1.lof_metrics.by_transcript.txt.bgz. The gnomAD exomes variants and coverage files can be found at https://storage.googleapis.com/gnomad-public/release/2.0.2/vcf/exomes/gnomad.exomes.r2.0.2.sites.vcf.bgz and https://storage.googleapis.com/gnomad-public/release/2.0.2/coverage/combined_tars/gnomad.exomes.r2.0.2.coverage.all.tar, respectively. The gnomAD genomes variants files can be found at https://storage.googleapis.com/gnomad-public/release/2.0.2/vcf/genomes/gnomad.genomes.r2.0.2.sites.coding_only.chr1-22.vcf.bgz and https://storage.googleapis.com/gnomad-public/release/2.0.2/vcf/genomes/gnomad.genomes.r2.0.2.sites.coding_only.chrX.vcf.bgz. The gnomAD genes, transcripts and exons files can be found at http://broadinstitute.org/~konradk/exac_browser/exac_browser.tar.gz. The Ensembl coding and peptide sequences from build GRCh37/hg19 can be found at https://grch37.ensembl.org/biomart/martview (data set: Human genes (GRCh37.p13); Attributes → Sequences → ‘Coding sequence’ and ‘Peptide’). The homozygous LOF tolerant genes (that is, nulls) can be found at https://github.com/macarthur-lab/gene_lists/blob/master/lists/homozygous_lof_tolerant_twohit.tsv. The cell essential and non-essential genes from CRISPR–Cas experiments can be found at https://github.com/macarthur-lab/gene_lists/blob/master/lists/CEGv2_subset_universe.tsv and https://github.com/macarthur-lab/gene_lists/blob/master/lists/NEGv1_subset_universe.tsv, respectively. The mouse heterozygous lethal genes can be obtained from http://www.mousemine.org/ by querying the database with the following search terms: path = ‘OntologyAnnotation.ontologyTerm’ type = ‘MPTerm’; path = ‘OntologyAnnotation.subject’ type = ‘SequenceFeature’; path = ‘OntologyAnnotation.evidence.baseAnnotations.subject’ type = ‘Genotype’; path = ‘OntologyAnnotation.evidence.baseAnnotations.subject.zygosity’ op = ‘ = ’ value = ‘ht’ code = ‘B’; path = ‘OntologyAnnotation.ontologyTerm.name’ op = ‘CONTAINS’ value = ‘lethal’. The human–mouse ortholog mapping file can be found at http://www.informatics.jax.org/downloads/reports/HMD_HumanPhenotype.rpt. The HGNC approved gene symbols can be found at https://www.genenames.org/download/statistics-and-files.

Code availability

Code for calculating GeVIR/VIRLOF scores, data analysis and figures can be found at https://github.com/gevirank/gevir. Computed GeVIR/VIRLOF scores are available in Supplementary Table 2.

References

Lek, M. et al. Analysis of protein-coding genetic variation in 60,706 humans. Nature 536, 285–291 (2016).
Article CAS Google Scholar
Karczewski, K. J. et al. Variation across 141,456 human exomes and genomes reveals the spectrum of loss-of-function intolerance across human protein-coding genes. Preprint at bioRxiv https://doi.org/10.1101/531210 (2019).
Samocha, K. E. et al. A framework for the interpretation of de novo mutation in human disease. Nat. Genet. 46, 944–950 (2014).
Article CAS Google Scholar
Petrovski, S., Wang, Q., Heinzen, E. L., Allen, A. S. & Goldstein, D. B. Genic intolerance to functional variation and the interpretation of personal genomes. PLoS Genet. 9, e1003709 (2013).
Article CAS Google Scholar
Hamosh, A., Scott, A. F., Amberger, J. S., Bocchini, C. A. & McKusick, V. A. Online Mendelian Inheritance in Man (OMIM), a knowledgebase of human genes and genetic disorders. Nucleic Acids Res. 33, D514–D517 (2005).
Article CAS Google Scholar
Cassa, C. A. et al. Estimating the selective effects of heterozygous protein-truncating variants from human exome data. Nat. Genet. 49, 806–810 (2017).
Article CAS Google Scholar
Gussow, A. B., Petrovski, S., Wang, Q., Allen, A. S. & Goldstein, D. B. The intolerance to functional genetic variation of protein domains predicts the localization of pathogenic mutations within genes. Genome Biol. 17, 9 (2016).
Article Google Scholar
Samocha, K. E. et al. Regional missense constraint improves variant deleteriousness prediction. Preprint at bioRxiv https://doi.org/10.1101/148353 (2017).
Sivley, M. Comprehensive analysis of constraint on the spatial distribution of missense variants in human protein structures. Am. J. Hum. Genet. 102, 415–426 (2018).
Article CAS Google Scholar
Havrilla, J. M., Pedersen, B. S., Layer, R. M. & Quinlan, A. R. A map of constrained coding regions in the human genome. Nat. Genet. 51, 88–95 (2018).
Article Google Scholar
Landrum, M. J. et al. ClinVar: public archive of interpretations of clinically relevant variants. Nucleic Acids Res. 44, D862–D868 (2016).
Article CAS Google Scholar
Davydov, E. V. et al. Identifying a high fraction of the human genome to be under selective constraint using GERP++. PLoS Comput. Biol. 6, e1001025 (2010).
Article Google Scholar
Motenko, H., Neuhauser, S. B., O’Keefe, M. & Richardson, J. E. MouseMine: a new data warehouse for MGI. Mamm. Genome 26, 325–330 (2015).
Article CAS Google Scholar
Eppig, J. T. et al. The Mouse Genome Database (MGD): facilitating mouse as a model for human biology and disease. Nucleic Acids Res. 43, D726–D736 (2015).
Article CAS Google Scholar
Hart, T. et al. Evaluation and design of genome-wide CRISPR/SpCas9 knockout screens. G3 7, 2719–2727 (2017).
Article CAS Google Scholar
Huang, D. W., Sherman, B. T. & Lempicki, R. A. Systematic and integrative analysis of large gene lists using DAVID bioinformatics resources. Nat. Protoc. 4, 44–57 (2009).
Article CAS Google Scholar
Huang, D. W., Sherman, B. T. & Lempicki, R. A. Bioinformatics enrichment tools: paths toward the comprehensive functional analysis of large gene lists. Nucleic Acids Res. 37, 1–13 (2009).
Article Google Scholar
Kobayashi, Y. et al. Pathogenic variant burden in the ExAC database: an empirical approach to evaluating population data for clinical variant interpretation. Genome Med. 9, 13 (2017).
Article Google Scholar
Huang, N., Lee, I., Marcotte, E. M. & Hurles, M. E. Characterising and predicting haploinsufficiency in the human genome. PLoS Genet. 6, e1001154 (2010).
Article Google Scholar
Steinberg, J., Honti, F., Meader, S. & Webber, C. Haploinsufficiency predictions without study bias. Nucleic Acids Res. 43, e101–e101 (2015).
Article Google Scholar
Yates, B. et al. Genenames.org: the HGNC and VGNC resources in 2017. Nucleic Acids Res. 45, D619–D625 (2017).
Article CAS Google Scholar
Pedregosa, F. et al. Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011).
Google Scholar
Virtanen, P. et al. SciPy 1.0–fundamental algorithms for scientific computing in Python. Preprint at https://arxiv.org/abs/1907.10121 (2019).

Download references

Acknowledgements

This work was supported by the Engineering and Physical Sciences Research Council (EP/N509565/1). M.T. was funded by the Newlife Foundation (grant no.14–15/15). We also acknowledge the support of the Manchester Academic Health Science Centre. We thank gnomAD and the groups that provided exome and genome variant data to this resource. A full list of contributing groups can be found at https://gnomad.broadinstitute.org/about.

Author information

Authors and Affiliations

School of Computer Science, University of Manchester, Manchester, UK
Nikita Abramovs & Andrew Brass
School of Biological Sciences, Faculty of Biology, Medicine and Health, University of Manchester, Manchester, UK
Nikita Abramovs & May Tassabehji
School of Health Sciences, Faculty of Biology, Medicine and Health, University of Manchester, Manchester, UK
Andrew Brass
Manchester Centre for Genomic Medicine, St. Mary’s Hospital, Manchester Academic Health Sciences Centre (MAHSC), Manchester, UK
May Tassabehji

Authors

Nikita Abramovs
View author publications
You can also search for this author in PubMed Google Scholar
Andrew Brass
View author publications
You can also search for this author in PubMed Google Scholar
May Tassabehji
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

N.A., M.T. and A.B. conceived and designed the research. N.A. executed the analysis. N.A. and M.T. performed the primary writing. M.T. and A.B. supervised all aspects of the research, and reviewed and edited the manuscript.

Corresponding author

Correspondence to May Tassabehji.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

Extended Data Fig. 1 Comparison of top genes ranked by GeVIR with a list of genes sorted by number of CCRs at 95th or greater percentile (7,000 genes).

a, Cumulative number of genes associated exclusively with AD diseases in OMIM (n = 770). b, Cumulative number of genes associated exclusively with AR diseases in OMIM (n = 1,553). c, AD class F1 score calculated at each subset of top genes (cumulative) considering AD genes as true positives and AR genes as false positives. d, Gene canonical transcript protein length in each thousand ranked genes (that is 1–1,000, 1,001–2,000 … 6,001–7,000). Standard notations are used for elements of the box plot (that is, upper or lower hinges: 75th or 25th percentiles; inner segment: median, notches are calculated using a Gaussian-based asymptotic approximation; and upper or lower whiskers: extension of the hinges to the largest or smallest value at most 1.5 times of interquartile range). Outliers are not shown due to the presence of genes with extreme protein length (for example TTN, ~36,000 amino acids) in the data set, which would distort the figure. Correlation between protein length and gene rank was measured with Spearman’s rank correlation coefficient.

Supplementary information

Supplementary Information

Supplementary Figures 1–5, Note and Tables 1, 3, 4, 5, 7 and 8

Reporting Summary

Supplementary Data 1

Supplementary Tables 2 and 6

Rights and permissions

Reprints and permissions

About this article

Cite this article

Abramovs, N., Brass, A. & Tassabehji, M. GeVIR is a continuous gene-level metric that uses variant distribution patterns to prioritize disease candidate genes. Nat Genet 52, 35–39 (2020). https://doi.org/10.1038/s41588-019-0560-2

Download citation

Received: 29 May 2019
Accepted: 22 November 2019
Published: 23 December 2019
Issue Date: January 2020
DOI: https://doi.org/10.1038/s41588-019-0560-2

This article is cited by

The evolutionary impact of childhood cancer on the human gene pool
- Ulrik Kristoffer Stoltze
- Jon Foss-Skiftesvik
- Kjeld Schmiegelow
Nature Communications (2024)
PdmIRD: missense variants pathogenicity prediction for inherited retinal diseases in a disease-specific manner
- Bing Zeng
- Dong Cheng Liu
- Bo Qin
Human Genetics (2024)
Genic constraint against nonsynonymous variation across the mouse genome
- George Powell
- Michelle M. Simon
- Cecilia M. Lindgren
BMC Genomics (2023)
An unsupervised deep learning framework for predicting human essential genes from population and functional genomic data
- Troy M. LaPolice
- Yi-Fei Huang
BMC Bioinformatics (2023)
Chromatin regulators in the TBX1 network confer risk for conotruncal heart defects in 22q11.2DS
- Yingjie Zhao
- Yujue Wang
- Bernice E. Morrow
npj Genomic Medicine (2023)

GeVIR is a continuous gene-level metric that uses variant distribution patterns to prioritize disease candidate genes

Subjects

Abstract

Access options

Similar content being viewed by others

Refining the impact of genetic evidence on clinical success

Inferring gene regulatory networks from single-cell multiome data using atlas-scale external data

Genome-wide association studies

Data availability

Code availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Additional information

Extended data

Extended Data Fig. 1 Comparison of top genes ranked by GeVIR with a list of genes sorted by number of CCRs at 95th or greater percentile (7,000 genes).

Supplementary information

Supplementary Information

Reporting Summary

Supplementary Data 1

Rights and permissions

About this article

Cite this article

This article is cited by

The evolutionary impact of childhood cancer on the human gene pool

PdmIRD: missense variants pathogenicity prediction for inherited retinal diseases in a disease-specific manner

Genic constraint against nonsynonymous variation across the mouse genome

An unsupervised deep learning framework for predicting human essential genes from population and functional genomic data

Chromatin regulators in the TBX1 network confer risk for conotruncal heart defects in 22q11.2DS

Search

Quick links

Subjects

Abstract

Access options

Similar content being viewed by others

Data availability

Code availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Additional information

Extended data

Supplementary information

Rights and permissions

About this article

Cite this article

Share this article

This article is cited by

Search

Quick links