Fast and accurate genomic analyses using genome graphs

Rakocevic, Goran; Semenyuk, Vladimir; Lee, Wan-Ping; Spencer, James; Browning, John; Johnson, Ivan J.; Arsenijevic, Vladan; Nadj, Jelena; Ghose, Kaushik; Suciu, Maria C.; Ji, Sun-Gou; Demir, Gülfem; Li, Lizao; Toptaş, Berke Ç.; Dolgoborodov, Alexey; Pollex, Björn; Spulber, Iosif; Glotova, Irina; Kómár, Péter; Stachyra, Andrew L.; Li, Yilong; Popovic, Milos; Källberg, Morten; Jain, Amit; Kural, Deniz

doi:10.1038/s41588-018-0316-4

Technical Report
Published: 14 January 2019

Fast and accurate genomic analyses using genome graphs

Goran Rakocevic ORCID: orcid.org/0000-0002-3411-0764^1,2^na1,
Vladimir Semenyuk ORCID: orcid.org/0000-0002-7461-6153^1,2^na1,
Wan-Ping Lee ORCID: orcid.org/0000-0002-5305-1181¹,
James Spencer^1,2,
John Browning ORCID: orcid.org/0000-0003-2997-8734^1,2,
Ivan J. Johnson^1,2,
Vladan Arsenijevic^1,2,
Jelena Nadj^1,2,
Kaushik Ghose ORCID: orcid.org/0000-0003-2933-1260^1,2,
Maria C. Suciu^1,2,
Sun-Gou Ji^1,2,
Gülfem Demir^1,2,
Lizao Li^1,2,
Berke Ç. Toptaş^1,2,
Alexey Dolgoborodov¹,
Björn Pollex^1,2,
Iosif Spulber¹,
Irina Glotova^1,2,
Péter Kómár^1,2,
Andrew L. Stachyra^1,2,
Yilong Li^1,2,
Milos Popovic^1,2,
Morten Källberg¹,
Amit Jain^1,2 &
…
Deniz Kural ORCID: orcid.org/0000-0002-8085-0771^1,2

Nature Genetics volume 51, pages 354–362 (2019)Cite this article

22k Accesses
116 Citations
116 Altmetric
Metrics details

Subjects

Abstract

The human reference genome serves as the foundation for genomics by providing a scaffold for alignment of sequencing reads, but currently only reflects a single consensus haplotype, thus impairing analysis accuracy. Here we present a graph reference genome implementation that enables read alignment across 2,800 diploid genomes encompassing 12.6 million SNPs and 4.0 million insertions and deletions (indels). The pipeline processes one whole-genome sequencing sample in 6.5 h using a system with 36 CPU cores. We show that using a graph genome reference improves read mapping sensitivity and produces a 0.5% increase in variant calling recall, with unaffected specificity. Structural variations incorporated into a graph genome can be genotyped accurately under a unified framework. Finally, we show that iterative augmentation of graph genomes yields incremental gains in variant calling accuracy. Our implementation is an important advance toward fulfilling the promise of graph genomes to radically enhance the scalability and accuracy of genomic analyses.

Access through your institution

Buy or subscribe

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on Springer Link
Instant access to full article PDF

Buy now

Prices may be subject to local taxes which are calculated during checkout

**Fig. 1: The graph genome architecture and computational resource requirements.**

**Fig. 2: Read mapping accuracy using BWA-MEM and graph genomes.**

**Fig. 3: Variant calling benchmarking between Graph Genome Pipeline and BWA-GATK.**

**Fig. 4: SV genotyping using graph genomes.**

**Fig. 5: The effect of iteratively augmented graph genomes on variant calling.**

GraphTyper2 enables population-scale genotyping of structural variation using pangenome graphs

Article Open access 27 November 2019

Hannes P. Eggertsson, Snaedis Kristmundsdottir, … Pall Melsted

Pan-African genome demonstrates how population-specific genome graphs improve high-throughput sequencing data analysis

Article Open access 04 August 2022

H. Serhat Tetikol, Deniz Turgut, … Brandi N. Davis-Dusenbery

Pangenome graph construction from genome alignments with Minigraph-Cactus

Article 10 May 2023

Glenn Hickey, Jean Monlong, … Benedict Paten

Code availability

Graph Genome Pipeline is freely available to academic users for non-commercial use. Compiled standalone tools and the License of Use can be accessed at https://www.sevenbridges.com/graph-genome-academic-release/. The source code of the Graph Genome Pipeline tools is not publicly available.

Data availability

Raw sequencing data for the 150 Coriell WGS samples (Figs. 1, 4 and 5) can be accessed from the European Nucleotide Archive under accession PRJEB20654. Raw sequencing data for the Qatari samples (Fig. 5) used can be found under NCBI SRA accessions SRP060765, SRP061943 and SRP061463. Genome in a Bottle data (Fig. 3) are available from the NCBI FTP site (ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/data). The Sanger sequencing traces have been deposited in the European Nucleotide Archive under accession PRJEB26700.

References

Lander, E. S. et al. Initial sequencing and analysis of the human genome. Nature 409, 860–921 (2001).
Article CAS Google Scholar
Venter, J. C. et al. The sequence of the human genome. Science 291, 1304–1351 (2001).
Article CAS Google Scholar
Schneider, V. A. et al. Evaluation of GRCh38 and de novo haploid genome assemblies demonstrates the enduring quality of the reference assembly. Genome Res. 27, 849–864 (2017).
Article CAS Google Scholar
1000 Genomes Project Consortium. et al. A global reference for human genetic variation. Nature 526, 68–74 (2015).
Article Google Scholar
Sudmant, P. H. et al. An integrated map of structural variation in 2,504 human genomes. Nature 526, 75–81 (2015).
Article CAS Google Scholar
Degner, J. F. et al. Effect of read-mapping biases on detecting allele-specific expression from RNA-sequencing data. Bioinformatics 25, 3207–3212 (2009).
Article CAS Google Scholar
Brandt, D. Y. C. et al. Mapping bias overestimates reference allele frequencies at the HLA genes in the 1000 Genomes Project Phase I data. G3 5, 931–941 (2015).
Article Google Scholar
Alkan, C., Coe, B. P. & Eichler, E. E. Genome structural variation discovery and genotyping. Nat. Rev. Genet. 12, 363–376 (2011).
Article CAS Google Scholar
Antaki, D., Brandler, W. M. & Sebat, J. SV2: accurate structural variation genotyping and de novo mutation detection. Bioinformatics 34, 1774–1777 (2018).
Article CAS Google Scholar
Maretty, L. et al. Sequencing and de novo assembly of 150 genomes from Denmark as a population reference. Nature 548, 87–91 (2017).
Article CAS Google Scholar
Mallick, S. et al. The Simons Genome Diversity Project: 300 genomes from 142 diverse populations. Nature 538, 201–206 (2016).
Article CAS Google Scholar
Lek, M. et al. Analysis of protein-coding genetic variation in 60,706 humans. Nature 536, 285–291 (2016).
Article CAS Google Scholar
Schneeberger, K. et al. Simultaneous alignment of short reads against multiple genomes. Genome Biol. 10, R98 (2009).
Article Google Scholar
Paten, B., Novak, A. M., Eizenga, J. M. & Garrison, E. Genome graphs and the evolution of genome inference. Genome Res. 27, 665–676 (2017).
Article CAS Google Scholar
Paten, B., Novak, A. & Haussler, D. Mapping to a reference genome structure. arXiv [q-bio.GN] 1404.5010 (2014).
Novak, A. M. et al. Genome graphs. bioRxiv https://doi.org/10.1101/101378(2017).
Huang, L., Popic, V. & Batzoglou, S. Short read alignment with populations of genomes. Bioinformatics 29, i361–i370 (2013).
Article CAS Google Scholar
Dilthey, A., Cox, C., Iqbal, Z., Nelson, M. R. & McVean, G. Improved genome inference in the MHC using a population reference graph. Nat. Genet. 47, 682–688 (2015).
Article CAS Google Scholar
Eggertsson, H. P. et al. Graphtyper enables population-scale genotyping using pangenome graphs. Nat. Genet. 49, 1654–1660 (2017).
Article CAS Google Scholar
Garrison, E. et al. Variation graph toolkit improves read mapping by representing genetic variation in the reference. Nat. Biotechnol. 36, 875–879 (2018).
Article CAS Google Scholar
Kim, D., Langmead, B. & Salzberg, S. L. HISAT: a fast spliced aligner with low memory requirements. Nat. Methods 12, 357–360 (2015).
Article CAS Google Scholar
Sirén, J., Garrison, E., Novak, A. M., Paten, B. & Durbin, R. Haplotype-aware graph indexes. arXiv [cs.DS] 1805.03834 (2018).
DePristo, M. A. et al. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat. Genet. 43, 491–498 (2011).
Article CAS Google Scholar
Li, H. Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. arXiv [q-bio.GN] 1303.3997v2 (2013).
Zook, J. M. et al. Extensive sequencing of seven human genomes to characterize benchmark reference materials. Sci. Data 3, 160025 (2016).
Article CAS Google Scholar
Michailidou, K. et al. Large-scale genotyping identifies 41 new loci associated with breast cancer risk. Nat. Genet. 45, 353–361 (2013).
Article CAS Google Scholar
Berndt, S. I. et al. Genome-wide meta-analysis identifies 11 new loci for anthropometric traits and provides insights into genetic architecture. Nat. Genet. 45, 501–512 (2013).
Article CAS Google Scholar
McVey, M. & Lee, S. E. MMEJ repair of double-strand breaks (director’s cut): deleted sequences and alternative endings. Trends Genet. 24, 529–538 (2008).
Article CAS Google Scholar
Wang, J., Raskin, L., Samuels, D. C., Shyr, Y. & Guo, Y. Genome measures used for quality control are dependent on gene function and ancestry. Bioinformatics 31, 318–323 (2015).
Article Google Scholar
Fakhro, K. A. et al. The Qatar genome: a population-specific tool for precision medicine in the Middle East. Hum. Genome Var. 3, 16016 (2016).
Article Google Scholar
Nho, K. et al. Comparison of multi-sample variant calling methods for whole genome sequencing. IEEE Int. Conf. Systems Biol. 2014, 59–62 (2014).
PubMed PubMed Central Google Scholar
Novak, A. M., Garrison, E. & Paten, B. A graph extension of the positional Burrows-Wheeler transform and its applications. Algorithms Mol. Biol. 12, 18 (2017).
Article Google Scholar
Huang, J. et al. Improved imputation of low-frequency and rare variants using the UK10K haplotype reference panel. Nat. Commun. 6, 8111 (2015).
Article CAS Google Scholar
van Leeuwen, E. M. et al. Genome of The Netherlands population-specific imputations identify an ABCA6 variant associated with cholesterol levels. Nat. Commun. 6, 6065 (2015).
Article Google Scholar
Nagasaki, M. et al. Rare variant discovery by deep whole-genome sequencing of 1,070 Japanese individuals. Nat. Commun. 6, 8018 (2015).
Article CAS Google Scholar
Martin, A. R. et al. Human demographic history impacts genetic risk prediction across diverse populations. Am. J. Hum. Genet. 100, 635–649 (2017).
Article CAS Google Scholar
Church, D. M. et al. Modernizing Reference Genome Assemblies. PLoS Biol. 9, e1001091 (2011).
Article CAS Google Scholar
Mills, R. E. et al. An initial map of insertion and deletion (INDEL) variation in the human genome. Genome Res. 16, 1182–1190 (2006).
Article CAS Google Scholar
1000 Genomes Project Consortium. et al. An integrated map of genetic variation from 1,092 human genomes. Nature 491, 56–65 (2012).
Article Google Scholar
Kural, D. Methods for Inter- and Intra-species Genomics for the Detection of Variation and Function. (Boston College Graduate School of Arts and Sciences, Boston, 2014).
Google Scholar
Li, H. & Durbin, R. Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics 25, 1754–1760 (2009).
Article CAS Google Scholar
Garrison, E. & Marth, G. Haplotype-based variant detection from short-read sequencing. arXiv [q-bio.GN] 1207.3907 (2012).
Poplin, R. et al. Scaling accurate genetic variant discovery to tens of thousands of samples. bioRxiv https://doi.org/10.1101/201178 (2017).
Durbin, R., Eddy, S. R., Krogh, A. & Mitchison, G. Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids. (Cambridge University Press, 1998).
Van der Auwera, G. A. et al. From FastQ data to high confidence variant calls: the Genome Analysis Toolkit best practices pipeline. Curr. Protoc. Bioinformatics. 43, 11.10.1–33 (2013).
Google Scholar
Zook, J. M. et al. Integrating human sequence data sets provides a resource of benchmark SNP and indel genotype calls. Nat. Biotechnol. 32, 246–251 (2014).
Article CAS Google Scholar
Cleary, J. G. et al. Comparing variant call files for performance benchmarking of next-generation sequencing variant calling pipelines. bioRxiv https://doi.org/10.1101/023754 (2015).

Download references

Acknowledgements

We are grateful for the members of the GA4GH Data Workgroup, Benchmarking, and Reference variation initiatives, in particular J. Zook, for insightful discussions and ideas. M. Huvet helped refine the treatment and presentation of ideas behind trio-based benchmarking. Research reported in this publication was supported in part by the UK Department of Health grant SBRI Genomics Competition: Enabling Technologies for Genomic Sequence Data Analysis and Interpretation administered by Genomics England.

Author information

These authors contributed equally: Goran Rakocevic, Vladimir Semenyuk.

Authors and Affiliations

Seven Bridges Genomics, Inc, Cambridge, MA, USA
Goran Rakocevic, Vladimir Semenyuk, Wan-Ping Lee, James Spencer, John Browning, Ivan J. Johnson, Vladan Arsenijevic, Jelena Nadj, Kaushik Ghose, Maria C. Suciu, Sun-Gou Ji, Gülfem Demir, Lizao Li, Berke Ç. Toptaş, Alexey Dolgoborodov, Björn Pollex, Iosif Spulber, Irina Glotova, Péter Kómár, Andrew L. Stachyra, Yilong Li, Milos Popovic, Morten Källberg, Amit Jain & Deniz Kural
Totient, Inc, Cambridge, MA, USA
Goran Rakocevic, Vladimir Semenyuk, James Spencer, John Browning, Ivan J. Johnson, Vladan Arsenijevic, Jelena Nadj, Kaushik Ghose, Maria C. Suciu, Sun-Gou Ji, Gülfem Demir, Lizao Li, Berke Ç. Toptaş, Björn Pollex, Irina Glotova, Péter Kómár, Andrew L. Stachyra, Yilong Li, Milos Popovic, Amit Jain & Deniz Kural

Authors

Goran Rakocevic
View author publications
You can also search for this author in PubMed Google Scholar
Vladimir Semenyuk
View author publications
You can also search for this author in PubMed Google Scholar
Wan-Ping Lee
View author publications
You can also search for this author in PubMed Google Scholar
James Spencer
View author publications
You can also search for this author in PubMed Google Scholar
John Browning
View author publications
You can also search for this author in PubMed Google Scholar
Ivan J. Johnson
View author publications
You can also search for this author in PubMed Google Scholar
Vladan Arsenijevic
View author publications
You can also search for this author in PubMed Google Scholar
Jelena Nadj
View author publications
You can also search for this author in PubMed Google Scholar
Kaushik Ghose
View author publications
You can also search for this author in PubMed Google Scholar
Maria C. Suciu
View author publications
You can also search for this author in PubMed Google Scholar
Sun-Gou Ji
View author publications
You can also search for this author in PubMed Google Scholar
Gülfem Demir
View author publications
You can also search for this author in PubMed Google Scholar
Lizao Li
View author publications
You can also search for this author in PubMed Google Scholar
Berke Ç. Toptaş
View author publications
You can also search for this author in PubMed Google Scholar
Alexey Dolgoborodov
View author publications
You can also search for this author in PubMed Google Scholar
Björn Pollex
View author publications
You can also search for this author in PubMed Google Scholar
Iosif Spulber
View author publications
You can also search for this author in PubMed Google Scholar
Irina Glotova
View author publications
You can also search for this author in PubMed Google Scholar
Péter Kómár
View author publications
You can also search for this author in PubMed Google Scholar
Andrew L. Stachyra
View author publications
You can also search for this author in PubMed Google Scholar
Yilong Li
View author publications
You can also search for this author in PubMed Google Scholar
Milos Popovic
View author publications
You can also search for this author in PubMed Google Scholar
Morten Källberg
View author publications
You can also search for this author in PubMed Google Scholar
Amit Jain
View author publications
You can also search for this author in PubMed Google Scholar
Deniz Kural
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

G.R., V.S., W.-P.L., J.S., A.D., B.P., A.J., and I.S. developed the algorithms and implemented the tools for graph genome alignment. J.B. and I.J.J. developed the algorithms and implemented the tools for variant calling. K.G. implemented the simulation experiments and carried out the benchmarks based on simulated data. V.A., J.N., A.J., and G.R. devised and carried out the experiments with population-specific genome graphs. P.K. and B.C.T. developed the computational tools used for benchmarks based on related genomes, and A.J. and M.C.S. carried out the experiments. S.-G.J., G.D., L.L., and P.K. created the genome graph containing the structural variants, designed, and carried out all of related experiments. M.P. created the machine learning–based variant filters and carried out the related experiments. I.G. and M.K. aided in interpreting the results and worked on the manuscript. Y.L., G.R., and D.K. prepared the manuscript with input from all other authors. D.K. conceived and oversaw the project with assistance from A.J., A.L.S., and M.K.

Corresponding author

Correspondence to Deniz Kural.

Ethics declarations

Competing interests

G.R., J.S., V.A., J.N., M.C.S., G.D., L.L., B.C.T., B.P., I.S., I.G., P.K., A.L.S., Y.L., M.P., W.-P.L., M.K., and D.K. were employed by Seven Bridges Genomics Inc. during the development of the described tools. V.S., J.B., I.J.J., K.G., S.-G.J., A.D., and A.J. are current employees of Seven Bridges Genomics Inc. G.R., V.S., J.S., J.B., I.J.J., V.A., K.G., S.-G.J., L.L., I.S., P.K., A.L.S., Y.L., A.J., M.P. and D.K. hold shares, stock options or restricted stock units in Seven Bridges Genomics Inc. D.K. is co-inventor on 12 patents (issued: 14/016,833; 14/811,057; 15/196,345; 14/041,850 14/157,759; 14/157,979; published: 14/517,406; 14/517,419; 14/517,513; 14/517,451; 14/744,536; 14/798,686). V.S. is inventor on four patents (pending: 15/061,235; 14/885,192; 15/598,404; 15/597,464). W.-P.L. is co-inventor on three patents (published: 14/994,385, pending: 15/353,105; 15/007874). B.P., I.S. and A.J. are co-inventors on one patent (pending: 15/452,963). I.J.J is inventor on one patent (62/630,347). Applicant for patents is Seven Bridges Genomics Inc.

Additional information

Publisher’s note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Rakocevic, G., Semenyuk, V., Lee, WP. et al. Fast and accurate genomic analyses using genome graphs. Nat Genet 51, 354–362 (2019). https://doi.org/10.1038/s41588-018-0316-4

Download citation

Received: 02 October 2017
Accepted: 14 November 2018
Published: 14 January 2019
Issue Date: February 2019
DOI: https://doi.org/10.1038/s41588-018-0316-4

This article is cited by

A comprehensive benchmark of graph-based genetic variant genotyping algorithms on plant genomes for creating an accurate ensemble pipeline
- Ze-Zhen Du
- Jia-Bao He
- Wen-Biao Jiao
Genome Biology (2024)
Pan-genome de Bruijn graph using the bidirectional FM-index
- Lore Depuydt
- Luca Renders
- Jan Fostier
BMC Bioinformatics (2023)
A review of the pangenome: how it affects our understanding of genomic variation, selection and breeding in domestic animals?
- Ying Gong
- Yefang Li
- Lin Jiang
Journal of Animal Science and Biotechnology (2023)
The pan-genome and local adaptation of Arabidopsis thaliana
- Minghui Kang
- Haolin Wu
- Jianquan Liu
Nature Communications (2023)
Variant calling and benchmarking in an era of complete human genome sequences
- Nathan D. Olson
- Justin Wagner
- Justin M. Zook
Nature Reviews Genetics (2023)