Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Technical Report
  • Published:

Fast and accurate genomic analyses using genome graphs

Abstract

The human reference genome serves as the foundation for genomics by providing a scaffold for alignment of sequencing reads, but currently only reflects a single consensus haplotype, thus impairing analysis accuracy. Here we present a graph reference genome implementation that enables read alignment across 2,800 diploid genomes encompassing 12.6 million SNPs and 4.0 million insertions and deletions (indels). The pipeline processes one whole-genome sequencing sample in 6.5 h using a system with 36 CPU cores. We show that using a graph genome reference improves read mapping sensitivity and produces a 0.5% increase in variant calling recall, with unaffected specificity. Structural variations incorporated into a graph genome can be genotyped accurately under a unified framework. Finally, we show that iterative augmentation of graph genomes yields incremental gains in variant calling accuracy. Our implementation is an important advance toward fulfilling the promise of graph genomes to radically enhance the scalability and accuracy of genomic analyses.

This is a preview of subscription content, access via your institution

Access options

Buy this article

Prices may be subject to local taxes which are calculated during checkout

Fig. 1: The graph genome architecture and computational resource requirements.
Fig. 2: Read mapping accuracy using BWA-MEM and graph genomes.
Fig. 3: Variant calling benchmarking between Graph Genome Pipeline and BWA-GATK.
Fig. 4: SV genotyping using graph genomes.
Fig. 5: The effect of iteratively augmented graph genomes on variant calling.

Similar content being viewed by others

Code availability

Graph Genome Pipeline is freely available to academic users for non-commercial use. Compiled standalone tools and the License of Use can be accessed at https://www.sevenbridges.com/graph-genome-academic-release/. The source code of the Graph Genome Pipeline tools is not publicly available.

Data availability

Raw sequencing data for the 150 Coriell WGS samples (Figs. 1, 4 and 5) can be accessed from the European Nucleotide Archive under accession PRJEB20654. Raw sequencing data for the Qatari samples (Fig. 5) used can be found under NCBI SRA accessions SRP060765, SRP061943 and SRP061463. Genome in a Bottle data (Fig. 3) are available from the NCBI FTP site (ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/data). The Sanger sequencing traces have been deposited in the European Nucleotide Archive under accession PRJEB26700.

References

  1. Lander, E. S. et al. Initial sequencing and analysis of the human genome. Nature 409, 860–921 (2001).

    Article  CAS  Google Scholar 

  2. Venter, J. C. et al. The sequence of the human genome. Science 291, 1304–1351 (2001).

    Article  CAS  Google Scholar 

  3. Schneider, V. A. et al. Evaluation of GRCh38 and de novo haploid genome assemblies demonstrates the enduring quality of the reference assembly. Genome Res. 27, 849–864 (2017).

    Article  CAS  Google Scholar 

  4. 1000 Genomes Project Consortium. et al. A global reference for human genetic variation. Nature 526, 68–74 (2015).

    Article  Google Scholar 

  5. Sudmant, P. H. et al. An integrated map of structural variation in 2,504 human genomes. Nature 526, 75–81 (2015).

    Article  CAS  Google Scholar 

  6. Degner, J. F. et al. Effect of read-mapping biases on detecting allele-specific expression from RNA-sequencing data. Bioinformatics 25, 3207–3212 (2009).

    Article  CAS  Google Scholar 

  7. Brandt, D. Y. C. et al. Mapping bias overestimates reference allele frequencies at the HLA genes in the 1000 Genomes Project Phase I data. G3 5, 931–941 (2015).

    Article  Google Scholar 

  8. Alkan, C., Coe, B. P. & Eichler, E. E. Genome structural variation discovery and genotyping. Nat. Rev. Genet. 12, 363–376 (2011).

    Article  CAS  Google Scholar 

  9. Antaki, D., Brandler, W. M. & Sebat, J. SV2: accurate structural variation genotyping and de novo mutation detection. Bioinformatics 34, 1774–1777 (2018).

    Article  CAS  Google Scholar 

  10. Maretty, L. et al. Sequencing and de novo assembly of 150 genomes from Denmark as a population reference. Nature 548, 87–91 (2017).

    Article  CAS  Google Scholar 

  11. Mallick, S. et al. The Simons Genome Diversity Project: 300 genomes from 142 diverse populations. Nature 538, 201–206 (2016).

    Article  CAS  Google Scholar 

  12. Lek, M. et al. Analysis of protein-coding genetic variation in 60,706 humans. Nature 536, 285–291 (2016).

    Article  CAS  Google Scholar 

  13. Schneeberger, K. et al. Simultaneous alignment of short reads against multiple genomes. Genome Biol. 10, R98 (2009).

    Article  Google Scholar 

  14. Paten, B., Novak, A. M., Eizenga, J. M. & Garrison, E. Genome graphs and the evolution of genome inference. Genome Res. 27, 665–676 (2017).

    Article  CAS  Google Scholar 

  15. Paten, B., Novak, A. & Haussler, D. Mapping to a reference genome structure. arXiv [q-bio.GN] 1404.5010 (2014).

  16. Novak, A. M. et al. Genome graphs. bioRxiv https://doi.org/10.1101/101378(2017).

  17. Huang, L., Popic, V. & Batzoglou, S. Short read alignment with populations of genomes. Bioinformatics 29, i361–i370 (2013).

    Article  CAS  Google Scholar 

  18. Dilthey, A., Cox, C., Iqbal, Z., Nelson, M. R. & McVean, G. Improved genome inference in the MHC using a population reference graph. Nat. Genet. 47, 682–688 (2015).

    Article  CAS  Google Scholar 

  19. Eggertsson, H. P. et al. Graphtyper enables population-scale genotyping using pangenome graphs. Nat. Genet. 49, 1654–1660 (2017).

    Article  CAS  Google Scholar 

  20. Garrison, E. et al. Variation graph toolkit improves read mapping by representing genetic variation in the reference. Nat. Biotechnol. 36, 875–879 (2018).

    Article  CAS  Google Scholar 

  21. Kim, D., Langmead, B. & Salzberg, S. L. HISAT: a fast spliced aligner with low memory requirements. Nat. Methods 12, 357–360 (2015).

    Article  CAS  Google Scholar 

  22. Sirén, J., Garrison, E., Novak, A. M., Paten, B. & Durbin, R. Haplotype-aware graph indexes. arXiv [cs.DS] 1805.03834 (2018).

  23. DePristo, M. A. et al. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat. Genet. 43, 491–498 (2011).

    Article  CAS  Google Scholar 

  24. Li, H. Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. arXiv [q-bio.GN] 1303.3997v2 (2013).

  25. Zook, J. M. et al. Extensive sequencing of seven human genomes to characterize benchmark reference materials. Sci. Data 3, 160025 (2016).

    Article  CAS  Google Scholar 

  26. Michailidou, K. et al. Large-scale genotyping identifies 41 new loci associated with breast cancer risk. Nat. Genet. 45, 353–361 (2013).

    Article  CAS  Google Scholar 

  27. Berndt, S. I. et al. Genome-wide meta-analysis identifies 11 new loci for anthropometric traits and provides insights into genetic architecture. Nat. Genet. 45, 501–512 (2013).

    Article  CAS  Google Scholar 

  28. McVey, M. & Lee, S. E. MMEJ repair of double-strand breaks (director’s cut): deleted sequences and alternative endings. Trends Genet. 24, 529–538 (2008).

    Article  CAS  Google Scholar 

  29. Wang, J., Raskin, L., Samuels, D. C., Shyr, Y. & Guo, Y. Genome measures used for quality control are dependent on gene function and ancestry. Bioinformatics 31, 318–323 (2015).

    Article  Google Scholar 

  30. Fakhro, K. A. et al. The Qatar genome: a population-specific tool for precision medicine in the Middle East. Hum. Genome Var. 3, 16016 (2016).

    Article  Google Scholar 

  31. Nho, K. et al. Comparison of multi-sample variant calling methods for whole genome sequencing. IEEE Int. Conf. Systems Biol. 2014, 59–62 (2014).

    PubMed  PubMed Central  Google Scholar 

  32. Novak, A. M., Garrison, E. & Paten, B. A graph extension of the positional Burrows-Wheeler transform and its applications. Algorithms Mol. Biol. 12, 18 (2017).

    Article  Google Scholar 

  33. Huang, J. et al. Improved imputation of low-frequency and rare variants using the UK10K haplotype reference panel. Nat. Commun. 6, 8111 (2015).

    Article  CAS  Google Scholar 

  34. van Leeuwen, E. M. et al. Genome of The Netherlands population-specific imputations identify an ABCA6 variant associated with cholesterol levels. Nat. Commun. 6, 6065 (2015).

    Article  Google Scholar 

  35. Nagasaki, M. et al. Rare variant discovery by deep whole-genome sequencing of 1,070 Japanese individuals. Nat. Commun. 6, 8018 (2015).

    Article  CAS  Google Scholar 

  36. Martin, A. R. et al. Human demographic history impacts genetic risk prediction across diverse populations. Am. J. Hum. Genet. 100, 635–649 (2017).

    Article  CAS  Google Scholar 

  37. Church, D. M. et al. Modernizing Reference Genome Assemblies. PLoS Biol. 9, e1001091 (2011).

    Article  CAS  Google Scholar 

  38. Mills, R. E. et al. An initial map of insertion and deletion (INDEL) variation in the human genome. Genome Res. 16, 1182–1190 (2006).

    Article  CAS  Google Scholar 

  39. 1000 Genomes Project Consortium. et al. An integrated map of genetic variation from 1,092 human genomes. Nature 491, 56–65 (2012).

    Article  Google Scholar 

  40. Kural, D. Methods for Inter- and Intra-species Genomics for the Detection of Variation and Function. (Boston College Graduate School of Arts and Sciences, Boston, 2014).

    Google Scholar 

  41. Li, H. & Durbin, R. Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics 25, 1754–1760 (2009).

    Article  CAS  Google Scholar 

  42. Garrison, E. & Marth, G. Haplotype-based variant detection from short-read sequencing. arXiv [q-bio.GN] 1207.3907 (2012).

  43. Poplin, R. et al. Scaling accurate genetic variant discovery to tens of thousands of samples. bioRxiv https://doi.org/10.1101/201178 (2017).

  44. Durbin, R., Eddy, S. R., Krogh, A. & Mitchison, G. Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids. (Cambridge University Press, 1998).

  45. Van der Auwera, G. A. et al. From FastQ data to high confidence variant calls: the Genome Analysis Toolkit best practices pipeline. Curr. Protoc. Bioinformatics. 43, 11.10.1–33 (2013).

    Google Scholar 

  46. Zook, J. M. et al. Integrating human sequence data sets provides a resource of benchmark SNP and indel genotype calls. Nat. Biotechnol. 32, 246–251 (2014).

    Article  CAS  Google Scholar 

  47. Cleary, J. G. et al. Comparing variant call files for performance benchmarking of next-generation sequencing variant calling pipelines. bioRxiv https://doi.org/10.1101/023754 (2015).

Download references

Acknowledgements

We are grateful for the members of the GA4GH Data Workgroup, Benchmarking, and Reference variation initiatives, in particular J. Zook, for insightful discussions and ideas. M. Huvet helped refine the treatment and presentation of ideas behind trio-based benchmarking. Research reported in this publication was supported in part by the UK Department of Health grant SBRI Genomics Competition: Enabling Technologies for Genomic Sequence Data Analysis and Interpretation administered by Genomics England.

Author information

Authors and Affiliations

Authors

Contributions

G.R., V.S., W.-P.L., J.S., A.D., B.P., A.J., and I.S. developed the algorithms and implemented the tools for graph genome alignment. J.B. and I.J.J. developed the algorithms and implemented the tools for variant calling. K.G. implemented the simulation experiments and carried out the benchmarks based on simulated data. V.A., J.N., A.J., and G.R. devised and carried out the experiments with population-specific genome graphs. P.K. and B.C.T. developed the computational tools used for benchmarks based on related genomes, and A.J. and M.C.S. carried out the experiments. S.-G.J., G.D., L.L., and P.K. created the genome graph containing the structural variants, designed, and carried out all of related experiments. M.P. created the machine learning–based variant filters and carried out the related experiments. I.G. and M.K. aided in interpreting the results and worked on the manuscript. Y.L., G.R., and D.K. prepared the manuscript with input from all other authors. D.K. conceived and oversaw the project with assistance from A.J., A.L.S., and M.K.

Corresponding author

Correspondence to Deniz Kural.

Ethics declarations

Competing interests

G.R., J.S., V.A., J.N., M.C.S., G.D., L.L., B.C.T., B.P., I.S., I.G., P.K., A.L.S., Y.L., M.P., W.-P.L., M.K., and D.K. were employed by Seven Bridges Genomics Inc. during the development of the described tools. V.S., J.B., I.J.J., K.G., S.-G.J., A.D., and A.J. are current employees of Seven Bridges Genomics Inc. G.R., V.S., J.S., J.B., I.J.J., V.A., K.G., S.-G.J., L.L., I.S., P.K., A.L.S., Y.L., A.J., M.P. and D.K. hold shares, stock options or restricted stock units in Seven Bridges Genomics Inc. D.K. is co-inventor on 12 patents (issued: 14/016,833; 14/811,057; 15/196,345; 14/041,850 14/157,759; 14/157,979; published: 14/517,406; 14/517,419; 14/517,513; 14/517,451; 14/744,536; 14/798,686). V.S. is inventor on four patents (pending: 15/061,235; 14/885,192; 15/598,404; 15/597,464). W.-P.L. is co-inventor on three patents (published: 14/994,385, pending: 15/353,105; 15/007874). B.P., I.S. and A.J. are co-inventors on one patent (pending: 15/452,963). I.J.J is inventor on one patent (62/630,347). Applicant for patents is Seven Bridges Genomics Inc.

Additional information

Publisher’s note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Text and Figures

Supplementary Note, Supplementary Tables 1–4, 6 and 17, and Supplementary Figures 1–22

Reporting Summary

Supplementary Table 5

Computational resource requirements of the Graph Genome Aligner and BWA-MEM

Supplementary Table 7

Precision FDA Truth Contest results vs. Graph Genome Pipeline

Supplementary Table 8

Variant calling benchmarking against genotyping using SNP arrays

Supplementary Table 9

Variant calling benchmarking results from simulated data

Supplementary Table 10

Genome in a Bottle benchmarking results

Supplementary Table 11

Trio benchmarking: inferred variant calling precision and recall

Supplementary Table 12

Trio benchmarking: Mendelian compliance rates with variant representation resolution

Supplementary Table 13

Trio benchmarking: Mendelian compliance rate without variant representation resolution

Supplementary Table 14

Validation of potentially false false positive variants in GiaB samples

Supplementary Table 15

Structure variation coordinates used in SV genotyping benchmarking experiments

Supplementary Table 16

Variant calling using global graph augmented by population-specific variants

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Rakocevic, G., Semenyuk, V., Lee, WP. et al. Fast and accurate genomic analyses using genome graphs. Nat Genet 51, 354–362 (2019). https://doi.org/10.1038/s41588-018-0316-4

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1038/s41588-018-0316-4

This article is cited by

Search

Quick links

Nature Briefing AI and Robotics

Sign up for the Nature Briefing: AI and Robotics newsletter — what matters in AI and robotics research, free to your inbox weekly.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing: AI and Robotics