Identifying genetic variants underlying phenotypic variation in plants without complete genomes

Voichek, Yoav; Weigel, Detlef

doi:10.1038/s41588-020-0612-7

Analysis
Published: 13 April 2020

Identifying genetic variants underlying phenotypic variation in plants without complete genomes

Nature Genetics volume 52, pages 534–540 (2020)Cite this article

21k Accesses
74 Citations
215 Altmetric
Metrics details

Subjects

Abstract

Structural variants and presence/absence polymorphisms are common in plant genomes, yet they are routinely overlooked in genome-wide association studies (GWAS). Here, we expand the type of genetic variants detected in GWAS to include major deletions, insertions and rearrangements. We first use raw sequencing data directly to derive short sequences, k-mers, that mark a broad range of polymorphisms independently of a reference genome. We then link k-mers associated with phenotypes to specific genomic regions. Using this approach, we reanalyzed 2,000 traits in Arabidopsis thaliana, tomato and maize populations. Associations identified with k-mers recapitulate those found with SNPs, but with stronger statistical support. Importantly, we discovered new associations with structural variants and with regions missing from reference genomes. Our results demonstrate the power of performing GWAS before linking sequence reads to specific genomic regions, which allows the detection of a wider range of genetic variants responsible for phenotypic variation.

Access through your institution

Buy or subscribe

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on Springer Link
Instant access to full article PDF

Buy now

Prices may be subject to local taxes which are calculated during checkout

**Fig. 1: Flowering time associations in *A. thaliana*.**

**Fig. 2: SNP- and k-mer-based GWAS on 1,582 *A. thaliana* phenotypes.**

**Fig. 3: Specific cases of k-mer superiority.**

**Fig. 4: SNP- and k-mer-based GWAS in maize.**

**Fig. 5: SNP- and k-mer-based GWAS in tomato.**

**Fig. 6: Kinship matrix estimates with k-mers.**

The impact of transposable elements on tomato diversity

Article Open access 13 August 2020

Eight high-quality genomes reveal pan-genome architecture and ecotype differentiation of Brassica napus

Article Open access 13 January 2020

A super pan-genomic landscape of rice

Article Open access 12 July 2022

Data availability

A list of all phenotypes and top SNPs or k-mers passing their corresponding thresholds can be found at https://zenodo.org/record/3701176#.XmX9u5NKhhE.

The authors declare that all other data supporting the findings of this study are available within the Supplementary Information files.

Code availability

Code is available at https://github.com/voichek/kmersGWAS.

References

Saxena, R. K., Edwards, D. & Varshney, R. K. Structural variations in plant genomes. Brief. Funct. Genomics 13, 296–307 (2014).
Article PubMed PubMed Central Google Scholar
Iqbal, Z., Caccamo, M., Turner, I., Flicek, P. & McVean, G. De novo assembly and genotyping of variants using colored de Bruijn graphs. Nat. Genet. 44, 226–232 (2012).
Article PubMed Central CAS Google Scholar
Salzberg, S. L., Pertea, M., Fahrner, J. A. & Sobreira, N. DIAMUND: direct comparison of genomes to detect mutations. Hum. Mutat. 35, 283–288 (2014).
Article CAS PubMed Google Scholar
Zielezinski, A. et al. Benchmarking of alignment-free sequence comparison methods. Genome Biol. 20, 144 (2019).
Article PubMed PubMed Central Google Scholar
Lees, J. A. et al. Sequence element enrichment analysis to determine the genetic basis of bacterial phenotypes. Nat. Commun. 7, 12797 (2016).
Article CAS PubMed PubMed Central Google Scholar
Sheppard, S. K. et al. Genome-wide association study identifies vitamin B₅ biosynthesis as a host specificity factor in Campylobacter. Proc. Natl Acad. Sci. USA 110, 11923–11927 (2013).
Article CAS PubMed PubMed Central Google Scholar
Lees, J. A. et al. Genome-wide identification of lineage and locus specific variation associated with pneumococcal carriage duration. eLife 6, e26255 (2017).
Rahman, A., Hallgrímsdóttir, I., Eisen, M. & Pachter, L. Association mapping from sequencing reads using k-mers. eLife 7, e32920 (2018).
Gordon, S. P. et al. Extensive gene content variation in the Brachypodium distachyon pan-genome correlates with population structure. Nat. Commun. 8, 2184 (2017).
Article PubMed PubMed Central CAS Google Scholar
Sun, S. et al. Extensive intraspecific gene order and gene structural variations between Mo17 and other maize genomes. Nat. Genet. 50, 1289–1295 (2018).
Article CAS PubMed Google Scholar
Minio, A., Massonnet, M., Figueroa-Balderas, R., Castro, A. & Cantu, D. Diploid genome assembly of the wine grape Carménère. G3 9, 1331–1337 (2019).
Article CAS PubMed PubMed Central Google Scholar
Arora, S. et al. Resistance gene cloning from a wild crop relative by sequence capture and association genetics. Nat. Biotechnol. 37, 139–143 (2019).
Article CAS PubMed Google Scholar
1001 Genomes Consortium. 1,135 genomes reveal the global pattern of polymorphism in Arabidopsis thaliana. Cell 166, 481–491 (2016).
Article CAS Google Scholar
Zhou, X. & Stephens, M. Genome-wide efficient mixed-model analysis for association studies. Nat. Genet. 44, 821–824 (2012).
Article CAS PubMed PubMed Central Google Scholar
Abney, M. Permutation testing in the presence of polygenic variation. Genet. Epidemiol. 39, 249–258 (2015).
Article PubMed PubMed Central Google Scholar
Svishcheva, G. R., Axenovich, T. I., Belonogova, N. M., van Duijn, C. M. & Aulchenko, Y. S. Rapid variance components-based method for whole-genome association analysis. Nat. Genet. 44, 1166–1170 (2012).
Article CAS Google Scholar
Loh, P.-R. et al. Efficient Bayesian mixed-model analysis increases association power in large cohorts. Nat. Genet. 47, 284–290 (2015).
Article CAS PubMed PubMed Central Google Scholar
Li, X. et al. Exploiting natural variation of secondary metabolism identifies a gene controlling the glycosylation diversity of dihydroxybenzoic acids in Arabidopsis thaliana. Genetics 198, 1267–1276 (2014).
Article PubMed PubMed Central Google Scholar
Vetter, M., Karasov, T. L. & Bergelson, J. Differentiation between MAMP-triggered defenses in Arabidopsis thaliana. PLoS Genet. 12, e1006068 (2016).
Article PubMed PubMed Central CAS Google Scholar
Bao, W., Kojima, K. K. & Kohany, O. Repbase Update, a database of repetitive elements in eukaryotic genomes. Mob. DNA 6, 11 (2015).
Article PubMed PubMed Central Google Scholar
Morrison, G. D. & Linder, C. R. Association mapping of germination traits in Arabidopsis thaliana under light and nutrient treatments: searching for G×E effects. G3 (Bethesda) 4, 1465–1478 (2014).
Article CAS Google Scholar
Zapata, L. et al. Chromosome-level assembly of Arabidopsis thaliana Ler reveals the extent of translocation and inversion polymorphisms. Proc. Natl Acad. Sci. USA 113, E4052–E4060 (2016).
Article CAS PubMed PubMed Central Google Scholar
Bryant, F. M., Hughes, D., Hassani-Pak, K. & Eastmond, P. J. Basic LEUCINE ZIPPER TRANSCRIPTION FACTOR67 transactivates DELAY OF GERMINATION1 to establish primary seed dormancy in Arabidopsis. Plant Cell 31, 1276–1288 (2019).
Article CAS PubMed PubMed Central Google Scholar
Schnable, P. S. et al. The B73 maize genome: complexity, diversity and dynamics. Science 326, 1112–1115 (2009).
Article CAS PubMed Google Scholar
Gore, M. A. et al. A first-generation haplotype map of maize. Science 326, 1115–1117 (2009).
Article CAS PubMed Google Scholar
Springer, N. M. et al. The maize W22 genome provides a foundation for functional genomics and transposon biology. Nat. Genet. 50, 1282–1288 (2018).
Article CAS PubMed Google Scholar
Zhao, W. et al. Panzea: a database and resource for molecular and functional diversity in the maize genome. Nucleic Acids Res. 34, D752–D757 (2006).
Article CAS PubMed Google Scholar
Bukowski, R. et al. Construction of the third-generation Zea mays haplotype map. Gigascience 7, 1–12 (2018).
Article PubMed Google Scholar
Tomato Genome Consortium. The tomato genome sequence provides insights into fleshy fruit evolution. Nature 485, 635–641 (2012).
Article CAS Google Scholar
Lin, T. et al. Genomic analyses provide insights into the history of tomato breeding. Nat. Genet. 46, 1220–1226 (2014).
Article CAS PubMed Google Scholar
Tieman, D. et al. A chemical genetic roadmap to improved tomato flavor. Science 355, 391–394 (2017).
Article CAS PubMed Google Scholar
Zhu, G. et al. Rewiring of the fruit metabolome in tomato breeding. Cell 172, 249–261 (2018).
Article CAS PubMed Google Scholar
Tikunov, Y. M. et al. Non-smoky glycosyltransferase1 prevents the release of smoky aroma from tomato fruit. Plant Cell 25, 3067–3078 (2013).
Article CAS PubMed PubMed Central Google Scholar
Sohn, J.-I. & Nam, J.-W. The present and future of de novo whole-genome assembly. Brief. Bioinform. 19, 23–40 (2018).
CAS PubMed Google Scholar
Pascoe, B. et al. Enhanced biofilm formation and multi-host transmission evolve from divergent genetic backgrounds in Campylobacter jejuni. Environ. Microbiol. 17, 4779–4789 (2015).
Article CAS PubMed PubMed Central Google Scholar
Schneeberger, K. et al. Simultaneous alignment of short reads against multiple genomes. Genome Biol. 10, R98 (2009).
Article PubMed PubMed Central CAS Google Scholar
Paten, B., Novak, A. M., Eizenga, J. M. & Garrison, E. Genome graphs and the evolution of genome inference. Genome Res. 27, 665–676 (2017).
Article CAS PubMed PubMed Central Google Scholar
Seren, Ü. et al. AraPheno: a public database for Arabidopsis thaliana phenotypes. Nucleic Acids Res. 45, D1054–D1059 (2017).
Article CAS PubMed Google Scholar
Fordyce, R. F. et al. Digital imaging combined with genome-wide association mapping links loci to plant–pathogen interaction traits. Plant Physiol. 178, 1406–1422 (2018).
Article CAS PubMed PubMed Central Google Scholar
Chan, E. K. F., Rowe, H. C., Hansen, B. G. & Kliebenstein, D. J. The complex genetic architecture of the metabolome. PLoS Genet. 6, e1001198 (2010).
Article PubMed PubMed Central CAS Google Scholar
Danecek, P. et al. The variant call format and VCFtools. Bioinformatics 27, 2156–2158 (2011).
Article CAS PubMed PubMed Central Google Scholar
Purcell, S. et al. PLINK: a tool set for whole-genome association and population-based linkage analyses. Am. J. Hum. Genet. 81, 559–575 (2007).
Article CAS PubMed PubMed Central Google Scholar
Cheng, C.-Y. et al. Araport11: a complete reannotation of the Arabidopsis thaliana reference genome. Plant J. 89, 789–804 (2017).
Article CAS PubMed Google Scholar
Portwood, J. L. 2nd et al. MaizeGDB 2018: the maize multi-genome genetics and genomics database. Nucleic Acids Res. 47, D1146–D1154 (2019).
Article PubMed Google Scholar
Kokot, M., Dlugosz, M. & Deorowicz, S. KMC 3: counting and manipulating k-mer statistics. Bioinformatics 33, 2759–2761 (2017).
Article CAS PubMed Google Scholar
Kang, H. M. et al. Efficient control of population structure in model organism association mapping. Genetics 178, 1709–1723 (2008).
Article PubMed PubMed Central Google Scholar
Atwell, S. et al. Genome-wide association study of 107 phenotypes in Arabidopsis thaliana inbred lines. Nature 465, 627–631 (2010).
Article CAS PubMed PubMed Central Google Scholar
Devlin, B. & Risch, N. A comparison of linkage disequilibrium measures for fine-scale mapping. Genomics 29, 311–322 (1995).
Article CAS PubMed Google Scholar
Togninalli, M. et al. The AraGWAS Catalog: a curated and standardized Arabidopsis thaliana GWAS catalog. Nucleic Acids Res. 46, D1150–D1156 (2018).
Article CAS PubMed Google Scholar
Bankevich, A. et al. SPAdes: a new genome assembly algorithm and its applications to single-cell sequencing. J. Comput. Biol. 19, 455–477 (2012).
Article CAS PubMed PubMed Central Google Scholar
Langmead, B. & Salzberg, S. L. Fast gapped-read alignment with Bowtie 2. Nat. Methods 9, 357–359 (2012).
Article CAS PubMed PubMed Central Google Scholar
Robinson, J. T. et al. Integrative Genomics Viewer. Nat. Biotechnol. 29, 24–26 (2011).
PubMed Central Google Scholar
Ristova, D., Giovannetti, M., Metesch, K. & Busch, W. Natural genetic variation shapes root system responses to phytohormones in Arabidopsis. Plant J. 96, 468–481 (2018).
Li, H. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics 34, 3094–3100 (2018).
Article CAS PubMed PubMed Central Google Scholar

Download references

Acknowledgements

We thank the many colleagues who have shared A. thaliana phenotypic information with us. We thank in particular G. Zhu and S. Huang for help with tomato genotypic and phenotypic information and C. Romay, R. Bukowski and E. Buckler for help with maize genotypes and phenotypes. We thank K. Swarts, F. Rabanal and I. Soifer for fruitful discussions. This work was supported by the DFG ERA-CAPS 1001 Genomes Plus and the Max Planck Society.

Author information

Authors and Affiliations

Department of Molecular Biology, Max Planck Institute for Developmental Biology, Tübingen, Germany
Yoav Voichek & Detlef Weigel

Authors

Yoav Voichek
View author publications
You can also search for this author in PubMed Google Scholar
Detlef Weigel
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

Y.V. and D.W. designed the study and wrote the paper. Y.V. conducted the analysis.

Corresponding author

Correspondence to Detlef Weigel.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

Extended Data Fig. 1 Examples of well characterized structural variant tagged by k-mers.

Examples of how k-mers tag well characterized structural variants²² between the Col-0 reference genome and the Ler fully assembled genome. The two genomes were used to count 31 bp k-mers, and all k-mers unique to one genome and appearing only once in it were plotted in the indicated regions. The a translocation, b inversion and c-d insertion/deletion positions are indicated by vertical lines and red shades. The k-mers unique to Col-0/Ler are plotted in the upper/lower panels in red/blue, respectively. The five positions tagged by k-mers inside the translocation presented in a are either SNPs or 1 bp indels.

Extended Data Fig. 2 Genome-wide evaluation of k-mer potential to detect SVs in well-characterized genomes.

a, For every translocation or inversion, previously identified²² between the Col-0 reference genome or the Ler genome we evaluate if it is tagged by 31 bp k-mers. Each translocation or inversion will affect 4 edges between the translocated fragment and the neighbouring genomic regions (bottom panel). For every previously identified translocation or inversion, the number of edges (0-4) which are tagged by k-mers unique to one genome were counted. Only 1.1% of these SVs were not tagged by any k-mer unique to one genome (upper panel). b, For every edge tagged by k-mers, described in A, we plot the number of k-mers unique to one genome which tagged it. The histogram is enriched with edges covered by the maximal number of k-mers, 31. c, Evaluating the potential to tag by k-mers long insertions/deletions between the well characterized genomes of Col-0 and Ler²². While in the genome with the apparent deletion only the junction between the two fragments will be tagged by unique k-mers, in the genome with the apparent insertion, the entire insert will be tagged (bottom panel). Only 0.4% of the previously characterized long insertions/deletions are not tagged by unique k-mers.

Extended Data Fig. 3 Pipeline for k-mer-based GWAS.

a, Creating the k-mer presence/absence table: Each accession’s genomic DNA sequencing reads are cut into k-mers⁴⁵, filtering k-mers appearing less than twice/thrice in a sequencing library. k-mers are further filtered to retain only those present in at least 5 accessions, and ones that are found in both forward and reverse-complement form in at least 20% of accessions they appeared in. All k-mer lists are combined into a k-mer presence/absence table. b, Genome-wide associations on the full k-mers table using SNP-based software: the k-mers table is converted into PLINK binary format, which is used as input for SNP-based association mapping software^14,42. c, GWA optimized for the k-mers: k-mers presence/absence patterns are first associated with the phenotype and its permutations using a LMM to account for population structure^16,17. This first step is done by calculating an approximated score of the exact model. Best k-mers from this first step (for example 100,000 k-mers) are passed to the second step, In which an exact p-value is calculated¹⁴ for both the phenotype and its permutations. A permutation-based threshold is calculated, and all k-mers passing this threshold are checked for their rank in the scoring from the first step. If not all k-mers hits are in the top 50% of the initial scoring, then the entire process is rerun from the beginning, passing more k-mers from the first to the second step. This last test is built to confirm that the approximation of the first step will not remove true associated k-mers.

Extended Data Fig. 4 Allele counts for A. thaliana 1001G k-mers.

Histogram of k-mer allele counts: For every N=1..1008, the number of k-mers appeared in exactly N accessions is plotted.

Extended Data Fig. 5 Flowering time-genotype associations in A. thaliana identified with k-mers.

a, LD between SNPs associated with flowering time. Dashed lines represent the four variant types, as in Fig. 1c. b, LD between k-mers associated with flowering time, Dashed lines represent the four variant types, as in Fig. 1c. c, Same as Fig. 1d with only SNPs. d, Same as Fig. 1d with only k-mers presented, showing also k-mers lower than the threshold. e, Manhattan plot of SNPs and k-mer associations with flowering time in 10 °C as in Fig. 1d for k-mers of length 25 bp.

Extended Data Fig. 6 Comparison of SNP- and k-mer-GWAS on phenotypes from 104 studies on A. thaliana accessions.

a, Histogram of the number of identified k-mers vs. identified SNPs (in log₂) for A. thaliana phenotypes. Only the 458 phenotypes with both variant types identified were used. b, Histogram of thresholds difference of k-mers vs. SNPs of all A. thaliana phenotypes. Thresholds were -log₁₀ transformed.

Extended Data Fig. 7 Specific case studies in which k-mers are superior to SNPs.

a, Results from GWAS on measurements of lesions by Botrytis cinerea UKRazz strain³⁹. An example of k-mers having better hold on a short variant: 19 k-mers and no SNPs were identified, all k-mers in complete LD (top row). Sequence reads containing the k-mers mapped to chromosome 3, with a single T nucleotide deletion out of an eight T’s stretch, in position 72,017. Manual (middle) and the 1001G project (bottom) calls are shown. In the 1001G, 57 of 61 accessions contain missing values. b, Haplotypes around SNPs associated with xylosides concentrations are not correlated with this trait. All SNPs in positions 870,000 to 874,000 in chromosome 5 were hierarchically clustered (left panel, white mark missing values). The two identified SNPs are marked by arrows and a close-up of their state is shown (middle panel). Phenotypic values colored according to the two SNPs: TG blue, TT red, and CT green (right panel). c-e, Manhattan plot for: c, xyloside percentage, d, seedling growth inhibition by a flg22 variant, e, germination in darkness in low nutrient conditions. f, Germination phenotype plotted for accessions with top associated k-mer present or absent. Boxes cover 25%- 75% percentiles, medians marked by horizontal lines, and whiskers cover the full range of values.

Extended Data Fig. 8 Comparison of SNP- and k-mer- based GWAS in maize.

a, Histogram of k-mer allele counts for maize accessions. b, Histogram of difference between threshold values of SNPs and k-mers for maize phenotypes. c, Histogram of the top SNP P-value divided by the k-mers defined threshold, in (-log10), for maize phenotypes. Plotted for phenotypes with only identified SNPs (upper panel) or for phenotypes with both SNPs and k-mers identified (lower panel). d, Histogram of the number of identified k-mers vs. identified SNPs for maize phenotypes. e, Histogram of the difference between top (-log10) p-values in the two methods for maize phenotypes identified by both methods. Plotted as in Fig. 2g. f, Manhattan plot of associations with ear weight (environment 07A). Associated k-mers could not be located in the reference genome, and are thus not presented.

Extended Data Fig. 9 Comparison of SNP- and k-mer-based GWAS in tomato.

a, Histogram of k-mers allele counts for tomato accessions. b, Histogram of difference between threshold values of SNPs and k-mers for tomato phenotypes. c, Histogram of the top SNP P-value divided by the k-mers defined threshold, in -log10, for tomato phenotypes. Plotted for phenotypes with only identified SNPs (upper panel) or for phenotypes with both SNPs and k-mers identified (lower panel). d, Histogram of the difference between top (-log10) p-values in the two methods for tomato phenotypes. e, Histogram of the number of identified k-mers vs. identified SNPs for tomato phenotypes.

Extended Data Fig. 10 Kinship matrix calculation based on k-mers for tomato accessions.

Identification of pairs of tomato accessions for which relatedness as measured with k-mers is much lower than relatedness as measured with SNPs. For every pair among the 246 accessions, a black square is plotted if the difference in relatedness between SNPs and k-mers is larger than 0.15. Accessions are ordered by the number of black square in their row/column. Red lines mark the 21 accessions with most black squares, that is, those for which the k-mer/SNP difference in relatedness is larger than 0.15 for the most pairs.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Voichek, Y., Weigel, D. Identifying genetic variants underlying phenotypic variation in plants without complete genomes. Nat Genet 52, 534–540 (2020). https://doi.org/10.1038/s41588-020-0612-7

Download citation

Received: 31 October 2019
Accepted: 10 March 2020
Published: 13 April 2020
Issue Date: May 2020
DOI: https://doi.org/10.1038/s41588-020-0612-7

This article is cited by

Plant pangenomes for crop improvement, biodiversity and evolution
- Mona Schreiber
- Murukarthick Jayakodi
- Martin Mascher
Nature Reviews Genetics (2024)
Targeted genome editing for cotton improvement: prospects and challenges
- Rakesh Kumar
- Joy Das
- Yenumula Gerard Prasad
The Nucleus (2024)
The genomics and evolution of inter-sexual mimicry and female-limited polymorphisms in damselflies
- Beatriz Willink
- Kalle Tunström
- Christopher West Wheat
Nature Ecology & Evolution (2023)
Multigenic resistance to Xylella fastidiosa in wild grapes (Vitis sps.) and its implications within a changing climate
- Abraham Morales-Cruz
- Jonas Aguirre-Liguori
- Brandon S. Gaut
Communications Biology (2023)
Integrating multiomics data accelerates elucidation of plant primary and secondary metabolic pathways
- Feng Zhu
- Weiwei Wen
- Alisdair R. Fernie
aBIOTECH (2023)