Introduction

Pairwise linkage disequilibrium (LD), the statistical association between alleles at two different loci, has applications in genotype imputation (Wen and Stephens 2010), genome-wide association studies (Zhu and Stephens 2018), genomic prediction (Wientjes et al. 2013), population genetics (Slatkin 2008), and many other tasks (Sved and Hill 2018). LD is often estimated from next-generation sequencing technologies, where the genotypes and haplotypes are not known with certainty (Gerard et al. 2018). Thus, researchers typically use estimated genotypes, such as posterior mean genotypes (Fox et al. 2019), to estimate LD. However, this can cause biased LD estimates, attenuated toward zero, implying that loci are less dependent than in reality (Gerard 2021).

This bias is particularly strong in polyploids, organisms with more than two complete sets of chromosomes. Unlike diploids, polyploids exhibit multiple levels of heterozygosity. For example, at a biallelic locus with alleles A and a, a heterozygous diploid would have genotype Aa, whereas a heterozygous tetraploid might have genotypes Aaaa, AAaa, or AAAa. These multiple levels of heterozygosity make polyploid dosage more difficult to estimate, and exacerbate the impact on estimation of data-specific quirks, such as allelic bias and overdispersion (Gerard et al. 2018). This all increases genotype uncertainty in polyploid organisms, increasing the effect of LD attenuation. Therefore, in Gerard (2021) we derived maximum likelihood estimates (MLEs) that have lower bias and are consistent estimates of LD. This approach was particularly helpful for polyploids.

Unfortunately, the MLE approach is prohibitively slow. Researchers typically calculate pairwise LD at genome-wide scales, and the MLE approach takes on the order of a tenth of a second. Thus, for many genome-wide applications, containing millions of SNPs, LD estimation using the MLE approach would take years of computation time. This is not conducive to large-scale applications.

Here, we derive scalable approaches to estimate LD that account for genotype uncertainty (“Materials and methods”). Our methods use only the first two moments of the marginal posterior genotype distribution for each individual at each locus, which are often provided or easily obtainable from many genotyping programs. We calculate sample moments from these posterior moments, and use these to multiplicatively inflate naive LD estimates. We show, through simulations (“Simulations”) and real data (“LD estimates for Solanum tuberosum”), that our estimates can reduce attenuation bias and improve LD estimates when genotypes are uncertain. All calculations have computational complexities that are linear in the sample size, and so these estimates are scalable to genome-wide applications.

Materials and methods

In this section, we will define moment-based estimators of the LD coefficient Δ (Lewontin and Kojima 1960), the standardized LD coefficient \({{\Delta }}^{\prime}\) (Lewontin 1964), and the Pearson correlation ρ (Hill and Robertson 1968). There are two types of LD measures considered in the literature, “haplotypic” (called “gametic” in the diploid literature) and “composite.” Haplotypic LD measures are more familiar, representing the association between loci that reside on the same haplotype (Hedrick et al. 1978), whereas composite LD measures aggregate the associations between alleles on all haplotypes between two loci (Cockerham and Weir 1977; Weir 1979). As obtaining estimates of haplotypic LD from unphased genotypes typically requires additional assumptions (such as Hardy–Weinberg equilibrium), we will only consider estimating composite measures of LD. Advantageously, these composite measures are appropriate LD measures for generic autopolyploid, allopolyploid, and segmental allopolyploid populations, even in the absence of Hardy–Weinberg equilibrium (Gerard 2021). We will also only consider biallelic loci, where the genotype for each individual is the dosage (from 0 to the ploidy) of one of the two alleles.

We will now review these composite measures of LD at biallelic loci. Let G = (GA, GB) be the random variable of genotypes of a K-ploid individual at loci A and B, where each Gj is the dosage (from 0 to K) of an allele at locus j. A sample of individuals, G1, G2, …, Gn is assumed to be independent and identically distributed to G. The composite measure of correlation between loci A and B is just the Pearson correlation,

$$\rho ={{{\rm{cor}}}}({G}_{A},{G}_{B}).$$
(1)

The composite LD coefficient is the covariance divided by the ploidy K,

$${{\Delta }}=\frac{1}{K}{{{\rm{cov}}}}({G}_{A},{G}_{B}).$$
(2)

We divide by the ploidy in Eq. (2) so that, for a population in Hardy–Weinberg equilibrium, the composite LD coefficient equals the well-known haplotypic LD coefficient. The possible values of Δ are bounded, with the size of this bound depending on the allele frequencies at each locus, making it difficult to compare LD across loci. To create a measure of LD that is less dependent on allele frequencies, we have the composite standardized LD coefficient,

$${{\Delta }}^{\prime} ={{\Delta }}/{{{\Delta }}}_{m},\,{{\mbox{where}}}\,$$
(3)
$${{{\Delta }}}_{m}=\left\{\begin{array}{ll}\min \{{{{\rm{E}}}}{[G]}_{A}{{{\rm{E}}}}{[G]}_{B},(K-{{{\rm{E}}}}{[G]}_{A})(K-{{{\rm{E}}}}{[G]}_{B})\}/{K}^{2}&\,{{\mbox{if}}}\,{{\Delta }} \,<\, 0,\,{{\mbox{and}}}\,\\ \min \{{{{\rm{E}}}}{[G]}_{A}(K-{{{\rm{E}}}}{[G]}_{B}),(K-{{{\rm{E}}}}{[G]}_{A}){{{\rm{E}}}}{[G]}_{B}\}/{K}^{2}&\,{{\mbox{if}}}\,{{\Delta }} \,>\, 0.\end{array}\right.$$
(4)

One can show that \({{\Delta }}^{\prime}\) is free to vary between −K and K, but is constrained between −1 and 1 for populations in Hardy–Weinberg equilibrium. For further details of these measures see Gerard (2021).

We wanted to create LD estimators of Eqs. (1)–(3) that account for genotype uncertainty while also being agnostic to the genotyping technology, e.g., microarrays (Fan et al. 2003), next-generation sequencing (Baird et al. 2008; Elshire et al. 2011), or mass spectrometry (Oeth et al. 2009). One way to do this is to use only the genotype posterior distributions for each individual, which are often provided by different genotyping software that analyze data from different genotyping technologies (e.g. Clark et al. 2019; Gerard and Ferrão 2019; Gerard et al. 2018; Serang et al. 2012; Voorrips et al. 2011; Zych et al. 2019). We will thus assume that the user provides the posterior means and variances for the genotypes for each individual at two loci, which can be easily obtained from the full posterior distributions for each individual. An advantage of this approach is its modularity. That is, as genotyping platforms improve and become better calibrated, the approach below will still be usable without having to create a tailor-made method to estimate LD directly from these new genotyping platforms.

To define our estimators of LD, let XiA and XiB be the posterior mean genotypes at loci A and B for individual i {1, …, n}. Let YiA and YiB be the posterior variances of genotypes at loci A and B for individual i. Our estimators are based entirely on the following sample moments of these posterior moments, which may be calculated in linear time in the sample size, n.

$${u}_{xA}:=\frac{1}{n}\mathop{\sum }\limits_{i=1}^{n}{X}_{iA},\,\,{u}_{xB}:=\frac{1}{n}\mathop{\sum }\limits_{i=1}^{n}{X}_{iB},$$
(5)
$${v}_{xA}:=\frac{1}{n-1}\mathop{\sum }\limits_{i=1}^{n}{({X}_{iA}-{u}_{xA})}^{2},\,\,{v}_{xB}:=\frac{1}{n-1}\mathop{\sum }\limits_{i=1}^{n}{({X}_{iA}-{u}_{xB})}^{2},$$
(6)
$${c}_{x}:=\frac{1}{n-1}\mathop{\sum }\limits_{i=1}^{n}({X}_{iA}-{u}_{xA})({X}_{iB}-{u}_{xB}),$$
(7)
$${u}_{yA}:=\frac{1}{n}\mathop{\sum }\limits_{i=1}^{n}{Y}_{iA},\,{{\mbox{and}}}\,\,{u}_{yB}:=\frac{1}{n}\mathop{\sum }\limits_{i=1}^{n}{Y}_{iB}.$$
(8)

For a K-ploid species, our LD estimators, which we derive in Section S1 of the Supplementary Material, are as follows. The estimated LD coefficient is as follows:

$$\hat{{{\Delta }}}:=\left(\frac{{u}_{yA}+{v}_{xA}}{{v}_{xA}}\right)\left(\frac{{u}_{yB}+{v}_{xB}}{{v}_{xB}}\right)\left(\frac{{c}_{x}}{K}\right).$$
(9)

The estimated Pearson correlation is as follows:

$$\hat{\rho }:=\sqrt{\frac{{u}_{yA}+{v}_{xA}}{{v}_{xA}}}\sqrt{\frac{{u}_{yB}+{v}_{xB}}{{v}_{xB}}}\frac{{c}_{x}}{\sqrt{{v}_{xA}{v}_{xB}}}.$$
(10)

Note that \({c}_{x}/\sqrt{{v}_{xA}{v}_{xB}}\) is the sample Pearson correlation between posterior mean genotypes. The estimated standardized LD coefficient is as follows:

$$\hat{{{\Delta }}}^{\prime} :=\hat{{{\Delta }}}/{\hat{{{\Delta }}}}_{m},\,{{\mbox{where}}}\,$$
(11)
$${\hat{{{\Delta }}}}_{m}:=\left\{\begin{array}{ll}\min \big\{{u}_{xA}{u}_{xB},(K-{u}_{xA})(K-{u}_{xB})\big\}/{K}^{2}&\,{{\mbox{if}}}\;{c}_{x} \,<\, 0,{{\mbox{and}}}\,\\ \min \big\{{u}_{xA}(K-{u}_{xB}),(K-{u}_{xA}){u}_{xB}\big\}/{K}^{2}&\,{{\mbox{if}}}\;{c}_{x} \,>\, 0.\end{array}\right.$$
(12)

We can compare our new estimators to those researchers typically use in practice. Since the population LD parameters (1)–(3) are population moments of the individual genotypes, researchers typically set XiA and XiB as estimates of GiA and GiB, and then use the sample moments of the XiA’s and XiB’s to estimate these population moments. That is

$${\hat{\rho }}^{(naive)}:=\frac{{c}_{x}}{\sqrt{{v}_{xA}{v}_{xB}}},$$
(13)
$${\hat{{{\Delta }}}}^{(naive)}:=\frac{1}{K}{c}_{x},\,{{\mbox{and}}}\,$$
(14)
$$\hat{{{\Delta }}}{^{\prime} }^{(naive)}:=\frac{{\hat{{{\Delta }}}}^{(naive)}}{{\hat{{{\Delta }}}}_{m}}.$$
(15)

Comparing Eqs. (13)–(15) to Eqs. (9)–(11), we see that our new estimators take the naive estimators most researchers use in practice and inflate these by a multiplicative effect. Such multiplicative effects are sometimes called “reliability ratios” in the measurement error models literature (Fuller 2009).

Standard errors are important for hypothesis testing (Brown 1975), read-depth suggestions (Maruki and Lynch 2014), and shrinkage (Dey and Stephens 2018). Because estimators (9)–(11) are functions of sample moments, deriving their standard errors can be accomplished by appealing to the central limit theorem, followed by an application of the delta method (Section S2 of the Supplementary Material).

Section S3 of the Supplementary Material contains practical considerations for improving our estimates of LD. We apply hierarchical shrinkage (Stephens 2016) on the log of the reliability ratios to improve estimation performance (Section S3.1). As we have observed unstable behavior when SNPs are mostly monoallelic, we apply a thresholding strategy to mitigate the effects of unusually large reliability ratios (Section S3.2). We also truncate LD estimates when sampling variability causes estimates (9)–(11) to lie outside their theoretical boundaries (Section S3.3). Section S4 of the Supplementary Material contains some theoretical discussions on why our methods perform as well as the MLE in the simulations of Section 3.1.

All methods are implemented in the ldsep package on the Comprehensive R Archive Network https://cran.r-project.org/package=ldsep.

Results

Simulations

Comparison to the MLE and the standard approach

We compared our moment-based estimators (9)–(11) to those of the MLE of Gerard (2021) as well as the naive estimators that calculate the sample covariance and sample correlation between posterior mean genotypes at two loci (13)–(15). Each replication, we generated genotypes for n {10, 100, 1000} individuals with ploidy K {2, 4, 6, 8} under Hardy–Weinberg equilibrium at two loci with major allele frequencies (pA, pB)  {(0.5, 0.5), (0.5, 0.75), (0.9, 0.9)} and Pearson correlation ρ {0, 0.5, 0.9}. We then used updog’s rflexdog() function (Gerard and Ferrão 2019; Gerard et al. 2018) to generate read-counts at read-depths of either 10 or 100, a sequencing error rate of 0.01, an overdispersion value of 0.01, and no allele bias. Updog was then used to generate genotype likelihoods and genotype posterior distributions for each individual at each SNP. These were then fed into ldsep to obtain the MLE, our new moment-based estimator, and the naive estimator. Simulations were replicated 200 times for each unique combination of simulation parameters.

The accuracy of estimating ρ2 when pA = pB = 0.5 at a read-depth of 10 is presented in Fig. 1. The results for other scenarios are similar and may be found in Figs. S5–S21 of the Supplementary Material. We see that the moment-based estimator and the MLE perform comparably, even for small read-depth and sample size. The naive estimator has a strong attenuation bias toward zero. This bias is particularly prominent for higher ploidy levels. For example, for an octoploid species where the true ρ2 is 0.81, the naive estimator appears to converge to a ρ2 estimate of around 0.25. This bias does not disappear with increasing sample size. Estimated standard errors are reasonably well-behaved when the sample size is moderate to large (n = 100 or 1000) but can be unstable for very small sample sizes (n = 10) (Figs. S1 and S2 of the Supplementary Material). This is not unexpected as the standard errors rely on asymptotic approximations (Section S2).

Fig. 1: Estimate of ρ2 (y-axis) for the maximum likelihood estimator (Gerard 2021) (MLE), our new moment-based estimator (Eq. (10)) (MoM), and the naive squared sample correlation coefficient between posterior mean genotypes (Eq. (13)) (Naive).
figure 1

The x-axis indexes the sample size, the row-facets index the ploidy, and the column-facets index the true ρ2, which is also presented by the horizontal dashed red line. These simulations were performed using a read-depth of 10, and major allele frequencies of 0.5 at each locus. The naive estimator presents a strong attenuation bias toward 0, particularly for higher ploidy regimes.

Additional simulation results, exploring our estimators when applied to rare variants, are presented in Section S5 of the Supplementary Material. The conclusions of that section are the same as here: the naive approach performs better at complete linkage equilibrium due to its attenuation bias, but performs worse at larger ploidies and larger levels of LD. However, we note that LD between rare variants is, in general, difficult to estimate.

The effect of using different genotyping strategies

Our new methods rely on accurate genotyping priors, which can be obtained adaptively using empirical Bayes approaches using sufficiently many samples. We therefore wished to study the effects of using either a fixed prior or a different genotyping platform. To do this, we generated posterior genotype probabilities under four scenarios: (i) the empirical Bayes approach of estimating the prior implemented by updog (Gerard and Ferrão 2019; Gerard et al. 2018), (ii) the empirical Bayes approach of estimating the prior implemented by polyRAD (Clark et al. 2019), (iii) a Bayesian approach assuming an unrealistic uniform prior on the genotypes, as implemented by updog, and (iv) a Bayesian approach assuming an unrealistic “horseshoe-like” prior on the genotypes that puts most mass on genotypes 0 and K, as implemented by updog. Specifically, for the “horseshoe-like” prior, the prior probability of a dosage of 0 or K was set to 0.45 each and the prior probability of dosages 1, …, (K − 1) was set to 0.1/(K − 1) each.

We ran simulations under the same parameter settings of “Comparison to the MLE and the standard approach”, where genotyping uncertainty had the greatest effect on LD estimation: higher ploidy species (K = 8) with pA = pB = 0.5 and a Pearson correlation ρ = 0.9. We simulated n {10, 100, 1000, 10000} individuals, with a sequencing depth of 5, 10, or 100. As in “Comparison to the MLE and the standard approach”, we generated genotypes and read-counts using the updog software at a sequencing error rate of 0.01, an overdispersion parameter of 0.01, and no allele bias. We then used the above four procedures to generate genotype posterior probabilities. These were fed into ldsep to obtain estimates of ρ. We replicated each simulation setting 200 times.

The results are presented in Fig. 2. There, we find that for larger sequencing depths (e.g., ≈100×), one can essentially use a uniform prior and normalize the genotype likelihoods to be posterior probabilities. The genotype posteriors using this simple approach are close enough to those using adaptive approaches to provide decent LD estimates. However, for smaller read-depths, using a fixed prior has a deleterious effect. In such cases, one should use an adaptive genotyping approach that can consistently estimate the prior for larger sample sizes, even at lower read-depths. Many approaches that accomplish this exist, but for our analyses we found that two designed specifically for sequencing data work well in practice: updog (Gerard and Ferrão 2019; Gerard et al. 2018) and polyRAD (Clark et al. 2019). For non-sequencing data, there exist adaptive methods as well (Serang et al. 2012; Voorrips et al. 2011; Zych et al. 2019).

Fig. 2: Estimates of ρ using Eq. (10) (y-axis) when the true ρ is 0.9 (red dashed line) for different sample sizes (x-axis), different read-depths (facets) and different methods for obtaining the genotype posterior probabilities.
figure 2

The updog software (Gerard and Ferrão 2019; Gerard et al. 2018) was used either with an empirical Bayes approach to estimate the prior (“updog”), a fixed uniform prior (“uniform”) or a fixed unrealistic “horseshoe-like” prior (“horseshoe”). The polyRAD software (Clark et al. 2019) was also used to obtain posterior genotype probabilities (“polyRAD”).

LD estimates for Solanum tuberosum

We evaluated our methods on the autotetraploid potato (Solanum tuberosum, 2n = 4x = 48) genotyping-by-sequencing data from Uitdewilligen et al. (2013). We used updog (Gerard and Ferrão 2019; Gerard et al. 2018) to obtain the posterior moments for each individual’s genotype at each SNP on a single super scaffold (PGSC0003DMB000000192). To remove monoallelic SNPs, we filtered out SNPs with allele frequencies either >0.95 or <0.05, and filtered out SNPs with a variance of posterior means <0.05. This resulted in 2108 SNPs. We then estimated the squared correlation between each SNP using either the naive approach of calculating the sample Pearson correlation between posterior means, or using our new moment-based approach (Eq. (10)).

Our estimators are scalable. On a 1.9 GHz quad-core PC running Linux with 32 GB of memory, it took a total of 1.9 seconds to estimate all pairwise correlations using our new moment-based approach, which is a small increase over the 0.7 s it took to estimate all pairwise correlations using the naive approach. In Gerard (2021), we found that the MLE approach took about 0.1 s for each pair of SNPs for a tetraploid individual. Extrapolating this to 2108 SNPs would indicate that the MLE approach would take about 2.5 days of computation time to calculate all pairwise LD estimates on this dataset.

The histogram of estimated reliability ratios is presented in Fig. 3. We see there that the reliability ratios of most SNPs only increase their correlation estimates by <10%. But a not insignificant portion have reliability ratios that increase the correlation estimates by more than 10%. To evaluate the LD estimates of high reliability ratio SNPs, we calculated the MLEs for ρ2 between the twenty SNPs with the largest reliability ratios. A pairs plot for ρ2 estimates between the three approaches is presented in Fig. 4. We see there that the MLE and new moment-based approach result in very similar ρ2 estimates, while the naive approach using posterior means results in much smaller ρ2 estimates.

Fig. 3: Reliability ratio estimates.
figure 3

Histogram of estimated reliability ratios (S69) using the data from Uitdewilligen et al. (2013).

Fig. 4: Pairs plot for ρ2 estimates between the twenty SNPs from Uitdewilligen et al. (2013) with the largest estimated reliability ratios when using either maximum likelihood estimation (MLE) (Gerard 2021), our new moment-based approach (Eq. (10)) (MoM), or the naive approach using just posterior means (Naive).
figure 4

The dashed line is the y = x line. The MLE and the moment-based approach result in much more similar LD estimates.

Discussion

It has been known since at least the time of Spearman that the sample correlation coefficient (or, similarly, the ordinary least squares estimator in simple linear regression) is attenuated in the presence of uncertain variables (Spearman 1904). Methods to adjust for this bias include assuming prior knowledge on the measurement variances or the ratio of measurement variances (resulting from, for example, repeated measurements on the same individuals) (Degracie and Fuller 1972; Koopmans 1937), using instrumental variables (Carter and Fuller 1980), and using distributional assumptions (Pal 1980). See Fuller (2009) for a detailed introduction to this vast field. In order to accommodate different data types (Baird et al. 2008; Elshire et al. 2011; Fan et al. 2003; Oeth et al. 2009) and different genotyping programs (Clark et al. 2019; Gerard and Ferrão 2019; Gerard et al. 2018; Serang et al. 2012; Voorrips et al. 2011; Zych et al. 2019), and therefore increase the generality of our methods, we limited ourselves to using just posterior genotype probabilities to calculate LD. This excluded using these previous approaches. Our solution, then, was to use sample moments of marginal posterior moments which, to our knowledge, has never been proposed before.

It is natural to ask if our methods could be used to account for uncertain genotypes in genome-wide association studies. However, the moment-based techniques we used in this manuscript, when applied to simple linear regression with an additive effects model (where the SNP effect is proportional to the dosage), result in the standard ordinary least squares estimates when using the posterior mean as a covariate (Section S6 of the Supplementary Material). This supports using the posterior mean as a covariate in simple linear regression with an additive effects model. This is not to say, however, that using the posterior mean is also appropriate for more complicated models of gene action (Rosyara et al. 2016), or for nonlinear models (Carroll et al. 2006). Developing methods to account for genotype uncertainty in these more complicated settings is a research interest of the author, and a topic for future work.

We would not recommend using our methods to analyze diploid genomes. As seen in the simulations of “Comparison to the MLE and the standard approach,” diploid approaches that do not account for genotype uncertainty perform fine, even at low depths, because genotype uncertainty is much less of an issue for diploids. Furthermore, phasing approaches are well-established and highly effective in the diploid literature (Browning and Browning 2007; Li et al. 2010; Scheet and Stephens 2006; Swarts et al. 2014), and our approach would likely not perform comparatively well against haplotype-aware LD estimation methods that use such phased information. However, in polyploids, haplotype estimation is much harder to achieve (Cheng et al. 2021; Mollinari and Garcia 2019; Shen et al. 2016; Zheng et al. 2016), and so accurate approaches that leverage only read-based information between two SNPs are important.

In this article, we demonstrated that naive LD estimates are typically attenuated toward zero in higher ploidy organisms due to the effects of genotype uncertainty. To correct for this bias, we presented moment-based approaches that perform as well as principled likelihood-based approaches, but only take a fraction of the computation time. Possible future directions include (i) extending our methods to multiallelic loci and (ii) evaluating the downstream consequences of using our improved LD estimates, such as for effective population size estimation (Ragsdale and Gravel 2019; Waples 2006) or admixture estimation (Loh et al. 2013). Our moment-based estimators will allow researchers to use de-biased LD estimators for such tasks at scale.