Introduction

Association studies aim to identify variants associated with a disease, usually focusing on single-nucleotide polymorphisms (SNPs). They are able to detect the variants with modest effect, which are implied in complex diseases, contrarily to linkage analysis.1 In genome-wide association studies (GWAS), the considered variants are tag-SNPs, which capture most common SNPs of the genome through linkage disequilibrium (LD).2 However, the association between a SNP and a disease does not prove the causality link between the two: the association can result from a causal effect of the SNP itself or from the LD with another causal variant. Consequently, a significant association signal indicates a set of correlated variants associated with the disease. Discriminating between the causal variant and variants in LD with it by using case–control data was addressed by Udler et al.3 Under the hypothesis that the causal variant is among the genotyped SNPs, the proposed method allows to select a minimal subset of potentially causal SNPs among disease-associated variants. Family data convey more information than case–control data, and their use can improve the performance of this selection process; moreover, family data allow to address a limitation of the discrimination method with case–control data, which is that the causal variant among genotyped SNP.

Here, we propose a method exploiting family data to select a minimal subset of associated SNPs and to make inference on putative causal variants in LD with those SNPs. This method uses an association framework, which takes advantage on the linkage information existing in affected sib-pairs (ASPs) data.4 The first step is to select a minimal subset of potentially causal SNPs among disease-associated variants; assuming that the causal SNP is among the genotyped SNPs, we compare the performance of this discrimination step with the method using case–control data.3 The second step of the method addresses the situation where the causal variant is not directly genotyped, but is in LD with genotyped SNPs. In this situation, case–control data do not convey enough information to make the difference between a SNP in strong LD with the unobserved causal variant, and a truly causal variant. Using a sample of ASPs, the number of alleles that shared identical-by-descent (IBD) by the two affected siblings allows to make inference about the causal variant, to estimate its allelic frequencies, the allelic odds ratio (OR) and the LD between it and an observed SNP.

The advantage of the family method is illustrated on a sample of multiple sclerosis (MS) data. MS is a chronic autoimmune neurological disease of the central nervous system, which affects about 1–2 per 1000 people in Europe and North America.5 It is manifested by demyelination of nerve fibers in the brain, spinal cord and optic nerve. The disease is progressive and may lead to the loss of walking and eventually death. It is a multifactorial disease and has environmental and genetic factors. Several associations with genes involved in the immune response have been found. In particular, in the literature, there are associations with various genes in the human leukocyte antigen (HLA) region, and various non-HLA genes, for example, CD58, IL2RA and IL7R.6 Our data set consists of french ASP and controls data from a previous study7 and collected through REFGENSEP. It comprehends the genotypes of 26 SNPs in IL2RA on the chromosome 10. Several studies find association between MS and this gene in Caucasian populations.8, 9, 10, 11, 12, 13 The method using case–control data selects a minimal subset of seven associated SNPs, which reduces to three SNPs when using family data. The second step shows that none of these SNPs is causal, and that the association signal is due to at least two different ungenotyped variants in the region.

Materials and methods

In the first two paragraphs below, we give an overview of the method proposed by Udler et al3 to identify causal variants in case–control association studies. The reader is referred to the original paper for details. In the third paragraph, we present the discrimination method that uses family data in two steps; selection of a subset of associated SNPs and inference about a putative causal variant not genotyped in the sample. In the last paragraph, we describe the MS data set used to illustrate this approach.

Discrimination procedures

Consider n highly correlated variants in a genomic region. Under the hypothesis that one of these variants is causal, the aim is to select a subset of these variants that is likely to contain the causal variant. The method relies on Bayesian principles: if Li for i=1,…, n is the likelihood that the ith variant is the causal variant, the variants selected are those of index i such that

or, equivalently,

where k=ln(K2)=2lnK. Following Udler et al,3 we take K=100 (ie, k=2ln100≈9.21), which is interpreted as excluding variants with odds greater than 100:1.

For example, for two SNPs A and B, the SNP B is not retained if 2lnLA−2lnLB>k. As asymptotically

where and are association test statistics corresponding to the likelihoods LA and LB, such as the Armitage Trend Test statistic14, 15, 16 for case–control design, and a score statistic adapted to the family design,4 this is equivalent to not retaining B is

Both these association statistics Y are approximately standard normal, YN(0,1), under the hypothesis of no association with the disease. Otherwise, it is approximately decentered normal: assuming that the causal variant for the disease is A with ψ its per-allele OR,

where η is a decentered parameter, which will depend on the sample sizes, on the allele frequencies and on ψ. This parameter η also depends on the association statistic test used. Being in LD with the variant A, the variant B is also associated with the disease. Then,

where r is the correlation coefficient between the two variants, measuring the intensity of the LD. The distribution of the discrimination statistic (equation 1) is approximately

(see details in section 1 of Supplementary Information).

Then, we can rely the power of discrimination 1−β with the decentered parameter η by

where z1−β is the quantile of level 1−β of the standard normal distribution.

Discrimination with case–control data

Here we consider case–control data: the association statistic is the Armitage statistic.14, 15, 16 Udler et al show that in this case

where, m and n are the number of controls and cases, fA and fa the frequencies of the reference and alternative alleles, A and a, and ψ the per-allele OR of a. The demonstration is given in section 2 of Supplementary Information. Then, if we assume that the number of controls and cases are equal, the total sample size needed to achieve power 1−β is

where η2 can be computed from β using equation 2.

Family design

Here, we propose a method in two steps. The first step is the selection of a subset of variants that is likely to contain the causal variant using the same discrimination procedure that in the case–control design, but using an association statistic designed for family data. We compute the power of discrimination of this procedure, assuming that the causal SNP is among the genotyped SNPs.

The second step uses the selected variants to make inference on causal variants in LD with them, relying on Bayesian principles. This step allows to retrieve information on a causal variant even if it is not genotyped.

First step: discrimination with family data

First, we use the same discrimination statistic (equation 1) based on a statistic Y, which has been proposed for ASPs and controls.4 The data considered include genotypes of controls, genotypes of the index cases and the number of IBD alleles in each sib-pair. Hereafter, we denote the three possible genotypes by the number of alternative alleles: 0, 1 and 2.

We denote nki as the number of ASPs in which the index genotype is k and the number of IBD alleles is i, mk as the number of controls with genotype k, and n and m as the total number of ASPs and controls. The association statistic is where U is the score

with

the estimator of the alternative allele frequency, and

the estimator of the variance of U under the hypothesis of no association. In absence of association, the distribution of Y is standard normal YN(0,1).

We consider the causal variant A with allele frequencies fA and fa and OR ψ. The association statistic is decentered: YAN(η,1), where η is approximately

Formulas for E(UA) and (depending on fA, fa, ψ and sample sizes n and m) are given in section 3 of Supplementary Information. Then, we can calculate power of discrimination for a given set of parameters, or total sample size needed to achieve a given power, with equation 2.

Second step: Bayesian inference on the causal variant

Second, we propose to use the SNPs selected by the discrimination step to retrieve information on the causal variants of the region. Let’s assume that a variant B in LD with the causal variant A is observed. In this case, we want to make inferences on A, in particular to estimate the LD between A and B, and the OR of A. This task is undoable with case–control data, as an OR for variant B can always be computed, which explains fully the observations under the hypothesis that B is the causal variant. However, the advantage of family data lies in the linkage information provided by the IBD state of the sib-pairs, which allows to discriminate between observations made directly at a causal variant A, and observations made at a variant B in LD with A.

In section 4 of Supplementary Information, we write a likelihood for the family data Lψ(fa,fb,d) (where ψ is the OR in A, fa and fb the alternative allele frequencies in A and B and d is the LD between A and B). In section 5 of Supplementary Information, we show that all parameters are identifiable, provided that ψ>1 and d≠0.

This likelihood can be used to define the posterior distribution of parameters, from which we sample using Metropolis–Hastings algorithm17 (cf section 6 of Supplementary Information for details). In particular, we can estimate the posterior distribution of the disequilibrium r2=d2/(fa(1−fa)fb(1−fb)). We also find simultaneous credibility regions for fa and fb, or for ψ and r2, using the posterior joint density of these parameters as estimated from the values sampled by the Metropolis–Hastings algorithm.

MS data

These two methods of discrimination are illustrated on MS data described in full details in Babron et al.7 This data include 26 tag-SNPs on the IL2RA gene for french families with at least one affected child collected through REFGENSEP. All affected people were reviewed by a board-certified neurologist and diagnosed according to Poser criteria.18 All individuals signed informed consent in accordance with the European Union and Country Laws and the Helsinki Convention. The sample comprises 522 trio families (one affected with two living parents) and 101 multiplex families (at least two affected sibs).

The trio families are used to create pseudo-control genotypes consisting of the alleles untransmitted by the parents to their affected child. Pseudo-control genotypes are known to represent general population genotypes.19 ASPs are obtained from multiplex families, randomly selecting two affected sibs in each family. The IBD states are calculated using the software Merlin,20 which calculates the probability of each IBD state. Only ASPs for which one IBD state has probability higher than 0.8 are kept, assigning the IBD state with probability exceeding 0.8 to them.

Overall, the data set comprises 522 pseudo-controls and 82 ASPs with case index genotypes and IBD states. In addition to applying the family-based descrimination method on the data set, we will also use the case–control method on the 82 index sibs as cases and the 522 pseudo-controls.

Results

Power of the family and case–control discrimination methods

The power of the two discrimination methods depends on the expression of the decentered parameter η (equation 2). In Figure 1, we display η values for an OR ψ varying from 1–5. For all OR, the η parameter in a family design is higher than that of a case–control design.

Figure 1
figure 1

The parameter η according to the OR ψ for 0.1 alternative allele frequency, r2=0.8 and 500 cases and 500 controls.

The total sample size required to achieve 90 power to exclude variants at 100:1 odds assuming an equal number of controls and unrelated cases or ASPs for different values of alternative allele frequency, OR ψ and LD, r2 is reported in Figure 2. For identical parameters, the family discrimination method needs a smaller sample size than the case–control method. For example, when ψ=3, r2=0.9 and fa=0.1, the case–control method needs the genotypes of 1500 controls and 1500 cases, whereas the family discrimination method needs only the genotypes of 300 controls and 300 sib-pairs (genotype of the index case and IBD state, which can be obtained with a low-density genotyping of the second sib).

Figure 2
figure 2

The sample size needed to achieve 90 power to exclude variants at 100:1 odds is plotted as function of r2, for various values of ψ and for (a) fa=0.1 and (b) fa=0.5. The number of cases and controls are assumed to be equivalent.

Inference on a causal variant with family data

Using Metropolis–Hastings algorithm, we simulate data composed of 1000 ASPs and 1000 controls. The theoretical distribution used for these simulations is described in section 4 of Supplementary Information. The posterior distribution of fa,fb, and r2, obtained from 107 distribution samples, are displayed in Figures 3 and 4. Each shade of gray represents the credibility region for one level. The lightest gray corresponds to all sampled values.

Figure 3
figure 3

Posterior distributions with Metropolis–Hastings for a simulated sample of 1000 sib-pairs and 1000 controls. The parameters used for simulation are r2=1, fa=fb=0.2 and ψ=2.

Figure 4
figure 4

Posterior distributions with Metropolis–Hastings for a simulated sample of 1000 sib-pairs and 1000 controls. The parameters used for simulation are r2=0.8, fa=0.435, fb=0.448 and ψ=3.

In Figure 3, the data are simulated under a model with total LD (r2=1), alternative allele frequencies fa=fb=0.2, and an OR ψ=2 for the causal variant A. The 95 credibility regions of fa, fb, ψ, and r2 are ~(0.17,0.215), (0.08,0.29), (1.8,3) and (0.31), respectively. They contain the true values of parameters, and the mode of the distribution is near to the true values. Note that the allele frequency is best estimated at the variant that is directly observed, which corresponds to a certain amount of uncertainty on r2.

In Figure 4, the data are simulated with r2=0.8, fa=0.435, fa=0.448 and ψ=3. The 95 credibility regions of fb, fa, ψ and r2 are (0.415,0.47), (0.3,0.55), (2.5,5) and (0.5,1), respectively. Again, they contain the true values of parameters. Interestingly, although the causal variant is not directly observed, some inference of its characteristics is possible.

Application to MS data

Discrimination methods

The values of association statistics and P-values for the two discrimination methods at all SNPs are displayed in the Table 1. For the SNPs with the smallest P-values (SNPs 1–4, 21 and 24), the family-based P-values are lower than the case–control ones. However, the use of family data does not decrease P-value for all SNPs: for example, the SNP reported in the literature21, 22, 23, 24, 25 as associated with MS, rs2104286 (SNP 23 in our numbering), is not associated using case–control data, and adding the IBD information does not decrease its P-value. After Bonferroni correction, the association is significant association only with rs3118470 (SNP 24) for both case–control and family designs, and, with rs12359875 (SNP 1) for family design only.

Table 1 Association test statistics and P-values; discrimination statistics of all SNPs with SNP 24

To apply discrimination methods on these data, we compute the difference of association statistics between the most associated SNP, that is, SNP 24, and others (Table 1). Comparing these values with the threshold k=9.210, we select the set of SNPs 1, 2, 3, 4, 20, 21 and 24 as likely to contain the causal variant, using the case–control discrimination method, whereas the selected set contains only SNPs 1, 4 and 24, for the family discrimination method.

Metropolis–Hastings on SNPs 24 and 1

Applying the Metropolis–Hastings algorithm on the most associated SNP, that is, SNP 24, the posterior distributions are displayed in Figure 5. The frequency fa corresponds to the SNP 24 and fa to the hypothetical causal variant. The 95 credibility region of fa and OR ψ are (0.24,0.31) and (1.4,3.2), respectively. The LD r2 is not well estimated, as its 95 credibility region containing almost all possible values. Finally, for the parameter fa, the disease allele frequency is bimodal, with two modes near 0.3 and 0.8.

Figure 5
figure 5

Posterior distributions with Metropolis–Hastings using SNP 24 (rs3118470).

Also applying the Metropolis–Hastings algorithm on the second associated SNP, that is, SNP 1, the posterior distributions are displayed in Figure 6. Graphically, the 95 credibility regions of fb, fa, ψ and r2 are ~0.68,0.76), (0.6,0.95), (1.5,5) and (0.1,1), respectively. The mode of fa is ~0.8.

Figure 6
figure 6

Posterior distributions with Metropolis–Hastings using SNP 1 (rs12359875).

We have also applied the Metropolis–Hastings algorithm on rs9663421 (SNP 4), which is in the subset of SNPs selected by the family method. The results are similar with those obtained for SNP 1. This is consistent with the observed LD between the SNPs 1 and 4 in our data (r2=0.85).

Discussion

Nowadays, research on complex diseases focuses on massive case–control designs, neglecting family designs. However, the joint use of linkage and association information in families allows efficient designs for complex diseases. Using linkage information in association studies results not only in a gain of power in association testing, but also in an increased ability to estimate the risk conferred by the allelic variants, as illustrated in previous papers on Rheumatoid Arthritis26, 27 and MS.7 The MASC method28 was developed to exploit all information in family data. The association test from Perdry et al4 is built on the same idea.

In this paper, we have shown that sib-pairs provide a gain of power to discriminate between several SNPs associated with a disease. For example, with an OR of 3, r2=0.9 and 0.1 alternative allele frequency, the family method needs five times less individuals than the case–control method to achieve similar power. This illustrates well the gain of information provided by family data, as the sib-pairs test uses simultaneously association information comparing control and index cases and linkage information through the IBD. Note that in many cases, the IBD information is already available from previous linkage studies that have been performed using the same sib-pair sample. If it is not available, it can be obtained through low-density genotyping, which has a negligible cost as compared with the high-density genotyping of the index cases.

Moreover, Udler et al's method for case–control data assumes that the causal variant is genotyped, which is unlikely to be true when using tag-SNPs. Imputation methods, which have been widely used in GWAS, can help to reach a fine enough mapping scale. Nevertheless, we have shown that with family data, thank to the IBD information, we can capture information on the unobserved causal variants through the linked observed variants. This was first done formally, by proving the identifiability of the parameters (section 5 of Supplementary Information). This allows in theory to assess whether the observed variant is the causal variant, or if it is only in LD with the causal variant. Sampling from the posterior distribution of the different parameters allows to obtain credibility regions for allele frequencies, per-allele risk and LD between observed and causal variants. We showed on simulated data that these credibility regions are reasonably accurate.

These methods were illustrated on a real data set, consisting of IL2RA genotypes on MS cases and controls, which were first considered in Babron et al.7 Both discrimination methods provide subsets of SNPs, which likely contain the causal variants. However, the subset obtained using case–control data contains seven SNPs, whereas the subset obtained using the family method contains only three SNPs. This reduced size illustrates the increase of discrimination power. In addition, the family data were additionally used to estimate the risk allele frequency and the OR of a putative causal variant in IL2RA: when using rs12359875 (SNP 1), the method infers a causal variant with a risk allele frequency ~0.8, likely different from SNP 1; the value of the LD between SNP 1 and the causal variant and the allelic OR are not well identified. When using rs3118470 (SNP 24), the posterior distribution of the risk allele frequency of the causal variant is bimodal, with one mode ~0.8 and another ~0.3. This pleads in favor of the existence of a second causal variant with a risk allele frequency near 0.3, in LD with SNP 24 but not with SNP 1, whereas the first causal variant with a risk allele frequency near 0.8 is in LD with both SNPs. In the previous study,7 the association signal in IL2RA was best captured by an haplotype of rs2256774 (SNP 22) and rs3118470 (SNP 24). Besides that we agree on the fact that the association signal is not due to a single SNP in the region, it is difficult to compare our results with their results, as our approach does not allow to consider several markers at the same time. Also, we considered only a multiplicative risk model, which was not the case in the previous study.7

Our approaches could be extended in these directions: considering haplotypes instead of isolated SNPs, droping the multiplicative risk hypothesis, and allowing for multiple disease alleles. If this can improve the disease model inference, a compromise has, however, to be found between the complexity of the model considered and the amount of available information. Considering larger nuclear families or even multiplex families can be a solution to improve the precision of the inference. Another interesting strategy would be to use the difference of LD pattern between cases and controls in the vicinity of the causal variant29 at the same time as the IBD information.

Identifying variants helps both to improve disease risk prediction and to uncover biological mechanisms involved in human diseases. A better statistical modeling of the effect of the variants in an associated genomic region is a crucial step on this way. In this regard, family design should not be neglected.

Software

The proposed method is accessible in the R package ASPBay available on the Comprehensive R Archive Network (CRAN).