Introduction

Hodgkin’s lymphoma (HL) is a malignancy of the lymphatic system with an incidence of 2–3/100 000/year in developed countries.1 It is characterized through malignant Hodgkin and Reed–Sternberg (HRS) cells mixed with a dominant background population of reactive lymphocytes and other inflammatory cells.2 Epstein–Barr virus (EBV) infections may be causally related to a number of cases as well as a personal history of autoimmune diseases.3,4 However, there is limited evidence to the involvement of other specific environmental risk factors, although there is a distinctive pattern of incidence rates and risk profiles by age, race/ethnicity, sex and economic levels.2 Some evidence for inherited genetic influence on susceptibility is provided by the increased familial risk and the very high concordance between monozygotic twins.5 Recently, several genome-wide association studies (GWASs) have identified a couple of loci for HL.6,7 These studies have shown that the risk of HL is highly influenced by the human leukocyte antigen (HLA) genotype variation, but the familial risk is also a consequence of non-HLA genotype variation.6 However, all the genetic variants identified so far only capture a minor percentage of the estimated heritability of the disease.8 Yet, a great deal remains to be understood regarding the remaining heritability.9,10 Some plausible explanation include unidentified gene–gene interactions, unidentified contributions of rare genetic variants or overestimating the total heritability for HL in population-based studies, resulting in a ‘phantom heritability’,9,11,12 that cannot be dissolved on the molecular basis.13,14

Thus, our aim is to provide reliable estimates for the genetic variation of the disease derived either from the quantification of resemblance between close relatives or the dissection of genetic variation from genomic loci.15

The knowledge of the heritability estimates for the susceptibility to HL based on genomic data and on population data will then provide further insights in the proportion of genetic variation hidden so far but still detectable on the genomic level.16

Materials and methods

Genomic data: quality control of SNP genotyping

The study population comprised a total of 2227 individuals, with 1001 cases and 1226 controls. Cases were sampled all over Germany, whereas controls were sampled within the Ruhr area in North Rhine-Westphalia as part of the Heinz Nixdorf Recall Study.17 All individuals were genotyped using the Illumina Human Omni Express 12v1 chip (Illumina, San Diego, CA, USA). (733 202 markers).

To counteract artificial differences in allele frequencies between cases and controls causing spurious genetic variation, GWAS data have undergone a very stringent quality control procedure.16 After checking the gender based on genotypes, 11 individuals were excluded because of inconsistencies. Three individuals were excluded because of low genotype calling rates (<0.99). In total, 27 individuals were excluded, because their heterozygosity was >3 SD apart from the mean heterozygosity of the sample. Principal component analysis indicated a presence of population outliers, especially in the cases, because cases represent a more diverse group than controls. After excluding these 46 outliers, the remaining individuals were genetically well matched. Seventeen individuals having a relatedness score of >0.05 were excluded. The final set consisted of 906 cases and 1217 controls. All data were checked for differential missingness between cases and controls, and single-nucleotide polymorphisms (SNPs) were excluded with P<0.05. SNPs were also excluded after applying the Hardy–Weinberg equilibrium test with P<0.001. A detailed overview of the study material, the identification of samples of non-European origin, the plots of the principal components and the results is given in our previous paper.6 The genome-wide Armitage trend test χ2 values showed minimal inflation of the test statistics proving the absence of substantial cryptic population substructure (genomic control inflation factor λgc=1.09).6

By using PLINK software18 we finally produced two subsets of data with SNPs in cases and controls that had either a minor allele frequency (MAF) of >0.01 or MAF of >0.05.

Statistical analysis on genomic data

For the statistical analyses on the genomic data, the approach of Yang et al19 was used. Their method has been completed by an approach of Speed et al,20 who presented an improved method for the heritability estimation on GWAS data with a new adjustment for linkage disequilibrium (LD). Both methods provide an estimate of the additive genetic variance explained by SNPs but are accounting for LD between the genotyped SNPs and unknown causal variants in different ways (ie, correlations between SNP genotypes).19,20 Both approaches fit a linear mixed model of the form: y=μ+g+e,16,19 whereby y is the vector of the disease status, μ is the mean vector, g is a vector of random additive genetic effects obtained from SNP data and e is a vector of residual effects. The covariance structure fitted in the data is the individual relationship estimated from the SNPs; cov(yj,yk)=Ajkσ2ge2, where Ajk is the genetic relationship between individuals j and k derived from the SNPs, σ2g is the additive genetic variance and σe2 is the residual variance. With this model disease heritability, h02, can be defined as: σ2g2g+σ2e.20 Because phenotypes in case–control studies are measured on the 0–1 scale, the relationship between observations on the observed scale and liabilities on the unobserved continuous scale are modeled through the liability threshold model. The liability for HL on the underlying scale follows a standard normal distribution whereby if liability exceeds a certain threshold, t, then individuals will be affected. The estimate of variance explained by the SNPs on the observed 0–1 scale is linearly transformed to that on the unobserved continuous liability scale such that h12=h02K(1–K)/z2,21 where K is the prevalence of the disease and z is the value of the standard normal probability density function at the threshold t. Given an incidence of 2–3/100 000/year will result in a cumulative risk of ~1 in 1000, which can be considered as an estimate of the prevalence. The relationship between additive genetic variance on the observed 0–1 and unobserved liability scales is extended to account for ascertainment bias in a case–control study.16 Estimation of the additive genetic variance was performed using restricted maximum likelihood (REML) via the genome-wide complex trait analysis (GCTA) software.22 Because the MAF spectrum of the unobserved causal variants may be different from the genotyped SNPs, the estimation of the variance explained by SNPs was performed in two ways. (1) the estimate of the variance explained by SNPs was adjusted to account for missing LD between the genotyped SNPs and unknown causal variants.19 SNPs were randomly assigned into two different groups with one of the groups being treated as representing ‘true’ causal variants. The covariance between both groups is supposed to reflect the true variance of relatedness between individuals, whereas the variance derived from the SNP group equals the variation of relatedness plus estimation error. The prediction error can then be derived by regressing the relationships of the ‘true’ causal variants on the SNPs. (2) In contrast to the method of Yang et al,19 which suggests a uniformly scaling of the usual SNP-based kinship coefficients, Speed et al20 proposed a different adjustment, in which SNPs are weighted according to how well they are tagged by their neighbors. The kinship coefficients corrected for LD are obtained in a two-step procedure: first, weightings for each predictor given the local patterns of correlations are calculated, and second these weightings are used to estimate relatedness values across all pairs of individuals.20 The final estimation of the additive genetic variance was again performed by using REML via the software tool GCTA.22

In addition, the approach of Guan and Stephens23 has been applied to estimate the proportion of phenotypic variance explained (PVE) by the genotypes. It implements the Markov chain Monte Carlo (MCMC)-based inference methods for Bayesian variable-selection regression based on standard normal linear regression with the following model:

relating a response variable y to covariates X. Here y is an n-vector of observations on n individuals, μ is an n-vector with components all equal to the same scalar μ, X is an n by p matrix of covariates, β is a p-vector of regression coefficients, τ denotes the inverse variance of the residual errors, Nn (·, ·) denotes the n-dimensional multivariate normal distribution and In the n by n identity matrix. The variables y and X are observed, whereas μ, β and τ are parameters to be inferred. Because GWAS applications involve binary phenotypes, the probit link function is used and allows direct comparisons to the estimates of variance explained by SNPs on the liability scale. The total proportion of variance in y explained by the relevant covariates Xγ, or PVE, is commonly used to summarize the results of a linear regression.

The PVE is closely related to the ‘heritability’ of the phenotype and reflects the optimal predictive accuracy that could be achieved for a linear combination of the measured genetic variants, whereas heritability reflects the accuracy that could be achieved by all genetic variants.23

Population data: Swedish Family-Cancer Database

To compare the variance explained by SNPs to new estimates of heritability from family-based studies, variance components were estimated for the susceptibility to HL on the basis of the 2010 update of the Swedish Family-Cancer Database that includes all individuals born after 1931 who are residing in Sweden, together with their biological parents, totaling 12.1 million individuals.24 The Database was created in 1996 by combining the Swedish Cancer Registry and the Swedish Multigenerational Register, and has been updated regularly. The Database includes information about the cancers, socioeconomic data and death causes. In total, 7438 individuals (4441 males and 2997 females) have been diagnosed with the HL (ICD-7 code 201). The R-stat package and DmuTrace25,26 were used to extract all related individuals of the patients from the large pedigree file back to the base population as well as all offspring of the patients and their relatives in future generations until the current population. This resulted in a pedigree of 133 544 individuals (67 059 males and 66 485 females). The entire pedigree consisted of 6755 families across 6 generations with a family size ranging from 2 to 490 individuals. The total number of founders was 59 679 with a range of 2 to 203 individuals per family. Each family contained at least one and up to eight cases.

Statistical analysis on population data

A generalized linear mixed effect threshold model with a binary response variable using MCMC Carlo techniques was applied.27 In such a standard threshold (probit) model, the observed binary records (Yij) are assumed fully determined by an underlying liability it), such that: for Yij=0 for λij≤0 and Yij=1 for λij>0 and the threshold value is set to 0. The model can be written as

where λ=vector of all λij, β=vector of ‘fixed’ effects, a=vector of random additive genetic effects of all individuals, e=vector of random residuals and X and Z are the appropriate incidence matrices. Var(a)=Aσa2 and Var(e)=Inσ2e, where A is the additive genetic relationship matrix of all individuals, In is an identity matrix with dimension equal to number of records and σa2 and σ2e are the additive genetic and the residual variances, respectively. As usual for probit threshold models, σ2e is restricted to be 1. Fixed effects included in the model were gender, birth year, country of birth, social economic index and number of children.

Marginal estimates of the genetic parameters were obtained in the underlying scale using Bayesian inference, implemented via the Gibbs sampling procedure and a data augmentation approach. The model included a Gibbs sampling chain of 10 150 000 rounds with a conservative 150 000 iterations as burn-in and 10 million sampling rounds. Every 1000th sample was drawn, resulting in 10 000 samples. For each sample of the Gibbs chain, narrow-sense heritability was calculated after as: h2a2/(σa22e), where σ2e=1 for probit link model.28,29

Posterior marginal means of heritability were derived and the 95% highest marginal posterior density region of the heritability range determined. The algorithm to estimate genetic (co)variance components has been implemented in the Gibbs sampling module of the DMU statistical software package30 that has been developed to handle multivariate genetic analyses including binary, ordered categorical and Gaussian traits. Results have been proven by the R package MCMCglmm31 that has also been used to prove convergence diagnosis and output analysis. Heritability estimates on the liability scale have been transformed to the observed scale by using the linear transformation of Dempster and Lerner.21

Results

Estimates of the variance explained by SNPs

The analysis on genomic data was restricted to the autosomes of the GWAS data set and based on the final set of 906 cases and 1217 controls. For both primary subsets of genomic data with either MAF >0.05 or MAF >0.01, the threshold for SNP missingness was raised stepwise starting from 0.05 up to 0.001. This resulted in a reduced number of SNPs from 583 333 to 442 325 (MAF >0.01). Both the crude proportions of variance and the adjusted estimates only dropped slightly, whereas the transformed estimate was stable across all subsets with different numbers of SNPs (Table 1). We only included adjusted estimates according to Yang et al19 in our tables as they did not show any difference to the adjustment according to Speed et al.20 Standard errors were slightly larger for the adjusted estimates and the transformed estimates. In contrast to stable estimates across different numbers of SNPs, the PVE showed moderate differences when using the Bayesian variable selection approach (Table 2). For data sets with less SNPs, the PVE values decrease and the high posterior density interval is shrinking. This fact is also presented in Figures 1 and 2 that show the posterior distributions of PVE from the MCMC runs for the corresponding MAF and each SNP subset. Values for PVE were larger when compared with the transformed estimate in Table 1.

Table 1 Estimated genetic variances for different setups
Table 2 Proportion of estimated variance (PVE)
Figure 1
figure 1

Proportion of estimated variance for MAF >0.01 with maximum per-SNP missing rate: red=0.001; blue=0.005; black=0.01; green=0.05.

Figure 2
figure 2

Proportion of estimated variance for MAF >0.05 with maximum per-SNP missing rate: red=0.001; blue=0.005; black=0.01; green=0.05.

Although the genomic control inflation factor was still small and mainly influenced by the distribution of cases collected across Germany,6 genomic partitioning was used to check the final influence of the population structure.20,32 Estimates were derived for chromosomes 1–7 and for the remaining chromosomes (8–22). For any threshold shown in Table 1, estimates of the two parts of the genome added up to the total estimates of common variance explained by SNPs as described in Table 1. The influence of the population structure has then also been tested with an additional REML analysis by fitting the first 2, 4 and 10 eigenvectors from the PCA as covariates. The results in Table 3 show little to no difference in the crude estimates of the variance explained by SNPs compared with original results in Table 1.

Table 3 Genetic variances explained by all SNPs by fitting first 2, 4 and 10 principal components (PCs) as covariates in the REML analysis

As many SNPs on chromosome 6 had extremely significant associations, one analysis was performed by excluding chromosome 6, or by analyzing chromosome 6 separately just as in the genome partitioning approach.20 The results in Table 4 show that estimates (crude and adjusted) decreased by 15–20% when excluding chromosome 6 (as compared with Table 1), whereas estimates based on SNPs on chromosome 6 solely account for 5% of the variance explained by SNPs. This analysis shows that a substantial proportion of genetic variation is explained by risk loci on chromosome 6, yet another big proportion of the genetic variation is still explained by SNPs on other chromosomes.

Table 4 Analysis without chromosome 6 or chromosome 6 only

Even though the incidence of 2–3/100 000/year has been stable for a long time,2 any variation over time because of changes in the environment would result in changes of the prevalence. Cutting the used prevalence in half to ~0.5 in 1000 or doubling it to ~2 in 1000 did not have any influence on the estimates of the common variance at any threshold shown in Table 1, but for a prevalence of ~0.5 in 1000, the transformed estimate dropped to 0.33 (SE: 0.04), whereas it increased to 0.40 (SE: 0.05) for a prevalence of ~2 in 1000.

According to Wray et al33 we could interpret our result of the common genetic variance explained by SNPs on the liability scale to mean that we must expect many genetic variants underlying the disease. As the risks of common variants are too small to be used individually as risk predictors, the overall transformed estimate of 0.35 (Table 1) can be translated to a sibling relative risk of 5.58 being associated with common genetic variation.33

Heritability estimates based on population data

Figure 3 shows a trace plot of the heritability values along the iterations. There is no trend in the trace and values spread widely with a reasonable parameter space. The plot clearly illustrates good mixing, and a Gibbs sampler that ‘converges’ fast. The right side of Figure 3 shows the posterior density of the heritability estimate as a result of the model described earlier applied to the data set. Averaged across the 10 000 samples, posterior means of the heritability were 0.40 on the liability scale with a corresponding 95% highest posterior density region (HPD95) ranging from 0.17 to 0.58. The heritability on the observed scale is z2/K(1−K) times the heritability on the underlying normally distributed liability scale that is then 0.24. According to the convergence criteria, the effective sample size for estimating the mean derived from our data set was 7186, representing a sufficient number, given the fact that we had 10 000 samples from the Gibbs sampler.31

Figure 3
figure 3

Trace (left) and posterior density (right) of heritability estimate.

Discussion

Results from the genomic analysis as well as the population-based analysis show the proportion of variance explained by SNPs and heritability values for the susceptibility to HL in an overlapping range from 0.21 to 0.48, whereas estimates on the liability scale are ranging from 0.35 to 0.48 only. Most of these values represent simply the proportion of the total variance that is attributable to the pure additive genetic variance. Estimates of the PVE as a result of the probit-based approach ranged from 0.39 to 0.48, and were somehow higher compared with estimates of the genetic variance explained by SNPs on the liability scale shown in Table 1. This can be explained by a slightly different concept of the PVE compared with the approaches of Yang et al19 and Speed et al20: PVE reflects the optimal predictive accuracy achieved for a linear combination of the measured genetic variants, whereas heritability reflects the accuracy that one can achieve by using all genetic variants.23 Simulations of Guan and Stephens23 have proven that the uncertainty in PVE is greater with a larger number of SNPs, presumably because of the increased difficulty in reliably identifying relevant variants. When data contain no SNPs with strong individual effects, it remains difficult to rule out the possibility that many SNPs may have very small effects that combine to produce an appreciable PVE.23 This uncertainty is also represented by the rather large confidence intervals in the right column of Table 2. It will be even greater when the true PVE is smaller.23 Nonetheless, the range of the posterior on PVE nicely spans the estimates provided by the other methods equally to both sides, proving the ability of the PVE method to quantify uncertainty in multivariate problems by assessing the full joint posterior distribution of the model parameters compared with the current simplistic ‘one SNP at a time’ testing paradigm,34,35 because analyzing all SNPs simultaneously will detect more of the genetic variation because of the identification of multiple causal variants.36

In contrast to the results of Enciso-Mora et al,37 our estimates of the variance are almost constant across the different numbers of SNPs and do not decline after a more stringent exclusion of missing genotypes. This is in agreement with the results by Yang et al19 and Lee et al.38 Thus, our QC has been stringent enough beforehand.

Inflation of the estimated variances explained by SNPs due to population stratification has been investigated by genomic partitioning. Results did not provide any evidence of stratification. Estimates of the two parts of the genome added up to the corresponding estimates on the full set (right column in Table 1). Our stringent QC helped to keep the genomic control inflation factor low.6

A cause for concern in estimating the common genetic variance explained by SNPs is created by LD that can lead to large biases. Contributions to heritability estimates from causal variants might be overestimated because of regions with strong LD or underestimated in regions with low LD.20 We also followed the proposal of Speed et al20 to analyze our data, but we did not detect any perceptible deviations in our estimates compared with the method of Yang et al.19 It seems like any underestimation of contributions to the heritability in low-LD regions is balanced by overestimation elsewhere as shown by Speed et al.20

Trends in cancer prevalence reveal the dynamics of cancers in the population. To test the influence of changes on the estimates of the common genetic variance explained by SNPs, we have halved and doubled the prevalence. Differences to the original prevalence could only be seen in the transformed variance, but standard estimates stayed constant. Thus, variation in the transformed estimates may reflect changes in the environment over time.

The estimates of the common genetic variance explained by SNPs are based on 410 000 to almost 600 000 SNPs. A significant effect on HL is harbored in the MHC region on chromosome 6.7,39 After excluding SNPs mapped to the MHC region (6p21, at 28–33 Mb), Enciso-Mora et al39 remained only with a limited number of loci influencing the risk of HL. Nonetheless, Enciso-Mora et al39 and Frampton et al6 were able to identify suggestive associations on another eight autosomes. After we excluded chromosome 6 from our analysis, the estimates dropped by 20%. A rather high proportion of the variance is thus explained by chromosome 6 only as shown in Table 4, but still a descent proportion of variance is explained by the remaining autosomes. We therefore conclude that many additional loci may contribute to the susceptibility of HL.

The link function for binary data most widely used is the probit link, also known as the threshold model.40 Modeling a random variable by using the probit function assumes that the latent variable (liability) possesses a standard normal distribution. The major advantage is that the liability is treated as a polygenic trait that is determined by many genes with small effects, and therefore heritability of liability is independent of the disease prevalence and can be directly compared.40

Our heritability estimates on the liability scale based on the Swedish population are larger than estimates by Shugart et al,8 who calculated heritability for HL to be 28.4%. A reason for their lower estimate is a bias in Falconer’s method8 that is a result of common familiar environmental factors and ascertainment.41 Generally, by using the extensive pedigree with its whole range of relationships in the population, the so-called animal threshold model provides the most accurate approximation of the heritability.42 Our estimates are larger, but still conservative, because the applied threshold model only estimated the contributions of genes that act additively: the effects of genes with nonadditive effects, such as dominance or epistasis, will not contribute to the heritability estimates reported here. And even though statistical models are available for the estimation of nonadditive genetic variance, much larger sets of suitably structured data would be required to obtain reliable estimates.

Heritability estimates on the liability scale are slightly larger for the population-based data than estimates of the common genetic variance explained by SNPs (0.40 compared with 0.35 for the GCTA approach), but they were in a very similar range to PVE (Table 2) that gave larger estimates because of reasons explained above. Nonetheless, heritability estimated from pedigree data is not the same as the proportion of phenotypic variation explained by all SNPs because the former includes the contribution of all causal variants, but the latter only includes the contribution of causal variants that are in LD with the genotyped SNPs. Thus, we also face the problem of missing heritability known for analysis of GWAS data.11 Our population-based estimates and the estimates from SNP data still do not differ too much as estimates for other traits.16 We therefore conclude that our genotypic data is a well-formed sample to draw sufficient conclusion about the genetic determination of the disease. Even though estimates of the common genetic variance explained by SNPs and the population-based heritability are similar, we must point out that many reasons for missing heritability have been widely accepted in the scientific community. This knowledge is supported by our findings of different estimates of the genetic variance explained by SNPs on chromosome 6 compared with other chromosomes. Genes detected on chromosome 6 in earlier studies do not account for much of the heritability of HL in our study.7,39 The susceptibility to HL is rather a combined effect of multiple genes on several chromosomes than that of a few disease genes on a single chromosome.

The common genetic variance explained by SNPs and the heritability on the liability scale of HL show that a reasonable proportion of the variation observed in both the German and the Swedish population is caused by variation in genotypes, but it also indicates that the environment is still the principal causative role in HL, and susceptibility genes described so far for HL are likely to explain only part of the genetic effects.

An important fact is the interpretation of the sibling relative risk that has be derived through the incidence of the disease and the estimates of common variation explained by SNPs. Based on our results, siblings of people with HL are 5.6 times more likely to develop the disease than others. These results are in agreement with studies showing up to a sevenfold increased risk in people with a parent or sibling diagnosed with HL.43 It therefore seems to be clear that besides the environment, genetic factors have strong influence on the etiology of HL.

In conclusion, there is genetic variation for the susceptibility to HL. Heritability based on the population data is somehow larger than for the genomic data showing the possibility of some missing heritability in the GWAS data. Besides that, there is still major evidence for multiple loci causing HL on chromosomes other than chromosome 6.