This page has been archived and is no longer updated
Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls
Author: TWTCCC
Keywords
Keywords for this Article
Add keywords to your Content
Save
|
Cancel
Share
|
Cancel
Revoke
|
Cancel
Rate & Certify
Rate Me...
Rate Me
!
Comment
Save
|
Cancel
Flag Inappropriate
The Content is
Objectionable
Explicit
Offensive
Inaccurate
Comment
Flag Content
|
Cancel
Delete Content
Reason
Delete
|
Cancel
Close
Full Screen
"ARTICLES Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls The Wellcome Trust Case Control Consortium* There is increasing evidence that genome-wide association (GWA) studies represent a powerful approach to the identificationofgenesinvolvedincommonhumandiseases.WedescribeajointGWAstudy(usingtheAffymetrixGeneChip 500KMapping Array Set) undertaken in the British population, which has examined,2,000individuals foreach of7 major diseases and a shared set of,3,000 controls. Case-control comparisons identified 24 independent association signals at P,5310 27 : 1 in bipolar disorder, 1 in coronary artery disease, 9 in Crohn?s disease, 3 in rheumatoid arthritis, 7 in type 1 diabetesand3intype2diabetes.Onthebasisofpriorfindingsandreplicationstudiesthus-farcompleted,almostallofthese signals reflect genuine susceptibility effects. We observed association at many previously identified loci, and found compellingevidencethatsomelociconferriskformorethanoneofthediseasesstudied.Acrossalldiseases,weidentifieda large number of further signals (including 58 loci with single-point P values between 10 25 and 5310 27 ) likely to yield additional susceptibility loci. The importance of appropriately large samples was confirmed by the modest effect sizes observed at most loci identified. This study thus represents a thorough validation of the GWA approach. It has also demonstrated that careful use of a shared control group represents a safe and effective approach to GWA analyses of multiplediseasephenotypes;hasgeneratedagenome-widegenotypedatabaseforfuturestudiesofcommondiseasesinthe British population; and shown that, provided individuals with non-European ancestry are excluded, the extent of population stratificationintheBritishpopulationisgenerallymodest.Ourfindingsoffernewavenuesforexploringthepathophysiology of these important disorders. We anticipate that our data, results and software, which will be widely available to other investigators, will provide a powerful resource for human genetics research. Despite extensive research efforts for more than a decade, the genetic basisofcommonhumandiseasesremainslargelyunknown.Although there have been some notable successes 1 , linkage and candidate gene association studies have often failed to deliver definitive results. Yet the identification of the variants, genes and pathways involved in particulardiseasesoffersapotentialroutetonewtherapies,improved diagnosis and better disease prevention. For some time it has been hoped that the advent of genome-wide association (GWA) studies would provide a successful new tool for unlocking the genetic basis ofmanyofthesecommoncausesofhumanmorbidityandmortality 1 . Three recent advances mean that GWA studies that are powered to detect plausible effect sizes are now possible 2 . First, the International HapMap resource 3 , which documents patterns of genome-wide vari- ation and linkage disequilibrium in four population samples, greatly facilitates both the design and analysis of association studies. Second, theavailabilityofdensegenotypingchips,containingsetsofhundredsof thousands of single nucleotide polymorphisms (SNPs) that provide good coverage of much of the human genome, means that for the first timeGWAstudiesforthousandsofcasesandcontrolsaretechnicallyand financially feasible. Third, appropriately large and well-characterized clinical samples have been assembled for many common diseases. The Wellcome Trust Case Control Consortium (WTCCC) was formed with a view to exploring the utility, design and analyses of GWA studies. It brought together over 50 research groups from the UK that are active in researching the genetics of common human diseases, with expertise ranging from clinical, through genotyping, to informatics and statistical analysis. Here we describe the main experi- mentoftheconsortium:GWAstudiesof2,000 casesand3,000shared controlsfor7complexhumandiseasesofmajorpublichealthimport- ance?bipolardisorder(BD),coronaryarterydisease(CAD),Crohn?s disease (CD), hypertension (HT), rheumatoid arthritis (RA), type 1 diabetes (T1D), and type 2 diabetes (T2D). Two further experiments undertaken by the consortium will be reported elsewhere: a GWA studyfortuberculosisin1,500casesand1,500controls,sampledfrom TheGambia;andanassociationstudyof1,500commoncontrolswith 1,000 cases for each of breast cancer, multiple sclerosis, ankylosing spondylitis and autoimmune thyroid disease, all typed at around 15,000 mainly non-synonymous SNPs. By simultaneously studying sevendiseaseswithdifferingaetiologies,wehopedtodevelopinsights, not only into thespecific geneticcontributionsto each ofthediseases, but also into differences in allelic architecture across the diseases. A further major aim was to address important methodological issues of relevance to all GWA studies, such as quality control, design and ana- lysis.Inaddition to our main association results,weaddressseveral of theseissuesbelow,includingthechoiceofcontrolsforgeneticstudies, the extent of population structure within Great Britain, sample sizes necessarytodetectgeneticeffectsofvaryingsizes,andimprovementsin genotype-calling algorithms and analytical methods. Samples and experimental analyses Individuals included in the study were living within England, Scotland and Wales (?Great Britain?) and the vast majority had *Lists of participants and affiliations appear at the end of the paper. Vol 447|7 June 2007|doi:10.1038/nature05911 661 Nature �2007 Publishing Group self-identified themselves as white Europeans (153 individuals with non-Caucasian ancestry were excluded from final analysis?see below). The seven conditions selected for study are all common familial diseases of major public health importance both in the UK andglobally 4 ,andforwhichsuitablenationallyrepresentativesample sets were available. The control individuals came from two sources: 1,500individualsfromthe1958BritishBirthCohort(58C)and1,500 individuals selected from blood donors recruited as part of this pro- ject (UK Blood Services (UKBS) controls). See Methods and Supplementary Table 1 for sample recruitment, phenotypes and summary details for each collection. We adopted an experimental design with 2,000 cases for each disease and 3,000 combined controls. All 17,000 samples were geno- typedwiththeGeneChip500KMappingArraySet(Affymetrixchip), which comprises 500,568 SNPs, as described in Methods. The power of this study (estimated from simulations that mimic linkage dis- equilibrium patterns in the HapMap Caucasian sample (CEU), see Methods) averaged across SNPs with minor allele frequencies (MAFs) above 5% is estimated to be 43% for alleles with a relative risk of 1.3, increasing to 80% for a relative risk of 1.5, for a P-value threshold of 5310 27 (Supplementary Table 2). We developed a new algorithm, CHIAMO, which we applied to simultaneously call the genotypes from all individuals (see Methods andSupplementaryInformation).Cross-platformcomparisonshowed CHIAMO to outperform BRLMM (the standard Affymetrix algo- rithm) by having an error rate under 0.2% (Supplementary Table 3), and comparison of 10 8 duplicate genotypes in our study gave a dis- cordance rate of 0.12%. We excluded 809 samples after checks for contamination, false identity, non-Caucasian ancestry and relatedness (see Methods and Supplementary Table 4); 16,179 individuals remained in the study. Genome-wide, 469,557 SNPs (93.8%) passed our quality control filters(describedinMethods)givinganaveragecallrateof99.63%.Of those, 392,575 have study-wide MAFs.1% (45,106have MAFs, 0.1%; see also Supplementary Figs 1 and 2). Initial analyses of the polymorphic SNPs suggest that patterns of linkage disequilibrium in our samples are very similar to those in HapMap (Supplementary Fig. 3). Therefore, we expect genome coverage with the Affymetrix 500K set in this study to be similar to that estimated for the HapMap CEU panel 2 . AllSNPspassingqualitycontrolfilterswereusedintheassociation analyses,althoughpowerisverylowforSNPswithlowMAFs(unless they have unusually large effects). On visual inspection of the cluster plots of SNPs showing apparently strong association, we removed a further 638 SNPs with poor clustering. Control groups Our main purpose in using two control groups was to assess possible bias in ascertaining control samples. In addition, noting that DNA sampleprocessingdifferedbetweenthesegroups,comparisonofcon- trol groups also providesa check for effects ofdifferential genotyping errors as a result of differences in DNA collection and preparation. Figure1ashowstheresultsof1-d.f.Mantel-extensiontests 5 fordiffer- ences in allele frequencies of SNPs between subjects from the 58BC and UKBS collections, stratified by 12 broad regions of Great Britain (see Supplementary Table 5 and Supplementary Fig. 4 for region definitions). The associated quantile-quantile plot (see Methods for background) in Fig. 1b shows good agreement with the null distri- bution (similar results are obtained for tests that do not stratify by geography, data not shown). The fact that we see few significant dif- ferences between these two control groups despite the fact that they differ in population groups sampled, DNA processing, and age, indi- cates that there would be little bias due to use of either sample as a controlgroupforanyofthecaseseries,andjustifiesourcombiningof thetwocontrolgroupstoformasinglegroupof3,000subjectsforour main analyses. One consequence of using a shared control group (for which detailed phenotyping for all traits of interest is not available) relates tothepotentialformisclassificationbias:aproportionofthecontrols is likely to have the disease of interest (and therefore might meet the criteria for inclusion as a case) and some others will develop it in thefuture.However,theeffectthishasonpowerismodestunlessthe extent of misclassification bias is substantial; for example, if 5% of controlswouldmeetthedefinitionofcasesatthesameage,thelossof power is approximately the same as that due to a reduction of the samplesizeby10% 6 .Evenforthehigherprevalenceconditionsexam- ined by the WTCCC (such as HT, CAD and T2D), the precise ascer- tainment schemes used here (which enriched for more extreme phenotypes and/or strong family history) will have limited the pro- portions of controls meeting case criteria to low levels (for example, to,5%). Although a study design which used ?hypercontrols? (that is, selection of control individuals from the lower extremity of the relevant trait distribution) would generally be the most powerful approach in a study focusing on one disease, the merits of such an approach need to be weighed against the additional costs associated with the need to phenotype and genotype each control sample. Geographical variation and population structure An additional cause of false positive findings is hidden population structure. Case and control samples may differ in the distribution of their ancestry, either owing to control sampling effects, as discussed above, or to confounding when different ancestries carry higher dis- ease risk and are, as a result, over-represented in cases. Even after exclusion of individuals with evidence of recent non-European ancestry, the British population is heterogeneous, having been shapedbyseveralwavesofimmigration fromsouthern andnorthern Europe. Whether the differences between these incoming popula- tionsaresufficientlylargetodistortthefindingsofpopulation-based case-control studies is an open question. We first examined our samples for non-European ancestry, using multidimensional scaling after ?seeding? our data with those from the three HapMap analysis panels (see Supplementary Fig. 5 and ?log 10 ( P ) 0 5 10 15 Chromosome 22 X212019181716151413121110987654321 Expected chi-squared value Observed test statistic 30 25 20 20 10 10 5 5 0 0 15 15 a b Figure 1 | Genome-wide scan for allele frequency differences between controls. a, Pvalues from the trend test for differences between SNP allele frequencies in the two control groups, stratified by geographical region. SNPshavebeenexcludedonthebasisoffailureinatestforHardy?Weinberg equilibriumin either control group considered separately, a low call rate, or if minor allele frequency is less than 1%, but not on the basis of a difference betweencontrolgroups.GreendotsindicateSNPswithaPvalue,1310 25 . b, Quantile-quantile plots of these test statistics. In this and subsequent quantile-quantile plots, the shaded region is the 95% concentration band (see Methods). ARTICLES NATURE|Vol 447|7 June 2007 662 Nature �2007 Publishing Group Methods), and excluded 153 individuals on this basis. We next looked for evidence of population heterogeneity by studying allele frequency differences between the 12 broad geographical regions (defined in Supplementary Fig. 4). The results for these 11-d.f. tests and associated quantile-quantile plots are shown in Fig. 2. Wide- spread small differences in allele frequencies are evident as an increasedslopeoftheline(Fig.2b);inaddition,afewlocishowmuch larger differences (Fig. 2a and Supplementary Fig. 6). Thirteen genomic regions showing strong geographical variation arelistedinTable1,andSupplementaryFig.7showsthewayinwhich their allelefrequencies varygeographically. Thepredominant pattern is variation along a NW/SE axis. The most likely cause for these marked geographical differences is natural selection, most plausibly in populations ancestral to those now in the UK. Variation due to selection has previously been implicated at LCT (lactase) and major histocompatibilitycomplex(MHC) 7?9 ,andwithin-UKdifferentiation at 4p14 has been found independently 10 , but others seem to be new findings.Allbutthreeoftheregionscontainknowngenes.Asidefrom evolutionary interest, genes showingevidenceof natural selectionare particularly interesting for the biology of traits such as infectious dis- eases; possible targets for selection include NADSYN1 (NAD synthe- tase 1) at 11q13, which could have a role in prevention of pellagra, as well as TLR1 (toll-like receptor 1) at 4p14, for which a role in the biology of tuberculosis and leprosy has been suggested 10 . There may be important population structure that is not well captured by current geographical region of residence. Present implementations of strongly model-based approaches such as STRUCTURE 11,12 are impracticable for data sets of this size, and we reverted to the classical method ofprincipal components 13,14 , using a subset of 197,175 SNPs chosen to reduce inter-locus linkage disequi- librium. Nevertheless, four of the first six principal components clearly picked up effects attributable to local linkage disequilibrium rather than genome-wide structure. The remaining two components show the same predominant geographical trend from NW to SE but, perhaps unsurprisingly, London is set somewhat apart (Supplemen- tary Fig. 8). The overall effect of population structure on our association results seems to be small, once recent migrants from outside Europe are excluded. Estimates of over-dispersion of the association trend test statistics (usually denoted l; ref. 15) ranged from 1.03 and 1.05 for RA and T1D, respectively, to 1.08?1.11 for the remaining diseases. Some of this over-dispersion could be due to factors other thanstructure,andthispossibilityissupportedbythefactthatinclu- sion of the two ancestry informative principal components as cov- ariates in the association tests reduced the over-dispersion estimates only slightly (Supplementary Table 6), as did stratification by geo- graphical region. This impression is confirmed on noting that Pvalues with and without correction for structure are similar (Supplementary Fig. 9). We conclude that, for most of the genome, population structure has at most a small confounding effect in our study, and as a consequence the analyses reported below do not correct for structure. In principle, apparent associations in the few genomicregionsidentifiedinTable1asshowingstronggeographical differentiation should be interpreted with caution, but none arose in our analyses. Disease association results Weassessedevidenceforassociationinseveralways(seeMethodsfor details),drawingonbothclassicalandbayesianstatisticalapproaches. For polymorphic SNPs on the Affymetrix chip, we performed trend tests (1 degree of freedom 16 ) and general genotype tests (2 degrees of freedom 16 , referred to as genotypic) between each case collection and the pooled controls, and calculated analogous Bayes factors. There areexamplesfrom animalmodelswheregenetic effectsactdifferently in males and females 17 , and to assess this in our data we applied a ?log 10 ( P ) 0 5 10 15 Chromosome 22 X212019181716151413121110987654321 3020 20 100 0 40 80 60 40 100 Observed test statistic Expected chi-squared value a b Figure 2 | Genome-widepictureofgeographicvariation. a,Pvaluesforthe 11-d.f. test for difference in SNP allele frequencies between geographical regions, within the 9 collections.SNPs have been excludedusing the project qualitycontrolfiltersdescribedinMethods.GreendotsindicateSNPswitha Pvalue,1310 25 .b,Quantile-quantileplotsoftheseteststatistics.SNPsat which the test statistic exceeds 100 are represented by triangles at the top of the plot, and the shaded region is the 95% concentration band (see Methods). Also shown in blue is the quantile-quantile plot resulting from removal of all SNPs in the 13 most differentiated regions (Table 1). Table 1 | Highly differentiated SNPs Chromosome Genes Region (Mb) SNP Position P value 2q21 LCT 135.16?136.82 rs1042712 136,379,576 5.54310 213 4p14 TLR1,TLR6,TLR10 38.51?38.74 rs7696175 386,43,552 1.51310 212 4q28 137.97?138.01 rs1460133 137,999,953 4.43310 208 6p25 IRF4 0.32?0.42 rs9378805 362,727 5.39310 213 6p21 HLA 31.10?31.55 rs3873375 31,359,339 1.07310 211 9p24 DMRT1 0.86?0.88 rs11790408 866,418 4.96310 207 11p15 NAV2 19.55?19.70 rs12295525 19,661,808 7.44310 208 11q13 NADSYN1, DHCR7 70.78?70.93 rs12797951 70,820,914 3.01310 208 12p13 DYRK4,AKAP3,NDUFA9, RAD51AP1,GALNT8 4.37?4.82 rs10774241 45,537,27 2.73310 208 14q12 HECTD1,AP4S1,STRN3 30.41?31.03 rs17449560 30,598,823 1.46310 207 19q13 GIPR,SNRPD2,QPCTL, SIX5,DMPK,DMWD, RSHL1,SYMPK,FOXA3 50.84?51.09 rs3760843 50,980,546 4.19310 207 20q12 38.30?38.77 rs2143877 38,526,309 1.12310 209 Xp22 2.06?2.08 rs6644913 2,061,160 1.23310 207 PropertiesofSNPsthatshowlargeallelefrequencydifferencesbetweensamplesofindividualsfrom12regionsacrossGreatBritain.RegionsshowingdifferentiatedSNPsaregivenwithdetailsofthe SNPwiththesmallestPvalueineachregionfordifferentiationonthe11-d.f.testofdifferencesinSNPallelefrequenciesbetweengeographicalregions,withinthe9collections.Clusterplotsforthese SNPs have been examined visually. Signal plots appear in Supplementary Information. Positions are in NCBI build-35 coordinates. NATURE|Vol 447|7 June 2007 ARTICLES 663 Nature �2007 Publishing Group sex-differentiated test which is sensitive to associations of a different magnitude and/or direction in the two sexes. Ourstudyalsoallowsustolookforlociwhichmayhaveaneffectin more than one disease. To assess this, we compared our common controls with all cases in each of three natural groupings of diseases: CAD1HT1T2D (metabolic and cardiovascular phenotypes with potential aetiological overlap, for example, involving defects in insu- lin action); RA1T1D (already known to share common loci); and CD1RA1T1D (all autoimmune diseases). TohelptocaptureputativediseaselocinotontheAffymetrix chip we used a new multilocus method in which a population genetics model is applied to our genotype data and the HapMap reference samples to simulate, or impute, genotype data at 2,193,483 HapMap SNPs not on the Affymetrix chip. These imputed, or in silico, geno- types are then tested for association in the same ways as SNPs geno- typed in the project. Beforedetailingtheprincipalresultsforeachdisease,wefirstsum- marize our main observations. Table 2 details the findings from the WTCCC scan for the 15 variants for which there was strong prior evidence of association with one or more of the diseases studied, based on extensive replication studies. All but two of these show associations in our study, with the magnitude of the evidence gen- erallyconsistentwiththeireffectsizesasestimatedfrompriorstudies. One of the signals for which we failed to obtain evidence of replica- tion (APOE in CAD) is poorly tagged by the Affymetrix 500K chip. Theother(INSinT1D)isrepresentedbyasingleSNPthatmarginally failed our study-wide quality control filters (overall missingness 5.2%)butwhich wasnonetheless stronglyassociatedwithT1Dwhen examined. Quantile-quantile plots for the trend test for each of the seven diseases show only very minor deviations from the null distri- bution, except in the extreme tails which correspond to associations reportedbelow(Fig.3).Thequantile-quantileplotsandtheresultsat positive controls (Table 2) give confidence in the quality of our data and the robustness of our analyses. Ourgenome-wide resultsfor thetrendtest areillustrated inFig. 4. The single-disease trend and genotypic tests for SNPs on the chip identified21signalsacrossthe7diseasesthatexceededathresholdof 5310 27 (Table 3). For each of these SNPs (except those within the MHC), cluster plots are shown in Supplementary Fig. 10 and ?signal plots? in Fig. 5. These signal plots estimate the likely demarcation of the hit region and show the signal at genotyped and imputed SNPs together with local genomic context. Four further strong (with P,5310 27 ) associations were revealed by the other primary ana- lyses described (Table 3). One locus (in RA) was revealed by the sex- differentiated analysis, twothrough multilocus approaches (both for T1D) and one through an analysis which combined cases from more thanoneautoimmunedisease(signalplotsinSupplementaryFigs11, 12 and 13, respectively). All of these signals were subjected to visual inspection of cluster plots,andinallcases(withoneexceptionnotedbelow)nearbycorre- lated SNPs also showed a strong signal (see signal plots). Thus, geno- typing artefacts are unlikely to be responsible for these associations. Indeed, at the time of writing, 12 of these 25 strong signals represent replicationsofpreviouslyreportedfindings(onlythosewithextensive priorreplicationarereportedinTable2).Oftheremainder,follow-up studies(reportedelsewhere)haveconfirmedallbutoneoftheloci(ten in total) for which replication has been attempted 10,19?24 . The other replication study gave equivocal results. Of the 18 loci implicated in autoimmunediseases,5showassociations(P,0.001)tomorethan1 condition, leading to a number of further potential new associations, at least one of which has also been replicated 10 . Itislikelythatfurthersusceptibilitygeneswillbeidentifiedthrough follow-upofothersignalsforwhichtheevidencefromourscanisless conclusive(seebelowforsomespecificexamples).Forexample,there are 58 further signals with single-point P values between 10 25 and 5310 27 for which inspection of cluster plots verifies CHIAMO calls (Table4).Asdescribedbelow,analyseswhichmakeuseofselectedcase samples to expand the reference group should also provide a useful routetotheprioritizationofsuchputativesignalsforfurtheranalysis. For convenience, the strongest association results are presented sepa- rately for each disease in Supplementary Table 7. Several general points are relevant to interpretation of these dis- ease-association data. First, replication studies are required to con- firm associations from GWAs. For the reasons given in the box, we regard very low P values (say P,5310 27 ) in our comparatively large sample size as strong evidence for association, and indeed all Box 1 | Significance levels in genome-wide studies There has been much debate concerning interpretation of significance levelsingenome-wideassociationstudiesandwhether,andhow,these should be corrected for multiple testing. Classical multiple testing theoryinstatisticsisconcernedwiththeproblemof?multipletests?ofa single ?global? null hypothesis. This, we would argue, is a problem far removed from that which facesusingenome-wideassociationstudies, where we face the problem of testing ?multiple hypotheses? (for a particular disease, one hypothesis foreach SNP, or region of correlated SNPs,inthegenome)andwethusdonotsubscribetotheviewthatone should correct significance levels for the number of tests performed to obtain ?genome-wide significance levels?. Nonetheless, our aim is to keepthefalsepositiveratewithinacceptableboundsandthisstillleads to the view that very low P values are needed for strong evidence of association. But thefactor determining thethresholdis notthenumber oftestsperformed,buttheaprioriprobabilitythatthereislikelytobea true association at any specified location in the genome. Of course, we cannot know this prior probability from objective evidence, but we can perhaps estimate an order of magnitude. There are two linked questions. The first concerns the choice of an appropriate ?threshold? for reporting possible associations as likely to be genuine. Here the mathematics is quite straightforward if we make the simplifying assumption that we have the same power to detect all true associations. Then we have 18 Posterior odds for true association5 Prior odds3Power/Significance threshold That is, for a given significance threshold, the probability of a true association depends on the prior odds and, crucially, the power. A plausibleestimatefortheprioroddsoftrueassociationatanyspecified locus might be of the order of 100,000:1 against, for example, on the basis of 1,000,000 ?independent? regions of the genome and an expectation of 10 detectable genes involved in the condition. (Other plausibleestimatesmightvaryfromthisbyanorderofmagnitudeorso in either direction.) Then, assuming a power of 0.5 and a significance thresholdof5310 27 ,theposterioroddsinfavourofa?hit?beingatrue association would be 10:1. However, if we relax this significance threshold by a factor of ten, or alternatively if the power were lower by a factor of 10, the posterior odds that a ?hit? is a true association would alsobereducedbyafactoroften.Thissimplemathematicalanalysisis little affected by allowing for the fact that true associations come in various sizes with varying power to detect them; the above formula is simply modified by interpreting ?power? as the mean power. The above discussion concerns ?average? properties of ?hits? achieving given significance levels. After the association data are available, a related but different question is whether a particular positive finding is likely to be a true one. For that calculation, the prior oddsmustbemultipliedbytheBayesfactor,theratiooftheprobability of the observed data under the assumption that there is a true association to its probability under the null hypothesis. As in power calculations, the calculation of Bayes factors requires assumptions about effect sizes (see Methods for details). A key point from both perspectives is that interpreting the strength of evidence in an association study depends on the likely number of true associations, and the power to detect them which, in turn, depends on effect sizes and sample size. In a less-well-powered study itwouldbenecessarytoadoptmorestringentthresholdstocontrolthe false-positive rate. Thus, when comparing two studies for a particular disease, with a hit with the same MAF and Pvalue for association, the likelihood that this is a true positive will in general be greater for the study that is better powered, typically the larger study. In practice, smallerstudiesoftenemploylessstringentP-valuethresholds,whichis precisely the opposite of what should occur. ARTICLES NATURE|Vol 447|7 June 2007 664 Nature �2007 Publishing Group or most of the loci we find at this level are either already known or have now been confirmed by subsequent replication. Such replica- tionstudiesarealsothesubstrateforeffortstodeterminetherangeof associatedphenotypesandtoidentifyandcharacterizepathologically relevant variation. Second, failure to detect a prominent association signal in the pre- sentstudycannotprovideconclusiveexclusionofanygivengene.This is the consequence of several factors including: less-than-complete coverage of common variation genome-wide on the Affymetrix chip; poor coverage (by design) of rare variants, including many structural variants (thereby reducing power to detect rare, penetrant, alleles) 25 ; difficultieswithdefiningthefullgenomicextentofthegeneofinterest; and,despitethesamplesize,relativelylowpowertodetect,atlevelsof significance appropriate for genome-wide analysis, variants with modest effect sizes (odds ratio (OR),1.2). Third, whereas the association signals detected can help to define regions of interest, they cannot provide unambiguous identification of the causal genes. Nevertheless, assessments on the basis of posi- tional candidacy carry considerable weight, and, as we show, these already allow us, for selected diseases, to highlight pathways and mechanisms of particular interest. Naturally, extensive resequencing and fine-mapping work, followed by functional studies will be required before such inferences can be translated into robust state- ments about the molecular and physiological mechanisms involved. We turn now to a discussion of the main findings for each disease, focusing here only on the most significant and interesting results Table 2 | Evidence for signal of association at previously robustly replicated loci Collection Gene Chromosome Reported SNP WTCCC SNP HapMap r 2 Trend P value Genotypic Pvalue CAD APOE 19q13 *rs4420638 - 1.7310 201 1.7310 201 CD NOD2 16q12 rs2066844 rs17221417 0.23 9.4310 212 4.0310 211 CD IL23R 1p31 rs11209026 rs11805303 0.01 6.5310 213 5.9310 212 RA HLA-DRB1 6p21 *rs615672 - 2.6310 227 7.5310 227 RA PTPN22 1p13 rs2476601 rs6679677 0.75 4.9310 226 5.6310 225 T1D HLA-DRB1 6p21 9270986 - 4.0310 2116 2.3310 2122 T1D INS 11p15 rs689 { -- - T1D CTLA4 2q33 rs3087243 rs3087243 1 2.5310 205 1.8310 205 T1D PTPN22 1p13 rs2476601 rs6679677 0.75 1.2310 226 5.4310 226 T1D IL2RA 10p15 rs706778 rs2104286 0.25 8.0310 206 4.3310 205 T1D IFIH1 2q24 rs1990760 rs3788964 0.26 1.9310 203 7.6310 203 T2D PPARG 3p25 rs1801282 rs1801282 1 1.3310 203 5.4310 203 T2D KCNJ11 11p15 rs5219 rs5215 0.91.3310 203 5.6310 203 T2D TCF7L2 10q25 rs7903146 rs4506565 0.92 5.7310 213 5.1310 212 WhereinformationonthestrengthofassociationataparticularSNPhadbeenpreviouslypublishedandreplicatedwetabulatedthePvalueofboththetrendandgenotypetestatthesameSNP(ifin our study), or the best tag SNP (defined to be the SNP with highest r 2 with the reported SNP, calculated in the CEU sample of the HapMap project). Positions are in NCBI build-35 coordinates. *Previous reports relate to haplotypes rather than single SNPs. { Not well tagged by SNPs that pass the quality control, see main text. 25 20 20 15 15 10 10 5 5 30 0 0 25 20 20 15 15 10 10 5 5 30 0 0 25 20 20 15 15 10 10 5 5 30 0 0 25 20 20 15 15 10 10 5 5 30 0 0 25 20 20 15 15 10 10 5 5 30 0 0 25 20 20 15 15 10 10 5 5 30 0 0 25 20 20 15 15 10 10 5 5 30 0 0 BD Observed test statistic Expected chi-squared value CAD CD HT RA T2D T1D Figure 3 | Quantile-quantile plots for seven genome-wide scans. For each of the seven disease collections, a quantile-quantile plot of the results of the trendtestisshowninblackforallSNPsthatpassthestandardprojectfilters, have a minor allele frequency.1% and missing data rate,1%. SNPs that were visually inspected and revealed genotype calling problems were excluded. These filters were chosen to minimize the influence of genotype- callingartefacts.Eachquantile-quantileplotshowninblackinvolvesaround 360,000 SNPs. SNPs at which the test statistic exceeds 30 are representedby triangles. Additional quantile-quantile plots, which also exclude all SNPs located in the regions of association listed in Table 3, are superimposed in blue(forBD,theexclusionoftheseSNPshasnovisibleeffectontheplot,and for HT there are no such SNPs). The blue quantile-quantile plots show that departures in the extreme tail of the distribution of test statistics are due to regions with a strong signal for association. NATURE|Vol 447|7 June 2007 ARTICLES 665 Nature �2007 Publishing Group fromtheanalysesdescribedabove,andconsiderationofanexpanded reference group, described below. Bipolar disorder (BD). Bipolar disorder (BD; manic depressive ill- ness 26 ) refers to an episodic recurrent pathological disturbance in mood(affect)rangingfromextremeelationormaniatoseveredepres- sion and usually accompanied by disturbances in thinking and beha- viour: psychotic features (delusions and hallucinations) often occur. Pathogenesis is poorly understood but there is robust evidence for a substantial genetic contribution to risk 27,28 . The estimated sibling recurrencerisk(l s )is7?10andheritability80?90% 27,28 .Thedefinition of BD phenotype is based solely on clinical features because, as yet, psychiatry lacks validating diagnostic tests such as those available for many physical illnesses. Indeed, a major goal of molecular genetics approaches to psychiatric illness is an improvement in diagnostic classification that will follow identification of the biological systems that underpin the clinical syndromes. The phenotype definition that we have used includes individuals that have suffered one or more episodes of pathologically elevated mood (see Methods), a criterion that captures the clinical spectrum of bipolar mood variation that shows familial aggregation 29 . Several genomic regions have been implicated in linkage studies 30 and, recently, replicated evidence implicating specific genes has been reported. Increasing evidence suggests an overlap in genetic suscept- ibility with schizophrenia, a psychotic disorder with many similar- itiestoBD.Inparticularassociationfindingshavebeenreportedwith both disorders at DAOA (D-amino acid oxidase activator), DISC1 (disrupted in schizophrenia 1), NRG1 (neuregulin1) and DTNBP1 (dystrobrevin binding protein 1) 31 . The strongest signal in BD was with rs420259 at chromosome 16p12 (genotypic test P56.3310 28 ; Table 3) and the best-fitting genetic model was recessive (Supplementary Table 8). Although recognizing that this signal was not additionally supported by the expanded reference group analysis (see below and Supplementary Table 9) and that independent replication is essential, we note that several genes at this locus could have pathological relevance to BD, (Fig. 5). These include PALB2 (partner and localizer of BRCA2), which is involved in stability of key nuclear structures including chromatin and the nuclear matrix; NDUFAB1 (NADH dehydrogen- ase (ubiquinone) 1, alpha/beta subcomplex, 1), which encodes a subunit of complex I of the mitochondrial respiratory chain; and DCTN5 (dynactin 5), which encodes a protein involved in intracel- lular transport that is known to interact with the gene ?disrupted in schizophrenia 1? (DISC1) 32 , the latter having been implicated in sus- ceptibility to bipolar disorder as well as schizophrenia 33 . Of the four regions showing association at P,5310 27 in the expanded reference group analysis (Supplementary Table 9), it is of interest that the closest gene to the signal at rs1526805 (P52.23 10 27 )isKCNC2 which encodes the Shaw-related voltage-gated pot- assium channel. Ion channelopathies are well-recognized as causes of episodic central nervous system disease, including seizures, ataxias ?log 10 ( P ) 0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15 Chromosome Type 2 diabetes 22 XX212019181716151413121110987654321 22 XX212019181716151413121110987654321 22 XX212019181716151413121110987654321 22 XX212019181716151413121110987654321 22 XX212019181716151413121110987654321 22 XX212019181716151413121110987654321 22 XX212019181716151413121110987654321 Coronary artery disease Crohn?s disease Hypertension Rheumatoid arthritis Type 1 diabetes Bipolar disorder Figure 4 | Genome-widescanforsevendiseases. Foreachofsevendiseases 2log 10 ofthetrendtestPvalueforquality-control-positiveSNPs,excluding those in each disease that were excluded for having poor clustering after visual inspection, are plotted against position on each chromosome. Chromosomes are shown in alternating colours for clarity, with Pvalues,1310 25 highlighted in green. All panels are truncated at 2log 10 (Pvalue)515, although some markers (for example, in the MHC in T1D and RA) exceed this significance threshold. ARTICLES NATURE|Vol 447|7 June 2007 666 Nature �2007 Publishing Group and paralyses 34 . It is possible that this may extend to episodic distur- bances of mood and behaviour. Amongst the other higher ranked signals in the BD data set (SupplementaryTable7),thereissupportforthepreviouslysuggested importanceofGABAneurotransmission(rs7680321(P56.2310 25 ) in GABRB1 encoding a ligand-gated ion channel (GABA A receptor, beta 1)) 35 , glutamate neurotransmission (rs1485171 (P59.7310 25 ) in GRM7 (glutamate receptor, metabotropic 7)) 35 and synaptic func- tion (rs11089599 (P57.2310 25 )inSYN3 (synapsin III) 36 ). We note that a broad range of genetic and non-genetic data point to the importance of analyses that use alternative approaches to phenotype definition, including symptom dimensions 31 . Although beyond the scope of the current paper, such analyses will be required to maximize the potential of the current BD data set. Coronary artery disease (CAD). Coronary artery disease (coronary atherosclerosis) is a chronic degenerative condition in which lipid andfibrousmatrixisdepositedinthewallsofthecoronaryarteriesto form atheromatous plaques 37 . It may be clinically silent or present with angina pectoris or acute myocardial infarction. Pathogenesis is complex, with endothelial dysfunction, oxidative stress and inflam- mation contributing to development and instability of the athero- sclerotic plaque 37 . In addition to lifestyle and environmental factors, genes are importantintheaetiologyofCAD 38 .Forearlymyocardialinfarction, estimates of l s range from ,2to,7 (ref. 39). Genetic variation is thought likely to influence risk of CAD both directly and through effects on known CAD risk factors including hypertension, diabetes and hypercholesterolaemia. Genome-wide linkage studies have mappedseverallocithatmayaffectsusceptibilitytoCAD/myocardial infarction 40 although for only two of these has the likely gene been identified (ALOX5AP (arachidonate 5-lipoxygenase-activating pro- tein) and LTA4H (leukotriene A4 hydrolase)) 41,42 . Association stud- ies have identified several plausible genetic variants affecting lipids, thrombosis, inflammation or vascular biology but for most the evid- ence is not yet conclusive 40 . We did not find evidence for strong association at any of these genes within our study (Table 2 and Supplementary Table 10). Themostnotablenewfinding forCADisthepowerfulassociation on chromosome 9p21.3 (Table 3; Fig. 5). Although the strongest signal is seen at rs1333049 (P51.8310 214 ), associations are seen forSNPsacross.100kilobases.Thisregionhasnotbeenhighlighted in previous studies of CAD or myocardial infarction 40,43 . The region of interest contains the coding sequences of genes for two cyclin dependent kinase inhibitors, CDKN2A (encoding p16 INK4a ) and CDKN2B (p15 INK4b ), although the most closely associated SNP is some distance removed. Both genes have multiple isoforms, have an important role in the regulation of the cell cycle and are widely expressed 44 , with CDKN2B known to be expressed in the macro- phages but not the smooth muscle cells of fibrofatty lesions 45,46 .It is of interest that expression of CDKN2B is induced by transforming growth factor beta (TGF-b) and that the TGF-b signalling system is implicated in the pathogenesis of human atherosclerosis 45,46 . Besides CDKN2A and CDKN2B, the only other known gene nearby is MTAP which encodes methylthioadenosine phosphorylase, an enzyme that contributes to polyamine metabolism and is important for the salvage of both adenine and methionine. MTAP is ubiquitously expressed, including in the cardiovascular system 47 . Further work is required to determine whether the CAD association at this locus is mediated through CDKN2A/B, MTAP or some other mechanism. The same region also shows replicated evidence of association to T2D in the WTCCC and other data sets 19,21,22 , though different SNPs seem to be involved. None of the loci showing more modest associations with CAD (Table 4) includes genes hitherto strongly implicated in the patho- genesis of CAD. A potentially interesting association is at rs6922269 (P56.3310 26 ), an intronic SNP in MTHFD1L, which encodes Table 3 | Regions of the genome showing the strongest association signals Coll e ction Chro mos o me Region (Mb) SNP Trend P value Geno typic P val ue log 10 (BF ), addi tive log 10 (BF ), gene ral Risk allele Mino r allele He te r o z y go te odd s rat i o Ho m o zyg o t e od ds r a ti o C ont rol M AF Case MAF Standard analysis BD 16p12 23.3?23.62 rs420259 2.19310 204 6.29310 208 1.96 4.79 AG2.08 (1.60?2.71) 2.07 (1.6?2.69) 0.282 0.248 CAD 9p21 21.93?22.12 rs1333049 1.79310 214 1.16310 213 11.66 11.19 CC1.47 (1.27?1.70) 1.9 (1.61?2.24) 0.474 0.554 CD 1p31 67.3?67.48 rs11805303 6.45310 213 5.85310 212 10.07 9.41 TT1.39 (1.22?1.58) 1.86 (1.54?2.24) 0.317 0.391 CD 2q37 233.92?234 rs10210302 7.10310 214 5.26310 214 11.11 11.28 TC1.19 (1.01?1.41) 1.85 (1.56?2.21) 0.481 0.402 CD 3p21 49.3?49.87 rs9858542 7.71310 207 3.58310 208 4.24 5.22 AA1.09 (0.96?1.24) 1.84 (1.49?2.26) 0.282 0.331 CD 5p13 40.32?40.66 rs17234657 2.13310 213 1.99310 212 10.41 9.89 GG1.54 (1.34?1.76) 2.32 (1.59?3.39) 0.125 0.181 CD 5q33 150.15?150.31 rs1000113 5.10310 208 3.15310 207 5.36 5.01 TT1.54 (1.31?1.82) 1.92 (0.92?4.00) 0.067 0.098 CD 10q21 64.06?64.31 rs10761659 2.68310 207 1.75310 206 4.69 4.13 GA1.23 (1.05?1.45) 1.55 (1.3?1.84) 0.461 0.406 CD 10q24 101.26?101.32 rs10883365 1.41310 208 5.82310 208 5.91 5.48 GG1.2 (1.03?1.39) 1.62 (1.37?1.92) 0.477 0.537 CD 16q12 49.02?49.4 rs17221417 9.36310 212 3.98310 211 8.93 8.47 GG1.29 (1.13?1.46) 1.92 (1.58?2.34) 0.287 0.356 CD 18p11 12.76?12.91 rs2542151 4.56310 208 2.03310 207 5.42 5.00 GG1.3 (1.14?1.48) 2.01 (1.46?2.76) 0.163 0.208 RA 1p13 113.54?114.16 rs6679677 4.90310 226 5.55310 225 22.36 21.99 AA1.98 (1.72?2.27) 3.32 (1.93?5.69) 0.096 0.168 RA 6 MHC rs6457617* 3.44310 276 5.18310 275 74.84 73.18 TT2.36 (1.97?2.84) 5.21 (4.31?6.30) 0.489 0.685 T1D 1p13 113.54?114.16 rs6679677 1.17310 226 5.43310 226 23.07 22.83 AA1.82 (1.59?2.09) 5.19 (3.15?8.55) 0.096 0.169 T1D 6 MHC rs9272346* 2.42310 2134 5.47310 2134 141.9142.2 AG5.49 (4.83?6.24) 18.52 (27.03?12.69) 0.387 0.150 T1D 12q13 54.64?55.09 rs11171739 1.14310 211 9.71310 211 8.89 8.24 CC1.34 (1.17?1.54) 1.75 (1.48?2.06) 0.423 0.493 T1D 12q24 109.82?111.49 rs17696736 2.17310 215 1.51310 214 12.53 11.88 GG1.34 (1.16?1.53) 1.94 (1.65?2.29) 0.424 0.506 T1D 16p13 10.93?11.37 rs12708716 9.24310 208 4.92310 207 5.15 4.70 AG1.19 (0.97?1.45) 1.55 (1.27?1.89) 0.350 0.297 T2D 6p22 20.63?20.84 rs9465871 1.02310 206 3.34310 207 4.15 3.98 CC1.18 (1.04?1.34) 2.17 (1.6?2.95) 0.178 0.218 T2D 10q25 114.71?114.81 rs4506565 5.68310 213 5.05310 212 10.14 9.43 TT1.36 (1.2?1.54) 1.88 (1.56?2.27) 0.324 0.395 T2D 16q12 52.36?52.41 rs9939609 5.24310 208 1.91310 207 5.35 5.05 AA1.34 (1.17?1.52) 1.55 (1.3?1.84) 0.398 0.453 Multi-locus analysis T1D 4q27 123.26?123.92 rs6534347 4.48310 207 1.83310 206 5.15 4.69 AA1.30 (1.10?1.55) 1.49 (1.25?1.78) 0.351 0.402 T1D 12p13 9.71?9.86 rs3764021 7.19310 205 5.08310 208 2.12 4.55 CT1.57 (1.38?1.79) 1.48 (1.25?1.75) 0.467 0.426 Sex differentiated analysis RA 7q32 130.80?130.84 rs11761231 3.91310 207 1.37310 206 --GA1.44 (1.19?1.75) 1.64 (1.35?1.99) 0.375 0.327 Combined cases RA1T1D 10p15 6.07?6.17 rs2104286 5.92310 208 2.52310 207 5.26 4.45 TC1.35 (1.11?1.65) 1.62 (1.34?1.97) 0.286 0.245 RegionswithatleastoneSNPwithaPvalueoflessthan5x10 27 forourprimaryanalyses.Thelog 10 valueoftheBayesfactor(BF)forthebayesiananalysiscorrespondingtothetrendandgenotypic testsisalsogiven.Regionmarkstheboundariesofsignaldefinedbyrecombinationandreturnofteststatisticstobackgroundlevels.Theminoralleleisdefinedinthecontrolsanditsfrequencyinthat groupaswellasthecasesampleisreported.MAF,minorallelefrequency.ClusterplotsforeachSNPhavebeeninspectedvisually,andareshowninSupplementaryFig.10.PositionsareinNCBIbuild- 35 coordinates *Multiple SNPs in the MHC region are significant, we report the most extreme. NATURE|Vol 447|7 June 2007 ARTICLES 667 Nature �2007 Publishing Group methylenetetrahydrofolate dehydrogenase (NADP 1 -dependent) 1-like, the mitochondrial isozyme of C1-tetrahydrofolate (THF) synthase 48,49 .C 1 -THFsynthasesinterconverttheonecarbonunitscar- ried by the biologically active form of folic acid, C1-tetrahydrofolate. These are used in a variety of cellular processes including purine and methioninesynthesis 48 .Anotherenzymeinthesamepathway,methyl- ene THF reductase (encoded by MTHFR) is subject to a common mutation which influences plasma homocysteine level 50 and has been associated with increased risk of coronary and other atherosclerotic disease 51 . The possibility of a link between variants in MTHFD1L and CAD risk is supported by evidence that MTHFD1L activity also con- tributes to plasma homocysteine 52 and that defects in the MTHFD1L pathway may increase plasma homocysteine level 48,53 . An intronic SNP in ADAMTS17 (a disintegrin and metalloprotei- nase with thrombospondin motifs 17), which showed modest asso- ciation (rs1994016; P51.1310 24 ) in our primary analysis, showed amuchstrongerassociationintheexpandedreferencegroupanalysis (see below and Supplementary Table 9). Although the specific func- tion of ADAMTS17 has not been determined, other members of the ADAMTS family have been implicated in vascular extracellular matrix degradation, vascular remodelling and atherosclerosis 54,55 . Crohn?s disease (CD). Crohn?s disease isa common form ofchronic inflammatoryboweldisease 56 .Thepathogenicmechanismsarepoorly understood, but probably involve a dysregulated immune response to commensal intestinal bacteria and possibly defects in mucosal barrier function or bacterial clearance 57 . Genetic predisposition to BD hit region, chromosome 16 COG7 GGA2 EARS2 UBPH/UBFD1 NDUFAB1 PALB2 DCTN5 PLK1 ERN2 Chromosomal position (Mb) Genes Cons Genes Cons Genes Cons Genes Cons CAD hit region, chromosome 9 CDKN2A CDKN2B cM from hit SNP CD hit region, chromosome 1 IL23R Chromosomal position (Mb) 24.023.523.022.5 24.5 Chromosomal position (Mb) 22.522.021.521.0 23.0 CD hit region, chromosome 2 ATG16L1 Chromosomal position (Mb) 234.5234.0233.5233.0 235.0 50.049.549.048.5 50.5 150.5150.0149.5149.0 151.0 151.5 64.564.063.563.0 65.0 51.0 41.040.540.039.5 41.5 68.067.567.066.5 68.5 40 0 80 10 5 0 15 40 0 80 10 5 0 15 40 0 80 10 5 0 15 cM per Mb ?log 10 ( P ) 40 0 80 10 5 0 15 40 0 80 10 5 0 15 40 0 80 10 5 0 15 cM per Mb ?log 10 ( P ) 40 0 80 10 5 0 15 cM per Mb ?log 10 ( P ) 40 0 80 10 5 0 15 40 0 80 10 5 0 15 0 2 4 0 2 4 0 2 4 0 2 4 0 2 4 0 2 4 0 2 4 0 2 4 0 2 4 0 2 4 0 2 4 0 2 4 CD hit region, chromosome 3 USP4 GPX1 RHOA TCTA AMT NICN1 DAG1 BSN APEH MST1 RNF123 AMIGO3 GMPPB IHPK1 LOC389118 C3orf54 UBE1L TRAIP Chromosomal position (Mb) CD hit region, chromosome 5 cM from hit SNP cM from hit SNP cM from hit SNP Chromosomal position (Mb) CD hit region, chromosome 5 MST150 ZNF300 Chromosomal position (Mb) CD hit region, chromosome 10 ZNF365 C10orf22 EGR2 Chromosomal position (Mb) 40 0 80 10 5 50.049.549.048.548.0 50.5 13.513.012.512.011.5 14.0 102.0101.5101.0100.5100.0 102.5 115.0114.5114.0113.5113.0 0 15 cM per Mb ?log 10 ( P ) 40 0 80 10 5 0 15 40 0 80 10 5 0 20 15 25 CD hit region, chromosome 10 NKX2-3 Chromosomal position (Mb) CD hit region, chromosome 16 NKD1 SLIC1 NOD2 CYLD Chromosomal position (Mb) CD hit region, chromosome 18 PTPN2 Chromosomal position (Mb) RA hit region, chromosome 1 MAGI3 PHTF1 RSBN1 PTPN22 C1orf178 AP4B1 DCLRE1B Chromosomal position (Mb) Figure 5 | Regionsofthegenomeshowingstrongevidenceofassociation. Characteristicsofgenomicregions1.25Mbtoeithersideof?hitSNPs??SNPswith lowestPvalues.Regionboundaries(verticaldottedlines)werechosentocoincidewithlocationswhereteststatisticsreturnedtobackgroundlevelsand,where possible,recombinationhotspots.Upperpanel,2log 10 (Pvalues)forthetest(trendorgenotypic)withthesmallestPvalueatthehitSNP.Blackpointsrepresent SNPstypedinthestudy,andgreypointsrepresentSNPswhosegenotypeswereimputed.SNPsimputedwithhigherconfidenceareshownindarkergrey.Middle panel, fine-scale recombinationrate(centimorgansperMb)estimatedfromPhaseIIHapMap. The purple line showsthecumulativegeneticdistance(incM) ARTICLES NATURE|Vol 447|7 June 2007 668 Nature �2007 Publishing Group CD is suggested by a l s of 17?35 and by twin studies that contrast monozygotic concordance rates of 50% with only 10% in dizygotic pairs 58,59 . A number of CD-susceptibility loci have previously been defined, andall ofthesegenerate strong signals inour data(Table2).In 2001, positional cloning identified CARD15 (caspase recruitment domain family, member 15; NOD2) as the first confirmed CD-susceptibility gene 60,61 .Inthepresentstudy,thislocusisrepresentedbyrs17221417 (P59.4310 212 ). A second association, on chromosome 5q31 (ref. 62) has been widely replicated, although the identity of the causative geneisdisputedowingtoextensiveregionallinkagedisequilibrium 63 . Here, the previously described risk haplotype is tagged by rs6596075 (P55.4310 27 ). More recent studies have identified four further CD-susceptibility loci, all of which are strongly replicated in the present study. The association between CD and SNPs within IL23R (interleukin 23 receptor) 63 is here represented by a cluster of associated SNPs, including rs11805303 (P56.5310 213 ). The strongest signal for CD in the present scan (at rs10210302; P57.1310 214 ) maps to the ATG16L1 (ATG16 autophagy related 16-like 1) gene and is in strong linkage disequilibrium (r 2 50.97) with a non-synonymous SNP (T300A, rs2241880) associated with CD in a German non- synonymous SNP scan 64 . The third is a locus at chromosome 10q21 around rs10761659 (P52.7310 27 ) and represents a non- coding intergenic SNP mapping 14-kb telomeric to gene ZNF365 and 55-kb centromeric to the pseudogene antiquitin-like 4?a 115.0114.5114.0113.5113.0 cM per Mb ?log 10 ( P ) 40 0 80 10 5 0 20 15 0 2 4 25 T1D hit region, chromosome 1 MAGI3 PHTF1 RSBN1 PTPN22 C1orf178 AP4B1 DCLRE1B Chromosomal position (Mb) 55.555.054.554.053.5 56.0 40 0 0 2 4 80 10 5 0 15 T1D hit region, chromosome 12 SILV CDK2 RAB5B SUOX IKZF4 RPS26 ERBB3 PA2G4 RPL41 ZC3H10 FAM62A MYL6B MYL6 SMARCC2 RNF41 OBFC2B SLC39A5 COQ10A TMEM4 USP52 IL23A STAT2 APOF Chromosomal position (Mb) 112.0111.5111.0110.5110.0 0 2 4 40 0 80 10 5 0 15 cM from hit SNP T1D hit region, chromosome 12 CUTL2 FAM109A SH2B3 ATXN2 BRAP ACAD10 ALDH2 MAPKAPK5 TMEM116 ERP29 C12orf30 TRAFD1 C12orf51 RPL6 PTPN11 Chromosomal position (Mb) cM per Mb ?log 10 ( P ) 12.011.511.010.510.0 0 2 4 40 0 80 10 5 0 15 T1D hit region, chromosome 16 DEXI KIAA0350 SOCS1 TNP2 PRM3 PRM2 PRM1 C16orf75 Chromosomal position (Mb) 0 2 4 21.5 22.021.020.520.019.5 40 0 80 10 5 0 15 T2D hit region, chromosome 6 CDKAL1 Chromosomal position (Mb) 0 2 4 cM from hit SNP 115.5 116.0115.0114.5114.0113.5 40 0 80 10 5 0 15 T2D hit region, chromosome 10 TCF7L2 Chromosomal position (Mb) cM per Mb ?log 10 ( P ) 0 2 4 cM from hit SNP 53.553.052.552.051.5 40 0 80 10 5 0 15 T2D hit region, chromosome 16 FTO Chromosomal position (Mb) Genes Cons Genes Cons Genes Cons fromthehitSNP.Lowerpanel,knowngenes,andsequenceconservationin17vertebrates.Knowngenes(orange)inthehitregionarelistedintheupperright partofeachplotinchromosomalorder,startingattheleftedgeoftheregion.Thetoptrackshowsplus-strandgenesandthemiddletrackshowsminus-strand genes. Sequence conservation (bottom track) scores are based on the phylogenetic hidden Markov model phastCons. Highly conserved regions (phastCons score$600)areshowninblue.InformationinmiddleandlowerpanelsistakenfromtheUCSCGenomeBrowser.PositionsareinNCBIbuild-35coordinates. See Supplementary Information on ?signal plots?. NATURE|Vol 447|7 June 2007 ARTICLES 669 Nature �2007 Publishing Group recently detected signal 65 . Finally, strong association with a cluster of SNPs around rs17234657 (P5 2.1310 213 ) within a 1.2Mb gene desert on chromosome 5p13.1, recapitulates the finding of a recent GWA study 66 . The current study identifies four further new strong association signalsinCD,locatedonchromosomes3p21,5q33,10q24and18p11 (Table 3; Fig. 5). Successful replication for all four loci is reported elsewhere 23 . Thefirst of theseincludes several SNPs around IRGM(immunity- related guanosine triphosphatase; the human homologue of the mouse Irgm/Lrg47), the strongest signal being at rs1000113 (P5 5.1310 28 ). IRGM encodes a GTP-binding protein which induces autophagy and is involved in elimination of intracellular bacteria, including Mycobacterium tuberculosis 67 . Reduced function and/or activity of this gene would be expected to lead to persistence of intracellular bacteria, consistent with existing models of CD patho- genesis 57 and the recent ATG16L1 association 64 (see above). The second novel CD association is seen at rs9858542 (P5 7.7310 27 ), a synonymous coding SNP within the BSN (bassoon) gene on chromosome 3p21. BSN is thought to encode a scaffold protein expressed in brain and involved in neurotransmitter release; a more plausible regional candidate is MST1 (macrophage Table 4 | Regions of the genome showing moderate evidence of association Coll e ction Chro mos o me Region (Mb) SNP Trend P value Geno typic P val ue log 10 (BF ), addi tive log 10 (BF ), gene ral Risk allele Mino r allele He te r o z y go te odd s rat i o Homoz ygote odds ratio Con trol MAF Case MAF BD 2p25 11.94?12.00 rs4027132 1.31310 205 9.68310 206 3.07 2.84 AG1.39 (1.19?1.64) 1.51 (1.27?1.79) 0.459 0.414 BD 2q12 104.41?104.58 rs7570682 3.11310 206 1.64310 205 3.68 3.23 AA1.23 (1.09?1.40) 1.64 (1.28?2.12) 0.214 0.255 BD 2q14 115.63?116.11 rs1375144 2.43310 206 1.31310 205 3.80 2.92 AG1.32 (1.07?1.63) 1.59 (1.29?1.96) 0.337 0.291 BD 2q37 241.23?241.28 rs2953145 1.11310 205 6.57310 206 3.22 3.50 CG1.84 (1.31?2.58) 2.14 (1.53?2.98) 0.226 0.189 BD 3p23 32.26?32.33 rs4276227 4.57310 206 2.62310 205 3.52 3.04 CT1.20 (0.99?1.46) 1.49 (1.23?1.81) 0.371 0.326 BD 3q27 184.29?184.40 rs683395 2.30310 206 5.11310 206 3.87 3.73 GG1.47 (1.26?1.71) 1.30 (0.69?2.46) 0.080 0.109 BD 6p21 42.82?42.86 rs6458307 3.43310 201 4.35310 206 20.80 2.84 TT0.84 (0.75?0.96) 1.39 (1.13?1.69) 0.312 0.321 BD 8p12 34.22?34.61 rs2609653 6.86310 206 - 3.44 3.21 CC1.43 (1.19?1.71) 3.62 (1.26?10.44) 0.052 0.074 BD 9q32 114.31?114.39 rs10982256 8.80310 206 4.41310 205 3.23 2.37 TC1.26 (1.08?1.47) 1.47 (1.24?1.74) 0.471 0.425 BD 14q22 57.17?57.24 rs10134944 3.21310 206 6.89310 206 3.73 3.59 TT1.45 (1.24?1.68) 1.32 (0.74?2.33) 0.086 0.115 BD 14q32 103.43?103.62 rs11622475 2.10310 206 8.14310 206 3.87 3.24 CT1.13 (0.89?1.44) 1.47 (1.17?1.86) 0.300 0.256 BD 16q12 51.36?51.50 rs1344484 1.64310 206 1.03310 205 3.94 3.41 TC1.24 (1.03?1.48) 1.52 (1.27?1.82) 0.402 0.353 BD 20p13 3.70?3.73 rs3761218 4.43310 205 6.71310 206 2.58 3.18 TC0.97 (0.81?1.15) 1.31 (1.09?1.57) 0.397 0.356 CAD 1q43 236.77?236.85 rs17672135 1.04310 204 2.35310 206 2.36 3.88 TC0.70 (0.61?0.81) 1.32 (0.79?2.22) 0.134 0.108 CAD 5q21 99.98?100.11 rs383830 5.72310 206 1.34310 205 3.49 3.26 TA1.60 (1.16?2.21) 1.92 (1.40?2.63) 0.220 0.182 CAD 6q25 151.34?151.42 rs6922269 6.33310 206 1.50310 205 3.38 3.14 AA1.17 (1.04?1.32) 1.65 (1.32?2.06) 0.253 0.294 CAD 16q23 81.72?81.79 rs8055236 9.73310 206 5.60310 206 3.28 3.59 GT1.91 (1.33?2.74) 2.23 (1.56?3.17) 0.198 0.162 CAD 19q12 34.74?34.78 rs7250581 9.12310 206 2.50310 205 3.30 2.87 GA1.06 (0.79?1.43) 1.40 (1.05?1.86) 0.220 0.182 CAD 22q12 25.01?25.06 rs688034 6.90310 206 3.75310 206 3.33 3.15 TT1.11 (0.98?1.25) 1.62 (1.34?1.95) 0.310 0.355 CD 1q24 169.53?169.67 rs12037606 1.79310 206 1.09310 205 3.89 3.35 AA1.22 (1.07?1.40) 1.52 (1.28?1.82) 0.388 0.438 CD 5q23 131.40?131.90 rs6596075 5.40310 207 3.19310 206 4.54 4.01 CG1.55 (1.00?2.39) 2.06 (1.35?3.14) 0.166 0.127 CD 6p22 20.83?20.85 rs6908425 5.13310 206 1.10310 205 3.55 3.38 CT1.63 (1.18?2.25) 1.95 (1.43?2.67) 0.230 0.190 CD 6p21 32.79?32.91 rs9469220 8.65310 207 2.28310 206 4.19 3.92 AA1.14 (0.98?1.32) 1.52 (1.28?1.79) 0.481 0.534 CD 6q23 138.06?138.17 rs7753394 4.42310 206 2.59310 205 3.52 2.99 CC1.21 (1.04?1.40) 1.48 (1.25?1.76) 0.482 0.531 CD 7q36 147.62?147.70 rs7807268 6.89310 206 4.42310 206 3.33 3.58 GG1.38 (1.20?1.60) 1.47 (1.24?1.74) 0.462 0.509 CD 10p15 38.52?38.57 rs6601764 2.56310 206 8.95310 206 3.74 3.01 CC1.16 (1.01?1.33) 1.52 (1.28?1.80) 0.408 0.458 CD 19q13 50.89?51.07 rs8111071 6.14310 206 1.75310 205 3.48 3.29 GG1.47 (1.25?1.73) 1.28 (0.56?2.88) 0.070 0.096 HT 1q43 235.67?235.79 rs2820037 5.76310 205 7.66310 207 2.54 3.99 TT1.54 (1.03?2.31) 1.09 (0.74?1.62) 0.141 0.171 HT 8q24 140.17?140.35 rs6997709 7.88310 206 4.36310 205 3.32 2.60 GT1.20 (0.94?1.52) 1.49 (1.18?1.89) 0.285 0.244 HT 12p12 24.86?24.95 rs7961152 7.39310 206 3.03310 205 3.29 2.51 AA1.16 (1.01?1.32) 1.47 (1.25?1.74) 0.415 0.461 HT 12q23 100.52?100.58 rs11110912 9.18310 206 1.94310 205 3.27 3.11 GG1.33 (1.18?1.51) 1.34 (0.96?1.86) 0.165 0.200 HT 13q21 66.90?67.04 rs1937506 9.23310 206 4.53310 205 3.25 2.85 GA1.33 (1.04?1.69) 1.60 (1.26?2.02) 0.289 0.248 HT 15q26 94.60?94.67 rs2398162 7.85310 206 5.67310 206 3.33 3.40 AG0.97 (0.76?1.25) 1.31 (1.03?1.67) 0.258 0.218 RA 1p36 2.44?2.77 rs6684865 5.37310 206 3.14310 205 3.47 2.97 GA1.27 (1.02?1.56) 1.54 (1.25?1.90) 0.338 0.294 RA 1p31 80.16?80.36 rs11162922 1.80310 206 - 4.11 3.80 AG1.27 (0.41?4.01) 2.00 (0.64?6.20) 0.072 0.048 RA 4p15 24.99?25.13 rs3816587 7.65310 203 9.25310 206 0.50 2.64 CC0.91 (0.80?1.04) 1.35 (1.14?1.59) 0.406 0.434 RA 6q23 138.00?138.06 rs6920220 4.99310 206 1.58310 205 3.49 3.17 AA1.20 (1.06?1.36) 1.72 (1.33?2.22) 0.223 0.263 RA 7q32 130.80?130.84 rs11761231 1.74310 206 2.65310 206 3.92 3.42 CT1.44 (1.19?1.75) 1.64 (1.35?1.99) 0.375 0.327 RA 10p15 6.07?6.16 rs2104286 7.02310 206 2.52310 205 3.37 2.57 TC1.41 (1.10?1.81) 1.68 (1.31?2.14) 0.286 0.244 RA 13q12 19.845?19.855 rs9550642 8.44310 206 3.90310 205 3.35 3.02 AA1.34 (1.15?1.56) 2.23 (1.21?4.13) 0.084 0.112 RA 21q22 41.430?41.465 rs2837960 3.45310 202 1.68310 206 0.05 2.70 GG0.95 (0.83?1.08) 2.30 (1.64?3.23) 0.171 0.188 RA 22q13 35.870?35.885 rs743777 7.92310 206 1.15310 206 3.29 3.52 GG1.09 (0.97?1.24) 1.72 (1.40?2.11) 0.292 0.336 T1D 1q42 221.92?222.17 rs2639703 8.46310 206 1.74310 205 3.25 3.06 CC1.15 (1.02?1.30) 1.61 (1.31?1.99) 0.276 0.318 T1D 4q27 123.02?123.92 rs17388568 5.01310 207 3.27310 206 4.42 3.89 AA1.26 (1.11?1.42) 1.58 (1.27?1.95) 0.260 0.307 T1D 5q14 86.20?86.50 rs2544677 8.23310 206 4.43310 205 3.32 2.70 CG1.34 (1.00?1.79) 1.65 (1.24?2.18) 0.242 0.204 T1D 5q31 132.64?132.67 rs17166496 6.06310 201 5.20310 206 20.97 3.25 CG0.77 (0.68?0.87) 1.09 (0.92?1.29) 0.391 0.386 T1D 10p15 6.07?6.18 rs2104286 7.96310 206 4.32310 205 3.31 2.88 TC1.30 (1.02?1.65) 1.57 (1.25?1.99) 0.286 0.245 T1D 12p13 9.71?9.80 rs11052552 1.02310 204 7.24310 207 2.22 3.80 GT1.49 (1.28?1.73) 1.43 (1.21?1.69) 0.486 0.446 T1D 18p11 12.76?12.91 rs2542151 1.89310 206 1.16310 205 3.91 3.52 GG1.30 (1.15?1.47) 1.62 (1.17?2.24) 0.163 0.201 T2D 1p31 66.04?66.36 rs4655595 2.68310 206 1.33310 205 3.81 3.47 GG1.37 (1.17?1.59) 2.33 (1.23?4.42) 0.080 0.108 T2D 2q24 160.90?161.17 rs6718526 2.40310 206 1.16310 205 3.86 3.35 CT1.49 (1.05?2.11) 1.86 (1.32?2.63) 0.209 0.171 T2D 3p14 55.24?55.32 rs358806 4.77310 201 3.05310 206 20.83 2.72 AA0.86 (0.75?0.97) 1.78 (1.34?2.36) 0.198 0.204 T2D 4q27 122.92?123.02 rs7659604 2.1310 202 9.42310 206 0.13 2.74 TT1.35 (1.19?1.54) 1.09 (0.91?1.30) 0.380 0.403 T2D 10q11 43.43?43.63 rs9326506 7.78310 206 2.99310 205 3.27 2.92 CC1.28 (1.11?1.48) 1.46 (1.24?1.72) 0.492 0.538 T2D 12q13 49.50?49.87 rs12304921 5.37310 202 7.07310 206 20.09 2.68 GG2.50 (1.53?4.09) 1.94 (1.20?3.15) 0.145 0.159 T2D 12q15 69.58?69.96 rs1495377 1.31310 206 6.52310 206 4.01 3.15 GG1.28 (1.11?1.49) 1.51 (1.28?1.78) 0.497 0.547 T2D 15q24 72.24?72.50 rs2930291 7.72310 206 4.40310 205 3.30 2.42 GA1.25 (1.04?1.51) 1.50 (1.24?1.82) 0.377 0.332 T2D 15q25 78.12?78.36 rs2903265 9.57310 206 4.98310 205 3.24 2.53 GA1.18 (0.93?1.49) 1.47 (1.17?1.86) 0.284 0.243 Regions with at least one SNP with a Pvalue of greater than 5x10 27 and less than 1x10 25 for either the trend or the genotypic test. Columns as for Table 3. Cluster plots for each SNP have been inspectedvisually.PositionsareinNCBIbuild-35coordinates.GenotypicPvalueswerenotcalculatedforSNPswiththelowestMAFsowingtolownumbersofrare-allelehomozygotesandsensitivity to genotype calling errors. ARTICLES NATURE|Vol 447|7 June 2007 670 Nature �2007 Publishing Group stimulating 1), which encodes a protein influencing motile activity and phagocytosis by resident peritoneal macrophages 68 . The third novel association involves a cluster of SNPs around rs10883365 (P51.4310 28 ) on chromosome 10q24.2. The most credible candidate here is the NKX2-3 (NK2 transcription factor related, locus 3) gene, a member of the NKX family of homeodo- main-containing transcription factors. Targeted disruption of the murine homologue of NKX2-3 results in defective development of theintestineandsecondarylymphoidorgans 69 .Abnormalexpression of NKX2-3 may alter gut migration of antigen-responsive lympho- cytes and influence the intestinal inflammatory response. The final novel association, at rs2542151 (P54.6310 28 ) maps 5.5-kb upstream of PTPN2 (protein tyrosine phosphatase, non- receptor type 2) on chromosome 18p11. PTPN2 encodes the T cell protein tyrosine phosphatase TCPTP, a key negative regulator of inflammatory responses. The same locus also shows strong asso- ciation with T1D susceptibility (trend test P51.9310 26 ) and a consistent, though weaker, association with RA (P51.9310 22 ), supportingtheexistenceofoverlappingpathwaysinthepathogenesis of very distinct inflammatory phenotypes (combined trend test Pvalue for all three diseases59310 28 ) (Table 3; ref. 10). Several further loci generating less strong evidence for association are of interest on the basis of their biological candidacy (Table 4). For example, rs9469220 (P58.7310 27 ) mapping to the human leuko- cyte antigen (HLA) system class II region was detected in the ?second tier? of associations (Table 4). This suggests a significant contribution ofHLAtoCD-susceptibility,thoughlessmarkedthanseeninclassical autoimmune conditions such as RA and T1D. Another interesting candidate flagged in Table 4 is TNFAIP3 (TNFa induced protein 3), theclosestgenetors7753394onchromosome6q23.Theproteinprod- uct inhibits TNFa-induced NFkB-dependent gene expression by interfering with RIP- or TRAF-2-mediated transactivation signals? hence interacting with the same pathway as CARD15 (NOD2). Markers with lower levels of significance include rs6478108 (P5 9.0310 25 ) within TNFSF15 (tumour necrosis factor super family, member 15), previously reported associated with CD 70 ;and rs3816769 (P53.1310 25 ) which maps within STAT3 (signal trans- ducers and activator of transcription, member 3). On the X chro- mosome rs2807261 (P51.3310 27 ) maps 50-kb from the gene CD40LG (CD40 ligand?previously known as TNF superfamily, member 5), implicated in the regulation of B-cell proliferation, adhe- sionandimmunoglobulinclassswitching 71 .Asdescribedinthesection onT1D,amodestassociationbetweenCDandSNPsinthevicinityof the PTPN11 gene on chromosome 12q24 (P51.5310 23 )probably reflects a locus influencing general autoimmune predisposition. An emerging theme from molecular genetic studies of CD is the importance of defects in autophagy and the processing of phagocy- tosedbacteria. Anumberofotherspecificcomponentswithininnate and adaptive immune pathways are also highlighted. Hypertension (HT). Hypertension refers to a clinically significant increase in blood pressure and constitutes an important risk factor for cardiovascular disease (http://www.who.int/whr/2002/en/; ref. 72).Lifestyleexposuresthatelevatebloodpressure,includingsodium intake, alcohol and excess weight 73 are well-described risk factors. Genetic factors are also important 74,75 . Estimates of l s are approxi- mately 2.5?3.5. Experimental models have highlighted a number of quantitative trait loci but these have yet to translate into insights into human hypertension 76 . Linkage studies are consistent with susceptibility genes of modest effect size 77 and well-replicated findings have yet to emerge from association approaches. None of the variants previously associated with HT showed evid- enceforassociationinourstudyalthoughwenotethatsome,suchas promoter of the WNK1 (WNK lysine deficient protein kinase 1) gene 78,79 , are not well tagged by the Affymetrix chip. For HT there were no SNPs with significance below 5310 27 (Table 3) but the number and distribution of association signals in therange10 24 to10 27 wassimilartothatoftheotherdiseasesstudied (Table4andSupplementaryTable7).Thereareseveralpossibleexpla- nations. First, HT may have fewer common risk alleles of larger effect sizes than some of the other complex phenotypes. If so, then identifi- cation of susceptibility variants for HT is likely to be reliant on the synthesisoffindingsfrommultiplelarge-scalestudies.Second,thepre- sent study may have failed to detect genuine common susceptibility variants oflarge effect size because they happenedto bepoorly tagged bythesetofSNPsgenotypedinthecurrentstudy.Ifso,furtherrounds ofgenotypingusingresourcesthatofferincreaseddensity(orcomple- mentarySNPsets),and/orimprovedanalyticalmethods(forexample, imputation-based) should facilitate their discovery. Third, study of HT may be more susceptible than other phenotypes to the diluting effects of misclassification bias due to the presence of hypertensive individuals within the control samples. If so, power can be improved in future studies by use of controls specifically screened to exclude individuals with elevated blood pressure. Themost strongly associated SNPs(Table 4)do notidentify genes from physiological systems previously implicated by clinical or gen- etic studies in hypertension. The strongest signal overall is with rs2820037 on 1q43 (genotypic test, P57.7310 27 ). The closest genes are RYR2 (encoding the ryanodine receptor 2), mutations in which are associated with stress-induced polymorphic ventricular tachycardia and arrhythmogenic right ventricular dysplasia 80,81 ; CHRM3, encoding the cholinergic receptor muscarinic 3, a member of the G protein-coupled receptor family 32 ; and ZP4, the product of which is zona pellucida glycoprotein 4 81 . The strong association sig- nals on the X chromosome using an expanded reference group (see belowandSupplementaryTable9)areofsubstantialinterestbutthey do not identify known genes of obvious relevance to HT. Rheumatoid arthritis (RA). Rheumatoid arthritis is a chronic inflammatory disease characterized by destruction of the synovial joints resulting in severe disability, particularly in patients who remain refractory to available therapies 82 . Susceptibility to, and severity of, RA are determined by both genetic and environmental factors, with l s estimates ranging from 5?10 (ref. 83). Anassociation between RAandallelesoftheHLA-DRB1 locus has longbeenestablished 84 .Despiteextensivelinkage 85?87 andassociation studies, only oneother RAsusceptibility locus hasbeen convincingly identified in Caucasians. In common with several autoimmune dis- eases including T1D, carriage of the T allele of the rs2476601 SNP in the PTPN22 (protein tyrosine phosphatase, non-receptor type 22) gene has been reproducibly associated with RA, conferring a genetic relative risk of approximately 1.8 (refs 88, 89). These known associa- tionswithHLA-DRB1andPTPN22explainaround50%ofthefamil- ial aggregation of RA. Both these previous associations emerge strongly here (Table 2). The most associated marker within PTPN22 (rs6679677: chromo- some 1p13) is perfectly correlated (HapMap CEU data r 2 51) with the functionally relevant SNP (rs2476601) described previously, and the effect size is consistent with previous estimates 89 . Amongst other putativeRAsusceptibilitygenes,twoSNPsmappingtoCTLA-4(cyto- toxic T-lymphocyte associated 4) rs3087243 and rs11571300 were only nominally significant (P50.085 and P50.034, respectively) (Supplementary Table 10). RA was the sole disease for which the sex-differentiated analysis generated a strong signal due to different genetic effects in males and females.TheSNPrs11761231(chromosome7)generatesaPvalueof 3.9310 27 for the 2-degrees of freedom (d.f.) sex-differentiated test which combines trend tests in males and females (Table 3). (The trend test ignoring the sex of the individuals has a Pvalue of 1.7310 26 .) This genotype has no effect on disease status in males, butastrongapparentlyadditiveeffectinfemales(Pvalueinalogistic regression model with additive log-odds is 0.68 in males and 6.8310 28 in females, additive OR for females 1.32), and may rep- resent one of the first sex-differentiated effects in human diseases. Cluster plots for this SNP seem good, but it is surrounded by NATURE|Vol 447|7 June 2007 ARTICLES 671 Nature �2007 Publishing Group recombination hotspots and has no other SNPs on the Affymetrix chipwithr 2 .0.1(SupplementaryFig.11).Somecautionistherefore required, but this represents a potentially interesting finding which warrants further investigation, particularly given the sex-related pre- valence difference characteristic of this condition. None of the 9 SNPs with nominal Pvalues in the range 10 25 to 5310 27 (Table 4) map to loci previously associated with RA. Of particular interest is the association of SNPs mapping close to both thealphaandbetachainsoftheIL2receptor(rs2104286inthecaseof IL2RA; rs743777 and IL2RB). The IL2 receptor mediates IL2 stimu- lation of T lymphocytes andis thereby thought to have animportant role in preventing autoimmunity. A rare 4-base-pair deletion of IL2RA has been associated with development of severe autoimmune disease 90 , and there is evidence (from previous data 91 , and from this study and its follow-up) that SNPs within the IL2RA gene region are associated with T1D (see also T1D section). SeveraloftheSNPs withnominalsignificance intherange 10 24 to 10 25 (SupplementaryTable7)maptogeneswithplausiblebiological relevance. Examples include SNPs within genes implicated in the TNFpathway(forexample,rs2771369inTNFAIP2(tumournecrosis factor, alpha-induced protein 2)) or in the regulation of T-cell func- tion (rs854350 in GZMB (granzyme B) and rs4750316 in PRKCQ (protein kinase C, theta)). The association with rs10786617 in KAZALD1 (Kazal-type serine protease inhibitor domain-containing protein1precursor),agenewhoseproductisknowntohavearolein boneregenerationafterinjury,mayberelevanttothedevelopmentof bone erosions in RA. RAandT1Dwerealreadyknowntohavetwodiseasesusceptibility genes in common: at the MHC, and at PTPN22. As detailed else- where, our study provides data indicating that this list can be extended to include variants around IL2RA (chromosome 10p15), PTPN2 (chromosome 18p11) and the chromosome 12q24 region (Supplementary Table 11), all apparently novel in RA. Type 1 diabetes (T1D). Type 1 diabetes is a chronic autoimmune disorderwithonsetusuallyinchildhood 92 .Thel s forT1Dis,15and twin data suggest that over 85% of the phenotypic variance is due to geneticfactors 93 .Therearesixgenes/regionsforwhichthereisstrong pre-existing statistical support for a role in T1D-susceptibility: these are the major histocompatibility complex (MHC), the genes encod- ing insulin, CTLA-4 (cytotoxic T-lymphocyte associated 4) and PTPN22 (protein tyrosine phosphatase, non-receptor type 22), and the regions around the interleukin 2 receptor alpha (IL2RA/CD25) and interferon-induced helicase 1 genes (IFIH1/MDA5) 94 . However, thesesignalscanexplainonlypartofthefamilialaggregationofT1D. Five of these previously identified associations were detected in this scan (P#0.001) (Table 2 and Supplementary Table 10), the excep- tion being the INS gene discussed above. Inthisstudy,single-pointanalysesrevealedthreenovelregions(on chromosomes 12q13, 12q24 and 16p13) showing strong evidence of association (P,5310 27 ; Table 3). Four further regions attained similar levels of significance either through multilocus analyses (chromosomes 4q27 and 12p13: Table 3, Supplementary Fig. 12), or through the combined analysis of autoimmune cases (chromo- somes 18p11 and the 10p15 CD25 region: Table 3, Supplementary Fig. 13). The associations with T1D for chromosomes 12q13, 12q24, 16p13 and 18p11 have been confirmed in independent and multiple populations 10 . The two signals on chromosome 12 (at 12q13 and 12q24) map to regions of extensive linkage disequilibrium covering more than ten genes (Fig. 5). Several of these represent functional candidates because of their presumed roles in immune signalling, considered to be a major feature of T1D-susceptibility. These include ERBB3 (receptor tyrosine-protein kinase erbB-3 precursor) at 12q13 and SH2B3/LNK (SH2B adaptor protein 3), TRAFD1 (TRAF-type zinc finger domain containing 1) and PTPN11 (protein tyrosine phos- phatase, non-receptor type 11) at 12q24. For these signal regions in particular, extensive resequencing, further genotyping and targeted functional studies willbe essentialsteps inidentifying which gene, or genes,arecausal 95 .Ofthoselisted,PTPN11isaparticularlyattractive candidate given a major role in insulin and immune signalling 96 .Itis also a member of the same family of regulatory phosphatases as PTPN22, already established as an important susceptibility gene for T1D and other autoimmune diseases 94,97 . Indeed, the 12q24 variant mostassociatedwithT1DalsofeaturesinboththeCDandRAscans, generating a combined signal for all autoimmune cases of 9.33 10 210 (Supplementary Table 11). In contrast, available annotations suggest that the 16p13 region contains only two genes of unknown function, KIAA0350 and dexa- methasone-inducedtranscript(Fig.5).Also,theregionofassociation identified on 18p11 (Supplementary Fig. 14), which seems to confer susceptibility to all three autoimmune conditions studied (combined trend test P593 10 28 , P54.6310 28 for CD, 1.9310 22 for RA, and 1.9310 26 for T1D: Supplementary Table 11), maps to a single gene, PTPN2 (protein tyrosine phosphatase, non-receptor type 2), a member of the same family as PTPN22 and PTPN11 and involved in immune regulation 96 . Our scan found associations with SNPs within the chromosome 10p15 region containing CD25, encoding the high-affinity receptor for IL-2. This is consistent with a previous report of associations of this region with T1D 91 . The CD25 region has previously been shown to be associated with Graves? disease 98 and the present study also provides evidence of association with RA (combined trend test P55310 28 , P5,7310 26 for RA and T1D separately, Supplementary Table 11). This finding has clear biological connec- tionstotheevidenceofassociationbetweenT1Dandaregionof4q27 revealed by the multilocus analysis (Supplementary Table 12, Supplementary Fig. 12). This region contains the genes encoding both IL-2 and IL-21. Together with studies in the NOD (nonobese diabetic) mouse model of T1D, which have shown that amajor non- MHC locus (Idd3) reflects regulatory variation of the Il2 gene 99 , our results point to the primary importance of the IL-2 pathway in T1D and other autoimmune diseases. One further region deserves comment. In the multilocus analysis, there was increased support for a region on chromosome 12p13 containing several candidate genes, including CD69 (CD69 antigen (p60, early T-cell activation antigen)) and multiple CLEC (C-type lectindomainfamily)genes.Incontrasttothechromosome4region where the effect of imputation is to tip an already-strong signal (5.01310 27 for typed rs17388568, trend test) over the arbitrary threshold of 5310 27 , the 12p13 locus involves a more marked change between imputed and actual (7.2310 27 for rs11052552, general test). Replication studies of this imputed SNP to date have produced equivocal results (for details see ref. 10). Type 2 diabetes (T2D). Type 2 diabetes is a chronic metabolic dis- order typically first diagnosed in the middle to late adult years 100 . Strongly associated with obesity, the condition features defects in both the secretion and peripheral actions of insulin 101 . The appre- ciable familial aggregation of T2D (an estimated lsof,3.0 in European individuals) 73 reflects both shared family environment andgenetic predisposition. Heritability valuesvarywidelywithmost estimates between 30 and 70% 101 . To date, robust, widely replicated associations in non-isolate populations are limited to variants in three genes: PPARG (encoding the peroxisomal proliferative activated receptor gamma; P12A 102 ), KCNJ11(theinwardly-rectifyingKir6.2componentofthepancreatic beta-cell KATP channel; E23K 103 ) and TCF7L2 (transcription factor 7-like 2; rs7903146 (refs 104, 105)). All three of these signals are detected here with effect-sizes con- sistent with previous reports (Table 2). A cluster of SNPs on chro- mosome 10q, within TCF7L2, represented by rs4506565 (trend test, OR 1.36, P55.7310 213 ) generates the strongest association signal forT2D(Table3,Fig.5).Rs4506565isintightlinkagedisequilibrium (r 2 of 0.92 in the CEU component of HapMap) with rs7903146, the variant with the strongest aetiological claims 104,106 . In fact, our ARTICLES NATURE|Vol 447|7 June 2007 672 Nature �2007 Publishing Group imputation analysis confirms that rs7903146, though unrepresented on the chip, is responsible for the strongest association effect in this region (Fig. 5). TCF7L2 acts within the WNT-signalling pathway, and effects on diabetes risk seem to be mediated predominantly through beta-cell dysfunction 107 . As expected, given existing effect-size estimates, the signals assoc- iated with variants within the other established T2D-susceptibility genes, KCNJ11 (rs5215, r 2 of 0.9 with rs5219, E23K) and PPARG (rs17036328, r 2 of 1 with rs1801282, P12A) are less dramatic (trend test,OR1.15and1.23respectively,bothP5,0.001).Theseexamples illustrate how genuine disease-susceptibility variants can generate association signals which would not attract immediate attention for follow-up in the genomewide context. Apart from TCF7L2, the scan reveals two signals for T2D with P values less than 5310 27 (Table 3, Fig. 5). The first of these maps within the FTO (fat-mass and obesity-associated) gene on chro- mosome 16q. Several adjacent SNPs (including rs9939609, rs7193144 and rs8050136) generate signals characterized by a per- allele OR for T2D of ,1.25 and a risk-allele frequency of ,40% in controls. As recently described in follow-up studies prompted by this finding, the effect of these variants on T2D-risk has been replicated and is mediated entirely by their marked effect on adiposity 24 . Thethird association signal (chromosome 6p22) features acluster of highly associated SNPs (including rs9465871) with risk-allele fre- quencies between 18 and 35%, mapping to intron 5 of the CDKAL1 (CDK5 regulatory subunit associated protein 1-like 1) gene. Although the function of CDKAL1 is notknown, it shareshomology attheproteindomainlevelwithCDK5regulatorysubunitassociated protein 1 (CDK5RAP1). CDK5RAP1 is known to inhibit the activa- tion of CDK5, a cyclin-dependent kinase which has been implicated in the maintenance of normal beta-cell function 108 . Our own follow- up studies, and scans by other groups have shown strong replication of this finding 19?22 . The effect of this variant on T2D-risk shows significant departures from additivity (Supplementary Table 8). One notable inclusion amongst the variants with more modest association signals is a cluster of SNPs on chromosome 10 including rs10748582 and rs7923866, which generate trend test Pvalues between 10 24 and 10 25 . This cluster maps in the vicinity of the HHEX (homeobox, hematopoietically expressed) and IDE (insulin- degrading enzyme) genes, in a region recently highlighted in a GWA scan for T2D performed in 1363 subjects of French origin 109 . The SNPs showing association in our data are proxies for those reported intheFrenchstudyandgeneratesimilareffect-sizeestimatesforT2D. Of the three other regions highlighted by the French scan 109 , none can be confirmed by our data. The SNP in SLC30A8 associated with T2D in the French report (rs13266634) is poorly correlated with SNPs on the Affymetrix chip (r 2 ,0.01), and extensive recombina- tioneventsintheregionlimitthevalueofdata-imputationmethods. Coverage of the LOC387761 and EXT2 signals is considerably better, but, for these, neither genotyped nor imputed SNPs show evidence for association with T2D. WTCCC data contributed to identification of two additional robustly replicating T2D signals, mapping to the IGF2BP2 gene and CDKN2A/CDKN2B regions 19,21,22 , although neither generated impressivePvaluesontheprimaryscananalysis(neithersingle-point P was,10 24 ). The latter signal maps to the same region as the CAD signal on chromosome 9 though different SNPs are involved. The other SNPs in Table 4 do not map to genes or regions previously implicated in T2D pathogenesis, and replication efforts to date have not identified any confirmed signals 19 . Expanded reference group analyses. For a fixed number of cases, power of a case-control study can be increased by enlarging the reference group. Our main analyses used a control:case ratio of 1.5:1 for each disease. The availability of the other 6 disease data sets gaveustheopportunitytoexpandthereferencegroupuptoaratioof ,7.5:1, with potential reciprocal benefits for the analysis of each disease. For BD and T2D the expanded reference group comprised the 58C and UKBS controls supplemented by the other 6 disease sets; forCADandHTthisexpandedreferencegroupwasreducedtoexclude HT and CAD respectively; for CD, RA and T1D, the reference group was augmented only by the cases from the non-autoimmune diseases. Theutilityoftheexpandedreferencegroupapproachwasdemon- strated by increased evidence for association at most of the loci that received strongest support from our primary analysis, including many of the signals at loci known to show robust association in T1D, T2D and CD (Supplementary Table 9). Additionally, this ana- lysis elevated several loci with modest levels of statistical significance in the primary analysis, to the top tier of statistical significance (P,5310 27 ). Our data indicate that this approach may be a useful adjunct to conventional analysis and that loci identified as highly significant shouldbeconsideredforfollowup.Therearetwoimportantcaveats. First, susceptibility genes that influence both the test disease and one or more of the diseases included in the reference group will cause loss of power. Second, a ?mirror-image? effect could occur whereby a strong association within the expanded reference sample (for example, HLA in autoimmune diseases) causes spurious association withtheoppositealleleinthetestdisease.Thus,apositiveassociation using an expanded reference group must be interpreted within the context of association findings in the diseases included within the reference group. Disease models. It is of interest to consider which statistical models best describethedataatandbetween loci thatarestrongly associated with disease status. Biological interpretationof these statistical mod- els is not straightforward but they can help in choosing more power- ful statistical tools for detecting associations. First, consider separately each of the 19 non-MHC SNPs showing strong evidence for association on either the trend or genotypic test inTable3.Forfourofthese19,thePvalueonthe2-d.f.genotypictest was smaller than that on the 1-d.f. trend test (Table 3). When com- paring disease models, these were also the four SNPs with evidence for departure from a simple model in which odds of disease increase multiplicatively with the number of copies of the risk allele (Sup- plementary Table 8). This supports our view that the genotypic test should be carried out in addition to the trend test, although should perhaps be viewed more cautiously for two reasons: it is more sus- ceptible to genotyping errors; and (on the basis of our findings) experience does not favour strong dominance effects. A separate question relates to the best models for the way in which different loci combine to affect susceptibility to a disease, and as a consequenceontheextenttowhichmethodsexplicitlyallowinginter- actions between loci should be employed to detect associations 110 . None of the analyses reported here includes such interactions, so we are not well placed to address the general question. Nonetheless, within each collection with multiple associated regions (CD, T1D and T2D) we considered all pairs of non-MHC SNPs in Table 3 and lookedforadeparturefromthemodelinwhichthetwolocicombine toincreaselog-oddsinanadditivefashion.Wefoundsuggestiveevid- ence of a departure from multilocus additivity between rs1000113 and rs10761659 in CD (unadjusted P value5 0.002) and between rs9465871 and rs4506565 in T2D (unadjusted P value50.004). Further investigation of this question, preferably on unbiased sets of disease loci found through the application of single locus and interaction-based approaches, would seem warranted. Discussion We have studied seven common familial diseases by genome-wide association analysis in 16,179 individuals. Our findings inform understanding of the genetic basis of the diseases concerned and provide methodological insights relevant to the pursuit of GWA studies in general. AsimplebutimportantobservationisthatGWAanalysisprovides a highly effective approach for exploring the genetic underpinnings of common familial diseases. Our yield of novel, highly significant NATURE|Vol 447|7 June 2007 ARTICLES 673 Nature �2007 Publishing Group association findings is comparable to, or exceeds, the number of those hitherto-generated by candidate gene or positional cloning efforts. For many of the compelling signals, replication has already been obtained, including regions on chromosomes 3p21, 5q33, 10q24 and 18p11 for CD 23 , 12q13, 12q24, 16p13 and 18p11 in T1D 10 and 6p22 and 16q12 in T2D 19?22,24 . For others, replication is requiredtoestablishadefinitiverelationshipwithdisease.Additional findingsofparticularinterestincludetheidentificationofseveralloci that seem to influence susceptibility to multiple autoimmune dis- eases, and the suggestion of a novel locus for RA which shows sex- specific effects. Our study enables us to make several general recommendations relevanttoGWAstudies.Thefirstrelatestotheimportanceofcareful quality control. In such large data sets, small systematic differences canreadilyproduceeffectscapableofobscuringthetrueassociations beingsought 111,112 .Weimplementedextensivequalitycontrolchecks to minimize differences in sample DNA concentration, quality and handling procedures and combined a new genotype-calling algo- rithm (CHIAMO) with a set of filtering heuristics to select SNPs for further analysis. Given that infallible detection of incorrect geno- type calls is not yet possible, the criteria used for SNP exclusion need to strike a compromise between stringency (which may discard true signals or generate spurious positives through differential missing- ness) and leniency (with the danger that true signals are swamped by spurious findings due to poor genotype calling). As such, systematic visual inspection of cluster plots for SNPs of interest remains an integral part of the quality control process. Thepotential for population structure to undermine inferences in case-control association studies has long been debated 113 but limited empirical data have been available to assess the issue. Our study highlighted several loci, some known and some new, which dem- onstrate substantial geographical variation in allele frequencies across Britain (Table 1), most probably due to natural selection in ancestral populations. Outside these loci, the effects of population structurearerelativelyminor,anddonotrepresentamajorsourceof confounding, provided that individuals with appreciable non- European ancestry are excluded. Although these conclusions may not generalize to studies in other locations, this finding reinforces the logistical and economic benefits of the case-control design over alternatives (such as family-based association studies). Our study allowed us to address another important methodo- logical issue: the adequacy, or otherwise, of using a common set of controls, rather than a sample recruited explicitly for use with a defineddiseasesample.Itisoftenassumedthatfailuretomatchcases and controls for socio-demographic variables will lead to substantial inflation ofthetypeIerrorrate. Ourstudy demonstrates that,within the context of large-scale genetic association studies, for British populationsatleast,thisconcernhasbeenoverstated.Arelatedargu- ment against use of population controls relates to the perceived impact of misclassification bias when a proportion of controls meet the criteria used to define cases. However, the consequent loss of power is modest unless the trait of interest is very common 6 . Given the above, the present study provides a compelling case for both the suitabilityandefficiencyofthecommoncontroldesigninBritainand warrants its serious consideration elsewhere. Further benefits can be expectedfromuseofthiscommoncontrolgenotypedatasetinfuture GWA studies in Britain. Finally, in failing to detect significant differ- encesinperformancebetweentheepidemiologicalsample(58C)and that derived from blood donors (UKBS), we validate the use of the latter samples for cost-effective, large-scale control DNA provision. Intermsofgeneralbiologicalinsights,themostprofoundrelateto inferences about the allelic architecture of common traits. The novel variants we have uncovered are characterized by modest effect size (that is, per-allele ORs between 1.2 and 1.5) and even these estimates are likely to be inflated 114 . We identified no additional common variants of very large effect (akin to HLA in T1D: Supplementary Fig. 15). The observed distribution of effect sizes is consistent with models based on theoretical considerations and empirical data from animal models 87,115,116 that suggest that, for any given trait, there will be few (if any) large effects, a handful of modest effects and a sub- stantial number of genes generating small or very small increases in disease risk. There are several important corollaries. Notwithstanding the incomplete coverage afforded by the genotyping reagents employed, mostofthesusceptibilityeffectsyettobeuncoveredforthesediseases (atleastthoseattributable to,ortaggedby,common SNPs)are likely to have effects of similar or smaller magnitude to those we have highlighted. Beyond the signals with the strongest evidence for asso- ciation, most of which are likely to be real (and many of which have alreadybeenconfirmed),therewillbemanyadditional susceptibility variants for which the WTCCC provides some evidence, but for which extensive replication will be required to establish validity. PPARGandKCNJ11provideexamplesofprovensusceptibilitygenes (forT2D)thatgeneratedonlymodestevidenceforassociationwithin the WTCCC, and which would only have been revealed by such replication efforts. Given the likely preponderance of susceptibility variants of small effect, the potential for identifying further loci is limitedonlybytheclinicalresourcesavailableforreplication(assum- ing suitable study design, accurate genotyping and appropriate ana- lysis and inference). Provided the attribution of a causal relationship with the trait of interest is robust, even variants of very small effect can offer fundamental biological insights. The patterns of allelic architecture uncovered mean that replica- tioneffortswillneedtofeaturecomparablylargesamplesizes:evenif one accepts more relaxed significance thresholds given the prior evidence, one has to consider the inflation in effect-size estimates in the primary study. Caution is required in reaching negative con- clusionsonthebasisofasinglefailedattemptatreplication,oranyset of replication attempts that are inadequately powered. One of our major design considerations was sample size. We set out to include samples larger than those previously examined for genome-wide association, and our results suggest that such large samplesizeswerenecessary.Evenwith2,000casesand3,000controls, Case and control sample size Proportion of subsamples exceeding P -value threshold 2 3 4 6 7 8 9 10 11 12 14 16 17 18 20 21 0.8 0.6 0.4 0.2 1.0 0.0 1,7501,5001,2501,000750500250 1,900 Figure 6 | Strong associations in subsamples of our data. For the 16 SNPs inTable3(outsidetheMHC)withPvaluesforthetrendtestbelow5310 27 , we randomly generated 1,000 subsets of our full data set corresponding to case-control studies with different numbers of cases, and the same number of controls (xaxis). The yaxis gives the proportion of subsamples of a given size in which that SNP achieved a Pvalue for the trend test below 5310 27 . SNPs are numbered according to the row in which they occur in Table 3 (so that, for example, the CAD hit is numbered 2, and the TCF7L2 hit on chromosome 10 for T2D is numbered 20). ARTICLES NATURE|Vol 447|7 June 2007 674 Nature �2007 Publishing Group adequate power is restricted to common variants of relatively large effect (see Supplementary Table 2). We carried out an experiment to see which SNPs showing strong evidence of association in the full data (that is, signals outside MHC with trend test P,5310 27 ), would have been detected at that same threshold in only a subset of our data (Fig. 6). Because it focuses on a particular but arbitrary P-value threshold, some care is needed in interpreting the figure. Nonetheless, for subsamples of 1,000 cases and 1,000 controls, of the 16 loci detected in the full study, we would have been certain of seeingonly2,withanexpectationofabout6;forsubsamplesof1,500 casesand1,500controls,wecouldexpecttohaveseenabout9.These figures provide stark evidence that the larger the study sample, the more loci can be expected to reach threshold significance values. Indeed, given the likely distribution of effect sizes for most complex traits (see above), there are strong grounds for the prosecution of GWA studies on an even larger scale than ours, and, wherever pos- sible, combining the results from existing GWA scans performed for the same trait. To assist such efforts, individual level data from this study will be widely available through the Consortium?s Data Access Committee (follow links from http://www.wtccc.org.uk). In our study, T1D and CD, the conditions showing strongest familial aggregation (as quantified by their sibling relative risks, l s ), generated the largest number of highly significant associations. This relationship was not sustained in comparisons between the other five diseases. It is important to recognize that the association signalssofaridentifiedaccountforonlyasmallproportionofoverall familiality. There is a disparity in scale between the modest locus- specific l s effects attributable to the identified associations (for instance, the prominent TCF7L2 signal for T2D translates into a l s of only 1.03) and the estimates of overall familiality that reflects the combined effects of all genes and shared family environment. These estimates demonstrate the limited potential of the variants thus far identified (singly or in combination) to provide clinically useful prediction of disease 117,118 . Theidentification and characterization of the aetiological variants that underlie replicated associations will necessitate extensive fine- mapping and functional validation. We view the WTCCC study and data set as an important first step towards harnessing the powerful molecular genomic tools now available to dissect the biological basis ofcommondiseaseandtranslatingthosefindingsintoimprovements in human health. METHODS SUMMARY A detailed descriptionof materials andmethodsis given in Methods. The work- flow and organization of the project are given in Supplementary Fig. 16. Case series came from previously established collections with nationally represent- ative recruitment: 2,000 samples were genotyped for each. The control samples camefromtwosources:halffromthe1958BirthCohortandtheremainderfrom anewUKBloodServicesample.Thelattercollectionwasestablishedspecifically forthisstudyandisaUKnationalrepositoryofanonymizedDNAsamplesfrom 3,622 consenting blood donors. The vast majority of subjects were self-reported as of European Caucasian ancestry. All DNA samples were requantified and testedfordegradationandPCRamplification.Genotypingwasperformedusing GeneChip 500K arrays at the Affymetrix Services Lab (California): arrays not passing the 93% call rate threshold at P50.33 with the Dynamic Model algo- rithm were repeated. CEL (cell intensity) files were transferred to WTCCC for quantilenormalization,andgenotypescalledusinganewgenotypingalgorithm, CHIAMO, developed for this project. QC/QA measures included sample call rate,overallheterozygosityandevidenceofnon-Europeanancestry(809samples excluded; 16,179 retained for analysis). SNPs were excluded from analysis because of missing data rates, departures from Hardy?Weinberg equilibrium andothermetrics(31,011excluded;469,557retained).Standard1-d.f.and2-d.f. tests of case-control association were supplemented with bayesian approaches, multilocus methods (data imputation) and analyses with combined data sets, either as additional cases (to detect variants influencing multiple phenotypes) or as an expandedreference group(to increase power). Results for each SNP for all analyses reported will be available from http://www.wtccc.org.uk, as will details allowing other researchers to apply for access to WTCCC genotype data. Software packages developed within the WTCCC are available on request (see Methods for details). Full Methods and any associated references are available in the online version of the paper at www.nature.com/nature. Received 26 March; accepted 11 May 2007. 1. Hirschhorn, J. N. & Daly, M. J. Genome-wide association studies for common diseases and complex traits. Nature Rev. Genet. 6, 95?108 (2005). 2. Barrett, J. C. & Cardon, L. R. Evaluating coverage of genome-wide association studies. Nature Genet. 38, 659?662 (2006). 3. The International HapMap consortium. A haplotype map of the human genome. Nature 437, 1299?1320 (2005). 4. Murray, C. J. & Lopez, A. D. Evidence-based health policy?lessons from the Global Burden of Disease Study. Science 274, 740?743 (1996). 5. Mantel, N. Chi-square tests with one degree of freedom: Extension of the Mantel?Haenszel procedure. J. Am. Stat. Ass. 58, 690?700 (1963). 6. Colhoun,H.M.,McKeigue,P.M.&DaveySmith,G.Problemsofreportinggenetic associations with complex outcomes. Lancet 361, 865?872 (2003). 7. Bersaglieri, T. et al. Genetic signatures of strong recent positive selection at the lactase gene. Am. J. Hum. Genet. 74, 1111?1120 (2004). 8. Coelho, M. et al. Microsatellite variation and evolution of human lactase persistence. Hum. Genet. 117, 329?339 (2005). 9. Sabeti, P. C. et al. Positive natural selection in the human lineage. Science 312, 1614?1620 (2006). 10. Todd, J. A. et al. Robust associations of four new chromosome regions from genome-wide analyses of type 1 diabetes. Nature Genet. advance online publication, doi:10.1038/ng2068 (6 June 2007). 11. Falush, D., Stephens, M. & Pritchard, J. K. Inference of population structure using multilocus genotype data: linked loci and correlated allele frequencies. Genetics 164, 1567?1587 (2003). 12. Pritchard, J. K., Stephens, M. & Donnelly, P. Inference of population structure using multilocus genotype data. Genetics 155, 945?959 (2000). 13. Menozzi, P., Piazza, A. & Cavalli-Sforza, L. Synthetic maps of human gene frequencies in Europeans. Science 201, 786?792 (1978). 14. Price, A. L. et al. Principal components analysis corrects for stratification in genome-wide association studies. Nature Genet. 38, 904?909 (2006). 15. Devlin, B. & Roeder, K. Genomic control for association studies. Biometrics 55, 997?1004 (1999). 16. Clayton, D. in Handbook of Statistical Genetics (eds Balding, D. J., Bishop, M. & Cannings, C.) 939?960 (Wiley, New York, 2003). 17. Mackay, T. F. & Anholt, R. R. Of flies and man: Drosophila as a model for human complex traits. Annu. Rev. Genomics Hum. Genet. 7, 339?367 (2006). 18. Wacholder, S., Chanock, S., Garcia-Closas, M., El Ghormli, L. & Rothman, N. Assessingtheprobabilitythatapositivereportisfalse:anapproachformolecular epidemiology studies. J. Natl. Cancer Inst. 96, 434?442 (2004). 19. Zeggini, E. et al. Replication of genome-wide association signals in U.K. samples reveals risk loci for type 2 diabetes. Science online publication, doi:10.1126/ science.1142364 (26 April 2007). 20. Steinthorsdottir, V.et al.Avariant inCDKAL1 influences insulinresponse and risk oftype2diabetes.NatureGenet.advanceonlinepublication,doi:10.1038/ng2043 (26 April 2007). 21. Scott, L. J. et al. A genome-wide association study of type 2 diabetes in Finns detects multiple susceptibility variants. Science online publication, doi:10.1126/ science.1142382 (26 April 2007). 22. Diabetes Genetics Institute. Genome-wide association analysis identifies loci for type 2 diabetes and triglyceride levels. Science online publication, doi:10.1126/ science.1142358 (26 April 2007). 23. Parkes,M.etal.SequencevariantsintheautophagygeneIRGMandmultipleother replicatinglocicontributetoCrohn?sdiseasesusceptibility.NatureGenet.advance online publication, doi:10.1038/ng2061 (6 June 2007). 24. Frayling, T. M. et al. A common variant in the FTO gene is associated with body massindexandpredisposestochildhoodandadultobesity.Science316,889?894 (2007). 25. Cohen, J. C. et al. Multiple rare alleles contribute to low plasma levels of HDL cholesterol. Science 305, 869?872 (2004). 26. Muller-Oerlinghausen, B., Berghofer, A. & Bauer, M. Bipolar disorder. Lancet 359, 241?247 (2002). 27. Craddock, N.,O?Donovan, M. C. & Owen, M. J.The genetics of schizophrenia and bipolar disorder: dissecting psychosis. J. Med. Genet. 42, 193?204 (2005). 28. McGuffin, P. et al. The heritability of bipolar affective disorder and the genetic relationship to unipolar depression. Arch. Gen. Psychiatry 60, 497?502 (2003). 29. Rice, J. et al. The familial transmission of bipolar illness. Arch. Gen. Psychiatry 44, 441?447 (1987). 30. McQueen, M. B. et al. Combined analysis from eleven linkage studies of bipolar disorder provides strong evidence of susceptibility loci on chromosomes 6q and 8q. Am. J. Hum. Genet. 77, 582?595 (2005). 31. Craddock, N. & Owen, M. J. The beginning of the end for the Kraepelinian dichotomy. Br. J. Psychiatry 186, 364?366 (2005). 32. Ozeki,Y.etal.Disrupted-in-Schizophrenia-1(DISC-1):mutanttruncationprevents binding to NudE-like (NUDEL) and inhibits neurite outgrowth. Proc. Natl Acad. Sci. USA 100, 289?294 (2003). NATURE|Vol 447|7 June 2007 ARTICLES 675 Nature �2007 Publishing Group 33. Blackwood,D.H.etal.Schizophreniaandaffectivedisorders?cosegregationwith a translocation at chromosome 1q42 that directly disrupts brain-expressed genes: clinical and P300 findings in a family. Am. J. Hum. Genet. 69, 428?433 (2001). 34. Graves, T. D. & Hanna, M. G. Neurological channelopathies. Postgrad. Med. J. 81, 20?32 (2005). 35. Krystal, J. H. et al. Glutamate and GABA systems as targets for novel antidepressant and mood-stabilizing treatments. Mol. Psychiatry 7 (Suppl. 1), S71?S80 (2002). 36. Vawter, M. P. et al. Reduction of synapsin in the hippocampus of patients with bipolar disorder and schizophrenia. Mol. Psychiatry 7, 571?578 (2002). 37. Libby,P.&Theroux, P.Pathophysiology ofcoronaryartery disease. Circulation 111, 3481?3488 (2005). 38. Yusuf, S. et al. Effect of potentially modifiable risk factors associated with myocardial infarction in 52 countries (the INTERHEART study): case-control study. Lancet 364, 937?952 (2004). 39. Lusis, A. J., Mar, R. & Pajukanta, P. Genetics of atherosclerosis. Annu. Rev. Genomics Hum. Genet. 5, 189?218 (2004). 40. Watkins, H. & Farrall, M. Genetic susceptibility to coronary artery disease: from promise to progress. Nature Rev. Genet. 7, 163?173 (2006). 41. Helgadottir,A.etal.Thegeneencoding5-lipoxygenaseactivatingproteinconfers risk of myocardial infarction and stroke. Nature Genet. 36, 233?239 (2004). 42. Helgadottir, A. et al. A variant of the gene encoding leukotriene A4 hydrolase confers ethnicity-specific risk of myocardial infarction. Nature Genet. 38, 68?74 (2006). 43. Topol,E.J.,Smith,J.,Plow,E.F.&Wang,Q.K.Geneticsusceptibilitytomyocardial infarction and coronary artery disease. Hum. Mol. Genet. 15 (Spec. No. 2), R117?R123 (2006). 44. Lowe, S. W. & Sherr, C. J. Tumor suppression by Ink4a?Arf: progress and puzzles. Curr. Opin. Genet. Dev. 13, 77?83 (2003). 45. Hannon, G. J. & Beach, D. p15 INK4B is a potential effector of TGF-b-induced cell cycle arrest. Nature 371, 257?261 (1994). 46. Kalinina, N. et al. Smad expression in human atherosclerotic lesions: evidence for impaired TGF-b/Smad signaling in smooth muscle cells of fibrofatty lesions. Arterioscler. Thromb. Vasc. Biol. 24, 1391?1396 (2004). 47. Schmid,M.etal.Amethylthioadenosinephosphorylase(MTAP)fusiontranscript identifies a new gene on chromosome 9p21 that is frequently deleted in cancer. Oncogene 19, 5747?5754 (2000). 48. Prasannan, P., Pike, S., Peng, K., Shane, B. & Appling, D. R. Human mitochondrial C 1 -tetrahydrofolate synthase: gene structure, tissue distribution of the mRNA, and immunolocalization in Chinese hamster ovary calls. J. Biol. Chem. 278, 43178?43187 (2003). 49. Walkup,A.S.&Appling,D.R.Enzymaticcharacterizationofhumanmitochondrial C 1 -tetrahydrofolate synthase. Arch. Biochem. Biophys. 442, 196?205 (2005). 50. Frosst, P. et al. A candidate genetic risk factor for vascular disease: a common mutation in methylenetetrahydrofolate reductase. Nature Genet. 10, 111?113 (1995). 51. Klerk, M. et al. MTHFR 677CRT polymorphism and risk of coronary heart disease: a meta-analysis. J. Am. Med. Assoc. 288, 2023?2031 (2002). 52. Gregory, J. F. III et al. Primed, constant infusion with [ 2 H 3 ]serine allows in vivo kinetic measurement of serine turnover, homocysteine remethylation, and transsulfurationprocessesinhumanone-carbonmetabolism.Am.J.Clin.Nutr.72, 1535?1541 (2000). 53. Randak, C. et al.Three siblings with nonketotic hyperglycinaemia, mildlyelevated plasma homocysteine concentrations and moderate methylmalonic aciduria. J. Inherit. Metab. Dis. 23, 520?522 (2000). 54. Wight, T. N. The ADAMTS proteases, extracellular matrix, and vascular disease ?Wakingthesleepinggiant(s)!Arterioscler.Thromb.Vasc.Biol.25,12?14(2005). 55. Jonsson-Rylander, A. et al. The role of ADAMTS-1 in atherosclerosis: Remodeling of carotid artery, immunohistochemistry, and proteolysis of versican. Arter. Thromb. Vas. Bio. 25, 180?185 (2004). 56. Travis, S. P. et al. European evidence based consensus on the diagnosis and management of Crohn?s disease: current management. Gut 55 (Suppl. 1), i16?i35 (2006). 57. Sartor, R. B. Mechanisms of disease: pathogenesis of Crohn?s disease and ulcerative colitis. Nature Clin. Pract. Gastroenterol. Hepatol. 3, 390?407 (2006). 58. Tysk, C., Lindberg, E., Jarnerot, G. & Floderusmyrhed, B. Ulcerative-colitis and Crohns-disease in an unselected population of monozygotic and dizygotic twins ? a study of heritability and the influence of smoking. Gut 29, 990?996 (1988). 59. Gaya, D. R., Russell, R. K., Nimmo, E. R. & Satsangi, J. New genes in inflammatory bowel disease: lessons for complex diseases? Lancet 367, 1271?1284 (2006). 60. Hugot, J. P. et al. Association of NOD2 leucine-rich repeat variants with susceptibility to Crohn?s disease. Nature 411, 599?603 (2001). 61. Ogura, Y. et al. A frameshift mutation in NOD2 associated with susceptibility to Crohn?s disease. Nature 411, 603?606 (2001). 62. Rioux, J. D. et al. Genetic variation in the 5q31 cytokine gene cluster confers susceptibility to Crohn disease. Nature Genet. 29, 223?228 (2001). 63. Duerr, R. H. et al. A genome-wide association study identifies IL23R as an inflammatory bowel disease gene. Science 314, 1461?1463 (2006). 64. Hampe, J. et al. A genome-wide association scan of nonsynonymous SNPs identifies a susceptibility variant for Crohn disease in ATG16L1. Nature Genet. 39, 207?211 (2007). 65. Rioux, J.D. et al.Genome-wide association studyidentifiesnewsusceptibility loci for Crohn disease and implicates autophagy in disease pathogenesis. Nature Genet. 39, 596?604 (2007). 66. Libioulle, C. et al. Novel crohn disease locus identified by genome-wide associationmapstoagenedeserton5p13.1andmodulatesexpressionofPTGER4. PLoS Genet. 3, e58 (2007). 67. Singh, S. B., Davis, A. S., Taylor, G. A. & Deretic, V. Human IRGM induces autophagy to eliminate intracellular mycobacteria. Science 313, 1438?1441 (2006). 68. Leonard,E.J.Biologicalaspectsofmacrophage-stimulating protein(MSP)andits receptor. Ciba Found Symp. 212, 183?191; discussion 192?197 (1997). 69. Pabst, O., Forster, R., Lipp, M., Engel, H. & Arnold, H. H. NKX2.3 is required for MAdCAM-1 expression and homing of lymphocytes in spleen and mucosa- associated lymphoid tissue. EMBO J. 19, 2015?2023 (2000). 70. Yamazaki, K. et al. Single nucleotide polymorphisms in TNFSF15 confer susceptibility to Crohn?s disease. Hum. Mol. Genet. 14, 3499?3506 (2005). 71. Pietravalle, F. et al. Human native soluble CD40L is a biologically active trimer, processed inside microsomes. J. Biol. Chem. 271, 5965?5967 (1996). 72. Battegay,E.J.,Lip,G.Y.H.&Badris,G.L.(eds)Hypertension;PrinciplesandPractice (Taylor Francis Group, 2005). 73. Kobberling, J. & Tattersall, R. (eds) The Genetics of Diabetes Mellitus (Academic Press, London, 1982). 74. Dominiczak, A. F. et al. Genetics of hypertension: Lessons learnt from Mendelian and polygenic syndromes. Clin. Exp. Hypertens. 26, 611?620 (2004). 75. Mein, C. A., Caulfield, M. J., Dobson, R. J. & Munroe, P. B. Genetics of essential hypertension. Hum. Mol. Genet. 13, R169?R175 (2004). 76. Hubner, N. et al. Integrated transcriptional profiling and linkage analysis for identification of genes underlying disease. Nature Genet. 37, 243?253 (2005). 77. Caulfield, M. et al. Genome-wide mapping of human loci for essential hypertension. Lancet 361, 2118?2123 (2003). 78. Newhouse,S.J.etal.HaplotypesoftheWNK1geneassociatewithbloodpressure variation in a severely hypertensive population from the British Genetics of Hypertension study. Hum. Mol. Genet. 14, 1805?1814 (2005). 79. Tobin,M.D.etal.AssociationofWNK1genepolymorphismsandhaplotypeswith ambulatory blood pressure in the general population. Circulation 112, 3423?3429 (2005). 80. Otsu, K. et al. Molecular cloning of cDNA encoding the Ca 21 release channel (ryanodine receptor) of rabbit cardiac muscle sarcoplasmic reticulum. J. Biol. Chem. 265, 13472?13483 (1990). 81. Benkusky, N. A., Farrell, E. F. & Valdivia, H. H. Ryanodine receptor channelopathies. Biochem. Biophys. Res. Commun. 322, 1280?1285 (2004). 82. Worthington, J., Barton, A. & John, S. L. The epidemiology of rheumatoid arthritis and the use of linkage and association studies to identify disease genes (Birkhauser, Basel, 2005). 83. Wordsworth, P. & Bell, J. Polygenic susceptibility in rheumatoid arthritis. Ann. Rheum. Dis. 50, 343?346 (1991). 84. Gregersen, P. K., Silver, J. & Winchester, R. J. The shared epitope hypothesis. An approachtounderstandingthemoleculargeneticsofsusceptibilitytorheumatoid arthritis. Arthritis Rheum. 30, 1205?1213 (1987). 85. Jawaheer,D.etal.Agenomewidescreeninmultiplexrheumatoidarthritisfamilies suggestsgenetic overlapwithotherautoimmune diseases.Am.J.Hum. Genet.68, 927?936 (2001). 86. John, S. et al. Whole-genome scan, in a complex disease, using 11,245 single- nucleotide polymorphisms: comparison with microsatellites. Am. J. Hum. Genet. 75, 54?64 (2004). 87. MacKay, K. et al. Whole-genome linkage analysis of rheumatoid arthritis susceptibility loci in 252 affected sibling pairs in the United Kingdom. Arthritis Rheum. 46, 632?639 (2002). 88. Begovich, A. B. et al. A missense single-nucleotide polymorphism in a gene encodingaproteintyrosinephosphatase(PTPN22)isassociatedwithrheumatoid arthritis. Am. J. Hum. Genet. 75, 330?337 (2004). 89. Hinks, A., Eyre, S., Barton, A., Thomson, W. & Worthington, J. Investigation of genetic variation across PTPN22 in UK rheumatoid arthritis (RA) patients. Ann. Rheum. Dis. 66, 683?686 (2006). 90. Sharfe, N., Dadi, H. K., Shahar, M. & Roifman, C. M. Human immune disorder arising from mutation of the achain of the interleukin-2 receptor. Proc. Natl Acad. Sci. USA 94, 3168?3171 (1997). 91. Vella, A. et al. Localization of a type 1 diabetes locus in the IL2RA/CD25 region by use of tag single-nucleotide polymorphisms. Am. J. Hum. Genet. 76, 773?779 (2005). 92. Devendra, D., Liu, E. & Eisenbarth, G. S. Type 1 diabetes: recent developments. Br. Med. J. 328, 750?754 (2004). 93. Hyttinen, V., Kaprio, J., Kinnunen, L., Koskenvuo, M. & Tuomilehto, J. Genetic liability of type 1 diabetes and the onset age among 22,650 young Finnish twin pairs: a nationwide follow-up study. Diabetes 52, 1052?1055 (2003). 94. Smyth, D. J. et al. A genome-wide association study of nonsynonymous SNPs identifies atype 1 diabetes locus inthe interferon-induced helicase (IFIH1) region. Nature Genet. 38, 617?619 (2006). 95. Todd, J. A. Statistical false positive or true disease pathway? Nature Genet. 38, 731?733 (2006). 96. Mustelin,T.,Vang,T.&Bottini,N.Proteintyrosinephosphatasesandtheimmune response. Nature Rev. Immunol. 5, 43?57 (2005). ARTICLES NATURE|Vol 447|7 June 2007 676 Nature �2007 Publishing Group 97. Bottini, N. et al. A functional variant of lymphoid tyrosine phosphatase is associated with type I diabetes. Nature Genet. 36, 337?338 (2004). 98. Brand, O. J. Association of the interleukin-2 receptor alpha (IL-2Ra)/CD25 gene region with Graves? disease using a multilocus test and tag SNPs. Clin. Endocrinol. 66, 508?512 (2007). 99. Yamanouchi,J.etal.Interleukin-2genevariationimpairsregulatoryTcellfunction and causes autoimmunity. Nature Genet. 39, 329?337 (2007). 100.Zimmet, P., Alberti, K. G. & Shaw, J. Global and societal implications of the diabetes epidemic. Nature 414, 782?787 (2001). 101. Stumvoll, M., Goldstein, B. J. & van Haeften, T. W. Type 2 diabetes: principles of pathogenesis and therapy. Lancet 365, 1333?1346 (2005). 102. Altshuler,D.etal.ThecommonPPARcPro12Alapolymorphismisassociatedwith decreased risk of type 2 diabetes. Nature Genet. 26, 76?80 (2000). 103. Gloyn,A.L.etal.Large-scaleassociationstudiesofvariantsingenesencodingthe pancreatic b-cell KATP channel subunits Kir6.2 (KCNJ11) and SUR1 (ABCC8) confirm that theKCNJ11E23Kvariant is associated with type 2 diabetes. Diabetes 52, 568?572 (2003). 104.Grant,S.F.etal.Variantoftranscriptionfactor7-like2(TCF7L2)geneconfersrisk of type 2 diabetes. Nature Genet. 38, 320?323 (2006). 105. Zeggini, E. & McCarthy, M. I. TCF7L2: the biggest story in diabetes genetics since HLA? Diabetologia 50, 1?4 (2007). 106.Helgason, A. et al.Refining theimpact ofTCF7L2 genevariants ontype 2diabetes and adaptive evolution. Nature Genet. 39, 218?225 (2007). 107. Saxena, R. et al. Common single nucleotide polymorphisms in TCF7L2 are reproducibly associated with type 2 diabetes and reduce the insulin response to glucose in nondiabetic individuals. Diabetes 55, 2890?2895 (2006). 108.Ubeda,M.,Rukstalis,J.M.&Habener,J.F.Inhibitionofcyclin-dependentkinase5 activity protects pancreatic b cells from glucotoxicity. J. Biol. Chem. 281, 28858?28864 (2006). 109.Sladek,R.etal.Agenome-wideassociationstudyidentifiesnovelrisklocifortype 2 diabetes. Nature 445, 881?885 (2007). 110. Marchini, J., Donnelly, P. & Cardon, L. R. Genome-wide strategies for detecting multiple loci that influence complex diseases. Nature Genet. 37, 413?417 (2005). 111. Clayton,D.G.etal.Populationstructure,differentialbiasandgenomiccontrolina large-scale, case-control associationstudy. Nature Genet.37, 1243?1246 (2005). 112. Zondervan, K. T. & Cardon, L. R. The complex interplay among factors that influence allelic association. Nature Rev. Genet. 5, 89?100 (2004). 113. Hutchison, K. E., Stallings, M., McGeary, J. & Bryan, A. Population stratification in the candidate gene study: fatal threat or red herring? Psychol. Bull. 130, 66?79 (2004). 114. Lohmueller, K. E., Pearce, C. L., Pike, M., Lander, E. S. & Hirschhorn, J. N. Meta- analysis of genetic association studies supports a contribution of common variants to susceptibility to common disease. Nature Genet. 33, 177?182 (2003). 115. Hayes, B. & Goddard, M. E. The distribution of the effects of genes affecting quantitative traits in livestock. Genet. Sel. Evol. 33, 209?229 (2001). 116. Valdar, W. et al. Genome-wide genetic association of complex traits in heterogeneous stock mice. Nature Genet. 38, 879?887 (2006). 117. Yang, Q., Khoury, M. J., Friedman, J., Little, J. & Flanders, W. D. How many genes underlie the occurrence of common complex diseases in the population? Int. J. Epidemiol. 34, 1129?1137 (2005). 118. Janssens, A.C.et al.Predictive testing forcomplex diseasesusingmultiplegenes: fact or fiction? Genet. Med. 8, 395?400 (2006). Supplementary Information is linked to the online version of the paper at www.nature.com/nature. Acknowledgements The principal funder of this project was the Wellcome Trust. Case collections were funded by: Arthritis Research Campaign, BDA Research, British Heart Foundation, British Hypertension Society, Diabetes UK, Glaxo-Smith Kline Research and Development, Juvenile Diabetes Research Foundation, National Association for Colitis and Crohn?s disease, SHERT (The Scottish Hospitals Endowment Research Trust), St Bartholomew?s and The Royal London Charitable Foundation, UK Medical Research Council, UK NHS R&D and the Wellcome Trust. Statistical analyses were funded by a Commonwealth Scholarship, EU, EPSRC, Fundac�a?o para a Cie?ncia e a Tecnologia (Portugal), National Institutes of Health, National Science Foundation and the Wellcome Trust.Weacknowledgethemanyphysicians,researchfellowsandresearchnurses who contributed to the various case collections, and the collection teams and seniormanagementoftheUKBloodServicesresponsiblefortheUKBloodServices Collection. For the 1958 Birth Cohort, venous blood collection was funded by the UK Medical Research Council and cell-line production, DNA extraction and processingbytheJuvenileDiabetesResearchFoundationandtheWellcomeTrust. We recognize the contributions of: P. Shepherd (1958 Birth Cohort); those at Affymetrix responsible for genotype assay optimization, data production and data delivery (particularly S. Cawley, R. Mei, H. Fakhrai-Rad, H. Francis-Land, R. Pillai); L. Forty, G. Fraser, J. Heron, S. Hyde, A. Massey; F. Oyebode, E. Russell, M. Sinclair, A.Stern,N.WalkerandS.Zammitt(recruitmentandphenotypicassessmentofBD cases); M. Yuille, B. Ollier and the UK DNA Banking Network and members of the BHF Family Heart Study Research Group (CAD case recruitment and DNA provision);S.Goldthorpe,D.SoarsandJ.WhittakerforCDcollections;J.Pembroke, M. Bruce, S. Colville-Stewart, K. Edwards, L. Gatherer, C. Gemmell, K. Gilmour, S. Hampson, S. Hood, J. Hunt, J. Hussein, J. Jamieson, J. Kent, D. Lloyd, K. MacFarlane, S. Mellow, A. Nixon, J. Pheby, D. Picton, F. Porteus, P. Whitworth, K.Witte,A.Zawadzka,C.MeinandtheBartsandTheLondonGenomeCentre(HT sample collection); H. Withers, the research nurses and the membership of the British Society for Paediatric Endocrinology and Diabetes (T1D case recruitment); and M. Sampson, S. O?Rahilly, S. Howell, M. Murphy and A. Wilson (T2D case recruitment). Essential informatics support was provided by the administration, systems, bioinformatics, data services and DNA teams of the JDRF/WT DIL; the Web System teams at the Sanger Institute (particularly R. Pettitt); D. Holland and R. Vincent. T. Dibling, C. Hind, D. Simpkin, P. Ewels and D. Moore provided genotyping assistance. Personal support was provided by: Arthritis Research Campaign(A.B.,H.Do.,S.E.,P.G.,S.H.,A.H.,S.J.,C.P.,A.S.,D.S.,W.T.,J.Wo.);British Heart Foundation (S.G.B., N.J.S., A.Do., C.W.); Cancer Research UK (D.Ea.); Diabetes UK(R.M.F.); Cure Crohn?s and ColitisFund (F.R.C.); CORE(C.M.O.); SIM (G.B.); LeverhulmeTrust(A.P.M.);Throne-Holst Foundation (C.M.L.);UKMedical Research Council (D.P.K., M.D.T., J.R.P.); Vandervell Foundation (M.N.W.); and Wellcome Trust (D.G.C., L.R.C., C.M., J.Sat., M.T., A.T.H., E.Z., C.B., S.J.B., A.C., K.D., J.Gh., R.G., S.E.H., A.K., E.K., R.McG., S.P., R.R., P.Wh., D.W., P.De.). Author Information Affymetrix GeneChip Mapping 500K Set Arrays 250K_Nsp_SNP and 250K_Sty2_SNP are deposited in NCBI GEO under accession numbers GPL3718 and GPL3720, respectively. Reprints and permissions information is available at www.nature.com/reprints. The authors declare no competing financial interests. Correspondence and requests for materials should be addressed to P.D. (donnelly@stats.ox.ac.uk). The Wellcome Trust Case Control Consortium Management Committee Paul R. Burton 1 , David G. Clayton 2 , Lon R. Cardon 3 , Nick Craddock 4 ,PanosDeloukas 5 ,AudreyDuncanson 6 ,Dominic P.Kwiatkowski 3,5 ,MarkI. McCarthy 3,7 , Willem H. Ouwehand 8,9 , Nilesh J. Samani 10 , John A. Todd 2 & Peter Donnelly (Chair) 11 DataandAnalysisCommitteeJeffreyC.Barrett 3 ,PaulR.Burton 1 ,DanDavison 11 ,Peter Donnelly 11 , Doug Easton 12 , David Evans 3 , Hin-Tak Leung 2 , Jonathan L. Marchini 11 , Andrew P. Morris 3 , Chris C. A. Spencer 11 , Martin D. Tobin 1 , Lon R. Cardon (Co-chair) 3 & David G. Clayton (Co-chair) 2 UK Blood Services and University of Cambridge Controls Antony P. Attwood 5,8 , James P. Boorman 8,9 , Barbara Cant 8 , Ursula Everson 13 ,Judith M. Hussey 14 , Jennifer D. Jolley 8 , Alexandra S. Knight 8 , Kerstin Koch 8 , Elizabeth Meech 15 , Sarah Nutland 2 , ChristopherV.Prowse 16 ,HelenE.Stevens 2 ,NiallC.Taylor 8 ,GrahamR.Walters 17 ,Neil M. Walker 2 , Nicholas A. Watkins 8,9 , Thilo Winzer 8 , John A. Todd 2 & Willem H. Ouwehand 8,9 1958BirthCohortControlsRichardW.Jones 18 ,WendyL.McArdle 18 ,SusanM.Ring 18 , David P. Strachan 19 & Marcus Pembrey 18,20 Bipolar Disorder Gerome Breen 21 , David St Clair 21 (Aberdeen); Sian Caesar 22 , Katherine Gordon-Smith 22,23 , Lisa Jones 22 (Birmingham); Christine Fraser 23 , Elaine K. Green 23 , Detelina Grozeva 23 , Marian L. Hamshere 23 , Peter A. Holmans 23 , Ian R. Jones 23 , George Kirov 23 , Valentina Moskvina 23 , Ivan Nikolov 23 , Michael C. O?Donovan 23 , Michael J. Owen 23 , Nick Craddock 23 (Cardiff); David A. Collier 24 , Amanda Elkin 24 , Anne Farmer 24 , Richard Williamson 24 , Peter McGuffin 24 (London); Allan H. Young 25 & I. Nicol Ferrier 25 (Newcastle) Coronary Artery Disease Stephen G. Ball 26 , Anthony J. Balmforth 26 , Jennifer H. Barrett 26 , D. Timothy Bishop 26 , Mark M. Iles 26 , Azhar Maqbool 26 , Nadira Yuldasheva 26 , Alistair S. Hall 26 (Leeds); Peter S. Braund 10 , Paul R. Burton 1 , Richard J. Dixon 10 , Massimo Mangino 10 , Suzanne Stevens 10 , Martin D. Tobin 1 , John R. Thompson 1 & Nilesh J. Samani 10 (Leicester) Crohn?sDiseaseFrancescaBredin 27 ,MarkTremelling 27 ,MilesParkes 27 (Cambridge); Hazel Drummond 28 , Charles W. Lees 28 , Elaine R. Nimmo 28 , Jack Satsangi 28 (Edinburgh); Sheila A. Fisher 29 , Alastair Forbes 30 , Cathryn M. Lewis 29 , Clive M. Onnie 29 , Natalie J. Prescott 29 , Jeremy Sanderson 31 , Christopher G. Mathew 29 (London); Jamie Barbour 32 , M. Khalid Mohiuddin 32 , Catherine E. Todhunter 32 , John C. Mansfield 32 (Newcastle); Tariq Ahmad 33 , Fraser R. Cummings 33 & Derek P. Jewell 33 (Oxford) Hypertension John Webster 34 (Aberdeen); Morris J. Brown 35 , David G. Clayton 2 (Cambridge); G. Mark Lathrop 36 (Evry); John Connell 37 , Anna Dominiczak 37 (Glasgow); Nilesh J. Samani 10 (Leicester); Carolina A. Braga Marcano 38 , Beverley Burke 38 ,Richard Dobson 38 ,Johannie Gungadoo 38 ,Kate L.Lee 38 ,PatriciaB. Munroe 38 , Stephen J. Newhouse 38 , Abiodun Onipinla 38 , Chris Wallace 38 , Mingzhan Xue 38 , Mark Caulfield 38 (London); Martin Farrall 39 (Oxford) Rheumatoid Arthritis Anne Barton 40 , The Biologics in RA Genetics and Genomics StudySyndicate(BRAGGS)SteeringCommittee*,IanN.Bruce 40 ,HannahDonovan 40 , SteveEyre 40 ,PaulD.Gilbert 40 ,SamanthaL.Hider 40 ,AnneM.Hinks 40 ,SallyL.John 40 , NATURE|Vol 447|7 June 2007 ARTICLES 677 Nature �2007 Publishing Group Catherine Potter 40 , Alan J. Silman 40 , Deborah P. M. Symmons 40 , Wendy Thomson 40 & Jane Worthington 40 Type 1 Diabetes David G. Clayton 2 , David B. Dunger 2,41 , Sarah Nutland 2 , Helen E. Stevens 2 , Neil M. Walker 2 , Barry Widmer 2,41 & John A. Todd 2 Type 2 Diabetes Timothy M. Frayling 42,43 , Rachel M. Freathy 42,43 , Hana Lango 42,43 , John R. B. Perry 42,43 , Beverley M. Shields 43 , Michael N. Weedon 42,43 , Andrew T. Hattersley 42,43 (Exeter);GrahamA.Hitman 44 (London);MarkWalker 45 (Newcastle); Kate S. Elliott 3,7 , Christopher J. Groves 7 , Cecilia M. Lindgren 3,7 , Nigel W. Rayner 3,7 , Nicholas J. Timpson 3,46 , Eleftheria Zeggini 3,7 & Mark I. McCarthy 3,7 (Oxford) Tuberculosis Melanie Newport 47 , Giorgio Sirugo 47 (Gambia); Emily Lyons 3 , Fredrik Vannberg 3 & Adrian V. S. Hill 3 (Oxford) AnkylosingSpondylitisLindaA.Bradbury 48 ,ClaireFarrar 49 ,JenniferJ.Pointon 48 ,Paul Wordsworth 49 & Matthew A. Brown 48,49 Autoimmune Thyroid Disease Jayne A. Franklyn 50 , Joanne M. Heward 50 , Matthew J. Simmonds 50 & Stephen C. L. Gough 50 BreastCancerSheilaSeal 51 ,BreastCancerSusceptibilityCollaboration(UK)*,Michael R. Stratton 51,52 & Nazneen Rahman 51 MultipleSclerosisMariaBan 53 ,AnGoris 53 ,StephenJ.Sawcer 53 &AlastairCompston 53 Gambian Controls David Conway 47 , Muminatou Jallow 47 , Melanie Newport 47 , Giorgio Sirugo 47 (Gambia); Kirk A. Rockett 3 & Dominic P. Kwiatkowski 3,5 (Oxford) DNA, Genotyping,DataQC andInformatics Suzannah J.Bumpstead 5 ,Amy Chaney 5 , Kate Downes 2,5 , Mohammed J. R. Ghori 5 , Rhian Gwilliam 5 , Sarah E. Hunt 5 , Michael Inouye 5 , Andrew Keniry 5 , Emma King 5 , Ralph McGinnis 5 , Simon Potter 5 , Rathi Ravindrarajah 5 ,PamelaWhittaker 5 ,ClaireWidden 5 ,DavidWithers 5 ,PanosDeloukas 5 (WellcomeTrustSangerInstitute,Hinxton);Hin-TakLeung 2 ,SarahNutland 2 ,HelenE. Stevens 2 , Neil M. Walker 2 & John A. Todd 2 (Cambridge) Statistics Doug Easton 12 , David G. Clayton 2 (Cambridge); Paul R. Burton 1 , Martin D. Tobin 1 (Leicester);JeffreyC.Barrett 3 ,DavidEvans 3 ,AndrewP.Morris 3 ,LonR.Cardon 3 (Oxford) Niall J. Cardin 11 , Dan Davison 11 , Teresa Ferreira 11 , Joanne Pereira-Gale 11 , Ingileif B. Hallgrimsdo�ttir 11 , Bryan N. Howie 11 , Jonathan L. Marchini 11 , Chris C. A. Spencer 11 , Zhan Su 11 , Yik Ying Teo 3,11 , Damjan Vukcevic 11 & Peter Donnelly 11 (Oxford) PrimaryInvestigatorsDavid Bentley 5 {,MatthewA.Brown 48,49 ,LonR.Cardon 3 ,Mark Caulfield 38 , David G. Clayton 2 , Alistair Compston 53 , Nick Craddock 23 , Panos Deloukas 5 , Peter Donnelly 11 , Martin Farrall 39 , Stephen C. L. Gough 50 , Alistair S. Hall 26 , Andrew T. Hattersley 42,43 , Adrian V. S. Hill 3 , Dominic P. Kwiatkowski 3,5 , Christopher G. Mathew 29 , Mark I. McCarthy 3,7 , Willem H. Ouwehand 8,9 , Miles Parkes 27 , Marcus Pembrey 18,20 , Nazneen Rahman 51 , Nilesh J. Samani 10 , Michael R. Stratton 51,52 , John A. Todd 2 & Jane Worthington 40 *See Supplementary Information for details. Affiliations for participants: 1 Genetic Epidemiology Group, Department of Health Sciences,UniversityofLeicester,AdrianBuilding,UniversityRoad,LeicesterLE17RH,UK. 2 Juvenile Diabetes Research Foundation/Wellcome Trust Diabetes and Inflammation Laboratory,DepartmentofMedicalGenetics,CambridgeInstitute forMedicalResearch, University of Cambridge, Wellcome Trust/MRC Building, Cambridge CB2 0XY, UK. 3 Wellcome Trust Centre for Human Genetics, University of Oxford, Roosevelt Drive, Oxford OX3 7BN, UK. 4 Department of Psychological Medicine, Henry Wellcome Building,School of Medicine, Cardiff University, Heath Park, Cardiff CF14 4XN, UK. 5 The Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SA, UK. 6 The Wellcome Trust, Gibbs Building, 215 Euston Road, London NW1 2BE, UK. 7 Oxford Centre for Diabetes, Endocrinology and Medicine, University of Oxford, Churchill Hospital, Oxford OX3 7LJ, UK. 8 Department of Haematology, University of Cambridge, Long Road, Cambridge CB2 2PT, UK. 9 National Health Service Blood and Transplant, Cambridge Centre, Long Road, Cambridge CB2 2PT, UK. 10 Department of Cardiovascular Sciences, University of Leicester, Glenfield Hospital, Groby Road, Leicester LE3 9QP, UK. 11 Department of Statistics, University of Oxford, 1 South Parks Road, Oxford OX1 3TG, UK. 12 Cancer Research UK Genetic EpidemiologyUnit,StrangewaysResearchLaboratory,WortsCauseway,CambridgeCB1 8RN,UK. 13 NationalHealthServiceBloodandTransplant,SheffieldCentre,LongleyLane, SheffieldS57JN,UK. 14 NationalHealthServiceBloodandTransplant,BrentwoodCentre, Crescent Drive, Brentwood CM15 8DP, UK. 15 The Welsh Blood Service, Ely Valley Road, Talbot Green, Pontyclun CF72 9WB, UK. 16 The Scottish National Blood Transfusion Service,Ellen?sGlenRoad,EdinburghEH177QT,UK. 17 NationalHealthServiceBloodand Transplant, Southampton Centre, Coxford Road, Southampton SO16 5AF, UK. 18 Avon Longitudinal Study of Parents and Children, University of Bristol, 24 Tyndall Avenue, Bristol BS8 1TQ, UK. 19 Division of Community Health Services, St George?s University of London, Cranmer Terrace, London SW17 0RE, UK. 20 Institute of Child Health, University College London, 30 Guilford Street, London WC1N 1EH, UK. 21 University of Aberdeen, Institute of Medical Sciences, Foresterhill, Aberdeen AB25 2ZD, UK. 22 Department of Psychiatry, Division of Neuroscience, Birmingham University, Birmingham B15 2QZ, UK. 23 Department ofPsychologicalMedicine,HenryWellcomeBuilding,SchoolofMedicine, Cardiff University, Heath Park, Cardiff CF14 4XN, UK. 24 SGDP, The Institute of Psychiatry, King?s College London, De Crespigny Park, Denmark Hill, London SE5 8AF, UK. 25 SchoolofNeurology,NeurobiologyandPsychiatry,RoyalVictoriaInfirmary,Queen Victoria Road, Newcastle upon Tyne, NE1 4LP, UK. 26 LIGHT and LIMM Research Institutes, Faculty of Medicineand Health, University of Leeds, Leeds LS1 3EX, UK. 27 IBD Research Group, Addenbrooke?s Hospital, University of Cambridge, Cambridge CB2 2QQ, UK. 28 Gastrointestinal Unit, School of Molecular and Clinical Medicine, University of Edinburgh, Western General Hospital, Edinburgh EH4 2XU, UK. 29 Department of Medical & Molecular Genetics, King?s College London School of Medicine, 8th Floor Guy?s Tower, Guy?s Hospital, London SE1 9RT, UK. 30 Institute for Digestive Diseases, University College London Hospitals Trust, London, NW1 2BU, UK. 31 Department of Gastroenterology, Guy?s and St Thomas? NHS Foundation Trust, London SE1 7EH, UK. 32 Department of Gastroenterology & Hepatology, University of Newcastle upon Tyne, Royal Victoria Infirmary, Newcastle upon Tyne NE1 4LP, UK. 33 Gastroenterology Unit, Radcliffe Infirmary, University of Oxford, Oxford OX2 6HE, UK. 34 Medicine and Therapeutics,AberdeenRoyalInfirmary,Foresterhill,Aberdeen,GrampianAB92ZB,UK. 35 Clinical Pharmacology Unit and the Diabetes and Inflammation Laboratory, University of Cambridge, Addenbrookes Hospital, Hills Road, Cambridge CB2 2QQ, UK. 36 Centre National de Genotypage, 2, Rue Gaston Cremieux, Evry, Paris 91057, France. 37 BHF Glasgow Cardiovascular Research Centre, University of Glasgow, 126 University Place, Glasgow G12 8TA, UK. 38 Clinical Pharmacology and Barts and The London Genome Centre, William Harvey Research Institute, Barts and The London, Queen Mary?s School of Medicine, Charterhouse Square, London EC1M 6BQ, UK. 39 Cardiovascular Medicine, University of Oxford, Wellcome Trust Centre for Human Genetics, Roosevelt Drive, Oxford OX3 7BN, UK. 40 arc Epidemiology Research Unit, University of Manchester, Stopford Building, Oxford Rd, Manchester M13 9PT, UK. 41 Department of Paediatrics, University ofCambridge,Addenbrooke?sHospital,CambridgeCB22QQ,UK. 42 Genetics ofComplexTraits,InstituteofBiomedicalandClinicalScience,PeninsulaMedicalSchool, Magdalen Road, Exeter EX1 2LU, UK. 43 Diabetes Genetics, Institute of Biomedical and Clinical Science, Peninsula MedicalSchool, Barrack Road, Exeter EX2 5DU, UK. 44 Centre for Diabetes and Metabolic Medicine, Barts and The London, Royal London Hospital, Whitechapel, London E1 1BB, UK. 45 Diabetes Research Group, School of Clinical Medical Sciences, Newcastle University, Framlington Place, Newcastle upon Tyne NE2 4HH, UK. 46 TheMRCCentreforCausalAnalysesinTranslationalEpidemiology,Bristol University, Canynge Hall, Whiteladies Rd, Bristol BS2 8PR, UK. 47 MRC Laboratories, Fajara, The Gambia. 48 Diamantina Institute for Cancer, Immunology and Metabolic Medicine, Princess Alexandra Hospital, University of Queensland, Woolloongabba, Qld 4102, Australia. 49 Botnar Research Centre, University of Oxford, Headington, Oxford OX3 7BN, UK. 50 Department of Medicine, Division of Medical Sciences, Institute of Biomedical Research, University of Birmingham, Edgbaston, Birmingham B15 2TT, UK. 51 Section of Cancer Genetics, Institute of Cancer Research, 15 Cotswold Road, Sutton SM2 5NG, UK. 52 Cancer Genome Project, The Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SA, UK. 53 Department of Clinical Neurosciences, University of Cambridge, Addenbrooke?s Hospital, Hills Road, Cambridge CB2 2QQ, UK. {Present address: Illumina Cambridge, Chesterford Research Park, Little Chesterford, Nr Saffron Walden, Essex CB10 1XL, UK. ARTICLES NATURE|Vol 447|7 June 2007 678 Nature �2007 Publishing Group METHODS BD phenotype description. BD cases were all over the age of 16yr, living in mainland UK and of European descent. Recruitment was undertaken through- outtheUKbyteamsbasedinAberdeen(8%ofcases),Birmingham(35%cases), Cardiff(33%cases),London(15%cases)andNewcastle(9%cases).Individuals who had been in contact with mental health services were recruited if they suffered with a major mood disorder in which clinically significant episodes of elevatedmoodhadoccurred.Thiswasdefinedasalifetimediagnosisofabipolar mood disorder according to Research Diagnostic Criteria 119 and included the bipolar subtypes that have been shown in family studies to co-aggregate for example 29 : bipolar I disorder (71% cases), schizoaffective disorder bipolar type (15%cases),bipolarIIdisorder(9%cases)andmanicdisorder(5%cases).After providing written informed consent, all subjects were interviewed by a trained psychologist or psychiatrist using a semi-structured lifetime diagnostic psychi- atric interview (in most cases the Schedules for Clinical Assessment in Neuropsychiatry 120 and available psychiatric medical records were reviewed). Using all available data, best-estimate ratings were made for a set of key pheno- typicmeasuresonthebasisoftheOPCRITchecklist(whichcoversbothpsycho- pathology and course of illness) 121,122 and lifetime psychiatric diagnoses were assignedaccordingtotheResearchDiagnosticCriteria 119 .Thereliabilityofthese methodshasbeenshowntobehigh 119,123,124 .Furtherdetailsofclinicalmethodo- logy can be found in Green, 2005 (ref. 123) and Green, 2006 (ref. 124). CAD phenotype description. CAD cases had a validated history of either myo- cardial infarction or coronary revascularization (coronary artery bypass surgery or percutaneous coronary angioplasty) before their 66th birthday. Verification of the history of CAD was required either from hospital records or the primary care physician. Recruitment was carried out on a national basis in the UK through a direct approach to the public via (1) the media and (2) mailing all general practices (family physicians) with information about the study, as prev- iously described 125 . In an initial pilot phase, potential participants were also identified and approached through local CAD databases in the two lead centres (Leeds and Leicester). Although the majority of subjects had at least one further sib also affected with premature CAD, only one subject from each family was included in the present study. CD phenotype description. CD cases were attendees at inflammatory bowel disease clinics in and around the five centres which contributed samples to the WTCCC (Cambridge, Oxford, London, Newcastle, Edinburgh). Ascertainment wasbasedonaconfirmeddiagnosisof Crohn?sdisease(CD)usingconventional endoscopic, radiological and histopathological criteria 126 . We included all sub- types of CD as classified by disease extent and behaviour and the collection was not specifically enriched for family history or early age of onset. The median age of diagnosis was 26.1yr and 62% of the collection had undergone CD-related abdominal surgery. A small proportion had previously been recruited as mem- bers of multiply affected families but only one affected individual was included per family. HTphenotypedescription.HTcasescomprisedseverelyhypertensiveprobands ascertainedfromfamilieswithmultiplexaffectedsibshipsorasparent?offspring trios.TheywereofwhiteBritishancestry(uptolevelofgrand-parents)andwere recruited from the Medical Research Council General Practice Framework and otherprimarycarepracticesintheUK 77 .Eachcasehadahistoryofhypertension diagnosed before 60yr of age, with confirmed blood pressure recordingscorres- ponding to seated levels.150/100mmHg (if based on one reading), or the mean of 3 readings greater than 145/95mmHg. These criteria correspond to the threshold for the uppermost 5% of blood pressuredistributionin a contem- poraneous health screening survey of 5,000 British men and women in 1995 (N. Wald and M. Law, personal communication). We excluded hypertensive indi- viduals who self-reportedly consumed.21 units of alcohol per week and those with diabetes, intrinsic renal disease, a history of secondary hypertension or co- existing illness. Cases did not undergo systematic genetic screening to exclude the (rare) known monogenic causes of HT. We focused on the recruitment of hypertensive individuals with body mass indices,30kgm 22 . The probands were extensively phenotyped by trained nurses (see http://www.brightstudy. ac.ukforstandardoperatingprocedures,additionalphenotypesandstudyques- tionnaires). Sample selection for WTCCC was based on DNA availability and quality. RA phenotype description. RA cases were recruited to studies coordinated by the ARC (Arthritis Research Campaign) Epidemiology Unit. All subjects were Caucasian over the age of 18yr and satisfied the 1987 American College of Rheumatology Criteria for RA 127 modified for genetic studies 128 . Of the cases, 404wererecruitedaspartofthearcNationalRepositoryofFamilyMaterial 129 :of these, 301 were probands from affected sibling pair families and 103 were cases from trio families, having bothparents or one parent andone unaffected sibling availableforstudy.Afurther109caseswererecruitedfromtheNorfolkArthritis Register,aprimarycare-basedinceptioncollection 130 .Allothercases(n51348) were recruited from NHS Rheumatology Clinics throughout the UK. Samples forWTCCCwereselectedfromthevariousstudiesonthebasisofthequalityand availability of DNA. T1Dphenotypedescription.T1Dcaseswererecruitedfrompaediatricandadult diabetes clinics at 150 National Health Service hospitals across mainland UK. The total T1D case data set (n5,8,000) from which the WTCCC cases were selected, represents close to half the T1D cases seen in such clinics. Nationwide coverage was achieved through the voluntary efforts of members of the British Society for Paediatric Endocrinology and Diabetes, who recruited about half of cases,the restcomingfrom peripateticnursesemployedby the JDRF/WTGRID project(http://www-gene.cimr.cam.ac.uk/todd/) 131 .Toestablishapositivediag- nosis of T1D (and, in particular, to distinguish it from the more common, but lateronsetT2D),werequiredallcasestohaveanageofdiagnosisbelow17yrand insulin dependence since diagnosis (with a minimum period of at least 6 months). However, a very few subjects were subsequently discovered to be suf- fering from rare monogenic disorders, such as maturity onset diabetes of the young (MODY), and latterly permanent neonatal diabetes (PNDM): these were excluded. T2D phenotype description. The T2D cases were selected from UK Caucasian subjectswhoformpartoftheDiabetesUKWarren2repository.Ineachcase,the diagnosis of diabetes was based on either current prescribed treatment with sulphonylureas, biguanides, other oral agents and/or insulin or, in the case of individuals treated with diet alone, historical or contemporary laboratory evid- ence of hyperglycaemia (as defined by the World Health Organization). Other formsofdiabetes(forexample,maturity-onsetdiabetesoftheyoung,mitochon- drial diabetes, and type 1 diabetes) were excluded by standard clinical criteria based on personal and family history. Criteria for excluding autoimmune dia- betes included absence of first-degree relatives with T1D, an interval of$1yr betweendiagnosisandinstitutionofregularinsulintherapyandnegativetesting forantibodiestoglutamicaciddecarboxylase(anti-GAD).Caseswerelimitedto thosewhoreportedthatallfourgrandparentshadexclusivelyBritishand/orIrish origin, by both self-reported ethnicity and place of birth. All were diagnosed between age 25 and 75. Approximately 30% were explicitly recruited as part of multiplex sibships 132 and ,25% were offspring in parent?offspring ?trios? or ?duos?(thatis,familiescomprisingonlyoneparentcomplementedbyadditional sibs) 133 . The remainder were recruited as isolated cases but these cases were (compared to population-based cases) of relatively early onset and had a high proportion of T2D parents and/or siblings 134 . Cases were ascertained across the UK but were centred around the main collection centres (Exeter, London, Newcastle, Norwich, Oxford). Selection of the samples typed in WTCCC from the larger collections was based primarily on DNA availability and success in passing Diabetes and Inflammation Laboratory (DIL)/Wellcome Trust Sanger Institute (WTSI) DNA quality control. 1958BirthCohortControls(58BC).The 1958Birth Cohort(alsoknownasthe National Child Development Study) includes all births in England, Wales and Scotland, during one week in 1958. From an original sample of over 17,000 births, survivors were followed up at ages 7, 11, 16, 23, 33 and 42yr (http:// www.cls.ioe.ac.uk/studies.asp?section5000100020003) 135 . In a biomedical examination at 44-45 yr 136 (http://www.b58cgene.sgul.ac.uk/followup.php), 9,377cohortmemberswerevisitedathomeproviding7,692bloodsampleswith consent for future Epstein?Barr virus (EBV)-transformed cell lines. DNA sam- ples extracted from 1,500 cell lines of self-reported white ethnicity and repres- entativeofgenderandeachgeographicalregionwereselectedforuseascontrols. UK Blood Services Controls (UKBS). The second set of common controls was madeupof1,500individualsselectedfromasampleofblooddonorsrecruitedas partofthecurrentproject.WTCCCincollaborationwiththeUKBloodServices (NHSBT in England, SNBTS in Scotland and WBS in Wales) set up a UK national repository of anonymized samples of DNA and viable mononuclear cells from 3,622 consenting blood donors, age range 18?69yr (ethical approval 05/Q0106/74). A set of 1,564 samples was selected from the 3622 samples recruited based on sex and geographical region (to reproduce the distribution of the samples of the 1958 Birth Cohort) for use as common controls in the WTCCC study. DNA was extracted as described below with a yield of 305461207mg (mean61 s.d.). ProtocolforDNAextraction.Whitebloodcellswereisolatedfromthefiltersby first pushing 10ml air through the filter in contra direction to the initial blood flow through the filter, followed by 40ml PBS, collecting into a 50ml centrifuge tube,andcentrifugation(2.000r.p.m.,10min,20uC).Cellswerelysedbyadding 40ml Lysis buffer (320mM Sucrose, 1% Triton-X-100, 4.9mM MgCl 2, 1mM TRIS-HCl pH7.4) and pelleted by centrifugation (2,500 r.p.m., 15min, 4uC). Pellets were frozen before extraction. Pellets were digested overnight at 37uC with 5.25M GuHCl, 490mM NH 4 Ac, 1.25% Na Sarcosyl and 0.125mgml 21 Proteinase K and then mixed with 2ml chloroform to form a white emulsion. The aqueous layer was separated by centrifugation (2,500r.p.m., 3min) and doi:10.1038/nature05911 Nature �2007 Publishing Group DNA was precipitated in ethanol overnight at 220uC. DNA was further pre- cipitated by rotation (40r.p.m., 5min) and then pelleted by centrifugation (3,000 r.p.m., 15min). Pellets were washed twice by rinsing with 2ml 70% ethanol, followed by centrifugation (3,000r.p.m., 5min). DNA pellets were air-dried before re-suspension in TE buffer (10mM Tris, 0.1mM EDTA). Sample handling. Each participating sample collection was issued unique WTCCC barcode labels and a spreadsheet with unique sample identifiers for logging information on case/control status, DNA concentration (requested at 100ngml 21 ),DNAextractionmethod,sex,broadgeographicalregionandageat requirement. Each collection supplied 10mg aliquots of anonymized samples in bar-coded, deep 96-well plates. On receipt, samples had their DNA concentra- tion measured by Picogreen (triplicate measurements), were checked for DNA degradationona0.75%agarosegel,andgenotypedwithupto38SNPsarranged in two multiplex reactions using the MassExtend (hME) and/or iPLEX 37 assay. The aboveSNPs served for obtaininga molecular fingerprint (25 of the 38 SNPs were present on the GeneChip 500K) and experimentally confirming the sex of each sample. Samples with concentrations$50ngml 21 , showing limited or no degrada- tion, having a minimum of 7/10 (hME reaction) and/or 14/23 (iPLEX reaction) SNPs typed, and having the sex markers in agreement or not violating the supplied information were deemed fit for whole genome genotyping. Note that the hME set was replaced with a second iPLEX reaction in the course of the project to increase marker density. We selected 2,000 and 1,500 samples from each disease and control collection respectively. Selected samples were normal- izedto50ngml 21 andre-arrayedroboticallyinto96-wellplatessothateachplate was composed of 94 samples representing at least two different collections at a ratio of 1:1. For each collection, the selected samples were balanced first for sex and then geographical region (see above). Genotyping.SNPgenotypingwasperformedwiththecommercialreleaseofthe GeneChip 500K arrays at Affymetrix Services Lab. A modified version of the genotyping assay developed for the 100K Mapping Array 137 was used. In brief, twoaliquotsof250ngofDNAeacharedigestedwithNspIandStyI,respectively, an adaptor is ligated and molecules are then fragmented and labelled. At this stage each enzyme preparation is hybridized to the corresponding SNP array (262,000 and 238,000 on the NspI and StyI array respectively). Samples were processed in 96-well plate format, each plate carried a positive and a negative control, up to the hybridization step. Individual arrays not passing the 93% call rate threshold at P50.33 with the Dynamic Model algorithm 138 were repeated (fresh aliquot of initial end-labelled reaction). Samples failing twice at the hybridization stage were reprocessed using a fresh DNA aliquot. Affymetrix delivered successful samples as those having a Dynamic Model call rate of 93% at P50.33 for each array, over 90% concordance for the 50 SNPs that are common to the two arrays, both arrays agreed on gender, and showed over 70% identity to the Sequenom genotypes supplied by WTCCC. CEL filesprovidedthe intensities of the various probesoneach chip.Initially, genotypes were called with the Dynamic Model 138 algorithm. Affymetrix subse- quently developed an improved algorithm, BRLMM (Bayesian Robust Linear Model with Mahalanobis distance classifier 139,140 ). This processes batches of samples and uses clustering techniques to call genotypes (the ?mismatch? probe intensities are not used). In Affymetrix?s standard protocol it is applied in batches of 96 samples (plates). This is, of course, a very small sample size and, forsomeSNPs,someclusterswillcontainfew,ifany,observations.Thismightbe counteredbycombininginformationaboutclusterlocationoveralargenumber of SNPs. Throughout, physical coordinates refer to NCBI build-35 of the human gen- ome. Alleles are expressed in the forward (1) strand of the reference human genome (NCBI build-35). Power calculations. We assessed power of the Affymetrix 500K chip using the followingsimulationexperiment.SeparatelyforeachSNPwithMAF.5%inthe 10HapMapENCODEregions,weassumedtheSNPwascausativeandsimulated genotype dataat all SNPsin the sameregionas the putativedisease SNPin case- control panels of 2,000 cases and 3,000 controls with linkage disequilibrium patternsthatmatchthoseinHapMap.Forcontrols,thesesimulationswerebased ontheimputationalgorithmdescribedbelow(withallgenotypedatainitiallyset tomissinginthe3,000controlindividuals).Forcases,theassumedeffectsizewas first used to calculate genotype frequencies in cases (via Bayes? theorem), and genotypes in cases at the putative SNP were then simulated independently from theses calculated frequencies. Genotypes at all other SNPs in the region in cases were then simulated using the imputation algorithm described below (with all data other than the genotypes at the causative SNP initially set to missing in the cases).Foreachsuchsimulatedcase-controlpanel,trendtestswereperformedat eachoftheSNPsintheregionthatareactuallyontheAffymetrixchip,andifany of these reached the stated P-value threshold the putative disease SNP was deemed to be detected, and otherwise to be undetected. Power estimates are then calculated as the proportion of putative disease SNPs with MAFs.5% across the HapMap ENCODE regions that are detected at the given P-value threshold. There are various approximations here. Actual numbers of cases andcontrolsforeachdiseaseareslightlysmallerthanthe3,000:2,000valuesused in the simulations, but in the other direction, our simulations ignore the pos- sibility that a disease SNP might be detected by a genotyped SNP outside its ENCODE region. The accuracy reported below of the imputation algorithm in imputing genotypes leads us to believe these simulations should be a reasonable proxy for real data. Some such simulation is needed if power calculations are to take account of the fact that any given putative disease SNP could typically be detectedbyseveralSNPsonthechip.Exploitationofthissimulationapproachto assess power across different platforms and SNP chips and for different experi- mental designs will be reported elsewhere. CHIAMO.Wedevelopedanewgenotypecallingalgorithm,CHIAMO,whichis applied after quantile normalization of the data from each sample. A complete descriptionisgiveninSupplementaryInformation.Webrieflysummarizesome features here. Normalized intensities for each genotype were mapped to a two- dimensional intensity vector and then we applied CHIAMO, which uses a baye- sian hierarchical 4-class mixture model to call genotypes for the whole project. Weusedoptimizationbasedon12randomstartstofindthesetofparameters( ^ h) that maximize the posterior distribution of the model. This parameter set was usedtocalculatethemaximumaposterioriestimatesoftheprobabilitiesofeach genotype call, Pr Z ij Data, ^ h C12 C12 C12 C16C17 , where Z ij g{0, 1, 2, 3};{AA, AB, BB, null} is the genotypecallforindividualjincollectioni.AllCHIAMOgenotypecallsanalysed in this paper were based on an a posteriori probability threshold of 0.9 for making a call, following our analysis of the relationship between concordance and missing data rates (data not shown). CHIAMO differs from BRLMM in several respects: (1) it uses a different transformation of the CEL files to give the two-dimensional summary for each individual at an SNP leading to better defined clusters; (2) it makes use of mis-match probe signals; (3) it uses a differentmethodforfitting the clusters;and(4) it allowsthe dataforall samples to be called simultaneously, thus allowing better estimation of cluster location and shape parameters, while making allowance for possible differences in these parameter values between case/control groups that could arise as a result of differencesinDNAquality.Thisisachievedusingahierarchicalstatisticalmodel that specifies the joint distribution of the three cluster centres, their spread, and likely allele frequencies (using HapMap) and genotype frequencies (centred on Hardy?Weinberg proportions but allowing some variation). CHIAMO improved both call rate and accuracy in comparison to BRLMM, the current standard Affymetrix calling algorithm (Supplementary Table 3)?it roughly halved missing data rates and discordance rates with another platform. SeeSupplementaryInformationforfulldetails,discussionofsomechallengesfor genotype calling, and example cluster plots (Supplementary Figs 10 and 17). Quantile-quantile plots. Quantile-quantile (Q-Q) plots are constructed by rankingasetofvaluesofastatisticfromsmallesttolargest(the?orderstatistics?) and plotting them against their expected values, given the assumption that the valueshavebeen sampledfroma distributionof knowntheoreticalform(in our case, the chi-squared distribution, usually on one degree of freedom?for example, the distribution of our trend tests under the null hypothesis). Deviations from the line of equality indicate either that the theoretical distri- bution is incorrect, or that the sample is contaminated with values generated in someothermanner(forexample,byatrueassociation).Toaidinterpretationof such plotswehavealso calculated 95%?concentrationbands?(shadedgrey in all Q-Q plots). These are formed by calculating, for each order statistic, the 2.5th and 97.5th centiles of the distribution of the order statistic under random sam- pling and the null hypothesis (for details see ref. 141). We should add two notes of caution. First, concentration bands are calculated point by point and, although there are very strong correlations between nearby order statistics, the probabilitythatarealquantile-quantile plotwillstrayoutsidethe concentration band at some point is some bit larger than 5%. Second, the theoretical chi- squared distribution is an approximation, valid for large samples; it is not clear whetherthisapproximationcontinuestoholdintotheextremerighthandtailof the distributionexploredin aGWA study(althoughthe indicationsarethatit is probably not far wrong for a study as large as ours). Dataqualitycontrol.OfsamplesforwhichAffymetrixreturnedCELfiles,atotal of 809 were excluded from the analysis. A complete breakdown by collection is giveninSupplementaryTable4.Missingdataratepersampleactsasanindicator oflowDNAquality.Mostsampleshadverylowratesofmissingdata(study-wide average0.00925,standarddeviation0.0187)andwechosetoexclude250samples with.3% missing data across all SNPs (Supplementary Fig. 18, and Supple- mentary Tables 4 and 13). We also set empirical thresholds on genome-wide heterozygosity(excessheterozygosityinparticularmayindicatecontamination). Six samples with.30% heterozygosity and a further three with,23% hetero- zygosity were excluded (see Supplementary Fig. 18). We excluded 16 samples doi:10.1038/nature05911 Nature �2007 Publishing Group with discrepancies between WTCCC information and external identifying information (such as genotypes from another experiment, blood type or incor- rect disease status). We sought to detect individuals with non-Caucasian ances- try using multi-dimensional scaling to provide a two-dimensional projection of the data whose axes represent geographic genetic variation. In the interest of computational efficiency and to avoid confounding of the multi-dimensional scalingbyextendedlinkagedisequilibriumwethinnedthedatatoasetof71,458 SNPs, within which no pair were correlated with r 2 .0.2. For this set of nearly independent SNPs we computedgenome-wideaverage identity bystate (sumof thenumberofidentical-by-stateallelesateachlocusdividedbytwicethenumber ofloci)betweeneachpairofindividualsineachsamplecollectionalongwiththe 270 HapMap samples. We converted these identity by-state-relationships to distances by subtracting them from 1, and the matrix of pairwise identity by statevalueswasusedasinputtomulti-dimensionalscaling.Theprojectiononto the two multi-dimensional scaling axes is shown in Supplementary Fig. 5. We excluded 153 samples that were clearly separate from the main cluster of WTCCC individuals. Exclusion of these individuals resulted in a substantial reduction in estimates of over-dispersion in test statistic distributions (data not shown). We also excluded 295 duplicated (.99% identity) and 86 related (86?98% identity) samples from the analysis. Filtering out suboptimal markers depends on both the platform and the genotype calling algorithm. We experimented with various quality metrics for CHIAMO calls, for example, based on the location and/or separation of the clusters, but found that the best indicator of a SNP being difficult to call was the amount of missing data in its calls: CHIAMO consistently marked many individuals missing for SNPs with poorly defined or overlapping clusters, whereasitsuccessfullycalledgenotypesfornearlyallindividualsonhigh-quality SNPs (data not shown). We excluded 26,567 SNPs with a study-wide missing data rate.5% (Supplementary Fig. 19), or.1% for SNPs with a study-wide MAF,5%. We additionally excluded 4,351 SNPs with Hardy?Weinberg exact Pvalue,5.7310 27 in the combined set of 2,938 controls, and 93 SNPs with Pvalue, 5.7310 27 for either a one- or two-degree of freedom test of asso- ciation between the two control groups (corresponding to a 1d.f. chi-squared statistic of about 25). See Supplementary Fig. 20 and Fig. 1 respectively for the empirical distributions of these statistics used to motivate the thresholds above. Overall, we found that the 809 excludedindividuals (whichrepresent4.8% of the study samples) accounted for 35.6% of the missing data at non-excluded SNPs. In total, 469,557 SNPs passed the quality control filters. Supplementary Fig. 20 shows the effect of quality control filters, and visual inspectionoftheclusterplotsofSNPsshowingapparentlystrongassociation,on quantile-quantileplotsforonedisease(T2D,othersaresimilar),andthesuccess of these filters in excluding poorly performing SNPs. The figure (panel d) also showsthemarkedeffectonthetailsofthedistributionofteststatisticsofregions of genuine association (for this disease the three regions removed because of strong evidence of association have all been independently replicated, see main text).TheaiminfilteringistoexcludepoorSNPsbutwithoutremovinggenuine associations. No single criterion will do this. In order not to exclude possible genuine associations,we chose to apply relatively light quality control filters but thentosubjectallapparentlyassociatedSNPstovisualinspectionofclusterplots (see Supplementary Information). Around 100 cluster plots were assessed per disease. We used X-chromosome SNPs to check for sex discrepancies with the sample files (Supplementary Fig. 21). These were fed back to disease groups for amend- ment and verification.The,80 samples where it wasnot possibleto discernthe source of the discrepancy were left in the study for analysis, on the grounds that mishandlingwasconsideredunlikelytohaveintroducedsampleswithaltogether different phenotypes. DNAquality betweencasesandcontrolscouldresultin false-positive associa- tionsthroughdifferentialeffectsongenotypecalling 111 .DNAsinourstudycame from various sourcesbetween, andin some cases within,caseandcontrol series, but with the combination of centralized sample quality control, simultaneous genotype callingwith CHIAMO(which explicitly allowsfor differencesbetween collections),andinspectionofclusterplotsforSNPswithverysmallPvalues,our study did not experience such difficulties. Comparing linkage disequilibrium. Two questions which have been raised about the HapMap data are how well it describes linkage disequilibrium in populations other than the ones that were sampled, and whether the sample sizes in HapMap (60 Caucasian individuals, for example) are adequate to describe patterns of linkage disequilibrium. With data on 2,938 controls and 16,179 individuals in total at around 400,000 polymorphic SNPs, we are well placed to address this for the British population. Initial analyses suggest that patterns of linkage disequilibrium in our samples are very similar to those in HapMap. As an example, Supplementary Fig. 3 compares patterns of linkage disequilibriuminHapMapCEUindividualsandour58CsampleatSNPsonthe Affymetrix chip across 2231Mb regions of the genome and they seem almost identical.Wecalculatedr 2 valuesdirectlyfromthephasedhaplotypesavailablein HapMap, but using unphased genotype data from our study. Note that visual representations of linkage disequilibrium in this form can be very sensitive to SNP density so comparisons across regions is difficult without correction for SNP density, and direct comparison of linkage disequilibrium patterns at all HapMap SNPs with those at the subset of SNPs on the Affymetrix 500K chip is not straightforward. Geographical variation and population structure. Principal component ana- lysis was performed as a two-stage process: we formed a matrix of estimated correlations(formally,theinnerproductmeasureofsimilarity)betweenallpairs of individuals, and then computed the eigenvectors and eigenvalues of that matrix. We estimated the correlation between two individuals as described by 14 . We identified components that reflected genome-wide structure in two ways. First, we created two subsets of the data containing SNPs from the odd- and even-numbered chromosomes, repeated the PCA on each of these, and inspected scatter plots of pairs of components between the two subsets of the data. A component which is due to a region of linkage disequilibrium on a chromosome (as opposed to genome-wide structure) will appear only when analysing the data set containing SNPs from that chromosome. Second, we computed the score of every SNP on the components. For a component that is due to a region of linkage disequilibrium, there will be a spike of high SNP scoresonlyinthatregion.Tominimizethecontributionfromregionsofextens- ive strong linkage disequilibrium, the correlation estimates were based on a subset of 197,175 SNPs that were spaced at least 0.001cM apart (HapMap esti- mates) and specifically excluded the MHC region. To assess the level of over-dispersion in each collection we first created a very clean set of data to ameliorate the effects of over-dispersion due to calling problems and missing data. In addition to the main filters described above, we filteredoutallSNPsthathadacleargenotype-callingproblemrevealedbyvisual inspection,SNPswithastudy-widemissingdatarate.1%andSNPswithstudy- wide minor allele frequency,1%. Around 360,000 SNPs passed these filters. Estimates of l were calculated using an estimator based on the median test statistic 15 . Estimates of l were also calculated from tests that conditioned on the scores for each individual along the two estimated principal components described above. The tests (1 d.f. and 2 d.f.) were carried out by including the scores as additional covariates in a logistic regression model fit. Bayes factors. The box in the main text makes the point that understanding the strengthof evidence conveyedby a particular Pvalue also requiresknowledgeof power. In contrast, the Bayes factor (BF) provides a single measure of the strength of the evidence for an association, and we report these in addition to Pvalues (Supplementary Table 14). As for power, calculation of Bayes factors requires assumptions about effect sizes. The assumptions underlying our calcu- lations are given below and in Supplementary Information. There is broad agreement between the way in which Pvalues and our Bayes factorsrankSNPs,exceptforSNPswithlowMAFs(SupplementaryFig.22).This is intuitive: unless one believed, a priori, that rare causative SNPs have substan- tially larger effect sizes, there will be reduced power for these SNPs and hence weaker evidence for association than for common SNPs with the same Pvalue. One perspective on GWAs is that in practice they will be used to prioritize SNPsforfurtherstudyoradditionaltyping.InadditiontoBFsprovidingasingle quantity that can be directly compared between SNPs, it is also straightforward for investigators to give different a priori weights to different classes of SNPs, such as non-synonymous (ns)SNPs, genic SNPs, SNPs in highly conserved regions, or SNPs in linkage disequilibrium with many (or few) other SNPs. We now describe calculation of the Bayes factors. We use M 0 to denote a model of no association, M 1 for a model with an additive effect on the log-odds scale and M 2 for a general 3 parameter model of association. At each SNP we calculate two Bayes factors: one for the additive model versus the null model, BF 1 , and one for the general model versus the null model, BF 2 . That is, BF 1 ~ Pr Data M 1 j�� Pr Data M 0 j�� , BF 2 ~ Pr Data M 2 j�� Pr Data M 0 j�� , where Pr Data M i j��~ � Pr Data h i , M i j��Pr h i M i j��dh, where h denotes the para- meters for the model. For all 3 models we use a logistic regression model for the likelihood Pr Data h i , M i j��where the log-odds for individual i is equal to m for model M 0 , mzcZ i for model M 1 and mzcI(Z i ~1)zw(2cI(Z i ~2)) for model M 2 . Z i is the genotype (coded 0, 1 and 2) for individual i and I Z i ~m��is the indicatorfunctionthatindividualihasthegenotypecodedasm.Foreachmodel wechoosethepriorsontheparameters,Pr h i M i j��,toreflectourbeliefaboutthe likely effect sizes underlying complex trait loci. TheparametercinmodelsM 1 andM 2 istheincreaseinlog-oddsofdiseasefor every copy of the allele coded as 1, and e c is the additive model odds ratio. For both models we use a N(0, 0.2) prior on c. This prior puts probability 0.31 on doi:10.1038/nature05911 Nature �2007 Publishing Group oddsratiosabove1.2orbelow0.8,andprobability 0.02onoddsratiosabove1.5 orbelow0.5.Theparameterminallthreemodelsrepresentsthebaselineoddsof disease. In a case-control design the numbers of cases in the sample have been elevated artificially, which will have a large effect on likely values of m. Our prior beliefsaboutthebaselineriskofdiseasemusttakethisintoaccount.Forallthree models we have used a N(0, 1) for m and have found that the resulting Bayes factors are relatively insensitive to choice of priors for this parameter as long as the same prior is used for the two models being compared. The parameter w in modelM 2 representsarecessiveeffectoverandaboveanadditiveeffect.Weusea N(1, 1) prior for w. Combined with the prior on c, this results in a prior prob- abilityof 0.25ontheoddsratiosabove1.5andbelow0.5forthegenotypecoded as 2. In addition, we note that the evaluation of the Bayes factors will depend on the way the alleles at the SNP have been coded 0 and 1. To account for this we average over the two possible codings of each SNP with equal weight. A fuller description of the priors used can be found in Supplementary Information. Sex-differentiated tests. We examined the possibility of differential genetic effects in males and females by reapplying the two single-locus analyses (trend test and genotypic test) separately in males and females and combining the results(simplyaddingthechi-squaredstatisticsforthemaleandfemaleanalyses, and comparing with the 2 d.f. or 4d.f. null hypothesis; results are shown in Supplementary Table 15). We refer to this as a sex-differentiated test. This test is sensitive to association that is of a differentmagnitudeand/or direction in the two sexes, although it is less powerful than the simple test when the effect size does not vary with sex. X Chromosome analysis. For several reasons the X chromosome needs to be treated differently from the autosomes (note that the Affymetrix chip used does not assay the Y chromosome). First, samples sizes andhencepower aredifferent from the autosomes (only one copy of X in males). Also, because the effective population size on the X chromosome is smaller than the autosomes, linkage disequilibrium extends further. And unlike the autosomes, there are choices in howtoimplementevensinglelocusanalyses:theserelatetotherelativeweightto be given to males and females in comparisons between cases and controls. For autosomal SNPs, the 1d.f. trend test statistic is calculated by dividing the square of the difference between means of the SNP genotypes (scored 0, 1, 2) between cases and controls by an estimate of its variance. The variance estimate used is an empirical estimate that does not assume Hardy?Weinberg equilib- rium. The numerator can also be represented as the squared difference in allele frequencies between cases and controls, as in the allele counting test. At first sight,anaturalgeneralizationofthistesttodealwithSNPsontheXchromosome would involve comparing allele frequencies, by allele counting, but using a variance estimate which does not assume Hardy?Weinberg equilibrium in females. However, we took the view that, because most loci on the X chro- mosome are subject to X chromosome inactivation, it is more logical to treat males as if they were homozygous females. Thus we score female genotypes 0, 1 or 2 and male genotypes 0 or 2, comparing mean scores of cases and controls as before.Thevarianceestimateallowsforthedifferentvarianceofmaleandfemale contributions and does not assume Hardy?Weinberg equilibrium in females. A stratified version of the test is constructed using the same principles by which the trend test is extended to the Mantel extension test; a score that con- trastscasesandcontrolsiscomputedforeachstratumtogetherwithitsvariance; thesearethensummedoverstrata.Thefinaltestisthesquaredtotalscoredivided by the total variance. To extend these tests to a 2d.f. test, we add a score that compares heterozygosity between cases and controls. Clearly, only females con- tribute to this component. Results of these analyses of Xchromosome SNPs are shown in Supplementary Table 16. Multilocusanalysis.Weuse(1)thegenotypedataofthisstudy,(2)theHapMap data,and(3)apopulationgeneticsmodel,tosimulategenotypesattheHapMap SNPs that are not on the Affymetrix500Kchip. Informally,we determine which haplotypes are present in each individual in a region, and then use HapMap to ?fill in?these haplotypes atuntyped SNPs (see belowfor details).These?in silico? genotypesarethentestedforassociationwiththediseaseasbefore.Thispowerful multilocustoolforassociationstudies 143 hastheadvantageof usinginformation from all markers in linkage disequilibrium with an untyped SNP, but in a way that decreases with genetic distance. Our imputation method was applied to individuals passing project filters, and used markers which passed the project filtersandinadditionhadMAF.1%.Asavalidationwecomparedourimputed genotypesfor 58C individuals withgenotypes obtainedonan Illuminaplatform for 10,180 SNPs that are polymorphic in CEU HapMap samples. At these SNPs, for imputed genotypes with posterior call probabilities above 0.95, there was 98.4% agreement with the Illumina genotypes. Inourassociationanalysesweimputedgenotypesat2,139,483HapMapSNPs, and tested these for association with each disease using the trend test or the genotypic test. We included the results from imputed SNPs in the signal plots (Fig.5)becausetheyareusefulin(1)assessingsignalstrengthwithinaregion;(2) providing a wider range of SNPs for follow up; and (3) indicating possible locationsforthecausalvariant.ForexampleinthecaseofTCF7L2inT2D,there is a substantially stronger signal from rs7903146 than for any of the typed SNPs (see also Supplementary Fig. 12). To be conservative, stringent quality control filters were applied to genomic regions where imputed SNPs (but not genotyped SNPs) were responsible for a strong signal for association. These were as follows: (1) any such region was required to contain more than one imputed SNP showing the required level of association with a MAF.2% and posterior probability for imputed genotypes averaged across the SNP.0.95 (empirical studies showed imputation at low MAFSNPs morepronetoerror);(2)allclusterplotsforgenotypedSNPswithin 0.3cM (from HapMap Phase II estimated recombination rates) were checked and where there was evidence of any mis-calling the region was rejected (the major problem with imputation arises around SNPs with genotype calling errors); and (3) if there was no genotyped SNP with a P value,10 24 for asso- ciation on either trend or genotypic test, the region was rejected. Note that accuracy of imputation with these filters applied will be larger than the figure of 98.4% reported above. We use H5{H 1 ,?, H N } to denote a set of N known haplotypes where H i 5{H i1 ,?, H iL } is an individual haplotype and L is the number of SNP loci. In practice, we set H to be the 120 CEU haplotypes estimated as part of the HapMapprojectowingtotheexpectedsimilarityinhaplotypestructurebetween the CEU and UK populations. We let G5{G 1 ,?, G k } denote the genotype data on the K individuals in the study where G i 5{G i1 ,?, G iL } and G ij g{0, 1, 2, missing}. In this setting, the majority of SNPs will have entirely missing geno- types, becausethe Affymetrix500K chip has approximately 1/6th of the number of SNPs in the Phase II HapMap. The missing genotypes are imputed by mod- ellingthedistributionofeachindividual?sgenotypevectorG i conditionalonthe knownsetofhaplotypesH,PrG i Hj��.Ourmodelforeachindividual?sgenotype vector is a Hidden Markov Model in which the hidden states are a sequence of pairs of the N known haplotypes in the set H. That is, Pr G i Hj��~ X Z 1�� i , Z 2�� i Pr G i Z 1�� i , Z 2�� i , H C12 C12 C12 C16C17 Pr Z 1�� i , Z 2�� i C16C17 , whereZ 1�� i ~ Z 1�� i1 , ..., Z 1�� iL no andZ 2�� i ~ Z 2�� i1 , ..., Z 2�� iL no arethetwosequences ofcopyingstatesattheLsitesandZ j�� il [ 1, ..., Nfg.Here,Pr Z 1�� i , Z 2�� i C16C17 defines our prior probability on how the sequences of copying states change along the sequence and Pr G i Z 1�� i , Z 2�� i , H C12 C12 C12 C16C17 models how the observed genotypes will be close to but not exactly the same as the haplotypes being copied. The precise form of these terms (described in ref. 142) are based on an approximate popu- lation genetics model that makes direct use of the recently estimated fine-scale recombination map across the genome 142,143 . At each of the missing genotypes in the study, we use this model to calculate probabilities for the three possible genotypes. At each imputed SNP, we used these probabilities to calculate the 233 table of expected genotype counts for cases and controls and used these counts to carry out a standard test of association. Diseasemodels.Totestfordeviationsfromadditivity(inlog-odds)atalocuswe fit a logistic regressionmodel usingthe functionglmin the statistical softwareR (http://www.r-project.org/). For each region we considered the most significant SNPandcomparedanadditivemodeltoageneral2-d.f.modelbyfittingamodel with an additive sub-model nested in a general model. The additive effect was modelledbyavariableencoded0,1,or2fortheeffectatthethreegenotypesand a second term for a general model was included by a variable encoded 1 for heterozygotesand0otherwise.Werejectedanadditivemodelifthesecondterm was significant and then compared a dominant or recessive model to a general model.Forthepairwiseinteractionanalysis,wefixedthemarginalmodelateach locusonthebasisofthesinglelocusanalysis.Wecomparedthetwolocusmodel with these marginals and no interaction terms with a larger model including interactions. This larger interaction model has 1, 2, or 4 additional parameters depending on whether both marginal models are additive, one is additive and one general, or both general. Software.SeveralsoftwarepackagesweredevelopedwithintheWTCCCfordata analysis, data management and simulation studies. We found it necessary to normalize the Affymetrix probe intensity data to minimize chip-to-chip vari- ability. A C11 program was written to carry out this normalization efficiently. To obtain a copy of the software please email Hin-Tak Leung at hin-tak.leung@ cimr.cam.ac.uk. We developed a new genotype calling algorithm, CHIAMO, implemented in C11. CHIAMO uses a hierarchical statistical model, which allows it to simultaneously call genotypes at all data samples. To obtain a copy of the soft- ware please email J. L. Marchini at marchini@stats.ox.ac.uk. doi:10.1038/nature05911 Nature �2007 Publishing Group To perform genome-wide association analysis we developed two software packages: snpMatrix and SNPTEST. snpMatrix is an R package and is freely available from http://www-gene.cimr.cam.ac.uk/clayton/software/. Both quant- itative and qualitative phenotypes can by analysed using snpMatrix and flexible associationtestingfunctionsareprovidedthatcontrolforpotentialconfounding by quantitative and qualitative covariates. SNPTEST is a standalone C11 pro- gram that implements both frequentist tests and bayesian analysis of association andallowstheusertoincludequantitativeorqualitativecovariates.Thisprogram worksdirectlywiththeoutputofCHIAMOandIMPUTE(seebelow).Toobtaina copy of the software please email J. L. Marchini at marchini@stats.ox.ac.uk. Genotypes at SNPs that are in HapMap but not on the Affymetrix 500K chip were imputed using the C11 program IMPUTE, which makes use of genotype informationatneighbouringSNPs.Toobtainacopyofthesoftwarepleaseemail J. L. Marchini at marchini@stats.ox.ac.uk. 119. Spitzer, R. L., Endicott, J. & Robins, E. Research diagnostic criteria: rationale and reliability. Arch. Gen. Psychiatry 35, 773?782 (1978). 120. Wing, J. K. B. T. et al. SCAN. Schedules for Clinical Assessment in Neuropsychiatry. Arch. Gen. Psychiatry 47, 589?593 (1990). 121. Craddock, M. et al. Concurrent validity of the OPCRIT diagnostic system. Comparison of OPCRIT diagnoses with consensus best-estimate lifetime diagnoses. Br. J. Psychiatry 169, 58?63 (1996). 122. McGuffin, P., Farmer, A. & Harvey, I. A polydiagnostic application of operational criteria in studies of psychotic illness. Development and reliability of the OPCRIT system. Arch. Gen. Psychiatry 48, 764?770 (1991). 123. Green, E. K. et al. Operation ofthe schizophrenia susceptibility gene, neuregulin 1, acrosstraditionaldiagnosticboundariestoincreaseriskforbipolardisorder.Arch. Gen. Psychiatry 62, 642?648 (2005). 124. Green,E.K.etal.Geneticvariationofbrain-derivedneurotrophicfactor(BDNF)in bipolar disorder: case-control study of over 3000 individuals from the UK. Br. J. Psychiatry 188, 21?25 (2006). 125. Samani, N. J. et al. A genomewide linkage study of 1,933 families affected by premature coronary artery disease: The British Heart Foundation (BHF) Family Heart Study. Am. J. Hum. Genet. 77, 1011?1020 (2005). 126. Lennard-Jones, J. E. Classification of inflammatory bowel disease. Scand. J. Gastroenterol. (Suppl.) 170,2?6; discussion 6?9 (1989). 127. Arnett,F.C.etal.TheAmericanRheumatismAssociation1987revisedcriteriafor the classification of rheumatoid arthritis. Arthritis Rheum. 31, 315?324 (1988). 128. MacGregor, A. J., Bamber, S. & Silman, A. J. A comparison of the performance of different methods of disease classification for rheumatoid arthritis. Results of an analysis from a nationwide twin study. J. Rheumatol. 21, 1420?1426 (1994). 129. Worthington,J.etal.TheArthritisandRheumatismCouncil?sNationalRepository of Family Material: pedigrees from the first 100 rheumatoid arthritis families containing affected sibling pairs. Br. J. Rheumatol. 33, 970?976 (1994). 130. Symmons, D. P., Barrett, E. M., Bankhead, C. R., Scott, D. G. & Silman, A. J. The incidence of rheumatoid arthritis in the United Kingdom: results from the Norfolk Arthritis Register. Br. J. Rheumatol. 33, 735?739 (1994). 131. Smyth, D. et al. Replication of an association between the lymphoid tyrosine phosphataselocus(LYP/PTPN22)withtype1diabetes,andevidenceforitsroleas a general autoimmunity locus. Diabetes 53, 3020?3023 (2004). 132. Wiltshire,S.etal.Agenomewidescanforlocipredisposingtotype2diabetesina U.K. population (the Diabetes UK Warren 2 Repository): analysis of 573 pedigrees provides independent replication of a susceptibility locus on chromosome 1q. Am. J. Hum. Genet. 69, 553?569 (2001). 133. Frayling, T. M. et al. Parent?offspring trios: a resource to facilitate the identification of type 2 diabetes genes. Diabetes 48, 2475?2479 (1999). 134. Groves, C.J.et al.Association analysisof6,736U.K. subjectsprovidesreplication and confirms TCF7L2 as a type 2 diabetes susceptibility gene with a substantial effect on individual risk. Diabetes 55, 2640?2644 (2006). 135. Power, C. & Elliott, J. Cohort profile: 1958 British birth cohort (National Child Development Study). Int. J. Epidemiol. 35, 34?41 (2006). 136. Strachan, D. P. et al. Lifecourse influences on health among British adults: Effects ofregionofresidenceinchildhoodandadulthood.Int.J.Epidemiol.Advanceonline publication, doi:10.1093/ije/dyl309 (25 January 2007). 137. Matsuzaki, H. et al. Genotyping over 100,000 SNPs on a pair of oligonucleotide arrays. 1, 104?105. Nat Methods 1, 104?105 (2004). 138. Di, X. et al. Dynamic model based algorithms for screening and genotyping over 100KSNPsonoligonucleotidemicroarrays.Bioinformatics21,1958?1963(2005). 139. Rabbee, N. & Speed, T. A genotype calling algorithm for affymetrix SNP arrays. Bioinformatics 22, 7?12 (2006). 140.Affymetrix. in Technical Report (2006). 141. Stirling,W.D.EnhancementstoAidInterpretationofProbabilityPlots.Statistician 31, 211?220 (1982). 142. Li, N. & Stephens, M. Modeling linkage disequilibrium and identifying recombination hotspots using single-nucleotide polymorphism data. Genetics 165, 2213?2233 (2003). 143. Marchini, J., Howie, B., Myers, S., McVean, G. & Donnelly, P. A new multipoint methodforgenome-wideassociationstudiesviaimputationofgenotypes.Nature Genet. doi:10.1038/ng2088 (in the press). doi:10.1038/nature05911 Nature �2007 Publishing Group "
Add Content to Group
|
Bookmark
|
Keywords
|
Flag Inappropriate
share
Close
Digg
Facebook
MySpace
Google+
Comments
Close
Please Post Your Comment
*
The Comment you have entered exceeds the maximum length.
Submit
|
Cancel
*
Required
Comments
Please Post Your Comment
No comments yet.
Save Note
Note
View
Public
Private
Friends & Groups
Friends
Groups
Save
|
Cancel
|
Delete
Please provide your notes.
Next
|
Prev
|
Close
|
Edit
|
Delete
Genetics
Gene Inheritance and Transmission
Gene Expression and Regulation
Nucleic Acid Structure and Function
Chromosomes and Cytogenetics
Evolutionary Genetics
Population and Quantitative Genetics
Genomics
Genes and Disease
Genetics and Society
Cell Biology
Cell Origins and Metabolism
Proteins and Gene Expression
Subcellular Compartments
Cell Communication
Cell Cycle and Cell Division
Scientific Communication
Career Planning
Loading ...
Scitable Chat
Register
|
Sign In
Visual Browse
Close
Comments
CloseComments
Please Post Your Comment