Introduction

Glaucoma is an optic neuropathy characterized by progressive loss of retinal ganglion cells, often manifesting as changes in the intra-papillary and papillary regions of the optic disc and retinal nerve fiber layer [1]. This disease affects 44 million individuals worldwide, with a projected prevalence of 53 million cases by 2020 and 80 million cases by 2040 [2]. The evaluation of structural changes is central to the diagnosis and management of patients with glaucoma [3]. Although the appearance of the optic nerve was first described nearly 150 years ago [4], analysis of optic nerve head (ONH) features in glaucoma remains challenging.

A number of evaluation schemes have emerged that attempt to characterize the phenotype of the ONH, with the most common being cup-to-disc ratio (CDR). Although very widely used, CDR is limited by high degrees variability in grading among ophthalmologists [5,6,7]. Many researchers have developed other methods of optic nerve assessment, attempting to improve inter-observer agreement. In 1974, Shiose proposed three patterns of optic disc damage, each with six stages of structural changes [8]. Read and Spaeth, also in 1974, coupled stages of increasing CDR with worsening visual field loss [9]. The Glaucoma and Glaucoma Suspects grading system built upon this work by taking into account worsening CDRs, visual field changes, and neuroretinal rim asymmetry, as well as disc pallor [10, 11]. Nesterov proposed a five stage system characterized by optic disc excavation, temporal slope, depth, and CDR [12]. The Rim to Disc (R/D) Method shifted the focus away from cupping to the neuroretinal rim [13]. Finally, the Disc Damage Likelihood Scale (DDLS) proposed ten stages of glaucomatous progression based on an estimation of the neuroretinal rim in any position and the circumferential extent of its absence [14].

Although each has its advantages, these grading systems remain less than optimal. The systems do not always allow an examiner to easily recognize a pathologic optic nerve, determine the severity of damage, and monitor for evidence of glaucomatous progression. Furthermore, like CDR, many of the systems yield inconsistent or variable results among examiners [6, 15,16,17,18].

A variety of imaging and automated instruments such as the scanning laser polarimetry [19], scanning laser ophthalmoscope [20], and optical coherence tomography [21, 22] have also been used to evaluate the ONH in glaucoma. Despite these advances, the subjective method of grading stereo images of the ONH remains important. Automated instruments have been shown to miss both early cases of glaucomatous damage and severe cases of glaucoma; they are also more limited when facing anatomic variation or suboptimal images [23, 24]. For these reasons, European Optic Disc Assessment Trial recommended that automated devices be used to “support rather than replace skilled clinic examination” [24].

Thus, there remains a need for a reproducible method for grading and classifying optic nerve disc photos. Ideally, this methodology would be more comprehensive and precise than the existing subjective methods described above, while avoiding the pitfalls of automated instruments. Such a grading system could be used to assess progression over time or to categorize phenotypes of the optic nerve in glaucoma for research and genetic studies.

To address this need, we used digital stereo images to develop a new quantitative method of assessing the ONH, which allows computation of many features associated with glaucoma and illustrates the inter-grader reliability of measuring CDR. This method was employed on more than 2554 images from the Primary Open-Angle African American Glaucoma Genetics (POAAGG) study, which is a 5-year project investigating the genetics of primary open-angle glaucoma (POAG) in African Americans. In this paper, we describe our methodology and report concordance rates among non-physician trained graders.

Methods

Participants

The POAAGG study population consists of self-identified African Americans and individuals of African or Afro-Caribbean descent, over age 35, recruited from the Philadelphia region. Exclusion criteria have been previously published [25]. The examining ophthalmologist recorded the vertical CDR. Fellowship-trained glaucoma specialists determined each subject’s classification as a glaucoma case, control, or suspect based on previously published criteria [25]. For this study, 30 degree stereo disc photos taken, using the Topcon TRC 50EX retinal camera (Topcon Corp. of America, Paramus, NJ), from 2554 eyes of glaucoma cases and controls were analyzed. These images were received by graders between 02/09/2016 and 01/31/2018. The University of Pennsylvania Institutional Review Board approved this study and the informed consent process, and this research followed the tenets of the Declaration of Helsinki.

Non-physician trained graders

Three non-physician graders, from the Ophthalmology Reading Center at the University of Pennsylvania, were trained by two fellowship-trained glaucoma specialists to grade digital stereo color images of the optic disc. These graders were experienced in grading the retina in other studies (such as the Comparison of Age-Related Macular Degeneration Treatments Trials) [26], but not the ONH. Prior to beginning the study, all graders were tested and found to have stereo vision of 40 s of arc. Graders were trained to use the stereo viewer (Screen-Vu stereoscope, Portland, OR). Training sessions occurred weekly for 2 h a week for 5 months. The graders were given “practice” optic nerve images to grade between sessions and these images were reviewed during the weekly meetings with the glaucoma specialists and Director of the Reading Center. The outlined parameters were each drawn while the graders were actively using the stereo viewer.

Optic disc analysis

After completing training, each digital optic nerve photograph was analyzed by two of the three trained graders. Graders were masked as to whether the image was from a glaucoma case or a control. The graders were asked to outline three structures on each optic nerve photograph using the Image J/Fiji software (available at http://rsbweb.nih.gov/ij/; Rasband WS, Image J, US National Institutes of Health, Bethesda, MD, 1997e2012) (Fig. 1).

Fig. 1
figure 1

Examples of stereo disc photos outlined for cup color, cup contour, and disc by two graders. For Image 1, areas were 72,070 (Grader 1) and 94,006 (Grader 2) for cup color; 108,667 (Grader 1) and 98,751 (Grader 2) for cup contour; and 174,421 (Grader 1) and 172,310 (Grader 2) for the disc. CDR was 0.41 (Grader 1) and 0.55 (Grader 2) using cup color, and 0.62 (Grader 1) and 0.57 (Grader 2) using cup contour. For Image 2, areas were 39,859 (Grader 1) and 48,728 (Grader 2) for cup color; 75,195 (Grader 1) and 63,352 (Grader 2) for cup contour; and 135,916 (Grader 1) and 133,663 (Grader 2) for the disc. CDR was 0.29 (Grader 1) and 0.36 (Grader 2) using cup color, and 0.55 (Grader 1) and 0.47 (Grader 2) using cup contour. Area units are unscaled pixels

Outlined structures included:

  1. 1.

    The optic cup using only color and pallor cues from the photograph (“color cup”).

  2. 2.

    The optic cup using contour and vascular cues (“contour cup”).

  3. 3.

    The optic disc, defined as the outer border of the nerve rim and the inner border of the scleral ring, if a scleral ring was present.

The areas within each of these measurements, as well as the height and width of these measurements, were then calculated using the Image J/Fiji software. The software calculated the height and width based on the vertical and horizontal axes, respectively; axes were determined by the software. CDR was then calculated as the ratio of the area, height or width of the optic cup (using either the “color cup” or “contour cup”) to the area, height or width of the optic disc. The graders were not involved in any clinical diagnoses.

Statistical analysis

Intraclass correlations coefficients and 95% confidence intervals were calculated for the color cup, contour cup, disc, color CDR, and contour CDR measurements. Intraclass correlation coefficient is a measure of reliability that describes the consistency or reproducibility of quantitative measurements (i.e., optic nerve grading) made by different observers (i.e., two graders). Reliability ranges from 0 to 1, where 1 indicates that there is no grader error (i.e., no differences in grading between graders) and 0 means that all variability across eyes is attributable to grader error. Higher reliability values indicate less grader error. These were calculated separately using the area, height, and width of the drawings, and for the area CDR calculations the square roots of the cup and disc areas were used. The CDR value used for comparison to the clinical assessment of CDR was the average of the values from two graders. All analysis was performed in SAS v9.4 (Cary, NC).

Results

A total of 2554 digital stereo images were analyzed for this study, including 1984 images from POAG cases and 570 images from controls. The intraclass correlation (95% confidence interval) for agreement between the grading value of CDR height and the clinical value of vertical CDR was 0.72 (0.63, 0.79) by color pallor cues and 0.71 (0.63, 0.79) by contour.

Among all images, the intraclass correlation (95% confidence interval) for agreement between graders was 0.90 (0.89, 0.90) for the cup area using only color/pallor cues; 0.92 (0.91, 0.92) for the cup area using contour and vascular cues; and 0.99 (0.99, 0.99) for the disc area (Table 1). Intraclass correlations among all subjects were highest when using height (versus area or width). The intraclass correlation for the area of the CDR was 0.74 (0.73, 0.76) when using “color cup” and 0.61 (0.58, 0.63) when using “contour cup.”

Table 1 Intraclass correlations between trained graders based on area, height, and width of measurements

The intraclass correlations were stratified by whether the stereo disc images were from glaucoma cases versus controls (Table 1). The intraclass correlations were slightly higher among cases than controls for the color cup, the contour cup, and the optic disc. The cup-to-disc area ratio using the contour cup also had a slightly higher intraclass correlation among cases (0.57 [0.54, 0.60]) than controls (0.35 [0.27, 0.42]). This lower concordance among controls was also seen when height and width measurements were used in the CDR measurements.

The difference between CDR values assigned by the two graders was calculated using the color cup and contour cup for area, height, and width measurements (Table 2 and Fig. 2). Using color cups, the CDR difference by area between graders was ≤0.1 in 71% of images and ≤0.2 in 94% of images; using contour cups, the difference by area was ≤0.1 in in 65% of images and ≤0.2 in 92% of images. Again, using height measurements cups yielded the highest reproducibility (compared with area or width), with difference between graders of ≤0.1 in 75% of images for color cups and ≤0.1 in 68% of images for contour cups. Reproducibility was higher among cases when using contour cups and higher among controls when using color cups (Table 3).

Table 2 Differences between CDR values assigned by two trained graders using color and contour cup for area, height, and width measurements, for all patients
Fig. 2
figure 2

Bland-Altman plots for cup-to-disc ratio for color and contour grading. a Bland-Altman plot of the absolute value of the difference between two graders of the cup-to-disc ratio by the mean of the two values when using the area of the cup as defined by color and the area of the disc. The mean absolute difference was 0.078, and the 95th percentile of the differences is 0.193. b Bland-Altman plot of the absolute value of the difference between two graders of the cup-to-disc ratio by the mean of the two values when using the area of the cup as defined by contour and the area of the disc. The mean absolute difference was 0.089, and the 95th percentile of the differences is 0.218

Table 3 Differences between CDR values assigned by two trained graders using color and contour cup for area, height, and width measurements, for cases and controls

Discussion

Our study introduced a new quantaitive method of assessing the ONH and deriving CDR using digital stereo images. Non-physician graders trained by glaucoma specialists were capable of grading and measuring ONH parameters with high inter-grader reliability. Graders had very low discordance when outlining the optic cup and disc using color or vascular cues and deriving CDR from these measurements.

When outlining the optic cup, both color/pallor and contour/vascular cues yielded high intraclass correlations between graders. We used area, height, and width to make these measurements, finding that height yielded the highest intraclass correlations. Vertical CDR, which was calculated as the ratio of the optic cup (both by color and vascular cues) and optic disc, maintained strong intraclass correlations among graders. More than 94% of images (using color cups) and 92% of images (using contour cups) had a CDR difference of ≤ 0.2 between graders. This CDR difference (≤0.2) was previously defined as “concordant” by the Ocular Hypertension Treatment Study [27]. The lower intraclass correlation for CDR in controls when compared to cases could be due to discrepancies in smaller-sized cup ratios.

Our results demonstrate the importance of training graders in optic disc evaluation; we show that with training, very high concordance rates can be obtained between graders. Studies that included graders (even glaucoma specialists) that were not trained on standard images show lower rates of concordance. For example, Jampel et al. [15] asked glaucoma specialists to evaluate disc changes over time in a cohort of patients with visual field loss. Despite analyzing the qualities of cup enlargement, focal rim thinning, cup depth, and optic disc hemorrhages concurrently, inter-observer agreement was poor (k= 0.2). Likewise, Azuara-Blano et al. [28] reported an inter-observer k of 0.34 to 0.68 (and intraobserver k of 0.55 to 0.78) among specialists assessing whether optic discs are compatible with glaucomatous changes. It is likely that even a small amount of standardized training among graders may have increased concordance in these studies. Inter-observer agreement among non-experts in detecting disc changes has been shown to increase after only one training session [29].

Our results also highlight how detailed and specific measurements can lead to high concordance rates among graders. In the Ocular Hypertension Treatment Study, trained graders judged optic discs as deteriorated or not based on thinning of the neuroretinal rim, yielding agreement (kappa) ranging from 0.65 to 0.83 over 5 years [30]. Likewise, the European Optic Disc Assessment Trial, which prompted 243 ophthalmologists to grade stereoscopic optic disc photos or healthy and glaucomatous eyes, found an overall diagnostic accuracy of 80.5% and intraobserver agreement average of 0.7 [24]. These results are considered an overestimate, as the study was conducted in ideal conditions, with graders having unlimited time to grade and only eyes with a definitive diagnosis used. These studies demonstrate the difficulty in classifying optic nerve damage solely based on subjective disc features, even with established protocols. We believe that using detailed measures to calculate CDR or grade the optic nerve, as in our study, can increase the accuracy of measurements. Meanwhile, this methodology also avoids the pitalls of automated instruments, such as limitations when faced with anatomic variations or suboptimal images.

The time spent training the readers was extensive; we do not know whether the same degree of concordance would have been obtained with less training. It is also possible that the graders were more effective learners, given their prior familiarity with examining digital images. Additionally, there remains no gold-standard for analysis of the ONH and thus no “right answer” to compare against for measurements in this study. Nonetheless, the moderate agreement of the vertical CDR assessed clinically, with an intraclass correlation of 0.72 for assessment by color cues or by contour, indicates that the grading assessments are generally consistent with clinical judgment.

In conclusion, this study introduced a new method of quantitatively grading the ONH on stereo disc images, which resulted in high inter-grader reliability. We envision this system being useful for detecting and measuring small differences in the ONH for large-volume glaucoma studies. Going forward, we plan to train the graders to assess other parameters of digital photographs, such as alpha and beta peripapillary atrophy. In addition, we hope to analyze disc asymmetry between eyes and to examine POAG progression by evaluating future stereo disc images of the same patients over time.

Study highlights

What was known before

  • Optic nerve grading for glaucoma has historically been challenging, with multiple evaluations schemes emerging in recent years. These systems are often subjective and not fully reproducible among graders.

  • Automated instruments, such as optical coherence tomography, have also been used to grade the optic nerve. These instruments, though useful for mass grading, have been shown to miss both early and severe cases of glaucoma.

  • Thus, there is a need for a reproducible method for grading and classifying optic nerve disc photos. Ideally, this methodology would be more comprehensive and precise than existing subjective methods, while avoiding the pitfalls of automated instruments.

What this study adds

  • We propose a novel quantitative method for assessing optic nerve stereo disc photographs.

  • This method utilizes color and vascular cues, as well as area, height, and width, to evaluate the optic nerve and determine cup-to-disc ratio, an essential tool for evaluating glaucoma progression.

  • We show that non-physician graders, who assessed 2554 images from African American patients, achieve high concordance rates with this method after receiving training.

  • We believe that our method can be useful for assessing glaucoma progression or categorizing phenotypes for glaucoma research studies.