Introduction

The full-field stimulus threshold test (FST) is a psychophysical measure of whole-field retinal light sensitivity. It was developed as an alternative method to assay residual visual function in patients with inherited retinal disease (IRD), whose severely reduced vision may prohibit standard quantitative follow-up with LogMAR acuity, perimetry and electroretinography (ERG) [1, 2]. In the recent decade as clinical trials for genetic therapies in IRDs have begun to advance into real world applicability, FST garnered momentum as a research outcome measure, and now increasingly as an adjunctive method for the assessment and surveillance of low vision patients in the clinical setting [3, 4].

In brief, the test stimulus is a full-field (‘Ganzfeld’) flash or pulse of light often generated using narrow-band light emitting diodes (LEDs). The participant responds ‘seen’ or ‘not seen’ to stimuli of varying luminance strengths. The stimulus luminance increases or decreases strategically until sampling is sufficient to calculate an estimate of the light perception threshold using a psychometric function. Responses to chromatic (red/blue/green) or achromatic (white) stimuli, presented on a variety of light- or dark-adapted backgrounds, are used to delineate the relative contributions of cone and rod photoreceptors to the sensitivity threshold estimate [1, 3]. Current commercially available FST equipment include the DiagnosysFST programme with the Espion ColorDome™ (Diagnosys LLC, Lowell, MA, USA) and the Metrovision FST programme on the MonCvONE perimeter (Metrovision, Pérenchies, France).

FST in gene therapy trials and wider research applications

Originally FST was a secondary efficacy measure in gene therapy clinical trials for voretigene neparvovec-rzyl (VN), now approved in several global territories for RPE65-mediated retinopathy [5, 6]. This early-onset retinal dystrophy (also called Leber’s congenital amaurosis, LCA2) is characterised by severe progressive visual impairment and rod-cone degeneration from infancy [7, 8].

Early phase VN publications mention FST with chromatic stimuli as a functional outcome measure [9, 10]. The later Phase III and long-term safety and efficacy studies report both chromatic and achromatic FST, with the change in white light FST threshold averaged across both eyes reported as a secondary efficacy outcome [11,12,13]. The primary endpoint in these RPE65 clinical trials remained an improvement in the multi-luminance mobility test (MLMT). Nonetheless, a strong correlation was shown between 1-year change in MLMT and the white light FST response averaged over both eyes (Pearson correlation coefficient r = 0.71), ratifying FST as a feasible surrogate measure where mobility testing may be unavailable [4, 12]. Currently, an active clinical trial in Japanese RPE65 retinal dystrophy patients lists 1-year change in FST as its primary clinical end point (NCT04516369).

The success of the VN trials saw uptake of FST as an outcome measure in other clinical trials for IRDs, including subretinal hMERTK therapy for patients with advanced retinitis pigmentosa (RP) (NCT01482195) (ref. [14]), electronic retinal prosthesis for end-stage RP (NCT02720640) (ref. [15]), and intravitreal antisense oligonucleotide therapy in CEP290-LCA (LCA10; NCT03140969) (ref. [16, 17]). Increasingly FST is being included among the battery of visual function markers in various phenotyping studies, either in anticipation of potential future therapy development, or to monitor natural history of visual decline. These include TRPM1-congenital stationary night blindness [18], CRB1-RP [19, 20], CNM4 Jalili syndrome [21], retinal degeneration in Usher Syndrome ([22]; NCT05158296; NCT04765345), and other forms of LCA including GUCY2D-LCA (LCA1) (ref. [23]) and AIPL1-LCA (LCA4) (ref. [24]). FST has also been used as an alternative to dark adaptometry to phenotype dark adaptation in choroideremia [25, 26], and to quantify cone sensitivity during the cone plateau in Stargardt disease [27]. A further interesting role has been proposed for FST to be a measure of rod inhibition for visual cycle-modifying drugs that block RPE65 function [28, 29].

A given drawback of the full-field nature of FST is the inability to localise threshold improvements at specific retinal loci. As such, the retinal origin of the FST response remains contested. Based on initial comparisons against dark-adapted perimetry, FST responses are commonly assumed to originate from the most sensitive retinal areas [1]. This appeared to be corroborated by another study that found FST blue and white thresholds were inversely correlated with maximum dark-adapted perimetric sensitivity (r = –0.8) at low luminance, though the correlation weakened at higher luminance where thresholds were likely to be influenced by cones [30]. Conversely, other studies correlating FST thresholds with ERG and microperimetry have suggested responses may be generated within the central 20 degrees of the visual field [31], or mediated by spatial summation [32].

FST as a clinical tool

Given the scaling challenges of using MLMT in a clinical setting, FST has become a favoured clinical surrogate follow-up measure of post-treatment efficacy for RPE65 patients [33, 34]. Additionally, FST may be clinically useful for monitoring patients with early-stage retinal disease [22], or as an auxiliary test for patients with low vision and poor central fixation [35]. However, without harmonised reference data or best-practice consensus for FST, protocol variations between institutions currently limit data comparability and interpretation of quality.

Clinical practice guidelines often require that tests are adapted to become feasible and practical for specialised populations [36]. The specific utility of FST in IRD means the clinical population will invariably include children and/or those with complex sensory and developmental needs. Stingl’s group [37] reported a strong correlation between age and three-month FST improvement to blue stimuli in treated RPE65 patients (regression analysis R2 = 0.81), suggesting a higher chance of rescuing rod function with treatment at a younger age. Currently the evidence base for FST protocol modification is unclear, and specific reference data are lacking for paediatric and more specialised populations.

Scoping review aims & objectives

Ahead of the development and formalisation of standardised clinical practice guidance for FST, it will be valuable to understand the scope of practice of how the FST is delivered and reported in different contexts, compare the methodological variability of testing protocols, and identify potential areas for adaptation or optimisation for specific populations. As summative research about FST is limited a scoping review was chosen to systematically survey the current available literature on the FST, with the hope that findings may form the basis for future systematic review and evidence synthesis.

Methods

The scoping review is reported according to the Preferred Reporting Items for Systematic Reviews and Meta-analysis Extension for Scoping Reviews (PRISMA-ScR) checklist, and methodology followed guidance from the Joanna Briggs Institute (JBI) [38]. The research question was to explore FST practice in human participants in research and clinical settings to date. A comprehensive electronic Boolean search was conducted in the Cochrane Library, EMBASE, Pubmed, Scopus and Web of Science, using keyword combinations based on ‘FST’, ‘full-field’, ‘stimulus’, ‘sensitivity’, ‘scotopic’, ‘threshold’, ‘retina’, ‘night vision’ etc, with search refinement guided by a library information specialist (C.L., Aston University Library). The full search query is listed in Appendix Table 1. Electronic databases were last searched in September 2022.

To minimise publication bias and maximise scope, grey sources were located manually from bibliographies of key studies or by searching grey literature repositories and clinical trial registries using selected keyword combinations. Grey literature also included educational and technical documents such as user manuals and commercial symposia presentations, sourced through Google searching or obtained directly from the manufacturer.

All identified citations were collated into Mendeley Desktop Version 1.19.4 (Elsevier, London, UK). Following duplicates removal, title and abstracts of references were screened before selected texts were retrieved and assessed in full against the inclusion criteria (Appendix Table 2). The process was checked by a second reviewer (A.H.) and any discrepancies resolved through discussion. Included items were all sources where full-field luminance stimuli were used to quantify visual sensitivity thresholds in human participants using a psychometric method. Items were excluded if they were in a non-human setting, if there was no appreciable reporting of FST methodology or outcomes, or if the full text was not available.

Data extracted included key characteristics regarding source, participant, test and key results and recorded in a data extraction form in Microsoft Excel. Categories and subcategories of the full data extraction form are listed in Appendix Table 3. Data extraction from non-English language papers was aided by free online translation platforms where appropriate. A web-based application was used to derive numerical data estimates from figures or graphs where relevant (WebPlotDigitizer, https://apps.automeris.io/wpd/).

To present a narrative overview of all sources included, data relating to key source characteristics, participant characteristics, test characteristics and key outcomes were counted and summarised. Concepts were categorised and coded to explore relationships among common themes and all data visualisation was performed using Microsoft Office.

Results

Search and selection of data

The full study selection process is presented as a PRISMA-ScR flow diagram [39] (Fig. 1). After removal of duplicates, 713 unique abstracts were screened against inclusion criteria. First stage screening was checked in parallel by a second reviewer (A.H.) using a randomly selected sample of titles and abstracts and any disagreement was resolved through discussion. In total, 363 items from screening were retrieved in full for further assessment alongside manually sourced grey literature.

Fig. 1: Scoping review flow diagram.
figure 1

PRISMA-ScR flow diagram of source selection process (from Tricco et al., 2018).

The broad search initially sought to capture analogous modalities or techniques to FST, such as the whole-field Scotopic Sensitivity Tester (SST-1) (LKC Technologies, Gaithersburg, MD, USA) which is no longer commercially available, or studies deriving the scotopic final threshold from the dark adaptometry curve. For improved feasibility and relevance of the scope it was agreed with other authors (A.H. and D.T.) that data extraction and synthesis should focus on items pertaining to the current commercially available versions of the FST. 85 total sources were eligible for final inclusion [1,2,3,4, 9,10,11,12, 14,15,16,17,18,19,20,21,22,23,24,25,26,27, 29,30,31,32,33,34,35, 37, 40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94], comprising published studies using FST (n = 80) and product information available from manufacturers (n = 5) (Appendix Table 4). Quantitative summarisation and evidence synthesis will refer to these either separately as ‘FST studies’ and ‘commercial information’, or collectively as ‘sources’. The findings from evidence synthesis are reported under the four broad themes of ‘Scope’, ‘Population and context’, ‘Methodology and reporting’, and ‘Interpretation’.

Scope of FST sources

The general publication characteristics of the included FST sources are summarised in Table 1. Of the 85 total sources available since 2005, the majority (72%) of items are from the most recent five years to date, reflecting escalating interest in FST as an outcome measure after its inclusion in the landmark RPE65 clinical trials (Fig. 2). There was high variability and inconsistency in author self-reporting of study design type, and studies were re-classified for this review based on published definitions [95]. Most included studies were retrospective evaluations of a specific IRD phenotypes (n = 12), longitudinal or natural history studies (n = 11), or explorations/comparison of novel outcome measures in the specified IRD population (n = 8)

Table 1 Summary of source characteristics.
Fig. 2: Summary figure showing timeline of included sources (n = 80) by year of publication, with key stages and publications in full-field stimulus test (FST) development and voretigene neparvovec (VN) clinical trial progression and approval.
figure 2

CADTH Canadian Agency for Drugs and Technologies in Health, EMA European Medicines Agency, FDA Food and Drug Administration, TGA Therapeutic Goods Administration.

Institutional affiliations spanned 20 countries (Fig. 3), with the United States (USA) having the highest number of mentions in included sources (207 mentions) followed by the Netherlands (33 mentions). Specifically in the USA, institutions from Philadelphia were the most frequently mentioned (68 mentions) in author affiliations. This may be consistent with these being sites for the groups involved in the preclinical, clinical and commercialisation work for VN gene therapy [13, 40] and the development of the initial and current iterations of the FST [1, 2, 4].

Fig. 3: Geographic infographic of institutional affiliations of all included sources (n = 85) by country and city (or USA state).
figure 3

Area of circle is proportional to the total frequency count of mentions for the institution as listed within the Author Affiliations across all sources.

There were no included sources affiliated with sites based in the continent of Africa. Two groups in Brazil [31, 41,42,43] represent the only mentioned centres in South America. There were four publications affiliated with two institutions in Sydney. This high representation of published research from North America and Western Europe may reflect the geographic distribution of centres with specialised interest and resource available for genetic eye disease and vision science research. There may also be limitations of the search strategy used in this review, despite effort to search non-English language databases. Moreover, this could also highlight a lack of resourcing, infrastructure and funding for carrying out specialised ophthalmic testing or clinical trials in these lesser represented regions

Population and context

The commercial product information sources (n = 5) carry no relevant patient population data so are omitted from this part of quantitative synthesis and used for reference purposes where appropriate.

Clinical genotype and phenotype

Study populations of included FST studies (n = 80) are mainly patients with IRD and/or visually healthy participants used as controls. This is consistent with the FST having been initially developed expressly for this low vision patient population [1]. Non-IRD populations in which FST has been performed includes diabetic retinopathy [41, 44], age-related macular degeneration [1, 45] and one study in only visually healthy controls [46].

Broadly, in rod and rod-cone retinal dystrophies, the rationale for using FST has been to establish or characterise residual rod photoreceptor function [23], measure natural history of rod-cone functional decline [22], or as a supplementary measure of visual function changes e.g. after gene supplementation [17, 34, 96]. In macular or cone dystrophy populations, FST has been used as an alternative measure of dark-adapted thresholds for patients who have difficulty maintaining the fixation necessary to perform dark-adapted perimetry [27], or used with different chromatic stimuli and backgrounds to interrogate differential photoreceptor sensitivity [47].

The study population data of included studies are tabulated in Table 2 and represented in Fig. 4, categorised by pattern of retinal abnormality, followed by clinical phenotype and genotype, although some studies had mixed or unspecified populations.

Table 2 Studies categorised by phenotype and/or genotype.
Fig. 4: Patient population of included studies (n = 79) categorised by pattern of retinal dysfunction, clinical phenotype and genotype.
figure 4

The area of each segment is proportional to the number of FST studies with this clinical population. One study omitted from figure due to being a narrative review. BBS Bardet-Biedl syndrome, BCM blue-cone monochromacy, CORD cone-rod dystrophy, cCSNB complete congenital stationary night blindness, eAMD exudative age-related macular degeneration, LCA Leber congenital amaurosis, RP retinitis pigmentosa, XLRS X-linked retinoschisis.

FST and patient age

There were nine studies with no available age data. Of the other 71 studies, the age of the total study patient population ranged from 6 months to 87 years old at baseline or time of reporting. 44 of these studies (62%) had a study population that overall included patients aged <18 years, of which 15 studies included children aged ≤5 years (pre-school age). FST was not always achieved nor attempted in all patients in the total study population, with reasons given including patient reported as having no or minimal light perception [48, 49], unavailability of equipment [40, 50, 51], adverse complications [52] or age-related reasons. Specifically reported reasons for not achieving FST in patients of younger age include ‘exhaustion’ [12], results being ‘too variable’ [35], unreliable performance [11] or simply ‘young age’ [20].

For eleven of the 44 studies that included children in their total population, FST was not reported on an individual patient basis and only as a summative statistic so it was not possible to know if FST results were specifically achieved in those under 18 years of age. This was usually due to the study population being large [1, 3, 53, 54] or the source being a conference abstract [55]. The 44 studies are tabulated in Table 3 showing overall age range, number of patients with reportable FST thresholds, and ages of individual children in whom FST results were achieved. Age data from the final column of Table 3 of 148 children from 31 studies are pooled and plotted as a histogram (Fig. 5), with median age 11 and IQR 9–14 years.

Table 3 Characteristics of studies with paediatric populations (n = 44), ordered by overall study population age range (youngest to oldest).
Fig. 5: Ages of individual paediatric patients reported to have achieved results from FST testing (n = 148 children pooled from 31 studies where individual ages were reported).
figure 5

Median age of these children is 11 (Interquartile range 9–14).

A comprehensive summary figure (Fig. 6) visualises all FST studies by overall study age range, with age range for which FST results were achieved or available shown where reported. Demarcation lines show studies in which the overall population included patients aged <18 years (blue dashed line), or ≤5 years (green dashed line). From this figure, it can be appreciated that while there were 15 studies that included children aged 5 or under, only one study reported successful FST results in these younger children. These were two children with RPE65 mutations in the VN Phase III clinical trial [12]. However, for the youngest child who was 4 years old at baseline, FST was achieved only at day 180 and year 1; results were missing at baseline and all other timepoints due to unreliable white light testing and procedural deviations [12, 13]. Subjective testing in infants aged less than 5 years of age can be limited by comprehension and/or cooperation, and there are currently no age-based modifications to the FST.

Fig. 6: Summary graph of FST studies (n = 71)* ordered by total population age range, showing also individual ages of patients with FST results where available.
figure 6

The blue dashed line and grey area demarcates paediatric and adult patients. The green dashed line demarcates age 5 (pre-schoolage). *Nine studies were omitted from graph due to no available data on age.

Methodology and reporting

Nomenclature

While commonly abbreviated to ‘FST’, there remains inconsistency in the full name of the test among sources. While the original studies [1,2,3] refer to the ‘full-field stimulus test’, other terminology used in literature and manufacturer materials include ‘full-field sensitivity test’, ‘full-field light sensitivity threshold’, ‘full-field scotopic threshold’ or similar variations and combinations of terms. Figure 7 shows these variations and number of sources in which they appear. Regardless of the shared abbreviation, inconsistency for the full name introduces additional ambiguity such as for keyword selection when performing systematic literature searching, or when sources may be translated or compared between centres. The World Health Organization have advised on the importance of standardising nomenclature of medical devices. Inconsistencies in the names of medical devices my cause confusion between types of devices, affect traceability, and adversely impact healthcare delivery [97].

Fig. 7: Variations in ‘FST’ full nomenclature used by sources (N = 80).
figure 7

Some sources use more than one form of nomenclature within the same text.

Testing equipment

In 56 (70%) of 80 studies, FST was performed using hardware and software from Diagnosys LLC. This was often the Espion ColorDome™ LED full-field stimulator with the E2 or E3 desktop console Espion software version E6.49 (ref. [56]) or E6.59 (ref. [16]), and/or specific FST software. One group (2019; 2021) refer to a ‘thresholding algorithm built into a computer driven ERG system’ without mention of commercial software or hardware [35, 57]. The original iterations of the FST by Roman et al. were developed using a custom modified Zeiss-Humphrey perimeter [1, 2, 4]. Although the FST is also available on the MonCvONE-CR perimeter-based system by Metrovision [58], no studies have yet been published using this device for FST, although one study [59] mentions testing in one patient was performed using the Metrovision device before acquisition of the Diagnosys equipment. 20 studies (25%) did not specify the software or hardware used to record the FST responses.

The patient response interface for psychometric testing was mentioned only by 17 (21%) of the 80 studies, seven of which specified a ‘two button box’, ‘binary’ or ‘yes-no’ input [3, 27, 56, 59, 60], four studies reported ‘button box’, or ‘button-press’ [33, 48, 61, 62], and four ‘patient response’ [23, 31, 63], The review by Simunovic’s group [64] reported ‘target detection’. The perimeter-based methods with the original modified Humphrey or the MCvONE perimeters are known to use a single-button input [1,2,3, 58]. The remainder 78% of studies did not specify whether the response input was one or two-choice, and this cannot be otherwise inferred, since the commercialised Diagnosys device offers both one or two button input options, or alternatively the operator may also respond on behalf of patient’s verbal response [4, 65,66,67]. These methodological differences may affect comparability and interpretation of results in several ways. A single or dual-choice decision paradigm affects the complexity of the task (which may be particularly relevant when testing young children or patients with learning needs), and may alter the underlying psychophysical algorithm for threshold estimation. Moreover, using an operator response method may introduce additional response delay and require prolonging of the interstimulus interval (ISI), which could introduce fatigue or also alter the psychometric function.

Stimulus characteristics and thresholds

Colour

Regarding the colour of stimuli, 22 (28%) of the 80 studies reported FST thresholds were obtained using white or achromatic ‘6500 K’ stimuli; 24 (29%) studies used red and blue LED stimuli (peak wavelength ranges were 632–642 nm and 450–465 nm, respectively); 19 (24%) used white, red and blue stimuli; and three referred to white red, blue and green (peak 513–530 nm) stimuli, although green light FST data were not used in analysis [46, 54, 68]. Two studies tested FST thresholds using blue stimuli only [9, 69], and stimulus colour was not specified in eight studies. Two studies were categorised as ‘other’ due to being a review [4] or post-hoc analysis of existing data [30] (Fig. 8).

Fig. 8: Colour and wavelength of flash stimuli used for full-field stimulus threshold testing (n = 80 studies).
figure 8

*Assumed to be typographical error (should be 6500 K). †Assumed to be typographical error (should be 638 nm). ‡‘dim’ LED < 0.01 cd∙s/m2, ‘bright’ LED > 0.01 cd∙s/m2.

Of the 51 studies that reported performing FST using chromatic stimuli, 34 cited the purpose was to delineate which photoreceptors mediated the FST response by using the difference in measured sensitivity to ‘blue’ (presumed rod) and ‘red’ (presumed cone) stimuli. Only 18 of these studies specified the blue-red difference criteria for distinguishing rod, cone or mixed photoreceptor mediation. Differences between blue-red FST ranged from ≥19 to 28 dB for rod-mediated thresholds (with more sensitive thresholds to blue compared to red stimuli), ≤0 to 10 dB for cone-mediation (similar sensitivity to blue and red stimuli) and values between these bounds indicating mixed rod-cone mediation (Table 4). The manufacturer information for the Metrovision FST suggests a difference between thresholds to blue and red stimuli of 19 dB characterises rod mediation [58].

Table 4 Differentiating rod and cone contributions to sensitivity thresholds tested using chromatic full-field stimuli.

Birch’s group [22] compared white, red and blue thresholds in the RUSH2A study to phenotype USH2A-associated retinopathy. They noted patients with ≥ 20 dB difference between blue and red thresholds mostly had white thresholds less than -30 dB. This led to the proposal that any white FST threshold dimmer than –30 dB is rod-mediated (reference level 0 dB = 0.1 cd/m2) though some exceptions can be noted in those with long-standing USH2A disease. Zabek’s group adopted this classification of rod-mediation for their white FST thresholds tested in RP patients (reference level 0 dB = 0.1 cd·sm2) [60].

Stimulus presentation order was often not specified for two-colour testing. When three or more colours were tested, one study group specified an order of blue-red-white [46, 68], and another used blue-white-red [33]. Two groups recommended red-blue-white [41, 42, 44], three groups used white-red-blue [2, 27, 48, 51], and for the remainder of studies order was not specified. In the original paper by Roman’s group [1], the stimulus testing sequence was: white, white, blue, red, white, blue, red, white.

Temporal presentation and response time or ISI

For stimulus temporal characteristics, the Diagnosys software allows customisation options of ‘flash’, ‘pulse’ or ‘blink’ [65, 67], while the Metrovision system presents a ‘flash’ every 3 s [58]. 51 (64%) of the 80 studies did not specify stimulus type or duration. Of those that did, 12 studies reported using a 200 millisecond (ms) flash stimulus, eight used a 4 ms flash, three reported a ‘brief full-field flash’, while six stated simply ‘flashes’. The interstimulus interval (ISI) or response time-window was specified by Klein and Birch [3] and Ahuja’s group [70] as 5 s, though the DiagnosysFST software enables customisation of ISI between 1 and 9999 ms [65, 67]. An ‘unconstrained response window’ was used in the sepofarsen clinical trials, when patients with CEP290-LCA were tested using a commercial binary thresholding algorithm. Though one patient with a substantial treatment response (P11) was assessed further using chromatic FST under dark and light-adapted conditions using a 4/2 dB staircase and two response reversals with a ‘limited’ time window, reported to minimise false-positive responses (i.e. any responses not synchronised with stimulus presentation) [16, 56]. Roman et al. [4] suggest the ISI may need to be prolonged for severely affected patients. For inter-session duration, Roman [2] and Ghazi’s [14] groups suggested a ‘short’ pause between threshold determinations to avoid fatigue, Messias’ group [41, 42] specified 5 min interval between sessions for their patients with diabetic retinopathy, while ‘no timeout’ was specified in the study with RP patients by Zabek et al. [60]. The majority (81%) of studies did not specify ISI or duration between threshold determinations or between sessions.

This omission in reporting stimulus temporal characteristics again complicates interpretation and data comparability. It was reported that two patients with CEP290-LCA (P7 and P9) in the sepofarsen clinical trial were erroneously tested using a 4 ms stimulus at several timepoints instead of the protocol-specified 200 ms [16, 17]. In the interim report, one of these patients (P7) was tested under both conditions at 3 months and appeared to show a ~1 log unit better sensitivity to the blue (but not to the red) stimuli for the longer stimulus presentation [16]. This is noteworthy since 1 log unit (or 10 dB) improvement in FST sensitivity is suggested to constitute a clinically significant post-treatment change in RPE65 patients [11, 33, 71]. In the phase 1b/2 report the authors decided to exclude the data from this patient and impute the data for the second patient (P9) who was erroneously tested [17]. The temporal characteristics of the light stimulus (and hence also the ISI) may have consequences for the psychometric function, as well as the conversion of the units of light stimulus between the absolute (cd∙s/m2, cd/m2) or relative luminance units (dB).

Testing Strategy

The original versions of the FST developed by Roman’s group from a modified perimeter used a staircase algorithm initially that varied stimuli in 4 dB steps and then in 2 dB steps with each reversal (4-2-2), with the final threshold estimate as the luminance last seen by the patient [1]. The commercially available DiagnosysFST programme describes a seen/not seen strategy and explores the detection of stimuli within a 10 dB range of a selected starting luminance. If the threshold is not found within this 10 dB range, the software shifts the area of exploration to a 10 dB range up or down according to a proprietary algorithm which includes no stimulus ‘catch’ trials until a threshold is reached. It should be noted that although described in some texts as ‘forced-choice’ [3], this yes/no strategy is distinct from and has less control against response variation bias compared to true two-alternative forced choice methods where a choice is made between two versions of concurrent or sequential stimuli. The final threshold in the DiagnosysFST programme is calculated as the midpoint of the frequency of seeing curve generated using a two parameter Weibull function, accounting for false positives (‘Error Blanks’) and false negatives (‘Error Max’) [3, 64, 65, 67]. The Metrovision MonCvONE perimeter employs an 8-4-2-1 staircase sequence, with occasional tests for patient reliability using no stimulus ‘catch’ trials [58].

Reporting of the psychophysical strategy is inconsistent among the 80 included studies. Of 12 studies that mentioned a ‘staircase’ strategy, 11 referred to using the ‘4-2-2’ strategy despite testing using the commercialised system. One study used a 5-2.5-2.5 staircase [72] and another stated that ‘16 reversals were required for convergence’ [70]. 31 studies did not provide details of FST methodology or strategy but cited previous references to published methodology, most often that of [1,2,3]. Test strategy was not specified or referenced in 19 studies.

For estimation of the final threshold, 12 studies reported this to be determined as the median or 50% probability of detection on the sigmoidal psychometric function, while eight specifically referenced the built-in two parameter Weibull function. Twelve other studies reported that the final threshold was taken as the average of multiple measurements per colour stimuli per eye (usually 2, 3, or 10 measurements per stimuli). Roman et al. [4] advise 6 measurements per session to be the most appropriate compromise for estimating variance and avoiding fatigue – with two sessions of 6 measurements being sufficiently powered (98%) to detect a 5 dB FST change on two-sample 2-sided t-test at 5% significance.

Patient preparation

Where pupillary dilation was specified (22 studies; 28%), this was most frequently done with 1% tropicamide and 2.5% phenylephrine. Testing was nearly always performed monocularly (usually with other eye patched), apart from three studies that performed binocular testing [51, 69, 72] and 15 studies where this was not specified.

In total, 60 (75%) of the 80 total studies reported FST performed under dark adapted (DA) conditions and 9 studies mentioned light-adapted testing, with or without DA testing. Only four studies specified DA methodology and six studies detailed light adaptation (Table 5). For DA methodology, the DiagnosysFST commercial information advises only to ‘set the patient in a dark room and start the adaptation timer’ [65, 67] and allows a choice of duration. The shortest DA time reported was 20 min [19, 73], and the MetrovisionFST also references 20 min of DA [58]. The longest DA duration was ‘overnight’ in a natural history study in RLBP1-RP patients [74].

Table 5 Dark adaptation (DA) and light adaptation (LA) methods for full-field stimulus testing, where available, and ordered by duration of DA duration.

Patients were dark-adapted for 25 min in four studies, 30 or ‘at least 30’ min in fifteen studies, 40 min in two studies, 45 or ‘at least 45’ min in nine studies, and 1 h in one study (Table 5). Sengillo et al. [54] reported 25–40 min DA, while Aleman et al. [35] reported a DA duration of >45 min plus additional 15–20 min for pupillary dilation. There was no appreciable rationale for choice of DA duration (no apparent associations with patient population, VA, age etc), though Stingl’s group [19] chose 20 min based on the minimum ISCEV recommendations for scotopic full-field ERG at the time [98]. 22 studies where dark-adapted thresholds were tested did not specify DA duration. Williams et al. [45] performed chromatic FST without DA, but acknowledged that this limited their ability to interpret their blue FST data as isolated rod-function responses.

In the phase 1 VN gene therapy dose escalation trial, chromatic FST thresholds were tested after both a’standard’ DA time of <2 h and ‘extended’ DA time of >3 h, due to the authors’ prior observation of prolonged rod (but not cone) kinetics on scotopic perimetry in treated RPE65-LCA eyes [99]. Only extended DA blue flashes responses were reported, while extended DA red flash responses were reported only if cone-mediated [10]. The short-term phase 1 trial reported 1 h DA for blue FST [9], which shortened to 40 min DA for chromatic and achromatic stimuli in the phase 1 follow-on trial report [52]. The later VN clinical trial publications did not specify DA method or duration. In their recent review, the original FST developers advise a 45 min protocol only if values are not different to those tested under 2 h DA [4].

Interpretation of FST results

Units

Interstudy comparison and interpretation of FST data is complicated by the variability in units used to represent threshold values of luminance. 51 studies (64%) reported in decibels (dB, a ratio-based scaling of luminance units) while 19 studies (24%) reported in log units of luminance i.e. log(cd∙s/m2) or log(cd/m2). Some studies used a combination of both dB and log units [24, 69] while other units in the literature included log scotopic Trolands [47], LogMAR [43], and a percentage of the ‘maximum threshold needed to elicit the FST response’, which the authors state was done to ‘minimise the ceiling effect and produce a meaningful figure’ [15]. Four studies did not report any units (Table 6).

Table 6 Units used in the reporting and measuring of retinal sensitivity luminance thresholds in full-field stimulus threshold testing sources.

For reference, the definitions of common photometric units are provided in Appendix Table 5. and can be found in the ISCEV calibration guidelines [100]. Candelas per square metre (cd/m2) is the International System of Units (SI) standard unit of luminance. This measures light emitted from a source surface per unit area, and is used for display screens or ganzfeld background luminance. For brief flashes of light, such as the stimuli used in ERG and FST testing, ‘flash strength’ is given as candela-seconds per square metre (cd·s/m2). This weights luminance by the flash duration to account for the temporal integration of the visual system. The MLMT, the mobility test used for the primary endpoint of the VN Phase III clinical trials, measures the levels of the various light conditions in units of illuminance (lux) [12]. Illuminance is a measure of the amount of light falling onto or received by a given surface area, and decreases as the distance between the surface and the light source increases. Trolands are calculated by multiplying luminance of the stimulus by pupillary area, to give an estimate of the effective stimulus at the retina.

Photometric measures are matched to the spectral sensitivity of the light-adapted eye (peaking at 555 nm). Rod stimuli may be more accurately measured in scotopic units that correct for the spectral sensitivity of the dark-adapted eye (peaking ~500 nm), typically using scotopic filter over a photometer, however such photometer filters are not widely available. Hence, the ISCEV calibration standards recommend using photopic units but note that for a short wavelength rod flash, a xenon strobe of 2–3 photopic cd·sm2 is equivalent to 4 scotopic cd·sm2 [100].

For the studies reporting in dB, there was further variation in the 0 dB reference point used for converting between relative (dB) and absolute luminance (i.e. cd∙s/m2) units. Although it is possible to customise this conversion base in the DiagnosysFST programme setup and interconvert between units within the software, the variability in published literature may also point to inaccuracies of reporting. For example, Klein and Birch [3] used a 0 dB = 0.1 cd∙s/m2 reference, reportedly equivalent to 25 cd/m2 presented for 4 ms. Nguyen’s group [20] reported their 0 dB set to 0.01 cd/m2, but also stated that this was equivalent to 25 cd/m2 presented for 4 ms, despite there being no time component in their reported units. An earlier study with the same patient cohort set the 0 dB reference to 0.1 cd∙s/m2, but both studies state the same ‘healthy’ control reference value of -53 dB (assumed to be rod mediated and derived from [1,2,3]) despite the apparent difference in log unit scaling [59].

There were 26 studies that reported FST results in dB but did not specify a 0 dB reference point. This has implications not only on data comparability between centres, comparison with control reference data, but also on interpretation of clinically significant change. For example, a result of -60 dB would correspond to –8 log units if 0 dB was set to 0.01 cd∙s/m2, but to -7 log units if 0 dB was set to 0.1 cd∙s/m2. This means without specification of unit parameters, a patient tested at two different time-points could be erroneously reported as having a -1 log unit clinically significant improvement in retinal sensitivity due to ambiguity of the reference scales used.

For post-treatment RPE65 patients, a clinically significant post-treatment change in retinal sensitivity measured on FST is considered to be >10 dB or >1 log unit improvement in white light threshold (averaged across both eyes) from baseline [11, 12, 52]. Nonetheless, this still represents a relative rather than absolute change in retinal sensitivity (i.e. the retina being sensitive to a ten-fold decrease in stimulus luminance). Test-retest variability was most frequently between 1–3 dB or 0.1 to 0.3 log units, often calculated as the 2 standard deviations from the mean threshold, across 15 studies with available data.

Reference FST data from healthy controls

Reference FST threshold values based on results from visually healthy controls were available in 32 studies, either numerically reported in the text or represented on figures. Direct comparison of reference values between studies is challenging due to the variability in units used (dB, log phot-cd/m2 or log scot-tds) and different conventions in positive or negative scaling of luminance parameters. Moreover, some studies used a reference range and others a mean value with standard deviation or standard error. It may also be considered that reference values may localise to the control population available to each centre. The reference range for DiagnosysFST are suggested to be –6 to –7 log units for white flashes [29], while for the Metrovision-FST, responses from a healthy control participant measured –85 dB, –62 dB and –81 dB for white, red and blue flashes respectively, with 0 dB equal to 318 cd/m2 [58]. Figure 9 presents available healthy control reference data from included sources, with values as originally reported in the text.

Fig. 9: Summary figure showing control reference data for full-field stimulus testing where available.
figure 9

Studies are grouped according to units and 0 dB reference scaling. Note that data are plotted in their original reported values of each study, and so may not be directly comparable between studies that use different 0 dB scales. DA dark adaptation, FST full-field stimulus threshold, LA light adaptation.

Discussion and conclusions

Having demonstrated utility in pivotal gene therapy clinical trials, FST has emerged as an increasingly adopted visual functional outcome measure both in clinical research settings and the eye clinic. The need for wider harmonisation of methodology and reporting is clear [4], and this scoping review has not only summarised characteristics of current FST practice but also highlighted instances where interstudy practice variability may preclude data comparability, interpretation, and meaningful clinical follow-up. Additionally, despite indication for better therapeutic potential for treating IRD patients at a younger age, FST has rarely been achievable in children aged ≤5 years, with protocol modification and age-matched reference data an underdeveloped research area in this topic.

Considerations for methodological standardisation

Although the commercialised FST programmes offer flexibility in test customisation, the lack of formalised standard guidance is reflected by the numerous current permutations in test parameters such as patient interface (e.g. one or two-button response paradigm, verbal or button-press response); stimulus parameters (colour/wavelength, order of presentation, temporal characteristics, response time window, number of reversals etc); threshold calculation strategy (rod/cone mediation criteria; method of calculating final threshold); and patient preparation (method and duration of dark or light adaptation, patient instruction, mydriasis, monocular or binocular testing etc).

Each element of variation introduces additional ambiguity or potential alterations to the underlying psychometric function. This affects comparability, reproducibility and interpretation of data between studies which may have significant consequences for future multicentre clinical trials, evidence synthesis, assessment of quality or agreement between centres, or establishment of regional or national references databases. Investigators and standardisation stakeholders may wish to consider that some test parameters may be specific to the purpose and clinical population being investigated, and generate context-based guidance modifications accordingly.

Considerations for reporting and interpretation

A notable finding from this scoping review is the high proportion of studies that currently omit crucial elements of FST methodology in their reporting, such as characteristics of the light stimulus (4 ms or 200 ms) or 0 dB reference level. This may be due to a need for concise methodology in papers, and sometimes studies sometimes simply reference the original publications of [1, 3] while clearly using an individualised protocol. Arguably, unit differences may be arbitrary if the clinical aim is to compare sensitivity change from baseline for individuals, provided parameters remain consistent for the same patient. Nevertheless, these reporting inconsistencies limit reliable post-hoc analyses such as for systematic evidence synthesis or meta-analyses. As FST becomes more favoured as a visual functional endpoint in clinical trials (such as in NCT04516369), investigators may wish to consider the Delphi approaches to the generation of core outcome sets [101, 102] to assess and prioritise minimum elements of FST methodology that are the most important to report.

What is a clinically significant change in FST

Relatedly, it must be considered that the clinically significant 10 dB or 1 log unit change suggested by the RPE65 clinical trials represent relative improvement proportional to the baseline residual sensitivity. A ten-fold increase in FST threshold from -1 to -2 log units represents improvement by 0.09 cd∙s/m2, but from -4 to -5 log units is a 0.00009 cd∙s/m2 improvement (using 0 dB = 0.1 cd∙s/m2). This requires patients with a higher baseline threshold value to demonstrate a larger step-change in absolute luminance sensitivity from baseline to constitute improvement. This also has implications for infants or patients with severely low vision, in whom it may be challenging obtain an accurate baseline if at all, as demonstrated in [12] and several other studies where FST results were omitted due to unreliable performance or lack of baseline comparison.

Interpretation is additionally complicated by lack of consensus on the source of remnant vision i.e., whether FST is a summative global response or mediated by the most sensitive retinal loci. There is the question of whether intrinsic photosensitive retinal ganglion cells may also contribute to visual perception to flash stimuli in the absence of outer retina [4, 48, 103]. Pupillometry studies have suggested that longer duration (1000 ms) and brighter white or blue (2.4–2.6 log cd/ms2) stimuli favours a melanopsin-mediated transient pupillary light reflex (TPLR) whereas TPLR under shorter flash durations (e.g. 100 ms) is likely more driven by remnant outer retina [75, 104].

Clarification of structure-function associations and FST response origin will be important not only for understanding of disease pathophysiology, but also for predicting treatment potential from retinal structure or other factors [105]. In treated RPE65 eyes, Stingl et al. [37] showed evidence of clinically relevant retinotopic rod function improvement using FST, DA chromatic perimetry and chromatic pupil campimetry, but also found younger age to be a major predictor of degree of functional photoreceptor rescue. Conversely, Gange’s group [53] found despite progressive perifoveal chorioretinal atrophy in 18 eyes of 10 treated RPE65 patients, average FST improvement remained consistent at 3 log units, VA was not significantly altered, and VF improvement was broadly stable on 1 year follow-up. Notably, 9 of the 10 patients described with treatment-related atrophy were aged <18 years. These findings further elaborate the significance of structure-function and paediatric-specific considerations in the treatment and post-treatment monitoring of RPE65 retinal dystrophy, which may be applicable also to other IRDs.

Considerations for adaptation for specific populations

It is readily appreciated that psychophysical testing in younger children can be constrained by cooperation, understanding and cognitive/motor demand, particularly for sequentially presented stimuli requiring a subjective response. Studies have found that lapse rate (incorrect response to ‘catch trials’) for simple psychophysical tasks can be 19% in neurotypically developing children compared to 5% in adults [106]. These factors may be further amplified in children with early visual impairment and/or atypical sensory processing [107, 108]. Non-compliance with scotopic testing may be compounded by unfamiliar clinical settings and the need for mydriasis and lengthy dark adaptation. Accurate paediatric psychophysical measurement requires a careful balance between efficiency (e.g. more reversals reduce variability, but longer procedures may result in higher lapse rate) and the need for procedural standardisation to ensure comparability between participants [106].

Some strategies to account for inattentional bias and high lapse rate include indexing the lapse rate through statistical methods such as checking consistency of reversal points [106], or estimating threshold through post-hoc fitting of psychometric function on QUEST-based procedures rather than taking the staircase reversal average [109]. Lapses may also be physiologically monitored through recording of eye movements or posture [106, 110] or using neurophysiological correlates [111]. Furthermore, investigators may opt to develop novel strategies to minimise likelihood of lapses through improving task engagement by children, such as through gamified approaches e.g. [112, 113].

Although commercialised automation of the FST enables a single threshold estimation to be completed relatively quickly (3 min for Metrovision and 2–8 min for DiagnosysFST), recommendations to perform several runs per stimuli per eye for a more reliable average (Roman et al. [4] recommend 6 repeats) with 5 min breaks between trials [41, 42] can prolong test time. Intuitive strategies to improve test efficiency include selecting an appropriate starting luminance based on manual pre-testing or previous estimates [4, 65], though this is not yet standardised or clearly reported (e.g. how many dB above the pre-test estimate is appropriate to set the starting value). In younger children or patients with learning needs, it may be that investigators consider alternative methods of estimating starting luminance based on visual behaviour gauged through preferential looking techniques or spatial frequency limits [114, 115].

Nevertheless, the largest contributor to total test duration is dark adaptation (DA) time, of which there is no current consensus. The range in included studies is 20 min to 3 h (or overnight in one instance), with the most frequently reported being 30 min. The original test developers recommend 45 min DA, provided values are not different from those tested under 2 h DA [4]. Anecdotally, for young RPE65 patients with additional learning or communicational needs and their families 45 min DA is challenging and sometimes distressing. Alternatives to provide a physiologically meaningful measure are needed for these instances. There appear to be ongoing investigation into chromatic FST performed under ‘mesopic’ conditions (0.1 cd/m2 background illumination) without pupil dilation or dark adaptation [116].

With FST being a test of absolute visual thresholds. Roman et al. [4] caution that rod adaptation kinetics may be particularly prolonged in newly treated RPE65 eyes [99]. It may be that investigators develop condition-specific DA protocols to accommodate this phenomenon in RPE65 patients, while opting for alternative appropriate DA durations for baseline testing or in other retinal conditions, using data derived from literature or empirical research. It might be further argued that for purposes of monitoring or assaying light perception in low vision patients, the priority is to standardise conditions to facilitate longitudinal comparison against baseline or a control reference, regardless of where the test point may fall on the patient’s individual DA curve. Furthermore, as with other tests of visual function, FST is rarely performed in isolation and should be interpreted within the constellation of other structural and functional measures [64, 117].

Future perspectives

FST emerged among several novel specialised approaches to assay vision at the absolute thresholds, driven by the need to develop outcome measures for low vision patients to assess therapeutic benefit in gene therapy clinical trials. As FST can be delivered using existing ganzfeld hardware or a modified perimeter, it has become a widely available and increasingly favoured supplementary measure of retinal sensitivity for both research and clinical monitoring purposes.

This scoping review has summarised some key areas of current variability in FST practice, discussed some impacts of non-standardisation, and highlighted the under-researched space of paediatric-specific considerations in this field. The next steps are for stakeholders to conduct further evidence synthesis or empirical research to establish consensus on a set of coherent methodological guidance (which may need to be specific to the type of retinal condition under investigation), protocol modifications for specialist populations and importantly a minimum set of elements for reporting (for example, 0 dB reference level, response task, and key stimulus parameters such as colour, duration and ISI, length of dark adaptation). In parallel to this, agreement on criteria for quality assurance, stimulus calibration and establishment of local or regional age-matched reference databases will also be crucial to enhance reliability and rigour of measurement [100, 118].