Introduction

Non-muscle invasive bladder cancer (NMIBC) has one of the highest per-patient cancer-related costs due to high recurrence rates and need for long-term cystoscopic surveillance1. Disease management also profoundly impacts quality-of-life, especially for patients progressing to more advanced disease2. Intravesical bacillus Calmette-Guérin (BCG) is the current standard of care for adjuvant treatment in intermediate- and high-risk NMIBC, however up to 40% of patients do not respond to therapy3. These “BCG-unresponsive” patients and those who progress from NMIBC to potentially lethal muscle-invasive disease (MIBC) often require aggressive therapy in the form of a radical cystectomy, which carries considerable morbidity and mortality. Therefore, accurate and timely prediction of recurrence and progression remains the cornerstone of management and counselling for NMIBC patients.

Artificial intelligence (AI) has recently emerged as a promising tool in urology, enabling accurate and personalised risk predictions by integrating multimodal data4. However, many AI models in urothelial cancer were found to have high risk-of-bias5. Indeed, despite the proliferation of AI research, few models have successfully been adopted into clinical practice – underscoring the need for more sophisticated, AI-specific tools to scrutinise these studies. APPRAISE-AI is a quantitative tool we have developed to evaluate both methodological and reporting quality in AI studies6. It also provides detailed assessments of data and model quality, making it particularly valuable for comparing AI studies addressing the same clinical question.

This systematic review aims to critically evaluate the robustness of AI models predicting recurrence and progression in NMIBC. We compare the performance of AI and non-AI approaches for these tasks. Using APPRAISE-AI, we assess study quality and identify common methodological and reporting pitfalls. Finally, we provide recommendations to address six key areas: (1) dataset generation, (2) outcome definitions, (3) methodological considerations, (4) model evaluation, (5) reproducibility, and (6) peer-review.

Results

Study screening and selection

The initial search identified 7102 studies, of which 5558 underwent title and abstract screening after removal of duplicates. A total of 490 studies proceeded to full-text review, and 475 were excluded (Fig. 1). In all, 15 studies were included, with five studies focusing on recurrence7,8,9,10,11, four on progression12,13,14,15, and six on both outcomes16,17,18,19,20,21. Detailed characteristics of the included studies are summarised in Tables 1 and 2.

Fig. 1
figure 1

Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) flowchart.

Table 1 Study characteristics and performance metrics of studies focused on NMIBC recurrence
Table 2 Study characteristics and performance metrics of studies focused on NMIBC progression

Study characteristics

Seven studies (47%) were published between 2015 and 2022, while eight (53%) were published between 2000 and 2010. Most studies (60%) were from Europe (five from United Kingdom, one from each of Spain, Poland, Netherlands, and Italy), followed by Asia (two from each of Japan and South Korea, one from China) and Africa (one from Egypt).

All studies focused on model development using retrospective data, of which four (27%) included multiple institutions. Only one study included non-academic institutions18. Median sample size was 125 (IQR 93−309) and median follow-up was 71 months (IQR 32−93). Median recurrence and progression rates were 50% (IQR 42−62) and 19% (IQR 12−25), respectively.

Patient characteristics

Most studies included all NMIBC risk groups. However, patients varied with respect to prior NMIBC history, with nine studies (60%) including only primary tumours, two (13%) with exclusively recurrent tumours, three (20%) with both, and one (7%) with no details provided. Tumour grading scheme also varied, with nine studies (60%) using the WHO 1973 classification system, five (33%) using WHO 2004/2016, and one (7%) with no details provided. Four studies (27%) explicitly reported use of repeat transurethral resection of bladder tumour (TURBT)10,11,20,21. Eight studies (53%) mentioned administration of intravesical therapy, of which six used both BCG and mitomycin C while two used only BCG.

Outcome definitions

Various definitions of recurrence and progression were described. Seven definitions were used for recurrence, including relapse of: (1) equivalent or lower stage, (2) equivalent or lower stage within six months, (3) any stage, (4) any stage within two years, (5) any stage or papillary formations on cystoscopy, (6) Ta, T1, or CIS, and (7) high-grade, T1, or CIS. For progression, seven definitions were reported, including relapse of: (1) ≥ T2, (2) ≥ T2 or metastases, (3) ≥ T2, metastases, or bladder cancer death, (4) from Ta to T1, (5) from Ta to T1 or T1 to T2, (6) from Ta/CIS to T1, T2, nodal disease, metastases, or from low to high grade, and (7) higher stage or grade.

Model characteristics

The most commonly used AI models were based on neural networks (n = 11, 73%), including shallow neural networks, neuro-fuzzy modelling, deep belief networks, DeepSurv, and convolutional neural networks. Studies differed in how their models were trained and evaluated, with seven studies (47%) using separate training and testing cohorts; four (27%) using separate training, validation, and testing cohorts; one (7%) performing 10-fold cross-validation; and three (20%) using the same cohort for both training and testing. Most models incorporated clinicopathological features (n = 10), while other data types included gene expression profiles (n = 6) and radiomic features (n = 2).

Median c-index was 0.76 (IQR 0.68−0.81) for recurrence and 0.76 (IQR 0.75−0.88) for progression. Three studies (20%) provided calibration plots to assess reliability of risk estimates and only one assessed net benefit using decision curve analysis.

Quality of studies

Interrater reliability of APPRAISE-AI was moderate to excellent, with ICCs ranging from 0.60−1 for item scores, 0.83−0.96 for domain scores, and 0.98 for overall scores (Supplementary Table 1). Median overall score was 37 (low quality) and ranged from 26 (low quality) to 64 (high quality). From 2000 to 2010, all studies were low quality, except for one moderate quality (Supplementary Fig. 1). From 2010 to 2022, three of seven studies were low quality. Overall study quality improved over time (regression coefficient 0.65, 95% CI 0.08−1.21, p = 0.03). Only one study throughout the entire study period was high quality21.

The two strongest APPRAISE-AI domains were clinical relevance and reporting quality, while the three weakest were methodological conduct, robustness of results, and reproducibility (Fig. 2). Items achieving greater than 60% of their maximum possible score included title, background, objective and problem, eligibility criteria, ground truth (defining outcome of interest), model description, cohort characteristics, model specification, critical analysis, implementation into clinical practice, and disclosures (Supplementary Fig. 2). Items achieving less than 40% of their maximum possible score included source of data, data abstraction, cleaning, and preparation, sample size calculation, baseline, hyperparameter tuning (adjusting attributes to influence how models learns from data), clinical utility assessment, bias assessment, error analysis, and transparency. Three studies described how missing data were handled, of which one used complete-case analysis and two imputed missing values using random forests. No studies reported on sample size calculation. Only one study included a publicly accessible repository containing the data and AI models necessary to replicate their findings21.

Fig. 2: APPRAISE-AI domain and overall scores.
figure 2

Box plot of APPRAISE-AI domain (blue) and overall (red) scores for the 15 studies using AI to predict NMIBC recurrence and progression. Each box represents the 25th and 75th percentiles with the centre line indicating the median, and the whiskers extending to the minimum and maximum scores. Each field is presented as a percentage of the maximum possible score for that field (i.e., consensus score/maximum possible score x 100%) to compare scores between fields, irrespective of the assigned weighting. Overall APPRAISE-AI scores were graded as follows: very low quality, 0-19; low quality, 20−39; moderate quality, 40−59; high quality, 60−79; very high quality, 80−100.

Comparison between AI and non-AI approaches

Seven studies (47%) compared AI models with non-AI approaches. These included regression-based models (logistic or Cox regression, n = 4), existing nomograms (European Organisation for Research and Treatment of Cancer nomogram, n = 2), and clinical experts (n = 1). Most studies found that AI outperformed non-AI methods for both recurrence and progression (Fig. 3). However, two studies, which compared AI versus urologists and Cox regression, found that non-AI approaches were superior for some metrics. The margin of benefit of AI compared to non-AI approaches varied depending on study quality. Median absolute difference in performance between AI and non-AI approaches was 10 for the ten low quality studies, 22 for the four moderate quality studies, and 4 for the one high quality study (Supplementary Fig. 3).

Fig. 3: Differences in performance between AI and non-AI approaches.
figure 3

Absolute difference in reported performance metrics between AI and non-AI approaches, stratified by recurrence or progression prediction task.

Discussion

This systematic review identified 15 studies predicting NMIBC recurrence and progression. A distinguishing feature is the use of APPRAISE-AI to provide a comprehensive summary of the methodological rigour and reporting quality of these studies. While most studies reported good to excellent performance of their AI models, two-thirds were rated as low quality. Only one study in the last two decades was considered high quality21. Although the clinical relevance and reporting quality domains attained the highest scores, methodological conduct, robustness of results, and reproducibility consistently ranked the lowest – a recurring issue among other clinical AI studies5,22. This discrepancy between high reporting quality yet poor reproducibility can be explained by the former domain encompassing familiar elements such as cohort characteristics, critical analysis, limitations, and disclosures. These items are well understood and routinely reported by the medical community, and often mandated by journals. In contrast, the reproducibility domain introduces AI-specific concepts including model description, hyperparameter tuning, model specification, and data/model transparency. These items, unique to AI studies, may not be comprehensively addressed within current reporting practices. Therefore, this review emphasises the need for better methodological and reporting practices tailored for AI studies within urology23,24,25.

Common pitfalls of current studies

Common pitfalls can be categorised into dataset limitations, heterogeneous outcome definitions, methodological flaws, inadequate model evaluation, and reproducibility issues. These concerns may lead to overly optimistic estimates of model performance and limit their potential for clinical use.

Datasets: Most models were trained on retrospective cohorts from single academic institutions, thus may lack generalisability in non-academic settings, such as community hospitals. Median cohort size was 125, which is considered small even for regression-based methods. Models trained on smaller datasets are at risk of instability, defined as volatility in models and their predictions because of their dependence on the training data and modelling approaches used26. Unstable models may generate unreliable predictions, especially when applied to external cohorts.

Data quality issues were also attributed to substantial heterogeneity in eligibility criteria, patient, and tumour characteristics. Only 20% of models were trained on both primary and recurrent tumours. Studies were also divided in their use of the WHO 1973 or 2004/2016 grading schemes. In addition, standard of care varied – only 27 and 53% of studies reported using repeat TURBT and intravesical therapies, respectively, despite almost all studies including high-risk patients for whom these treatments would be recommended. These findings highlight the need for diverse, representative data that accurately reflects the NMIBC patient population and current standard of care27,28.

Outcome definitions: Despite focusing this review on only two prediction tasks (recurrence and progression), we identified 14 distinct definitions across 15 studies for these outcomes. These variations in outcome definitions substantially limit comparability of studies.

Methods: Methodological errors were frequently repeated in studies. There was limited clarity on data pre-processing steps, especially regarding handling of missing data. Similarly, hyperparameter tuning steps, which defines how models learns from data, were poorly described. In addition, no sample size calculations were reported, thus it is unclear whether there were sufficient events per predictor variable for model training29. These concerns undermine transparency of datasets and models.

Several studies had concerns for data leakage – for example, using the same dataset for model training and testing without additional steps to obtain an optimism-corrected estimate of model performance30. Indeed, we found that studies with data leakage reported a median accuracy of 86% (IQR 80−93) compared to 83% (IQR 76−90) for those without this concern. Over half of studies (8/15) did not compare their AI models with alternative approaches, such as existing nomograms, statistical models, or clinical judgement. Of the remaining that provided a comparison, we found that better study quality was associated with a lower margin of benefit of AI models.

Evaluation: Studies typically reported on accuracy, sensitivity, specificity, and c-index. However, these measures are not always appropriate. Furthermore, measures of statistical significance for performance metrics, calibration plots, and net benefit were rarely reported. Therefore, researchers are encouraged to understand the strengths and limitations of different evaluation metrics to select the most relevant ones for addressing their clinical question31,32,33.

Algorithmic bias refers to disparities in AI performance for clinically relevant subgroups, such as sex, race, and socioeconomic status – which violates the ethical principle of justice28. These inequities underscore the fundamental link between training data and model behaviour. Non-representative data may introduce biases against minority groups, which in turn may perpetuate discriminatory practices within AI models. Indeed, several studies have found that AI models disproportionately affect marginalised patients, including females, individuals of African ancestry, and lower socioeconomic status34,35. Various strategies have been proposed to mitigate algorithmic bias to develop “fair” AI models. For instance, a bias assessment is recommended for examining performance heterogeneity across clinically relevant subgroups, similar to subgroup analyses commonly reported in clinical trials6,23,28,36. However, only two studies conducted some form of bias assessment, highlighting a gap in current evaluation practices.

Reproducibility: Only one study provided publicly accessible datasets and code necessary to replicate their findings. This so-called “reproducibility crisis” is concerning and consistent with other areas of AI in medicine37. Since clinical AI models often involve high-stakes decisions with direct patient consequences, failure to reproduce study findings may erode trust in these models and lead to poor clinical adoption.

Recommendations

Despite notable improvements in study quality, substantial work remains to address common pitfalls outlined in this review. We provide the following recommendations to enhance quality of future AI studies in NMIBC, which are summarised in Table 3.

Table 3 Summary of recommendations to improve AI studies in NMIBC prognostication

Recommendations for data quality: Datasets should be inclusive of NMIBC patients, regardless of their tumour history, stage, grade, subtype, or divergent differentiation, and should not be restricted to academic institutions. Study cohorts should also reflect standard of care, including use of repeat TURBT and intravesical therapies. For example, the European Association of Urology (EAU) prognostic risk groups were based on primary NMIBC patients who did not receive intravesical BCG38. Consequently, these risks groups were found to overestimate progression risk in contemporary BCG-treated patients39. As there is no international consensus on NMIBC grading, researchers are encouraged to report both WHO 1973 and 2004/2022 grading whenever feasible. This topic remains controversial, although proponents have advocated for a hybrid grading system40.

Adequate sample size is also essential to ensure model stability. A sample size calculation example is provided in Supplementary Note 2.

Recommendations for outcome definitions: To enhance consistency, researchers are encouraged to refer to definitions outlined by the International Bladder Cancer Group41. Additional patient-centred outcomes include number of invasive procedures administered over a two-year timeframe and need for cystectomy, radiation, or systemic chemotherapy42.

Recommendations for methodology: Researchers are encouraged to refer to relevant AI reporting guidelines from the Enhancing the QUAlity and Transparency Of health Research (EQUATOR) Network based on their data types and study context (i.e., model development, validation, or clinical trials). For example, the Standardised Reporting of Machine Learning Applications in Urology (STREAM-URO) framework outlines best practices in reporting AI studies in urology23. These include describing: (1) how datasets were divided into training and testing cohorts, (2) how data were pre-processed or modified, (3) how missing data were handled, and (4) what hyperparameters were tuned and how (i.e., grid search, optimisation metric). To prevent data leakage, it is imperative to isolate the testing cohort prior to any data pre-processing steps such as normalisation or imputation. Studies should also incorporate methods to address model overfitting, such as bootstrapping, internal cross-validation, or external validation33.

Recommendations for evaluation: Researchers are recommended to compare AI models with appropriate baselines such as previously published models or regression-based approaches. These comparators help justify whether additional complexity and opacity of AI approaches are warranted. Model evaluation should encompass measures of discrimination (c-index), calibration (calibration plot), and net benefit (decision curve analysis). Furthermore, we advocate for the use of bias assessments to assess for performance heterogeneity across clinically relevant subgroups, such as age group, sex, and ethnicity.

Recommendations for reproducibility: We recognise that institutional privacy and intellectual property considerations may impose restrictions on data and code sharing. However, researchers are strongly encouraged to disseminate their models via publicly accessible platforms or web applications. This practice is best exemplified by Jobczyk et al., who provided a web application for their model and made their deidentified datasets and code available in a public repository21. Alternatively, data can be securely housed in dedicated environments designed for clinical information, as done for electronic health record data from the Beth Israel Deaconess Medical Center in the Medical Information Mart for Intensive Care43.

Recommendations for reviewers: In line with current journal practices of including statistical reviewers, editorial boards may consider recruiting reviewers with AI expertise to assess technical aspects of these studies. Furthermore, we recommend reviewers pay close attention to common pitfalls identified in this review, including methodological conduct, robustness of results, and reproducibility. APPRAISE-AI may be valuable in providing an overall assessment of study quality and identifying specific concerns that may be clarified with study authors6.

Bridging the gap in the adoption of AI reporting guidelines

Despite the proliferation of AI reporting guidelines in recent years, the methodological and reporting pitfalls outlined in this review were consistent with those identified in other areas of medicine, including medical imaging44,45,46, ophthalmology47, vascular surgery48, neurosurgery49, and oncology50,51. One possible explanation may be due to a translational gap between guideline developers and other researchers conducting AI studies. For instance, Pattathil et al. reviewed randomised controlled trials evaluating AI interventions in ophthalmology based on adherence to the CONSORT-AI checklist, a reporting guideline for AI clinical trials47,52. Although three trials were published following the release of CONSORT-AI, guideline adherence ranged from 37 to 78%. However, none of the trial investigators were involved in the development of this guideline. We recently evaluated AI studies on paediatric hydronephrosis using STREAM-URO and APPRAISE-AI53. Among the three studies published after the introduction of these frameworks, the highest scoring study was authored by the same group that developed these tools. These findings reinforce the need for broader stakeholder engagement during guideline development, stronger collaborations between the medical and AI communities, and most importantly, mandating the use of appropriate AI reporting guidelines by journals. Recent initiatives, such as the TRIPOD-AI (prediction models)54, PRISMA-AI (systematic reviews and meta-analyses)55, and CANGARU guidelines (generative AI and large language models)56, are notable examples that prioritise these considerations.

Data and practice variation due to the human nature of medicine

Despite best practices in AI, the inherent human nature of medicine may impact model generalisability. Tumour staging and grading – which are fundamental in NMIBC prognostication – are subject to considerable interobserver and intraobserver variability, with kappa scores ranging from 0.42 to 0.60 for staging, 0.003−0.68 for the WHO 1973 grading system, and 0.17−0.70 for the WHO 2004/2016 grading system57,58. Furthermore, the RESECT study has highlighted significant variability in recurrence rates among institutions even after controlling for known risk factors, suggesting that differences in surgical technique and perioperative management may play a role59. These limitations require additional efforts to minimise practice variation to allow AI to achieve its full potential.

Limitations

Our findings should be interpreted within the context of its limitations. Importantly, study quality was determined using APPRAISE-AI, which was published following the studies included in this review. Accordingly, best practices in AI may have evolved over time. Nevertheless, APPRAISE-AI is well-aligned with established non-AI reporting guidelines such as the Transparent Reporting of a multivariable prediction model for Individual Prognosis Or Diagnosis (TRIPOD) statement60. Therefore, improved adherence to these guidelines may be reflected in better APPRAISE-AI scores in recent years. In addition, performance metrics could not be pooled across studies due to inconsistent reporting of these metrics and confidence intervals. Therefore, a more sophisticated comparison between AI and non-AI approaches could not be conducted. Finally, only 15 studies were included given the focused scope of this review. However, we also incorporated studies from non-clinical journals, such as those found in the Institute of Electrical and Electronics Engineers (IEEE) family of publications.

In conclusion, this systematic review examined current AI applications to predict recurrence and progression in NMIBC. Despite some progress in study quality, majority of studies were deemed low quality and likely unsuitable for clinical use. Common pitfalls revolved around dataset limitations, heterogeneous outcome definitions, methodological flaws, suboptimal model evaluation, and reproducibility concerns, notwithstanding limitations due to variability in pathological assessment, surgical technique, and perioperative management. Specific recommendations are provided for researchers and reviewers to ensure best practices in AI are followed. Key stakeholders should prioritise enhancing dataset curation, refining methodological approaches, and improving transparency and completeness of reporting. These concerted efforts are vital in developing high quality AI models that can safely be integrated into future NMIBC care.

Methods

This systematic review was conducted in accordance with the Preferred Reporting Items for Systematic Reviews and Meta-analyses (PRISMA) guidelines and was prospectively registered on PROSPERO (CRD42022354048). There were no deviations from the PROSPERO analytical plan.

Search strategy

OVID MEDLINE, EMBASE, Web of Science, and Scopus were searched from inception to February 5th, 2024. The search strategy was based on a recent scoping review on AI applications in urothelial cancer, including both bladder cancer and upper tract urothelial carcinoma (search strategy available in Supplementary Note 1)5.

Eligibility criteria

All studies investigating the use of AI to predict recurrence or progression in patients with pathologically confirmed NMIBC were included. AI was defined as the use of a computer system to mimic human cognitive functions for clinical decision support. AI models included tree-based models, support vector machines, artificial neural networks, deep learning, and natural language processing. Recurrence was defined as the first relapse of bladder tumour (any stage) following initial diagnosis of NMIBC, or as defined by study investigators. Progression was defined as the first relapse of bladder tumour invading the muscularis propria (T2) following initial diagnosis of NMIBC, or as defined by study investigators. Only studies written in English were included.

Studies were excluded if AI approaches were not used, or non-bladder cancer neoplasms were described. Studies were also excluded if the primary aim was to detect T2 disease on imaging (i.e., diagnostic study) or to assess risk factors rather than prediction modelling. Reviews, abstracts, and articles without full text were excluded.

Data extraction and synthesis

Four reviewers (JW, SM, NB, KN) independently screened and abstracted eligible studies, with disagreements resolved by consensus. The following data were collected: study demographics, patient and tumour characteristics, definition of recurrence and progression, sample size, types of AI models, training features, performance metrics, and information relevant to the evaluation of study quality.

Quality assessment using APPRAISE-AI

APPRAISE-AI is a scoring tool designed to evaluate methodological and reporting quality of AI studies for clinical decision support6. Articles were scored using a standardised form consisting of 24 items with a maximum overall score of 100 points. Each APPRAISE-AI item was mapped to one of six domains: clinical relevance, data quality, methodological conduct, robustness of results, reporting quality, and reproducibility. Overall scores were interpreted as follows: 0−19, very low quality; 20−39, low quality; 40−59, moderate quality; 60−79, high quality; and 80−100, very high quality. Collectively, the APPRAISE-AI item, domain, and overall scores provide macro- and micro-level insights on the strengths and weaknesses of each study.

Two reviewers (JCCK, AK) experienced in developing urological AI applications independently evaluated each article. Disagreements were resolved by a re-review of the article, APPRAISE-AI item criteria, and discussion until a consensus was reached. Interrater reliability was measured using intraclass correlation coefficients (ICCs; calculated with two-way random effects, absolute agreement, and single measurement). ICC values less than 0.50 indicated poor reliability, values between 0.50 and 0.75 indicated moderate reliability, values between 0.75 and 0.90 indicated good reliability, and values greater than 0.90 indicated excellent reliability61. Linear regression was used to determine whether overall APPRAISE-AI scores improved over time.

Comparison between AI and non-AI approaches

Performance was compared between AI and non-AI approaches examined within the included studies. Non-AI models included statistical models, clinical judgement, or existing clinical tools, such as the European Organisation for Research and Treatment of Cancer nomogram62. Accuracy, c-index, sensitivity and specificity were considered for this analysis since these metrics were most commonly reported. If studies reported metrics for multiple cohorts, we selected metrics based on the following hierarchy: external validation, internal validation, and training cohort. For each study, the absolute performance difference between the best AI and non-AI model was recorded separately for recurrence and progression. All analyses were performed using GraphPad PRISM version 8.3.0 and MedCalc version 19.6.3.