Introduction

Important progress has been made in the treatment of high-risk rectal cancer (RC) patients in the past decades1. Prospective clinical trials of neoadjuvant chemoradiotherapy have shown that approximately 15% of patients can reach pathological complete response (pCR) by neoadjuvant chemoradiotherapy (CRT)2 and that response rates can be further increased to about 30% pCR by total neoadjuvant treatment (TNT)3,4,5,6. Organ sparing approaches may be realistic for patients with pCR, and can have major impact on long-term quality after definite treatment However, intensified neoadjuvant treatment is associated with increased frequency and severity of adverse events that can present during or after CRT7. At the same time, responses are heterogeneous with 30–40% of patients presenting with tumor regression of different grades while 7–30% of patients classify as non-responders8,9. Interpretable predictive biomarkers to guide personalized treatment are therefore of utmost importance to further improve patient selection for intensified treatment regimens and treatment outcomes10. The setting for the identification of predictive markers in the clinical pathway of RC patients remains highly challenging: Turnaround time needs to be short and only scarce material from diagnostic biopsies is available for study, limiting the utility of molecular and functional analysis methods.

The Consensus Molecular Subtypes (CMS) define four distinct subtypes of colorectal cancer (CRC) by common patterns of gene expression11. The prognostic value of CMS classes has been reported in several studies12,13, yet their association with treatment outcome is an open research topic11. Recently, Domingo et al.14 have described an association between pCR to neoadjuvant chemoradiotherapy and transcriptional CMS signatures in diagnostic RC biopsies, suggesting that molecular classification may also be used as a predictive biomarker for treatment stratification. Overall, early profiling of CMS classification in the clinical pathway promises to be highly relevant for personalized treatment decisions15. Limitations of transcriptional CMS classification include a high cost and time requirement for RNA sequencing, a high failure rate due to the small amount of tumor material available and difficulty to standardize single samples13,16. In addition, even successful transcriptome generation shows higher frequencies of unclassified cases in biopsies than resections13,17.

Machine learning has been shown to be a promising alternative solution for predicting molecular signatures across multiple cancer types from diagnostic histopathology sections18. In particular, Sirinukunwattana et al.13 showed that deep learning models can predict image-based CMS (imCMS) classes that match transcriptional CMS calls directly from hematoxylin-and-eosin (H&E) stained histopathology whole slide images (WSIs) of CRC tumor specimens, using three independent cohorts. This prior work also highlighted the potential of imCMS to stratify patients for whom transcriptional CMS failed. Yet, evidence of the generalization of imCMS on biopsy cohorts was limited by the need for domain-adversarial training and fine-tuning, requiring labeled data from the target cohort, which is a shortcoming for the implementation of imCMS in the clinic.

Here, we set out to show whether imCMS classifiers can be trained without domain adaptation and perform reproducibly in independent pre-treatment biopsy datasets. By investigation of real-world cohorts sourced from the Medical Research Council (MRC) and Cancer Research UK (CRUK) Stratification in Colorectal cancer (S:CORT), we estimated via simulation experiments how many cancer biopsies are sufficient to achieve near-optimal imCMS classification performance. We then assessed whether imCMS classification is associated with pCR to neoadjuvant long course chemoradiotherapy (LCRT) with single agent fluoropyrimidine treatment in two independent held out diagnostic biopsy cohorts, comprising a total of 169 advanced RC patients with outcome data.

Results

Effect of multi-cohort training on imCMS performance

With the goal of developing a generic imCMS classifier that can perform well on both resection and biopsy specimens (without the need for a domain adaptation procedure as required in imCMSv113), we first assessed the effect of combining resection and biopsy cohorts for training the models, while keeping independent resection and biopsy cohorts held out for evaluation purposes. We designed five experiments with different combinations of cohorts (Fig. 1b). For each experiment, the development datasets were split into five bins (with stratified cohorts and classes at the patient level) to build five folds of training/validation splits by using each bin once as an internal validation set (for model selection purposes), while the remaining four bins are used for training. Classification performances were separately measured for each trained model using the corresponding held out test sets, as well as for the ensemble model that combines the output of the five models (Fig. 1c).

Fig. 1: Study Flow Figure.
figure 1

a Description of the image dataset used in this study. b Table of the cohort-level partitions of the dataset for training and validation of the models compared in this study. c Illustration of the 5-fold experimental setup and ensembling approach investigated in this study. d Flowchart of the 3-stage imCMS classification pipeline investigated in this study. e Table of the cohort-level partitions of the dataset used for training and validation of biopsy simulations (left), and illustration of the biopsy sampling simulation protocol investigated in this study. Scale bars in WSIs represent 2 mm, sampled tiles have sides of size 630 μm).

The macro-average area under the receiver operating characteristic curves (AUROC) of the trained models are reported in (Fig. 2), the detailed ROC curves, and confusion matrices of the best performing trained models are reported in (Supplementary Fig. 2). For each test set, we achieved the best performance when both resection and biopsy images were used jointly for training. The ensemble model resulting from the experiment [E3a] (TCGA: macro-average AUROC .813; ARISTOTLE: macro-average AUROC .798) was used for subsequent analysis of outcome association. To ensure a valid assessment of performance, all the datasets in this study were curated by excluding WSIs with poor overall quality (e.g., tissue folding, out-of-focus images), and image processing was restricted to regions of tumor and microenvironment that were delineated by a board-certified pathologist specialized in gastrointestinal pathology. The exclusion criteria used in this study are listed in (Supplementary Fig. S1). Robustness to heterogeneity of image appearance and to staining variability was addressed by the application of standard data augmentation policies during training. Details on the training procedure are provided in Supplementary Notes.

Fig. 2: Classification performances of imCMS models trained and evaluated using data from different combinations of cohorts.
figure 2

Numbers are macro-averages of the AUROC for ensembles of five models trained and validated with five different folds of the same development set. Each model of an ensemble model was evaluated separately: the mean and standard deviation of the corresponding macro-average AUROCs is reported in parenthesis. The ensemble model resulting of the experiment [E3a] was used for subsequent analysis of association between imCMS and pCR to neoadjuvant LCRT.

Association between imCMS and pCR to neoadjuvant LCRT

To investigate the association between imCMS calls and pCR to neoadjuvant LCRT with single agent fluoropyrimidine, we identified two RC cohorts treated with the same treatment regimen (pelvic irradiation 45–50.4Gy in 25 fractions over 5 weeks, combined with single agent Capecitabine on treatment days) (see Supplementary Notes). We applied the imCMS ensemble model resulting from experiment [E3a] to WSIs of all pre-treatment biopsies of the ARISTOTLE and SALZBURG cohorts with available pCR and pre-treatment T/N stage data. imCMS calls at the case level were defined as the majority vote of the calls of the five different trained models that constitute the ensemble model. Cases with undecided majority were classified as “mixed” (n = 2). The analysis of imCMS calls with clinicopathological data and treatment outcomes was conducted with 114 patients of the ARISTOTLE cohort including 24 patients with pCR to neoadjuvant LCRT, and 55 patients of the SALZBURG cohort with 6 patients with pCR to neoadjuvant LCRT. Patients were split into groups based on their predicted imCMS class. Odds ratios for pCR were calculated separately for each group and adjusted by T-stage, N-stage and cohort. Significant associations (p-value < 0.05) were found for both imCMS4 with lack of pCR (OR 0.25, 95% CI 0.07–0.88, p = 0.03) and imCMS1 with pCR (OR 2.69, 95% CI 1.01–7.17, p = 0.048) (Fig. 3), suggesting that imCMS calls generated from pre-operative biopsy material associate with response of the primary tumor to combined chemoradiotherapy.

Fig. 3: Neoadjuvant LCRT treatment response and association with imCMS.
figure 3

a Illustration of the analysis of association between imCMS groups resulting from the application of an ensemble model (experiment [E3a]) and outcome in the independent holdout cohorts ARISTOTLE and SALZBURG. b Comparison of odds ratio of pCR outcome by neoadjuvant LCRT between different groups of patients of the ARISTOTLE and SALZBURG cohorts based on imCMS calls derived from pre-operative biopsy images. Patients with biopsies classified as imCMS1 were significantly more likely to achieve pCR by LCRT (OR 2.69; 95% CI: 1.01–7.17, p = 0.048) than all other imCMS groups, whereas patients with imCMS4 calls rarely achieved pCR (OR 0.25; 95% CI: 0.07–0.88, p = 0.031). Two patients had mixed imCMS classification calls and were not shown in the odds ratio comparison.

Simulation: effect of biopsy sampling on imCMS classification

To assess how reliable biopsy-based imCMS predictions are in comparison to resection-based imCMS predictions in a control scenario, we estimated the change in classification performance of imCMS as a function of the number of biopsy fragments in a sample. It is theoretically possible to conduct an experiment in which we would generate a dataset with many tissue fragments biopsied from each patient’s tumor site along with paired resection specimens and associated CMS calls, then comparing imCMS classification performance when combining different numbers of fragments. However, since repeated sampling cannot be ethically justified, we argue that such an experimental protocol can be approximated via the generation of virtual biopsy fragments randomly sampled from existing resection images.

We generated 26 simulations of biopsy datasets, each with a specified number m of biopsy fragments and sourced from either of the two datasets of resection specimens, FOCUS or SPINAL. For every WSI in a given resection dataset, we collected 10,000 subsets of m non-overlapping biopsy-fragment-shaped images randomly cropped from the annotated tumor regions in the resection. Cropping was performed by sourcing real shapes of tumor biopsy fragments (Fig. 4a, b) that had been manually annotated in the GRAMPIAN and ARISTOTLE cohorts (total of 1580 possible shapes). Examples of simulated biopsy samples from a source resection specimen are illustrated in (Fig. 4c, d). Thus, each random subset of m fragments simulates a single random biopsy sampling event that an endoscopist could potentially perform, including possibly both superficial and deep samples. Then, all the random subsets generated across all the WSIs of a resection dataset were concatenated to represent a set of random biopsy sampling events (Fig. 4c, d). By restricting the cropping procedure to annotated regions of tumor and microenvironment we aimed at having simulated biopsies that are a good representation of real-world biopsies. We designed these simulated datasets to approximate the characteristics of actual biopsy datasets under the assumption that, (a) sampling locations are uniformly distributed in a tumor region, (b) the shape of the biopsies is random and independent of the sampling location, (c) sampling locations of consecutive biopsy fragments are random and non-overlapping. Consequently, we argue that the distribution of images of such a simulated biopsy from a single resection image closely approximates the distribution of images of actual random biopsy samples excised from the same tumor specimen.

Fig. 4: Biopsy Simulation Experiments.
figure 4

a Example of a WSI with heterogeneous tile-level imCMS classification. b The shapes of simulated biopsy fragments are randomly sampled from annotations of biopsies in the GRAMPIAN and ARISTOTLE cohorts. c Illustration of a biopsy simulation procedure based on a source imaged resection specimen. Biopsy samples are randomly generated by sampling m = 1 biopsy fragments at random in the annotated tumor region of the resection specimen. This random process is repeated 10,000 times for each WSI of a resection dataset to form a simulated biopsy dataset. For each generated biopsy sample, imCMS is applied. The resulting distributions of class probability over a simulated biopsy dataset are shown on the right. d same as c but with m = 3 biopsy fragments. The spatial heterogeneity of tile-level predictions (shown in a) explains the different distributions obtained for each imCMS class across a simulated biopsy dataset. e Classification performance of trained imCMS models on simulated biopsy datasets as a function of the number of tumor biopsy fragments per sample. Dots with lines indicate the mean and standard deviation of the macro-average AUROC obtained by five models for each simulated dataset: [E4a] (left); [E4b] (right). The red dotted line indicates a 3% difference score with the macro-average AUROC obtained on corresponding fully imaged resection specimens. Scale bars represent 2 mm.

To assess the performance of imCMS in the simulated biopsy datasets, we made sure to evaluate models on data not seen during training by using the training and test partitions described in Fig. 1e. Each trained imCMS model resulting from the training procedures [E4a] and [E4b] was applied to all its respective simulated test sets. As a result, we observed a rapid increase of classification performances when the number of fragments in a biopsy sample increases until convergence to the performance level that the models achieved when using the resection data without sampling (Fig. 4e). For both test conditions, we report a AUROC difference lower than 3% of the AUROC achieved with the original resection data when the number of simulated tumor biopsy fragments in a sample is above five, suggesting that reliable imCMS classification can be achieved at diagnosis in a large fraction of cases19.

Stability of distribution of cell types between resection and biopsy samples

As endoscopists excise biopsies in specific targeted regions of tumors, this procedure is subject to variance between operators as well as patient-specific factors. This can potentially induce a variation of morphological information in images of biopsies in comparison to resection specimens. This may particularly affect heterogeneously distributed factors in the tumor microenvironment such as cancer, immune and stromal cell populations. We therefore investigated whether there are systematic differences in microenvironment composition between biopsies and resection specimens of CRC specimens. Specifically, we confirmed the absence of confounding factors with real-world biopsies and investigated whether systematic differences of cell distribution exist between biopsy samples and resection specimens by deconvolution of stromal, immune and tumor-related signatures from transcriptomic data in 529 biopsies and 565 resections from four S:CORT cohorts (FOCUS, SPINAL, GRAMPIAN and ARISTOTLE). First, we generated absolute abundance estimates for key immune and stromal cell types with two different tools (MCPcounter20 and xCell21) using bulk transcriptomic data. With both methods, levels of key immune lineages (T-cells including CD8+ and differentiated cytotoxic T-cell subpopulations, B-cells, NK-cells, monocytes, myeloid dendritic cells and neutrophils) as well as endothelial and stromal cell populations in biopsies and resections were statistically similar across CMS subtypes (all p-values > 0.25, ANOVA) (Fig. 5). This result suggests that biopsies may be representative of the broad abundance of cell types in the tumor microenvironment compared to resection specimens, according to their transcriptomic CMS classes.

Fig. 5: Distributions of cell types across resection and biopsy samples.
figure 5

Comparison of measures of abundance scores for ten representative cell types between 529 biopsy and 565 resection primary CRC samples of the cohorts used in this study (FOCUS, SPINAL, GRAMPIAN and ARISTOTLE) using the transcriptome-based MCP-counter (a, b) and xCell tools (c, d). Cell enrichment scores were measured both by CMS subgroup (b, d) and sample type (a, c). Cell type distributions by CMS-class do not differ between biopsy and resection samples, indicating that transcriptional CMS calls derived from biopsy samples robustly capture the underlying biology of CRC shown on the actual tissue. Distributions in panels a and c are similar (p > 0.25, ANOVA), whereas all p-values in panels b and d are significant (p < 0.05).

Tissue microenvironment patterns related to CMS classes

CMS classification is driven by microenvironment and tumor-related factors. To investigate consistency of imCMS classification in RC biopsy material with underlying biological patterns, we visualized image tiles with the highest predicted probability score for each imCMS class in biopsy material (Fig. 6, see Supplementary Fig. 3 for examples of slide-level classification maps). This approach provides an overall idea of the morphology associated with each CMS class in absence of an established definition of transcriptional CMS at the scale of image patches. In agreement with prior studies, we found a consistent pattern of high stromal content and dissociative tumor growth (tumor budding) in tumor tiles classified as imCMS4 from all three cohorts. imCMS1 tiles more frequently contain lymphocytic infiltration and focal mucinous differentiation, although this feature was less frequent than previously observed in CRC resection specimens (compare TCGA, bottom (Fig. 6)), which is consistent with a lower representation of mucinous differentiation in RC. imCMS2 and imCMS3 features were consistent with previously described patterns, with imCMS2 showing epithelial-rich glandular and cribriform tumor growth with focal comedo-like necrosis while imCMS3 was characterized by glandular differentiation with tubular growth, focal mucin and a minor villiform component. Bioinformatic deconvolution of cell composition supported the observed tile-level associations. In biopsy specimens classified as imCMS4, a significantly higher frequency of fibroblast signatures were observed, while imCMS1 samples showed a tendency towards increased immune signatures, supporting biological interpretability of imCMS predictions at the case level.

Fig. 6: Gallery of image patches classified with high confidence for each CMS class.
figure 6

For each cohort (ARISTOTLE, SALZBURG, TCGA), we selected the 16 image patches (side 630 μm) from different patients, with highest probability score for each imCMS class (based on the ensemble model of experiment [E3a]). Visual interpretation confirms the existence of distinct morphological patterns within each predicted imCMS class. a imCMS1: increased lymphocytic infiltrates, poor tumor differentiation, focal mucin; b imCMS2: glandular differentiation with cribriform growth patterns and comedo-like necrosis associated with imCMS2; c imCMS3: glandular differentiation with ectatic mucin-filled glandular structures in combination with a minor component showing papillary and cribriform morphology; d imCMS4: prominent desmoplastic stromal reaction and dissociative tumor growth (tumor budding).

Discussion

There are currently no clinically established predictive markers for response to neoadjuvant chemoradiotherapy in RC. Decision making for RC patients and strategies for assignment to neoadjuvant treatment protocols are still taken at the cohort level and it is challenging to predict which patients will respond to neoadjuvant chemoradiotherapy. Current pathology assessment of pre-operative biopsies is limited to the confirmation of cancer diagnosis and a limited panel of molecular studies such as testing for mismatch-repair deficiency (MMRd) and the testing for common driver mutations if clinically desired. Recent studies have demonstrated a strong benefit of neoadjuvant treatment of patients with MMRd RC with PD-1 Blockade, but with a frequency of approximately 2-3%, this genotype is infrequent in RC22. Other studies have suggested that the absence of tumor budding23,24, low stromal content24 or the quantification of cytotoxic T-cells25 may aid in identifying patients with favorable prognosis, but these methods are based on subjective visual features and incompletely capture the complex biology related to neoadjuvant chemoradiotherapy response. Better methods to supply a biologically informed and clinically relevant classification of RC biology from pre-operative biopsy material are therefore needed.

Consensus molecular subtyping may aid transcriptome-based staging in both colon and rectal cancer. Based on their distinct biology and clinical behavior, CMS1 and CMS4 subgroups play a key role for prognostication and may support the development and assignment of patients to biologically informed precision treatments11,12. Specifically, the CMS1 subgroup (immune subtype) is highly enriched for MSI cases and the CpG Island Methylathion (CIMP-high) subgroup of colorectal cancers. Gene enrichment analysis has shown high expression of immune response genes, in particular interferon signaling as well as the wound healing signature11. Analysis of the tumor microenvironment has demonstrated dense infiltration by anti-tumoral immune populations, in particular CD8+ cytotoxic T-lymphocytes, which play a key role in mediating the anti-tumoral treatment effect of radiotherapy. These biological features support an improved prognosis, increased propensity for response to radiotherapy and may indicate an increased likelihood of response to immune checkpoint inhibitors. In contrast, CMS4 (mesenchymal subtype) comprises stroma-rich tumors with an activation of TGF-β signaling pathways, evidence of epithelial-mesenchymal transition and invasive growth pattern11,12. These cases are characterized by a poor prognosis and increased resistance to radiotherapy treatment as well as conventional chemotherapy protocols26. Studies have demonstrated an improved response and prolonged survival of CMS4 patients to irinotecan as compared with oxaliplatin-based chemotherapy, but overall improvements were limited26. Novel treatments will be needed (in particular TGF-β-targeting anticancer agents) for subtype-specific interventions in CMS4 and to further improve outcomes.

In an earlier study13, we developed an image-based approach to predict the consensus molecular subtypes from WSIs of clinical CRC specimens and provided the proof of principle for generating image-based molecular calls even with minimum input material. In a clinical setting, such image-based morpho-molecular classifiers have the benefit to offer a level of interpretability linked to known molecular profiles, as opposed to black-box classifiers whose interpretation is limited. The predictive ability of image-based molecular calls for response to chemoradiotherapy remained an open question. For the current study, we re-implemented the imCMS analysis pipeline and trained classifiers using a combination of CRC resection and biopsy cohorts, across different stages. We found that combining these datasets was a viable option to enable generalization to external biopsy cohorts without having to rely on domain adversarial training or fine-tuning as previously required13. This incremental improvement showed that such computational pipelines can directly and accurately classify images in new cohorts without the need for additional data or re-training. Despite distributional shifts between training cohorts (e.g., biopsy sampling artefacts, relative proportions of different tissue morphology) can make models subject to learning cohort-specific features that can limit classification performance, we observed a consistent improvement of performance when combining resection and biopsy modalities for training. This result is the most striking when comparing the performances of the models tested using TCGA: models trained solely with biopsy data (GRAMPIAN or ARISTOTLE) were not able to generalize unless resection data (FOCUS and SPINAL) was used. Furthermore, we conjecture that combining these datasets enables learning of modality-agnostic features that are invariant to inter-cohort variations. This hypothesis is supported by our visual assessment of the image patches with highest probability scores from different patients across the TCGA, ARISTOTLE and SALZBURG cohorts, which illustrates the stability of morphology of top-contributing tiles across independent cohorts. Yet, this visual assessment and the suggested associations observed between local morphology and CMS classes should be further studied in bottom-up approaches and further confirmed in conjunction with the establishment of the CMS classes in small fields of view.

The significant associations between imCMS1 and pCR to neoadjuvant LCRT and between imCMS4 and lack of pCR to neoadjuvant LCRT suggest the existence of predictive morphological features that our models were able to capture. This result paves the way for future work for the validation of such a tool as a support for clinical decision in the neoadjuvant treatment setting. Specifically, imCMS1 calls could identify patients for total neoadjuvant treatment with favorable prognosis. Due to an enrichment for immune-activated cases and MMRd cases, imCMS1 cases could also have a greater likelihood to achieve complete remission with immune-checkpoint blockade27,28, but this association requires further study. In contrast, patients classified as imCMS4 could be selected for clinical trials adding additional chemotherapy or biological agents to CRT, as the likelihood for response to standard LCRT protocols with single agent fluoropyrimidine is low. This result is aligned with current understanding of CMS4-associated biology: based on the high stromal content and TGF-β signatures in this subset, resistance to cytotoxic treatment is common29. For imCMS4 patients, tumor-stroma-based therapeutic targeting may therefore offer a potentially efficacious and biologically informed treatment alternative27,28. Ten Hoorn et al. suggest that prospective studies are required to further establish the CMS taxonomy and confirm treatment efficacy by CMS subtype in clinical practice12. However, CRC subtyping via current RNA sequencing technologies is deficient, in scenarios with low tumor material: other modalities such as biopsy imaging thus constitute a promising, fast and cost-effective alternative. Using an image-based approach, we demonstrate successful imCMS classification for all samples in the present study, compared to technical failure rates of up to 35% using state-of-the-art panel sequencing approaches13.

Current ESGE guidelines19 recommend sampling a minimum of six biopsies for the diagnosis of colorectal carcinomas, to ensure sampling of tumor fragments, and to reliably represent the overall tumor phenotype. Yet whether this number of biopsies is optimal to determine the classification of CRC according to biological subtypes remains an open question. We thus undertook comprehensive simulation experiments on existing fully digitized clinical cohorts to capture the morphology related to transcriptional CMS calls and to address whether clinically established sampling protocols are sufficient to describe tumor heterogeneity at the gene-expression level. As expected and due to the spatially heterogeneous nature of CRC tumors30,31, our results suggest that sampling less than five tumor biopsies is not sufficient to properly predict the CMS call that would be obtained from an equivalent resection specimen of the same tumor. Our results corroborate the 2021 ESGE recommendation as our experiments show on two independent external test datasets of 147 and 266 patients, that five or more standard tumorous biopsy fragments are sufficient to reliably capture the global tumor phenotype needed to achieve CMS classification performance with fidelity close to full resection specimens. Thus, the current sampling protocol established in clinical practice is likely sufficient to capture tumor biology from several cancerous fragments, and to enable informed patient stratification by computational analysis.

A limitation of the current imCMS pipeline is the reliance on manual annotations of tumor regions as a pre-processing step, making the whole pipeline semi-automated. We suggest that future work should investigate strategies to either fully automate this step or to accurately classify WSIs independently of localized tumor regions. The pipeline of imCMSv1.5 was designed as an incremental extension of the work of Sirinukunwattana et al.13, yet with the goal of increasing classification and generalization performance. Although model performance was reported in held out test sets in full transparency, the variation of performance observed across the different test sets suggest the existence of hidden dataset-related factors impacting model classification. The identification of these factors and the improvement of model robustness remain a priority for the development of future versions of imCMS. Other deep learning architectures and training procedures such as recent proposed solutions to biopsy image classification problems32,33 and consideration of model uncertainty beyond ensemble majority voting should be done in future work. A weakness of our simulation experiments is the strong assumption for randomness of the sampling distribution of biopsy fragments and we concede that the distribution of true biopsy fragments may be different from our simulation, e.g., tumor fragments from the lumen are more likely to be sampled in a real-world scenario. The effect of these degrees of approximation should be investigated in future work and across cancer types. We also recognize that the guidelines requiring six biopsies per case is designed to ensure there is a high chance of identifying invasive cancer, rather than slough or non-invasive malignancy. Here, our approach shows five or more biopsy fragments containing invasive tumor are required to achieve optimal imCMS classification. Although we did not observe any significant difference in terms of distribution of cell types between the studied resection and biopsy cohorts, such sampling information should be accounted for in future work. Beyond prediction of CMS classes, other molecular signatures of CRC proposed in the literature34,35 are relevant candidates to identify associations with treatment outcome.

To conclude, we found that deep learning models can automatically capture the morphology associated with transcriptional CMS in imaged biopsies from unseen cohorts. We found that patients stratified according to biopsy-based imCMS respond differently to neoadjuvant LCRT. The results of this study therefore support the development of an inexpensive clinical tool to assign patients to subtype targeted biological interventions in future clinical trials. Beyond CMS in CRC, the on-going development morpho-molecular classification models across cancer types and molecular signatures offer a new type of cost-effective computational tools to support clinical decision making18,36. This surrogate for transcriptional analysis can also be of use for research studies with restricted funding, or to revisit existing trial cohorts by studying associations between image-based CMS classification and clinical variables of interest without the need for extra tissue material.

Methods

Study design

The study design, cohorts and aims are outlined in Fig. 1 and detailed methods for all experiments and statistical analysis are provided in the Supplementary Notes.

All samples in the S:CORT cohorts were obtained following individual informed consent and ethical approval by the National Research Ethics Service in the United Kingdom (ref 15/EE/0241; IRAS reference 169363) consistent with the principles set out in the Declaration of Helsinki. The SALZBURG cohort was reviewed by the ethical board of the provincial government of Salzburg, Austria (415-E/2343/5-2018), although under Austrian law informed consent is not needed for research use and is therefore not available for all cases.

Rectal cancer LCRT treatment protocol

All patients from the GRAMPIAN, ARISTOTLE and SALZBURG cohorts included in the study received a “standard” treatment protocol for advanced RC by pelvic irradiation (45–50.4Gy in 25 fractions over 5 weeks) combined with Capecitabine (825 mg  m−2 BD on treatment days) Detailed information on these cohorts is available in the Supplementary Notes. The primary endpoint for all cases was pCR after completion of LCRT as assessed by histopathological analysis of the surgical RC resection specimen by an expert gastrointestinal pathologist according to established guidelines.

Processing of tissue samples and slide scanning

For the four S:CORT cohorts, serial 5-μm sections were cut from one pathologist-selected representative tumor block for H&E staining followed by up to nine unstained sections for RNA extraction. H&E slides were reviewed by an expert gastrointestinal pathologist and invasive cancer regions were annotated to guide RNA and DNA extraction by macrodissection. Regions of extensive necrosis and non-tumor tissue were excluded according to standard practice for molecular tumor profiling. All H&E slides were scanned on an Aperio scanner at magnification 20 × (0.5 μm  px−1). All digital slides were re-reviewed and a board-certified pathologist annotated tumor regions while excluding areas containing folds or debris. The data filtering procedure, cohort sizes and additional details for each cohort are summarized in (Supplementary Fig. 1).

Transcriptional CMS classification

For the S:CORT cohorts, RNA expression was obtained by microarray (Xcel, Affymetrix), and raw CEL files underwent the robust multiarray average normalization with the Affymetrix package (v1.56.0)37 in R38. Batch-corrected transcriptional CMS calls were derived for each sample with CMSclassifier using the protocol described in the Supplementary Notes. Any variation in the number of patients and slides in the same cohorts used in the study by Sirinukunwattana et al.13 stems from the updated procedure for transcriptional CMS. For TCGA39, the same transcriptional CMS calls were used as previously reported by Sirinukunwattana et al.13.

Transcriptomic immune profiling

The batch-corrected version of the S:CORT transcriptome was used to derive estimates of immune and stromal cell types with MCPcounter20 and Xcell21 by applying the original R packages. Although scores for both signatures are not comparable between cell types, they were scaled from 0 to 1 to facilitate visualization.

Deep learning imCMS classification

To develop a WSI-based CMS classifier, we considered the model proposed by Sirinukunwattana et al.13 as a baseline (imCMSv1), and re-implemented a new version (imCMS v1.5) that facilitates the training procedure, and reproducibility of our experiments while keeping high performance (test macro-average AUROC in held-out TCGA of .813) and same capabilities for the interpretation of classification results.

The version 1.5 of imCMS is based on the three-stage process illustrated in (Fig. 1d). First, manually annotated tumor regions of an input WSI are tiled with patches of size 318 × 318px at magnification 5 × (~2 μm  px−1) with 50% overlap, and all tiles with less than 50% overlap with the annotated tumor regions were excluded. Then, all the extracted patches are fed as input to a trained deep learning model that outputs probability scores for each target CMS class. Third, all tile-level probability scores are averaged to produce slide-level probability scores, and the class with the highest score is considered as the imCMS call prediction for the input WSI.

We kept the first stage identical to imCMSv1 but opted for a fixed magnification for tile extraction at magnification 5 × based on the results of13, suggesting optimal performance in both resection and biopsy images. With the last two stages, we moved from a count-based assumption to a collective assumption for determination of imCMS calls40, thus enabling more fine-grained contributions of each tile to the slide-level predictions. Further, for imCMS1.5, we changed the pre-trained InceptionV3 backbone architecture used in imCMSv1 by a randomly initialized customized ResNet architecture, to ensure that our approach can be re-implemented without having to rely on a pre-training procedure. See Supplementary Notes for more details about the implemented model architecture and training procedure.

Reporting summary

Further information on research design is available in the Nature Research Reporting Summary linked to this article.