scDesign3 generates realistic in silico data for multimodal single-cell and spatial omics

Song, Dongyuan; Wang, Qingyang; Yan, Guanao; Liu, Tianyang; Sun, Tianyi; Li, Jingyi Jessica

doi:10.1038/s41587-023-01772-1

Brief Communication
Published: 11 May 2023

scDesign3 generates realistic in silico data for multimodal single-cell and spatial omics

Dongyuan Song ORCID: orcid.org/0000-0003-1114-1215¹,
Qingyang Wang²,
Guanao Yan²,
Tianyang Liu²,
Tianyi Sun² &
…
Jingyi Jessica Li ORCID: orcid.org/0000-0002-9288-5648^1,2,3,4,5,6

Nature Biotechnology volume 42, pages 247–252 (2024)Cite this article

16k Accesses
9 Citations
152 Altmetric
Metrics details

Subjects

Abstract

We present a statistical simulator, scDesign3, to generate realistic single-cell and spatial omics data, including various cell states, experimental designs and feature modalities, by learning interpretable parameters from real data. Using a unified probabilistic model for single-cell and spatial omics data, scDesign3 infers biologically meaningful parameters; assesses the goodness-of-fit of inferred cell clusters, trajectories and spatial locations; and generates in silico negative and positive controls for benchmarking computational tools.

Access through your institution

Buy or subscribe

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on Springer Link
Instant access to full article PDF

Buy now

Prices may be subject to local taxes which are calculated during checkout

**Fig. 1: scDesign3 generates realistic synthetic data of diverse single-cell and spatial omics technologies.**

**Fig. 2: scDesign3 enables comprehensive interpretation of real data.**

scGPT: toward building a foundation model for single-cell multi-omics using generative AI

Article 26 February 2024

Molecular pixelation: spatial proteomics of single cells by sequencing

Article Open access 08 May 2024

Tracking single-cell evolution using clock-like chromatin accessibility loci

Article Open access 09 May 2024

Data availability

All datasets used in the study are publicly available. Supplementary Table 2 lists the datasets from 17 published studies (sources included). The preprocessed datasets are available in the Zenodo repository at https://doi.org/10.5281/zenodo.7110761⁵².

Code availability

The scDesign3 package is available at https://github.com/SONGDONGYUAN1994/scDesign3. The comprehensive tutorials are available at https://songdongyuan1994.github.io/scDesign3/docs/index.html. In the tutorials, we described the input and output formats, model parameters and exemplary datasets for each functionality of scDesign3. The source code for reproducing the results is available in the Zenodo repository at https://doi.org/10.5281/zenodo.7110761⁵².

References

Tang, F. et al. mRNA-seq whole-transcriptome analysis of a single cell. Nat. Methods 6, 377–382 (2009).
Article CAS PubMed Google Scholar
Cao, J. et al. The single-cell transcriptional landscape of mammalian organogenesis. Nature 566, 496–502 (2019).
Article CAS PubMed PubMed Central Google Scholar
Buenrostro, J. D. et al. Single-cell chromatin accessibility reveals principles of regulatory variation. Nature 523, 486–490 (2015).
Article CAS PubMed PubMed Central Google Scholar
Cusanovich, D. A. et al. Multiplex single-cell profiling of chromatin accessibility by combinatorial cellular indexing. Science 348, 910–914 (2015).
Article CAS PubMed PubMed Central Google Scholar
Karemaker, I. D. & Vermeulen, M. Single-cell DNA methylation profiling: technologies and biological applications. Trends Biotechnol. 36, 952–965 (2018).
Article CAS PubMed Google Scholar
Bendall, S. C. et al. Single-cell mass cytometry of differential immune and drug responses across a human hematopoietic continuum. Science 332, 687–696 (2011).
Article CAS PubMed PubMed Central Google Scholar
Chen, S., Lake, B. B. & Zhang, K. High-throughput sequencing of the transcriptome and chromatin accessibility in the same cell. Nat. Biotechnol. 37, 1452–1457 (2019).
Article CAS PubMed PubMed Central Google Scholar
Stoeckius, M. et al. Simultaneous epitope and transcriptome measurement in single cells. Nat. Methods 14, 865–868 (2017).
Article CAS PubMed PubMed Central Google Scholar
Rao, N., Clark, S. & Habern, O. Bridging genomics and tissue pathology: 10x genomics explores new frontiers with the visium spatial gene expression solution. Genet. Eng. Biotechnol. News 40, 50–51 (2020).
Article Google Scholar
Rodriques, S. G. et al. Slide-seq: a scalable technology for measuring genome-wide expression at high spatial resolution. Science 363, 1463–1467 (2019).
Article CAS PubMed PubMed Central Google Scholar
Stickels, R. R. et al. Highly sensitive spatial transcriptomics at near-cellular resolution with Slide-seqV2. Nat. Biotechnol. 39, 313–319 (2021).
Article CAS PubMed Google Scholar
Moffitt, J. R. et al. Molecular, spatial, and functional single-cell profiling of the hypothalamic preoptic region. Science 362, eaau5324 (2018).
Article PubMed PubMed Central Google Scholar
Efremova, M. & Teichmann, S. A. Computational methods for single-cell omics across modalities. Nat. Methods 17, 14–17 (2020).
Article CAS PubMed Google Scholar
Cao, Y., Yang, P. & Yang, J. Y. H. A benchmark study of simulation methods for single-cell RNA sequencing data. Nat. Commun. 12, 6911 (2021).
Crowell, H. L., Morillo Leonardo, S. X., Soneson, C. & Robinson, M. D. The shaky foundations of simulating single-cell RNA sequencing data. Genome Biol. 24, 62 (2023).
Sun, T., Song, D., Li, W. V. & Li, J. J. scDesign2: a transparent simulator that generates high-fidelity single-cell gene expression count data with gene correlations captured. Genome Biol. 22, 163 (2021).
Risso, D., Perraudeau, F., Gribkova, S., Dudoit, S. & Vert, J.-P. A general and flexible method for signal extraction from single-cell RNA-seq data. Nat. Commun. 9, 284 (2018).
Crowell, H. L. et al. Muscat detects subpopulation-specific state transitions from multi-sample multi-condition single-cell transcriptomics data. Nat. Commun. 11, 6077 (2020).
Cannoodt, R., Saelens, W., Deconinck, L. & Saeys, Y. Spearheading future omics analyses using dyngen, a multi-modal simulator of single cells. Nat. Commun. 12, 3942 (2021).
Dibaeinia, P. & Sinha, S. Sergio: a single-cell expression simulator guided by gene regulatory networks. Cell Syst. 11, 252–271 (2020).
Article CAS PubMed PubMed Central Google Scholar
Papadopoulos, N., Gonzalo, P. R. & Söding, J. Prosstt: probabilistic simulation of single-cell RNA-seq data for complex differentiation processes. Bioinformatics 35, 3517–3519 (2019).
Article CAS PubMed PubMed Central Google Scholar
Tian, J., Wang, J. & Roeder, K. Esco: single cell expression simulation incorporating gene co-expression. Bioinformatics 37, 2374–2381 (2021).
Article CAS PubMed PubMed Central Google Scholar
Navidi, Z., Zhang, L. & Wang, B. simATAC: a single-cell ATAC-seq simulation framework. Genome Biol. 22, 74 (2021).
Li, W. V. & Li, J. J. A statistical simulator scDesign for rational scRNA-seq experimental design. Bioinformatics 35, i41–i50 (2019).
Article CAS PubMed PubMed Central Google Scholar
Korsunsky, I. et al. Fast, sensitive and accurate integration of single-cell data with Harmony. Nat. Methods 16, 1289–1296 (2019).
Article CAS PubMed PubMed Central Google Scholar
Marouf, M. et al. Realistic in silico generation and augmentation of single-cell RNA-seq data using generative adversarial networks. Nat. Commun. 11, 166 (2020).
Ma, Y. & Zhou, X. Spatially informed cell-type deconvolution for spatial transcriptomics. Nat. Biotechnol. 40, 1349–1359 (2022).
Cable, D. M. et al. Robust decomposition of cell type mixtures in spatial transcriptomics. Nat. Biotechnol. 40, 517–526 (2022).
Article CAS PubMed Google Scholar
Elosua-Bayes, M., Nieto, P., Mereu, E., Gut, I. & Heyn, H. SPOTlight: seeded NMF regression to deconvolute spatial transcriptomics spots with single-cell transcriptomes. Nucleic Acids Res. 49, e50 (2021).
Article CAS PubMed PubMed Central Google Scholar
Yan, G. & Li, J. J. scReadSim: a single-cell multi-omics read simulator. Preprint at bioRxiv https://doi.org/10.1101/2022.05.29.493924 (2022).
Cao, K., Hong, Y. & Wan, L. Manifold alignment for heterogeneous single-cell multi-omics data integration using Pamona. Bioinformatics 38, 211–219 (2022).
Article CAS Google Scholar
Argelaguet, R., Cuomo, A. S. E., Stegle, O. & Marioni, J. C. Computational principles and challenges in single-cell data integration. Nat. Biotechnol. 39, 1202–1215 (2021).
Article CAS PubMed Google Scholar
Fang, J. et al. Clustering deviation index (CDI): a robust and accurate internal measure for evaluating scRNA-seq data clustering. Genome Biol. 23, 269 (2022).
Duò, A., Robinson, M. D. & Soneson, C. A systematic performance evaluation of clustering methods for single-cell RNA-seq data. F1000Res. 7, 1441 (2018).
Street, K. et al. Slingshot: cell lineage and pseudotime inference for single-cell transcriptomics. BMC Genomics 19, 477 (2018).
Ji, Z. & Ji, H. TSCAN: pseudo-time reconstruction and evaluation in single-cell RNA-seq analysis. Nucleic Acids Res. 44, e117 (2016).
Article PubMed PubMed Central Google Scholar
Stasinopoulos, D. M. & Rigby, R. A. Generalized additive models for location scale and shape (GAMLSS) in R. J. Stat. Softw. 23, 1–46 (2008).
Zhang, Y., Parmigiani, G. & Johnson, W. E. ComBat-seq: batch effect adjustment for RNA-seq count data. NAR Genom. Bioinform. 2, lqaa078 (2020).
Article PubMed PubMed Central Google Scholar
Wood, S. N. Generalized Additive Models: An Introduction with R (Chapman and Hall/CRC, 2006).
Kammann, E. E. & Wand, M. P. Geoadditive models. J. R. Stat. Soc. C 52, 1–18 (2003).
Czado, C. Analyzing Dependent Data with Vine Copulas (Springer, 2019).
Lun, A. T. L., McCarthy, D. J. & Marioni, J. C. A step-by-step workflow for low-level analysis of single-cell RNA-seq data with Bioconductor. F1000Res. 5, 2122 (2016).
Stuart, T., Srivastava, A., Madad, S., Lareau, C. A. & Satija, R. Single-cell chromatin state analysis with Signac. Nat. Methods 18, 1333–1341 (2021).
Article CAS PubMed PubMed Central Google Scholar
Hao, Y. et al. Integrated analysis of multimodal single-cell data. Cell 184, 3573–3587 (2021).
Article CAS PubMed PubMed Central Google Scholar
Zhu, J., Sun, S. & Zhou, X. SPARK-X: non-parametric modeling enables scalable and robust detection of spatial expression patterns for large spatial transcriptomic studies. Genome Biol. 22, 184 (2021).
Article CAS PubMed PubMed Central Google Scholar
Li, B. et al. Benchmarking spatial and single-cell transcriptomics integration methods for transcript distribution prediction and cell type deconvolution. Nat. Methods 19, 662–670 (2022).
Article CAS PubMed Google Scholar
Lütge, A. et al. CellMixS: quantifying and visualizing batch effects in single-cell RNA-seq data. Life Sci. Alliance 4, e202001004 (2021).
Article PubMed PubMed Central Google Scholar
Newman, A. M. et al. Robust enumeration of cell subsets from tissue expression profiles. Nat. Methods 12, 453–457 (2015).
Article CAS PubMed PubMed Central Google Scholar
Zeng, D. et al. IOBR: multi-omics immuno-oncology biological research to decode tumor microenvironment and signatures. Front. Immunol. 12, 687975 (2021).
Article CAS PubMed PubMed Central Google Scholar
Biancalani, T. et al. Deep learning and alignment of spatially resolved single-cell transcriptomes with Tangram. Nat. Methods 18, 1352–1362 (2021).
Article PubMed PubMed Central Google Scholar
Moriel, N. et al. Novosparc: flexible spatial reconstruction of single-cell gene expression with optimal transport. Nat. Protoc. 16, 4177–4200 (2021).
Article CAS PubMed Google Scholar
Song, D., Wang, Q. & Li, J. J. scDesign3 generates realistic in silico data for multimodal single-cell and spatial omics. Zenodo https://doi.org/10.5281/zenodo.7110761 (2022).

Download references

Acknowledgements

We appreciate the comments and feedback from the members of the Junction of Statistics and Biology at UCLA (http://jsb.ucla.edu). This work was supported by the following grants: National Science Foundation grants no. DBI-1846216 and no. DMS-2113754, NIH/NIGMS grants no. R01GM120507 and no. R35GM140888, Johnson & Johnson WiSTEM2D Award, the Sloan Research Fellowship, the UCLA David Geffen School of Medicine W. M. Keck Foundation Junior Faculty Award and the Chan-Zuckerberg Initiative Single-Cell Biology Data Insights Grant (to J.J.L.). J.J.L. was a fellow at the Radcliffe Institute for Advanced Study at Harvard University in 2022–2023 while she was writing this paper.

Author information

Authors and Affiliations

Bioinformatics Interdepartmental Ph.D. Program, University of California, Los Angeles, CA, USA
Dongyuan Song & Jingyi Jessica Li
Department of Statistics, University of California, Los Angeles, CA, USA
Qingyang Wang, Guanao Yan, Tianyang Liu, Tianyi Sun & Jingyi Jessica Li
Department of Human Genetics, University of California, Los Angeles, CA, USA
Jingyi Jessica Li
Department of Computational Medicine, University of California, Los Angeles, CA, USA
Jingyi Jessica Li
Department of Biostatistics, University of California, Los Angeles, CA, USA
Jingyi Jessica Li
Radcliffe Institute for Advanced Study, Harvard University, Cambridge, MA, USA
Jingyi Jessica Li

Authors

Dongyuan Song
View author publications
You can also search for this author in PubMed Google Scholar
Qingyang Wang
View author publications
You can also search for this author in PubMed Google Scholar
Guanao Yan
View author publications
You can also search for this author in PubMed Google Scholar
Tianyang Liu
View author publications
You can also search for this author in PubMed Google Scholar
Tianyi Sun
View author publications
You can also search for this author in PubMed Google Scholar
Jingyi Jessica Li
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

D.S. and J.J.L. conceived of the study. D.S., Q.W. and J.J.L. wrote the paper. D.S. and Q.W. developed the scDesign3 R package. D.S. and Q.W. performed data analysis with assistance from G.Y. and T.L. D.S. and T.S. discussed the scDesign3 method design at the beginning of the study.

Corresponding author

Correspondence to Jingyi Jessica Li.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review

Peer review information

Nature Biotechnology thanks Kin Fai Au and Jean Yang for their contribution to the peer review of this work.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

Extended Data Fig. 1 Benchmarking scDesign3 against four existing scRNA-seq simulators (scGAN, muscat, SPARSim, and ZINB-WaVE) for generating scRNA-seq data from a single trajectory (mouse pancreatic endocrinogenesis; dataset PANCREAS in Supplementary Table 2).

a, Distributions of eight summary statistics in the test data and the synthetic data generated by scDesign3 and the four simulators. Each number on top of a violin plot (the distribution of a summary statistic in a synthetic dataset) is the Kolmogorov-Smirnov (KS) distance between the synthetic data distribution (indicated by that violin plot) and the test data distribution. A smaller number indicates better agreement between the synthetic data and the test data in terms of that summary statistic’s distribution. b, Heatmaps of the gene-gene correlation matrices (showing top 100 highly expressed genes) in the test data and the synthetic data generated by scDesign3 and the four simulators. Pearson’s correlation coefficient r measures the similarity between two correlation matrices, one from the test data and the other from the synthetic data. c, PCA visualization (top two PCs) of the test data and the synthetic data generated by scDesign3 and the four simulators. Colors label cells’ pseudotime values; note that only the synthetic data generated by scDesign3 contain the pseudotime truths. An mLISI value close to 2 means that the synthetic data resemble the real data well in the low-dimensional space. d, UMAP visualization of the real data and the synthetic data generated by scDesign3 and the four simulators.

Extended Data Fig. 2 Benchmarking scDesign3 against four existing scRNA-seq simulators (scGAN, muscat, SPARSim, and ZINB-WaVE) for generating scRNA-seq data from bifurcating trajectories (myeloid progenitors in mouse bone marrow; dataset MARROW in Supplementary Table 2).

a, Distributions of eight summary statistics in the test data and the synthetic data generated by scDesign3 and the four simulators. Each number on top of a violin plot (the distribution of a summary statistic in a synthetic dataset) is the Kolmogorov-Smirnov (KS) distance between the synthetic data distribution (indicated by that violin plot) and the test data distribution. A smaller number indicates better agreement between the synthetic data and the test data in terms of that summary statistic’s distribution. b, Heatmaps of the gene-gene correlation matrices (showing top 100 highly expressed genes) in the test data and the synthetic data generated by scDesign3 and the four simulators. Pearson’s correlation coefficient r measures the similarity between two correlation matrices, one from the test data and the other from the synthetic data. c, PCA visualization (top two PCs) of the test data and the synthetic data generated by scDesign3 and the four simulators. Colors label cells’ pseudotime values in two trajectories; note that only the synthetic data generated by scDesign3 contain the pseudotime truths. An mLISI value close to 2 means that the synthetic data resemble the real data well in the low-dimensional space. d, UMAP visualization of the real data and the synthetic data generated by scDesign3 and the four simulators.

Extended Data Fig. 3 scDesign3 simulated realistic gene expression patterns in cancer spatial transcriptomics data (datasets OVARIAN and ACINAR in Supplementary Table 2.

Human ovarian cancer (a) and human prostate cancer, acinar cell carcinoma (b). The tissue samples were measured with both H&E (hematoxylin and eosin stain, left) and spatial transcriptomics (right, three cancer-related genes). Large Pearson correlation coefficients (r) represent similar spatial patterns in synthetic data and real (test) data.

Extended Data Fig. 4 scDesign3 simulated 10x Visium spatial transcriptomics data (sagital mouse brain slices; dataset VISIUM in Supplementary Table 2).

a, Distributions of eight summary statistics in the test data and the synthetic data generated by scDesign3 using cell type labels (scDesign3-ideal) and spatial locations (scDesign3-spatial), respectively. Each number on top of a violin plot (the distribution of a summary statistic in a synthetic dataset) is the Kolmogorov-Smirnov (KS) distance between the synthetic data distribution (indicated by that violin plot) and the test data distribution. A smaller number indicates better agreement between the synthetic data and the test data in terms of that summary statistic’s distribution. b, Heatmaps of the gene-gene correlation matrices (showing top 100 highly expressed genes) in the test data and the synthetic data generated by scDesign3-ideal and scDesign3-spatial. Pearson’s correlation coefficient r measures the similarity between two correlation matrices, one from the test data and the other from the synthetic data. c, PCA visualization (top two PCs) of the real data and the synthetic data generated by scDesign3-ideal and scDesign3-spatial. Cell types (clusters) are labeled by colors. Since the scDesgin3-spatial dataset was based on spatial locations only, it did not contain cell types. An mLISI value close to 2 means that the synthetic data resemble the real data well in the low-dimensional space. d, UMAP visualization of the real data and the synthetic data generated by scDesign3-ideal and scDesign3-spatial. In summary, scDesign3 realistically simulated 10x Visium data based on spatial locations without needing cell type annotations.

Extended Data Fig. 5 scDesign3 mimicked spatial transcriptomics data so that prediction algorithms had similar prediction performance when trained on real data or scDesign3 synthetic data.

In detail, we first split each of four spatial transcriptomics datasets (VISIUM, SLIDE, OVARIAN, and ACINAR in Supplementary Table 2) into two datasets (training and testing) by randomly splitting the spatial locations into two halves. Second, we used each of the four training datasets to fit scDesign3 and generate the corresponding synthetic dataset. Third, on each pair of training dataset and synthetic dataset (among a total of four pairs), we trained each of three prediction algorithms (gbm: gradient boosting machine; randomForest: random forest; svmRadial: support vector machine with the radial kernel) to predict each gene’s expression at a spatial location (input: spatial location; output: the gene’s log(count+1) expression level at the location), obtaining a pair of prediction models for each gene. Fourth, we applied each pair of prediction models to the corresponding testing dataset and calculated each model’s root-mean-squared error (RMSE) for predicting the corresponding gene, obtaining a pair of RMSEs. As a result, in each panel, we plotted the RMSEs for each prediction algorithm (row) and dataset (column), with each dot in the panel representing a gene. We found all genes’ RMSEs highly similar, indicating that scDesign3’s synthetic data well mimicked real data.

Extended Data Fig. 6 The effect of K on scDesign3’s simulation of spatial transcriptomics data (dataset ACINAR in Supplementary Table 2).

The rows represent three cancer-related genes; column 1 represents real test data; columns 2–8 represent scDesign3’s synthetic data generated using varying K, the input basis number. A large Pearson correlation coefficient (r) represents similar spatial patterns in synthetic and test data. The effective degrees of freedom (edf) represents the wiggliness of the fitted surface. With a larger K, scDesign3 can fit more complex patterns. The overfitting issue is accounted for by the automatic smoothness estimation³⁹: when K is sufficiently large, edf (model complexity) and r (model goodness-of-fit) both become stable.

Extended Data Fig. 7 scDesign3 simulated spot-resolution spatial transcriptomics data for benchmarking cell-type deconvolution algorithms (datasets MOB-SP and MOB-SC in Supplementary Table 2).

a, scDesign3’s synthetic spot-resolution data well mimicked real data (top row), showing similar expression patterns for four cell-type marker genes (columns). scDesign3 used three steps to generate the spot-resolution data. Step 1: every gene’s estimated mean expression level at each spot (as a smooth function of spot location) by scDesign3. Step 2: every gene’s predicted expression level at each spot from CIBERSORT’s estimated cell-type proportions at the spot (considered as the ‘true proportions’) and the gene’s cell-type-specific expression levels (from the reference scRNA-seq data). Step 3: every gene’s simulated expression level at each spot by scDesign3 (from the true cell-type proportions at the spot and scDesign3’s synthetic scRNA-seq data). b, Using scDesign3 synthetic data, we benchmarked three spatial cell-type deconvolution algorithms (CARD⁶, RCTD⁷, and SPOTlight⁸). For each of the four cell types (columns), we used two metrics-Pearson correlation (r) and root-mean-square error (RMSE)-to compare the proportions estimated by each deconvolution algorithm (rows 2-4) to the true proportions (top row). Large r values represent similar spatial patterns of proportions, while small RMSE values represent similar values of proportions. Although all three algorithms well captured the spatial patterns of each cell type’s proportions (evidenced by large r values), CARD and RCTD outperformed SPOTlight by estimating cell-type proportions more accurately (evidenced by smaller RMSE values).

Extended Data Fig. 8 scDesign3 simulated scATAC-seq data (human PBMCs; dataset ATAC in Supplementary Table 2).

a, Distributions of eight summary statistics in the test data and the synthetic data generated by scDesign3 using cell type labels. Each number on top of a violin plot (the distribution of a summary statistic in a synthetic dataset) is the Kolmogorov-Smirnov (KS) distance between the synthetic data distribution (indicated by that violin plot) and the test data distribution. A smaller number indicates better agreement between the synthetic data and the test data in terms of that summary statistic’s distribution. b, Heatmaps of the peak-peak correlation matrices in the test data and the synthetic data generated by scDesign3. Pearson’s correlation coefficient r measures the similarity between two correlation matrices, one from the test data and the other from the synthetic data. c, PCA visualization (top two PCs) of the test data and the synthetic data generated by scDesign3. Cell types are labeled by colors. An mLISI value close to 2 means that the synthetic data resemble the test data well in the low-dimensional space. d, UMAP visualization of the test data and the synthetic data generated by scDesign3.

Extended Data Fig. 9 scDesign3 simulated CITE-seq data (human PBMCs; dataset CITE in Supplementary Table 2).

a, Distributions of eight summary statistics in the test data and the synthetic data generated by scDesign3. The CITE-seq dataset contains simultaneous measurements of each cell’s gene expression and surface protein abundance captured by Antibody-Derived Tags (ADTs). Each number on top of a violin plot (the distribution of a summary statistic in a synthetic dataset) is the Kolmogorov-Smirnov (KS) distance between the synthetic data distribution (indicated by that violin plot) and the test data distribution. A smaller number indicates better agreement between the synthetic data and the test data in terms of that summary statistic’s distribution. b, Heatmaps of the gene and protein correlation matrices (10 proteins with names starting with ‘ADT’ and their corresponding genes) in the test data and the synthetic data generated by scDesign3. Pearson’s correlation coefficient r measures the similarity between two correlation matrices, one from the test data and the other from the synthetic data. scDesign3 preserved the correlations between the RNA and protein expression levels of the 10 surface proteins. c, PCA visualization (top two PCs) of the test data and the synthetic data generated by scDesign3. Cell types are labeled by colors. An mLISI value close to 2 means that the synthetic data resemble the real data well in the low-dimensional space. d, UMAP visualization of the test data and the synthetic data generated by scDesign3.

Extended Data Fig. 10 scDesign3 provides unsupervised measures of the goodness-of-fit of pseudotime, clusters, and inferred spatial locations.

For visual clarity, we plot the relative BIC or AIC (rBIC or rAIC) by re-scaling scDesign3’s marginal BIC or AIC to [0, 1]. a, The scDesign3 rBIC (unsupervised) is negatively correlated with the R² (supervised). Each R² was calculated between the set of perturbed or inferred pseudotimes and the set of true pseudotimes in each of the eight datasets (the column names). The P value is from the one-sided test of Spearman’s rank correlation ρ. The true pseudotime is the ground truth used for generating the synthetic data. b, Comparison of the scDesign3 rBIC and the Clustering Deviation Index (CDI) rBIC (rescaled to [0, 1])³³. The color scale shows the number of clusters, and the shapes represent clustering algorithms. We found the scDesign3 rBIC (unsupervised) negatively correlated with the ARI (supervised). The P value is from the one-sided test of Spearman’s rank correlation ρ. We also found the scDesign3 rBIC to perform better or similarly to the CDI on six out of the eight datasets (the column names). c, The scDesign3 rAIC (unsupervised) is negatively correlated with the mean cosine similarity (supervised). The mean cosine similarity was calculated between the set of perturbed or inferred locations and the set of true locations in each of the two spatial datasets (the column names). The P value is from the one-sided test of Spearman’s rank correlation ρ. The true locations are the ground truth used for generating the semi-synthetic data. Due to the high complexity of spatial patterns, the scDesign3 rAIC (left) outperformed the scDesign3 rBIC (right) for penalizing the model complexity less.

Supplementary information

Supplementary Information

Supplementary Methods, Figs. 1–5 and Tables 1–5.

Reporting Summary

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Song, D., Wang, Q., Yan, G. et al. scDesign3 generates realistic in silico data for multimodal single-cell and spatial omics. Nat Biotechnol 42, 247–252 (2024). https://doi.org/10.1038/s41587-023-01772-1

Download citation

Received: 20 September 2022
Accepted: 30 March 2023
Published: 11 May 2023
Issue Date: February 2024
DOI: https://doi.org/10.1038/s41587-023-01772-1

This article is cited by

Evaluating spatially variable gene detection methods for spatial transcriptomics data
- Carissa Chen
- Hani Jieun Kim
- Pengyi Yang
Genome Biology (2024)
Pathogenomics for accurate diagnosis, treatment, prognosis of oncology: a cutting edge overview
- Xiaobing Feng
- Wen Shu
- Min He
Journal of Translational Medicine (2024)
Spatial multi-omics: novel tools to study the complexity of cardiovascular diseases
- Paul Kiessling
- Christoph Kuppe
Genome Medicine (2024)
Statistical method scDEED for detecting dubious 2D single-cell embeddings and optimizing t-SNE and UMAP hyperparameters
- Lucy Xia
- Christy Lee
- Jingyi Jessica Li
Nature Communications (2024)
scReadSim: a single-cell RNA-seq and ATAC-seq read simulator
- Guanao Yan
- Dongyuan Song
- Jingyi Jessica Li
Nature Communications (2023)