Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Analysis
  • Published:

Robust integration of multiple single-cell RNA sequencing datasets using a single reference space

Abstract

In many biological applications of single-cell RNA sequencing (scRNA-seq), an integrated analysis of data from multiple batches or studies is necessary. Current methods typically achieve integration using shared cell types or covariance correlation between datasets, which can distort biological signals. Here we introduce an algorithm that uses the gene eigenvectors from a reference dataset to establish a global frame for integration. Using simulated and real datasets, we demonstrate that this approach, called Reference Principal Component Integration (RPCI), consistently outperforms other methods by multiple metrics, with clear advantages in preserving genuine cross-sample gene expression differences in matching cell types, such as those present in cells at distinct developmental stages or in perturbated versus control studies. Moreover, RPCI maintains this robust performance when multiple datasets are integrated. Finally, we applied RPCI to scRNA-seq data for mouse gut endoderm development and revealed temporal emergence of genetic programs helping establish the anterior–posterior axis in visceral endoderm.

This is a preview of subscription content, access via your institution

Access options

Buy this article

Prices may be subject to local taxes which are calculated during checkout

Fig. 1: Schematic illustration of data integration.
Fig. 2: Performance of RPCI in integrating simulated heterogeneous datasets.
Fig. 3: RPCI integration of semi-simulated datasets with diverse cell types and multiple perturbations.
Fig. 4: Performance of RPCI in integrating WT and ERR KO snRNA-seq datasets.
Fig. 5: Maintenance of correct cell type relationship in endoderm developmental trajectory in RPCI integration.

Similar content being viewed by others

Data availability

All scRNA-seq datasets in this study were published previously, and their availabilities are described in Supplementary Table 2.

Code availability

The RISC is prepared as an R package and is available for free use via GitHub (https://github.com/bioinfoDZ/RISC). Codes for the analysis (and related source data) are provided in the Code Ocean (https://codeocean.com/capsule/9098032).

References

  1. Islam, S. et al. Quantitative single-cell RNA-seq with unique molecular identifiers. Nat. Methods 11, 163–166 (2014).

    Article  CAS  PubMed  Google Scholar 

  2. Nawy, T. Single-cell sequencing. Nat. Methods 11, 18 (2014).

    Article  CAS  PubMed  Google Scholar 

  3. Wang, Y. & Navin, N. E. Advances and applications of single-cell sequencing technologies. Mol. Cell 58, 598–609 (2015).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  4. Zheng, G. X. et al. Massively parallel digital transcriptional profiling of single cells. Nat. Commun. 8, 14049 (2017).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  5. Azizi, E. et al. Single-cell map of diverse immune phenotypes in the breast tumor microenvironment. Cell 174, 1293–1308 (2018).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  6. Rosenberg, A. B. et al. Single-cell profiling of the developing mouse brain and spinal cord with split-pool barcoding. Science 360, 176–182 (2018).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  7. Fan, X. et al. Spatial transcriptomic survey of human embryonic cerebral cortex by single-cell RNA-seq analysis. Cell Res. 28, 730–745 (2018).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  8. Wang, J. X. et al. Single-cell gene expression analysis reveals regulators of distinct cell subpopulations among developing human neurons. Genome Res. 27, 1783–1794 (2017).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  9. Davie, K. et al. A single-cell transcriptome atlas of the aging Drosophila brain. Cell 174, 982–998 (2018).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  10. Butler, A., Hoffman, P., Smibert, P., Papalexi, E. & Satija, R. Integrating single-cell transcriptomic data across different conditions, technologies, and species. Nat. Biotechnol. 36, 411–420 (2018).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  11. Haghverdi, L., Lun, A. T. L., Morgan, M. D. & Marioni, J. C. Batch effects in single-cell RNA-sequencing data are corrected by matching mutual nearest neighbors. Nat. Biotechnol. 36, 421–427 (2018).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  12. Hie, B., Bryson, B. & Berger, B. Efficient integration of heterogeneous single-cell transcriptomes using Scanorama. Nat. Biotechnol. 37, 685–691 (2019).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  13. Korsunsky, I. et al. Fast, sensitive and accurate integration of single-cell data with Harmony. Nat. Methods 16, 1289–1296 (2019).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  14. Lin, Y. et al. scMerge leverages factor analysis, stable expression, and pseudoreplication to merge multiple single-cell RNA-seq datasets. Proc. Natl Acad. Sci. USA 116, 9775–9784 (2019).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  15. Lotfollahi, M., Wolf, F. A. & Theis, F. J. scGen predicts single-cell perturbation responses. Nat. Methods 16, 715–721 (2019).

    Article  CAS  PubMed  Google Scholar 

  16. Risso, D., Perraudeau, F., Gribkova, S., Dudoit, S. & Vert, J. P. A general and flexible method for signal extraction from single-cell RNA-seq data. Nat. Commun. 9, 284 (2018).

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  17. Shaham, U. et al. Removal of batch effects using distribution-matching residual networks. Bioinformatics 33, 2539–2546 (2017).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  18. Stuart, T. et al. Comprehensive integration of single-cell data. Cell 177, 1888–1902 (2019).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  19. Wang, T. et al. BERMUDA: a novel deep transfer learning method for single-cell RNA sequencing batch correction reveals hidden high-resolution cellular subtypes. Genome Biol. 20, 165 (2019).

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  20. Welch, J. D. et al. Single-cell multi-omic integration compares and contrasts features of brain cell identity. Cell 177, 1873–1887 (2019).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  21. Zhang, F., Wu, Y. & Tian, W. A novel approach to remove the batch effect of single-cell data. Cell Discov. 5, 46 (2019).

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  22. Tran, H. T. N. et al. A benchmark of batch-effect correction methods for single-cell RNA sequencing data. Genome Biol. 21, 12 (2020).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  23. Jolliffe, I. T. & Cadima, J. Principal component analysis: a review and recent developments. Philos. Trans. A Math. Phys. Eng. Sci. 374, 20150202 (2016).

    PubMed  PubMed Central  Google Scholar 

  24. Jolliffe, I. T. Principal Component Analysis (Springer, 2011).

  25. Zhang, X., Xu, C. & Yosef, N. Simulating multiple faceted variability in single cell RNA sequencing. Nat. Commun. 10, 2611 (2019).

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  26. Zappia, L., Phipson, B. & Oshlack, A. Splatter: simulation of single-cell RNA sequencing data. Genome Biol. 18, 174 (2017).

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  27. Scrucca, L., Fop, M., Murphy, T. B. & Raftery, A. E. mclust 5: clustering, classification and density estimation using Gaussian finite mixture models. R J 8, 289–317 (2016).

    Article  PubMed  PubMed Central  Google Scholar 

  28. Rousseeuw, P. J. Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J. Comput. Appl. Math. 20, 53–65 (1987).

    Article  Google Scholar 

  29. Buttner, M., Miao, Z., Wolf, F. A., Teichmann, S. A. & Theis, F. J. A test metric for assessing single-cell RNA-seq batch correction. Nat. Methods 16, 43–49 (2019).

    Article  PubMed  CAS  Google Scholar 

  30. Villani, A. C. et al. Single-cell RNA-seq reveals new types of human blood dendritic cells, monocytes, and progenitors. Science 356, eaah4573 (2017).

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  31. Hu, P. et al. Single-nucleus transcriptomic survey of cell diversity and functional maturation in postnatal mammalian hearts. Genes Dev. 32, 1344–1357 (2018).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  32. Liu, Y., Singh, V. K. & Zheng, D. Stereo3D: using stereo images to enrich 3D visualization. Bioinformatics 36, 4189–4190 (2020).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  33. Nowotschin, S. et al. The emergent landscape of the mouse gut endoderm at single-cell resolution. Nature 569, 361–367 (2019).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  34. Arnold, S. J. & Robertson, E. J. Making a commitment: cell lineage allocation and axis patterning in the early mouse embryo. Nat. Rev. Mol. Cell Biol. 10, 91–103 (2009).

    Article  CAS  PubMed  Google Scholar 

  35. Nowotschin, S., Hadjantonakis, A. K. & Campbell, K. The endoderm: a divergent cell lineage with many commonalities. Development 146, dev150920 (2019).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  36. Stuckey, D. W., Di Gregorio, A., Clements, M. & Rodriguez, T. A. Correct patterning of the primitive streak requires the anterior visceral endoderm. PLoS ONE 6, e17620 (2011).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  37. Kang, H. M. et al. Multiplexed droplet single-cell RNA-sequencing using natural genetic variation. Nat. Biotechnol. 36, 89–94 (2018).

    Article  CAS  PubMed  Google Scholar 

  38. Pepe-Mooney, B. J. et al. Single-cell analysis of the liver epithelium reveals dynamic heterogeneity and an essential role for YAP in homeostasis and regeneration. Cell Stem Cell 25, 23–38 (2019).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  39. Hill, M. C. et al. A cellular atlas of Pitx2-dependent cardiac development. Development 146, dev180398 (2019).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  40. Gordon, S. R. et al. PD-1 expression by tumour-associated macrophages inhibits phagocytosis and tumour immunity. Nature 545, 495–499 (2017).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  41. Savas, P. et al. Single-cell profiling of breast cancer T cells reveals a tissue-resident memory subset associated with improved prognosis. Nat. Med. 24, 986–993 (2018).

    Article  CAS  PubMed  Google Scholar 

  42. Yost, K. E. et al. Clonal replacement of tumor-specific T cells following PD-1 blockade. Nat. Med. 25, 1251–1259 (2019).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  43. Ding, J. et al. Systematic comparison of single-cell and single-nucleus RNA-sequencing methods. Nat. Biotechnol. 38, 737–746 (2020).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  44. Grun, D. et al. De novo prediction of stem cell identity using single-cell transcriptome data. Cell Stem Cell 19, 266–277 (2016).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  45. Muraro, M. J. et al. A single-cell transcriptome atlas of the human pancreas. Cell Syst. 3, 385–394 (2016).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  46. Segerstolpe, A. et al. Single-cell transcriptome profiling of human pancreatic islets in health and type 2 diabetes. Cell Metab. 24, 593–607 (2016).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  47. Wang, Y. J. et al. Single-cell transcriptomics of the human endocrine pancreas. Diabetes 65, 3028–3038 (2016).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  48. Baron, M. et al. A single-cell transcriptomic map of the human and mouse pancreas reveals inter- and intra-cell population structure. Cell Syst. 3, 346–360 (2016).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  49. Wolf, F. A. et al. PAGA: graph abstraction reconciles clustering with trajectory inference through a topology preserving map of single cells. Genome Biol. 20, 59 (2019).

    Article  PubMed  PubMed Central  Google Scholar 

  50. Giraddi, R. R. et al. Single-cell transcriptomes distinguish stem cell state changes and lineage specification programs in early mammary gland development. Cell Rep. 24, 1653–1666 (2018).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  51. Maaten, L. V. D. Accelerating t-SNE using tree-based algorithms. J. Mach. Learn. Res. 15, 3221–3245 (2014).

    Google Scholar 

  52. Becht, E. et al. Dimensionality reduction for visualizing single-cell data using UMAP. Nat. Biotechnol. 37, 38 (2018).

    Article  CAS  Google Scholar 

  53. Alter, O., Brown, P. O. & Botstein, D. Singular value decomposition for genome-wide expression data processing and modeling. Proc. Natl Acad. Sci. USA 97, 10101–10106 (2000).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  54. Kolde, R. pheatmap: Pretty Heatmaps https://rdrr.io/cran/pheatmap/ (2019).

  55. Zwiener, I., Frisch, B. & Binder, H. Transforming RNA-seq data to improve the performance of prognostic gene signatures. PLoS ONE 9, e85150 (2014).

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  56. Law, C. W., Chen, Y., Shi, W. & Smyth, G. K. voom: precision weights unlock linear model analysis tools for RNA-seq read counts. Genome Biol. 15, R29 (2014).

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  57. McCarthy, D. J., Campbell, K. R., Lun, A. T. & Wills, Q. F. Scater: pre-processing, quality control, normalization and visualization of single-cell RNA-seq data in R. Bioinformatics 33, 1179–1186 (2017).

    CAS  PubMed  PubMed Central  Google Scholar 

  58. Lun, A. T., Bach, K. & Marioni, J. C. Pooling across cells to normalize single-cell RNA sequencing data with many zero counts. Genome Biol. 17, 75 (2016).

    Article  PubMed  CAS  Google Scholar 

  59. Bacher, R. et al. SCnorm: robust normalization of single-cell RNA-seq data. Nat. Methods 14, 584–586 (2017).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  60. R Core Team. R: A Language and Environment for Statistical Computing https://www.R-project.org/ (2019).

  61. Koopmans, L. H., Owen, D. B. & Rosenblatt, J. I. Confidence intervals for the coefficient of variation for the normal and log normal distributions. Biometrika 51, 25–32 (1964).

    Article  Google Scholar 

  62. Ver Hoef, J. M. & Boveng, P. L. Quasi-Poisson vs. negative binomial regression: how should we model overdispersed count data? Ecology 88, 2766–2772 (2007).

    Article  PubMed  Google Scholar 

  63. Gonzalez, I., Déjean, S., Martin, P. & Baccini, A. CCA: an R package to extend canonical correlation analysis. J. Stat. Softw. 23, 14 (2008).

    Article  Google Scholar 

  64. Witten, D. M., Tibshirani, R. & Hastie, T. A penalized matrix decomposition, with applications to sparse principal components and canonical correlation analysis. Biostatistics 10, 515–534 (2009).

    Article  PubMed  PubMed Central  Google Scholar 

  65. Wooldridge, J.M. Introductory Econometrics: A Modern Approach (Cengage, 2018)

  66. Lambert, D. Zero-inflated Poisson regression, with an application to defects in manufacturing. Technometrics 34, 1–14 (1992).

    Article  Google Scholar 

  67. Rousseeuw, P. J. Silhouettes: a graphical aid to the interpretation and validation of cluster-analysis. J. Comput. Appl. Math. 20, 53–65 (1987).

    Article  Google Scholar 

  68. Maechler, M., Rousseeuw, P., Struyf, A., Hubert, M., Hornik, K. cluster: Cluster Analysis Basics and Extensions https://cran.r-project.org/package=cluster (2019).

  69. Venables, W.N., Ripley, B.D. & Venables, W.N. Modern Applied Statistics with S (Springer, 2002).

  70. Finak, G. et al. MAST: a flexible statistical framework for assessing transcriptional changes and characterizing heterogeneity in single-cell RNA sequencing data. Genome Biol. 16, 278 (2015).

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  71. Korthauer, K. D. et al. A statistical approach for identifying differential distributions in single-cell RNA-seq experiments. Genome Biol. 17, 222 (2016).

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  72. Nabavi, S., Schmolze, D., Maitituoheti, M., Malladi, S. & Beck, A. H. EMDomics: a robust and powerful method for the identification of genes differentially expressed between heterogeneous classes. Bioinformatics 32, 533–541 (2016).

    Article  CAS  PubMed  Google Scholar 

Download references

Acknowledgements

We thank all the research groups that generated and shared the scRNA-seq data used in this study. We thank the members of the Zheng lab for valuable discussions, software testing and comments on the manuscript. We also acknowledge funding support from the National Institutes of Health (grants HL133120 to D.Z. and B.Z., HL153920 to D.Z., HD092944 to D.Z. and B.Z., and HD070454 to D.Z.).

Author information

Authors and Affiliations

Authors

Contributions

Y.L. and D.Z. conceived the algorithm and analysis. Y.L. performed the analyses. T.W. and B.Z. contributed to the methods or discussions. D.Z. supervised the study. All authors wrote the manuscript.

Corresponding author

Correspondence to Deyou Zheng.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Peer review information Nature Biotechnology thanks the anonymous reviewers for their contribution to the peer review of this work.

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information

Supplementary Notes, Supplementary Methods and Supplementary Figs. 1–21.

Reporting Summary

Supplementary Tables 1 and 2

Supplementary Table 1. Metric scores from pairwise integration and full integration of the semi-simulated data. Supplementary Table 2. List of real scRNA-seq datasets.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Liu, Y., Wang, T., Zhou, B. et al. Robust integration of multiple single-cell RNA sequencing datasets using a single reference space. Nat Biotechnol 39, 877–884 (2021). https://doi.org/10.1038/s41587-021-00859-x

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1038/s41587-021-00859-x

This article is cited by

Search

Quick links

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing