Abstract
Although genomic analyses predict many noncanonical open reading frames (ORFs) in the human genome, it is unclear whether they encode biologically active proteins. Here we experimentally interrogated 553 candidates selected from noncanonical ORF datasets. Of these, 57 induced viability defects when knocked out in human cancer cell lines. Following ectopic expression, 257 showed evidence of protein expression and 401 induced gene expression changes. Clustered regularly interspaced short palindromic repeat (CRISPR) tiling and start codon mutagenesis indicated that their biological effects required translation as opposed to RNA-mediated effects. We found that one of these ORFs, G029442—renamed glycine-rich extracellular protein-1 (GREP1)—encodes a secreted protein highly expressed in breast cancer, and its knockout in 263 cancer cell lines showed preferential essentiality in breast cancer-derived lines. The secretome of GREP1-expressing cells has an increased abundance of the oncogenic cytokine GDF15, and GDF15 supplementation mitigated the growth-inhibitory effect of GREP1 knockout. Our experiments suggest that noncanonical ORFs can express biologically active proteins that are potential therapeutic targets.
This is a preview of subscription content, access via your institution
Access options
Access Nature and 54 other Nature Portfolio journals
Get Nature+, our best-value online-access subscription
$29.99 / 30 days
cancel any time
Subscribe to this journal
Receive 12 print issues and online access
$209.00 per year
only $17.42 per issue
Rent or buy this article
Prices vary by article type
from$1.95
to$39.95
Prices may be subject to local taxes which are calculated during checkout
Similar content being viewed by others
Data availability
Processed data for CRISPR screens (Figs. 3 and 4d) are available in Supplementary Tables 22 and 27. Raw data are available in the Source data files accompanying this manuscript, as well as through the NCBI Sequence Read Archive at: SRR13126801, SRR13128583, SRR13132373, SRR13142215 and SRR13142421. Mass spectrometry data relating to Fig. 1 are available in Supplementary Table 14. Raw MS spectra are available through the original datasets at: https://cptac-data-portal.georgetown.edu/study-summary/S060 (CPTAC2_BRCA_prosp), https://cptac-data-portal.georgetown.edu/study-summary/S045 (CPTAC2_COAD_prosp), https://cptac-data-portal.georgetown.edu/study-summary/S050 (CPTAC3_ccRCC), https://cptac-data-portal.georgetown.edu/study-summary/S056 (CPTAC3_LUAD), https://cptac-data-portal.georgetown.edu/study-summary/S051 (CPTAC3_PTRC_DP1), https://cptac-data-portal.georgetown.edu/study-summary/S053 (CPTAC3_UCEC), ftp://massive.ucsd.edu/MSV000080527 (HLA_Abelin), ftp://massive.ucsd.edu/MSV000084787 (HLA_Ouspenskaia), ftp://massive.ucsd.edu/MSV000084172/; ftp://massive.ucsd.edu/MSV000080527; ftp://massive.ucsd.edu/MSV000084442/ (HLA_Sarkizova), ftp://massive.ucsd.edu/MSV000082644 (CPTAC Medulloblastoma) and http://www.peptideatlas.org (PeptideAtlas database). L1000 data relating to Fig. 2 and Supplementary Figs. 8 and 9 are available through the NIH LINCS program and at https://clue.io/data. The website lincsproject.org provides information about the LINCS consortium, including data standards. Source data are provided with this paper.
Code availability
L1000 data analysis code and preprocessed data are available via GitHub: https://github.com/cmap/cmapM. There is additional information about this database and tools at http://clue.io/connectopedia. L1000 data were analyzed via the following: the ‘tidyverse’ suite36 of R packages (v.1.2.1), the ‘cmapR’ package37 (v.1.0.1) in R v.3.5.0 (R Core Team 2018) and in-house code available through github (https://github.com/johnprensner/smORF_analyses). Mass spectrometry peptides were processed via Spectrum Mill MS Proteomics Workbench v.6.0. Additional code for computational tools used in this study is listed here: PhyloCSF (https://github.com/mlin/PhyloCSF/wiki) for 29-mammal alignment, Slncky (https://slncky.github.io), STARS v.1.3 (http://www.broadinstitute.org/rnai/public/software/index) and CERES v.1.0 (https://github.com/cancerdatasci/ceres).
References
Ewing, B. & Green, P. Analysis of expressed sequence tags indicates 35,000 human genes. Nat. Genet. 25, 232–234 (2000).
Fields, C., Adams, M. D., White, O. & Venter, J. C. How many genes in the human genome? Nat. Genet. 7, 345–346 (1994).
Liang, F. et al. Gene index analysis of the human genome estimates approximately 120,000 genes. Nat. Genet. 25, 239–240 (2000).
Omenn, G. S. et al. Progress on identifying and characterizing the human proteome: 2018 metrics from the HUPO Human Proteome Project. J. Proteome Res. 17, 4031–4041 (2018).
Ingolia, N. T. et al. Ribosome profiling reveals pervasive translation outside of annotated protein-coding genes. Cell Rep. 8, 1365–1379 (2014).
Ji, Z., Song, R., Regev, A. & Struhl, K. Many lncRNAs, 5′UTRs, and pseudogenes are translated and some are likely to express functional proteins. eLife 4, e08890 (2015).
Pertea, M. et al. CHESS: a new human gene catalog curated from thousands of large-scale RNA sequencing experiments reveals extensive transcriptional noise. Genome Biol. 19, 208 (2018).
van Heesch, S. et al. The translational landscape of the human heart. Cell 178, 242–260 (2019).
Burge, C. & Karlin, S. Prediction of complete gene structures in human genomic DNA. J. Mol. Biol. 268, 78–94 (1997).
Dinger, M. E., Pang, K. C., Mercer, T. R. & Mattick, J. S. Differentiating protein-coding and noncoding RNA: challenges and ambiguities. PLoS Comput. Biol. 4, e1000176 (2008).
Lander, E. S. et al. Initial sequencing and analysis of the human genome. Nature 409, 860–921 (2001).
Mouse Genome Sequencing Consortium Initial sequencing and comparative analysis of the mouse genome. Nature 420, 520–562 (2002).
Mudge, J. M. et al. Discovery of high-confidence human protein-coding genes and exons by whole-genome PhyloCSF helps elucidate 118 GWAS loci. Genome Res. 29, 2073–2087 (2019).
Banfai, B. et al. Long noncoding RNAs are rarely translated in two human cell lines. Genome Res. 22, 1646–1657 (2012).
Jungreis, I. et al. Nearly all new protein-coding predictions in the CHESS database are not protein-coding. Preprint at bioRxiv https://doi.org/10.1101/360602 (2018).
Bazzini, A. A. et al. Identification of small ORFs in vertebrates using ribosome footprinting and evolutionary conservation. EMBO J. 33, 981–993 (2014).
Branca, R. M. et al. HiRIEF LC–MS enables deep proteome coverage and unbiased proteogenomics. Nat. Methods 11, 59–62 (2014).
Cabili, M. N. et al. Integrative annotation of human large intergenic noncoding RNAs reveals global properties and specific subclasses. Genes Dev. 25, 1915–1927 (2011).
Calviello, L. et al. Detecting actively translated open reading frames in ribosome profiling data. Nat. Methods 13, 165–170 (2016).
Gao, X. et al. Quantitative profiling of initiating ribosomes in vivo. Nat. Methods 12, 147–153 (2015).
Gascoigne, D. K. et al. Pinstripe: a suite of programs for integrating transcriptomic and proteomic datasets identifies novel proteins and improves differentiation of protein-coding and non-coding genes. Bioinformatics 28, 3042–3050 (2012).
Iyer, M. K. et al. The landscape of long noncoding RNAs in the human transcriptome. Nat. Genet. 47, 199–208 (2015).
Kim, M. S. et al. A draft map of the human proteome. Nature 509, 575–581 (2014).
Koch, A. et al. A proteogenomics approach integrating proteomics and ribosome profiling increases the efficiency of protein identification and enables the discovery of alternative translation start sites. Proteomics 14, 2688–2698 (2014).
Ma, J. et al. Discovery of human sORF-encoded polypeptides (SEPs) in cell lines and tissue. J. Proteome Res. 13, 1757–1765 (2014).
Mackowiak, S. D. et al. Extensive identification and analysis of conserved small ORFs in animals. Genome Biol. 16, 179 (2015).
Ruiz-Orera, J., Messeguer, X., Subirana, J. A. & Alba, M. M. Long non-coding RNAs as a source of new peptides. eLife 3, e03523 (2014).
Schwaid, A. G. et al. Chemoproteomic discovery of cysteine-containing human short open reading frames. J. Am. Chem. Soc. 135, 16750–16753 (2013).
Slavoff, S. A. et al. Peptidomic discovery of short open reading frame-encoded peptides in human cells. Nat. Chem. Biol. 9, 59–64 (2013).
Sun, H. et al. Integration of mass spectrometry and RNA-seq data to confirm human ab initio predicted genes and lncRNAs. Proteomics 14, 2760–2768 (2014).
Zhang, C. et al. Systematic analysis of missing proteins provides clues to help define all of the protein-coding genes on human chromosome 1. J. Proteome Res. 13, 114–125 (2014).
Vanderperre, B. et al. Direct detection of alternative open reading frames translation products in human significantly expands the proteome. PLoS ONE 8, e70698 (2013).
Wilhelm, M. et al. Mass-spectrometry-based draft of the human proteome. Nature 509, 582–587 (2014).
Subramanian, A. et al. A next generation connectivity map: L1000 platform and the first 1,000,000 profiles. Cell 171, 1437–1452 (2017).
Nassa, M. et al. Analysis of human collagen sequences. Bioinformation 8, 26–33 (2012).
Breit, S. N., Tsai, V. W. & Brown, D. A. Targeting obesity and cachexia: Identification of the GFRAL receptor-MIC-1/GDF15 pathway. Trends Mol. Med. 23, 1065–1067 (2017).
Mullican, S. E. & Rangwala, S. M. Uniting GDF15 and GFRAL: therapeutic opportunities in obesity and beyond. Trends Endocrinol. Metab. 29, 560–570 (2018).
Baroni, M. et al. Distinct response to GDF15 knockdown in pediatric and adult glioblastoma cell lines. J. Neurooncol. 139, 51–60 (2018).
Huang, C. Y. et al. Molecular alterations in prostate carcinomas that associate with in vivo exposure to chemotherapy: identification of a cytoprotective mechanism involving growth differentiation factor 15. Clin. Cancer Res. 13, 5825–5833 (2007).
Ratnam, N. M. et al. NF-kappaB regulates GDF-15 to suppress macrophage surveillance during early tumor development. J. Clin. Invest. 127, 3796–3809 (2017).
Corre, J. et al. Bioactivity and prognostic significance of growth differentiation factor GDF15 secreted by bone marrow mesenchymal stem cells in multiple myeloma. Cancer Res. 72, 1395–1406 (2012).
Peake, B. F., Eze, S. M., Yang, L., Castellino, R. C. & Nahta, R. Growth differentiation factor 15 mediates epithelial mesenchymal transition and invasion of breast cancers through IGF-1R-FoxM1 signaling. Oncotarget 8, 94393–94406 (2017).
Martinez, T. F. et al. Accurate annotation of human protein-coding small open reading frames. Nat. Chem. Biol. 16, 458–468 (2020).
Chen, J. et al. Pervasive functional translation of noncanonical human open reading frames. Science 367, 1140–1146 (2020).
Xie, W. et al. Epigenomic analysis of multilineage differentiation of human embryonic stem cells. Cell 153, 1134–1148 (2013).
Chen, J. et al. Evolutionary analysis across mammals reveals distinct classes of long non-coding RNAs. Genome Biol. 17, 19 (2016).
Liu, S. J. et al. CRISPRi-based genome-scale identification of functional long noncoding RNA loci in human cells. Science 355, aah7111 (2017).
Petersen, T. N., Brunak, S., von Heijne, G. & Nielsen, H. SignalP 4.0: discriminating signal peptides from transmembrane regions. Nat. Methods 8, 785–786 (2011).
Kelley, L. A., Mezulis, S., Yates, C. M., Wass, M. N. & Sternberg, M. J. The Phyre2 web portal for protein modeling, prediction and analysis. Nat. Protoc. 10, 845–858 (2015).
Domazet-Loso, T., Brajkovic, J. & Tautz, D. A phylostratigraphy approach to uncover the genomic history of major adaptations in metazoan lineages. Trends Genet. 23, 533–539 (2007).
Domazet-Loso, T. et al. No evidence for phylostratigraphic bias impacting inferences on patterns of gene emergence and evolution. Mol. Biol. Evol. 34, 843–856 (2017).
Kumar, S., Stecher, G., Suleski, M. & Hedges, S. B. TimeTree: a resource for timelines, timetrees, and divergence times. Mol. Biol. Evol. 34, 1812–1819 (2017).
Yang, X. et al. A public genome-scale lentiviral expression library of human ORFs. Nat. Methods 8, 659–661 (2011).
Subramanian, A. et al. Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proc. Natl Acad. Sci. USA 102, 15545–15550 (2005).
Ross, Z., Wickham, H., Robinson, D. Declutter your R workflow with tidy tools. Preprint at PeerJ https://peerj.com/preprints/3180.pdf (2017).
Enache, O. M. et al. The GCTx format and cmap{Py, R, M, J} packages: resources for optimized storage and integrated traversal of annotated dense matrices. Bioinformatics 35, 1427–1429 (2019).
Doench, J. G. et al. Optimized sgRNA design to maximize activity and minimize off-target effects of CRISPR-Cas9. Nat. Biotechnol. 34, 184–191 (2016).
Piccioni, F., Younger, S. T. & Root, D. E. Pooled lentiviral-delivery genetic screens. Curr. Protoc. Mol. Biol. 121, 32.1.1–32.1.21 (2018).
Meyers, R. M. et al. Computational correction of copy number effect improves specificity of CRISPR-Cas9 essentiality screens in cancer cells. Nat. Genet. 49, 1779–1784 (2017).
Hart, T., Brown, K. R., Sircoulomb, F., Rottapel, R. & Moffat, J. Measuring error rates in genomic perturbation screens: gold standards for human functional genomics. Mol. Syst. Biol. 10, 733 (2014).
Bae, S., Park, J. & Kim, J. S. Cas-OFFinder: a fast and versatile algorithm that searches for potential off-target sites of Cas9 RNA-guided endonucleases. Bioinformatics 30, 1473–1475 (2014).
Yu, C. et al. High-throughput identification of genotype-specific cancer vulnerabilities in mixtures of barcoded tumor cell lines. Nat. Biotechnol. 34, 419–423 (2016).
Pinello, L. et al. Analyzing CRISPR genome-editing experiments with CRISPResso. Nat. Biotechnol. 34, 695–697 (2016).
Niknafs, Y. S. et al. MiPanda: a resource for analyzing and visualizing next-generation sequencing transcriptomics data. Neoplasia 20, 1144–1149 (2018).
Shevchenko, A., Wilm, M., Vorm, O. & Mann, M. Mass spectrometric sequencing of proteins silver-stained polyacrylamide gels. Anal. Chem. 68, 850–858 (1996).
Peng, J. & Gygi, S. P. Proteomics: the move to mixtures. J. Mass Spectrom. 36, 1083–1091 (2001).
Eng, J. K., McCormack, A. L. & Yates, J. R. An approach to correlate tandem mass spectral data of peptides with amino acid sequences in a protein database. J. Am. Soc. Mass Spectrom. 5, 976–989 (1994).
Beausoleil, S. A., Villen, J., Gerber, S. A., Rush, J. & Gygi, S. P. A probability-based approach for high-throughput protein phosphorylation analysis and site localization. Nat. Biotechnol. 24, 1285–1292 (2006).
Jones, D. T. & Cozzetto, D. DISOPRED3: precise disordered region predictions with annotated protein-binding activity. Bioinformatics 31, 857–863 (2015).
Acknowledgements
We thank D. Bondeson, P. Tsvetkov, S. Corsello, U. Ben-David and T. Ouspenskaia for helpful discussions and critical reading of the manuscript. We thank M. Zhong for technical assistance with cloning and Z. Demere for assistance with CRISPR-sequencing. We thank D. Nusinow and S. Gygi for insights into identification of small peptides in proteomics datasets. We thank R. Tomaino for assistance with mass spectrometry at the Talpin Biological Mass Spectrometry Facility at Harvard Medical School. We thank J. Chen for assistance with the Slncky algorithm. We thank J. Gould for assistance with gene datasets. We thank I. Cheeseman for provision of DOX-inducible HeLa Cas9 cells. J.R.P. was supported by the Harvard K-12 in Central Nervous System tumors (grant 5K12 CA 90354-18). V.L and M.W.K. were supported by the National Institutes of Health (grants R01 HD073104 and RO1 HD091846 to M.W.K.).
Author information
Authors and Affiliations
Contributions
J.R.P. and T.R.G. conceived the project, designed experimental approaches, supervised the study and analyzed data. J.R.P. selected ORFs for screening and developed ORF prioritization methods. J.R.P. and X.Y. designed and generated the ORF cDNA library. J.R.P performed ORF library screening, in vitro CRISPR experiments, siRNA experiments, immunoblots, cell culture assays and all GREP1 functional experiments. B.F. executed the arrayed ORF screen for L1000. O.M.E. and N.J.L. performed gene expression profiling and analyzed L1000 gene expression data. Z.J. contributed ORF predictions and assisted in analysis of ORF candidates. V.L., A.K., M.K. and J.R.P. performed protein evolutionary analyses and analyzed phylostratigraphy data. K.K., K.R.C. and J.D.J. performed proteomic identification of ORFs from datasets. J.R.P., F.P. and D.E.R. designed and analyzed CRISPR screens. T.G., D.A. and A.B. assisted with sgRNA design. A.G. and Z.K. performed cell line CRISPR screens. L.W., K.S., G.B. and J.A.R. performed pooled CRISPR screening. V.M.W. and J.M.D. analyzed pooled CRISPR screen data. J.M.D. performed comparative analyses of ORF CRISPR data with publicly available CRISPR screens. J.R.P. and T.R.G. wrote the manuscript draft and all authors contributed to editing it.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Extended data
Extended Data Fig. 1 Generation and validation of a non-canonical ORF cDNA library.
a, Vector design and sequence details for the ORF library. The vector used is a modified version of the plx307 vector developed by the Genomic Perturbation Platform at the Broad Institute. b, Titration analyses of in cell western experiments. Three ORFs were chosen: eGFP (positive control), LINC00116 (high-expressing ORF), and RP11-539I5 (low expressing ORF). Increasing amounts of plasmid were transfected into increasing numbers of HEK293T cells as shown. c, Quantification the in cell western titration shown in b, demonstrating signal detection over noise and signal plateau. Signal was quantified using pixel density in the 800 nM green color channel. d, Replicate experiments assessing signal-to-noise thresholds for a low-expressing ORF transfected into HEK293T cells with a low DNA plasmid concentration, as well as a high-expressing ORF (eGFP) transfected into HEK293T cells at a high DNA plasmid concentration. e, Example in cell western data in triplicate experiments for selected ORFs. f, Abrogation of protein translation via mutation of the ORF for selected examples. g, A systematic evaluation of in cell western signal for wild type and mutant ORFs for all pairs. ORFs are separated into those with signal above the baseline threshold, and those without reproducible signal. h, An immunoblot showing in vitro transcription/translation of selected tag-free ORFs using a wheat germ lysate system. Red arrows indicate the translated ORFs. Results were repeated in two independent experiments.
Extended Data Fig. 2 Analysis of paired wild-type and mutant constructs in L1000 data.
a, A strategy for ORF mutagenesis strategy in which the start codon and downstream methionines were mutated to alanine. The shown amino acid sequence is a fictional sequence. b, A pie chart showing the number and percentage of amino acids changed per ORF from the mutagenesis. c, A violin plot showing the number of Perturbational Class (PCL) connections made at the 98th percentile for matched mutant and wild type constructs (n = 47 for each, all data points are biologically independent experiments). P value by a two-tailed Wilcoxon matched pairs rank test. d, Left, the overall distribution of PCL connections across all ranks in wild type and mutant constructs (n = 19,012 independent comparisons for each). Right, an inset image of distribution of PCL connections at high connectivity, showing a bias in connections made with wild type compared to mutant constructs (n = 1,920 independent comparisons each). P value by a two-tailed Wilcoxon matched pairs rank test. e, All PCL connections in wild type constructs at either the > =95th percentile or < = -95th percentile, with the matched percentile connectivity in the mutant constructs. f, The distribution of percentile connectivity results in wild type or mutant constructs for the indicated genes. In brief, all ORF L1000 signatures were queried against all PCL classes and a percentile connectivity was generated for each individual cell line and for both wild type and mutant constructs. Cell line and construct data was then aggregated and ranked from highest to lowest connectivity. The rank positions of wild type and mutant ORFs were then plotted to reveal a depletion of mutant constructs at high connectivity scores. g, Two example heatmaps for the TINCR and SLC35A4 uORF plasmids showing clustering of PCL connectivity among wild type constructs that is not shared with mutant constructs. Purple bars denote wild type ORF experiments and green bars denote mutant ORF experiments. h, L1000 signature replicate reproducibility for all wild type and mutant pairs across all cell lines. All ORF signatures with at least one reproducible wild type signature are shown.
Extended Data Fig. 3 Validation of CRISPR hits via manual assays.
a–i, CRISPR assays using doxycycline-inducible Cas9 in HeLa cells. Targets are divided in ones that validated and ones that did not. For each experiment, the right-set panel is qPCR data of expression 96 hours after induction of Cas9 with doxycycline. a) ZBTB11-AS1 b) HP08474 c) GREP1 d) RP11-54A9.1 e) G083755 f) OLMALINC g) CTD-2270L9.4 h) RP11-277L2.3 i) ASNSD1 uORF. j-k, CRISPR assays using stably-expressing A375 Cas9 cells. j) CTD-2270L9.4 k) ASNSD1 uORF. For all data in this figure, n = 6 technical replicates for each data point. Error bars represent standard deviation. Data was also acquired a 3 independent biological replicates based on doxycycline dose level (0.2 ug/mL, 1.0 ug/mL and 2.0 ug/mL doxycycline, as well as 0 ug/mL doxycycline). The data shown are the 1.0 ug/mL dosing level, with similar results observed for the 0.2 ug/mL and 2.0 ug/mL doxycycline dosing levels.
Extended Data Fig. 4 Tiling CRISPR assays to elucidate functional non-canonical ORFs.
a, A heatmap showing log fold change viability loss at Day +21 in the secondary CRISPR screen for the indicated non-canonical ORFs tested by multiple tiling sgRNA regions. b-e, Examples of non-canonical ORFs with a CRISPR tiling phenotype. b-e) Graphical representation of tiling CRISPR assays in which each dot represents an individual sgRNA. sgRNAs are mapped to their genomic loci and the genomic region of the tiling assay is shown. The location of the putative non-canonical ORF is shown in the gene annotation above. b) CTD-2270L9.4 c) OLMALINC d) RP11-54A9.1 e) RPP14 dORF / HTD2. f - k, Representative sgRNA log fold change data for the indicated transcripts. Each tiling experiment is classified as indicated. f) LINC00662 g) RP11-195B21.3 h) LYRM4-AS1 i) ESRG j) TCONS_I2_00007040 k) LINC01184.
Extended Data Fig. 5 Specific siRNA knockdown of ZBTB11-AS1 mRNA transcript causes a viability phenotype which is specifically rescued by the wild type ZBTB11-AS1 ORF.
a, A schematic showing the genomic location and sequences for the two siRNAs used for ZBTB11-AS1. b, mRNA expression levels for ZBTB11-AS1 or ZBTB11 transcripts 48 hours after siRNA knockdown of ZBTB11-AS1 in A549 cells. N = 3 independent replicates for all conditions. Barplots represent mean ± standard deviation. c, Relative cell viability of A549 cells treated with ZBTB11-AS1 siRNAs at 72 hours. Parental A549 cells were used along with A549 cells expressing cDNAs for GFP, wild type ZBTB11-AS1 ORF sequence, or mutant ZBTB11-AS1 ORF lacking translational start sites. Only the wild-type ZBTB11-AS1 ORF sequence rescues the viability phenotype. N = 6 independent replicates for all conditions. Barplots represent mean ± standard deviation. d, DNA and amino acid sequences of the wild type and mutant ZBTB11-AS1 ORF cDNAs. *p < 0.05, **p < 0.01. n.s., non-significant. For P values: Parental, non-targeting vs siRNA #1 P < 0.0001, non-targeting vs siRNA #2 P < 0.0001; GFP, non-targeting vs siRNA #1 P = 0.0008, non-targeting vs siRNA #2, P < 0.0001; WT ORF, non-targeting vs siRNA #1 P = 0.04, non-targeting vs siRNA #2 P = 0.83; MUT ORF, non-targeting vs siRNA #1 P = 0.001, non-targeting vs siRNA #2 P = 0.02. P values by a two-tailed Student’s T test.
Extended Data Fig. 6 The GREP1 locus and expression.
a, A schematic representation of the GREP1 gene structure and the annotation of this locus in the indicated databases. The year of release for each database is indicated. b, mRNA expression level of GREP1 across tumor lineages in the Cancer Cell Line Encyclopedia. The Y axis is in a log10 scale. c, mRNA expression of GREP1 across tumor types using TCGA and GTex data. A two-tailed Student’s t-test was used to calculate significance of change between normal and cancer tissues. Cell lineages are grouped according to whether GREP1 expression is specifically modulated in cancer, universally expressed as a lineage gene, or not robustly expressed in the indicated lineage.
Extended Data Fig. 7 GREP1 is implicated in cell proliferation and breast cancer patient outcomes.
a, Cell viability curves following GREP1 knockout in three sensitive and three insensitive cell lines. GREP1 expression in the Cancer Cell Line Encyclopedia is indicated in transcripts per million (TPM) b) A scatter plot showing lineage-specific correlation between cell viability and GREP1 mRNA expression on the X axis with the average GREP1 expression level on the Y axis. c, Overall survival for breast cancer patients in the TCGA database stratified by GREP1 expression. N = 1,036 individual patients. N = 969 GREP1-low and N = 67 GREP1-high patients. Significance by a one-sided log-rank P value. d, Overall survival for colon cancer patients in the TCGA database stratified by GREP1 expression. N = 296 individual patients. N = 38 GREP1-high and N = 258 GREP1-low patients. Significance by a one-sided log-rank P value. e, Immunoblot of V5-tagged GREP1 or GFP in HEK293T cells in both whole cell lysate and conditioned media. A mutant GREP1, in which translational start sites were mutated to alanine, lacks protein translation initiation ability. Results were repeated in three independent experiments. i, Abundance of mass spec peptides detected in the full length GREP1 or cleavage product GREP1 proteins. Peptide abundance is represented as a fraction of total peptides detected. All error bars represent standard deviation.
Extended Data Fig. 8 GREP1 is associated with the extracellular matrix.
a, Total fraction of amino acid usage in the ORFeome, GENBANK, GREP1, and the Collagen alpha-1 family. Sequence similarities between GREP1 and the collagen family are indicated. b, Predicted disorder score for the GREP1 amino acid sequence. c, Amino acid conservation for detected homologs of GREP1 in the indicated species. d, Non-denaturing native western blot of GREP1 in conditioned media from HEK293T cells expressing V5-tagged GREP1. e, Representative Commassie-stained gels for immunoprecipitation of GREP1 from the conditioned media of HEK293T cells. Two representative biological replicates are shown. f, Enrichment of extracellular matrix proteins in the IP-MS data for GREP1 compared to IP-MS data for GFP. g, Gene Ontology Cellular Component analysis of proteins > = 2 fold enriched in GREP1 immunoprecipitation compared to GFP immunoprecipitations. h, IP MS total peptide count for fibronectin shown for three separate experiments. i, Commassie stain of V5 immunoprecapitation of V5-tagged GFP, GREP1 del_SLS or GREP1 constructs expressed in CAMA-1 cells following fractionation of cell lysate into cytoplasmic, membrane and cell media components. Results were repeated in 2 independent experiments. j, Western blot of endogenous fibronectin, E-cadherin, beta-actin and GAPDH in cell lysate or cell culture media for CAMA-1 cells expressing GFP, GREP1 del_SLS or GREP1 constructs as in panel i. Results were repeated in two independent experiments. k, IP mass spectrometry data showing the total peptide count for GREP1 and other top-scoring proteins following IP of V5-tagged GREP1 in HEK293T, ZR-75-1, and CAMA-1 cells. N = 4 independent IP MS experiments. Lines represent median ± interquartile (25-75%) range.
Extended Data Fig. 9 GREP1 regulates GDF15 in vitro and correlates with GDF15 expression in patient tumor tissues.
a, Cytokine profiling in HEK293T cells with transient ectopic GREP1 or GFP overexpression, ZR-75-1 cells with stable GREP1 knockout, or HDQP1 cells with stable GREP1 knockout. The change in signal abundance was calculated for each control/GREP1 pair. To rank cytokines, the average of the absolute values for the individual signal changes was plotted. b, GDF15 abundance by ELISA in ZR-75-1 and CAMA-1 cells overexpressing a GREP1 or GFP cDNA plasmid. N = 3 technical replicates. N = 2 independent experiments performed, with representative results shown. c, Spearman’s rho for GREP1 expression correlation with GDF15, EMILIN2, or FN1 in the indicated TCGA datasets. d, Spearman’s p value for the GREP1 correlation coefficient for GREP1 correlation with GDF15, EMILIN2, or FN1 in the indicated TCGA datasets. e-g, Recombinant GDF15 partially rescues GREP1 knockout. CAMA-1, ZR-75-1 or T47D Cas9 cells were infected with the indicated sgRNAs. 24 hours after infection, cells were treated with vehicle control or increasing concentration of recombinant human GDF15 as shown. Relative abundance was measured 7 days after infection. N = 5 for all conditions in panel e. N = 6 for all conditions in panel f. N = 5 for all conditions in panel g. All error bars represent standard deviation. Two independent experiments were performed for panels e–g.
Supplementary information
Supplementary Information
Supplementary Figures 1–14, Supplementary Discussion and Supplementary References.
Supplementary Tables 1–17
This file contains Supplementary Tables 1–17, including additional information and source data for Fig. 1.
Supplementary Tables 18–20
This file contains Supplementary Tables 18–20, including additional information and source data for Fig. 2.
Supplementary Tables 21–32
This file contains Supplementary Tables 21–32, including additional information and source data for Fig. 3.
Supplementary Tables 33–38
This file contains Supplementary Tables 33–38, including additional information and source data for Fig. 4.
Source data
Source Data Fig. 1
Unprocessed immunoblot images used in Fig. 1d.
Source Data Fig. 3
Unprocessed immunoblot images used in Fig. 3d.
Source Data Fig. 4
Unprocessed Coomassie image used in Fig. 4i.
Source Data Fig. 4
Unprocessed immunoblot images used to generate cytokine data in Fig. 4k.
Source Data Fig. 3
A table, including the unprocessed sequencing read counts for each gRNA at each time point used in the primary CRISPR screen shown in Fig. 3.
Source Data Fig. 3
A table, including the unprocessed sequencing read counts for each gRNA at each time point used in the secondary CRISPR screen shown in Fig. 3.
Source Data Extended Data Fig. 8
Unprocessed native immunoblot images used in Extended Data Fig. 8d.
Source Data Extended Data Fig. 8
Unprocessed immunoblot images used in Extended Data Fig. 8j.
Rights and permissions
About this article
Cite this article
Prensner, J.R., Enache, O.M., Luria, V. et al. Noncanonical open reading frames encode functional proteins essential for cancer cell survival. Nat Biotechnol 39, 697–704 (2021). https://doi.org/10.1038/s41587-020-00806-2
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1038/s41587-020-00806-2
This article is cited by
-
Widespread stable noncanonical peptides identified by integrated analyses of ribosome profiling and ORF features
Nature Communications (2024)
-
SUsPECT: a pipeline for variant effect prediction based on custom long-read transcriptomes for improved clinical variant annotation
BMC Genomics (2023)
-
Small open reading frames: a comparative genetics approach to validation
BMC Genomics (2023)
-
Evolution and implications of de novo genes in humans
Nature Ecology & Evolution (2023)
-
A novel tumor suppressor encoded by a 1p36.3 lncRNA functions as a phosphoinositide-binding protein repressing AKT phosphorylation/activation and promoting autophagy
Cell Death & Differentiation (2023)