Abstract
Sequential and persistent activity models are two prominent models of short-term memory in neural circuits. In persistent activity models, memories are represented in persistent or nearly persistent activity patterns across a population of neurons, whereas in sequential models, memories are represented dynamically by a sequential activity pattern across the population. Experimental evidence for both models has been reported previously. However, it has been unclear under what conditions these two qualitatively different types of solutions emerge in neural circuits. Here, we address this question by training recurrent neural networks on several short-term memory tasks under a wide range of circuit and task manipulations. We show that both sequential and nearly persistent solutions are part of a spectrum that emerges naturally in trained networks under different conditions. Our results help to clarify some seemingly contradictory experimental results on the existence of sequential versus persistent activity-based short-term memory mechanisms in the brain.
This is a preview of subscription content, access via your institution
Access options
Access Nature and 54 other Nature Portfolio journals
Get Nature+, our best-value online-access subscription
$29.99 / 30 days
cancel any time
Subscribe to this journal
Receive 12 print issues and online access
$209.00 per year
only $17.42 per issue
Buy this article
- Purchase on Springer Link
- Instant access to full article PDF
Prices may be subject to local taxes which are calculated during checkout
Similar content being viewed by others
Code availability
The code for reproducing the experiments and analyses reported in this article is available at https://github.com/eminorhan/recurrent-memory.
Data availability
The raw simulation data used for generating each figure are available upon request.
Change history
25 January 2019
In the version of this article initially published online, a word was misprinted in the abstract. Extra letters were removed from the word “Experimentalrep” to correct it to “Experimental”. The error has been corrected in the print, PDF and HTML versions of this article.
06 February 2019
The original and corrected figures are shown in the accompanying Publisher Correction.
References
Fuster, J. M. & Alexander, G. E. Neuron activity related to short-term memory. Science 173, 652–654 (1971).
Wang, X. J. Synaptic reverberation underlying mnemonic persistent activity. Trends Neurosci. 24, 455–463 (2001).
Goldman, M. S. Memory without feedback in a neural network. Neuron 61, 621–634 (2009).
Druckmann, S. & Chklovskii, D. B. Neural circuits underlying persistent representations despite time varying activity. Curr. Biol. 22, 2095–2103 (2012).
Murray, J. D. et al. Stable population coding for working memory coexists with heterogeneous neural dynamics in prefrontal cortex. Proc. Natl Acad. Sci. USA 114, 394–399 (2017).
Lundqvist, M., Herman, P. & Miller, E. K. Working memory: delay activity, yes! Persistent activity? Maybe not. J. Neurosci. 38, 7013–7019 (2018).
Constantinidis, C. et al. Persistent spiking activity underlies working memory. J. Neurosci. 38, 7020–7028 (2018).
Funahashi, S., Bruce, C. J. & Goldman-Rakic, P. S. Mnemonic coding of visual space in the monkey’s dorsolateral prefrontal cortex. J. Neurophysiol. 61, 331–349 (1989).
Miller, E. K., Erickson, C. A. & Desimone, R. Neural mechanisms of visual working memory in prefrontal cortex of the macaque. J. Neurosci. 16, 5154–5167 (1996).
Romo, R., Brody, C. D., Hernández, A. & Lemus, L. Neural correlates of parametric working memory in the prefrontal cortex. Nature 399, 470–473 (1999).
Goard, M. J., Pho, G. N., Woodson, J. & Sur, M. Distinct roles of visual, parietal, and frontal motor cortices in memory-guided sensorimotor decisions. eLife 5, e13764 (2016).
Guo, Z. V. et al. Maintenance of persistent activity in a frontal thalamocortical loop. Nature 545, 181–186 (2017).
Baeg, E. H. et al. Dynamics of population code for working memory in the prefrontal cortex. Neuron 40, 177–188 (2003).
Fujisawa, S., Amarasingham, A., Harrison, M. T. & Buzsáki, G. Behavior-dependent short-term assembly dynamics in the medial prefrontal cortex. Nat. Neurosci. 11, 823–833 (2008).
MacDonald, C. J., Lepage, K. Q., Eden, U. T. & Eichenbaum, H. Hippocampal ‘time cells’ bridge the gap in memory for discontiguous events. Neuron 71, 737–749 (2011).
Harvey, C. D., Coen, P. & Tank, D. W. Choice-specific sequences in parietal cortex during a virtual-navigation decision task. Nature 484, 62–68 (2012).
Schmitt, L. I. et al. Thalamic amplification of cortical connectivity sustains attentional control. Nature 545, 219–223 (2017).
Scott, B. B. et al. Fronto-parietal cortical circuits encode accumulated evidence with a diversity of timescales. Neuron 95, 385–398 (2017).
Murray, J. D. et al. A hierarchy of intrinsic timescales across cortex. Nat. Neurosci. 17, 1661–1663 (2014).
Runyan, C. A., Piasini, E., Panzeri, S. & Harvey, C. D. Distinct timescales of population coding across cortex. Nature 548, 92–96 (2017).
Sussillo, D., Churchland, M. M., Kaufman, M. T. & Shenoy, K. V. A neural network that finds a naturalistic solution for the production of muscle activity. Nat. Neurosci. 18, 1025–1033 (2015).
Cueva, C. J. & Wei, X. X. Emergence of grid-like representations by training recurrent neural networks to perform spatial localization. Preprint at https://arxiv.org/abs/1803.07770 (2018).
Banino, A. et al. Vector-based navigation using grid-like representations in artificial agents. Nature 557, 429–433 (2018).
Wilken, P. & Ma, W. J. A detection theory account of change detection. J. Vis. 4, 1120–1135 (2004).
Barron, A. R. Universal approximation bounds for superpositions of a sigmoidal function. IEEE Trans. Inf. Theory 39, 930–945 (1993).
Zucker, R. S. & Regehr, W. G. Short-term synaptic plasticity. Annu. Rev. Physiol. 64, 355–405 (2002).
Mongillo, G., Barak, O. & Tsodyks, M. Synaptic theory of working memory. Science 319, 1543–1546 (2008).
Rose, N. S. et al. Reactivation of latent working memories with transcranial magnetic stimulation. Science 354, 1136–1139 (2016).
Wolff, M. J., Jochim, J., Akyürek, E. G. & Stokes, M. G. Dynamic hidden states underlying working-memory-guided behavior. Nat. Neurosci. 20, 864–871 (2017).
Hinton, G. E. & Plaut, D. C. Using fast weights to deblur old memories. Proc. 9th Annual Conference of the Cognitive Science Society, 177–186 (Erlbaum, 1987).
Sompolinsky, H. & Kanter, I. Temporal association in asymmetric neural networks. Phys. Rev. Lett. 57, 2861–2864 (1986).
Fiete, I. R., Senn, W., Wang, C. Z. H. & Hahnloser, R. H. R. Spike-time-dependent plasticity and heterosynaptic competition organize networks to produce long scale-free sequences of neural activity. Neuron 65, 563–576 (2010).
Klampfl, S. & Maass, W. Emergence of dynamic memory traces in cortical microcircuit models through STDP. J. Neurosci. 33, 11515–11529 (2013).
Krumin, M., Lee, J. J., Harris, K. D. & Carandini, M. Decision and navigation in mouse parietal cortex. ELife 7, e42583 (2018).
Rajan, K., Harvey, C. D. & Tank, D. W. Recurrent network models of sequence generation and memory. Neuron 90, 128–142 (2016).
Orhan, A. E. & Ma, W. J. Efficient probabilistic inference in generic neural networks trained with non-probabilistic feedback. Nat. Commun. 8, 138 (2017).
Ganguli, S., Huh, D. & Sompolinsky, H. Memory traces in dynamical systems. Proc. Natl Acad. Sci. USA 105, 18970–18975 (2008).
Clevert, D. A., Unterthiner, T., Hochreiter, S. Fast and accurate deep network learning by exponential linear units (ELUs). Preprint at https://arxiv.org/abs/1511.07289 (2016).
Glorot X., Bordes A., Bengio Y. Deep sparse rectifier neural networks. In Proc. 14th International Conference on Artificial Intelligence and Statistics (2011).
Mante, V., Sussillo, D., Shenoy, K. V. & Newsome, W. T. Context-dependent computation by recurrent dynamics in prefrontal cortex. Nature 503, 78–84 (2013).
Yamins, D. L. & DiCarlo, J. J. Using goal-driven deep learning models to understand sensory cortex. Nat. Neurosci. 19, 356–365 (2016).
Wang, J., Narain, D., Hosseini, E. A. & Jazayeri, M. Flexible timing by temporal scaling of cortical responses. Nat. Neurosci. 21, 102–110 (2018).
Kingma, D. P. & Ba, J. L. Adam: a method for stochastic optimization. Preprint at https://arxiv.org/abs/1412.6980 (2015).
Keshvari, S., van den Berg, R. & Ma, W. J. No evidence for an item limit in change detection. PLoS Comput. Biol. 9, e1002927 (2013).
Acknowledgements
This work was supported by grant no. R01EY020958 from the National Eye Institute. We thank the staff at the High-Performance Computing Cluster at New York University, especially S. Wang, for their help with troubleshooting.
Author information
Authors and Affiliations
Contributions
A.E.O. conceived the study and developed the research plan with input from W.J.M. In several iterations, A.E.O. performed the experiments and the analyses. A.E.O. and W.J.M. then discussed the results, which helped refine the experiments and the analyses. A.E.O. wrote the initial draft of the paper. A.E.O. and W.J.M. reviewed and edited later iterations of the paper.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher’s note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Integrated supplementary information
Supplementary Fig. 1 Initial, untrained network dynamics for different (λ0,σ0) values.
The heat maps show the normalized responses of the recurrent units to a unit pulse delivered at time t = 0 to all units. Here, λ0 takes 10 uniformly-spaced values between 0.8 and 0.98 (columns) and σ0 takes 10 uniformly-spaced values between 0 and 0.4025 (rows).
Supplementary Fig. 2 Normalized responses of the recurrent units in networks trained with strong initial network coupling and no regularization.
Each plot corresponds to an example trial from one of the six basic tasks. The SIs of the trials are indicated at the top of the plots. Trials are ordered by increasing SI from left to right. All trials shown here are from networks trained with λ0 = 0.96, σ0 = 0.313, ρ = 0. After training, all networks shown here achieved a test set performance within 25% of the optimal performance. In Supplementary Figs. 2–5, only the active recurrent units are shown.
Supplementary Fig. 3 Normalized responses of the recurrent units in networks trained with weak initial network coupling and no regularization.
Each plot corresponds to an example trial from one of the six basic tasks. The SIs of the trials are indicated at the top of the plots. Trials are ordered by increasing SI from left to right. All trials shown here are from networks trained with λ0 = 0.96, σ0 = 0.134, ρ = 0. After training, all networks shown here achieved a test set performance within 50% of the optimal performance.
Supplementary Fig. 4 Normalized responses of the recurrent units in networks trained with strong initial network coupling and strong regularization.
Each plot corresponds to an example trial from one of the six basic tasks. The SIs of the trials are indicated at the top of the plots. Trials are ordered by increasing SI from left to right. All trials shown here are from networks trained with λ0 = 0.96, σ0 = 0.313, ρ = 10−3. After training, all networks shown here achieved a test set performance within 50% of the optimal performance.
Supplementary Fig. 5 Normalized responses of the recurrent units in networks trained with weak initial network coupling and strong regularization.
Each plot corresponds to an example trial from one of the six basic tasks. The SIs of the trials are indicated at the top of the plots. Trials are ordered by increasing SI from left to right. All trials shown here are from networks trained with λ0 = 0.96, σ0 = 0.134, ρ = 10−3. After training, all networks shown here achieved a test set performance within 50% of the optimal performance.
Supplementary Fig. 6 Average normalized activity of recurrent units in an example network trained in the 2AFC task.
The network shown here was trained with λ0 = 0.96, σ0 = 0.313, ρ = 0. After training, the network achieved a test set performance within 0.1% of the optimal performance. As in ref. 16, we divided the recurrent units into left-preferring and right-preferring ones based on whether they responded more strongly during correct left choices or during correct right choices. The upper panel shows the average normalized responses of the left-preferring units in the correct left and correct right trials, respectively. Similarly, the lower panel shows the average normalized responses of the right-preferring units in the correct left and correct right trials. As reported in ref. 16, the trained network developed choice-specific sequences in the 2AFC task (cf. Figure 2c in ref. 16). Only the most active 150 units from each group are shown in this figure; as always, the original network contained 500 recurrent units. This figure also demonstrates that the sequences are consistent from trial to trial, since the sequential activity pattern does not disappear when the responses are averaged over multiple trials.
Supplementary Fig. 7 A simplified model of recurrent dynamics.
A simplified model that only incorporated the ReLU nonlinearity and the mean recurrent connection weight profiles shown in the upper panel (with no fluctuations around the mean) qualitatively captured the difference between the emergent sequential vs. persistent activity patterns (lower panel, left and right plots respectively). The networks simulated here had 500 recurrent units (only the most active 50 units are shown in the lower panel). All recurrent units received a unit pulse input at t = 0. The self-recurrence term in the recurrent connectivity matrix (not shown in the upper panel for clarity) was set to 1 in both cases. In the sequential case, the off-diagonal band was set to 0.09 in the forward direction and 0.01 in the backward direction, that is Wi,i−1 = 0.09 and Wi−1,i = 0.01. The recurrent units did not have a bias term and they did not receive any direct inputs during the trial other than the unit pulse injected at the beginning of the trial.
Supplementary Fig. 8 Results for the clipped ReLU networks.
The clipped ReLU nonlinearity is similar to ReLU except that it is bounded above by a maximum value: that is f(x) = clip(x, rmin, rmax), where rmin = 0 and rmax = 100. a SI increased significantly with σ0. Linear regression slope: 0.55 ± 0.28, R2 = 0.01 (two-sided Wald test, n = 280 experimental conditions, p = 0.049). In a–c, solid black lines are the linear fits and shaded regions are 95% confidence intervals for the linear regression. b SI decreased significantly with λ0. Linear regression slope: −3.87 ± 0.66, R2 = 0.11 (two-sided Wald test, n = 280 experimental conditions, p = 0.000). Note that this result differs from the corresponding result in the case of ReLU networks, where λ0 did not have a significant effect on the SI (Fig. 2c). c SI decreased significantly with ρ. Linear regression slope: −418 ± 64, R2 = 0.13 (two-sided Wald test, n = 280 experimental conditions, p = 0.000). d SI as a function of task. Overall, the ordering of the tasks by SI was similar to that obtained with the ReLU nonlinearity (Fig. 3a). However, note that training was substantially more difficult with the clipped ReLU nonlinearity than with the ReLU nonlinearity. Across all tasks and all conditions, ReLU networks had a training success (defined as reaching within 50% of the optimal performance) of ~60%, whereas the clipped ReLU networks had a training success of only ~9.3%. In particular, we were not able to successfully train any networks in the CD task and very few in the 2AFC task. As a consequence, some of the differences between the tasks ended up not being significant in the clipped ReLU case. Error bars represent mean ± standard errors across different hyperparameter settings. Exact sample sizes for the derived statistics shown in d are reported in Supplementary Table 1. e, f Recurrent connection weight profiles (as in Fig. 6a–c) in conditions where SI > 4.8 and in conditions where SI < 3, respectively. The weights were smaller in magnitude in f, because most of the low SI networks were trained under strong regularization. Solid lines represent mean weights and shaded regions represent standard deviations of weights. Both means and standard deviations are averages over multiple networks.
Supplementary Fig. 9 Changing the amount of input noise.
In these simulations, we set ρ = 0 and varied the gain of the input population(s), g. g = 1 corresponds to the original case reported in the main text; lower and higher values of g correspond to higher and lower amounts of input noise, respectively. a Combined across all noise conditions, SI increased significantly with σ0. Linear regression slope: 0.76 ± 0.08, R2 = 0.04 (two-sided Wald test, n = 2239 experimental conditions, p = 0.000). In a–c, solid black lines are the linear fits and shaded regions are 95% confidence intervals for the linear regression. b λ0 did not have a significant effect on SI (two-sided Wald test, n = 2239 experimental conditions, p = 0.958). c The input gain g slightly increased the SI. Linear regression slope: 0.04 ± 0.02, R2 = 0.003 (two-sided Wald test, n = 2239 experimental conditions, p = 0.003). d Again, combined across all input noise levels, the ordering of the tasks by SI was similar to that obtained in the main set of experiments, where g = 1 (Fig. 3a). Error bars represent mean ± standard errors across different hyperparameter settings and noise levels. Exact sample sizes for the derived statistics shown in d are reported in Supplementary Table 1.
Supplementary Fig. 10 Results for the lowest level of input noise (g = 2.5).
a SI increased significantly with σ0. Linear regression slope: 0.76 ± 0.18, R2 = 0.05 (two-sided Wald test, n = 365 experimental conditions, p = 0.000). In a-b, solid black lines are the linear fits and shaded regions are 95% confidence intervals for the linear regression. b λ0 did not have a significant effect on SI (two-sided Wald test, n = 365 experimental conditions, p = 0.253). c The ordering of the tasks by SI was similar to that obtained in the main set of experiments. Error bars represent mean ± standard errors across different hyperparameter settings. Exact sample sizes for the derived statistics shown in c are reported in Supplementary Table 1. d, e Recurrent connection weight profiles (as in Fig. 6a–c) in conditions where SI > 4.9 and in conditions where SI < 2.8, respectively. Solid lines represent mean weights and shaded regions represent standard deviations of weights. Both means and standard deviations are averages over multiple networks.
Supplementary Fig. 11 Results for the highest level of input noise (g = 0.5).
a SI increased significantly with σ0 Linear regression slope: 0.91 ± 0.21, R2 = 0.05 (two-sided Wald test, n = 361 experimental conditions, p = 0.000). In a, b, solid black lines are the linear fits and shaded regions are 95% confidence intervals for the linear regression. b λ0 did not have a significant effect on SI (two-sided Wald test, n = 361 experimental conditions, p = 0.457). c The ordering of the tasks by SI was similar to that obtained in the main set of experiments. Error bars represent mean ± standard errors across different hyperparameter settings. Exact sample sizes for the derived statistics shown in c are reported in Supplementary Table 1. d, e Recurrent connection weight profiles (as in Fig. 6a–c) in conditions where SI > 4.6 and in conditions where SI < 2.3, respectively. Solid lines represent mean weights and shaded regions represent standard deviations of weights. Both means and standard deviations are averages over multiple networks.
Supplementary Fig. 12 Schur decomposition of trained and random connectivity matrices.
a Schur mode interaction matrices for the mean recurrent connectivity patterns shown in Fig. 6a–c. Only significant Schur modes with at least one interaction of magnitude greater than 0.04 with another Schur mode are shown here. b The corresponding significant Schur modes. Networks with more sequential activity (SI > 5) have more high-frequency Schur modes than networks with less sequential activity (SI < 2.5). The random networks are close to normal.
Supplementary Fig. 13 Results from networks explicitly trained to generate sequential activity as in ref. 35.
a, b are analogous to Fig. 6a, b and show the recurrent weight profiles obtained in trained networks with ReLU and tanh nonlinearities, respectively. c, d show example trials for the corresponding networks (trained with the same initial condition). Only networks with sequentiality index larger than 5.45 were included in the results shown here.
Supplementary Fig. 14 Circuit mechanism that generates sequential vs. persistent activity in networks with alternative activation functions.
This figure is analogous to Fig. 6a, b, but the results shown are for networks with the exponential linear (elu) activation function (a) and networks with the softplus activation function (b). Note that the elu activation function typically produced larger SIs than softplus, hence slightly different SI thresholds were used in the two cases to determine low and high SI networks.
Supplementary information
Supplementary Text and Figures
Supplementary Figs. 1–14 and Supplementary Table 1
Rights and permissions
About this article
Cite this article
Orhan, A.E., Ma, W.J. A diverse range of factors affect the nature of neural representations underlying short-term memory. Nat Neurosci 22, 275–283 (2019). https://doi.org/10.1038/s41593-018-0314-y
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1038/s41593-018-0314-y
This article is cited by
-
Low-dimensional encoding of decisions in parietal cortex reflects long-term training history
Nature Communications (2023)
-
Dynamical latent state computation in the male macaque posterior parietal cortex
Nature Communications (2023)
-
The neuroconnectionist research programme
Nature Reviews Neuroscience (2023)
-
Multiplexing working memory and time in the trajectories of neural networks
Nature Human Behaviour (2023)
-
Spiking Recurrent Neural Networks Represent Task-Relevant Neural Sequences in Rule-Dependent Computation
Cognitive Computation (2023)