A new ML paradigm for continuous learning is Nested Learning
- November 10, 2025
- 0

Machine learning (ML) has made a lot of progress in the last ten years, mainly thanks to powerful neural network architectures and the algorithms used to train them. However, despite the fact that large language models (LLMs) have been successful, there are still a few fundamental obstacles to overcome. One of these obstacles is continuous learning, or the capacity of a model to actively acquire new knowledge and skills over time while not forgetting its previous ones. When it comes to continual learning and self-improvement, the human brain is the gold standard. Neuroplasticity, a remarkable ability to alter its structure in response to new experiences, memories, and learning, is how it adapts. Without this ability, a person is limited to immediate context (like anterograde amnesia).
The knowledge of current LLMs is restricted to either the immediate context of their input window or the static information they learn during pre-training, which is similar to our observation. “Catastrophic forgetting” (CF), in which mastery of previous skills is sacrificed in order to learn new ones, is frequently caused by the straightforward strategy of constantly adding new data to a model’s parameters. Traditionally, researchers have used improved optimization rules or architectural modifications to combat CF. However, we have been treating the model’s architecture—also known as the network structure—and the optimization algorithm—also known as the training rule—as two distinct entities for far too long. As a result, we are unable to create a learning system that is truly unified and effective. In our paper, “Nested Learning: The Illusion of Deep Learning Architectures”, published at NeurIPS 2025, we introduce Nested Learning, which bridges this gap. Nested Learning treats a single ML model not as one continuous process, but as a system of interconnected, multi-level learning problems that are optimized simultaneously. We argue that the model’s architecture and the rules used to train it (the optimization algorithm) are fundamentally the same ideas. They are just different “levels” of optimization, each with a different update rate and internal information flow (context flow). By recognizing this inherent structure, Nested Learning provides a new, previously invisible dimension for designing more capable AI, allowing us to build learning components with deeper computational depth, which ultimately helps solve issues like catastrophic forgetting.
A proof-of-concept, self-modifying architecture we call “Hope” is used to test and validate Nested Learning. It performs better than current state-of-the-art models in language modeling and long-context memory management.
The Nested Learning paradigm
Nested Learning reveals that a complex ML model is actually a set of coherent, interconnected optimization problems nested within each other or running in parallel. Each of these internal issues has its own context flow, which is the collection of information it is attempting to learn from. From this point of view, it would appear that the current methods for deep learning function by basically compressing their internal context flows. Nested Learning, on the other hand, reveals a new dimension for model design, enabling us to construct learning components with greater computational depth. To illustrate this paradigm, we look at the concept of associative memory — the ability to map and recall one thing based on another (like recalling a name when you see a face).
We demonstrate that an associative memory can be modeled for the training process itself, particularly the backpropagation process. A measure of how “surprising” or unexpected a given data point was is the value of its local error, which the model learns to map to. Similarly, following previous studies (e.g., Miras), key architectural components, such as the attention mechanism in transformers, can also be formalized as simple associative memory modules that learn the mapping between tokens in a sequence.
We can organize these interconnected optimization problems into “levels” by defining an update frequency rate, or how frequently the weights of each component are changed. The Nested Learning paradigm is centered on this ordered set. Implementing Nested Learning The Nested Learning perspective provides us with fundamental strategies for enhancing existing algorithms and architectures right away: Optimizers in depth We are able to apply principles from an associative memory perspective to optimizers because Nested Learning views them as modules of associative memory, such as momentum-based optimizers. We observed that many standard optimizers rely on simple dot-product similarity (a measure of how alike two vectors are by calculating the sum of the products of their corresponding components) whose update doesn’t account for how different data samples relate to each other. By changing the underlying objective of the optimizer to a more standard loss metric, such as L2 regression loss (a common loss function in regression tasks that quantifies the error by summing the squares of the differences between predicted and true values), we derive new formulations for core concepts like momentum, making them more resilient to imperfect data.
Continuum memory systems
The feedforward neural networks serve as long-term memory, storing knowledge prior to training, while the sequence model serves as short-term memory in a typical Transformer. This idea is extended by the Nested Learning paradigm into what we refer to as a “continuum memory system” (CMS), in which memory is viewed as a spectrum of modules that update at a distinct, distinct frequency rate. For continual learning, this results in a memory system that is significantly more effective and richer. Hope: A architecture that adapts itself and has a continuum memory As a proof-of-concept, we used Nested Learning principles to design Hope, a variant of the Titans architecture. Long-term memory modules called Titans prioritize memories according to how surprising they are. They only have two levels of parameters update, despite their powerful memory management, which results in first-order in-context learning. Hope, on the other hand, is a self-modifying recurrent architecture that is able to make use of unrestricted levels of in-context learning. Additionally, it is enhanced with CMS blocks so that it can scale to larger context windows. Through a self-referential process, it can basically improve its own memory, resulting in an architecture with infinite looping learning levels.
Experiments
We conducted experiments to evaluate the effectiveness of our deep optimizers and the performance of Hope on language modeling, long-context reasoning, continual learning, and knowledge incorporation tasks. Our paper contains all of the results.
Results
The effectiveness of Nested Learning, the design of continuum memory systems, and self-modifying Titans are confirmed by our experiments. The Hope architecture outperforms current recurrent models and standard transformers in terms of accuracy and perplexity on a variety of public language modeling and common-sense reasoning tasks.




















