Meta-Learning : The next scaling law
Where billions in compute will flow next : Two scaling laws built the current generation of AI. Each looked obvious only in hindsight. The next two are hiding in plain sight.
Two scaling laws built the current generation of AI. The first—data scaling—was dismissed as impossible until it wasn’t. The second—test-time compute—was seen as a cherry on top rather than a fundamental scaling law until IMO gold medals and superhuman coding proved otherwise. Each was “obvious in hindsight” but required massive infrastructure bets to realize.
The next two are hiding in plain sight.
The Two Scaling Laws That Changed Everything
Pre-2020: The Data/Model Scaling Breakthrough
In 2019, if you suggested that simply making models larger and training them on more unstructured data would lead to transformative capabilities, you’d have been met with skepticism. The conventional wisdom was that we’d hit diminishing returns. Bigger models were expensive, and surely architectural innovations mattered more than brute scale.
But there were hints. GPT-2 showed glimmers—better coherence, surprising few-shot capabilities. These were the early shoots, informal observations that more data seemed to help in ways that weren’t quite understood.
What it took to actually realize this scaling law was enormous: infrastructure to train models with billions of parameters, datasets scraped and curated at unprecedented scale, and most importantly, belief. The groups that made the infrastructure investments—despite the skepticism—unlocked transformative capabilities. GPT-3’s few-shot learning, emergent reasoning, and broad task generalization validated what had been a controversial bet.
Post-2020: The Test-Time Compute Revolution
By 2022, we were seeing another pattern. Chain-of-thought prompting worked surprisingly well. Scratchpads helped. Prompt engineering to guide model “thinking” showed promise. These were the early shoots of the second scaling law.
But it wasn’t until teams formalized reasoning, explicitly trained models to leverage thinking structure, and applied reinforcement learning to test-time computation that we saw the true power. The o1 model demonstrated the scaling law: suddenly acing high school mathematics and coding competitions, performing at remarkably high levels on standardized exams. This was the moment of recognition—test-time compute wasn’t just helpful, it was a scaling law.
The test-time compute scaling law had emerged: given more time to think, properly trained models could solve dramatically harder problems.
The Pattern in the Pattern
These two scaling laws share a crucial structure:
They started as informal observations (bigger datasets help; thinking through problems helps)
They required significant engineering and infrastructure investment to realize
They weren’t obvious until they were (skepticism preceded acceptance)
Each unlocked 1-2 orders of magnitude improvement in capabilities
Most importantly, they both exhibited the defining property of a scaling law: more compute, data, or time reliably translated to better performance in a predictable, monotonic way.
This gives us a methodology for identifying the next scaling laws: Where are today’s informal observations that show promise and look like emergent properties from previous scaling laws but haven’t been systematized? Which of those capabilities improve with more effort but aren’t yet predictable, monotonic scaling laws?
What Makes Something a Scaling Law?
A scaling law isn’t just “a promising research direction.” It has specific properties:
Monotonic improvement: Performance increases predictably with scale
Capital absorption: Can productively use more compute/data/time
Broad applicability: Works across a wide range of tasks
Predictable returns: You can forecast performance gains from additional resources
Data/Model scaling and test-time compute both exhibit these properties. The question is: what’s next?
The Stakes
Billions of dollars in training compute are coming online in the next few years. The frontier labs are all making massive infrastructure bets. The question isn’t whether this compute will be deployed—it’s which methods can absorb this capital and translate it into capability gains.
We have two proven scaling laws. We need more to get us to the next level of AI capability.
Reading Today’s Tea Leaves
So where are the early shoots today? What’s working informally that we haven’t systematized? What shows promise but isn’t yet exhibiting the predictable, monotonic scaling we need?
I see two obvious candidates, both of which have moved beyond mere speculation into demonstrated effectiveness:
1. Parallel Search and Planning — We’ve seen impressive results from systems from OpenAI, Gemini Deep Think with the recent IMO/ICPC performances. These systems use extensive parallel search and reasoning. The results are undeniable. And yet we haven’t yet seen wide-scale adoption across model types and sizes. The question is how to convert this into a true scaling law that works consistently and predictably across model types and sizes, utilizing the parallelized inference compute to obtain increasing amounts of performance.
2. Meta-Learning for Adaptation — Foundation models excel at in-context adaptation on arbitrary unstructured tasks. But when tasks require deep domain specialization, extensive proprietary codebases, or contexts too large to fit, in-context learning/adaptation fails. These scenarios demand fine-tuning: actual parameter updates. Yet fine-tuning is fragile, often requiring significant engineering expertise — from requiring appropriately designed tasks and datasets to choosing the right parameterization and optimization methods — to obtain the performance improvements. The vision: models that reliably adapt their own weights on unstructured data for arbitrary specialized tasks, obtaining monotonic improvements with increasing adaptation compute. This capability exists in fragments, but systematizing it into a scaling law would be transformative.
These candidates are almost too obvious to ignore, yet they remain systematically underexplored as potential scaling laws. Let’s examine each in detail.
I. Parallel Search and Planning
This one is past “early shoots” territory. We have substantial evidence that parallel search and reasoning provides massive performance gains:
DeepMind and OpenAI’s systems achieving IMO and ICPC gold medal performance
GPT-5-Pro, Deep Research and Gemini Deep Think showing strong capabilities on open-ended problems
The effectiveness is no longer in question. The problem is that this isn’t yet a scaling law—performance doesn’t seem to scale reliably and monotonically with increased search compute across model sizes and types. Current systems often plateau.
Leveraging the true benefits of parallel reasoning and search requires careful algorithmic considerations beyond the current reasoning paradigm to turn it into a scaling law and leverage the enormous benefits it might have to offer. We need to move from models using external scaffolding for parallel search to models trained to perform parallel reasoning natively. This poses various scientific challenges :
The Scientific Challenges
Let me be specific about what makes this hard and what needs solving. These aren’t impossible problems, but they’re substantial research challenges:
1. Credit Assignment in Hierarchical Exploration
This is perhaps the most theoretically interesting challenge. Consider what happens during parallel search:
The ultimate reward signal comes from task completion (did you solve the problem?).
But effective search requires exploration—trying diverse approaches, some of which won’t immediately lead to solutions.
These two objectives are in direct conflict.
In low-dimensional control tasks, we handle this with simple exploration bonuses or entropy regularization. But those solutions are ill-suited to high-dimensional language modeling settings. The action space is discrete, enormous, and highly structured.
We need novel solutions for credit assignment that can:
Reward exploration even when individual branches fail
Identify promising partial solutions in parallel chains
Assign credit appropriately across hierarchical search structures
This is fundamentally harder than the credit assignment problem in current sequential reasoning models.
2. Calibration Under Sampling
When you’re taking a large number of samples from a network for parallel search, calibration becomes absolutely critical. An uncalibrated network assigns poor probabilities to its outputs, which has two devastating effects:
At inference time: You can’t effectively select which parallel reasoning chains to pursue or trust
During training: Uncalibrated sampling creates high gradient variance, making RL training unstable
The literature already discusses how difficult it is to maintain calibration through post-training, especially with RL. When you add extensive parallel sampling on top, the problem compounds. We need mechanisms to ensure networks don’t drift too far from calibration during both training and inference. This isn’t just a nice-to-have—it’s fundamental to making parallel search work as a scaling law.
3. Computational Efficiency
Naively training with parallel search is computationally expensive—you’re generating multiple reasoning chains per problem during training. But there are promising directions:
Cheaper sampling techniques: Borrowing from speculative decoding approaches
Diffusion-based language models: Can potentially sample more efficiently
Sample reuse mechanisms: Clever ways to reuse computation across similar search branches
Learned pruning strategies: Train models to efficiently prune unpromising search branches early
The efficiency challenges are solvable, but they require focused engineering effort and architectural innovation.
4. Adaptive Reasoning Allocation
Here’s something we already observe with current reasoning models: different tasks need different types and amounts of reasoning.
With parallel search, this becomes even more pronounced:
Different search patterns: Math problems might need systematic case analysis, while creative writing needs exploratory divergent reasoning
Different search depths: Some problems need extensive exploration; others are solved with shallow search
Different pruning strategies: When to merge parallel chains, when to spawn new ones, when to commit to a solution
The model needs to learn meta-reasoning about its own search process—how much parallelism to deploy, what search strategy to use, when to stop searching.
This adaptive allocation is itself a learned capability that needs to scale with model and compute.
The Multi-Agent Formulation
As problems scale in complexity, we’re already hitting computational and context length limits with single centralized agents. The natural solution: treat parallel search as a multi-agent problem—a centralized agent orchestrating multiple sub-agents to explore the solution space.
Anthropic’s recent blog post on multi-agent coding systems provides a concrete example of this pattern in action. The multi-agent formulation actually makes many of the challenges discussed here more apparent and tractable:
Credit assignment: Clear sub-goals for each agent
Gradient stability: Modular training signals
Orchestration logic: Explicit design space
Inference speed: Parallelizable across agents
The multi-agent learning literature provides useful precedents for these challenges, though adapting them to language model reasoning requires novel approaches.
Why Now? Why Solvable?
These are legitimately hard problems. But three factors make them tractable now:
Base model capabilities are ready: Current reasoning models already do sophisticated step-by-step thinking. The jump to parallel reasoning is difficult but not discontinuous.
Infrastructure exists: The RL training infrastructure built for sequential reasoning models provides a foundation. We need to extend it, not build from scratch.
Multi-agent learning literature: Decades of research on multi-agent systems, credit assignment in hierarchical tasks, and exploration vs. exploitation provides theoretical grounding.
The timing is right. The capabilities, infrastructure, and theoretical understanding have converged. When we train models explicitly for calibrated parallel reasoning, for efficient search strategies, for adaptive allocation—the relationship between search compute and performance becomes monotonic, transforming parallel search from a useful technique into a capital-absorbing scaling law.
The Payoff
Problems that benefit from exploring solution spaces could see dramatic improvements:
Scientific reasoning: Exploring hypotheses, designing experiments
Engineering design: Evaluating trade-offs across design choices
Long-form creative work: Exploring narrative branches, plot structures
Strategic planning: Reasoning through scenarios and contingencies
This scaling law could absorb significant compute while delivering performance gains across a broad range of tasks that require exploration and evaluation of possibilities.
II. Meta-Learning: The Deeper Scaling Law
Structured search and parallel reasoning represent a powerful extension of test-time compute. The beauty of this inference-time adaptation is its flexibility—models can tackle arbitrary, unstructured problems without any parameter updates. You can simply throw more compute at a problem and watch performance scale.
But this flexibility has a fundamental ceiling. Test-time compute, which relies on conditioning on context, cannot internalize the deep specialization required for domains with millions of lines of code, expertise built from decades of specialized literature, or stable adaptation across thousands of user interactions where context windows become impractical. For these scenarios, conditioning on context isn’t enough—the knowledge must be embedded into the model’s weights themselves.
This is the domain of fine-tuning. When done well, a fine-tuned model can dramatically outperform its base model on specialized tasks, often appearing a generation ahead on its target domain. The challenge, however, is captured in that critical phrase: “when done well”. Fine-tuning is currently fragile and often requires significant engineering expertise to succeed. Success depends on a host of factors, from designing appropriate datasets to choosing the right parameterization, avoiding catastrophic forgetting, ensuring safe adaptation etc.
The root of this brittleness is that we currently treat fine-tunability as an emergent property or a side-effect of pre-training choices, rather than an explicit, trained capability. We rely on heuristics—overparameterization, architectural tricks, and specific regularizers—essentially hoping that these decisions yield a model with the ‘nice’ properties needed for controllable fine-tuning. As it turns out, this hope is often misplaced.
The principled solution is to stop treating adaptability as a happy accident and start training for it explicitly. This is the shift from fine-tuning as an art to meta-learning as a science: training models specifically to be easy to fine-tune.
A scaling law will emerge when we train models explicitly for this adaptability. This transforms fine-tuning from a fragile, labor-intensive art into a reliable, capital-absorbing scaling law. Performance will improve monotonically and predictably with the investment in adaptation : more adaptation compute, more adaptation steps, and larger specialization contexts will reliably translate to better task performance.
Why This Matters
The open-source competition: Currently, the primary moat for open-source models is fine-tunability. Companies choose open-source models over API-based ones precisely because they can fine-tune them effectively to their specific use cases. If frontier models became genuinely easy to fine-tune via API, this competitive dynamic shifts dramatically.
The data moat: Countless organizations have exclusive, valuable datasets but lack deep learning expertise. Easy fine-tunability without requiring specialists would unlock enormous value from proprietary data.
Safety: The current paradigm of fine-tuning is, from a safety perspective, a high-stakes gamble. We take a massively complex, poorly understood artifact (the base model) and perturb its weights, hoping that it aligns with our goals without producing unintended, harmful behaviors. When it fails, the failures are often unpredictable and catastrophic. Easy and controllable fine-tuning would allow more control both to the base model providers as well as the fine-tuning engineers on what behaviors a fine-tuned model exhibits.
The real recursive self-improvement: AGI folklore imagines recursive self-improvement as one model coding improvements to itself, becoming exponentially more capable. That’s a centralized, dramatic vision.
The more practical and immediate manifestation? Model specialization and adaptation. Models that self-adapt to:
Specific users (learning your communication style, preferences, work patterns)
Specific repositories (understanding your codebase’s idioms and architecture)
Specific use cases (adapting to your company’s domain and workflow)
Specific applications (tuning to your product’s unique requirements)
The model uses its plethora of interactions and collected data to continuously adapt to each specific instance. This opens use cases currently blocked by the unavailability of specialized human labor capable of performing those adaptations.
But this requires models that can adapt easily and safely. Which brings us to the principled solution.
From Emergent capability to Principled Training: Meta-Learning
The solution isn’t to hope for better emergent fine-tunability. It’s to train models explicitly to be easy to fine-tune. That is the problem of meta-learning, i.e, learning to learn.
Meta-learning isn’t new. Researchers have explored it for years. But previous efforts were limited:
Designed for the small model regime: Techniques that worked for models with millions of parameters
Targeted narrow tasks: Learning a base model for few-shot learning, or learning optimizer parameters for a very narrow/simple distribution of tasks.
Well-specified task distributions: Controlled academic benchmarks, not messy real-world adaptation
What we need instead: Meta-learning as the problem of learning an “easy-to-fine-tune algorithmic infrastructure.” Models that can adapt to arbitrary, ill-specified tasks through actual parameter updates—not just in-context learning.
The New Formulation: Three Core Questions
Rather than speculate on specific solutions—which would be premature—let me lay out the design space and provide intuition for thinking about these problems.
Question 1: The Task Distribution
For meta-learning to work, we need to carefully design what tasks the model trains on (meta-train) and what tasks we evaluate on (meta-test).
Meta-test distribution: This is straightforward—the real downstream tasks we care about. Actual user requests, bug fixes, safety requirements. The fine-tuned model’s performance on these tasks is what matters.
Meta-train distribution: This is where it gets interesting. We need extreme flexibility and variety here. We would like the model to be able to fine-tune with very unstructured datasets, anything ranging from large unsupervised/self-supervised datasets and repositories, structured few-shot example tasks, Large supervised fine-tuning datasets, reward optimization tasks, continual learning tasks and many more. The flexibility of the meta-train distribution is what sets this apart from our conventional conceptions of meta-learning and is made feasible primarily due to the availability of large models that allow us to use flexible learning mechanisms to adapt in diverse ways (as discussed in the next section).
Concrete example: Repository adaptation for coding
Consider a model adapting to a new codebase. The meta-train tasks might include:
Index the repository: Learn its structure, key abstractions, dependency patterns
Learn from bug fixes: Use previous fixes as few-shot examples of correct patterns
Create unsupervised tasks: Mask out functions and try to complete them, learning the codebase’s idioms
Test generation: Mask out tests and train to write them from scratch, learning verification patterns
Then meta-test on:
User feature requests
Hidden bug fixes (held out from training)
Safety checks (does the adapted model stay aligned?)
Performance under distribution shift
Note that these task types aren’t entirely new. We’re already creating similar environments for RL reasoning training and evaluation. The critical difference from current RL reasoning is that, as opposed to performing Multi-task learning where a single model does well without any adaptation on a wide variety of tasks (i.e the current approach), we instead propose to formulate it as a Meta-learning problem where the model adapts with actual parameter updates per task instance. Not just stacking all the information into context—real adaptation through fine-tuning. This difference is crucial. Current reasoning models try to be generalists. Meta-learned models become specialists—temporarily, adaptively, controllably.
Thus, the infrastructure we’re building for reasoning model training can be easily repurposed with some additional work towards meta-learning. This bootstraps what will eventually become diverse, extensive meta-learning environments covering a wide range of fine-tuning scenarios.
Question 2: The Adaptation Problem
Given the complexities and varieties of the fine-tuning problem described above, the conventional meta-learning setups, as conceived in the literature, are clearly too simplistic. A different framing from the AGI community for such problems is the problem of ‘recursive self-improvement’ where a super-intelligent model programs itself to keep improving in a recursive manner. This is a very centralized conception of the problem which is simplistic in its own way.
A more nuanced framing of the problem sits in these two extremes. We thus formulate it as the problem of training a strong base model to be good at ‘self-improvement’ and adaptation (instead of just simple fine-tuning) on a range of niche specialized tasks. We call this ‘decentralized recursive self-improvement’ instead of the centralized AGI version.
Concrete example: Repository adaptation for coding:
The model must figure out:
Indexing strategy and its evolution: How to organize repository knowledge, which abstractions to track. How to refine those abstractions as the model adapts.
Task selection: Which unsupervised/self-supervised learning tasks to use for adaptation. And creating those tasks from the provided raw data.
Parameter selection: Which parameters to adapt (LoRA, last-layer fine-tuning, ControlNet, full fine-tuning?).
Optimizer selection: Adam, second-order methods, learned optimizers?
Tool setup: What tools are needed and how to configure them
Reasoning strategy: What reasoning approaches fit this repo’s complexity class and likely downstream tasks
And many more
This is a rich, complex adaptation process—far beyond “update parameters on examples.”
Today’s reasoning models already take baby steps toward such an adaptation problem. When given a new task, they already think through the problem structure, the approach to take to solve the problem, the tools they could use etc. Meta-adaptation is a natural extension to richer reasoning traces and actual parameter updates. The continuity with current systems suggests this is the right direction.
Question 3: The Algorithm for Meta-Learning
As with the reasoning paradigm, the algorithmic framework here rests on orchestrating existing capabilities into a new, powerful learning paradigm. The foundation is a strong, pre-trained reasoning model, akin to today’s frontier models. This base model must provide the raw cognitive material: the ability to interpret complex instructions, generate and modify its own code, and reflect on its performance—the essential skills for self-improvement.
The core of the algorithm is a two-level process. In the “inner loop,” the model executes the adaptation itself. Presented with a new task—defined by a dataset, instructions, and a target objective—the model initiates a multi-step self-improvement cycle. First, it analyzes the task to formulate a learning strategy, hypothesizing which of its capabilities need modification. Then, it specifies the actual parameter update rules to use, perhaps for a set of LoRA adapters or other efficient structures, effectively writing its own fine-tuning adjustments. It applies these changes, tests them on a validation subset of the task data, and iterates, refining the updates based on the intermediate results. The model is, in essence, performing its own targeted, iterative fine-tuning in a closed loop.
The “outer” meta-learning loop is the meta-training process that teaches the model this skill. Because the inner loop is a complex, temporally extended procedure, direct backpropagation is intractable. Instead, credit assignment must be handled through reinforcement learning. After an inner-loop adaptation cycle concludes, the newly specialized model is evaluated on a held-out test set discussed above. The performance on this test set—a measure of how well it generalized—provides a sparse reward signal. This reward is used to update the base model’s meta-learning faculty, reinforcing the adaptation strategies that lead to successful outcomes. Training this outer loop requires a vast and diverse distribution of tasks, exposing the model to thousands of unique adaptation problems so it can learn the general principles of effective learning.
This approach poses significant but solvable challenges. The reward function itself must be sophisticated. Different tasks need different reward structures, likely requiring a meta-adaptive component for the reward function itself that adjusts it based on the task/sub-task. The entire process requires a robust infrastructure capable of managing thousands of parallel self-modification experiments. And as with the other scaling laws, these are no longer questions of just pure science, but of large-scale engineering. The computational cost of running millions of these inner-loop/outer-loop cycles is precisely how this paradigm absorbs compute. By leveraging the infrastructure built for today’s models, we can create a system that learns how to learn—and in doing so, establish the predictable, resource-driven relationship between compute and capability that defines a true scaling law.
The Path to a New Scaling Law
This vision of meta-learning was pure science fiction just a few years ago. Today, the essential ingredients have finally converged. The powerful reasoning capabilities of modern foundation models provide a sufficiently strong starting point, and the large-scale RL infrastructure built to train them gives us the tools to tackle this challenge. The timing is right.
By training models explicitly for adaptability, we can transform fine-tuning from a brittle, labor-intensive art into a reliable, capital-absorbing scaling law. The relationship becomes predictable and monotonic: investing more adaptation compute, using larger specialization contexts, and running more adaptation steps will reliably translate to better performance on specialized tasks. This isn’t just a marginal improvement; it’s a fundamental shift that makes progress predictable.
Realizing this vision requires solving substantial problems, but they are no longer abstract academic curiosities. They are well-posed engineering and scientific challenges: building the data pipelines for diverse meta-training, designing algorithms for adaptive reward and credit assignment, and creating the software infrastructure to support models that dynamically update their own parameters. These are difficult, but tractable, problems.
The payoff for this effort is transformative. Systematizing adaptation will unlock the value of countless proprietary datasets currently inaccessible to non-experts, fundamentally reshape the competitive dynamics with open-source models, and enhance safety by making adaptation more controllable. Most profoundly, it paves the way for a practical form of recursive self-improvement—not a single, centralized AGI, but a decentralized ecosystem of models constantly and safely adapting to specialize for every user, codebase, and domain. This is how we convert the promise of adaptation into the next great scaling law.
III. Concluding throughts
The current paradigm, built on the twin pillars of data scaling and test-time compute, has been extraordinarily successful. It has produced systems with unprecedented reasoning capabilities and established a clear, if capital-intensive, path to improvement. But as with any mature research program, the returns from simply turning the existing knobs are beginning to follow a predictable, and perhaps diminishing, curve. The most important question for the field is no longer how to refine the current paradigm, but how to define the next one.
The path forward lies in systematizing the powerful but ad-hoc techniques that have emerged in the shadow of the current scaling laws. Today, structured search is a collection of brittle prompting tricks and complex, hand-tuned decoders. Adaptation is an art form, a bespoke process of fine-tuning that is more craft than science. This is precisely the state data scaling and reasoning were in before they were formalized into predictable laws. The informal observations and scattered successes are the signal; the task ahead is to amplify that signal into a robust, scalable methodology.
This formalization is the critical work required to unlock the next wave of compute. The goal is to build a new paradigm where performance scales predictably with resource allocation in two new dimensions. First, where investing more parallel search and reasoning at inference time reliably yields better, more verifiable answers. Second, where investing more compute into a meta-learning phase produces models that can adapt to new domains with ever-greater speed and reliability. This is how we transform these promising research directions into true scaling laws: by creating a predictable relationship between capital investment and capability gain.
The models that emerge from this program will represent a fundamental evolution. They will not be static artifacts of knowledge, but dynamic systems capable of targeted inquiry and autonomous specialization. This is the most direct and promising path to converting the immense computational resources on the horizon into a qualitative leap in AI capability. The labs that successfully engineer this transition—transforming search and adaptation from craft to science—will not only absorb the coming compute, but will define the frontier of artificial intelligence for the next decade.
If you found this interesting and want to discuss these ideas further, or if you’re working on related problems, I’d love to hear from you.
