For a decade, the AI industry ran on a simple formula: more data + more compute = better models. The data was static — scraped web pages, annotated images, expert-curated text. The moat belonged to whoever had the biggest dataset or the deepest labeling pipeline.
That era is ending. The frontier has shifted from learning from human artifacts to learning from action and consequence. In this new paradigm, what matters is not the volume of tokens you can feed a model, but the quality of the environments you can build for it to practice in.
This essay argues that RL environments — the sandboxes, simulators, and instrumented workflows where agents act, fail, and improve — are emerging as the most consequential and investable layer of the AI stack. They are to the "Era of Experience" what ImageNet was to the deep learning revolution: the enabling substrate without which progress stalls.
I. The Shift: From Tokens to Trajectories
The "pre-training on internet text" paradigm hit diminishing returns around 2024. Not because language models stopped improving, but because the marginal value of another trillion tokens became vanishingly small compared to a well-designed RL training run. OpenAI's progression from o1 to o3 told the story: each generation allocated more compute to reinforcement learning and less to static pre-training.
This shift has a clean theoretical grounding. Silver and Sutton's "Era of Experience" paper from DeepMind frames it precisely: the next generation of AI agents will learn from streams of interaction, take grounded actions in real action spaces, receive grounded rewards from consequences, and plan in the currency of experience.
What does "RL compute" actually consume? Not text. It consumes environment interactions — hundreds of thousands of rollouts where an agent attempts a task, receives a reward signal, and updates its policy. The raw material of RL scaling is not a dataset. It is a world the agent can act in.
This creates a fundamentally different supply chain. In the token era, data companies sold annotated text. In the experience era, they sell instrumented environments — interactive sandboxes that package a workflow surface, a task distribution, an evaluation function, and full trajectory logging. The product isn't a file. It's a living system.
II. Anatomy of an RL Environment
An RL environment is deceptively simple in concept but ferociously hard to build well. At its core, it has four components — and each one is a source of competitive advantage or failure.
The critical insight is that these four components are deeply interdependent. A beautiful UI surface paired with a weak eval produces agents that look competent in demos and fail in production. A brilliant task library logged with lossy trajectories wastes compute. Building environments is systems engineering, not a wrapper around an existing application.
III. Why Environments Are the Bottleneck
Brendan Foody of Mercor makes a provocative claim: "RL is becoming so effective that models will be able to saturate any evaluation." If that's true — and the evidence increasingly supports it — then the binding constraint on AI progress is no longer algorithmic. It's environmental.
Consider the implications. If RL can reliably improve a model on any eval it's trained against, then the frontier of AI capability is defined entirely by the breadth, fidelity, and difficulty of available environments. The labs don't need better optimizers. They need harder gyms.
This is why frontier labs have, in the past 18 months, put dozens of environment vendors into business. They are buying the raw material of RL scaling, and the demand is growing faster than supply.
Environments are harder to commoditize than data
Static datasets can be copied, leaked, or re-created. An environment is a running system — a live surface, a task generator, an eval pipeline, a logging infrastructure. Replicating it requires engineering the entire stack, not just downloading a file. The defensibility is operational, not legal.
Domain expertise creates a compounding advantage
Building a high-fidelity environment for chip design (Phinity), finance (Isidor), or computer use (Fleet AI, Matrices) requires deep domain knowledge — not just of the task, but of the failure modes, edge cases, and difficulty gradients that make RL training productive. This expertise compounds: each iteration reveals what the model finds hard, which informs the next generation of tasks.
The "eval gap" is the real AGI bottleneck
If RL can saturate any eval, then progress reduces to: can we write evals fast enough for everything humans do? This reframes the AGI question. The barrier isn't intelligence — it's instrumentation. Every un-eval'd workflow is a capability the model can't acquire through experience. Environment builders are, in a very real sense, converting human economic activity into AI-legible training signal.
Labs are compute-rich but environment-poor
The asymmetry is structural. Labs have billions in GPU compute and world-class RL researchers. What they lack — and cannot build internally at the rate they need — is coverage across the long tail of real-world workflows. This creates a natural market: labs buy environments the way cloud customers buy SaaS. The vendor ecosystem isn't incidental. It's load-bearing.
IV. The Competitive Landscape
The environment-building market is nascent but structuring quickly into clear verticals. Each vertical reflects a domain where RL training demand is acute and the task surface is complex enough to sustain a standalone company.
| Domain | Key Players | Why It's Hard | Moat Potential |
|---|---|---|---|
| Coding | Preference Model, Proximal, Mechanize, AfterQuery | Requires realistic repos, dependency resolution, test suites that catch subtle bugs — not just syntax | High — code environments improve with real codebase diversity |
| Computer Use | Fleet AI, Matrices, DeepTune | Pixel-level fidelity, latency simulation, handling non-deterministic UI states across OSes | Very High — cross-application workflows are combinatorially complex |
| Chip Design | Phinity AI | EDA tool integration, PPA (power/performance/area) evaluation, multi-step synthesis flows | Very High — extreme domain specialization |
| Finance | Isidor | Regulatory constraints, temporal data leakage, market simulation fidelity | High — compliance requirements create natural barriers |
| General Knowledge Work | Turing, Mercor, Surge AI, Handshake | Breadth vs. depth tradeoff; must cover many workflows without sacrificing fidelity | Medium — breadth is defensible, but depth may matter more |
The incumbents — Turing, Mercor, Surge AI — have the advantage of existing lab relationships and data annotation infrastructure. But the specialized players have something potentially more durable: task design expertise that directly translates into model capability improvement. The question is whether labs will consolidate around a few deep vendors or maintain a broad portfolio.
V. The Sim-to-Real Problem
Every environment company faces the same existential question: does training in our sandbox transfer to the real world? This is the sim-to-real gap, borrowed from robotics, and it is the single largest technical risk in the environment business.
Consider a coding environment. If the tasks are too synthetic — clean functions, isolated logic, perfect test suites — the model learns to ace the gym but stumbles on real codebases with messy dependencies, legacy patterns, and ambiguous requirements. The environment vendor's value is precisely in closing this gap: making their sandbox realistic enough that RL training transfers.
An underappreciated challenge: environments must stay ahead of the model. As RL training improves the agent, the environment must produce harder tasks, or the model saturates and stops learning. This creates a treadmill — environment companies must continuously expand their task distribution, or their product becomes obsolete as models improve.
The best environment companies will build procedural task generators that can scale difficulty programmatically, rather than relying on hand-crafted task libraries that get exhausted.
Jerry Dworek, former OpenAI research lead, frames the broader concern precisely: "How do those models generalize? How do those models perform outside of what they've been trained for?" If RL environments produce narrow specialists rather than general improvers, the entire value proposition weakens. The counter-argument is that with enough diverse environments — covering enough of the skill surface — generalization emerges from breadth. This is an empirical question, and its answer will determine whether the environment market is a $1B or $100B opportunity.
VI. Beyond Labs: The Enterprise RL Opportunity
Today, the primary buyers of RL environments are frontier labs. But the next wave of demand will come from enterprises and AI-native application companies that want to fine-tune agents for their specific workflows.
Cursor's use of online RL to improve their tab completion model — using real user acceptance/rejection signals as the reward — is a template. Any company with an AI agent in production is sitting on a natural RL environment: the agent acts, the user provides implicit feedback (acceptance, correction, abandonment), and the trajectory is logged. The missing piece is the infrastructure to close the loop — to convert these production traces into RL training runs.
Consider a company operating AI voice bots handling millions of customer calls. Every call is a trajectory: the agent speaks, the customer responds, the call resolves or escalates. Success metrics are clear — resolution rate, customer satisfaction, handle time. The environment is already built; it's the production system itself. What's needed is the RL infrastructure to train on these trajectories and the eval framework to measure improvement without regressing on edge cases.
This is where "RL-as-a-service" companies (Applied Compute, CGFT Labs, Osmosis AI) enter: they bring the RL expertise and tooling to companies that have the environments but lack the ML infrastructure to exploit them.
The enterprise opportunity has different dynamics than the lab market. Labs buy environments for general model improvement. Enterprises need environments that mirror their specific workflows — their CRM, their codebase, their customer demographics. This suggests a market where vertical specialization wins, and where the data generated by production deployment feeds back into environment quality in a flywheel that's very hard to replicate from outside.
VII. Open Questions and Risks
Generalization vs. Narrow Skill
The most important open question in AI: does RL training produce transferable intelligence, or just task-specific muscle memory? If narrow, the environment market fragments into thousands of niche verticals. If broad, a smaller number of high-quality environments might drive general capability — concentrating value in the best gym builders.
Continual Learning
Ilya Sutskever's metaphor of the "superintelligent 15-year-old" points at a gap RL alone may not fill. Current RL produces a model that is frozen after training. Real intelligence requires continual adaptation — learning on deployment, updating beliefs, acquiring new skills in context. If continual learning proves necessary, environment companies will need to evolve from batch-training sandboxes to always-on learning systems.
Reward Hacking at Scale
As RL scales, so does the attack surface for reward hacking — agents finding ways to maximize the eval score without actually solving the intended problem. This is the environment builder's deepest technical challenge: designing evals robust enough that optimization pressure doesn't find shortcuts. The arms race between RL algorithms and eval robustness will define the quality bar for environment companies.
Lab Consolidation Risk
If one or two labs pull decisively ahead in RL scaling, they may verticalize — building environments in-house rather than buying from vendors. The environment market's growth depends on RL scaling being a broad industry trend, not a single-lab phenomenon. The emergence of open-source RL infrastructure (Prime Intellect, Thinky Machines) is a positive signal here.
VIII. The Thesis, Compressed
In the experience era, the quality of the environment determines the ceiling of the agent. RL can optimize any objective you give it — which means the binding constraint is having the right objectives, in the right action spaces, at the right difficulty level. Environment builders are the supply-side infrastructure of AI progress.
The most investable positions in this stack are:
Deep-domain environment specialists
Companies that own the task design, eval logic, and fidelity requirements for a specific high-value domain — and whose environments measurably improve model performance on real-world tasks in that domain. The closer the gym to the game, the more valuable the gym.
Environment-building infrastructure
Tools that let anyone create high-quality RL environments — HUD Evals and equivalents that democratize environment construction. As RL moves from labs to enterprises, the demand for "environment IDEs" will explode.
Application companies with natural RL flywheels
Companies where production usage generates trajectories and reward signals that can be fed back into RL training — creating a self-improving loop that widens the gap with competitors who train only on static data. Cursor is the archetype. Every AI-native company should be asking: is my production system also an RL environment?
If Foody is right, then the companies that can convert human economic activity into well-instrumented, well-evaluated RL environments — at the speed the labs demand — are building one of the most durable and consequential businesses in AI. Not the flashiest. Not the one demoing a chatbot. The one building the gym where the chatbot learned to be good.