RL Environments: The New Data Moat of AI

For a decade, the AI industry ran on a simple formula: more data + more compute = better models. The data was static — scraped web pages, annotated images, expert-curated text. The moat belonged to whoever had the biggest dataset or the deepest labeling pipeline.

That era is ending. The frontier has shifted from learning from human artifacts to learning from action and consequence. In this new paradigm, what matters is not the volume of tokens you can feed a model, but the quality of the environments you can build for it to practice in.

This essay argues that RL environments — the sandboxes, simulators, and instrumented workflows where agents act, fail, and improve — are emerging as the most consequential and investable layer of the AI stack. They are to the "Era of Experience" what ImageNet was to the deep learning revolution: the enabling substrate without which progress stalls.

· · ·

I. The Shift: From Tokens to Trajectories

The "pre-training on internet text" paradigm hit diminishing returns around 2024. Not because language models stopped improving, but because the marginal value of another trillion tokens became vanishingly small compared to a well-designed RL training run. OpenAI's progression from o1 to o3 told the story: each generation allocated more compute to reinforcement learning and less to static pre-training.

This shift has a clean theoretical grounding. Silver and Sutton's "Era of Experience" paper from DeepMind frames it precisely: the next generation of AI agents will learn from streams of interaction, take grounded actions in real action spaces, receive grounded rewards from consequences, and plan in the currency of experience.

We added some reinforcement learning compute for o1… o3 maybe had a little bit more RL compute. At some point in the future maybe we'll have a lot of RL compute and then at some far point in the future maybe we'll be totally dominated and crushed by RL compute. — Dan Roberts, OpenAI Researcher, at Sequoia Ascent

What does "RL compute" actually consume? Not text. It consumes environment interactions — hundreds of thousands of rollouts where an agent attempts a task, receives a reward signal, and updates its policy. The raw material of RL scaling is not a dataset. It is a world the agent can act in.

This creates a fundamentally different supply chain. In the token era, data companies sold annotated text. In the experience era, they sell instrumented environments — interactive sandboxes that package a workflow surface, a task distribution, an evaluation function, and full trajectory logging. The product isn't a file. It's a living system.

· · ·

II. Anatomy of an RL Environment

An RL environment is deceptively simple in concept but ferociously hard to build well. At its core, it has four components — and each one is a source of competitive advantage or failure.

The Four Pillars of an RL Environment

Surface

The action space the agent inhabits. A UI it can click, an API it can call, a codebase it can modify, a terminal it can type into. The surface must faithfully reproduce the real-world task environment — otherwise the agent learns to exploit the simulation, not solve the problem.

BrowserIDEAPIShellGUI

Tasks

A distribution of problems at calibrated difficulty. Not one task, but a task library — parameterized, varied, and tunable. Too easy and the agent saturates without learning transfer. Too hard and gradients vanish. The art is in the curriculum: starting simple, escalating complexity, and introducing adversarial variation.

Task generationDifficulty curvesCurriculum

Eval

A scoring function that maps outcomes to rewards. The eval defines what "good" means. Binary pass/fail is brittle. Dense reward shaping is powerful but can introduce reward hacking. The best evals blend automated checks with LLM-as-judge and, where available, human ground-truth. They are the loss function of the experience era.

Automated checksLLM-as-judgeHuman ground truth

Logging

Full trajectory capture for learning. Every action, observation, reward, and intermediate state — recorded. This is the data the RL algorithm actually trains on. Incomplete trajectories mean incomplete gradients. Logging quality directly determines training signal quality.

TracesState snapshotsReward signals

The critical insight is that these four components are deeply interdependent. A beautiful UI surface paired with a weak eval produces agents that look competent in demos and fail in production. A brilliant task library logged with lossy trajectories wastes compute. Building environments is systems engineering, not a wrapper around an existing application.

· · ·

III. Why Environments Are the Bottleneck

Brendan Foody of Mercor makes a provocative claim: "RL is becoming so effective that models will be able to saturate any evaluation." If that's true — and the evidence increasingly supports it — then the binding constraint on AI progress is no longer algorithmic. It's environmental.

Consider the implications. If RL can reliably improve a model on any eval it's trained against, then the frontier of AI capability is defined entirely by the breadth, fidelity, and difficulty of available environments. The labs don't need better optimizers. They need harder gyms.

This is why frontier labs have, in the past 18 months, put dozens of environment vendors into business. They are buying the raw material of RL scaling, and the demand is growing faster than supply.

Environments are harder to commoditize than data

Static datasets can be copied, leaked, or re-created. An environment is a running system — a live surface, a task generator, an eval pipeline, a logging infrastructure. Replicating it requires engineering the entire stack, not just downloading a file. The defensibility is operational, not legal.

Domain expertise creates a compounding advantage

Building a high-fidelity environment for chip design (Phinity), finance (Isidor), or computer use (Fleet AI, Matrices) requires deep domain knowledge — not just of the task, but of the failure modes, edge cases, and difficulty gradients that make RL training productive. This expertise compounds: each iteration reveals what the model finds hard, which informs the next generation of tasks.

The "eval gap" is the real AGI bottleneck

If RL can saturate any eval, then progress reduces to: can we write evals fast enough for everything humans do? This reframes the AGI question. The barrier isn't intelligence — it's instrumentation. Every un-eval'd workflow is a capability the model can't acquire through experience. Environment builders are, in a very real sense, converting human economic activity into AI-legible training signal.

Labs are compute-rich but environment-poor

The asymmetry is structural. Labs have billions in GPU compute and world-class RL researchers. What they lack — and cannot build internally at the rate they need — is coverage across the long tail of real-world workflows. This creates a natural market: labs buy environments the way cloud customers buy SaaS. The vendor ecosystem isn't incidental. It's load-bearing.

· · ·

IV. The Competitive Landscape

The environment-building market is nascent but structuring quickly into clear verticals. Each vertical reflects a domain where RL training demand is acute and the task surface is complex enough to sustain a standalone company.

Domain	Key Players	Why It's Hard	Moat Potential
Coding	Preference Model, Proximal, Mechanize, AfterQuery	Requires realistic repos, dependency resolution, test suites that catch subtle bugs — not just syntax	High — code environments improve with real codebase diversity
Computer Use	Fleet AI, Matrices, DeepTune	Pixel-level fidelity, latency simulation, handling non-deterministic UI states across OSes	Very High — cross-application workflows are combinatorially complex
Chip Design	Phinity AI	EDA tool integration, PPA (power/performance/area) evaluation, multi-step synthesis flows	Very High — extreme domain specialization
Finance	Isidor	Regulatory constraints, temporal data leakage, market simulation fidelity	High — compliance requirements create natural barriers
General Knowledge Work	Turing, Mercor, Surge AI, Handshake	Breadth vs. depth tradeoff; must cover many workflows without sacrificing fidelity	Medium — breadth is defensible, but depth may matter more

The incumbents — Turing, Mercor, Surge AI — have the advantage of existing lab relationships and data annotation infrastructure. But the specialized players have something potentially more durable: task design expertise that directly translates into model capability improvement. The question is whether labs will consolidate around a few deep vendors or maintain a broad portfolio.

· · ·

V. The Sim-to-Real Problem

Every environment company faces the same existential question: does training in our sandbox transfer to the real world? This is the sim-to-real gap, borrowed from robotics, and it is the single largest technical risk in the environment business.

Consider a coding environment. If the tasks are too synthetic — clean functions, isolated logic, perfect test suites — the model learns to ace the gym but stumbles on real codebases with messy dependencies, legacy patterns, and ambiguous requirements. The environment vendor's value is precisely in closing this gap: making their sandbox realistic enough that RL training transfers.

The Difficulty Calibration Problem

An underappreciated challenge: environments must stay ahead of the model. As RL training improves the agent, the environment must produce harder tasks, or the model saturates and stops learning. This creates a treadmill — environment companies must continuously expand their task distribution, or their product becomes obsolete as models improve.

The best environment companies will build procedural task generators that can scale difficulty programmatically, rather than relying on hand-crafted task libraries that get exhausted.

Jerry Dworek, former OpenAI research lead, frames the broader concern precisely: "How do those models generalize? How do those models perform outside of what they've been trained for?" If RL environments produce narrow specialists rather than general improvers, the entire value proposition weakens. The counter-argument is that with enough diverse environments — covering enough of the skill surface — generalization emerges from breadth. This is an empirical question, and its answer will determine whether the environment market is a $1B or $100B opportunity.

· · ·

VI. Beyond Labs: The Enterprise RL Opportunity

Today, the primary buyers of RL environments are frontier labs. But the next wave of demand will come from enterprises and AI-native application companies that want to fine-tune agents for their specific workflows.

Cursor's use of online RL to improve their tab completion model — using real user acceptance/rejection signals as the reward — is a template. Any company with an AI agent in production is sitting on a natural RL environment: the agent acts, the user provides implicit feedback (acceptance, correction, abandonment), and the trajectory is logged. The missing piece is the infrastructure to close the loop — to convert these production traces into RL training runs.

Application: Voice AI at Scale

Consider a company operating AI voice bots handling millions of customer calls. Every call is a trajectory: the agent speaks, the customer responds, the call resolves or escalates. Success metrics are clear — resolution rate, customer satisfaction, handle time. The environment is already built; it's the production system itself. What's needed is the RL infrastructure to train on these trajectories and the eval framework to measure improvement without regressing on edge cases.

This is where "RL-as-a-service" companies (Applied Compute, CGFT Labs, Osmosis AI) enter: they bring the RL expertise and tooling to companies that have the environments but lack the ML infrastructure to exploit them.

The enterprise opportunity has different dynamics than the lab market. Labs buy environments for general model improvement. Enterprises need environments that mirror their specific workflows — their CRM, their codebase, their customer demographics. This suggests a market where vertical specialization wins, and where the data generated by production deployment feeds back into environment quality in a flywheel that's very hard to replicate from outside.

· · ·

VII. Open Questions and Risks

Generalization vs. Narrow Skill

The most important open question in AI: does RL training produce transferable intelligence, or just task-specific muscle memory? If narrow, the environment market fragments into thousands of niche verticals. If broad, a smaller number of high-quality environments might drive general capability — concentrating value in the best gym builders.

Continual Learning

Ilya Sutskever's metaphor of the "superintelligent 15-year-old" points at a gap RL alone may not fill. Current RL produces a model that is frozen after training. Real intelligence requires continual adaptation — learning on deployment, updating beliefs, acquiring new skills in context. If continual learning proves necessary, environment companies will need to evolve from batch-training sandboxes to always-on learning systems.

Reward Hacking at Scale

As RL scales, so does the attack surface for reward hacking — agents finding ways to maximize the eval score without actually solving the intended problem. This is the environment builder's deepest technical challenge: designing evals robust enough that optimization pressure doesn't find shortcuts. The arms race between RL algorithms and eval robustness will define the quality bar for environment companies.

Lab Consolidation Risk

If one or two labs pull decisively ahead in RL scaling, they may verticalize — building environments in-house rather than buying from vendors. The environment market's growth depends on RL scaling being a broad industry trend, not a single-lab phenomenon. The emergence of open-source RL infrastructure (Prime Intellect, Thinky Machines) is a positive signal here.

· · ·

VIII. The Thesis, Compressed

In the experience era, the quality of the environment determines the ceiling of the agent. RL can optimize any objective you give it — which means the binding constraint is having the right objectives, in the right action spaces, at the right difficulty level. Environment builders are the supply-side infrastructure of AI progress.

The most investable positions in this stack are:

Deep-domain environment specialists

Companies that own the task design, eval logic, and fidelity requirements for a specific high-value domain — and whose environments measurably improve model performance on real-world tasks in that domain. The closer the gym to the game, the more valuable the gym.

Environment-building infrastructure

Tools that let anyone create high-quality RL environments — HUD Evals and equivalents that democratize environment construction. As RL moves from labs to enterprises, the demand for "environment IDEs" will explode.

Application companies with natural RL flywheels

Companies where production usage generates trajectories and reward signals that can be fed back into RL training — creating a self-improving loop that widens the gap with competitors who train only on static data. Cursor is the archetype. Every AI-native company should be asking: is my production system also an RL environment?

The primary barrier to applying agents to the entire economy is building evals for everything. — Brendan Foody, CEO of Mercor

If Foody is right, then the companies that can convert human economic activity into well-instrumented, well-evaluated RL environments — at the speed the labs demand — are building one of the most durable and consequential businesses in AI. Not the flashiest. Not the one demoing a chatbot. The one building the gym where the chatbot learned to be good.

RL Environments Are the New Data Moat

I. The Shift: From Tokens to Trajectories

II. Anatomy of an RL Environment

III. Why Environments Are the Bottleneck

Environments are harder to commoditize than data

Domain expertise creates a compounding advantage

The "eval gap" is the real AGI bottleneck

Labs are compute-rich but environment-poor

IV. The Competitive Landscape

V. The Sim-to-Real Problem

VI. Beyond Labs: The Enterprise RL Opportunity

VII. Open Questions and Risks

Generalization vs. Narrow Skill

Continual Learning

Reward Hacking at Scale

Lab Consolidation Risk

VIII. The Thesis, Compressed

Deep-domain environment specialists

Environment-building infrastructure

Application companies with natural RL flywheels