If you’re evaluating infrastructure for autonomous coding agents or building it yourself - here’s a framework we’ve found useful after 100+ conversations with engineering teams at various stages of adoption.
flowchart TB
subgraph L4["Layer 4: Harness"]
H["Cursor, Claude Code, Codex, Custom"]
end
subgraph L3["Layer 3: Orchestration"]
O["Task graphs • Failure handling • Human-in-the-loop • Session management"]
end
subgraph L2["Layer 2: Security"]
S["Network control • Filesystem boundaries • Credential injection • Approval flows"]
end
subgraph L1["Layer 1: Compute"]
C["MicroVMs • Real dev environments • Lifecycle control"]
end
L4 --> L3
L3 --> L2
L2 --> L1
Most infrastructure discussions focus on security - that’s layer 2. We mapped out four layers, and we believe that the gaps are usually somewhere else - not in the security layer.
The gap between forefront and everyone else
Here’s what we keep seeing: teams at the forefront are already running multiplayer. Multiple agents picking up tickets in parallel, working overnight, PRs ready for review in the morning. Stripe’s minions merge 1,000+ PRs per week. Some teams we’ve talked to run 5-10 agents continuously.
But even for them, there’s a lot of hand-holding:
- Running multiple agents in parallel using git worktrees, manually checking they don’t conflict
- Copying CI error logs back and forth when builds fail
- Babysitting token spend to catch runaway retries
- Spending more time fixing agent output on legacy code than it would take to write it themselves
The forefront teams are making it work through brute force - custom scripts, manual coordination, constant supervision. They’ve built their own infrastructure because nothing off-the-shelf handles the complexity.
Everyone else is stuck in single-player mode. One engineer, one agent, constant supervision. It works - genuinely works - but it doesn’t scale.
Startups doing greenfield projects can try out multiplayer mode faster. Simple stack, small team, low consequences for mistakes. However, enterprises hit the wall faster. Complex build systems, private registries, compliance requirements, legacy code that no single person fully understands. The gaps show up immediately - and they don’t have a 100-person platform team to build custom infrastructure like Stripe does.
The Four Layers
Layer 1: Compute
The environment where agents run. VMs, containers, or MicroVMs that provide CPU, memory, filesystem, and network.
This seems simple until you try to scale it. A solo developer can get away with running agents locally - maybe using git worktrees to isolate parallel work, maybe just letting the agent modify their main checkout and hoping for the best.
But the moment you want agents running autonomously 24/7 - you need a dedicated compute.
What agents actually need:
-
Real dev environments. This is where most sandbox providers fail. A stripped-down container that can run Python scripts is useless for real development work. Agents need to run Docker (yes, Docker inside the sandbox), spin up databases, use browser devtools, execute build systems. MicroVMs beat containers here because you get a real kernel, not just namespace isolation.
-
Boot time doesn’t matter (for coding agents). This trips people up because they’ve heard about Firecracker’s ~125ms cold-start and assume that’s the goal. It’s not - at least not for coding agents.
Coding agents boot once and run for hours. A 30-second startup is irrelevant when the agent will spend 4 hours refactoring a module. What matters is environment fidelity (does it match production?), persistence (can I resume where I left off?), and capability (can I run Docker, databases, browsers?).
Fast boot matters for a different use case entirely: production agents where you execute untrusted code per-request (code interpreters, user plugins, etc.). That’s not what we’re talking about here. Most sandbox providers optimize for runtime because that’s where the cloud provider money is - but coding agents need the opposite tradeoff. Slow boot is fine, limited environments are not.
-
Lifecycle control. You need fast stop (don’t pay for idle compute), fast resume (pick up where you left off), and snapshot/restore (branch an environment, try something risky, roll back if it fails). An agent working on a bug might want to checkpoint before attempting a risky fix - if it breaks everything, restore and try a different approach.
In this layer, we saw that the biggest hurdle is matching the environment to their actual dev environment. Just like “historical” cloud developer environments, which were not adopted widely, the problem lies in making sure that environments are easy to work with and don’t break.
This sounds obvious, but it’s where the “15% productivity vs 10x” gap comes from. If your CI requires Docker Compose to spin up five services before tests pass, but the agent sandbox can’t run Docker-in-Docker, the agent will fail. Or it will take it a LONG time to set everything up. When talking about AI agents, time is money - literally, money for tokens.
Signs you have a layer 1 problem:
- PRs from agents fail in CI but passed in the agent’s environment
- Agents can’t access private registries or internal tools
- You’re maintaining separate “agent-compatible” test suites
- Engineers say “it works when I run it, but not when the agent does”
Who’s building here: E2B (Firecracker-based sandboxes), Daytona (fast spin-up dev environments), Modal (serverless compute), Namespace (cloud dev environments), Sprites. Most optimize for boot time because they’re targeting runtime use cases.
Layer 2: Security
This layer is about isolation and access control. Preventing agents from accessing unauthorized systems, running untrusted commands or leaking credentials.
An autonomous agent with access to your codebase, your cloud credentials, and your production systems is an incident waiting to happen. Not because the agent is malicious - but because it’s confidently wrong in ways that humans rarely are.
I’ve seen agents:
- Delete tests because “they were flaky” (technically fixing the flakiness)
- Spin up cloud resources in regions nobody uses (for “optimization”)
- Commit secrets to version control (because the .env file was right there)
- rm -rf directories that seemed “unused” (they weren’t)
None of these were malicious. All of them were the agent doing exactly what it thought was right. The problem is that agents don’t always have the institutional knowledge that makes a human pause and think “wait, that seems dangerous.”
What you actually need:
- Network control. Your agent should only be able to reach endpoints it needs for its work. Everything else denied by default. This limits blast radius when (not if) something goes wrong. Not just domain whitelist - but full paths.
- Filesystem boundaries. The agent can modify the codebase, sure. But can it read ~/.ssh? Can it write to /etc? Can it access other projects on the same machine? These boundaries need to be explicit.
- Credential injection. Agents need secrets to do useful work - API keys, database passwords, cloud credentials. But those secrets shouldn’t live in the sandbox where the agent can exfiltrate them. The pattern that works: a proxy outside the sandbox intercepts requests, recognizes sentinel values like
__GITHUB_TOKEN__, and injects real credentials at runtime. The agent never sees the actual secret. There are a few open source packages being developed in this area.
- Approval flows. Some operations should require human approval. Merging to main, deploying to production, deleting data, creating cloud resources. The security layer needs to intercept these operations, pause execution, notify a human, and wait for approval before continuing.
Signs you have a layer 2 problem:
- Security team is blocking agent adoption entirely
- You’re nervous about what agents might access
- No audit trail of what agents actually did
This layer is becoming commoditized faster than most people realize. The patterns are documented. We’ve talked to multiple enterprises who’ve built “good enough” sandboxing in a few weeks using existing kernel tools, or using Enterprise plans of the harnesses. If your platform team is competent, you can probably build this. The question is whether you should spend their time on it.
Who’s building here: Most compute providers include basic isolation. Many teams roll their own using seccomp, AppArmor, or sandbox-exec.
Layer 3: Orchestration
This layer is about coordination, failure handling, and human-in-the-loop workflows. Making multiple agents & humans work together without stepping on each other.
The dream: a PRD goes in, tickets come out, agents pick up tasks, write code, create PRs, respond to review feedback, and merge when approved. The human role shifts from writing code to reviewing output and making product decisions.
This is where most teams underestimate the complexity. Running one agent on one task is straightforward, but running multiple agents, picking up work from the ticketing system, with proper failure handling, cost controls, and human oversight - that’s a different problem entirely.
What orchestration needs to handle:
- Task graphs, not task lists. The insight that took us a bit longer to learn: you can’t just throw tickets at agents and expect coherent output. If Task B depends on Task A’s API changes, and an agent picks up Task B before Task A is merged, it will hallucinate the interface. You need dependency tracking that prevents agents from starting work that isn’t ready yet. Currently most people we’ve talked to are doing it manually.
- Dynamic prompts. The prompt for “start this task” is different from “CI failed, fix it” which is different from “reviewer requested changes” which is different from “you’ve been stuck for 20 minutes, try a different approach.” Orchestration needs to understand where a task is in its lifecycle and provide appropriate context.
- CI/CD integration. When CI fails, orchestration routes the failure back to an agent with the error logs and instructions to fix. When a PR gets approved, orchestration triggers the merge. When a deploy succeeds, orchestration marks the task complete.
- Session management. You need to be able to monitor what an agent is doing in real-time, “teleport” into a session and take over the keyboard when needed, pause and resume work, and capture everything for later replay.
Signs you have a Layer 3 problem:
- Agents running in parallel produce conflicting changes
- You’ve been surprised by large token bills
- Agents have made changes that should have required human approval
- Engineers don’t trust agent output because “sometimes it does weird things”
Dependency tracking across agents, cost circuit breakers, human-in-the-loop gates, stuck detection - these are product problems, not just infrastructure problems. We are seeing many open source orchestration frameworks emerging for this layer.
Who’s building here: OpenAI Symphony turns project work into isolated, autonomous implementation runs. Linear is moving into this space, connecting their issue tracker directly to agent workflows. Warp Oz for terminal-native agent orchestration. Ona, Claude Code, and Cursor are all adding orchestration features.
Layer 4: Harness
We’ve talked to enterprises whose codebases have years of technical debt, hundreds of contributors over the years, undocumented business logic, weird edge cases from customers who left five years ago, and tests that nobody trusts. When they try coding agents on this code, the agents hallucinate interfaces, miss context, and produce changes that are technically correct but wrong for the system.
Joaco Diaz wrote a thoughtful piece arguing that human code review isn’t the last frontier - the real frontier is the deep context that lives in engineers’ heads. The reviewer who knows that this ugly hack exists for a reason, that another team depends on undocumented behavior, that this small refactor touches something fragile.
He’s right about the current state - agents today struggle with legacy codebases. They don’t have the institutional knowledge that makes a senior engineer pause and think “wait, I remember why this looks weird.”
But here’s the paradox: the only way agents get better at legacy code is if people run them on legacy code.
If enterprises wait until agents are “ready” for legacy systems before deploying them, they’ll wait forever. The tooling, the patterns, the skills, the context systems - none of that gets built unless people are actually trying to make agents work on messy codebases. The AI labs aren’t going to figure out your 20-year-old monorepo for you. The ecosystem evolves because innovators push on it.
The enterprises running agents on legacy code today - even when it’s painful, even when the agents make mistakes, even when humans have to intervene constantly - are pushing the entire ecosystem toward a future where this works. They’re discovering the failure modes, building the workarounds, developing the skills and context systems that make agents useful on real codebases.
The enterprises waiting for agents to be “ready” will find themselves years behind when that readiness arrives. Because it won’t arrive for everyone simultaneously. It will arrive first for the teams who invested in the infrastructure, learned what breaks, and built the institutional knowledge of how to make agents work on their specific codebase.
Signs you have a Layer 4 problem:
- Agents hallucinate interfaces or miss undocumented dependencies
- Output is “technically correct” but wrong for your system
- Engineers spend more time fixing agent output than they would writing it themselves
- You’re waiting for agents to “get better” before trying them on real code
Where to focus
If you’re evaluating where to invest, here’s the pattern we’ve seen:
Layer 2 (Security) is the first thing teams worry about, but it’s becoming commoditized. The patterns are known. You can build it or buy it, but either way it’s increasingly table stakes.
Layer 1 (Compute) is often the actual blocker, even though teams don’t frame it that way. “The agent isn’t working well” is often really “the agent’s environment doesn’t match our real environment.”
Layer 3 (Orchestration) is if you feel innovative. If you want to push the boundaries in your organization. It’s where most teams underestimate complexity.
Layer 4 (Harness) you probably already have a harness layer or a few of them. If your team didn’t fully adopt it yet - invest the time in making it work, even for your legacy codebases.
Questions to ask yourself
-
Can an agent in your sandbox run the exact same build/test commands your engineers run? If not, you have a Layer 1 problem.
-
Could an agent in your sandbox exfiltrate credentials if it tried? If yes (or “I’m not sure”), you have a Layer 2 problem.
-
If you ran five agents on related tickets right now, would they coordinate or conflict? If conflict, you have a Layer 3 problem.
-
Do agents understand the weird undocumented stuff in your codebase? If they keep missing context that any senior engineer would know, you have a Layer 4 problem.
What we’re building
At Islo, we started by building Layer 2 - we thought security was the primary blocker. We’ve since learned it’s becoming commodity.
We’re now focused on Layer 1 and Layer 3: compute that actually matches enterprise dev environments, with orchestration that makes autonomous agents reliable at scale.
We’re not trying to own Layer 4 - we are agnostic to which harness you use - Anthropic, Cursor, Codex. It’s your choice.
If you’re building or evaluating agent infrastructure and want to compare notes, I’m at adam@islo.dev.
If you’re evaluating infrastructure for autonomous coding agents or building it yourself - here’s a framework we’ve found useful after 100+ conversations with engineering teams at various stages of adoption.
flowchart TB subgraph L4["Layer 4: Harness"] H["Cursor, Claude Code, Codex, Custom"] end subgraph L3["Layer 3: Orchestration"] O["Task graphs • Failure handling • Human-in-the-loop • Session management"] end subgraph L2["Layer 2: Security"] S["Network control • Filesystem boundaries • Credential injection • Approval flows"] end subgraph L1["Layer 1: Compute"] C["MicroVMs • Real dev environments • Lifecycle control"] end L4 --> L3 L3 --> L2 L2 --> L1Most infrastructure discussions focus on security - that’s layer 2. We mapped out four layers, and we believe that the gaps are usually somewhere else - not in the security layer.
The gap between forefront and everyone else
Here’s what we keep seeing: teams at the forefront are already running multiplayer. Multiple agents picking up tickets in parallel, working overnight, PRs ready for review in the morning. Stripe’s minions merge 1,000+ PRs per week. Some teams we’ve talked to run 5-10 agents continuously.
But even for them, there’s a lot of hand-holding:
The forefront teams are making it work through brute force - custom scripts, manual coordination, constant supervision. They’ve built their own infrastructure because nothing off-the-shelf handles the complexity.
Everyone else is stuck in single-player mode. One engineer, one agent, constant supervision. It works - genuinely works - but it doesn’t scale.
Startups doing greenfield projects can try out multiplayer mode faster. Simple stack, small team, low consequences for mistakes. However, enterprises hit the wall faster. Complex build systems, private registries, compliance requirements, legacy code that no single person fully understands. The gaps show up immediately - and they don’t have a 100-person platform team to build custom infrastructure like Stripe does.
The Four Layers
Layer 1: Compute
The environment where agents run. VMs, containers, or MicroVMs that provide CPU, memory, filesystem, and network.
This seems simple until you try to scale it. A solo developer can get away with running agents locally - maybe using git worktrees to isolate parallel work, maybe just letting the agent modify their main checkout and hoping for the best.
But the moment you want agents running autonomously 24/7 - you need a dedicated compute.
What agents actually need:
Real dev environments. This is where most sandbox providers fail. A stripped-down container that can run Python scripts is useless for real development work. Agents need to run Docker (yes, Docker inside the sandbox), spin up databases, use browser devtools, execute build systems. MicroVMs beat containers here because you get a real kernel, not just namespace isolation.
Boot time doesn’t matter (for coding agents). This trips people up because they’ve heard about Firecracker’s ~125ms cold-start and assume that’s the goal. It’s not - at least not for coding agents.
Coding agents boot once and run for hours. A 30-second startup is irrelevant when the agent will spend 4 hours refactoring a module. What matters is environment fidelity (does it match production?), persistence (can I resume where I left off?), and capability (can I run Docker, databases, browsers?).
Fast boot matters for a different use case entirely: production agents where you execute untrusted code per-request (code interpreters, user plugins, etc.). That’s not what we’re talking about here. Most sandbox providers optimize for runtime because that’s where the cloud provider money is - but coding agents need the opposite tradeoff. Slow boot is fine, limited environments are not.
Lifecycle control. You need fast stop (don’t pay for idle compute), fast resume (pick up where you left off), and snapshot/restore (branch an environment, try something risky, roll back if it fails). An agent working on a bug might want to checkpoint before attempting a risky fix - if it breaks everything, restore and try a different approach.
In this layer, we saw that the biggest hurdle is matching the environment to their actual dev environment. Just like “historical” cloud developer environments, which were not adopted widely, the problem lies in making sure that environments are easy to work with and don’t break.
This sounds obvious, but it’s where the “15% productivity vs 10x” gap comes from. If your CI requires Docker Compose to spin up five services before tests pass, but the agent sandbox can’t run Docker-in-Docker, the agent will fail. Or it will take it a LONG time to set everything up. When talking about AI agents, time is money - literally, money for tokens.
Signs you have a layer 1 problem:
Who’s building here: E2B (Firecracker-based sandboxes), Daytona (fast spin-up dev environments), Modal (serverless compute), Namespace (cloud dev environments), Sprites. Most optimize for boot time because they’re targeting runtime use cases.
Layer 2: Security
This layer is about isolation and access control. Preventing agents from accessing unauthorized systems, running untrusted commands or leaking credentials.
An autonomous agent with access to your codebase, your cloud credentials, and your production systems is an incident waiting to happen. Not because the agent is malicious - but because it’s confidently wrong in ways that humans rarely are.
I’ve seen agents:
None of these were malicious. All of them were the agent doing exactly what it thought was right. The problem is that agents don’t always have the institutional knowledge that makes a human pause and think “wait, that seems dangerous.”
What you actually need:
__GITHUB_TOKEN__, and injects real credentials at runtime. The agent never sees the actual secret. There are a few open source packages being developed in this area.Signs you have a layer 2 problem:
This layer is becoming commoditized faster than most people realize. The patterns are documented. We’ve talked to multiple enterprises who’ve built “good enough” sandboxing in a few weeks using existing kernel tools, or using Enterprise plans of the harnesses. If your platform team is competent, you can probably build this. The question is whether you should spend their time on it.
Who’s building here: Most compute providers include basic isolation. Many teams roll their own using seccomp, AppArmor, or sandbox-exec.
Layer 3: Orchestration
This layer is about coordination, failure handling, and human-in-the-loop workflows. Making multiple agents & humans work together without stepping on each other.
The dream: a PRD goes in, tickets come out, agents pick up tasks, write code, create PRs, respond to review feedback, and merge when approved. The human role shifts from writing code to reviewing output and making product decisions.
This is where most teams underestimate the complexity. Running one agent on one task is straightforward, but running multiple agents, picking up work from the ticketing system, with proper failure handling, cost controls, and human oversight - that’s a different problem entirely.
What orchestration needs to handle:
Signs you have a Layer 3 problem:
Dependency tracking across agents, cost circuit breakers, human-in-the-loop gates, stuck detection - these are product problems, not just infrastructure problems. We are seeing many open source orchestration frameworks emerging for this layer.
Who’s building here: OpenAI Symphony turns project work into isolated, autonomous implementation runs. Linear is moving into this space, connecting their issue tracker directly to agent workflows. Warp Oz for terminal-native agent orchestration. Ona, Claude Code, and Cursor are all adding orchestration features.
Layer 4: Harness
We’ve talked to enterprises whose codebases have years of technical debt, hundreds of contributors over the years, undocumented business logic, weird edge cases from customers who left five years ago, and tests that nobody trusts. When they try coding agents on this code, the agents hallucinate interfaces, miss context, and produce changes that are technically correct but wrong for the system.
Joaco Diaz wrote a thoughtful piece arguing that human code review isn’t the last frontier - the real frontier is the deep context that lives in engineers’ heads. The reviewer who knows that this ugly hack exists for a reason, that another team depends on undocumented behavior, that this small refactor touches something fragile.
He’s right about the current state - agents today struggle with legacy codebases. They don’t have the institutional knowledge that makes a senior engineer pause and think “wait, I remember why this looks weird.”
But here’s the paradox: the only way agents get better at legacy code is if people run them on legacy code.
If enterprises wait until agents are “ready” for legacy systems before deploying them, they’ll wait forever. The tooling, the patterns, the skills, the context systems - none of that gets built unless people are actually trying to make agents work on messy codebases. The AI labs aren’t going to figure out your 20-year-old monorepo for you. The ecosystem evolves because innovators push on it.
The enterprises running agents on legacy code today - even when it’s painful, even when the agents make mistakes, even when humans have to intervene constantly - are pushing the entire ecosystem toward a future where this works. They’re discovering the failure modes, building the workarounds, developing the skills and context systems that make agents useful on real codebases.
The enterprises waiting for agents to be “ready” will find themselves years behind when that readiness arrives. Because it won’t arrive for everyone simultaneously. It will arrive first for the teams who invested in the infrastructure, learned what breaks, and built the institutional knowledge of how to make agents work on their specific codebase.
Signs you have a Layer 4 problem:
Where to focus
If you’re evaluating where to invest, here’s the pattern we’ve seen:
Layer 2 (Security) is the first thing teams worry about, but it’s becoming commoditized. The patterns are known. You can build it or buy it, but either way it’s increasingly table stakes.
Layer 1 (Compute) is often the actual blocker, even though teams don’t frame it that way. “The agent isn’t working well” is often really “the agent’s environment doesn’t match our real environment.”
Layer 3 (Orchestration) is if you feel innovative. If you want to push the boundaries in your organization. It’s where most teams underestimate complexity.
Layer 4 (Harness) you probably already have a harness layer or a few of them. If your team didn’t fully adopt it yet - invest the time in making it work, even for your legacy codebases.
Questions to ask yourself
Can an agent in your sandbox run the exact same build/test commands your engineers run? If not, you have a Layer 1 problem.
Could an agent in your sandbox exfiltrate credentials if it tried? If yes (or “I’m not sure”), you have a Layer 2 problem.
If you ran five agents on related tickets right now, would they coordinate or conflict? If conflict, you have a Layer 3 problem.
Do agents understand the weird undocumented stuff in your codebase? If they keep missing context that any senior engineer would know, you have a Layer 4 problem.
What we’re building
At Islo, we started by building Layer 2 - we thought security was the primary blocker. We’ve since learned it’s becoming commodity.
We’re now focused on Layer 1 and Layer 3: compute that actually matches enterprise dev environments, with orchestration that makes autonomous agents reliable at scale.
We’re not trying to own Layer 4 - we are agnostic to which harness you use - Anthropic, Cursor, Codex. It’s your choice.
If you’re building or evaluating agent infrastructure and want to compare notes, I’m at adam@islo.dev.