The missing layer of the agent development stack

Blog

The missing layer of the agent development stack

· Adam Gold

We are building AI agents that talk to GitHub and Slack. Well, it’s not exactly “we” - our AI agents built AI agents that talk to GitHub and Slack. Weird, I know. Anyway, ten agents running in parallel, each hitting the same endpoints over and over while debugging. GitHub’s 5,000 requests/hour disappeared quite quickly, and every test run left garbage PRs we had to close manually (or by script). Webhooks required ngrok and couldn’t be replayed.

If you’re building something that talks to a database, you don’t test against prod.. But for third-party APIs - GitHub, Slack, Stripe - everyone just… hits the real thing? writes mocks? or hits rate limits and fun webhooks stuff?

We couldn’t keep doing that, so we built fake servers that act like the real APIs, keep state, work with the official SDKs. The more we used them, the more we thought: why doesn’t this exist already? so we open sourced it.

The gap in the development stack

If you need to test against a database, you spin up Postgres in a container - solved problem, everyone does it. Cloud services? MinIO for S3, LocalStack for AWS, Azurite for Azure - also solved. These things exist because developers needed them and someone built them.

Third-party APIs? Nothing.

Want to test against GitHub without hitting rate limits? Write your own mock. Stripe? Mock it. Slack? Mock it. Every team writes the same throwaway mocks, they drift from reality immediately, and nobody shares them because they’re all slightly different and barely maintained.

For humans, this is annoying but survivable - you hit the API manually a few times, maybe you write a mock that covers the happy path, you move on. But with agents, this annoyance becomes a blocker. Agents run unattended, in parallel, experimenting with variations, learning what works. They need something else. They need a surface that behaves like the real thing, responds to their actions, maintains state across calls, and doesn’t rate limit them after 20 requests.

doubleagent is the missing layer of the local development stack - the one for third-party APIs.

I’ll try to keep this post focused on the design decisions we made along the way.

What it looks like

Start the GitHub fake:

$ doubleagent start github
Starting github fake on http://localhost:8080
Health endpoint: http://localhost:8080/_doubleagent/health
Ready.

Write a test using the official PyGithub SDK:

from github import Github

def test_full_issue_lifecycle():
    client = Github(base_url="http://localhost:8080")
    
    repo = client.get_user().create_repo("test-repo")
    issue = repo.create_issue(title="Bug", body="Something broke")
    issue.create_comment("Looking into this")
    issue.edit(state="closed")
    
    fetched = repo.get_issue(1)
    assert fetched.state == "closed"
    assert fetched.comments == 1
    assert list(fetched.get_comments())[0].body == "Looking into this"

Run it:

$ pytest test_github.py -v
========================= test session starts =========================
test_github.py::test_full_issue_lifecycle PASSED                 [100%]
========================= 1 passed in 0.03s ===========================

You just ran a full integration test against a stateful GitHub API, locally, in milliseconds, zero network, zero rate limits, zero garbage repos or issues to clean up afterward.

Decision 1: Agents can contribute

This is the weird one, and honestly the one I’m most excited about.

We built doubleagent to be used by AI agents. So we figured: why not let AI agents add to it?

The repo has a skills/ directory that teaches agents how to contribute:

skills/
├── add-service.md          # Create new service from scratch
├── add-endpoint.md         # Add endpoint to existing service  
├── write-contract-test.md  # Tests that run against real + fake
└── debug-drift.md          # Fix when fake stops matching real API

Point Claude or Cursor at skills/add-service.md with an API’s docs, and it follows a pipeline: the agent scans the API docs, picks the canonical SDK for that platform, identifies the 5-10 core usage patterns developers actually need, translates those into test scenarios, runs those tests against the real API first to establish ground truth, implements the fake server to match that behavior, verifies the same tests pass against the fake, then enriches with edge cases and iterates until contract tests pass against both.

We pointed Claude Code at skills/add-service.md with Stripe’s API docs. 45 minutes later it opened a PR with 12 contract tests and a working fake covering payment intents, customers, and refunds. The PR passed review and got merged.

There are hundreds of APIs that agents need to talk to. We’re not going to build fakes for all of them ourselves - that’s impossible. But if agents can add their own following the same patterns, the thing grows on its own. More people build agents, more services get faked, easier to build more agents. It’s a flywheel that doesn’t require us to be the bottleneck.

Decision 2: State machines, not response templates

Most mocking tools break the moment your agent creates something, then reads it back:

# Agent creates a PR
pr = repo.create_pull(title="Feature", head="feature", base="main")

# Then tries to read it
fetched = repo.get_pull(pr.number)  # Mock has no idea this PR exists

The mock returned a canned response for create_pull, but it didn’t actually create anything. The next call fails or returns stale data.

We needed fakes that keep state. When you create a PR, it exists. When you read it back, you get what you created. When you merge it, the state changes - PR closes, branch deletes, commit shows up in target branch history.

# This actually works with doubleagent
repo = client.get_user().create_repo("test-repo")
issue = repo.create_issue(title="Bug", body="Something broke")
issue.create_comment("Looking into this")
issue.edit(state="closed")

fetched = repo.get_issue(1)
assert fetched.state == "closed"
assert fetched.comments == 1
assert list(fetched.get_comments())[0].body == "Looking into this"

Decision 3: Contract tests against real APIs

Fakes drift - GitHub adds a field or Slack changes an error message. Your fake doesn’t know, so now you’re testing against something that doesn’t match reality.

Testing against a lie is worse than not testing. You ship code that passes, then it breaks in prod because the real API behaves differently.

So every behavior we fake has a test that runs against both the fake and the real API:

@pytest.mark.parametrize("client", [real_github_client, fake_github_client])
def test_create_issue_returns_correct_structure(client):
    repo = client.get_repo("test-org/test-repo")
    issue = repo.create_issue(title="Test", body="Body")
    
    assert issue.number > 0
    assert issue.title == "Test"
    assert issue.state == "open"
    assert issue.created_at is not None

@pytest.mark.parametrize("client", [real_github_client, fake_github_client])
def test_invalid_repo_returns_404(client):
    with pytest.raises(GithubException) as exc:
        client.get_repo("nonexistent/repo")
    assert exc.value.status == 404

CI runs these against fakes on every commit. We run against real APIs daily. When a test passes on the fake but fails on real GitHub, we know something changed upstream. When it fails on the fake but passes on real GitHub, we broke something.

Decision 4: Language agnostic

We thought about building one big framework in Python, which is the easiest for us - easier to maintain, consistent patterns, all that. But that would make us the bottleneck for every new service.

So we made each fake an independent HTTP server. Write it in whatever language you want:

services/
├── github/     # Python
├── slack/      # Python  
├── stripe/     # TypeScript
├── auth0/      # Python
└── descope/    # Python

The only rule: implement three endpoints.

GET  /_doubleagent/health   # 200 if up
POST /_doubleagent/reset    # Clear state
POST /_doubleagent/seed     # Load fixtures

Independent HTTP servers mean each fake is a natural library for its ecosystem - you can test, deploy, and version them independently, run only what you need, and the interface is dead simple. We use Rust for the CLI because we wanted fast startup. Python for most fakes because it’s faster to prototype. Someone could write the Jira fake in TS and it would work fine.

Did we try monolithic first? Yes. It didn’t scale.

Decision 5: Webhooks that actually work

The bot we started with was webhook-driven. GitHub fires an event, our agent responds. Testing this against real GitHub is painful:

  1. Expose a public endpoint (ngrok, localtunnel, whatever)
  2. Actually do the thing that triggers the webhook
  3. Hope the timing works out
  4. Can’t replay, can’t control order

With doubleagent, webhooks fire when state changes:

# Your webhook handler receives "pull_request.opened" when this runs:
pr = repo.create_pull(title="Feature", head="feature", base="main")

# And "pull_request.closed" with merged=true when this runs:
pr.merge()

No ngrok, no public endpoints. Payloads are signed with HMAC-SHA256, same as real GitHub, so signature validation works.

The real power: testing edge cases. What happens when two webhooks arrive out of order? When a PR gets merged while you’re still processing the review event? You can trigger webhooks in whatever order you want, at whatever timing, and test that your agent handles it.

Why we don’t fake everything

Google’s “Software Engineering at Google” has a chapter on test doubles that says something we keep coming back to:

“A fake should be exactly as high-fidelity as the client requires, but not more.”

We don’t try to fake every endpoint or every edge case upfront. We fake what people actually need, and when someone needs more, the system grows to meet them. The loop looks like this:

  1. Someone encounters an unsupported scenario
  2. They report it (or describe the user story they need)
  3. An agent writes tests against the real API for that scenario
  4. The agent implements the fake behavior to match
  5. The same tests pass against both

It’s not “we’ll eventually support everything.” It’s “we support what matters, and when something new matters, we add it.” The contract tests keep it honest, and agents do the work.

What’s next

The agent development stack is still mostly unbuilt. This is one piece of it.

Check out the repo: github.com/islo-labs/doubleagent. Start a fake, run a test, see if it works for what you’re building. If something’s missing, open an issue - or point an agent at the skills and let it add what you need.