An autonomous developer-tooling product
Red-teaming an autonomous coding agent
A fixed-scope adversarial assessment of an agent that could read, write, and execute against customer code: what breaks it, and what the fixes cost.
Delivered a ranked finding set with reproductions; the highest-severity class was a planning-loop hijack, not a prompt injection.
Context
A time-boxed red-team engagement against a shipping autonomous coding agent. Goal: find the failures that matter before an adversary does.
Problem and constraint
The interesting attacks on agents are not the prompt-injection demos. They are the ones that turn the agent's own planning loop against the user. The constraint was a fixed scope and a fixed clock: maximize severity-weighted coverage, not raw finding count.
Approach and key decisions
- Modeled the agent as an untrusted distributed system, not a chatbot. Threat surface: tool inputs, tool outputs fed back into context, repo content the agent reads, and the reflection step itself.
- Prioritized loop-integrity attacks over content attacks. The decision that paid off: spend the budget where a single success compromises a whole session, not where it produces one bad message.
- Wrote every finding as a deterministic reproduction, not a screenshot. A finding the team cannot replay is a finding they will not fix.
Outcome (sanitized)
The top finding was a planning-loop hijack: attacker-controlled repo content steered the agent's plan across steps without ever appearing as an obvious injection. Findings were ranked by severity with reproductions and suggested boundaries. Severity distribution shared as a shape, not as client-identifying detail.
What I would do differently
Start from the reflection step and work outward. That is where the highest severity lived, and I reached it later than I should have.