What red-teaming an agent actually looks like
May 14, 2026 · 1 min read · #red-teaming #ai-security #process
A useful agent red-team is not a contest to find the funniest jailbreak. It is an engineering engagement with a clock and a budget, and the output is fixes, not screenshots.
How I run one
Scope as a system, not a chatbot. The first hour is spent mapping the real surface: tool inputs, tool outputs fed back into context, ambient content the agent reads, and the reflection step. If I am only testing the chat box, I am testing the least valuable thing.
Spend the budget by severity, not by count. Twenty low findings read well in a report and change nothing. One reproducible planning-loop hijack changes the architecture. I optimize for severity-weighted coverage and say so up front.
Every finding ships as a reproduction. Deterministic, replayable, with a suggested boundary. A finding the team cannot replay is a finding they will not fix, and an unfixed finding was not worth finding.
The honest part
The best findings usually come from the reflection step, and I usually reach them later than I would like. Red-teaming agents is still a young discipline. I would rather tell you that than sell you a checklist.