Lenny's PodcastSander Schulhoff: Why AI guardrails fail every red team test
How prompt injection and jailbreaks bypass state-of-the-art guardrails; agents that send emails or touch databases turn every bypass into real damage.
At a glance
WHAT IT’S REALLY ABOUT
AI guardrails are failing, exposing a looming agent-driven security crisis
- Sander Schulhoff argues that today’s AI security stack—especially guardrails and automated red teaming—is fundamentally ineffective against determined attackers. Because large language models have an effectively infinite prompt (attack) space, claims like “99% protection” are statistically meaningless, and humans routinely bypass state-of-the-art defenses in minutes. The real risk emerges not from chatbots alone but from AI agents, browsers, and robots that can take real actions (send emails, touch production systems, control hardware) and can be manipulated via prompt injection and jailbreaks. Schulhoff urges companies to refocus from buying guardrail products to combining classical cybersecurity with AI expertise, constraining permissions (e.g., CAMEL-style approaches), and educating teams before deploying powerful agents.
IDEAS WORTH REMEMBERING
5 ideasDo not rely on AI guardrails for real security
Guardrails (LLM classifiers around your model) can only be tested on a vanishingly small fraction of the effectively infinite prompt space, and human red-teamers consistently bypass them in 10–30 attempts. They tend to create dangerous overconfidence without meaningfully reducing risk for determined attackers.
Differentiate harmless chatbots from agents with real-world powers
If your AI is a read-only FAQ bot with no actions and no privileged data, your primary risk is reputational, not systemic. Once an AI can send emails, modify databases, browse the web, or control robots, any action it is permitted to take can be coerced by a prompt injection or jailbreak.
Treat LLMs as unpatchable brains, not traditional software
Classical bugs can be patched with high confidence; LLM behavior cannot. Even after fine-tuning or patching specific failure modes, the underlying model remains vulnerable to countless unseen variations of the same attack pattern.
Focus on permissioning and architecture, not just model behavior
Combine classical cybersecurity with AI awareness: sandbox code execution, strictly scope what each agent can read or write, and assume any accessible data can be exfiltrated. Proper data and action permissioning can neutralize entire classes of prompt injection, as in containerizing model-written code instead of running it on the main app server.
Use CAMEL-style task-based permission restriction where possible
Frameworks like CAMEL infer the minimal permissions needed for a given user request (e.g., only ‘send email’ but not ‘read inbox’ for a simple outbound message) and grant only those. This can block many prompt injection attacks that appear later in untrusted content, though it cannot solve cases where dangerous capabilities (read + write) are legitimately required together.
WORDS WORTH SAVING
5 quotesAI guardrails do not work. I'm gonna say that one more time. Guardrails do not work.
— Sander Schulhoff
You can patch a bug, but you can't patch a brain.
— Sander Schulhoff
The only reason there hasn't been a massive attack yet is how early the adoption is, not because it's secure.
— Alex Komoroske, quoted by Lenny and Sander
For a model like GPT-5, the number of possible attacks is one followed by a million zeros… it’s basically an infinite attack space.
— Sander Schulhoff
The reason I'm here is 'cause I think this is about to start getting serious… We're starting to see agents and robotics deployed that can do damage.
— Sander Schulhoff
High quality AI-generated summary created from speaker-labeled transcript.
Get more out of YouTube videos.
High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.
Add to Chrome