Lenny's Podcast

Sander Schulhoff: Why AI guardrails fail every red team test

How prompt injection and jailbreaks bypass state-of-the-art guardrails; agents that send emails or touch databases turn every bypass into real damage.

Sander SchulhoffguestLenny Rachitskyhost

Dec 20, 20251h 32mWatch on YouTube ↗

WHAT IT’S REALLY ABOUT

AI guardrails are failing, exposing a looming agent-driven security crisis

Sander Schulhoff argues that today’s AI security stack—especially guardrails and automated red teaming—is fundamentally ineffective against determined attackers. Because large language models have an effectively infinite prompt (attack) space, claims like “99% protection” are statistically meaningless, and humans routinely bypass state-of-the-art defenses in minutes. The real risk emerges not from chatbots alone but from AI agents, browsers, and robots that can take real actions (send emails, touch production systems, control hardware) and can be manipulated via prompt injection and jailbreaks. Schulhoff urges companies to refocus from buying guardrail products to combining classical cybersecurity with AI expertise, constraining permissions (e.g., CAMEL-style approaches), and educating teams before deploying powerful agents.

IDEAS WORTH REMEMBERING

5 ideas

Do not rely on AI guardrails for real security

Guardrails (LLM classifiers around your model) can only be tested on a vanishingly small fraction of the effectively infinite prompt space, and human red-teamers consistently bypass them in 10–30 attempts. They tend to create dangerous overconfidence without meaningfully reducing risk for determined attackers.

Differentiate harmless chatbots from agents with real-world powers

If your AI is a read-only FAQ bot with no actions and no privileged data, your primary risk is reputational, not systemic. Once an AI can send emails, modify databases, browse the web, or control robots, any action it is permitted to take can be coerced by a prompt injection or jailbreak.

Treat LLMs as unpatchable brains, not traditional software

Classical bugs can be patched with high confidence; LLM behavior cannot. Even after fine-tuning or patching specific failure modes, the underlying model remains vulnerable to countless unseen variations of the same attack pattern.

Focus on permissioning and architecture, not just model behavior

Combine classical cybersecurity with AI awareness: sandbox code execution, strictly scope what each agent can read or write, and assume any accessible data can be exfiltrated. Proper data and action permissioning can neutralize entire classes of prompt injection, as in containerizing model-written code instead of running it on the main app server.

Use CAMEL-style task-based permission restriction where possible

Frameworks like CAMEL infer the minimal permissions needed for a given user request (e.g., only ‘send email’ but not ‘read inbox’ for a simple outbound message) and grant only those. This can block many prompt injection attacks that appear later in untrusted content, though it cannot solve cases where dangerous capabilities (read + write) are legitimately required together.

WORDS WORTH SAVING

5 quotes

AI guardrails do not work. I'm gonna say that one more time. Guardrails do not work.

— Sander Schulhoff

You can patch a bug, but you can't patch a brain.

— Sander Schulhoff

The only reason there hasn't been a massive attack yet is how early the adoption is, not because it's secure.

— Alex Komoroske, quoted by Lenny and Sander

For a model like GPT-5, the number of possible attacks is one followed by a million zeros… it’s basically an infinite attack space.

— Sander Schulhoff

The reason I'm here is 'cause I think this is about to start getting serious… We're starting to see agents and robotics deployed that can do damage.

— Sander Schulhoff

Fundamental weaknesses of AI guardrails and their attack surfaceJailbreaking vs. prompt injection and real-world examples of eachAdversarial robustness, attack success rate (ASR), and evaluation challengesWhy agents, AI-powered browsers, and robots dramatically raise the stakesLimitations of the current AI security industry and market dynamicsPractical mitigation strategies: permissioning, CAMEL, and classical securityFrontier lab research gaps and long-term directions for AI robustness

High quality AI-generated summary created from speaker-labeled transcript.

Get more out of YouTube videos.

High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.