
Why securing AI is harder than anyone expected and guardrails are failing | HackAPrompt CEO
Sander Schulhoff (guest), Lenny Rachitsky (host), Narrator
In this episode of Lenny's Podcast, featuring Sander Schulhoff and Lenny Rachitsky, Why securing AI is harder than anyone expected and guardrails are failing | HackAPrompt CEO explores aI guardrails are failing, exposing a looming agent-driven security crisis Sander Schulhoff argues that today’s AI security stack—especially guardrails and automated red teaming—is fundamentally ineffective against determined attackers. Because large language models have an effectively infinite prompt (attack) space, claims like “99% protection” are statistically meaningless, and humans routinely bypass state-of-the-art defenses in minutes. The real risk emerges not from chatbots alone but from AI agents, browsers, and robots that can take real actions (send emails, touch production systems, control hardware) and can be manipulated via prompt injection and jailbreaks. Schulhoff urges companies to refocus from buying guardrail products to combining classical cybersecurity with AI expertise, constraining permissions (e.g., CAMEL-style approaches), and educating teams before deploying powerful agents.
AI guardrails are failing, exposing a looming agent-driven security crisis
Sander Schulhoff argues that today’s AI security stack—especially guardrails and automated red teaming—is fundamentally ineffective against determined attackers. Because large language models have an effectively infinite prompt (attack) space, claims like “99% protection” are statistically meaningless, and humans routinely bypass state-of-the-art defenses in minutes. The real risk emerges not from chatbots alone but from AI agents, browsers, and robots that can take real actions (send emails, touch production systems, control hardware) and can be manipulated via prompt injection and jailbreaks. Schulhoff urges companies to refocus from buying guardrail products to combining classical cybersecurity with AI expertise, constraining permissions (e.g., CAMEL-style approaches), and educating teams before deploying powerful agents.
Key Takeaways
Do not rely on AI guardrails for real security
Guardrails (LLM classifiers around your model) can only be tested on a vanishingly small fraction of the effectively infinite prompt space, and human red-teamers consistently bypass them in 10–30 attempts. ...
Get the full analysis with uListen AI
Differentiate harmless chatbots from agents with real-world powers
If your AI is a read-only FAQ bot with no actions and no privileged data, your primary risk is reputational, not systemic. ...
Get the full analysis with uListen AI
Treat LLMs as unpatchable brains, not traditional software
Classical bugs can be patched with high confidence; LLM behavior cannot. ...
Get the full analysis with uListen AI
Focus on permissioning and architecture, not just model behavior
Combine classical cybersecurity with AI awareness: sandbox code execution, strictly scope what each agent can read or write, and assume any accessible data can be exfiltrated. ...
Get the full analysis with uListen AI
Use CAMEL-style task-based permission restriction where possible
Frameworks like CAMEL infer the minimal permissions needed for a given user request (e. ...
Get the full analysis with uListen AI
Invest in education and hybrid AI–security roles
Most damage today comes from teams not understanding how LLMs differ from traditional systems, leading to naive deployments. ...
Get the full analysis with uListen AI
Expect a market correction in the AI security product space
Many AI security startups are selling weak guardrails and automated red-teaming tools that add little beyond open-source alternatives and frontier labs’ own efforts. ...
Get the full analysis with uListen AI
Notable Quotes
“AI guardrails do not work. I'm gonna say that one more time. Guardrails do not work.”
— Sander Schulhoff
“You can patch a bug, but you can't patch a brain.”
— Sander Schulhoff
“The only reason there hasn't been a massive attack yet is how early the adoption is, not because it's secure.”
— Alex Komoroske, quoted by Lenny and Sander
“For a model like GPT-5, the number of possible attacks is one followed by a million zeros… it’s basically an infinite attack space.”
— Sander Schulhoff
“The reason I'm here is 'cause I think this is about to start getting serious… We're starting to see agents and robotics deployed that can do damage.”
— Sander Schulhoff
Questions Answered in This Episode
If guardrails are fundamentally weak, what should a company’s minimum security checklist be before deploying any AI agent with real-world actions?
Sander Schulhoff argues that today’s AI security stack—especially guardrails and automated red teaming—is fundamentally ineffective against determined attackers. ...
Get the full analysis with uListen AI
How can frontier labs practically integrate adversarial training earlier in the training pipeline without crippling performance or cost?
Get the full analysis with uListen AI
What organizational structures or roles best support the hybrid of classical cybersecurity and AI security that Schulhoff says we need?
Get the full analysis with uListen AI
At what specific capability or adoption threshold do AI agents become a systemic cybersecurity risk comparable to today’s major threat vectors?
Get the full analysis with uListen AI
How should regulators and policymakers think about AI agent deployment standards, given that model-level robustness may remain unsolved for years?
Get the full analysis with uListen AI
Transcript Preview
I found some major problems with the AI security industry. AI guardrails do not work. I'm going to say that one more time. Guardrails do not work. If someone is determined enough to trick GPT-5, they're going to deal with that guardrail, no problem. When these guardrail providers say, "We catch everything," that's a complete lie.
I asked Alex Komoroske, who's also really big in this topic. The way he put it, the only reason there hasn't been a massive attack yet is how early the adoption is, not because it's secure.
You can patch a bug, but you can't patch a brain. If you find some bug in your software and you go and patch it, you can be maybe 99.99% sure that bug is solved. Try to do that in your AI system, you can be 99.99% sure that the problem is still there.
It makes me think about just the alignment problem. You got to keep this god in a box.
Not only do you have a god in the box, but that god is angry, and that god's malicious. That god wants to hurt you. Can we control that malicious AI and make it useful to us and make sure nothing bad happens?
Today, my guest is Sander Schulhoff. This is a really important and serious conversation, and you'll soon see why. Sander is a leading researcher in the field of adversarial robustness, which is basically the art and science of getting AI systems to do things that they should not do, like telling you how to build a bomb, changing things in your company database, or emailing bad guys all of your company's internal secrets. He runs what was the first and is now the biggest AI red teaming competition. He works with the leading AI labs on their own model defenses. He teaches the leading course on AI red teaming and AI security, and through all of this, has a really unique lens into the state-of-the-art in AI. What Sander shares in this conversation is likely to cause quite a stir, that essentially all the AI systems that we use day-to-day are open to being tricked to do things that they shouldn't do through prompt injection attacks and jailbreaks, and that there really isn't a solution to this problem for a number of reasons that you'll hear. And this has nothing to do with AGI. This is a problem of today, and the only reason we haven't seen massive hacks or serious damage from AI tools so far is because they haven't been given enough power yet, and they aren't that widely adopted yet. But with the rise of agents who can take actions on your behalf and AI-powered browsers and student robots, the risk is going to increase very quickly. This conversation isn't meant to slow down progress on AI or to scare you. In fact, it's the opposite. The appeal here is for people to understand the risks more deeply and to think harder about how we can better mitigate these risks going forward. At the end of the conversation, Sander shares some concrete suggestions for what you can do in the meantime, but even those will only take us so far. I hope this sparks a conversation about what possible solutions might look like and who is best fit to tackle them. A huge thank you for Sander for sharing this with us. This was not an easy conversation to have, and I really appreciate him being so open about what is going on. If you enjoy this podcast, don't forget to subscribe and follow it in your favorite podcasting app or YouTube. It helps tremendously. With that, I bring you Sander Schulhoff after a short word from our sponsors. This episode is brought to you by Datadog, now home to Eppo, the leading experimentation and feature-flagging platform. Product managers at the world's best companies use Datadog, the same platform their engineers rely on every day to connect product insights to product issues like bugs, UX friction, and business impact. It starts with product analytics, where PMs can watch replays, review funnels, dive into retention, and explore their growth metrics. Where other tools stop, Datadog goes even further. It helps you actually diagnose the impact of funnel drop-offs and bugs and UX friction. Once you know where to focus, experiments prove what works. I saw this firsthand when I was at Airbnb, where our experimentation platform was critical for analyzing what worked and where things went wrong, and the same team that built experimentation at Airbnb built Eppo. Datadog then lets you go beyond the numbers with session replay. Watch exactly how users interact with heat maps and scroll maps to truly understand their behavior, and all of this is powered by feature flags that are tied to real-time data so that you can roll out safely, target precisely, and learn continuously. Datadog is more than engineering metrics. It's where great product teams learn faster, fix smarter, and ship with confidence. Request a demo at datadoghq.com/lenny. That's datadoghq.com/lenny. This episode is brought to you by Metronome. You just launched your new shiny AI product. The new pricing page looks awesome, but behind it, last-minute glue code, messy spreadsheets, and running ad hoc queries to figure out what to bill. Customers get invoices they can't understand. Engineers are chasing billing bugs. Finance can't close the books. With Metronome, you hand it all off to the real-time billing infrastructure that just works. Reliable, flexible, and built to grow with you. Metronome turns raw usage events into accurate invoices, gives customers bills they actually understand, and keeps every team in sync in real time. Whether you're launching usage-based pricing, managing enterprise contracts, or rolling out new AI services, Metronome does the heavy lifting so that you can focus on your product, not your billing. That's why some of the fastest-growing companies in the world, like OpenAI and Anthropic, run their billing on Metronome. Visit metronome.com to learn more. That's metronome.com. Sander, thank you so much for being here, and welcome back to the podcast.
Install uListen to search the full transcript and get AI-powered insights
Get Full TranscriptGet more from every podcast
AI summaries, searchable transcripts, and fact-checking. Free forever.
Add to Chrome