a16zBig Ideas 2024: AI Interpretability: From Black Box to Clear Box with Anjney Midha
At a glance
WHAT IT’S REALLY ABOUT
AI interpretability shifts from research to scalable engineering for control
- Interpretability is framed as “reverse engineering” AI models so practitioners can answer why models produce certain outputs and how to control them.
- A key shift post-2023 is moving from neuron-by-neuron explanations to “features,” patterns across neurons that map more cleanly to human-meaningful concepts.
- Anthropic’s dictionary-learning work is cited as evidence that feature-level decomposition can separate concepts (e.g., religion vs. biology) that neuron-level analysis cannot disentangle.
- The biggest near-term bottleneck is scaling mechanistic interpretability from toy models to frontier models, turning the challenge primarily into engineering (compute, tooling, and workflows).
- Better interpretability is positioned as a prerequisite for high-stakes deployment (healthcare, finance, defense) and for more empirically grounded AI policy debates rather than worst-case speculation.
IDEAS WORTH REMEMBERING
5 ideasFeature-level analysis is the new “unit” of interpretability.
Instead of trying to interpret single neurons (which activate in many unrelated contexts), researchers increasingly focus on “features,” consistent activation patterns across many neurons that correspond to clearer concepts.
2023 marked a step-change: interpretability looks more tractable than before.
Midha characterizes the field as “pre-2023 vs. post-2023,” arguing that recent results provide a concrete path (mechanistic interpretability) rather than many competing, unproven hypotheses.
Toy models are the proving ground for methods that might scale.
Because frontier models are too complex to dissect directly, interpretability research validates approaches on small models first, then attempts to scale once there’s evidence the approach works in principle.
Scaling interpretability is now largely an engineering and compute problem.
Midha argues the core approach is visible at small scale; the remaining work is to industrialize it—build tooling, pipelines, and infrastructure to run these methods on much larger networks.
Two hard scaling bottlenecks are autoencoder expansion and interaction reasoning.
Researchers must scale the “autoencoder” used to surface features (Midha mentions ~100× expansion factor) and must also interpret the combinatorial interactions among many features when prompts mix sensitive or complex concepts.
WORDS WORTH SAVING
5 quotesAI interpretability... is just a complex way of saying reverse engineering AI models.
— Anjney Midha
As these models begin to be deployed in real-world situations, the big question on everyone's mind is why?
— Anjney Midha
You can break the world of interpretability down into a pre-2023 and a post-2023 world... because there's been such a massive breakthrough.
— Anjney Midha
Interpretability is now an engineering problem as opposed to an open-ended research problem.
— Anjney Midha
If you can explain why the kitchen does something, then you can control what it does, and that makes it much more reliable.
— Anjney Midha
High quality AI-generated summary created from speaker-labeled transcript.
Get more out of YouTube videos.
High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.
Add to Chrome