Big Ideas 2024: AI Interpretability: From Black Box to Clear Box with Anjney Midha

Anjney Midha, General Partner at a16z, believes that mechanistic interpretability (a fancy term for "reverse engineering" AI models) will take center stage in 2024. In this discussion, we move beyond the black box and explore pivotal questions: Why do AI models make specific statements? What influences the success of certain prompts? Most crucially, how can we control these models in real-world scenarios? Topics Covered: 00:00 - Big Ideas in Tech 2024 01:39: AI Interpretability: From Black Box to Clear Box 02:21: What do we and don’t understand about LLM black boxes and interpretability 04:23 - Research in interpretability 06:43 - Features represented in the outputs from LLMs 08:16 - Unlocks in interpretability 11:49 - The engineering challenges 14:10 - Scaling mechanistic interpretability research 17:27 - A new focus on explainability Resources: View all 40+ big ideas: https://a16z.com/bigideas2024 Find Anish on Twitter: https://twitter.com/anjneymidha Stay Updated: Find a16z on Twitter: https://twitter.com/a16z Find a16z on LinkedIn: https://www.linkedin.com/company/a16z Subscribe on your favorite podcast app: https://a16z.simplecast.com/ Follow our host: https://twitter.com/stephsmithio Please note that the content here is for informational purposes only; should NOT be taken as legal, business, tax, or investment advice or be used to evaluate any investment or security; and is not directed at any investors or potential investors in any a16z fund. a16z and its affiliates may maintain investments in the companies discussed. For more details please see a16z.com/disclosures.

Steph SmithhostAnjney Midhaguest

Dec 22, 202322mWatch on YouTube ↗

WHAT IT’S REALLY ABOUT

AI interpretability shifts from research to scalable engineering for control

Interpretability is framed as “reverse engineering” AI models so practitioners can answer why models produce certain outputs and how to control them.
A key shift post-2023 is moving from neuron-by-neuron explanations to “features,” patterns across neurons that map more cleanly to human-meaningful concepts.
Anthropic’s dictionary-learning work is cited as evidence that feature-level decomposition can separate concepts (e.g., religion vs. biology) that neuron-level analysis cannot disentangle.
The biggest near-term bottleneck is scaling mechanistic interpretability from toy models to frontier models, turning the challenge primarily into engineering (compute, tooling, and workflows).
Better interpretability is positioned as a prerequisite for high-stakes deployment (healthcare, finance, defense) and for more empirically grounded AI policy debates rather than worst-case speculation.

IDEAS WORTH REMEMBERING

5 ideas

Feature-level analysis is the new “unit” of interpretability.

Instead of trying to interpret single neurons (which activate in many unrelated contexts), researchers increasingly focus on “features,” consistent activation patterns across many neurons that correspond to clearer concepts.

2023 marked a step-change: interpretability looks more tractable than before.

Midha characterizes the field as “pre-2023 vs. post-2023,” arguing that recent results provide a concrete path (mechanistic interpretability) rather than many competing, unproven hypotheses.

Toy models are the proving ground for methods that might scale.

Because frontier models are too complex to dissect directly, interpretability research validates approaches on small models first, then attempts to scale once there’s evidence the approach works in principle.

Scaling interpretability is now largely an engineering and compute problem.

Midha argues the core approach is visible at small scale; the remaining work is to industrialize it—build tooling, pipelines, and infrastructure to run these methods on much larger networks.

Two hard scaling bottlenecks are autoencoder expansion and interaction reasoning.

Researchers must scale the “autoencoder” used to surface features (Midha mentions ~100× expansion factor) and must also interpret the combinatorial interactions among many features when prompts mix sensitive or complex concepts.

WORDS WORTH SAVING

5 quotes

AI interpretability... is just a complex way of saying reverse engineering AI models.

— Anjney Midha

As these models begin to be deployed in real-world situations, the big question on everyone's mind is why?

— Anjney Midha

You can break the world of interpretability down into a pre-2023 and a post-2023 world... because there's been such a massive breakthrough.

— Anjney Midha

Interpretability is now an engineering problem as opposed to an open-ended research problem.

— Anjney Midha

If you can explain why the kitchen does something, then you can control what it does, and that makes it much more reliable.

— Anjney Midha

Black box vs. clear box AIMechanistic interpretabilityNeurons vs. features (superposition)Dictionary learning and sparse autoencodersScaling from toy models to frontier modelsControllability, reliability, and mission-critical deploymentPolicy, regulation, and empirically grounded safety debates

High quality AI-generated summary created from speaker-labeled transcript.

Get more out of YouTube videos.

High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.