Big Ideas 2024: AI Interpretability: From Black Box to Clear Box with Anjney Midha

Anjney Midha, General Partner at a16z, believes that mechanistic interpretability (a fancy term for "reverse engineering" AI models) will take center stage in 2024. In this discussion, we move beyond the black box and explore pivotal questions: Why do AI models make specific statements? What influences the success of certain prompts? Most crucially, how can we control these models in real-world scenarios? Topics Covered: 00:00 - Big Ideas in Tech 2024 01:39: AI Interpretability: From Black Box to Clear Box 02:21: What do we and don’t understand about LLM black boxes and interpretability 04:23 - Research in interpretability 06:43 - Features represented in the outputs from LLMs 08:16 - Unlocks in interpretability 11:49 - The engineering challenges 14:10 - Scaling mechanistic interpretability research 17:27 - A new focus on explainability Resources: View all 40+ big ideas: https://a16z.com/bigideas2024 Find Anish on Twitter: https://twitter.com/anjneymidha Stay Updated: Find a16z on Twitter: https://twitter.com/a16z Find a16z on LinkedIn: https://www.linkedin.com/company/a16z Subscribe on your favorite podcast app: https://a16z.simplecast.com/ Follow our host: https://twitter.com/stephsmithio Please note that the content here is for informational purposes only; should NOT be taken as legal, business, tax, or investment advice or be used to evaluate any investment or security; and is not directed at any investors or potential investors in any a16z fund. a16z and its affiliates may maintain investments in the companies discussed. For more details please see a16z.com/disclosures.

Steph SmithhostAnjney Midhaguest

Dec 23, 202322mWatch on YouTube ↗

CHAPTERS

Big Ideas 2024 overview and why interpretability matters now
Steph Smith frames a16z’s Big Ideas 2024 list and positions “AI moving from black box to clear box” as a key theme. The episode sets up why interpretability is becoming crucial as AI moves from demos into real-world workflows.
Defining AI interpretability: reverse engineering models for ‘why’ and control
Anjney Midha defines interpretability as reverse engineering AI models—understanding why they produce specific outputs. He argues the core questions are why outputs happen, why prompts work differently, and how to control behavior.
The ‘kitchen and cooks’ analogy for black-box behavior
To explain the black box problem, Anjney compares a model to a kitchen with many cooks whose internal debates are invisible to outsiders. You can observe the meal (output) but not the process (internal reasoning), making ‘why’ hard to answer.
From cooks to head chefs: organizing behavior into interpretable concepts
Anjney proposes the breakthrough framing: train ‘head chefs’ that represent meaningful, higher-level concepts (like cuisines) to organize low-level activity. This doesn’t control every unit, but it reveals and influences the major decision drivers.
Pre-2023 vs post-2023: the shift from neurons to features
Anjney describes a step-change in interpretability: moving away from interpreting individual neurons toward “features,” or patterns of activation across many neurons. Features align more consistently with concepts than single neurons do.
Mechanistic interpretability and Anthropic’s dictionary-learning example
The conversation highlights Anthropic’s paper “Decomposing Language Models with Dictionary Learning” as a key milestone. Using small ‘toy’ models as experimental testbeds, researchers can identify concrete features that weren’t separable at the neuron level.
What a ‘feature’ looks like in practice: the ‘God feature’ example
Anjney gives a tangible example: a feature that reliably activates on religious concepts (a “God feature”), distinct from biology/DNA-related features. The point is that neuron-level signals can be mixed, while feature-level analysis separates concepts cleanly.
What the breakthrough unlocks: engineering focus, controllability, reliability
Anjney outlines three implications of feature-based interpretability: it turns interpretability into a scaling/engineering problem, enables more precise control, and increases reliability. Better reliability also supports more grounded governance and reduces speculative fear-based debate.
Why scaling is hard: from toy kitchens to frontier-model complexity
They discuss the gap between small-model demonstrations and frontier systems like GPT-4/Claude-class models with hundreds of billions of parameters. Direct interpretation at that scale is currently intractable, so the challenge becomes scaling proven methods without exploding cost and complexity.
Engineering challenge #1: scaling the autoencoder (massive expansion factors)
Anjney identifies scaling the autoencoder—used to make sense of features—as a primary hurdle. Moving from small experiments to frontier models may require ~100x expansion factors, which can be compute-intensive, demanding new efficiency strategies.
Engineering challenge #2: interpreting feature interactions and combinatorial complexity
Beyond extracting features, the next hurdle is understanding how features interact when prompts combine sensitive or complex concepts. As the number of features grows, their interactions create nonlinear complexity that must be interpreted to answer real-world ‘why’ questions.
2024 outlook: rising emphasis on explainability for mission-critical adoption
Anjney expects 2024 to bring more attention, talent, and investment into interpretability and explainability. The motivation is unlocking deployment beyond forgiving consumer use cases into domains like healthcare and finance that require predictability and control.
Closing: building toward ‘clear box’ AI and where to find more Big Ideas
Steph closes by reinforcing the broader Big Ideas series and teasing upcoming topics. The episode ends with a call to explore the full Big Ideas 2024 list and to build.

Get more out of YouTube videos.

High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.

Add to Chrome

Big Ideas 2024 overview and why interpretability matters now

Defining AI interpretability: reverse engineering models for ‘why’ and control

The ‘kitchen and cooks’ analogy for black-box behavior

From cooks to head chefs: organizing behavior into interpretable concepts

Pre-2023 vs post-2023: the shift from neurons to features

Mechanistic interpretability and Anthropic’s dictionary-learning example

What a ‘feature’ looks like in practice: the ‘God feature’ example

What the breakthrough unlocks: engineering focus, controllability, reliability

Why scaling is hard: from toy kitchens to frontier-model complexity

Engineering challenge #1: scaling the autoencoder (massive expansion factors)

Engineering challenge #2: interpreting feature interactions and combinatorial complexity

2024 outlook: rising emphasis on explainability for mission-critical adoption

Closing: building toward ‘clear box’ AI and where to find more Big Ideas

Get more out of YouTube videos.