Big Ideas 2024: AI Interpretability: From Black Box to Clear Box with Anjney Midha

Anjney Midha, General Partner at a16z, believes that mechanistic interpretability (a fancy term for "reverse engineering" AI models) will take center stage in 2024. In this discussion, we move beyond the black box and explore pivotal questions: Why do AI models make specific statements? What influences the success of certain prompts? Most crucially, how can we control these models in real-world scenarios? Topics Covered: 00:00 - Big Ideas in Tech 2024 01:39: AI Interpretability: From Black Box to Clear Box 02:21: What do we and don’t understand about LLM black boxes and interpretability 04:23 - Research in interpretability 06:43 - Features represented in the outputs from LLMs 08:16 - Unlocks in interpretability 11:49 - The engineering challenges 14:10 - Scaling mechanistic interpretability research 17:27 - A new focus on explainability Resources: View all 40+ big ideas: https://a16z.com/bigideas2024 Find Anish on Twitter: https://twitter.com/anjneymidha Stay Updated: Find a16z on Twitter: https://twitter.com/a16z Find a16z on LinkedIn: https://www.linkedin.com/company/a16z Subscribe on your favorite podcast app: https://a16z.simplecast.com/ Follow our host: https://twitter.com/stephsmithio Please note that the content here is for informational purposes only; should NOT be taken as legal, business, tax, or investment advice or be used to evaluate any investment or security; and is not directed at any investors or potential investors in any a16z fund. a16z and its affiliates may maintain investments in the companies discussed. For more details please see a16z.com/disclosures.

Steph SmithhostAnjney Midhaguest

Dec 23, 202322mWatch on YouTube ↗

EVERY SPOKEN WORD

20 min read · 3,954 words

0:00 – 1:39
Big Ideas in Tech 2024
1. SSSteph Smith
  Precision delivery of medicine, entertainment franchise games absolutely exploding, small modular reactors and the nuclear renaissance, plus AI moving into very complex workflows. Now, these were just a few of the major tech innovations that partners at a16z predicted last year. And our partners are back, and we just dropped our list of over forty plus big ideas for 2024, a compilation of critical advancements across all our verticals, from smart energy grids to crime-detecting computer vision to democratizing miracle drugs like GLP-1s or even AI moving from black box to clear box. You can find the full list of forty plus builder-worthy pursuits at a16z.com/bigideas2024, or you can click the link in our description below. But on deck today, you will hear directly from one of our partners as we dive even more deeply into their big idea. What's the why now? What opportunities and what challenges are on the horizon? And how can you get involved? Let's dive in. As a reminder, the content here is for informational purposes only, should not be taken as legal, business, tax, or investment advice, or be used to evaluate any investment or security, and is not directed at any investors or potential investors in any a16z fund. Please note that a16z and its affiliates may also maintain investments in the companies discussed in this podcast. For more details, including a link to our investments, please see a16z.com/disclosures.
2. AMAnjney Midha
  [on-hold music]
1:39 – 2:21
AI Interpretability: From Black Box to Clear Box
1. AMAnjney Midha
  My name is Anjney Midha. I'm a general partner here at a16z, and I'm talking to you today about AI interpretability, which is just a complex way of saying reverse engineering AI models. Over the last few years, AI's been dominated by scaling, which is a quest to see what was possible if you threw a ton of compute and data at training these large models. But now, as these models begin to be deployed in real-world situations, the big question on everyone's mind is why? Uh, why do these models say the things they do? Why do some prompts produce better results than others? And perhaps most importantly, how do we control them?
2:21 – 4:23
What do we and don’t understand about LLM black boxes and interpretability
1. SSSteph Smith
  Anjney, I feel like most people don't need convincing that this is a worthwhile endeavor for us to understand these models a little better. But maybe you could share where we're at in that journey. What do we or don't we understand about these LLM black boxes and their interpretability?
2. AMAnjney Midha
  You know, it might help to reason by analogy here. If you pretend one of these AI models is like a big kitchen with hundreds of cooks, and when, and when you ask, uh, ask the kitchen to make something, y-you know, each cook knows how to make certain foods.
3. SSSteph Smith
  Mm-hmm.
4. AMAnjney Midha
  And when you give the kitchen ingredients and you say, "Hey, go cook a meal," all the different cooks debate about what to make, and eventually they come to an agreement on a meal to prepare based on these ingredients. Now, the problem is where we are, um, in the industry right now is that from the outside, we can't really see what's happening in these kitchens. So you have no idea how they made that decision on the meal.
5. SSSteph Smith
  You just get the cake or the taco or whatever it might be. Yeah.
6. AMAnjney Midha
  Right. And so you f-- if you ask the kitchen, "Hey, why did you choose to make lasagna?" It's really hard to get a straight answer because the individual cooks don't actually represent a clear concept like a dish or a cuisine. And so the big idea here is what if you could train a team of head chefs to oversee these groups of cooks, and each head chef would specialize in one cuisine. So you'd have the Italian head chef who controls all the pasta and pizza cooks, and then you have the baking head chef in charge of cakes and pies. And now when you ask why lasagna, the Italian head chef raises his hand and says, "I instructed the cooks to make a hearty Italian meal." And these head chefs represent clear, interpretable concepts inside the neural network.
7. SSSteph Smith
  Mm.
8. AMAnjney Midha
  And so this breakthrough is like finally understanding all the cooks in that messy kitchen by training these head chefs to organize them into tidy sort of cuisine categories. And we can't control every individual cook, but now we can get insights into the bigger, more meaningful decisions that determine what meal the AI chooses to make.
9. SSSteph Smith
  Yeah.
10. AMAnjney Midha
  Does that make sense?
4:23 – 6:43
Research in interpretability
1. SSSteph Smith
  It does, but are you saying that we do actually have a sense now of those, like, head chefs or the people responsible for parts of what might be happening within the AI? Obviously, it's not people in this case, but have we actually unlocked some of that information with some of the new releases or new papers that have come out?
2. AMAnjney Midha
  We have. We have. And you can kind of inter-- uh, you can, you can break the world of interpretability down into a pre-2023 and a post-2023 world, in my opinion, because there's been such a massive breakthrough in that specific, um, domain of understanding which cook's doing what. Um, you know, more specifically, what, what's happening is that these, these, these models are made up of neurons, right? A neuron refers to an individual node in the neural network, and it's just a single computational unit. And historically, the industry sort of tried to analyze and interpret and explain these models by under- by trying to understand what each neuron was doing, what each cook was doing in that, in that situation. Um, a feature, on the other hand, which is this new, uh, atomic unit that the industry is proposing now a-as an alternative to neuron, refers to a specific pattern of activations across multiple neurons. And so while a single neuron might activate in all kinds of unrelated contexts, like whether you're asking for lasagna or you're asking for a pastry, a feature, which is this new atomic unit, represents a specific concept that consistently activates a particular set of neurons. And so to explain the difference using, you know, the cooking analogy, a, a neuron is like an individual cook in the kitchen. Each one knows how to make certain dishes, but doesn't represent a clear concept. A feature would be like a cuisine specialty controlled by a head chef. So for example, the Italian cuisine feature is active whenever the Italian head chef and all the cooks they oversee are working on an, on an Italian dish. And that feature has a consistent interpret- interpretation, which in this case is Italian food-While individual cooks do not. And so in summary, these neurons are individual computational units that don't map neatly to concepts. These features are patterns of activations across multiple neurons that do represent clear interpretable concepts. And so the breakthrough here was that now we've learned how to decompose a neural network into these interpretable features when previous approaches focused on interpreting single neurons. And, and so the short
6:43 – 8:16
Features represented in the outputs from LLMs
1. AMAnjney Midha
  answer is yes, we have a massive breakthrough where we actually now know how to trace what was happening in the kitchen.
2. SSSteph Smith
  And maybe could you give an example that's specific to these LLMs when we're talking about a feature? I know there's, there's still so much research to be done, but like what's an example of a feature that you'd actually see represented in the outputs from an LLM?
3. AMAnjney Midha
  Yeah, that's a great question. So I think if you actually look at the paper that, um, moved the industry forward a bunch earlier this year. This is a paper called, um, Decomposing Language Models with Dictionary Learning. This came out of Anthropic. Um, you know, interpretability is a large field, but this paper I think, um, took a specific approach called mechanistic interpretability. And the-- I think one of the examples-- The paper has a number of examples of, of features in-- that they discovered in a, in a very small, almost toy-like model because smaller models prove to be very useful, um, petri dishes for these experiments. And I think an example of one of these features was a, a God feature, where, um, when you talk to the model about religious concepts, um, then a specific God feature fired over and over again. And they found when they talked to the model about a different type of concept like, uh, biology or DNA, a different feature that was unrelated to the God feature fired. Whereas, where- whereas, um, the same neurons were firing for both those concepts. And so there was a s- the, the feature level analysis allowed them to decompose and break apart the idea
8:16 – 11:49
Unlocks in interpretability
1. AMAnjney Midha
  or the concept of religion from biology where-- which, which is something that wasn't possible to tease apart in the neuron world.
2. SSSteph Smith
  Yeah. Yeah. And maybe you could speak a little bit more to why this is helpful. I mean, maybe it's, it's obvious for folks listening, but now that we have these concepts that we see and maybe can also link pretty intuitively like, "Oh, okay. I understand biology. Oh, I understand religion as a concept that's coming out of these LLMs." Now that we understand these linkages a little more, what does that mean? Like, why does this now open things up? What-- Are we in a new environment, you kind of said pre some of these, uh, unlocks, and now we're post. What does post look like?
3. AMAnjney Midha
  Yeah, this is a great question. So I think there's three big things that, um, that are sort of so what's from, from this breakthrough. The first is that, that interpretability is now an engineering problem as opposed to an open-ended research problem. And that's a huge sort of sea change for the industry, because up until now, there were a number of hypotheses on how to interpret how these models were, were behaving and explain why, but it wasn't quite concrete. It wasn't quite understood which one of those approaches would work best to actually explain how these models work at, at very large scale, at frontier model scale. Um, but I think this approach, this mechanistic interpretability approach and this paper that came out earlier this year shows that actually the, the relationships are so easily observable, um, at a small scale that the, the bulk of the challenge now is to, to scale up this approach, which is an engineering challenge. And I think that's massive because the engineering is largely a function of the resources and the investment that goes into scaling these models, um, whereas research can be fairly open-ended. And so I think, uh, one big, one big conclusion, um, or takeaway from, from 2023 is that interpretability has gone from being a research area to being an engineering area. I think the second is that if we actually can get this approach to work at scale, then we can control these models. In the same way that if you understood how a kitchen made a dish and you wanted a different outcome, now you can go to the Italian chef and say, "Could you please make that change next time around?" And, and so that allows controllability. And that's really important because as these models get deployed in really important sort of mission-critical situations like, um, uh, healthcare and finance and, um, in, in, in defense applications, you need to be able to control these models very, very precisely, which unfortunately today just isn't the case. We have very blunt tools to control these models, but nothing precise enough for those mission-critical situations. So I think con- controllability is a big, is a big piece that, that this unlocks. And the third is sort of the, the, uh, a, a, a byproduct of having controlla- controllability, which is we h- Once you can control these models, you can rely on them more. And I think that's a huge, uh, increased reliability means n- not only good things for, for the customers and the users and developers using these models, but also from a quali- policy and regulatory perspective, we can now have very concrete, grounded debate about what models are safe and not, how to govern them, how to make sure that the space develops in a concrete, empirically grounded way, as opposed to reasoning about these models in the abstract with a lot of-- without a lot of evidence. I think one of the problems we've had as an industry is that because there hasn't been a concrete way to show or demonstrate that we understand
11:49 – 14:10
The engineering challenges
1. AMAnjney Midha
  these black boxes and how they work, that a lot of the policy work around it and, and policy thinking is sort of worst case analysis. And, and worst case analysis can be fairly open to fear-mongering and a lo- a ton of FUD. And I think instead now we have an empirical basis to say, "Here are the real risks of these models, and here's how policy should address them." And I think that's a big, uh, improvement or a big, uh, advance as well for us all.
2. SSSteph Smith
  Totally. I mean, it's huge. And it's, it's kind of interesting because we don't know every little piece of physics, but we're able to deploy that in extremely effective ways and build-All of the things around us through that understanding that has grown over time. And so it's really exciting that these early building blocks are getting in place. Maybe you can just speak to that engineering challenge or the f- the flippening that you said happened where we previously had a research challenge, which was somewhat TBD. When is this gonna be unlocked? How is it going to be unlocked? And now we have, again, those early building blocks where we're now talking about scale. And I'll just read out a quick tweet, um, from Chris, who I believe is on the Anthropic team, and he said, "If you asked me a year ago, superposition would've been by far the reason I was most worried that mechanistic interpretability would hit a dead end. I'm now very optimistic. I'd go as far as saying I-- it's now primarily an engineering problem, hard, but less fundamental risk." So I think it captures what you, you were just mentioning, but maybe you can speak a little bit more to the papers and the size of, or, or the scope that they've done this, this feature analysis within and what the steps would be to do this when we're talking about those much, much larger foundational models.
3. AMAnjney Midha
  Yeah. I, I, that's a great question. So I think, um, stepping back, the way science in this space is done often is, you know, you, you start with a, a, a small almost story-like model of your problem, see if some solution is promising, and then you decide to scale it up, um, to f- to a bigger and bigger, uh, level, because if you can't get it to work at a really small scale, rarely do these systems, um, work at, at large scale. And so I think, um, you know, while the, while of course the mo- the holy grail challenge with interpretability is explaining how frontier models that are the GPT-4s
14:10 – 17:27
Scaling mechanistic interpretability research
1. AMAnjney Midha
  and Claude 2s and Bards of the world, which are, you know, sever- several hundred billion parameter at, in their scale, I think that one of the challenges with trying to attack interpretability of those models directly is that they're so large and such complex systems, it, it, it is very, very untractable to try to tease apart all the different, um, neurons in, in these, in these models at that scale. Now, I should be clear, it's not easy. Um, and, and there are a ton of like unsolved problems in the scaling, uh, sort of part of the, of this, of this cha- of this journey as well.
2. SSSteph Smith
  Yeah. If I could just interrupt real quick. I mean, you mentioned the scaling laws, and those have continued to scale, but we didn't necessarily know if that would be the case. It has, of course, proven to be the case as we move forward, but what are the challenges that you see that might be outstanding as we look to scale up some of this mechanistic interpretability research? What open challenges do you see on that path?
3. AMAnjney Midha
  Ah, okay. So yeah. So to, to borrow our analogy earlier of the kitchen, you know, we-- the, I think we as an industry now have a, a model of, of what's going on and some proof of what's going on, uh, with these, with these features with a kitchen which has, let's say, three or four chefs.
4. SSSteph Smith
  Yeah.
5. AMAnjney Midha
  And so to figure out if this would work at frontier scale where you have thousands and thousands of chefs in each kitchen, and in, in the case of a model, you have, you know, billions of parameters-
6. SSSteph Smith
  Mm-hmm
7. AMAnjney Midha
  ... um, I think there are two big open problems that, that need to be solved in order for this approach to, to work at, at scale. The first is increasing the, the autoencoder, which is conceptually you can kind of think about as the model that makes sense of what's going on with each feature.
8. SSSteph Smith
  Mm-hmm.
9. AMAnjney Midha
  And the autoencoder here is pretty small in, in the paper that came out in October, and so I think there's a big challenge where the researchers in the space have to figure out how to scale up the autoencoder in the order of magnitude of, um, almost 100X expansion factor, and so that, that's a lot.
10. SSSteph Smith
  Yeah.
11. AMAnjney Midha
  And the, that's pretty difficult because, uh, the, you know, training the underlying, the base model itself often requires hundreds and often billions of dollars worth of compute. And so I do think it's a fairly difficult and compute-intensive challenge to scale the autoencoder. Now, I think there's a ton of promising approaches on, on how to do that scaling without needing tons and tons of compute, but those are pretty open-ended engineering problems. I think the second is to actually scale the interpretation of these-
12. SSSteph Smith
  Mm.
13. AMAnjney Midha
  ... of these, uh, of these networks. And so a- as an example, you know, if you find all the, um, the neurons and all the features related to, let's say, pasta or, or Italian cuisine, and then you have a separate set of features that map to pa- pastries, right? Um, now the question is how do you answer a complex query? And you ask the AI, "Hey, if I asked you a, um, a, a provocative question about whether people of a certain ethni- ethnicity enjoy Italian cuisine or not," right?
17:27 – 22:00
A new focus on explainability
1. AMAnjney Midha
  Um, you need to figure out how those two features actually interact with each other at some meaningful scale.
2. SSSteph Smith
  Yeah. Mm-hmm.
3. AMAnjney Midha
  And that is, uh, is a pretty difficult challenge to reason about too, and I think that's, that's the second big open problem that the researchers call out in their work. And so I, I think, um, the, the combinatory complexity of each of those sets of features, um, interacting with each other at, at increasing scales, um, is a nonlinear increase in complexity that, that, um, has to be interpreted. And so these are sort of the two-
4. SSSteph Smith
  Yeah
5. AMAnjney Midha
  ... big, at least, at least at the moment, these are the two clear engineering problems that need to be solved, scaling up the autoencoder and scaling up the interpretation. But there probably are a list of, um, uh, you know, long tail qu- questions as well that I'm not addressing here, but those are sort of the two big ones.
6. SSSteph Smith
  How does this change the game? And maybe you could speak to what you're excited for specifically coming into 2024 as it relates to mechanistic interpretability.
7. AMAnjney Midha
  Yeah. So to be clear, I'm, I'm excited about, um, a- all kinds of interpretability, uh, or, uh, explainability. I'm, I'm broadly very excited about 2024 as the, uh, as the first time or, or, um-At least the, the year where the most amount of in-interest and attention is being paid to explainability. You know, the last few years the attention was all on, on the how and the what. People are just incredulous at the capabilities of these models. Can we get them to be smarter? Can we get them to reason about entirely new topics that maybe weren't in the original pre-training data set? Um, and, and that's, that's been totally reasonable. But I think, um, more attention on the why of these models and to explain how they work was the-- it has been the big, um, blocker on these models getting deployed outside of just, uh, a few consumer use cases where, um, the costs of, of the model not being as reliable, as steerable, um, are, are low. And so low precision environments, consumer use cases where people are more forgiving and tolerant of mistakes by the model and so on, is largely where the bulk of the value, um, has been generated in, in, in AI today. But I think if you wanna see these models take over some of the most impactful parts of our lives that they currently don't-- aren't deployed in, things like healthcare, I think that those mission critical situations require a lot more reliability, um, and predictability, and that's what interpretability ultimately unlocks. If you can explain why the, the, the kitchen does something, then you can control what it does, and that makes it much more reliable, and therefore it's gonna be used in more and more situations and in more use cases and in more, more and more impactful customer, um, uh, journeys, where today a lot of the models don't actually make it, make the cut.
8. SSSteph Smith
  Yeah. No, it's so true. Actually, something that also just dawned on me as you were talking is almost everything in this world has margin for error, right? There is error inherently in, in, in most things. However, if you can understand, if you can explain that error and constrain it to something that other people can get behind, it's just much more likely that people will wanna engage with that thing because they can at least understand what is coming out of it. Um, and so yeah, that's-- I feel like that picture is, is very compelling, and I hope we can get there.
9. AMAnjney Midha
  I hope so too. I think, you know, to be clear, we're, we're not there yet, but we've got the glimmers now of approaches that might work. And I think twenty twenty-- what I'm excited about twenty twenty-four is a lot more investment, a lot more energy, a lot more of the best researchers in this space spending their time on interpretability.
10. SSSteph Smith
  Yeah. Well, we have some of the smartest people in the world working on AI, and we saw how quickly things moved in twenty twenty-two, twenty twenty-three. So hopefully in twenty twenty-four, some of this interpretability work moves just as quickly.
11. AMAnjney Midha
  I hope so. I've got my fingers crossed. [chuckles]
12. SSSteph Smith
  All right. I hope you enjoyed this Big Idea. We do have a lot more on the way, including a new age of maritime exploration that takes advantage of AI and computer vision, plus AI versus games that never end, and whether voice first apps may finally be having their moment. By the way, if you wanna see our full list of forty plus Big Ideas today, you can head on over to a16z.com/bigideas2024. It's time to build. [upbeat music]

Episode duration: 22:01

Install uListen for AI-powered chat & search across the full episode — Get Full Transcript

Transcript of episode yTZVcOmhmlw

Get more out of YouTube videos.

High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.

Add to Chrome