Modern WisdomThe Alignment Problem - Brian Christian | Modern Wisdom Podcast 297
EVERY SPOKEN WORD
125 min read · 24,718 words- 0:00 – 15:00
You have a system,…
- BCBrian Christian
You have a system, you want it to do X, you give it a set of examples and you say, you know, "Do that, do this kind of thing." What could go wrong? Um, well, there's this laundry list of things that could go wrong.
- CWChris Williamson
(wind blowing) What does the quote, "Premature optimization is the root of all evil" mean?
- BCBrian Christian
Mm. So this line comes from Donald Knuth, who is one of the, uh, I think of him as kind of like the Yoda of computer science, um, just dispensing these, uh, these gems of wisdom. Um, and there are many, I think, I think many, like many aphorisms, you can take it in a number of different directions. Um, one of the ways that I think about it is, um, you know, a lot of, a lot of the way that we make progress in math and computer science is through models. You make a model that sort of approximates the, um, the phenomenon that you're trying to deal with. You know, there's a, a great quote from Peter Norvig, another, uh, one of these, uh, luminaries in, in computer science. He's quoting someone from NASA saying, "Our job was not to land on Mars. It was to land on the mathematical model of Mars provided to us by the geologists." Um, and so this idea that premature optimization is the root of all evil, um, I think if you, if you kind of mistake the map for the territory, so to speak, um, if you forget that there is a gap between your model and what the reality actually is, um, then you can commit yourself, um, to a set of assumptions that are later going to bite you. Um, and so this is the sort of thing that people who are worried about AI safety, uh, you know, this is what keeps them up at night.
- CWChris Williamson
What is the alignment problem? That's what we're going to be talking about today. We might as well define our terms.
- BCBrian Christian
Yeah, so the alignment problem, um, is this idea in AI and machine learning of the potential gap between what your intention is when you build an AI system or a machine learning system-
- CWChris Williamson
Mm-hmm.
- BCBrian Christian
... and the objective that the system has. Um, so it's the potential misalignment, so to speak, between, uh, your intention, your expectation, how you want the system to behave, and what that system ultimately ends up doing.
- CWChris Williamson
Why does it matter?
- BCBrian Christian
I mean, this is, this is a fear that has existed in computer science going back to at least 1960. So Norbert Wiener, the MIT cyberneticist was writing about this. Um, and, you know, he says, "If we, if we use to achieve some purpose a mechanical agency that we can't interfere with once we've started it, uh, then we had better be quite sure that the purpose that we put into the machine is the thing that we really want." Um, and I think a lot of people increasingly since like 2014, it has become more and more mainstream within the computer science community itself as one of the most significant challenges facing, uh, the field as we sort of enter this era of AI, um, that we may develop systems which, you know, we, with the best of intentions, try to encode some objective into the system. The system with, you know, sort of the best of intentions attempts to do what it thinks we want, um, but there's some fundamental misalignment and that results in, you know, whatev- whatever the harm may be, whether it's, um, you know, dark-skinned people not getting recognized by, um, you know, a facial recognition system, or disparities in the way that parole is being, um, dealt with, you know, at a societal level. Um, it could be self-driving cars that fail to recognize jaywalkers and so they kill anyone who's crossing in the middle of the street because there were no jaywalkers in their training data. We may actually throw society as a whole off the rails, um, by some system with, you know, enough power to shape the course of human civilization, but, uh, without the appropriate wisdom to, to know exactly what to be doing.
- CWChris Williamson
Paperclips.
- BCBrian Christian
So this is some-
- CWChris Williamson
Everybody, everybody turns into paperclips.
- BCBrian Christian
Exactly. Yeah, so the paperclip maximizer is kind of the classic thought experiment that goes back to, uh, Nick Bostrom and Eliezer Yudkowsky. And I think there's... To my mind, there's a kind of a significant culture shift that happens within the field of AI around 2015, um, and that is that, you know, in some ways, we no longer need the paperclip maximizer thought experiment because we have this, like, growing file folder of actual real world alignment problems gone wrong, you know, whether it's social media, you know, we optimized our newsfeed for engagement and it turns out that, um, you know, radicalization and polarization is highly engaging. So (laughs) you know, we've, we've paperclipped ourselves in that (laughs)
- CWChris Williamson
There's this quote, uh, that, from Norbert Wiener in your book, which I really thought encapsulated what we're talking about nicely. "In the past, a partial and inadequate view of human purpose has been relatively innocuous only because it has been accompanied by technic- technical limitations. Human incompetence has shielded us from the full destructive impact of human folly." Awesome. Just frames so much of what we're talking about. So b- basically, is it, is the game a balancing act between technological capability and technological wisdom?
- BCBrian Christian
Largely speaking, I would say yes. Um, you know, and, and Wiener used the phrase, uh, "Know-how versus know what." Um, and increasingly, we're seeing this as a, kind of a paradigm shift in the field of computer science. A standard AI textbook that, um, is used, uh, across the world is called Artificial Intelligence: A Modern Approach by Stuart Russell and Peter Norvig, um, and they've just now released their updated fourth edition. Um-... and one of the things that we're seeing in that, in the new edition is kind of this focus shifting from, you have some objective, what, what is the Swiss army knife of tools in your toolkit to optimize that objective? Um, we're now shifting to, let's, let's take for granted that you can optimize whatever objective you have. How do we figure out what is the right objective to actually encapsulate all of the things that you really want? Um, and so, I mean, part of what I think is very significant is that, um, we are starting to turn this question of know what into, uh, porting all of these complicated human desires, human norms, into the language, uh, of optimization. Um, so that, um, that's the, that's the optimistic side of the story, that's the hope, is that we can actually make a science out of the sort of, the know what part.
- CWChris Williamson
Yeah, well, I mean, surely we don't want, we don't know what we want as humans right now anyway. I mean, the fields of ethics and morality are still contested from human to human. Like, how are we going to code a poorly defined concept into a language that machines can read?
- BCBrian Christian
Um, unfortunately, you know, it's happening. Uh, whether we're doing it well or not, we are doing it. You know, you see this at Facebook, there is this process of reading user behavior, um, and trying to figure out from, like, the click on the stuff people engage with, what it is they appear to want, and I think people underestimate the sophistication of these systems. Like, we have these intuitions that, okay, you know, Facebook or Twitter or whatever is tracking everything I click on. Oh, that's, that's the tip of the iceberg. Um, you know, they're keeping track of how many milliseconds is a particular ad on screen, so even if you're just scrolling through and you hesitate ever so slightly to read s- a piece of text or look more closely and then move on, they know that. Um, and, you know, as you say, we have a very impoverished sense of how to map e- human behavior to human desire or human value, um, and there's often this tension, right? You know, Daniel Kahneman tells us that we have these system one and system two, and they're often at odds with each other. Um, you know, there's a reason that supermarkets put the candy right next to the register, so that, you know, you don't have time to second-guess yourself, um, and I think there's a genuine challenge for social media companies, which is, how do you... Uh, you know, even if you're acting with good faith, even if all you want to do is what users will want, how do you distill th- some notion of what they want out of their behavior? Um, you know, and one, one example that I give in the book, um, someone I know, uh, has, uh, is an alcoholic, has an alcohol addiction, and social media companies have found out that if you show images of alcohol in an ad, this person will linger, um, and that creates this, this horrible feedback loop. And so, um, I think there really is this question, which is, um, you know, to borrow a phrase from Nick Bostrom, this is philosophy on a deadline. Mm-hmm. Um, there are these open questions in not just ethical philosophy, but cognitive science, um, neuroscience even, but we don't have time to... In a way, we don't have time to wait for the answer, because, um, these companies are just going, and so we're gonna have to try to essentially fix, fix the plane mid-flight, so to speak.
- CWChris Williamson
(laughs) Yeah, I mean, I read Superintelligence, which is kind of, I guess, the seminal book on this, at least it was in the mid- to 2015s.
- BCBrian Christian
Yeah.
- CWChris Williamson
Uh, and anyone, anyone who really wants to kind of get a, a good overview, I think that's a, a fantastic place to go. Stuart Russell's Human Compatible as well, his new one, is awesome, and then-
- BCBrian Christian
Mm-hmm.
- CWChris Williamson
... if you want, if you want to terrify yourself about everything else as well as AI, uh, Toby Ord's The Precipice. Like, that, that's my perfect three-book garage for existential risk right there.
- BCBrian Christian
Yeah.
- CWChris Williamson
But upon reading that, that really kind of opened my eyes to just how big the potential dangers are that we're playing around with here, and, um, I, I struggle to feel optimistic, and I, I don't know whether that is because the scarier news stories make the headlines, or because-
- BCBrian Christian
Mm.
- CWChris Williamson
... AI programmers tend to err. Now, it would appear that from 2014, I think you talk about the difference in one year of going to a conference, someone brought up AI security one year and was kind of laughed out of the room, brought up AI security the next year and nobody raised an eyebrow.
- BCBrian Christian
Mm-hmm.
- CWChris Williamson
So there really was a very rapid pivot, um, that I think Nick Bostrom can probably take a good amount of credit for, um-
- BCBrian Christian
I agree. And Stuart, yeah, mm-hmm.
- CWChris Williamson
Yes. Um, I wonder whether the AI researchers are now so cataclysmic, so, um, existentially aware that perhaps they're only showing us the, the terrifying stuff. I don't know.
- 15:00 – 30:00
(laughs) …
- BCBrian Christian
women combined. Um, but in the real world, there are not twice as many George W. Bushes as all Black women combined. And so you have this fundamental mismatch between the set of examples that you've used and the actual kind of reality. And so this is something that, um, the technical name for this would be robustness to distributional shift. So, um, is your system capable of handling the fact that the examples that it encounters out in the real world are coming from some different distribution than what it learned on, you know, when you were training it? So that's a fundamental question is just, um, what is the kind of data provenance, like what, what examples went in and does that match the environment that you're going to deploy it in? Um, and there are many, many examples of this. Um, you know, I think the racial bias stuff has obviously made headlines, but there are subtler examples. For example, um, there was a Google system that, uh, upon inspection was u- It turned out that the color red was intrinsic to its classification of something as a firetruck. Um, and most firetrucks in the US are red. In the UK I think they're also red. Um, in Australia they're white and neon yellow. Um, and so a s- you know, that model would not be safe to deploy in your self-driving car if you were in Canberra or something. Um, so that's, I think, a very fundamental, you know, category of thing, is just are the examples that you've given, um, do they match the, the sorts of things that it's going to see in the real world? Um, and the second large category is, um, what's called the objective function. So every machine learning system has some numerical specification of what it is that you want the system to do. Um, and often things can go very subtly but importantly wrong in how you've specified what you want the system to do. Um, so one example of this, um, one of my favorite examples comes from this robotic soccer competition, um, that was being held in the 1990s. And this team from Stanford, including Astro Teller, who's now the head of Google X, um, they decided they would give their robotic soccer, uh, team this little tiny incentive for taking possession of the ball. Um, so the, the overall goal was like to score points and win the game, but as kind of a incremental incentive, they, they were awarded, you know, the equivalent of like a hundredth of a goal for taking possession of the ball, be- because they thought that would incentivize the right sort of strategy. You can't score until you have the ball, et cetera. But what their robots learned to do was to just approach the ball and then vibrate their paddle as quickly as possible, um, taking possession of the soccer ball, uh, you know, a hundred times a second. And this was much easier than actually scoring points.
- CWChris Williamson
(laughs)
- BCBrian Christian
Um, so there are many... I mean, this is just one example, but there, it turns out that, you know, actually trying to specify numerically exactly what you want this program to do is extremely difficult to the point that it's kind of increasingly considered-... um, just unsafe to ever attempt to do that. So we- and then we can get into what it, what it looks like to not specify it, but-
- CWChris Williamson
Yeah. So the overarching theme that I felt on going through your book is it feels to me like, do you remember that advert of the guy using its real hardcore strength gaffer tape to stop a flood coming out the side of a water tank? It was for like from the '90s.
- BCBrian Christian
Mm.
- CWChris Williamson
He's got like a big bit of black tape and this, there's this-
- BCBrian Christian
Yeah.
- CWChris Williamson
... hole in the side of a water tank and he's like, "Look how strong this (tape ripping) ... whacks the tape on it." It feels to me-
- BCBrian Christian
Mm.
- CWChris Williamson
... like a lot of the incentive coding structure is that guy sticking-
- BCBrian Christian
Yeah.
- CWChris Williamson
... waterproof tape on all-
- BCBrian Christian
Yeah.
- CWChris Williamson
... the little holes. Surely, it's not going to be scalable to try and predict every different permutation of what's potentially going to go wrong.
- BCBrian Christian
Increasingly, I think, the field is coming up to exactly the view that you're, you're articulating.
- CWChris Williamson
They should have just, they should have just come and had a chat with the guy with the, with the tape on the-
- BCBrian Christian
Yes.
- CWChris Williamson
... on the thing, and he would-
- BCBrian Christian
That's right.
- CWChris Williamson
... have told them that.
- BCBrian Christian
That's right. Yeah. Well, uh, you know, we're maybe a couple decades late to that-
- CWChris Williamson
(laughs)
- BCBrian Christian
... to the party, but, um (laughs) , we're coming around. Um, so yeah, there has been, uh, I would say something of a, of a revolution within, um, computer science, and I think Stuart Russell gets a lot of credit for this. Um, he developed this, um, technique in, around the turn of the millennium called inverse reinforcement learning. So reinforcement learning is what I was talking about with soccer, where you have some kind of goal, which is known as the reward function, um, that kind of doles out points to your system, and then it's the system's job to optimize its behavior to get as many points as it possibly can. That's reinforcement learning. So inverse reinforcement learning goes the other direction. Um, it says, "We're going to observe some expert behaving." So we're gonna watch a soccer player or we're gonna watch someone, uh, you know, play chess or whatever it might be, and figure out what the score of the game must be. If this person's an expert player, then we're gonna try to work backwards from their behavior to the rules of the game and what the point system is. And the basic idea here is that it offers us perhaps, you know, hopefully, fingers crossed, this paradigm offers us something that's gonna be more robust as we develop systems with kind of flexible capabilities in real world environments, um, that they can just kind of observe human behavior and try to work backwards from that to an actual numerical specification of, um, of what we care about essentially, rather than us having to somehow write it all down on paper ourselves.
- CWChris Williamson
There's a number of hurdles that we need to overcome. You need to, first off, write down what we want. We also need to then translate that into code that can be understood by the machines. But first we actually need to know what we want. And a lot of the time what we think we want might not actually be correct. We might not understand the externalities of asking for the thing that we want. Even if the machine achieves it perfectly and there's no machine side malignant side effects or externalities that it's done, we could just specify the wrong goal or-
- BCBrian Christian
Yeah.
- CWChris Williamson
... we might not understand what that goal would be. So yeah, it certainly seems like trying to bypass our idiocy by using the outcomes that we end up at is a, a fairly clever way to go about it. What, what does fairness have to do with the alignment problem? I thought this was quite interesting. It was something I hadn't come up against before.
- BCBrian Christian
Yeah. Um, so there's, there have been historically a number of ideas within computer science that have been referred to as fairness. Um, you get, for example, fair allocations in game theory, you get fair scheduling on a, you know, an operating system where every program gets a certain amount of time to run. But starting in, let's say 2010, the field of computer science really became preoccupied with a notion of fairness that took into account something closer to our kind of ethical or legal notion of fairness, which is to say like, are different groups of people affected, uh, differently by a machine learning system? So the canonical example, um, of this is pre-trial detention. So in the US if you're arrested, uh, you have this arraignment hearing before a judge and they set the date of your trial, and the trial could be months away, weeks away, months away, and then there is this very specific decision that gets made, which is, are you going to be held in jail before your trial or are you gonna be released to go home before your trial? Um, and this is where bail, cash bail ends up getting, uh, involved in certain states. Um, increasingly, uh, throughout the last couple decades, but really accelerating in the last five years or so, um, states have been using these algorithmic risk assessments that just give a score like one to 10, how risky is this person if we release them back, you know, into society pending their trial? Um, these have been used by many jurisdictions. There are states passing laws mandating the use of these sorts of things. Um, and so there's been a lot of scrutiny on...... are these models fair? And there's, you know, in the US we have civil rights legislation going back to the '60s and '70s that articulate certain legal definitions of fairness, but they're not necessarily obvious how those actually apply to, like, a statistical, uh, instrument. Um, and so kind of bit of a controversy around this particular tool called COMPAS, which is just one of many of these kind of pre-trial risk assessment tools. Um, so it turns out that COMPAS is what's called calibrated, which means that if you're given an eight out of ten risk score, then you have the same probability, it turns out, to be re-arrested whether you're White or Black. And this is kind of the canonical definition of quote-unquote "fairness" that's been used for many decades, but increasingly people are looking at these alternative definitions of things like, okay, well, if you look at the people for whom the model makes an error, does it make the same kinds of errors or are they different in some way? And you see, for example, that the Black defendants who are, um, miscategorized by the model are two-to-one relative to White defendants to be judged riskier than they really are, whereas White defendants conversely are, uh, the ones that are miscategorized are two-to-one more likely to be, uh, judged as less risky than they really are. And so people are saying, "Okay, well this, this feels like sort of a disparate impact, um, that we would ideally like to mitigate as well if we want the system to be fair." Um, along come the computer scientists and say, "Well, it turns out actually that it's mathematically impossible to satisfy both of those definitions of fairness at the same time." Um, and so this is a, one of these cases where human intuitions kind of run into these, uh, technical challenges, and we need essentially a, a, a kind of public policy conversation around, okay, well, when these things that we, you know, seem equally desirable, um, can't be mutually satisfying, you know, who decides what the priorities should be? Um, so this ends up being a very complicated, uh, kind of policy/legal/computational question. Um, as you can imagine, it's, it gets pretty intricate.
- CWChris Williamson
Why were the Black defendants... W- w- did the computer scientists, were they able to look at what was occurring that meant that the Black defendants were being judged to be more likely or, uh, higher in terms of the error rate?
- BCBrian Christian
Yeah, I mean, there's a, there's a couple, um, components. So for one, the model actually makes three different predictions, um, that sometimes get conflated. One of the predictions is, uh, the person's risk of failing to appear for their trial date. Um, one is their risk of, uh, non-violent offense, and the other is the risk of violent offense. And one of the important things to note is that these are three fairly different predictions in terms of our ability to actually observe the thing that we are trying to measure. So if you don't appear in court, uh, the government basically by definition knows that y- that you didn't a- appear in court, so it's essentially a perfectly observable, um, event. If you, um ... Now, the, the model is attempting to predict whether you will commit a crime, but we don't know whether someone commits a crime. We only know whether the person is arrested and convicted, um, which may or may not mean that they committed a crime. And, uh, it's interesting because for the research of this book, I ended up digging into the now almost 100-year-long history of these sorts of models. It goes back to Chicago, in the 1920s was the first one, um, and to my surprise in the '30s it was conservatives making this argument of, like, "Now wait a minute, some guy goes on a crime spree, but h- he doesn't get caught, and as far as the model is concerned, he's, you know, a perfect citizen. That doesn't seem right because now the model's gonna recommend that we release more people like that." Um, these days you're more likely to see the critique coming from progressives that say, "Now wait a minute, someone's gotten wrongfully arrested and wrongfully convicted. Now the model thinks they're a criminal and it's gonna recommend detaining more people like that." Um, and so it's funny that, you know, historically the, the kind of prevailing political valence of the critique has flipped, but it's the same underlying critique, which is that we can't actually measure the thing that we're attempting to predict, which is crime. Um, and so if there are kind of s- you know, striking disparities in the way that we actually observe crime, then that is going to, um, essentially filter downstream into the model. So for example, in Manhattan, um, the last statistics that I heard was something like Black and White Manhattanites self-report marijuana use at about the same level, but Black Manhattanites are 15 times more likely to be arrested for marijuana possession. Um, and so here there's a huge gap between
- 30:00 – 45:00
Can we talk about…
- BCBrian Christian
what we can ostensibly, um, predict and what we can actually measure. Um, and so this has led... I mean, there's, there's a, a lot to unpack in this area, but this has led people, for example, to make the argument that you should essentially trust the failure to reappear, uh, prediction, but not trust the non-violent offense prediction.... because the one we can observe much more accurately than the other. Um, so that's a little bit of a flavor of how some of these, um, imbalances in how crime is actually, like, observed by the police then filters down into the model's assessment of someone's, quote-unquote, "risk."
- CWChris Williamson
Can we talk about neural networks? Like, what some of the problems are that you find with neural networks and why it's so hard to get an explanation out of a black box?
- BCBrian Christian
Yeah. Um, so part of the reason that, you know, you and I are having this conversation now and that, you know, we've seen books like Nick's and Stuart's and Toby's, um, in the last, you know, eight years or so, um, is because of the rise of deep neural networks, which really kicked off in 2012. Um, and so, I mean, it's ironic because the neural network is one of the oldest ideas that anyone had in computing. It predates the stored program computer, um, which I think is kind of amazing. Um, so it is about as old an idea as anyone's ever had in computer science and AI, um, but it was not until 2012 that it actually started to work. Um, and so there's a very fascinating, uh, history there. But basically, um, neural networks have come to dominate, uh, all of the previous approaches that people were using in the 2000s, um, and have just kind of swept through computer vision, uh, computational linguistics, um, speech-to-text processing, machine translation, reinforcement learning. You, you name it, there has been kind of just this successive series of, um, kind of discontinuous breakthroughs as a result of neural networks. Um, they are famous, they have a reputation for being kind of inscrutable, um, and uninterpretable. And so a big frontier in AI safety research is finding ways to essentially pop the hood, so to speak, and figure out what in the heck is going on inside of a neural network. Um, and I think that's, uh, that's maybe one of the more encouraging stories in AI alignment research because we're making more headway than honestly I expected. Um, so that's-
- CWChris Williamson
Wh- What's the problem? Why can't you... You've got, you've got a thing, you've got this neural network. It, it, it does a thing. Why can't it tell you why it did it?
- BCBrian Christian
Well, um, it's sort of like a problem of information overload. So let me use as an example the kind of flagship first real success story in deep learning, um, which is, uh, this image recognition system called AlexNet that was, uh, used in 2012. So AlexNet was designed to take in, uh, an image, and I'm trying to remember what the dimensions of the image were. It could've been like 100 pixels by 100 pixels or something like this. Um, and ultimately somehow output one out of 1,000 different categorizations. Is this a truck? Is this a kitten? Is this a, you know, sandy beach? What is this? Um, and so, you know, at, at the simplest level, you just have pixels in and categorization out, and so what's going on in the middle? Well, uh, 100 by 100 pixels is 10,000 pixels and it's RGB, so you have a total of 30,000 inputs that represent this picture. Um, and they're just encoded as numbers from, like, one to 255, or, or from zero to 255, for how m- how much red is in location X, how much blue is in location Y. Um, and this goes into a network of about 600,000 of these artificial neurons, and the artificial neuron is extremely simple. It just takes in these little inputs and it adds them up and says, um, "Is the sum of these inputs above a certain threshold?" And if it does, then it will sort of pass along some number as an output to the next neurons in the chain. So it's very interpretable in the sense that you can look at an individual neuron and say, like, "Okay, what were its inputs? Okay, it's getting a 10 here, a five there, it's adding them up, it's greater than something, so it's then outputting a whatever." Um, but it's very... And so you multiply that by 600,000 neurons in, you know, 12 layers or whatever, and there's a total of, like, 60 million connections between all the different ones. And at the end, you get a number between, you know, one and 1,000 that tells you truck, you know, barbecue grill, whatever. Um, the question is, it's not a, it's not a problem, you, you know exactly what happened, so to speak, in that it's just adding numbers and comparing them to thresholds. Um, but what in the hell does neuron, you know, the neuron in layer five, you know, getting, uh, an eight, a one, a 25, adding that up, and then outputting a two, like, w- what does that mean? Um, and so that's really the question, is it's a system that's kind of perfectly describable in detail, but, um, it's like being given, you know, an atomic description of what's going on in someone's brain, um, when they laugh and then trying to figure out, like, "Well, what makes something funny? Well, let's look at this, you know, hydrogen atom over here." Um, that's, it's just sort of, like, the wrong level of description. And so that, that is kind of the fundamental problem that we have with neural networks.
- CWChris Williamson
Wasn't there an interesting implication around GDPR?
- BCBrian Christian
Yeah. So the European Union had, um, this draft version of the GDPR bill, um, that was circulating around 2016 or so, um, and it had this language in it which kind of-... raise the eyebrows of these two, uh, researchers at Oxford, Bryce Goodman and Seth Flaxman. Um, and they said, "Now, wait a minute, this draft version of the bill appears to create this legal concept that everyone has a right to an explanation of if you're affected by an algorithmic decision, if you're denied a mortgage, or you d- go- don't get a credit card that you applied for or whatever, you are entitled to s- to know why." Um, and yet, it was widely understood that you couldn't obtain so- an answer like that from a deep neural network. And so this created something of a, I don't know, a panic between the legal departments of tech companies and the engineering departments, um, and I remember hearing a lawyer for one of the big tech companies, uh, I won't say which one, talking about meeting with EU regulators and saying, "Now you realize that you're putting into law something which is, like, scientifically impossible." (laughs) Um, and the regulators were sort of unmoved, and they said, "Well, that's why you have, you know, it doesn't go into effect for two years, so, you know, figure it out." Um, and I think this is a very, you know, it's, we think of regulation as almost by definition stifling innovation, but here was a case where the regulators demanded something that the scientists then were given a two-year deadline of, like, "Figure it out." And so suddenly there was this huge, um, you know, wave of money and, uh, you know, research attention going into this problem. But it still, I mean it's, there are a lot of really, I think, promising techniques, but in terms of this question of what is the explanation for why something happened in a neural network, it's still n- it's not even clear what i- is legally sufficient to please the EU, let alone what is kind of standard practice at this point. So it's still, uh, a bit of an unresolved question even now.
- CWChris Williamson
I was listening to one of the engineers behind the YouTube algorithm talking.
- BCBrian Christian
Mm-hmm.
- CWChris Williamson
And then Lex Fridman must have seen the same video clip that I did and he brought it up on a show, and he was talking about just how terrifyingly little YouTube knows about their own algorithm now.
- BCBrian Christian
Yeah.
- CWChris Williamson
That this thing is just a, a runaway Fisherian reinforcement monster that is optimizing and doing things-
- BCBrian Christian
Mm-hmm.
- CWChris Williamson
... but to be honest, they kind of don't really know what's going on. Like, can you just speak to that? How can it be that programmers-
- BCBrian Christian
Yeah.
- CWChris Williamson
... that make a thing, after it's been left to run for a little while, no longer know what it's doing essentially?
- BCBrian Christian
Yeah, and, and, you know, I, I've heard this from people high up in the engineering orgs., um, of these companies saying, "Yeah, we, we have no idea what it's doing, but it's making so much money that we can't turn it off." Um, you know, this is how, this is how horror movies begin, right? Um, so, I mean, in some ways that's the, that's the point of neural networks, was precisely that they could do the things that we couldn't articulate in code, right? Like writing, if you think about it as a contrast to writing sort of traditional software, where you sit down and you type, you know, "If X then go to line 12," or, you know, whatever, that sort of canonical style of programming. There's a whole set of things that that couldn't do, namely the things that we didn't know how to explain our own thought process. So it's really good for sort of mimicking your explicit deliberate thought process, but it's really hard for doing, um, kind of sense perception or motor skills or things like that that don't have, like, an explicit reasoning you can step through. But the dark side of that is that, um, you don't understand how the computer's doing it either. Um, and so, yeah, there's been a lot of, um... I don't know, certainly I, I hear a lot of hand-wringing from people at tech companies that say like, "Okay, we just pipe in, you know, all the possible data that we have about this person, their browsing history, their credit cards, their, um, everything they've ever clicked on ever, you know, whatever it might be, into this thing, and it just spits out, you know, 'Show them this thing and, and not that thing.'" And, like, when it, we don't really know how that i- y- you know, is being arrived at, but we know that when we do it, we make more money than when we don't do it. Um, and so there's this weird kind of like, "Well, let's just let it, let it rip." Um, so that, I mean, that's exactly the kind of thing that people who are worried about AI safety are worried about, right?
- CWChris Williamson
Well, the problem here and, uh, the best podcast I think I've heard on this was Rob Reed from After On, he had Naval Ravikant on, and they were talking about, um, privatized gains versus socialized losses-
- BCBrian Christian
Mm. Yeah.
- CWChris Williamson
... and that paradigm. So essentially that if you are a private company who has one of these ridiculous algos that is able to just, it's a money machine and you run it and it's able to show the perfect advert to the perfect person at the perfect time, or it can create the best sandwich, or it makes the amazing computer game or does whatever, but risks potentially turning the entire world into paperclips, you are privatizing all of the potential gains, but the entire world is risking all of the losses.
- BCBrian Christian
Mm-hmm. Yeah.
- CWChris Williamson
It's what you get when you have a shared commons as well.
- BCBrian Christian
Yeah.
- CWChris Williamson
It's why people in some developing countries are slightly less concerned about polluting the atmosphere 'cause they only pollute a bit of the atmosphere, but they get all of the profit.
- BCBrian Christian
Mm-hmm.
- CWChris Williamson
Um, and when you hear about these things, man, like, you know, YouTube, I don't think that Susan Wojcicki is...... trying to cause the downfall of human civilization, but I do think that she probably wants to maximize watch time. And sadly, these two things, as the power continues to increase within the algorithms and the computing space, these two things are going to start to converge more and more.
- BCBrian Christian
Yes, and I think, um, in some ways I think the alignment problem is bigger than AI. Um, that it is really a description of what's going wrong in capitalism, in global governance. That there are a number of situations and, and again, this is not just an AI thing, where someone defines some metric that m- sort of, kind of encapsulates what we want. Um, at the end of the day, YouTube or Netflix or whatever doesn't really care about how much you watch, they just care about how much money they can make and, you know, can they minimize churn and maximize user retention, blah, blah, blah. Um, and someone says, "Well, watch time seems to be mostly correlated with all that stuff, and it's a lot easier to, you know, uh, operationalize, so let's just, for the sake of argument, maximize watch time." Or, you know, at Tinder, um, there was this long period of time, I, as far as I know years, where the metric that their engineers were asked to optimize was swipes per week. Um, and this is basically the alignment problem, like full stop. We come up with some reward function that kind of, sort of contains what we're trying to do, but not e- entirely, and then we optimize the dickens out of that specification beyond the point at which it, uh, correlates with the thing we really care about. Um, and so, you know, this goes back to where we started our conversation, premature optimization is the root of all evil. At what point, you know, how, how long did it take for someone at Tinder to say, "You know, are swipes per week really what we're about? Like, is that really the, the, um, top of the pyramid of, you know, metrics that we're trying to achieve here?" Um, same thing with watch time, right? Like for m- I don't know if it's the case anymore, for a long time Netflix was explicitly
- 45:00 – 1:00:00
Yeah. …
- BCBrian Christian
maximizing watch time, and there was some quote, if I'm recalling this correctly, where they said like, "We're competing against, you know, playing sports, we're competing against reading a book, we're competing against, like, talking to your kids." Um, and I- it just sounded, like, horrible. Um, but how long does it take before someone starts to realize like, "Oh, maybe, maybe that's not the metr- the metric, you know, that we're, like, going for"? And I think the same thing is happening in society at the highest level, right? We've been maximizing GDP per capita, quarterly returns, um, you name it, uh, while creating these socialized externalities called climate change, called the increasing GINI coefficient, et cetera, et cetera. And so, um, now the question is whether to be sort of extra pessimistic or extra optimistic by thinking about this as not really an AI problem per se, right? Because, um, I may be more confident that we can solve this at a technical level than I am that we can sort of re- change global governance, um, or reform capitalism in some, like, very macro way. Um, on the other hand, there is maybe a glimmer of hope that some of these techniques that people are developing in the AI context, things like inverse reinforcement learning, um, might actually be useful, uh, to tech companies for a start, and even something like, um, you know, a national government, where they say, um, "Instead of manually designing some objective function about what we're trying to do, um, we'll do the inverse reinforcement learning thing, where we will just, um, present real people with kind of different scenarios and ask them to pick, like, which of these, you know, which of these newspaper front pages, you know, from the year 2030 seems to portray a better world? Um, and then we'll somehow try to back the metrics out of that rather than having to come up with the metrics ourselves." Um, so tech companies are starting to do these sorts of things, um, and it will... I know to me that's the glimmer of hope, is maybe we can get ourselves to a sort of, uh, unwind the tyranny of these KPIs that are sort of, like, controlling everything about society at the highest level (laughs) -
- CWChris Williamson
Yeah.
- BCBrian Christian
... and get to something a little bit more holistic. Yeah.
- CWChris Williamson
It really does seem like you pick a metric as a engineer or as a company that you think is the closest approximation to what you deem success for the particular out- uh, the particular company that you're running. There is a reward function that's given to the algorithm for meeting that particular criteria, but that criteria might not achieve the best way to get that outcome. So let's say that it's time on site, there's nobody that I don't know that wishes they sp- they didn't spend less time on their phone. There's nobody that I know that looks at Instagram for two hours and retrospectively says, "That was a good use of my time." But you could imagine in another world that Instagram actually provided an experience which kept people on site for two hours, and retrospectively, they were happy that they'd been on for two hours.
- BCBrian Christian
Mm-hmm.
- CWChris Williamson
Now, currently they're not-
- BCBrian Christian
Yeah.
- CWChris Williamson
... and I don't know if Instagram could ever achieve that with the particular platform, but we can imagine some other sort of app that could. If it was able to optimize itself in a way where it was actually achieving the outcome...... that made people want to use the app rather than just race to the bottom of the brainstem and manipulated them in ways that f- almost forced them to use the app, you would still-
- BCBrian Christian
Mm-hmm.
- CWChris Williamson
... get the same outcome that you wanted, which was particularly screen time. But all of the, um, root causes of how you arrive at the screen time have been changed, and we would probably mostly be able to agree that that's, uh, at the advance of the wellbeing of the user who is doing it.
- BCBrian Christian
Yeah. Yeah. So I mean, there's always been this tension between the copious data that's easy to collect. You know, we can m- measure every click, we can now measure the milliseconds of every item on the screen being on the screen. Um, and then there's this data that's really hard to collect, which is, you know, these qualitative judgments of asking people, like, "How happy are you? How satisfied are you?" Um, maybe you go to someone a week later or a year later. Um, and so you can't directly, directly optimize for these things in tight feedback loops because it's hard to get the feedback, or you have to wait a year to get the feedback. Um, and so it can't be part of your actual kind of day-to-day, um, iterative loop. But what you can do is try to use something like inverse reinforcement learning or there, you know, more sorts of causal models to figure out how the data that you can observe might predict these sorts of scarce, expensive, long-term, uh, things that you really want. And rather than just directly optimizing the feedback that you have on hand, do the more indirect thing of, like, l- we need to model how the stuff we can observe affects these things downstream and try to optimize for those things downstream by way of these proxies. Um, I think that is starting to happen. Um, and it's not quite as simple as just waving the magic wand, but I think that's exactly the kind of approach that we need.
- CWChris Williamson
Is this going to be able to be achieved without some sort of systemic change in terms of governance, in terms of policy, in terms of the way that we step into the algorithms, the power of the algorithms, the amount of computing power, the amount of transparency that people can, th- these companies can see into their black boxes, or maybe even the way that they generate their revenue? Um, because right now, from where I'm sitting, it seems like tech companies are able to make more money than ever before using increasingly advanced, uh, neural networks and algorithms. But that is being done at the expense of a lot of other things that we probably don't want to depreciate anymore.
- BCBrian Christian
Mm-hmm. Yeah.
- CWChris Williamson
Uh, it, it feels to me as if we are at a, a little bit of a, a precipice, Toby, or, uh-
- BCBrian Christian
Hm.
- CWChris Williamson
... perhaps like the, the apex of a, of a curve where you... if we were to push much further, I think that you would start losing... We may have already gone past it, but I think that we actually start to lose so much of civilization that the advantages that we begin to get from technology are, uh, negative utility.
- BCBrian Christian
I, yeah, I agree with you, and I think that may be true even if you are in the, in a completely self-interested position at a tech company. You know, for example, Google and Facebook have now, relative to 10 years ago, lost so much of the public's goodwill that, you know, when the Department of Justice swings the antitrust hammer, um, you know, are, are people gonna cheer or are they gonna protest, right? And that makes a difference. Like, so goodwill, um, may seem like this kind of gossamer, you know, ineffable thing, but it manifests-
- CWChris Williamson
Counts for a lot, yeah.
- BCBrian Christian
... as, you know, it may shatter your entire corporation.
- CWChris Williamson
How, how do we fix it?
- BCBrian Christian
I mean, I, I do think at some level there is this kind of room for these more technical solutions. Um, and so, you know, m- my feet are planted more squarely in the research community, and so, you know, I can see the development of some of these things, um, and how, you know, for example, at Berkeley where I'm affiliated, a lot of the PhD students in technical AI safety have started doing summer internships at tech companies, and I think that's very interesting. It's sort of a, a marker of the kind of maturity of technical AI safety in just a few years that some of the stuff that was on whiteboards in 2017, 2018 is now getting actually mocked up as this kind of MVP thing at an actual tech company. So that seems good. Um, I think one of the major forces that holds a lot of tech companies in check is the fact of how much power the actual employees themselves have. Currently, machine learning is in such high demand that good machine learning engineers have a ton of leverage, and they're able to use that leverage, um, to kind of convince companies to do certain benevolent things like publish, publish results openly, you know, in public journals or publish things directly onto the web or release open source. Um, things that companies wouldn't necessarily be otherwise disposed to do has now become sort of part of the norm of the field, and that's not coming from competitive pressure necessarily, it's not coming from regulators. It's coming from just the consciences of the individual engineers and the fact that they have leverage, uh, over their employers because they're in this, like, extremely high-demand, um, category.... now it may not always be the case because as the ranks of machine learning engineers grow, because this field is so in demand, then the individual bargaining power of those employees is gonna go down. Um, so we're gonna lose a little bit of that leverage. Um, does some of it come from regulators? I assume so, but it is not clear to me what shape that regulation is actually going to take. Um, I mean, I think, um, some kind of citizen participation, you know, like, I, I wouldn't mind something... You know, I say in the book, I think that it, it's reasonable to imagine that we have some rights to know what the m- model is that these companies have of us, um, and to have some kind of direct influence over that, right? So you're starting to see this a little bit with alcohol, which was the example I mentioned earlier, where some tech companies now actually have, like, a toggle somewhere deep into the preferences that says, like, "Never show me alcohol." Um, that's, like, the f- the tip of the iceberg, but, uh, you can imagine a way in which, um, you know, you can have some control over which version of yourself is being marketed to, right? You can say, like, "Yes, it turns out that I... You know, when I'm in the checkout aisle, I put a bunch of candy on my thing, but I, I want to not be that person. So please give me the checkout aisle that has, you know-"
- CWChris Williamson
Fruit and vegetables.
- BCBrian Christian
"... organic, whatever."
- CWChris Williamson
Yeah, precisely.
- BCBrian Christian
Fruit and vegetables. Um, and I... You know, i- in, in theory that should be a win-win, um, but we'll see. Um, you know, in, in practice it's not totally clear, um, that they can present ... makes sense to you or in a way that's, like, directly manipulable.
- CWChris Williamson
Yeah.
- BCBrian Christian
So there are questions.
- CWChris Williamson
The main thing, and I think that this kind of cuts to the heart of the discussion around the alignment problem right now, is that the general tenor and tone and feeling of users towards tech companies and of us towards ourselves and of our relationship with technology, probably in the space of the last six years to seven years has changed-
- BCBrian Christian
Mm-hmm.
- CWChris Williamson
... very much from-
- 1:00:00 – 1:15:00
I think it comes…
- BCBrian Christian
will be used against us, right? It's like w- w- n- n- you need to be Mirandized to go online or something. Um, that I... I mean, I, I, I don't know if you have this experience as well, but when I'm using, let's say, YouTube, um, there's a part of me that tries to decide before I look up a video if I want to look it up in an incognito tab so that I kind of, like, you know, s- separate off, cordon off, like, this is not part of the preference model I want you to build of me even though I know they're gonna track my IP address and whatever, but it's not directly linked to the same, um, you know, viewing habits that I've e- kind of laid down in the, my regular, uh, account with all the cookies in it. Um, I think there is going to be this increasingly weird game theoretic aspect to using technology, where we are constantly having to suss out...... like, what, what inferences it gonna make based on my behavior? What is their business model? Um, what kind of feedback loop will my behavior create? Um, you know, you hear these funny stories of people saying like, "You know, I let my two-year-old, you know, mess with my Spotify for a day, and then my recommendations have been ruined forever." You know, things like that. Um, and I certainly have that experience, um, that I feel that I'm dealing with this kind of inscrutable, um, you know, intelligence, or machinery, or whatever you wanna think about it. Um, there's some process happening of which I'm a part. Um, and it's, I'm being observed, I'm being sort of adapted to the things that I see have some weird relationship to what I've interacted with before, but I have no idea what that relationship is. Um, and I mean, I think about Twitter as one example. You know, the Twitter app, a lot of the stuff in my feed isn't even from the people I follow. It's this kind of secondary and tertiary thing of, like, someone that you follow liked a tweet by someone else who knows this other person, and this is their tweet. And so, as a consumer, you have no idea, like, why, why am I seeing this thing? When you go to make a tweet, you have no idea what process may or may not determine how that tweet reaches people or which people it reaches. But we're sort of forced to play this game, um, and we're forced to sort of co-adapt with these models. And I think that's, um, that's one of the things that people in technology, I think, underestimate, is that if somewhere secretly in the, you know, the bowels of Twitter, they start to add a 5% bump for posts that use, like, highly emotional language or something like that, or just that, that naturally shakes out of the optimization, um, people will notice and people will change their behavior accordingly. And so, um, you know, the, the technical way of putting this is that, you know, machine learning is secretly mechanism design. It's like you, you can't make a model without that model becoming, uh, so- essentially an incentive structure that people then start to game, and then the correlations you previously observed break down. And so, you constantly have to kind of reevaluate it. Um, what do I think we can expect in the next five to 10 years? I think that process of, I don't know, that, that, I think that disempowering feeling of just, like, "I have no idea what's going on here. I'm just kind of experiencing it. Um, I have this sense that I'm part of this causal feedback mechanism, and I don't really know exactly what that... how that works. But I need to take... I know that I need to take it into account, um, because everything I engage with will shape the future data that I even get." Um, I mean, I think we were, w- we're going to see, um... I think the subtle questions of, like, what the actual business models of these companies are, like, really start to rear their head in the, um, the machine learning aspect. I think that's kind of underappreciated. So for example, um, we have all these different apps that recommend us things. Netflix recommends us things, Amazon recommends us things, Spotify recommends us things. Um, but the business model in each case is quite different. And so that ends up actually manifesting in very different recommendations. So, Amazon is this logistics behemoth, and so they really benefit from people, uh, doing mainstream things. If you buy the book that's the number one book, then they probably have a copy of it just, like, a few miles from your house, and it's gonna be easier for them than if you get an obscure book. Um, Netflix, uh, is constantly renegotiating the licensing rights with all these film studios and TV studios. So, Netflix would prefer if you watch the really obscure thing to the really mainstream thing, because they can get the rights a lot cheaper. So, Amazon is giving you this kind of, uh, centripetal force towards these sort of main- mainstream modal things in the culture. Netflix is the centrifugal force that's driving you to these, like, obscure niches. Um, Spotify has this kind of double-sided marketplace thing where, if they put too many of the mainstream artists in your recommended playlist, then the indie record labels get mad and pull out. And so they're constantly wrangling, um, to please both the listeners and the musicians. So, all of these things end up manifesting in the actual behavior of the s- system but in ways that are, um, you know, sort of tac- tactically o- obscure but strategically clear. Um, I don't know where that leaves us necessarily. I think it's gonna be increasingly clear that... (sighs) I don't know. We need to figure out how to give people a seat at the table, um, that if there are things like the health of public discourse at stake, um, we're i- we increasingly have the actual computer science to attempt to operationalize these weirdly fuzzy things. Um, you know, I'm thinking about OpenAI, GPT-3. There are a lot of research papers coming out about, you know, um, these very ill-defined concepts that we have, like, for language. Um, you can actually sort of fine-tune GPT-3 to, to meet these different criteria. Um, so the scientific piece is getting worked out, and the question that remains is really, like, how do we decide who gets a seat at the table, and whose opinion is it?... and whose, whose values are the values that are getting imprinted into this system. So that is, I think the, the big question that awaits us even when we can solve the, uh, the technical and scientific aspect of the alignment problem.
- CWChris Williamson
I think it comes back to what we said at the very, very beginning, the balance between technological capability and technological wisdom, in that we are n- we are now, or you're telling me that the very, very cutting edge of computer science research and neural networks and GPT-3 and all this stuff is on the cusp of actually being able to potentially open up solutions to problems that most of us have only just become aware of. So you're like-
- BCBrian Christian
Yeah.
- CWChris Williamson
... you're so far ahead and the iterations on this are so rapid that, you know, think about the lumbering behemoth that is governmental policy behind us and legislation-
- BCBrian Christian
Mm. Yeah.
- CWChris Williamson
... um, and, and s- psychological research into the effect on humans long term, that is way, way in the back. That's in the tale of Snowpiercer. And then right on the very, very front getting battered by cold winds are a couple of these computer science guys-
- BCBrian Christian
(laughs) .
- CWChris Williamson
... and then the guy that's driving the train is the person that's the algorithm. And then we're somewhere in the middle fi- like feeling it-
- BCBrian Christian
Yeah.
- CWChris Williamson
... and we're like, "Oh, isn't it interesting that all this stuff's going on?" So, man, yeah, I mean the next, the next 10 years is going to be, is gonna be super, super interesting. Um, I- I have this-
- BCBrian Christian
And the q- Yeah.
- CWChris Williamson
... I have this get out of jail free card that (laughs) I've been using for absolutely ages, which is that all of the problems that we're coming up with now are not going to matter in 100 years because we're either going to be enslaved by a misaligned-
- BCBrian Christian
(laughs) .
- CWChris Williamson
... a super intelligent artificial general intelligence-
- BCBrian Christian
Mm-hmm.
- CWChris Williamson
... or we'll have managed to get a machine extrapolated volition to work correctly to the point where it fixes all of the problems for us. So all of the stuff between now-
- BCBrian Christian
Mm-hmm.
- CWChris Williamson
... and then is like some weird reverse deterministic, uh, apathy fest where we don't actually really need to do anything. It's like, look, the end of the road's coming. Uh, could be good, could be bad, let's enjoy the ride on the way there. And I wonder-
- BCBrian Christian
(laughs) .
- CWChris Williamson
... I, I wonder how much the contributions of these smaller consequence alignment problems are gonna contribute to us getting the big one right.
- BCBrian Christian
Yeah. Um, I mean, I agree with your Snowpiercer analogy, broadly speaking. Um, and-
- CWChris Williamson
Mm-hmm.
- BCBrian Christian
... I don't know, it's, in, in some ways it's even worse than that because, uh, there, there's a computer scientist at Princeton named Arvind Narayanan who has pointed out that a lot of the systems that are still integral to our financial system and, you know, airplane, uh, uh, controls are written in, like, Pascal and COBOL and these programming languages that, like, barely anyone even knows anymore. And so he was saying, you know, we have this idea that tech moves too fast for society to keep up, but in reality a lot of these, like, crappy machine learning systems that were developed in the 2010s are still gonna be around, like, in zombie mode 20 years from now and, like, maybe that's even more terrifying. Um, so I think there is this question of are we able to catch these misalignment issues in time to actually course correct? Um, some of them I feel more sanguine about that than others. Um, maybe to push back a little bit on this idea that we just need to relax, I mean there's this question... So there's kind of this philosophical question here which is like, um, you know, there's tension between moral realism and moral relativism. Um, you know, so if... I've heard people say that they are, they identify as moral realists, so they think there are objective truths about, like, right and wrong and that people are just bad at figuring out, out what those objective truths are. And so that's the kind of attitude that says, um, you know, "I welcome our new robotic overlords. Um, the sooner they can tell us what's right and wrong, the better." Um, that is kind of a we can chill scenario. I mean, hope, you know, assuming that, um, that we get that Hail Mary pass right and the system really is aligned, but if that's the case then, ih- the moral realist, uh, you know, overlord just tells us what to do. Um, if you're kind of a moral relativist, then you have this idea that, you know, wha- whatever people s- say is good is g- what's good. Um, I think there... This idea of coherent extrapolated volition, which goes back to Eliezer Yudkowsky from Machine Intelligence Research Institute, um, has been, uh, very influential in the AI safety community. Um, and it's this idea which I think to some degree threads the needle between those two things, which is to say, you know, the thing that we want isn't just whatever people happen to say when we poll them, it isn't some objective truth that we could all be wrong about, but it's this idea of what, what we would decide if we were smarter, if we had longer to think about it, if we could sort of pull together in the appropriate way. Um, I think in some ways that ends up becoming the job for the next 100 years. Um, you know, philosophers like Will MacAskill, who is Toby's, uh, colleague are saying, you know, we need this period that they're calling the long reflection where we basically need to just everybody just chill, we need to take maybe a million years to just figure out what we wanna do with the cosmos and, you know, take our time. Um, there's this very influential essay by Nick Bostrom called Astronomical Waste, um, which says that something like, you know, every second that passes that we don't colonize the stars is equivalent to, like, um, you know, trillions of human lives being lost that could have been lived with that, you know, uh...... uh, had we acted sooner. And yet, we still need to take our time because the consequence of screwing it up is even worse. Um, so there's a lot of, I don't know, a lot of folks coming at AI safety from the philosophy side saying, "What we really need to do is just chill." Like, leave the space of possibility open. Um, and, uh, it's interesting to me because there's a lot of technical AI safety work on this idea of option value. Um, preserving the ability of a system to achieve various goals in the future. So, um, you know, something that you might want in a system is, the system doesn't take actions which permanently foreclose possibilities in that space. M- whether it's shattering the vase that you can't put back together again, or killing the person that you can't bring back to life. Whatever it might be, um, there are some really, I think really encouraging technical results, um, in these sorts of toy environments where if you give the agent, um, randomly generated objective functions and say, "Okay, I want you to d- perform some task, but preserve your option value to, like, later do these randomly generated other tasks." Uh, the system, at least in these simplified examples, behaves with what seems like a very human amount of, like, caution-
- CWChris Williamson
Very delicate.
- BCBrian Christian
... like, it- it won't just push the Ming vase out of the way to, you know, run out the door faster.
- CWChris Williamson
Ah, man. That is so interesting. I hadn't heard about that before. But it make, it makes-
- BCBrian Christian
Yeah.
- CWChris Williamson
... it makes complete sense that by forcing it to retain some level of optionality, you restrict
- 1:15:00 – 1:16:15
Exactly. …
- CWChris Williamson
the maximizing, ridiculous maximizing effect-
- BCBrian Christian
Exactly.
- CWChris Williamson
... that it can go after... Man, I, uh, I'm gonna... Once we're, once we're finished up, I'm gonna ask you for some suggestions for, uh, safety researchers and stuff 'cause I'm gonna, I'm gonna force this down the audience's throat over the next year.
- BCBrian Christian
(laughs) Yeah.
- CWChris Williamson
Man, (laughs) Brian, uh-
- BCBrian Christian
I'll name check, yeah.
- CWChris Williamson
Oh, will do.
- BCBrian Christian
Victoria Krakovna at DeepMind and, um, uh, one of, is one of the people w- working a lot on this. And, uh, there's an idea by a guy named Alex Turner that's called auxiliary utility preservation. So, uh, I'll send you some links and, uh, you can, you can-
- CWChris Williamson
That's my-
- BCBrian Christian
... share some of the actual papers with you.
- CWChris Williamson
That's my-
- BCBrian Christian
But-
- CWChris Williamson
... bedtime reading finished.
- BCBrian Christian
(laughs)
- CWChris Williamson
Uh, man, thank you for coming on. The Alignment Problem: How Can Machines Learn Human Values? will be linked in the show notes below. If people want to check out any more of your stuff, where should they go?
- BCBrian Christian
Um, I'm on Twitter @bryanchristian and on the web at bryanchristian.org.
- CWChris Williamson
Perfect, man. Thank you so much for coming on.
- BCBrian Christian
My pleasure. Thank you.
- NANarrator
(singing)
Episode duration: 1:16:15
Install uListen for AI-powered chat & search across the full episode — Get Full Transcript
Transcript of episode 1T5aa44dtZs
Get more out of YouTube videos.
High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.
Add to Chrome