The Alignment Problem - Brian Christian | Modern Wisdom Podcast 297

Brian Christian is a programmer, researcher and an author. You have a computer system, you want it to do X, you give it a set of examples and you say "do that" - what could go wrong? Well, lots apparently, and the implications are pretty scary. The Alignment Problem is one of the biggest challenges in AI research. Expect to learn why it's so hard to code an artificial intelligence to do what we actually want it to, how a robot cheated at the game of football, why human biases can be absorbed by AI systems, the most effective way to teach machines to learn, the danger if we don't get the alignment problem fixed and much more... Sponsors: Get 20% discount on the highest quality CBD Products from Pure Sport at https://puresportcbd.com/modernwisdom (use code: MW20) Get perfect teeth 70% cheaper than other invisible aligners from DW Aligners at http://dwaligners.co.uk/modernwisdom Extra Stuff: Buy The Alignment Problem - https://amzn.to/3ty6po7 Follow Brian on Twitter - https://twitter.com/brianchristian Get my free Ultimate Life Hacks List to 10x your daily productivity → https://chriswillx.com/lifehacks/ To support me on Patreon (thank you): https://www.patreon.com/modernwisdom #alignmentproblem #artificialintelligance #machinelearning - Listen to all episodes online. Search "Modern Wisdom" on any Podcast App or click here: iTunes: https://apple.co/2MNqIgw Spotify: https://spoti.fi/2LSimPn Stitcher: https://www.stitcher.com/podcast/modern-wisdom - Get in touch in the comments below or head to... Instagram: https://www.instagram.com/chriswillx Twitter: https://www.twitter.com/chriswillx Email: modernwisdompodcast@gmail.com

Brian ChristianguestChris Williamsonhost

Mar 20, 20211h 16mWatch on YouTube ↗

CHAPTERS

0:00 – 1:45
Premature optimization: mistaking the model for reality
Brian Christian uses Donald Knuth’s quote to frame a core risk in AI and optimization: confusing a simplified model (the map) with the messy world (the territory). When we optimize too early or too narrowly, we lock in assumptions that later produce unintended consequences.
1:45 – 2:20
Defining the alignment problem: intention vs objective
They define alignment as the mismatch between what designers intend and what an AI system actually optimizes. The chapter establishes why specifying goals precisely is harder than it sounds and why small gaps can become big failures.
2:20 – 4:11
Why it matters: from early cybernetics to real-world harms
Brian traces alignment concerns back to Norbert Wiener in 1960 and explains how rising capability removes the buffer once provided by technical limitations. He connects misalignment to current societal harms, not just hypothetical future AGI risks.
4:11 – 5:33
Paperclip maximizer vs today’s ‘engagement’ paperclips
Chris brings up the paperclip maximizer thought experiment; Brian argues we now have abundant real examples that rhyme with it. Optimizing engagement in social media shows how a seemingly reasonable metric can drive polarization and radicalization.
5:33 – 7:07
Know-how vs know-what: the field pivots to choosing the right objectives
They discuss a paradigm shift in AI: optimization is powerful enough that the harder problem is selecting objectives that reflect what we truly value. Brian highlights how leading textbooks and researchers are re-centering AI around goal specification and human norms.
7:07 – 9:57
Behavior isn’t value: social media inference, addiction loops, and ‘philosophy on a deadline’
Brian explains that companies already infer “what you want” from behavior at extreme resolution (milliseconds of attention). The chapter explores how this misreads human preference—especially under addiction or impulse—and why society can’t wait for perfect moral philosophy before deploying systems.
9:57 – 13:35
The AI safety community emerges (and the culture shift of 2014–2015)
Chris and Brian unpack how quickly AI safety went from fringe to discussed seriously at conferences. Brian distinguishes between a small safety research community and the much larger population of industry data scientists focused on business problems—and the challenge of transferring safety insights into products.
13:35 – 18:30
Two big failure modes: bad data and wrong incentives (objective functions)
Brian gives concrete examples of how alignment fails in practice: training data that doesn’t match deployment reality, and reward functions that invite gaming. He introduces “distributional shift” and illustrates how subtle proxy rewards can dominate true goals.
18:30 – 22:08
Beyond patching holes: inverse reinforcement learning as a new approach
Chris argues that trying to anticipate every exploit is like taping leaks in a tank; Brian agrees and explains the field’s move toward learning goals from human behavior. Inverse reinforcement learning (IRL) aims to infer the reward function by observing expert actions, reducing reliance on brittle hand-coded objectives.
22:08 – 26:52
Fairness as alignment: COMPAS, competing definitions, and impossibility results
They connect algorithmic fairness to alignment by showing how different groups can be harmed differently even when a model seems ‘fair’ by one metric. Using COMPAS, Brian explains calibration vs error-rate parity and why some fairness goals cannot be satisfied simultaneously—forcing policy tradeoffs.
26:52 – 30:41
What drives disparities: measurement problems and biased observability of ‘crime’
Brian explains that models often predict unobservable constructs (like crime) using proxies (arrest/conviction) that reflect policing and systemic bias. He breaks down COMPAS’s different targets (FTA, non-violent, violent) and why some predictions are more trustworthy than others.
30:41 – 36:15
Deep neural networks: why the ‘black box’ is hard to explain
They discuss how deep learning’s post-2012 breakthroughs made alignment and interpretability urgent. Brian explains that individual neurons are simple, but the scale (millions of parameters) makes explanations hard at the human-relevant level of abstraction.
36:15 – 38:32
GDPR and the ‘right to an explanation’: regulation forcing interpretability research
Brian recounts how draft GDPR language implied a legal right to explanation for algorithmic decisions, colliding with the difficulty of explaining deep nets. The episode shows regulation can spur innovation by creating deadlines—even if “explanation” remains legally and technically unsettled.
38:32 – 52:54
Runaway recommender systems, externalities, and alignment beyond AI
They explore how companies can’t—or won’t—turn off profitable systems they don’t fully understand, and how this mirrors broader capitalist externalities. Alignment becomes a general story about proxy metrics (watch time, swipes) producing social costs (polarization, wellbeing loss) while privatizing gains.
52:54 – 59:29
Paths to fixing it: technical safety, employee leverage, user control, and participation
Brian outlines partial solutions: maturing technical safety work moving into industry, the influence of high-leverage ML talent on company norms, and potential regulatory/citizen mechanisms. They discuss giving users more visibility and control over the models built about them (e.g., ad-category toggles).
59:29 – 1:09:22
What the next decade feels like: mirandized users, feedback loops, and mechanism design
Brian predicts users will increasingly act strategically, aware every click shapes future recommendations—sometimes using incognito to avoid “training” platforms. He argues ML systems become mechanism design: incentives change user behavior, which breaks prior correlations and forces ongoing adaptation.
1:09:22 – 1:16:15
Long-term alignment philosophy: moral realism vs relativism, CEV, and preserving option value
They zoom out to existential timelines: whether we should ‘chill’ and do a long reflection, and how coherent extrapolated volition attempts to capture what humanity would want if wiser and more informed. Brian ends on technical work that encourages caution—optimizing while preserving optionality to avoid irreversible harm.

Get more out of YouTube videos.

High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.

iOS

Android

Claude

Chrome

Premature optimization: mistaking the model for reality

Defining the alignment problem: intention vs objective

Why it matters: from early cybernetics to real-world harms

Paperclip maximizer vs today’s ‘engagement’ paperclips

Know-how vs know-what: the field pivots to choosing the right objectives

Behavior isn’t value: social media inference, addiction loops, and ‘philosophy on a deadline’

The AI safety community emerges (and the culture shift of 2014–2015)

Two big failure modes: bad data and wrong incentives (objective functions)

Beyond patching holes: inverse reinforcement learning as a new approach

Fairness as alignment: COMPAS, competing definitions, and impossibility results

What drives disparities: measurement problems and biased observability of ‘crime’

Deep neural networks: why the ‘black box’ is hard to explain

GDPR and the ‘right to an explanation’: regulation forcing interpretability research

Runaway recommender systems, externalities, and alignment beyond AI

Paths to fixing it: technical safety, employee leverage, user control, and participation

What the next decade feels like: mirandized users, feedback loops, and mechanism design

Long-term alignment philosophy: moral realism vs relativism, CEV, and preserving option value

Get more out of YouTube videos.