Skip to content
The Twenty Minute VCThe Twenty Minute VC

Andrew Feldman, Cerebras Co-Founder and CEO: The AI Chip Wars & The Plan to Break Nvidia's Dominance

Andrew Feldman is the Co-Founder and CEO @ Cerebras, the fastest AI inference + training platform in the world. In Sept 2024 the company filed to go public off the back of a rumoured $1BN deal with G42 in the UAE. Andrew is the leading expert for all things inference. ---------------------------------------------- In Today’s Episode We Discuss: (00:00) Intro (00:56) Where Was AI Landscape in 2015 When Cerebras Founded (02:34) NVIDIA’s Biggest Strength Has Become Their Biggest Weakness (04:07) What Happens to the Cost of Inference? (06:24) Why Are AI Algorithms So Inefficient? (20:58) Why is it Total BS That We Have Hit Scaling Laws? (25:26) What Will Be the Ratio of Synthetic to Human Data Used in 5 Years? (36:50) What Specifically Was So Impressive About DeepSeek? (37:16) Why is Distillation Not Wrong and OpenAI Need to Look in the Mirror? (38:07) Where Will Value Accrue in a World of AI? (40:13) How Will NVIDIA’s Market Position Change Over the Next Five Years? (48:18) Why is the CUDA Locking for NVIDIA BS? What is Their Weakness? (49:11) Why is Trump Better for Business than Biden? (01:01:22) Do We Underestimate China in a World of AI? (01:05:23) Quick-Fire Round ----------------------------------------------- Subscribe on Spotify: https://open.spotify.com/show/3j2KMcZTtgTNBKwtZBMHvl?si=85bc9196860e4466 Subscribe on Apple Podcasts: https://podcasts.apple.com/us/podcast/the-twenty-minute-vc-20vc-venture-capital-startup/id958230465 Follow Harry Stebbings on X: https://twitter.com/HarryStebbings Follow Andrew Feldman on X: https://twitter.com/andrewdfeldman Follow 20VC on Instagram: https://www.instagram.com/20vchq Follow 20VC on TikTok: https://www.tiktok.com/@20vc_tok Visit our Website: https://www.20vc.com Subscribe to our Newsletter: https://www.thetwentyminutevc.com/contact ----------------------------------------------- #20vc #harrystebbings #andrewfeldman #cerebras #ceo #founder #ai #nvidia #chips #cuda #deepseek

Andrew FeldmanguestHarry Stebbingshost
Mar 24, 20251h 14mWatch on YouTube ↗

CHAPTERS

  1. 0:27 – 2:34

    Cerebras’ founding thesis: AI as a new workload (2015)

    Andrew explains what the Cerebras founders saw in 2015: the emergence of AI as a fundamentally new computing workload. He frames it as a computer-architecture opportunity driven less by novel math and more by new bottlenecks in memory bandwidth and communication.

    • AI created a new “problem to solve” for computer architects
    • The market proved far larger than Feldman initially expected
    • AI workloads stress memory bandwidth and interconnects more than raw math
    • Cerebras was founded on the belief purpose-built systems could outperform GPUs
  2. 2:34 – 4:16

    What AI chips really do: compute is easy, moving data is hard

    Feldman reduces chip design to two jobs—compute and data movement—and argues AI’s core math (matrix multiply) is straightforward. The real difficulty is shuttling massive intermediate results and model weights between compute, memory, and across many chips.

    • Chips mainly compute, move data, and sometimes store it
    • AI’s primitive operations are simple; the scale is the challenge
    • Training/inference require moving huge volumes of weights and activations
    • Data movement and communication dominate performance and power
  3. 4:16 – 6:11

    Training vs inference: why generative inference is brutally bandwidth-bound

    The conversation narrows to why generative inference has different constraints than training/fine-tuning. Feldman describes how each generated token requires reloading weights, making memory bandwidth the limiting factor—especially on GPU architectures built around off-chip memory.

    • Training-from-scratch and fine-tuning are computationally similar
    • Generative inference repeatedly moves weights for each token
    • Example: 70B params at 16-bit implies ~140GB moved per token
    • GPU off-chip memory architecture becomes a fundamental inference limiter
  4. 6:11 – 10:16

    Wafer-scale + SRAM: Cerebras’ bet to beat HBM-centric GPUs

    Feldman contrasts HBM (high capacity, slower) with SRAM (very fast, lower capacity) and explains why wafer-scale enables enough SRAM to hold large models efficiently. He argues this reduces chip-to-chip sprawl and the operational complexity of running giant models across thousands of devices.

    • HBM is excellent but slow; SRAM is fast but usually too small
    • Wafer-scale allows large SRAM capacity plus speed advantages
    • Avoids needing thousands of chips to host very large models
    • Operational simplicity improves when models fit on fewer devices/wafers
  5. 10:16 – 11:04

    Cost and power trade-offs: why keeping data on-chip matters

    Harry probes whether wafer-scale implies higher cost; Feldman responds by emphasizing power efficiency. He claims much of chip power is spent on I/O, so minimizing off-chip movement lowers energy use and improves cost-per-inference over time.

    • Wafer-scale is a trade-off decision, not automatically “more expensive”
    • I/O (off-chip movement) is among the most power-hungry operations
    • Keeping data in the silicon domain reduces power draw
    • Performance-per-watt becomes central to inference economics
  6. 11:04 – 14:44

    Yield, defects, and the tile-based breakthrough that made wafer-scale viable

    Feldman explains semiconductor yield using an accessible analogy and why large dies historically fail more often. He describes Cerebras’ core innovation: building the processor from many identical tiles with redundancy so defects can be bypassed rather than scrapping the wafer.

    • Yield: probability a manufactured chip is functional despite random defects
    • Bigger chips traditionally mean lower yield and more wasted silicon
    • Cerebras uses many identical tiles and redundancy to route around flaws
    • This approach mirrors memory-industry redundancy techniques adapted to compute
  7. 14:44 – 17:43

    What customers optimize for: latency vs batch, and why “milliseconds matter”

    They discuss how performance priorities change by application: accuracy and speed in high-stakes or interactive settings versus cost in batch jobs. Feldman argues fast inference unlocks entirely new product categories, drawing analogies to Netflix’s evolution as bandwidth improved.

    • Optimization depends on use case: accuracy, speed, efficiency, or cost
    • Interactive AI experiences are highly latency-sensitive
    • Faster inference enables new applications, not just cheaper existing ones
    • Historical analogy: broadband transformed media distribution and product design
  8. 17:43 – 20:40

    The inference growth equation and why demand can explode

    Feldman offers a simple model for inference market size: users × usage frequency × compute per use. He argues all three are rising simultaneously, creating exceptional growth, and claims AI shifted from novelty to daily workflow utility in late 2024.

    • Training creates AI; inference is how AI is consumed
    • Inference demand scales with users, frequency, and compute per interaction
    • All three drivers are currently increasing at once
    • AI crossed from “cool” to “useful” in everyday workflows around late 2024
  9. 20:40 – 24:38

    Energy, datacenters, and infrastructure constraints: power is abundant but misallocated

    Harry raises whether the world can meet inference’s energy footprint; Feldman says the industry must deliver commensurate societal value. He argues the U.S. has power in the wrong places and that local regulation and permitting slow data center and grid buildout, while regions like Texas benefit from fewer constraints.

    • AI compute is power-intensive; societal value must justify the cost
    • U.S. power availability isn’t aligned with optimal data center locations
    • Local regulation/permitting can block large infrastructure projects
    • Some regions accelerate builds by reducing bureaucratic friction
  10. 24:38 – 26:19

    Why inference will get cheaper: hardware, datacenter efficiency, and better algorithms

    Feldman breaks inference cost into data center OpEx and hardware CapEx, then adds a third lever: algorithmic efficiency. He argues today’s inference often achieves very low GPU utilization, implying large headroom for improved utilization and lower costs via software and architectural changes.

    • Inference cost drivers: power/space (OpEx) + hardware (CapEx)
    • Performance-per-generation reduces cost if throughput rises
    • Algorithmic efficiency can materially increase utilization
    • Claim: inference on GPUs is often ~5–7% utilized, implying large waste
  11. 26:19 – 30:07

    “Scaling laws are not over”: sparsity, MoE, and moving past transformers

    Feldman rejects the idea that scaling laws are exhausted and points to inference-time scaling (more compute, better answers). He highlights inefficiencies like all-to-all connectivity and suggests increased sparsity and new architectures beyond transformers will reduce compute while improving capability.

    • Senior ML thinkers expect major algorithmic improvements ahead
    • Inference-time scaling can still improve quality (more compute → better answers)
    • MoE and sparsity reduce unnecessary weight usage per token
    • Transformers have known weaknesses; new model classes likely emerge
  12. 30:07 – 34:45

    Synthetic data’s future: ‘almost all’ training data and why it can be higher value

    Asked about data, Feldman predicts training will rely overwhelmingly on synthetic data within five years. He argues synthetic data is powerful because it can oversample rare, high-learning-value scenarios—like flight simulators training emergency conditions rather than routine cruising.

    • Prediction: training data becomes “almost all synthetic”
    • Synthetic data can target rare/high-value learning situations
    • Analogy: pilot simulators emphasize takeoffs, landings, and failures
    • Synthetic generation fills gaps where real-world data is scarce or expensive
  13. 34:45 – 38:07

    DeepSeek, open source shockwaves, and the distillation debate

    Feldman praises DeepSeek as a focused engineering achievement that delivered measurable improvements without massive headcount. He argues distillation isn’t inherently wrong and notes the unusually fast, industry-wide impact of DeepSeek’s open release.

    • DeepSeek impressed him as disciplined engineering, not “model intellectualism”
    • Open-source release created immediate, broad technical impact
    • Distillation framed as analogous to summarization; not inherently unethical
    • Consistency matters: critiques of distillation vs critiques of training data usage
  14. 38:07 – 48:18

    Where value accrues: hardware defensibility, CUDA lock-in skepticism, and Nvidia’s future share

    The discussion turns to moats and defensibility: Feldman argues dominant market share itself is a moat, using Intel as an example. He dismisses CUDA lock-in for inference, credits frameworks like PyTorch with disintermediating CUDA, and predicts Nvidia’s share will fall but remain majority over five years.

    • Enduring value requires both current advantage and a durable trajectory
    • Market-share leadership can be a powerful, under-discussed moat
    • Claim: no meaningful CUDA lock-in in inference; switching is easy at the service layer
    • Prediction: Nvidia declines from “almost all” share to ~50–60% in five years
  15. 48:18 – 56:19

    Cerebras business mechanics: cashflow positivity, G42 concentration, and why go public

    Feldman explains cashflow positivity as a function of real differentiation showing up in gross margins, criticizing negative-gross-margin “commodity” businesses. He addresses revenue concentration risk with G42, positioning it as a learned strategic-partnership capability, and gives reasons for pursuing public-company status.

    • Gross margins signal differentiation; negative gross margin implies commoditization
    • G42 concentration is both a risk and a beachhead for repeatable partnerships
    • Operational learning: deploying huge clusters, hardening software, scaling manufacturing
    • Reasons to go public: organizational readiness, enterprise preference, and category leadership
  16. 56:19 – 1:05:23

    Geopolitics: export controls, Trump vs Biden on AI, and China’s underestimated capabilities

    Feldman argues hardware compliance is more tractable than software control but warns policy has unintended consequences that can accelerate adversaries’ capabilities. He says the current administration is better for AI, explains Cerebras’ choice not to sell to China on ethical grounds, and strongly warns against underestimating China’s engineering and industrial policy strength.

    • Export controls are difficult; unintended consequences can spur local substitution
    • Hardware is easier to track than software/open-source diffusion
    • He views the current administration as more supportive of AI and big tech
    • China is ‘100%’ underestimated: infrastructure, talent pipeline, policy execution
  17. 1:05:23 – 1:14:36

    Quick-fire: contrarian beliefs, Nvidia’s key threat, and lessons from leadership mistakes

    In rapid Q&A, Feldman shares a geopolitical contrarian view about Middle East stability and reiterates Nvidia’s inference weakness as architectural. He also names a key personal mistake—resisting water cooling—broadens to decision-making humility, and comments on where investment is underappreciated (edge inference near sensors).

    • Contrarian belief: Middle East peace may be closer due to business-focused states
    • Underrated Nvidia threat: GPU off-chip-memory architecture for inference
    • Personal lesson: fought water cooling early; later became industry standard
    • Underappreciated market: tiny, sub-milliwatt edge inference chips for sensors/robotics

Get more out of YouTube videos.

High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.