The Twenty Minute VCAndrew Feldman, Cerebras Co-Founder and CEO: The AI Chip Wars & The Plan to Break Nvidia's Dominance
At a glance
WHAT IT’S REALLY ABOUT
Cerebras CEO on Beating Nvidia, AI Inference, and Future Architectures
- Andrew Feldman, co-founder and CEO of Cerebras, explains how AI’s explosive growth is reshaping chip design, data centers, and the balance of power in the semiconductor industry. He details why current GPU architectures—especially their reliance on off-chip HBM—are fundamentally suboptimal for large-scale inference and how Cerebras’ wafer-scale SRAM-based design aims to outcompete Nvidia on speed, efficiency, and cost. Feldman argues that we are still early across compute, algorithms, and data, with vast gains ahead from better architectures, more efficient models, and synthetic data. He also discusses export controls, China’s capabilities, the economics of AI infrastructure, and why he believes hardware providers will ultimately be more valuable than model companies over a five-year horizon.
IDEAS WORTH REMEMBERING
5 ideasGPU architecture is fundamentally misaligned with large-scale inference needs.
GPUs were built around off-chip HBM, which is high-capacity but relatively slow; generative inference requires repeatedly moving massive model weights from memory to compute, making memory bandwidth and off-chip I/O a hard bottleneck that wastes power and limits utilization.
Wafer-scale chips with on-chip SRAM can dramatically improve inference performance and efficiency.
By using an entire wafer packed with fast SRAM and many identical tiles plus redundancy, Cerebras keeps far more data on-chip, cuts I/O power, overcomes yield issues, and can run large models on far fewer devices, unlocking speed and power advantages over GPU clusters.
AI algorithms and utilization are still highly inefficient, leaving large headroom for improvement.
Current inference workloads often leave GPUs 93–95% idle, and many models are all-to-all connected; algorithmic advances such as sparsity, MoEs, and better architectures can both reduce compute waste and improve quality, driving down cost per token significantly over time.
Inference will dwarf training as AI shifts from novelty to daily workflow.
Feldman frames inference demand as the number of users × usage frequency × compute per use; all three are rising simultaneously as AI moves into mainstream workflows, meaning inference capacity and efficiency will dominate infrastructure spending in the coming years.
Synthetic data will likely become the dominant source for training.
He expects most training data to be synthetic within five years, particularly for rare or dangerous scenarios (like edge cases in flying or driving), where simulators can generate high-value examples more efficiently than collecting real-world data.
WORDS WORTH SAVING
5 quotesOur AI algorithms today are not particularly efficient. In a GPU, most of the time it's doing inference, it's 5 or 7% utilized.
— Andrew Feldman
The fundamental architecture of the GPU with off-chip memory is not great for inferencing… it can be beaten and I think they know it.
— Andrew Feldman
We won't be as dependent on transformers in three years or five years as we are now, 100%.
— Andrew Feldman
There are no examples in compute in 50 years in which by making things cheaper and faster, the market got smaller.
— Andrew Feldman
In five years from now Nvidia's gonna have somewhere between 50 and 60% of the market. Right now they have approximately all of it.
— Andrew Feldman
High quality AI-generated summary created from speaker-labeled transcript.
Get more out of YouTube videos.
High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.
Add to Chrome