No Priors Ep. 77 | With Foundry CEO and Founder Jared Quincy Davis

In this episode of No Priors, hosts Sarah and Elad are joined by Jared Quincy Davis, former DeepMind researcher and the Founder and CEO of Foundry, a new AI cloud computing service provider. They discuss the research problems that led him to starting Foundry, the current state of GPU cloud utilization, and Foundry's approach to improving cloud economics for AI workloads. Jared also touches on his predictions for the GPU market and the thinking behind his recent paper on designing compound AI systems. Sign up for new podcasts every week. Email feedback to show@no-priors.com Follow us on Twitter: @NoPriorsPod | @Saranormous | @EladGil | @jaredq_ Show Notes: 00:00 Introduction 02:42 Foundry background 03:57 GPU utilization for large models 07:29 Systems to run a large model 09:54 Historical value proposition of the cloud 14:45 Sharing cloud compute to increase efficiency 19:17 Foundry’s new releases 23:54 The current state of GPU capacity 29:50 GPU market dynamics 36:28 Compound systems design 40:27 Improving open-ended tasks

Sarah GuohostJared Quincy DavisguestElad Gilhost

Aug 22, 202442mWatch on YouTube ↗

CHAPTERS

0:00 – 2:41
Why Foundry exists: democratizing “DeepMind/OpenAI-level” compute leverage
Sarah and Elad introduce Jared Quincy Davis and set up Foundry’s core motivation: making the computational leverage behind breakthroughs like AlphaFold 2 and ChatGPT accessible beyond a few well-capitalized labs. Jared frames the opportunity as lowering the cost and friction of AI compute so more teams can produce breakthrough results.
- •AlphaFold 2 and ChatGPT as examples of small teams with massive compute leverage
- •The “David vs. Goliath” narrative breaks down when you consider compute budgets and platform advantages
- •Foundry’s mission: reimagine cloud infrastructure end-to-end for AI workloads
- •Goal of driving orders-of-magnitude cost reductions over time to increase breakthrough frequency
2:41 – 3:56
Foundry’s product offering: an AI-native public cloud with 12–20× economics gains
Jared explains what Foundry sells and how it differs from both hyperscalers and newer GPU clouds. The pitch centers on redesigning the stack from first principles for AI to improve price/performance, reliability, security, and elasticity.
- •Public cloud built specifically for AI workloads
- •Re-architecting components ‘from first principles’ rather than incremental tweaks
- •Claimed 12–20× improved economics vs. legacy clouds and some GPU clouds
- •Core offering: infrastructure-as-a-service plus operational tooling (security, reliability, elasticity)
3:56 – 5:58
GPU utilization reality check: why even top training runs often sit below 80%
Elad asks about utilization across different GPU “owners,” and Jared focuses first on the best-case scenario: large pretraining. Even there, utilization can be surprisingly low due to failures, operational buffers, and gaps between workloads.
- •Even sophisticated pretraining runs can be sub-80% utilization
- •Teams reserve 10–20%+ of GPUs as ‘healing buffer’ to handle failures
- •Bad hardware batches and frequent failures can drive utilization below 50%
- •Idle time also comes from intermissions between major training runs
5:58 – 7:41
H100s aren’t ‘chips’—they’re complex systems, and failures scale with cluster size
Jared argues that common mental models of GPUs as simple chips are outdated. Modern GPU platforms (DGX/HGX-class) are heavy, component-dense systems; at supercomputer scale, failure probability compounds across millions of components.
- •H100-class boxes are 70–80 pounds with tens of thousands of components
- •NVIDIA compresses ‘data center-like’ infrastructure into a single system
- •At thousands-to-hundreds-of-thousands of nodes, failures become inevitable
- •Tooling to manage these failure modes is still immature, encouraging over-buffering
7:41 – 9:47
What counts as a ‘large model’: when orchestration becomes a distributed systems problem
Jared defines the “large regime” as the point where even holding model weights requires multiple state-of-the-art GPUs/nodes. From there, training and sometimes inference become synchronized distributed computations, increasing sensitivity to component and network failures.
- •“Large” begins when one GPU/node can’t contain the model weights
- •Cluster orchestration is required for a single synchronized computation
- •Distributed systems constraints become central to model scaling
- •Networking (e.g., InfiniBand/Mellanox) is a key enabler and a key failure surface
9:47 – 16:09
Cloud’s original promise vs. today’s AI cloud: elasticity is missing, so it’s ‘colo-like’
Sarah raises the CapEx/OpEx and depreciation mismatch in GPU compute; Jared broadens it into a critique of current AI cloud. He argues AI cloud often behaves like co-location/hosting with long-term reservations, not true elastic cloud computing.
- •Current AI cloud often resembles co-location/hosting more than classic cloud
- •Historical cloud arc: AWS origins (S3/EC2) and gradual enterprise adoption
- •Key cloud insight: “fast is free” when elasticity lets you scale up then release resources
- •AI workloads often forced into multi-year fixed reservations, misaligned with real demand
16:09 – 19:22
Why long-term GPU commitments are risky: supply-chain bottlenecks and no hedging markets
Sarah contrasts pre-cloud infrastructure models with today’s AI compute reality; Jared agrees the ecosystem sits between colo and hosting. He highlights how upfront commitments and immature financial/market mechanisms increase risk for AI builders.
- •Developers are re-encountering hardware supply-chain constraints
- •Upfront capital and long-duration contracts create catastrophic downside if demand shifts
- •No mature analogs to commodities hedging (options/futures) for GPU capacity
- •Foundry seeks technical + business-model leverage without taking undue GPU inventory risk
19:22 – 22:52
New releases: ‘AI cloud is a parking lot business’ and the case for GPU spot usability
Jared introduces Foundry’s releases via a parking-lot analogy: on-demand is expensive and unreliable; reserved is cheaper but underused. Foundry aims to let others safely use reserved capacity (spot) with automation that makes preemption practical at scale.
- •Two regimes: pricey on-demand vs discounted-but-underutilized reservations
- •Vision: enable pay-as-you-go use of others’ reserved capacity (win-win-win economics)
- •Core need: automation so reserved users aren’t disrupted when they reclaim capacity
- •Spot is harder with GPUs—especially at scale and with interconnect—so usability features matter
22:52 – 23:46
Spot in practice: enabling training, batch inference, and better utilization of buffers
Jared explains how improved spot mechanisms expand what workloads can run preemptibly. He emphasizes batch inference and synthetic data generation as increasingly important workloads that are more horizontally scalable and often less interconnect-dependent than giant pretraining.
- •Companies use spot not just for inference but increasingly for training too
- •Batch inference is a strong fit: horizontally parallel and interruption-tolerant
- •Spot and reliability tooling interact—buffers can be ‘packed’ with preemptible nodes
- •Workload mix is shifting as compound/agentic workflows become more common
23:46 – 28:13
Where the world’s GPU capacity actually is: it’s not mostly in hyperscalers
Sarah asks about global GPU capacity and the shortage; Jared counters common assumptions with anecdotes and back-of-the-envelope comparisons. He argues there’s far more compute in the world than people think, but much of it is inaccessible, unnetworked, or poorly utilized.
- •Major clouds own only a tiny fraction of global GPU compute by some measures
- •GPT-3 training example: 10,000 V100s for ~14.6 days on Azure
- •Ethereum at peak implied ~10–20 million V100-equivalent GPUs running 24/7
- •Even high-end H100 utilization can be ~20–25% per some datasets; accessibility is the bottleneck
28:13 – 29:50
MARS: Monitoring, Alerting, Resiliency & Security to reduce perceived failures and downtime
Jared describes MARS, an internal tool suite productized to increase GPU availability and operational reliability. He connects MARS to spot strategy: maintaining aggressive healing buffers improves user experience, and spot can monetize/optimize those buffers.
- •MARS improves uptime via monitoring, alerting, resiliency, and security
- •Healing buffers allow automatic replacement when GPUs fail without user disruption
- •Spot helps make reserved ‘buffer’ capacity economically efficient
- •Third parties can expose their healing buffers through Foundry to offset cluster cost
29:50 – 31:58
GPU market dynamics and the scaling future: big clusters are scarce, but paradigms are shifting
Sarah asks how market dynamics shape Foundry’s strategy; Jared says large, state-of-the-art interconnected clusters will remain valuable, yet they’re constrained by power/space/interconnect. He argues progress will increasingly come from shifting computation across training/inference and using techniques that don’t always require giant synchronized clusters.
- •Severe scarcity of very large, tightly interconnected clusters
- •Scaling laws require continued 2×/10× compute to get incremental gains; it gets hard fast
- •Emerging innovations include training across facilities/data centers
- •Shift toward approaches that reduce dependence on massive interconnect-heavy training runs
31:58 – 36:18
Compound AI systems: synthetic data, distillation, and massive parallel inference as the new lever
Jared lays out examples—Phi-3, LLaMA 3.1 distillation, AlphaCode, AlphaGeometry—where performance comes from system design rather than only bigger monolithic models. These workflows often trade synchronized training for huge amounts of parallelizable inference and filtering, reshaping infrastructure needs.
- •Phi-3: small model + high-quality data curation as an alternative path
- •LLaMA 3.1: synthetic data + distillation from a larger model into smaller variants
- •AlphaCode-style: generate massive candidate sets (e.g., ~1M) and filter/verify
- •Life-cycle cost framing: sometimes ‘overtrain’ small models to reduce inference cost later
36:18 – 40:16
The paper: verifiability-driven system design (best-of-K + strong verifiers) to boost accuracy
Elad prompts discussion of Jared’s paper on compound AI system design. Jared explains the key idea: for verifiable tasks—where checking is cheaper than generating—you can generate many candidates in parallel and use verifiers/judges (models, tests, simulators) to select better answers, yielding large performance gains.
- •Core principle: exploit task verifiability (cheap checking vs expensive generation)
- •Embarrassingly parallel candidate generation + best-of-K selection
- •Empirical gains: prime factorization jump (3.7%→36.6%); ~3% MMLU lift in technical domains
- •Verifier can be a model, unit tests, a simulator, or other checking mechanisms
40:16 – 42:41
Applying these ideas to open-ended tasks: ensembles, multi-stage pipelines, and ‘networks of networks’
Sarah asks about extending verifiability and compound design to open-ended problems. Jared predicts large multi-stage systems that call multiple models (Claude/Gemini/GPT-4), plus heuristics and simulators, using repeated generation-and-selection to improve outcomes—especially in code and other partially verifiable domains.
- •Future systems may use many stages, each with its own best-of-K loops
- •Ensembling multiple frontier models to exploit complementary strengths
- •Hybridizing with classical heuristics, simulators, and test suites for verification
- •Expectation: more batch inference and synthetic data generation, reducing reliance on mega-clusters

Get more out of YouTube videos.

High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.

iOS

Android

Claude

Chrome

Why Foundry exists: democratizing “DeepMind/OpenAI-level” compute leverage

Foundry’s product offering: an AI-native public cloud with 12–20× economics gains

GPU utilization reality check: why even top training runs often sit below 80%

H100s aren’t ‘chips’—they’re complex systems, and failures scale with cluster size

What counts as a ‘large model’: when orchestration becomes a distributed systems problem

Cloud’s original promise vs. today’s AI cloud: elasticity is missing, so it’s ‘colo-like’

Why long-term GPU commitments are risky: supply-chain bottlenecks and no hedging markets

New releases: ‘AI cloud is a parking lot business’ and the case for GPU spot usability

Spot in practice: enabling training, batch inference, and better utilization of buffers

Where the world’s GPU capacity actually is: it’s not mostly in hyperscalers

MARS: Monitoring, Alerting, Resiliency & Security to reduce perceived failures and downtime

GPU market dynamics and the scaling future: big clusters are scarce, but paradigms are shifting

Compound AI systems: synthetic data, distillation, and massive parallel inference as the new lever

The paper: verifiability-driven system design (best-of-K + strong verifiers) to boost accuracy

Applying these ideas to open-ended tasks: ensembles, multi-stage pipelines, and ‘networks of networks’

Get more out of YouTube videos.