Inference, Diffusion, World Models, and More | YC Paper Club

Even if you’re a current PhD student, it's hard to keep up with the latest AI research. That's why we started YC Paper Club, a small group of researchers, engineers, and founders who will meet every two weeks this summer to present and discuss new papers together. This was from our very first discussion group on May 20th, 2026, at the YC office in Mountain View, CA. Thanks to the following presenters: 0:12 - Intro from YC Visiting Partner Francois Chaubard 3:49 - Tanishq Kumar — Speculative Speculative Decoding (https://arxiv.org/abs/2603.03251) 18:33 - Guangyao (Stannis) Zhou — Diffusion-MPC (https://arxiv.org/abs/2410.05364) 30:26 - Isaac Ward — LeWorldModeling (https://arxiv.org/abs/2603.19312) 43:54 - Akshay Vegesna — Deep Learning is Not So Mysterious or Different (https://arxiv.org/abs/2503.02113) 51:24 - Konwoo Kim — Pretraining Under Infinite Compute (https://arxiv.org/pdf/2509.14786)

Francois ChaubardhostTanishq KumarguestGuangyao (Stannis) ZhouguestIsaac WardguestAkshay VegesnaguestKonwoo Kimguest

May 28, 20261h 7mWatch on YouTube ↗

WHAT IT’S REALLY ABOUT

YC Paper Club debuts: faster inference, diffusion control, and scaling laws

The event frames inference speed as a future capability bottleneck—not just a cost issue—because faster tokens-per-second enable more test-time “thinking” and higher delivered intelligence.
Speculative Speculative Decoding (SSD) extends speculative decoding by parallelizing drafting and verification across hardware, predicting likely verification outcomes to hide draft latency and materially increase token throughput.
Diffusion-MPC uses diffusion models for both multi-step action proposals and multi-step dynamics modeling, improving long-horizon planning, reducing compounding error, and enabling adaptation to new rewards and changed dynamics at test time.
LeWorldModeling (a JEPA-style approach) proposes SigReg, a Gaussian/isotropy regularizer in latent space to prevent representational collapse, enabling efficient latent-space world modeling, MPC-style planning, and uncertainty/surprise detection.
Two high-level papers argue that (1) deep learning generalization can be explained with classical tools like PAC-Bayes/compression/flat minima, and (2) when data is fixed but compute is abundant, aggressive regularization, ensembling, and distillation yield predictable scaling-law asymptotes and large data-efficiency gains.

IDEAS WORTH REMEMBERING

5 ideas

Inference throughput may directly cap deployable intelligence.

The talk argues that as models rely more on test-time compute (e.g., longer deliberation, RL-as-inference wrappers), tokens-per-second becomes a capability constraint: faster inference allows more “thinking” within latency budgets.

Speculative decoding works because verifying multiple tokens is parallelizable.

A small “draft” model proposes tokens sequentially, while the large “target” model verifies them in one forward pass; if a token is rejected, the target can often sample an additional “bonus token” at the rejection point without extra passes.

SSD’s core win is hiding draft latency by predicting verification outcomes.

SSD starts drafting the next round before verification finishes by branching on likely accept-length/bonus-token outcomes; high hit rates (~80–90% cited) let it keep the verifier fed and improve both latency and throughput.

Practical SSD hinges on smart compute allocation and cache-miss strategy.

Because predictions of verification outcomes can fail, the system must decide when to fall back to ordinary speculation and how to distribute draft compute across candidate prefixes; naïve equal allocation is suboptimal and affects both hit rate and draft quality.

Diffusion-MPC improves long-horizon planning by modeling sequences, not steps.

Using diffusion for multi-step action proposals and multi-step dynamics reduces compounding error and allows simple sampling-based planning to compete strongly, while preserving MPC’s ability to swap reward functions at test time.

WORDS WORTH SAVING

5 quotes

So the claim I'm gonna make, and maybe this is the one thing to take away from the message I'm trying to send in this talk, is that inference today is seen as a sort of like cost or convenience lever. But, uh, in one, two, or three years, inference is gonna be seen as a capability.

— Tanishq Kumar

If you have a method, an algorithm, a system where its performance scales with the amount of thinking it does- Then fundamentally, the speed at which you can do inference, the tokens per second, is exactly the peak intelligence that you can deliver.

— Tanishq Kumar

The sort of key asymmetry here, the reason that speculation works, is that it is easier to verify than to generate.

— Tanishq Kumar

I wanted to communicate to you all that this is not a new idea at all. It's really just kinda new advertising or packaging on an old idea.

— Isaac Ward

So part of the motivation for this paper is just the fact that over the past, uh, six or seven years, pre-training has continued to improve model capabilities in pretty surprising ways.

— Konwoo Kim

Inference as capability vs cost leverSpeculative decoding and verification bonus tokensParallel draft/verify and cache-miss handling in SSDDiffusion models for multi-step action and dynamics in MPCRuntime adaptation to new rewards and altered dynamicsWorld models, representational collapse, and SigReg regularizationInfinite-compute / data-constrained pretraining: ensembling and distillation asymptotes

High quality AI-generated summary created from speaker-labeled transcript.

Get more out of YouTube videos.

High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.