Y CombinatorInference, Diffusion, World Models, and More | YC Paper Club
At a glance
WHAT IT’S REALLY ABOUT
YC Paper Club debuts: faster inference, diffusion control, and scaling laws
- The event frames inference speed as a future capability bottleneck—not just a cost issue—because faster tokens-per-second enable more test-time “thinking” and higher delivered intelligence.
- Speculative Speculative Decoding (SSD) extends speculative decoding by parallelizing drafting and verification across hardware, predicting likely verification outcomes to hide draft latency and materially increase token throughput.
- Diffusion-MPC uses diffusion models for both multi-step action proposals and multi-step dynamics modeling, improving long-horizon planning, reducing compounding error, and enabling adaptation to new rewards and changed dynamics at test time.
- LeWorldModeling (a JEPA-style approach) proposes SigReg, a Gaussian/isotropy regularizer in latent space to prevent representational collapse, enabling efficient latent-space world modeling, MPC-style planning, and uncertainty/surprise detection.
- Two high-level papers argue that (1) deep learning generalization can be explained with classical tools like PAC-Bayes/compression/flat minima, and (2) when data is fixed but compute is abundant, aggressive regularization, ensembling, and distillation yield predictable scaling-law asymptotes and large data-efficiency gains.
IDEAS WORTH REMEMBERING
5 ideasInference throughput may directly cap deployable intelligence.
The talk argues that as models rely more on test-time compute (e.g., longer deliberation, RL-as-inference wrappers), tokens-per-second becomes a capability constraint: faster inference allows more “thinking” within latency budgets.
Speculative decoding works because verifying multiple tokens is parallelizable.
A small “draft” model proposes tokens sequentially, while the large “target” model verifies them in one forward pass; if a token is rejected, the target can often sample an additional “bonus token” at the rejection point without extra passes.
SSD’s core win is hiding draft latency by predicting verification outcomes.
SSD starts drafting the next round before verification finishes by branching on likely accept-length/bonus-token outcomes; high hit rates (~80–90% cited) let it keep the verifier fed and improve both latency and throughput.
Practical SSD hinges on smart compute allocation and cache-miss strategy.
Because predictions of verification outcomes can fail, the system must decide when to fall back to ordinary speculation and how to distribute draft compute across candidate prefixes; naïve equal allocation is suboptimal and affects both hit rate and draft quality.
Diffusion-MPC improves long-horizon planning by modeling sequences, not steps.
Using diffusion for multi-step action proposals and multi-step dynamics reduces compounding error and allows simple sampling-based planning to compete strongly, while preserving MPC’s ability to swap reward functions at test time.
WORDS WORTH SAVING
5 quotesSo the claim I'm gonna make, and maybe this is the one thing to take away from the message I'm trying to send in this talk, is that inference today is seen as a sort of like cost or convenience lever. But, uh, in one, two, or three years, inference is gonna be seen as a capability.
— Tanishq Kumar
If you have a method, an algorithm, a system where its performance scales with the amount of thinking it does- Then fundamentally, the speed at which you can do inference, the tokens per second, is exactly the peak intelligence that you can deliver.
— Tanishq Kumar
The sort of key asymmetry here, the reason that speculation works, is that it is easier to verify than to generate.
— Tanishq Kumar
I wanted to communicate to you all that this is not a new idea at all. It's really just kinda new advertising or packaging on an old idea.
— Isaac Ward
So part of the motivation for this paper is just the fact that over the past, uh, six or seven years, pre-training has continued to improve model capabilities in pretty surprising ways.
— Konwoo Kim
High quality AI-generated summary created from speaker-labeled transcript.