Y CombinatorInference, Diffusion, World Models, and More | YC Paper Club
CHAPTERS
- 0:07 – 3:49
YC Paper Club kickoff: building a founder–researcher community at Pioneer
Francois Chaubard opens the first YC Paper Club, emphasizing the goal of connecting researchers and founders and revitalizing the Pioneer space. He highlights the caliber of attendees (citations, fundraising) and the motivation for creating a South Bay AI community hub.
- •Paper Club mission: bridge founders and researchers
- •Attendee ‘show of hands’ on citations and fundraising as a proxy for room’s talent
- •Personal context: Pioneer’s role in past YC/AI history (early OpenAI era)
- •Geographic rationale: convene AI talent beyond San Francisco
- •Preview of five papers to be presented
- 3:49 – 7:51
Why inference is becoming a capability (not just a cost): setting up SSD
Tanishq Kumar motivates inference work as a near-term driver of model capability, not merely economics. He frames faster inference as translating directly into more ‘thinking’ at test time and demonstrates speed differences between decoding approaches.
- •Inference costs dominate at scale; RL training increasingly becomes ‘inference-heavy’
- •Core claim: inference speed will define peak delivered intelligence
- •Demo: baseline autoregressive vs speculative decoding vs SSD implementation
- •Goal: sample from a large model faster using algorithmic changes
- •Roadmap: explain speculative decoding, then speculative speculative decoding
- 7:51 – 10:53
Vanilla speculative decoding: draft, verify, accept/reject, and the ‘bonus token’
Tanishq explains how speculative decoding accelerates sampling from a target (large) model using a draft (small) model. The key is that verifying a block of tokens is parallelizable in transformers, unlike generating tokens autoregressively.
- •Draft model generates a candidate token block sequentially
- •Target model verifies the whole block in one forward pass (parallel across positions)
- •Accept tokens deemed plausible; reject at first implausible token onward
- •Important nuance: ‘bonus token’ can be sampled for free at rejection point
- •Speculation as exchanging extra FLOPs for lower latency
- 10:53 – 14:57
Speculative Speculative Decoding (SSD): parallelizing draft and verification
SSD tackles the sequential dependency between drafting and verifying by running them concurrently on separate hardware. The draft side predicts likely verification outcomes and pre-drafts continuations so the verifier rarely waits.
- •Bottleneck in vanilla: draft for round t+1 depends on verification result of round t
- •SSD runs drafting and verification at the same time (no co-location)
- •Draft predicts likely verification outcomes immediately after sending draft tokens
- •Precompute next-round drafts conditional on likely acceptance lengths + bonus token
- •Cache-miss handling becomes central when predictions are wrong
- 14:57 – 18:33
SSD design tradeoffs and results: cache hits, batching, and throughput/latency wins
Tanishq outlines practical complexities: how to allocate draft compute across predicted outcomes, and how to recover when predictions miss. He closes with performance comparisons showing SSD improving both latency and throughput against strong inference engines.
- •Prediction accuracy can reach ~80–90%, sufficient for large gains
- •Use draft token distributions to propose bonus-token candidates
- •Tradeoff: compute allocation vs draft quality vs cache-hit rate
- •Naive fallback to standard speculation isn’t always optimal under batching
- •Benchmarking vs SGLang/vLLM-style baselines; SSD delivers higher tokens/sec
- 18:33 – 21:05
Diffusion-MPC overview: diffusion models for action proposals and dynamics
Guangyao (Stannis) Zhou introduces model predictive control (MPC) and explains how Diffusion-MPC uses diffusion models to improve both planning and model accuracy. The approach learns multi-step action proposals and multi-step dynamics to reduce compounding error.
- •MPC basics: propose actions, roll out with dynamics model, optimize objective
- •Key challenges: accurate dynamics and strong planners
- •Diffusion-MPC learns multi-step action proposals + multi-step dynamics
- •Benefits: reduced compounding errors; simpler sampling-based planning
- •Factorization enables adaptation to new rewards/dynamics at test time
- 21:05 – 27:08
Positioning Diffusion-MPC among diffusion agents and control paradigms
Stannis situates Diffusion-MPC in a broader taxonomy of model-free, model-based, and joint modeling approaches. He contrasts diffusion policy, Diffuser-style joint trajectory modeling, observation-only decision diffusion, and the Diffusion-MPC factorized approach.
- •Unified view: all methods model a joint distribution over states/actions, but factorize differently
- •Tradeoffs: runtime planning, adapting to new rewards/dynamics, leveraging non-expert or video-only data
- •Diffusion policy: strong control but typically needs expert demonstrations
- •Decision Diffuser: can learn from observation/video-only with inverse dynamics
- •Diffusion-MPC: runtime adaptation advantages via proposal + dynamics separation
- 27:08 – 30:18
Diffusion-MPC results: runtime reward/dynamics adaptation and ablations
Stannis highlights empirical findings: competitive performance on standard tasks and strong adaptation behavior when rewards or dynamics change at inference time. Ablations attribute gains to diffusion proposals and multi-step modeling on both action and dynamics.
- •Competitive in fixed-reward single-task settings (MuJoCo-style)
- •Runtime reward changes yield new behaviors without retraining the full agent
- •Dynamics adaptation: update dynamics model using play data (e.g., ‘broken ankle’ walker)
- •Factorization helps recover performance under changed environment dynamics
- •Ablations: diffusion proposals, multi-step action modeling, and multi-step dynamics each contribute
- 30:18 – 35:51
LeWorld Modeling: what world models are and why they matter now
Isaac Ward introduces world models as predictors of environment dynamics conditioned on actions, connecting the idea to decades-old RL formulations. He contrasts model-free and model-based agents and frames the current ‘world model moment’ as strategically important.
- •World model definition: predict next observation/state given current observation and action
- •Capabilities: imagination rollouts, model-based control, surprise/uncertainty quantification
- •Historical lineage (Sutton-era formulations) despite new branding
- •Model-free vs model-based: brittleness, interpretability, and uncertainty tradeoffs
- •Why this matters: foundational bet behind large investments in world models
- 35:51 – 39:53
LeWorldModeling method: JEPA-style latent dynamics + SigReg to prevent collapse
Isaac explains the paper’s core training challenge—representation collapse when learning latent dynamics—and the proposed fix. LeWorldModeling predicts future latent embeddings (not pixels) and regularizes embedding distributions using the SigReg isotropic Gaussian constraint.
- •Challenge: co-learn representation + dynamics; trivial ‘all states same’ collapse is a local minimum
- •JEPA-style setup: encode observation → predict next latent conditioned on action → (optional) decode
- •SigReg: enforce ‘healthy’ latent distribution via 1D projections looking Gaussian/isotropic
- •Framed as a simpler alternative to many bespoke anti-collapse tricks
- •Resulting model is small (~50M params) and efficient because it operates in latent space
- 39:53 – 43:54
Using world models: MPC planning in latent space and quantifying surprise
The talk covers how LeWorldModeling supports goal-conditioned planning by searching actions that move latents from start to goal. Isaac also emphasizes ‘surprise’ detection—spikes in model error under perturbations—as a key safety/robustness feature in real deployments.
- •Open-loop prediction demos (PushT/PushCube): imagined rollouts track ground truth
- •MPC: encode current + goal; optimize action sequence to reach goal in latent space
- •Performance: strong on 2D tasks; comparisons vary in 3D where foundation-backed models excel
- •Speed: large runtime advantage vs competitors due to latent-space computation
- •Surprise quantification: model error spikes under distribution shifts (color change/teleport)
- 43:54 – 51:24
Generalization isn’t mysterious: PAC-Bayes, compression, and soft inductive bias
Akshay Vegesna presents Andrew Gordon Wilson’s argument that modern deep learning phenomena can be explained with classical generalization theory. He uses PAC-Bayes and the idea of compressible/flat solutions to reframe overparameterization and benign overfitting.
- •Motivation: scaling improves generalization; need mechanistic understanding
- •PAC-Bayes: test loss bounded by training loss + compression term; past bounds were often vacuous
- •Overparameterization: larger models can fit better and find more compressible (flatter) solutions
- •Flat minima occupy more volume in high dimensions and correlate with compressibility
- •Benign overfitting intuition via regularized polynomials: flexible models + soft bias toward simpler structure
- 51:24 – 57:31
Pretraining under infinite compute: data-constrained scaling recipes and asymptotes
Konwoo Kim addresses the regime where data is fixed but compute grows, motivated by compute scaling outpacing internet text growth. The paper uses scaling laws to evaluate recipes by their power-law behavior and asymptotic loss under infinite compute.
- •Motivation: compute per datapoint rising rapidly; future pretraining becomes data-limited
- •Experimental setup: fix small token budget (e.g., 200M tokens) and scale model/recipe
- •Standard recipe (epoching + early stopping) eventually overfits as models grow
- •Aggressive regularization (large weight decay) yields clean power-law scaling with measurable asymptote
- •Asymptotes used as an evaluation tool for ‘best possible’ performance under infinite compute
- 57:31 – 1:06:22
Ensembling, joint scaling, and distillation: turning compute into data efficiency
Konwoo shows that ensembling provides a lower asymptotic loss than single-model regularization in the data-constrained regime, and that combining both yields further gains. Distillation (including self-distillation) can preserve much of the benefit while keeping inference models small, and trends carry to downstream tasks and continued pretraining.
- •Ensembling beats single large models when data constrained; produces lower asymptotic loss
- •Joint scaling: combine regularization (scale model size) + ensembling (scale number of models) via a double-limit analysis
- •Data scaling laws quantify ‘effective extra tokens’ (e.g., ~5× data efficiency for joint recipe)
- •Distillation compresses ensemble benefits into a small dense model; self-distillation can further improve loss
- •Application to continued pretraining: large data-efficiency wins on restricted-domain token subsets
- 1:06:22 – 1:07:18
Wrap-up: growing the Paper Club community + logistics
Francois closes by reflecting on the event’s success and inviting attendees to help shape future sessions. He shares participation norms and points everyone to join the Slack, then ends with boba and informal networking.
- •Paper Club envisioned as a community-driven recurring forum
- •Call for ideas and active participation from attendees
- •Ground rules: respect and engagement
- •Slack onboarding for continued discussion and coordination
- •Event closes with refreshments (boba)