No PriorsNo Priors Ep. 103 | With Vevo Therapeutics and the Arc Institute
CHAPTERS
Meet Arc Institute + Vevo and the mission: Tahoe-100 and virtual cell models
Sarah Guo introduces leaders from Vevo Therapeutics and the Arc Institute to discuss Tahoe-100, a massive drug-perturbation single-cell dataset. The group frames the bigger goal: moving AI in biology beyond proteins toward predictive, cell-level models that can eventually drive therapies.
- •Guests and roles across Vevo (data generation) and Arc (modeling/atlas)
- •Tahoe-100 positioned as the largest single-cell drug-perturbed dataset
- •Why virtual cell models matter alongside protein structure/language models
- •Focus on AI-for-bio progress and when it may translate to treatments
What Tahoe-100 is and why it’s an ImageNet-like moment for cells
The team explains Tahoe-100 as a landmark dataset intended to unlock new machine-learning capabilities in biology. They compare it to pivotal datasets in AI history (e.g., ImageNet) that triggered nonlinear progress once scale and standardization arrived.
- •Tahoe-100 described as the world’s biggest scRNA-seq dataset for drug perturbations
- •Dataset meant to catalyze a new AI-driven style of drug discovery
- •Analogy: foundational datasets (ImageNet) enable step-changes in model capability
- •Shift from protein-level understanding to cell-level response modeling
Why we need virtual cell models in addition to protein language/structure models
Arc leaders describe how protein models answer binding/structure questions, but cells require a different abstraction: dynamic state and response. They use a computer analogy (DNA as ROM, RNA as RAM, model as CPU) to explain how perturbations map to transcriptomic changes and enable inverse design.
- •DNA vs RNA vs cellular response: ROM/RAM/CPU framing
- •Virtual cell models aim to predict transcriptomic response to perturbations
- •Inverse problem: choose drug/gene edits to push diseased state toward healthy
- •Compute-limited areas (DNA) vs data-limited areas (cell state models)
Why perturbational data matters: moving from correlation to causation
The discussion contrasts observational single-cell atlases with perturbational experiments that reveal causal dynamics. They explain that diverse perturbations help models learn the manifold of possible cell states and generalize to unseen conditions.
- •Observational data is largely correlational; perturbations support causal inference
- •Perturbation responses help map the latent manifold of cell states
- •Need breadth across cell types/tissues and disease contexts to generalize
- •Public perturbational single-cell data was tiny relative to Tahoe-100
What prior single-cell data looked like (and why it’s hard to use)
They outline why past public single-cell datasets have limited utility for foundation models: fragmentation, weak labeling, and severe batch effects. Tahoe-100’s promise is consistency, depth, and breadth across patient-derived cancer models and many drug treatments.
- •Academic/industry datasets are small, fragmented, and inconsistently annotated
- •Batch effects can dominate—same lab, different days can look different
- •Tahoe-100: ~50 patient cancer models and ~1,200 drug treatments
- •Claimed as a first dataset that can truly enable ML at this layer
How big is “big enough”? Tokens, scaling laws, and information content
They translate single-cell scale into the language-model notion of tokens and discuss early scaling intuitions. They emphasize that raw cell count isn’t enough—diversity and informative variation (often from perturbations) determine real learning signal.
- •LLM inspiration: ~1T tokens as a comfortable reference point
- •Counting ‘tokens’ in cells via genes/expression; Tahoe approximated as hundreds of billions of tokens
- •Key uncertainty: which fraction of tokens is truly informative
- •Downsampling evidence suggests some existing datasets are information-poor
Choosing perturbations and designing coverage: biology space + chemical space
Vevo explains how perturbations are selected to match the biological questions (e.g., core cancer pathways) while still learning transferable biology. The team describes expanding coverage across patient diversity and chemical diversity, with ML guiding what gaps to fill next.
- •Perturbations chosen around cancer-relevant genes/pathways and drug mechanisms
- •Conserved pathways may transfer to other domains (immunity, neuroscience)
- •Strategy: maximize information content via patient diversity and coverage
- •As scale rises, the approach becomes more unbiased/hypothesis-light
Vevo’s Mosaic platform: pooled patient tumors enabling massive, reproducible screening
They explain the key experimental innovation: pooling cells from many patients into a ‘mosaic tumor’ and screening many drugs efficiently. The approach massively increases experimental throughput and standardization versus one-model-at-a-time testing.
- •Mosaic tumors combine many patient-derived models to capture genetic variation
- •Each mouse/drug treatment yields patient-specific response signals at scale
- •Enables screening hundreds/thousands of drugs across many models efficiently
- •Framing: more ‘tokens per experiment’ changes how biology is done
Open-sourcing Tahoe-100 and building the Arc Virtual Cell Atlas (SC Basecamp)
Vevo explains why a venture-backed startup would open-source: to set a new community baseline and recruit a broad ecosystem to model and critique the data. Arc describes pairing Tahoe-100 with SC Basecamp, an agent-curated observational corpus, to create a large, standardized atlas for training virtual cell models.
- •Motivations: raise the field’s ambition, enable community iteration, keep teams lean
- •Arc Virtual Cell Atlas launches with Tahoe-100 as a centerpiece dataset
- •SC Basecamp: agent ‘crawler/index’ that mines and uniformly processes public scRNA data
- •Combined resource: ~330M cells for pretraining + perturbation fine-tuning
Data collection realities: batch effects, consistency, and surprising leverage
They highlight how much biological data is “infected” by analytical and experimental variability across time, tools, and labs. Tahoe-100’s consistency is underscored by the fact that a small number of people generated it quickly, reducing variability from many ‘hands.’
- •Public archives are messy: changing tool versions, genome builds, inconsistent processing
- •Arc’s approach reduces analytical batch effects by reprocessing uniformly
- •Tahoe-100 generated by ~4 people in a short window, improving consistency
- •Standardization is framed as essential for reliable downstream ML
How to evaluate a virtual cell model: prediction, benchmarks, and today’s limits
Arc discusses evaluating models by their ability to predict differential gene expression after perturbations. They note that current models perform poorly and that the field lacks robust shared benchmarks—an area where these new datasets could accelerate progress.
- •Primary metric: predict differentially expressed genes after a perturbation
- •Current predictive performance cited as very low (~10% for best models)
- •No widely accepted benchmark suite yet; industry would benefit from one
- •Hypothesis: data quality/consistency is a major limiting factor, not just model architecture
What virtual cell models unlock: faster science, in silico experiments, and better drugs
They argue virtual cell models matter because biology is slow and expensive; accurate simulation could parallelize experimentation and speed iteration. The discussion connects model utility to target selection, toxicity/specificity, and ultimately improving the low clinical success rate in drug development.
- •Biology constrained by real time (e.g., aging studies take years)
- •In silico simulation only helps if it’s accurate—otherwise it’s ‘noise’
- •Drug discovery use case: predict cell response to new chemical entities across patients
- •Improve target choice and chemical design; address the ~90% clinical failure rate
Beyond single cells: right abstraction, context dependence, and future modalities
They address how single-cell transcriptomics can still capture environmental context filtered through cellular state, and how models can ladder up to multicellular systems. The group points to organoids/spheroids, in vivo contexts, and adding spatial data as natural next steps.
- •Transcriptome proposed as the right abstraction layer for early virtual cell models
- •Context dependence: environment signals are reflected in the cell state
- •Extension path: spheroids/organoids and more realistic in vivo settings
- •Future enrichment: incorporate spatial data and multicellular interactions
Hot takes: platform vs single-hypothesis biotechs, China’s rise, and why “AI for drug discovery” may work now
In closing, they contrast platform companies that generate many hypotheses with traditional single-asset biotechs that can become wedded to one bet. They discuss Chinese biotechs’ cost and speed advantages, and argue the field is nearing an inflection point as data, compute, and model maturity converge—while acknowledging validation will still take years due to clinical timelines.
- •Platform biotech aims for hypothesis generation/selection at scale vs single-hypothesis execution
- •China: competitive cost basis, fast pipelines; US needs intentional innovation in biotech
- •Organizational ethos: small elite teams, rapid building, less bureaucracy, more experimentation
- •Biology’s ‘GPT moment’ framing: proteins ahead; virtual cell models around GPT-1/2 today; proof still gated by long clinical cycles