Skip to content
No PriorsNo Priors

No Priors Ep. 103 | With Vevo Therapeutics and the Arc Institute

On this week’s episode of No Priors, Sarah Guo is joined by leading members of the teams at Vevo Therapeutics and the Arc Institute – Nima Alidoust, CEO/Co-Founder at Vevo Therapeutics; Johnny Yu, CSO/Co-Founder at Vevo Therapeutics; Patrick Hsu, CEO/Co-Founder at Arc Institute; Dave Burke, CTO at Arc Institute; and Hani Goodarzi, Core Investigator at Arc Institute. Predicting protein structure (AlphaFold 3, Chai-1, Evo 2) was a big AI/biology breakthrough. The next big leap is modeling entire human cells—how they behave in disease, or how they respond to new therapeutics. The same way LLMs needed enormous text corpora to become truly powerful, Virtual Cell Models need massive, high-quality cellular datasets to train on. In this episode, the teams discuss the groundbreaking release of the Tahoe-100M single cell dataset, Arc Atlas, and how these advancements could transform drug discovery. Sign up for new podcasts every week. Email feedback to show@no-priors.com Follow us on Twitter: @NoPriorsPod | @Saranormous | @Nalidoust | @IAmJohnnyYu | @PDHsh | @Davey_Burke | @Genophoria Show Notes: 0:00 Introduction 1:40 Significance of Tahoe-100M dataset 4:22 Where we are with virtual cell models and protein language models 10:26 Significance of perturbational data 17:39 Challenges and innovations in data collection 24:42 Open sourcing and community collaboration 33:51 Predictive ability and importance of virtual cell models 35:27 Drug discovery and virtual cell models 44:27 Platform vs. single hypothesis companies 46:05 Rise of Chinese biotechs 51:36 AI in drug discovery

Sarah GuohostJohnny YuguestNima AlidoustguestPatrick HsuguestDave BurkeguestHani GoodarziguestElad Gilhost
Feb 25, 202557mWatch on YouTube ↗

CHAPTERS

  1. Meet Arc Institute + Vevo and the mission: Tahoe-100 and virtual cell models

    Sarah Guo introduces leaders from Vevo Therapeutics and the Arc Institute to discuss Tahoe-100, a massive drug-perturbation single-cell dataset. The group frames the bigger goal: moving AI in biology beyond proteins toward predictive, cell-level models that can eventually drive therapies.

    • Guests and roles across Vevo (data generation) and Arc (modeling/atlas)
    • Tahoe-100 positioned as the largest single-cell drug-perturbed dataset
    • Why virtual cell models matter alongside protein structure/language models
    • Focus on AI-for-bio progress and when it may translate to treatments
  2. What Tahoe-100 is and why it’s an ImageNet-like moment for cells

    The team explains Tahoe-100 as a landmark dataset intended to unlock new machine-learning capabilities in biology. They compare it to pivotal datasets in AI history (e.g., ImageNet) that triggered nonlinear progress once scale and standardization arrived.

    • Tahoe-100 described as the world’s biggest scRNA-seq dataset for drug perturbations
    • Dataset meant to catalyze a new AI-driven style of drug discovery
    • Analogy: foundational datasets (ImageNet) enable step-changes in model capability
    • Shift from protein-level understanding to cell-level response modeling
  3. Why we need virtual cell models in addition to protein language/structure models

    Arc leaders describe how protein models answer binding/structure questions, but cells require a different abstraction: dynamic state and response. They use a computer analogy (DNA as ROM, RNA as RAM, model as CPU) to explain how perturbations map to transcriptomic changes and enable inverse design.

    • DNA vs RNA vs cellular response: ROM/RAM/CPU framing
    • Virtual cell models aim to predict transcriptomic response to perturbations
    • Inverse problem: choose drug/gene edits to push diseased state toward healthy
    • Compute-limited areas (DNA) vs data-limited areas (cell state models)
  4. Why perturbational data matters: moving from correlation to causation

    The discussion contrasts observational single-cell atlases with perturbational experiments that reveal causal dynamics. They explain that diverse perturbations help models learn the manifold of possible cell states and generalize to unseen conditions.

    • Observational data is largely correlational; perturbations support causal inference
    • Perturbation responses help map the latent manifold of cell states
    • Need breadth across cell types/tissues and disease contexts to generalize
    • Public perturbational single-cell data was tiny relative to Tahoe-100
  5. What prior single-cell data looked like (and why it’s hard to use)

    They outline why past public single-cell datasets have limited utility for foundation models: fragmentation, weak labeling, and severe batch effects. Tahoe-100’s promise is consistency, depth, and breadth across patient-derived cancer models and many drug treatments.

    • Academic/industry datasets are small, fragmented, and inconsistently annotated
    • Batch effects can dominate—same lab, different days can look different
    • Tahoe-100: ~50 patient cancer models and ~1,200 drug treatments
    • Claimed as a first dataset that can truly enable ML at this layer
  6. How big is “big enough”? Tokens, scaling laws, and information content

    They translate single-cell scale into the language-model notion of tokens and discuss early scaling intuitions. They emphasize that raw cell count isn’t enough—diversity and informative variation (often from perturbations) determine real learning signal.

    • LLM inspiration: ~1T tokens as a comfortable reference point
    • Counting ‘tokens’ in cells via genes/expression; Tahoe approximated as hundreds of billions of tokens
    • Key uncertainty: which fraction of tokens is truly informative
    • Downsampling evidence suggests some existing datasets are information-poor
  7. Choosing perturbations and designing coverage: biology space + chemical space

    Vevo explains how perturbations are selected to match the biological questions (e.g., core cancer pathways) while still learning transferable biology. The team describes expanding coverage across patient diversity and chemical diversity, with ML guiding what gaps to fill next.

    • Perturbations chosen around cancer-relevant genes/pathways and drug mechanisms
    • Conserved pathways may transfer to other domains (immunity, neuroscience)
    • Strategy: maximize information content via patient diversity and coverage
    • As scale rises, the approach becomes more unbiased/hypothesis-light
  8. Vevo’s Mosaic platform: pooled patient tumors enabling massive, reproducible screening

    They explain the key experimental innovation: pooling cells from many patients into a ‘mosaic tumor’ and screening many drugs efficiently. The approach massively increases experimental throughput and standardization versus one-model-at-a-time testing.

    • Mosaic tumors combine many patient-derived models to capture genetic variation
    • Each mouse/drug treatment yields patient-specific response signals at scale
    • Enables screening hundreds/thousands of drugs across many models efficiently
    • Framing: more ‘tokens per experiment’ changes how biology is done
  9. Open-sourcing Tahoe-100 and building the Arc Virtual Cell Atlas (SC Basecamp)

    Vevo explains why a venture-backed startup would open-source: to set a new community baseline and recruit a broad ecosystem to model and critique the data. Arc describes pairing Tahoe-100 with SC Basecamp, an agent-curated observational corpus, to create a large, standardized atlas for training virtual cell models.

    • Motivations: raise the field’s ambition, enable community iteration, keep teams lean
    • Arc Virtual Cell Atlas launches with Tahoe-100 as a centerpiece dataset
    • SC Basecamp: agent ‘crawler/index’ that mines and uniformly processes public scRNA data
    • Combined resource: ~330M cells for pretraining + perturbation fine-tuning
  10. Data collection realities: batch effects, consistency, and surprising leverage

    They highlight how much biological data is “infected” by analytical and experimental variability across time, tools, and labs. Tahoe-100’s consistency is underscored by the fact that a small number of people generated it quickly, reducing variability from many ‘hands.’

    • Public archives are messy: changing tool versions, genome builds, inconsistent processing
    • Arc’s approach reduces analytical batch effects by reprocessing uniformly
    • Tahoe-100 generated by ~4 people in a short window, improving consistency
    • Standardization is framed as essential for reliable downstream ML
  11. How to evaluate a virtual cell model: prediction, benchmarks, and today’s limits

    Arc discusses evaluating models by their ability to predict differential gene expression after perturbations. They note that current models perform poorly and that the field lacks robust shared benchmarks—an area where these new datasets could accelerate progress.

    • Primary metric: predict differentially expressed genes after a perturbation
    • Current predictive performance cited as very low (~10% for best models)
    • No widely accepted benchmark suite yet; industry would benefit from one
    • Hypothesis: data quality/consistency is a major limiting factor, not just model architecture
  12. What virtual cell models unlock: faster science, in silico experiments, and better drugs

    They argue virtual cell models matter because biology is slow and expensive; accurate simulation could parallelize experimentation and speed iteration. The discussion connects model utility to target selection, toxicity/specificity, and ultimately improving the low clinical success rate in drug development.

    • Biology constrained by real time (e.g., aging studies take years)
    • In silico simulation only helps if it’s accurate—otherwise it’s ‘noise’
    • Drug discovery use case: predict cell response to new chemical entities across patients
    • Improve target choice and chemical design; address the ~90% clinical failure rate
  13. Beyond single cells: right abstraction, context dependence, and future modalities

    They address how single-cell transcriptomics can still capture environmental context filtered through cellular state, and how models can ladder up to multicellular systems. The group points to organoids/spheroids, in vivo contexts, and adding spatial data as natural next steps.

    • Transcriptome proposed as the right abstraction layer for early virtual cell models
    • Context dependence: environment signals are reflected in the cell state
    • Extension path: spheroids/organoids and more realistic in vivo settings
    • Future enrichment: incorporate spatial data and multicellular interactions
  14. Hot takes: platform vs single-hypothesis biotechs, China’s rise, and why “AI for drug discovery” may work now

    In closing, they contrast platform companies that generate many hypotheses with traditional single-asset biotechs that can become wedded to one bet. They discuss Chinese biotechs’ cost and speed advantages, and argue the field is nearing an inflection point as data, compute, and model maturity converge—while acknowledging validation will still take years due to clinical timelines.

    • Platform biotech aims for hypothesis generation/selection at scale vs single-hypothesis execution
    • China: competitive cost basis, fast pipelines; US needs intentional innovation in biotech
    • Organizational ethos: small elite teams, rapid building, less bureaucracy, more experimentation
    • Biology’s ‘GPT moment’ framing: proteins ahead; virtual cell models around GPT-1/2 today; proof still gated by long clinical cycles

Get more out of YouTube videos.

High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.