Skip to content
No PriorsNo Priors

No Priors Ep. 103 | With Vevo Therapeutics and the Arc Institute

On this week’s episode of No Priors, Sarah Guo is joined by leading members of the teams at Vevo Therapeutics and the Arc Institute – Nima Alidoust, CEO/Co-Founder at Vevo Therapeutics; Johnny Yu, CSO/Co-Founder at Vevo Therapeutics; Patrick Hsu, CEO/Co-Founder at Arc Institute; Dave Burke, CTO at Arc Institute; and Hani Goodarzi, Core Investigator at Arc Institute. Predicting protein structure (AlphaFold 3, Chai-1, Evo 2) was a big AI/biology breakthrough. The next big leap is modeling entire human cells—how they behave in disease, or how they respond to new therapeutics. The same way LLMs needed enormous text corpora to become truly powerful, Virtual Cell Models need massive, high-quality cellular datasets to train on. In this episode, the teams discuss the groundbreaking release of the Tahoe-100M single cell dataset, Arc Atlas, and how these advancements could transform drug discovery. Sign up for new podcasts every week. Email feedback to show@no-priors.com Follow us on Twitter: @NoPriorsPod | @Saranormous | @Nalidoust | @IAmJohnnyYu | @PDHsh | @Davey_Burke | @Genophoria Show Notes: 0:00 Introduction 1:40 Significance of Tahoe-100M dataset 4:22 Where we are with virtual cell models and protein language models 10:26 Significance of perturbational data 17:39 Challenges and innovations in data collection 24:42 Open sourcing and community collaboration 33:51 Predictive ability and importance of virtual cell models 35:27 Drug discovery and virtual cell models 44:27 Platform vs. single hypothesis companies 46:05 Rise of Chinese biotechs 51:36 AI in drug discovery

Sarah GuohostJohnny YuguestNima AlidoustguestPatrick HsuguestDave BurkeguestHani GoodarziguestElad Gilhost
Feb 25, 202557mWatch on YouTube ↗

At a glance

WHAT IT’S REALLY ABOUT

Tahoe 100 launches virtual cell era, redefining AI-driven drug discovery

  1. The episode features leaders from Vevo Therapeutics and the Arc Institute announcing Tahoe 100, a 100‑million–cell single‑cell RNA sequencing dataset, paired with Arc’s 230‑million–cell SC Basecamp to form a 330‑million–cell Virtual Cell Atlas.
  2. They argue that biology is entering its “virtual cell” moment, analogous to ImageNet and GPT in vision and language, enabling models that predict how cells respond to genetic and chemical perturbations rather than just protein structures.
  3. The discussion covers why prior single‑cell data were too small, noisy, and observational, how large perturbational datasets unlock causal and predictive modeling, and why open‑sourcing Tahoe 100 is strategically important.
  4. They outline how accurate virtual cell models could transform drug discovery, improve target selection, and reduce clinical failure rates, while reshaping how biotech companies, platforms, and global competition (including China) evolve.

IDEAS WORTH REMEMBERING

5 ideas

Perturbational single-cell data is the missing foundation for causal models in biology.

Most historic single‑cell data are small, observational, and focused on healthy tissue, which limits models to correlations; Tahoe 100 massively expands drug-perturbed, disease-relevant data, enabling models that learn how interventions cause state changes in cells.

Virtual cell models complement, not replace, protein structure and language models.

Protein models capture binding and structural biology, but many drug failures stem from targeting the wrong pathways in complex cellular contexts; virtual cell models aim to learn the ‘system-level’ transcriptomic response of cells, bridging from molecular binding to organism-level outcomes.

Data quality and diversity matter as much as raw scale for training foundation models.

Early single‑cell foundation models barely degraded when trained on only ~1% of prior public data, revealing redundancy and narrow biological coverage; Tahoe 100 focuses on rich perturbations across 50 cancer models and 1,200 drugs with minimal batch effects to maximize information content.

Open-sourcing Tahoe 100 is a strategic move to amplify impact with a small team.

By making the dataset public and combining it with Arc’s SC Basecamp, Vevo and Arc catalyze an ecosystem of external researchers building virtual cell models, effectively multiplying their R&D capacity without scaling headcount.

AI agents can already perform valuable “plumbing” for biology by cleaning and unifying data.

Arc’s SC Basecamp uses an AI agent to crawl the Sequence Read Archive, re‑process heterogeneous datasets with uniform pipelines, and reduce analytical batch effects, demonstrating how agents can automate dry‑lab workflows and create higher‑quality inputs for models.

WORDS WORTH SAVING

5 quotes

“Tahoe 100 is the world's biggest single-cell RNA sequencing dataset… We actually think it's the first dataset that's going to enable machine learning in this space.”

Johnny (Vevo Therapeutics)

“This is the domain we are talking about… the language of systems biology. The first thing you should be doing is to try out the things that worked in the other domains in this domain.”

Nima (Vevo Therapeutics)

“In biology we have treated humans as the foundation models that ingest information and come up with hypotheses… now we actually want to go beyond that.”

Nima (Vevo Therapeutics)

“How do we go from a discipline that primarily respects experiments today to something more like physics, where theory drives a lot of progress? These virtual cell models are a core wedge in making that happen.”

Patrick Hsu (Arc Institute)

“I think it's morning in bio… We should be playing a different kind of game here.”

Nima (Vevo Therapeutics)

Tahoe 100: design, scale, and significance of the largest perturbational single‑cell datasetFrom protein structure models to virtual cell models and systems biologyData quality, batch effects, and the importance of perturbational vs observational datasetsThe Arc Virtual Cell Atlas and SC Basecamp, including AI agents to curate public dataScaling laws, token analogies, and where AI for biology sits on the GPT timelineVirtual cells in drug discovery: target selection, chemistry search space, and clinical impactStrategic choices: open‑sourcing, platform vs single‑asset biotechs, and Chinese biotech competition

High quality AI-generated summary created from speaker-labeled transcript.

Get more out of YouTube videos.

High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.

Add to Chrome