No Priors Ep. 103 | With Vevo Therapeutics and the Arc Institute

Name: No Priors Ep. 103 | With Vevo Therapeutics and the Arc Institute
Uploaded: 2025-02-25T12:00:00Z
Duration: 57 min 40 s
Description: The episode features leaders from Vevo Therapeutics and the Arc Institute announcing Tahoe 100, a 100‑million–cell single‑cell RNA sequencing dataset, paired with Arc’s 230‑million–cell SC Basecamp to form a 330‑million–cell Virtual Cell Atlas.

No PriorsFeb 25, 202557m

Sarah Guo (host), Johnny Yu (guest), Nima Alidoust (guest), Patrick Hsu (guest), Dave Burke (guest), Hani Goodarzi (guest), Elad Gil (host)

Tahoe 100: design, scale, and significance of the largest perturbational single‑cell datasetFrom protein structure models to virtual cell models and systems biologyData quality, batch effects, and the importance of perturbational vs observational datasetsThe Arc Virtual Cell Atlas and SC Basecamp, including AI agents to curate public dataScaling laws, token analogies, and where AI for biology sits on the GPT timelineVirtual cells in drug discovery: target selection, chemistry search space, and clinical impactStrategic choices: open‑sourcing, platform vs single‑asset biotechs, and Chinese biotech competition

In this episode of No Priors, featuring Sarah Guo and Johnny Yu, No Priors Ep. 103 | With Vevo Therapeutics and the Arc Institute explores tahoe 100 launches virtual cell era, redefining AI-driven drug discovery The episode features leaders from Vevo Therapeutics and the Arc Institute announcing Tahoe 100, a 100‑million–cell single‑cell RNA sequencing dataset, paired with Arc’s 230‑million–cell SC Basecamp to form a 330‑million–cell Virtual Cell Atlas.

Tahoe 100 launches virtual cell era, redefining AI-driven drug discovery

The episode features leaders from Vevo Therapeutics and the Arc Institute announcing Tahoe 100, a 100‑million–cell single‑cell RNA sequencing dataset, paired with Arc’s 230‑million–cell SC Basecamp to form a 330‑million–cell Virtual Cell Atlas.

They argue that biology is entering its “virtual cell” moment, analogous to ImageNet and GPT in vision and language, enabling models that predict how cells respond to genetic and chemical perturbations rather than just protein structures.

The discussion covers why prior single‑cell data were too small, noisy, and observational, how large perturbational datasets unlock causal and predictive modeling, and why open‑sourcing Tahoe 100 is strategically important.

They outline how accurate virtual cell models could transform drug discovery, improve target selection, and reduce clinical failure rates, while reshaping how biotech companies, platforms, and global competition (including China) evolve.

Key Takeaways

Perturbational single-cell data is the missing foundation for causal models in biology.

Most historic single‑cell data are small, observational, and focused on healthy tissue, which limits models to correlations; Tahoe 100 massively expands drug-perturbed, disease-relevant data, enabling models that learn how interventions cause state changes in cells.

Get the full analysis with uListen AI

Virtual cell models complement, not replace, protein structure and language models.

Protein models capture binding and structural biology, but many drug failures stem from targeting the wrong pathways in complex cellular contexts; virtual cell models aim to learn the ‘system-level’ transcriptomic response of cells, bridging from molecular binding to organism-level outcomes.

Get the full analysis with uListen AI

Data quality and diversity matter as much as raw scale for training foundation models.

Early single‑cell foundation models barely degraded when trained on only ~1% of prior public data, revealing redundancy and narrow biological coverage; Tahoe 100 focuses on rich perturbations across 50 cancer models and 1,200 drugs with minimal batch effects to maximize information content.

Get the full analysis with uListen AI

Open-sourcing Tahoe 100 is a strategic move to amplify impact with a small team.

By making the dataset public and combining it with Arc’s SC Basecamp, Vevo and Arc catalyze an ecosystem of external researchers building virtual cell models, effectively multiplying their R&D capacity without scaling headcount.

Get the full analysis with uListen AI

AI agents can already perform valuable “plumbing” for biology by cleaning and unifying data.

Arc’s SC Basecamp uses an AI agent to crawl the Sequence Read Archive, re‑process heterogeneous datasets with uniform pipelines, and reduce analytical batch effects, demonstrating how agents can automate dry‑lab workflows and create higher‑quality inputs for models.

Get the full analysis with uListen AI

Virtual cell models could dramatically reshape drug discovery economics and strategy.

If models can accurately predict how novel compounds move diseased cells toward healthy states across diverse patient contexts, they can narrow target space, prioritize better chemical matter in silico, and potentially cut clinical failure rates, shifting value from brute‑force screening to model‑driven design.

Get the full analysis with uListen AI

Biotech must shift from slow, hypothesis-heavy models to scalable, hypothesis-light platforms.

The guests argue that falling sequencing and compute costs, plus large-scale platforms like Vevo’s mosaic tumors, make it feasible to generate massive, unbiased datasets, letting models surface hypotheses instead of anchoring entire companies on a single, fragile idea.

Get the full analysis with uListen AI

Notable Quotes

““Tahoe 100 is the world's biggest single-cell RNA sequencing dataset… We actually think it's the first dataset that's going to enable machine learning in this space.””
— Johnny (Vevo Therapeutics)

““This is the domain we are talking about… the language of systems biology. The first thing you should be doing is to try out the things that worked in the other domains in this domain.””
— Nima (Vevo Therapeutics)

““In biology we have treated humans as the foundation models that ingest information and come up with hypotheses… now we actually want to go beyond that.””
— Nima (Vevo Therapeutics)

““How do we go from a discipline that primarily respects experiments today to something more like physics, where theory drives a lot of progress? These virtual cell models are a core wedge in making that happen.””
— Patrick Hsu (Arc Institute)

““I think it's morning in bio… We should be playing a different kind of game here.””
— Nima (Vevo Therapeutics)

Questions Answered in This Episode

How will the field agree on robust benchmarks to evaluate virtual cell models’ predictive power, and what metrics beyond differential gene expression will matter most?

Get the full analysis with uListen AI

What specific early applications (e.g., target prioritization in certain cancers) are most likely to produce the first clear ‘wins’ from virtual cell models in the next 3–5 years?

Get the full analysis with uListen AI

How might integrating spatial, multi-omic, and organoid data further improve virtual cell models beyond transcriptomics alone?

Get the full analysis with uListen AI

What governance and access models are needed to ensure that open datasets like Tahoe 100 don’t just empower big tech and pharma, but also smaller labs and startups globally?

Get the full analysis with uListen AI

Given the rise of Chinese biotechs with strong execution and lower costs, how should US and European biotech ecosystems adapt their strategies around innovation, AI, and platform building?

Get the full analysis with uListen AI

Transcript Preview

Sarah Guo

Hi, listeners. Welcome back to No Priors. Today, we're here with the CEO, CTO, and core investigator of the Arc Institute, as well as the co-founders of Vyvo to talk about their release of the Tahoe 100, the largest single cell drug-perturbed dataset ever created, as well as where we are in AI for biology, why we need a virtual cell model and not just protein structure prediction models, and when we should finally expect to see treatments from this growth of use of machine learning in bio.

Johnny Yu

Hi, I'm Johnny, and I work on single-cell RNA sequencing at Vyvo.

Nima Alidoust

I'm Nima. Uh, I'm one of the founders together with Johnny. Um, I'm a quantum chemist by background, but I've converted to be, being a computational chemist that loves playing with biological data. And we are building Vyvo to, to really do that, to predict how chemicals interact with cells in different biological contexts. Some people call it the virtual cell. That's, that's basically what we're working on.

Patrick Hsu

I'm Patrick Hsu, one of the founders at the Arc Institute, which is working at the interface of biology and machine learning to try to understand and, uh, one day treat, uh, complex human diseases, which are most of the major killers.

Dave Burke

I'm Dave, uh, CTO at Arc Institute, focused on computational biology and building novel AI models, uh, for biology.

Hani Goodarzi

I'm Hani. I'm a core investigator at Arc. I work very closely with Dave and Patrick to push our, you know, virtual cell initiative.

Sarah Guo

Congratulations, everyone. It's a big day. Uh, let's jump right into it. What is the Tahoe 100, and what is the significance of it?

Johnny Yu

So, Tahoe 100 is the world's biggest single-cell RNA sequencing dataset, and it enables basically a ton of machine learning applications, including things like the virtual cell, but it also enables a lot of drug discovery applications. And broadly, in the context of where I think we are as a field, it's kinda the beginning of a different way of doing drug discovery, of basically understanding how to build medicines, and basically bringing AI and machine learning people into the mix.

Nima Alidoust

And maybe something I would add, uh, there as well, um, over the last 20 years or so, people have accumulated massive amounts of, you know, data points when it comes to protein structures, uh, protein function, how drug molecules interact with proteins. But one thing that we haven't had as much is how, uh, how different cells behave in different contexts and how different genes whi- within each of those cells actually functions in the presence of the other genes, so, you know, in, in, in these different biological contexts. Um, this, we believe this is the era for that right now, and you have seen the emergence of protein language models built on the datasets that have been accumulated over the last two decades. But now is the era for actually having data on cells, how they function, how they interact with drug molecules. And exactly what Johnny's saying, Tahoe is really a landmark dataset there that allows us to really measure how drugs interact with different cells from different patient models, um, and that gives us the ability to build similar models that we built in protein language models, but in the, um, in the cellular kinda context.

Install uListen to search the full transcript and get AI-powered insights

Get Full Transcript

Get more from every podcast

AI summaries, searchable transcripts, and fact-checking. Free forever.

Add to Chrome