
No Priors Ep. 103 | With Vevo Therapeutics and the Arc Institute
Sarah Guo (host), Johnny Yu (guest), Nima Alidoust (guest), Patrick Hsu (guest), Dave Burke (guest), Hani Goodarzi (guest), Elad Gil (host)
In this episode of No Priors, featuring Sarah Guo and Johnny Yu, No Priors Ep. 103 | With Vevo Therapeutics and the Arc Institute explores tahoe 100 launches virtual cell era, redefining AI-driven drug discovery The episode features leaders from Vevo Therapeutics and the Arc Institute announcing Tahoe 100, a 100‑million–cell single‑cell RNA sequencing dataset, paired with Arc’s 230‑million–cell SC Basecamp to form a 330‑million–cell Virtual Cell Atlas.
Tahoe 100 launches virtual cell era, redefining AI-driven drug discovery
The episode features leaders from Vevo Therapeutics and the Arc Institute announcing Tahoe 100, a 100‑million–cell single‑cell RNA sequencing dataset, paired with Arc’s 230‑million–cell SC Basecamp to form a 330‑million–cell Virtual Cell Atlas.
They argue that biology is entering its “virtual cell” moment, analogous to ImageNet and GPT in vision and language, enabling models that predict how cells respond to genetic and chemical perturbations rather than just protein structures.
The discussion covers why prior single‑cell data were too small, noisy, and observational, how large perturbational datasets unlock causal and predictive modeling, and why open‑sourcing Tahoe 100 is strategically important.
They outline how accurate virtual cell models could transform drug discovery, improve target selection, and reduce clinical failure rates, while reshaping how biotech companies, platforms, and global competition (including China) evolve.
Key Takeaways
Perturbational single-cell data is the missing foundation for causal models in biology.
Most historic single‑cell data are small, observational, and focused on healthy tissue, which limits models to correlations; Tahoe 100 massively expands drug-perturbed, disease-relevant data, enabling models that learn how interventions cause state changes in cells.
Get the full analysis with uListen AI
Virtual cell models complement, not replace, protein structure and language models.
Protein models capture binding and structural biology, but many drug failures stem from targeting the wrong pathways in complex cellular contexts; virtual cell models aim to learn the ‘system-level’ transcriptomic response of cells, bridging from molecular binding to organism-level outcomes.
Get the full analysis with uListen AI
Data quality and diversity matter as much as raw scale for training foundation models.
Early single‑cell foundation models barely degraded when trained on only ~1% of prior public data, revealing redundancy and narrow biological coverage; Tahoe 100 focuses on rich perturbations across 50 cancer models and 1,200 drugs with minimal batch effects to maximize information content.
Get the full analysis with uListen AI
Open-sourcing Tahoe 100 is a strategic move to amplify impact with a small team.
By making the dataset public and combining it with Arc’s SC Basecamp, Vevo and Arc catalyze an ecosystem of external researchers building virtual cell models, effectively multiplying their R&D capacity without scaling headcount.
Get the full analysis with uListen AI
AI agents can already perform valuable “plumbing” for biology by cleaning and unifying data.
Arc’s SC Basecamp uses an AI agent to crawl the Sequence Read Archive, re‑process heterogeneous datasets with uniform pipelines, and reduce analytical batch effects, demonstrating how agents can automate dry‑lab workflows and create higher‑quality inputs for models.
Get the full analysis with uListen AI
Virtual cell models could dramatically reshape drug discovery economics and strategy.
If models can accurately predict how novel compounds move diseased cells toward healthy states across diverse patient contexts, they can narrow target space, prioritize better chemical matter in silico, and potentially cut clinical failure rates, shifting value from brute‑force screening to model‑driven design.
Get the full analysis with uListen AI
Biotech must shift from slow, hypothesis-heavy models to scalable, hypothesis-light platforms.
The guests argue that falling sequencing and compute costs, plus large-scale platforms like Vevo’s mosaic tumors, make it feasible to generate massive, unbiased datasets, letting models surface hypotheses instead of anchoring entire companies on a single, fragile idea.
Get the full analysis with uListen AI
Notable Quotes
““Tahoe 100 is the world's biggest single-cell RNA sequencing dataset… We actually think it's the first dataset that's going to enable machine learning in this space.””
— Johnny (Vevo Therapeutics)
““This is the domain we are talking about… the language of systems biology. The first thing you should be doing is to try out the things that worked in the other domains in this domain.””
— Nima (Vevo Therapeutics)
““In biology we have treated humans as the foundation models that ingest information and come up with hypotheses… now we actually want to go beyond that.””
— Nima (Vevo Therapeutics)
““How do we go from a discipline that primarily respects experiments today to something more like physics, where theory drives a lot of progress? These virtual cell models are a core wedge in making that happen.””
— Patrick Hsu (Arc Institute)
““I think it's morning in bio… We should be playing a different kind of game here.””
— Nima (Vevo Therapeutics)
Questions Answered in This Episode
How will the field agree on robust benchmarks to evaluate virtual cell models’ predictive power, and what metrics beyond differential gene expression will matter most?
The episode features leaders from Vevo Therapeutics and the Arc Institute announcing Tahoe 100, a 100‑million–cell single‑cell RNA sequencing dataset, paired with Arc’s 230‑million–cell SC Basecamp to form a 330‑million–cell Virtual Cell Atlas.
Get the full analysis with uListen AI
What specific early applications (e.g., target prioritization in certain cancers) are most likely to produce the first clear ‘wins’ from virtual cell models in the next 3–5 years?
They argue that biology is entering its “virtual cell” moment, analogous to ImageNet and GPT in vision and language, enabling models that predict how cells respond to genetic and chemical perturbations rather than just protein structures.
Get the full analysis with uListen AI
How might integrating spatial, multi-omic, and organoid data further improve virtual cell models beyond transcriptomics alone?
The discussion covers why prior single‑cell data were too small, noisy, and observational, how large perturbational datasets unlock causal and predictive modeling, and why open‑sourcing Tahoe 100 is strategically important.
Get the full analysis with uListen AI
What governance and access models are needed to ensure that open datasets like Tahoe 100 don’t just empower big tech and pharma, but also smaller labs and startups globally?
They outline how accurate virtual cell models could transform drug discovery, improve target selection, and reduce clinical failure rates, while reshaping how biotech companies, platforms, and global competition (including China) evolve.
Get the full analysis with uListen AI
Given the rise of Chinese biotechs with strong execution and lower costs, how should US and European biotech ecosystems adapt their strategies around innovation, AI, and platform building?
Get the full analysis with uListen AI
Transcript Preview
Hi, listeners. Welcome back to No Priors. Today, we're here with the CEO, CTO, and core investigator of the Arc Institute, as well as the co-founders of Vyvo to talk about their release of the Tahoe 100, the largest single cell drug-perturbed dataset ever created, as well as where we are in AI for biology, why we need a virtual cell model and not just protein structure prediction models, and when we should finally expect to see treatments from this growth of use of machine learning in bio.
Hi, I'm Johnny, and I work on single-cell RNA sequencing at Vyvo.
I'm Nima. Uh, I'm one of the founders together with Johnny. Um, I'm a quantum chemist by background, but I've converted to be, being a computational chemist that loves playing with biological data. And we are building Vyvo to, to really do that, to predict how chemicals interact with cells in different biological contexts. Some people call it the virtual cell. That's, that's basically what we're working on.
I'm Patrick Hsu, one of the founders at the Arc Institute, which is working at the interface of biology and machine learning to try to understand and, uh, one day treat, uh, complex human diseases, which are most of the major killers.
I'm Dave, uh, CTO at Arc Institute, focused on computational biology and building novel AI models, uh, for biology.
I'm Hani. I'm a core investigator at Arc. I work very closely with Dave and Patrick to push our, you know, virtual cell initiative.
Congratulations, everyone. It's a big day. Uh, let's jump right into it. What is the Tahoe 100, and what is the significance of it?
So, Tahoe 100 is the world's biggest single-cell RNA sequencing dataset, and it enables basically a ton of machine learning applications, including things like the virtual cell, but it also enables a lot of drug discovery applications. And broadly, in the context of where I think we are as a field, it's kinda the beginning of a different way of doing drug discovery, of basically understanding how to build medicines, and basically bringing AI and machine learning people into the mix.
And maybe something I would add, uh, there as well, um, over the last 20 years or so, people have accumulated massive amounts of, you know, data points when it comes to protein structures, uh, protein function, how drug molecules interact with proteins. But one thing that we haven't had as much is how, uh, how different cells behave in different contexts and how different genes whi- within each of those cells actually functions in the presence of the other genes, so, you know, in, in, in these different biological contexts. Um, this, we believe this is the era for that right now, and you have seen the emergence of protein language models built on the datasets that have been accumulated over the last two decades. But now is the era for actually having data on cells, how they function, how they interact with drug molecules. And exactly what Johnny's saying, Tahoe is really a landmark dataset there that allows us to really measure how drugs interact with different cells from different patient models, um, and that gives us the ability to build similar models that we built in protein language models, but in the, um, in the cellular kinda context.
Install uListen to search the full transcript and get AI-powered insights
Get Full TranscriptGet more from every podcast
AI summaries, searchable transcripts, and fact-checking. Free forever.
Add to Chrome