Artie: Real Time Data Streaming For The AI Age

In this episode of Founder Firesides, YC Managing Partner Jared Friedman talks to the founders of Artie (S23), Jacqueline Cheong and Robin Tang, who have just announced their Series A. Artie is a real-time data streaming platform for cutting edge companies, streaming up-to-date and reliable data between systems in real time.

Jared FriedmanhostJacqueline CheongguestRobin Tangguest

Jan 26, 202626mWatch on YouTube ↗

CHAPTERS

0:05 – 1:00
Artie overview and Series A announcement
Jared opens by introducing Artie founders Robin Tang and Jacqueline Cheong and congratulating them on their newly announced Series A. Jacqueline explains Artie’s core mission: moving production data between systems in real time as changes happen.
- •Artie is a real-time data streaming platform for moving data across systems
- •Example use case: streaming Postgres changes into Snowflake continuously
- •Series A details: $12M led by Dalton Caldwell with Paul Buchheit and Brian Berg/Standard Capital
- •Sets the stage for why “real-time” matters for modern data stacks
1:00 – 2:00
Origin story: the persistent pain of “data isn’t fresh enough”
Robin traces the company’s roots to repeated frustrations across multiple roles where teams wanted fresher, faster data for experimentation and operations. He found existing tools didn’t fit production-database needs and internal builds were costly and slow.
- •Growth/ops use cases needed near-real-time data, but data teams couldn’t prioritize it
- •Managed batch ETL tools worked for SaaS apps but not production databases
- •Buyer mismatch: business-user tools vs infra/platform engineer needs
- •Attempting to build internally proved expensive and still failed to reach production quality
2:00 – 3:24
Why building CDC connectors in-house is a trap (and why Artie had to exist)
The founders argue it’s irrational that companies spend 1–2 years building basic CDC pipelines like Postgres-to-Snowflake. Jared reinforces that this work isn’t a company’s core competency, motivating Robin to build a productized solution.
- •In-house CDC projects can take years and still not be production-ready
- •Even “simple” connectors hide distributed-systems complexity
- •Opportunity: package hard-won reliability into a product
- •Founding motivation: if it doesn’t exist, build it
3:24 – 4:17
Early build timeline and YC-era “fake it till self-serve” onboarding
Robin explains it took ~6 months to build early infrastructure (faster now with better AI tooling), and much longer to make the product truly self-serve. They describe scrappy onboarding tactics, including manual workflows behind a self-serve facade.
- •Initial build: ~6 months; could be 2–3 months if rebuilt today
- •Early onboarding tracked via Google Sheets; manual backfills kicked off by founders
- •During YC, UI actions triggered founder Slack pings to do work manually
- •Took ~10 months total to become genuinely self-serve
4:17 – 5:23
What existed at YC application time: infrastructure first, sales skills later
Jacqueline describes that by the start of YC the core infrastructure already existed, but usability and UI were still developing. Much of YC was spent learning customer conversations, selling, and iterating on what the market needed.
- •Idea crystallized 6–8 months before YC
- •Core infrastructure was built pre-YC; UI and productization came next
- •YC focus: learning sales, customer discovery, and go-to-market motion
- •Transition from “it works” to “people can use it”
5:23 – 8:47
Landing Substack: mission-critical first customer via cold email + POC
They recount how Substack became their first major customer, despite the high stakes of deploying unproven infrastructure. Confidence came from a rigorous POC pushing huge volumes under strict constraints, and the speed of the customer’s need.
- •Real-time pipelines are inherently mission critical and high risk to adopt
- •Substack needed low-load extraction from a massive Postgres DB into Snowflake
- •They won trust through a demanding POC (billions of rows, tight constraints)
- •Inbound came from a cold email; head of data replied within an hour
8:47 – 10:18
Why growth is lumpy for infra: long gaps between ‘big bets’
Jared highlights that infrastructure adoption doesn’t produce easy early hockey-stick growth because each deployment is a major bet. The founders confirm it took many months to land another Substack-scale customer and compare to Substack’s own early trajectory.
- •Infra deployments require replacing/risking foundational systems
- •Customer acquisition is slower due to high diligence and trust requirements
- •They got 7–8 customers during YC, mostly early adopters
- •It took ~9 months after Substack to land another similar-scale customer
10:18 – 13:12
Reaching $1M ARR with a tiny team: disciplined hiring and founder-led sales
Jared notes Artie hit $1M ARR with only four people, far smaller than typical pre-AI SaaS benchmarks. Jacqueline attributes this to strict adherence to YC hiring discipline and keeping customer-facing sales tightly founder-driven to speed iteration.
- •At $1M ARR: two founders + two engineers
- •Hiring philosophy: avoid ‘bad hires,’ hire only for the biggest constraint
- •Founder-led sales built trust and accelerated feedback loops
- •Fast product iteration because founders directly handled objections and requests
13:12 – 16:49
Married co-founders: deciding to jump in and how it changes the work dynamic
The conversation shifts to their personal partnership: they are married and co-founded Artie together after evaluating whether their conflict style would work in business. Jacqueline explains how working together increased closeness and reduced communication friction.
- •Jacqueline was initially skeptical; conviction grew through engineer interviews
- •They assessed how they “fight” and whether conflict is productive
- •They quit jobs at the same time to start the company
- •Marriage reduces filtering—issues get discussed early, improving speed
16:49 – 17:27
Work-life boundaries (or lack thereof) when building high-stakes infra
They discuss whether they maintain boundaries between work and life, concluding that during this phase they largely don’t. The intensity of Artie’s mission-critical operations dominates day-to-day life.
- •They debated boundaries but decided strict separation isn’t realistic now
- •Company-building intensity leaves little else to discuss
- •Acknowledgement that it’s hard and all-consuming
- •Reflects typical early-stage infra-founder workload
17:27 – 20:53
Battle scars of real-time data: backfills, edge cases, and undocumented ‘right ways’
Robin and Jacqueline describe the technical reality: production data pipelines fail in countless messy ways that don’t show up locally. They cover online backfills while streaming, performance constraints at massive scale, and hard connector work like SQL Server CDC.
- •Production introduces unknown unknowns beyond volume: complexity and messy data
- •Examples of weird data: invalid dates/months in MongoDB, schema evolution challenges
- •Online backfill + concurrent streaming pattern (Kafka accumulation/drain)
- •Scale pressures: 10B-row tables force constant efficiency improvements
- •SQL Server CDC requires an undocumented approach; trigger-based CDC is too slow for enterprises
20:53 – 22:19
Owning the whole stack: Kafka SDK ordering bugs and customer expectations
They explain how reliability demands extend beyond their code—third-party library bugs become Artie’s responsibility. A Kafka client rebalancing/ordering issue caused out-of-order reads, forcing deep debugging and a vendor switch, plus stronger customer guardrails.
- •Kafka consumer rebalancing can create ordering hazards under load
- •They discovered an SDK bug violating Kafka improvement plans (KIP) semantics
- •Out-of-order consumption leads to incorrect destination state—integrity is paramount
- •Customers don’t care where the bug lives; ‘their bug is your bug’
- •Need for guardrails even when customers misconfigure systems
22:19 – 23:16
Where Artie is now: 700B+ rows processed and scaling toward trillions
Jacqueline shares current scale metrics and how demand is rising with real-time AI workloads and agentic use cases. They plan to increase reliability and scalability as volume grows by an order of magnitude.
- •Processed 700B+ rows over the last 12 months (~12x YoY)
- •Early baseline: Substack at a few billion rows/month
- •Market pull from AI/agentic use cases needing real-time production data
- •Focus: reliability and scalability from hundreds of billions to trillions
23:16 – 24:23
Team growth plan and founder advice: move fast, don’t overthink
They outline aggressive hiring plans—tripling the team—and the roles needed to support scale. Jacqueline closes with tactical advice: execute, observe, and iterate rather than getting stuck in analysis.
- •Hiring plan: triple headcount this year
- •Roles: engineering, sales, marketing, ops, sales engineers; recruiter is critical
- •Hiring capacity itself becomes a bottleneck at this stage
- •Advice: try things quickly, learn from reality, and adjust
24:23 – 26:33
Roadmap: beyond CDC into events APIs and new real-time destinations
They describe expanding from database CDC into an events API and more sources/destinations where low-latency matters. Examples include sub-second warehouse queryability and future destinations like Elasticsearch for real-time indexing/search.
- •Started with CDC from production databases due to volume/complexity and real-time need
- •Launched an events API streaming into Snowflake/Databricks/Redshift
- •Latency goal: events queryable in warehouses within ~1–200ms
- •Riding new native streaming capabilities (Snowflake streaming, BigQuery Storage Write API)
- •Future destinations: Elasticsearch for real-time indexing/search use cases