Tokenmaxxing: How Top Builders Use AI To Do The Work Of 400 Engineers

Garry Tan on tokenmaxxing workflows: thin harness, fat skills, 400x output gains.

Garry TanguesthostJared Friedmanhosthosthosthost

May 8, 202641mWatch on YouTube ↗

Returning to building after an investing/leadership hiatusRebuilding Postrous/Posthaven as Garry’s ListAgentic journalism: retrieval, citations, cross-source analysisTokenmaxxing and “boil the ocean” research mindsetGStack skills repo and meta-prompting (CEO Plan, ENG review)QA automation with Playwright and browser-based agent testingThin harness, fat skills; tests and reliability in AI-generated codeClaude Code vs Codex as complementary agentsPersonal AI vs corporate-controlled AI feedsLines-of-code controversy and measuring AI-driven productivity

AI-generated summary based on the episode transcript.

In this episode of Y Combinator, featuring Garry Tan and Jared Friedman, Tokenmaxxing: How Top Builders Use AI To Do The Work Of 400 Engineers explores tokenmaxxing workflows: thin harness, fat skills, 400x output gains Garry Tan describes returning to coding after 13 years and rebuilding a full-featured blog and agentic research system in days by directing AI agents rather than writing most code himself.

WHAT IT’S REALLY ABOUT

Tokenmaxxing workflows: thin harness, fat skills, 400x output gains

Garry Tan describes returning to coding after 13 years and rebuilding a full-featured blog and agentic research system in days by directing AI agents rather than writing most code himself.
He introduces “tokenmaxxing” as intentionally spending more tokens to “boil the ocean,” gather broader context, cross-check sources, and increase software quality and completeness.
He shares the practical workflow patterns behind his output—structured plans, ASCII diagrams, iterative reviews, and high test coverage—plus the need for humans to stay in the loop to provide goals, taste, and judgment.
He explains the “thin harness, fat skills” philosophy: reuse standardized agent harnesses while investing creativity in markdown prompt “skills” that encode process and decision-making.
He argues the near-future will be “personal AI,” where individuals control their own data and prompts, versus corporate-controlled systems that shape outputs via opaque incentives.

IDEAS WORTH REMEMBERING

5 ideas

AI makes “builder mode” accessible again—if you learn to direct it.

Tan attributes his comeback to treating Claude Code like a high-powered tool: he provides intent, constraints, and review loops while the model executes. The leverage comes from orchestration—running many parallel “agents”—more than typing speed.

Tokenmaxxing is a strategy: pay for completeness, not just speed.

Instead of minimizing token spend, he advocates spending more to broaden retrieval (e.g., 20 sources vs 1), reconcile disagreements, and produce more reality-grounded outputs—whether for investigative writing or engineering decisions.

Structured prompting (plans + diagrams) reduces “vibe coding slop.”

Having the model first generate ASCII diagrams of data flows, state machines, and error paths forces explicit architecture. This loads critical context up front and yields more correct, complete implementations than jumping straight to code.

High test coverage is the antidote to brittle AI code.

Tan says early “it works for 80%” failures pushed him toward aggressive automated testing; he now targets roughly 80–90% coverage (not necessarily 100%) plus integration/E2E checks to make AI-written changes production-safe.

Build with a “thin harness, fat skills” mental model.

The harness is the reusable execution loop (tool calls, running commands, applying diffs); the differentiator is the markdown “skills” that encode review processes, product taste, and decision heuristics. Trying to force nuanced judgment into brittle deterministic code often backfires.

WORDS WORTH SAVING

5 quotes

I think that's like the defining question. Like, will you have control over your own tools or will your tools have control over you?

— Garry Tan

And then in this case, it took about $200, which was my Claude Code Max account, and probably five days. Full-featured blog platform, does everything you want...

— Garry Tan

You pay more money and you might be token maxing, but you should token max.

— Garry Tan

I feel like using OpenClaw these days is like driving a Ferrari, and it's, like, exhilarating. It's insane... but then it's also like a Ferrari in that you better be a mechanic.

— Garry Tan

Like, you could buy millions of years of- consciousness- of machine consciousness... Now I can be a time billionaire. It's not, you know, my own time. It's the time of a machine- like, doing work for me.

— Garry Tan

QUESTIONS ANSWERED IN THIS EPISODE

5 questions

In Garry’s List, what exact retrieval pipeline did you use (chunking strategy, hybrid RRF settings, citation generation), and what failure modes did you hit early?

Garry Tan describes returning to coding after 13 years and rebuilding a full-featured blog and agentic research system in days by directing AI agents rather than writing most code himself.

When you say “tokenmax,” how do you decide the point of diminishing returns—what metrics (accuracy, latency, cost per task) tell you to stop adding context?

He introduces “tokenmaxxing” as intentionally spending more tokens to “boil the ocean,” gather broader context, cross-check sources, and increase software quality and completeness.

Can you share a concrete example of an ASCII data-flow diagram prompt that noticeably reduced bugs or rework in a real feature?

He shares the practical workflow patterns behind his output—structured plans, ASCII diagrams, iterative reviews, and high test coverage—plus the need for humans to stay in the loop to provide goals, taste, and judgment.

What does your ENG review skill actually check (testing, architecture, security, error handling), and which parts had the biggest impact on reliability?

He explains the “thin harness, fat skills” philosophy: reuse standardized agent harnesses while investing creativity in markdown prompt “skills” that encode process and decision-making.

Why is 80–90% test coverage the sweet spot for agent-written code in your experience—what kinds of tests do you intentionally avoid at that level?

He argues the near-future will be “personal AI,” where individuals control their own data and prompts, versus corporate-controlled systems that shape outputs via opaque incentives.

Chapter Breakdown

Control vs. dependency: personal tools in the age of AI

The conversation opens with a framing question: will people control their AI tools, or will the tools control them. Garry uses the “Ferrari” metaphor to capture both the power and brittleness of modern agentic coding systems—fast and exhilarating, but requiring skill to maintain and debug.

Returning to coding after 13 years—why now

Jared sets the premise: Garry returned from a long break and suddenly shipped an extraordinary volume of code while running YC. Garry describes the surprise of reawakening his builder identity and attributes the change to the new capabilities of tools like Claude Code.

Rebuilding Postrous/Posthaven as Garry’s List—at token-scale cost and speed

Garry explains the origin of Garry’s List: building a site to organize people around California policy issues that matter to him. He recounts rebuilding his old blogging platform a third time—this time in days and for roughly the cost of an AI subscription—while adding modern agentic research capabilities.

Software that thinks like a journalist: agentic research, sourcing, and synthesis

The product evolves beyond a publishing tool into a system that performs investigative-journalism-like work. Garry describes agentic retrieval: crawling, cross-referencing sources, extracting quotables, and producing deeply sourced reports using multiple APIs and toolchains.

The rise of “tokenmaxxing”: spending tokens to buy completeness and truth

Garry introduces tokenmaxxing as a philosophy: if more tokens yield more complete, higher-quality work, spend them. He argues this will extend to nearly all knowledge work, with humans supplying goals, values, and agency while machines handle the heavy lifting.

Accidental productization: how GStack emerged from repeated prompts

Garry explains that GStack wasn’t planned—it emerged from noticing repeated interactions and converting them into reusable ‘skills.’ He describes techniques like forcing models to produce ASCII diagrams of data flows and user flows before coding, which improves completeness and reduces confusion.

The 400x workflow: queued PRs, plan-first execution, and multi-agent reviews

Garry walks through his day-to-day system for shipping: using plan mode, batching features, and running a sequence of specialized reviews before execution. The workflow is designed to scale output while preserving quality through automated checks and human-in-the-loop decision points.

QA automation and the birth of a browser harness (Playwright → Browse/QA)

Manual testing became the bottleneck, so Garry tried to automate QA using browser tooling. Slow MCP browser control pushed him to wrap Playwright, which evolved into a long-lived daemon with a CLI (‘Browse’) and a QA mode that tests UI/data mutations based on branch context.

Thin Harness, Fat Skills: where code ends and ‘latent space’ begins

Garry crystallizes his philosophy: don’t rebuild harnesses; invest in high-leverage skills (often in markdown) that guide agents effectively. He contrasts deterministic code (brittle, exact) with LLM ‘latent space’ (contextual, adaptable), and argues the art is choosing the right boundary between them.

Agents are like Ferraris: brittle systems, mechanics mindset, and self-healing loops

The group returns to the Ferrari metaphor: the tools are powerful but can fail in surprising ways. A key shift is that brittleness matters less when another agent (or Claude Code) can continuously repair and maintain the system, accelerating iteration despite instability.

Gbrain and OpenClaw: building a personal AI with better memory than grep

Garry describes moving beyond Claude Code into OpenClaw-based workflows and building ‘Gbrain’ to improve how agents use personal context. He notes that naive approaches (grep over markdown) waste context, and he reuses RAG learnings from Garry’s List—chunking, embeddings, and hybrid retrieval—into a personal system.

400x output and the lines-of-code controversy: what it measures and what it doesn’t

Garry addresses backlash over citing lines of code as a productivity metric. He argues LoC is imperfect but can be normalized (logical lines), and that AI-directed coding changes incentives versus human LoC ‘padding’; the bigger point is that the tools raise the ceiling on what a capable builder can ship.

The future of personal AI and buying back time with tokens

The episode closes on a vision: everyone will have a personal AI with their own data, integrations, and prompts—or else rely on corporate-controlled feeds and opaque incentives. Token spend becomes a way to ‘buy time’ by borrowing machine labor, turning builders into “time billionaires” through scalable machine work.

EVERY SPOKEN WORD

Install uListen for AI-powered chat & search across the full episode — Get Full Transcript

Get more out of YouTube videos.

High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.