Reducto: Making Human Data LLM-Ready With State-of-the-Art Accuracy

Reducto just raised $24.5M in Series A funding to help enterprises unlock unstructured data with near-perfect accuracy. AI teams today are bottlenecked by messy, real-world documents—so Reducto built the most accurate parsing pipeline in the industry. By combining vision-language models with agentic workflows, Reducto turns complex PDFs and scanned documents into structured, LLM-ready data. Now trusted by companies like Scale AI, Vanta, and top AI teams, Reducto has parsed over 250 million pages and is expanding into full end-to-end pipelines: document splitting, classification, structured extraction, and more. With their new Agentic OCR framework, they’re pushing toward human-level accuracy—automating what used to take teams days, in seconds. YC Partner Diana Hu recently sat down with the Reducto founders to talk about how they got here, their founding story, and the kind of company they are building. Learn more about Reducto at https://reducto.ai. Apply to Y Combinator: https://ycombinator.com/apply Chapters (Powered by ChapterMe) - 00:00 - Data-driven AI for large enterprises 01:17 - Document management 03:04 - Simplify PDF processing for companies 03:59 - Aha moment for PDF extraction, interesting approach 05:02 - NLP-based PDF extraction for enterprise apps 06:56 - Great data, exciting use cases 08:10 - Best places for customer approaches 08:48 - Closing a Fortune 25 deal in just two months 11:21 - Data-driven AI for high-quality documents 13:19 - Reductos AI-focused infrastructure attracts top companies 15:18 - Quality of data, results, support

Diana Huhost

May 1, 202515mWatch on YouTube ↗

CHAPTERS

Reducto’s mission: turning complex documents into LLM-ready structured data
The founders explain that Reducto converts messy, high-stakes enterprise documents (claims, health records, financial statements) into clean structured outputs. The primary goal is to make downstream LLM use cases—like RAG and summarization—work reliably with real-world inputs.
Early enterprise adoption and why this became a bottleneck worth solving
Reducto quickly found traction with sophisticated customers, including named startups and unnamed trillion-dollar enterprises. The team frames document ingestion as a core bottleneck preventing AI applications from reaching production quality.
From long-term LLM memory to a surprise “marketing stunt” that became the product
The company didn’t set out to build a document processing platform; it emerged while building a different LLM product. A quick-and-dirty segmentation model shared as a blog/experiment drew strong inbound interest and revealed unmet demand.
The schlep-blindness insight: PDFs are hard, everyone suffers, nobody wants to own it
They discuss how PDF/document extraction is a known pain point but avoided because it’s perceived as unglamorous. Reducto leaned into this “boring but essential” infrastructure gap—similar to the early opportunity Stripe exploited in payments.
A vision-first approach: reading documents like humans instead of rules and heuristics
Reducto’s technical pivot was to treat document understanding as a computer vision problem rather than a pile of PDF-specific rules. Layout cues—spacing, indentation, hierarchy—become first-class signals for building robust, general parsing.
Customer impact: big jumps in downstream LLM accuracy and less post-processing work
Customers often see large improvements in end-to-end LLM performance simply by switching ingestion providers. Reducto also reduces the need for manual cleanup steps like chunking and post-processing, letting teams focus on product logic and reasoning.
New capabilities unlocked: scanned docs, handwriting, checkboxes, and messy real-world artifacts
They highlight cases where previously impossible workflows become feasible, including scanned documents without metadata and forms with handwriting. By combining deterministic CV and modern VLM/OCR strengths, Reducto handles edge cases like highlights, circled numbers, and instruction-guided extraction.
How they closed a Fortune 25 enterprise deal rapidly: credibility through performance
The founders recount an intense evaluation process that began during YC and culminated in a full enterprise contract. Their main competitor was the customer’s internal document processing team, making it a high bar—yet they won by proving speed and quality gains.
Reaching “SOTA” in document extraction: data engine, benchmarks, and iteration speed
Reducto found that public datasets/benchmarks weren’t sufficient for measuring real enterprise complexity. They invested in high-quality data pipelines and diverse sampling to evaluate rigorously, iterate quickly, and set a higher standard for extraction quality.
Expanding beyond PDFs: a unified ingestion layer across file types
Customer demand pulled Reducto into supporting more formats to avoid forcing teams to maintain multiple pipelines. They now handle spreadsheets, images, documents, and slides—while aiming to preserve the same accuracy bar.
Why Reducto is becoming core AI infrastructure for agents and enterprise apps
They argue that teams build ingestion in-house only because quality demands force them to—until a better external option exists. Reducto aims to be that foundational layer so teams can adopt new models faster and focus on higher-level reasoning and product improvements.
Hiring and culture: scrappy builders obsessed with data quality details
Following their Series A, the team discusses hiring across engineering and ML roles. They emphasize scrappiness and a deep care for detail—because data quality directly determines customer outcomes, and building SOTA requires hands-on, meticulous validation.

Get more out of YouTube videos.

High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.

Add to Chrome

Reducto’s mission: turning complex documents into LLM-ready structured data

Early enterprise adoption and why this became a bottleneck worth solving

From long-term LLM memory to a surprise “marketing stunt” that became the product

The schlep-blindness insight: PDFs are hard, everyone suffers, nobody wants to own it

A vision-first approach: reading documents like humans instead of rules and heuristics

Customer impact: big jumps in downstream LLM accuracy and less post-processing work

New capabilities unlocked: scanned docs, handwriting, checkboxes, and messy real-world artifacts

How they closed a Fortune 25 enterprise deal rapidly: credibility through performance

Reaching “SOTA” in document extraction: data engine, benchmarks, and iteration speed

Expanding beyond PDFs: a unified ingestion layer across file types

Why Reducto is becoming core AI infrastructure for agents and enterprise apps

Hiring and culture: scrappy builders obsessed with data quality details

Get more out of YouTube videos.