Skip to content
YC Root AccessYC Root Access

Reducto: Making Human Data LLM-Ready With State-of-the-Art Accuracy

Reducto just raised $24.5M in Series A funding to help enterprises unlock unstructured data with near-perfect accuracy. AI teams today are bottlenecked by messy, real-world documents—so Reducto built the most accurate parsing pipeline in the industry. By combining vision-language models with agentic workflows, Reducto turns complex PDFs and scanned documents into structured, LLM-ready data. Now trusted by companies like Scale AI, Vanta, and top AI teams, Reducto has parsed over 250 million pages and is expanding into full end-to-end pipelines: document splitting, classification, structured extraction, and more. With their new Agentic OCR framework, they’re pushing toward human-level accuracy—automating what used to take teams days, in seconds. YC Partner Diana Hu recently sat down with the Reducto founders to talk about how they got here, their founding story, and the kind of company they are building. Learn more about Reducto at https://reducto.ai. Apply to Y Combinator: https://ycombinator.com/apply Chapters (Powered by ChapterMe) - 00:00 - Data-driven AI for large enterprises 01:17 - Document management 03:04 - Simplify PDF processing for companies 03:59 - Aha moment for PDF extraction, interesting approach 05:02 - NLP-based PDF extraction for enterprise apps 06:56 - Great data, exciting use cases 08:10 - Best places for customer approaches 08:48 - Closing a Fortune 25 deal in just two months 11:21 - Data-driven AI for high-quality documents 13:19 - Reductos AI-focused infrastructure attracts top companies 15:18 - Quality of data, results, support

Diana Huhost
May 1, 202515mWatch on YouTube ↗

CHAPTERS

  1. Reducto’s mission: turning complex documents into LLM-ready structured data

    The founders explain that Reducto converts messy, high-stakes enterprise documents (claims, health records, financial statements) into clean structured outputs. The primary goal is to make downstream LLM use cases—like RAG and summarization—work reliably with real-world inputs.

    • Transforms complex documents into structured data
    • Targets LLM workflows: RAG, summarization, and more
    • Handles enterprise-grade document types (insurance, healthcare, finance)
    • Positions ingestion/structuring as the critical first step for AI apps
  2. Early enterprise adoption and why this became a bottleneck worth solving

    Reducto quickly found traction with sophisticated customers, including named startups and unnamed trillion-dollar enterprises. The team frames document ingestion as a core bottleneck preventing AI applications from reaching production quality.

    • Adoption by teams like Vanta and large unnamed enterprises
    • Rapid growth within ~1 year post-YC
    • Document ingestion is a fundamental blocker for AI app quality
    • Customers want reliability across a wide variety of documents
  3. From long-term LLM memory to a surprise “marketing stunt” that became the product

    The company didn’t set out to build a document processing platform; it emerged while building a different LLM product. A quick-and-dirty segmentation model shared as a blog/experiment drew strong inbound interest and revealed unmet demand.

    • Original work: long-term memory for LLMs
    • User request: manage uploaded files alongside chat history
    • Off-the-shelf tools failed to meet needs
    • Weekend segmentation demo outperformed incumbents and drove demand
    • Teams asked for an API and paid product immediately
  4. The schlep-blindness insight: PDFs are hard, everyone suffers, nobody wants to own it

    They discuss how PDF/document extraction is a known pain point but avoided because it’s perceived as unglamorous. Reducto leaned into this “boring but essential” infrastructure gap—similar to the early opportunity Stripe exploited in payments.

    • Founders were surprised the problem wasn’t already solved
    • Community signal: many AI founders struggled with ingestion
    • “Boring” infrastructure work can create massive leverage
    • Analogy to PG’s schlep blindness / Stripe-like opportunity
  5. A vision-first approach: reading documents like humans instead of rules and heuristics

    Reducto’s technical pivot was to treat document understanding as a computer vision problem rather than a pile of PDF-specific rules. Layout cues—spacing, indentation, hierarchy—become first-class signals for building robust, general parsing.

    • Shift from rule-based parsing to computer-vision-driven understanding
    • Uses human-like interpretation of layout semantics
    • Background in CV/ML research influenced the approach
    • Focus on generality: not just invoices, but the “long tail” of documents
  6. Customer impact: big jumps in downstream LLM accuracy and less post-processing work

    Customers often see large improvements in end-to-end LLM performance simply by switching ingestion providers. Reducto also reduces the need for manual cleanup steps like chunking and post-processing, letting teams focus on product logic and reasoning.

    • Swapping ingestion can materially boost LLM accuracy
    • Reported improvements can reach ~30%+ on hard documents
    • Reduces customer burden: chunking and post-processing handled upstream
    • Reframes the problem: once data is clean, new applications become possible
  7. New capabilities unlocked: scanned docs, handwriting, checkboxes, and messy real-world artifacts

    They highlight cases where previously impossible workflows become feasible, including scanned documents without metadata and forms with handwriting. By combining deterministic CV and modern VLM/OCR strengths, Reducto handles edge cases like highlights, circled numbers, and instruction-guided extraction.

    • Enables features that previously weren’t feasible for customers
    • Handles scanned docs lacking metadata
    • Handwriting extraction improved via VLM-era techniques
    • Robust to real-world artifacts (checkboxes, highlighting, circling)
    • Supports plain-text instructions to guide extraction behavior
  8. How they closed a Fortune 25 enterprise deal rapidly: credibility through performance

    The founders recount an intense evaluation process that began during YC and culminated in a full enterprise contract. Their main competitor was the customer’s internal document processing team, making it a high bar—yet they won by proving speed and quality gains.

    • Enterprise journey began from Launch YC demo success
    • Competitor was the company’s internal document team (strong benchmark)
    • Long, multi-stakeholder evaluation with deep technical scrutiny
    • Onsite grilling session with many stakeholders, including competitors
    • Outcome: fully signed deal and ongoing usage
  9. Reaching “SOTA” in document extraction: data engine, benchmarks, and iteration speed

    Reducto found that public datasets/benchmarks weren’t sufficient for measuring real enterprise complexity. They invested in high-quality data pipelines and diverse sampling to evaluate rigorously, iterate quickly, and set a higher standard for extraction quality.

    • Document AI lacked enough high-quality public datasets
    • Built in-house data pipelines and a “data engine”
    • Prioritized diversity of documents to cover edge cases
    • Vision-first perspective enabled novel methods over legacy heuristics
    • Benchmarking and iteration became a core competence
  10. Expanding beyond PDFs: a unified ingestion layer across file types

    Customer demand pulled Reducto into supporting more formats to avoid forcing teams to maintain multiple pipelines. They now handle spreadsheets, images, documents, and slides—while aiming to preserve the same accuracy bar.

    • Support expanded to spreadsheets, images, documents, slides
    • Customer pain: maintaining separate ingestion pipelines
    • Strategy: broaden endpoints while maintaining accuracy standards
    • Positions product as a unified ingestion layer for enterprises
  11. Why Reducto is becoming core AI infrastructure for agents and enterprise apps

    They argue that teams build ingestion in-house only because quality demands force them to—until a better external option exists. Reducto aims to be that foundational layer so teams can adopt new models faster and focus on higher-level reasoning and product improvements.

    • Core value: unblock product quality without building ingestion in-house
    • Helps teams move faster with new models and workflows
    • Applicable to startups, scale-ups, and enterprises
    • Acts as foundational infra for AI apps and agentic systems
  12. Hiring and culture: scrappy builders obsessed with data quality details

    Following their Series A, the team discusses hiring across engineering and ML roles. They emphasize scrappiness and a deep care for detail—because data quality directly determines customer outcomes, and building SOTA requires hands-on, meticulous validation.

    • Hiring across product engineering and ML roles
    • Seeks scrappy startup-minded engineers and founders
    • Strong emphasis on detail-oriented craftsmanship
    • Example: researchers manually reviewing thousands of pages
    • Core mantra: data quality determines end results and customer success

Get more out of YouTube videos.

High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.