Making agentic workflows trustworthy and verifiable with a custom DSL

System design of agentic research assistant built unconventionally: one component outputs plan in custom Turing-incomplete programming language, another interprets it, quiver of models executes concrete tasks. Architectural choices as concrete instantiations of company values.

May 22, 202629mWatch on YouTube ↗

CHAPTERS

0:15 – 2:16
Why identical outputs can deserve different levels of trust
James Brady opens by arguing that the internal mechanism producing an answer can matter as much as the answer itself. He illustrates that two systems with the same output may warrant different trust depending on model quality, tool use, and verification steps.
- •Trust depends on the process, not just the final output
- •Static-analysis-style example: same verdict, different confidence depending on how it was generated
- •Older model vs state-of-the-art model with tool use and critique produces different “objects”
- •Mechanism choice is a product/design decision, not a universal best practice
2:16 – 3:17
Design trade-offs: speed vs rigor and Elicit’s reliability goals
He frames workflow design as contextual, emphasizing trade-offs between fast responses and rigorous, defensible work. Elicit’s brand centers on reliability and provenance, which pushes the team toward more structured, auditable mechanisms.
- •No single canonical internal workflow for all domains or users
- •Speed-versus-rigor is a central product trade-off
- •Elicit optimizes for reliability and high-quality provenance
- •These priorities shape how agentic workflows should be built
3:17 – 5:49
Three desiderata for a trustworthy research agent
Elicit distilled their needs into three requirements: make the agent’s process legible, preserve fidelity as users iterate, and ensure the system follows the agreed process. These desiderata motivated adopting a DSL approach.
- •Process must be legible to humans and other agents (spot-checkable)
- •Iteration should retain fidelity and avoid drift during refinement
- •The system must follow the validated process faithfully (no silent deviations)
- •These requirements naturally point toward an executable DSL
5:49 – 6:50
Introducing ÆSHPL: a constrained, research-focused DSL
James introduces Elicit’s custom language, ÆSHPL, and explains its defining constraints and philosophy. By keeping the language Turing-incomplete and functional, Elicit improves predictability and memoization while embedding domain-specific primitives.
- •ÆSHPL is Turing-incomplete: no loops, recursion, or mutation
- •Purely functional and reactive for predictability
- •Opinionated subset of Python (familiar syntax, constrained features)
- •Adds research-native primitives (papers, clinical trials, retrieval)
6:50 – 7:51
What ÆSHPL looks like in practice (typed, Python-like workflows)
He shows an example program that resembles Python and uses typing to catch errors and enable fast redrafts. The snippet represents a concrete workflow (e.g., competitive/landscape analysis) with searches, joins, enrichment, and filtering steps.
- •Python-like syntax makes it model- and developer-friendly
- •Static types support quick feedback loops (type errors → redraft)
- •Workflows encode multi-step research processes (search, join, enrich, screen)
- •Focus is on explicit, executable plans rather than implicit prompting
7:51 – 8:51
The core loop: write ÆSHPL → interpret → redraft
Elicit’s session engine repeatedly generates ÆSHPL, runs it, and rewrites the program based on results or errors. This creates an iterative, program-driven form of agentic progress instead of ad-hoc conversational drift.
- •Curator component writes ÆSHPL programs
- •Interpreter executes ÆSHPL in Python
- •Errors (e.g., type errors) are cheaply fed back for program correction
- •System continuously extends/redrafts the program to make progress
8:51 – 10:24
System architecture: UI, event log, Python service, sandbox, and curator
James explains how ÆSHPL fits into Elicit’s overall system: a browser UI emits events into an append-only log, a Python service brokers and interprets programs, and a sandboxed curator writes and revises ÆSHPL. The architecture is designed for traceability and safe execution.
- •UI interaction is captured as events (event-sourcing pattern)
- •Append-only event log is the backbone for distributed state
- •Sandbox writes ÆSHPL; Python service interprets it
- •Back-and-forth between curator and interpreter implements the redraft loop
10:24 – 11:58
Operational hardening: wrapper, model harness swapping, and credential isolation
He highlights infrastructure added around the curator to support different SDKs/harnesses and to keep secrets safe. A gateway centralizes LLM access so user input can’t coerce the system into leaking credentials or environment data.
- •Wrapper layer allows swapping harnesses (Agent SDK, Py + Claude, experiments)
- •Priority is using the best available models/harnesses over time
- •Gateway centralizes LLM calls and protects API keys
- •Security threat model includes prompt-injection attempts to exfiltrate secrets
11:58 – 14:30
Execution pipeline: parsing, typechecking, AST interpretation, and caching
James details the interpreter pipeline: parse and validate ÆSHPL, typecheck, build an AST-like structure, then interpret in Python. A content-addressed store enables aggressive caching/memoization so full-program re-execution is fast despite constant redrafting.
- •Parser/validator catches syntax issues early
- •Typechecking enables cheap corrective loops
- •Interpreter walks the program tree (closures, special forms, primitives)
- •Content-addressed store caches expression results for memoization
- •Full program is re-run on each revision, relying on caching to stay fast
14:30 – 15:30
Demo setup: Elicit’s rigor-first positioning and saved research landscape session
Transitioning to a demo, he notes Elicit differentiates on rigor rather than raw speed. He opens a saved session mapping organizations investing in biology foundation models, since the full workflow can take hours for deep research.
- •Elicit sits toward the rigor end of the speed–rigor spectrum
- •Product offers templates (tables, slides, reports) but demo focuses on a landscape workflow
- •Saved session used because realistic runs can be lengthy
- •Goal: show how ÆSHPL underpins trustworthy, inspectable research outputs
15:30 – 18:02
Demo walkthrough: iterative analysis steps driven by executable ÆSHPL
He shows how Elicit clarifies the user’s request, then runs successive stages of searches, retrieval, enrichment, and screening—each encoded in ÆSHPL. The emphasis is that the plan is not just displayed; it is executed, enabling faithful adherence to the workflow.
- •System asks clarifying questions to scope the landscape
- •Analysis blocks include paper/web search, full-text fetch, and filtering/screening
- •Every stage is represented as executable ÆSHPL (not a non-binding plan)
- •Deep, multi-stage workflows are visible and auditable during execution
18:02 – 21:07
Artifacts and provenance: tables generated with inspectable code and a derived graph view
The demo culminates in an “artifact” table of organizations and extracted attributes. He then inspects the exact ÆSHPL used to generate it and shows a more ergonomic graph visualization derived directly from the same program structure.
- •Artifacts: structured outputs (tables) with extracted fields and sources
- •Users can view the exact ÆSHPL program that produced an artifact
- •Legibility supports spot checks and agent-based critiques (missed searches, scope gaps)
- •Graph view is derived from ÆSHPL (not a decorative visualization)
- •Visualization helps quickly assess whether the process seems skewed or incomplete
21:07 – 24:40
Long-running iterative builds: joining artifacts, reinterpreting whole programs, and avoiding drift
He demonstrates extending the session with new layers (open vs closed strategies, GTM, oversight bodies) and then joining tables through natural language requests. The resulting ÆSHPL grows to ~1000 lines, yet Elicit reinterprets the entire program each time—made feasible by caching and chosen to reduce drift and increase cohesion guarantees.
- •User adds successive investigative layers without restarting the workflow
- •Natural-language instruction triggers new ÆSHPL that joins prior artifacts
- •Top of the program remains identical; new logic appends as the session evolves
- •System reinterprets the whole program for each artifact to preserve coherence
- •Caching/memoization makes full re-execution practical
- •Full-program execution helps reduce iterative drift and supports stronger correctness guarantees
24:40 – 29:38
When a DSL is worth it: practical build checklist and closing thesis
In closing, James argues DSLs aren’t universally necessary, but are powerful when your product demands legibility, fidelity across iteration, and faithful execution. He lists key engineering investments beyond the DSL itself—interrupt handling, session rehydration, message plumbing, event sourcing, and especially evaluation—then returns to the core point: mechanism matters for trust.
- •DSLs are hard but worthwhile when aligned with product desiderata
- •Base DSL on a familiar language to leverage training data and reduce syntax learning
- •Most effort is surrounding systems engineering, not the DSL core
- •Needed components: harness swapping, interrupt handling, session rehydration, credential isolation, message handling
- •Event sourcing is powerful but non-trivial
- •Eval investment is crucial for program-writing/executing agents
- •Final claim: trustworthy outputs require trustworthy mechanisms, not just impressive results

Get more out of YouTube videos.

High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.

iOS

Android

Claude

Chrome

Why identical outputs can deserve different levels of trust

Design trade-offs: speed vs rigor and Elicit’s reliability goals

Three desiderata for a trustworthy research agent

Introducing ÆSHPL: a constrained, research-focused DSL

What ÆSHPL looks like in practice (typed, Python-like workflows)

The core loop: write ÆSHPL → interpret → redraft

System architecture: UI, event log, Python service, sandbox, and curator

Operational hardening: wrapper, model harness swapping, and credential isolation

Execution pipeline: parsing, typechecking, AST interpretation, and caching

Demo setup: Elicit’s rigor-first positioning and saved research landscape session

Demo walkthrough: iterative analysis steps driven by executable ÆSHPL

Artifacts and provenance: tables generated with inspectable code and a derived graph view

Long-running iterative builds: joining artifacts, reinterpreting whole programs, and avoiding drift

When a DSL is worth it: practical build checklist and closing thesis

Get more out of YouTube videos.