Making agentic workflows trustworthy and verifiable with a custom DSL

System design of agentic research assistant built unconventionally: one component outputs plan in custom Turing-incomplete programming language, another interprets it, quiver of models executes concrete tasks. Architectural choices as concrete instantiations of company values.

May 21, 202629mWatch on YouTube ↗

WHAT IT’S REALLY ABOUT

Elicit’s DSL makes agent workflows legible, faithful, and cacheable execution

The talk argues that identical outputs should not be equally trusted because the internal mechanism (model choice, tooling, critique, token spend) materially changes reliability and risk.
Elicit chose a DSL to satisfy three product goals: a legible process users/agents can spot-check, iteration that avoids drifting from intent, and faithful execution of the agreed workflow.
ÆSHPL is a typed, purely functional, reactive, Turing-incomplete subset of Python with domain primitives for scientific research tasks like searching papers, fetching full text, screening, and extraction.
The system repeatedly writes, type-checks, interprets, and rewrites the entire ÆSHPL program, using content-addressed memoization to keep it fast while maintaining coherence.
A demo shows Elicit exposing both the executable ÆSHPL and a graph derived from it so users can audit how research “artifacts” (tables) were produced and extend the workflow safely over time.

IDEAS WORTH REMEMBERING

5 ideas

Trustworthiness depends on the process, not just the answer.

Brady’s core claim is that two identical outputs warrant different trust depending on the internal workflow—e.g., older model vs SOTA model with tools, critique, and redrafting.

A DSL can make an agent’s workflow auditable and enforceable.

By encoding the plan as executable ÆSHPL, Elicit can show exactly what steps were intended and ensure the system actually ran those steps rather than improvising.

Constraining the language reduces drift and improves reliability.

ÆSHPL is Turing-incomplete (no loops/recursion/mutation) and purely functional, which limits unpredictable behavior and supports consistent iteration without losing the original intent.

Typing enables cheap, tight feedback loops for agent program repair.

The Python service parses and type-checks ÆSHPL; errors are sent back to the “curator” to redraft, making many failures fast to detect and fix before expensive execution.

Re-running the whole program each iteration can be practical with memoization.

Elicit rewrites and re-interprets the entire program each cycle to preserve coherence, but uses a content-addressed store to cache evaluated expressions so most work is reused.

WORDS WORTH SAVING

5 quotes

So let's say that two systems produce identical output. Do you trust them equally? And the answer is, of course, well, it depends. It depends on what went on inside of those systems to produce that output.

— James Brady

I would say that the, the, the mechanism, the how of how a, a how an answer is produced is as important and important in a different way compared to just the, the final output itself.

— James Brady

So firstly, the research agent's process must be legible. It needs to be legible to the user, and also, by the way, it needs to be legible to other agents.

— James Brady

AshPL is not just a representation of a plan, it is literally the plan which is executable, you know.

— James Brady

My pitch is that you should care a lot about the mechanism, um, because the mechanism, the mechanism matters.

— James Brady

Mechanism vs output trustThree desiderata: legibility, fidelity under iteration, faithful executionÆSHPL design: Python subset, typed, pure, no loops/recursionDomain primitives for research workflowsArchitecture: UI + event log + Python service + sandboxed curatorType checking, AST interpretation, and sandboxingContent-addressed caching/memoization for full-program rerunsOperational needs: interrupts, session rehydration, credential isolation, evals

High quality AI-generated summary created from speaker-labeled transcript.

Get more out of YouTube videos.

High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.