Skip to content
ClaudeClaude

Making agentic workflows trustworthy and verifiable with a custom DSL

System design of agentic research assistant built unconventionally: one component outputs plan in custom Turing-incomplete programming language, another interprets it, quiver of models executes concrete tasks. Architectural choices as concrete instantiations of company values.

May 22, 202629mWatch on YouTube ↗

CHAPTERS

  1. Why identical outputs aren’t equally trustworthy: mechanism matters

    James Brady opens by arguing that trust depends not just on the final answer but on the process used to produce it. He illustrates how two identical outputs can deserve different levels of confidence depending on model quality, tool use, and internal checks.

  2. Three requirements for Elicit’s research agent: legibility, fidelity, faithful execution

    He outlines the core desiderata that pushed Elicit toward a domain-specific language. The agent’s workflow must be inspectable, stable under iteration, and actually executed as written.

  3. Introducing ÆSHPL: a constrained, opinionated Python subset for research workflows

    James introduces Elicit’s DSL, ÆSHPL, designed to encode agentic workflows in a controlled way. It is intentionally limited to improve predictability and enable verification and caching.

  4. What ÆSHPL code looks like and what it represents

    He shows an example ÆSHPL program that resembles Python and encodes a competitive analysis workflow. The key idea: the plan is not just documentation—it’s executable.

  5. The core execution loop: write ÆSHPL → interpret → redraft → re-interpret

    He explains Elicit’s internal cycle: a model component writes the program, the system executes it, and the program is iteratively revised based on errors and results. This becomes the engine of progress inside a session.

  6. System architecture: UI + event log + Python service + sandboxed curator

    James maps the DSL workflow into a production system architecture. Event sourcing connects user actions to the evolving ÆSHPL program and its execution.

  7. Operational components: wrapper, model gateway, and credential isolation

    He details supporting infrastructure that makes the approach secure and flexible. These layers allow swapping model harnesses while protecting secrets from prompt injection or exfiltration.

  8. From code to execution: parsing, type-checking, AST interpretation, and caching

    He walks through the interpreter pipeline: parse and validate the program, then interpret it via an AST walker in Python. A content-addressed store enables memoization critical to performance and iteration.

  9. Why re-run the whole program each time: avoiding drift while staying fast

    Elicit reinterprets the full ÆSHPL program after each redraft, rather than patch-executing snippets. This design improves coherence and allows stronger guarantees, with memoization keeping it responsive.

  10. Demo setup: Elicit’s “research landscape” workflow and rigor-first positioning

    James transitions into a demo, positioning Elicit on the “rigor” end of the speed–quality spectrum. He uses a saved session that mapped organizations investing in foundation models for biology.

  11. Demo walkthrough: layered searches, enrichment, screening, and artifact generation

    He shows the stepwise analysis blocks: multiple web/paper searches, full-text fetching, and filtering. The output becomes structured “artifacts” (tables) with extracted attributes and provenance.

  12. Inspectability: view the ÆSHPL and a derived graphical workflow view

    He demonstrates that each artifact can be traced back to the exact ÆSHPL program that generated it. For usability, Elicit also provides a graph visualization derived directly from the same program.

  13. Extending the session: joins, oversight bodies, and long programs with caching

    He shows iterative expansion: adding comparisons (open vs closed), commercialization/GTG strategy, mapping oversight institutions, and finally joining datasets. The final program becomes much longer but remains efficient due to caching.

  14. When a DSL is worth it: engineering checklist and evaluation investment

    He closes with guidance: a DSL isn’t for everyone, but it fits when trust, robustness, and provenance matter. Most effort is not the DSL syntax itself but the surrounding system and rigorous evaluation.

  15. Closing thesis: same table, different object—mechanism changes trust

    James returns to the opening question: identical-looking outputs can carry different trust levels. Elicit’s differentiator is a visible, executable, repeatable process that users can inspect and endorse.

Get more out of YouTube videos.

High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.

Add to Chrome