Skip to content
YC Root AccessYC Root Access

Using LongMemEval to Improve Agent Memory

Sam Bhagwat, co-founder of Mastra and author of Principles of Building AI Agents, shares how they’ve been pushing the limits of agent memory. He explains the Long Mem Eval benchmark, breaks down why memory matters for reasoning across long conversations, and shows how simple changes—like tailored templates, targeted updates, and better data structures—led to state-of-the-art results. Chapters: 00:12 - Overview of the Long Mem Eval Benchmark 01:15 - Understanding Memory in AI Agents 01:59 - Information Extraction in Memory 02:30 - Multi-Session Reasoning 03:27 - Temporal Reasoning 04:10 - Knowledge Updates in Memory 05:13 - Handling Missing Information 05:46 - Types of Memory in Masra Agents 05:58 - Semantic Recall Explained 06:51 - Working Memory and Templates 07:12 - Initial Benchmark Results 07:43 - Improving Memory Implementation 11:08 - Configuration Matters 12:14 - Future Steps and Conclusion

Sam Bhagwathost
Aug 25, 202513mWatch on YouTube ↗

CHAPTERS

  1. 0:12 – 1:15

    Why benchmark agent memory: Masra’s motivation and framework mindset

    Sam Bhagwat introduces himself, Masra (a TypeScript agent framework), and the goal: use a rigorous benchmark loop to measure memory performance and iteratively improve it. He frames memory improvements as framework-level primitives that many teams can reuse rather than bespoke app logic.

    • Masra aims to encode opinionated, reusable agent infrastructure
    • Talk focus: optimizing agent memory using a benchmark-driven loop
    • Goal is to ship improvements into the framework’s memory layer
    • Benchmarking is positioned as the key to systematic iteration
  2. 1:15 – 1:59

    LongMemEval and what “memory” means for agents

    Sam defines agent memory as compressing and searching over long chat histories to retrieve the right information into the context window. He describes LongMemEval (released late last year) as a memory benchmark with ~500 questions across multiple memory skills.

    • Memory = compression + meaningful search over a queue of messages
    • Retrieval’s purpose is getting the right facts into the context window
    • LongMemEval benchmarks agent memory across many scenarios
    • Dataset size: roughly 500 categorized questions
  3. 1:59 – 2:30

    Task 1 — Information extraction within a single session

    The first LongMemEval subtask is straightforward retrieval: can the system accurately pull specific facts stated earlier in the same session. Success depends on precise extraction and correct inclusion in the prompt context.

    • Single-session lookup of what the user/assistant said
    • Accuracy depends on retrieving the exact relevant snippet
    • Represents the simplest memory competency in the benchmark
    • Failure usually means the needed fact never reaches the LLM context
  4. 2:30 – 3:27

    Task 2 — Multi-session reasoning across long histories

    Sam explains multi-session reasoning as retrieving relevant details spanning multiple sessions or long-running histories. The challenge is compressing and selecting across many interactions so the model can perform correct reasoning once the right items are surfaced.

    • Long-running chat histories require useful compression
    • Need to retrieve facts that may be spread across sessions
    • Correctness hinges on assembling the right cross-session evidence
    • If retrieval misses, reasoning degrades even if the LLM is capable
  5. 3:27 – 4:10

    Task 3 — Temporal reasoning over time-based events

    Temporal reasoning evaluates whether memory retrieval preserves time relationships so the model can reason about sequences and recency. Sam highlights that retrieval must supply time-anchored information correctly for the LLM to answer questions about when events occurred.

    • Questions depend on ordering/recency of events
    • Retrieval must include the right time-linked messages
    • Temporal cues are necessary for correct downstream reasoning
    • Time handling becomes a common source of errors
  6. 4:10 – 5:13

    Task 4 — Knowledge updates and overwriting stale attributes

    This category checks whether memory properly updates user attributes as new information arrives, rather than clinging to outdated facts. Sam shares a story about ChatGPT confusing his identity due to incorrect attribute updates, illustrating the real-world stakes.

    • Memory should overwrite/update user attributes when new facts appear
    • Stale memories can cause incorrect persona/attribute assumptions
    • Order and recency of facts must be reflected in retrieval
    • Benchmark includes scenarios requiring correct “latest state” behavior
  7. 5:13 – 5:46

    Task 5 — Handling missing information (knowing what you don’t know)

    The final category tests whether the system can detect absence: when the needed information isn’t in history, the model should not hallucinate. Retrieval and prompting must make it clear that the relevant fact is missing so the LLM can respond appropriately.

    • System must recognize when required info isn’t present
    • Benchmark includes queries where no correct item exists in memory
    • Goal is to avoid confident fabrication
    • Requires retrieval/prompting that supports “insufficient info” responses
  8. 5:46 – 5:58

    Masra’s memory primitives: semantic recall vs. working memory

    Sam outlines Masra’s two primary memory types used for this benchmark: semantic recall and working memory. He positions them as standard, practical capabilities that map well to LongMemEval’s requirements.

    • Two main memory types: semantic recall and working memory
    • These are the most relevant Masra features for LongMemEval
    • Both can be combined to answer benchmark questions
    • Design aims to stay domain-agnostic as a general framework
  9. 5:58 – 6:51

    How semantic recall works: embeddings, vector DB, top‑K, and context range

    Semantic recall embeds messages into a vector database (e.g., pgvector, Chroma) and retrieves the top‑K most similar messages. Masra also pulls a message window around hits (like grep context) to preserve surrounding context for better answers.

    • Messages are embedded and stored in a vector database
    • Retrieval uses a top‑K similarity search
    • A message range/window adds surrounding context around matches
    • Parameters like top‑K and window size directly affect quality
  10. 6:51 – 7:12

    Working memory templates: why structure and attributes matter

    Working memory uses templates that define which attributes to track (e.g., different needs for personal training vs. legal research). Sam emphasizes that the template choice strongly influences what gets stored and updated, impacting accuracy.

    • Templates specify which user/agent attributes to extract/store
    • Different applications need different attribute schemas
    • Template quality affects working memory usefulness
    • Structure is a controllable lever for better memory performance
  11. 7:12 – 7:43

    Baseline benchmark results and the gap to state of the art

    Masra ran its existing memory implementation on LongMemEval and found performance below the best published results. Combined working memory + semantic recall reached ~67% versus ~72% state-of-the-art, motivating targeted improvements.

    • Initial combined approach scored ~67% accuracy
    • State-of-the-art results were ~72%
    • Benchmark exposed concrete headroom for improvement
    • Prompted a focused effort to upgrade both memory components
  12. 7:43 – 11:08

    Iterating on implementation: tailored templates, targeted overwrites, and bug fixes

    Sam describes a series of changes that improved results: generating question-specific working-memory templates, switching from full rewrites to targeted attribute overwrites, and fixing issues in how timestamps/dates were represented. These changes helped reach and then surpass state-of-the-art performance.

    • Generated tailored templates per question to improve working memory
    • Replaced full working-memory rewrites with targeted updates to reduce errors
    • Fixed date/timestamp mismatches that harmed temporal reasoning
    • Improved formatting/structure (timestamps + grouping by day/hour) to aid reasoning
  13. 11:08 – 12:14

    Configuration matters: top‑K sensitivity and per-category behavior

    Results showed that retrieval configuration (especially top‑K) significantly affects overall accuracy—too small (e.g., K=2) performs poorly, while larger values (e.g., 5 to 20) improve outcomes. Category breakdowns reveal that some tasks (like single-session) are less sensitive to semantic recall changes, reinforcing the value of grouped eval reporting.

    • Top‑K set too low sharply degrades results
    • Increasing top‑K (e.g., 5 → 20) can materially improve accuracy
    • Not all categories move together; single-session tasks may be less affected
    • Category-level reporting helps diagnose what changes actually helped
  14. 12:14 – 13:39

    Next steps: reranking, episodic memory, and the eval-driven improvement loop

    Sam closes with future enhancements—adding episodic memory and reranking—to push results further. He emphasizes the broader lesson: build an eval scaffold, iterate repeatedly, and use data to uncover surprising bugs and presentation tweaks that improve performance.

    • Planned additions: episodic memory and retrieval reranking
    • Reranking is noted as used by others but not yet in Masra
    • Core method: write evals, iterate relentlessly, measure every change
    • Data-driven debugging reveals weird edge cases and representation issues

Get more out of YouTube videos.

High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.