Using LongMemEval to Improve Agent Memory

Sam Bhagwat, co-founder of Mastra and author of Principles of Building AI Agents, shares how they’ve been pushing the limits of agent memory. He explains the Long Mem Eval benchmark, breaks down why memory matters for reasoning across long conversations, and shows how simple changes—like tailored templates, targeted updates, and better data structures—led to state-of-the-art results. Chapters: 00:12 - Overview of the Long Mem Eval Benchmark 01:15 - Understanding Memory in AI Agents 01:59 - Information Extraction in Memory 02:30 - Multi-Session Reasoning 03:27 - Temporal Reasoning 04:10 - Knowledge Updates in Memory 05:13 - Handling Missing Information 05:46 - Types of Memory in Masra Agents 05:58 - Semantic Recall Explained 06:51 - Working Memory and Templates 07:12 - Initial Benchmark Results 07:43 - Improving Memory Implementation 11:08 - Configuration Matters 12:14 - Future Steps and Conclusion

Sam Bhagwathost

Aug 25, 202513mWatch on YouTube ↗

CHAPTERS

0:12 – 1:15
Why benchmark agent memory: Masra’s motivation and framework mindset
Sam Bhagwat introduces himself, Masra (a TypeScript agent framework), and the goal: use a rigorous benchmark loop to measure memory performance and iteratively improve it. He frames memory improvements as framework-level primitives that many teams can reuse rather than bespoke app logic.
1:15 – 1:59
LongMemEval and what “memory” means for agents
Sam defines agent memory as compressing and searching over long chat histories to retrieve the right information into the context window. He describes LongMemEval (released late last year) as a memory benchmark with ~500 questions across multiple memory skills.
1:59 – 2:30
Task 1 — Information extraction within a single session
The first LongMemEval subtask is straightforward retrieval: can the system accurately pull specific facts stated earlier in the same session. Success depends on precise extraction and correct inclusion in the prompt context.
2:30 – 3:27
Task 2 — Multi-session reasoning across long histories
Sam explains multi-session reasoning as retrieving relevant details spanning multiple sessions or long-running histories. The challenge is compressing and selecting across many interactions so the model can perform correct reasoning once the right items are surfaced.
3:27 – 4:10
Task 3 — Temporal reasoning over time-based events
Temporal reasoning evaluates whether memory retrieval preserves time relationships so the model can reason about sequences and recency. Sam highlights that retrieval must supply time-anchored information correctly for the LLM to answer questions about when events occurred.
4:10 – 5:13
Task 4 — Knowledge updates and overwriting stale attributes
This category checks whether memory properly updates user attributes as new information arrives, rather than clinging to outdated facts. Sam shares a story about ChatGPT confusing his identity due to incorrect attribute updates, illustrating the real-world stakes.
5:13 – 5:46
Task 5 — Handling missing information (knowing what you don’t know)
The final category tests whether the system can detect absence: when the needed information isn’t in history, the model should not hallucinate. Retrieval and prompting must make it clear that the relevant fact is missing so the LLM can respond appropriately.
5:46 – 5:58
Masra’s memory primitives: semantic recall vs. working memory
Sam outlines Masra’s two primary memory types used for this benchmark: semantic recall and working memory. He positions them as standard, practical capabilities that map well to LongMemEval’s requirements.
5:58 – 6:51
How semantic recall works: embeddings, vector DB, top‑K, and context range
Semantic recall embeds messages into a vector database (e.g., pgvector, Chroma) and retrieves the top‑K most similar messages. Masra also pulls a message window around hits (like grep context) to preserve surrounding context for better answers.
6:51 – 7:12
Working memory templates: why structure and attributes matter
Working memory uses templates that define which attributes to track (e.g., different needs for personal training vs. legal research). Sam emphasizes that the template choice strongly influences what gets stored and updated, impacting accuracy.
7:12 – 7:43
Baseline benchmark results and the gap to state of the art
Masra ran its existing memory implementation on LongMemEval and found performance below the best published results. Combined working memory + semantic recall reached ~67% versus ~72% state-of-the-art, motivating targeted improvements.
7:43 – 11:08
Iterating on implementation: tailored templates, targeted overwrites, and bug fixes
Sam describes a series of changes that improved results: generating question-specific working-memory templates, switching from full rewrites to targeted attribute overwrites, and fixing issues in how timestamps/dates were represented. These changes helped reach and then surpass state-of-the-art performance.
11:08 – 12:14
Configuration matters: top‑K sensitivity and per-category behavior
Results showed that retrieval configuration (especially top‑K) significantly affects overall accuracy—too small (e.g., K=2) performs poorly, while larger values (e.g., 5 to 20) improve outcomes. Category breakdowns reveal that some tasks (like single-session) are less sensitive to semantic recall changes, reinforcing the value of grouped eval reporting.
12:14 – 13:39
Next steps: reranking, episodic memory, and the eval-driven improvement loop
Sam closes with future enhancements—adding episodic memory and reranking—to push results further. He emphasizes the broader lesson: build an eval scaffold, iterate repeatedly, and use data to uncover surprising bugs and presentation tweaks that improve performance.

Get more out of YouTube videos.

High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.

Add to Chrome

Why benchmark agent memory: Masra’s motivation and framework mindset

LongMemEval and what “memory” means for agents

Task 1 — Information extraction within a single session

Task 2 — Multi-session reasoning across long histories

Task 3 — Temporal reasoning over time-based events

Task 4 — Knowledge updates and overwriting stale attributes

Task 5 — Handling missing information (knowing what you don’t know)

Masra’s memory primitives: semantic recall vs. working memory

How semantic recall works: embeddings, vector DB, top‑K, and context range

Working memory templates: why structure and attributes matter

Baseline benchmark results and the gap to state of the art

Iterating on implementation: tailored templates, targeted overwrites, and bug fixes

Configuration matters: top‑K sensitivity and per-category behavior

Next steps: reranking, episodic memory, and the eval-driven improvement loop

Get more out of YouTube videos.