Using LongMemEval to Improve Agent Memory

Sam Bhagwat, co-founder of Mastra and author of Principles of Building AI Agents, shares how they’ve been pushing the limits of agent memory. He explains the Long Mem Eval benchmark, breaks down why memory matters for reasoning across long conversations, and shows how simple changes—like tailored templates, targeted updates, and better data structures—led to state-of-the-art results. Chapters: 00:12 - Overview of the Long Mem Eval Benchmark 01:15 - Understanding Memory in AI Agents 01:59 - Information Extraction in Memory 02:30 - Multi-Session Reasoning 03:27 - Temporal Reasoning 04:10 - Knowledge Updates in Memory 05:13 - Handling Missing Information 05:46 - Types of Memory in Masra Agents 05:58 - Semantic Recall Explained 06:51 - Working Memory and Templates 07:12 - Initial Benchmark Results 07:43 - Improving Memory Implementation 11:08 - Configuration Matters 12:14 - Future Steps and Conclusion

Sam Bhagwathost

Aug 24, 202513mWatch on YouTube ↗

WHAT IT’S REALLY ABOUT

Benchmark-driven improvements to AI agent memory using LongMemEval framework

LongMemEval evaluates agent memory across five subskills—information extraction, multi-session reasoning, temporal reasoning, knowledge updates, and handling missing information.
Masra’s initial combined working-memory + semantic-recall approach scored below state-of-the-art, prompting targeted changes guided by benchmark feedback.
Tailoring working-memory templates to each question and performing targeted attribute overwrites (instead of full rewrites) materially improved accuracy.
Fixing timestamp/date handling and restructuring retrieved messages (timestamps plus grouping by day/session) improved temporal and session-based reasoning.
Retrieval configuration—especially semantic-recall top‑K—significantly affected results, underscoring that “configuration matters” as much as algorithms.

IDEAS WORTH REMEMBERING

5 ideas

Treat “memory” as retrieval + formatting, not just storage.

The talk frames memory as compressing and searching message history to place the right facts into the context window; presentation quality (structure, timestamps) directly impacts reasoning accuracy.

Benchmark categories map to distinct failure modes.

LongMemEval’s five categories help isolate whether you’re failing at simple extraction, cross-session aggregation, time ordering, updating facts, or correctly admitting missing information.

Working-memory templates are a controllable lever—and should be task-shaped.

Masra improved working-memory accuracy by generating templates tailored to each question, reinforcing that the “fields you track” should match what the user is asking (or your app’s domain).

Prefer targeted working-memory overwrites over full rewrites.

Having the LLM rewrite the entire working-memory state introduced errors; updating only the specific attribute that changed reduced drift and improved benchmark performance.

Timestamp correctness is essential for temporal reasoning.

A bug that mismatched “benchmark date” vs “run date” caused wrong temporal inferences; fixing dates yielded measurable gains on time-based tasks.

WORDS WORTH SAVING

5 quotes

So I mean, memory is a, is the compression of a, a queue of, of chat messages, right? It's the ability to sort of search over those messages in a meaningful way, um, in response to user queries, um, in order to meaningfully, uh, respond and, and sort of get the right things into, uh, the context window, um, so that the LLM is able to answer it.

— Sam Bhagwat

When I ask ChatGPT who I am, it re-responds that I'm a five-year-old girl who likes-- who loves Squishmallows because, um, I give it to my daughter and my son, and I guess it, it updated who I was with, with my daughter, and, and so it thinks I'm a five-year-old girl who loves Squishmallows.

— Sam Bhagwat

We realized that our implementation of working memory when we got like a new, um, a new piece of information, we were sort of asking the LLM to rewrite the whole working memory, um, like rather than just overwrite that one specific part of working memory.

— Sam Bhagwat

We all know time is hard, and, and so it turned out that, um, we were at- we're sort of putting in the wrong dates.

— Sam Bhagwat

I think the lesson, if we look at the full results, is kind of configuration matters.

— Sam Bhagwat

LongMemEval benchmark structure (500 questions, 5 categories)Definition of agent memory as compressed searchable chat historyInformation extraction vs multi-session retrievalTemporal reasoning and timestamp correctnessKnowledge updates and overwriting user attributesMissing-information detection (knowing what you don’t know)Masra memory system: semantic recall (vector DB) + working memory (templates)Template design and question-specific templatesTargeted working-memory updates vs full rewritesSemantic recall parameters: top‑K and message-range contextMessage formatting/data structures (flat list vs grouped)Config sensitivity and category-wise eval breakdownPotential next steps: reranking and episodic memory

High quality AI-generated summary created from speaker-labeled transcript.

Get more out of YouTube videos.

High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.