At a glance
WHAT IT’S REALLY ABOUT
Benchmark-driven improvements to AI agent memory using LongMemEval framework
- LongMemEval evaluates agent memory across five subskills—information extraction, multi-session reasoning, temporal reasoning, knowledge updates, and handling missing information.
- Masra’s initial combined working-memory + semantic-recall approach scored below state-of-the-art, prompting targeted changes guided by benchmark feedback.
- Tailoring working-memory templates to each question and performing targeted attribute overwrites (instead of full rewrites) materially improved accuracy.
- Fixing timestamp/date handling and restructuring retrieved messages (timestamps plus grouping by day/session) improved temporal and session-based reasoning.
- Retrieval configuration—especially semantic-recall top‑K—significantly affected results, underscoring that “configuration matters” as much as algorithms.
IDEAS WORTH REMEMBERING
5 ideasTreat “memory” as retrieval + formatting, not just storage.
The talk frames memory as compressing and searching message history to place the right facts into the context window; presentation quality (structure, timestamps) directly impacts reasoning accuracy.
Benchmark categories map to distinct failure modes.
LongMemEval’s five categories help isolate whether you’re failing at simple extraction, cross-session aggregation, time ordering, updating facts, or correctly admitting missing information.
Working-memory templates are a controllable lever—and should be task-shaped.
Masra improved working-memory accuracy by generating templates tailored to each question, reinforcing that the “fields you track” should match what the user is asking (or your app’s domain).
Prefer targeted working-memory overwrites over full rewrites.
Having the LLM rewrite the entire working-memory state introduced errors; updating only the specific attribute that changed reduced drift and improved benchmark performance.
Timestamp correctness is essential for temporal reasoning.
A bug that mismatched “benchmark date” vs “run date” caused wrong temporal inferences; fixing dates yielded measurable gains on time-based tasks.
WORDS WORTH SAVING
5 quotesSo I mean, memory is a, is the compression of a, a queue of, of chat messages, right? It's the ability to sort of search over those messages in a meaningful way, um, in response to user queries, um, in order to meaningfully, uh, respond and, and sort of get the right things into, uh, the context window, um, so that the LLM is able to answer it.
— Sam Bhagwat
When I ask ChatGPT who I am, it re-responds that I'm a five-year-old girl who likes-- who loves Squishmallows because, um, I give it to my daughter and my son, and I guess it, it updated who I was with, with my daughter, and, and so it thinks I'm a five-year-old girl who loves Squishmallows.
— Sam Bhagwat
We realized that our implementation of working memory when we got like a new, um, a new piece of information, we were sort of asking the LLM to rewrite the whole working memory, um, like rather than just overwrite that one specific part of working memory.
— Sam Bhagwat
We all know time is hard, and, and so it turned out that, um, we were at- we're sort of putting in the wrong dates.
— Sam Bhagwat
I think the lesson, if we look at the full results, is kind of configuration matters.
— Sam Bhagwat
High quality AI-generated summary created from speaker-labeled transcript.
Get more out of YouTube videos.
High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.
Add to Chrome