Using LongMemEval to Improve Agent Memory

Sam Bhagwat, co-founder of Mastra and author of Principles of Building AI Agents, shares how they’ve been pushing the limits of agent memory. He explains the Long Mem Eval benchmark, breaks down why memory matters for reasoning across long conversations, and shows how simple changes—like tailored templates, targeted updates, and better data structures—led to state-of-the-art results. Chapters: 00:12 - Overview of the Long Mem Eval Benchmark 01:15 - Understanding Memory in AI Agents 01:59 - Information Extraction in Memory 02:30 - Multi-Session Reasoning 03:27 - Temporal Reasoning 04:10 - Knowledge Updates in Memory 05:13 - Handling Missing Information 05:46 - Types of Memory in Masra Agents 05:58 - Semantic Recall Explained 06:51 - Working Memory and Templates 07:12 - Initial Benchmark Results 07:43 - Improving Memory Implementation 11:08 - Configuration Matters 12:14 - Future Steps and Conclusion

Sam Bhagwathost

Aug 25, 202513mWatch on YouTube ↗

EVERY SPOKEN WORD

10 min read · 2,465 words

0:00 – 0:12
Intro
1. SBSam Bhagwat
  I'm Sam. Um, I'm the co-founder and CEO of Masra, which is a TypeScript agent framework. I'm also the author of the book that was in- on all the seats, um, called Principles of Building AI Agents.
0:12 – 1:15
Overview of the Long Mem Eval Benchmark
1. SBSam Bhagwat
  Uh, I'm gonna talk today about a specific benchmark that we worked, uh, a bit on, um, optimizing our agent memory ar- around. Um, so this ben- benchmark is called the LongMemEval Benchmark. Um, I'm gonna sort of mostly share this as kind of the, the loop that we've been talking about a little bit, which is understanding your, your system's performance and optimizing that performance. Um, and, uh, as, as-- The work that we did, um, here, we sort of built into, um, Masra's, um, memory layer. So, you know, the great thing about frameworks, um, before this-- Uh, before, um, starting Masra, I was the Gatsby co-founder, and, uh, most of the team is, uh, is ex-Gatsby folks. The great thing about frameworks is that frameworks encode a lot of, uh, opinionated stuff, so you don't have to reinvent the wheel. And so hopefully the, the work we did here, a lot of teams will get to, will get to use. So what is LongMemEval? Let's start off
1:15 – 1:59
Understanding Memory in AI Agents
1. SBSam Bhagwat
  with this. So, um, also, we'll, we'll just start off with memory because I'm just gonna start with the definition here. So I mean, memory is a, is the compression of a, a queue of, of chat messages, right? It's the ability to sort of search over those messages in a meaningful way, um, in response to user queries, um, in order to meaningfully, uh, respond and, and sort of get the right things into, uh, the context window, um, so that the LLM is able to answer it. So LongMemEval came out, um, late last year. Um, it's a benchmark of, uh, for agent memory. There are sort of five hundred questions, um, categorized in various ways, which we're about to kind of like talk about. So it turns out that, um, you know, if you think about this,
1:59 – 2:30
Information Extraction in Memory
1. SBSam Bhagwat
  how memory works, there-- it kind of actually breaks down into a few different sort of subtasks. And, um, so what first, first is kind of this idea of, um, information e-extraction, which means, I mean, you can kinda see, right? So this is sort of the simplest, uh, task that you, you can have. Um, hey, I have, uh... I need to, you know, in a single session, I need to be able to see, uh, what did the user say or like, what did I as the assistant say, right? Um, and can I sort of accurately grab that information and, and, and put it into context?
2:30 – 3:27
Multi-Session Reasoning
1. SBSam Bhagwat
  Okay, cool. Um, what is-- So then, um, the second sort of subcategory of, of memory, right, of this sort of compression, um, is this kind of idea of like multi-session reasoning. So, you know, we all have these like long, long-running, um, window or long-running, um, chat history, and we need to sort of make useful compressions of that. Um, can I extract information across multiple sessions in response and, and sort of pull that all into the, the context window in order to let the LLM, you know, do some correct reasoning over it? Um, in this case, you can see like, you know, we might have to sort of pull, pull information about, you know, musical instruments, right? And, and a-again, if you, if you're able to pull it correctly, then, then you can answer the question correctly. If not, you're, you're in trouble.
3:27 – 4:10
Temporal Reasoning
1. SBSam Bhagwat
  Um, this third one is, you know, again, temporal reasoning. So if you sort of-- you're giving the user information about time-based events, and can they sort-- Ca- or you're giving the LLM in-information about time-based events in, in, in the chat history, and can the LLM-- Can you sort of pull in the right information as you're, a-as you're doing your retrieval, um, into the context window so it can reason correctly about that? Um, in this case, you know, um, you, you'll have, uh, um, y-you want to be able to say, "Well, I went, uh, on this tour behind the scenes today, but my last museum visit was five months ago," right?
4:10 – 5:13
Knowledge Updates in Memory
1. SBSam Bhagwat
  Knowledge updates are interesting. So, um, I'll, I'll just share a personal, um... I'll sort of, uh, you know, share a personal, uh, story on this one first. So when I ask ChatGPT who I am, it re-responds that I'm a five-year-old girl who likes-- who loves Squishmallows because, um, I give it to my daughter and my son, and I guess it, it updated who I was with, with my daughter, and, and so it thinks I'm a five-year-old girl who loves Squishmallows. Um, but, you know, re-realistically, right, like memory, [clears throat] memory has to update sort of, uh, specific y-- Working memory stores user attributes. Um, and, you know, you, these attributes may overwrite as, as you get new information. Um, so you need to be able to, again, fetch the right information in the right order and put it in from that queue of messages and, and put it into the context window so the LLM can reason appropriately, in this case, around a vacation. Um, and then, um, lastly, so like the last sort of like subtask, right?
5:13 – 5:46
Handling Missing Information
1. SBSam Bhagwat
  Um, the-- you have to be able to pull information in, in a way that like the LLM knows that it does not have the appropriate information, um, when it does not have the information you need. So, you know, this example is about a ten-gallon tank and a twenty-gallon tank. Um, and we're in fact looking for a thirty-gallon tank and, and it knows that it doesn't have one. So, um, you know, Masra agents, and I think this is, um... You know, these are sort of fairly standard,
5:46 – 5:58
Types of Memory in Masra Agents
1. SBSam Bhagwat
  um, these are sort of fairly standard categories of, of functionality. But Masra agents have two main memory types. Um, they have the ability to do semantic recall, and they have the ability to, you know, do,
5:58 – 6:51
Semantic Recall Explained
1. SBSam Bhagwat
  do working memory. Those are sort of the most, uh, uh, kind of like the most important ones for, for this particular benchmark. Again, like semantic recall, just how, how does semantic recall work? Um, well, how it works is that, um, you, you sort ofTake all of your, um, messages, you embed them, you put them in a, um, in a, uh, vector database, such as pgvector or Chroma or whatever you would like to use. Um, and then you, um, then you s-search over them. Um, and then, you know, typically like so that's the top K parameter in the, in semantic recall. Um, we also have the, the... We also have a message range around that, which is similar to like if you're using like grep -c, right? And you're like searching some file, you're, you're gonna wanna grab the lines around any of your specific matches. So there's like a message range where you're grabbing the messages around that specific message, so you have sort of the surrounding context that you need.
6:51 – 7:12
Working Memory and Templates
1. SBSam Bhagwat
  Um, the, the interesting thing I think about working memory too is that like you can imagine that the template that you're using for memory matters. The template is what specifies the attributes that you're specifically interested in. You can imagine if you're building a personal trainer AI, it's-- you're maybe interested in slightly different things than if you're building a legal research AI, um, application,
7:12 – 7:43
Initial Benchmark Results
1. SBSam Bhagwat
  right? Um, so I wanna talk about, you know, we, we ran our existing [clears throat] memory implementation on the LongMemEval. Um, and, you know, we weren't really happy with the results. So the state-of-the-art sort of results started at about, uh, seventy-two percent. Um, and then I think like our, our initial, um, results got, uh, got sixty-seven percent, um, uh, when we combined our, our working memory and semantic recall. And we're like, "Okay, like looks like, um, our implementation could, could have been better, and
7:43 – 11:08
Improving Memory Implementation
1. SBSam Bhagwat
  so, you know, what can we do to improve our implementation?" Well, good thing we have a benchmark. So, um, we, we started off and, and said like, "What, what could we do to improve, uh, how we're, you know, how we're doing as measured by the benchmark?" And so the first thing that we did is we sort of generated tailored templates, uh, for each question. Um, uh, so what, what does that, you know, what does that mean? Again, like what w- the template is should be specific to the question being asked. We just sort of script a sort of LLM, uh, LLM generate, um, templates for each question, so the template could be tailored to the question. And it turned out that actually made our working memory just used by itself a lot more accurate. Um, but really the, the weight of the, the weight of the results was on the semantic recall. So we're happy that like our working memory was better, but okay, cool, like let's make the semantic recall better. Um, and the, the, the first thing then that we, um, that we sort of were, were... Then but, but actually like I guess we, we, we did-- So the next thing we did actually was still on working memory. We, we realized that our implementation of working memory when we got like a new, um, a new piece of information, we were sort of asking the LLM to rewrite the whole working memory, um, like rather than just overwrite that one specific part of working memory. And so it sometimes happened that, you know, those rewrites were incorrect. And so we had a better, um, you know, results when we just did more tailored targeted updates. And when we did that, we again, we found our overall accuracy levels increased to like state-of-the-art. So we were pretty happy, but we were not like completely happy, so we kept working on this. Um, we found some specific bugs with how the messages were being presented. There was some confusion around, you know, the benchmarks were done at a specific date, but then they were being run at a different date. And, you know, we all know time is hard, and, and so it turned out that, um, we were at- we're sort of putting in the wrong dates. Um, and so the LLM was trying to reason over the information, but was actually reasoning over the information being presented with the wrong dates. And so well, it turned out that when we put in the right dates, it was able to do these kind of temporal reasoning tasks better. Um, and so, uh, you know, we, we got another couple, couple percent on the benchmark. Um, and then, um, you know, the fourth thing is, is, you know, data structure matters. Um, we started, uh, we started grouping the messages. We were initially just presenting them as like a flat list of messages. Um, we added-- we not only added the timestamps next to the messages, um, but we also, um, we also sort of grouped them by, you know, day, and then had the hour timestamps. And having it, this particular data structure, again helped the LLM do sort of temporal reasoning, session-based reasoning, um, better, um, than, than it had previously done. Uh, and, and so, you know, these were again, like o-on a framework level, you know, sort of general, um, we don't-- we didn't-- we weren't making any sort of domain or industry-specific updates because we're making a, a framework that's working across domains and industries. But, you know, these were things that were-- are within our control, um, as a framework, and we were able to sort of make significant, um, get significant improvement with these, um, with these changes. Um, these are sort of the,
11:08 – 12:14
Configuration Matters
1. SBSam Bhagwat
  the full results. Uh, I think the lesson, if we look at the full results, is kind of configuration matters. So if you look at like top K, you know, for example, like if we set the k-top K unrealistically low, something like two, you get, you know, much worse results than if you were to set it more reasonably. And then, you know, again, five versus twenty, um, we, we did see a, a significant improvement. Um, uh, uh, in the beginning of the talk, I was like talking about those five categories. You can see them down at the bottom. You can see the results broken down by category. Um, you'll notice that like, again, like some of the... If we look at like the s- the, um, single session, uh, preference, um, it-- the, it's not-- Because it's single session, it's not very tied to semantic recall, so the performance didn't improve as we improved top K-- as we increased top K. But I think this is another interesting lesson, right? Like when you're writing evals, just make sure to be grouping them and, and some of your changes may change some categories without moving other categories. Um, the, the last, um, the, the, the last bit here is, is probably, um,
12:14 – 13:38
Future Steps and Conclusion
1. SBSam Bhagwat
  there are other, there are other things we could have done. There are other benchmark results if you, um, there's another, uh, another company that did a benchmark result that used re-ranking. We don't, we, we don't right now use re-ranking in our, um, in our memory system. So, uh, uh, so, you know, the next, the next steps are sort of like, "Hey, we're gonna like ship episodic memory and see how the benchmarks change when we ship episodic memory. We're gonna add kind of the ability to re-rank and see how the results change when we add the ability to re-rank," et cetera, et cetera. But I, I think like the, you know, the, the TLDR here is that, you know, the in-- the, the, the loop that leads to improvement is, is write a set of, uh, is write an eval scaffold and then just iterate a lot. Um, and that, that will, that will lead to, um, that will lead to the results that, that you want if you sort of keep iterating at it. Um, and, and it will generate the... Uh, looking at the data will, will sort of like let you zo-zone in on sort of like where the weird bugs are and, um, and just the un- and just the odd weird things you can do to, to sort of like tweet-- to present the information in a clear and easier to understand format. So thanks. [audience applauding] [upbeat music]

Episode duration: 13:39

Install uListen for AI-powered chat & search across the full episode — Get Full Transcript

Transcript of episode FTokJt1ioeg

Get more out of YouTube videos.

High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.

Add to Chrome

Intro

Overview of the Long Mem Eval Benchmark

Understanding Memory in AI Agents

Information Extraction in Memory

Multi-Session Reasoning

Temporal Reasoning

Knowledge Updates in Memory

Handling Missing Information

Types of Memory in Masra Agents

Semantic Recall Explained

Working Memory and Templates

Initial Benchmark Results

Improving Memory Implementation

Configuration Matters

Future Steps and Conclusion

Get more out of YouTube videos.