📄 Page
1
(This page has no text content)
📄 Page
2
Managing Memory for AI Agents Memory, Models, and the Architecture of Adaptability Benjamin Labaschin, Jim Allen Wallace, Andrew Brookins, and Manvinder Singh
📄 Page
3
Managing Memory for AI Agents by Benjamin Labaschin, Jim Allen Wallace, Andrew Brookins, and Manvinder Singh Copyright © 2025 O’Reilly Media, Inc. All rights reserved. Published by O’Reilly Media, Inc., 141 Stony Circle, Suite 195, Santa Rosa, CA 95401. O’Reilly books may be purchased for educational, business, or sales promotional use. Online editions are also available for most titles (https://oreilly.com). For more information, contact our corporate/institutional sales department: 800-998-9938 or corporate@oreilly.com. Acquisitions Editor: Aaron Black Development Editor: Jill Leonard Production Editor: Jonathon Owen Copyeditor: Shannon Turlington Cover Designer: Susan Thompson Cover Illustrator: Ellie Volckhausen Interior Designer: David Futato Interior Illustrator: Kate Dullea October 2025: First Edition Revision History for the First Edition 2025-10-03: First Release
📄 Page
4
The O’Reilly logo is a registered trademark of O’Reilly Media, Inc. Managing Memory for AI Agents, the cover image, and related trade dress are trademarks of O’Reilly Media, Inc. The views expressed in this work are those of the authors and do not represent the publisher’s views. While the publisher and the authors have used good faith efforts to ensure that the information and instructions contained in this work are accurate, the publisher and the authors disclaim all responsibility for errors or omissions, including without limitation responsibility for damages resulting from the use of or reliance on this work. Use of the information and instructions contained in this work is at your own risk. If any code samples or other technology this work contains or describes is subject to open source licenses or the intellectual property rights of others, it is your responsibility to ensure that your use thereof complies with such licenses and/or rights. This work is part of a collaboration between O’Reilly and Redis. See our statement of editorial independence. 979-8-341-66124-0 [LSI]
📄 Page
5
Introduction Have you ever had a conversation with a manager that, however pleasant it started, devolved into frustration because they simply couldn’t remember a critical detail from last week? Maybe you spent an hour explaining a new project, your concerns about the timeline, the specific approaches you’d decided on—only to have them ask a week later, “Wait, what are you working on again?” It’s all you can do not to walk away in exasperation. For those of us working with AI agents, this scenario feels painfully familiar. The agent you’re collaborating with today might be brilliant at understanding your current request, even maintaining perfect context throughout a long conversation. But mention that project from last week— the one you meticulously outlined, with all its requirements and constraints —and you’re met with the digital equivalent of a blank stare. This isn’t just an inconvenience; it’s a fundamental limitation that shapes how we work with these systems. New sessions can feel like starting from scratch, forcing you to reexplain context that should be remembered and rebuild understanding that should persist. The agent’s amnesia forces us into a loop of recontextualization—turning what should be an ongoing collaboration into a series of disconnected encounters. The fact is that agentic memory is simply not as expansive as it should be. While these systems can process vast amounts of information within a single conversation, their ability to retain and meaningfully recall that information across time remains frustratingly limited. The good news is that this limitation hasn’t gone unnoticed. Researchers and engineers are working tirelessly to expand memory capacity, improve storage and retrieval mechanisms, and create systems that can maintain context not just for hours, but for weeks and months. They’re tackling the problem from multiple angles—from fundamental improvements to attention mechanisms to clever workarounds that simulate longer-term memory. This book is about understanding those efforts and, more importantly, what they mean for how we’ll work with agents in the future. Because the
📄 Page
6
organization or individual who figures out how to give their agents effective long-term memory won’t just have a better tool—they’ll have a true collaborator. Agent Memory: It’s Just Data…Until It Isn’t At its core, agent memory is exactly what any software engineer would expect: data, storage, and retrieval. When you hear “memory management” for AI agents, you’d be right to think “data management.” We have decades of experience with databases, caching systems, and storage architectures. So why do we need entire books about agent memory? The answer is deceptively simple: while data storage may be a familiar topic, how agents use that data is fundamentally different from any system that engineers have built before. Consider a traditional database. When you query for customer records, you get exactly those records, every time. The database doesn’t decide that yesterday’s query about customer #12345 is less important today. It doesn’t compress old transactions into vague summaries. It certainly doesn’t retrieve different results based on how you phrase your query. But agents? They’re nondeterministic by nature. The same query might pull different information based on subtle changes in phrasing. What gets stored isn’t just data—it’s embedded into vectors where bank (financial) and bank (riverside) might live in completely different neighborhoods of meaning. Retrieval isn’t a precise SELECT statement; it’s a fuzzy search through semantic space where relevance is calculated, not guaranteed. This is why we borrow so heavily from human memory as a metaphor. Like humans, agents must work within constraints: limited context windows instead of infinite storage. Agents weigh recent information more heavily than old. They make decisions about what to retain, what to compress, and what to forget. They retrieve information based on similarity and context, not perfect matches.
📄 Page
7
Yet even this comparison has limits. As we’ll explore, agents don’t learn continuously like humans do. They can’t update their core knowledge. They’re frozen in time, relying on clever engineering to simulate the kind of dynamic memory we take for granted. This tension—between data management as we know it and the strange new requirements of nondeterministic systems—defines everything about how we build agent memory. It’s why a simple conversation can become a complex dance of embeddings, vector databases, importance scoring, and semantic caching. It’s why giving an agent the same task twice may yield different results, even with identical memories. What You’ll Get from This Book This book takes you on a journey from fundamental decisions to practical implementation, always keeping memory at the center of the conversation. Here’s what each chapter delivers: Chapter 1 The technical foundation of this report, this chapter introduces agent memory management. Agents are non- deterministic by nature, so how can we teach them to retain and use important user information? Using strategies such as importance scoring, cascading memory systems, and checkpointing, we build conceptual foundations for the rest of the report. Chapter 2 Provides a deep dive into agent memory systems. This is the technical heart of the book, discussing the difference between episodic, semantic, and procedural memory. How
📄 Page
8
agent memory actually works—from context windows to persistence strategies, importance scoring to semantic caching. This chapter threads the needle from theory to industry practice, covering common agentic frameworks like Redis and LangGraph. Chapter 3 Introduces the economics of agents, model usage, and selection. Memory isn’t free. Every token stored costs money. Learn why the economics of memory management keep shifting and what that means for your architecture. Chapter 4 Covers navigating agent tradeoffs—custom builds, frameworks, and hosted solutions. Should you build from scratch, adopt a framework, or use a hosted solution? Your choice shapes everything about how you’ll manage memory. Chapter 5 Covers collective memory—how teams and organizations share knowledge through AI agents. This chapter spans individual agents to organizational intelligence. It covers how Transactive Memory Systems work in practice, why novices benefit most from shared knowledge, and how platforms like Zep and MCP connect organizational memory. To close things out, we’ll cover the future of agent memory. In the conclusion, you’ll learn why memory is so impactful—and what it means
📄 Page
9
for organizations to get agentic memory right. Throughout this book, we’ll return to a fundamental truth: the most important agent will always be the human agent. We are the conductors guiding these powerful orchestras. The systems we’ll explore—from vector databases to semantic caching, from importance scoring to collective memory platforms—are instruments in our hands. They can store vast amounts of data, retrieve it in clever ways, even make intelligent decisions about what to keep and what to forget. But it’s we who decide what constitutes success, what memories matter most, and how these systems should serve our goals. This book will teach you not just what’s important in agent memory management but why it’s important. Because in the end, the organizations and individuals who thrive won’t be those with the biggest context windows or the fastest retrieval algorithms. They’ll be those who understand how to conduct these systems—instructing them in just the right ways to retain what matters, forget what doesn’t, and transform raw data into genuine intelligence. Let’s begin.
📄 Page
10
Chapter 1. A Deep Dive into Agent Memory Systems An agent’s memory is, at its core, synonymous with data, storage, and retrieval. Anytime you hear “memory management,” you would be right to interpret this as “data management.” And data management, it turns out, is a concept we have a great deal of knowledge about in the world of software engineering. So why is it then that we need entire reports on memory management in AI agents? The answer, it turns out, is that while data management is an understood entity, the usage of data for AI agents is fundamentally different from anything that tool engineers have encountered before. Recall that agents are nondeterministic systems, programmed with the ability to use tools and constraints that generally guide them—but nondeterministic all the same. To these agents, some data is more relevant than others, all data takes up space, and all agents have a limited amount of space—or context windows—in which to process. This fundamental insight shapes everything about how we design, implement, and manage agent memory systems. Because the way in which agents use data differs fundamentally from traditional software programs, memory is often used as a stand-in for human cognition. The thinking is that if agents use information to generate nondeterministically—weighing recent information more heavily than older information, taking into account user context and preferences, and adapting to new inputs dynamically—then perhaps memory is the appropriate analogy for how agents work. This tension between humanlike behavior and fundamentally different architectures defines the current state of agent memory systems. In this chapter, we’ll explore how the industry is navigating these challenges through short-term memory management, long-term persistence strategies,
📄 Page
11
emerging technologies, and enhancement techniques like named entity recognition. Understanding Agent Memory Systems Not every system will be the same for agents, and they are certainly going to change in the future—they’re changing as we speak! That’s why focusing on specific architectures is much less important than understanding higher concepts. Although agents aren’t classically programmable, they follow classic computer science principles when it comes to memory management. Think of agent memory as like RAM in a computer. Even as context windows expand, it’s generally true that the more applicable and concise the information, and the more direct the query, the better the results will be. This isn’t just about efficiency; it’s about the fundamental nature of language paired with the systems that process this information. As there are different architectures for agent systems, there are also different classifications of memory. Some distinguish between sensory memory (ingesting information like images, audio, and haptic feedback), short-term or working memory (an active memory buffer of conversation history), and long-term memory (storage relevant to the agent’s or user’s life or work).1 Others focus on splitting between short-term and long-term memory, with subcategories for long-term memory including episodic (specific past events), procedural (contextual working knowledge and learned skills), and semantic (general world knowledge).2 Much of this thinking is influenced by the field of psychology, which can categorize human memory similarly. These distinctions help tailor how agents interact with humans, with systems, and even with other agents. In particular, they influence how memory is stored and retrieved. Memory Storage and Representation
📄 Page
12
Almost all memory is embedded into vectors of continuous numbers meaningful to specific large language models (LLMs) and then stored in vector databases. Some of this information can be considered knowledge (embedded documents or working context), while other information is memory in the traditional sense (user preferences, standing instructions, or relevant past answers). Then there are episodic memories that don’t need to be kept, such as random interactions or questions that don’t appear to have lasting relevance. The process of deciding what constitutes each type is where the complexity begins. Unlike traditional databases where you explicitly define schemas and relationships, agent memory systems must make these determinations dynamically, often with imperfect information. The Challenge of Nondeterministic Systems Those of us who work with agents every day know the frustration: instructions explicitly given in the past that are no longer retained in a new session or, worse, instructions neglected during a longer session with the same agent. But even if memories are retained, that doesn’t mean the agent will act as you’d expect. Because agents are not programmable in the traditional sense, giving one the same task twice may yield markedly different results. Consider a research assistant that pulls different sources for the same question. This variability extends to coding agents, too. After all, asking a coding agent to scan and iteratively improve a large legacy codebase is a far more difficult task than asking one to build a new repository from scratch. Instructing an agent to “alter the search bar” in a system that could have many search bars leaves significant room for interpretation. What agents retrieve for you largely depends on factors such as the size of the context the agent must ingest, a complex interplay of their embeddings, their similarity metrics, and the specific phrasing of your query.
📄 Page
13
Storage and Retrieval: The Core Challenge This bears repeating because it is the most crucial aspect of memory management in AI agent systems: the retention of knowledge is dynamic and stochastic, not just on the storage side but on the retrieval side as well. How do you decide what should be stored? What instructions are comprehensive enough to give a system? Do you store everything? Conversations can range from a few sentences to dozens of pages of text, depending on the user and use case. When storage gets tight, how do you flush the system? There are many strategies to address these challenges. Some popular methods include: Importance scoring Calculating memory importance based on recency, frequency of reference, user engagement metrics, and keyword relevance3 Cascading memory systems Allowing the agent itself to choose what to promote to long- term storage and what to retrieve4 Intelligent compression Using specialized models to condense conversation history into key details, events, and decisions5 Vector store offloading Moving older messages from short-term memory into vector stores, often with summarization6 Engineers typically compress information by instructing an LLM to summarize to the best of its ability. But summaries are not the same thing as the original. There is, by definition, loss of information.
📄 Page
14
The Imprecision of Retrieval Retrieval makes everything more complicated. Information retrieval is often based on similarity between texts, and because language is imprecise, retrieval will be fuzzy, too. The classic example is bank: it can be a financial institution or an aspect of a river. Depending on the model choice and embeddings, different algorithms must be used: cosine similarity, Euclidean distance, or even older methods like term frequency–inverse document frequency (TF-IDF). There’s no single way to search and retrieve information, with plenty of trade-offs between speed and accuracy. We’re certainly improving in the vector database space—local options like ChromaDB, Redis, PostgreSQL with pgvector, and Qdrant exist—but there’s still plenty of room for further improvement. Managing Context Window Limitations All models have context windows containing information provided as context for generating answers. There are different methods for managing these limitations. With FIFO (first in, first out), the earliest information received over a long conversation may be least accurately recalled, meaning that more recent information gets prioritized. Strategies to address this include intelligent pruning, where a model removes superfluous information. But there are consequences to this. Consider summarized information about a legal text: you might get the broad strokes of the legal argument and topics but lose critical information, like a negation or case reference that completely changes the content. Summarization by definition means losing detail and taking a higher abstract perspective. Compare this to larger context windows like those in Gemini 2.5, which can handle millions of tokens. We can stuff more information into a model, but we may not have effective recall for the first-passed information relative to
📄 Page
15
the last. It’d be nice if we lived in a world where models had perfect recall, but the architecture of transformers’ self-attention mechanisms requires quadratically more processing as context increases. Algorithms like FlashAttention attempt to work around this, but a more direct approach might be retrieval-augmented generation (RAG), which limits the corpus of documents and forces the LLM to return sourced information rather than stuffing superfluous information inside. An even more refined approach—one I believe will become more popular —is semantic caching. What if we retained the relative context of an information retrieval system over time by processing the semantics of the content being passed? Frequently retrieved information gets prioritized. For systems like internal LLMs or RAG where many users talk to the same corpus of information, it may be both more computationally effective and cost-effective to semantically cache that information. Semantic caching isn’t without its drawbacks—it works exceptionally well for single-shot questions but breaks down in multiturn conversations. These methods will continue to be refined, but the fundamental problems will remain the same: how can we most efficiently and effectively store and retrieve the information we desire? Persistence via Checkpointing Checkpointing is a crucial step for agents, especially those that engage in multiturn conversations. As agents interact with a user or a system, they periodically save their internal states (their memory) in order to persist information across sessions or long conversations. Different organizations, products, and systems approach this process differently. The key insight is that checkpointing isn’t just about saving state—it’s about making that state retrievable and actionable in the dynamic, nondeterministic world of agent interactions. There are many ways to handle checkpointing, and different teams use different tools. For example, Redis is a popular choice because it’s fast and works well for real-time applications. Some setups use Redis to save
📄 Page
16
conversation threads or agent state, making it easy to restore context across sessions or even share memory between different parts of a system. Features like automatic cleanup (time to live) help keep things tidy, so you’re not stuck with a bunch of old, irrelevant data.7 Basically, checkpointing is about making sure agents don’t lose their place and that their memory is both persistent and practical in the unpredictable world of AI interactions. 1 Michael Lanham, AI Agents in Action (Manning Publications, 2024), 200. 2 Manvinder Singh and Andrew Brookins, “Build Smarter AI Agents: Manage Short-Term and Long-Term Memory with Redis,” Redis Blog, April 29, 2025, https://redis.io/blog/build- smarter-ai-agents-manage-short-term-and-long-term-memory-with-redis/. 3 Singh and Brookins, “Build Smarter AI Agents.” 4 “Memory for Agents,” LangChain Blog, October 19, 2024, https://blog.langchain.com/memory-for-agents. 5 “How to Migrate to LangGraph Memory,” LangChain Documentation, https://python.langchain.com/docs/versions/migrating_memory. 6 Singh and Brookins, “Build Smarter AI Agents.” 7 Brian Sam-Bodden, “LangGraph & Redis: Build Smarter AI Agents with Memory & Persistence,” Redis Blog, March 28, 2025, https://redis.io/blog/langgraph-redis-build-smarter- ai-agents-with-memory-persistence/; Redis Official Documentation, Redis, https://redis.io/docs/.
📄 Page
17
Chapter 2. Long-Term Memory: Building Persistent Learning Agents There’s no single, universal definition for the different types of memory in agent systems—every company seems to have its own take. For example, Anthropic, OpenAI, and Google all use slightly different terminology and approaches. But the more interesting question is: how does an agent actually decide what counts as procedural, semantic, or episodic memory? Or put another way: what gets treated as short-term, long-term, or contextual memory? Along these lines, semantic caching might play a big role. Sometimes short- term memories can be promoted to long-term if they’re accessed frequently enough. Conversely, long-term memories that aren’t used much might get summarized, become less detailed, or even be dropped from the system altogether. These are the kinds of trade-offs and decisions that go into designing agent memory. Ultimately, it all comes down to how the system is built to manage and retain information. Since it’s nearly impossible to program an agent to handle every scenario, we rely on constraints and parameters to guide what gets treated as episodic, semantic, or procedural memory. Types of Long-Term Memory The industry has converged on three primary types of long-term memory, although implementations vary significantly: Episodic memory
📄 Page
18
Stores specific past experiences and events, functioning like human autobiographical memory. Companies typically implement this through RAG systems on conversation histories, extracting relevant chunks instead of keeping the full history.1 The common approach uses few-shot example prompting where agents learn from past sequences, with key events, actions, and outcomes logged in structured formats.2 Semantic memory Maintains structured factual knowledge—facts, definitions, rules—implemented through knowledge bases, symbolic AI, or vector embeddings. LLMs extract information from conversations, storing it as user or entity profiles that are retrieved and inserted into system prompts to influence future responses. Procedural memory The least common area, but a growing one, which stores skills, rules, and learned behaviors for automatic task performance. This combines LLM weights, agent code, and system prompts, with some agents updating their own prompts through “reflection” or metaprompting. The field of agentic AI is moving away from rigid definitions of memory toward more flexible hybrid approaches, where memories can transition between types based on usage patterns and importance scoring. As Rowan Trollope, CEO of Redis, observes, this mirrors human memory consolidation. Just as our REM cycle compresses information from short- term to long-term memory as we sleep, the goal is to build agents to engage in similar processes.3
📄 Page
19
Long-Term Memory in Frameworks The practical implementation of long-term memory varies dramatically across frameworks, with each taking a different philosophical approach to the challenge: LangGraph Stores From the creators of LangChain, a framework for LLM and agent development, this system organizes memory in namespaces as JSON documents with unique identifiers, supporting semantic facts, user preferences, episodic examples, and procedural system prompts. The LangMem SDK adds tools for extracting information from conversations, optimizing prompts, and maintaining persistent memory.4 Mem0 (Memory-Zero) This framework extracts key facts from interactions and updates long-term memory selectively. Rather than storing complete chat histories, it maintains concise entries to reduce memory usage and improve retrieval speed. Users can add a graph-based extension, Mem0g, that maps relationships between entities for additional context.5 Redis Semantic Caching (LangCache) This framework addresses the repetitive nature of agent queries through semantic caching. The implementation includes configurable search criteria, a REST API, and user- specific security features. At the time of writing, LangCache is in private preview, but it should soon be available to the general public.6 ADK MemoryService (Google’s Agent Development Kit)
📄 Page
20
This framework provides a BaseMemoryService interface with two primary functions: adding completed sessions to storage and searching stored information. Developers can choose between InMemoryService for RAM-based keyword searches (nonpersistent) or VertexAIMemoryBankService for production environments with persistent semantic search. Technologies and Solutions for Advanced Memory Management The landscape of agent memory technologies reflects a fundamental tension: the need for both sophistication and simplicity, performance and flexibility, innovation and reliability. Each solution represents a different philosophy about how agents should remember. Each framework reflects different priorities and philosophies: LangGraph Seeks simplicity with its document store approach. Easy integration and namespace organization make it ideal for rapid prototyping and standard agent workflows, though it lacks the sophistication of specialized solutions. ADK MemoryService Provides enterprise-grade reliability with clear interfaces and Vertex AI integration. The trade-off is its limitation to the Google ecosystem—perfect for Google Cloud deployments but less flexible for multicloud architectures. Vector database alternatives Each serves varying levels of demand. Pinecone offers managed services that are excellent for scale. Weaviate provides open source with hybrid search capabilities. Qdrant focuses on performance with advanced filtering.