All posts
MemoryProductionFrameworks

Memory Management in Agentic AI: From STM to LTM in Production

Why your demo agent feels brilliant and your production agent feels like a goldfish — a beginner-to-advanced field guide to short-term and long-term memory, the strategies that actually work, and how to wire them up in CrewAI, LangChain, LangGraph, OpenAI Agents SDK, and Strands.

AS
AgentSwarms Authors
June 2, 2026· 18 min read·
MemoryProductionFrameworks

A user told your agent, in the very first message, that they live in Bangalore and prefer answers in metric. Twelve turns later they ask about the weather, and the agent confidently quotes Fahrenheit for somewhere in Texas. Nothing crashed. The model is fine. The agent simply forgot — because nobody told it how to remember.

Memory is the unglamorous engineering that separates a chatbot that performs well in a demo from an agent that feels like it knows you in production. It is also where most teams quietly under-invest. They pick a framework, ship the default ConversationBufferMemory, and discover at scale that 'just keep the whole conversation in the prompt' is neither cheap nor correct.

This guide is the field manual we wish we'd had — beginner-friendly enough to start from zero, deep enough to make real production decisions. We'll define the memory types properly, walk the strategies you'll actually choose between, show how to test memory inside AgentSwarms, and finish with concrete, working snippets for CrewAI, LangChain, LangGraph, OpenAI Agents SDK, and Strands.

Memory is not one thing

Cognitive scientists distinguish between working memory (what you're holding in mind right now) and long-term memory (what you can recall later). Agentic AI inherited this distinction almost verbatim — and then split long-term into three useful sub-types. The reason that matters: each type has a different cost, a different lifetime, and a different mechanism. Treating them as one bucket is how you end up paying gateway prices to remember someone's favorite color.

Short-term (working) memory
The current conversation — held in the prompt window.
"You just told me your name is Priya, 2 turns ago."
span: minutescost: tokens per turn
Tap through the four memory types. STM lives in the prompt window. Long-term splits into episodic (what happened), semantic (stable facts), and procedural (learned how-tos). Each costs and behaves differently.
  • Short-term (working) memory — the recent conversation, held inside the model's context window. Free to write, expensive to keep large, vanishes when the session ends.
  • Episodic memory — durable records of what happened: past conversations, decisions made, tickets filed. Stored outside the model, recalled on demand.
  • Semantic memory — stable facts: the user's name, their preferences, your company's VAT rate. Small, hot, often injected into the system prompt verbatim.
  • Procedural memory — learned routines and tool-use patterns. Usually expressed as updated system prompts, few-shot examples, or skill libraries that the agent reaches for automatically.
The line that helps most teams

If a fact is true for the rest of this conversation, it's STM. If it's true the next time this user opens the app, it's LTM. Knowing which one you're building decides the storage, the recall, and the bill.

Why pure short-term memory fails at scale

The simplest possible memory is to shove the whole transcript back into the prompt every turn. It works beautifully in a demo and falls over in production for two boring reasons: context windows are finite, and tokens cost money. Even a 200K-token window fills surprisingly fast once you add a system prompt, tool definitions, retrieved documents, and a verbose multi-agent dialogue. And every token of history is a token you pay for, every turn.

Turn 2 (user): "my flight number is BA117."
Conversation length6 turns
Turn 6 (user): "what time does my flight land?"
Assistant: "BA117 lands at 18:40 local time."

Pure sliding-window STM forgets old turns. Memory is what keeps "BA117" alive after turn 16.

Drag the slider. Early in the conversation, the model can still see the user's flight number. Past a certain length, naïve sliding-window STM drops it — and the agent has to ask again. Memory is what keeps that fact alive.

There is also a quality problem that papers like Lost in the Middle made famous: when the prompt is long, models pay disproportionate attention to the beginning and the end, and quietly under-weight the middle. So even when the relevant fact is technically in the window, longer context can make recall worse, not better. The fix is the same in both cases: stop trying to remember everything in the prompt.

Short-term memory: the four strategies you'll actually choose between

Every framework's STM offerings are variations on the same four ideas. Pick deliberately — the right one depends on conversation length, latency budget, and how much you mind paying for tokens you'll never use again.

Last N turns verbatim + a summary of everything before.
$$ balancedrecall: best general-purpose default
Toggle the four strategies. The Hybrid (window + rolling summary) is the production default for a reason — it's cheap, predictable, and degrades gracefully.
  • Full buffer — keep every turn verbatim. Use only for short flows (<10 turns) or evaluation runs where you want zero loss.
  • Sliding window — keep the last N turns, drop older ones. Predictable token cost; risks dropping the one fact the user mentioned on turn 2.
  • Rolling summary — periodically compress older turns into a paragraph. Lossy but durable; pay for one extra LLM call to save many.
  • Hybrid (window + summary) — the default in serious systems. Recent turns verbatim, everything before rolled into a running summary. Cheap, robust, and the model still gets the important details from earlier.
Use a small, cheap model for summaries

Your STM summarizer doesn't need GPT-5. A flash-tier model produces perfectly good rolling summaries at a fraction of the cost. AgentSwarms uses gemini-3-flash-preview for exactly this — it is the difference between a memory subsystem you'd ship and one you'd quietly disable.

Long-term memory: extract, store, recall, inject

Long-term memory is a small data pipeline glued to the end of every turn. It runs in the background, costs almost nothing per turn, and is what makes the difference between “an assistant” and “my assistant.”

🗣️Turn ends
🧪Extract
🔢Embed + store
🔍Recall
📥Inject
User said something durable — a preference, a fact, a decision.
The five-step loop every production LTM implementation runs. Most platforms — including AgentSwarms — wire this up for you; understanding it is what lets you tune it.
  1. 1Extract. After each assistant turn, a small structured-output prompt scans the exchange for durable items — facts, preferences, decisions, instructions — and emits a list. Skip greetings, restatements, and anything that looks like raw PII.
  2. 2Store. Embed each item and write it to a per-user, per-agent table with metadata (kind, score, usage_count, created_at). Keep a hard cap (a few hundred items per agent is plenty).
  3. 3Recall. On the next user prompt, tokenize the query, retrieve top-K relevant items by embedding similarity (or hybrid keyword + vector), and rank by overlap + recency + usage_count.
  4. 4Inject. Prepend a === WHAT YOU REMEMBER ABOUT THIS USER === block to the system prompt. Keep it short; the model treats it as context, not as an instruction to recite.
  5. 5Feedback. Bump usage_count and last_used_at on items you actually surfaced. Frequently-useful facts rise; stale ones sink and get pruned.
// The shape of a production-grade LTM item. Note the metadata — it's what
// makes recall, ranking, decay, and audit possible.
type MemoryItem = {
  id: string;
  user_id: string;       // scope: never leak across users
  agent_id: string;      // scope: an agent only remembers what it learned
  kind: "fact" | "preference" | "episodic" | "instruction";
  content: string;       // "user prefers metric units"
  embedding: number[];   // for semantic recall
  keywords: string[];    // for hybrid recall + fast filter
  score: number;         // human/model assigned importance
  usage_count: number;   // bumped on every recall
  last_used_at: string | null;
  created_at: string;
};
Scope memory tightly

Memory is per (user, agent). Never share an LTM store across users — that's a privacy incident waiting to happen. Cross-agent sharing is fine only when the agents are part of the same product surface; otherwise namespace by agent id.

Best practices that separate working memory from broken memory

  • Decide the lifetime first. STM vs LTM is not a tooling question — it's a product question. Will this still be true tomorrow?
  • Default to hybrid STM. Last N turns verbatim + rolling summary. Resist the urge to ship a pure buffer.
  • Make extraction picky. Greetings, acknowledgements, and PII-shaped strings should never reach the store. Use a strict JSON schema and a low temperature.
  • Cap and decay. Hard cap per agent (e.g. 200 items). Prune by score × recency × usage_count — old, never-used items should die.
  • Hybrid recall beats pure vector. Combine keyword overlap with embedding similarity; rerank with recency. Pure cosine search loses to user IDs, SKUs, and proper nouns.
  • Treat memory as state, not magic. Version it, snapshot it, let users view and delete it. GDPR's right to erasure is not optional.
  • Evaluate it. Maintain a small golden set of recall questions. Gate releases on it like any other regression test.
  • Log every injection. When a memory item lands in the prompt, log which one and why. The first time the agent says something weird, you'll want that trace.
Memory is a privacy contract

Anything an agent remembers, the user can ask you to forget. Build the delete-by-user flow on day one, not after the first support ticket. AgentSwarms exposes this in the agent settings; in your own stack, a simple DELETE FROM memory_items WHERE user_id = $1 cron-friendly endpoint is the minimum.

Testing memory inside AgentSwarms

The fastest way to build intuition for memory is to watch it work — and watch it fail. AgentSwarms exposes the whole pipeline so you can prod it without writing infrastructure.

  1. 1Open the Playground and pick (or create) an agent. In the agent's settings, expand Memory — toggle STM strategy (window / summary / hybrid) and enable Long-term memory.
  2. 2Have a multi-turn conversation. Tell the agent two or three things about yourself: a preference, a fact, a small task you want it to remember. Send a few unrelated turns after.
  3. 3Watch the trace. Each request inspector on the right shows the system prompt the model actually saw — including any [WHAT YOU REMEMBER] block and the rolling summary. If a fact is there, the model has it. If it isn't, that's your bug.
  4. 4Open a fresh conversation. Ask the agent something that requires one of the earlier facts. Recall should fire, the trace should show the injected item, and the answer should land.
  5. 5Try to break it. Tell the agent contradictory things. Use throwaway facts. Ask after a long delay. The Failure-Mode Labs include a context-loss lab specifically for this.
  6. 6Wipe and re-run. Clearing memory from the agent settings should make the next turn forget — a one-click verification that nothing is leaking from a hidden cache.
What units does the user prefer?
expects: metric
MISS
What's their last refund order id?
expects: #4412
MISS
Which timezone do they work in?
expects: IST (UTC+5:30)
PASS
Recall@1:33%— gate the deploy on this.
A tiny golden recall set. Flip 'after memory tuning' to watch Recall@1 jump. Gate your deploys on this, the same way you gate code on tests — silent recall regressions are how memory subsystems quietly rot.

Production patterns by framework

Every serious agent framework now ships memory primitives — but they make very different defaults. Tap through them; the one-line summary under each is what you'd tell a teammate in code review.

Short-term
Checkpointer (thread-scoped state)
Long-term
Store API (cross-thread, namespaced by user)
First-class persistent state; the production pick.
Memory model by framework. LangGraph's Store + checkpointer split is the most production-ready out of the box; CrewAI's defaults get you running fastest; LangChain gives you the most levers; OpenAI Agents SDK and Strands lean on you to wire the LTM yourself.

LangChain — explicit, composable, batteries-mostly-included

LangChain's modern memory story lives in the LangGraph integration; the classic memory classes (ConversationBufferMemory, ConversationSummaryMemory, ConversationBufferWindowMemory, VectorStoreRetrieverMemory) still work and remain a clear way to learn the moving parts.

# Hybrid STM (window + summary) with vector-backed LTM in classic LangChain.
from langchain.memory import (
    ConversationSummaryBufferMemory,
    VectorStoreRetrieverMemory,
)
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain_community.vectorstores import Chroma

llm = ChatOpenAI(model="gpt-5-mini", temperature=0)

# STM: keep recent turns verbatim, summarize older ones above a token budget.
stm = ConversationSummaryBufferMemory(
    llm=llm,
    max_token_limit=1500,
    memory_key="chat_history",
    return_messages=True,
)

# LTM: per-user vector store of durable facts.
store = Chroma(
    collection_name=f"ltm_user_{user_id}",
    embedding_function=OpenAIEmbeddings(),
)
ltm = VectorStoreRetrieverMemory(
    retriever=store.as_retriever(search_kwargs={"k": 5}),
    memory_key="long_term_memory",
)

# Use both: inject `{chat_history}` and `{long_term_memory}` into your prompt.

LangGraph — first-class persistent memory

LangGraph splits memory cleanly: a checkpointer persists per-thread state (your STM) and a Store persists cross-thread, namespaced facts (your LTM). This is the architecture closest to what you'd build yourself for production.

from langgraph.checkpoint.postgres import PostgresSaver
from langgraph.store.postgres import PostgresStore
from langgraph.prebuilt import create_react_agent

# Checkpointer = STM (thread-scoped conversation state).
checkpointer = PostgresSaver.from_conn_string(POSTGRES_URL)
# Store = LTM (cross-thread, namespaced by user).
store = PostgresStore.from_conn_string(POSTGRES_URL)

agent = create_react_agent(
    "openai:gpt-5-mini",
    tools=[...],
    checkpointer=checkpointer,
    store=store,
)

# Namespacing is how you scope memory:
namespace = ("user", user_id, "preferences")
store.put(namespace, "units", {"value": "metric"})

# In a tool the agent can call to recall:
hits = store.search(namespace, query="what units does the user prefer", limit=3)

CrewAI — memory on by default

CrewAI's pragmatic choice is to enable memory and give you knobs. A crew with memory=True automatically uses short-term memory for the current run plus long-term memory across runs, with an entity store for proper nouns.

from crewai import Crew, Agent, Task
from crewai.memory.storage.rag_storage import RAGStorage

crew = Crew(
    agents=[researcher, writer],
    tasks=[research_task, write_task],
    memory=True,  # turns on STM + LTM + entity memory
    memory_config={
        "provider": "openai",
        "config": {"model": "text-embedding-3-small"},
        "user_memory": {"user_id": user_id},  # scope LTM per user
    },
    embedder={"provider": "openai", "config": {"model": "text-embedding-3-small"}},
)

# Inspect / reset between runs:
crew.reset_memories(command_type="short")  # or "long", "entity", "all"

OpenAI Agents SDK — sessions for STM, BYO for LTM

from agents import Agent, Runner, SQLiteSession

agent = Agent(name="Assistant", instructions="...")

# Sessions = STM. Same session_id ⇒ same conversation across calls.
session = SQLiteSession(session_id=f"user_{user_id}", db_path="sessions.db")

await Runner.run(agent, "I prefer metric.", session=session)
await Runner.run(agent, "What units do I prefer?", session=session)
# Returns "metric" — the SDK rehydrated the conversation from the session.

# LTM is on you: wire a vector store and inject recalled facts into
# `instructions` before each Runner.run().

Strands — conversation managers + memory tools

from strands import Agent
from strands.agent.conversation_manager import SummarizingConversationManager
from strands_tools import memory  # built-in long-term memory tool

agent = Agent(
    model="bedrock/anthropic.claude-3.5-sonnet",
    # STM: keep last N turns, summarize older ones.
    conversation_manager=SummarizingConversationManager(
        summary_ratio=0.3,
        preserve_recent_messages=10,
    ),
    # LTM: tool-call into a vector backend.
    tools=[memory],
    system_prompt="Use the memory tool to recall facts about the user.",
)

agent("I live in Bangalore and prefer metric units.")
# → agent calls memory.store(...) on its own.
agent("What's the weather where I live?")
# → agent calls memory.retrieve(...) and answers in metric.

A production checklist

  1. 1STM strategy chosen deliberately (default: window + rolling summary).
  2. 2LTM extraction prompt is picky and rejects PII-shaped strings.
  3. 3Every memory item carries user_id, agent_id, kind, score, usage_count, timestamps.
  4. 4Hard cap per (user, agent); pruning runs on a schedule.
  5. 5Hybrid recall (keyword overlap + embedding similarity + recency).
  6. 6System prompt injection is logged per turn — you can replay what the model saw.
  7. 7Golden recall set runs on every deploy; regressions block the release.
  8. 8User-facing memory viewer + one-click wipe; delete-by-user endpoint exists.
  9. 9Token costs for STM summarization and LTM extraction are tracked separately in your observability.
  10. 10Failure-mode lab in your dev loop: an agent with memory disabled is a useful baseline.

The takeaway

Memory is what makes an agent feel less like a tool and more like a colleague — and it is one of the highest-leverage subsystems you will build. The mechanics are not exotic: a small pipeline, a hybrid recall step, a strict scope, and an honest eval. Pick the right type, ship the right strategy, gate it on a golden set, and treat what the agent remembers with the same seriousness you treat anything else you store about a user.

Open the Playground, turn memory on, and have a real conversation with your agent. The first time it answers a question by remembering something you said three days ago, you'll understand why this is the work worth doing.


Was this useful?

Comments

Sign in to join the discussion.

Loading comments…