All posts
Multi-AgentMemoryFrameworks

The Agent That Remembers: Inside Hermes and Self-Improving Agents

Most agents wake up with amnesia every session and a system prompt frozen at write-time. Hermes — Nous Research's open-source, self-improving agent — does the opposite: it carries memory across sessions, turns its own successful runs into portable skills, and fans work out to sandboxed sub-agents. Here's how each piece actually works, and whether it's ready for your stack.

AS
AgentSwarms Authors
June 3, 2026· 17 min read·
Multi-AgentMemoryFrameworks

Watch someone use a coding assistant for a week and you'll notice a quiet tax. Every morning they re-explain the same things: this is the stack, this is the convention, no, don't be verbose, here's the project we discussed yesterday. The model is brilliant and amnesiac — it solves the problem in front of it and then forgets you exist. Tomorrow you start over. The thing that should be getting better at working with you simply can't, because nothing it learns survives the session.

That's the gap Hermes is built to close. Released open-source by Nous Research in early 2026 — and improbably popular, north of 50,000 GitHub stars within weeks — it's an agent designed around a single thesis: an agent should accumulate. It should remember across sessions, turn the things that worked into reusable skills, and parallelize hard work across sandboxed helpers. This post is a tour of how it does each of those, with an honest read on where it's genuinely useful and where you should keep your guard up.

Self-improving, concretely

Strip away the marketing and a 'self-improving agent' is one with a loop: it observes what happened, notices a pattern worth keeping, writes that pattern down where its future self will find it, and is faster next time. Hermes is interesting because it makes all three steps — observe, persist, retrieve — first-class and automatic.

The two ceilings: static prompts and stateless sessions

Two design choices quietly cap what most agents can become. The first is the static system prompt: a block of instructions frozen the moment you wrote it. It can't notice that you always want TypeScript over JavaScript, or that last week's deploy needed a specific incantation. The second is the stateless session: when the conversation ends, the agent's working knowledge evaporates. Together they guarantee a cold start, forever.

Session · Mon🧠

Learns your name, stack (TanStack + Supabase), and that you hate verbose replies.

Session · Wed🧠

Recalls the stack; reuses a deploy skill it wrote on Monday — no re-explaining.

Session · Fri🧠

Knows the project history and your preferences. Picks up mid-thought.

Evolving memory compounds: each session starts where the last one ended. The agent gets more useful the longer you work with it.

The same three sessions, two ways. Toggle between a static-prompt agent (a stranger every time) and one with evolving memory (it picks up where you left off). The difference compounds across a week.

Angle 1 — Evolving memory vs static system prompts

The obvious objection to 'just remember everything' is that you can't. Context windows are finite and expensive; dumping every past transcript into the prompt is both impossible past a point and ruinous before it. Hermes' answer is that memory isn't one thing — it's a small stack of layers, each with a different job and a different loading strategy.

Every session is written to a SQLite archive and full-text indexed. Instead of stuffing old transcripts into context, the agent searches them (~10ms over 10k+ docs) and injects only an LLM-summarised hit.

searched on demand

Four tiers, one principle: keep the always-on context tiny, and fetch everything else only when it's actually needed.

Hermes' four memory layers. Click each one — the trick is that only the tiny top layer is always in context; everything else is fetched on demand, so the window stays lean.
  • Prompt memory (MEMORY.md + USER.md) — a tiny, always-loaded brief capped at roughly 3,575 characters across both files. The cap is the point: it forces the agent to keep only what's genuinely durable, like a sticky note rather than a diary.
  • Session archive (SQLite + FTS5) — every session is written to a local database and full-text indexed. Rather than re-reading old chats, the agent searches them in about 10ms and pulls back only what's relevant.
  • Skills (~/.hermes/skills/) — procedural memory as markdown files (more on these next).
  • User model (Honcho, optional) — a passive, opt-in model of your preferences and style built up over time.

Who decides what's worth keeping? The agent does.

The clever bit is curation. At intervals during a session, Hermes fires a periodic nudge — an internal, system-level prompt that asks the agent to look back at what just happened and decide whether anything is worth persisting. Memory isn't a transcript dump; it's an editorial act the agent performs on itself. When a conversation grows long, a separate sentinel triggers compression: an auxiliary model extracts the keep-worthy facts into that tight character budget and summarizes the middle, while a lineage chain in SQLite preserves traceability.

# MEMORY.md  (always loaded · ~3,575 char budget shared with USER.md)

## Project: agentswarms
- Stack: TanStack Start + Supabase; deploys via Lovable git push.
- Conventions: run prettier + tsc before pushing; never commit routeTree.gen.ts.

## Decisions
- 2026-06-01: chose Strands SDK export over a custom runtime.

## Preferences
- Terse answers. Show the diff, skip the preamble.
The real innovation is restraint

It would be easy to build an agent that hoards everything. The hard, useful thing is an agent that keeps almost nothing in always-on context and gets very good at fetching the rest. The character cap isn't a limitation to route around — it's the mechanism that keeps the agent's 'working memory' sharp.

Persistence without a bloated context window

This is the question that matters for anyone who's watched their token bill climb: how do you remember a month of work without paying for a month of tokens on every call? Hermes' answer is three mechanisms working together — search instead of load (FTS5 over the archive), summarize before inject (an LLM condenses a hit before it enters context), and progressive disclosure for skills (only short summaries load until one is actually needed). The result is a context footprint that stays nearly flat as your history grows.

12 past sessions + skills to recall
Load everything (naive)42.4K / 64.0K
Hermes (search + summarise)5.5K / 64.0K

The naive approach grows linearly and eventually blows past the context window. Hermes keeps the window nearly flat by searching the archive and loading only summarised, relevant hits — persistence without the token bill.

Add more history with the +/− control. The naive 'load everything' approach grows linearly and eventually overflows the window; Hermes stays nearly flat by searching, summarizing, and loading skills only on demand.

Angle 2 — The agentskills.io standard

Remembering facts is half the story. The bigger lever is remembering how to do things. When Hermes works through a gnarly task — scrape this site, reshape the data, recover from that rate-limit — the sequence of steps that finally worked is valuable. Most agents discard it. Hermes captures it as a skill.

Skills follow agentskills.io, an open standard for packaging agent capabilities that Anthropic published in late 2025 and that's since been adopted across 20+ platforms. A skill is just a folder with a SKILL.md file: YAML frontmatter that says what this is and when to use it, followed by a markdown playbook that says what to actually do. Optionally it carries scripts/, references/, and assets/ alongside.

The agent completes a non-trivial workflow — say, scraping a site and reshaping the data — across several tool calls.

A successful trajectory isn't thrown away — it's distilled into a searchable, portable tool the agent (or any agentskills.io-compatible agent) can reuse.

From trajectory to tool. Step through how a successful multi-step run gets distilled into a portable SKILL.md — the final step shows the actual file shape.

Capture isn't indiscriminate. A skill is created when the run clears a meaningful bar — five or more tool calls, a recovery from an error, a user correction, or a non-obvious workflow that worked. Those are precisely the moments where hard-won procedure is worth saving. Importantly, Hermes captures both code trajectories and browser trajectories, so 'how I automated that web form' becomes as reusable as 'how I parsed that log format.'

---
name: scrape-and-reshape
description: Scrape an HTML table and emit clean, typed CSV. Use when the user
  wants tabular data off a web page.
version: 1.0.0
platforms: [linux, macos]
metadata:
  hermes:
    tags: [web, data, automation]
    category: scraping
    requires_toolsets: [terminal, browser]
---

## Steps
1. Open the URL with the browser tool; wait for the table to render.
2. Extract rows; coerce numeric columns, strip thousands separators.
3. Validate row count against the page's stated total before writing.
4. Emit CSV to ./out/ and report the path.
Edits prefer patches, not rewrites

When a skill needs to improve, Hermes defaults to a targeted patch — swapping an old string for a new one — rather than regenerating the whole file. It's safer (less chance of clobbering a working step), cheaper in tokens, and leaves a cleaner history of how the skill evolved.

Two properties make this more than a private cache. First, portability: because the format is a shared standard, a skill Hermes wrote can run unchanged in other compatible agents — Claude, Codex, Cursor, and the rest — with no conversion step. Your procedural memory isn't trapped in one tool. Second, progressive disclosure: only the skill's name and one-line summary sit in context by default; the full body loads only when the agent judges it relevant. That's why a library of a hundred-plus skills doesn't translate into a hundred-plus skills' worth of tokens on every turn.

A skill is executable instructions — treat shared ones as code

Hermes ships a security scanner that checks community skills for exfiltration, injection, and supply-chain tricks, and that's a good baseline. But a skill is a set of instructions an agent will follow with real tools. Review third-party skills before enabling them, exactly as you'd review a dependency you're about to npm install.

Angle 3 — Parallel, sandboxed sub-agents

Some tasks are naturally several tasks. 'Research these five competitors,' 'run this pipeline across four datasets,' 'check each of these endpoints' — a single thread plods through them one at a time, and the wall-clock is the sum. Hermes can instead spin up isolated sub-agents, each with its own conversation, its own sandboxed terminal, and its own Python RPC session, dispatch the sub-tasks in parallel, and aggregate the results when they return.

3 sub-tasks
sub-agent 1
🖥️ RPC
sub-agent 2
🖥️ RPC
sub-agent 3
🖥️ RPC
Wall-clock: 4 units · bounded by the slowest sub-task

Each sub-agent gets its own sandboxed terminal and Python RPC session, so they run in isolation and in parallel. The main thread dispatches, then aggregates — it never blocks on one long step.

Sequential vs parallel. Adjust the sub-task count and toggle the mode: sequential wall-clock is the sum of every step; with sub-agents it's bounded by the slowest one, and the main thread never blocks.

The 'isolated' part is doing real work. Each sub-agent's execution runs in a Docker sandbox with a read-only root filesystem, dropped Linux capabilities, namespace isolation, and PID limits — so one helper running untrusted code can't trample the host or its siblings. The main thread becomes a coordinator: it delegates, the children grind in isolation, and the orchestrator stitches the answers back together without ever stalling on a single long step.

Parallelism isn't free — fan out on purpose

Every sub-agent is its own set of model calls, so a wide fan-out multiplies cost and adds coordination overhead. The win shows up when sub-tasks are genuinely independent and individually slow. For a quick three-step chain, a single thread is cheaper and simpler — reach for sub-agents when the work is embarrassingly parallel, not by reflex.

Putting it together: the flywheel

Memory, skills, and sub-agents aren't three separate features — they're one loop. A request comes in, relevant history and skills are retrieved, the agent reasons and acts (fanning out if it helps), a periodic nudge prompts it to write down what it learned, and the result is persisted so the next pass starts smarter. That's the self-improvement flywheel.

📥 Receive

A request arrives through any connected channel (CLI, Telegram, Discord…).

The Document step is what makes it self-improving — reported at roughly 40% faster on repeated task classes, at a ~15–25% token overhead for the reflection.

The loop that makes it 'self-improving.' Click through each step — the Document stage is the one most agents skip, and it's the one that compounds.
The honest numbers

Nous reports an agent using its own accumulated skills completing repeated research tasks roughly 40% faster than a fresh instance, with no prompt tuning. The reflection machinery isn't free: expect on the order of 15–25% extra tokens for the privilege. Whether that trade pays off is a direct function of how repetitive your workload is.

Where it earns its keep — use cases

  • Recurring research & monitoring — multi-week topic tracking where the agent's skills and memory sharpen with each pass; the 40% speed-up lands hardest here.
  • Scheduled automation — a built-in cron scheduler drives daily summaries, data pulls, or CI/CD notifications that benefit from cross-session knowledge.
  • A genuinely personal assistant — connect Telegram, Discord, or WhatsApp and the agent maintains context across channels; start on your phone, continue at your desk.
  • Pipeline fan-out — independent, slow sub-tasks (per-dataset processing, per-competitor research) handled by parallel sub-agents.
  • Training-data generation — batch trajectory generation (via the Atropos RL framework) for fine-tuning, turning the agent's runs into datasets.

Is it ready for the enterprise? An honest scorecard

This is where excitement meets procurement. Hermes has real structural advantages for serious use — and real gaps you'd have to engineer around. Pretending otherwise helps no one, so here's the balanced view.

Self-host & data control

MIT-licensed and runs on your own machine, VPS, or Docker. Your data and memory never have to leave your infrastructure — a real advantage for regulated environments.

Green where it shines (self-hosting, isolation, portability), amber-to-red where you'll do extra work (auditing, stability). A capable tool to adopt with eyes open — not a turnkey enterprise platform yet.

Enterprise feasibility, dimension by dimension. Click each — green where it shines (self-hosting, sandboxing, no lock-in), amber-to-red where you'll do extra work (auditability, stability, cross-domain transfer).

The green column is legitimately strong. It's MIT-licensed and self-hostable, so your data and the agent's memory can stay entirely inside your perimeter — a rare and valuable property for regulated teams. Execution is sandboxed by default with sensible Docker hardening. And because skills ride the open agentskills.io standard, you're not locked in: the procedural memory you accumulate is portable.

The cautions are just as real. Memory is opaque — it's hard to audit exactly what the agent has learned or why it did something, which is a problem for compliance-heavy contexts until you wrap it in your own logging. Improvement is domain-specific: gains on one task class don't transfer to another, so plan for per-domain skill libraries rather than one omniscient agent. The codebase is young and fast-moving (several minor versions in its first couple of months), so pin versions and expect churn. And community skills, while scanned, are still executable instructions that warrant review.

The verdict

Hermes is a capable tool to adopt with eyes open, not a turnkey enterprise platform. If your workload is repetitive, your data needs to stay in-house, and you have the engineering maturity to add auditing and version discipline around it, the memory-and-skills model can pay for itself. If you need an SLA, deep auditability, and frozen APIs today, treat it as a preview of where the field is heading.

Best practices if you run it

  1. 1Curate MEMORY.md like a sticky note, not a diary — the character cap is a feature; respect it.
  2. 2Let skills accumulate on real work, then prune ruthlessly. A bloated, contradictory skill library hurts retrieval quality.
  3. 3Review every community/third-party skill before enabling it — it's executable instruction, full stop.
  4. 4Fan out to sub-agents only for genuinely independent, slow sub-tasks; a single thread is cheaper for short chains.
  5. 5Budget for the reflection overhead (~15–25% tokens) and measure whether your workload is repetitive enough to earn it back.
  6. 6Pin the version. The project moves fast; reproducibility beats living on the edge.
  7. 7Wrap it in your own logging/observability if you need auditability — don't rely on the built-in memory being inspectable.
  8. 8Keep skills organized by domain, since learnings don't transfer across task types.

Where this lands in AgentSwarms

The patterns Hermes productizes are exactly the ones we teach hands-on. Memory layering, procedural skills, and sub-agent fan-out are design decisions you can prototype on the swarm canvas before committing to a runtime — wire an orchestrator that dispatches to parallel workers, see where context accumulation bites in the Failure-Mode Labs, and export the architecture to a framework when the shape is right. The point isn't to rebuild Hermes; it's to understand the moving parts well enough to choose them deliberately.

A note on scope

AgentSwarms is a learning and prototyping platform, not a production agent runtime. We don't run Hermes for you — this post is here so that when you do reach for a self-improving agent, you know what its memory is actually doing, why its skills are portable, and what you're signing up for operationally.

The era of the amnesiac assistant is ending. The agents that matter next won't just be smarter in the moment — they'll be the ones that remember the last moment, keep what worked, and arrive at tomorrow's problem already knowing something about it. Hermes is an early, opinionated, refreshingly open bet on that future. Run it with curiosity and a little caution, and it'll show you what an agent that accumulates actually feels like.


Was this useful?

Comments

Sign in to join the discussion.

Loading comments…