Skip to main content
Gemma 4RAGAI automation

How to Give Gemma 4 Memory Between Sessions

With Gemma 4's 256k window, the real challenge isn't the context size itself, but surviving the end of a session without losing facts. For practical AI implementation, relying solely on summaries is a mistake. The working solution is a hybrid of local RAG, a memory structure, and concise summaries.

Technical Context

I'd immediately discard the idea of "just keeping everything in 256k." That only looks good on paper. For a gaming assistant that needs facts from old sessions, this setup breaks the moment a new game starts or the old story no longer fits.

I've seen the same thing repeatedly in AI implementation projects: summarization saves the context window but gradually kills accuracy. By the third or fourth session, the model remembers not the history, but a summary of a summary. And that's where silent amnesia begins.

Practically speaking, I would build memory in three layers. The first layer is the "hot" context of the current session. The second is a compact state-summary: characters, quests, inventory, unfinished branches, world rules. The third is a local RAG over raw events from past sessions, not just a single markdown file cobbled together.

So, it's not about "exporting to md and chunking it somehow," but about a proper event-based memory. Every significant event is written as a separate entry: who did what, where, when, and with what consequences. Then I would index this using embeddings and add standard metadata filters: session_id, npc, location, quest, item.

Summaries are still necessary, but not as the sole source of truth. I'd update the summary at around 70-80% of the window capacity but keep it short and strictly structured. Not a literary retelling, but almost a JSON-brain: goals, facts, relationships, world changes.

If the infrastructure allows, it's better to run Gemma 4 through vLLM or a similar runtime with paged attention. This doesn't solve long-term memory on its own, but it greatly simplifies life with long context and the KV cache, especially if you have more than one active session.

What This Changes for Business and Automation

The main win here isn't that "the model got smarter," but that it stops forgetting critical details. For gaming assistants, support agents, CRM agents, and internal copilots, this is no longer a cosmetic fix but the foundation of AI automation.

Who wins? Those who need accuracy regarding past events: gaming projects, service teams, products with a long user lifecycle. Who loses? Those who hope to solve everything with a single summary and then wonder why their agent is confidently hallucinating.

I would do it this way: a summary for continuity, RAG for precise facts, and a separate state-store for entities and rules. These are exactly the kinds of things we build for clients at Nahornyi AI Lab when they need a live AI integration without memory gaps, not just a demo.

If your agent has already started to "forget" clients, tasks, or game states, don't try to fix it with another long prompt. It's better to layer the memory and build an AI solution development plan for your specific scenario. If you'd like, my team at Nahornyi AI Lab can help design it so the system remembers what's important, works locally, and doesn't fall apart after a couple of sessions.

Understanding how other local AI assistants address memory challenges provides valuable insights into overcoming LLM amnesia. For example, we've examined how Rust LocalGPT delivers persistent memory for a local assistant.

Share this article