☆ Yσɠƚԋσʂ ☆@lemmy.ml

☆ Yσɠƚԋσʂ ☆@lemmy.ml

Delta-mem tackles a really annoying problem with current LLMs dealing with long contexts. Usually when we want an agent or assistant to remember things over a long conversation we just shove all the past text into the prompt. The problem is that standard attention gets computationally expensive as the context grows and the models often suffer from context rot where they just forget or ignore the middle stuff anyway. Other approaches like RAG or LoRA edits either bring in noisy retrieval steps or lock the memory into static weights that do not update well on the fly.

The authors built something called delta-mem which keeps the main LLM completely frozen and bolts on a tiny dynamic memory state. Instead of saving raw text it compresses the history into a really small 8x8 matrix representing associative memory. As new tokens come in it updates this matrix using a delta learning rule which basically checks if the current memory can predict the new information and only writes the residual difference into the state. It even has a forget gate to handle old info naturally. When the model generates a response it reads from this compressed state to tweak the query and output of the standard attention mechanism. It’s a clever way to inject memory directly into the forward pass without messing with the core weights.

They also tested a few ways to write to this memory. You can update it token by token which is great for local details but prone to noise. You can average out a whole message segment and write that which smooths things out for stronger models. Or you can split the memory into multiple parallel states so facts and task progress do not overwrite each other which turned out to be really helpful for smaller backbones.

They tested it on Qwen models and it bumped the average scores significantly especially on memory heavy benchmarks like LoCoMo and Memory Agent Bench. The coolest finding is the context recovery test. They actually deleted the explicit textual history from the prompt and the model could still answer multi-hop questions using just the compressed 8x8 state. It heavily implies that we might not need massive million token context windows if we can figure out how to compress and stream memory directly into the attention layers efficiently. Plus the parameter overhead is microscopic at roughly 0.12 percent of the backbone size.

$δ$-mem: Efficient Online Memory for Large Language Models

$δ$-mem: Efficient Online Memory for Large Language Models