Developer Tools – blogs.jutsu.ai

Every time your agent calls an LLM, it quietly resends the full conversation history. Turn 20 includes turns 1–19. Turn 50 includes turns 1–49. It’s invisible, automatic, and expensive.

I noticed this while building Trooper—a Go proxy that sits between agents and LLMs. Watching token counts climb over a long debugging session made it clear: the agent kept replaying the same context. Most of it was noise.

The model didn’t need a transcript. It needed state.

What “state” actually means

After a few turns, what matters in a session usually fits into four buckets:

Decisions made — what was chosen and why
Constraints locked — what cannot change
Open loops — what still needs to be resolved
Ruled out — what was tried and rejected

That’s it. The back-and-forth, verbose explanations, and repeated context are replay. The model doesn’t need them again.

The SITREP

I added structured session memory to Trooper. After enough turns, Trooper’s local Llama model generates a SITREP—a situation report—from the user messages in the session.

It looks like this:

INTENT: Build a RAG pipeline with ChromaDB and nomic-embed-text
DECISIONS: Use cosine similarity over MMR — focused queries not broad;
           Chunk size 256, overlap 30 — locked;
           Pure vector search — ChromaDB no hybrid support;
           Top k set to 5
CONSTRAINTS: Node 18 locked — platform team constraint, no exceptions;
             Re-ranking ruled out — latency jumped 200ms to 800ms
OPEN: Poor recall on technical queries — nomic-embed-text struggles with domain jargon;
      Evaluating bge-small as alternative

From that point forward, every request to the LLM sends:

Anchor (first 2 turns verbatim)
+ SITREP (structured state)
+ Tail (last N turns verbatim)

Instead of the full history.

The numbers

From a real 15-turn session:

Full history:    10,820 tokens per request
With Trooper:     1,157 tokens per request
Reduction:             89%

Make progress visible: the dashboard shows this reduction live.

Does the LLM still answer correctly?

This is the part that matters. Token savings are worthless if the model loses coherence.

To test it, I took the auto-generated SITREP, opened a completely fresh chat with no history, and asked questions about decisions made in the original session.

Questions:

What is the chunk size?
Why did we rule out hybrid search?
What retrieval method did we choose and why?
What is still open?

Result: All four were answered correctly. The model worked entirely from the SITREP. No history. No context bleed.

That’s the claim: structured state is sufficient for the model to continue reasoning correctly—and it costs 89% less to send.

How it works

Trooper is a Go proxy—one binary, no SDK, no instrumentation. Point your existing agent at it by changing one URL.

# Before
export ANTHROPIC_BASE_URL=https://api.anthropic.com

# After
export ANTHROPIC_BASE_URL=http://localhost:3000

Nothing else changes. Trooper intercepts every request, maintains session state, and when the SITREP is ready, rewrites the messages array before forwarding to the LLM.

The SITREP is built by a local Llama 3.1 8b model running via Ollama—fast, private, no cloud cost. The extraction happens asynchronously in the background. The main request path is not blocked.

// GetTripleAnchor assembles what gets sent to the LLM
func (s *SessionStore) GetTripleAnchor(sessionID string) []map[string]string {
    payload := append([]map[string]string{}, state.Anchor...)
    if state.SITREP != "" {
        payload = append(payload, map[string]string{
            "role":    "system",
            "content": fmt.Sprintf("[STATE_SITREP: %s]", state.SITREP),
        })
    }
    return append(payload, state.Tail...)
}

The dashboard reports compression live:

HISTORY COMPRESSED    89%
TOKENS SAVED          459
CONFIDENCE            100%

Why this is different from conversation summarisation

Most summarisation tools compress what was said. The SITREP extracts what matters for the next action.

Copilot’s context compaction summarises the full conversation—useful for humans in long chats. The SITREP is structured specifically for agents: decisions, constraints, open loops, ruled-out paths. Not a narrative summary. A state snapshot.

The result: subsequent turns stay coherent on intent without replaying noise. This is especially relevant for agents running repeated structured workflows, more than for general chat.

The limitation

The SITREP works best for structured agentic workflows—debugging sessions, research pipelines, multi-step build tasks. For open-ended creative work where tangential context might matter later, you’ll want a larger tail window or higher-fidelity compression.

The tail window is configurable. You can keep more raw context for less structured sessions.

What else Trooper does

The compression is the latest addition. Trooper also:

Falls back to local Ollama when cloud quota hits—context preserved across the switch
Routes simple turns to Ollama automatically—cloud never contacted
Privacy routing—sensitive requests stay local via x_force_local
Live dashboard—intent, open loops, completed steps, transcript
Subagent recovery—/recovery/{session_id} tells you exactly where to resume

All from one URL change.

The bigger question

We often treat conversation history as memory. But a transcript is a log. Memory is state.

Humans don’t replay every prior conversation before deciding. They carry forward conclusions, constraints, unresolved questions, and relevant context—a structured snapshot, not a full transcript.

Long-running agents may need to do the same. Not just to save tokens—though that helps—but because state is a better abstraction for agent memory than history.

The SITREP is an experiment in that direction.

github.com/shouvik12/trooper — Go, MIT, zero dependencies beyond Ollama.

Reference: View article

Category: Developer Tools

Cut Agent Token Usage by 89%—Without Touching the Agent