Most LLM cost is waste — context that didn’t need to be there, models too big for the task, reasoning that ran longer than it should. Here’s how to fix it, grounded in 2025 research, with a concrete open-source stack at the end.
The problem
Token cost and context bloat are the same problem: no mechanism deciding what information is worth keeping. It shows up three ways:
- Uncontrolled output. Thinking tokens are invisible but billed at 3–5x the input price. No budget cap = silent overbilling.
- Context bloat. Every API call sends the full conversation from scratch. History accumulates, quality degrades well before the limit.
- No routing. Simple classification tasks go to frontier models at frontier prices.
Fix these in order — output, then input structure, then context management, then routing. The savings compound.
Context degrades faster than you think
Models advertise 128K–1M token windows. Effective context — where accuracy stays near baseline — is 50–65% of that.
| Benchmark | Finding | Venue |
|---|---|---|
| NoLiMa | 11/13 models below 50% accuracy at 32K | ICML 2025 |
| RULER (NVIDIA) | Effective = 50–65% of advertised | COLM 2024 |
| Context Rot (Chroma) | 18/18 models degrade; shuffled text beats coherent | 2025 |
| Lost in the Middle | >30% drop mid-context | TACL 2024 |
Don’t stuff documents in and hope. Treat context as a finite resource with non-linear quality decay.
Control what the model writes
Output tokens cost 3–5x more than input (sequential generation vs. parallel input processing). This is the highest-leverage place to start.
Cap max_tokens by task type. Classification: 16–32. JSON extraction: 128–256. Summarization: 256–512. Analysis: 512–1,024.
Control thinking budgets. DeepSeek-R1 averages ~12K thinking tokens per math question. Without a cap, reasoning pipelines cost 10–30x what you’d expect from visible output alone.
Use structured output. JSON schema reduces output ~15% and eliminates parsing retries.
Research: thinking budget techniques
TALE (ACL 2025) auto-estimates per-task budget: 67% fewer output tokens, 59% cost reduction, 80% accuracy preserved.
Chain of Draft (arXiv 2025) — add one sentence to your prompt: “Think step by step, but only keep a minimum draft for each thinking step.” Uses 7.6% of standard CoT tokens at comparable accuracy. Zero infrastructure change.
SGLang’s xgrammar constrained decoding guarantees valid JSON with zero retry overhead.
Structure prompts for prefix caching
The model has no memory between calls. Every request sends the entire conversation as one flat token sequence — system prompt, all previous turns, new message. Every token gets KV-computed before the model can output anything.
Prefix caching reuses KV vectors when the first N tokens match a previous request. Anthropic gives 90% off cached tokens. OpenAI 50%. Google 75%. You just need to structure prompts so the static prefix is as large as possible.
The rule: static content first (system prompt, schema, few-shot examples), dynamic content last (conversation history, new message). Everything above the cache boundary must be byte-for-byte identical across requests.
Code example: cache-hostile vs cache-friendly
# ❌ Cache-hostile — dynamic content at the top
messages = [
{
"role": "system",
"content": f"""
Current time: {datetime.now()} # ← breaks cache every call
User ID: {user_id} # ← breaks cache per user
You are a helpful assistant...
[800 tokens of static instructions]
"""
},
*conversation_history,
{"role": "user", "content": user_message}
]
# ✓ Cache-friendly — static prefix locked, dynamic at bottom
messages = [
{
"role": "system",
"content": """
You are a helpful assistant.
[Instructions, rules, schema — never changes]
[Few-shot examples — never changes]
""",
"cache_control": {"type": "ephemeral"} # Anthropic API
},
*conversation_history,
{"role": "user", "content": user_message}
]
What breaks the cache: timestamps or user IDs above the boundary, rotating few-shot examples, dynamic personalization in the system prompt. If you need personalization, put it in a user-turn message below the boundary.
Skill templates take this further. For recurring tasks (reports, deployments, code reviews), the model spends thinking tokens rewriting the user’s request into a form it can reason about — every time. A pre-built template eliminates this translation cost. Combined with prefix caching, skill templates hit cache from the second call: 10% input cost, near-zero thinking overhead.
Manage context over time
Even with perfect prompt structure, context grows. Active management keeps quality high and costs linear.
Compaction. Summarize history into a compact block and reset. Proactive compaction beats automatic — the model’s recall is still intact and can be guided toward what matters. Write state to a JSON file, spawn a fresh context that reads it.
Subgoal-based chunking. HiAgent (ACL 2025) compresses completed subgoal history into short summaries. Result: 2x success rate on long-horizon tasks, 35% context reduction.
Hierarchical memory. Three layers: working memory (current window) → session memory (today’s summary) → long-term memory (persistent facts). The Letta/MemGPT framework validates this — the LLM self-manages context through function calls to page data in and out.
KG-RAG for history. Instead of keeping all history in context, use HippoRAG (NeurIPS 2024) to build a knowledge graph across sessions and retrieve by semantic query. 10–30x cheaper than iterative RAG.
When to reset a conversation
| Utilization | Action |
|---|---|
| <50% | Keep going — near-baseline accuracy |
| 50–70% | Monitor; compact at a natural break |
| 70–85% | Compact now — measurable quality loss |
| >85% | Reset — severe degradation |
Signals that mean reset now: model asks for information already provided, generated code contradicts earlier decisions, suggests previously rejected solutions, enters a fix-break-fix loop.
Route intelligently
Routing isn’t just about sending easy queries to cheap models. It directly improves cache efficiency across the entire system.
Why routing multiplies savings
When a router dispatches to specialized sub-agents, each sub-agent gets a stable, task-specific prompt template and a fresh context window. Because every request to “Agent A” shares the exact same system prompt prefix, the KV cache is computed once and reused — cache hit rates approach 100% for the system prompt portion. The router effectively pre-warms cached templates for all downstream agents.
Two independent savings: cheaper models for simple tasks and better cache utilization for all tasks.
Tencent ADP: proof at scale
Tencent’s Agent Development Platform ships an Intent Classifier node (意图识别节点) as a first-class building block. User message enters → intent classifier (LLM with constrained prompt) produces a label → conditional branching routes to a specialized sub-graph with its own system prompt, tools, and knowledge base. Deployed across WeChat Pay and QQ Music customer service. The same pattern appears in ByteDance’s Coze, Baidu’s AppBuilder, and Dify — intent-based routing is now a standard production primitive.
The cache benefit is structural: each branch has a stable prompt template, so prefix caching kicks in automatically from the second request per intent category.
Building a router: four options
| Approach | How | Latency / Cost |
|---|---|---|
| Embedding similarity | Semantic Router — define intents with example utterances, route via cosine similarity. No LLM call. | <5ms / free |
| Small local LLM | Run Qwen2.5-1.5B or Phi-3-mini via Ollama. System prompt lists intents, model returns a label. This is Tencent ADP’s approach. | 50–150ms / free |
| Learned router | RouteLLM (ICLR 2025) — trained on Chatbot Arena preference data. 95% of GPT-4 quality, 85% cost reduction. | ~50ms / free |
| Cascading | AutoMix (ICML 2024) — try cheapest model first, self-verify confidence, escalate if uncertain. Up to 98% cost reduction. | Varies |
Start with: embedding similarity for high-confidence matches + local LLM for ambiguous cases + frontier API only when both are uncertain.
Open-source stack
User request
│
▼
Qdrant semantic cache ─── hit? → return cached response
│ miss
▼
Ollama (Qwen2.5-1.5B) ─── intent classification → skill_id
│
▼
LiteLLM proxy
├── budget check (per-key caps)
├── inject skill template ← prefix cached after first call
├── attach memory context ← HippoRAG / Letta retrieval
│
├── simple task → SGLang (self-hosted 7B)
├── medium task → RouteLLM decides
└── complex task → frontier API (Anthropic / OpenAI)
│
▼
Letta memory manager
├── update session memory
├── extract to long-term store
└── trigger compaction at 70% utilization
Start with LiteLLM + Ollama + Qdrant. That handles routing, semantic caching, and budget control. Add Letta and HippoRAG once you have multi-turn agents that need persistent memory.
Full tool list with links
Routing: LiteLLM (MIT) — unified proxy, caching, budget caps. RouteLLM (Apache 2.0) — learned router. Semantic Router (MIT) — embedding-based.
Inference: SGLang (Apache 2.0) — RadixAttention, 6.4x throughput. Ollama (MIT) — local models. vLLM (Apache 2.0) — PagedAttention.
Compression: LLMLingua (MIT) — 20x compression, <1.5% accuracy loss.
Memory: Letta (Apache 2.0) — hierarchical memory. HippoRAG (Apache 2.0) — KG-RAG. Qdrant (Apache 2.0) — vector DB. Chroma (Apache 2.0) — simpler vector store.
What to do, in order
| # | Action | Effort | Savings |
|---|---|---|---|
| 1 | Prompt structure — static first, dynamic last | Low | 60–90% input cost |
| 2 | Cap max_tokens + Chain of Draft | Low | 59–92% output cost |
| 3 | Thinking budget on reasoning models | Low | 10–30x on reasoning calls |
| 4 | Intent routing via Ollama + LiteLLM | Medium | 3–10x model cost |
| 5 | Subgoal chunking + proactive compaction | Medium | 35%+ context reduction |
| 6 | KG-RAG for conversation history | Medium | >90% vs full-context |
| 7 | Hierarchical memory via Letta | High | Long-horizon task quality |
All combined: 90–95% savings vs. a naive single-model pipeline.
The most important thing: output tokens cost 3–5x more than input. Controlling what the model writes saves more than all input-side optimizations combined.
References
- TALE — Han et al., ACL 2025, arXiv:2412.18547
- Chain of Draft — Xu et al., arXiv:2502.18600, 2025
- Unified Routing & Cascading — Dekoninck et al., ICLR + ICML 2025, arXiv:2410.10347
- RouteLLM — Ong et al., ICLR 2025, arXiv:2406.18665
- HiAgent — Hu et al., ACL 2025
- NoLiMa — Adobe Research, ICML 2025, arXiv:2502.05167
- LongBench v2 — Tsinghua, ACL 2025, arXiv:2412.15204
- HippoRAG — Gutiérrez et al., NeurIPS 2024, arXiv:2405.14831
- Context Rot — Chroma Research, 2025
- Lost in the Middle — Liu et al., TACL 2024, arXiv:2307.03172
- MemGPT / Letta — Packer et al., arXiv:2310.08560
- SGLang — Zheng et al., NeurIPS 2024, arXiv:2312.07104
- Effective Context Engineering for AI Agents — Anthropic Engineering Blog, 2025
- AutoMix — Madaan et al., ICML 2024