LLM Cost Optimization — Implementation Guide for Teams

Companion to 7 Levers That Cut Our LLM Costs by 80%. That post covers what and why. This one covers what good looks like for each lever.

Tasks

Read and follow the guidelines below for every lever you touch
Log every LLM call and test on lower models with LLM-as-judge for loss function
“Important task” pass — allow flagged tasks to bypass routing and go directly to the best model
Thread-based context management — fork, summarize, retrieve
Router: cap output per task type, check model tier, implement semantic matching for routing
LoRA: train on Cocos2dx JS and the team’s database schema

Log every LLM call at the client layer — prompt, output, model, tokens, latency, user accept/reject signal
Replay logged prompts through cheaper models, score against the expensive model’s accepted output using LLM-as-judge
Accepted pairs double as LoRA training data — logging and fine-tuning work best as one initiative

2 tiers (cheap/strong) is a good starting point — add more only when misroute data supports it
Classify 1,000+ real queries before building the router
Keyword overrides for high-confidence terms first, semantic matching for the fuzzy middle
“Important task” pass: some tasks (production incidents, security reviews) benefit from always going to the best model — a simple flag or keyword override handles this cleanly
Shadow mode for 1 week before going live; misroute rates below 5% tend to give the best ROI
Passing context by reference (file paths) rather than inline keeps orchestrator costs down

Track context utilization (tokens_used / context_window), alert at 60%
Thread-based: one task per thread, fork subtopics into sub-threads with only the relevant context
Summarize at milestones (~10 turns) with a cheap model; replace history with summary + last 2 messages
Keep full transcripts in storage — summaries serve the window, originals serve everything else

Static content first in prompts, dynamic last — this alone enables prefix caching (50-90% off)
No timestamps or session IDs in system prompts — they break the cache
Semantic caching: start with cosine > 0.95, lower gradually after validating quality
TTLs on every entry; skip caching for non-deterministic queries

Per-task-type caps, not global — classification (30 tokens) and analysis (600 tokens) are different tasks
Prompt instructions (“category name only, no explanation”) produce better output than hard truncation
max_tokens at 1.5x P90 output length works well as a safety net
“Answer directly, do not deliberate” reduces hidden thinking tokens on simple tasks

Good candidates: output is near-deterministic, process stable 4+ weeks, no reasoning needed
Let the LLM write its own replacement from 50 representative examples
Shadow mode 1-2 weeks (>95% equivalence), keep LLM as fallback, review quarterly

500+ accepted pairs per category before starting — LoRA benefits from patience
Our priority domains: Cocos2dx JS (lifecycle hooks, scene transitions, component patterns) and the team’s database schema (table relationships, query patterns, migration conventions)
Quality gate: within 5% of the expensive model on held-out test set
Canary rollout: 5% → 25% → 50% → 100% over 2 weeks
Monthly retraining, weekly drift checks
Fix prompts first if the expensive model underperforms — LoRA amplifies quality, not fixes it

Tag every call as realtime or async — most teams find 20-40% qualify for batching
Provider batch APIs (Anthropic, OpenAI, Google) give 50% off, same quality
Fallback to synchronous for urgent requests; keep fallback rate below 5%