Guide

Companion to 7 Levers That Cut Our LLM Costs by 80%. That post covers what and why. This one covers what good looks like for each lever. 📋 Copy as Markdown Tasks Read and follow the guidelines below for every lever you touch Log every LLM call and test on lower models with LLM-as-judge for loss function “Important task” pass — allow flagged tasks to bypass routing and go directly to the best model Thread-based context management — fork, summarize, retrieve Router: cap output per task type, check model tier, implement semantic matching for routing LoRA: train on Cocos2dx JS and the team’s database schema Guidelines Log Everything Log every LLM call at the client layer — prompt, output, model, tokens, latency, user accept/reject signal Replay logged prompts through cheaper models, score against the expensive model’s accepted output using LLM-as-judge Accepted pairs double as LoRA training data — logging and fine-tuning work best as one initiative Route by Difficulty 2 tiers (cheap/strong) is a good starting point — add more only when misroute data supports it Classify 1,000+ real queries before building the router Keyword overrides for high-confidence terms first, semantic matching for the fuzzy middle “Important task” pass: some tasks (production incidents, security reviews) benefit from always going to the best model — a simple flag or keyword override handles this cleanly Shadow mode for 1 week before going live; misroute rates below 5% tend to give the best ROI Passing context by reference (file paths) rather than inline keeps orchestrator costs down Manage Context Track context utilization (tokens_used / context_window), alert at 60% Thread-based: one task per thread, fork subtopics into sub-threads with only the relevant context Summarize at milestones (~10 turns) with a cheap model; replace history with summary + last 2 messages Keep full transcripts in storage — summaries serve the window, originals serve everything else Cache at Every Layer Static content first in prompts, dynamic last — this alone enables prefix caching (50-90% off) No timestamps or session IDs in system prompts — they break the cache Semantic caching: start with cosine > 0.95, lower gradually after validating quality TTLs on every entry; skip caching for non-deterministic queries Cap the Output Per-task-type caps, not global — classification (30 tokens) and analysis (600 tokens) are different tasks Prompt instructions (“category name only, no explanation”) produce better output than hard truncation max_tokens at 1.5x P90 output length works well as a safety net “Answer directly, do not deliberate” reduces hidden thinking tokens on simple tasks Materialize Known Solutions Good candidates: output is near-deterministic, process stable 4+ weeks, no reasoning needed Let the LLM write its own replacement from 50 representative examples Shadow mode 1-2 weeks (>95% equivalence), keep LLM as fallback, review quarterly Fine-Tune with LoRA 500+ accepted pairs per category before starting — LoRA benefits from patience Our priority domains: Cocos2dx JS (lifecycle hooks, scene transitions, component patterns) and the team’s database schema (table relationships, query patterns, migration conventions) Quality gate: within 5% of the expensive model on held-out test set Canary rollout: 5% → 25% → 50% → 100% over 2 weeks Monthly retraining, weekly drift checks Fix prompts first if the expensive model underperforms — LoRA amplifies quality, not fixes it Batch Async Work Tag every call as realtime or async — most teams find 20-40% qualify for batching Provider batch APIs (Anthropic, OpenAI, Google) give 50% off, same quality Fallback to synchronous for urgent requests; keep fallback rate below 5%