Ai | nmnhut

Thiết kế dữ liệu SFT cho LLM: chất, đa dạng, và chống học vẹt

Sổ tay lưu trữ về thiết kế dữ liệu cho supervised fine-tuning (SFT) — bước dạy một LLM nền làm theo chỉ dẫn / một tác vụ cụ thể. Tổng hợp từ các paper công khai 2023–2025. Viết để sau này đọc lại không phải research từ đầu. 0. Trước hết: SFT thực sự dạy gì? Hiểu lầm phổ biến nhất: coi SFT như “nhồi kiến thức” vào model. Không phải. ...

Multi-Agent Reinforcement Learning From the Ground Up: PPO, GRPO, CTDE, and MAPPO

Multi-agent reinforcement learning (MARL) studies how several agents learn to act in a shared environment. This post is a background survey of the cooperative MARL literature, written so that a reader with no reinforcement-learning background can follow it from start to finish. It is organized in four parts. Part 1 builds the single-agent foundations — the vocabulary, then the formulas, with the purpose of every symbol. Part 2 covers the two policy-update algorithms everything else rests on, PPO and GRPO, each with a worked example. Part 3 moves to many agents: the CTDE paradigm, the methods that established it (DDPG, MADDPG, COMA, VDN, QMIX), and MAPPO. Part 4 describes the benchmark environments. A summary and a glossary close the post. ...

User Game Lifecycle (Phần 1): Học từ những gì người chơi không làm

Đa số hệ gợi ý học từ những gì bạn làm. Bài này của Tencent thì còn để ý cả những gì bạn thôi không làm nữa — và mình thấy ý đó khá dễ thương. Bài viết song ngữ Anh-Việt. Phiên bản tiếng Việt ở bên dưới. Đây là Phần 1 của series 2 phần. Phần 2 dựng thử pipeline. English A small side-quest that led to a paper This whole thing started as a search for a dataset — logs of people playing and browsing games, the raw clicks of a session. The search itself is worth a few lines, because it sets up why the paper we ended up at felt interesting. ...

User Game Lifecycle (Phần 2): Dựng thử pipeline

Phần 1 là ý tưởng. Phần này là dây chuyền lắp ráp — tám khối nhỏ biến mấy cú click thô thành một biểu diễn người chơi mà một model production xài được. Bài viết song ngữ Anh-Việt. Phiên bản tiếng Việt ở bên dưới. Đây là Phần 2 của series 2 phần. Phần 1 giải thích phương pháp. English Part 1 was about why Tencent’s User Game Lifecycle works: fill out a too-short history by gathering behavior from four places and adding in going quiet (lost and silence actions), then keep the long-tail from being ignored with Inverse Probability Masking. This part is the how — the shape of a pipeline that turns those ideas into something you can train. We’ll keep it gentle: pseudocode and data shapes, no real framework code, so the structure stays easy to see. ...

MAGMA: Teaching AI to Remember Like Humans Do

Your AI has amnesia, and the fix isn’t more memory — it’s better memory. Bài viết song ngữ Anh-Việt. Phiên bản tiếng Việt ở bên dưới. English Every time you have a long conversation with an AI assistant, something embarrassing happens behind the scenes. Around the 30-minute mark, the system quietly starts forgetting the beginning of your conversation. Not because it decided that stuff was unimportant — because it ran out of room. These systems don’t have memory. They have a sliding window, a fixed-size buffer that moves forward as the conversation grows, dropping older context off the back end. That brilliant setup you laid out in the first five minutes? Gone. ...

Patterns for Agent Diversity in Multi-Agent LLM Systems

When you run multiple LLM agents together, the most likely outcome is collapse: every agent converges on the same answer, echoes the others, and you end up paying for expensive consensus that a single call could have produced. This happens because the default pressure in most multi-agent setups is agreement. Agents read each other’s outputs and optimize for coherence. Without an opposing force, there’s no reason for any agent to disagree. ...

LLM Cost Optimization — Implementation Guide for Teams

Companion to 7 Levers That Cut Our LLM Costs by 80%. That post covers what and why. This one covers what good looks like for each lever. 📋 Copy as Markdown Tasks Read and follow the guidelines below for every lever you touch Log every LLM call and test on lower models with LLM-as-judge for loss function “Important task” pass — allow flagged tasks to bypass routing and go directly to the best model Thread-based context management — fork, summarize, retrieve Router: cap output per task type, check model tier, implement semantic matching for routing LoRA: train on Cocos2dx JS and the team’s database schema Guidelines Log Everything Log every LLM call at the client layer — prompt, output, model, tokens, latency, user accept/reject signal Replay logged prompts through cheaper models, score against the expensive model’s accepted output using LLM-as-judge Accepted pairs double as LoRA training data — logging and fine-tuning work best as one initiative Route by Difficulty 2 tiers (cheap/strong) is a good starting point — add more only when misroute data supports it Classify 1,000+ real queries before building the router Keyword overrides for high-confidence terms first, semantic matching for the fuzzy middle “Important task” pass: some tasks (production incidents, security reviews) benefit from always going to the best model — a simple flag or keyword override handles this cleanly Shadow mode for 1 week before going live; misroute rates below 5% tend to give the best ROI Passing context by reference (file paths) rather than inline keeps orchestrator costs down Manage Context Track context utilization (tokens_used / context_window), alert at 60% Thread-based: one task per thread, fork subtopics into sub-threads with only the relevant context Summarize at milestones (~10 turns) with a cheap model; replace history with summary + last 2 messages Keep full transcripts in storage — summaries serve the window, originals serve everything else Cache at Every Layer Static content first in prompts, dynamic last — this alone enables prefix caching (50-90% off) No timestamps or session IDs in system prompts — they break the cache Semantic caching: start with cosine > 0.95, lower gradually after validating quality TTLs on every entry; skip caching for non-deterministic queries Cap the Output Per-task-type caps, not global — classification (30 tokens) and analysis (600 tokens) are different tasks Prompt instructions (“category name only, no explanation”) produce better output than hard truncation max_tokens at 1.5x P90 output length works well as a safety net “Answer directly, do not deliberate” reduces hidden thinking tokens on simple tasks Materialize Known Solutions Good candidates: output is near-deterministic, process stable 4+ weeks, no reasoning needed Let the LLM write its own replacement from 50 representative examples Shadow mode 1-2 weeks (>95% equivalence), keep LLM as fallback, review quarterly Fine-Tune with LoRA 500+ accepted pairs per category before starting — LoRA benefits from patience Our priority domains: Cocos2dx JS (lifecycle hooks, scene transitions, component patterns) and the team’s database schema (table relationships, query patterns, migration conventions) Quality gate: within 5% of the expensive model on held-out test set Canary rollout: 5% → 25% → 50% → 100% over 2 weeks Monthly retraining, weekly drift checks Fix prompts first if the expensive model underperforms — LoRA amplifies quality, not fixes it Batch Async Work Tag every call as realtime or async — most teams find 20-40% qualify for batching Provider batch APIs (Anthropic, OpenAI, Google) give 50% off, same quality Fallback to synchronous for urgent requests; keep fallback rate below 5%

7 Levers That Cut Our LLM Costs by 80%

Most teams send every question to their most expensive model. That’s like routing every patient to the head surgeon — stitches or heart transplant, same price. When we audited query logs across several production workloads, we found a consistent pattern: 60–70% of queries were simple lookups or classifications that a cheap model handled just as well. The expensive model was doing busywork. Seven levers, applied in order. Together they drove 80–95% cost reduction in our case — your mileage will vary by workload. ...

Build a Token Router with Embeddings and Prompt Templates

Skip the training pipeline and the GPU — embeddings, cosine similarity, and structured prompts are enough to cut your LLM bill by 80%. The idea Every query has a shape — topic, complexity, expected output format. You can detect that shape in <5ms using embeddings, then: Pick a prompt template — pre-built system prompt with format constraints, cached by the provider Pick a model — cheap for easy queries, strong for hard ones Cap output tokens — templates define expected length All of this works with pure geometry in embedding space — no model training, no preference data required. ...

Two Roads to AI Agents: Code or Markdown?

The same task, two radically different approaches. One says “write code to orchestrate the LLM.” The other says “write markdown to teach it.” Both produce agents that reason, use tools, and complete complex work. Knowing when to reach for which is the skill that matters in 2026. The SDK Way: Agents as Code Agent SDKs — OpenAI Agents SDK, Claude Agent SDK, LangGraph, CrewAI — let you build agents programmatically. You register tools as functions, write instructions, and the SDK runs the loop: prompt the LLM, execute tool calls, feed results back, repeat. ...