User Game Lifecycle (Phần 1): Học từ những gì người chơi không làm

Đa số hệ gợi ý học từ những gì bạn làm. Bài này của Tencent thì còn để ý cả những gì bạn thôi không làm nữa — và mình thấy ý đó khá dễ thương. Bài viết song ngữ Anh-Việt. Phiên bản tiếng Việt ở bên dưới. Đây là Phần 1 của series 2 phần. Phần 2 dựng thử pipeline. English A small side-quest that led to a paper This whole thing started as a search for a dataset — logs of people playing and browsing games, the raw clicks of a session. The search itself is worth a few lines, because it sets up why the paper we ended up at felt interesting. ...

June 7, 2026 · 16 min · Minh-Nhut Nguyen

User Game Lifecycle (Phần 2): Dựng thử pipeline

Phần 1 là ý tưởng. Phần này là dây chuyền lắp ráp — tám khối nhỏ biến mấy cú click thô thành một biểu diễn người chơi mà một model production xài được. Bài viết song ngữ Anh-Việt. Phiên bản tiếng Việt ở bên dưới. Đây là Phần 2 của series 2 phần. Phần 1 giải thích phương pháp. English Part 1 was about why Tencent’s User Game Lifecycle works: fill out a too-short history by gathering behavior from four places and adding in going quiet (lost and silence actions), then keep the long-tail from being ignored with Inverse Probability Masking. This part is the how — the shape of a pipeline that turns those ideas into something you can train. We’ll keep it gentle: pseudocode and data shapes, no real framework code, so the structure stays easy to see. ...

June 7, 2026 · 14 min · Minh-Nhut Nguyen

MAGMA: Teaching AI to Remember Like Humans Do

Your AI has amnesia, and the fix isn’t more memory — it’s better memory. Bài viết song ngữ Anh-Việt. Phiên bản tiếng Việt ở bên dưới. English Every time you have a long conversation with an AI assistant, something embarrassing happens behind the scenes. Around the 30-minute mark, the system quietly starts forgetting the beginning of your conversation. Not because it decided that stuff was unimportant — because it ran out of room. These systems don’t have memory. They have a sliding window, a fixed-size buffer that moves forward as the conversation grows, dropping older context off the back end. That brilliant setup you laid out in the first five minutes? Gone. ...

April 10, 2026 · 9 min · Minh-Nhut Nguyen

Patterns for Agent Diversity in Multi-Agent LLM Systems

When you run multiple LLM agents together, the most likely outcome is collapse: every agent converges on the same answer, echoes the others, and you end up paying for expensive consensus that a single call could have produced. This happens because the default pressure in most multi-agent setups is agreement. Agents read each other’s outputs and optimize for coherence. Without an opposing force, there’s no reason for any agent to disagree. ...

April 7, 2026 · 6 min · Minh-Nhut Nguyen

LLM Cost Optimization — Implementation Guide for Teams

Companion to 7 Levers That Cut Our LLM Costs by 80%. That post covers what and why. This one covers what good looks like for each lever. 📋 Copy as Markdown Tasks Read and follow the guidelines below for every lever you touch Log every LLM call and test on lower models with LLM-as-judge for loss function “Important task” pass — allow flagged tasks to bypass routing and go directly to the best model Thread-based context management — fork, summarize, retrieve Router: cap output per task type, check model tier, implement semantic matching for routing LoRA: train on Cocos2dx JS and the team’s database schema Guidelines Log Everything Log every LLM call at the client layer — prompt, output, model, tokens, latency, user accept/reject signal Replay logged prompts through cheaper models, score against the expensive model’s accepted output using LLM-as-judge Accepted pairs double as LoRA training data — logging and fine-tuning work best as one initiative Route by Difficulty 2 tiers (cheap/strong) is a good starting point — add more only when misroute data supports it Classify 1,000+ real queries before building the router Keyword overrides for high-confidence terms first, semantic matching for the fuzzy middle “Important task” pass: some tasks (production incidents, security reviews) benefit from always going to the best model — a simple flag or keyword override handles this cleanly Shadow mode for 1 week before going live; misroute rates below 5% tend to give the best ROI Passing context by reference (file paths) rather than inline keeps orchestrator costs down Manage Context Track context utilization (tokens_used / context_window), alert at 60% Thread-based: one task per thread, fork subtopics into sub-threads with only the relevant context Summarize at milestones (~10 turns) with a cheap model; replace history with summary + last 2 messages Keep full transcripts in storage — summaries serve the window, originals serve everything else Cache at Every Layer Static content first in prompts, dynamic last — this alone enables prefix caching (50-90% off) No timestamps or session IDs in system prompts — they break the cache Semantic caching: start with cosine > 0.95, lower gradually after validating quality TTLs on every entry; skip caching for non-deterministic queries Cap the Output Per-task-type caps, not global — classification (30 tokens) and analysis (600 tokens) are different tasks Prompt instructions (“category name only, no explanation”) produce better output than hard truncation max_tokens at 1.5x P90 output length works well as a safety net “Answer directly, do not deliberate” reduces hidden thinking tokens on simple tasks Materialize Known Solutions Good candidates: output is near-deterministic, process stable 4+ weeks, no reasoning needed Let the LLM write its own replacement from 50 representative examples Shadow mode 1-2 weeks (>95% equivalence), keep LLM as fallback, review quarterly Fine-Tune with LoRA 500+ accepted pairs per category before starting — LoRA benefits from patience Our priority domains: Cocos2dx JS (lifecycle hooks, scene transitions, component patterns) and the team’s database schema (table relationships, query patterns, migration conventions) Quality gate: within 5% of the expensive model on held-out test set Canary rollout: 5% → 25% → 50% → 100% over 2 weeks Monthly retraining, weekly drift checks Fix prompts first if the expensive model underperforms — LoRA amplifies quality, not fixes it Batch Async Work Tag every call as realtime or async — most teams find 20-40% qualify for batching Provider batch APIs (Anthropic, OpenAI, Google) give 50% off, same quality Fallback to synchronous for urgent requests; keep fallback rate below 5%

March 27, 2026 · 3 min · Minh-Nhut Nguyen

7 Levers That Cut Our LLM Costs by 80%

Most teams send every question to their most expensive model. That’s like routing every patient to the head surgeon — stitches or heart transplant, same price. When we audited query logs across several production workloads, we found a consistent pattern: 60–70% of queries were simple lookups or classifications that a cheap model handled just as well. The expensive model was doing busywork. Seven levers, applied in order. Together they drove 80–95% cost reduction in our case — your mileage will vary by workload. ...

March 23, 2026 · 8 min · Minh-Nhut Nguyen

Build a Token Router with Embeddings and Prompt Templates

Skip the training pipeline and the GPU — embeddings, cosine similarity, and structured prompts are enough to cut your LLM bill by 80%. The idea Every query has a shape — topic, complexity, expected output format. You can detect that shape in <5ms using embeddings, then: Pick a prompt template — pre-built system prompt with format constraints, cached by the provider Pick a model — cheap for easy queries, strong for hard ones Cap output tokens — templates define expected length All of this works with pure geometry in embedding space — no model training, no preference data required. ...

March 23, 2026 · 7 min · Minh-Nhut Nguyen

Two Roads to AI Agents: Code or Markdown?

The same task, two radically different approaches. One says “write code to orchestrate the LLM.” The other says “write markdown to teach it.” Both produce agents that reason, use tools, and complete complex work. Knowing when to reach for which is the skill that matters in 2026. The SDK Way: Agents as Code Agent SDKs — OpenAI Agents SDK, Claude Agent SDK, LangGraph, CrewAI — let you build agents programmatically. You register tools as functions, write instructions, and the SDK runs the loop: prompt the LLM, execute tool calls, feed results back, repeat. ...

March 23, 2026 · 4 min · Minh-Nhut Nguyen

Workshop: Build an AI Video Pipeline — Skills vs SDK in Practice

Part 2 of Two Roads to AI Agents. This time we apply the framework to something real. In Part 1, we drew the line between Agent SDKs (code orchestration) and Agent Skills (markdown knowledge). Now let’s see where that line falls in practice. We’ll walk through a pipeline that turns any article into a narrated MP4 — and at each step, I’ll label whether it belongs in SDK code or Skill knowledge, and why. ...

March 23, 2026 · 4 min · Minh-Nhut Nguyen

Building an AI Video Pipeline: From Text to Narrated MP4 with Remotion and ElevenLabs

I wanted a way to turn my blog posts into narrated videos without spending hours in video editors. What I ended up building was a full pipeline: give it an article, a URL, or a PowerPoint file — get back a 1080p MP4 with animated slides, syntax-highlighted code blocks, and an AI voiceover in any language. The whole thing is open source: github.com/nmnhut-it/educational-video-pipeline. This post walks through how it works and how you can build your own. ...

March 7, 2026 · 7 min · Minh-Nhut Nguyen