This post explains two terms you will run into often in multi-agent reinforcement learning: CTDE and MAPPO. It also covers a practical question that does not get written down much — which of these papers you can actually reproduce, and whether you need a GPU.
I will keep the wording simple and stay close to what the source papers say.
The setting: more than one agent
In standard reinforcement learning, a single agent takes actions in an environment and learns from rewards. Multi-agent reinforcement learning (MARL) is the same idea, but with several agents acting at the same time. Examples: several robots that must cooperate, or several units in a game that share a goal.
Having more than one agent creates a specific difficulty. Each agent usually sees only its own local information. But during learning, you often have access to everything — every agent’s observation, the full state of the environment, and what the other agents are doing. The question is how to use that extra information without breaking the rule that, when the system actually runs, each agent can only act on what it can see.
This is exactly what CTDE addresses.
CTDE: Centralized Training, Decentralized Execution
CTDE stands for Centralized Training with Decentralized Execution. It is a paradigm — a way of organizing the learning — not a single algorithm.
The idea splits the agent’s life into two phases:
- Training (centralized): You are allowed to use any information you have. That includes the global state and the other agents’ policies. This is fine because training is done offline, in a controlled setting.
- Execution (decentralized): Once trained, each agent runs on its own, using only the information available to it locally. No global state. No communication required between agents.
A 2024 tutorial by Amato is a good single reference for the definition: “An Introduction to Centralized Training for Decentralized Execution in Cooperative Multi-Agent Reinforcement Learning” (arXiv:2409.03052).
That tutorial divides MARL approaches into three types:
- CTE — Centralized Training and Execution.
- CTDE — Centralized Training, Decentralized Execution.
- DTE — Decentralized Training and Execution.
CTDE is the most common of the three. The reasons given: it can use centralized information during training, it scales better than fully centralized methods, it needs no communication at execution time, and it fits cooperative problems well.
Where CTDE comes from
CTDE is usually traced back to three papers from 2017–2018. If you are writing a prior-work section, these are the standard citations for the origin of the paradigm:
- MADDPG — Lowe et al., “Multi-Agent Actor-Critic for Mixed Cooperative-Competitive Environments” (2017), arXiv:1706.02275. Uses a centralized critic with decentralized actors.
- COMA — Foerster et al., “Counterfactual Multi-Agent Policy Gradients” (2017), arXiv:1705.08926.
- QMIX — Rashid et al. (2018), arXiv:1803.11485. Does value decomposition under CTDE.
A common way to cite: use Amato 2024 for the definition, and MADDPG / COMA / QMIX as the origin of the paradigm.
MAPPO: PPO applied to many agents
MAPPO stands for Multi-Agent PPO. PPO (Proximal Policy Optimization) is a well-known single-agent algorithm. MAPPO is PPO with a small set of changes that make it work for multiple agents.
The canonical paper is “The Surprising Effectiveness of PPO in Cooperative Multi-Agent Games” (Yu, Velu, Vinitsky, Gao, Wang, Bayen, Wu, 2021):
- arXiv: https://arxiv.org/abs/2103.01955
- Project page: https://sites.google.com/view/mappo
- Official code: https://github.com/marlbenchmark/on-policy
What the paper claims
PPO, an on-policy single-agent method, with several simple modifications reaches strong performance on three popular MARL benchmarks. Its sample efficiency is similar to popular off-policy methods in most cases. The benchmarks are:
- MPE — the particle-world environments.
- SMAC — the StarCraft Multi-Agent Challenge.
- Hanabi — the Hanabi card-game challenge.
The structure fits CTDE directly: MAPPO uses a centralized value function (critic) during training while keeping decentralized actors for execution.
The five factors
The paper points out five implementation and hyperparameter choices that matter a lot for MAPPO’s performance:
- Value normalization.
- Value function inputs (what you feed the critic).
- Training data usage.
- Policy and value clipping.
- Death masking (how you handle agents that have died in an episode).
This list matters in practice: if your numbers do not match the paper, one of these five is a likely cause.
Can you reproduce these papers? Do you need a GPU?
This is the practical question. The short version: MAPPO is the most reproducible on modest hardware, but CPU-only is realistic only for the simplest environments.
MAPPO
The original experiments ran on a single desktop: 256 GB RAM, one 64-core CPU, and one RTX 3090 GPU. The “single desktop, one GPU” setup was a deliberate point of the paper — it was meant to be reproducible without a cluster. The official code is maintained and widely used.
How hard it is depends heavily on the environment:
- MPE — small networks, low-dimensional. This genuinely runs on CPU in a reasonable time. It is the best starting point if you have no GPU.
- SMAC — needs the StarCraft II binary, which is heavy. It runs on CPU but slowly; a GPU helps a lot. Full results take many millions of steps per map.
- Hanabi — very sample-hungry; the paper reports billions of timesteps for full results. This is the hardest to reproduce cheaply, regardless of hardware.
- GRF (Google Research Football) — also heavy.
One honest caveat from the authors: they suggest double-checking several factors before a run — rollout threads, episode length, PPO epochs, mini-batches, the clip term, and so on. MAPPO is sensitive to these, so “reproduce” can mean some tuning, even with the official code.
The CTDE papers (MADDPG / COMA / QMIX)
These are paradigm-defining papers rather than a single shared codebase:
- MADDPG — its original environment (MPE) is light and CPU-feasible. The original code is old (TensorFlow 1) and somewhat bit-rotted, but reimplementations exist in libraries like EPyMARL and PyMARL.
- QMIX / COMA — mostly evaluated on SMAC, so the same StarCraft-binary overhead applies. The standard reference implementation is PyMARL (https://github.com/oxwhirl/pymarl), which most people use instead of re-implementing.
A practical recommendation
If you want CPU-only and a high chance of matching published numbers, start with MAPPO on MPE or MADDPG on MPE. They are small, fast, and the classic CTDE-versus-MAPPO comparison lives there. Move to SMAC only once you have GPU access and patience for long runs.
A final, honest note: reproducibility in MARL is uneven. Seed variance is high, and small implementation details (the five MAPPO factors, parameter sharing, and so on) shift results noticeably. Budget time for tuning rather than expecting a single run to match a published curve.
Summary
- CTDE is a paradigm: use all information during training, but let each agent act on local information only at execution time. Cite Amato 2024 for the definition; MADDPG / COMA / QMIX for the origin.
- MAPPO is PPO adapted for multiple agents, with a centralized critic and decentralized actors. It works well on MPE, SMAC, and Hanabi. Five implementation factors drive its performance.
- Reproducibility: MPE-based experiments are CPU-friendly. SMAC and Hanabi are heavy and want a GPU. Start small, expect some tuning.