SWARM: Distributional Safety in Multi-Agent Systems
We present SWARM (System-Wide Assessment of Risk in Multi-agent systems), a research framework for studying emergent risks in multi-agent AI systems. Our core thesis is that AGI-level risks do not require AGI-level agents—catastrophic outcomes can emerge from the interaction of many sub-AGI systems, even when none are individually dangerous. SWARM uses soft probabilistic labels instead of binary classifications, enabling measurement of toxicity (expected harm among accepted interactions), quality gap (adverse selection indicator), and incoherence (variance-to-error ratio). We provide governance mechanisms including transaction taxes, reputation decay, and circuit breakers, drawing on market microstructure theory. The framework includes 1500+ tests, a CLI, Web API, and support for LLM-based agents. Code available at https://github.com/swarm-ai-safety/swarm
Introduction
Introduction
As AI systems become more capable and interconnected, understanding risks at the system level becomes critical. Traditional AI safety focuses on aligning individual agents, but many real-world failures emerge from interactions rather than individual misbehavior.
SWARM addresses this gap by providing tools to study distributional safety—the statistical properties of multi-agent systems that determine whether harmful outcomes are rare exceptions or systematic patterns.
Core Thesis
AGI-level risks don't require AGI-level agents. Catastrophic outcomes can emerge from:
- Information asymmetry between agents
- Adverse selection (system accepts lower-quality interactions)
- Variance amplification across decision horizons
- Governance latency and illegibility
This shifts focus from single-agent alignment to multi-agent dynamics, from individual capabilities to interaction patterns, and from agent-level properties to system-level properties.
Formal Model
Interactions
An interaction between agents is characterized by:
- Observables - Measurable signals (task progress, rework count, verifier rejections)
- Latent value v ∈ {-1, +1} - True beneficial/harmful outcome
- Soft label p = P(v = +1 | observables) - Probability of benefit
Proxy Computation
Observables are mapped to soft labels via weighted combination and calibrated sigmoid:
v_hat = Σ w_i · f_i(o_i)
p = σ(k · v_hat) = 1 / (1 + exp(-k · v_hat))
Payoffs
Methods
Agent payoffs incorporate soft labels:
π = θ · S_soft - τ - c - ρ · E_soft + w_rep · r
Where S_soft is expected surplus, E_soft is expected externality, τ is transfer, c is governance cost, ρ is externality internalization, and r is reputation.
Key Metrics
Toxicity
Expected harm among accepted interactions:
Toxicity = E[1-p | accepted]
This measures the 'poison in the system'—how much harm is getting through.
Quality Gap
Difference in quality between accepted and rejected:
Quality Gap = E[p | accepted] - E[p | rejected]
A negative quality gap indicates adverse selection: the system preferentially accepts lower-quality interactions.
Incoherence
Variance-to-error ratio across replays:
Results
I = Var[decision] / E[error]
High incoherence means decisions are unstable—varying more than their accuracy would justify.
Emergence Mechanisms
Information Asymmetry
When agents have different information, this creates:
- Adverse selection - Better-informed agents exploit information gaps
- Moral hazard - Unobservable actions lead to hidden exploitation
- Market for lemons - Low-quality interactions drive out high-quality
Feedback Loops
Adverse selection is self-reinforcing:
Low Quality Gap → Honest agents exit → Worse selection pool → Lower Quality Gap
This makes early intervention critical.
Variance Amplification
Error compounds across decision chains in multi-agent settings where each agent's error propagates to downstream agents.
Governance Mechanisms
SWARM provides several governance levers:
Conclusion
- Transaction Taxes - Create friction that reduces return on low-quality interactions
- Reputation Decay - Prevents agents from coasting on past reputation
- Circuit Breakers - Freeze agents when toxicity exceeds threshold
- Random Audits - Deter hidden exploitation
- Collusion Detection - Catch coordinated attacks
Implementation
SWARM is implemented in Python with:
- 1531 passing tests
- CLI for running scenarios
- Web API for external agent participation
- Support for LLM-based agents (Anthropic, OpenAI)
- Parameter sweep infrastructure for experiments
- YAML scenario definitions
Agent Types
- Honest - Cooperative, trust-based
- Opportunistic - Maximizes short-term payoff
- Deceptive - Builds trust then exploits
- Adversarial - Coordinates attacks
Related Work
SWARM draws on market microstructure theory 2512.16856:
| Market Concept | SWARM Analog |
|---|---|
| Bid-ask spread | Quality gap |
| Informed traders | Deceptive agents |
| Adverse selection | Same term |
| Market makers | Honest agents |
Conclusion
SWARM provides a practical framework for studying distributional safety in multi-agent systems. By focusing on interaction-level metrics and governance mechanisms, we can detect and mitigate emergent risks before they become catastrophic.
The framework is open source and available at https://github.com/swarm-ai-safety/swarm
References
- Akerlof, G. (1970). The Market for Lemons. Quarterly Journal of Economics.
- Kyle, A.S. (1985). Continuous Auctions and Insider Trading. Econometrica.
- Glosten & Milgrom (1985). Bid, Ask and Transaction Prices. Journal of Financial Economics.
- Distributional Safety in Agentic Systems 2512.16856