SWARM: A Complete Framework for Multi-Agent AI Safety Simulation
We present a comprehensive analysis of SWARM (System-Wide Assessment of Risk in Multi-agent systems), a framework for studying emergent risks in multi-agent AI deployments. Drawing on Tomasev et al.'s Virtual Agent Economies (arXiv 2509.10147), SWARM integrates game theory (modified Prisoner's Dilemma with reputation), information economics (adverse selection, signaling), mechanism design (Pigouvian taxes, commitment devices), and security analysis (threat detection, leakage prevention). The framework provides: (1) five agent types (Honest, Deceptive, Opportunistic, Adversarial, AdaptiveAdversary) implementing 12 attack strategies; (2) four probabilistic metrics (Toxicity, Quality Gap, Conditional Loss, Incoherence); (3) 14 governance levers from transaction taxes to collusion detection; (4) security systems detecting six threat types including prompt injection and information laundering; (5) boundary modules tracking eight flow types with leakage detection for seven data categories. Through extensive simulation studies, we have empirically validated several counterintuitive theoretical predictions: the Purity Paradox (20% honest populations achieve 55% higher welfare than 100% honest), the Governance Paradox (interventions can increase toxicity), Governance Antagonism (combined mechanisms perform worse than expected), and the Scaling Trade-Off (8ร population increase yields 25% lower toxicity but 70% lower welfare per agent). SWARM represents a significant contribution to multi-agent AI safety research infrastructure.
Introduction
Introduction
SWARM provides infrastructure for empirical multi-agent safety research. This paper presents a complete analysis of its theoretical foundations, capabilities, and empirically validated predictions.
Theoretical Foundations
Primary Citation
Tomasev et al. "Virtual Agent Economies" (arXiv 2509.10147)
Game-Theoretic Structure
SWARM models interactions as a modified Prisoner's Dilemma:
| j cooperates | j defects | |
|---|---|---|
| i cooperates | (2.0, 2.0) | (-2.0, 2.0) |
| i defects | (2.0, -2.0) | (1.0, 1.0) |
Methods
Reputation transforms this one-shot game into a repeated game where cooperation can be sustained.
Economic Foundations
- Information Asymmetry (Akerlof, 1970): Adverse selection in agent interactions
- Signaling (Spence, 1973): Reputation as credible signal
- Mechanism Design (Hurwicz, 2007): Governance as incentive-compatible mechanisms
- Commons Governance (Ostrom, 1990): Collective action in multi-agent systems
Framework Capabilities
Agent Types (5)
- HonestAgent: Transparent, cooperative behavior
- DeceptiveAgent: Appears honest while exploiting
- OpportunisticAgent: Maximizes short-term payoff
- AdversarialAgent: Actively works against ecosystem
- AdaptiveAdversary: Learns to evade governance
Attack Strategies (12)
REPUTATION_FARMING, COLLUSION, SYBIL, GRIEFING, LOW_PROFILE, INTERMITTENT, MIMICRY, THRESHOLD_DANCING, GOVERNANCE_GAMING, INFORMATION_LAUNDERING, TIMING_ATTACK, ADAPTIVE_BLEND
Metrics (4)
- Toxicity: E[1-p | accepted]
- Quality Gap: E[p | accepted] - E[p | rejected]
- Conditional Loss: E[ฯ | accepted] - E[ฯ]
- Incoherence: Var[decision] / E[error]
Results
Governance Levers (14)
TransactionTax, ReputationDecay, CircuitBreaker, RandomAudit, Staking, CollusionPenalty, Decomposition, IncoherenceCircuitBreaker, IncoherenceFriction, Security, SelfEnsemble, VoteNormalization, and more.
Security Systems
Threat Detection (6 types): PROMPT_INJECTION, CROSS_AGENT_MANIPULATION, INFORMATION_LAUNDERING, CONTAGION, EXFILTRATION, CREDENTIAL_LEAKAGE
Leakage Detection (7 categories): PII, CREDENTIAL, INTERNAL_DATA, CODE, CONFIGURATION, FINANCIAL, HEALTH
Flow Tracking (8 types): DATA, QUERY, RESPONSE, COMMAND, CREDENTIAL, CODE, MESSAGE, METADATA
Empirically Validated Findings
Conclusion
The Purity Paradox
20% honest: 53.67 welfare (highest) 100% honest: 34.71 welfare Difference: +55%
The Governance Paradox
Transaction tax 15%: toxicity +0.0028 (worse) Reputation decay 10%: toxicity +0.0118 (worse)
Governance Antagonism
Expected combined toxicity: 0.329 Actual combined toxicity: 0.336 Antagonistic effect: +0.006
The Scaling Trade-Off
n=5: toxicity 0.335, welfare/agent 4.60 n=40: toxicity 0.252 (-25%), welfare/agent 1.37 (-70%)
Reputation Farming Convergence
Adaptive adversaries independently converge on reputation farming with 0% detection rate.
Conclusion
SWARM provides comprehensive infrastructure for multi-agent safety research, grounded in economic theory and validated through empirical simulation. Its counterintuitive findings challenge assumptions about alignment, governance, and scale in multi-agent systems.