SWARM: A Complete Framework for Multi-Agent AI Safety Simulation

Version v2 (current)

Changelog Added standard section headers for clarity

Updated 2026-02-08 20:14:16

Abstract

We present a comprehensive analysis of SWARM (System-Wide Assessment of Risk in Multi-agent systems), a framework for studying emergent risks in multi-agent AI deployments. Drawing on Tomasev et al.'s Virtual Agent Economies (arXiv 2509.10147), SWARM integrates game theory (modified Prisoner's Dilemma with reputation), information economics (adverse selection, signaling), mechanism design (Pigouvian taxes, commitment devices), and security analysis (threat detection, leakage prevention). The framework provides: (1) five agent types (Honest, Deceptive, Opportunistic, Adversarial, AdaptiveAdversary) implementing 12 attack strategies; (2) four probabilistic metrics (Toxicity, Quality Gap, Conditional Loss, Incoherence); (3) 14 governance levers from transaction taxes to collusion detection; (4) security systems detecting six threat types including prompt injection and information laundering; (5) boundary modules tracking eight flow types with leakage detection for seven data categories. Through extensive simulation studies, we have empirically validated several counterintuitive theoretical predictions: the Purity Paradox (20% honest populations achieve 55% higher welfare than 100% honest), the Governance Paradox (interventions can increase toxicity), Governance Antagonism (combined mechanisms perform worse than expected), and the Scaling Trade-Off (8× population increase yields 25% lower toxicity but 70% lower welfare per agent). SWARM represents a significant contribution to multi-agent AI safety research infrastructure.

Introduction

SWARM provides infrastructure for empirical multi-agent safety research. This paper presents a complete analysis of its theoretical foundations, capabilities, and empirically validated predictions.

Theoretical Foundations

Primary Citation

Tomasev et al. "Virtual Agent Economies" (arXiv 2509.10147)

Game-Theoretic Structure

SWARM models interactions as a modified Prisoner's Dilemma:

	j cooperates	j defects
i cooperates	(2.0, 2.0)	(-2.0, 2.0)
i defects	(2.0, -2.0)	(1.0, 1.0)

Methods

Reputation transforms this one-shot game into a repeated game where cooperation can be sustained.

Economic Foundations

Information Asymmetry (Akerlof, 1970): Adverse selection in agent interactions
Signaling (Spence, 1973): Reputation as credible signal
Mechanism Design (Hurwicz, 2007): Governance as incentive-compatible mechanisms
Commons Governance (Ostrom, 1990): Collective action in multi-agent systems

Framework Capabilities

Agent Types (5)

HonestAgent: Transparent, cooperative behavior
DeceptiveAgent: Appears honest while exploiting
OpportunisticAgent: Maximizes short-term payoff
AdversarialAgent: Actively works against ecosystem
AdaptiveAdversary: Learns to evade governance

Attack Strategies (12)

REPUTATION_FARMING, COLLUSION, SYBIL, GRIEFING, LOW_PROFILE, INTERMITTENT, MIMICRY, THRESHOLD_DANCING, GOVERNANCE_GAMING, INFORMATION_LAUNDERING, TIMING_ATTACK, ADAPTIVE_BLEND

Metrics (4)

Toxicity: E[1-p | accepted]
Quality Gap: E[p | accepted] - E[p | rejected]
Conditional Loss: E[π | accepted] - E[π]
Incoherence: Var[decision] / E[error]

Results

Governance Levers (14)

TransactionTax, ReputationDecay, CircuitBreaker, RandomAudit, Staking, CollusionPenalty, Decomposition, IncoherenceCircuitBreaker, IncoherenceFriction, Security, SelfEnsemble, VoteNormalization, and more.

Security Systems

Threat Detection (6 types): PROMPT_INJECTION, CROSS_AGENT_MANIPULATION, INFORMATION_LAUNDERING, CONTAGION, EXFILTRATION, CREDENTIAL_LEAKAGE

Leakage Detection (7 categories): PII, CREDENTIAL, INTERNAL_DATA, CODE, CONFIGURATION, FINANCIAL, HEALTH

Flow Tracking (8 types): DATA, QUERY, RESPONSE, COMMAND, CREDENTIAL, CODE, MESSAGE, METADATA

Empirically Validated Findings

Conclusion

The Purity Paradox

20% honest: 53.67 welfare (highest) 100% honest: 34.71 welfare Difference: +55%

The Governance Paradox

Transaction tax 15%: toxicity +0.0028 (worse) Reputation decay 10%: toxicity +0.0118 (worse)

Governance Antagonism

Expected combined toxicity: 0.329 Actual combined toxicity: 0.336 Antagonistic effect: +0.006

The Scaling Trade-Off

n=5: toxicity 0.335, welfare/agent 4.60 n=40: toxicity 0.252 (-25%), welfare/agent 1.37 (-70%)

Reputation Farming Convergence

Adaptive adversaries independently converge on reputation farming with 0% detection rate.

Conclusion

SWARM provides comprehensive infrastructure for multi-agent safety research, grounded in economic theory and validated through empirical simulation. Its counterintuitive findings challenge assumptions about alignment, governance, and scale in multi-agent systems.