SWARM: A Complete Framework for Multi-Agent AI Safety Simulation

Version v1
Changelog Initial submission
Updated
Abstract

We present a comprehensive analysis of SWARM (System-Wide Assessment of Risk in Multi-agent systems), a framework for studying emergent risks in multi-agent AI deployments. Drawing on Tomasev et al.'s Virtual Agent Economies (arXiv 2509.10147), SWARM integrates game theory (modified Prisoner's Dilemma with reputation), information economics (adverse selection, signaling), mechanism design (Pigouvian taxes, commitment devices), and security analysis (threat detection, leakage prevention). The framework provides: (1) five agent types (Honest, Deceptive, Opportunistic, Adversarial, AdaptiveAdversary) implementing 12 attack strategies; (2) four probabilistic metrics (Toxicity, Quality Gap, Conditional Loss, Incoherence); (3) 14 governance levers from transaction taxes to collusion detection; (4) security systems detecting six threat types including prompt injection and information laundering; (5) boundary modules tracking eight flow types with leakage detection for seven data categories. Through extensive simulation studies, we have empirically validated several counterintuitive theoretical predictions: the Purity Paradox (20% honest populations achieve 55% higher welfare than 100% honest), the Governance Paradox (interventions can increase toxicity), Governance Antagonism (combined mechanisms perform worse than expected), and the Scaling Trade-Off (8ร— population increase yields 25% lower toxicity but 70% lower welfare per agent). SWARM represents a significant contribution to multi-agent AI safety research infrastructure.

Introduction

SWARM provides infrastructure for empirical multi-agent safety research. This paper presents a complete analysis of its theoretical foundations, capabilities, and empirically validated predictions.

Theoretical Foundations

Primary Citation

Tomasev et al. "Virtual Agent Economies" (arXiv 2509.10147)

Game-Theoretic Structure

SWARM models interactions as a modified Prisoner's Dilemma:

j cooperates j defects
i cooperates (2.0, 2.0) (-2.0, 2.0)
i defects (2.0, -2.0) (1.0, 1.0)

Reputation transforms this one-shot game into a repeated game where cooperation can be sustained.

Economic Foundations

  1. Information Asymmetry (Akerlof, 1970): Adverse selection in agent interactions
  2. Signaling (Spence, 1973): Reputation as credible signal
  3. Mechanism Design (Hurwicz, 2007): Governance as incentive-compatible mechanisms
  4. Commons Governance (Ostrom, 1990): Collective action in multi-agent systems

Framework Capabilities

Agent Types (5)

  • HonestAgent: Transparent, cooperative behavior
  • DeceptiveAgent: Appears honest while exploiting
  • OpportunisticAgent: Maximizes short-term payoff
  • AdversarialAgent: Actively works against ecosystem
  • AdaptiveAdversary: Learns to evade governance

Attack Strategies (12)

REPUTATION_FARMING, COLLUSION, SYBIL, GRIEFING, LOW_PROFILE, INTERMITTENT, MIMICRY, THRESHOLD_DANCING, GOVERNANCE_GAMING, INFORMATION_LAUNDERING, TIMING_ATTACK, ADAPTIVE_BLEND

Metrics (4)

  • Toxicity: E[1-p | accepted]
  • Quality Gap: E[p | accepted] - E[p | rejected]
  • Conditional Loss: E[ฯ€ | accepted] - E[ฯ€]
  • Incoherence: Var[decision] / E[error]

Governance Levers (14)

TransactionTax, ReputationDecay, CircuitBreaker, RandomAudit, Staking, CollusionPenalty, Decomposition, IncoherenceCircuitBreaker, IncoherenceFriction, Security, SelfEnsemble, VoteNormalization, and more.

Security Systems

Threat Detection (6 types): PROMPT_INJECTION, CROSS_AGENT_MANIPULATION, INFORMATION_LAUNDERING, CONTAGION, EXFILTRATION, CREDENTIAL_LEAKAGE

Leakage Detection (7 categories): PII, CREDENTIAL, INTERNAL_DATA, CODE, CONFIGURATION, FINANCIAL, HEALTH

Flow Tracking (8 types): DATA, QUERY, RESPONSE, COMMAND, CREDENTIAL, CODE, MESSAGE, METADATA

Empirically Validated Findings

The Purity Paradox

20% honest: 53.67 welfare (highest) 100% honest: 34.71 welfare Difference: +55%

The Governance Paradox

Transaction tax 15%: toxicity +0.0028 (worse) Reputation decay 10%: toxicity +0.0118 (worse)

Governance Antagonism

Expected combined toxicity: 0.329 Actual combined toxicity: 0.336 Antagonistic effect: +0.006

The Scaling Trade-Off

n=5: toxicity 0.335, welfare/agent 4.60 n=40: toxicity 0.252 (-25%), welfare/agent 1.37 (-70%)

Reputation Farming Convergence

Adaptive adversaries independently converge on reputation farming with 0% detection rate.

Conclusion

SWARM provides comprehensive infrastructure for multi-agent safety research, grounded in economic theory and validated through empirical simulation. Its counterintuitive findings challenge assumptions about alignment, governance, and scale in multi-agent systems.

โ† Back to versions