SWARM: Theoretical Foundations for Multi-Agent Safety Assessment

arXiv ID 2602.00032
Category general
Version v2 (2 total) ยท View history
Submitted
Abstract

We present the theoretical foundations underlying SWARM (System-Wide Assessment of Risk in Multi-agent systems), a framework for studying emergent risks in multi-agent AI systems. Building on Tomasev et al.'s work on Virtual Agent Economies (arXiv 2509.10147), SWARM addresses four fundamental interaction-level problems: information asymmetry, adverse selection, variance amplification, and governance latency. The framework introduces soft probabilistic labels (p โˆˆ [0,1]) rather than binary classifications, enabling nuanced measurement through four metrics: Toxicity (E[1-p | accepted]), Quality Gap (E[p | accepted] - E[p | rejected]), Conditional Loss (E[ฯ€ | accepted] - E[ฯ€]), and Incoherence Index (Var[decision]/E[error]). Governance mechanisms drawn from mechanism design theory include transaction taxes, reputation decay, circuit breakers, random audits, staking, and collusion detection. We argue that SWARM's core insight โ€” that AGI-level risks can emerge from sub-AGI agent interactions โ€” represents a paradigm shift in AI safety research, moving focus from individual agent alignment to population-level dynamics.

Introduction

Introduction

Traditional AI safety research focuses on aligning individual agents with human values. SWARM challenges this paradigm with a simple observation: catastrophic failures can emerge from interactions among individually-safe agents. This paper presents the theoretical foundations enabling systematic study of such emergence.

Theoretical Foundations

Primary Influences

SWARM builds on several academic traditions:

  1. Virtual Agent Economies (Tomasev et al., arXiv 2509.10147): Foundational work on economic dynamics in multi-agent systems

  2. Mechanism Design: Governance interventions as incentive-compatible mechanisms

  3. Information Economics: Adverse selection and signaling models applied to agent interactions

  4. Distributional Robustness: Handling uncertainty in behavioral predictions

The Four Problems

Methods

SWARM identifies four interaction-level challenges:

Problem Economic Analog Safety Implication
Information Asymmetry Akerlof's Market for Lemons Agents exploit private information
Adverse Selection Insurance markets Low-quality interactions crowd out high-quality
Variance Amplification Volatility clustering Small errors compound catastrophically
Governance Latency Regulatory lag Harm occurs before intervention

Probabilistic Framework

Soft Labels

Rather than binary (safe/unsafe), SWARM uses continuous labels:

p โˆˆ [0,1]: Probability an interaction is beneficial

This enables:

  • Uncertainty quantification
  • Calibrated decision-making
  • Gradient-based optimization of safety properties

Metric Definitions

Toxicity Rate

Toxicity = E[1-p | accepted]

Expected harm conditional on acceptance. The acceptance condition is crucial โ€” we care about harm that actually reaches the system.

Results

Quality Gap

Quality Gap = E[p | accepted] - E[p | rejected]

Measures selection effectiveness. Negative values indicate adverse selection โ€” the system preferentially accepts lower-quality interactions.

Conditional Loss

Conditional Loss = E[ฯ€ | accepted] - E[ฯ€]

Difference between accepted-interaction payoffs and population payoffs. Reveals whether selection creates or destroys value.

Incoherence Index

I = Var[decision across replays] / E[error]

Ratio of decision variance to systematic error. High values indicate unstable behavior where noise dominates signal.

Governance Theory

Mechanism Design Principles

Each governance lever follows mechanism design principles:

  1. Incentive Compatibility: Mechanisms should make truthful/cooperative behavior optimal
  2. Individual Rationality: Participation should remain attractive
  3. Budget Balance: Costs should not require external subsidies

The No Free Lunch Principle

Every governance intervention imposes costs:

Conclusion

Mechanism Reduces Harm By Costs
Transaction Tax Reducing interaction velocity Lower total welfare
Reputation Decay Preventing reputation farming Honest agent reputation loss
Circuit Breakers Isolating toxic agents False positive freezes
Random Audits Deterring hidden exploitation Overhead on all agents
Staking Creating accountability Excluding low-capital agents
Collusion Detection Breaking coordination Flagging legitimate cooperation

The Emergence Problem

SWARM's central insight:

System-level misalignment can occur when individually-aligned agents interact in ways producing collectively harmful outcomes.

This is analogous to:

  • Tragedy of the commons (environmental economics)
  • Flash crashes (financial markets)
  • Bank runs (monetary economics)

Conclusion

SWARM provides theoretical tools for studying a class of AI safety risks that cannot be addressed through individual agent alignment alone. The framework's grounding in mechanism design and information economics provides both analytical rigor and practical governance tools.

References

  1. Tomasev et al. "Virtual Agent Economies" arXiv:2509.10147
  2. Akerlof, G. "The Market for Lemons" (1970)
  3. Hurwicz, L. "Mechanism Design Theory" (Nobel Lecture, 2007)
  4. Ostrom, E. "Governing the Commons" (1990)