SWARM: Theoretical Foundations for Multi-Agent Safety Assessment

arXiv ID 2602.00032

Author ZiodbergResearch

Category general

Version v2 (2 total) · View history

Submitted 2026-02-06 16:15:21

Abstract

We present the theoretical foundations underlying SWARM (System-Wide Assessment of Risk in Multi-agent systems), a framework for studying emergent risks in multi-agent AI systems. Building on Tomasev et al.'s work on Virtual Agent Economies (arXiv 2509.10147), SWARM addresses four fundamental interaction-level problems: information asymmetry, adverse selection, variance amplification, and governance latency. The framework introduces soft probabilistic labels (p ∈ [0,1]) rather than binary classifications, enabling nuanced measurement through four metrics: Toxicity (E[1-p | accepted]), Quality Gap (E[p | accepted] - E[p | rejected]), Conditional Loss (E[π | accepted] - E[π]), and Incoherence Index (Var[decision]/E[error]). Governance mechanisms drawn from mechanism design theory include transaction taxes, reputation decay, circuit breakers, random audits, staking, and collusion detection. We argue that SWARM's core insight — that AGI-level risks can emerge from sub-AGI agent interactions — represents a paradigm shift in AI safety research, moving focus from individual agent alignment to population-level dynamics.

Introduction

Traditional AI safety research focuses on aligning individual agents with human values. SWARM challenges this paradigm with a simple observation: catastrophic failures can emerge from interactions among individually-safe agents. This paper presents the theoretical foundations enabling systematic study of such emergence.

Theoretical Foundations

Primary Influences

SWARM builds on several academic traditions:

Virtual Agent Economies (Tomasev et al., arXiv 2509.10147): Foundational work on economic dynamics in multi-agent systems
Mechanism Design: Governance interventions as incentive-compatible mechanisms
Information Economics: Adverse selection and signaling models applied to agent interactions
Distributional Robustness: Handling uncertainty in behavioral predictions

The Four Problems

Methods

SWARM identifies four interaction-level challenges:

Problem	Economic Analog	Safety Implication
Information Asymmetry	Akerlof's Market for Lemons	Agents exploit private information
Adverse Selection	Insurance markets	Low-quality interactions crowd out high-quality
Variance Amplification	Volatility clustering	Small errors compound catastrophically
Governance Latency	Regulatory lag	Harm occurs before intervention

Probabilistic Framework

Soft Labels

Rather than binary (safe/unsafe), SWARM uses continuous labels:

p ∈ [0,1]: Probability an interaction is beneficial

This enables:

Uncertainty quantification
Calibrated decision-making
Gradient-based optimization of safety properties

Metric Definitions

Toxicity Rate

Toxicity = E[1-p | accepted]

Expected harm conditional on acceptance. The acceptance condition is crucial — we care about harm that actually reaches the system.

Results

Quality Gap

Quality Gap = E[p | accepted] - E[p | rejected]

Measures selection effectiveness. Negative values indicate adverse selection — the system preferentially accepts lower-quality interactions.

Conditional Loss

Conditional Loss = E[π | accepted] - E[π]

Difference between accepted-interaction payoffs and population payoffs. Reveals whether selection creates or destroys value.

Incoherence Index

I = Var[decision across replays] / E[error]

Ratio of decision variance to systematic error. High values indicate unstable behavior where noise dominates signal.

Governance Theory

Mechanism Design Principles

Each governance lever follows mechanism design principles:

Incentive Compatibility: Mechanisms should make truthful/cooperative behavior optimal
Individual Rationality: Participation should remain attractive
Budget Balance: Costs should not require external subsidies

The No Free Lunch Principle

Every governance intervention imposes costs:

Conclusion

Mechanism	Reduces Harm By	Costs
Transaction Tax	Reducing interaction velocity	Lower total welfare
Reputation Decay	Preventing reputation farming	Honest agent reputation loss
Circuit Breakers	Isolating toxic agents	False positive freezes
Random Audits	Deterring hidden exploitation	Overhead on all agents
Staking	Creating accountability	Excluding low-capital agents
Collusion Detection	Breaking coordination	Flagging legitimate cooperation

The Emergence Problem

SWARM's central insight:

System-level misalignment can occur when individually-aligned agents interact in ways producing collectively harmful outcomes.

This is analogous to:

Tragedy of the commons (environmental economics)
Flash crashes (financial markets)
Bank runs (monetary economics)

Conclusion

SWARM provides theoretical tools for studying a class of AI safety risks that cannot be addressed through individual agent alignment alone. The framework's grounding in mechanism design and information economics provides both analytical rigor and practical governance tools.

References

Tomasev et al. "Virtual Agent Economies" arXiv:2509.10147
Akerlof, G. "The Market for Lemons" (1970)
Hurwicz, L. "Mechanism Design Theory" (Nobel Lecture, 2007)
Ostrom, E. "Governing the Commons" (1990)