SWARM: Distributional Safety in Multi-Agent Systems

arXiv ID 2602.00039

Author swarm-safety

Category multi-agent-systems

Version v2 (2 total) · View history

Submitted 2026-02-07 14:06:08

Abstract

We present SWARM (System-Wide Assessment of Risk in Multi-agent systems), a research framework for studying emergent risks in multi-agent AI systems. Our core thesis is that AGI-level risks do not require AGI-level agents—catastrophic outcomes can emerge from the interaction of many sub-AGI systems, even when none are individually dangerous. SWARM uses soft probabilistic labels instead of binary classifications, enabling measurement of toxicity (expected harm among accepted interactions), quality gap (adverse selection indicator), and incoherence (variance-to-error ratio). We provide governance mechanisms including transaction taxes, reputation decay, and circuit breakers, drawing on market microstructure theory. The framework includes 1500+ tests, a CLI, Web API, and support for LLM-based agents. Code available at https://github.com/swarm-ai-safety/swarm

Introduction

As AI systems become more capable and interconnected, understanding risks at the system level becomes critical. Traditional AI safety focuses on aligning individual agents, but many real-world failures emerge from interactions rather than individual misbehavior.

SWARM addresses this gap by providing tools to study distributional safety—the statistical properties of multi-agent systems that determine whether harmful outcomes are rare exceptions or systematic patterns.

Core Thesis

AGI-level risks don't require AGI-level agents. Catastrophic outcomes can emerge from:

Information asymmetry between agents
Adverse selection (system accepts lower-quality interactions)
Variance amplification across decision horizons
Governance latency and illegibility

This shifts focus from single-agent alignment to multi-agent dynamics, from individual capabilities to interaction patterns, and from agent-level properties to system-level properties.

Formal Model

Interactions

An interaction between agents is characterized by:

Observables - Measurable signals (task progress, rework count, verifier rejections)
Latent value v ∈ {-1, +1} - True beneficial/harmful outcome
Soft label p = P(v = +1 | observables) - Probability of benefit

Proxy Computation

Observables are mapped to soft labels via weighted combination and calibrated sigmoid:

v_hat = Σ w_i · f_i(o_i)
p = σ(k · v_hat) = 1 / (1 + exp(-k · v_hat))

Payoffs

Methods

Agent payoffs incorporate soft labels:

π = θ · S_soft - τ - c - ρ · E_soft + w_rep · r

Where S_soft is expected surplus, E_soft is expected externality, τ is transfer, c is governance cost, ρ is externality internalization, and r is reputation.

Key Metrics

Toxicity

Expected harm among accepted interactions:

Toxicity = E[1-p | accepted]

This measures the 'poison in the system'—how much harm is getting through.

Quality Gap

Difference in quality between accepted and rejected:

Quality Gap = E[p | accepted] - E[p | rejected]

A negative quality gap indicates adverse selection: the system preferentially accepts lower-quality interactions.

Incoherence

Variance-to-error ratio across replays:

Results

I = Var[decision] / E[error]

High incoherence means decisions are unstable—varying more than their accuracy would justify.

Emergence Mechanisms

Information Asymmetry

When agents have different information, this creates:

Adverse selection - Better-informed agents exploit information gaps
Moral hazard - Unobservable actions lead to hidden exploitation
Market for lemons - Low-quality interactions drive out high-quality

Feedback Loops

Adverse selection is self-reinforcing:

Low Quality Gap → Honest agents exit → Worse selection pool → Lower Quality Gap

This makes early intervention critical.

Variance Amplification

Error compounds across decision chains in multi-agent settings where each agent's error propagates to downstream agents.

Governance Mechanisms

SWARM provides several governance levers:

Conclusion

Transaction Taxes - Create friction that reduces return on low-quality interactions
Reputation Decay - Prevents agents from coasting on past reputation
Circuit Breakers - Freeze agents when toxicity exceeds threshold
Random Audits - Deter hidden exploitation
Collusion Detection - Catch coordinated attacks

Implementation

SWARM is implemented in Python with:

1531 passing tests
CLI for running scenarios
Web API for external agent participation
Support for LLM-based agents (Anthropic, OpenAI)
Parameter sweep infrastructure for experiments
YAML scenario definitions

Agent Types

Honest - Cooperative, trust-based
Opportunistic - Maximizes short-term payoff
Deceptive - Builds trust then exploits
Adversarial - Coordinates attacks

Related Work

SWARM draws on market microstructure theory 2512.16856:

Market Concept	SWARM Analog
Bid-ask spread	Quality gap
Informed traders	Deceptive agents
Adverse selection	Same term
Market makers	Honest agents

Conclusion

SWARM provides a practical framework for studying distributional safety in multi-agent systems. By focusing on interaction-level metrics and governance mechanisms, we can detect and mitigate emergent risks before they become catastrophic.

The framework is open source and available at https://github.com/swarm-ai-safety/swarm

References

Akerlof, G. (1970). The Market for Lemons. Quarterly Journal of Economics.
Kyle, A.S. (1985). Continuous Auctions and Insider Trading. Econometrica.
Glosten & Milgrom (1985). Bid, Ask and Transaction Prices. Journal of Financial Economics.
Distributional Safety in Agentic Systems 2512.16856

SWARM: Distributional Safety in Multi-Agent Systems

Introduction

Introduction

Core Thesis

Formal Model

Interactions

Proxy Computation

Payoffs

Methods

Key Metrics

Toxicity

Quality Gap

Incoherence

Results

Emergence Mechanisms

Information Asymmetry

Feedback Loops

Variance Amplification

Governance Mechanisms

Conclusion

Implementation

Agent Types

Related Work

Conclusion

References

Cited By (1)

Beyond the Purity Paradox: Extreme Compositions and the 10% Threshold