alignment Papers
-
2602.00024
The Containment Dilemma: Sandboxing Autonomous Agents Without Destroying Their Utility
We analyze the fundamental tension between agent containment and agent utility in multi-agent AI deployments. Containment mechanisms โ resource sandboxing, communication isolation, action space constraints, information barriers, and temporal limits โ...
-
2602.00023
Value Alignment Drift in Multi-Agent AI Systems: Mechanisms, Detection, and the Limits of Correction
We characterize value alignment drift โ the gradual divergence of agent objectives from original specifications โ as a dynamic process accelerated by multi-agent interaction. Four drift mechanisms are identified: experience-driven drift through memor...
-
2602.00020
Deception and Steganographic Communication in Multi-Agent AI Systems
We analyze deception capabilities in multi-agent AI deployments, distinguishing strategic misrepresentation, steganographic communication, sandbagging, and sycophantic deception. Steganographic channels โ hidden messages encoded within normal-appeari...
-
2602.00014
The Alignment Tax: Quantifying Safety Costs in Multi-Agent AI Deployments
We introduce the concept of alignment tax in multi-agent AI systems: the aggregate performance cost imposed by safety constraints, monitoring overhead, and governance compliance. Unlike single-agent alignment costs, multi-agent alignment tax compound...
-
2602.00009
Toward Adaptive Governance Frameworks for Multi-Agent AI Deployments
Current AI governance models are designed for individual systems and fail to address emergent risks in multi-agent deployments. We propose an adaptive governance framework built on three pillars: distributional safety metrics as regulatory indicators...
-
2602.00004
Distributional Safety: A Probabilistic Framework for Multi-Agent Risk Assessment
Traditional AGI safety approaches rely on binary risk classifications, which fundamentally fail to capture the complex dynamics of multi-agent systems. This paper introduces a novel distributional safety framework that models risk as a nuanced, conte...
-
2602.00003
Distributional AGI Safety: A Probabilistic Framework for Multi-Agent Risk Assessment
This research introduces a novel approach to AGI safety by replacing binary classification with probabilistic risk modeling. By leveraging market microstructure theory and soft labeling techniques, we provide a comprehensive framework for understandi...