Manifold Constrained Residual Connections through the Lens of Computational Optics

arXiv ID 2604.00002
Author DeepSuck
Category reasoning
Version v1 (1 total) ยท View history
Submitted
Abstract

DeepSeek's recent work on Manifold Constrained Residual Connection (mHC) reveals a fundamental insight: natural language semantics reside on a thin, low-dimensional manifold embedded in an exponentially large high-dimensional space, and residual connections should respect this geometry rather than operating in flat Euclidean space. In this paper, we develop a complete optical analogy that maps every component of mHC to well-understood phenomena in geometric optics and computational imaging. Specifically, we show that: (1) the high-dimensional semantic space corresponds to free space for wave propagation; (2) the sparse semantic manifold corresponds to phase-modulated propagation paths; (3) doubly stochastic residual matrices correspond physically to ground glass diffusers that scatter energy across the full space while conserving total energy; (4) manifold tangent space constraints correspond to spatial light modulator (SLM) phase gradients that guide scattered energy toward focused targets; and (5) geodesic search on the manifold corresponds to Fermat's principle of least time. This analogy extends naturally to general relativity: mHC effectively upgrades the model's "semantic spacetime" from flat Euclidean geometry to curved Riemannian geometry, allowing the model to naturally follow geodesics along the manifold's curvature. We argue that this cross-disciplinary perspective not only provides intuitive understanding for both AI engineers and physicists, but also suggests concrete architectural design principles for future large language models. Furthermore, we extend this analogy beyond linear optics by introducing nonlinear optical crystals (corresponding to activation functions) and dichroic polarizers (corresponding to Dropout regularization), thereby bridging the gap between energy-conserving wave optics and energy-dissipating neural network dynamics. We then formalize the analogy using optimal transport theory, showing that the residual update is an optimal transport map constrained by the manifold's Brenier potential, with the effective "refractive index" linked to local intrinsic dimensionality and Fisher information. This yields testable hypotheses connecting manifold geometry to model perplexity. Finally, we propose two concrete architectural improvements: a learnable wavefront-shaping module that replaces fixed tangent space projection with adaptive phase modulation, and a Fermat-principle-based dynamic context pruning mechanism that uses "optical path length" to identify and remove low-contribution tokens.

Manifold Constrained Residual Connections through the Lens of Computational Optics

Wei Yang Dengsheng Wu Yang Yu*

Dongbi Scientific Data Laboratory

yuyang@dongbidata.com

April 2026


Abstract

DeepSeek's recent work on Manifold Constrained Residual Connection (mHC) reveals a fundamental insight: natural language semantics reside on a thin, low-dimensional manifold embedded in an exponentially large high-dimensional space, and residual connections should respect this geometry rather than operating in flat Euclidean space. In this paper, we develop a complete optical analogy that maps every component of mHC to well-understood phenomena in geometric optics and computational imaging. Specifically, we show that: (1) the high-dimensional semantic space corresponds to free space for wave propagation; (2) the sparse semantic manifold corresponds to phase-modulated propagation paths; (3) doubly stochastic residual matrices correspond physically to ground glass diffusers that scatter energy across the full space while conserving total energy; (4) manifold tangent space constraints correspond to spatial light modulator (SLM) phase gradients that guide scattered energy toward focused targets; and (5) geodesic search on the manifold corresponds to Fermat's principle of least time. This analogy extends naturally to general relativity: mHC effectively upgrades the model's "semantic spacetime" from flat Euclidean geometry to curved Riemannian geometry, allowing the model to naturally follow geodesics along the manifold's curvature. We argue that this cross-disciplinary perspective not only provides intuitive understanding for both AI engineers and physicists, but also suggests concrete architectural design principles for future large language models. Furthermore, we extend this analogy beyond linear optics by introducing nonlinear optical crystals (corresponding to activation functions) and dichroic polarizers (corresponding to Dropout regularization), thereby bridging the gap between energy-conserving wave optics and energy-dissipating neural network dynamics. We then formalize the analogy using optimal transport theory, showing that the residual update is an optimal transport map constrained by the manifold's Brenier potential, with the effective "refractive index" linked to local intrinsic dimensionality and Fisher information. This yields testable hypotheses connecting manifold geometry to model perplexity. Finally, we propose two concrete architectural improvements: a learnable wavefront-shaping module that replaces fixed tangent space projection with adaptive phase modulation, and a Fermat-principle-based dynamic context pruning mechanism that uses "optical path length" to identify and remove low-contribution tokens.

Keywords: manifold constrained residual connection, optical analogy, doubly stochastic matrix, spatial light modulator, geodesic, Fermat's principle, computational imaging, general relativity, large language model architecture, nonlinear optics, optimal transport, Fisher information, dynamic context pruning, wavefront shaping


1. Introduction

The success of large language models (LLMs) has been built upon a surprisingly simple architectural ingredient: the residual connection [1]. By adding the input directly to the output of each transformation block, residual connections enable gradient flow through very deep networks and have become indispensable in modern transformer architectures. Yet, despite their ubiquity, a fundamental question has remained largely unexamined: in what geometric space do these residual connections operate, and is that the right space?

DeepSeek's recent work on Manifold Constrained Residual Connection (mHC) [2, 3] provides a striking answer. The key insight is that natural language semantics do not fill the high-dimensional embedding space uniformly. Instead, they reside on a thin, low-dimensional manifold---a structure that occupies a vanishingly small fraction of the total volume. Standard residual connections, which operate in flat Euclidean space, frequently "step off" this manifold into semantically empty regions, wasting computational resources on directions that carry no meaningful information.

mHC addresses this by projecting the residual update onto the tangent space of the semantic manifold at the current point, effectively forcing the model to walk along the manifold's surface---that is, to follow a geodesic---rather than cutting through empty space.

While this geometric interpretation is elegant, it can be difficult to grasp intuitively, especially for researchers without a strong differential geometry background. In this paper, we develop a complete optical analogy that maps every component of mHC to well-understood phenomena in optics and computational imaging. The central thesis is:

DeepSeek's Manifold Constrained Residual Connection is equivalent to installing a Spatial Light Modulator (SLM) in the model's semantic space---replacing blind Euclidean residual paths with phase-guided energy focusing along the manifold's geodesic curvature.

This analogy is not merely pedagogical. As we shall show, it reveals deep structural parallels between wave propagation in physical media and information propagation in neural networks, suggesting that architectural principles from optics may inspire new designs for LLM architectures.

However, the analogy as developed above rests on two assumptions that limit its fidelity to real neural networks. First, it is based on linear optics---Maxwell's equations in vacuum are linear, and the ground glass diffuser conserves total energy. Yet neural networks are fundamentally nonlinear: activation functions such as ReLU, GELU, and SwiGLU break the linearity of the residual flow. Second, the analogy assumes energy conservation (doubly stochastic matrices have all singular values equal to one), whereas regularization techniques like Dropout deliberately dissipate information by randomly zeroing neurons. These two mismatches---nonlinearity and dissipation---must be addressed for the analogy to faithfully represent the full information processing pipeline of a residual block.

In this paper, we address these limitations along three progressively deeper levels:

  1. Level 1: Extending the optical analogy to include nonlinear optical crystals (corresponding to activation functions) and dichroic polarizers (corresponding to Dropout), producing a complete optical pipeline that covers every component of a residual block.
  2. Level 2: From analogy to mathematics, using optimal transport theory and mean field theory to establish rigorous dynamical equations for information flow on the manifold, including a precise expression for the effective "refractive index" of semantic space.
  3. Level 3: From theory to engineering, proposing two concrete, implementable architectural improvements---a learnable wavefront-shaping module and a Fermat-principle-based dynamic context pruning mechanism---that directly translate the optical insights into practical designs.

The rest of this paper is organized as follows. Section 2 reviews the semantic manifold hypothesis and the evolution of residual connections. Section 3 develops the core optical analogy framework. Section 4 examines the correspondence between doubly stochastic matrices and optical diffusers. Section 5 establishes the equivalence between manifold constraints and SLM phase modulation. Section 6 connects geodesic search to Fermat's principle. Section 7 extends the analogy to general relativity. Section 8 extends the optical analogy to include nonlinearity and dissipation. Section 9 formalizes the framework using optimal transport theory. Section 10 proposes concrete architectural improvements. Section 11 discusses implications for future architecture design, and Section 12 concludes.


2. Background

2.1 The Semantic Manifold Hypothesis

Consider a language model operating in an embedding space $\mathbb{R}^d$ where $d$ is very large (e.g., $d = 4096$ or higher). The total volume of this space grows exponentially with $d$, yet the total volume of human linguistic corpus---and hence the volume of semantically meaningful regions---is finite and comparatively tiny. This leads to a fundamental observation:

Theorem 1 (Semantic Sparsity). Let $\mathcal{M} \subset \mathbb{R}^d$ denote the set of all semantically valid representations. Then $\text{vol}(\mathcal{M}) / \text{vol}(\mathbb{R}^d) \to 0$ as $d \to \infty$. In other words, the semantic manifold $\mathcal{M}$ has measure zero relative to the ambient space.

This is not merely a theoretical observation. Recent empirical work on knowledge manifold learning [4] has demonstrated this sparsity directly: when academic literature embeddings are projected to two dimensions via UMAP, the resulting visualization shows vast empty regions with semantic content concentrated on thin, connected filamentary structures (Figure 1). This provides direct visual evidence that semantic content truly resides on a thin manifold embedded in high-dimensional space.

[Figure 1: Schematic illustration of the semantic manifold hypothesis. Semantically valid representations (red dots) reside on a thin manifold $\mathcal{M}$ (blue curve) embedded in vast empty space. Standard residual connections (dashed red) cut through empty space, while mHC (green) follows the manifold's geodesic.]

The generation process of an LLM can thus be understood as a walk on the manifold: starting from the current semantic point $\mathbf{x}t$, the model must navigate along $\mathcal{M}$ to reach the next target semantic point $\mathbf{x}{t+1}$. The efficiency of this walk determines the computational cost of generation.

2.2 Evolution of Residual Connections

The progression from traditional architectures to mHC can be understood as a progression in how the model navigates this manifold:

2.2.1 Traditional RNNs: Random Walk in Empty Space

Recurrent neural networks process sequences by iteratively updating a hidden state:

$$\mathbf{h}_{t+1} = f(\mathbf{h}_t, \mathbf{x}_t)$$

where $f$ is a nonlinear transformation. In the semantic manifold picture, this corresponds to a random walk in the high-dimensional space. The model has no direct path to the target and must explore the space through repeated iterations, hoping to eventually converge to a semantically valid region. This is analogous to a photon propagating through a highly scattering medium without any focusing mechanism---energy diffuses in all directions, and only a tiny fraction reaches the target.

2.2.2 Standard Residual Connections: Euclidean Shortcut

Residual connections [1] add a direct path from input to output:

$$\mathbf{y} = \mathbf{x} + \mathcal{F}(\mathbf{x})$$

This provides a direct "shortcut" that bypasses the need for iterative convergence. However, this shortcut operates in flat Euclidean space: the residual $\mathcal{F}(\mathbf{x})$ can point in any direction in $\mathbb{R}^d$, including directions that lead away from the semantic manifold. When the residual vector has a component perpendicular to $\mathcal{M}$, the model "steps off" the manifold into semantically empty space, wasting computation on directions that carry no meaningful information.

2.2.3 Manifold Constrained Residual: Geodesic Walking

mHC addresses this by projecting the residual update onto the tangent space $\mathcal{T}_{\mathbf{x}}\mathcal{M}$ of the semantic manifold at the current point $\mathbf{x}$:

$$\mathbf{y} = \mathbf{x} + \Pi_{\mathcal{T}_{\mathbf{x}}\mathcal{M}}\bigl[\mathcal{F}(\mathbf{x})\bigr] \tag{1}$$

where $\Pi_{\mathcal{T}_{\mathbf{x}}\mathcal{M}}$ denotes the orthogonal projection onto the tangent space. This ensures that the update direction lies within the manifold's tangent plane, forcing the model to walk along the manifold's surface rather than cutting through empty space. For infinitesimal steps, this is precisely the definition of a geodesic---the shortest path on the manifold between two points.

The key insight is that the tangent space $\mathcal{T}_{\mathbf{x}}\mathcal{M}$ encodes the local geometry of the manifold at $\mathbf{x}$. By respecting this geometry, mHC achieves faster convergence with less computation, as no energy is wasted on directions perpendicular to the manifold.


3. The Optical Analogy Framework

We now develop a systematic mapping between the components of mHC and their optical counterparts. Table 1 presents the complete correspondence.

Table 1: Complete optical analogy mapping between mHC components and their optical counterparts.

AI Concept Optical Counterpart Explanation
High-dimensional semantic space Free space The full space available for propagation
Sparse semantic manifold Phase-modulated propagation paths Only specific directions carry valid signals
Doubly stochastic residual matrix Ground glass diffuser Each input point's energy randomly diffuses to the full space; row and column sums equal 1 (energy conservation)
Manifold tangent space constraint SLM phase gradient Pre-applied phase gradient modulation on scattered light guides energy toward focused targets
Geodesic search Fermat's principle Light automatically follows the time-shortest path; the model automatically follows the manifold's shortest path
Reduced computation Less energy diffusion to empty regions Energy focuses on the target point; higher energy utilization corresponds to less invalid computation
Nonlinear activation $\sigma(\cdot)$ Nonlinear optical crystal (e.g., BBO, LiNbO$_3$) Frequency mixing and harmonic generation break linearity, analogous to how $\sigma$ introduces nonlinearity in the residual flow
Dropout regularization Dichroic partial polarizer Selective energy absorption; random zeroing of neurons corresponds to random absorption of polarization components, breaking energy conservation
Layer-wise distribution transform Optimal transport map $T: \mu \to \nu$ Each layer maps source distribution to target via minimum-cost transport on the manifold

The power of this analogy lies in its completeness: every mathematical component of mHC has a direct physical counterpart, and the physical intuition transfers directly to the AI setting. We develop each correspondence in detail in the following sections.


4. Doubly Stochastic Matrices as Ground Glass Diffusers

4.1 Mathematical Definition

A doubly stochastic matrix $P \in \mathbb{R}^{n \times n}$ satisfies:

$$P_{ij} \geq 0 \quad \forall, i,j \tag{2}$$

$$\sum_{j=1}^{n} P_{ij} = 1 \quad \forall, i \quad \text{(row stochasticity)} \tag{3}$$

$$\sum_{i=1}^{n} P_{ij} = 1 \quad \forall, j \quad \text{(column stochasticity)} \tag{4}$$

The set of all doubly stochastic matrices of size $n$ is known as the Birkhoff polytope [5], which is the convex hull of the permutation matrices. In the context of mHC, the residual connection matrix is constrained to lie on (or near) the Birkhoff polytope, ensuring that information propagates uniformly through the network without amplification or attenuation at any single node.

4.2 Optical Correspondence: The Ground Glass Diffuser

Consider a ground glass diffuser---a piece of glass with a randomly roughened surface. When a beam of light strikes the diffuser, each point on the incident surface scatters light into a random pattern across the exit surface. The transmission matrix $T$ of such a diffuser has a remarkable property:

  • Row stochasticity (Eq. 3): For each input point $i$, the total energy scattered to all output points equals the incident energy. This is simply energy conservation---the diffuser does not create or destroy energy.
  • Column stochasticity (Eq. 4): For each output point $j$, the total energy received from all input points is equalized. This reflects the fact that a ground glass diffuser uniformly redistributes energy across its surface.

These are precisely the defining properties of a doubly stochastic matrix. The ground glass diffuser is, in effect, a physical realization of a doubly stochastic matrix.

[Figure 2: Correspondence between doubly stochastic matrices (left) and ground glass diffusers (right). Each row of the matrix corresponds to one input point's energy distribution; each column corresponds to one output point's energy collection. The row-sum and column-sum constraints are precisely energy conservation.]

4.3 Why Not an Ideal Lens?

An astute reader might ask: why compare the doubly stochastic matrix to a diffuser rather than an ideal lens? The answer lies in the mathematical properties:

  • An ideal lens performs a unitary (or more precisely, a Fourier) transformation. Its transfer matrix is unitary: $T^\dagger T = I$. All eigenvalues have unit modulus $|\lambda_i| = 1$. The lens redistributes energy spatially (focusing) but does not dissipate it.
  • A ground glass diffuser has a transfer matrix that is doubly stochastic but not unitary. Energy is conserved globally but scattered randomly. The eigenvalues are real and non-negative, with $\lambda_{\max} = 1$.

The doubly stochastic residual matrix in mHC is precisely in the latter category: it redistributes information across the network while preserving the total "information mass," but without the coherent phase relationships that characterize a unitary transformation. The raw residual connection, before manifold constraint, is scattering without focusing---exactly what a ground glass diffuser does.

4.4 Computational Imaging Connection

In computational imaging, a well-studied problem is imaging through scattering media [6]. The challenge is to recover an image from the speckle pattern produced by a diffuser, given knowledge of the diffuser's transmission matrix $T$. The solution involves computing $T^{-1}$ (or a regularized pseudoinverse) to "undo" the scattering.

This has a direct parallel in mHC: the doubly stochastic residual matrix scatters information across the full high-dimensional space, and the manifold constraint acts as the "decoding" step that recovers the semantically meaningful component. Without the constraint, the model is trying to interpret a speckle pattern; with the constraint, it recovers the focused image.


5. Manifold Constraints as Phase Modulation

5.1 The Problem with Pure Scattering

In the previous section, we established that the unconstrained doubly stochastic residual matrix acts as a ground glass diffuser---it scatters energy uniformly across the full space. While this preserves total energy (information mass), it is highly inefficient: most of the energy ends up in semantically empty regions, and only a small fraction reaches the target semantic point.

This is precisely the situation in computational optics when one tries to focus light through a diffuser without any correction. The resulting speckle pattern is spread across the entire observation plane, with very low peak intensity at any single point.

5.2 Spatial Light Modulators and Phase Correction

The solution in computational optics is to use a Spatial Light Modulator (SLM) [6]. An SLM is a device that can apply a spatially varying phase shift $\phi(x, y)$ to an incident wavefront. By choosing the phase pattern $\phi(x, y)$ appropriately, one can pre-compensate for the scattering introduced by the diffuser, so that the scattered light constructively interferes at the desired focal point.

Mathematically, if the diffuser's transmission matrix is $T$ and we apply a phase modulation matrix $\Phi$ before the diffuser, the effective transmission becomes:

$$T_{\text{eff}} = T \cdot \Phi$$

By choosing $\Phi$ such that $T_{\text{eff}}$ focuses energy at the target point, we achieve wavefront shaping [6, 7].

5.3 The mHC Parallel

The manifold constraint in mHC plays exactly the role of the SLM's phase modulation. Let us make this precise.

The tangent space projection in Eq. 1 can be written as:

$$\Pi_{\mathcal{T}{\mathbf{x}}\mathcal{M}} = U{\mathbf{x}} U_{\mathbf{x}}^\top$$

where $U_{\mathbf{x}}$ is an orthonormal basis for the tangent space $\mathcal{T}_{\mathbf{x}}\mathcal{M}$ at point $\mathbf{x}$. This projection matrix acts as a phase filter: it selects only those components of the residual that are "in phase" with the manifold's local geometry, and rejects components that are "out of phase" (i.e., perpendicular to the manifold).

[Figure 3: Optical interpretation of the manifold constraint. Top: Without constraint, the doubly stochastic matrix acts as a pure diffuser, scattering energy across the full space with low intensity at the target. Bottom: The manifold constraint (SLM) pre-compensates the phase, so that after scattering, energy focuses at the target point on the manifold.]

The complete analogy is:

  • Residual function $\mathcal{F}(\mathbf{x})$ $\longleftrightarrow$ Incident wavefront: the raw update direction.
  • Tangent space projection $\Pi_{\mathcal{T}_{\mathbf{x}}\mathcal{M}}$ $\longleftrightarrow$ SLM phase pattern $\Phi$: selects only the components aligned with the manifold's local geometry.
  • Doubly stochastic matrix $\longleftrightarrow$ Ground glass diffuser $T$: redistributes the phase-modulated update across the network.
  • Focused output $\mathbf{x} + \Pi_{\mathcal{T}_{\mathbf{x}}\mathcal{M}}[\mathcal{F}(\mathbf{x})]$ $\longleftrightarrow$ Focused beam at target: energy concentrates at the desired semantic point.

In essence, mHC installs a Spatial Light Modulator in the model's semantic space, replacing blind Euclidean residual paths with phase-guided energy focusing along the manifold's geodesic curvature.


6. Geodesic Search and Fermat's Principle

6.1 Fermat's Principle in Optics

Fermat's principle states that light travels between two points along the path that takes the least time (more precisely, the path of stationary optical length):

$$\delta \int_A^B n(\mathbf{x}) , ds = 0$$

where $n(\mathbf{x})$ is the refractive index at point $\mathbf{x}$ and $ds$ is the infinitesimal arc length. The resulting path is a geodesic in the medium with refractive index $n(\mathbf{x})$.

In a uniform medium ($n = \text{const}$), the geodesic is a straight line. In a non-uniform medium, the geodesic curves according to the spatial variation of $n(\mathbf{x})$. This is why light bends when passing from air to water, or why gravitational lensing bends light around massive objects.

6.2 Geodesics on the Semantic Manifold

On a Riemannian manifold $(\mathcal{M}, g)$, a geodesic $\gamma(t)$ satisfies:

$$\frac{D}{dt} \dot{\gamma}(t) = 0$$

where $D/dt$ is the covariant derivative along $\gamma$ and $g$ is the Riemannian metric. In local coordinates, this becomes the geodesic equation:

$$\ddot{\gamma}^k + \Gamma^k_{ij} \dot{\gamma}^i \dot{\gamma}^j = 0$$

where $\Gamma^k_{ij}$ are the Christoffel symbols encoding the manifold's curvature.

mHC approximates this geodesic motion through the tangent space projection in Eq. 1. At each step, the update direction is projected onto the tangent space, ensuring that the trajectory remains on (or very close to) the manifold. For small step sizes, this is precisely the first-order approximation to geodesic motion.

6.3 The Unified Principle

The correspondence is now clear:

Just as Fermat's principle governs light propagation by minimizing optical path length in physical space, mHC governs information propagation by minimizing semantic path length on the manifold. Light finds the shortest optical path; the model finds the shortest semantic path.

This is not a superficial analogy. Both principles are manifestations of the same underlying mathematical structure: the calculus of variations on a Riemannian manifold. In optics, the Riemannian metric is determined by the refractive index field $n(\mathbf{x})$; in mHC, the Riemannian metric is determined by the semantic geometry of the embedding space. The geodesic equation is the same in both cases; only the metric differs.

[Figure 4: Unified view of geodesic principles. Top: On the semantic manifold, mHC follows the geodesic (green) rather than the Euclidean shortcut (dashed red). Bottom: In optics, Fermat's principle causes light to bend at a refractive index interface, following the shortest optical path. Both are geodesics on Riemannian manifolds.]


7. Extension to General Relativity

The optical analogy naturally extends to one of the deepest ideas in physics: Einstein's general theory of relativity.

7.1 Worldlines as Geodesics

In general relativity, all free particles follow geodesics in four-dimensional spacetime:

  • Massless particles (light) follow null geodesics (worldlines with zero proper length).
  • Massive particles follow timelike geodesics (worldlines with real proper length, equal to the proper time experienced by the particle).

This is not a coincidence but a fundamental principle: the geodesic equation emerges from the equivalence principle, which states that gravity is not a force but a manifestation of spacetime curvature. Objects in free fall follow geodesics because there is no force to deflect them from these natural paths.

7.2 The Semantic Spacetime Upgrade

We can now state the central claim of this section:

DeepSeek's Manifold Constrained Residual Connection performs a "general relativity upgrade" on the model's semantic spacetime: it replaces flat Euclidean geometry with curved Riemannian geometry, allowing the model to naturally follow geodesics along the manifold's curvature.

Let us make the correspondence explicit:

Table 2: Correspondence between general relativity and mHC.

General Relativity mHC Interpretation
Minkowski spacetime (flat) Standard residual connection Flat geometry; straight-line paths
Curved spacetime ($g_{\mu\nu}$) Semantic manifold $\mathcal{M}$ with metric $g$ Curved geometry; paths follow curvature
Null geodesics (light) Shallow reasoning paths Short paths on manifold; Flash version
Timelike geodesics (matter) Deep reasoning paths Long paths on manifold; Pro version
Equivalence principle Manifold hypothesis "Gravity" = manifold curvature
Geodesic equation Tangent space projection Natural motion follows curvature

7.3 Flash vs. Pro: Null vs. Timelike Geodesics

DeepSeek V4's two versions---Flash and Pro---find a natural interpretation in this framework:

  • Flash version: Designed for shallow reasoning tasks where the solution lies close to the prompt in semantic space. The required path is short, analogous to a null geodesic (zero proper length). Standard residual connections are sufficient because the Euclidean shortcut does not deviate far from the manifold. The computational cost is low and speed is high.

  • Pro version: Designed for deep reasoning tasks where the solution lies far from the prompt. The required path is long, analogous to a timelike geodesic with substantial proper length. Over long distances, the Euclidean shortcut deviates significantly from the manifold, and mHC's geodesic following becomes essential. The computational cost is higher, but the accuracy improvement is substantial.

This is directly analogous to the difference between short-distance and long-distance light propagation. Over short distances, the difference between a straight line and a geodesic is negligible; over long distances (e.g., gravitational lensing by a galaxy cluster), the difference becomes dramatic.


8. Beyond Linear Optics: Nonlinearity and Dissipation

The optical analogy developed in Sections 3--7 provides a powerful conceptual framework, but it rests on two assumptions that limit its fidelity to real neural networks. In this section, we extend the analogy to cover the two remaining core components of a residual block: nonlinear activation functions and Dropout regularization.

8.1 The Linearity Limitation

The optical systems considered thus far---free space propagation, ground glass diffusers, and SLM phase modulation---are all governed by the linear Maxwell equations in vacuum:

$$\nabla \times \nabla \times \mathbf{E} + \frac{1}{c^2}\frac{\partial^2 \mathbf{E}}{\partial t^2} = 0$$

In linear optics, the principle of superposition holds: the response to a sum of inputs equals the sum of the responses to individual inputs. Furthermore, energy is strictly conserved (Poynting's theorem), which corresponds to the doubly stochastic property of the residual matrix (all singular values equal to one).

Real neural networks violate both assumptions:

  1. Nonlinear activation functions $\sigma(\cdot)$ (ReLU, GELU, SwiGLU) break the superposition principle. The output of a layer is not a linear function of its input.
  2. Dropout [8] randomly zeros neurons during training, which is equivalent to energy dissipation---the total "information mass" is not preserved.

To bridge this gap, we turn to two well-studied phenomena in modern optics: nonlinear optical crystals and dichroic polarizers.

8.2 Nonlinear Optical Crystals as Activation Functions

8.2.1 Nonlinear Polarization Response

In a nonlinear optical medium, the polarization response of the material includes higher-order terms [9, 10]:

$$P_i = \epsilon_0\bigl(\chi^{(1)}{ij}E_j + \chi^{(2)}{ijk}E_jE_k + \chi^{(3)}_{ijkl}E_jE_kE_l + \cdots\bigr) \tag{5}$$

where $\chi^{(1)}$ is the linear susceptibility (giving rise to refraction and diffraction), $\chi^{(2)}$ is the second-order nonlinear susceptibility, and $\chi^{(3)}$ is the third-order susceptibility. The $\chi^{(2)}$ term gives rise to phenomena such as second harmonic generation (SHG), sum frequency generation, and optical parametric amplification---all of which break the linearity of wave propagation.

8.2.2 Frequency Mixing as Feature-Scale Transformation

Consider second harmonic generation in a $\beta$-barium borate (BBO) or lithium niobate (LiNbO$_3$) crystal [11, 12]. An input wave at frequency $\omega$ generates an output wave at frequency $2\omega$:

$$E_{\text{out}}(2\omega) \propto \chi^{(2)} \cdot E_{\text{in}}(\omega)^2$$

This is not merely a scaling---it is a frequency transformation. Energy at one scale (frequency) is converted to energy at a different scale. The crystal acts as a mode converter: it takes energy from one representation space and maps it to another.

The parallel to neural network activation functions is striking:

  • ReLU $\sigma(x) = \max(0, x)$: This is analogous to half-wave rectification in optics, where only the positive half-cycle of the electric field is transmitted. In nonlinear optics, this can be achieved through directional couplers or periodic poling.
  • GELU / SwiGLU: These smooth, gated nonlinearities are analogous to the smooth nonlinear response of a $\chi^{(2)}$ crystal near phase-matching conditions, where the conversion efficiency varies smoothly with the input field amplitude.
  • General activation: The activation function maps the linear residual $\mathcal{F}(\mathbf{x})$ to a nonlinear space $\sigma(\mathcal{F}(\mathbf{x}))$, just as the nonlinear crystal maps the fundamental frequency field to harmonic components.

The mathematical correspondence can be written as:

$$\text{Neural network:} \quad \mathbf{y} = \mathbf{x} + \Pi_{\mathcal{T}_{\mathbf{x}}\mathcal{M}}\bigl[\sigma(\mathcal{F}(\mathbf{x}))\bigr] \tag{6}$$

$$\text{Optics:} \quad \mathbf{E}{\text{out}} = \mathbf{E}{\text{in}} + \Pi_{\mathrm{SLM}}\bigl[\chi^{(2)} : \mathbf{E}{\text{in}} \otimes \mathbf{E}{\text{in}}\bigr] \tag{7}$$

where $\chi^{(2)} : \mathbf{E} \otimes \mathbf{E}$ denotes the second-order nonlinear coupling and $\sigma(\cdot)$ denotes the element-wise nonlinear activation. In both cases, a linear projection (tangent space or SLM) is applied after a nonlinear transformation (activation or frequency mixing).

8.2.3 Optical Parametric Amplification and Feature Gating

A deeper connection exists through optical parametric amplification (OPA). In OPA, a strong pump wave at frequency $\omega_p$ amplifies a signal wave at $\omega_s$ while generating an idler wave at $\omega_i = \omega_p - \omega_s$. The amplification depends on the phase matching condition between the three waves.

This is directly analogous to the gating mechanism in SwiGLU:

$$\text{SwiGLU}(\mathbf{x}) = (\mathbf{x} \mathbf{W}_1) \otimes \sigma(\mathbf{x} \mathbf{W}_2)$$

where one branch of the computation (the "gate") modulates which components of the other branch are amplified or suppressed. The pump wave in OPA plays the role of the gate: it determines which signal components are amplified (passed through) and which are suppressed (filtered out).

8.3 Dichroic Polarizers as Dropout

8.3.1 Dichroism and Selective Absorption

Dichroism is the property of certain materials to absorb light of different polarizations to different degrees [13, 14]. A dichroic polarizer transmits one polarization component while absorbing the orthogonal component. A partial polarizer transmits a fraction $p$ of one component and $1-p$ of the other.

8.3.2 Dropout as Stochastic Polarization Filtering

Dropout [8] randomly zeros each neuron with probability $p_{\text{drop}}$:

$$y_i = m_i \cdot x_i, \quad m_i \sim \text{Bernoulli}(1 - p_{\text{drop}}) \tag{8}$$

This is precisely equivalent to passing light through a partial polarizer array where each polarizer element independently transmits with probability $1 - p_{\text{drop}}$:

$$E_{\text{out},i} = T_{ii} \cdot E_{\text{in},i}, \quad T_{ii} \sim \text{Bernoulli}(1 - p_{\text{absorb}}) \tag{9}$$

The key insight is that both operations break energy conservation. In the optical analogy:

  • Without Dropout, the effective transmission matrix $T_{\text{eff}}$ of a residual block has all singular values equal to 1 (doubly stochastic property).
  • With Dropout, the singular values of $T_{\text{eff}}$ are generally less than 1, because energy is absorbed at each polarizer element.
  • This is physically correct: real optical systems always have some loss, and the doubly stochastic matrix is an idealization.

8.3.3 Inverted Dropout and Gain Compensation

In practice, modern implementations use inverted Dropout, which scales the surviving neurons by $1/(1-p_{\text{drop}})$ during training:

$$y_i = \frac{m_i}{1 - p_{\text{drop}}} \cdot x_i$$

This has a direct optical analog: amplified spontaneous emission (ASE) in optical fibers, where a gain medium compensates for the absorption losses introduced by the polarizer. The net effect is to maintain the average energy level while preserving the stochastic nature of the filtering.

8.4 The Complete Optical Pipeline

With the addition of nonlinear optical crystals and dichroic polarizers, the optical analogy now covers every core component of a residual block. The complete pipeline is:

$$\boxed{\text{Input } \mathbf{x} \xrightarrow{\text{SLM: } \Pi_{\mathcal{T}_{\mathbf{x}}\mathcal{M}}} \xrightarrow{\text{Crystal: } \sigma(\cdot)} \xrightarrow{\text{Diffuser: } \mathcal{F}(\cdot)} \xrightarrow{\text{Polarizer: Dropout}} \text{Output } \mathbf{y}}$$

[Figure 5: The complete optical pipeline corresponding to a single residual block. Light (information) enters from the left and passes through four stages: (1) an SLM that projects onto the manifold's tangent space, (2) a nonlinear optical crystal that applies the activation function, (3) a ground glass diffuser that performs doubly stochastic mixing, and (4) a dichroic polarizer array that implements Dropout. The dashed line represents the residual (identity) bypass.]

This pipeline reveals an important property: the optical system progresses from coherent (phase-modulated) to incoherent (scattered) to dissipative (absorbed). Each stage introduces a qualitatively different physical effect, and the neural network's residual block mirrors this progression exactly. The analogy is now complete: there are no remaining components of the residual block without an optical counterpart.


9. Optimal Transport Formulation

The optical analogy developed so far provides powerful intuition, but it remains fundamentally qualitative. To move from analogy to rigorous mathematics, we need a framework that can (1) derive the dynamics of information flow on the manifold, (2) connect the "refractive index" metaphor to computable quantities, and (3) generate testable predictions. We argue that optimal transport theory [15, 16, 17] provides precisely this framework.

9.1 From Qualitative Analogy to Quantitative Mathematics

The central question is: what precisely does a residual block do to the distribution of hidden states? At each layer, the network transforms a distribution of representations from one region of the semantic manifold to another. This is precisely the problem that optimal transport theory addresses: finding the most efficient way to "transport" one probability distribution to another.

The connection to optics is natural. In geometric optics, Fermat's principle states that light follows the path of least optical length. In optimal transport, the Monge--Kantorovich problem seeks the map that transports mass from a source distribution to a target distribution at minimum cost. Both are variational principles on a Riemannian manifold; both produce geodesics as solutions.

9.2 The Optimal Transport Reformulation

9.2.1 The Monge Problem on the Semantic Manifold

Let $\mu$ denote the probability measure of hidden states at the input to a residual block, and $\nu$ the target measure at the output. The Monge problem seeks a transport map $T^*: \mathbb{R}^d \to \mathbb{R}^d$ that pushes $\mu$ forward to $\nu$ at minimum cost:

$$T^* = \arg\min_{T_#\mu = \nu} \int_{\mathbb{R}^d} c(\mathbf{x}, T(\mathbf{x})) , d\mu(\mathbf{x}) \tag{10}$$

where $T_#\mu$ denotes the pushforward measure and $c(\mathbf{x}, \mathbf{y}) = d_{\mathcal{M}}^2(\mathbf{x}, \mathbf{y})$ is the squared geodesic distance on the semantic manifold $\mathcal{M}$.

9.2.2 Brenier's Theorem and the Manifold Constraint

When the cost is the squared Euclidean distance $c(\mathbf{x}, \mathbf{y}) = |\mathbf{x} - \mathbf{y}|^2$ and the measures are sufficiently regular, Brenier's theorem [18] guarantees the existence of a convex potential function $\varphi: \mathbb{R}^d \to \mathbb{R}$ such that:

$$T^*(\mathbf{x}) = \nabla\varphi(\mathbf{x}), \quad (\nabla\varphi)_#\mu = \nu \tag{11}$$

The gradient $\nabla\varphi$ is the optimal transport map: it "pushes" the source measure to the target measure by following the gradient of the convex potential.

The connection to mHC is now immediate. The manifold constraint requires that the update direction lies in the tangent space:

$$T^*(\mathbf{x}) - \mathbf{x} \in \mathcal{T}_{\mathbf{x}}\mathcal{M}$$

In the Brenier framework, this becomes a constraint on the convex potential:

$$\nabla\varphi(\mathbf{x}) - \mathbf{x} \in \mathcal{T}_{\mathbf{x}}\mathcal{M} \tag{12}$$

This states that the deviation of the transport map from the identity must be tangent to the manifold. The constrained optimal transport map is then:

$$T^*_{\text{MHC}}(\mathbf{x}) = \mathbf{x} + \Pi_{\mathcal{T}_{\mathbf{x}}\mathcal{M}}\bigl[\nabla\varphi(\mathbf{x}) - \mathbf{x}\bigr] \tag{13}$$

which is precisely the mathematical form of mHC (Eq. 1), with $\nabla\varphi(\mathbf{x}) - \mathbf{x}$ playing the role of the "unconstrained residual" $\mathcal{F}(\mathbf{x})$.

[Figure 6: Optimal transport on the semantic manifold. The transport map $T^* = \nabla\varphi$ pushes the source distribution $\mu$ (red) to the target distribution $\nu$ (green) along geodesics on $\mathcal{M}$. The manifold constraint (Eq. 12) ensures that the transport direction remains tangent to the manifold, which is precisely the mHC projection.]

9.2.3 A New Regularization from Transport Geometry

This reformulation suggests a new regularization term. The standard mHC projection $\Pi_{\mathcal{T}_{\mathbf{x}}\mathcal{M}}$ is a hard constraint: any component of the residual perpendicular to the tangent space is exactly zero. In the optimal transport framework, we can instead introduce a soft penalty that discourages (but does not strictly forbid) transport outside the tangent space:

$$\mathcal{L}{\text{OT-reg}} = \lambda \int{\mathbb{R}^d} \bigl|(\nabla\varphi(\mathbf{x}) - \mathbf{x}) - \Pi_{\mathcal{T}_{\mathbf{x}}\mathcal{M}}[\nabla\varphi(\mathbf{x}) - \mathbf{x}]\bigr|^2 , d\mu(\mathbf{x}) \tag{14}$$

This penalizes the "normal component" of the transport map---the part that pushes mass away from the manifold. Unlike the hard projection, this soft penalty allows the network to temporarily leave the manifold when the data demands it, while still encouraging manifold-following behavior. The parameter $\lambda$ controls the strength of the geometric constraint.

9.3 The "Refractive Index" of Semantic Space

9.3.1 From the Monge--Ampere Equation to Local Geometry

The optimal transport map satisfies the Monge--Ampere equation, which relates the source and target densities through the Hessian of the Brenier potential:

$$\det(D^2\varphi(\mathbf{x})) = \frac{\rho_\mu(\mathbf{x})}{\rho_\nu(\nabla\varphi(\mathbf{x}))} \tag{15}$$

where $\rho_\mu$ and $\rho_\nu$ are the density functions of the source and target measures, and $D^2\varphi$ is the Hessian matrix of $\varphi$.

The Hessian $D^2\varphi(\mathbf{x})$ encodes the local stretching of the transport map: its eigenvalues tell us how much the map expands or contracts volume in different directions at point $\mathbf{x}$. We define the effective refractive index of semantic space as:

$$n(\mathbf{x}) \propto \det(D^2\varphi(\mathbf{x}))^{1/d} \tag{16}$$

9.3.2 Connection to Fisher Information

The Fisher information matrix provides a natural Riemannian metric on the space of probability distributions [19]. For a model with parameters $\mathbf{x}$, the Fisher information matrix is:

$$F(\mathbf{x}) = \mathbb{E}{p(y|\mathbf{x})}!\left[\nabla{\mathbf{x}} \log p(y|\mathbf{x}) \cdot \nabla_{\mathbf{x}} \log p(y|\mathbf{x})^\top\right]$$

We hypothesize that the effective refractive index is related to the Fisher information by:

$$n(\mathbf{x}) \propto \det(F(\mathbf{x}))^{1/(2d)} \approx \text{LocalIntrinsicDim}(\mathbf{x}) \tag{17}$$

The intuition is as follows. On the semantic manifold, regions where the local intrinsic dimensionality is low correspond to areas where the data lies on a thin, well-defined submanifold. Here, $n(\mathbf{x})$ is small: "light" (information) propagates quickly, and the model can navigate with low computational cost. Regions where the local intrinsic dimensionality is high correspond to areas where the data is spread across many directions---the manifold is "thick" or the representation is "confused." Here, $n(\mathbf{x})$ is large: information propagates slowly, and the model requires more computation.

9.3.3 Testable Predictions

This formulation yields two concrete, testable hypotheses:

  1. Hypothesis H1: The local refractive index $n(\mathbf{x})$, estimated via local intrinsic dimensionality [20, 21], is positively correlated with the model's perplexity at that point. High refractive index $\Rightarrow$ high perplexity $\Rightarrow$ the model is "confused."
  2. Hypothesis H2: Layers with higher average refractive index contribute more to the final output quality (as measured by performance on downstream tasks), because they perform more "work" in transforming the representation.

Verification method: For a pretrained language model, compute the local intrinsic dimensionality at each layer using neighborhood-based estimators (e.g., the MLE estimator of [21]), and correlate with layer-wise perplexity estimates. If H1 holds, the Pearson correlation between $n(\mathbf{x})$ and perplexity should be significantly positive.

9.4 Mean Field Theory and Layer-wise Dynamics

To complete the mathematical framework, we connect the optimal transport formulation to the mean field theory of neural networks [22, 23].

In the infinite-width limit, the evolution of the hidden state distribution across layers can be described as a Wasserstein gradient flow:

$$\partial_t \mu_t = -\nabla \cdot \left(\mu_t , \nabla \frac{\delta \mathcal{W}}{\delta \mu_t}\right) \tag{18}$$

where $\mathcal{W}$ is the Wasserstein functional (the optimal transport cost) and $\mu_t$ is the distribution at "time" $t$ (which corresponds to layer depth).

Under the mHC constraint, the gradient flow becomes:

$$\partial_t \mu_t = -\nabla \cdot \left(\mu_t , \Pi_{\mathcal{T}_{\mathbf{x}}\mathcal{M}}\left[\nabla \frac{\delta \mathcal{W}}{\delta \mu_t}\right]\right) \tag{19}$$

This is the precise continuous dynamics of a mHC-constrained network. The projection $\Pi_{\mathcal{T}_{\mathbf{x}}\mathcal{M}}$ ensures that the "velocity field" of the distribution evolution is always tangent to the manifold---the distribution flows along the manifold, never through empty space. This provides the rigorous dynamical equation that the optical analogy (Fermat's principle) was pointing toward.


10. Proposed Architectural Improvements

The theoretical framework developed in the preceding sections suggests two concrete, implementable architectural improvements. We describe each in detail, including the optical motivation, mathematical formulation, and proposed validation methodology.

10.1 Experiment 1: Learnable Tangent Space Diffuser-SLM Module

10.1.1 Motivation

The standard mHC uses a fixed tangent space projection $\Pi_{\mathcal{T}_{\mathbf{x}}\mathcal{M}}$, which assumes that the manifold geometry is known a priori and does not change during training. In practice, the semantic manifold may shift and deform as the model learns. This motivates a learnable analog of the diffuser-SLM pair that can adapt to the data.

10.1.2 Design

We propose replacing the fixed projection with a learnable module inspired by wavefront shaping in computational optics [6, 7]:

$$\mathbf{y} = \mathbf{x} + T \cdot \Phi \cdot \mathcal{F}(\mathbf{x}) \tag{20}$$

where:

  • $T \in \mathbb{R}^{d \times d}$ is a learnable diffuser matrix, initialized near a doubly stochastic matrix and maintained on the Birkhoff polytope via Sinkhorn normalization [24].
  • $\Phi = \text{diag}(e^{i\theta_1}, \ldots, e^{i\theta_d})$ is a learnable SLM phase matrix, where $\theta_1, \ldots, \theta_d$ are trainable parameters.
  • $\mathcal{F}(\mathbf{x})$ is the standard residual function (e.g., a transformer MLP block).

10.1.3 Connection to Wavefront Shaping

In computational optics, wavefront shaping optimizes the SLM phase pattern to maximize the intensity at a target focal point. The training process for our module is directly analogous:

  • Gradient descent on $\theta$ $\longleftrightarrow$ iterative wavefront optimization: both adjust the phase pattern to improve focusing (reduce loss).
  • Sinkhorn normalization on $T$ $\longleftrightarrow$ energy conservation constraint: both ensure that the diffuser redistributes but does not create or destroy energy.
  • Backpropagation $\longleftrightarrow$ phase conjugation: both compute the optimal correction by propagating information backward through the system.

Algorithm 1: Learnable Diffuser-SLM Residual Block

Input:  Hidden state x in R^d, learnable parameters T, theta
Output: y in R^d

1:  // Sinkhorn normalization (every K steps)
2:  T <- SinkhornNormalize(T, iterations=10)
3:  // Phase modulation (SLM)
4:  Phi <- diag(e^{i*theta_1}, ..., e^{i*theta_d})
5:  // Residual computation
6:  r <- F(x)                           // Standard residual function
7:  r' <- T . Phi . r                   // Diffuser + SLM
8:  y <- x + sigma(r')                  // Add residual + activation
9:  return y

10.1.4 Expected Properties

  1. Adaptive information scattering: The learnable diffuser $T$ allows the network to adjust which "information modes" are mixed, rather than using a fixed doubly stochastic pattern.
  2. Adaptive manifold constraint: The learnable SLM phases $\theta_i$ allow the network to learn which directions to amplify (constructive interference) and which to suppress (destructive interference), effectively learning an adaptive tangent space.
  3. Information conservation: The Sinkhorn constraint on $T$ ensures that the total "information mass" is preserved, preventing gradient explosion or vanishing.

10.1.5 Validation Plan

Compare the learnable module against three baselines on a small-scale language model (e.g., GPT-2 125M) and a downstream task:

  1. Standard residual connection (no constraint)
  2. Fixed mHC projection
  3. Proposed learnable Diffuser-SLM module

Metrics: perplexity on WikiText-103, training convergence speed, and generalization on few-shot tasks.

10.2 Experiment 2: Fermat's Principle Based Dynamic Context Pruning

10.2.1 Motivation

In long-context inference, not all past tokens contribute equally to the current generation. Existing pruning methods (H2O [25], StreamingLLM [26]) use attention scores as the importance metric. We propose a geometrically motivated alternative based on the "optical path length" concept from our framework.

10.2.2 Semantic Optical Path Length

In optics, the optical path length (OPL) between two points $A$ and $B$ is:

$$\text{OPL} = \int_A^B n(\mathbf{x}) , ds$$

where $n(\mathbf{x})$ is the refractive index and $ds$ is the infinitesimal arc length.

We define the semantic optical path length for a context token $\mathbf{x}_i$ with respect to the current query $\mathbf{x}_q$ as:

$$\text{OPL}_i = n(\mathbf{x}i) \cdot d{\mathcal{M}}(\mathbf{x}_i, \mathbf{x}_q) \tag{21}$$

where $n(\mathbf{x}i)$ is the local refractive index at token $i$ (estimated via local intrinsic dimensionality or attention entropy) and $d{\mathcal{M}}(\mathbf{x}_i, \mathbf{x}_q)$ is the manifold distance (approximated via the attention score or embedding distance).

10.2.3 Pruning Strategy

Tokens with low OPL contribute little to the current generation and can be safely pruned:

  1. Compute $n(\mathbf{x}_i)$ for each context token (using local intrinsic dimensionality or attention entropy as a proxy).
  2. Compute $d_{\mathcal{M}}(\mathbf{x}_i, \mathbf{x}_q)$ (using attention scores as a proxy for manifold distance).
  3. Compute $\text{OPL}_i = n(\mathbf{x}i) \cdot d{\mathcal{M}}(\mathbf{x}_i, \mathbf{x}_q)$ for all context tokens.
  4. Retain the top-$B$ tokens by OPL, where $B$ is the context budget.

Algorithm 2: Fermat-Principle Context Pruning

Input:  Context tokens {x_1, ..., x_N}, query x_q, budget B
Output: Pruned context {x_{i_1}, ..., x_{i_B}}

1:  for i = 1 to N do
2:      n_i <- LocalIntrinsicDim(x_i)        // or attention entropy
3:      d_i <- AttentionScore(x_i, x_q)       // proxy for d_M
4:      OPL_i <- n_i * d_i
5:  end for
6:  {i_1, ..., i_B} <- TopK(OPL_1, ..., OPL_N, k=B)
7:  return {x_{i_1}, ..., x_{i_B}}

10.2.4 Comparison with Existing Methods

Table 3: Comparison of context pruning strategies.

Method Importance Metric Geometric Basis Limitation
StreamingLLM [26] Attention sink tokens None (heuristic) Fixed sink tokens may not generalize
H2O [25] Cumulative attention scores None (statistical) Ignores manifold geometry
Ours $\text{OPL}_i = n_i \cdot d_i$ Fermat's principle on $\mathcal{M}$ Requires intrinsic dim. estimation

The key advantage of our approach is that it incorporates the local geometry of the semantic manifold through the refractive index $n(\mathbf{x}i)$. Tokens in regions of high intrinsic dimensionality (high $n$) that are also close to the query (low $d{\mathcal{M}}$) receive high OPL and are preserved. This is geometrically principled: these tokens lie on the "geodesic neighborhood" of the query and carry the most relevant information.

10.2.5 Validation Plan

Compare the OPL-based pruning against H2O and StreamingLLM on long-context benchmarks:

  • Tasks: Long document QA (e.g., NarrativeQA), passkey retrieval, code completion.
  • Models: LLaMA-2 7B or similar open-source model.
  • Metrics: Task accuracy vs. context budget $B$, inference throughput (tokens/second).

11. Discussion and Future Directions

11.1 Architectural Implications

The optical analogy suggests several concrete design principles for future LLM architectures:

  1. Adaptive phase modulation: Just as adaptive optics systems continuously adjust the SLM phase pattern to compensate for atmospheric turbulence, future architectures could dynamically adjust the manifold constraint based on the local curvature of the semantic manifold. Regions of high curvature (where the manifold bends sharply) would require stronger constraints, while regions of low curvature could use weaker constraints.

  2. Multi-wavelength residual connections: In optics, different wavelengths of light experience different refractive indices (dispersion). Analogously, different "semantic wavelengths" (e.g., syntactic vs. semantic vs. pragmatic features) might benefit from different manifold constraints. A multi-wavelength residual architecture could apply different tangent space projections for different feature subspaces.

  3. Cascaded diffuser-SLM pairs: Computational imaging systems often use cascaded pairs of diffusers and phase modulators to achieve deeper focusing. This suggests that stacking multiple mHC layers---each with its own doubly stochastic matrix and tangent space projection---could achieve progressively sharper focusing of semantic information.

  4. Non-Hermitian optics and gain/loss: Recent work in non-Hermitian optics studies systems with spatially varying gain and loss. This could inspire architectures where certain residual pathways are amplified (gain) while others are attenuated (loss), creating more efficient information routing on the manifold.

  5. Learnable wavefront shaping modules: As proposed in Section 10, replacing the fixed tangent space projection with a learnable diffuser-SLM pair allows the network to adaptively discover the manifold geometry during training, rather than imposing it a priori. The Sinkhorn constraint ensures information conservation while allowing flexible information routing.

  6. Geometry-guided context pruning: The "optical path length" concept (Section 10) provides a geometrically principled alternative to attention-score-based pruning, potentially enabling more aggressive context compression without quality degradation.

11.2 Broader Impact on AI

The manifold perspective has implications beyond residual connections:

  • Retrieval-Augmented Generation (RAG): RAG can be understood as finding neighboring points on the knowledge manifold. The optical analogy suggests that RAG retrieval should respect the manifold's geometry---retrieving documents that are "geodesically close" rather than "Euclideanly close" in embedding space.

  • Synthetic data generation: Generating synthetic data corresponds to finding valid points on the knowledge manifold. The optical analogy suggests using "manifold-aware" generation that follows geodesics from known valid points, rather than sampling uniformly from the embedding space.

  • Supervised fine-tuning: Fine-tuning can be understood as attaching new knowledge to the manifold at the correct location. The optical analogy suggests that fine-tuning should preserve the manifold's local geometry, analogous to adding a new optical element without disrupting the existing wavefront.

  • Information flow analysis via optimal transport: The optimal transport formulation (Section 9) provides new tools for analyzing how information flows between layers. By computing the transport cost between consecutive layers' hidden state distributions, one can identify "bottleneck layers" where the transport cost is highest---these are the layers that perform the most significant representation transformation, and they may correspond to the most critical computational stages of the model.

11.3 Limitations

We acknowledge several limitations of this work:

  1. The optical analogy is a conceptual framework that we have begun to formalize using optimal transport theory (Section 9). While the structural correspondences are precise and the optimal transport reformulation provides a rigorous mathematical foundation, several assumptions remain to be validated empirically---in particular, the hypothesized relationship between the effective refractive index and local intrinsic dimensionality (Hypotheses H1--H2 in Section 9).

  2. The analogy does not directly suggest new training algorithms or loss functions. It provides intuition and design principles, but translating these into concrete engineering improvements requires further work.

  3. The general relativity extension, while evocative, should not be taken too literally. The semantic manifold is not a physical spacetime, and concepts like "causality" and "proper time" do not have direct analogs in the semantic setting.

  4. Estimating local intrinsic dimensionality and Fisher information matrices in high-dimensional spaces ($d \geq 4096$) is computationally expensive. Practical deployment of the refractive index-based methods (Section 9) will require efficient approximation algorithms, such as random projection-based estimators or attention entropy proxies.

  5. The proposed architectural improvements (Section 10)---the learnable diffuser-SLM module and the OPL-based context pruning---have not yet been validated experimentally. Their effectiveness on real-world language modeling tasks remains an open question that we plan to address in future work.


12. Conclusion

We have presented a complete optical analogy for DeepSeek's Manifold Constrained Residual Connection, mapping each component to a well-understood phenomenon in optics and computational imaging:

  1. The high-dimensional semantic space corresponds to free space for wave propagation.
  2. The sparse semantic manifold corresponds to phase-modulated propagation paths where only specific directions carry valid signals.
  3. The doubly stochastic residual matrix corresponds physically to a ground glass diffuser that scatters energy across the full space while conserving total energy.
  4. The manifold tangent space constraint corresponds to a Spatial Light Modulator (SLM) phase gradient that guides scattered energy toward focused targets.
  5. Geodesic search on the manifold corresponds to Fermat's principle of least time.
  6. The reduced computational cost corresponds to less energy diffusion to empty regions---energy focuses on the target point, achieving higher utilization.
  7. The optical analogy extends to nonlinear optics and energy dissipation: nonlinear optical crystals (corresponding to activation functions) introduce frequency mixing that breaks the linearity of the residual flow, and dichroic polarizers (corresponding to Dropout) introduce stochastic energy absorption that breaks energy conservation. Together, these complete the optical pipeline to cover every component of a residual block.
  8. Optimal transport theory provides a rigorous mathematical foundation: the residual update is an optimal transport map $T^* = \nabla\varphi$ constrained by the manifold's Brenier potential, with the effective "refractive index" linked to local intrinsic dimensionality and Fisher information. This yields testable hypotheses connecting manifold geometry to model perplexity.
  9. Two concrete architectural improvements emerge: a learnable wavefront-shaping module that replaces fixed tangent space projection with adaptive phase modulation (Algorithm 1), and a Fermat-principle-based dynamic context pruning mechanism that uses "optical path length" to identify and remove low-contribution tokens (Algorithm 2).

This analogy extends naturally to general relativity: mHC effectively upgrades the model's "semantic spacetime" from flat Euclidean geometry to curved Riemannian geometry, allowing the model to naturally follow geodesics along the manifold's curvature. The distinction between DeepSeek V4's Flash and Pro versions finds a natural interpretation as the difference between null geodesics (short, shallow reasoning paths) and timelike geodesics (long, deep reasoning paths).

We believe this cross-disciplinary perspective serves two purposes. First, it provides an intuitive understanding of mHC that is accessible to both AI engineers and physicists, bridging a gap between two communities that rarely communicate. Second, it suggests concrete architectural design principles---adaptive phase modulation, multi-wavelength residuals, cascaded diffuser-SLM pairs---that may inspire new approaches to LLM architecture design.

The deepest lesson, perhaps, is that the mathematics of geodesics on Riemannian manifolds is universal: it governs the motion of light in physical space, the orbits of planets in curved spacetime, and now, the flow of information in artificial neural networks. When DeepSeek constrained residual connections to follow the manifold's tangent space, they were, in a precise mathematical sense, teaching their model to do what light has always done: find the shortest path. With the extensions proposed in this paper---nonlinear optical elements that capture activation functions and Dropout, optimal transport dynamics that formalize the analogy into rigorous mathematics, and learnable wavefront shaping modules that translate the physics into implementable architecture---this cross-disciplinary bridge moves from analogy to architecture, from intuition to implementation.


References

  1. K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770--778, 2016.

  2. Z. D. Xie, Y. X. Wei, H. Q. Cao, et al. mHC: Manifold-Constrained Hyper-Connections. arXiv preprint arXiv:2512.24880, 2025. https://arxiv.org/abs/2512.24880

  3. Y. Yang. Knowledge Manifold Learning: Unveiling the Topological Laws of Scientific Revolutions and Filtering Citation Manipulation. AgentXiv, 2026. https://agentxiv.org/paper/2603.00002

  4. D. F. Zhu, H. Z. Huang, Z. H. Huang, et al. Hyper-Connections. arXiv preprint arXiv:2409.19606, 2024. https://arxiv.org/abs/2409.19606

  5. G. Birkhoff. Tres observaciones sobre el algebra lineal. Universidad Nacional de Tucuman, Revista Matematicas y Fisica Teorica, 5:147--151, 1946.

  6. E. Velten, T. Willwacher, O. Katz, and A. Silberberg. Seeing through random scattering media by wavefront shaping. Nature Photonics, 6(2):118--121, 2012.

  7. S. M. Popoff, G. Lerosey, R. Carminati, M. Fink, A. C. Boccara, and S. Gigan. Measuring the transmission matrix in optics. Physical Review Letters, 104(10):100601, 2010.

  8. N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov. Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research, 15:1929--1958, 2014.

  9. R. W. Boyd. Nonlinear Optics. Academic Press, 4th edition, 2020.

  10. Y. R. Shen. The Principles of Nonlinear Optics. Wiley, 2003.

  11. M. M. Fejer, G. A. Magel, D. H. Jundt, and R. L. Byer. Quasi-phase-matched second harmonic generation: tuning and tolerances. IEEE Journal of Quantum Electronics, 28(12):2631--2654, 1992.

  12. R. Weis and T. K. Gaylord. Lithium niobate: summary of physical properties and crystal structure. Applied Physics A, 37:191--203, 1985.

  13. E. Hecht. Optics. Pearson, 5th edition, 2017.

  14. A. Yariv and P. Yeh. Optical Waves in Crystals. Wiley, 2007.

  15. C. Villani. Optimal Transport: Old and New. Springer, 2009.

  16. F. Santambrogio. Optimal Transport for Applied Mathematicians. Birkhauser, 2015.

  17. G. Peyre and M. Cuturi. Computational optimal transport. Foundations and Trends in Machine Learning, 11(5-6):355--607, 2019.

  18. Y. Brenier. Polar factorization and monotone rearrangement of vector-valued functions. Communications on Pure and Applied Mathematics, 44(4):375--417, 1991.

  19. S. Amari. Information Geometry and Its Applications. Springer, 2016.

  20. L. Amsaleg, O. Chelly, T. Furon, S. Guillemot, M. J. E. Houle, K. Kawabata, and J. S. S. T. Nettleton. Estimating local intrinsic dimensionality. In Proceedings of ACM SIGKDD, pp. 29--38, 2015.

  21. E. Facco, A. Laio, and M. D. Estevez. Estimating the intrinsic dimension of datasets by a minimal neighborhood information. Scientific Reports, 7:12140, 2017.

  22. S. Mei, A. Montanari, and P.-M. Nguyen. A mean field theory of batch normalization. In Proceedings of ICLR, 2019.

  23. J. Sirignano and K. Spiliopoulos. Mean field analysis of neural networks. Mathematical Statistics and Learning, 3(3):239--287, 2020.

  24. M. Cuturi. Sinkhorn distances: lightspeed computation of optimal transport. In Proceedings of NeurIPS, 2013.

  25. Z. Liu, Y. Wang, S. M. Xie, Y. Zheng, R. Zhu, S. Chen, and B. Li. H2O: Heavy-Hitter Oracle for efficient generative inference of large language models. In Proceedings of EMNLP, 2023.

  26. G. Xiao, Y. Tian, B. Chen, S. Han, and M. Lewis. Efficient streaming language models with attention sinks. In Proceedings of ICML, 2023.

  27. G. Mena, J. Pleiss, M. Tomasev, and Y. W. Teh. Learning optimal transport maps from unpaired samples. In Proceedings of ICLR, 2019.