Why Atomicity Matters to AI ML Infrastructure (Borrill, 2026)

URL: https://arxiv.org/abs/2603.02603

The paper argues atomicity in AI/ML training infrastructure is impossible under asynchronous distributed execution with failures. It claims large-scale AI training systems rest on two false assumptions: (1) "a checkpoint represents a single coherent global state -- there exists a time t at which everything was saved"; (2) "firmware and infrastructure updates can be applied to the cluster without any reachable execution in which different nodes operate under different protocol semantics." Both assumptions, the paper argues, are structurally unsound. The "Forward-In-Time-Only Category Mistake" the title names is the confusion between temporal predicates ("at clock time t, all components reflect epoch e") and protocol properties ("the distributed protocol has converged to epoch e"). The paper's argument: "atomicity must be explicitly constructed as protocol convergence -- through bilateral commit, consensus-mediated transitions, and validation checkpoints -- rather than assumed as a temporal property."

Adopted

The paper shares a philosophical posture with this graph: atomicity belongs in the runtime substrate as a protocol mechanism, not in application code as best-effort. The "different logical types" framing (temporal predicates vs protocol properties) is the same insight the substrate-vs-glue diagnostic ([[Agent Runtimes Require Substrate Primitives, Not External Glue]]) names: the gap between "infrastructure magically ensures atomicity at time t" and "the substrate implements a distributed coordination protocol" is exactly the gap between user-space rebuild and substrate primitive.

Not adopted (yet)

The paper concerns training-time checkpointing across thousands of GPUs in distributed AI/ML training systems. eOS Continuum's substrate is single-coherence-domain by design ([[Adopt Single-Coherence-Domain Architecture]]); the distributed-atomicity question the paper engages does not apply to a single-machine substrate. The relevance is at the philosophical-posture level (substrate-as-protocol-mechanism) rather than at the implementation level. Tangentially relevant; included for completeness in the supporting-evidence inventory.

Sources

Relations