OpenAI Is Treating the Network as Part of the Model

An abstract visualization of high-speed multipath networking flowing across thousands of interconnected GPU nodes in a large AI supercomputer cluster. — Frontier AI training now depends on the network as much as the chips. MRC is OpenAI's answer to that dependency.

OpenAI has published MRC — Multipath Reliable Connection — a new networking protocol designed for large AI training clusters, co-developed with AMD, Broadcom, Intel, Microsoft, and NVIDIA. The protocol is already deployed across OpenAI's largest NVIDIA GB200 supercomputers, including Oracle's Abilene site and Microsoft's Fairwater systems. OpenAI also released the specification through the Open Compute Project, framing this less as a private optimization and more as a proposed industry standard.

The problem MRC solves is real and expensive. Frontier AI training runs hundreds of thousands of GPUs in tight synchrony. If one switch, one link, or one route slows down, the entire job can stall while compute sits idle. At the scale of modern training clusters, even small network inefficiencies translate directly into wasted GPU time — and GPU time is currently one of the most expensive resources in AI development.

MRC addresses this by spreading data across many network paths simultaneously and routing around failures in microseconds. Instead of relying on a single path that becomes a bottleneck, the protocol dynamically balances load and recovers from disruptions before the training job notices them.

What multipath networking actually changes

The shift here is not just faster links. It is a different philosophy about where reliability lives in a cluster. Traditional networking design tries to make individual paths reliable. MRC assumes paths will sometimes fail or degrade and builds resilience into the protocol instead. That mirrors how distributed software systems moved from "reliable machines" to "fault-tolerant architectures" — the same logic applied to the physical layer underneath the models.

For Ethernet-based AI networking specifically, this matters. Ethernet has been competing against specialized interconnects like InfiniBand for AI cluster workloads. MRC is evidence that Ethernet-based networks can be engineered toward the reliability and performance that large training runs require, without depending on proprietary hardware. The Dell'Oro Group noted that MRC reinforces Ethernet's expanding role in AI back-end networks — a meaningful signal given how much infrastructure investment is flowing into this space.

Open specification through OCP

Publishing MRC through the Open Compute Project is a deliberate choice. OpenAI could have kept the protocol internal or licensed it privately. Instead, the specification is available to anyone building AI infrastructure. That benefits OpenAI — broader adoption means more ecosystem testing, more vendor support, and more pressure on partners to optimize for MRC — but it also benefits the wider industry.

Open infrastructure standards have historically mattered most when systems become too large and complex for single-vendor stacks to manage alone. AI clusters are reaching that point. A training run that spans hundreds of thousands of GPUs across multiple vendors, sites, and hardware generations is not a problem any one company can solve in isolation. Open protocols give different vendors a shared target to optimize for.

Infrastructure is now a competitive layer

The deeper implication of MRC is that AI progress has a new constraint surface. For years, the dominant variables were model architecture and training data. Then scale — more chips, more compute — became the differentiator. Now the invisible infrastructure around the chips is becoming a competitive layer too: networking, power, routing, recovery, cooling, and orchestration.

Better networks mean fewer failed training runs, lower cost per training cycle, and faster iteration between model versions. That compounds. A lab that can run more experiments per month, with fewer wasted runs, will produce better models faster — not because their architecture is smarter, but because their infrastructure is more reliable.

This is why OpenAI is publishing a networking protocol at all. It is not a research paper about model design. It is a systems engineering artifact — a sign that winning frontier AI requires owning the full stack down to the routing layer.

The SunMarc takeaway

For builders and product teams, MRC is a useful reminder that "AI capability" is not a single variable. It is an output of many interdependent systems: model quality, training infrastructure, inference efficiency, and increasingly, the reliability of the networking underneath all of it.

The practical implication is not that every builder needs to care about supercomputer networking protocols. It is that the AI tools reaching users — the APIs, the models, the inference services — are getting faster, cheaper, and more capable partly because the infrastructure running them is being engineered at every layer. When OpenAI ships a better model, some of that improvement comes from the network that made a longer training run feasible.

For anyone building on top of AI APIs, understanding that infrastructure reliability is part of the capability curve helps set realistic expectations about where the next improvements will come from — and why frontier labs continue to invest heavily in systems that most users will never see.

Relevant links

← Back to updates