This page contains press release content distributed by XPR Media. Members of the editorial and news staff of the USA TODAY Network were not involved in the creation of this content.

Clockwork.io Introduces A New Class of Fault Tolerance to End Failure-Driven GPU Waste in AI Training

New TorchPass solution addresses a multi-million dollar challenge with AI infrastructure; uses Live GPU Migration to keep large-scale AI training running through hardware failures instead of forcing costly restarts

PALO ALTO, CA / ACCESS Newswire / March 11, 2026 / Clockwork.io, the leader in Software-Driven AI Fabrics– a programmable, vendor-neutral software layer that optimizes large-scale GPU clusters for real-time observability, fault tolerance, and deterministic performance-today announced the general availability of TorchPass Workload Fault Tolerance. This new class of software-driven fault-tolerance eliminates one of the most costly failure modes in large-scale AI training: catastrophic job restarts caused by infrastructure faults.

Delivered as a core capability of the Clockwork.io FleetIQ platform, TorchPass applies the principles of Software-Driven AI Fabrics to distributed training, using Live GPU Migration to allow workloads to continue running through GPU failures, network disruptions, driver bugs, and even full node crashes-without checkpoint restarts or lost progress.

“Companies are investing billions in next-gen chips, yet the costs of running distributed AI jobs remains grossly inflated because the ecosystem has accepted failure as a constant,” said Suresh Vasudevan, CEO of Clockwork.io. “We built TorchPass to fundamentally reject that premise. Instead of treating failure as inevitable and restarting after the fact, TorchPass makes infrastructure faults invisible to the workload-training continues through failures transparently, in software. For a typical 2,048-GPU deployment, that translates into over $6 million a year in recovered compute. This is what our Software-Driven AI Fabric approach was designed to deliver: fault-tolerant AI infrastructure.”

Dylan Patel, Founder and CEO of SemiAnalysis agreed that large-scale training jobs are limited by interruptions.

“As Blackwell clusters roll out with an NVL72 domain, and we look to the future with Rubin Ultra’s NVL576 domain, the idea that a single GPU error or network link flap can take down an entire run is totally unacceptable,” said Patel. “TorchPass solves a huge challenge with cluster reliability: it provides transparent failover and live workload migration that keeps MFU high, which in turn drives better GPU economics.”

Why AI Training Fails at Scale

Distributed AI training remains one of the most failure-prone workloads in modern infrastructure. As cluster sizes grow, fragility increases sharply. Research from Meta FAIR shows that mean time to failure drops to 7.9 hours in a 1,024-GPU cluster and to just 1.8 hours at 16,384 GPUs. This means that for most large, AI-focused enterprises or AI clouds, failure-driven restarts are completely inevitable – making this a major barrier to scaling AI’s impact.

Each failure forces training jobs to roll back to the most recent checkpoint, discarding minutes or hours of completed work and wasting additional time on manual intervention, reprovisioning resources and restarting training. These restarts silently cap GPU utilization, making reliability one of the largest hidden costs in AI infrastructure.

TorchPass addresses this problem by proactively addressing costly AI workload failures, solving them before the job stops or needs to restart. Vital for enterprises running large AI workloads and AI clouds alike, TorchPass dramatically improves the reliability of workloads and cluster utilization. For AI clouds, who can now address impacted GPUs while preserving the training run as planned, this translates into better customer SLAs and overall AI cloud economics, improving their ability to protect margin and deliver new models sooner.

“Managing compute output across large-scale GPU clusters is vital to ensuring we’re delivering reliable capacity to our customers. By using TorchPass we have the support of a company that focuses on resilience like it is a core business function: it replaces any specific failing GPU and keeps the rest of the job moving, rather than making one small problem impact our large-scale operations,” said David Power, CTO of Nscale. “In our evaluation, Live GPU Migration preserved both run continuity and throughput under real fault conditions, which is exactly what you need to deliver predictable time-to-train and a better customer experience at scale.”

How Live GPU Migration Works: Reliability Without Restart

TorchPass performs transparent, in-flight migration of impacted training ranks to spare resources when failures occur. TorchPass typically completes recovery in approximately three minutes while the training process continues uninterrupted.

It supports resilience across three failure scenarios:

  • Unplanned migration, handling sudden events such as kernel crashes, power failures, or GPU faults by reconstructing state from healthy replicas

  • Pre-emptive migration, triggered by early warning signals such as rising temperatures or ECC memory errors, enabling controlled migration before a hard failure

  • Planned migration, enabling maintenance, patching, and workload rebalancing without interrupting training

This approach reduces wasted training progress by 95%, cutting lost time from approximately three hours per day to under ten minutes in a 1,024-GPU cluster.

Jordan Nanos, Member of Technical Staff and lead author of ClusterMAX-SemiAnalysis’ independent benchmark for large-scale AI training-stress tested Clockwork.io TorchPass and found it delivered leading performance and efficiency for large-scale distributed training, enabling users to reduce checkpointing overhead in training. He shared the following results:

“In our testing, Clockwork.io TorchPass delivered the fastest and most efficient fault-tolerant performance for a gpt-oss-120B training run. We used TorchTitan on a Kubernetes cluster with 64x H200 GPUs. During our testing we measured job completion time (JCT) and Model FLOPs Utilization (MFU) against a standard approach (checkpoint-restart) and the leading open-source fault-tolerant training framework (TorchFT). We simulated multiple hardware failures on the cluster in order to stress test the fault-tolerant training frameworks.

When compared to checkpoint-restart, TorchPass was significantly faster to recover from failures. This reduced overall JCT and maintained high MFU. And when compared to TorchFT, TorchPass had a significantly higher MFU. This reduced overall JCT while also maintaining an equal time to recover from failures.

Using TorchPass also has a downstream effect where it provides users with an opportunity to reduce or even remove checkpointing from their training code. This means larger effective batch sizes, lower risk of out of memory errors (OOMs), and less time spent thinking about storage. For a research organization, this can ultimately mean a faster time to reach their training objective,” concluded Nanos.

Measurable Business Impact from Software-Driven Fault-Tolerance

For customers operating large AI clusters, the impact is immediate and measurable. In a typical 2,048-GPU H200 deployment, TorchPass Workload Fault Tolerance delivers over $6 million in annual savings by preventing wasted compute.

These savings come from eliminating hundreds of thousands of GPU-hours that would otherwise be lost to failure-driven restarts, cascading retries, and idle recovery time. By keeping training jobs running through infrastructure faults instead of restarting them, TorchPass converts lost GPU time into productive training, significantly improving the return on GPU investments that today often operate at just 30-50% of theoretical performance.

Enabling the Next Generation of AI Infrastructure

By making reliability a software-defined capability rather than a hardware constraint, TorchPass provides the operational confidence required to deploy next-generation, tightly coupled systems such as NVIDIA GB200 and GB300 NVL72 and future rack-scale systems, where dense architectures amplify the cost of even small failures.

TorchPass builds on Clockwork.io’s prior release of Network Fault Tolerance, which applies the same Software-Driven AI Fabric principles to network resilience by transparently rerouting traffic around link failures.

Together, these capabilities form Clockwork.io’s Software-Driven AI Fabric, a vendor-neutral software layer spanning network, compute, and storage. As modern AI workloads run on tightly coupled clusters where hundreds or thousands of processors must operate in coordinated lockstep, infrastructure behaves as a single system, where reliability and performance directly determine overall efficiency. By managing this complexity in software, Clockwork.io enables operators to run heterogeneous AI infrastructure as a unified platform-maintaining high utilization, predictable performance, and resilience while preserving the flexibility to evolve hardware and improve the economics of large-scale AI deployments.

To learn more about the launch of TorchPass, visit the Clockwork.io team in-person at NVIDIA GTC from March 16-19, Booth #205, or visit https://clockwork.io.

About Clockwork.io
Clockwork.io pioneers Software-Driven AI Fabrics™, delivering a programmable software layer that makes large-scale AI clusters observable, deterministic, and resilient by design to drive continuous workload progress and peak cluster utilization. Its FleetIQ platform enables enterprises to train, deploy, and serve the world’s most demanding AI workloads faster, more reliably, and at lower cost. Companies including Uber, Wells Fargo, DCAI, Nebius, Nscale, and White Fiber trust Clockwork.io to power their AI infrastructure. Learn more at www.clockwork.io.

Media Contact
Dana Trismen
clockwork@unshakablemarketinggroup.com
650-269-7478

SOURCE: Clockwork

View the original press release on ACCESS Newswire

Information contained on this page is provided by an independent third-party content provider. XPRMedia and this Site make no warranties or representations in connection therewith. If you are affiliated with this page and would like it removed please contact pressreleases@xpr.media

French Quarter Festival Returns April 16–19 with Free Music, Local Cuisine, and Four Days of Celebration in New Orleans

French Quarter Festival Returns April 16–19 with Free Music, Local Cuisine, and Four Days of Celebration in New Orleans

Beloved annual festival highlights local musicians, iconic Louisiana food vendors, and the culture of the historic

March 13, 2026

Eso Soccer and Bridgeview Foundation to Host ‘2026 Freedom To Play’ Post-Hurricane Relief Event in Jamaica

Eso Soccer and Bridgeview Foundation to Host ‘2026 Freedom To Play’ Post-Hurricane Relief Event in Jamaica

"Uniting Communities Through the Global Game" We are looking forward to capturing the excitement that the World Cup

March 13, 2026

New UK Border Rules Mean British Dual Citizens May Be Denied Boarding Without a British Passport

New UK Border Rules Mean British Dual Citizens May Be Denied Boarding Without a British Passport

New UK ETA travel rules mean British dual citizens must use a British passport or proof of right of abode. Learn what

March 13, 2026

John Kerry to join the Ocean Stewardship Initiative (OSI) as a Champion

John Kerry to join the Ocean Stewardship Initiative (OSI) as a Champion

The Sustainable Markets Initiative, founded by His Majesty King Charles III, announces John Kerry as Champion for Ocean

March 13, 2026

Sturlite Announces Successful SAP S/4HANA EWM Public Cloud Implementation & RF Integration with Fingent

Sturlite Announces Successful SAP S/4HANA EWM Public Cloud Implementation & RF Integration with Fingent

The successful implementation of SAP S/4HANA EWM Public Cloud & RF Integration reflects strong collaboration

March 13, 2026

VGS Portfolio Supercharges Viking SupplyNet’s Unique, Vertically Integrated Model with Premium Piping Connections

VGS Portfolio Supercharges Viking SupplyNet’s Unique, Vertically Integrated Model with Premium Piping Connections

Viking SupplyNet expands its portfolio with VGS, a premium lineup of piping connections designed to streamline

March 13, 2026

Inside the 2026 Hollywood Swag Bag Celebrating Oscar Weekend Nominees and Casting Directors

Inside the 2026 Hollywood Swag Bag Celebrating Oscar Weekend Nominees and Casting Directors

Luxury gift basket highlights innovative brands and celebrates the first year casting directors are honored during

March 13, 2026

Jarzynski equality in the context of superconducting optical cavities

Jarzynski equality in the context of superconducting optical cavities

FAYETTEVILLE, GA, UNITED STATES, March 13, 2026 /EINPresswire.com/ — This article investigates the classical limit of

March 13, 2026

Buyer’s Guide to Easy-Clean Anti-Fingerprint HPL Sheets for Modern Cabinetry Applications

Buyer’s Guide to Easy-Clean Anti-Fingerprint HPL Sheets for Modern Cabinetry Applications

CHANGZHOU, JIANGSU, CHINA, March 13, 2026 /EINPresswire.com/ — As cabinetry design trends move toward matte finishes,

March 13, 2026

New Survey Shows How AI is Transforming the American Workplace in 2026

New Survey Shows How AI is Transforming the American Workplace in 2026

Novorésumé’s latest study reveals trends in AI reliance, AI sentiment, and generational differences This is what

March 13, 2026

How to Choose a Reliable High Pressure Decorative Laminates Supplier in China

How to Choose a Reliable High Pressure Decorative Laminates Supplier in China

CHANGZHOU, JIANGSU, CHINA, March 13, 2026 /EINPresswire.com/ — As global construction and furniture markets continue

March 13, 2026

Nebo Launches Marketing Predictive App with up to 99% Accuracy

Nebo Launches Marketing Predictive App with up to 99% Accuracy

The model has proven to be over 97% accurate, and for some clients it has proven to be over 99% accurate Ironically,

March 13, 2026

Compliant, Safe, Efficient: ATEX Vacuum Solution for Technical and Medical Gases

Compliant, Safe, Efficient: ATEX Vacuum Solution for Technical and Medical Gases

Industrial operations are facing increasing challenges from new regulatory requirements – especially when dealing with

March 13, 2026

CredibleLaw.com Launches National Merchant Cash Advance Research Hub

CredibleLaw.com Launches National Merchant Cash Advance Research Hub

New research hub analyzes MCA industry growth, litigation trends, and state laws to help businesses understand the

March 13, 2026

China Leading UV LED Curing Solution Provider at RadTech UV+EB Technology Expo & Conference (USA)

China Leading UV LED Curing Solution Provider at RadTech UV+EB Technology Expo & Conference (USA)

ZHUHAI, GUANGDONG, CHINA, March 13, 2026 /EINPresswire.com/ — The intersection of global manufacturing and sustainable

March 13, 2026

Thruvision secures further U.S. aviation contract award from Greater Orlando Aviation Authority

Thruvision secures further U.S. aviation contract award from Greater Orlando Aviation Authority

Greater Orlando Aviation Authority has placed an order for Thruvision systems to support aviation worker screening

March 13, 2026

Chinese Top 3 Solar Street Light Manufacturers in 2026 Leading the Global Solar Lighting Industry with Innovation

Chinese Top 3 Solar Street Light Manufacturers in 2026 Leading the Global Solar Lighting Industry with Innovation

Driving the future of renewable outdoor lighting through cutting-edge solar technology, intelligent control systems,

March 13, 2026

The Space Launch Services Market is projected to attain a value of US $24.42 billion by 2030

The Space Launch Services Market is projected to attain a value of US $24.42 billion by 2030

The Business Research Company's The Space Launch Services Market is projected to attain a value of US $24.42 billion by

March 13, 2026

conga-TCRP1 combines high performance with maximum scalability and design flexibility

conga-TCRP1 combines high performance with maximum scalability and design flexibility

Scalable Edge Performance for Demanding Applications SAN DIEGO, CA, UNITED STATES, March 13, 2026 /EINPresswire.com/ —

March 13, 2026

conga-HPC/cBLS accelerates demanding edge designs

conga-HPC/cBLS accelerates demanding edge designs

More consistent power for COM-HPC client platforms SAN DIEGO, CA, UNITED STATES, March 13, 2026 /EINPresswire.com/ —

March 13, 2026

AAA Organized Plumbing Expands Professional Plumbing Services to Napa Valley Communities

AAA Organized Plumbing Expands Professional Plumbing Services to Napa Valley Communities

Ukiah plumbing company brings trusted residential and commercial expertise to Napa County, offering homeowners and

March 13, 2026

5 Essential Features to Look for in a China Trail Run Shoes Manufacturer for Professional Athletes

5 Essential Features to Look for in a China Trail Run Shoes Manufacturer for Professional Athletes

SHENZHEN, GUANGDONG, CHINA, March 13, 2026 /EINPresswire.com/ — The global landscape of outdoor sports has undergone a

March 13, 2026

TRWD Announces Entry into $10B Growth Sector

TRWD Announces Entry into $10B Growth Sector

Company Announced Plans To Become One of Only Two Publicly Traded Entities in a $10 Billion Industry; Execution Begins

March 13, 2026

congatec and Kontron partner on embedded computing solutions

congatec and Kontron partner on embedded computing solutions

congatec launches aReady.YOURS Partner Program for market-specific system solutions SAN DIEGO, CA, UNITED STATES, March

March 13, 2026

Cure All Plumbing Reinforces Commitment to Professional Standards and Community Support in Arizona

Cure All Plumbing Reinforces Commitment to Professional Standards and Community Support in Arizona

GILBERT, AZ, UNITED STATES, March 13, 2026 /EINPresswire.com/ — After more than two decades in the plumbing industry,

March 13, 2026

aReady.YOURS from congatec for fast and reliable (full) custom embedded computing designs

aReady.YOURS from congatec for fast and reliable (full) custom embedded computing designs

congatec centralizes customization design and software integration services in new Customer Application Center and

March 13, 2026

PEL Learning Expands Academic & Franchise Opportunities in California

PEL Learning Expands Academic & Franchise Opportunities in California

PEL Learning Centers expands in California with mastery-based Math & ELA tutoring using Singapore Math and Spalding

March 13, 2026

Industry Recognition for Excellence: RakSmart Honored with HostingSeekers ‘2026 Fastest Growing Hosting Brand’ Award

Industry Recognition for Excellence: RakSmart Honored with HostingSeekers ‘2026 Fastest Growing Hosting Brand’ Award

RakSmart wins HostingSeekers’s 2026 Fastest Growing Hosting Brand, known for innovation, 99.9% uptime, fast support

March 13, 2026

Beast Games Winner Jeff Allen Doubles Down on Mission to Fund Cure for Rare Disease Affecting His Son

Beast Games Winner Jeff Allen Doubles Down on Mission to Fund Cure for Rare Disease Affecting His Son

Allen completes second Ruck4Rare & pledges $1 million to ACD's Race for a Cure Every mile I ruck, every fundraiser,

March 13, 2026

Core Factors Introduces Three Psychological Type Assessments to Support Depth-Level Development Work

Core Factors Introduces Three Psychological Type Assessments to Support Depth-Level Development Work

Core Factors delivers a suite of type assessments supported by a participant experience designed to sustain learning

March 13, 2026

Global R&B Artist MYSPRO Builds Momentum Ahead of March 27 Release of Close Enough

Global R&B Artist MYSPRO Builds Momentum Ahead of March 27 Release of Close Enough

Following the emotional impact of Echo in My Chest, the Oregon based artist continues shaping a cinematic and globally

March 13, 2026

Brothers Tailors Introduces New Seasonal Fabrics and Custom Styles in Phoenix

Brothers Tailors Introduces New Seasonal Fabrics and Custom Styles in Phoenix

PHEONIX, AZ, UNITED STATES, March 13, 2026 /EINPresswire.com/ — Brothers Tailors, a family-owned tailoring business

March 13, 2026

Why Sustainability is Key for Every OEM 3d Interior Wall Panel Manufacturer in Today’s Market

Why Sustainability is Key for Every OEM 3d Interior Wall Panel Manufacturer in Today’s Market

DONGGUAN, GUANGDONG, CHINA, March 13, 2026 /EINPresswire.com/ — The global interior design landscape is undergoing a

March 13, 2026

5 Reasons to Choose an ISO-Certified Logistics Container Traceability Company for Cold Chain

5 Reasons to Choose an ISO-Certified Logistics Container Traceability Company for Cold Chain

CHINA, March 13, 2026 /EINPresswire.com/ — The global cold chain industry is navigating a period of unprecedented

March 13, 2026

Maamgic Reveals the Essential ‘Camera-Ready’ Swim Guide for Spring Break 2026!

Maamgic Reveals the Essential ‘Camera-Ready’ Swim Guide for Spring Break 2026!

We’ve entered an era of 'Functional Honesty' in menswear”— the Design Director at Maamgic, Megan Wilson NY, UNITED

March 13, 2026

Women Leaders to Gather in Marina del Rey for Strategic St. Patrick’s Day Business Brunch During Women’s History Month

Women Leaders to Gather in Marina del Rey for Strategic St. Patrick’s Day Business Brunch During Women’s History Month

Women leaders from California, Washington, and Canada gather March 15 in Marina del Rey for a Global Women Speakers

March 13, 2026

New Chapter in India’s Wildlife Conservation: Cheetah population crosses 50 as nine Botswana cheetahs arrive at Kuno

New Chapter in India’s Wildlife Conservation: Cheetah population crosses 50 as nine Botswana cheetahs arrive at Kuno

BHOPAL, MADHYA PRADESH, INDIA, March 13, 2026 /EINPresswire.com/ — India’s ambitious cheetah reintroduction program

March 13, 2026

TurfGrass Experts Launches New Initiative to Help Northern Kentucky Homeowners Facing New Construction Lawn Issues

TurfGrass Experts Launches New Initiative to Help Northern Kentucky Homeowners Facing New Construction Lawn Issues

As new neighborhoods grow across Northern Kentucky, TurfGrass Experts' Union branch offers support to address

March 13, 2026

Creative Repute Launches Cost Calculator and Client Portal to Strengthen Transparency

Creative Repute Launches Cost Calculator and Client Portal to Strengthen Transparency

Creative Repute unveils Cost Calculator and Client Portal, designed together to streamline client onboarding, improve

March 13, 2026

CabinetDIY Highlights the Timeless Appeal of White Kitchen Cabinets for Modern Homes

CabinetDIY Highlights the Timeless Appeal of White Kitchen Cabinets for Modern Homes

CabinetDIY Highlights the Timeless Appeal of White Kitchen Cabinets for Modern Homes COSTA MESA, CA, UNITED STATES,

March 13, 2026