On a Saturday night in mid-September, a senior Google engineer shared some rough news with more than 50 colleagues. Part of the company’s cloud services offering was failing Anthropic, a darling AI startup and key strategic customer, and they’d have to work overtime to fix it.
To repair the faulty part of its service — an underperforming and unstable NVIDIA H100 cluster — Google Cloud leadership initiated a seven-day-per-week sprint for the next month. The downside of not making it work, the senior engineer said, was “too large, for Anthropic — most importantly, for Google Cloud, and for Google,” according to documents reviewed by Big Technology.