At the Google Cloud Next conference, the company joined forces with NVIDIA to unveil a sweeping hardware and software roadmap. The goal? To dramatically cut the cost of running AI models at scale, especially for inference. This is the moment where the rubber meets the road for generative AI, and both companies are betting big on a tightly integrated stack.
The centerpiece is the new A5X bare-metal instance, powered by NVIDIA’s Vera Rubin NVL72 rack-scale systems. This isn’t just a faster GPU; it is a rethinking of how to move data. Through careful codesign, the architecture promises to slash inference costs per token by ten times compared to earlier generations. It also claims to boost token throughput per megawatt by a factor of ten. For any team paying the cloud bill for a large language model, those numbers sound like music.
Bandwidth at Hyperscale: The ConnectX-9 and Virgo Pairing
Connecting thousands of processors is a nightmare without massive bandwidth. Processing delays can kill performance in a distributed system. To solve this, the A5X instances pair NVIDIA ConnectX-9 SuperNICs with Google’s internal Virgo networking technology. This combination lets a single cluster scale up to 80,000 Rubin GPUs on site. For truly massive projects, a multisite deployment can reach 960,000 GPUs.
At that scale, workload management becomes the real challenge. Routing data across nearly a million parallel processors requires exact synchronization. Any idle compute time is wasted money, so the software layer has to be just as smart as the silicon. Mark Lohmeyer, VP and GM of AI and Computing Infrastructure at Google Cloud, framed it as a matter of customer empowerment. “At Google Cloud, we believe the next decade of AI will be shaped by customers’ ability to run their most demanding workloads on a truly integrated, AI-optimised infrastructure stack,” he said. “By combining Google Cloud’s scalable infrastructure and managed AI services with NVIDIA’s industry-leading platforms, systems and software, we’re giving customers flexibility to train, tune, and serve everything from frontier and open models to agentic and physical AI workloads while optimising for performance, cost, and sustainability.”
Tackling Sovereign Data Governance and Cloud Security
Raw compute power is only part of the equation. Data governance remains a massive hurdle for enterprise adoption. Regulated sectors like finance and healthcare often stall machine learning initiatives because of data sovereignty requirements and the risk of exposing proprietary information. No one wants their secret sauce leaked to the cloud.
To address this, Google Gemini models running on NVIDIA Blackwell and Blackwell Ultra GPUs are entering preview on Google Distributed Cloud. This deployment model keeps frontier models entirely within a customer’s controlled environment, right next to their most sensitive data stores. The architecture uses NVIDIA Confidential Computing, a hardware-level security protocol that encrypts prompts and fine-tuning data. Even the cloud provider cannot view or alter the underlying data. For public cloud environments, a preview of Confidential G4 VMs with NVIDIA RTX PRO 6000 Blackwell GPUs brings these same protections to multi-tenant setups. This is the first cloud-based confidential computing offering for NVIDIA Blackwell GPUs, giving regulated industries a path to high-performance hardware without violating privacy standards.
Managing the Operational Overhead of Agentic AI
Building multi-step agentic systems is a heavy engineering lift. It involves connecting large language models to complex APIs, maintaining vector database sync, and mitigating hallucinations during execution. To streamline this, NVIDIA Nemotron 3 Super is now available on the Gemini Enterprise Agent Platform. This gives developers tools to customize and deploy reasoning models specifically for agentic tasks.
Training these models at scale also introduces heavy operational overhead. Managing cluster sizing and handling hardware failures during long reinforcement learning cycles is a slog. Google Cloud and NVIDIA introduced Managed Training Clusters on the platform, which includes a managed reinforcement learning API built with NVIDIA NeMo RL. This system automates cluster sizing, failure recovery, and job execution. Data science teams can focus on model quality instead of low-level infrastructure management. CrowdStrike is already using NVIDIA NeMo open libraries to generate synthetic data and fine-tune models for cybersecurity, running on Managed Training Clusters with Blackwell GPUs to accelerate threat detection.
Bridging Legacy Architecture and Physical Simulations
Integrating machine learning into heavy industry and manufacturing presents a different kind of challenge. Connecting digital models to physical factory floors requires exact simulations, massive compute power, and standardization across legacy data formats. NVIDIA’s AI infrastructure and physical AI libraries are now generally available on Google Cloud, providing the foundation for organizations to simulate and automate real-world workflows.
Major industrial software providers like Cadence and Siemens have made their solutions available on Google Cloud, accelerated by NVIDIA infrastructure. These tools power the engineering of heavy machinery, aerospace platforms, and autonomous vehicles. But many manufacturing firms still run on decades-old product lifecycle management systems. Translating geometry and physics data from those systems is a pain. By using NVIDIA Omnibus libraries and the open-source NVIDIA Isaac Sim framework via the Google Cloud Marketplace, developers can bypass some of those translation headaches. They can construct physically accurate digital twins and train robotics simulation pipelines before deploying anything on a physical factory floor. This is where AI meets gravity, and the results could reshape how we build everything from jet engines to assembly lines.