scientist Vera Rubin

Nvidia GTC 2025 Roundup: What’s Most Significant?

We’ll meet again

Last week Nvidia held its annual GTC developer conference, where it disclosed upcoming hardware products. Below we analyze significant disclosures and their implications.

Vera Arm-Compatible Processor

Vera succeeds Grace, Nvidia’s Arm-based general-purpose processor. Whereas Grace employs Arm’s Neoverse V2 cores (which are similar to the Cortex-X3 CPU), Vera will use Nvidia’s homegrown Olympus CPU. It’s the company’s first custom CPU since the Project Denver cores it employed in Tegra SoCs. (Denver originated with a design from Stexar, a company Nvidia acquired. Several ex-Stexar engineers are still at Nvidia.) From the Arm vs Qualcomm trial, we knew the company still had an active architecture license.

Nvidia likely developed its own CPU to achieve greater performance than it could obtain from Arm’s designs, particularly within Vera’s claimed 50 W power budget. The newest, fastest Arm CPUs deliver about 30% greater instruction throughput (IPC) than the Neoverse V2, an increase Olympus could match. Factoring in Vera’s additional cores would raise per-chip performance to 60% above Grace. Nvidia, however, is claiming Vera delivers a 2× speedup.

Given the execution width of Arm’s high-end cores, Vera’s gains could come from a faster clock and simultaneous multithreading (SMT), which Arm has eschewed in its high-end CPUs. Its benefits are nonobvious; the technique’s gains vary by workload, CPU design, and system. We assume Nvidia’s performance analysis indicates SMT would benefit AI workloads, which the company has intimate knowledge of. Assuming SMT raises per-CPU throughput by 20%, a small clock-rate boost is all that’s needed for Vera to double Grace’s performance.

Conventional wisdom is that HPC programmers disable SMT when it is available. An issue is that SMT increases throughput variance. Because HPC workloads employ many parallel threads and the slowest thread gates overall forward progress, increased variance can slow the whole system. On the other hand, SMT is good at hiding latencies external to the CPU. For example, when blocked on a data transfer (a common occurrence in AI processing), an SMT-enabled CPU can allocate its resources to another thread.

We expect the same CPU to appear in PC processors Nvidia is developing. Were it not for this second use, Nvidia may have been tempted to implement the less mature RISC-V architecture. Nonetheless, Olympus is a small strike against Arm. The company has few customers for its highest-performing designs (the Neoverse V and Cortex-X cores) among which to amortize its development costs.

Copackaged Optics

Nvidia has vaulted to the forefront of copackaged optics (CPO), disclosing its roadmap for CPO-based InfiniBand and Ethernet switches. Due in 2H25, Nvidia’s first CPO product will be a radix-144 InfiniBand switch delivering 28.8 Tbps peak throughput. In 2H26, the company plans to offer a 512-port Ethernet switch.

The technology connects fiber to chips instead of terminating fiber at a system’s front plate and using copper from there to the chip. Broadcom has been the furthest along with CPO, demonstrating multiple CPO-enabled Tomahawk switches and shipping the Tomahawk 5 Bailly design. Nvidia has the advantage of vertical integration, engineering both systems and integrated circuits, enabling it to move rapidly from development to production. It has also been aided by collaborating with TSMC.

Power is the problem Nvidia seeks to address with CPO, citing the megawatts avoided by employing CPO. Data throughput will also become a limiter. Nvidia’s first silicon photonics engine operates at 1.6 Tbps. Although the industry has a roadmap to faster rates with copper, the metallic element’s gains will plateau within a few generations. While the first CPO chips are switches, we expect Nvidia to apply the technology to its GPUs.

Data-Center GPU Roadmap

Nvidia also discussed its upcoming GPUs: Blackwell Ultra, Rubin, Rubin Ultra, and Feynman. A message to competitors is that Nvidia has a deep new-chip pipeline. For example, while Intel talks about Jaguar Shores, its next attempt to enter the market, Nvidia has four chips queued up. Even if Jaguar surpasses Rubin on some metrics, Rubin Ultra is around the corner.

Due by the end of this year, Blackwell Ultra revamps the Nvidia Blackwell GPU that propelled the company’s Q4 revenue. The new GPU does more than increase HBM capacity by 50%. It raises 4-bit floating-point (FP4) throughput by 50%. New instructions and greater softmax throughput double the speed of attention operations at the heart of major AI models. Because the original Blackwell was reticle limited, these performance gains come at the expense of other capabilities. Blackwell Ultra sacrifices INT8 and FP64 function units to make room for the added FP4 and attention hardware. Nvidia expects FP4 to replace INT8 for inference. But HPC customers have no alternative to FP64, rendering Ultra unsuitable for them.

Slated to come late next year, Rubin doubles Blackwell Ultra’s FP4 and FP8 throughput. It includes the same HBM capacity but upgrades it from HBM3e to the faster HBM4. Interconnect throughput also doubles. Like Blackwell, Rubin comprises eight HBM sites and two reticle-size compute dice. Nvidia will pair it with the Vera processor, much like it has mated Hopper and Blackwell GPUs to Grace. The company will feature 72 Vera Rubin modules in a chassis based on the NVL72 Grace Blackwell rack-scale system.

Changing nomenclature to reflect the number of computing dice, Nvidia designates the new system Vera Rubin NVL144. The change’s motivation becomes clear with Rubin Ultra, which integrates four computing dice per socket along with 16 HBM sites. Targeted for 2H27, Rubin Ultra promises greater performance and faster interfaces. The company featured it in a new rack-scale system (NVL576 Kyber) that doubles the number of GPU packages to 144 and requires a shocking 600 kW. The design has vertical blades and a communications midplane like a core router.

The shift toward marketing data-center AI processing as a system is to Nvidia’s advantage because the company offers both AI accelerators (GPUs) and networking technologies owing to its Mellanox acquisition. Intel attempted a system-level approach for servers and acquired switch-chip companies such as Fulcrum and Barefoot but ultimately had little impact. The company says it will take a system-level approach with Jaguar, but the investment required may be too much. AMD acquired ZT Systems for its expertise, but it’s too early for that deal to show benefits.

Spark/Station

At CES, Nvidia showed Digits, a small form factor workstation. Now called Spark, it’s offered alongside a larger system named Station based on a full-size PC-like motherboard and featuring a GB300 (Grace and Blackwell Ultra pairing). During his keynote, CEO Jensen Huang mentioned it also includes a PCIe slot for accepting a graphics card, highlighting AI-targeted GPUs’ woeful graphics capabilities.

At the conference, Asus showed off its Spark implementation, which lists for $3,000, $1,000 less than its Nvidia-branded counterpart. By ensuring that developers and data scientists have low-cost access to its technology, Nvidia is helping to keep these people on Team Green, even as alternatives emerge.

Other

  • Autonomous vehicles are a major AI application and, therefore, a natural Nvidia market. The company announced one partnership with GM and another with Nexar. From its dashcams, Nexar has among the biggest real-world driving data sets that Nvidia can use to train its models. Nvidia recognizes its success depends on delivering hardware, software, models, and data, and it’s well ahead of its semiconductor competitors.
  • Personal robots are the next big AI application. In a partnership with DeepMind (the makers of the Go-winning AI model) and Disney, the company helped make an anthropomorphic (but not creepily so) robot. Nvidia also released the Groot N1 model for human-like skills. Here, too, Nvidia’s strategy is integration, raising the barrier for competitors while jump-starting customers’ projects.
  • Development tools and models also starred at GTC. As the cost of implementing AI functions decreases, AI hardware will be less concentrated in data centers and more distributed among enterprises. The software and models Nvidia is developing will help customers use Nvidia hardware and enable the applications that will propel further company growth.
  • Blackwell RTX Pro 6000 is the professional version of the company’s GeForce RTX 5090 gaming GPU. Targeting applications such as engineering and design, it enables more on-chip computational units and adds memory.
  • Quantum computing came up. Let us at XPU.pub know if that’s interesting to you. Otherwise, we’ll studiously ignore this technology, leaving it for future grandchildren to examine.

Posted

in

by


error: Unable to select