Nvidia SM core

Researchers Reveal GPU Architecture’s Warped World

She’s filled with secrets

For those accustomed to microprocessors, GPUs are weird. Among the oddities is GPUs’ opaque architecture. Not only do vendors withhold microarchitecture details, they even withhold the instruction-set architecture (ISA). Nvidia, for example, expects developers to program using its high-level Cuda code. The adventuresome (like DeepSeek) can use PTX, which is like assembly language, but Sass is Nvidia GPUs’ actual assembly language—and the company doesn’t fully document it.

To reveal how Nvidia GPUs work, a Universitat Politècnica de Catalunya group has reverse-engineered the Nvidia RTX A6000, an Ampere-generation GPU. (Ampere came two generations before Blackwell.) They crafted microbenchmarks from hand-written Sass code to unveil microarchitecture details. Their findings reveal techniques that could be employed in CPUs but aren’t and will certainly help GPU programmers write better software.

The researchers discovered that GPUs take to heart the old saw that RISC stands for relegate the interesting stuff to the compiler, a view that successful CPUs never embraced. For example, the Ampere architecture lets the compiler control register dependencies and manage a register-file cache. Hardware helps. For example, the compiler activates control bits in the instruction stream to set stall and dependence counters. These count down as conflicts resolve until reaching zero, indicating execution may proceed. A reuse bit associated with source operands signals the hardware to cache that register, obviating a subsequent full-fat register read. (Register files so big that they must be cached is another foreign concept for CPU folks.)

The university team discerned that the GPU likely has a simple stream buffer to prefetch instructions. On the data side, they sized load/store queues and memory bandwidth, and they discovered cases where shared memory structures can become a bottleneck. They also determined which registers lead to faster address calculation, reducing load/store latency.

The researchers applied their findings to updating a GPU simulator and improving its accuracy, which will help academia and industry alike in developing better GPU code. For those accustomed to CPUs, the effort required to document a GPU architecture and build a simulator is foreign. At the same time, CPU designers must envy their GPU counterparts, who require programming at a higher abstraction level, freeing them from intergenerational binary compatibility.


Posted

in

by


error: Unable to select