A-ha motorcycle race

DeepSeek-R1 Resets AI-Development Expectations

We’re talking away. I don’t know what I’m to say. I’ll say it anyway.

DeepSeek-R1 slipped into the world on January 20, a day before Sam Altman and friends kicked off his Stargate OPM incinerator. DeepSeek soon overshadowed Stargate, playing down its costs while Stargate did the opposite. Nvidia’s stock tumbled on the belief that DeepSeek showed significant AI advancements required a tiny fraction of the investment previously assumed. Although $NVDA hasn’t recovered, the AI industry has been energized by DeepSeek’s accomplishments, but what will be their effect on the AI chip (NPU/GPU) industry?

What is DeepSeek-R1?

DeepSeek-R1 is a so-called reasoning model, a neural-network type that gained prominence when Altman’s OpenAI previewed its O1 model. Having learned reasoning patterns, these models break a problem into steps and check their work along the way. A large language model (LLM), by contrast, has learned text patterns and sequentially generates responses. An LLM, therefore, will struggle with some simple problems. Having another potential customer developing such leading-edge models is good for chip suppliers.

To train a preliminary R1 version (R1-Zero), DeepSeek applied reinforcement learning (RL) without supervised fine tuning (SFT),achieving good reasoning performance but with some drawbacks. Adding another RL stage and SFT stages using the DeepSeek-V3 model to the training pipeline addressed the problems and led to the released R1 version. Reinforcement learning is the technique that DeepMind employed to train AlphaGo, the Go-playing program that developed its superhuman skills by playing itself instead of studying past games. In successfully applying RL, DeepSeek broke through the data wall, the impending barrier to LLM improvement (and, therefore, AI-accelerator purchases) caused by a lack of additional training data.

To develop a smaller reasoning model, DeepSeek employed distillation, using R1 to fine-tune smaller LLMs such as Alibaba’s Qwen-7B and endow them with reasoning. Useful small models should reduce the energy and NPU capital required to deliver acceptable reasoning performance. If Jevon’s Paradox holds, total NPU quantities demanded will more than offset the efficiency savings, growing the market. Energy use will scale with NPU deployments; therefore, the energy wall remains a barrier to AI progress. Energy demand, however, will be distributed, instead of concentrated in a few massive data centers, making it easier to service.

Was Only $6 Million Required to Develop R1?

Triggering the $NVDA selloff was the erroneous notion that DeepSeek spent only $5.6 million to develop R1—something the company never claimed. As DeepSeek documents,  this amount accounts only for the “official” training cost of the V3 (not the R1) model at an assumed rate of $2.00 per GPU hour. It does not include the cost to obtain the required GPUs, and the company specifically states that this amount excludes other research and development expenses.

Lending credence to the $5.6 million figure, Berkeley researchers have been able to replicate R1-Zero training in a limited case for only $30 in GPU time. (That’s $30, not $30 million.) As with DeepSeek’s V3 training, that figure covers only on-task GPU hours. Nonetheless, DeepSeek’s techniques are far more economical than those of western rivals and will likely be widely adopted. Again, if Jevon’s Paradox holds, improved efficiency will ultimately benefit chipmakers.

It’s reasonable to believe DeepSeek has access to much more hardware than only the cluster of 2,048 H800s (nerfed Hopper H100s for China) employed in V3 training. Hardware, nonetheless, is scarce, and the company wrote low-level PTX code and developed a pipeline-parallelism algorithm called DualPipe to improve training throughput. Other companies, by contrast, write only high-level Cuda code and rely on standard frameworks and libraries.

Viewed abstractly, DeepSeek traded labor (expert programmers) for capital (big GPUs) but may have sacrificed code maintainability and development time. Much has been made of Nvidia’s “Cuda moat,” its leading software ecosystem, but even for DeepSeek, for whom this is less relevant, Nvidia was the preferred supplier.

A further unaccounted-for R1 cost was that of Llama and Qwen development. By distilling them, DeepSeek reaped value from these open models, but their billion-dollar development costs were borne by Meta and Alibaba. Thus, the notion that all AI development can be as economical as DeepSeek’s approach is akin to demanding a free refill without having bought the first cup. To no one’s chagrin, DeepSeek may have also used OpenAI’s models to create R1.

What is DeepSeek V3?

Released in December, DeepSeek V3 is the company’s most recent LLM. Like R1, it’s an open-source model built using unusual and new techniques to increase training and inference efficiency. Containing 671 billion parameters, it’s large but a mixture-of-experts (MoE) model like its predecessor, V2. Relatively few parameters (37 billion), therefore, activate for each token. Also like V2, it adopts multihead latent attention (MLA), which reduces some vectors’ dimensionality to save space when caching them and maps them back to high-dimensional space for calculations.

To further improve V3 training efficiency, DeepSeek was among the first to employ FP8 data instead of the more common FP16. This required developing a new quantization method to mitigate over- and underflows. Various other approaches further reduce memory requirements.

Bottom Line

Already blazing, interest in AI flared further with the DeepSeek R1 release and its relatively low hardware requirements. Researchers, hobbyists, and companies are experimenting with it, running it on local machines including AI PCs and the Raspberry Pi. In so doing, they indicate that Jevon’s Paradox will hold, that efficiency will drive adoption. AI development won’t be constrained to a single company with unprecedented resources but will be shared among innovators. Because DeepSeek’s models are open and its approaches documented, other modelmakers will build on the company’s innovations, further advancing AI technology—a trend that will benefit NPU/GPU suppliers.


Posted

in

by


error: Selecting disabled if not logged in