It's all about the Pentiums music video screencap

Reliability Plagues Intel as Raptor Defects Emerge

It’s all about the Pentiums.

While much of the world was reacting to CrowdStrike crashing systems worldwide, PC enthusiasts were coping with Intel processors failing. Alderon Games found a nearly 100% failure rate for Raptor Lake (13th and 14th Generation Core) desktop processors and states problems extend to contemporaneous server and mobile (laptop) processors. Others, such as Epic Games, have seen similar problems.

Intel admits to instability issues with its desktop Raptors and claims to have found the root cause, which it will address with a microcode patch. If the patch isn’t effective or the problem indeed extends beyond desktop models, Intel will suffer reputational loss. Even if it works, the company will face warranty claims for potentially millions of chips that failed before the fix could be applied.

Gaming Accelerates Processor Aging, Revealing Defects

Advanced digital semiconductors like Intel processors are complex, not just in their logic but also in their physical design, and they are at the frontier of material science. Moreover, Intel has compensated for lagging technology by extending its processors’ power limits so that they can raise voltage and increase clock rates to boost performance. These factors and the accompanying higher temperatures will accelerate aging, revealing defects sooner.

Running their machines under higher loads for longer than other groups, avid gamers will be the first to experience aging-induced reliability problems. Because so much gameplay has an online component, companies like Alderon can collect data on thousands—if not millions—of machines, generating hard-to-refute statistical evidence that a problem exists.

Identifying Reliability Problems is Difficult, but Potentially Unnecessary

Finding the root cause is much harder. Many things can go wrong. For example, a wire inside a chip could be so thin that repeated switching causes it to burn out like a fuse, or a via could delaminate from its sidewalls as heat stresses the chip. The billions of design elements comprising a processor are billions of potential failure points.

However, finding the exact failure may not be required if the behavior leading to it can be avoided. Intel’s microcode patch updates voltage-control algorithms, which could suffice to address Raptor’s problem, even if it isn’t truly tackling the root cause. (By microcode, we assume they mean firmware. Raptor probably has a little system-management CPU that monitors the chip and adjusts voltage and other parameters dynamically. I can’t believe the state machine that expands complex x86 instructions manages the system in its spare time.)

An Earlier Intel Defect, FDIV, Provides Lessons

Handled well, a crisis will subside. Alternatively, the handling itself becomes a second, bigger crisis. Intel learned this lesson firsthand 30 years ago when the FDIV bug plagued Pentiums. After weeks of sullying its reputation by downplaying the FDIV problem, the company turned the situation around by replacing defective chips on request. In so doing, it showed that it stood behind its products and the Intel Inside campaign meant something, turning a public-relations fiasco into a success. Raptor’s reliability crisis is still nascent, giving Intel the opportunity to go directly to success, bypassing the fiasco stage and the risk that the public in 2024 won’t be as forgiving as it was in 1995.


Posted

in

by

Tags:


error: Unable to select