Computers don’t actually run a dozen programs at once but instead rapidly switch among them to create a simulacrum of simultaneity. The GPUs and NPUs executing AI models, however, present no illusion of executing more than a single model at a time. Startup InferX, however, is betting that this will become a problem as AI hosts such as Google and OpenAI offer customers multiple models. The problem will worsen as agents chain multiple models to complete a task. Dedicating a GPU/NPU to a single model provides availability but idles hardware when a workload calls on a different model.
InferX (not to be confused with the Analog Devices, nee Flex Logix, software of the same name) is developing software for rapidly and securely loading and unloading models, breaking the one-to-one link between model and hardware, facilitating resource sharing, and reducing overall capacity requirements. Enabling inference to cold start in less than two seconds, InferX allows a single node to run one of dozens of models on demand.
The small startup is still developing its software, focusing on Nvidia GPUs. Addressing only inference workloads and issues around cold-starting models, its scope is narrower than that of Run AI, which Nvidia acquired a year ago. As the InferX code matures and initial customers put it into production, we expect it to garner additional interest.