OpenAI bypasses Nvidia with unusually fast coding model on board-sized chips

https://omg10.com/4/10736335

But 1,000 tokens per second is really modest by Cerebras standards. the company has measured 2100 tokens per second on Llama 3.1 70B and reported 3,000 tokens per second on OpenAI’s openweight gpt-oss-120B model, suggesting that Codex-Spark’s comparatively slower speed reflects the overhead of a larger or more complex model.

AI coding agents have had a banner year, with tools like OpenAI’s Codex and Anthropic’s Claude Code reaching a new level of usefulness for quickly building prototypes, interfaces, and boilerplate code. OpenAI, Google, and Anthropic have been competing to offer more capable encryption agents, and latency has become what separates the winners; a model that codes faster allows the developer to iterate faster.

With fierce competition from Anthropic, OpenAI has been iterating on its Codex lineup at a rapid pace, releasing GPT-5.2 in December after CEO Sam Altman issued an internal “code red” memo about competitive pressure from Google, and then shipping GPT-5.3-Codex just a few days ago.

Diversify away from Nvidia

Spark’s deeper hardware history may be more consequential than its benchmark scores. The model runs on Cerebras’ Wafer Scale Engine 3, a plate-sized chip that Cerebras has built its business on since at least 2022. OpenAI and Cerebras announced their partnership in January, and Codex-Spark is the first product to come out of it.

OpenAI spent the past year systematically reducing its dependence on Nvidia. The company signed a massive multi-year deal with AMD in October 2025, struck a $38 billion cloud computing deal with Amazon in November, and has been designing its own custom AI chip for eventual manufacturing by TSMC.

Meanwhile, a planned $100 billion infrastructure deal with Nvidia has so far failed, although Nvidia has since committed to a $20 billion investment. Reuters reported that OpenAI was dissatisfied with the speed of some Nvidia chips for inference tasks, which is exactly the type of workload OpenAI designed Codex-Spark for.

Regardless of what chip is under the hood, speed matters, although it may come at the cost of accuracy. For developers who spend their days inside a code editor waiting for AI suggestions, 1,000 tokens per second may seem less like carefully piloting a jigsaw and more like wielding a circular saw. Just be careful what you’re cutting.