After nearly four years and hundreds of billions burned building smarter and more capable models, folks understandably would like to see them do something more than run a chatbot.
In this respect, OpenClaw served like blood in the water, demonstrating that, in spite of its seemingly endless supply of security flaws, LLMs really can be used to automate complex tasks. Since then, you’ve probably noticed the term “harness” coming up more frequently to describe agentic AI frameworks, and for good reason.
You don’t need a harness to interact with a chatbot – local tools like Ollama send API calls directly to the LLMs – but to do today’s advanced work, they are essential.
On their face, AI harnesses are just a bit of code that wraps around an LLM’s API endpoint, orchestrates tool calls, and manages context. OpenClaw, Claude Code, Codex, and Pi Coding Agent are all examples of code-focused harnesses you may already be familiar with.
As simple as all this sounds, harnesses are changing the way we think about everything from training new models to how we build and run them at scale.
LLM inference on its own is pretty dumb – not the models so much as the way we interact with them. The OpenAI-compatible API calls that have become the de facto standard are transactional. With most early chatbots, you made a request and the API would supply a response.
A harness, by comparison, orchestrates those API calls, breaking down one request into multiple.
If you were to ask a code agent to build an app that parses logs, the harness might make one request to plan things out, another to review the log directory, a third to generate and execute that code in an interpreter, and a fourth to debug and fix any errors. This multi-step loop would continue until the work is done or the harness cuts it short to ask for user input.
At least for coding, these harnesses are getting good enough to be useful. In fact, a harness may have a bigger impact on whether the code assistant will be successful than the model itself. Even Qwen3.6-27B, a small-to-medium-sized LLM, proved to be a surprisingly effective alternative to larger paid models when paired with harnesses like Anthropic’s Claude Code or Cline. And yes, if you didn’t know, Claude Code works with any model you like.
In fact, the realization that small models with well-designed harnesses can now automate complex tasks has contributed to a shortage of Mac Minis, as AI enthusiasts race to self-host OpenClaw and LLMs on them.
Changing the way we build models
Training dominated the first two years of the AI boom. OpenAI, Google, Microsoft and others raced to build smarter models using as much data as they could harvest.
But by the end of 2024, the payoff of building ever larger models started to taper off, as the extra parameters only engendered small gains in intelligence.
DeepSeek R1 brought “reasoning” models and test-time scaling to the mainstream. To be clear, these models don’t actually reason, but instead trade time and tokens for higher quality answers and a lower propensity to make stuff up (aka “hallucinate,” although we at El Reg try to avoid anthropomorphizing AI).
It wasn’t the first. OpenAI’s o1 beat them to it, but R1 was the first widely adopted open weights model that used reinforcement learning (RL) to teach the model new skills, like chain-of-thought reasoning.
Over the past year, agentic code assistants have steadily gained traction. Consequently, people are increasingly using RL to teach models to use the tools and resources that agent harnesses expose to them.
If you look at many of the recent model releases on Hugging Face, you’ll notice a strong emphasis on agentic tool calling and long-context reasoning. If you want a model to work effectively with an agent harness, it needs to execute tool calls reliably. And since those tool calls can return large quantities of information, you also need the model not to lose track of that information.
While these qualities make for better agentic models, they also require a very different set of hardware.
CPUs take center stage
Compute to run these agent harnesses is in high demand. After living in the shadow of high-end GPUs and AI accelerators for the past few years, CPUs are back in the limelight.
Intel Xeon processors are selling faster than Intel can make them. Meta is buying up every chip it can get from Arm and Nvidia, and renting boatloads of Amazon’s Graviton CPUs while it awaits delivery.
This is happening because agent harnesses don’t run on GPUs. Even with enough CPU cores to execute these tasks at scale, the number of requests is also reshaping the way we run models.
If you haven’t noticed, inference costs have been on the rise. OpenAI recently raised the price of GPT-5.5, Microsoft moved GitHub Copilot to a purely usage-based pricing model, and Anthropic could soon force Claude Code users onto its pricier “Max” subscriptions.
Some of this is because of increased demand. Like it or not, vibe coding is catching on and probably isn’t going away. However, we suspect some of it may be down to the fact that these models are running on hardware that was originally built for training and is now having to play double duty for inference.
Only in the last year and a half have we started to see inference-optimized systems like Nvidia’s NVL72 racks hit the market. AWS, AMD, and others are now racing to catch up with rack-scale compute platforms of their own.
But it turns out that even these systems aren’t enough on their own. If agentic code harnesses are making dozens of requests, each generating hundreds of lines of code, inference performance becomes a major bottleneck. In the early days of ChatGPT, it might have been enough to churn out tokens faster than the average person could read. Remove the meatbag from the equation and speed becomes everything.
GPUs are incredibly compute-dense parallel processors, but their memory isn’t great for the kind of auto-regressive large models these harnesses are being saddled to.
Groq and Cerebras get their moment under the AI sun
Faced with these challenges, infrastructure providers have adopted new compute architectures that combine GPUs with specialized AI accelerators.
Nvidia’s acquihire of Groq is a prime example. Late last year, Nvidia dropped $20 billion to license the AI chipmaker’s language processing unit (LPU) chip tech and hire away its engineering staff.
As we wrote at the time, Nvidia could have built its own SRAM-heavy decode accelerator, if it wanted to, but because it was faster to use someone else’s.
By combining its compute heavy GPUs with Groq’s high-bandwidth LPUs, Nvidia was able to churn out more tokens faster and, in theory, improve the economics for AI agents.
Higher interactivity is key for agentic workloads because they can now serve more requests in the same amount of time, or “think” about the information that’s been provided to them for longer.
We’ve previously explored Nvidia’s new Groq-based LPXs back at GTC as well as the market dynamics behind the multi-rack architecture.
AWS is using recently public Cerebras Systems’ wafer-scale AI accelerators in much the same way, while Intel is now working with SambaNova on its own disaggregated compute architecture.
The pendulum swings
Given the sheer amount of compute these agent harnesses require, there’s a good chance we’ll start to see hyperscalers cut costs by offloading some of the work onto client devices.
Because of the way these harnesses work, simpler requests like planning could be run on small models running locally on the user’s PC.
In fact, Google appears to be doing just that.
As we reported earlier this month, Google quietly began shipping as part of Chrome a small LLM that will eat up 4 GB of disk space, and presumably just as much memory when in operation. The model appears to power basic functionality like “help me write” functionally, scam detection, and other AI-assisted functions which have steadily invaded our browsers as of late.
It’s not hard to imagine code agents doing something similar. A small local model could be used to draft and test code snippets while the larger cloud-hosted model is used to debug and correct errors, shifting much of the load off datacenters and onto client devices.
For that to work, we’re going to need systems with a whole lot more high-speed memory, which poses a bit of a problem in light of the DRAM and NAND shortage.
While user-facing agent harnesses could be used to shove some of the computational load onto customer devices, many still want to see agents carrying out entire departments’ worth of work. Take the human out of the loop, and these agents wouldn’t be constrained by limitations of their fleshy masters and could work orders of magnitude faster given enough compute resources.
So, just like the rise of PCs didn’t spell the end of mainframes, local AI will no is unlikely to end investors’ obsession with ever hotter and more power hungry bit barns any time soon. ®