M.G. Siegler •

The AI Overkill Era

As AI shifts agentic and expensive, might we see more hybrid local/cloud arrangements?
The AI Overkill Era

There are a number of narratives the past few weeks that I think you can tie together with the same string. That is, we may be slowly shifting into a world where frontier AI isn't needed for most workflows. I mean, technically that has always been the case. But we're starting to see real action and movement around that notion to turn it into a reality for many.

First came the narrative around costs spiraling out of control at many companies trying to leverage AI in day-to-day operations. Some of this is obviously related to the "tokenmaxxing" trend, but it's also seemingly just around "normal" AI deployment too. With usage growing and the shift towards consumption-based pricing...

Then came the news of NVIDIA 'RTX Spark' systems. The company's shift into CPUs for PCs is all about local inference. Why? At least in part to be able to cut AI costs and workloads in the clouds, while also potentially boosting speed if you simply don't need the biggest and best models for everything.

Next we saw Apple at WWDC showcase 'Siri AI' – for real this time – which is heavily predicated around their on-device models. While much was made about the Google Gemini partnership, and rightfully so, the real key may be their routing system which is able to quickly know when to use those local models versus when to call to their models running in the cloud – Google's Cloud, running NVIDIA chips, no less. For a lot of Siri queries, you won't need that. And for a lot of your personal content, you may not want that. And if you do need it, Apple may charge you more for it.

Speaking of, this week also brought Claude Fable 5 – yes, Anthropic's 'Mythos' model meant for the masses, at last. Beyond the security concerns, there are major cost concerns too. To the point where Anthropic has a big disclaimer on the model picker that customers will have access to Fable through June 22, at which point it's likely to be an extra-cost add-on. Read: credits will be burned. And the reality here is that Fable is undoubtedly overkill for many "regular" AI workflows.

And that's my word to tie all of these things together: it feels like we're entering the Era of AI Overkill.

Famous last words and all that, but while everyone has spent the past couple of years debating if and when frontier model development would slow down, the real story may be reaching a sort of equilibrium where end-users (and businesses) are realizing that they simply don't need the latest and greatest model for every task. And certainly, given the rising costs with rising usage, they may not want to pay for the best every time.

Again, on paper, this isn't anything new. It's the reason why we've long had the model picker drop-downs within the various AI products. They always implied that maybe you don't need the biggest and best model for your task – and, in fact, this other model may be far faster. But money has a way of talking that time doesn't. If you're telling someone that using this other model will be not just faster, but far cheaper, they're more likely to listen. Provided the end result is still good. And again, for many queries/tasks, "good enough" is now just that.

Countering this is the ongoing push towards agentic workflows and yes, coding. But as Apple may more clear this week, that's probably not going to be the AI workflows that the masses use any time soon. Over time, sure, agents will continue to creep into our lives. But for a lot of tasks, the on-device agentic work that Apple is offering will be more than enough to surprise and delight the masses.

The power users like you and me will undoubtedly continue to use – and pay for – the Claudes and the ChatGPTs at the bleeding edge. But again, it's impossible to argue with free.

What will be truly interesting will be watching how enterprises strike this balance. In light of the rising AI bills, do they start to buy up RTX Spark machines to take a lot of AI work offline as it were? "Unmetered intelligence" is a term that Microsoft and others would really like to make happen. Spending a few thousand dollars up-front versus tens or hundreds of thousands per employee over time is obviously a pretty easy equation. Again, provided the output is still good!

Do we enter a world of 'have and have-nots' in business where the biggest businesses pay up for the frontier AI all the time while smaller businesses "settle" for cheaper AI?

Hybridization seems inevitable here as well. And that will require good routing systems to know what should be done locally, for free, versus what needs to be handled at the frontier in the cloud. The buzzword you're actually looking for here is "orchestration", again, not new. But it will be increasingly important. Not just to route between different models from different providers, and different models from the same provider, but also local versus cloud models.

How the model makers themselves handle this will also be interesting to watch. Obviously they've been trying to take on more of the work of picking the right model for you, but this has also prompted some backlash and backtracking. At the same time, it was never going to be tenable to have a Microsoft Office-like toolbar system of drop-downs, and instead, most people will want what Apple is now offering with Siri AI: one prompt box to rule them all.

Yes, yes, here too power users will want to decide and pick their own. But also undoubtedly not always. It's just far too big of a cognitive burden to have to think about which model you want to use for each query or task. And as workflows move more towards audio and perhaps to different types of AI devices, the UI to handle such model picking quickly goes away out of necessity.

And as agentic workflows become more automated, the system will necessarily need to pick the right models for your task. Interestingly, while much of this work started local with OpenClaw and the like, this will probably increasingly move the other way: to the cloud as more devices get incorporated into such workflows. But there still might be a home "hub" for local model work to save cost/time? Even if it's a smartphone?

Again, if nothing else, people may feel more comfortable running their personal information locally. Certainly Apple believes this! And going forward, does our personal AI morph into our professional AI, or do they stay fully separate?

Anyway, yes, it's fun to try Fable to see if it can answer "who let the dogs out?", but it's also increasingly insanely expensive and stupid to do that. I mean, it was always stupid to do that. But now money talks. And barks back.

As we enter the Era of AI Overkill, are the cheaper and local models ready to answer the call? And is the orchestration layer ready to route to them?