The inference trap: How cloud providers are eating your AI margins

June 29, 2025

24

Table of Contents

This text is a part of VentureBeat’s particular challenge, “The Actual Value of AI: Efficiency, Effectivity and ROI at Scale.” Learn extra from this particular challenge.

AI has develop into the holy grail of contemporary corporations. Whether or not it’s customer support or one thing as area of interest as pipeline upkeep, organizations in each area at the moment are implementing AI applied sciences — from basis fashions to VLAs — to make issues extra environment friendly. The purpose is simple: automate duties to ship outcomes extra effectively and get monetary savings and assets concurrently.

Nevertheless, as these tasks transition from the pilot to the manufacturing stage, groups encounter a hurdle they hadn’t deliberate for: cloud prices eroding their margins. The sticker shock is so unhealthy that what as soon as felt just like the quickest path to innovation and aggressive edge turns into an unsustainable budgetary blackhole – very quickly.

This prompts CIOs to rethink every thing—from mannequin structure to deployment fashions—to regain management over monetary and operational facets. Typically, they even shutter the tasks completely, beginning over from scratch.

However right here’s the very fact: whereas cloud can take prices to insufferable ranges, it’s not the villain. You simply have to know what kind of car (AI infrastructure) to decide on to go down which highway (the workload).

The cloud story — and the place it really works

The cloud may be very very similar to public transport (your subways and buses). You get on board with a easy rental mannequin, and it immediately offers you all of the assets—proper from GPU cases to quick scaling throughout varied geographies—to take you to your vacation spot, all with minimal work and setup.

The quick and quick access through a service mannequin ensures a seamless begin, paving the best way to get the mission off the bottom and do speedy experimentation with out the massive up-front capital expenditure of buying specialised GPUs.

Most early-stage startups discover this mannequin profitable as they want quick turnaround greater than anything, particularly when they’re nonetheless validating the mannequin and figuring out product-market match.

“You make an account, click on just a few buttons, and get entry to servers. Should you want a special GPU dimension, you shut down and restart the occasion with the brand new specs, which takes minutes. If you wish to run two experiments directly, you initialise two separate cases. Within the early phases, the main focus is on validating concepts shortly. Utilizing the built-in scaling and experimentation frameworks offered by most cloud platforms helps cut back the time between milestones,” Rohan Sarin, who leads voice AI product at Speechmatics, instructed VentureBeat.

The price of “ease”

Whereas cloud makes good sense for early-stage utilization, the infrastructure math turns into grim because the mission transitions from testing and validation to real-world volumes. The dimensions of workloads makes the payments brutal — a lot in order that the prices can surge over 1000% in a single day.

That is notably true within the case of inference, which not solely has to run 24/7 to make sure service uptime but additionally scale with buyer demand.

On most events, Sarin explains, the inference demand spikes when different clients are additionally requesting GPU entry, growing the competitors for assets. In such circumstances, groups both maintain a reserved capability to verify they get what they want — resulting in idle GPU time throughout non-peak hours — or undergo from latencies, impacting downstream expertise.

Christian Khoury, the CEO of AI compliance platform EasyAudit AI, described inference as the brand new “cloud tax,” telling VentureBeat that he has seen corporations go from $5K to $50K/month in a single day, simply from inference site visitors.

It’s additionally price noting that inference workloads involving LLMs, with token-based pricing, can set off the steepest value will increase. It’s because these fashions are non-deterministic and might generate completely different outputs when dealing with long-running duties (involving massive context home windows). With steady updates, it will get actually troublesome to forecast or management LLM inference prices.

Coaching these fashions, on its half, occurs to be “bursty” (occurring in clusters), which does go away some room for capability planning. Nevertheless, even in these circumstances, particularly as rising competitors forces frequent retraining, enterprises can have large payments from idle GPU time, stemming from overprovisioning.

“Coaching credit on cloud platforms are costly, and frequent retraining throughout quick iteration cycles can escalate prices shortly. Lengthy coaching runs require entry to massive machines, and most cloud suppliers solely assure that entry in the event you reserve capability for a 12 months or extra. In case your coaching run solely lasts just a few weeks, you continue to pay for the remainder of the 12 months,” Sarin defined.

And, it’s not simply this. Cloud lock-in may be very actual. Suppose you’ve made a long-term reservation and acquired credit from a supplier. In that case, you’re locked of their ecosystem and have to make use of no matter they’ve on provide, even when different suppliers have moved to newer, higher infrastructure. And, lastly, whenever you get the flexibility to maneuver, you’ll have to bear large egress charges.

“It’s not simply compute value. You get…unpredictable autoscaling, and insane egress charges in the event you’re transferring information between areas or distributors. One crew was paying extra to maneuver information than to coach their fashions,” Sarin emphasised.

So, what’s the workaround?

Given the fixed infrastructure demand of scaling AI inference and the bursty nature of coaching, enterprises are transferring to splitting the workloads — taking inference to colocation or on-prem stacks, whereas leaving coaching to the cloud with spot cases.

This isn’t simply idea — it’s a rising motion amongst engineering leaders attempting to place AI into manufacturing with out burning by means of runway.

“We’ve helped groups shift to colocation for inference utilizing devoted GPU servers that they management. It’s not attractive, but it surely cuts month-to-month infra spend by 60–80%,” Khoury added. “Hybrid’s not simply cheaper—it’s smarter.”

In a single case, he mentioned, a SaaS firm diminished its month-to-month AI infrastructure invoice from roughly $42,000 to simply $9,000 by transferring inference workloads off the cloud. The swap paid for itself in beneath two weeks.

One other crew requiring constant sub-50ms responses for an AI buyer help device found that cloud-based inference latency was inadequate. Shifting inference nearer to customers through colocation not solely solved the efficiency bottleneck — but it surely halved the price.

The setup usually works like this: inference, which is always-on and latency-sensitive, runs on devoted GPUs both on-prem or in a close-by information heart (colocation facility). In the meantime, coaching, which is compute-intensive however sporadic, stays within the cloud, the place you’ll be able to spin up highly effective clusters on demand, run for just a few hours or days, and shut down.

Broadly, it’s estimated that renting from hyperscale cloud suppliers can value three to 4 occasions extra per GPU hour than working with smaller suppliers, with the distinction being much more vital in comparison with on-prem infrastructure.

The opposite massive bonus? Predictability.

With on-prem or colocation stacks, groups even have full management over the variety of assets they wish to provision or add for the anticipated baseline of inference workloads. This brings predictability to infrastructure prices — and eliminates shock payments. It additionally brings down the aggressive engineering effort to tune scaling and maintain cloud infrastructure prices inside purpose.

Hybrid setups additionally assist cut back latency for time-sensitive AI functions and allow higher compliance, notably for groups working in extremely regulated industries like finance, healthcare, and schooling — the place information residency and governance are non-negotiable.

Hybrid complexity is actual—however hardly ever a dealbreaker

Because it has at all times been the case, the shift to a hybrid setup comes with its personal ops tax. Organising your personal {hardware} or renting a colocation facility takes time, and managing GPUs outdoors the cloud requires a special form of engineering muscle.

Nevertheless, leaders argue that the complexity is usually overstated and is normally manageable in-house or by means of exterior help, except one is working at an excessive scale.

“Our calculations present that an on-prem GPU server prices about the identical as six to 9 months of renting the equal occasion from AWS, Azure, or Google Cloud, even with a one-year reserved fee. Because the {hardware} usually lasts no less than three years, and infrequently greater than 5, this turns into cost-positive throughout the first 9 months. Some {hardware} distributors additionally provide operational pricing fashions for capital infrastructure, so you’ll be able to keep away from upfront cost if money movement is a priority,” Sarin defined.

Prioritize by want

For any firm, whether or not a startup or an enterprise, the important thing to success when architecting – or re-architecting – AI infrastructure lies in working in accordance with the precise workloads at hand.

Should you’re uncertain concerning the load of various AI workloads, begin with the cloud and maintain a detailed eye on the related prices by tagging each useful resource with the accountable crew. You’ll be able to share these value stories with all managers and do a deep dive into what they’re utilizing and its affect on the assets. This information will then give readability and assist pave the best way for driving efficiencies.

That mentioned, do not forget that it’s not about ditching the cloud completely; it’s about optimizing its use to maximise efficiencies.

“Cloud remains to be nice for experimentation and bursty coaching. But when inference is your core workload, get off the lease treadmill. Hybrid isn’t simply cheaper… It’s smarter,” Khoury added. “Deal with cloud like a prototype, not the everlasting house. Run the maths. Discuss to your engineers. The cloud won’t ever let you know when it’s the mistaken device. However your AWS invoice will.”

Supply hyperlink

Tags
AI
AI News

Buy now

The inference trap: How cloud providers are eating your AI margins

The cloud story — and the place it really works

The price of “ease”

So, what’s the workaround?

Hybrid complexity is actual—however hardly ever a dealbreaker

Prioritize by want

Related Articles

Google set a new durability standard with the Pixel 10 Fold...

Google doubles down on ‘AI phones’ with its Pixel 10 series

CodeSignal’s new AI tutoring app Cosmo wants to be the ‘Duolingo...

Leave a Reply Cancel reply

Latest Articles

Google set a new durability standard with the Pixel 10 Fold...

Google doubles down on ‘AI phones’ with its Pixel 10 series

CodeSignal’s new AI tutoring app Cosmo wants to be the ‘Duolingo...

VB AI Impact Series: Can you really govern multi-agent AI?

The best Android tablets of 2025: Lab tested, expert recommended