Physical Address

304 North Cardinal St.
Dorchester Center, MA 02124

The inference trap: How cloud providers are eating your AI margins


This article is part of the special number of Venturebeat, “the real cost of AI: performance, efficiency and large -scale king”. Learn more of this special issue.

AI has become the Holy Grail of modern companies. Whether customer service Or something as a niche as the maintenance of pipelines, organizations in each field now implement AI technologies – VLA foundation models – to make things more effective. The objective is simple: automate tasks to provide results more effectively and save money and resources simultaneously.

However, while these projects go from the pilot to the production stadium, the teams encounter an obstacle that they had not planned: cloud costs eroding their margins. The shock of stickers is so bad that what once looked like the fastest path to innovation and competitive point becomes an unsustainable budgetary hole – in no time.

This encourages the CIOs to rethink everything – from the architecture of models to deployment models – to regain control of financial and operational aspects. Sometimes they even close the projects entirely, start from scratch.

But here is the fact: although the cloud can take costs to unbearable levels, it is not the villain. You just have to understand what type of vehicle (AI infrastructure) to choose to follow which route (the workload).

The story of the cloud – and where it works

The cloud is very similar to public transport (your metros and bus). You climb on board with a simple rental model, and it instantly gives you all the resources – from GPU instances to a quick scaling on various geographies – to take you to your destination, all with minimal work and a configuration.

Quick and easy access via a service model guarantees a transparent start, opening the way to the project approach and to make a quick experiment without the huge expenditure in initial capital of the acquisition of specialized GPUs.

Most startups at an early stage find this lucrative model because they need a rapid turnaround than anything else, especially when they always validate the model and determine the adjustment of the product market.

“You create an account, click on a few buttons and access the servers. If you need a different GPU size, you stop and restart the instance with the new specifications, which takes minutes. If you want to carry out two experiences at the same time, you initialize two distinct instances. In the early stages, most of the platforms are validation ideas. Speechsaid Venturebeat.

The cost of “ease”

Although the Cloud is perfectly logical for use at an early stage, infrastructure mathematics become dark as the project goes from the test and validation in volumes of the real world. The extent of workloads makes brutal invoices – so much so that costs can increase by more than 1000% overnight.

This is particularly true in the case of inference, which must not only execute 24/7 to guarantee the availability of the service, but also evolve with the demand of customers.

On most occasions, explains Sarin, the demand for inference increases when other customers also require access to the GPU, increasing competition from resources. In such cases, the teams keep a reserved capacity to ensure that they get what they need – which leads to the time of inactive GPU during non -cic – or suffers from latency, which has an impact on downstream experience.

Christian Khoury, CEO of the AI ​​compliance platform Easaudit AIhas described inference as the new “Cloud tax”, telling Venturebeat that he saw companies going from $ 5,000 to $ 50,000 / month during the night, just from inference traffic.

It should also be noted that the inference workloads involving LLM, with prices based on chips, can trigger increases in the highest costs. Indeed, these models are non -deterministic and can generate different outlets when managing long -term tasks (involving large context windows). With continuous updates, it becomes really difficult to predict or control LLM inference costs.

The formation of these models, on its part, happens to be “missed” (occurring in clusters), which leaves some room for capacity planning. However, even in these cases, especially since the increasing competition forces of recycling, companies can have massive invoices of the time of inactive GPU, resulting from overproster.

“Training credits on cloud platforms are expensive and frequent recycling during rapid iteration cycles can quickly increase costs. Long training sessions require access to large machines, and most cloud suppliers only guarantee access if you reserve capacity for a year or more. If your training lasts only a few weeks, you always pay for the rest of the year, “said Sarin.

And that’s not just that. The cloud locking is very real. Suppose you have made a long -term reservation and bought credits from a supplier. In this case, you are locked in their ecosystem and need to use everything they have offered, even when other suppliers have moved to a more recent and better infrastructure. And, finally, when you get the possibility of moving, you may have to bear massive exit costs.

“This is not only the cost of calculation. You get … an unpredictable autoscalière and insane exit costs if you move data between regions or suppliers. A team paid more to move data than to train their models,” said Sarin.

So what is the bypass solution?

Given the constant demand for the infrastructure of the scale of IA inference and the radical nature of the training, companies move to the division of workloads – taking into account the roommate or the batteries on Prém.

It is not only theory – it is a growing movement among engineering leaders who try to put AI in production without burning through the track.

“We have helped the teams to pass to the roommate for inference by using dedicated GPU servers they control. It is not sexy, but it reduces the infra expenses from 60 to 80%,” added Khoury. “The hybrid is not only cheaper – it’s smarter.”

In one case, he said, a SaaS company has reduced its AI infrastructure monthly invoice from around $ 42,000 to only $ 9,000 by moving the workloads in the cloud inference. The switch paid in less than two weeks.

Another team requiring coherent responses of less than 50 ms for an AI customer support tool discovered that the cloud -based inference latency was insufficient. The displacement of inference closer to users via roommate has not only resolved the bottleneck of performance – but it has half reduced the cost.

The configuration generally works like this: inference, which is always safe and sensitive to latency, works on GPU dedicated on site or in a nearby data center (roommate). Meanwhile, the training, which is at high calculation intensity but sporadic, remains in the cloud, where you can run powerful clusters on demand, run for a few hours or days and close.

In general, it is estimated that rental with hyperscal cloud suppliers can cost three to four times more per hour of GPU than to work with smaller suppliers, the difference being even more significant compared to the on -site infrastructure.

The other big bonus? Predictability.

With on -site or roommate batteries, teams also have a total control over the number of resources they wish to provide or add for the expected reference base for inference workloads. This brings predictability to infrastructure costs – and eliminates surprise invoices. It also reduces the aggressive engineering effort to set up scaling and maintain cloud infrastructure costs within reasonable limits.

Hybrid configurations also help reduce latency of time -sensitive AI applications and allow better compliance, in particular for teams operating in highly regulated industries such as finance, health care and education – where residence and data governance are not negotiable.

The hybrid complexity is real, but rarely a dealbreaker

As has always been the case, the transition to a hybrid configuration comes with its own Ops tax. The implementation of your own equipment or the rental of a roommate installation takes time and the management of GPUs outside the cloud requires another type of engineering muscle.

However, leaders argue that complexity is often overestimated and is generally manageable internally or by external support, unless you work on an extreme scale.

“Our calculations show that an on -site GPU server costs around six to nine months of rental from the equivalent body from AWS, Azure or Google Cloud, even with a reserved rate of one year. Since the equipment usually lasts at least three years, and often more than five a concern, “said Sarin.

Prioritize

For any business, whether it is a startup or a company, the key to success during architecture – or re -chitecte – the infrastructure of AI lies in the work according to specific workloads at hand.

If you are not sure of the burden of different IA workloads, start with the cloud and closely monitor the associated costs by marking each resource with the responsible team. You can share these cost relationships with all managers and dive in depth in what they use and its impact on resources. These data will then give clarity and help open the way to driving.

That said, don’t forget that it is not a question of dropping the cloud entirely; It involves optimizing its use to maximize efficiency.

“The Cloud is always ideal for experimentation and failed training. But if the inference is your main workload, remove the treadmill. Hybrid is not only cheaper … It’s smarter,” added Khoury. “Treat the Cloud like a prototype, not the permanent house. Raise mathematics. Tell your engineers. The Cloud will never tell you when it is the wrong tool. But your AWS invoice will do it.”



Source link

Leave a Reply

Your email address will not be published. Required fields are marked *