by David Linthicum

Machine learning operations don’t belong with cloudops

analysis

Aug 23, 20193 mins

Giving systems enabled with machine learning to the cloud operations team to manage is not only a mistake, it’s dangerous

It’s Monday morning, and after a long weekend of system trouble the cloud operations team is discussing what happened. It seems that several systems that were associated with a very advanced, new inventory management system enabled with machine learning had issues over the weekend. The postmortem concluded the following:

The batch process that moved raw data from the operational database to the training database failed, as well as the auto recovery process. An ops team member who was working over the weekend attempted to resubmit but caused not one, but four partial updates that left the training database in an unstable state.

This caused the knowledge models in the machine learning systems to train with bad data and required that the new information in the knowledge base be removed and the models rebuilt.
Also, several outside data feeds, such as pricing and tax data, were updated at the same time to the training database. Although those worked fine, they too needed to be backed out of the knowledge database considering that the operational data was not in a good state.
The system was unavailable for two days and the company lost $4 million, considering lost productivity, customer reactions, and PR issues.

This is not 2025; this is today. As enterprises find more uses for “cheap and good” cloud-based machine learning systems we’re finding that the systems that leverage machine learning are complex to operate. The ops teams do not expect the degree of difficulty and the complexity and are finding that they are undertrained, understaffed, and underfunded.

The assumption is that the cloud operations teams could handle cloud-based databases, cloud-based storage, and cloud-based compute with a fairly easy transition. For the most part that’s been the case, considering that cloud-based systems are similar to traditional systems.

However, systems based on machine learning have not yet been seen for the most part by operations teams. These systems have specialized purposes, as well as specialized systems—such as databases and knowledge engines—that have to be monitored and managed in certain ways. This is where the current operations teams are failing.

The fix is pretty easy to understand, but most enterprises are not going to like it, considering it means spending more dollars for ML cloudops or abandoning ML cloudops. Machine learning systems are technological chainsaws. If used carefully, they are highly effective. If mishandled they can be dangerous. Failures can go undetected, and if the system automatically uses the resulting bad knowledge, you could end up with huge issues that may not be discovered until much damage is done. More risk than reward, it seems.

Cloud ComputingMachine LearningAnalyticsSoftware Development

by David Linthicum

Follow David Linthicum on X

David S. Linthicum is an internationally recognized industry expert and thought leader. Dave has authored 13 books on computing, the latest of which is An Insider’s Guide to Cloud Computing. Dave’s industry experience includes tenures as CTO and CEO of several successful software companies, and upper-level management positions in Fortune 100 companies. He keynotes leading technology conferences on cloud computing, SOA, enterprise application integration, and enterprise architecture. Dave writes the Cloud Insider blog for InfoWorld. His views are his own.

Show me more

Topics

About

Policies

Our Network

More

Machine learning operations don’t belong with cloudops

Giving systems enabled with machine learning to the cloud operations team to manage is not only a mistake, it’s dangerous

More from this author

Cloud-based LLMs risk enterprise stability

The AI coding hangover

Neoclouds run AI cheaper and better

Why enterprises are still bad at multicloud

Cloud architects earn the highest salaries

Cloud sovereignty isn’t a toggle feature

Microsoft’s Copilot push irks customers, stirs FTC

The data center gold rush is warping reality

Show me more

OpenAI’s desktop superapp: The end of ChatGPT as we know it?

Google’s Stitch UI design tool is now AI-powered

Stop using AI to submit bug reports, says Google

How to build desktop apps in Typescript with Electrobun

Write and run assembly in Python with Copapy

Run AI Models Locally on Your PC — No Cloud Required (LM Studio Guide)