Machine learning operations don’t belong with cloudops

analysis
Aug 23, 20193 mins

Giving systems enabled with machine learning to the cloud operations team to manage is not only a mistake, it’s dangerous

danger
Credit: Shawn Carpenter

It’s Monday morning, and after a long weekend of system trouble the cloud operations team is discussing what happened. It seems that several systems that were associated with a very advanced, new inventory management system enabled with machine learning had issues over the weekend. The postmortem concluded the following:

  • The batch process that moved raw data from the operational database to the training database failed, as well as the auto recovery process. An ops team member who was working over the weekend attempted to resubmit but caused not one, but four partial updates that left the training database in an unstable state.
  • This caused the knowledge models in the machine learning systems to train with bad data and required that the new information in the knowledge base be removed and the models rebuilt.
  • Also, several outside data feeds, such as pricing and tax data, were updated at the same time to the training database. Although those worked fine, they too needed to be backed out of the knowledge database considering that the operational data was not in a good state.
  • The system was unavailable for two days and the company lost $4 million, considering lost productivity, customer reactions, and PR issues.

This is not 2025; this is today. As enterprises find more uses for “cheap and good” cloud-based machine learning systems we’re finding that the systems that leverage machine learning are complex to operate. The ops teams do not expect the degree of difficulty and the complexity and are finding that they are undertrained, understaffed, and underfunded.

The assumption is that the cloud operations teams could handle cloud-based databases, cloud-based storage, and cloud-based compute with a fairly easy transition. For the most part that’s been the case, considering that cloud-based systems are similar to traditional systems.

However, systems based on machine learning have not yet been seen for the most part by operations teams. These systems have specialized purposes, as well as specialized systems—such as databases and knowledge engines—that have to be monitored and managed in certain ways. This is where the current operations teams are failing.

The fix is pretty easy to understand, but most enterprises are not going to like it, considering it means spending more dollars for ML cloudops or abandoning ML cloudops. Machine learning systems are technological chainsaws. If used carefully, they are highly effective. If mishandled they can be dangerous. Failures can go undetected, and if the system automatically uses the resulting bad knowledge, you could end up with huge issues that may not be discovered until much damage is done. More risk than reward, it seems.

David Linthicum

David S. Linthicum is an internationally recognized industry expert and thought leader. Dave has authored 13 books on computing, the latest of which is An Insider’s Guide to Cloud Computing. Dave’s industry experience includes tenures as CTO and CEO of several successful software companies, and upper-level management positions in Fortune 100 companies. He keynotes leading technology conferences on cloud computing, SOA, enterprise application integration, and enterprise architecture. Dave writes the Cloud Insider blog for InfoWorld. His views are his own.

More from this author