How new tools and techniques are extracting business insights from massive data sets You may have heard how companies like Google and Facebook use machine learning to drive cars, recognize human speech, and classify images. Very cool, you think, but how does that relate to your business? Well, consider how these companies use machine learning today: A payments processing company detects fraud hidden among more than a billion transactions in real time, reducing losses by $1 million per month. An auto insurer predicts losses from insurance claims using detailed geospatial data, enabling them to model the business impact of severe weather events. Working with data produced by vehicle telematics, a manufacturer uncovers patterns in operational metrics and uses them to drive proactive maintenance. Two themes unify these success stories. First, each application depends on big data: a large volume of data, in a variety of formats and at high velocity. Second, in each case, machine learning uncovers new insights and drives value. The technical foundations of machine learning are more than 50 years old, but until recently few people outside of academia were aware of its capabilities. Machine learning requires a lot of computing power; early adopters simply lacked the infrastructure to make it cost-effective. Several converging trends contribute to the recent surge of interest and activity in machine learning: Moore’s Law radically reduced computing costs; massive computing power is now widely available at minimal cost. New and innovative algorithms provide faster results. Data scientists have accumulated the theory and practical knowledge to apply machine learning effectively. Above all, the tsunami of big data creates analytic problems that simply cannot be solved with conventional statistics. Necessity is the mother of invention, and old methods of analysis no longer work in today’s business environment. Machine learning techniques There are hundreds of different machine learning algorithms. A recent paper benchmarked more than 150 algorithms for classification alone. This overview covers the key techniques that data scientists use to drive value today. Data scientists distinguish between techniques for supervised and unsupervised learning. Supervised learning techniques require prior knowledge of an outcome. For example, if we work with historical data from a marketing campaign, we can classify each impression by whether or not the prospect responded, or we can determine how much they spent. Supervised techniques provide powerful tools for prediction and classification. Frequently, however, we do not know the “ultimate” outcome of an event. For example, in some cases of fraud, we may not know that a transaction is fraudulent until long after the event. In this case, rather than attempting to predict which transactions are frauds, we might want to use machine learning to identify transactions that are unusual and flag these for further investigation. We use unsupervised learning when we do not have prior knowledge about a specific outcome, but still want to extract useful insights from the data. The most widely used supervised learning techniques include the following: Generalized linear models (GLM) — an advanced form of linear regression that supports different probability distributions and link functions, enabling the analyst to model the data more effectively. Enhanced with a grid search, GLM is a hybrid of classical statistics and the most advanced machine learning. Decision trees — a supervised learning method that learns a set of rules that split a population into progressively smaller segments that are homogeneous with respect to the target variable. Random forests — a popular ensemble learning method that trains many decision trees, then averages across the trees to develop a prediction. This averaging process produces a more generalizable solution and filters out random noise in the data. Gradient boosting machine (GBM) — a method that produces a prediction model by training a sequence of decision trees, where successive trees adjust for prediction errors in previous trees. Deep learning — an approach that models high-level patterns in data as complex multilayered networks. Because it is the most general way to model a problem, deep learning has the potential to solve the most challenging problems in machine learning. Key techniques for unsupervised learning include the following: Clustering — a technique that groups objects into segments, or clusters, that are similar to one another on many metrics. Customer segmentation is an example of clustering in action. There are many different clustering algorithms; the most widely used is k-means. Anomaly detection — the process of identifying unexpected events or outcomes. In fields like security and fraud, it is not possible to exhaustively investigate every transaction; we need to systematically flag the most unusual transactions. Deep learning, a technique discussed previously under supervised learning, can also be used for anomaly detection. Dimension reduction – the process of reducing the number of variables being considered. As organizations capture more data, the number of possible predictors (or features) available for prediction expands rapidly. Simply identifying what data provides information value for a particular problem is a significant task. Principal components analysis (PCA) evaluates a set of raw features and reduces them to indices that are independent of one another. While some machine learning techniques tend to consistently outperform others, it is rarely possible to say in advance which one will work best for a particular problem. Hence, most data scientists prefer to try many techniques and choose the best model. For this reason, high performance is essential because it enables the data scientist to try more options in less time. Machine learning in action Across industries and business disciplines, businesses use machine learning to increase revenue or reduce costs by performing tasks more efficiently than humans can do unaided. Included below are seven examples that demonstrate the versatility and wide applicability for machine learning. Preventing fraud. With more than 150 million active digital wallets than $200 billion in annual payments, PayPal leads the online payments industry. At that volume, even low rates of fraud can be very costly; early in its corporate history, the company was losing $10 million per month to fraudsters. To address the problem, PayPal built a top team of researchers, who used state-of-the-art machine learning techniques to build models that can identify fraudulent payments in real time. Targeting digital display. Ad-tech company Dstillery uses machine learning to help companies like Verizon and Williams-Sonoma target digital display advertising on real-time bidding platforms. Using data collected about an individual’s browsing history, visits, clicks, and purchases, Dstillery runs predictions thousands of times per second, handling hundreds of campaigns at a time; this enables the company to significantly outperform human marketers targeting ads for optimal impact per dollar spent. Recommending content. For customers of Comcast’s X1 interactive TV service, Comcast provides personalized real-time recommendations for content based on each customer’s prior viewing habits. Working with billions of history records, Comcast uses machine learning techniques to develop a unique taste profile for each customer, then groups customers with common tastes into clusters. For each cluster of customers, Comcast tracks and displays the most popular content in real time, so customers can see what content is currently trending. The net result: better recommendations, higher utilization, and more satisfied customers. Building better cars. New cars built by Jaguar Land Rover have 60 onboard computers that produce 1.5GB of data every day across more than 20,000 metrics. Engineers at the company use machine learning to distill the data and understand how customers actually work with the vehicle. By working with true usage data, designers can predict part failure and potential safety issues; this helps them engineer vehicles appropriately for expected conditions. Targeting best prospects. Marketers use “propensity to buy” models as a tool to determine the best sales and marketing prospects and the best products to offer. With a vast array of products to offer, from routers to cable TV boxes, Cisco’s marketing analytics team trains 60,000 models and scores 160 million prospects in a matter of hours. By experimenting with a range of techniques from decision trees to gradient-boosted machines, the team has greatly improved the accuracy of the models. That translates into more sales, fewer wasted sales calls, and more satisfied sales reps. Optimizing media. NBC Universal stores hundreds of terabytes of media files for international cable TV distribution; efficient management of this online resource is necessary to support distribution to international clients. The company uses machine learning to predict future demand for each item based on a combination of measures. Based on these predictions, the company moves media with low predicted demand to low-cost offline storage. The predictions from machine learning are far more effective than arbitrary rules based on single measures, such as file age. As a result, NBC Universal reduces its overall storage costs while maintaining client satisfaction. Improving health care delivery. For hospitals, patient readmission is a serious matter, and not simply out of concern for the patient’s health and welfare. Medicare and private insurers penalize hospitals with a high readmission rate, so hospitals have a financial stake in making sure they discharge only those patients who are well enough to stay healthy. The Carolinas Healthcare System (CHS) uses machine learning to construct risk scores for patients, which case managers work into their discharge decisions. This system enables better utilization of nurses and case managers, prioritizing patients according to risk and complexity of the case. As a result, CHS has lowered its readmission rate from 21 percent to 14 percent. Machine learning software requirements Software for machine learning is widely available, and organizations seeking to develop a capability in this area have many options. The following requirements should be considered when evaluating machine learning: Speed Time to value Model accuracy Easy integration Flexible deployment Usability Visualization Let’s review each of these in turn. Speed. Time is money, and fast software makes your highly paid data scientists more productive. Practical data science is often iterative and experimental; a project may require hundreds of tests, so small differences in speed translate to dramatic improvements in efficiency. Given today’s data volumes, high-performance machine learning software must run on a distributed platform, so you can spread the workload over many servers. Time to value. Runtime performance is only one part of total time to value. The key metric for your business is the amount of time needed to complete a project from data ingestion to deployment. In practical terms, this means that your machine learning software should integrate with popular Hadoop and cloud formats, and it should export predictive models as code that you can deploy anywhere in your organization. Model accuracy. Accuracy matters, especially when the stakes are high. For applications like fraud detection, small improvements in accuracy can produce millions of dollars in annual savings. Your machine learning software should empower your data scientists to use all of your data, rather than forcing them to work with samples. Easy integration. Your machine learning software must co-exist with a complex stack of big data software in production. Ideally look for machine learning software that runs on commodity hardware and does not require specialized HPC machines or exotic hardware like GPU chips. Flexible deployment. Your machine learning software should support a range of deployment options, including co-location in Hadoop or in a freestanding cluster. If cloud is part of your architecture, look for software that runs in a variety of cloud platforms, such as Amazon Web Services, Microsoft Azure, and Google Cloud Platform. Usability. Data scientists use many different software tools to perform their work, including analytic languages like R, Python, and Scala. Your machine learning platform should integrate easily with the tools your data scientists already use. In addition, well-designed machine learning algorithms include time-saving features such as the following: Ability to treat missing data Ability to transform categorical data Regularization techniques to manage complexity Grid search capability for automated test and learn Automatic cross-validation (to avoid overlearning) Visualization. Successful predictive modeling requires collaboration between the data scientist and business users. Your machine learning software should provide business users with tools to visually evaluate the quality and characteristics of the predictive model. Introducing H2O H2O is a scalable machine learning platform for data scientists and business analysts. Unlike conventional software, H2O provides a combination of extraordinary math and high performance in a free and open source platform. H2O is fast. H2O runs on a distributed in-memory framework. Your machine learning workload runs entirely in memory, avoiding the disk I/O bottleneck. Plus, you can distribute the workload across as many servers as necessary for the performance you need. H2O is easy to implement. H2O runs on a variety of platforms: Windows, Mac, and Linux clusters; on Cloudera, MapR, and Hortonworks Hadoop under YARN; on Apache Spark; on Amazon EC2, Google Compute Engine, and Microsoft Azure. To deploy H2O, you simply download the software to your preferred platform and install it. H2O builds better models. H2O’s machine learning algorithms detect complex interactions that would be difficult to find using conventional methods, such as linear regression. Since H2O is horizontally scalable, you can perform analysis with all of your data in a single pass. One of the leading insurance carriers in the United States used to perform retention analysis in SAS/STAT. To fit the data into SAS, they had to run the analysis separately for each state, which took an entire weekend. With H2O, they run the analysis only once. By modeling their entire book of business at once, they identify patterns that cannot be detected from state-level analysis. The result: more accurate models and a more effective retention program. H2O integrates easily with your big data stack. H2O is open source software, which means you can examine the source code and, if necessary, modify it to work in your environment. H2O works with the leading Hadoop distributions, and it runs under YARN. For example, PayPal uses H2O because it works seamlessly with other big data frameworks, including Hadoop distributions and open source languages. Integral Ad Science uses H2O as part of a complex stack of applications — including Cloudera Hadoop, Spark, HBase, MySQL, Kafka, Storm, Hive, Impala, Pig, Java, JavaScript, Python, and R — to understand how consumers interact with digital advertising. And Comcast uses H2O together with Spark to deliver personalized recommendations for video content to its subscribers. The system updates program recommendations every 20 seconds through Spark Streaming. H2O puts insight into production. Predictive models provide value to the organization when they drive operational decisions. Unfortunately, commercial software bottles up those insights in a proprietary package that can take months to put into production. H2O exports POJOs — Plain Old Java Objects — that are easy to integrate into an operational pipeline. H2O simplifies machine learning. Machine learning used to require a lot of custom programming — even building algorithms from scratch. In addition to prebuilt and pretested algorithms, H2O includes many other features that save the data scientist valuable time. They include missing value treatments, categorical data handling features, regularization capability, automated grid search, and automatic cross-validation. H2O is true open source software. H2O is an open source project of H2O.ai, which distributes the software under an Apache license. There are no gimmicks, such as stripped-down “community editions” or “freemium” software you have to pay for after an evaluation. H2O.ai offers commercial support to enterprises seeking a defined SLA, private JIRA, access to H2O.ai’s team of data scientists, H2O Quick Start, and H2O DevOps. (If you’re interested in taking part in our mission to bring machine learning to the masses, please check out our GitHub repositories.) At H2O, we believe that machine learning will become as ubiquitous, easy to use, and powerful as search. Google, Yahoo, and others helped unleash the power of the Web for ordinary users by making it easy to find relevant results from a seemingly limitless number of pages. Similarly, machine learning will allow businesses of all kinds to tap into the power of modern data sets by making it easy to get to valuable insights. However, we’re obviously not there yet. Getting there will require further investments — both from machine learning developers like H2O, and from business users whose volumes of data and needs for analysis outstrip conventional methods. SriSatish Ambati is co-founder and CEO of H2O.ai. New Tech Forum provides a venue to explore and discuss emerging enterprise technology in unprecedented depth and breadth. The selection is subjective, based on our pick of the technologies we believe to be important and of greatest interest to InfoWorld readers. InfoWorld does not accept marketing collateral for publication and reserves the right to edit all contributed content. Send all inquiries to newtechforum@infoworld.com. Data ManagementAnalyticsPredictive Analytics