Familiarize yourself with data mining functions and algorithms Data mining has its origins in conventional artificial intelligence, machine learning, statistics, and database technologies, so it has much of its terminology and concepts derived from these technologies. This article, an excerpt from Java Data Mining: Strategy, Standard, and Practice by Mark F. Hornick, Erik Marcade, Sunil Venkayala (Morgan Kaufman, 2007), introduces data mining concepts for those new to data mining, and will familiarize data mining experts with data mining terminology and capabilities specific to the Java Data Mining API (JDM). This article details and expands those concepts associated with mining functions and algorithms by example. Although we discuss higher-level details of the algorithms used to give some intuition about how each algorithm works, a detailed discussion of data mining algorithms is beyond the scope of this article.Note: This excerpt was printed with permission from Morgan Kaufman, a division of Elsevier. Copyright 2007. Java Data Mining: Strategy, Standard, and Practice by Mark F. Hornick, Erik Marcade, Sunil Venkayala. For more information about this title and other similar books, please visit www.mkp.com.This article explores data mining concepts in financial services by using a business problem faced by a hypothetical consumer bank called ABCBank. ABCBank provides banking and financial services for individuals and corporate businesses. It has branch offices throughout the country. In addition, it has online banking services for its customers. ABCBank offers products such as bank accounts for checking, savings, and certificates, and many types of credit cards, loans, and other financial services. ABCBank has a diverse customer base, distributed nationally. Customer demographics vary widely in income levels, education and professional qualifications, ethnic backgrounds, age groups, and family status. This article introduces a business problem faced by ABCBank, its solution, and the concepts associated with the related mining function. While developing a solution for the problem, we discuss the concepts related to the data mining technique used to solve it. We follow a common description pattern for the problem, starting with a problem definition, solution approach, data description, available settings for tuning the solution, and an overview of relevant algorithms. For supervised functions, we also describe how to evaluate a model’s performance, and apply a model to obtain prediction results. For unsupervised functions, we describe model content and how to use models to solve the problem.Problem definition: How to reduce customer attritionABCBank is losing customers to its competitors and wants to gain a better understanding of the type of customers who are closing their accounts. ABCBank also wants to be proactive in retaining existing customers by taking appropriate measures to improve customer satisfaction. This is commonly known as the customer attrition problem in the financial services industry.Solution approach: Predict customers who are likely to attriteABCBank can use customer data collected in its transactional and analytical databases to find the patterns associated with customers likely, or unlikely, to attrite. Using the data mining classification function, ABCBank can predict customers who are likely to attrite and understand the characteristics, or profiles, of such customers. Gaining a better understanding of customer behavior enables ABCBank to develop business plans to retain customers. Classification is used to assign cases, such as customers, to discrete values, called classes or categories, of the target attribute. The target is the attribute whose values are predicted using data mining. In this problem, the target is the attribute attrite with two possible values: Attriter and Non-attriter. When referring to the model build dataset, the value Attriter indicates that the customer closed all accounts, and Non-attriter indicates the customer has at least one account at ABCBank. When referring to the prediction in the model apply dataset, the value Attriter indicates that the customer is likely to attrite and Non-attriter indicates that the customer is not likely to attrite. The prediction is often associated with a probability indicating how likely the customer is to attrite. When a target attribute has only two possible values, the problem is referred to as a binary classification problem. When a target attribute has more than two possible values, the problem is known as a multiclass classification problem.Data specification: CUSTOMERS datasetAn important step in any data mining project is to collect related data from enterprise data sources. Identifying which attributes should be used for data mining is one of the challenges faced by the data miner and relies on appropriate domain knowledge of the data. In this example, we introduce a subset of possible customer attributes as listed in Table 1. In real-world scenarios, there may be hundreds or even thousands of customer attributes available in enterprise databases.Table 1 lists physical attribute details of the CUSTOMERS dataset, which include name, datatype, and description. The attribute name refers to either a column name of a database table or a field name of a flat file. The attribute data type refers to the allowed type of values for that attribute. JDM defines integer, double, and string data types, which are commonly used data types for mining. JDM conformance rules allow a vendor to add more data types if required. Attribute description can be used to explain the meaning of the attribute or describe the allowed values. In general, physical data characteristics are captured by database metadata. Table 1. Customers Table physical attribute detailsAttribute nameData typeAttribute descriptionCUST_IDINTEGERUnique customer identifierNAMESTRINGName of the customerADDRESSSTRINGAddress of the customerCITYSTRINGCity of residenceCOUNTYSTRINGCountySTATESTRINGStateEDUSTRINGEducational level, e.g., diploma, bachelor’s, master’s, Ph.D.MAR_STATUSSTRINGMarital status, e.g., married, single, widowed, divorcedOCCUPATIONSTRINGOccupation of the customer, e.g., clerical, manager, sales, etc.INCOMEDOUBLEAnnual income in thousands of dollarsETHNIC_GROUPSTRINGEthnic groupAGEDOUBLEAgeCAP_GAINDOUBLECurrent capital gains or lossesSAV_BALANCEDOUBLEAverage monthly savings balanceCHECK_BALANCEDOUBLEAverage monthly checking balanceRETIRE_BALANCEDOUBLECurrent retirement account balanceMORTGAGE_AMOUNTDOUBLECurrent mortgage/home loan balanceNAT_COUNTRYSTRINGNative countryCREDIT_RISKSTRINGRelative credit risk, e.g., high, medium, lowATTRITESTRINGThe target attribute indicating whether a customer will attrite or not. Values include “attriter” and “non-attriter.” Users may also specify logical attribute characteristics specific to data mining. For example, physical attribute names in the table or file can be cryptic, such as HHSIZE means household size representing the number of people living as one family. Users can map physical names to logical names to be more descriptive and hence easier to understand. Logical data characteristics also include the specification of data mining attribute type, attribute usage type, and data preparation type to indicate how these attributes should be interpreted in data mining operations. Table 2 lists the logical data specification details for the CUSTOMERS dataset shown in Table 1.The attribute type indicates the attribute data characteristics, such as whether the attribute should be treated as numerical, categorical, or ordinal. Numerical attributes are those whose values should be treated as continuous numbers. Categorical attributes are those where attribute values correspond to discrete, nominal categories. Ordinal attributes are also those with discrete values, but their order is significant. In Table 2, the attribute type column specifies attributes such as city, county, state, education, and marital status as categorical attributes. The attribute capital gains is a numerical attribute as it has continuous data values, such as $12,500.94. The attribute credit risk is an ordinal attribute as it has high, medium, or low as ordered relative values. The attribute usage type specifies whether an attribute is active—should be used as input to mining; inactive—excluded from mining; or supplementary—brought forward with the input values but not used explicitly for mining. In Table 2, the usage type column specifies attributes customer ID, name, and address as inactive because these attributes are identifiers or will not generalize to predict if a customer is an attriter. All other attributes are active, and used as input for data mining. In this example, we have not included supplementary attributes. However, consider a derived attribute computed as the capital gains divided by the square of age, called ageCapitalGain-Ratio. From the user perspective, if the derived attribute ageCapital-GainRatio appears in a model rule, it may be difficult to interpret the underlying values as it relates to the business. In such a case, the model can reference supplementary attributes, for example, age and capital gain. Although these supplementary attributes are not directly used in the model build, they can be presented in model details to facilitate rule understanding using the corresponding values of age and capital gain.Table 2. Customers Table logical data specificationAttribute nameLogical nameAttribute typeUsage typePreparationCUST_IDCustomer ID Inactive NAMEName Inactive ADDRESSAddress Inactive CITYCityCategoricalActivePreparedCOUNTYCountyCategoricalActivePreparedSTATEStateCategoricalActivePreparedEDUEducationCategoricalActivePreparedMAR-STATUSMarital statusCategoricalActivePreparedOCCUOccupationCategoricalActivePreparedINCOMEAnnual income levelNumericalActiveNot preparedETHNIC_GRPEthnic groupCategoricalActivePreparedAGEAgeNumericalActiveNot preparedCAP_GAINCapital gainsNumericalActiveNot preparedSAV_BALANCEAvg. savings balanceNumericalActiveNot preparedCHECK_BALANCEAvg. checking balanceNumericalActiveNot preparedRETIRE_BALANCERetirement balanceNumericalActiveNot preparedMORTGAGE_AMOUNTHome loan balanceNumericalActiveNot preparedNAT_COUNTRYNative countryCategoricalActivePreparedCREDIT_RISKCredit riskOrdinalActivePreparedATTRITEAttriteTarget In addition to usual ETL (Extraction Transformation and Loading) operations used for loading and transforming data, data mining can involve algorithm-specific data preparation. Such data preparation includes transformations such as binning and normalization. One may choose to prepare data manually to leverage domain-specific knowledge or to fine-tune data to improve results. The data preparation type is used to indicate if data is manually prepared. In Table 2, the preparation column lists which attributes are already prepared for model building. (Note: Extraction Transformation and Loading (ETL) is the process of extracting data from their operational data sources or external data sources, transforming the data—which includes cleansing, aggregation, summarization, and integration—and other transformations, and loading the data into a data mart or data warehouse.) Specify settings: Fine-tune the solution to the problemAfter exploring attribute values in the CUSTOMERS dataset, the data miner found some oddities in the data. The capital gains attribute has some extreme values that are out of range from the general population. Figure 1 illustrates the distribution of capital gains values in the data. Note that there are very few customers who have capital gains greater than $1,000,000; in this example such values are treated as outliers. Outliers are the values of a given attribute that are unusual compared to the rest of that attribute’s data values. For example, if customers have capital gains over 1 million dollars, these values could skew mining results involving the attribute capital gains.In this example, the capital gains attribute has a valid range of $2,000 to $1,000,000 based on the value distribution, shown in Figure 1. In JDM, we use outlier identification settings to specify the valid range, or interval, to identify outliers for the model building process. Some data mining engines (DMEs) automatically identify and treat outliers as part of the model building process. JDM allows data miners to specify an outlier treatment option per attribute to inform algorithms how to treat outliers in the build data. The outlier treatment specifies whether attribute outlier values are treated asMissing (should be handled as missing values) or asIs (should be handled as the original values). Based on the problem requirements and vendor-specific algorithm implementations, data miners can either explicitly choose the outlier treatment or leave it to the DME.In assessing the data, the data miner noticed that the state attribute has some invalid entries. All ABCBank customers who are U.S. residents must have the state value as a two-letter abbreviation of one of the 50 states or the District of Columbia. To indicate valid attribute values to the model build, a category set can be specified in the logical data specification. The category set characterizes the values found in a categorical attribute. In this example, the category set for the state attribute contains values {AL, AK, AS, AZ, …, WY}. The state values that are not in this set will be considered as invalid values during the model build, and may be treated as missing or terminate execution. Our CUSTOMERS dataset has a disproportionate number of Non-attriters: 20 percent of the cases are Attriters, and 80 percent are Non-attriters. To build an unbiased model, the data miner balances the input dataset to contain an equal number of cases, with each target value using stratified sampling. In JDM, prior probabilities are used to represent the original distribution of attribute values. The prior probabilities should be specified when the original target value distribution is changed, so that the algorithm can consider them appropriately. However, not all algorithms support prior probability specification, so you will need to consult a given tool’s documentation.ABCBank management informed the data miner that it is more expensive when an attriter is misclassified, that is, predicted as a Non-attriter. This is because losing an existing customer and acquiring a new customer costs much more than trying to retain an existing customer. For this, JDM allows the specification of a cost matrix to specify costs associated with possible false predictions. A cost matrix is an N x N table that defines the cost associated with incorrect predictions, where N is the number of possible target values. In this example, the data miner specifies a cost matrix indicating that predicting a customer would not attrite when in fact he would is three times costlier than predicting the customer would attrite when he actually would not. The cost matrix for this problem is illustrated in Figure 2.In this example, we are more interested to know about the customers who are likely to attrite, so the Attriter value is considered the positive target value—the value we are interested in predicting. The positive target value is necessary when computing lift and the ROC test metric. The Non-attriter value is considered the negative target value. This allows us to use the terminology false positive and false negative. A false positive (FP) occurs when a case is known to have the negative target value, but the model predicts the positive target value. A false negative (FN) occurs when a case is known to have a positive target value, but the model predicts the negative target value. The true positives are the cases where the predicted and actual positive target values are in agreement, and true negatives are the cases where the predicted and actual negative target values are in agreement. In Figure 2, note that the false negative cost is $150 and the false positive is $50 and all diagonal elements always have cost “0,” because there is no cost for correct predictions. Select algorithm: Find the best fit algorithmSince JDM defines algorithm selection as an optional step, most data mining tools provide a default or preselected algorithm for each mining function. Some data mining tools automate finding the most appropriate algorithm and its settings based on the data and user-specified problem characteristics. If the data miner does not specify the algorithm to be used, the JDM implementation chooses the algorithm.If the JDM implementation does not select the algorithm automatically, or the data miner wants control over the algorithm settings, the user can explicitly select the algorithm and specify its settings. Selection of the right algorithm and settings benefits from data mining expertise, knowledge of the available algorithms, and often experimentation to determine which algorithm best fits the problem. Data miners will often try different algorithms and settings, and inspect the resulting models and test results to select the best algorithm and settings. This section provides a high-level overview of the algorithms supported by JDM for classification problems: decision tree, naïve bayes (NB), support vector machine (SVM), and feed forward neural networks.Decision treeThe decision tree algorithm is one of the most popular algorithms because it is easy to understand how it makes predictions. A decision tree produces rules that not only explain how or why a prediction was made, but are also useful in segmenting a population, that is, showing which groupings of cases produce a certain outcome. Decision tree is widely used for classification, and some implementations also support regression. In this section, we give an overview of the decision tree algorithm and discuss concepts behind its settings as defined in JDM. OverviewDecision tree models are a lot like playing the game 20 Questions, where a player asks a series of questions of a person concealing the name of an object. These questions allow the player to keep narrowing the space of possible objects. When the space is sufficiently constrained, a guess can be made about the name of the object. In playing 20 Questions, we rely on a vast range of experience acquired over many years to know which questions to ask and what the likely outcome is. With decision trees, an algorithm looks over a constrained set of experience, that is, the dataset. It then determines which questions can be asked to produce the right answer, that is, classify each case correctly.In this example, let us assume the input dataset has only three active attributes from the CUSTOMERS dataset introduced earlier: age, capital gains, and average savings account balance and 10 customer cases. Each case has a known target value as shown in Table 3. Note that 5 out of 10 customers attrite, hence there is a 50 percent chance that a randomly selected customer will attrite. Using the attribute details in this dataset, a decision tree algorithm can learn data patterns and build a tree as shown in Figure 3.Figure 3. Decision tree for customer attrition.In a decision tree, each node-split is based on an attribute condition that partitions or splits the data. In this example, the tree root node, node-1, shown in Figure 3, represents all 10 customers in the dataset. From these 10 customer cases, the algorithm learns that customers whose age is greater than 36 are likely to attrite. So node-1 splits data into node-2 and node-3 based on the customer’s age. Node-3 further splits its data into node-4 and node-5 based on the customer’s savings account balance. Each tree node has an associated rule that predicts the target value with a certain confidence and support. The confidence value is a measure of likelihood that the tree node will correctly predict the target value. Confidence is the ratio between the cases with correct predictions in the node and the total number of cases assigned to that node. The support value is a measure of how many cases were assigned to that node from the build dataset. Support can be expressed as a count or the ratio between the number of cases in the node and the total number of cases in the build dataset.Table 3. Customer attrition build dataCustomer IDAgeCapital gainAverage saving account balanceAttrite141$4,500$11,500Attriter235$15,000$3,000Non-attriter326$3,400$21,500Attriter437$6,100$36,000Attriter532$14,500$7,000Non-attriter640$2,500$15,000Attriter730$11,000$6,000Non-attriter821$4,100$2,000Non-attriter928$10,000$5,500Non-attriter1027$7,500$31,500Attriter Table 4 lists tree node details, such as node ID, rule, prediction, the number of cases that belong to the node, and the confidence and support of the rule. For example, node-2 has three cases (1, 4, and 6) that satisfy the predicate age > 36, and all of them are attriters, hence this node’s confidence value is 3/3 = 1, or 100 percent. However, only 3 out of 10 cases support the rule defined by node-2, hence the support value is 3/10 = 0.3. As node-2 has a confidence value of 1, it is called a pure node and no further splits can be made. Node-3 can be split further because its confidence value is less than 1, that is, 5/7 = 0.71, and confidence can be improved by using the average savings balance attribute as shown in Table 4. In this tree, nodes 2, 4, and 5 are called leaf nodes, because they do not have any child nodes. Table 4. Tree node details tableNodeRulePrediction#CasesConfidenceSupport1 Attriter105/10 = 0.55/10 = 0.52Age > 36Attriter33/3 = 1.03/10 = 0.33Age Non-attriter75/7 = 0.77/10 = 0.74Age Non-attriter55/5 = 1.05/10 = 0.55Age = 21,500Attriter22/2 = 1.02/10 = 0.2 Algorithm settingsAlgorithm settings allow users to exert finer control over the algorithm to attain better results during the build process. Decision tree models can be extremely accurate on the build data if allowed to overfit the build data. This occurs by allowing the algorithm to build deeper trees with rules specific to even individual cases. Hence, overfit models give very good accuracy with the build data, but do not generalize well on new data, resulting in decreased predictive accuracy.To avoid overfitting, users can apply stopping criteria and pruning techniques. Algorithms typically iterate over the build data, learning the patterns that exist in the data or making finer distinctions. Some algorithms could continue this iteration practically indefinitely. As such, algorithms often provide stopping criteria, which tell the algorithm when to stop building the model. In the case of a decision tree algorithm, stopping criteria are used to avoid model overfitting and control tree size. Decision tree stopping criteria include maximum depth of the tree to avoid deep trees with too many predicates, minimum leaf node size to avoid tree nodes with low support, maximum confidence to avoid pure nodes, and minimum decrease in impurity to avoid node splits that gain only minimal increase in predictive accuracy of the prediction. Users can specify one or more of these stopping criteria, and the tree will grow until the first stopping criteria is met. Pruning is the process of removing the less significant tree nodes, for example, those with insufficient support. There are two types of pruning: pre-pruning and post-pruning. Pre-pruning avoids insignificant node splits while building the tree by measuring the goodness of the split. Post-pruning removes the insignificant nodes after building a fully grown tree. Different measures called tree homogeneity metrics are used to define the goodness of a node split, such as gini, entropy, mean absolute deviation, mean square error, and misclassification ratio. Tree homogeneity metrics are also known as information gain.Naïve bayesThe naïve bayes algorithm is one of the fastest classification algorithms. It produces results comparable to other algorithms, often outperforming other classification algorithms. Naïve bayes works well with large volumes of data.OverviewNaïve bayes is based on Bayes Theorem and assumes that the predictor attributes are conditionally independent of each other with respect to the target attribute. This assumption significantly reduces the number of computations required to predict a target value and hence the naïve bayes algorithm performs well with large volumes of data. The naïve bayes algorithm involves computing the probability of each target and predictor attribute value combination. To control the number of such combinations, attributes that have either continuous values or a high number of distinct values are typically binned. In this example, to simplify the description of the naïve bayes algorithm, consider two attributes age and savings balance from the CUSTOMERS (Table 3) dataset. These attributes are binned to have two binned values to further simplify this discussion. For age, bin-1 contains values less than or equal to 35 and bin-2 contains the values greater than 35. For savings balance, bin-1 contains values less than or equal to $20,000 and bin-2 contains values greater than $20,000. In JDM, a naïve bayes algorithm computes the probabilities of a target value for a given attribute value using the cases in the build dataset. In this example, we have two attributes with two binned values for a binary target.Listing 1 shows the list of eight possible probabilities that are computed as part of the naïve bayes model build. Using these probability values, the naïve bayes algorithm computes the most probable target value for a given new case. In this example, for a new customer whose age = 25 and savings balance = $13,300, the probability of being an Attriter and Non-Attriter is computed as shown in Listing 2. Note that in Listing 2, P(Attriter) and P(Non-Attriter) are prior-probabilities of the target values that are specified as input to the model build. For this new customer case, the probability of being a Non-attriter (0.31) is more than that of an Attriter (0.03), and hence the model predicts this customer as a Non-attriter.Algorithm settingsIn JDM, a naïve bayes algorithm has two settings, singleton threshold, and pairwise threshold, that are used to define which predictor attribute values or predictor-target value pairs should be ignored.Listing 1. Naïve bayes algorithm computation of probabilities using build data Probability of age < 35 when the customer is Attriter P( age < 35 / Attriter ) = 2/6 = 0.33 Probability of age < 35 when the customer is Non-attriter P( age < 35 / Non-attriter ) = 4/6 = 0.64 Probability of age > 35 when the customer is Attriter P( age > 35 / Attriter ) = 3/4 = 0.75 Probability of age > 35 when the customer is Non-attriter P( age > 35 / Non-attriter ) = 1/4 = 0.25 Probability of savings balance (SB) < 20000 when the customer is Attriter P( SB < 20000 / Attriter ) = 3/7 = 0.43 Probability of savings balance (SB) < 20000 when the customer is Non-attriter P( SB < 20000 / Non-attriter ) = 4/7 = 0.57 Probability of savings balance (SB) > 20000 when the customer is Attriter P( SB > 20000 / Attriter ) = 3/3 = 1.00 Probability of savings balance (SB) > 20000 when the customer is Non-attriter P( SB > 20000 / Non-attriter ) = 0/3 = 0.00 Listing 2. Naïve bayes algorithm computing probability for a new case Probability that customer is attriter given age = 35 and savings balance (SB) = $13,300 P(Attriter / age=25 and SB = $13,300) _ P(age < 35/Attriter) x P(SB < $20000/Attriter) x P(Attriter) = 0.33 x 0.43 = 0.2 = 0.03 Probability that customers is Non-attriter given age = 35 and savings balance = $13,300 P(Non-attriter / age=25 and SB = $13,300) = P(age < 35/Non-attriter) x P(SB < $20000/Non-attriter) x P(Non-attriter)= 0.67 x 0.57 x 0.8 = 0.31 When a naïve bayes model is built, a given value of a predictor attribute is ignored unless there are enough occurrences of that value. The frequency of occurrences in the build data must equal or exceed the fraction specified by the singleton threshold. For example, when a singleton threshold of 0.001 is specified and age = 15 occurred only 10 times out of 100,000 cases, then age = 15 is ignored because (10/100000 = 0.0001) Similarly, a pair of values between a predictor and target attribute is ignored unless there are enough occurrences of that pair in the build data. The frequency of occurrences in the build data must equal or exceed the fraction specified by the pairwise threshold. For example, when a pairwise threshold of 0.01 is specified and the pair age = 25 and target value Attriter occurred 2,000 times out of 100,000 cases, then age = 25 is used by the model because (2000/ 100000 = 0.02) > 0.01.Support vector machineThe support vector machine (SVM) algorithm is one of the most popular, relatively new supervised algorithms. SVM is proven to give highly accurate results in complex classification problems, such as gene expression analysis in which the number of known cases is small but the number of attributes can be quite large. SVM is gaining greater acceptance in solving traditional data mining problems, including being a preferred alternative to neural networks.OverviewThe SVM algorithm creates a hyperplane that separates target values with a maximum-margin. A hyperplane is the plane that divides a space into two spaces. For example, in two-dimensional space, as shown in Figure 4(a), the line that divides the target values Attriter and Non-attriter is called a hyperplane. A hyperplane exists as a complex equation that divides the space using N attribute dimensions, where N is the number of predictor attributes. To understand the concept of support vectors, we look at two-dimensional space. In Figure 4(b), the hyperplane that classifies Attriters from Non-attriters and those data points that the margin pushes up against are called support vectors. Margin is the minimal distance between the data points and the hyperplane that divides Attriters and Non-attriters.SVM allows the selection of a kernel function. Kernel functions morph data into high-dimensional vector space and look for relations in such space. Many kernel functions have been introduced in the data mining community. JDM includes kLinear, kGaussian, hypertangent, polynomial, and sigmoid.Feed-forward neural networksMultilayer feed-forward neural networks with a back propagation learning algorithm are some of the most popular neural network techniques used for supervised learning. Despite the fact that neural networks often take longer to build than other algorithms and they do not provide interpretable results, they are popular for their predictive accuracy and high tolerance to noisy data.OverviewA neural network is an interconnected group of simulated neurons that represent a computational model for information processing. A simulated neuron is a mathematical model that can take one or more inputs to produce one output. The output is calculated by multiplying each input by a corresponding weight, and combining them to produce an ouput, which may be subject to an activation function. An activation function effectively is a transformation on the output, which includes the specification of a threshold above which the output is 1, otherwise zero. Figure 5(a) illustrates a neuron that takes x1, x2, and x3 input value and w1, w2, and w3 as input weights to produce output value y.Back propagation is the most common neural network learning algorithm. It learns by iteratively processing the build data, comparing the network’s prediction for each case with the actual known target value from the validation data. (Validation data is a kind of test data used during model building, which the algorithm may automatically create by partitioning the build data. Validation data allows the algorithm to determine how well the model is learning the patterns in the data. JDM allows users to provide an evaluation dataset explicitly in a build task, if desired.) For each case, the weights are updated in the opposite direction, so as to minimize the error between the network’s prediction and actual target value.Figure 5(b) illustrates a back propagation neural network that consists of three types of layers: input, hidden, and output. The input layer will have a number of neurons equal to the number of input attributes, the output layer will have a number of neurons equal to number of target values. The number of hidden layers and number of neurons in each hidden layer can be determined by the algorithm or specified by the data miner explicitly. In general, the addition of a hidden layer can allow the network to learn more complex patterns, but it can also adversely affect model build and apply performance. For each neural layer, JDM allows specifying an activation function that computes the activation state of each neuron in that layer.Figure 5. Neural networks: (a) neuron representation, (b) back propagation neural networks. Click on thumbnail to view full-sized image.Evaluate model quality: Compute classification test metricsIt is important to evaluate the quality of supervised models before using them to make predictions in a production system. To test supervised models, the historical data is split into two datasets, one for building the model, the other for testing it. Test dataset cases are typically not used to build a model, in order to give a true assessment of a model’s predictive accuracy.JDM supports four types of popular test metrics for classification models: prediction accuracy, confusion matrix, receiver operating characteristics (ROC), and lift. These metrics are computed by comparing predicted and actual target values. This section discusses these test metrics in the context of the ABCBank’s customer attrition problem.In the customer attrition problem, assume that the test dataset has 1,000 cases and the classification model predicted 910 cases correctly, 90 cases incorrectly. The accuracy of the model on this dataset is 910/1,000 = 0.91 or 91 percent.Consider that out of 910 correct predictions, 750 customers are non-attriters and the remaining 160 are attriters. Out of 90 wrong predictions, 60 are predicted as Attriters when they are actually Non-attriters, and 30 are predicted as Non-attriters when they are actually Attriters. This is illustrated in Figure 6. To represent this, we use a matrix called a confusion matrix. A confusion matrix is a two-dimensional, N x N table that indicates the number of correct and incorrect predictions a classification model made on specific test data, where N represents the number of target attribute values. It is called a confusion matrix because it points out where the model gets confused, that is, makes incorrect predictions.Figure 6. Confusion matrixThe structure of this table looks similar to the cost matrix that was illustrated in Figure 2, but the confusion matrix cells have the model’s incorrect and correct prediction counts. If we consider Attriter as the positive target value, false-positive (FP) prediction count is 60, and the false-negative (FN) prediction count is 30.Although the confusion matrix measures misclassification of target values, in our example, false-negatives are three times costlier than the false-positives. To assess model quality from a business perspective, we need to measure cost in addition to accuracy. The total cost of false predictions is 3×30+1×60=150. If with a different model you get 40 false-positives and 40 false-negatives, then the overall accuracy is better, however total cost is more at 3×40+1×40=160. If a cost matrix is specified, it is important to consider cost values to measure the performance and select the model with the least cost value.Receiver operating characteristics (ROC) is another way to compare classification model quality. An ROC graph places the false positive rate on the X-axis and true positive rate on the Y-axis as shown in Figure 7. Here, the false positive rate is the ratio of the number of false positives and the total number of actual negatives. Similarly, the true positive rate is the ratio of the number of true positives and the total number of actual positives.To plot the ROC graph, the test task determines the false positive and true positive rates at different probability thresholds. Here, the probability threshold is the level above which a probability of the predicted positive target value is considered a positive prediction. Different probability threshold values result in different false positive rates and true positive rates. For example, when the Attriter prediction probability is 0.4 and the probability threshold is set to 0.3, the customer is predicted as an Attriter. Whereas if the probability threshold is 0.5, the customer is predicted as a Non-attriter as illustrated in Figure 7(a).Figure 7(b) illustrates the ROC curves of two classification models that are plotted at different probability thresholds. These models perform better at different false positive rates; for example, at a false positive rate of 0.1, Model B has better true positives than Model A. However, at 0.3 and above, the false positive rate of Model A outperformed that of Model B. Based on the accepted false positive rate, users can select the model and its probability threshold. The area under the ROC curve is another measure of overall performance of a classification model. The higher the area under the ROC curve, generally, the better the model performance.Figure 7. Receiver operating characteristics. Click on thumbnail to view full-sized image.In the ROC graph, the point (0,1) is the perfect classifier: it classifies all positive cases and negative cases correctly. (Note: A classification model is also referred to as a classifier since it classifies cases among the possible target values.) It is (0,1) because the false positive rate is 0 (none), and the true positive rate is 1 (all). The point (0,0) represents a classifier that predicts all cases to be negative, while the point (1,1) corresponds to a classifier that predicts every case to be positive. Point (1,0) is the classifier that is incorrect for all classifications.Lift and cumulative gain are also popular metrics to assess the effectiveness of a classification model. Lift is the ratio between the results obtained using the classification model and a random selection. Cumulative gain is the percentage of positive responses determined by the model across quantiles of the data. Cases are typically divided into 10 or 100 quantiles against which the lift and cumulative gain is reported, as illustrated later in Table 5. The lift chart and cumulative gains charts are often used as visual aids for assessing model performance. An understanding of how cumulative lift and cumulative gains are computed helps in understanding the cumulative lift and cumulative gains charts illustrated in Figure 8.Figure 8. Lift charts: (a) cumulative lift, (b) cumulative gain. Click on thumbnail to view full-sized image.Following are the typical high-level steps used to compute cumulative lift and gain values:Compute the positive target value probability for all test dataset casesSort cases in descending order of the positive target value probabilitySplit the sorted test dataset cases into n groups, also known as quantilesCompute lift for each quantile—the ratio of the cumulative number of positive targets and cumulative number of positive targets that can be found at randomCompute the cumulative gain for each quantile—the ratio of cumulative predicted number of positive targets using the model and total number of positive targets in the test datasetTable 5 details lift and cumulative gain calculations for our customer dataset example. Each row of this table represents a quantile that has 100 customer cases. In the test dataset, there are 1,000 cases, of which 190 are known to be attriters. Hence, picking a random customer from this dataset, we have a 19 percent probability that the customer is an attriter.Using the classification model, the first quantile contains the top 100 customers that are predicted to be attriters. Comparing the prediction against the known actual values, we find that the algorithm was correct for 70 of these 100 customers. Therefore, the lift for the first quantile is 70/19 = 3.684, where 70 is the number of attriters found using the classification model and 19 is the number of customers that would have been found given a random selection of customers.Table 5. Lift computations tableQuantile numberNumber of customers likely to attriteCumulative number of customers likely to attriteCumulative quantile liftCumulative gain1707070/19 = 3.68470/190 = 36.8%240110110/38 = 3.289110/190 = 57.9%325135135/57 = 2.368135/190 = 71.5%415150150/76 = 1.974150/190 = 79.4%512162162/95 = 1.705162/190 = 85.7%68170170/114 = 1.491170/190 = 89.8%77177177/133 = 1.331177/190 = 93.4%85183183/152 = 1.204183/190 = 96.5%95188188/171 = 1.099188/190 = 99.0%103190190/190 = 1.000190/190 = 100% In this example, suppose that ABCBank wants to launch a customer retention campaign with a limited budget that can retain at least 50 percent of the attriters. Here, the user can select the 200 customers in the first two quantiles whose cumulative gain is 57.9 percent and has a lift of 3.289.Let us say there are two attrition models, Model A and Model B, that are built using different algorithms or with different settings using the same algorithm. Figure 8 illustrates the cumulative lift and cumulative gains charts plotted for these models to compare the results. Note that Model A outperforms Model B in the first two quantiles; however, Model B outperforms from the third quantile onward. A user can pick Model A when the budget allows for at most 20 percent of customers, otherwise the user can pick Model B if more than 20 percent of customers are budgeted.Apply model: Obtain prediction resultsAfter evaluating model performance using the test data, the user selects the best model for the problem and applies it to predict target values for an apply dataset. As noted for decision tree, some algorithms may use a subset of the input attributes in the final model. This subset of attributes is called the model signature, and it can be retrieved from the model to determine which attributes are required to apply the model.In this section, we take a simple decision tree model to illustrate the model apply operation. This model has three input attributes: age, capital gain, and average savings balance as shown in Table 3 and the model uses only two of these, age and average savings balance, as shown in Figure 9. These two attributes form the model signature. Consequently, to use this model, the apply dataset for this model needs only contain cases with age and average savings balance attribute values. Consider an apply dataset that has two customer cases for customers Jones and Smith as shown in Table 6 to understand the apply process.Table 6. Customer apply tableIDCustomer nameAddressCityStateZIP CodeAgeAverage saving account balanceA1Jones10 Main St.BostonMA0210141$11,500A2Smith120 Beacon St.BuffaloNY1420135$3,000 Figure 9 illustrates how the decision tree model predicts if these customers are attriters. Jones is older than 36, so from the root node he is assigned to node-2 that predicts him as an attriter. Smith is younger than 36, so he is assigned to node-3 and node-3 further splits based on average savings balance (SB). Because Smith’s average savings balance is less than $21,500, he is assigned to node-4, which predicts him as a Non-attriter.The classification apply operation enables generating prediction results with various types of content such as the predicted category—the target value predicted by the model; probability—the probability that the prediction is correct according to the model; cost—the cost associated with the model’s prediction (cost is computed only when cost matrix is specified); and node-id—node or rule identifier used to make the prediction (this is applicable only for models such as decision tree that can provide a tree node or rule use for the prediction). In JDM, the apply prediction results can be presented in various forms, such as the top prediction details, top-n or bottom-n predictions, probabilities associated with all target values, and the probability of predicting a specified target value(s). Selection of the prediction results depends on the problem requirements and the type of information a user wants to see. In this example, we produce the top prediction value and its corresponding probability and cost to identify the attriters.Even though applying a model to a dataset is common, predictions and probabilities will likely change when customer attributes change. For example, when a customer calls a bank to transfer a large sum from his savings account to another bank, the call center application can display a precomputed prediction that the customer is likely to attrite. This would have been based on the customer’s previous account balance. With the funds transfer, this may change the model’s prediction for this customer. As such, it is useful to rescore customers in real-time based on current data. This can be achieved using the JDM single record apply capability, designed to provide real-time response.Mark Hornick has lead the Java Data Mining (JSR 73) expert group since its inception in July of 2000 and now leads the JSR-247 expert group working towards JDM 2.0. Hornick brings nearly 20 years of experience in the design and implementation of advanced distributed systems, including in-database data mining, distributed object management, and Java APIs. Hornick is a senior manager in Oracle’s Data Mining Technologies group. With more than 17 years of experience in the neural network industry, Erik Marcade, founder and chief technical officer for KXEN, is responsible for software development and information technologies. Prior to founding KXEN, Marcade developed real-time software expertise at Cadence Design Systems, accountable for advancing real-time software systems as well as managing “system-on-a-chip” projects. Before joining Cadence, Marcade spearheaded a project to restructure the marketing database of the largest French automobile manufacturer for Atos, a leading European information technology services company. Sunil Venkayala is a principal member of the technical staff at Oracle Data Mining Technologies and served as an expert group member for the JDM standard developed under JSR 73. Venkayala has more than five years’ experience in developing applications using predictive technologies available in the Oracle Database and more than seven years of experience working with Java and Internet technologies. JavaAnalytics