Basic Data Mining Tutorial
Basic Data Mining Tutorial
Welcome to the Microsoft Analysis Services Basic Data Mining Tutorial. Microsoft SQL Server provides an integrated environment for creating and working with data mining models. In this Basic Data Mining Tutorial, you will complete a scenario for a targeted mailing campaign in which you create three models for analyzing customer purchasing behavior and targeting potential buyers. The tutorial demonstrates how to use the data mining algorithms, mining model viewers, and data mining tools that are included in Microsoft SQL Server Analysis Services. The fictitious company, Adventure Works Cycles, is used for all examples. When you are comfortable using the data mining tools, we recommend that you also complete the Intermediate Data Mining Tutorial, which demonstrates how to use forecasting, market basket analysis, time series, association models, nested tables, and sequence clustering.
Tutorial Scenario
In this tutorial, you are an employee of Adventure Works Cycles who has been tasked with learning more about the company's customers based on historical purchases, and then using that historical data to make predictions that can be used in marketing. The company has never done data mining before, so you must create a new database specifically for data mining and set up several data mining models.
What You Will Learn
This tutorial teaches you how to create and work with several different types of data mining models. It also teaches you how to create a copy of a mining model, and apply a filter to the mining model. You then process the new model and evaluate the model using a lift chart. After the model is complete, you use drillthrough to retrieve additional data from the underlying mining structure. In SQL Server 2008, Microsoft introduced several new features that help you develop custom data mining models and use the results more effectively.
Holdout Test Sets - When you create a mining structure, you can now divide the data in the mining structure into training and testing sets. Mining model filters - You can now attach filters to a mining model, and apply the filter during both training and testing. Drillthrough to Structure Cases and Structure Columns - You can now easily move from the general patterns in the mining model to actionable detail in the data source.
Microsoft Corporation 2008
In this lesson, you will learn how to create a new Analysis Services database, add a data source and data source view, and prepare the new database to be used with data mining.
provides a more secure authentication method than SQL Server Authentication. However, SQL Server Authentication is provided for backward compatibility. For more information about authentication methods, see Database Engine Configuration - Account Provisioning. 7. In the Select or enter a database name list, select , and then Click OK. 8. Click Next. 9. On the Impersonation Information page, click Use the service account, and then click Next. On the Completing the Wizard page, notice that, by default, the data source is named Adventure Works DW2008R2. 10. Click Finish. The new data source, Adventure Works DW2008R2, appears in the Data Sources folder in Solution Explorer.
If you want to create a data source, click New Data Source to start the Data Source Wizard. 4. On the Select Tables and Views page, select the following objects, and then click the right arrow to include them in the new data source view: o ProspectiveBuyer (dbo) - table of prospective bike buyers o vTargetMail (dbo) - view of historical data about past bike buyers 5. Click Next. 6. On the Completing the Wizard page, by default the data source view is named Adventure Works DW2008R2. Change the name to Targeted Mailing, and then click Finish. The new data source view opens in the Targeted Mailing.dsv [Design] tab.
In this lesson, you will learn how to create a mining model structure that can be used as part of a targeted mailing scenario.
Lesson 3: Adding and Processing Models
In this lesson you will learn how to add models to a structure. The models you create are built with the following algorithms:
Lesson 4: Exploring the Targeted Mailing Models (Basic Data Mining Tutorial)
In this lesson you will learn how to explore and interpret the findings of each model using the Viewers.
Lesson 5: Testing Models (Basic Data Mining Tutorial)
In this lesson, you make a copy of one of the targeted mailing models, add a mining model filter to restrict the training data to a particular set of customers, and then assess the viability of the model.
Lesson 6: Creating and Working with Predictions (Basic Data Mining Tutorial)
In this final lesson of the Basic Data Mining Tutorial, you use the model to predict which customers are most likely to purchase a bike. You then drill through to the underlying cases to obtain contact information.
Requirements
Microsoft SQL Server 2008 R2 Microsoft SQL Server Analysis Services The AdventureWorksDW2008R2 database.
To enhance security, the sample databases are not installed with SQL Server. To install the official databases for Microsoft SQL Server, visit the Microsoft SQL Sample Databases page and select SQL Server 2008R2.
The Marketing department of Adventure Works Cycles wants to increase sales by targeting specific customers for a mailing campaign. The company's database, AdventureWorksDW2008R2, contains a list of past customers and a list of potential new customers. By investigating the attributes of previous bike buyers, the company hopes to discover patterns that they can then apply to potential customers. They hope to use the discovered patterns to predict which potential customers are most likely to purchase a bike from Adventure Works Cycles. In this lesson you will use the Data Mining Wizard to create the targeted mailing structure. After you complete the tasks in this lesson, you will have a mining structure with a single model. Because there are many steps and important concepts involved in creating a structure, we have separated this process into the following three tasks: Creating a Targeted Mailing Mining Model Structure (Basic Data Mining Tutorial)
5. 6. 7. 8. 9.
If you get a warning that no data mining algorithms can be found, the project properties might not be configured correctly. This warning occurs when the project attempts to retrieve a list of data mining algorithms from the Analysis Services server and cannot find the server. By default, BI Development Studio will use localhost as the server. If you are using a different instance, or a named instance, you must change the project properties. For more information, see Creating an Analysis Services Project (Basic Data Mining Tutorial). Click Next. On the Select Data Source View page, in the Available data source views pane, select Targeted Mailing. You can click Browse to view the tables in the data source view and then click Close to return to the wizard. Click Next. On the Specify Table Types page, select the check box in the Case column for vTargetMail to use it as the case table, and then click Next. You will use the ProspectiveBuyer table later for testing; ignore it for now. On the Specify the Training Data page, you will identify at least one predictable column, one key column, and one input column for your model. Select the check box in the Predictable column in the BikeBuyer row.
Note
Notice the warning at the bottom of the window. You will not be able to navigate to the next page until you select at least one Input and one Predictable column. 10. Click Suggest to open the Suggest Related Columns dialog box. The Suggest button is enabled whenever at least one predictable attribute has been selected. The Suggest Related Columns dialog box lists the columns that are most closely related to the predictable column, and orders the attributes by their correlation with the predictable attribute. Columns with a significant correlation (confidence greater than 95%) are automatically selected to be included in the model. Review the suggestions, and then click Cancel toignore the suggestions.
Note
If you click OK, all listed suggestions will be marked as input columns in the wizard. If you agree with only some of the suggestions, you must change the values manually. 11. Verify that the check box in the Key column is selected in the CustomerKey row.
Note
If the source table from the data source view indicates a key, the Data Mining Wizard automatically chooses that column as a key for the model. 12. Select the check boxes in the Input column in the following rows. You can check multiple columns by highlighting a range of cells and pressing CTRL while selecting a check box. o Age o CommuteDistance o EnglishEducation o EnglishOccupation o Gender o GeographyKey o HouseOwnerFlag o MaritalStatus o NumberCarsOwned o NumberChildrenAtHome o Region o TotalChildren o YearlyIncome 13. On the far left column of the page, select the check boxes in the following rows. o AddressLine1 o AddressLine2 o DateFirstPurchase o EmailAddress o FirstName o LastName.
Microsoft Corporation 2008
Ensure that these rows have checks only in the left column. These columns will be added to your structure but will not be included in the model. However, after the model is built, they will be available for drillthrough and testing. For more information about drillthrough, see Using Drillthrough on Mining Models and Mining Structures (Analysis Services - Data Mining) 14. Click Next.
Specifying the Data Type and Content Type (Basic Data Mining Tutorial)
Review and modify content type and data type for each column
1. On the Specify Columns' Content and Data Type page, click Detect to run an algorithm that determines the default data and content types for each column. 2. Review the entries in the Content Type and Data Type columns and change them if necessary, to make sure that the settings are the same as those listed in the following table. Typically, the wizard will detect numbers and assign an appropriate numeric data type, but there are many scenarios where you might want to handle a number as text instead. For example, the GeographyKey should be handled as text, because it would be inappropriate to perform mathematical operations on this identifier. Column Address Line1 Address Line2 Age Bike Buyer Commute Distance CustomerKey DateLastPurchase Email Address English Education English Occupation FirstName Gender Geography Key House Owner Flag Last Name Marital Status Content Type Data Type Discrete Text Discrete Text Continuous Long Discrete Long Discrete Text Key Long Continuous Date Discrete Text Discrete Text Discrete Text Discrete Text Discrete Text Discrete Text Discrete Text Discrete Text Discrete Text
Microsoft Corporation 2008
Number Cars Owned Number Children At Home Region Total Children Yearly Income 3. Click Next.
Specifying a Testing Data Set for the Structure (Basic Data Mining Tutorial) In the final few screens of the Data Mining Wizard you will split your data into a testing set and a training set. You will then name your structure and enable drill through on the model.
Specifying a Testing Set
Separating data into training and testing sets when you create a mining structure makes it possible to immediately assess the accuracy of the mining models that you create later. For more information on testing sets, see Partitioning Data into Training and Testing Sets (Analysis Services - Data Mining).
Drillthrough can be enabled on models and on structures. The checkbox in this window enables drillthrough on the named model and enables you to retrieve detailed information from the model cases that were used to train the model. If the underlying mining structure has also been configured to allow drillthrough, you can retrieve detailed information from the model cases and the mining structure, including columns that were not included in the mining model. For more information, see Using Drillthrough on Mining Models and Mining Structures (Analysis Services Data Mining).
The mining structure that you created in the previous lesson contains a single mining model that is based on the Microsoft Decision Trees algorithm. In order to identify customers for the targeted mailing, you will create two additional models, then process and deploy the models. In this lesson, you will create a set of mining models that will suggest the most likely customers from a list of potential customers. To complete the tasks in this lesson, you will use the Microsoft Clustering Algorithm and the Microsoft Naive Bayes Algorithm. This lesson contains the following tasks:
Microsoft Corporation 2008
10
Adding New Models to the Targeted Mailing Structure (Basic Data Mining Tutorial)
The new model now appears in the Mining Models tab of Data Mining Designer. This model, built with the Microsoft Clustering algorithm, groups customers with similar characteristics into clusters and predicts bike buying for each cluster. Although you can modify the column usage and properties for the new model, no changes to the TM_Clustering model are necessary for this tutorial.
Processing Models in the Targeted Mailing Structure (Basic Data Mining Tutorial) Before you can browse or work with the mining models that you have created, you must deploy the Analysis Services project and process the mining structure and mining
Microsoft Corporation 2008
11
models. Deploying sends the project to a server and creates any objects in that project on the server. Processing is the step, or series of steps, that populates Analysis Services objects with data from relational data sources. Models cannot be used until they have been deployed and processed.
Ensuring Consistency with HoldoutSeed
When you deploy a project and process the structure and models, individual rows in your data structure are randomly assigned to the training and testing set based on a random number seed. Typically, the random number seed is computed based on attributes of the data structure. For the purposes of this tutorial, in order to ensure that your results are the same as described here, we will arbitrarily assign a fixed holdout seed of 12. The holdout seed is used to initialize random sampling and ensures that the data is partitioned in roughly the same way for all mining structures and their models. This value does not affect the number of cases in the training set; instead, it ensures that the partition can be repeated. For more information on holdout seed, see Partitioning Data into Training and Testing Sets (Analysis Services - Data Mining).
In Data Mining Designer, you can process a mining structure, a specific mining model that is associated with a mining structure, or the structure and all the models that are associated with that structure. For this task, we will process the structure and all the models at the same time.
12
If you made changes to the structure, you will be prompted to build and deploy the project before processing the models. Click Yes. 2. Click Run in the Processing Mining Structure - Targeted Mailing dialog box. The Process Progress dialog box opens to display the details of model processing. Model processing might take some time, depending on your computer. 3. Click Close in the Process Progress dialog box after the models have completed processing. 4. Click Close in the Processing Mining Structure - <structure> dialog box. There are multiple ways to process a model and structure. For more information, see the following topics:
To process all the mining models that are associated with the structure
1. On the Mining Models tab of Data Mining Designer, select Process Mining Structure and All Models from the Mining Model menu. 2. If you made changes to the mining structure, you will be prompted to redeploy the structure before processing the models. Click Yes. 3. In the Processing Mining Structure - <structure> dialog box, click Run.
Microsoft Corporation 2008
13
4. The Process Progress dialog box opens to show the details of model processing. 5. After the models have successfully completed processing, click Close in the Process Progress dialog box. 6. Click Close in the Processing <model> dialog box. The mining structure and all the associated mining models have been processed.
Lesson 4: Exploring the Targeted Mailing Models (Basic Data Mining Tutorial)
SQL Server 2008 R2 Other Versions
After the models in your project are processed, you can explore them in Business Intelligence Development Studio to look for interesting trends. Because the results of mining models are complex and can be difficult to understand in a raw format, visually investigating the data is often the easiest way to understand the rules and relationships that the algorithms have discovered within the data. Exploring also helps you to
Microsoft Corporation 2008
14
understand the behavior of the model and discover which model performs best before you deploy it. Each model you created is listed in the Mining Model Viewer tab in Data Mining Designer. Each algorithm that you used to build a model in Analysis Services returns a different type of result. Therefore, Analysis Services provides a separate viewer for each algorithm. Analysis Services also provides a generic viewer that works for all model types. The Generic Content Tree Viewer displays detailed model content information that varies depending on the algorithm that was used. For more information, see Viewing Model Details with the Microsoft Generic Content Tree Viewer. In this lesson you will look at the same data using your three models. Each model type is based on a different algorithm and provides different insights into the data. The Decision Tree model tells you about factors that influence bike buying. The Clustering model groups your customers by attributes that include their bike buying behavior and other selected attributes. The Naive Bayes model enables you to explore the relationship between different attributes. Finally, the Generic Content Tree Viewer reveals the structure of the model and provides richer detail including formulas, patterns that were extracted, and a count of cases in a cluster or a particular tree. Click on the following topics to explore the mining model viewers.
The Microsoft Decision Trees algorithm predicts which columns influence the decision to purchase a bike based upon the remaining columns in the training set. The Microsoft Decision Tree Viewer provides the following tabs for use in exploring decision tree mining models: Decision Tree Dependency Network The following sections describe how to select the appropriate viewer and explore the other mining models.
Exploring the Clustering Model (Basic Data Mining Tutorial) Exploring the Naive Bayes Model (Basic Data Mining Tutorial)
On the Decision Tree tab, you can examine all the tree models that make up a mining model.
15
Because the targeted mailing model in this tutorial project contains only a single predictable attribute, Bike Buyer, there is only one tree to view. If there were more trees, you could use the Tree box to choose another tree. Reviewing the TM_Decision_Tree model in the Decision Tree viewer reveals that age is the single most important factor in predicting bike buying. Interestingly, once you group the customers by age, the next branch of the tree is different for each age node. By exploring the Decision Tree tab we can conclude that purchasers age 34 to 40 with one or no cars are very likely to purchase a bike, and that single, younger customers who live in the Pacific region and have one or no cars are also very likely to purchase a bike.
16
Alternately, place your cursor over any node in the tree to see the condition that is required to reach that node from the node that comes before it. You can also view this same information in the Mining Legend. 6. Click on the node for Age >=34 and < 41. The histogram is displayed as a thin horizontal bar across the node and represents the distribution of customers in this age range who previously did (pink) and did not (blue) purchase a bike. The Viewer shows us that customers between the ages of 34 and 40 with one or no cars are likely to purchase a bike. Taking it one step further, we find that the likelihood to purchase a bike increases if the customer is actually age 38 to 40. Because you enabled drillthrough when you created the structure and model, you can retrieve detailed information from the model cases and mining structure, including those columns that were not included in the mining model (e.g., emailAddress, FirstName). For more information, see Using Drillthrough on Mining Models and Mining Structures (Analysis Services - Data Mining).
17
The Microsoft Clustering algorithm groups cases into clusters that contain similar characteristics. These groupings are useful for exploring data, identifying anomalies in the data, and creating predictions. The Microsoft Cluster Viewer provides the following tabs for use in exploring clustering mining models: Cluster Diagram Cluster Profiles Cluster Characteristics Cluster Discrimination The following sections describe how to select the appropriate viewer and explore the other mining models.
Exploring the Decision Tree Model (Basic Data Mining Tutorial) Exploring the Naive Bayes Model (Basic Data Mining Tutorial)
18
The Cluster Diagram tab displays all the clusters that are in a mining model. The lines between the clusters represent "closeness" and are shaded based on how similar the clusters are. The actual color of each cluster represents the frequency of the variable and the state in the cluster.
19
marketing department might want to combine similar clusters together when determining the best method for delivering the targeted mailing. Back to Top
Cluster Profiles Tab
The Cluster Profiles tab provides an overall view of the TM_Clustering model. The Cluster Profiles tab contains a column for each cluster in the model. The first column lists the attributes that are associated with at least one cluster. The rest of the viewer contains the distribution of the states of an attribute for each cluster. The distribution of a discrete variable is shown as a colored bar with the maximum number of bars displayed in the Histogram bars list. Continuous attributes are displayed with a diamond chart, which represents the mean and standard deviation in each cluster.
20
With the Cluster Characteristics tab, you can examine in more detail the characteristics that make up a cluster. Instead of comparing the characteristics of all of the clusters (as in the Cluster Profiles tab), you can explore one cluster at a time. For example, if you select Bike Buyers High from the Cluster list, you can see the characteristics of the customers in this cluster. Though the display is different from the Cluster Profiles viewer, the findings are the same.
Note
Unless you set an initial value for holdoutseed, results will vary each time you process the model. For more information, see HoldoutSeed Element Back to Top
Cluster Discrimination Tab
With the Cluster Discrimination tab, you can explore the characteristics that distinguish one cluster from another. After you select two clusters, one from the Cluster 1 list, and one from the Cluster 2 list, the viewer calculates the differences between the clusters and displays a list of the attributes that distinguish the clusters most.
The Microsoft Naive Bayes algorithm provides several methods for displaying the interaction between bike buying and the input attributes. The Microsoft Naive Bayes Viewer provides the following tabs for use in exploring Naive Bayes mining models: Dependency Network
Microsoft Corporation 2008
21
Attribute Profiles Attribute Characteristics Attribute Discrimination The following sections describe how to explore the other mining models.
Exploring the Decision Tree Model (Basic Data Mining Tutorial) Exploring the Clustering Model (Basic Data Mining Tutorial)
Dependency Network
The Dependency Network tab works in the same way as the Dependency Network tab for the Microsoft Tree Viewer. Each node in the viewer represents an attribute, and the lines between nodes represent relationships. In the viewer, you can see all the attributes that affect the state of the predictable attribute, Bike Buyer.
The Attribute Profiles tab describes how different states of the input attributes affect the outcome of the predictable attribute.
22
With the Attribute Characteristics tab, you can select an attribute and value to see how frequently values for other attributes appear in the selected value cases.
With the Attribute Discrimination tab, you can investigate the relationship between two discrete values of bike buying and other attribute values. Because the
Microsoft Corporation 2008
23
TM_NaiveBayes model has only two states, 1 and 0, you do not have to make any changes to the viewer. In the viewer, you can see that people who do not own cars tend to buy bicycles, and people who own two cars tend not to buy bicycles.
Lesson 5: Testing Models (Basic Data Mining Tutorial) SQL Server 2008 R2
Other Versions
Now that you have processed the model by using the targeted mailing scenario training set, you will test your models against the testing set. Because the data in the testing set already contains known values for bike buying, it is easy to determine whether the model's predictions are correct. The model that performs the best will be used by the Adventure Works Cycles marketing department to identify the customers for their targeted mailing campaign. In this lesson you will first test your models by making predictions against the testing set. Next, you will test your models on a filtered subset of the data. Analysis Services provides a variety of methods to determine the accuracy of mining models. In this lesson we will take a look at a lift chart. Validation is an important step in the data mining process. Knowing how well your targeted mailing mining models perform against real data is important before you deploy the models into a production environment. For more information about how model validation fits into the larger data mining process, see Data Mining Concepts (Analysis Services - Data Mining). This lesson contains the following tasks: Testing Accuracy with Lift Charts (Basic Data Mining Tutorial)
24
On the Mining Accuracy Chart tab of Data Mining Designer, you can calculate how well each of your models makes predictions, and compare the results of each model directly against the results of the other models. This method of comparison is referred to as a lift chart. Typically, the predictive accuracy of a mining model is measured by either lift or classification accuracy. For this tutorial we will use the lift chart only. For more information about lift charts and other accuracy charts, see Tools for Charting Model Accuracy (Analysis Services - Data Mining). In this topic, you will perform the following tasks:
Choosing Input Data Selecting the Models, Predictable Columns, and Values
The first step in testing the accuracy of your mining models is to select the data source that you will use for testing. You will test how well the models perform against your testing data and then you will use them with external data.
The next step is to select the models that you want to include in the lift chart, the predictable column against which to compare the models, and the value to predict.
Note
The mining model columns in the Predictable Column Name list are restricted to columns that have the usage type set to Predict or Predict Only and have a content type of Discrete or Discretized.
25
2. In the Predictable Column Name column, verify that Bike Buyer is selected for each model. 3. In the Show column, select each of the models. By default, all the models in the mining structure are selected. You can decide not to include a model, but for this tutorial leave all the models selected. 4. In the Predict Value column, select 1. The same value is automatically filled in for each model that has the same predictable column. 5. Select the Lift Chart tab to display the lift chart. When you click the tab, a prediction query runs against the server and database for the mining structure and the input table or test data. The results are plotted on the graph. When you enter a Predict Value, the lift chart plots a Random Guess Model as well as an Ideal Model. The mining models you created will fall between these two extremes; between a random guess and a perfect prediction. Any improvement from the random guess is considered to be lift. 6. Use the legend to locate the colored lines representing the Ideal Model and the Random Guess Model. You'll notice that the TM_Decision_Tree model provides the greatest lift, outperforming both the Clustering and Naive Bayes models. For an in-depth explanation of a lift chart similar to the one created in this lesson, see Lift Chart (Analysis Services - Data Mining).
Testing a Filtered Model (Basic Data Mining Tutorial) Now that you have determined that the TM_Decision_Tree model is the most accurate, you should evaluate the model in the context of the Adventure Works Cycles targeted mailing campaign. The Marketing department wants to know if there is a difference in the characteristics of male bike buyers and female bike buyers. This information will help them decide which magazines to use for advertising and which products to feature in their mailings. In this lesson, we will create a model that is filtered on gender. You can then easily make a copy of that model, and change just the filter condition to generate a new model based on a different gender.
26
For more information on filters, see Creating Filters for Mining Models (Analysis Services - Data Mining).
Using Filters
Filtering enables you to easily create models built on subsets of your data. The filter is applied only to the model and does not change the underlying data source. For information on applying filters to nested tables, see Intermediate Data Mining Tutorial (Analysis Services - Data Mining).
Filters on Case Tables
27
5. Click the Value text box, and type M. 6. Click the next row in the grid. 7. Click OK to close the Model Filter. The filter displays in the Properties window. Alternately, you can launch the Model Filter dialog from the Properties window. 8. Repeat the above steps, but this time name the model TM_Decision_Tree_Female and type F in the Value text box. You now have two new models displayed in the Mining Models tab.
Process the Filtered Models
Models cannot be used until they have been deployed and processed. For more information on processing models, see Processing Models in the Targeted Mailing Structure (Basic Data Mining Tutorial).
View the results and assess the accuracy of the filtered models in much the same way as you did for the previous three models. For more information, see: Exploring the Decision Tree Model (Basic Data Mining Tutorial) Testing Accuracy with Lift Charts (Basic Data Mining Tutorial)
28
of the same characteristics as the unfiltered bike buyers but all three have interesting differences as well. This is useful information that Adventure Works Cycles can use to develop their marketing campaign.
Lesson 6: Creating and Working with Predictions (Basic Data Mining Tutorial)
SQL Server 2008 R2 Other Versions
You have trained, tested, and explored the data mining models you created. Now you are ready to use the models to identify recipients for Adventure Works Cycles targeted mailing campaign. In this lesson you will create a query to predict which customers are most likely to purchase a bike. You will also retrieve the probability that the prediction is correct, so that you can decide whether to present the recommendation to the marketing department or not. Once you have identified customers with a high probability of purchasing a bike, you will drill through to the details of the cases in the mining model to retrieve names and contact information for these customers. This lesson contains the following topics: Creating Predictions (Basic Data Mining Tutorial)
29
After you have tested the accuracy of your mining models and decided that you are satisfied with them, you can then create Data Mining Extensions (DMX) prediction queries by using the Prediction Query Builder on the Mining Model Prediction tab in the Data Mining Designer. The Prediction Query Builder has three views. With the Design and Query views, you can build and examine your query. You can then run the query and view the results in the Result view. For more information about how to use the Prediction Query Builder, see Creating DMX Prediction Queries.
Creating the Query
The first step in creating a prediction query is to select a mining model and input table.
After you select the input table, Prediction Query Builder creates a default mapping between the mining model and the input table, based on the names of the columns. At least one column from the structure must match a column in the external data.
Important
The data that you use to determine the accuracy of the models must contain a column that can be mapped to the predictable column.
30
31
This specifies the target column for the PredictProbability function. For more information about functions, see Data Mining Extensions (DMX) Function Reference. 3. In the Prediction Function row, in the Field column, select PredictProbability. 4. From the Mining Model window above, select and drag [Bike Buyer] into the Criteria/Argument cell. When you let go, [TM_Decision_Tree].[Bike Buyer] appears in the Criteria/Argument cell. 5. Click the next empty row in the Source column, and then select TM_Decision_Tree. 6. In the TM_Decision_Tree row, in the Field column, select Bike Buyer. 7. In the TM_Decision_Tree row, in the Criteria/Argument column, type =1. 8. Click the next empty row in the Source column, and then select ProspectiveBuyer. 9. In the ProspectiveBuyer row, in the Field column, select ProspectiveBuyerKey. This adds the unique identifier to the prediction query so that you can identify who is and who is not likely to buy a bicycle 10. Add five more rows to the grid. For each row, select ProspectiveBuyer as the Source and then add the following columns in the Field cells: o calcAge o LastName o FirstName o AddressLine1 o AddressLine2 Finally, run the query and browse the results.
32
Using Drillthrough on Structure Data (Basic Data Mining Tutorial) As part of their advertising campaign, Adventure Works Cycles is sending a mailer to potential customers in the 34-40 age demographic. The marketing department has decided that they would also like to send the mailer to the customers whom purchased bikes from Adventure Works Cycles more than five years ago. In this lesson you will identify customers with older bikes and retrieve their contact information. This information is not included in the model, but is included in the structure. To retrieve the contact information you will first ensure that drillthrough is enabled for the structure and then you will use drillthrough to reveal the names and addresses of the customers with older bikes. For information on how to drill through to model cases, see Using Drillthrough on Structure Data (Basic Data Mining Tutorial).
33
intermediate data mining tutorial, which demonstrates how to create models for forecasting, market basket analysis, and sequence clustering.
Microsoft Analysis Services provides an integrated environment for creating and working with data mining models. You can easily bind to data sources, create and test multiple models on the same data, and deploy models for use in predictive analysis. In the Basic Data Mining Tutorial, you learned how to use Business Intelligence Development Studio to create a data mining solution, and you built three models to support a targeted mailing campaign for analyzing customer purchasing behavior and for targeting potential buyers. To complete the following tutorial, you should to be familiar with the data mining tools and with the mining model viewers that were introduced in the Basic Data Mining Tutorial. This intermediate tutorial builds on that experience and introduces several new scenarios, including forecasting and market basket analysis. You will learn how to create a time series model, an association model, and a sequence clustering model. You will also learn how to use nested tables in a model, and how to create filters on nested tables. All scenarios use the AdventureWorksDW2008R2 data source, but you will create different data source views for different scenarios. You can do the lessons in any order as long as you create the data source first. The lessons are independent and can be completed separately.
Lesson Scenarios
34
After your success with the targeted mailing campaign, you have been asked to apply your knowledge of data mining to develop several new models for use in business planning. These include the following new model types:
Time series models, to forecast the sales of products in different regions around the world. You will develop individual models for each region and also a general model that can be used for cross-prediction. Association model, to analyze groupings of products that are purchased during visits to the Adventure Works Cycles e-commerce site. Based on this market basket model, you might recommend products to customers. Sequence clustering model, to analyze the order in which customers buy products. Based on this model, you can plan changes in Web site design or new product offerings. Neural network model and logistic regression models--To perform exploratory analysis of call center data. Based on the insights from the preliminary model, you will create a model to identify possible strategies for improving customer experience with the call center.
This tutorial teaches you how to create and work with several types of data mining algorithms. This tutorial also introduces the following concepts:
Using nested tables to build models Choosing a nested table key, time series key, or sequence key Filtering nested tables when creating models or making predictions Determining whether you have enough data to support a model Creating a general model and applying it to multiple data sets
In this lesson, you will create a new project based on the AdventureWorksDW2008R2 database, to support several new data sources views and many more mining models.
Lesson 2: Building a Forecasting Scenario (Intermediate Data Mining Tutorial)
In this lesson, you will create a mining model that can be used as part of a forecasting scenario. You will also explore mining models that are built with the Microsoft Time Series algorithm. You will build models for individual regions, and then build a general model that can be used for cross-prediction.
Lesson 3: Building a Market Basket Scenario (Intermediate Data Mining Tutorial)
Microsoft Corporation 2008
35
In this lesson, you will add a new data source view and learn how to work with nested tables and keys. Based on this data, you will create a mining model that can be used as part of a market basket scenario. You will also explore mining models that are built with the Microsoft Association algorithm.
Lesson 4: Building a Sequence Clustering Scenario (Intermediate Data Mining Tutorial)
In this lesson, you will create a mining model that can be used as part of a sequence clustering scenario. You will also learn how to explore mining models that are built with the Microsoft Sequence Clustering algorithm.
Lesson 5: Building Neural Network and Logistic Regression Models (Intermediate Data Mining Tutorial)
In this lesson, you will create several related mining models, using the Microsoft Neural Network and Microsoft Logistic Regression algorithms. You will also learn to work with data source views to explore data underlying the models.
Requirements
Microsoft SQL Server 2008 R2 Microsoft SQL Server Analysis Services SQL Server with the AdventureWorksDW2008R2 database.
By default, the sample databases are not installed, to enhance security. To install the official databases for Microsoft SQL Server, visit the Microsoft SQL Sample Databases page and select SQL Server 2008R2.
Note
When you are working through a tutorial, you might find it easier to move back and forth between the steps if you add the Next topic and Previous topic buttons to the document viewer toolbar. For more information, see Adding Next and Previous Buttons to Help.