Data Mining With Microsoft SQ L Server 2008
Data Mining With Microsoft SQ L Server 2008
Table of Contents
Data Mining with Microsoft SQL Server 2008 .......................................................................... 1
Exercise 1 Creating Data Mining Models ......................................................................................................................2 Exercise 2 Viewing Mining Accuracy Charts ............................................................................................................. 10 Exercise 3 Creating a Prediction Query ....................................................................................................................... 13 Exercise 4 Creating a Time Series Model ................................................................................................................... 16
Scenario
Adventure Works Cycles, a bicycle manufacturing company, uses business analytics to better understand its customer base. The company plans to analyze and improve the performance of its bicycle retail sector. Over time, the company has collected information about past customers and sales. It now wants to use this information to gain insights about its customers. 90 Minutes
SQL2008
The password for the Administrator account on all computers in this lab is: Pa$$w0rd
Page 1 of 20
Tasks Complete the following tasks on: SQL2008 Create views in the AdventureWorksD W database Create an Analysis Services project
Detailed Steps a. Click Start, and then click Computer. b. Browse to the C:\SQLHOLS\Data Mining\Starter folder. c. Double-click Setup.cmd. d. Wait for the Command Prompt window to close before proceeding to the next procedure.
1.
2.
a. Click Start, point to All Programs, click Microsoft SQL Server 2008, right-click SQL Server Business Intelligence Development Studio, and click Run as administrator. When prompted, click Continue. b. If prompted to choose a default environment setting, choose Business Intelligence Settings then click start Visual Studio c. On the File menu, point to New, and then click Project. d. In the New Project dialog box, in the Project Types pane, click Business Intelligence Projects. e. In the Templates pane, click Analysis Services Project. f. In the Name box, type DM Exercise 1 g. In the Location box, type C:\SQLHOLS\Data Mining\Starter\ h. Clear the Create directory for Solution checkbox, and then click OK. Note: The project is created in a new solution. A solution is the largest unit of management in the Business Intelligence Development Studio environment. Each solution contains one or more projects. An Analysis Services project is a group of related files that contain the XML code for all of the objects in an Analysis Services database. You can view the solution and its projects in the Solution Explorer window on the right-hand side in Business Intelligence Development Studio. If Solution Explorer is not visible, you can view it by selecting the View, Solution Explorer menu item (or the keyboard shortcut CTRL+ALT+L). i. In Solution Explorer, right-click the DM Exercise 1 project, and then click Properties. j. In the DM Exercise 1 Property Pages dialog box, under Configuration Properties, click Deployment. k. In the right pane, in the Deployment Mode drop-down list, click Deploy All, and then click OK.
Page 2 of 20
3.
Page 3 of 20
Page 4 of 20
h. Because CustomerKey is the primary key of the source table, the Data Mining Wizard has automatically selected it as the key. The key identifies the cases in the mining model. The attributes selected as Input are analyzed to determine their relationship and influence on the a attribute selected as Predictable. Predictable Note: The CustomerKey, FirstName, and LastName columns must not be selected as Input or Predictable columns. i. On the Specify Columns Content and Data Type page, review the Content Type column for all numeric rows, and then click Detect. j. When the detection is complete, notice that the NumberCarsOwned and NumberChildren NumberChildrenAtHome fields have been changed from Continuous to Discrete and then click Next. Discrete, Note: The Content and Data Type page shows the Data Type determined from the source data. When you click Detect, Analysis Services scans numeric fields to , determine if they are continuous or discrete data. After the detection has occurred, the interface provides you with the flexibility to manually edit both the Data Type and provides Content Type fields. k. On the Split data into training and testing sets page, ensure that Percentage of testing data is set to 30%, and then click Next. Note: SQL Server 2008 Analysis Services enables you to partition your input data into Services training and testing sets of data. The training data will be used by the mining model algorithm to determine patterns and relations. A randomly selected portion of the data will be held to test the accuracy of the data mining models created by comparing of model predictions to actual values. All mining models associated with this mining structure will use the training and testing sets defined in this step. l. On the Completing the Wizard page, in the Mining Structure Name box, type Structur Customers select the Allow drill through check box, and then click Finish. The Customers, Mining Structure designer will open. Note: A data mining structure can contain multiple data mining models. Each data : mining model uses a subset of the data referenced by the data mining structure. When referenced the data mining structure is processed, the source data is queried once and then all of
Page 5 of 20
Page 6 of 20
Note: The Mining Legend window on the right side of the display may be relocated and resized to improve the display of the decision tree. If you accidentally close the Mining Legend window, click the Refresh icon next to the Viewer box and the Mining Legend window will return. f. On the Show Level slider, drag the pointer to the left so that only one level of the decision tree is displayed. g. Click the All node. This node contains a histogram with blue representing bike buyers and red representing non-bike buyers. h. Information about all customers is displayed in the Mining Legend window. Review the values showing how many customers are bike buyers. Compare the percentage of bike buyers to non-bike buyers. (You may need to widen the Mining Legend window to be able to see the percentages.) i. On the Show Level slider, drag the pointer to the right so that two levels of the decision tree are displayed. Note that Number Cars Owned is most predictive of a customer's bike buying behavior. j. Click each node of level 2. The Mining Legend window displays detailed information for each node. k. In the Background drop-down list, click Yes. l. The shade of each node indicates the concentration of the value in the Background drop-down list. The dark blue color tells you that the greatest number of bike buyers have no cars. m. Click the + to the right of the Number Cars Owned = 0 box. Again, the dark blue coloring shows that most bike buyers are under the age of 43. Expand and contract nodes in the diagram to investigate the predicting factors for each group. 12. View the Decision Trees mining model dependency network a. In the Mining Model Viewer, click the Dependency Network tab. b. The Dependency Network viewer displays the strength of the relationships between the attributes in a decision tree model. c. On the links slider, drag the pointer to the bottom. As the threshold for links becomes higher, dependencies are removed from the chart. Note: The strongest link is shown to be Number Cars Owned as previously shown on the Decision Tree tab. d. In the Dependency Network diagram, click the Bike Buyer node. e. The color of each node indicates that attribute's relationship to the Bike Buyer
Page 8 of 20
Page 9 of 20
Page 10 of 20
Tasks Complete the following tasks on: SQL Server 2008 HOLs 1. Open and deploy an existing project
Detailed Steps a. Open Business Intelligence Development Studio as Administrator if it is not already open. b. On the File menu, point to Open, and then click Project/Solution. c. In the Open Project dialog box, browse to C:\SQLHOLS\Data Mining\Starter\DM Exercise 2, click DM Exercise 2.sln, and then click Open. d. On the Build menu, click Deploy DM Exercise 2. e. Observe the deployment progress shown in the Deployment Progress pane. The Deployment Progress pane gives you detailed information about what happens during deployment. Note: Analysis Services might take a while to process the data mining models. f. When deployment is complete, you can close the Deployment Progress window if you want.
2.
a. In Solution Explorer, expand the Mining Structures folder, and then double-click Customers.dmm. Note: If Solution Explorer is not visible, click View, and then click Solution Explorer. b. Click the Mining Accuracy Chart tab. c. Verify that the Synchronize Prediction Columns and Values box is selected. Note: You should only clear the Synchronize Prediction Columns and Values box if you know that two mining structure columns derive from the same underlying relational or multidimensional source and the columns contain the same states or have been discretized in the same way. In this scenario, you will enable Analysis Services to synchronize the columns and values because all columns may not be discretized in the same way. d. Verify that the Show check box is selected for both the Customers DT and Customers NB mining models. e. In the Predictable Column Name column, verify that Bike Buyer is selected for both mining models. Note: In the Predictable Column Name drop-down lists, the mining model column names are restricted to columns that have their usage type set to Predict or Predict Only. f. In the Select data set to be used for Accuracy Chart area, select Specify a different data set, and then click the ellipsis (). The Specify Column Mapping page opens. You will use the Column Mapping page to design a Prediction Query that will be run to compare the mining model's predicted values with the validation data set's actual values. g. On the Column Mapping page, in the Select Input Table(s) window, click Select Page 11 of 20
Page 12 of 20
Page 13 of 20
Tasks Complete the following tasks on: SQL Server 2008 HOLs 1. Open and deploy an existing project
Detailed Steps a. Open Business Intelligence Development Studio if it is not already open. b. On the File menu, point to Open, and then click Project/Solution. c. In the Open Project dialog box, browse to C:\SQLHOLS\Data Mining\DM Exercise 3, click DM Exercise 3.sln, and then click Open. Note: The solution used in Exercise 3 is different from the solution created in Exercise 2. d. On the Build menu, click Deploy DM Exercise 3. e. Observe the deployment progress shown in the Deployment Progress window. The Deployment Progress pane gives you detailed information about what happens during deployment. Note: Analysis Services may take a while to process the data-mining models. f. When deployment is complete, you can close the Deployment Progress window if you want.
2.
Select the mining model and input table for a prediction query
a. In Solution Explorer, in the Mining Structures folder, double-click Customers.dmm. Note: If Solution Explorer is not visible, click View, and then click Solution Explorer. b. In the Customers.dmm designer, click the Mining Model Prediction tab. c. In the Mining Model window, click Select Model. d. In the Select Mining Model dialog box, expand Customers, click Customers DT, and then click OK. e. In the Select Input Table(s) window, click Select Case Table. f. In the Select Table window, in the Data Source box, ensure that Customers is selected, click the vDMLabCustomerPredict table, and then click OK. Note: Relationships between the mining structure and the input table are automatically created between columns with the same name. Relationships can be added or deleted by the user.
3.
a. Enter the following values into the first row of the table at the bottom of the designer. Columns Source Field Values vDMLabCustomerPredicts CustomerKey
Page 14 of 20
Note: You can resize the columns of the table by dragging the dividing line between the column headings. b. Enter the following values into the second row of the table. Columns Source Field Show Columns Source Field Show Values vDMLabCustomerPredicts FirstName (Checked) Values vDMLabCustomerPredicts LastName (Checked)
c. Enter the following values into the third row of the table.
Note: Adding the customers name to the report will make the report more readable and useful for users. If the view or table defined in the Data Source View contained additional information such as phone number or e-mail address, these columns can also be added to the report, giving the company a report created in a single step that included all contact information required by the marketing department. d. Enter the following values into the fourth row of the table. Columns Source Field Show Values Customers DT mining model Bike Buyer (Checked)
Note: The value in the Source column will change from Customers DT mining model to Customers DT. e. Enter the following values into the fifth row of the table. Columns Source Field Alias Show Criteria/arguments Values Prediction Function PredictProbability Confidence (Checked) [Customers DT].[Bike Buyer]
f. On the Mining Model menu, click Query to view the Data Mining Extensions to SQL language (DMX) syntax for the query that you defined in the previous steps. DMX is designed to create, train, modify and query data-mining models, providing a simple and familiar language for embedding prediction in applications. 4. Display the Decision Tree query results a. On the Mining Model menu, click Result to view the results of the query that you defined in the previous steps. b. The results of the prediction query are displayed as follows: The CustomerKey column identifies each record from the input table.
Page 15 of 20
Page 16 of 20
Tasks Complete the following tasks on: SQL Server 2008 HOLs 1. Create an Analysis Services project
Detailed Steps a. Open Business Intelligence Development Studio as Administrator if it is not already open. b. On the File menu, point to New, and then click Project. c. In the New Project dialog box, in the Project Types pane, click the Business Intelligence Projects folder. d. In the Templates pane, click Analysis Services Project. e. In the Name box, type DM Exercise 4 f. In the Location box, enter C:\SQLHOLS\Data Mining\Starter\ g. Clear the Create directory for Solution checkbox. h. Click OK. i. In Solution Explorer, right-click the DM Exercise 4 project, and then click Properties. j. In the DM Exercise 4 Property Pages dialog box, under Configuration Properties, click Deployment. k. In the right pane, in the Deployment Mode drop-down list, click Deploy All, and then click OK.
2.
a. In the Solution Explorer window, under the DM Exercise 4 project, right-click the Data Sources folder, and on the shortcut menu, click New Data Source. b. In the Data Source Wizard dialog box, on the Welcome to the Data Source Wizard page, click Next. Note: If the Data connections pane already includes (local).AdventureWorksDW, move to step 10. c. On the Select how to define the connection page, ensure the Create a data source based on an existing or new connection option is selected, and then click New. d. In the Connection Manager dialog box, click the SqlClient Data Provider from the .Net Providers folder in the Provider drop-down list at the top of the page and click OK. e. In the Server name box, type (local) f. Under Log on to the server, click Use Windows Authentication. g. In the Select or enter a database name drop-down list, click AdventureWorksDW. h. Click Test Connection, and then click OK to dismiss the message box.
Page 17 of 20
Page 18 of 20
Page 19 of 20
k. Set the PERIODICITY_HINT Value to {12} and then click OK. The data in the OK data source is organized into monthly results. The PERIODICITY_HINT Value tells the algorithm that a pattern repeats itself every 12 periods (or months in this case). 6. Deploy the Analysis Services solution a. On the Build menu, click Deploy DM Exercise 4. b. Observe the deployment progress shown in the Deployment Progress pane. The Deployment Progress pane gives you detailed information about what happens during deployment. Note: Analysis Services might take a while to process the data mining models. models a. Click the Mining Model Viewer tab. b. Hide Solution Explorer and any other windows that are blocking the chart information by clicking the Auto Hide icon (the pushpin in the upper upper-right corner). c. The chart shows the quantity and sales amount information for bike sales by region. The values from July 2001 to March 2004 are actual values and are shown as solid lines. The values for April 2004 and beyond are predicted values and are shown as dotted lines. d. Point to any of the lines on the chart. A tooltip will appear and display information relevant to the location on the line. e. Click in the chart where the chart background color changes to a darker shade of grey. The mining legend now shows the values related to that point in time. The . timestamp at the top of the legend tells you the month that you are looking at. For example, Timestamp: 200406 is June 2004. The most recent data represented in example, the mining model input data is for June of 2004. The dashed data lines to the right of the line you clicked represent predicted future Sales Amounts and Units Sold. f. To compare Sales Amount for three different product models in Europe, under the Amount Prediction steps in the drop-down list, select only M200 Europe: Sales steps, Amount, R250 Europe: Sales Amount, and T1000 Europe: Sales Amount. Amount Then click OK. g. Click in the chart where the chart background color changes to a darker shade of background Page 20 of 20
7.
Page 21 of 21