0% found this document useful (0 votes)
332 views

Text Mining Tutorial

This document provides step-by-step instructions for building several text classification models using SQL Server 2005 Data Mining tools. It describes importing newsgroup articles data, building a term dictionary and term vectors, preparing training and test samples, and building decision tree, naive bayes, and neural network models to classify newsgroup articles. The models are evaluated based on their accuracy on the test data.

Uploaded by

Ngan Nguyen
Copyright
© Attribution Non-Commercial (BY-NC)
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
332 views

Text Mining Tutorial

This document provides step-by-step instructions for building several text classification models using SQL Server 2005 Data Mining tools. It describes importing newsgroup articles data, building a term dictionary and term vectors, preparing training and test samples, and building decision tree, naive bayes, and neural network models to classify newsgroup articles. The models are evaluated based on their accuracy on the test data.

Uploaded by

Ngan Nguyen
Copyright
© Attribution Non-Commercial (BY-NC)
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
You are on page 1/ 13

A Tutorial for Text Classification using SQL Server 2005 Beta2 Data Mining

Peter Pyungchul Kim SQL Business Intelligence Microsoft Corporation

Introduction
This tutorial presents details steps for you to take to perform a typical text classification task using SQL Server 2005 Beta2. The sample dataset is obtained from http://www2.cs.cmu.edu/afs/cs.cmu.edu/project/theo-20/www/data/news20.html. The dataset is a small subset of USENET newsgroup postings that belong to 5 different groups. The task is to build a mining model to classify each posting into its group. This tutorial document should be available together with an import-ready file, NGArticles.txt (or NGArticles.zip).

Create a database
1.1 In SQL Mgmt Studio, connect to the local SQL server (localhost). 1.2 Create a new database and name it TDM.

Import News Group Articles to the database


1.1 Right click the database, TDM, and Task Import. Source: NGArticles.txt (Flat File, unzipped from NGArticles.zip provided) Header row delimiter: @@@@ Check Column names in the first data row

Row delimiter: @@@@ Column delimiter: &&&& Column property for ArticleText: Change DataType to DT_NTEXT Destination: Server: local SQL server (localhost) Database: TDM Table: NGArticle

Build a dictionary
1.2 Start Business Intelligence Development Studio with a new Integration Services project called TextDataMining. This will create a solution and a Integration Services project in it, both of which are named TextDataMining. 1.3 Rename the Integration Services project as PrepareArticles just for convenience.

1.4 Create a new DTS (SSIS) package 1.5 Rename the package to BuildDictionary.dtsx 1.6 Go to Data Flow tab and add a new Data Flow task 1.7 In the data flow task, add a OLE DB Source transform Connection: create a new for localhost.TDM

Table: NGArticles Columns: ArticleText only 1.8 Add a Term Extraction transform and connect from the OLE DB Source transform Term Type: Noun and Noun Phrase Score Type: TFIDF Parameters: Frequency=10, Length=2 1.9 Add a Sort transform and connect it. Sort Term in ascending order Dont pass through Score column 1.10 Add an OLE DB Destination transform and connect it. Use the connection: localhost.TDM Click New and name it Dictionary In Mappings, connect the column, Term 1.11 Execute the package It automatically enters into debugging mode It may take a few minutes 1.12 Stop debugging

Build term vectors


1.13 1.14 1.15 1.16 Create a new DTS (SSIS) package Rename the package to BuildTermVectors.dtsx Go to Data Flow tab and add a new Data Flow task In the data flow task, add a OLE DB Source transform Connection: create a new for localhost.TDM Table: NGArticles Columns: ID, ArticleText only Add a Term Lookup transform and connect from the previous transform Reference table: Dictionary

1.17

PassThru column: ID Lookup input column: ArticleText

1.18 1.19

1.20

1.21

Add a Sort transform and connect it. Sort ID in ascending order, then, Term in ascending order, no duplicates Add an OLE DB Destination transform and connect it. Use the connection: localhost.TDM Click New and name it TermVectors In Mappings, make sure to connect all columns, Term, Frequency, ID Execute the package It automatically enters into debugging mode It may take a few minutes Stop debugging

(Note that the picture doesnt include the Derived Column transform built in step 4.5.)

Prepare train/test samples


1.22 1.23 Create a new DTS (SSIS) package Rename the package to PrepareSamples.dtsx

1.24 1.25

Go to Data Flow tab and add a new Data Flow task In the data flow task, add a OLE DB Source transform Connection: create a new for localhost.TDM Table: NGArticles Columns: ID, NewsGroup only 1.26 Add a Percentage Sampling transform and connect from the OLE DB Source transform Sampling rate: 70% Selected rows: Train sample (70%) Unselected rows: Test sample (30%)

1.27 Add two OLE DB Destination transforms and connect them from the Percentage Sampling (one from Train sample, another from Test sample) Use the connection: localhost.TDM Click New and name them TrainArticles and TestArticles respectively. In Mappings, make sure to connect all columns, ID, NewsGroup 1.28 Execute the package It automatically enters into debugging mode 1.29 Stop the debugging mode.

Build/Test/Refine data mining models


1.30 Add a new Analysis Services project, and name it as DataMining.

1.31 Create a Data Source to refer the database, TDM, in the local SQL server. 1.32 Create a Data Source View using the data source, TDM. Add the following tables in the DSV: TrainArticles, TestArticles, and TermVectors.

1.33

Create a Mining Structure as follows: Algorithm: Microsoft_Decision_Trees DSV to use: TDM Case table: TrainArticles Nested table: TermVectors Columns usage:

Name the structure as NGArticlesDM and the model as NGArticlesDM_DT

1.34 Right click the model, NGArticlesDM_DT and select New Mining Model to add the following two additional models: NGArticlesDM_NB with Microsoft_Naive_Bayes algorithm NGArticlesDM_NN with Microsoft_Logistic_Regression algorithm

1.35

Right-click each model and set the algorithm parameters as follows: NGArticlesDM_DT: Disable automatic feature selection (MAXIMUM_INPUT_ATTRIBUTES=0) NGArticlesDM_NB: Disable automatic feature selection (MAXIMUM_INPUT_ATTRIBUTES=0) NGArticlesDM_NN: Disable automatic feature selection (MAXIMUM_INPUT_ATTRIBUTES=0) 1.36 Deploy the project by pressing F5. It may take several minutes to train all the three models. 1.37 Select Mining Accuracy tab to see the lift chart using TestArticles and TermVectors to compare the classification accuracy of the three models trained.

1.38 Browse models. Note that browsing the models content may take considerably long time due to the complexity of models. E.g., NGArticlesDM_NB, NGArticlesDM_NN involves more than 5,000 attributes (scoring/coefficients). For instance, browsing NGArticlesDM_NN took 3 minutes in 3GHz Xeon CPU, 2GB memory PC.

Deployment data mining models


Not covered in this tutorial at this moment.

You might also like