Text Mining Tutorial
Text Mining Tutorial
Introduction
This tutorial presents details steps for you to take to perform a typical text classification task using SQL Server 2005 Beta2. The sample dataset is obtained from http://www2.cs.cmu.edu/afs/cs.cmu.edu/project/theo-20/www/data/news20.html. The dataset is a small subset of USENET newsgroup postings that belong to 5 different groups. The task is to build a mining model to classify each posting into its group. This tutorial document should be available together with an import-ready file, NGArticles.txt (or NGArticles.zip).
Create a database
1.1 In SQL Mgmt Studio, connect to the local SQL server (localhost). 1.2 Create a new database and name it TDM.
Row delimiter: @@@@ Column delimiter: &&&& Column property for ArticleText: Change DataType to DT_NTEXT Destination: Server: local SQL server (localhost) Database: TDM Table: NGArticle
Build a dictionary
1.2 Start Business Intelligence Development Studio with a new Integration Services project called TextDataMining. This will create a solution and a Integration Services project in it, both of which are named TextDataMining. 1.3 Rename the Integration Services project as PrepareArticles just for convenience.
1.4 Create a new DTS (SSIS) package 1.5 Rename the package to BuildDictionary.dtsx 1.6 Go to Data Flow tab and add a new Data Flow task 1.7 In the data flow task, add a OLE DB Source transform Connection: create a new for localhost.TDM
Table: NGArticles Columns: ArticleText only 1.8 Add a Term Extraction transform and connect from the OLE DB Source transform Term Type: Noun and Noun Phrase Score Type: TFIDF Parameters: Frequency=10, Length=2 1.9 Add a Sort transform and connect it. Sort Term in ascending order Dont pass through Score column 1.10 Add an OLE DB Destination transform and connect it. Use the connection: localhost.TDM Click New and name it Dictionary In Mappings, connect the column, Term 1.11 Execute the package It automatically enters into debugging mode It may take a few minutes 1.12 Stop debugging
1.17
1.18 1.19
1.20
1.21
Add a Sort transform and connect it. Sort ID in ascending order, then, Term in ascending order, no duplicates Add an OLE DB Destination transform and connect it. Use the connection: localhost.TDM Click New and name it TermVectors In Mappings, make sure to connect all columns, Term, Frequency, ID Execute the package It automatically enters into debugging mode It may take a few minutes Stop debugging
(Note that the picture doesnt include the Derived Column transform built in step 4.5.)
1.24 1.25
Go to Data Flow tab and add a new Data Flow task In the data flow task, add a OLE DB Source transform Connection: create a new for localhost.TDM Table: NGArticles Columns: ID, NewsGroup only 1.26 Add a Percentage Sampling transform and connect from the OLE DB Source transform Sampling rate: 70% Selected rows: Train sample (70%) Unselected rows: Test sample (30%)
1.27 Add two OLE DB Destination transforms and connect them from the Percentage Sampling (one from Train sample, another from Test sample) Use the connection: localhost.TDM Click New and name them TrainArticles and TestArticles respectively. In Mappings, make sure to connect all columns, ID, NewsGroup 1.28 Execute the package It automatically enters into debugging mode 1.29 Stop the debugging mode.
1.31 Create a Data Source to refer the database, TDM, in the local SQL server. 1.32 Create a Data Source View using the data source, TDM. Add the following tables in the DSV: TrainArticles, TestArticles, and TermVectors.
1.33
Create a Mining Structure as follows: Algorithm: Microsoft_Decision_Trees DSV to use: TDM Case table: TrainArticles Nested table: TermVectors Columns usage:
1.34 Right click the model, NGArticlesDM_DT and select New Mining Model to add the following two additional models: NGArticlesDM_NB with Microsoft_Naive_Bayes algorithm NGArticlesDM_NN with Microsoft_Logistic_Regression algorithm
1.35
Right-click each model and set the algorithm parameters as follows: NGArticlesDM_DT: Disable automatic feature selection (MAXIMUM_INPUT_ATTRIBUTES=0) NGArticlesDM_NB: Disable automatic feature selection (MAXIMUM_INPUT_ATTRIBUTES=0) NGArticlesDM_NN: Disable automatic feature selection (MAXIMUM_INPUT_ATTRIBUTES=0) 1.36 Deploy the project by pressing F5. It may take several minutes to train all the three models. 1.37 Select Mining Accuracy tab to see the lift chart using TestArticles and TermVectors to compare the classification accuracy of the three models trained.
1.38 Browse models. Note that browsing the models content may take considerably long time due to the complexity of models. E.g., NGArticlesDM_NB, NGArticlesDM_NN involves more than 5,000 attributes (scoring/coefficients). For instance, browsing NGArticlesDM_NN took 3 minutes in 3GHz Xeon CPU, 2GB memory PC.