0% found this document useful (0 votes)

365 views13 pages

Text Mining Tutorial

This document provides step-by-step instructions for building several text classification models using SQL Server 2005 Data Mining tools. It describes importing newsgroup articles data, building a term dictionary and term vectors, preparing training and test samples, and building decision tree, naive bayes, and neural network models to classify newsgroup articles. The models are evaluated based on their accuracy on the test data.

Uploaded by

Ngan Nguyen

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOC, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

365 views13 pages

Text Mining Tutorial

Uploaded by

Ngan Nguyen

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOC, PDF, TXT or read online on Scribd

You are on page 1/ 13

A Tutorial for Text Classification using SQL Server 2005 Beta2 Data Mining

Peter Pyungchul Kim SQL Business Intelligence Microsoft Corporation

Introduction
This tutorial presents details steps for you to take to perform a typical text classification task using SQL Server 2005 Beta2. The sample dataset is obtained from http://www2.cs.cmu.edu/afs/cs.cmu.edu/project/theo-20/www/data/news20.html. The dataset is a small subset of USENET newsgroup postings that belong to 5 different groups. The task is to build a mining model to classify each posting into its group. This tutorial document should be available together with an import-ready file, NGArticles.txt (or NGArticles.zip).

Create a database
1.1 In SQL Mgmt Studio, connect to the local SQL server (localhost). 1.2 Create a new database and name it TDM.

Import News Group Articles to the database

1.1 Right click the database, TDM, and Task Import. Source: NGArticles.txt (Flat File, unzipped from NGArticles.zip provided) Header row delimiter: @@@@ Check Column names in the first data row

Row delimiter: @@@@ Column delimiter: &&&& Column property for ArticleText: Change DataType to DT_NTEXT Destination: Server: local SQL server (localhost) Database: TDM Table: NGArticle

Build a dictionary
1.2 Start Business Intelligence Development Studio with a new Integration Services project called TextDataMining. This will create a solution and a Integration Services project in it, both of which are named TextDataMining. 1.3 Rename the Integration Services project as PrepareArticles just for convenience.

1.4 Create a new DTS (SSIS) package 1.5 Rename the package to BuildDictionary.dtsx 1.6 Go to Data Flow tab and add a new Data Flow task 1.7 In the data flow task, add a OLE DB Source transform Connection: create a new for localhost.TDM

Table: NGArticles Columns: ArticleText only 1.8 Add a Term Extraction transform and connect from the OLE DB Source transform Term Type: Noun and Noun Phrase Score Type: TFIDF Parameters: Frequency=10, Length=2 1.9 Add a Sort transform and connect it. Sort Term in ascending order Dont pass through Score column 1.10 Add an OLE DB Destination transform and connect it. Use the connection: localhost.TDM Click New and name it Dictionary In Mappings, connect the column, Term 1.11 Execute the package It automatically enters into debugging mode It may take a few minutes 1.12 Stop debugging

Build term vectors

1.13 1.14 1.15 1.16 Create a new DTS (SSIS) package Rename the package to BuildTermVectors.dtsx Go to Data Flow tab and add a new Data Flow task In the data flow task, add a OLE DB Source transform Connection: create a new for localhost.TDM Table: NGArticles Columns: ID, ArticleText only Add a Term Lookup transform and connect from the previous transform Reference table: Dictionary

1.17

PassThru column: ID Lookup input column: ArticleText

1.18 1.19

1.20

1.21

Add a Sort transform and connect it. Sort ID in ascending order, then, Term in ascending order, no duplicates Add an OLE DB Destination transform and connect it. Use the connection: localhost.TDM Click New and name it TermVectors In Mappings, make sure to connect all columns, Term, Frequency, ID Execute the package It automatically enters into debugging mode It may take a few minutes Stop debugging

(Note that the picture doesnt include the Derived Column transform built in step 4.5.)

Prepare train/test samples

1.22 1.23 Create a new DTS (SSIS) package Rename the package to PrepareSamples.dtsx

1.24 1.25

Go to Data Flow tab and add a new Data Flow task In the data flow task, add a OLE DB Source transform Connection: create a new for localhost.TDM Table: NGArticles Columns: ID, NewsGroup only 1.26 Add a Percentage Sampling transform and connect from the OLE DB Source transform Sampling rate: 70% Selected rows: Train sample (70%) Unselected rows: Test sample (30%)

1.27 Add two OLE DB Destination transforms and connect them from the Percentage Sampling (one from Train sample, another from Test sample) Use the connection: localhost.TDM Click New and name them TrainArticles and TestArticles respectively. In Mappings, make sure to connect all columns, ID, NewsGroup 1.28 Execute the package It automatically enters into debugging mode 1.29 Stop the debugging mode.

Build/Test/Refine data mining models

1.30 Add a new Analysis Services project, and name it as DataMining.

1.31 Create a Data Source to refer the database, TDM, in the local SQL server. 1.32 Create a Data Source View using the data source, TDM. Add the following tables in the DSV: TrainArticles, TestArticles, and TermVectors.

1.33

Create a Mining Structure as follows: Algorithm: Microsoft_Decision_Trees DSV to use: TDM Case table: TrainArticles Nested table: TermVectors Columns usage:

Name the structure as NGArticlesDM and the model as NGArticlesDM_DT

1.34 Right click the model, NGArticlesDM_DT and select New Mining Model to add the following two additional models: NGArticlesDM_NB with Microsoft_Naive_Bayes algorithm NGArticlesDM_NN with Microsoft_Logistic_Regression algorithm

1.35

Right-click each model and set the algorithm parameters as follows: NGArticlesDM_DT: Disable automatic feature selection (MAXIMUM_INPUT_ATTRIBUTES=0) NGArticlesDM_NB: Disable automatic feature selection (MAXIMUM_INPUT_ATTRIBUTES=0) NGArticlesDM_NN: Disable automatic feature selection (MAXIMUM_INPUT_ATTRIBUTES=0) 1.36 Deploy the project by pressing F5. It may take several minutes to train all the three models. 1.37 Select Mining Accuracy tab to see the lift chart using TestArticles and TermVectors to compare the classification accuracy of the three models trained.

1.38 Browse models. Note that browsing the models content may take considerably long time due to the complexity of models. E.g., NGArticlesDM_NB, NGArticlesDM_NN involves more than 5,000 attributes (scoring/coefficients). For instance, browsing NGArticlesDM_NN took 3 minutes in 3GHz Xeon CPU, 2GB memory PC.

Deployment data mining models

Not covered in this tutorial at this moment.

CH 08
No ratings yet
CH 08
175 pages
ProblemStatement For t4 Track
No ratings yet
ProblemStatement For t4 Track
4 pages
MySQL for Python
From Everand
MySQL for Python
Albert Lukaszewski
5/5 (1)
Learn SQLite in 24 Hours
From Everand
Learn SQLite in 24 Hours
Alex Nordeen
No ratings yet
Yode Super Market Mangment System
No ratings yet
Yode Super Market Mangment System
62 pages
FreeRadius DaloRadius CentOs 7
No ratings yet
FreeRadius DaloRadius CentOs 7
7 pages
Text Mining Tutorial
No ratings yet
Text Mining Tutorial
13 pages
Microsoft Virtual Labs (1) .SQL Server 2005 Data Mining
No ratings yet
Microsoft Virtual Labs (1) .SQL Server 2005 Data Mining
27 pages
SQL Server 2005 - Data Mining
No ratings yet
SQL Server 2005 - Data Mining
27 pages
Knight's Microsoft SQL Server 2012 Integration Services 24-Hour Trainer
From Everand
Knight's Microsoft SQL Server 2012 Integration Services 24-Hour Trainer
Brian Knight
No ratings yet
Data Mining With Microsoft SQ L Server 2008
No ratings yet
Data Mining With Microsoft SQ L Server 2008
23 pages
DM - Getting Started With Data Mining in SQL Server
No ratings yet
DM - Getting Started With Data Mining in SQL Server
142 pages
DBMS Support of The Data Mining
No ratings yet
DBMS Support of The Data Mining
54 pages
Data Mining Tutorial0
No ratings yet
Data Mining Tutorial0
5 pages
Administering Microsoft Azure SQL Solutions DP 300
From Everand
Administering Microsoft Azure SQL Solutions DP 300
Manish Soni
No ratings yet
How to Write a Bulk Emails Application in Vb.Net and Mysql: Step by Step Fully Working Program
From Everand
How to Write a Bulk Emails Application in Vb.Net and Mysql: Step by Step Fully Working Program
Lotfi Ferchichi
No ratings yet
Visual Basic 2010 Coding Briefs Data Access
From Everand
Visual Basic 2010 Coding Briefs Data Access
Kevin Hough
5/5 (1)
BIDM
No ratings yet
BIDM
48 pages
Data Mining Lab Manual-Final
No ratings yet
Data Mining Lab Manual-Final
38 pages
Lecture 3 Data Mining
No ratings yet
Lecture 3 Data Mining
30 pages
SQL Server 2008 For Business Intelligence: UTS Short Course
No ratings yet
SQL Server 2008 For Business Intelligence: UTS Short Course
43 pages
C# 2010 Coding Briefs Data Access
From Everand
C# 2010 Coding Briefs Data Access
Kevin Hough
No ratings yet
Ba Createing Data Mart SQL
No ratings yet
Ba Createing Data Mart SQL
25 pages
Microsoft Visual Basic Interview Questions: Microsoft VB Certification Review
From Everand
Microsoft Visual Basic Interview Questions: Microsoft VB Certification Review
Equity Press
No ratings yet
SQL Server 2012 Tutorials - Analysis Services Data Mining
No ratings yet
SQL Server 2012 Tutorials - Analysis Services Data Mining
215 pages
Basic Data Mining Tutorial
No ratings yet
Basic Data Mining Tutorial
35 pages
CSE2021 - MODULE 1ppt
No ratings yet
CSE2021 - MODULE 1ppt
62 pages
String Pattern SQL
No ratings yet
String Pattern SQL
5 pages
Msbi Developer (SSRS, Ssas, Ssis) : Advanced Level
100% (1)
Msbi Developer (SSRS, Ssas, Ssis) : Advanced Level
4 pages
Comp 6838
No ratings yet
Comp 6838
41 pages
Hands-On Lab: String Patterns, Sorting and Grouping: Software Used in This Lab
No ratings yet
Hands-On Lab: String Patterns, Sorting and Grouping: Software Used in This Lab
5 pages
Developing and Managing A BI Semantic Model
No ratings yet
Developing and Managing A BI Semantic Model
12 pages
SSIS Transformations
No ratings yet
SSIS Transformations
6 pages
Data Mining Using SQL Server Analysis Server
No ratings yet
Data Mining Using SQL Server Analysis Server
4 pages
Data Mining Course Overview
No ratings yet
Data Mining Course Overview
38 pages
01-Introduction To Data Mining
No ratings yet
01-Introduction To Data Mining
43 pages
Transformations
No ratings yet
Transformations
7 pages
Blazor and API Example: Classroom Quiz Application
From Everand
Blazor and API Example: Classroom Quiz Application
Taurius Litvinavicius
No ratings yet
Tableau 8.2 Training Manual: From Clutter to Clarity
From Everand
Tableau 8.2 Training Manual: From Clutter to Clarity
Larry Keller
No ratings yet
SQL Server: Tips and Tricks - 1
From Everand
SQL Server: Tips and Tricks - 1
Priyanka Agarwal
5/5 (1)
Unlock 213292906 SQL Server Integration Services Ssis
No ratings yet
Unlock 213292906 SQL Server Integration Services Ssis
431 pages
Unit 1
No ratings yet
Unit 1
59 pages
Professional Microsoft SQL Server 2012 Integration Services
From Everand
Professional Microsoft SQL Server 2012 Integration Services
Brian Knight
No ratings yet
Lec 1
No ratings yet
Lec 1
48 pages
Data Mining
No ratings yet
Data Mining
20 pages
08-Text Mining
No ratings yet
08-Text Mining
38 pages
Data Mining
No ratings yet
Data Mining
33 pages
Data Management
No ratings yet
Data Management
36 pages
المختبر الثاني
No ratings yet
المختبر الثاني
19 pages
التنقيب عن البيانات Data Mining
100% (3)
التنقيب عن البيانات Data Mining
26 pages
Data Mining Merged PDF CS1 CS8
No ratings yet
Data Mining Merged PDF CS1 CS8
272 pages
Data Mining Slide
No ratings yet
Data Mining Slide
35 pages
Notes Module 2
No ratings yet
Notes Module 2
28 pages
Learn SQL in 24 Hours: The Complete Beginner’s Guide: Master Coding in 24 Hours
From Everand
Learn SQL in 24 Hours: The Complete Beginner’s Guide: Master Coding in 24 Hours
Aniket Jain
No ratings yet
Transformation Description Examples of When Transformation Would Be Used
No ratings yet
Transformation Description Examples of When Transformation Would Be Used
7 pages
Python Advanced Programming: The Guide to Learn Python Programming. Reference with Exercises and Samples About Dynamical Programming, Multithreading, Multiprocessing, Debugging, Testing and More
From Everand
Python Advanced Programming: The Guide to Learn Python Programming. Reference with Exercises and Samples About Dynamical Programming, Multithreading, Multiprocessing, Debugging, Testing and More
Marcus Richards
No ratings yet
Ssis Ssas Training Course
No ratings yet
Ssis Ssas Training Course
4 pages
Oracle SQL Developer 2.1
From Everand
Oracle SQL Developer 2.1
Sue Harper
No ratings yet
Bi 06 Etl
No ratings yet
Bi 06 Etl
35 pages
Data Mining Introductiondifferent
No ratings yet
Data Mining Introductiondifferent
83 pages
Information Technology Fundamentals: CCIT4085
No ratings yet
Information Technology Fundamentals: CCIT4085
43 pages
Text Mining: Lecturer: Dr. Nguyen Thi Ngoc Anh
No ratings yet
Text Mining: Lecturer: Dr. Nguyen Thi Ngoc Anh
27 pages
Tos DQS en
No ratings yet
Tos DQS en
288 pages
Lect 1
No ratings yet
Lect 1
38 pages
SqlCommand Class
No ratings yet
SqlCommand Class
16 pages
Referencia M PDF
No ratings yet
Referencia M PDF
1,008 pages
MS PDF VIEWER Snowsetanswers 2
No ratings yet
MS PDF VIEWER Snowsetanswers 2
475 pages
ENCh19 Database Recovery Techniques
No ratings yet
ENCh19 Database Recovery Techniques
34 pages
Practical Task
No ratings yet
Practical Task
3 pages
UNIT 4 Dr. JPSM PPT - Python
No ratings yet
UNIT 4 Dr. JPSM PPT - Python
40 pages
Data Manipulation & Analysis
No ratings yet
Data Manipulation & Analysis
31 pages
Fa5 It0069
No ratings yet
Fa5 It0069
11 pages
Oracle r12 Contract Import Interface Guide
No ratings yet
Oracle r12 Contract Import Interface Guide
194 pages
A Practical Guide To Bootstrap in R
No ratings yet
A Practical Guide To Bootstrap in R
4 pages
Chapter 5
No ratings yet
Chapter 5
54 pages
Lawyer's Diary Documentation0315
No ratings yet
Lawyer's Diary Documentation0315
58 pages
Building AI Agents With Autogen - Workshop
No ratings yet
Building AI Agents With Autogen - Workshop
49 pages
12th Computer Repeated Questions
No ratings yet
12th Computer Repeated Questions
5 pages
1904 Thom's Directory
No ratings yet
1904 Thom's Directory
26 pages
Library Management System
No ratings yet
Library Management System
22 pages
Chapter - 5 - Data Mining
No ratings yet
Chapter - 5 - Data Mining
18 pages
Chapter 7 - Text Mining, Sentiment Analysis, and Social Analytics
No ratings yet
Chapter 7 - Text Mining, Sentiment Analysis, and Social Analytics
91 pages
Get Realm Building Modern Swift Apps With Realm Database 2nd Edition Marin Todorov Free All Chapters
100% (2)
Get Realm Building Modern Swift Apps With Realm Database 2nd Edition Marin Todorov Free All Chapters
55 pages
BDC
No ratings yet
BDC
5 pages
Reviewed Oracle 1z0 084 Dumps by Ware 01-04-2024 10qa Ebraindumps
No ratings yet
Reviewed Oracle 1z0 084 Dumps by Ware 01-04-2024 10qa Ebraindumps
23 pages
Feature Scope Description For SAP Cloud Integration
No ratings yet
Feature Scope Description For SAP Cloud Integration
35 pages
Caching Strategies
No ratings yet
Caching Strategies
22 pages
DA Lab Manual (ACSDS0653) (1) Copy 1
No ratings yet
DA Lab Manual (ACSDS0653) (1) Copy 1
65 pages
Questions
No ratings yet
Questions
3 pages
Next Generation Tehnologies Practical
No ratings yet
Next Generation Tehnologies Practical
57 pages

Text Mining Tutorial

Uploaded by

Text Mining Tutorial

Uploaded by

A Tutorial for Text Classification using SQL Server 2005 Beta2 Data Mining

Peter Pyungchul Kim SQL Business Intelligence Microsoft Corporation

Import News Group Articles to the database

Build term vectors

PassThru column: ID Lookup input column: ArticleText

Prepare train/test samples

Build/Test/Refine data mining models

Name the structure as NGArticlesDM and the model as NGArticlesDM_DT

Deployment data mining models

You might also like