0% found this document useful (0 votes)

82 views20 pages

Lec1 PDF

This document introduces a course on business analytics and text mining using Python. It begins by noting that previous courses focused on structured numeric data using R, while this course will process unstructured text data. It discusses how text can be transformed into numeric values to apply machine learning algorithms. Key differences between text mining and data mining are identified, such as text mining working with large collections of documents rather than structured data. Machine learning techniques can be applied to text by creating a tabular format with words as attributes and documents as records.

Uploaded by

Arvind Sarvesh

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

82 views20 pages

Lec1 PDF

Uploaded by

Arvind Sarvesh

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 20

Business Analytics & Text Mining

Modeling Using Python

INTRODUCTION
Dr. GAURAV DIXIT
DEPARTMENT OF MANAGEMENT STUDIES

1
INTRODUCTION

• This course is subsequent to my earlier courses in the Data

Science area
– “Business Analytics & Data Mining Modeling Using R”
– “Business Analytics & Data Mining Modeling Using R Part II”
• In these two courses, we used numeric data for predictive
analytics
– Mainly ‘structured numeric data’ was processed using data mining
techniques
– Categorical variables were also processed using numeric codes

2
INTRODUCTION

• Structured Numeric Data

– Uniform measurements are taken for all the observations in the
sample

• In this course, we progress towards processing unstructured

data
– Text is typically described as unstructured data
– We model prediction problems using unstructured text data

3
INTRODUCTION

• Machine learning algorithms can be employed to model

prediction problems using data which could be
– Structured numerical measurements or
– Unstructured text
• This is possible because
– Text and documents can be transformed into measured values
• Where ‘presence’ or ‘absence’ of words on the column side of the tabular format
can be indicated against various documents on the row side
– This leads to the common representation used in data mining techniques for numerical data

4
INTRODUCTION

• Central themes in Text Mining and Data Mining are similar

with following key differences
– Evaluation techniques
• Chronological order of publication
• Alternative measures of error
– Data are text and documents
• Specialized techniques may be preferred
– Techniques must be modified to work with high dimensional data
• Tens of thousands of words and documents

5
INTRODUCTION

• In the related domains of ‘Natural language Processing’ and

‘Search Engine Technology’
– Focus is on Linguistic techniques
• Essence of language understanding
– Becoming closer to the generic machine learning paradigm
• Learning from data, whether numerical or text
• Main theme in Text Mining is
– Empirical in nature
• Mine for recurring word patterns in large text collections, or large collections of
digital documents

6
INTRODUCTION

• How text mining is different?

– A progress from applying analytics on large data to ‘big data’
– Nowadays, most data originate in digital form due to pervasive use of
computers
• For example, following activities are being performed electronically
– Stock trading
– Writing a book
– Buying a product online
– Digital transactions (many paper-based transactions have been replaced by paperless digital
alternatives)

7
INTRODUCTION

• Data Mining vs Text Mining

– Both are about finding valuable patterns in data
– Data mining domain
• In its maturity phase
– No significant development is expected
– Incremental development will continue
• No longer an emerging technology
• Techniques are highly developed
• Requires highly structured numeric data
– Involves extensive data preparation
• Lacks universal applicability

8
INTRODUCTION

• Data Mining vs Text Mining

– Both are about learning from samples of past experience or examples
– Text mining domain
• An emerging area
• Works with large collection of documents
– Contents are readable and meaningful

– Numbers vs text
– Analytics tasks are formulated differently
• Even though many techniques are similar

9
INTRODUCTION

• Structured data (for data mining)

– Requires data preparation involving data transformation steps
– Data collection effort might be based on careful prior design for
mining
– Measurements are well-defined and recorded uniformly for every
observation in the sample
– Types of variable measurements
• Continuous variables (Interval, ratio) and categorical variables (Nominal, ordinal)
– Finally, described in a highly structured tabular/matrix format

10
INTRODUCTION

• Structured data (for data mining)

– A row in the tabular format is a complete example of past experience
– A column is one measurement taken uniformly for all the rows
– Creates a structured world for applications of data mining techniques
• We can operate in a typical mathematical fashion

• Unstructured Data (for text mining)

– Initial presentation is a variant of XML format
– Text is transformed into numerical data leading to tabular format used
in data mining

11
INTRODUCTION

• Unstructured Data (for text mining)

– For text, a row represents a document (an example of prior
experience)

– A column represents measurements taken to indicate the presence or

absence of a word for all the rows
• Each row represents a document and each column a word
• Cells are filled with 1s & 0s

12
INTRODUCTION

• Unstructured Data (for text mining)

– This is why techniques similar to data mining can be used in text
mining
• These techniques have been found to be very successful
• Without understanding specific properties of text such as
– The concepts of grammar or
– The meaning of words

– Example: A binary spreadsheet of words in documents

13
INTRODUCTION

Company Income Job Overseas

0 1 0 1
1 0 1 1
1 1 1 0
0 0 0 1

14
INTRODUCTION

• Text Mining
– Words are attributes/predictors and documents are cases/records
– Together these form a sample of data that can feed our well-known
learning methods
– Machine learning techniques can be used to work with this format and
process large amounts of data
• Machine learning techniques
– Can be described as statistical techniques without prior knowledge
– They typically don’t make any assumption about the data like
statistical techniques do

15
INTRODUCTION

• Machine learning techniques

– For example, multiple linear regression assumes the linear relationship
between Y (Target variable) and Xs (Predictors)
– Rather, this deficiency is counterbalanced with massive processing of
data
• Finding patterns in word combinations that are recurring and predictive

16
INTRODUCTION

• Understanding text characteristics

– Given a collection of documents
• Set of attributes will be the total set of ‘unique words’ in the collection
– Called as dictionary

– For thousands or even millions of documents

• Dictionary will converge to a smaller number of words
– Technical documents with alphanumeric terms may lead to very large
dictionaries
• Tabular layout can become too big in size to be practical

17
INTRODUCTION

• Text mining problems

– Information Retrieval
• Business Problem: Document matcher (online or device)
– Given a large collection of documents, finding relevant documents
– Analytics Component
» Task is to retrieve the relevant documents based on the best matches of input document with
the collection of documents
» New document is compared to all the other rows (documents), and the most similar rows and
their associated documents are the answers
• Similar to a search engine function
– A few words are presented, and these words are matched to others
– Best matches are presented as the responses
• Based on measuring similarity as in nearest-neighbor methods

18
Key References

• Fundamentals of Predictive Text Mining

– By Sholom M. Weiss, Nitin Indurkhya, & Tong Zhang (2015)
• Python for Data Analysis: Data Wrangling with Pandas,
NumPy, and Ipython
– By Wes McKinney (2017)

19
Thanks…

Data Mining in Business Intelligence
No ratings yet
Data Mining in Business Intelligence
64 pages
Education and Capitalism: Struggles For Learning and Liberation
No ratings yet
Education and Capitalism: Struggles For Learning and Liberation
1 page
110107129
No ratings yet
110107129
655 pages
Lec 2
No ratings yet
Lec 2
15 pages
TextAnalyticsApplicationofTextMining2021 31122023 071845am 1 10122024 061001pm
No ratings yet
TextAnalyticsApplicationofTextMining2021 31122023 071845am 1 10122024 061001pm
7 pages
FDS-Content Beyond Syllabus
No ratings yet
FDS-Content Beyond Syllabus
15 pages
What Is Text Mining
No ratings yet
What Is Text Mining
9 pages
Text Mining
No ratings yet
Text Mining
3 pages
Business Intelligence and Data Mining: by Dr. Atanu Rakshit Email: Atanu - Rakshit@iimrohtak - Ac.in
No ratings yet
Business Intelligence and Data Mining: by Dr. Atanu Rakshit Email: Atanu - Rakshit@iimrohtak - Ac.in
122 pages
Text Mining
No ratings yet
Text Mining
18 pages
Case Study On Text Mining
No ratings yet
Case Study On Text Mining
8 pages
TXSA Lecture-7-9-2023 PDF
No ratings yet
TXSA Lecture-7-9-2023 PDF
8 pages
43.IJCSCN PreprocessingTechniquesforTextMining Ilamathi Nithya
No ratings yet
43.IJCSCN PreprocessingTechniquesforTextMining Ilamathi Nithya
11 pages
Chengqing Zong - Rui Xia - Jiajun Zhang - Text Data Mining-Springer Singapore
100% (1)
Chengqing Zong - Rui Xia - Jiajun Zhang - Text Data Mining-Springer Singapore
506 pages
Module 1 Part1
No ratings yet
Module 1 Part1
54 pages
Chengqing Zong - Rui Xia - Jiajun Zhang - Text Data Mining-Springer Singapore
No ratings yet
Chengqing Zong - Rui Xia - Jiajun Zhang - Text Data Mining-Springer Singapore
528 pages
Text Mining: Techniques and Its Application: December 2014
100% (1)
Text Mining: Techniques and Its Application: December 2014
5 pages
Comparative Analysis of Text Mining Techniques For
No ratings yet
Comparative Analysis of Text Mining Techniques For
12 pages
Method Section-Seminar Paper
No ratings yet
Method Section-Seminar Paper
6 pages
Prof. Mohammed Tanzeem Agra
No ratings yet
Prof. Mohammed Tanzeem Agra
33 pages
Assignment Rubel - Data Mining
No ratings yet
Assignment Rubel - Data Mining
12 pages
WINSEM2023-24 BCSE206L TH VL2023240501787 2024-02-19 Reference-Material-I
No ratings yet
WINSEM2023-24 BCSE206L TH VL2023240501787 2024-02-19 Reference-Material-I
42 pages
DMTerm Paper
No ratings yet
DMTerm Paper
4 pages
(IJCST-V6I4P5) :S.Sheela, T.Bharathi
No ratings yet
(IJCST-V6I4P5) :S.Sheela, T.Bharathi
7 pages
Text Mining Assignment
No ratings yet
Text Mining Assignment
12 pages
Text Mining in Big Data Analytics
No ratings yet
Text Mining in Big Data Analytics
34 pages
Data Mining in Business Intelligence
No ratings yet
Data Mining in Business Intelligence
63 pages
Text Mining Introduction
No ratings yet
Text Mining Introduction
6 pages
Screenshot 2024-06-04 at 12.02.17 AM
No ratings yet
Screenshot 2024-06-04 at 12.02.17 AM
23 pages
Jo (2019) - Text Mining
No ratings yet
Jo (2019) - Text Mining
376 pages
Chapter 5 Predictive Analytics II Text J Web J and Social Media Analytics
No ratings yet
Chapter 5 Predictive Analytics II Text J Web J and Social Media Analytics
5 pages
Lecture 10 - Data Mining in Practice
No ratings yet
Lecture 10 - Data Mining in Practice
41 pages
Text Mining
No ratings yet
Text Mining
16 pages
Introduction To Text Mining
No ratings yet
Introduction To Text Mining
6 pages
Submitted To: Submitted By:: Text Mining
No ratings yet
Submitted To: Submitted By:: Text Mining
15 pages
Text Mining: A Burgeoning Technology For Knowledge Extraction
100% (1)
Text Mining: A Burgeoning Technology For Knowledge Extraction
5 pages
Unit I - Text Mining
No ratings yet
Unit I - Text Mining
48 pages
Module 4
No ratings yet
Module 4
63 pages
Text Data Mining Chengqing Zong Instant Download
No ratings yet
Text Data Mining Chengqing Zong Instant Download
52 pages
Business Analytics & Text Mining Modeling Using Python: Dr. Gaurav Dixit
No ratings yet
Business Analytics & Text Mining Modeling Using Python: Dr. Gaurav Dixit
17 pages
Text Mining
No ratings yet
Text Mining
12 pages
BA4027 Datamining For BI
100% (1)
BA4027 Datamining For BI
67 pages
Data Mining vs. Statistics: Pavel Brusilovsky
No ratings yet
Data Mining vs. Statistics: Pavel Brusilovsky
22 pages
DLWSS551 - Introduction
No ratings yet
DLWSS551 - Introduction
54 pages
Diborinaye 2
No ratings yet
Diborinaye 2
7 pages
1-What Is Text Mining - IBM
No ratings yet
1-What Is Text Mining - IBM
5 pages
Disciplines - Unit 3
No ratings yet
Disciplines - Unit 3
8 pages
EBM
No ratings yet
EBM
16 pages
UNIT - 1 Text Mining
No ratings yet
UNIT - 1 Text Mining
18 pages
Text Mining and Its Applications
No ratings yet
Text Mining and Its Applications
5 pages
Week 11 Lecture
No ratings yet
Week 11 Lecture
61 pages
Use of Data Mining and Text Mining (Machine Learning)
No ratings yet
Use of Data Mining and Text Mining (Machine Learning)
42 pages
Dissertation Text Mining
100% (2)
Dissertation Text Mining
4 pages
Text Mining PPT Merged
100% (1)
Text Mining PPT Merged
58 pages
Effective Classification of Text
No ratings yet
Effective Classification of Text
6 pages
Text and Web Analytics
No ratings yet
Text and Web Analytics
48 pages
Isba 1 Finals Reviewer
No ratings yet
Isba 1 Finals Reviewer
3 pages
Text and Web Mining
No ratings yet
Text and Web Mining
44 pages
data science course training in india hyderabad: innomatics research labs
From Everand
data science course training in india hyderabad: innomatics research labs
innomatics research labs
No ratings yet
Python Programming: General-Purpose Libraries; NumPy,Pandas,Matplotlib,Seaborn,Requests,os & sys: Python, #2
From Everand
Python Programming: General-Purpose Libraries; NumPy,Pandas,Matplotlib,Seaborn,Requests,os & sys: Python, #2
e3
No ratings yet
Databases: System Concepts, Designs, Management, and Implementation
From Everand
Databases: System Concepts, Designs, Management, and Implementation
Jonathan Rigdon
No ratings yet
VSearch Doc NyassaAstro
No ratings yet
VSearch Doc NyassaAstro
4 pages
Bollywood Alerts Vodafone
No ratings yet
Bollywood Alerts Vodafone
8 pages
Bollywood Alerts Vodafone
No ratings yet
Bollywood Alerts Vodafone
8 pages
Vodafone Nyassa Ivr 2019
No ratings yet
Vodafone Nyassa Ivr 2019
6 pages
Why Low Thyroid Hormone (A.k.a. A Slow Metabolism) Makes Fat Loss
No ratings yet
Why Low Thyroid Hormone (A.k.a. A Slow Metabolism) Makes Fat Loss
61 pages
10 Minute Workout
No ratings yet
10 Minute Workout
20 pages
Google Analytics - Case Study by Suraj Chande PDF
No ratings yet
Google Analytics - Case Study by Suraj Chande PDF
10 pages
How To Import Data Into Microsoft Access
No ratings yet
How To Import Data Into Microsoft Access
10 pages
The American Colonial Rule
100% (1)
The American Colonial Rule
23 pages
Pet Speaking Part 2 Sample Cards/Карточки #2: 10000 free browser games
No ratings yet
Pet Speaking Part 2 Sample Cards/Карточки #2: 10000 free browser games
1 page
BASIC ENGLISH I Ingrid Vanessa Rodriguez Capacho A1b
No ratings yet
BASIC ENGLISH I Ingrid Vanessa Rodriguez Capacho A1b
4 pages
Reader 70y71
No ratings yet
Reader 70y71
1 page
Clean Code
No ratings yet
Clean Code
88 pages
NEGATIVE INVERSION Explanation + Practice
No ratings yet
NEGATIVE INVERSION Explanation + Practice
4 pages
English 2
No ratings yet
English 2
5 pages
Family Tree Rubric
No ratings yet
Family Tree Rubric
3 pages
1575 Tania Sultana
No ratings yet
1575 Tania Sultana
10 pages
Review For Final Tests
No ratings yet
Review For Final Tests
35 pages
Btech 1 Sem 2 Sem English 75349 Jan 2023
No ratings yet
Btech 1 Sem 2 Sem English 75349 Jan 2023
2 pages
GHHNFHXHVCGHV Vvbest Punjabi Jokes in Punjabi - Google Search
No ratings yet
GHHNFHXHVCGHV Vvbest Punjabi Jokes in Punjabi - Google Search
1 page
Nios D.el - ED 504 Urdu Guide
67% (3)
Nios D.el - ED 504 Urdu Guide
176 pages
Hiphop Shakespear - Patrick
No ratings yet
Hiphop Shakespear - Patrick
3 pages
ASCII Chart Decimal Octal Hex Character Description: S. Balaraman
100% (1)
ASCII Chart Decimal Octal Hex Character Description: S. Balaraman
3 pages
Writing Workshop Adjectives Lesson
No ratings yet
Writing Workshop Adjectives Lesson
11 pages
System Programming Question Bank
50% (2)
System Programming Question Bank
22 pages
Roy Norris FCE Webinar Supplementary Material
No ratings yet
Roy Norris FCE Webinar Supplementary Material
9 pages
Brihat Parashara Hora Shastra: Peace of The Birth On The New Moon (Chapter 86)
No ratings yet
Brihat Parashara Hora Shastra: Peace of The Birth On The New Moon (Chapter 86)
10 pages
List of Countries, Nationalities and Their Languages
No ratings yet
List of Countries, Nationalities and Their Languages
3 pages
II 02 LO 1 Assess Technical and User Documentation
No ratings yet
II 02 LO 1 Assess Technical and User Documentation
7 pages
E??2
No ratings yet
E??2
2 pages
Lab Attendant Assistant Chapterwise Question Bank Hindi & English
No ratings yet
Lab Attendant Assistant Chapterwise Question Bank Hindi & English
368 pages
English Q4 Preposition: Maritess R. Marte
No ratings yet
English Q4 Preposition: Maritess R. Marte
15 pages
Planner 20250104162210 Class Iv - Syllabus For Final Examination
No ratings yet
Planner 20250104162210 Class Iv - Syllabus For Final Examination
6 pages
Republic of The Philippines Department of Education Region III - Central Luzon Schools Division of Angeles Claro M Recto ICT High School Angeles City
No ratings yet
Republic of The Philippines Department of Education Region III - Central Luzon Schools Division of Angeles Claro M Recto ICT High School Angeles City
16 pages
Unidad 9 Booklet
No ratings yet
Unidad 9 Booklet
12 pages
Mock Test 5-HSG
No ratings yet
Mock Test 5-HSG
5 pages
Form Interview PDF
No ratings yet
Form Interview PDF
1 page

Lec1 PDF

Uploaded by

Lec1 PDF

Uploaded by

Business Analytics & Text Mining

Modeling Using Python

• This course is subsequent to my earlier courses in the Data

• Structured Numeric Data

• In this course, we progress towards processing unstructured

• Machine learning algorithms can be employed to model

• Central themes in Text Mining and Data Mining are similar

• In the related domains of ‘Natural language Processing’ and

• How text mining is different?

• Data Mining vs Text Mining

• Data Mining vs Text Mining

• Structured data (for data mining)

• Structured data (for data mining)

• Unstructured Data (for text mining)

• Unstructured Data (for text mining)

– A column represents measurements taken to indicate the presence or

• Unstructured Data (for text mining)

– Example: A binary spreadsheet of words in documents

Company Income Job Overseas

• Machine learning techniques

• Understanding text characteristics

– For thousands or even millions of documents

• Text mining problems

• Fundamentals of Predictive Text Mining

You might also like