0% found this document useful (0 votes)
82 views20 pages

Lec1 PDF

This document introduces a course on business analytics and text mining using Python. It begins by noting that previous courses focused on structured numeric data using R, while this course will process unstructured text data. It discusses how text can be transformed into numeric values to apply machine learning algorithms. Key differences between text mining and data mining are identified, such as text mining working with large collections of documents rather than structured data. Machine learning techniques can be applied to text by creating a tabular format with words as attributes and documents as records.

Uploaded by

Arvind Sarvesh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
82 views20 pages

Lec1 PDF

This document introduces a course on business analytics and text mining using Python. It begins by noting that previous courses focused on structured numeric data using R, while this course will process unstructured text data. It discusses how text can be transformed into numeric values to apply machine learning algorithms. Key differences between text mining and data mining are identified, such as text mining working with large collections of documents rather than structured data. Machine learning techniques can be applied to text by creating a tabular format with words as attributes and documents as records.

Uploaded by

Arvind Sarvesh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 20

Business Analytics & Text Mining

Modeling Using Python


INTRODUCTION
Dr. GAURAV DIXIT
DEPARTMENT OF MANAGEMENT STUDIES

1
INTRODUCTION

• This course is subsequent to my earlier courses in the Data


Science area
– “Business Analytics & Data Mining Modeling Using R”
– “Business Analytics & Data Mining Modeling Using R Part II”
• In these two courses, we used numeric data for predictive
analytics
– Mainly ‘structured numeric data’ was processed using data mining
techniques
– Categorical variables were also processed using numeric codes

2
INTRODUCTION

• Structured Numeric Data


– Uniform measurements are taken for all the observations in the
sample

• In this course, we progress towards processing unstructured


data
– Text is typically described as unstructured data
– We model prediction problems using unstructured text data

3
INTRODUCTION

• Machine learning algorithms can be employed to model


prediction problems using data which could be
– Structured numerical measurements or
– Unstructured text
• This is possible because
– Text and documents can be transformed into measured values
• Where ‘presence’ or ‘absence’ of words on the column side of the tabular format
can be indicated against various documents on the row side
– This leads to the common representation used in data mining techniques for numerical data

4
INTRODUCTION

• Central themes in Text Mining and Data Mining are similar


with following key differences
– Evaluation techniques
• Chronological order of publication
• Alternative measures of error
– Data are text and documents
• Specialized techniques may be preferred
– Techniques must be modified to work with high dimensional data
• Tens of thousands of words and documents

5
INTRODUCTION

• In the related domains of ‘Natural language Processing’ and


‘Search Engine Technology’
– Focus is on Linguistic techniques
• Essence of language understanding
– Becoming closer to the generic machine learning paradigm
• Learning from data, whether numerical or text
• Main theme in Text Mining is
– Empirical in nature
• Mine for recurring word patterns in large text collections, or large collections of
digital documents

6
INTRODUCTION

• How text mining is different?


– A progress from applying analytics on large data to ‘big data’
– Nowadays, most data originate in digital form due to pervasive use of
computers
• For example, following activities are being performed electronically
– Stock trading
– Writing a book
– Buying a product online
– Digital transactions (many paper-based transactions have been replaced by paperless digital
alternatives)

7
INTRODUCTION

• Data Mining vs Text Mining


– Both are about finding valuable patterns in data
– Data mining domain
• In its maturity phase
– No significant development is expected
– Incremental development will continue
• No longer an emerging technology
• Techniques are highly developed
• Requires highly structured numeric data
– Involves extensive data preparation
• Lacks universal applicability

8
INTRODUCTION

• Data Mining vs Text Mining


– Both are about learning from samples of past experience or examples
– Text mining domain
• An emerging area
• Works with large collection of documents
– Contents are readable and meaningful

– Numbers vs text
– Analytics tasks are formulated differently
• Even though many techniques are similar

9
INTRODUCTION

• Structured data (for data mining)


– Requires data preparation involving data transformation steps
– Data collection effort might be based on careful prior design for
mining
– Measurements are well-defined and recorded uniformly for every
observation in the sample
– Types of variable measurements
• Continuous variables (Interval, ratio) and categorical variables (Nominal, ordinal)
– Finally, described in a highly structured tabular/matrix format

10
INTRODUCTION

• Structured data (for data mining)


– A row in the tabular format is a complete example of past experience
– A column is one measurement taken uniformly for all the rows
– Creates a structured world for applications of data mining techniques
• We can operate in a typical mathematical fashion

• Unstructured Data (for text mining)


– Initial presentation is a variant of XML format
– Text is transformed into numerical data leading to tabular format used
in data mining

11
INTRODUCTION

• Unstructured Data (for text mining)


– For text, a row represents a document (an example of prior
experience)

– A column represents measurements taken to indicate the presence or


absence of a word for all the rows
• Each row represents a document and each column a word
• Cells are filled with 1s & 0s

12
INTRODUCTION

• Unstructured Data (for text mining)


– This is why techniques similar to data mining can be used in text
mining
• These techniques have been found to be very successful
• Without understanding specific properties of text such as
– The concepts of grammar or
– The meaning of words

– Example: A binary spreadsheet of words in documents

13
INTRODUCTION

Company Income Job Overseas


0 1 0 1
1 0 1 1
1 1 1 0
0 0 0 1

14
INTRODUCTION

• Text Mining
– Words are attributes/predictors and documents are cases/records
– Together these form a sample of data that can feed our well-known
learning methods
– Machine learning techniques can be used to work with this format and
process large amounts of data
• Machine learning techniques
– Can be described as statistical techniques without prior knowledge
– They typically don’t make any assumption about the data like
statistical techniques do

15
INTRODUCTION

• Machine learning techniques


– For example, multiple linear regression assumes the linear relationship
between Y (Target variable) and Xs (Predictors)
– Rather, this deficiency is counterbalanced with massive processing of
data
• Finding patterns in word combinations that are recurring and predictive

16
INTRODUCTION

• Understanding text characteristics


– Given a collection of documents
• Set of attributes will be the total set of ‘unique words’ in the collection
– Called as dictionary

– For thousands or even millions of documents


• Dictionary will converge to a smaller number of words
– Technical documents with alphanumeric terms may lead to very large
dictionaries
• Tabular layout can become too big in size to be practical

17
INTRODUCTION

• Text mining problems


– Information Retrieval
• Business Problem: Document matcher (online or device)
– Given a large collection of documents, finding relevant documents
– Analytics Component
» Task is to retrieve the relevant documents based on the best matches of input document with
the collection of documents
» New document is compared to all the other rows (documents), and the most similar rows and
their associated documents are the answers
• Similar to a search engine function
– A few words are presented, and these words are matched to others
– Best matches are presented as the responses
• Based on measuring similarity as in nearest-neighbor methods

18
Key References

• Fundamentals of Predictive Text Mining


– By Sholom M. Weiss, Nitin Indurkhya, & Tong Zhang (2015)
• Python for Data Analysis: Data Wrangling with Pandas,
NumPy, and Ipython
– By Wes McKinney (2017)

19
Thanks…

20

You might also like