0% found this document useful (0 votes)
217 views17 pages

Business Analytics & Text Mining Modeling Using Python: Dr. Gaurav Dixit

This document provides an introduction to modeling using Python for business analytics and text mining. It discusses prediction and evaluation techniques for text mining, focusing on topics like topic assignment. It introduces Python as a suitable platform for data science and analytics due to libraries like NumPy, pandas, and matplotlib. The course will use Python and Jupyter Notebook for text mining tasks like predictive modeling, data preparation, retrieval, clustering, and information extraction.

Uploaded by

Ramu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
217 views17 pages

Business Analytics & Text Mining Modeling Using Python: Dr. Gaurav Dixit

This document provides an introduction to modeling using Python for business analytics and text mining. It discusses prediction and evaluation techniques for text mining, focusing on topics like topic assignment. It introduces Python as a suitable platform for data science and analytics due to libraries like NumPy, pandas, and matplotlib. The course will use Python and Jupyter Notebook for text mining tasks like predictive modeling, data preparation, retrieval, clustering, and information extraction.

Uploaded by

Ramu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

Business Analytics & Text Mining

Modeling Using Python


INTRODUCTION
Dr. GAURAV DIXIT
DEPARTMENT OF MANAGEMENT STUDIES

1
INTRODUCTION

• Prediction and Evaluation


– Text mining modeling process is similar to data mining modeling
process
• Process is about building models based on prior cases (from training partition)
• Then the built model is used to predict the unseen cases (from test partition)
– Evaluation of the model success is
• Based on its performance on the test partition which is not part of the model
building process
– This mechanism works well for most of the text mining scenarios
• However, there might be few special scenarios

2
INTRODUCTION

• Prediction and Evaluation


– Example: Topic assignment
• Assigning topics to news stories, such as financial or sports stories
• However, news stories might change over time
– News stories for test partition should be selected taking into account this sensitivity towards dates of
publication
» Since model training process typically won’t account for changes over time

– Measurement of error
• Typically, classical measures of accuracy work well if all errors are to be evaluated
equally
• However, as in topic assignment problem, not all errors will be evaluated equally
– Measures of accuracy such as “recall” and “precision” are especially important in such scenarios

3
INTRODUCTION

• Prediction and Evaluation


– Other tasks like clustering and extraction are
• Exploratory in nature
• Performed using unsupervised methods
• Evaluation is not as objective as it is for predication and classification tasks

4
INTRODUCTION

• Further Comments on Text Mining


– Just like data mining techniques, borrow heavily from statistical
approaches
– Selection of learning methods depends on
• Data preparation
• Experience with text and data science methods gives us direction
– Focus of this course is on prediction aspects

5
INTRODUCTION

• Python as a Data Science Platform


– A general-purpose programming language
– One of the most popular interpreted programming languages
– Python is currently among the fastest-growing programming languages
in the world
• Ease of learning
• Data science and artificial intelligence (AI)
• Large and active developer community

6
INTRODUCTION

• Python as a Data Science Platform


– A suitable language
• Not only for doing research and prototyping, and testing new ideas
• But also for building the production systems
• An advantage over SAS & R where porting for larger production system might be
required
– Expected to overtake R to become most preferred platform for data
science

7
INTRODUCTION

• Python as a Data Science Platform


– Jupyter Notebook will be used for Python programming required in
this course
– Jupyter Notebook
• An open-source web platform
• To create and share documents that contain live code, equations, visualizations and
narrative text
• Used primarily for:
– Data cleaning and transformation, Numerical simulation, Statistical modeling, Data visualization,
Machine learning, and much more

8
INTRODUCTION

• Python
– This course focuses on using
Python programming language and
Its data-oriented library ecosystem
for analytics
– Suitable for application development (Higher productivity language)
• Due to it being an interpreted programming language
• Run substantially slower in comparison to compiled language like Java or C++

9
INTRODUCTION

• Python
– Not suitable for highly concurrent, multithreaded applications,
particularly applications with many CPU-bound threads
• Due to global interpreter lock (GIL) mechanism
– Prevents the interpreter from executing more than one Python instruction at a time

• Python data ecosystem


– Important library packages
• NumPy, pandas, and matplotlib

10
INTRODUCTION

• NumPy
– Short for Numerical Python
– For numerical computing in Python
– Contains
• Arrays for storing data (used as primary data structure), functions for manipulating
data

• pandas
– Name derived from panel data, an econometrics term
– For working with tabular or structured data

11
INTRODUCTION

• pandas
– Contains
• DataFrame
– A tabular, column-oriented data structure with both row and column labels
• Series
– A one-dimensional labeled array object
• Functionality to reshape, slice and dice, perform aggregations, and select subsets of
data

• matplotlib
– For producing plots and other two dimensional data visualizations

12
INTRODUCTION

• Python
– Other key library packages
• SciPy for scientific computing
• scikit-learn for machine learning (prediction-focused)
• statsmodels for classical statistics and econometrics (focused on statistical
inference)

13
INTRODUCTION

• Python: Other considerations


– Integrated Development Environments (IDEs) and Text Editors
• Spyder (free), an IDE currently shipped with Anaconda
– Similar to RStudio that was used in previous courses

– In this course, we shall be using Python 3.7 or later versions

14
INTRODUCTION

• Course Roadmap
– Module I: General Overview of Text Mining
– Module II: Python for Analytics
– Module III: Data Preparation
– Module IV: Predictive Models for Text
– Module V: Retrieval and Clustering of Documents
– Module VI: Information Extraction
– Module VII: Conclusion

15
Key References

• Fundamentals of Predictive Text Mining


– By Sholom M. Weiss, Nitin Indurkhya, & Tong Zhang (2015)
• Python for Data Analysis: Data Wrangling with Pandas,
NumPy, and Ipython
– By Wes McKinney (2017)

16
Thanks…

17

You might also like