0% found this document useful (0 votes)

110 views1 page

Text Mining Using Python

The document describes the steps for effective text data cleaning using Python. It outlines 10 steps for cleaning a sample tweet related to consumer opinions on the iPhone. The steps include escaping HTML characters, decoding data, apostrophe lookup, removing stop words, removing punctuation, removing expressions, splitting attached words, slang lookup, standardizing words, and removing URLs. The document also briefly discusses advanced cleaning techniques like grammar checking and spelling correction.

Uploaded by

Pablo Rivera

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

110 views1 page

Text Mining Using Python

Uploaded by

Pablo Rivera

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 1

Effective Text Data

Cleaning using Python

Benefits of mining for a brand?
You can do sentimental You can measure brand It is used to identify the It is widely used for
analysis to discover popularity using the pain points of customers predictions and
customer’s sentiment actively engaged i.e. customer relationship forecasting
for a brand tweeters management

The Business Problem

Let’s say, we want to find the features of an Apple iPhone which are most
popular amongst the fans on Twitter.

What to do next?
We’ve extracted all the tweets related to consumer opinions of iPhone.
Here’s a sample tweet on which we’ll perform data cleaning

TWEET
“I luv my <3 iphone & you’re awsm apple. DisplayIsAwesome, sooo
happppppy :) http://www.apple.com”

Steps for Data Cleaning

STEP Escaping HTML characters

01
Code

import HTMLParser
html_parser = HTMLParser.HTMLParser()
tweet = html_parser.unescape(original_tweet)

Output
>> “I luv my <3 iphone & you’re awsm apple. Display Is Awesome, sooo
happppppy http://www.apple.com”

Decoding data STEP

02
Code

tweet = original_tweet.decode("utf8").encode(‘ascii’,’ignore’)

Output
>> “I luv my <3 iphone & you’re awsm apple. DisplayIsAwesome,
sooo happppppy :) http://www.apple.com”

STEP Apostrophe Lookup

03
Code

APPOSTOPHES = {“'s" : " is", "'re" : " are", ...} ## Need a huge dictionary
words = tweet.split()
reformed = [APPOSTOPHES[word] if word in APPOSTOPHES else word for word in words]
reformed = " ".join(reformed)

Outcome

>> “I luv my <3 iphone & you are awsm apple. DisplayIsAwesome, sooo
happppppy :) http://www.apple.com”

Removal of Stop-Words STEP

04
When data analysis needs to be data driven at the word level, the
commonly occurring words (stop-words) should be removed.
One can either create a long list of stop-words or one can use
predefined language specific libraries.

STEP Removal of Punctuations

05
All the punctuation marks according to the priorities should be
dealt with. For example: “.”, “,”,”?” are important punctuations
that should be retained while others need to be removed.

Removal of Expressions
STEP

06
Textual data (usually speech transcripts) may contain human
expressions like [laughing], [Crying], [Audience paused]. These
expressions are usually non relevant to content of the speech and
hence need to be removed.

STEP Split Attached Words

07
Code
cleaned = “ ”.join(re.findall(‘[A-Z][^A-Z]*’, original_tweet))

Outcome
>> “I luv my <3 iphone & you are awsm apple. Display Is Awesome, sooo
happppppy :) http://www.apple.com”

Slangs lookup STEP

08
Code
tweet = _slang_loopup(tweet)

Outcome

>> “I love my <3 iphone & you are awesome apple. Display Is
Awesome, sooo happppppy :) http://www.apple.com”

STEP Standardizing word

09
Code

tweet = ''.join(''.join(s)[:2] for _, s in itertools.groupby(tweet))

Outcome

>> “I love my <3 iphone & you are awesome apple. Display Is
Awesome, so happy :) http://www.apple.com”

Removal of URLs STEP

10
URLs and hyperlinks in text data like comments, reviews, and tweets
should be removed.

Final cleaned tweet:

>> “I love my iphone & you are awesome apple. Display Is Awesome, so
happy!” , <3 , :)

Advanced Data Cleaning

Grammar checking
Grammar checking is majorly learning based,
huge amount of proper text data is learned and models are created.
Many online tools are available for grammar correction purposes.

Spelling correction
In natural language, misspelled errors are
encountered. One can use algorithms like the Levenshtein Distances,
Dictionary Lookup etc. other modules and packages to fix these
errors.

Your Next Steps…

Now that the data (tweet) is cleaned, you are ready to practice and learn the
following techniques (in no order) of Text Mining-

1. Framework to build a niche dictionary for text mining

http://bit.ly/1eetMw6

2 Step by Step guide to extract insights from free text

http://bit.ly/1JjslYe

3. 2014 FIFA World Cup Prediction using Twitter Mining

http://bit.ly/1kLeYSk

4. Text Mining Hack using Google API

http://bit.ly/1LDPF6c

For more resources on analytics/data science, visit

www.analyticsvidhya.com

Approaching Almost Any NLP
No ratings yet
Approaching Almost Any NLP
118 pages
CS5228 Project 2 Twitter Sentiment Analysis Group No.: 29: 1 Problem Statement
No ratings yet
CS5228 Project 2 Twitter Sentiment Analysis Group No.: 29: 1 Problem Statement
15 pages
Text Cleaning Methods in NLP - Part-2
No ratings yet
Text Cleaning Methods in NLP - Part-2
5 pages
Steps For Effective Text Data Cleaning
No ratings yet
Steps For Effective Text Data Cleaning
6 pages
Quick Guide_ Steps To Perform Text Data Cleaning in Python
No ratings yet
Quick Guide_ Steps To Perform Text Data Cleaning in Python
6 pages
Experiment No 3
No ratings yet
Experiment No 3
7 pages
NLP Lab Manual
No ratings yet
NLP Lab Manual
38 pages
Unit 5
No ratings yet
Unit 5
4 pages
Lab Manual
No ratings yet
Lab Manual
10 pages
Beginner's Guide To Data Cleaning and Feature Extraction in NLP - by Enes Gokce - Towards Data Science
No ratings yet
Beginner's Guide To Data Cleaning and Feature Extraction in NLP - by Enes Gokce - Towards Data Science
20 pages
British_Airways_Forage_Report
No ratings yet
British_Airways_Forage_Report
12 pages
Smaexp 3
No ratings yet
Smaexp 3
9 pages
03 The-Different-Methods-Deal-Text-Data-Predictive-Python
No ratings yet
03 The-Different-Methods-Deal-Text-Data-Predictive-Python
16 pages
Data Science Project
No ratings yet
Data Science Project
34 pages
2 NLP Pipeline
No ratings yet
2 NLP Pipeline
57 pages
Social Media
No ratings yet
Social Media
7 pages
AminaRahmanK DL Lab5
No ratings yet
AminaRahmanK DL Lab5
11 pages
Chapter 2
No ratings yet
Chapter 2
36 pages
Text Mining and Dataset Creation in Python
No ratings yet
Text Mining and Dataset Creation in Python
13 pages
NLP___
No ratings yet
NLP___
28 pages
Text Cleaning Methods in NLP
No ratings yet
Text Cleaning Methods in NLP
7 pages
SMA EXP 3
No ratings yet
SMA EXP 3
7 pages
Text Analysis in Business Using Python
No ratings yet
Text Analysis in Business Using Python
5 pages
String and Text Processing
No ratings yet
String and Text Processing
8 pages
Advance Data Mining Assignment
No ratings yet
Advance Data Mining Assignment
10 pages
CSDM2-Text Preprocessing For NL Data - 011050
No ratings yet
CSDM2-Text Preprocessing For NL Data - 011050
6 pages
Membership Constraints: Adel Nehme
No ratings yet
Membership Constraints: Adel Nehme
36 pages
1745064423339-Coders_of_Delhi
No ratings yet
1745064423339-Coders_of_Delhi
12 pages
SMA 3
No ratings yet
SMA 3
3 pages
Unit 5 Machine Learning
No ratings yet
Unit 5 Machine Learning
9 pages
Lecture 8 - Text Analytics NLP
No ratings yet
Lecture 8 - Text Analytics NLP
24 pages
Wrangle Report
No ratings yet
Wrangle Report
4 pages
4aeee7-Ba25-Ff2e-30d7-63d306a7270 Open Ai Playground Example Prompts - Google Sheets
No ratings yet
4aeee7-Ba25-Ff2e-30d7-63d306a7270 Open Ai Playground Example Prompts - Google Sheets
8 pages
DATA WRANGLING
No ratings yet
DATA WRANGLING
4 pages
Python Record Manual
No ratings yet
Python Record Manual
18 pages
DAwHPC L03 Data Cleaning Practical
No ratings yet
DAwHPC L03 Data Cleaning Practical
43 pages
Text Analytics - Capstone Project
No ratings yet
Text Analytics - Capstone Project
19 pages
APznzaaezhN_zrfGNBIVQoFpyxQuDJEbpYM-rd1_4RK0dsKNoyaIK1leg5AOwJTuo35Fm7my_JrMLHTTwQc2-C9HancQl3eg5PMXqg3GVh...P8BhsI_jQJy5fp8rf5U6yKHXRfFB-0sfyXvsKcrtjBjLcU1flNWbsLeC886utDYCdlHaYbVGoX44N_s9IQDFZVmSS9erIHdWuLbw1xo7dFCD-1IOTfC4GfUw8x
No ratings yet
APznzaaezhN_zrfGNBIVQoFpyxQuDJEbpYM-rd1_4RK0dsKNoyaIK1leg5AOwJTuo35Fm7my_JrMLHTTwQc2-C9HancQl3eg5PMXqg3GVh...P8BhsI_jQJy5fp8rf5U6yKHXRfFB-0sfyXvsKcrtjBjLcU1flNWbsLeC886utDYCdlHaYbVGoX44N_s9IQDFZVmSS9erIHdWuLbw1xo7dFCD-1IOTfC4GfUw8x
171 pages
Understanding Language Model
No ratings yet
Understanding Language Model
5 pages
Sma 2
No ratings yet
Sma 2
9 pages
Detail NLP
No ratings yet
Detail NLP
5 pages
Statistical Computing With Python
No ratings yet
Statistical Computing With Python
21 pages
TXSA Lecture-7-9-2023 PDF
No ratings yet
TXSA Lecture-7-9-2023 PDF
8 pages
Unit 5
No ratings yet
Unit 5
8 pages
(Assignment 1 & 2) Regular Expression
No ratings yet
(Assignment 1 & 2) Regular Expression
3 pages
Reg. No.: 39110009 Colab Notebook Link: Name: Abivirshan Suresh
No ratings yet
Reg. No.: 39110009 Colab Notebook Link: Name: Abivirshan Suresh
27 pages
Data Cleaning
No ratings yet
Data Cleaning
52 pages
Unit2 Full
No ratings yet
Unit2 Full
28 pages
Project Report: BS (CS) - 6 (A) Project Title: Toxic Comment Analysis
No ratings yet
Project Report: BS (CS) - 6 (A) Project Title: Toxic Comment Analysis
20 pages
NLP Experiment 1
No ratings yet
NLP Experiment 1
13 pages
Easy Programming for Everyone
From Everand
Easy Programming for Everyone
Umar Asghar
No ratings yet
Python Roadmap
No ratings yet
Python Roadmap
28 pages
Twitter Sentiment Analysis For Product Review
No ratings yet
Twitter Sentiment Analysis For Product Review
19 pages
Unit 4
No ratings yet
Unit 4
60 pages
Unit 3
No ratings yet
Unit 3
102 pages
wrangle_report
No ratings yet
wrangle_report
3 pages
I
No ratings yet
I
54 pages
Simplifying Data Science With Python
From Everand
Simplifying Data Science With Python
Billy David millican
No ratings yet
AI-Powered Resume Mastery: Complete Guide on How to Write a Winning Resume with AI (Without Getting Caught!)
From Everand
AI-Powered Resume Mastery: Complete Guide on How to Write a Winning Resume with AI (Without Getting Caught!)
Timo Sprenger
No ratings yet
Programming And Coding begginers level
From Everand
Programming And Coding begginers level
Memo
No ratings yet
Kalchschmidt, M., Verganti, R., & Zotteri, G. (2006) - Forecasting Demand From Heterogeneous Customers PDF
No ratings yet
Kalchschmidt, M., Verganti, R., & Zotteri, G. (2006) - Forecasting Demand From Heterogeneous Customers PDF
23 pages
A Model For Selecting The Appropriate Level of Aggregation in Forecasting Processes PDF
No ratings yet
A Model For Selecting The Appropriate Level of Aggregation in Forecasting Processes PDF
10 pages
How To Use Aggregation and Combined Forecasting To Improve Seasonal Demand
100% (1)
How To Use Aggregation and Combined Forecasting To Improve Seasonal Demand
17 pages
An Aggregate-Disaggregate Intermittent Demand Approach (ADIDA) To Forecasting: An Empirical Proposition and Analysis
No ratings yet
An Aggregate-Disaggregate Intermittent Demand Approach (ADIDA) To Forecasting: An Empirical Proposition and Analysis
17 pages
An Aggregatedisaggregate Intermittent Demand Approach Adida To Forecasting
No ratings yet
An Aggregatedisaggregate Intermittent Demand Approach Adida To Forecasting
11 pages
Lecture Notes 3.3 Logarithms
No ratings yet
Lecture Notes 3.3 Logarithms
6 pages
Business Process Management
0% (1)
Business Process Management
16 pages
Isilon OneFS Simulator Install Guide
No ratings yet
Isilon OneFS Simulator Install Guide
16 pages
474 Userguide En
No ratings yet
474 Userguide En
10 pages
Awesome Penetration Testing Documentation
100% (2)
Awesome Penetration Testing Documentation
32 pages
DBMS Assignment DML &DDL
No ratings yet
DBMS Assignment DML &DDL
11 pages
NX Info
No ratings yet
NX Info
15 pages
Website Design of Job Description Based On Isco-08 and Calculation of Employee Total Needs Based On Work Load
No ratings yet
Website Design of Job Description Based On Isco-08 and Calculation of Employee Total Needs Based On Work Load
10 pages
Introduction To Interval Analysis 1st Edition Ramon E. Moore - The ebook is ready for download with just one simple click
No ratings yet
Introduction To Interval Analysis 1st Edition Ramon E. Moore - The ebook is ready for download with just one simple click
80 pages
Icom FR3000 - 4000 Instruction Manual
No ratings yet
Icom FR3000 - 4000 Instruction Manual
24 pages
Transactions and Tables Used in QM
No ratings yet
Transactions and Tables Used in QM
6 pages
NCR 5942 LCD Parts Identification Manual
No ratings yet
NCR 5942 LCD Parts Identification Manual
24 pages
Troubleshooting and Service Manual 120V-240V
No ratings yet
Troubleshooting and Service Manual 120V-240V
48 pages
Learning Outcomes 3
No ratings yet
Learning Outcomes 3
10 pages
Steven C. Chapra: Exhibit 4A
No ratings yet
Steven C. Chapra: Exhibit 4A
34 pages
Sanksheparaamaayanam PDF
No ratings yet
Sanksheparaamaayanam PDF
16 pages
Spec Sheet Conectores MC4
No ratings yet
Spec Sheet Conectores MC4
5 pages
21 Lsi Specs
No ratings yet
21 Lsi Specs
9 pages
BRTF13: Tetra Optical Mini Slave Repeater
No ratings yet
BRTF13: Tetra Optical Mini Slave Repeater
2 pages
5G NR BWP Types
No ratings yet
5G NR BWP Types
4 pages
Touche Chi 2012
No ratings yet
Touche Chi 2012
10 pages
PANTONE Color Support User Guide
No ratings yet
PANTONE Color Support User Guide
12 pages
Subdrive Connect Manual
No ratings yet
Subdrive Connect Manual
156 pages
Lec 04 ODE Exact PDF
No ratings yet
Lec 04 ODE Exact PDF
11 pages
Job Application & Resume (Civil) 1
100% (1)
Job Application & Resume (Civil) 1
4 pages
1) How Many Types of Files Are There in A SQL Server Database?
No ratings yet
1) How Many Types of Files Are There in A SQL Server Database?
16 pages
MT7986 Forwarding Module 2021-09-30 v1.0
No ratings yet
MT7986 Forwarding Module 2021-09-30 v1.0
22 pages
ISupplier User Manual
No ratings yet
ISupplier User Manual
31 pages
OOPs Unit 1 & 2 Notes PDF
67% (3)
OOPs Unit 1 & 2 Notes PDF
109 pages
One Page-2024 - Flyer
No ratings yet
One Page-2024 - Flyer
1 page

Text Mining Using Python

Uploaded by

Text Mining Using Python

Uploaded by

Effective Text Data

Cleaning using Python

The Business Problem

Steps for Data Cleaning

STEP Escaping HTML characters

Decoding data STEP

STEP Apostrophe Lookup

Removal of Stop-Words STEP

STEP Removal of Punctuations

STEP Split Attached Words

Slangs lookup STEP

STEP Standardizing word

tweet = ''.join(''.join(s)[:2] for _, s in itertools.groupby(tweet))

Removal of URLs STEP

Final cleaned tweet:

Advanced Data Cleaning

Your Next Steps…

1. Framework to build a niche dictionary for text mining

2 Step by Step guide to extract insights from free text

3. 2014 FIFA World Cup Prediction using Twitter Mining

4. Text Mining Hack using Google API

For more resources on analytics/data science, visit

You might also like