0% found this document useful (0 votes)

202 views2 pages

Basic Tweet Preprocessing in Python: 1. Hashtag Extraction Using Regex

This document summarizes the steps to preprocess tweets in Python for analysis. The key steps are: 1) Extract hashtags using regex and segment multi-word hashtags. 2) Clean tweets by removing URLs, mentions, emojis, and punctuation. 3) Perform tokenization, lemmatization, and remove stop words, digits and punctuations. 4) Generate a word cloud of frequent hashtags to visualize the most discussed topics.

Uploaded by

siva79srm

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

202 views2 pages

Basic Tweet Preprocessing in Python: 1. Hashtag Extraction Using Regex

Uploaded by

siva79srm

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 2

496K Followers · About Follow Get started

You have 2 free member-only stories left this month. Sign up for Medium and get an extra one

TWEET PREPROC ES S ING!

Basic Tweet Preprocessing in

Python
Learn how to preprocess tweets using Python

Parthvi Shah May 19 · 5 min read

https://hdqwalls.com/astronaut-hanging-on-moon-wallpaper

Note from the editors: Towards Data Science is a Medium publication

primarily based on the study of data science and machine learning. We are
not health professionals or epidemiologists, and the opinions of this article
should not be interpreted as professional advice. To learn more about the
coronavirus pandemic, you can click here.

Just to give you a little background as to why I am preprocessing tweets:

Given the current situation as of May, 2020, I am interested in the
political discourse of the US Governors with respect to the ongoing
pandemic. I would like to analyse how did the two parties — Republican
& Democratic Party react to the given situation, COVID-19. What were
their main goals at this time? Who focused more on what? What did they
care about the most?

After collecting tweets from all the Governor’s of the states starting from
Day 1 of Case-1 of the COVID-19 case, we merged them into a DataFrame
(How to merge various JSON files into a DataFrame) and performed
preprocessing.

We had a total of ~30,000 tweets. A tweet contains a lot of opinions

about the data it represents. Raw tweets without preprocessing is highly
unstructured and contains redundant information. To overcome these
issues, preprocessing of tweets is performed by taking multiple steps.

Almost every social media site is known for the topic it represents in the
form of hashtags. Particularly for our case, Hashtags played an important
part since we were interested in #Covid19 ,#Coronavirus, #StayHome,
#InThisTogether, etc. Hence, the first step was forming a separate
feature based on the hashtag values and segmented them.

1. Hashtag Extraction using Regex

List of all hashtags added to a new column as a new feature ‘hashtag’

tweets[‘hashtag’] = tweets[‘tweet_text’].apply(lambda x:
re.findall(r”#(\w+)”, x))

After Hashtag Extraction

However, hashtags with more than one word had to segmented. We

segmented those hashtags into n-words using the library ekphrasis.

#installing ekphrasis
!pip install ekphrasis

After it’s installation, I selected a segmenter built on twitter-corpus —

from ekphrasis.classes.segmenter import Segmenter

#segmenter using the word statistics from Twitter

seg_tw = Segmenter(corpus=”twitter”)

The most relevant tweet-preprocessor I found — tweet-preprocessor,

which is a tweet preprocessing library in Python.

It deals with —

URLs

Mentions

Reserved words (RT, FAV)

Emojis

Smileys

#installing tweet-preprocessor
!pip install tweet-preprocessor

2 . Text-Cleaning (URLs, Mentions, etc.)

Adding the cleaned (After removal of URLs, Mentions) tweets to a new
column as a new feature ‘text’
Cleaning is done using tweet-preprocessor package.

import preprocessor as p

#forming a separate feature for cleaned tweets

for i,v in enumerate(tweets['text']):
tweets.loc[v,’text’] = p.clean(i)

3. Tokenization , Removal of Digits, Stop Words and

Punctuations
Further preprocessing of the new feature ‘text’
NLTK (Natural Language Toolkit) is one of the best library for
preprocessing text data.

#important libraries for preprocessing using NLTK

import nltk
from nltk import word_tokenize, FreqDist
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
nltk.download
nltk.download('wordnet')
nltk.download('stopwords')
from nltk.tokenize import TweetTokenizer

Remove Digits and lower the text (makes it easy to deal with)

data = data.astype(str).str.replace('\d+', '')

lower_text = data.str.lower()

Remove Punctuations

def remove_punctuation(words):
new_words = []
for word in words:
new_word = re.sub(r'[^\w\s]', '', (word))
if new_word != '':
new_words.append(new_word)
return new_words

Lemmatization + Tokenization — Used a built in TweetTokenizer()

lemmatizer = nltk.stem.WordNetLemmatizer()
w_tokenizer = TweetTokenizer()

def lemmatize_text(text):
return [(lemmatizer.lemmatize(w)) for w in \
w_tokenizer.tokenize((text))]

The last preprocessing step is

Removing stop words — There is a pre-defined stop words list in

English. However, you can modify your stop words like by simply
appending the words to the stop words list.

stop_words = set(stopwords.words('english'))

tweets['text'] = tweets['text'].apply(lambda x: [item for item in \

x if item not in stop_words])

4. Word Cloud
Frequency Distribution of the Segmented Hashtags
After the pre-processing steps, We excluded all the places names and
abbreviations in the tweets because it acted as a leakage variable and
then we performed a frequency distribution of the most occurring
hashtags and created a word cloud —

This was quite expected.

from wordcloud import WordCloud

#Frequency of words
fdist = FreqDist(tweets['Segmented#'])

#WordCloud
wc = WordCloud(width=800, height=400,
max_words=50).generate_from_frequencies(fdist)
plt.figure(figsize=(12,10))
plt.imshow(wc, interpolation="bilinear")
plt.axis("off")
plt.show()

The final dataset —

The final code —

import pandas as pd
import numpy as np
import json
from collections import Counter
from wordcloud import WordCloud
import matplotlib.pyplot as plt
import re, string, unicodedata
import nltk
from nltk import word_tokenize, sent_tokenize, FreqDist
from nltk.corpus import stopwords
from nltk.stem import LancasterStemmer, WordNetLemmatizer
nltk.download
nltk.download('wordnet')
nltk.download('stopwords')
from nltk.tokenize import TweetTokenizer

!pip install ekphrasis

!pip install tweet-preprocessor
import preprocessor as p

tweets['hashtag'] = tweets['tweet_text'].apply(lambda x:
re.findall(r"#(\w+)", x))

for i,v in enumerate(tweets['text']):

tweets.loc[v,’text’] = p.clean(i)

def preprocess_data(data):
#Removes Numbers
data = data.astype(str).str.replace('\d+', '')
lower_text = data.str.lower()
lemmatizer = nltk.stem.WordNetLemmatizer()
w_tokenizer = TweetTokenizer()

def lemmatize_text(text):
return [(lemmatizer.lemmatize(w)) for w \
in w_tokenizer.tokenize((text))]

def remove_punctuation(words):
new_words = []
for word in words:
new_word = re.sub(r'[^\w\s]', '', (word))
if new_word != '':
new_words.append(new_word)
return new_words

words = lower_text.apply(lemmatize_text)
words = words.apply(remove_punctuation)
return pd.DataFrame(words)

pre_tweets = preprocess_data(tweets['text'])
tweets['text'] = pre_tweets

stop_words = set(stopwords.words('english'))
tweets['text'] = tweets['text'].apply(lambda x: [item for item in \
x if item not in stop_words])

from ekphrasis.classes.segmenter import Segmenter

# segmenter using the word statistics from Twitter

seg_tw = Segmenter(corpus="twitter")
a = []
for i in range(len(tweets)):
if tweets['hashtag'][i] != a
listToStr1 = ' '.join([str(elem) for elem in \
tweets['hashtag'][i]])
tweets.loc[i,'Segmented#'] = seg_tw.segment(listToStr1)

#Frequency of words
fdist = FreqDist(tweets['Segmented#'])
#WordCloud
wc = WordCloud(width=800, height=400,
max_words=50).generate_from_frequencies(fdist)
plt.figure(figsize=(12,10))
plt.imshow(wc, interpolation="bilinear")
plt.axis("off")
plt.show()

Hope I helped y’all.

Text classification in general works better if the text is preprocessed well.

Do give some extra time to it, it will all be worth it in the end.

Sign up for The Daily Pick

By Towards Data Science

Hands-on real-world examples, research, tutorials, and cutting-edge techniques

delivered Monday to Thursday. Make learning your daily ritual. Take a look

Your email Get this newsletter

By signing up, you will create a Medium account if you don’t already have one. Review our Privacy Policy for more information
about our privacy practices.

Preprocessing Text Classification Python Towards Data Science Tweet Classification

A Medium publication sharing concepts, ideas, and codes.

Christian Freischlag · May 19

Combining numerical and text features in

deep neural networks
How to use Keras multiple input models to train a deep neural network
end to end for text and numerical data

Photo by Marius Masalar on Unsplash

In applied machine learning, data often consists of multiple data types,

e.g. text and numerical data. To build a model which combines features
from both domains, it is necessary to stack these features together. This
post shows different solutions to combine natural language processing
and traditional features in one single model in Keras (end-to-end
learning).

Real-world data is different

Scientific data sets are usually limited to one single kind of data e.g. text,
images or numerical data. It makes a lot of sense, as the goal is to
compare new with existing models and approaches. …

Sebastian Schuchmann · May 19

Ultimate Walkthrough for ML-Agents in

Unity3D
Training an A.I. via Machine Learning from beginning to end
Hey! This will be a fast-paced, complete walkthrough of building an A.I.
with Unity’s ML-Agents. Like a good T V Chef, I already have a simple
game prepared, which you can clone from GitHub. Make sure you do, if
you want to follow along!

Repository: A.I. Jumping Cars

Currently, it’s just a human-controlled game, no machine learning

involved, YET! By pressing the space key, you can let the car jump to
dodge the incoming vehicles.

We will train an A.I. via Machine Learning to do the same thing, hopefully,
better than we — or at least I — can. …

Mitchell Nemeth · May 19

Algorithmic Filtering and Technology May

Exacerbate Censorship Worries

Source: Unsplash / Elijah O’Donnell

The war on misinformation, famously dubbed “fake news,” began after

the 2016 election in light of revelations about foreign powers using social
media to supposedly sway the electorate. …

Source: Vecteezy. Image free to share.

5 Machine Learning Regression

Algorithms You Need to Know
Yes, Linear Regression isn’t the only one
Quick! Name five machine learning algorithms.

Chances are that not very many of them are regression algorithms. After
all, the only widely popularized regression algorithm is linear regression,
mostly because of its simplicity. However, linear regression is very often
not applicable to real-world data because of its basic capabilities and
limited freedom of movement. It is really only often used as a baseline
model to evaluate and compare to new approaches in research.

Here are 5 regression algorithms that you should have in your toolbox
along with popularized classification algorithms like SVM, decision tree,
and neural networks.

1 | Neural Network Regression

Theory
Neural Networks are incredibly powerful, but they are usually used for
classification. Signals pass through layers of neurons and are generalized
into one of several classes. However, they can be very quickly adapted
into regression models by changing the last activation function. …

Anuranjan Kumar · May 19

MUS INGS OF DATA S C IENC E

Linear Algebra for Data Science: A new

way to start — Part 1
Ideas and suggestions from Prof. Gilbert Strang’s Vision of Linear
Algebra

Photo by Andy Holmes on Unsplash

Linear Algebra is perhaps one of the most fundamental building blocks

for data science, pattern recognition, and machine learning. I believe,
understanding fundamentals is the first thing anyone should do before
going after advanced topics. Why? Because we don’t want to get stuck in
a loop going back and forth in advanced topics and fundamentals!

In this series, I am going to cover topics covered by Prof. Gilbert Strang in

his latest lecture series, “A 2020 Vision of Linear Algebra”. It is an
excellent lecture series in which he takes a top-down approach in
teaching applied linear algebra. …

How to do visualization 5 Types of Machine 6 Things About Data An Ultimate Guide to

using python from Learning Algorithms You Science that Employers Time Series Analysis in
scratch Need to Know Don’t Want You to Know Pandas
Sharan Kumar Ravindran in Sara A. Metwalli in Towards Terence S in Towards Data Rashida Nasrin Sucky in
Towards Data Science Data Science Science Towards Data Science

Data science… without New Features in Python 30 Examples to Master 5 YouTubers Data
any data?! 3.10 Pandas Scientists And ML
Cassie Kozyrkov in Towards James Briggs in Towards Data Soner Yıldırım in Towards Data
Engineers Should
Data Science Science Science
Subscribe To
Richmond Alake in Towards
Data Science

About Help Legal

Unit IV Da Online - PPTX 2 82
No ratings yet
Unit IV Da Online - PPTX 2 82
81 pages
ANN Final Exam
100% (1)
ANN Final Exam
13 pages
C1.1 Lucrari Indexate ISI A
No ratings yet
C1.1 Lucrari Indexate ISI A
128 pages
Approaching Almost Any NLP
No ratings yet
Approaching Almost Any NLP
118 pages
Artificial Intelligence Notes Class 9
100% (1)
Artificial Intelligence Notes Class 9
129 pages
Translating Classical Arabic Verse Human Translati
No ratings yet
Translating Classical Arabic Verse Human Translati
15 pages
Distributed Computing - CS3551 - Important Questions With Answer - Unit 2 - Logical Time and Global State
No ratings yet
Distributed Computing - CS3551 - Important Questions With Answer - Unit 2 - Logical Time and Global State
12 pages
Digital Age, Current Trends & Ethical Issues in Ict
No ratings yet
Digital Age, Current Trends & Ethical Issues in Ict
10 pages
Findings and Suggestions
No ratings yet
Findings and Suggestions
11 pages
Ictest 2025-Mec-Presentation Templete
No ratings yet
Ictest 2025-Mec-Presentation Templete
15 pages
NLP Pipeline: Chapter-2
No ratings yet
NLP Pipeline: Chapter-2
171 pages
Your Ai Survival Guide Rashidi en 49436
No ratings yet
Your Ai Survival Guide Rashidi en 49436
6 pages
Chapter 26 Text Mining - Introduction To Data Science
No ratings yet
Chapter 26 Text Mining - Introduction To Data Science
20 pages
1 s2.0 S0148296325001432 Main
No ratings yet
1 s2.0 S0148296325001432 Main
16 pages
ChatGPT For Higher Education and Professional Development - A Guid
No ratings yet
ChatGPT For Higher Education and Professional Development - A Guid
135 pages
Learn AI Quantum 2022 PDF
No ratings yet
Learn AI Quantum 2022 PDF
13 pages
Nvidia-Learning-Training Course-Catalog
100% (1)
Nvidia-Learning-Training Course-Catalog
32 pages
CS1026 - Assignment 3
No ratings yet
CS1026 - Assignment 3
3 pages
L10 Learning II Gradient Based Learning
No ratings yet
L10 Learning II Gradient Based Learning
72 pages
NLB Lab Manuel 2
No ratings yet
NLB Lab Manuel 2
71 pages
2048 AI Based Game
No ratings yet
2048 AI Based Game
49 pages
Getting Data
No ratings yet
Getting Data
54 pages
2 NLP Pipeline
No ratings yet
2 NLP Pipeline
57 pages
CH 04
No ratings yet
CH 04
47 pages
4.twitter Extraction and Analytics
No ratings yet
4.twitter Extraction and Analytics
45 pages
For Assignment-10 (Machine Learning With Python - NLP-2)
No ratings yet
For Assignment-10 (Machine Learning With Python - NLP-2)
37 pages
Tweepy Functions
No ratings yet
Tweepy Functions
34 pages
NLP10 Ipynb
No ratings yet
NLP10 Ipynb
107 pages
INDEXReport Ayush
No ratings yet
INDEXReport Ayush
38 pages
ChatGPT Twitter Sentiment Analyzer
No ratings yet
ChatGPT Twitter Sentiment Analyzer
50 pages
Natural Language Processing Manual
No ratings yet
Natural Language Processing Manual
39 pages
NLP - Course EDC 1 29
No ratings yet
NLP - Course EDC 1 29
29 pages
Unit 3-Non CNN Approaches To Object Recognition
No ratings yet
Unit 3-Non CNN Approaches To Object Recognition
26 pages
Twitter Sentiment Analysis in Python
0% (1)
Twitter Sentiment Analysis in Python
9 pages
Incremental Learning
No ratings yet
Incremental Learning
35 pages
Lecture 8 - Text Analytics NLP
No ratings yet
Lecture 8 - Text Analytics NLP
24 pages
Python
No ratings yet
Python
30 pages
Chapter 1
No ratings yet
Chapter 1
31 pages
Social Media Analytics
No ratings yet
Social Media Analytics
33 pages
MIS Unit 1
No ratings yet
MIS Unit 1
31 pages
CH 03 Eng S v1.0
No ratings yet
CH 03 Eng S v1.0
25 pages
Solo Project Instructions
No ratings yet
Solo Project Instructions
21 pages
Project Report
No ratings yet
Project Report
12 pages
AminaRahmanK DL Lab5
No ratings yet
AminaRahmanK DL Lab5
11 pages
Assign 5 TT
No ratings yet
Assign 5 TT
13 pages
Unit 5 Machine Learning
No ratings yet
Unit 5 Machine Learning
9 pages
LinkedIn - About DEBATE Articles in CAIS Journal
No ratings yet
LinkedIn - About DEBATE Articles in CAIS Journal
7 pages
NLP Manual
No ratings yet
NLP Manual
15 pages
Akshada Tweet Report With Pages Removed
No ratings yet
Akshada Tweet Report With Pages Removed
15 pages
CHAPTER 4 and 5 New Hate Speech
No ratings yet
CHAPTER 4 and 5 New Hate Speech
21 pages
Reg. No.: 39110009 Colab Notebook Link: Name: Abivirshan Suresh
No ratings yet
Reg. No.: 39110009 Colab Notebook Link: Name: Abivirshan Suresh
27 pages
Methodology
No ratings yet
Methodology
9 pages
2024 07 03 Rec Teaching CG
No ratings yet
2024 07 03 Rec Teaching CG
9 pages
21csc305t Ai-Unit4 & 5 - QB
No ratings yet
21csc305t Ai-Unit4 & 5 - QB
6 pages
Text Emotion
No ratings yet
Text Emotion
12 pages
Twitter Sentiment Analysis: (Corona Virus)
No ratings yet
Twitter Sentiment Analysis: (Corona Virus)
12 pages
Twitter Sentiment Analysis: (Corona Virus)
No ratings yet
Twitter Sentiment Analysis: (Corona Virus)
12 pages
4aeee7-Ba25-Ff2e-30d7-63d306a7270 Open Ai Playground Example Prompts - Google Sheets
No ratings yet
4aeee7-Ba25-Ff2e-30d7-63d306a7270 Open Ai Playground Example Prompts - Google Sheets
8 pages
2025 Quiz 375
No ratings yet
2025 Quiz 375
5 pages
Exploring Progress in Aspect-Based Sentiment Analysis An In-Depth Survey
No ratings yet
Exploring Progress in Aspect-Based Sentiment Analysis An In-Depth Survey
10 pages
Advance Data Mining Assignment
No ratings yet
Advance Data Mining Assignment
10 pages
NLP - (1) (1) .Ipynb - Colab
No ratings yet
NLP - (1) (1) .Ipynb - Colab
10 pages
CS5228 Project 2 Twitter Sentiment Analysis Group No.: 29: 1 Problem Statement
No ratings yet
CS5228 Project 2 Twitter Sentiment Analysis Group No.: 29: 1 Problem Statement
15 pages
Twitter Sentiment Analysis System
No ratings yet
Twitter Sentiment Analysis System
5 pages
Text Cleaning Methods in NLP - Part-2
No ratings yet
Text Cleaning Methods in NLP - Part-2
5 pages
Neural Networks and Structured Knowledge: Rule Extraction and Applications
No ratings yet
Neural Networks and Structured Knowledge: Rule Extraction and Applications
7 pages
Case Study-2
No ratings yet
Case Study-2
4 pages
Event Planning
No ratings yet
Event Planning
8 pages
AI Agent For Info Retrieval
No ratings yet
AI Agent For Info Retrieval
3 pages
Wrangle Report
No ratings yet
Wrangle Report
4 pages
More Than Sentiments
No ratings yet
More Than Sentiments
6 pages
Financial Accounting Tally Workbook: Submitted By-Rushil Submitted To - Ms. Anita ROLL NO. 15321
No ratings yet
Financial Accounting Tally Workbook: Submitted By-Rushil Submitted To - Ms. Anita ROLL NO. 15321
4 pages
Unit 5
No ratings yet
Unit 5
4 pages
Lab Manual
No ratings yet
Lab Manual
10 pages
HW - Regex: 1 Instructions HW - Regular Expression - 10 Points
No ratings yet
HW - Regex: 1 Instructions HW - Regular Expression - 10 Points
9 pages
Proposed Preprocessing Methods For Manipulate Text of Tweet
No ratings yet
Proposed Preprocessing Methods For Manipulate Text of Tweet
12 pages
Python Advanced Exercises - Google Search
No ratings yet
Python Advanced Exercises - Google Search
2 pages
How I Built My Very First Twitter Bot-That'S Surprisingly Enjoyable
No ratings yet
How I Built My Very First Twitter Bot-That'S Surprisingly Enjoyable
9 pages
Improve Profiling Bank Customer Behavior Using ML
No ratings yet
Improve Profiling Bank Customer Behavior Using ML
8 pages
Anand Institute of Higher Technology Department of Computer Science and Engineering ACADEMIC YEAR: 2018-19 Mini Project Report
No ratings yet
Anand Institute of Higher Technology Department of Computer Science and Engineering ACADEMIC YEAR: 2018-19 Mini Project Report
9 pages
5 Tahapan Case Folding, Tokenization Dan Filtering, Stopword Removal, Stemming.
No ratings yet
5 Tahapan Case Folding, Tokenization Dan Filtering, Stopword Removal, Stemming.
7 pages
Class8 Ch5 Notes
No ratings yet
Class8 Ch5 Notes
2 pages
CSDM2-Text Preprocessing For NL Data - 011050
No ratings yet
CSDM2-Text Preprocessing For NL Data - 011050
6 pages
AI For Generation of Images
No ratings yet
AI For Generation of Images
2 pages
Bertweet: A Pre-Trained Language Model For English Tweets
No ratings yet
Bertweet: A Pre-Trained Language Model For English Tweets
6 pages
FIT5196-S2-2020 Assessment 1: Task 1: Parsing Text Files (U)
No ratings yet
FIT5196-S2-2020 Assessment 1: Task 1: Parsing Text Files (U)
4 pages
Twitter Python Assignment
No ratings yet
Twitter Python Assignment
8 pages
Steps For Effective Text Data Cleaning
No ratings yet
Steps For Effective Text Data Cleaning
6 pages
Call For Book Chapter
No ratings yet
Call For Book Chapter
1 page
NBA Awareness Webinar Brochure
No ratings yet
NBA Awareness Webinar Brochure
2 pages
Chat GPT
No ratings yet
Chat GPT
35 pages
Data Visualization in Python Preview PDF
100% (9)
Data Visualization in Python Preview PDF
58 pages