0% found this document useful (0 votes)
170 views

Basic Tweet Preprocessing in Python: 1. Hashtag Extraction Using Regex

This document summarizes the steps to preprocess tweets in Python for analysis. The key steps are: 1) Extract hashtags using regex and segment multi-word hashtags. 2) Clean tweets by removing URLs, mentions, emojis, and punctuation. 3) Perform tokenization, lemmatization, and remove stop words, digits and punctuations. 4) Generate a word cloud of frequent hashtags to visualize the most discussed topics.

Uploaded by

siva79srm
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
170 views

Basic Tweet Preprocessing in Python: 1. Hashtag Extraction Using Regex

This document summarizes the steps to preprocess tweets in Python for analysis. The key steps are: 1) Extract hashtags using regex and segment multi-word hashtags. 2) Clean tweets by removing URLs, mentions, emojis, and punctuation. 3) Perform tokenization, lemmatization, and remove stop words, digits and punctuations. 4) Generate a word cloud of frequent hashtags to visualize the most discussed topics.

Uploaded by

siva79srm
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 2

496K Followers · About Follow Get started

You have 2 free member-only stories left this month. Sign up for Medium and get an extra one

TWEET PREPROC ES S ING!

Basic Tweet Preprocessing in


Python
Learn how to preprocess tweets using Python

Parthvi Shah May 19 · 5 min read

https://hdqwalls.com/astronaut-hanging-on-moon-wallpaper

Note from the editors: Towards Data Science is a Medium publication


primarily based on the study of data science and machine learning. We are
not health professionals or epidemiologists, and the opinions of this article
should not be interpreted as professional advice. To learn more about the
coronavirus pandemic, you can click here.

Just to give you a little background as to why I am preprocessing tweets:


Given the current situation as of May, 2020, I am interested in the
political discourse of the US Governors with respect to the ongoing
pandemic. I would like to analyse how did the two parties — Republican
& Democratic Party react to the given situation, COVID-19. What were
their main goals at this time? Who focused more on what? What did they
care about the most?

After collecting tweets from all the Governor’s of the states starting from
Day 1 of Case-1 of the COVID-19 case, we merged them into a DataFrame
(How to merge various JSON files into a DataFrame) and performed
preprocessing.

We had a total of ~30,000 tweets. A tweet contains a lot of opinions


about the data it represents. Raw tweets without preprocessing is highly
unstructured and contains redundant information. To overcome these
issues, preprocessing of tweets is performed by taking multiple steps.

Almost every social media site is known for the topic it represents in the
form of hashtags. Particularly for our case, Hashtags played an important
part since we were interested in #Covid19 ,#Coronavirus, #StayHome,
#InThisTogether, etc. Hence, the first step was forming a separate
feature based on the hashtag values and segmented them.

1. Hashtag Extraction using Regex


List of all hashtags added to a new column as a new feature ‘hashtag’

tweets[‘hashtag’] = tweets[‘tweet_text’].apply(lambda x:
re.findall(r”#(\w+)”, x))

After Hashtag Extraction

However, hashtags with more than one word had to segmented. We


segmented those hashtags into n-words using the library ekphrasis.

#installing ekphrasis
!pip install ekphrasis

After it’s installation, I selected a segmenter built on twitter-corpus —

from ekphrasis.classes.segmenter import Segmenter

#segmenter using the word statistics from Twitter


seg_tw = Segmenter(corpus=”twitter”)

The most relevant tweet-preprocessor I found — tweet-preprocessor,


which is a tweet preprocessing library in Python.

It deals with —

URLs

Mentions

Reserved words (RT, FAV)

Emojis

Smileys

#installing tweet-preprocessor
!pip install tweet-preprocessor

2 . Text-Cleaning (URLs, Mentions, etc.)


Adding the cleaned (After removal of URLs, Mentions) tweets to a new
column as a new feature ‘text’
Cleaning is done using tweet-preprocessor package.

import preprocessor as p

#forming a separate feature for cleaned tweets


for i,v in enumerate(tweets['text']):
tweets.loc[v,’text’] = p.clean(i)

3. Tokenization , Removal of Digits, Stop Words and


Punctuations
Further preprocessing of the new feature ‘text’
NLTK (Natural Language Toolkit) is one of the best library for
preprocessing text data.

#important libraries for preprocessing using NLTK


import nltk
from nltk import word_tokenize, FreqDist
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
nltk.download
nltk.download('wordnet')
nltk.download('stopwords')
from nltk.tokenize import TweetTokenizer

Remove Digits and lower the text (makes it easy to deal with)

data = data.astype(str).str.replace('\d+', '')


lower_text = data.str.lower()

Remove Punctuations

def remove_punctuation(words):
new_words = []
for word in words:
new_word = re.sub(r'[^\w\s]', '', (word))
if new_word != '':
new_words.append(new_word)
return new_words

Lemmatization + Tokenization — Used a built in TweetTokenizer()

lemmatizer = nltk.stem.WordNetLemmatizer()
w_tokenizer = TweetTokenizer()

def lemmatize_text(text):
return [(lemmatizer.lemmatize(w)) for w in \
w_tokenizer.tokenize((text))]

The last preprocessing step is

Removing stop words — There is a pre-defined stop words list in


English. However, you can modify your stop words like by simply
appending the words to the stop words list.

stop_words = set(stopwords.words('english'))

tweets['text'] = tweets['text'].apply(lambda x: [item for item in \


x if item not in stop_words])

4. Word Cloud
Frequency Distribution of the Segmented Hashtags
After the pre-processing steps, We excluded all the places names and
abbreviations in the tweets because it acted as a leakage variable and
then we performed a frequency distribution of the most occurring
hashtags and created a word cloud —

This was quite expected.

from wordcloud import WordCloud

#Frequency of words
fdist = FreqDist(tweets['Segmented#'])

#WordCloud
wc = WordCloud(width=800, height=400,
max_words=50).generate_from_frequencies(fdist)
plt.figure(figsize=(12,10))
plt.imshow(wc, interpolation="bilinear")
plt.axis("off")
plt.show()

The final dataset —

The final code —

import pandas as pd
import numpy as np
import json
from collections import Counter
from wordcloud import WordCloud
import matplotlib.pyplot as plt
import re, string, unicodedata
import nltk
from nltk import word_tokenize, sent_tokenize, FreqDist
from nltk.corpus import stopwords
from nltk.stem import LancasterStemmer, WordNetLemmatizer
nltk.download
nltk.download('wordnet')
nltk.download('stopwords')
from nltk.tokenize import TweetTokenizer

!pip install ekphrasis


!pip install tweet-preprocessor
import preprocessor as p

tweets['hashtag'] = tweets['tweet_text'].apply(lambda x:
re.findall(r"#(\w+)", x))

for i,v in enumerate(tweets['text']):


tweets.loc[v,’text’] = p.clean(i)

def preprocess_data(data):
#Removes Numbers
data = data.astype(str).str.replace('\d+', '')
lower_text = data.str.lower()
lemmatizer = nltk.stem.WordNetLemmatizer()
w_tokenizer = TweetTokenizer()

def lemmatize_text(text):
return [(lemmatizer.lemmatize(w)) for w \
in w_tokenizer.tokenize((text))]

def remove_punctuation(words):
new_words = []
for word in words:
new_word = re.sub(r'[^\w\s]', '', (word))
if new_word != '':
new_words.append(new_word)
return new_words

words = lower_text.apply(lemmatize_text)
words = words.apply(remove_punctuation)
return pd.DataFrame(words)

pre_tweets = preprocess_data(tweets['text'])
tweets['text'] = pre_tweets

stop_words = set(stopwords.words('english'))
tweets['text'] = tweets['text'].apply(lambda x: [item for item in \
x if item not in stop_words])

from ekphrasis.classes.segmenter import Segmenter

# segmenter using the word statistics from Twitter


seg_tw = Segmenter(corpus="twitter")
a = []
for i in range(len(tweets)):
if tweets['hashtag'][i] != a
listToStr1 = ' '.join([str(elem) for elem in \
tweets['hashtag'][i]])
tweets.loc[i,'Segmented#'] = seg_tw.segment(listToStr1)

#Frequency of words
fdist = FreqDist(tweets['Segmented#'])
#WordCloud
wc = WordCloud(width=800, height=400,
max_words=50).generate_from_frequencies(fdist)
plt.figure(figsize=(12,10))
plt.imshow(wc, interpolation="bilinear")
plt.axis("off")
plt.show()

Hope I helped y’all.

Text classification in general works better if the text is preprocessed well.


Do give some extra time to it, it will all be worth it in the end.

Sign up for The Daily Pick


By Towards Data Science

Hands-on real-world examples, research, tutorials, and cutting-edge techniques


delivered Monday to Thursday. Make learning your daily ritual. Take a look

Your email Get this newsletter

By signing up, you will create a Medium account if you don’t already have one. Review our Privacy Policy for more information
about our privacy practices.

37 

Preprocessing Text Classification Python Towards Data Science Tweet Classification

More from Towards Data Science Follow

A Medium publication sharing concepts, ideas, and codes.

Christian Freischlag · May 19

Combining numerical and text features in


deep neural networks
How to use Keras multiple input models to train a deep neural network
end to end for text and numerical data

Photo by Marius Masalar on Unsplash

In applied machine learning, data often consists of multiple data types,


e.g. text and numerical data. To build a model which combines features
from both domains, it is necessary to stack these features together. This
post shows different solutions to combine natural language processing
and traditional features in one single model in Keras (end-to-end
learning).

Real-world data is different


Scientific data sets are usually limited to one single kind of data e.g. text,
images or numerical data. It makes a lot of sense, as the goal is to
compare new with existing models and approaches. …

Read more · 4 min read

40 3

Sebastian Schuchmann · May 19

Ultimate Walkthrough for ML-Agents in


Unity3D
Training an A.I. via Machine Learning from beginning to end
Hey! This will be a fast-paced, complete walkthrough of building an A.I.
with Unity’s ML-Agents. Like a good T V Chef, I already have a simple
game prepared, which you can clone from GitHub. Make sure you do, if
you want to follow along!

Repository: A.I. Jumping Cars

Currently, it’s just a human-controlled game, no machine learning


involved, YET! By pressing the space key, you can let the car jump to
dodge the incoming vehicles.

We will train an A.I. via Machine Learning to do the same thing, hopefully,
better than we — or at least I — can. …

Read more · 10 min read

25 1

Mitchell Nemeth · May 19

Algorithmic Filtering and Technology May


Exacerbate Censorship Worries

Source: Unsplash / Elijah O’Donnell

The war on misinformation, famously dubbed “fake news,” began after


the 2016 election in light of revelations about foreign powers using social
media to supposedly sway the electorate. …

Read more · 8 min read

55

Andre Ye · May 19

Source: Vecteezy. Image free to share.

5 Machine Learning Regression


Algorithms You Need to Know
Yes, Linear Regression isn’t the only one
Quick! Name five machine learning algorithms.

Chances are that not very many of them are regression algorithms. After
all, the only widely popularized regression algorithm is linear regression,
mostly because of its simplicity. However, linear regression is very often
not applicable to real-world data because of its basic capabilities and
limited freedom of movement. It is really only often used as a baseline
model to evaluate and compare to new approaches in research.

Here are 5 regression algorithms that you should have in your toolbox
along with popularized classification algorithms like SVM, decision tree,
and neural networks.

1 | Neural Network Regression


Theory
Neural Networks are incredibly powerful, but they are usually used for
classification. Signals pass through layers of neurons and are generalized
into one of several classes. However, they can be very quickly adapted
into regression models by changing the last activation function. …

Read more · 8 min read

435 1

Anuranjan Kumar · May 19

MUS INGS OF DATA S C IENC E

Linear Algebra for Data Science: A new


way to start — Part 1
Ideas and suggestions from Prof. Gilbert Strang’s Vision of Linear
Algebra

Photo by Andy Holmes on Unsplash

Linear Algebra is perhaps one of the most fundamental building blocks


for data science, pattern recognition, and machine learning. I believe,
understanding fundamentals is the first thing anyone should do before
going after advanced topics. Why? Because we don’t want to get stuck in
a loop going back and forth in advanced topics and fundamentals!

In this series, I am going to cover topics covered by Prof. Gilbert Strang in


his latest lecture series, “A 2020 Vision of Linear Algebra”. It is an
excellent lecture series in which he takes a top-down approach in
teaching applied linear algebra. …

Read more · 6 min read

Read more from Towards Data Science

More From Medium

How to do visualization 5 Types of Machine 6 Things About Data An Ultimate Guide to


using python from Learning Algorithms You Science that Employers Time Series Analysis in
scratch Need to Know Don’t Want You to Know Pandas
Sharan Kumar Ravindran in Sara A. Metwalli in Towards Terence S in Towards Data Rashida Nasrin Sucky in
Towards Data Science Data Science Science Towards Data Science

Data science… without New Features in Python 30 Examples to Master 5 YouTubers Data
any data?! 3.10 Pandas Scientists And ML
Cassie Kozyrkov in Towards James Briggs in Towards Data Soner Yıldırım in Towards Data
Engineers Should
Data Science Science Science
Subscribe To
Richmond Alake in Towards
Data Science

About Help Legal

You might also like