Basic Tweet Preprocessing in Python: 1. Hashtag Extraction Using Regex
Basic Tweet Preprocessing in Python: 1. Hashtag Extraction Using Regex
You have 2 free member-only stories left this month. Sign up for Medium and get an extra one
https://hdqwalls.com/astronaut-hanging-on-moon-wallpaper
After collecting tweets from all the Governor’s of the states starting from
Day 1 of Case-1 of the COVID-19 case, we merged them into a DataFrame
(How to merge various JSON files into a DataFrame) and performed
preprocessing.
Almost every social media site is known for the topic it represents in the
form of hashtags. Particularly for our case, Hashtags played an important
part since we were interested in #Covid19 ,#Coronavirus, #StayHome,
#InThisTogether, etc. Hence, the first step was forming a separate
feature based on the hashtag values and segmented them.
tweets[‘hashtag’] = tweets[‘tweet_text’].apply(lambda x:
re.findall(r”#(\w+)”, x))
#installing ekphrasis
!pip install ekphrasis
It deals with —
URLs
Mentions
Emojis
Smileys
#installing tweet-preprocessor
!pip install tweet-preprocessor
import preprocessor as p
Remove Digits and lower the text (makes it easy to deal with)
Remove Punctuations
def remove_punctuation(words):
new_words = []
for word in words:
new_word = re.sub(r'[^\w\s]', '', (word))
if new_word != '':
new_words.append(new_word)
return new_words
lemmatizer = nltk.stem.WordNetLemmatizer()
w_tokenizer = TweetTokenizer()
def lemmatize_text(text):
return [(lemmatizer.lemmatize(w)) for w in \
w_tokenizer.tokenize((text))]
stop_words = set(stopwords.words('english'))
4. Word Cloud
Frequency Distribution of the Segmented Hashtags
After the pre-processing steps, We excluded all the places names and
abbreviations in the tweets because it acted as a leakage variable and
then we performed a frequency distribution of the most occurring
hashtags and created a word cloud —
#Frequency of words
fdist = FreqDist(tweets['Segmented#'])
#WordCloud
wc = WordCloud(width=800, height=400,
max_words=50).generate_from_frequencies(fdist)
plt.figure(figsize=(12,10))
plt.imshow(wc, interpolation="bilinear")
plt.axis("off")
plt.show()
import pandas as pd
import numpy as np
import json
from collections import Counter
from wordcloud import WordCloud
import matplotlib.pyplot as plt
import re, string, unicodedata
import nltk
from nltk import word_tokenize, sent_tokenize, FreqDist
from nltk.corpus import stopwords
from nltk.stem import LancasterStemmer, WordNetLemmatizer
nltk.download
nltk.download('wordnet')
nltk.download('stopwords')
from nltk.tokenize import TweetTokenizer
tweets['hashtag'] = tweets['tweet_text'].apply(lambda x:
re.findall(r"#(\w+)", x))
def preprocess_data(data):
#Removes Numbers
data = data.astype(str).str.replace('\d+', '')
lower_text = data.str.lower()
lemmatizer = nltk.stem.WordNetLemmatizer()
w_tokenizer = TweetTokenizer()
def lemmatize_text(text):
return [(lemmatizer.lemmatize(w)) for w \
in w_tokenizer.tokenize((text))]
def remove_punctuation(words):
new_words = []
for word in words:
new_word = re.sub(r'[^\w\s]', '', (word))
if new_word != '':
new_words.append(new_word)
return new_words
words = lower_text.apply(lemmatize_text)
words = words.apply(remove_punctuation)
return pd.DataFrame(words)
pre_tweets = preprocess_data(tweets['text'])
tweets['text'] = pre_tweets
stop_words = set(stopwords.words('english'))
tweets['text'] = tweets['text'].apply(lambda x: [item for item in \
x if item not in stop_words])
#Frequency of words
fdist = FreqDist(tweets['Segmented#'])
#WordCloud
wc = WordCloud(width=800, height=400,
max_words=50).generate_from_frequencies(fdist)
plt.figure(figsize=(12,10))
plt.imshow(wc, interpolation="bilinear")
plt.axis("off")
plt.show()
By signing up, you will create a Medium account if you don’t already have one. Review our Privacy Policy for more information
about our privacy practices.
37
40 3
We will train an A.I. via Machine Learning to do the same thing, hopefully,
better than we — or at least I — can. …
25 1
55
Andre Ye · May 19
Chances are that not very many of them are regression algorithms. After
all, the only widely popularized regression algorithm is linear regression,
mostly because of its simplicity. However, linear regression is very often
not applicable to real-world data because of its basic capabilities and
limited freedom of movement. It is really only often used as a baseline
model to evaluate and compare to new approaches in research.
Here are 5 regression algorithms that you should have in your toolbox
along with popularized classification algorithms like SVM, decision tree,
and neural networks.
435 1
Data science… without New Features in Python 30 Examples to Master 5 YouTubers Data
any data?! 3.10 Pandas Scientists And ML
Cassie Kozyrkov in Towards James Briggs in Towards Data Soner Yıldırım in Towards Data
Engineers Should
Data Science Science Science
Subscribe To
Richmond Alake in Towards
Data Science