Complete Beginner's Guide To Processing Whatsapp Data With Python
Complete Beginner's Guide To Processing Whatsapp Data With Python
We’ve made changes to our Terms of Service and Privacy Policy. They take e ect on September 1, 2020, and
we encourage you to review them. By continuing to use our services, you agree to the new Terms of Service
and acknowledge the Privacy Policy applies to you.
Continue
This is your last free story this month. Sign up and get an extra one for free.
Bobby Muljono
Dec 26, 2019 · 7 min read
Free-Text Goldmine
From texting your loved ones, sending memes and professional usage, Whatsapp has
been dominating the messenger market worldwide with 1.5 billion active monthly
https://towardsdatascience.com/complete-beginners-guide-to-processing-whatsapp-data-with-python-781c156b5f0b 1/9
21/8/2020 Complete Beginner’s Guide to Processing Whatsapp Data with Python | by Bobby Muljono | Towards Data Science
users.
We’ve When
made it comes
changes to ourto complex
Terms NLPand
of Service modelling, free
Privacy Policy. text
They is black
take gold.
e ect on September 1, 2020, and
we encourage you to review them. By continuing to use our services, you agree to the new Terms of Service
and
NLPacknowledge the Privacy
for businesses Policy
provide applies to you.
enhanced user experience ranging from spell-checks,
feedback analysis and even virtual assistants.
Continue
In certain situations, small businesses may create Whatsapp chat groups to relay
information between members as a low-cost alternative to setting up systems to log
data. Rule-based chat system on how the information is to be disseminated is agreed at
the start of the chat. Consider the following example:
. . .
Methodology
There are many great resources online to convert Whatsapp data into a pandas
dataframe. Most, if not all, makes use of Python’s Regex library as a fairly complicated
solution to split the text file into columns of the dataframe.
However, my objective here is to target Python users who are beginners in string
manipulation. For beginners learning Python, we have better familiarity with basic
Python methods that does not come from external libraries. In this article, we will be
using a lot of the basic methods in processing Whatsapp data into a pandas dataframe.
https://towardsdatascience.com/complete-beginners-guide-to-processing-whatsapp-data-with-python-781c156b5f0b 2/9
21/8/2020 Complete Beginner’s Guide to Processing Whatsapp Data with Python | by Bobby Muljono | Towards Data Science
Heremade
We’ve is what we will
changes beTerms
to our covering:
of Service and Privacy Policy. They take e ect on September 1, 2020, and
we encourage you to review them. By continuing to use our services, you agree to the new Terms of Service
and
1.acknowledge
2 libraries the Privacyfor
(pandas Policy applies to you.
dataframe and datetime to detect datetime objects)
Continue
2. A lot of .split() methods
3. List comprehensions
4. Error-handling
Otherwise, the easiest way to extract Whatsapp .txt file can be done by the following
method:
3. Tap on the ‘…’ > Select ‘More’ > Select ‘Export chat’ without media and send it to
your personal e-mail
Once you’re done, your text file should look something like this:
https://towardsdatascience.com/complete-beginners-guide-to-processing-whatsapp-data-with-python-781c156b5f0b 3/9
21/8/2020 Complete Beginner’s Guide to Processing Whatsapp Data with Python | by Bobby Muljono | Towards Data Science
Stepmade
We’ve 2: Importing theTerms
changes to our dataof into your
Service Python
and Privacy IDEThey take e ect on September 1, 2020, and
Policy.
we encourage you to review them. By continuing to use our services, you agree to the new Terms of Service
The first thing we want to do is to make sure we know the location of your text file.
and acknowledge the Privacy Policy applies to you.
Once we know its destination, we can set our working directory to the file’s location:
Continue
import os
os.chdir('C:/Users/Some_Directory/...')
Once that is out of the way, we want to define a function to read your text file into a
Python variable with the following method:
1 def read_file(file):
2 '''Reads Whatsapp text file into a list of strings'''
3 x = open(file,'r', encoding = 'utf-8') #Opens the text file into variable x but the variable
4 y = x.read() #By now it becomes a huge chunk of string that we need to separate line by line
5 content = y.splitlines() #The splitline method converts the chunk of string into a list of s
6 return content
7
8 chat = read_file('test_chat.txt')
The above function converts our text file into a list of strings that allows us to make use
of .split() methods later on. But for now, there is some cleaning you need to do.
We can observe that ‘Some random text’ does not have the same usual format that
every line of Whatsapp text should have. To handle such elements, let’s first look at the
https://towardsdatascience.com/complete-beginners-guide-to-processing-whatsapp-data-with-python-781c156b5f0b 4/9
21/8/2020 Complete Beginner’s Guide to Processing Whatsapp Data with Python | by Bobby Muljono | Towards Data Science
pattern
We’ve ofchanges
made Whatsappto ourtext messages.
Terms of Service and Privacy Policy. They take e ect on September 1, 2020, and
we encourage you to review them. By continuing to use our services, you agree to the new Terms of Service
and acknowledge the Privacy Policy applies to you.
Continue
Ignoring everything else after the date, it is obvious that unwanted elements do not
have date objects in them. So we begin removing them by checking if they do contain
date before the first ‘,’. We do this by utilizing basic error handling-technique.
As you can see, we have removed about 100 elements that may pose a hindrance to
feature extraction later on. It is just within most of our casual texting culture to not use
multi-line texts unless we are sharing links with caption with our buddies!
The first feature we would like to extract is the date. Remember that the date string
occurs right before the first ‘,’. So we extract the element using the .split(‘,’) method at
index 0. We can write this beautifully using Python’s list comprehension.
Do note that I came from an R background and I am very used to using ‘i’ in for loops.
Another way you can write the above code without using range() function is the
following:
https://towardsdatascience.com/complete-beginners-guide-to-processing-whatsapp-data-with-python-781c156b5f0b 5/9
21/8/2020 Complete Beginner’s Guide to Processing Whatsapp Data with Python | by Bobby Muljono | Towards Data Science
In contrast,
We’ve this istowhat
made changes is required
our Terms of Serviceusing the Regex
and Privacy Policy.method
They takejust tooncheck
e ect whether
September theand
1, 2020,
we encourage
string youistodate.
pattern review them. By continuing to use our services, you agree to the new Terms of Service
and acknowledge the Privacy Policy applies to you.
Continue
Credits: Samir Sheri
All that just to identify the date feature?! (Photo by Ben White on Unsplash)
With that out of the way, we may proceed with the same logic when extracting both
the time and name of the sender. Take note of the following pattern:
1. Time string occurs right after the first ‘,’ and right before the first ‘-’
2. Name string occurs right after the first ‘-’ followed by the second ‘:’ at index 0
Finally we want to extract the content of the message. This is a little bit tricky because
certain lines do not contain any messages. Instead, they are system-generated
messages depicted by the following:
https://towardsdatascience.com/complete-beginners-guide-to-processing-whatsapp-data-with-python-781c156b5f0b 6/9
21/8/2020 Complete Beginner’s Guide to Processing Whatsapp Data with Python | by Bobby Muljono | Towards Data Science
chat.split(":")
You may choose to remove elements with ‘Missing Text’ later on.
And voila! Your data frame is ready for post-analysis! Notice the system-generated
message that appear on the name column. You can conditionally remove rows with
system generated message with the following code:
https://towardsdatascience.com/complete-beginners-guide-to-processing-whatsapp-data-with-python-781c156b5f0b 7/9
21/8/2020 Complete Beginner’s Guide to Processing Whatsapp Data with Python | by Bobby Muljono | Towards Data Science
. . .
We’ve made changes to our Terms of Service and Privacy Policy. They take e ect on September 1, 2020, and
we encourage you to review them. By continuing to use our services, you agree to the new Terms of Service
and acknowledge the Privacy Policy applies to you.
There are many ways you can make use of a processed Whatsapp text data to conduct
your analysis. From recreating yourself as a bot, using NLP for sentiment analysis
to just plain simple analytics. Making use of Whatsapp data is great practice for any
complex NLP projects to come.
Basic string manipulation is enough to convert a text file into a pandas dataframe as
shown above. If you are a newbie with Python(like me), it is better to get used to the
basics than trying out new techniques that may prove a little overwhelming at first.
However, Python’s regex library is still an important tool for intermediate to advanced
uses of text mining and data validation.
Here is a great article explaining the concepts of the Regex library in Python along with
its potential uses for data analytics and data science:
towardsdatascience.com
Happy coding!
Your email
By signing up, you will create a Medium account if you don’t already have one. Review our Privacy Policy for more
information about our privacy practices.
https://towardsdatascience.com/complete-beginners-guide-to-processing-whatsapp-data-with-python-781c156b5f0b 8/9
21/8/2020 Complete Beginner’s Guide to Processing Whatsapp Data with Python | by Bobby Muljono | Towards Data Science
We’ve made changes to our Terms of Service and Privacy Policy. They take e ect on September 1, 2020, and
we encourage you to review them. By continuing to use our services, you agree to the new Terms of Service
and acknowledge the Privacy Policy applies to you.
Python Data Science Programming Analytics
Continue WhatsApp
https://towardsdatascience.com/complete-beginners-guide-to-processing-whatsapp-data-with-python-781c156b5f0b 9/9