0% found this document useful (0 votes)
142 views

Complete Beginner's Guide To Processing Whatsapp Data With Python

This document provides a 3-step guide to processing WhatsApp chat data with Python: 1. Import the WhatsApp chat text file and pandas library. Read the file into a list of strings using a custom function. 2. Handle multi-line messages that do not follow the standard format by looking at the pattern of WhatsApp text messages. 3. Split each line of the text list into columns using basic string methods like split. This prepares the data to be loaded into a pandas dataframe for analysis.

Uploaded by

Luciano
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
142 views

Complete Beginner's Guide To Processing Whatsapp Data With Python

This document provides a 3-step guide to processing WhatsApp chat data with Python: 1. Import the WhatsApp chat text file and pandas library. Read the file into a list of strings using a custom function. 2. Handle multi-line messages that do not follow the standard format by looking at the pattern of WhatsApp text messages. 3. Split each line of the text list into columns using basic string methods like split. This prepares the data to be loaded into a pandas dataframe for analysis.

Uploaded by

Luciano
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

21/8/2020 Complete Beginner’s Guide to Processing Whatsapp Data with Python | by Bobby Muljono | Towards Data Science

We’ve made changes to our Terms of Service and Privacy Policy. They take e ect on September 1, 2020, and
we encourage you to review them. By continuing to use our services, you agree to the new Terms of Service
and acknowledge the Privacy Policy applies to you.

Continue

This is your last free story this month. Sign up and get an extra one for free.

Complete Beginner’s Guide to Processing


Whatsapp Data with Python
Making use of basic Python methods to process text data instead of Regex

Bobby Muljono
Dec 26, 2019 · 7 min read

Photo by Rachit Tank on Unsplash

Free-Text Goldmine
From texting your loved ones, sending memes and professional usage, Whatsapp has
been dominating the messenger market worldwide with 1.5 billion active monthly

https://towardsdatascience.com/complete-beginners-guide-to-processing-whatsapp-data-with-python-781c156b5f0b 1/9
21/8/2020 Complete Beginner’s Guide to Processing Whatsapp Data with Python | by Bobby Muljono | Towards Data Science

users.
We’ve When
made it comes
changes to ourto complex
Terms NLPand
of Service modelling, free
Privacy Policy. text
They is black
take gold.
e ect on September 1, 2020, and
we encourage you to review them. By continuing to use our services, you agree to the new Terms of Service
and
NLPacknowledge the Privacy
for businesses Policy
provide applies to you.
enhanced user experience ranging from spell-checks,
feedback analysis and even virtual assistants.
Continue

12 NLP Examples: How Natural Language Processing is Used


With NLP, autocomplete isn't the only way businesses can upgrade their on-
site search. Klevu is a smart search provider…
www.wonder ow.co

In certain situations, small businesses may create Whatsapp chat groups to relay
information between members as a low-cost alternative to setting up systems to log
data. Rule-based chat system on how the information is to be disseminated is agreed at
the start of the chat. Consider the following example:

21/09/2019, 14:04 — Salesperson A: Item B/Branch C/Sold/$1900


21/09/2019, 16:12 — Salesperson X: Item Y/Branch Z/Refund/$1600,
defect found in product, not functioning

We can immediately recognize patterns pertaining to sales order from different


salesperson, separated by common operators such as ‘/’ and ‘,’. With a simple system
(but prone to human spelling error) like this, we can analyze sales pattern of different
products and different locations with the use of Whatsapp.

. . .

Methodology
There are many great resources online to convert Whatsapp data into a pandas
dataframe. Most, if not all, makes use of Python’s Regex library as a fairly complicated
solution to split the text file into columns of the dataframe.

However, my objective here is to target Python users who are beginners in string
manipulation. For beginners learning Python, we have better familiarity with basic
Python methods that does not come from external libraries. In this article, we will be
using a lot of the basic methods in processing Whatsapp data into a pandas dataframe.
https://towardsdatascience.com/complete-beginners-guide-to-processing-whatsapp-data-with-python-781c156b5f0b 2/9
21/8/2020 Complete Beginner’s Guide to Processing Whatsapp Data with Python | by Bobby Muljono | Towards Data Science

Heremade
We’ve is what we will
changes beTerms
to our covering:
of Service and Privacy Policy. They take e ect on September 1, 2020, and
we encourage you to review them. By continuing to use our services, you agree to the new Terms of Service
and
1.acknowledge
2 libraries the Privacyfor
(pandas Policy applies to you.
dataframe and datetime to detect datetime objects)
Continue
2. A lot of .split() methods

3. List comprehensions

4. Error-handling

Step 1: Getting the data


If exporting messages directly from your phone is not your jam, you can try the
following method:

Read, Extract WhatsApp Messages backup on Android, iPhone,


Blackberry
WhatsApp is undoubtedly no.1 Messaging service on mobile devices having
its presence across Android, iOS, Blackberry…
geeknizer.com

Otherwise, the easiest way to extract Whatsapp .txt file can be done by the following
method:

1. Open your Whatsapp application

2. Select a chat of your interest

3. Tap on the ‘…’ > Select ‘More’ > Select ‘Export chat’ without media and send it to
your personal e-mail

Once you’re done, your text file should look something like this:

21/09/2019, 23:03 — Friend: my boss dont like filter


21/09/2019, 23:03 — Friend: he likes everything on a page
21/09/2019, 23:03 — Me: so basically you need to turn your data into
ugly first then come out pivot table
21/09/2019, 23:03 — Me: haha
21/09/2019, 23:04 — Me: pivot table all in 1 page what
21/09/2019, 23:05 — Me: but ya i hate this kinda excel work sia
21/09/2019, 23:05 — Me: haha
21/09/2019, 23:05 — Friend: as in
21/09/2019, 23:05 — Me: hope to transition to data scientist asap

https://towardsdatascience.com/complete-beginners-guide-to-processing-whatsapp-data-with-python-781c156b5f0b 3/9
21/8/2020 Complete Beginner’s Guide to Processing Whatsapp Data with Python | by Bobby Muljono | Towards Data Science

Stepmade
We’ve 2: Importing theTerms
changes to our dataof into your
Service Python
and Privacy IDEThey take e ect on September 1, 2020, and
Policy.
we encourage you to review them. By continuing to use our services, you agree to the new Terms of Service
The first thing we want to do is to make sure we know the location of your text file.
and acknowledge the Privacy Policy applies to you.
Once we know its destination, we can set our working directory to the file’s location:
Continue

import os
os.chdir('C:/Users/Some_Directory/...')

Once that is out of the way, we want to define a function to read your text file into a
Python variable with the following method:

1 def read_file(file):
2 '''Reads Whatsapp text file into a list of strings'''
3 x = open(file,'r', encoding = 'utf-8') #Opens the text file into variable x but the variable
4 y = x.read() #By now it becomes a huge chunk of string that we need to separate line by line
5 content = y.splitlines() #The splitline method converts the chunk of string into a list of s
6 return content
7
8 chat = read_file('test_chat.txt')

read_whatsapp_text_file.py hosted with ❤ by GitHub view raw

The above function converts our text file into a list of strings that allows us to make use
of .split() methods later on. But for now, there is some cleaning you need to do.

Step 3: Handling multi-line messages


Sometimes the data you extract may not be in perfect format due to multi-line texts.
Consider the following situation using the same salesperson example from above that
is already converted into a list:

21/09/2019, 14:04 — Salesperson A: Item B/Branch C/Sold/$1900

'Some random text formed by new line from Salesperson A'

21/09/2019, 16:12 — Salesperson X: Item Y/Branch Z/Refund/$1600,


defect found in product, not functioning

We can observe that ‘Some random text’ does not have the same usual format that
every line of Whatsapp text should have. To handle such elements, let’s first look at the

https://towardsdatascience.com/complete-beginners-guide-to-processing-whatsapp-data-with-python-781c156b5f0b 4/9
21/8/2020 Complete Beginner’s Guide to Processing Whatsapp Data with Python | by Bobby Muljono | Towards Data Science

pattern
We’ve ofchanges
made Whatsappto ourtext messages.
Terms of Service and Privacy Policy. They take e ect on September 1, 2020, and
we encourage you to review them. By continuing to use our services, you agree to the new Terms of Service
and acknowledge the Privacy Policy applies to you.

Continue

Ignoring everything else after the date, it is obvious that unwanted elements do not
have date objects in them. So we begin removing them by checking if they do contain
date before the first ‘,’. We do this by utilizing basic error handling-technique.

As you can see, we have removed about 100 elements that may pose a hindrance to
feature extraction later on. It is just within most of our casual texting culture to not use
multi-line texts unless we are sharing links with caption with our buddies!

Step 4: Feature extraction


Now this is where you will be using your basic Python skills to extract features from the
list that you will parse into a dataframe later on. First, we need to revisit the string
pattern from the Whatsapp data.

The first feature we would like to extract is the date. Remember that the date string
occurs right before the first ‘,’. So we extract the element using the .split(‘,’) method at
index 0. We can write this beautifully using Python’s list comprehension.

Do note that I came from an R background and I am very used to using ‘i’ in for loops.
Another way you can write the above code without using range() function is the
following:

date = [text.split(‘,’)[0] for text in chat]

https://towardsdatascience.com/complete-beginners-guide-to-processing-whatsapp-data-with-python-781c156b5f0b 5/9
21/8/2020 Complete Beginner’s Guide to Processing Whatsapp Data with Python | by Bobby Muljono | Towards Data Science

In contrast,
We’ve this istowhat
made changes is required
our Terms of Serviceusing the Regex
and Privacy Policy.method
They takejust tooncheck
e ect whether
September theand
1, 2020,
we encourage
string youistodate.
pattern review them. By continuing to use our services, you agree to the new Terms of Service
and acknowledge the Privacy Policy applies to you.

Continue
Credits: Samir Sheri

All that just to identify the date feature?! (Photo by Ben White on Unsplash)

With that out of the way, we may proceed with the same logic when extracting both
the time and name of the sender. Take note of the following pattern:

1. Time string occurs right after the first ‘,’ and right before the first ‘-’

2. Name string occurs right after the first ‘-’ followed by the second ‘:’ at index 0

Finally we want to extract the content of the message. This is a little bit tricky because
certain lines do not contain any messages. Instead, they are system-generated
messages depicted by the following:

21/09/2019, 11:03 — Salesperson A created the group "Dummy Chat"


21/09/2019, 11:03 — Salesperson A added Salesperson B
21/09/2019, 14:04 — Salesperson A: Item B/Branch C/Sold/$1900

https://towardsdatascience.com/complete-beginners-guide-to-processing-whatsapp-data-with-python-781c156b5f0b 6/9
21/8/2020 Complete Beginner’s Guide to Processing Whatsapp Data with Python | by Bobby Muljono | Towards Data Science

21/09/2019, 16:12 — Salesperson X: Item Y/Branch Z/Refund/$1600,


We’ve made changes to our Terms of Service and Privacy Policy. They take e ect on September 1, 2020, and
defect found in product, not functioning
we encourage you to review them. By continuing to use our services, you agree to the new Terms of Service
and acknowledge the Privacy Policy applies to you.

Notice that there is no additional ‘:’ afterContinue


the first one that occurred at the time string.
To put into perspective, consider the following .split(‘:’) method:

chat.split(":")

#['21/09/2019, 14','04 — Salesperson A',' Item B/Branch


C/Sold/$1900']

The element at index 2 is of interest to us. However, since system-generated messages


do not contain the second ‘:’, extracting information at index 2 will produce an error.
Therefore we will proceed with our second error-handling technique.

You may choose to remove elements with ‘Missing Text’ later on.

Final step: Concatenating everything into a dataframe


Now that we have 4 lists of features, we can finally create a pandas dataframe with a
single line of code!

And voila! Your data frame is ready for post-analysis! Notice the system-generated
message that appear on the name column. You can conditionally remove rows with
system generated message with the following code:

df = df[df[‘Content’] != ‘Missing Text’]

https://towardsdatascience.com/complete-beginners-guide-to-processing-whatsapp-data-with-python-781c156b5f0b 7/9
21/8/2020 Complete Beginner’s Guide to Processing Whatsapp Data with Python | by Bobby Muljono | Towards Data Science

. . .
We’ve made changes to our Terms of Service and Privacy Policy. They take e ect on September 1, 2020, and
we encourage you to review them. By continuing to use our services, you agree to the new Terms of Service
and acknowledge the Privacy Policy applies to you.

Final Thoughts Continue

There are many ways you can make use of a processed Whatsapp text data to conduct
your analysis. From recreating yourself as a bot, using NLP for sentiment analysis
to just plain simple analytics. Making use of Whatsapp data is great practice for any
complex NLP projects to come.

Basic string manipulation is enough to convert a text file into a pandas dataframe as
shown above. If you are a newbie with Python(like me), it is better to get used to the
basics than trying out new techniques that may prove a little overwhelming at first.
However, Python’s regex library is still an important tool for intermediate to advanced
uses of text mining and data validation.

Here is a great article explaining the concepts of the Regex library in Python along with
its potential uses for data analytics and data science:

The Ultimate Guide to using the Python regex module


The original Pattern Finder

towardsdatascience.com

Happy coding!

Sign up for The Daily Pick


By Towards Data Science
Hands-on real-world examples, research, tutorials, and cutting-edge techniques delivered Monday
to Thursday. Make learning your daily ritual. Take a look

Your email

Get this newsletter

By signing up, you will create a Medium account if you don’t already have one. Review our Privacy Policy for more
information about our privacy practices.
https://towardsdatascience.com/complete-beginners-guide-to-processing-whatsapp-data-with-python-781c156b5f0b 8/9
21/8/2020 Complete Beginner’s Guide to Processing Whatsapp Data with Python | by Bobby Muljono | Towards Data Science

We’ve made changes to our Terms of Service and Privacy Policy. They take e ect on September 1, 2020, and
we encourage you to review them. By continuing to use our services, you agree to the new Terms of Service
and acknowledge the Privacy Policy applies to you.
Python Data Science Programming Analytics
Continue WhatsApp

About Help Legal

Get the Medium app

https://towardsdatascience.com/complete-beginners-guide-to-processing-whatsapp-data-with-python-781c156b5f0b 9/9

You might also like