Text Data and Analysis (Slides)
Text Data and Analysis (Slides)
Overview
01. Introduction 07. The CONCATENATE function
Introduction
|
Text data and analysis functions in spreadsheets refer to a set of built-in functions that allows
users to manipulate and analyze text data within cells. Advantages of using these functions
include:
01.
01. Time-saving 04.
01. Flexibility
They allow quick and easy manipulation of large
They allow users to extract specific information or
datasets, which saves time compared to manually
manipulate text data in various ways, depending on their
editing each cell.
needs. This flexibility allows users to create customized
02.
01. Consistency solutions for different types of data.
3
Spreadsheet functions
Data overview
| To investigate how spreadsheet functions can be used to analyze text data, we will use a
Tweets on climate change dataset that has 100 rows and the following columns:
1. ID
Dataset
A numeric string that is associated with and
uniquely identifies a single Tweet within the
dataset. It makes it possible to access and
interact with a specific Tweet.
2. Text
An aggregated Tweet pertaining to climate
change.
4
Spreadsheet functions
|
The SUBSTITUTE function is used to replace a SUBSTITUTE and REGEXREPLACE are similar but
specific character or string of characters in a SUBSTITUTE is preferred over REGEXREPLACE when
cell with a different character or string. the text being replaced is in multiple columns.
● text_to_search – The text within which to ● replace_with – The string that will replace
search and replace. search_for.
● search_for – The string to search for within ● occurrence_number – [OPTIONAL] The instance
text_to_search. of search_for within text_to_search to
search_for will match parts of words as well replace with replace_with. If
as whole words; therefore, a search for "vent" occurrence_number is specified, only the
will also replace text within "eventual". indicated instance of search_for is replaced.
5
Spreadsheet functions
● We will use OR logic since a Tweet can have On Twitter, a mention is a way to tag or reference
both, either of the two, or neither of the two. another user in a Tweet by including their
username in the tweet. Mentions are commonly
● The REGEXEXTRACT function will be used to
used to start a conversation with someone, to
identify and extract the URLs and mentions
acknowledge someone in a Tweet, or to give
then the SUBSTITUTE function will be used to
credit to someone for their work.
replace them with a blank string.
Mentions are prepended with the “@” symbol.
6
Spreadsheet functions
● To remove the URLs and mentions extracted by the REGEXREPLACE function, we will make replace_with
on SUBSTITUTE an empty string (“”).
02. Replicate the formula to the other rows by dragging the fill handle down.
1028954403129184256 Gotta love the facts. https://t.co/bZ2G8AZuo9 Gotta love the facts.
1028954810781814784 You send me crap It's 5 minutes to midnight You send me crap It's 5 minutes to
for a mute https://t.co/FFhYHCitKb midnight for a mute
7
Spreadsheet functions
| The TRIM function is used to remove leading, trailing, and repeated spaces in text while the
CLEAN function returns the text with the non-printable ASCII characters removed.
Spreadsheets do not show non-printable characters in the user interface, so using the CLEAN
function will typically not result in any visible changes.
8
Spreadsheet functions
02. Replicate the formula to the other rows by dragging the fill handle down.
9
Spreadsheet functions
| The SEARCH and FIND functions both return the position at which a string is first found within
text. SEARCH, however, ignores case while FIND is case-sensitive.
10
Spreadsheet functions
● We will use SEARCH so that we can identify all relevant hashtags irrespective of sentence case.
● Since SEARCH will return an error if a Tweet does not contain the hashtag, we will use an IFERROR
statement to replace the error value with 0.
02. Replicate the formula to the other rows by dragging the fill handle down.
ID Text #climatechange
11
Spreadsheet functions
| The SPLIT function divides text around a specified character or string and puts each
fragment into a separate cell in the row.
● We will use the space character as the delimiter since words are separated by spaces.
● It is advisable to use the SPLIT function after the last column since its results populate the cells
horizontally.
02. Replicate the formula to the other rows by dragging the fill handle down.
13
Spreadsheet functions
For example:
14
Spreadsheet functions
● We will use a colon followed by a whitespace as a delimiter between the ID and text.
02. Replicate the formula to the other rows by dragging the fill handle down.
1028954403129184256 Gotta love the facts. https://t.co/bZ2G8AZuo9 1028954403129184256: Gotta love the
facts. https://t.co/bZ2G8AZuo9
15
Spreadsheet functions
| The UPPER function converts a specified string to uppercase, LOWER converts a specified
string to lowercase, and PROPER capitalizes each word in a specified string.
16
Spreadsheet functions
● Since the FIND function is case-sensitive, we can start by converting the text to uppercase or
lowercase before applying the FIND function.
● It is common practice to convert text to lowercase during analysis so we will use the LOWER function.
02. Replicate the formula to the other rows by dragging the fill handle down.
17