0% found this document useful (0 votes)
151 views5 pages

SQL For Data Analysis Advanced Techniques For Transforming Data Into Insights 1nbsped 1492088781 9781492088783 241 245

Chapter 5 discusses text analysis using SQL, highlighting its strengths in handling text data within databases for quantitative analysis tasks such as categorization and sentiment analysis. It contrasts SQL's capabilities with those of other programming languages for more complex text analysis, emphasizing when SQL is advantageous and when it is not suitable. The chapter also introduces a UFO sightings dataset to illustrate practical applications of text analysis with SQL.

Uploaded by

Rajesh Madathil
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
151 views5 pages

SQL For Data Analysis Advanced Techniques For Transforming Data Into Insights 1nbsped 1492088781 9781492088783 241 245

Chapter 5 discusses text analysis using SQL, highlighting its strengths in handling text data within databases for quantitative analysis tasks such as categorization and sentiment analysis. It contrasts SQL's capabilities with those of other programming languages for more complex text analysis, emphasizing when SQL is advantageous and when it is not suitable. The chapter also introduces a UFO sightings dataset to illustrate practical applications of text analysis with SQL.

Uploaded by

Rajesh Madathil
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

Chapter 5.

Text Analysis

In the last two chapters, we explored applications of dates and numbers with
time series analysis and cohort analysis. But data sets are often more than just
numeric values and associated timestamps. From qualitative attributes to free
text, character fields are often loaded with potentially interesting information.
Although databases excel at numeric calculations such as counting, summing,
and averaging things, they are also quite good at performing operations on text
data.
I’ll begin this chapter by providing an overview of the types of text analysis
tasks that SQL is good for, and of those for which another programming
language is a better choice. Next, I’ll introduce our data set of UFO sightings.
Then we’ll get into coding, covering text characteristics and profiling, parsing
data with SQL, making various transformations, constructing new text from
parts, and finally finding elements within larger blocks of text, including with
regular expressions.

Why Text Analysis with SQL?


Among the huge volumes of data generated every day, a large portion consists of
text: words, sentences, paragraphs, and even longer documents. Text data used
for analysis can come from a variety of sources, including descriptors populated
by humans or computer applications, log files, support tickets, customer surveys,
social media posts, or news feeds. Text in databases ranges from structured
(where data is in different table fields with distinct meanings) to semistructured
(where the data is in separate columns but may need parsing or cleaning to be
useful) or mostly unstructured (where long VARCHAR or BLOB fields hold
arbitrary length strings that require extensive structuring before further analysis).
Fortunately, SQL has a number of useful functions that can be combined to
accomplish a range of text-structuring and analysis tasks.

What Is Text Analysis?


Text analysis is the process of deriving meaning and insight from text data.
There are two broad categories of text analysis, which can be distinguished by
whether the output is qualitative or quantitative. Qualitative analysis, which may
also be called textual analysis, seeks to understand and synthesize the meaning
from a single text or a set of texts, often applying other knowledge or unique
conclusions. This work is often done by journalists, historians, and user
experience researchers. Quantitative analysis of text also seeks to synthesize
information from text data, but the output is quantitative. Tasks include
categorization and data extraction, and analysis is usually in the form of counts
or frequencies, often trended over time. SQL is much more suited to quantitative
analysis, so that is what the rest of this chapter is concerned with. If you have the
opportunity to work with a counterpart who specializes in the first type of text
analysis, however, do take advantage of their expertise. Combining the
qualitative with the quantitative is a great way to derive new insights and
persuade reluctant colleagues.
Text analysis encompasses several goals or strategies. The first is text extraction,
where a useful piece of data must be pulled from surrounding text. Another is
categorization, where information is extracted or parsed from text data in order
to assign tags or categories to rows in a database. Another strategy is sentiment
analysis, where the goal is to understand the mood or intent of the writer on a
scale from negative to positive.
Although text analysis has been around for a while, interest and research in this
area have taken off with the advent of machine learning and the computing
resources that are often needed to work with large volumes of text data. Natural
language processing (NLP) has made huge advances in recognizing, classifying,
and even generating brand-new text data. Human language is incredibly
complex, with different languages and dialects, grammars, and slang, not to
mention the thousands and thousands of words, some that have overlapping
meanings or subtly modify the meaning of other words. As we’ll see, SQL is
good at some forms of text analysis, but for other, more advanced tasks, there are
languages and tools that are better suited.

Why SQL Is a Good Choice for Text Analysis


There are a number of good reasons to use SQL for text analysis. One of the
most obvious is when the data is already in a database. Modern databases have a
lot of computing power that can be leveraged for text tasks in addition to the
other tasks we’ve discussed so far. Moving data to a flat file for analysis with
another language or tool is time consuming, so doing as much work as possible
with SQL within the database has advantages.
If the data is not already in a database, for relatively large data sets, moving the
data to a database may be worthwhile. Databases are more powerful than
spreadsheets for processing transformations on many records. SQL is less error-
prone than spreadsheets, since no copying and pasting is required, and the
original data stays intact. Data could potentially be altered with an UPDATE
command, but this is hard to do accidentally.
SQL is also a good choice when the end goal is quantification of some sort.
Counting how many support tickets contain a key phrase and parsing categories
out of larger text that will be used to group records are good examples of when
SQL shines. SQL is good at cleaning and structuring text fields. Cleaning
includes removing extra characters or whitespace, fixing capitalization, and
standardizing spellings. Structuring involves creating new columns from
elements extracted or derived from other fields or constructing new fields from
parts stored in different places. String functions can be nested or applied to the
results of other functions, allowing for almost any manipulations that might be
needed.
SQL code for text analysis can be simple or complex, but it is always rule based.
In a rule-based system, the computer follows a set of rules or instructions—no
more, no less. This can be contrasted with machine learning, in which the
computer adapts based on the data. Rules are good because they are easy for
humans to understand. They are written down in code form and can be checked
to ensure they produce the desired output. The downside of rules is that they can
become long and complicated, particularly when there are a lot of different cases
to handle. This can also make them difficult to maintain. If the structure or type
of data entered into the column changes, the rule set needs to be updated. On
more than one occasion, I’ve started with what seemed like a simple CASE
statement with 4 or 5 lines, only to have it grow to 50 or 100 lines as the
application changed. Rules might still be the right approach, but keeping in sync
with the development team on changes is a good idea.
Finally, SQL is a good choice when you know in advance what you are looking
for. There are a number of powerful functions, including regular expressions,
that allow you to search for, extract, or replace specific pieces of information.
“How many reviewers mention ‘short battery life’ in their reviews?” is a
question SQL can help you answer. On the other hand, “Why are these
customers angry?” is not going to be as easy.

When SQL Is Not a Good Choice


SQL essentially allows you to harness the power of the database to apply a set of
rules, albeit often powerful rules, to a set of text to make it more useful for
analysis. SQL is certainly not the only option for text analysis, and there are a
number of use cases for which it’s not the best choice. It’s useful to be aware of
these.
The first category encompasses use cases for which a human is more
appropriate. When the data set is very small or very new, hand labeling can be
faster and more informative. Additionally, if the goal is to read all the records
and come up with a qualitative summary of key themes, a human is a better
choice.
The second category is when there’s a need to search for and retrieve specific
records that contain text strings with low latency. Tools like Elasticsearch or
Splunk have been developed to index strings for these use cases. Performance
will often be an issue with SQL and databases; this is one of the main reasons
that we usually try to structure the data into discrete columns that can more
easily be searched by the database engine.
The third category comprises tasks in the broader NLP category, where machine
learning approaches and the languages that run them, such as Python, are a better
choice. Sentiment analysis, used to analyze ranges of positive or negative
feelings in texts, can be handled only in a simplistic way with SQL. For
example, “love” and “hate” could be extracted and used to categorize records,
but given the range of words that can express positive and negative emotions, as
well as all the ways to negate those words, it would be nearly impossible to
create a rule set with SQL to handle them all. Part-of-speech tagging, where
words in a text are labeled as nouns, verbs, and so on, is better handled with
libraries available in Python. Language generation, or creating brand-new text
based on learnings from example texts, is another example best handled in other
tools. We will see how we can create new text by concatenating pieces of data
together, but SQL is still bound by rules and won’t automatically learn from and
adapt to new examples in the data set.
Now that we’ve discussed the many good reasons to use SQL for text analysis,
as well as the types of use cases to avoid, let’s take a look at the data set we’ll be
using for the examples before launching into the SQL code itself.

The UFO Sightings Data Set


For the examples in this chapter, we’ll use a data set of UFO sightings compiled
by the National UFO Reporting Center. The data set consists of approximately
95,000 reports posted between 2006 and 2020. Reports come from individuals
who can enter information through an online form.
The table we will work with is ufo, and it has only two columns. The first is a
composite column called sighting_report that contains information about
when the sighting occurred, when it was reported, and when it was posted. It
also contains metadata about the location, shape, and duration of the sighting
event. The second column is a text field called description that contains the
full description of the event. Figure 5-1 shows a sample of the data.

You might also like