SQL For Data Analysis Advanced Techniques For Transforming Data Into Insights 1nbsped 1492088781 9781492088783 241 245

Chapter 5 discusses text analysis using SQL, highlighting its strengths in handling text data within databases for quantitative analysis tasks such as categorization and sentiment analysis. It contrasts SQL's capabilities with those of other programming languages for more complex text analysis, emphasizing when SQL is advantageous and when it is not suitable. The chapter also introduces a UFO sightings dataset to illustrate practical applications of text analysis with SQL.

Uploaded by

Rajesh Madathil

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

154 views5 pages

SQL For Data Analysis Advanced Techniques For Transforming Data Into Insights 1nbsped 1492088781 9781492088783 241 245

Uploaded by

Rajesh Madathil

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 5

Chapter 5.

Text Analysis

In the last two chapters, we explored applications of dates and numbers with
time series analysis and cohort analysis. But data sets are often more than just
numeric values and associated timestamps. From qualitative attributes to free
text, character fields are often loaded with potentially interesting information.
Although databases excel at numeric calculations such as counting, summing,
and averaging things, they are also quite good at performing operations on text
data.
I’ll begin this chapter by providing an overview of the types of text analysis
tasks that SQL is good for, and of those for which another programming
language is a better choice. Next, I’ll introduce our data set of UFO sightings.
Then we’ll get into coding, covering text characteristics and profiling, parsing
data with SQL, making various transformations, constructing new text from
parts, and finally finding elements within larger blocks of text, including with
regular expressions.

Why Text Analysis with SQL?

Among the huge volumes of data generated every day, a large portion consists of
text: words, sentences, paragraphs, and even longer documents. Text data used
for analysis can come from a variety of sources, including descriptors populated
by humans or computer applications, log files, support tickets, customer surveys,
social media posts, or news feeds. Text in databases ranges from structured
(where data is in different table fields with distinct meanings) to semistructured
(where the data is in separate columns but may need parsing or cleaning to be
useful) or mostly unstructured (where long VARCHAR or BLOB fields hold
arbitrary length strings that require extensive structuring before further analysis).
Fortunately, SQL has a number of useful functions that can be combined to
accomplish a range of text-structuring and analysis tasks.

What Is Text Analysis?

Text analysis is the process of deriving meaning and insight from text data.
There are two broad categories of text analysis, which can be distinguished by
whether the output is qualitative or quantitative. Qualitative analysis, which may
also be called textual analysis, seeks to understand and synthesize the meaning
from a single text or a set of texts, often applying other knowledge or unique
conclusions. This work is often done by journalists, historians, and user
experience researchers. Quantitative analysis of text also seeks to synthesize
information from text data, but the output is quantitative. Tasks include
categorization and data extraction, and analysis is usually in the form of counts
or frequencies, often trended over time. SQL is much more suited to quantitative
analysis, so that is what the rest of this chapter is concerned with. If you have the
opportunity to work with a counterpart who specializes in the first type of text
analysis, however, do take advantage of their expertise. Combining the
qualitative with the quantitative is a great way to derive new insights and
persuade reluctant colleagues.
Text analysis encompasses several goals or strategies. The first is text extraction,
where a useful piece of data must be pulled from surrounding text. Another is
categorization, where information is extracted or parsed from text data in order
to assign tags or categories to rows in a database. Another strategy is sentiment
analysis, where the goal is to understand the mood or intent of the writer on a
scale from negative to positive.
Although text analysis has been around for a while, interest and research in this
area have taken off with the advent of machine learning and the computing
resources that are often needed to work with large volumes of text data. Natural
language processing (NLP) has made huge advances in recognizing, classifying,
and even generating brand-new text data. Human language is incredibly
complex, with different languages and dialects, grammars, and slang, not to
mention the thousands and thousands of words, some that have overlapping
meanings or subtly modify the meaning of other words. As we’ll see, SQL is
good at some forms of text analysis, but for other, more advanced tasks, there are
languages and tools that are better suited.

Why SQL Is a Good Choice for Text Analysis

There are a number of good reasons to use SQL for text analysis. One of the
most obvious is when the data is already in a database. Modern databases have a
lot of computing power that can be leveraged for text tasks in addition to the
other tasks we’ve discussed so far. Moving data to a flat file for analysis with
another language or tool is time consuming, so doing as much work as possible
with SQL within the database has advantages.
If the data is not already in a database, for relatively large data sets, moving the
data to a database may be worthwhile. Databases are more powerful than
spreadsheets for processing transformations on many records. SQL is less error-
prone than spreadsheets, since no copying and pasting is required, and the
original data stays intact. Data could potentially be altered with an UPDATE
command, but this is hard to do accidentally.
SQL is also a good choice when the end goal is quantification of some sort.
Counting how many support tickets contain a key phrase and parsing categories
out of larger text that will be used to group records are good examples of when
SQL shines. SQL is good at cleaning and structuring text fields. Cleaning
includes removing extra characters or whitespace, fixing capitalization, and
standardizing spellings. Structuring involves creating new columns from
elements extracted or derived from other fields or constructing new fields from
parts stored in different places. String functions can be nested or applied to the
results of other functions, allowing for almost any manipulations that might be
needed.
SQL code for text analysis can be simple or complex, but it is always rule based.
In a rule-based system, the computer follows a set of rules or instructions—no
more, no less. This can be contrasted with machine learning, in which the
computer adapts based on the data. Rules are good because they are easy for
humans to understand. They are written down in code form and can be checked
to ensure they produce the desired output. The downside of rules is that they can
become long and complicated, particularly when there are a lot of different cases
to handle. This can also make them difficult to maintain. If the structure or type
of data entered into the column changes, the rule set needs to be updated. On
more than one occasion, I’ve started with what seemed like a simple CASE
statement with 4 or 5 lines, only to have it grow to 50 or 100 lines as the
application changed. Rules might still be the right approach, but keeping in sync
with the development team on changes is a good idea.
Finally, SQL is a good choice when you know in advance what you are looking
for. There are a number of powerful functions, including regular expressions,
that allow you to search for, extract, or replace specific pieces of information.
“How many reviewers mention ‘short battery life’ in their reviews?” is a
question SQL can help you answer. On the other hand, “Why are these
customers angry?” is not going to be as easy.

When SQL Is Not a Good Choice

SQL essentially allows you to harness the power of the database to apply a set of
rules, albeit often powerful rules, to a set of text to make it more useful for
analysis. SQL is certainly not the only option for text analysis, and there are a
number of use cases for which it’s not the best choice. It’s useful to be aware of
these.
The first category encompasses use cases for which a human is more
appropriate. When the data set is very small or very new, hand labeling can be
faster and more informative. Additionally, if the goal is to read all the records
and come up with a qualitative summary of key themes, a human is a better
choice.
The second category is when there’s a need to search for and retrieve specific
records that contain text strings with low latency. Tools like Elasticsearch or
Splunk have been developed to index strings for these use cases. Performance
will often be an issue with SQL and databases; this is one of the main reasons
that we usually try to structure the data into discrete columns that can more
easily be searched by the database engine.
The third category comprises tasks in the broader NLP category, where machine
learning approaches and the languages that run them, such as Python, are a better
choice. Sentiment analysis, used to analyze ranges of positive or negative
feelings in texts, can be handled only in a simplistic way with SQL. For
example, “love” and “hate” could be extracted and used to categorize records,
but given the range of words that can express positive and negative emotions, as
well as all the ways to negate those words, it would be nearly impossible to
create a rule set with SQL to handle them all. Part-of-speech tagging, where
words in a text are labeled as nouns, verbs, and so on, is better handled with
libraries available in Python. Language generation, or creating brand-new text
based on learnings from example texts, is another example best handled in other
tools. We will see how we can create new text by concatenating pieces of data
together, but SQL is still bound by rules and won’t automatically learn from and
adapt to new examples in the data set.
Now that we’ve discussed the many good reasons to use SQL for text analysis,
as well as the types of use cases to avoid, let’s take a look at the data set we’ll be
using for the examples before launching into the SQL code itself.

The UFO Sightings Data Set

For the examples in this chapter, we’ll use a data set of UFO sightings compiled
by the National UFO Reporting Center. The data set consists of approximately
95,000 reports posted between 2006 and 2020. Reports come from individuals
who can enter information through an online form.
The table we will work with is ufo, and it has only two columns. The first is a
composite column called sighting_report that contains information about
when the sighting occurred, when it was reported, and when it was posted. It
also contains metadata about the location, shape, and duration of the sighting
event. The second column is a text field called description that contains the
full description of the event. Figure 5-1 shows a sample of the data.

Artificial Intelligence Handwritten Notes by Riya
100% (1)
Artificial Intelligence Handwritten Notes by Riya
118 pages
Data Structures and Algorithm Analysis in Java, Third Edition
From Everand
Data Structures and Algorithm Analysis in Java, Third Edition
Clifford A. Shaffer
4/5 (4)
SQL Interview Questions: A complete question bank to crack your ANN SQL interview with real-time examples
From Everand
SQL Interview Questions: A complete question bank to crack your ANN SQL interview with real-time examples
Prasad Kulkarni
2/5 (1)
Data Structures & Algorithms Interview Questions You'll Most Likely Be Asked
From Everand
Data Structures & Algorithms Interview Questions You'll Most Likely Be Asked
Vibrant Publishers
1/5 (1)
Exploring Data with Access 2019
From Everand
Exploring Data with Access 2019
Larry Rockoff
No ratings yet
Exploring Data with Access 2016
From Everand
Exploring Data with Access 2016
Larry Rockoff
No ratings yet
SQL
From Everand
SQL
Brandon Cooper
No ratings yet
Data Structures I Essentials
From Everand
Data Structures I Essentials
Dennis Smolarski
No ratings yet
Mastering Computer Programming: A Comprehensive Guide
From Everand
Mastering Computer Programming: A Comprehensive Guide
Kondwani Hara
No ratings yet
SQL Made Easy: Tips and Tricks to Mastering SQL Programming
From Everand
SQL Made Easy: Tips and Tricks to Mastering SQL Programming
Ryan Campbell
No ratings yet
UNIT2
No ratings yet
UNIT2
56 pages
Python Data Structures Explained: A Practical Guide with Examples
From Everand
Python Data Structures Explained: A Practical Guide with Examples
William E. Clark
No ratings yet
SQL Programming For Beginners The Ultimate Beginners Guide To Analyze and Manipulate Data With SQL (2020)
100% (2)
SQL Programming For Beginners The Ultimate Beginners Guide To Analyze and Manipulate Data With SQL (2020)
88 pages
SQL Programming & Database Management For Noobee
From Everand
SQL Programming & Database Management For Noobee
Kishor Sarkar X
No ratings yet
Structured Query Language Simplified: Efficient and Effective Database Management
From Everand
Structured Query Language Simplified: Efficient and Effective Database Management
Angela White
No ratings yet
JavaScript Data Structures Explained: A Practical Guide with Examples
From Everand
JavaScript Data Structures Explained: A Practical Guide with Examples
William E. Clark
No ratings yet
Regular Expressions Demystified: A Practical Guide with Examples
From Everand
Regular Expressions Demystified: A Practical Guide with Examples
William E. Clark
No ratings yet
SQL Interview Success From Beginner To Pro
From Everand
SQL Interview Success From Beginner To Pro
Shana
No ratings yet
Introduction to Algorithms & Data Structures: A solid foundation for the real world of machine learning and data analytics
From Everand
Introduction to Algorithms & Data Structures: A solid foundation for the real world of machine learning and data analytics
Bolakale Aremu
No ratings yet
Data Structures Explained: A Practical Guide with Examples
From Everand
Data Structures Explained: A Practical Guide with Examples
William E. Clark
No ratings yet
Basic Oracle Handout
No ratings yet
Basic Oracle Handout
140 pages
Basic Oracle Handout
No ratings yet
Basic Oracle Handout
140 pages
SQL Fundamentals for New Developers: A Practical Guide with Examples
From Everand
SQL Fundamentals for New Developers: A Practical Guide with Examples
William E. Clark
No ratings yet
The Art of SQL: Crafting Robust Database Solutions
From Everand
The Art of SQL: Crafting Robust Database Solutions
Richard Evans
No ratings yet
DATABASE From the conceptual model to the final application in Access, Visual Basic, Pascal, Html and Php: Inside, examples of applications created with Access, Visual Studio, Lazarus and Wamp
From Everand
DATABASE From the conceptual model to the final application in Access, Visual Basic, Pascal, Html and Php: Inside, examples of applications created with Access, Visual Studio, Lazarus and Wamp
Olga Maria Stefania Cucaro
No ratings yet
Mock 1 - With Keys
No ratings yet
Mock 1 - With Keys
4 pages
Introduction to Microsoft SQL Server
From Everand
Introduction to Microsoft SQL Server
Eric Frick
No ratings yet
CYT180Week2 - Big Data Models
No ratings yet
CYT180Week2 - Big Data Models
34 pages
Structured Query Language (SQL) : University of Southern Mindanao
No ratings yet
Structured Query Language (SQL) : University of Southern Mindanao
21 pages
Module 1 - SQL For Analytics Introduction
No ratings yet
Module 1 - SQL For Analytics Introduction
19 pages
SQL Demystified: A Beginner's Roadmap to Data Retrieval and Management
From Everand
SQL Demystified: A Beginner's Roadmap to Data Retrieval and Management
Kaushal Mehta
No ratings yet
SQL Tutorial
No ratings yet
SQL Tutorial
7 pages
Relational Databases
No ratings yet
Relational Databases
2 pages
Querying Databricks with Spark SQL: Leverage SQL to query and analyze Big Data for insights (English Edition)
From Everand
Querying Databricks with Spark SQL: Leverage SQL to query and analyze Big Data for insights (English Edition)
Adam Aspin
No ratings yet
Learn SQL: Database Management Basics
From Everand
Learn SQL: Database Management Basics
Kiet Huynh
No ratings yet
Advanced SQL Queries: Writing Efficient Code for Big Data
From Everand
Advanced SQL Queries: Writing Efficient Code for Big Data
Robert Johnson
5/5 (2)
SQL For Beginners The Simplified Guide To Managing, Analyzing Data PDF
100% (3)
SQL For Beginners The Simplified Guide To Managing, Analyzing Data PDF
109 pages
Exploring Data with Excel 2019
From Everand
Exploring Data with Excel 2019
Larry Rockoff
No ratings yet
Chapter 2 Grade 12 Programming and Structure Query Language S
No ratings yet
Chapter 2 Grade 12 Programming and Structure Query Language S
32 pages
Mastering PL/SQL Through Illustrations: From Learning Fundamentals to Developing Efficient PL/SQL Blocks (English Edition)
From Everand
Mastering PL/SQL Through Illustrations: From Learning Fundamentals to Developing Efficient PL/SQL Blocks (English Edition)
B Chandra
No ratings yet
C++ Data Structures Explained: A Practical Guide with Examples
From Everand
C++ Data Structures Explained: A Practical Guide with Examples
William E. Clark
No ratings yet
Mastering Data Structures and Algorithms in Python & Java
From Everand
Mastering Data Structures and Algorithms in Python & Java
Sachin Naha
No ratings yet
SQL Mastery: The Masterclass Guide to Become an SQL ExpertMaster The SQL Programming Language In This Ultimate Guide Today!
From Everand
SQL Mastery: The Masterclass Guide to Become an SQL ExpertMaster The SQL Programming Language In This Ultimate Guide Today!
Jonathan S. Walker
No ratings yet
Concise Oracle Database For People Who Has No Time
From Everand
Concise Oracle Database For People Who Has No Time
Billy Aung Myint
No ratings yet
SQL For Beginners
No ratings yet
SQL For Beginners
104 pages
Data Structures and Algorithm Analysis in C++, Third Edition
From Everand
Data Structures and Algorithm Analysis in C++, Third Edition
Clifford A. Shaffer
4.5/5 (5)
What Is SQL Disadvantages
No ratings yet
What Is SQL Disadvantages
2 pages
Mastering Elasticsearch 5.x - Third Edition
From Everand
Mastering Elasticsearch 5.x - Third Edition
Bharvi Dixit
3/5 (1)
Unit 1 Notes
No ratings yet
Unit 1 Notes
50 pages
What Is SQL (Milestone)
No ratings yet
What Is SQL (Milestone)
4 pages
09 Databases
No ratings yet
09 Databases
13 pages
Java Data Structures Explained: A Practical Guide with Example
From Everand
Java Data Structures Explained: A Practical Guide with Example
William E. Clark
No ratings yet
SpaCy for Natural Language Processing: Definitive Reference for Developers and Engineers
From Everand
SpaCy for Natural Language Processing: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
SQL for Beginners: Your Essential Guide to Querying and Managing Databases
From Everand
SQL for Beginners: Your Essential Guide to Querying and Managing Databases
Emily Harris
No ratings yet
Ultimate Enterprise Data Analysis and Forecasting using Python: Leverage Cloud platforms with Azure Time Series Insights and AWS Forecast Components for Deep learning Modeling using Python (English Edition)
From Everand
Ultimate Enterprise Data Analysis and Forecasting using Python: Leverage Cloud platforms with Azure Time Series Insights and AWS Forecast Components for Deep learning Modeling using Python (English Edition)
Shanthababu Pandian
No ratings yet
Concept Mining: Fundamentals and Applications
From Everand
Concept Mining: Fundamentals and Applications
Fouad Sabry
No ratings yet
Exploring the World of Data Science and Machine Learning
From Everand
Exploring the World of Data Science and Machine Learning
NIBEDITA Sahu
No ratings yet
Beginning XML
From Everand
Beginning XML
Joe Fawcett
3/5 (1)
Python Text Mining: Perform Text Processing, Word Embedding, Text Classification and Machine Translation
From Everand
Python Text Mining: Perform Text Processing, Word Embedding, Text Classification and Machine Translation
Alexandra George
No ratings yet
SQL For Placement
No ratings yet
SQL For Placement
55 pages
XML Programming: The Ultimate Guide to Fast, Easy, and Efficient Learning of XML Programming
From Everand
XML Programming: The Ultimate Guide to Fast, Easy, and Efficient Learning of XML Programming
Christopher Right
2.5/5 (2)
BL - en.U4AIE21025 - Rajesh M. - Reimbursement Form - Paper - 397
No ratings yet
BL - en.U4AIE21025 - Rajesh M. - Reimbursement Form - Paper - 397
4 pages
Use Case Diagram
No ratings yet
Use Case Diagram
7 pages
Annual Report 2023-2024
No ratings yet
Annual Report 2023-2024
29 pages
Annual Report 2023-2024
No ratings yet
Annual Report 2023-2024
31 pages
Academic Calendar - 2023-24 Even Sem
No ratings yet
Academic Calendar - 2023-24 Even Sem
6 pages
CAPE SBC Express - 09.04.2024 - 7.08 P.M. - B4 - 1 - Lower
No ratings yet
CAPE SBC Express - 09.04.2024 - 7.08 P.M. - B4 - 1 - Lower
3 pages
Programs On If and Switch Statements
No ratings yet
Programs On If and Switch Statements
10 pages
Zone Based Energy Efficient Routing Protocol For Mobile Sensor Networks
No ratings yet
Zone Based Energy Efficient Routing Protocol For Mobile Sensor Networks
6 pages
Db2 12 For zOS SQL Performance and Tuning Course - CV964G PDF
No ratings yet
Db2 12 For zOS SQL Performance and Tuning Course - CV964G PDF
2 pages
Lab 4 Muhammad Abdullah (1823-2021)
No ratings yet
Lab 4 Muhammad Abdullah (1823-2021)
11 pages
Using ERwin Data Modeler
100% (1)
Using ERwin Data Modeler
46 pages
DB Workbook - Question
No ratings yet
DB Workbook - Question
31 pages
SQL Test
No ratings yet
SQL Test
6 pages
Power BI - USC
No ratings yet
Power BI - USC
6 pages
Advanced PLSQL - Updated
No ratings yet
Advanced PLSQL - Updated
81 pages
D17075GC10 Toc
No ratings yet
D17075GC10 Toc
10 pages
Salesforce SOQL Interview Questions
No ratings yet
Salesforce SOQL Interview Questions
3 pages
PL 400 December 11, 2024
No ratings yet
PL 400 December 11, 2024
27 pages
Postgresql Exercises
No ratings yet
Postgresql Exercises
97 pages
Ch-2 Notes
No ratings yet
Ch-2 Notes
50 pages
Lab 10 Fa10
No ratings yet
Lab 10 Fa10
4 pages
Master Question Bank C#
No ratings yet
Master Question Bank C#
121 pages
Characteristics of Database Approach
No ratings yet
Characteristics of Database Approach
63 pages
(Ebooks PDF) Download The Excel Analyst S Guide To Access 1st Edition Michael Alexander Full Chapters
100% (12)
(Ebooks PDF) Download The Excel Analyst S Guide To Access 1st Edition Michael Alexander Full Chapters
75 pages
A Interview Faq's - 1
No ratings yet
A Interview Faq's - 1
25 pages
Getting Stared With Testing
No ratings yet
Getting Stared With Testing
38 pages
Vineeth Final
No ratings yet
Vineeth Final
21 pages
Murach SQL Server 2012 Chapter 4 Flashcards - Quizlet
No ratings yet
Murach SQL Server 2012 Chapter 4 Flashcards - Quizlet
3 pages
Mcs 023 IgnouAssignmentGuru
No ratings yet
Mcs 023 IgnouAssignmentGuru
2 pages
Unit 1
No ratings yet
Unit 1
11 pages
CertyIQ DP-900 Web - Question Bank
No ratings yet
CertyIQ DP-900 Web - Question Bank
70 pages
DBMS - 2nd Expt
No ratings yet
DBMS - 2nd Expt
4 pages
Coding Horror - A Visual Explanation of SQL Joins
No ratings yet
Coding Horror - A Visual Explanation of SQL Joins
33 pages
BScIT Syllabus 13.9.23
No ratings yet
BScIT Syllabus 13.9.23
20 pages
IP Report File - Aditi 12B
No ratings yet
IP Report File - Aditi 12B
17 pages
Informatica Interview QnA
No ratings yet
Informatica Interview QnA
345 pages
SQL
No ratings yet
SQL
10 pages

SQL For Data Analysis Advanced Techniques For Transforming Data Into Insights 1nbsped 1492088781 9781492088783 241 245

Uploaded by

SQL For Data Analysis Advanced Techniques For Transforming Data Into Insights 1nbsped 1492088781 9781492088783 241 245

Uploaded by

Chapter 5.

Why Text Analysis with SQL?

What Is Text Analysis?

Why SQL Is a Good Choice for Text Analysis

When SQL Is Not a Good Choice

The UFO Sightings Data Set

You might also like