Introduction To Data Mining and Analytics
Introduction To Data Mining and Analytics
TO
DATA MINING
and ANALYTICS
WITH MACHINE
LEARNING IN R AND
PYTHON
17830-2
Production Credits
Director of Product Management: Laura Pagluica
Product Manager: Edward Hinman
Product Assistant: Melissa Duffy
Product Coordinator: Paula Yuan-Gregory
Tech Editor: Robert Shimonski
Senior Project Specialist: Dan Stone
Digital Project Specialist: Angela Dooley
Marketing Manager: Michael Sullivan
Product Fulfillment Manager: Wendy Kilborn
Project Management: codeMantra U.S., LLC
Cover Design: Scott Moden
Text Design: Scott Moden
Senior Media Development Editor: Troy Liston
Rights Specialist: Rebecca Damon
Cover Image (Title Page, Part Opener, Chapter
Opener):
© Patra Kongsirimongkolchai / EyeEm / Getty
Images
Printing and Binding: LSC Communications
LCCN: 2019955670
6048
Printed in the United States of America
24 23 22 21 20 10 9 8 7 6 5 4 3 2 1
© Patra Kongsirimongkolchai/EyeEm/Getty Images
DEDICATION
Debbie,
You are the love of my life…
© Patra Kongsirimongkolchai/EyeEm/Getty Images
BRIEF CONTENTS
Preface
About the Author
GLOSSARY
INDEX
© Patra Kongsirimongkolchai/EyeEm/Getty Images
TABLE OF CONTENTS
Preface
About the Author
Contributors
Data Visualization
Visual Programming
Business Intelligence
Data Clustering
Data Classification
Predictive Analytics
Data Association
2 Machine Learning
Machine Learning Versus Artificial Intelligence
Database Models
Database Schemas
A Word on Normalization
Data Warehouses
Data Marts
Revisiting Schemas
Data Lakes
Lucidchart
4 Data Visualization
Visualization Best Practices
Line Chart
Multiline Charts
Area Chart
Radar Chart
Combo Chart
Diff Chart
Waterfall Chart
Composition Charts
Pie Chart
Donut Chart
Sunburst Chart
Treemap Chart
Funnel Chart
Pyramid Chart
Correlation Charts
Scatter Chart
Bubble Chart
Dashboard Charts
Gauge Chart
Calendar Chart
Candlestick Chart
Distribution Charts
Histogram Chart
Geocharts
Big Number
Filtering Data
Conditional Formatting
Excel Files
Leveraging Excel Statistical Functions
Using AVERAGEA
AVERAGEIFS
Using TRIMMEAN
DEVSQ
CORREL
and COVARIANCE.P
Determining the Slope an Intercept of a Line That
INTERCEPT
LOGEST
Using FREQUENCY
Using AS
Operations
JSON Is Self-Describing
Databases
MongoDB Database
Sets
CouchDB
Redis
Amazon DynamoDB
Cassandra
HBase
RocksDB
Charts
Python Operators
Relational Operators
Python Lists
Indentation
Logical Operators
Iterative Processing
Structures
Calculations
Creating Variables in R
Comments in R
Using R’s Built-in Functions
Operators within R
Logical Operators
Conditional Operators in R
Vector
Using R Packages
10 Data Clustering
Common Clustering Approaches
Using K-Means++
Hierarchical Cluster
Using Solver
11 Classification
Applying the K-Nearest Neighbors (KNN) Classification
Algorithm
12 Predictive Analytics
Understanding Linear Regression
Regression
K-Nearest-Neighbors Regression
Polynomial Regression
Hands-on: RapidMiner
13 Data Association
Understanding Support, Confidence, and Lift
Handwriting Classification
Enter MapReduce
MongoDB
Others
Notebook
GLOSSARY
INDEX
ADDITIONAL
RESOURCES
Cloud Desktop
This textbook is accompanied by a Navigate 2
Premier Course that includes access to a Cloud
Desktop with lab exercises. Cloud Desktops are
browser-based lab environments where students
have a chance to extend their learning beyond the
textbook with real software on live virtual
machines. Visit go.jblearning.com/JamsaData to
learn more.
© Patra Kongsirimongkolchai/EyeEm/Getty Images
Preface
Data analysts are dealing with more data today
than ever before, and conservatively, that amount
of data will continue to double every two years for
many years to come. With ever-growing amounts
of data come many opportunities for discovery.
However, as the volume of data increases, so too
do the challenges of discovering trends and
patterns, as well as expressing such findings in a
meaningful way to others. To store, analyze, and
visualize data, analysts use SQL, NoSQL, and
graph databases; Excel; data-mining tools; and
machine-learning as well as big-data tools. This
book examines these concepts and tools,
introducing each and then using hands-on
instruction to develop the skills you will need to
perform real-world data-mining and analytic
operations. Machine-learning and data-mining
operations normally make extensive use of Python
and R programming. As such, this book uses
each. If you have not programmed before, relax -
this book presents the programming skills you will
need and introduces Visual Programming
environments, such as Weka, Orange, and
RapidMiner, which let you perform data mining and
machine learning without having to write code!
Chapter 1 examines data mining, the process of
identifying patterns that exist within data, and
machine learning, the use of data pattern
recognition algorithms, which allow a program to
solve problems. With the data-mining patterns in
hand, data analysts can apply them to other data
sets. Think of the actual “mining” as the search for
the data patterns, as opposed to the subsequent
use of the patterns. The data-mining process may
involve the use of statistics, database queries,
visualization tools, traditional programming, and
machine learning. Machine learning is the use of a
programming language to solve problems, such as
clustering, categorization, predictive analysis, and
data association. As you will learn in Chapter 1,
machine-learning solutions solve complex
problems by using data to drive discovery, using
only a few lines of code.
CONTRIBUTORS
Many people have contributed valuable assistance
in the development of this book. The author and
publisher would like to thank the reviewers whose
feedback helped shaped the text in many ways:
Images
CHAPTER 1
Data Mining and
Analytics
Chapter Goals and Objectives
▶ Define and describe data mining.
▶ Define and describe machine
learning.
▶ Define and describe data
visualization.
▶ Locate, search, and use common
data-set repositories.
▶ Define and describe data quality.
▶ Define and describe the common
data-mining and machine-learning
applications: clustering, classification,
predictive analytics, and association.
▶ Performance
▶ Memory use
▶ Hardness or softness
▶ Data-set size
▶ Need for the analyst to specify the
starting number of clusters
▶ And more
Chapter 10, “Data Clustering,” examines cluster
operations in detail. As shown in FIGURE 1.18, a
common use of clustering is infectious disease
control. Through cluster analysis, doctors and
researchers can often isolate the source of an
infectious disease.
FIGURE 1.18 Using data clustering to identify
the source of an infectious disease.
Used with permission of RapidMiner
▶ Iris setosa
▶ Iris vergenica
▶ Iris versicolor
The data set has 50 records for each variety.
▶ Iris setosa
▶ Iris vergenica
▶ Iris versicolor
The data set has 50 records for each variety.
▶ Data clustering
▶ Data classification
▶ Predictive analytics
▶ Data association
Kongsirimongkolchai/EyeEm/Getty Images
CHAPTER 2
Machine Learning
Chapter Goals and Objectives
▶ Perform key data-mining and
machine-learning operations.
▶ Compare and contrast supervised
and unsupervised learning.
▶ Compare and contrast training and
testing data sets.
▶ Define and describe dimensionality
reduction.
▶ Define and describe primary-
component analysis.
▶ Know when and how to apply data-
set standard scaling.
▶ Data classification
▶ Data clustering
▶ Predictive analysis
▶ Data association
▶ Supervised learning
▶ Unsupervised learning
▶ Reinforced learning
▶ Deep learning
Supervised learning is the use of an algorithm that
uses labeled data to produce a training data set an
algorithm can use to learn how to identify patterns.
Common solutions that use supervised learning
include data classification (Chapter 11, “Data
Classification,” examines this in detail).
▶ Data clustering
▶ Data association
Chapter 10, “Data Clustering,” examines cluster
operations in detail. Chapter 13, “Data
Association,” examines data association.
▶ Performance
▶ Memory use
▶ Hardness or softness
▶ Data set size
▶ Need for the analyst to specify the
starting number of clusters
xs
Kongsirimongkolchai/EyeEm/Getty Images
CHAPTER 3
Databases and Data
Warehouses
Chapter Goals and Objectives
▶ Define database and describe the
role of databases in data analytics.
▶ Create an entity relationship diagram
(ERD) that represents entities and
their relationships.
▶ Compare and contrast the
conceptual, logical, and physical data
models.
▶ Compare and contrast databases,
data warehouses, data marts, and
data lakes.
▶ Explain the purpose of data
normalization and understand the
processing required to achieve third-
normal form (3NF).
▶ Compare and contrast relational,
NoSQL, object-oriented, and graph
databases.
▶ Tables
▶ Fields
▶ Views
▶ Indexes
▶ Relationships
▶ Stored procedures and functions
▶ And more
The ERDs previously shown visually represent
several schema components (tables, fields,
relationships, and primary and foreign keys). SQL
developers, when asked to provide a database
schema, will often provide the CREATE TABLE
queries used to create tables, the query code for
stored procedures, functions, and views, as well
as index definitions, and so on.
▶ Neo4j
▶ OrientedDB
▶ Amazon Neptune
▶ GraphDB
Kongsirimongkolchai/EyeEm/Getty Images
CHAPTER 4
Data Visualization
Chapter Goals and Objectives
▶ Define and describe data
visualization.
▶ Compare and contrast chart types
and the appropriate use of each.
▶ Create a variety of charts using
Excel.
▶ Create HTML-based charts on the
web.
▶ Use best practices when creating
charts.
▶ A visualization is a representation of
an object. A chart, for example, is a
visualization of a data set. A goal of
visualizing data is to improve
communication.
▶ To create a visualization, data
analysts have many chart types from
which they can choose. Each chart
type is well suited to represent a
particular type of data.
▶ To help analysts select the correct
chart type, analysts often group charts
by the following:
Time-based comparison charts
that represent how a variable’s value
changes over time.
Category-based comparison
charts that represent one or more
categories of values.
Composition charts that represent
how a value relates to the whole.
Distribution charts that represent
the distribution frequency of values
within a data set.
Correlation charts that represent
how two or more variables relate.
Dashboard charts that represent
key performance indicators (KPIs),
normally on a company dashboard.
Geocharts that chart data against a
map.
▶ Regardless of the chart type you are
creating, you should follow best-
practice guidelines to improve the
effectiveness of your chart.
▶ Analysts have a wide variety of
visualization tools they can use to
create charts, ranging from Excel to
Tableau and beyond. Factors to
consider when selecting a visualization
toolset include price, learning curve,
support for web-based charts, and
dashboard support.
▶ To integrate charts into a Hypertext
Markup Language (HTML) web page,
developers can leverage tools such as
Google Charts.
Visualization Tools
Visualization Website
Tool
Domo www.domo.com
Sisense www.sisense.com
Tableau www.tableau.com
Qlik www.qlik.com
Yellowfin www.yellowfinbi.com
Visualization Best Practices
Creating quality visualizations is both an art and a
science. A quality chart uses color effectively,
integrates complementary fonts, and highlights
points of interest to communicate a message—the
art of visualization. Likewise, quality visualizations
use the correct chart for the right purpose—the
science of visualization. Creating quality charts
takes time and effort. Be prepared to make
revisions to your charts based on feedback you
receive from others. Over time, you will learn your
audience preferences.
▶ Scatter chart
▶ Bubble chart
Scatter Chart
Often during data exploration, analysts will
compare one variable to another to determine if a
relationship exists. To visualize the data, analysts
will plot the values on a Cartesian coordinate
plane to create a scatter chart, as shown in
FIGURE 4.50.
▶ x-coordinate
▶ y-coordinate
▶ size
▶ color
▶ Histogram chart
▶ Box and whisker chart
Histogram Chart
Analysts often track the frequency of occurrence of
values over time, such as the number of website
visits by time of day, orders per day, and so on. A
histogram, which looks like a bar chart in that it
uses rectangular bars, charts such distributions of
values. Using the histogram’s frequency counts (or
percentages), analysts can estimate future events
based on the chart’s probability distribution. In
other words, the histogram shows the probability
of an event’s future occurrence. The histogram
groups values into bins, such as the number of
sales orders for 1–5 products, 6–10 products, and
so on.
Images
CHAPTER 5
Keep Excel in Your
Toolset
Chapter Goals and Objectives
▶ Sort and filter data using Excel.
▶ Create charts to visualize data using
Excel.
▶ Apply conditional formatting to
highlight key values.
▶ Compare and contrast spreadsheet
file formats.
▶ Use pivot tables to analyze data and
to produce reports.
▶ Perform “what if” processing within
Excel.
or
Function Notes
VAR.P Population
Function Notes
STDEV.P Population
▶ CORREL.RHO
▶ CORREL.TAO
▶ PEARSON
FIGURE 5.39 illustrates the use of these functions.
FIGURE 5.39 Using additional Excel
correlation functions.
Used with permission from Microsoft
Determining the Covariance
Using COVARIANCE.S and
COVARIANCE.P
As you have learned, variance provides a measure
of how much data-set values differ from the mean.
Covariance is a similar measure, but it examines
two variables to determine if there is a relationship
between the greater values of one variable and the
greater or lesser values of another variable. If the
large values of one variable align with the large
values of a second variable, the covariance is
positive. If, instead, the large values of one
variable align with the small values of the other
variable, the covariance is negative.
or
When you work with large data sets, you will often
want to highlight data that match a specific
condition. Using Excel’s conditional formatting, you
can quickly highlight the cells matching the criteria
you select using a specific font or color. Using
conditional formatting, for example, you can define
rules that let you highlight outlier data.
Images
CHAPTER 6
Keep SQL in Your
Toolset
Chapter Goals and Objectives
▶ Define and describe the components
of a relational database.
▶ Compare and contrast DCL, DDL,
and DML queries.
▶ Perform complex SQL queries.
▶ Compare and contrast SQL JOIN
operations.
▶ Use SQL aggregation functions and
query techniques to group data for
reporting.
Query Description
Query Description
Query Description
Create CREATE
Read SELECT
Update UPDATE
When you are done with the shell, you can type
Quit at the MySQL prompt, or you can simply
close the window.
the query.
query.
Spend Time with the W3Schools SQL
Tutorial
MySQL www.mysql.com
Oracle www.oracle.com
SQLite sqlite.org
Firebird Firebirdsql.org
Actian https://www.actian.com/data-
management/actian-x-hybrid-rdbms/
Employees, and if you are using Oracle, you will use the
RowNum <= 5.
Copyright Oracle and its affiliates. Used with permission
Sorting Your Query Results
Often, the first way database developers analyze
data is simply to sort the data from highest to
lowest (descending) or lowest to highest
(ascending) order based on the value of one or
more fields. To sort the results of a SELECT query,
you use the ORDER BY clause. The following
SELECT query, for example, sorts the Employees
table by last name:
WHERE
Region = 'West'
AND
SALES >=
100,000
WHERE GPA
>= 3.0 OR
Instructor =
'Smith'
WHERE State
NOT IN ('New
York', 'Seattle')
+ Addition
- Subtraction
* Multiplication
/ Division
% Modulo (remainder)
| Bitwise OR
^ Bitwise Exclusive OR
=+ Addition
-= Subtraction
*= Multiplication
/= Division
%= Modulo
|= Bitwise OR
^= Bitwise Exclusive OR
SQL Arithmetic Functions
When you perform data-analytic operations using
SQL, there may be times when you must perform
more complex arithmetic operations beyond the
SUM, AVG, and STDDEV previously discussed. To
help you perform such operations, SQL provides
the arithmetic functions listed in TABLE 6.12.
Chapter 7, “Keep Excel in Your Toolset,” defines
and describes these arithmetic functions in detail.
TABLE 6.12 The SQL Arithmetic
Functions
Function Operation
Function Operation
which would return matching records, all records from the left
table, as well as all records from the right table. Many SQL
databases do. To simulate the FULL OUTER JOIN, you can take
the UNION of the RIGHT OUTER JOIN with the LEFT OUTER
JOIN.
Copyright Oracle and its affiliates. Used with permission
MySQL.
Used with permission from Microsoft
▶ Data Migrator
▶ Microsoft SQL Server Integration
Services (SSIS)
▶ Oracle Warehouse Builder (OWB)
▶ Cognos Data Manager
▶ CloverETL
Finally, the following queries illustrate the use of
SQL to perform a simple ETL operation. The
SELECT query that follows the USE extracts data
from the Employees table into the EmployeeTemp
table. Then, using the EmployeeTemp table, the
UPDATE queries change the gender value of 1 to
male and the gender value of 0 to female. Finally,
the fifth query loads the updated data into an
EmployeeNew table, and the sixth deletes the
temporary table EmployeeTemp, which was used
for the transform:
Note: Remember, if you want to move data to and from a table that
resides in a different database, you can precede the table name
−9,223,372,036,854,775,808 to
9,223,372,036,854,775,807, where size
specifies the maximum number of digits
Query Description
Images
CHAPTER 7
NoSQL Data Analytics
Chapter Goals and Objectives
▶ Compare and contrast relational and
NoSQL databases.
▶ Compare and contrast NoSQL
database management systems.
▶ Perform NoSQL query operations.
▶ Understand the role of JSON within
NoSQL solutions.
▶ Define and describe managed
database services.
Self-describing
Human readable
Language independent
Lightweight
▶ MongoDBManager
▶ MongoVue
▶ Studio 3T (previously known as Robo
3T and RoboMongo)
▶ MongoVision
▶ PhpMoAdmin
▶ MongoExplorer
▶ mViewer
FIGURE 7.11 illustrates the Robo 3T GUI, which
makes it very easy to issue all MongoDB query
types within a visual environment.
FIGURE 7.11 Using the Robo 3T GUI to
interact with MongoDB.
Used with permission of 3T Software Labs
Querying a MongoDB Collection
A MongoDB collection is similar to a table within a
relational database, in that it groups data objects—
which MongoDB refers to as documents. A
MongoDB database may store many different
collections, such as Customers, Products, and
Orders. To refer to a specific collection within the
current database, you use the notation
db.collectionName, where db represents the
current database. For example, the statement
db.Employees.find() displays all the documents
(records) within the Employees collection, as
shown in FIGURE 7.12.
Note: When you issue MongoDB queries, keep in mind that the
$eq Equality
+ Addition
− Subtraction
* Multiplication
/ Division
Function Purpose
Create
Read
Update
Delete
Read find()
Update update()
Delete delete()
Revisiting MySQL and
JSON
As you learned in Chapter 6, “Keep SQL in Your
Toolset,” MySQL is a SQL-based relational
database. Because of the widespread use of
JSON, MySQL has expanded its query capabilities
to support a JSON data type that you can use to
store a JSON object. In addition, MySQL provides
functions you can use to query and manipulate
JSON data. For example, if a record contains a
JSON Students field, using MySQL’s JSON
capabilities, you can construct queries that
manipulate specific items within that field. For
specifics on the MySQL JSON functions, refer to
the MySQL documentation, as shown in FIGURE
7.39.
FIGURE 7.39 The MySQL JSON functions.
Copyright Oracle and its affiliates. Used with permission
Using an Index to Improve
MongoDB Query
Performance
If you identify one or more queries that you (or
others) will execute on a regular basis in the
future, you should optimize the query to improve
its performance. One way database developers
improve performance is to index key fields within a
MongoDB collection.
▶ Studio 3T
▶ NoSQLBooster
▶ Stich Query Translator
▶ Enhanced PolyBase
▶ SalesByCategory
▶ SalesByMonth
Images
CHAPTER 8
Programming Data
Mining and Analytic
Solutions
Chapter Goals and Objectives
▶ Use Python to perform common
machine-learning and data-mining
operations.
▶ Use R to perform common machine-
learning and data-mining operations.
▶ Compare and contrast Python and R
solutions.
Throughout this text, you will make extensive
use of the Python and R programming
languages to perform data-mining and machine-
learning operations. Both Python and R are very
powerful programming languages with a wide
range of capabilities and features. This chapter
introduces the fundamental concepts you need
to know in order to understand the programs this
text presents. The chapter’s goal is to get you
up and running with each quickly. By the time
you finish this chapter, you will understand the
following key concepts:
libraries for each. If you are using this text’s Cloud Desktop, you do
Operator Meaning
** Exponentiation
+, − Addition, subtraction
Operator Purpose
== Equality test
!= Not equal
As you can see, the first for loop displays the row
values, and the second for loop displays the
individual values.
Package Functionality
Name
matplotlib Contains functions you will use to plot and chart data
Comments in R
R, like Python, uses the pound sign (#) to indicate
the start of a comment:
Operator Purpose
* Multiplication
/ Division
^ or ** Exponentiation
%% Modulo (remainder)
+ Addition
− Subtraction
Operator Meaning
== Equality
!= Not equal
The program first loads the data set from the .csv
file and then uses the head function to display the
dataframe’s starting data. Then the program uses
the summary function to display a summary of the
dataframe’s fields. When you execute this
program, it will display the following output:
After you load a data set, using the head and
summary functions can provide you with insights
into the dataframe’s contents.
Using Dataframe Column Names
When you work with dataframes, you may find it
convenient to refer to columns using names. In
such cases, you can use the names function to
assign the column names:
learning-databases/water-treatment/water-treatment.data
Summary
Throughout this text, you will make extensive use
of the Python and R programming languages to
perform machine-learning and data-mining
operations. As you learned, Python is one of the
world’s most popular programming languages and
is used to create solutions that range from
websites, data mining, machine learning,
visualization, and more. Python is open-source
software, which users can download, install, and
use for free.
Python is an interpreted language, as opposed to
a compiled language, for which the Python
interpreter executes one statement at a time.
Developers can interactively execute one
statement at a time via the Python interpreter’s
prompt, or developers can group statements into a
text file, called a Python script, which they then
direct Python to execute.
Key Terms
Arithmetic operators
Compiler
Conditional processing
Interpreter
Iterative processing
Logical operator
Module
Operator precedence
Package
pandas
PIP
PyPi
Relational operators
Script
sklearn
Syntax
Ternary operator
Zero-based indexing
Review
1. Describe the Python features that make it
well suited for data mining and machine
learning.
2. Describe the R features that make it well
suited for data mining and machine learning.
3. Compare and contrast a compiled versus an
interpreted programming language.
4. Build and execute the Python code examples
this chapter presents.
5. Build and execute the R code examples this
chapter presents.
6. Using Python, load the Titanic data set as a
data frame and then use the summary
function to display specifics about the data
set. You can download the Titanic data set
from this text’s catalog page at
go.jblearning.com/DataMining.
7. Using R, load the Titanic data set as a data
frame and then use the summary function to
display specifics about the data set. You can
download the Titanic data set from this text’s
catalog page at
go.jblearning.com/DataMining.
8. Using a Python for loop, display the contents
of the Titanic data set.
9. Using an R for loop, display the contents of
the Titanic data set.
10. Using Visual Studio, create and run the
projects presented in this chapter.
Chapter opener image: © Patra Kongsirimongkolchai/EyeEm/Getty
Images
CHAPTER 9
Data Preprocessing
and Cleansing
Chapter Goals and Objectives
▶ Define and describe data cleansing.
▶ Describe data-quality attributes.
▶ Define and describe data
governance.
▶ Describe the role of a data-quality
assessment framework (DQAF).
Accuracy 97
Completeness 99
Consistency 90
Conformity 88
divided by zero. As you audit your data, you may also want to
search for fields with the NaN value. MySQL represents such
values as NULL.
should not contain fields that are calculatable, such as the total
sold and their prices, or, in this case, the age of a customer for
which you know the customer’s birth date. The Age field is
consistency.
▶ Data stakeholders
▶ Data owners
▶ Data stewards
▶ Data custodians
▶ OpenRefine
▶ R programing language
▶ Python and pandas
▶ DataWrangler
discussed in Chapter 8.
▶ Accuracy
▶ Completeness
▶ Consistency
▶ Conformity
Using these four measures as a starting point, you
can create ways to analyze and, optionally, to
cleanse your data.
Images
CHAPTER 10
Data Clustering
Chapter Goals and Objectives
▶ Define and describe data clustering.
▶ Compare and contrast hard and soft
clustering.
▶ Compare and contrast different
clustering algorithms.
▶ Describe the purpose of a
dendrogram.
▶ Visually represent cluster
assignments.
▶ Performance
▶ Memory use
▶ Hardness or softness
▶ Data-set size
▶ Need for the analyst to specify the
starting number of clusters
This chapter examines several commonly used
clustering algorithms readily available in both
Python and R. By the time you finish this
chapter, you will understand the following key
concepts:
▶ Time-series analysis
▶ Text-data mining
▶ Predictive analysis
▶ Data classification
▶ K-means and hierarchical clustering
In this section, you will use Solver to perform K-
means and hierarchical clustering.
K-Means Clustering Using Solver
To start, download the Excel file HandsOn.xlsx
from this text’s catalog page at
go.jblearning.com/DataMining. Excel will display
the file’s contents, as shown in FIGURE 10.19.
▶ Cluster centers
▶ Cluster sizes
▶ Cluster average distances
▶ Cluster assignments
To view the cluster assignments, click on the
KMC_Clusters sheet, as shown in FIGURE 10.24.
Key Terms
Agglomerative
Centroid
Clustering
Density-Based Spatial Clustering of
Applications with Noise (DBSCAN)
Euclidian distance
Hierarchical clustering
K-means clustering
Outlier
Soft clustering algorithms
Sum of squares
Review
1. Define and describe data clustering.
2. Describe the K-means clustering algorithm.
3. Describe the hierarchical clustering
algorithm.
4. Describe the DBSCAN clustering algorithm.
5. Describe the expectations maximization
clustering algorithm.
6. Perform each of the Python clustering
applications this chapter presents using
different cluster sizes. Describe your results.
7. Perform each of the R clustering applications
this chapter presents using different cluster
sizes. Describe your results.
8. Perform the Solver operations this chapter
presents using different cluster sizes.
Describe your findings.
9. Define and describe a data outlier.
10. Describe different approaches to handling
data outliers.
Chapter opener image: © Patra Kongsirimongkolchai/EyeEm/Getty
Images
CHAPTER 11
Classification
Chapter Goals and Objectives
▶ Define and describe data
classification.
▶ Compare and contrast binary and
multiclass classification.
▶ Compare and contrast classification
algorithms.
▶ Define and describe the role of
training and testing data sets.
▶ Describe the steps to perform the
classification process.
▶ Iris setosa
▶ Iris versicolor
▶ Iris virginica
You can download the Iris data set from this text’s
catalog page at go.jblearning.com/DataMining.
When you do so, Notepad will display the file’s
contents, as shown in FIGURE 11.2. Save the file
to a folder on your disk.
FIGURE 11.2 The Iris data set.
Used with permission from Microsoft
Within the file, you will find that each record has
measured sepal and petal lengths and widths, as
well as the resulting flower classifications. The
learning algorithm will use part of the data set for
training and part for testing. You can then provide
values you want to classify.
▶ Handicapped infants
▶ Water project cost sharing
▶ Adoption of the budget resolution
▶ Physician fee freeze
▶ El Salvador aid
▶ Religious groups in schools
▶ Anti-satellite test ban
▶ Aid to Nicaraguan Contras
▶ MX missile
▶ Immigration
▶ Synfuels corporation cutback
▶ Education spending
▶ Superfund right to sue
▶ Crime
▶ Duty-free exports
▶ Export administration act for South
Africa
The data set contains voting records for 435
members of Congress (267 Democrats and 168
Republicans). When you download the data set,
you will find that it contains categorical data, as
shown in FIGURE 11.18.
FIGURE 11.18 Categorical data within the
Congressional Voting Records data set.
Dua, D., and Graff, C. 2019. UCI Machine Learning Repository.
Images
CHAPTER 12
Predictive Analytics
Chapter Goals and Objectives
▶ Define and describe predictive
analysis.
▶ Compare and contrast predictive and
prescriptive analysis.
▶ Define and describe the regression
process.
▶ Define and describe regression
techniques.
▶ Compare and contrast regression
algorithms.
X Y
0 2
1 3
2 4
3 5
X Y
0 2
1 3
2 9
3 13
4 27
5 84
6 105
7 169
If you plot the data, you will find that the data are
not linear, as shown in FIGURE 12.3.
FIGURE 12.3 Plotting a simple nonlinear data
set.
Used with permission of Python Software Foundation
X Y
0 2
1 3
2 9
3 12
4 15
5 18
6 19
7 20
▶ Ease of use
▶ Data-set appropriateness
▶ Performance
▶ Memory use
Images
CHAPTER 13
Data Association
Chapter Goals and Objectives
▶ Define and describe data association.
▶ Define and describe market-basket
analysis.
▶ Define and describe support,
confidence, conviction, and lift.
▶ Use visual programming to
implement machine-learning and
data-mining solutions.
Ljubljana
Ljubljana
Ljubljana
Ljubljana
Close the File dialog box. From the Data
collection, drag the Data Table on to the
workspace. Then drag your mouse between the
File and Data Table widgets, connecting the
widgets with a line, as shown in FIGURE 13.20.
Ljubljana
Ljubljana
Ljubljana
Ljubljana
Ljubljana
Performing Predictive Analysis
within Orange
Select the Orange File menu and choose New to
create a new workspace. Then, from the Data
collection, drag a Data Set widget on to the
workspace. Double-click on the Data Set widget.
Orange will open a window within its built-in data
sets, as shown in FIGURE 13.25.
FIGURE 13.25 Displaying the data sets built
into Orange.
Orange Software. Used with permission from University of
Ljubljana
Ljubljana
Ljubljana
From the Orange Help menu, you can also find
several videos that walk you through the process
of building advanced workflows.
Summary
In this chapter, you examined data association, the
process of identifying patterns (relationships)
between variables in a data set. As you learned,
market-basket analysis is the use of data
association to analyze consumer data for patterns,
such as the products a shopper adds to their
shopping cart (called consequent products) based
upon an item that already resides in the cart
(called the antecedent product). Throughout this
chapter, you performed several analyses of
shopping carts using Python, R, and RapidMiner.
As you learned, data association uses four
measures to identify patterns:
▶ Support: a ratio measure that provides
the relative frequency of an item in the
basket.
▶ Confidence: a measure that examines
the frequency with which a consequent
occurs in spite of the absence of the
antecedent.
▶ Lift: the ratio of the confidence to the
expected confidence. Lift values close to
1 do not show an association, but rather,
more likely indicate coincidence.
▶ Conviction: a measure that examines
the frequency with which the consequent
occurs in spite of the absence of the
antecedent.
Visual programming is the process of creating a
program by dragging and dropping objects on to a
workspace, as opposed to writing programming
language statements. In this chapter, you used the
RapidMiner and Orange environments to perform
visual programming to create data-mining and
machine-learning solutions.
Key Terms
Antecedent
Association
Confidence
Consequent
Conviction
Item set
Lift
Market-basket analysis
Support
Visual programming
Review
1. Compare and contrast association and
correlation.
2. Using Python, re-create the examples shown
in this chapter, presenting your results.
3. Using R, re-create the examples shown in
this chapter, presenting your results.
4. Use RapidMiner to perform the market-
basket analysis presented in this chapter.
5. Using Orange, create a histogram showing
the number of samples in the iris.tab file
provided by Orange.
6. Using Python, load the Breast Cancer data
set from go.jblearning.com/DataMining and
display a data-set summary.
7. Using R, load the Breast Cancer data set
from go.jblearning.com/DataMining and
display a data-set summary.
8. Using Python, load the Breast Cancer data
set from go.jblearning.com/DataMining and
display the correlation between variables.
9. Using R, load the Breast Cancer data set
from go.jblearning.com/DataMining and
display the correlation between variables.
Chapter opener image: © Patra Kongsirimongkolchai/EyeEm/Getty
Images
CHAPTER 14
Mining Text and Images
Chapter Goals and Objectives
▶ Perform text sentiment analysis and
categorization.
▶ Perform facial recognition.
▶ Perform image classification.
▶ Understand that text and image
mining use the same data-mining
techniques you have used throughout
this text.
▶ http://cmake.org/download
▶ https://www.boost.org/users/downlo
ad/
After you install these two applications, use PIP to
install the Facial_Recognition and DLib modules:
Summary
This chapter introduced text and image mining. To
perform text and image processing, you used
many of the data-mining techniques you have
used throughout this text. To do so, you first
converted the text or image into a numeric
representation the algorithms require. Across the
web, you can find many text- and image-
processing libraries.
Key Terms
Computer vision
Facial recognition
Image mining
Natural-language processing
Sentiment analysis
Text mining
Review
1. Define and describe text mining.
2. Define and describe image mining.
3. Modify the text sentiment script
DecisionTreeText.py this chapter presents to
use your own text. Describe how modifying
your words or format influences your results.
4. Modify the text-clustering script
ClusterText.py this chapter presents to use a
different clustering algorithm.
5. Modify the Python script digitsClassify.py this
chapter presents to use a different clustering
algorithm.
6. Implement the Python face recognition
program this chapter presents.
7. Modify the Python script BoxFace.py this
chapter presents to draw a circle around
each face.
Chapter opener image: © Patra Kongsirimongkolchai/EyeEm/Getty
Images
CHAPTER 15
Big Data Mining
Chapter Goals and Objectives
▶ Define and describe big data.
▶ Define and describe common data
capacities such as megabytes,
terabytes, and petabytes.
▶ Describe the role of Hadoop in big
data processing.
▶ Define and describe the MapReduce
process.
▶ Volume
▶ Velocity
▶ Variety
▶ Veracity
▶ Value
Amazon 1+ million
Microsoft 1+ million
Google 1+ million
Images
CHAPTER 16
Planning and
Launching a Data-
Mining and Data-
Analytics Project
Chapter Goals and Objectives
▶ Define and describe data
governance.
▶ Describe and calculate a return on
investment (ROI).
▶ Describe and perform a SWOT
analysis.
▶ Define and describe the PDCA
process.
Power BI Free $0
Total $10,000
Business people often use the ROI metric to
determine whether they should pursue an
opportunity. To calculate the ROI, you divide the
potential gain by the cost:
Python Tutorial
Key Terms
Plan, Do, Act, and Change (PDCA)
Return on Investment (ROI)
Role-based security
SMART (Specific, Measurable, Achievable,
Realistic, and Time-based)
Strengths, Weakness, Opportunities, and
Threats (SWOT)
Review
1. Describe common tasks and responsibilities
of the data governance board.
2. Perform a SWOT analysis that provides
insight into whether you should choose the R
or Python programming language as your
data-analytics programming tool.
3. Assume you are asked to use census data to
create a dashboard for potential healthcare
customers. List the questions you would ask
about the data source.
4. Describe role-based security and how you
would use it for a data-analytics project.
5. Describe the SMART acronym and how and
when you might use it.
6. Assume a computer company can increase
the price at which it sells computers by $5 a
system by adding a $3 memory chip. What is
the potential ROI?
7. Describe how you would use the Plan, Do,
Check, and Act methodology for a new
business intelligence dashboard.
8. Using Jupyter Notebook, create the notebook
presented in the Hands-On section of this
chapter.
© Patra Kongsirimongkolchai/EyeEm/Getty Images
GLOSSARY
Agglomerative:
A synonym for merging. A hierarchical clustering
algorithm is agglomerative in that it merges related
clusters to form a larger cluster.
Aggregate:
A group. Structured Query Language (SQL)
provides several aggregation queries that group
values. SQL also provides aggregation functions,
such as COUNT, SUM, and AVG, which perform
operations on a group of records.
Antecedent:
Something that existed before another. In the case
of market-basket analysis, the antecedent
corresponds to an item that existed in the cart prior
to the selection of another.
API:
An acronym for application programming interface,
which describes code (often a library) a program
can call (use) to accomplish a specific task. To
create charts, Python programmers use the plot.ly
API.
Area chart:
A chart, similar to a line chart, that represents how
values change over time. The area chart fills the
area under the line that connects the data points
with a solid color.
Arithmetic mean:
The sum of the data values divided by the number
of values
Arithmetic operator:
Within a programming language, an arithmetic
operator is a symbol that represents addition (+),
subtraction (−), multiplication (*), division (/), and
so on.
Artificial intelligence:
The science of making intelligent machines that
can perceive visual items, recognize voices, make
decisions, and more. Machine learning is an
application of artificial intelligence.
Association:
In market-basket analysis, association is the
process of relationships between variables, such
as identifying items that a customer frequently
purchases together, such as diapers and beer.
Bar chart:
A chart that represents data using horizontal bars,
the lengths of which are proportional to the
underlying data value.
Bayes Theorem:
A theorem that produces a probability based upon
known or related events.
Big data:
The term to describe a data set the size of which
exceeds the amount of data with which our
common applications can work.
Binary classification:
A classification that assigns data to one of two
classes, such as a loan being approved or
disapproved.
Bubble chart:
A chart that represents three dimensions of data
using x and y coordinates and the size of the
bubble, which is proportional to the underlying
data value.
Business intelligence:
The use of tools (data mining, machine learning,
and visualization) to convert data into actionable
business insights and recommendations.
Business rule:
Defines or constrains a specific business
operational process. A business rule, for example,
might specify that employee overtime cannot
exceed 10 hours per week.
Byte:
The amount of data required to store an ASCII
character: 1,0240 bytes.
Candlestick chart:
A chart used for financial data that represents a
stock’s open, close, high, and low values, with the
color of the box representing whether the stock
closed positive or negative.
Cardinality:
With respect to database entities, describes a
relationship type. Common cardinalities include
one-to-one, one-to-many, and many-to-many.
Centroid:
The center of a cluster. The centroid does not
have to correspond to a data point within the data
set.
Classification:
A supervised machine-learning solution that
assigns data items to specific categories.
Cluster:
In database management, a cluster is a collection
of notes within a distributed database. In data
mining, a cluster is a group of related data.
Clustering:
The process of grouping related data. Clustering is
an unsupervised learning process in that it works
with unlabeled data.
Collection:
A group of documents within a NoSQL database,
similar to a table within a relational database.
Column chart:
A chart that represents data using vertical bars,
the lengths of which are proportional to the
underlying data value.
Combo chart:
A chart that combines a bar chart with a line chart,
often for comparative purposes.
Comma-delimited file:
A field that contains field values separated by
commas. Database developers often use comma-
delimited files, also called comma-separated
(CSV) files, to move data between a database and
an application, such as a spreadsheet.
Compiler:
A program that examines a source code file for
syntax errors and, if none exist, creates an
executable file.
Composition chart:
A chart that represents how one or more values
relate to a larger whole. Common composition
charts include pie charts and sunburst charts.
Computer vision:
The ability for a software application to analyze
and understand images.
Conceptual model:
A high-level database model that shows the
entities that make up a system and their
relationships, but not the attributes that make up
the entities.
Conditional formatting:
The process of highlighting (using fonts and
colors) cells in a spreadsheet that satisfy a specific
condition.
Conditional processing:
In programming, conditional processing identifies a
set of instructions (statements) a program
executes when a specified condition is true or
false. The Python and R programming languages
implement conditional processing using an if-else
statement.
Confidence:
In market-basket analysis, confidence compares
the number of times the pair was purchased to the
number of times one of the items in the pair was
purchased.
Confidence interval:
A range of values between which the probability of
a value falling into the range is defined as the
confidence interval.
Consequent:
Something that occurred based on the presence of
another. In the case of market-basket analysis, the
consequent is the item that a customer places in
the cart based on the presence of another item—
the antecedent.
Conviction:
In market-basket analysis, conviction is a measure
that examines the frequency with which a
consequent occurs in spite of the absence of the
antecedent.
Correlation:
A measure that shows the relationship between
two variables. Variables with a correlation
approaching 1 have a strong correlation—as you
increase the value of one of the variables, the
value of the second will also increase. Likewise, if
you decrease the value of one of the variables, the
value of the second will decrease. Variables with a
correlation approaching −1 are inversely
correlated. As you increase the value of one
variable, the other variable’s value will decrease.
Variables with a correlation approaching 0 are not
correlated.
Correlation chart:
A chart that represents how two or more variables
relate. A common correlation chart is the scatter
chart.
Covariance:
A measure of similarity between two variables. If
the large values of one variable align with the large
values of a second variable, the covariance is
positive. If instead, the large values of one variable
align with the small values of the other variable,
the covariance is negative.
Cross join:
A join operation that returns a row for each
combination of a field in the right table combined
with a field in the left table.
Crow’s foot:
Within an entity relationship diagram, a crow’s foot
is a symbol used to represent a “many”
relationship (such as one-to-many). The symbol is
so named because it resembles the shape a
crow’s foot might leave in dirt as the crow walks
across it.
CRUD:
An acronym for create, read, update, and delete,
which correspond to common database
operations. Structured Query Language (SQL) and
MongoDB provide specific queries for each of
these operations.
Dashboard:
A visual (and often interactive) collection of charts
and graphs that correspond to the metrics for a
business’s key performance indicators.
Dashboard chart:
A chart that represents key performance indicators
(KPIs) that companies use to track initiatives.
Common dashboard charts include the gauge
chart and calendar chart.
Data accuracy:
The degree to which data correctly represent
underlying real-world values.
Data anomalies:
Inconsistencies that occur within data due to
duplicate or unnormalized data. Data anomalies
normally occur during insert, update, or delete
operations.
Data association:
The process of identifying relationships between
variables. A well-known data association problem
is market-basket analysis for which the items in a
customer’s shopping cart are examined in order to
determine relationships between the items that
influence the shopper’s behavior.
Data cleansing:
The process of detecting, correcting, and removing
errors and inconsistences from data.
Data completeness:
The degree to which data represent all required
values.
Data conformity:
The degree to which data values align with
business rules.
Data consistency:
The degree to which similar or related values align
throughout the data set.
Data custodian:
A technical expert who administers and manages
the data. The data custodian would implement the
tables (or, in the case of NoSQL, the collections)
that store the data. The data custodian might
implement the queries, stored procedures, and
database solutions that implement the business
rules and security controls.
Data integrity:
The degree to which data are accurate, complete,
and truly representative of the real-world data.
Data lake:
A collection of data outside of a database stored in
a more natural form, such as a binary or text file.
Data mart:
A specialized data warehouse that contains
information for a specific group, such as sales or
manufacturing.
Data munging:
A synonym for data wrangling—the steps a
database developer performs to transform and
cleanse data.
Data owner:
Often an executive who oversees the business
unit associated with the data. The data owner
serves as the final arbitrator and decision maker
for issues that affect the data. He or she, however,
would not play a day-to-day, hands-on role with
the data.
Data quality:
The measure of data’s suitability for use.
Data steward:
An individual who makes the day-to-day decisions
for the data with respect to quality, content, and
metadata. The data steward has great knowledge
of the data, its producers and consumers, and the
business rules that guide data operations.
Data timeliness:
The degree to which data arrive in a favorable time
frame.
Data validity:
The degree to which data are logically correct.
Data visualization:
The process of using charts and graphs (visuals)
to represent data.
Data warehouse:
A database optimized for reporting, decision
support, and other analytical operations.
Data wrangling:
The steps a database developer performs to
transform and cleanse data.
Database:
A collection of data organized within a database
management system (DBMS) for fast storage and
retrieval.
DBMS:
An acronym for database management system—
the software that surrounds and operates the
database. Common DBMS software includes
Oracle, Microsoft SQL, and MySQL, as well as
NoSQL DBMS software such as MongoDB.
DBSCAN:
An acronym for density-based spatial clustering of
applications with noise. The DBSCAN clustering
algorithm creates clusters by grouping points
based on their proximity to dense regions within
the data’s coordinate space.
Decision tree:
A graph-based data structure that contains
decision points. By following paths through the
decision points, a decision tree classifier can
determine to which class it should assign data.
Decision tree classifier:
A classification technique that creates a decision
tree, which it applies to assign data to a class.
Decision tree regression:
A regression technique that produces an
expression that relates predictor variables to the
dependent variable.
Deep learning:
A hierarchically structured process that leverages
layers of machine learning for which the output of
one layer becomes the input to the next.
Delimiter:
A value used to separate values within a file.
Comma delimiters include the comma and tab.
Dependent variable:
In data classification, the dependent variable is the
class to which the algorithm will assign the data.
Descriptive analytics:
The use of statics and data mining to explain
(describe) what happened based on historical
data.
Diff chart:
A chart that compares two data sets by
representing the differences between them.
Distributed database:
A database that runs on multiple servers located
(distributed) within a network or the cloud.
Distribution chart:
A chart that represents the frequency of values
within a data set. The common distribution chart is
the histogram.
Document:
A data item within a NoSQL database, similar to a
record within a relational database.
DQAF:
An acronym for data quality assessment
framework, a guide that presents 48 factors to
consider when measuring data quality.
Dynamic chart:
A web-based chart, the contents of which update
automatically when the underlying data set values
change.
Entity:
A term for a thing or object.
Entity relationship diagram (ERD):
A drawing that represents the entities (things) that
make up a system, along with the relationships
between entities.
ETL:
An acronym for extract, transform, and load.
Database developers commonly perform ETL
operations. The extract operation retrieves data
from one or more database tables. The transform
operation changes the data in some way. The load
operation stores the transformed data into a
different destination.
Euclidian distance:
The straight-line distance between two points,
calculated as follows:
Exabyte (EB):
1,0245 bytes.
Export:
The process of extracting data from an application
or database to another destination.
Facial recognition:
Analyzing patterns within photos to identify and
recognize faces.
Filter:
A query operation performed on a subset of
records based upon a condition specified within a
WHERE or HAVING clause.
Filtering:
The process of displaying only selected values.
Foreign key:
A field in one table that corresponds to a primary-
key field in a different table. Foreign keys exist to
support databases’ referential integrity. A database
enforces many of the same rules for a foreign key
that it would for the primary key, such as a foreign
key cannot have a NULL value.
Framework:
A structure (often a document) users can follow to
achieve a desired result.
Funnel chart:
A chart, the shape of which resembles a funnel, for
which the top of the funnel represents 100% of the
whole and the area of each section below
represents the section’s underlying value.
Gauge chart:
A chart that appears similar to a dial on an
automobile dashboard that represents a value.
Analysts often display gauge charts on a key
performance indicator (KPI) dashboard.
Geochart:
A chart that represents how the values from one
location compare to values in a different location.
Geomap chart:
A chart that includes a map.
Geometric mean:
An arithmetic mean calculated by taking the
product of a group of numbers as opposed to the
sum
Gigabyte (GB):
1,0243 bytes.
Graph database:
A database designed to store the many
relationships an entity may have. Graph databases
store entities as nodes and the relationships
between entities as edges. Graph databases are
NoSQL databases.
Graphical user interface:
A visual environment within which a user interacts
with an application. The MySQL Workbench
provides a graphical user interface within which a
user can execute SQL queries.
GUI:
An acronym for graphical user interface. MongoDB
provides the Compass GUI.
Hadoop:
An open-source distributed framework for
processing big data.
Hard clustering algorithm:
A clustering algorithm that restricts each point to
residing in only one cluster.
Harmonic mean:
The product of the data values divided by the
number of values. The harmonic mean may
reduce the impact of outliers
HDFS:
An acronym for Hadoop Distributed File System, a
file system that distributes big data files across
multiple nodes, which combine to create an HDFS
cluster.
Hierarchical clustering:
An agglomerative clustering algorithm that creates
clusters by merging related (nearby) clusters.
Histogram chart:
A chart that represents a data set’s frequency
distribution, which shows a count (the frequency)
of occurrence of values within defined ranges,
called bins.
Horizontal scaling:
Scaling a database or application across additional
servers.
Image mining:
The application of data-mining techniques to
image files. Image mining includes image
recognition (computer vision), image clustering,
and image classification.
Import:
The process of loading data into an application or
database from another source.
Index:
A value that points to a specific record used to
improve database performance. An index normally
contains a field value and a pointer to the
corresponding record within a table. To assign an
index to a field, database developers use the
CREATE INDEX query.
Inner join:
A join operation that returns rows that include
fields from the left and right tables for which the
tables have matching values on the specified field.
Interpreter:
A program that examines a programming
language statement for syntax errors and, if none
exist, executes the statements. Unlike a compiler
that converts an entire source code file into an
executable file at one time, an interpreter executes
one statement at a time. R and Python use an
interpreter.
IoT:
An acronym for Internet of Things, which describes
the collection of billions of devices on the internet.
Item set:
A collection of items within a market- basket.
Iterative processing:
In programming, iterative processing identifies a
set of instructions (statements) a program
executes as long as a specific condition is true.
Programmers refer to such statements as a loop.
When the condition becomes false, the program
continues its execution at the first statement that
follows the loop. To implement iterative
processing, Python provides the for and while
loops, and R provides the for, while, and repeat
loops.
JOIN:
A Structured Query Language (SQL) query
operation that temporarily combines two tables
based upon a related field.
JSON:
An acronym for JavaScript Object Notation, a
format used to describe an object. NoSQL
databases make extensive use of JSON to store
data.
Kilobyte (KB):
1,0241 bytes.
K-means clustering:
A clustering technique that groups points based on
minimizing the average distance of each point
from its cluster’s center (centroid).
K-nearest-neighbor classifier:
A data classification technique that assigns data to
the class it most closely resembles.
KPIs:
Key performance indicators—a measurable value
that indicates how well a company is achieving a
specific business objective. Data KPIs are
measures the company can use to determine the
effectiveness of their data-quality objectives. Data
analysts often display KPIs within a dashboard to
make the metrics easily accessible by others.
Left join:
A join operation that returns the matching rows
from the left and right tables, as well as rows from
the left table. Structured Query Language (SQL)
will assign the null value to fields that do not have
matching values in the right table.
Lift:
In market-basket analysis, lift is the ratio of the
actual confidence to the expected confidence. Lift
values close to 1 do not show an association, but
rather, more likely indicate coincidence.
Lightweight:
Less overhead.
Line chart:
A chart that uses line segments to connect data
point markers, ideally to better visualize trends.
Linear regression:
A regression technique that produces a linear
equation that represents predictor variables to the
dependent variable.
Logical model:
A database model that represents the things
(entities) that make up a system and their
relationships. A logical model will include attribute
(field) names, but not the underlying field data
types.
Logical operator:
Within a programming language, a logical operator
allows programmers to combine two or more
conditions. R and Python provide the AND, OR,
and NOT logical operators.
Logit:
A function used in logistic regression that
determines the probability that data belong to a
class.
Machine learning:
The use of data pattern recognition algorithms to
solve problems. There are two primary forms of
machine learning: supervised and unsupervised.
In supervised machine learning, the algorithm
examines a training data set to learn how to
identify patterns. Unsupervised learning, in
contrast, does not use a training set.
Managed database server:
A cloud-based database for which the cloud
provider manages the database software,
performing the administrative tasks, such as
applying updates and patches. A managed
database server can also be configured to scale
on demand. MongoDB provides the Atlas
managed database server.
MapReduce:
A two-phase process that applications use to
perform big data analytics. During phase one, the
mapping phase, the code maps (identifies) the
data of interest and groups the data based on the
key portion of a key–value pair. During the
reduction phase, the code aggregates (combines)
the results for the reduced groups.
Marker:
The symbol used to represent a data point on a
chart. The market might be a circle, an X, or some
other meaningful shape.
Market-basket analysis:
The process of analyzing items in a shopping cart
to determine items that a consumer purchases
together, such as diapers and beer.
Master data:
The key data within a business, such as its
customers, vendors, employees, and so on. Using
master data management (MDM), database
designers and administrators try to have only one
definition/schema for these key data.
Median:
The middle value in a list of sorted numbers.
Megabyte (MB):
1,0242 bytes.
MLP:
An acronym for multilayer perceptron. The MLP
classifier is a neural network solution that uses
multiple layers of perceptrons to assign data to
groups.
Module:
In Python, a module is a source file that contains
the data structure definitions and functions to
perform a specific task.
Multiclass classification:
A classification that assigns data to one of many
classes, such as a wine being a white, red, or
rose.
Nested query:
A query within another query, such as a SELECT
query within an UPDATE query that returns values
that tell UPDATE which records to modify. You
place the nested query within parentheses.
Database developers also refer to a nested query
as a subquery.
Network port:
A numeric value that corresponds to the port that a
network application listens to for connections and
commands. Web browsers listen to port 80.
MySQL, by default, listens to port 3306. The
network port follows an Internet Protocol (IP)
address or domain name, separated from each
with a colon (:), such as 127.0.0.1:3306 or
DataMiningAndAnalysis.com:3306.
Neural network:
A machine learning algorithm that simulates the
activities of the brain and nervous system. Behind
the scenes, neural networks use mathematical
functions (called perceptrons).
Normalization:
A process of refining a relational database table,
often by decomposing data into one or more
tables, to achieve specific data conditions that
correspond to a normal form. By normalizing
database tables, database designers reduce the
possibility of data anomalies that occur during
insert, update, and delete operations.
NoSQL:
An acronym for not only SQL. A NoSQL database
does not store data within tables, but rather files
(often in JavaScript Object Notation [JSON] or
Extended Markup Language [XML]) and does not
use Structured Query Language (SQL) as its
primary query language.
Object-oriented database:
A database that allows database developers to
store and retrieve data using objects.
OLAP:
An acronym for online analytical processing. A
data warehouse that exists for reporting, decision
support operations, and other analytics uses
OLAP.
OLTP:
An acronym for online transaction processing. A
database that stores and retrieves data that
describe a transaction, such as a customer order
or an employee timecard update, uses OLTP.
On-premise server:
A server that resides within a company’s local data
center.
OODBMS:
An acronym for object-oriented database
management system. A database management
system that provides support for storing and
retrieving objects and their attributes.
Operator precedence:
In a programming language, operator precedence
specifies the order in which the language will
perform arithmetic operations when an expression
contains more than one operator. In Python and R,
for example, the multiplication operator has a
higher precedence than addition, so the language
will evaluate the following expression to 17: 2 + 3
× 5.
Outer join:
A join operation that returns the matching rows
from the left and right tables, as well as rows from
the left or right table, as specified by LEFT OUTER
JOIN or RIGHT OUTER JOIN. Structured Query
Language (SQL) will assign the null value to fields
that do not have matching values in the right table.
Outlier:
A value that falls outside of the clusters.
Overfitting data:
With respect to the K-nearest-neighbor’s algorithm
to classify data, if you specify a value of K that is
too small, you may “overfit” the model, meaning
the model may start to treat noise or errant data as
valid training data.
Package:
In Python, a package is a directory (folder) of
Python module files.
Pandas:
The library Python developers import into their
program that provides data structure definitions
and functions for data analytic operations. The
pandas library also defines the dataframe object,
into which Python programs often load data sets.
PDCA:
An acronym for Plan, Do, Check, Act—an iterative
process of reevaluating a solution or process to
determine if solutions can be made.
Perceptron:
A linear function used in neural networks. Because
many problems are not linear, they must be further
decomposed into linear models by creating
additional layers of perceptrons, creating a
multilayer perceptron (MLP) solution.
Petabyte (PB):
1,0246 bytes.
Physical model:
A database model that represents the entities that
make up a system and their relationships that
include each entity’s attribute name and data type.
Using the physical model, a database developer
can quickly create entities using the CREATE
TABLE query.
Pie chart:
A chart that represents 100% of the whole as a
circle and the values of the components as
proportional slices.
PIP:
A command-line utility program that Python
developers use to download and install packages
on their computer.
Pivot table:
A capability provided by Excel that lets users
group data, identify relationships, and format data
for reporting.
Polynomial:
A mathematical expression that consists of
variables, coefficients, and exponents.
Polynomial degree:
The highest exponent within the polynomial.
Polynomial regression:
A regression technique that produces a polynomial
expression that represents the relationships
between variables.
Predictive analytics:
The use of statistics, data mining, and machine
learning to analyze historical data to predict what
will happen in the future.
Predictor variable:
A data set value used by a classification algorithm
to predict the class to which the data should be
assigned.
Prescriptive analytics:
Within descriptive and predictive analytics,
recommends the best choice among available
options.
Primary key:
A field within a database table that uniquely
identifies each record. Examples of primary keys
include CUSTOMER_ID or ORDER_ID.
PyPi:
An acronym for Python Package Index, a website
that features over 180,000 installable packages.
Pyramid chart:
A chart, the shape of which resembles a triangle,
for which the bottom of the pyramid represents
100% of the whole and the area of each section
represents the section’s underlying value.
Quartile:
A number in a sorted list that identifies 25% of the
data. The second quartile value is also the median
—the number below which 50% of the numbers
fall and above which the other 50% reside.
Query:
A command that directs a database to perform a
specific task, such as creating a table or retrieving
specific records from a table.
Radar chart:
A chart that represents multiple variables across
two or more axes, the origin of which is the center
of the chart.
Referential integrity:
A database concept that specifies related
attributes in different tables are treated
consistently.
Reinforced machine learning:
The use of feedback loops to reward correct
predictions and to punish mistakes.
Relational database:
A database that stores entities (the things that
make up the system) in individual tables for which
the rows correspond to records and the columns to
the attributes (fields) that make up the record. A
relational database also stores information about
the relationships between tables. Relational
databases are often called SQL databases
because developers make extensive use of
Structured Query Language (SQL) to manipulate
them.
Relational operator:
In a programming language, a relational operator
lets programs test the relationship between two
values, such as whether the values are equal, one
value is greater than the other, and so on.
Remote database:
A database that does not reside on the current
computer, but rather, on a network or cloud-based
server.
Replica set:
A collection of MongoDB databases that back up
(replicate) other databases.
Right join:
A join operation that returns the matching rows
from the left and right tables, as well as rows from
the right table. SQL will assign the null value to
fields that do not have matching values in the left
table.
ROI:
An acronym for return on investment—a metric
businesspeople use to determine whether they
should proceed with an opportunity.
Role-based security:
A security approach that identifies common roles
within a system (such as administrator, power
user, user, report/visualization creator), maps
users to specific roles, and grants access levels to
each role.
Scaling:
The process of increasing (scaling up) or
decreasing (scaling down) computing resources
based on demand.
Scatter chart:
A chart that represents the x–y relationships
between two data sets as markers.
Schema:
A representation of something in an outline or
model form. Database designers often use entity
relationship diagrams to visually represent a
database schema.
Script:
A file containing programming language
statements that an interpreter will execute.
Self-describing:
An item, the contents of which not only provide
values but also specify the item’s structure.
JavaScript Object Notation (JSON) objects consist
of field–value pairs that specify structure and
values.
Sentiment analysis:
The analysis of text to determine the underlying
attitude, such as positive, neutral, or negative.
Shard:
A database within a collection (cluster) of related
databases distributed to improve performance and
reliability.
Sklearn:
The package name Python developers use to
import the scikit-learn library that provides
functions to perform classification, clustering,
regression, and more.
SMART:
An acronym for specific, measurable, achievable,
realistic, and time based—an approach to
improving goals and processes.
Snowflake schema:
A schema commonly used to represent the fact
and dimension tables that make up a data
warehouse. A snowflake schema differs from a
star schema in that it includes levels of dimension
tables to create a structure that resembles a
snowflake in appearance.
Soft-clustering algorithms:
A clustering algorithm that allows a point to reside
within multiple clusters.
Stacked-area chart:
A chart that displays two or more area charts on
the same graph, often for comparative purposes or
to represent each area’s percentage of the whole.
Stacked-bar chart:
A bar chart for which the primary (aggregate)
rectangular bar is composed of two or more
component rectangles, the lengths of which are
proportional to the component value.
Stacked-column chart:
A column chart for which the primary (aggregate)
rectangular column bar is composed of two or
more component rectangles, the lengths of which
are proportional to the component value.
Standard deviation:
A measure of the difference between a value and
the mean. A large standard deviation indicates
values are dispersed, whereas a small standard
deviation indicates that the values are closely
grouped.
Star schema:
A data warehouse schema so named because it
resembles the shape of a star. At the center of the
star is a fact table that contains specific data
metrics, surrounded by dimension tables that
contain supporting data.
Static chart:
A chart that does not automatically update based
upon changes to the underlying data set. A static
chart might be a graphics image within a Hypertext
Markup Language (HTML) file. To change the
values the chart displays, a developer must
replace the graphic.
Subquery:
A synonym for nested query.
Sum of squares:
An algorithm to summarize the variance within a
data set. Each value’s distance from the mean is
squared and added to the sum of values. The
smaller the sum of the squares, the more closely
the values align with the mean.
Sunburst chart:
A chart that displays hierarchically related data
using a multilayer circular shape, similar to
multiple layers of pie charts.
Supervised learning:
The use of an algorithm that uses labeled data to
produce a training data set from which the
algorithm can learn. Machine learning solutions
that use supervised learning include classification.
Support:
In market-basket analysis, support is a ratio
measure that provides the relative frequency of an
item in the basket.
SVC:
An acronym for support vector classifier, a
classification technique that creates classes using
a series of lines (vectors) that divide the classes.
SVM:
An acronym for support vector machine, a
machine learning algorithm. When applied to
classification, the term support vector classifier
(SVC) is often used.
SWOT:
An acronym for strengths, weaknesses,
opportunities, and threats—an analysis technique
for evaluating opportunities.
Syntax:
The grammar rules for a language. In Python, for
example, the syntax specifies that variable names
start with a letter or an underscore. If you violate
the language syntax, the interpreter will display a
syntax error message and will not execute the
statement.
Terabyte (TB):
1,0244 bytes.
Ternary operator:
In a programming language, a ternary operator is
a conditional operator that returns one of two
values based upon a tested condition being true or
false.
Test data set:
A data set with predictor variables and correct
results for the dependent variable that is used by a
machine learning algorithm to test the accuracy of
a model.
Underfitting data:
With respect to the K-nearest-neighbor’s
algorithm, if you specify a value of K that is too
large, you may “underfit” the model, which means
the model is not capable of correctly modeling the
training data.
Unsupervised learning:
Machine learning that does not use labeled data,
and hence, does not use a training data set.
Unsupervised learning algorithms discover their
solutions. Common unsupervised learning
solutions include clustering and data association.
Variance:
A measure of the difference between a value at
the mean, which is equal to the square of the
standard deviation.
Vertical scaling:
Scaling a database or application server by adding
processing power, random-access memory (RAM),
and/or disk storage capacity.
Visual programming:
The process of creating a program by dragging
and dropping objects on to a workspace as
opposed to writing programming language
statements. The RapidMiner and Orange
environments allow data analysts to use visual
programming to create data-mining and machine
learning solutions.
Visualization:
The visual representation of data with the goal of
improving communication.
Waterfall chart:
A chart that represents the sequential impact of
applying positive (increments) and negative
(decrements) values to a starting value.
Wildcard:
A symbol used within a query operation to match
values. Structured Query Language (SQL)
supports the percent sign (%) wildcard that
matches one or more characters to the wildcard
and the underscore (_) wildcard that matches any
character to the wildcard. Database developers
use wildcard characters with the LIKE clause.
Yottabyte (YB):
1,0248 bytes.
Zero-based indexing:
In a programming language, zero-based indexing
specifies that the first element of an array will
reside at the index location zero. Python uses
zero-based indexing; R does not. In R, the first
element of an array resides at index location 1.
Zettabyte (ZB):
1,0247 bytes.
© Patra Kongsirimongkolchai/EyeEm/Getty Images
INDEX
Note: Page numbers followed by f and t indicate
material in figures and tables, respectively.
A
Actian, 222t
agglomerative hierarchical clustering
algorithm, 456
Amazon Alexa, 592
Amazon Aurora, 215, 222t
Amazon DynamoDB, 332
Amazon Web Services (AWS), 332
antecedent product, 563, 564, 589
application programming interface (API), 106,
327, 424
apriori function, 28, 567, 571
area chart, 118–119, 118f, 119f
arithmetic functions, 247–250, 248t–249t,
249f–250f
arithmetic mean, 175, 176f
arithmetic operations, 245–247, 245f–246f,
246t–247t
bitwise operators, 246, 247t
compound operators, 247, 247t
artificial intelligence (AI), 46, 47f
association, 3, 563, 589–590
antecedent, 563, 564, 589
apriori function, 567, 571
calculating, 566–568
confidence, 564, 565, 589
consequent, 563, 564, 589
conviction, 564, 566, 590
definition, 27, 563
FP-Growth, 571–574, 572f–575f, 577f
item sets, 567, 568
lift, 564, 566, 590
Python script, 27–29
real-world data, 568–571, 569f
shopping cart data set, 564f
shopping-cart problem, 563
support, 564, 565, 589
Weka, 40–41, 41f
asynchronous JavaScript (AJAX), 111
Auto-MGP.csv data set, 160
AVEDEV function, 188, 189f
AVERAGE function, 175, 176f
AVERAGEIF function, 177, 177f
AVERAGEIFS function, 178
AVG function, 422, 422f
B
bad data, risks of, 411
bar charts, 120–121, 120f, 121f
Bayes theorem, 487, 502
bent-elbow method, 449, 449f
big data, 609, 611, 626. See also Hadoop;
Hadoop Distributed File System (HDFS)
data mining, 29–30, 30f
servers, 613, 613t
size of data, 611, 612, 612t, 626
sources, 609–610, 610f
V’s of, 613
binary classification, 486, 487
bottom-up hierarchical clustering algorithm,
456, 457f
box and whisker charts, 145–146
Breast Cancer data set
dimensionality reduction, 59, 59f, 60
KNN classification, 500–502, 500f
bubble chart, 137–138, 138f
built-in aggregate functions
AVG and STDDEV functions, 244, 244f
COUNT function, 244, 244f
SUM, MIN, and MAX functions, 244, 245f
business intelligence, 2, 15–17
business rules, 412
bytes, 612, 612t. See also specific bytes
C
calendar chart, 139–141, 140f
candlestick chart, 141–142, 141f
cardinality, 76, 78, 78f, 79, 98
Cascading Style Sheets (CSS), 221
Cassandra, 333, 333f, 334f
category-based comparison charts, 4
bar and column charts, 120–121, 120f,
121f
clustered bar and column charts, 121–
122, 122f
combo chart, 122, 123f
diff chart, 122–125, 124f
radar chart, 122, 123f
waterfall chart, 125, 125f
Census data set, 64–65, 64f
centroid, 448f, 448t
CFO. See chief financial officer (CFO)
ChartData.R program, 10–11, 10f, 11f
charting data with Excel, 164–167, 165f–167f
Chen diagram, 79, 79f, 80f, 82
chief financial officer (CFO), 429
classification, 23–25, 485–529, 531
algorithms, 45, 47, 48
decision-tree classifier, 511–514, 511f,
515f
definition of, 485, 528
handwriting, 604–606, 604f, 605f
KNN classifier (See K-nearest-
neighbors (KNN) classifier)
logistic regression classifier, 506–509
Naïve Bayes classifier, 502–506, 503t
neural networks, 509–511
random-forest classifier, 514, 516–517
real-world data sets, 521–528
steps in, 487
supervised learning, 486, 528
SVM classifier, 517–521, 517f–519f
test data set, 486, 528
text, 592
training data set, 486, 528
Weka, 36, 39, 39f
cloud services, 633
cloud-based databases, 267, 267f
clustered bar and column charts, 121–122,
122f
clustering, 20–23, 20f, 446–484
algorithms, 447, 448, 448t
centroid, 448f
cluster assignments, 468–470
concept of, 446
definition of, 2, 35
Euclidean distances, 448f
hard, 446, 447
hierarchical, 447, 448t
bottom-up algorithm, 456, 457f
cluster chart, 457, 458f
dendrograms, 456, 457f, 459f
Python script, 458–459, 461–462,
462f
R program, 460, 460f
top-down algorithm, 456
K-means, 447
bent-elbow method, 449, 449f
cluster formation, 448, 449f, 450f
description of, 448t
Python script, 450–452, 452f
R program, 452, 453, 453f
steps in, 451
K-means++, 448t, 453
Python script, 453–455
R program, 455–456, 456f
shapes and forms, 467, 467f
soft, 446, 447
stock data, 56–57, 57f
text, 592, 598–601
Weka, 35, 36, 37f, 38f
clusters, 332, 611, 614
column charts, 120–121, 120f, 121f
combo chart, 122, 123f
comma-separated values (CSV) files, 30, 30f
compiler, 347, 378
composition charts, 4
donut chart, 129–130, 130f
funnel chart, 135–136, 135f
pie chart, 126–129, 126f–129f
pyramid chart, 136, 136f
stacked area chart, 132–133, 133f
stacked bar and column charts, 130–132,
132f
sunburst chart, 130, 131f
treemap chart, 134, 134f
computer vision, 592, 606
conceptual database model, 76, 80–81, 81f
conditional formatting, 160, 167
in Excel, 168f
to highlight specific values, 167–169,
168f
steps in, 167
conditional operators, 385–386
conditional processing, 357–358
confidence, in market-basket analysis, 564,
565, 589
confidence interval, 198
confusion matrix, 487, 494–497
Congressional Voting Records data set, 525–
528, 526f
consequent product, 563, 564, 589
conviction, in market-basket analysis, 564,
566, 590
“correct” chart, 105
CORREL function, 188–190, 190f
correlation charts, 4, 188, 190f
bubble chart, 137–138, 138f
scatter chart, 136–137, 137f
correlation, data-set, 576, 577f, 578, 579f
CouchDB, 327–330, 328f–331f
COUNT function, 184, 185f
covariance, 190, 192f
COVARIANCE.S and COVARIANCE.P
function, 190–191, 192f
cross join, 261, 261f
crow’s foot diagram, 79, 80f, 81f
CRUD operation, 214, 215t, 318, 318t
CURDATE function, 420
Cypher, 89
D
dashboard charts, 4, 5f, 6, 31
calendar chart, 139–141, 140f
candlestick chart, 141–142, 141f
gauge chart, 138–139, 139f
data accuracy, 2, 19, 42, 412, 419, 632
data analysis, Excel
forecast sheet, 198–200, 199f–201f
Goal Seek dialog box, 197f
pivot tables, 200–208, 202f–208f
what-if processing, 196–198, 197f, 198f
data anomalies, 77, 99
data buckets, 630, 631f
data centers
Facebook, 613, 614f
in HDFS clusters, 614, 615f
data cleansing, 19, 411–444
correcting errant data, 426–427
data governance, 427–430
data quality assessment framework, 436,
437f
data validation (See data validation
techniques)
data-quality KPIs, 438f
definition of, 411
deleting errant data, 426
ETL process, 430, 430f
OpenRefine, 439–443, 439f–443f
data completeness, 412, 414, 436, 632
data conformity, 2, 19, 42, 412, 632
data consistency, 2, 19, 42, 423, 436, 632.
See also data inconsistences
data controls and logging, 634
data custodian, 429, 630
data governance
data lifecycle and, 429, 430f
definition of, 427
DGI Data Governance Framework, 427,
428f
operations, 427, 428f
stakeholders, owners, stewards, and
custodians in, 429
data inconsistences, 423–425, 424f, 425f
data integrity, 436
data lakes, 76, 95, 620
data lifecycle, 429, 430f
data marts, 78, 93
data mining
definition of, 530
future of, 31
text and photo, 19–20
data munging, 430
data owner, 429, 629, 630
data quality, 436, 444
accuracy, 2, 19, 42, 412, 419, 632
attribute measurements, 413, 414t
conformity, 2, 19, 42, 412, 632
consistency, 2, 19, 42, 423, 436, 632
definition of, 2
improving, 412, 414f, 444
key performance indicators, 438f
multiple data sources vs., 414, 415f
understanding, 412
data quality assessment framework (DQAF),
436, 437f
Data Quality Services (DQS), 431f
data science, 4
data sets, 18, 18f, 19f. See also specific data
sets
data sources, 631, 632f
data stakeholder, 429, 630
data steward, 429
data timeliness, 436
data toolsets, 633
data validation techniques, 414–426
data inconsistences, 423–425, 424f, 425f
duplicate values, 417–418, 418f, 419f
field’s mean and standard deviation, 422–
423, 422f, 423f
leading and trailing spaces, 425, 426f
nonexistent records, fields, and null
values, 415–416, 416f, 417f
synchronizing data time stamps, 425,
426, 426f
value-pair consistency, 420, 421f
value-range compliance, 419–420, 419f,
420f
data validity, 436
data visualization, 2, 4–11, 5f
best practices, 104
category-based comparison charts
bar and column charts, 120–121,
120f, 121f
clustered bar and column charts,
121–122, 122f
combo chart, 122, 123f
diff chart, 122–125, 124f
radar chart, 122, 123f
waterfall chart, 125, 125f
composition charts
donut chart, 129–130, 130f
funnel chart, 135–136, 135f
pie chart, 126–129, 126f–129f
pyramid chart, 136, 136f
stacked area chart, 132–133, 133f
stacked bar and column charts, 130–
132, 132f
sunburst chart, 130, 131f
treemap chart, 134, 134f
“correct” chart, 105
correlation charts
bubble chart, 137–138, 138f
scatter chart, 136–137, 137f
dashboard charts
calendar chart, 139–141, 140f
candlestick chart, 141–142, 141f
gauge chart, 138–139, 139f
distribution charts
box and whisker charts, 145–146
histogram chart, 142–145, 142f–145f
geocharts, 146–149, 148f
Google Charts, 6, 7f, 41
plotting data, 149
Python script, 8–9, 9f
Tableau dashboard, 106, 152–156, 156f
Tableau website, 150–152, 151f
time-based comparison charts
area chart, 118–119, 118f, 119f
dual y -axis chart, 116–118, 117f
line chart, 111–112, 112f
multiline charts, 113, 113f
smoothing line chart data, 115–116,
115f
top x -axis line chart, 113–115, 113f
web solutions
real-time dynamic chart, 111, 111f
static chart, 107–111, 107f
data volume, 613
data warehouses, 76
analytics data into, 91f
data lakes, 95
data marts, 93
Lucidchart drawing environment, 96–98,
96f
normalization process, 93
relational schema, 93–95
snowflake schema, 95, 95f
Snowflake website, 92, 92f
star schema, 94
data wrangling, 19, 412, 430
in Python, 431–435, 432f
tools, 430
data-analytics project, 628–629, 643
budget factors, 634, 634f
data governance board, 629–631
data sources, 631, 632f
data toolsets, 633
defining the question, 629, 630, 630f
integrating data solutions, 629, 630f
Jupyter Notebook documentation, 638–
642, 639f–642f
organizational data buckets, 630, 631f
PDCA methodology, 629, 637, 637f
quality expectations, 632
ROI analysis, 629, 636
security considerations, 633
self-service solutions, 637, 638
SMART approach, 634
SWOT analysis, 629, 637, 638f
visual representation, 636, 636f
database administrator, 633
database as a service (DBaaS), 2
database developers, 411
database management systems (DBMS), 76,
99
databases
entity relationship diagrams, 79
graph, 89–90, 90f
models
conceptual database model, 80–81,
81f
logical database model, 81–82, 81f
physical database model, 82–84, 83f
normalization, 85–87
NoSQL, 88
object-oriented, 88–89
relational, 87–88, 88f, 89f
role of, 17–18
schemas, 85
data-governance board, 427, 444, 629–631
data-set correlation, 576, 577f, 578, 579f
data-set summaries, 574–576, 578
DBSCAN. See Density-Based Spatial
Clustering of Applications with Noise
(DBSCAN)
decision tree, 487, 511
decision tree regression, 543–546, 544f
decision-tree classifier, 487, 511–514, 511f,
515f
deep machine learning, 45, 47
degree of a polynomial, 550
DELETE operation, 274–275, 275f
dendrograms, 456, 457f, 459f
Density-Based Spatial Clustering of
Applications with Noise (DBSCAN), 447, 448t,
484
border point, core point and noise, 462,
464f
identifying outliers using, 472–473
using Python, 463, 464, 465f
using R, 464, 465, 466f
dependent variable, 506
descriptive analytics, 25, 530, 531, 561
DEVSQ function, 188, 189f
DGI Data Governance Framework, 427, 428f
DiapersAndBeer data set, 40n 40f
diff chart, 122–125, 124f
dimensionality reduction
linear discriminant analysis, 63–64
primary component analysis, 60–63
distributed database system
horizonal scaling, 323, 324f
vertical scaling, 323, 324f
distribution charts, 6
box and whisker charts, 145–146
histogram chart, 142–145, 142f–145f
DLib modules, 601
donut chart, 129–130, 130f
Dow Jones Stocks data, 56–57, 56f–57f
DQAF. See data quality assessment
framework (DQAF)
DQS. See Data Quality Services (DQS)
DROP operation, 275–276, 276f
dual y -axis chart, 116–118, 117f
duplicate values, testing for, 417–418, 418f,
419f
E
e-commerce system, 78, 78f, 80, 81f
Chen diagram, 82f
conceptual view of, 80, 81f
fact table, 94f
logical view of, 81–82, 81f, 82f
physical model of, 82–84, 83f
“elbow method,” for clustering, 449, 449f
enterprise data ecosystem, 629, 630f
entity relationship diagram (ERD), 76, 78, 79
Chen diagram, 79, 79f, 80f, 82
crow’s foot, 79, 80f, 81f
labeling relationships within, 79f
primary- and foreign-key fields, 84, 84f
school ERD, entities, 98f
for school’s registration system, 97f
errors and inconsistences, 411. See also data
cleansing
ETL process. See extract, transform, and
load (ETL) process
Euclidean distances, 448f, 490
exabyte (EB), 612, 612t
Excel
charting data, 164–167
concepts of, 159–160
conditional formatting, 167–169
data analysis
forecast sheet, 198–200, 199f–201f
Goal Seek dialog box, 197f
pivot tables, 200–208, 202f–208f
what-if processing, 196–198, 197f,
198f
file formats
comma-separated value files, 170,
170f
JavaScript object notation files, 172–
174, 173f
markup language files, 172, 172f
open document specification files,
170–171, 171f
portable document format files, 171–
172, 171f
.xlsx file extension, 174, 174f
filtering data, 161–164
sorting data values, 160–161
statistical functions
AVEDEV function, 188, 189f
AVERAGE function, 175, 176f
AVERAGEIF and AVERAGEIFS
function, 177–178, 177f
CORREL function, 188–190, 190f
COUNT function, 184, 185f
COVARIANCE.S and
COVARIANCE.P function, 190, 192f
DEVSQ function, 188, 189f
FREQUENCY function, 194–196,
196f
GEOMEAN function, 179–180, 180t
HARMEAN function, 179–180, 180t
LARGE and SMALL functions, 184,
184f
LINEST function, 191, 194f
LOGEST function, 193–194, 195f
MAX and MIN functions, 181–182,
182f
MEDIAN function, 180–181, 181f
QUARTILE function, 183, 183f
SLOPE and INTERCEPT functions,
191, 193f
TRIMMEAN function, 178–179, 179f
VAR and STDEV function, 185–187, 186f–
187f, 186t–187t
executing queries
command-line shell interface
SHOW DATABASES query, 218,
219f
using Windows, 218, 218f
workbench
built-in databases, 220, 221f
chapter’s creation, 223, 224f
companion databases, 224, 224f
lightning-bolt icon, 219, 220f
SHOW DATABASES query, 219
using Windows, 219, 219f
export operations
comma-separated file, 264, 264f
MongoDB
index, 322, 322f
noSQL databases, 322, 322f
spreadsheet program, 264
Extensible Markup Language (XML), 294, 620
extract, transform, and load (ETL) process,
266, 430, 430f
F
Facebook, 613, 614f
facial recognition, 592, 601–602, 603f
definition, 601
OpenCV, 606–607, 606f, 607f
Python script, 602, 607
fashion-mnist data set, 71f
file formats, in Excel
comma-separated value files, 170, 170f
JavaScript object notation files, 172–174,
173f
markup language files, 172, 172f
open document specification files, 170–
171, 171f
portable document format files, 171–172,
171f
.xlsx file extension, 174, 174f
file systems, 611, 614, 626. See also Hadoop
Distributed File System (HDFS)
filters, 303
filtering data, in Excel, 161–164, 162f–164f
Firebird, 222t
first-normal form (1NF), 77, 85–86, 85f
forecast worksheet, 198, 200f
foreign key, 77, 83–84
FP-Growth data association, 571–574, 572f–
575f, 577f
framework
definition of, 427
DGI Data Governance Framework, 427,
428f
Resource Description Framework, 89
FREQUENCY function, 194–196, 196f
funnel chart, 135–136, 135f
G
gauge chart, 138–139, 139f
geocharts, 146–149, 148f
GEOMEAN function, 179–180, 180t
geometric mean, 179–180, 180t
gigabyte (GB), 612, 612t
Goal Seek tool, 196–198, 197f
Google, 613
Google Charts, 6, 7f, 41, 106, 106f–108f, 107
Google Colab, 69
Google “crash course” in machine learning,
68, 68f
governance, 427. See also data governance
grade point average (GPA), 229
graph database, 77, 89–90, 90f
graphical user interface (GUI), 216
MongoDB
arithmetic functions, 309, 310t
arithmetic operator, 309, 310t
compass GUI, 301, 302f
grouping operations, 311–314, 312f–
314f
insert method, 314–315, 315f
limit method, 309, 311f
logical operator, 306–307, 307f–308f,
307t
relational operators, 304–306, 305t,
306f
sort method, 308, 309f
specific collections, 302–304, 303f–
304f
third-party vendors, 302, 303f
Groceries data set, 568–571, 569f
GROUP BY clause
data groups, 250, 251f
displaying records, 250, 251f
ROLLUPs
COALESCE statement, 253, 254f
grand total, 252, 253f
product summary, 254, 254f
subtotals, 252, 253f
two fields, 255, 255f
syntax error message, 251, 251f
H
Hadoop, 611, 626
“bring the process to the data,” 614, 616,
617f
example of, 617, 619
MapReduce process (See MapReduce
process)
Hadoop Distributed File System (HDFS)
clusters in, 611, 614
definition, 611, 626
master/slave model, 611, 614, 616f, 626
use of different data centers, 614, 615f
handwriting classification, 604–606, 604f, 605f
hard clustering, 446, 447
HARMEAN function, 179–180, 180t
harmonic mean, 179–180, 180t
HBase, 333–334, 334f, 625, 625f
HDFS. See Hadoop Distributed File System
(HDFS)
Heart Disease data set, 523, 524f, 525
hierarchical clustering, 22, 22f, 447, 448t, 484
bottom-up algorithm, 456, 457f
cluster chart, 457, 458f
dendrograms, 456, 457f, 459f
top-down algorithm, 456
using Python, 458–459, 461–462, 462f
using R, 460, 460f
using Solver, 474–478, 475f–478f
histogram chart, 142–145, 142f–145f
horizonal scaling, 323, 324f
hot encoding, 24, 487, 525
HTML file
calendar chart, 140–141
candlestick chart, 141–142
diff chart, 124
dual y -axis chart, 116–118
gauge chart, 138–139
geochart, 148–149
line chart, 114
smoothing line chart data, 115
static bar chart, 108
static pie chart, 107
hyperplane, 487, 517, 517f
hypertext markup language (HTML), 221
I
IBM DB2, 215, 222t
image mining
applications, 601
definition, 592, 601
facial recognition, 601–602, 603f
handwriting classification, 604–606, 604f,
605f
import operations
MongoDB
index, 321, 321f
NoSQL databases, 321, 321f
one table into another, 264–265
spreadsheet program, 264
tab-delimited files, 263, 264f
index
MongoDB
business-intelligence tools, 326–327
distributed database system, 323–
325, 324f
export operations, 322, 322f
import operations, 321, 321f
real-time replications, 325–326,
325f–326f
third-party tools, 322–323, 323f
SQL
performance improvement, 277, 278t
time-consuming process, 277, 278t
inner join, 258, 258f
INSERT operation, 271–272, 272f
insert/update documents, MongoDB
creating collection within database, 317,
318f
CRUD operations, 318, 318t
deleteOne/deleteMany method, 316–317,
317f
dropping collection and database, 317,
318f
Insurance.csv data set, 26
INTERCEPT function, 191, 193f
Internet of Things (IoT), 18, 29, 31, 610
interpreter, 344
INTERSECT operation, 262, 263f
IoT. See Internet of Things (IoT)
iPython
download and installation, 377, 377f
Jupyter Notebook, 377, 377f
Iris data set, 21, 35, 36, 488–490, 488f, 489f
IronPython, 401, 402f
item sets, 567, 568
J
Java Database Connectivity (JDBC), 333
JavaScript Object Notation (JSON), 620
JOIN operation
advantages, 297
component of, 297, 298f
cross join, 261, 261f
inner join, 258, 258f
left join, 258–259, 258f–259f
MySQL documentation, 319, 319f
nested objects, 295
orders and customers tables, 255, 256f
relational database table, 296, 297f
right join, 260, 260f
self-describing objects, 295–296
send and receive data, 292, 293f
storing, 295
temporary tables, 256, 256f
validating JSON content, 296, 297f
Windows Notepad accessory, 296, 296f
JSON files, 172–174, 173f
Jupyter Notebook, 377, 377f
documentation, 638–642, 639f–642f
K
Kaggle website, 18, 19f, 41, 523, 524f
key performance indicators (KPIs), 427
data-quality, 438f
values, 281
kilobyte (KB), 611, 612t
K-means clustering, 21–22, 22f, 447, 484
bent-elbow method, 449, 449f
description of, 448t
formation of clusters, 448, 449f, 450f
steps in, 451
text clustering, 598–601
using Python, 450–452, 452f
using R, 452, 453, 453f
using Solver, 474–478, 475f–478f
K-means++ clustering, 453
description of, 448t
using Python, 453–455
using R, 455–456, 456f
K-nearest-neighbors (KNN) classifier, 24–25,
487
accuracy score, 494–497
assigning data point to class, 490, 490f
Breast Cancer data set, 500–502, 500f
confusion matrix, 494–497
Euclidean distance, 490
for handwritten digits, 605–606
Iris data set, 488–490, 488f, 489f
training and testing data sets, 489
value of K, 493–494, 494f
Wine data set, 497–499, 497f
Zoo data set, 522, 523
K-nearest-neighbors (KNN) regression, 548–
550
KNN classifier. See K-nearest-neighbors
(KNN) classifier
KPIs. See key performance indicators
(KPIs)
L
LARGE function, 184, 184f
LDA. See linear discriminant analysis
(LDA)
leading and trailing spaces, elimination of,
425, 426f
left join, 258–259, 258f–259f
lift, in market-basket analysis, 564, 566, 590
lightweight, JSON, 292, 294
line chart, 111–112, 112f
linear discriminant analysis (LDA), 63–64
linear regression, 531, 561
definition, 532
example of, 533–537, 533f, 535f–537f
linear equation, 532, 532f, 561
multiple regression, 531, 532, 539–543,
562
simple regression, 531, 532, 537–539
LINEST function, 193, 194f
LOGEST function, 193–194, 195f
logical database model, 76, 81–82, 81f
logical operators
AND, 238, 238t
NOT, 238t, 239, 239f
OR, 238, 239, 238t, 239f
Python script, 357–359
R program, 384–385
logistic regression classifier, 487, 506–509
logit, 487, 506
LTRIM function, 425
Lucidchart drawing environment, 96–98, 96f
M
machine learning
algorithms, 47–48
vs. artificial intelligence, 46, 47f
clustering stock data, 56–57, 57f
concept of, 44
data clustering, 20–23, 20f
vs. data mining, 3
deep learning, 45, 47
definition of, 2, 3
dimensionality reduction
linear discriminant analysis, 63–64
primary component analysis, 60–63
fashion-mnist data set, 71f
Google “crash course” in, 68f
mapping categorical variables, 64–68
operation tools, 3
programming data mining, 17
reinforced learning, 45, 47, 48f
scaling data-set values, 57–59
spam data set, 51f, 53–55, 54f
supervised learning, 45, 46, 50f
accuracy model, 52, 53
machine-learning model, 52–53
perform steps, 48
Python program, 51, 52
training and testing data sets, 49–52,
51f
TensorFlow website, 69–71, 70f
for desktop, web, and mobile
solutions, 69f
downloadable software and details,
70f
Google Colab tutorial on, 70f
unsupervised learning, 45, 47
managed database server, 268, 335
managed service provider (MSP), 92, 92f
managed services, 633
mapping categorical variables, 64–68
MapReduce process, 611, 626
example of, 617–619, 618f
using MongoDB, 621–625, 622f
MariaDB, 215
marker, 104, 111
market-basket analysis, 563, 589
markup language files, 172, 172f
master data, 429
master data management (MDM), 436
matplotlib package, 372, 374, 374f
MAX function, 181, 182f
mean
arithmetic, 175, 176f
examining the field’s values, 422–423,
422f, 423f
geometric, 179–180, 180t
MEDIAN function, 180–181, 181f
median value, 180–181, 181f
megabyte (MB), 611, 612t
Microsoft Access, 222t
Microsoft MS SQL Server, 222t
Microsoft Power BI
accuracy value, 285, 285f
canvas, 284, 284f
data source, 282, 282f
data window, 283, 283f
data-quality charts, 285, 285f
data-quality KPIs, 281, 281f
Fields section, 284, 284f
gauge chart, 286, 286f
Modeling tab, 283, 283f
Power Query Editor, 282, 282f
state-of-the-art visualization, 279, 280f
trial version, 280, 280f
MIN function, 181, 182f
MLP classifier. See multilayer perceptron
(MLP) classifier
MLP solution, 509, 510f
Modified National Institute of Science and
Technology (MNIST) data set, 604, 604f
MongoDB
collections, 298, 299f, 300f
command-line shell, 298, 300f
download and installation, 298, 299f
graphical user interface
arithmetic functions, 309, 310t
arithmetic operator, 309, 310t
compass GUI, 301, 302f
grouping operations, 311–314, 312f–
314f
insert method, 314–315, 315f
limit method, 309, 311f
logical operator, 306–307, 307f–308f,
307t
relational operator, 304–306, 305t,
306f
sort method, 308, 309f
specific collection, 302–304, 303f–
304f
third-party vendors, 302, 303f
index
business-intelligence tools, 326–327
distributed database system, 323–
325, 324f
export operations, 322, 322f
import operations, 321, 321f
real-time replications, 325–326,
325f–326f
third-party tools, 322–323, 323f
insert/update documents
creating collection within database,
317, 318f
CRUD operations, 318, 318t
deleteOne/deleteMany method, 316–
317, 317f
dropping collection and database,
317, 318f
managed database server, 335–338,
336f–338f
MapReduce operations, 621–625, 622f
text’s catalog page, 300–301
visualization tool, 338–339, 339f–340f
multiclass classification, 486
multilayer perceptron (MLP) classifier, 487
multiline charts, 113, 113f
multiple linear regression, 531, 532, 539–543,
562
MySQL
cloud-based managed database service,
268, 269f
command-line shell interface
chapter’s creation, 225, 225f
SHOW DATABASES query, 218,
219f
using Windows, 218, 218f
database creation, 268–271, 270t–271t
Downloads page, 215, 216f
implementation of, 215
installation process, 215, 216f
licenses, 215
phpMyAdmin, 267–268, 267f–268f
stores data and client programs, 216,
217f
workbench
built-in databases, 220, 221f
chapter’s creation, 221, 224f
companion databases, 225, 225f
lightning-bolt icon, 219, 220f
SHOW DATABASES query, 219
using Windows, 219, 219f
N
Naïve Bayes classifier, 487, 502–506, 503t
natural language processing, 31, 592
Natural Language Toolkit (NLTK) data set,
597
Neo4j website, 89, 90f
nested query, 276–277
network port, 212
neural network(s), 487
classification using, 509–511
MLP solution, 509, 510f
perceptrons, 487, 509, 510f
Neutrino API, 424
NLTK data set. See Natural Language
Toolkit (NLTK) data set
non-numeric data, filtering, 525–528
normalization, 85–87, 93
NoSQL databases, 75, 77, 87–89, 99
Amazon DynamoDB, 332
Cassandra, 333, 333f, 334f
CouchDB, 327–330, 328f–331f
JSON
advantages, 297
component of, 297, 298f
MySQL documentation, 319, 319f
nested objects, 295
relational database table, 296, 297f
self-describing objects, 295–296
send and receive data, 292, 293f
storing, 295
validating content, 296, 297f
Windows Notepad accessory, 296,
296f
MongoDB
arithmetic functions, 309, 310t
arithmetic operator, 309, 310t
business-intelligence tools, 326–327
collections, 298, 299f, 300f
command-line shell, 298, 300f
compass GUI, 301, 302f
create a collection, 317, 318f
CRUD operations, 318, 318t
deleteOne/deleteMany method, 316–
317, 317f
distributed database system, 323–
325, 324f
download and installation, 298, 299f
dropping a collection, 317, 318f
export operations, 322, 322f
grouping operations, 311–314, 312f–
314f
import operations, 321, 321f
insert method, 314–315, 315f
limit method, 309, 311f
logical operator, 306–307, 307f–308f,
307t
managed database server, 335–338,
336f–338f
real-time replications, 325–326,
325f–326f
relational operator, 304–306, 305t,
306f
sort method, 308, 309f
specific collections, 302–304, 303f–
304f
text’s catalog page, 300–301
third-party tools, 322–323, 323f
third-party vendors, 302, 303f
visualization tool, 338–339, 339f–
340f
Redis, 330–332, 332f, 333f
RocksDB, 335, 335f
null/not-a-number (NaN) values, 2
numerical calculations
matplotlib package, 372, 374, 374f
numpy package, 374–375, 375f
plot.ly package, 372, 373f
numpy package, 374–375, 375f
O
object-oriented database, 77, 88–89
object-oriented database management
systems (OODBMSs), 88
Online Analytical Processing (OLAP), 77, 90,
90f, 100
Online Transaction Processing (OLTP), 77,
90, 90f, 100
Open Document Specification (.ods) file
format, 170, 171f
OpenCV, facial recognition, 606–607, 606f,
607f
OpenRefine, data cleansing, 439–443, 439f–
443f
Oracle, 211, 215, 222, 222t , 229, 268, 287
Oracle Warehouse Builder (OWB), 266
Orange, 45
data mining, 579–586, 580f–586f
predictive analysis, 586–589, 587f–589f
outer join, 260
outliers, 471–473
P
pandas, 346
PC hard drive sizes, 609, 610f
PDCA methodology. See Plan, Do, Check,
and Act (PDCA) methodology
perceptrons, 487, 509, 510f
petabyte (PB), 612, 612t
physical database model, 76, 82–84, 83f
pie chart, 126–129, 126f–129f
PIP command, 346
pivot tables
data preparation, 201–202, 202f
filter data using slicers, 207–208, 208f
pivot table creation, 202–205, 204f
pivoting data, 205–207, 207f
Plan, Do, Check, and Act (PDCA)
methodology, 629, 637, 637f
plot.ly package, 372, 373f
plotting data, 149
polynomial, definition of, 550
polynomial regression, 550–553, 551f, 552f
portable document format (.pdf) files, 171–
172, 171f
PostgreSQL, 222t
predictive analysis with Orange, 586–589,
587f–589f
predictive analytics, 33–35, 530–562
definition of, 3
linear regression, 25, 25f
multivariate regression, 25, 26f
Python script, 26–27
predictor variable, 487
prescriptive analytics, 531, 561
primary component analysis (PCA), 60–63
primary key, 77, 83–84
pyramid chart, 136, 136f
Python
association measures, 27–29, 566–570
break statements, 361–362
built-in functions, 363–364, 364f
classification algorithms, 24–25
decision-tree classifier, 512–514
KNN classifier, 490–492, 495–496,
497–501
logistic regression classifier, 507–
508
Naïve Bayes classifier, 503–505
neural network-based classifier, 509–
511
random-forest classifier, 514, 516
SVM classifier, 518–521
clustering algorithms, 21, 22f
handwriting, 604–606
hierarchical clustering, 457–458,
461–462, 462f
K-means clustering, 450–452, 452f
K-means++ clustering, 453–455
text clustering, 598–601
comment symbol, 355
conditional process, 357–358
continue statement, 361–362
creating, 351–352, 352f
data visualization, 8–9, 9f
data wrangling, 431–435, 432f
dataframe object, 369–370, 371f
data-set summaries and correlation, 574–
576
deleting variables, 350–351
download and install, 348, 348f
equals-sign assignment operator, 350
ETL operations, 266
execute, 351–352
facial recognition, 602, 607
group related statements, 356–357
importing packages, 366–368, 367f, 367t
iterative process, 358–360, 361f
launching, 348, 349f
logical operator, 358
machine learning
Breast Cancer data set, 60
clustering stock data, 56–57
linear discriminant analysis, 63–64
mapping categorical variables, 65–
68
primary component analysis, 61–63
scaling data-set values, 58–59
spam data set, 53–55
supervised machine learning, 51, 52
module, 365–366
numerical calculations
matplotlib package, 372, 374, 374f
numpy package, 374–375, 375f
plot.ly package, 372, 373f
numerical data structures, 370, 371, 372
object-oriented programming, 364–365
operator precedence, 353, 354t
PyPI, 368, 368f
regression algorithms
decision tree regression, 543–545
KNN regression, 548–549
linear regression, 536
multiple linear regression, 539–542
polynomial regression, 552, 553
random forest regression, 546–547
simple linear regression, 534, 537–
538
relational operator, 353, 354t
sklearn package, 375–376, 376f
supports functions, 360, 362, 363
syntax requires, 357
ternary operator, 357–358
text processing, 594–597
value list, 354–356, 355f
variables, 355
visual programming, 12
W3Schools, 350
Python Package Index (PyPI), 368, 368f
Q
QUARTILE function, 146, 147f, 183, 183f
R
R program
additional support in, 382, 383f
association measures, 568, 571
break statements, 388
built-in functions, 381, 382f
classification algorithms
decision-tree classifier, 512–513
KNN classifier, 492, 496, 499, 501–
502
logistic regression classifier, 508–
509
Naïve Bayes classifier, 505–506
random-forest classifier, 516–517
clustering algorithms, 23
hierarchical clustering, 460, 460f
K-means clustering, 452, 453, 453f
K-means++ clustering, 455–456,
456f
comments in, 381
common operations, 382, 383
conditional operators, 385–386
creating variables, 380–381, 381t
data visualization, 10–11, 10f, 11f
dataframe objects, 393–395
data-set summaries and correlation, 578
download and installation, 377, 378f
interactive development, 377, 378f
logical operators, 384–385
machine learning, spam data set, 53–55
multidimensional collection, 391–393,
392f
next statements, 388
object-oriented programming, 398, 399
one-dimensional collection, 389–390
own functions, 395–397
packages, 397–398, 399f
regression algorithms
decision tree regression, 545–546
KNN regression, 549–550
multiple linear regression, 540, 541,
543
random forest regression, 548
simple linear regression, 538–539
repeating statement, 386–388
sentiment analysis, 593–594
radar chart, 122, 123f
random forest regression, 546–548
random-forest algorithm, 36, 39, 39f
random-forest classifier, 487, 514, 516–517
RapidMiner, 12, 13, 553–554
accuracy of prediction models, 554, 557f,
558, 560f
Auto Model option, 554, 555f, 561
auto model wizard, 13, 14f
data prediction, 13, 15f
FP-Growth data association, 571–574,
572f–575f, 577f
Naïve Bayes predictions, 558, 558f
new project window, 13, 14f
operations within, 554, 556f, 558, 559f
prediction algorithms, 557f, 560f
prediction variables within, 554, 556f,
558, 559f
predictive models in, 13, 16f
random forest prediction, 561f
Start Layout screen, 554, 555f
Titanic data set, 13, 15f, 17f
website, 554, 554f
RBAC. See role-based access controls
(RBAC)
real-time dynamic chart, 111, 111f
real-world data association, 568–571
real-world data sets, 487, 521–529
Redis, 330–332, 332f, 333f
referential integrity, 77
regression, 531, 561
decision tree, 543–546, 544f
K-nearest-neighbors, 548–550
linear, 531, 561
definition, 532
example of, 533–537, 533f, 535f–
537f
linear equation, 532, 532f, 561
multiple regression, 531, 532, 539–
543, 562
simple regression, 531, 532, 537–
539
polynomial, 550–553, 551f, 552f
random forest, 546–548
reinforced machine learning, 45, 47, 48f
relational database, 77, 87–88, 88f, 89f
relational operator
BETWEEN, 233t , 234, 235f
IN, 233t, 234, 235f
greater than or equal, 232, 233t, 234f
less than, 233t , 234, 234f
Python, 353, 354t
R program, 382, 383–384, 383t , 384t
relational schema, 93–95
remote database, 212
remote dictionary server, 330
replication sets, 325, 326f
Resource Description Framework (RDF), 89
return-on-investment (ROI) analysis, 629, 636
right join, 260, 260f
RocksDB, 335, 335f
ROI analysis. See return- on-investment
(ROI) analysis
role-based access controls (RBAC), 633
role-based security, 629, 633
ROUND function, 250, 250f
RTRIM function, 425
S
scaling data-set values, 57–59
scatter chart, 136–137, 137f
Seattle.csv data set file, 33–34
second-normal form (2NF), 77, 86–87, 86f
security considerations, 629, 633
SELECT statement, 420
self-describing objects, 292, 295–296
self-service solutions, 637, 638
sentiment analysis, 592–594
servers, big data, 613
simple linear regression, 531, 532, 537–539,
561–562
SimpleKMeans algorithm, 37f
Siri, 592
sklearn package, 375–376, 376f
Slicer button, filter data, 207–208, 208f
SLOPE and INTERCEPT functions, 191, 193f
SMALL function, 184, 184f
SMART approach, 634, 643
smoothing line chart data, 115–116, 115f
snowflake schema, 95, 95f
Social Security number, 277
soft clustering, 446, 447
Solver, 473–474, 474f
hierarchical clustering, 479–483, 479f–
483f
K-means clustering, 474–478, 475f–478f
Sort Options field, 160–161, 160f, 161f
spam data set, 51f, 53–55, 54f
Spark, 620, 620f, 621f
SQL DCL queries, 214, 214t
SQL DDL queries, 213, 213t
SQL DML queries, 214, 214t
SQLite, 222t
stacked area chart, 132–133, 133f
stacked bar and column charts, 130–132,
132f
standard deviation
data-set values, 186, 187f, 187t
examining the field’s values, 422–423,
422f, 423f
star schema, 77, 94
static chart, 106–111, 107f
statistical functions, Excel
AVEDEV function, 188, 189f
AVERAGE function, 175, 176f
AVERAGEIF and AVERAGEIFS
functions, 177–178, 177f
CORREL function, 188–190, 190f
COUNT function, 184, 185f
COVARIANCE.S and COVARIANCE.P
functions, 190, 192f
DEVSQ function, 188, 189f
FREQUENCY function, 194–196, 196f
GEOMEAN function, 179–180, 180t
HARMEAN function, 179–180, 180t
LARGE and SMALL functions, 184, 184f
LINEST function, 191, 194f
LOGEST function, 193–194, 195f
MAX and MIN functions, 181–182, 182f
MEDIAN function, 180–181, 181f
QUARTILE function, 183, 183f
SLOPE and INTERCEPT functions, 191,
193f
TRIMMEAN function, 178–179, 179f
VAR and STDEV functions, 185–187,
186f–187f, 186t–187t
statistics vs. data mining, 4
STDDEV function, 422, 422f
Stocks.json database, 621, 622f
strengths, weaknesses, opportunities, and
threats (SWOT) analysis, 629, 637, 638f
Structured Query Language (SQL), 75, 87
arithmetic operations, 247t, 248–249,
248t–249t, 249f–250f
bitwise operators, 246, 247t
compound operators, 247, 247t
multiplication operator, 245, 245f
built-in aggregate functions
AVG and STDDEV functions, 244,
244f
COUNT function, 244, 244f
SUM, MIN, and MAX functions, 244,
245f
cloud-based databases, 267, 267f
CRUD operation, 214, 215t
database vendors, 222–225, 222t, 223f–
225f
DCL queries, 214, 214t
DDL queries, 213, 213t
DELETE operation, 274–275, 275f
DML queries, 214, 214t
DROP operation, 275–276, 276f
ETL operation, 266
exporting data
comma-separated file, 264, 264f
spreadsheet program, 266
GROUP BY clause
data groups, 250, 251f
displaying records, 250, 251f
ROLLUPs, 252–255, 252f–255f
syntax error message, 251, 251f
importing data
one table into another, 264–265
spreadsheet program, 264
tab-delimited files, 263, 264f
index, 277, 278t
INSERT operation, 271–272, 272f
INTERSECT operation, 262, 263f
JOIN operation
cross join, 261, 261f
inner join, 258, 258f
left join, 258–259, 258f–259f
orders and customers tables, 255,
256f
right join, 260, 260f
temporary table, 256, 256f
LIMIT keyword, 229, 229f
nested query, 276–277
notepad accessory, 223, 223f
retrieving rows (records)
multiple field values, 228, 228f
relational database, 227, 227f
SHOW DATABASES query, 225,
226f
SHOW TABLES query, 226, 226f
single field, 227, 227f
wildcard character, 228, 228f
single query exists, semicolon, 221
sorting records
ASC keyword, 229, 229f
DESC keyword, 229, 230f
multiple fields, 230, 231f
UNION operation, 262, 262f
UPDATE operation, 272–274, 273f–274f
WHERE clause condition
asterisk wildcard, 231, 232f, 233f
AS keyword, 241–243, 242f–243f
LIKE operator, 235, 236t, 236f–237f
logical operators, 237–238, 238t,
238f–239f
relational operators, 232–237, 233t,
234f–235f
remote server, connection
information, 240, 241f
uppercase and lowercase letters,
240, 240f
W3Schools tutorial, 221–222
subquery, 263, 277
sum of squares, 449, 463
summaries, data-set, 574–576, 578
sunburst chart, 130, 131f
supervised learning, 486, 528, 531
supervised machine learning, 23, 36, 41, 45,
50f
accuracy model, 52, 53
machine-learning model, 52–53
Python program, 51, 52
training and testing data sets, 49–52, 51f
support, in market-basket analysis, 564, 565,
589
support vector classifier (SVC). See support
vector machine (SVM) classifier
support vector machine (SVM) classifier, 487,
517–521, 517f–519f
SWOT analysis. See strengths,
weaknesses, opportunities, and threats
(SWOT) analysis
T
tab-delimited files, 263, 264f
Tableau dashboard, 152–156, 156f
Tableau website, 150–152, 151f
table-based relational databases, 18
TensorFlow website
for desktop, web, and mobile solutions,
69, 69f
downloadable software and details, 70f
Fashion-mnist data set, 71f
Google Colab tutorial on, 69, 70f
terabyte (TB), 612, 612t
test data set, 486, 528
testing data set, 36, 48–52, 51f, 56
text classification, 592
text clustering, 592, 598–601
text mining, 592
sentiment analysis, 592–594
text clustering, 598–601
text processing, 594–597
using NLTK data set, 597
text processing, 594–597
text-based categorical data, 525–528
TextBlob library, 594
third-normal form (3NF), 77, 87, 87f
time-based comparison charts, 4
area chart, 118–119, 118f, 119f
dual y -axis chart, 116–118, 117f
line chart, 111–112, 112f
multiline charts, 113, 113f
smoothing line chart data, 115–116, 115f
top x -axis line chart, 113–115, 113f
time-consuming process, 277
timeliness, 436
TIMESTAMPDIFF function, 420
TitanicFields data set, 12
top x -axis line chart, 113–115, 113f
top-down hierarchical clustering algorithm,
456
trailing spaces, elimination of, 425, 426f
training data set, 2, 23, 36, 42, 45, 49–52, 51f,
486, 528
transaction control language (TCL) queries,
214
treemap chart, 134, 134f
TRIMMEAN function, 178–179, 179f
Twitter, 613
U
UCI data-set repository, 18, 18f, 41, 521, 521f,
525, 537
underfitting data, 493
UNION operation, 262, 262f
unsupervised machine learning, 2, 20, 41, 45
UPDATE operation, 272–274, 273f–274f
V
value-pair consistency, 420, 421f
value-range compliance, 419–420, 419f, 420f
VAR and STDEV function, 185–187, 186f–
187f, 186t–187t
variance, data-set values, 185–186, 186f,
186t
vertical scaling, 323
virtual data warehouses, 92
visual programming, 11–15, 564, 590
definition of, 2
environment, 12, 13f
Visual Studio
classification results, 403, 404f
cluster program, 403, 404f
Cookiecutter box, 403, 403f
Create new project option, 401, 402f
IronPython, 400, 400f
Microsoft, 399, 400f
project templates, 401, 401f
regression program, 405, 406f
visualization tools, 103, 103t
V’s of big data, 613
W
waterfall chart, 125, 125f
Weka (Waikato Environment for Knowledge
Analysis)
data association, 40–41, 41f
data classification, 36, 39, 39f
data cleansing, 31–33
data clustering, 35, 36, 37f, 38f
data visualization, 33, 34f, 35f
download and installation, 31, 32f
predictive analytics, 33–35
running, 31, 32f
website, 32f
WHERE clause condition
asterisk wildcard, 231, 232f, 233f
AS keyword, 241–243, 242f–243f
LIKE operator
percent sign wildcard, 235, 236t,
236f , 237f
underscore wildcard, 235, 236t, 237,
237f
logical operators
AND, 238, 238t
NOT, 238t, 239, 239f
OR, 238, 239, 238t, 239f
relational operators
BETWEEN, 233t , 234, 235f
IN, 233t, 234, 235f
greater than or equal, 232, 233t, 234f
less than, 233t , 234, 234f
remote server, connection information,
240, 241f
uppercase and lowercase letters, 240,
240f
wildcards, 228
Wine data set, 497–499, 497f
workbench
built-in databases, 220, 221f
chapter’s creation, 223, 224f
companion database, 225, 225f
lightning-bolt icon, 219, 220f
SHOW DATABASES query, 219
using Windows, 219, 219f
W3Schools
CSS, 221
HTML, 221
JavaScript, 221
SQL tutorial, 221, 222
X
XML. See Extensible Markup Language
(XML)
XML file grouping, 172, 172f–174f
Y
yottabyte (YB), 612, 612t
YouTube, 613
Z
zero-based indexing, 345
zettabyte (ZB), 612, 612t
Zoo data set, 521–523, 522f