0% found this document useful (0 votes)
22 views

Module 1 - SQL For Analytics Introduction

SQL
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views

Module 1 - SQL For Analytics Introduction

SQL
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 19

SQL

for Analytics
Start Learning
SQL For Analytics
Learn SQL by Application! Realistic ends to end case
studies, examples and challenges to teach you the way it is
meant to be used.

Preface
SQL was initially created to be the language for generating,
manipulating, and retrieving data from relational databases, which
have been around for more than 40 years. Over the past decade or
so, however, other data platforms such as Hadoop, Spark, and
NoSQL have gained a great deal of traction, eating away at the
relational database market. As will be discussed in the last few
chapters of this book, however, the SQL language has been evolving
to facilitate the retrieval of data from various platforms, regardless
of whether the data is stored in tables, documents, or flat files.

The easiest, as well as an essential skill that every data science


aspirant should acquire, is SQL. This course is designed for all the
users who, maybe experienced with data analysis but new to SQL,
or perhaps experienced with SQL but new to data analysis. Or you
may be new to both topics entirely. We learn SQL only for the
purpose of data analysis and will exclude the concepts which might
relate to data engineering and deep database management studies.
MODULE 01

A Little Background
• Introduction to Database
• Relational Database, Primary Key & Foreign Key
• SQL as Part of the Data Analysis Workflow
• Database Data Types
Contents
Introduction to Database ..................................................... 2
1.1. Data Infrastructure .................................................... 2
1.2. Relational Database Systems ...................................... 3
1.3. SQL Constraints: ....................................................... 5
PRIMARY KEY Constraint ............................................. 5
FOREIGN KEY Constraint............................................. 6
Referencing Columns in Another Table ......................... 6
1.4. Database Structure .................................................... 8
1.5. Four Sublanguages of SQL .......................................... 8
SQL for Analytics ............................................................... 10
2.1. What Is Data Analysis? ............................................. 10
2.2. SQL as Part of the Data Analysis Workflow ................. 10
Database Data Types ......................................................... 13
3.1. Types of Data ........................................................... 13
1. Structured Versus Unstructured ................................ 13
2. Quantitative Versus Qualitative Data ......................... 14
3. Sparse Data ............................................................ 14
3.2. Database Data Types ................................................ 14
Introduction to Database | Module 1

SECTION 1

Introduction to Database
A database is nothing more than a set of related information. A
telephone book, for example, is a database of the names, phone
numbers, and addresses of all people living in a particular
region. While a telephone book is certainly a universal and
frequently used database, it suffers from the following:

• Finding a person’s telephone number can be time


consuming.
• A telephone book is indexed only by last/first names, so
finding the names of the people living at a particular
address, is not a practical.
• From the moment the telephone book is printed, the
information becomes less and less accurate.

The same drawbacks attributed to telephone books can be


applied to any manual data storage system. Because a
computerised database system stores data electronically, it is
able to retrieve data more quickly, index data in multiple ways,
and deliver up-to-the-minute information.

1.1. Data Infrastructure


A database is a set of data stored in a computer. This data is
usually structured in a way that makes the data easily
accessible. Databases aren’t the only way data can be stored,
and there is an increasing variety of options for storing data
needed for analysis and powering applications. File storage
systems, NoSQL databases and search-based data stores are
alternative data storage systems that offer low latency for
application development and searching log files. Although not
typically part of the analysis process, they are increasingly part

2
Module 1 | Introduction to Database

of organizations’ data infrastructure. NoSQL is a technology that


allows for data modelling that is not strictly relational. It allows
for very low latency storage and retrieval, critical in many online
applications. Examples of these data stores that you might hear
about in your organization are Cassandra, Couchbase,
DynamoDB, Memcached, Giraph, and Neo4j.

1.2. Relational Database Systems


A relational database uses a structure that allows us to identify
and access data in relation to another piece of data in the
database. Often, data in a relational database is organized into
tables. Each row in the table is considered as a record. Every
record is broken down into fields that represent single items of
data describing a specific thing. For example, you can store
information about a collection of book data inside a database.
Information pertaining to the books themselves can be stored in
a table called Books. Each book record can be stored in one table
row with each specific piece of data such as book title, author,
or price, stored into a separate field.

A Database contains
one or more tables.

A table contains a
number of records.

Field 1 Field 2 Field 3 Field 4

A record contains
one or more fields

3
Introduction to Database | Module 1

Databases are usually associated with software that allows for


the data to be updated and queried. The software that manages
the database is called a Relational Database Management
System (RDBMS). These systems make storing data and
returning results easier and more efficient by allowing different
questions and commands to be posed to the database. Popular
RDBMS software includes

When working with databases we will participate in the design,


maintenance and administration of the database that supplies
data to our website or application. In order to do this, however,
we will need to access that data and also automate the process
to allow other users to retrieve and perhaps even modify data
without technical knowledge. To achieve this, we will need to
communicate with the database in a language it can interpret.
Structured Query Language (SQL) will allow us to directly
communicate with databases and is thus the subject of this
course. SQL is composed of commands that enable users to
create database and table structures, perform various types of
data manipulation and data administration and query the
database in order to extract useful information.

🗒️ Is SQL a Programming Language


SQL isn’t a general-purpose language in the way that C or
Python are. SQL without a database and data in tables is just a
text file. SQL can’t build a website, but it is powerful for
working with data in databases. On a practical level, what
matters most is that SQL can help you get the job of data
analysis done.

4
Module 1 | Introduction to Database

1.3. SQL Constraints:


In a database table, we can add rules to a column known
as constraints. These rules control the data that can be stored
in a column. For example, if a column has NOT NULL constraint,
it means the column cannot store NULL values. The constraints
used in SQL are:

Constraint Description
NOT NULL values cannot be null.
UNIQUE values cannot match any older value.
PRIMARY KEY used to uniquely identify a row.
FOREIGN KEY references a row in another table.
CHECK validates condition for new value.
DEFAULT set default value if not passed.
CREATE INDEX used to speed up the read process.

PRIMARY KEY Constraint

In SQL, the PRIMARY KEY constraint is used to uniquely identify


rows. It is a combination of NOT NULL and UNIQUE constraints
i.e. it cannot contain duplicate or NULL values.

-- create Colleges table with primary key


college_id

CREATE TABLE Colleges (


college_id INT,
college_code VARCHAR(20) NOT NULL,
college_name VARCHAR(50),
CONSTRAINT CollegePK PRIMARY KEY (college_id)
);

Here, the college_id column is the PRIMARY KEY . This means


that the values of this column must be unique, and it cannot
contain NULL values.

5
Introduction to Database | Module 1

FOREIGN KEY Constraint

The FOREIGN KEY constraint is used to create a relationship


between two tables. A foreign key is defined using the FOREIGN
KEY and REFERENCES keywords.

-- this table doesn’t contain foreign keys

CREATE TABLE Customers (


id INTEGER PRIMARY KEY,
name VARCHAR(100),
age INTEGER
);

-- create another table named Prodcuts


-- add foreign key to customer_id column
-- the foreign key references the id column of
the customers table

CREATE TABLE Products (


customer_id INTEGER ,
name VARCHAR(100),
FOREIGN KEY (customer_id)
REFERENCES Customers(id)
);

id column in the Products table references the id column in


the Customers table.

Referencing Columns in Another Table

The FOREIGN KEY constraint in SQL establishes a relationship


between two tables by linking columns in one table to those in
another. For example,

6
Module 1 | Introduction to Database

Here, the customer_id field in the Orders table is a FOREIGN


KEY that refers to the customer_id field in the Customers table.
This means that the value of the customer_id (of Orders table)
must be a value from the customer_id column (of
Customers table).

🗒️Note: The foreign key can be referenced to any column in


the parent table. However, it is a general practice to reference
the foreign key to the primary key of the parent table.

7
Introduction to Database | Module 1

1.4. Database Structure


SQL is used to access, manipulate, and
retrieve data from objects in a database.
Databases can have one or more
schemas, which provide the
organization and structure and contain
other objects. Within a schema, the
objects most commonly used in data
analysis are tables, views, and
functions. Tables contain fields, which
hold the data. Tables may have one or
more indexes; an index is a special kind
of data structure that allows data to be
retrieved more efficiently.

1.5. Four Sublanguages of SQL


To communicate with databases, SQL has four sublanguages for
tackling different jobs, and these are mostly standard across
database types.

1. DQL, or data query language, is what this course is mainly


about. It’s used for querying data, which you can think of
as using code to ask questions of a database. DQL
commands include SELECT , FROM , WHERE , JOINS , etc.
SQL queries can access a single table (or view), can
combine data from multiple tables through the use of
joins, and can also query across multiple schemas in the
same database.
2. DDL, or data definition language, is used to create and
modify tables, views, users, and other objects in the
database. It affects the structure but not the contents.
There are three common commands: CREATE , ALTER , and
DROP . CREATE is used to make new objects. ALTER

8
Module 1 | Introduction to Database

changes the structure of an object, such as by adding a


column to a table. DROP deletes the entire object and its
structure.
3. DCL, or data control language, is used for access control.
Commands include GRANT and REVOKE , which give
permission and remove permission, respectively. In an
analysis context, GRANT might be needed to allow a
colleague to query a table you created. You might also
encounter such a command when someone has told you
a table exists in the database but you can’t see it—
permissions might need to be GRANTed to your user.
4. DML, or data manipulation language, is used to act on the
data itself. The commands are INSERT , UPDATE , and
DELETE . INSERT adds new records and is essentially the
“load” step in extract, transform, load (ETL). UPDATE
changes values in a field, and DELETE removes rows.

9
SQL for Analytics | Module 1

SECTION 2

SQL for Analytics


Before actually starting talking with the database, we’ll start
with a discussion of what data analysis is and then move on
to a discussion of SQL: what is SQL, why it’s so popular, and
how SQL fits into data analysis .

2.1. What Is Data Analysis?


Data analysis is part data discovery, part data interpretation, and
part data communication. Very often the purpose of data
analysis is to improve decision-making, by humans and
increasingly by machines through automation.

Mining historical data helps us understand the characteristics


and behaviour of customers, suppliers, and processes.
Historical data can help us develop informed estimates and
predicted ranges of outcomes, which will sometimes be wrong
but quite often will be right. Past data can point out gaps,
weaknesses, and opportunities. It allows organizations to
optimize, save money, and reduce risk and fraud. It can also help
organizations find opportunity and it can become the building
blocks of new products that delight customers.

2.2. SQL as Part of the Data Analysis


Workflow
Analysis workflow refers to the series of steps that an analyst
follows to achieve the desired outcome. It always starts with a
question, and ends with a presentation/ visual dashboard to
present the outcome of the analysis to stakeholders.

10
Module 1 | SQL for Analytics

1. First step of analysis workflow is ‘Framing the Question’


which may be about how many new customers have been
acquired, how sales are trending, or why some users stick
around for a long time while others try a service and never
return.
2. Once the question is framed, we consider where the data
originated. Data is generated by ‘Source Systems’, a
term that includes any human or machine process that
generates data of interest. Data can be generated by
people by hand, such as when someone fills out a form or
takes notes during a doctor’s visit. Data can also be
machine-generated, such as when an application
database records a purchase, an event-streaming system
records a website click or a marketing management tool
records an email open.
3. The next step is moving the data and storing it in a
database for analysis. I will use the terms ‘Data
Warehouse’, which is a database that consolidates data
from across an organization into a central repository, and
data store, which refers to any type of data storage
system that can be queried.
Usually, a person or team is responsible for getting data
into the data warehouse. This process is called ETL
(Extract, Transform, and Load). Extract pulls the data
from the source system. Transform optionally changes
the structure of the data, performs data quality cleaning,
or aggregates the data. Load puts the data into the

11
SQL for Analytics | Module 1

database. You might also hear the terms source and


target in the context of ETL. The source is where the data
comes from, and the target is the destination, i.e., the
database and the tables within it.
4. Once the data is in a database, the next step is
‘Performing Queries and Analysis’. In this step, SQL is
applied to explore, profile, clean, shape, and analyze the
data. Exploring the data involves becoming familiar with
the topic, where the data was generated, and the
database tables in which it is stored. Profiling involves
checking the unique values and distribution of records in
the data set. Cleaning involves fixing incorrect or
incomplete data, adding categorization and flags, and
handling null values. Shaping is the process of arranging
the data into the rows and columns needed in the result
set. Finally, analysing the data involves reviewing the
output for trends, conclusions, and insights.
5. ‘Presentation of the Data’ into a final output form is the
last step in the overall workflow. Businesspeople won’t
appreciate receiving a file of SQL code; they expect you to
present graphs, charts, and insights. Communication is
key to having an impact with analysis, and for that, we
need a way to share the results with other people.

12
Module 1 | Database Data Types

SECTION 3

Database Data Types


Data scientists spend 60% of their time cleaning and organizing
data in order to prepare it for analysis or modelling work.
Preparing data is such a common task that terms have sprung up
to describe it, such as data munging, data wrangling, and data
prep. Data preparation is easier when a data set has a data
dictionary, a document or repository that has clear descriptions
of the fields, possible values, how the data was collected, and
how it relates to other data. Unfortunately, this is frequently not
the case. Documentation often isn’t prioritized, even by people
who see its value, or it becomes out-of-date as new fields and
tables are added or the way data is populated changes. Even
when a data dictionary exists, you will still likely need to do data
prep work as part of the analysis.

3.1. Types of Data


Data is the foundation of analysis, and all data has a database
data type and also belongs to one or more categories of data.
Having a firm grasp of the many forms data can take will help you
be a more effective data analyst.

1. Structured Versus Unstructured


Data is often described as structured or unstructured. Most
databases were designed to handle structured data, where each
attribute is stored in a column, and instances of each entity are
represented as rows. For example, an address table might have
fields for street address, city, state, and postal code. Each row
would hold a particular customer’s address. Each field has a
data type and allows only data of that type to be entered.
Structured data is easy to query with SQL.

13
Database Data Types | Module 1

Unstructured data is the opposite of structured data. There is


no predetermined structure, data model, or data type.
Unstructured data is often the “everything else” that isn’t
database data. Documents, emails, and web pages are
unstructured. They don’t fit into the traditional data types, and
thus they are more difficult for relational databases to store
efficiently and for SQL to query

2. Quantitative Versus Qualitative Data

Quantitative data is numeric. It comes with numeric information


such as price, quantity, or visit duration. Counts, sums,
averages, or other numeric functions are applied to the data.
Qualitative data is usually text-based. Temperature and
humidity levels are quantitative, while descriptors like “hot and
humid” are qualitative. The price a customer paid for a product
is quantitative; whether they like or dislike it is qualitative.

3. Sparse Data

Sparse data occurs when there is a small amount of information


within a larger set of empty or unimportant information. Sparse
data might show up as many nulls and only a few values in a
particular column. JSON is one approach that has been
developed to deal with sparse data from a writing and storage
perspective, as it stores only the data that is present and omits
the rest. This is in contrast to a row-store database, which has to
hold memory for a field even if there is no value in it.

3.2. Database Data Types


Fields in database tables all have defined data types. You don’t
necessarily need to be an expert on the nuances of data types to
be good at analysis, but later in the course, we’ll encounter
situations in which considering the data type is important, so this

14
Module 1 | Database Data Types

section will cover the basics. These are based on Postgres but
are similar across most major database types.

String data types are the most versatile. These can hold letters,
numbers, and special characters, including unprintable
characters like tabs and newlines. String fields can be defined to
hold a fixed or variable number of characters. A CHAR field could
be defined to allow only two characters to hold, for example, US
state abbreviation. Whereas a field storing the full names of
states would need to be a VARCHAR to allow a variable number
of characters.

Numeric data types are all the ones that store numbers, both
positive and negative. Mathematical functions and operators
can be applied to numeric fields. Numeric data types include the
INT types as well as FLOAT, DOUBLE, and DECIMAL types that
allow decimal places. Integer data types are often implemented
because they use less memory than their decimal counterparts.

15
Database Data Types | Module 1

The logical data type is called BOOLEAN. It has values of TRUE


and FALSE and is an efficient way to store information where
these options are appropriate. Operations that compare two
fields return a BOOLEAN value as a result. This data type is often
used to create flags, and fields that summarize the presence or
absence of a property in the data.

The datetime types include DATE, TIMESTAMP, and TIME. Date


and time data should be stored in a field of one of these database
types whenever possible since SQL has a number of useful
functions that operate on them. Timestamps and dates are very
common in databases and are critical to many types of analysis,
particularly time series analysis and cohort analysis.

Other data types, such as JSON and geographical types, are


supported by some but not all databases.

16

You might also like