Data Analysis Notes
Data Analysis Notes
/!\ Tips :
- Use alias ! (<AS> name)
- Never use JOIN without specifying the type (INNER/LEFT/RIGHT/…)
Datastudio introduction
1
Figure 1 : Metrics on a table
2
Session 2 (30/01 Afternoon)
Question 1)
3
Session 3 (31/01 Morning)
Question 2)
Objective
Get an overview of the key concepts and tools used in data analysis work.
Hardware basics
4
Figure 6 : Historical cost of computer memory and storage
The main challenge of Big Data is not how to store the information but how
to process it.
5
Rise of cloud computing
- When you load data based on the volumetry, GCP will if needed optimize
the portioning of your data across multiple server. It will more over
ensure redundancy/replication to avoid data loss in case of
infrastructure incident.
- When you query data based on the volumetry to be queried, GCP will
select the right number of “computer” to run your query using a
distributed computing framework.
6
High-level vs low-level classification depends on what you are comparing :
- BigQuery is high-level in comparison to Spark
- Spark is high-level in comparison to Hadoop/MapReduce
What is SQL ?
Initially, it was associated with RDMBS type databases such as MySQL (created
in 1995) and SQL Server (created in 1989), which had a similar operating mode
in the background. Distributed computed framework decided to use it later on
as well like BigQuery or Apache Spark.
7
Session 4 (31/01 Afternoon)
Business case
1) How to start ?
2) BigQuery
8
3) Datastudio
9
Lecture : Data Quality on BigQuery
Objective
Get an overview of the data quality stakes in data team
10
A relational database (RDB) is a way of structuring information in tables,
rows, and columns. An RDB has the ability to establish links (or
relationships) between information by joining tables, which makes it easy to
understand and gain insights about the relationship between various data
points.
It allows you set hard constraint on the database : unicity of a value (or a
set of value) through primary key.
MySQL BigQuery
OLTP – like
Analytical
Use case (Online Transaction
(OLAP – like)
Processing)
Type of queries Big number of small query Small number of huge query
Row level Bulk load,
Manipulation
(delete, update) update and delete limited
Primary keys can enforce
Constraints scheme constraint across Scheme
table
SQL support Yes Yes
Horizontal
Scalability Vertical
(i.e. distributed)
Data folks are importing good practise from software engineering to improve
the quality management :
- Modular data modelling
- Documentation best practise
11
There is a whole system around dbt. It has been valorised at 4 billions $
and start-up are built around it.
(https://www.castordoc.com/, https://www.siffletdata.com/)
Example :
https://medium.com/partoo/partoo-migrates-from-mongodb-to-postgresql-
43c60854bebb
Document format is however very popular and used by other relational database:
- PostgreSQL support document (without scheme)
- BigQuery support document :
with scheme (json functions) or without scheme (arrays)
12
Session 5 (01/02 Morning)
BigQuery is sharing public data as google trend dataset so people can exercise
on BigQuery.
https://console.cloud.google.com/bigquery?hl=fr&project=sql-class-mines-
nancy-2023&ws=!1m4!1m3!3m2!1sbigquery-public-data!2swikipedia
# union operator
SELECT order_date, order_id, revenue
FROM command1
UNION ALL/DISTINCT (SELECT cast(datetime as DATE) as order_date,
orderID as order_id,
revenue
FROM command2)
Summary
- On BigQuery, you are billed on the amount of data your query process
- When you run a query on the table, it processes the full tables and
computes only the columns needed.
- BigQuery offers support to partition your table in order to optimize
it (sparing money and not wasting server resources)
https://cloud.google.com/bigquery/docs/querying-partitioned-tables
Indexes are a common way to enhance database performance. An index allows the
database server to fin and retrieve specific rows much faster than it could
do without an index. But indexes also add overhead to the database system as
a whole, so they should be used sensibly.
https://cloud.google.com/bigquery/docs/loading-
data#loading_denormalized_nested_and_repeated_data
13
Syntax tips
- Date
https://cloud.google.com/bigquery/docs/reference/standard-
sql/date_functions
WITH ga_running_sum AS (
/* STEP 1 : Compute Running Sum */
SELECT {...}),ga_rank AS (
/* STEP 2 : Compute Rank */
SELECT * {...}
FROM ga_running_sum
WHERE date_session > DATE_ADD(CURRENT_DATE(),INTERVAL -3 DAY))
SELECT /* STEP 3 : Compute Evolution */ * {...}
FROM ga_rank
WHERE date_session = (SELECT MAX(date_session)
FROM `sql-class-mines-nancy-
2023.a_bq_window_function.stats_ga`)
- Regex
https://www.dataquest.io/blog/regex-cheatsheet/ (syntax)
https://pythex.org/ (to test my regex)
CAST(REPLACE(REPLACE(SAFE.REGEXP_EXTRACT(title,
'\\([0-9]{4}\\)'),'(',''),')','') AS INT64) AS m_released_year
14
Session 6 (01/02 Afternoon)
How to do a rank ?
For each customer, we create an order rank iterating based on the date of
order (this pattern is similar to an aggregation function) :
- GROUP BY is replaced by PARTITION BY
- OREDER BY is necessary for using a ranking function
- OVER() declares the use of an analytics function
An analytics function is the only way to use functions that require an ORDER
BY operator.
To finish with this example, be ware of the difference between RANK() and
ROW_NUMBER() :
- RANK() : if 2 orders have been placed at the same date, they will have
the same rank.
- ROW_NUMBER() : they will have a different row number (arbitrary one).
You can use it as well to compute aggregation functions that do not require
ordering (SUM()/AVG()/…) to simplify your query :
15
General syntax
- PARTITION BY : breaks up the input rows into separate partitions, over
which the window function is independently evaluated.
- ORDER BY : defines how rows are ordered within a partition. This clause
is optional in most situations but is required in some cases for
navigation functions.
- WINDOW_FRAME_CLAUSE : (for aggregate analytics functions) defines the
window frame within the current partition. The window frame determines
what to include in the window. If this clause is used, ORDER BY is
required except for fully unbounded windows.
We will work with ROWS : computes the window frame bases on physical offsets
from the current row. For example, you could include two rows before and
after the current row.
Query steps :
1. Compute the sessions from the last X days
(as a convention, name them : entrances_1_2, entrances_1_7,…)
2. From there, compute the rank()
3. Run the difference
4. Filter on something
WITH ga_running_sum AS
( /* STEP 1 */
SELECT
*
,SUM(organic_entrances) OVER (PARTITION BY entity_theme,entity_id ORDER BY date_s
ession ROWS BETWEEN 1 PRECEDING AND 0 PRECEDING) as entrances_1_2
,SUM(organic_entrances) OVER (PARTITION BY entity_theme,entity_id ORDER BY date_s
ession ROWS BETWEEN 6 PRECEDING AND 0 PRECEDING) as entrances_1_7
,SUM(organic_entrances) OVER (PARTITION BY entity_theme,entity_id ORDER BY date_s
ession ROWS BETWEEN 29 PRECEDING AND 0 PRECEDING) as entrances_1_30
,SUM(organic_entrances) OVER (PARTITION BY entity_theme,entity_id ORDER BY date_s
ession ROWS BETWEEN 3 PRECEDING AND 2 PRECEDING) as entrances_3_4
,SUM(organic_entrances) OVER (PARTITION BY entity_theme,entity_id ORDER BY date_s
ession ROWS BETWEEN 13 PRECEDING AND 7 PRECEDING) as entrances_7_14
,SUM(organic_entrances) OVER (PARTITION BY entity_theme,entity_id ORDER BY date_s
ession ROWS BETWEEN 59 PRECEDING AND 30 PRECEDING) as entrances_30_60
FROM `sql-class-mines-nancy-2023.a_bq_window_function.stats_ga`
16
),ga_rank AS
( /* STEP 2 : Compute Rank */
SELECT
*
,RANK() OVER (PARTITION BY entity_theme,date_session ORDER BY entrances_1_2 DESC)
as rank_entrances_1_2
,RANK() OVER (PARTITION BY entity_theme,date_session ORDER BY entrances_1_7 DESC)
as rank_entrances_1_7
,RANK() OVER (PARTITION BY entity_theme,date_session ORDER BY entrances_1_30 DESC)
as rank_entrances_1_30
,RANK() OVER (PARTITION BY entity_theme,date_session ORDER BY entrances_3_4 DESC)
as rank_entrances_3_4
,RANK() OVER (PARTITION BY entity_theme,date_session ORDER BY entrances_7_14 DESC)
as rank_entrances_7_14
,RANK() OVER (PARTITION BY entity_theme,date_session ORDER BY entrances_30_60 DESC
) as rank_entrances_30_60
FROM ga_running_sum
WHERE date_session > DATE_ADD(CURRENT_DATE(),INTERVAL -3 DAY)
)
SELECT
/* STEP 3 : Compute Evolution */
*
,ROUND(rank_entrances_3_4 - rank_entrances_1_2 ) as evol_rank_entrances_3_4_vs_ra
nk_entrances_1_2
,ROUND(rank_entrances_7_14 - rank_entrances_1_7 ) as evol_rank_entrances_7_14_vs_
rank_entrances_1_7
,ROUND(rank_entrances_30_60 - rank_entrances_1_30 ) as evol_rank_entrances_30_60
_vs_rank_entrances_1_30
,ROUND(rank_entrances_1_7 - rank_entrances_1_2 ) as evol_rank_entrances_1_7_vs_ra
nk_entrances_1_2
,ROUND(rank_entrances_1_30 - rank_entrances_1_2 ) as evol_rank_entrances_1_30_vs_
rank_entrances_1_2
FROM ga_rank
WHERE date_session = (SELECT MAX(date_session) FROM `sql-class-mines-nancy-
2023.a_bq_window_function.stats_ga`)
The document format you discovered in your mongoDB exercises is a very popular
one. You will find it as well in python (cf dictionary data structure). It
has been since implemented in BigQuery or relational databases like
PostgreSQL. MongoDB properties sacrifices data consistency for more
horizontal scalability combined with very good latency
17
What is API ?
- It is an abstract term to describe the idea of communication protocol
with a machine.
- There is various ways to do it : API is very popular and can return
data in various standard (XML.csv files, …) but json is also a very
popular one.
Examples :
- In Webedia : https://rivery.io/
- Another tool very popular among start up : https://airbyte.com/
Versioning
Git is free and open source distributed version control system designed to
handle everything from small to very large projects with speed and efficiency.
It works by tracking the difference : which lines are add and which lines
are removed.
On top of it, private companies offer services to store git “repository” and
interact with it :
- https://github.com/ is one of the most popular one (it was bought by
Microsoft for 7.5 billion dollars in 2018).
- Other example : https://about.gitlab.com/
No code objective
How ?
Reusing components/macros already coded by others allow you to spend more
time focusing on your problem and its impact, rather than on “logistics”.
18
By reducing the number of expertise needed for a project, you can :
- Focus on hiring people that fit your specific issues
- Diminish the number of stakeholders in the project, easing
communication and decision making.
Gojob example
https://gojob.com/
In collaboration with the ops team, they develop and test new process
involving tools evolution by developing them thanks to no code solution.
Strapi
Open-source trend
A lot of recent tools are available on open-source version and are monetized
on enterprise plan besides.
Examples of recent tools funded by VC : dbt, strapi.
19
Why go for the “make” ?
- Nothing fits your need
- Too expensive / you already have the expertise to develop the project
- Too strategic for the company : it should be a strategic advantage vs
the rest of the market
20