0% found this document useful (0 votes)
16 views

Data Analysis Notes

Uploaded by

Romain Osanno
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views

Data Analysis Notes

Uploaded by

Romain Osanno
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 20

DATA ANALYSIS

Session 1 (30/01 Morning)

 Introduction to SQL syntax on Google Cloud Platform


(BigQuery)

# Write the result of the query in a new table


CREATE OR REPLACE TABLE z_romain.ratings AS
SELECT *
FROM `sql-class-mines-nancy-2023.movielens.ratings`
LIMIT 1000

# 1) Get the number of ratings by movie id


SELECT movieId, COUNT(*) AS nb_ratings_by_movie
FROM `z_romain.ratings`
GROUP BY 1

# 2) Link it with the title from the movies table

SELECT m.title, COUNT(*) AS nb_ratings_by_movie


FROM `z_romain.ratings` AS r INNER JOIN `movielens.movies` AS m
ON r.movieId = m.movieId
GROUP BY 1

# 3) Keep only the top 10 movies released in 2010


SELECT m.title, COUNT(*) AS nb_ratings_by_movie
FROM `z_romain.ratings` AS r INNER JOIN `movielens.movies` AS m
ON r.movieId = m.movieId
WHERE m.title LIKE '%(2010)%' # ou regexp_contains(m.title, ‘(2010)’)
GROUP BY 1
ORDER BY 2 DESC
LIMIT 10

/!\ Tips :
- Use alias ! (<AS> name)
- Never use JOIN without specifying the type (INNER/LEFT/RIGHT/…)

 Datastudio introduction

# creation of a new table to add to the report


CREATE OR REPLACE TABLE `z_romain.datastudtio` AS
SELECT m.title, COUNT(*) as nb_ratings_by_movie
FROM `sql-class-mines-nancy-2023.movielens.ratings` AS r
INNER JOIN `sql-class-mines-nancy-2023.movielens.movies` AS m
ON r.movieId = m.movieId
GROUP BY 1

1
Figure 1 : Metrics on a table

Figure 2 : Movie distribution of ratings

Figure 3 : Movie distribution of ratings – group

2
Session 2 (30/01 Afternoon)

 Exercise on BigQuery and Datastudio

To have a look of movielens interface :


https://movielens.org/home
mail : [email protected]
mdp : mines2023

Question 1)

# First, we identify the Disney tags


SELECT tag, COUNT(*) as nb_tag, COUNT(DISTINCT movieId) as nb_distinct_movieId
FROM `sql-class-mines-nancy-2023.movielens.tags`
WHERE regexp_contains(lower(tag),'disney')
GROUP BY 1
ORDER BY 2 DESC

# Facultative (for visualization) : count the tags


SELECT COUNT(*) as nb_tag, COUNT(DISTINCT movieId) as nb_distinct_movieId
FROM `sql-class-mines-nancy-2023.movielens.tags`
WHERE regexp_contains(lower(tag),'disney')

# Then, we map it with the ratings table


WITH disney_movieId AS (SELECT DISTINCT movieId
FROM `sql-class-mines-nancy-2023.movielens.tags`
WHERE regexp_contains(lower(tag),'disney'))
SELECT movies.title, COUNT(*) as nb_ratings
FROM `sql-class-mines-nancy-2023.movielens.ratings` as ratings
INNER JOIN `sql-class-mines-nancy-2023.movielens.movies` as movies
ON ratings.movieID = movies.movieId
WHERE ratings.movieId IN (SELECT * FROM disney_movieId)
GROUP BY 1
ORDER BY 2 DESC
LIMIT 10

Figure 4 : Top 10 Disney movies with the most ratings

3
Session 3 (31/01 Morning)

 Exercise on BigQuery and Datastudio (following)

Question 2)

# Data preparation query


CREATE OR REPLACE TABLE z_romain.ratings_by_users AS
SELECT userId, COUNT(*) as nb_ratings_by_user
FROM `sql-class-mines-nancy-2023.movielens.ratings`
GROUP BY 1

Figure 5 : Movie distribution running sum – group

 Lecture : Big data & Cloud Module

Objective
Get an overview of the key concepts and tools used in data analysis work.

 Hardware basics

Reading and writing information on storage takes time (relative to the


processor speed). This speed depends on the type of storage (SSD vs magnetic).

Storage pricing on Google :


- Standard storage = 46$/monthly/1To
- SSD = 190$/monthly

4
Figure 6 : Historical cost of computer memory and storage

 Rise of distributed computing

The main challenge of Big Data is not how to store the information but how
to process it.

Example : Google 2006 Released Hadoop


Apache Hadoop is a collection of open-source software utilities that
facilitates using a network of many computers to solve problems involving
massive amounts of data and computation. It provides a software framework for
distributed storage and processing of big data using the MapReduce programming
model.

Figure 7 : Hadoop MapReduce

5
 Rise of cloud computing

Cloud computing is the on-demand availability of computer system resources,


especially data storage and computing power, without direct active management
by the user. Large clouds often have functions distributed over multiple
locations, each of which is a data centre.

Figure 8 : Market share

 What is BigQuery ? – high level vs low level language

BigQuery is a multi-parallel processing (MPP) database, serverless and fully


managed, using SQL.

- When you load data based on the volumetry, GCP will if needed optimize
the portioning of your data across multiple server. It will more over
ensure redundancy/replication to avoid data loss in case of
infrastructure incident.

- When you query data based on the volumetry to be queried, GCP will
select the right number of “computer” to run your query using a
distributed computing framework.

High-level languages are programming languages that are designed to allow


humans to write computer programs and interact with a computer system without
having to have specific knowledge of the processor or hardware that the
program will run on. They use command words and syntax which reflects everyday
language, which makes them easier to learn and use. They also offer the
programmer development tools such as libraries and built-in functions.

6
High-level vs low-level classification depends on what you are comparing :
- BigQuery is high-level in comparison to Spark
- Spark is high-level in comparison to Hadoop/MapReduce

 What is SQL ?

SQL means Structured Query Language invented in 1974. It is a declarative


language : you write code that describes what you want, but not how to get
it (in opposition to imperative language). It is a widespread standard in
data industry.

Initially, it was associated with RDMBS type databases such as MySQL (created
in 1995) and SQL Server (created in 1989), which had a similar operating mode
in the background. Distributed computed framework decided to use it later on
as well like BigQuery or Apache Spark.

Some databases were qualified as “NoSQL” databases in opposition to RDMBS


databases, which where originally the only type of database with SQL support.

 Cloud provider value proposition

Software as a service (Saas) is a software licensing and delivery model in


which software is licensed on a subscription basis and is centrally hosted.
Most French Licorns have business models caracterised as Saas based
(Contentsquare, Deezer, Mirakl, Payfit, Quonto, Spendesk).

Figure 9 : Comparison of cloud service models

IAAS : Infrastructure as a service


PAAS : Platform as a service
SAAS : Software as a service

7
Session 4 (31/01 Afternoon)

 Business case

1) How to start ?

Identify the metrics we want to analyse :


- We want to look at macro statistic
 Which indicator do we want ?
Nb ratings, rating value average and standard deviation
- We want to look at granular value (movie level)
 Just look at movie example 1 by 1 by applying previous metrics
- We want to have an idea of the distribution
 Number of users by rating value

Identify the analysis axis that could be discriminating:


- Movie related : gender, released date, nb rating by movie
- User related : rating notation, rating temporality*, nb rating by
users, first rating date

(*in comparison to the released date)

2) BigQuery

# Data preparation query : join ratings and movies


CREATE OR REPLACE TABLE z_romain.ta_ratings_movies AS
SELECT CAST(TIMESTAMP_SECONDS(CAST(timestamp AS INT64)) AS DATE) AS rating_date,
title AS m_title,
genres AS m_genres,
COUNT(*) OVER (PARTITION BY movieId) AS nb_ratings_by_movie,
CAST(REPLACE(REPLACE(SAFE.REGEXP_EXTRACT(title,'\\([0-9]{4}\\)'),
'(',''),')','') AS INT64) AS m_released_year,
rating,
userId,
COUNT(*) OVER (PARTITION BY userId) AS nb_ratings_by_user
FROM `sql-class-mines-nancy-2023.movielens.ratings`
INNER JOIN `sql-class-mines-nancy-2023.movielens.movies` AS movie
USING(movieId)

8
3) Datastudio

Figure 10 : Datastudio report on the movie distribution – average

Figure 11 : Datastudio report on the movie distribution – standard deviation

9
 Lecture : Data Quality on BigQuery

Objective
Get an overview of the data quality stakes in data team

 Relational database – key takeaway for data quality

Figure 12 : Database tables in a normalized manners

Figure 12 : Database tables in a denormalized manners

Relational databases where especially designed for transaction management.

Example : ecommerce website


You want to support multiple concurrent queries (in high volumetry) :
- List of place available by category (SELECT)
- Place an “order” (INSERT)
- Update the availability of a place once it has been sold (UPDATE)
You do not want to sell twice the same place !
If you have a failure (during data update), you want to be able to revert to
a previous stage.
You need accurate data and a “solid” modelisation.

10
A relational database (RDB) is a way of structuring information in tables,
rows, and columns. An RDB has the ability to establish links (or
relationships) between information by joining tables, which makes it easy to
understand and gain insights about the relationship between various data
points.

It allows you set hard constraint on the database : unicity of a value (or a
set of value) through primary key.

 BigQuery vs Relational Database

MySQL BigQuery
OLTP – like
Analytical
Use case (Online Transaction
(OLAP – like)
Processing)
Type of queries Big number of small query Small number of huge query
Row level Bulk load,
Manipulation
(delete, update) update and delete limited
Primary keys can enforce
Constraints scheme constraint across Scheme
table
SQL support Yes Yes
Horizontal
Scalability Vertical
(i.e. distributed)

Figure 13 : MySQL vs BigQuery

 Data Quality in BigQuery

People started to replicate this constraint by writing test through “Primary


key” on a set of columns.

SELECT userId, movieId, COUNT(*) as n


FROM `sql-class-mines-nancy-2023.movielens.ratings`
GROUP BY 1, 2
HAVING n > 1

 Data Build Tool (dbt) : a booming tool in data team

Framework for testing : https://www.getdbt.com/product/data-testing/

Data folks are importing good practise from software engineering to improve
the quality management :
- Modular data modelling
- Documentation best practise

11
There is a whole system around dbt. It has been valorised at 4 billions $
and start-up are built around it.
(https://www.castordoc.com/, https://www.siffletdata.com/)

 NoSQL database and document format

NoSQL properties sacrifice data consistency for more horizontal scalability


combined with very good latency. Due to the lack of enforced schemes and
relation in NoSQL database, if their design is not well managed, they become
nightmare in term of data quality.

Example :
https://medium.com/partoo/partoo-migrates-from-mongodb-to-postgresql-
43c60854bebb

Document format is however very popular and used by other relational database:
- PostgreSQL support document (without scheme)
- BigQuery support document :
with scheme (json functions) or without scheme (arrays)

Figure 14 : example of BigQuery support document with scheme

12
Session 5 (01/02 Morning)

 Lecture : BigQuery optimization and data loading

 Data partitioning in BigQuery and cost optimization

BigQuery is sharing public data as google trend dataset so people can exercise
on BigQuery.

https://console.cloud.google.com/bigquery?hl=fr&project=sql-class-mines-
nancy-2023&ws=!1m4!1m3!3m2!1sbigquery-public-data!2swikipedia

# union operator
SELECT order_date, order_id, revenue
FROM command1
UNION ALL/DISTINCT (SELECT cast(datetime as DATE) as order_date,
orderID as order_id,
revenue
FROM command2)

Summary
- On BigQuery, you are billed on the amount of data your query process
- When you run a query on the table, it processes the full tables and
computes only the columns needed.
- BigQuery offers support to partition your table in order to optimize
it (sparing money and not wasting server resources)
 https://cloud.google.com/bigquery/docs/querying-partitioned-tables

 How indexing is working on relational database ?

Indexing strategy is similar to the “index” or the summary of a book. If you


are looking for a concept in a book, instead of reading the full book, you
read the index to get the page number. Indexing takes space (physical storage)
and time (data load when creation or update).

Indexes are a common way to enhance database performance. An index allows the
database server to fin and retrieve specific rows much faster than it could
do without an index. But indexes also add overhead to the database system as
a whole, so they should be used sensibly.

 How to load data into BigQuery ?

https://cloud.google.com/bigquery/docs/loading-
data#loading_denormalized_nested_and_repeated_data

13
 Syntax tips

- Date

https://cloud.google.com/bigquery/docs/reference/standard-
sql/date_functions

Example : creating a movie filter

WHERE date_session > DATE_ADD(CURRENT_DATE(),INTERVAL -3 DAY)

WITH ga_running_sum AS (
/* STEP 1 : Compute Running Sum */
SELECT {...}),ga_rank AS (
/* STEP 2 : Compute Rank */
SELECT * {...}
FROM ga_running_sum
WHERE date_session > DATE_ADD(CURRENT_DATE(),INTERVAL -3 DAY))
SELECT /* STEP 3 : Compute Evolution */ * {...}
FROM ga_rank
WHERE date_session = (SELECT MAX(date_session)
FROM `sql-class-mines-nancy-
2023.a_bq_window_function.stats_ga`)

- Regex

https://www.dataquest.io/blog/regex-cheatsheet/ (syntax)
https://pythex.org/ (to test my regex)

Example : extract the year from the title

We use BigQuery ‘REGEXP_EXTRACT’ function


1) From the pythex module, you identify the regex match
2) Then, you implement it in BigQuery

CAST(REPLACE(REPLACE(SAFE.REGEXP_EXTRACT(title,
'\\([0-9]{4}\\)'),'(',''),')','') AS INT64) AS m_released_year

14
Session 6 (01/02 Afternoon)

 SQL – Window function module

How to do a rank ?

# We want to rank the customer order by date of acquisition


SELECT *, RANK() OVER (PARTITION BY customer_id ORDER BY date ASC) as order_number
FROM orders

For each customer, we create an order rank iterating based on the date of
order (this pattern is similar to an aggregation function) :
- GROUP BY is replaced by PARTITION BY
- OREDER BY is necessary for using a ranking function
- OVER() declares the use of an analytics function

An analytics function is the only way to use functions that require an ORDER
BY operator.

To finish with this example, be ware of the difference between RANK() and
ROW_NUMBER() :
- RANK() : if 2 orders have been placed at the same date, they will have
the same rank.
- ROW_NUMBER() : they will have a different row number (arbitrary one).

You can use it as well to compute aggregation functions that do not require
ordering (SUM()/AVG()/…) to simplify your query :

CREATE OR REPLACE TABLE z_romain.ta_ratings_movies AS


SELECT title as m_title,
genres as m_genres,
COUNT(*) OVER (PARTITION BY movieId) AS nb_ratings_by_movie,
rating,
userId,
COUNT(*) OVER (PARTITION BY userId) AS nb_ratings_by_user
FROM `sql-class-mines-nancy-2023.movielens.ratings`
INNER JOIN `sql-class-mines-nancy-2023.movielens.movies` as movie
USING(movieId)

# Focus on the running_sum operator


SELECT item, purchases, category, SUM(purchases)
OVER (
PARTITION BY category
ORDER BY purchases
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
) AS total_purchases
FROM Produce

15
General syntax
- PARTITION BY : breaks up the input rows into separate partitions, over
which the window function is independently evaluated.
- ORDER BY : defines how rows are ordered within a partition. This clause
is optional in most situations but is required in some cases for
navigation functions.
- WINDOW_FRAME_CLAUSE : (for aggregate analytics functions) defines the
window frame within the current partition. The window frame determines
what to include in the window. If this clause is used, ORDER BY is
required except for fully unbounded windows.

We will work with ROWS : computes the window frame bases on physical offsets
from the current row. For example, you could include two rows before and
after the current row.

ROWS BETWEEN A AND A'

# Let’s discover the data


SELECT * FROM `sql-class-mines-nancy-2023.a_bq_window_function.stats_ga` LIMIT 1000

How to do it step by step ?

Metrics we need to compute :


- Sessions 2 day : session from the last two days
- Rank 30 days
- Rank 7 days
- Progression rank 30 days = rank 2 days – rank 30 days
- Progression rank 7 days = rank 2 days – rank 7 days

Query steps :
1. Compute the sessions from the last X days
(as a convention, name them : entrances_1_2, entrances_1_7,…)
2. From there, compute the rank()
3. Run the difference
4. Filter on something

One possible manner to code it

WITH ga_running_sum AS
( /* STEP 1 */
SELECT
*
,SUM(organic_entrances) OVER (PARTITION BY entity_theme,entity_id ORDER BY date_s
ession ROWS BETWEEN 1 PRECEDING AND 0 PRECEDING) as entrances_1_2
,SUM(organic_entrances) OVER (PARTITION BY entity_theme,entity_id ORDER BY date_s
ession ROWS BETWEEN 6 PRECEDING AND 0 PRECEDING) as entrances_1_7
,SUM(organic_entrances) OVER (PARTITION BY entity_theme,entity_id ORDER BY date_s
ession ROWS BETWEEN 29 PRECEDING AND 0 PRECEDING) as entrances_1_30
,SUM(organic_entrances) OVER (PARTITION BY entity_theme,entity_id ORDER BY date_s
ession ROWS BETWEEN 3 PRECEDING AND 2 PRECEDING) as entrances_3_4
,SUM(organic_entrances) OVER (PARTITION BY entity_theme,entity_id ORDER BY date_s
ession ROWS BETWEEN 13 PRECEDING AND 7 PRECEDING) as entrances_7_14
,SUM(organic_entrances) OVER (PARTITION BY entity_theme,entity_id ORDER BY date_s
ession ROWS BETWEEN 59 PRECEDING AND 30 PRECEDING) as entrances_30_60
FROM `sql-class-mines-nancy-2023.a_bq_window_function.stats_ga`

16
),ga_rank AS
( /* STEP 2 : Compute Rank */
SELECT
*
,RANK() OVER (PARTITION BY entity_theme,date_session ORDER BY entrances_1_2 DESC)
as rank_entrances_1_2
,RANK() OVER (PARTITION BY entity_theme,date_session ORDER BY entrances_1_7 DESC)
as rank_entrances_1_7
,RANK() OVER (PARTITION BY entity_theme,date_session ORDER BY entrances_1_30 DESC)
as rank_entrances_1_30
,RANK() OVER (PARTITION BY entity_theme,date_session ORDER BY entrances_3_4 DESC)
as rank_entrances_3_4
,RANK() OVER (PARTITION BY entity_theme,date_session ORDER BY entrances_7_14 DESC)
as rank_entrances_7_14
,RANK() OVER (PARTITION BY entity_theme,date_session ORDER BY entrances_30_60 DESC
) as rank_entrances_30_60
FROM ga_running_sum
WHERE date_session > DATE_ADD(CURRENT_DATE(),INTERVAL -3 DAY)
)
SELECT
/* STEP 3 : Compute Evolution */
*
,ROUND(rank_entrances_3_4 - rank_entrances_1_2 ) as evol_rank_entrances_3_4_vs_ra
nk_entrances_1_2
,ROUND(rank_entrances_7_14 - rank_entrances_1_7 ) as evol_rank_entrances_7_14_vs_
rank_entrances_1_7
,ROUND(rank_entrances_30_60 - rank_entrances_1_30 ) as evol_rank_entrances_30_60
_vs_rank_entrances_1_30
,ROUND(rank_entrances_1_7 - rank_entrances_1_2 ) as evol_rank_entrances_1_7_vs_ra
nk_entrances_1_2
,ROUND(rank_entrances_1_30 - rank_entrances_1_2 ) as evol_rank_entrances_1_30_vs_
rank_entrances_1_2
FROM ga_rank
WHERE date_session = (SELECT MAX(date_session) FROM `sql-class-mines-nancy-
2023.a_bq_window_function.stats_ga`)

 Lecture : Useful software engineering knowledge for data

 Document format and NoSQL database

The document format you discovered in your mongoDB exercises is a very popular
one. You will find it as well in python (cf dictionary data structure). It
has been since implemented in BigQuery or relational databases like
PostgreSQL. MongoDB properties sacrifices data consistency for more
horizontal scalability combined with very good latency

 Rest API and json response

Document format is a widely used standard. It is very popular in Rest API :


- Popular tool to analyse api response : https://www.postman.com/
- Module request in python to query an api (and transfer the data to a
dictionary for example)

17
What is API ?
- It is an abstract term to describe the idea of communication protocol
with a machine.
- There is various ways to do it : API is very popular and can return
data in various standard (XML.csv files, …) but json is also a very
popular one.

 Companies use Saas connector to query API

Using a Saas connector to query APIs enables to reuse scripts already


developed by others.

Examples :
- In Webedia : https://rivery.io/
- Another tool very popular among start up : https://airbyte.com/

 Versioning

It permits to track the different modifications on a code based. It is


especially useful when :
- You are several to work on the same project.
- To analyse the impact of some modifications later on.

Git is free and open source distributed version control system designed to
handle everything from small to very large projects with speed and efficiency.
It works by tracking the difference : which lines are add and which lines
are removed.

On top of it, private companies offer services to store git “repository” and
interact with it :
- https://github.com/ is one of the most popular one (it was bought by
Microsoft for 7.5 billion dollars in 2018).
- Other example : https://about.gitlab.com/

 Lecture : No code / low code and tool selection

 No code objective

No code objective is to enable a faster development cycle by using tools


accessible through a high-level language interface (datastudio and BI tools).

How ?
Reusing components/macros already coded by others allow you to spend more
time focusing on your problem and its impact, rather than on “logistics”.

18
By reducing the number of expertise needed for a project, you can :
- Focus on hiring people that fit your specific issues
- Diminish the number of stakeholders in the project, easing
communication and decision making.

 Gojob example

https://gojob.com/

Gojob is an “interim management” company. It invested a lot in tech to


increase its operational efficiency by optimizing sourcing and matching
between company and job seeker.

In collaboration with the ops team, they develop and test new process
involving tools evolution by developing them thanks to no code solution.

What gojob is looking for their no-coder :

- Product skills : understand problem and target the impact.


- Data skills : being able to access company data (understanding how it
is organized, being able to query it).
- Maker : configure no code tools.

 Strapi

Strapi is a no code / low code tool that enables you to create :


- A database
- A user interface to feed the database
- A Rest API to query the database

To do so, you configure a relational model through a no code interface. You


need to define the objects, their schemes, and the relation between them.

To test it in the cloud : https://strapi.io/demo

 Open-source trend

A lot of recent tools are available on open-source version and are monetized
on enterprise plan besides.
Examples of recent tools funded by VC : dbt, strapi.

 Make or buy trade-off and some criteria to select vendor

Why go for the “buy” ?


- Commoditization of the feature does not make it strategic
- Speed/time to market

19
Why go for the “make” ?
- Nothing fits your need
- Too expensive / you already have the expertise to develop the project
- Too strategic for the company : it should be a strategic advantage vs
the rest of the market

Check list to consider for this trade off at Webedia


1. Customization allows us to tweak the tool to Webedia’s specific extra
needs :
o Usually 80% of our needs are shared with other customers, we need
to be able to adapt the tool to solve the 20%
o Usually customization = open tool through API (<> open source)
2. Data ownership belongs to Webedia and is easily accessible
3. Pricing consideration
o Costs should not be exploding if usage grows
o Avoid “Vendor” lock-in effect, being in position to leave if
price is too high
To illustrate it with google cloud platform : there are competitors
with similar service, GCP follow market standard in term of product
design.
4. Viability of the editor allows for a long-term collaboration
5. External strategic factor
o Retailer do not want to be Amazon Web Service customers
o Webedia went for Google Cloud Platform because of more native
integrations with Google Data Product and it is big enough to be
considered in negotiation (but too small to be considered as a
competitor)

20

You might also like