Data Science: Part 2 - SQL
Data Science: Part 2 - SQL
Part 2 - SQL
Index
Part 2/4 - SQL
Part 1 - Python Part 3 - P&S Part 4 - Deep into Data Science
01
Lets understand first, As a Data scientist why do we need SQL?
02
Case study 1: (PAYPAL interview)
03
Case study 3 : Joins
04
Case study 4 : Analyzing telecom data
05
GENERAL SQL QUESTIONS
Lets understand first, As a Data scientist why do we need SQL?
When we learn machine learning academically, we use datasets from Kaggle or other
such websites. Many of those datasets are readymade (we directly get a csv file with x
rows and y columns). In industrial setting, we have to prepare datasets from multiple
data sources(tables), build hypothesis and test them. There can be multiple SQL queries
running at backend to prepare just one column(feature) in your dataset, which may
involve aggregation, ordering, windowing, joins and many such SQL operations.
As machine learning models performance majorly depends on quality features its been
trained on, So while project is in development phase, we have to do lot of
experimentation with hyperparameters and quality features, for which we have to try
new set of features for improvements, that’s where at least basic understanding of SQL
comes handy. Later when pilots are successful, these data extraction pipelines will be
automated by data engineers where you need advanced knowledge to optimize
workflows.
Also for general purpose analytics to gain more insights from data we need SQL as
Excel has its limitations when it comes to big data.
Some examples of features that you will be using as input dataset of your machine
learning model in various industries are,
Telecom : How many times customer did recharge after expiry of his prepaid plan, Avg
of last 3 recharges MRP
Finance : Sum/avg of top 3 high value transactions of customer, days passed since
recent transaction, creditworthiness
Ecommerce : Tag customers who did > 50$ purchase in their 2nd transaction (high
value repeating customers)
In following sections you will find 2 types of question sets, one will be case study
based, which are mostly asked in interviews, second will be fundamental questions. At
the end we also have interview checklist for SQL and some useful links to learn and
practice sql.
Table 1 (daily transaction data) columns : Pymt_ID, Pymt_Date, Sndr_ID, Rcvr_ID, Amt
Table Name: Employee_MST (keeps record of active employee salary and dept)
Table Name: Employee_DTL (keeps record of all employees associated with company)
Q . Refer Above tables and Write a Query which gives below output,
Employee_dtl
Q1. HOW MANY TOTAL RECHARGES EACH SUBSCRIBER HAS DONE IN JUNE MONTH
Q5. THERE ARE HOW MANY SUCH CUSTOMERS IN SYSTEM, WHO HAVE NOT DONE
ANY RECHARGE FOR LAST 35 DAYS
ISNULL is used when we want null values as imputed by our specified value in final table.
Q2. What are different SQL commands : ( As a data scientist we majorly deal with
DDL, DML,DQL )
But, Now if I want to see each students height compared to their class avg. height, we will use partition
by clause as below.
Output :
Now its more informative for me to see each student height as well as class avg.
Dense_rank() : it works in similar way as of rank(), but it does not skip ranking if it finds duplicate in
window
Q9.
Grouping Data and Using Aggregate Functions
Before interview, you should have at least solved problems that contain following SQL
clauses.
Group by, Order by, having, window functions, is null, rank, dense_rank, row_number,
min, max, avg, stdev, count, all types of joins, like, wildcards,.
https://www.w3schools.com/sql/
https://www.codecademy.com/courses/learn-sql/lessons/aggregate-functions/exercises/
intro\
https://www.hackerrank.com/domains/sql\
Part 2 - SQL
@cantilever_labs
@cantilever labs
@cantilever labs
www.cantileverlabs.com