100% found this document useful (1 vote)
51 views

Data Science: Part 2 - SQL

SQL is an important skill for data scientists to extract and prepare data from multiple sources for machine learning models. Some key reasons include: (1) In industry, datasets often need to be prepared from multiple tables using SQL queries involving joins, aggregations, etc. (2) As a machine learning engineer experiments with different features, SQL is useful to try new feature extractions. (3) SQL is needed for general analytics on big data beyond the limitations of tools like Excel. The document then provides examples of SQL case studies and questions often asked in interviews.

Uploaded by

Himanshu Patidar
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
51 views

Data Science: Part 2 - SQL

SQL is an important skill for data scientists to extract and prepare data from multiple sources for machine learning models. Some key reasons include: (1) In industry, datasets often need to be prepared from multiple tables using SQL queries involving joins, aggregations, etc. (2) As a machine learning engineer experiments with different features, SQL is useful to try new feature extractions. (3) SQL is needed for general analytics on big data beyond the limitations of tools like Excel. The document then provides examples of SQL case studies and questions often asked in interviews.

Uploaded by

Himanshu Patidar
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

Data Science

Part 2 - SQL

CANTILEVER LABS IS OFFICIAL TRAINING PARTNER OF

IIT BOMBAY | IIT MADRAS | IIT KHARGPUR | IIT HYDERABAD

BITS PILANI | BITS HYDERABAD | NIT ROURKELA | SYMBIOSIS

JNTU HYDERABAD | SREENIDHI | MAHINDRA UNIVERSITY | GITAM srm


chennai | gnits | cmr-cet | GEETHANJALI | CHITKARA

& many more

Index
Part 2/4 - SQL
Part 1 - Python Part 3 - P&S Part 4 - Deep into Data Science

01
Lets understand first, As a Data scientist why do we need SQL?

02
Case study 1: (PAYPAL interview)

03
Case study 3 : Joins

04
Case study 4 : Analyzing telecom data

05
GENERAL SQL QUESTIONS
Lets understand first, As a Data scientist why do we need SQL?

When we learn machine learning academically, we use datasets from Kaggle or other
such websites. Many of those datasets are readymade (we directly get a csv file with x
rows and y columns). In industrial setting, we have to prepare datasets from multiple
data sources(tables), build hypothesis and test them. There can be multiple SQL queries
running at backend to prepare just one column(feature) in your dataset, which may
involve aggregation, ordering, windowing, joins and many such SQL operations.

As machine learning models performance majorly depends on quality features its been
trained on, So while project is in development phase, we have to do lot of
experimentation with hyperparameters and quality features, for which we have to try
new set of features for improvements, that’s where at least basic understanding of SQL
comes handy. Later when pilots are successful, these data extraction pipelines will be
automated by data engineers where you need advanced knowledge to optimize
workflows.

Also for general purpose analytics to gain more insights from data we need SQL as
Excel has its limitations when it comes to big data.

Some examples of features that you will be using as input dataset of your machine
learning model in various industries are,

Telecom : How many times customer did recharge after expiry of his prepaid plan, Avg
of last 3 recharges MRP

Finance : Sum/avg of top 3 high value transactions of customer, days passed since
recent transaction, creditworthiness

Manufacturing : Number of times maintenance activity performed, days between each


maintenance or breakdowns of machine

Ecommerce : Tag customers who did > 50$ purchase in their 2nd transaction (high
value repeating customers)

In following sections you will find 2 types of question sets, one will be case study
based, which are mostly asked in interviews, second will be fundamental questions. At
the end we also have interview checklist for SQL and some useful links to learn and
practice sql.

01 Data Science | Part 2 - SQL


Case study 1: (PAYPAL interview)

Table 1 (daily transaction data) columns : Pymt_ID, Pymt_Date, Sndr_ID, Rcvr_ID, Amt

Table 2 columns : Rcvr_ID, Rcvr_name, Rcvr_Industry

Table 3 columns : Sndr_id, Sndr_name, Sndr_age

Q. Which industry has 3rd highest total receiving amount.

Case study 2 : Window functions

Table Name: Employee_MST (keeps record of active employee salary and dept)

Table Name: Employee_DTL (keeps record of all employees associated with company)

Q . Refer Above tables and Write a Query which gives below output,

02 Data Science | Part 2 - SQL


A. Output table has only employee names which have joined recently. Here concept
used is first get department wise ranking using window function with descending order
of dates and use that table with alias and then get the data which has recent date rank
(row_number) to get only recently joined employee

Case study 3 : Joins


Employee_name

Employee_dtl

Q. Write a Query which gives below Output.

03 Data Science | Part 2 - SQL


A:

Case study 4 : Analyzing telecom data

Table : 1 year data of recharges done by subscribers

Q1. HOW MANY TOTAL RECHARGES EACH SUBSCRIBER HAS DONE IN JUNE MONTH

Q2. WHICH RECHARGE PLAN MRP IS SUBSCRIBED MOST

Q3. EXTRACT CUSTOMERS WHO HAVE DONE MORE THAN 15 RECHARGES

04 Data Science | Part 2 - SQL


Q4. GET TOTAL, AVG AND MAXIMUM OF RECENT 3 RECHARGE AMOUNT OF
SUBSCRIBER

Q5. THERE ARE HOW MANY SUCH CUSTOMERS IN SYSTEM, WHO HAVE NOT DONE
ANY RECHARGE FOR LAST 35 DAYS

Q6. GET RECENT RECHARGE OF SUBSCRIBERS

GENERAL SQL QUESTIONS


Q1) What is the difference between ISNULL and COALESCE?

ISNULL is used when we want null values as imputed by our specified value in final table.

COALESCE returns first non null entry

Q2. What are different SQL commands : ( As a data scientist we majorly deal with
DDL, DML,DQL )

05 Data Science | Part 2 - SQL


Q3 . Types of joins in SQL :

Q4 . Data types in SQL :

Q5. What is the difference between Delete, Truncate and Drop ?


Delete : We can delete all rows or targeted rows based on condition

Truncate : We can delete all rows from table at once

Drop : We can delete entire table from database

06 Data Science | Part 2 - SQL


Q6. How is “PARTITION BY” different from “GROUP BY”?
PARTITION BY gives aggregated columns with each record in the specified table. If we have 15 records in
the table, the query output SQL PARTITION BY also gets 15 rows. On the other hand, GROUP BY gives
one row per group in result set.

E.g. : Suppose we have below table of student heights in class A and B,

We want to know avg. height of students from class A and B,

Group by clause will give below output

But, Now if I want to see each students height compared to their class avg. height, we will use partition
by clause as below.

Output :

Now its more informative for me to see each student height as well as class avg.

Q7. What is order of each SQL clause

07 Data Science | Part 2 - SQL


Q8. What is the difference between RANK() ,ROW_NUMBER() and DENSE_RANK() ?
Rank() : it is used in window function, it ranks the data as per order given in window. It skips the ranking
if it finds similar record for that window

Dense_rank() : it works in similar way as of rank(), but it does not skip ranking if it finds duplicate in
window

Row_num() : it returns simply row number of record in window function.

Q9.
Grouping Data and Using Aggregate Functions

Ordering Data Results

Selecting Data from Multiple Tables ( Joins )

Q10. Different types of aggregate functions


COUNT()
SUM()
MIN()
MAX()
AVG()
STDEV()
VAR()

08 Data Science | Part 2 - SQL


Q11. What are Constraints in SQL?
NOT NULL - Restricts NULL value from being inserted into a column.
CHECK - Verifies that all values in a field satisfy a condition.
DEFAULT - Automatically assigns a default value if no value has been specified for the field.
UNIQUE - Ensures unique values to be inserted into the field.
INDEX - Indexes a field providing faster retrieval of records.
PRIMARY KEY - Uniquely identifies each record in a table.
FOREIGN KEY - Ensures referential integrity for a record in another table.

Q12 . What are ACID properties?


Atomicity: This property ensures that the transaction is completed in all-or-nothing way.
Consistency: This ensures that updates made to the database is valid and follows rules and
restrictions.
Isolation: This property ensures integrity of transaction that are visible to all other transactions.
Durability: This property ensures that the committed transactions are stored permanently in the
database

Q13 . How to find the 5th highest salary in SQL?

Q14. What is cte in SQL


CTEs are Common Table Expressions that are used to create temporary result tables from which data
can be retrieved/ used.

Interview checklist for SQL :

Before interview, you should have at least solved problems that contain following SQL
clauses.

Group by, Order by, having, window functions, is null, rank, dense_rank, row_number,
min, max, avg, stdev, count, all types of joins, like, wildcards,.

09 Data Science | Part 2 - SQL


Useful links :

https://www.w3schools.com/sql/

https://www.codecademy.com/courses/learn-sql/lessons/aggregate-functions/exercises/
intro\
https://www.hackerrank.com/domains/sql\

09 Data Science | Part 2 - SQL


Part 1/4 - Python

Part 2 - SQL

Next Part 3 - P&S


Part 4 - Deep into Data Science

@cantilever_labs

@cantilever labs

@cantilever labs

www.cantileverlabs.com

Data Science | Part 1 - Python

You might also like