Open In App

Top SQL Queries for Data Scientist

Last Updated : 23 Jul, 2025
Comments
Improve
Suggest changes
Like Article
Like
Report

SQL (Structured Query Language) is one of the critical instruments used in data manipulation and analysis. Knowledge of SQL queries is crucial for data scientists to efficiently select, modify, and analyse the collected big data. Indeed, using SQL queries plays a key role in improving the quality of findings from data by providing efficient techniques to analyze the data.

Top-SQL-Queries-for-Data-Scientist
SQL Queries for Data Scientist

This article aims to identify various Top SQL queries that any data scientist should be conversant with within their line of work, including filtering methods, aggregation, and joining of data.

Basic SQL Queries

Retrieving Data with SELECT

The SELECT statement is fundamental for retrieving data from a database. For example, to retrieve all columns from a table named employees:

SELECT * FROM employees;

Filtering Data with WHERE

The WHERE clause allows you to filter data based on specific conditions. To find employees in the 'Sales' department:

SELECT * FROM employees WHERE department = 'Sales';

Sorting Data with ORDER BY

The ORDER BY clause sorts the result set. To sort employees by their salary in descending order:

SELECT * FROM employees ORDER BY salary DESC;

Limiting Results with LIMIT

The LIMIT clause restricts the number of rows returned. To get the top 5 highest-paid employees:

SELECT * FROM employees ORDER BY salary DESC LIMIT 5;

Aggregation and Grouping

Using Aggregate Functions

Aggregate functions perform calculations on multiple rows. For example, to get the total salary expense:

SELECT SUM(salary) FROM employees;

Grouping Data with GROUP BY

The GROUP BY clause groups rows that have the same values. To find the average salary by department:

SELECT department, AVG(salary) FROM employees GROUP BY department;

Filtering Groups with HAVING

The HAVING clause filters groups based on aggregate conditions. To find departments with an average salary above 50,000:

SELECT department, AVG(salary) FROM employees GROUP BY department HAVING AVG(salary) > 50000;

Advanced Filtering Techniques

Using Subqueries in WHERE Clause

Subqueries can be used within a WHERE clause to filter data. To find employees who earn more than the average salary:

SELECT * FROM employees WHERE salary > (SELECT AVG(salary) FROM employees);

Correlated Subqueries

A correlated subquery refers to the outer query. To find employees who have the highest salary in their department:

SELECT * FROM employees e1 WHERE salary = (SELECT MAX(salary) FROM employees e2 WHERE e1.department = e2.department);

Using CASE Statements for Conditional Logic

The CASE statement allows for conditional logic. To categorize employees based on their salary:

SELECT name, salary,

CASE

WHEN salary > 70000 THEN 'High'

WHEN salary BETWEEN 50000 AND 70000 THEN 'Medium'

ELSE 'Low'

END AS salary_category

FROM employees;

Joins and Unions

Understanding Different Types of Joins

Joins combine rows from two or more tables. An INNER JOIN returns only matching rows:

SELECT e.name, d.department_name

FROM employees e

INNER JOIN departments d ON e.department_id = d.id;

A LEFT JOIN returns all rows from the left table, and matching rows from the right table:

SELECT e.name, d.department_name

FROM employees e

LEFT JOIN departments d ON e.department_id = d.id;

Combining Results with UNION and UNION ALL

The UNION operator combines the result sets of two queries, removing duplicates:

SELECT name FROM employees

UNION

SELECT name FROM contractors;

The UNION ALL operator includes duplicates:

SELECT name FROM employees

UNION ALL

SELECT name FROM contractors;

Handling NULL Values in Joins

NULL values can affect join results. To handle NULLs in a LEFT JOIN:

SELECT e.name, d.department_name

FROM employees e

LEFT JOIN departments d ON e.department_id = d.id

WHERE d.department_name IS NOT NULL;

Advanced SQL Functions

String Functions

String functions manipulate text data. For example, to concatenate first and last names:

SELECT CONCAT(first_name, ' ', last_name) AS full_name FROM employees;

Date and Time Functions

Date functions handle date and time data. To get the current date and time:

SELECT NOW();

Numeric Functions

Numeric functions perform operations on numbers. To round salaries to the nearest thousand:

SELECT ROUND(salary, -3) FROM employees;

Window Functions

Window functions perform calculations across a set of table rows. To assign a row number to each employee:

SELECT name, ROW_NUMBER() OVER (ORDER BY salary DESC) AS row_num FROM employees;

Using ROW_NUMBER, RANK, and DENSE_RANK

These functions assign ranks to rows. ROW_NUMBER gives a unique rank:

SELECT name, ROW_NUMBER() OVER (ORDER BY salary DESC) AS rank FROM employees;

RANK can give the same rank to ties:

SELECT name, RANK() OVER (ORDER BY salary DESC) AS rank FROM employees;

DENSE_RANK ensures no gaps in rank values:

SELECT name, DENSE_RANK() OVER (ORDER BY salary DESC) AS rank FROM employees;

Aggregating Data with OVER Clause

The OVER clause defines the window for aggregate functions. To calculate a running total of salaries:

SELECT name, salary, SUM(salary) OVER (ORDER BY salary) AS running_total FROM employees;

Common Table Expressions (CTEs)

Basics of CTEs

CTEs define temporary result sets. To define and use a CTE:

WITH HighSalaryEmployees AS (

SELECT * FROM employees WHERE salary > 70000

)

SELECT * FROM HighSalaryEmployees;

Recursive CTEs for Hierarchical Data

Recursive CTEs handle hierarchical data. To list an employee hierarchy:

WITH RECURSIVE EmployeeHierarchy AS (

SELECT id, name, manager_id FROM employees WHERE manager_id IS NULL

UNION ALL

SELECT e.id, e.name, e.manager_id FROM employees e

INNER JOIN EmployeeHierarchy eh ON e.manager_id = eh.id

)

SELECT * FROM EmployeeHierarchy;

Using CTEs for Complex Queries

CTEs simplify complex queries. To calculate department budgets and average salaries:


WITH DepartmentSalaries AS (

SELECT department, SUM(salary) AS total_salary, AVG(salary) AS avg_salary

FROM employees

GROUP BY department

)

SELECT * FROM DepartmentSalaries;

Data Modification Queries

Inserting Data with INSERT

The INSERT statement adds new rows to a table. To insert a new employee:

INSERT INTO employees (name, department, salary) VALUES ('John Doe', 'Sales', 60000);

Updating Data with UPDATE

The UPDATE statement modifies existing data. To give all employees in 'Sales' a 10% raise:

UPDATE employees SET salary = salary * 1.10 WHERE department = 'Sales';

Deleting Data with DELETE

The DELETE statement removes rows from a table. To delete employees with a salary below 30000:

DELETE FROM employees WHERE salary < 30000;

Merging Data with MERGE (Upserts)

The MERGE statement combines insert and update operations. To insert or update employee records:

MERGE INTO employees AS target

USING new_employees AS source

ON target.id = source.id

WHEN MATCHED THEN

UPDATE SET target.name = source.name, target.salary = source.salary

WHEN NOT MATCHED THEN

INSERT (id, name, salary) VALUES (source.id, source.name, source.salary);

Conclusion

SQL becomes an essential component of a data scientist’s arsenal since it allows for efficient data extraction as well as manipulation and analysis. It is crucial for a data scientist to have knowledge of basic as well as advanced levels of SQL to manage various types of data sets and extract useful information from them. SELECT, WHERE, and JOIN are the essential parts for data acquisition and extraction, while window functions, CTEs, and pivot tables are more advanced features that augment one’s capability of performing various calculations and creating elaborate reports. With these SQL queries applied, the experience of a data scientist will be made easier, the ability to analyze complex data will become more accurate, and the formulation of the right decisions will be possible in the different domains.


Similar Reads