0% found this document useful (0 votes)
6 views39 pages

SQLV Unitwise Notes

The document provides comprehensive notes on SQL and database design, covering topics such as data warehouses, ER diagrams, OLAP vs OLTP, and SQL commands including DDL and DML. It explains key concepts like entity, referential, and semantic constraints, as well as various SQL functions and operators. Additionally, it details different schema types, joins, nested queries, and the use of views in SQL.

Uploaded by

Anand Manickam
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views39 pages

SQLV Unitwise Notes

The document provides comprehensive notes on SQL and database design, covering topics such as data warehouses, ER diagrams, OLAP vs OLTP, and SQL commands including DDL and DML. It explains key concepts like entity, referential, and semantic constraints, as well as various SQL functions and operators. Additionally, it details different schema types, joins, nested queries, and the use of views in SQL.

Uploaded by

Anand Manickam
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 39

UNIT-WISE NOTES

BHARATH INSTITUTE OF HIGHER EDUCATION AND RESEARCH


SCHOOL OF COMPUTING

SQL AND VISUALIZATION – U20CSST62

UNIT – 1 (DATABASE DESIGN AND INTRODUCTION TO MYSQL)


Data Warehouse, ERD, Star and Snowflake Schemas, OLAP vs OLTP, Entity Constraints, Referential
Constraints, Semantic Constraints, Comprehension: ERD, Introduction to SQL, DDL Statements, DML
Statements; SQL Basic Statements and Operators, Aggregate and Inbuilt Functions, String and Date-
Time Functions and Ordering, Regular Expressions, Nested Queries, Views; Venn Diagrams and Inner
and Outer Joins, Left and Right Join, Cross Join, Views with Join, Intersect, Minus, Union and Union
all.

Data Warehouse:
A Data Warehouse refers to a centralized repository that
stores integrated data from multiple sources, designed to
support analytical reporting and decision-making.
It's not a database like a traditional Online Transaction
Processing (OLTP) system; instead, it's a system optimized
for reporting, analysis, and data-driven decision-making.

Concept Description
ETL Extract, Transform, Load – process of collecting data from sources, transforming it
into a usable format, and loading into the warehouse.
OLAP Online Analytical Processing – supports complex queries and analysis.
Schema Types Star Schema, Snowflake Schema, Galaxy/Fact Constellation Schema.
Fact Table Central table containing quantitative data (measurable events).
Dimension Table Contains descriptive attributes related to dimensions (e.g., time, product, region).

ERD – Entity Relationship Diagram:


An entity relationship diagram (ER diagram or ERD) is a visual
representation of how items in a database relate to each other.
ERDs are a specialized type of flowchart that conveys the
relationship types between different entities within a system.
By defining the entities, their attributes, and showing the relationships between them,
an ER diagram can illustrate the logical structure of databases. This is useful for
engineers hoping to either document a database as it exists or sketch out a design of a new
database.
Styles of Cardinality:
Cardinality is the mathematical sense just means the number of values in a set. In
relationship to databases and ERD, cardinality specifies how many instances of an entity relate to one
instance of another entity. Ordinality is also closely linked to cardinality. While cardinality specifies the
occurrences of a relationship, ordinality describes the relationship as either mandatory or optional. In
other words, cardinality specifies the maximum number of relationships and ordinality specifies the
absolute minimum number of relationships.
1
UNIT-WISE NOTES

Here are some best practice tips for constructing an ERD:


 Identify the entities. The first step in making an ERD is to identify all the entities you will use.
An entity is nothing more than a rectangle with a description of something that your system
stores information about. This could be a customer, a manager, an invoice, a schedule, etc.
Draw a rectangle for each entity you can think of on your page. Keep them spaced out a bit.
 Identify relationships. Look at two entities, are they related? If so draw a solid line connecting
the two entities and add a diamond between them with a brief description of how they are
related.
 Add attributes. Any key attributes of entities should be added using oval-shaped symbols.
 Complete the diagram. Continue to connect the entities with lines, and adding diamonds to
describe each relationship until all relationships have been described. Each of your entities may
not have any relationships, some may have multiple relationships.

Star and Snowflake Schemas:


A Schema in SQL is a collection of database objects associated with a database. The username of a
database is called a Schema owner (owner of logically grouped structures of data). Schema always
belong to a single database whereas a database can have single or multiple schemas.
 Star Schema is a type of multidimensional model used for data warehouses. In a star schema,
the fact tables and dimension tables are included. This schema uses fewer foreign-key joins. It
forms a star structure with a central fact table connected to the surrounding dimension tables.

 Snowflake Schema is also a type of multidimensional model used for data warehouses. In the
snowflake schema, the fact tables, dimension tables and sub-dimension tables are included.
This schema forms a snowflake structure with fact tables, dimension tables and sub-dimension
tables.

Choice of selecting schemas:


 If simplicity and speed are our priorities, the Star Schema is a better fit.

2
UNIT-WISE NOTES

 If we need to handle complex data with frequent updates while minimizing storage, the
Snowflake Schema is more suitable.
OLAP vs OLTP:
OLTP (Online Transaction Processing) is the kind of system you interact with during everyday
activities. For example, when you book a movie ticket, transfer money through online banking, or
place an order on Amazon — you’re using an OLTP system. These systems handle a large number of
short transactions like insert, update, or delete operations. The focus is on speed and accuracy, because
users expect the system to respond immediately and reflect the latest changes. The data stored here is
current and usually highly structured, often using a normalized database design to avoid repetition.
OLAP (Online Analytical Processing) is used for analyzing large amounts of historical data to help in
decision-making. For example, a sales manager might want to analyze sales trends over the last year or
compare performance across regions. OLAP systems are optimized for reading data rather than writing
or updating it. They handle fewer queries compared to OLTP, but those queries are complex and
involve summarization, aggregation, and filtering. The data is often stored in denormalized forms,
like star or snowflake schemas, to make analysis faster and easier.
To put it simply:
 OLTP is like a diary you write in every day – qu0069ck entries, small data, always up to
date.
 OLAP is like a report card or business review – detailed, covers long periods, and used to
understand performance or plan ahead.
So, businesses use OLTP systems to run their daily operations, and they use OLAP systems to
understand how the business is doing and what decisions to take next.

Entity, Referential, and Semantic Constraints:


Constraints in a database are rules applied to columns or tables to ensure that the data stored is
accurate, valid, and reliable. Constraints are important:
 To enforce data integrity
 To avoid duplicate, null, or invalid entries
 To maintain relationships between tables

Entity Constraints:
Entity constraints ensure that each row (or entity) in a table is unique and identifiable. These
constraints are mainly enforced through Primary Keys. For example, every record in a table must be
distinct and must have a unique identifier — just like every person has a unique ID or Aadhaar
number.
CREATE TABLE Student (

3
UNIT-WISE NOTES

student_id INT PRIMARY KEY, -- entity constraint


name VARCHAR(50),
age INT
);
Referential Constraints:
Referential constraints ensure that a value in one table must match a value in another table. These are
enforced using a foreign key. If a table refers to another table’s data, the referenced data must exist.
It helps keep relationships between tables valid and consistent.
CREATE TABLE Marks (
mark_id INT PRIMARY KEY,
student_id INT,
score INT,
FOREIGN KEY (student_id) REFERENCES Student(student_id)
);

Semantic Constraints:
Semantic constraints are rules that enforce the meaning and logic of data beyond just structure. They
ensure the data makes sense in the real-world context or business rules. These constraints check if the
data follows business logic or valid conditions — not just format or uniqueness.
For example:
 A person's age must be greater than 0.
 An employee’s salary cannot be negative.
 Order date should not be in the future.
CREATE TABLE Employee (
emp_id INT PRIMARY KEY,
salary DECIMAL(10, 2),
CHECK (salary >= 0)
);

Introduction to SQL:
SQL (Structured Query Language) is the standard language used to communicate with and manage
relational databases. It helps you create, read, update, and delete data stored in tables.
Basic components of SQL:
1. Data Definition Language (DDL)
o Commands like CREATE, ALTER, and DROP that define or modify database structure.
2. Data Manipulation Language (DML)
o Commands like INSERT, UPDATE, DELETE to manage data.
3. Data Query Language (DQL)
o The SELECT command to retrieve data.
4. Data Control Language (DCL)
o Commands like GRANT and
REVOKE to control access.
What is SQL used for?
 Creating databases and tables
 Inserting, updating, and deleting data
 Querying data to fetch specific information

4
UNIT-WISE NOTES

 Managing database security and permissions


 Defining data constraints and relationships

DDL Statements:
DDL statements are used to define and manage the structure of database objects like tables, indexes,
and schemas. They deal with creating, modifying, and deleting these objects.
CREATE
Used to create a new database object (table, index, view, etc.).
CREATE TABLE Students (
student_id INT PRIMARY KEY,
name VARCHAR(50),
age INT);
ALTER
Used to modify an existing database object, like adding or dropping columns.
ALTER TABLE Students ADD email VARCHAR(100);
DROP
Used to delete an existing database object completely.
DROP TABLE Students;
TRUNCATE
Removes all rows from a table but keeps the structure intact (faster than DELETE).
TRUNCATE TABLE Students;
RENAME
Changes the name of a database object.
RENAME TABLE Students TO Learners;

DML Statements:
DML statements are used to manage the data stored inside database tables. They help you insert,
update, delete, and retrieve data.
INSERT
Adds new rows (records) into a table.
INSERT INTO Students (student_id, name, age) VALUES (1, 'Alice', 20);
UPDATE
Modifies existing data in a table.
UPDATE Students SET age = 21 WHERE student_id = 1;
DELETE
Removes rows from a table based on a condition.
DELETE FROM Students WHERE student_id = 1;
SELECT
Retrieves data from one or more tables.
SELECT * FROM Students;

5
UNIT-WISE NOTES

Operator Type Operators Purpose Example


Perform basic math
Arithmetic +, -, *, /, % SELECT salary * 1.1 FROM Employees;
operations
Comparison =, !=, <, >, <=, >= Compare values SELECT * FROM Students WHERE age > 18;
Combine multiple SELECT * FROM Students WHERE age > 18
Logical AND, OR, NOT
conditions AND grade = 'A';
Match patterns in SELECT * FROM Students WHERE name
Pattern Matching LIKE, %, _
text LIKE 'A%';
BETWEEN ... Check if a value is SELECT * FROM Products WHERE price
Range Operator AND within a range BETWEEN 100 AND 500;
Check if a value SELECT * FROM Students WHERE grade IN
Set Operator IN
exists in a given set ('A', 'B');
IS NULL, IS NOT Check for presence SELECT * FROM Orders WHERE
Null Check NULL or absence of a value delivery_date IS NULL;

Aggregate and Inbuilt Functions:


It performs calculations on a group of rows and return a single result.

Functio
Purpose Example Result
n
Total number of
COUNT() Counts number of rows SELECT COUNT(*) FROM Students;
students
SUM() Adds up values in a column SELECT SUM(marks) FROM Students; Total marks
AVG() Calculates average value SELECT AVG(marks) FROM Students; Average marks
MAX() Finds the highest value SELECT MAX(age) FROM Students; Oldest student’s age
Youngest student’s
MIN() Finds the lowest value SELECT MIN(age) FROM Students;
age

Inbuilt functions work on individual values (one row at a time), and return one result per row.
(a) String Functions

Function Purpose Example Output

UPPER() Converts text to uppercase SELECT UPPER(name) FROM Students; 'ALICE'

6
UNIT-WISE NOTES

LOWER() Converts text to lowercase SELECT LOWER(name) FROM Students; 'alice'

LENGTH() Returns length of a string SELECT LENGTH(name) FROM Students; 5

SUBSTRING() Extracts part of a string SELECT SUBSTRING(name, 1, 3) FROM Students; 'Ali'

CONCAT() Combines strings SELECT CONCAT(name, ' ', city) FROM Students; 'Alice Delhi'

 Aggregate functions work with GROUP BY for grouped results.


 Inbuilt functions are often used in SELECT, WHERE, and ORDER BY clauses.

(b) Date Functions

Function Purpose Example Output


NOW() Returns current date & time SELECT NOW(); '2025-06-05 22:30:00'
CURDATE() Returns current date SELECT CURDATE(); '2025-06-05'
SELECT YEAR(dob)
YEAR() Extracts year from date 2003
FROM Students;
SELECT MONTH(dob)
MONTH() Extracts month from date 06
FROM Students;

Ordering:
Ordering is done using the ORDER BY clause in SQL. It helps you sort the result of a query based on
one or more columns, either in ascending (ASC) or descending (DESC) order.
SELECT column1, column2
FROM table_name
ORDER BY column1 [ASC|DESC], column2 [ASC|DESC];

Regular Expressions:
Regular Expressions (RegEx) are patterns used to match character combinations in strings. In SQL,
they allow powerful searching, filtering, and validation of textual data.

Pattern Meaning Matches


^abc Starts with "abc" abc123, abcdef
abc$ Ends with "abc" 123abc, xyzabc
a.b 'a' followed by any one char, then 'b' acb, aab, a9b
[abc] Matches one of a, b, or c a, b, c
[^abc] Matches any char except a, b, or c d, x, 1
[a-z] Any lowercase letter a to z
[0-9] Any digit 0 to 9
a* 0 or more of 'a' "", a, aa, aaaa
a+ 1 or more of 'a' a, aa, aaa
a{3} Exactly 3 'a's aaa

7
UNIT-WISE NOTES

a{2,4} Between 2 and 4 'a's aa, aaa, aaaa

Nested Queries:
A nested query, also called a subquery, is a query inside another query. It's used when the result of one
query depends on the output of another. The subquery is usually written inside parentheses (). Eg:
SELECT column FROM table
WHERE column = (SELECT column FROM another_table WHERE condition);
Use nested queries when:
 The condition depends on a calculated or filtered result.
 You need to compare rows with aggregated data.
 You want to check for the existence or non-existence of related data.
Views:
A view is a virtual table based on the result of an SQL query. It doesn’t store data itself but displays
data from one or more tables. Think of it like a saved query that behaves like a table.
CREATE VIEW view_name AS SELECT column1, column2
FROM table_name WHERE condition;
Features of Views:
 View does not store data; it shows data from underlying tables
 You can query a view like a table: SELECT * FROM
view_name;
 Useful to hide certain columns (e.g., salary) from users
 Simple views (on single table, no joins/aggregates) can be updated sometimes

SQL Joins, Set Operators, and Views – Complete Table:

Venn
Concept Description Use Case SQL Syntax Example
Logic
Returns rows
common to both Employees with SELECT * FROM Employees E INNER
INNER JOIN A∩B
tables based on valid departments JOIN Departments D ON E.dept_id = D.id;
condition
All rows from left +

A⟕ B
All customers,
matched rows from SELECT * FROM Customers C LEFT
LEFT JOIN even those without
right; unmatched → JOIN Orders O ON C.id = O.customer_id;
orders
NULL
All rows from right +

A⟖ B
All orders, even if
matched rows from SELECT * FROM Customers C RIGHT
RIGHT JOIN customer info is
left; unmatched → JOIN Orders O ON C.id = O.customer_id;
missing
NULL
SELECT * FROM Customers C LEFT

A∪ B
All rows from both JOIN Orders O ON C.id = O.customer_id
FULL Combine all data
tables; unmatched UNION SELECT * FROM Customers C
OUTER JOIN from both tables
rows → NULL RIGHT JOIN Orders O ON C.id =
O.customer_id;
Cartesian product
Generate all color- SELECT * FROM Colors CROSS JOIN
CROSS JOIN (every row of A with A×B
size combinations Sizes;
every row of B)
VIEW with Virtual table created Based on Save employee- CREATE VIEW EmpDept AS SELECT
JOIN from join result JOIN department list for E.name, D.name FROM Employees E JOIN

8
UNIT-WISE NOTES

A∪ B
type reuse Departments D ON E.dept_id = D.id;
Combines results of 2 Merge unique
SELECT city FROM BranchA UNION
UNION queries; removes (no cities from two
SELECT city FROM BranchB;

A∪ B
duplicates repeats) tables
Combines results of 2 Combine sales
SELECT city FROM BranchA UNION
UNION ALL queries; keeps (with data from multiple
ALL SELECT city FROM BranchB;
duplicates repeats) sources
Returns only SELECT name FROM Customers
Customers who
INTERSECT common rows from A∩B INTERSECT SELECT name FROM
are also suppliers
both queries Suppliers;
Employees who
MINUS / Rows in first query SELECT name FROM Employees MINUS
A−B didn’t submit
EXCEPT not in second SELECT name FROM Timesheets;
timesheets
Sample Questions:

1. What is a Snowflake Schema? Illustrate with a simple diagram.


2. List and explain any four built-in SQL functions.
3. How are wildcards used with the LIKE operator in SQL? Explain with examples.
4. Differentiate between UNION, UNION ALL, and INTERSECT in SQL. Provide suitable
examples.
5. How do entity constraints and referential integrity constraints help maintain data accuracy in
relational databases?
6. Design an Entity-Relationship Diagram (ERD) for a Employee Management System. Clearly
explain all associated entity constraints, referential constraints, and semantic rules applied
within the model.
__________________________________________________________________________________

UNIT – 2 (DATA MODELLING)


Introduction to Data Modelling, A Data Model vs A Floor Model, Database Design - Creation -
Manipulation Cycle, Relational Schemas, Relational vs Non-Relational Schemas, Database Design,
DDL Statements Syntax, Database Creation; DML Statements Syntax, Database Manipulation,
Database Querying, IMDb Schema Creation, IMDb Solution.

Introduction to Data Modelling:


Data Modelling is the process of creating a blueprint or structure for how data will be stored, accessed,
and related in a database.
Why is Data Modelling Important?
 Ensures data consistency and accuracy
 Helps in designing databases that are efficient and scalable
 Makes it easy to understand relationships between data
 Acts as a guide for developers, analysts, and DBAs

Type Description Purpose / Use Case Example


High-level view of data; focuses on Used during early
Conceptua ER Diagram with entities
what data is needed, not how it is planning; shared with
l like Student, Course
stored. stakeholders.
Adds details like attributes, Used by database
Tables: Student(StudentID,
Logical primary/foreign keys; platform- designers to define
Name)
independent. structure.

9
UNIT-WISE NOTES

Specifies how data is stored in a Used by


SQL tables with data types,
Physical specific DBMS (e.g., MySQL, developers/DBAs for
indexes
Oracle). implementation.

 Conceptual: What the system contains (ERD level)


 Logical: How the system is logically structured (tables,
keys, types)
 Physical: How it is actually implemented in the DBMS
(schemas, storage)

Database Design - Creation - Manipulation Cycle:


This cycle explains how a database moves from design to implementation and then to usage. It consists of three
main phases:

Database Design
This is the planning phase where you decide what data you need, how it should be organized, and how tables
relate.
Steps:
 Requirement Analysis
 Create ER Model (Entities, Attributes, Relationships)
 Normalize Data (remove redundancy)
 Convert to Relational Schema
Example: Designing tables like Student(StudentID, Name) and Course(CourseID, Title) with a relationship table
Enrollment.

Database Creation
This is the implementation phase where the designed model is translated into SQL code using DDL (Data
Definition Language) – CREATE, ALTER, DROP.

Database Manipulation
This is the operation phase where data is added, updated, deleted, and queried using DML (Data Manipulation
Language) – INSERT, UPDATE, DELETE.

Design → Create → Manipulate → Maintain & Improve


This cycle is iterative, meaning updates to the design may require re-creation and re-manipulation.

10
UNIT-WISE NOTES

A Data Model vs A Floor Model:


A data model defines how data is organized and structured, similar to a blueprint for a database. A
floor model, on the other hand, is a physical or virtual representation of a building's layout, often used
for planning and visualizing a space. While a data model focuses on data relationships and
organization within a system, a floor model focuses on the spatial arrangement of a physical
environment.
 A Data Model is to data what a Floor Model is to physical space.
 Both are design tools used to plan complex structures before actual implementation.

Aspect Data Model Floor Model


A blueprint for how data is structured A layout plan for how space (e.g., building/floor)
Definition
and related in a database. is organized physically.

Used in database design and information Used in architecture, interior design, or building
Domain
systems. planning.

To plan how information is stored, To plan how rooms, walls, furniture, or structures
Purpose
accessed, and connected. are arranged.

Entities, attributes, relationships among


Represents Physical spaces like rooms, doors, pathways.
data.

ER Diagram showing Student → Blueprint showing classrooms, labs, offices on a


Example
Enrolls → Course college floor.

Output
Diagrams, schemas, SQL definitions. 2D/3D layout drawings or architectural blueprints.
Format

Relational Schema:

11
UNIT-WISE NOTES

A relational schema is a set of relational tables and associated items that are related to one another. All
of the base tables, views, indexes, domains, user roles, stored modules, and other items that a user
creates to fulfil the data needs of a particular enterprise or set of applications belong to one schema.
A relational schema defines the structure of a relational database table. It describes the table name, its
attributes (columns), and the types of data each column can hold.
 Student → table name
 StudentID, Name, Age, CourseID → attributes (columns)
 INT, VARCHAR → data types
It may also include:
 Primary Key: uniquely identifies a row (e.g., StudentID)
 Foreign Key: links to another table (e.g., CourseID refers to Course table)

Relational vs Non-Relational Schemas:

Feature Relational Schema Non-Relational Schema


Data Model Tabular (tables, rows, columns) Document, key-value, graph, or wide-column
Schema Type Fixed schema – predefined structure Flexible schema – dynamic, can vary per item
Storage Format Structured data in rows and columns JSON, BSON, XML, key-value pairs, etc.
Normalization Often normalized to reduce redundancy Usually denormalized for faster access
Joins Supports complex joins between tables Limited or no joins
Examples MySQL, PostgreSQL, Oracle MongoDB (document)

A relational schema organizes data into structured tables (like Excel sheets) with predefined columns
and rows. Each table has a schema (blueprint) that defines the column names, data types, and
relationships with other tables using keys (Primary Key, Foreign Key).
A non-relational schema (NoSQL) is more flexible. It stores data in formats like JSON, key-value
pairs, graphs, or documents instead of structured tables. It does not require a fixed schema.
Sample Questions:
1. Explain the difference between a data model and a floor model with examples.
2. What does DDL stand for? Mention two commands that come under DDL.
3. Explain the importance of data modeling in database design.
4. Differentiate DML statements. Include their syntax examples.
5. Explain how data modelling assists in designing a database that reflects real-world scenarios.
6. Discuss relational and non-relational schemas by discussing their use cases, advantages, and
disadvantages.
__________________________________________________________________________________

UNIT – 3 (ADVANCED SQL AND BEST PRACTICES & SQL ASSIGNMENT)


Rank Functions, Partitioning, Frames, Lead and Lag Functions, Case Statements, UDFs, Stored
Procedures, Cursors, Best Practices, Indexing, Clustered vs Non-Clustered Indexing, Order of Query

12
UNIT-WISE NOTES

Execution, Joins vs Nested Queries; Profitability Analysis, Profitable Customers, Customers Without
Orders, Fraud Detection, Problem Statement, Solution.
Rank Functions:
Rank functions assign a rank or position number to rows in a result set based on the values in one or
more columns. They are commonly used for sorting, ranking, and pagination.

Function Description Behavior Example Syntax


Assigns ranks with gaps if Rows with the same value get
RANK() OVER (ORDER
RANK() there are ties (same rank for the same rank; next rank skips
BY Salary DESC)
equal values). accordingly.
Rows with same value share
Like RANK(), but without rank; next rank is next DENSE_RANK() OVER
DENSE_RANK()
gaps in ranking sequence. consecutive number (no (ORDER BY Salary DESC)
gaps).
Assigns unique sequential row Each row gets a unique ROW_NUMBER() OVER
ROW_NUMBER()
numbers, even if there are ties. number; no ties. (ORDER BY Salary DESC)

SELECT
EmployeeID,
Salary,
RANK() OVER (ORDER BY Salary DESC) AS Rank,
DENSE_RANK() OVER (ORDER BY Salary DESC) AS DenseRank,
ROW_NUMBER() OVER (ORDER BY Salary DESC) AS RowNum
FROM Employees;
Partitioning (Windows Function):
Partitioning in SQL is used with window functions to divide the result set into groups (partitions), so
calculations are done within each group separately.
What does Partition mean?
 Think of partitioning as splitting your data into chunks based on one or more columns.
 Window functions (like RANK(), ROW_NUMBER(), SUM(), etc.) operate independently within each
partition.
SELECT EmployeeID, Department, Salary,
RANK() OVER (PARTITION BY Department ORDER BY Salary DESC) AS DeptRank FROM Employees;
 Here, ranking resets for each Department because of PARTITION BY Department.
 Employees are ranked within their own department, not across the entire table.
Why use Partition?
 To perform calculations group-wise without collapsing the result into a single group (unlike GROUP
BY).
 Useful for running totals, rankings, moving averages, etc., within subsets of data.
Lead and Lag Functions:

13
UNIT-WISE NOTES

Lead and Lag are window functions that allow you to access data from following or previous rows relative to
the current row without using a self-join.
What do they do?
 LEAD(column, offset, default) returns the value from a row after the current row by offset (default is 1).
 LAG(column, offset, default) returns the value from a row before the current row by offset (default is
1).
SELECT Month, Sales,
LAG(Sales, 1, 0) OVER (ORDER BY Month) AS PrevMonthSales,
LEAD(Sales, 1, 0) OVER (ORDER BY Month) AS NextMonthSales
FROM SalesData;

UDFs and Stored Procedures:


User-Defined Functions (UDFs) in SQL are routines created by users that perform specific
calculations or operations and return a value—either a single scalar value or a table. UDFs are
typically used within SQL queries, such as in the SELECT or WHERE clauses, to simplify complex
expressions or reuse logic. They are designed to be side-effect free, meaning they do not modify the
database data but only compute and return results based on the inputs provided.
CREATE FUNCTION getDiscount(@price DECIMAL)
RETURNS DECIMAL
AS
BEGIN
RETURN @price * 0.9; -- 10% discount
END;

On the other hand, Stored Procedures are a set of precompiled SQL statements that perform a
sequence of operations, which can include querying, inserting, updating, or deleting data. Stored
procedures are executed explicitly using commands like EXEC or CALL. Unlike UDFs, stored
procedures can have side effects because they can modify the database and control transactions with
commits or rollbacks. They can also accept input and output parameters, allowing more complex and
flexible database operations, such as batch processing or business logic implementation.
CREATE PROCEDURE UpdateEmployeeSalary
@EmpID INT,
@NewSalary DECIMAL
AS
BEGIN

14
UNIT-WISE NOTES

UPDATE Employees
SET Salary = @NewSalary
WHERE EmployeeID = @EmpID;
END;

Feature User-Defined Functions (UDFs) Stored Procedures


Perform calculations and return a single
Purpose Perform a sequence of SQL operations (tasks)
value or table
Must return a value (scalar, table, or May or may not return values; can return
Return Type
aggregate) multiple results
Used in SQL statements like SELECT, Executed explicitly with EXEC or CALL
Usage
WHERE commands
Should not modify database state (no
Side Effects Can modify data: insert, update, delete
data changes)
Parameters Can accept input parameters Can accept input/output parameters
Transaction Can include transaction control (commit,
Cannot control transactions
Control rollback)
CREATE FUNCTION getTotal(@id CREATE PROCEDURE updateSalary(@id
Example
INT) RETURNS INT AS ... INT, @newSal INT) AS ...

Cursors:
A cursor is a database object used to retrieve, manipulate, and navigate through rows of a result set one
row at a time. Unlike regular SQL queries that operate on entire sets of rows at once, cursors allow
row-by-row processing, which is useful when you need to perform operations on individual rows
sequentially.
 When you need to process each row separately (e.g., complex calculations, conditional
updates).
 When operations can’t be done in a single set-based SQL query.
How does a cursor work?
1. Declare the cursor with a SQL SELECT statement.
2. Open the cursor to establish the result set.
3. Fetch rows one by one from the cursor.
4. Perform required operations on each fetched row.
5. Close the cursor when done.
6. Deallocate the cursor to release resources.
DECLARE @StudentID INT;

15
UNIT-WISE NOTES

DECLARE student_cursor CURSOR FOR


SELECT StudentID FROM Students WHERE Grade = 'A';
OPEN student_cursor;
FETCH NEXT FROM student_cursor INTO @StudentID;
WHILE @@FETCH_STATUS = 0
BEGIN
PRINT 'Top student ID: ' + CAST(@StudentID AS VARCHAR);
FETCH NEXT FROM student_cursor INTO @StudentID;
END;
CLOSE student_cursor;
DEALLOCATE student_cursor;
Cursors can be slow and resource-heavy compared to set-based operations, so use them only when
necessary.

Indexing - Clustered vs Non-Clustered:


Indexing is a technique used in databases to speed up the retrieval of rows by creating a data structure
that allows quick search, much like an index in a book. Without indexes, the database must scan the
entire table to find matching rows, which can be slow for large datasets.
 A clustered index determines the physical order of data rows in a table based on the indexed
column(s). This means the actual data is stored in the sequence of the clustered index key,
making data retrieval for range queries or sorting very fast. Because the table’s data can only be
sorted one way, a table can have only one clustered index, which is often created on the
primary key by default.
 A non-clustered index does not affect the physical order of the data. Instead, it creates a
separate structure that stores the indexed key values along with pointers to the actual data rows.
This allows quick lookup of rows based on the indexed column without rearranging the data
itself. A table can have multiple non-clustered indexes, which are useful for speeding up
searches on columns frequently used in queries but not for sorting the data physically.

16
UNIT-WISE NOTES

Order of Query Execution:


When you write an SQL query, it may look like it runs from top to bottom, but internally, the database
executes the clauses in a specific logical order. Understanding this order helps in writing efficient and
correct queries.
Logical Order of SQL Query Execution:
1. FROM — Identify and gather the data source tables, including any joins.
2. WHERE — Filter rows based on conditions.
3. GROUP BY — Group the filtered rows by specified columns.
4. HAVING — Filter groups based on aggregate conditions.
5. SELECT — Choose the columns or expressions to return.
6. ORDER BY — Sort the final result set.
7. LIMIT / OFFSET — Limit the number of rows returned (if used).
For Example:
SELECT Department, COUNT(*) AS EmployeeCount FROM Employees WHERE Salary > 50000
GROUP BY Department HAVING COUNT(*) > 5 ORDER BY EmployeeCount DESC;

Joins vs Nested Queries:


Joins combine rows from two or more tables based on related columns, returning a single result set
with columns from all tables involved. Joins are efficient for retrieving related data side-by-side,
especially when you want to merge tables on common keys. Common types include INNER JOIN,
LEFT JOIN, RIGHT JOIN, and FULL JOIN. For example, to get employee names with their
department names, a join between Employees and Departments on DepartmentID is used.

Nested Queries (or subqueries) are queries written inside another query, often in the WHERE or
FROM clause. They allow you to filter or manipulate data using results from another query. Nested
queries are useful when you need to perform stepwise filtering or calculations, like finding employees
whose salary is above the average salary computed by a subquery.
Key Differences:
 Joins combine tables horizontally to show related columns together.
 Nested Queries are vertically nested queries where one query depends on the result of another.
 Joins are generally faster and more readable for combining tables.
 Nested queries are handy for filtering or conditional logic but can be less efficient if overused.

17
UNIT-WISE NOTES

Sample Questions:
1. Write an SQL query using PARTITION BY to compute the average score within each class.
2. Use the RANK() function in a query to find the top 3 most profitable customers.
3. Construct a query using LEAD() to display each order's amount alongside the following order’s
amount.
4. Demonstrate the use of RANK(), DENSE_RANK(), and ROW_NUMBER() on a sales table
and explain how their outputs differ.
5. Explain the logical sequence in which an SQL query is executed, with a practical example.
6. Develop a profitability analysis dashboard using advanced SQL techniques such as views and
user-defined functions (UDFs), intended for business executives.

__________________________________________________________________________________

UNIT – 4 (DATA VISUALISATION IN PYTHON)

Visualisations – Some Examples, Case Study Overview, Data Handling and Cleaning, Sanity Checks,
Outliers Analysis with Boxplots, Histograms; Distribution Plots, Styling Options, Pie - Chart and Bar
Chart, Scatter Plots, Pair Plots, Bar Graphs and Box Plots, Heatmaps, Line Charts, Stacked Bar Charts,
Plotly.

Visualizations – Some Examples:


Data visualization is the representation of data through use of common graphics, such as charts, plots,
infographics and even animations. These visual displays of information communicate complex data
relationships and data-driven insights in a way that is easy to understand. Below are commonly used
types with examples and use cases:
1. Line chart: Trend over time (e.g., stock prices).
2. Bar chart: Comparison (e.g., sales by region).
3. Histogram: Distribution (e.g., age distribution).
4. Scatter plot: Relationship between two variables.
5. Heatmap: Correlation matrix.

Data Handling and Cleaning:


Data handling and cleaning are essential steps before performing any analysis or visualization. This
ensures data integrity and quality.
Importing the Dataset:
import pandas as pd
df = pd.read_csv('data.csv') # Also supports Excel, JSON, SQL
Understanding Data:
df.shape # Rows, columns

18
UNIT-WISE NOTES

df.head() # First 5 rows


df.tail() # Last 5 rows
df.info() # Data types, non-null counts
df.describe() # Summary stats for numerical columns
Handling Missing Values:
df.isnull().sum() # Check Missing Values
df.dropna(inplace=True) # Drop Missing Values
# Filling Missing Values
df.fillna(0, inplace=True) # or with mean/median/mode
df['Age'].fillna(df['Age'].mean(), inplace=True)
Handling Duplicates:
df.duplicated().sum() # Check Duplicates
df.drop_duplicates(inplace=True) # Drop Duplicates
Datatype Conversion:
df.dtypes # Check types
# Convert types
df['Date'] = pd.to_datetime(df['Date'])
df['Age'] = df['Age'].astype(int)
Renaming Columns:
df.rename(columns={'OldName': 'NewName'}, inplace=True)
Filtering and Sub setting:
df[df['Salary'] > 50000] # Conditional filtering
df[['Name', 'Age']] # Selecting specific columns
Replacing Values:
df['Gender'].replace({'M': 'Male', 'F': 'Female'}, inplace=True)
Creating New Columns:
df['Revenue'] = df['Units_Sold'] * df['Price']

Sanity Checks:
Sanity checks help validate the correctness, completeness, and consistency of data before analysis or
modeling. These are quick, essential checks to avoid misleading results.
1. Shape of the Dataset
2. Column Names and Types
3. Missing Values
4. Duplicates
5. Unique Values
6. Value Ranges
7. Logical Consistency
8. Date Validity (format)
9. Unnecessary Columns
10. Summary Statistics
19
UNIT-WISE NOTES

Outliers Analysis with Boxplots:


Outliers are data points that differ significantly from other observations. They can skew analysis and
model performance. Boxplots are a simple and effective way to detect outliers.

What is a Boxplot?
Visual summary of data distribution:
 Median (Q2)
 Quartiles (Q1: 25th percentile, Q3: 75th percentile)
 IQR = Q3 − Q1
 Whiskers: data within 1.5 * IQR from Q1 and Q3
 Outliers: points outside the whiskers
Boxplot using Dearborn:
import seaborn as sns
import matplotlib.pyplot as plt
sns.boxplot(x=df['Salary'])
plt.title("Boxplot of Salary")
plt.show()
# Boxplot by Category
sns.boxplot(x='Department', y='Salary', data=df)
Identifying Outliers Programmatically
Q1 = df['Salary'].quantile(0.25)
Q3 = df['Salary'].quantile(0.75)
IQR = Q3 - Q1

# Define bounds
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

# Find outliers
outliers = df[(df['Salary'] < lower_bound) | (df['Salary'] > upper_bound)]

20
UNIT-WISE NOTES

Dealing with Outliers:


Remove:
df_no_outliers = df[(df['Salary'] >= lower_bound) & (df['Salary'] <= upper_bound)]
Impute:
df['Salary'] = df['Salary'].clip(lower_bound, upper_bound)
Histograms and Distribution Plots:
Both histograms and distribution plots are used to visualize the frequency distribution of numerical
data.
Histogram: Shows how often values fall into a range of bins.
import matplotlib.pyplot as plt

df['Age'].plot(kind='hist', bins=10, edgecolor='black')


plt.title("Age Distribution")
plt.xlabel("Age")
plt.ylabel("Frequency")
plt.show()
Use Cases:
 Identify skewness
 Spot gaps or spikes
 Detect multimodal distributions
Distribution Plot (Distplot): Combines a histogram with a Kernel Density Estimate (KDE) curve.
import seaborn as sns
sns.histplot(df['Age'], kde=True, bins=10)
plt.title("Age Distribution with KDE")
plt.xlabel("Age")
plt.ylabel("Density")
plt.show()
Tips:
 Use bins to control granularity.
 Use KDE to smooth the distribution shape.
 For skewed data, consider log-transforming before plotting.

Styling Options for Data Visualizations in Python:

21
UNIT-WISE NOTES

Styling enhances the readability and presentation of your plots. Python libraries like Matplotlib,
Seaborn, and Plotly offer flexible customization.
1. Set Global Style
plt.style.use('ggplot') # Other options: 'seaborn', 'bmh', 'classic'
1. Matplotlib Styling
import matplotlib.pyplot as plt
plt.plot(x, y, color='green', linestyle='--', marker='o', linewidth=2)
plt.title("Title", fontsize=16, color='navy')
plt.xlabel("X-axis", fontsize=12)
plt.ylabel("Y-axis", fontsize=12)
plt.grid(True)
plt.legend(["Label"])
plt.xticks(rotation=45)
plt.show()
2. Seaborn Styling:
import seaborn as sns
sns.set_style("whitegrid") # Options: whitegrid, darkgrid, white, dark, ticks
# Context (for scaling fonts etc)
sns.set_context("talk") # Options: paper, notebook, talk, poster
# Palette
sns.set_palette("pastel") # Other options: deep, muted, bright, dark, colorblind
3. Plotly Styling (interactive):
import plotly.express as px
fig = px.bar(df, x='Department', y='Revenue',
color='Region', title='Revenue by Department')
fig.update_layout(title_font_size=20, plot_bgcolor='lightgray')
fig.show()

22
UNIT-WISE NOTES

Pie – Chart, Bar Chart, and Scatter Plots:


These are foundational visualizations for analysing proportions, comparisons, and relationships
between variables.
Pie Chart: Visualize proportions of categories in a whole (100%). Avoid for too many categories—use
bar chart instead.
import matplotlib.pyplot as plt
sizes = df['Category'].value_counts()
labels = sizes.index

plt.pie(sizes, labels=labels, autopct='%1.1f%%',


startangle=90, colors=['skyblue', 'lightgreen', 'coral'])
plt.axis('equal') # Equal aspect ratio ensures pie is circular
plt.title("Category Distribution")
plt.show()
Bar Chart: Compare quantities across categories.
# Matplotlib Example
category_sales = df.groupby('Category')['Sales'].sum()
category_sales.plot(kind='bar', color='steelblue', edgecolor='black')
plt.title("Sales by Category")
plt.xlabel("Category")
plt.ylabel("Total Sales")
plt.xticks(rotation=45)
plt.show()
# Seaborn Example
import seaborn as sns

23
UNIT-WISE NOTES

sns.barplot(x='Category', y='Sales', data=df, estimator=sum, ci=None)


Scatter Plot: Show relationship or correlation between two numeric variables.
sns.scatterplot(x='Experience', y='Salary', data=df)
plt.title("Experience vs Salary")
# with hue (color by category)
sns.scatterplot(x='Age', y='SpendingScore', hue='Gender', data=df)
Pie charts best for ≤ 5 categories.
Bar charts are ideal for count or sum comparisons.
Use scatter plots to detect linearity, clusters, or outliers.
Pair Plots:
A pair plot visualizes pairwise relationships between multiple numeric variables. It’s particularly
useful for exploratory data analysis (EDA) and correlation detection.
Scatter plots for each pair of numeric variables
Histograms/Distributions on the diagonal
Hue to differentiate categories
import seaborn as sns
import matplotlib.pyplot as plt
sns.pairplot(df)
plt.show()
# Pair Plot with Hue (Categorical Grouping)
sns.pairplot(df, hue='Species')
# Selecting specific columns
sns.pairplot(df[['SepalLength', 'SepalWidth', 'PetalLength', 'Species']], hue='Species')
# Styling options
sns.pairplot(df, hue='Species', palette='Set2', markers=["o", "s", "D"])

HeatMaps:
A heatmap is a graphical representation of data where values are depicted by color. It’s especially
useful for visualizing correlation matrices, pivot tables, or any 2D data structure.
import seaborn as sns
import matplotlib.pyplot as plt
corr = df.corr(numeric_only=True)
sns.heatmap(corr, annot=True, cmap='coolwarm', linewidths=0.5)
plt.title("Correlation Matrix Heatmap")

24
UNIT-WISE NOTES

plt.show()
# Styling options
sns.heatmap(corr, annot=True, vmin=-1, vmax=1, cmap='RdBu')

Stacked Bar Chart:


A stacked bar chart displays totals broken down into subgroups, stacking each subgroup on top of the
previous in one bar per category. It’s ideal for visualizing part-to-whole relationships across categories.
Using Pandas + Matplotlib:
import pandas as pd
import matplotlib.pyplot as plt
# Example Data
data = {'Quarter': ['Q1', 'Q2', 'Q3', 'Q4'],
'Product A': [20, 35, 30, 35],
'Product B': [25, 32, 34, 20]}
df = pd.DataFrame(data)
# Set index for stacking
df.set_index('Quarter').plot(kind='bar', stacked=True, color=['skyblue', 'salmon'])
plt.title("Quarterly Sales by Product")
plt.ylabel("Sales")
plt.xticks(rotation=0)
plt.legend(title='Product')
plt.show()
Using Plotly (interactive):
import plotly.graph_objects as go
fig = go.Figure(data=[
go.Bar(name='Product A', x=df['Quarter'], y=df['Product A']),
go.Bar(name='Product B', x=df['Quarter'], y=df['Product B'])
])
fig.update_layout(barmode='stack', title="Quarterly Sales by Product")

25
UNIT-WISE NOTES

fig.show()

Plotly for Interactive Data Visualization in Python:


Plotly is a powerful library for creating interactive and publication-quality charts in Python. It supports
line charts, bar plots, pie charts, scatter plots, box plots, heatmaps, and more — all interactive by
default.
Line Charts:
import plotly.express as px
fig = px.line(df, x='Date', y='Revenue', title='Revenue Over Time')
fig.show()
Bar Charts:
fig = px.bar(df, x='Region', y='Sales', color='Product', barmode='group')
fig.show()
Pie Charts:
fig = px.pie(df, names='Category', values='Sales', title='Sales Distribution')
fig.show()
Scatter Plot:
fig = px.scatter(df, x='Experience', y='Salary', color='Department', size='Bonus')
fig.show()
Box Plot:
fig = px.box(df, x='Department', y='Salary', color='Department')
fig.show()

Why Use Plotly?


1. Interactive by Default: Great for exploratory analysis and presentations. Zoom, pan, hover
tooltips, and legend toggling built-in.

26
UNIT-WISE NOTES

2. Wide Range of Charts: Ideal for both simple and complex visualizations. Supports line, bar,
pie, scatter, box, heatmaps, and even 3D plots, maps, and gauges.
3. Web Integration: Visualizations can be exported as HTML or embedded in web apps (e.g.,
Dash).
4. Customization and Styling: Fully customizable layouts, colors, tooltips, and annotations.
Professional-grade formatting for reports and dashboards.
5. Easy to Learn: plotly.express offers a high-level API, very similar to Seaborn. Fast to prototype
visualizations.
Plotly is ideal when you need interactivity, presentation-ready visuals, or plan to build a dashboard or
web app. It complements Matplotlib and Seaborn well for modern data visualization needs.

Sample Questions:
1. Use Matplotlib to create a bar graph that displays and compares sales figures.
2. Conduct basic sanity checks on a dataset using Pandas, focusing on identifying and reporting
missing values.
3. Create an interactive line chart using Plotly to visualize trends over time.
4. Demonstrate how to clean a customer dataset by handling missing data and outliers using
appropriate data cleaning techniques.
5. Generate a distribution plot with Seaborn and explain how to assess skewness and kurtosis
from the visualization.
6. Describe the complete workflow in Python for loading, cleaning, and visualizing a dataset,
including code examples and key considerations at each step.

__________________________________________________________________________________

UNIT – 5 (DATA VISUALIZATION USING TABLEAU)


What is Data Analytics? Why Data Visualization? What is Tableau? Why Tableau? Tableau vs Excel
and Power BI, Exploratory vs Explanatory Analysis, Getting Started with Tableau, Bar Charts, Line
Charts and Filters, Area Charts, Box Plots and Pivoting, Maps and Hierarchies, Pie Charts. Treemaps
and Grouping, Dashboards - I, Joins and Splits, Numeric and String Functions, Logical and Date
Functions, Histograms and Parameters, Scatter Plots, Dual Axis Charts, Top N parameters and
Calculated Fields, Stacked Bar Charts, Dashboards - II and Filter Actions, Storytelling. Donut Charts,
Pareto Charts, Packed Bubble Charts, Highlight Tables, Control Charts, LOD Expressions: Include and

27
UNIT-WISE NOTES

Fixed, LOD Expressions: Exclude, Motion Charts, Bullet Charts, Gantt Charts, Likert Scale Charts,
Hexbin Charts, Best Practices.

What are Data Analytics?


Data Analytics is the science of analyzing raw data to uncover meaningful insights, patterns, and
trends that support decision-making. It involves several steps such as:
1. Data Collection – Gathering data from various sources.
2. Data Cleaning – Removing inconsistencies or missing values.
3. Data Exploration – Understanding data distributions and relationships.
4. Data Visualization – Using charts/graphs to represent insights.
5. Modeling/Prediction – Applying statistical or ML techniques (in advanced analytics).
6. Decision-Making – Using insights for strategy or operational improvements.

Aspect Data Analysis Data Analytics


A broader field that includes data analysis plus
The process of inspecting,
Definition tools, technologies, and processes for decision-
cleaning, and interpreting data
making
Focused on understanding past Includes descriptive, diagnostic, predictive, and
Scope
data prescriptive methods
To extract useful information To solve problems, make predictions, and support
Goal
and summarize it decisions
Excel, SQL, Tableau, Python Same tools + advanced ones like ML models, Big
Tools Used
(Pandas, Matplotlib) Data platforms
Complexit Generally simpler and More complex; may involve automation and real-
y descriptive time processing

Data Analysis = "What happened?"


Data Analytics = "What happened + Why + What will happen + What should we do?"

Why Data Visualization?


Data Visualization is crucial because it transforms raw data into visual formats (charts, graphs, maps)
that are easy to understand and interpret. Here's why it's important:

1. Simplifies Complex Data: Makes large datasets more understandable. Converts numbers into
visuals for faster comprehension.
2. Reveals Patterns and Trends: Identifies relationships, correlations, and outliers. Helps spot
trends over time or across categories.

28
UNIT-WISE NOTES

3. Aids Decision-Making: Helps stakeholders make data-driven decisions. Visual summaries are
more persuasive than tables.
4. Enhances Communication: Makes storytelling with data more effective. Allows clear
presentation to both technical and non-technical audiences.
5. Interactive Exploration: Enables dynamic filtering and zooming (especially with tools like
Tableau). Encourages deeper insights through user-driven analysis.

What is Tableau? Why Tableau?


Tableau is a powerful data visualization and business intelligence (BI) tool that helps users analyze,
visualize, and share data insights interactively and efficiently. It uses a drag-and-drop interface to
create charts, dashboards, and reports from structured data without needing programming skills.

1. User-Friendly Interface: Drag-and-drop functionality for building charts and dashboards. Ideal for
users with little or no coding background.

2. Powerful Data Connectivity: Connects to various data sources: Excel, SQL, cloud databases,
Google Sheets, etc.

3. Real-Time Analysis: Supports live and extract data connections for up-to-date insights.

4. Interactive Dashboards: Allows filtering, highlighting, and drill-down features for dynamic
storytelling.

5. Rich Visualizations: Offers a wide variety of charts: bar, line, pie, maps, scatter, treemaps, and
more.

6. Integration with Other Tools: Works with R, Python, and other BI tools for advanced analytics.

7. Sharing and Collaboration: Dashboards can be shared securely via Tableau Server, Tableau
Public, or Tableau Online.

Exploratory vs Explanatory Analysis:

Both are essential—exploratory to understand the data, and explanatory to communicate insights
effectively.
Exploratory Analysis asks: "What does the data tell us?"
Explanatory Analysis answers: "What do we want to show others from the data?"
Aspect Exploratory Analysis Explanatory Analysis
Purpose Discover patterns, trends, and Communicate specific findings or

29
UNIT-WISE NOTES

relationships in data insights


When used Early stage of analysis (data discovery) Later stage (after insights are found)
Business stakeholders, general
Audience Analysts, data scientists
audience
Tools/ Interactive tools (Tableau, Python – Focused visuals and storytelling
Approach Seaborn, Pandas) tools
Focused, refined, and narrative-
Nature Open-ended, iterative, and dynamic
driven
Presenting a bar chart to show
Examples Using scatter plots to detect correlation
revenue growth

Tableau vs Excel and Power BI:


Feature Tableau Excel Power BI

Data visualization and Spreadsheet-based data Business intelligence and data


Primary Use
dashboards analysis visualization
User-friendly drag-and- Intuitive but needs some
Ease of Use Familiar to most users
drop learning curve
Advanced, interactive, Interactive visuals with AI
Visualization Basic charts and graphs
and visually rich features
Data Handles large datasets Limited to moderate-size Efficient with large and diverse
Handling efficiently data data sources
High (filters, actions, High (cross-filtering, slicers,
Interactivity Low
tooltips, stories) Q&A)
R, Python, Salesforce, MS Office, VBA, limited Seamless with Microsoft
Integration
SQL, etc. external integration ecosystem (Azure, Excel)
Paid (free Tableau Public One-time purchase or part Freemium with Pro/Enterprise
Cost
available) of MS Office licensing
Learning
Moderate Easy Moderate
Curve
Analysts and BI General users and small data Organizations using Microsoft
Best For
professionals analysis tasks stack

 Tableau: Best for advanced, interactive dashboards and professional data storytelling.
 Excel: Great for quick analysis and calculations, limited in visuals.
 Power BI: Ideal for Microsoft users, balances ease, integration, and visualization.

30
UNIT-WISE NOTES

Getting Started with Tableau


 Tableau is a powerful data visualization tool used to convert raw data into understandable
dashboards and visualizations.
 Interface includes Data Pane, Shelves (Rows, Columns), Marks Card, Show Me panel,
Worksheet, Dashboard, and Story tabs.
 Supports drag-and-drop to create charts, maps, and dashboards.

Bar Charts, Line Charts, and Filters:


 Bar Charts: Compare categorical data; created by dragging a dimension to Rows and measure
to Columns.
 Line Charts: Show trends over time; typically use a date/time dimension on the x-axis.
 Filters: Used to limit data in views; can be added to filter shelf or as interactive controls.

Figure 1:Filters in Bar Chart


Area Charts, Box Plots and Pivoting:
 Area Charts: Like line charts but with filled areas; show cumulative totals over time.
 Box Plots: Display distribution, median, quartiles, and outliers.
 Pivoting: Reshape data by converting rows to columns or vice versa, often in data preparation.

31
UNIT-WISE NOTES

Figure 2: Area Chart with Filters

Figure 3: Pivoting
Maps and Hierarchies:
 Maps: Visualize geographic data using latitude/longitude or geographic roles (e.g., country,
city).
 Hierarchies: Allow drill-down in dimensions (e.g., Year > Quarter > Month).

Figure 4: Maps
Pie Charts, Treemaps and Grouping:
 Pie Charts: Show parts of a whole; best for few categories.
 Treemaps: Use nested rectangles to show proportions in a hierarchy.
 Grouping: Combine multiple dimension members into a single group.

32
UNIT-WISE NOTES

Figure 5: Pie Chart

Figure 6: Treemaps
Dashboards – I:
 Combine multiple visualizations on a single screen.
 Add interactivity using filters, actions, and legends.

Figure 7: Sample Dashboard - I


Joins and Splits:
 Joins: Combine data from multiple tables based on keys (Inner, Left, Right, Outer).
 Splits: Separate a single field into multiple fields (e.g., split full name into first and last).

33
UNIT-WISE NOTES

Figure 8: Table Joins and Splits


Numeric and String Functions:
 Numeric: ABS, ROUND, CEILING, FLOOR, etc.
 String: LEFT, RIGHT, MID, LEN, FIND, UPPER, LOWER, etc.
Logical and Date Functions:
 Logical: IF, CASE, AND, OR, NOT.
 Date: NOW(), TODAY(), DATEADD(), DATEDIFF(), DATENAME(), etc.
Histograms and Parameters:
 Histograms: Show distribution by dividing data into bins.
 Parameters: Dynamic input controls to change measures, dimensions, or filters.

Figure 9: Histograms (using bins)

34
UNIT-WISE NOTES

Figure 10: Parameters


Scatter Plots:
 Show relationships between two numerical variables.
 Add category to detail/shape/color for more insights.

Figure 11: Scatter Plot


Dual Axis Charts:
 Overlay two measures with different axes (e.g., bar and line combo).
 Useful for comparing trends or different units.

35
UNIT-WISE NOTES

Figure 12: Dual Axis Chart


Top N Parameters and Calculated Fields:
 Top N: Display top items by creating a parameter and using it in a filter.
 Calculated Fields: Custom expressions for derived metrics or logic.

Figure 13: Calculated Field


Stacked Bar Charts:
 Show sub-category contribution to a total across categories.
 Use color to separate stacks.

Figure 14: Stacked Bar Chart


Dashboards - II and Filter Actions
 Advanced dashboard design with interactivity.
 Filter Actions: Filter views based on selections in other charts.

36
UNIT-WISE NOTES

Figure 15: Sample Dashboard - I


Storytelling
 Combine multiple dashboards/views into a narrative.
 Use captions to explain insights and guide users.
Donut Charts, Pareto Charts
 Donut Charts: Modified pie chart with a hole; created by dual axis pie chart.
 Pareto Charts: Combination of bar and line to show individual and cumulative contributions.

Figure 16: Donut Chart

Figure 17: Pareto Chart


Packed Bubble Charts, Highlight Tables
 Packed Bubble: Use bubbles to represent measures; size indicates value.
 Highlight Table: Color-coded table to highlight values.

37
UNIT-WISE NOTES

Figure 18: Packed Bubble Chart

Figure 19: Highlight Table


Motion Charts
 Animated scatter plots showing changes over time.
 Requires a date field and “pages” shelf to animate over time.

Sample Questions:
1. Explain the difference between exploratory and explanatory data analysis in Tableau, using
practical examples.

38
UNIT-WISE NOTES

2. How do filters and parameters differ in Tableau, and what roles do they serve in dashboards?
3. Describe how joins are used in Tableau for combining datasets, and illustrate with a real-world
use case.
4. Discuss why stacked bar charts may be more effective than standard bar charts for displaying
category-wise comparisons within a group.
5. Explain how implementing filter actions in Tableau dashboards improves user experience and
decision-making.
6. What is a Pareto chart, and why is it valuable in business data analysis? Detail the process of
creating one in Tableau and its key applications.

39

You might also like