0% found this document useful (0 votes)
11 views

Presentation Session 1 - Practical Data Science Final

Uploaded by

SHANMUGAM
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views

Presentation Session 1 - Practical Data Science Final

Uploaded by

SHANMUGAM
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 78

Session 1

03rd Aug 2024 [5 PM – 8 PM]

Practical Data Science

Topics:
1. Introduction to Data Science
2. Data characteristics
3. Descriptive Statistics
4. Inferential Statistics
WEEK 2Data
- CSEScience – WhyVISUALIZATION
3020 – DATA it is needed?
Data Growth – IDC-Seagate Study

“Data science is an interdisciplinary field of


scientific methods, processes, algorithms
and systems to extract knowledge or
insights from data in various forms, either
structured or unstructured”

Dr Vengadeswaran S, IIIT Kottayam Practical Data Science and Advanced ML 2


Data
WEEK 2 - CSE Science
3020 – DATA– VISUALIZATION
Life Cycle
1. Data Acquisition: collect data from all its raw
sources, DBs and flat-files - integrate and
Data Acquisition
transform it into a homogenous format,
collecting it into “data warehouse,” – ETL Tools

Data
2. Data Preparations:
Visualisation
Preparations
Data Science • Data Cleaning (remove bad data, null values,
handling missing values)
– Life Cycle
• Data Transformation – takes raw data and
turns it into desired outputs by normalizing
Modelling (min-max, zscore)
Building and Data Mining
Testing • Handling Outliers
• Data Reduction
Dr Vengadeswaran S, IIIT Kottayam Practical Data Science and Advanced ML 3
Data
WEEK 2 - CSE Science
3020 – DATA– VISUALIZATION
Life Cycle
3. Data Mining– Uncover the data patterns and
relationships to take better business decisions.
Data Acquisition
It’s a discovery process to get hidden and useful
knowledge, commonly known as exploratory
data analysis
Visualisation
Data
Preparations
4. Modelling Building and Testing –
Data Science • Modeling is the heart of data analysis. It
– Life Cycle takes organized data as ip and gives op.
• Suitable ML/DL models to be built for the
data, problem - to gain deeper insights,
Modelling predict outcomes – using training data set
Building and Data Mining
Testing • Tested against predetermined test data to
assess result accuracy
• Fine-tuned to improve the result
Dr Vengadeswaran S, IIIT Kottayam Practical Data Science and Advanced ML 4
Data
WEEK 2 - CSE Science
3020 – DATA– VISUALIZATION
Life Cycle
5. Visualisation –
Data Acquisition
• Communicate insights from data through
visual representation
• Explaining the process of operationalisation
Data • Communicate results
Visualisation
Preparations
Data Science • Highlights the findings, correlations, etc..
– Life Cycle

Modelling
Building and Data Mining
Testing

Dr Vengadeswaran S, IIIT Kottayam Practical Data Science and Advanced ML 5


Data
WEEK 2 - CSE 3020Science
– DATA–VISUALIZATION
Benefits
• Making Faster and better business decisions
• Develop insights that are beyond human capabilities
• Act at the right time and take advantage of opportunities,
• Innovate new products and solutions
• Risk analysis practices to make informed business decisions..
• Measuring performance etc..

Dr Vengadeswaran S, IIIT Kottayam Practical Data Science and Advanced ML 6


Data3020
WEEK 2 - CSE Science – Applications
– DATA VISUALIZATION

Recommendation
Systems

Dr Vengadeswaran S, IIIT Kottayam Practical Data Science and Advanced ML 7


WEEK 2 Data
- CSEScience – What
3020 – DATA we cover?
VISUALIZATION

Session 1: Introduction to Data Science - Data characteristics,


Descriptive and Inferential Statistical Analysis

Session 2: Types of data and dataset, Different pre-processing Techniques:


Finding Missing Data and handling them, Encoding Categorical Data, Data
Transformation and Normalization, Feature scaling, Indexing and slicing,
Filtering data, Outlier identification and removal. Hands-on in Python

Session 3: Complex Merging and Concatenating, Reshaping Data,


Grouping and Aggregation, Advanced Grouping Techniques, training and
test split, Cross validation Techniques. Practical exercise focused on pre-
processing a complex real-world dataset

Dr Vengadeswaran S, IIIT Kottayam Practical Data Science and Advanced ML 8


Data
WEEK Visualization
2 - CSE – Why
3020 – DATA it is needed?
VISUALIZATION
Example 1
Sales of jackets and sales of socks over the course of the previous year

Visualization,
• Improves Insights
• Enables faster decision making

Dr Vengadeswaran S, IIIT Kottayam Practical Data Science and Advanced ML 9


Data
WEEK Visualization
2 - CSE – Why
3020 – DATA it is needed?
VISUALIZATION
Example 2

Dr Vengadeswaran S, IIIT Kottayam Practical Data Science and Advanced ML 10


Data
WEEK Visualization
2 - CSE – Why
3020 – DATA it is needed?
VISUALIZATION
Example 2

Dr Vengadeswaran S, IIIT Kottayam Practical Data Science and Advanced ML 11


Data
WEEK Visualization
2 - CSE – Why
3020 – DATA it is needed?
VISUALIZATION

Data visualization representation:


• Charts
• Tables
• Graphs
• Maps
• Infographics
• Dashboards

• Graphical representation of information and data.


• Communicate insights from data through visual representation
• Goal à analyse large datasets into visual graphicsà easy to understand of complex
relationships within the data.
Dr Vengadeswaran S, IIIT Kottayam Practical Data Science and Advanced ML 12
WEEKData Visualization
2 - CSE 3020 – DATA– What we cover?
VISUALIZATION

Session 1 - Overview of Data Visualisation - Data abstraction, Scalar


and vector visualisation - Data visualisation for Numerical and
categorical data; Histogram and Bar Chart.

Session 2 - Boxplot, Line Plots, Pie Charts, Scatter Plots, Heatmaps for
Correlation Analysis, Text visualisation. Hands On - MatplotLib for creating
multiple plots

Session 3 - Dashboard creation using visualization tools for the use cases:
Finance/marketing/healthcare (anyone) etc.

Dr Vengadeswaran S, IIIT Kottayam Practical Data Science and Advanced ML 13


2. Data Characteristics

Outline:
• Data types

• Measurements of Data

• Dataset types

• Semantics

Dr Vengadeswaran S, IIIT Kottayam Practical Data Science and Advanced ML 14


WEEK 2 - CSE 3020 Data Types
– DATA VISUALIZATION
• Different types of data à Statistical techniques & Visualization
• Classification is essential

Dr Vengadeswaran S, IIIT Kottayam Practical Data Science and Advanced ML 15


Qualitative/Categorical
WEEK 2 - CSE data
3020 – DATA VISUALIZATION

• Categories or groups
• Answer to Yes or No
• Qualitative data can be separated into different categories
that are distinguished by some nonnumeric characteristics.
• E.g.: Genders (male/female) of professional athletes.
• Expressed in terms of natural language descriptions
• Sometimes categorical data can take numerical values, but
those numbers do not have mathematical meaning.
• E.g.: Birthdate
• Calculate the average,
Dr Vengadeswaran S, IIIT Kottayam Practical Data Science and Advanced ML 16
Quantitative
WEEK 2 - CSE 3020 data
– DATA VISUALIZATION

• Information that can be measured and written down with


numbers and not in any language or descriptive form

• Quantitative data - your height, your shoe size, and the


numbers representing counts or measurements.
• Example: total count of your employees

Dr Vengadeswaran S, IIIT Kottayam Practical Data Science and Advanced ML 17


Classify as qualitative or quantitative
• Colors of car in a dealer’s showroom.
• Number of seats in movie theaters.
• Classification of patients based on nursing care needed
(complete, partial, or self care)
• Lengths of newborn cats of a certain species.
• Number of complaint letters received by an airline per month.

Dr Vengadeswaran S, IIIT Kottayam Practical Data Science and Advanced ML 18


Working with Quantitative Data

Quantitative data can further be


distinguished between
Discrete and Continuous types.

Dr Vengadeswaran S, IIIT Kottayam Practical Data Science and Advanced ML 19


Discrete
• Data result when the number of possible values is either a
finite number or a ‘countable’ number of possible values.
• Discrete data consists of distinct and separate values
0, 1, 2, 3
Example:
• The number of eggs that hen lay
• Mark obtained by student

Dr Vengadeswaran S, IIIT Kottayam Practical Data Science and Advanced ML 20


Continuous
• Continuous variable is defined as a variable which can take an
uncountable set of values or infinite set of values.
• Values can be integers or decimals
Example:
The amount of milk that a cow produces; e.g. 2.3431 gallons per day.
Height of Individuals: Any value within a specific range and have
fractional components (e.g., 5.7 feet).
Temperature: Temperature can be measured with decimal values,
taking any value within a range.
Dr Vengadeswaran S, IIIT Kottayam Practical Data Science and Advanced ML 21
Classify as discrete or continuous.
• Number of cartons of milk manufactured each day.
• Temperatures of airplane wing
• Incomes of college students on work study programs.

Dr Vengadeswaran S, IIIT Kottayam Practical Data Science and Advanced ML 22


Levels of Measurement

Another way to classify data is to use levels of


measurement.
Four levels of measurement
• Nominal
• Ordinal
• Interval
• Ratio

Dr Vengadeswaran S, IIIT Kottayam Practical Data Science and Advanced ML 23


Why Is Level of Measurement Important?

• Helps you decide what statistical analysis is appropriate on


the values that were assigned
• Helps you decide how to interpret the data from that variable

Dr Vengadeswaran S, IIIT Kottayam Practical Data Science and Advanced ML 24


1. Nominal level of measurement

• Characterized by data that consist of names, labels,


or categories only.
• The data cannot be arranged in an ordering scheme
(such as low to high)

• Example: survey responses yes, no, undecided

Dr Vengadeswaran S, IIIT Kottayam Practical Data Science and Advanced ML 25


2. Ordinal level of measurement

• Involves data that may be arranged in some order,


but differences between data values either cannot
be determined or are meaningless

• Example: Course grades A, B, C, D, or F

Dr Vengadeswaran S, IIIT Kottayam Practical Data Science and Advanced ML 26


3. Interval level of measurement
• Like the ordinal level, with the additional property
that the difference between any two data values is
meaningful. However, there is no natural zero
starting point (where none of the quantity is present)
Example: Years 1000, 2000, 1776, and 1492
Temperature

Dr Vengadeswaran S, IIIT Kottayam Practical Data Science and Advanced ML 27


4. Ratio level of measurement
• Interval level modified to include the natural zero starting
point (where zero indicates that none of the quantity is
present).
• For values at this level, differences and ratios are
meaningful.
Example: Prices of college textbooks ($0 represents no cost)
weight of a person.

Dr Vengadeswaran S, IIIT Kottayam Practical Data Science and Advanced ML 28


Summary - Levels of Measurement
v Nominal - categories only
v Ordinal - categories with some order
v Interval - differences but no natural starting point
v Ratio - differences and a natural starting point

Dr Vengadeswaran S, IIIT Kottayam Practical Data Science and Advanced ML 29


Classify each as nominal, ordinal, interval, or ratio

• Horsepower of motorcycle engines.


• Ratings of corporation in Houston(poor, fair, good, excellent)
• Salaries of the top 5 CEOs in the United States
• Marital status of respondents to a survey of savings accounts.

Dr Vengadeswaran S, IIIT Kottayam Practical Data Science and Advanced ML 30


The Hierarchy of Levels

Nominal

Dr Vengadeswaran S, IIIT Kottayam Practical Data Science and Advanced ML 31


The Hierarchy of Levels

Nominal Attributes are only named; weakest


Dr Vengadeswaran S, IIIT Kottayam Practical Data Science and Advanced ML 32
The Hierarchy of Levels

Ordinal

Nominal Attributes are only named; weakest

Dr Vengadeswaran S, IIIT Kottayam Practical Data Science and Advanced ML 33


The Hierarchy of Levels

Ordinal Attributes can be ordered

Nominal Attributes are only named; weakest

Dr Vengadeswaran S, IIIT Kottayam Practical Data Science and Advanced ML 34


The Hierarchy of Levels

Interval
Ordinal Attributes can be ordered

Nominal Attributes are only named; weakest

Dr Vengadeswaran S, IIIT Kottayam Practical Data Science and Advanced ML 35


The Hierarchy of Levels

Interval Distance is meaningful

Ordinal Attributes can be ordered

Nominal Attributes are only named; weakest

Dr Vengadeswaran S, IIIT Kottayam Practical Data Science and Advanced ML 36


The Hierarchy of Levels

Ratio
Interval Distance is meaningful

Ordinal Attributes can be ordered

Nominal Attributes are only named; weakest

Dr Vengadeswaran S, IIIT Kottayam Practical Data Science and Advanced ML 37


The Hierarchy of Levels

Ratio Absolute zero

Interval Distance is meaningful

Ordinal Attributes can be ordered

Nominal Attributes are only named; weakest

Dr Vengadeswaran S, IIIT Kottayam Practical Data Science and Advanced ML 38


Nominal Scale & Ordinal Scale

Dr Vengadeswaran S, IIIT Kottayam Practical Data Science and Advanced ML 39


Questionnaire

Dr Vengadeswaran S, IIIT Kottayam Practical Data Science and Advanced ML 40


Reflection Spot
Understand the data

Dr Vengadeswaran S, IIIT Kottayam Practical Data Science and Advanced ML 41


Dataset types

Dr Vengadeswaran S, IIIT Kottayam Practical Data Science and Advanced ML 42


Data set types
Table
• Data represented in rows
and columns,.
• Cell - combination of a row
and a column (an item and
an attribute) contains
a value for that pair

Dr Vengadeswaran S, IIIT Kottayam Practical Data Science and Advanced ML 43


Dataset types
Networks
• Used to specify when there is some kind of relationship between
two or more items.
• An item in a network is often called a node.
• A link is a relation between two items.
• Tables can represent networks
-Many-many relationships
-Also can be stored as specific graph databases or files

Dr Vengadeswaran S, IIIT Kottayam Practical Data Science and Advanced ML 44


Dataset types
Trees
• Networks with hierarchical structure are more
specifically called trees.
• In contrast to a general network, trees do not
have cycles: each child node has only one
parent node pointing to that

Dr Vengadeswaran S, IIIT Kottayam Practical Data Science and Advanced ML 45


Dataset types
Field
• The field dataset type also contains attribute values
associated with cells.
• Each cell in a field contains measurements or calculations
from a continuous domain
• There are infinitely many values that you might measure, so
you could always take a new measurement between any
two existing ones.
• Temperature, pressure, speed, force, and density;
mathematical functions can also be continuous.
Dr Vengadeswaran S, IIIT Kottayam Practical Data Science and Advanced ML 46
Dataset types
Fields

Scalar Fields Vector Fields Tensor Fields

Each point in space has an


associated...

Dr Vengadeswaran S, IIIT Kottayam Practical Data Science and Advanced ML 47


Dataset types
Spatial field
• Continuous data is in the form of a spatial field, cell structure
of the field - sampling at spatial positions.
• For example, with a medical imaging - suspected tumours
(distinctive shapes or densities)

Dr Vengadeswaran S, IIIT Kottayam Practical Data Science and Advanced ML 48


Dataset types
Spatial Data Example: MRI
• Medical scan of a human body containing measurements
indicating the density of tissue at many sample point

Dr Vengadeswaran S, IIIT Kottayam Practical Data Science and Advanced ML 49


Dataset types
Grid Field
• When a field contains data created by sampling at
completely regular intervals, the cells form a uniform grid.
• No need to explicitly store the grid geometry in terms of its
location in space, or the grid topology

Dr Vengadeswaran S, IIIT Kottayam Practical Data Science and Advanced ML 50


Dataset types
Grid Fields
• Grids necessary to sample continuous data:
• A rectilinear grid supports non-uniform
sampling, allowing efficient storage of
information that has high complexity in
some areas and low complexity in others,
at the cost of storing some information
uniform rectilinear about the geometric location of each row.

• Interpolation: “how to show values between the sampled points in


ways that do not mislead”

Dr Vengadeswaran S, IIIT Kottayam Practical Data Science and Advanced ML 51


Dataset types
Geometry
• The geometry dataset type specifies information about the
shape of items with explicit spatial positions.
• The items could be points, or one-dimensional lines or curves,
or 2D surfaces or regions, or 3D volumes.

Dr Vengadeswaran S, IIIT Kottayam Practical Data Science and Advanced ML 52


Dataset types
Dataset Availability
• The default approach to vis assumes that the entire dataset
is available all at once, as a static file.

• One kind of dynamic change is to add new items or delete


previous items. Another is to change the values of existing
items
Dr Vengadeswaran S, IIIT Kottayam Practical Data Science and Advanced ML 53
Semantics matter
Refer the following data:
Task 1:
Basil, 7, S, Pear
What do you infer?
Any guess

Task 2: Lakshan 1111 89 90 92

What do you infer?


Ruhan 2222 78 67 90

Any guess
Thejas 3333 82 98 88

Student Name Regno Data Visualization Computer Graphics Human Computer


Interaction
Lakshan 1111 89 90 92

Ruhan 2222 78 67 90

Thejas 3333 82 98 88

Dr Vengadeswaran S, IIIT Kottayam Practical Data Science and Advanced ML 54


Semantics matter
• Semantics: real-world meaning of the data
• city, or fruit, number represent a day of the month, or an age, or
a measurement of height, or a unique code for a specific person,
or a postal code for a neighbourhood, or a position in space etc..
• Type: structural or mathematical interpretation
• data level - item, link, attribute?
• dataset level - table, a tree, a field of sampled values? etc..
• Both often require metadata
- Sometimes we can infer some of this information

Dr Vengadeswaran S, IIIT Kottayam Practical Data Science and Advanced ML 55


3. Descriptive Statistics
Outline:
• Mean
• Median
• Mode
• Range
• Variance
• Standard Deviation
• Percentile
• Interquartile Range

Dr Vengadeswaran S, IIIT Kottayam Practical Data Science and Advanced ML 56


Central Tendency of Data
Single value that attempts to describe the whole data
using a central point or central location of the data.
• Mean
• Median
• Mode

Dr Vengadeswaran S, IIIT Kottayam Practical Data Science and Advanced ML 57


Mean
• Arithmetic average
• Sum of the values, divided with the number of values in a data
Example: Observations for weight of students in a class
(60, 55, 85, 90, 70, 65, 70, 45)
Average weight of student = 540/8 = 67.5
Trim parameter
• The values in the vector get sorted and
• Required numbers of observations are dropped from calculating the mean.
NA Parameter (Not Applicable)
• If there are missing values, then the mean function returns NA.

Dr Vengadeswaran S, IIIT Kottayam Practical Data Science and Advanced ML 58


Median
55, 60, 70, 65, 70, 85, 90

• Middle value on the sorted list (ascending)


Example: Observations for weight of students in a class
Sample1: (60, 55, 85, 90, 70, 65, 70) à Total number is Odd
Ascending order (55, 60, 65, 70, 70, 85, 90) à Median = 70
Sample2: (60, 55, 85, 65, 70, 45) à Total number is even
Ascending order (45, 55, 60, 65, 70, 85) à Median = Mean (60, 65) = 62.5

Dr Vengadeswaran S, IIIT Kottayam Practical Data Science and Advanced ML 59


Mode
• Value that has highest number of occurrences in a set of data.
• Sample1: (2,1,2,3,1,2,3,4,1,5,5,3,2)
• Mode = 2
• Both numeric and character data.

Dr Vengadeswaran S, IIIT Kottayam Practical Data Science and Advanced ML 60


Impact of Outlier in Central tendency

• Outlier is a data point that differs significantly


from other observations.
• Due to a variability in the measurement, or it may
be the result of experimental error
Dr Vengadeswaran S, IIIT Kottayam Practical Data Science and Advanced ML 61
Impact of Outlier in Central tendency

Dr Vengadeswaran S, IIIT Kottayam Practical Data Science and Advanced ML 62


Impact of Outlier in Central tendency

Dr Vengadeswaran S, IIIT Kottayam Practical Data Science and Advanced ML 63


Summary of Central tendency

Dr Vengadeswaran S, IIIT Kottayam Practical Data Science and Advanced ML 64


Measures of dispersion
Measures of dispersion (How wide the set of data is spread out?
• Range
• Variance
• Standard Deviation
• Percentile
• Interquartile range

Dr Vengadeswaran S, IIIT Kottayam Practical Data Science and Advanced ML 65


Range
Let us consider two set of observations,
Sample 1: (-10, 0, 10, 20, 30) Sample 2: (8, 9, 10, 11, 12)
Mean = 10 Mean = 10
Median = 10 Median = 10
To differentiate à Range = Max – Min
Range = 40 Range = 4
Range considers only extreme values

Hence there is a need for other metrics


• Variance
• Standard deviation
Dr Vengadeswaran S, IIIT Kottayam Practical Data Science and Advanced ML 66
Variance
• The average of the squared differences from the Mean
• Sample 1: (-10, 0, 10, 20, 30) à Mean = 10
( "#$"#$ ! % $"#$ ! % #$"#$ ! % &$"#$ ! % '$"#$ !
• 𝑥= = 1000/5 = 200
(
Steps to find out Variance:-
Step 1: Find the mean.
Step 2: For each data point, find the square of its distance to the mean.
Step 3: Sum the values from Step 2.
Step 4: Divide by the number of data points.
Sample 1: (-10, 0, 10, 20, 30) Sample 2: (8, 9, 10, 11, 12)
Variance = 250 Variance = 2.5
Linear measurements à Square measurements
Hence the value should be normalized
Dr Vengadeswaran S, IIIT Kottayam Practical Data Science and Advanced ML 67
Standard Deviation
• Standard deviation measures the spread of a data distribution. (σ)

• SD is Square root of variance


• To get normalized values
• The more spread out a data distribution is, the greater its standard deviation.
• SD close to 0 indicates that the data points tend to be close to the mean.
• SD cannot be negative.

Dr Vengadeswaran S, IIIT Kottayam Practical Data Science and Advanced ML 68


Percentile
• Percentile describes how a score compares to other scores from the same set.
• The percentage of values in a set of data scores that fall below a given value.

Step 1: Arrange the score in ascending order


Step 2: Percentile x (Total observations + 1)
100

Dr Vengadeswaran S, IIIT Kottayam Practical Data Science and Advanced ML 69


Percentile
Step 1: Arrange the score in ascending order
Step 2: Percentile x (Total observations + 1)
100

Let us find the 80th percentile


= 80 x (16) /100 = 0.8 x 16
= ~12.8 = 13th observations is 80th percentile

Dr Vengadeswaran S, IIIT Kottayam Practical Data Science and Advanced ML 70


Interquartile Range
• Interquartile range represents the difference between 1st Quartile (25th percentile)
and 3rd Quartile (75th percentile)
• Spread of 50% of data
• Obtaining Quartiles
1. Order data
2. Find the median
3. Look at the lower half of data set - Find “median” of this lower half - Q1
4. Look at the upper half of the data set. - Find “median” of this upper half - Q3
5. Inter-Quartile Range (IQR) = Q3 - Q1

Dr Vengadeswaran S, IIIT Kottayam Practical Data Science and Advanced ML 71


Interquartile Range
Consider these 10 ages:
05 11 21 24 27 28 30 42 50 52
­
median Inter-Quartile Range
(IQR) = Q3 - Q1
The median of the bottom half (Q1) = 21
05 11 21 24 27
IQR= 42-21 = 21
­

The median of the top half (Q3) = 42


28 30 42 50 52
­

Dr Vengadeswaran S, IIIT Kottayam Practical Data Science and Advanced ML 72


Interquartile Range
Example 2: Quartiles, n = 53
100 124 148 170 185 215
101 125 150 170 185 220
106 127 150 172 186 260
106 128 152 175 187
110 130 155 175 192
110 130 157 180 194
119 133 165 180 195
120 135 165 180 203
120 139 165 180 210
123 140 170 185 212

L(M)=(53+1) / 2 = 27 Median = 165


Dr Vengadeswaran S, IIIT Kottayam Practical Data Science and Advanced ML 73
Interquartile Range
Example 2: Quartiles, n = 53
100 124 148 170 185 215
101 125 150 170 185 220
106 127 150 172 186 260
106 128 152 175 187
110 130 155 175 192
110 130 157 180 194
119 133 165 180 195
120 135 165 180 203
120 139 165 180 210
123 140 170 185 212
Bottom half has n * = 26 ® L(Q1)=(26 + 1) / 2= 13.5 from bottom
Q1 = avg(127, 128) = 127.5
Dr Vengadeswaran S, IIIT Kottayam Practical Data Science and Advanced ML 74
Interquartile Range
Example 2: Quartiles, n = 53
100 124 148 170 185 215
101 125 150 170 185 220
106 127 150 172 186 260
106 128 152 175 187
110 130 155 175 192
110 130 157 180 194
119 133 165 180 195
120 135 165 180 203
120 139 165 180 210
123 140 170 185 212
Top half has n* = 26 ® L(Q3) = 13.5 from the top!
Q3 = avg(185, 185) = 185
Dr Vengadeswaran S, IIIT Kottayam Practical Data Science and Advanced ML 75
Interquartile Range
Example 2: Quartiles, n = 53
Q1 = 127.5 Q2 = 165 Q3 = 185
Inter-Quartile
"5 point summary" • Q1 = 127.5
Range (IQR)
= {Min, Q1, Median, Q3, Max}
= {100, 127.5, 165, 185, 260} • Q3 = 185
= Q3 - Q1
= 185 – 127.5
= 57.5
“spread of middle 50%”

Dr Vengadeswaran S, IIIT Kottayam Practical Data Science and Advanced ML 76


4. Inferential Statistics

Outline:
• Normal Distribution
• Correlation
• Covariance
• Central Limit Theorem
• Hypothesis testing

Dr Vengadeswaran S, IIIT Kottayam Practical Data Science and Advanced ML 77

You might also like