0% found this document useful (0 votes)
4 views64 pages

Dsa 2

The document discusses statistical concepts related to data exploration and analysis, highlighting the importance of understanding data objects, attribute types, and descriptive measures for both categorical and numerical variables. It emphasizes exploratory data analysis (EDA) as a method for summarizing raw data, discovering patterns, and checking for errors. Additionally, it categorizes data sets into various types, including record, graph, ordered, and document data.

Uploaded by

Raj
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views64 pages

Dsa 2

The document discusses statistical concepts related to data exploration and analysis, highlighting the importance of understanding data objects, attribute types, and descriptive measures for both categorical and numerical variables. It emphasizes exploratory data analysis (EDA) as a method for summarizing raw data, discovering patterns, and checking for errors. Additionally, it categorizes data sets into various types, including record, graph, ordered, and document data.

Uploaded by

Raj
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 64

Statistical Concepts

Chittaranjan Pradhan

Data Science and Analytics 2 Data Exploration

Data Objects and

Statistical Concepts Attribute


Attribute (or Variable) Types
Properties of Attribute
Values
Types of Data Sets

Descriptive Measures
for Categorical
Variables

Descriptive Measures
for Numerical Variables
Measure of Central
Tendency
Measure of Variability
Measure of Shape
Outliers and Missing Values

Relationships Among
Variables

Chittaranjan Pradhan
School of Computer Engineering,
KIIT University
2.1
Statistical Concepts
Data Exploration
Chittaranjan Pradhan
Data Exploration
Data Exploration
• Data exploration refers to the initial step in data analysis in
Data Objects and
which data analysts use data visualization and statistical Attribute
Attribute (or Variable) Types
techniques to describe dataset characterizations, such as Properties of Attribute
Values
size,quantity, and accuracy, in order to better understand Types of Data Sets

the nature of the data Descriptive Measures


for Categorical
Variables
• Raw data is typically reviewed with a combination of
Descriptive Measures
manual workflows and automated data-exploration for Numerical Variables
Measure of Central
techniques to visually explore data sets, look for Tendency
Measure of Variability
similarities, patterns and outliers and to identify the Measure of Shape

relationships between different variables Outliers and Missing Values

Relationships Among
• This is also sometimes referred to as exploratory data Variables

analysis, which is a statistical technique employed to


analyze raw data sets in search of their broad
characteristics
• Starting with data exploration helps users to make better
decisions on where to dig deeper into the data and to take
a broad understanding of the business when asking more
detailed questions later 2.2
Statistical Concepts
Data Exploration...
Chittaranjan Pradhan

Exploratory Data Analysis Data Exploration

Data Objects and


Attribute
• Raw data are not very informative. Exploratory Data Attribute (or Variable) Types

Analysis (EDA) is how we make sense of the data by Properties of Attribute


Values

converting them from their raw form to a more informative Types of Data Sets

Descriptive Measures
one for Categorical
Variables
• EDA consists of:
Descriptive Measures
• organizing and summarizing the raw data, for Numerical Variables
Measure of Central
• discovering important features and patterns in the data and Tendency
Measure of Variability
any striking deviations from those patterns, and then Measure of Shape

• interpreting our findings in the context of the problem Outliers and Missing Values

Relationships Among
• EDA can be useful for: Variables

• describing the distribution of a single variable (center,


spread, shape, outliers)
• checking data (for errors or other problems)
• checking assumptions to more complex statistical analyses
• investigating relationships between variables

2.3
Statistical Concepts
Data Exploration...
Chittaranjan Pradhan

Important Features of Exploratory Data Analysis


Data Exploration

• Examining Distributions: exploring data one variable at a Data Objects and


Attribute
time (univariate) Attribute (or Variable) Types
Properties of Attribute

• Examining Relationships: exploring data two variables at Values


Types of Data Sets

a time (bivariate) Descriptive Measures


for Categorical
• In Exploratory Data Analysis, our exploration of data will Variables

always consist of the two elements: visual displays, Descriptive Measures


for Numerical Variables
supplemented by numerical measures Measure of Central
Tendency
Measure of Variability
Measure of Shape
Outliers and Missing Values

Relationships Among
Variables

2.4
Statistical Concepts
Data Objects and Attribute Types
Chittaranjan Pradhan

Data Exploration

Data Objects and


Data Objects and Attribute Types Attribute
Attribute (or Variable) Types
Properties of Attribute
Values
• Data sets are made up of data objects. A data object Types of Data Sets

represents an entity Descriptive Measures


for Categorical
• Ex: customer in sales database Variables

Descriptive Measures
for Numerical Variables
Measure of Central
• An attribute is a data field , representing a characteristic of Tendency
Measure of Variability
a data object Measure of Shape
Outliers and Missing Values
• Ex: Cust_ID Relationships Among
Variables

• Attribute values are numbers or symbols assigned to an


attribute

2.5
Statistical Concepts
Attribute (or Variable) Types
Chittaranjan Pradhan

Attribute (or Variable) Types Data Exploration

Data Objects and


• Nominal Attributes Attribute
Attribute (or Variable) Types
• Nominal means relating to names". The values of a nominal Properties of Attribute
Values
attribute are symbols or names of things Types of Data Sets

• Ex: Hair_color -> black, grey white Descriptive Measures


for Categorical
Marital_status -> single, married, divorced Variables
• The values do not have any meaningful order about them Descriptive Measures
for Numerical Variables
Measure of Central
Tendency
• Binary Attributes Measure of Variability
Measure of Shape
• It is a nominal attribute with only two states: 0 or 1 Outliers and Missing Values

• Ex: Patient smokes (1) or doesn’t smoke (0) Relationships Among


Variables
• A binary attribute is symmetric if both of its states are
equally important
Ex: male or female states in gender attribute
• A binary attribute is asymmetric if the outcomes of the
states are not equally important
Ex: positive or negative of medical tests

2.6
Statistical Concepts
Attribute (or Variable) Types...
Chittaranjan Pradhan
Attribute (or Variable) Types...
Data Exploration
• Ordinal Attributes
Data Objects and
• It is an attribute with possible values that have a meaningful Attribute
Attribute (or Variable) Types
order or ranking among them, but the magnitude between Properties of Attribute
Values
successive values is not known Types of Data Sets

• Ex: tall, medium, short values of height attribute Descriptive Measures


for Categorical
assistant, associate and professor in professional ranks Variables
• The central tendency of an ordinal attribute can be Descriptive Measures
represented by its mode and its median (the middle value in for Numerical Variables
Measure of Central
an ordered sequence), but the mean cannot be defined Tendency
Measure of Variability

• Numeric Attributes Measure of Shape


Outliers and Missing Values
• A numeric attribute is quantitative. It can be interval-scaled Relationships Among
or ratio-scaled Variables

• Interval-scaled attributes
• measured on a scale of equal-size units
• No true zero-point
• Ex: calendar dates
• Ratio-scaled attributes
• a value being multiples of another value
• Inherent zero-point
• Ex: year of experience in employee
2.7
frequency of words in a document
Statistical Concepts
Attribute (or Variable) Types...
Chittaranjan Pradhan
Attribute (or Variable) Types...
Data Exploration
• Numerical variables can be classified as discrete or
Data Objects and
continuous Attribute
Attribute (or Variable) Types
• Discrete vs. Continuous Attributes Properties of Attribute
Values
• Discrete Attribute Types of Data Sets

• a finite set of values Descriptive Measures


for Categorical
• sometimes, represented as integer variables Variables
• Ex: zip codes; profession Descriptive Measures
for Numerical Variables
Measure of Central
• Continuous Attribute Tendency
Measure of Variability
• infinite number of states Measure of Shape

• They are represented as floating-point variables Outliers and Missing Values

• Ex: height; rainfall Relationships Among


Variables
• A variable is numerical if meaningful arithmetic can be
performed on it. Otherwise, the variable is categorical
• Data sets can also be categorized as cross-sectional or
time series
• Cross-sectional data are the data on a cross section of a
population at a distinct point in time
• Time series data are the data collected over time
2.8
Statistical Concepts
Properties of Attribute Values
Chittaranjan Pradhan

Data Exploration

Data Objects and


Attribute
Properties of Attribute Values Attribute (or Variable) Types
Properties of Attribute
Values

• Distinctness : =, 6= Types of Data Sets

Order : <, > Descriptive Measures


for Categorical
Addition : +, - (differences are meaningful) Variables

Descriptive Measures
Multiplication : *, / (ratios are meaningful) for Numerical Variables
Measure of Central
Tendency
Measure of Variability
• Nominal attribute : distinctness Measure of Shape
Outliers and Missing Values
• Ordinal attribute : dintinctness & order Relationships Among
Variables
• Interval attribute : distinctness, order & addition
• Ratio attribute : all 4 properties

2.9
Statistical Concepts
Properties of Attribute Values...
Chittaranjan Pradhan

Data Exploration

Data Objects and


Attribute
Attribute (or Variable) Types
Properties of Attribute
Values
Types of Data Sets

Descriptive Measures
for Categorical
Variables

Descriptive Measures
for Numerical Variables
Measure of Central
Tendency
Measure of Variability
Measure of Shape
Outliers and Missing Values

Relationships Among
Variables

2.10
Statistical Concepts
Types of Data Sets
Chittaranjan Pradhan

Types of Data Sets


Data Exploration
• Record Data Data Objects and
Attribute
• Relational records Attribute (or Variable) Types

• Data matrix, ex: numerical matrix Properties of Attribute


Values

• Document data Types of Data Sets

• Transaction data Descriptive Measures


for Categorical
Variables
• Graph Data
Descriptive Measures
• World wide web for Numerical Variables
Measure of Central
• Social and information network Tendency
Measure of Variability
• Molecular structures Measure of Shape
Outliers and Missing Values
• Ordered Data
Relationships Among
• Video data Variables

• Temporal data: time-series


• Sequential data: transaction sequences
• Genetic sequence data
• Spatial and Multimedia
• Spatial data: map
• image data

2.11
Statistical Concepts
Types of Data Sets...
Chittaranjan Pradhan
Record Data
Data that consists of a collection of records, each of which Data Exploration

consists of a fixed set of attributes Data Objects and


Attribute
Attribute (or Variable) Types
Properties of Attribute
Values
Types of Data Sets

Descriptive Measures
for Categorical
Variables

Descriptive Measures
for Numerical Variables
Measure of Central
Tendency
Measure of Variability
Measure of Shape
Outliers and Missing Values

Data Matrix Relationships Among


Variables

If data objects have the same fixed set of numeric attributes,


then the data objects can be thought of as points in a
multi-dimensional space, where each dimension represents a
distinct attribute. Such data can be represented by mxn matrix

2.12
Statistical Concepts
Types of Data Sets...
Chittaranjan Pradhan
Document Data
Each document becomes a term vector Data Exploration

• each term is a component (attribute) of the vector Data Objects and


Attribute
Attribute (or Variable) Types
• the value of each component is the number of times the Properties of Attribute
Values
corresponding term occurs in the document Types of Data Sets

Descriptive Measures
for Categorical
Variables

Descriptive Measures
for Numerical Variables
Measure of Central
Tendency
Measure of Variability
Measure of Shape
Outliers and Missing Values

Relationships Among
Transaction Data Variables

A special type of record data, where each record (transaction)


involves a set of items

2.13
Statistical Concepts
Types of Data Sets...
Chittaranjan Pradhan
Graph Data
Ex: Generic graph and HTML links Data Exploration

Data Objects and


Attribute
Attribute (or Variable) Types
Properties of Attribute
Values
Types of Data Sets

Descriptive Measures
for Categorical
Variables

Descriptive Measures
for Numerical Variables
Measure of Central
Tendency

Chemical Data Measure of Variability


Measure of Shape
Outliers and Missing Values

Relationships Among
Variables

2.14
Statistical Concepts
Types of Data Sets...
Chittaranjan Pradhan
Ordered Data
Data Exploration

Data Objects and


Attribute
Attribute (or Variable) Types
Properties of Attribute
Values
Types of Data Sets

Descriptive Measures
for Categorical
Variables

Descriptive Measures
for Numerical Variables
Measure of Central
Tendency
Measure of Variability
Measure of Shape
Outliers and Missing Values

Relationships Among
Variables

2.15
Statistical Concepts
Descriptive Measures for Categorical Variables
Chittaranjan Pradhan
Descriptive Measures for Categorical Variables
Data Exploration
• Descriptive statistics are the first pieces of information Data Objects and
used to understand and represent a dataset. The goal is to Attribute
Attribute (or Variable) Types
describe the main features of numerical and categorical Properties of Attribute
Values
information with simple summaries Types of Data Sets

Descriptive Measures
for Categorical
Variables
• Frequencies
Descriptive Measures
• To produce contingency tables which calculate counts for for Numerical Variables
Measure of Central
each combination of categorical variables Tendency

• Ex: we may want to get the total count of female and male Measure of Variability
Measure of Shape
customers Outliers and Missing Values

If we want to understand the number of married and single Relationships Among


Variables
females and male customers we can produce a cross
classification table
• Proportions
• Contingency tables that present the proportions
(percentages) of each category or combination of
categories
• Ex: Percentages for gender by marital status
2.16
Statistical Concepts
Descriptive Measures for Categorical Variables...
Chittaranjan Pradhan

Data Exploration
Descriptive Measures for Categorical Variables... Data Objects and
Attribute
Attribute (or Variable) Types
Properties of Attribute
Values
Types of Data Sets

Descriptive Measures
for Categorical
Variables

Descriptive Measures
for Numerical Variables
Measure of Central
Tendency
Measure of Variability
Measure of Shape
Outliers and Missing Values

Relationships Among
Variables

2.17
Statistical Concepts
Descriptive Measures for Categorical Variables...
Chittaranjan Pradhan

Descriptive Measures for Categorical Variables...


Data Exploration

• Marginals Data Objects and


Attribute
• Marginals show the total counts or percentages across Attribute (or Variable) Types
Properties of Attribute
columns or rows in a contingency table Values
Types of Data Sets
• We can compute the marginal frequencies and the
Descriptive Measures
percentages for these marginal frequencies for Categorical
Variables

Descriptive Measures
for Numerical Variables
Measure of Central
Tendency
Measure of Variability
Measure of Shape
Outliers and Missing Values

Relationships Among
Variables

2.18
Statistical Concepts
Descriptive Measures for Categorical Variables...
Chittaranjan Pradhan

Descriptive Measures for Categorical Variables...


Data Exploration
• Visualization Data Objects and
Attribute
• Bar charts are most often used to visualize categorical Attribute (or Variable) Types

variables Properties of Attribute


Values
Types of Data Sets

Descriptive Measures
for Categorical
Variables

Descriptive Measures
for Numerical Variables
Measure of Central
Tendency
Measure of Variability
Measure of Shape
Outliers and Missing Values

Relationships Among
Variables

2.19
Statistical Concepts
Descriptive Measures for Numerical Variables
Chittaranjan Pradhan

Data Exploration

Data Objects and


Attribute
Attribute (or Variable) Types
Properties of Attribute
Values

Descriptive Measures for Numerical Variables Types of Data Sets

Descriptive Measures
for Categorical
• Measure of central tendency Variables

Descriptive Measures
for Numerical Variables

• Measure of variability Measure of Central


Tendency
Measure of Variability
Measure of Shape
Outliers and Missing Values
• Measure of Shape
Relationships Among
Variables

2.20
Statistical Concepts
Measure of Central Tendency
Chittaranjan Pradhan
Measure of Central Tendency
Measure of central tendency measures the location of the Data Exploration

middle or center of a data distribution Data Objects and


Attribute
Attribute (or Variable) Types
Properties of Attribute
Mean Values
Types of Data Sets

• Average of all values in a data set Descriptive Measures


for Categorical
Variables
• The mean of x1 , x2 , ...xN can be computed as
Descriptive Measures
for Numerical Variables
Measure of Central
Tendency
Measure of Variability
Measure of Shape
Outliers and Missing Values

Relationships Among
Variables

• If each xi is associated with a weight wi , then the weighted


mean is

2.21
Statistical Concepts
Measure of Central Tendency...
Chittaranjan Pradhan

Mean...
Data Exploration

• A trimmed mean (sometimes called a truncated mean) is Data Objects and


Attribute
similar to a mean, but it trims any outliers. Outliers can Attribute (or Variable) Types
Properties of Attribute
affect the mean (especially if there are just one or two very Values
Types of Data Sets
large values), so a trimmed mean can often be a better fit Descriptive Measures
for data sets with erratic high or low values or for extremely for Categorical
Variables
skewed distributions Descriptive Measures
for Numerical Variables
Measure of Central
Tendency
• The mean salary at a company may be substantially Measure of Variability

pushed up by that of a few highly paid managers Measure of Shape


Outliers and Missing Values

Relationships Among
Variables
• Ex: Find the trimmed 20% mean for the following test
scores: 60, 81, 83, 91, 99
Step 1: Trim the top and bottom 20% from the data. That
leaves us with the middle three values: 81, 83, 91
Step 2: Find the mean with the remaining values. The
mean is (81 + 83 + 91) / 3 ) = 85

2.22
Statistical Concepts
Measure of Central Tendency...
Chittaranjan Pradhan
Median
Data Exploration
• It is the middle value in a set of ordered data values. It is
Data Objects and
the value that separates the higher half of a data set from Attribute
Attribute (or Variable) Types
the lower half Properties of Attribute
Values
• If two middle numbers are present, then take mean of the Types of Data Sets

two numbers Descriptive Measures


for Categorical
Variables
• Ex: Median of 4, 1, 7 is 4
Descriptive Measures
for Numerical Variables
Measure of Central
Tendency
Mode Measure of Variability
Measure of Shape

• Mode for a set of data is the value that occurs most Outliers and Missing Values

Relationships Among
frequently in the set Variables

• Data sets with one, two, or three modes are respectively


called unimodal, bimodal, and trimodal
• In general, a data set with two or more modes is
multimodal; whereas if each data value occurs only once,
then there is no mode
• For unimodal data, mean − mode ≡ 3x(mean − median)
2.23
Statistical Concepts
Measure of Central Tendency...
Chittaranjan Pradhan
Frequency, Relative Frequency and Cumulative Relative
Frequency
Data Exploration

Data Objects and


• Frequency (or event) recording is a way to measure the Attribute
Attribute (or Variable) Types
number of times a behavior occurs within a given period Properties of Attribute
Values

• Relative Frequency distribution shows the proportion of Types of Data Sets

the total number of observations associated with each Descriptive Measures


for Categorical
value or class of values and is related to a probability Variables

Descriptive Measures
distribution for Numerical Variables
Measure of Central
• Cumulative Frequency represents the sum of the relative Tendency
Measure of Variability
frequencies Measure of Shape
Outliers and Missing Values

Relationships Among
Variables

2.24
Statistical Concepts
Measure of Central Tendency...
Chittaranjan Pradhan

Data Exploration
Midrange Data Objects and
Attribute
Attribute (or Variable) Types
• It is the average of the largest and smallest values in the Properties of Attribute
Values

set Types of Data Sets

Descriptive Measures
• Ex: Data values: 30, 36, 47, 50, 52, 52, 56, 60, 63, 70, 70, for Categorical
Variables
110 Descriptive Measures
Mean = 58, Median = 54, Mode: Bimodal (52 and 70), for Numerical Variables
Measure of Central
Midrange = 70 Tendency
Measure of Variability
Measure of Shape
Outliers and Missing Values

Mean, Median & Mode from Grouped Frequencies Relationships Among


Variables

• Data: 59, 65, 61, 62, 53, 55, 60, 70, 64, 56, 58, 58, 62, 62,
68, 65, 56, 59, 68, 61, 67
Mean = 61.38095, Median = 61, Mode = 62

2.25
Statistical Concepts
Measure of Central Tendency...
Chittaranjan Pradhan
Estimating Mean from Grouped Data
Data Exploration
• 59, 65, 61, 62, 53, 55, 60, 70, 64, 56, 58, 58, 62, 62, 68,
Data Objects and
65, 56, 59, 68, 61, 67 Attribute
Attribute (or Variable) Types

• The groups (51-55, 56-60 etc.), also called class intervals, Properties of Attribute
Values

are of width 5 Types of Data Sets

Descriptive Measures
• Mean can be estimated by using midpoints for Categorical
Variables
• The midpoints are in the middle of each class: 53, 58, 63 Descriptive Measures
for Numerical Variables
and 68 Measure of Central
Tendency
Measure of Variability
Measure of Shape
Outliers and Missing Values

Relationships Among
Variables

2.26
Statistical Concepts
Measure of Central Tendency...
Chittaranjan Pradhan
Estimating Mean from Grouped Data...
Data Exploration
• 53, 55, 56, 56, 58, 58, 59, 59, 60, 61, 61, 62, 62, 62, 64,
Data Objects and
65, 65, 67, 68, 68, 70 Attribute
Attribute (or Variable) Types

• With the midpoints, the data looks like: Properties of Attribute


Values

53, 53, 58, 58, 58, 58, 58, 58, 58, 63, 63, 63, 63, 63, 63, Types of Data Sets

Descriptive Measures
63, 63, 68, 68, 68, 68 for Categorical
Variables
• Estimated mean = 61.333 Descriptive Measures
for Numerical Variables
Measure of Central
Tendency
Measure of Variability
Measure of Shape
Outliers and Missing Values

Relationships Among
Variables

2.27
Statistical Concepts
Measure of Central Tendency...
Chittaranjan Pradhan

Estimating Median from Grouped Data


Data Exploration
• The median is the middle value (61), which is in the 61-65 Data Objects and
Attribute
group. So, median group is 61-65 Attribute (or Variable) Types
Properties of Attribute
• We call it 61-65, but it really includes values from 60.5 Values
Types of Data Sets
upto 65.5. At 60.5, we already have 9 runners, and by the
Descriptive Measures
next boundary at 65.5 we have 17 runners for Categorical
Variables
• By drawing a straight line in between we can pick out Descriptive Measures
for Numerical Variables
where the median frequency of n/2 runners is: Measure of Central
Tendency
Measure of Variability

EstimatedMedian = L + (n/2)−B
G ∗w Measure of Shape
Outliers and Missing Values

where, Relationships Among


Variables
L->lower class boundary of the group containing the
median
n-> total number of values
B->cummulative frequency of the groups before the
median group
G-> frequency of the median group
w-> group width
2.28
Statistical Concepts
Measure of Central Tendency...
Chittaranjan Pradhan
Estimating Median from Grouped Data...
Data Exploration

Data Objects and


Attribute
Attribute (or Variable) Types
Properties of Attribute
Values
Types of Data Sets

Descriptive Measures
for Categorical
Variables

Descriptive Measures
for Numerical Variables
Measure of Central
Tendency
Measure of Variability
Measure of Shape
Outliers and Missing Values

Relationships Among
Variables

2.29
Statistical Concepts
Measure of Central Tendency...
Chittaranjan Pradhan

Data Exploration

Estimating Mode from Grouped Data Data Objects and


Attribute
Attribute (or Variable) Types

• Modal group (the group with the highest frequency) is 61 - Properties of Attribute
Values

65 Types of Data Sets

Descriptive Measures
• Mode can be estimated as: for Categorical
Variables

Descriptive Measures
for Numerical Variables
f −f
m
Measure of Central

)+(fm −fm+1 ) ∗ w
m−1
EstimatedMode = L + (fm −fm−1 Tendency
Measure of Variability

where, Measure of Shape


Outliers and Missing Values
L-> lower class boundary of the modal group Relationships Among
fm -> frequency of the modal group Variables

fm−1 ->frequency of the group before the modal group


fm+1 -> frequency of the group after the modal group
w-> group width

2.30
Statistical Concepts
Measure of Central Tendency...
Chittaranjan Pradhan

Estimating Mode from Grouped Data... Data Exploration

Data Objects and


Attribute
Attribute (or Variable) Types
Properties of Attribute
Values
Types of Data Sets

Descriptive Measures
for Categorical
Variables

Descriptive Measures
for Numerical Variables
Measure of Central
Tendency
Measure of Variability
Measure of Shape
Outliers and Missing Values

Relationships Among
Variables

2.31
Statistical Concepts
Measure of Variability
Chittaranjan Pradhan

Measure of Variability
Data Exploration
• Measures of variability give a sense of how spread out the Data Objects and
Attribute
response values are Attribute (or Variable) Types
Properties of Attribute
• The range, standard deviation and variance each reflect Values
Types of Data Sets
different aspects of spread
Descriptive Measures
• Percentiles and quartiles certainly tell you something for Categorical
Variables
about variability Descriptive Measures
for Numerical Variables
• The second quartile is equal to the median by definition Measure of Central
Tendency
Measure of Variability
Measure of Shape
Outliers and Missing Values
Range Relationships Among
Variables

• The range of the set is the difference between the largest


and smallest values
• Ex: Data set: 1, 3, 5, 6, 7
Range=7-1=6
• Range for Grouped data = Upper limit of the last class
interval - Lower limit of the first class interval
2.32
Statistical Concepts
Measure of Variability...
Chittaranjan Pradhan
Quantiles
Data Exploration
• Let X-> numeric data sorted in increasing order. Quantiles
Data Objects and
are points taken at regular intervals of a data distribution, Attribute
Attribute (or Variable) Types
dividing it into essentially equal-size consecutive sets Properties of Attribute
Values

• Ex: 2-quantile is the data point dividing the lower and Types of Data Sets

upper halves of the data distribution. It corresponds to Descriptive Measures


for Categorical
median Variables

Descriptive Measures
for Numerical Variables
Measure of Central
Quartiles Tendency
Measure of Variability
Measure of Shape

• The 4-quantiles are the three data points that split the data Outliers and Missing Values

Relationships Among
distribution into four equl parts, commonly referred as Variables
quartiles
• It divide an ordered data set into four equal parts

Percentiles
The 100-quantiles are more commonly referred to as
percentiles
2.33
Statistical Concepts
Measure of Variability...
Chittaranjan Pradhan
Interquartile Range
Data Exploration
• Distance between the first (25th percentile) and third (75th
Data Objects and
percentile) quartiles is called the interquartile range (IQR) Attribute
Attribute (or Variable) Types
IQR = Q3 − Q1 Properties of Attribute
Values

• IQR gives us the width of the box. A small width means Types of Data Sets

Descriptive Measures
more consistent data values since it indicates less for Categorical
Variables
variation in the data or that data values are closer together
Descriptive Measures
for Numerical Variables
Measure of Central
Tendency
Measure of Variability
Measure of Shape
Outliers and Missing Values

Relationships Among
Variables

2.34
Statistical Concepts
Measure of Variability...
Chittaranjan Pradhan

Interquartile Range... Data Exploration

Data Objects and


Attribute
Attribute (or Variable) Types
Properties of Attribute
Values
Types of Data Sets

Descriptive Measures
for Categorical
Variables

Descriptive Measures
for Numerical Variables
Measure of Central
Tendency
Measure of Variability
Measure of Shape
Outliers and Missing Values

Relationships Among
Variables

2.35
Statistical Concepts
Measure of Variability...
Chittaranjan Pradhan

Variance
Data Exploration
• Variance is a statistical measure that quantifies the spread Data Objects and
Attribute
or dispersion of a set of data points Attribute (or Variable) Types
Properties of Attribute
• It indicates how much the individual data points in a Values
Types of Data Sets
dataset differ from the mean of the dataset
Descriptive Measures
for Categorical
• Low variance means the data points are close to the mean Variables
and to each other; high variance means the data points Descriptive Measures
for Numerical Variables
are spread out from the mean and from each other Measure of Central
Tendency
• Variance of N observations, x1 , x2 , ..., xN , for a numeric Measure of Variability
Measure of Shape
attribute X is Outliers and Missing Values

Relationships Among
Variables

Ex: Data: 4, 6, 8, 10
Mean= 7, Variance= 5
2.36
Statistical Concepts
Measure of Variability...
Chittaranjan Pradhan
Standard Deviation
Data Exploration
• It is defined as the degree of dispersion of the data point to
Data Objects and
the mean value of the data point Attribute
Attribute (or Variable) Types
• Standard deviation, σ, of the observations is the square Properties of Attribute
Values
root of the variance Types of Data Sets

Descriptive Measures
• A low standard deviation means that the data observations for Categorical
Variables
tend to be very close to the mean, while a high standard
Descriptive Measures
deviation indicates that the data are spread out over a for Numerical Variables
Measure of Central
large range of values Tendency
Measure of Variability
Measure of Shape
Outliers and Missing Values

Relationships Among
Variables

2.37
Statistical Concepts
Measure of Variability...
Chittaranjan Pradhan

Bessel’s Correction
Data Exploration

Data Objects and


• Standard deviation is sensitive to extreme values. A single Attribute

very extreme value can increase the standard deviation Attribute (or Variable) Types
Properties of Attribute

and misrepresent the dispersion Values


Types of Data Sets

• For two data sets with the same mean, the one with the Descriptive Measures
for Categorical
larger standard deviation is the one in which the data is Variables

more spread out from the center Descriptive Measures


for Numerical Variables
Measure of Central
Tendency
Measure of Variability
• In statistics, Bessel’s correction is the use of n-1 instead of Measure of Shape

n in the formula for the sample standard deviation, where Outliers and Missing Values

Relationships Among
n is the number of observations in a sample Variables

• This method corrects the bias in the estimation of the


population variance. It also partially corrects the bias in the
estimation of the population standard deviation
• However, the correction often increases the mean squared
error in these estimations

2.38
Statistical Concepts
Measure of Shape
Chittaranjan Pradhan

Symmetrical distribution
Data Exploration

Data Objects and


• When a distribution is symmetrical, the mode, median and Attribute
Attribute (or Variable) Types
mean are all in the middle of the distribution Properties of Attribute
Values

• Histogram graph showing the frequency of retirement age Types of Data Sets

for different age groups Descriptive Measures


for Categorical
Variables

Descriptive Measures
for Numerical Variables
Measure of Central
Tendency
Measure of Variability
Measure of Shape
Outliers and Missing Values

Relationships Among
Variables

2.39
Statistical Concepts
Measure of Shape...
Chittaranjan Pradhan

Positive Skewed distribution


Data Exploration
• A distribution is said to be positively or right skewed when Data Objects and
the tail on the right side of the distribution is longer than Attribute
Attribute (or Variable) Types
the left side Properties of Attribute
Values

• In a positively skewed distribution it is common for the Types of Data Sets

Descriptive Measures
mean to be ’pulled’ toward the right tail of the distribution for Categorical
Variables
• Generally most of the values, including the median value, Descriptive Measures
tend to be less than the mean value for Numerical Variables
Measure of Central
Tendency
Measure of Variability
Measure of Shape
Outliers and Missing Values

Relationships Among
Variables

2.40
Statistical Concepts
Measure of Shape...
Chittaranjan Pradhan

Negative Skewed distribution


Data Exploration
• A distribution is said to be negatively or left skewed when Data Objects and
the tail on the left side of the distribution is longer than the Attribute
Attribute (or Variable) Types
right side Properties of Attribute
Values

• In a negatively skewed distribution, it is common for the Types of Data Sets

Descriptive Measures
mean to be ’pulled’ toward the left tail of the distribution for Categorical
Variables
• Generally most of the values, including the median value, Descriptive Measures
for Numerical Variables
tend to be greater than the mean value Measure of Central
Tendency
Measure of Variability
Measure of Shape
Outliers and Missing Values

Relationships Among
Variables

2.41
Statistical Concepts
Measure of Shape...
Chittaranjan Pradhan

Symmetric vs. Skewed Data


Data Exploration
• Median, mean and mode of symmetric, positively and Data Objects and
Attribute
negatively skewed data Attribute (or Variable) Types
Properties of Attribute
Values
Types of Data Sets
• Data in most real applications are not symmetric (a). They Descriptive Measures
may instead be either positively skewed (b), where the for Categorical
Variables
mode occurs at a value that is smaller than the median or Descriptive Measures
for Numerical Variables
negatively skewed (c), where the mode occurs at a value Measure of Central

greater than the median Tendency


Measure of Variability
Measure of Shape
Outliers and Missing Values

Relationships Among
Variables

2.42
Statistical Concepts
Measure of Shape...
Chittaranjan Pradhan

Pearson’s Coefficient of Skewness


Data Exploration

• Most frequently used skewness measurement Data Objects and


Attribute
P 3 Attribute (or Variable) Types
(x−x̄)
• Skewness = (n−1).S 3
Properties of Attribute
Values

where, S-> Statndard deviation, x̄− > Mean Types of Data Sets

Descriptive Measures
for Categorical
Variables
• Using Mode, Pearson0 sFirstCoefficient = Mean−Mode
StandardDeviation Descriptive Measures
for Numerical Variables
Measure of Central
Tendency
• Using Median, Measure of Variability
3(Mean−Median)
Pearson0 sSecondCoefficient = StandardDeviation
Measure of Shape
Outliers and Missing Values

Relationships Among
Variables

Kurtosis

• Kurtosis gives a measure of flatness of distribution


(x−x̄)4
P
• Kurtosis = (n−1).S 4
where, S-> Statndard deviation, x̄− > Mean

2.43
Statistical Concepts
Measure of Shape...
Chittaranjan Pradhan

Data Exploration
Measures of Sample Skewness and Kurtosis
Data Objects and
Attribute
Attribute (or Variable) Types
Properties of Attribute
Values
Types of Data Sets

Descriptive Measures
for Categorical
Variables

Descriptive Measures
for Numerical Variables
Measure of Central
Tendency
Measure of Variability
Measure of Shape
Outliers and Missing Values

Relationships Among
Variables

2.44
Statistical Concepts
Outliers
Chittaranjan Pradhan

Data Exploration
Outliers Data Objects and
Attribute

• The extreme values in the datasets are called outliers Attribute (or Variable) Types
Properties of Attribute
Values
• Types of Outliers Types of Data Sets

• Global Outliers: The data point or points whose values are Descriptive Measures
for Categorical
far outside everything else in the dataset are global outliers Variables

Descriptive Measures
for Numerical Variables
• Contextual Outliers: Contextual outliers are those values Measure of Central
Tendency
of data points that deviate quite a lot from the rest of the Measure of Variability

data points that are in the same context, however, in a Measure of Shape
Outliers and Missing Values
different context, it may not be an outlier at all Relationships Among
Variables

• Collective Outliers: Some data points collectively as a


whole deviates from the dataset. These data points
individually may not be a global or contextual outlier, but
they behave as outliers when aggregated together

2.45
Statistical Concepts
Outliers...
Chittaranjan Pradhan

Tukey Fences Data Exploration

Data Objects and


• When there are no outliers in a sample, the mean and Attribute
Attribute (or Variable) Types

standard deviation are used to summarize a typical value Properties of Attribute


Values

and the variability in the sample, respectively Types of Data Sets

Descriptive Measures
• When there are outliers in a sample, the median and for Categorical
Variables
interquartile range are used to summarize a typical value Descriptive Measures
and the variability in the sample, respectively for Numerical Variables
Measure of Central
Tendency
• Tukey fence method is to find outliers. Outliers are the Measure of Variability

values below Q1-1.5IQR or above Q3+1.5IQR Measure of Shape


Outliers and Missing Values

Relationships Among
Variables
• Upper Fence = Q3+1.5IQR = Q3+1.5(Q3-Q1)
Lower Fence = Q1-1.5IQR = Q1-1.5(Q3-Q1)

• The data points beyond the upper and the lower fence in
box plot are reffered to as outliers

2.46
Statistical Concepts
Outliers...
Chittaranjan Pradhan

Data Exploration
Boxplot Analysis
Data Objects and
Attribute

• Boxplots are a popular way of visualizing a distribution Attribute (or Variable) Types
Properties of Attribute
Values
through quantiles and detect outliers Types of Data Sets

• A boxplot incorporates the five-number summary as: Descriptive Measures


for Categorical
Variables

Minimum, Q1, Median, Q3, Maximum Descriptive Measures


for Numerical Variables
Measure of Central
Tendency
• Data is represented with a box Measure of Variability
Measure of Shape
• The ends of the box are at the first and third quartiles, i.e., Outliers and Missing Values
the height of the box is IQR Relationships Among
• The median is marked by a line within the box Variables

• Two lines (called whiskers) outside the box extend to the


smallest (Minimum) and largest (Maximum) observations
• Points beyond a specified outlier threshold, plotted
individually

2.47
Statistical Concepts
Outliers...
Chittaranjan Pradhan
Boxplot Analysis...
Data Exploration

Data Objects and


Attribute
Attribute (or Variable) Types
Properties of Attribute
Values
Types of Data Sets

Descriptive Measures
for Categorical
Variables

Descriptive Measures
for Numerical Variables
Measure of Central
Tendency
Measure of Variability
Measure of Shape
Outliers and Missing Values

Relationships Among
Variables

2.48
Statistical Concepts
Outliers...
Chittaranjan Pradhan
Boxplot Analysis...
Data Exploration
• Boxplots often provide information about the shape of a
Data Objects and
data set Attribute
Attribute (or Variable) Types

• If most of the observations are concentrated on the low Properties of Attribute


Values

end of the scale, the distribution is skewed right; and vice Types of Data Sets

Descriptive Measures
versa for Categorical
Variables
• If a distribution is symmetric, the observations will be Descriptive Measures
evenly split at the median for Numerical Variables
Measure of Central
Tendency
Measure of Variability
Measure of Shape
Outliers and Missing Values

Relationships Among
Variables

2.49
Statistical Concepts
Outliers...
Chittaranjan Pradhan
Boxplot Analysis...
Data Exploration
• Ex: 25, 28, 29, 29, 30, 34, 35, 35, 37, 38
Data Objects and
Attribute
• Median -> 32 Attribute (or Variable) Types
Properties of Attribute
• First quartile is the median of the data points to the left of Values
Types of Data Sets
the median: 25, 28, 29, 29, 30. So, Q1->29 Descriptive Measures
for Categorical
• Third quartile is the median of the data points to the right Variables

of the median: 34, 35, 35, 37, 38. So, Q3->35 Descriptive Measures
for Numerical Variables
• Min-> 25 and Max->38 Measure of Central
Tendency
Measure of Variability
Measure of Shape
Outliers and Missing Values

Relationships Among
Variables

2.50
Statistical Concepts
Outliers...
Chittaranjan Pradhan

Data Exploration

Data Objects and


Attribute
Z-Score for Outlier Detection Attribute (or Variable) Types
Properties of Attribute
• . A Z-score is a standard score that tells you how many Values
Types of Data Sets
standard deviations away from the mean an individual Descriptive Measures
value (x) lies: for Categorical
Variables
• A positive Z-score means that your x value is greater than Descriptive Measures
for Numerical Variables
the mean Measure of Central
• A negative Z-score means that your x value is less than the Tendency
Measure of Variability
mean Measure of Shape
Outliers and Missing Values
• A Z-score of zero means that your x value is equal to the
Relationships Among
mean Variables

• Z − Score = Observation−Mean
StandardDeviation

2.51
Statistical Concepts
Missing Values
Chittaranjan Pradhan
Missing Values
Data Exploration
• Reasons for missing values
Data Objects and
• Information is not collected (ex: people decline to give their Attribute
Attribute (or Variable) Types
age and weight) Properties of Attribute
Values
• Attributes may not be applicable to all cases (ex: annual Types of Data Sets

income is not applicable to children) Descriptive Measures


for Categorical
Variables

• Handling missing values Descriptive Measures


for Numerical Variables
• Eliminate data objects or variables Measure of Central
Tendency
• Estimate missing values (Ex: census results) Measure of Variability
Measure of Shape
• Ignore the missing value during analysis Outliers and Missing Values
• Replace with all possible values (weighted by their Relationships Among
probabilities) Variables

• Types of Missing Values: Some definitions are based on


representation: Missing data is the lack of a recorded
answer for a particular field
• Missing completely at Random (MCAR)
• Missing at Random (MAR)
• Missing Not at Random (MNAR)
2.52
Statistical Concepts
Missing Values...
Chittaranjan Pradhan
Types of Missing Values
Missing Completely at Random (MCAR) Data Exploration

Data Objects and


• If a person has missing data then it is completely Attribute
unrelated to the other information in the data. The Attribute (or Variable) Types
Properties of Attribute
missingness on the variable is completely unsystematic Values
Types of Data Sets
• Missingness of a value is independent of attributes Descriptive Measures
• Fill in values based on the attribute for Categorical
Variables
• Analysis may be unbiased overall Descriptive Measures
for Numerical Variables
• Ex: when we take a random sample of a population, where Measure of Central
Tendency
each member has the same chance of being included in Measure of Variability

the sample Measure of Shape


Outliers and Missing Values

Relationships Among
Variables

2.53
Statistical Concepts
Missing Values...
Chittaranjan Pradhan
Types of Missing Values...
Missing at Random (MAR) Data Exploration

Data Objects and


• Missingness is related to other variables Attribute
Attribute (or Variable) Types
• Fill in values based other values Properties of Attribute
Values

• Almost always produces a bias in the analysis Types of Data Sets

Descriptive Measures
• Ex: when we take a sample from a population, where the for Categorical
Variables
probability to be included depends on some known Descriptive Measures
for Numerical Variables
property Measure of Central
Tendency
Measure of Variability
Measure of Shape
Outliers and Missing Values

Relationships Among
Variables

2.54
Statistical Concepts
Missing Values...
Chittaranjan Pradhan
Types of Missing Values...
Missing not at Random (MNAR) - Nonignorable Data Exploration

Data Objects and


• Missingness is related to unobserved measurements Attribute
Attribute (or Variable) Types
• When the missing values on a variable are related to the Properties of Attribute
Values
values of that variable itself, even after controlling for other Types of Data Sets

variables Descriptive Measures


for Categorical
Variables
• MNAR means that the probability of being missing varies
Descriptive Measures
for reasons that are unknown to us for Numerical Variables
Measure of Central
Tendency
Measure of Variability
Measure of Shape
Outliers and Missing Values

Relationships Among
Variables

2.55
Statistical Concepts
Missing Values...
Chittaranjan Pradhan

Data Exploration

Definitions Data Objects and


Attribute
Attribute (or Variable) Types

• Let’s X-> matrix of the data we expect to have; Properties of Attribute


Values

X = {X0 , Xm } where X0 -> observed data and Xm -> Types of Data Sets

Descriptive Measures
missing data for Categorical
Variables

Descriptive Measures
MCAR : P(R|X0 , Xm ) = P(R) for Numerical Variables
Measure of Central
Tendency

MAR : P(R|X0 , Xm ) = P(R|X0 ) Measure of Variability


Measure of Shape
Outliers and Missing Values

Relationships Among
• Let R->matrix with same dimensions as X where Ri,j = 1 if Variables

the datum is missing, 0 otherwise

MNAR : Nosimplification

2.56
Statistical Concepts
Relationships Among Variables
Chittaranjan Pradhan

Data Exploration

Data Objects and


Attribute
Relationships Among Variables Attribute (or Variable) Types
Properties of Attribute
Values
• This is an important first step in any exploratory data Types of Data Sets

analysis Descriptive Measures


for Categorical
Variables
• To look closely at variables one at a time, but it is almost
Descriptive Measures
never the last step for Numerical Variables
Measure of Central

• The primary interest is usually in relationships between Tendency


Measure of Variability

variables Measure of Shape


Outliers and Missing Values

• Categorical vs. Categorical Relationships Among


Variables
• Categorical vs. Numerical
• Numerical vs. Numerical

2.57
Statistical Concepts
Relationships Among Categorical Variables
Chittaranjan Pradhan
Relationships Among Categorical Variables
Data Exploration
• The most meaningful way to describe a categorical
Data Objects and
variable is with counts, possibly expressed as percentages Attribute
Attribute (or Variable) Types
of totals, and corresponding Properties of Attribute
Values

• Consider a data set with at least two categorical variables, Types of Data Sets

Descriptive Measures
Smoking and Drinking for Categorical
Variables
• Smoking: Non Smoker(NS), Occasional Smoker (OS), Descriptive Measures
Heavy Smoker (HS) for Numerical Variables
Measure of Central
Tendency
• Drinking: Non Drinker(ND), Occasional Drinker(OD), Measure of Variability

Heavy Drinker(HD) Measure of Shape


Outliers and Missing Values

Relationships Among
Variables

2.58
Statistical Concepts
Relationships Among Categorical Variables...
Chittaranjan Pradhan
Relationships Among Categorical Variables...
Data Exploration
• It is customary to display all such counts in a table called a
Data Objects and
crosstabs (for crosstabulations). This is also sometimes Attribute
Attribute (or Variable) Types
called a contingency table Properties of Attribute
Values
Types of Data Sets

Descriptive Measures
for Categorical
Variables

Descriptive Measures
for Numerical Variables
Measure of Central
Tendency
Measure of Variability
Measure of Shape
Outliers and Missing Values

Relationships Among
Variables

2.59
Statistical Concepts
Relationships Among Categorical and Numerical Variables
Chittaranjan Pradhan

Relationships Among Categorical and Numerical Variables


Data Exploration

• It describes a very common situation where the goal is to Data Objects and
Attribute
break down a numerical variable such as salary by a Attribute (or Variable) Types
Properties of Attribute
categorical variable such as gender Values
Types of Data Sets

• This general problem, typically referred to as the Descriptive Measures


for Categorical
comparison problem, is one of the most important Variables

problems in data analysis Descriptive Measures


for Numerical Variables
• It occurs whenever you want to compare a numerical Measure of Central
Tendency

measure across two or more subpopulations Measure of Variability


Measure of Shape
Outliers and Missing Values

Relationships Among
Variables

2.60
Statistical Concepts
Relationships Among Numerical and Numerical Variables
Chittaranjan Pradhan
Relationships Among Numerical and Numerical Variables
Data Exploration
• To study relationships among numeric variables, a new
Data Objects and
type of chart, called a scatterplot, and two new summary Attribute
Attribute (or Variable) Types
measures, correlation and covariance, are used Properties of Attribute
Values

• Scatter Plot Types of Data Sets

Descriptive Measures
• A scatterplot is a scatter of points, where each point denotes for Categorical
Variables
the values of an observation for two selected variables
Descriptive Measures
• It is a graphical method for detecting relationships between for Numerical Variables
two numerical variables Measure of Central
Tendency
• The two variables are often labeled generically as X and Y, Measure of Variability
Measure of Shape
so a scatterplot is sometimes called an X-Y chart Outliers and Missing Values

• The purpose of a scatterplot is to make a relationship (or Relationships Among


Variables
the lack of it) apparent

2.61
Statistical Concepts
Relationships Among Numerical and Numerical Variables...
Chittaranjan Pradhan

Data Exploration

Data Objects and


Attribute
Attribute (or Variable) Types
Properties of Attribute
Values
Types of Data Sets

Descriptive Measures
for Categorical
Variables

Descriptive Measures
for Numerical Variables
Measure of Central
Tendency
Measure of Variability
Measure of Shape
Outliers and Missing Values

Relationships Among
Variables

2.62
Statistical Concepts
Relationships Among Numerical and Numerical Variables...
Chittaranjan Pradhan
Relationships Among Numerical and Numerical Variables...
Data Exploration
• Correlation and Covariance
Data Objects and
• Correlation and covariance measure the strength and Attribute
Attribute (or Variable) Types
direction of a linear relationship between two numerical Properties of Attribute
Values
variables. (Bi-Variate Measures) Types of Data Sets

• The relationship is ’strong’ if the points in a scatterplot Descriptive Measures


for Categorical
cluster tightly around some straight line Variables
• The two numerical variables must be ’paired’ variables Descriptive Measures
for Numerical Variables
• Covariance: is essentially an average of products of Measure of Central
Tendency
deviations from means
Pn Measure of Variability
(X −X̄ )(Y −Ȳ )
CoVar (X , Y ) = i=1 in−1 i Measure of Shape
Outliers and Missing Values

where, Xi and Yi be the paired values for observation i, Relationships Among


Variables
and n be the number of observations
• Covariance has a serious limitation as a descriptive
measure because it is very sensitive to the units in which
X and Y are measured
• Correlation: is a unitless quantity that is unaffected by the
measurement scale
CoVar (X ,Y )
Correl(X , Y ) = StdDev (X )∗StdDev (Y ) 2.63
Statistical Concepts
Relationships Among Numerical and Numerical Variables...
Chittaranjan Pradhan
Pearson’s Correlation Coefficient
Data Exploration
• The correlation
P coefficient
P P is calculated as
n XY − X Y Data Objects and
r=√ P 2 P 2 P 2 P 2 Attribute
(n X −( X ) ).(n Y −( Y ) ) Attribute (or Variable) Types

where,
P n-> number of data points or observations Properties of Attribute
Values

XY -> sum of the product of x-value and y-value for Types of Data Sets

Descriptive Measures
each
P point in the data set for Categorical
Variables
P X -> sum of the x-values in the data set Descriptive Measures

P Y 2-> sum of the y-values in the data set for Numerical Variables
Measure of Central

P X 2 ->sum of the squares of the x-values in the data set Tendency


Measure of Variability
Y ->sum of the squares of the y-values in the data set Measure of Shape
Outliers and Missing Values

Relationships Among
Variables

2.64

You might also like