Dsa 2
Dsa 2
Chittaranjan Pradhan
Descriptive Measures
for Categorical
Variables
Descriptive Measures
for Numerical Variables
Measure of Central
Tendency
Measure of Variability
Measure of Shape
Outliers and Missing Values
Relationships Among
Variables
Chittaranjan Pradhan
School of Computer Engineering,
KIIT University
2.1
Statistical Concepts
Data Exploration
Chittaranjan Pradhan
Data Exploration
Data Exploration
• Data exploration refers to the initial step in data analysis in
Data Objects and
which data analysts use data visualization and statistical Attribute
Attribute (or Variable) Types
techniques to describe dataset characterizations, such as Properties of Attribute
Values
size,quantity, and accuracy, in order to better understand Types of Data Sets
Relationships Among
• This is also sometimes referred to as exploratory data Variables
converting them from their raw form to a more informative Types of Data Sets
Descriptive Measures
one for Categorical
Variables
• EDA consists of:
Descriptive Measures
• organizing and summarizing the raw data, for Numerical Variables
Measure of Central
• discovering important features and patterns in the data and Tendency
Measure of Variability
any striking deviations from those patterns, and then Measure of Shape
• interpreting our findings in the context of the problem Outliers and Missing Values
Relationships Among
• EDA can be useful for: Variables
2.3
Statistical Concepts
Data Exploration...
Chittaranjan Pradhan
Relationships Among
Variables
2.4
Statistical Concepts
Data Objects and Attribute Types
Chittaranjan Pradhan
Data Exploration
Descriptive Measures
for Numerical Variables
Measure of Central
• An attribute is a data field , representing a characteristic of Tendency
Measure of Variability
a data object Measure of Shape
Outliers and Missing Values
• Ex: Cust_ID Relationships Among
Variables
2.5
Statistical Concepts
Attribute (or Variable) Types
Chittaranjan Pradhan
2.6
Statistical Concepts
Attribute (or Variable) Types...
Chittaranjan Pradhan
Attribute (or Variable) Types...
Data Exploration
• Ordinal Attributes
Data Objects and
• It is an attribute with possible values that have a meaningful Attribute
Attribute (or Variable) Types
order or ranking among them, but the magnitude between Properties of Attribute
Values
successive values is not known Types of Data Sets
• Interval-scaled attributes
• measured on a scale of equal-size units
• No true zero-point
• Ex: calendar dates
• Ratio-scaled attributes
• a value being multiples of another value
• Inherent zero-point
• Ex: year of experience in employee
2.7
frequency of words in a document
Statistical Concepts
Attribute (or Variable) Types...
Chittaranjan Pradhan
Attribute (or Variable) Types...
Data Exploration
• Numerical variables can be classified as discrete or
Data Objects and
continuous Attribute
Attribute (or Variable) Types
• Discrete vs. Continuous Attributes Properties of Attribute
Values
• Discrete Attribute Types of Data Sets
Data Exploration
Descriptive Measures
Multiplication : *, / (ratios are meaningful) for Numerical Variables
Measure of Central
Tendency
Measure of Variability
• Nominal attribute : distinctness Measure of Shape
Outliers and Missing Values
• Ordinal attribute : dintinctness & order Relationships Among
Variables
• Interval attribute : distinctness, order & addition
• Ratio attribute : all 4 properties
2.9
Statistical Concepts
Properties of Attribute Values...
Chittaranjan Pradhan
Data Exploration
Descriptive Measures
for Categorical
Variables
Descriptive Measures
for Numerical Variables
Measure of Central
Tendency
Measure of Variability
Measure of Shape
Outliers and Missing Values
Relationships Among
Variables
2.10
Statistical Concepts
Types of Data Sets
Chittaranjan Pradhan
2.11
Statistical Concepts
Types of Data Sets...
Chittaranjan Pradhan
Record Data
Data that consists of a collection of records, each of which Data Exploration
Descriptive Measures
for Categorical
Variables
Descriptive Measures
for Numerical Variables
Measure of Central
Tendency
Measure of Variability
Measure of Shape
Outliers and Missing Values
2.12
Statistical Concepts
Types of Data Sets...
Chittaranjan Pradhan
Document Data
Each document becomes a term vector Data Exploration
Descriptive Measures
for Categorical
Variables
Descriptive Measures
for Numerical Variables
Measure of Central
Tendency
Measure of Variability
Measure of Shape
Outliers and Missing Values
Relationships Among
Transaction Data Variables
2.13
Statistical Concepts
Types of Data Sets...
Chittaranjan Pradhan
Graph Data
Ex: Generic graph and HTML links Data Exploration
Descriptive Measures
for Categorical
Variables
Descriptive Measures
for Numerical Variables
Measure of Central
Tendency
Relationships Among
Variables
2.14
Statistical Concepts
Types of Data Sets...
Chittaranjan Pradhan
Ordered Data
Data Exploration
Descriptive Measures
for Categorical
Variables
Descriptive Measures
for Numerical Variables
Measure of Central
Tendency
Measure of Variability
Measure of Shape
Outliers and Missing Values
Relationships Among
Variables
2.15
Statistical Concepts
Descriptive Measures for Categorical Variables
Chittaranjan Pradhan
Descriptive Measures for Categorical Variables
Data Exploration
• Descriptive statistics are the first pieces of information Data Objects and
used to understand and represent a dataset. The goal is to Attribute
Attribute (or Variable) Types
describe the main features of numerical and categorical Properties of Attribute
Values
information with simple summaries Types of Data Sets
Descriptive Measures
for Categorical
Variables
• Frequencies
Descriptive Measures
• To produce contingency tables which calculate counts for for Numerical Variables
Measure of Central
each combination of categorical variables Tendency
• Ex: we may want to get the total count of female and male Measure of Variability
Measure of Shape
customers Outliers and Missing Values
Data Exploration
Descriptive Measures for Categorical Variables... Data Objects and
Attribute
Attribute (or Variable) Types
Properties of Attribute
Values
Types of Data Sets
Descriptive Measures
for Categorical
Variables
Descriptive Measures
for Numerical Variables
Measure of Central
Tendency
Measure of Variability
Measure of Shape
Outliers and Missing Values
Relationships Among
Variables
2.17
Statistical Concepts
Descriptive Measures for Categorical Variables...
Chittaranjan Pradhan
Descriptive Measures
for Numerical Variables
Measure of Central
Tendency
Measure of Variability
Measure of Shape
Outliers and Missing Values
Relationships Among
Variables
2.18
Statistical Concepts
Descriptive Measures for Categorical Variables...
Chittaranjan Pradhan
Descriptive Measures
for Categorical
Variables
Descriptive Measures
for Numerical Variables
Measure of Central
Tendency
Measure of Variability
Measure of Shape
Outliers and Missing Values
Relationships Among
Variables
2.19
Statistical Concepts
Descriptive Measures for Numerical Variables
Chittaranjan Pradhan
Data Exploration
Descriptive Measures
for Categorical
• Measure of central tendency Variables
Descriptive Measures
for Numerical Variables
2.20
Statistical Concepts
Measure of Central Tendency
Chittaranjan Pradhan
Measure of Central Tendency
Measure of central tendency measures the location of the Data Exploration
Relationships Among
Variables
2.21
Statistical Concepts
Measure of Central Tendency...
Chittaranjan Pradhan
Mean...
Data Exploration
Relationships Among
Variables
• Ex: Find the trimmed 20% mean for the following test
scores: 60, 81, 83, 91, 99
Step 1: Trim the top and bottom 20% from the data. That
leaves us with the middle three values: 81, 83, 91
Step 2: Find the mean with the remaining values. The
mean is (81 + 83 + 91) / 3 ) = 85
2.22
Statistical Concepts
Measure of Central Tendency...
Chittaranjan Pradhan
Median
Data Exploration
• It is the middle value in a set of ordered data values. It is
Data Objects and
the value that separates the higher half of a data set from Attribute
Attribute (or Variable) Types
the lower half Properties of Attribute
Values
• If two middle numbers are present, then take mean of the Types of Data Sets
• Mode for a set of data is the value that occurs most Outliers and Missing Values
Relationships Among
frequently in the set Variables
Descriptive Measures
distribution for Numerical Variables
Measure of Central
• Cumulative Frequency represents the sum of the relative Tendency
Measure of Variability
frequencies Measure of Shape
Outliers and Missing Values
Relationships Among
Variables
2.24
Statistical Concepts
Measure of Central Tendency...
Chittaranjan Pradhan
Data Exploration
Midrange Data Objects and
Attribute
Attribute (or Variable) Types
• It is the average of the largest and smallest values in the Properties of Attribute
Values
Descriptive Measures
• Ex: Data values: 30, 36, 47, 50, 52, 52, 56, 60, 63, 70, 70, for Categorical
Variables
110 Descriptive Measures
Mean = 58, Median = 54, Mode: Bimodal (52 and 70), for Numerical Variables
Measure of Central
Midrange = 70 Tendency
Measure of Variability
Measure of Shape
Outliers and Missing Values
• Data: 59, 65, 61, 62, 53, 55, 60, 70, 64, 56, 58, 58, 62, 62,
68, 65, 56, 59, 68, 61, 67
Mean = 61.38095, Median = 61, Mode = 62
2.25
Statistical Concepts
Measure of Central Tendency...
Chittaranjan Pradhan
Estimating Mean from Grouped Data
Data Exploration
• 59, 65, 61, 62, 53, 55, 60, 70, 64, 56, 58, 58, 62, 62, 68,
Data Objects and
65, 56, 59, 68, 61, 67 Attribute
Attribute (or Variable) Types
• The groups (51-55, 56-60 etc.), also called class intervals, Properties of Attribute
Values
Descriptive Measures
• Mean can be estimated by using midpoints for Categorical
Variables
• The midpoints are in the middle of each class: 53, 58, 63 Descriptive Measures
for Numerical Variables
and 68 Measure of Central
Tendency
Measure of Variability
Measure of Shape
Outliers and Missing Values
Relationships Among
Variables
2.26
Statistical Concepts
Measure of Central Tendency...
Chittaranjan Pradhan
Estimating Mean from Grouped Data...
Data Exploration
• 53, 55, 56, 56, 58, 58, 59, 59, 60, 61, 61, 62, 62, 62, 64,
Data Objects and
65, 65, 67, 68, 68, 70 Attribute
Attribute (or Variable) Types
53, 53, 58, 58, 58, 58, 58, 58, 58, 63, 63, 63, 63, 63, 63, Types of Data Sets
Descriptive Measures
63, 63, 68, 68, 68, 68 for Categorical
Variables
• Estimated mean = 61.333 Descriptive Measures
for Numerical Variables
Measure of Central
Tendency
Measure of Variability
Measure of Shape
Outliers and Missing Values
Relationships Among
Variables
2.27
Statistical Concepts
Measure of Central Tendency...
Chittaranjan Pradhan
EstimatedMedian = L + (n/2)−B
G ∗w Measure of Shape
Outliers and Missing Values
Descriptive Measures
for Categorical
Variables
Descriptive Measures
for Numerical Variables
Measure of Central
Tendency
Measure of Variability
Measure of Shape
Outliers and Missing Values
Relationships Among
Variables
2.29
Statistical Concepts
Measure of Central Tendency...
Chittaranjan Pradhan
Data Exploration
• Modal group (the group with the highest frequency) is 61 - Properties of Attribute
Values
Descriptive Measures
• Mode can be estimated as: for Categorical
Variables
Descriptive Measures
for Numerical Variables
f −f
m
Measure of Central
)+(fm −fm+1 ) ∗ w
m−1
EstimatedMode = L + (fm −fm−1 Tendency
Measure of Variability
2.30
Statistical Concepts
Measure of Central Tendency...
Chittaranjan Pradhan
Descriptive Measures
for Categorical
Variables
Descriptive Measures
for Numerical Variables
Measure of Central
Tendency
Measure of Variability
Measure of Shape
Outliers and Missing Values
Relationships Among
Variables
2.31
Statistical Concepts
Measure of Variability
Chittaranjan Pradhan
Measure of Variability
Data Exploration
• Measures of variability give a sense of how spread out the Data Objects and
Attribute
response values are Attribute (or Variable) Types
Properties of Attribute
• The range, standard deviation and variance each reflect Values
Types of Data Sets
different aspects of spread
Descriptive Measures
• Percentiles and quartiles certainly tell you something for Categorical
Variables
about variability Descriptive Measures
for Numerical Variables
• The second quartile is equal to the median by definition Measure of Central
Tendency
Measure of Variability
Measure of Shape
Outliers and Missing Values
Range Relationships Among
Variables
• Ex: 2-quantile is the data point dividing the lower and Types of Data Sets
Descriptive Measures
for Numerical Variables
Measure of Central
Quartiles Tendency
Measure of Variability
Measure of Shape
• The 4-quantiles are the three data points that split the data Outliers and Missing Values
Relationships Among
distribution into four equl parts, commonly referred as Variables
quartiles
• It divide an ordered data set into four equal parts
Percentiles
The 100-quantiles are more commonly referred to as
percentiles
2.33
Statistical Concepts
Measure of Variability...
Chittaranjan Pradhan
Interquartile Range
Data Exploration
• Distance between the first (25th percentile) and third (75th
Data Objects and
percentile) quartiles is called the interquartile range (IQR) Attribute
Attribute (or Variable) Types
IQR = Q3 − Q1 Properties of Attribute
Values
• IQR gives us the width of the box. A small width means Types of Data Sets
Descriptive Measures
more consistent data values since it indicates less for Categorical
Variables
variation in the data or that data values are closer together
Descriptive Measures
for Numerical Variables
Measure of Central
Tendency
Measure of Variability
Measure of Shape
Outliers and Missing Values
Relationships Among
Variables
2.34
Statistical Concepts
Measure of Variability...
Chittaranjan Pradhan
Descriptive Measures
for Categorical
Variables
Descriptive Measures
for Numerical Variables
Measure of Central
Tendency
Measure of Variability
Measure of Shape
Outliers and Missing Values
Relationships Among
Variables
2.35
Statistical Concepts
Measure of Variability...
Chittaranjan Pradhan
Variance
Data Exploration
• Variance is a statistical measure that quantifies the spread Data Objects and
Attribute
or dispersion of a set of data points Attribute (or Variable) Types
Properties of Attribute
• It indicates how much the individual data points in a Values
Types of Data Sets
dataset differ from the mean of the dataset
Descriptive Measures
for Categorical
• Low variance means the data points are close to the mean Variables
and to each other; high variance means the data points Descriptive Measures
for Numerical Variables
are spread out from the mean and from each other Measure of Central
Tendency
• Variance of N observations, x1 , x2 , ..., xN , for a numeric Measure of Variability
Measure of Shape
attribute X is Outliers and Missing Values
Relationships Among
Variables
Ex: Data: 4, 6, 8, 10
Mean= 7, Variance= 5
2.36
Statistical Concepts
Measure of Variability...
Chittaranjan Pradhan
Standard Deviation
Data Exploration
• It is defined as the degree of dispersion of the data point to
Data Objects and
the mean value of the data point Attribute
Attribute (or Variable) Types
• Standard deviation, σ, of the observations is the square Properties of Attribute
Values
root of the variance Types of Data Sets
Descriptive Measures
• A low standard deviation means that the data observations for Categorical
Variables
tend to be very close to the mean, while a high standard
Descriptive Measures
deviation indicates that the data are spread out over a for Numerical Variables
Measure of Central
large range of values Tendency
Measure of Variability
Measure of Shape
Outliers and Missing Values
Relationships Among
Variables
2.37
Statistical Concepts
Measure of Variability...
Chittaranjan Pradhan
Bessel’s Correction
Data Exploration
very extreme value can increase the standard deviation Attribute (or Variable) Types
Properties of Attribute
• For two data sets with the same mean, the one with the Descriptive Measures
for Categorical
larger standard deviation is the one in which the data is Variables
n in the formula for the sample standard deviation, where Outliers and Missing Values
Relationships Among
n is the number of observations in a sample Variables
2.38
Statistical Concepts
Measure of Shape
Chittaranjan Pradhan
Symmetrical distribution
Data Exploration
• Histogram graph showing the frequency of retirement age Types of Data Sets
Descriptive Measures
for Numerical Variables
Measure of Central
Tendency
Measure of Variability
Measure of Shape
Outliers and Missing Values
Relationships Among
Variables
2.39
Statistical Concepts
Measure of Shape...
Chittaranjan Pradhan
Descriptive Measures
mean to be ’pulled’ toward the right tail of the distribution for Categorical
Variables
• Generally most of the values, including the median value, Descriptive Measures
tend to be less than the mean value for Numerical Variables
Measure of Central
Tendency
Measure of Variability
Measure of Shape
Outliers and Missing Values
Relationships Among
Variables
2.40
Statistical Concepts
Measure of Shape...
Chittaranjan Pradhan
Descriptive Measures
mean to be ’pulled’ toward the left tail of the distribution for Categorical
Variables
• Generally most of the values, including the median value, Descriptive Measures
for Numerical Variables
tend to be greater than the mean value Measure of Central
Tendency
Measure of Variability
Measure of Shape
Outliers and Missing Values
Relationships Among
Variables
2.41
Statistical Concepts
Measure of Shape...
Chittaranjan Pradhan
Relationships Among
Variables
2.42
Statistical Concepts
Measure of Shape...
Chittaranjan Pradhan
where, S-> Statndard deviation, x̄− > Mean Types of Data Sets
Descriptive Measures
for Categorical
Variables
• Using Mode, Pearson0 sFirstCoefficient = Mean−Mode
StandardDeviation Descriptive Measures
for Numerical Variables
Measure of Central
Tendency
• Using Median, Measure of Variability
3(Mean−Median)
Pearson0 sSecondCoefficient = StandardDeviation
Measure of Shape
Outliers and Missing Values
Relationships Among
Variables
Kurtosis
2.43
Statistical Concepts
Measure of Shape...
Chittaranjan Pradhan
Data Exploration
Measures of Sample Skewness and Kurtosis
Data Objects and
Attribute
Attribute (or Variable) Types
Properties of Attribute
Values
Types of Data Sets
Descriptive Measures
for Categorical
Variables
Descriptive Measures
for Numerical Variables
Measure of Central
Tendency
Measure of Variability
Measure of Shape
Outliers and Missing Values
Relationships Among
Variables
2.44
Statistical Concepts
Outliers
Chittaranjan Pradhan
Data Exploration
Outliers Data Objects and
Attribute
• The extreme values in the datasets are called outliers Attribute (or Variable) Types
Properties of Attribute
Values
• Types of Outliers Types of Data Sets
• Global Outliers: The data point or points whose values are Descriptive Measures
for Categorical
far outside everything else in the dataset are global outliers Variables
Descriptive Measures
for Numerical Variables
• Contextual Outliers: Contextual outliers are those values Measure of Central
Tendency
of data points that deviate quite a lot from the rest of the Measure of Variability
data points that are in the same context, however, in a Measure of Shape
Outliers and Missing Values
different context, it may not be an outlier at all Relationships Among
Variables
2.45
Statistical Concepts
Outliers...
Chittaranjan Pradhan
Descriptive Measures
• When there are outliers in a sample, the median and for Categorical
Variables
interquartile range are used to summarize a typical value Descriptive Measures
and the variability in the sample, respectively for Numerical Variables
Measure of Central
Tendency
• Tukey fence method is to find outliers. Outliers are the Measure of Variability
Relationships Among
Variables
• Upper Fence = Q3+1.5IQR = Q3+1.5(Q3-Q1)
Lower Fence = Q1-1.5IQR = Q1-1.5(Q3-Q1)
• The data points beyond the upper and the lower fence in
box plot are reffered to as outliers
2.46
Statistical Concepts
Outliers...
Chittaranjan Pradhan
Data Exploration
Boxplot Analysis
Data Objects and
Attribute
• Boxplots are a popular way of visualizing a distribution Attribute (or Variable) Types
Properties of Attribute
Values
through quantiles and detect outliers Types of Data Sets
2.47
Statistical Concepts
Outliers...
Chittaranjan Pradhan
Boxplot Analysis...
Data Exploration
Descriptive Measures
for Categorical
Variables
Descriptive Measures
for Numerical Variables
Measure of Central
Tendency
Measure of Variability
Measure of Shape
Outliers and Missing Values
Relationships Among
Variables
2.48
Statistical Concepts
Outliers...
Chittaranjan Pradhan
Boxplot Analysis...
Data Exploration
• Boxplots often provide information about the shape of a
Data Objects and
data set Attribute
Attribute (or Variable) Types
end of the scale, the distribution is skewed right; and vice Types of Data Sets
Descriptive Measures
versa for Categorical
Variables
• If a distribution is symmetric, the observations will be Descriptive Measures
evenly split at the median for Numerical Variables
Measure of Central
Tendency
Measure of Variability
Measure of Shape
Outliers and Missing Values
Relationships Among
Variables
2.49
Statistical Concepts
Outliers...
Chittaranjan Pradhan
Boxplot Analysis...
Data Exploration
• Ex: 25, 28, 29, 29, 30, 34, 35, 35, 37, 38
Data Objects and
Attribute
• Median -> 32 Attribute (or Variable) Types
Properties of Attribute
• First quartile is the median of the data points to the left of Values
Types of Data Sets
the median: 25, 28, 29, 29, 30. So, Q1->29 Descriptive Measures
for Categorical
• Third quartile is the median of the data points to the right Variables
of the median: 34, 35, 35, 37, 38. So, Q3->35 Descriptive Measures
for Numerical Variables
• Min-> 25 and Max->38 Measure of Central
Tendency
Measure of Variability
Measure of Shape
Outliers and Missing Values
Relationships Among
Variables
2.50
Statistical Concepts
Outliers...
Chittaranjan Pradhan
Data Exploration
• Z − Score = Observation−Mean
StandardDeviation
2.51
Statistical Concepts
Missing Values
Chittaranjan Pradhan
Missing Values
Data Exploration
• Reasons for missing values
Data Objects and
• Information is not collected (ex: people decline to give their Attribute
Attribute (or Variable) Types
age and weight) Properties of Attribute
Values
• Attributes may not be applicable to all cases (ex: annual Types of Data Sets
Relationships Among
Variables
2.53
Statistical Concepts
Missing Values...
Chittaranjan Pradhan
Types of Missing Values...
Missing at Random (MAR) Data Exploration
Descriptive Measures
• Ex: when we take a sample from a population, where the for Categorical
Variables
probability to be included depends on some known Descriptive Measures
for Numerical Variables
property Measure of Central
Tendency
Measure of Variability
Measure of Shape
Outliers and Missing Values
Relationships Among
Variables
2.54
Statistical Concepts
Missing Values...
Chittaranjan Pradhan
Types of Missing Values...
Missing not at Random (MNAR) - Nonignorable Data Exploration
Relationships Among
Variables
2.55
Statistical Concepts
Missing Values...
Chittaranjan Pradhan
Data Exploration
X = {X0 , Xm } where X0 -> observed data and Xm -> Types of Data Sets
Descriptive Measures
missing data for Categorical
Variables
Descriptive Measures
MCAR : P(R|X0 , Xm ) = P(R) for Numerical Variables
Measure of Central
Tendency
Relationships Among
• Let R->matrix with same dimensions as X where Ri,j = 1 if Variables
MNAR : Nosimplification
2.56
Statistical Concepts
Relationships Among Variables
Chittaranjan Pradhan
Data Exploration
2.57
Statistical Concepts
Relationships Among Categorical Variables
Chittaranjan Pradhan
Relationships Among Categorical Variables
Data Exploration
• The most meaningful way to describe a categorical
Data Objects and
variable is with counts, possibly expressed as percentages Attribute
Attribute (or Variable) Types
of totals, and corresponding Properties of Attribute
Values
• Consider a data set with at least two categorical variables, Types of Data Sets
Descriptive Measures
Smoking and Drinking for Categorical
Variables
• Smoking: Non Smoker(NS), Occasional Smoker (OS), Descriptive Measures
Heavy Smoker (HS) for Numerical Variables
Measure of Central
Tendency
• Drinking: Non Drinker(ND), Occasional Drinker(OD), Measure of Variability
Relationships Among
Variables
2.58
Statistical Concepts
Relationships Among Categorical Variables...
Chittaranjan Pradhan
Relationships Among Categorical Variables...
Data Exploration
• It is customary to display all such counts in a table called a
Data Objects and
crosstabs (for crosstabulations). This is also sometimes Attribute
Attribute (or Variable) Types
called a contingency table Properties of Attribute
Values
Types of Data Sets
Descriptive Measures
for Categorical
Variables
Descriptive Measures
for Numerical Variables
Measure of Central
Tendency
Measure of Variability
Measure of Shape
Outliers and Missing Values
Relationships Among
Variables
2.59
Statistical Concepts
Relationships Among Categorical and Numerical Variables
Chittaranjan Pradhan
• It describes a very common situation where the goal is to Data Objects and
Attribute
break down a numerical variable such as salary by a Attribute (or Variable) Types
Properties of Attribute
categorical variable such as gender Values
Types of Data Sets
Relationships Among
Variables
2.60
Statistical Concepts
Relationships Among Numerical and Numerical Variables
Chittaranjan Pradhan
Relationships Among Numerical and Numerical Variables
Data Exploration
• To study relationships among numeric variables, a new
Data Objects and
type of chart, called a scatterplot, and two new summary Attribute
Attribute (or Variable) Types
measures, correlation and covariance, are used Properties of Attribute
Values
Descriptive Measures
• A scatterplot is a scatter of points, where each point denotes for Categorical
Variables
the values of an observation for two selected variables
Descriptive Measures
• It is a graphical method for detecting relationships between for Numerical Variables
two numerical variables Measure of Central
Tendency
• The two variables are often labeled generically as X and Y, Measure of Variability
Measure of Shape
so a scatterplot is sometimes called an X-Y chart Outliers and Missing Values
2.61
Statistical Concepts
Relationships Among Numerical and Numerical Variables...
Chittaranjan Pradhan
Data Exploration
Descriptive Measures
for Categorical
Variables
Descriptive Measures
for Numerical Variables
Measure of Central
Tendency
Measure of Variability
Measure of Shape
Outliers and Missing Values
Relationships Among
Variables
2.62
Statistical Concepts
Relationships Among Numerical and Numerical Variables...
Chittaranjan Pradhan
Relationships Among Numerical and Numerical Variables...
Data Exploration
• Correlation and Covariance
Data Objects and
• Correlation and covariance measure the strength and Attribute
Attribute (or Variable) Types
direction of a linear relationship between two numerical Properties of Attribute
Values
variables. (Bi-Variate Measures) Types of Data Sets
where,
P n-> number of data points or observations Properties of Attribute
Values
XY -> sum of the product of x-value and y-value for Types of Data Sets
Descriptive Measures
each
P point in the data set for Categorical
Variables
P X -> sum of the x-values in the data set Descriptive Measures
P Y 2-> sum of the y-values in the data set for Numerical Variables
Measure of Central
Relationships Among
Variables
2.64