0% found this document useful (0 votes)
8 views

Week 2

Uploaded by

sainathgunda99
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views

Week 2

Uploaded by

sainathgunda99
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 73

Know your data

[email protected]
DLZNK464L9

Week 2

This file is meant for personal use by [email protected] only.


Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Learning objectives
By the end of this module, you will be able to:
• List the different types of attributes.
• Compute basic descriptive statistics of a dataset.
• Create and read graphic plots that display descriptive statistics.
• Explain statistical hypothesis testing.
• Compute object similarity and dissimilarity.
[email protected]
DLZNK464L9

This file is meant for personal use by [email protected] only.


Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
[email protected]
DLZNK464L9
Data objects and attribute types

This file is meant for personal use by [email protected] only.


Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Agenda
In this session, we will learn about:
• Types of datasets
• Data objects
• Attributes and their types
• Relationship of attributes
[email protected]
DLZNK464L9

• Need for the absolute “0”


• Discrete vs. continuous attributes

This file is meant for personal use by [email protected] only.


Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Types of data sets
Type of datasets Examples
Record • Relational records
• Data matrix: Numerical matrix, crosstabs
• Document data: Text documents, term
frequency vector
• Transactional data
[email protected]
DLZNK464L9
Graph and • World wide web
network • Social or Information networks
• Molecular structures
Ordered • Video data: Sequence of images
• Temporal data: Time-series
• Sequential data: Transaction sequences
• Genetic sequence data
Spatial, image and • Spatial data: Maps
multimedia • Image data
• Video data
This file is meant for personal use by [email protected] only.
Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Illustration - Record type datasets
Document data
Documents Team Coach Play Ball Score Game Win Lost Timeout Season
Document1 3 0 5 0 2 6 0 2 0 2
Document2 0 7 0 2 1 0 0 3 0 0
Document3 0 1 0 0 1 2 2 0 3 0
[email protected]
DLZNK464L9
Transactional data
TID Items
1 Bread, coke, milk
2 Beer, bread
3 Beer, coke, diaper, milk
4 Beer, bread, diaper, milk
5 Coke , diaper, milk
This file is meant for personal use by [email protected] only.
Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Data objects
• Data sets are made up of data objects.
• A data object represents an entity (an observation).
• Examples:
○ sales database: customers, store items, sales

[email protected]
DLZNK464L9
medical database: patients, treatments
○ university database: students, professors, courses
• Data objects are also called observations, samples, examples, instances, data points, objects, and
tuples.
• Data objects are described by their attributes.
• Database table rows -> data objects; columns ->attributes

This file is meant for personal use by [email protected] only.


Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Fitness of a dataset
Data sets need to be as complete as possible for the questions/phenomena to be studied.
• Are data objects/observations in your data set representative of the population under study?
○ Areas of the aircraft most likely to be damaged in the war.
○ Identifying tanks in the forest.
• Are relevant attributes comprehensively included in your data set?
○ Most militarized country in the world?
[email protected]
DLZNK464L9
○ Body height and total earnings?

This file is meant for personal use by [email protected] only.


Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Attributes
• Attribute (or dimensions, features, variables) is a data field, representing a characteristic or feature
of a data object, E.g., customer _ID, name, address
• Attribute vector (feature vector) is formed by data objects with more than one attributes.
• Types :
o Categorical: qualitative
─ Nominal, Binary, Ordinal
[email protected]

o Numeric: quantitative
DLZNK464L9

─ Interval-scaled, Ratio-scaled

This file is meant for personal use by [email protected] only.


Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Attribute types: Qualitative

• Nominal: categories, states, or “names of things”


○ Hair_color = {auburn, black, blond, brown, grey, red, white}
○ marital status, occupation, ID numbers, zip codes
• Binary
○ Nominal attribute with only 2 states (0 and 1)

[email protected] binary: both outcomes are equally important.
─ e.g., biological gender
DLZNK464L9

○ Asymmetric binary: outcomes not equally important.


─ e.g., medical test (positive vs. negative)
─ Convention: assign 1 to the most important outcome (e.g.,
positive)
• Ordinal
○ Values have a meaningful order (ranking), but the magnitude
between successive values is unknown.
○ Size = {small, medium, large}, letter grades, army rankings
This file is meant for personal use by [email protected] only.
Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Numeric attribute types: Quantitative
• Quantity (integer or real-valued)
• Interval
○ Measured on a scale of equal-sized units
○ Values have order
─ E.g., the temperature in Celsius or Fahrenheit units, calendar
dates
○ No true zero-point
[email protected]
DLZNK464L9

• Ratio
○ Inherent zero-point
○ We can speak of values as being an order of magnitude larger than
the unit of measurement (10 kelvins is twice as high as 5 kelvins).
─ E.g., the temperature in Kelvin, length, counts, monetary
quantities.

This file is meant for personal use by [email protected] only.


Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Relationship of attributes
• All ratio attributes are interval attributes.
• All interval attributes are ordinal attributes.
• All ordinal attributes are nominal attributes.

[email protected]
DLZNK464L9
Ratio:
Absolute
Interval: zero
Distance is
Ordinal: meaningful
Attributes
Nominal: can be
Attributes ordered
are only
named
weakest

This file is meant for personal use by [email protected] only.


Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
The need for the absolute “0”
• Temperature measured at 20 ˚C is not twice that at 10 ˚C because 0 ˚C is not the absolute 0.
• Ratios can not be defined reliably on an arbitrary 0.

[email protected]
DLZNK464L9

This file is meant for personal use by [email protected] only.


Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Discrete vs continuous attribute
• Discrete attribute
○ Values are distinct and separate (unconnected values).
○ Can have integer values, e.g., age or binary values.
○ Can be infinite but must be countable (each value in the set has a corresponding integer).

• Continuous attribute
[email protected]
○ Has real numbers as attribute values. E.g., temperature, height, or weight.
DLZNK464L9

○ Can take on ANY value within a finite or infinite interval.


○ Continuous attributes are typically represented as floating-point variables.

This file is meant for personal use by [email protected] only.


Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Summary
Here is a quick recap:
• We discussed various types of datasets, such as record, and ordered among others along with data
objects that represent an entity.
• We looked at different types of attributes along with the relationship between attributes, such as all
the ratio attributes being interval attributes etc.
[email protected]
DLZNK464L9

• We talked about the need for the absolute “0” i.e., ratio attribute.
• We learned the difference between discrete and continuous attributes and that discrete attributes
can have integer values. For example age, or binary values, whereas continuous attributes take
on any value within a finite or infinite interval.

This file is meant for personal use by [email protected] only.


Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Basic statistical descriptions of data
[email protected]
DLZNK464L9 (Part 1)

This file is meant for personal use by [email protected] only.


Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Agenda
In this session, we will learn about:
● The basic statistical description of data
● Measures of central tendency
● Weighted arithmetic mean
● Symmetric vs. Skewed data
[email protected]
DLZNK464L9

● Properties of a nominal distribution curve

This file is meant for personal use by [email protected] only.


Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Basic statistical descriptions of data
● Motivation
○ To understand the data better.
● Central tendency
○ Mode, median, mean, midrange
● Dispersion of the data
○ Range, quartiles, interquartile range, five-number summary boxplots
[email protected]
DLZNK464L9

○ Variance and standard deviation


● Graphs to present data summaries and distributions
○ Quantile plots, histograms, scatter plots etc.

This file is meant for personal use by [email protected] only.


Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Measuring the central tendency
● Mean (algebraic measure) (sample vs. population):
Note: n is the sample size, and N is the population size.
○ Weighted arithmetic mean: Sample mean Population
○ Trimmed mean: chopping extreme values

● Median:
[email protected]
DLZNK464L9

○ Middle value if odd number of ordered values, or average of the


middle two values otherwise
Weighted arithmetic mean
○ Estimated by interpolation (for grouped data)

This file is meant for personal use by [email protected] only.


Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Measuring the central tendency (Contd.)
● Mode
○ Values that occur most frequently for a variable.
○ Unimodal, bimodal, trimodal, or no model if all values occurs the
same times
○ Empirical relationship between mean, mode, and median on
[email protected]
DLZNK464L9

moderately skewed data: mean-mode=3 x (mean-median)

This file is meant for personal use by [email protected] only.


Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Weighted arithmetic mean

[email protected]
DLZNK464L9

This file is meant for personal use by [email protected] only.


Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Estimated median by interpolation

N=3194
[email protected]
DLZNK464L9

L1 = 21
Total observations = 3194
=200+450+300 = 950 3194/2 = 1597
Freqmedian = 1500

width = 30
median = 33.94
This file is meant for personal use by [email protected] only.
Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Making sense of the estimation

[email protected]
DLZNK464L9

This file is meant for personal use by [email protected] only.


Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Symmetric vs Skewed data

Symmetric positively skewed negatively skewed

[email protected]
DLZNK464L9

Median, mean and mode of symmetric, positively and negatively skewed data.

This file is meant for personal use by [email protected] only.


Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Measuring the dispersion of data
● Range, Quartiles, outliers, and boxplots
○ Range: the difference between the largest and smallest value
in a set
■ Midrange = the average of the largest and smallest of
[email protected] values in a data set.
DLZNK464L9

○ Quantiles: points taken at regular intervals of data


distribution.
■ Quartiles and percentiles: Q1 (25th percentile), Q2
(median), Q3 (75th percentile)
○ Inter-quartile range: IQR = Q3 – Q1
○ Outlier: usually, a value higher/lower than 1.5 x IQR
above/below Q3/Q1
This file is meant for personal use by [email protected] only.
Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Measuring the dispersion of data (Contd.)

● Variance and standard deviation (sample: s, population: σ)


○ Variance s2 (or σ2): algebraic, scalable computation
○ Standard deviation s (or σ) is the square root of variance s2 (or σ2)
○ Mean absolute deviation (MAD)
[email protected]
DLZNK464L9

This file is meant for personal use by [email protected] only.


Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Properties of Normal Distribution Curve

[email protected]
DLZNK464L9

μ–σ μ+σ μ–2σ μ+2σ μ-3σ μ+3σ


From μ–σ to μ+σ: contains From μ–2σ to μ+2σ: From μ–3σ to μ+3σ: contains
about 68% of the contains about 95% of the about 95% of the
observations observations observations

This file is meant for personal use by [email protected] only.


Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Summary
Here is a quick recap:
• We looked at the basic statistical description of the data.
• We discussed the various measures of central tendencies, such as Mean, Arithmetic Mean, Median,
and Mode.
• We understood that symmetric distribution is when the data is equally distributed around the mean,
[email protected]
DLZNK464L9
mode, or median and skewed distribution is when the tail of the distribution is longer on the left-hand
side than on the right-hand side or vice-versa.
• We discussed the various measures of dispersion of the data, such as variance and standard deviation
• We discussed the properties of a nominal distribution curve using various distribution graphs.

This file is meant for personal use by [email protected] only.


Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Basic statistical descriptions of data
(Part 2)
[email protected]
DLZNK464L9

This file is meant for personal use by [email protected] only.


Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Agenda
In this session, we will learn about:
● Box plot
● Quantile plot
● Quantile-Quantile (Q-Q) plot
● Scatter plot
[email protected]
DLZNK464L9

● Positively and negative correlation


● Non-correlated data

This file is meant for personal use by [email protected] only.


Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Graphic displays of basic statistical descriptions
• Boxplot: the graphic display of a five-number summary.

• Quantile plot: each value xi is paired with fi, indicating that approximately fi
x 100% of data are ≤ xi .

• Quantile-quantile (q-q) plot: graphs the quantiles of one univariate


[email protected]
DLZNK464L9
distribution against the corresponding quantiles of another.

• Bar chart: x-axis presents values, y-axis frequency/count

• Histogram: x-axis values, y-axis frequency/density

• Scatter plot: graphs bi/tri-variate data as points in a 2/3-D plane

This file is meant for personal use by [email protected] only.


Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Box plot

• Five-number summary of a distribution: Minimum,


Q1, Median, Q3, Maximum
• Boxplot
○ Data is represented with a box.
○ The ends of the box are at the first and third
[email protected]
DLZNK464L9
quartiles, i.e., the height of the box is IQR.
○ The median is marked by a line within the box.
○ Whiskers: two lines outside the box extended
to Minimum and Maximum within 1.5 IQR.
○ Outliers: points beyond a specified outlier
threshold, plotted individually.

This file is meant for personal use by [email protected] only.


Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Quantile plot
• Displays all of the data (allowing the user to assess both the overall behavior and unusual
occurrences).
• Plots quantile information of a univariate distribution
○Data sorted in increasing order. For a value xi , fi indicates that approximately fi x 100% of the
data are below or equal to the value xi

[email protected]
DLZNK464L9

This file is meant for personal use by [email protected] only.


Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Quantile-Quantile (Q-Q) plot
• Graph the quantiles of one univariate distribution against the corresponding quantiles of another
variable. (xi, yi)
• Shows if two variables follows the same distribution.
• Example shows the unit price of items sold at branch 1 vs. branch 2 for each quantile. Unit prices of
items sold at branch 1 tend to be lower than those at branch 2.

[email protected]
DLZNK464L9

This file is meant for personal use by [email protected] only.


Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Bar Chart vs Histogram
• Histogram: Shows the probability distribution of a given variable by depicting the frequencies/density of
observations occurring in certain ranges of values.
• Differs from a bar chart
Attribute Histogram Bar Chart
Variable type Contiguous variables Discrete variables
[email protected]
DLZNK464L9
Gap between
bars Adjacent bars Space-separated bars
Bar width Could have varied width Equal width
Bar order Matters Does not matter

This file is meant for personal use by [email protected] only.


Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Histograms often tell more than boxplots

• The two histograms shown may have the same boxplot representation.

• The same values for min, Q1, median, Q3, and max, but they have rather different data
distributions.

[email protected]
DLZNK464L9

This file is meant for personal use by [email protected] only.


Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Scatter plot
• Present bivariate numerical data to see clusters of points, outliers, correlations etc.
• Each pair of values is treated as a pair of coordinates and plotted as points in the plane.

[email protected]
DLZNK464L9

This file is meant for personal use by [email protected] only.


Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Positive and negative correlation

• The left half fragment is positively linear correlated.

• The right half is negative linear correlated.

[email protected]
DLZNK464L9

This file is meant for personal use by [email protected] only.


Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Non-correlated data

[email protected]
DLZNK464L9

This file is meant for personal use by [email protected] only.


Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Summary

Here is a quick recap:


• We understood various charts and plots, such as box plot, histogram, quantile plot, Q-Q plots, and
Scatter plot.
• We learned that positive correlation describes the relationship between two variables that change
in the same direction and a negative correlation describes the relationship between two variables
that change in the inverse directions.
[email protected]
DLZNK464L9

This file is meant for personal use by [email protected] only.


Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
[email protected]
DLZNK464L9
Hypothesis testing

This file is meant for personal use by [email protected] only.


Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Agenda

In this session, we will learn about:


• Hypothesis testing
• One-tailed and two-tailed tests
• Steps to perform hypothesis testing
[email protected]
DLZNK464L9

This file is meant for personal use by [email protected] only.


Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Hypothesis testing

• Hypothesis testing is a scientific process of testing whether a hypothesis is plausible or not.

• Test includes a null hypothesis and alternative hypothesis, for example:


Null hypothesis: 𝒙𝒙
� = 𝝁𝝁 Alt hypothesis: 𝒙𝒙
� > 𝝁𝝁
Null hypothesis: variable A and variable B are independent Alt hypothesis: they are correlated
[email protected]
DLZNK464L9

• The goal is to determine which hypothesis is likely to be true at a confidence level

This file is meant for personal use by [email protected] only.


Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Hypothesis testing: comparison of means
Comparison of Degrees of
means freedom Application Assumptions Test statistic
Testing the difference of a sample
mean, x-bar, with a known population Normal distribution,
Not mean, μ known population σ
One sample Z test applicable

[email protected]
Normal distribution,
DLZNK464L9
Testing the difference of one sample population standard
One sample t test n-1 mean, x-bar with a given mean, μ deviation, σ is unknown
Testing the difference of two sample
means when population variances
Two sample t test n1+n2-2 unknown but considered equal Normal distribution

Rejection area

This file is meant for personal use by [email protected] only.


Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Hypothesis testing
Comparison of Degrees of
means freedom Application Assumptions Test statistic
Testing two sample means when
their respective population
standard deviations are unknown Normal distribution two
but considered equal, data dependent samples,
recorded in pairs and each pair has always two-tailed test.
Paired t test
[email protected]
DLZNK464L9
n-1 a difference, d Sd= standard deviation
Normal distribution
Testing the difference of three or
One-way ANOVA n1-1 & n2-1 more sample means

This file is meant for personal use by [email protected] only.


Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Confidence level and significance level
• Significance level: the risk we are willing to take to
reject null hypo when it is actually true.
xbar = 110, 𝝁𝝁 = 100
• Typical: 5% or 1% Say, our Z = 2.5

P(Z > 1.645)

• Confidence level = 1 – significance level


[email protected]
DLZNK464L9

• Typical: 95% or 99%

This file is meant for personal use by [email protected] only.


Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
One-tailed and two-tailed tests
● One-tailed test
1. Null Hypothesis; xbar = 𝞵𝞵
2. Alternate hypothesis; xbar > 𝞵𝞵 ;
where 𝞵𝞵 is hypothesized mean
[email protected]
DLZNK464L9

● Two-tailed test
1. Null hypothesis; xbar = 𝞵𝞵
2. Alternate hypothesis; xbar ≠ 𝞵𝞵 ;
where 𝞵𝞵 is hypothesized mean

This file is meant for personal use by [email protected] only.


Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Hypothesis testing: steps

• Steps in Hypothesis testing:


1. State the hypotheses (null and alternative)
2. Identify the test statistic and its probability distribution.
3. Specify the significance level
4. Collect the data and perform the calculations
[email protected]
DLZNK464L9

5. Make the statistical decision


6. Make the business decision

This file is meant for personal use by [email protected] only.


Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Summary

Here is a quick recap:


• We discussed hypothesis testing, a scientific process of testing whether or not a hypothesis is plausible.
• We understood one-tailed tests that allow the testing of an effect in one direction and two-tailed tests
allow the testing of an effect in two directions—positive and negative.
• We looked at various tests to check the null and alternate hypothesis, such as one sample Z test, two-
sample t-test, etc.
[email protected]
DLZNK464L9

• We looked at the steps to conduct a hypothesis test

This file is meant for personal use by [email protected] only.


Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Measuring data similarity and
[email protected]
DLZNK464L9
dissimilarity

This file is meant for personal use by [email protected] only.


Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Agenda
In this session, we will learn about:
● Measuring data similarity and dissimilarity
● Proximity measures for nominal, ordinal, and binary attributes
● Proximity measures for numerical attributes and normalization
● Compute dissimilarity with mixed type variables
● Cosine similarity
[email protected]
DLZNK464L9

This file is meant for personal use by [email protected] only.


Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Similarity and Dissimilarity
● Similarity
○ Numerical measure of how alike two data objects are.
○ Value is higher when objects are more alike.
○ Often falls in the range [0,1].
● Dissimilarity (e.g., distance)
○ Numerical measure of how different two data objects are.
[email protected]
DLZNK464L9
○ Lower when objects are more alike.
○ Minimum dissimilarity is often 0.
○ Upper limit varies.
● Proximity may refer to either similarity or dissimilarity.

This file is meant for personal use by [email protected] only.


Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Data matrix and (dis)similarity matrix
● Data matrix
○ n data points with p dimensions
○ Two modes: object +feature

● Distance/similarity matrix
○ n data points, but registers only
[email protected]
DLZNK464L9
the distance/similarity
○ Is often a symmetric matrix
○ Single mode: (dis)similarity

This file is meant for personal use by [email protected] only.


Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Proximity measure for nominal attributes
Nominal attributes can take two or more states/values, e.g., color can be red, yellow, blue, green, etc.
(generalization of a binary attribute).
Method 1: Simple matching

Observations ( cat1 and cat2 ) are described by nominal values of color, size, sleep time.

Objects
[email protected]
DLZNK464L9 Color Size Sleep time
cat1 yellow small <5 hours
cat2 yellow medium 5-8 hours

d(i, j): distance between i and j


m: Number of attributes with same values,
p: total number of attributes This file is meant for personal use by [email protected] only.
Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Proximity measure for binary attributes
Method2:
● In this method, nominal attributes are converted to binary attributes.
● Creating a new binary attribute for each of the nominal states is called “One Hot Encoding”, thus
forming a binary attribute table as shown below.
● Thus, proximity measure for binary attributes is used to measure the similarity between the
objects.
[email protected]
DLZNK464L9

sleep time sleeptime 5-


Objects color-yellow color-… size-small size-medium <5 hours 8 hours
cat1 1 0 1 0 1 0
cat2 1 0 0 1 0 1

This file is meant for personal use by [email protected] only.


Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Proximity measure for binary attributes

The number of attributes having the same/different values for the observations ( eg, cat1, cat2 in the
previous table ) are counted by using the binary attribute table forming the contingency table as
shown below:
Object J

[email protected]
1 0 sum
DLZNK464L9

Object I 1 q r q+r

0 s t s+t

sum q+s r+t p

q represents the number of attributes both objects have the value of 1


r, s represents the number of attributes both objects have the different value
t represents the number of attributes both objects have the value of 0
p is the total number of attributes
This file is meant for personal use by [email protected] only.
Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Proximity measure for binary attributes
By using the contingency table previously created, distance / similarity metrics are calculated as shown
below:

Distance measure for symmetric binary


variables :

[email protected]
DLZNK464L9

Distance measure for asymmetric binary


variables:

Jaccard coefficient (shown similarity


measure for asymmetric binary
variables) :

This file is meant for personal use by [email protected] only.


Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Compute dissimilarity using binary variables

Name Gender Fever Cough Test-1 Test-2 Test-3 Test-4


Jack M Y N P N N N
Mary F Y N P N P N
Jim M Y P N N N N

[email protected]

● Gender is a nominal attribute.


DLZNK464L9

● The remaining attributes are asymmetric


binary.
● Let the values Y and P be 1, and the value N
be 0.
● Considering only asymmetric attributes.

This file is meant for personal use by [email protected] only.


Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Distance on numeric data: Minkowski distance
● Minkowski distance: A popular distance measure

where i = (xi1, xi2, …, xip) and j = (xj1, xj2, …, xjp) are two p-dimensional data objects, and h is the
order (the distance so defined is also called L-h norm)
[email protected]

● Properties
DLZNK464L9

○ d(i, j) > 0 if i ≠ j, and d(i, i) = 0 (Positive definiteness)


○ d(i, j) = d(j, i) (Symmetry)
○ d(i, j) ≤ d(i, k) + d(k, j) (Triangle Inequality)
● A distance that satisfies these properties is a metric.

This file is meant for personal use by [email protected] only.


Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Special cases of Minkowski distance
● h = 1: Manhattan (city block, L1 norm) distance
○ E.g., the Hamming distance: the number of bits that are different between two binary vectors

● h = 2: (L2 norm) Euclidean distance


[email protected]
DLZNK464L9

This file is meant for personal use by [email protected] only.


Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Special cases of Minkowski distance (Contd.)
● Chebyshev distance:
○ When h → ∞. “supremum” (Lmax norm, L∞ norm) distance.
■This is the maximum difference between any component (attribute) of the vectors

[email protected]
DLZNK464L9

○ When h → -∞.
○ This is the minimum difference between any component (attribute) of the vectors

This file is meant for personal use by [email protected] only.


Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Example: Minkowski distance

Manhattan (L1)
[email protected]
DLZNK464L9

Euclidean (L2)

Supremum
This file is meant for personal use by [email protected] only.
Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Normalization of numerical values
● Values measured on different scales can not be compared directly.
Student A: SAT = 1800
Student B: ACT = 24
Which student performed better relative to other test-takers?
● Normalization used widely with multi-dimensional datasets involving different scales: clustering,
multidimensional scaling, principal component analysis, etc.
[email protected]
DLZNK464L9

SAT ACT
Mean 1500 21
Standard
deviation 300 5

This file is meant for personal use by [email protected] only.


Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Standardizing numeric data
● Z-score
○ x: raw score to be standardized, μ: mean of the population, σ: standard
deviation
○ the distance between the raw score and the population mean in units of
the standard deviation
○ negative when the raw score is below the mean, “+” when above
[email protected]
DLZNK464L9

● An alternative way: Calculate the Mean Absolute Deviation (MAD),

Where

Standardized measure:

● MAD is more robust to outliers than the standard deviation because, in the
former, the differences with the mean are not squared.
This file is meant for personal use by [email protected] only.
Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Proximity measure for ordinal variables
● Order is important, e.g., rank
Math grade: A, B, C, D, E.

[email protected]
DLZNK464L9

● Map ordinal values to values between 0 and 1 (to interval-scaled)


1. Replace xif by their rank
2. Map the range of each variable onto [0, 1] by replacing the i-th value in
the f-th variable by

3. Compute the dissimilarity of using methods for numerical variables

This file is meant for personal use by [email protected] only.


Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Dissimilarity btw objects with mixed type attributes
● A database may contain all attribute types.
○ Nominal, symmetric binary, asymmetric binary, numeric, ordinal
● One may use a weighted formula to combine their effects.
● Distance btw objects i and j over f features/attributes:

[email protected]
DLZNK464L9

• 0 if f is missing for either object i • f is binary or nominal:


or j, or if f for i and j are both 0 o dij(f) = 0 if xif = xjf , or dij(f) = 1
and f is asymmetric binary otherwise
attribute • f is numeric: use the normalized
• 1 otherwise distance
• for fpersonal
This file is meant is ordinal: convert to numeric
use by [email protected] only.
Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Compute dissimilarity with mixed variables

Name Gender Fever Cough Test-1 Test-2 Test-3 Test-4


Jack M Y N P N N N
Mary F Y N P N P N
Jim M Y P N N N N

● Gender is a nominal attribute; others are asymmetric binary.


[email protected]
DLZNK464L9
● Let the values Y and P be 1, and the value N be 0.

This file is meant for personal use by [email protected] only.


Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Compute dissimilarity with mixed variables (Contd.)

Name Gender Fever Cough Test-1 Test-2 Test-3 Test-4


Jack M Y N P N N N
Mary F Y N P N P N
Jim M Y P N N N N

Considering all attributes:


[email protected]
DLZNK464L9

d(Jack, Mary)= (1*1+0*1+0*0+0*1+0*0+1*1+0*0)/(1+1+1+1) = 2/4 = 0.5

Gender: nominal: d=1, δ=1


Fever: asyn: d=0, δ=1
Cough: asyn, both 0: d=0, δ=0
Test-1:asyn: d=0, δ=1
Test-2:asyn, both 0: d=0, δ=0
Test-3:asyn: d=1, δ=1
Test-4:asyn, both 0: d=0, δ=0
This file is meant for personal use by [email protected] only.
Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Cosine similarity: similarity btw vectors of numerical values
• A document can be represented by thousands of attributes, each recording the frequency of a
particular word (such as keywords) or phrase in the document.
Document Team Coach Hockey Baseball Soccer Penalty Score Win Loss Season
Documen
t1 5 0 3 0 2 0 0 2 0 0
Documen
t2
[email protected]
DLZNK464L9
3 0 2 0 1 1 0 1 0 1
Documen
t3 0 7 0 2 1 0 0 3 0 0
Documen
t4 0 1 0 0 1 2 2 0 3 0

• The angle between any two vectors (documents) can be used as a measure of the similarity between
the two documents:

This file is meant for personal use by [email protected] only.


Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Similarity btw vectors of numerical values (Contd.)

• Cosine similarity is in [0, 1] : If d1 and d2 are two vectors (e.g., term-frequency vectors) then

cos(d1, d2) = (d1 ∙ d2) /(||d1|| x ||d2||) ,

where ∙ indicates vector dot product, ||d||: the length of vector d


[email protected]
DLZNK464L9

This file is meant for personal use by [email protected] only.


Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Example: Cosine similarity

• cos(d1, d2) = (d1 ∙ d2) /(||d1|| ||d2||) ,


where ∙ indicates vector dot product, ||d||: the length of vector d

• Ex: Find the similarity between documents 1 and 2.

DLZNK464L9 d = (5, 0, 3, 0, 2, 0, 0, 2, 0, 0)
[email protected]
1
d2 = (3, 0, 2, 0, 1, 1, 0, 1, 0, 1)

d1∙ d2 = 5*3+0*0+3*2+0*0+2*1+0*1+0*1+2*1+0*0+0*1 = 25
||d1||= (5*5+0*0+3*3+0*0+2*2+0*0+0*0+2*2+0*0+0*0)0.5=(42)0.5 = 6.481
||d2||= (3*3+0*0+2*2+0*0+1*1+1*1+0*0+1*1+0*0+1*1)0.5=(17)0.5 = 4.12
cos(d1, d2 ) = 25 / (6.48*4.12) = 0.94

This file is meant for personal use by [email protected] only.


Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Summary

Here is a quick recap:


● We learned how to measure data similarity and dissimilarity.
● We looked into distance metrics, such as Minkowski distance and standardization.
● We learned proximity measures for nominal, ordinal, and binary attributes.
● We also learned how to compute dissimilarity with mixed variables.
● We discussed cosine similarity with an example.
[email protected]
DLZNK464L9

This file is meant for personal use by [email protected] only.


Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Learning Outcomes
Coming to the end of this module, you should now be able to:
• Differentiate different types of attributes: nominal, binary, ordinal, interval-scaled, and ratio-scaled.
• Evaluate basic descriptive statistics of a dataset: central tendency and dispersion.
• Illustrate and interpret graphic plots that display descriptive statistics.
• Summarize statistical hypothesis testing.
• Evaluate object similarity and dissimilarity in mixed-type datasets.
• Summarize Cosine similarity using an example.
[email protected]
DLZNK464L9

This file is meant for personal use by [email protected] only.


Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.

You might also like