0% found this document useful (0 votes)

8 views

Week 2

Uploaded by

sainathgunda99

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

8 views

Week 2

Uploaded by

sainathgunda99

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 73

Know your data

[email protected]
DLZNK464L9

Week 2

This file is meant for personal use by [email protected] only.

Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Learning objectives
By the end of this module, you will be able to:
• List the different types of attributes.
• Compute basic descriptive statistics of a dataset.
• Create and read graphic plots that display descriptive statistics.
• Explain statistical hypothesis testing.
• Compute object similarity and dissimilarity.
[email protected]
DLZNK464L9

This file is meant for personal use by [email protected] only.

Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Agenda
In this session, we will learn about:
• Types of datasets
• Data objects
• Attributes and their types
• Relationship of attributes
[email protected]
DLZNK464L9

• Need for the absolute “0”

• Discrete vs. continuous attributes

This file is meant for personal use by [email protected] only.

Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Types of data sets
Type of datasets Examples
Record • Relational records
• Data matrix: Numerical matrix, crosstabs
• Document data: Text documents, term
frequency vector
• Transactional data
[email protected]
DLZNK464L9
Graph and • World wide web
network • Social or Information networks
• Molecular structures
Ordered • Video data: Sequence of images
• Temporal data: Time-series
• Sequential data: Transaction sequences
• Genetic sequence data
Spatial, image and • Spatial data: Maps
multimedia • Image data
• Video data
This file is meant for personal use by [email protected] only.
Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Illustration - Record type datasets
Document data
Documents Team Coach Play Ball Score Game Win Lost Timeout Season
Document1 3 0 5 0 2 6 0 2 0 2
Document2 0 7 0 2 1 0 0 3 0 0
Document3 0 1 0 0 1 2 2 0 3 0
[email protected]
DLZNK464L9
Transactional data
TID Items
1 Bread, coke, milk
2 Beer, bread
3 Beer, coke, diaper, milk
4 Beer, bread, diaper, milk
5 Coke , diaper, milk
This file is meant for personal use by [email protected] only.
Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Data objects
• Data sets are made up of data objects.
• A data object represents an entity (an observation).
• Examples:
○ sales database: customers, store items, sales
○
[email protected]
DLZNK464L9
medical database: patients, treatments
○ university database: students, professors, courses
• Data objects are also called observations, samples, examples, instances, data points, objects, and
tuples.
• Data objects are described by their attributes.
• Database table rows -> data objects; columns ->attributes

This file is meant for personal use by [email protected] only.

Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Fitness of a dataset
Data sets need to be as complete as possible for the questions/phenomena to be studied.
• Are data objects/observations in your data set representative of the population under study?
○ Areas of the aircraft most likely to be damaged in the war.
○ Identifying tanks in the forest.
• Are relevant attributes comprehensively included in your data set?
○ Most militarized country in the world?
[email protected]
DLZNK464L9
○ Body height and total earnings?

This file is meant for personal use by [email protected] only.

Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Attributes
• Attribute (or dimensions, features, variables) is a data field, representing a characteristic or feature
of a data object, E.g., customer _ID, name, address
• Attribute vector (feature vector) is formed by data objects with more than one attributes.
• Types :
o Categorical: qualitative
─ Nominal, Binary, Ordinal
[email protected]

o Numeric: quantitative
DLZNK464L9

─ Interval-scaled, Ratio-scaled

This file is meant for personal use by [email protected] only.

• Nominal: categories, states, or “names of things”

○ Hair_color = {auburn, black, blond, brown, grey, red, white}
○ marital status, occupation, ID numbers, zip codes
• Binary
○ Nominal attribute with only 2 states (0 and 1)
○
[email protected] binary: both outcomes are equally important.
─ e.g., biological gender
DLZNK464L9

○ Asymmetric binary: outcomes not equally important.

─ e.g., medical test (positive vs. negative)
─ Convention: assign 1 to the most important outcome (e.g.,
positive)
• Ordinal
○ Values have a meaningful order (ranking), but the magnitude
between successive values is unknown.
○ Size = {small, medium, large}, letter grades, army rankings
This file is meant for personal use by [email protected] only.
Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Numeric attribute types: Quantitative
• Quantity (integer or real-valued)
• Interval
○ Measured on a scale of equal-sized units
○ Values have order
─ E.g., the temperature in Celsius or Fahrenheit units, calendar
dates
○ No true zero-point
[email protected]
DLZNK464L9

• Ratio
○ Inherent zero-point
○ We can speak of values as being an order of magnitude larger than
the unit of measurement (10 kelvins is twice as high as 5 kelvins).
─ E.g., the temperature in Kelvin, length, counts, monetary
quantities.

This file is meant for personal use by [email protected] only.

Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Relationship of attributes
• All ratio attributes are interval attributes.
• All interval attributes are ordinal attributes.
• All ordinal attributes are nominal attributes.

[email protected]
DLZNK464L9
Ratio:
Absolute
Interval: zero
Distance is
Ordinal: meaningful
Attributes
Nominal: can be
Attributes ordered
are only
named
weakest

This file is meant for personal use by [email protected] only.

Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
The need for the absolute “0”
• Temperature measured at 20 ˚C is not twice that at 10 ˚C because 0 ˚C is not the absolute 0.
• Ratios can not be defined reliably on an arbitrary 0.

[email protected]
DLZNK464L9

This file is meant for personal use by [email protected] only.

Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Discrete vs continuous attribute
• Discrete attribute
○ Values are distinct and separate (unconnected values).
○ Can have integer values, e.g., age or binary values.
○ Can be infinite but must be countable (each value in the set has a corresponding integer).

• Continuous attribute
[email protected]
○ Has real numbers as attribute values. E.g., temperature, height, or weight.
DLZNK464L9

○ Can take on ANY value within a finite or infinite interval.

○ Continuous attributes are typically represented as floating-point variables.

This file is meant for personal use by [email protected] only.

Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Summary
Here is a quick recap:
• We discussed various types of datasets, such as record, and ordered among others along with data
objects that represent an entity.
• We looked at different types of attributes along with the relationship between attributes, such as all
the ratio attributes being interval attributes etc.
[email protected]
DLZNK464L9

• We talked about the need for the absolute “0” i.e., ratio attribute.
• We learned the difference between discrete and continuous attributes and that discrete attributes
can have integer values. For example age, or binary values, whereas continuous attributes take
on any value within a finite or infinite interval.

This file is meant for personal use by [email protected] only.

Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Agenda
In this session, we will learn about:
● The basic statistical description of data
● Measures of central tendency
● Weighted arithmetic mean
● Symmetric vs. Skewed data
[email protected]
DLZNK464L9

● Properties of a nominal distribution curve

This file is meant for personal use by [email protected] only.

Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Basic statistical descriptions of data
● Motivation
○ To understand the data better.
● Central tendency
○ Mode, median, mean, midrange
● Dispersion of the data
○ Range, quartiles, interquartile range, five-number summary boxplots
[email protected]
DLZNK464L9

○ Variance and standard deviation

● Graphs to present data summaries and distributions
○ Quantile plots, histograms, scatter plots etc.

This file is meant for personal use by [email protected] only.

Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Measuring the central tendency
● Mean (algebraic measure) (sample vs. population):
Note: n is the sample size, and N is the population size.
○ Weighted arithmetic mean: Sample mean Population
○ Trimmed mean: chopping extreme values

● Median:
[email protected]
DLZNK464L9

○ Middle value if odd number of ordered values, or average of the

middle two values otherwise
Weighted arithmetic mean
○ Estimated by interpolation (for grouped data)

This file is meant for personal use by [email protected] only.

Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Measuring the central tendency (Contd.)
● Mode
○ Values that occur most frequently for a variable.
○ Unimodal, bimodal, trimodal, or no model if all values occurs the
same times
○ Empirical relationship between mean, mode, and median on
[email protected]
DLZNK464L9

moderately skewed data: mean-mode=3 x (mean-median)

This file is meant for personal use by [email protected] only.

[email protected]
DLZNK464L9

This file is meant for personal use by [email protected] only.

N=3194
[email protected]
DLZNK464L9

L1 = 21
Total observations = 3194
=200+450+300 = 950 3194/2 = 1597
Freqmedian = 1500

width = 30
median = 33.94
This file is meant for personal use by [email protected] only.
Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Making sense of the estimation

[email protected]
DLZNK464L9

This file is meant for personal use by [email protected] only.

Symmetric positively skewed negatively skewed

[email protected]
DLZNK464L9

Median, mean and mode of symmetric, positively and negatively skewed data.

This file is meant for personal use by [email protected] only.

Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Measuring the dispersion of data
● Range, Quartiles, outliers, and boxplots
○ Range: the difference between the largest and smallest value
in a set
■ Midrange = the average of the largest and smallest of
[email protected] values in a data set.
DLZNK464L9

○ Quantiles: points taken at regular intervals of data

distribution.
■ Quartiles and percentiles: Q1 (25th percentile), Q2
(median), Q3 (75th percentile)
○ Inter-quartile range: IQR = Q3 – Q1
○ Outlier: usually, a value higher/lower than 1.5 x IQR
above/below Q3/Q1
This file is meant for personal use by [email protected] only.
Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Measuring the dispersion of data (Contd.)

● Variance and standard deviation (sample: s, population: σ)

○ Variance s2 (or σ2): algebraic, scalable computation
○ Standard deviation s (or σ) is the square root of variance s2 (or σ2)
○ Mean absolute deviation (MAD)
[email protected]
DLZNK464L9

This file is meant for personal use by [email protected] only.

[email protected]
DLZNK464L9

μ–σ μ+σ μ–2σ μ+2σ μ-3σ μ+3σ

From μ–σ to μ+σ: contains From μ–2σ to μ+2σ: From μ–3σ to μ+3σ: contains
about 68% of the contains about 95% of the about 95% of the
observations observations observations

This file is meant for personal use by [email protected] only.

Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Summary
Here is a quick recap:
• We looked at the basic statistical description of the data.
• We discussed the various measures of central tendencies, such as Mean, Arithmetic Mean, Median,
and Mode.
• We understood that symmetric distribution is when the data is equally distributed around the mean,
[email protected]
DLZNK464L9
mode, or median and skewed distribution is when the tail of the distribution is longer on the left-hand
side than on the right-hand side or vice-versa.
• We discussed the various measures of dispersion of the data, such as variance and standard deviation
• We discussed the properties of a nominal distribution curve using various distribution graphs.

This file is meant for personal use by [email protected] only.

Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Agenda
In this session, we will learn about:
● Box plot
● Quantile plot
● Quantile-Quantile (Q-Q) plot
● Scatter plot
[email protected]
DLZNK464L9

● Positively and negative correlation

● Non-correlated data

This file is meant for personal use by [email protected] only.

• Quantile plot: each value xi is paired with fi, indicating that approximately fi
x 100% of data are ≤ xi .

• Quantile-quantile (q-q) plot: graphs the quantiles of one univariate

[email protected]
DLZNK464L9
distribution against the corresponding quantiles of another.

• Bar chart: x-axis presents values, y-axis frequency/count

• Histogram: x-axis values, y-axis frequency/density

• Scatter plot: graphs bi/tri-variate data as points in a 2/3-D plane

This file is meant for personal use by [email protected] only.

Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Box plot

• Five-number summary of a distribution: Minimum,

Q1, Median, Q3, Maximum
• Boxplot
○ Data is represented with a box.
○ The ends of the box are at the first and third
[email protected]
DLZNK464L9
quartiles, i.e., the height of the box is IQR.
○ The median is marked by a line within the box.
○ Whiskers: two lines outside the box extended
to Minimum and Maximum within 1.5 IQR.
○ Outliers: points beyond a specified outlier
threshold, plotted individually.

This file is meant for personal use by [email protected] only.

Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Quantile plot
• Displays all of the data (allowing the user to assess both the overall behavior and unusual
occurrences).
• Plots quantile information of a univariate distribution
○Data sorted in increasing order. For a value xi , fi indicates that approximately fi x 100% of the
data are below or equal to the value xi

[email protected]
DLZNK464L9

This file is meant for personal use by [email protected] only.

Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Quantile-Quantile (Q-Q) plot
• Graph the quantiles of one univariate distribution against the corresponding quantiles of another
variable. (xi, yi)
• Shows if two variables follows the same distribution.
• Example shows the unit price of items sold at branch 1 vs. branch 2 for each quantile. Unit prices of
items sold at branch 1 tend to be lower than those at branch 2.

[email protected]
DLZNK464L9

This file is meant for personal use by [email protected] only.

Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Bar Chart vs Histogram
• Histogram: Shows the probability distribution of a given variable by depicting the frequencies/density of
observations occurring in certain ranges of values.
• Differs from a bar chart
Attribute Histogram Bar Chart
Variable type Contiguous variables Discrete variables
[email protected]
DLZNK464L9
Gap between
bars Adjacent bars Space-separated bars
Bar width Could have varied width Equal width
Bar order Matters Does not matter

This file is meant for personal use by [email protected] only.

• The two histograms shown may have the same boxplot representation.

• The same values for min, Q1, median, Q3, and max, but they have rather different data
distributions.

[email protected]
DLZNK464L9

This file is meant for personal use by [email protected] only.

Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Scatter plot
• Present bivariate numerical data to see clusters of points, outliers, correlations etc.
• Each pair of values is treated as a pair of coordinates and plotted as points in the plane.

[email protected]
DLZNK464L9

This file is meant for personal use by [email protected] only.

• The left half fragment is positively linear correlated.

• The right half is negative linear correlated.

[email protected]
DLZNK464L9

This file is meant for personal use by [email protected] only.

[email protected]
DLZNK464L9

This file is meant for personal use by [email protected] only.

Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Summary

Here is a quick recap:

• We understood various charts and plots, such as box plot, histogram, quantile plot, Q-Q plots, and
Scatter plot.
• We learned that positive correlation describes the relationship between two variables that change
in the same direction and a negative correlation describes the relationship between two variables
that change in the inverse directions.
[email protected]
DLZNK464L9

This file is meant for personal use by [email protected] only.

Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Agenda

In this session, we will learn about:

• Hypothesis testing
• One-tailed and two-tailed tests
• Steps to perform hypothesis testing
[email protected]
DLZNK464L9

This file is meant for personal use by [email protected] only.

• Hypothesis testing is a scientific process of testing whether a hypothesis is plausible or not.

• Test includes a null hypothesis and alternative hypothesis, for example:

Null hypothesis: 𝒙𝒙
� = 𝝁𝝁 Alt hypothesis: 𝒙𝒙
� > 𝝁𝝁
Null hypothesis: variable A and variable B are independent Alt hypothesis: they are correlated
[email protected]
DLZNK464L9

• The goal is to determine which hypothesis is likely to be true at a confidence level

This file is meant for personal use by [email protected] only.

Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Hypothesis testing: comparison of means
Comparison of Degrees of
means freedom Application Assumptions Test statistic
Testing the difference of a sample
mean, x-bar, with a known population Normal distribution,
Not mean, μ known population σ
One sample Z test applicable

[email protected]
Normal distribution,
DLZNK464L9
Testing the difference of one sample population standard
One sample t test n-1 mean, x-bar with a given mean, μ deviation, σ is unknown
Testing the difference of two sample
means when population variances
Two sample t test n1+n2-2 unknown but considered equal Normal distribution

Rejection area

This file is meant for personal use by [email protected] only.

Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Hypothesis testing
Comparison of Degrees of
means freedom Application Assumptions Test statistic
Testing two sample means when
their respective population
standard deviations are unknown Normal distribution two
but considered equal, data dependent samples,
recorded in pairs and each pair has always two-tailed test.
Paired t test
[email protected]
DLZNK464L9
n-1 a difference, d Sd= standard deviation
Normal distribution
Testing the difference of three or
One-way ANOVA n1-1 & n2-1 more sample means

This file is meant for personal use by [email protected] only.

Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Confidence level and significance level
• Significance level: the risk we are willing to take to
reject null hypo when it is actually true.
xbar = 110, 𝝁𝝁 = 100
• Typical: 5% or 1% Say, our Z = 2.5

P(Z > 1.645)

• Confidence level = 1 – significance level

[email protected]
DLZNK464L9

• Typical: 95% or 99%

This file is meant for personal use by [email protected] only.

Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
One-tailed and two-tailed tests
● One-tailed test
1. Null Hypothesis; xbar = 𝞵𝞵
2. Alternate hypothesis; xbar > 𝞵𝞵 ;
where 𝞵𝞵 is hypothesized mean
[email protected]
DLZNK464L9

● Two-tailed test
1. Null hypothesis; xbar = 𝞵𝞵
2. Alternate hypothesis; xbar ≠ 𝞵𝞵 ;
where 𝞵𝞵 is hypothesized mean

This file is meant for personal use by [email protected] only.

• Steps in Hypothesis testing:

1. State the hypotheses (null and alternative)
2. Identify the test statistic and its probability distribution.
3. Specify the significance level
4. Collect the data and perform the calculations
[email protected]
DLZNK464L9

5. Make the statistical decision

6. Make the business decision

This file is meant for personal use by [email protected] only.

Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Summary

Here is a quick recap:

• We discussed hypothesis testing, a scientific process of testing whether or not a hypothesis is plausible.
• We understood one-tailed tests that allow the testing of an effect in one direction and two-tailed tests
allow the testing of an effect in two directions—positive and negative.
• We looked at various tests to check the null and alternate hypothesis, such as one sample Z test, two-
sample t-test, etc.
[email protected]
DLZNK464L9

• We looked at the steps to conduct a hypothesis test

This file is meant for personal use by [email protected] only.

Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Agenda
In this session, we will learn about:
● Measuring data similarity and dissimilarity
● Proximity measures for nominal, ordinal, and binary attributes
● Proximity measures for numerical attributes and normalization
● Compute dissimilarity with mixed type variables
● Cosine similarity
[email protected]
DLZNK464L9

This file is meant for personal use by [email protected] only.

Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Similarity and Dissimilarity
● Similarity
○ Numerical measure of how alike two data objects are.
○ Value is higher when objects are more alike.
○ Often falls in the range [0,1].
● Dissimilarity (e.g., distance)
○ Numerical measure of how different two data objects are.
[email protected]
DLZNK464L9
○ Lower when objects are more alike.
○ Minimum dissimilarity is often 0.
○ Upper limit varies.
● Proximity may refer to either similarity or dissimilarity.

This file is meant for personal use by [email protected] only.

● Distance/similarity matrix
○ n data points, but registers only
[email protected]
DLZNK464L9
the distance/similarity
○ Is often a symmetric matrix
○ Single mode: (dis)similarity

This file is meant for personal use by [email protected] only.

Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Proximity measure for nominal attributes
Nominal attributes can take two or more states/values, e.g., color can be red, yellow, blue, green, etc.
(generalization of a binary attribute).
Method 1: Simple matching

Observations ( cat1 and cat2 ) are described by nominal values of color, size, sleep time.

Objects
[email protected]
DLZNK464L9 Color Size Sleep time
cat1 yellow small <5 hours
cat2 yellow medium 5-8 hours

d(i, j): distance between i and j

m: Number of attributes with same values,
p: total number of attributes This file is meant for personal use by [email protected] only.
Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Proximity measure for binary attributes
Method2:
● In this method, nominal attributes are converted to binary attributes.
● Creating a new binary attribute for each of the nominal states is called “One Hot Encoding”, thus
forming a binary attribute table as shown below.
● Thus, proximity measure for binary attributes is used to measure the similarity between the
objects.
[email protected]
DLZNK464L9

sleep time sleeptime 5-

Objects color-yellow color-… size-small size-medium <5 hours 8 hours
cat1 1 0 1 0 1 0
cat2 1 0 0 1 0 1

This file is meant for personal use by [email protected] only.

The number of attributes having the same/different values for the observations ( eg, cat1, cat2 in the
previous table ) are counted by using the binary attribute table forming the contingency table as
shown below:
Object J

[email protected]
1 0 sum
DLZNK464L9

Object I 1 q r q+r

0 s t s+t

sum q+s r+t p

q represents the number of attributes both objects have the value of 1

r, s represents the number of attributes both objects have the different value
t represents the number of attributes both objects have the value of 0
p is the total number of attributes
This file is meant for personal use by [email protected] only.
Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Proximity measure for binary attributes
By using the contingency table previously created, distance / similarity metrics are calculated as shown
below:

Distance measure for symmetric binary

variables :

[email protected]
DLZNK464L9

Distance measure for asymmetric binary

variables:

Jaccard coefficient (shown similarity

measure for asymmetric binary
variables) :

This file is meant for personal use by [email protected] only.

Name Gender Fever Cough Test-1 Test-2 Test-3 Test-4

Jack M Y N P N N N
Mary F Y N P N P N
Jim M Y P N N N N

[email protected]

● Gender is a nominal attribute.

DLZNK464L9

● The remaining attributes are asymmetric

binary.
● Let the values Y and P be 1, and the value N
be 0.
● Considering only asymmetric attributes.

This file is meant for personal use by [email protected] only.

where i = (xi1, xi2, …, xip) and j = (xj1, xj2, …, xjp) are two p-dimensional data objects, and h is the
order (the distance so defined is also called L-h norm)
[email protected]

● Properties
DLZNK464L9

○ d(i, j) > 0 if i ≠ j, and d(i, i) = 0 (Positive definiteness)

○ d(i, j) = d(j, i) (Symmetry)
○ d(i, j) ≤ d(i, k) + d(k, j) (Triangle Inequality)
● A distance that satisfies these properties is a metric.

This file is meant for personal use by [email protected] only.

Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Special cases of Minkowski distance
● h = 1: Manhattan (city block, L1 norm) distance
○ E.g., the Hamming distance: the number of bits that are different between two binary vectors

● h = 2: (L2 norm) Euclidean distance

[email protected]
DLZNK464L9

This file is meant for personal use by [email protected] only.

Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Special cases of Minkowski distance (Contd.)
● Chebyshev distance:
○ When h → ∞. “supremum” (Lmax norm, L∞ norm) distance.
■This is the maximum difference between any component (attribute) of the vectors

[email protected]
DLZNK464L9

○ When h → -∞.
○ This is the minimum difference between any component (attribute) of the vectors

This file is meant for personal use by [email protected] only.

Manhattan (L1)
[email protected]
DLZNK464L9

Euclidean (L2)

Supremum
This file is meant for personal use by [email protected] only.
Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Normalization of numerical values
● Values measured on different scales can not be compared directly.
Student A: SAT = 1800
Student B: ACT = 24
Which student performed better relative to other test-takers?
● Normalization used widely with multi-dimensional datasets involving different scales: clustering,
multidimensional scaling, principal component analysis, etc.
[email protected]
DLZNK464L9

SAT ACT
Mean 1500 21
Standard
deviation 300 5

This file is meant for personal use by [email protected] only.

Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Standardizing numeric data
● Z-score
○ x: raw score to be standardized, μ: mean of the population, σ: standard
deviation
○ the distance between the raw score and the population mean in units of
the standard deviation
○ negative when the raw score is below the mean, “+” when above
[email protected]
DLZNK464L9

● An alternative way: Calculate the Mean Absolute Deviation (MAD),

Where

Standardized measure:

● MAD is more robust to outliers than the standard deviation because, in the
former, the differences with the mean are not squared.
This file is meant for personal use by [email protected] only.
Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Proximity measure for ordinal variables
● Order is important, e.g., rank
Math grade: A, B, C, D, E.

[email protected]
DLZNK464L9

● Map ordinal values to values between 0 and 1 (to interval-scaled)

1. Replace xif by their rank
2. Map the range of each variable onto [0, 1] by replacing the i-th value in
the f-th variable by

3. Compute the dissimilarity of using methods for numerical variables

This file is meant for personal use by [email protected] only.

Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Dissimilarity btw objects with mixed type attributes
● A database may contain all attribute types.
○ Nominal, symmetric binary, asymmetric binary, numeric, ordinal
● One may use a weighted formula to combine their effects.
● Distance btw objects i and j over f features/attributes:

[email protected]
DLZNK464L9

• 0 if f is missing for either object i • f is binary or nominal:

or j, or if f for i and j are both 0 o dij(f) = 0 if xif = xjf , or dij(f) = 1
and f is asymmetric binary otherwise
attribute • f is numeric: use the normalized
• 1 otherwise distance
• for fpersonal
This file is meant is ordinal: convert to numeric
use by [email protected] only.
Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Compute dissimilarity with mixed variables

Name Gender Fever Cough Test-1 Test-2 Test-3 Test-4

Jack M Y N P N N N
Mary F Y N P N P N
Jim M Y P N N N N

● Gender is a nominal attribute; others are asymmetric binary.

[email protected]
DLZNK464L9
● Let the values Y and P be 1, and the value N be 0.

This file is meant for personal use by [email protected] only.

Name Gender Fever Cough Test-1 Test-2 Test-3 Test-4

Jack M Y N P N N N
Mary F Y N P N P N
Jim M Y P N N N N

Considering all attributes:

[email protected]
DLZNK464L9

d(Jack, Mary)= (11+01+00+01+00+11+0*0)/(1+1+1+1) = 2/4 = 0.5

Gender: nominal: d=1, δ=1

Fever: asyn: d=0, δ=1
Cough: asyn, both 0: d=0, δ=0
Test-1:asyn: d=0, δ=1
Test-2:asyn, both 0: d=0, δ=0
Test-3:asyn: d=1, δ=1
Test-4:asyn, both 0: d=0, δ=0
This file is meant for personal use by [email protected] only.
Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Cosine similarity: similarity btw vectors of numerical values
• A document can be represented by thousands of attributes, each recording the frequency of a
particular word (such as keywords) or phrase in the document.
Document Team Coach Hockey Baseball Soccer Penalty Score Win Loss Season
Documen
t1 5 0 3 0 2 0 0 2 0 0
Documen
t2
[email protected]
DLZNK464L9
3 0 2 0 1 1 0 1 0 1
Documen
t3 0 7 0 2 1 0 0 3 0 0
Documen
t4 0 1 0 0 1 2 2 0 3 0

• The angle between any two vectors (documents) can be used as a measure of the similarity between
the two documents:

This file is meant for personal use by [email protected] only.

• Cosine similarity is in [0, 1] : If d1 and d2 are two vectors (e.g., term-frequency vectors) then

cos(d1, d2) = (d1 ∙ d2) /(||d1|| x ||d2||) ,

where ∙ indicates vector dot product, ||d||: the length of vector d

[email protected]
DLZNK464L9

This file is meant for personal use by [email protected] only.

• cos(d1, d2) = (d1 ∙ d2) /(||d1|| ||d2||) ,

where ∙ indicates vector dot product, ||d||: the length of vector d

• Ex: Find the similarity between documents 1 and 2.

DLZNK464L9 d = (5, 0, 3, 0, 2, 0, 0, 2, 0, 0)
[email protected]
1
d2 = (3, 0, 2, 0, 1, 1, 0, 1, 0, 1)

d1∙ d2 = 5*3+0*0+3*2+0*0+2*1+0*1+0*1+2*1+0*0+0*1 = 25
||d1||= (5*5+0*0+3*3+0*0+2*2+0*0+0*0+2*2+0*0+0*0)0.5=(42)0.5 = 6.481
||d2||= (3*3+0*0+2*2+0*0+1*1+1*1+0*0+1*1+0*0+1*1)0.5=(17)0.5 = 4.12
cos(d1, d2 ) = 25 / (6.48*4.12) = 0.94

This file is meant for personal use by [email protected] only.

Here is a quick recap:

● We learned how to measure data similarity and dissimilarity.
● We looked into distance metrics, such as Minkowski distance and standardization.
● We learned proximity measures for nominal, ordinal, and binary attributes.
● We also learned how to compute dissimilarity with mixed variables.
● We discussed cosine similarity with an example.
[email protected]
DLZNK464L9

This file is meant for personal use by [email protected] only.

Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Learning Outcomes
Coming to the end of this module, you should now be able to:
• Differentiate different types of attributes: nominal, binary, ordinal, interval-scaled, and ratio-scaled.
• Evaluate basic descriptive statistics of a dataset: central tendency and dispersion.
• Illustrate and interpret graphic plots that display descriptive statistics.
• Summarize statistical hypothesis testing.
• Evaluate object similarity and dissimilarity in mixed-type datasets.
• Summarize Cosine similarity using an example.
[email protected]
DLZNK464L9

This file is meant for personal use by [email protected] only.

Digging Numbers
No ratings yet
Digging Numbers
108 pages
Getting Started With Statistics: Data Objects and Attributes
0% (1)
Getting Started With Statistics: Data Objects and Attributes
5 pages
Forest Mensuration Book
No ratings yet
Forest Mensuration Book
389 pages
Finexamassess2 2021 1
50% (2)
Finexamassess2 2021 1
4 pages
Introduction & Basic Concepts in Statistics
100% (1)
Introduction & Basic Concepts in Statistics
36 pages
DMDW 2
No ratings yet
DMDW 2
68 pages
ITS665dm Topic2-DataUnderstanding
No ratings yet
ITS665dm Topic2-DataUnderstanding
53 pages
Chap2 Data
No ratings yet
Chap2 Data
87 pages
R22-UNIT2-IDS-CH1
No ratings yet
R22-UNIT2-IDS-CH1
10 pages
X Chapter 02 Data
No ratings yet
X Chapter 02 Data
67 pages
Sess02 Data
No ratings yet
Sess02 Data
96 pages
2-Data_Preprocessing
No ratings yet
2-Data_Preprocessing
104 pages
2nd Slides
No ratings yet
2nd Slides
54 pages
IDS-UNIT-2-FINAL (1)
No ratings yet
IDS-UNIT-2-FINAL (1)
18 pages
Week 5 - Data Mining Exploring Data With R
No ratings yet
Week 5 - Data Mining Exploring Data With R
146 pages
Data
No ratings yet
Data
84 pages
Full
No ratings yet
Full
367 pages
Wk. 3. Data (12-05-2021)
No ratings yet
Wk. 3. Data (12-05-2021)
57 pages
4 - Ch4 - Data Objects and Attribute Types
No ratings yet
4 - Ch4 - Data Objects and Attribute Types
14 pages
Unit 2 Final Ids
No ratings yet
Unit 2 Final Ids
38 pages
Source: Books by Tan, Steinbach, Kumar Han, Kamber & Pei Evans Dinesh Kumar + Experiential Knowledge
No ratings yet
Source: Books by Tan, Steinbach, Kumar Han, Kamber & Pei Evans Dinesh Kumar + Experiential Knowledge
31 pages
DWDM Unit6-Data Similarity Measures
No ratings yet
DWDM Unit6-Data Similarity Measures
40 pages
Data Mining: Data
No ratings yet
Data Mining: Data
50 pages
Bab 2 Data: Created By: Arif Djunaidy (Ftif - Its)
No ratings yet
Bab 2 Data: Created By: Arif Djunaidy (Ftif - Its)
57 pages
Unit 1 - IDS
No ratings yet
Unit 1 - IDS
49 pages
Datalec1 (1)
No ratings yet
Datalec1 (1)
23 pages
IDS 2nd Unit Notes
No ratings yet
IDS 2nd Unit Notes
14 pages
Unit 1 - IDS
No ratings yet
Unit 1 - IDS
50 pages
Data Mining CH2
No ratings yet
Data Mining CH2
69 pages
Chapter-2 (Data)
No ratings yet
Chapter-2 (Data)
95 pages
IDS Unit-2
No ratings yet
IDS Unit-2
39 pages
Attributes
No ratings yet
Attributes
66 pages
Class 2 Introduction to Data
No ratings yet
Class 2 Introduction to Data
40 pages
Lect 2 DM Converted 1
No ratings yet
Lect 2 DM Converted 1
29 pages
Data Mining and Data Warehouses: Professor: Liana Stanescu Student: Georgian Vladutu
No ratings yet
Data Mining and Data Warehouses: Professor: Liana Stanescu Student: Georgian Vladutu
12 pages
Lecture 3-Know Your Data - M
No ratings yet
Lecture 3-Know Your Data - M
19 pages
Data Mining Unit-I
No ratings yet
Data Mining Unit-I
44 pages
lec01-dataprep
No ratings yet
lec01-dataprep
67 pages
DS Handout 4
No ratings yet
DS Handout 4
4 pages
III-IT-Data Mining Unit 1-Session 3
No ratings yet
III-IT-Data Mining Unit 1-Session 3
21 pages
IDS Unit 2
No ratings yet
IDS Unit 2
49 pages
Machine Learning Machine Learning Data
No ratings yet
Machine Learning Machine Learning Data
43 pages
03 ML Data Intro
No ratings yet
03 ML Data Intro
12 pages
Knowing Your Data
No ratings yet
Knowing Your Data
43 pages
Data Warehousing and Mining: Dr. Hossen Asiful Mustafa
No ratings yet
Data Warehousing and Mining: Dr. Hossen Asiful Mustafa
49 pages
Data Mining Lecture2-2
No ratings yet
Data Mining Lecture2-2
29 pages
Basic Data Mining Techniques: Attributes
No ratings yet
Basic Data Mining Techniques: Attributes
12 pages
ML 1,2 Unit Peter Flach Machine Learning. The Art and Scienc
No ratings yet
ML 1,2 Unit Peter Flach Machine Learning. The Art and Scienc
22 pages
Modified Module 2-DM
No ratings yet
Modified Module 2-DM
107 pages
Penggalian Data & Analitika Bisnis: Faculties Teknologi Informasi - ITS
No ratings yet
Penggalian Data & Analitika Bisnis: Faculties Teknologi Informasi - ITS
69 pages
Attribute Type Description Examples Operations: Attribute Level Transformation Comments
No ratings yet
Attribute Type Description Examples Operations: Attribute Level Transformation Comments
33 pages
Types of Data
No ratings yet
Types of Data
26 pages
Lecture Notes For Chapter 2: by Tan, Steinbach, Kumar
No ratings yet
Lecture Notes For Chapter 2: by Tan, Steinbach, Kumar
25 pages
Data Mining: Data: Lecture Notes For Chapter 2
No ratings yet
Data Mining: Data: Lecture Notes For Chapter 2
34 pages
Attribute Oriented Analysis
No ratings yet
Attribute Oriented Analysis
27 pages
All Data Mining Chapters
No ratings yet
All Data Mining Chapters
235 pages
Chpater 2 PDF
No ratings yet
Chpater 2 PDF
44 pages
Chap2 Data
No ratings yet
Chap2 Data
86 pages
Chapter 02 Data and Data Preprocessing
No ratings yet
Chapter 02 Data and Data Preprocessing
74 pages
DWM Sem V Module 2 - Introduction To Data Mining, Data Exploration and Data Pre-Processing
No ratings yet
DWM Sem V Module 2 - Introduction To Data Mining, Data Exploration and Data Pre-Processing
55 pages
Presentation 1
No ratings yet
Presentation 1
46 pages
Lecture Notes For Chapter 2 Introduction To Data Mining
No ratings yet
Lecture Notes For Chapter 2 Introduction To Data Mining
34 pages
Data Collection: Six Sigma Thinking, #1
From Everand
Data Collection: Six Sigma Thinking, #1
Sumeet Savant
No ratings yet
Kali Linux CTF Blueprints
From Everand
Kali Linux CTF Blueprints
Cameron Buchanan
No ratings yet
Selenium Java Notes Part-1 (1)
No ratings yet
Selenium Java Notes Part-1 (1)
95 pages
Dashboards Intro
No ratings yet
Dashboards Intro
27 pages
Dashboard Layouts
No ratings yet
Dashboard Layouts
33 pages
Week+1-Part+1_upd
No ratings yet
Week+1-Part+1_upd
30 pages
Weekly Assessment-4
No ratings yet
Weekly Assessment-4
10 pages
Week 5, Unit 2 Quantitative and Qualitative Data Analysis
100% (1)
Week 5, Unit 2 Quantitative and Qualitative Data Analysis
54 pages
Nature of Data
No ratings yet
Nature of Data
29 pages
Faculty of Mathematical Sciences: Department of Mathematics B.Sc. (H) Mathematics
No ratings yet
Faculty of Mathematical Sciences: Department of Mathematics B.Sc. (H) Mathematics
82 pages
Scheme of Studies: Master of Science in Management Sciences (MSMS)
No ratings yet
Scheme of Studies: Master of Science in Management Sciences (MSMS)
49 pages
إحصاء_تطبيقية》بنك_اسئلة_مهم_بالإجابات
No ratings yet
إحصاء_تطبيقية》بنك_اسئلة_مهم_بالإجابات
31 pages
22b81a05y6 DS
No ratings yet
22b81a05y6 DS
9 pages
MR Final Project
No ratings yet
MR Final Project
8 pages
Research Methodology Bio Statistics Net
No ratings yet
Research Methodology Bio Statistics Net
100 pages
Software Process Measurement II
No ratings yet
Software Process Measurement II
34 pages
BBI1223 Business Statistics TUTORIAL 1
No ratings yet
BBI1223 Business Statistics TUTORIAL 1
3 pages
Introduction To Data Mining Instructors Solution Manual 1st ed. Edition Tan - Get the ebook instantly with just one click
100% (1)
Introduction To Data Mining Instructors Solution Manual 1st ed. Edition Tan - Get the ebook instantly with just one click
40 pages
FCE BINF Question Pool Solved
No ratings yet
FCE BINF Question Pool Solved
109 pages
6th Grade Math Curriculum Map
No ratings yet
6th Grade Math Curriculum Map
6 pages
Variables
No ratings yet
Variables
19 pages
Supplement To The Basic Practice of Statistics - Chapter 1
No ratings yet
Supplement To The Basic Practice of Statistics - Chapter 1
17 pages
Chapter 5
No ratings yet
Chapter 5
11 pages
CH 6 (Methods of Data Collection-Tools & Techniques)
No ratings yet
CH 6 (Methods of Data Collection-Tools & Techniques)
50 pages
Introduction To Statistics
No ratings yet
Introduction To Statistics
60 pages
Statistics
100% (1)
Statistics
37 pages
GIS Data Model
No ratings yet
GIS Data Model
97 pages
Assignment 1 - Quantitative Research Methods
No ratings yet
Assignment 1 - Quantitative Research Methods
15 pages
Fisher and Bloomfield Understanding the research process
No ratings yet
Fisher and Bloomfield Understanding the research process
7 pages
Tejas 22-10-24
No ratings yet
Tejas 22-10-24
15 pages
Unit 4 Assessment
No ratings yet
Unit 4 Assessment
34 pages
Bus105 Pcoq 1
No ratings yet
Bus105 Pcoq 1
15 pages
Determinants of Investment Beh
No ratings yet
Determinants of Investment Beh
25 pages