Data Science
Data Science
Data Science
Science
Post Graduate
Program
Data
Analytics
Post Graduate Program in Data Analytics
Data Science: Participant Manual
Contents
Design and Analysis of Experiment ................................................................................................................................. 13
Data and Data Collection .......................................................................................................................................................................13
Data Collection Techniques .................................................................................................................................................................13
What is Design of Experiment? ..........................................................................................................................................................13
DoE: Some Terminology ........................................................................................................................................................................14
DoE: Examples of Experiments from Daily Life ..........................................................................................................................14
Design of Experiment: Cake Baking .................................................................................................................................................14
Guidelines for Designing Experiments ...........................................................................................................................................14
Baking the Cake: Steps in Design of Experiments ......................................................................................................................14
DoE for Cake Baking: Factor Levels..................................................................................................................................................15
Strategy of Experimentation ...............................................................................................................................................................15
History of Design of Experiment .......................................................................................................................................................16
Four Eras of DOE ......................................................................................................................................................................................16
Some Major Players in DOE .................................................................................................................................................................16
Why Design of Experiment? ................................................................................................................................................................17
Building Blocks of DoE ...........................................................................................................................................................................17
One-factor-at-a-time Experiments (OFAT) ...................................................................................................................................17
Factorial Designs ......................................................................................................................................................................................18
Factorial Designs with Several Factors...........................................................................................................................................19
Factorial v/s OFAT ...................................................................................................................................................................................19
Example: Effect of Re and k/D on friction factor f .....................................................................................................................19
Central Composite Design ....................................................................................................................................................................22
Randomized Design and ANOVA .......................................................................................................................................................22
Some Terminology ...................................................................................................................................................................................22
ANOVA ...........................................................................................................................................................................................................24
Some Useful Quantities ..........................................................................................................................................................................25
How to Compare Means of the Sample? .........................................................................................................................................26
Null Hypothesis .........................................................................................................................................................................................27
Results of ANOVA .....................................................................................................................................................................................29
Inferences from ANOVA ........................................................................................................................................................................29
Probability Theory................................................................................................................................................................. 30
Brief Overview: Population ..................................................................................................................................................................30
Sample ...........................................................................................................................................................................................................30
Sample Space ..............................................................................................................................................................................................30
Types of Data ..............................................................................................................................................................................................30
Random Variables ....................................................................................................................................................................................31
If the factors can be controlled, the data are experimental. The purpose of most experiments
is to compare and estimate the effects of the different treatments on the response variable.
Experiments can be designed in many different ways to collect this information. The values,
or levels, of the factor (or combination of factors) are called treatments.
Strategy of Experimentation
1 Best guess approach (trial and error)
Can continue indefinitely
Cannot guarantee best solution has been found
These are Full Factorial Design. Factorial designs are good with small number of variables.
Consider there are 100 variables and 20 outputs. Typical “Simplified” jet engine models.
Knowledge of statistics applicable:
Samples, mean, variances
Equivalence of means and variance of samples
The focus is on three very useful and important classes of factorial designs:
OFAT
Factorial design
Randomized design
Randomization:
“Average out” effects of extraneous factors
Reduce bias and systematic errors
OFAT experiments provide the optimum combinations of the factor levels, but, each of
these presumptions can be shown to be false except under very special circumstances. The
key reasons why OFAT should not be conducted except under very special circumstances
are:
Do not provide adequate information on interactions
Do not provide efficient estimates of the effects
Factorial OFAT
2 factors: 4 runs (3 effects) 2 factors: 6 runs (2 effects)
3 factors: 8 runs (7 effects) 3 factors: 16 runs (3 effects)
5 factors: 32 or 16 runs (31 or 15 effects) 5 factors: 96 runs (5 effects
7 factors: 128 or 64 runs (127 or 63 effects) 7 factors: 512 runs (7 effects)
Responses:
(1) = 0.0311,
b = 0.0327
ab = 0.0200
The presence of interactions implies that one cannot satisfactorily describe the effects of
each factor using main effects.
B- 0.000 -1.668
B+ 0.001
-1.725
Log10(f)
Log10(f)
-1.639
-1.783
-1.712
0.0008828
0.0007414
-1.784
B:0.0006000
k/D
0.0004586 5.707
5.354
4.293 4.646 5.000 5.354 5.707 5.000
4.646
0.0003172 4.293
A: RE
A: RE
X = A: RE
Y = B: k/D
0.0007414
-1.566
-1.668 -1.706
Predicted
B: k/D
0.0006000
-1.592-1.630
-1.639
0.0004586
-1.744 -1.711
-1.783
0.0003172
4.293 4.646 5.000 5.354 5.707
-1.783 -1.711 -1.639 -1.566 -1.494
A: RE
Actual
Some Terminology
In order to collect data in an experiment, the different treatments are assigned to objects
(people, cars, animals, or the like) that are called experimental units. For example: Super
Sample Means
ANOVA
In addition to graphical analysis, Analysis of Variance (ANOVA) is a tool to analyse data. It is
necessary to study the influence of each individual parameter (principle variable), effect of
2 parameters taken at a time (interaction).
Sample Means
To measure the within-treatment variability, the quantity defined is Error Sum of Squares.
If there were no variability within each sample, the error sum of squares would be equal to 0.
The more the values within the samples vary, the larger will be SSE.
The variability in the observed values of the response must come from one of two sources:
1. The between-treatment variability
2. The within-treatment variability
The sum of squares that measures the total amount of variability (SSTO) in the observed
values of the response:
It follows that the total sum of squares equals the sum of the treatment sum of squares and
the error sum of squares. The SST and SSE are said to partition the total sum of squares.
Null Hypothesis
No interaction exists between factors 1 and 2 versus the alternative hypothesis 𝐻_𝑎: that
interaction does exist. Reject 𝐻0 in favour of 𝐻𝑎 at level of significance 𝛼 if,
𝑀𝑆(𝑖𝑛𝑡)
𝐹𝑖𝑛𝑡 =
𝑀𝑆𝐸
is greater than the 𝐹𝛼 based on (a -1)(b - 1) numerator and ab(m - 1) denominator degrees of
freedom.
Step 2: Calculate SS(1), which measures the amount of variability due to the different
levels of factor 1:
Step 3: Calculate SS(2), which measures the amount of variability due to the different
levels of factor 2:
Step 4: Calculate SS(interaction), which measures the amount of variability due to the
interaction between factors 1 and 2:
Step 5: Calculate SSE, which measures the amount of variability due to the error:
Conclusion: Little or no interaction exists between shelf display height and shelf display
width.
Little or no interaction exists between factors 1 and 2. These can be (separately) tested for
their significance – testing the significance of the main effects.
For F(1): Reject 𝐻_0 at the .05 level of significance. There is a strong evidence that at least
two of the bottom, middle, and top display heights have different effects on mean monthly
demand.
For F(2): Cannot reject 𝐻_0 at the .05 level of significance. No strong evidence that the
regular display width and the wide display have different effects on mean monthly demand.
Probability Theory
Brief Overview: Population
We often have questions concerning large populations. If we want to know the average
weight of all 20 year olds in the India, then the population is all individuals who are 20 years
old and living in the India. If we want to know the proportion of middle aged men who do
not have a heart attack after taking a certain drug, then the population is the set of all middle
aged men.
The entire set of possible observations in which we are interested. Gathering information
from the entire population is not always possible due to barriers such as time, accessibility,
or cost. Usually populations are so large that a researcher cannot examine the entire
population. Therefore, a subset of the population is selected to represent the population in a
research study.
Sample
It is a subset of population from which inferences are drawn about population. Sample data
provide only limited information about the population. As a result, sample statistics are
generally imperfect representatives of the corresponding population parameters.
The goal is to use the results obtained from the sample to help answer questions about the
population.
Sample Space
An Event is a subset of a sample space. It is a basic outcomes. The Sample Space is the set of
all possible outcomes of an experiment.
Probability of event A:
𝑛(𝐴)
𝑃(𝐴) =
𝑛(𝑆)
Where, n(A) = the number of element in the set of the event A
n(S) = the number of element in the sample space S
A sample space S is the set of all possible outcomes of a (conceptual or physical) random
experiment. (Ω can be finite or infinite.)
Examples:
1. S may be the set of all possible outcomes of dice roll. S = {1,2,3,4,5,6}
2. Number of hours people sleep. S = {h: h ≥ 0 hours}
3. Temperature recorded in Mumbai for last 10 years. S = {T : T [5C,41C] }
4. Do you brush teeth everyday? S = {yes, no}
Types of Data
Discrete: Quantitative data are called Continuous if the sample space contains an
interval or continuous span of real numbers.
Random Variables
A random variable is all of possible outcomes from an experiment. Random variable can be
discrete or continuous. A discrete random variable can assume at most a countable number
of values. A continuous random variable is one which takes an infinite number of possible
values. Continuous random variables are usually measurements.
Sample of 100 Mumbai citizens proportion of citizens with a car or two-wheeler, if the actual
population proportion is 60%, how likely is it that we'd get a sample proportion of 0.69?
Events
Event A is a subset of the sample space S. A S
The Rule of Union: The probability of the union of two events in terms of probability of two
events and the probability of their intersection
Probability of the union of A & B is determined by adding the probability of the events A & B
and then subtracting the probability of the intersection of the events A & B.
The symbol ∩ means both A&B simultaneously occur.
Independence Event
The probability that one event occurs in no way affects the probability of the other event
occurring. If A and B are independent events then the probability of both occurring is:
The Probability of getting any number on the die is no way influences the probability of getting
a head or tail on the coin.
When you flip a fair coin, you either get a head or a tail but not both. We can prove that these
event are mutually exclusive by adding their probabilities:
Dependent Event
The outcome of one event affects the outcome of the other. If A and B are dependent events
then the probability of both occurring is:
Suppose we have 5 blue marbles and 5 red marbles in a bag. We pull out one marble, which
may be blue or red. What is the probability that the second marble will be red?
If the first marble was red, then the bag is left with 4 red marbles out of 9 so the probability
4
of drawing a red marble on the second draw is 9. But if the first marble we pull out of the
draw is blue, then there are still 5 red marbles in the bag and the probability of pulling a red
5
marble out of the bag is 9.
Two events A and B are said to be mutually non-exclusive events if both the events A and B
have at least one common outcome between them.
Both occur if I get HT or TH. They’re also not identical: if I get HH, A occurs but B doesn’t.
Algebra of Sets
Since events and sample spaces are just sets, let's review the algebra of sets:
"I think there is a 50% chance that the world's oil reserves will be depleted by the year
2100."
"I think there is a 1% chance that the men's basketball team will end up in the Final Four
sometime this decade."
Frequency
Probability that you will get Heads when you toss a coin?
Classical Approach
As long as the outcomes in the sample space are equally likely (!!!), the probability of
event A is:
𝑁(𝐴)
𝑃(𝐴) =
𝑁(𝐵)
Probability
It is a measurement of likelihood that a particular event will occur.
Tossing a Coin: When a coin is tossed, these are two possible outcomes: Head and Tail.
Throwing Dice: When a single die is thrown, there are 6 possible outcomes: 1, 2, 3, 4, 5, 6.
1
The Probability of any one of them is 6
Probability is a (real-valued) set function P that assigns to each event A in the sample
space S a number P(A), called the probability of the event A.
𝑃(𝑆) = 1
A = {S}
Corollary: P(∅) = 0
When area A is zero, there is “null” element in A, 𝐴 = {∅}
𝑃 (⋃ 𝐴𝑖 ) = ∑ 𝑃(𝐴𝑖 )
𝑖=1 𝑖=1
Events in A and B = (𝐴 ⋃ 𝐵) − (𝐴 ⋂ 𝐵)
Corollary:
Event A
P(𝐴 ⋃ 𝐵) = 𝑃(𝐴) + 𝑃(𝐵) − 𝑃(𝐴 ⋂ 𝐵)
P(¬𝐴) = 1 − 𝑃(𝐴)
P(D) = 67/137
If a person has renal disease, what is the probability that he/she tests positive
P(T+|D)
N(T+ ⋂ 𝐷)
= = 44/67
𝑁(𝐷)
P(𝑇+ ⋂ 𝐷)
=
𝑃(𝑆)
The conditional probability of an event A given that an event B has occurred is written:
𝑃(𝐴|𝐵) and is calculated using: (𝐴|𝐵) = 𝑃(𝐴 ∩ 𝐵)/𝑃(𝐵) as long as P(B) > 0.
P(A | B) ≥ 0
P(B | B) = 1
P(A1 ∪ A2 ∪ ... ∪ Ak| B) = P(A1 | B) + P(A2 | B) + ... + P( Ak| B) and likewise for infinite unions.
Independent Events: Events A and B are independent events if the occurrence of one of
them does not affect the probability of the occurrence of the other.
P(B|A) = P(B), (provided that P(A) > 0) or P(A|B) = P(A), (provided that P(B) > 0)
Bayes Rule
Bayes rule is important for reverse conditioning.
𝐏(𝐁|𝐀)𝐏(𝐀)
𝐏(𝐀|𝐁) =
𝐏(𝐁)
Let T be the event that test is positive, for healthy people, 𝑃(𝑇|𝐴𝑐 ) = 1% 0.01
𝑃(𝑇|𝐴)𝑃(𝐴)
𝑃(𝐴|𝑇) =
𝑃(𝑇)
𝑃(𝑇|𝐴)𝑃(𝐴)
=
𝑃((𝑇 ∩ 𝐴) ∪ (𝑇 ∩ 𝐴𝑐 ))
𝑃(𝑇|𝐴)𝑃(𝐴)
=
𝑃(𝑇|𝐴)𝑃(𝐴) + 𝑃(𝑇|𝐴𝑐 )𝑃(𝐴𝑐 )
1 𝑋 0.001
= 𝟎. 𝟎𝟗𝟏
1 𝑋 0.001 + 0.01 𝑋 0.9999
Bayesian Learning
Random Variables
A random variable is all of possible outcomes from an experiment. Random variable can be
discrete or continuous. A discrete random variable can assume at most a countable number
Real valued random variable is a function of the outcome of the randomised experiment.
𝑋: Ω→ℝ
Example: If you roll a die, the outcome is random (not fixed) and there are 6 possible
outcomes, each of which occur with probability one-sixth.
It is how frequently we expect different outcomes to occur if we repeat the experiment
over and over.
In case of continuous variable probability distribution is a real valued function that maps
the possible values of x against their respective probabilities of occurrence, p(x).
Let p (win $1) be the probability to win $1, p (win $1) = 0.4737 => p (lose $1) = 0.5263
On average, bettors lose about a less than a cent for each dollar they put down on a bet like
this. (These are the best bets for patrons.)
Depending upon the process of selecting a sample, the type of sample space and purpose of
sampling we get different discrete probability distributions:
Binomial: Yes/no outcomes (dead/alive, treated/untreated, sick/well).Samples are
drawn with replacements.
Hypergeometric: Sampling without replacement
Poisson: Counts (e.g., how many cases of disease in a given area)
When a Bernoulli experiment is conducted ‘n’ number of times, then the sum of those
distributions will be binomially distributed with parameters ‘n’ and ‘P’. The probability of
success is ‘P’. The probability of failure is 1-P. The trials are independent, which means that
the outcome of one trial does not affect the outcome of any other trials.
Example:
Let us consider the probability of getting 1 in rolling a die 20 number of times. Here n = 20
and p = 1/6. Hence success would be rolling a one and failure would indicate getting any
other number. On the other hand if we consider rolling an even number, then
q=1–p
n = number of trials
x = number of successes
𝑛
𝑛!
Note: ) = 𝑟!(𝑛−𝑟)!
(
𝑟
Example: Say 40% of the class is female. What is the probability that 6 of the first 10
students walking in will be female?
Variance 𝜎2 = 𝑛 𝑝 𝑞
Standard Deviation = 𝜎 = √𝑛 𝑝 𝑞
The actual probability of getting exactly 500 heads out of 1000 flips is just over 2.5%, but
the probability of getting between 484 and 516 heads (that is, within one standard
deviation of the mean) is about 68%.
Example: A deck of cards contains 20 cards: 6 red cards and 14 black cards. 5 cards are
drawn randomly without replacement. What is the probability that exactly 4 red cards are
drawn?
The hypergeometric random variable is the number of successes, x, drawn from the r
available in the n selections.
r N r nr
P( x)
x n x N
N r ( N r ) n( N n)
2
n N 2 ( N 1)
Where:
N = the total number of elements
r = number of successes in the N elements
n = number of elements drawn
X = the number of successes in the n elements
Example: Suppose a customer at a pet store wants to buy two hamsters for his daughter,
but he wants two males or two females (i.e., he wants only two hamsters in a few months)
Wants two hamsters but both of the same sex. There are ten hamsters, five male and five
female. What is the probability of drawing two of the same sex? (With hamsters, it’s
virtually a random selection.)
𝑃(𝑀 = 2) = 𝑃(𝐹 = 2)
(52)(10−5
2−2 )
= = 0.22 𝑃(𝑀 = 2 ∪ 𝐹 = 2)
(10
2)
= 𝑃(𝑀 = 2) + 𝑃(𝐹 = 2)
= 2 X 0.22 = 0.44
λx e-λ
P(X = x) =
x!
Where, x = 0, 1, 2, 3 ….
Example: The number of errors in a new edition book is Poisson distributed with mean 1.5
per 100 pages and this varies from book to book. What is the probability that there are no
typographical errors in a randomly selected 100 pages of a new book?
e-λ λx
P(x)=
x!
(2.71828)-1.5 (1)
=
0!
= 0.2231
λ = mean number of occurrences in the given unit of time, area, volume, etc.
e = 2.71828….
µ=λ
2=λ
x e
P( x)
x!
Say in a given stream there are an average of 3 striped trout per 100 yards. What is the
probability of seeing 5 striped trout in the next 100 yards, assuming a Poisson distribution?
x e 35 e 3
P( x 5) .1008
x! 5!
How about in the next 50 yards, assuming a Poisson distribution? Since the distance is only
half as long, λ is only half as large.
(1) f(x) is positive everywhere in the support S, that is, f(x) > 0, for all x in S
(2) The area under the curve f(x) in the support S is 1, that is:
∫𝑆 𝑓(𝑥) 𝑑𝑥 = 1
(3) If f(x) is the p.d.f. of x, then the probability that x belongs to A, where A is some interval,
is given by the integral of f(x) over that interval, that is:
𝑃(𝑋 ∈ 𝐴) = ∫ 𝑓(𝑥) 𝑑𝑥
𝐴
The Cumulative Density Function (“c.d.f.") of a continuous random variable X is defined as:
𝑥
𝐹(𝑋) = ∫ 𝑓(𝑡) 𝑑𝑡
−∞
f(x) = 3x2
= [𝑡 3 ]0𝑥 = 𝑥 3 0<𝑥<1
𝜇 = (b+a)/2
f(x)
2
𝜎 = (b-a)2/12
No shape parameter.
Triangular Distribution
Parameters:
minimum a, maximum b, most likely c
Symmetric or skewed in either
direction
a is the location parameter
(b-a) scale parameter
c is the shape parameter
a+b+c
μ=
3
2
a2 + b2 + c2 − ab − bc − ac
σ =
18
Used as rough approximation of other distributions.
F x 0 if x a
x a
2
F x if a x c
b a c a
b x
2
F x 1 if c x b
b a b c
F x 1 if x b
Normal Distribution
Normal Distribution represents the distribution of many random variables as a symmetrical
bell-shaped graph.
Empirical Rule:
If a data set has an approximately bell-shaped relative frequency histogram, then, 68% of the
data will fall within one standard deviation of the mean, 95% of the data will fall within two
standard deviations of the mean. Almost all (99.7%) of the data will fall within three standard
deviations of the mean.
Example: The daily demand for gasoline at a gas station is normally distributed with mean of
1000 gallons and standard deviation of 100 gallons. There is exactly 1100 gallons of gasoline
in storage on a particular day. What is the probability that there will be enough gasoline to
satisfy the customer’s demand on that day?
Here, Demand for gasoline = X and we want to find the probability of X < 1100.
The variable X has been transformed to z, but this does not cause any change in the area. The
value of z specify the location of the corresponding value of X. Using z table we find the value
of z as 0.8413.
Each distribution has its own table? Infinitely Many Normal Distributions Means Infinitely
Many Tables to Look Up!
Somebody calculated all the integrals for the standard normal and put them in a table!
Integration is not required to be done manually by humans anymore because computers now
do all the integration.
What’s the probability of getting a math SAT score of 575 or less, =500 and =50?
68% of students will have scores between 450 and 550
95% will be between 400 and 600
99.7% will be between 350 and 650
What’s the probability of getting a math SAT score of 575 or less, =500 and =50?
575 500
Z 1.5
50
But to look up Z= 1.5 in standard normal chart (or enter into SAS/R/Python) = .9332
Area is 93.45%
Exponential Distribution
Exponential distribution is used to model the time elapsed between two events.
Examples:
The length of time between telephone calls
The length of time between arrivals at a service station
The life time of electronic components
1
Mean = λ
Example: The lifetime of a battery is exponentially distributed with λ = 0.5. What is the
probability that the battery will last for more than 20 hours?
What is the probability that the battery will last between 10 to 15 hours?
Models:
Time between customer arrivals to a
service system
The time to failure of machines
What is the probability the next death occurs in less than one year, i.e. t = 1?
Introduction to Statistics
What is statistics?
A branch of mathematics taking and transforming numbers into useful information for
decision makers. Methods for processing & analyzing numbers. Methods for helping reduce
the uncertainty inherent in decision making.
Why statistics?
Knowledge of Statistics allows you to make better sense of the ubiquitous use of numbers.
Types of Statistics
1. Descriptive Statistics: Collect, Organize, Characterize & Present Data. Descriptive
statistics are methods for organizing and summarizing data. For example, tables or
graphs are used to organize data, and descriptive values such as the average score are
used to summarize data. A descriptive value for a population is called a parameter
and a descriptive value for a sample is called a statistic.
Types of Variables
1. Discrete Variable
When the variable can only take a fixed number of values. (Such as class size) consist
of indivisible categories.
E.g.: If you roll a die, you can get 1, 2, 3, 4, 5, or 6 you cannot get 1.2 or 0.1. If it is a fair
die, the probability distribution will be 1/6, 1/6, 1/6, 1/6, 1/6, 1/6.
2. Continuous variable
A continuous distribution is appropriate when the variable can take on an infinite
number of values.
E.g.: Length, Mass, Height and Weight.
The goal is to use the results obtained from the sample to help answer questions about the
population.
Distribution
The general term for any organized set of data. We organize data so we can see the pattern
they form. We need to know how many total scores were sampled (N). We are also concerned
with how often each different score occurs in the data. How often a score occurs is
symbolized f for frequency.
There are three methods for computing the median, depending on the distribution of
scores.
If you have an odd number of scores pick the middle score. 1 4 6 7 12 14 18. Median
is 7.
If you have an even number of scores, take the average of the middle two. 1 4 6 7 8
12 14 16. Median is (7+8)/2 = 7.5
If you have several scores with the same value in the middle of the distribution use
the formula for percentiles (not found in your book).
X
For a population: =
N
X
For a Sample: X =
n
Solution:
SX1 can be computed by multiplying M1 times the sample size (SX1= M1*n1 = 18*23 = 414).
ntotal = n1 + n2 = 23 + 34 = 57
Practical Example
10 randomly selected Stores Sales (in Millions) 8 of the Store Sales less than 25, two store
sales greater than 45. 8, 12, 6, 16, 10, 20, 22, 25, 47, 55
Median = 18.0(in Millions)
Mean = 22.1(in Millions)
Which is more accurate regarding generalization to the ‘typical store Sales’? One that
includes:
Stores having less sales than Average Sales?
Outlier Store sales?
Semi Interquartile-Range
A quartile is a division of a distribution of scores. The 1st quartile refers to the 25th
percentile, the 2nd quartile refers to the 50th percentile (or median) and the 3rd quartile
refers to the 75th percentile. Interquartile range refers to the distance between the 1st and
3rd quartile.
Mean Deviation
Mean Deviation is also known as average deviation. Here, the deviation is taken from any
average especially Mean, Median or Mode. While taking deviation, we have to ignore negative
items and consider all of them as positive.
MD = d
N (deviation taken from mean)
MD = m
N (deviation taken from median)
MD = z
N (deviation taken from mode)
Standard Deviation
This is the most useful and most commonly used of the measures of variability. The standard
deviation looks to find the average distance that scores are away from the mean.
( X X ) 2
(n - 1)
=sum (sigma)
X=score for each point in data
Standard Deviation = 32.01. The score of student 12 and 14 are skewing this calculation
(indirectly through the mean).
Other Measures
There are other measures that are frequently used to analyze a collection of data:
Skewness
Kurtosis
Coefficient of Variation
Box Plot
Scattered Plot
Skewness
Skewness is the lack of symmetry of the data. For grouped data:
Kurtosis
Kurtosis provides information regarding the shape of the population distribution (the
peakedness or heaviness of the tails of a distribution). For grouped data:
Stock B: CV = 5%
Average Price last year = $100
Standard Deviation = $5
(5/100*100=5%)
Box Plot
A box-plot is a visual description of the distribution based on:
Minimum
Q1
Median
Q3
Maximum
A box plot is a graphical representation of statistical data based on the minimum, first
quartile, median, third quartile, and maximum. Mainly used to identify outliers.
A boxplot can show information about the distribution, variability, and center of a data set.
Consider the following 25 exam scores: 43, 54, 56, 61, 62, 66, 68, 69, 69, 70, 71, 72, 77, 78,
79, 85, 87, 88, 89, 93, 95, 96, 98, 99, and 99. The five-number summary for these exam scores
is 43, 68, 77, 89, and 99, respectively.
Some statistical software adds asterisk signs (*) to show numbers in the data set that are
considered to be outliers — numbers determined to be far enough away from the rest of the
data to be noteworthy.
It’s easy to misinterpret a boxplot by thinking the bigger the box, the more data. Remember
each of the four sections shown in the boxplot contains an equal percentage (25%) of the
data. A bigger part of the box means there is more variability (a wider range of values) in
that part of the box, not more data. You can’t even tell how many data values are included in
a boxplot — it is totally built around percentages.
Scatter Plot
Displays the relationship between two continuous variables. Useful in the early stage of
analysis when exploring data and determining is a linear regression analysis is appropriate.
May show outliers in your data.
A graph that contains plotted points that show the relationship between two variables. A
scatterplot consists of an X axis (the horizontal axis), a Y axis (the vertical axis), and a series
of dots. Each dot on the scatterplot represents one observation from a data set. The position
of the dot on the scatterplot represents its X and Y values.
Dispersion Measures
Outliers
An outlier is an observation which does not
appear to belong with the other data. Outliers can
arise because of a measurement or recording
error or because of equipment failure during an
experiment, etc. An outlier might be indicative of
a sub-population, e.g. an abnormally low or high
value in a medical test could indicate presence of
an illness in the patient.
Outliers are different from the noise data. Noise is random error or variance in a measured
variable. Noise should be removed before outlier detection.
Types of Outliers
1. Global Outlier: A data point is considered a global outlier if its value is far outside the
entirety of the data set in which it is found.
2. Contextual Outliers: A data point is considered a contextual outlier if its value
significantly deviates from the rest the data points in the same context.
3. Collective Outliers: A subset of data points within a data set is considered anomalous
if those values as a collection deviate significantly from the entire data set.
Causes of Outliers
Data Entry Errors: Human errors such as errors caused during data collection,
recording or entry can cause outliers in data.
Measurement Error: When the measurement instrument used turns out to be faulty.
Intentional Outlier: Intentional Outlier is found in self-reported measures that
involves sensitive data.
Data Processing Error: When we extract data form multiple sources, it is possible
that some manipulation or extraction errors may lead to outliers in the dataset.
Impact of Outliers
Outliers can extremely change the results of the data analysis and statistical modeling. It
increases the error variance and reduces the power of statistical tests. If the outliers are non-
randomly distributed, they can decrease normality. They can bias or influence estimates that
may be of substantive interest. They can also impact the basic assumption of Regression,
ANOVA and other statistical model assumptions.
Inferential Statistics
Hypothesis Testing
Is a method of making an inference about a population parameter based on sample data. Is
statistical analysis used to determine if the difference observed in samples is not a random
occurrence but a true difference.
Key Terms
Three key terms that you need to understand in Hypothesis Testing are:
Confidence Interval: Measure for reliability of an estimate; sample is used for
estimating a population parameter so we need to know the reliability of that
estimate
Degrees of Freedom: Number of values that are free to vary in a study
P-value: Probability of obtaining a test statistic at least as extreme as the one that
was actually observed, assuming that the Null Hypothesis is true
Confidence Interval
Describes the reliability of an estimate. Range of values (lower and upper boundary) within
which the population parameter is included. Width of the interval indicates the uncertainty
associated with the estimate. In confidence level, probability associated with the confidence
interval.
Degrees of Freedom
Degrees of Freedom is the measure of number of values in a study that are free to vary. For
example, if you have to take ten different courses to graduate, and only ten different courses
are offered, then you have nine degrees of freedom. In nine semesters, you will be able to
choose which class to take. In the tenth semester, there will only be one class left to take –
there is no choice.
P-Value
P-value is the probability of obtaining a test statistic at least as extreme as the one that was
actually observed, assuming that the Null Hypothesis is true. When P-value is less than a
certain significance level (often 0.05), you "reject the null hypothesis". This result indicates
that the observed result is not due to a random occurrence but a true difference.
Decision
Correct
True State
Guilty Correct
Type II error
decision
Goodness of Fit
When you toss a coin, you have an equal probability of getting a head or a tail. So, if you
toss the coin 100 times, the expected distribution is: head = 50 & tail = 50. Take a scenario
where you get this result: head = 57 & tail = 43.
How do you know if the coin is biased or if you toss another 100 times, you will get the
expected distribution?
Goodness of Fit test helps to establish if the observed distribution fits the expected
distribution.
Test of Independence
An ice cream vendor conducts a survey to capture the relation between ice cream flavor
preference and gender.
Based on the above data, how can the vendor establish the relation between gender and ice
cream flavor preference?
Determine the Degrees of Freedom of that statistic: Number of frequencies reduced by the
number of parameters of the fitted distribution. Compare the Chi-Square to the critical
value from the Chi-Square distribution.
Chi-Square from the test is less than the critical Chi-Square and P-value is less than the
significance level. Null Hypothesis is rejected. Sarah can conclude that having children has
an association with student enrollment into part-time courses.
Normality Tests
Normality test establishes if the data approximately follows a normal distribution. Tests for
hypothesis are different for normal and non-normal data hence the need to first check the
distribution type.
Shapiro-Wilk’s W Test
Commonly used test for checking the normality of data.
Ho = A sample x1, ..., xn came from a normally distributed population
Reject Null Hypothesis if the “W” (test statistic) is below a predetermined threshold or reject
Null Hypothesis if the P-value is less than the alpha level.
Kolmogorov-Smirnov Test
It is a non-parametric test. Used to compare two samples, or, a sample with a given
distribution. While testing for normality, samples are standardized and compared with a
standard normal distribution. Less powerful for testing normality than the Shapiro–Wilk test
or Anderson–Darling test.
Two-Tailed Tests
Test where the region of rejection is on both sides of the
sampling distribution.
Example:
Speed limit in a freeway 60 – 80 mph (acceptable range
of values). Region of rejection would be numbers from
both sides of the distribution, that is, both <60 and >80
are defects.
z-Test
Z-test is a statistical test where normal distribution is applied and is basically used for
dealing with problems relating to samples when n ≥ 30. The z measure is calculated as:
̅ - μ) / SE
z = (X
Where x is the mean sample to be standardized, μ (mu) is the population mean and SE is
the standard error of the mean.
SE = σ / SQRT(n)
Example: Null Hypothesis = All sales managers are meeting the quarterly target of 1M.
Example:
Null Hypothesis: Production times in the plants in Pune and Xian are same.
Null Hypothesis: Student scores are same before and after introduction of video
based learning.
ANOVA
ANOVA is used for comparing means of more than 2 samples.
Null Hypothesis = Means of all the samples are equal
Alternative hypothesis = Mean of at least one of the samples is different
Variance of all samples is assumed to be similar.
Example:
Null Hypothesis: Query response time same across all five query categories.
Null Hypothesis: No difference in student performance across the 6 modules of the
Analytics course.
Homogeneity of Variance
F-test is used for the testing if variances of two samples are similar. Test statistic F (ratio of
two sample variances) = S2X/ S2Y. Test statistics F follows an F-distribution with n − 1 and m
− 1 Degrees of Freedom if the Null Hypothesis is true else it has a non-central F-distribution.
Null Hypothesis is rejected if F is either too large or too small. F-test does not work on non-
normal distributions. Other tests for testing equality of two variances are Levene's test,
Bartlett's test, or the Brown-Forsythe test.
Non-Normality
Reasons for Non-Normality
Extreme Values
Overlap of Two or More Processes
Insufficient Data Discrimination
Sorted Data
Values close to zero or a natural limit
Data follows a Different Distribution
Sorted Data
If data available is a subset of the data produced
from a process. If data is from within specific Normal Distribution
limits of the process only, it might not follow a
normal distribution.
Non-parametric Tests
Mood’s Median Test
Is a non-parametric test that does not make assumptions about the distribution of data. Tests
the equality of medians. Y variable is continuous or discrete-ordinal or discrete-count. X
variable is discrete with two or more attributes
Example: Comparing the Medians of the monthly satisfaction ratings (Y) of six customers (X)
over the last two years. Comparing the Medians of the number of calls per week (Y) at a
service hotline separated by four different call types (X = complaint, technical question,
positive feedback, or product information) over the last six months.
Sign Test
Non-parametric equivalent of the One Sample T-test
Can also be used for paired data by calculating the difference between the two
samples
Compares median of sample to median of population
Y variable is continuous or discrete-ordinal or discrete-count
Looks at the number of observations above and below the median.
Regression analysis identifies the relationship between an outcome variable and one or
more explanatory or input variables in the form of an equation. Correlation test helps to
establish association or relation between two continuous variables and Regression analysis
provides the magnitude of the relation.
Regression is used in estimating the magnitude of the relation between variables. Relation is
defined as Variable1= function (Variable2). Based on this relation, value of one variable
corresponding to a particular value of the other variable can be estimated.
Correlation does not imply causation, but a relation. Correlation can also be coincidental.
Examples:
Analysis of Student Grades in Mathematics and English: Use Correlation to determine if the
students who are good at Mathematics tend to be equally good at English. Use Regression to
determine whether the marks in English can be predicted for given marks in Mathematics.
Analysis of Home Runs and Batting: Use Correlation to determine the relationship between
the number of home runs that a major league baseball team hits and its team batting average.
Use Regression to determine the number of home runs for a given batting average.
Correlation Coefficient
Correlation Coefficient (also called Pearson Correlation Coefficient) is a measure of strength
and direction of a linear relation between two variables. Correlation Coefficient r or R is
defined as covariance of variables divided by product of Standard Deviations of the variables.
Examples: A positive correlation between height of a child and age: As the child grows his or
her height increases almost linearly. A negative correlation between temperature and time
babies take to crawl: Babies take longer to learn to crawl in cold months (when they are
bundled in clothes that restrict their movement), than in warmer months.
Regression
Regression analysis finds the “line of best fit” for one response variable (continuous) based
on one or more explanatory variables. Statistical methods to assess the Goodness of Fit of
the model. Regression analysis for two variables X and Y estimates the relation as Y=f(X). If
it is a linear relation, the relation should be defined as Y = a + bx (simple linear regression)
Example: Age and cholesterol level are correlated. The regression equation based on
sample was estimated as: Cholesterol level= 156.3+0.65*Age.
Descriptive Statistics helps in identifying potential data problems such as errors, outliers,
and extreme values, identifying process issues and selection of appropriate statistical test
for understanding the underlying relationships/patterns.
Deal #10 is of significantly higher value than all the other deals and impacts the average
calculation
Central Tendency
A measure of Central Tendency is a single value that attempts to describe a set of data by
identifying the central position within that set of data. In other words, the Central Tendency
computes the “center” around which the data is distributed.
Timing for the Men’s 500-meter Speed Skating event in Winter Olympics is tabulated. The
Central Tendency measures are computed below:
Dispersion Measures
Measures of Dispersion describe the data spread or how far the measurements are from
the center.
Range
Variance/standard deviation
Mean absolute deviation
Interquartile range
The inter-quartile range (IQR) is a measure that indicates the extent to which the central
50% of values within the dataset are dispersed.
IQR = Q3 – Q1
Ignores Outliers. Only two points used in estimation.
Shape of a Distribution
The shape of a distribution is described by the following characteristics. Skewness is a
measure of symmetry:
Hypothesis Testing
Important Basic Terms
Population = All possible values
Sample = A portion of the population
Inference Statistics = Generalizing from a sample to a population with calculated
degree of certainty.
Parameter = A characteristic of population
E.g.: Population mean (µ)
Statistic = Calculated from data in that sample,
E.g.: Sample mean
It is a tentative explanation for certain behaviors, phenomenon or events that have occurred
or will occur. A statistical hypothesis is an assertion concerning one or more populations an
educated guess and claim or statement about a property of a population.
To prove that a hypothesis is true, or false, with absolute certainty, we need to examine the
entire population (which is practically not possible). Instead, hypothesis testing concerns on
how to use a random sample to judge if it is evidence (data in the sample) that supports or
not the hypothesis about a parameter.
A criminal trial is an example of hypothesis testing without the statistics. In a trial a jury must
decide between two hypotheses. The null hypothesis is H0: The defendant is innocent. The
alternative hypothesis or research hypothesis is H1: The defendant is guilty. The jury does
not know which hypothesis is true. They must make a decision on the basis of evidence
presented.
Make statement(s) regarding unknown population parameter values based on sample data.
Begin with the assumption that the null hypothesis is true. Similar to the notion of
innocent until proven guilty. Refers to the status quo or historical value. Always
contains “=”, “≤” or “” sign. May or may not be rejected.
The hypothesis we want to test is if HA is “likely" true. There are two possible outcomes:
1. Reject H0 and accept HA because of sufficient evidence in the sample in favour or
H1.
2. Not reject H0 because of insufficient evidence to support HA.
Failure to reject H0 does not mean the null hypothesis is true. There is no formal outcome
that says “accept H0." We say, we “failed to reject H0”.
Suppose the sample mean age was X = 20. This is significantly lower than the claimed mean
population age of 50. If the null hypothesis were true, the probability of getting such a
different sample mean would be very small, so you reject the null hypothesis. Getting a
sample mean of 20 is so unlikely if the population mean was 50, you conclude that the
population mean must not be 50.
Sampling Distribution of the Test Statistic. “Too Far Away” From Mean of Sampling
Distribution.
The test statistic is a value computed from the sample data, and it is used in making the
decision about the rejection of the null hypothesis.
z-test: Test statistic for mean
t- test: Test statistic for mean
Chi square: Test statistic for standard deviation
The confidence coefficient (1-α) is the probability of not rejecting H0 when it is true. The
confidence level of a hypothesis test is (1-α)*100%. The power of a statistical test (1-β) is
the probability of rejecting H0 when it is false.
β increases when the difference between hypothesized parameter and its true value
decreases.
Decision Rule: If the test statistic falls in the rejection region, reject H0; otherwise do not
reject H0.
If the test statistic falls into the non-rejection region, do not reject the null hypothesis H0. If
the test statistic falls into the rejection region, reject the null hypothesis. Express the
managerial conclusion in the context of the problem.
Since ZSTAT = -2.0 < -1.96, reject the null hypothesis. Conclude there is sufficient evidence
that the mean number of TVs in US homes is not equal to 3.
If the p-value is < α then reject H0, otherwise do not reject H0. State the managerial conclusion
in the context of the problem.
How likely is it to get a ZSTAT of -2 (or something further from the mean (0), in either
direction) if H0 is true?
Example:
≤ μ ≤ 2.9968
Since this interval does not contain the hypothesized mean (3.0), we reject the null
hypothesis at = 0.05.
Type I error
Erroneously rejecting the null hypothesis. Your result is significant (p < .05), so you reject
the null hypothesis, but the null hypothesis is actually true.
Type II error
Erroneously accepting the null hypothesis. Your result is not significant (p > .05), so you
don’t reject the null hypothesis, but it is actually false.
Example
H0: μ = 2.7, HA: μ > 2.7. Random sample of students: n = 36, s = 0.6 and compute X .
Decision Rule
Set significance level α = 0.05. If p-value < 0.05, reject null hypothesis.
Let’s consider what our conclusion is based upon different observed sample means.
In other words, someone is claiming that the mean time is 350 units and we want to check
this claim out to see if it appears reasonable. We can rephrase this request into a test of the
hypothesis: H0: = 350. The research hypothesis becomes: H1: ≠ 350
Recall: Standard deviation [σ]was assumed to be 75, the sample size [n] was 25, and the
sample mean was calculated to be 370.16.
Example:
When trying to decide whether the mean is not equal to 350, a large value of X (say, 600)
would provide enough evidence. IfX is close to 350 (say, 355) we could not say that this
provides a great deal of evidence to infer that the population mean is different than 350.
Do not say that the null hypothesis is accepted when a statistician is around.
The testing procedure begins with the assumption: The null hypothesis is true. Until there
is a further statistical evidence, it is still assumed H0: = 350 (assumed to be TRUE)
The next step will be to determine the sampling distribution of the sample meanX
assuming the true mean is 350.
Depends on what you define as the “guts” of the sampling distribution. If we define the guts
as the center 95% of the distribution [this means = 0.05], then the critical values that
define the guts will be: 1.96 standard deviations of X-Bar on either side of the mean of the
sampling distribution [350], or
Second Way
Standardized test statistic: The “guts” of the sampling distribution is defined to be the center
95% [ = 0.05]. If the Z-Score for the sample meanX is greater than 1.96, we know that will
be in the reject region on the right side or X. If the Z-Score for the sample mean is less than
-1.97, we know thatX will be in the reject region on the left side.
Since this is a two tailed test, this area should be doubled for the p-value. p-value =
2*(0.0901) = 0.1802
Since we defined the guts as the center 95% [ = 0.05], the reject region is the other 5%.
Since the sample mean,X is in the 18.02% region, it cannot be in the 5% rejection region [
= 0.05].
Unstandardized Test Statistic: Since LCV (320.6) <X (370.16) < UCV (379.4), we reject the
null hypothesis at a 5% level of significance.
Standardized Test Statistic: Since -Z/2(-1.96) < Z(1.344) < Z/2 (1.96), we fail to reject the
null hypothesis at a 5% level of significance.
P-value: Since p-value (0.1802) > 0.05 [], we fail to reject the hull hypothesis at a 5% level
of significance.
Null hypothesis H0: μ = 170 (“no difference”) Alternative hypothesis can be either. Ha : μ >
170 (one-sided test) or Ha : μ ≠ 170 (two-sided test). The rejection region is split equally
between the two tails.
For the illustrative example, μ0 = 170. We know σ = 40. Take an SRS of n = 64. Therefore,
Standard Error of Mean =40/√64 = 5.
Let α ≡ probability of erroneously rejecting H0. Set α threshold (e.g., let α = .10, .05, or
whatever). Reject H0 when P ≤ α, Retain H0 when P > α
Interpretation
Conventions*
P > 0.10 non-significant evidence against H0
Examples:
P =.27 non-significant evidence against H0
Do not reject H0: insufficient evidence that true mean cost is different than $168.
One-Tail Tests
In many cases, the alternative hypothesis focuses on a particular direction.
H0: μ ≥ 3 H1: μ < 3. This is a lower-tail test since the alternative hypothesis is focused on
the lower tail below the mean of 3.
H0: μ ≤ 3 H1: μ > 3. This is an upper-tail test since the alternative hypothesis is focused on
the upper tail above the mean of 3.
Lower-Tail Tests
There is only one critical value, since the rejection area is in only one tail.
Upper-Tail Tests
There is only one critical value, since the rejection area is in only one tail.
H1: μ > 52 the average is greater than $52 per month (i.e., sufficient evidence exists to
support the manager’s claim)
Example: Decisions
Reach a decision and interpret the result.
Do not reject H0 since tSTAT = 0.55 ≤ 1.318, there is not sufficient evidence that the mean
bill is over $52.
Data
Null Hypothesis µ= 52.00
Level of Significance 0.1
Sample Size 25
Sample Mean 53.10
Sample Standard Deviation 10.00
Intermediate Calculations
Standard Error of the Mean 2.00 =B8/SQRT(B6)
Degrees of Freedom 24 =B6-1
t test statistic 0.55 =(B7-B4)/B11
Proportions
Hypothesis Tests for Proportions
Involves categorical variables. Two possible outcomes:
1. Possesses characteristic of interest
2. Does not possess characteristic of interest
The sampling distribution of p is approximately normal, so the test statistic is a ZSTAT value:
Example
A marketing company claims that it receives 8% responses from its mailing. To test this
claim, a random sample of 500 were surveyed with 25 responses. Test at the = 0.05
significance level.
Check:
n π = (500)(.08) = 40
Conclusion: There is sufficient evidence to reject the company’s claim of 8% response rate.
p-Value Solution
Calculate the p-value and compare to . (For a two-tail test the p-value is always two-tail)
Beta
Calculating β
Suppose n = 64 , σ = 6 , and = .05.
Conclusions:
A one-tail test is more powerful than a two-tail test
An increase in the level of significance () results in an increase in power
An increase in the sample size results in an increase in power
Regression Analysis
Scatter Plot
A scatter plot can be used either when one continuous variable that is under the control of
the experimenter and the other depends on it or when both continuous variables are
independent. If a parameter exists that is systematically incremented and/or decremented
by the other, it is called the control parameter or independent variable and is customarily
plotted along the horizontal axis. The measured or dependent variable is customarily plotted
along the vertical axis. If no dependent variable exists, either type of variable can be plotted
on either axis or a scatter plot will illustrate only the degree of correlation (not causation)
between two variables.
A scatter plot can suggest various kinds of correlations between variables with a
certain confidence interval. For example, weight and height, weight would be on y axis and
height would be on the x axis. Correlations may be positive (rising), negative (falling), or null
(uncorrelated). If the pattern of dots slopes from lower left to upper right, it indicates a
positive correlation between the variables being studied. If the pattern of dots slopes from
upper left to lower right, it indicates a negative correlation. A line of best fit (alternatively
called 'trendline') can be drawn in order to study the relationship between the variables. An
equation for the correlation between the variables can be determined by established best-fit
procedures. For a linear correlation, the best-fit procedure is known as linear regression and
is guaranteed to generate a correct solution in a finite time. No universal best-fit procedure
is guaranteed to generate a correct solution for arbitrary relationships. A scatter plot is also
very useful when we wish to see how two comparable data sets agree to show nonlinear
relationships between variables. The ability to do this can be enhanced by adding a smooth
line such as LOESS. Furthermore, if the data are represented by a mixture model of simple
relationships, these relationships will be visually evident as superimposed patterns.
Examples:
Correlation Coefficient
The population correlation coefficient ρ (rho) measures the strength of the association
between the variables. The sample correlation coefficient r is an estimate of ρ and is used to
measure the strength of the linear relationship in the sample observations.
Where:
r = Sample correlation coefficient
n = Sample size
x = Value of the independent variable
y = Value of the dependent variable
Features of ρ and r
Unit free
Range between -1 and 1
The closer to -1, the stronger the negative linear relationship
The closer to 1, the stronger the positive linear relationship
The closer to 0, the weaker the linear relationship
Calculation Example:
Test statistic
r
t (With n – 2 degrees of freedom)
1 r2
n2
Example: Is there evidence of a linear relationship between tree height and trunk diameter
at the .05 level of significance?
Test Solution:
𝛽0 = 𝑦 − 𝛽1 𝑥 ….(1)
( x x )( y y )
1 , 0 y 1x
2
(x x)
x y
xy
Algebraic Equivalent: 1 n , 0 y 1x
2
2 ( x)
x
n
Interpretation
𝜷𝟎 : 𝛽0 is the estimated average value of 𝑦 when the value of 𝑥 is zero. Traditionally
it is the “bias” of the model.
𝜷𝟏 : 𝛽1 is the estimated change in the average value of 𝑦 as a result of a one-unit
change in 𝑥. A sensitivity measure, “Slope” or “rate” of the model
Excel Output
Graphical Presentation
House price model: Scatter plot and regression line.
Interpretation b1
b1 measures the estimated change in the average value of Y as a result of a one-unit change
in X. Here, b1 = .10977 tells us that the average value of a house increases by: 0.10977($1000)
= $109.77, on average, for each additional one square foot of size.
The simple regression line always passes through the mean of the y variable and the mean
of the x variable. The least squares coefficients are unbiased estimates of β0 and β1.
Where:
y = Average value of the dependent variable
y = Observed values of the dependent variable
ŷ = Estimated value of y for the given x value
SST = total sum of squares: Measures the variation of the yi values around their mean y
SSE = error sum of squares: Variation attributable to factors other than the relationship
between x and y
Coefficient of Determination, R2
The coefficient of determination is the portion of the total variation in the dependent
variable that is explained by variation in the independent variable. The coefficient of
determination is also called R-squared and is denoted as R2.
SSR
R2 where, 0 R 2 1
SST
Note:
In the single independent variable case, the coefficient of determination is: R2 = r2
Where:
R2 = Coefficient of determination
r = Simple correlation coefficient
0 < R2 < 1. Weaker linear relationship between x and y: Some but not all of the variation in
y is explained by variation in x.
R2 = 0
No linear relationship between x and y: The value of Y
does not depend on x.
(None of the variation in y is explained by variation in
x)
Output
Where,
SSE = Sum of squares error
n = Sample size
k = number of independent variables in the model
Where:
SSE
sε = Sample standard error of the estimate
n2
Excel Output
b1 β1
Test statistic t
s b1
d.f. n 2
Where:
b1 = Sample regression slope coefficient
β1 = Hypothesized slope
Conclusion: There is sufficient evidence that square footage affects house price.
At 95% level of confidence, the confidence interval for the slope is (0.0337, 0.1858). Since
the units of the house price variable is $1000s, we are 95% confident that the average
impact on sales price is between $33.70 and $185.80 per square foot of house size. This
95% confidence interval does not include 0.
Conclusion: There is a significant relationship between house price and square feet at the
.05 level of significance.
1 (x p x)
2
yˆ t/2s ε
n (x x) 2
1 (x p x)
2
yˆ t/2s ε 1
n (x x) 2
This extra term adds to the interval width to reflect the added uncertainty for an individual
case.
1 (x p x)
2
1 (x p x)
2
Logistic Regression
Review of Linear Estimation
It is known how to handle linear estimation models of the type: 𝑌 = 𝛽0 + 𝛽1 𝑋1 + … . +
𝛽𝑛 𝑋𝑛 + 𝜖 ≡ 𝑋 𝛽 + 𝜖
Here, the terms in the model are k different functions of the n predictors. For the classic
multiple regression model: 𝐸(𝑌|𝑋) = 𝛽0 + 𝛽1 𝑋1 + … . + 𝛽𝑛 𝑋𝑛 + 𝜖
The regression coefficients 𝛽𝑖 represent the estimated change in the mean of the response Y
associated with a unit change in 𝑋𝑖 while the other predictors are held constant. They
measure the association between Y and Xi adjusted for the other predictors in the model.
Non-linear Estimation
In all these models Y, the dependent variable, was continuous. Independent variables could
be dichotomous (dummy variables), but not the dependent variable. Non-linear estimation
come with dichotomous Y variables.
Example:
Success/Failure, Remission/No Remission
Survived/Died, CHD/No CHD, Low Birth Weight/Normal Birth Weight
Link Functions
A link function is a function linking the actual Y to the estimated Y in an econometric model.
Example: Logs
Start with Y = Xβ + ε
Then change to log(Y) ≡ Y′ = Xβ + ε
Run this like a regular OLS equation
Then you have to “back out” the results
What function F(Y) goes from the [0, 1] interval to the real line?
At least one function is known that goes the other way around. That is, given any real value
it produces a number (probability) between 0 and 1. This is the cumulative normal
distribution Φ. That is, given any Z-score, Φ ∈ [0,1].
= Φ(X 𝛽 + 𝜖 )
Φ−1 (𝑌) = 𝑋𝛽 + 𝜖
𝑌’ = 𝑋𝛽 + 𝜖
The link function F(Y) = 𝛷−1 (𝑌) known as the Probit link. This term was coined in the
1930’s by biologists studying the dosage-cure rate link. This is short for “probability unit”.
In a Probit model, the value of Xβ is taken to be the z-value of a normal distribution. Higher
values of Xβ mean that the event is more likely to happen have to be careful about the
interpretation of estimation results here. A one unit change in Xi leads to a βi change in the
z-score of Y. The estimated curve is an S-shaped cumulative normal distribution.
This fits the data much better than the linear estimation. Always lies between 0 and 1. Can
estimate, for instance, the BVAP at which Pr(Y=1) = 50%. This is the “point of equal
opportunity”
The odds ratio is always non-negative. As a final step, then, take the log of the odds ratio.
Logit Function
logit(Y) = log[O(Y)] = log[y/(1-y)]
Why is it needed?
At first, this was computationally easier than working with normal distributions. It has some
properties that can be investigated with multinomial dependent variable. The density
function associated with it is very close to a standard normal distribution.
Latent Variables
One way to state what’s going on is to assume that there is a latent variable Y* such that
𝑌∗ = 𝑋 𝛽 + 𝜖 , 𝜖 ∈ 𝑁(0, 𝜎2 )
Logistic Regression
Similar to linear regression, two main differences Y (outcome or response) is categorical
Yes/No, Approve/Reject or Responded/Did not respond. Result is expressed as a
probability of being in either group.
𝑒X 𝛽
Note: Pr(Y = 1|X) =
1+𝑒 X 𝛽
where:
“exp” or “e” is the exponential function (e=2.71828…)
p is probability that the event y occurs given x, and can range between 0 and 1
p/(1-p) is the "odds ratio"
ln[p/(1-p)] is the log odds ratio, or "logit"
Coronary Heart Disease (CD) and Age, sampled individuals were examined for signs of CD
(present = 1 / absent = 0) and the potential relationship between this outcome and their age
(yrs.) was considered.
This is a portion of the raw data for the 100 subjects who participated in the study.
How can we analyse these data?
The mean age of the individuals with some signs of coronary heart disease is 51.28 years
vs. 39.18 years for individuals without signs (t = 5.95, p < .0001).
The smooth regression estimate is “S-shaped” but what does the estimated mean value
represent?
Answer: P(CD|Age)
We can group individuals into age classes and look at the percentage/proportion showing
signs of coronary heart disease. Notice the “S-shape” to the estimated proportions vs. age.
We can group individuals into age classes and look at the percentage/proportion showing
signs of coronary heart disease. Notice the “S-shape” to the estimated proportions vs. age.
Logit Transformation
The logistic regression model is given by:
Dichotomous Predictor
Consider a dichotomous predictor (X) which represents the presence of risk (1 = present).
For the odds ratio associated with risk presence taking the natural logarithm, OR e 1
The estimated regression coefficient associated with a 0-1 coded dichotomous predictor is
the natural log of the OR associated with risk presence. ln(OR) 1
P(Y | X ) P
ln ln o 1 X
1 P(Y | X ) 1 P
P
e o 1 X
1 P
Consider a dichotomous predictor (X) which represents the presence of risk (1 = present)
Risk Factor (X)
Disease (Y) Present Absent
(X = 1) ( X = -1 )
Yes (Y = 1) P (Y 1 X 1) P (Y 1 X 1)
No (Y = 0) 1 P (Y 1 X 1) 1 P (Y 1 X 1)
For the odds ratio associated with risk presence taking the natural logarithm, OR e2 1
Is it possible to assume that the increase in risk associated with a c unit increase is constant
throughout one’s life? Is the increase going from 20 30 years of age the same as going from
50 60 years? If that assumption is not reasonable, then one must be careful when
discussing risk associated with a continuous predictor.
ˆo 2.183
ˆ1 0.607
Time-Series Data
Numerical data obtained at regular time intervals. The time intervals can be annually,
quarterly, monthly, weekly, daily, and hourly, etc.
Example:
Trend Component
Long-run increase or decrease over time (overall upward or downward movement). Data
taken over a long period of time. Trend can be upward or downward. Trend can be linear
or non-linear.
Sales Sales
Time Time
Downward linear trend Upward nonlinear trend
Seasonal Component
Short-term regular wave-like patterns. Observed within 1 year. Often monthly or quarterly.
Cyclical Component
Long-term wave-like patterns. Regularly occur but may vary in length. Often measured
peak to peak or trough to trough.
Smoothing Methods
A time series plot helps to figure out whether there is a trend component. Often it helps if
one can “smooth” the time series data. Two popular smoothing methods are:
1. Moving Averages: Calculate moving averages to get an overall impression of the
pattern of movement over time. Averages of consecutive time series values for a
chosen period of length L
Moving Averages
Used for smoothing
A series of arithmetic means over time
Result dependent upon choice of L (length of period for computing means)
Last moving average of length L can be extrapolated one period into future for a short term
forecast
Examples: For a 5 year moving average, L = 5. For a 7 year moving average, L = 7, etc.
Y1 Y2 Y3 Y4 Y5
First average: MA(5)
5
Y2 Y3 Y4 Y5 Y6
Second average: MA(5)
5
Annual Data
60
50
40
Sales
30
20
10
0
1 2 3 4 5 6 7 8 9 10 11
Year
Exponential Smoothing
Used for smoothing and short term forecasting (one period into the future). A weighted
moving average weights decline exponentially. Most recent observation weighted most.
E1 Y1
Ei WYi (1 W ) Ei 1 For i = 2, 3, 4 …
Where:
Ei = exponentially smoothed value for period i
Example
Suppose we use weight W = 0.2
50
40
Sales
30
20
10
0
1 2 3 4 5 6 7 8 9 10
Time Period
Sales Smoothed
The smoothed value in the current period (i) is used as the forecast value for next period (i
+ 1) Yˆi 1 Ei
Yˆ b0 b1 X
Compare adj. r2 and standard error to that of linear model to see if this is an improvement.
Can try other functional forms to get best fit.
b1 = estimate of log(β1)
Interpretation: (β̂1 1) 100% is the estimated annual compound growth rate (in %).
Use a quadratic trend model if the second differences are approximately constant
Use an exponential trend model if the percentage differences are approx. constant
(Y2 Y1 ) (Y Y2 ) (Y Yn -1 )
100% 3 100% n 100%
Y1 Y2 Yn -1
Autoregressive Modeling
Used for forecasting. Takes advantage of autocorrelation
1st order - correlation between consecutive values
2nd order - correlation between values 2 periods apart
Units: 4, 3, 2, 3, 2, 2, 4, 6
Develop the 2nd order table. Use Excel or Minitab to estimate a regression model.
Measuring Errors
Choose the model that gives the smallest measuring errors.
n
Sum of squared errors (SSE): SSE (Yi Ŷi ) 2
i 1
Sensitive to outliers.
n
Y Ŷ
i i
Mean Absolute Deviation (MAD): M AD i 1
n
Principal of Parsimony
Suppose two or more models provide a good fit for the data. Select the simplest model.
βi provides the multiplier for the ith quarter relative to the 4th quarter (i = 2, 3, 4)
Index Numbers
Index numbers allow relative comparisons over time. Index numbers are reported relative
to a Base Period Index. Base period index = 100 by definition. Used for an individual item or
group of items.
Pi
Ii 100
Pbase
Where
Ii = index number for year i
Pi = price for year i
Pbase= price for the base year
Example:
Airplane ticket prices from 1995 to 2003.
Prices in 2000 were 100% of base year prices (by definition, since 2000 is the base year)
P 320
I 2000 2000 100 (100) 100
P2000 320
P i
(t )
i = item
I (t )
U i 1
n
100 t = time period
P
i 1
i
(0) n = total number of items
P
i 1
i
(t )
= sum of the prices for the group of items at time t
n
P
i 1
i
(0)
= sum of the prices for the group of items in time period 0
I 2004
P 2004
100
410
(100) 118.8
P 2001 345
Unweighted total expenses were 18.8% higher in 2004 than in 2001.
Machine Learning
Algorithms and techniques used for data analytics. Studies how to automatically learn to
make accurate predictions based on past observations. Machine learning is programming
computers to optimize a performance criterion by tuning set of parameters. These tuned
programs then perform same task on unseen data.
Diagrammatic Representation
Applications
Finance: Credit scoring, fraud detection
Manufacturing: Optimization, troubleshooting
Bioinformatics: Motifs, alignment
Supervised Learning
We are given attributes, 𝑿 and Feature
Training
targets 𝒚 knowledgeable external Data Extraction
supervisor:
Regression
Classification ML Model
Decision trees
Random forest
ML Algorithm
Performance
Metric
Regression: Examples
Reading your mind: Happiness state is related to brain region intensities.
Predicting stock prices depends on: Recent Stock Prices, News Events and Related
commodities.
Classification: Examples
Credit Scoring: Differentiating between low-risk and high-risk customers from their income
and savings
Classification: Applications
These applications are also known as Pattern Recognition.
Face Recognition: Pose, lighting, occlusion (glasses, beard), make-up, hair style.
Character recognition: Different handwriting styles.
Speech recognition: Temporal dependency. Use of a dictionary or the syntax of the
language.
Sensor fusion: Combine multiple modalities; Example: visual (lip image) and
acoustic for speech.
Medical diagnosis: From symptoms to illnesses.
Web Advertising: Predict if a user clicks on an ad on the Internet.
Performance
Metric
Learning Associations
Basket analysis: P (Y | X ) probability that somebody who buys X also buys Y where X and
Y are products/services. Example: P ( chips | beer ) = 0.7
Market-Basket Transactions
Reinforcement Learning
Mimics intelligent system
Observers interaction of environment and system actions
Optimize goal/rewards and leads to continuous, self-learning
It is not a method but a process as a whole to build knowledge
Corrective action even if system sees a new situation
What is Clustering?
Attach label to each observation or data points in a set. It comes under “unsupervised
classification”. Clustering is alternatively called as “grouping” if you would want to assign
same label to a data points that are “close” to each other. Clustering algorithms rely on a
distance metric between data points. Sometimes, the distance metric is more important than
the clustering algorithm.
Why Clustering?
Understanding: Group related documents for browsing, group genes and proteins
that have similar functionality, or group stocks with similar price fluctuations
Summarization: Reduce the size of large data sets and feature selection pool
Data compression
Types of Clustering
A clustering is a set of clusters
Partitional Clustering
For example: Two clusters for the shop near my house and my friend’s house. This
philosophy is implemented in “K-Means Clustering Algorithm”.
K-means Overview
Partitional clustering approach
Each cluster is associated with a centroid (center point)
Each point is assigned to the cluster with the closest centroid
Number of clusters, K, must be specified
The basic algorithm is very simple
Algorithm: k-means
Decide on a value for k (number of classes). Initialize the k cluster centers (randomly, if
necessary). Decide the class memberships of the M objects by assigning them to the nearest
cluster center. Re-estimate the k cluster centers. If none of the M objects changed
membership in the last iteration, exit. Otherwise go to 3.
What to find?
The best 256 colours among all 16 million colours such that the image using only the 256
colours in the palette looks as close as possible to the original image. Colour quantization is
used to map from high to lower resolution. Can always quantize uniformly, but this wastes
the colour map by assigning entries to colours not existing in the image, or would not assign
extra entries to colours frequently used in the image.
For example, if the image is a seascape, we expect to see many shades of blue and maybe no
red. So the distribution of the colour map entries should reflect the original density.
Steps
Collect data from a true-colour image
Perform K-means clustering to obtain cluster centers as the indexed colours
Compute the compression rate
before m * n * 3 * 8 bits
after m * n * log 2 c c * 3 * 8 bits
before m * n *3*8 24 24
after m * n * log 2 c c * 3 * 8 log c 24c log 2 c
2
m*n
MATLAB Code
X = imread('annie19980405.jpg');
image(X)
[m, n, p]=size(X);
index=reshape(1:m*n*p, m*n, 3)';
data=double(X(index));
maxI=6;
for i=1:maxI
centerNum=2^i;
fprintf('i=%d/%d: no. of centers=%d\n', i, maxI, centerNum);
center=kMeansClustering(data, centerNum);
distMat=distPairwise(center, data);
[minValue, minIndex]=min(distMat);
X2=reshape(minIndex, m, n);
map=center'/255;
figure; image(X2); colourmap(map); colourbar; axis image;
end
Choice of K
Can WK(C), i.e., the within cluster distance as a function of K serve as any indicator?
𝐾
𝑊𝑘 = ∑ 𝑁𝑘 ∑ 𝑑2 (𝑥𝑖 , 𝑧𝑘 )
𝑘=1 {𝑥𝑖 ∈𝐶𝑘 }
Note that WK(C) decreases monotonically with increasing K. That is the within cluster
scatter decreases with increasing centroids. Instead look for gap statistics (successive
difference between WK(C)):
{WK WK 1 : K K *} {WK WK 1 : K K *}
K-means: Limitations
K is a user input; Alternatively BIC (Bayesian information criterion) or MDL (minimum
description length) can be used to estimate K. K-means converges, but it finds a local
minimum of the cost function. It is an approximation to an NP-hard combinatorial
optimization problem. Works only for numerical observations. Outliers can considerable
trouble to K-means. K-means has problems when clusters are of differing sizes, densities and
non-globular shapes.
One solution is to use many clusters. Find parts of clusters, but need to put together.
K-medoids Clustering
K-means is appropriate when it can be worked with Euclidean distances. K-means can
work only with numerical, quantitative variable types. Euclidean distances do not work
well in at least two situations
1. Some variables are categorical
2. Outliers can be potential threats
A general version of K-means algorithm called K-medoids can work with any distance
measure. K-medoids clustering is computationally more intensive.
K-medoids Algorithm
Step 1: For a given cluster assignment C, find the observation in the cluster minimizing the
total distance to other points in that cluster:
Step 2: Assign
Step 3: Given a set of cluster centers {m1, …, mK}, minimize the total error by assigning
each observation to the closest (current) cluster center:
mk xi , k 1,2,, K
k
Iterate steps 1 to 3:
Generalized K-means
Computationally much costlier that K-means
Apply when dealing with categorical data
Apply when data points are not available, but only pair-wise distances are available
Converges to local minimum.
Dysplastic cells have undergone precancerous changes. They generally have longer and
darker nuclei, and they have a tendency to cling together in large clusters. Mildly dysplastic
cells have enlarged and bright nuclei. Moderately dysplastic cells have larger and darker
nuclei. Severely dysplastic cells have large, dark, and often oddly shaped nuclei. The
cytoplasm is dark, and it is relatively small.
Possible Features
Nucleus and cytoplasm area
Nucleus and cyto brightness
Nucleus shortest and longest diameter
Cyto shortest and longest diameter
Types of Clusters
Well-separated clusters
Center-based clusters
Contiguous clusters
Density-based clusters
Property or Conceptual
Described by an Objective Function
1
𝑧𝑘 = ∑ 𝑥
𝑁𝑘 𝑥∈𝐶𝑘 (𝑖)
K-Nearest Neighbor
Different Learning Methods
Eager Learning
Explicit description of target function on the whole training set
Instance-based Learning
Learning=storing all training instances
Classification=assigning target function to a new instance
Referred to as “Lazy” learning
Classification
Predict the category of the given instance that is rationally consistent with the dataset.
Instance-Based Learning
K-Nearest Neighbor Algorithm
Weighted Regression
Case-based reasoning
Features
All instances correspond to points in an n-dimensional Euclidean space.
Classification is delayed till a new instance arrives.
Classification done by comparing feature vectors of the different points.
Target function may be discrete or real-valued.
“Closeness” is defined in terms of the Euclidean distance between two examples. The
Euclidean distance between X=(x1, x2, x3,…xn) and Y =(y1,y2, y3,…yn) is defined as:
n
D( X , Y ) (x y )
i 1
i i
2
Normalization of Variables
Example: Married
Non-Numeric Data
Feature values are not always numbers.
Boolean values: Yes or no, presence or absence of an attribute
Categories: Colors, educational attainment, gender
Non-binary characterizations
Use natural progression when applicable; e.g., educational attainment: GS, HS,
College, MS, PHD => 1,2,3,4,5
Assign arbitrary numbers but be careful about distances; e.g., color: red, yellow, blue
=> 1,2,3
k-NN Variations
Value of k
o Larger k increases confidence in prediction
o Note that if k is too large, decision may be skewed
Weighted evaluation of nearest neighbors
o Plain majority may unfairly skew decision
o Revise algorithm so that closer neighbors have greater “vote weight”
Other distance measures
o City-block distance (Manhattan dist)
Add absolute value of differences
o Cosine similarity
This is prohibitively expensive for large number of samples. But we need large number of
samples for k-NN to work well.
Remarks
Advantages
Can be applied to the data from any distribution
Very simple and intuitive
Good classification if the number of samples is large enough
Disadvantages
Classification Task
Given: Expression profile of a new patient + a learned model
Tennis Example
|w T xi + b|
r=
‖w‖
After rescaling w and b by ρ/2 in the equality, we obtain that distance between each xs and
y ( w T x s b) 1
r s
w w
the hyperplane is:
2
2r
w
Then the margin can be expressed through (rescaled) w and b as:
2
w
Find w and b such that is maximized and for all (xi, yi), i=1..n : yi(wTxi + b) ≥ 1
Find w and b such that Φ(w) = ||w||2=wTw is minimized and for all (xi, yi), i=1..n : yi (wTxi
+ b) ≥ 1.
Each non-zero αi indicates that corresponding xi is a support vector. Then the classifying
function is (note that we don’t need w explicitly):
f(x) = ΣαiyixiTx + b
Notice that it relies on an inner product between the test point x and the support vectors xi
we will return to this later. Also keep in mind that solving the optimization problem involved
computing the inner products xiTxj between all training points.
Parameter C can be viewed as a way to control overfitting: it “trades off” the relative
importance of maximizing the margin and fitting the training data.
Solution:
Dual problem is identical to separable case (would not be identical if the 2-norm penalty
for slack variables CΣξi2 was used in primal objective, we would need additional Lagrange
multipliers for slack variables):
Find α1…αN such that
Q(α) =Σαi - ½ΣΣαiαjyiyjxiTxj is maximized and
(1) Σαiyi = 0
(2) 0 ≤ αi ≤ C for all αi
Again, xi with non-zero αi will be support vectors. Solution to the dual problem is:
w =Σαiyixi
b= yk(1- ξk) - ΣαiyixiTxk for any k s.t. αk>0
No need to compute w explicitly for classification: f(x) = ΣαiyixiTx + b.
Theoretical Justification for Maximum Margins
What has Vapnik proved?
The class of optimal linear separators has VC dimension h bounded as:
D 2
h min 2 , m0 1
Where ρ is the margin, D is the diameter of the smallest sphere that can enclose all of the
training examples, and m0 is the dimensionality.
Non-linear SVMs
Datasets that are linearly separable with some noise work out great:
The original feature space can always be mapped to some higher-dimensional feature
space where the training set is separable.
A kernel function is a function that is eqiuvalent to an inner product in some feature space.
Example:
2-dimensional vectors x=[x1 x2]; let K(xi,xj)=(1 + xiTxj)2,
= [1 xi12 √2 xi1xi2 xi22 √2xi1 √2xi2]T [1 xj12 √2 xj1xj2 xj22 √2xj1 √2xj2]
A kernel function implicitly maps data to a high-dimensional space (without the need to
compute each φ(x) explicitly).
Polynomial of power p: K(xi,xj)= (1+ xiTxj)p Mapping Φ: x → φ(x), where φ(x) has
dimensions
SVM locates a separating hyperplane in the feature space and classify points in that space. It
does not need to represent the space explicitly, simply by defining a kernel function. The
kernel function plays the role of the dot product in the feature space.
Properties of SVM
Flexibility in choosing a similarity function
Sparseness of solution when dealing with large data sets only support vectors are
used to specify the separating hyperplane
Ability to handle large feature spaces
Overfitting can be controlled by soft margin approach
Nice math property: A simple convex optimization problem which is guaranteed to
converge to a single global solution
Feature Selection
Weakness of SVM
It is sensitive to noise: A relatively small number of mislabeled examples can dramatically
decrease the performance. It only considers two classes:
How to do multi-class classification with SVM?
Answer:
To predict the output for a new input, just predict with each SVM and find out which one
puts the prediction the furthest into the positive region.
If we leave at 10 AM and there are no cars stalled on the road, what will our commute time
be?
Choosing Attributes
The previous experience decision table showed 4 attributes hour, weather, accident and
stall. But the decision tree only showed 3 attributes hour, accident and stall. Why is that?
Methods for selecting attributes (which will be described later) show that weather is not a
discriminating attribute.
We use the principle of Occam’s Razor: Given a number of competing hypotheses, the simplest
one is preferable. The basic structure of creating a decision tree is the same for most
decision tree algorithms. The difference lies in how we select the attributes for the tree. We
will focus on the ID3 algorithm(Iterative Dichotomiser 3) developed by Ross Quinlan in
1975. There is an extension of ID3 algorithm referred to as C4.5 is an extension of ID3 that
accounts for unavailable values, continuous attribute value ranges, pruning of decision
trees, rule derivation, and so on.
How did we know to split on “Leave At” and then on “Stall” and “Accident” and not
“Weather”?
ID3 Heuristic
To determine the best attribute, we look at the ID3 heuristic. ID3 splits attributes based on
their entropy. Entropy is the measure of disinformation.
Entropy
Calculation of Entropy
𝑛
|𝑆𝑖 | |𝑆𝑖 |
𝐸(𝑆) = ∑ − ∗ log 2 ( )
|𝑆| |𝑆|
𝑖=1
𝑆 = set of examples
𝑆𝑖 = subset of S with value vi under the target attribute
n = size of the range of the target attribute
Entropy is minimized when all values of the target attribute are the same. If we know that
commute time will always be short, then entropy = 0. Entropy is maximized when there is
an equal chance of all values for the target attribute (i.e. the result is random). If commute
time = short in 3 instances, medium in 3 instances and long in 3 instances, entropy is
maximized.
Example
Suppose S has 25 examples, 15 positive and 10 negatives [15+, 10-]. Then the entropy of S
relative to this classification is:
2
|𝑆𝑖 | |𝑆𝑖 |
𝐸(𝑆) = ∑ − ∗ 𝑙𝑜𝑔2 ( )
|𝑆| |𝑆|
𝑖=1
ID3
ID3 splits on attributes with the lowest entropy. We calculate the entropy for all values of
an attribute as the weighted sum of subset entropies as follows:
𝑛 |𝑆𝑖 |
∑ ∗ 𝐸(𝑆𝑖 )
𝑖=1 |𝑆|
Where, n is the range of the attribute we are testing. We can also measure information gain
(which is inversely proportional to entropy) as a measure of expected reduction in entropy
for selecting particular attribute for split.
Information Gain
Information gain measures the expected reduction in entropy, or uncertainty.
|𝑆𝑣 |
𝐺𝑎𝑖𝑛(𝑆, 𝐴) = 𝐸(𝑠) − ∑ ∗ 𝐸(𝑆𝑣 )
|𝑆|
𝑣∈𝐴
Where, A(S) is the set of all possible values for attribute S, and
𝑆𝑣 = {𝑝 ∈ 𝑆 | 𝐴(𝑝) = 𝑣}
The first term in the equation for Gain is just the entropy of the original collection S and the
second term is the expected value of the entropy after S is partitioned using attribute A. Given
our commute time sample set, we can calculate the entropy of each attribute at the root node.
Examples
Before partitioning, the entropy is E(10/20, 10/20) = - 10/20 log2 (10/20) - 10/20 log2
(10/20) = 1
Expected entropy after partitioning 4/20 * E(1/4, 3/4) + 12/20 * E(9/12, 3/12) + 4/20 *
E(0/4, 4/4) = 0.65
Decision
Knowing the ``when’’ attribute values provides larger information gain than ``where’’.
Therefore the ``when’’ attribute should be chosen for testing prior to the ``where’’ attribute.
Similarly, we can compute the information gain for other attributes. At each node, choose the
attribute with the largest information gain.
Evaluation
Training accuracy
How many training instances can be correctly classify based on the available data? Is high
when the tree is deep/large, or when there is less confliction in the training instances.
However, higher training accuracy does not mean good generalization
Testing accuracy
Given a number of new instances, how many of them can we correctly classify? Cross
validation.
Continuous Attribute
Each non-leaf node is a test, its edge partitioning the attribute into subsets (easy for
discrete attribute). For continuous attribute:
Partition the continuous value of attribute A into a discrete set of intervals
Create a new boolean attribute 𝐴𝑐 , looking for a threshold c,
𝐴𝑐 = {𝑇𝑟𝑢𝑒 𝑖𝑓 𝐴 < 𝑐
𝐹𝑎𝑙𝑠𝑒
How to choose c?
Pruning Trees
There is another technique for reducing the number of attributes used in a tree – pruning.
Two types of pruning:
1. Pre-pruning (forward pruning)
2. Post-pruning (backward pruning)
Prepruning: In prepruning, we decide during the building process when to stop adding
attributes (possibly based on their information gain). However, this may be problematic –
Why? Sometimes attributes individually do not contribute much to a decision, but
combined, they may have a significant impact.
Postpruning: Postpruning waits until the full decision tree has built and then prunes the
attributes. Two techniques:
1. Subtree Replacement
2. Subtree Raising
Node 6 replaced the subtree. Generalizes tree a little more, but may increase accuracy.
Subtree Raising
Entire subtree is raised onto another node
Entire subtree is raised onto another node. This was not discussed in detail as it is not clear
whether this is really worthwhile (as it is very time consuming).
Hypothesis hH overfits training data if there is an alternative hypothesis h’H such that
Random Forests
An ensemble of decision trees
1. Split the learning data into number of samples use Bootstrap sampling and generate
large number of bootstrap samples.
2. Generate a decision tree for each bootstrap sample.
3. All trees vote to produce a final answer. The majority vote is considered.
Why do this?
It was found that optimal cut points can depend strongly on the training set used. [High
variance]. This led to the idea of using multiple trees to vote for a result. Averaging the
outputs of trees reduces overfitting to noise. Pruning is not needed. For the use of multiple
trees to be most effective the trees should be independent as possible. Splitting using a
random subset of features hopefully achieves this.
Typically 5 – 100 trees are used. Often only a few trees are needed. Results seem fairly
insensitive to the number of random attributes that are tested for each split. A common
default is to use the square oot of the number of attributes. Trees are fast to generate because
fewer attributes have to be tested for each split and no pruning is needed. Memory needed
to store the trees can be large.
Probabilistic Classification
Establishing a probabilistic model for classification.
Discriminative model
Generative classification with the MAP rule: Apply Bayesian rule to convert them into
posterior probabilities:
P( X x | C ci ) P(C ci )
P(C ci | X x)
P ( X x)
P( X x | C ci ) P(C ci )
for i 1,2, , L
Naïve Bayes
Bayes classification
Naïve Bayes classification assumption that all input features are conditionally independent!
P( X 1 , X 2 , , X n | C ) P( X 1 | X 2 , , X n , C ) P( X 2 , , X n | C )
P( X 1 | C ) P( X 2 , , X n | C )
P( X 1 | C ) P( X 2 | C ) P( X n | C )
[ P( x1 | c* ) P( xn | c* )]P(c* ) [ P( x1 | c) P( xn | c)]P(c), c c* , c c1 , , cL
For the day <sunny, cool, high, strong>, what’s the play prediction?
For four external factors, we calculate for each we calculate the conditional probability
table.
Learning Phase
Example
Test Phase
Given a new instance, predict its label x’= (Outlook=Sunny, Temperature=Cool,
Humidity=High, Wind=Strong). Look up tables achieved in the learning phrase
Learning Phase:
P(C ci ) i 1, , L
Relevant Issues
Violation of Independence Assumption. For many real world tasks,
P( X1 , , X n | C ) P( X1 | C ) P( X n | C )
Nevertheless, naïve Bayes works surprisingly well anyway! Zero conditional probability
Problem.
n mp
Pˆ ( X j a jk | C ci ) c
nm
nc : number of training examples for which X j a jk and C ci
n : number of training examples for which C ci
p : prior estimate (usually, p 1 / t for t possible values of X j )
m : weight to prior (number of " virtual" examples, m 1)
Example: Suppose a dataset with 1000 tuples, [income Low] = 0, [income Medium] = 990
and [income High] = 10
Naïve Bayes is often a good choice if you don’t have much training data!
Prehistory
W.S. McCulloch & W. Pitts (1943). “A logical calculus of the ideas immanent in nervous
activity”, Bulletin of Mathematical Biophysics, 5, 115-137. This seminal paper pointed out
that simple artificial “neurons” could be made to perform basic logical operations such as
AND, OR and NOT.
Could computers built from these simple units reproduce the computational power of
biological brains? Were biological neurons performing logical operations?
Linear Neurons
The neuron has a real-valued output which is a weighted sum of its inputs. The aim of
learning is to minimize the discrepancy between the desired output and the actual output:
How do we measure the discrepancies?
Do we update the weights after every training case?
Why don’t we solve it analytically?
A Motivation Example
Each day you get lunch at the cafeteria. Your diet consists of fish, chips, and a drink. You get
several portions of each. The cashier only tells you the total price of the meal. After several
days, you should be able to figure out the price of each portion.
Each meal price gives a linear constraint on the prices of the portions:
price x w x w x w
fish fish chips chips drink drink
The obvious approach is just to solve a set of simultaneous linear equations, one per meal.
But we want a method that could be implemented in a neural network. The prices of the
Cashier’s Brain
wi xi ( y yˆ )
E
E 1
1
E
2
2
n
n
1
((2yyn (yyˆˆ ny))n22 yˆ n ) 2
n
n
n
Now differentiate to get error derivatives for weights.
E yˆˆ n EEn
E 1 y nyˆ n En
1
22
E
wwii
n
1 n
wi
ˆ
y
wi n i n ynwi yˆ n
2w ˆn
xxi ,n (( yyn
yyˆˆ n ))
i,n ( yn yˆn )
i ,n nn
n
n x
n
The batch delta rule changes the weights in proportion to their error derivatives summed
over all training cases. E
E
wwii
w
wii E
wi
wi
The Error Surface
The error surface lies in a space with a horizontal axis for each weight and one vertical axis
for the error.
For a linear neuron, it is a quadratic bowl.
Vertical cross-sections are parabolas.
Horizontal cross-sections are ellipses.
w1
Constraint from
training case 2
w2
Adding Biases
A linear neuron is a more flexible model if we b w1 w2
include a bias. We can avoid having to figure
out a separate learning rule for the bias by
using a trick. A bias is exactly equivalent to a
weight on an extra input line that always has
an activity of 1.
yˆ b xi wi
i
1 x1 x2
Transfer Functions
Determines the output from a summation of the weighted inputs of a neuron. Maps any real
numbers into a domain normally bounded by 0 to 1 or -1 to 1, i.e. squashing functions. Most
common functions are sigmoid functions: O j f j wij xi
i
1
Logistic: f ( x)
1 ex
e x ex
Hyperbolic tangent: f ( x)
e x ex
Activation Functions
The activation function is generally non-linear. Linear functions are limited because the
output is simply proportional to the input.
Neuron Models
The choice of activation function determines the neuron model.
Examples:
a if v c
Step function: (v)
b if v c
a if v c
Ramp function: (v) b if v d
a ((v c)(b a) /( d c)) otherwise
1
Gaussian function: (v) z
1 exp( xv y)
1 1 v 2
(v ) exp
2 2
c d
Sigmoid Function
Gaussian Function
The Gaussian function is the probability function of the normal distribution. Sometimes
also called the frequency curve.
Network Architectures
Three different classes of network architectures:
Single-layer feed-forward
Multi-layer feed-forward
Recurrent
The architecture of a neural network is linked with the learning algorithm used to train
where A > 0, Vi = Wij * Yj , such that Wij is a weight of the link from node i to node j and
Yj is the output of node j.
Back propagation consists of the repeated application of the following two passes:
1. Forward pass: In this step, the network is activated on one example and the error
of (each neuron of) the output layer is computed.
2. Backward pass: In this step the network error is used for updating the weights. The
error is propagated backwards from the output layer through the network layer by
layer. This is done by recursively computing the local gradient of each neuron.
Back propagation adjusts the weights of the NN in order to minimize the network total
mean squared error.
Consider a network of three layers. Let us use i to represent nodes in input layer, j to
represent nodes in hidden layer and k represent nodes in output layer. wij refers to weight
of connection between a node in input layer and node in hidden layer. The following
equation is used to derive the output value Yj of node j.
Yj 1
1 eV j
The network error is the sum of the squared errors of the output neurons:
E(n) e 2k (n)
The total mean squared error is the average of the network errors of the training examples.
N
E (n)
1
EAV N
n 1
Iteration of the Backprop algorithm is usually terminated when the sum of squares of
errors of the output values for all training data in an epoch is less than some threshold such
as 0.01.
E
wij wij wij w ij -
w ij
Stopping Criterions
Total mean squared error change: Back-prop is considered to have converged
when the absolute rate of change in the average squared error per epoch is
sufficiently small (in the range [0.1, 0.01]).
Generalization based criterion: After each epoch, the NN is tested for
generalization. If the generalization performance is adequate then stop. If this
stopping criterion is used then the part of the training set used for testing the
network generalization will not use for updating the weights.
Applications
Healthcare Applications of ANNs
Predicting/confirming myocardial infarction, heart attack, from EKG output waves.
Physicians had a diagnostic sensitivity and specificity of 73.3% and 81.1% while
ANNs performed 96.0% and 96.0%
Identifying dementia from EEG patterns, performed better than both Z statistics and
discriminant analysis; better than LDA for (91.1% vs. 71.9%) in classifying with
Alzheimer disease.
Papnet: A Pap Smear screening system by Neuromedical Systems in used by US FDA
Predict mortality risk of preterm infants, screening tool in urology, etc.
Neural networks:
Neural networks learn from examples
No requirement of an explicit description of the problem.
No need for a programmer.
The neural computer adapts itself during a training period, based on examples of
similar problems even without a desired solution to each problem. After sufficient
training the neural computer is able to relate the problem data to the solutions,
inputs to outputs, and it is then able to offer a viable solution to a brand new
problem.
Able to generalize or to handle incomplete data.
NNs vs. Computers
Digital Computers Neural Networks
Deductive Reasoning. We apply known Inductive Reasoning. Given input and
rules to input data to produce output. output data (training examples), we
construct the rules.
Computation is centralized, synchronous, Computation is collective, asynchronous,
and serial. and parallel.
Memory is packetted, literally stored, and Memory is distributed, internalized, short
location addressable. term and content addressable.
Not fault tolerant. One transistor goes Fault tolerant, redundancy, and sharing of
and it no longer works. responsibilities.
Exact. Inexact.
Static connectivity. Dynamic connectivity.
Applicable if well-defined rules with Applicable if rules are unknown or
precise input data. complicated, or if data are noisy or partial.