0% found this document useful (0 votes)
21 views8 pages

EDA Reviewer

Uploaded by

Dianne Rose Nava
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views8 pages

EDA Reviewer

Uploaded by

Dianne Rose Nava
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

EDA

INTRODUCTION TO STATISTICS ❖ Ratio level of measurement


❖ Statistics – the science of conducting studies to collect, organize, summarize, analyze,and draw conclusions from ■ possesses all the characteristics of interval measurement, and there exists a true zero. In addition, true ratios
data. exist when the same variable is measured on two different members of the population.
■ Variable - a characteristic or attribute that can assume different values. The simple quantitative variables you usually think of are under ratio measurements: height, weight, age,
■ Data - the values (measurements or observations) that the variables can assume. volume and so on.

Examples: TYPES OF SAMPLING


Variable: Ages in a Certain Family
❖ Random sampling
Data: {17, 22, 26, 52, 53}
❖ Systematic sampling
❖ Stratified sampling
Variable: Biological Sex
❖ Cluster sampling
Data: {Male, Female}

As in mathematics, the list of data is called a data set.


And each element is called a data value.

TYPES OF STATISTICS
❖ Descriptive Statistics – consists of the collection, organization, summarization, and presentation of data.
■ uses the measures of central tendency (the Triple M a.k.a. mean, median, mode)

❖ Inferential Statistics – consists of generalizing from samples to populations, performing estimations and hypothesis
tests, determining relationships among variables, and making predictions.

Example:
TYPES OF STATISTICAL STUDIES
Graduating Age in my 4th Year Section:{16, 16, 16, 15, 16, 15, 16, 15, 16, 15, 15, 16, 17, 16, 15, 16, 18, 16, 15,
❖ Observational studies the researcher merely observes what is happening or what has happened in the past and
16,16, 17, 16, 16, 15, 16, 16, 16, 15, 15, 16, 16, 16, 16, 15, 16, 16, 15, 16, 16, 16}
tries to draw conclusions based on these observations.
Mean: 15.81 ~ 16, Median: 16, Mode: 16
❖ Experimental studies the researcher manipulates one of the variables and tries to determine how the manipulation
influences other variables.
Saying that the average graduating age of the section is 16 years old falls under descriptive statistics.
■ Explanatory variable
But saying that the average graduating age of the succeeding batches is 16 years oldfalls under inferential statistics.
● the variable that is manipulated
● also called independent variable
TYPES OF VARIABLES
❖ Qualitative variables – variables that can be placed into distinct categories, according to some characteristic or
■ Outcome variable
attribute.
● the variable affected by the manipulated variable
● also called dependent variable
Examples: are gender, sex, school graduated, blood type, movies watched, series finished, etc.

PROBABILITY & COUNTING RULES


❖ Quantitative variables – numerical and can be ordered or ranked.
❖ Probability – quantifies the chances of an event or outcome of occurring. It says how likely (or unlikely) something
can happen for a given trial.
Examples: are height, weight, age, BMI, grade equivalents, population, number of dogs,number of movies watched,
■ A trial is a chance process that leads to well-defined results called outcomes.
etc.
■ Discrete variables – can have data values that can be counted.
❖ Sample Space – the set of all possible outcomes in a trial.
■ Continuous variables – can have an infinite number of data values between any two specific values. They
are obtained by measuring. They often include fractions and decimals
Examples:
Examples: ■ Flipping a Coin S = { H, T }
■ Rolling a Die S = { 1, 2, 3, 4, 5, 6 }
Grade Equivalents D
■ Flipping two coins in a row S = { HH, TT, HT, TH }
Population D
Height C
BMI C ■ A tree diagram may be used determining the sample space for sequences of events.
Example: Biological Sex for two Children in a Family
This type of classification uses measurement scales, and the four common levels of measurement are used:
1 2
M
❖ Nominal level of measurement
■ classifies data into mutually exclusive (non overlapping), exhausting categories in which no order or ranking F S = { FM, FF, MM, MF }
can be imposed on the data. F
Some examples include eye color, religion, degree program, and nationality. Cardinality:
M n(S)=4
M
❖ Ordinal level of measurement
■ classifies data into categories that can be ranked; however, precise differences between the ranks do not exist. F
Some examples include letter grades, clothes sizes, and level of educational attainment. There are two common ways for which probabilities are determined:
❖ Classical Probability – uses sample spaces to determine the numerical probability that an event will happen. It
❖ Interval level of measurement assumes that all outcomes in the sample space are equally likely to occur.
■ ranks data, and precise differences between units of measure do exist; however, there is no meaningful zero.
Some examples include temperature, dates in a certain month, and calendar years. ❖ Empirical Probability – uses actual observations to determine the frequency of an event occurring.
Law of Large Numbers – the empirical probability should approach the classical probability, if available, assuming the the sum of several values of P(X) isobtained. An alternative distribution, called the cumulative probability distribution.
trials are fair, as the number of trials increaseslargely. ❖ The cumulative probability distribution is obtained by adding probabilities, and each sum is plotted versus the
random variable values.
❖ Event – consists of a desired set of outcomes of a trial.
Examples:
■ The set of event outcomes are subsets of the sample space of the trial. a) What is its probability distribution?
Consider the trial of rolling a die:
■ An event is a simple event if it only has one outcome. X 1 2 3 4 5 6
Let X = the number that appears
■ Otherwise, it will now be a compound event, which consists of two or more outcomes.

S = { 1, 2, 3, 4, 5, 6 }
P(X) 1 1 1 1 1 1
Example:
n( S) = 6 6 6 6 6 6 6
Rolling a Die
X = { 1, 2, 3, 4, 5, 6 }
E1: Rolling a 2 E1 = {2} n( E 1 ) =1 simple event b) What is its cumulative probability distribution?
E2: Rolling a number > 5 E2 = {6} n( E 2 ) =1 simple event c) What is the probability of rolling at most a 4?
X 1 2 3 4 5 6
E3: Rolling an even number E3 = { 2, 4, 6 } n( E 3 ) =3 compound event 2
E4: Rolling a prime number compound event CP(X = 4) =
E4 = { 2, 3, 5 } n( E 4 ) =3 3 CP(X) 1 1 1 2 5 1
6 3 2 3 6
■ Mutually exclusive events – events whose set of outcomes do not have anything in common.

For classical probability, each outcome is equally likely to occur. As a consequence of being a discrete probability distribution, there are two requirements:
❖ The probabilities for each value of the random variable in the sample space must add up to 1.
n( E ) number of outcomes E
P (E ) = =
n( S) total number of outcomes in the sample space ■ ∑ P (X ) = 1
Example: ❖ The probability for each value of the random variable in the sample space must be between 0 and 1.
Probability of getting a tails on a coin flip n( E )
1 ■ 0 ≤ P (X ) ≤ 1
P (E ) = = = 0.5 = 50%
n( S) 2
MEAN, VARIANCE & STANDARD DEVIATION
❖ Mean
Are complementary events always mutually exclusive? Yes ■ It describes the “center” of allpossible values of a random variable.
Are mutually exclusive events always complementary? No ■ can sometimes be interpreted as the value thatthe variable assumes on average.

COUNTING RULES ■ μ ≡ mean = ∑ X × P(X)]


❖ Permutation used for counting arrangements where order matters.
❖ Combination used for countng arrangements where order does not matter ❖ Variance
■ can sometimes be interpreted as the value thatthe variable assumes on average.
■ Fundamental Counting Rule – if the events are independent, then thetotal number of outcomes is simply the
product of the number of outcomes of each successive event. ■ 𝜎 2 ≡ variance = ∑ [ X 2 × P(X) - μ 2 = ∑ (X - μ) 2 × P(X)

DISCRETE PROBABILITY DISTRIBUTIONS ❖ Standard Deviation


❖ A random variable is a variable whose values are determined by chance. ■ 𝜎
■ For every possible value of the random variable, there is a corresponding probability.
■ If the probabilities for each value of the random variable are plotted, then what we get is a probability Rounding Rule: When reporting the mean, variance and standard deviation, the rule is that it should be reported to one
distribution. decimal place more than the values of X.
■ If the variable can assumed discrete data values, then what we get is a discrete probability distribution.
EXPECTED VALUE
Examples: ❖ The expected value is the single value that we are expecting that the mean will approachas we obtain more
samples.
1. Consider the trial of flipping a coin:
n( X ) X 0 1 ■ if the mean is based on an ideal probability distribution (e.g. for coin flips, dicerolls), this mean is exactly based
Let X = the number of heads that appear P (X ) = on the expected value: after a lot of rolls, we expect toget the mean value on average.
a) What are the possible values of X? n( S) = 2
b) What is the probability P(X) for each X? P(X) 1 1
S = { H, T } 2 2
CONTINUOUS PROBABILITY DISTRIBUTIONS AND THE NORMAL DISTRIBUTIONS
X = { 0, 1 }

2. Now, consider the trial of flipping two


S = { HH, TT, HT, TH } X 0 1 2
coins:
Let X = the number of heads that appear
X = { 0, 1, 2 } P(X) 1 1 1
a) What are the possible values of X?
b) What is the probability P(X) for each X? 4 2 4
c) What is the probability of getting at most
one head, i.e. P(X ≤ 1)?

Figure 1: normal distribution


Properties
❖ Bell-shaped NORMAL DISTRIBUTION - EMIPIRICAL RULE
■ bell curve
■ normal curve Example:
■ gaussian distribution (after Carl Friedrich Gauss) The heights in a school's population were determined to have a mean of 66 inches, with a standard deviation of 3
inches. Assume that height is a normally distributed variable.
❖ Mean=Median=Mode In a batch of 1000 students, estimate:
■ Median=value of x which divides the area under the curve into two equal parts
■ Mode=value of x that correspond to the 'highest point'
■ Mean=calculated in a similar way as discrete variables
a. how many students are P(63 < x < 69) = 0.68
● ∑ X × P(X)] between 63 and 69 inches tall 1000 × 0.68 = 680 students
b. how many students are P(60 < x < 72) = 0.95
between 60 and 72 inches tall 1000 × 0.95 = 950 students
c. how many students are between P(66 < x < 72) = 0.475
66 and 72 inches tall 0.95
= 0.475
2
1000 × 0.475 = 475 students
d. how many students are shorter P(X < 69) = 0.84
❖ Unimodal than 69 inches. 0.68
0.5 + = 0.84
2
1000 × 0.84 = 840 students
e. how many students are taller P(X > 69) = 0.16
than 69 inches 0.68
0.5 - = 0.16
2
1000 × 0.16 = 160 students

❖ Symmetric about the mean f. how many students are between P(57 < x < 69) = 0.8385
57 and 69 inches tall 0.997 0.68
+ = 83.85
2 2
1000 × 0.8358 = 839 students
g. how many students are shorter ?
than 62 inches

INTRODUCING: z-values
z ≡ number of standard deviations from the mean
❖ X-𝜇
❖ Continuous z=
■ never touches the x-axis 𝜎

STANDARD NORMAL DISTRIBUTION


❖ Total Area = 1
❖ is a normal distribution with μ = 0 and 𝜎 = 1
❖ Empirical Rule (68, 95, 99.7 rule)
CENTRAL LIMIT THEOREM
■ one standard deviation above and below the mean = 68% of total area
■ two standard deviation above and below the mean = 95% of total area
■ three standard deviation above and below the mean = 99.7% of total area ⏨)
❖ Sample Mean ( x
■ denoted by , is obtained not from the entire population, but only from a sample.
■ the best estimate that we can get as near to the population mean as possible.
∑x
⏨=
■ x where : sample size = n
n

Example:
x = { 1, 2, 3, 4, 5 }
n=5
∑x 1 + 2 + 3 + 4 + 5
x⏨= = =3
n 5

❖ Sample Standard Deviation (s)

∑ (x - x⏨) 2
■s =
n-1
Example: The constant a will now be referred to as z a/2 . This is to signify that it is a z-value obtainedfrom the standard normal
x = { 1, 2, 3, 4, 5 } distribution.
n=5
x⏨= 3 Notice also that the confidence interval is two-sided – the population mean is bounded on both sides, i.e. it is of the form
L<μ<U
x - x⏨= { 1 - 3, 2 - 3, 3 - 3, 4 - 3, 5 - 3 }
x - x⏨= { -2, -1, 0, 1, 2 } Lastly, this formulation of the confidence interval assumes that the population standard deviation σ is already known.
(x - x⏨) 2 = (-2) 2 , (-1) 2 , (1) 2 , (2) 2

∑ (x - x⏨) 2 = 10
10
s= = 1.6
5-1

Central Limit Theorem states that the sample mean will be normally distributed, this leads to some useful applications.

❖ Considering all possible samples of the same sample size n, the mean of the sample means is equal to the
population mean.
■ μ x⏨ = μ

❖ Considering all possible samples of the same sample size n, the standard deviation of the sample means is
equal to the population standard deviation divided by the square root of the sample size n:
𝜎 HYPOTHESIS TESTING
■ 𝜎 x⏨ = Three Methods
n ❖ the traditional method
❖ p-value method
❖ The sample mean will exhibit a normal distribution if either X is normally distributed, or n ≥ 30 ❖ the confidence interval method

Example: Statistical Hypothesis - is a conjecture about population parameter.


Consider the problem from earlier on kids’ TV habits. Determine the mean and standard deviation of the sample mean if
❖ Null Hypothesis (H 0 ) - there is no difference between two parameters
it is known that = 25.000, and = 3.000. n=10
μ x⏨ = μ = 25.000 ❖ Alternative Hypothesis (H 1 ) - existence of difference between two parameter

𝜎 3.000
𝜎 x⏨ = = = 0.949 Two - tailed test Right - tailed Left - tailed
n 10 H0 : μ = k H0 : μ = k H0 : μ = k
H1 : μ ≠ k H1 : μ > k H1 : μ < k
CONFIDENCE INTERVAL
❖ Recall the empirical rule for normal distributions (i.e. 68-95-99.7 rule).
STATISTICAL INFERENCE
P(-1 < z < 1) ≈ 0.68
❖ Statistical test - uses the data obtained from a sample to make decision
P(-2 < z < 2) ≈ 0.95
P(-3 < z < 3) ≈ 0.997 ❖ Test Value - numerical value obtained from a statistical test

❖ Type 1 error - occurs if you reject the null hypothesis when it is true
P(z > a) = Area NORM. S. INV(1 - Area)
P(z < a) = Area NORM. S. INV(Area) ❖ Type 2 error - occurs if you do not reject the null hypothesis when it is false
P(-a < z < a) = Area Area
NORM. S. INV 0.5 - Testing the Difference Between Two Means
2 ❖ z test
Formulas: ❖ t test
■ t test for independent samples with equal variances
𝜎 𝜎 ■ t test for independent samples with unequal variances
⏨- a
❖x < μ < x⏨+ a ■ t test for dependent samples
n n
❖ Testing the Difference Between Two Proportions
𝜎 𝜎 ■ z test
⏨- za/2
❖x < μ < x⏨+ za/2
n n
❖ Testing the Difference Between Two Variances
■ F test
𝜎
❖ where z a/2 is called the maximum error of the estimate (margin of error)
n
• •
• •


μ
X 2 3 5 6 9 X 1 2 5 6 8
𝟏 𝟏 𝟏 𝟏 𝟏 𝟏 𝟏 𝟏
P(X) ? P(X) ?
𝟕 𝟒 𝟏𝟒 𝟐 𝟏𝟒 𝟐 𝟐𝟖 𝟕

∑ 𝑷(𝑿) σ 𝑷ሺ𝑿ሻ

𝑪𝑷(𝑿) 𝑪𝑷(𝑿) 𝑪𝑷ሺ𝑿ሻ 𝑪𝑷ሺ𝑿ሻ


𝑷(𝑿 ≥ 𝟑) 𝑷ሺ𝑿 ≥ 𝟐ሻ
𝑷(𝑿 ≤ 𝟓) 𝑷ሺ𝑿 ≤ 𝟓ሻ
𝑷(𝑿 𝐢𝐬 𝐨𝐝𝐝) 𝑷ሺ𝑿 𝐢𝐬 𝐨𝐝𝐝ሻ

X 3.0 4.0 7.0 7.1 8.2 8.3 X 2.5 6.6 6.9 8.3 8.7 9.2
𝟏 𝟏 𝟏 𝟏 𝟏 𝟏 𝟏 𝟏 𝟐 𝟑 𝟏 𝟐
P(X) P(X)
𝟏𝟓 𝟐 𝟏𝟐 𝟐𝟎 𝟏𝟎 𝟓 𝟏𝟎 𝟔 𝟓 𝟐𝟎 𝟐𝟎 𝟏𝟓

μ μ
σ² σ²
σ σ

𝒑 𝒑
𝒒 𝒒
X 2 3 7 9 10
𝟏 𝟏 𝟏 𝟏
P(X) ?
𝟐 𝟐𝟖 𝟏𝟒 𝟒

∑ 𝑷(𝑿)

𝑪𝑷(𝑿) 𝑪𝑷(𝑿)
𝑷(𝑿 ≥ 𝟑)
𝑷(𝑿 ≤ 𝟕)
𝑷(𝑿 𝐢𝐬 𝐨𝐝𝐝)

X 1.8 2.1 3.0 3.5 7.0 8.7


𝟕 𝟏 𝟏 𝟓 𝟏 𝟏
P(X)
𝟑𝟎 𝟐𝟎 𝟏𝟓 𝟏𝟐 𝟑𝟎 𝟓

μ
σ²
σ

𝒑
𝒒

You might also like