Data Science 01 - Basics
Data Science 01 - Basics
to
Statistical Analysis
In a s A . Yassin ee, P h D
§ Axioms of Probability:
1. P(S)=1 ;
2. 0<P(E)<1 ;
3. given events E1 and E2,
Basic Probability:
Definitions
§ These axioms imply:
◦ P(E ') = 1− P(E)
◦ P(Φ) = 0
◦ P(E1∪ E2) = P(E1) + P(E2) − P(E1∩ E2)
§ Random Variable (RV): a function or rule that assigns a
numerical value to each outcome in the sample space of a
random experiment.
Statistical Data
Analysis
1. BASIC PROBABILITY
2. DATA COLLECTION AND DESCRIPTION
3. DATA PRESENTATION
Data Collection and Description:
Types of statistical studies
§ Designed Studies
§ Collect the observations of the resulting system output data.
§ Example: study effectiveness of a new drug.
§ Understand the population characteristics
§ Control (placebo)
§ Retrospective Study
§ Would be either all or a sample of the historical process data.
§ Example: Analyze #website access during past month; $ exchange rate.
§ You cannot study conditions that did not occur during data sampling interval
(e.g. #access during Ramadan!)
§ Observational Study
§ Observe a (manufacturing) process or population; monitoring of social behavior
§ Usually conducted for short time period.
§ Can include some sophisticated measurements that are not usually measured
Data Collection and Description:
Basic concepts
§ Population vs. Sample
§ Population and Sample characteristics
Sample
Statistics:
x, s, ..
Population
Parameters:
µ, s, ..
Data Description
Population Mean and Variance
§ Population Mean:
µ = E(x) = ∑ x P(x )i i
i=1:N
§ Population Variance
σ 2 = E((x − µ )2 ) = ∑ i
(x − µ ) 2
P(xi )
i=1:N
E((x − µ )3 ) n
SK =
σ 3
≈ ∑
(n −1)(n − 2) i=1:n
(x i − x )3
/ s 3
§ Kurtosis
E((x − µ )4 ) n(n +1)
KRT =
σ 4
≈ ∑
(n −1)(n − 2)(n − 3) i=1:n
(x i − x ) 4
/ s 4
Data Description:
Higher order statistics
Source: http://openi.nlm.nih.gov/
Data Description:
Higher order statistics
Example:
Probability functions that have the same VAR (=1) but different KURT.
D: Laplace dist., eKr=3 S: hyperbolic secant dist., eKr=1.2
N: normal dist., eKr=0 C: raised cosine dist., eKr=−0.59.
W: Wigner semicircle dist., eKr=−1 U: uniform dist., eKr=−1.2.
0.25 1.00
0.90
0.20 0.80
0.70
Probability
Probability
0.15 0.60
0.50
0.10 0.40
0.30
0.05 0.20
0.10
0.00 0.00
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
Value of X Value of X
Hint: For the variance, you can use the following formula,
Assignment 1:
DS1=EuStockMarkets$DAX;
DS2=EuStockMarkets$SMI
1. Given DS1, estimate the mean and variance of the absolute change in
the daily stock price (use for-loops)
2. Repeat for DS2
3. Estimate (in an efficient way) the overall mean (as given in the lecture)
and variance of the combined dataset DS1 U DS2
4. Calculate the Skewness and Kurtosis of DS1
5. Plot Histogram and Box Plot of the combined dataset
Statistical Analysis
More on Data Description
1. MARGINAL, JOINT, AND CONDITIONAL
PROBABILITY
2. EXPECTED VALUES
3. LINEAR TRANSFORMATION OF RV
Data Description:
Marginal, Joint, and Conditional Prob.
§ Marginal Prob.: P(X=x)
§ Joint Prob.: P(X=x , Y=y) ; P(Y=y , X=x)
Data Description:
Marginal, Joint, and Conditional Prob.
§ Partitions and total probability theory:
§ P(A)= Σi P(A, Bi)
§ P(X=x)= Σi P(X=x, Y=yi)
§ Conditional:
§ P(X=x | Y=y) = P(X=x , Y=y) / P(Y=y)
§ Or, P(X=x , Y=y) = P(X=x | Y=y) . P(Y=y)
§ Notice (total prob.): P(X=x)= Σi P(X=x | Y=yi) P(Y=yi)
§ Independent Variables:
§ P(X=x , Y=y) = P(X=x) . P(Y=y)
§ P(X=xI Y=y) = P(X=x) . P(Y=y) / P(Y=y) = P(X=x)
Data Description:
Marginal, Joint, and Conditional Prob.
§ Example1:
P(B1)=0.7 P(B2)=0.3
B1: 20 Red + 20 Yellow
B2: 10 Red + 20 Yellow
If a Yellow ball is picked, what is the
prob. that it was from Box 2. Box1 Box2
§ Example2:
Suppose a drug test is 99% sensitive (true positive) and 99% specific
(true negative). Suppose that 0.5% of people are users of drugs. If a
randomly selected individual tests was positive, what is the probability
he or she is a really a drug user?
Data Description:
Marginal, Joint, and Conditional Prob.
§ Useful Definitions/Terminology:
§ What is E( a statistic)?
1 1 2
E(x ) = E( ∑ xi )
n i=1:n
2
E(S ) = E( ∑
n −1 i=1:n
(xi − x ) )
Data Description
Expected Value
§ The ‘Expected value’ can be thought of as an operator that
estimates the mean value of a quantity given all its possible
values.
§ The E(.) operator is linear
§ Example: E(5X −Y + 3) = 5E(X )− E(Y ) + 3
§ Two random variables are said to be uncorrelated, if,
E(X.Y ) = E(X ).E(Y )
§ Any two independent RV are also uncorrelated.