0% found this document useful (0 votes)
45 views52 pages

Data Science 01 - Basics

The document provides an introduction to statistical analysis. It discusses that statistical analysis deals with collecting, presenting, describing, and analyzing data. The objectives of statistical analysis include describing and understanding phenomena as well as predicting outcomes. Some key concepts covered include statistics, probability, data collection and description methods like studies and basic concepts, describing data through measures like the mean, variance, skewness and kurtosis, probability functions, and methods of data presentation including stem-and-leaf diagrams, histograms, box plots, and time series plots.

Uploaded by

lomeroaia
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
45 views52 pages

Data Science 01 - Basics

The document provides an introduction to statistical analysis. It discusses that statistical analysis deals with collecting, presenting, describing, and analyzing data. The objectives of statistical analysis include describing and understanding phenomena as well as predicting outcomes. Some key concepts covered include statistics, probability, data collection and description methods like studies and basic concepts, describing data through measures like the mean, variance, skewness and kurtosis, probability functions, and methods of data presentation including stem-and-leaf diagrams, histograms, box plots, and time series plots.

Uploaded by

lomeroaia
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 52

Introduction

to
Statistical Analysis
In a s A . Yassin ee, P h D

R ecom m ended R eference:


“A p p lie d S ta tistics a n d P ro b a b ility fo r E ngineers, ”
B y D o u g l a s C M o n t g o m e r y ; a n d G e o rg e C R u n g e r ;
P u b l i s h e r : Wi l e y.
Statistical Analysis
§ Deals with Data
§ Collection,
§ Presentation,
§ Description, and
§ Analysis
§ The objectives include:
§ describe/understand a phenomenon,
§ predict outcomes,
Statistical Analysis
§ A statistic: is a calculated numerical value that characterizes some
aspect of a sample set of data.
§ Why statistics?. ... Randomness
§ Applications:
§ Analyzing traffic patterns
§ Predicting stock market changes
§ Weather Forecasts/Analysis
§ Monitoring effectiveness of design, medication, solution, …
§ Assessment of the quality of a product, service, person, …
§ :
Overview…
§ In statistics, we deal with randomness…
§ Mean/Average is not sufficient!
§ Variance.
§ Other useful descriptors: Skewness and Kurtosis
§ More descriptors: Marginal, Joint, and conditional
Probabilities.
Overview…
Given a website (www.NU.edu), lets ask some direct questions:
• What is the #visitors to the website?
• What is the max #visitors? Min #visitors?
Overview…
Now, let’s ask more advanced questions:
• In July, we started a marketing campaign, was it effective?
• What is the expected number of visitors next month?
• Compared to www.AUC.edu, which site has more visitors?
• Which day of the week has the largest number of visitors?
Statistical Data
Analysis
1. BASIC PROBABILITY
2. DATA COLLECTION AND DESCRIPTION
3. DATA PRESENTATION
Statistical Data
Analysis
1. BASIC PROBABILITY
2. DATA COLLECTION AND DESCRIPTION
3. DATA PRESENTATION
Basic Probability:
Definitions
§ Random Experiment (e.g. tossing a coin or a die)
§ Sample Space: all possible outcomes of the experiment.
§ Event is a subset of the sample space (e.g. odd outcomes)
§ Venn Diagrams, Union, Intersection.

§ Axioms of Probability:
1. P(S)=1 ;
2. 0<P(E)<1 ;
3. given events E1 and E2,
Basic Probability:
Definitions
§ These axioms imply:
◦ P(E ') = 1− P(E)
◦ P(Φ) = 0
◦ P(E1∪ E2) = P(E1) + P(E2) − P(E1∩ E2)
§ Random Variable (RV): a function or rule that assigns a
numerical value to each outcome in the sample space of a
random experiment.
Statistical Data
Analysis
1. BASIC PROBABILITY
2. DATA COLLECTION AND DESCRIPTION
3. DATA PRESENTATION
Data Collection and Description:
Types of statistical studies
§ Designed Studies
§ Collect the observations of the resulting system output data.
§ Example: study effectiveness of a new drug.
§ Understand the population characteristics
§ Control (placebo)
§ Retrospective Study
§ Would be either all or a sample of the historical process data.
§ Example: Analyze #website access during past month; $ exchange rate.
§ You cannot study conditions that did not occur during data sampling interval
(e.g. #access during Ramadan!)
§ Observational Study
§ Observe a (manufacturing) process or population; monitoring of social behavior
§ Usually conducted for short time period.
§ Can include some sophisticated measurements that are not usually measured
Data Collection and Description:
Basic concepts
§ Population vs. Sample
§ Population and Sample characteristics

Sample
Statistics:
x, s, ..

Population
Parameters:
µ, s, ..
Data Description
Population Mean and Variance
§ Population Mean:

µ = E(x) = ∑ x P(x )i i
i=1:N

§ Population Variance

σ 2 = E((x − µ )2 ) = ∑ i
(x − µ ) 2
P(xi )
i=1:N

Where N is the total number of instances in the population.


Data Description:
Sample Mean and Variance
§ Given a sample of size n, the population mean can be estimated
by the sample average*:
1
x = ∑ xi
n i=1:n
§ The population variance can be also estimated by,
1
2
s = ∑
n −1 i=1:n
(x i − x ) 2

§ The values depend on the selection of the sample. i.e., get a


different sample, the estimated value will change.
§ What are the “mean” and “variance” of these estimators?
*Is there an efficient way for estimating the mean of a stream of data samples?
Data Description:
Higher order statistics
§ Why mean and variance are very common (Normal)?
§ Higher order statistics?
§ Skewness

E((x − µ )3 ) n
SK =
σ 3
≈ ∑
(n −1)(n − 2) i=1:n
(x i − x )3
/ s 3

§ Kurtosis
E((x − µ )4 ) n(n +1)
KRT =
σ 4
≈ ∑
(n −1)(n − 2)(n − 3) i=1:n
(x i − x ) 4
/ s 4
Data Description:
Higher order statistics

Source: http://openi.nlm.nih.gov/
Data Description:
Higher order statistics
Example:
Probability functions that have the same VAR (=1) but different KURT.
D: Laplace dist., eKr=3 S: hyperbolic secant dist., eKr=1.2
N: normal dist., eKr=0 C: raised cosine dist., eKr=−0.59.
W: Wigner semicircle dist., eKr=−1 U: uniform dist., eKr=−1.2.

*eKR: excess Kr (above that of Gaussian = 3) Source: http://wikipedia.org/


Data Description:
Probability Functions
§ It provides complete description of the randomness
of the sample/population.
§ Probability Function:
§ PDF/PMF: Non-negative values; Sum to 1.
§ CDF: integration/summation of PDF/PMF; monotonic; from
0 to 1.
Data Description:
Probability Functions

0.25 1.00
0.90
0.20 0.80
0.70
Probability

Probability
0.15 0.60
0.50
0.10 0.40
0.30
0.05 0.20
0.10
0.00 0.00
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
Value of X Value of X

Illustrative PDF Cumulative CDF


(Probability Density Function) (Cumulative Density Function)
Data Description:
Probability Functions
§ Example 1: Normal Distribution

§ Example 2: Uniform Distribution


Data Description:
Probability Functions

§ Example 3: Poisson Distribution (model of arrival)


Data Description:
Probability Functions

§ Example 4: Exponential Distribution


Statistical Data
Analysis
1. BASIC PRO BAB ILITY
2. DATA CO L L EC TI O N AND DESC RI PTI O N
3. DATA PRESENTATI O N
Data Presentation
1. STEM- AND- LEAF DIAGRAMS
2. HISTO GRAMS
3. B OX PLOTS
4. TI ME SERI ES PLOTS
5. MU LTIVARIATE DATA
Data Presentation:
Stem and Leaf Diagram
Data Presentation:
Histogram Plot
§ Use the horizontal axis to represent the measurement scale for the data.
§ Use The Vertical scale to represent the counts, or frequencies.
Data Presentation:
Box Plot
§ Describes several features of a data set, such as center, spread, departure
from symmetry, and identification of odd observations.
§ Odd observations are called “outliers.”
§ The box encloses the interquartile range (IQR) with left at the first quartile,
q1, and the right at the third quartile, q3.
§ A line, or whisker, extends from each end of the box.
§ The lower whisker extends to smallest data point within 1.5 interquartile
ranges from first quartile.
§ The upper whisker extends to largest data point within 1.5 interquartile
ranges from third quartile.
Data Presentation:
Box Plot
Example
§ Data samples: 88, 99, 95, 89, 63, 99, 100, 89, 98, 100
§ Sorted samples: 63, 88, 89, 89, 95, 98, 99, 99, 100, 100
§ A lower quartile of Q1 = 89 (below which, the lowest 25% of the samples exist)
§ An upper quartile of Q3 = 99 (above which, the biggest 25% of the samples exist)
§ Hence the box extends from 89 to 99 and the interquartile range IQR is 99 - 89 =
10.
§ An outlier is any data point that is more than 1.5 times the IQR from either end of
the box.
§ 1.5 times the IQR is 1.5*10 = 15 so, at the upper end an outlier is any data point
more than 99+15=114.
§ There are no data points larger than 114, so there are no outliers at the upper end.
§ At the lower end an outlier is any data point less than 89 - 15 = 74. There is one
data point, 63, which is less than 74 so 63 is an outlier.
Stem-And-Leaf Diagrams
§ Stem-And-Leaf Diagrams is a good way to obtain an
informative visual display of a data.
§ Each number consists of at least two digits.
§ Steps for constructing
1. Divide each number into two parts: a stem, and a leaf.
2. List the stem value in a vertical column.
3. Record the leaf for each observation
4. Write the units for stems and leaves on the display
Example
Example
Time Series Plots
Multivariate Data
The corrected sum of cross-products
Scatter Diagrams
§ Diagram is a simple descriptive tool for multivariate data.
§ The diagram is useful for examining the pairwise (or two variables at a
time) relationships between the variables.
Scatter diagrams
LAB 1:
LAB 1: (DS1=beaver1; DS2=beaver2)
1. Given large data sample, DS1 containing N1 data points,
estimate the mean and variance (use simple for-loop over
the samples).
2. Given another data sample, DS2 containing N2 data points,
estimate the mean and variance (use simple for-loop over
the samples).
3. Describe a method to efficiently estimate the mean of the
combined dataset DS1+DS2.

Hint: For the variance, you can use the following formula,
Assignment 1:
DS1=EuStockMarkets$DAX;
DS2=EuStockMarkets$SMI
1. Given DS1, estimate the mean and variance of the absolute change in
the daily stock price (use for-loops)
2. Repeat for DS2
3. Estimate (in an efficient way) the overall mean (as given in the lecture)
and variance of the combined dataset DS1 U DS2
4. Calculate the Skewness and Kurtosis of DS1
5. Plot Histogram and Box Plot of the combined dataset
Statistical Analysis
More on Data Description
1. MARGINAL, JOINT, AND CONDITIONAL
PROBABILITY
2. EXPECTED VALUES
3. LINEAR TRANSFORMATION OF RV
Data Description:
Marginal, Joint, and Conditional Prob.
§ Marginal Prob.: P(X=x)
§ Joint Prob.: P(X=x , Y=y) ; P(Y=y , X=x)
Data Description:
Marginal, Joint, and Conditional Prob.
§ Partitions and total probability theory:
§ P(A)= Σi P(A, Bi)
§ P(X=x)= Σi P(X=x, Y=yi)

§ Conditional:
§ P(X=x | Y=y) = P(X=x , Y=y) / P(Y=y)
§ Or, P(X=x , Y=y) = P(X=x | Y=y) . P(Y=y)
§ Notice (total prob.): P(X=x)= Σi P(X=x | Y=yi) P(Y=yi)

§ Bayes’ Theory: P(X=x| Y=y) = P(Y=y|X=x) P(X=x) / P(Y=y)

§ Independent Variables:
§ P(X=x , Y=y) = P(X=x) . P(Y=y)
§ P(X=xI Y=y) = P(X=x) . P(Y=y) / P(Y=y) = P(X=x)
Data Description:
Marginal, Joint, and Conditional Prob.
§ Example1:
P(B1)=0.7 P(B2)=0.3
B1: 20 Red + 20 Yellow
B2: 10 Red + 20 Yellow
If a Yellow ball is picked, what is the
prob. that it was from Box 2. Box1 Box2

§ Example2:
Suppose a drug test is 99% sensitive (true positive) and 99% specific
(true negative). Suppose that 0.5% of people are users of drugs. If a
randomly selected individual tests was positive, what is the probability
he or she is a really a drug user?
Data Description:
Marginal, Joint, and Conditional Prob.
§ Useful Definitions/Terminology:

In Bayes Theorem: P( A | B) = P(B | A) . P(A) / P(B)


P( A ) the prior, is the initial degree of belief in A.
P( A | B) the posterior, is the degree of belief having accounted for B.
P( B | A) /P(B) represents the support B provides for A.
Data Description
Expected Value
§ The expected value E(X) of a discrete random variable is the
sum of all X-values (can be inf) weighted by their respective
probabilities.

§ Similarly, the variance is given by E((X-μ)2), that is,

§ What is E( a statistic)?
1 1 2
E(x ) = E( ∑ xi )
n i=1:n
2
E(S ) = E( ∑
n −1 i=1:n
(xi − x ) )
Data Description
Expected Value
§ The ‘Expected value’ can be thought of as an operator that
estimates the mean value of a quantity given all its possible
values.
§ The E(.) operator is linear
§ Example: E(5X −Y + 3) = 5E(X )− E(Y ) + 3
§ Two random variables are said to be uncorrelated, if,
E(X.Y ) = E(X ).E(Y )
§ Any two independent RV are also uncorrelated.

§ In general, uncorrelated RVs are not independent.


Data Description:
Linear transformation
§ A linear transformation of a random variable X is performed
by multiplying by a constant; or adding a constant or a RV.
§ Examples:
§ Y = a X + b
§ Z = X + Y
Data Description:
Linear transformation
§ In general,
Data Description:
Linear transformation
§ For RV with normal distributions,
LAB 2:
LAB 2:
1. Generate array of 1000 zero mean Gaussian random
samples
2. Calculate Mean, Variance, Skewness, Kurtosis
3. Plot: Histogram (using different #slots), and Box Plot

1. Generate an array of 1000 Poisson random samples


2. Calculate Mean, Variance, Skewness, Kurtosis
3. Plot Histogram (using different #slots), Box Plot
LAB #3
1. Given the attached data table (first 4 columns of “iris”
dataset),
1. Calculate Mean, Variance, STD, Skewness, Kurtosis, of each attribute
2. Present the data using Box Plot of all attributes
3. Present each attribute using Histogram
4. Determine the correlation coefficient between every pair of attributes
(use table to summarize the results).
5. Scatter plot the attributes with maximum and minimum correlation.
LAB #4
1. Simulate 1000 measurements of a Gaussian random variable, X
~N(4,2). Present these samples using Box Plot graph.

2. Simulate another 1000 measurements of another Gaussian


random variable, Y ~N(4,2). Display the scatter plot of X and Y

3. Define a RV, Z = X – 0.1 Y.


◦ Display the scatter plot of Z vs X, and Z vs Y
◦ Determine the correlation between Z vs X, and Z vs Y.

You might also like