0% found this document useful (0 votes)
225 views30 pages

Statistics For Data Science

Statistics is the study of collecting, organizing, summarizing, and interpreting data. It involves techniques for sampling a subset of a population to make inferences about the whole population while accounting for sampling error. There are two main types of sampling - random sampling which allows for generalizing to the population, and non-random sampling which does not. Descriptive statistics are used to summarize and describe data through measures of central tendency like the mean and median, and measures of spread like range, variance, and standard deviation.

Uploaded by

ArminSayadi
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
Download as pdf or txt
0% found this document useful (0 votes)
225 views30 pages

Statistics For Data Science

Statistics is the study of collecting, organizing, summarizing, and interpreting data. It involves techniques for sampling a subset of a population to make inferences about the whole population while accounting for sampling error. There are two main types of sampling - random sampling which allows for generalizing to the population, and non-random sampling which does not. Descriptive statistics are used to summarize and describe data through measures of central tendency like the mean and median, and measures of spread like range, variance, and standard deviation.

Uploaded by

ArminSayadi
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
Download as pdf or txt
Download as pdf or txt
You are on page 1/ 30

Statistics (a branch of applied Mathematics)

Techniques and procedures for collecting, presentation, analysing, interpreting, and making decision based on data.

Population:
• Set of all units of interest for a particular study
• Needs to be clearly identified
• Real or theoretical
• Vary in size
• Characterized by parameters (census)
Sample (representative vs. biased)
• Subset of units selected from a population
• Represent the population
• Vary in size
• Characterized by statistics

Sampling
Draw inferences about some population(s) of interest based on observations of just a subset or sample from the whole
population.

Sampling Error
• Sample statistics are used as a basis for drawing conclusions about population parameters
• Sample provides limited information about population
• A sample is not expected to give a perfectly accurate picture of the whole population
• Usually there is a discrepancy between a sample statistic and the corresponding population parameter called
SAMPLING ERROR

Types of Sampling
Random (Probability): can generalize beyond sample. data may be difficult to obtain and expensive to collect
• Simple
allows generalizations (ensure representativeness)
rarely used (no access to the whole population list, all selected samples may not participate)
sample may not capture enough elements of subgroups
suitable for small populations
• Systematic (Marketing)
easier than random (no good for heterogeneous population)
offers approximation of random sampling
• Stratified (Geographical)
ensure subgroup representation
• Cluster
often conducted in multiple stages

Non−random (Non-Probability): knowledge in preparation for a later random sample. information cannot be generalized
beyond the sample.
• Convenience
o Handpicked
selected with particular purpose in mind
selected cases meet particular criteria (typical, show wide variance, represent expertise, etc.)
o Snowball
used for populations not easily identified or assessed
sample is built through referrals
no guarantee of representativeness
• Purposive
o Volunteer
highly convenient
not likely to be representative
no guarantee of representativeness
However can credibly represent population if:
1. selection is done with the goal of representativeness in mind
2. strategies are used to ensure samples match population characteristics.

Data Types, Numeric (measurement Scale)


Qualitative:
• Nominal(Categorical): classified without numeric meaning
o Dichotomous:
▪ symmetric = same value (male vs. female)
▪ asymmetric = different value (score vs. not)
• Ordinal: classified, ordered scale, no constant scale
Quantitative (Scale or Continuous):
• Interval: classified, ordered, equidistant unit, but no natural zero (Temperature, dates, personality measures)
• Ratio: classified, ordered, constant scale and a natural zero=no negative value (Age, weight, length, percentage)

Data Management
• Data dictionary/Codebook
o First step in data management and preparation
o Describes the content, structure and format for each item in the collected data and conversion rules for
items into variables
• Data validation rules
Answer to questions should match each others
• Recoding data
Continuous into categorical
Conceptual: based on meaningful thresholds (90−100=A+, A, A−….)
Improving the properties of the data: when the distribution is extremely skewed like number of specialist visit (to
normalize data)
Categorical into new categorical
Conceptual reasons: too much categories, is used as a proxy for a variable with a smaller number of categories
Improving the properties of the data: usually is needed when some categories have few responses
• Computing total scale scores
When you need to compute total scores for a scale, ALWAYS conduct internal consistency analysis first
This is your evidence for reliability of these scores
Statistic used for this purpose: Cronbach’s alpha
Rule for interpretation: Cronbach’s alpha>.70 indicates that reliable total scores can be created from a set of
variables
• Handling missing data
How much missing data does each case have?
How much missing data does each variable have?
What are the reasons for missing data?
Decisions (need to be explained and justified):
Eliminate cases that have too much missing data
Eliminate variables that have too much missing data
Imputation = Creating fake data
Does not save when the data is garbage!
Methods of different complexity
Work well when there is not too much missing (<10% per variable) and when the missing pattern can be
considered random
Allow to maximize the amount of data used in the analyses
List−wise deletion –analytic strategy when the case that has a missing value on one of the variables is removed
from all analyses
Case−wise deletion –analytic strategy when the case that has a missing value on one of the variables is removed
only from the analyses that involve this variable
Both methods use reduced sample size in the analyses
• Checking for outliers
Unusual data points located far from other values of the variable
Should be investigated carefully
May indicate faulty data, erroneous procedures, areas where a certain theory might not be valid
Often contain valuable information about the process under investigation
ALWAYS identify and discuss
NEVER ignore
TRY to understand
How to deal with outliers
Data Transformation: Performing a mathematical operation on scores to reduce the impact of outliers. For
example, taking the square root, reciprocal, or logarithm or all the scores has this effect.
shrink larger values more than smaller values
may affect interpretation
change the relationship between the variables
often require non-negative data
Winsorizing: Replacing a fixed number of extreme scores with the score that is closest to them in the tail of the
distribution.
Trimming: Removing a fixed percentage of extreme scores from each of the tails of a distribution (e.g., omitting
the two highest and two lowest scores).
Deletion: use only if you find outliers are legitimate errors that can't be corrected
use only as a last resort
use when they lie so far outside the range of the remainder of the data that they distort statistical inferences
when in doubt, you can report model results both with and without outliers to see how much they change
Accommodation: use methods that are robust in the presence of outliers (nonparametric)

Descriptive Statistics
Methods to organize, summarize, describe and understand data
Univariate Descriptive Statistics
Purposes:
describe distributions of scores
• shape
• central tendency
• spread

Measures of Central Tendency


Determine a single score that defines the center of a distribution or most representative of the entire distribution
Arithmetic Means: affected by outlier, continuous and symmetric data (normal distribution) x̄=Σxi/n
Median: measure of position, resistant against outliers, skewed or ordinal data
Mode: measure of repetition, nominal data
Excel
• AVERAGE(B3:K6)
• count(B4:E9)
Weighted Mean: x̄=Ʃxf(x)/Ʃf(x)
Mean of Grouped Frequencies: Ʃ(class midpoint*frequency)/total no of observations=Ʃ(m*f)/n
Geometric Mean: x̄g=ⁿ√x₁x₂x₃xn calculate compound annual growth rate (CAGR) or average yearly return of investment.
Harmonic Mean: x̄=n/∑(1/x). average speed over distance.

Measures of spread (dispersion)


Measure degree to which scores in a distribution are spread out or clustered together
Measure how well an individual score represents the entire distribution
Provide information about how much error to expect if you are using the sample to represent the population
Range: R= xmax – xmin, sensitive to outliers
Interquartile Range: IQR=Q3-Q1 less sensitive to outliers
Excel
• quartile.inc() → median is included
Deviation scores: difference between each score and the mean
Variance: variation from x̄. σ²= Σ(xᵢ-µ)²/N, s²= Σ(xᵢ-x̄)²/n-1
Standard Deviation: (σ, s) average of squared deviation from the mean. Measure of deviation
Excel
• stdev.s(B3:k6)
Variance of grouped frequencies: Σ(m-x̄)²f/n-1
Mean Absolute Deviation: MAD=Σ(xᵢ-x̄)/n
Coefficient of Variation: S/x̄*100%. Putting the SD into the scale (real risks). The average variability of ….. is about …% of
the mean
Used for comparison between 2 distribution with 2 different measurement units
Standard Error of the Mean: SEM=s/√n. the standard deviation of the sampling distribution of x̄ or deviation of x̄ from
actual µ. Measure of precision (large sample size → less uncertainty in sample mean).
Confidence Interval: x̄±SE(x̄)*t0.975,n-1
We are 95% confident that the true mean lies between … and ….
Excel
• T.INV(0.975,df)
Standard Error of the Sample Proportion: SE(p)=√p(1-p)/n
Confidence Interval of the Sample Proportion: p±SE(p)*z0.975
Excel: =NORM.S.INV(0.975)

Tools:
numerical (mean, median, mode, range, variance, SD)
graphical (histogram, bar graph)
tabular (frequency table): summarized table with data are arranged in to classes and frequencies.

Qualitative: Analyzed by frequency table, illustrated by bar chart


• Nominal(Categorical): mode
• Ordinal: median, IQR, range
Quantitative (Scale or Continuous): Analyzed by Mean, Median, Mode, Quantile, STD, Min, Max, illustrated by Histogram
and distribution (if skewed distribution use Median)
• Interval
• Ratio

Why look at distribution?


1. Screening for outliers –not representative values
2. Getting better descriptive statistics (median and range instead of mean and standard deviation if distribution is
skewed)
3. Choosing appropriate inferential statistics

Bivariate Distributions
To find the relationship between 2 variables
Methods (depend on measurement scale of the two variables)
• Tabular tool
Crosstabulations: 2 categorical
• Graphical tool
Clustered bar graph: 2 categorical
Boxplot or histogram: 1 categorical and 1 continuous
Scatter plot: 2 continuous (look at relationship, linear, direction, outlier)
• Numerical (measures of association)
Correlation Coefficients
Statistical technique that is used to measure and describe relationship between two variables.
Measures three aspects of relationship:
Direction (identified by sign)
• positive
• negative
Form (linear vs. nonlinear)
Strength of relationship (for linear correlation –how well the data points fit a straight line)
Variables Correlation Coefficient
Continuous, normally distributed and linearly Pearson’s r correlation
related
Continuous, skewed and/or not linearly related Spearman’s correlation
Ordinal Kendall’s Tau B
Nominal with 2 categories Phi
Nominal with 2+ categories Cramer’s V

Nominal Ordinal Scale


Nominal Crosstabulation Crosstabulation Boxplot
Phi Clustered bar graph Point Biserial (0/1)
Cramer’s V Phi
Cramer’s V
Ordinal Crosstabulation Crosstabulation Boxplot
Clustered bar graph Clustered bar graph Scatter plot
Phi Kendall’s Tau B Spearman’s rho
Cramer’s V
Scale Boxplot Boxplot Scatter plot
Point Biserial (0/1) Scatter plot Spearman’s rho
Spearman’s rho (skewed)
Pearson’s r (normal)

Data Visualization
• Gain insight into an information space by mapping data onto graphical primitives
• Provide qualitative overview of large data sets
• Search for patterns, trends, structure, irregularities, relationships among data
• Help find interesting regions and suitable parameters for further quantitative analysis
• Provide a visual proof of computer representations derived
Categorization of visualization methods
Boxplot: graphic display of five-number summary
Histogram: a graphical representation of the frequency distribution in which the x-axis represents the classes and the y-
axis represents the frequencies in bars.
Ogive (Cumulative Frequency Distribution):
Quantile plot: scatter plot of all data elements in box plot
each value x is paired with f indicating that approximately 100% of f data are ≤ x
Quantile-quantile (q-q) plot: graphs the quantiles of one univariant distribution against the corresponding
quantiles of another (bivariate quantile plot)
Scatter plot: each pair of values is a pair of coordinates and plotted as points in the plane
Moments
• First Moment: ∑x/n → Average distances from 0. indication of the mean.
• Second Moment: ∑(x)²/n, Average squared distances from 0, Indication of variance.
• Third Moment: ∑(x)³/n, average cubed distance from 0, indication of skewness
1/N*∑(x-µ)³/σ³ −or− n/(n-1)(n-2)*∑(x-x̄)³/s³
• Forth Moment: ∑(x)⁴/n, average forth power of distance from 0, indication of kurtosis
1/N*∑(x-µ)⁴/σ⁴ −or− n(n+1)/(n-1)(n-2)(n-3)*∑(x-x̄)⁴/s⁴-3(n-1)²/(n-2)(n-3)
Entropy: the level of impurity or uncertainty in the data
Information Gain: the amount of information that each variable/feature gives us about the final outcome. Suitable to
find the root node
Decision Tree: root node, branches,
Confusion Matrix: to evaluate the performance of a model, TP+TN/TP+TN+FP(I)+FN(II) or Ts/everything

Probability
The likelihood that an event occurs, desired outcome/total outcome
Random Experiment: an experiment that the outcome can not be predicted by certainty
Event: outcome of an experiment
Sample space: all possibilities after one or more trials (head or tail) = x^ⁿ
Mutually exclusive: just one of the possible event can take place at a time (head or tail)
Collectively exhaustive list: a list of the possible events that can result from an experiment
∈=element → 6∈N, 7∉N
⊆=subset → M⊆N=all elements in M are also in N plus equivalent subset M=N
Fundamental Counting Principle (FCP)
Permutation: different ways of arranging sets of x out of a larger group of n (order is important). P(n,x)=(n P x)=n!/(n-x)!
→ 4!/(4−2)! = 4!/2! → 4*3=12
Combination: the number of ways of selecting sets of size x from a larger group (order matters). C(n,x)=(n C x)=n!/x!(n-
x)! → 4!/2!(4−2)! = 4!/2! → 4*3=12
Examples: 14 marbles (4red, 4green, 3blue, 3yellow)
Probability of having sets of 4 marbles out of 14: C(14, 4)=1001
Probability of having 4 marbles of different colors out of 14:
C(4,1)*C(4,1)*C(3,1)*C(3,1)=4*4*3*3=144 out of 1001
Probability of having 4 marbles (at least 2 red) out of 14 (2red or 3red or 4red):
C(4,2)*C(10,2)+C(4,3)*C(10,1)+C(4,4)=270+40+1=311 out of 1001
Probability of having 4 marbles (none red but at least 1 green) out of 14 (because no red=10 marbles)
Total probability: C(10,4)=210
C(4,1)*C(6,3)+C(4,2)*C(6,2)+C(4,3)*C(6,1)+C(4,4)=80+90+24+1=195 out 210
C(6,4)=15 if consider no green and red -> 210-15=195 (reverse way of calculation)
Types of Events
Independent Events
The event A does not affect the occurrence of event B
Disjoint Event (AՈB=0)
probability of A or B (union) = P(AՍB) = P(A) + P(B) → P of Heart or Dimond = 13/52 + 13/52
probability of A and B (intersection)= P(AՈB) = P(A)*P(B) → p of A grade in Stat and Math
Non−Disjoint Events (AՈB≠0)
probability of A or B = P(AՍB) = P(A) + P(B) – P(AՈB) → p of Queen or Heart = 4/52 + 13/52 – 1/52
Conditional or Dependent Events
probability of A and B = P(AՈB) = P(A|B)*P(B) = P(B|A)*P(A) → p of 2 Heart in a row
P(A|B) = probability of A given B has happened
Types of Probability
Marginal = P(A)
Joint = P(AՈB)
Conditional = P(A|B)
P(A|B) = P(BՈA)/P(B)

Company Electronics Mechanical Computer total


Science
A 22 28 18 68
B 34 25 30 89
C 19 32 21 72
Total 75 85 69 229

P of Electronics = 75/229
P of Mechanical and company A = 28/229
P of Electronics and company A or C = 22/229 + 19/229
P of Computer Science or company B = 69/229 + 89/220 – 30/229
P of Electronics given placed in company A = 22/68
P of red ball in second draw if 2 black and 3 red = P of black in first draw * P of red in second draw + P of red in first draw
* P of red in second draw = (2/5*3/4) + (3/5*2/4) = 6/20 + 6/20 = 60%

Bayes’ Theorem:
P(A|B) = P(B|A)*P(A)/P(B) → posterior = likelihood ratio * prior / P(B)
P(B) = P(A¹ՈB) + P(A²ՈB) + P(A³ՈB)
P(B) = P(B|A¹)*P(A¹) + P(B|A²)*P(A²) + P(B|A³)*P(A³)
P(B) = Σ P(B|A¡)*P(A¡)
P(A¡|B) = P(B|A¡)*P(A¡)/ Σ P(B|A¡)*P(A¡) generalized term of Bayes’ Theorem

Example
Prior Probabilities Conditional Probabilities
30% = P(A¹) low risk 1% defaulter = P(defaulter|A¹)
60% = P(A²) med risk 10% defaulter = P(defaulter|A²)
10% = P(A3) high risk 18% defaulter = P(defaulter|A³)

Posterior Probability: P of A¹ rating given defaulter pool = P(A¹|defaulter)


P(A¹|defaulter) = 0.01*0.30/0.01*0.30+0.10*0.60+0.18*0.10 = 0.003/0.081 = 0.037 = 3.7%

Correlation Analysis
Show the relationship of 2 variable (provide information on direction & strength: -1 to +1). Correlation is not necessarily
causation.
• Nominal Data (Chi Square Test)
• Numeric Data (Pearson Correlation coefficient)

Bivariate relationship
The value of dependent variable(Target variable), is a function of independent variable(predictor variables): x=f(y)
Linear algebra equation (slope-intercept form of line)
y=mx+b [x: random variable, m: slope=rise/run, b: y-intercept [line crosses y-axis or x=0 (0,y)].
Simple Linear Regression Model
y = β₀ + β₁x + Є (β₀=y-intercept of population parameter, β₁=slope of population parameter, Є=error term or unexplained
variation in y)
Simple linear regression equation
E(y) = β₀ + β₁x
Simple linear regression is not perfect and E(Y) is the mean of a distribution of ys (ŷ)
Because most of the time population parameter are not available (β₀ & β₁), we use sample data.
Sample linear regression equation
ŷ = b₀ + b₁x (y-hat is the mean value of y for a given value of x and a point estimator of E(y))
We always compare our best linear regression models with 1 dependent variable model [mean or slope=0 → ŷ = b₀ + 0x
→ ŷ = b₀]. If the sum of squares residual (SSE) decreases significantly, the model is good
Least squares criterion/line:
1. Make a scatter plot of data points
2. Look for a visual line
3. Find correlation: +, - , strong or weak
4. Descriptive statistics/centroid: the cross point between the mean of each variable [the best-fit regression line
must pass through the centroid.
5. Calculation: ŷ = b₀ + b₁x
Slope: b₁=Σ(xᵢ-x̄)(yᵢ-ȳ)/Σ(xᵢ-x̄)² x̄: mean of independent variable, xᵢ: value of independent
ȳ: mean of dependent variable, yᵢ: value of dependent variable

intercept: b₀ =ȳ-b₁x̄
Least squares method
The goal is to minimize the sum of squared residual (SSE)
We always compare our best regression model (Least squared line) with 1 dependent variable model (mean)
SST= Σ (yᵢ - ȳ)²
SSR= Σ (ŷᵢ - ȳ)²
SSE=Σ (yᵢ - ŷᵢ)²
[yᵢ: observed value of dependent variable, ŷᵢ: estimated value of dependent variable, Residual = yᵢ - ŷᵢ]
1 dependent variable: SST=SSE
2 variable: SST=SSR+SSE (SST stay the same, so SSE decreases significantly)
Coefficient of Determination
How well does the estimated Linear Regression equation fit our data.
Coefficient of Determination = r² =SSR/SST →..% of total sum of squares can be explained by the estimated regression
model
Standardized Score regression
Not commonly used. Useful when different scales and variances.
Make a standardized Z score for each variable and make a scatter plot on x and y axis. The centroid is (0, 0) or the centre
of x, y axis. Intercept is 0 and slop equal to correlation coefficient (r).
Raw data correlation and standardized value correlation are the same.
Original Regression Slope=r*sy/sx
Standardized Slope=r=b₁*sx/sy
Regression Model Error
SSE=Σ(yᵢ - ŷᵢ)²
MSE=s²=SSE/df is an estimate of σ² the variance of error (how spread out the data points are from the regression line)
[df=n-2]
s(standard error of estimate)=√MSE(root mean square error)=√SSE/n-2 (an average distance of observations fall from
regression line
r²=SSR/SST

Confidence Interval of Slope


b₁±tα̷₂sb₁=point estimator of slope ± margin of error
sb₁: standard deviation of slope=s/√∑(xᵢ-x̄)²
we are 95% confident that the interval (…,….) contains the true slope of regression line.
Does the interval contain 0? If no: reject the null of the slope is 0 (H₀: B₁=0, Hα: B₁≠0)
T test for significance
t=b₁/sb₁ vs. tcᵣᵢ: if t>tcᵣᵢ → t statistics is significant or reject the null
Confidence Interval of ŷ*
y*=b₀+b₁x*
ŷ*±tα̷₂sŷ*=point estimator of ŷ* ± margin of error
sŷ*=s√1/n+(x*-x̄)²/∑(xᵢ-x̄)²
x*: the value of interest for the independent variable of x
xᵢ: the observed values of dependent variable
y*: the possible value of y when x=x*
E(y*):the expected value or mean of y when x=x*
Prediction interval
Is always wider than CI because PI is about individual values while CI is about mean values.
ŷ*±tα̷₂sᵨᵣₑ
s²ᵨᵣₑ=s²+sy*²
the estimate variance of ŷ* is at its minimum at the mean of independent variable [x*=x̄] or the most precise estimate of
the mean value of y is when x*=x̄. As x* gets further from the x̄, the confidence interval and prediction interval become
wider.
Residual
Difference between the observed value of dependent variable and predicted value by regression model
y=B₀+B₁x+Є

r=cov(x,y)/sₓ*sᵧ
If r>=2/√n-> relationship exist
r>0.9 = strong correlation, 0.7<r<0.9 = moderate, r<0.7 = week
In Excel -> r=CORREL(array1, array2)

First plot the data

Crosstabulation
Table summary of 2 variables (show the relationship between 2 variables)
In Excel: pivot table
Smartphone TVs Gaming Appliances Computers Total Average σ
North $1336 $1128 $1859 $1656 $1583 $7,562 $302 $110
South $6,822 $273 $116
East $7,727 $309 $104
West $7,587 $303 $118
Total $5714 $5824 $6512 $5945 $5703 $29,698
Average $286 $291 $326 $297 $285 $297
σ $118 $102 $134 $107 $97 111$

We can compare each region or segment with others


Also percent and ranking

Distributions
Chebyshev Rule
Regardless of how the data are distributed, at least (1 -1/k2) x 100% of the values will fall within k standard deviations of
the mean (for k > 1)
Discreet Probability Density Function (pdf)
the distribution of discreet number of outcomes over several independent experiments/trials.
Total: ƩP(x)=1
Range: 0 <̲ P(x) >̲ 1
Expected value: E(X)=ƩxP(x)
Variance: σ²=Σ(x-x̄)²p(x)
Uniform discrete probability distribution
If roll a dice or flip a coin, makes a uniform distribution
P(x)=1/no of outcomes
Binomial probability (Bernoulli Process)
• Each trial is independent.
• Only 2 possible outcomes/events over each experiment/trial.
• Probability of success (p) is the same across all trials. Probability of failure (q)=1-P
If the probability of colorblindness in men is 8% and pick 10 men:
The probability of having 10 men with colorblindness? (0.08)¹⁰
The probability of having 10 men without colorblindness? (0.92)¹⁰
The chance of having 2 men with colorblindness out of 10 men?
P(X=x)=(n C x)pᵡqⁿˉᵡ= n!/x!(n-x)!*pᵡqⁿˉᵡ = n!/x!(n-x)!*(0.08)²(0.92)⁸ → pdf
Binomial Probability Density Function (pdf)
The distribution of binomial probability over several independent experiments/trials.
Expected number of color blind men in the sample: E(x)=np=x̄
SD(x): s=√npq
As n->ꚙ, binomial distribution moves towards normal distribution if np>5 & nq>5
Excel: =BINOM.DIST(x,n,p,FALSE)
TI-83 -> 2nd,Vars,Enter,scroll down, binom pdf, (n,p,x),enter
Cumulative binomial distribution(cdf)
Addition of desired pdfs or deduction of the rest from 1 [1-P(x)]
The probability of having at least 2 men with colorblindness? P(x≥2)
Excel: P(x≤1)=BINOM.DIST(1,10,0.08,TRUE)
Continuous Probability Distribution Function
If N moves towards ꝏ, discrete distribution turn to continuous distribution
Probability of any specific outcome is undefined P(a)=0
P(a,b) = area under curve between a and b (Just cdf)
Uniform continuous probability distribution
Width=b-a
F(x)=1/b-a
E(x)=b+a/2
σ²=(b-a)²/N
σ=b-a/√N
P(x)=x₁-x₂/b-a
Normal (Gaussian) Distribution
Data around the mean occurs more frequently than data away from the mean
Mean = center of distribution
SD = height of distribution
Z score: x–x̄/s.
0 to 1 SD=34.13%
1 to 2 SD=13.59%
2 to 3 SD=2.14%
+3SD=0.13%
Positive skewed (Right skewed) → Mode < Median < Mean

Poisson Distribution (quality control)


The probability of events (x) over a fixed continuum or time interval given the mean (λ) (time, distance, location).
Bounded between 0 and ∞. P(x)= e⁻ʵ λᵡ/x‫יִ‬
If average customer are λ=3 per hour, the probability of having x=5 customer/h
Excel: =POISSON.DIST(x,λ,FALSE)
When n>20, Poisson distribution moves towards binomial distribution
If continuum change: calculate the mean based on new continuum (cross multiply ratio)
cdf: cumulative probability of an event (x) over a continuum or interval given the mean.
Probability of less than 5 calls in 2 hrs if average calls are 8 per 2 hrs.
cdf P(x≤4)=e̵֊ʵΣ[λᵡ/x!]-> P(4)=e֊⁸[8⁰/0!+8¹/1!+………..8⁴/4!]
Excel: =POISSON.DIST(4,8,TRUE)
Probability of more than 5 calls in 2 hrs if average calls are 8 per 2 hrs. 1-cdfP(x≤5)
E(x)=λ can be a continuous number
SD(x)=√λ

Hypergeometric Distribution
N: total population size
A: total items of interest in population
n: sample size
Probability distribution of 2 spades in a 5 cards poker hand?
N=52 A=13 n=5 x=2
P(x)=
Excel: =HYPERGEOM.DIST(x,n,A,N,FALSE)
Probability distribution of at least 2 spades in a 5 cards poker hand?
Excel: =HYPERGEOM.DIST(x,n,A,N,TRUE)
If you have 2 spade, probability of flush out of next 5 cards?
N=50 A=11 n=5 x=3,4,5
Excel: P(x≥3 spades)=1-HYPERGEOM.DIST(2,5,11,50,TRUE) +
P(x=5 hearts)=HYPERGEOM.DIST(5,5,13,50,FALSE) +
P(x=5 diamonds)=HYPERGEOM.DIST(5,5,13,50,FALSE) +
P(x=5 clubs)=HYPERGEOM.DIST(5,5,13,50,FALSE) = 0.065821
E(x)=nA/N
SD(x)=√nA/N*(N-A)/N*(N-n)/N-1

Exponential Distribution
The time between events in Poisson process (inverse of Poisson)
Number of cars passing a tollgate in one hour vs. Number of hours between car arrivals
Events per single unit of time vs. Time per single event
If P(0≤x≤1)=0.05 → P(1≤x≤2)=0.95*0.05 → P(2≤x≤3)=0.95²*0.05 ……….
Probability that the next visitor arrives within 10 min? make the time unit the same
P(x≤10 min)=1-e⁽⁻ˣʹᵞ⁾ or 1-e⁽⁻ˣʵ⁾
Excel: = EXPON.DIST(x,λ,TRUE) if put FALSE → pmf
Probability that the next visitor arrives after 30 min?
Excel: = 1-EXPON.DIST(x,λ,TRUE)

Inferential Statistics
Makes conclusion or predictions about a population based on a sample. There is always some errors around the
estimation (margin of error).
Central limit theorem
The distribution of the sample mean of a random variable tend toward the normal distribution as the sample size
increases, regardless of the distribution from which we are sampling (≥30).
Point Estimation
Mean of sampling distribution will be a good estimate of the population mean. Sample size should be > 30 and not be
more than 10% 0f the population.
Finding The Estimates
1. Method of moment
2. Maximum of Likelihood
3. Bayes’ Estimator
4. Best Unbiased Estimator
5. Interval Estimate: a range of values between upper and lower confidence limit, used to estimate a population
parameter
Margin of Error
Margin of error is the maximum distance between the estimate and population parameter based on a certain level of
confidence (zσ/√n). As the sample size increases up to diminishing return point, the margin of error decreases (Better
estimation of µ).
Confidence Interval (CI)
a range between upper and lower confidence limit that true population mean lies within them.
CI calculation: Point estimate +̲ margin of error → zcrit= x̄±zσ/√n
Confidence Level
The probability that the interval estimate contains the population parameter.
We are 95% confident that the interval of …. to …. contains the true population mean or population mean lies within the
interval.
Confidence level + significance level (α) = 100%
Excel
Zcrit: =NORM.S.INV(0.95) for 1−tail or (0.975) for 2−tail
CI 90% 95% 99%
1−tail test ±1.28 ±1.645 ±2.33
2−tail test ±1.645 ±1.96 ±2.576

Sample Size Calculation


√n=zₐ‚₂σ/E -> n= (zₐ‚₂)²σ²/E²
To have 95% of sample means contain µ, minimum sample size should be ….

Find σ
1. Estimate from previous studies using the same population
2. Conduct a pilot study to select a preliminary sample
3. Guess: data range/4

Hypothesis Formulation
Null/status quo hypothesis (H₀) is opposite of researcher/alternative hypothesis (Hₐ) and they are mutually exclusive.
Null says nothing new or different and always contains equality (=,≤,≥) while alternative hypothesis says that there is
something new or different. We compare μ (existing mean) against μ₀ (hypothesised mean) to find if they come from
the same population or different one (μ=μ₀ or μ≠μ₀)

Hypothesis testing procedure


1. Start with clear, well-developed research problem or question
2. Establish hypothesis, both null and alternative
3. Determine appropriate statistical test and sampling distribution
4. define α (acceptable variation). As α increases, null rejection get easier
5. State the decision rule (rejection rea=µ +̲ zcritical)
6. Gather and visualize your data from sample
7. Calculate test statistics
8. State statistics conclusion (x̅ is within µ ± zcri or no?)
9. Make decision or inference based on conclusion. Data support, indicate or infer …….

P-value
The probability that the observed relationship occurred by pure chance. Strength of evidence against Null hypothesis 1-
test statistics (Z or T)
If P-value<α -> H₀ is rejected
If P-value>α -> H₀ is failed to be rejected

Reality
conclusion H₀=false H₀=true
Reject null True positive False positive, Type 1 error (α)
Fail to reject False negative, Type 2 error(β) True negative

Probability of type 1 error (α): reject the null while we shouldn’t (level of significance)
Probability of type 2 error (β): fail to reject the null while we should
As α decreases, β increases
1–α =Confidence Level = non rejection area
1–β =Power = the probability of rejecting a false null or the ability of a model to detect the difference
If we reject the null, we conclude that data support alternative and if we fail to reject null, it still does no prove the null.

One−sample hypothesis testing


Does the sample come from the same population? Or sample mean is between µ±CI?
µ=existing mean vs. µ₀=hypothesised mean
Two-tailed hypothesis: H₀: µ=µ₀, Hα: µ≠µ₀
Right-tailed hypothesis (upper): H₀: µ≤µ₀, Hα: µ>µ₀
Left-tailed hypothesis (lower): H₀: µ≥µ₀, Hα: µ<µ₀
If σ is known or n>30 -> use Z distribution
zcri= µ±Z̷₂σ/√n if 2 tail
zcri= µ+zσ/√n if right tail
zcri= µ-zσ/√n if left tail
Z-test statistics: z=(x̅-µ₀)/σ/√n -> if z value out of Zcri, null is rejected
Excel: p−value=−1−NORM.S.DIST(z,TRUE) if 1−tail
1−NORM.S.DIST(1.886,TRUE)*2 if 2−tail
If σ is unknown or n≤30 -> use T distribution
T-test statistics: t=(x̅-µ₀)/s/√n → if t value out of Tcri or if p-value<α ->null is rejected which means the difference
between sample mean and population mean is not significant.
Excel: p−value=1−T.DIST(x,df,TRUE) or 1-T.DIST(x,df,TRUE) [x=t statistics] if FALSE→probability mass function (PMF) or
just hight
tcrit=T.INV(α,df)
there is not enough evidence at 95% CI (5% level of significance) to infer that the population average ….. is more/less
than …….

Type 2 Error (β)


1. Find alternative mean (µα) or mean difference
2. Calculate x̅cri based on α
3. Find Z value location of x̅cri on the µα distribution: Z=(x̅-µα)/σ/√n
4. Find the probability under the Z value curve= β
1–β =Power = the probability of rejecting a false null or the ability of a model to detect the difference
As µα moves away from µ₀ (the difference increases) type 2 error approaches 0 (↓)and power approaches 1 (↑). Also as
sample size increases, the power increases. and we are assured that rejection is correct
Controlling Type 2 Error
The goal is to align the α and β regions in the µ₀ 𝑎𝑛𝑑 µ𝛼 distribution respectively. Since the population means, σ and
critical values are set, we can manipulate the sample size to generate a standard error that brings α and β into alignment
at C. this new n will create a new x̅cri value for our decision rule.
Z=(x̅-µ₀)/σ/√n & µ±zσ/√n
µ₀±zασ/√n= µα±zβσ/√n -> n=(zα+zβ)²σ²/(µ₀-µα)²
Since type 2 error controlled, we can use phrase “reject null or fail to reject null. we have to redo our study with new
sample size to control β at desired level.

Independent−samples z test
To understand the role of random chance between 1 categorial variable and 1 quantitative variable
Comparing 2 population via their independent samples mean. Less power than paired sample t test.
D₀=µ₁-µ₂ (samples difference)
2 tailed test: H₀: θ=0 (there is no statistically significant difference between the two samples)
, Hα: θ≠0
1 tailed upper test: H₀: θ≤0, Hα: θ>0
1 tailed lower test: H₀: θ≥0, Hα: θ<0

One sample Z test Two samples Z test


margin of error zσ/√n z√(σ₁²/n₁+σ₂²/n₂)
Confidence interval x̅±z(σ/√n) d̄ ±z√(σ₁²/n₁+σ₂²/n₂)
Z−value Z=x̅-µ₀/√σ/n Z=(x̅₁-x̅₂)-D₀/√(σ₁²/n₁+σ₂²/n₂)
Compare the result of Z-statistics against Z−critical to see if null is rejected or no.

Independent−samples t test
If no σ, use T distribution using df=(s₁²/n₁+s₂²/n₂)/[1/n₁-1(s₁²/n₁)]+[1/n₂-1(s₂²/n₂)] always round down↓ or df=n₁+n₂−2
Take one sample from each populations n₁ & n₂
Calculate the difference of sample means x̅₁-x̅₂=d₁ an estimate of µ₁-µ₂
Calculate the mean of differences=d̄ and standard error of differences=Sd̅ =Sx̅₁-x̅₂=
APA Style
the groups do not differ significantly, t(14)=1.59, p=0.14, d=.79, 95% CI [−.44, 2.94]. The clicker−training group (M=6.38,
SD=1.41) was not significantly different than the food reward−training group(M=5.13, SD=1.73). These finding do not
support the idea that clicker training is more effective than traditional food reward training.

Matched/paired−samples T test (pre and post-test)


Compare 2 dependant samples mean
dᵢ=pre-test variable – post-test variable
d̅ =mean of differences
df=n-1
t=t critical (use t table)

1 sample T test Independent samples T test matched samples T test


Estimate σ Sx̅=s/√n Sd̄ :Sx̅₁-x̅₂=√(s₁²/n₁+s₂²/n₂) Sd̄ =Sᵈ/√n
Margin of Error tSx̄=t(s/√n) tSd̄ =t√(s₁²/n₁+s₂²/n₂) tSd̄ =t(Sᵈ/√n)
Interval x̅±t(s/√n) d̄ ±t√(s₁²/n₁+s₂²/n₂) d̅ ±t(Sᵈ/√n)
T−value T=x̅-µ₀/(s/√n) T=signal/noise=(x̅₁-x̅₂)-D₀/√(s₁²/n₁+s₂²/n₂) T=d̅ -µᵈ/(Sᵈ/√n)

Compare the result of T-statistics against interval estimate to see if null is rejected or no.
How to interpret the result:
There is a 95% probability that the interval contain the true mean difference d̄ between µ₁ and µ₂.
If 95% interval contain 0 -> There is (not) enough/sufficient evidence to conclude that there is a difference between the
mean of 2 populations
If interval does not contain 0 -> There is enough/sufficient evidence to conclude that there is a difference between the
mean of 2 populations (the difference is not statistically significant or related to sampling error)
T statistics + P value = α + t critical = 1

Population Proportion Hypothesis testing


17 out of 30 students said “yes”(binary nominal variable). p̂ =x/n=17/30
p̂ =population proportion
if np>5 & nq>5 → turn to normal distribution (z)
Standard Error of proportion
infinite population or n/N≤0.05: seᵨ=√pq/n, if 2 population: √p₁q₁/n₁+p₀q₀/n₀
If finite population or n/N>0.05: multiply to correction factor → seᵨ=(√pq/n)(√N-n/N-1)
CI of proportion
CIᵨ=p±seᵨzcrit, θ±seᵨzcrit if 2 population (θ=p₁−p₀)
One−sample
z=p̂ -p/√(pq/n)
Excel: zcrit=NORM.S.INV(α)
p−value=NORM.S.DIST(z,TRUE)
two−sample
z= (p̂ ₁− p̂₂)/√pq√(1/n₁+1/n₂)

Chi-square test (for goodness of fit)


To determine if the difference between categorical percentages in each cell are significant or not.
To understand the role of random chance between 2 categorial variables
H₀: the observed pattern fits the given/expected distribution?
Hα: the observed pattern does not fit the given/expected distribution?
Compare observed data with expected data.
Chi−square statistics: ꭕ²=Σ(Oᵢ-Eᵢ)²/Eᵢ
Chi-square critical value based on α & df=number of categories – 1
If ꭕ²>critical value → reject null
(For independence)
H₀: 2 categorial variables are independent (by chance).
Hα: 2 categorial variables are not independent (related).

Variable 1 total
Category x Category y
Variable 2 Category a #1 #2 #1+#2
Category b #3 #4 #3+#4
total #1+#3 #2+#4 #1+#2+#3+#4
Proportion of category x of variable 1 =#1+#3/#1+#2+#3+#4
Expected values foe each cell: row total*column total/grand total on contingency table
Chi−square statistics: ꭕ²=Σ(Oᵢ-Eᵢ)²/Eᵢ
df=(r-1)(c−1)

Samples Variance (Quality Control & Six Sigma)


Variance(risk): σ²= Σ(xᵢ-µ)²/N, s²= Σ(xᵢ-x̄)²/n-1 (average squared deviation from the mean)
Chi-square distribution (ꭕ²)
The distribution of sample variance (s²) of many samples of the same size from a normal population follows Chi-square
distribution, with a df=v=(n-1).
Different types of chi-square distributions based on the df=(n-1) like T distribution.
0 on the right and 1 on the left. Cumulative probability runs right to left. Curve never touches x-axis.
Critical values: ꭕ².025(right) & ꭕ².975(left).
Use chi-square table to find critical values based on df=n-1 → ꭕ²₁≤ꭕ²≤ꭕ²₂ = ꭕ².975≤ꭕ²≤ꭕ².025
Test Statistics: ꭕ²=(n-1)s²/σ² → ꭕ².975≤(n-1)s²/σ²≤ꭕ².025 → ꭕ².975σ²≤(n-1)s²≤ꭕ².025σ²→
(n-1)s²/ꭕ².025≤σ²≤(n-1)s²/ꭕ².975

If n=12 →df=11
Critical Values: ꭕ².975≤ꭕ²≤ꭕ².025 → 3.82≤ꭕ²≤21.92
If s² → GE=25.89 & APL=73.30
Variance Test Statistics: APL: 36.78≤σ²≤211.07
GE: 12.99≤σ²≤74.55
Extra information:
GE APL
s=5.09 s=8.56
3.6%≤σ≤8.63% 6.06%≤σ≤14.53%

Hypothesis tests for the variance


2 tailed test: H₀:σ²=σ₀², Hα:σ²≠σ₀² → H₀:ꭕ²₁ˍα/₂≤ꭕ²≤ꭕ²α/₂ = ꭕ².975≤ꭕ²≤ꭕ².025
Upper/right tailed test: H₀:σ²≤σ₀², Hα:σ²>σ₀² → H₀:ꭕ²≤ꭕ²α = ꭕ²≤ꭕ².05
Lower/left tailed test: H₀:σ²≥σ₀², Hα:σ²<σ₀² → H₀:ꭕ²≥ꭕ²₁ˍα = ꭕ²≥ꭕ².95
If ꭕ² in rejection area, null is rejected
The sample data does not offer sufficient evidence to conclude……..
The larger sample size (≥10), the less variance and narrower confidence interval estimate (up to diminishing return)

F-ratio test for 2 equal variance & F-distribution


The distribution of ratio of 2 random samples variance taken from 2 normal population follows F-distribution (only
right−tailed). Comparing 2 sample variances against each others.
H₀:σ²ₓ=σ²ᵧ variance in population X is equal to variance of population Y, Hα:σ²ₓ≠σ²ᵧ
F-ratio Test statistics: F=S²ₓ/S²ᵧ or ratio of 2 chi−square objects
Larger variance on top, as a result always upper/right tailed
Use F-table, Excel or android app to find critical F-value by α, df₁=n-1 & df₂=n-1.
If test statistics larger than critical F-value (rejection area), null is rejected.
There is enough evidence that 2 sample variances are not equal with 95% CI.

ANOVA (>2 population)


Analysing the equality of means (≥ 3 populations) or subgroups by partitioning the source of variance (error vs.
conditions).
total variability = unexplained variability (error=SSE) + explained variability (conditions=SSC)
MSE: differences among scores within the groups (mean of sample variances). always estimates population variances
(σ²)
MSB: differences among sample means (mean of sample means*n). Estimate σ² only when the population means are
equal.
If F−stat=MSB=MS → population means are the same
If MSB>MSE → population means are not equal, but how much?
F−ratio=MSB/MSE →calculate p−value

One way ANOVA/Completely randomized design


1 variable (column)
H₀: µ₁=µ₂=….µₖ
There is no statistically significant difference between population means in all conditions.
Each data point=x observations in each group=n
Group mean=x̄ total observations=N
Grand mean=x̿
Sum of squares (SSQ) df Mean Squared F-stat
Between: SSC=Σnᵢ(x̄ᵢ-x̿)² dfᴄ=C-1 MSB=SSC/dfᴄ F=MSC/MSE dfᴛ= dfᴄ+dfᴇ
Within: SSE=Σ(xᵢ₁-x̄₁)²+Σ(xᵢ₂-x̄₂)²+… dfᴇ=dfd=C(n−1)=N-C MSE=SSE/dfᴇ Fcrit= based on
Total: SST=SSC+SSE=Σ(x-x̿)² dfᴛ=N-1 desired α, dfᴄ,
dfᴇ
F-ratio=variance between distributions/variance within distributions, among/around, if>1→reject null
If F-stat>Fcrit, null is rejected -> Significant difference in means of samples

2 ways/factors ANOVA without replication


2 variables(rows/blocks variation, columns/treatments variation)
Completely randomized block design
SST=SSC+SSE+SSB
Sum of squares (SS) df Mean Squared F-stat
Columns/groups: SSC=Σ(x̄-x̿)²B dfᴄ=C-1 MSC=SSC/dfᴄ F=MSC/MSE
Rows/Blocks: SSB=Σ(x̄-x̿)²C dfв=B-1 MSB=SSB/dfв F=MSB/MSE
Within/error: SSE=SST-SSC+SSE dfᴇ=(C-1)(B-1) MSE=SSE/dfᴇ
Total: SST= SSC+SSE=Σ(x-x̿)² dfᴛ=N-1 MST=SSQ/dfᴛ
dfᴛ= dfᴄ+ dfв+dfᴇ
If F-stat>Fcrit or P-value<α → reject the null (significant difference)

2 ways/factors ANOVA with replication:


Multiple measurement for each factor (row or column)
data visualization:
Always look for significant interaction on marginal mean graph first. If interaction, can not analyze main factor
separately.
Interaction: when the effect of one factor changes for different levels of the other factor → cross on marginal mean
graph (independent variable on horizontal line and dependent variable on vertical line)
Marginal mean graph interpretation (watch playlist 11 video 8)

Post Hoc Analysis (Fisher’s LSD Procedure [least square difference])


To know where the differences are located if ANOVA F test is significant. In other word, tell us which of pairwise
comparisons contain significant differences.
Population pairs: ABCD → AB, AC, AD, BC, BD, CD [C(4,2)]
Always visualize your data first (box plot)
H₀: µᵢ=µⱼ, Hα: µᵢ≠µⱼ
T=x̄ᵢ-x̄ⱼ/√MSE(1/nᵢ+1/nⱼ), df=nᴛ-c ᴛ=total no of observations in both groups
x̄ᵢ-x̄ⱼ=difference matrix, √MSE(1/nᵢ+1/nⱼ)=standard error of differences
P-value based on t-stat and df → if < α →reject the null
Tcritical based on α/₂ and df → if < T → reject the null
If interval does not contain 0 → significant
Least significant differences=LSD=T α/₂ √MSE(1/nᵢ+1/nⱼ): If x̄ᵢ-x̄ⱼ ≥ LSD → reject the null

Data Science
The process of inspecting, cleansing, transforming and modeling data to uncovering patterns and trends, extract
knowledge and insight, and support decision−making through using scientific methods
• Better & faster decision making
• Cost reduction
• Better marketing and product analysis
• Organization analysis
Business Analyst
Examine large and different types of data to uncover hidden patterns, correlations and other insights to make better
decisions.
1.
Data Science Pathway
Planning
Define the objective of the problem
Organize resources like data, software, people
Coordinate people
Schedule project
Data Preparation
Get data
Clean data
Explore data
Refine data
Modeling
Create model
Validate model
Evaluate model
Refine model
Follow Up
Present model
Deploy model
Revisit model
Archive assets

Machine Learning
Provide machines the ability to learn automatically and improve from experience without being explicitly programed
Why ML
1. Analyse the large amount of data
2. Improve decision making
3. Uncover trend and patterns in data
4. Solve complex problem
Algorithm
A set of rules and statistical methods used to learn patterns from data
Target Variable = Dependent Variable
Predictor Variables = Independent Variables
Data Life cycle
1. Business Understanding: (problem, objectives, variables to be predicted and input data)
2. Data Acquisition: (types, source, how to obtain, how to store)
3. Data processing: (cleaning)
4. Exploratory Data Analysis: (understand the pattern, retrieve useful insight, form hypotheses)
5. Modeling
6. Model Evaluation & Optimization
7. Deployment & Prediction
Simple Linear Regression
With only 1 variable, the best prediction for next measurement is the mean[slope=0, ȳ=b₀].
With 1 dependent variable, the only sum of squared is due to residual (distance between each observation and mean).
(yᵢ - ȳᵢ)² = SSE=SST
With 2 variables, SST stays the same but should SSE decreases significantly. SST=SSE+SSR (regression)
The goal of linear regression is to create a linear model to minimize sum of squares residual (SSE)
We always compare our best linear regression models with 1 dependent variable model.
Uses of LR
1. Determining the strength of predictors
2. Forecasting an effect
3. Trend forecasting

Natural language Processing


Image Recognition
Voice Recognition
Recommendation Engine
Neural Network

How do data scientists use statistics?


Data scientists work closely with business stakeholders to understand their goals and determine how data can be used
to achieve those goals. The design data modeling processes, create algorithms and predictive models to extract the data
the business needs, then help analyze the data and share insights with peers. While each project is different, the process
for gathering and analyzing data generally follows the below path:
Ask the right questions to begin the discovery process.
Acquire data.
Process and clean the data.
Integrate and store data.
Initial data investigation and exploratory data analysis.
Choose one or more potential models and algorithms
Apply data science methods and techniques, such as machine learning, statistical modeling, and artificial intelligence.
Measure and improve results.
Present final results to stakeholders.
Make adjustments based on feedback.
Repeat the process to solve a new problem.

Most data scientists use the following core skills in their daily work:
Statistical analysis: Identify patterns in data. This includes having a keen sense of pattern detection and anomaly
detection.
Machine learning: Implement algorithms and statistical models to enable a computer to automatically learn from data.
Computer science: Apply the principles of artificial intelligence, database systems, human/computer interaction,
numerical analysis, and software engineering.
Programming: Write computer programs and analyze large datasets to uncover answers to complex problems. Data
scientists need to be comfortable writing code working in a variety of languages such as Java, R, Python, and SQL.
Data storytelling: Communicate actionable insights using data, often for a non-technical audience.
Business intuition: Connect with stakeholders to gain a full understanding of the problems they’re looking to solve.
Analytical thinking. Find analytical solutions to abstract business issues.
Critical thinking: Apply objective analysis of facts before coming to a conclusion.
Inquisitiveness: Look beyond what’s on the surface to discover patterns and solutions within the data.
Interpersonal skills: Communicate across a diverse audience across all levels of an organization.

Important instances where data scientists use statistics


Design Experiments to Inform Product Decisions.
Data scientists use Frequentist Statistics and experimental design to determine whether or not the difference in the
performance of two types of products is significant to take action. This application help data scientists to understand the
experimental results especially when there are multiple metrics being measured.
Models to Predict the Signal
Using Regression, Classification, Time series analysis, and causal analysis, data scientists can tell the reason behind a
change in the rate of sales. They use these techniques to predict the sales of upcoming months and point out the
relevant trends to be careful of.
Turning Big Data Into Big Picture
Consider a large group of customers buying products. The data about each person’s shopping list is worthless if it stays
like that. Data scientists can label each customer and put similar ones to a group and understand the buying pattern. It
helps to identify how each group of people affect the business development. Statistic techniques such as clustering,
latent variable analysis, and dimensionality reduction are used to achieve this.
Understand User Engagement, Retention, Conversion and Leads
It is known that many customers would be lost from the signing-in stage to the actual regular use stage. Data science use
techniques such as regression, latent variable analysis, casual effect analysis, and survey design to find out the reason
behind this loss. It also identifies the successful leads the company is using to engage more customers.
Predicting Customer Needs
Statistical techniques such as latent variable analysis, predictive modeling, clustering, and dimensionality reduction help
data scientists to predict the items a customer might need next. A matrix of users and their interactions with the
company product is all that is needed to obtain this.
Telling the story with Data
It is the end product of all operations of data scientists. He acts as the ambassador between the company and the data.
All the findings from data should be properly communicated with the rest of the company without losing any fidelity.
Rather than summarizing the numbers, a data scientist has to explain why each number is significant. To do that
properly, data visualization techniques from statistics are used.

Data visualisation
Maximize how quickly and accurately people decode information from graphics.
Aims
Pre−attentive cognition: color coding
Accuracy: bar chart vs. pie chart
Pre−requisites
Engagement
Understanding
Memorability
Emotional Connection

Frequency Table
Frequency Bar Chart & Relative Frequency = frequency/n -> 28/100=0.28
Grouped Data: No actual data
Frequency Histogram: X=variable of interest, Y=frequency or relative frequency, no gaps between bars
(the frequency of values over certain intervals called bins)
Stem and leaf plot
Box and whisker plot
P-P plot: compare cumulative probability of our empirical data with an ideal distribution (fall on a straight line=normal
distribution)
Q-Q plot: compare quantile of our empirical data with an ideal distribution(fall on a straight line=normal distribution)
Types: skewed, symmetric, bimodal, uniform, no pattern

Big Data
Variety
Velocity
Volume
STEPS OF DATA EXPLORATION AND PREPARATION
Remember the quality of your inputs decide the quality of your output. So, once you have got your business hypothesis
ready, it makes sense to spend lot of time and efforts here. With my personal estimate, data exploration, cleaning and
preparation can take up to 70% of your total project time.
Below are the steps involved to understand, clean and prepare your data for building your predictive model:
Variable Identification
Univariate Analysis
Bi-variate Analysis
Missing values treatment
Outlier treatment
Variable transformation
Variable creation
Finally, we will need to iterate over steps 4 - 7 multiple times before we come up with our refined model. Let's now
study each stage in detail:-
Variable Identification
First, identify Predictor (Input) and Target (output) variables. Next, identify the data type and category of the variables.
Let's understand this step more clearly by taking an example.
Example:- Suppose, we want to predict, whether the students will play cricket or not (refer below data set). Here you
need to identify predictor variables, target variable, data type of variables and category of variables.

Below, the variables have been defined in different category:


Univariate Analysis
At this stage, we explore variables one by one. Method to perform uni-variate analysis will depend on whether the
variable type is categorical or continuous. Let's look at these methods and statistical measures for categorical and
continuous variables individually:
Continuous Variables:- In case of continuous variables, we need to understand the central tendency and spread of the
variable. These are measured using various statistical metrics visualization methods as shown below:

Note: Univariate analysis is also used to highlight missing and outlier values. In the upcoming part of this series, we will
look at methods to handle missing and outlier values. To know more about these methods, you can refer
course descriptive statistics from Udacity.
Categorical Variables:- For categorical variables, we'll use frequency table to understand distribution of each category.
We can also read as percentage of values under each category. It can be be measured using two
metrics, Count and Count% against each category. Bar chart can be used as visualization.
Bi-variate Analysis
Bivariate Analysis finds out the relationship between two variables. Here, we look for association and dissociation
between variables at a pre-defined significance level. We can perform bivariate analysis for any combination of
categorical and continuous variables. The combination can be: Categorical & Categorical, Categorical & Continuous and
Continuous & Continuous. Different methods are used to tackle these combinations during analysis process.
Let's understand the possible combinations in detail:
Continuous & Continuous: While doing bivariate analysis between two continuous variables, we should look at scatter
plot. It is a nifty way to find out the relationship between two variables. The pattern of scatter plot indicates the
relationship between variables. The relationship can be linear or nonlinear.
Scatter plot shows the relationship between two variable but does not indicates the strength of relationship amongst
them. To find the strength of the relationship, we use Correlation. Correlation varies between -1 and +1.
-1: perfect negative linear correlation
+1:perfect positive linear correlation and
0: No correlation
Correlation can be derived using following formula:
Correlation = Covariance(X,Y) / SQRT( Var(X)* Var(Y))
Various tools have function or functionality to identify correlation between variables. In Excel, function CORREL() is used
to return the correlation between two variables and SAS uses procedure PROC CORR to identify the correlation. These
function returns Pearson Correlation value to identify the relationship between two variables:
Covariance
Show the linear relationship of 2 variable (just provide information on direction: positive, negative or zero, no value on
strength)
Sample: cov(x,y)=sₓᵧ=Ʃ(xi-x̄)(yi-ȳ)/n-1
Population: cov(x,y)=σₓᵧ=Ʃ(xi-µₓ)(yi-µᵧ)/N
Excel use population formula while SPSS use sample formula, Excel result*N/n-1

In above example, we have good positive relationship(0.65) between two variables X and Y.
Categorical & Categorical: To find the relationship between two categorical variables, we can use following methods:
Two-way table: We can start analyzing the relationship by creating a two-way table of count and count%. The rows
represents the category of one variable and the columns represent the categories of the other variable. We show count
or count% of observations available in each combination of row and column categories.
Stacked Column Chart: This method is more of a visual form of Two-way table.
Chi-Square Test: This test is used to derive the statistical significance of relationship between the variables. Also, it tests
whether the evidence in the sample is strong enough to generalize that the relationship for a larger population as well.
Chi-square is based on the difference between the expected and observed frequencies in one or more categories in the
two-way table. It returns probability for the computed chi-square distribution with the degree of freedom.
Probability of 0: It indicates that both categorical variable are dependent
Probability of 1: It shows that both variables are independent.
Probability less than 0.05: It indicates that the relationship between the variables is significant at 95% confidence. The
chi-square test statistic for a test of independence of two categorical variables is found by:

where O represents the observed frequency. E is the expected frequency under the null

hypothesis and computed by: From previous two-way table, the expected count for
product category 1 to be of small size is 0.22. It is derived by taking the row total for Size (9) times the column total for
Product category (2) then dividing by the sample size (81). This is procedure is conducted for each cell. Statistical
Measures used to analyze the power of relationship are:
Cramer's V for Nominal Categorical Variable
Mantel-Haenszel Chi-Square for ordinal categorical variable.
Different data science language and tools have specific methods to perform chi-square test. In SAS, we can use Chisq as
an option with Proc freq to perform this test.

Categorical & Continuous: While exploring relation between categorical and continuous variables, we can draw box plots
for each level of categorical variables. If levels are small in number, it will not show the statistical significance. To look at
the statistical significance we can perform Z-test, T-test or ANOVA.
Z-Test/ T-Test:- Either test assess whether mean of two groups are statistically different from each other or not.

If the probability of Z is small then the difference of two averages is more significant. The T-test is very similar to Z-test
but it is used when number of observation for both categories is less than

30.
ANOVA:- It assesses whether the average of more than two groups is statistically different.
Example: Suppose, we want to test the effect of five different exercises. For this, we recruit 20 men and assign one type
of exercise to 4 men (5 groups). Their weights are recorded after a few weeks. We need to find out whether the effect of
these exercises on them is significantly different or not. This can be done by comparing the weights of the 5 groups of 4
men each.
Till here, we have understood the first three stages of Data Exploration, Variable Identification, Uni-Variate and Bi-
Variate analysis. We also looked at various statistical and visual methods to identify the relationship between variables.
Now, we will look at the methods of Missing values Treatment. More importantly, we will also look at why missing
values occur in our data and why treating them is necessary.

MISSING VALUE TREATMENT


Why missing values treatment is required?
Missing data in the training data set can reduce the power / fit of a model or can lead to a biased model because we
have not analysed the behavior and relationship with other variables correctly. It can lead to wrong prediction or
classification.

Notice the missing values in the image shown above: In the left scenario, we have not treated missing values. The
inference from this data set is that the chances of playing cricket by males is higher than females. On the other hand, if
you look at the second table, which shows data after treatment of missing values (based on gender), we can see that
females have higher chances of playing cricket compared to males.

Why my data has missing values?


We looked at the importance of treatment of missing values in a dataset. Now, let's identify the reasons for occurrence
of these missing values. They may occur at two stages:
Data Extraction: It is possible that there are problems with extraction process. In such cases, we should double-check for
correct data with data guardians. Some hashing procedures can also be used to make sure data extraction is correct.
Errors at data extraction stage are typically easy to find and can be corrected easily as well.
Data collection: These errors occur at time of data collection and are harder to correct. They can be categorized in four
types:
Missing completely at random: This is a case when the probability of missing variable is same for all observations. For
example: respondents of data collection process decide that they will declare their earning after tossing a fair coin. If an
head occurs, respondent declares his / her earnings & vice versa. Here each observation has equal chance of missing
value.
Missing at random: This is a case when variable is missing at random and missing ratio varies for different values / level
of other input variables. For example: We are collecting data for age and female has higher missing value compare to
male.
Missing that depends on unobserved predictors: This is a case when the missing values are not random and are related
to the unobserved input variable. For example: In a medical study, if a particular diagnostic causes discomfort, then
there is higher chance of drop out from the study. This missing value is not at random unless we have included
“discomfort” as an input variable for all patients.
Missing that depends on the missing value itself: This is a case when the probability of missing value is directly
correlated with missing value itself. For example: People with higher or lower income are likely to provide non-response
to their earning.

Which are the methods to treat missing values ?


Deletion: It is of two types: List Wise Deletion and Pair-Wise Deletion.
In list wise deletion, we delete observations where any of the variable is missing. Simplicity is one of the major
advantage of this method, but this method reduces the power of model because it reduces the sample size.
In pairwise deletion, we perform analysis with all cases in which the variables of interest are present. Advantage of this
method is, it keeps as many cases available for analysis. One of the disadvantage of this method, it uses different sample

size for different variables.


Deletion methods are used when the nature of missing data is "Missing completely at random" else non random missing
values can bias the model output.
Mean/ Mode/ Median Imputation: Imputation is a method to fill in the missing values with estimated ones. The
objective is to employ known relationships that can be identified in the valid values of the data set to assist in estimating
the missing values. Mean / Mode / Median imputation is one of the most frequently used methods. It consists of
replacing the missing data for a given attribute by the mean or median (quantitative attribute) or mode (qualitative
attribute) of all known values of that variable. It can be of two types:-
Generalized Imputation: In this case, we calculate the mean or median for all non missing values of that variable then
replace missing value with mean or median. Like in above table, variable "Manpower" is missing so we take average of
all non missing values of "Manpower" (28.33) and then replace missing value with it.
Similar case Imputation: In this case, we calculate average for gender "Male" (29.75) and "Female" (25) individually of
non missing values then replace the missing value based on gender. For "Male", we will replace missing values of
manpower with 29.75 and for "Female" with 25.
Prediction Model: Prediction model is one of the sophisticated method for handling missing data. Here, we create a
predictive model to estimate values that will substitute the missing data. In this case, we divide our data set into two
sets: One set with no missing values for the variable and another one with missing values. First data set become training
data set of the model while second data set with missing values is test data set and variable with missing values is
treated as target variable. Next, we create a model to predict target variable based on other attributes of the training
data set and populate missing values of test data set. We can use regression, ANOVA, Logistic regression and various
modeling technique to perform this. There are 2 drawbacks for this approach:
The model estimated values are usually more well-behaved than the true values
If there are no relationships with attributes in the data set and the attribute with missing values, then the model will not
be precise for estimating missing values.
KNN Imputation: In this method of imputation, the missing values of an attribute are imputed using the given number of
attributes that are most similar to the attribute whose values are missing. The similarity of two attributes is determined
using a distance function. It is also known to have certain advantage & disadvantages.
Advantages:
k-nearest neighbour can predict both qualitative & quantitative attributes
Creation of predictive model for each attribute with missing data is not required
Attributes with multiple missing values can be easily treated
Correlation structure of the data is taken into consideration
Disadvantage:
KNN algorithm is very time-consuming in analyzing large database. It searches through all the dataset looking for the
most similar instances.
Choice of k-value is very critical. Higher value of k would include attributes which are significantly different from what
we need whereas lower value of k implies missing out of significant attributes.
After dealing with missing values, the next task is to deal with outliers. Often, we tend to neglect outliers while building
models. This is a discouraging practice. Outliers tend to make your data skewed and reduces accuracy. Let's learn more
about outlier treatment.
TECHNIQUES OF OUTLIER DETECTION AND TREATMENT
What is an Outlier?
Outlier is a commonly used terminology by analysts and data scientists as it needs close attention else it can result in
wildly wrong estimations. Simply speaking, Outlier is an observation that appears far away and diverges from an overall
pattern in a sample.
Let's take an example, we do customer profiling and find out that the average annual income of customers is $0.8
million. But, there are two customers having annual income of $4 and $4.2 million. These two customers annual income
is much higher than rest of the population. These two observations will be seen as Outliers.

What are the types of Outliers?


Outlier can be of two types: Univariate and Multivariate. Above, we have discussed the example of univariate outlier.
These outliers can be found when we look at distribution of a single variable. Multivariate outliers are outliers in an n-
dimensional space. In order to find them, you have to look at distributions in multi-dimensions.
Let us understand this with an example. Let us say we are understanding the relationship between height and weight.
Below, we have univariate and bivariate distribution for Height, Weight. Take a look at the box plot. We do not have any
outlier (above and below 1.5*IQR, most common method). Now look at the scatter plot. Here, we have two
values below and one above the average in a specific segment of weight and height.

What causes Outliers?


Whenever we come across outliers, the ideal way to tackle them is to find out the reason of having these outliers. The
method to deal with them would then depend on the reason of their occurrence. Causes of outliers can be classified in
two broad categories:
Artificial (Error) / Non-natural
Natural.
Let's understand various types of outliers in more detail:
Data Entry Errors:- Human errors such as errors caused during data collection, recording, or entry can cause outliers in
data. For example: Annual income of a customer is $100,000. Accidentally, the data entry operator puts an additional
zero in the figure. Now the income becomes $1,000,000 which is 10 times higher. Evidently, this will be the outlier value
when compared with rest of the population.
Measurement Error: It is the most common source of outliers. This is caused when the measurement instrument used
turns out to be faulty. For example: There are 10 weighing machines. 9 of them are correct, 1 is faulty. Weight measured
by people on the faulty machine will be higher / lower than the rest of people in the group. The weights measured on
faulty machine can lead to outliers.
Experimental Error: Another cause of outliers is experimental error. For example: In a 100m sprint of 7 runners, one
runner missed out on concentrating on the 'Go' call which caused him to start late. Hence, this caused the runner's run
time to be more than other runners. His total run time can be an outlier.
Intentional Outlier: This is commonly found in self-reported measures that involves sensitive data. For example: Teens
would typically under report the amount of alcohol that they consume. Only a fraction of them would report actual
value. Here actual values might look like outliers because rest of the teens are under reporting the consumption.
Data Processing Error: Whenever we perform data mining, we extract data from multiple sources. It is possible that
some manipulation or extraction errors may lead to outliers in the dataset.
Sampling error: For instance, we have to measure the height of athletes. By mistake, we include a few basketball players
in the sample. This inclusion is likely to cause outliers in the dataset.
Natural Outlier: When an outlier is not artificial (due to error), it is a natural outlier. For instance: In my last assignment
with one of the renowned insurance company, we noticed that the performance of top 50 financial advisors was far
higher than rest of the population. Surprisingly, it was not due to any error. Hence, whenever we perform any data
mining activity with advisors, we used to treat this segment separately.

What is the impact of Outliers on a dataset?


Outliers can drastically change the results of the data analysis and statistical modeling. There are numerous
unfavourable impacts of outliers in the data set:
It increases the error variance and reduces the power of statistical tests
If the outliers are non-randomly distributed, they can decrease normality
They can bias or influence estimates that may be of substantive interest
They can also impact the basic assumption of Regression, ANOVA and other statistical model assumptions.
To understand the impact deeply, let's take an example to check what happens to a data set with and without outliers in
the data set. Example:

As you can see, data set with outliers has significantly different mean and standard deviation. In the first scenario, we
will say that average is 5.45. But with the outlier, average soars to 30. This would change the estimate completely.
How to detect Outliers?
Most commonly used method to detect outliers is visualization. We use various visualization methods, like Box-
plot, Histogram, Scatter Plot (above, we have used box plot and scatter plot for visualization). Some analysts also various
thumb rules to detect outliers. Some of them are:
Any value, which is beyond the range of -1.5 x IQR to 1.5 x IQR
Use capping methods. Any value which out of range of 5th and 95th percentile can be considered as outlier
Data points, three or more standard deviation away from mean are considered outlier
Outlier detection is merely a special case of the examination of data for influential data points and it also depends on
the business understanding
Bivariate and multivariate outliers are typically measured using either an index of influence or leverage, or distance.
Popular indices such as Mahalanobis’ distance and Cook’s D are frequently used to detect outliers.
In SAS, we can use PROC Univariate, PROC SGPLOT. To identify outliers and influential observation, we also look at
statistical measure like STUDENT, COOKD, RSTUDENT and others.
How to remove Outliers?
Most of the ways to deal with outliers are similar to the methods of missing values like deleting observations,
transforming them, binning them, treat them as a separate group, imputing values and other statistical methods. Here,
we will discuss the common techniques used to deal with outliers:
Deleting observations: We delete outlier values if it is due to data entry error, data processing error or outlier
observations are very small in numbers. We can also use trimming at both ends to remove outliers.
Transforming and binning values: Transforming variables can also eliminate outliers. Natural log of a value reduces the
variation caused by extreme values. Binning is also a form of variable transformation. Decision Tree algorithm allows to
deal with outliers well due to binning of variable. We can also use the process of assigning weights to different
observations.

Imputing: Like imputation of missing values, we can also impute outliers. We can use mean, median, mode imputation
methods. Before imputing values, we should analyse if it is natural outlier or artificial. If it is artificial, we can go with
imputing values. We can also use statistical model to predict values of outlier observation and after that we can impute
it with predicted values.
Treat separately: If there are significant number of outliers, we should treat them separately in the statistical model.
One of the approach is to treat both groups as two different groups and build individual model for both groups and
then combine the output.
Till here, we have learnt about steps of data exploration, missing value treatment and techniques of outlier detection
and treatment. These 3 stages will make your raw data better in terms of information availability and accuracy. Let's
now proceed to the final stage of data exploration. It is Feature Engineering.
THE ART OF FEATURE ENGINEERING
What is Feature Engineering?
Feature engineering is the science (and art) of extracting more information from existing data. You are not adding any
new data here, but you are actually making the data you already have more useful.
For example, let's say you are trying to predict foot fall in a shopping mall based on dates. If you try and use the dates
directly, you may not be able to extract meaningful insights from the data. This is because the foot fall is less affected by
the day of the month than it is by the day of the week. Now this information about day of week is implicit in your data.
You need to bring it out to make your model better.
This exercising of bringing out information from data in known as feature engineering.
What is the process of Feature Engineering ?
You perform feature engineering once you have completed the first 5 steps in data exploration - Variable
Identification, Univariate, Bivariate Analysis, Missing Values Imputation and Outliers Treatment. Feature engineering
itself can be divided in 2 steps:
Variable transformation.
Variable / Feature creation.
These two techniques are vital in data exploration and have a remarkable impact on the power of
prediction. Let's understand each of this step in more details.
What is Variable Transformation?
In data modelling, transformation refers to the replacement of a variable by a function. For instance, replacing a variable
x by the square / cube root or logarithm x is a transformation. In other words, transformation is a process that changes
the distribution or relationship of a variable with others.
Let’s look at the situations when variable transformation is useful.

When should we use Variable Transformation?


Below are the situations where variable transformation is a requisite:
When we want to change the scale of a variable or standardize the values of a variable for better understanding. While
this transformation is a must if you have data in different scales, this transformation does not change the shape of the
variable distribution
When we can transform complex non-linear relationships into linear relationships. Existence of a linear relationship
between variables is easier to comprehend compared to a non-linear or curved relation. Transformation helps us to
convert a non-linear relation into linear relation. Scatter plot can be used to find the relationship between two
continuous variables. These transformations also improve the prediction. Log transformation is one of the commonly
used transformation technique used in these situations.

Symmetric distribution is preferred over skewed distribution as it is easier to interpret and generate inferences. Some
modeling techniques requires normal distribution of variables. So, whenever we have a skewed distribution, we can use
transformations which reduce skewness. For right skewed distribution, we take square / cube root or logarithm of
variable and for left skewed, we take square / cube or exponential of variables.

Variable Transformation is also done from an implementation point of view (Human involvement). Let's understand it
more clearly. In one of my project on employee performance, we found that age has direct correlation with
performance of the employee i.e. higher the age, better the performance. From an implementation standpoint,
launching age based programme might present implementation challenge. However, categorizing the sales agents in
three age group buckets of <30 years, 30-45 years and >45 and then formulating three different strategies for each
group is a judicious approach. This categorization technique is known as Binning of Variables.

What are the common methods of Variable Transformation?


There are various methods used to transform variables. As discussed, some of them include square root, cube root,
logarithmic, binning, reciprocal and many others. Let’s look at these methods in detail by highlighting the pros and cons
of these transformation methods.
Logarithm: Log of a variable is a common transformation method used to change the shape of distribution of the
variable on a distribution plot. It is generally used for reducing right skewness of variables. Though, It can’t be applied to
zero or negative values as well.
Square / Cube root: The square and cube root of a variable has a sound effect on variable distribution. However, it is not
as significant as logarithmic transformation. Cube root has its own advantage. It can be applied to negative values
including zero. Square root can be applied to positive values including zero.
Binning: It is used to categorize variables. It is performed on original values, percentile or frequency. Decision of
categorization technique is based on business understanding. For example, we can categorize income in three
categories, namely: High, Average and Low. We can also perform co-variate binning which depends on the value of more
than one variables.

What is Feature / Variable Creation & its Benefits?


Feature / Variable creation is a process to generate a new variables / features based on existing variable(s). For example,
say, we have date(dd-mm-yy) as an input variable in a data set. We can generate new variables like day, month, year,
week, weekday that may have better relationship with target variable. This step is used to highlight the hidden
relationship in a variable:

There are various techniques to create new features. Let’s look at the some of the commonly used methods:
Creating derived variables: This refers to creating new variables from existing variable(s) using set of functions or
different methods. Let’s look at it through “Titanic - Kaggle competition”. In this data set, variable age has missing
values. To predict missing values, we used the salutation (Master, Mr, Miss, Mrs) of name as a new variable. How do we
decide which variable to create? Honestly, this depends on business understanding of the analyst, his curiosity and the
set of hypothesis he might have about the problem. Methods such as taking log of variables, binning variables and other
methods of variable transformation can also be used to create new variables.
Creating dummy variables: One of the most common application of dummy variable is to convert categorical variable
into numerical variables. Dummy variables are also called Indicator Variables. It is useful to take categorical variable as a
predictor in statistical models. Categorical variable can take values 0 and 1. Let's take a variable 'gender'. We can
produce two variables, namely, “Var_Male” with values 1 (Male) and 0 (No male) and “Var_Female” with values 1
(Female) and 0 (No Female). We can also create dummy variables for more than two classes of a categorical variables
with n or n-1 dummy variables.

Why companies invest on AI & ML


More connected devices
Low cost of storing data
Low cost of computation

You might also like