Statistics For Data Science
Statistics For Data Science
Techniques and procedures for collecting, presentation, analysing, interpreting, and making decision based on data.
Population:
• Set of all units of interest for a particular study
• Needs to be clearly identified
• Real or theoretical
• Vary in size
• Characterized by parameters (census)
Sample (representative vs. biased)
• Subset of units selected from a population
• Represent the population
• Vary in size
• Characterized by statistics
Sampling
Draw inferences about some population(s) of interest based on observations of just a subset or sample from the whole
population.
Sampling Error
• Sample statistics are used as a basis for drawing conclusions about population parameters
• Sample provides limited information about population
• A sample is not expected to give a perfectly accurate picture of the whole population
• Usually there is a discrepancy between a sample statistic and the corresponding population parameter called
SAMPLING ERROR
Types of Sampling
Random (Probability): can generalize beyond sample. data may be difficult to obtain and expensive to collect
• Simple
allows generalizations (ensure representativeness)
rarely used (no access to the whole population list, all selected samples may not participate)
sample may not capture enough elements of subgroups
suitable for small populations
• Systematic (Marketing)
easier than random (no good for heterogeneous population)
offers approximation of random sampling
• Stratified (Geographical)
ensure subgroup representation
• Cluster
often conducted in multiple stages
Non−random (Non-Probability): knowledge in preparation for a later random sample. information cannot be generalized
beyond the sample.
• Convenience
o Handpicked
selected with particular purpose in mind
selected cases meet particular criteria (typical, show wide variance, represent expertise, etc.)
o Snowball
used for populations not easily identified or assessed
sample is built through referrals
no guarantee of representativeness
• Purposive
o Volunteer
highly convenient
not likely to be representative
no guarantee of representativeness
However can credibly represent population if:
1. selection is done with the goal of representativeness in mind
2. strategies are used to ensure samples match population characteristics.
Data Management
• Data dictionary/Codebook
o First step in data management and preparation
o Describes the content, structure and format for each item in the collected data and conversion rules for
items into variables
• Data validation rules
Answer to questions should match each others
• Recoding data
Continuous into categorical
Conceptual: based on meaningful thresholds (90−100=A+, A, A−….)
Improving the properties of the data: when the distribution is extremely skewed like number of specialist visit (to
normalize data)
Categorical into new categorical
Conceptual reasons: too much categories, is used as a proxy for a variable with a smaller number of categories
Improving the properties of the data: usually is needed when some categories have few responses
• Computing total scale scores
When you need to compute total scores for a scale, ALWAYS conduct internal consistency analysis first
This is your evidence for reliability of these scores
Statistic used for this purpose: Cronbach’s alpha
Rule for interpretation: Cronbach’s alpha>.70 indicates that reliable total scores can be created from a set of
variables
• Handling missing data
How much missing data does each case have?
How much missing data does each variable have?
What are the reasons for missing data?
Decisions (need to be explained and justified):
Eliminate cases that have too much missing data
Eliminate variables that have too much missing data
Imputation = Creating fake data
Does not save when the data is garbage!
Methods of different complexity
Work well when there is not too much missing (<10% per variable) and when the missing pattern can be
considered random
Allow to maximize the amount of data used in the analyses
List−wise deletion –analytic strategy when the case that has a missing value on one of the variables is removed
from all analyses
Case−wise deletion –analytic strategy when the case that has a missing value on one of the variables is removed
only from the analyses that involve this variable
Both methods use reduced sample size in the analyses
• Checking for outliers
Unusual data points located far from other values of the variable
Should be investigated carefully
May indicate faulty data, erroneous procedures, areas where a certain theory might not be valid
Often contain valuable information about the process under investigation
ALWAYS identify and discuss
NEVER ignore
TRY to understand
How to deal with outliers
Data Transformation: Performing a mathematical operation on scores to reduce the impact of outliers. For
example, taking the square root, reciprocal, or logarithm or all the scores has this effect.
shrink larger values more than smaller values
may affect interpretation
change the relationship between the variables
often require non-negative data
Winsorizing: Replacing a fixed number of extreme scores with the score that is closest to them in the tail of the
distribution.
Trimming: Removing a fixed percentage of extreme scores from each of the tails of a distribution (e.g., omitting
the two highest and two lowest scores).
Deletion: use only if you find outliers are legitimate errors that can't be corrected
use only as a last resort
use when they lie so far outside the range of the remainder of the data that they distort statistical inferences
when in doubt, you can report model results both with and without outliers to see how much they change
Accommodation: use methods that are robust in the presence of outliers (nonparametric)
Descriptive Statistics
Methods to organize, summarize, describe and understand data
Univariate Descriptive Statistics
Purposes:
describe distributions of scores
• shape
• central tendency
• spread
Tools:
numerical (mean, median, mode, range, variance, SD)
graphical (histogram, bar graph)
tabular (frequency table): summarized table with data are arranged in to classes and frequencies.
Bivariate Distributions
To find the relationship between 2 variables
Methods (depend on measurement scale of the two variables)
• Tabular tool
Crosstabulations: 2 categorical
• Graphical tool
Clustered bar graph: 2 categorical
Boxplot or histogram: 1 categorical and 1 continuous
Scatter plot: 2 continuous (look at relationship, linear, direction, outlier)
• Numerical (measures of association)
Correlation Coefficients
Statistical technique that is used to measure and describe relationship between two variables.
Measures three aspects of relationship:
Direction (identified by sign)
• positive
• negative
Form (linear vs. nonlinear)
Strength of relationship (for linear correlation –how well the data points fit a straight line)
Variables Correlation Coefficient
Continuous, normally distributed and linearly Pearson’s r correlation
related
Continuous, skewed and/or not linearly related Spearman’s correlation
Ordinal Kendall’s Tau B
Nominal with 2 categories Phi
Nominal with 2+ categories Cramer’s V
Data Visualization
• Gain insight into an information space by mapping data onto graphical primitives
• Provide qualitative overview of large data sets
• Search for patterns, trends, structure, irregularities, relationships among data
• Help find interesting regions and suitable parameters for further quantitative analysis
• Provide a visual proof of computer representations derived
Categorization of visualization methods
Boxplot: graphic display of five-number summary
Histogram: a graphical representation of the frequency distribution in which the x-axis represents the classes and the y-
axis represents the frequencies in bars.
Ogive (Cumulative Frequency Distribution):
Quantile plot: scatter plot of all data elements in box plot
each value x is paired with f indicating that approximately 100% of f data are ≤ x
Quantile-quantile (q-q) plot: graphs the quantiles of one univariant distribution against the corresponding
quantiles of another (bivariate quantile plot)
Scatter plot: each pair of values is a pair of coordinates and plotted as points in the plane
Moments
• First Moment: ∑x/n → Average distances from 0. indication of the mean.
• Second Moment: ∑(x)²/n, Average squared distances from 0, Indication of variance.
• Third Moment: ∑(x)³/n, average cubed distance from 0, indication of skewness
1/N*∑(x-µ)³/σ³ −or− n/(n-1)(n-2)*∑(x-x̄)³/s³
• Forth Moment: ∑(x)⁴/n, average forth power of distance from 0, indication of kurtosis
1/N*∑(x-µ)⁴/σ⁴ −or− n(n+1)/(n-1)(n-2)(n-3)*∑(x-x̄)⁴/s⁴-3(n-1)²/(n-2)(n-3)
Entropy: the level of impurity or uncertainty in the data
Information Gain: the amount of information that each variable/feature gives us about the final outcome. Suitable to
find the root node
Decision Tree: root node, branches,
Confusion Matrix: to evaluate the performance of a model, TP+TN/TP+TN+FP(I)+FN(II) or Ts/everything
Probability
The likelihood that an event occurs, desired outcome/total outcome
Random Experiment: an experiment that the outcome can not be predicted by certainty
Event: outcome of an experiment
Sample space: all possibilities after one or more trials (head or tail) = x^ⁿ
Mutually exclusive: just one of the possible event can take place at a time (head or tail)
Collectively exhaustive list: a list of the possible events that can result from an experiment
∈=element → 6∈N, 7∉N
⊆=subset → M⊆N=all elements in M are also in N plus equivalent subset M=N
Fundamental Counting Principle (FCP)
Permutation: different ways of arranging sets of x out of a larger group of n (order is important). P(n,x)=(n P x)=n!/(n-x)!
→ 4!/(4−2)! = 4!/2! → 4*3=12
Combination: the number of ways of selecting sets of size x from a larger group (order matters). C(n,x)=(n C x)=n!/x!(n-
x)! → 4!/2!(4−2)! = 4!/2! → 4*3=12
Examples: 14 marbles (4red, 4green, 3blue, 3yellow)
Probability of having sets of 4 marbles out of 14: C(14, 4)=1001
Probability of having 4 marbles of different colors out of 14:
C(4,1)*C(4,1)*C(3,1)*C(3,1)=4*4*3*3=144 out of 1001
Probability of having 4 marbles (at least 2 red) out of 14 (2red or 3red or 4red):
C(4,2)*C(10,2)+C(4,3)*C(10,1)+C(4,4)=270+40+1=311 out of 1001
Probability of having 4 marbles (none red but at least 1 green) out of 14 (because no red=10 marbles)
Total probability: C(10,4)=210
C(4,1)*C(6,3)+C(4,2)*C(6,2)+C(4,3)*C(6,1)+C(4,4)=80+90+24+1=195 out 210
C(6,4)=15 if consider no green and red -> 210-15=195 (reverse way of calculation)
Types of Events
Independent Events
The event A does not affect the occurrence of event B
Disjoint Event (AՈB=0)
probability of A or B (union) = P(AՍB) = P(A) + P(B) → P of Heart or Dimond = 13/52 + 13/52
probability of A and B (intersection)= P(AՈB) = P(A)*P(B) → p of A grade in Stat and Math
Non−Disjoint Events (AՈB≠0)
probability of A or B = P(AՍB) = P(A) + P(B) – P(AՈB) → p of Queen or Heart = 4/52 + 13/52 – 1/52
Conditional or Dependent Events
probability of A and B = P(AՈB) = P(A|B)*P(B) = P(B|A)*P(A) → p of 2 Heart in a row
P(A|B) = probability of A given B has happened
Types of Probability
Marginal = P(A)
Joint = P(AՈB)
Conditional = P(A|B)
P(A|B) = P(BՈA)/P(B)
P of Electronics = 75/229
P of Mechanical and company A = 28/229
P of Electronics and company A or C = 22/229 + 19/229
P of Computer Science or company B = 69/229 + 89/220 – 30/229
P of Electronics given placed in company A = 22/68
P of red ball in second draw if 2 black and 3 red = P of black in first draw * P of red in second draw + P of red in first draw
* P of red in second draw = (2/5*3/4) + (3/5*2/4) = 6/20 + 6/20 = 60%
Bayes’ Theorem:
P(A|B) = P(B|A)*P(A)/P(B) → posterior = likelihood ratio * prior / P(B)
P(B) = P(A¹ՈB) + P(A²ՈB) + P(A³ՈB)
P(B) = P(B|A¹)*P(A¹) + P(B|A²)*P(A²) + P(B|A³)*P(A³)
P(B) = Σ P(B|A¡)*P(A¡)
P(A¡|B) = P(B|A¡)*P(A¡)/ Σ P(B|A¡)*P(A¡) generalized term of Bayes’ Theorem
Example
Prior Probabilities Conditional Probabilities
30% = P(A¹) low risk 1% defaulter = P(defaulter|A¹)
60% = P(A²) med risk 10% defaulter = P(defaulter|A²)
10% = P(A3) high risk 18% defaulter = P(defaulter|A³)
Correlation Analysis
Show the relationship of 2 variable (provide information on direction & strength: -1 to +1). Correlation is not necessarily
causation.
• Nominal Data (Chi Square Test)
• Numeric Data (Pearson Correlation coefficient)
Bivariate relationship
The value of dependent variable(Target variable), is a function of independent variable(predictor variables): x=f(y)
Linear algebra equation (slope-intercept form of line)
y=mx+b [x: random variable, m: slope=rise/run, b: y-intercept [line crosses y-axis or x=0 (0,y)].
Simple Linear Regression Model
y = β₀ + β₁x + Є (β₀=y-intercept of population parameter, β₁=slope of population parameter, Є=error term or unexplained
variation in y)
Simple linear regression equation
E(y) = β₀ + β₁x
Simple linear regression is not perfect and E(Y) is the mean of a distribution of ys (ŷ)
Because most of the time population parameter are not available (β₀ & β₁), we use sample data.
Sample linear regression equation
ŷ = b₀ + b₁x (y-hat is the mean value of y for a given value of x and a point estimator of E(y))
We always compare our best linear regression models with 1 dependent variable model [mean or slope=0 → ŷ = b₀ + 0x
→ ŷ = b₀]. If the sum of squares residual (SSE) decreases significantly, the model is good
Least squares criterion/line:
1. Make a scatter plot of data points
2. Look for a visual line
3. Find correlation: +, - , strong or weak
4. Descriptive statistics/centroid: the cross point between the mean of each variable [the best-fit regression line
must pass through the centroid.
5. Calculation: ŷ = b₀ + b₁x
Slope: b₁=Σ(xᵢ-x̄)(yᵢ-ȳ)/Σ(xᵢ-x̄)² x̄: mean of independent variable, xᵢ: value of independent
ȳ: mean of dependent variable, yᵢ: value of dependent variable
intercept: b₀ =ȳ-b₁x̄
Least squares method
The goal is to minimize the sum of squared residual (SSE)
We always compare our best regression model (Least squared line) with 1 dependent variable model (mean)
SST= Σ (yᵢ - ȳ)²
SSR= Σ (ŷᵢ - ȳ)²
SSE=Σ (yᵢ - ŷᵢ)²
[yᵢ: observed value of dependent variable, ŷᵢ: estimated value of dependent variable, Residual = yᵢ - ŷᵢ]
1 dependent variable: SST=SSE
2 variable: SST=SSR+SSE (SST stay the same, so SSE decreases significantly)
Coefficient of Determination
How well does the estimated Linear Regression equation fit our data.
Coefficient of Determination = r² =SSR/SST →..% of total sum of squares can be explained by the estimated regression
model
Standardized Score regression
Not commonly used. Useful when different scales and variances.
Make a standardized Z score for each variable and make a scatter plot on x and y axis. The centroid is (0, 0) or the centre
of x, y axis. Intercept is 0 and slop equal to correlation coefficient (r).
Raw data correlation and standardized value correlation are the same.
Original Regression Slope=r*sy/sx
Standardized Slope=r=b₁*sx/sy
Regression Model Error
SSE=Σ(yᵢ - ŷᵢ)²
MSE=s²=SSE/df is an estimate of σ² the variance of error (how spread out the data points are from the regression line)
[df=n-2]
s(standard error of estimate)=√MSE(root mean square error)=√SSE/n-2 (an average distance of observations fall from
regression line
r²=SSR/SST
r=cov(x,y)/sₓ*sᵧ
If r>=2/√n-> relationship exist
r>0.9 = strong correlation, 0.7<r<0.9 = moderate, r<0.7 = week
In Excel -> r=CORREL(array1, array2)
Crosstabulation
Table summary of 2 variables (show the relationship between 2 variables)
In Excel: pivot table
Smartphone TVs Gaming Appliances Computers Total Average σ
North $1336 $1128 $1859 $1656 $1583 $7,562 $302 $110
South $6,822 $273 $116
East $7,727 $309 $104
West $7,587 $303 $118
Total $5714 $5824 $6512 $5945 $5703 $29,698
Average $286 $291 $326 $297 $285 $297
σ $118 $102 $134 $107 $97 111$
Distributions
Chebyshev Rule
Regardless of how the data are distributed, at least (1 -1/k2) x 100% of the values will fall within k standard deviations of
the mean (for k > 1)
Discreet Probability Density Function (pdf)
the distribution of discreet number of outcomes over several independent experiments/trials.
Total: ƩP(x)=1
Range: 0 <̲ P(x) >̲ 1
Expected value: E(X)=ƩxP(x)
Variance: σ²=Σ(x-x̄)²p(x)
Uniform discrete probability distribution
If roll a dice or flip a coin, makes a uniform distribution
P(x)=1/no of outcomes
Binomial probability (Bernoulli Process)
• Each trial is independent.
• Only 2 possible outcomes/events over each experiment/trial.
• Probability of success (p) is the same across all trials. Probability of failure (q)=1-P
If the probability of colorblindness in men is 8% and pick 10 men:
The probability of having 10 men with colorblindness? (0.08)¹⁰
The probability of having 10 men without colorblindness? (0.92)¹⁰
The chance of having 2 men with colorblindness out of 10 men?
P(X=x)=(n C x)pᵡqⁿˉᵡ= n!/x!(n-x)!*pᵡqⁿˉᵡ = n!/x!(n-x)!*(0.08)²(0.92)⁸ → pdf
Binomial Probability Density Function (pdf)
The distribution of binomial probability over several independent experiments/trials.
Expected number of color blind men in the sample: E(x)=np=x̄
SD(x): s=√npq
As n->ꚙ, binomial distribution moves towards normal distribution if np>5 & nq>5
Excel: =BINOM.DIST(x,n,p,FALSE)
TI-83 -> 2nd,Vars,Enter,scroll down, binom pdf, (n,p,x),enter
Cumulative binomial distribution(cdf)
Addition of desired pdfs or deduction of the rest from 1 [1-P(x)]
The probability of having at least 2 men with colorblindness? P(x≥2)
Excel: P(x≤1)=BINOM.DIST(1,10,0.08,TRUE)
Continuous Probability Distribution Function
If N moves towards ꝏ, discrete distribution turn to continuous distribution
Probability of any specific outcome is undefined P(a)=0
P(a,b) = area under curve between a and b (Just cdf)
Uniform continuous probability distribution
Width=b-a
F(x)=1/b-a
E(x)=b+a/2
σ²=(b-a)²/N
σ=b-a/√N
P(x)=x₁-x₂/b-a
Normal (Gaussian) Distribution
Data around the mean occurs more frequently than data away from the mean
Mean = center of distribution
SD = height of distribution
Z score: x–x̄/s.
0 to 1 SD=34.13%
1 to 2 SD=13.59%
2 to 3 SD=2.14%
+3SD=0.13%
Positive skewed (Right skewed) → Mode < Median < Mean
Hypergeometric Distribution
N: total population size
A: total items of interest in population
n: sample size
Probability distribution of 2 spades in a 5 cards poker hand?
N=52 A=13 n=5 x=2
P(x)=
Excel: =HYPERGEOM.DIST(x,n,A,N,FALSE)
Probability distribution of at least 2 spades in a 5 cards poker hand?
Excel: =HYPERGEOM.DIST(x,n,A,N,TRUE)
If you have 2 spade, probability of flush out of next 5 cards?
N=50 A=11 n=5 x=3,4,5
Excel: P(x≥3 spades)=1-HYPERGEOM.DIST(2,5,11,50,TRUE) +
P(x=5 hearts)=HYPERGEOM.DIST(5,5,13,50,FALSE) +
P(x=5 diamonds)=HYPERGEOM.DIST(5,5,13,50,FALSE) +
P(x=5 clubs)=HYPERGEOM.DIST(5,5,13,50,FALSE) = 0.065821
E(x)=nA/N
SD(x)=√nA/N*(N-A)/N*(N-n)/N-1
Exponential Distribution
The time between events in Poisson process (inverse of Poisson)
Number of cars passing a tollgate in one hour vs. Number of hours between car arrivals
Events per single unit of time vs. Time per single event
If P(0≤x≤1)=0.05 → P(1≤x≤2)=0.95*0.05 → P(2≤x≤3)=0.95²*0.05 ……….
Probability that the next visitor arrives within 10 min? make the time unit the same
P(x≤10 min)=1-e⁽⁻ˣʹᵞ⁾ or 1-e⁽⁻ˣʵ⁾
Excel: = EXPON.DIST(x,λ,TRUE) if put FALSE → pmf
Probability that the next visitor arrives after 30 min?
Excel: = 1-EXPON.DIST(x,λ,TRUE)
Inferential Statistics
Makes conclusion or predictions about a population based on a sample. There is always some errors around the
estimation (margin of error).
Central limit theorem
The distribution of the sample mean of a random variable tend toward the normal distribution as the sample size
increases, regardless of the distribution from which we are sampling (≥30).
Point Estimation
Mean of sampling distribution will be a good estimate of the population mean. Sample size should be > 30 and not be
more than 10% 0f the population.
Finding The Estimates
1. Method of moment
2. Maximum of Likelihood
3. Bayes’ Estimator
4. Best Unbiased Estimator
5. Interval Estimate: a range of values between upper and lower confidence limit, used to estimate a population
parameter
Margin of Error
Margin of error is the maximum distance between the estimate and population parameter based on a certain level of
confidence (zσ/√n). As the sample size increases up to diminishing return point, the margin of error decreases (Better
estimation of µ).
Confidence Interval (CI)
a range between upper and lower confidence limit that true population mean lies within them.
CI calculation: Point estimate +̲ margin of error → zcrit= x̄±zσ/√n
Confidence Level
The probability that the interval estimate contains the population parameter.
We are 95% confident that the interval of …. to …. contains the true population mean or population mean lies within the
interval.
Confidence level + significance level (α) = 100%
Excel
Zcrit: =NORM.S.INV(0.95) for 1−tail or (0.975) for 2−tail
CI 90% 95% 99%
1−tail test ±1.28 ±1.645 ±2.33
2−tail test ±1.645 ±1.96 ±2.576
Find σ
1. Estimate from previous studies using the same population
2. Conduct a pilot study to select a preliminary sample
3. Guess: data range/4
Hypothesis Formulation
Null/status quo hypothesis (H₀) is opposite of researcher/alternative hypothesis (Hₐ) and they are mutually exclusive.
Null says nothing new or different and always contains equality (=,≤,≥) while alternative hypothesis says that there is
something new or different. We compare μ (existing mean) against μ₀ (hypothesised mean) to find if they come from
the same population or different one (μ=μ₀ or μ≠μ₀)
P-value
The probability that the observed relationship occurred by pure chance. Strength of evidence against Null hypothesis 1-
test statistics (Z or T)
If P-value<α -> H₀ is rejected
If P-value>α -> H₀ is failed to be rejected
Reality
conclusion H₀=false H₀=true
Reject null True positive False positive, Type 1 error (α)
Fail to reject False negative, Type 2 error(β) True negative
Probability of type 1 error (α): reject the null while we shouldn’t (level of significance)
Probability of type 2 error (β): fail to reject the null while we should
As α decreases, β increases
1–α =Confidence Level = non rejection area
1–β =Power = the probability of rejecting a false null or the ability of a model to detect the difference
If we reject the null, we conclude that data support alternative and if we fail to reject null, it still does no prove the null.
Independent−samples z test
To understand the role of random chance between 1 categorial variable and 1 quantitative variable
Comparing 2 population via their independent samples mean. Less power than paired sample t test.
D₀=µ₁-µ₂ (samples difference)
2 tailed test: H₀: θ=0 (there is no statistically significant difference between the two samples)
, Hα: θ≠0
1 tailed upper test: H₀: θ≤0, Hα: θ>0
1 tailed lower test: H₀: θ≥0, Hα: θ<0
Independent−samples t test
If no σ, use T distribution using df=(s₁²/n₁+s₂²/n₂)/[1/n₁-1(s₁²/n₁)]+[1/n₂-1(s₂²/n₂)] always round down↓ or df=n₁+n₂−2
Take one sample from each populations n₁ & n₂
Calculate the difference of sample means x̅₁-x̅₂=d₁ an estimate of µ₁-µ₂
Calculate the mean of differences=d̄ and standard error of differences=Sd̅ =Sx̅₁-x̅₂=
APA Style
the groups do not differ significantly, t(14)=1.59, p=0.14, d=.79, 95% CI [−.44, 2.94]. The clicker−training group (M=6.38,
SD=1.41) was not significantly different than the food reward−training group(M=5.13, SD=1.73). These finding do not
support the idea that clicker training is more effective than traditional food reward training.
Compare the result of T-statistics against interval estimate to see if null is rejected or no.
How to interpret the result:
There is a 95% probability that the interval contain the true mean difference d̄ between µ₁ and µ₂.
If 95% interval contain 0 -> There is (not) enough/sufficient evidence to conclude that there is a difference between the
mean of 2 populations
If interval does not contain 0 -> There is enough/sufficient evidence to conclude that there is a difference between the
mean of 2 populations (the difference is not statistically significant or related to sampling error)
T statistics + P value = α + t critical = 1
Variable 1 total
Category x Category y
Variable 2 Category a #1 #2 #1+#2
Category b #3 #4 #3+#4
total #1+#3 #2+#4 #1+#2+#3+#4
Proportion of category x of variable 1 =#1+#3/#1+#2+#3+#4
Expected values foe each cell: row total*column total/grand total on contingency table
Chi−square statistics: ꭕ²=Σ(Oᵢ-Eᵢ)²/Eᵢ
df=(r-1)(c−1)
If n=12 →df=11
Critical Values: ꭕ².975≤ꭕ²≤ꭕ².025 → 3.82≤ꭕ²≤21.92
If s² → GE=25.89 & APL=73.30
Variance Test Statistics: APL: 36.78≤σ²≤211.07
GE: 12.99≤σ²≤74.55
Extra information:
GE APL
s=5.09 s=8.56
3.6%≤σ≤8.63% 6.06%≤σ≤14.53%
Data Science
The process of inspecting, cleansing, transforming and modeling data to uncovering patterns and trends, extract
knowledge and insight, and support decision−making through using scientific methods
• Better & faster decision making
• Cost reduction
• Better marketing and product analysis
• Organization analysis
Business Analyst
Examine large and different types of data to uncover hidden patterns, correlations and other insights to make better
decisions.
1.
Data Science Pathway
Planning
Define the objective of the problem
Organize resources like data, software, people
Coordinate people
Schedule project
Data Preparation
Get data
Clean data
Explore data
Refine data
Modeling
Create model
Validate model
Evaluate model
Refine model
Follow Up
Present model
Deploy model
Revisit model
Archive assets
Machine Learning
Provide machines the ability to learn automatically and improve from experience without being explicitly programed
Why ML
1. Analyse the large amount of data
2. Improve decision making
3. Uncover trend and patterns in data
4. Solve complex problem
Algorithm
A set of rules and statistical methods used to learn patterns from data
Target Variable = Dependent Variable
Predictor Variables = Independent Variables
Data Life cycle
1. Business Understanding: (problem, objectives, variables to be predicted and input data)
2. Data Acquisition: (types, source, how to obtain, how to store)
3. Data processing: (cleaning)
4. Exploratory Data Analysis: (understand the pattern, retrieve useful insight, form hypotheses)
5. Modeling
6. Model Evaluation & Optimization
7. Deployment & Prediction
Simple Linear Regression
With only 1 variable, the best prediction for next measurement is the mean[slope=0, ȳ=b₀].
With 1 dependent variable, the only sum of squared is due to residual (distance between each observation and mean).
(yᵢ - ȳᵢ)² = SSE=SST
With 2 variables, SST stays the same but should SSE decreases significantly. SST=SSE+SSR (regression)
The goal of linear regression is to create a linear model to minimize sum of squares residual (SSE)
We always compare our best linear regression models with 1 dependent variable model.
Uses of LR
1. Determining the strength of predictors
2. Forecasting an effect
3. Trend forecasting
Most data scientists use the following core skills in their daily work:
Statistical analysis: Identify patterns in data. This includes having a keen sense of pattern detection and anomaly
detection.
Machine learning: Implement algorithms and statistical models to enable a computer to automatically learn from data.
Computer science: Apply the principles of artificial intelligence, database systems, human/computer interaction,
numerical analysis, and software engineering.
Programming: Write computer programs and analyze large datasets to uncover answers to complex problems. Data
scientists need to be comfortable writing code working in a variety of languages such as Java, R, Python, and SQL.
Data storytelling: Communicate actionable insights using data, often for a non-technical audience.
Business intuition: Connect with stakeholders to gain a full understanding of the problems they’re looking to solve.
Analytical thinking. Find analytical solutions to abstract business issues.
Critical thinking: Apply objective analysis of facts before coming to a conclusion.
Inquisitiveness: Look beyond what’s on the surface to discover patterns and solutions within the data.
Interpersonal skills: Communicate across a diverse audience across all levels of an organization.
Data visualisation
Maximize how quickly and accurately people decode information from graphics.
Aims
Pre−attentive cognition: color coding
Accuracy: bar chart vs. pie chart
Pre−requisites
Engagement
Understanding
Memorability
Emotional Connection
Frequency Table
Frequency Bar Chart & Relative Frequency = frequency/n -> 28/100=0.28
Grouped Data: No actual data
Frequency Histogram: X=variable of interest, Y=frequency or relative frequency, no gaps between bars
(the frequency of values over certain intervals called bins)
Stem and leaf plot
Box and whisker plot
P-P plot: compare cumulative probability of our empirical data with an ideal distribution (fall on a straight line=normal
distribution)
Q-Q plot: compare quantile of our empirical data with an ideal distribution(fall on a straight line=normal distribution)
Types: skewed, symmetric, bimodal, uniform, no pattern
Big Data
Variety
Velocity
Volume
STEPS OF DATA EXPLORATION AND PREPARATION
Remember the quality of your inputs decide the quality of your output. So, once you have got your business hypothesis
ready, it makes sense to spend lot of time and efforts here. With my personal estimate, data exploration, cleaning and
preparation can take up to 70% of your total project time.
Below are the steps involved to understand, clean and prepare your data for building your predictive model:
Variable Identification
Univariate Analysis
Bi-variate Analysis
Missing values treatment
Outlier treatment
Variable transformation
Variable creation
Finally, we will need to iterate over steps 4 - 7 multiple times before we come up with our refined model. Let's now
study each stage in detail:-
Variable Identification
First, identify Predictor (Input) and Target (output) variables. Next, identify the data type and category of the variables.
Let's understand this step more clearly by taking an example.
Example:- Suppose, we want to predict, whether the students will play cricket or not (refer below data set). Here you
need to identify predictor variables, target variable, data type of variables and category of variables.
Note: Univariate analysis is also used to highlight missing and outlier values. In the upcoming part of this series, we will
look at methods to handle missing and outlier values. To know more about these methods, you can refer
course descriptive statistics from Udacity.
Categorical Variables:- For categorical variables, we'll use frequency table to understand distribution of each category.
We can also read as percentage of values under each category. It can be be measured using two
metrics, Count and Count% against each category. Bar chart can be used as visualization.
Bi-variate Analysis
Bivariate Analysis finds out the relationship between two variables. Here, we look for association and dissociation
between variables at a pre-defined significance level. We can perform bivariate analysis for any combination of
categorical and continuous variables. The combination can be: Categorical & Categorical, Categorical & Continuous and
Continuous & Continuous. Different methods are used to tackle these combinations during analysis process.
Let's understand the possible combinations in detail:
Continuous & Continuous: While doing bivariate analysis between two continuous variables, we should look at scatter
plot. It is a nifty way to find out the relationship between two variables. The pattern of scatter plot indicates the
relationship between variables. The relationship can be linear or nonlinear.
Scatter plot shows the relationship between two variable but does not indicates the strength of relationship amongst
them. To find the strength of the relationship, we use Correlation. Correlation varies between -1 and +1.
-1: perfect negative linear correlation
+1:perfect positive linear correlation and
0: No correlation
Correlation can be derived using following formula:
Correlation = Covariance(X,Y) / SQRT( Var(X)* Var(Y))
Various tools have function or functionality to identify correlation between variables. In Excel, function CORREL() is used
to return the correlation between two variables and SAS uses procedure PROC CORR to identify the correlation. These
function returns Pearson Correlation value to identify the relationship between two variables:
Covariance
Show the linear relationship of 2 variable (just provide information on direction: positive, negative or zero, no value on
strength)
Sample: cov(x,y)=sₓᵧ=Ʃ(xi-x̄)(yi-ȳ)/n-1
Population: cov(x,y)=σₓᵧ=Ʃ(xi-µₓ)(yi-µᵧ)/N
Excel use population formula while SPSS use sample formula, Excel result*N/n-1
In above example, we have good positive relationship(0.65) between two variables X and Y.
Categorical & Categorical: To find the relationship between two categorical variables, we can use following methods:
Two-way table: We can start analyzing the relationship by creating a two-way table of count and count%. The rows
represents the category of one variable and the columns represent the categories of the other variable. We show count
or count% of observations available in each combination of row and column categories.
Stacked Column Chart: This method is more of a visual form of Two-way table.
Chi-Square Test: This test is used to derive the statistical significance of relationship between the variables. Also, it tests
whether the evidence in the sample is strong enough to generalize that the relationship for a larger population as well.
Chi-square is based on the difference between the expected and observed frequencies in one or more categories in the
two-way table. It returns probability for the computed chi-square distribution with the degree of freedom.
Probability of 0: It indicates that both categorical variable are dependent
Probability of 1: It shows that both variables are independent.
Probability less than 0.05: It indicates that the relationship between the variables is significant at 95% confidence. The
chi-square test statistic for a test of independence of two categorical variables is found by:
where O represents the observed frequency. E is the expected frequency under the null
hypothesis and computed by: From previous two-way table, the expected count for
product category 1 to be of small size is 0.22. It is derived by taking the row total for Size (9) times the column total for
Product category (2) then dividing by the sample size (81). This is procedure is conducted for each cell. Statistical
Measures used to analyze the power of relationship are:
Cramer's V for Nominal Categorical Variable
Mantel-Haenszel Chi-Square for ordinal categorical variable.
Different data science language and tools have specific methods to perform chi-square test. In SAS, we can use Chisq as
an option with Proc freq to perform this test.
Categorical & Continuous: While exploring relation between categorical and continuous variables, we can draw box plots
for each level of categorical variables. If levels are small in number, it will not show the statistical significance. To look at
the statistical significance we can perform Z-test, T-test or ANOVA.
Z-Test/ T-Test:- Either test assess whether mean of two groups are statistically different from each other or not.
If the probability of Z is small then the difference of two averages is more significant. The T-test is very similar to Z-test
but it is used when number of observation for both categories is less than
30.
ANOVA:- It assesses whether the average of more than two groups is statistically different.
Example: Suppose, we want to test the effect of five different exercises. For this, we recruit 20 men and assign one type
of exercise to 4 men (5 groups). Their weights are recorded after a few weeks. We need to find out whether the effect of
these exercises on them is significantly different or not. This can be done by comparing the weights of the 5 groups of 4
men each.
Till here, we have understood the first three stages of Data Exploration, Variable Identification, Uni-Variate and Bi-
Variate analysis. We also looked at various statistical and visual methods to identify the relationship between variables.
Now, we will look at the methods of Missing values Treatment. More importantly, we will also look at why missing
values occur in our data and why treating them is necessary.
Notice the missing values in the image shown above: In the left scenario, we have not treated missing values. The
inference from this data set is that the chances of playing cricket by males is higher than females. On the other hand, if
you look at the second table, which shows data after treatment of missing values (based on gender), we can see that
females have higher chances of playing cricket compared to males.
As you can see, data set with outliers has significantly different mean and standard deviation. In the first scenario, we
will say that average is 5.45. But with the outlier, average soars to 30. This would change the estimate completely.
How to detect Outliers?
Most commonly used method to detect outliers is visualization. We use various visualization methods, like Box-
plot, Histogram, Scatter Plot (above, we have used box plot and scatter plot for visualization). Some analysts also various
thumb rules to detect outliers. Some of them are:
Any value, which is beyond the range of -1.5 x IQR to 1.5 x IQR
Use capping methods. Any value which out of range of 5th and 95th percentile can be considered as outlier
Data points, three or more standard deviation away from mean are considered outlier
Outlier detection is merely a special case of the examination of data for influential data points and it also depends on
the business understanding
Bivariate and multivariate outliers are typically measured using either an index of influence or leverage, or distance.
Popular indices such as Mahalanobis’ distance and Cook’s D are frequently used to detect outliers.
In SAS, we can use PROC Univariate, PROC SGPLOT. To identify outliers and influential observation, we also look at
statistical measure like STUDENT, COOKD, RSTUDENT and others.
How to remove Outliers?
Most of the ways to deal with outliers are similar to the methods of missing values like deleting observations,
transforming them, binning them, treat them as a separate group, imputing values and other statistical methods. Here,
we will discuss the common techniques used to deal with outliers:
Deleting observations: We delete outlier values if it is due to data entry error, data processing error or outlier
observations are very small in numbers. We can also use trimming at both ends to remove outliers.
Transforming and binning values: Transforming variables can also eliminate outliers. Natural log of a value reduces the
variation caused by extreme values. Binning is also a form of variable transformation. Decision Tree algorithm allows to
deal with outliers well due to binning of variable. We can also use the process of assigning weights to different
observations.
Imputing: Like imputation of missing values, we can also impute outliers. We can use mean, median, mode imputation
methods. Before imputing values, we should analyse if it is natural outlier or artificial. If it is artificial, we can go with
imputing values. We can also use statistical model to predict values of outlier observation and after that we can impute
it with predicted values.
Treat separately: If there are significant number of outliers, we should treat them separately in the statistical model.
One of the approach is to treat both groups as two different groups and build individual model for both groups and
then combine the output.
Till here, we have learnt about steps of data exploration, missing value treatment and techniques of outlier detection
and treatment. These 3 stages will make your raw data better in terms of information availability and accuracy. Let's
now proceed to the final stage of data exploration. It is Feature Engineering.
THE ART OF FEATURE ENGINEERING
What is Feature Engineering?
Feature engineering is the science (and art) of extracting more information from existing data. You are not adding any
new data here, but you are actually making the data you already have more useful.
For example, let's say you are trying to predict foot fall in a shopping mall based on dates. If you try and use the dates
directly, you may not be able to extract meaningful insights from the data. This is because the foot fall is less affected by
the day of the month than it is by the day of the week. Now this information about day of week is implicit in your data.
You need to bring it out to make your model better.
This exercising of bringing out information from data in known as feature engineering.
What is the process of Feature Engineering ?
You perform feature engineering once you have completed the first 5 steps in data exploration - Variable
Identification, Univariate, Bivariate Analysis, Missing Values Imputation and Outliers Treatment. Feature engineering
itself can be divided in 2 steps:
Variable transformation.
Variable / Feature creation.
These two techniques are vital in data exploration and have a remarkable impact on the power of
prediction. Let's understand each of this step in more details.
What is Variable Transformation?
In data modelling, transformation refers to the replacement of a variable by a function. For instance, replacing a variable
x by the square / cube root or logarithm x is a transformation. In other words, transformation is a process that changes
the distribution or relationship of a variable with others.
Let’s look at the situations when variable transformation is useful.
Symmetric distribution is preferred over skewed distribution as it is easier to interpret and generate inferences. Some
modeling techniques requires normal distribution of variables. So, whenever we have a skewed distribution, we can use
transformations which reduce skewness. For right skewed distribution, we take square / cube root or logarithm of
variable and for left skewed, we take square / cube or exponential of variables.
Variable Transformation is also done from an implementation point of view (Human involvement). Let's understand it
more clearly. In one of my project on employee performance, we found that age has direct correlation with
performance of the employee i.e. higher the age, better the performance. From an implementation standpoint,
launching age based programme might present implementation challenge. However, categorizing the sales agents in
three age group buckets of <30 years, 30-45 years and >45 and then formulating three different strategies for each
group is a judicious approach. This categorization technique is known as Binning of Variables.
There are various techniques to create new features. Let’s look at the some of the commonly used methods:
Creating derived variables: This refers to creating new variables from existing variable(s) using set of functions or
different methods. Let’s look at it through “Titanic - Kaggle competition”. In this data set, variable age has missing
values. To predict missing values, we used the salutation (Master, Mr, Miss, Mrs) of name as a new variable. How do we
decide which variable to create? Honestly, this depends on business understanding of the analyst, his curiosity and the
set of hypothesis he might have about the problem. Methods such as taking log of variables, binning variables and other
methods of variable transformation can also be used to create new variables.
Creating dummy variables: One of the most common application of dummy variable is to convert categorical variable
into numerical variables. Dummy variables are also called Indicator Variables. It is useful to take categorical variable as a
predictor in statistical models. Categorical variable can take values 0 and 1. Let's take a variable 'gender'. We can
produce two variables, namely, “Var_Male” with values 1 (Male) and 0 (No male) and “Var_Female” with values 1
(Female) and 0 (No Female). We can also create dummy variables for more than two classes of a categorical variables
with n or n-1 dummy variables.