Computational Stats Aiml Notes
Computational Stats Aiml Notes
INDEX
---------------------------------------------------------------------------------------------------------------------
3. HYPOTHESIS TESTS AND STATISTICAL TESTS
3.1 Typical Analysis Procedures
3.2 Hypothesis Concepts
3.3 Errors
3.4 The p-value
3.5 Sample Size
3.6 Confusion Matrix
3.7 Sensitivity and Specificity
3.8 ROC-AUC Curves
3.9 Tests for Numerical Data
3.9.1 Distribution of a Sample Mean
3.9.2 Comparison of Two Groups
3.9.3 Comparison of Multi Groups
---------------------------------------------------------------------------------------------------------------------
4. STATISTICAL METHODS
Telegram Channel
https://t.me/SPPU_TE_BE_COMP
(for all engineering Resources)
WhatsApp Channel
(for all tech updates)
https://whatsapp.com/channel/
0029ValjFriICVfpcV9HFc3b
Insta Page
(for all engg & tech updates)
https://www.instagram.com/
sppu_engineering_update
lOMoAR cPSD| 43980795
Example: Let us compare the weight of two groups of subject. Then the null hypothesis is that
there is null difference in the weight between the two groups. If a statistical comparison of the
weight produces a p-value of 0.03, this means that the probability that the null hypothesis is correct
is 0.03, or 3%. Since this probability is quite low, we say that there is a significant difference
between the weight of the two groups.
3.3 Errors
Types of Error
In hypothesis testing, two types of errors can occur: Type
I errors
These are errors, where you get a significant result despite the fact that the hypothesis is true. The
likelihood of a Type I error is commonly indicated with α , and is set before you start the data
analysis. For example, assume that the population of young Austrian adults has a mean IQ of 105
(i.e. we are smarter than the rest) and a standard deviation of 15. We now want to check if the
average FH student in Linz has the same IQ as the average Austrian, and we select 20 students. We
set α=0.05, i.e. we set our significance level to 95%. Let us now assume that the average student
has in fact the same IQ as the average Austrian. If we repeat our study 20 times, we will find one
of those 20 times that our sample mean is significantly different from the Austrian average IQ.
Such a finding would be a false result, despite the fact that our assumption is correct, and would
constitute a type I error.
worse, as the posterior probability (i.e. after the data have been collected) that the hypothesis is
true. As an example, take the case where the alternative hypothesis is that the mean is just a fraction
of one standard deviation larger than the mean under the null hypothesis: in that case, a sample
that produces a p-value of 0.05 may just as likely be produced if the alternative hypothesis is true
as if the null hypothesis is true! have investigated this question in detail, and recommend to use a
“calibrated p-value” to estimate the probability of making a mistake when rejecting the null
hypothesis, when the data produce a p-value p :
Remember, p only indicates the likelihood of obtaining a certain value for the test statistic if the
null hypothesis is true - nothing else! And keep in mind that improbable events do happen, even if
not very frequently. For example, back in 1980 a woman named Maureen Wilcox bought tickets
for both the Rhode Island lottery and the Massachusetts lottery. And she got the correct numbers
for both lotteries. Unfortunately for her, she picked all the correct numbers for Massachusetts on
her Rhode Island ticket, and all the right numbers for Rhode island on her Massachusetts ticket
Seen statistically, the p-value for such an event would be extremely small - but it did happen
anyway.
- True Positive (TP): It is the total counts having both predicted and actual values are Dog.
- True Negative (TN): It is the total counts having both predicted and actual values are Not Dog.
- False Positive (FP): It is the total counts having prediction is Dog while actually Not Dog.
- False Negative (FN): It is the total counts having prediction is Not Dog while actually, it is Dog.
- If you only care about comparing two levels (like when the response variable is binary),
conduct a proportion difference z-test or a Fisher exact-test.
- If you want to compare the joint frequency counts to expected frequency counts under the
independence model (the model of independent explanatory variables), conduct a Pearson’s
chi-squared independence test, or a G-test.
4.2 Normalization
In statistics and applications of statistics, normalization can have a range of meanings. In the
simplest cases, normalization of ratings means adjusting values measured on different scales to a
notionally common scale, often prior to averaging. In more complicated cases, normalization may
refer to more sophisticated adjustments where the intention is to bring the entire probability
distributions of adjusted values into alignment. In the case of normalization of scores in educational
assessment, there may be an intention to align distributions to a normal distribution. A different
approach to normalization of probability distributions is quantile normalization, where the
quantiles of the different measures are brought into alignment.
10
to the model and to avoid the domination of features with larger values. Feature scaling becomes
necessary when dealing with datasets containing features that have different ranges, units of
measurement, or orders of magnitude. In such cases, the variation in feature values can lead to
biased model performance or difficulties during the learning process. There are several common
techniques for feature scaling, including standardization, normalization, and min-max scaling.
These methods adjust the feature values while preserving their relative relationships and
distributions. By applying feature scaling, the dataset’s features can be transformed to a more
consistent scale, making it easier to build accurate and effective machine learning models. Scaling
facilitates meaningful comparisons between features, improves model convergence, and prevents
certain features from overshadowing others based solely on their magnitude.
It subtracts the mean of the column from each value and then divides by the range, i.e,
max(x)min(x). This scaling algorithm works very well in cases where the standard deviation is
very small, or in cases which don’t have Gaussian distribution.
4.3 Bias
Bias is simply defined as the inability of the model because of that there is some difference or error
occurring between the model’s predicted value and the actual value. These differences between
actual or expected values and the predicted values are known as error or bias error or error due to
bias. Bias is a systematic error that occurs due to wrong assumptions in the machine learning
process.
Let Y be the true value of a parameter, and let ^Y be an estimator of Y based on a sample of data.
Then, the bias of the estimator ^Y is given by:
where E(^Y) is the expected value of the estimator ^Y. It is the measurement of the model that how
well it fits the data.
Low Bias: Low bias value means fewer assumptions are taken to build the target function. In this
case, the model will closely match the training dataset.
11
High Bias: High bias value means more assumptions are taken to build the target function. In this
case, the model will not match the training dataset closely. The high-bias model will not be able
to capture the dataset trend. It is considered as the underfitting model which has a high error rate.
It is due to a very simplified algorithm.
For example, a linear regression model may have a high bias if the data has a non-linear
relationship.
4.4 Variance
Variance is the measure of spread in data from its mean position. In machine learning variance is
the amount by which the performance of a predictive model changes when it is trained on different
subsets of the training data. More specifically, variance is the variability of the model that how
much it is sensitive to another subset of the training dataset. i.e., how much it can adjust on the
new subset of the training dataset.
Let Y be the actual values of the target variable, and ^Y be the predicted values of the target
variable. Then the variance of a model can be measured as the expected value of the square of the
difference between predicted values and the expected value of the predicted values.
4.5 Regularization
In mathematics, statistics, finance, computer science, particularly in machine learning and inverse
problems, regularization is a process that changes the result answer to be "simpler". It is often used
to obtain results for ill-posed problems or to prevent overfitting. Although regularization
procedures can be divided in many ways, the following delineation is particularly helpful:
- Explicit regularization is regularization whenever one explicitly adds a term to the
optimization problem. These terms could be priors, penalties, or constraints. Explicit regularization
12
is commonly employed with ill-posed optimization problems. The regularization term, or penalty,
imposes a cost on the optimization function to make the optimal solution unique. Implicit
regularization is all other forms of regularization. This includes, for example, early stopping, using
a robust loss function, and discarding outliers.
- Implicit regularization is essentially ubiquitous in modern machine learning approaches,
including stochastic gradient descent for training deep neural networks, and ensemble methods
(such as random forests and gradient boosted trees).
In explicit regularization, independent of the problem or model, there is always a data term, that
corresponds to a likelihood of the measurement and a regularization term that corresponds to a
prior. By combining both Bayesian statistics, one can compute a posterior, that includes both
information sources and therefore stabilizes the estimation process. By trading off both objectives,
one chooses to be more addictive to the data or to enforce generalization (to prevent overfitting).
There is a whole research branch dealing with all possible regularizations. In practice, one usually
tries a specific regularization and then figures out the probability density that corresponds to that
regularization to justify the choice. It can also be physically motivated by common sense or
intuition. In machine learning, the data term corresponds to the training data and the regularization
is either the choice of the model or modifications to the algorithm. It is always intended to reduce
the generalization error, i.e. the error score with the trained model on the evaluation set and not the
training data. One of the earliest uses of regularization is Tikhonov regularization, related to the
method of least squares.
A regularization term (or regularizer) R(f) is added to a loss function:
13
1. Linear regression model: LASSO regression starts with the standard linear regression
model, which assumes a linear relationship between the independent variables (features)
and the dependent variable (target). The linear regression equation can be represented as
follows:makefileCopy codey = β₀ + β₁x₁ + β₂x₂ + ... + βₚxₚ + ε Where: y is the dependent
variable (target). β₀, β₁, β₂, ..., βₚ are the coefficients (parameters) to be estimated. x₁, x₂, ...,
xₚ are the independent variables (features). ε represents the error term.
2. L1 regularization: LASSO regression introduces an additional penalty term based on the
absolute values of the coefficients. The L1 regularization term is the sum of the absolute
values of the coefficients multiplied by a tuning parameter λ:scssCopy codeL₁ = λ * (|β₁| +
|β₂| + ... + |βₚ|) Where: λ is the regularization parameter that controls the amount of
regularization applied. β₁, β₂, ..., βₚ are the coefficients.
3. Objective function: The objective of LASSO regression is to find the values of the
coefficients that minimize the sum of the squared differences between the predicted values
and the actual values, while also minimizing the L1 regularization term:makefileCopy
14
codeMinimize: RSS + L₁ Where: RSS is the residual sum of squares, which measures the
error between the predicted values and the actual values.
4. Shrinking coefficients: By adding the L1 regularization term, LASSO regression can shrink
the coefficients towards zero. When λ is sufficiently large, some coefficients are driven to
exactly zero. This property of LASSO makes it useful for feature selection, as the variables
with zero coefficients are effectively removed from the model.
5. Tuning parameter λ: The choice of the regularization parameter λ is crucial in LASSO
regression. A larger λ value increases the amount of regularization, leading to more
coefficients being pushed towards zero. Conversely, a smaller λ value reduces the
regularization effect, allowing more variables to have non-zero coefficients.
6. Model fitting: To estimate the coefficients in LASSO regression, an optimization algorithm
is used to minimize the objective function. Coordinate Descent is commonly employed,
which iteratively updates each coefficient while holding the others fixed.
LASSO regression offers a powerful framework for both prediction and feature selection,
especially when dealing with high-dimensional datasets where the number of features is large. By
striking a balance between simplicity and accuracy, LASSO can provide interpretable models
while effectively managing the risk of overfitting. It’s worth noting that LASSO is just one type of
regularization technique, and there are other variants such as Ridge regression (L2 regularization)
and Elastic Net.
15
4.8.1 K-fold
In k-fold cross-validation, we first divide our dataset into k equally sized subsets. Then, we repeat
the train-test method k times such that each time one of the k subsets is used as a test set and the
rest k-1 subsets are used together as a training set. Finally, we compute the estimate of the model’s
performance estimate by averaging the scores over the k trials. For example, let’s suppose that we
have a dataset S = \{x_1, x_2, x_3, x_4, x_5, x_6\} containing 6 samples and that we want to
perform a 3-fold cross-validation.
Then, we train and evaluate our machine-learning model 3 times. Each time, two subsets form the
training set, while the remaining one acts as the test set. In our example:
Finally, the overall performance is the average of the model’s performance scores on those three
test sets:
4.8.2 LOOCV
In the leave-one-out (LOO) cross-validation, we train our machine-learning model n times where
n is to our dataset’s size. Each time, only one sample is used as a test set while the rest are used to
train our model. We’ll show that LOO is an extreme case of k-fold where \mathbf{k=n}. If we
apply LOO to the previous example, we’ll have 6 test subsets:
S_1 = {x_1} ,S_2 = {x_2} ,S_3 = {x_3} ,S_4 = {x_4} ,S_5 = {x_5}, S_6 = {x_6}
16
Iterating over them, we use S \setminus S_i as the training data in iteration i=1,2,\ldots, 6, and
evaluate the model on S_i:
The final performance estimate is the average of the six individual scores:
17
used to estimate this generalization performance, and therefore choose the set of values for
hyperparameters that maximize it.
Cross-validation is commonly employed when the initial evaluation (like Mean Squared Error)
demonstrates reasonably satisfactory performance when you want to obtain a more reliable
estimate of the model’s generalization ability. It helps assess the model’s performance across
multiple subsets of the data and provides a more robust evaluation by mitigating the potential bias
introduced by a single train-test split. Cross-validation error provides a more reliable estimate of a
model’s performance versus a single train-test split. It evaluates the model on several subsets of
the data, resolving the issue of variability in the training and validation data and leading to a more
robust performance estimation. It aids in model selection, hyperparameter tuning, and comparing
different models.
18
19
20
1. Filter
2. Wrapper
3. Embedded
21
Where, observed frequency = No. of observations of class, Expected frequency = No. of expected
observations of class if there was no relationship between the feature and the target.
22
1. Measurement errors
2. Data entry or processing errors
3. Unrepresentative sampling
In practice, it can be difficult to tell different types of outliers apart. While you can use calculations
and statistical methods to detect outliers, classifying them as true or false is usually a subjective
process.
Methods:
1. Sorting method: You can sort quantitative variables from low to high and scan for
extremely low or extremely high values. Flag any extreme values that you find. This is a simple
way to check whether you need to investigate certain data points before using more sophisticated
methods.
2. Using visualizations: You can use software to visualize your data with a box plot, or a box-
andwhisker plot, so you can see the data distribution at a glance. This type of chart highlights
minimum and maximum values (the range), the median, and the interquartile range for your data.
Many computer programs highlight an outlier on a chart with an asterisk, and these will lie outside
the bounds of the graph.
3. Statistical outlier detection: Statistical outlier detection involves applying statistical tests
or procedures to identify extreme values. You can convert extreme data points into z scores that
tell you how many standard deviations away they are from the mean. If a value has a high enough
or low enough z score, it can be considered an outlier. As a rule of thumb, values with a z score
greater than 3 or less than –3 are often determined to be outliers.
4. Using the interquartile range: The interquartile range (IQR) tells you the range of the
middle half of your dataset. You can use the IQR to create “fences” around your data and then
define outliers as any values that fall outside those fences. This method is helpful if you have a
few values on the extreme ends of your dataset, but you aren’t sure whether any of them might
count as outliers.
23
5.4 Resampling-Random
Resampling Method is a statical method that is used to generate new data points in the dataset by
randomly picking data points from the existing dataset. It helps in creating new synthetic datasets
for training machine learning models and to estimate the properties of a dataset when the dataset
is unknown, difficult to estimate, or when the sample size of the dataset is small. Two common
methods of Resampling are:
1. Cross Validation
2. Bootstrapping
24
25
26
A correlation coefficient is a descriptive statistic. That means that it summarizes sample data
without letting you infer anything about the population. A correlation coefficient is a bivariate
statistic when it summarizes the relationship between two variables, and it’s a multivariate statistic
when you have more than two variables.
27
The residuals of this model (the difference between the observed values and the predicted values)
will be small, which means the residual standard error will also be small. Conversely, a regression
model that has a large residual standard error will have data points that are more loosely scattered
around the fitted regression line:
28
The residuals of this model will be larger, which means the residual standard error will also be
larger. The following example shows how to calculate and interpret the residual standard error of
a regression model in R.
2. Insert X values in the equation found in step 1 in order to get the respective Y values i.e.
3. Now subtract the new Y values from the original Y values. Thus, found values are the error
terms. It is also known as the vertical distance of the given point from the regression line.
4. Square the errors found in step 3.
5. Sum up all the squares.
6. Divide the value found in step 5 by the total number of observations.
29
RMSE, the better a given model is able to “fit” a dataset. The formula to find the root mean square
error, often abbreviated RMSE, is as follows: RMSE = √Σ(Pi – Oi)2 / n where:
Σ is a fancy symbol that means “sum”
Pi is the predicted value for the i’th observation in the dataset.
Oi is the observed value for the i’th observation in the dataset. n
is the sample size
The following example shows how to interpret RMSE for a given regression model.
- How strong the relationship is between two or more independent variables and one
dependent variable (e.g. how rainfall, temperature, and amount of fertilizer added affect
crop growth).
- The value of the dependent variable at a certain value of the independent variables (e.g. the
expected yield of a crop at certain levels of rainfall, temperature, and fertilizer addition).
The formula for a multiple linear regression is:
30
better predictions and is common for regression predictive modeling tasks and generally tasks that
have numerical input variables. Typically, linear algorithms, such as linear regression and logistic
regression, respond well to the use of polynomial input variables.
1. If we move towards a negative gradient or away from the gradient of the function at the
current point, it will give the local minimum of that function.
2. Whenever we move towards a positive gradient or towards the gradient of the function at
the current point, we will get the local maximum of that function.
This entire procedure is known as Gradient Ascent, which is also known as steepest descent. The
main objective of using a gradient descent algorithm is to minimize the cost function using
iteration. To achieve this goal, it performs two steps iteratively:
1. Calculates the first-order derivative of the function to compute the gradient or slope of that
function.
31
2. Move away from the direction of the gradient, which means slope increased from the
current point by alpha times, where Alpha is defined as Learning Rate. It is a tuning
parameter in the optimization process which helps to decide the length of the steps.
- Logistic regression predicts the output of a categorical dependent variable. Therefore the
outcome must be a categorical or discrete value.
- It can be either Yes or No, 0 or 1, true or False, etc. but instead of giving the exact value as
0 and 1, it gives the probabilistic values which lie between 0 and 1.
- Logistic Regression is much similar to Linear Regression except that how they are used.
Linear Regression is used for solving Regression problems, whereas Logistic regression is
used for solving the classification problems.
- In Logistic regression, instead of fitting a regression line, we fit an “S” shaped logistic
function, which predicts two maximum values (0 or 1).
- The curve from the logistic function indicates the likelihood of something such as whether
the cells are cancerous or not, a mouse is obese or not based on its weight, etc.
- Logistic Regression is a significant machine learning algorithm because it has the ability
to provide probabilities and classify new data using continuous and discrete datasets.
- Logistic Regression can be used to classify the observations using different types of data
and can easily determine the most effective variables used for the classification.
Types of Logistic Regression:
On the basis of the categories, Logistic Regression can be classified into three types:
Binomial: In binomial Logistic regression, there can be only two possible types of the dependent
variables, such as 0 or 1, Pass or Fail, etc.
Multinomial: In multinomial Logistic regression, there can be 3 or more possible unordered types
of the dependent variable, such as “cat”, “dogs”, or “sheep”
Ordinal: In ordinal Logistic regression, there can be 3 or more possible ordered types of dependent
variables, such as “low”, “Medium”, or “High”.
32
The probability of the evidence P(B) can be calculated using the law of total probability. If
{A1, A2, …, An}is a partition of the sample space, which is the set of all outcomes of an
experiment, then,
33
34
35