0% found this document useful (0 votes)
58 views36 pages

Computational Stats Aiml Notes

Uploaded by

Pratik Parbhane
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
58 views36 pages

Computational Stats Aiml Notes

Uploaded by

Pratik Parbhane
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 36

lOMoAR cPSD| 43980795

INDEX
---------------------------------------------------------------------------------------------------------------------
3. HYPOTHESIS TESTS AND STATISTICAL TESTS
3.1 Typical Analysis Procedures
3.2 Hypothesis Concepts
3.3 Errors
3.4 The p-value
3.5 Sample Size
3.6 Confusion Matrix
3.7 Sensitivity and Specificity
3.8 ROC-AUC Curves
3.9 Tests for Numerical Data
3.9.1 Distribution of a Sample Mean
3.9.2 Comparison of Two Groups
3.9.3 Comparison of Multi Groups
---------------------------------------------------------------------------------------------------------------------
4. STATISTICAL METHODS

4.1 Standard Deviation


4.2 Normalization
4.2.1 Feature Scaling
4.2.2 Min-Max Scaling
4.3 Bias
4.4 Variance
4.5 Regularization
4.6 Ridge Regression
4.7 Lasso Regression
4.8 Cross Validation Techniques
4.8.1 K-Fold Method
4.8.2 LOOCV
4.8.3 Stratified K-Fold
4.8.4 Grid Search
4.9 Cross Validation Error
---------------------------------------------------------------------------------------------------------------------
5. STATISTICAL PROCESSING
5.1 Dimensionality Reduction Techniques
5.1.1 Principal Component Analysis
5.1.2 Discriminant Analysis
5.2 Feature Selection

Downloaded by Vishwajeet Londhe ([email protected])


Study material provided by: Vishwajeet Londhe

Join Community by clicking below links

Telegram Channel

https://t.me/SPPU_TE_BE_COMP
(for all engineering Resources)

WhatsApp Channel
(for all tech updates)

https://whatsapp.com/channel/
0029ValjFriICVfpcV9HFc3b

Insta Page
(for all engg & tech updates)

https://www.instagram.com/
sppu_engineering_update
lOMoAR cPSD| 43980795

5.2.1 Chi2 Square Method


5.2.2 Variance Threshold
5.2.3 Recursive Feature Elimination
5.3 Outliers Detection Method
5.4 Resampling-Random
5.4.1 Random Over-resampling
5.4.2 Random Under-resampling
---------------------------------------------------------------------------------------------------------------------
6. STATISTICAL MODELING
6.1 Linear Regression Models
6.2 Correlation Coefficient
6.3 Rank Correlation
6.4 Residual Error
6.5 Mean Square Error
6.6 Root Mean Square Error (RMSE)
6.7 Multi-Linear Regression
6.8 Polynomial Features
6.9 Gradient Descent
6.10 Logistic Regression
6.11 Bayesian Statistics
6.12 Baye’s Theorem
6.13 Monte Carlo Method
---------------------------------------------------------------------------------------------------------------------

Downloaded by Vishwajeet Londhe ([email protected])


lOMoAR cPSD| 43980795

Chapter 3 HYPOTHESIS TESTS AND STATISTICAL TESTS

3.1 Typical Analysis Procedures


In “the old days” (before computers with almost unlimited computational power were available),
the statistical analysis of data was typically restricted to hypothesis tests: you formulate a
hypothesis, collect your data, and then accept or reject the hypothesis. The advent of powerful
computers changed the game. Nowadays, the analysis of statistical data is (or at least should be) a
highly interactive process: you look at the data, generate hypotheses and models, check these
models, modify them to improve the correspondence between models and data; when you are
happy, you calculate the confidence interval for your model parameters, and form your
interpretation based on these values.
In either case, you should start off with the following steps:

1. Visually inspect your data.


2. Find outliers and check them carefully.
3. Determine the datatype of your values.
4. If you have continuous data, check whether or not they are normally distributed.
5. Select and apply the appropriate test or start with the model-based analysis of your data.

3.2 Hypothesis Concepts


A statistical hypothesis test is a method of statistical inference using data from a scientific study.
In statistics, a result is called statistically significant if it has been predicted as unlikely to have
occurred “by chance alone”. Thereby by chance alone is judged with respect to a pre-determined
threshold probability, the significance level. These tests are used in determining what outcomes of
a study would lead to a rejection of the null hypothesis for a pre-specified level of significance.
The name null hypothesis derives from the fact that according to this hypothesis, some value is
null. The critical region of a hypothesis test is the set of all outcomes which cause the null
hypothesis to be rejected in favor of the alternative hypothesis.
A typical approach is as follows:

1. State your hypothesis.


2. Decide which value you want to test your hypothesis on, which is called the test statistic.
3. Compute from the observations the observed value of the test statistic.
4. Calculate the p-value. This is the probability, under the null hypothesis, of sampling a test
statistic at least as extreme as that which was observed.
5. Reject the null hypothesis, in favor of the alternative hypothesis, if and only if the p-value
is less than the significance level (the selected probability) threshold. If p<0.05, the
difference between your sample and the value that you check is significant. If p<0.001, we
speak of a highly significant difference.

Downloaded by Vishwajeet Londhe ([email protected])


lOMoAR cPSD| 43980795

Example: Let us compare the weight of two groups of subject. Then the null hypothesis is that
there is null difference in the weight between the two groups. If a statistical comparison of the
weight produces a p-value of 0.03, this means that the probability that the null hypothesis is correct
is 0.03, or 3%. Since this probability is quite low, we say that there is a significant difference
between the weight of the two groups.

3.3 Errors
Types of Error
In hypothesis testing, two types of errors can occur: Type
I errors
These are errors, where you get a significant result despite the fact that the hypothesis is true. The
likelihood of a Type I error is commonly indicated with α , and is set before you start the data
analysis. For example, assume that the population of young Austrian adults has a mean IQ of 105
(i.e. we are smarter than the rest) and a standard deviation of 15. We now want to check if the
average FH student in Linz has the same IQ as the average Austrian, and we select 20 students. We
set α=0.05, i.e. we set our significance level to 95%. Let us now assume that the average student
has in fact the same IQ as the average Austrian. If we repeat our study 20 times, we will find one
of those 20 times that our sample mean is significantly different from the Austrian average IQ.
Such a finding would be a false result, despite the fact that our assumption is correct, and would
constitute a type I error.

Type II errors and Test Power


If we want to answer the question “How much chance do we have to reject the null hypothesis
when the alternative is in fact true?” Or in other words, “What’s the probability of detecting a real
effect?” we are faced with a different problem. To answer these questions, we need an alternative
hypothesis. For the example given above, an alternative hypothesis could be: “We assume that our
population has a mean value of 6.” A Type II error is an error, where you do not get a significant
result, despite the fact that the null-hypothesis is false. The probability for this type of error is
commonly indicated with β. The power of a statistical test is defined as (1−β) 100 , and is the
chance of correctly accepting the alternate hypothesis. Figure [fig:power1] shows the meaning of
the power of a statistical test. Note that for finding the power of a test, you need an alternative
hypothesis.

3.4 The p-value


In other words, p values are often used to measure evidence against a hypothesis. Unfortunately,
they are often incorrectly viewed as an error probability for rejection of the hypothesis, or, even

Downloaded by Vishwajeet Londhe ([email protected])


lOMoAR cPSD| 43980795

worse, as the posterior probability (i.e. after the data have been collected) that the hypothesis is
true. As an example, take the case where the alternative hypothesis is that the mean is just a fraction
of one standard deviation larger than the mean under the null hypothesis: in that case, a sample
that produces a p-value of 0.05 may just as likely be produced if the alternative hypothesis is true
as if the null hypothesis is true! have investigated this question in detail, and recommend to use a
“calibrated p-value” to estimate the probability of making a mistake when rejecting the null
hypothesis, when the data produce a p-value p :

Remember, p only indicates the likelihood of obtaining a certain value for the test statistic if the
null hypothesis is true - nothing else! And keep in mind that improbable events do happen, even if
not very frequently. For example, back in 1980 a woman named Maureen Wilcox bought tickets
for both the Rhode Island lottery and the Massachusetts lottery. And she got the correct numbers
for both lotteries. Unfortunately for her, she picked all the correct numbers for Massachusetts on
her Rhode Island ticket, and all the right numbers for Rhode island on her Massachusetts ticket
Seen statistically, the p-value for such an event would be extremely small - but it did happen
anyway.

3.5 Sample Size


The power of a statistical test depends on four factors:

- α, the probability for Type I errors


- β, the probability for Type II errors ( ⇒ power of the test)
- d, the effect size, i.e. the magnitude of the investigated effect relative to σ , the standard
deviation of the sample
- n, the sample size
Only 3 of these 4 parameters can be chosen, the 4th is then automatically fixed. The size of the
absolute difference D between mean treatment outcomes that will answer the clinical question
being posed is often called clinical significance or clinical relevance.

Downloaded by Vishwajeet Londhe ([email protected])


lOMoAR cPSD| 43980795

3.6 Confusion Matrix


A confusion matrix is a matrix that summarizes the performance of a machine learning model on
a set of test data. It is often used to measure the performance of classification models, which aim
to predict a categorical label for each input instance. The matrix displays the number of true
positives (TP), true negatives (TN), false positives (FP), and false negatives (FN) produced by the
model on the test data. For binary classification, the matrix will be of a 2X2 table, for multi-class
classification, the matrix shape will be equal to the number of classes i.e. for n classes it will be
nXn. A 2X2 Confusion matrix is shown below for the image reorganization having a Dog image
or Not Dog image.

Downloaded by Vishwajeet Londhe ([email protected])


lOMoAR cPSD| 43980795

- True Positive (TP): It is the total counts having both predicted and actual values are Dog.
- True Negative (TN): It is the total counts having both predicted and actual values are Not Dog.
- False Positive (FP): It is the total counts having prediction is Dog while actually Not Dog.
- False Negative (FN): It is the total counts having prediction is Not Dog while actually, it is Dog.

3.7 Sensitivity and Specificity


Sensitivity and specificity are statistical measures of the performance of a binary classification
test1. Sensitivity measures the proportion of positives that are correctly identified, while specificity
measures the proportion of negatives that are correctly identified. Sensitivity and specificity are
expressed as probabilities or percentages. There is a trade-off between sensitivity and specificity,
as increasing one may decrease the other.

3.8 ROC-AUC Curves


Closely related to Sensitivity and Specificity is the receiver operating characteristic (ROC) curve.
This is a graph displaying the relationship between the true positive rate (on the vertical axis) and
the false positive rate (on the horizontal axis). The technique comes from the field of engineering,
where it was developed to find the predictor which best discriminates between two given
distributions. In the ROC-curve (see figure below) this point is given by the value with the largest
distance to the diagonal.

Downloaded by Vishwajeet Londhe ([email protected])


lOMoAR cPSD| 43980795

3.9 Tests for Numerical Data


3.9.1 Distribution of a Sample Mean
The distribution of the sample mean is the distribution of the random variable that consists of
sample means. It is the probabilistic spread of all the means of samples of fixed size that users
choose randomly from a particular population. The distribution of the sample mean is
approximately normal with mean equal to the population mean and standard error equal to the
population standard deviation divided by the square root of the sample size. This is the result of
the Central Limit Theorem.

3.9.2 Comparison of Two Groups


Use independent samples tests to either describe a variable’s frequency or central tendency
difference between two independent groups, or to compare the difference to a hypothesized value.
If the data generating process produces continuous outcomes (interval or ratio) and the outcomes
are symmetrically distributed, the difference in the sample means, d=x-y is a random variable
centered at the population difference, d=ux-uy. You can use a theoretical distribution (normal or
student t) to estimate a 95% confidence interval (CI) around d, or compare ^d to an hypothesized
population difference, d0. The Central Limit Theorem (CLT) conditions hold, you can assume the
random variable is normally distributed and use the z-test, otherwise assume the random variable
has a student t distribution and use the t-test.1 If the data generating process produces continuous
outcomes that are not symmetrically distributed, use a non-parametric test like the Mann-Whitney
U test. If the data generating process produces discrete outcomes (counts), the sample count, x, is
a random variable from a Poisson, binomial, normal, or multinomial distribution, or a random
variable from a theoretical outcome. For two independent samples, the data can be organized into
a two-way table - a frequency table for two categorical variables. If you have a single categorical
predictor variable, you can test whether the joint frequency counts differ from the expected
frequency counts in the saturated model. You analyze a two-way table one of two ways.

- If you only care about comparing two levels (like when the response variable is binary),
conduct a proportion difference z-test or a Fisher exact-test.

Downloaded by Vishwajeet Londhe ([email protected])


lOMoAR cPSD| 43980795

- If you want to compare the joint frequency counts to expected frequency counts under the
independence model (the model of independent explanatory variables), conduct a Pearson’s
chi-squared independence test, or a G-test.

3.9.3 Comparison of Multiple Groups


Comparison tests look for differences among group means. They can be used to test the effect of a
categorical variable on the mean value of some other characteristic.
T-tests are used when comparing the means of precisely two groups (e.g. the average heights of
men and women). ANOVA and MANOVA tests are used when comparing the means of more than
two groups (e.g. the average heights of children, teenagers, and adults). Quantitative ~ Categorical

Downloaded by Vishwajeet Londhe ([email protected])


lOMoAR cPSD| 43980795

Chapter 4 STATISTICAL METHODS

4.1 Standard Deviation


In statistics, Variance and standard deviation are related with each other since the square root of
variance is considered the standard deviation for the given data set. Below are the definitions of
variance and standard deviation.
Variance is the measure of how notably a collection of data is spread out. If all the data values are
identical, then it indicates the variance is zero. All non-zero variances are considered to be positive.
A little variance represents that the data points are close to the mean, and to each other, whereas if
the data points are highly spread out from the mean and from one another indicates the high
variance. In short, the variance is defined as the average of the squared distance from each point
to the mean.
Standard Deviation is a measure which shows how much variation (such as spread, dispersion,
spread,) from the mean exists. The standard deviation indicates a “typical” deviation from the
mean. It is a popular measure of variability because it returns to the original units of measure of
the data set. Like the variance, if the data points are close to the mean, there is a small variation
whereas the data points are highly spread out from the mean, then it has a high variance. Standard
deviation calculates the extent to which the values differ from the average. Standard Deviation, the
most widely used measure of dispersion, is based on all values. Therefore a change in even one
value affects the value of standard deviation. It is independent of origin but not of scale. It is also
useful in certain advanced statistical problems.

4.2 Normalization
In statistics and applications of statistics, normalization can have a range of meanings. In the
simplest cases, normalization of ratings means adjusting values measured on different scales to a
notionally common scale, often prior to averaging. In more complicated cases, normalization may
refer to more sophisticated adjustments where the intention is to bring the entire probability
distributions of adjusted values into alignment. In the case of normalization of scores in educational
assessment, there may be an intention to align distributions to a normal distribution. A different
approach to normalization of probability distributions is quantile normalization, where the
quantiles of the different measures are brought into alignment.

4.2.1 Feature Scaling


Feature scaling is a data preprocessing technique used to transform the values of features or
variables in a dataset to a similar scale. The purpose is to ensure that all features contribute equally

10

Downloaded by Vishwajeet Londhe ([email protected])


lOMoAR cPSD| 43980795

to the model and to avoid the domination of features with larger values. Feature scaling becomes
necessary when dealing with datasets containing features that have different ranges, units of
measurement, or orders of magnitude. In such cases, the variation in feature values can lead to
biased model performance or difficulties during the learning process. There are several common
techniques for feature scaling, including standardization, normalization, and min-max scaling.
These methods adjust the feature values while preserving their relative relationships and
distributions. By applying feature scaling, the dataset’s features can be transformed to a more
consistent scale, making it easier to build accurate and effective machine learning models. Scaling
facilitates meaningful comparisons between features, improves model convergence, and prevents
certain features from overshadowing others based solely on their magnitude.

4.2.2 Min-Max Scaling


Min-Max Scaler is one of the most popular scaling algorithms. It transforms features by scaling
each feature to a given range, which is generally [0,1], or [-1,-1] in case of negative values.
For each feature, the Min-Max Scaler follows the formula:

It subtracts the mean of the column from each value and then divides by the range, i.e,
max(x)min(x). This scaling algorithm works very well in cases where the standard deviation is
very small, or in cases which don’t have Gaussian distribution.

4.3 Bias
Bias is simply defined as the inability of the model because of that there is some difference or error
occurring between the model’s predicted value and the actual value. These differences between
actual or expected values and the predicted values are known as error or bias error or error due to
bias. Bias is a systematic error that occurs due to wrong assumptions in the machine learning
process.
Let Y be the true value of a parameter, and let ^Y be an estimator of Y based on a sample of data.
Then, the bias of the estimator ^Y is given by:

where E(^Y) is the expected value of the estimator ^Y. It is the measurement of the model that how
well it fits the data.
Low Bias: Low bias value means fewer assumptions are taken to build the target function. In this
case, the model will closely match the training dataset.

11

Downloaded by Vishwajeet Londhe ([email protected])


lOMoAR cPSD| 43980795

High Bias: High bias value means more assumptions are taken to build the target function. In this
case, the model will not match the training dataset closely. The high-bias model will not be able
to capture the dataset trend. It is considered as the underfitting model which has a high error rate.
It is due to a very simplified algorithm.
For example, a linear regression model may have a high bias if the data has a non-linear
relationship.

4.4 Variance
Variance is the measure of spread in data from its mean position. In machine learning variance is
the amount by which the performance of a predictive model changes when it is trained on different
subsets of the training data. More specifically, variance is the variability of the model that how
much it is sensitive to another subset of the training dataset. i.e., how much it can adjust on the
new subset of the training dataset.
Let Y be the actual values of the target variable, and ^Y be the predicted values of the target
variable. Then the variance of a model can be measured as the expected value of the square of the
difference between predicted values and the expected value of the predicted values.

where E[-Y] is the expected


value of the predicted values. Here the expected value is averaged over all the training data.
Variance errors are either low or high-variance errors.
Low variance: Low variance means that the model is less sensitive to changes in the training data
and can produce consistent estimates of the target function with different subsets of data from the
same distribution. This is the case of underfitting when the model fails to generalize on both
training and test data.
High variance: High variance means that the model is very sensitive to changes in the training data
and can result in significant changes in the estimate of the target function when trained on different
subsets of data from the same distribution. This is the case of overfitting when the model performs
well on the training data but poorly on new, unseen test data. It fits the training data too closely
that it fails on the new training dataset.

4.5 Regularization
In mathematics, statistics, finance, computer science, particularly in machine learning and inverse
problems, regularization is a process that changes the result answer to be "simpler". It is often used
to obtain results for ill-posed problems or to prevent overfitting. Although regularization
procedures can be divided in many ways, the following delineation is particularly helpful:
- Explicit regularization is regularization whenever one explicitly adds a term to the
optimization problem. These terms could be priors, penalties, or constraints. Explicit regularization

12

Downloaded by Vishwajeet Londhe ([email protected])


lOMoAR cPSD| 43980795

is commonly employed with ill-posed optimization problems. The regularization term, or penalty,
imposes a cost on the optimization function to make the optimal solution unique. Implicit
regularization is all other forms of regularization. This includes, for example, early stopping, using
a robust loss function, and discarding outliers.
- Implicit regularization is essentially ubiquitous in modern machine learning approaches,
including stochastic gradient descent for training deep neural networks, and ensemble methods
(such as random forests and gradient boosted trees).
In explicit regularization, independent of the problem or model, there is always a data term, that
corresponds to a likelihood of the measurement and a regularization term that corresponds to a
prior. By combining both Bayesian statistics, one can compute a posterior, that includes both
information sources and therefore stabilizes the estimation process. By trading off both objectives,
one chooses to be more addictive to the data or to enforce generalization (to prevent overfitting).
There is a whole research branch dealing with all possible regularizations. In practice, one usually
tries a specific regularization and then figures out the probability density that corresponds to that
regularization to justify the choice. It can also be physically motivated by common sense or
intuition. In machine learning, the data term corresponds to the training data and the regularization
is either the choice of the model or modifications to the algorithm. It is always intended to reduce
the generalization error, i.e. the error score with the trained model on the evaluation set and not the
training data. One of the earliest uses of regularization is Tikhonov regularization, related to the
method of least squares.
A regularization term (or regularizer) R(f) is added to a loss function:

4.6 Ridge Regression


Ridge regression is a method of estimating the coefficients of multiple-regression models in
scenarios where the independent variables are highly correlated. It has been used in many fields

13

Downloaded by Vishwajeet Londhe ([email protected])


lOMoAR cPSD| 43980795

including econometrics, chemistry, and engineering. Also known as Tikhonov regularization,


named for Andrey Tikhonov, it is a method of regularization of ill-posed problems. It is particularly
useful to mitigate the problem of multicollinearity in linear regression, which commonly occurs in
models with large numbers of parameters.[3] In general, the method provides improved efficiency
in parameter estimation problems in exchange for a tolerable amount of bias (see bias–variance
tradeoff). The theory was first introduced by Hoerl and Kennard in 1970 in their Technometrics
papers "RIDGE regressions: biased estimation of nonorthogonal problems" and "RIDGE
regressions: applications in nonorthogonal problems”. This was the result of ten years of research
into the field of ridge analysis. Ridge regression was developed as a possible solution to the
imprecision of least square estimators when linear regression models have some multicollinear
(highly correlated) independent variables—by creating a ridge regression estimator (RR). This
provides a more precise ridge parameters estimate, as its variance and mean square estimator are
often smaller than the least square estimators previously derived. Analogous to the ordinary least
squares estimator, the simple ridge estimator is then given by:

4.7 Lasso Regression


Lasso regression is a regularization technique. It is used over regression methods for a more
accurate prediction. This model uses shrinkage. Shrinkage is where data values are shrunk towards
a central point as the mean. The lasso procedure encourages simple, sparse models (i.e. models
with fewer parameters). This particular type of regression is well-suited for models showing high
levels of multicollinearity or when you want to automate certain parts of model selection, like
variable selection/parameter elimination. Lasso Regression uses L1 regularization technique (will
be discussed later in this article). It is used when we have more features because it automatically
performs feature selection. Here’s a step-by-step explanation of how LASSO regression works:

1. Linear regression model: LASSO regression starts with the standard linear regression
model, which assumes a linear relationship between the independent variables (features)
and the dependent variable (target). The linear regression equation can be represented as
follows:makefileCopy codey = β₀ + β₁x₁ + β₂x₂ + ... + βₚxₚ + ε Where: y is the dependent
variable (target). β₀, β₁, β₂, ..., βₚ are the coefficients (parameters) to be estimated. x₁, x₂, ...,
xₚ are the independent variables (features). ε represents the error term.
2. L1 regularization: LASSO regression introduces an additional penalty term based on the
absolute values of the coefficients. The L1 regularization term is the sum of the absolute
values of the coefficients multiplied by a tuning parameter λ:scssCopy codeL₁ = λ * (|β₁| +
|β₂| + ... + |βₚ|) Where: λ is the regularization parameter that controls the amount of
regularization applied. β₁, β₂, ..., βₚ are the coefficients.
3. Objective function: The objective of LASSO regression is to find the values of the
coefficients that minimize the sum of the squared differences between the predicted values
and the actual values, while also minimizing the L1 regularization term:makefileCopy

14

Downloaded by Vishwajeet Londhe ([email protected])


lOMoAR cPSD| 43980795

codeMinimize: RSS + L₁ Where: RSS is the residual sum of squares, which measures the
error between the predicted values and the actual values.
4. Shrinking coefficients: By adding the L1 regularization term, LASSO regression can shrink
the coefficients towards zero. When λ is sufficiently large, some coefficients are driven to
exactly zero. This property of LASSO makes it useful for feature selection, as the variables
with zero coefficients are effectively removed from the model.
5. Tuning parameter λ: The choice of the regularization parameter λ is crucial in LASSO
regression. A larger λ value increases the amount of regularization, leading to more
coefficients being pushed towards zero. Conversely, a smaller λ value reduces the
regularization effect, allowing more variables to have non-zero coefficients.
6. Model fitting: To estimate the coefficients in LASSO regression, an optimization algorithm
is used to minimize the objective function. Coordinate Descent is commonly employed,
which iteratively updates each coefficient while holding the others fixed.

LASSO regression offers a powerful framework for both prediction and feature selection,
especially when dealing with high-dimensional datasets where the number of features is large. By
striking a balance between simplicity and accuracy, LASSO can provide interpretable models
while effectively managing the risk of overfitting. It’s worth noting that LASSO is just one type of
regularization technique, and there are other variants such as Ridge regression (L2 regularization)
and Elastic Net.

4.8 Cross Validation Techniques


Cross validation is a technique used in machine learning to evaluate the performance of a model
on unseen data. It involves dividing the available data into multiple folds or subsets, using one of
these folds as a validation set, and training the model on the remaining folds. This process is
repeated multiple times, each time using a different fold as the validation set. Finally, the results
from each validation step are averaged to produce a more robust estimate of the model’s
performance. The main purpose of cross validation is to prevent overfitting, which occurs when a
model is trained too well on the training data and performs poorly on new, unseen data. By
evaluating the model on multiple validation sets, cross validation provides a more realistic estimate
of the model’s generalization performance, i.e., its ability to perform well on new, unseen data.
There are several types of cross validation techniques, including k-fold cross validation, leaveone-
out cross validation, and stratified cross validation. The choice of technique depends on the size
and nature of the data, as well as the specific requirements of the modeling problem.

15

Downloaded by Vishwajeet Londhe ([email protected])


lOMoAR cPSD| 43980795

4.8.1 K-fold
In k-fold cross-validation, we first divide our dataset into k equally sized subsets. Then, we repeat
the train-test method k times such that each time one of the k subsets is used as a test set and the
rest k-1 subsets are used together as a training set. Finally, we compute the estimate of the model’s
performance estimate by averaging the scores over the k trials. For example, let’s suppose that we
have a dataset S = \{x_1, x_2, x_3, x_4, x_5, x_6\} containing 6 samples and that we want to
perform a 3-fold cross-validation.

First, we divide S into 3 subsets randomly. For instance:

S_1 = {x_1, x_2}, S_2 = {x_3, x_4}, S_3 = {x_5, x_6}

Then, we train and evaluate our machine-learning model 3 times. Each time, two subsets form the
training set, while the remaining one acts as the test set. In our example:

Finally, the overall performance is the average of the model’s performance scores on those three
test sets:

4.8.2 LOOCV
In the leave-one-out (LOO) cross-validation, we train our machine-learning model n times where
n is to our dataset’s size. Each time, only one sample is used as a test set while the rest are used to
train our model. We’ll show that LOO is an extreme case of k-fold where \mathbf{k=n}. If we
apply LOO to the previous example, we’ll have 6 test subsets:

S_1 = {x_1} ,S_2 = {x_2} ,S_3 = {x_3} ,S_4 = {x_4} ,S_5 = {x_5}, S_6 = {x_6}

16

Downloaded by Vishwajeet Londhe ([email protected])


lOMoAR cPSD| 43980795

Iterating over them, we use S \setminus S_i as the training data in iteration i=1,2,\ldots, 6, and
evaluate the model on S_i:

The final performance estimate is the average of the six individual scores:

4.8.3 Stratified K-Fold


Stratified k-fold cross-validation is the same as just k-fold cross-validation, But Stratified k-fold
cross-validation, it does stratified sampling instead of random sampling.

4.8.4 Grid Search


In machine learning, hyperparameter optimization or tuning is the problem of choosing a set of
optimal hyperparameters for a learning algorithm. A hyperparameter is a parameter whose value is
used to control the learning process. By contrast, the values of other parameters (typically node
weights) are learned. The same kind of machine learning model can require different constraints,
weights or learning rates to generalize different data patterns. These measures are called
hyperparameters and have to be tuned so that the model can optimally solve the machine learning
problem. Hyperparameter optimization finds a tuple of hyperparameters that yields an optimal
model which minimizes a predefined loss function on given independent data. The objective
function takes a tuple of hyperparameters and returns the associated loss. Cross-validation is often

17

Downloaded by Vishwajeet Londhe ([email protected])


lOMoAR cPSD| 43980795

used to estimate this generalization performance, and therefore choose the set of values for
hyperparameters that maximize it.

4.9 Cross Validation Error


In machine learning, evaluating the performance of a model is crucial to ensure its ability to
generalize well to unseen data. One commonly used technique for this purpose is cross-validation.
It provides an estimate of model performance for new unseen data. We will explore crossvalidation
error, its calculation, and its significance in assessing model performance. Before diving into this
metric, it is necessary to understand a few concepts, such as Mean Squared Error, Mean Absolute
error and other ways to get insights into the quality of a model you are testing with only one data
set.

Cross-validation is commonly employed when the initial evaluation (like Mean Squared Error)
demonstrates reasonably satisfactory performance when you want to obtain a more reliable
estimate of the model’s generalization ability. It helps assess the model’s performance across
multiple subsets of the data and provides a more robust evaluation by mitigating the potential bias
introduced by a single train-test split. Cross-validation error provides a more reliable estimate of a
model’s performance versus a single train-test split. It evaluates the model on several subsets of
the data, resolving the issue of variability in the training and validation data and leading to a more
robust performance estimation. It aids in model selection, hyperparameter tuning, and comparing
different models.

18

Downloaded by Vishwajeet Londhe ([email protected])


lOMoAR cPSD| 43980795

Chapter 5 STATISTICAL PROCESSING

5.1 Dimensionality Reduction Techniques


Dimensionality reduction is a technique used to reduce the number of features in a dataset while
retaining as much of the important information as possible. In other words, it is a process of
transforming high-dimensional data into a lower-dimensional space that still preserves the essence
of the original data. In machine learning, high-dimensional data refers to data with a large number
of features or variables. The curse of dimensionality is a common problem in machine learning,
where the performance of the model deteriorates as the number of features increases. This is
because the complexity of the model increases with the number of features, and it becomes more
difficult to find a good solution. In addition, high-dimensional data can also lead to overfitting,
where the model fits the training data too closely and does not generalize well to new data.
Dimensionality reduction can help to mitigate these problems by reducing the complexity of the
model and improving its generalization performance. There are two main approaches to
dimensionality reduction: feature selection and feature extraction.
Feature Selection: Feature selection involves selecting a subset of the original features that are
most relevant to the problem at hand. The goal is to reduce the dimensionality of the dataset while
retaining the most important features. There are several methods for feature selection, including
filter methods, wrapper methods, and embedded methods. Filter methods rank the features based
on their relevance to the target variable, wrapper methods use the model performance as the criteria
for selecting features, and embedded methods combine feature selection with the model training
process.
Feature Extraction: Feature extraction involves creating new features by combining or
transforming the original features. The goal is to create a set of features that captures the essence
of the original data in a lower-dimensional space. There are several methods for feature extraction,
including principal component analysis (PCA), linear discriminant analysis (LDA), and
tdistributed stochastic neighbor embedding (t-SNE). PCA is a popular technique that projects the
original features onto a lower-dimensional space while preserving as much of the variance as
possible.

19

Downloaded by Vishwajeet Londhe ([email protected])


lOMoAR cPSD| 43980795

5.1.1 Principal Component Analysis


This method was introduced by Karl Pearson. It works on the condition that while the data in a
higher dimensional space is mapped to data in a lower dimension space, the variance of the data
in the lower dimensional space should be maximum.

It involves the following steps:

1. Construct the covariance matrix of the data.


2. Compute the eigenvectors of this matrix.
3. Eigenvectors corresponding to the largest eigenvalues are used to reconstruct a large
fraction of variance of the original data.
Hence, we are left with a lesser number of eigenvectors, and there might have been some data loss
in the process. But, the most important variances should be retained by the remaining eigenvectors.

20

Downloaded by Vishwajeet Londhe ([email protected])


lOMoAR cPSD| 43980795

5.1.2 Discriminant Analysis


Discriminant analysis (DA) is a multivariate technique which is utilized to divide two or more
groups of observations (individuals) premised on variables measured on each experimental unit
(sample) and to discover the impact of each parameter in dividing the groups. In addition, the
prediction or allocation of newly defined observations to previously specified groups may be
examined using a linear or quadratic function for assigning each individual to existing groups. This
can be done by determining which group each individual belongs to. A system for determining
membership in a group may be constructed using discriminant analysis. The method comprises a
discriminant function (or, for more than two groups, a set of discriminant functions) that is
premised on linear combinations of the predictor variables that offer the best discrimination
between the groups. If there are more than two groups, the model will consist of discriminant
functions. After the functions have been constructed using a sample of instances for which the
group membership is known, they may be applied to fresh cases that contain measurements for the
predictor variables but whose group membership is unknown.
Linear and quadratic discriminant analysis are the two varieties of a statistical technique known as
discriminant analysis.
#1 – Linear Discriminant Analysis: Often known as LDA, is a supervised approach that attempts
to predict the class of the Dependent Variable by utilizing the linear combination of the
Independent Variables. It is predicated on the hypothesis that the independent variables have a
normal distribution (continuous and numerical) and that each class has the same variance and
covariance. Both classification and conditionality reduction may be accomplished with the
assistance of this method.

#2 – Quadratic Discriminant Analysis: It is a subtype of Linear Discriminant Analysis (LDA) that


uses quadratic combinations of independent variables to predict the class of the dependent variable.
The assumption of the normal distribution is maintained. Even if it does not presume that the
classes have an equal covariance. The QDA produces a quadratic decision boundary.

5.2 Feature Selection


In this, we try to find a subset of the original set of variables, or features, to get a smaller subset
which can be used to model the problem. It usually involves three ways:

1. Filter
2. Wrapper
3. Embedded

21

Downloaded by Vishwajeet Londhe ([email protected])


lOMoAR cPSD| 43980795

5.2.1 Chi2 Square Method


Chi-square test is used for categorical features in a dataset. We calculate Chi-square between each
feature and the target and select the desired number of features with the best Chi-square scores. It
determines if the association between two categorical variables of the sample would reflect their
real association in the population. Chi- square score is given by:

Where, observed frequency = No. of observations of class, Expected frequency = No. of expected
observations of class if there was no relationship between the feature and the target.

5.2.2 Variance Threshold


The variance threshold is a simple baseline approach to feature selection. It removes all features
whose variance doesn’t meet some threshold. By default, it removes all zero-variance features, i.e.,
features with the same value in all samples. We assume that features with a higher variance may
contain more useful information but note that we are not taking the relationship between feature
variables or feature and target variables into account, which is one of the drawbacks of filter
methods. The get support returns a Boolean vector where True means the variable does not have
zero variance.

5.2.3 Recursive Feature Elimination


Recursive Feature Elimination (RFE) is a feature selection algorithm that is used to select a subset
of the most relevant features from a dataset. It is a recursive process that starts with all the features
in the dataset and then iteratively removes the least essential features until the desired number of
features is reached. The main logic behind RFE is that the most relevant features will have the
highest impact on the target variable, and thus will be more useful for predicting the target. RFE
uses a model (such as a linear regression or support vector machine) to evaluate the importance of
each feature, and the features with the lowest importance are eliminated in each iteration. The role
of recursion in RFE is to repeatedly perform the feature selection process until the desired number
of features is reached. In each iteration, the algorithm removes the least important features and
then refits the model with the remaining features. This process is repeated until the desired number
of features is reached or the performance of the model no longer improves. RFE is a useful
algorithm for feature selection because it is simple to implement and can be applied to a variety of
models. It is especially useful for datasets with a large number of features, as it can help to reduce
the dimensionality of the dataset and improve the performance of the model.

22

Downloaded by Vishwajeet Londhe ([email protected])


lOMoAR cPSD| 43980795

5.3 Outliers Detection Method


Outliers are values at the extreme ends of a dataset. Some outliers represent true values from natural
variation in the population. Other outliers may result from incorrect data entry, equipment
malfunctions, or other measurement errors. An outlier isn’t always a form of dirty or incorrect data,
so you have to be careful with them in data cleansing. What you should do with an outlier depends
on its most likely cause.
- True outliers should always be retained in your dataset because these just represent natural
variations in your sample. True outliers are also present in variables with skewed distributions
where many data points are spread far from the mean in one direction. It’s important to select
appropriate statistical tests or measures when you have a skewed distribution or many outliers.
- Outliers that don’t represent true values can come from many possible sources:

1. Measurement errors
2. Data entry or processing errors
3. Unrepresentative sampling
In practice, it can be difficult to tell different types of outliers apart. While you can use calculations
and statistical methods to detect outliers, classifying them as true or false is usually a subjective
process.
Methods:
1. Sorting method: You can sort quantitative variables from low to high and scan for
extremely low or extremely high values. Flag any extreme values that you find. This is a simple
way to check whether you need to investigate certain data points before using more sophisticated
methods.
2. Using visualizations: You can use software to visualize your data with a box plot, or a box-
andwhisker plot, so you can see the data distribution at a glance. This type of chart highlights
minimum and maximum values (the range), the median, and the interquartile range for your data.
Many computer programs highlight an outlier on a chart with an asterisk, and these will lie outside
the bounds of the graph.
3. Statistical outlier detection: Statistical outlier detection involves applying statistical tests
or procedures to identify extreme values. You can convert extreme data points into z scores that
tell you how many standard deviations away they are from the mean. If a value has a high enough
or low enough z score, it can be considered an outlier. As a rule of thumb, values with a z score
greater than 3 or less than –3 are often determined to be outliers.
4. Using the interquartile range: The interquartile range (IQR) tells you the range of the
middle half of your dataset. You can use the IQR to create “fences” around your data and then
define outliers as any values that fall outside those fences. This method is helpful if you have a
few values on the extreme ends of your dataset, but you aren’t sure whether any of them might
count as outliers.

23

Downloaded by Vishwajeet Londhe ([email protected])


lOMoAR cPSD| 43980795

5.4 Resampling-Random
Resampling Method is a statical method that is used to generate new data points in the dataset by
randomly picking data points from the existing dataset. It helps in creating new synthetic datasets
for training machine learning models and to estimate the properties of a dataset when the dataset
is unknown, difficult to estimate, or when the sample size of the dataset is small. Two common
methods of Resampling are:

1. Cross Validation
2. Bootstrapping

5.4.1 Random Over-resampling


It aims to balance class distribution by randomly increasing minority class examples by replicating
them.
For example –
Total Observations: 100
Positive Dataset: 90
Negative Dataset: 10
Event Rate: 2%
We replicate Negative Dataset 15 times
Positive Dataset: 90
Negative Dataset after Replicating: 150
Total Observations: 190
Event Rate : 150/240= 63%

24

Downloaded by Vishwajeet Londhe ([email protected])


lOMoAR cPSD| 43980795

5.4.2 Random Under-Resampling


It aims to balance class distribution by randomly eliminating majority class examples.
For Example –
Total Observations: 100
Positive Dataset: 90
Negative Dataset: 10
Event rate: 2%
We take 10% samples of Positive Dataset and combine it with Negative Dataset.
Positive Dataset after Random Under-Sampling: 10% of 90 = 9
Total observation after combining it with Negative Dataset: 10+9=19
Event Rate after Under-Sampling: 10/19 = 53%
When instances of two different classes are very close to each other, we remove the instances of the
majority class to increase the spaces between the two classes. This helps in the classification process.

Chapter 6 STATISTICAL MODELING

25

Downloaded by Vishwajeet Londhe ([email protected])


lOMoAR cPSD| 43980795

6.1 Linear Regression Models


Linear regression is a type of supervised machine learning algorithm that computes the linear
relationship between a dependent variable and one or more independent features. When the number
of the independent feature, is 1 then it is known as Univariate Linear regression, and in the case of
more than one feature, it is known as multivariate linear regression. The goal of the algorithm is to
find the best linear equation that can predict the value of the dependent variable based on the
independent variables. The equation provides a straight line that represents the relationship
between the dependent and independent variables. The slope of the line indicates how much the
dependent variable changes for a unit change in the independent variable(s). Linear regression is
used in many different fields, including finance, economics, and psychology, to understand and
predict the behavior of a particular variable. For example, in finance, linear regression might be
used to understand the relationship between a company’s stock price and its earnings or to predict
the future value of a currency based on its past performance. One of the most important supervised
learning tasks is regression. In regression set of records are present with X and Y values and these
values are used to learn a function so if you want to predict Y from an unknown X this learned
function can be used. In regression we have to find the value of Y, So, a function is required that
predicts continuous Y in the case of regression given X as independent features. Here Y is called a
dependent or target variable and X is called an independent variable also known as the predictor
of Y. There are many types of functions or modules that can be used for regression. A linear
function is the simplest type of function. Here, X may be a single feature or multiple features
representing the problem.

6.2 Correlation Coefficient


A correlation coefficient is a number between -1 and 1 that tells you the strength and direction of
a relationship between variables. In other words, it reflects how similar the measurements of two
or more variables are across a dataset.

26

Downloaded by Vishwajeet Londhe ([email protected])


lOMoAR cPSD| 43980795

A correlation coefficient is a descriptive statistic. That means that it summarizes sample data
without letting you infer anything about the population. A correlation coefficient is a bivariate
statistic when it summarizes the relationship between two variables, and it’s a multivariate statistic
when you have more than two variables.

6.3 Rank Correlation


Sometimes there doesn’t exist a marked linear relationship between two random variables but a
monotonic relation (if one increases, the other also increases or instead, decreases) is clearly
noticed. Pearson’s Correlation Coefficient evaluation, in this case, would give us the strength and
direction of the linear association only between the variables of interest. Herein comes the
advantage of the Spearman Rank Correlation methods, which will instead give us the strength and
direction of the monotonic relation between the connected variables. This can be a good starting
point for further evaluation.

27

Downloaded by Vishwajeet Londhe ([email protected])


lOMoAR cPSD| 43980795

6.4 Residual Error


The residual standard error is used to measure how well a regression model fits a dataset. In simple
terms, it measures the standard deviation of the residuals in a regression model. It is calculated as:
Residual standard error = √Σ(y – ŷ)2/df
where: y: The observed value, ŷ: The predicted value, df: The degrees of freedom, calculated as
the total number of observations – total number of model parameters.
The smaller the residual standard error, the better a regression model fits a dataset. Conversely, the
higher the residual standard error, the worse a regression model fits a dataset. A regression model
that has a small residual standard error will have data points that are closely packed around the
fitted regression line:

The residuals of this model (the difference between the observed values and the predicted values)
will be small, which means the residual standard error will also be small. Conversely, a regression
model that has a large residual standard error will have data points that are more loosely scattered
around the fitted regression line:

28

Downloaded by Vishwajeet Londhe ([email protected])


lOMoAR cPSD| 43980795

The residuals of this model will be larger, which means the residual standard error will also be
larger. The following example shows how to calculate and interpret the residual standard error of
a regression model in R.

6.5 Mean Square Error


Z The Mean Squared Error (MSE) or Mean Squared Deviation (MSD) of an estimator measures
the average of error squares i.e. the average squared difference between the estimated values and
true value. It is a risk function, corresponding to the expected value of the squared error loss. It is
always non – negative and values close to zero are better. The MSE is the second moment of the
error (about the origin) and thus incorporates both the variance of the estimator and its bias.
Steps to find the MSE

1. Find the equation for the regression line.

2. Insert X values in the equation found in step 1 in order to get the respective Y values i.e.

3. Now subtract the new Y values from the original Y values. Thus, found values are the error
terms. It is also known as the vertical distance of the given point from the regression line.
4. Square the errors found in step 3.
5. Sum up all the squares.
6. Divide the value found in step 5 by the total number of observations.

6.6 Root Mean Square Error (RMSE)


Regression analysis is a technique we can use to understand the relationship between one or more
predictor variables and a response variable. One way to assess how well a regression model fits a
dataset is to calculate the root mean square error, which is a metric that tells us the average distance
between the predicted values from the model and the actual values in the dataset. The lower the

29

Downloaded by Vishwajeet Londhe ([email protected])


lOMoAR cPSD| 43980795

RMSE, the better a given model is able to “fit” a dataset. The formula to find the root mean square
error, often abbreviated RMSE, is as follows: RMSE = √Σ(Pi – Oi)2 / n where:
Σ is a fancy symbol that means “sum”
Pi is the predicted value for the i’th observation in the dataset.
Oi is the observed value for the i’th observation in the dataset. n
is the sample size
The following example shows how to interpret RMSE for a given regression model.

6.7 Multi-Linear Regression


Regression models are used to describe relationships between variables by fitting a line to the
observed data. Regression allows you to estimate how a dependent variable changes as the
independent variable(s) change. Multiple linear regression is used to estimate the relationship
between two or more independent variables and one dependent variable. You can use multiple
linear regression when you want to know:

- How strong the relationship is between two or more independent variables and one
dependent variable (e.g. how rainfall, temperature, and amount of fertilizer added affect
crop growth).
- The value of the dependent variable at a certain value of the independent variables (e.g. the
expected yield of a crop at certain levels of rainfall, temperature, and fertilizer addition).
The formula for a multiple linear regression is:

6.8 Polynomial Features


Polynomial features are those features created by raising existing features to an exponent. For
example, if a dataset had one input feature X, then a polynomial feature would be the addition of
a new feature (column) where values were calculated by squaring the values in X, e.g., X^2. This
process can be repeated for each input variable in the dataset, creating a transformed version of
each. As such, polynomial features are a type of feature engineering, e.g., the creation of new input
features based on the existing features. The “degree” of the polynomial is used to control the
number of features added, e.g., a degree of 3 will add two new variables for each input variable.
Typically, a small degree is used such as 2 or 3. It is also common to add new variables that
represent the interaction between features, e.g., a new column that represents one variable
multiplied by another. This too can be repeated for each input variable creating a new “interaction”
variable for each pair of input variables. A squared or cubed version of an input variable will change
the probability distribution, separating the small and large values, a separation that is increased
with the size of the exponent. This separation can help some machine learning algorithms make

30

Downloaded by Vishwajeet Londhe ([email protected])


lOMoAR cPSD| 43980795

better predictions and is common for regression predictive modeling tasks and generally tasks that
have numerical input variables. Typically, linear algorithms, such as linear regression and logistic
regression, respond well to the use of polynomial input variables.

6.9 Gradient Descent


Gradient Descent is known as one of the most commonly used optimization algorithms to train
machine learning models by means of minimizing errors between actual and expected results.
Further, gradient descent is also used to train Neural Networks. In mathematical terminology,
Optimization algorithm refers to the task of minimizing/maximizing an objective function f(x)
parameterized by x. Similarly, in machine learning, optimization is the task of minimizing the cost
function parameterized by the model's parameters. The main objective of gradient descent is to
minimize the convex function using iteration of parameter updates. Once these machine learning
models are optimized, these models can be used as powerful tools for Artificial Intelligence and
various computer science applications. In this tutorial on Gradient Descent in Machine Learning,
we will learn in detail about gradient descent, the role of cost functions specifically as a barometer
within Machine Learning, types of gradient descents, learning rates, etc. The best way to define
the local minimum or local maximum of a function using gradient descent is as follows:

1. If we move towards a negative gradient or away from the gradient of the function at the
current point, it will give the local minimum of that function.
2. Whenever we move towards a positive gradient or towards the gradient of the function at
the current point, we will get the local maximum of that function.

This entire procedure is known as Gradient Ascent, which is also known as steepest descent. The
main objective of using a gradient descent algorithm is to minimize the cost function using
iteration. To achieve this goal, it performs two steps iteratively:

1. Calculates the first-order derivative of the function to compute the gradient or slope of that
function.

31

Downloaded by Vishwajeet Londhe ([email protected])


lOMoAR cPSD| 43980795

2. Move away from the direction of the gradient, which means slope increased from the
current point by alpha times, where Alpha is defined as Learning Rate. It is a tuning
parameter in the optimization process which helps to decide the length of the steps.

6.10 Logistic Regression


Logistic regression is a supervised machine learning algorithm mainly used for classification tasks
where the goal is to predict the probability that an instance of belonging to a given class. It is used
for classification algorithms its name is logistic regression. it’s referred to as regression because it
takes the output of the linear regression function as input and uses a sigmoid function to estimate
the probability for the given class. The difference between linear regression and logistic regression
is that linear regression output is the continuous value that can be anything while logistic regression
predicts the probability that an instance belongs to a given class or not. It is used for predicting the
categorical dependent variable using a given set of independent variables.

- Logistic regression predicts the output of a categorical dependent variable. Therefore the
outcome must be a categorical or discrete value.
- It can be either Yes or No, 0 or 1, true or False, etc. but instead of giving the exact value as
0 and 1, it gives the probabilistic values which lie between 0 and 1.
- Logistic Regression is much similar to Linear Regression except that how they are used.
Linear Regression is used for solving Regression problems, whereas Logistic regression is
used for solving the classification problems.
- In Logistic regression, instead of fitting a regression line, we fit an “S” shaped logistic
function, which predicts two maximum values (0 or 1).
- The curve from the logistic function indicates the likelihood of something such as whether
the cells are cancerous or not, a mouse is obese or not based on its weight, etc.
- Logistic Regression is a significant machine learning algorithm because it has the ability
to provide probabilities and classify new data using continuous and discrete datasets.
- Logistic Regression can be used to classify the observations using different types of data
and can easily determine the most effective variables used for the classification.
Types of Logistic Regression:
On the basis of the categories, Logistic Regression can be classified into three types:
Binomial: In binomial Logistic regression, there can be only two possible types of the dependent
variables, such as 0 or 1, Pass or Fail, etc.
Multinomial: In multinomial Logistic regression, there can be 3 or more possible unordered types
of the dependent variable, such as “cat”, “dogs”, or “sheep”
Ordinal: In ordinal Logistic regression, there can be 3 or more possible ordered types of dependent
variables, such as “low”, “Medium”, or “High”.

32

Downloaded by Vishwajeet Londhe ([email protected])


lOMoAR cPSD| 43980795

6.11 Bayesian Statistics


Bayesian statistics is a theory in the field of statistics based on the Bayesian interpretation of
probability where probability expresses a degree of belief in an event. The degree of belief may be
based on prior knowledge about the event, such as the results of previous experiments, or on
personal beliefs about the event. This differs from a number of other interpretations of probability,
such as the frequentist interpretation that views probability as the limit of the relative frequency of
an event after many trials. Bayesian statistical methods use Bayes' theorem to compute and update
probabilities after obtaining new data. Bayes' theorem describes the conditional probability of an
event based on data as well as prior information or beliefs about the event or conditions related to
the event. For example, in Bayesian inference, Bayes' theorem can be used to estimate the
parameters of a probability distribution or statistical model. Since Bayesian statistics treats
probability as a degree of belief, Bayes' theorem can directly assign a probability distribution that
quantifies the belief to the parameter or set of parameters.
Bayes' theorem is used in Bayesian methods to update probabilities, which are degrees of belief,
after obtaining new data. Given two events A and B, the conditional probability of A given that B
is true is expressed as follows:

The probability of the evidence P(B) can be calculated using the law of total probability. If
{A1, A2, …, An}is a partition of the sample space, which is the set of all outcomes of an
experiment, then,

6.12 Baye’s Theorem


Bayes theorem is also known as the Bayes Rule or Bayes Law. It is used to determine the
conditional probability of event A when event B has already happened. The general statement of
Bayes’ theorem is “The conditional probability of an event A, given the occurrence of another
event B, is equal to the product of the event of B, given A and the probability of A divided by the
probability of event B.” i.e.
P(A|B) = P(B|A)P(A) / P(B)
where,
P(A) and P(B) are the probabilities of events A and B
P(A|B) is the probability of event A when event B happens.
P(B|A) is the probability of event B when A happens.

33

Downloaded by Vishwajeet Londhe ([email protected])


lOMoAR cPSD| 43980795

Bayes’ Theorem for n set of events is defined as,


Let E1, E2,…, En be a set of events associated with the sample space S, in which all the events E1,
E2,…, En have a non-zero probability of occurrence. All the events E1, E2,…, E form a partition
of S. Let A be an event from space S for which we have to find probability, then according to
Bayes’ theorem,
P(Ei|A) = P(Ei)P(A|Ei) / ∑ P(Ek)P(A|Ek)
for k = 1, 2, 3, …., n

6.13 Monte Carlo Method


Monte Carlo methods, or Monte Carlo experiments, are a broad class of computational algorithms
that rely on repeated random sampling to obtain numerical results. The underlying concept is to
use randomness to solve problems that might be deterministic in principle. They are often used in
physical and mathematical problems and are most useful when it is difficult or impossible to use
other approaches. Monte Carlo methods are mainly used in three problem classes: optimization,
numerical integration, and generating draws from a probability distribution. In physics-related
problems, Monte Carlo methods are useful for simulating systems with many coupled degrees of
freedom, such as fluids, disordered materials, strongly coupled solids, and cellular structures (see
cellular Potts model, interacting particle systems, McKean–Vlasov processes, kinetic models of
gases). Other examples include modeling phenomena with significant uncertainty in inputs such
as the calculation of risk in business and, in mathematics, evaluation of multidimensional definite
integrals with complicated boundary conditions. In application to systems engineering problems
(space, oil exploration, aircraft design, etc.), Monte Carlo–based predictions of failure, cost
overruns and schedule overruns are routinely better than human intuition or alternative "soft"
methods.
In principle, Monte Carlo methods can be used to solve any problem having a probabilistic
interpretation. By the law of large numbers, integrals described by the expected value of some
random variable can be approximated by taking the empirical mean (a.k.a. the 'sample mean') of
independent samples of the variable. When the probability distribution of the variable is
parameterized, mathematicians often use a Markov chain Monte Carlo (MCMC) sampler. The
central idea is to design a judicious Markov chain model with a prescribed stationary probability
distribution. That is, in the limit, the samples being generated by the MCMC method will be
samples from the desired (target) distribution. By the ergodic theorem, the stationary distribution
is approximated by the empirical measures of the random states of the MCMC sampler.

34

Downloaded by Vishwajeet Londhe ([email protected])


lOMoAR cPSD| 43980795

Monte Carlo methods vary, but tend to follow a particular pattern:

1. Define a domain of possible inputs.


2. Generate inputs randomly from a probability distribution over the domain.
3. Perform a deterministic computation on the inputs.
4. Aggregate the results.

35

Downloaded by Vishwajeet Londhe ([email protected])

You might also like