0% found this document useful (0 votes)
11 views53 pages

ML Ai

The document provides an overview of machine learning, detailing its types: supervised, unsupervised, and reinforced learning. It explains linear regression, multiple linear regression, bias and variance, correlation coefficients, regularization techniques, logistic regression, k-nearest neighbors, and the confusion matrix for model evaluation. Key concepts such as cost functions, errors, and the importance of avoiding overfitting are also discussed.

Uploaded by

Somasekhar Lalam
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views53 pages

ML Ai

The document provides an overview of machine learning, detailing its types: supervised, unsupervised, and reinforced learning. It explains linear regression, multiple linear regression, bias and variance, correlation coefficients, regularization techniques, logistic regression, k-nearest neighbors, and the confusion matrix for model evaluation. Key concepts such as cost functions, errors, and the importance of avoiding overfitting are also discussed.

Uploaded by

Somasekhar Lalam
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 53

MACHINE LEARNING

ML is the field where the machine/program learns using an algorithm and doesn’t need a programmer
to explicitly explain the program to the system. There are three types of learnings –
• Supervised Learning – This is where the machine is fed labelled data and it is trained on the
labelled data. For example, Regression and Classification.
• Unsupervised Learning – This is where the data provided is not labelled and the machine finds
its own patterns in the data. It is mainly used for clustering data points based on their
properties. For example, K-means clustering.
• Reinforced Learning – This is where the machine learns from interacting with the
environment. It either gets a positive or negative reinforcement based on the actions and it
learns the ways to maximize positive reinforcement. This is mainly used in robotics.

NOTE – There are cases where both supervised and unsupervised learning techniques are utilized. This
is called semi-supervised.

LINEAR REGRESSION
We have values of independent (𝑥) and dependent variables (𝑦). We try to fit a line (best – fit line) to
minimize the error (deviation of a data point from the line) using the training data set. Then, we can
predict the value of 𝑦 based on the value of 𝑥 from the testing data set. The line equation will be –
𝑌 = 𝑀𝑋 + 𝐵
𝑥𝑦 − 𝑥̅ 𝑦̅
̅̅̅
𝑀=
̅̅̅̅̅̅
(𝑥 2 ) − (𝑥̅ )2

𝐵 = 𝑌̅ − 𝑀 ∗ 𝑥̅
For example, let us take the dataset below –

𝒙 𝒚 𝒙𝒚 𝒙𝟐
1 2 2 1
2 4 8 4
3 5 15 9
4 7 28 16
5 10 50 25
6 11 66 36

Here, we get –
∑ 𝑥 21
𝑥̅ = = = 3.5
𝑛𝑥 6
Similarly, 𝑦̅ = 6.5 ; ̅̅̅
𝑥 2 = 15.16 ; 𝑥𝑦
̅̅̅ = 28.167
Therefore, we get –
28.167 − 22.75
𝑀= = 1.86
15.16 − 12.25
𝐵 = 6.5 − (1.86 ∗ 3.5) = −0.01
Therefore, we get the best fit line as –
𝒀 = 𝟏. 𝟖𝟔𝒙 − 𝟎. 𝟎𝟏
Now that we have the line, it will not pass through all the points. Hence, there will be an error. We can
define 3 types of errors –
𝑛
1
𝐴𝑏𝑠𝑜𝑙𝑢𝑡𝑒 𝑒𝑟𝑟𝑜𝑟 = ∑|𝑌(𝑥𝑖 ) − 𝑦𝑖 |
𝑛
𝑖=1
𝑛
1
𝑀𝑒𝑎𝑛 𝑆𝑞𝑢𝑎𝑟𝑒 𝑒𝑟𝑟𝑜𝑟 = ∑(𝑌(𝑥𝑖 ) − 𝑦𝑖 )2
𝑛
𝑖=1

𝑅𝑜𝑜𝑡 𝑀𝑆𝐸 = √𝑀𝑆𝐸

Now, we also define the Cost Function of the model as follows –


𝑛
1
𝐽 = ∑(𝑌(𝑥𝑖 ) − 𝑦𝑖 )2
𝑛
𝑖=1

Now that we have the cost function, we can see that the cost function actually changes with the
equation of the line. More specifically, if we change the slope of the line, then the errors will change
and hence the cost function will change. So now, we plot a graph where slope is on the x-axis and the
corresponding cost functions are on the y-axis. This will result in the following graph –

This graph is called the Gradient Descent graph. The slope value where we get the local minima of the
gradient descent will result in the best fit line.

MULTIPLE LINEAR REGRESSION


Till now, the LR models are using a single independent variable to predict the dependent variable. If
we use multiple variables instead, then it is called Multiple Linear Regression. Here, the best fit line is
given by –
𝑦 = 𝑏 + 𝑚1 𝑥1 + 𝑚2 𝑥2 + ⋯ + 𝑚𝑛 𝑥𝑛
Let us take the case where we have 2 independent variables with the following values –
Then, we create 2 matrices as follows –
1 1 4 1
𝑋 = [1 2 5] 𝑌 = [ 6]
1 3 8 8
Now, to get the slope matrix, we use the following formula –
𝑀 = ((𝑋 𝑇 𝑋)−1 𝑋 𝑇 )𝑌

BIAS AND VARIANCE


Bias is defined as the difference between the expected value and the actual value of the dependent
variable. Variance is known as the average spread of the data points over the mean data point. Bias
and variance are usually used to refer to the training and testing parts of the model respectively.
Variance indicates how sensitive the model is to external change.
If a model has high bias and low variance, then it means that the model is neither working well during
training nor is it able to adapt and follow the testing data points. Thus, this is the case of the
underfitting model.
On the other hand, if the model has low bias but high variance, then it means that the model is
following each point as closely as possible and during testing it doesn’t perform as well as expected.
This is called the overfitting model.
The best fit model among these will be the one which has low bias and low variance.

Multi-collinearity – When we have a multi-variable regression and the independent variables are high
correlated, then it is termed as multi-collinearity. This causes problems in selecting variables and a
good model needs to avoid these conditions.

CORRELATION COEFFICIENT
Correlation shows the strength of a relationship between two variables and is expressed as the
correlation coefficient. The values of correlation coefficient can range from -1 to 1. If a variable
increases and the other variable increases at the same rate, then it is a perfect positive correlation
and is denoted by +1. On the other hand, if a variable decreases when the other increases (both at the
same rate), then it is called a perfect negative correlation and is denoted by -1. If there is no
correlation, then the value of the coefficient will be 0.
At this point, we can also define the Pearson’s Correlation Coefficient (𝒓) as follows –
𝑛 ∑ 𝑥𝑦 − ∑ 𝑥 ∑ 𝑦 𝐶𝑜𝑣(𝑥, 𝑦)
𝑟= =
√𝑛 ∑ 𝑥 2 − (∑ 𝑥)2 ∗ √𝑛 ∑ 𝑦 2 − (∑ 𝑦)2 √𝑉𝑎𝑟(𝑥) ∗ 𝑉𝑎𝑟(𝑦)
Where,

REGULARIZATION
It is the technique used to overcome the problem of overfitting and feature selection. There are three
main types of regularizations –
• LASSO (L1) regularization
• Ridge (L2) regularization
• Elastic Net (L1+L2) regularization

List Absolute Shrinkage & Selection Operator (LASSO) Regularization


This is the technique used to perform feature selection. It adds an additional absolute penalty to the
cost function as shown below –
𝑛 𝑚
1
𝐶𝐹 = ∑(𝑦𝑖 − 𝑦)2 + 𝜆 ∑|𝑆𝑙𝑜𝑝𝑒𝑖 |
2𝑛
𝑖=1 𝑖=1

Where,
𝑛 – Number of test data points
𝑚 – Number of features/independent variables
𝜆 – Hyperparameter

Ridge Regularization
In this, we add the squared magnitude of the coefficient in the cost function. The cost function is given
by –
𝑛 𝑚
1
𝐶𝐹 = ∑(𝑦𝑖 − 𝑦)2 + 𝜆 ∑|𝑆𝑙𝑜𝑝𝑒𝑖 |2
2𝑛
𝑖=1 𝑖=1

This regularization reduces the overfitting problem.

Elastic Net Regularization


This combines both LASSO and Ridge regularization which reduces overfitting and helps in feature
selection. The cost function becomes –
𝑛 𝑚 𝑚
1
𝐶𝐹 = ∑(𝑦𝑖 − 𝑦)2 + 𝜆 ∑|𝑆𝑙𝑜𝑝𝑒𝑖 | + 𝜆 ∑|𝑆𝑙𝑜𝑝𝑒𝑖 |2
2𝑛
𝑖=1 𝑖=1 𝑖=1

NOTE – Regularization are also called Regression.

LOGISTIC REGRESSION
This is a classification algorithm. It is based on probability and performs classification of the data items.
Logistic regression can be of 2 types –
• Binary (2 classes)
• Multi-variable (more than 2 classes)
Logistic regression uses probability, hence the output values will be between 0 and 1. Also, it uses the
Sigmoid Function as the activation function. The sigmoid function is given below –
1
𝑌(𝑥) =
1 + 𝑒 −𝑥
For Logistic regression, we use the following version of the sigmoid formula –
𝟏
𝒀(𝒙) =
𝟏 + 𝒆−(𝒎𝒙+𝒃)
Let us take an example. Suppose we have a relation between the number of hours studied and the
passing or failing of a class –
HOURS STUDIED PASS/FAIL
29 Fail
15 Fail
33 Pass
28 Pass
39 Pass

In this case, let us assign Pass and Fail as 1 and 0 respectively. Also, it is given that the slope and
intercept are 2 and -64 respectively. Now, the sigmoid function becomes –
1
𝑌(𝑥) =
1 + 𝑒 64−2𝑥
Now, suppose they ask you what is the probability that the person will pass given they have studied
for 33 hours. In that case, we have –
1
𝑌(33) = = 𝟎. 𝟖𝟖
1 + 𝑒 −2
Similarly, if we want to find out how many hours one has to study to have a 95% chance of passing, we
get –
1
0.95 = => 𝒙 = 𝟑𝟑. 𝟒𝟕
1 + 𝑒 64−2𝑥
The cost function of Logistic Regression looks as follows –
𝑚
−1
𝐹= ∑ 𝑦𝑖 log(𝑦(𝑥𝑖 )) + (1 − 𝑦𝑖 ) log(1 − 𝑦(𝑥𝑖 ))
2𝑚
𝑖=1

K – NEAREST NEIGHBOURS PROBLEM


This is an algorithm that can be used for both – classification and regression. Let us take the case of
classification first.

K-NN Classification
In this case, let us assume we have a bunch of training points given and they have been divided into 2
classes (binary classification) – class 1 and class 2. Now, a new point has come. We need to determine
which class it belongs to. To do so, we follow the steps mentioned below –
• Find the Euclidian distance from the point to all the other points.
• Find the K nearest points. (K will be mentioned in the question)
• Now, out the K nearest neighbors, if class 1 > class 2, then the new point belongs to class 1.
Else, it belongs to class 2.
For example, take the points as shown below –

Now, the question is to find the class of a point (𝟔. 𝟓, 𝟏𝟒). To do this, we first find the distances
between the points –
𝒙𝟏 𝒙𝟐 𝑫𝒊𝒔𝒕𝒂𝒏𝒄𝒆 𝑪𝒍𝒂𝒔𝒔
8 16 2.5 0
6 17 3.041 0
7 15 1.118 1
9 18 4.717 1

Assume that 𝑲 = 𝟑. Then, if we take the points, we can see that 2 belong to class 0 and 1 to class 1.
Hence, the new point belongs to class 0.

K-NN Regression
The process is mostly the same. We need to find the K nearest neighbors using distance and get their
outputs. However, instead of taking the majority of the outputs, we take the average of the outputs.
In our previous case, the outputs were {0, 0, 1}. Then, the output of the current point would be 𝟏. 𝟔𝟕.

CONFUSION MATRIX
This is a matrix used to calculate the accuracy and precision of the model. Basically, we plot a matrix
between the predicted and actual values as follows –

Here, we define the values as –


𝑇𝑃 + 𝑇𝑁
𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 =
𝑇𝑃 + 𝐹𝑃 + 𝐹𝑁 + 𝑇𝑁
𝑇𝑃
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 =
𝑇𝑃 + 𝐹𝑃
𝑇𝑃
𝑅𝑒𝑐𝑎𝑙𝑙 =
𝑇𝑃 + 𝐹𝑁
Accuracy is the amount of true predictions out of the total number of predictions. Precision is the
number of true positive predictions out of the entire stock of positive predictions. Recall is the number
of true positives captured out of the total number of actual positive cases.

NOTE – FP and FN are referred to as Type 1 and Type 2 errors

F1 Score
When we try to improve the precision of a model, the recall goes down and vice-versa (general trend).
This trend is captured via the F1 score –
2
𝐹1 = = 𝐻𝑀(𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛, 𝑅𝑒𝑐𝑎𝑙𝑙)
1 1
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 + 𝑅𝑒𝑐𝑎𝑙𝑙
F-beta Score
The F1 score gives a relation between the two params, but it doesn’t actually give any information
about which param is being given more preference. Hence, the additional weight of 𝛽 is added to the
above formula –
1 1
𝐹 − 𝛽 𝑠𝑐𝑜𝑟𝑒 = (1 + )∗
𝛽2 1 1
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 + 𝑅𝑒𝑐𝑎𝑙𝑙
When 𝛽 = 1, equal weightage is given to precision and recall hence making the F-𝛽 score as the F1
score.

NAÏVE BAYES CLASSIFIER


It is one of the fastest and most efficient classification algorithms. It depends on the Bayes theorem
and it assumes that each feature is independent of each other. Let us assume we have 𝑛 independent
variables {𝑥1 , 𝑥2 , … , 𝑥𝑛 } and 𝑦 is the dependent variable. Then, we can write –
𝑃(𝑥1 |𝑦) ∗ 𝑃(𝑥2 |𝑦) ∗ 𝑃(𝑥3 |𝑦) ∗ … ∗ 𝑃(𝑥𝑛 |𝑦) ∗ 𝑃(𝑦)
𝑃(𝑦 | 𝑥1 , 𝑥2 , … , 𝑥𝑛 ) =
𝑃(𝑥1 ) ∗ 𝑃(𝑥2 ) ∗ 𝑃(𝑥3 ) ∗ … ∗ 𝑃(𝑥𝑛 )
Let us take an example –

Find the probability that the person will play tennis when the outlook is Sunny, the temperature is
cool, humidity is high and the wind is strong. Now for this case, we can write –
𝑃(𝑌𝑒𝑠 | 𝑆𝑢𝑛𝑛𝑦, 𝐶𝑜𝑜𝑙, 𝐻𝑖𝑔ℎ, 𝑆𝑡𝑟𝑜𝑛𝑔)
𝑃(𝑆𝑢𝑛𝑛𝑦 | 𝑌𝑒𝑠) ∗ 𝑃(𝐶𝑜𝑜𝑙 | 𝑌𝑒𝑠) ∗ 𝑃(𝐻𝑖𝑔ℎ | 𝑌𝑒𝑠) ∗ 𝑃(𝑆𝑡𝑟𝑜𝑛𝑔 | 𝑌𝑒𝑠) ∗ 𝑃(𝑌𝑒𝑠)
=
𝑃(𝑆𝑢𝑛𝑛𝑦) ∗ 𝑃(𝐶𝑜𝑜𝑙) ∗ 𝑃(𝐻𝑖𝑔ℎ) ∗ 𝑃(𝑆𝑡𝑟𝑜𝑛𝑔)
𝑷(𝒀𝒆𝒔 | 𝑺𝒖𝒏𝒏𝒚, 𝑪𝒐𝒐𝒍, 𝑯𝒊𝒈𝒉, 𝑺𝒕𝒓𝒐𝒏𝒈) = 𝟎. 𝟐𝟓
Similarly, we can write –
𝑷(𝑵𝒐 | 𝑺𝒖𝒏𝒏𝒚, 𝑪𝒐𝒐𝒍, 𝑯𝒊𝒈𝒉, 𝑺𝒕𝒓𝒐𝒏𝒈) = 𝟎. 𝟔𝟐𝟕
Now, we can finally write –
0.25
𝑃(𝑃𝑙𝑎𝑦𝑖𝑛𝑔 𝑡𝑒𝑛𝑛𝑖𝑠) = = 𝟎. 𝟐𝟖𝟓
0.25 + 0.627

SUPPORT VECTOR MACHINE (SVM)


This can be used for both regression and classification problem.

SVM Classification
SVM basically creates a separator line, also called hyperplane between the two groups of points to
differentiate the two groups of data points. The best hyperplane will be the one which is in the middle
of the two sets and the margin between the hyperplane and the points will be the maximum.

For the hyperplane, we have a general equation –


𝑦 = 𝛽0 + 𝛽1 𝑥1 + 𝛽2 𝑥2 + ⋯ + 𝛽𝑛 𝑥𝑛
𝑌 = 𝛽𝑇 𝑋 (𝑖𝑛 𝑚𝑎𝑡𝑟𝑖𝑥 𝑓𝑜𝑟𝑚)
So, for a 1-D hyperplane (aka line), we have 𝑛 = 1. Thus, we get –
𝑦 = 𝛽0 + 𝛽1 𝑥
Which is the equation of a straight line. Now, of course there is a chance that there will be some points
that might get misclassified by the SVM model. If the model allows for misclassification, then it is called
soft tolerant model. We can also define the degree of tolerance as the amount of misclassification the
model can allow. To define this, we add a penalty term 𝑪 – the higher the value, the more penalty is
experienced for misclassification.
The cost function of the SVM is written –
𝑛
|𝛽|
𝐶𝐹 = + 𝐶 ∑ 𝜁𝑖
2
𝑖=1

Here, 𝐶 is the penalty or the number of misclassified points and 𝜁 is the distance of the misclassified
point from the margins. Our job is to reduce the CF. Now, let us understand this question with a
problem. Assume the points below –
In the above, we assume that the Blue points belong to class -1 and the red points to class 1. Now,
from intuition we know that the hyperplane should be present at 𝑥 = 2. Thus, we choose the 3 nearest
points to the line - (1,0), (3, −1), (3,1). We write the general equation of the hyperplane as follows –
1 3 3
𝑎1 (0) + 𝑎2 (−1) + 𝑎3 (1) = 𝑐𝑙𝑎𝑠𝑠
1 1 1
Now, let’s assume the case of point (1,0). We get –
1 1 3 1 3 1
𝑎1 (0) (0) + 𝑎2 (−1) (0) + 𝑎3 (1) (0) = −1
1 1 1 1 1 1
2𝑎1 + 4𝑎2 + 4𝑎3 = −1
Similarly, we write the same for the other two points and then we get -
4𝑎1 + 11𝑎2 + 9𝑎3 = 1
4𝑎1 + 9𝑎2 + 11𝑎3 = 1
Solving the three equations, we get –
𝒂𝟏 = −𝟑. 𝟓 ; 𝒂𝟐 = 𝟎. 𝟕𝟓 ; 𝒂𝟑 = 𝟎. 𝟕𝟓
Now, we finally can find the equation of the hyperplane as follows –
1 3 3
𝑦 = 𝑎1 (0) + 𝑎2 (−1) + 𝑎3 (1)
1 1 1
𝟏
𝒚=( 𝟎 )
−𝟐
Hence, we get the hyperplane as –
ID3 DECISION TREES
These are also used for regression and classification and is considered the most powerful tool for both.
In this case, the tree structure is created wherein the internal nodes denote test on attribute, the edges
represent the outcome and the leaf nodes represent the class. In short, each node is internal node is
an attribute, each edge represents a decision/rule and the leaf nodes are the outcomes.
The factor used to decide how the decision tree needs to be split is called Entropy. Entropy is the
measure of impurity or uncertainty and helps to divide the decision tree. It is given as –

𝑺(𝒚) = − ∑ 𝒑(𝑿) 𝐥𝐨𝐠(𝒑(𝑿))

In addition to Entropy, we also define another term called Information Gain as follows –
𝑰𝑮(𝒂𝒕𝒕𝒓𝒊𝒃𝒖𝒕𝒆) = 𝑺(𝒚) − [𝑾𝒆𝒊𝒈𝒉𝒕𝒆𝒅 𝒂𝒗𝒈. ∗ 𝑬𝒏𝒕𝒓𝒐𝒑𝒚(𝒆𝒂𝒄𝒉 𝒇𝒆𝒂𝒕𝒖𝒓𝒆)]
The attribute with the highest gain will be used to split the tree. Let us take an example –

For this example –


9 9 5 5
𝑆(𝑃𝑙𝑎𝑦 𝑡𝑒𝑛𝑛𝑖𝑠) = − [ log 2 + log 2 ] = −(−0.41 − 0.53) = 0.94
14 14 14 14
2 2 3 3
𝑆(𝑂𝑢𝑡𝑙𝑜𝑜𝑘 = 𝑆𝑢𝑛𝑛𝑦) == − [ log 2 + log 2 ] = −(−0.53 − 0.44) = 0.97
5 5 5 5
Similarly, we get –
𝑆(𝑂𝑢𝑡𝑙𝑜𝑜𝑘 = 𝑂𝑣𝑒𝑟𝑐𝑎𝑠𝑡) = 0
𝑆(𝑂𝑢𝑡𝑙𝑜𝑜𝑘 = 𝑅𝑎𝑖𝑛) = 0.97
Now, we can find the gain of the Outlook attribute as follows –
5 4 5
𝐼𝐺(𝑂𝑢𝑡𝑙𝑜𝑜𝑘) = 0.94 − [ ∗ 0.97 + ∗0+ ∗ 0.97] = 0.25
14 14 14
Similarly, we can calculate the gain for the other attributes and then we can create the tree –

CART DECISION TREE


This is the same as the ID3 decision tree, but we use Gini Index to determine which attribute to split
from. Gini index used to identify the impurity of the attributes. It is defined as –
𝑛

𝐺(𝑥) = 1 − ∑(𝑝𝑖 )2
𝑖=1

Let us take the case of Outlook. We get –

2 2 3 2
𝐺(𝑂𝑢𝑡𝑙𝑜𝑜𝑘 = 𝑆𝑢𝑛𝑛𝑦) = 1 − ( ) − ( ) = 0.48
5 5
2 2 3 2
𝐺(𝑂𝑢𝑡𝑙𝑜𝑜𝑘 = 𝑅𝑎𝑖𝑛𝑦) = 1 − ( ) − ( ) = 0.48
5 5
4 2 0 2
𝐺(𝑂𝑢𝑡𝑙𝑜𝑜𝑘 = 𝑂𝑣𝑒𝑟𝑐𝑎𝑠𝑡) = 1 − ( ) − ( ) = 0
4 4
5 5 4
𝐺(𝑂𝑢𝑡𝑙𝑜𝑜𝑘) = ( ∗ 0.48) + ( ∗ 0.48) + ( ∗ 0) = 0.343
14 14 14
Similarly, we can calculate the Gini Index of the other attributes. After this, we split based on the
attribute with the minimum Gini Index.

DECISION TREE REGRESSION


Here, we use two parameters –
∑(𝑥 − 𝑥̅ )2
𝑆𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝐷𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛 (𝑆) = √
𝑛

𝑆
𝐶𝑜𝑒𝑓𝑓𝑖𝑐𝑖𝑒𝑛𝑡 𝑜𝑓 𝑉𝑎𝑟𝑖𝑎𝑡𝑖𝑜𝑛 (𝐶𝑉) = ∗ 100%
𝑥̅
SD is used to split the nodes and CV acts as a hyper-parameter that is used to stop the splitting process.
The entire process with example is given here - https://saedsayad.com/decision_tree_reg.htm
When the hyperparameter is too high, there will be minimal splitting of the tree which thus results in
underfitting. On the other hand, if the hyperparameter is too low, then the tree will keep splitting for
a high depth and this results in overfitting. We can solve this by applying a depth limit to the tree. This
is called pruning. If the depth limit is set before the tree construction, it is called pre-pruning while if
the depth limit is applied after the tree construction, then it is called post-pruning.

BIAS-VARIANCE TRADEOFF
For underfitting, we have high bias and low variance. For overfitting, we have high variance and low
bias. Thus, to have a functioning model, we need to have bias – variance tradeoff.

CROSS VALIDATION
Normally, we have training and testing datasets. However, for better model fitting, we take a part of
the training dataset and use it as a validation dataset. This helps us validate the working of the model
before we use it on the testing dataset.
Cross validation is an extension of this technique where machine learning models are trained on
subsets of the available input data.

Leave – P out Cross Validation


In this case, out of the 𝑁 total training data records 𝑃 data records are taken as validation set and the
remaining 𝑁 − 𝑃 data records are taken for training. However, this process is highly exhaustive since
it has to perform the cross validation for each combination of 𝑃 data records. It is very time intensive
as well.
A more time efficient version of this is the Leave-1 out validator where we are keeping only a single
record for validation. Thus, the number of iterations will be equal to 𝑁.
Hold out Cross Validation
This is a much simpler approach. In this case, we are not leaving out 𝑃 records, but rather we randomly
take a chunk of training data and assign it as validation data.

K – Fold Cross Validation


In this case, we create 𝐾 groups of 𝑁/𝐾 data points. Then, we take each block as validation dataset
and then perform the validation process.

In the above case, 𝐾 = 5. We can observe that we need to perform 𝐾 amount of iterations for cross
validation. In this case, let us assume we have 𝑁 = 1000 and out of these data points, 990 data points
belong to 1 class and 10 data points belong to 2nd class. In this case, there is an imbalance of data and
hence it is most likely that the validation set and training set will have points belonging to 1 st class.
Hence, there is a case for overfitting when the data has imbalance. The overfitting will be much lesser
than Leave-P out technique.

Stratified K – fold cross validation


In this case, take 𝐾 data points for validation set in such a way that the data points are equally spread
between the classes. So if we have 𝑁 = 1000 ; 𝐶1 = 100 ; 𝐶2 = 900; 𝑘 = 10, we need to create
validation sets of size 100. We then take 50 points from C1 and 50 points from C2 thus removing data
imbalance and overcoming the drawback of standard K-fold cross validation.

ARTIFICIAL NEURAL NETWORK (ANN)


This is a model that tries to emulate the animal neural networks – wherein we have a central “brain”
and the bunch of “organs” interconnected with “neurons”. In ANN, the role of neurons is mimicked by
Perceptron. The perceptron model is given as follows –
𝑛

𝑦 = ∑ 𝑤𝑖 𝑥𝑖 + 𝐵𝑖𝑎𝑠
𝑖=1

There are 4 steps to building an ANN model –


1. First, we apply the predictor function.
2. Then we pass the output through the activation function to get the model output
3. Then, we calculate the error function and find the error associated with the current model
4. Using the errors, we change the weights to reduce the error.
The formula to update the weights is given as –
(𝑤𝑖 )𝑛𝑒𝑤 = (𝑤𝑖 )𝑜𝑙𝑑 + 𝛼𝑒𝑥𝑖
Where 𝛼 is the learning factor and the 𝑒 is the error. Let us assume a single Predictor model as follows

We need to use this model to implement an AND gate. To do this, we first need to calculate the
predictor function –
𝑍 = 1.2 ∗ 𝑋1 + 0.6 ∗ 𝑋2 + 0
Thus, we get –
𝑍(0,0) = 0
𝑍(0,1) = 0.6
𝑍(1,0) = 1.2
𝑍(1,1) = 1.8
Now, we pass it to the activation function and get –
𝑌(0,0) = 0
𝑌(0,1) = 0
𝑌(1,0) = 1
𝑌(1,1) = 1
We can see that for input 10, the output is coming as 1 when the correct output is 0. Hence, we need
to correct this error and adjust the weight. Taking the learning factor as 0.5, we get –
𝑊1𝑛𝑒𝑤 = 1.2 + 0.5 ∗ (0 − 1) ∗ 1 = 0.7
𝑊2𝑛𝑒𝑤 = 0.6 + 0.5 ∗ (0 − 1) ∗ 0 = 0.6
Now, we re-train the model again –
𝑍(0,0) = 0
𝑍(0,1) = 0.6
𝑍(1,0) = 0.7
𝑍(1,1) = 1.3
Now, we pass it to the activation function again to get –
𝑌(0,0) = 0
𝑌(0,1) = 0
𝑌(1,0) = 0
𝑌(1,1) = 1
We can see that the model is now functioning effectively as an AND gate.

NOTE – The activation function for classification (as seen in the above case) usually marks the values
above a certain threshold as 1 and below it as 0. Hence, it is often called the Threshold Function.
We can also have different types of activation functions –
TYPE OF FUNCTION REPRESENTATION
Linear/Identitiy 𝐹(𝑥) = 𝑥
1, 𝑥 ≥ 𝑘
Binary Step 𝐹(𝑥) = { }
0, 𝑥 < 𝑘
1, 𝑥 ≥ 𝑘
Bipolar Step 𝐹(𝑥) = { }
−1, 𝑥 < 𝑘
1, 𝑥>𝑘
Ramp 𝐹(𝑥) = { 0, −𝑘 ≤ 𝑥 ≤ 𝑘 }
−1, 𝑥 < −𝑘
1
Binary Sigmoid 𝐹(𝑥) =
1 + 𝑒 −𝜆𝑥
2
Bipolar Sigmoid 𝐹(𝑥) = −1
1 + 𝑒 −𝜆𝑥

MULTI – LAYER PERCEPTRON FEED – FORWARD NETWORK


This is a feed – forward neural network that has at least 1 hidden layer.
Here, the questions asked are less since the calculations and adjusting at each layer is a long process.

CLUSTERING
This is a form of unsupervised learning that is used to bunch similar data points into separate groups.
Basically, classification on unlabeled data is called clustering. This is a

K-means Clustering
It is used to cluster the data points into 𝐾 number of clusters. The step-by-step process is shown below

Let us take an example –


POINTS 𝒙 𝒚
P1 2 8
P2 3 5
P3 4 9
P4 3 6
P5 5 6
It is given in the question that 𝐾 = 2 and centroids are 𝐶1 = 𝑃1 and 𝐶2 = 𝑃3. With this information,
we find the distances between the points and the centroids –
POINTS 𝑫𝒊𝒔𝒕 𝒕𝒐 𝑪𝟏 𝑫𝒊𝒔𝒕 𝒕𝒐 𝑪𝟐
P1 0 2.236
P2 3.162 4.123
P3 2.236 0
P4 2.236 3.162
P5 3.606 3.162
Based on the distance, we get –
𝐶1 = {𝑃1, 𝑃2, 𝑃4}
𝐶2 = {𝑃3, 𝑃5}
Now, we find the centroids of 𝐶1 and 𝐶2 again –
2+3+3 8+5+6 8 19
𝐶1𝑐𝑒𝑛𝑡𝑟𝑜𝑖𝑑 = ( , )=( , )
3 3 3 3
9 15
𝐶2𝑐𝑒𝑛𝑡𝑟𝑜𝑖𝑑 = ( , ) = (4.5,7.5)
2 2
Now, we calculate the distances to the points again –
POINTS 𝑫𝒊𝒔𝒕 𝒕𝒐 𝑪𝟏 𝑫𝒊𝒔𝒕 𝒕𝒐 𝑪𝟐
P1 1.795 2.55
P2 1.374 2.916
P3 5.164 1.581
P4 0.471 2.121
P5 4.082 1.581
Based on these distances, we get –
𝐶1 = {𝑃1, 𝑃2, 𝑃4}
𝐶2 = {𝑃3, 𝑃5}
Since there has been no change in the clusters, this is our final cluster.

NOTE – K-means clustering is great for large datasets and is easy to implement. However, we need to
choose the optimal value of 𝐾 and centroids otherwise we may have a long process before we find the
clusters. At the same time, the model is sensitive to outliers.

K – MEDOIDS CLUSTERING
This is the same concept of K-means clustering, but instead of centroid, we are using medoids. In K –
means, we were calculating the centroids by just taking means of the points in a cluster. Here, we
calculate the distance between the points of the clusters. The point with minimum total distance is the
measure against which we check.
HIERARCHICAL CLUSTERING
Hierarchical clustering is another unsupervised machine learning algorithm, which is used to group the
unlabelled datasets into a cluster and also known as hierarchical cluster analysis or HCA. In this
algorithm, we develop the hierarchy of clusters in the form of a tree, and this tree-shaped structure is
known as the dendrogram. Hierarchical clustering is of two types –
1. Agglomerative: Agglomerative is a bottom-up approach, in which the algorithm starts with
taking all data points as single clusters and merging them until one cluster is left.
2. Divisive: Divisive algorithm is the reverse of the agglomerative algorithm as it is a top-down
approach.

Agglomerative Clustering

Let us take an example below


First, we find the distances between points –

P1 P2 P3 P4 P5 P6
P1 0 1.118 1 3.605 5 4.123
P2 1.118 0 1.118 3.905 5.315 4.031
P3 1 1.118 0 2.828 4.243 3.162
P4 3.604 3.905 2.828 0 1.414 1.414
P5 5 5.315 4.243 1.414 0 2
P6 4.123 4.031 3.162 1.414 2 0

We can see that the shortest distance is between P1 and P2. We cluster these points together. We then
perform the same process again while clustering more and more points till all points are clustered. The
process is defined below –

This is the dendrogram. Since we are moving in a bottom-up fashion, this is an agglomerative
clustering. The matrix we obtained is also called the Proximity matrix.

Divisive Hierarchical Clustering


Divisive hierarchical clustering is exactly the opposite of Agglomerative Hierarchical clustering. In
Divisive Hierarchical clustering, all the data points are considered an individual cluster, and in every
iteration, the data points that are not similar are separated from the cluster. The separated data points
are treated as an individual cluster. Finally, we are left with N clusters.

LINKAGES
In hierarchical clustering, we need to find the distances between clusters to find the next similar point
to be added to the cluster. But, a cluster has a bunch of points. How do we actually find the distance
then? Here, the concept of linkage comes into play. Some of the linkages methods are shown below –
Single Linkage
It takes the distance between clusters as the distance between the closest points of the clusters.
Complete Linkage
It takes the distance between the farthest points in the clusters.

Centroid Linkage
It takes the distance between the centroids of the clusters.

Average Linkage
Here, we take the sum of the distances between all pairs of data points in the two clusters. Then, we
divide it by the number of pairs to get the average distance between the clusters. Since this linkage
gives a more complete picture, it is usually preferred.

DIMENSIONALITY REDUCTION
This is the process of transforming the dataset from higher dimensions to lower dimensions to have
better analysis and visualization. Basically, we are extracting the features that are necessary for the
model and not selecting the features that are not highly correlated. This means that of we ignore those
features, it will not affect our model.

Principal Component Analysis (PCA)


This is done on unsupervised learning models. The step-by-step process is given below –
1. First, we calculate the mean of the attributes.
2. Then, we calculate the covariance matrix.
3. After that, we calculate the eigen vectors and eigen values of the covariance matrix.
4. Then, we normalize the eigen vector matrix
5. Finally, we select the principal components.
Let us take an example –
𝒙𝟏 𝒙𝟐
4 11
8 4
13 5
7 14
First, we find the means.
4 + 8 + 13 + 7
̅̅̅
𝑥1 = =8
4
11 + 4 + 5 + 14
̅̅̅
𝑥2 = = 8.5
4
Now, we find the covariance matrix. The matrix is given as –
𝐶𝑜𝑣(𝑥1, 𝑥1) 𝐶𝑜𝑣(𝑥1, 𝑥2)
𝑆=[ ]
𝐶𝑜𝑣(𝑥2, 𝑥1) 𝐶𝑜𝑣(𝑥2, 𝑥2)
We have –
𝑁
1 1
𝐶𝑜𝑣(𝑥1, 𝑥1) = ̅̅̅)(𝑥1𝑖 − 𝑥̅ ) = (16 + 0 + 25 + 1) = 14
∑(𝑥1𝑖 − 𝑥1
𝑁−1 3
𝑖=1

Similarly,
1
𝐶𝑜𝑣(𝑥1, 𝑥2) = 𝐶𝑜𝑣(𝑥2, 𝑥1) = (−10 + 0 − 17.5 − 5.5) = −11
3
1
𝐶𝑜𝑣(𝑥2, 𝑥2) = (6.25 + 20.25 + 12.25 + 30.25) = 23
3
Thus, we get the covariance matrix as –
14 −11
𝑆=[ ]
−11 23
Now, we calculate the Eigen vectors.
14 −11 𝜆 0 14 − 𝜆 −11 0
[ ][ ]=[ ]=[ ]
−11 23 0 𝜆 −11 23 − 𝜆 0
We get 𝜆1 = 30.38 ; 𝜆2 = 6.6. We get –
14 − 𝜆 −11 𝑉1 0
[ ][ ] = [ ]
−11 23 − 𝜆 𝑉2 0
Solving, we get –
𝑉 11 11
[ 1] = [ ] 𝑜𝑟 [ ]
𝑉2 −16.38 7.4
We can see that the first eigen vector has a larger difference between values, and we will be taking
that as the Principal Component. Now, we perform normalization –
11
𝑉1 = = 0.5575
√112 + (−16.38)2
−11
𝑉2 = = −0.8297
√112 + 7.42
Now that we have a principal component, we can now find the attribute values as follows –
𝑥1𝑘 − ̅̅̅
𝑥1
𝑋𝑘 = 𝑒 𝑇 [
𝑥2 ]
𝑥2𝑘 − ̅̅̅
−4
𝑋1 = [0.5575 −0.8297] [ ] = −4.304
2.5
Thus, we get –

𝒙𝟏 𝒙𝟐 PC
4 11 -4.304
8 4 3.733
13 5 5.69
7 14 -5.12

LINEAR DISCRIMINANT ANALYSIS (LDA)


This is the dimensionality reduction algorithm for supervised learning. The standard process is similar
to the PCA process –

https://www.geeksforgeeks.org/ml-linear-discriminant-analysis/
ARTIFICIAL INTELLIGENCE
These are a set of technologies that enable computers to perform a variety of advanced functions like
data gathering, analysing, predicting, recommending etc. without human intervention. Ultimate goal
is to simulate human intelligence.

AGENTS

In AI, an agent is a computer program that is designed to perceive its environment, make decisions
and take actions to achieve a certain goal. Agents don’t need human intervention. They can be of many
types –

Simple Reflex Agent

These agents follow pre-determined rules to make decisions. They operate in the present – which
means they don’t use past data and are not concerned about the future outcomes. They work on
condition-action rules which are just glorified if-else statements.

Model – Based Reflex Agent

It works on finding a rule whose conditions matches the current situation and this is done by keeping
track of an internal state. The internal state is adjusted by each percept and this agent can handle
partially observable world.
Goal – based Agents

These agents take decisions based on how far they are from their goal. Their aim is to reduce the
distance towards the goal at each and every step.

Utility – based agents

The agents which are developed having their end uses as building blocks are called utility-based
agents. When there are multiple possible alternatives, then to decide which one is best, utility-based
agents are used. They choose actions based on a preference (utility) for each state.

Learning Agent

It is a type of agent that can learn from its past experiences and has a learning capability. It has mainly
4 components –

• Learning Element – It is responsible for making improvements by learning from environment


• Critic – This gives feedback on how well the agent is performing.
• Performance Element – It selects the external actions
• Problem Generator – It is responsible for suggesting actions that lead to new experiences.

TYPES OF AI

• Narrow AI – These are designed to perform specific tasks. Siri and Alexa come under this.
• General AI – These are designed to learn, adapt and improve to perform a wide range of tasks
that mimic human intelligence. There are no true General AI models yet, but Generative Pre-
trained Transformers (GPTs) aim to move towards that direction.
• Reactive machines – They operate based on pre-coded conditions and rules. They don’t have
the ability to learn or adapt.
• Limited Memory AI – It can learn from historic data and adapt, but the scope is limited.
• Theory of Mind AI – These are literally Ultron…understands emotions, takes decisions etc.

AGENT KNOWLEDGE REPRESENTATION

• Atomic State Rep – The state is considered an atomic entity that can’t be decomposed further.
• Propositional State Rep – State is represented as a collection of propositional statements.
• Relational State Rep – Represent the states as a set of relations or predicates
• First order Rep – Rep logic using first – order logic.

SEARCHING IN AI

AI is built to achieve certain tasks and most of the agents accomplish these tasks using searching
algorithms. These search algorithms consist of 4 main components –

• State Space – The total number of states.


• Start State
• End State
• Decisions – The edges of the search tree.

Searching algorithms can be of various types –


Uninformed Search techniques are also called Blind Search techniques. This is mainly because these
techniques have no additional information apart from the start and goal states. They go in blindly with
creating a plan. On the other hand, Informed Search techniques have information about the goal state
and they use heuristic functions to determine how close they are to the goal state. A heuristic can be
something like Euclidian distance, Manhattan distance etc.

To understand searching algorithms, we need to also understand the following terms –

• Completeness – If a search algorithm will always return a goal or solution if one exists, then it
is termed as complete solution.
• Optimality – If the solution found by the algorithm is said to be the solution with the lowest
cost, then the solution is called an optimal solution.

Depth First Search (DFS)

In this case, the algo starts at a root node and moves as far down as possible. Then, it backtracks to
reach the other nodes. Since it is LIFO, it is implemented using stack. It is neither complete nor optimal

Breadth First Search (BFS)

In this case, the algo starts with a root node, finds all the possible neighbours at the same level and
then moves on to the next level. This is implemented using a queue. It is complete and optimal

Uniform Cost Search (UCS)

In this case, each edge has a cost. The goal is to reach the goal node from the source node in the
minimum possible cost.

Depth Limited Search (DLS)

This is the same as DFS except there is a limit to the max depth the algorithm is allowed to go to. It
prevents any infinite loop situations.
Iterative Deepening DFS (IDDFS)

This is a combination of BFS and DFS. It performs a series of DFS while increasing the depth limit until
the goal is reached. In short, it has memory capacity of DFS with optimality of BFS.

Greedy Search (Best First Search)

This is the first of the informed search algorithms where we use heuristic measure to find the shortest
path. Each node is assigned a heuristic value which indicates their distance from the goal node. The
lower the value, the closer the node is to the goal node and hence we should choose that node. For
example –

The source and goal nodes are 𝑆 and 𝐺 respectively. We start with 𝑆 and can move to either 𝐴 or 𝐷.
We choose 𝐷 as it has a lower heuristic value. We follow similar procedure throughout the graph and
we get the path as –

𝑆→𝐷→𝐸→𝐺
This works well in most cases, but has a tendency to devolve into DFS. Greedy algorithm is neither
complete nor optimal.

A* tree search

This combines UCS and Greedy search algorithm. Each edge has a weight and each node has a heuristic
value. The cost function is a sum of both –

𝑓(𝑥) = 𝑈𝐶𝑆 + 𝐺𝑟𝑒𝑒𝑑𝑦 = 𝑔(𝑥) + ℎ(𝑥)


Here, 𝑔(𝑥) is called backward cost and ℎ(𝑥) is called forward cost. The property of A* search is that
forward cost of a node is underestimated compared to the actual cost of reaching the node. This
property is called the admissibility property.

0 ≤ ℎ(𝑥) ≤ ℎ∗ (𝑥)
Let us take an example as follows –
From 𝑆, we can go to either 𝐴 or 𝐷. We have 𝑓(𝐴) = 3 + 9 = 12 and 𝑓(𝐷) = 2 + 5 = 7. Thus, we
choose 𝐷 and move ahead in a similar manner. Here, the final path will be –

𝑆→𝐷→𝐵→𝐶→𝐺
𝑆→𝐷→𝐵→𝐸→𝐺

Consistent A* Graph Search

Unlike A* tree search, here we have a graph instead of tree. So, just being admissible is not sufficient.
Here, we need the nodes to be consistent which is done if they satisfy the following equation –

ℎ(𝑛) ≤ 𝑐𝑜𝑠𝑡(𝑛, 𝑛′ ) + ℎ(𝑛′ )


Where 𝑛′ is the successor of 𝑛.

NOTE – A* algorithm is both complete and optimal.

Iterative Deepening A*

https://www.javatpoint.com/iterative-deepening-a-algorithm

Basically, we set a limit/bound to the cost of reaching the goal. If a path reaches the goal within that
limit, the limit gets updated to the new cost. However, if the path has not found goal node and the
cost has crossed the limit, then terminate that path and go for a new path. Don’t waste time and
resources.

Weighted A* Algorithm

In this case, the cost function is given as follows –

𝑓(𝑥) = 𝑔(𝑥) + 𝑤 ∗ ℎ(𝑥)


Here, we are multiplying the forward cost with a weight. This means that the number of node
expansions will reduce. If 𝑤 = 1, then the algorithm becomes the regular A* algorithm. On the other
hand, 𝑤 = 0 makes the algorithm the same as UCS. Another factor is the admissibility factor. For an
algorithm to be admissible, we have –

ℎ(𝑛) ≤ ℎ∗ (𝑛)
However, now that we are multiplying a weight factor, the admissible condition need not be satisfied

𝑤 ∗ ℎ(𝑛) ≰ ℎ∗ (𝑛)
Due to inadmissibility, the solution may not be optimal.

LOCAL SEARCH

Local Search refers to a family of optimization algorithms that are focused on the goals and not on the
path it takes to reach the goal. Basically, here we choose a solution and then look for neighbours to
see if the solution is more efficient. We stop when we find the most efficient solution. Since we are
focus on a locality of neighbours, this algorithm is great for cases where the entire space is too large
to search completely.

1. Start with a general solution that can be either picked randomly of by heuristics.
2. Evaluate the quality of solution using some metric. It can be heuristics as well since we just
need how close to the goal it is.
3. Look for neighbours by making “moves” or minor changes to the current solution
4. If there is an improvement, then choose the neighbour.
5. Repeat steps 2-4 till we either reach the goal or a termination condition.

Hill Climbing

This is basically the standard local search algorithm. We start with a solution and then check for
neighbours. If the neighbours provide a better solution, then me move. If not, then we return it as the
solution. Now, this is simple and all, but it has a few problems with it –
The above is a graph and out job is to find the maxima. So, we start at the beginning and if the
neighbour reaches a higher point, we move. We continue like this till we reach a crest. The neighbour
would be at a lower point, so we call that as the solution. Here is the list of problems –

• What if the neighbour has the same value as the current node? Basically, what if we encounter
a shoulder? In this case, the standard hill climbing algorithm will return the current node and
call it a day. But we know that that is not the current answer. So, we need to overcome this.
We can simply change the condition to say that if the neighbour is either greater than or equal
to the current node value, then move. Now, the algorithm will move across flat surfaces as
well.
• Now, we reach the first local maxima. Here, the neighbours of the point of local maxima will
have lesser value than it. In short, the algorithm returns the local maxima as the solution when
a global maxima exists. We can solve this problem using two approaches –
o Random Walk – In this case, when we approach a local maxima, we randomly move
to one of the neighbours or sibling nodes and then start again. This may lead to finding
the global maxima.
o Random Restart – In this case, when we approach a local maxima, we randomly
restart the whole algorithm from another random node.
• Finally, in case of a flat area of the graph, there is a possibility of an infinite loop. This is because
all the points have the same output value and hence there can be re-tracing of neighbours. To
avoid this, we need to have a small array in the memory to store the nodes already visited.

If suppose we have a probability of 𝑃 to get a successful output in a run, then the number of expected
restarts required would be around 1/𝑃.

Local Beam Search

Local beam search represents a parallelized adaptation of hill climbing, designed specifically to
counteract the challenge of becoming ensnared in local optima. Instead of starting with a single initial
solution, local beam search begins with multiple solutions, maintaining a fixed number (the "beam
width") simultaneously. The algorithm explores the neighbours of all these solutions and selects the
best solutions among them. For example, let us take the following question –

Here, we are assuming Beam Search = 2. So, we take the 2 best neighbours.

[𝑎, 𝑑, 𝑐]
Now, we take the node 𝑑. Thus, the neighbours are ℎ and 𝑖.

[𝑑, 𝑐, 𝑖, ℎ]

Out of 𝑐, 𝑖, ℎ, we know that 𝑖 is the lowest cost. Thus, we take 𝑖 next. Similarly, we continue and we get

𝑎→𝑑→𝑖→ℎ→𝑘→𝑔
If we had 𝑘 = 10, then the solution would have been –

𝑎→𝑑→𝑖→ℎ→𝑐→𝑔

GAME PLAYING

This is a concept in AI which is exactly what the name suggests – we are playing games. Basically, here
we are trying to achieve a result, but there are other agents against us who are will unpredictably
change the state of the system to prevent us from reaching our goal. For example, chess, checkers etc.

Adversarial Search

It is a searching technique where we examine the problem which arises when we plan ahead and other
agents are planning against us. It is mainly used in game playing applications. Since adversarial
searches have two or more players with conflicting goals trying to explore the same sample space,
these are also termed as Games.

TYPES OF GAMES

• Perfect Info game – It is a game wherein the agent has all the information about the space
and the game. The entire board is visible to the agent.
• Imperfect Info game – It is a game wherein one agent has access to only their local space.
They are not aware of the rest of the board’s conditions and can’t see what the other agents
are doing.
• Deterministic Games – These are games which have an algorithm, a method and process of
play.
• Non-Deterministic Games – These are games where there is a chance of luck involved as well
and there is no deterministic way to know the next move.
• Zero-Sum Game – This is a case of pure competition. Here, the loss/gain of one agent is
perfectly balanced by gain/loss of the other agent.
To proceed to a solution, we need to formalize the problem –

• Initial State – The position in which the game begins


• Players – The number of players
• Ply – Each change made by a player is called a Ply.
• Move – In 1 move, first P1 makes a ply and then P2 makes a ply. Basically, 𝑛 plies make a move
if we have 𝑛 players.
• Terminal test – This defines if the game is over or not. Defines the final state.
• Utility (s,p) – It is the numeric value for a game that ends in terminal states 𝑠 for player 𝑝. For
example, if we win, lose or draw in chess, the utility values can be +1, -1 or 0 respectively.

MIN – MAX ALGORITHM

This is a recursive back-tracking algorithm that is used in game theory. We assume that there are two
players – MAX and MIN. The job of MAX is to choose a path to maximize utility value while MIN is to
choose a path that minimizes the utility value. Hence, these two agents are operating against each
other and they are assumed to operate in an optimal way. This algorithm performs a depth first search
of the complete game tree.

Let us take an example –

We can have two cases here – root node is MAX node or root node is MIN node. Let us take the case
of MAX node. Then we get –
Now, we know that the 2nd last layer is MAX. So it will choose the max of the two child nodes. And the
layer above that is MIN, which means that it will choose the minimum of the child nodes. Finally, the
graph will become –

Similarly, if we take the root node as MIN, then we get –

NOTE – These numbers at the nodes are basically the utility values.

ALPHA – BETA PRUNING

In this case, each node will have 2 values associated with it –

• Alpha – The max value we have found so far for the MAX node. It is initialized to −∞.
• Beta – The min value we have found so far for the MIN node. It is initialized to ∞

Now, for any node, if 𝛼 ≥ 𝛽, then we perform pruning, which means that for that node, we no longer
need to check any of the other branches for the node as we will get the same answer we already have.
This saves a lot more time when compared to the min-max method. Here is a detailed working of the
same –

https://www.javatpoint.com/ai-alpha-beta-pruning
KNOWLEDGE REPRESENTATION AND REASONING

This are a set of techniques and methodologies that are used to represent the data about the world in
such a way that the agent will be able to use this data to perform it’s actions. We have three
representations as a part of KRR –

• Propositional Logic
• Predicate Logic
• Fuzzy Logic

Both Propositional and Predicate logic have been explored in GA-EM notes. Here, we will solve a couple
of problems.

Question

Which of the following propositions are equivalent?

1. 𝑃 ⋁ ~𝑄
2. ~(~𝑃 ∧ 𝑄)
3. (𝑃 ∧ 𝑄) ∨ (𝑃 ∧ ~𝑄) ∨ (~𝑃 ∧ ~𝑄)
4. (𝑃 ∧ 𝑄) ∨ (𝑃 ∧ ~𝑄) ∨ (~𝑃 ∧ 𝑄)

Answer

𝑷 𝑸 Prop 1 Prop 2 Prop 3 Prop 4


0 0 1 1 1 0
0 1 0 0 0 1
1 0 1 1 1 1
1 1 1 1 1 1

Hence, Propositions 1, 2 and 3 are equivalent.

Entails

This is an operator in Propositional Logic and is represented by ⊨ symbol. Basically, 𝑆1 ⊨ 𝑆2 means


that whenever 𝑆1 is True, then 𝑆2 will also be True. For example –

𝑆1 ∶ 𝑥 ≥ 10
𝑆2 ∶ 𝑥 ≥ 9
Now, if 𝑆1 is True, it means that the value of 𝑥 is greater than equal to 10. This also means that 𝑥 will
be greater than or equal to 9. Hence, if 𝑆1 is True, then 𝑆2 will always be True. Thus, 𝑺𝟏 ⊨ 𝑺𝟐.

Question
Answer

Option B

Question

Answer

Option C

Question

Answer

4 statements are implied

Question
Answer

Option B

Question

Answer

Option D

Question

Answer

Option A

Question

Answer

Option D
UNCERTAINTY

Uncertainty refers to the condition wherein we have a lack of information or ambiguity in a situation.
Systems with noise which cause uncertainty is one of the most common challenges in real-world AI
systems.

FUZZY LOGIC

The word Fuzzy means that things are not clear or are vague. Sometimes, it is not easy to find the
absolute true or false value of the problem. In such cases, this logic provides flexibility by providing a
range of values. It is used to represent uncertainty.

The steps for fuzzy logic is as follows –

First, we take a crisp input and pass it through the Fuzzification module. This module will transform
the input into fuzzy inputs. Then, it is passed to the inference engine where we apply the rules from
the Rules base. This will produce a fuzzy output. That fuzzy output will go through the Defuzzification
module and get converted back to crisp output.

Let us take a fuzzy operation –


0 ; 𝑥 < 20
𝑥 − 20
𝑓(𝑥) = { ; 20 ≤ 𝑥 < 30}
10
1 ; 𝑥 ≥ 30
With this function, we can write –

𝐹 = [(10,0), (20,0), (24,0.4), (26,0.6), (28,0.8), (30,1), (35,1)]


As we can see, the output values can be anywhere between 0 and 1.

SET OPERATIONS

The set operations are similar to the regular set operations, but with a slight twist.

OPERATION SYMBOL DEFINITION


Union 𝐴∪𝐵 max(𝑓1 (𝑥), 𝑓2 (𝑥))
Intersection 𝐴∩𝐵 min(𝑓1 (𝑥), 𝑓2 (𝑥))
Complement 𝐴′ 𝑓 ′ (𝑥) = 1 − 𝑓(𝑥)
Bold Union 𝐴⨁𝐵 min(1, 𝑓1 (𝑥) + 𝑓2 (𝑥))
Bold Intersection 𝐴⨀𝐵 max(0, 𝑓1 (𝑥) − 𝑓2 (𝑥))

Let us take the examples of two fuzzy functions as follows –

𝐹 = [(10,0), (20,0), (24,0.4), (26,0.6), (28,0.8), (30,1), (35,1)]


𝐺 = [(10,0), (24,0.2), (28,1), (26,0.6), (35,1), (40,1)]
We get –

𝐹 ∪ 𝐺 = [(10,0), (20,0), (24,0.4), (26,0.6), (28,1), (30,1), (35,1), (40,1)]


𝐹 ∩ 𝐺 = [(10,0), (24,0.2), (26,0.6), (28,0.8), (35,1)]
𝐹′ = [(10,1), (20,1), (24,0.6), (26,0.4), (28,0.2), (30,0), (35,0)]
𝐹⨁𝐺 = [(10,0), (24,0.6), (26,1), (28,1), (35,1)]
𝐹⨀𝐺 = [(10,0), (24,0.2), (26,0), (28,0), (35,0)]

BAYESIAN NETWORK

A Bayesian network is basically a DAG where the nodes are random variables and the arcs/edges are
dependencies. For example, let us take a graph as follows –
In the graph, we can see that 𝐶 is dependent on 𝐴 and likewise for the other nodes. Thus, we write –

𝑁𝑜𝑑𝑒 𝐴 → 𝑃(𝐴)
𝑁𝑜𝑑𝑒 𝐵 → 𝑃(𝐵)
𝑁𝑜𝑑𝑒 𝐶 → 𝑃(𝐶|𝐴)
𝑁𝑜𝑑𝑒 𝐷 → 𝑃(𝐷|𝐴, 𝐵)
𝑁𝑜𝑑𝑒 𝐸 → 𝑃(𝐸|𝐷)
This helps us to understand the situation about various variables and their dependent probabilities.
The number of parameters needed to represent a node will be 2𝑛 where 𝑛 is the number of parents
of the node. Hence, for the above graph, we have –

𝑀𝑖𝑛 𝑝𝑎𝑟𝑎𝑚𝑠 = 𝐴 + 𝐵 + 𝐶 + 𝐷 + 𝐸 = 1 + 1 + 2 + 4 + 2 = 𝟏𝟎
Here is an example of a problem using Bayesian Network

https://www.youtube.com/watch?v=hEZjPZ-Ze0A https://www.youtube.com/watch?v=iz7Kl2gcmlk

EXACT INFERENCE USING VARIABLE ELIMINATION

As we can see from the previous example, there is a lot of calculation involved. To reduce the
calculations, we apply the exact inference using variable elimination. These are the steps involved –

1. First, we construct a factor for each conditional probability distribution.


2. Then we restrict the variables.
3. Then, we eliminate each hidden variables using same order by multiply and sum-out variable.
4. We then multiply the remaining factor
5. Finally, we perform normalization.

All these steps are FUCKING DOGSHIT…and so, we need an example to understand this shit.

Restricting Variables

Let us assume the following variables –


X Y Z Prob
F F F 0.1
F F T 0.1
F T F 0.2
F T T 0.1
T F F 0.05
T F T 0.15
T T F 0.1
T T T 0.2

Now suppose need to check for conditions where 𝑋 will be True. In such cases, we can perform
variable restriction as follows –

X Y Z Prob
T F F 0.05
T F T 0.15
T T F 0.1
T T T 0.2

Sum – Out variables

Let us suppose we are not concerned with the 𝑌 variable at all and need to calculate probabilities wrt
to the 𝑋 and 𝑍 variables. Then, we get –

X Z Prob
F F 0.3
F T 0.2
T F 0.15
T T 0.35
We have added the probabilities for the respective rows.

Multiplication Factor

Let us assume we have 2 tables now –

X Y Prob
F F 0.2
F T 0.3
T F 0.2
T T 0.3

Y Z Prob
F F 0.15
F T 0.25
T F 0.3
T T 0.3

Now, if we want to find 𝐹(𝑋 = 𝑇 ; 𝑌 = 𝐹 ; 𝑍 = 𝐹), then we can find –


𝐹(𝑋 = 𝑇 ; 𝑌 = 𝐹 ; 𝑍 = 𝐹) = 𝐹(𝑋 = 𝑇 ; 𝑌 = 𝐹) ∗ 𝐹(𝑌 = 𝐹 ; 𝑍 = 𝐹) = 𝟎. 𝟎𝟑

Normalization

Suppose, after calculations we have –

𝑃(𝑋) = 0.6 ; 𝑃(𝑌) = 0.2


We can see that these probabilities don’t add up to 1. Hence, we normalize them as follows –
0.6
𝑃𝑛𝑜𝑟𝑚 (𝑋) = = 𝟎. 𝟕𝟓
0.6 + 0.2
0.2
𝑃𝑛𝑜𝑟𝑚 (𝑌) = = 𝟎. 𝟐𝟓
0.6 + 0.2

Now that we some idea about the processes in each step, then we proceed with an example –

Using the above Bayesian network, find probability of B occurring given A has NOT occurred. Here, we
have –

• Query variable – B
• Evidence variable – A
• Hidden variables – E,W

We can write a simple expression as follows –

𝑃(𝐵 | ~𝐴) = 𝑃(𝐵) ∗ 𝑃(𝐸) ∗ 𝑃(~𝐴 | 𝐵, 𝐸) ∗ 𝑃(𝑊 | ~𝐴)


The general equation is quite intuitive. Since 𝐵, 𝐸 are independent variables, there is no dependency.
On the other hand, 𝐴 depends on both 𝐵, 𝐸 and 𝑊 depends on 𝐴. Since 𝐸, 𝑊 are hidden variables, we
can write the equation as –

𝑃(𝐵 | ~𝐴) = 𝑃(𝐵) ∗ ∑ 𝑃(𝐸) ∗ 𝑃(~𝐴 | 𝐵, 𝐸) ∗ ∑ 𝑃(𝑊 | ~𝐴)


𝐸 𝑊

Let us take the case of 𝑃(𝑊 | ~𝐴) –


𝑾 𝑨 Probability
F F 0.6
F T 0.2
T F 0.4
T T 0.8

𝑃(𝑊 | ~𝐴) = 𝑃(𝑊 = 𝑇 ; 𝐴 = 𝐹) + 𝑃(𝑊 = 𝐹 ; 𝐴 = 𝐹) = 𝟏


Thus, we get –

𝑃(𝐵 | ~𝐴) = 𝑃(𝐵) ∗ ∑ 𝑃(𝐸) ∗ 𝑃(~𝐴 | 𝐵, 𝐸)


𝐸

Now, we look at the further cases.

𝑨 𝑬 𝑩 Probability
F F F 0.9
F F T 0.3
F T F 0.8
F T T 0.2
T F F 0.1
T F T 0.7
T T F 0.2
T T T 0.8

𝑬 Probability
F 0.9
T 0.1

Since we are focused only on cases where 𝐴 = 𝐹.

𝑨 𝑬 𝑩 Probability
F F F 0.9
F F T 0.3
F T F 0.8
F T T 0.2

Now, we can create a table for 𝑃(𝐸) ∗ 𝑃(~𝐴 | 𝐵, 𝐸) as follows –

𝑬 𝑩 Probability
F F 0.81
F T 0.27
T F 0.08
T T 0.02
Thus, we can finally draw the table for 𝐹 = ∑𝐸 𝑃(𝐸) ∗ 𝑃(~𝐴 | 𝐵, 𝐸).

𝑩 Probability
F 0.89
T 0.29
Now, we can draw the table for 𝑃(𝐵 | ~𝐴) = 𝑃(𝐵) ∗ 𝐹

𝑩 Probability
F 0.623
T 0.087
Therefore,
0.087
𝑃(𝐵 |~𝐴) = = 𝟎. 𝟏𝟐𝟐
0.087 + 0.623

D – SEPARATION

This is a method to check for independence and reachability of nodes from each other. To check for
this, we need to define 2 conditions – active and inactive triplets. These are shown below –

In Case 1, if 𝐵 is an evidence/observed variable, then the path is inactive. Else, the path is active.

In Case 2, if 𝐴 is an observed node or the successor of 𝐴 is an observed node, then the path is active.
Else, it is inactive.

In Case 3, if 𝐴 is observed, then the path is inactive. Else, it is active.

Now, here is the step-by-step process to check for D-separation –

• First, we check the paths between the two nodes whose dependency we need to check.
• Then, we form the triplets for each path and check for active/inactive status.
• Even if one triplet is active, the path is active and the variables are not dependent.

Let us take an example –


Let us check the independence of 𝑉 and 𝑍. The path between 𝑉 and 𝑍 will be –

This is Case 1 (which is also called Causal chain btw) and the node 𝑇 is not observed. Hence, this path
is active and that means the variables are dependent.

Let us now check for independence of 𝑈 and 𝑉. The paths that exist between 𝑈 and 𝑉 are –

𝑈−𝑊−𝑉
𝑈−𝑊−𝑌−𝑋−𝑉
The triplets formed are –

Since none of the nodes are observed, the causal chains will all be active. Now, the other two triplets
are of Case 2. This implies that both the paths are inactive. Now, if a path has multiple triplets and
even one of these triplets is inactive, the entire path is inactive. Therefore, both the paths are inactive.
Hence, the variables 𝑈 and 𝑉 are independent.

APPROXIMATE INFERENCE USING SAMPLING

This is a much better way of finding probability when compared to the exact inferences. Basically, we
create a sample space of the possible outcomes and then calculate the probability of the required
condition. To improve the process, we can do the following –

• Reject samples that are not matching our requirement


GENERAL POINTS

Like mentioned, a searching algo is considered as complete if it can find the goal state no matter what.
If it always finds the shortest path, then the algo is also considered as optimal.

ALGO COMPLETE OPTIMAL TIME COMP. SPACE COMP.


BFS Yes Yes 𝑂(𝑏 𝑑 ) 𝑂(𝑏 𝑑 )
DFS Yes No 𝑂(𝑏 𝑑 ) 𝑂(𝑏𝑑)
Yes, if solution
DLS depth is less than No 𝑂(𝑏 𝑙 ) 𝑂(𝑏𝑙)
the depth limit
UCS Yes Yes 𝑂(𝑏1+𝑛 ) 𝑂(𝑏1+𝑛 )
IDDFS Yes Yes 𝑂(𝑏 𝑑 ) 𝑂(𝑏𝑑)
Greedy Algo No No 𝑂(𝑏 𝑑 ) 𝑂(𝑏 𝑑 )
Yes, if branching
factor is finite Yes, if admissible
A* 𝑂(𝑏 𝑑 ) 𝑂(𝑏 𝑑 )
and costs are and consistent
constant
Min-Max Yes Yes 𝑂(𝑏 𝑑 ) 𝑂(𝑏𝑑)

• 𝒅 is the depth of the shallowest solution


• 𝒃 is the number of nodes at each state.
• 𝒏 is the number of steps in the solution.

RANDOM VARIABLES
It is a variable that can be assigned real values to every element in the sample space. It can either be
discrete or continuous.

𝐸𝑥𝑝𝑒𝑐𝑡𝑎𝑡𝑖𝑜𝑛(𝑜𝑟 𝑀𝑒𝑎𝑛), 𝐸[𝑋] = ∑ 𝑋 𝑃(𝑋)

𝑉𝑎𝑟𝑖𝑎𝑛𝑐𝑒, 𝜎 2 = 𝐸[𝑋 2 ] − (𝐸[𝑋])2

𝑆𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝐷𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛 = √𝑉𝑎𝑟𝑖𝑎𝑛𝑐𝑒

CUMULATIVE DISTRIBUTION FUNCTION (CDF)


A CDF is given as –
𝐹(𝑥) = 𝑃(𝑋 ≤ 𝑥)
For every case, CDF satisfies the following properties –
• 0 ≤ 𝐹(𝑥) ≤ 1
• 𝐹(−∞) = 0
• 𝐹(∞) = 1
• 𝑃(𝑎 ≤ 𝑥 ≤ 𝑏) = 𝐹(𝑏) − 𝐹(𝑎)
• 𝑃(𝑥 > 𝑏) = 1 − 𝐹(𝑏)
PROBABILITY DENSITY FUNCTION (PDF)
We can define PDF as follows –
𝑑
𝑓(𝑥) = [𝐹(𝑥)]
𝑑𝑥
𝑥
𝐹(𝑥) = ∫ 𝑓(𝑥)
−∞

For every case, PDF satisfies the following properties –

• 𝑓(𝑥) ≥ 0
• 0 ≤ 𝑓(𝑥) ≤ 1

• ∫−∞ 𝑓(𝑥)𝑑𝑥 = 1
𝑏
• 𝑃(𝑎 ≤ 𝑥 ≤ 𝑏) = ∫𝑎 𝑓(𝑥)𝑑𝑥

PROBABILITY DISTRIBUTIONS
DISTRIBUTION PDF MEAN VARIANCE
Binomial 𝐶𝑥𝑛 𝑝 𝑥 𝑞 𝑛−𝑥 𝑛𝑝 𝑛𝑝𝑞
𝑒 −𝜆 𝜆𝑥
Poisson 𝜆 𝜆
𝑥!
𝑎+𝑏 (𝑎 − 𝑏)2
Uniform 𝐾
2 12
1 1
Exponential 𝑎𝑒 −𝑎𝑥
𝑎 𝑎2
−(𝑥−𝜇𝑥 )
1 2𝜎𝑥2
Normal 𝑒 𝜇𝑥 𝜎𝑥2
√2𝜋𝜎𝑥2

CONDITIONAL PDF
Let us assume we need to calculate the probability that random variable 𝑋 ∈ [𝑎, 𝑏] if it is given that
another random variable 𝑌 = 𝑦. Then, we can represent this as follows –
𝑏

𝑃(𝑋 ∈ [𝑎, 𝑏] | 𝑌 = 𝑦) = ∫ 𝑓𝑋 | 𝑌=𝑦 (𝑥)𝑑𝑥


𝑎

The RHS is the conditional PDF. To calculate the conditional PDF, we need the joint PDF which is
represented as 𝑓𝑋𝑌 (𝑥, 𝑦). Given this, we can write –
𝑓𝑋𝑌 (𝑥, 𝑦) 𝑓𝑋𝑌 (𝑥, 𝑦)
𝑓𝑋 | 𝑌=𝑦 (𝑥) = = ∞
𝑓𝑌 (𝑦) ∫ 𝑓𝑋𝑌 (𝑥, 𝑦)𝑑𝑥
−∞

Question
Given the joint PDF, find the conditional PDF.
Answer
1
𝑓𝑌 (𝑦) = {2 𝑖𝑓 𝑦 ∈ [3,5]}
0 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
𝑓𝑋 | 𝑌=𝑦 (𝑥) = 2𝑓𝑋𝑌 (𝑥, 𝑦)

CONFIDENCE INTERVAL
This is the interval where we can expect the value of the parameter to lie in. It is given as –
𝑠
𝐶𝐼 = 𝑥̅ ± 𝑧
√𝑛
Where, 𝐶𝐼 is the confidence interval, 𝑥̅ is the mean, 𝑧 is the confidence level, 𝑠 is the SD and 𝑛 is the
sample size.

CENTRAL TENDENCIES
Data can either be grouped or un-grouped. We can conclude that –
NOTE
For a data collection, we can write –
𝑀𝑜𝑑𝑒 = 3 ∗ 𝑀𝑒𝑑𝑖𝑎𝑛 − 2 ∗ 𝑀𝑒𝑎𝑛
If the distribution is symmetric, then we can write –
𝑀𝑒𝑎𝑛 = 𝑀𝑒𝑑𝑖𝑎𝑛 = 𝑀𝑜𝑑𝑒
If the distribution is not symmetric, then the distribution is said to be skewed.
𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒 𝑆𝑘𝑒𝑤 → 𝑀𝑒𝑎𝑛 > 𝑀𝑒𝑑𝑖𝑎𝑛 > 𝑀𝑜𝑑𝑒
𝑁𝑒𝑔𝑎𝑡𝑖𝑣𝑒 𝑆𝑘𝑒𝑤 → 𝑀𝑒𝑎𝑛 < 𝑀𝑒𝑑𝑖𝑎𝑛 < 𝑀𝑜𝑑𝑒

STANDARD DEVIATION
We can define –

∑(𝑥𝑖 − 𝑥̅ )2 𝑓𝑖
𝜎=√
∑ 𝑓𝑖

We can use SD to also define –


𝑀𝑒𝑎𝑛 − 𝑀𝑜𝑑𝑒
𝑆𝑘𝑒𝑤𝑛𝑒𝑠𝑠 𝐶𝑜𝑒𝑓𝑓𝑖𝑐𝑖𝑒𝑛𝑡 =
𝜎

Question
Find all the central tendencies.
Answer
Here,
𝑥𝑖 = {3,9,15,21,27,33,39}
Thus, we can get –
∑ 𝑥𝑖 𝑓𝑖 (3 ∗ 6) + (9 ∗ 11) + (15 ∗ 25) + (21 ∗ 35) + (27 ∗ 18) + (33 ∗ 12) + (39 ∗ 6)
𝑥̅ = =
∑ 𝑓𝑖 6 + 11 + 25 + 35 + 18 + 12 + 6
̅ = 𝟐𝟎. 𝟕𝟑
𝒙
For Median, we need to find the Median Class. The Median class is the class with the middle element
which is the 18 − 24 class. Thus, we can define –
𝐿 = 18 ; 𝐾 = 6 ; 𝑓 = 35

𝐹 = ∑ 𝑓𝑖 = 113 ; 𝐶 = 6 + 11 + 25 = 42

Thus, we can write –


𝐹
−𝐶
𝑀 = 𝐿 + (2 ) 𝐾 = 𝟐𝟎. 𝟒𝟖𝟔
𝑓

Now, we can calculate the Mode. To do so, we need to find the Modal Class which is the class with
max frequency. This is also 18-24. Thus, we get –
𝑙 = 18 ; 𝐹 = 35 ; 𝐹1 = 18 ; 𝐹−1 = 25 ; 𝐾 = 6
Thus, we can write –
𝐹 − 𝐹−1
𝑀𝑜𝑑𝑒 = 𝑙 + ( ) 𝐾 = 𝟐𝟎. 𝟐𝟐
2𝐹 − 𝐹−1 − 𝐹1

HYPOTHESIS TESTING

Hypothesis testing is basically a method to make decisions using experimental data. We can define two
types of hypotheses –

• Null Hypothesis – It is the general and default position. It is represented as 𝐻0


• Alternate Hypothesis – This is the hypothesis that is contrary to the Null hypothesis. It is
represented as 𝐻1

We then define p – value. It is the probability of finding the observed result given that 𝐻0 is true. If
we found the p – value lower than the predetermined significance value (also called alpha or threshold
level), then we reject the null hypothesis.
Chi – Squared Test

T – test

NOTE – A set of vectors are said to be independent if the solution to their equations are trivial i.e. only
0. There is no non-zero solution For example, let us take the vectors –

Now, we write the equation as follows –


1 2 −3 𝑥1 0
[3 5 9 ] [𝑥2] = [0]
5 9 3 𝑥3 0
If we try, we can see that 𝑥1 = 11 ; 𝑥2 = −6 ; 𝑐 = −1/3 is a non-trivial solution. Hence, the vectors
are linearly dependent.
• For any two random variables 𝑋 and 𝑌 having probability density functions represented by 𝑓𝑥
and 𝑓𝑦 respectively. If their joint probability density function is represented by multiplication
of their individual probability density functions, then the random variables are necessarily
independent.
• Bagging and boosting are two techniques that are a part of Ensemble learning techniques.
Both involve dividing the given problem into subsets and training the models on these subsets.
If we do this parallely, then it is refered to as Bagging and if done sequentially, it is referred to
as Boosting. Bagging reduces variance (overfitting) and Boosting reduces Bias (underfitting).
• In SVM, the support vectors are points in the training dataset that determine the optimal
hyperplane. Also, if we remove some of the non-support vector points from the dataset, then
the optimal hyperplane might change. The SVM hyperplane need not be orthogonal to the
support vectors.

TAYLOR AND MACLAURIN SERIES

For a function 𝑓(𝑥), we can write –

𝑓 ′(𝑎) 𝑓 ′′ (𝑥)
𝑓(𝑥) = 𝑓(𝑎) + (𝑥 − 𝑎) + (𝑥 − 𝑎)2 + ⋯ 𝑡𝑜 ∞
1! 2!
When 𝑎 = 0 (in question, they would mention around 𝑥 = 0), we get the Maclaurin Series –

𝑓(𝑥) = 𝑓(0) + 𝑓 ′ (0)𝑥 + 𝑓 ′′ (0)𝑥 2 + ⋯ 𝑡𝑜 ∞

PYTHON 𝒊𝒕𝒆𝒓() FUNCTION

https://www.programiz.com/python-programming/methods/built-in/iter

ONE-HOT ENCODING IN ML

In ML, we usually have categorical data that are not integers or numbers. These make it tough to work
with such data. Thus, we perform one-hot encoding upon this categorical data to turn it into numeric
data as a part of data pre-processing.

For a category with 𝑛 distinct values, we need to add 𝑛 − 1 dummy columns more to perform one-hot
encoding.

You might also like