0% found this document useful (0 votes)
9 views

Notes Machine Learning

Uploaded by

arjun.jadhaw
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views

Notes Machine Learning

Uploaded by

arjun.jadhaw
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 34

What is Machine Learning?

Machine learning is a subset of artificial intelligence (AI) that involves the development of
algorithms and statistical models that enable computers to perform specific tasks without being
explicitly programmed. Instead of following predefined rules, machine learning systems learn
patterns and make decisions based on data. Here’s a detailed overview:
Key Concepts in Machine Learning
1. Training Data:
o Data used to train machine learning models. The quality and quantity of this data
significantly impact the model's performance.
2. Algorithms:
o Methods and techniques used to build machine learning models. Examples
include decision trees, support vector machines, and neural networks.
3. Model:
o A mathematical representation derived from training data used to make
predictions or decisions.
4. Features:
o Individual measurable properties or characteristics of the data used as input to the
model.
5. Labels:
o The target variable or output the model aims to predict (used in supervised
learning).
Types of Machine Learning
1. Supervised Learning:
o Definition: The model is trained on labeled data, where the input data and
corresponding output labels are provided.
o Examples: Classification (e.g., spam detection), Regression (e.g., predicting
house prices).
o Algorithms: Linear Regression, Logistic Regression, Decision Trees, Support
Vector Machines.
2. Unsupervised Learning:
o Definition: The model is trained on unlabeled data and must find patterns and
relationships within the data.
o Examples: Clustering (e.g., customer segmentation), Association (e.g., market
basket analysis).
o Algorithms: K-means, Hierarchical Clustering, Apriori Algorithm.
3. Semi-Supervised Learning:
o Definition: The model is trained on a combination of labeled and unlabeled data.
o Examples: Useful when acquiring labeled data is expensive or time-consuming.
o Algorithms: Semi-Supervised Support Vector Machines, Co-training.
4. Reinforcement Learning:
o Definition: The model learns by interacting with an environment and receiving
feedback in the form of rewards or penalties.
o Examples: Robotics, Game AI, Autonomous driving.
o Algorithms: Q-Learning, Deep Q-Networks, Policy Gradients.
Key Steps in Machine Learning
1. Data Collection:
o Gathering relevant data from various sources for training the model.
2. Data Preprocessing:
o Cleaning and transforming the data to make it suitable for analysis. This includes
handling missing values, normalizing data, and encoding categorical variables.
3. Feature Engineering:
o Creating new features or modifying existing ones to improve the model’s
performance.
4. Model Training:
o Using algorithms to learn patterns from the training data and build the model.
5. Model Evaluation:
o Assessing the model's performance using metrics such as accuracy, precision,
recall, and F1 score.
6. Model Deployment:
o Implementing the model in a real-world application where it can make predictions

Designing a learning
system
The formal definition of Machine learning as discussed in the
previous blogs of the Machine learning series is “A computer
program is said to learn from experience E with respect to
some class of tasks T and performance measure P, if its
performance at tasks in T, as measured by P, improves with
experience E’’.

One of the examples discussed was learning checkers game,


the parameters T, E, and P with respect to this example are,
T -> Play the checkers game.
P -> Percentage of games won against the opponent.
E -> Playing practice games against itself.

Steps to design a learning system:


To get a successful learning system we need to have a
proper design, to make the design proper we’ll follow
certain steps. In this case, designing a learning system is a
five-step process. The steps are,
1. Choosing the Training Experience
2. Choosing the Target Function
3. Choose a Representation for the Target Function
4. Choosing a Function Approximation Algorithm
5. The Final Design

Let’s have a look at them briefly,

1. Choosing the Training Experience


The type of training experience chosen has a considerable
amount of impact on our algorithm. The training data’s
characteristics need to be similar to that of the total data
set’s characteristics.

In order to choose the right training experience for your


algorithm, consider these three attributes,

a) Type of Feedback: Check whether the training experience


provides direct or indirect feedback to the algorithm based
on the choices of the performance system.
In Direct feedback, you get the feedback of your choice
immediately. In the case of indirect feedback, you get a
sequence of moves and the final outcome of the sequence
of action.

b) Degree: The degree of a training experience refers to the


extent up to which the learner can control the sequence of
training.
For example, the learner might rely on constant feedback
about the moves played or it might itself propose a
sequence of actions and only ask for help when in need.

c) The representation of the distribution of samples across


which performance will be tested is the third crucial
attribute.
This basically means the more diverse the set of training
experience can be the better the performance can get.
2. Choosing the target function:
The next design decision is to figure out exactly what kind of
knowledge will be acquired and how the performance
software will put it to use.

Let’s take the classic example of the checkers game to


understand better. The program only needs to learn how to
select the best moves out of the legal moves(Set of all
possible moves is called legal moves).

The choice of the target function is a key feature in


designing the entire system. The target function V: B -> R.
This notation denotes that V maps any legal board state
from set B to a real value.

Assigning value to target function in a checkers game,


1. V(b) = 100 if b is the final board state that is won.
2. V(b) = -100 if b is the final board state that is lost.
3. V(b) = 0 if b is the final board state that is drawn.
4. V(b) = V(b’) if b is not a final state, and b’ is the best final board state
that can be achieved starting from b and playing optimally until the
end of the game.

3. Choosing Representation for Target function:


Once done with choosing the target function now we have to
choose a representation of this target function, When the
machine algorithm has a complete list of all permitted
movements, it may pick the best one using any format, such
as linear equations, hierarchical graph representation,
tabular form, and so on.

What are Perceptrons?


A neural network’s fundamental unit is a single-layer
perceptron. Input values, weights, and a bias, as well as a
weighted sum and activation function, make up a
perceptron.
In this blog, we’ll have a look at what perceptrons are and
how we represent them.

The basis of an ANN system is a unit known as a perceptron.

A real-valued vector input is taken as an input by the


perceptron, and a linear combination of it is calculated. The
perceptron either outputs a 1 if the result exceeds the
threshold or -1.

The perceptron computes the output o(x1,…, xn) given the


inputs x1 through xn.

where wi is a weighted real-valued constant that specifies


the contribution of input xi to the perceptron output.

The amount (-wO) is a threshold that must be exceeded by


the weighted combination of inputs wlxl +… + wnxn for the
perceptron to output a 1.

Imagine an extra constant input x0 = 1 to simplify notation,


enabling us to write the aforementioned inequality

as or in vector, form as
For convenience, we will sometimes write the perceptron
function as

Choosing values for the weights w0,…..wn is part of learning


a perceptron. As a result, the set of all potential real-valued
weight vectors is the space H of candidate hypotheses
examined in perceptron learning.

Simple Linear
Regression
Simple linear regression is a well-known statistical method
for obtaining a formula to predict values of one variable
from another variable when there is a causal relationship
between the two variables.

Linear regression models are used to predict the


relationship between two variables or factors or else to
showcase their relationship.

Simple linear regression is the formula for a straight line


which can be most commonly represented as
y = mx + c (or) y = a +bx

Generally, everyone prefers to use the Simple Linear


regression form by involving betas:
y x = b 0 + b1
 y is the dependent variable or Response.
 x is independent variable or Predictor.

Multiple Linear
Regression
Multiple linear regression in short can also be termed as
MLR, simply it is referred as multiple regression. It is a
statistical technique which uses various explanatory
variables to predict the outcome of a response variable
(dependent variable). Multiple linear regression (MLR) is
used to express the linear relationship between the
independent variables (explanatory variable) and dependent
variable (Response variable).

Multiple linear Regression can be calculated by using the


following metric

where,
for i=n observations
yi=dependent variable
xi=explanatory variables
β0=y-intercept (constant term)
βp=slope coefficients for each explanatory variable
ϵ=the model’s error term (residual)

Derivativation of Back
Propagation Rule
Backpropagation’s purpose is to find the partial derivatives
of the cost function C for every weight w or bias b in the
network. It is a supervised learning algorithm used for
Multilayer Perceptrons (Artificial Neural Networks).

In this blog, we’ll have a look at the Backpropagation rule


and its derivation.

We’ll use the product of some constant alpha and the partial
derivative of that quantity with respect to the cost function
to update the weights and biases in the network once we
get these partial derivatives. This is the famously-known
gradient descent method.

The partial derivatives provide the largest ascending


direction. As a result, we take a modest step in the opposite
direction — the route of greatest descent, that is, the
direction that will lead us to the cost function’s local
minima.

What is Backpropagation, and how does it work?


Using a concept known as the delta rule or gradient descent,
the Backpropagation algorithm hunts for the least value of
the error function in weight space. The weights that
minimize the error function are therefore regarded as a
learning problem solution.
Steps:
 We started by setting a random value to ‘W’ and then propagated
forward.
 Then we realized there was a mistake. We propagated backward and
raised the value of ‘W’ to lessen the mistake.
 We also found that the error had risen after that. We discovered that
we are unable to increase the ‘W’ value.
 As a result, we propagated backward one more and lowered the ‘W’
value.
 We have now found that the error has decreased.
Evaluating Hypotheses:
Basics of Sampling
Theory
For estimating hypothesis accuracy, statistical methods are
applied. In this blog, we’ll have a look at evaluating
hypotheses and the basics of sampling theory.

Let’s have a look at the terminologies involved and what


they mean,

Random Variable:
A random variable may be thought of as the name of a
probabilistic experiment. Its value is the outcome of the
experiment.

When we don’t know the outcome of the experiment for


certain, it comes under random variables.

The outcome of a coin flip is a good illustration of a random


variable. Consider a probability distribution where the
outcomes of a random event aren’t all equally likely to
occur.

If the number of heads we get from tossing two coins is the


random variable, Y, then Y might be 0, 1, or 2. On a two-coin
toss, this means that we could get no heads, one head, or
both heads.

The two coins, on the other hand, land in four different


patterns: TT, HT, TH, and HH. As a result, P(Y=0) = 1/4
because we only have one possibility of obtaining no heads
(i.e., two tails [TT] when the coins are tossed).
Similarly, receiving two heads (HH) has a 1/4 chance of
happening. In the probabilistic events, there are two cases
where one head can appear, i.e HT and TH. P (Y=1) = 2/4 =
1/2 in this example.

Probability Distribution:
A probability distribution is a statistical function that
specifies all possible values and probabilities for a random
variable in a given range.

This range will be bounded by the minimum and greatest


possible values, but where the possible value will be plotted
on the probability distribution will be determined by a
variety of factors.

The mean (average), standard deviation, skewness, and


kurtosis of the distribution are among these parameters.

Y defines the chance Pr(Y = yi) that Y will take on the value
yi for each potential value yi for a random variable.

Expected Value:
The expected value (EV) is the value that investment is
predicted to have at some time in the future.

In statistics and probability analysis, the expected value is


computed by multiplying each conceivable event by the
likelihood that it will occur and then summing all of those
values.

Investors might choose the scenario that is most likely to


provide the desired result by assessing anticipated values.

Radial Basis Functions


Learning using radial basis functions is a function
approximation method that is strongly connected to
distance-weighted regression and artificial neural networks.
In this blog, we’ll have a look at Radial Basis Functions.

This is the learned hypothesis function,

If each xu is an X instance and the kernel K function, (d (xq,


x)) is defined to decrease as d (xq, x) increases. The user-
supplied constant k specifies the number of kernel functions
to be included.

The contribution from each of the ku(d(xu,x)) elements is


confined to an area around the point xu in , which is a
global approximation to f (x).

It is normal to choose each function ku(d(xu,x)) individually


to be a Gaussian function centered at the point xu with
some variance .

The function may be thought of as defining a two-layer


network, with the first layer computing the values of the
different ku(d(xu,x)) and the second layer computing a
linear combination of these first-layer units values.

 A Gaussian function centered at some instance xu determines the


activation of each hidden unit. As a result, unless the input x is close to
xu, its activation will be close to zero.

 The hidden unit activations are combined in a linear fashion by the


output unit. Although the network illustrated here only has one output,
it is possible to incorporate numerous output units.

Dimensionality
Reduction
Dimensionality reduction is the process of decreasing the
dimensions of the feature set. In machine learning, there are
often many factors or variables through which final
classification is done. These variables are called as features.
If there are a greater number of features, it will be harder to
visualize the training set. Also, if features are more, it may
lead to correlation and hence redundant. Here is where
Dimensionality reduction plays a key role by reducing the
number of rando variables. It can also be divided into
feature selection and feature extraction.
Curse of Dimensionality
Curse of Dimensionality means the problems arise when we
work with high dimensions, which does not occur in low
dimensions. If the number of features increase, number of
samples also increases proportionally. More number of
samples need all different combinations of feature values
which are to be represented in our sample. Due to this,
certain algorithms struggle to train effective models. This is
known as Curse of Dimensionality.

The model becomes more complex and lead to overfitting, if


there is an increase in number of features. This results in
poor performance on the data. Therefore, to avoid
overfitting we employ dimensionality reduction.
Components of Dimensionality Reduction
Dimensionality reduction has two main components. They
are
 Feature Selection
 Feature Extraction
Let us study about them in more detail.
Feature Selection
Feature Selection means discovering the subsets of the
original set of features to get smaller subsets that are used
to model the problem. It is for filtering out irrelevant
features from the dataset. This method involves three
procedures.
 Filter
 Wrapped
 Embedded
Feature Extraction
Feature Extraction reduces the data in a high dimensional
space to the low dimensional space by reducing number of
dimensions. It is employed for creating a new, smaller set of
features which can capture most of the useful information.
The main difference between feature selection and feature
extraction is that feature extraction creates new ones
whereas the feature selection keeps a subset of original
features.

Support Vector Machine


Support Vector Machine (SVM) is a supervised machine
learning algorithm which can be used for both
classification or regression problems. Nevertheless, it is
mostly used in classification problems. In the SVM
classification, we plot each data item as a point in n-
dimensional space (where n is number of features) with the
value of each feature is represented as value of a specific
coordinate. Then, classification is performed by finding the
hyper-plane that differentiates the two classes very well.
Hyperplanes are decision boundaries which helps in the
classification of the data points. Data points falling on either
side of the hyperplane can be credited or subjected to
different classes. A hyperplane is a line that separates the
input variable space. In SVM, a hyperplane is selected to
separate the points in the input variable space by their
respective class, as either class 0 or class 1.
Correspondingly, the dimension of the hyperplane depends
upon the number of features. If the number of input features
is 2, then the hyperplane is just a line (as line is single or
one dimensional). If the number of input features is 3, then
the hyperplane becomes a two-dimensional plane.

Decision Tree Learning


Decision tree learning is a method for approximating
discrete-valued target attributes, under the category of
supervised learning. They can be used to address problems
involving regression and classification.
In this blog, we’ll have a look at an Introduction to Decision
tree learning and its representation.
Using the decision tree, we may express any boolean
function on discrete characteristics. To increase human
readability, learned trees can also be represented as sets of
if-then rules.
Representation:
The classification is yielded when the instances are
classified using decision trees by sorting them along the
tree from the root to a leaf node.

Every node in the decision tree represents a test of an


instance’s property, and each branch descending from that
node represents one of the attribute’s potential values.

Starting at the root node of the tree, test the attribute


indicated by this node, and then go along the tree branch
corresponding to the value of the attribute in the given
example, an instance is classified.

Every subtree encountered is processed similarly to


construct the tree.

Let’s have a look at an example, where the target attribute


is EnjoySport that might have yes or no values on different
Saturday mornings, is predicted here based on other
morning qualities.

Here’s a decision tree for the concept of EnjoySport. An


instance is classified by sorting it through the tree to its
suitable leaf node, then returning the classification
associated with this leaf node, it can be yes or no.

Classification models
Classification is the process of predicting the class of given
data by drawing some conclusions. Classification predictive
modeling is the approximation of a mapping function from
input variables to discrete output variables. It belongs to the
supervised learning in Machine Learning where the targets
are provided with input data.
Classification can be applied on Structured and Unstructured
data. The main theme of a classification model is to discover
the category or class to which the new data falls under.
Classes are also termed as targets or labels or categories.
If we give one or more than one inputs to a classification
model, then the model will try to predict the values of one or
more outcomes. Whereas the outcomes are labels which can
be applied to datasets.
Classification models have many applications in many
domains like medical diagnosing, target marketing etc.,

Bagging and Boosting


BOOTSTRAPPING
Bootstrap means random sampling with replacement. It
allows us to understand the bias and the variance better
with the dataset. Boot Strap means selection of small subset
of the data from the original dataset. This subset may be
replaced. We can understand the mean and standard
deviation from the dataset in a better way.

To get an estimate of the mean of the sample, we require a


sample of ‘n’ values (x)
mean(x) = 1/n * sum(x)

If our sample is small and if the mean has error in it. We can
improve the estimate of the mean by using the bootstrap
procedure:
1. Create many random sub-samples of our dataset with
replacement so that same sample can be selected more than
once.
2. Compute the mean of each sub-sample.
3. Calculate the average of all of our collected means and refer
that as our estimated mean for the data.

BAGGING
Bootstrap Aggregation also called as Bagging is a simple yet
powerful ensemble method. It is one of the applications of
the Bootstrap procedure to a high-variance machine
learning algorithm, typically decision trees.

In Bagging, several Subsets of the data are created from


Training sample chosen randomly with replacement.
Each Subset data is used to Train their Decision Trees.

Decision Trees suffer from Bias and Variance.


Simple Trees suffer with Large Bias.
Complex Trees suffer with Large Variance.
Several Decision Trees are combined to get the
correct result rather than taking a single Decision
Tree.

Bagging is used to reduce Variance of a Decision Tree.

K NEAREST
NEIGHBOURS
K-Nearest Neighbors is one of the most important
classification algorithms in Machine Learning. It belongs to
the supervised learning field and has a powerful application
in pattern recognition, data mining and disturbance
detection.
The k-nearest-neighbor is a “lazy learner” algorithm, as it
does not build a model using the training set until a request
of the data set is performed.
KNN algorithm can be used for both classification and
regression problems. However, it is more widely used in
classification problems. When building a KNN model there
are only a few parameters that need to be chosen to
improve performance of a model.

K Means Clustering
K-means clustering is simplest and popular unsupervised
machine learning algorithm. Clustering is one of the most
common experimental data analysis technique used to get a
perception about the structure of the data. It is defined as
the task of defining subgroups in the data such that data
points in the same subgroup or cluster are very similar while
data points in different clusters are very different.

Clustering analysis can be done according to features where


we try to find subgroups of samples based on features or on
the basis of samples.
In contrast to supervised learning, clustering is considered
as unsupervised learning method because we don’t have the
ground truth to compare the output of the clustering
algorithm to the true labels to calculate its performance. We
only want to try to find the structure of the data by grouping
the data points into different subgroups. Generally,
unsupervised algorithms like K means clustering make
interpretations or inferences from datasets using only input
vectors without referring to known, or labelled outcomes.

A cluster means a collection of data points aggregated or


combined together because of certain similarities. We have
to define a target number k, which refers to the number of
centroids need in the dataset. A centroid is the imaginary or
real location which represents the center of the cluster.

Each data point is allocated to each of the clusters by


reducing the in-cluster sum of squares. To make it simple,
the K-means algorithm identifies k number of centroids, and
then allocates every data point to the nearest cluster, by
keeping the centroids as small as possible. Here means in
the K-means refers to averaging of the data, which means
finding the Centroid.

To process the learning data, the K-means algorithm in data


mining starts with a first group of centroids that are
randomly selected, which are used as the beginning points
for every cluster, and then performs iterative or repetitive
calculations to optimize or improve the positions of the
centroids.

It stops creating and optimizing clusters when:


 The centroids have stabilized which means there is no change
in their values because the clustering has been successful.
 The definite number of iterations has been achieved.

Principal Component
Analysis
Principal Component Analysis is usually termed as PCA. This
technique is used in unsupervised learning technique as it
does not consider about features but only concentrates on
variation of data in order to reduce the dimensions. In the
real time, the data is so huge which needs to be reduced in
order to avoid over fitting and some other problems during
the predictions of the model. As the dimensions of data
increases, we will face difficulty in visualization and also
performing calculations on it also increases. Hence, we have
to decrease the dimensions of the data by using
Dimensionality Reduction techniques.
Principal component analysis is one of the best
Dimensionality reduction techniques used to reduce the
dimensions. The main idea behind the principal component
analysis (PCA) is to reduce the dimensionality of a data set.
The variables of the dataset may be correlated with each
other either heavily or lightly, while retaining the difference
or variation present in the dataset up to the maximum
extent.

Advantages of Principal Component Analysis


1. It removes Features with correlation.
2. It improves performance of the algorithm.
3. It also reduces overfitting.
4. Visualization of data is improved in this method.

Disadvantages of Principal component Analysis


1. In this method, independent variables become less interpretable.
2. Before beginning PCA, we must perform Data standardization.
3. There will be information loss by using this method.
Linear Discriminant
Analysis
Linear Discriminant Analysis is usually termed as LDA. It is a
Dimensionality Reduction Technique which is generally used
for Supervised learning problem as preprocessing technique.
It uses two axes X and Y to create a new axis and projects
data onto a new axis in such a way to maximize the
separation of the two categories and thereby reducing the
2D graph into a 1D graph. Therefore, the goal of Linear
Discriminant Analysis is to project the features of higher
dimension space onto a lower dimensional space.

Genetic Algorithms:
Motivation and Genetic
Algorithm-Representing
Genetic algorithm- Motivation
Genetic algorithms are preferred for solving optimization
problems as they can yield a result in optimum time and is
also relatively faster.

The need for the genetic algorithm is as follows:

1. Gradient base model failures


In the gradient base method, traditional calculus is used. It
starts at a random point and moves in the direction of the
gradient, till it reaches the top point. Though this method is
effective for single peaked objective functions such as the
cost function in linear regression, it cannot be very useful in
real life where the problems are more complex such as
landscapes, they are made of many peaks and valleys
leading to the failure of these models. They will ultimately
get stuck at the local optima.

2. Solving difficult problems with ease.


In computer science, we have many problems that even the
most powerful computers take ages to compute. In this
situation, a generic algorithm comes handy as it can give
approximated solutions in optimal time.

3. Time-efficient.
Many problems such as the Travelling Salesperson Problem
or TSP has many practical uses such as pathfinding and VLSI
design. GA can give efficient results in optimum time. For
example, how much of a trouble it would be if our GPS took
hours to give us the path from our source to destination. But
with GA involved we can get a good result in a small time.
Genotype representing
Genotype representation is very important during the
implementation of a genetic algorithm, as it directly impacts
the performance of the genetic algorithm. A proper
representation is where the phenotype and the genotype
mappings have proper spaces between them.

The most common representation methods for GA are as


follows:

1. Binary representation
Binary representation is one of the most common methods
of representing GA.

The genotype in this method consists of bit strings. In the


case of a Knapsack problem, where the solution space
consists of Boolean decision variables, the binary
representation is natural.

For other problems, which may deal with numbers, we


represent the numbers with their binary representation. The
only drawback in this situation is that different bits have
different meanings, which results in undesired
consequences for the mutation and crossover operators.

2. Integer representation
In the case of discrete-valued genes, we cannot always limit
the results to binary, so instead, we use integer
representation. For example, if we had to encode the three
directions, we could have encoded them as: {1,2,3,4}, and
represented using integer representation.
3. Permutation representation
Whenever the solutions are represented by an order of
elements we can use permutation representation. We can
take the same example of TSP, let us assume the person has
to travel all the cities, visiting one city at a time, and then
comes back to the source city. As we can see the order of
the TSP is naturally becoming a permutation, therefore we
can use the permutation representation.

4. Real valued representation


In the problems where we require to define the genes using
continuous variables, we use real-valued representation.
Reinforcement Learning
Reinforcement Learning is a type of machine learning model
which enables the model to learn in an interactive
environment by using trial and error method making use of
feedback from its own actions and experiences. It is the
training of machine learning models to create a sequence of
decisions.
In reinforcement learning, an artificial intelligence faces a
situation which is similar to Game. Artificial Intelligence gets
either rewards or penalties of the actions it performs to
make the machine do whatever the programmers wants to.
Goal of reinforcement learning is to maximize the total
reward.

Although both supervised and reinforcement learning uses


mapping between input and output, Reinforcement learning
uses rewards and punishments as signals for both positive
and negative behavior unlike supervised learning where
feedback is needed to the agent for the correct set of
actions for accomplishing a task.
When we compare Reinforcement learning and unsupervised
learning, reinforcement learning is different in terms of
goals. In Unsupervised learning, the goal is to find
similarities and differences between data points. In
Reinforcement learning, we have to find a suitable action
model that would maximize the total cumulative reward of
the agent.

Reinforcement Learning is currently the most effective way


to hint machine’s creativity by leveraging the power of
search and trails. Even though Programmer defines the
reward policy (Rules of the game) he does not provide any
hints or suggestions for the model for how to solve the
problem or game. The model decides that how to
accomplish the task to maximize the reward. The model
begins with the random trials and finishes with sophisticated
techniques.

Key Points in Reinforcement Learning


 The input of reinforcement Learning model should be an initial state
from which the model will begin.
 There will be many possible outputs as there are variety of possible
solutions to a specific problem.
 The training is based on the input provided, the model will return a
state and the user will decide to reward or punish the model according
to its output.
 The model will always continue to learn.
 The solution is decided as best based on the maximum reward of the
problem.
Types of Reinforcement
Positive Reinforcement
Positive Reinforcement is defined as when an event occurs
due to a specific behavior, increases the strength and the
frequency of the behavior. To make it simple, it has a
positive effect on the behavior.
Merits of reinforcement learning
 It Maximizes Performance of the model.
 Can be able to sustain a change for a long period of time.
Demerits of reinforcement learning
 Too much Reinforcement can lead to overload of states which can
reduce the performance of the results.

Negative Reinforcement
Negative Reinforcement is defined as strengthening of a
behavior due to a negative condition is either stopped or
avoided.
Merits of reinforcement learning
 It increases behavior of the model.
 It provides the opposition to minimum standard of performance.
Demerits of reinforcement learning
 It only provides enough to require for the minim

Evaluating Hypotheses:
Basics of Sampling
Theory
For estimating hypothesis accuracy, statistical methods are
applied. In this blog, we’ll have a look at evaluating
hypotheses and the basics of sampling theory.

Let’s have a look at the terminologies involved and what


they mean,
Random Variable:
A random variable may be thought of as the name of a
probabilistic experiment. Its value is the outcome of the
experiment.

When we don’t know the outcome of the experiment for


certain, it comes under random variables.

The outcome of a coin flip is a good illustration of a random


variable. Consider a probability distribution where the
outcomes of a random event aren’t all equally likely to
occur.

If the number of heads we get from tossing two coins is the


random variable, Y, then Y might be 0, 1, or 2. On a two-coin
toss, this means that we could get no heads, one head, or
both heads.

The two coins, on the other hand, land in four different


patterns: TT, HT, TH, and HH. As a result, P(Y=0) = 1/4
because we only have one possibility of obtaining no heads
(i.e., two tails [TT] when the coins are tossed).

Similarly, receiving two heads (HH) has a 1/4 chance of


happening. In the probabilistic events, there are two cases
where one head can appear, i.e HT and TH. P (Y=1) = 2/4 =
1/2 in this example.

Probability Distribution:
A probability distribution is a statistical function that
specifies all possible values and probabilities for a random
variable in a given range.
This range will be bounded by the minimum and greatest
possible values, but where the possible value will be plotted
on the probability distribution will be determined by a
variety of factors.

The mean (average), standard deviation, skewness, and


kurtosis of the distribution are among these parameters.

Y defines the chance Pr(Y = yi) that Y will take on the value
yi for each potential value yi for a random variable.

Expected Value:
The expected value (EV) is the value that investment is
predicted to have at some time in the future.

In statistics and probability analysis, the expected value is


computed by multiplying each conceivable event by the
likelihood that it will occur and then summing all of those
values.

Investors might choose the scenario that is most likely to


provide the desired result by assessing anticipated values.

The variance of a Random Variable:


In statistics, variance refers to the deviation of a data
collection from its mean value. The probability-weighted
average of squared deviations from the predicted value is
used to calculate it.
As a result, the greater the variance, the greater the
difference between the set’s numbers and the mean. A
smaller variance, on the other hand, indicates that the
numbers in the collection are closer to the mean.

The Y-random variable variance is defined as,

Standard Deviation:
The standard deviation is a statistic that calculates the
square root of the variance and measures the dispersion of
a dataset relative to its mean.

The standard deviation is determined as the square root of


variance by computing each data point’s difference from the
mean.

When data points are further from the mean, there is more
variation within the data set; as a result, the larger the
standard deviation, the more spread out the data is.

The standard deviation of Y is . The standard deviation


of Y is usually represented using the symbol .

The Binomial Distribution:


Under a given set of factors or assumptions, the binomial
distribution expresses the likelihood that a variable will take
one of two independent values.
The binomial distribution is based on the premise that each
trial has just one result, has the same chance of success,
and is mutually exclusive, or independent of the others.

It gives the probability of observing r heads in a series of n


independent coin tosses if the probabilities of heads in a
single toss is p.

Normal Distribution:
The standard distribution, also known as the Gaussian
distribution, is the probability of a measure of distribution
based on the definition, indicating that the data about the
definition occurs more often than the data at a distance. The
normal distribution will appear as a metal grid on the graph.

It is also referred to as a bell-shaped probability distribution


that covers many natural phenomena.

Central Limit Theorem:


Central Limit Theorem is a statistical premise that given a
big enough sample size from a population with a finite level
of variation, the mean of all sampled variables from the
same population will be about equal to the mean of the
entire population.

Furthermore, according to the law of large numbers, these


samples resemble a normal distribution, with their variances
being roughly equal to the variance of the population as the
sample size grows.
Estimator:
It is a random variable Y used to estimate some parameter p
of an underlying population.

The estimand is the quantity that is being estimated (i.e. the


one you wish to know). For example, suppose you needed to
discover the average height of pupils at a 1000-student
school.

You measure a group of 30 children and discover that the


average height is 56 inches. This is the estimator for your
sample mean. You estimate the population means (your
estimand) to be around 56 inches using the sample mean.

The Estimation Bias:


The estimation bias of Y as an estimator for p is the quantity
(E[Y] – p). An unbiased estimator is one for which the bias is
zero.

N % confidence interval:
An N% confidence interval estimate for parameter p is an
interval that includes p with probability N%.

Bayesian Learning:
Introduction
Bayesian machine learning is a subset of probabilistic
machine learning approaches (for other probabilistic models,
see Supervised Learning). In this blog, we’ll have a look at a
brief introduction to bayesian learning.
In Bayesian learning, model parameters are treated as
random variables, and parameter estimation entails
constructing posterior distributions for these random
variables based on observed data.

Why Bayesian Learning Algorithms?


For two reasons, Bayesian learning approaches are relevant
to machine learning.
 To begin, Bayesian learning algorithms compute explicit probabilities
for hypotheses.
 The second reason is that they aid comprehension of various learning
methods that do not involve probability manipulation.

Features of Bayesian learning methods include:


Each observed training example can reduce or enhance the
estimated chance that a hypothesis is correct by a small
amount.

This is more flexible than methods that fully discard a


hypothesis if it is discovered to be inconsistent with any
single example. To assess the final probability of a
hypothesis, prior knowledge can be merged with observed
data.

Hypotheses that make probabilistic predictions can be


accommodated by Bayesian approaches (e.g., hypotheses
such as “this pneumonia patient has a 93 percent chance of
complete recovery”).

The validity of a proposition is calculated via Bayesian


Estimation.
The proposition’s validity is determined by two factors:
i). Preliminary Estimate
ii). New evidence that is relevant.

You might also like