0% found this document useful (0 votes)
2 views21 pages

Machine Learning Notes

The document provides an overview of machine learning, covering key concepts such as algorithms, feature selection, and types of learning including supervised and unsupervised. It discusses various machine learning methods like K-Nearest Neighbour, Naive Bayes, decision trees, and regression techniques, explaining their algorithms and applications. Additionally, it introduces neural networks, specifically the perceptron model, highlighting its components and types.

Uploaded by

Muhammed Shammas
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views21 pages

Machine Learning Notes

The document provides an overview of machine learning, covering key concepts such as algorithms, feature selection, and types of learning including supervised and unsupervised. It discusses various machine learning methods like K-Nearest Neighbour, Naive Bayes, decision trees, and regression techniques, explaining their algorithms and applications. Additionally, it introduces neural networks, specifically the perceptron model, highlighting its components and types.

Uploaded by

Muhammed Shammas
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 21

Module 1

Introduction to Machine Learning


Machine Learning (ML) is a field of artificial intelligence that focuses on
developing algorithms that allow computers to learn patterns from data and
make decisions or predictions without being explicitly programmed.

How Do Machines Learn?


Machines learn by using algorithms to process data, identify patterns, and
make decisions based on that data. Learning can be supervised (with labeled
data), unsupervised (with unlabeled data), or semi-supervised.

Selecting the Right Features


Feature selection involves identifying the most relevant variables (features) in
your data that contribute most significantly to the prediction or classification
task. Good feature selection improves model accuracy and reduces
computational cost.

Understanding Data
Numeric Variables
- **Mean**: The average value of a set of numbers.
- **Median**: The middle value when the numbers are sorted in ascending
order.
- **Mode**: The most frequently occurring value in a set of numbers.

Measuring Spread
- **Range**: The difference between the maximum and minimum values.
- **Variance**: The average of the squared differences from the mean.
- **Standard Deviation**: The square root of the variance, representing how
spread out the numbers are from the mean.

Review of Distribution
Uniform Distribution
A distribution where all outcomes are equally likely. Each value in the range has
the same probability of occurring.

Normal Distribution
A bell-shaped distribution that is symmetric about the mean. Most of the data
points cluster around the mean, with probabilities tapering off equally on both
sides.
Categorical Variables
These are variables that represent categories or groups. They can be nominal
(no natural order) or ordinal (natural order). Examples include gender, race, or
levels of satisfaction.

Dimensionality Reduction - Principal Component Analysis (PCA)


PCA is a technique used to reduce the dimensionality of data by transforming
the original variables into a new set of variables (principal components) that
are uncorrelated and that capture the maximum variance in the data. This
helps in simplifying the model, reducing computation time, and eliminating
multicollinearity.
MODULE 2

Lazy Learning
Lazy learning is a type of machine learning where the model generalizes the
training data only when a query is made, rather than during the initial training
phase. The algorithm does not explicitly construct a model, instead it stores the
training data and performs computations at prediction time.

Classification using K-Nearest Neighbour (KNN) Algorithm


KNN is a simple, non-parametric, and lazy learning algorithm used for
classification and regression.

- **How KNN Works**:


1. **Store Training Data**: During the training phase, the algorithm stores all
the training examples.
2. **Compute Distance**: When a query instance is given, it calculates the
distance between the query instance and all the training examples. Common
distance metrics include Euclidean, Manhattan, and Minkowski distances.
3. **Find Nearest Neighbours**: Select the k training examples that are
closest to the query instance.
4. **Predict Class**: For classification, the algorithm assigns the class that is
most common among the k nearest neighbours. For regression, it averages the
values of the k nearest neighbours.

- **Example**:
If you want to classify a new data point, the algorithm looks at the k nearest
data points in the training set and assigns the most common class among
them.
Probabilistic Learning - Naive Bayes Classifier
Naive Bayes is a probabilistic classifier based on applying Bayes' theorem with
strong (naive) independence assumptions between the features.

Bayes' Theorem
Bayes' theorem describes the probability of an event based on prior knowledge
of conditions that might be related to the event. The formula is:

P(A|B) = [p(B/A)*p(A)]/p(B)

Where:
- ( P(A|B) ) is the posterior probability of class A given predictor B.
- ( P(B|A) ) is the likelihood of predictor B given class A.
- ( P(A)) is the prior probability of class A.
- ( P(B) ) is the prior probability of predictor B.

Naive Bayes Algorithm


1. **Calculate Prior Probabilities**: Compute the prior probabilities for each
class.
2. **Calculate Likelihood**: For each feature, calculate the likelihood of the
feature given the class.
3. **Calculate Posterior Probability**: Use Bayes' theorem to compute the
posterior probability for each class.
4. **Predict Class**: Assign the class with the highest posterior probability to
the query instance.
- **Example**:
In spam email detection, the algorithm calculates the probability of an email
being spam given the presence of certain words and classifies the email based
on the highest probability.

Joint Probability
The joint probability of two events A and B is the probability that both events
occur. It is denoted as ( P(A / B) ) or ( P(A, B) ).

- **Example**:
If the probability of it raining (A) is 0.3 and the probability of it being windy (B)
is 0.4, the joint probability ( P(A/ B) ) would depend on the relationship
between A and B.

Conditional Probability
The conditional probability of an event A given that another event B has
occurred is denoted as P(A|B) .

- **Example**:
The probability that it will rain today given that it rained yesterday can be
calculated if the two events are dependent.

Summary
- **Lazy Learning**: Stores data and delays generalization until query time
(e.g., KNN).
- **K-Nearest Neighbour (KNN)**: Classifies based on the majority class of the
nearest k neighbours.
- **Probabilistic Learning (Naive Bayes)**: Uses Bayes' theorem with the
assumption of feature independence to classify data.
- **Bayes' Theorem**: Calculates posterior probabilities to update predictions.
- **Joint Probability**: The probability of two events occurring together.
- **Conditional Probability**: The probability of one event occurring given that
another event has occurred.

Classification using Decision Trees and Rules

Decision trees are a popular method for classification and regression tasks.
They use a tree-like model of decisions and their possible consequences,
including chance event outcomes, resource costs, and utility. Decision trees are
constructed using a divide-and-conquer strategy.

Divide and Conquer Strategy


The divide and conquer strategy in decision trees involves recursively
partitioning the data into subsets based on feature values. Each split
corresponds to a decision node in the tree, which branches out to further splits
or to a leaf node representing a class label or a continuous value in the case of
regression.

Decision Tree Algorithm

Steps in Building a Decision Tree

1. **Select the Best Feature**:


- At each node, the algorithm selects the feature that best splits the data into
subsets. This is typically done using criteria like Gini impurity, information gain,
or Chi-square.
2. **Split the Data**:
- Divide the data into subsets based on the selected feature. Each subset
represents a branch of the tree.

3. **Create Decision Nodes and Leaf Nodes**:


- Decision nodes correspond to the selected features and their conditions.
- Leaf nodes represent the final output (class labels or values).

4. **Recursive Partitioning**:
- Repeat the process for each subset until one of the stopping conditions is
met, such as:
- All instances in a subset belong to the same class.
- No remaining features to split on.
- A pre-defined maximum tree depth is reached.

5. **Pruning** (Optional):
- Pruning is used to reduce the size of the tree and prevent overfitting by
removing branches that have little importance.

### Example

Consider a dataset with features like "Weather" (Sunny, Rainy), "Temperature"


(Hot, Cold), and a target variable "Play" (Yes, No).

1. **Root Node**:
- Calculate entropy for the entire dataset.
- Calculate information gain for each feature.
- Choose the feature with the highest information gain, e.g., "Weather".

2. **Split Data**:
- Split data based on "Weather".
- Create branches for "Sunny" and "Rainy".

3. **Sub-Nodes**:
- For each branch, repeat the process:
- Calculate entropy for the subset.
- Calculate information gain for remaining features.
- Choose the best feature, e.g., "Temperature".

4. **Leaf Nodes**:
- Continue until all subsets are pure (only contain one class) or another
stopping criterion is met.

Summary

- **Decision Trees**: Use a tree structure to make decisions and classify data
based on features.
- **Divide and Conquer**: Recursively split data into subsets based on the best
feature at each step.
- **Decision Nodes**: Represent decisions based on features.
- **Leaf Nodes**: Represent class labels or values.
- **Algorithms**: ID3, C4.5, CART are common decision tree algorithms.
- **Pruning**: Can be applied to prevent overfitting by removing less
important branches.

Decision trees are intuitive and easy to interpret, making them a valuable tool
for both classification and regression tasks in machine learning.

Regression Methods
Regression analysis is a statistical method used to examine the relationship
between a dependent variable and one or more independent variables. It helps
in understanding how the dependent variable changes when any one of the
independent variables is varied, while the other independent variables are held
fixed.

Simple Linear Regression


Simple linear regression is a method used to model the relationship between a
single independent variable (X) and a dependent variable (Y) by fitting a linear
equation to the observed data.

y= a0+a1x+ ε
Where,

a0= It is the intercept of the Regression line (can be obtained putting x=0)
a1= It is the slope of the regression line, which tells whether the line is
increasing or decreasing.

ε = The error term. (For a good model it will be negligible)


Ordinary Least Squares (OLS) Estimation
The OLS method estimates the parameters of the linear regression model by
minimizing the sum of the squared differences between the observed values
and the values predicted by the linear model. The linear model is represented
as:

Ordinary Least Squares Formula – How to Calculate


OLS
In mathematical terms, the OLS formula can be written as the following:

Minimize ∑(yi – ŷi)^2


where yi is the actual value, ŷi is the predicted value. A linear regression model used for
determining the value of the response variable, ŷ, can be represented as the following
equation.
y = b0 + b1x1 + b2x2 + … + bnxn + e
where:

• y is the dependent variable


• b0 is the intercept
• b1, b2, …, bn are the coefficients of the independent variables x1, x2, …, xn
• e is the error term
The coefficients b1, b2, …, bn can also be called the coefficients of determination.
The goal of the OLS method can be used to estimate the unknown parameters (b1, b2,
…, bn) by minimizing the sum of squared residuals (SSR). The sum of squared residuals
is also termed the sum of squared error (SSE).
This method is also known as the least-squares method for regression.
Correlation

Correlation measures the strength and direction of the linear relationship


between two variables. The correlation coefficient (( r )) ranges from -1 to 1.

- **Positive Correlation**: As one variable increases, the other variable tends


to increase (( r > 0 )).
- **Negative Correlation**: As one variable increases, the other variable tends
to decrease (( r < 0 )).
- **No Correlation**: There is no linear relationship between the variables (( r
= 0 )).

The Pearson correlation coefficient is calculated as:

The Pearson correlation coefficient is the most often used metric of


correlation. It expresses the linear relationship between two variables in
numerical terms. The Pearson correlation coefficient, written as “r,” is as
follows:
𝑟=∑(𝑥𝑖−𝑥ˉ)(𝑦𝑖−𝑦ˉ)/√∑(𝑥𝑖−𝑥ˉ)2∑(𝑦𝑖−𝑦ˉ)2

where,
• r: Correlation coefficient
• 𝑥𝑖xi : i^th value first dataset X
• 𝑥ˉxˉ : Mean of first dataset X
• 𝑦𝑖yi : i^th value second dataset Y
• 𝑦ˉyˉ : Mean of second dataset Y
Multiple Linear Regression

Multiple linear regression models the relationship between a dependent


variable and two or more independent variables. The model is represented as:

• = the predicted value of the dependent variable


• = the y-intercept (value of y when all other parameters are set to 0)
• = the regression coefficient ( ) of the first independent variable ( )
(a.k.a. the effect that increasing the value of the independent variable has on
the predicted y value)
• … = do the same for however many independent variables you are testing
• = the regression coefficient of the last independent variable
• = model error (a.k.a. how much variation there is in our estimate of )

Summary

- **Simple Linear Regression**: Models the relationship between one


independent variable and one dependent variable using a linear equation.
- **Ordinary Least Squares (OLS)**: Estimates the parameters of the linear
regression model by minimizing the sum of squared residuals.
- **Correlation**: Measures the strength and direction of the linear
relationship between two variables.
- **Multiple Linear Regression**: Models the relationship between a
dependent variable and multiple independent variables using a linear
equation.

These regression methods are foundational tools in statistical analysis and


machine learning, used to predict outcomes and understand relationships
between variables.
MODULE 3

Neural Networks

Neural networks are computational models inspired by the human brain. They
are designed to recognize patterns, make decisions, and predict outcomes
based on input data.

Biological Motivation

Neural networks are inspired by the structure and function of the human brain,
which consists of interconnected neurons. Each neuron receives input signals,
processes them, and transmits output signals to other neurons. Similarly,
artificial neural networks consist of interconnected nodes (neurons) that
process information in a layered structure.

Perceptron

The perceptron is the simplest type of artificial neural network and serves as
the building block for more complex networks. It consists of a single neuron
with adjustable weights and a bias.

Basic Components of Perceptron


Mr. Frank Rosenblatt invented the perceptron model as a binary classifier which
contains three main components. These are as follows:
o Input Nodes or Input Layer:

This is the primary component of Perceptron which accepts the initial data into the
system for further processing. Each input node contains a real numerical value.

o Wight and Bias:

Weight parameter represents the strength of the connection between units. This is
another most important parameter of Perceptron components. Weight is directly
proportional to the strength of the associated input neuron in deciding the output.
Further, Bias can be considered as the line of intercept in a linear equation.

o Activation Function:

These are the final and important components that help to determine whether the
neuron will fire or not. Activation Function can be considered primarily as a step
function. Activation functions introduce non-linearity into the network, enabling
it to learn complex patterns.

Types of Activation functions:

o Sign function
o Step function, and
o Sigmoid function
Perceptrons are a type of artificial neural network used in machine learning for
binary classification tasks. They are the simplest form of a neural network,
consisting of a single layer of weights that connect input features to the output.
There are several types of perceptrons, each with different characteristics and
applications:

1. **Single-Layer Perceptron**:
- **Definition**: A single-layer perceptron is the most basic form of a neural
network. It consists of a single layer of weights connecting input features to the
output.
- **Applications**: It is used for linearly separable problems, meaning
problems where a single hyperplane can separate the data into two classes.

2. **Multi-Layer Perceptron (MLP)**:


- **Definition**: An extension of the single-layer perceptron, MLPs contain
one or more hidden layers of neurons between the input and output layers.
- **Applications**: MLPs can solve more complex, non-linearly separable
problems by using multiple layers and non-linear activation functions. They are
used for tasks such as image and speech recognition.

3. **Binary Perceptron**:
- **Definition**: A binary perceptron is a type of single-layer perceptron that
outputs binary results (0 or 1).
- **Applications**: It is used for binary classification tasks where the goal is
to categorize data into two distinct classes.

4. **Multi-Class Perceptron**:
- **Definition**: A perceptron that has been adapted to handle multi-class
classification problems. This is often achieved using a technique such as one-vs-
all (OvA) or one-vs-one (OvO) to extend the binary perceptron.
- **Applications**: Used for classification tasks with more than two classes,
such as categorizing types of animals in an image.

5. **Probabilistic Perceptron**:
- **Definition**: A probabilistic perceptron incorporates probabilistic
methods, such as using a sigmoid or softmax function for the output layer to
provide a probability distribution over possible output classes.
- **Applications**: Used in scenarios where probabilistic interpretation of
the output is beneficial, such as in probabilistic decision-making systems.

6. **Kernel Perceptron**:
- **Definition**: An extension of the perceptron that uses kernel functions to
map input features into a higher-dimensional space, allowing for the
classification of non-linearly separable data.
- **Applications**: Useful in scenarios where the data is not linearly
separable in its original space, similar to the application of support vector
machines (SVMs) with kernel tricks.

Understanding these different types of perceptrons helps in choosing the


appropriate model for specific machine learning tasks and problems.
Network Models

Neural networks are composed of multiple layers:

- **Input Layer**: Receives the input data.


- **Hidden Layers**: Perform intermediate computations and feature
extraction.
- **Output Layer**: Produces the final output.

Common types of neural networks include:

- **Feedforward Neural Network**: Information flows in one direction from


input to output.
- **Convolutional Neural Network (CNN)**: Specialized for image and spatial
data processing.
- **Recurrent Neural Network (RNN)**: Handles sequential data by
maintaining information in memory.

Cost Function

The cost function measures the difference between the predicted output and
the actual output. It guides the training process by quantifying the error.
Common cost functions include:

1. Regression Cost Function


2. Binary Classification cost Functions
3. Multi-class Classification Cost Function.

1. Regression Cost Function


Regression models are used to make a prediction for the continuous variables such
as the price of houses, weather prediction, loan predictions, etc. When a cost
function is used with Regression, it is known as the "Regression Cost Function." In
this, the cost function is calculated as the error based on the distance, such as:

1. Error= Actual Output-Predicted output

There are three commonly used Regression cost functions, which are as follows:

a. Means Error

this type of cost function, the error is calculated for each training data, and then the

mean of all error values is taken.

It is one of the simplest ways possible.

The errors that occurred from the training data can be either negative or positive.

While finding mean, they can cancel out each other and result in the zero-mean error for

the model, so it is not recommended cost function for a model.

However, it provides a base for other cost functions of regression models.

b. Mean Squared Error (MSE)

Means Square error is one of the most commonly used Cost function methods.

It improves the drawbacks of the Mean error cost function, as it calculates the square of the

difference between the actual value and predicted value. Because of the square of the

difference, it avoids any possibility of negative error.

The formula for calculating MSE is given below:

c. Mean Absolute Error (MAE)

Mean Absolute error also overcome the issue of the Mean error cost function by
taking the absolute difference between the actual value and predicted value.

The formula for calculating Mean Absolute Error is given below:

This means the Absolute error cost function is also known as L1 Loss.

It is not affected by noise or outliers, hence giving better results if the dataset has noise or outlier.

2. Binary Classification Cost Functions


Classification models are used to make predictions of categorical variables, such as predictions

for 0 or 1, Cat or dog, etc. The cost function used in the classification problem is known as

the Classification cost function. However, the classification cost function is different from

the Regression cost function.

One of the commonly used loss functions for classification is cross-entropy loss.

The binary Cost function is a special case of Categorical cross-entropy, where there is only

one output class. For example, classification between red and blue.

To better understand it, let's suppose there is only a single output variable Y

1. Cross-entropy(D) = - y*log(p) when y = 1


2.
3. Cross-entropy(D) = - (1-y)*log(1-p) when y = 0

The error in binary classification is calculated as the mean of cross-entropy for all N training data.

Which means:

1. Binary Cross-Entropy = (Sum of Cross-Entropy for N data)/N


3. Multi-class Classification Cost Function
A multi-class classification cost function is used in the classification problems for which

instances are allocated to one of more than two classes. Here also, similar to binary

class classification cost function, cross-entropy or categorical cross-entropy is commonly used

cost function.

It is designed in a way that it can be used with multi-class classification with the target values

ranging from 0 to 1, 3, ….,n classes.

In a multi-class classification problem, cross-entropy will generate a score that summarizes

the mean difference between actual and anticipated probability distribution.

Backpropagation Algorithm

Backpropagation is the algorithm used to train neural networks by updating the


weights to minimize the cost function.

1. **Forward Pass**: Calculate the output of the network for a given input.
2. **Compute Loss**: Calculate the error using the cost function.
3. **Backward Pass**: Propagate the error backward through the network to
compute gradients of the cost function with respect to the weights.
4. **Update Weights**: Adjust the weights using the gradients to minimize the
error (often using gradient descent).
Introduction to Deep Learning

Deep learning is a subset of machine learning that involves neural networks


with many layers (deep neural networks). It is particularly effective in
processing large amounts of data and extracting high-level features. Deep
learning has led to significant advancements in areas such as:

- Image and speech recognition


- Natural language processing
- Autonomous vehicles

Deep learning models, such as CNNs, RNNs, and transformers, are used to
solve complex tasks by learning hierarchical representations of data.

Summary

- **Neural Networks**: Inspired by the brain, used for pattern recognition and
prediction.
- **Perceptron**: Basic building block of neural networks.
- **Activation Functions**: Introduce non-linearity (e.g., Sigmoid, Tanh, ReLU).
- **Network Models**: Comprise input, hidden, and output layers.
- **Cost Function**: Measures error (e.g., MSE, Cross-Entropy).
- **Backpropagation**: Algorithm for training neural networks.
- **Deep Learning**: Utilizes deep neural networks for complex tasks.

You might also like