Machine Learning Notes
Machine Learning Notes
Understanding Data
Numeric Variables
- **Mean**: The average value of a set of numbers.
- **Median**: The middle value when the numbers are sorted in ascending
order.
- **Mode**: The most frequently occurring value in a set of numbers.
Measuring Spread
- **Range**: The difference between the maximum and minimum values.
- **Variance**: The average of the squared differences from the mean.
- **Standard Deviation**: The square root of the variance, representing how
spread out the numbers are from the mean.
Review of Distribution
Uniform Distribution
A distribution where all outcomes are equally likely. Each value in the range has
the same probability of occurring.
Normal Distribution
A bell-shaped distribution that is symmetric about the mean. Most of the data
points cluster around the mean, with probabilities tapering off equally on both
sides.
Categorical Variables
These are variables that represent categories or groups. They can be nominal
(no natural order) or ordinal (natural order). Examples include gender, race, or
levels of satisfaction.
Lazy Learning
Lazy learning is a type of machine learning where the model generalizes the
training data only when a query is made, rather than during the initial training
phase. The algorithm does not explicitly construct a model, instead it stores the
training data and performs computations at prediction time.
- **Example**:
If you want to classify a new data point, the algorithm looks at the k nearest
data points in the training set and assigns the most common class among
them.
Probabilistic Learning - Naive Bayes Classifier
Naive Bayes is a probabilistic classifier based on applying Bayes' theorem with
strong (naive) independence assumptions between the features.
Bayes' Theorem
Bayes' theorem describes the probability of an event based on prior knowledge
of conditions that might be related to the event. The formula is:
P(A|B) = [p(B/A)*p(A)]/p(B)
Where:
- ( P(A|B) ) is the posterior probability of class A given predictor B.
- ( P(B|A) ) is the likelihood of predictor B given class A.
- ( P(A)) is the prior probability of class A.
- ( P(B) ) is the prior probability of predictor B.
Joint Probability
The joint probability of two events A and B is the probability that both events
occur. It is denoted as ( P(A / B) ) or ( P(A, B) ).
- **Example**:
If the probability of it raining (A) is 0.3 and the probability of it being windy (B)
is 0.4, the joint probability ( P(A/ B) ) would depend on the relationship
between A and B.
Conditional Probability
The conditional probability of an event A given that another event B has
occurred is denoted as P(A|B) .
- **Example**:
The probability that it will rain today given that it rained yesterday can be
calculated if the two events are dependent.
Summary
- **Lazy Learning**: Stores data and delays generalization until query time
(e.g., KNN).
- **K-Nearest Neighbour (KNN)**: Classifies based on the majority class of the
nearest k neighbours.
- **Probabilistic Learning (Naive Bayes)**: Uses Bayes' theorem with the
assumption of feature independence to classify data.
- **Bayes' Theorem**: Calculates posterior probabilities to update predictions.
- **Joint Probability**: The probability of two events occurring together.
- **Conditional Probability**: The probability of one event occurring given that
another event has occurred.
Decision trees are a popular method for classification and regression tasks.
They use a tree-like model of decisions and their possible consequences,
including chance event outcomes, resource costs, and utility. Decision trees are
constructed using a divide-and-conquer strategy.
4. **Recursive Partitioning**:
- Repeat the process for each subset until one of the stopping conditions is
met, such as:
- All instances in a subset belong to the same class.
- No remaining features to split on.
- A pre-defined maximum tree depth is reached.
5. **Pruning** (Optional):
- Pruning is used to reduce the size of the tree and prevent overfitting by
removing branches that have little importance.
### Example
1. **Root Node**:
- Calculate entropy for the entire dataset.
- Calculate information gain for each feature.
- Choose the feature with the highest information gain, e.g., "Weather".
2. **Split Data**:
- Split data based on "Weather".
- Create branches for "Sunny" and "Rainy".
3. **Sub-Nodes**:
- For each branch, repeat the process:
- Calculate entropy for the subset.
- Calculate information gain for remaining features.
- Choose the best feature, e.g., "Temperature".
4. **Leaf Nodes**:
- Continue until all subsets are pure (only contain one class) or another
stopping criterion is met.
Summary
- **Decision Trees**: Use a tree structure to make decisions and classify data
based on features.
- **Divide and Conquer**: Recursively split data into subsets based on the best
feature at each step.
- **Decision Nodes**: Represent decisions based on features.
- **Leaf Nodes**: Represent class labels or values.
- **Algorithms**: ID3, C4.5, CART are common decision tree algorithms.
- **Pruning**: Can be applied to prevent overfitting by removing less
important branches.
Decision trees are intuitive and easy to interpret, making them a valuable tool
for both classification and regression tasks in machine learning.
Regression Methods
Regression analysis is a statistical method used to examine the relationship
between a dependent variable and one or more independent variables. It helps
in understanding how the dependent variable changes when any one of the
independent variables is varied, while the other independent variables are held
fixed.
y= a0+a1x+ ε
Where,
a0= It is the intercept of the Regression line (can be obtained putting x=0)
a1= It is the slope of the regression line, which tells whether the line is
increasing or decreasing.
where,
• r: Correlation coefficient
• 𝑥𝑖xi : i^th value first dataset X
• 𝑥ˉxˉ : Mean of first dataset X
• 𝑦𝑖yi : i^th value second dataset Y
• 𝑦ˉyˉ : Mean of second dataset Y
Multiple Linear Regression
Summary
Neural Networks
Neural networks are computational models inspired by the human brain. They
are designed to recognize patterns, make decisions, and predict outcomes
based on input data.
Biological Motivation
Neural networks are inspired by the structure and function of the human brain,
which consists of interconnected neurons. Each neuron receives input signals,
processes them, and transmits output signals to other neurons. Similarly,
artificial neural networks consist of interconnected nodes (neurons) that
process information in a layered structure.
Perceptron
The perceptron is the simplest type of artificial neural network and serves as
the building block for more complex networks. It consists of a single neuron
with adjustable weights and a bias.
This is the primary component of Perceptron which accepts the initial data into the
system for further processing. Each input node contains a real numerical value.
Weight parameter represents the strength of the connection between units. This is
another most important parameter of Perceptron components. Weight is directly
proportional to the strength of the associated input neuron in deciding the output.
Further, Bias can be considered as the line of intercept in a linear equation.
o Activation Function:
These are the final and important components that help to determine whether the
neuron will fire or not. Activation Function can be considered primarily as a step
function. Activation functions introduce non-linearity into the network, enabling
it to learn complex patterns.
o Sign function
o Step function, and
o Sigmoid function
Perceptrons are a type of artificial neural network used in machine learning for
binary classification tasks. They are the simplest form of a neural network,
consisting of a single layer of weights that connect input features to the output.
There are several types of perceptrons, each with different characteristics and
applications:
1. **Single-Layer Perceptron**:
- **Definition**: A single-layer perceptron is the most basic form of a neural
network. It consists of a single layer of weights connecting input features to the
output.
- **Applications**: It is used for linearly separable problems, meaning
problems where a single hyperplane can separate the data into two classes.
3. **Binary Perceptron**:
- **Definition**: A binary perceptron is a type of single-layer perceptron that
outputs binary results (0 or 1).
- **Applications**: It is used for binary classification tasks where the goal is
to categorize data into two distinct classes.
4. **Multi-Class Perceptron**:
- **Definition**: A perceptron that has been adapted to handle multi-class
classification problems. This is often achieved using a technique such as one-vs-
all (OvA) or one-vs-one (OvO) to extend the binary perceptron.
- **Applications**: Used for classification tasks with more than two classes,
such as categorizing types of animals in an image.
5. **Probabilistic Perceptron**:
- **Definition**: A probabilistic perceptron incorporates probabilistic
methods, such as using a sigmoid or softmax function for the output layer to
provide a probability distribution over possible output classes.
- **Applications**: Used in scenarios where probabilistic interpretation of
the output is beneficial, such as in probabilistic decision-making systems.
6. **Kernel Perceptron**:
- **Definition**: An extension of the perceptron that uses kernel functions to
map input features into a higher-dimensional space, allowing for the
classification of non-linearly separable data.
- **Applications**: Useful in scenarios where the data is not linearly
separable in its original space, similar to the application of support vector
machines (SVMs) with kernel tricks.
Cost Function
The cost function measures the difference between the predicted output and
the actual output. It guides the training process by quantifying the error.
Common cost functions include:
There are three commonly used Regression cost functions, which are as follows:
a. Means Error
this type of cost function, the error is calculated for each training data, and then the
The errors that occurred from the training data can be either negative or positive.
While finding mean, they can cancel out each other and result in the zero-mean error for
Means Square error is one of the most commonly used Cost function methods.
It improves the drawbacks of the Mean error cost function, as it calculates the square of the
difference between the actual value and predicted value. Because of the square of the
Mean Absolute error also overcome the issue of the Mean error cost function by
taking the absolute difference between the actual value and predicted value.
This means the Absolute error cost function is also known as L1 Loss.
It is not affected by noise or outliers, hence giving better results if the dataset has noise or outlier.
for 0 or 1, Cat or dog, etc. The cost function used in the classification problem is known as
the Classification cost function. However, the classification cost function is different from
One of the commonly used loss functions for classification is cross-entropy loss.
The binary Cost function is a special case of Categorical cross-entropy, where there is only
one output class. For example, classification between red and blue.
To better understand it, let's suppose there is only a single output variable Y
The error in binary classification is calculated as the mean of cross-entropy for all N training data.
Which means:
instances are allocated to one of more than two classes. Here also, similar to binary
cost function.
It is designed in a way that it can be used with multi-class classification with the target values
Backpropagation Algorithm
1. **Forward Pass**: Calculate the output of the network for a given input.
2. **Compute Loss**: Calculate the error using the cost function.
3. **Backward Pass**: Propagate the error backward through the network to
compute gradients of the cost function with respect to the weights.
4. **Update Weights**: Adjust the weights using the gradients to minimize the
error (often using gradient descent).
Introduction to Deep Learning
Deep learning models, such as CNNs, RNNs, and transformers, are used to
solve complex tasks by learning hierarchical representations of data.
Summary
- **Neural Networks**: Inspired by the brain, used for pattern recognition and
prediction.
- **Perceptron**: Basic building block of neural networks.
- **Activation Functions**: Introduce non-linearity (e.g., Sigmoid, Tanh, ReLU).
- **Network Models**: Comprise input, hidden, and output layers.
- **Cost Function**: Measures error (e.g., MSE, Cross-Entropy).
- **Backpropagation**: Algorithm for training neural networks.
- **Deep Learning**: Utilizes deep neural networks for complex tasks.