Mod 1
Mod 1
to
Machine
Learning
By
Sneha
Sureddy
Syllabus
Machine Learning coined by Arthur Samuel in 1959, allows the machine to learn
from examples and experience, without being explicitly programmed.
Machine Learning is
Study of algorithms that
improve their performance
at some task
with experience 4
Alpydin & Ch. Eick: ML Topic1 5
Machine Learning and Deep
Learning
9
How To Make A Machine Learn
11
Supervised Learning
Apple
Decision Function
/ Hypothesis
Orange
Supervised Classification
Decision Function
/ Hypothesis
Unsupervised Classification
Decision Function
/ Hypothesis
Decision Function
/ Hypothesis
AlphaGo
machin
e vs
human
Ke Jie
Learned Model Parameters
Training Mathematical
x1 Set y1 Model
x2 y2
x3 y3
?
x4 y4
xi1 xiM
xN – 1 yN – 1
xN yN Learned
Parameters
Logistic Regression: Can be applied
when data is linearly separable.
Linearly separable data
Training Set
x= y=
x1 data outcome
y1
x2 y2
?
x3 y3
x4 y4
PREDICTION
xN – 1 yN – 1
xN yN
Learned Model
Parameters
Training Mathematical
x1 Set y1 Model
x2 y2
x3 y3
?
x4 y4
xi1 xiM
xN – 1 yN – 1
xN yN Learned
Parameters
Linear Predictive
Model
zi = (b1 x xi1) + (b2 x xi2) + ⋯ + (bM x xiM)
zi bia
+ b0 s
bM
b1
1.2) + b0
y2 = 0
z2 = (b1 X 0.2) + (b2 X 0.95) + (b3 X 83) + (b4 X
1.3) + b0
Sigmoid function
The sigmoid function, also known as the logistic function, is a
mathematical function that maps any real-valued number to a
value between 0 and 1. It is defined as:
sigmoid(z) = 1 / (1 + exp(-z))
The sigmoid function has an S-shaped curve, where:
For z very small, sigmoid(z) approaches 0
For z very large, sigmoid(z) approaches 1
Sigmoid outputs probabilities, making it suitable for binary
classification problems where the target variable is 0 or 1, yes or
no, etc.
The sigmoid of 100 is:
sigmoid(100) = 1 / (1 + e^(-100))
Using the approximate value of e (2.71828), we can calculate the sigmoid:
sigmoid(100) ≈ 1 / (1 + 2.71828^(-100))
sigmoid(100) ≈ 1 / (1 + 1.38e-44)
sigmoid(100) ≈ 1
So, the sigmoid of 100 is approximately 1.Note that the sigmoid function
approaches 1 as the input value increases, and approaches 0 as the input value
decreases. In this case, the input value of 100 is large enough that the sigmoid
function outputs a value very close to 1.
Convert to a
Probability
zi = (b1 x 0.5) + (b2 x 0.8) +
(b3 x 75) + (b4 x 1.2) + b0 feature
Sigmoid Function
1 p(yi = 1|xi) =
s
Cloud Cover Humidity Temperature Air Pressure
σ(zi)
0.5 80% 75 1.2
Chance of
0.
5 b parameters tell us
how important data
Rain
–
0 zi
prediction
– – 0 2 4 6
6 4 2
Learned Model
Parameters
Training Logistic Regression
x1 Set y1 σ(zi Model (or
)
z = (b X “Network”)
x2 y2 x ) + (b X x ) + ꔇ + (bM X
zi i 1 i1 2 i2 iMx 0
x3 y3 )+
x4 y4 b
bM
b1
xN – 1 yN – 1
xi1 xi2 xiM
xN yN
zi1 ziK
zi1 ziK
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0
00 00 00 00 00 00 00 00 0
0 0 0 0 00 00 0 0 0 118
0 200
0 223
0 155
0 155
0 23
0 0 0 0 0 0
0 0 0 0 0 0 0
0 0
0 0 0 0 0 0 0 0 0 0 6 124 214 253 253 253 253 254 229 213 67
0 0 0
0 0 0
0
0 0 0 0 0 0 0 0 0 43 198 253 254 253 247 175 175 235 253 253 108 0
0 0 0 0 0 0 0 0 58 0 8
253 183 00 0 0 0 0 0 36 222 253 118 0 0 0 0 0 0
0 0
0 0 0 0 0 0 0 0 54 250 128
0 2 0 0 0 0 6 71 192 237 253 247 71 0 0 0 0 0 0
00 00 00 00 00 00 00 00 09 213 253
212 50
253 0
253 18
224 58 19247
123 198 253 254 253 0
85 0 00 016 0 139
0 247
0 0
247 71 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 6 241 227 136 200 253 253 253 254 192 34 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 58 9 254 254 186 14 0 0 0 0 0 0
189 254 175
118 254
0 256 254 254 254 149059 0 0 0 00 0 00 00 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 101 253 253 254 253 253 253 42 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 13 229 253 254 192 253 253 253 254 253 137 0 0 0 0 0 0 0 0 0 0 0 0
0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0
SA 3.0
Single Filter
(Shallow
Learning)
filte
Use of single filter only
r
looks for the average
shape
Layer
1. Input Layer: This layer receives the input features or data. The number of nodes
in this layer corresponds to the number of input features.
51
NEURAL NETWORK REPRESENTATION
Artificial Neural Network (ANNs) are programs designed to solve any
problem by trying to mimic the structure and the function of our
nervous system.
Neural networks are based on simulated neurons, which are joined
together in a variety of ways to form networks.(first one is 3-4-1
architecture, second one is 3-4-2 architecture)
54
Neuron
A neuron typically consists of:
1. Inputs: Receives data from other neurons or external inputs.
2. Weights: Assigns importance to each input.
3. Bias: Adds a constant value to the weighted sum.
4. Activation function: Determines the output.
5. Output: Sends the result to other neurons or to the output of the
network.
Activation function
The activation function is used to determine the output of a node given its
input. It introduces non-linearity to give output.
Common activation functions include:
1. Sigmoid: Maps the input to a value between 0 and 1.
2. ReLU (Rectified Linear Unit): Maps all negative values to 0 and all positive
values to the same value.
3. Tanh (Hyperbolic Tangent): Maps the input to a value between -1 and 1.
4. Softmax: Used for multi-class classification problems, it maps the input to a
probability distribution over all classes.
Neural networks & Deep learning
Neural networks are a type of machine learning model inspired by the
structure and function of the human brain.
They consist of layers of interconnected nodes (neurons) that process and
transmit information.
Deep learning is a subfield of machine learning that focuses on neural
networks with multiple layers, typically more than three.
Deep learning models are designed to learn complex patterns
and representations in data, such as images, speech, and text.
1. Forward pass: Input data flows through the network, layer by layer, to
produce an output.
2. Error calculation: The difference between the predicted output and the
actual output (target) is calculated, resulting in an error or loss.
3. Backward pass: The error is propagated backwards through the network,
layer by layer, to calculate the gradients of the loss function with respect to
each parameter.
4. Weight update: The gradients are used to update the model parameters
(weights and biases) to minimize the loss function.
Overfitting
The model performs well during training but does not perform well during
testing.
Causes of overfitting:
1. Model complexity: Using a model with too many parameters or layers.
2. Too much training: Training the model for too many epochs or iterations.
3. Small training dataset: Using a dataset that is too small to capture the
underlying patterns.
4. Noise in the data: Presence of noise or outliers in the training data.
To overcome overfitting techniques like Validation, Early Stopping,
Regularization etc., can be used.
Validation methods
Validation methods in deep learning are essential techniques used to evaluate the
performance of a model during training and prevent overfitting.
Validation, in the context of machine learning and deep learning, refers to the
process of evaluating a model's performance on a separate dataset, called the
validation set, during training.
This dataset is not just used for training the model, but rather for assessing its
performance and making adjustments as needed.
Validation Process:
Split data: Divide the available data into training, validation, and testing sets
(e.g., 70% for training, 10% for validation, and 20% for testing).
Train model: Train the model on the training set.
Evaluate on validation set: Evaluate the model's performance on the validation
set during training.
Adjusthyperparameters: Based on validation performance, adjust
hyperparameters or training parameters.
Repeat: Repeat steps until validation performance improves.
Final evaluation: Evaluate the final model on the testing set to estimate its
performance on new, unseen data.
Split Data in
Separate Groups
x1 y1 xx
x1 11 y1
x2 yy
x2 11
y2 y2
xx22
x3 y3 x3 y3
x4 y4 random yy
x4 22 y4
assignme
nt xx33
yy33
xN – 1 yN – 1 xN –xx1 44 yN – 1
xyyN yN
xN yN 44
–– 11
yyNN ––
Split Data in
Separate Groups x1
x2
y1
y2
x3 y3
x1 y1 x4 y4
x2 y2
yN – 1
testin
x3 y3 xN – 1
yN
x1 g y1
xN y2
x4 y4 x2
trainin x3 y3
g x4 y4
validatio
x1 n y1 x yN – 1
xN – 1 yN – 1 x2 y2 xN
N–1
yN
x3
xN yN x4 y3
y4
all available
data yN – 1
xN – 1
yN
xN
trainin
x1 g y1
x2 y2 refine
model
x3 y3
x4 y4 validatio testin
x1 n y1 x1 g y1
x2 y2 x2 y2
xN – 1 x3 y3 x3 y3
yN – 1
xN x4 y4 x4 y4
yN
yN – 1 xN – 1 yN – 1
xN – 1
yN xN yN
xN
update
slope
3
(b)
2
f
–2 –1 0 1 2
b
Gradient Descent Optimizer
Visualize the entire hypothesis of possible weight vectors and their associated
E values
Wo,W1: weights of a linear unit
E: Error for fixed set of training examples
Gradient descent search determines a weight vector that minimizes E by
starting with an arbitrary initial weight vector, then repeatedly modifying it in
small steps.
At each step, the weight vector is altered in the direction that produces the
steepest descent along the error surface
84
Derivation of the Gradient Descent Rule
How can we calculate the direction of steepest descent along the error
surface?
This direction can be found by computing the derivative of E with respect to
each component of the vector
is learning rate, if step size is too big, we may miss global minima if it is too
small it takes lot of time to converge
85
Gradient Descent (GD):
1. Batch processing: GD uses the entire training dataset to compute the gradient of
the loss function.
2. GD calculates the exact gradient of the loss function.
3. GD updates the model parameters after processing the entire dataset.
More computation time per weight Less computation time per weight
update update
Used with a larger step size Used with a smaller step size
per weight update per weight update
87
Summary
SGD has several advantages over GD:
1. SGD converges faster than GD, especially for large datasets.
2. SGD requires less memory since it only processes a single sample or small
batch at a time.
GD is more accurate but slower, while SGD is faster but less accurate. The
choice between GD and SGD depends on the specific problem, dataset size, and
computational resources.
Mini-Batch Gradient Descent can be used practically. It uses a small batch of
samples to compute the gradient.
Evaluating neural networks
Evaluating neural networks involves assessing their performance on a given task.
Here are some ways to evaluate neural networks:
1. Accuracy: Measure the proportion of correct predictions.
Accuracy = (TP + TN) / (TP + TN + FP + FN)
Where : TP = True Positives (correctly predicted instances)
TN = True Negatives (correctly predicted non-instances)
FP = False Positives(incorrectly predicted instances)
FN = False Negatives((incorrectly predicted non-instances)
By implementing early stopping, you can train more efficient and effective
neural networks.
Linear Regression and Logistic Regression are both supervised learning
algorithms used for prediction, but they differ in their approach and application:
Linear Regression
1. Continuous output: Predicts a continuous value (e.g., price, temperature).
2. Linear relationship: Assumes a linear relationship between inputs and output.
3. Mean squared error: Optimizes for mean squared error between predicted and
actual values.
4. Regression: Used for regression tasks, like predicting a continuous value.
Here no activation function is used , models linear relationships
Logistic Regression
1. Binary output: Predicts a binary value (e.g., 0/1, yes/no).
2. Non-linear relationship: Uses a sigmoid function to model a non-linear
relationship between inputs and output.
3. Cross-entropy loss: Optimizes for cross-entropy loss between predicted
probabilities and actual labels.
4. Classification: Used for classification tasks, like predicting a binary label.
Here activation function(sigmoid) is used to convert a linear regression equation
to the logistic regression equation , i.e., models non-linear relationships
Underfitting and overfitting
Underfitting and overfitting are two common problems in machine learning:
Underfitting-
Occurs when a model is too simple or has too few parameters to capture the
underlying patterns in the data.
- Model fails to learn from the training data and performs poorly on both
training and test data.
- Symptoms: - High bias ,Low variance ,Poor performance on training data
Poor performance on test data
Solutions: Increase model complexity, Add more features and samples, Use a
different Algorithm.
Overfitting
Occurs when a model is too complex or has too many parameters, fitting the
noise in the training data rather than the underlying patterns.
Model performs well on training data but poorly on test data.
Symptoms: Low bias,High variance ,Good performance on training data - Poor
performance on test data-
Solutions: - Regularization ,Early stopping , - Data augmentation - Cross-
validation
To avoid underfitting and overfitting, aim for a balance between model complexity
and data complexity. Use techniques like cross-validation, regularization, and early
stopping to find the optimal balance.
Bias-Variance