Module4 DS PPT
Module4 DS PPT
What Is a Decision Tree?, Entropy, The Entropy of a Partition, Creating a Decision Tree,
Putting It All Together, Random Forests, Neural Networks, Perceptrons, Feed-Forward Neural
Networks, Back propagation, Example: Fizz Buzz, Deep Learning, The Tensor, The Layer
Abstraction, The Linear Layer, Neural Networks as a Sequence of Layers, Loss and
Optimization, Example: XOR Revisited, Other Activation Functions, Example: Fizz Buzz
Revisited, Softmaxes and Cross-Entropy, Dropout, Example: MNIST, Saving and Loading
Models, Clustering, The Idea, The Model, Example: Meetups, Choosing k, Example:
Clustering Colors, Bottom-Up Hierarchical Clustering
Text Book : Chapters 17, 18, 19 and 20
Decision tree
• A decision tree is a popular machine learning algorithm used for both classification and
regression tasks.
• It works by recursively splitting the data into subsets based on certain conditions, resulting
in a tree-like structure.
• A decision tree uses a tree structure to represent a number of possible decision paths and an
outcome for each path.
• Each internal node represents a feature (or attribute), each branch represents a decision rule,
and each leaf node represents the outcome (either a class label for classification or a
continuous value for regression).
Root Node
Decision
Node
Leaf Node
Working of decision tree
1. Splitting: The algorithm splits the dataset at each node based on a feature that provides the
best separation of the classes. This is often determined using metrics like:
• Entropy
• Gini Impurity: Another measure of impurity or disorder.
• Information Gain: The reduction in entropy or impurity achieved by splitting the dataset.
2. Recursive Partitioning: The process of splitting continues recursively, creating more
branches and nodes until one of the stopping criteria is met, such as:
• A maximum depth of the tree is reached.
• A minimum number of samples per leaf is reached.
• No further information gain can be achieved.
3. Prediction: Once the tree is built, making predictions involves traversing the tree from the
root to a leaf node based on the feature values of the input data. The class or value at the leaf
node is the prediction.
• There are many decision tree algorithms such as ID3, C4.5, CART, GUIDE, CTREE etc.
• The most commonly used decision tree algorithms are ID3 (Iterative Dichotomizer 3), C4.5
advanced version of ID3, CART stands for classification and regression.
Example 1
Example 2:
The game of Twenty Questions, then you are familiar with decision trees.
• “I am thinking of an animal.”
• “Does it have more than five legs?”
• “No.”
• “Is it delicious?”
• “No.”
• “Does it appear on the back of the Australian five-cent coin?”
• “Yes.”
• “Is it an echidna?”
• “Yes, it is!”
This corresponds to the path:
“Not more than 5 legs” → “Not delicious” → “On the 5-cent coin” → “Echidna!”
A “guess the animal” decision tree
The decision tree for hiring
Entropy
• Entropy is a key concept used to measure the impurity or disorder of a
dataset.
• It is used as part of the information gain calculation when determining how
to split nodes in a decision tree.
• The goal is to choose splits that reduce entropy, making the child nodes
more "pure" (i.e., containing mostly one class).
• For a dataset with multiple classes, the entropy is calculated using the
formula:
𝐶
𝐻 ( 𝑆 )=− ∑ 𝑝𝑖. 𝑙𝑜𝑔 2(𝑝𝑖)
Where 𝑝𝑖 is the proportion of instances in class 𝑖 and 𝐶 is the total number of classes.
𝑖=1
The entropy will be small when every pi
is close to 0 or 1 (i.e., when most
of the data is in a single class),
and it will be larger when many of the
pi’s are not close
to 0 (i.e., when the data is spread across
multiple classes).
A graph of -p log p
Random Forest
• An ensemble method that builds multiple decision trees and merges their outputs to improve
accuracy and reduce overfitting.
• Each tree in a random forest provides a “vote” for a predicted outcome, and the final result
is based on the majority vote for classification or the average prediction for regression.
Activation function
• The summation result is passed through an activation function to produce the final output y.
• An activation function determines the output of each neuron by introducing non-linearity,
allowing the network to learn complex patterns and relationships within the data.
• The Perceptron typically uses the step function as its activation function, which outputs:
+1 (or class 1) if the weighted sum of inputs ≥0.
−1 (or class 0) otherwise.
Other common activation functions are:
Sigmoid: Compresses outputs to a range between 0 and 1.
ReLU (Rectified Linear Unit): Outputs zero for negative values and passes positive values
unchanged. It’s efficient and commonly used in hidden layers.
Softmax: Used in the output layer of classification models to produce probabilities for each
class.
Perceptron algorithm
1. Initialize weights and bias randomly, typically small values close to zero.
2. Loop over each data point in the training set:
• Calculate the output of the perceptron using the current weights and inputs.
• Calculate the error as the difference between the actual label and the
predicted label.
• Update weights and bias based on the error:
𝑤𝑖=𝑤𝑖+ Δwi
𝑏=𝑏+Δb
Where
Δwi =×(𝛼 𝑦true−𝑦pred)×𝑥𝑖
𝑦 𝑦true−𝑦pred)
Δb=×(
Here, 𝛼 is the learning rate, a hyperparameter that controls how much the
weights adjust with each update.
3. Repeat the process for a specified number of iterations or until convergence,
i.e., when there are no more errors.
Feed-Forward Neural Network (FNN)
• It is the is one of the simplest types of artificial neural networks. In a feed-forward network,
information flows in one direction only—from the input layer, through any hidden layers, to
the output layer.
• There are no cycles or loops in this structure, which makes it easier to understand and
implement compared to other types of neural networks.
• Just like in the perceptron, each (noninput) neuron has a weight corresponding to each of its
inputs and a bias.
• As with the perceptron, for each neuron we’ll sum up the products of its inputs and its
weights. But here, rather than outputting the step function applied to that product, use the
sigmoid function
Why use sigmoid instead of the simpler step_function?
• In order to train a neural network, we need to use calculus, and in order to use calculus,
we need smooth functions.
• step_function isn’t even continuous, and sigmoid is a good smooth approximation of it.
Layer Composition
Neural networks are built by stacking various types of layers in a specific order.
The most common layer types include:
Input Layer: The first layer that receives the raw data input.
Hidden Layers: Intermediate layers where most of the computation and feature extraction
occur. They can be fully connected (dense) layers, convolutional layers, recurrent layers, etc.
Output Layer: The final layer that produces the network's prediction or classification result.
Loss and Optimization
To train a neural network, you need a loss function to measure the difference between the
model's predictions and the true values, and an optimization algorithm to update the model
parameters to minimize this loss.
1. Loss Functions
The loss function quantifies how well or poorly the model is performing. For most neural
networks, you’ll use one of the following:
•Mean Squared Error (MSE): Common for regression tasks.
•Cross-Entropy Loss: Typically used for classification tasks.
2. Optimization Algorithms
An optimizer adjusts the model’s weights based on the gradients from backpropagation.
Common optimizers include:
•Stochastic Gradient Descent (SGD): A simple and effective method that adjusts
weights by taking small steps proportional to the negative gradient.
•Adam: An advanced version of gradient descent that adapts the learning rate for each
parameter, often leading to faster convergence.
Other activation function
• Activation functions are essential in neural networks because they introduce non-linearity,
enabling networks to learn complex patterns and approximate non-linear functions.
• Without activation functions, a neural network composed of linear layers would be
equivalent to a single-layer model, limiting its capacity to solve complex tasks.
1. Sigmoid Function
2. Tanh (Hyperbolic Tangent)
3. ReLU (Rectified Linear Unit)
4. Leaky ReLU
5. Softmax
Softmaxes and Cross-Entropy
• Softmax and Cross-Entropy are two important concepts in machine learning, often used
together, especially in classification tasks.
• The Softmax function takes a vector of values (like scores for each class) and converts them
into probabilities. It’s used in the output layer of a neural network to interpret the results as
probabilities for each class.
• Cross-Entropy is a loss function used to measure the difference between two probability
distributions – the true distribution (the actual label) and the predicted distribution (output
from the Softmax function).
How They Work Together
• When Softmax is applied to the neural network’s output, it transforms these raw scores
into probabilities.
• Then, Cross-Entropy is used to measure how well these probabilities align with the
actual class labels, penalizing the model more if it assigns low probability to the correct
class.
• This combination is widely used in multi-class classification tasks because:Softmax
ensures the model outputs probabilities.
• Cross-Entropy penalizes incorrect predictions, guiding the model to improve its
accuracy.
Dropout
In machine learning, dropout is a regularization technique primarily used in neural
networks to reduce the risk of overfitting during training.
Random Neuron Deactivation: During each forward pass, dropout temporarily "drops
out" (sets to zero) a random subset of neurons within a layer.
This forces the network to not rely on specific neurons for prediction, distributing the
learning across multiple neurons.
Typically, this is controlled by a probability (called the "dropout rate") that defines the
fraction of neurons to be dropped out.
Applied During Training: Dropout is only used during the training phase, and the
complete network is used during inference (testing/validation) by scaling neuron
activations to account for the missing connections in training.
Benefits:
• Prevents Overfitting: Dropout helps prevent the model from overfitting by ensuring that
the neural network does not become overly reliant on certain neurons.
• Improves Generalization: By making the network more robust to noise and forcing it to
learn more varied representations, dropout improves the network's ability to generalize
on unseen data.
Epoch 9/10
1875/1875 ━━━━━━━━━━━━━━━━━━━━ 4s 2ms/step - loss: 0.1807 -
sparse_categorical_accuracy: 0.9489 - val_loss: 0.1731 -
val_sparse_categorical_accuracy: 0.9503
Epoch 10/10
1875/1875 ━━━━━━━━━━━━━━━━━━━━ 5s 2ms/step - loss: 0.1672 -
sparse_categorical_accuracy: 0.9542 - val_loss: 0.1651 -
val_sparse_categorical_accuracy: 0.9525
313/313 ━━━━━━━━━━━━━━━━━━━━ 1s 2ms/step - loss: 0.1923 -
sparse_categorical_accuracy: 0.9439
Test Loss: 0.1651, Test Accuracy: 0.9525
Clustering
• Clustering is an unsupervised machine learning technique that involves grouping data
points into clusters, or groups, such that data points in the same group are more similar
to each other than to those in other groups.
• This similarity is usually measured using metrics like Euclidean distance, Manhattan
distance, or cosine similarity.
• Clustering is a core concept in data analysis and machine learning, especially when
dealing with unlabeled data, as it allows us to reveal hidden patterns and relationships
within a dataset.
• Different types of clustering methods are
1. K-means Clustering
2. Hierarchical Clustering:
3. DBSCAN (Density-Based Spatial Clustering of Applications with Noise)
4. Model-Based Clustering (Gaussian Mixture Models (GMM))
5. Grid-Based Clustering
K-means Clustering:
• One of the simplest clustering methods is k-means, in which the number of clusters k is
chosen in advance,
• The goal is to partition the inputs into sets S1, ..., Sk in a way that minimizes the total
sum of squared distances from each point to the mean of its assigned cluster.
• Divides data into 𝐾 clusters by minimizing the variance within clusters. It randomly
initializes centroids and assigns each data point to the closest centroid.
• The centroids are then updated iteratively.
• Iterative algorithm that usually finds a good clustering
1. Start with a set of k-means, which are points in d-dimensional space.
2. Assign each point to the mean to which it is closest.
3. If no point’s assignment has changed, stop and keep the clusters.
4. If some point’s assignment has changed, recompute the means and return to
step 2.
Choosing k
There are various ways to choose a k. One that’s reasonably easy to understand involves
plotting the sum of squared errors (between each point and the mean of its cluster) as a
function of k and looking at where the graph “bends”:
Bottom-Up Hierarchical Clustering
• Bottom-up hierarchical clustering, also known as agglomerative hierarchical
clustering.
• It is a type of clustering approach that builds a hierarchy of clusters by starting with
each data point as an individual cluster and progressively merging clusters based on
their similarity.
• Here “grow” clusters from the bottom up.