0% found this document useful (0 votes)
26 views49 pages

Module4 DS PPT

data science module 4

Uploaded by

rajaa.david
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
26 views49 pages

Module4 DS PPT

data science module 4

Uploaded by

rajaa.david
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 49

Module 4: Decision Tree

What Is a Decision Tree?, Entropy, The Entropy of a Partition, Creating a Decision Tree,
Putting It All Together, Random Forests, Neural Networks, Perceptrons, Feed-Forward Neural
Networks, Back propagation, Example: Fizz Buzz, Deep Learning, The Tensor, The Layer
Abstraction, The Linear Layer, Neural Networks as a Sequence of Layers, Loss and
Optimization, Example: XOR Revisited, Other Activation Functions, Example: Fizz Buzz
Revisited, Softmaxes and Cross-Entropy, Dropout, Example: MNIST, Saving and Loading
Models, Clustering, The Idea, The Model, Example: Meetups, Choosing k, Example:
Clustering Colors, Bottom-Up Hierarchical Clustering
Text Book : Chapters 17, 18, 19 and 20
Decision tree
• A decision tree is a popular machine learning algorithm used for both classification and
regression tasks.
• It works by recursively splitting the data into subsets based on certain conditions, resulting
in a tree-like structure.
• A decision tree uses a tree structure to represent a number of possible decision paths and an
outcome for each path.
• Each internal node represents a feature (or attribute), each branch represents a decision rule,
and each leaf node represents the outcome (either a class label for classification or a
continuous value for regression).

Root Node

Decision
Node

Leaf Node
Working of decision tree
1. Splitting: The algorithm splits the dataset at each node based on a feature that provides the
best separation of the classes. This is often determined using metrics like:
• Entropy
• Gini Impurity: Another measure of impurity or disorder.
• Information Gain: The reduction in entropy or impurity achieved by splitting the dataset.
2. Recursive Partitioning: The process of splitting continues recursively, creating more
branches and nodes until one of the stopping criteria is met, such as:
• A maximum depth of the tree is reached.
• A minimum number of samples per leaf is reached.
• No further information gain can be achieved.
3. Prediction: Once the tree is built, making predictions involves traversing the tree from the
root to a leaf node based on the feature values of the input data. The class or value at the leaf
node is the prediction.
• There are many decision tree algorithms such as ID3, C4.5, CART, GUIDE, CTREE etc.
• The most commonly used decision tree algorithms are ID3 (Iterative Dichotomizer 3), C4.5
advanced version of ID3, CART stands for classification and regression.
Example 1
Example 2:

The game of Twenty Questions, then you are familiar with decision trees.

• “I am thinking of an animal.”
• “Does it have more than five legs?”
• “No.”
• “Is it delicious?”
• “No.”
• “Does it appear on the back of the Australian five-cent coin?”
• “Yes.”
• “Is it an echidna?”
• “Yes, it is!”
This corresponds to the path:
“Not more than 5 legs” → “Not delicious” → “On the 5-cent coin” → “Echidna!”
A “guess the animal” decision tree
The decision tree for hiring
Entropy
• Entropy is a key concept used to measure the impurity or disorder of a
dataset.
• It is used as part of the information gain calculation when determining how
to split nodes in a decision tree.
• The goal is to choose splits that reduce entropy, making the child nodes
more "pure" (i.e., containing mostly one class).
• For a dataset with multiple classes, the entropy is calculated using the
formula:

𝐶
𝐻 ( 𝑆 )=− ∑ 𝑝𝑖. 𝑙𝑜𝑔 2(𝑝𝑖)
Where 𝑝𝑖 is the proportion of instances in class 𝑖 and 𝐶 is the total number of classes.
𝑖=1
The entropy will be small when every pi
is close to 0 or 1 (i.e., when most
of the data is in a single class),
and it will be larger when many of the
pi’s are not close
to 0 (i.e., when the data is spread across
multiple classes).

A graph of -p log p
Random Forest
• An ensemble method that builds multiple decision trees and merges their outputs to improve
accuracy and reduce overfitting.
• Each tree in a random forest provides a “vote” for a predicted outcome, and the final result
is based on the majority vote for classification or the average prediction for regression.

Working of Random Forest


Bootstrap Sampling:
• Random Forest uses bootstrap sampling to create different subsets of the original data. This
means each tree is trained on a different subset, making them diverse.
Random Feature Selection:
• At each split in each decision tree, only a random subset of features is considered, further
diversifying the trees.
• This helps prevent trees from being identical and allows the forest to capture a wider range
of patterns.
Advantages of Random Forest
1. Reduced Overfitting: By averaging the results of multiple trees, Random Forest
reduces the risk of overfitting that is common in single decision trees.
2. Robustness: The combination of many trees makes Random Forest more robust to noise
in the data.
3. Feature Importance: Random Forest provides insights into which features are most
important for making predictions, useful in understanding and refining the model.
Neural Network
• A Neural Network is a computational model inspired by the structure of the human brain.
• It consists of layers of interconnected nodes (neurons) that process and transmit information
to recognize patterns, make decisions, and learn from data.
• Neural networks are a foundation of deep learning and excel at handling complex tasks such
as handwriting recognition, image recognition, face detection, natural language processing,
and even playing games.

Types of Neural Networks


1. Feedforward Neural Networks (FNN): Information flows in one direction from input to
output, without loops. This is the most basic architecture.
2. Convolutional Neural Networks (CNN): Used for image processing, CNNs have special
layers that detect spatial features like edges and textures.
3. Recurrent Neural Networks (RNN): Designed to handle sequential data (e.g., time series or
language), RNNs have connections that allow them to "remember" previous inputs.
4. Generative Adversarial Networks (GANs): These networks consist of two competing
neural networks (a generator and a discriminator) used in generating new data, like images.
Perceptrons
• Perceptron is the one of the simplest types of Artificial neural network, primarily used for
binary classification tasks.
• Perceptron is a single layer neural network with single neuron. It takes several binary
inputs, applies weights, sums them and passes the result through an activation function to
produce a binary output.
• Figure 10.5 shows the perceptron model.
Inputs- A perceptron takes multiple inputs represented as x1,x2,x3,…xn
Weights – Each input is associated with a weight w1, w2,w3, …wn. The weights are real
valued numbers that adjust input importance.
Bias- An additional parameter b is added to the weighted sum to adjust the output
independently of the input values
Summation

Activation function
• The summation result is passed through an activation function to produce the final output y.
• An activation function determines the output of each neuron by introducing non-linearity,
allowing the network to learn complex patterns and relationships within the data.
• The Perceptron typically uses the step function as its activation function, which outputs:
+1 (or class 1) if the weighted sum of inputs ≥0.
−1 (or class 0) otherwise.
Other common activation functions are:
Sigmoid: Compresses outputs to a range between 0 and 1.
ReLU (Rectified Linear Unit): Outputs zero for negative values and passes positive values
unchanged. It’s efficient and commonly used in hidden layers.
Softmax: Used in the output layer of classification models to produce probabilities for each
class.
Perceptron algorithm
1. Initialize weights and bias randomly, typically small values close to zero.
2. Loop over each data point in the training set:
• Calculate the output of the perceptron using the current weights and inputs.
• Calculate the error as the difference between the actual label and the
predicted label.
• Update weights and bias based on the error:
𝑤𝑖=𝑤𝑖+ Δwi
​𝑏=𝑏+Δb
Where
Δwi =×(𝛼 𝑦true−𝑦pred)×𝑥𝑖
𝑦 𝑦true−𝑦pred)
Δb=×(
Here, 𝛼 is the learning rate, a hyperparameter that controls how much the
weights adjust with each update.
3. Repeat the process for a specified number of iterations or until convergence,
i.e., when there are no more errors.
Feed-Forward Neural Network (FNN)
• It is the is one of the simplest types of artificial neural networks. In a feed-forward network,
information flows in one direction only—from the input layer, through any hidden layers, to
the output layer.
• There are no cycles or loops in this structure, which makes it easier to understand and
implement compared to other types of neural networks.
• Just like in the perceptron, each (noninput) neuron has a weight corresponding to each of its
inputs and a bias.
• As with the perceptron, for each neuron we’ll sum up the products of its inputs and its
weights. But here, rather than outputting the step function applied to that product, use the
sigmoid function
Why use sigmoid instead of the simpler step_function?
• In order to train a neural network, we need to use calculus, and in order to use calculus,
we need smooth functions.
• step_function isn’t even continuous, and sigmoid is a good smooth approximation of it.

A neural network for XOR


Training a Feedforward Neural Network
• Training a Feedforward Neural Network involves adjusting the weights of the neurons to
minimize the error between the predicted output and the actual output.
• This process is typically performed using backpropagation and gradient descent.
Backpropagation
• Backpropagation is a fundamental algorithm for training artificial neural networks,
especially deep networks with multiple layers.
• It works by calculating the gradient of the loss function with respect to each weight in the
network and using it to adjust weights and minimize the error in predictions.
• Imagine we have a training set that consists of input vectors and corresponding target
output vectors.
• For example, the input vector [1, 0] corresponded to the target output [1]. Imagine that our
network has some set of weights. We then adjust the weights using the following
algorithm:
1. Run feed_forward on an input vector to produce the outputs of all the neurons
in the network.
2. We know the target output, so we can compute a loss that’s the sum of the squared
errors.
3. Compute the gradient of this loss as a function of the output neuron’s weights.
4. “Propagate” the gradients and errors backward to compute the gradients with respect to
the hidden neurons’ weights.
5. Take a gradient descent step.
Deep Learning
• Deep learning is a subset of machine learning focused on training artificial neural networks
with many layers (or "deep" layers) to perform complex tasks like image recognition, natural
language processing, and more.
• Deep learning originally referred to the application of “deep” neural networks (that is,
networks with more than one hidden layer),
• Deep learning models use neural networks with multiple layers.
• Each layer consists of a set of neurons (or nodes) that process inputs, apply weights and
biases, and pass the result through an activation function to introduce non-linearity.
• Input Layer: The first layer, which receives raw data as input.
• Hidden Layers: These layers extract features from the data, with each subsequent layer
learning more complex representations.
• Output Layer: Produces the final output, such as class labels in classification or numeric
values in regression.
The Tensor
• A tensor is a multi-dimensional array or data structure that represents data in various
dimensions.
• Tensors are the fundamental building blocks of data manipulation in neural networks
• A tensor is a generalization of scalars, vectors, and matrices to higher dimensions.
• Scalar: A single number (0-dimensional tensor).
• Vector: A 1-dimensional array of numbers, like [1,2,3] (1-dimensional tensor).
• Matrix: A 2-dimensional array of numbers, (2-dimensional tensor).
• Higher-Dimensional Tensors: An n-dimensional array, with more than two axes, used to
represent more complex data structures, such as images, sequences, or batches of data in
machine learning.
Example
• A 3D tensor could represent color images, where each pixel has three values (R, G, B).
• A 4D tensor could represent a batch of images, where each entry in the batch is a 3D tensor
(for each image).
The Layer Abstraction
• The Layer Abstraction in deep learning refers to the conceptual organization of neural network
models into distinct layer, each performing specific types of computations or transformations
on the input data.
• This hierarchical structure allows for the progressive extraction of features from raw data with
each layer learning increasingly complex representation.
• Layer abstractions help modularize and structure neural networks, allowing easier design,
training, and modification.
• In deep learning frameworks like TensorFlow and PyTorch, layers are treated as modular
components, each with a specific role in transforming data. For example, layers handle tasks
like feature extraction (Convolutional layers), data transformation (Dense layers), or
regularization (Dropout layers). These modular layers can be stacked to create complex neural
network architectures.
Types of layers include:
Input Layer: The first layer, which accepts raw data and passes it to the next layer.
Hidden Layers: These are the intermediate layers where transformations, feature extraction,
and pattern recognition occur.
• Dense (Fully Connected) Layer: Every neuron in the layer is connected to every neuron in
the previous layer. Often used in the last layers of the network for combining learned
features.
• Convolutional Layer: Used mainly for image data, it applies convolution operations to
detect local patterns, like edges or textures.
• Recurrent Layer: Used for sequential data like text, it maintains memory by carrying
information across time steps.
Output Layer: Produces the final output of the network.
Benefits of Layer Abstraction in Deep Learning
Complexity Management: Each layer performs a specific task, making it easier to understand,
optimize, and troubleshoot the network.
Hierarchical Feature Learning: With layered abstraction, neural networks can learn features
at multiple levels of granularity, allowing them to model complex patterns and relationships in
the data.
Modularity and Flexibility: Layers can be swapped or fine-tuned without changing the entire
network structure, making neural networks highly customizable.
Efficient Computation: Deep learning frameworks optimize layer operations separately,
allowing for faster computation and more efficient memory usage.
Linear Layer
• The Linear Layer is a fundamental component in neural networks, particularly in fully
connected or dense networks.
• It applies a linear transformation to the input data, which means it performs a matrix-vector
multiplication, followed by an optional addition of a bias term.
• For an input vector x and an output vector y, the linear layer computes:
y=Wx+b
where:
• W is the weight matrix, containing learnable parameters.
• b is the bias vector, also learnable parameters.
• x is the input vector.
• y is the output vector.
• The linear layer is often placed between other layers (like activation layers) in a neural
network.
• When combined with nonlinear activation functions (like ReLU, Sigmoid, or Tanh), it
enables the network to model complex, nonlinear relationships.
Neural Networks as a Sequence of Layers
• Neural networks are structured as sequences of layers, where each layer processes the data
it receives and then passes it to the next layer.
• This sequential layer design enables neural networks to progressively extract and learn
increasingly complex features from data.

Layer Composition
Neural networks are built by stacking various types of layers in a specific order.
The most common layer types include:
Input Layer: The first layer that receives the raw data input.
Hidden Layers: Intermediate layers where most of the computation and feature extraction
occur. They can be fully connected (dense) layers, convolutional layers, recurrent layers, etc.
Output Layer: The final layer that produces the network's prediction or classification result.
Loss and Optimization
To train a neural network, you need a loss function to measure the difference between the
model's predictions and the true values, and an optimization algorithm to update the model
parameters to minimize this loss.
1. Loss Functions
The loss function quantifies how well or poorly the model is performing. For most neural
networks, you’ll use one of the following:
•Mean Squared Error (MSE): Common for regression tasks.
•Cross-Entropy Loss: Typically used for classification tasks.
2. Optimization Algorithms
An optimizer adjusts the model’s weights based on the gradients from backpropagation.
Common optimizers include:
•Stochastic Gradient Descent (SGD): A simple and effective method that adjusts
weights by taking small steps proportional to the negative gradient.
•Adam: An advanced version of gradient descent that adapts the learning rate for each
parameter, often leading to faster convergence.
Other activation function
• Activation functions are essential in neural networks because they introduce non-linearity,
enabling networks to learn complex patterns and approximate non-linear functions.
• Without activation functions, a neural network composed of linear layers would be
equivalent to a single-layer model, limiting its capacity to solve complex tasks.
1. Sigmoid Function
2. Tanh (Hyperbolic Tangent)
3. ReLU (Rectified Linear Unit)
4. Leaky ReLU
5. Softmax
Softmaxes and Cross-Entropy
• Softmax and Cross-Entropy are two important concepts in machine learning, often used
together, especially in classification tasks.
• The Softmax function takes a vector of values (like scores for each class) and converts them
into probabilities. It’s used in the output layer of a neural network to interpret the results as
probabilities for each class.
• Cross-Entropy is a loss function used to measure the difference between two probability
distributions – the true distribution (the actual label) and the predicted distribution (output
from the Softmax function).
How They Work Together
• When Softmax is applied to the neural network’s output, it transforms these raw scores
into probabilities.
• Then, Cross-Entropy is used to measure how well these probabilities align with the
actual class labels, penalizing the model more if it assigns low probability to the correct
class.
• This combination is widely used in multi-class classification tasks because:Softmax
ensures the model outputs probabilities.
• Cross-Entropy penalizes incorrect predictions, guiding the model to improve its
accuracy.
Dropout
In machine learning, dropout is a regularization technique primarily used in neural
networks to reduce the risk of overfitting during training.
Random Neuron Deactivation: During each forward pass, dropout temporarily "drops
out" (sets to zero) a random subset of neurons within a layer.
This forces the network to not rely on specific neurons for prediction, distributing the
learning across multiple neurons.
Typically, this is controlled by a probability (called the "dropout rate") that defines the
fraction of neurons to be dropped out.
Applied During Training: Dropout is only used during the training phase, and the
complete network is used during inference (testing/validation) by scaling neuron
activations to account for the missing connections in training.
Benefits:
• Prevents Overfitting: Dropout helps prevent the model from overfitting by ensuring that
the neural network does not become overly reliant on certain neurons.
• Improves Generalization: By making the network more robust to noise and forcing it to
learn more varied representations, dropout improves the network's ability to generalize
on unseen data.

Choosing Dropout Rate:


• Common dropout rates are 0.2 to 0.5, but they vary based on the architecture and
dataset.
• Too high a dropout rate can hinder the model's ability to learn, while too low a rate may
not effectively prevent overfitting.
Example: MNIST
• MNIST is a dataset of handwritten digits that everyone uses to learn deep learning It is
available in a somewhat tricky binary format, so install the mnist library to work with it.
• python program to compute loss and optimization in deep learning using mnsit data set.

Figure 1 MNIST images


import tensorflow as tf
from tensorflow.keras.datasets import mnist
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Flatten
from tensorflow.keras.optimizers import SGD
from tensorflow.keras.losses import SparseCategoricalCrossentropy
from tensorflow.keras.metrics import SparseCategoricalAccuracy

# Load MNIST dataset


(x_train, y_train), (x_test, y_test) = mnist.load_data()
x_train, x_test = x_train / 255.0, x_test / 255.0 # Normalize data to [0, 1]

# Create a simple neural network model


model = Sequential([
Flatten(input_shape=(28, 28)), # Flatten the input images (28x28 pixels)
Dense(128, activation='relu'), # First hidden layer with 128 units and ReLU activation
Dense(10, activation='softmax') # Output layer with 10 units (for 10 classes) and softmax activation
])

# Compile the model with loss function and optimizer


model.compile(optimizer=SGD(learning_rate=0.01), # Stochastic Gradient Descent optimizer with learning rate 0.01
loss=SparseCategoricalCrossentropy(), # Loss function for multi-class classification
metrics=[SparseCategoricalAccuracy()]) # Evaluation metric

# Train the model and compute loss


history = model.fit(x_train, y_train, epochs=10, batch_size=32, validation_data=(x_test, y_test))

# Evaluate the model


test_loss, test_accuracy = model.evaluate(x_test, y_test)
print(f"Test Loss: {test_loss:.4f}, Test Accuracy: {test_accuracy:.4f}")
• Data Loading and Normalization: MNIST is loaded and normalized to values between 0 and 1.
• Model Architecture: A sequential model is created with one hidden layer and an output layer.
• Loss Function: SparseCategoricalCrossentropy is used since the dataset labels are integers.
• Optimizer: SGD (Stochastic Gradient Descent) with a learning rate of 0.01 is used for optimization.
• Training and Validation: The model is trained for 10 epochs, and validation data is provided for
monitoring.
• Evaluation: The model is evaluated on test data to report final loss and accuracy.
• Below shows the output of the model for 9th and 10 epoch

Epoch 9/10
1875/1875 ━━━━━━━━━━━━━━━━━━━━ 4s 2ms/step - loss: 0.1807 -
sparse_categorical_accuracy: 0.9489 - val_loss: 0.1731 -
val_sparse_categorical_accuracy: 0.9503
Epoch 10/10
1875/1875 ━━━━━━━━━━━━━━━━━━━━ 5s 2ms/step - loss: 0.1672 -
sparse_categorical_accuracy: 0.9542 - val_loss: 0.1651 -
val_sparse_categorical_accuracy: 0.9525
313/313 ━━━━━━━━━━━━━━━━━━━━ 1s 2ms/step - loss: 0.1923 -
sparse_categorical_accuracy: 0.9439
Test Loss: 0.1651, Test Accuracy: 0.9525
Clustering
• Clustering is an unsupervised machine learning technique that involves grouping data
points into clusters, or groups, such that data points in the same group are more similar
to each other than to those in other groups.
• This similarity is usually measured using metrics like Euclidean distance, Manhattan
distance, or cosine similarity.
• Clustering is a core concept in data analysis and machine learning, especially when
dealing with unlabeled data, as it allows us to reveal hidden patterns and relationships
within a dataset.
• Different types of clustering methods are
1. K-means Clustering
2. Hierarchical Clustering:
3. DBSCAN (Density-Based Spatial Clustering of Applications with Noise)
4. Model-Based Clustering (Gaussian Mixture Models (GMM))
5. Grid-Based Clustering
K-means Clustering:
• One of the simplest clustering methods is k-means, in which the number of clusters k is
chosen in advance,
• The goal is to partition the inputs into sets S1, ..., Sk in a way that minimizes the total
sum of squared distances from each point to the mean of its assigned cluster.
• Divides data into 𝐾 clusters by minimizing the variance within clusters. It randomly
initializes centroids and assigns each data point to the closest centroid.
• The centroids are then updated iteratively.
• Iterative algorithm that usually finds a good clustering
1. Start with a set of k-means, which are points in d-dimensional space.
2. Assign each point to the mean to which it is closest.
3. If no point’s assignment has changed, stop and keep the clusters.
4. If some point’s assignment has changed, recompute the means and return to
step 2.
Choosing k
There are various ways to choose a k. One that’s reasonably easy to understand involves
plotting the sum of squared errors (between each point and the mean of its cluster) as a
function of k and looking at where the graph “bends”:
Bottom-Up Hierarchical Clustering
• Bottom-up hierarchical clustering, also known as agglomerative hierarchical
clustering.
• It is a type of clustering approach that builds a hierarchy of clusters by starting with
each data point as an individual cluster and progressively merging clusters based on
their similarity.
• Here “grow” clusters from the bottom up.

1. Make each input its own cluster of one.


2. As long as there are multiple clusters remaining, find the two closest clusters and
merge them.
How Bottom-Up Hierarchical Clustering Works
1. Start with each point as its Own Cluster:
Initially, each data point is considered a separate cluster. If you have 𝑛 data points, you start
with 𝑛 clusters.
2. Compute Pairwise Distances Between Clusters:
Calculate the distance between each pair of clusters. The choice of distance metric (e.g.,
Euclidean, Manhattan, cosine similarity) depends on the nature of the data and the problem
at hand.
3. Merge the Closest Clusters:
Find the two clusters with the smallest distance and merge them into a single cluster. This
reduces the total number of clusters by one.
4. Update Distances Between Clusters:
After merging two clusters, update the distances between the new cluster and the remaining
clusters.
Different strategies for calculating these distances result in different types of hierarchical
clustering:
• Single Linkage: Distance between clusters is based on the closest pair of points (nearest
neighbor).
• Complete Linkage: Distance between clusters is based on the farthest pair of points
(furthest neighbor).
• Average Linkage: Distance between clusters is based on the average distance between all
pairs of points in the two clusters.
• Centroid Linkage: Distance between clusters is based on the distance between their
centroids

5. Repeat Steps 3 and 4 Until Only One Cluster Remains:


Continue merging clusters until all points are merged into a single cluster or until a
desired number of clusters is reached. This results in a hierarchy of clusters, often
visualized as a dendrogram.
A dendrogram is a tree-like diagram that shows the merging process of clusters at each step.
It is read from the bottom up, with each leaf node representing an individual data point.
Advantages:
• Does not require specifying the number of clusters in advance.
• Creates a hierarchy of clusters, allowing for flexibility in choosing the final clusters.
Disadvantages:
• Computationally intensive for large datasets, with a time complexity of 𝑂( 𝑛3).
• Sensitive to noise and outliers, which can significantly affect the hierarchical structure.
Some Important questions
1. Illustrate the working of Decision tree and hence explain importance of entropy in decision tree.
2. Discuss perceptron neural network in detail
3. What is feed forward neural network ? Explain back propagation method to train neural network
4. Bring out the differences between machine learning and deep learning
5. Explain the working of Artificial neural network.
6. What is clustering and explain k means clustering in detail.
7. Explain layer abstraction in deep learning
8. Write python program to compute loss and optimization in deep learning using mnsit data set
9. What is an activation function ? Discuss different types of activation function
10.Write a note on dropout

You might also like