0% found this document useful (0 votes)
21 views25 pages

Assignment 02

Uploaded by

Jonas Edward
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views25 pages

Assignment 02

Uploaded by

Jonas Edward
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 25

THE UNIVERSITY OF DODOMA

COLLEGE OF INFORMATICS AND VIRTUAL EDUCATION (CIVE)

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING(CSE)

COURSE NAME: ARTIFICIAL INTELLIGENCE

COURSE CODE: CP 422

INSTRUCTOR: MR. THOMAS TESHA

ASSIGNMENT TYPE: GROUP ASSIGNMENT 02

STUDENT NAME REGISTRATION NUMBER PROGRAMME

OMBEN O NGATA T/UDOM/2019/12326 BSc - CNISE

TUMSIFU PALLAGYO T/UDOM/2019/08249 BSc – CNISE

JOSEPHAT TEMBA T/UDOM/2019/08243 BSc – CNISE

POLIKARP MREMA T/UDOM/2019/08290 BSc – CNISE

KHADIJA LUSHINGE T/UDOM/2019/08241 BSc – CNISE

JAMAL MTAZA MIYONGA T/UDOM/2019/08254 BSc – CNISE

BAKARI MPAWASO T/UDOM/2019/12327 BSc – CNISE

JULIUS SORAELY T/UDOM/2019/08250 BSc – CNISE

SULEIMAN JUMA SULEIMAN T/UDOM/2019/08285 BSc – CNISE

DAVID ROBERT MGAYA T/UDOM/2019/08265 BSc - CNISE

1|Page
Question one
a) Examine the distinctions between Breadth-First Search (BFS) and Depth-First Search (DFS)
algorithms.
Answer:
BREADTH-FIRST SEARCH(BFS) is a simple strategy in which the root node is expanded
first, then all the successors of the root node are expanded next, then their successors, and so
on. In general, all the nodes are expanded at a given depth in the search tree before any nodes
at the next level are expanded.

While

DEPTH-FIRST SEARCH(DFS) always expands the deepest node in the current frontier of
the search tree. The search proceeds immediately to the deepest level of the search tree, where
the nodes have no successors. As those nodes are expanded, they are dropped from the
frontier, so then the search “backs up” to the next deepest node that still has unexplored
successors.
Example

2|Page
Breadth-First Search (BFS) and Depth-First Search (DFS) are two fundamental search
algorithms used in the field of Artificial Intelligence (AI) to traverse or explore graph-like
structures. While they both aim to search for a particular node or find a path in a graph, they
differ in their approach and the order in which they explore the nodes. Here are the main
distinctions between BFS and DFS:
Exploration Order:
- BFS explores the graph level by level, starting from the initial node. It visits all the
neighbors of a node before moving on to the next level.
- DFS explores the graph by going as deep as possible along each branch before
backtracking. It visits the first neighbor of a node and explores that branch completely
before moving on to the next neighbor.
Data Structure:
- BFS uses a queue data structure to keep track of the nodes to be explored. The nodes
are added to the end of the queue and removed from the front in a first-in-first-out (FIFO)
manner.
- DFS uses a stack (or recursion) to keep track of the nodes to be explored. The nodes are
added to the top of the stack and removed from the top in a last-in-first-out (LIFO)
manner.
Memory Requirements:
- BFS generally requires more memory compared to DFS because it needs to store all the
nodes at each level in the queue until they are explored. In worst-case scenarios, BFS
may require memory proportional to the size of the entire graph.
- DFS typically requires less memory compared to BFS because it only needs to store a
single path from the initial node to the current node being explored. However, in very
deep or infinite graphs, DFS can run into stack overflow errors due to excessive
recursion.
Completeness and Optimality:
- BFS is both complete and optimal for finding a solution if it exists. Since it explores all
the nodes at each level before moving to the next level, it is guaranteed to find the
shortest path to a solution.
- DFS is not complete in general because it can get stuck in infinite loops or fail to find a
solution if the search space is infinite. Additionally, DFS is not guaranteed to find the
shortest path because it explores a single branch as deeply as possible before
backtracking.
Time Complexity:
- The time complexity of BFS is generally higher than DFS. In the worst case, BFS has a
time complexity of O(V + E), where V is the number of vertices and E is the number of
edges in the graph.
- The time complexity of DFS is generally lower than BFS. In the worst case, DFS has a
time complexity of O(V + E), but it can often terminate earlier if it finds a solution.

3|Page
b) Consider a grid-based navigation problem where an agent needs to find the shortest path
from the initial state S to the goal state G. The grid consists of cells, and the agent can move
horizontally or vertically between adjacent cells. There are no obstacles in the grid, and the
agent can move in any direction.
i. Define the problem in terms of states, actions, initial state, goal state, and the path
cost function
ii. Explain the breadth-first search (BFS) algorithm for solving this problem.
iii. Describe the general steps of BFS and discuss its properties, including
completeness, optimality, time complexity, and space complexity
Answer
i. Problem Definition:
• States: The states in this problem can be represented by the coordinates (x, y) of the cells
in the grid.
• Actions: The agent can move horizontally or vertically between adjacent cells. Therefore,
the actions are {UP, DOWN, LEFT, RIGHT}.
• Initial State: The initial state (S) represents the starting position of the agent in the grid.
• Goal State: The goal state (G) represents the position the agent needs to reach.
• Path Cost Function: The path cost function assigns a cost of 1 to each step taken by the
agent

ii. BFS algorithm to solve the problem


The BFS algorithm explores the grid in a breadth-first manner, systematically searching
through all the cells level by level until it finds the goal state.
iii. General steps
1. Initialize an empty queue and a set of visited cells.
2. Enqueue the initial state (S) into the queue.
3. Mark the initial state as visited.
4. Repeat until the queue is empty:
- Dequeue the front node from the queue.
- If the dequeued node is the goal state (G), terminate the search and return the
path.
- Generate all possible valid actions from the current node.
- For each valid action, calculate the new state by applying the action to the current
node. - If the new state has not been visited:
o Mark the new state as visited.
o Enqueue the new state into the queue.

Completeness:

4|Page
BFS is complete for finding a solution if one exists. Since it explores all the cells level
by level, it is guaranteed to find a solution if there is a path from the initial state to the goal
state.
Optimality:
BFS guarantees to find the shortest path from the initial state to the goal state. Since it
explores the graph in a breadth-first manner, it will reach the goal state with the fewest
number of steps.
Time Complexity:
In the worst case, BFS explores all the cells in the grid. If the grid has dimensions M x
N, the time complexity of BFS is O(M*N) as it needs to visit each cell at least once.
Space Complexity:
The space complexity of BFS depends on the maximum number of nodes stored in the
queue. In the worst case, BFS needs to store all the nodes at each level until the goal state
is found. If the branching factor is B and the maximum depth of the search tree is D, the
space complexity of BFS is O(B^D) as it needs to store nodes at each level up to D in the
queue

c) Explain the Depth-first search (DFS) algorithm for solving this problem. Describe the
general steps of DFS and discuss its properties, including completeness, optimality, time
complexity, and space complexity
Depth-First Search (DFS) is an algorithm that explores the grid in a depth-first manner, going
as deep as possible along each branch before backtracking. Here are the general steps of the
DFS algorithm for solving the grid-based navigation problem:
1. Initialize a stack and a set of visited cells.
2. Push the initial state (S) onto the stack.
3. Mark the initial state as visited.
4. Repeat until the stack is empty:
• Pop the top node from the stack.
• If the popped node is the goal state (G), terminate the search and return the path.
• Generate all possible valid actions from the current node.
• For each valid action, calculate the new state by applying the action to the current
node.
• If the new state has not been visited:
• Mark the new state as visited.
• Push the new state onto the stack.
Properties of DFS:
Completeness: DFS is not complete in general because it can get stuck in infinite loops or fail to
find a solution if the search space is infinite. If there is a solution, DFS will find it, but it may not
explore the entire search space.

5|Page
Optimality: DFS does not guarantee finding the shortest path. It explores a single branch as deeply
as possible before backtracking, so it may find a suboptimal path before finding the shortest path.
Time Complexity: The time complexity of DFS depends on the branching factor, the maximum
depth of the search tree, and the number of solutions. In the worst case, if the graph has branching
factor B and the maximum depth is D, the time complexity of DFS is O(B^D). However, if there
are multiple solutions, DFS may terminate earlier.
Space Complexity: The space complexity of DFS is determined by the maximum depth of the
search tree. In the worst case, if the maximum depth is D, the space complexity of DFS is O(D)
due to the recursion stack or the stack used to keep track of nodes. However, if the graph has a
large branching factor or the depth is not limited, DFS can encounter stack overflow errors.

6|Page
Question two
a) Markov Decision Processes (MDPs) are mathematical models used to formalize the
framework of reinforcement learning. MDPs provide a way to represent decision-making
problems in environments that exhibit Markovian properties. They consist of a set of states,
actions, transition probabilities, and rewards. Summarize the key components and concepts
related to MDPs:
Answer
Markov Decision Processes (MDPs) are mathematical models that are widely used in the field
of reinforcement learning to formalize decision-making problems in environments with
Markovian properties. The key components and concepts related to MDPs include:
• States (S): MDPs consist of a set of states that represent different configurations or
situations in the environment. The agent's actions and the environment's dynamics depend
on the current state.
• Actions (A): Actions are the choices or decisions that the agent can take in each state. The
set of available actions may vary depending on the state. The agent's goal is to learn a
policy that determines the best action to take in each state.
• Transition Probabilities (T): Transition probabilities represent the dynamics of the
environment. They define the probability distribution over next states given the current
state and action. In other words, they specify the likelihood of transitioning from one state
to another when a particular action is taken.
• Rewards (R): Rewards are numerical values that provide feedback to the agent based on
its actions and the resulting state transitions. Rewards can be immediate or delayed, and
they are used to shape the agent's behavior. The objective of the agent is often to maximize
the cumulative sum of rewards over time.
• Policy (π): A policy is a mapping from states to actions, which guides the agent's
decisionmaking. It defines the agent's behavior and determines the action to take in each
state. The policy can be deterministic (mapping each state to a single action) or stochastic
(mapping each state to a probability distribution over actions).

7|Page
• Value Function: The value function in an MDP is used to evaluate the desirability or
quality of being in a particular state or taking a specific action. There are two types of value
functions: the state-value function (V(s)) estimates the expected return starting from a
given state under a specific policy, and the action-value function (Q(s, a)) estimates the
expected return starting from a given state, taking a specific action, and following a specific
policy.
• Bellman Equations: The Bellman equations are mathematical equations that describe the
relationships between value functions in MDPs. They provide a way to recursively
decompose the value functions based on the expected future rewards and state transitions.
• Optimal Policy: An optimal policy is the policy that maximizes the expected cumulative
rewards over time. It specifies the best action to take in each state. Finding the optimal
policy is one of the main goals in reinforcement learning, and various algorithms, such as
value iteration and policy iteration, are used to solve for it.
• Exploration and Exploitation: In the context of MDPs, exploration refers to the agent's
ability to explore different actions and states to gather information about the environment.
Exploitation refers to the agent's ability to utilize the knowledge it has gained to maximize
its rewards. Balancing exploration and exploitation is crucial for effective learning in
MDPs.
MDPs provide a framework for modeling decision-making problems in uncertain environments
and serve as the foundation for many reinforcements learning algorithms, allowing agents to
learn optimal policies through interaction with the environment.

b) What is the main difference between model-based and model-free reinforcement learning?
Model-Based Reinforcement Learning:
• In model-based reinforcement learning, the agent has access to a model of the environment.
The model provides information about the transition probabilities and rewards associated
with each state-action pair.
• The agent utilizes the model to simulate or predict the outcomes of different actions in
different states. It uses this information to plan and make decisions about which actions to
take.
• Model-based methods typically involve two steps: learning the model from data gathered
through interaction with the environment and then using the learned model to compute
optimal policies or value functions.
• These methods can efficiently plan and reason about long-term consequences, and they
can make predictions about unobserved states.
• However, model-based approaches may suffer from inaccuracies if the learned model does
not fully capture the complexities of the real environment.

8|Page
Model-Free Reinforcement Learning:
• In model-free reinforcement learning, the agent does not have access to an explicit model
of the environment. It learns directly from interactions with the environment, without
explicitly estimating the transition probabilities and rewards.
• The agent explores the environment and learns from the observed state-action-reward
sequences. It focuses on learning a policy or value function without explicitly reasoning
about the underlying dynamics.
• Model-free methods directly estimate the value function or policy through various
algorithms such as Q-learning, SARSA, or policy gradients.
• These methods are simpler and more straightforward to implement, as they do not require
learning and utilizing a model of the environment.
• However, model-free approaches may require a large number of interactions with the
environment to learn an optimal policy accurately, especially in complex domains.

9|Page
c) Can a reinforcement learning algorithm be both model-based and model-free? Explain why
or why not
No, a reinforcement learning algorithm cannot be both model-based and model-free
simultaneously. The terms "model-based" and "model-free" represent distinct approaches to
reinforcement learning that differ in their fundamental principles and methodologies.

Model-based reinforcement learning algorithms utilize an explicit model of the environment,


which includes information about the transition probabilities and rewards associated with
each state-action pair. These algorithms learn the model from data and use it to simulate or
predict the outcomes of different actions in different states. They then utilize the learned
model to plan and make decisions about which actions to take. The use of the model is a core
characteristic of model-based algorithms.

On the other hand, model-free reinforcement learning algorithms do not rely on an explicit
model of the environment. Instead, they directly learn from interactions with the
environment, without explicitly estimating the transition probabilities and rewards. Model-
free methods focus on learning a policy or value function directly from observed state-
action-reward sequences without explicitly reasoning about the underlying dynamics.

These two approaches are fundamentally different and involve distinct learning processes
and decision-making mechanisms. Model-based algorithms leverage the model to plan and
make decisions based on explicit predictions, while model-free algorithms learn directly
from experiences without explicitly representing the environment's dynamics.

It's worth noting that some algorithms may incorporate elements of both model-based and
model free approaches, but they still fall into one category or the other based on their core
principles. For example, some algorithms may combine a learned model with a model-free
learning method to improve efficiency or incorporate prior knowledge. However, they would
still be considered either model-based or model-free based on their primary approach.

10 | P a g e
d) Consider a in question 1 grid-world, assume it is a 4x4 grid. The agent starts at the top-left
corner and can take actions to move up, down, left, or right in each time step. The agent
receives a reward of +1 for reaching the bottom-right corner and a reward of -1 for reaching
the bottom-left corner. All other transitions receive a reward of 0. The discount factor γ is
set to 0.9. The agent follows an epsilon-greedy policy with ε = 0.1.
i. In your own words, define the terms "agent," "environment," and "reward" in
the context of reinforcement learning.
ii. Explain the concept of the discount factor (γ) and its significance in
reinforcement learning.
iii. Provide a brief overview of the epsilon-greedy policy and its purpose in
reinforcement learning.
iv. Suppose the agent follows a Q-learning algorithm to learn an optimal policy in
this grid-world environment. Outline the steps of the Q-learning algorithm and
explain how the Q-values are updated.
Answer
i. Agent, environment and reward

Agent
This refers to the entity in Artificial intelligence which receive percept in the
environment and perform action that produce the best outcome.
Environment
This refers to the external system which Agent interact with
Reward
This is the numerical value which agent receive after performing certain action in
the environment

ii. Discount factor (γ)


In reinforcement learning represents the importance or weight given to future
rewards compared to immediate rewards. It determines how much the agent values
immediate rewards versus long-term rewards. The discount factor is a value
between 0 and 1, with higher values indicating that the agent assigns more
significance to future rewards.

The significance of the discount factor lies in its impact on the agent's decision-
making process. When γ is close to 0, the agent becomes myopic and prioritizes
immediate rewards, focusing on short-term gains. Conversely, when γ is close to 1,
the agent takes into account the cumulative long-term rewards and exhibits more
far-sighted behavior, prioritizing long-term gains. By adjusting the discount factor,
the agent can balance the trade-off between immediate rewards and future rewards,

11 | P a g e
influencing its ability to make optimal decisions in environments with delayed
consequences and uncertain outcomes.
iii. Epsilon-greedy policy and its purpose in reinforcement learning
The epsilon-greedy policy is a commonly used exploration-exploitation strategy in
reinforcement learning. It determines the agent's behavior in selecting actions based
on a balance between exploration (trying out different actions) and exploitation
(choosing the currently believed best action).
The epsilon-greedy policy works as follows:
• With a probability of ε (epsilon), the agent explores and selects a random
action, regardless of its estimated value.
• With a probability of 1-ε, the agent exploits its current knowledge and
selects the action with the highest estimated value (greedy action).

The purpose of the epsilon-greedy policy is to encourage exploration in the early


stages of learning, allowing the agent to discover potentially better actions. As
learning progresses and the agent's knowledge improves, exploitation becomes
more dominant, and the agent focuses on selecting the actions it believes to be the
best based on its current estimates.

iv. The Q-learning algorithm and how the Q-values are updated
The Q-learning algorithm iteratively updates the Q-values based on the observed
rewards and updates them towards the maximum expected future rewards. Over
time, the Q-values converge to the optimal action-value function, enabling the agent
to learn an optimal policy that maximizes its cumulative rewards
Steps
1. Initialize the Q-values for all state-action pairs arbitrarily or to some initial
values.
2. Repeat the following steps until convergence or a predefined number of
iterations:
o Select an action to take in the current state based on the epsilon-
greedy policy.
o Execute the selected action and observe the resulting next state and
the corresponding reward.
o Update the Q-value for the current state-action pair using the Q-
learning update rule: Q(s, a) = Q(s, a) + α * (r + γ * max(Q(s', a')) -
Q(s, a))
• where:
o Q(s, a) is the Q-value of the current state-action pair.
o α (alpha) is the learning rate, determining the weight given to the
new information compared to the existing estimate.

12 | P a g e
o r is the observed reward after taking the action in the current state.
o γ (gamma) is the discount factor, as described earlier.
o max(Q(s', a')) represents the maximum Q-value among all possible
actions in the next state. - Update the current state to the next state.
3. Repeat the above steps for multiple episodes or until convergence.

13 | P a g e
Question three
a) Summarize the concept of alpha-beta pruning in adversarial search?

Alpha-beta pruning is a technique used in adversarial search algorithms, such as minimax, to


improve the efficiency of searching through the game tree. The main idea behind alpha-beta
pruning is to avoid evaluating unnecessary branches of the tree by using two values: alpha and
beta.

In adversarial search, the goal is to find the best move for a player in a game against an
opponent. The game tree represents all possible moves and their resulting states. Minimax is a
common algorithm used to explore this tree and determine the optimal move.

During the minimax search, the algorithm explores the tree recursively, alternating between
maximizing the player’s score and minimizing the opponent’s score. However, alpha-beta
pruning helps to reduce the number of unnecessary evaluations by keeping track of two values:
alpha and beta.

Alpha represents the best score found so far for the maximizing player, while beta represents the
best score found so far for the minimizing player. Initially, alpha is set to negative infinity, and
beta is set to positive infinity.

As the algorithm traverses the tree, it updates the alpha and beta values based on the scores
encountered. When a node’s score exceeds the current beta value, it means that the opponent
would never choose that path, so there’s no need to explore it further. Similarly, if a node’s score
is lower than the current alpha value, the maximizing player would never choose that path.

By pruning these branches, the algorithm avoids evaluating unnecessary subtrees and
significantly reduces the search space. This leads to a substantial improvement in the efficiency
of the adversarial search algorithm, allowing it to search deeper into the game tree and make
more informed decisions.

Overall, alpha-beta pruning is a powerful technique in adversarial search that optimizes the
exploration of the game tree by eliminating irrelevant branches, ultimately leading to more
efficient decision-making

14 | P a g e
b.) The process of traversing the game tree using the Minimax algorithm involves recursively
evaluating the nodes in the tree to determine the optimal move for a player. Summarize the steps
involved

The following Here are the steps involved in traversing the game tree using the Minimax
algorithm to determine the optimal move for a player:

1.) Start at the root node of the game tree, representing the current state of the game.

2.) If the current node is a terminal node (i.e., the game is over), evaluate the node’s utility
value. The utility value represents the desirability of the outcome for the player (e.g., a
win may have a higher utility value than a loss).

3.) If the current node is a maximizing player’s node, initialize the best value to negative
infinity. Iterate through each child node, which represents the possible moves for the
maximizing player.

4.) Recursively call the Minimax algorithm on each child node. This step alternates between
maximizing and minimizing player nodes.

5.) Update the best value to the maximum of the current best value and the utility value
obtained from the child node.

6.) If the current node is a minimizing player’s node, initialize the best value to positive
infinity. Iterate through each child node, representing the possible moves for the
minimizing player.

7.) Recursively call the Minimax algorithm on each child node. Again, this step alternates
between maximizing and minimizing player nodes.

8.) Update the best value to the minimum of the current best value and the utility value
obtained from the child node.

9.) After evaluating all child nodes, return the best value found.

10.) As the recursion unwinds back to the root node, the best value at each level
represents the optimal move for the corresponding player.

11.) Finally, the algorithm returns the optimal move or the best move found at the root
level.

15 | P a g e
By following these steps, the Minimax algorithm explores the game tree by recursively
evaluating nodes and propagating the best move choices up to the root node, ultimately
determining the optimal move for the player

c.) In a two-player, zero-sum game, the Minimax algorithm determines the best move by
considering the actions that maximize the advantage for one player while minimizing the
advantage for the other player. Explain how the algorithm work

The Minimax algorithm is used to determine the best move for a player in a two-player, zero-
sum game, where the outcome is either a win or a loss, and the sum of the players’ utility values
is always zero. The algorithm works by considering the actions that maximize the advantage for
one player while minimizing the advantage for the other player.

Here's how the Minimax algorithm works


1.) The algorithm starts at the root node of the game tree, representing the current state of the
game.

2.) If the current node is a terminal node (i.e., the game is over), the algorithm evaluates the
utility value of the node. The utility value represents the desirability of the outcome for
the player. For example, a win may have a higher utility value than a loss.

3.) If the current node is a maximizing player’s node, the algorithm initializes the best value
to negative infinity. It then evaluates each child node, which represents the possible
moves for the maximizing player.

4.) The algorithm recursively calls itself on each child node, treating them as the current
node, but now as a minimizing player’s node.

5.) The algorithm updates the best value to the maximum of the current best value and the
utility value obtained from the child node.

6.) If the current node is a minimizing player’s node, the algorithm initializes the best value
to positive infinity. It evaluates each child node, representing the possible moves for the
minimizing player.

7.) The algorithm recursively calls itself on each child node, treating them as the current
node, but now as a maximizing player’s node.

16 | P a g e
8.) The algorithm updates the best value to the minimum of the current best value and the
utility value obtained from the child node.

9.) After evaluating all child nodes, the algorithm returns the best value found.

10.) As the recursion unwinds back to the root node, the algorithm propagates the best
move choices up to the root node.

11.) Finally, at the root level, the algorithm returns the move associated with the best
value, which represents the optimal move for the player.

By considering all possible moves and evaluating their utility values, the Minimax algorithm
systematically searches through the game tree to determine the best move. It assumes that both
players will make optimal moves, and thus, it selects the move that maximizes the advantage for
the player while minimizing the advantage for the opponent

17 | P a g e
Question four
a) Broadly what is the concept of entropy in decision tree algorithms?
In the context of decision tree algorithms in artificial intelligence, entropy is a measure of
impurity or disorder within a set of data. It is used as a criterion for splitting the data at
each node of the decision tree.
Entropy is calculated based on the distribution of class labels in a given set of data. In a
binary classification problem, where there are two possible outcomes (e.g., positive and
negative), entropy is calculated using the proportion of positive and negative examples in
the data set.

The formula for entropy is:


Entropy = -p(positive) * log2(p(positive)) - p(negative) * log2(p(negative))

where p(positive) and p(negative) represent the proportions of positive and negative
examples in the data set, respectively.
The goal of a decision tree algorithm is to minimize the entropy or maximize the
information gain at each split. Information gain measures the reduction in entropy achieved
by splitting the data based on a particular attribute or feature. It quantifies how much
information is gained by partitioning the data according to that attribute.

The decision tree algorithm iteratively selects the attribute that maximizes the information
gain and splits the data based on that attribute. This process is repeated recursively until a
stopping criterion is met, such as reaching a maximum tree depth or a minimum number
of samples required to create a leaf node.

By using entropy and information gain, decision tree algorithms aim to create a tree
structure that effectively classifies the data by minimizing impurity and maximizing the
separation of different classes at each node. The resulting decision tree can be used for
prediction and classification tasks based on the learned patterns in the training data.

b) Broadly how does a decision tree algorithm work?


A decision tree algorithm is a supervised learning method that constructs a tree-like model
to make decisions or predictions based on input data. It partitions the data based on a set
of rules learned from the training examples.
Here's a general overview of how a decision tree algorithm works:
1. Data Preparation: The algorithm begins with a set of labeled training examples, where
each example consists of a set of input features and a corresponding target or class label.
2. Attribute Selection: The algorithm selects the best attribute or feature from the available
features to split the data at the root of the tree. The selection criterion is typically based on
measures such as entropy or Gini impurity, aiming to maximize information gain or purity.

18 | P a g e
3. Splitting: The data is partitioned into subsets based on the selected attribute. Each subset
corresponds to a branch or child node of the tree.
4. Recursive Splitting: The algorithm repeats the splitting process independently for each
child node. It selects the best attribute to split the data at each node based on the remaining
features and the impurity measure.
5. Stopping Criteria: The splitting process continues recursively until a stopping criterion
is met. Common stopping criteria include reaching a maximum tree depth, a minimum
number of training examples in a node, or when no further improvement in purity or
information gain is achievable. 6. Leaf Node Assignment: Once the splitting process stops,
the algorithm assigns a class label to each leaf node. This is typically determined by
majority voting or by assigning the most frequent class label of the training examples in
the corresponding node
7. Pruning (optional): In some cases, the decision tree may grow excessively complex and
overfit the training data. Pruning techniques can be applied to reduce the complexity and
improve generalization by removing or merging certain nodes and branches.
8. Prediction: The constructed decision tree can be used to make predictions or decisions
on unseen or test data. For a given input, the data traverses the tree from the root to a leaf
node based on the attribute conditions, and the class label assigned to that leaf node is
returned as the prediction.
The decision tree algorithm is intuitive and interpretable, as the resulting tree structure can
be visualized and understood easily. It is widely used in various applications such as
classification, regression, and feature selection, due to its simplicity and ability to handle
both numerical and categorical data

c) How the calculations of information gain is performed using both the Gini index and
entropy in decision tree algorithms?

Information gain is a measure used in decision tree algorithms to determine the best
attribute
for splitting the data. It quantifies the amount of information gained by partitioning the data
based on a particular attribute. Both Gini index and entropy are commonly used to calculate
information gain. Here's how the calculations are performed using these measures:
1. Gini Index:
The Gini index measures the impurity or disorder of a set of data. It ranges from 0 to 1,
where 0 represents a pure set (all examples belong to the same class), and 1 represents
maximum impurity (an equal distribution of examples across all classes).
To calculate the Gini index for a given attribute:
a. Calculate the Gini index for each possible value of the attribute by considering the
proportion of examples in each class.
- For a binary classification problem, where there are two classes (positive and negative),

19 | P a g e
the Gini index for a value v is calculated as:
Gini(v) = 1 - (p(positive|v)^2) - (p(negative|v)^2)
where p(positive|v) represents the proportion of positive examples given value v, and
p(negative|v) represents the proportion of negative examples given value v.

b. Calculate the weighted average Gini index (Gini index after the split) by considering the
proportions of each value in the attribute.
- For an attribute A with n possible values, the weighted Gini index (Gini(A)) after the
split is calculated as:
Gini(A) = (n1/N) * Gini(v1) + (n2/N) * Gini(v2) + ... + (nk/N) * Gini(vk)
where n1, n2, ..., nk represent the number of examples for each value v1, v2, ..., vk, and
N is the total number of examples.
c. Calculate the information gain by subtracting the weighted Gini index (after the split)
from the Gini index (before the split):
Information Gain = Gini(before split) - Gini(A)

20 | P a g e
2. Entropy:
Entropy is another measure of impurity or disorder in a set of data. It ranges from 0 to
log2(N), where 0 represents a pure set, and log2(N) represents maximum impurity (equal
distribution of examples across all classes).
To calculate the entropy for a given attribute:

a. Calculate the entropy for each possible value of the attribute by considering the
proportion of examples in each class.
- For a binary classification problem, the entropy for a value v is calculated as:

Entropy(v) = -p(positive|v) * log2(p(positive|v)) - p(negative|v) * log2(p(negative|v))

where p(positive|v) represents the proportion of positive examples given value v, and
p(negative|v) represents the proportion of negative examples given value v.

b. Calculate the weighted average entropy (entropy after the split) by considering the
proportions of each value in the attribute.
- For an attribute A with n possible values, the weighted entropy (Entropy(A)) after the
split is calculated as:
Entropy(A) = (n1/N) * Entropy(v1) + (n2/N) * Entropy(v2) + ... + (nk/N) * Entropy(vk)
where n1, n2, ..., nk represent the number of examples for each value v1, v2, ..., vk, and
N is the total number of examples.

c. Calculate the information gain by subtracting the weighted entropy (after the split) from
the entropy (before the split):
Information Gain = Entropy(before split) - Entropy(A)

In both cases, the attribute with the highest information gain is selected as the best attribute
for splitting the data at a particular node in the decision tree algorithm. This process is
repeated recursively to construct the decision tree.

21 | P a g e
d) Let's consider a scenario where we have a dataset of animals classified as either "mammals"
or "reptiles." We want to determine the best attribute to split the data based on the concept
of entropy and calculate the information gain. Determine which attribute (Habitat or Body
Temperature) is more informative in splitting the data based on the entropy measure. You
may use either entropy or gini index, choose only one metric and show your work clearly
Table 4- 1 Animals prediction

To determine which attribute is more informative, we need to calculate the entropy and
information
gain for each attribute (Habitat and Body Temperature).
1. Entropy calculation for the target variable (Class):
• Number of mammals: 2 (Lion, Dolphin)
• Number of reptiles: 3 (Eagle, Snake, Crocodile)
• Total examples: 5
• Proportion of mammals: 2/5
• Proportion of reptiles: 3/5
Entropy(Class) = - (2/5) * log2(2/5) - (3/5) * log2(3/5) ≈ 0.971
2. Entropy calculation after splitting based on the attribute "Habitat":
• For the value "Land":
• Number of mammals: 1 (Lion)
• Number of reptiles: 1 (Snake)
• Total examples: 2
• Proportion of mammals: 1/2
• Proportion of reptiles: 1/2
Entropy(Habitat=Land) = - (1/2) * log2(1/2) - (1/2) * log2(1/2) = 1
• For the value "Air":
• Number of mammals: 0
• Number of reptiles: 1 (Eagle)
• Total examples: 1
• Proportion of mammals: 0
• Proportion of reptiles: 1

22 | P a g e
Entropy(Habitat=Air) = 0
• For the value "Water":
• Number of mammals: 1 (Dolphin)
• Number of reptiles: 1 (Crocodile)
• Total examples: 2
• Proportion of mammals: 1/2
• Proportion of reptiles: 1/2
Entropy(Habitat=Water) = - (1/2) * log2(1/2) - (1/2) * log2(1/2) = 1
Weighted average entropy (after split) = (2/5) * Entropy(Habitat=Land) + (1/5) *
Entropy(Habitat=Air) + (2/5) * Entropy(Habitat=Water) = (2/5) * 1 + (1/5) * 0 + (2/5) * 1
= 0.8
Information Gain(Habitat) = Entropy(Class) - Weighted average entropy (after split) =
0.971 - 0.8 =0.171
3. Entropy calculation after splitting based on the attribute "Body Temperature":
• For the value "Warm":
• Number of mammals: 2 (Lion, Dolphin)
• Number of reptiles: 1 (Eagle)
• Total examples: 3
• Proportion of mammals: 2/3
• Proportion of reptiles: 1/3
Entropy(Body Temperature=Warm) = - (2/3) * log2(2/3) - (1/3) * log2(1/3) ≈ 0.918
• For the value "Cold":
• Number of mammals: 0
• Number of reptiles: 2 (Snake, Crocodile)
• Total examples: 2
• Proportion of mammals: 0
• Proportion of reptiles: 1
Entropy(Body Temperature=Cold) = 0
Weighted average entropy (after split) = (3/5) * Entropy(Body Temperature=Warm) +
(2/5) *
Entropy(Body Temperature=Cold) = (3/5) * 0.918 + (2/5) * 0 ≈ 0.551
Information Gain(Body Temperature) = Entropy(Class) - Weighted average entropy (after
split) = 0.971 - 0.551 = 0.42

Based on the entropy measure, the attribute "Body Temperature" has a higher information
gain (0.42) compared to the attribute "Habitat" (0.171). Therefore, "Body Temperature" is
considered more informative in splitting the data.

23 | P a g e
e) Given the dataset (see Table 4-2) of animals classified as either "mammals" or "reptiles"
based on their habitat and body temperature, how can we use the Naive Bayes classifier to
predict whether a new animal, such as a cow, belongs to the "mammal" or "reptile" class
based on its habitat and body temperature? Assume binary variables (1 if the attribute is
present, 0 if not present) and calculate the posterior probabilities for both classes.
Table4- 2 Animals prediction

To use the Naive Bayes classifier to predict the class (mammal or reptile) of a new animal,
such as a cow, based on its habitat and body temperature, we need to calculate the
posterior probabilities for both classes.
We have binary variables for Habitat (1 if present, 0 if not present) and Body Temperature
(1 if warm-blooded, 0 if cold-blooded).
To calculate the posterior probabilities for each class, we can use Bayes' theorem:
P(Class|Habitat, Body Temperature) = (P(Habitat, Body Temperature|Class) * P(Class)) /
P(Habitat, Body Temperature)
Let's calculate the posterior probabilities for both classes (mammal and reptile) based on
the given dataset:
1. Calculate prior probabilities:
- P(Mammal) = Number of mammal examples / Total examples
=2/5
= 0.4
- P(Reptile) = Number of reptile examples / Total examples
=3/5
= 0.6
2. Calculate likelihoods:
- P(Habitat=1|Class=Mammal) = Number of mammal examples with Habitat=1 / Number
of mammal examples
=1/2
= 0.5
- P(Habitat=1|Class=Reptile) = Number of reptile examples with Habitat=1 / Number of
reptile examples
=2/3
≈ 0.667

24 | P a g e
- P(Body Temperature=1|Class=Mammal) = Number of mammal examples with Body
Temperature=1 / Number of mammal examples
=2/2
=1
- P(Body Temperature=1|Class=Reptile) = Number of reptile examples with Body
Temperature=1 / Number of reptile examples
=1/3
≈ 0.333
3. Calculate the evidence or marginal likelihood (normalizing factor):
- P(Habitat, Body Temperature) = ΣP(Habitat, Body Temperature|Class) * P(Class)
= P(Habitat=1, Body Temperature=1|Class=Mammal) * P(Mammal)
+ P(Habitat=1, Body Temperature=1|Class=Reptile) * P(Reptile)
= 0.5 * 0.4 + 0.667 * 0.6
≈ 0.633

4. Calculate the posterior probabilities for each class using Bayes' theorem:
- P(Class=Mammal|Habitat, Body Temperature) = (P(Habitat=1, Body Temperature=1|
Class=Mammal) * P(Mammal)) / P(Habitat, Body Temperature)
= (0.5 * 0.4) / 0.633
≈ 0.397
- P(Class=Reptile|Habitat, Body Temperature) = (P(Habitat=1, Body Temperature=1|
Class=Reptile) * P(Reptile)) / P(Habitat, Body Temperature)
= (0.667 * 0.6) / 0.633
≈ 0.631
Based on the calculations, the posterior probability of the class "Mammal" for a new animal
(such as a cow) with unknown habitat and body temperature is approximately 0.397, and
the posterior probability of the class "Reptile" is approximately 0.631

25 | P a g e

You might also like