ML Assignment-01
ML Assignment-01
PART –A
ANS: The Random Forest algorithm is an ensemble learning method primarily used for classification
and regression tasks. It operates by constructing multiple decision trees during training and outputs the
mode of the classes (classification) or the mean prediction (regression) of the individual trees.
ANS: Logistic Regression is a statistical method used for binary classification problems, where
the goal is to predict one of two possible outcomes (e.g., yes/no, true/false, 0/1). It estimates the
probability that a given input belongs to a particular class based on a set of features
ANS: A Support Vector Machine (SVM) is a supervised learning algorithm used for classification and
regression tasks. It is particularly effective for binary classification problems. The key idea behind SVM is
to find the optimal boundary, known as a hyperplane, that best separates the data into different classes
PART-B
ANS: A Decision Tree is a supervised learning algorithm used for both classification and
regression tasks. It splits the dataset into subsets based on the most significant feature to predict
the target variable, using a tree-like structure.
Key Concepts:
1. Root Node: Represents the entire dataset and splits into subsets.
2. Decision Nodes: Intermediate nodes where the data gets further split.
3. Leaf Nodes: The final nodes representing the output (class label in classification or value
in regression).
4. Splitting Criteria: Measures like Gini Index, Information Gain, or Variance Reduction
(for regression) are used to split the data at each node.
Example: Classification using a Decision Tree
Let's use a simple dataset to predict whether a person will buy a product based on age and
income:
1. Choose the Best Splitting Attribute: To determine the best split, we use Information
Gain or the Gini Index. Let's assume we use Gini Index here.
2. Calculate Gini for Each Split: Let's consider splitting the data based on Age and
Income. We start with Age and find the best threshold for splitting.
o Age ≤ 35:
Group 1 (Age ≤ 35): {Person 1, Person 2, Person 3}
Group 2 (Age > 35): {Person 4, Person 5, Person 6, Person 7}
For Group 1:
2 people don’t buy (No), 1 person buys (Yes).
Gini Index for Group 1:
For Group 2:
1 person doesn’t buy (No), 3 people buy (Yes).
Gini Index for Group 2:
o Similarly, we calculate for other possible splits (like Income) and find the split
that minimizes the Gini Index.
3. Split the Data: Assume that splitting by Age ≤ 35 gives the best Gini index. We now
create two branches:
o For Age ≤ 35, we check further conditions like Income.
o For Age > 35, we check the next best feature.
4. Repeat Until Stopping Criteria: The process repeats for each subset until either:
o All data points in a node belong to a single class.
o The maximum depth of the tree is reached.
o There are no further significant splits.
Advantages:
ANS: Linear Regression is a supervised learning algorithm used for predicting a continuous
target variable based on one or more independent (input) variables. It assumes a linear
relationship between the input variables (features) and the output variable (target). The goal is to
find the line (in the case of one feature) or the hyperplane (in the case of multiple features) that
best fits the data.
Key Concepts: