Machine Learning
Machine Learning
1. Decision Tree: In a Decision tree, there are two nodes, which are
the Decision Node and Leaf Node. Decision nodes are used to make
any decision and have multiple branches, whereas Leaf nodes are the
output of those decisions and do not contain any further branches .
Decision trees are nothing but a giant structure of the nested if-else statement.
it is easy to understand and cost is low because of the logarithmic time. But
this algorithm has the overfitting issues (this will be solve by the random
forest).
I. Entropy
It is the measurement of the disorder.
The mathematical formula for entropy is:
c
E ( S )=∑ −p i log 2 pi Where pi is the proportion of the data points belonging to the
i=1
Overfitting:
this can solve by the hyper parameter (max_depth) this the
depth of the tree.
There are different hyper parameter (min sample split )
Random Forest(very imp)
It is the best and it will work on any machine learning project and this will use
for the regression and classification task.
Random forest means (Bagging group of tress) if we use the decision tree in
the bagging technique then it is called the random forest.
1. Bootstrap Sampling:
o Create multiple subsets of the training data by sampling with
replacement.
o Each subset has the same size as the original training set but may
contain duplicate samples and exclude others.
3. Aggregate Predictions:
o For classification: Use majority voting (the most common class
among models).
o For regression: Use the average of predictions from all models.
1. Reduces Variance:
o By averaging predictions, bagging stabilizes the results and
reduces the likelihood of overfitting.
2. Improves Accuracy:
o Even if individual models are weak learners, their combined
output can be much stronger.
3. Handles Overfitting:
o Particularly effective with high-variance models like decision trees.
Let suppose we have the large data and in the random forest we take the base
models as a decision tree, make the n models for to train the dataset for this
we distribute the data but don’t give all the data to the one model, so we
sampling the data on the basis of the (rows, column, and combination).
So, after the training process we have the n decision tree model which were
train on our dataset. When we predict the data we take all the prediction
output from all the train models.
1. Bootstrap Sampling:
o Create multiple subsets of the dataset by sampling with
replacement.
o Each subset is called a bootstrap sample and is used to train one
decision tree.
o Some data points will be repeated in a subset, while others may
be left out.
3. Aggregate Predictions:
o For classification: Each tree votes for a class, and the majority vote
becomes the final prediction.
o For regression: The predictions are averaged to get the final
output.
In machine learning programs we want the low bias and low variance but we
can’t achieve it because our models will take tradeoff between both of us, but
Random Forest can give this type of results because Decision tree is low bias
high variance type of model so random forest can change the high variance to
low,
Advantages:
1. Reduces Overfitting:
o Individual decision trees may overfit the data, but combining
many trees reduces this risk.
2. Handles High-Dimensional Data:
o Random Forest works well even with many features and datasets
with high dimensionality.
3. Robust to Noise:
o Since it averages predictions, outliers and noisy data have less
impact.
4. Feature Importance:
o Random Forest provides a measure of feature importance, which
can help you identify the most influential features in your dataset.