The document provides an overview of the Random Forest algorithm, a popular supervised machine learning technique used for classification and regression tasks. It explains the concepts of ensemble learning, bagging, and boosting, highlighting how Random Forest combines multiple decision trees to improve predictive accuracy while preventing overfitting. Additionally, it discusses the advantages, disadvantages, and applications of Random Forest in various sectors such as banking, medicine, land use, and marketing.
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0 ratings0% found this document useful (0 votes)
12 views18 pages
Lecture-12 Machine Learning With Python
The document provides an overview of the Random Forest algorithm, a popular supervised machine learning technique used for classification and regression tasks. It explains the concepts of ensemble learning, bagging, and boosting, highlighting how Random Forest combines multiple decision trees to improve predictive accuracy while preventing overfitting. Additionally, it discusses the advantages, disadvantages, and applications of Random Forest in various sectors such as banking, medicine, land use, and marketing.
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18
Lecture-12
Machine Learning with
Python Random Forest Algorithm ❖Random Forest is one of the most popular and commonly used algorithms by Data Scientists. Random forest is a Supervised Machine Learning Algorithm that is used widely in Classification and Regression problems. ❖It is based on the concept of ensemble learning, ❖Ensemble simply means combining multiple models. Thus a collection of models is used to make predictions rather than an individual model. Bagging ❖Bagging, also known as Bootstrap Aggregation, is the ensemble technique used by random forest. ❖Bagging chooses a random sample/random subset from the entire data set. Hence each model is generated from the samples (Bootstrap Samples) provided by the Original Data with replacement known as row sampling. This step of row sampling with replacement is called bootstrap. ❖Now each model is trained independently, which generates results. The final output is based on majority voting after combining the results of all models. This step which involves combining all the results and generating output based on majority voting, is known as aggregation. Bagging Boosting ❖Boosting is one of the techniques that use the concept of ensemble learning. A boosting algorithm combines multiple simple models (also known as weak learners or base estimators) to generate the final output. It is done by building a model by using weak models in series. ❖There are several boosting algorithms; AdaBoost was the first really successful boosting algorithm that was developed for the purpose of binary classification. AdaBoost is an abbreviation for Adaptive Boosting and is a prevalent boosting technique that combines multiple “weak classifiers” into a single “strong classifier.” ADA-Boost Random forest Algorithm ❖ "Random Forest is a classifier that contains a number of decision trees on various subsets of the given dataset and takes the average to improve the predictive accuracy of that dataset." ❖Instead of relying on one decision tree, the random forest takes the prediction from each tree and based on the majority votes of predictions, and it predicts the final output. ❖The greater number of trees in the forest leads to higher accuracy and prevents the problem of overfitting. ❖Random forest works on the Bagging principle Assumptions for Random Forest Since the random forest combines multiple trees to predict the class of the dataset, it is possible that some decision trees may predict the correct output, while others may not. But together, all the trees predict the correct output. Therefore, below are two assumptions for a better Random forest classifier: ❖There should be some actual values in the feature variable of the dataset so that the classifier can predict accurate results rather than a guessed result. ❖The predictions from each tree must have very low correlations. Steps Involved in Random Forest Algorithm Random Forest works in two-phase first is to create the random forest by combining N decision tree, and second is to make predictions for each tree created in the first phase. The Working process can be explained in the below steps and diagram: Step-1: Select random K data points from the training set. Step-2: Build the decision trees associated with the selected data points (Subsets) Step-3: Choose the number N for decision trees that you want to build. Step-4: Repeat Step 1 & 2. Step-5: For new data points, find the predictions of each decision tree, and assign the new data points to the category that wins the majority votes. Example: Suppose there is a dataset that contains multiple fruit images. So, this dataset is given to the Random forest classifier. The dataset is divided into subsets and given to each decision tree. During the training phase, each decision tree produces a prediction result, and when a new data point occurs, then based on the majority of results, the Random Forest classifier predicts the final decision Why use Random Forest? ❖It takes less training time as compared to other algorithms. ❖It predicts output with high accuracy, even for the large dataset it runs efficiently. ❖It can also maintain accuracy when a large proportion of data is missing. Applications of Random Forest There are mainly four sectors where Random forest mostly used: ❖Banking: Banking sector mostly uses this algorithm for the identification of loan risk. ❖Medicine: With the help of this algorithm, disease trends and risks of the disease can be identified. ❖Land Use: We can identify the areas of similar land use by this algorithm. ❖Marketing: Marketing trends can be identified using this algorithm. Advantages of Random Forest ❖Random Forest is capable of performing both Classification and Regression tasks. ❖It is capable of handling large datasets with high dimensionality. ❖It enhances the accuracy of the model and prevents the overfitting issue. Disadvantages of Random Forest ❖Although random forest can be used for both classification and regression tasks, it is not more suitable for Regression tasks. Difference Between Decision Tree and Random Forest Decision trees Random Forest 1. Random forests are created from 1. Decision trees normally suffer from subsets of data, and the final output is the problem of overfitting if it’s allowed based on average or majority ranking; to grow without any control. hence the problem of overfitting is taken care of. 2. A single decision tree is faster in 2. It is comparatively slower. computation. 3. When a data set with features is 3. Random forest randomly selects taken as input by a decision tree, it observations, builds a decision tree, will formulate some rules to make and takes the average result. It predictions. doesn’t use any set of formulas. Thank you!!
DATA MINING and MACHINE LEARNING. PREDICTIVE TECHNIQUES: ENSEMBLE METHODS, BOOSTING, BAGGING, RANDOM FOREST, DECISION TREES and REGRESSION TREES.: Examples with MATLAB