0% found this document useful (0 votes)
30 views3 pages

Random Forest

The document discusses implementing a random forest classifier on the Titanic dataset. It covers preprocessing the data, handling categorical variables, splitting into train and test sets, building the random forest classifier model with hyperparameters like n_estimators and max_depth, making predictions on the test set, and evaluating accuracy.

Uploaded by

bunsglazing135
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
30 views3 pages

Random Forest

The document discusses implementing a random forest classifier on the Titanic dataset. It covers preprocessing the data, handling categorical variables, splitting into train and test sets, building the random forest classifier model with hyperparameters like n_estimators and max_depth, making predictions on the test set, and evaluating accuracy.

Uploaded by

bunsglazing135
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 3

Random Forest Classifier

(i) https://www.datacamp.com/tutorial/random-forests-classifier-python
(ii) Below Code for practice
About dataset
The dataset used in this is ‘titanic.csv’ which is available for free, which is available on
Kaggle.com. This dataset includes the following features
1. Importing Libraries and reading dataset
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
df = pd.read_csv("titanic.csv")
df

2. Data preprocessing
df.drop(['Cabin','PassengerId','Name','Ticket’],axis=1,inplace=True)
df = df.fillna(0)

3. Handling categorical data


from sklearn.preprocessing import LabelEncoder
le=LabelEncoder()
df['Sex']=le.fit_transform(df['Sex'])
df['Embarked']=le.fit_transform(df['Embarked'])
df
4. Dependent and independent variables
# Putting feature variable to X
X = df.drop('Survived',axis=1)
# Putting response variable to y
y = df['Survived']

5. Splitting dataset into Training and Testing Set


# Splitting the data into train and test
from sklearn.model_selection import train_test_split
# Splitting the data into train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.7, random_state=42)
Next, split both x and y into training and testing sets with the help of the train_test_split()
function. In this training data set is 0.8 which means 80%.
6. Implementing a Random forest classifier
#Import Random Forest Model
from sklearn.ensemble import RandomForestClassifier
#Create a Gaussian Classifier
clf=RandomForestClassifier(n_estimators=100)
#Train the model using the training sets y_pred=clf.predict(X_test)
clf.fit(X_train,y_train)
Different parameters are used in the Random forest algorithm
1. N_estimators-The number of decision trees in the forest.
Note: The default value is 100. You can increase the number of trees that can increase the
accuracy but be careful that should not lead to overfitting
2. criterion{“gini”, “entropy”}, default=”gini”
This is to measure the quality of a split. These are the criterion by which the decision tree
actually split the variables.
 “gini” for the Gini impurity
 “entropy” for the information gain
3. Max_depth int, default=None
The maximum depth of the tree(root node to terminal node).
Note: If you are using a high value that means you are overcomplicating the things and that
can lead to overfitting. So be careful while choosing the value.
4. min_samples_split(int or float, default=2)
The minimum number of samples actually required to split an internal node:
Remember the lower the value the higher the chance to fit errors but that doesn’t mean you
choose a very high value because that will over generalize the model leading to overfitting.
So choose value accordingly.
5. min_samples_leaf(int or float, default=1)
The minimum number of samples is required to be at a leaf node.
6. Max_features {“auto”, “sqrt”, “log2”}, int or float, default=”auto”
a maximum number of features random forest considers when looking for the best split.
7. n_jobs(int, default=None)
It is the number of jobs to run in parallel. This is used when you have the capability to do
parallel processing where n_jobs= -1 means using all processors and n_jobs=1, it can use
only one processor
8. random_state(int, RandomState instance or None, default=None)
It Controls both the randomness of the samples used when building trees.
9. verbose(int, default=0)
Controls the verbosity when fitting and predicting. It gives you all the run-time information.
You can hyper-tune these by changing the values. You can read my blog on hyper-tuning.
7. Predicting test cases using random forest
# Predicting the test set results
Pred = classifier.predict(X_test)
print(Pred)
Output:
[0 1 1 0 1 1 1 0 0 0 1 0 1 0 1 1 1 0 0 0 1 0 1 1 1 1 1 1 0 1 0 0 0 0 1 0 1
1111111101100001100010001011001111111
0 1 1 0 0 0 1 0 1 1 0 0 0 1 0 0 1]

8. Checking the accuracy score


from sklearn.metrics import classification_report
rand_score=classifier.score(X_test, y_test)
'''rand_score=classifier.accuracy_score(y_test,Pred)'''
classification_report_rf=classification_report(y_test,Pred)
print("Accuracy score:",rand_score)
Output:
Accuracy score: 0.8268156424581006

You might also like