0% found this document useful (0 votes)
16 views12 pages

Decision - Tree-Random - Forest - Jupyter Notebook

Uploaded by

termp89
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views12 pages

Decision - Tree-Random - Forest - Jupyter Notebook

Uploaded by

termp89
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

11/24/24, 2:31 PM Decision_Tree-Random_Forest - Jupyter Notebook

DT RF
About the data set (Employee data)
The dataset contains information about employees. The aim is to find which employees
might undergo attrition.
Attribute information:

Age: Age of the employee

BusinessTravel: How much travel is involved in the job for the employee:No Travel, Travel
Frequently, Tavel Rarely

Department: Department of the employee: Human Resources, Reserach & Development,


Sales

Commute: Number of miles of daily commute for the employee

Education: Employee education field: Human Resources, Life Sciences, Marketing,


Medical Sciences, Technical, Others

EnvironmentSatisfaction: Satisfaction of employee with office environment

Gender: Employee gender

JobInvolvement: Job involvement rating

JobLevel: Job level for employee designation

JobSatisfaction: Employee job satisfaction rating

MonthlyIncome: Employee monthly salary

OverTime: Has the employee been open to working overtime: Yes or No

PercentSalaryHike: Percent increase in salary

PerformanceRating: Overall employee performance rating

YearsAtCompany: Number of years the employee has worked with the company

Attrition: Employee leaving the company: Yes or No

Table of Content
1. Decision tree
2. Random forest

Import the required libraries

localhost:8888/notebooks/Decision_Tree-Random_Forest.ipynb 1/12
11/24/24, 2:31 PM Decision_Tree-Random_Forest - Jupyter Notebook

In [1]:  1 import pandas as pd


2 import numpy as np
3 import matplotlib.pyplot as plt
4 from matplotlib.colors import ListedColormap
5 import seaborn as sns
6 from warnings import filterwarnings
7 filterwarnings('ignore')
8 pd.options.display.max_columns = None
9 pd.options.display.max_rows = None
10 pd.options.display.float_format = '{:.6f}'.format
11 from sklearn.model_selection import train_test_split
12 from sklearn.preprocessing import StandardScaler
13 from sklearn.utils import resample
14 from sklearn.utils import shuffle
15 from sklearn import metrics
16 from sklearn.metrics import classification_report
17 from sklearn.tree import DecisionTreeClassifier
18 from sklearn.ensemble import RandomForestClassifier
19 from sklearn import tree
20 from sklearn.model_selection import GridSearchCV
21 from sklearn.metrics import accuracy_score
22 from sklearn.metrics import roc_curve
23 from sklearn.metrics import roc_auc_score
24 from sklearn.metrics import confusion_matrix
25 from sklearn.model_selection import GridSearchCV
26 from sklearn.model_selection import cross_val_score
27 import pydotplus
28 from IPython.display import Image
29 import random

In [2]:  1 plt.rcParams['figure.figsize']=[16,8]

Load the csv file

localhost:8888/notebooks/Decision_Tree-Random_Forest.ipynb 2/12
11/24/24, 2:31 PM Decision_Tree-Random_Forest - Jupyter Notebook

In [3]:  1 df_employee = pd.read_csv('emp_attrition.csv')


2 df_employee.head().transpose()

Out[3]:
0 1 2 3

Age 33 32 40 42

Attrition Yes Yes Yes No

BusinessTravel Travel_Frequently Travel_Rarely Travel_Rarely Travel_Rarely Tra

Research & Research & Research &


Department Sales
Development Development Development

DistanceFromHome 3 4 9 7

EducationField Life Sciences Medical Life Sciences Medical

EnvironmentSatisfaction 1 4 4 2

Gender Male Male Male Female

JobInvolvement 3 1 3 4

JobLevel 1 3 1 2

Research Sales Laboratory Research


JobRole
Scientist Executive Technician Scientist

JobSatisfaction 1 4 1 2

MonthlyIncome 3348 10400 2018 2372

NumCompaniesWorked 1 1 3 6

OverTime Yes No No Yes

PercentSalaryHike 11 11 14 16

PerformanceRating 3 3 3 3

YearsAtCompany 10 14 5 1

localhost:8888/notebooks/Decision_Tree-Random_Forest.ipynb 3/12
11/24/24, 2:31 PM Decision_Tree-Random_Forest - Jupyter Notebook

In [4]:  1 df_employee.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1580 entries, 0 to 1579
Data columns (total 18 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Age 1580 non-null int64
1 Attrition 1580 non-null object
2 BusinessTravel 1580 non-null object
3 Department 1580 non-null object
4 DistanceFromHome 1580 non-null int64
5 EducationField 1580 non-null object
6 EnvironmentSatisfaction 1580 non-null int64
7 Gender 1580 non-null object
8 JobInvolvement 1580 non-null int64
9 JobLevel 1580 non-null int64
10 JobRole 1580 non-null object
11 JobSatisfaction 1580 non-null int64
12 MonthlyIncome 1580 non-null int64
13 NumCompaniesWorked 1580 non-null int64
14 OverTime 1580 non-null object
15 PercentSalaryHike 1580 non-null int64
16 PerformanceRating 1580 non-null int64
17 YearsAtCompany 1580 non-null int64
dtypes: int64(11), object(7)
memory usage: 222.3+ KB

Let's begin with some hands-on practice exercises

1. Decision tree

Detect the outliers in the dataset. Remove the outliers using IQR method, if
present.

localhost:8888/notebooks/Decision_Tree-Random_Forest.ipynb 4/12
11/24/24, 2:31 PM Decision_Tree-Random_Forest - Jupyter Notebook

In [5]:  1 df_employee.boxplot()
2 plt.xticks(rotation='vertical',fontsize=12)
3 plt.show()

In [6]:  1 q1=df_employee.quantile(0.25)
2 q3=df_employee.quantile(0.75)
3 iqr=q3-q1
4 df_employee=df_employee[~((df_employee<(q1-1.5*iqr))|(df_employee>
5 df_employee=df_employee.reset_index(drop=True)
6 df_employee.shape

Out[6]: (1487, 18)

Build a full model to predict if an employee will leave the company. Find
three features that impact the model prediction the most.

In [7]:  1 df_target=df_employee['Attrition']
2 df_feature=df_employee.drop('Attrition',axis=1)

In [8]:  1 for i in range(len(df_target)):


2 if df_target[i]=='Yes':
3 df_target[i]=1
4 else:
5 df_target[i]=0
6 df_target=df_target.astype('int')

localhost:8888/notebooks/Decision_Tree-Random_Forest.ipynb 5/12
11/24/24, 2:31 PM Decision_Tree-Random_Forest - Jupyter Notebook

In [9]:  1 df_num=df_feature.select_dtypes(include=[np.number])
2 df_cat=df_feature.select_dtypes(include=[np.object])
3 dummy_var=pd.get_dummies(data=df_cat,drop_first=True)
4 X=pd.concat([df_num,dummy_var],axis=1)
5 X.head()

Out[9]:
Age DistanceFromHome EnvironmentSatisfaction JobInvolvement JobLevel JobSat

0 33 3 1 3 1

1 32 4 4 1 3

2 40 9 4 3 1

3 42 7 2 4 2

4 43 27 3 3 3

localhost:8888/notebooks/Decision_Tree-Random_Forest.ipynb 6/12
11/24/24, 2:31 PM Decision_Tree-Random_Forest - Jupyter Notebook

In [10]:  1 X_train,X_test,y_train,y_test=train_test_split(X,df_target,test_si
2 dt_full=DecisionTreeClassifier(random_state=50).fit(X_train,y_trai
3 y_pred_full=dt_full.predict(X_test) dt_full.feature_importances_
4 imp_features=pd.DataFrame({'Features':X_train.columns,'Importance'
5 imp_features.sort_values(by='Importance',ascending=False)

Out[10]:
Features Importance

6 MonthlyIncome 0.147614

29 OverTime_Yes 0.099595

1 DistanceFromHome 0.096689

10 YearsAtCompany 0.094220

8 PercentSalaryHike 0.094219

0 Age 0.089836

2 EnvironmentSatisfaction 0.059804

5 JobSatisfaction 0.054751

7 NumCompaniesWorked 0.050974

3 JobInvolvement 0.040995

20 Gender_Male 0.040213

19 EducationField_Technical Degree 0.026531

18 EducationField_Other 0.016882

21 JobRole_Human Resources 0.015804

4 JobLevel 0.015257

26 JobRole_Research Scientist 0.010973

17 EducationField_Medical 0.009292

14 Department_Sales 0.008895

24 JobRole_Manufacturing Director 0.008432

11 BusinessTravel_Travel_Frequently 0.007876

22 JobRole_Laboratory Technician 0.004835

12 BusinessTravel_Travel_Rarely 0.003221

16 EducationField_Marketing 0.003092

28 JobRole_Sales Representative 0.000000

27 JobRole_Sales Executive 0.000000

13 Department_Research & Development 0.000000

25 JobRole_Research Director 0.000000

23 JobRole_Manager 0.000000

9 PerformanceRating 0.000000

15 EducationField_Life Sciences 0.000000

Find the area under the receiver operating characteristic curve for full
model built in previous question

localhost:8888/notebooks/Decision_Tree-Random_Forest.ipynb 7/12
11/24/24, 2:31 PM Decision_Tree-Random_Forest - Jupyter Notebook

In [23]:  1 fpr,tpr,thresholds=roc_curve(y_test,y_pred_full)
2 plt.plot([0,1],[0,1],'r--')
3 plt.plot(fpr,tpr)
4 plt.text(x=0.02,y=0.8,s=('AUC Score:',round(metrics.roc_auc_score(y
5 plt.grid(True)
6 plt.show()

Plot a confusion matrix for the full model built in Q3.

In [25]:  1 cm=confusion_matrix(y_test,y_pred_full)
2 conf_matrix=pd.DataFrame(data=cm,columns=['Predict Attr:No','Predi
3 sns.heatmap(conf_matrix,annot=True,fmt='d',cmap=ListedColormap(['l
4 plt.show()

Calculate the specificity, sensitivity, % of misclassified and correctly


classified observations. What can you say about the model by looking at
the sensitivity and specificity values? Is this a good model?

localhost:8888/notebooks/Decision_Tree-Random_Forest.ipynb 8/12
11/24/24, 2:31 PM Decision_Tree-Random_Forest - Jupyter Notebook

In [26]:  1 tn=cm[0][0]
2 tp=cm[1][1]
3 fp=cm[0][1]
4 fn=cm[1][0]
5 total=tn+tp+fp+fn
6 correct_classify=((tn+tp)/total)*100
7 correct_classify

Out[26]: 89.48545861297539

In [ ]:  1 mis_classify=((fn+fp)/total)*100
2 mis_classify

In [ ]:  1 specificity=tn/(tn+fp)
2 specificity

In [ ]:  1 sensitivity=tp/(tp+fn)
2 sensitivity

Build and plot a decision tree with maximum 5 terminal nodes.

In [13]:  1 dt_2=DecisionTreeClassifier(max_leaf_nodes=5,random_state=50).fit(X
2 tree.plot_tree(dt_2,class_names=['No','Yes'],feature_names=X_train
3 plt.show()

Build 5 decision trees each with 20 random features. Also predict the
attrition for test set for each model.

In [14]:  1 columns=list(X_train.columns)
2 sample_features=random.choices(columns,k=20)
3 dt_model_1=DecisionTreeClassifier(random_state=20).fit(X_train[sam
4 y_pred_1=dt_model_1.predict(X_test[sample_features])

localhost:8888/notebooks/Decision_Tree-Random_Forest.ipynb 9/12
11/24/24, 2:31 PM Decision_Tree-Random_Forest - Jupyter Notebook

In [16]:  1 sample_features=random.choices(columns,k=20)
2 dt_model_2=DecisionTreeClassifier(random_state=20).fit(X_train[sam
3 y_pred_2=dt_model_2.predict(X_test[sample_features])

In [17]:  1 sample_features=random.choices(columns,k=20)
2 dt_model_3=DecisionTreeClassifier(random_state=20).fit(X_train[sam
3 y_pred_3=dt_model_3.predict(X_test[sample_features])

In [18]:  1 sample_features=random.choices(columns,k=20)
2 dt_model_4=DecisionTreeClassifier(random_state=20).fit(X_train[sam
3 y_pred_4=dt_model_4.predict(X_test[sample_features])

In [19]:  1 sample_features=random.choices(columns,k=20)
2 dt_model_5=DecisionTreeClassifier(random_state=20).fit(X_train[sam
3 y_pred_5=dt_model_5.predict(X_test[sample_features])

Create a new dataframe "model_predictions_df" by appending each prediction made in the


previous question. There will be 5 columns in the dataframe for each prediction using the
decision tree models built in above question.

In [20]:  1 model_predictions_df=pd.DataFrame({'y_pred_1':y_pred_1,'y_pred_2':y
2 model_predictions_df.head()

Out[20]:
y_pred_1 y_pred_2 y_pred_3 y_pred_4 y_pred_5

0 0 0 0 0 0

1 0 0 0 1 0

2 1 1 1 1 1

3 0 0 1 0 0

4 1 0 0 0 1

Create a new column "Voted_Result" in the dataframe


"model_predictions_df" that contains the maximum occuring value (mode)
of the 5 columns in the dataframe (row-wise).

localhost:8888/notebooks/Decision_Tree-Random_Forest.ipynb 10/12
11/24/24, 2:31 PM Decision_Tree-Random_Forest - Jupyter Notebook

In [21]:  1 votes=[]
2 for i in range(model_predictions_df.shape[0]):
3 votes.append(model_predictions_df.iloc[i].value_counts().index
4 model_predictions_df['Voted_Results']=votes
5 model_predictions_df.head()

Out[21]:
y_pred_1 y_pred_2 y_pred_3 y_pred_4 y_pred_5 Voted_Results

0 0 0 0 0 0 0

1 0 0 0 1 0 0

2 1 1 1 1 1 1

3 0 0 1 0 0 0

4 1 0 0 0 1 0

Consider the values of "Voted_Result" as our new predictions and store its
values in a variable "new_y_pred" and find the accuracy and the roc-auc
score using new_y_pred.

In [22]:  1 new_y_pred=model_predictions_df['Voted_Results']
2 print("Accuracy:",accuracy_score(y_test,new_y_pred))
3 print("ROC-AUC Score:",roc_auc_score(y_test,new_y_pred))

Accuracy: 0.9485458612975392
ROC-AUC Score: 0.9548693830608723

2. Random Forest

Build a random forest full model to predict if an employee will leave the
company or not and generate a classification report.

In [24]:  1 rf_model=RandomForestClassifier(n_estimators=10,random_state=50).f
2 y_pred=rf_model.predict(X_test)
3 print(classification_report(y_test,y_pred))

precision recall f1-score support

0 0.98 0.92 0.94 259


1 0.89 0.97 0.93 188

accuracy 0.94 447


macro avg 0.93 0.94 0.94 447
weighted avg 0.94 0.94 0.94 447

In [ ]:  1 ​

localhost:8888/notebooks/Decision_Tree-Random_Forest.ipynb 11/12
11/24/24, 2:31 PM Decision_Tree-Random_Forest - Jupyter Notebook

localhost:8888/notebooks/Decision_Tree-Random_Forest.ipynb 12/12

You might also like