0% found this document useful (0 votes)
141 views2 pages

Decision Tree

The document discusses using a decision tree model to predict whether an employee's salary is over $100k based on their company, job, and level of education using a dataset of salaries. It shows the decision tree that was created using these factors as nodes to make the prediction. The document also discusses using entropy and gini impurity to determine the optimal variables to split the data on at each node for making the most informed predictions.

Uploaded by

Sudheer Redus
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
141 views2 pages

Decision Tree

The document discusses using a decision tree model to predict whether an employee's salary is over $100k based on their company, job, and level of education using a dataset of salaries. It shows the decision tree that was created using these factors as nodes to make the prediction. The document also discusses using entropy and gini impurity to determine the optimal variables to split the data on at each node for making the most informed predictions.

Uploaded by

Sudheer Redus
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 2

DECISION TREE

Problem: Take the following Salary.csv data and find out the salary of an employee given the
company, job and qualification.

Dataset: salaries.csv

Based on the following dataset, we want to answer if a job has salary > 100K $ or not ?

We can now create a decision tree as shown below:

Why we started at company? Because the entropy is less in this model. Entropy is a measure of
randomness. [We can also use gini to decide this.] When less entropy is there, we can gain high
information at every split.

Gini impurity represents how much impurities are remaining. It should be less.

Program

# prediction for salary of an employee - decision tree


import pandas as pd

# load the dataset


df = pd.read_csv('F://datascience-notes/ml/decision-trees/salaries.csv')
df

# drop the target col


inputs = df.drop('salary_more_than_100k', axis='columns')
target = df['salary_more_than_100k']

# let us convert the column data into numerics. this is done with LabelEncoder
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()

# create another col in df


inputs['company_n'] = le.fit_transform(inputs['company'])
inputs['job_n'] = le.fit_transform(inputs['job'])
inputs['degree_n'] = le.fit_transform(inputs['degree'])
inputs.head()

# delete cols with labels (or strings)


input_n = inputs.drop(['company', 'job', 'degree'], axis = 'columns')
input_n
# above, 2-> google, 0->abc pharma, 1-> facebook
# 2 -> sales executive 0-> business manager, 1-> computer programmer
# bachelors -> 0, masters -> 1

# create the model


from sklearn.tree import DecisionTreeClassifier
# default criterion='gini'. We can use criterion='entropy'
model = DecisionTreeClassifier()
model.fit(input_n, target)

# find accuracy
accuracy = model.score(input_n, target)
accuracy # 1.0

# predict for a person working in google as sales executive with masters degree
model.predict([[2,2,1]]) # array([0]) # less than 100K $
model.predict([[2,0,0]]) # array([1]) # >= 100k$

Task on Decision Tree:

Titanic dataset ‘titanic.csv’ is available.


Predict the if a passenger survived or not (1 or 0) based on Pclass, Sex, Age and Fare columns.
Note: split the data for training 80% and for testing 20% and then train the model.

You might also like