ML 6
ML 6
Intel(R) Core(TM) i5-10505 CPU @ 3.20 GHz Processor, 8 GB RAM, 256 GB HDD, 20” LCD
Monitor, Keyboard, Mouse.
Software Requirement:
OS ,Python 2.7, PyCharm and Anaconda. Studio.
DESCRIPTION:
Decision Tree is one of the most powerful and popular algorithm. Decision-tree algorithm falls
under the category of supervised learning algorithms. It works for both continuous as well as
categorical output variables.
It is a numeric python module which provides fast maths functions for calculations.
It is used to read data in numpy arrays and for manipulation purpose.
3. Pandas :
Used to read and write different files.
Data manipulation can be done easily with dataframes.
Installation of the packages :
In Python, sklearn is the package which contains all the required packages to implement
Machine learning algorithm. You can install the sklearn package by following the commands
given below.
using pip :
pip install -U scikit-learn
Before using the above command make sure you have scipy and numpy packages installed.
If you don’t have pip. You can install it using
python get-pip.py
using conda :
conda install scikit-learn
Assumptions we make while using Decision tree :
At the beginning, we consider the whole training set as the root.
Attributes are assumed to be categorical for information gain and for gini index, attributes
are assumed to be continuous.
On the basis of attribute values records are distributed recursively.
We use statistical methods for ordering attributes as root or internal node.
Pseudocode :
1. Find the best attribute and place it on the root node of the tree.
2. Now, split the training set of the dataset into subsets. While making the subset make sure
that each subset of training dataset should have the same value for an attribute.
3. Find leaf nodes in all branches by repeating 1 and 2 on each subset.
Since 2001
Bhartiya Gramin Punarrachna Sanstha’s
Hi-Tech Institute of Technology, Aurangabad
A Pioneer to Shape Global Technocrats
Approved By AICTE, DTE Govt. of Maharashtra & Affiliated to Dr. Babasaheb Ambedkar Technological University, Lonere, Raigad
P-119, Bajajnagar, MIDC Waluj, Aurangabad, Maharashtra, India - 431136P: (0240) 2552240, 2553495, 2553496 Web:http://hitechengg.edu.in/
While implementing the decision tree we will go through the following two phases:
1. Building Phase
Preprocess the dataset.
Split the dataset from train and test using Python sklearn package.
Train the classifier.
2. Operational Phase
Make predictions.
Calculate the accuracy.
Data Import :
To import and manipulate the data we are using the pandas package provided in python.
Here, we are using a URL which is directly fetching the dataset from the UCI site no need
to download the dataset. When you try to run this code on your system make sure the
system should have an active Internet connection.
As the dataset is separated by “,” so we have to pass the sep parameter’s value as “,”.
Another thing is notice is that the dataset doesn’t contain the header so we will pass the
Header parameter’s value as none. If we will not pass the header parameter then it will
consider the first line of the dataset as the header.
Data Slicing :
Before training the model we have to split the dataset into the training and testing dataset.
To split the dataset for training and testing we are using the sklearn module train_test_split
Since 2001
Bhartiya Gramin Punarrachna Sanstha’s
Hi-Tech Institute of Technology, Aurangabad
A Pioneer to Shape Global Technocrats
Approved By AICTE, DTE Govt. of Maharashtra & Affiliated to Dr. Babasaheb Ambedkar Technological University, Lonere, Raigad
P-119, Bajajnagar, MIDC Waluj, Aurangabad, Maharashtra, India - 431136P: (0240) 2552240, 2553495, 2553496 Web:http://hitechengg.edu.in/
First of all we have to separate the target variable from the attributes in the dataset.
X = balance_data.values[:, 1:5]
Y = balance_data.values[:,0]
Above are the lines from the code which separate the dataset. The variable X contains the
attributes while the variable Y contains the target variable of the dataset.
Next step is to split the dataset for training and testing purpose.
X_train, X_test, y_train, y_test = train_test_split(
X, Y, test_size = 0.3, random_state = 100)
Above line split the dataset for training and testing. As we are splitting the dataset in a ratio
of 70:30 between training and testing so we are pass test_size parameter’s value as 0.3.
random_state variable is a pseudo-random number generator state used for random
sampling.
Terms used in code :
Gini index and information gain both of these methods are used to select from the n attributes
of the dataset which attribute would be placed at the root node or the internal node.
Gini index:
Gini Index is a metric to measure how often a randomly chosen element would be
incorrectly identified.
It means an attribute with lower gini index should be preferred.
Sklearn supports “gini” criteria for Gini Index and by default, it takes “gini” value.
SOURCE CODE:
# Run this program on your local python
# interpreter, provided you have installed
# the required libraries.
Y = balance_data.values[:, 0]
# Performing training
clf_gini.fit(X_train, y_train)
return clf_gini
max_depth = 3, min_samples_leaf = 5)
# Performing training
clf_entropy.fit(X_train, y_train)
return clf_entropy
print("Report : ",
classification_report(y_test, y_pred))
# Driver code
def main():
# Building Phase
data = importdata()
X, Y, X_train, X_test, y_train, y_test = splitdataset(data)
clf_gini = train_using_gini(X_train, X_test, y_train)
clf_entropy = tarin_using_entropy(X_train, X_test, y_train)
# Operational Phase
print("Results Using Gini Index:")
OUTPUT:
Data Information:
Predicted values:
['R' 'L' 'R' 'R' 'R' 'L' 'R' 'L' 'L' 'L' 'R' 'L' 'L' 'L' 'R' 'L' 'R' 'L'
'L' 'R' 'L' 'R' 'L' 'L' 'R' 'L' 'L' 'L' 'R' 'L' 'L' 'L' 'R' 'L' 'L' 'L'
'L' 'R' 'L' 'L' 'R' 'L' 'R' 'L' 'R' 'R' 'L' 'L' 'R' 'L' 'R' 'R' 'L' 'R'
'R' 'L' 'R' 'R' 'L' 'L' 'R' 'R' 'L' 'L' 'L' 'L' 'L' 'R' 'R' 'L' 'L' 'R'
'R' 'L' 'R' 'L' 'R' 'R' 'R' 'L' 'R' 'L' 'L' 'L' 'L' 'R' 'R' 'L' 'R' 'L'
'R' 'R' 'L' 'L' 'L' 'R' 'R' 'L' 'L' 'L' 'R' 'L' 'R' 'R' 'R' 'R' 'R' 'R'
'R' 'L' 'R' 'L' 'R' 'R' 'L' 'R' 'R' 'R' 'R' 'R' 'L' 'R' 'L' 'L' 'L' 'L'
'L' 'L' 'L' 'R' 'R' 'R' 'R' 'L' 'R' 'R' 'R' 'L' 'L' 'R' 'L' 'R' 'L' 'R'
'L' 'L' 'R' 'L' 'L' 'R' 'L' 'R' 'L' 'R' 'R' 'R' 'L' 'R' 'R' 'R' 'R' 'R'
'L' 'L' 'R' 'R' 'R' 'R' 'L' 'R' 'R' 'R' 'L' 'R' 'L' 'L' 'L' 'L' 'R' 'R'
'L' 'R' 'R' 'L' 'L' 'R' 'R' 'R']
Confusion Matrix: [[ 0 6 7]
[ 0 67 18]
[ 0 19 71]]
Accuracy : 73.4042553191
Report :
precision recall f1-score support
Since 2001
Bhartiya Gramin Punarrachna Sanstha’s
Hi-Tech Institute of Technology, Aurangabad
A Pioneer to Shape Global Technocrats
Approved By AICTE, DTE Govt. of Maharashtra & Affiliated to Dr. Babasaheb Ambedkar Technological University, Lonere, Raigad
P-119, Bajajnagar, MIDC Waluj, Aurangabad, Maharashtra, India - 431136P: (0240) 2552240, 2553495, 2553496 Web:http://hitechengg.edu.in/
Predicted values:
['R' 'L' 'R' 'L' 'R' 'L' 'R' 'L' 'R' 'R' 'R' 'R' 'L' 'L' 'R' 'L' 'R' 'L'
'L' 'R' 'L' 'R' 'L' 'L' 'R' 'L' 'R' 'L' 'R' 'L' 'R' 'L' 'R' 'L' 'L' 'L'
'L' 'L' 'R' 'L' 'R' 'L' 'R' 'L' 'R' 'R' 'L' 'L' 'R' 'L' 'L' 'R' 'L' 'L'
'R' 'L' 'R' 'R' 'L' 'R' 'R' 'R' 'L' 'L' 'R' 'L' 'L' 'R' 'L' 'L' 'L' 'R'
'R' 'L' 'R' 'L' 'R' 'R' 'R' 'L' 'R' 'L' 'L' 'L' 'L' 'R' 'R' 'L' 'R' 'L'
'R' 'R' 'L' 'L' 'L' 'R' 'R' 'L' 'L' 'L' 'R' 'L' 'L' 'R' 'R' 'R' 'R' 'R'
'R' 'L' 'R' 'L' 'R' 'R' 'L' 'R' 'R' 'L' 'R' 'R' 'L' 'R' 'R' 'R' 'L' 'L'
'L' 'L' 'L' 'R' 'R' 'R' 'R' 'L' 'R' 'R' 'R' 'L' 'L' 'R' 'L' 'R' 'L' 'R'
'L' 'R' 'R' 'L' 'L' 'R' 'L' 'R' 'R' 'R' 'R' 'R' 'L' 'R' 'R' 'R' 'R' 'R'
'R' 'L' 'R' 'L' 'R' 'R' 'L' 'R' 'L' 'R' 'L' 'R' 'L' 'L' 'L' 'L' 'L' 'R'
'R' 'R' 'L' 'L' 'L' 'R' 'R' 'R']
Confusion Matrix: [[ 0 6 7]
[ 0 63 22]
[ 0 20 70]]
Accuracy : 70.7446808511
Report :
precision recall f1-score support
B 0.00 0.00 0.00 13
L 0.71 0.74 0.72 85
R 0.71 0.78 0.74 90
avg / total 0.66 0.71 0.68 188
dataset = np.array(
[['Asset Flip', 100, 1000],
['Text Based', 500, 3000],
['Visual Novel', 1500, 5000],
['2D Pixel Art', 3500, 8000],
['2D Vector Art', 5000, 6500],
['Strategy', 6000, 7000],
['First Person Shooter', 8000, 15000],
['Simulator', 9500, 20000],
['Racing', 12000, 21000],
['RPG', 14000, 25000],
['Sandbox', 15500, 27000],
['Open-World', 16500, 30000],
['MMOFPS', 25000, 52000],
['MMORPG', 30000, 80000]
])
# print X
print(X)
Output:
[[ 100]
[ 500]
[ 1500]
[ 3500]
[ 5000]
[ 6000]
[ 8000]
[ 9500]
[12000]
[14000]
[15500]
[16500]
[25000]
[30000]]
Step 4: Select all of the rows and column 2 from the dataset to “y”.
# select all rows by : and column 2
# by 2 to Y representing labels
y = dataset[:, 2].astype(int)
# print y
Since 2001
Bhartiya Gramin Punarrachna Sanstha’s
Hi-Tech Institute of Technology, Aurangabad
A Pioneer to Shape Global Technocrats
Approved By AICTE, DTE Govt. of Maharashtra & Affiliated to Dr. Babasaheb Ambedkar Technological University, Lonere, Raigad
P-119, Bajajnagar, MIDC Waluj, Aurangabad, Maharashtra, India - 431136P: (0240) 2552240, 2553495, 2553496 Web:http://hitechengg.edu.in/
print(y)
Output:
[ 1000 3000 5000 8000 6500 7000 15000 20000 21000 25000 27000 30000 52000 80000]
Step 5: Fit decision tree regressor to the dataset
# import the regressor
from sklearn.tree import DecisionTreeRegressor
# specify title
plt.title('Profit to Production Cost (Decision Tree Regression)')
Step 8: The tree is finally exported and shown in the TREE STRUCTURE below,
visualized using http://www.webgraphviz.com/ by copying the data from the ‘tree.dot’ file.
# import export_graphviz
from sklearn.tree import export_graphviz
Since 2001
Bhartiya Gramin Punarrachna Sanstha’s
Hi-Tech Institute of Technology, Aurangabad
A Pioneer to Shape Global Technocrats
Approved By AICTE, DTE Govt. of Maharashtra & Affiliated to Dr. Babasaheb Ambedkar Technological University, Lonere, Raigad
P-119, Bajajnagar, MIDC Waluj, Aurangabad, Maharashtra, India - 431136P: (0240) 2552240, 2553495, 2553496 Web:http://hitechengg.edu.in/
Conclusion: It assists analysts in evaluation upcoming choices. The tree creates a visual
representation of all possible outcomes, rewards and follow-up decisions in one documents.