Coincent - Data Science With Python Assignment
Coincent - Data Science With Python Assignment
Coincent - Data Science With Python Assignment
Coincent Assignment
“iris.shape” gives us the dimensionality of dataset. We can learn the no. of columns and
length of the dataset.
“iris.columns” gives us variables present in the dataset.
“iris.describe()” method gives us the details of variables present in the dataset such as
total values, average, mean, minimum value and maximum value of variables in the
dataset. The describe() method gives us a good picture of the distribution of data.
“iris.info()” gives us the short summary of our dataset like datatype, non-null values
and memory usage.
3. Understanding the Variables
“iris.duplicated().sum()” will print the number of duplicated rows in our dataset.
“iris.isnull().sum()” returns the missing values in the iris dataset.
We can understand how many values of each and every variable in our iris
dataset.
“iris.Species.value_counts()” method is used for this purpose and
“iris.Species.value_counts(normalize = True) helps us to understand the variables
as percentages.
We can also understand them using histogram and bargraph. The libraries seaborn
and matplotlib are required for this purpose.
4. Study the relationship
In the above step, we understood what kind of variables are present, how many
empty values are there, what is the size of the dataset, how many training sets are
there, what is the datatype of each variable. Here in this iris dataset, there are no
null or empty values so we did not fill them. If there are any empty values in the
dataset then we need to fill them up or remove those rows from the dataset.
“sns.pairplot(iris)” displays all the variables in the iris dataset against each other
using a scatterplot. This plot can be very helpful for understanding the relationship
between all the variables in our dataset.
These above graphs are used to understand the relationship between the variables
in our dataset. All the above graphs compares the variables of the dataset with
each other.
2. What is Decision Tree? Draw decision tree by taking the example of Play
Tennis.
Ans. Decision Tree : It is a supervised machine learning algorithm that is used to solve
both classification and regression problems. It is a tree structure classifier. It has root
node, internal nodes and leaf nodes and these are connected using branches. Features
of the dataset are represented by the internal nodes and the output of the model is
represented by the leaf nodes. All the internal nodes and root node are also called as
Decision node.
Dataset :
6. Can you explain the difference between a Test Set and a Validation Set?
Ans. Test Set : This data is used to test the machine learning model after completion of
the training stage. This set contains all kind of data which we face in the real life
scenarios. This set is used on the final model. It is a subset of the original dataset but
doesn’t contain the cases present in the training data set.
Validation Set : This data set is used for validation during the training stage of the
model. Unlike test set, it is considered to be a part of the training stage. This data is
only used to evaluate the model in its training stage but not used to train that is the
model doesn’t learn anything from this data set
Simplify the model : While building the model, we use various no.of features on
which the output of the model depends on. Choose only the important one on which
the output majorly depends on.
Cross Validation : In this technique, we divide the training dataset into subsets and
perform validation after training the model with each subset.
8. What is Precision?
Ans. It is the percentage of correctly predicted as positive class in the total number of
positive predictions made by the model.
Precision = TP / (TP + FP) where,
TP (True Positive) : correct prediction as Positive
FP (False Positive) : wrongly predicted as Positive
9. Explain How a ROC Curve works.
Ans. The full-form ROC is Receiver Operating Characteristic Curve. It is a method used to
evaluate the performance of our model. It is used in Binary classification problems.
Let us understand this with the help of an example.
Assume that we have built a machine learning model that not only predicts the output
but also predicts it in the form of probabilistic score. If the probabilistic score of a
class is high then the chances of the new data point belonging to that is also high.
Let the data set be :
xi yi yp
x1 1 1.2
x2 0 0.8
x3 1 0.92
x4 1 1.56
xi yi yp
x4 1 1.56
x3 1 1.2
x1 1 0.92
x2 0 0.8
2. Thresholding : Here we select each and every value of yp as Threshold value (T)
in every case and compare it with the every yp value. If the probability is high then
it belongs to this specific class if not then the other.
3. In this step, we find the TPR and FPR for each and every case.
TPR : True Positive Rate = TP / (TP + FN).
FPR : False Positive Rate = FP / (FP + TN).
4. Plot the graph between TPR and FPR. The curve obtained in this graph is called
the Receiver Operating Characteristic Curve. It has FPR on the x-axis and TPR on
the y-axis.
10. What is Accuracy?
Ans. It is defined as the ratio of no.of correctly classified points to the total no.of points
present in the training data set. It is the ratio of no.of correct predictions made to the
total no.of predictions made by the model.
No Yes
Predicted Class
No TN FP
Label
Yes FN TP
3. Evaluate the model using parameters like Accuracy, Precision, Recall and F1
Score that can be calculated based on this matrix.
4. It also tells us the error made by the model. Not only the error but also the type of
the error, that is, type-1 or type-2.
Root Node : Decision tree starts with a root node. This root node generates the child
nodes. It doesn’t have any incoming branches.
Internal Nodes : They come between root node and leaf nodes. These nodes are also
called the “Decision Nodes”. They are called so because they are used to make
decision and have multiple outgoing branches. It represents the features of the dataset.
Branches : The decision taken by the decision nodes are represented by these
branches.
Leaf Node : The final output class label is represented by the leaf nodes. They are the
end nodes of the decision tree. They do not have any outgoing branches. They
represent the final prediction made by the model.
Example :
Root Node
1. Root Node : It is the parent node of the decision tree. There is always only one root
node in the entire decision tree. It represents the feature or attribute with highest
Entropy value in the data set.
2. Internal Node : They represent the features or attributes that are present in the
dataset. They are also called as the Decision Nodes because they are used to make
decisions to predict the output.
3. Leaf Node : They represent the final output or class label predicted by the model.
These are the last nodes in the decision tree.
2. Post-Pruning : This pruning creates a decision tree with minimum cross validation
error. In this method, we check for overfitting problem after the decision tree is
completely built. The data set is portioned into many subsets. If there is overfitting
then we start cutting the tree’s leaf nodes from the bottom to up until the cross-
validated error is minimum. The smaller tree is better than a larger tree with more
error possibilities.
23. What are some disadvantages of using Decision Trees and how would
you solve them?
Ans. Disadvantages :
1. Decision trees are more prone to overfitting and do not generalize well to the new
data.
2. Small variation in the data can produce a different decision tree.
3. It shows large variation in the prediction of the class label of the new data point
with small changes to the training set.
4. They are expensive to train compared to other algorithm.
Methods to solve these disadvantages :
1. Pruning of the decision tree can solve the problem of overfitting. There are 2 types
of pruning : Pre-pruning and Post-pruning.
2. Bagging or averaging of the estimates can reduce the variance in the decision tree.
25. How would you define the Stopping Criteria for decision trees?
Ans. We stop the building the growing of the decision tree when any one of the following
conditions are met.
1. When all the variables in the corresponding subset of the dataset belong to the
same class label.
2. When the number of variables are less than the specified minimum.
3. In the decision tree, the level of root node is 1, and it’s child node’s level is 2, and
their child node’s level is 3 and so on. But when building the decision tree, if the
level of current node is greater than specifies maximum then we stop the decision
tree.
4. The improvement of the class impurity that is Gain Index is becoming low.
26. What is Entropy?
Ans. It is defined as the measure of randomness in the data or impurity of a variable or
feature of the given dataset. Mathematically, it can be calculated using
H(Y) or E(Y) = - Σ P(Yi) * log2[P(Yi)]
Where,
H(Y) = E(Y) = Entropy
Y = feature in the dataset
‘i’ runs from 1 to ‘k’
k = no.of class labels.
2. Information Gain : It is the difference between entropy of the parent node and
the weighted average multiplied by the entropy of the each feature.
3. Gini Index : It is similar to the Entropy. If the Gini Index of a feature or attribute
in the dataset is 0 then it is said to be pure attribute. It is calculated using
G.I = 1 – Σ (Pi)2
where i runs from 1 to n. n = no.of values in the set.
Pre-Pruning is a method in which we stop the tree-building process in the early stage
that is before it produces the leaf nodes. While splitting the at each node during the
tree-building, we perform the cross-validation and try find the error. If this error is not
reducing after every stage then stop the and check the alternate split.
32. While building Decision Tree how do you choose which attribute to split
at each node?
Ans. While splitting each node during the building of the decision tree, the attribute is
chosen based on its Information Gain.
I.G (S,a) = E(S) – [(weighted average)*E(each feature)]
or
I.G(S,a) = E(S) – Σ ( |Sv| / |S| )*E(Sv)
The attribute with the highest information gain is chosen at each node.
33. How would you compare different Algorithms to build Decision Trees?
Ans. We always choose the machine learning algorithm that is most appropriate and gives
us the most accurate prediction the output variable. In order to do this, we must
compare the algorithms. To compare the algorithms we calculate many metrics such
as accuracy, precision, F1 score, recall, etc. Not only these it also depends on the type
of the problem statement and the dataset set given to us. If the problem statement
given to us is suitable to the classification or regression then we use the supervised
learning algorithms. We also compare the algorithms based on how quickly it
predicting the accurate or correct output. We choose the decision tree algorithm out of
all the machine learning algorithms when the problem statement is a classification
based or regression based. We use Decision mostly when it is classification based
problem and when there less no.of features on which the output depends on.
34. How do you Gradient Boost decision trees?
Ans. Gradient boosting is method in which we combine various weak learning
algorithms in a sequential manner so that it becomes a fast learning algorithm. In
order to gradient boost decision trees we combine the many trees in series and each
tree on the errors from the previous tree and rectifies them which in return increases
the efficiency and accuracy of the model.
35. What are the differences between Decision Trees and Neural Networks?
Ans. Definition :
Decision Trees is a supervised machine learning algorithm that is used to solve both
classification and regression problems. It has tree like structure. Thus it is tree
structured classifier. It has root node, internal nodes and leaf nodes. Entropy,
Information Gain, and Gini impurity are the metrics that are used to construct a
decision tree.
Neural Network is a type of machine learning process, called deep learning, which
teaches the computer or model to learn the from its mistakes and develop itself
continuously. It is a method that is used to process data and help model to make
predictions with greater accuracy.
Structure :
Decision tree
It has tree like structure. Thus it is tree structured classifier. It has following
parts :
1. Root Node
2. Internal / Decision Nodes
3. Leaf Nodes
4. Branches
Neural Network
The structure of the Neural Network is inspired by the Human brain. It
consists of interconnected artificial neurons in 3 layers :
1. Input Layer
2. Hidden Layer
3. Output Layer