ML 5 Practical Writeup
ML 5 Practical Writeup
AIM : WRITE STUDY ASSIGNMENT TO BUILD RULE BASED AND TREE BASED
MODEL
THEORY :
Rule-based modeling is a modeling approach that uses a set of rules that indirectly specifies a
mathematical model. The rule-set can either be translated into a model such as Markov chains or
differential equations, or be treated using tools that directly work on the rule-set in place of a
translated model, as the latter is typically much bigger. Rule-based modeling is especially
effective in cases where the rule-set is significantly simpler than the model it implies, meaning
that the model is a repeated manifestation of a limited number of patterns. An important domain
where this is often the case is biochemical models of living organisms. Groups of mutually
corresponding substances are subject to mutually corresponding interactions.
Rule-based machine learning methods have been established for a long time. There exists
a large body of research describing many different approaches, which are offering various
combinations of interesting properties. Among these properties are, for example:
good interpretability,
the ability to incrementally refine rules or sets of rules, and/or
allowing rules to be expressed in a formalism like Datalog or Horn logic, i.e.
going beyond conjunctions of simple conditions based on attribute-value pairs.
Rules: Rules typically take the form of an {IF:THEN} expression, (e.g. {IF 'condition' THEN
'result'}, or as a more specific example, {IF 'red' AND 'octagon' THEN 'stop-sign'}). An
individual rule is not in itself a model, since the rule is only applicable when its condition is
satisfied. Therefore rule-based machine learning methods typically comprise a set of rules,
or knowledge base, that collectively make up the prediction model.
Tree based algorithms are considered to be one of the best and mostly used supervised learning
methods. Tree based algorithms empower predictive models with high accuracy, stability and
ease of interpretation. Unlike linear models, they map non-linear relationships quite well. They
are adaptable at solving any kind of problem at hand (classification or regression).
Methods like decision trees, random forest, gradient boosting are being popularly used in all
kinds of data science problems. Hence, for every analyst (fresher also), it’s important to learn
these algorithms and use them for modeling.
The decision of making strategic splits heavily affects a tree’s accuracy. The decision criteria is
different for classification and regression trees. Decision trees use multiple algorithms to decide
to split a node in two or more sub-nodes. The creation of sub-nodes increases the homogeneity of
resultant sub-nodes. In other words, we can say that purity of the node increases with respect to
the target variable. Decision tree splits the nodes on all available variables and then selects the
split which results in most homogeneous sub-nodes.
Overfitting is one of the key challenges faced while using tree based algorithms. If there is no
limit set of a decision tree, it will give you 100% accuracy on training set because in the worse
case it will end up making 1 leaf for each observation. Thus, preventing overfitting is pivotal
while modeling a decision tree and it can be done in 2 ways: