Introduction To Machine Learning
Introduction To Machine Learning
Machine Learning
1
Introduction to machine learning
2. Data science is the process of applying science and domain expertise to data to
extract useful information from data.
3. It includes application of all the statistical and mathematical tools and techniques to
glean out the useful information from data using machine learning
2
Introduction to machine learning
“[Machine Learning is the] field of study that gives computers the ability to
learn without being explicitly programmed.” Arthur Samuel 1959:
3
Introduction to machine learning
2. The ability to do the tasks is embodied in form of a model which is the result of the
learning process
3. The model represents the process which generated the data used to build the
model
4. The data used is expected to represent the long term behaviour of the process
5. The more representative data is of the real world in which the process is
executed, the better the model would be
4
Introduction to machine learning
4. These expressions (end result of running the algorithms) is broadly called models
5
Introduction to machine learning
1. Cannot express our knowledge about patterns as a program. For e.g. Character
recognition or natural language processing
2. Do not have an algorithm to identify a pattern of interest. For e.g. In spam mail detection
4. Too many permutations and combinations possible. For e.g. Genetic code mapping
6
Introduction to machine learning
1. Fraud detection
2. Sentiment analysis
7
Introduction to machine learning
8
Introduction to machine learning
9
Introduction to machine learning
1. A data set representing the real world, is a collection attributes that define an
entity
Attributes / Dimensions
10
Introduction to machine learning
Sugar
e
Ag
BP level
Heart healthy
Potential heart ailments
11
Introduction to machine learning
1. Position of a point in
space is defined with
respect to the origin
3. We believe there is a
Sugar
BP level
Heart healthy
Potential heart ailments
12
Introduction to machine learning
BP level
Heart healthy
Erroneous Potential heart ailments
classification
13
Introduction to machine learning
ax + by + cz = d
BP level
Heart healthy
Potential heart ailments
14
Introduction to machine learning
ax + by + cz = d, BP level
Heart healthy
Potential heart ailments
15
Introduction to machine learning
e
Ag
ax + by + cz = d, BP level
Heart healthy
Potential heart ailments
16
Introduction to machine learning
Sugar
ax + by + cz = d,
17
Introduction to machine learning
Model Dimensions
18
Introduction to machine learning
Dimensions / Dimension reduction
1. An important step in preparing data for machine learning. If not done with care, or not
done at all, may have adverse effect on accuracy of machine learning models
2. Process to convert high dimensional data set into lesser dimension data with minimal
loss of variance
19
Introduction to machine learning
Dimensions / Dimension reduction
1. In the animation below, one attribute data does not convey much but when combined
with another dimension, reveals more actionable information
2. More the dimensions, more information the data set reveals
Spread, variance, mean on one dimension (D2)
20
Introduction to machine learning
Dimensions / Dimension reduction
1. Beyond a point (in terms of number of dimensions) it is diminishing returns in terms of
information content
2. When data with such dimensions is fed to machine learning algorithms, the models
tend to become lesser and lesser accurate due to degree of noise Vs information
contributed by the dimensions
21
Introduction to machine learning
Dimensions / Dimension reduction
a) Low Variance Filter. Columns with little variance carry little information. Columns with
variance lower than a given threshold are removed. A word of caution: variance is range
dependent; therefore normalization is required before applying this technique.
b) High Correlation Filter. Data columns with very similar trends are also likely to carry very
similar information. In this case, only one of them will suffice to feed the machine learning
model.
c) Mutual Information Filter for non-linearly related coordinates - a criterion for feature
selection in machine learning. Based on Information theory by Claude Shannon’s concept
of entropy (joint entropy)
Refer : http://www.kdnuggets.com/2015/05/7-methods-data-dimensionality-reduction.html
22
Introduction to machine learning
Feature Extraction / Principal Component Analysis
e) Principal Component Analysis attempts to hit two birds with same stone –
1. It transform existing dimensions to increase the SNR. It creates a new dimension out of
the original two
23
Introduction to machine learning
24
Introduction to machine learning
Source: https://quantdare.com/machine-learning-a-brief-breakdown/
25
Introduction to machine learning
26
Introduction to machine learning
3. Python stands out as the language best suited for all areas of the data science
and machine learning framework.
4. Refer :
https://www.datacamp.com/community/tutorials/r-or-python-for-data-analysis#gs.q
kenPdo
27
Introduction to machine learning
Pandas adds data structures and tools that are designed for practical data analysis in
finance, statistics, social sciences, and engineering. Pandas works well with
incomplete, messy, and unlabeled data (i.e., the kind of data you’re likely to encounter
in the real world), and provides tools for shaping, merging, reshaping, and slicing
datasets
matplotlib is the standard Python library for creating 2D plots and graphs.
28
Introduction to machine learning
29
Introduction to machine learning
Introduction to Supervised
Machine Learning
30
Introduction to machine learning
c. Training stage involves use of training data ( a subset of the externally supplied
data) supplied in form of independent and target values
d. They produce a model which is supposed to represent the real process that
generated the data.
e. The model is tested for it’s performance in test stage using test data. If
satisfactory the model is implemented (productionized)
f. The model is used to predict target values for new data points
31
Introduction to machine learning
Pre-process Address data quality issues such as missing values, outliers, data
Data pollution etc. Establish veracity of the data. Select attributes for model,
Need domain expertise
Create
Split the data into training set and test set. Generally
training & 70:30 ratio is used
test set
Select
Select appropriate algorithm/s to model. For e.g. Random
appropriate Forest, K Nearest Neighbors etc. Depends on data
algorithm/s
Productionize
CRISP DM & calibrate
32
Introduction to machine learning
Linear Regression
33
Introduction to machine learning
b. The term "linear" in the name “linear regression” refers to the fact that the
method models data with linear combination of the explanatory variables.
d. In the case of linear regression with a single explanatory variable, the linear
combination used in linear regression can be expressed as:
e. In its most basic form fits a straight line to the response variable. The model is
designed to fit a line that minimizes the squared differences (also called errors
or residuals.).
34
Introduction to machine learning
Linear Regression Models -
35
Introduction to machine learning
Linear Regression Models -
d. Coefficient of relation - Pearson’s coefficient p(x,y) = Cov(x,y) / ( stnd Dev (x) X stnd
Dev (y) )
e. Generating linear model for cases where r is near 0, makes no sense. The model will
not be reliable. For a given value of X, there can be many values of Y! Nonlinear
models may be better in such cases
36
Introduction to machine learning
Linear Regression Models (Recap) -
f. Coefficient of relation - Pearson’s coefficient p(x,y) = Cov(x,y) / ( stnd Dev (x) X stnd
Dev (y) )
- ve +ve
quad quad
+ve - ve
quad quad
=0
>0
http://www.socscistatistics.com/tests/pearson/Default2.aspx
37
Introduction to machine learning
Linear Regression Models -
g. Given Y = f(x) and the scatter plot shows apparent correlation between X and Y
Let’s fit a line into the scatter which shall be our model
h. But there are infinite number of lines that can be fit in the scatter. Which one
should we consider as the model?
j. Gradient descent
methods use partial
derivatives on the
parameters (slope and
intercept) to minimize
sum of squared errors
38
Introduction to machine learning
Error = (T – (mx + C)
Sum of all errors can cancel
out and give 0
39
Introduction to machine learning
Linear Regression Models -
n. Coefficient of determinant – determines the fitness of a linear model. The closer the
points get to the line, the R^2 (coeff of determinant) tends to 1, the better the model is
Ybar
Xbar
40
Introduction to machine learning
Linear Regression Models -
o. Coefficient of determinant (Contd…)
I. There are a variety of errors for all those points that don’t fall exactly on the line.
II. It is important to understand these errors to judge the goodness of fit of the model i.e.
How representative the model is likely to be in general
III. Let us look at point P1 which is one of the given data points and associated errors due to
the model
1. P1 – Original y data point for given x
SSE
4. SST – Sum of Square error Total (SST)
SST
P2 Variance of P1 from Ybar (Y – Ybar)^2
SSR
Ybar 5. SSR - Regression error (p2 – ybar)^2 (portion
SST captured by regression model)
Xbar x
41
Introduction to machine learning
Linear Regression Models -
42
Introduction to machine learning
Linear Regression Models -
Point B
Point B
Point A Point A
In case of point “A”, the line explains the variance of the point
Whereas point “B” the is a small area (light grey) which the line does not represent.
43
Introduction to machine learning
Advantages –
1. Simple to implement and easier to interpret the outputs coefficients
Disadvantages -
2. Assumes a linear relationships between dependent and independent variables. That
is, it assumes there is a straight-line relationship between them
3. Outliers can have huge effects on the regression
4. Linear regression assume independence between attributes
5. Linear regression looks at a relationship between the mean of the dependent variable
and the independent variables.
6. Just as the mean is not a complete description of a single variable, linear regression
is not a complete description of relationships among variables
7. Boundaries are linear
44
Introduction to machine learning
The dataset has 9 attributes listed below that define the quality
1. mpg: continuous
2. cylinders: multi-valued discrete
3. displacement: continuous
4. horsepower: continuous
5. weight: continuous
6. acceleration: continuous
7. model year: multi-valued discrete
8. origin: multi-valued discrete
9. car name: string (unique for each instance)
45
Introduction to machine learning
Logistic Regression
46
Introduction to machine learning
b. Given the value of predictor (variable x), the model estimates the probability that
the new data point belongs to a given class say “A”. Probability values can
range between 0 and 1.
Class B Class A
Class B Class A
Density distribution
47
Introduction to machine learning
c. A new data point (shown with “?”) needs to be classified i.e. does it belong to
class A or B.
d. Given the distribution, closer the point is to the origin, it is unlikely to belong to
class A. Farther away it is from the origin, likely it belongs to class A
e. One can try to fit a simple linear model (y = mx +c) where y greater than a
threshold means point most probably belongs to class A. The challenge is, for
extreme values of x, probability is <0 or >1 which is absurd
Probability
model
48
Introduction to machine learning
Class A
Class B
Note: The linear model t (which is of the form mx + c), represents logit which is natural log(p / 1-p)
where p is probability that a data point belongs to a class or not
49
Introduction to machine learning
50
Introduction to machine learning
1 Class
4
Logistic Regression Model - A
Incorrect classification
Correct classification
Incorrect classification
j. In case of point 1 (correct classification), yi = 1, log(pi) => 0 thus total expression
approaches 0 …. No error
k. In case of point 2 (incorrect classification), yi =1, log(pi) approaches infinity (log of small
numbers approach infinity), contribution to error increases significantly
l. In case of point 4 (incorrect classification), yi = 1, log (1-pi) approaches infinity, error
contribution increases
51
Introduction to machine learning
Advantages -
1. Makes no assumptions about distributions of classes in feature space
2. Easily extended to multiple classes (multinomial regression)
3. Natural probabilistic view of class predictions
4. Quick to train
5. Very fast at classifying unknown records
6. Good accuracy for many simple data sets
7. Resistant to overfitting
8. Can interpret model coefficients as indicators of feature importance
Dis advantages -
9. Constructs linear boundaries
52
Introduction to machine learning
53
Introduction to machine learning
Confusion Matrix -
2. Of the 84 actual diabetes case, the model correctly classified only 46 as diabetic
3. Of the 147 non diabetic cases, the model correctly classified 134 as non-diabetic
4. 13 cases who are normal but identified as diabetic are called Type 1 error
54
Introduction to machine learning
K Nearest Neighbours
55
Introduction to machine learning
56
Introduction to machine learning
f. The training data is represented by the scattered data points in the feature
space
g. The color of the data points indicate the class they belong to
h. The grey point is the query point who's class has to be fixed
57
Introduction to machine learning
i. Measuring similarity with distance between the points using Euclidian method
58
Introduction to machine learning
d. Radius Neighbor Classifier may be a better choice when the sampling is not
uniform. However, when there are many attributes and data is sparse, this
method becomes ineffective due to curse of dimensionality
Ref: http://scikit-learn.org/stable/modules/neighbors.html#classification
59
Introduction to machine learning
e. The Neighbors based algorithm can also be used for regression where the
labels are continuous data and the label of query point can be average of the
labels of the neighbors
f. The approach to find nearest neighbors using distance between the query point
and all other points is called the brute force. Becomes time costly (O(N^2) ) and
inefficient with increase in number of points
g. KD Tree based nearest neighbor approach helps reduce the time from the order
of N^2 to DNlogN where D is number of dimensions. This methods becomes
ineffective when D is large dur to curse of dimensionality
60
Introduction to machine learning
b. Those dimensions which have larger possible range of values will dominate the
result of the distance calculation using Euclidian formula
c. To ensure all the dimensions have similar scale, we normalize the data on all
the dimensions / attributes
d. There are multiple ways of normalizing the data. We will use Z-score
standardization
61
Introduction to machine learning
1. Minkowski distance
2. Euclidean distance
3. Manhattan distance
4. Chebyshev distance
5. Mahalanobis distanc
6. Inner product
7. Cosine similarity
8. Pearson correlation
9. Hamming distance
10. Jaccard similarity
11. Edit distance or Levenshtein distance
Ref:
http://scikit-learn.org/stable/modules/generated/sklearn.neighbors.DistanceMetric.html
http://dataaspirant.com/2015/04/11/five-most-popular-similarity-measures-implementation-in-python/
62
Introduction to machine learning
Dis-advantages -
5. Fixing the optimal value of K is a challenge
6. Will not be effective when the class distributions overlap
7. Does not output any models. Calculates distances for every new point (lazy learner)
8. Computationally intensive (O(D(N^2))), can be addressed using KD algorithms which
take time to prepare
63
Introduction to machine learning
64
Introduction to machine learning
65
Introduction to machine learning
a. Naive Bayes classifiers are linear classifiers based on Bayes’ theorem. The model
generated is probabilistic
b. It is called naive due to the assumption that the features in the dataset are mutually
independent
c. In real world, the independence assumption is often violated, but naive Bayes
classifiers still tend to perform very well
d. Idea is to factor all available evidence in form of predictors into the naïve Bayes rule
to obtain more accurate probability for class prediction
f. Being relatively robust, easy to implement, fast, and accurate, naive Bayes classifiers
are used in many different fields
66
Introduction to machine learning
If it rained 3 out of 10 days in the past where the days were exactly like today,
the probability it will rain today is 30%
67
Introduction to machine learning
Conditional Probability – it is the probability that an event has occurred (not yet
observed) given another event has occurred. For e.g.
4. given the card drawn is red (an event has occurred)
5. what is the probability it is a king (event not yet observed)?
6. Since the card is red, there are 26 likely values for red
7. Of these 26 possible values we are interested in king which is 2 (king of diamonds
and heart)
8. Thus the conditional probability that the card is a king given red card is 2 /26
9. Compare this with joint probability of red king (2/52).
10. Given an event has occurred, it increases the probability of the other event
68
Introduction to machine learning
69
Introduction to machine learning
70
Introduction to machine learning
The objective function is to maximize the posterior probability given the training
data
71
Introduction to machine learning
One assumption that Bayes classifiers make is that the samples are independent and
identically distributed. Samples are drawn from a similar probability distribution.
Independence means that the probability of one observation does not a ffect the probability of
another observation (e.g., time series and network graphs are not independent)
Thus, given a d-dimensional feature vector x, the class conditional probability can be calculated
as follows:
72
Introduction to machine learning
Joint Probabilities -
a. Imagine you represent all the flight experience you had till date as the blue area in a
mathematical space. The dimensions of the boxes and circles are immaterial
b. Of these experiences, 20% of the time you experienced flight delay
Flight data
A 100%
20% 80 %
73
Introduction to machine learning
Flight data
B 5%
80 %
A 20% (A n B) = flight delay and fog
Lesserthe
More theoverlap,
overlap,more
lesser
thethe
occurrences
occurrences
of flight
of flight
delay
delay
and
andfog
fog
P(A| B) = P(A n B) / P(B) eq 1
74
Introduction to machine learning
Evidence 80 %
A 20%
b. Probability of event A given that event B has occurred (fog has formed) depends on
I. Apriori probability of fog occurring whenever there was flight delay – P (B/A)
II. Apriori probability of flight delay P(A) which is 20% in the example
III. Apriori probability of flight facing fog P(B) which is 5% in the example
c. When it is a matter of deciding the class of an output such as whether flight will get
delayed or not, we calculate P(A/B) and P(!A/B), compare which is higher. Since in both
the denominator is P(B), it is ignored as it has no influence on which class will it be
75
Introduction to machine learning
a. The following two tables reflect the apriori probabilities of the events A and B. Probabilities
based on past data of 100 points
T1 FOG T2 FOG
Frequency Yes No Total Likelihood Yes No Total
Flight delayed 4 16 20 Flight delayed 4 / 20 16 / 20 20
Not Delayed 1 79 80 Not Delayed 1 / 80 79 /80 80
d. P(flight delay | fog) = ( (4/20) * (20 / 100) ) = .04 (maximal probability) (no need to divide by
P(B), probability of fog, as it is a constant. This is Naïve Bayes probability.
e. Naïve probability is when the event of flight delay and fog were unrelated (false
independence) P( ) = ((20 / 100) * (5/100)) = .01 This indicates importance of Bayes
theorem
76
Introduction to machine learning
Naïve Bayes Classifier -
Suppose there are multiple factors that could lead to flight delay (as shown in the likelihood
table below)
T2 FOG Technical Snag Pilot Fatigue Passenger
Likelihood Yes No Yes No Yes No Yes No Total
Flight 4 / 20 16 / 20 10/20 10/20 0/20 20/20 12/20 8/20 20
delayed
Not 1 / 80 79 /80 14/80 66 /80 8/80 71/80 23/80 57/80 80
Delayed
Total 5 / 100 95 / 100 24/100 76/100 8/100 91/100 35/100 65/100 100
a. Probability that the flight will be delayed, given that Fog = Yes, Technical Snag = No, Pilot
Fatigue = No, Passenger related = Yes is given as –
P(flight delay / Fog n ! Technical Snag n ! Pilot Fatigue n Passenger delay) = ( P(Fog n
! Technical Snag n ! Pilot Fatigue n Passenger delay) / flight delay) * P(flight delay) ) /
P(Fog n ! Technical Snag n ! Pilot Fatigue n Passenger delay)
P(Fog n ! Technical Snag n ! Pilot Fatigue n Passenger delay) / flight delay) = P(Fog /
flight delay) * P(!Technical snag |flight delay * P(!Pilot Fatigue | flight delay) * P(Passenger
Delay| flight delay)
77
Introduction to machine learning
Advantages -
1. Simple , Fast in processing and effective
2. Does well with noisy data and missing data
3. Requires few examples for training (assuming the data set is a true representative of
the population)
4. Easy to obtain estimated probability for a prediction
Dis-advantages -
5. Relies on and often incorrect assumption of independent features
6. Not ideal for data sets with large number of numerical attributes
7. Estimated probabilities are less reliable in practice than predicted classes
8. If rare predictor value is not captured in the ** training set but appears in the test set
the probability calculation will be incorrect
** For e.g. input record has fog=“yes”, Technical snag = “yes”, Pilot Fatigue = “Yes” and
passenger delay = “yes” .If this combination is not in the training set for delayed flights in
the past then the probability calculation in step “a” on previous slide will become 0!
78
Introduction to machine learning
Sol: Naive+Bayesian+Pima+Diabetes+.ipynb
79
Introduction to machine learning
Decision Trees
80
Introduction to machine learning
Decision Trees -
1. Classifiers utilize a tree structure to model relationships among the features and the
potential outcomes
2. Decision trees consist of nodes and branches. Nodes represent a decision function
while branch represents the result of the function. Thus it is a flow chart for deciding
how to classify a new observation:
3. The nodes are of three types, Root Node (representing the original data), Branch
Node (representing a function), Leaf Node (which holds the result of all the previous
functions that connect to it)
81
Introduction to machine learning
Decision Trees -
4. For classification problem, the posterior probability of all the classes is reflected in the
leaf node and the Leaf Node belongs to the majority class.
5. After executing all the functions from Root Node to Leaf Node, the class of a
data point is decided by the leaf node to which it reaches
6. For regression, the average/ median value of the target attribute is assigned to the
query variable
7. Tree creation splits data into subsets and subsets into further smaller subsets. The
algorithm stops splitting data when data within the subsets are sufficiently
homogenous or some other stopping criterion is met
82
Introduction to machine learning
Decision Trees -
1. The decision tree algorithm learns (i.e. creates the decision tree from the data set)
through optimization of a loss function
2. The loss function represents the loss of impurity in the target column. The
requirement here is to minimize the impurity as much as possible at the leaf nodes
83
Introduction to machine learning
Decision Trees -
1. There is a bag of 50 balls of red, green, blue, white and
yellow colour respectively
2. You have to pull out one ball from the bag with closed
eyes. If the ball is -
a. Red, you loose the prize money accumulated
b. Green, you can quit
c. Blue you loose half prize money but continue
d. White you loose quarter prize money & continue
e. Yellow you can skip the question
3. This state where you have to decide and your decision
can result in various outcomes with equal probability is
said to be state of maximum uncertainty
4. If you have a bag full of balls of only one colour, then
there is no uncertainty. You know what is going to
happen. Uncertainty is zero.
5. Thus, the more the homogeneity, lesser the uncertainty
and vice versa
6. Uncertainty is expressed as entropy or Gini index
84
Introduction to machine learning
Decision Trees -
Suppose we wish to find if there was any influence of shipping mode, order priority on
customer location. Customer location is target column and like the bag of coloured balls
Shipping
Sales
Mode
Data
Regular Express
Air Air
Low High
Priority Priority
When sub branches are created, the total entropy of the sub branches should be
less than the entropy of the parent node. More the drop in entropy, more the
information gained
85
Introduction to machine learning
b. Let the two classes Red -> class 0 and Black -> class 1
d. Suppose we remove all red balls from the bag and then entropy will be
a. H(X) = - 1.0 *log2(1.0) – 0.0 * log2(0) = 0 ## Entropy is 0! i.e. Information is
100%
86
Introduction to machine learning
Machine Learning (Decision Tree Classification)
Decision Trees -
Entropy Info Gain
E0 = max entropy 0
Shipping
Mode (1000) E0
say 1
E1 = E0 – E1
Express (E1a*700/1000) +
Regular Air (E1b * 300/1000)
(700), E1a Air (300),
E1b
E2 = (E2a * E1 – E2
500/700) + (E2b *
Low High Low High 200/700) + (E2c *
Priority Priority Priority Priority
(500) E2a (200) E2b 100/300) + (E2d *
(100) E2c (200) E2d
200/300)
Tree will stop growing when stop criterion for the splitting is reached which could be -
a. Tree has reached certain pre-fixed depth (longestt path from root node to leaf node)
b. Tree has achieve maximum number of nodes (tree size)
c. Exhausted all attributes to split
d. Leaf node on split will have less than predefined number of data points
87
Introduction to machine learning
88
Introduction to machine learning
89
Introduction to machine learning
Decision Trees -
1. Gini index – is calculated by subtracting the sum of the squared probabilities of each
class from one
a. Uses squared proportion of classes
b. Perfectly classified, Gini Index would be zero
c. Evenly distributed would be 1 – (1/# Classes)
d. You want a variable split that has a low Gini Index
e. Used in CART algorithm
2. Entropy –
a. Favors splits with small counts but many unique values
b. Weights probability of class by log(base=2) of the class probability
c. A smaller value of Entropy is better. That makes the difference between the parent node’s
entropy larger
d. Information Gain is the Entropy of the parent node minus the entropy of the child nodes
90
Introduction to machine learning
91
Introduction to machine learning
3. C5.0 is Quinlan’s latest version and it uses less memory and builds smaller
rulesets than C4.5 while being more accurate
92
Introduction to machine learning
Decision Trees -
Advantages -
1. Simple , Fast in processing and effective
2. Does well with noisy data and missing data
3. Handles numeric and categorical variables
4. Interpretation of results does not required mathematical or statistical knowledge
Dis-advantages -
5. Often biased towards splits or features have large number of levels
6. May not be optimum as modelling some relations on axis parallel basis is not
optimal
7. Small changes in training data can result in large changes to the logic
8. Large trees can be difficult to interpret
93
Introduction to machine learning
3. If left unconstrained, they can build tree structures to adapt to the training
data leading to overfitting
4. To avoid overfitting, we need to restrict the DT’s freedom during the tree
creation. This is called regularization
94
Introduction to machine learning
1. max_depth – Is the maximum length of a path from root to leaf (in terms of
number of decision points. The leaf node is not split further. It could lead to
a tree with leaf node containing many observations on one side of the tree,
whereas on the other side, nodes containing much less observations get
further split
95
Introduction to machine learning
96
Introduction to machine learning
Decision Tree -
Sol: Regularization+Credit+Decision+Tree.ipynb
97
Introduction to machine learning
Ensemble Methods
98
Introduction to machine learning
Ensemble Methods -
To combine the predictions of several base estimators built with a given learning
algorithm in order to improve generalizability / robustness over a single estimator
1. Averaging methods, the driving principle is to build several estimators independently and
then to average / vote their predictions. On average, the combined estimator is usually
better than any of the single base estimator because its variance is reduced.
E.g. Bagging methods, Forests of randomized trees, ...
2. Boosting methods, base estimators are built sequentially and one tries to reduce the bias
of the combined estimator. The motivation is to combine several weak models to produce
a powerful ensemble. E.g. AdaBoost, Gradient Tree Boosting, ...
99
Introduction to machine learning
Ensemble Methods -
1. Designed to improve the stability and accuracy of classification and regression models
3. Can be used with any type of machine learning model, mostly used with Decision Tree
4. Uses sampling with replacement to generate multiple samples of a given size. Sample may
contain repeat data points
5. For large sample size, sample data is expected to have roughly 63.2% ( 1 – 1/e) unique
data points and the rest being duplicates
6. For classification bagging is used with voting to decide the class of an input while for
regression average or median values are calculate
10
Introduction to machine learning
Source: https://link.springer.com/article/10.1007/s13721-013-0034-x
10
Introduction to machine learning
Sol: Bagging+Credit+Decision+Tree.ipynb
10
Introduction to machine learning
1. Similar to bagging, but the learners are grown sequentially; except for the first, each
subsequent learner is grown from previously grown learners
2. If the learner is a Decision Tree, each of the trees can be small, with just a few terminal
nodes (determined by the parameter d supplied )
3. During voting higher weight is given to the votes of learners which perform better in
respective training data unlike Bagging where all get equal weight
4. Boosting slows down learning (because it is sequential) but the model generally performs
well
10
Introduction to machine learning
It is called Adaptive Boosting as the weights are re-assigned to each instance, with higher
weights to incorrectly classified instance
Source: https://link.springer.com/article/10.1007/s13721-013-0034-x
10
Introduction to machine learning
7. Two prominent boosting algorithms are AdaBoost, short for Adaptive Boosting and Gradient
Descent Boosting
8. In AdaBoost, the successive learners are created with a focus on the ill fitted data of the
previous learner
9. Each successive learner focuses more and more on the harder to fit data i.e. their residuals
in the previous tree
10
Introduction to machine learning
10
Introduction to machine learning
Sol: Adaboost+Credit+Decision+Tree.ipynb
10
Introduction to machine learning
1. Each learner is fit on a modified version of original data (original data is replaced with the x
values and residuals from previous learner
2. By fitting new models to the residuals, the overall learner gradually improves in areas where
residuals are initially high
10
Introduction to machine learning
11
Introduction to machine learning
Sol: GRB+Credit+Decision+Tree.ipynb
11
Introduction to machine learning
1. Each tree in the ensemble is built from a sample drawn with replacement (bootstrap) from
the training set
2. In addition, when splitting a node during the construction of a tree, the split that is chosen is
no longer the best split among all the features
3. Instead, the split is picked is the best split among a random subset of the features
4. As a result of this randomness, the bias of the forest usually slightly increases (with respect
to the bias of a single non-random tree)
5. Due to averaging, its variance decreases, usually more than compensating the increase in
bias, hence yielding overall a better result
11
Introduction to machine learning
1. Used with Decision Trees. Create different trees by providing different sub-features from the
feature set to the tree creating algorithm. The optimization function is Entropy or Gini index
N
instances
11
Introduction to machine learning
Lab- 9 Improve defaulter prediction of the decision tree using Random Forest
Sol: RF+Credit+Decision+Tree.ipynb
11
Introduction to machine learning
11
Introduction to machine learning
Source:
http://pubs.rsc.org/-/content/articlelanding/2014/mb/c4mb00410h/unauth#!
11 divAbstract
Introduction to machine learning
Sol: Stacking+Credit+Decision+Tree.ipynb
11
Introduction to machine learning
Fast
(excluding Some for
Naive Classific Somewh Some feature feature Naive
Bayes ation at what Lower extraction) Fast extraction Yes Yes No No Yes No Bayes
Yes (unless
Random Moderat noise ratio is Random
Forests Either A little No Higher Slow e Some No very high) Yes Possibly No No Forests
AdaBoost Either A little No Higher Slow Fast Some No Yes Yes Possibly No No AdaBoost
Source: http://www.dataschool.io/comparing-supervised-learning-algorithms/
11
Introduction to machine learning
Support Vector Machines
1. Known as maximum-margin hyperplane, find that linear model with max margi. Unlike
the liner classifiers, objective is not minimizing sum of squared errors but finding a
line/plane that separates two or more groups with maximum margins
Max margin hyper plane
n
gi
ar
M
http://stackoverflow.com/questions/9480605/what-is-the-
relation-between-the-number-of-support-vectors-and-
training-data-and Support Vectors
11
Introduction to machine learning
Support Vector Machines
1. First line does separate the two sets but id too close to both red & green data points
2. Chances are that when this model is put in production, variance in both cluster data
may force some data points on wrong side
3. The second line doesn’t look so vulnerable to the variance. The two points nearest
from different clusters define the margin around the line and are support vectors
4. SVMs try to find the second kind of line where the line is at max distance from both
the clusters simultaneously
12
Introduction to machine learning
Support Vector Machines
|w•x+b|/||w||=1/||w||,
2. Think in terms of multi-dimensional space. SVM algorithm has to find the combination
of weights across the dimensions such that they hyperplane has max possible margin
around it
3. All the predictor variables have to be numeric and scaled.
12
Introduction to machine learning
Support Vector Machines Allowing Errors
12
Introduction to machine learning
Support Vector Machines Linearly Non Separable Data
x1^2, x2^2
1. When data is not linearly separable, SVM uses kernel trick to make it linearly separable
2. This concept is based on Cover’s theorem “given a set of training data that is not linearly
separable, with high probability it can be transformed into a linearly separable training set
by projecting it into a higher-dimensional space via some non-linear transformation”
3. In the pic above, replace x1 with x1^2, x2 with x2^2 and create a third dimension x3 =
sqrt(2x1x2)
12
Introduction to machine learning
Support Vector Machines Linearly Non Separable Data
1. Using kernel tricks the data points are project to higher dimensional space
2. The data points become relatively more easily separable in higher dimension space
3. SVM can now be drawn between the data sets with a given complexity
12
Introduction to machine learning
1. Suppose we are given training data {(x1, y1),...,(xn, yn) } ⊂ X × R, where X denotes
the space of the input patterns (e.g. X = Rd).
2. Goal is to find a function f(x) that has at most ε deviation from the actually obtained
targets yi for all the training data, and at the same time is as flat as possible
3. In other words, we do not care about errors as long as they are less than ε, but will
not accept any deviation larger than this
4. f can take the form f(x) = (w, x )+ b with w ∈ X, b ∈ R
5. Flatness means that one seeks a small w. One way to ensure this is to minimize the ||
w||^2 = (w, w).
12
Introduction to machine learning
7. In the first picture, ||w||^2 is not minimized, neither the third constraint. Take the
pointer to be x value, yi – (w, xi) – b is < e i.e. diff between green dot and the line but
(w, xi) + b –yi i.e. diff between line an red dot is not < e.
8. In second picture, all three constraints are met
9. Sometimes, it may not be possible to meet the constraint due to data points not being
linearly separable so we may want to allow for some errors.
12
Introduction to machine learning
10. We introduce slack variables ξi, ξ ∗ i to cope with otherwise infeasible constraints of
the optimization problem and this is known as soft margin classifier
11. The epsilon term allows some errors i.e. data points lie within the error margins where
error margins is e + epsilon
12
Introduction to machine learning
Support Vector Machines Kernel Functions
1. SVM libraries come packaged with some standard kernel functions such as
polynomial, radial basis function (RBF), and Sigmoid
Source: https://gist.github.com/WittmannF/60680723ed8dd0cb993051a7448f7805
12
Introduction to machine learning
Machine Learning (Support Vector Machines)
Strengths Weakness
13
Introduction to machine learning
Sol: OCR-SVM.ipynb
13
Introduction to machine learning
13
Introduction to machine learning
13
Introduction to machine learning
Machine Learning (Support Vector Machines)
The model has predicted the characters correctly 84% of the times
13
Introduction to machine learning
13
Introduction to machine learning
Machine Learning (Artificial Neural Network)
1. Artificial Neural Network (ANN) models relationships between a set of input data and
output data.
2. ANN models are based on the observed behaviour of neural nets in our brains
13
Introduction to machine learning
Machine Learning (Artificial Neural Network)
5. Artificial Neural Network (ANN) models relationships between a set of input data and
output data.
6. Natural Neuron
7. Abstract neuron
13
Introduction to machine learning
Machine Learning (Artificial Neural Network)
8. The processing elements of a ANN is called a node, representing the artificial neuron
10. The initial layer is the input layer and the last layer is the output layer. In between we
have the hidden layers
13
Introduction to machine learning
Machine Learning (Artificial Neural Network)
11. A given node will fire and feed a signal to subsequent nodes in next layer only if the
step function it implements reaches a threshold
12. In ANN use of Sigmoid function is more common than step function
Output ai fired
Threshold
input
13
Introduction to machine learning
Machine Learning (Artificial Neural Network)
13. The summation function g can be implemented in many ways. It does not have to be
mathematical addition of the inputs
14
Introduction to machine learning
Machine Learning (Artificial Neural Network)
14. The ANN generic architecture
15. Neural net consists of multiple layers. It has two layers on the edge, one is input layer
and the other is output layer.
16. In between input and output layer, there can be many other layers. These layers are
called hidden layers
14
Introduction to machine learning
Machine Learning (Artificial Neural Network)
17. The input layer is passive, does no processing, only holds
the input data to supply it to the first hidden layer
14
Introduction to machine learning
Machine Learning (Artificial Neural Network)
18. The input layer is passive, does no processing, only holds the input data to supply
it to the first hidden layer
X3 N1Output = Sigmoid(ACC)
19. The weights for a given hidden node is pre-fixed and all the nodes in the hidden
layer have their own weights
20. The output of each node is fed to output layer nodes or another set of hidden nodes
in another hidden layer
14
Introduction to machine learning
Machine Learning (Artificial Neural Network)
21. The output value of each hidden node is sent to each output node in the output layer
14
Introduction to machine learning
Machine Learning (Artificial Neural Network)
X1
Output Node 1
X2
011 O12 O13 O14
X3
ACC = X1*WO11 + X2*WO12 +
X4 X3*WO13 + X4*WO14
N1Output = Sigmoid(ACC)
14
Introduction to machine learning
Machine Learning (Artificial Neural Network)
22. In a binary output ANN, the output node acts like a perceptron classifying the input
into one of the two classes
14
Introduction to machine learning
Machine Learning (Artificial Neural Network)
24. We can have a ANN with multiple output nodes where a given output node may or
may not get triggered given the input and the weights.
25. We can have a ANN with multiple output nodes where a given output node may or
may not get triggered given the input and the weights.
14
Introduction to machine learning
Machine Learning (Artificial Neural Network)
27. The weights required to make a neural network carry out a particular task are found
by a learning algorithm, together with examples of how the system should operate
28. The examples in vehicle identification could be a large hadoop file of serveral millions
sample segments such as bicycle, motorcycle, car, bus etc.
29. The learning algorithms calculate the appropriate weights for each classification for all
nodes at all the levels in the network
30. If we consider each input as a dimension then ANN labels different regions in the n-
dimensional space. In our example one region is cars, other region is bicycle
Car
Bycycle
14
Introduction to machine learning
Machine Learning (Artificial Neural Network)
Strengths Weakness
The main advantage of ANN models over the ANN does not provide
statistical methods is that the latter assume linear information about the relative
relationships and/or normal distribution, while significance of the various
reality is non-linear and non-normal. Thus the parameters.
ANN model is capable to conform to the real
world.
14
Introduction to machine learning
15
Introduction to machine learning
Machine Learning (Artificial Neural Network)
( Lab-5 Estimate concrete strength – Model Improvement
15
Introduction to machine learning
Modelling Errors
15
Introduction to machine learning
Modelling Errors
All models are impacted by three types of errors which reduce their predicting power.
1. Variance errors
2. Bias error
3. Random errors
Variance errors
4. Caused by the random factors that impact the process that generate the data
5. The population / universe, representing the infinite data points continuously jiggle
6. Sample drawn from such universe is a snapshot of a small part of the universe
7. The model based on a sample will perform differently on different samples
8. Variance errors increase with increase in number of attributes in the model due to increase in degrees
of freedom for the data points to wriggle in
Bias errors
9. Caused by our selection of the attributes and our interpretation of their influence on each other
10. The real model in the universe / population may have many more attributes and the attributes
interacting in different ways not reflected in our model
Random errors
11. Caused by unknown factors. They cannot be modelled
15
Introduction to machine learning
Time T2
Sample / snapshot
15
Introduction to machine learning
Sample Data (Analytics Base Table) Three Random Training Sets From Three Random Test Sets From ABT
ABT
15
Introduction to machine learning
Modelling Errors
15
Introduction to machine learning
Fitness of a Model
Generalize
1. Models are expected to perform well (meet least accuracy thresholds) in production (real world data)
2. But data in real world is under flux / jiggle
3. Models have to perform in this context of continuous jiggle. Such models are said to generalize well
4. For models to generalize well, they should neither be underfit or overfit in the training data
Underfit models
5. Models that are over simplified i.e. models in which the independent and dependent attributes interact in
a simple linear way ( can be expressed in a linear form for e.g. y = mx + c).
6. The model could have been addressed as a quadratic form such as y = m1x + m2 x^2 +C
7. Underfit models result in errors as they fail to capture the complex interactions among the attributes in
the real world
8. These models will not generalize in the real world
Overfit models
9. Models that perform very well (sometimes with zero errors) in training data
10. Are complex polynomial surfaces that twist and turn in the feature space to cleanly separate the classes
11. Adjust to the variance in the training data i.e. try to adjust to the positions of the data points though
those positions are not the expected values of the data points (mean of the jiggle)
12. These models adapt to the variance error in the data set and will not generalize in the real world
15
Introduction to machine learning
Good
Over fit
Underfit
In overfit models, the models absorb the noise (variance) in the data
points achieving almost 100% accuracy in controlled environment. But
when used in production (where the data points have different variance,
the models will perform poorly
15
Introduction to machine learning
Fitness of a Model
15
Introduction to machine learning
Model performance measures
a. Confusion Matrix – A 2X2 tabular structure reflecting the performance of the model in four blocks
Confusion Matrix Predicted Positive Predicted Negative
b. Accuracy – How accurately / cleanly does the model classify the data points. Lesser the false
predictions, more the accuracy
c. Sensitivity / Recall – How many of the actual True data points are identified as True data points
by the model . Remember, False Negatives are those data points which should have been
identified as True.
d. Specificity – How many of the actual Negative data points are identified as negative by the model
e. Precision – Among the points identified as Positive by the model, how many are really Positive
16
Introduction to machine learning
Receiver Operating Characteristics (ROC) Curve
16
Introduction to machine learning
To explain F Stats
A B
C D
16
Introduction to machine learning
ThankYou
16