0% found this document useful (0 votes)
22 views163 pages

Introduction To Machine Learning

Uploaded by

mahesh Kumar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views163 pages

Introduction To Machine Learning

Uploaded by

mahesh Kumar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 163

Introduction to machine learning

Machine Learning

1
Introduction to machine learning

Machine Learning & Data Science

1. Machine learning is part of a larger discipline called Data Science

2. Data science is the process of applying science and domain expertise to data to
extract useful information from data.

3. It includes application of all the statistical and mathematical tools and techniques to
glean out the useful information from data using machine learning

2
Introduction to machine learning

“[Machine Learning is the] field of study that gives computers the ability to
learn without being explicitly programmed.” Arthur Samuel 1959:

Image Source : https://www.icecreamlabs.com

3
Introduction to machine learning

What is machine learning?

1. Is a process of enabling automated systems to learn to do tasks based on well


defined statistical and mathematical methods

2. The ability to do the tasks is embodied in form of a model which is the result of the
learning process

3. The model represents the process which generated the data used to build the
model

4. The data used is expected to represent the long term behaviour of the process

5. The more representative data is of the real world in which the process is
executed, the better the model would be

6. The challenge is how to get make true representative data sets

4
Introduction to machine learning

What do machine learning algorithms do?

1. Search through data to look for patterns

2. Patterns in form of trends, cycles, associations, classes etc.

3. Express these patterns as mathematical structures such as probability equations


or polynomial equations

4. These expressions (end result of running the algorithms) is broadly called models

5
Introduction to machine learning

When is machine learning useful ?

1. Cannot express our knowledge about patterns as a program. For e.g. Character
recognition or natural language processing

2. Do not have an algorithm to identify a pattern of interest. For e.g. In spam mail detection

3. Too complex and dynamic. For e.g. Weather forecasting

4. Too many permutations and combinations possible. For e.g. Genetic code mapping

5. No prior experience or knowledge. For e.g. Mars rover

6. Patterns hidden in humongous data. For e.g. Recommendation system

6
Introduction to machine learning

Where are machine learning based systems used (examples only)

1. Fraud detection

2. Sentiment analysis

3. Credit risk management

4. Prediction of equipment failures

5. New pricing models / strategies

6. Network intrusion detection

7. Pattern and image recognition

8. Email spam filtering

7
Introduction to machine learning

Machine Learning Pre-requisites

1. Rich set of data representing the real world

2. Knowledge and skills in


a. Maths and statistics
b. Programming (Python, R, Java, Go)
c. Domain knowledge

8
Introduction to machine learning

Real World as Mathematical Space

9
Introduction to machine learning

Machine learning happens in mathematical space / feature space:

1. A data set representing the real world, is a collection attributes that define an
entity

2. Each entity is represented as one record / line in the data set

Attributes / Dimensions

10
Introduction to machine learning

Machine learning happens in mathematical space / feature space:

1. Each attribute becomes a


dimension

2. Each record becomes a


point in the space

Sugar

e
Ag

BP level
Heart healthy
Potential heart ailments

11
Introduction to machine learning

Machine learning happens in mathematical space / feature space:

1. Position of a point in
space is defined with
respect to the origin

2. The position is decided by


the values of the attributes
for a point

3. We believe there is a
Sugar

process in nature that


based on the given
parameters makes some
healthy and others not so
e
Ag

BP level

Heart healthy
Potential heart ailments

12
Introduction to machine learning

Machine learning happens in mathematical space / feature space:

4. We wish to build a model


that would represent the
natural process

5. The model could be a simple


plane, complex plane, hyper
plane

6. The machine learning


Sugar

algorithm helps us find the


model
e
Ag

BP level
Heart healthy
Erroneous Potential heart ailments
classification
13
Introduction to machine learning

Machine learning happens in mathematical space / feature space:

7. In the figure, since the


separator is a plane, the
model will be the equation
representing the plane

ax + by + cz = d

8. x , y, z represent the three


dimensions i.e. BP, Age,
Sugar while d represents
Sugar

the color i.e. healthy or


ailing heart
e
Ag

BP level
Heart healthy
Potential heart ailments

14
Introduction to machine learning

Machine learning happens in mathematical space / feature space:

9. A new data point ( a new


customer) enters the
system

10. It’s x,y and z values will be


fed into the model to get
value of d (healthy or
ailing)
Sugar

11. The data point will be


placed above or below the
plane based on d
e
Ag

ax + by + cz = d, BP level

Heart healthy
Potential heart ailments

15
Introduction to machine learning

Machine learning happens in mathematical space / feature space:

12. Whether the new data point


is correctly placed (above
or below the plane) i.e.
correctly classified as ailing
or healthy hear will be
known only after direct
observation
Sugar

e
Ag

ax + by + cz = d, BP level

Heart healthy
Potential heart ailments

16
Introduction to machine learning

Machine learning happens in mathematical space / feature space:

13. Only direct test on the


object of interest will tell
whether the classification is
correct or not

Sugar

ax + by + cz = d,

14. If majority of new data


e
Ag

points are correctly


classified, the model is
good else not
BP level
Heart healthy
Potential heart ailments

17
Introduction to machine learning

Model Dimensions

18
Introduction to machine learning
Dimensions / Dimension reduction
1. An important step in preparing data for machine learning. If not done with care, or not
done at all, may have adverse effect on accuracy of machine learning models

2. Process to convert high dimensional data set into lesser dimension data with minimal
loss of variance

3. Total variance in a dataset (seen together all dimensions) is often equated to


information conveyed by the data set as a whole.

19
Introduction to machine learning
Dimensions / Dimension reduction
1. In the animation below, one attribute data does not convey much but when combined
with another dimension, reveals more actionable information
2. More the dimensions, more information the data set reveals
Spread, variance, mean on one dimension (D2)

Spread, variance, mean on one dimension (D1)

20
Introduction to machine learning
Dimensions / Dimension reduction
1. Beyond a point (in terms of number of dimensions) it is diminishing returns in terms of
information content
2. When data with such dimensions is fed to machine learning algorithms, the models
tend to become lesser and lesser accurate due to degree of noise Vs information
contributed by the dimensions

21
Introduction to machine learning
Dimensions / Dimension reduction
a) Low Variance Filter. Columns with little variance carry little information. Columns with
variance lower than a given threshold are removed. A word of caution: variance is range
dependent; therefore normalization is required before applying this technique.

b) High Correlation Filter. Data columns with very similar trends are also likely to carry very
similar information. In this case, only one of them will suffice to feed the machine learning
model.

c) Mutual Information Filter for non-linearly related coordinates - a criterion for feature
selection in machine learning. Based on Information theory by Claude Shannon’s concept
of entropy (joint entropy)

d) Wrappers classes to eliminate features - feature importance assessment is pushed on


to the model training process. A recursive feature elimination routine

Refer : http://www.kdnuggets.com/2015/05/7-methods-data-dimensionality-reduction.html

22
Introduction to machine learning
Feature Extraction / Principal Component Analysis
e) Principal Component Analysis attempts to hit two birds with same stone –
1. It transform existing dimensions to increase the SNR. It creates a new dimension out of
the original two

2. Helps Remove redundancy by eliminating attributes /dimensions which contain same


information as another attribute.

Ref: to deck on PCA for a mathematically minimalist approach

23
Introduction to machine learning

Machine Learning Categories

24
Introduction to machine learning

Machine learning categories:

Source: https://quantdare.com/machine-learning-a-brief-breakdown/

25
Introduction to machine learning

Python for Machine Learning

26
Introduction to machine learning

Python For Machine Learning


1. Python and R are suited for data science functions. Recently, Go has emerged as
an up and coming alternative to the three major languages, but is not yet as well
supported as Python.

2. In practice, data science teams use a combination of languages to play to the


strengths of each one, with Python and R used in varying degrees

3. Python stands out as the language best suited for all areas of the data science
and machine learning framework.

4. Refer :
https://www.datacamp.com/community/tutorials/r-or-python-for-data-analysis#gs.q
kenPdo

27
Introduction to machine learning

Python packages for Data Science and Machine Learning


NumPy for scientific computing, many other libraries use NumPy arrays to operate
efficiently. It also supports multidimensional arrays and matrices, as well as
mathematical and statistical functions that need little code

SciPy builds on NumPy by adding a collection of algorithms and functions for


computing integrals numerically, solving differential equations, optimization, and more.

Pandas adds data structures and tools that are designed for practical data analysis in
finance, statistics, social sciences, and engineering. Pandas works well with
incomplete, messy, and unlabeled data (i.e., the kind of data you’re likely to encounter
in the real world), and provides tools for shaping, merging, reshaping, and slicing
datasets

matplotlib is the standard Python library for creating 2D plots and graphs.

Jupyter extends the functionality of Python’s interactive interpreter with a souped-up


interactive shell that adds introspection, rich media, shell syntax, tab completion, and
command history retrieval.

28
Introduction to machine learning

Python packages for Data Science and Machine Learning


scikit-learn builds on NumPy and SciPy by adding a set of algorithms for common
machine learning and data mining tasks, including clustering, regression, and
classification.

Theano uses NumPy-like syntax to optimize and evaluate mathematical expressions.


What sets Theano apart is that it takes advantage of the computer’s GPU in order to
make data-intensive calculations up to 100x faster than the CPU alone. Theano’s
speed makes it especially valuable for deep learning and other computationally
complex tasks.

TensorFlow Developed by Google as an open-source library for training neural


networks.

29
Introduction to machine learning

Introduction to Supervised
Machine Learning

30
Introduction to machine learning

Characteristics of Supervised Machine Learning -

a. Class of machine learning algorithms that work on externally supplied instances


(data) in form of predictor attributes and associated target values

b. Process of model building involves training and testing stages

c. Training stage involves use of training data ( a subset of the externally supplied
data) supplied in form of independent and target values

d. They produce a model which is supposed to represent the real process that
generated the data.

e. The model is tested for it’s performance in test stage using test data. If
satisfactory the model is implemented (productionized)

f. The model is used to predict target values for new data points

31
Introduction to machine learning

Data Science Machine Learning Steps -


Identify Data Identify what type of data, source of data and how to ingest data into
your system. Need domain expertise and lateral thinking
Required

Pre-process Address data quality issues such as missing values, outliers, data
Data pollution etc. Establish veracity of the data. Select attributes for model,
Need domain expertise
Create
Split the data into training set and test set. Generally
training & 70:30 ratio is used
test set
Select
Select appropriate algorithm/s to model. For e.g. Random
appropriate Forest, K Nearest Neighbors etc. Depends on data
algorithm/s

Train & build Build the model in Python or Spark or R


the model
Evaluate the model on test data
Evaluate ensure it is not overfit or
with test data underfit and likely to generalize
well
Deploy at scale
OK?
No Yes

Productionize
CRISP DM & calibrate
32
Introduction to machine learning

Linear Regression

33
Introduction to machine learning

Linear Regression Models -

a. The term "regression" generally refers to predicting a real number. However, it


can also be used for classification (predicting a category or class.)

b. The term "linear" in the name “linear regression” refers to the fact that the
method models data with linear combination of the explanatory variables.

c. A linear combination is an expression where one or more variables are scaled


by a constant factor and added together.

d. In the case of linear regression with a single explanatory variable, the linear
combination used in linear regression can be expressed as:

response = intercept + constant ∗ explanatory

e. In its most basic form fits a straight line to the response variable. The model is
designed to fit a line that minimizes the squared differences (also called errors
or residuals.).

34
Introduction to machine learning
Linear Regression Models -

a. Before we generate a model, we need to understand the degree of relationship


between the attributes Y and X

b. Mathematically correlation between two variables indicates how closely their


relationship follows a straight line. By default we use Pearson’s correlation which
ranges between -1 and +1.

c. Correlation of extreme possible values of -1 and +1 indicate a perfectly linear


relationship between X and Y whereas a correlation of 0 indicates absence of linear
relationship
I. When r value is small, one needs to test whether it is statistically significant or not to
believe that there is correlation or not

35
Introduction to machine learning
Linear Regression Models -

d. Coefficient of relation - Pearson’s coefficient p(x,y) = Cov(x,y) / ( stnd Dev (x) X stnd
Dev (y) )

r is near 0 r is near -1 r is near +1

e. Generating linear model for cases where r is near 0, makes no sense. The model will
not be reliable. For a given value of X, there can be many values of Y! Nonlinear
models may be better in such cases

36
Introduction to machine learning
Linear Regression Models (Recap) -

f. Coefficient of relation - Pearson’s coefficient p(x,y) = Cov(x,y) / ( stnd Dev (x) X stnd
Dev (y) )

- ve +ve
quad quad

+ve - ve
quad quad

=0
>0

http://www.socscistatistics.com/tests/pearson/Default2.aspx

37
Introduction to machine learning
Linear Regression Models -
g. Given Y = f(x) and the scatter plot shows apparent correlation between X and Y
Let’s fit a line into the scatter which shall be our model

h. But there are infinite number of lines that can be fit in the scatter. Which one
should we consider as the model?

i. This and many other


algorithms use gradient
descent or variants of
gradient descent method
for finding the best
model

j. Gradient descent
methods use partial
derivatives on the
parameters (slope and
intercept) to minimize
sum of squared errors

38
Introduction to machine learning

Linear Regression Models (Recap) -


k. Whichever line we consider as the model, it will not pass through all the points.
l. The distance between a point and the line (drop a line vertically (shown in
yellow)) is the error in prediction
m. That line which gives least sum of squared errors is considered as the best line

Error = (T – (mx + C)
Sum of all errors can cancel
out and give 0

We square all the errors and


sum it up. That line which
gives us least sum of squared
errors is the best fit

39
Introduction to machine learning
Linear Regression Models -
n. Coefficient of determinant – determines the fitness of a linear model. The closer the
points get to the line, the R^2 (coeff of determinant) tends to 1, the better the model is

Model line always passes


through Xbar and Ybar

Ybar

Xbar

40
Introduction to machine learning
Linear Regression Models -
o. Coefficient of determinant (Contd…)
I. There are a variety of errors for all those points that don’t fall exactly on the line.
II. It is important to understand these errors to judge the goodness of fit of the model i.e.
How representative the model is likely to be in general
III. Let us look at point P1 which is one of the given data points and associated errors due to
the model
1. P1 – Original y data point for given x

2. P2 - Estimated y value for given x

y P1 3. Ybar – Average of all Y values in data set

SSE
4. SST – Sum of Square error Total (SST)
SST
P2 Variance of P1 from Ybar (Y – Ybar)^2
SSR
Ybar 5. SSR - Regression error (p2 – ybar)^2 (portion
SST captured by regression model)

6. SSE - Residual error (p1 – p2)^2

Xbar x
41
Introduction to machine learning
Linear Regression Models -

p. Coefficient of determinant (Contd…)


1. That model is the most fit where every
data point lies on the line. i.e. SSE = 0 for
y P1 all data points

SSE 2. Hence SSR should be equal to SST i.e.


SST SSR/SST should be 1.
P2
SSR
3. Poor fit will mean large SSE. SSR/SST will
Ybar
be close to 0

4. SSR / SST is called as r^2 (r square) or


coefficient of determination

5. r^2 is always between 0 and 1 and is a


Xbar x measure of utility of the regression model

42
Introduction to machine learning
Linear Regression Models -

q. Coefficient of determinant (Contd…) -

Point B
Point B

Point A Point A

In case of point “A”, the line explains the variance of the point

Whereas point “B” the is a small area (light grey) which the line does not represent.

%age of total variance that is represented by the line is coeff of determinant

43
Introduction to machine learning

Linear Regression Model -

Advantages –
1. Simple to implement and easier to interpret the outputs coefficients

Disadvantages -
2. Assumes a linear relationships between dependent and independent variables. That
is, it assumes there is a straight-line relationship between them
3. Outliers can have huge effects on the regression
4. Linear regression assume independence between attributes
5. Linear regression looks at a relationship between the mean of the dependent variable
and the independent variables.
6. Just as the mean is not a complete description of a single variable, linear regression
is not a complete description of relationships among variables
7. Boundaries are linear

44
Introduction to machine learning

Linear Regression Model -

Lab- 1- Estimating mileage based on features of a second hand car

Description – Sample data is available at


https://archive.ics.uci.edu/ml/datasets/Auto+MPG

The dataset has 9 attributes listed below that define the quality
1. mpg: continuous
2. cylinders: multi-valued discrete
3. displacement: continuous
4. horsepower: continuous
5. weight: continuous
6. acceleration: continuous
7. model year: multi-valued discrete
8. origin: multi-valued discrete
9. car name: string (unique for each instance)

Sol : mpg-linear regression.ipynb

45
Introduction to machine learning

Logistic Regression

46
Introduction to machine learning

Logistic Regression Model -

a. A classification method built on the same concept as linear regression. The


response variable is categorical. In it’s simplest form, the response variable is
binary i.e. belongs to one class or the other

b. Given the value of predictor (variable x), the model estimates the probability that
the new data point belongs to a given class say “A”. Probability values can
range between 0 and 1.
Class B Class A
Class B Class A

Density distribution

47
Introduction to machine learning

Logistic Regression Model

c. A new data point (shown with “?”) needs to be classified i.e. does it belong to
class A or B.

d. Given the distribution, closer the point is to the origin, it is unlikely to belong to
class A. Farther away it is from the origin, likely it belongs to class A

e. One can try to fit a simple linear model (y = mx +c) where y greater than a
threshold means point most probably belongs to class A. The challenge is, for
extreme values of x, probability is <0 or >1 which is absurd

Probability
model

48
Introduction to machine learning

Logistic Regression Model -

f. The linear model is passed to a logistic function p = 1/ 1 + e^-t the result of


which is values between 0 and 1. Thus p represents probability a data point
belongs to class “A” given x

Class A

Class B

Note: The linear model t (which is of the form mx + c), represents logit which is natural log(p / 1-p)
where p is probability that a data point belongs to a class or not
49
Introduction to machine learning

Logistic Regression Model -

k. All blue points belonging to class A are numerically Class A

represented by 1 and all red points belonging to class B


by 0

l. The y axis represents probability of a new data point


belonging to blue class
Class B

m. As the value of x increases, the corresponding probability


of a new point belonging to blue class increases (as is
evident from the logistic function with positive slope

n. As x increases, the probability of belonging to blue class


approaches 1 and becomes 1 at infinity

o. Similarly, as x reduces (gets closer to 0), the probability


of belonging to blue class, as indicated by the logistic
function approaches 0 and becomes 0 at infinity

50
Introduction to machine learning

1 Class
4
Logistic Regression Model - A

g. Logloss function determines the best fit line


2
3 Class
B
h. Objective is to minimize the log loss

i. There can be four difference cases for the value of yi and pi


Correct classification

Incorrect classification

Correct classification

Incorrect classification
j. In case of point 1 (correct classification), yi = 1, log(pi) => 0 thus total expression
approaches 0 …. No error
k. In case of point 2 (incorrect classification), yi =1, log(pi) approaches infinity (log of small
numbers approach infinity), contribution to error increases significantly
l. In case of point 4 (incorrect classification), yi = 1, log (1-pi) approaches infinity, error
contribution increases

51
Introduction to machine learning

Logistic Regression Model -

Advantages -
1. Makes no assumptions about distributions of classes in feature space
2. Easily extended to multiple classes (multinomial regression)
3. Natural probabilistic view of class predictions
4. Quick to train
5. Very fast at classifying unknown records
6. Good accuracy for many simple data sets
7. Resistant to overfitting
8. Can interpret model coefficients as indicators of feature importance

Dis advantages -
9. Constructs linear boundaries

52
Introduction to machine learning

Logistic Regression Model -

Lab- 2- Predict diabetes among Pima Indians

Description – Sample data is available at


https://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-diabete
s/pima-indians-diabetes.names

The dataset has 9 attributes listed below


1. Number of times pregnant
2. Plasma glucose concentration a 2 hours in an oral glucose tolerance test
3. Diastolic blood pressure (mm Hg)
4. Triceps skin fold thickness (mm)
5. 2-Hour serum insulin (mu U/ml)
6. Body mass index (weight in kg/(height in m)^2)
7. Diabetes pedigree function
8. Age (years)
9. Class variable (0 or 1) Sol: Logistic Regression - Lima Diabetes.ipynb

53
Introduction to machine learning

Confusion Matrix -

1. A tool to assess the performance of a classification model such as logistic regression


model

Total = 231 Predicted 0 Predicted 1 Row Total


Actual 0 134 13 147
Actual 1 38 46 84
Col Total 172 59 231

2. Of the 84 actual diabetes case, the model correctly classified only 46 as diabetic

3. Of the 147 non diabetic cases, the model correctly classified 134 as non-diabetic

4. 13 cases who are normal but identified as diabetic are called Type 1 error

5. 38 cases of diabetic patients identified as normal is Type II error

54
Introduction to machine learning

K Nearest Neighbours

55
Introduction to machine learning

K Nearest Neighbors based classifications -

a. Is an instance-based learning or non-generalizing learning. It does not construct


a general internal model
b. Classification is computed from a simple majority vote of the nearest neighbors
of each point
c. New data point is assigned a class which has the most data points in the
nearest neighbors of the point
d. Suited for classification where relationship between features and target classes
is numerous, complex and difficult to understand and yet items in a class tend
to be fairly homogenous on the values of attributes
e. Not suitable if the data is noisy and the target classes do not have clear
demarcation in terms of attribute values

56
Introduction to machine learning

K Nearest Neighbors based classifications -

f. The training data is represented by the scattered data points in the feature
space
g. The color of the data points indicate the class they belong to
h. The grey point is the query point who's class has to be fixed

57
Introduction to machine learning

K Nearest Neighbors based classifications -

i. Measuring similarity with distance between the points using Euclidian method

j. Other distance measurement methods include Manhattan distance, Minkowski distance,


Mahalanobis distance, Bhattacharya distance etc.

58
Introduction to machine learning

K Nearest Neighbors based classifications -

a. Scikit-learn implements two different nearest neighbors classifiers – K Nearest


Neighbor Classifier and Radius Neighbor Classifier

b. Radius Neighbor Classifier implements learning based on number of neighbors


within a fixed radius r of each training point, where r is a floating point value
specified by the user

c. Determining the optimal K is the challenge in K Nearest Neighbor classifiers. In


general, larger value of K suppresses impact of noise but prone to majority
class dominating

d. Radius Neighbor Classifier may be a better choice when the sampling is not
uniform. However, when there are many attributes and data is sparse, this
method becomes ineffective due to curse of dimensionality

Ref: http://scikit-learn.org/stable/modules/neighbors.html#classification
59
Introduction to machine learning

K Nearest Neighbors based classifications -

e. The Neighbors based algorithm can also be used for regression where the
labels are continuous data and the label of query point can be average of the
labels of the neighbors

f. The approach to find nearest neighbors using distance between the query point
and all other points is called the brute force. Becomes time costly (O(N^2) ) and
inefficient with increase in number of points

g. KD Tree based nearest neighbor approach helps reduce the time from the order
of N^2 to DNlogN where D is number of dimensions. This methods becomes
ineffective when D is large dur to curse of dimensionality

60
Introduction to machine learning

K Nearest Neighbors based classifications -

a. The distance formula is highly dependent on how features / attributes /


dimensions are measured.

b. Those dimensions which have larger possible range of values will dominate the
result of the distance calculation using Euclidian formula

c. To ensure all the dimensions have similar scale, we normalize the data on all
the dimensions / attributes

d. There are multiple ways of normalizing the data. We will use Z-score
standardization

61
Introduction to machine learning

K Nearest Neighbors based classifications –

There are many distance calculation formulas in Scikit-learn package-

1. Minkowski distance
2. Euclidean distance
3. Manhattan distance
4. Chebyshev distance
5. Mahalanobis distanc
6. Inner product
7. Cosine similarity
8. Pearson correlation
9. Hamming distance
10. Jaccard similarity
11. Edit distance or Levenshtein distance

Ref:
http://scikit-learn.org/stable/modules/generated/sklearn.neighbors.DistanceMetric.html
http://dataaspirant.com/2015/04/11/five-most-popular-similarity-measures-implementation-in-python/
62
Introduction to machine learning

K Nearest Neighbors based classifications -


Advantages -
1. Makes no assumptions about distributions of classes in feature space
2. Can work for multi classes simultaneously
3. Easy to implement and understand
4. Not impacted by outliers

Dis-advantages -
5. Fixing the optimal value of K is a challenge
6. Will not be effective when the class distributions overlap
7. Does not output any models. Calculates distances for every new point (lazy learner)
8. Computationally intensive (O(D(N^2))), can be addressed using KD algorithms which
take time to prepare

63
Introduction to machine learning

K Nearest Neighbors based classifications -

Lab- 3 Model the given data to predict type of breast cancer

Description – Sample data is available at


https://archive.ics.uci.edu/ml/datasets/breast+cancer+wisconsin+(original)

The dataset has 10 attributes listed below


1. Sample code number: id number
2. Clump Thickness: 1 - 10
3. Uniformity of Cell Size: 1 - 10
4. Uniformity of Cell Shape: 1 - 10
5. Marginal Adhesion: 1 - 10
6. Single Epithelial Cell Size: 1 - 10
7. Bare Nuclei: 1 - 10
8. Bland Chromatin: 1 - 10
9. Normal Nucleoli: 1 - 10
10. Mitoses: 1 - 10
Sol: KNN+Breast+Cancer+Modeling.ipynb

64
Introduction to machine learning

Naïve Bayes Classifier

65
Introduction to machine learning

Naïve Bayes Classifier -

a. Naive Bayes classifiers are linear classifiers based on Bayes’ theorem. The model
generated is probabilistic

b. It is called naive due to the assumption that the features in the dataset are mutually
independent

c. In real world, the independence assumption is often violated, but naive Bayes
classifiers still tend to perform very well

d. Idea is to factor all available evidence in form of predictors into the naïve Bayes rule
to obtain more accurate probability for class prediction

e. It estimates conditional probability which is the probability that something will


happen, given that something else has already occurred. For e.g. the given mail is
likely a spam given appearance of words such as “prize”

f. Being relatively robust, easy to implement, fast, and accurate, naive Bayes classifiers
are used in many different fields
66
Introduction to machine learning

Naïve Bayes Classifier -

Probability - is the number of trials in which an event of interest occurred by


total number of trials. (what is a trial and what is an event?)

If it rained 3 out of 10 days in the past where the days were exactly like today,
the probability it will rain today is 30%

a. In this example, the day is a trial / experiment


b. The event is rain
c. Probability that it will rain is P(A) (A denoting rains) = 3/10
d. Every trial has at least two outcomes (event will P(A) or will not occur P(A”)
e. The multiple possible events are mutually exclusive i.e. cannot occur simultaneously
f. Total probability in a trial is sum of probabilities of all events = P(A)+P(A”) = 100% = 1

67
Introduction to machine learning

Naïve Bayes Classifier -

Joint Probability – is the probability of multiple events occurring together (we


are not talking of causality here i.e. one event leads to another). For e.g.
1. probability of drawing a king from a deck of cards is 4/52
2. Probability of drawing a red colour card from a deck of cards is 26/52
3. Probability of drawing a red colour king = 2 / 52

Conditional Probability – it is the probability that an event has occurred (not yet
observed) given another event has occurred. For e.g.
4. given the card drawn is red (an event has occurred)
5. what is the probability it is a king (event not yet observed)?
6. Since the card is red, there are 26 likely values for red
7. Of these 26 possible values we are interested in king which is 2 (king of diamonds
and heart)
8. Thus the conditional probability that the card is a king given red card is 2 /26
9. Compare this with joint probability of red king (2/52).
10. Given an event has occurred, it increases the probability of the other event

68
Introduction to machine learning

Naïve Bayes Classifier -

Posterior probability – Bayes’ rule can be expressed as

The posterior probability, in the context of a classification problem, can be


interpreted as: ”What is the probability that a particular object belongs to class
given its observed feature values?”

69
Introduction to machine learning

Naïve Bayes Classifier -

Posterior probability – general expression

70
Introduction to machine learning

Naïve Bayes Classifier -

The objective function is to maximize the posterior probability given the training
data

person has diabetes if P(diabetes | xi) ≥ P(not-diabetes | xi),


else classify person as healthy.

71
Introduction to machine learning

Naïve Bayes Classifier (Assumptions) -

One assumption that Bayes classifiers make is that the samples are independent and
identically distributed. Samples are drawn from a similar probability distribution.

Independence means that the probability of one observation does not a ffect the probability of
another observation (e.g., time series and network graphs are not independent)

An additional assumption of naive Bayes classifiers is the conditional independence of


features. Under this naive assumption, the class-conditional probabilities or (likelihoods) of the
samples can be directly estimated from the training data instead of evaluating all possibilities of
x

Thus, given a d-dimensional feature vector x, the class conditional probability can be calculated
as follows:

72
Introduction to machine learning

Naïve Bayes Classifier -

Joint Probabilities -

a. Imagine you represent all the flight experience you had till date as the blue area in a
mathematical space. The dimensions of the boxes and circles are immaterial
b. Of these experiences, 20% of the time you experienced flight delay

Flight data

A 100%
20% 80 %

73
Introduction to machine learning

Naïve Bayes Classifier -

P(flight delay given fog) = P (A n B) / P(B)

Flight delayed no fog Fog but no flight delay

Flight data

B 5%

80 %
A 20% (A n B) = flight delay and fog

Lesserthe
More theoverlap,
overlap,more
lesser
thethe
occurrences
occurrences
of flight
of flight
delay
delay
and
andfog
fog
P(A| B) = P(A n B) / P(B)  eq 1

P(A n B) = P(A|B) * P(B) (rearranging the terms in eq 1)

Also, P(A n B) = P(B n A) = P(B|A) * P(A)

Therefore eq 1 - P(A|B) = P(B|A) * P(A) / P(B)

74
Introduction to machine learning

Naïve Bayes Classifier -

Joint Probabilities (Contd…) -

a. The relationship between dependent events is depicted using Bayes theorem


Flight data
Likelihood Prior prob
Posterior
B 5%

Evidence 80 %
A 20%

b. Probability of event A given that event B has occurred (fog has formed) depends on
I. Apriori probability of fog occurring whenever there was flight delay – P (B/A)
II. Apriori probability of flight delay P(A) which is 20% in the example
III. Apriori probability of flight facing fog P(B) which is 5% in the example

c. When it is a matter of deciding the class of an output such as whether flight will get
delayed or not, we calculate P(A/B) and P(!A/B), compare which is higher. Since in both
the denominator is P(B), it is ignored as it has no influence on which class will it be

d. However, to calculate the updated probability of a class, denominator P(B) is required

75
Introduction to machine learning

Naïve Bayes Classifier -

a. The following two tables reflect the apriori probabilities of the events A and B. Probabilities
based on past data of 100 points
T1 FOG T2 FOG
Frequency Yes No Total Likelihood Yes No Total
Flight delayed 4 16 20 Flight delayed 4 / 20 16 / 20 20
Not Delayed 1 79 80 Not Delayed 1 / 80 79 /80 80

Total 5 95 100 Total 5 / 100 95 / 100 100


b. In the likelihood table (T2) reveals that P(fog = Yes / flight delayed) = 4/20 = .20 indicating that
the probability is 20 percent that a flight is delayed due to fog

c. P( ) => P(flight delay | fog) = P(fog / flight delay) * P(flight delay)

d. P(flight delay | fog) = ( (4/20) * (20 / 100) ) = .04 (maximal probability) (no need to divide by
P(B), probability of fog, as it is a constant. This is Naïve Bayes probability.

e. Naïve probability is when the event of flight delay and fog were unrelated (false
independence) P( ) = ((20 / 100) * (5/100)) = .01 This indicates importance of Bayes
theorem

76
Introduction to machine learning
Naïve Bayes Classifier -
Suppose there are multiple factors that could lead to flight delay (as shown in the likelihood
table below)
T2 FOG Technical Snag Pilot Fatigue Passenger
Likelihood Yes No Yes No Yes No Yes No Total
Flight 4 / 20 16 / 20 10/20 10/20 0/20 20/20 12/20 8/20 20
delayed
Not 1 / 80 79 /80 14/80 66 /80 8/80 71/80 23/80 57/80 80
Delayed
Total 5 / 100 95 / 100 24/100 76/100 8/100 91/100 35/100 65/100 100
a. Probability that the flight will be delayed, given that Fog = Yes, Technical Snag = No, Pilot
Fatigue = No, Passenger related = Yes is given as –

P(flight delay / Fog n ! Technical Snag n ! Pilot Fatigue n Passenger delay) = ( P(Fog n
! Technical Snag n ! Pilot Fatigue n Passenger delay) / flight delay) * P(flight delay) ) /
P(Fog n ! Technical Snag n ! Pilot Fatigue n Passenger delay)

P(Fog n ! Technical Snag n ! Pilot Fatigue n Passenger delay) / flight delay) = P(Fog /
flight delay) * P(!Technical snag |flight delay * P(!Pilot Fatigue | flight delay) * P(Passenger
Delay| flight delay)

77
Introduction to machine learning

Naïve Bayes Classifier -

Advantages -
1. Simple , Fast in processing and effective
2. Does well with noisy data and missing data
3. Requires few examples for training (assuming the data set is a true representative of
the population)
4. Easy to obtain estimated probability for a prediction

Dis-advantages -
5. Relies on and often incorrect assumption of independent features
6. Not ideal for data sets with large number of numerical attributes
7. Estimated probabilities are less reliable in practice than predicted classes
8. If rare predictor value is not captured in the ** training set but appears in the test set
the probability calculation will be incorrect

** For e.g. input record has fog=“yes”, Technical snag = “yes”, Pilot Fatigue = “Yes” and
passenger delay = “yes” .If this combination is not in the training set for delayed flights in
the past then the probability calculation in step “a” on previous slide will become 0!

78
Introduction to machine learning

Naïve Bayes Classifier -

Lab- 4 Model to predict diabetes among Pima Indians

Description – Sample data is available at


https://archive.ics.uci.edu/ml/datasets/Adult

The dataset has 9 attributes listed below


1. Number of times pregnant
2. Plasma glucose concentration a 2 hours in an oral glucose tolerance test
3. Diastolic blood pressure (mm Hg)
4. Triceps skin fold thickness (mm)
5. 2-Hour serum insulin (mu U/ml)
6. Body mass index (weight in kg/(height in m)^2)
7. Diabetes pedigree function
8. Age (years)
9. Class variable (0 or 1)

Sol: Naive+Bayesian+Pima+Diabetes+.ipynb

79
Introduction to machine learning

Decision Trees

80
Introduction to machine learning

Decision Trees -

1. Classifiers utilize a tree structure to model relationships among the features and the
potential outcomes

2. Decision trees consist of nodes and branches. Nodes represent a decision function
while branch represents the result of the function. Thus it is a flow chart for deciding
how to classify a new observation:

3. The nodes are of three types, Root Node (representing the original data), Branch
Node (representing a function), Leaf Node (which holds the result of all the previous
functions that connect to it)

81
Introduction to machine learning

Decision Trees -

4. For classification problem, the posterior probability of all the classes is reflected in the
leaf node and the Leaf Node belongs to the majority class.

5. After executing all the functions from Root Node to Leaf Node, the class of a
data point is decided by the leaf node to which it reaches

6. For regression, the average/ median value of the target attribute is assigned to the
query variable

7. Tree creation splits data into subsets and subsets into further smaller subsets. The
algorithm stops splitting data when data within the subsets are sufficiently
homogenous or some other stopping criterion is met

82
Introduction to machine learning

Decision Trees -

1. The decision tree algorithm learns (i.e. creates the decision tree from the data set)
through optimization of a loss function

2. The loss function represents the loss of impurity in the target column. The
requirement here is to minimize the impurity as much as possible at the leaf nodes

3. Purity of a node is a measure of homogeneity in the target column at that node

83
Introduction to machine learning

Decision Trees -
1. There is a bag of 50 balls of red, green, blue, white and
yellow colour respectively
2. You have to pull out one ball from the bag with closed
eyes. If the ball is -
a. Red, you loose the prize money accumulated
b. Green, you can quit
c. Blue you loose half prize money but continue
d. White you loose quarter prize money & continue
e. Yellow you can skip the question
3. This state where you have to decide and your decision
can result in various outcomes with equal probability is
said to be state of maximum uncertainty
4. If you have a bag full of balls of only one colour, then
there is no uncertainty. You know what is going to
happen. Uncertainty is zero.
5. Thus, the more the homogeneity, lesser the uncertainty
and vice versa
6. Uncertainty is expressed as entropy or Gini index

84
Introduction to machine learning

Decision Trees -

Suppose we wish to find if there was any influence of shipping mode, order priority on
customer location. Customer location is target column and like the bag of coloured balls

Shipping
Sales
Mode
Data

Regular Express
Air Air

Low High
Priority Priority

When sub branches are created, the total entropy of the sub branches should be
less than the entropy of the parent node. More the drop in entropy, more the
information gained
85
Introduction to machine learning

Decision Trees – Shannon's Entropy


a. Imagine a bag contains 6 red and 4 black balls.

b. Let the two classes Red -> class 0 and Black -> class 1

c. Entropy of the bag (X) will be calculated as per the formula

a. H(X) = - (0.6 * log2( 0.6)) - (0.4 * log2(0.4)) = 0.9709506

d. Suppose we remove all red balls from the bag and then entropy will be
a. H(X) = - 1.0 *log2(1.0) – 0.0 * log2(0) = 0 ## Entropy is 0! i.e. Information is
100%

86
Introduction to machine learning
Machine Learning (Decision Tree Classification)
Decision Trees -
Entropy Info Gain

E0 = max entropy 0
Shipping
Mode (1000) E0
say 1

E1 = E0 – E1
Express (E1a*700/1000) +
Regular Air (E1b * 300/1000)
(700), E1a Air (300),
E1b

E2 = (E2a * E1 – E2
500/700) + (E2b *
Low High Low High 200/700) + (E2c *
Priority Priority Priority Priority
(500) E2a (200) E2b 100/300) + (E2d *
(100) E2c (200) E2d
200/300)

Tree will stop growing when stop criterion for the splitting is reached which could be -
a. Tree has reached certain pre-fixed depth (longestt path from root node to leaf node)
b. Tree has achieve maximum number of nodes (tree size)
c. Exhausted all attributes to split
d. Leaf node on split will have less than predefined number of data points

87
Introduction to machine learning

Decision Trees - Information Gain using Entropy

Information Gain = reduction in entropy =

88
Introduction to machine learning

Decision Trees - Information Gain using Gini index

Information Gain = reduction in Gini index =

89
Introduction to machine learning

Decision Trees -

Common measures of purity

1. Gini index – is calculated by subtracting the sum of the squared probabilities of each
class from one
a. Uses squared proportion of classes
b. Perfectly classified, Gini Index would be zero
c. Evenly distributed would be 1 – (1/# Classes)
d. You want a variable split that has a low Gini Index
e. Used in CART algorithm

2. Entropy –
a. Favors splits with small counts but many unique values
b. Weights probability of class by log(base=2) of the class probability
c. A smaller value of Entropy is better. That makes the difference between the parent node’s
entropy larger
d. Information Gain is the Entropy of the parent node minus the entropy of the child nodes

90
Introduction to machine learning

Decision Trees – Gini , Entropy , Misclassification Error

Note: Misclassification Error is not used in Decision Trees

91
Introduction to machine learning

Decision Trees - Algorithms

1. ID3 (Iterative Dicotomizer 3) – developed by Ross Quinlan. Creates a multi


branch tree at each node using greedy algorithm. Trees grow to maximum
size before pruning

2. C4.5 succeeded ID3 by overcoming limitation of features required to be


categorical. It dynamically defines discrete attribute for numerical attributes.
It converts the trained trees into a set of if-then rules. Accuracy of each rule
is evaluated to determine the order in which they should be applied

3. C5.0 is Quinlan’s latest version and it uses less memory and builds smaller
rulesets than C4.5 while being more accurate

4. CART (Classification & Regression Trees) is similar to C4.5 but it supports


numerical target variables and does not compute rule sets. Creates binary
tree. Scikit uses CART

92
Introduction to machine learning

Decision Trees -

Advantages -
1. Simple , Fast in processing and effective
2. Does well with noisy data and missing data
3. Handles numeric and categorical variables
4. Interpretation of results does not required mathematical or statistical knowledge

Dis-advantages -
5. Often biased towards splits or features have large number of levels
6. May not be optimum as modelling some relations on axis parallel basis is not
optimal
7. Small changes in training data can result in large changes to the logic
8. Large trees can be difficult to interpret

93
Introduction to machine learning

Decision Trees - Preventing overfitting through regularization

1. Decision trees do not assume a particular form of relationship between the


independent and dependent variables unlike linear models for e.g.

2. DT is a non-parametrized algorithm unlike linear models where we supply


the input parameters

3. If left unconstrained, they can build tree structures to adapt to the training
data leading to overfitting

4. To avoid overfitting, we need to restrict the DT’s freedom during the tree
creation. This is called regularization

5. The regularization hyperparameters depend on the algorithms used

94
Introduction to machine learning

Decision Trees - Regularization parameters

1. max_depth – Is the maximum length of a path from root to leaf (in terms of
number of decision points. The leaf node is not split further. It could lead to
a tree with leaf node containing many observations on one side of the tree,
whereas on the other side, nodes containing much less observations get
further split

2. min_sample_split - A limit to stop further splitting of nodes when the number


of observations in the node is lower than this value

3. min_sample_leaf – Minimum number of samples a leaf node must have.


When a leaf contains too few observations, further splitting will result
in overfitting (modeling of noise in the data).

95
Introduction to machine learning

Decision Trees - Regularization parameters (Contd…)

4. min_weight_fraction_leaf – Same as min_sample_leaf but expressed in


fraction of total number of weighted instances

5. max_leaf_nodes – maximum number of leaf nodes in a tree

6. max_feature_size - max number of features that are evaluated for splitting


each node

96
Introduction to machine learning

Decision Tree -

Lab- 5 Model to predict potential credit defaulters

Description – Sample data is available at local file system as credit.csv

The dataset has 16 attributes described at


https://archive.ics.uci.edu/ml/datasets/statlog+(german+credit+data)
or in the notes page of this slide

Sol: Regularization+Credit+Decision+Tree.ipynb

97
Introduction to machine learning

Ensemble Methods

98
Introduction to machine learning

Ensemble Methods -

To combine the predictions of several base estimators built with a given learning
algorithm in order to improve generalizability / robustness over a single estimator

Two families of ensemble methods are usually distinguished:

1. Averaging methods, the driving principle is to build several estimators independently and
then to average / vote their predictions. On average, the combined estimator is usually
better than any of the single base estimator because its variance is reduced.
E.g. Bagging methods, Forests of randomized trees, ...

2. Boosting methods, base estimators are built sequentially and one tries to reduce the bias
of the combined estimator. The motivation is to combine several weak models to produce
a powerful ensemble. E.g. AdaBoost, Gradient Tree Boosting, ...

99
Introduction to machine learning

Ensemble Methods -

In the final stage of


voting, we essentially Source:
have a combined surface https://github.com/MenuPol
resulting from individual is/MLT/wiki/Bagging
surfaces
10
Introduction to machine learning

Ensemble Methods – Averaging method - Bagging (Bootstrap Aggregation) :

1. Designed to improve the stability and accuracy of classification and regression models

2. It reduces variance errors and helps to avoid overfitting

3. Can be used with any type of machine learning model, mostly used with Decision Tree

4. Uses sampling with replacement to generate multiple samples of a given size. Sample may
contain repeat data points

5. For large sample size, sample data is expected to have roughly 63.2% ( 1 – 1/e) unique
data points and the rest being duplicates

6. For classification bagging is used with voting to decide the class of an input while for
regression average or median values are calculate

10
Introduction to machine learning

Ensemble Methods – Averaging method - Bagging (Bootstrap Aggregation) :

Re-sampling done for every


classifier using a random function.
For large n, 63.2% unique samples Voting could be simple
likely to be selected or weighted

Algorithm to generate classifiers.


Could be Decision Tree, Naïve
Bayes etc

K classifiers created in parallel and


independently on respective
training data

Source: https://link.springer.com/article/10.1007/s13721-013-0034-x

10
Introduction to machine learning

Ensemble Learning – Bagging:

Lab- 6 Improve defaulter prediction of the decision tree using bagging


ensemble technique

Description – Sample data is available at local file system as credit.csv

The dataset has 16 attributes described at


https://archive.ics.uci.edu/ml/datasets/statlog+(german+credit+data)
or in the notes page of this slide

Sol: Bagging+Credit+Decision+Tree.ipynb

10
Introduction to machine learning

Ensemble Methods – Boosting Method – AdaBoosting :

1. Similar to bagging, but the learners are grown sequentially; except for the first, each
subsequent learner is grown from previously grown learners

2. If the learner is a Decision Tree, each of the trees can be small, with just a few terminal
nodes (determined by the parameter d supplied )

3. During voting higher weight is given to the votes of learners which perform better in
respective training data unlike Bagging where all get equal weight

4. Boosting slows down learning (because it is sequential) but the model generally performs
well

10
Introduction to machine learning

Ensemble Methods – Boosting method - AdaBoosting:

Training data from base data with


focus on instances which were
incorrectly classified by earlier
model (if any)

Voting could be simple


K similar classifiers created in or weighted
sequence with respective training
data with focus on addressing the
misclassified data, not the usual
cost functions

It is called Adaptive Boosting as the weights are re-assigned to each instance, with higher
weights to incorrectly classified instance

Source: https://link.springer.com/article/10.1007/s13721-013-0034-x

10
Introduction to machine learning

Ensemble Methods – Boosting Method – AdaBoosting :

7. Two prominent boosting algorithms are AdaBoost, short for Adaptive Boosting and Gradient
Descent Boosting

8. In AdaBoost, the successive learners are created with a focus on the ill fitted data of the
previous learner

9. Each successive learner focuses more and more on the harder to fit data i.e. their residuals
in the previous tree

10
Introduction to machine learning

Ensemble Methods – Boosting Method – AdaBoosting :


Adapting weights with focus on erroneously classified instances
Initialize weights, equal weights to
all instances

Generate first classifier with equal


focus on all instances

Total up weights of all error instances,


express it as a ratio to total weights

If error ratio is > 50%

Calculate predictor weights (i.e. weight


of the classifier)

Assign new weights to instances


misclassified, else keep the weights
same

Renormalize the weights across all the


instances and fit next classifier

For a test instance use weighted


voting to identify the class

10
Introduction to machine learning

Ensemble Learning – AdaBoosting:

Lab- 7 Improve defaulter prediction of the decision tree using Adaboosting

Description – Sample data is available at local file system as credit.csv

The dataset has 16 attributes described at


https://archive.ics.uci.edu/ml/datasets/statlog+(german+credit+data)
or in the notes page of this slide

Sol: Adaboost+Credit+Decision+Tree.ipynb

10
Introduction to machine learning

Ensemble Methods – Averaging Method – Gradient Descent Boosting :

1. Each learner is fit on a modified version of original data (original data is replaced with the x
values and residuals from previous learner

2. By fitting new models to the residuals, the overall learner gradually improves in areas where
residuals are initially high

10
Introduction to machine learning

Ensemble Methods – Averaging Method – Gradient Descent Boosting :

First learner results in residuals


(dots that fall above and below the
surface. The result (red) is same
as first classifier

Next classifier focuses on the


residuals of the first classifier to
reclassify them as correctly as
possible

The combined effect of this


surface and previous classifier
surface is shown in red

The third learner focusses on the


residuals of the previous classifier

The combine result of the new


surface with the previous surface
is shown in red

11
Introduction to machine learning

Ensemble Learning – Gradient Boosting:

Lab- 8 Improve defaulter prediction of the decision tree using Gradient


boosting

Description – Sample data is available at local file system as credit.csv

The dataset has 16 attributes described at


https://archive.ics.uci.edu/ml/datasets/statlog+(german+credit+data)
or in the notes page of this slide

Sol: GRB+Credit+Decision+Tree.ipynb

11
Introduction to machine learning

Ensemble Methods – Random Forest:

1. Each tree in the ensemble is built from a sample drawn with replacement (bootstrap) from
the training set

2. In addition, when splitting a node during the construction of a tree, the split that is chosen is
no longer the best split among all the features

3. Instead, the split is picked is the best split among a random subset of the features

4. As a result of this randomness, the bias of the forest usually slightly increases (with respect
to the bias of a single non-random tree)

5. Due to averaging, its variance decreases, usually more than compensating the increase in
bias, hence yielding overall a better result

Source: scikit-learn user guide , chapter 3 , page 231

11
Introduction to machine learning

Ensemble Methods - Random Forest:

1. Used with Decision Trees. Create different trees by providing different sub-features from the
feature set to the tree creating algorithm. The optimization function is Entropy or Gini index

Randomly selected sub feature set Independent trees created in parallel

N
instances

Original number of features

11
Introduction to machine learning

Ensemble Learning – Random Forest:

Lab- 9 Improve defaulter prediction of the decision tree using Random Forest

Description – Sample data is available at local file system as credit.csv

The dataset has 16 attributes described at


https://archive.ics.uci.edu/ml/datasets/statlog+(german+credit+data)
or in the notes page of this slide

Sol: RF+Credit+Decision+Tree.ipynb

11
Introduction to machine learning

Ensemble Methods – Stacking:

1. Similar to bagging, but apply several different models to original data


2. The weights for each model is determined based on how well they perform on the given
input data
3. Similar classifiers usually make similar errors (bagging), so forming an ensemble with
similar classifiers may not improve the classification rate
4. Presence of a poorly performing classifier may cause performance deterioration in the
overall performance
5. Similarly, even on presence of a classifier that performs much better than all of the other
available base classifiers, may cause degradation in the overall performance
6. Another important factor is the amount of correlation among the incorrect classifications
made by each classifier
7. If the consistent classifiers tend to misclassify the same instances, then combining their
results will have no benefit
8. In contrast, a greater amount of independence among the classifiers can result in errors by
individual classifiers being overlooked when the results of the ensemble are combined.

11
Introduction to machine learning

Ensemble Methods – Stacking:

Source:
http://pubs.rsc.org/-/content/articlelanding/2014/mb/c4mb00410h/unauth#!
11 divAbstract
Introduction to machine learning

Ensemble Learning – Stacking:

Lab- 10 Improve defaulter prediction of the decision tree using Stacking

Description – Sample data is available at local file system as credit.csv

The dataset has 16 attributes described at


https://archive.ics.uci.edu/ml/datasets/statlog+(german+credit+data)
or in the notes page of this slide

Sol: Stacking+Credit+Decision+Tree.ipynb

11
Introduction to machine learning

Machine Learning Algorithms Comparison:


Easy Handles
to Amount of Performs lots of Gives
explai Averag parameter well with irrelevant Automat calibrated
n e tuning small features ically probabiliti
Results algorit predict needed number well learns es of Features
interpre hm to ive Predicti (excluding of (separates feature class might
Problem table by others accura Training on feature observati signal from interacti members Parame need
Algorithm Type you? ? cy speed speed selection) ons? noise)? ons? hip? tric? scaling? Algorithm
Depend
KNN Either Yes Yes Lower Fast s on n Minimal No No No Yes No Yes KNN
None No
(excluding (unless
Linear Regressi regularizati regularize Linear
regression on Yes Yes Lower Fast Fast on) Yes No No N/A Yes d) regression
None No
(excluding (unless
Logistic Classific Somewh Some regularizati regularize Logistic
regression ation at what Lower Fast Fast on) Yes No No Yes Yes d) regression

Fast
(excluding Some for
Naive Classific Somewh Some feature feature Naive
Bayes ation at what Lower extraction) Fast extraction Yes Yes No No Yes No Bayes

Decision Somewh Some Decision


trees Either at what Lower Fast Fast Some No No Yes Possibly No No trees

Yes (unless
Random Moderat noise ratio is Random
Forests Either A little No Higher Slow e Some No very high) Yes Possibly No No Forests
AdaBoost Either A little No Higher Slow Fast Some No Yes Yes Possibly No No AdaBoost

Source: http://www.dataschool.io/comparing-supervised-learning-algorithms/
11
Introduction to machine learning
Support Vector Machines
1. Known as maximum-margin hyperplane, find that linear model with max margi. Unlike
the liner classifiers, objective is not minimizing sum of squared errors but finding a
line/plane that separates two or more groups with maximum margins
Max margin hyper plane

n
gi
ar
M

http://stackoverflow.com/questions/9480605/what-is-the-
relation-between-the-number-of-support-vectors-and-
training-data-and Support Vectors

11
Introduction to machine learning
Support Vector Machines

Image Source : https://dzone.com/articles/support-vector-machines-tutorial

1. First line does separate the two sets but id too close to both red & green data points
2. Chances are that when this model is put in production, variance in both cluster data
may force some data points on wrong side
3. The second line doesn’t look so vulnerable to the variance. The two points nearest
from different clusters define the margin around the line and are support vectors
4. SVMs try to find the second kind of line where the line is at max distance from both
the clusters simultaneously
12
Introduction to machine learning
Support Vector Machines

distance from a point(x0,y0) to a line:


Ax+By+c = 0 is:
|Ax0 +By0 +c|/sqrt(A^2+B^2)
x0,y
0 wx+B= 1 H1
1
d
wx+B= 0
0 H0
wx+B= -1 H2
-1
d = distance between H0 and H1 is

|w•x+b|/||w||=1/||w||,

The total distance between H1 and H2 is thus: 2/||w||

2. Think in terms of multi-dimensional space. SVM algorithm has to find the combination
of weights across the dimensions such that they hyperplane has max possible margin
around it
3. All the predictor variables have to be numeric and scaled.

12
Introduction to machine learning
Support Vector Machines Allowing Errors

Image Source : https://dzone.com/articles/support-


vector-machines-tutorial

1. Data in real world is typically not linearly separable.


2. There will always be instances that a linear classifier can’t get right
3. SVM provides a complexity parameter, a tradeoff between: wide margin with errors or
a tight margin with minimal errors. As C increases, margins become tighter

12
Introduction to machine learning
Support Vector Machines Linearly Non Separable Data

x1^2, x2^2

Image Source : https://dzone.com/articles/support-


vector-machines-tutorial

1. When data is not linearly separable, SVM uses kernel trick to make it linearly separable
2. This concept is based on Cover’s theorem “given a set of training data that is not linearly
separable, with high probability it can be transformed into a linearly separable training set
by projecting it into a higher-dimensional space via some non-linear transformation”
3. In the pic above, replace x1 with x1^2, x2 with x2^2 and create a third dimension x3 =
sqrt(2x1x2)

12
Introduction to machine learning
Support Vector Machines Linearly Non Separable Data

Image Source : https://dzone.com/articles/support-vector-machines-tutorial

1. Using kernel tricks the data points are project to higher dimensional space
2. The data points become relatively more easily separable in higher dimension space
3. SVM can now be drawn between the data sets with a given complexity

12
Introduction to machine learning

Support Vector Machines Basic Idea

1. Suppose we are given training data {(x1, y1),...,(xn, yn) } ⊂ X × R, where X denotes
the space of the input patterns (e.g. X = Rd).
2. Goal is to find a function f(x) that has at most ε deviation from the actually obtained
targets yi for all the training data, and at the same time is as flat as possible
3. In other words, we do not care about errors as long as they are less than ε, but will
not accept any deviation larger than this
4. f can take the form f(x) = (w, x )+ b with w ∈ X, b ∈ R
5. Flatness means that one seeks a small w. One way to ensure this is to minimize the ||
w||^2 = (w, w).

12
Introduction to machine learning

Support Vector Machines Basic Idea

6. The problem can be represented as convex optimization problem

7. In the first picture, ||w||^2 is not minimized, neither the third constraint. Take the
pointer to be x value, yi – (w, xi) – b is < e i.e. diff between green dot and the line but
(w, xi) + b –yi i.e. diff between line an red dot is not < e.
8. In second picture, all three constraints are met
9. Sometimes, it may not be possible to meet the constraint due to data points not being
linearly separable so we may want to allow for some errors.

12
Introduction to machine learning

Support Vector Machines Basic Idea

10. We introduce slack variables ξi, ξ ∗ i to cope with otherwise infeasible constraints of
the optimization problem and this is known as soft margin classifier

11. The epsilon term allows some errors i.e. data points lie within the error margins where
error margins is e + epsilon

12
Introduction to machine learning
Support Vector Machines Kernel Functions

1. SVM libraries come packaged with some standard kernel functions such as
polynomial, radial basis function (RBF), and Sigmoid

2. For degree-d polynomials, the polynomial kernel looks like


where x and y are input vectors in lower dimension space, c is a user specified
constant (usually 1). K denotes inner product of x,y in higher dimension space

3. RBF (Radial Basis Function) kernel on two samples x and x’ is represented as -

4. It ranges from 0 when distance between x and x’ increases (e^-infinity becomes 0)


and becomes 1 when x = x’ because x – x’ = 0 and anything raised to 0 is 1
12
Introduction to machine learning
Support Vector Machines Kernel Functions

5. Sigmoid Kernel looks like K(X,Y)=tanh(γ⋅X(transpose)Y+r)

6. Linear Kernel are of the form that represents linear equation

Feature Linear Kernel RBF Kernel Polynomial Sigmoid


space Kernel Kernel

Source: https://gist.github.com/WittmannF/60680723ed8dd0cb993051a7448f7805

12
Introduction to machine learning
Machine Learning (Support Vector Machines)

Strengths Weakness

Very stable as it depends on the support vectors Computationally intensive


only. Not influenced by any other data point
including outliers
Can be adapted to classification or numeric Prone to over fitting training data
prediction problems
Capable of modelling relatively more complex Generally treated as a blackbox model
patterns than nearly any algorithm

Makes no assumptions about underlying data


sets

13
Introduction to machine learning

Ensemble Learning – OCR Support Vector Machine

Lab- 11 Handwritten character recognition

Description – Sample data is available at local file system as Letterdata.csv

The dataset is described at


https://archive.ics.uci.edu/ml/datasets/letter+recognition

Sol: OCR-SVM.ipynb

13
Introduction to machine learning

13
Introduction to machine learning

13
Introduction to machine learning
Machine Learning (Support Vector Machines)

The model has predicted the characters correctly 84% of the times

13
Introduction to machine learning

Artificial Neural Networks

13
Introduction to machine learning
Machine Learning (Artificial Neural Network)
1. Artificial Neural Network (ANN) models relationships between a set of input data and
output data.

2. ANN models are based on the observed behaviour of neural nets in our brains

3. Just as brain uses a network of interconnected neurons to parallelize the processing


of input signals to trigger a response

4. It will help to understand how biological neurons function (conceptual understanding)


Ref: http://neuroscience.uth.tmc.edu/s1/introduction.html
1. Resting potential
2. Action potential
3. Threshold for action potential
4. Synapses and synaptic transmission

13
Introduction to machine learning
Machine Learning (Artificial Neural Network)
5. Artificial Neural Network (ANN) models relationships between a set of input data and
output data.

6. Natural Neuron

7. Abstract neuron

13
Introduction to machine learning
Machine Learning (Artificial Neural Network)
8. The processing elements of a ANN is called a node, representing the artificial neuron

9. Each ANN is composed of a collection of nodes grouped in layers. A typical structure


is shown

10. The initial layer is the input layer and the last layer is the output layer. In between we
have the hidden layers

13
Introduction to machine learning
Machine Learning (Artificial Neural Network)
11. A given node will fire and feed a signal to subsequent nodes in next layer only if the
step function it implements reaches a threshold
12. In ANN use of Sigmoid function is more common than step function

Output ai fired

Threshold
input

13
Introduction to machine learning
Machine Learning (Artificial Neural Network)
13. The summation function g can be implemented in many ways. It does not have to be
mathematical addition of the inputs

14
Introduction to machine learning
Machine Learning (Artificial Neural Network)
14. The ANN generic architecture

15. Neural net consists of multiple layers. It has two layers on the edge, one is input layer
and the other is output layer.

16. In between input and output layer, there can be many other layers. These layers are
called hidden layers

14
Introduction to machine learning
Machine Learning (Artificial Neural Network)
17. The input layer is passive, does no processing, only holds
the input data to supply it to the first hidden layer

14
Introduction to machine learning
Machine Learning (Artificial Neural Network)
18. The input layer is passive, does no processing, only holds the input data to supply
it to the first hidden layer

X1 Hidden Layer Node 1

W11 W12 W13


X2

ACC = X1*W11 + X2*W12 + X3*W13

X3 N1Output = Sigmoid(ACC)

19. The weights for a given hidden node is pre-fixed and all the nodes in the hidden
layer have their own weights

20. The output of each node is fed to output layer nodes or another set of hidden nodes
in another hidden layer

14
Introduction to machine learning
Machine Learning (Artificial Neural Network)
21. The output value of each hidden node is sent to each output node in the output layer

14
Introduction to machine learning
Machine Learning (Artificial Neural Network)

X1
Output Node 1

X2
011 O12 O13 O14
X3
ACC = X1*WO11 + X2*WO12 +
X4 X3*WO13 + X4*WO14

N1Output = Sigmoid(ACC)

14
Introduction to machine learning
Machine Learning (Artificial Neural Network)
22. In a binary output ANN, the output node acts like a perceptron classifying the input
into one of the two classes

23. Examples of such ANN applications would be to detect fraudulent transaction,


whether a customer will buy a product given the attributes etc.

14
Introduction to machine learning
Machine Learning (Artificial Neural Network)
24. We can have a ANN with multiple output nodes where a given output node may or
may not get triggered given the input and the weights.

25. We can have a ANN with multiple output nodes where a given output node may or
may not get triggered given the input and the weights.

14
Introduction to machine learning
Machine Learning (Artificial Neural Network)
27. The weights required to make a neural network carry out a particular task are found
by a learning algorithm, together with examples of how the system should operate

28. The examples in vehicle identification could be a large hadoop file of serveral millions
sample segments such as bicycle, motorcycle, car, bus etc.

29. The learning algorithms calculate the appropriate weights for each classification for all
nodes at all the levels in the network

30. If we consider each input as a dimension then ANN labels different regions in the n-
dimensional space. In our example one region is cars, other region is bicycle

Car

Bycycle

Image Source: https://en.wikipedia.org/wiki/K-d_tree#/media/File:3dtree.png

14
Introduction to machine learning
Machine Learning (Artificial Neural Network)

Strengths Weakness

The main advantage of ANN models over the ANN does not provide
statistical methods is that the latter assume linear information about the relative
relationships and/or normal distribution, while significance of the various
reality is non-linear and non-normal. Thus the parameters.
ANN model is capable to conform to the real
world.

ANN is a non-parametric model, thus eliminates It is a big black box


the error in parameter estimation, while most of
statistical methods (MLR, etc.) are parametric
models that need higher background of statistic.

Neural networks are quite simple to implement


(you do not need a good linear algebra solver as
for examples for SVNs).

14
Introduction to machine learning

15
Introduction to machine learning
Machine Learning (Artificial Neural Network)
( Lab-5 Estimate concrete strength – Model Improvement

15
Introduction to machine learning

Modelling Errors

15
Introduction to machine learning

Modelling Errors
All models are impacted by three types of errors which reduce their predicting power.
1. Variance errors
2. Bias error
3. Random errors

Variance errors
4. Caused by the random factors that impact the process that generate the data
5. The population / universe, representing the infinite data points continuously jiggle
6. Sample drawn from such universe is a snapshot of a small part of the universe
7. The model based on a sample will perform differently on different samples
8. Variance errors increase with increase in number of attributes in the model due to increase in degrees
of freedom for the data points to wriggle in

Bias errors
9. Caused by our selection of the attributes and our interpretation of their influence on each other
10. The real model in the universe / population may have many more attributes and the attributes
interacting in different ways not reflected in our model

Random errors
11. Caused by unknown factors. They cannot be modelled

15
Introduction to machine learning

Modelling Errors – Variance error

Universe / Population Time T1

Time T2

Sample / snapshot

15
Introduction to machine learning

Modelling Errors Visaual demo of variance in training and test data

Sample Data (Analytics Base Table) Three Random Training Sets From Three Random Test Sets From ABT
ABT

15
Introduction to machine learning

Modelling Errors

1. We have to find the right attributes


and the right number of
dimensions such that the total
effect of these two (indicated by
black curve) minimizes.

2. The gap between variance curve


and total error curve reflects
presence of random errors in the
model

15
Introduction to machine learning

Fitness of a Model
Generalize
1. Models are expected to perform well (meet least accuracy thresholds) in production (real world data)
2. But data in real world is under flux / jiggle
3. Models have to perform in this context of continuous jiggle. Such models are said to generalize well
4. For models to generalize well, they should neither be underfit or overfit in the training data

Underfit models
5. Models that are over simplified i.e. models in which the independent and dependent attributes interact in
a simple linear way ( can be expressed in a linear form for e.g. y = mx + c).
6. The model could have been addressed as a quadratic form such as y = m1x + m2 x^2 +C
7. Underfit models result in errors as they fail to capture the complex interactions among the attributes in
the real world
8. These models will not generalize in the real world

Overfit models
9. Models that perform very well (sometimes with zero errors) in training data
10. Are complex polynomial surfaces that twist and turn in the feature space to cleanly separate the classes
11. Adjust to the variance in the training data i.e. try to adjust to the positions of the data points though
those positions are not the expected values of the data points (mean of the jiggle)
12. These models adapt to the variance error in the data set and will not generalize in the real world

15
Introduction to machine learning

Good
Over fit
Underfit

In overfit models, the models absorb the noise (variance) in the data
points achieving almost 100% accuracy in controlled environment. But
when used in production (where the data points have different variance,
the models will perform poorly
15
Introduction to machine learning

Fitness of a Model

Under Fit Good Fit Over Fit


When the interaction between When the interaction between
Right fit. Will generalize
the attributes and target the attributes and target
relatively much better
dimensions are oversimplified. dimensions are overly
Fail to generalize complex in the model. Fail to
generalize

15
Introduction to machine learning
Model performance measures
a. Confusion Matrix – A 2X2 tabular structure reflecting the performance of the model in four blocks
Confusion Matrix Predicted Positive Predicted Negative

Actual Positive True Positive False Negative

Actual Negative False Positive True Negative

b. Accuracy – How accurately / cleanly does the model classify the data points. Lesser the false
predictions, more the accuracy

c. Sensitivity / Recall – How many of the actual True data points are identified as True data points
by the model . Remember, False Negatives are those data points which should have been
identified as True.

d. Specificity – How many of the actual Negative data points are identified as negative by the model

e. Precision – Among the points identified as Positive by the model, how many are really Positive

16
Introduction to machine learning
Receiver Operating Characteristics (ROC) Curve

A technique for visualizing classifier performance


a. It is a graph between TP rate and FP rates
I. TP rate = TP / total positive
II. FP rate = FP / total negative
b. ROC graph is a trade off between benefits (TP) and
costs (FP)
c. The point (0,1) represents perfect classified (e.g. D)
I. TP = 1 and FP = 0

d. Classifiers very close to Y axis and lower (nearer to x


axis) are conservative models and strict in classifying
positives (low TP rate)
e. Classifiers on top right are liberal in classifying
positives hence higher TP rate and FP rate

16
Introduction to machine learning
To explain F Stats

F Test, also known as ANOVA (Analysis Of Variance) -


Imagine the two samples
were data collected

A B

If the two data sets are not


similar, most probably they
have been generated by
two different processes

C D

16
Introduction to machine learning

ThankYou

16

You might also like