PDF of Artifical Intelligance Machine Learning Lab 2
PDF of Artifical Intelligance Machine Learning Lab 2
Published by
Director
Institute of Distance and Open learning , University of Mumbai,Vidyanagari, Mumbai - 400 098.
Unit I
1. Artificial Intelligence Lab 1
Unit II
2. Introduction To Python Programming: Learn The Different Libraries 14
Unit III
3. Supervised learning 26
4. Supervised learning 41
Unit IV
5. Features and extraction 56
6. Classifying Data Using Support Vector Machines (SVMS):SVM-RBF Kernels 83
Unit V
7. Unsupervised Learning K-Means Clustering Algorithm 102
8. Unsupervised Learning K- Medoid Clustering Algorithm 114
Unit VI
9. Classifying Data Using Support Vector Machines (Svms): Svm-Rbf Kernels 128
Unit VII
10. Decision Tree 149
UnitVIII
11. Boosting Algorithms 161
12. Examples 174
Unit IX
13. XG Boost 191
14. Deployment Of Machine Learning Algorithms 224
*****
SYLLABUS
*****
UNIT I
1
ARTIFICIAL INTELLIGENCE LAB
Unit Structure
1.0 Objectives
1.1 Introduction
1.2 Logic Programming with PROLOG
1.3 Relationships among Objects and Properties Of Objects
1.4 Problem solving
1.4.1 Water jug problem
1.4.2 Tic-Tac-Toe problem
1.4.3 8-Puzzle Problem
1.5 Summary
1.6 References
1.7 Bibliography
1.8 Unit End Exercises
1.0 OBJECTIVES
After reading this chapter students will be able to:
Explain the structure of PROLOG
Describe the logic programming of PROLOG
Have the knowledge about the objects and its working principles in
PROLOG
write the applications and problems of Artificial Intelligence
programs using PROLOG
1.1 INTRODUCTION
PROLOG: Programming Logic language was designed in the 1970s by
Alain Colmerauer and a team of researchers
It uses a subset of predicate logic and draws its structure from theoretical
works of earlier logicians such as Herbrand (1930) and Robinson (1965)
on the automation of theorem proving.
PROLOG supports:
● Natural Language Understanding
1
Artificial Intelligence Lab
● Formal logic and associated forms of programming
● Reasoning modeling
● Database programming
● Expert System Development
● Real time AI programs
dog (puppy).
dog (kutty).
dog (jimmy).
cat (valu).
cat (miaw).
cat (mouse).
animal(Y):-dog(Y).
Output:
:- dog(puppy).
Yes
:- cat(kar).
No
PROLOG program, rules and facts, and also the use of queries that make
PROLOG search through its facts and rules to work out the answer.
Determining that puppy is an animal involves a very simple form of
logical reasoning:
2
Artificial Intelligence &
Machine Learning Lab 1.3 RELATIONSHIPS AMONG OBJECTS AND
PROPERTIES OF OBJECTS
The relationship between the objects and the particular relationship among
the objects are explained through the following example.
Each family has three components: husband, wife and children are objects
of the family. As the number of children varies from family to family the
children are represented by a list that is capable of accommodating any
number of items. Each person is, in turn, represented by a structure of four
components: name or it specifies the working organization and salary. The
family of can be stored in the database by the clause
family(
person( tom, fox, date(7,may,1950), works(bbc,15200) ),
person( ann, fox, dat{9,may, 195 1), unemployed),
[person( pat, fox, date(5,may,1973), unemployed),
person( jim, fox, date(S,may,1973), unemployed) ] ).
This program shall be extended as adding the information on the gender of
the people that occur in the parent relation. This can be done by simply
adding the following facts to our program:
female( pam).
male( tom).
male( bob).
female( liz).
female( pat).
female( ann).
male( jim).
The relations introduced here are male and female. These relations are
unary relations.
A binary relation like parent defines a relation between pairs of objects; on
the other hand, unary relations can be used to declare simple yes/no
properties of objects. The first unary clause above can be read: Pam is a
female. The same information declared in the two unary relations with one
binary relation, sex, instead. An alternative code snippet of program is :
gender( pam, feminine).
gender( tom, masculine).
gender( bob, masculine).
3
Artificial Intelligence Lab
The offspring relation is as the inverse of the parent relation. We could
define offspring in a similar way as the parent relation; that is, by simply
providing a list of simple facts about the offspring relation, each fact
mentioning one pair of people such that one is an offspring of the other.
For example:
offspring( liz, tom).
However, the offspring relation can be defined much more elegantly by
making use of the fact that it is the inverse of parent, and that parent has
already been defined. This alternative way can be based on the following
logical statement:
For all X and Y,
Y is an offspring of X if
X is a parent of Y.
This formulation is already close to the formalism of PROLOG. The
corresponding PROLOG clause which has the same meaning is:
offspring( Y, X) :- parent( X, Y).
This clause can also be read as:
For all X and Y,
if X is a parent of Y then
Y is an offspring of X.
PROLOG clauses : Rules
offspring( Y, X) :- parent( X, Y).
Difference between facts and rules: A fact is something that is always,
unconditionally, true. On the other hand, rules specify things that may be
true if some condition is satisfied. Therefore we say that rules have:
A condition part and a conclusion part
The conclusion part is also called the head of a clause and the condition
part the body of a clause. For example:
offspring( y, X) :- parent( X, y).
head body
If the condition parent( X, Y) is true then a logical consequence of this is
offspring( Y, X).
How rules are actually used by PROLOG is illustrated as
:- offspring( liz, tom).
4
Artificial Intelligence &
Machine Learning Lab 1.4 PROBLEM SOLVING
1.4.1 Water jug problem:
Problem Statement:
There is no other measuring equipment available and the jugs also do not
have any kind of marking on them. So, the agent’s task here is to fill the 4-
gallon jug with 2 gallons of water by using only these two jugs and no
other material. Initially, both our jugs are empty.
Here, let x denote the 4-gallon jug and y denote the 3-gallon jug.
3. (x,y) If x>0 (x-d,y) Pour some part from the 4 gallon jug
4. (x,y) If y>0 (x,y-d) Pour some part from the 3 gallon jug
7. (x,y) If (x+y)<7 (4, y-[4-x]) Pour some water from the 3 gallon jug to
fill the four gallon jug
8. (x,y) If (x+y)<7 (x-[3-y],y) Pour some water from the 4 gallon jug to
fill the 3 gallon jug.
9. (x,y) If (x+y)<4 (x+y,0) Pour all water from 3 gallon jug to the 4
gallon jug
10. (x,y) if (x+y)<3 (0, x+y) Pour all water from the 4 gallon jug to the 3
gallon jug
5
Artificial Intelligence Lab
S.No. 4 gallon jug 3 gallon jug Rule followed
contents contents
1. 0 gallon 0 gallon Initial state
2. 0 gallon 3 gallons Rule no.2
3. 3 gallons 0 gallon Rule no. 9
4. 3 gallons 3 gallons Rule no. 2
5. 4 gallons 2 gallons Rule no. 7
6. 0 gallon 2 gallons Rule no. 5
7. 2 gallons 0 gallon Rule no. 9
On reaching the 7th attempt, the goal state is reached.
Program Listing:
database
visited_state(integer,integer)
predicates
state(integer,integer)
clauses
state(2,0).
state(X,Y):-
X < 4,
not(visited_state(4,Y)),
assert(visited_state(X,Y)),
write("Fill the 4-Gallon Jug: (",X,",",Y,") --> (", 4,",",Y,")\n"),
state(4,Y).
state(X,Y):- Y < 3,
not(visited_state(X,3)),
assert(visited_state(X,Y)),
write("Fill the 3-Gallon Jug: (", X,",",Y,") --> (", X,",",3,")\n"),
state(X,3).
state(X,Y):- X > 0,
not(visited_state(0,Y)),
6
Artificial Intelligence & assert(visited_state(X,Y)),
Machine Learning Lab
write("Empty the 4-Gallon jug on ground: (", X,",",Y,") -->
(",0,",",Y,")\n"),
state(0,Y).
state(X,Y):- Y > 0,
not(visited_state(X,0)),
assert(visited_state(X,0)),
write("Empty the 3-Gallon jug on ground: (", X,",",Y,") -->
(",X,",",0,")\n"),
state(X,0).
state(X,Y):- X + Y >= 4,
Y > 0,
NEW_Y = Y - (4 - X),
not(visited_state(4,NEW_Y)),
assert(visited_state(X,Y)),
write("Pour water from 3-Gallon jug to 4-gallon until it is full:
(",X,",",Y,") --> (", 4,",",NEW_Y,")\n"),
state(4,NEW_Y).
state(X,Y):- X + Y >=3,
X > 0,
NEW_X = X - (3 - Y),
not(visited_state(X,3)),
assert(visited_state(X,Y)),
write("Pour water from 4-Gallon jug to 3-gallon until it is full:
(",X,",",Y,") --> (", NEW_X,",",3,")\n"),
state(NEW_X,3).
state(X,Y):- X + Y <=4,
Y > 0,
NEW_X = X + Y,
not(visited_state(NEW_X,0)),
assert(visited_state(X,Y)),
7
Artificial Intelligence Lab
write("Pour all the water from 3-Gallon jug to 4-gallon:
(",X,",",Y,") --> (", NEW_X,",",0,")\n"),
state(NEW_X,0).
state(X,Y):- X+Y<=3,
X > 0,
NEW_Y = X + Y,
not(visited_state(0,NEW_Y)),
assert(visited_state(X,Y)),
write("Pour all the water from 4-Gallon jug to 3-gallon:
(",X,",",Y,") --> (", 0,",",NEW_Y,")\n"),
state(0,NEW_Y).
state(0,2):- not(visited_state(2,0)),
assert(visited_state(0,2)),
write("Pour 2 gallons from 3-Gallon jug to 4-gallon: (", 0,",",2,") -->
(", 2,",",0,")\n"),
state(2,0).
state(2,Y):- not(visited_state(0,Y)),
assert(visited_state(2,Y)),
write("Empty 2 gallons from 4-Gallon jug on the ground:
(",2,",",Y,") --> (", 0,",",Y,")\n"),
state(0,Y).
goal:-
makewindow(1,2,3,"4-3 Water Jug Problem",0,0,25,80),
state(0,0).
8
Artificial Intelligence & The player who succeeds in putting 3 individual marks in an exceedingly
Machine Learning Lab
horizontal, vertical or diagonal row wins the game. Players shortly
discover that best play from each party ends up in a draw.
The game is generalized to an m,n,k-game during which 2 players
alternate putting stones of their own colour on an m×n board, with the
goal of obtaining k of their own colour in a row. Tit-Tat-Toe is the (3,3,3)-
game.
/*A Tic-Tac-Toe program in PROLOG. */
/*Predicates that define the winning conditions:*/
win(Board, Player) :- rowwin(Board, Player).
win(Board, Player) :- colwin(Board, Player).
win(Board, Player) :- diagwin(Board, Player).
rowwin(Board, Player) :- Board = [Player,Player,Player,_,_,_,_,_,_].
rowwin(Board, Player) :- Board = [_,_,_,Player,Player,Player,_,_,_].
rowwin(Board, Player) :- Board = [_,_,_,_,_,_,Player,Player,Player].
colwin(Board, Player) :- Board = [Player,_,_,Player,_,_,Player,_,_].
colwin(Board, Player) :- Board = [_,Player,_,_,Player,_,_,Player,_].
colwin(Board, Player) :- Board = [_,_,Player,_,_,Player,_,_,Player].
orespond(Board,Newboard):-
move(Board, o, Newboard),
win(Newboard, o),
!.
orespond(Board,Newboard) :-
move(Board, o, Newboard),
not(x_can_win_in_one(Newboard)).
orespond(Board,Newboard) :-
move(Board, o, Newboard).
orespond(Board,Newboard) :-
not(member(b,Board)),
!,
write('Cats game!'), nl,
Newboard = Board.
/* Translation from an integer description of x's move to a board
transformation.*/
xmove([b,B,C,D,E,F,G,H,I], 1, [x,B,C,D,E,F,G,H,I]).
xmove([A,b,C,D,E,F,G,H,I], 2, [A,x,C,D,E,F,G,H,I]).
xmove([A,B,b,D,E,F,G,H,I], 3, [A,B,x,D,E,F,G,H,I]).
xmove([A,B,C,b,E,F,G,H,I], 4, [A,B,C,x,E,F,G,H,I]).
xmove([A,B,C,D,b,F,G,H,I], 5, [A,B,C,D,x,F,G,H,I]).
xmove([A,B,C,D,E,b,G,H,I], 6, [A,B,C,D,E,x,G,H,I]).
10
Artificial Intelligence & xmove([A,B,C,D,E,F,b,H,I], 7, [A,B,C,D,E,F,x,H,I]).
Machine Learning Lab
xmove([A,B,C,D,E,F,G,b,I], 8, [A,B,C,D,E,F,G,x,I]).
xmove([A,B,C,D,E,F,G,H,b], 9, [A,B,C,D,E,F,G,H,x]).
xmove(Board, N, Board) :- write('Illegal move.'), nl.
explain :-
write('You play X by entering integer positions followed by a period.'),
nl,
display([1,2,3,4,5,6,7,8,9]).
test(Plan):-
write('Initial state:'),nl,
Init= [at(tile4,1), at(tile3,2), at(tile8,3), at(empty,4), at(tile2,5),
at(tile6,6), at(tile5,7), at(tile1,8), at(tile7,9)],
write_sol(Init),
Goal= [at(tile1,1), at(tile2,2), at(tile3,3), at(tile4,4), at(empty,5),
at(tile5,6), at(tile6,7), at(tile7,8), at(tile8,9)],
nl,write('Goal state:'),nl,
write(Goal),nl,nl,
solve(Init,Goal,Plan).
11
Artificial Intelligence Lab
solve(State, Goal, Plan):-
solve(State, Goal, [], Plan).
act(move(X,Y,Z),
[at(X,Y), at(empty,Z), is_movable(Y,Z)],
[at(X,Y), at(empty,Z)],
[at(X,Z), at(empty,Y)]).
/*Check is first list is a subset of the second */
is_subset([H|T], Set):-
member(H, Set),
is_subset(T, Set).
is_subset([], _).
12
Artificial Intelligence & remove(X, [H|T], [H|R]):-
Machine Learning Lab
remove(X, T, R).
write_sol([]).
write_sol([H|T]):-
write_sol(T),
write(H), nl.
member(X, [X|_]).
member(X, [_|T]):-
member(X, T).
1.5 SUMMARY
This chapter explains how prolog is used in the logical programs.
Different applications like water jug problem, tic-tac-toe and decision
making justification problems are described.
1.7 REFERENCES
1. Logic Programming with Prolog, Max Bramer, Springer
2. Prolog Programming for Artificial Intelligence, E. Kardelj University
. J. Stefan Institute
1.8 BIBLIOGRAPHY
1. https://www.cse.iitd.ac.in/~mcs052942/ai/print/13.txt
2. https://github.com/
*****
13
UNIT II
2
INTRODUCTION TO PYTHON
PROGRAMMING: LEARN THE
DIFFERENT LIBRARIES
Unit Structure
2.1 NumPy
2.2 Pandas
2.3 SciPy
2.4 Matplotlib
2.5 Scikit Learn.
2.1 NUMPY
● Python library is nothing but a ready made moule.
● This library can be used whenever we want.
● If we are writing a code and if a particular requirement arises then
instead of sitting and writing the whole code we can just use the ready
made code available in the library.
● Thus by using the library our time is getting saved in a very
wonderful manner.
● We can relate the Python library with the real world book library too.
So if you imagine a book library it has a whole set of books with it.
We can choose the book according to our requirements. Similarly in
the python library we can choose a particular set of code which is
needed.
● The extension of library files are “.dll”
● Full form of dll is Dynamic Load Libraries
● So whenever we add a library in our program during the execution
phase it searches it and loads the particular module which is needed.
● Now in this module we are studying about numpy which is one of the
libraries in python.
● NumPy stands for Numerical Python.
● It is one of the most widely used libray.
14
Artificial Intelligence & ● As it contains the code related to numerical details it is most popular
Machine Learning Lab
around data science and machine learning as both these fields need a
lot of numerical logic getting applied in it.
● It is used whenever the situation in coding arises in working with an
array.
● It does have methods that is made up for algebra related logics.
● This Numpy was made in the year 2005
● Example:
Lets try to insert array using numpy:
import numpy as ab
ar= ab.array(([1, 2, 3, 4, 5])
print(ar)
print(type(ar))
Output:
[1 2 3 4 5]
● In the above example in the first line we have imported the library by
typing numpy.
● We have given our library a name called as ab, so now in the program
whenever there is a requirement of numpy we just need to type ab.
● Then we created the variable called ar then we added array data inside
the same
● Then we printed it.
● So output is printing the array data that has been inserted.
# The standard way to import NumPy:
import numpy as np
# Create a 2-D array, set every second element in
# some rows and find max per row:
x = np.arange(15, dtype=np.int64).reshape(3, 5)
x[1:, ::2] = -99
x
# array([[ 0, 1, 2, 3, 4],
# [-99, 6, -99, 8, -99],
15
Introduction To Python
# [-99, 11, -99, 13, -99]]) Programming: Learn The
Different Libraries
x.max(axis=1)
# array([ 4, 8, 13])
# Generate normally distributed random numbers:
rng = np.random.default_rng()
samples = rng.normal(size=2500)
samples
Output
array([ 0.38054775, -0.06020411, 0.07380668, ..., 1.07546484,
-0.20855135, 0.09773109])
2.2 PANDAS
● The main role of the pandas library is to analyze the data.
● It is open source in nature
● It is used in relational data
● On the top of Numpy library Pandas library is present.
● It is very quick in nature.
● It was made in the year 2008
● It is very efficient in datas.
● When it comes to pandas it is not necessary that the data should or
should belong to a kind of category but instead it allows many.
● By using pandas you can reshape, analyze, and change your data very
easily.
● Pandas supports two data structures:
1. Series:
It is an array.
It can hold any kind of data types like integer, float, character etc.
It points to the column.
Example1 : In the below example each column i.e name and roll no
points to series. It is written in the following manner in the code:
ab=se.Series(df [„Name‟])
16
Artificial Intelligence & ab=se.Series(df [„Roll no‟])
Machine Learning Lab
Name Roll no
Madhusri 01
Srivatsan 22
Anuradha 6
Balaguru 55
Example 2:
import pandas as ab
import numpy as sj
# Creating empty series
ser = sj.Series()
print(ser)
# simple array
data = sj.array(['g', 'e', 'e', 'k', 's'])
ser = ab.Series(data)
print(ser)
In the above example two libraries have been imported and are used
namely numpy and pandas.
The library Pandas is getting represented by ab and similarly the
library numpy is getting represented by sj
Then series are created by calling it, so an empty series is called and
initialized.
17
Introduction To Python
Then it is printed then the array is getting added using numpy Programming: Learn The
Different Libraries
Then finally they are printed
The output comes out in the below fashion.
Output:
Series([], dtype: float64)
0 g
1 e
2 e
3 k
4 s
dtype: object
2. Data Frame:
It handles 3 parts, mainly data, columns and rows.
Example:
import pandas as pd
# Calling DataFrame constructor
df = pd.DataFrame()
print(df)
# list of strings
lst = ['Madhu, 'For', 'Madhusri', 'is',
'portal', 'for', 'students‟']
# Calling DataFrame constructor on list
df = pd.DataFrame(lst)
print(df)
Output:Empty DataFrame
Columns: []
Index: []
0
0 Madhu
1 For
2 Madhusri
3 is
18
Artificial Intelligence & 4 portal
Machine Learning Lab
5 for
6 students
import pandas as pd
df = pd.read_csv('data.csv')
print(df.to_string())
Output:
2.3 SCIPY
It falls under NumPy:
● It uses scientific and mathematical logic.
● It is open source
import scipy
print(scipy.__version__)
Output:
0.18.1
2.4 MATPLOTLIB
● It is used to plot graphs
● It is open source
● In python you need to install matplotlib pip otherwise the code will
not execute. To do this go to cmd and go the the folder where python
is located any type the following command:
import numpy as np
plt.plot(xpoints, ypoints)
plt.show()
20
Artificial Intelligence &
Machine Learning Lab
● It is open source.
● Installation of scikit is must to make the program run, this can be done
in the following manner.
● Example:
from sklearn.datasets import load_iris
iris = load_iris()
A= iris.data
y = iris.target
21
Introduction To Python
feature_names = iris.feature_names Programming: Learn The
target_names = iris.target_names Different Libraries
Output:
Feature names: ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)',
'petal width (cm)']
First 10 rows of X:
=
[
[5.1 3.5 1.4 0.2]
[4.9 3. 1.4 0.2]
[4.7 3.2 1.3 0.2]
[4.6 3.1 1.5 0.2]
[5. 3.6 1.4 0.2]
[5.4 3.9 1.7 0.4]
[4.6 3.4 1.4 0.3]
[5. 3.4 1.5 0.2]
[4.4 2.9 1.4 0.2]
[4.9 3.1 1.5 0.1]
]
● Features of Scikit learn are as follows:
● Cross validation
● Feature selection
● Example:
import pandas as pd
22
Artificial Intelligence & from sklearn.linear_model import LinearRegression
Machine Learning Lab
from sklearn.metrics import mean_squared_error
print(train_data.head())
# shape of the dataset
print('\nShape of training data :',train_data.shape)
print('\nShape of testing data :',test_data.shape)
# Now, we need to predict the missing target variable in the test data
# target variable - Item_Outlet_Sales
# seperate the independent and target variable on training data
train_x = train_data.drop(columns=['Item_Outlet_Sales'],axis=1)
train_y = train_data['Item_Outlet_Sales']
'''
Create the object of the Linear Regression model
You can also add other parameters and test your code here
https://scikit-
learn.org/stable/modules/generated/sklearn.linear_model.LinearRegressio
n.html
'''
model = LinearRegression()
model.fit(train_x,train_y)
23
Introduction To Python
print('\nCoefficient of model :', model.coef_) Programming: Learn The
Different Libraries
# intercept of the model
print('\nIntercept of model',model.intercept_)
predict_train = model.predict(train_x)
rmse_train = mean_squared_error(train_y,predict_train)**(0.5)
predict_test = model.predict(test_x)
rmse_test = mean_squared_error(test_y,predict_test)**(0.5)
Output:
Item_Weight ... Outlet_Type_Supermarket Type3
0 6.800000 ... 0
1 15.600000 ... 0
2 12.911575 ... 1
3 11.800000 ... 0
4 17.850000 ... 0
[5 rows x 36 columns]
Shape of training data : (1364, 36)
Shape of testing data : (341, 36)
24
Artificial Intelligence & Coefficient of model:
Machine Learning Lab
[-3.84197604e+00 9.83065945e+00 1.61711856e+01 6.09197622e+01
*****
25
UNIT III
3
SUPERVISED LEARNING
Unit Structure
3.0 Objectives
3.1 Introduction - Regression
3.1.1 What is a Regression
3.2 Types of Regression models
3.2.1 Linear Regression
3.2.2 Need of a Linear regression
3.2.3 Positive Linear Relationship
3.2.4 Negative Linear Relationship
3.3 Cost function
3.3.1 Gradient descent
3.3.2 Impact of different values for learning rate
3.3.3 Use case
3.3.4 Steps to implement linear regression model
3.4 What is logistic regression?
3.4.1 Hypothesis
3.4.2 A sigmoid function
3.5 Cost function
3.5.1 Gradient Descent
3.6 Lets Sum up
3.7 Exercises
3.8 References
3.0 OBJECTIVES
This Chapter would make you understand the following concepts:
What is a Regression?
Types of a Regression.
What is the mean of Linear regression and the importance of Linear
regression?
Importance of cost function and gradient descent in a Linear
regression.
Impact of different values for learning rate.
What is the mean of logistic regression and the importance of Linear
regression?
26
Artificial Intelligence & Importance of cost function and gradient descent in a logistic
Machine Learning Lab
regression.
27
Supervised Learning
The above graph presents the linear relationship between the dependent
variable and independent variables. When the value of x (independent
variable) increases, the value of y (dependent variable) is likewise
increasing. The red line is referred to as the best fit straight line. Based on
the given data points, we try to plot a line that models the points the best.
To calculate best-fit line linear regression uses a traditional slope-
intercept form.
y= Dependent Variable.
x= Independent Variable.
28
Artificial Intelligence & 3.2.4 Negative Linear Relationship:
Machine Learning Lab
If the dependent variable decreases on the Y-axis and the independent
variable increases on the X-axis, such a relationship is called a negative
linear relationship.
The goal of the linear regression algorithm is to get the best values for a0
and a1 to find the best fit line. The best fit line should have the least error
means the error between predicted values and actual values should be
minimized.
Using the MSE function, we will change the values of a0 and a1 such that
the MSE value settles at the minima. Model parameters xi, b (a0,a1) can be
manipulated to minimize the cost function. These parameters can be
29
Supervised Learning
determined using the gradient descent method so that the cost function
value is minimum.
Imagine a pit in the shape of U. You are standing at the topmost point in
the pit, and your objective is to reach the bottom of the pit. There is a
treasure, and you can only take a discrete number of steps to reach the
bottom. If you decide to take one footstep at a time, you would eventually
get to the bottom of the pit but, this would take a longer time. If you
choose to take longer steps each time, you may get to sooner but, there is a
chance that you could overshoot the bottom of the pit and not near the
bottom. In the gradient descent algorithm, the number of steps you take is
the learning rate, and this decides how fast the algorithm converges to the
minima.
To update a0 and a1, we take gradients from the cost function. To find
these gradients, we take partial derivatives for a0 and a1.
30
Artificial Intelligence &
Machine Learning Lab
The partial derivates are the gradients, and they are used to update the
values of a0 and a1. Alpha is the learning rate.
31
Supervised Learning
The blue line represents the optimal value of the learning rate, and the cost
function value is minimized in a few iterations. The green line represents
if the learning rate is lower than the optimal value, then the number of
iterations required high to minimize the cost function. If the learning rate
selected is very high, the cost function could continue to increase with
iterations and saturate at a value higher than the minimum value, that
represented by a red and black line.
32
Artificial Intelligence & The main function to calculate values of coefficients:
Machine Learning Lab
a0 = 0 #intercept`
a1 = 0 #Slop
error_cost = 0
cost_a0 = 0
cost_a1 = 0
for i in range(len(experience)):
for j in range(len(experience)):
partial_wrt_a1 = (-2*experience[j])*(salary[j]-(a0 +
a1*experience[j]))
33
Supervised Learning
cost_a1 = cost_a1 + partial_wrt_a1 #calculate cost for each number
and add
a0 = a0 - lr * cost_a0 #update a0
a1 = a1 - lr * cost_a1 #update a1
34
Artificial Intelligence &
Machine Learning Lab
35
Supervised Learning
36
Artificial Intelligence &
Machine Learning Lab
3.4.1 Hypothesis:
The objective of a logistic regression is to learn a function that outputs the
probability that the dependent variable is one for each training sample. To
achieve that, a sigmoid / logistic function is required for the
transformation.
Where,
θ is a vector of parameters that corresponds to each independent
variable
x is a vector of independent variables
37
Supervised Learning
3.5 COST FUNCTION
The cost function for logistic regression is derived from statistics using the
principle of maximum likelyhood estimation, which allows efficient
identification of parameters. In addition the covex property of the cost
function allow gradient descent to work eff ectively.
Where,
● i is one of the mth training samples
● hƟ(xi) is the predicted value for the training sample
● yi is the actual value for the training sample
To understand the cost function, we can look into each of the two
components in isolation:
Suppose yi=1:
if , hƟ(xi) =1 then the predicon error = 0
if , hƟ(xi) =0 then the predicon error approaches infinity
These two scenarios are represented by the blue line in Figure 2 below.
Suppose yi=0:
if , hƟ(xi) =0 then the predicon error = 0
if , hƟ(xi) =1 then the predicon error approaches infinity
These two scenarios are represented by the blue line in Figure 2 below.
38
Artificial Intelligence & The logistic regression cost function can be further simplified into a one
Machine Learning Lab
line equation:
Where,
● values of j = 0,1, …, n
● α is the learning rate
Note: The gradient descent algorithm is identical to linear regression’s
39
Supervised Learning
3.7 EXERCISES
Differentiate the Linear regression and logistic regression with a real
time example.
3.8 REFERENCES
https://www.studytonight.com/post/linear-regression-and-predicting-
values-based-on-a-training-dataset
https://activewizards.com/blog/5-real-world-examples-of-logistic-
regression-application
https://www.marktechpost.com/2021/11/12/logistic-regression-with-
a-real-world-example-in-python/
https://www.statology.org/linear-regression-real-life-
examples/#:~:text=For%20example%2C%20researchers%20might%2
0administer,pressure%20as%20the%20response%20variable.
https://www.quora.com/What-are-applications-of-linear-and-logistic-
regression
https://www.statology.org/logistic-regression-real-life-examples/
https://ncss-wpengine.netdna-ssl.com/wp-
content/themes/ncss/pdf/Procedures/NCSS/Logistic_Regression.pdf
https://www.analyticsvidhya.com/blog/2021/06/linear-regression-in-
machine-learning/
https://www.analyticsvidhya.com/blog/2022/01/an-introductory-note-
on-linear-regression/
http://home.iitk.ac.in/~shalab/regression/Chapter3-Regression-
MultipleLinearRegressionModel.pdf
https://www.princeton.edu/~otorres/Regression101.pdf
https://towardsdatascience.com/machine-learning-basics-with-the-k-
nearest-neighbors-algorithm-6a6e71d01761
https://www.javatpoint.com/k-nearest-neighbor-algorithm-for-
machine-learning
https://www.analyticsvidhya.com/blog/2021/04/simple-
understanding-and-implementation-of-knn-algorithm/
*****
40
4
SUPERVISED LEARNING
Unit Structure
4.0 Objectives
4.1 Advanced Optimization Algorithms
4.1.1 Multiclass Classification
4.1.2 Bias-Variance Tradeo
4.1.3 Regularization
4.2 Applications of Linear/Logistic regression.
4.2.1 Two things you can do using regression are
4.2.2Application of logistic regression
4.3 K-nearest Neighbors (KNN) Classification Model
4.4 Lets Sum up
4.5 References
4.6 Exercises
4.0 OBJECTIVES
This Chapter would make you understand the following concepts:
Advanced Optimization Algorithms
Applications of Linear/Logistic regression.
KNN- classification
41
Supervised Learning
To deal with a multiclass problem, we then train a logistic regression
binary classifier for each class to predict the probability that y = i. The
prediction output for a given new input will be chosen based on the
classifier that has the highest probability.
42
Artificial Intelligence &
Machine Learning Lab
Bias-Variance Tradeo
4.1.3 Regularization:
For a model to generalize well, regularization is usually introduced to
reduce over fitting of the training data.
This is represented by a regularization term, that is added to the cost
function that penalizes all parameters that are high in value. This leads to a
simpler hypothesis that is less prone to fitting. The new cost function then
becomes:
Where,
● i is one of the training samples
● is the predicted value for the training sample i
● yi is the actual value for the training sample i
● λ is the regularizaon parameter that controls the tradeoff between fing
the training dataset well and having the parameters θ small in values
● j is one of the parameter θ
43
Supervised Learning
Gradient descent remains the same as well:
44
Artificial Intelligence & 4.2.2 Application of logistic regression
Machine Learning Lab
Logistic Regression Real Life Example: 1
Medical researchers want to know how exercise and weight impact the
probability of having a heart attack. To understand the relationship
between the predictor variables and the probability of having a heart
attack, researchers can perform logistic regression.
The response variable in the model will be heart attack and it has two
potential outcomes:
The results of the model will tell researchers exactly how changes in
exercise and weight affect the probability that a given individual has a
heart attack. The researchers can also use the fitted logistic regression
model to predict the probability that a given individual has a heart
attacked, based on their weight and their time spent exercising.
The response variable in the model will be “acceptance” and it has two
potential outcomes:
The results of the model will tell researchers exactly how changes in GPA,
ACT score, and number of AP classes taken affect the probability that a
given individual gets accepted into the university. The researchers can also
use the fitted logistic regression model to predict the probability that a
given individual gets accepted, based on their GPA, ACT score, and
number of AP classes taken.
45
Supervised Learning
The response variable in the model will be “spam” and it has two potential
outcomes:
In [1]:
# read in the iris data
from sklearn.datasets import load_iris
iris = load_iris()
46
Artificial Intelligence & # create X (features) and y (response)
Machine Learning Lab
X = iris.data
y = iris.target
Classification accuracy:
● Proportion of correct predictions
● Common evaluation metric for classification problems
47
Supervised Learning
In [4]:
# compute classification accuracy for the logistic regression model
from sklearn import metrics
print(metrics.accuracy_score(y, y_pred))
0.96
● Known as training accuracy when you train and test the model on
the same data
● 96% of our predictions are correct
● KNN model:
1. Pick a value for K.
2. Search for the K observations in the training data that are "nearest" to
the measurements of the unknown iris
3. Use the most popular response value from the K nearest neighbors as
the predicted response value for the unknown iris
48
Artificial Intelligence & This would always have 100% accuracy, because we are testing on
Machine Learning Lab
the exact same data, it would always make correct predictions
KNN would search for one nearest observation and find that exact
same observation
KNN has memorized the training set
Because we testing on the exact same data, it would always make the
same prediction
49
Supervised Learning
Your accuracy would be high but may not generalize well for future
observations
Your accuracy is high because it is perfect in classifying your training
data but not out-of-sample data
● Black line (decision boundary): just right
Good for generalizing for future observations
● Hence we need to solve this issue using a train/test split that will be
explained below
51
Supervised Learning
Out[11]:
LogisticRegression(C=1.0, class_weight=None, dual=False,
fit_intercept=True,
intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
verbose=0, warm_start=False)
In [12]:
# STEP 3: make predictions on the testing set
y_pred = logreg.predict(X_test)
# compare actual response values (y_test) with predicted response values
(y_pred)
print(metrics.accuracy_score(y_test, y_pred))
0.95
Repeat for KNN with K=5:
In [13]:
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train, y_train)
y_pred = knn.predict(X_test)
print(metrics.accuracy_score(y_test, y_pred))
0.966666666667
Repeat for KNN with K=1:
In [14]:
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train, y_train)
y_pred = knn.predict(X_test)
print(metrics.accuracy_score(y_test, y_pred))
0.966666666667
Can we locate an even better value for K?
In [15]:
# try K=1 through K=25 and record testing accuracy
52
Artificial Intelligence & k_range = range(1, 26)
Machine Learning Lab
# We can create Python dictionary using [] or dict()
scores = []
# We use a loop through the range 1 to 26
# We append the scores in the dictionary
for k in k_range:
knn = KNeighborsClassifier(n_neighbors=k)
knn.fit(X_train, y_train)
y_pred = knn.predict(X_test)
scores.append(metrics.accuracy_score(y_test, y_pred))
print(scores)
[0.94999999999999996, 0.94999999999999996, 0.96666666666666667,
0.96666666666666667, 0.96666666666666667, 0.98333333333333328,
0.98333333333333328, 0.98333333333333328, 0.98333333333333328,
0.98333333333333328, 0.98333333333333328, 0.98333333333333328,
0.98333333333333328, 0.98333333333333328, 0.98333333333333328,
0.98333333333333328, 0.98333333333333328, 0.96666666666666667,
0.98333333333333328, 0.96666666666666667, 0.96666666666666667,
0.96666666666666667, 0.96666666666666667, 0.94999999999999996,
0.94999999999999996]
In [16]:
# import Matplotlib (scientific plotting library)
53
Supervised Learning
● Training accuracy rises as model complexity increases
● Testing accuracy penalizes models that are too complex or not
complex enough
● For KNN models, complexity is determined by the value of K (lower
value = more complex)
/Users/ritchieng/anaconda3/envs/py3k/lib/python3.5/site-
packages/sklearn/utils/validation.py:386: DeprecationWarning:
Passing 1d arrays as data is deprecated in 0.17 and willraise ValueError in
0.19. Reshape your data either using X.reshape(-1, 1) if your data has a
single feature or X.reshape(1, -1) if it contains a single sample.
DeprecationWarning)
Out[17]:
array([1])
54
Artificial Intelligence &
Machine Learning Lab 4.5 EXERCISES
Appropriate the Linear regression and logistic regression with a real
time example.
Take a real time example and execute about KNN- classification
4.6 REFERENCES
https://www.quora.com/What-are-applications-of-linear-and-logistic-
regression
https://www.statology.org/logistic-regression-real-life-examples/
https://ncss-wpengine.netdna-ssl.com/wp-
content/themes/ncss/pdf/Procedures/NCSS/Logistic_Regression.pdf
https://www.analyticsvidhya.com/blog/2021/06/linear-regression-in-
machine-learning/
https://www.analyticsvidhya.com/blog/2022/01/an-introductory-note-
on-linear-regression/
https://towardsdatascience.com/machine-learning-basics-with-the-k-
nearest-neighbors-algorithm-6a6e71d01761
https://www.javatpoint.com/k-nearest-neighbor-algorithm-for-
machine-learning
https://www.analyticsvidhya.com/blog/2021/04/simple-
understanding-and-implementation-of-knn-algorithm/
*****
55
UNIT IV
5
FEATURES AND EXTRACTION
Unit Structure
5.1 Dimensionality reduction
5.2 Feature selection
5.3 Normalization
56
Artificial Intelligence & ● Each independent variable regresses against each independent
Machine Learning Lab
variable, and we calculate the VIF.
Heatmap also plays a crucial role in understanding the correlation between
variables.
The type of relationship between any two quantities varies over a period of
time.
Correlation varies from -1 to +1
To be precise,
● Values that are close to +1 indicate a positive correlation.
● Values close to -1 indicate a negative correlation.
● Values close to 0 indicate no correlation at all.
Below is the heatmap to show how we will correlate which features are
highly dependent on the target feature and consider them.
57
Features And Extraction
Iris dataset.
plt.show()
Independent features:
● The second feature is almost independent of the others.
Here the correlation matrix and its pictorial representation have given the
idea about the potential number of features reduction. Therefore, two
features can be kept, and other features can be reduced apart from those
two features.
Feature Selection:
In feature selection, usually, a subset of original features is selected.
59
Features And Extraction
Feature selection
Feature Extraction:
Feature Extraction
60
Artificial Intelligence & Widespread linear feature extraction methods:
Machine Learning Lab
● Principal Component Analysis (PCA): It seeks a projection that
preserves as much information as possible in the data.
● Linear Discriminant Analysis (LDA):- It seeks a projection that best
discriminates the data.
61
Features And Extraction
Take a look the following picture of Taj Mahal from top view. Note that there
are only fewer dimensions in which information is varying and the variance is
also not much. Hence, it is difficult to identify from top view whether the
picture is of Taj Mahal. Thus, top view can be ignored easily.
The way PCA is different from other feature selection techniques such as
random forest, regularization techniques, forward/backward selection
techniques etc is that it does not require class labels to be present (thus
called as unsupervised). More details along with Python code example will
be shared in future posts.
62
Artificial Intelligence & standardized / normalized after creating training / test split. Python’s
Machine Learning Lab
sklearn.preprocessing StandardScaler class can be used for
standardizing the dataset.
This section represents custom Python code for extracting the features
using PCA.
63
Features And Extraction
Here are the steps followed for performing PCA:
1 #
2 # Perform one-hot encoding
3 #
4 categorical_columns = df.columns[df.dtypes == object] # Find all
categorical columns
5
6 df = pd.get_dummies(df, columns = categorical_columns,
drop_first=True)
7 #
8 # Create training / test split
9 #
10 from sklearn.model_selection import train_test_split
11 X_train, X_test, y_train, y_test = X_train, X_test, y_train, y_test =
train_test_split(df[df.columns[df.columns != 'salary']],
12 df['salary'], test_size=0.25, random_state=1)
13 #
14 # Standardize the dataset; This is very important before you apply PCA
15 #
16 from sklearn.preprocessing import StandardScaler
17 sc = StandardScaler()
18 sc.fit(X_train)
19 X_train_std = sc.transform(X_train)
20 X_test_std = sc.transform(X_test)
21 #
22 # Import eigh method for calculating eigenvalues and eigenvectirs
23 #
24 from numpy.linalg import eigh
25 #
64
Artificial Intelligence & # Determine covariance matrix
Machine Learning Lab
26
27 #
28 cov_matrix = np.cov(X_train_std, rowvar=False)
29 #
30 # Determine eigenvalues and eigenvectors
31 #
32 egnvalues, egnvectors = eigh(cov_matrix)
33 #
34 # Determine explained variance and select the most important
eigenvectors based on explained variance
35 #
36 total_egnvalues = sum(egnvalues)
37 var_exp = [(i/total_egnvalues) for i in sorted(egnvalues,
reverse=True)]
38 #
39 # Construct projection matrix using the five eigenvectors that
correspond to the top five eigenvalues (largest), to capture about 75%
of the variance in this dataset
40 #
41 egnpairs = [(np.abs(egnvalues[i]), egnvectors[:, i])
42 for i in range(len(egnvalues))]
43 egnpairs.sort(key=lambda k: k[0], reverse=True)
44 projectionMatrix = np.hstack((egnpairs[0][1][:, np.newaxis],
45 egnpairs[1][1][:, np.newaxis],
46 egnpairs[2][1][:, np.newaxis],
47 egnpairs[3][1][:, np.newaxis],
48 egnpairs[4][1][:, np.newaxis]))
49 #
50 # Transform the training data set
51 #
52 X_train_pca = X_train_std.dot(projectionMatrix)
65
Features And Extraction
Here are the steps followed for performing PCA:
● Perform PCA by fitting and transforming the training data set to the new
feature subspace and later transforming test data set.
1 #
2 # Perform one-hot encoding
3 #
4 categorical_columns = df.columns[df.dtypes == object] # Find all
categorical columns
5
6 df = pd.get_dummies(df, columns = categorical_columns,
drop_first=True)
7 #
8 # Create training / test split
9 #
10 from sklearn.model_selection import train_test_split
11 X_train, X_test, y_train, y_test = X_train, X_test, y_train, y_test =
train_test_split(df[df.columns[df.columns != 'salary']],
12 df['salary'], test_size=0.25, random_state=1)
13 #
14 # Standardize the dataset; This is very important before you apply
PCA
15 #
16 from sklearn.preprocessing import StandardScaler
17 sc = StandardScaler()
18 sc.fit(X_train)
19 X_train_std = sc.transform(X_train)
20 X_test_std = sc.transform(X_test)
21 #
22 # Perform PCA
23 #
24 from sklearn.decomposition import PCA
25 pca = PCA()
26 #
27 # Determine transformed features
28 #
29 X_train_pca = pca.fit_transform(X_train_std)
30 X_test_pca = pca.transform(X_test_std)
66
Artificial Intelligence &
Machine Learning Lab 5.2 FEATURE SELECTION
Feature Selection is one of the core concepts in machine learning
which hugely impacts the performance of your model. The data
features that you use to train your machine learning models have a huge
influence on the performance you can achieve. Irrelevant or partially
relevant features can negatively impact model performance. Feature
selection and Data cleaning should be the first and most important step of
your model designing.
Having irrelevant features in your data can decrease the accuracy of the
models and make your model learn based on irrelevant features.
67
Features And Extraction
Let’s have a look at these techniques one by one with an example
1. Univariate Selection:
Statistical tests can be used to select those features that have the strongest
relationship with the output variable.
68
Artificial Intelligence & The scikit-learn library provides the SelectKBest class that can be used
Machine Learning Lab
with a suite of different statistical tests to select a specific number of
features.
The example below uses the chi-squared (chi²) statistical test for non-
negative features to select 10 of the best features from the Mobile Price
Range Prediction Dataset.
import pandas as pd
import numpy as np
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2data =
pd.read_csv("D://Blogs//train.csv")
X = data.iloc[:,0:20] #independent columns
y = data.iloc[:,-1] #target column i.e price range#apply SelectKBest
class to extract top 10 best features
bestfeatures = SelectKBest(score_func=chi2, k=10)
fit = bestfeatures.fit(X,y)
dfscores = pd.DataFrame(fit.scores_)
dfcolumns = pd.DataFrame(X.columns)
#concat two dataframes for better visualization
featureScores = pd.concat([dfcolumns,dfscores],axis=1)
featureScores.columns = ['Specs','Score'] #naming the dataframe columns
print(featureScores.nlargest(10,'Score')) #print 10 best features
2. Feature Importance:
You can get the feature importance of each feature of your dataset by
using the feature importance property of the model.
69
Features And Extraction
Feature importance gives you a score for each feature of your data, the
higher the score more important or relevant is the feature towards your
output variable.
import pandas as pd
import numpy as np
data = pd.read_csv("D://Blogs//train.csv")
X = data.iloc[:,0:20] #independent columns
y = data.iloc[:,-1] #target column i.e price range
from sklearn.ensemble import ExtraTreesClassifier
import matplotlib.pyplot as plt
model = ExtraTreesClassifier()
model.fit(X,y)
print(model.feature_importances_) #use inbuilt class feature_importances
of tree based classifiers
#plot graph of feature importances for better visualization
feat_importances = pd.Series(model.feature_importances_,
index=X.columns)
feat_importances.nlargest(10).plot(kind='barh')
plt.show()
Correlation states how the features are related to each other or the target
variable.
70
Artificial Intelligence & Heatmap makes it easy to identify which features are most related to the
Machine Learning Lab
target variable, we will plot heatmap of correlated features using the
seaborn library.
import pandas as pd
import numpy as np
import seaborn as snsdata = pd.read_csv("D://Blogs//train.csv")
X = data.iloc[:,0:20] #independent columns
y = data.iloc[:,-1] #target column i.e price range
#get correlations of each features in dataset
corrmat = data.corr()
top_corr_features = corrmat.index
plt.figure(figsize=(20,20))
#plot heat map
g=sns.heatmap(data[top_corr_features].corr(),annot=True,cmap="RdYlGn
")
5.3 NORMALIZATION
Normalization is a technique often applied as part of data preparation for
machine learning. The goal of normalization is to change the values of
numeric columns in the dataset to use a common scale, without distorting
71
Features And Extraction
differences in the ranges of values or losing information. Normalization is
also required for some algorithms to model the data correctly.
For example, assume your input dataset contains one column with values
ranging from 0 to 1, and another column with values ranging from 10,000
to 100,000. The great difference in the scale of the numbers could cause
problems when you attempt to combine the values as features during
modelling.
● You can change all values to a 0-1 scale, or transform the values by
representing them as percentile ranks rather than absolute values.
● scaling to a range
● clipping
● log scaling
● z-score
● You know the approximate upper and lower bounds on your data with
few or no outliers.
A good example is age. Most age values falls between 0 and 90, and every
part of the range has a substantial number of people.
In contrast, you would not use scaling on income, because only a few
people have very high incomes. The upper bound of the linear scale for
income would be very high, and most people would be squeezed into a
small part of the scale.
Feature Clipping:
If your data set contains extreme outliers, you might try feature clipping,
which caps all feature values above (or below) a certain value to fixed
value. For example, you could clip all temperature values above 40 to be
exactly 40.
73
Features And Extraction
Log Scaling:
Log scaling computes the log of your values to compress a wide range to a
narrow range.
\[ x' = log(x) \]
Log scaling is helpful when a handful of your values have many points,
while most other values have few points. This data distribution is known
as the power law distribution. Movie ratings are a good example. In the
chart below, most movies have very few ratings (the data in the tail), while
a few have lots of ratings (the data in the head). Log scaling changes the
distribution, helping to improve linear model performance.
Z-Score:
\[ x' = (x - μ) / σ \]
74
Artificial Intelligence &
Machine Learning Lab
Suppose you're not sure whether the outliers truly are extreme. In this
case, start with z-score unless you have feature values that you don't want
the model to learn; for example, the values are the result of measurement
error or a quirk.
Tip:
To ensure that columns of a specific type are provided as input, try using
the Select Columns in Dataset component before Normalize Data.
4. Use 0 for constant columns when checked: Select this option when
any numeric column contains a single unchanging value. This ensures
that such columns are not used in normalization operations.
75
Features And Extraction
5. From the Transformation method dropdown list, choose a single
mathematical function to apply to all selected columns.
Zscore: Converts all values to a z-score.
The values in the column are transformed using the following formula:
Mean and standard deviation are computed for each column separately.
Population standard deviation is used.
MinMax: The min-max normalizer linearly rescales every feature to
the [0,1] interval.
Rescaling to the [0,1] interval is done by shifting the values of each
feature so that the minimal value is 0, and then dividing by the new
maximal value (which is the difference between the original maximal and
minimal values).
The values in the column are transformed using the following formula:
76
Artificial Intelligence & The values in the column are transformed using the following formula:
Machine Learning Lab
Steps Needed:
Examples:
Here, we create data by some random values and apply some
normalization techniques to it.
77
Features And Extraction
# importing packages
import pandas as pd
# create data
df = pd.DataFrame([
[180000, 110, 18.9, 1400],
[360000, 905, 23.4, 1800],
[230000, 230, 14.0, 1300],
[60000, 450, 13.5, 1500]],
columns=['Col A', 'Col B',
'Col C', 'Col D'])
# view data
display(df)
Output:
78
Artificial Intelligence &
Machine Learning Lab
Output:
79
Features And Extraction
See the plot of this dataframe:
import matplotlib.pyplot as plt
df_max_scaled.plot(kind = 'bar')
Output:
80
Artificial Intelligence &
Machine Learning Lab
Output:
81
Features And Extraction
df_z_scaled[column].mean()) /
df_z_scaled[column].std()
# view normalized data
display(df_z_scaled)
Output:
*****
82
6
TRANSFORMATION
Unit Structure
6.1 Introduction
6.2 Transformers
6.3 Principle Component Analysis (PCA)
6.1 INTRODUCTION
What is AI Transformation?:
We have listed below a set of the top 6 steps for Fortune 500 firms.
Smaller firms could skip having in-house teams and strive for less risky
and less investment heavy approaches such as relying on consultants for
targeted projects.
83
2. Execute pilot projects to gain momentum: Transformation
First few projects should create measurable business value while being
attainable. This is important for the transformation to gain trust across the
organization with achieved projects and it creates momentum that will
lead to AI projects with greater success.
These projects can rely on AI/ML powered tools in the marketplace or for
more custom solutions, your company can run a data science
competition and rely on the wisdom of hundreds of data scientists. These
competitions use encrypted data and provide a low cost way to find high
performing data science solutions.
84
Artificial Intelligence & 5. Develop internal and external communications:
Machine Learning Lab
When the team gains momentum from the initial AI projects and forms a
deeper understanding of AI, the organization will have a better
understanding of improvement areas where AI can create the most value.
An updated strategy that considers the company‟s track record can set a
better direction for the company.
Process Transformation:
85
numerous efforts underway to transform the business of mining to a Transformation
wholly robotic exercise, where no humans travel below the surface.
Domain Transformation:
86
Artificial Intelligence & We see (and have helped) numerous industrial companies that have
Machine Learning Lab
undergone domain transformations. ThyssenKrupp, a diversified industrial
engineering company, broadened its offerings to introduce a lucrative new
digital business alongside its traditional business. The company leveraged
a strong industrial market position and Internet of Things (IOT)
capabilities to help clients manage the maintenance of elevators with asset
health and predictive maintenance offerings—creating a significant new
source of revenue beyond the core. In another example, a major equipment
manufacturer is moving beyond its core machine offerings to introduce a
digital platform of solutions for its client sites: job-site activity
coordination, remote equipment tracking, situational awareness, and
supply chain optimization. The company is moving to become no longer
merely a heavy equipment provider, but also a digital solutions company.
Cultural/Organizational Transformation:
6.2 TRANSFORMERS
88
Artificial Intelligence & Understanding the Training Data:
Machine Learning Lab
Sample data Point: “write a function that adds two numbers”:
Python Code:
def add_two_numbers (num1 ,num2 ):
sum = num1 + num2
return sum
Tokenized Input:
SRC = [' ', 'write', 'a', 'python', 'function', 'to', 'add', 'two', 'user', 'provided',
'numbers', 'and', 'return', 'the', 'sum']
Tokenized Output:
TRG = [(57, 'utf-8'), (1, 'def'), (1, 'add_two_numbers'), (53, '('), (1, 'num1'),
(53, ','), (1, 'num2'), (53, ')'), (53, ':'), (4, '\n'), (5, ' '), (1, 'sum'), (53, '='),
(1, 'num1'), (53, '+'), (1, 'num2'), (4, '\n'), (1, 'return'), (1, 'sum'), (4, ''), (6,
''), (0, '')]
Data Augmentations:
While tokenizing the python code, we mask the names of certain variables
randomly(with „var_1, „var_2‟ etc) to ensure that the model that we train
does not merely fixate on the way the variables are named and actually
tries to understand the inherent logic and syntax of the python code.
89
For example, consider the following program. Transformation
In the above example, we have therefore expanded a single data point into
3 more data points using our random variable replacement technique.
We implement our augmentations at the time of generating our tokens.
While randomly picking variables to mask we avoid keyword
literals(keyword.kwlist), control structures(as can be seen in
below skip_list), and object properties. We add all such literals that need
to be skipped into the skip_list.
We now apply our augmentations and tokenization using
Pytorch‟s torchtext.data.Field.
Output = data.Field(tokenize = augment_tokenize_python_code,
init_token='<sos>',
eos_token='<eos>',
lower=False)
90
Artificial Intelligence & (1, 'num1'), (53, '+'), (1, 'var_1'), (4, '\n'), (1, 'return'), (1, 'sum'), (4, ''), (6,
Machine Learning Lab
''), (0, '')]
Feeding Data:
To feed data into our model we first create batches. The tokenized
predictions are then untokenized via the untokenize function of Python‟s
source code tokenizer.
Loss Function:
We have used augmentations in our dataset to mask variable literals. This
means that our model can predict a variety of values for a particular
variable and all of them are correct as long as the predictions are
91
consistent through the code. This would mean that our training labels are Transformation
not very certain and hence it would make more sense to treat them to be
correct with probability 1- smooth_eps and incorrect otherwise. This is
what label smoothening does. By adding label smoothening to Cross-
Entropy we ensure that the model does not become too confident in
predicting some of our variables that can be replaced via augmentations.
Now with all our components set we can train our model using
backpropagation. We split our dataset into training and validation data.
Our model is trained until our validation loss does not improve any
further.
It is important to note that label smoothening leads to much higher loss
values as compared to models that do not make use of label smoothening.
But this is as expected as we do not intend to be certain with our label
predictions. This is particularly the case with variables as there can be
multiple correct options as long as the predictions are consistent through
the target code sequence.
Sample Results:
Input: “program to sort a list of dictionaries by key”
Output:
var_1 ={'Nikhil':{'roll':24 ,'marks':17 },
'Akshat':{'roll':54 ,'marks':12 },
'Akash':{'roll':15 },'marks':15 }}
sort_key ='marks'
res ='marks'
res =var_2 (test_dict .items (),key =lambda x :x [1 ][sort_key ])
print ("The sorted dictionary by marks is : "+str (res ))
Output:
def sum_odd_elements (l :list ):
return sum ([i for i in l if i %2 = =1 ])
92
Artificial Intelligence & Output:
Machine Learning Lab
var_1 = 'Today is bad day'
var_1 [::-1 ]
93
Principal Components in PCA: Transformation
94
Artificial Intelligence & 7. Calculating the new features Or Principal Components: Here we
Machine Learning Lab
will calculate the new features. To do this, we will multiply the P*
matrix to the Z. In the resultant matrix Z*, each observation is the
linear combination of original features. Each column of the Z* matrix
is independent of each other.
8. Remove less or unimportant features from the new dataset: The
new feature set has occurred, so we will decide here what to keep and
what to remove. It means, we will only keep the relevant or important
features in the new dataset, and unimportant features will be removed
out.
95
● Focus on uncorrelated and Gaussian components. Transformation
Features to Ignore:
● Collinear features or linearly dependent features. e.g., leg size and
height.
● Noisy features that are constant. e.g., the thickness of hair
● Constant features. e.g., Number of teeth.
Features to Keep:
● Non-collinear features or low covariance.
● Features that change a lot, high variance. e.g., grade.
96
Artificial Intelligence & determine the direction of the new attribute space, and eigenvalues
Machine Learning Lab
determine its magnitude.
The PCA‟s main objective is to reduce the data‟s dimensionality by
projecting it into a smaller subspace, where the eigenvectors form the
axes. However, the eigenvectors define only the new axes‟ directions
because they all have a size of 1. Consequently, to decide which
eigenvector(s), we can discard without losing much information in the
subspace construction and checking the corresponding eigenvalues. The
eigenvectors with the highest values are the ones that include more
information about the distribution of our data.
Covariance Matrix:
The classic PCA approach calculates the covariance matrix, where each
element represents the covariance between two attributes. The covariance
between two attributes is calculated as shown below:
Create a matrix:
import pandas as pd
import numpy as npmatrix = np.array([[0, 3, 4], [1, 2, 4], [3, 4, 5]])
matrix
Correlation Matrix:
Another way to calculate eigenvalues and eigenvectors is by using the
correlation matrix. Although the matrices are different, they will result in
the same eigenvalues and eigenvectors (shown later) since the covariance
matrix's normalization gives the correlation matrix.
97
Transformation
Applications of PCA:
These are the typical applications of PCA:
● Data Visualization.
● Data Compression.
98
Artificial Intelligence & ● Noise Reduction.
Machine Learning Lab
● Data Classification.
● Image Compression.
● Face Recognition.
99
return np.dot(x - self.mean, self.projection_matrix.T) Transformation
self.sorted_components = np.argsort(self.eigen_values)[::-1]
self.projection_matrix =
self.eigen_vectors[self.sorted_components[:self.no_of_components]]self.e
xplained_variance = self.eigen_values[self.sorted_components]
self.explained_variance_ratio = self.explained_variance /
self.eigen_values.sum()
Standardization of x:
std = StandardScaler()
transformed = StandardScaler().fit_transform(x)
PCA with two components:
pca = convers_pca(no_of_components=2)
pca.fit(transformed)
Check eigenvectors:
cov_pca.eigen_vectors
Check eigenvalues:
cov_pca.eigen_values
Check sorted component:
cov_pca.sorted_components
Plot PCA with several components = 2:
x_std = pca.transform(transformed)plt.figure()
plt.scatter(x_std[:, 0], x_std[:, 1], c=y)
100
Artificial Intelligence &
Machine Learning Lab
*****
101
UNIT V
7
UNSUPERVISED LEARNING
K-MEANS CLUSTERING ALGORITHM
Unit Structure
7.0 Objectives
7.1 Introduction
7.2 Definition
7.3 Basic Algorithms
7.3.1 K-Means clustering
7.3.2 Practical advantages
7.4 Stages
7.5 Pseudo-code
7.6 The K-Means Algorithm Fits within the Framework of Cover’s
Theorem
7.7 Partitioning Clustering Approach
7.8 The K-means algorithm: a heuristic method
7.8.1 How K-means partitions?
7.8.2 K-means Demo
7.8.3 Application
7.8.4 Relevant issues of K-Means algorithm
7.9 Lets Sum up
7.10 Unit End Exercises
7.11 References
7.0 OBJECTIVES
This Chapter would make you understand the following concepts:
What is K-Means clustering algorithm
Definition of K-Means clustering algorithm
Basics of K-Means clustering
Practical advantages of K-Means clustering algorithm
Stages of K-Means clustering algorithm
Pseudo code of K-Means clustering algorithm
The K-Means Algorithm Fits within the Framework of Cover’s
Theorem
102
Artificial Intelligence &
Machine Learning Lab
Partitioning Clustering Approach
The K-means algorithm: a heuristic method
How K-means partitions?
K-means Demo
Application of K-Means algorithm
Relevant issues of K-Means algorithm
103
Unsupervised Learning
We have chosen to focus on the so-called K-means algorithm, because it is K-Means Clustering
simple to implement, yet effective in performance, two features that have Algorithm
made it highly popular.
Let {Xi }Ni=1 denote a set of multidimensional observations that is to be
partitioned into a proposed set of K clusters, where K is smaller than the
number of observations, N. Let the relationship.
j = C(i), i = 1, 2, ..., N
denote a many-to-one mapper, called the encoder, which assigns the ith
observation xi to the jth cluster according to a rule yet to be defined. To do
this encoding, we need a measure of similarity between every pair of
vectors xi and xi’ which is denoted by d(xi, xi’ ).When the measure d(xi,
xi’) is small enough, both xi and xi’ are assigned to the same cluster;
otherwise, they are assigned to different clusters.
To optimize the clustering process, we introduce the following cost
function (Hastie et al.,2001):
Hence,
=
2. The inner summation reads as follows: For a given , the encoder C
assigns to cluster j all the observations that are closest to xi. Except
for a scaling factor, the sum of the observations so assigned is an
estimate of the mean vector pertaining to cluster j; the scaling factor
in question is 1/Nj, where Nj is the number of data points within
104
Artificial Intelligence & cluster j. On account of these two points, we may therefore reduce to
Machine Learning Lab
the simplified form
where denotes the “estimated” mean vector associated with cluster j4 .In
effect, the mean may be viewed as the center of cluster j. In light of we
may now restate the clustering problem as follows:
Given a set of N observations, find the encoder C that assigns these
observations to the K clusters in such a way that, within each cluster, the
average measure of dissimilarity of the assigned observations from the
cluster mean is minimized.
Indeed, it is because of the essence of this statement that the clustering
technique described herein is commonly known as the K-means algorithm.
For an interpretation of the cost function J(C) we may say that, except for
a scaling factor 1/Nj, the inner summation in this equation is an estimate
of the variance of the observations associated with cluster j for a given
encoder C, as shown by
Accordingly, we may view the cost function J(C) as a measure of the total
cluster variance resulting from the assignments of all the N observations to
the K clusters that are made by encoder C.
With encoder C being unknown, how do we minimize the cost function
J(C) To address this key question, we use an iterative descent algorithm,
each iteration of which involves a two-step optimization. The first step
uses the nearest neighbor rule to minimize the cost function J(C) of with
respect to the mean vector for a given encoder C. The second step
minimizes the inner summation with respect to the encoder C for a given
mean vector .This two-step iterative procedure is continued until
convergence is attained.
Thus, in mathematical terms, the K-means algorithm proceeds in two
steps:
Step 1: For a given encoder C, the total cluster variance is minimized with
respect to the assigned set of cluster means ; that is, we perform, the
following minimization:
for a given C
Step 2: Having computed the optimized cluster means in step 1,we next
optimize the encoder as follows
105
Unsupervised Learning
K-Means Clustering
Algorithm
Starting from some initial choice of the encoder C, the algorithm goes
back and forth between these two steps until there is no further change in
the cluster assignments.
Each of these two steps is designed to reduce the cost function J(C) in its
own way; hence, convergence of the algorithm is assured. However,
because the algorithm lacks a global optimality criterion, the result may
converge to a local minimum, resulting in a suboptimal solution to the
clustering assignment.
Stage 1:
Keep the µ fixed and determine r.
In this case, it is easy to see that the minimization decomposes into m
independent problems. The solution for the i-th data point xi can be found
by setting:
and 0 otherwise.
Stage 2:
Keep the r fixed and determine µ. Since the r’s are fixed, J is an quadratic
function of µ. It can be minimized by setting the derivative with respect to
µj to be 0.
106
Artificial Intelligence &
Machine Learning Lab
Rearranging obtains
7.5 PSEUDO-CODE
Detailed pseudo-code can be found in K-Means Algorithms:
Cluster(X) {Cluster dataset X}
Initialize cluster centers µj for j = 1,...,k randomly
Repeat
for i = 1 to m do
Compute j’ = arg minj=1,...,k d(xi,µj)
Set rij’ = 1 and rij = 0 for all j’= j
end for
for j = 1 to k do
Compute µj =
end for
until Cluster assignments rij are unchanged
return {µ1,...,µk} and rij
The algorithm stops when the cluster assignments do not change
significantly.
K 2
E k x C d (x,mk)
squared distance to its “representative object” in each cluster
1
e.g., Euclidean distance d 2 (x,m )= ∑Ν(xn−mkn )2
k n=1
● Given a K, find a partition of K clusters to ptimize the chosen
partitioning criterion (cost function)
● global optimum: exhaustively search all partitions
108
Artificial Intelligence & 2. Compute new seed points as the centroids of the clusters of the
Machine Learning Lab
current partition (the centroid is the centre, i.e., mean point, of the
cluster)
3. Go back to Step 1), stop when no more new assignment (i.e.,
membership in each cluster no longer changes)
K-means
Demo
109
Unsupervised Learning
7.8.3 Application: K-Means Clustering
Algorithm
Colour-Based Image Segmentation Using K-means
Step 1: Loading a colour image of tissue stained with hemotoxylin and
eosin (H&E)
110
Artificial Intelligence &
Machine Learning Lab
“pink” pixels
Local optimum
● sensitive to initial seed points
● converge to a local optimum: maybe an unwanted solution
Other problems
● Need to specify K, the number of clusters, in advance
● Unable to handle noisy data and outliers (K-Medoids algorithm)
● Not suitable for discovering clusters with non-convex shapes
the K-mean performance?
Two issues with K-Means are worth noting.
First, it is sensitive to the choice of the initial cluster centers µ. A number
of practical heuristics have been developed. For instance, one could
randomly choose k points from the given dataset as cluster centers. Other
methods try to pick k points from X which are farthest away from each
other.
Second, it makes a hard assignment of every point to a cluster center.
Variants which we will encounter later in the book will relax this. Instead
112
Artificial Intelligence & of letting rij ∈ {0,1} these soft variants will replace it with the probability
Machine Learning Lab
that a given xi belongs to cluster j.
The K-Means algorithm concludes our discussion of a set of basic
machine learning methods for classification and regression. They provide
a useful starting point for an aspiring machine learning researcher.
7.11 REFERENCES
https://www.javatpoint.com/k-means-clustering-algorithm-in-
machine-learning
https://towardsdatascience.com/k-means-clustering-algorithm-
applications-evaluation-methods-and-drawbacks-aa03e644b48a
https://www.analyticsvidhya.com/blog/2021/11/understanding-k-
means-clustering-in-machine-learningwith-examples/
https://www.geeksforgeeks.org/k-means-clustering-introduction/
https://www.simplilearn.com/tutorials/machine-learning-tutorial/k-
means-clustering-algorithm
*****
113
8
UNSUPERVISED LEARNING
K- MEDOID CLUSTERING ALGORITHM
Unit Structure
8.0 Objectives
8.1 Definition – K-Medoid clustering algorithm
8.2 Introduction - K-Medoid clustering algorithm
8.3 K-Means & K-Medoids Clustering- Outliers Comparison
8.4 K-Medoids - Basic Algorithm
8.5 K-Medoids - Pam Algorithm
8.5.1 Typical Pam Example.8.6 Advantages And Disadvantages Of
Pam
8.7 CLARA – Clustering Large Applications
8.7.1 CLARA Algorithm
8.8 Comparison CLARA Vs PAM
8.9 Applications
8.10 General Applications of Clustering
8.11 Working of the K-Medoids approach
8.11.1 Complexity of K-Medoids algorithm
8.11.2 Advantages of the technique
8.12 Practical Implementation
8.13 Lets Sum up
8.14 Unit End Exercises
8.15 References
8.0 OBJECTIVES
This Chapter would make you understand the following concepts:
What is K-Medoid clustering algorithm
Definition of K-Medoid clustering algorithm
Comparison of K-Medoid clustering algorithm
K-Medoid Basic algorithm
K-Medoid PAM algorithm
Clara – Clustering Large Applications
Working and Practical Implementation
114
Artificial Intelligence &
Machine Learning Lab 8.1 DEFINITION – K-MEDOID CLUSTERING
ALGORITHM
K-Medoids is a clustering algorithm resembling the K-Means clustering
technique. It falls under the category of unsupervised machine learning.
115
Unsupervised Learning
Initialize: Select K points as the initial representative objects i.e initial K- K- Medoid Clustering
medoids of our K clusters. Algorithm
Repeat: Assign each point to the cluster with the closest medoid m.
If S < 0:
Algorithm:
1. Start with initial set of medoids.
2. Iteratively replace one of the medoids with a non-medoid if it reduces
total sum of SSE of resulting cluster.
116
Artificial Intelligence &
Machine Learning Lab Where k is number of clusters and x is a data point in cluster Ci and Mi is
medo id of Ci
E = (3+4+4) + (3+1+1+2+2)
Therefore, E = 20
Swapping o8 with o7
E = (3+4+4) + (2+2+1+3+3)
Therefore, E = 22
Let‟s now calculate cost function S for this swap, S = E for (o2,07) - E for
(o2, o8)
S = 22- 20
Therefore S > 0,
Advantages:
Disadvantages:
PAM algorithm for K-medoid clustering works well for dataset but cannot
scale well for large data set due to high computational overhead.
2
Pam Complexity : O(k(n-k) ) this is because we compute distance of n-k
points with each k point, to decide in which cluster it will fall and after
this we try to replace each of the medoid with a non medoid and find it‟s
distance with n-k points.
118
Artificial Intelligence &
Machine Learning Lab 8.7 CLARA – CLUSTERING LARGE APPLICATIONS
● Improvement over PAM
● Finds medoids in a sample from the dataset
● [Idea]: If the samples are sufficiently random, the medoids of the
sample approximate the medoids of the dataset
● [Heuristics]: 5 samples of size 40+2k gives satisfactory results
● Works well for large datasets (n=1000, k=10)
4. Retain the sub-dataset for which the mean (or sum) is minimal. A
further analysis is carried out on the final partition.
Strength:
119
Unsupervised Learning
K- Medoid Clustering
Algorithm
8.9 APPLICATIONS
Social Network:
Document Clustering
120
Artificial Intelligence &
Machine Learning Lab 8.10 GENERAL APPLICATIONS OF CLUSTERING
1. Recognition
2. Spatial Data Analysis
a. create thematic maps in GIS by clustering feature spaces
b. detect spatial clusters and explain them in spatial data mining
1. Image Processing
2. Economic Science (especially market research)
3. WWW
a. Document classification
b. Cluster Weblog data to discover groups of similar access patterns
121
Unsupervised Learning
cluster, such that its distance from other points is minimum. Since K- Medoid Clustering
medoids do not get influenced by extremities, the K-Medoids algorithm is Algorithm
more robust to outliers and noise than K-Means algorithm.
The following figure explains how mean‟s and medoid‟s positions can
vary in the presence of an outlier.
122
Artificial Intelligence & 2. Import required libraries and modules:
Machine Learning Lab
import numpy as np
import matplotlib.pyplot as plt
from sklearn_extra.cluster import KMedoids
#Import the digits‟ dataset available in sklearn.datasets package
from sklearn.datasets import load_digits
“””
Instead of using all 64 attributes of the dataset, we use Principal
Component Analysis (PCA) to reduce the dimensions of features set such
that most of the useful information is covered.
“””
from sklearn.decomposition import PCA
“””
Import module for standardizing the dataset i.e. rescaling the data such
that its has mean of 0 and standard deviation of 1
“””
from sklearn.preprocessing import scale
123
Unsupervised Learning
components to be considered. fit_transform() method fits the PCA K- Medoid Clustering
models and performs dimensionality reduction on digit_data. Algorithm
“””
124
Artificial Intelligence & init="heuristic", max_iter=2),"Manhattan metric",
Machine Learning Lab
),
(
KMedoids(metric="euclidean", n_clusters=num_digits,
init="heuristic", max_iter=2),"Euclidean metric",
),
(KMedoids(metric="cosine", n_clusters=num_digits, init="heuristic",
max_iter=2), "Cosine metric", ),
]
7. Initialize the number of rows and columns of the plot for plotting
subplots of each of the three metrics’ results:
#number of rows = integer(ceiling(number of model variants/2))
num_rows = int(np.ceil(len(models) / 2.0))
#number of columns
num_cols = 2
8. Fit each of the model variants to the data and plot the resultant
clustering:
#Clear the current figure first (if any)
plt.clf()
126
Artificial Intelligence & plt.xticks(())
Machine Learning Lab
plt.yticks(())
8.15 REFERENCES
● http://www.math.le.ac.uk/people/ag153/homepage/KmeansKmedo
ids/Kmeans_Kmedoids.html
● https://www.datanovia.com/en/lessons/k-medoids-in-r-algorithm-
and-practical-examples/
● https://towardsdatascience.com/understanding-k-means-k-means-
and-k-medoids-clustering-algorithms-ad9c9fbf47ca
● https://iq.opengenus.org/k-medoids-clustering/
● https://www.datanovia.com/en/lessons/k-medoids-in-r-algorithm-
and-practical-examples/
*****
127
UNIT VI
9
CLASSIFYING DATA USING SUPPORT
VECTOR MACHINES (SVMS): SVM-RBF
KERNELS
Unit Structure
9.0 Introduction to SVMS
9.1 What Is A Support Vector Machine, And How Does It Work?
9.2 What Is The Purpose of SVM?
9.3 Importing Datasets
9.4 The Establishment of A Support Vector Machine
9.5 A Simple Description of The SVM Classification Algorithm
9.6 What Is The Best Way To Transform This Problem Into A Linear
One?
9.7 Kernel For The Radial Basis Function (RBF) And Python Examples
9.8 Build A Model With Default Values For C And Gamma
9.9 Radial Basis Function (RBF) Kernel: The Go-To Kernel
9.10 Conclusion
9.11 References
128
Artificial Intelligence &
Machine Learning Lab 9.2 WHAT IS THE PURPOSE OF SVM?
An SVM training algorithm creates a model that assigns new examples to
one of two categories, making it a non-probabilistic binary linear
classifier, given a series of training examples that are individually
designated as belonging to one of two categories.
Before you go any further, make sure you have a basic knowledge of this
topic. In this article, I'll show you how to use machine learning techniques
like scikit-learn to classify cancer UCI datasets using SVM.
Numpy, Pandas, matplot-lib, and scikit-learn are required.
Let's look at a simple support vector categorization example. To begin, we
must first generate a dataset:
Implemention in python
# importing scikit learn with make_blobs
from sklearn.datasets.samples_generator import make_blobs
129
Classifying Data Using Support
Support vector machines consider a region around the line of a particular Vector Machines (SVMS): SVM-
width in addition to drawing a line between two classes. Here's an RBF Kernels
example of how it may appear:
# creating line space between -1 to 3.5
xfit = np.linspace(-1, 3.5)
# plotting scatter
plt.scatter(X[:, 0], X[:, 1], c=Y, s=50, cmap='spring')
# plot a line between the different sets of data
for m, b, d in [(1, 0.65, 0.33), (0.5, 1.6, 0.55), (-0.2, 2.9, 0.2)]:
yfit = m * xfit + b
plt.plot(xfit, yfit, '-k')
plt.fill_between(xfit, yfit - d, yfit + d, edgecolor='none',
color='#AAAAAA', alpha=0.4)
plt.xlim(-1, 3.5);
plt.show()
[[ 122.8 1001. ]
[ 132.9 1326. ]
[ 130. 1203. ]
...,
[ 108.3 858.1 ]
[ 140.1 1265. ]
[ 47.92 181. ]]
array([ 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 1., 1., 1., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0.,
0., 0., 0., 0., 0., 0., 0., 1., 0., 1., 1., 1., 1.,
1., 0., 0., 1., 0., 0., 1., 1., 1., 1., 0., 1., ....,
1.])
9.4 THE ESTABLISHMENT OF A SUPPORT VECTOR
MACHINE
These locations will now be fitted with a Support Vector Machine
Classifier. While the mathematical specifics of the likelihood model are
fascinating, we'll save those for another time. Instead, we'll approach the
scikit-learn algorithm as a black box that performs the aforementioned
work.
# import support vector classifier
# "Support Vector Classifier"
from sklearn.svm import SVC
clf = SVC(kernel='linear')
131
Classifying Data Using Support
# fitting x samples and y classes Vector Machines (SVMS): SVM-
RBF Kernels
clf.fit(x, y)
The model can then be used to forecast new values after it has been fitted:
clf.predict([[120, 990]])
clf.predict([[85, 550]])
array([ 0.])
array([ 1.])
Let's have a look at the graph to see what this means.
132
Artificial Intelligence & Hard-margin:
Machine Learning Lab
The SVM method is used to separate the two classes of points. Scenario
with a tight margin.
● The "H1" hyperplane is incapable of accurately separating the two
classes; hence it is not a suitable solution to our problem.
● The "H2" hyperplane accurately splits classes. The distance between
the hyperplane and the nearest blue and green points, on the other
hand, is extremely small. As a result, there's a good risk that any
future new points may be classified erroneously. The algorithm, for
example, would allocate the new grey point (x1=3, x2=3.6) to the
green class when it is evident that it should belong to the blue class
instead.
● Finally, the "H3" hyperplane appropriately and with the greatest
possible margin divides the two classes (yellow shaded area). A
solution has been discovered!
It's worth noting that determining the maximum feasible margin allows for
a more accurate classification of additional data, resulting in a far more
robust model. When utilizing the "H3" hyperplane, you can see that the
new grey point is correctly allocated to the blue class.
Soft-Margin:
It may not always be possible to completely separate the two classes. In
such cases, a soft-margin is employed, with some points permitted to be
misclassified or to fall within the margin (yellow shaded area). This is
where the "slack" value, represented by ξ (xi).
133
Classifying Data Using Support
Vector Machines (SVMS): SVM-
RBF Kernels
The SVM method is used to separate the two classes of points. Scenario
with a soft margin.
The green point inside the margin is treated as an outlier by the "H4"
hyperplane in this case. As a result, the support vectors are the two green
spots closest to the main group. This increases the model's resilience by
allowing for a bigger margin.
Note that you may tweak the hyperparameter C to decide how much you
care about misclassifications (and points inside the margin) in the
algorithm. C is essentially a weight that has been assigned to. A low C
wants to categorize all training instances correctly, producing a closer
match to the training data but making it less robust, whereas a high C
strives to classify all training examples correctly, producing a closer fit to
the training data but making it less robust.
While a high C value will likely result in higher model performance on the
training data, there is a substantial risk of over fitting the model, which
will result in poor test data outcomes.
Kernel Trick:
SVM was previously explained in the context of linearly separable blue
and green classes. What if we wanted to use SVMs to solve non-linear
problems? How would we go about doing that? The kernel technique
comes into play at this point. A kernel is a function that takes a nonlinear
problem and converts it to a linear problem in a higher-dimensional space.
Let's look at an example to demonstrate this method.
134
Artificial Intelligence & Assume you have two classes, red and black, as indicated in the
Machine Learning Lab
diagram below:
As you can see, red and black points are not linearly separable because
there is no way to construct a line that separates these two classes. We can,
however, distinguish them by drawing a circle with all of the red dots
inside and the black points outside.
Where gamma can be adjusted manually and must be greater than zero. In
sklearn's SVM classification method, the default value for gamma is:
135
Classifying Data Using Support
Briefly: Vector Machines (SVMS): SVM-
RBF Kernels
||x - x'||² Between two feature vectors, 2 is the squared Euclidean distance
(2 points). Gamma is a scalar that expresses how powerful a single
training sample (point) can be.
As a result of the above design, we can control the influence of specific
points on the overall algorithm. The bigger the gamma, the closer the other
points must be to have an impact on the model. In the Python examples
below, we'll see how adjusting gamma affects the results.
Setup:
The following data and libraries will be used:
● Kaggle chess games data
● Scikit-learn library for separating the data into train-test samples,
creating SVM classification models, and model evaluation
● Data manipulation with Pandas and Numpy
After you've saved the data to your machine, use the code below to ingest
it. We also get a few new variables that we can use in the modeling.
# Read in the csv
df=pd.read_csv('games.csv', encoding='utf-8')
136
Artificial Intelligence & # Print a snapshot of a few columns
Machine Learning Lab
df.iloc[:,[0,1,5,6,8,9,10,11,13,16,17]]
Let's now write a few functions that we may use to generate different
models and plot the results.
This function divides the data into train and test samples, fits the model,
predicts the outcome on a test set, and calculates model performance
metrics.
def fitting(X, y, C, gamma):
# Create training and testing samples
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
random_state=0)
# Fit the model
# Note, available kernels: {‘linear’, ‘poly’, ‘rbf’, ‘sigmoid’,
‘precomputed’}, default=’rbf’
model = SVC(kernel='rbf', probability=True, C=C, gamma=gamma)
clf = model.fit(X_train, y_train)
137
Classifying Data Using Support
print('Accuracy Score: ', score_te) Vector Machines (SVMS): SVM-
RBF Kernels
# Look at classification report to evaluate the model
print(classification_report(y_test, pred_labels_te))
print('--------------------------------------------------------')
print('----- Evaluation on Training Data -----')
score_tr = model.score(X_train, y_train)
print('Accuracy Score: ', score_tr)
# Look at classification report to evaluate the model
print(classification_report(y_train, pred_labels_tr))
print('--------------------------------------------------------')
With the test data and model prediction surface, the following function
will create a Plotly 3D scatter graph.
138
Artificial Intelligence & # Create a 3D scatter plot with predictions
Machine Learning Lab
fig = px.scatter_3d(x=X_test['rating_difference'], y=X_test['turns'],
z=y_test,
opacity=0.8, color_discrete_sequence=['black'])
139
Classifying Data Using Support
Note that we're cheating a little because the final number of moves won't Vector Machines (SVMS): SVM-
be known until after the match. As a result, if we were to make model RBF Kernels
predictions before the match, we wouldn't be able to use 'turns.' However,
this is merely for demonstration purposes, therefore we'll use it in the
examples below.
The code is brief because we're using our previously defined 'fitting'
function.
# Select data for modeling
X=df[['rating_difference', 'turns']]
y=df['white_win'].values
We can see that the model's performance on test data is similar to that on
training data, indicating that the default hyperparameters allow the model
to generalize well.
Now we'll use the Plot 3D function to see the prediction:
Plot_3D(X, X_test, y_test, clf)
140
Artificial Intelligence &
Machine Learning Lab
Note that the top black spots are actual class=1 (white won), whereas the
bottom black points are actual class=0 (white did not win). Meanwhile, the
surface represents the model's chance of white wine.
While the probability varies locally, the decision boundary is about x=0
(i.e., rating difference=0) because this is where the probability crosses the
p=0.5 line.
141
Classifying Data Using Support
As can be shown, raising gamma improves model performance on training Vector Machines (SVMS): SVM-
data but degrades model performance on test data. The graph below RBF Kernels
explains why this is the case.
Rather than a smooth prediction surface, we now have one that is highly
"spiky." We need to look into the kernel function a little more to see why
this happens.
When we use a high gamma value, we are telling the function that the
close points are significantly more crucial for the prediction than the far
points. As a result, we see these "spikes" since the prediction is based on
individual points in the training instances rather than the environment.
Reducing gamma, on the other hand, tells the function that when
generating a forecast, it's not only the specific point that matters, but also
the points around it. Let's look at another case with a low gamma value to
see if this is correct.
142
Artificial Intelligence &
Machine Learning Lab
143
Classifying Data Using Support
C. Hyperparameter Adjustment: Vector Machines (SVMS): SVM-
RBF Kernels
I chose not to add examples in this tale using various C values because it
impacts the smoothness of the prediction plane similarly to gamma, but for
different reasons. You may observe for yourself by using the "fitting"
function with a value of C=100. some points permitted to be misclassified
or to fall within the margin (yellow shaded area) this increases the model's
resilience by allowing for a bigger margin.
Where,
1. ‘σ’ is the variance and our hyper parameter
Let d₁₂ be the distance between the two points X₁ and X₂, we can now
represent d₁₂ as follows:
Fig 2: In space, the distance between two points is called the distance between two
points in space.
144
Artificial Intelligence & The following is a rewrite of the kernel equation:
Machine Learning Lab
The RBF kernel can have a maximum value of 1 when d12 is 0, which
means that the points are equal, i.e. X1 = X2.
1. There is no distance between the points when they are the same,
therefore they are incredibly comparable.
2. The kernel value is less than 1 and close to 0 when the points are
separated by a wide distance, indicating that the points are dissimilar.
Because we can see that as the distance between the point’s increases, they
become less similar, distance can be regarded of as an analogue to
dissimilarity.
Finding the proper value of “to determine which points should be regarded
comparable is critical, and this can be proved on a case-by-case basis..
a] σ = 1
When σ = 1, σ² = 1 and the RBF kernel’s mathematical equation will be as
follows:
The curve for this equation is shown below, and we can see that the RBF
Kernel reduces exponentially as the distance rises, and is 0 for distances
larger than 4.
145
Classifying Data Using Support
1. We can see that when d₁₂ = 0, the similarity is 1, and when Vector Machines (SVMS): SVM-
d₁₂ exceeds 4 units, the similarity is 0. RBF Kernels
2. We can see from the graph that if the distance between the points is
less than 4, the points are similar, and if the distance is larger than 4,
the points are dissimilar.
b] σ = 0.1
When σ = 0.1, σ² = 0.01 and the RBF kernel’s mathematical equation will
be as follows:
For σ = 0.1, the width of the Region of Similarity is the smallest, therefore
only extremely close points are considered comparable.
146
Artificial Intelligence &
Machine Learning Lab
The RBF Kernel Support Vector Machines are included in the scikit-learn
toolkit and have two hyperparameters: 'C' for SVM and "for the RBF
Kernel. In this case, is inversely proportional to.
147
Classifying Data Using Support
Vector Machines (SVMS): SVM-
RBF Kernels
The RBF Kernel Support Vector Machines are included in the scikit-learn
toolkit and have two hyper parameters: 'C' for SVM and " for the RBF
Kernel. In this case, is inversely proportional to.
9.10 CONCLUSION
A Support Vector Machine (SVM) is a discriminative classifier with a
separating hyperplane as its formal definition. An SVM training algorithm
creates a model that assigns new examples to one of two categories,
making it a non-probabilistic binary linear classifier. To train the
classifier, we must first import the cancer datasets as a CSV file. We then
extract two features out of all the samples and train them on top of each
other. The SVM algorithm seeks out a hyperplane that separates these two
classes by the greatest margin possible.
A hard margin can be utilized if classes are entirely linearly separable.
Otherwise, a soft margin is required. Let's have a look at the graph to see
what this means. The SVM method is used to separate the two classes of
points. In such cases, a soft margin is employed, with some points
permitted to be misclassified or to fall within the margin (yellow shaded
area) This increases the model's resilience by allowing for a bigger
margin.
9.11 REFERENCES
● https://www.geeksforgeeks.org/classifying-data-using-support-vector-
machinessvms-in-python/
● https://towardsdatascience.com/svm-classifier-and-rbf-kernel-how-to-
make-better-models-in-python-73bb4914af5b
● https://towardsdatascience.com/radial-basis-function-rbf-kernel-the-
go-to-kernel-acf0d22c798a
*****
148
UNIT VII
10
DECISION TREE
Unit structure
10.0 Objectives
10.1 Decision Tree
10.2 Ensemble Techniques – Bagging
10.3 Ensemble Techniques – Boosting
10.4 Ensemble Techniques – Stacking
10.5 Ensemble Techniques – Voting
10.6 Random Forest- Bagging Attribute Bagging And Voting For Class
Selection
10.7 Summary
10.8 References
10.0 OBJECTIVES
This chapter will enable students to:
● Make use of Data sets in implementing the machine learning
algorithms
● Implement the machine learning concepts and algorithms in any
suitable language of choice.
Data sets can be taken from standard repositories or constructed by the
students.
Introduction:
Decision-tree algorithm falls under the category of supervised learning
algorithms. It works for both continuous as well as categorical output
variables. Makes use of the Tree representation. Can be used for
149
Decision Tree
classification. Given a decision tree, how do we predict an outcome for a
class label? We start from the root of the tree. CART stands for
Classification and Regression Trees.
For example, consider a dataset of cats and dogs, with their features. The
label here is accordingly "cat", or "dog", and the goal is to identify the
animal based on its features, using a decision tree. Say, if at a particular
node in the tree, the input to a node contains only a single type of label,
say cats, we can infer that it is perfectly grouped, or "unmixed". On the
other hand, if the input contains a mix of cats and dogs, we would have to
ask another question about the features in the dataset that can help us
narrow down, and divide the mix further to try and "unmix" them
completely.
150
Artificial Intelligence & X = balance_data.values[:, 1:5]
Machine Learning Lab
Y = balance_data.values[:, 0]
151
Decision Tree
return y_pred
# Function to calculate accuracy
def cal_accuracy(y_test, y_pred):
print("Confusion Matrix: ",
confusion_matrix(y_test, y_pred))
# Operational Phase
print("Results Using Gini Index:")
152
Artificial Intelligence & A supervised learning algorithm. Makes use of the Tree representation.
Machine Learning Lab
Can be used for classification.
# initializing the bagging model using XGboost as base model with default
parameters
model = BaggingRegressor(base_estimator=xgb.XGBRegressor())
# training model
model.fit(X_train, y_train)
153
Decision Tree
# printing the root mean squared error between real value and predicted
value
print(mean_squared_error(y_test, pred_final))
# printing the root mean squared error between real value and predicted
value
print(mean_squared_error(y_test, pred_final))
154
Artificial Intelligence &
Machine Learning Lab 10.4 ENSEMBLE TECHNIQUES – STACKING
# importing utility modules
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
155
Decision Tree
# initializing the second-level model
final_model = model_1
# fitting the second level model with stack features
final_model = final_model.fit(s_train, y_train)
# predicting the final output using stacking
pred_final = final_model.predict(X_test)
# printing the root mean squared error between real value and predicted
value
print(mean_squared_error(y_test, pred_final))
156
Artificial Intelligence & model_2 = XGBClassifier()
Machine Learning Lab
Random Forest- bagging Attribute bagging and voting for class selection
157
Decision Tree
# getting train data from the dataframe
train = df.drop("Weekday")
# Splitting between train data into training and validation dataset
X_train, X_test, y_train, y_test = train_test_split(train, target,
test_size=0.20)
# initializing all the model objects with default parameters
model_3 = RandomForestClassifier()
# training all the model on the train dataset
final_model.fit(X_train, y_train)
# predicting the output on the test dataset
pred_final = final_model.predict(X_test)
# printing log loss between actual and predicted value
print(log_loss(y_test, pred_final))
example 2:
import pandas as pd
import numpy as np
dataset = pd.read_csv('/content/petrol_consumption.csv')
dataset.head()
X = dataset.iloc[:, 0:4].values
y = dataset.iloc[:, 4].values
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
random_state=0)
# Feature Scaling
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)
from sklearn.ensemble import Random Forest Regressor
regressor = Random Forest Regressor(n_estimators=20,random_state=0)
158
Artificial Intelligence & regressor.fit(X_train, y_train)
Machine Learning Lab
y_pred = regressor.predict(X_test)
from sklearn import metrics
print('Mean Absolute Error:', metrics.mean_absolute_error(y_test,
y_pred))
print('Mean Squared Error:', metrics.mean_squared_error(y_test, y_pred))
print('Root Mean Squared Error:',
np.sqrt(metrics.mean_squared_error(y_test, y_pred)))
dataset.head()
X = dataset.iloc[:, 0:4].values
y = dataset.iloc[:, 4].values
from sklearn.model_selection import train_test_split
X_test = sc.transform(X_test)
from sklearn.ensemble import Random Forest Classifier
classifier = RandomForestClassifier(n_estimators=20, random_state=0)
classifier.fit(X_train, y_train)
y_pred = classifier.predict(X_test)
from sklearn.metrics import classification_report, confusion_matrix,
accuracy_score
print(confusion_matrix(y_test,y_pred))
print(classification_report(y_test,y_pred))
159
Decision Tree
print(accuracy_score(y_test, y_pred))
from sklearn.ensemble import Random Forest Classifier
classifier = Random Forest Classifier(n_estimators=200, random_state=0)
classifier.fit(X_train, y_train)
y_pred = classifier.predict(X_test)
10.7 SUMMARY
Ensemble means a group of elements viewed as a whole rather than
individually. An Ensemble method creates multiple models and combines
them to solve it. Ensemble methods help to improve the
robustness/generalizability of the model. In this chapter, we had discussed
some methods with their implementation in Python.
10.8 REFERENCES
1 Aurelian Géron, Hands-On Machine Learning with Scikit-Learn,
Keras, and TensorFlow, 2nd Edition.
2 Paul J. Deitel, Python Fundamentals.
3 Stuart Russell, Peter Norvig ,Artificial Intelligence – A Modern
Approach, , Pearson Education / Prentice Hall of India, 3rd Edition,
2009.
4 EthemAlpaydın, Introduction to Machine Learning, PHI, Third
Edition, ISBN No. 978-81-203- 5078-6.
5 Peter Harrington, Machine Learning in Action. Manning Publications,
April 2012ISBN 9781617290183.
6 Introduction to Computer Programming using Python, John V Guttag
7 Core Python Programming, R. Nageswara Rao
8 https://talentsprint.com/pages/artificial-intelligence-machine-learning-iiit-
hprogram/program-details.pdf
9 https://learning.oreilly.com/library/view/learning-robotics
using/9781783287536/cover.html
10 http://www.qboticslabs.com
11 https://subscription.packtpub.com/book/big_data_and_business_intelligence
12 https://scikit-learn.org/0.16/modules/generated/sklearn.lda.LDA.html
13 https://machinelearningmastery.com/ensemble-machine-learning-
algorithmspython-scikit-learn/
14 https://www.coursera.org/learn/machine-learning#syllabus
15 https://data-flair.training/blogs/python-ml-data-preprocessing/
*****
160
UNIT VIII
11
BOOSTING ALGORITHMS
Unit Structure
11.0 Boosting Algorithms
11.1 How it works
11.2 Types of boosting Algorithms
11.3 Introduction to AdaBoost Algorithm
11.3.1 What is AdaBoost Algorithm
11.3.2 How it works
11.3.3 What is AdaBoost algorithm used for
11.3.4 Pros and Cons
11.3.5 Pseudocode of AdaBoost
11.4 Gradient Boosting Machines Algorithm
11.4.1 Implementation
11.4.2 Implementation using Scikit learn
11.4.3 Stochastic Gradient Boosting
11.4.4 Shrinkage
11.4.5 Regularization
11.4.6 Tree constraints
Example:
Let’s understand this with an example of the email, which recognize
whether the email, is a spam or not? It can be recognized it by the
following conditions:
Spam:
Not Spam:
162
Artificial Intelligence &
Machine Learning Lab 11.3 INTRODUCTION TO ADABOOST ALGORITHM
An adaBoost calculation can be utilized to boost the execution of any
machine learning calculation. Machine Learning has gotten to be a capable
tool which can make predictions based on a huge sum of data. It has ended
up so well known in later times that the application of machine learning
can be found in our day-to-day exercises [1,4,7]. A common illustration
of it is getting proposals for items whereas shopping online based on the
past things bought by the client. Machine Learning, frequently alluded to
as predictive analysis, can be characterized as the capability of computers
to memorize without being programmed unequivocally. As a substitute, it
utilizes the algorithms to analyze input data to foresee output inside an
specified range [1,4,7].
163
Boosting Algorithms
Then calculating the weight for the mth weak classifier as below:
The weight is positive for any classifier with an accuracy > 50%, becomes
larger if the classifier is more accurate, and negative if the classifier has an
accuracy < 50%. The prediction can be combined by inverting the sign. By
inverting the sign of the prediction, a classifier with a 40% accuracy can
be converted into a 60% accuracy [1,4,7].
Updating the weight for each data point as below:
Zm is here the normalization factor. It makes sure that the sum total of all
instance weights becomes equal to 1.
Cons:
Weak classifiers being too weak can lead to low margins and overfitting
[1,4,7].
164
Artificial Intelligence & 2. for Each base learner do:
Machine Learning Lab
Train base learner with a weighted sample.
Test base learner on all data.
Set learner weight with a weighted error.
Set example weights based on ensemble predictions.
3. end for
165
Boosting Algorithms
Output:
Accuracy:0.8666666666666667
An accuracy of 86.66% is achieved.
1. Loss Function:
The use of the loss function depends on the type of problem. The
advantage of gradient boosting is that there is no need for a new boosting
algorithm for each loss function [4,7,8].
2. Weak Learner:
In gradient boosting, decision trees are used as a weak learner. A
regression tree is used to give true values, which can be combined together
to create correct predictions. Like in the AdaBoost algorithm, small trees
with a single split are used, i.e. decision stump. Larger trees are used for
large levels I,e 4-8 levels [4,7,8].
3. Additive Model:
In this model, trees are added one at a time. existing trees remains the
same. During the addition of trees, gradient descent is used to minimize
the loss function.
The Gradient Boosting Machine is a powerful ensemble machine learning
algorithm that uses decision trees.
Gradient boosting is a generalization of AdaBoosting, improving the
performance of the approach and introducing ideas from bootstrap
aggregation to further improve the models, such as randomly sampling the
samples and features when fitting ensemble members.
167
Gradient boosting performs well, if not the best, on a wide range of tabular Boosting Algorithms
datasets, and versions of the algorithm like XGBoost and LightBoost often
play an important role in winning machine learning competitions [4,7,8].
Gradient Boosting ensemble is an ensemble created from decision trees
added sequentially to the model.
The weak learners are fit in such a way that each new learner fits into the
residuals of the previous step so as the model improves. The final model
aggregates the result of each step and thus a strong learner is achieved. A
loss function is used to detect the residuals. Mean squared error (MSE) is
168
Artificial Intelligence & used for a regression task and logarithmic loss (log loss) is used for
Machine Learning Lab
classification tasks [1,4,7].
Note:
Problem in gradient boosting decision trees is overfitting due to addition
of too many trees whereas in random forests, addition of too many tress
won’t cause overfitting.
Algorithm:
Let’s say the output model $y$ when fit to only 1 decision tree, is given by
$$A_1 + B_1x +e_1
where $e1$ is there sidual from this decision tree. In gradient boosting, we
fit the consecutive decision trees on there sidual from the last one [1,4,7].
So when gradient boosting is applied to this model, the consecutive
decision trees will be mathematically represented as:
e_1 = A_2 + B_2x + e_2
e_2 = A_3 + B_3x + e_3
Note that here we stop at 3 decision trees, but in an actual gradient
boosting model, the number of learners or decision trees is much more
[1,4,7]. The final model of the decision tree will be given by:
y = A_1 + A_2 + A_3 + B_1x + B_2x + B_3x + e_3 $$
11.4.1 Implementation:
Implementation from Scratch
Consider simulated data as shown in scatter plot below with 1 input (x)
and 1 output (y) variables.
169
Boosting Algorithms
Calculate error residuals. Actual target value, minus predicted target value
[e1= y – y_predicted1 ]
Fit a new model on error residuals as target variable with same input
variables [call it e1_predicted]
Add the predicted residuals to the previous predictions [y_predicted2 =
y_predicted1 + e1_predicted]
Fit another model on residuals that is still left. i.e. [e2 = y – y_predicted2]
and repeat steps 2 to 5 until it starts overfitting or the sum of residuals
become constant. Overfitting can be controlled by consistently checking
accuracy on validation data.
170
Artificial Intelligence &
Machine Learning Lab
171
Boosting Algorithms
172
Artificial Intelligence & Improving perfomance of gradient boosted decision trees [1,4,7]:
Machine Learning Lab
Gradient boosting algorithms are prone to overfitting and consequently
poor perfomance on test dataset. There are some pointers you can keep in
mind to improve the perfomance of gradient boosting algorithm.
11.4.4 Shrinkage:
The predictions of each tree are added together sequentially. Instead, the
contribution of each tree to this sum can be weighted to slow down the
learning by the algorithm. This weighting is called a shrinkage or a
learning rate. Using a low learning rate can dramatically improve the
perfomance of your gradient boosting model. Usually a learning rate in the
range of 0.1 to 0.3 gives the best results [1,4,7].
11.4.5 Regularization:
L1 and L2 regularization penalties can be implemented on leaf weight
values to slow down learning and prevent over-fitting. Gradient tree
boosting implementations often also use regularization by limiting the
minimum number of observations in trees’ terminal nodes.
Number of trees
Tree depth
*****
173
12
EXAMPLES
Unit Structure
12.0 Examples
12.1 Example 1
12.2 Example 2
12.3 Gradient Boosting for classification
12.4 Gradient Boosting for regression
12.5 Gradient Boosting hyperparameters
12.6 Explore number of Samples
12.7 Explore Number of features
12.8 Explore learning rate
12.9 Explore Tree depth
12.10 Grid search hyperparameters
12.1 EXAMPLE 1
Gradient Boosting is a popular boosting algorithm. In gradient boosting,
each predictor corrects its predecessor’s error. There is a technique called
the Gradient Boosted Trees whose base learner is CART (Classification
and Regression Trees) [5].
The below diagram explains how gradient boosted trees are trained for
regression problems.
12.2 EXAMPLE 2
Gradient Boosting Scikit-Learn API:
Using a modern version of the library by running the following script:
175
Examples
Running the example creates the dataset and summarizes the shape of the
input and output components.
1. (1000, 20) (1000,)
176
Artificial Intelligence & Next, we can evaluate a Gradient Boosting algorithm on this dataset [3,9]..
Machine Learning Lab
We will evaluate the model using repeated stratified k-fold cross-
validation, with three repeats and 10 folds. We will report the mean and
standard deviation of the accuracy of the model across all repeats and
folds [1].
Running the example reports the mean and standard deviation accuracy of
the model.
Gradient Boosting ensemble with default hyperparameters achieves a
classification accuracy of about 89.9 percent on this test dataset.
Mean Accuracy: 0.899 (0.030)
First, the Gradient Boosting ensemble is fit on all available data, then the
predict() function can be called to make predictions on new data.
The example below demonstrates this on our binary classification dataset.
Running the example fits the Gradient Boosting ensemble model on the
entire dataset and is then used to make a prediction on a new row of data,
as we might when using the model in an application.
177
Predicted Class: 1 Examples
Now that we are familiar with using Gradient Boosting for classification,
let’s look at the API for regression.
Running the example creates the dataset and summarizes the shape of the
input and output components.
1. (1000, 20) (1000,)
Next, we can evaluate a Gradient Boosting algorithm on this dataset.
As we did with the last section, we will evaluate the model using repeated
k-fold cross-validation, with three repeats and 10 folds. We will report the
mean absolute error (MAE) of the model across all repeats and folds. The
scikit-learn library makes the MAE negative so that it is maximized
instead of minimized. This means that larger negative MAE are better and
a perfect model has a MAE of 0.
The complete example is listed below [1].
Running the example reports the mean and standard deviation accuracy of
the model.
178
Artificial Intelligence & In this case, we can see the Gradient Boosting ensemble with default
Machine Learning Lab
hyperparameters achieves a MAE of about 62.
1. MAE: -62.475 (3.254)
We can also use the Gradient Boosting model as a final model and make
predictions for regression.
First, the Gradient Boosting ensemble is fit on all available data, then the
predict() function can be called to make predictions on new data.
The example below demonstrates this on our regression dataset [1].
Running the example fits the Gradient Boosting ensemble model on the
entire dataset and is then used to make a prediction on a new row of data,
as we might when using the model in an application.
Prediction: 37
Now that we are familiar with using the scikit-learn API to evaluate and
use Gradient Boosting ensembles, let’s look at configuring the model [1].
179
Examples
Running the example first reports the mean accuracy for each configured
number of decision trees.
In this case, we can see that that performance improves on this dataset
until about 500 trees, after which performance appears to level off. Unlike
AdaBoost, Gradient Boosting appears to not overfit as the number of trees
is increased in this case [1].
180
Artificial Intelligence & A box and whisker plot is created for the distribution of accuracy scores
Machine Learning Lab
for each configured number of trees.
We can see the general trend of increasing model performance and
ensemble size.
181
Examples
In this case, we can see that mean performance is probably best for a
sample size that is about half the size of the training dataset, such as 0.4 or
higher [1, 4, 7].
182
Artificial Intelligence &
Machine Learning Lab
183
Examples
A box and whisker plot is created for the distribution of accuracy scores
for each configured number of trees [1, 4, 7].
We can see the general trend of increasing model performance perhaps
peaking around eight or nine features and staying somewhat level.
184
Artificial Intelligence & Box Plot of Gradient Boosting Ensemble Number of Features vs.
Machine Learning Lab
Classification Accuracy
185
This highlights the trade-off between the number of trees (speed of Examples
training) and learning rate, e.g. we can fit a model faster by using fewer
trees and a larger learning rate.
A box and whisker plot is created for the distribution of accuracy scores
for each configured number of trees.
186
Artificial Intelligence &
Machine Learning Lab
Running the example first reports the mean accuracy for each configured
tree depth.
Performance improves with tree depth, perhaps peaking around a depth of
3 to 6, after which the deeper, more specialized trees result in worse
performance.
187
Examples
A box and whisker plot is created for the distribution of accuracy scores
for each configured tree depth.
We can see the general trend of increasing model performance with the
tree depth to a point, after which performance begins to degrade rapidly
with the over-specialized trees.
188
Artificial Intelligence &
Machine Learning Lab 12.10 GRID SEARCH HYPERPARAMETERS [1,4,7]
Gradient boosting can be challenging to configure as the algorithm as
many key hyperparameters that influence the behavior of the model on
training data and the hyperparameters interact with each other.
As such, it is a good practice to use a search process to discover a
configuration of the model hyperparameters that works well or best for a
given predictive modeling problem. Popular search processes include a
random search and a grid search.
In this section we will look at grid searching common ranges for the key
hyperparameters for the gradient boosting algorithm that you can use as
starting point for your own projects. This can be achieving using the
GridSearchCV class and specifying a dictionary that maps model
hyperparameter names to the values to search.
In this case, we will grid search four key hyperparameters for gradient
boosting: the number of trees used in the ensemble, the learning rate,
subsample size used to train each tree, and the maximum depth of each
tree. We will use a range of popular well performing values for each
hyperparameter.
Each configuration combination will be evaluated using repeated k-fold
cross-validation and configurations will be compared using the mean
score, in this case, classification accuracy.
The complete example of grid searching the key hyperparameters of the
gradient boosting algorithm on our synthetic classification dataset is listed
below.
189
Examples
*****
190
UNIT IX
13
XG BOOST
Unit Structure
13.1 XG Boost
13.1.0 Boosting
13.1.1 Using XGBoost in Python
13.1.2 k- fold cross validation using XGBoost
13.1.3 XGBoost Installation Guide
13.2 Voting Ensembles
13.2.1 Voting ensemble for classification
13.2.2 Hard voting ensemble for classification
13.1 XG BOOST
Extreme Gradient Boosting (XG Boost) is an upgraded implementation of
the Gradient Boosting Algorithm, which is developed for high
computational speed, scalability, and better performance [1-4,7].
XG Boost has various features, which are as follows:
1. Parallel Processing
2. Cross-Validation
3. Cache Optimization
4. Distributed Computing
191
13.1.0 Boosting [1-4,7]: XG Boost
Four classifiers (in 4 boxes), shown above, are trying to classify + and -
classes as homogeneously as possible.
1. Box 1: The first classifier (usually a decision stump) creates a vertical
line (split) at D1. It says anything to the left of D1 is + and anything to the
right of D1 is -. However, this classifier misclassifies three + points.
Note: a Decision Stump is a Decision Tree model that only splits off at
one level, therefore the final prediction is based on only one feature.
2. Box 2: The second classifier gives more weight to the three +
misclassified points (see the bigger size of +) and creates a vertical line at
D2. Again it says, anything to the right of D2 is - and left is +. Still, it
makes mistakes by incorrectly classifying three - points.
3. Box 3: Again, the third classifier gives more weight to the three -
misclassified points and creates a horizontal line at D3. Still, this classifier
fails to classify the points (in the circles) correctly.
192
Artificial Intelligence & 4. Box 4: This is a weighted combination of the weak classifiers (Box 1,2
Machine Learning Lab
and 3). As you can see, it does a good job at classifying all the points
correctly.
That's the basic idea behind boosting algorithms is building a weak model,
making conclusions about the various feature importance and parameters,
and then using those conclusions to build a new, stronger model and
capitalize on the misclassification error of the previous model and try to
reduce it. Now, let's come to XGBoost. To begin with, you should know
about the default base learners of XGBoost: tree ensembles. The tree
ensemble model is a set of classification and regression trees (CART).
Trees are grown one after another ,and attempts to reduce the
misclassification rate are made in subsequent iterations. Here’s a simple
example of a CART that classifies whether someone will like computer
games straight from the XGBoost's documentation.
If you check the image in Tree Ensemble section, you will notice each tree
gives a different prediction score depending on the data it sees and the
scores of each individual tree are summed up to get the final score.
boston = load_boston()
The boston variable itself is a dictionary, so you can check for its keys
using the .keys() method.
print(boston.keys())
dict_keys(['data', 'target', 'feature_names', 'DESCR'])
You can easily check for its shape by using the boston.data.shape attribute,
which will return the size of the dataset.
print(boston.data.shape)
(506, 13)
As you can see it returned (506, 13), that means there are 506 rows of data
with 13 columns. Now, if you want to know what the 13 columns are, you
can simply use the .feature_names attribute and it will return the feature
names.
print(boston.feature_names)
['CRIM' 'ZN' 'INDUS' 'CHAS' 'NOX' 'RM' 'AGE' 'DIS' 'RAD' 'TAX'
'PTRATIO'
'B' 'LSTAT']
193
The description of the dataset is available in the dataset itself. You can XG Boost
take a look at it using .DESCR.
print(boston.DESCR)
Boston House Prices dataset
===========================
Notes:
------
You'll notice that there is no column called PRICE in the DataFrame. This
is because the target column is available in another attribute called
boston.target. Append boston.target to your pandas DataFrame.
data['PRICE'] = boston.target
Run the .info() method on your DataFrame to get useful information about
the data.
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 506 entries, 0 to 505
195
TAX 506 non-null float64 XG Boost
196
Artificial Intelligence & XGBoost's hyperparameters:
Machine Learning Lab
At this point, before building the model, you should be aware of the tuning
parameters that XGBoost provides. Well, there are a plethora of tuning
parameters for tree-based learners in XGBoost and you can read all about
them here. But the most common ones that you should know are:
learning_rate: step size shrinkage used to prevent overfitting. Range is
[0,1]
max_depth: determines how deeply each tree is allowed to grow during
any boosting round.
subsample: percentage of samples used per tree. Low value can lead to
underfitting.
colsample_bytree: percentage of features used per tree. High value can
lead to overfitting.
n_estimators: number of trees you want to build.
objective: determines the loss function to be used like reg:linear for
regression problems, reg:logistic for classification problems with only
decision, binary:logistic for classification problems with probability.
XGBoost also supports regularization parameters to penalize models as
they become more complex and reduce them to simple (parsimonious)
models [1-4,7].
gamma: controls whether a given node will split based on the expected
reduction in loss after the split. A higher value leads to fewer splits.
Supported only for tree-based learners.
alpha: L1 regularization on leaf weights. A large value leads to more
regularization.
lambda: L2 regularization on leaf weights and is smoother than L1
regularization.
It's also worth mentioning that though you are using trees as your base
learners, you can also use XGBoost's relatively less popular linear base
learners and one other tree learner known as dart. All you have to do is set
the booster parameter to either gbtree (default),gblinear or dart.
Now, you will create the train and test set for cross-validation of the
results using the train_test_split function from sklearn's model_selection
module with test_size size equal to 20% of the data. Also, to maintain
reproducibility of the results, a random_state is also assigned.
197
XG Boost
Well, you can see that your RMSE for the price prediction came out to be
around 10.8 per 1000$.
198
Artificial Intelligence & exclude the n_estimators from the hyper-parameter dictionary because you
Machine Learning Lab
will use num_boost_rounds instead.
You will use these parameters to build a 3-fold cross validation model by
invoking XGBoost's cv() method and store the results in a cv_results
DataFrame. Note that here you are using the Dmatrix object you created
before.
params = {"objective":"reg:linear",'colsample_bytree': 0.3,'learning_rate':
0.1,
'max_depth': 5, 'alpha': 10}
cv_results = xgb.cv(dtrain=data_dmatrix, params=params, nfold=3,
num_boost_round=50,early_stopping_rounds=10,metrics="rmse",
as_pandas=True, seed=123)
cv_results contains train and test RMSE metrics for each boosting round.
cv_results.head()
199
plot_tree() function along with the number of trees you want to plot using XG Boost
the num_trees argument.
xg_reg = xgb.train(params=params, dtrain=data_dmatrix,
num_boost_round=10)
Plotting the first tree with the matplotlib library:
These plots provide insight into how the model arrived at its final
decisions and what splits it made to arrive at those decisions.
Another way to visualize your XGBoost models is to examine the
importance of each feature column in the original dataset within the
model.
One simple way of doing this involves counting the number of times each
feature is split on across all boosting rounds (trees) in the model, and then
visualizing the result as a bar graph, with the features ordered according to
how many times they appear. XGBoost has a plot_importance() function
that allows you to do exactly this.
200
Artificial Intelligence &
Machine Learning Lab
As you can see the feature RM has been given the highest importance
score among all the features.
Example 2:
XGBoost Regression API [1-4,7]
XGBoost can be installed as a standalone library and an XGBoost model
can be developed using the scikit-learn API.
Install the XGBoost library.
sudo pip install xgboost
You can then confirm that the XGBoost library was installed correctly and
can be used by running the following script.
# check xgboost version
import xgboost
print(xgboost.__version__)
Running the script will print your version of the XGBoost library you have
installed.
Your version should be the same or higher. If not, you must upgrade your
version of the XGBoost library.
If you do have errors when trying to run the above script, I recommend
downgrading to version 1.0.1 (or lower). This can be achieved by
specifying the version to install to the pip command, as follows:
sudo pip install xgboost==1.0.1
201
If you require specific instructions for your development environment, see XG Boost
the tutorial:
202
Artificial Intelligence & Randomness is used in the construction of the model. This means that
Machine Learning Lab
each time the algorithm is run on the same data, it may produce a slightly
different model.
When using machine learning algorithms that have a stochastic learning
algorithm, it is good practice to evaluate them by averaging their
performance across multiple runs or repeats of cross-validation. When
fitting a final model, it may be desirable to either increase the number of
trees until the variance of the model is reduced across repeated
evaluations, or to fit multiple final models and average their predictions.
Let’s take a look at how to develop an XGBoost ensemble for regression.
203
url = XG Boost
'https://raw.githubusercontent.com/jbrownlee/Datasets/master/housing.csv'
dataframe = read_csv(url, header=None)
# summarize shape
print(dataframe.shape)
# summarize first few lines
print(dataframe.head())
Running the example confirms the 506 rows of data and 13 input variables
and a single numeric target variable (14 in total). We can also see that all
input variables are numeric.
(506, 14)
0 1 2 3 4 5 ... 8 9 10 11 12 13
0 0.00632 18.0 2.31 0 0.538 6.575 ... 1 296.0 15.3 396.90 4.98 24.0
1 0.02731 0.0 7.07 0 0.469 6.421 ... 2 242.0 17.8 396.90 9.14 21.6
2 0.02729 0.0 7.07 0 0.469 7.185 ... 2 242.0 17.8 392.83 4.03 34.7
3 0.03237 0.0 2.18 0 0.458 6.998 ... 3 222.0 18.7 394.63 2.94 33.4
4 0.06905 0.0 2.18 0 0.458 7.147 ... 3 222.0 18.7 396.90 5.33 36.2
[5 rows x 14 columns]
Next, let’s evaluate a regression XGBoost model with default
hyperparameters on the problem.
First, we can split the loaded dataset into input and output columns for
training and evaluating a predictive model.
...
# split data into input and output columns
X, y = data[:, :-1], data[:, -1]
Next, we can create an instance of the model with a default configuration.
...
# define model
model = XGBRegressor()
We will evaluate the model using the best practice of repeated k-fold
cross-validation with 3 repeats and 10 folds.
204
Artificial Intelligence & maximized. As such, we can ignore the sign and assume all errors are
Machine Learning Lab
positive.
...
# evaluate model
scores = cross_val_score(model, X, y,
scoring='neg_mean_absolute_error', cv=cv, n_jobs=-1)
In this case, because the scores were made negative, we can use the
absolute() NumPy function to make the scores positive.
...
scores = absolute(scores)
205
XG Boost
scores = cross_val_score(model, X, y,
scoring='neg_mean_absolute_error', cv=cv, n_jobs=-1)
# force scores to be positive
scores = absolute(scores)
print('Mean MAE: %.3f (%.3f)' % (scores.mean(), scores.std()) )
Running the example evaluates the XGBoost Regression algorithm on the
housing dataset and reports the average MAE across the three repeats of
10-fold cross-validation.
In this case, we can see that the model achieved a MAE of about 2.1.
This is a good score, better than the baseline, meaning the model has skill
and close to the best score of 1.9.
Mean MAE: 2.109 (0.320)
We may decide to use the XGBoost Regression model as our final model
and make predictions on new data.
This can be achieved by fitting the model on all available data and calling
the predict() function, passing in a new row of data.
206
Artificial Intelligence &
Machine Learning Lab
207
Soft Voting: Predict the class with the largest summed probability from XG Boost
models.
A voting ensemble may be considered a meta-model, a model of models.
As a meta-model, it could be used with any collection of existing trained
machine learning models and the existing models do not need to be aware
that they are being used in the ensemble. This means you could explore
using a voting ensemble on any set or subset of fit models for your
predictive modeling task.
A voting ensemble is appropriate when you have two or more models that
perform well on a predictive modeling task. The models used in the
ensemble must mostly agree with their predictions.
209
Voting is provided via the VotingRegressor and VotingClassifier classes. XG Boost
Both models operate the same way and take the same arguments. Using
the model requires that you specify a list of estimators that make
predictions and are combined in the voting ensemble.
A list of base models is provided via the “estimators” argument. This is a
Python list where each element in the list is a tuple with the name of the
model and the configured model instance. Each model in the list must
have a unique name.
Now that we are familiar with the voting ensemble API in scikit-learn,
let’s look at some worked examples.
210
Artificial Intelligence & X, y = make_classification(n_samples=1000, n_features=20,
Machine Learning Lab
n_informative=15, n_redundant=5, random_state=2)
# summarize the dataset
print(X.shape, y.shape)
Running the example creates the dataset and summarizes the shape of the
input and output components.
(1000, 20) (1000,)
Next, we will demonstrate hard voting and soft voting for this dataset.
211
XG Boost
212
Artificial Intelligence & cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3,
Machine Learning Lab
random_state=1)
scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv,
n_jobs=-1, error_score='raise')
return scores
We can then report the mean performance of each algorithm, and also
create a box and whisker plot to compare the distribution of accuracy
scores for each algorithm.
# compare hard voting to standalone classifiers
213
XG Boost
Running the example first reports the mean and standard deviation
accuracy for each model.
Note: Your results may vary given the stochastic nature of the algorithm
or evaluation procedure, or differences in numerical precision. Consider
running the example a few times and compare the average outcome.
We can see the hard voting ensemble achieves a better classification
accuracy of about 90.2% compared to all standalone versions of the
model.
214
Artificial Intelligence & First, the hard voting ensemble is fit on all available data, then the
Machine Learning Lab
predict() function can be called to make predictions on new data.
Running the example fits the hard voting ensemble model on the entire
dataset and is then used to make a prediction on a new row of data, as we
might when using the model in an application.
Predicted Class: 1
Soft Voting Ensemble for Classification
We can demonstrate soft voting with the support vector machine (SVM)
algorithm.
The SVM algorithm does not natively predict probabilities, although it can
be configured to predict probability-like scores by setting the “probability”
argument to “True” in the SVC class.
We can fit five different versions of the SVM algorithm with a polynomial
kernel, each with a different polynomial degree, set via the “degree”
argument. We will use degrees 1-5.
Our expectation is that by combining the predicted class membership
probability scores predicted by each different SVM model that the soft
voting ensemble will achieve a better predictive performance than any
standalone model used in the ensemble, on average.
First, we can create a function named get_voting() that creates the SVM
models and combines them into a soft voting ensemble.
215
XG Boost
return models
We can evaluate and report model performance using repeated k-fold
cross-validation as we did in the previous section.
Tying this together, the complete example is listed below.
216
Artificial Intelligence &
Machine Learning Lab
Running the example first reports the mean and standard deviation
accuracy for each model.
217
Note: Your results may vary given the stochastic nature of the algorithm XG Boost
or evaluation procedure, or differences in numerical precision. Consider
running the example a few times and compare the average outcome.
We can see the soft voting ensemble achieves a better classification
accuracy of about 92.4% compared to all standalone versions of the
model.
If we choose a soft voting ensemble as our final model, we can fit and use
it to make predictions on new data just like any other model.
First, the soft voting ensemble is fit on all available data, then the predict()
function can be called to make predictions on new data.
218
Artificial Intelligence &
Machine Learning Lab
Running the example fits the soft voting ensemble model on the entire
dataset and is then used to make a prediction on a new row of data, as we
might when using the model in an application.
Predicted Class: 1
Voting Ensemble for Regression
We will look at using voting for a regression problem.
First, we can use the make_regression() function to create a synthetic
regression problem with 1,000 examples and 20 input features.
The complete example is listed below.
# test regression dataset
from sklearn.datasets import make_regression
# define dataset
X, y = make_regression(n_samples=1000, n_features=20,
n_informative=15, noise=0.1, random_state=1)
# summarize the dataset
print(X.shape, y.shape)
Running the example creates the dataset and summarizes the shape of the
input and output components.
(1000, 20) (1000,)
219
We can demonstrate ensemble voting for regression with a decision tree XG Boost
algorithm, sometimes referred to as a classification and regression tree
(CART) algorithm.
We can fit five different versions of the CART algorithm, each with a
different maximum depth of the decision tree, set via the “max_depth”
argument. We will use depths of 1-5.
Our expectation is that by combining the values predicted by each
different CART model that the voting ensemble will achieve a better
predictive performance than any standalone model used in the ensemble,
on average.
First, we can create a function named get_voting() that creates each CART
model and combines the models into a voting ensemble.
220
Artificial Intelligence &
Machine Learning Lab
221
XG Boost
Running the example first reports the mean and standard deviation
accuracy for each model.
Note: Your results may vary given the stochastic nature of the algorithm
or evaluation procedure, or differences in numerical precision. Consider
running the example a few times and compare the average outcome.
We can see the voting ensemble achieves a better mean squared error of
about -136.338, which is larger (better) compared to all standalone
versions of the model.
222
Artificial Intelligence &
Machine Learning Lab
If we choose a voting ensemble as our final model, we can fit and use it to
make predictions on new data just like any other model.
First, the voting ensemble is fit on all available data, then the predict()
function can be called to make predictions on new data.
The example below demonstrates this on our binary classification dataset.
Running the example fits the voting ensemble model on the entire dataset
and is then used to make a prediction on a new row of data, as we might
when using the model in an application.
Predicted Value: 141.319
*****
223
14
DEPLOYMENT OF MACHINE LEARNING
ALGORITHMS
Unit Structure
14.1 Deploy your Machine Learning Models
14.1.0 How to deploy machine learning models
14.1.1 Test and clean code ready for deployment
14.1.2 Prepare the model for container deployment
14.1.3 Beyond machine learning deployment
14.1.4 Challenges for machine learning deployment
14.2 Ways to Deploy Machine Learning Models in Production
14.2.1 To create a machine learning web service, you need at least
three steps
14.2.2 Deploying machine learning models for batch prediction
14.2.3 Deploying machine learning models on edge devices as
embedded models
References
MOOCs
API
Video Lectures
Quiz
225
Accurately explaining the results of a model is a key part of the machine Deployment Of Machine
Learning Algorithms
learning oversight process. Clarity around development is needed for the
results and predictions to be accepted in a business setting. For this
reason, a clear explanatory document or „read me‟ file should be
produced.
There are three simple steps to prepare for deployment at this stage:
Create a „read me‟ file to explain the model in detail ready for
deployment by the development team.
Clean and scrutinise the code and functions and ensure clear naming
conventions using a style guide.
226
Artificial Intelligence & after deployment means machine learning will be effective in an
Machine Learning Lab
organisation for the long term.
227
platform for collaboration between data scientists and the development Deployment Of Machine
Learning Algorithms
team, helping to simplify the deployment process.
228
Artificial Intelligence & Model testing and validation are not included here to keep it simple. But
Machine Learning Lab
do remember those are an integral part of any machine learning project.
In the next step, we need to persist the model. The environment where we
deploy the application is often different from where we train them.
Training usually requires a different set of resources. Thus this separation
helps organizations optimize their budget and efforts.
Scikit-learn offers python specific serialization that makes model
persistence and restoration effortless. The following is an example of how
we can store the trained model in a pickle file.
from sklearn.externals import joblib
joblib.dump(classifier, 'classifier.pkl')
Finally, we can serve the persisted model using a web framework. The
following code creates a REST API using Flask. This file is hosted in a
different environment, often in a cloud server.
229
14.2.2 Deploying machine learning models for batch prediction [12]: Deployment Of Machine
Learning Algorithms
While online models can serve prediction, on-demand batch predictions
are sometimes preferable.
Offline models can be optimized to handle a high volume of job instances
and run more complex models. In batch production mode, you don't need
to worry about scaling or managing servers either.
Batch prediction can be as simple as calling the predict function with a
data set of input variables. The following command does it.
prediction = classifier.predict(UNSEEN_DATASET)
Sometimes you will have to schedule the training or prediction in the
batch processing method. There are several ways to do this. My favorite is
to use either Airflow or Prefect to automate the task.
import requests
from datetime import timedelta, datetime
import pandas as pd
from prefect import task, Flow
from prefect.schedules import IntervalSchedule
@task(max_retries=3, retry_delay=timedelta(5))
def predict(input_data_path:str):
"""
This task load the saved model, input data and returns prediction.
If failed this task will retry 3 times at 5 min interval and fail
permenantly.
"""
230
Artificial Intelligence &
Machine Learning Lab
231
Reduced latency as the device is likely to be close to the user than a server Deployment Of Machine
Learning Algorithms
far away.
Reduce data bandwidth consumption as we ship processed results back to
the cloud instead of raw data that requires big size and eventually more
bandwidth.
Edge devices such as mobile and IoT devices have limited computation
power and storage capacity due to the nature of their hardware. We cannot
simply deploy machine learning models to these devices directly,
especially if our model is big or requires extensive computation to run
inference on them.
Instead, we should simplify the model using techniques such as
quantization and aggregation while maintaining accuracy. These
simplified models can be deployed efficiently on edge devices with
limited computation, memory, and storage.
We can use the TensorFlow Lite library on Android to simplify our
TensorFlow model. TensorFlow Lite is an open-source software library
for mobile and embedded devices that tries to do what the name says: run
TensorFlow models in Mobile and Embedded platforms.
The following example converts a Keras TensorFlow model.
232
Artificial Intelligence &
Machine Learning Lab REFERENCES
1. Quick Introduction to Boosting Algorithms in Machine Learning.
https://www.analyticsvidhya.com/blog/2015/11/quick-introduction-
boosting-algorithms-machine-learning/. [Last Accessed on
10.03.2022]
2. Boosting Algorithms Explained.
https://towardsdatascience.com/boosting-algorithms-explained-
d38f56ef3f30[Last Accessed on 10.03.2022]
3. A Comprehensive Guide To Boosting Machine Learning Algorithms.
https://www.edureka.co/blog/boosting-machine-learning/[Last
Accessed on 10.03.2022]
4. Essence of Boosting Ensembles for Machine Learning.
https://machinelearningmastery.com/essence-of-boosting-ensembles-
for-machine-learning/[Last Accessed on 10.03.2022]
5. Boosting in Machine Learning | Boosting and AdaBoost.
https://www.geeksforgeeks.org/bagging-vs-boosting-in-machine-
learning/[Last Accessed on 10.03.2022]
6. https://www.datacamp.com/. [Last Accessed on 10.03.2022]
7. Machine Learning Plus Platform .
https://www.machinelearningplus.com/. [Last Accessed on
10.03.2022]
8. Weights & Biases with Gradient. https://blog.paperspace.com/. [Last
Accessed on 10.03.2022]
9. Build a machine Learning Web App in 5 Minutes.
https://www.aiproblog.com/. [Last Accessed on 10.03.2022]
10. AdaBoost Algorithm. https://www.educba.com/adaboost-algorithm/.
[Last Accessed on 10.03.2022]
11. Implementing the AdaBoost Algorithm From Scratch.
https://www.geeksforgeeks.org/implementing-the-adaboost-
algorithm-from-scratch/?ref=gcse. [Last Accessed on 10.03.2022]
12. Optimisation algorithms for differentiable functions.
https://www.seldon.io/algorithm-optimisation-for-machine-learning.
[Last Accessed on 10.03.2022]
13. Quiz – Machine Learning. https://mcqmate.com/. [Last Accessed on
10.03.2022]
233
Deployment Of Machine
TUTORIALS Learning Algorithms
234
Artificial Intelligence &
Machine Learning Lab BOOKS
1. Schapire RE, Freund Y. Boosting: Foundations and algorithms.
Kybernetes. 2013 Jan 4.
2. Zhou ZH. Ensemble methods: foundations and algorithms. CRC
press; 2012 Jun 6.
3. Mohri M, Rostamizadeh A, Talwalkar A. Foundations of machine
learning. MIT press; 2018 Dec 25.
4. Zhou ZH. Ensemble methods: foundations and algorithms. CRC
press; 2012 Jun 6.
5. Data Mining: Practical Machine Learning Tools and Techniques,
2016.
235
Deployment Of Machine
MOOCS Learning Algorithms
236
Artificial Intelligence &
Machine Learning Lab APIS
1. Ensemble methods scikit-learn API.
2. sklearn.ensemble.VotingClassifier API.
3. sklearn.ensemble.VotingRegressor API.
237
Deployment Of Machine
VIDEO LECTURES Learning Algorithms
238
Artificial Intelligence &
Machine Learning Lab QUIZ
1. Ensemble learning can only be applied to supervised learning
methods.
A. True
B. False
2. Ensembles will yield bad results when there is significant diversity
among the models.
Note: All individual models have meaningful and good predictions.
A. true
B. false
3. Which of the following is / are true about weak learners used in
ensemble model?
1. They have low variance and they don‟t usually overfit
2. They have high bias, so they can not solve hard learning problems
3. They have high variance and they don‟t usually overfit
A. 1 and 2
B. 1 and 3
C. 2 and 3
D. none of these
4. Ensemble of classifiers may or may not be more accurate than any of
its individual model.
A. true
B. false
5. If you use an ensemble of different base models, is it necessary to
tune the hyper parameters of all base models to improve the ensemble
performance?
A. yes
B. no
C. can‟t say
6. Generally, an ensemble method works better, if the individual base
models have ____________?
239
Note: Suppose each individual base models have accuracy greater Deployment Of Machine
Learning Algorithms
than 50%.
A. bagging
B. boosting
C. a or b
D. none of these
8. Suppose there are 25 base classifiers. Each classifier has error rates of
e = 0.35.
Suppose you are using averaging as ensemble technique. What will be
the probabilities that ensemble of above 25 classifiers will make a
wrong prediction?
Note: All classifiers are independent of each other
A. 0.05
B. 0.06
C. 0.07
D. 0.09
9. In machine learning, an algorithm (or learning algorithm) is said to be
unstable if a small change in training data cause the large change in
the learned classifiers.True or False: Bagging of unstable classifiers is
a good idea
A. true
B. false
10. Which of the following parameters can be tuned for finding good
ensemble model in bagging based algorithms?
240
Artificial Intelligence & 1. Max number of samples
Machine Learning Lab
2. Max features
3. Bootstrapping of samples
4. Bootstrapping of features
A. 1 and 3
B. 2 and 3
C. 1 and 2
D. all of above
11. How is the model capacity affected with dropout rate (where model
capacity means the ability of a neural network to approximate
complex functions)?
A. model capacity increases in increase in dropout rate
B. false
13. Suppose, you want to apply a stepwise forward selection method for
choosing the best models for an ensemble model. Which of the
following is the correct order of the steps?
Note: You have more than 1000 models predictions
1. Add the models predictions (or in another term take the average)
one by one in the ensemble which improves the metrics in the
validation set.
2. Start with empty ensemble
3. Return the ensemble from the nested set of ensembles that has
maximum performance on the validation set
A. 1-2-3
B. 1-3-4
C. 2-1-3
D. none of above
241
14. Suppose, you have 2000 different models with their predictions and Deployment Of Machine
Learning Algorithms
want to ensemble predictions of best x models. Now, which of the
following can be a possible method to select the best x models for an
ensemble?
A. step wise forward selection
B. step wise backward elimination
C. both
D. none of above
15. Below are the two ensemble models:
1. E1(M1, M2, M3) and
2. E2(M4, M5, M6)
Above, Mx is the individual base models.
Which of the following are more likely to choose if following
conditions for E1 and E2 are given?
E1: Individual Models accuracies are high but models are of the same type
or in another term less diverse
E2: Individual Models accuracies are high but they are of different types
in another term high diverse in nature
A. e1
B. e2
C. any of e1 and e2
D. none of these
16. In boosting, individual base learners can be parallel.
A. true
B. false
17. Which of the following is true about bagging?
1. Bagging can be parallel
2. The aim of bagging is to reduce bias not variance
3. Bagging helps in reducing overfitting
A. 1 and 2
B. 2 and 3
C. 1 and 3
242
Artificial Intelligence & D. all of these
Machine Learning Lab
18. Suppose you are using stacking with n different machine learning
algorithms with k folds on data.
Which of the following is true about one level (m base models + 1
stacker) stacking?
Note: Here, we are working on binary classification problem
All base models are trained on all features
You are using k folds for base models
A. you will have only k features after the first stage
D. none of these
20. Which of the following can be one of the steps in stacking?
1. Divide the training data into k folds
2. Train k models on each k-1 folds and get the out of fold predictions
for remaining one fold
3. Divide the test data set in “k” folds and get individual fold
predictions by different algorithms
A. 1 and 2
B. 2 and 3
C. 1 and 3
D. all of above
21. Which of the following are advantages of stacking?
1) More robust model
2) better prediction
243
3) Lower time of execution Deployment Of Machine
Learning Algorithms
A. 1 and 2
B. 2 and 3
C. 1 and 3
D. all of the above
22. Which of the following are correct statement(s) about stacking?
A machine learning model is trained on predictions of multiple machine
learning models
A Logistic regression will definitely work better in the second stage as
compared to other classification methods
First stage models are trained on full / partial feature space of training data
A. 1 and 2
B. 2 and 3
C. 1 and 3
D. all of above
23. Which of the following is true about weighted majority votes?
1. We want to give higher weights to better performing models
2. Inferior models can overrule the best model if collective weighted
votes for inferior models is higher than best model
3. Voting is special case of weighted voting
A. 1 and 3
B. 2 and 3
C. 1 and 2
D. 1, 2 and 3
24. Which of the following is true about averaging ensemble?
A. it can only be used in classification problem
B. it can only be used in regression problem
D. all of above
*****
245