Python ML Book
Python ML Book
Python ML Book
We are writing this book as a reference. These are not meant to be a replacement for live classes
that you are required to attend.
We have not included following topics in the book for the reason that they are largely code examples
driven and are better suited for watching videos and coding along
Web Scraping
Accessing twitter data with API
Model stacking
Python pipelines
Putting python model in production
Rest of the chapters in sequence are supposed to be read before the class happens for the same, so
that when you come to class, you have your questions ready. Here are few pointers which will help
you in having an overall good learning experience :
Learning is almost never linear. Meaning, things will not make sense to you in same sequence in
which you are reading them. Some concepts will click, many at times when you are reading
something entirely different. Give yourself that time, be patient.
When you go through something second time, it makes more sense because you have gathered
more context and it becomes easier to connect the dots. Lesson here is that, dont get hung up
on one point if it doesnt make sense. Go through the entire chapter and its quite possible that
with greater context; you'll be able to connect the dots better.
There is absolutely no other way to learn programming than to code yourself . You'll never learn
programming by just reading a book or just watching a video .
While programming, if you remember syntax, it enables you to be faster while you code. But
mugging up syntax is not the goal of learning to program. Syntax is something that you'll start to
retain more and more of, as and when you spend more and more time coding. Focus more on
logic of things .
Any programming language has an overall theme for doing things in a certain way. Dont get too
hung up on why it doesnt behave in exactly the same way you expected it to. Especially when
you are at the very beginning of your learning to program curve; its ok to accept few things as it
is the way it is
Many chapters will be mathematically heavy , most of the time it'll be difficult to get it all in one
go. Keep in mind that, if you understand the mathematics, it makes you more confident as a
data scientist . But it's not absolutely pivotal in you solving the problem using programming
concept that you learn. Many seasoned data scientists in industry do not fully understand
everything which goes under the hood. Don't let that become a hurdle point for you.
Make your own notes , despite having all this material given to you. Making your own notes is
one very efficient way for your brain to retain the concepts that you are going through .
Good luck with all the learning , and let us know if you find any issue or have any suggestions . Reach
out to us here :
1
Chapter 1 : Python Fundamentals
Our assumption of starting with this module is that you have already installed Anaconda distribution
for python using links given on LMS, although a better option is to simply Google Download
Anaconda and follow the links [ As link on LMS might be an old one , not updated as per the latest
release]. Key things before we start with python programming
1. Make sure you go through the videos and have chosen a python editor which you like . (We use
Jupyter notebooks in the course , scripts provided will be notebooks. You can use spyder as well,
there will be no difference in syntax.)
2. Make sure that you code along , with videos as well as this book. There is no shortcut to learn
programming except writing/modifying code on your own, executing and trouble shooting.
Lets begin
1 x=4
2 y=5
3 x+y
For checking what value a particular object holds , you can simply write the object name in the cell
and execute ; object value will be displayed in the output .
1 x
Just like R, everything in python is also case sensitive , if we now try to execute X [X in caps]
1 X
in
----> 1 X
NameError: name 'X' is not defined
You can see that python doesn't recognize X in caps as an object name because we have not created
one. We named our object a lowercase x . As an important side note , Error messages in python are
printed with complete traceback [Most of which is generally not very helpful ] , best place to start
debugging is from the bottom.
['False', 'None', 'True', 'and', 'as', 'assert', 'break', 'class', 'continue', 'def',
'del', 'elif', 'else', 'except', 'finally', 'for', 'from', 'global', 'if', 'import',
'in', 'is', 'lambda', 'nonlocal', 'not', 'or', 'pass', 'raise', 'return', 'try',
'while', 'with', 'yield']
2
Object Types and Typecasting
Although we don't need to declare type of objects at the time of creation explicitly , python does
assign type to variables on its own. Anything between quotes ( there is no difference between single
or double quotes ) is considered to be of string/character type.
1 x=25
2 y='lalit'
3 type(x)
int
1 type(y)
str
int here means integer and str means string. This assigned type is important as operations are
allowed/not allowed depending on type of variables , irrespective of what values are stored in them .
This type assignment to objects affects/dictates what kind of operations are allowed on/with the
object. For example adding a number to a string is going to throw and error. Lets try doing that
anyway and see what happens .
1 x='34' # python will consider this to be a string because the value is within
quotes
2 x+3
However we can change the type of object using typecasting function and then do the operations .
1 int(x)+3
37
Just applying this function on x , doesn't change its type permanently , it outputs an integer on which
the addition operation could be done . Type of x is still str
1 type(x)
str
If we do want to change the type of object , we'd need to equate the output of typecasting function
to the original object itself
1 x=int(x)
2 type(x)
int
Last thing about typecasting , unlike R , if the data contained in a object is such that, it can not be
converted to a certain type ; then you'll get an error instead of missing value as output
1 int('king')
3
in
----> 1 int('king')
ValueError: invalid literal for int() with base 10: 'king'
Numeric Operations
Usual Arithmetic operations can be used for numeric type objects ( int , float , bool) by using
appropriate universal symbols, lets look at some examples
1 x=2
2 y=19
3 x+y
21
1 x-y
-17
1 x*y
38
1 x/y
0.10526315789473684
For exponents/power , python exclusively uses double asterisks . ^ operator which can be used for
exponents in R , should not be used in python for exponents ( it does bit-wise OR operation )
1 x**y
524288
You can of-course right complex calculations using parenthesis . In order to save the results of these
computations , you need to simply equate them to an object name. Notice , when you do that, an
explicit output will not be printed as we saw earlier
1 z=(x/y)**(x*y+3)
If you want to see the outcome , you can simply type the object name where you stored to result
1 z
8.190910549494282e-41
1 import math
2 math.log(34)
3.5263605246161616
1 math.exp(3.5263605246161616)
34.00000000000001
Boolean Objects
There are two values which Boolean objects can take True and False . They need to be spelled as is
for them to be considered Boolean values
4
1 x=True
2 y=False
3 type(x),type(y)
(bool, bool)
you can do usual Boolean operations using both words and symbols
1 x and y , x & y
(False, False)
1 x or y , x | y
(True, True)
1 not x , not y
(False, True)
Note that for negation , ! is not being used as we did in R. In python you simply use the keyword
not for reversing Boolean values
Writing Conditions
In practice , it doesn't happen too often that we create objects with Boolean values explicitly .
Boolean values are usually results of conditions that we apply on other objects containing data. Here
are some examples
Equality Condition
1 x=34
2 y=12
3 x==y
False
Since x is not equal to y, outcome of the condition is False . Note carefully that for writing equality
condition , we used two equal to sign , one simple equal to sign is reserved for assignment
In-Equality Condition
1 x!=y
True
1 x>y
True
1 x<y
False
1 x>= y
True
5
1 x<=y
False
These conditions can be written for character data also . in case of equality , strings should match
exactly including lower/upper case of characters as well. In case of less than , greater than kind of
comparison , result [ True/False] depends on dictionary order of the strings [ Not their length ]
you can use operators in and not in to check if some string/element is present in a string/list . The
examples here do not show this in context of lists , because we haven't yet introduced them
1 x='python'
2 'py' in x
True
False
Compound Conditions
you can use parenthesis to write a compound conditions which is essentially is combination of
multiple individual conditions here is an example
1 x=45
2 y=67
3 (x > 20 and y<10) or (x ==4 or y > 15 )
True
String Operations
1 x='Python'
2 y="Data"
All string data will need to be within quotes . Note that it doesn't matter whether you are using
double quotes or single quotes
1 len(x),len(y)
(6, 4)
Duplicating strings
When a string is multiplied by an integer ( not decimals/float) , it results in duplication of the base
string
1 'python'*3
'pythonpythonpython'
Concatenating strings
6
1 x+y
'PythonData'
1 z.lower()
1 z
Note that these functions are giving explicit output [ not making inplace changes in the object itself].
If you want to change z itself , you'll need to equate the function call to object it self
1 z=z.lower()
2 z
1 z.upper()
1 z.capitalize()
1 z.rjust(20)
we just added some white spaces [to left side] to our string z, note that the number passed to
function rjust represents length of the string after adding white spaces. It does not represent
number of white spaces being added to the string. If this input is smaller than the string in question ,
then the result is same as the original string with not changes whatsoever
1 len(z.rjust(20))
20
1 z.rjust(3)
7
try for yourself and see what these functions do : ljust , center
lets first add leading and trailing spaces to our string and then we'll learn about how to remove
those spaces using available string functions in python
1 z=z.center(30)
2 z
1 z.replace('a',"@#!")
1 z.replace('a',"@#!",2)
In Figure above you can see that each individual character of the string ( including white spaces ) is
assigned an index. From left to right index starts with 0 . From right to left , index starts with -1.
8
1 x[6]
'P'
1 x[-9]
't'
multiple characters can be extracted as well [contiguous ranges only ] by passing range of indices
1 x[6:10]
'Pyth'
1 x[-12:-7]
'Monty'
Note that x[a:b] will give you part of string starting with index a and ending with index b-1 [last
value is not included in the output]. By default this assumes step size 1; from left to right. If starting
position happens to be occurring after the ending position , result is simply an empty string [ instead
of an error ]
1 x[-7:-12]
''
You can change step size by making use of third input here
1 x[1:9]
'onty Pyt'
1 x[1:9:2]
'ot y'
step size 2 means that , starting from beginning , the next element will be at step size 2
you can also have step size negative , in that case , direction changes to :: right to left
1 x[-7:-12:-1]
' ytno'
Just like step size ( which we don't always specify), we don't need to specify starting and ending
points either. In absence of starting position , first value of the strings becomes the default starting
point . In absence of ending position, last value of the strings becomes the default end point.
1 x[:4]
'Mont'
1 x[5:]
' Python'
Note : Same way of indexing will be used for lists as well as mentioned in the title. However in case
of lists , index will assigned to each individual element. Rest of the behavior will be same as strings ,
as we discussed here
Lists
9
Lists are our first encounter with data structures in python. Objects which can hold more than one
values. Many other data structures that we are going to come across in python are derived from
these.
Lists are collection of multiple objects ( of any kind ), you can create them by simply putting the
values within square brackets [] separated by commas
1 x=[20,92,43,83,"john","a","c",45]
2 len(x)
In general you are going to find out, that when you apply len function on a data structure object ,
outcome is going to be how many elements they contain.
Indexing in context of lists works just like strings as mentioned earlier.Only difference being that, for
lists each individual element is assigned an index. Here are some examples of using indices with
lists. Notice the similarity with strings
1 x[4]
'john'
1 x[-2]
'c'
1 x[:4]
1 x[2:]
1 x[1:9:2]
1 x[-12:-5]
1 x[-5:-12]
[]
1 x[-5:-12:-1]
1 x
1 x[3]
83
10
1 x[3]='python'
2 x
Its not necessary that assigned value has the same type as the original one . We can re-assign
multiple values also, keeping in mind that there, the reassignment should be done with list of values
[ its not necessary that number of values should be as many as the original ones ]
1 x
1 x[2:6]
1 x[2:6]=[-10,'doe',-20]
2 x
1 x=[20,92,43,83]
2 x=x+[-10,12]
3 x
Notice what happens when try to add list to a list using append
1 len(x)
1 x.append(['a','b','c'])
2 x
[20, 92, 43, 83, -10, 12, 100, ['a', 'b', 'c']]
1 len(x)
1 x[7]
11
lets try to do similar thing with function extend instead
1 x.extend([3,5,7])
2 x
[20, 92, 43, 83, -10, 12, 100, ['a', 'b', 'c'], 3, 5, 7]
1 len(x)
11
this time all elements of the list , get added as individual elements
first 3 methods here , however don't enable us to add elements in any desired position in the list.
For that we need to use function insert
1 x
[20, 92, 43, 83, -10, 12, 100, ['a', 'b', 'c'], 3, 5, 7]
1 x.insert(4,'python')
2 x
[20, 92, 43, 83, 'python', -10, 12, 100, ['a', 'b', 'c'], 3, 5, 7]
function pop removes values from specified location. If any location is not specified , it simply
removes the last element from the list
1 x
[20, 92, 43, 83, 'python', -10, 12, 100, ['a', 'b', 'c'], 3, 5, 7]
1 x.pop()
1 x
[20, 92, 43, 83, 'python', -10, 12, 100, ['a', 'b', 'c'], 3, 5]
If you specify a position, as shown in the next example, element in that index gets removed from the
list
1 x.pop(4)
'python'
1 x
[20, 92, 43, 83, -10, 12, 100, ['a', 'b', 'c'], 3, 5]
This however might be a big hassle if we don't have prior information on where in the list value
resides which we want to remove . function remove comes to your rescue .
1 x.remove(83)
2 x
12
This will throw an error if the value doesn't exist in the list
1 x.remove(83)
in
----> 1 x.remove(83)
ValueError: list.remove(x): x not in list
Another thing to note is that, if there are multiple occurrences of the same value in the list then at a
time only first occurrence will be removed. You might need to run remove function in a loop ( which
we will learn about in some time )
1 x=[2,3,40,2,14,2,3,11,71,26]
2 x.sort()
3 x
default order of sorting is ascending , we can use option reverse and set it to True if sorting in
descending manner is required
1 x=[2,3,40,2,14,2,3,11,71,26]
2 x.sort(reverse=True)
3 x
Many at times we might need to simply flip [ reverse order ] the values of a list without really doing
any sorting. function reverse can be used for that .
1 x=[2,3,40,2,14,2,3,11,71,26]
2 x.reverse()
3 x
We'll also be learning to write repetitive codes in a more concise manner with for and while loops.
if-else
1 x=12
2 if x%2==0:
3 print(x,' is even')
4 else :
5 print(x,'x is odd')
12 is even
couple of important things to learn here , both in context of if-else block and in general about
python.
13
a condition follows after the keyword if
colon indicates , end of condition and start of the code block if
if the condition is true , program written inside the if block will be executed
if the condition is not true , program written in else block will be executed
Notice the indentation , instead of curly braces to define code blocks, in python levels of
indentations are used , this makes the code easy to read
Lets look at one more example to understand functionality of if-else block and importance of
indentation. Lets say , given 3 numbers we are trying to find maximum value among them
1 if a>b :
2 if a>c:
3 mymax=a
4 else :
5 mymax=c
6 else :
7 if b>c:
8 mymax=b
9 else:
10 mymax=c
1 mymax
30
You can experiment with passing different numbers. Couple of lessons to learn from this
1 x=[5,40,12,-10,0,32,4,3,6,72]
For loop
Lets say i wanted to do odd/even exercise for all the numbers in this list. Technically we can do this ,
by value of x in the code that we had written 10 times. or writing 10 if-else blocks. turns out, given
the for loop, we don't need to do that . lets see
1 for element in x:
2 if element%2==0:
3 print(element,' is even')
4 else:
5 print(element,' is odd')
5 is odd
40 is even
12 is even
-10 is even
0 is even
32 is even
4 is even
3 is odd
6 is even
72 is even
element here is known as index . it can be given any name like generic objects in python ,
element is not a fixed name
14
x/value_list can be any list or in general iterable
body of the for loop is executed as many times as there are values in the value_list
1 cities=['Mumbai','London','Bangalore','Pune','Hyderabad']
2 for i in cities:
3 num_chars=len(i)
4 print(i+ ':'+ str(num_chars))
Mumbai:6
London:6
Bangalore:9
Pune:4
Hyderabad:9
for loops can have multiple indices as well if the value list elements themselves are lists
1 x=[[1,2],['a','b'],[34,67]]
2 for i,j in x:
3 print(i,j)
12
ab
34 67
how ever for multiple indices to work , all the list element within the larger list need to same number
of elements
While Loop
We have seen so far that for loops work with indices iterating over a value list. Many at times , we
need to do this iterative operation basis a condition instead of a value list . We can do this using a
while loop . Lets see an example
Lets say we want to remove all the occurences of a value from a list. Remove function that we
learned about , only removes first occurrence . We can manually keep on running call to remove
until the all the occurences are removed or we can put this inside a while loop with the condition .
1 a=[3,3,4,4,43,3,3,3,2,2,45,67,89,3,3]
2 while 3 in a :
3 a.remove(3)
1 a
List Comprehension
when we need to construct another list using an existing one doing some operation, usual way of
doing that is to start with an empty list; iterate over the existing list, do some operation on the
elements and append the result to empty list . like this :
1 x=[3,89,7,-90,10,0,9,1]
2 y=[]
3
4 for elem in x:
5 sq=elem**2
6 y.append(sq)
1 y
15
[9, 7921, 49, 8100, 100, 0, 81, 1]
y now contains squares of all numbers in x. There is another way of achieving this where we write
the for loop ( shorter version of it ) inside the empty list directly
1 y
Here the operation on each elem in x is elem**2 this becomes element of the list automatically ,
without having to write an explicit for loop. This is called list comprehension . We can include
conditional statement also in list comprehension . Lets first look at explicit for loop for the same .
1 x= [ 4,-5,67,98,11,-20,7]
2 import math
3 y=[]
4 for elem in x:
5 if elem>0:
6 y.append(math.log(elem))
7
8 print(y)
Note: usage of print here is just to display this horizontally in comparison to vertically the way it
naturally gets displayed .
One caution here , don't always push for writing a list comprehension instead of a for loop. Purpose
of list comprehension is to shorten the code at the same time, not compromising on the readability
of a code. They provide no performance improvement on run time. Hence if list comprehension
starts to become too complex , its better to go with an explicit for loop
1 my_dict=
{'name':'lalit','city':'hyderabad','locality':'gachibowli','num_vehicles':2,3:78,4:
[3,4,5,6]}
16
1 len(my_dict)
1 type(my_dict)
dict
special functions associated with dictionaries let you extract keys and values of dictionaries
1 my_dict.keys()
1 my_dict.values()
As you can see, numbers and strings alike can be keys of dictionaries . Now that we can not really
use indices to access elements of dictionaries , how do we access them ? Using keys , as follows
1 my_dict['name']
'lalit'
1 my_dict[3]
78
As you can see, output is; values associated with the keys . A dictionary does not support duplicate
keys , values however have no such restriction .
For adding a new key value pair , you simply need to do this :
1 my_dict['city']='delhi'
2 print(my_dict)
{'name': 'lalit', 'city': 'delhi', 'locality': 'gachibowli', 'num_vehicles': 2, 3: 78, 4: [3, 4, 5, 6]}
you can see that dictionary now has one more key:value pair; 'city': 'delhi'
1 del my_dict[4]
2 print(my_dict)
Although rarely used , but you can do something similar to list comprehension for dictionaries also
{4: 16, -5: 25, 67: 4489, 98: 9604, 11: 121, -20: 400, 7: 49}
here elem has become the key of dictionary and elem**2, the associated value with it
Next we'll look at sets. Sets are like mathematical sets. They are unordered collection of unique
values . You can not have duplicate elements in a set , even if you forcefully try to . They are created
just like lists , but using curly braces .
17
1 x= {10, 2, 2, 4, 4, 4, 5, 60, 22, 76}
2 x
to add and remove elements , there are special functions associated with sets
1 x.add('python')
2 x
1 x.remove(5)
As mentioned above , sets are unordered , meaning they do not have any indices associated with
their elements and can not not be accessed the way we accessed lists .
1 x[2]
in
----> 1 x[2]
TypeError: 'set' object does not support indexing
1 for elem in x :
2 print(elem)
2
4
10
76
python
22
60
Notice here that order of elements printed here is not the same as it was when we created the set .
This can not be predetermined either . However will remain same once created .
They also have set operation function associated with them as we find with mathematical sets.
1 a.union(b)
1 a.intersection(b)
1 a.difference(b)
{0, 1, 5, 9, 17}
1 b.difference(a)
18
1 a.symmetric_difference(b)
Sets are used in practice when the usage doesn't require iterating over or indices . They are good for
maintaining collection of unique elements and checking presence of an element in sets is
significantly faster in comparison to lists.
last data structure that we talk about is tuples . they are exactly same as list except one difference .
You can not modify elements of tuples . They are created just like lists , except using small brackets .
1 x= ('a','b',45,98,0,-10,43)
tuples don't have any function like add/remove . In general there doesn't exist any way to modify a
tuple once created ( you can of course , change the type to list and then modify but that's beside the
point )
1 x[4] = 99
in
----> 1 x[4] = 99
TypeError: 'tuple' object does not support item assignment
Functions
A function is a block of code which only runs when it is called.You can pass data, known as
parameters, into a function. A function can return data as a result ( this is optional ).Functions help
break our program into smaller and modular chunks. As our program grows larger and larger,
functions make it more organized and manageable. Furthermore, it avoids repetition and makes
code reusable.
Lets say we are asked to come up with a program which, when provided a list, give us a dictionary
containing unique elements of the list as keys and their counts as values
1 x= [2,2,3,4,4,4,4,2,2,2,2,2,4,5,5,5,6,6,1,1,1,3]
2 count_dict={}
3 for elem in x:
4 if elem not in count_dict:
5 count_dict[elem]=1
6 count_dict[elem]=count_dict[elem]+1
1 count_dict
{2: 8, 3: 3, 4: 6, 5: 4, 6: 3, 1: 4}
Now if we need to do this multiple times in our project , we'll need to copy this entire bit of code
wherever this is required. Instead of doing that we can make use of functions . We'll wrap this
program in a function and when we need to use it, we'll have to write just one line, instead of writing
the whole program.
function definitions start with keyword def , in the brackets which follow, we name the input which
our program/function requires . In our case in discussion, there is a single input , a list . Functions
can have multiple , even varying number of inputs . Lets convert , program written above into a
function.
19
1 def my_count(a):
2 count_dict={}
3 for elem in a:
4 if elem not in count_dict:
5 count_dict[elem]=1
6 count_dict[elem]+=+1
7 return(count_dict)
now i can call this function using a single line and it will return the output back .
1 my_count(x)
{2: 8, 3: 3, 4: 6, 5: 4, 6: 3, 1: 4}
1 my_count([3,3,3,3,3,4,4,4,3,3,3,3,10,10,10,0,0,0,0,10,10,4,4,4])
Functions that we have been using ( other than the one that we just wrote ), can take variable
number of inputs. If we miss passing some value they work with default value given to them . Lets
see how to create a function with default values for arguments . We are going to look at a function
which takes input 3 numbers and returns a weighted sum.
1 def mysum(x,y,z):
2
3 s=100*x+10*y+z
4 return(s)
1 mysum(1,2,3)
123
but if we try to call it with less than 3 numbers , it starts to throw error
1 mysum(1,2)
in
----> 1 mysum(1,2)
TypeError: mysum() missing 1 required positional argument: 'z'
while creating the function itself, we could have given default values for it work with in order to avoid
this . We can use any value as default value ( only thing ensure is that , it should make sense in the
context of what function does )
1 def mysum(x=1,y=10,z=-1):
2
3 s=100*x+10*y+z
4 return(s)
1 mysum(1,2,3)
123
it still works as before when we are explicitly passing all the values . However if we chose to pass
lesser number of inputs , instead of throwing errors , it makes use of default values provided by us
to each argument.
20
1 mysum(1,2)
119
Last thing , about the functions; when we are passing the arguments while calling the function; they
are assigned to various objects in the functions in the sequence in which they are passed. This can
be changed if we name our arguments while we pass them. Sequence doesn't matter in that case.
1 mysum(z=7,x=9)
1007
Classes
what is a class? Simply a logical grouping of data and functions . What do we mean by "logical
grouping"? Well, a class can contain any data we'd like it to, and can have any functions (methods)
attached to it that we please. Rather than just throwing random things together under the name
"class", we try to create classes where there is a logical connection between things. Many times,
classes are based on objects in the real world (like Customer or Product or Points).
Regardless, classes are a way of thinking about programs. When you think about and implement
your system in this way, you're said to be performing Object-Oriented Programming. "Classes" and
"objects" are words that are often used interchangeably.
Lets look at a use case which will convince you that when using a class to define a logical grouping
makes your life easier
Lets say I want to keep track of customer records. Customers have lets say three attributes
associated with them : name , account_balance and account_number. I will need to create three
different objects for each customer and then ensure that I don't end up mixing those objects with
another customer's details. I'll rather write a class for customers.
1 class customer():
2
3 # the first function is __init__ , with double underscore
4 # this is used to set attributes of object of the class customer when it gets
created
5 # we can also put data here which can be used by any object of the class
customer
6 # can also be used by other methods/functions contained in the class
7
8 # self here is a way to refer to object of the same class and its attribute
9
10 def __init__(self,name,balance,ac_num):
11
12 self.Name=name
13 self.AC_balance=balance
14 self.AcNum = ac_num
now i just need to create one object for customers with these attributes and can seamlessly avoid
mixing up objects for these attributes across customers
1 c1=customer('lalit',2000,'A3124')
1 c1.Name
'lalit'
1 c1.AC_balance
2000
21
1 c1.AcNum
'A3124'
I can also attach methods/function to this class which will be available to object of this class only
1 class customer():
2
3 # the first function is __init__ , with double underscore
4 # this is used to set attributes of object of the class customer when it gets
created
5 # we can also put data here which can be used by any object of the class
customer
6 # can also be used by other methods/functions contained in the class
7
8 # self here is a way to refer to object of the same class and its attribute
9
10 def __init__(self,name,balance,ac_num):
11
12 self.Name=name
13 self.AC_balance=balance
14 self.AcNum = ac_num
15
16 def withdraw(self,amount):
17
18 self.AC_balance -= amount
19
20 def get_account_number(self):
21
22 print(self.AcNum)
1 c1=customer('lalit',2000,'A3124')
1 c1.AC_balance
2000
1 c1.withdraw(243)
1 c1.AC_balance
1757
1 c1.get_account_number()
A3124
As a final note to this discussion; If idea of classes seemed too daunting, you can safely skip this. You
will never need to write to your own classes until you start to work with some bigger projects which
implement their own algorithm or complex data processing routine.
In this module we are going to learn to handle datasets; creating datasets, reading from external
files , modifying datasets. We will also see summarizing and visualizing in python with packages
numpy, pandas and seaborn.
22
Chapter 2 : Data Handling with Python
In this module we'll discuss data handling with python. Discussion will be largely around two
packages numpy and pandas . Numpy is the original package in python designed to work with
multidimensional arrays which eventually enables us to work with data files. Pandas is a high level
package written on top of numpy which makes the syntax much more easier so that we can focus on
logic of data handling processes rather than getting bogged down with increasingly complex syntax
of numpy. However numpy comes with lot of functions which we'll be using for data manipulation.
Since pandas is written on top of numpy, its good to know how numpy works in general to
understand rationale behind lot of syntactical choices of pandas. Lets begin the discussion with
numpy.
Numpy
Through numpy we will learn to create and handle arrays. Arrays set a background for handling
columns in our datasets when we eventually move to pandas dataframes.
We will cover the following topics in Numpy:
creating nd arrays
subsetting with indices and conditions
comparison with np and math functions [np.sqrt , log etc ] and special numpy functions
For this course, we will consider only two dimensional arrays, though technically, we can create
arrays with more than two dimensions.
We start with importing the package numpy giving it the alias np.
1 import numpy as np
We start with creating a 2 dimensional array and assign it to the variable 'b'. It is simply a list of lists.
1 b = np.array([[3,20,99],[-13,4.5,26],[0,-1,20],[5,78,-19]])
2 b
We have passed 4 lists and each of the lists contains 3 elements. This makes 'b' a 2 dimensional
array.
We can also determine the size of an array by using its shape attribute.
1 b.shape
(4, 3)
Each dimension in a numpy array is referred to by the argument 'axis'. 2 dimensional arrays have
two axes, namely 0 and 1. Since there are two axes here, we will need to pass two indices when
accessing the values in the numpy array. Numbering along both the axes starts with a 0.
1 b
23
Assuming that we want to access the value -1 from the array 'b', we will need to access it with both
its indices along the two axes.
1 b[2,1]
-1.0
In order to access -1, we will need to first pass index 2 which refers to the third list and then we will
need to pass index 1 which refers to the second element in the third list. Axis 0 as refers to each list
and axis 1 refers to elements present in each list. First index is the index of the list where the
element is (i.e. 2) and the second index is its position within that list (i.e. 1).
Indexing and slicing here works just like it did in lists, only difference being that here we are
considering 2 dimensions.
1 print(b)
2 b[:,1]
[[ 3. 20. 99. ]
[-13. 4.5 26. ]
[ 0. -1. 20. ]
[ 5. 78. -19. ]]
This statement gives us the second element from all the lists.
1 b[1,:]
The above statement gives us all the elements from the second list.
By default, we can access all the elements of a list by providing a single index as well. The above
code can also be written as:
1 b[1]
We can access multiple elements of a 2 dimensional array by passing multiple indices as well.
1 print(b)
2 b[[0,1,1],[1,2,1]]
[[ 3. 20. 99. ]
[-13. 4.5 26. ]
[ 0. -1. 20. ]
[ 5. 78. -19. ]]
Here, values are returned by pairing of indices i.e. (0,1), (1,2) and (1,1). We will get the first element
when we run b[0,1] (i.e. the first list and second element within that list); the second element when
we run b[1,2] (i.e. the second list and the third element within that list) and the third element when
we run b[1,1] (i.e. the second list and the second element within that list). The three values returned
in the array above can also be obtained by the three print statements written below:
24
1 print(b[0,1])
2 print(b[1,2])
3 print(b[1,1])
20.0
26.0
4.5
This way of accessing the index can be used for modification as well e.g. updating the values of
those indices.
1 print(b)
2 b[[0,1,1],[1,2,1]]=[-10,-20,-30]
3 print(b)
[[ 3. 20. 99. ]
[-13. 4.5 26. ]
[ 0. -1. 20. ]
[ 5. 78. -19. ]]
[[ 3. -10. 99.]
[-13. -30. -20.]
[ 0. -1. 20.]
[ 5. 78. -19.]]
Here you can see that for each of the indices accessed, we updated the corresponding values i.e. the
values present for the indices (0,1), (1,2) and (1,1) were updated. In other words, 20.0, 26.0 and 4.5
were replaced with -10, -20 and -30 respectively.
1 b
1 b>0
On applying a condition on the array 'b', we get an array with Boolean values; True where the
condition was met, False otherwise. We can use these Boolean values, obtained through using
conditions, for subsetting the array.
1 b[b>0]
The above statement returns all the elements from the array 'b' which are positive.
Let's say we now want all the positive elements from the third list. Then we need to run the following
code:
1 print(b)
2 b[2]>0
25
[[ 3. -10. 99.]
[-13. -30. -20.]
[ 0. -1. 20.]
[ 5. 78. -19.]]
When we write b[2]>0, it returns a logical array, returning True wherever the list's value is positive
and False otherwise.
Subsetting the list in the following way will, using the condition b[2]>0, will return the actual positive
value.
1 b[2,b[2]>0]
array([20.])
Now, what if I want the values from all lists only for those indices where the values in the third list
were either 0 or positive.
1 print(b)
2 print(b[2]>=0)
3 print(b[:,b[2]>=0])
[[ 3. -10. 99.]
[-13. -30. -20.]
[ 0. -1. 20.]
[ 5. 78. -19.]]
[[ 3. 99.]
[-13. -20.]
[ 0. 20.]
[ 5. -19.]]
For the statement b[:,b[2]>=0], the ':' sign indicates that we are referring to all the lists and the
condition 'b[2]>=0' would ensure that we will get the corresponding elements from all the lists which
satisfy the condition that the third list is either 0 or positive. In other words, 'b[2]>=0' returns [True,
False, True] which will enable us to get the first and the third values from all the lists.
Now lets consider the following scenario, where we want to apply the condition on the third element
of each list and then apply the condition across all the elements of the lists:
1 b[:,2]>0
Here, we are checking whether the third element in each list is positive or not. b[:,2]>0 returns a
logical array. Note: it will have as many elements as the number of lists.
1 print(b)
2 print(b[:,2])
3 print(b[:,2]>0)
4 b[b[:,2]>0,:]
[[ 3. -10. 99.]
[-13. -30. -20.]
[ 0. -1. 20.]
[ 5. 78. -19.]]
26
[ True False True False]
Across the lists, it has extracted those values which correspond to the logical array. Using the
statement print(b[:,2]>0), we see that only 99. and 20. are positive, i.e. the third element from each
of the first and third lists are positive and hence True. On passing this condition to the array 'b',
b[b[:,2]>0,:], we get all those lists wherever the condition evaluated to True i.e. the first and the third
lists.
The idea of using numpy is that it allows us to apply functions on multiple values across a full
dimension instead of single values. The math package on the other hand works on scalars of single
values.
As an example, let's say we wanted to replace the entire 2nd list (index =1) with its absolute values.
1 import math as m
The function exp in the math package returns the exponential value of the number passed as
argument.
1 x=-80
2 m.exp(x)
1.8048513878454153e-35
However, when we pass an array to this function instead of a single scalar value, we get an error.
1 b[1]
1 b[1]=m.exp(b[1])
in
----> 1 b[1]=m.exp(b[1])
TypeError: only size-1 arrays can be converted to Python scalars
Basically, the math package converts its inputs to scalars, but since b[1] is an array of multiple
elements, it gives an error.
We will need to use the corresponding numpy function to be able to return absolute values of arrays
i.e. np.exp().
The following code will return the exponential values of the second list only.
1 print(b)
2 b[1]=np.exp(b[1])
3 print(b)
[[ 3. -10. 99.]
[-13. -30. -20.]
[ 0. -1. 20.]
[ 5. 78. -19.]]
27
[[ 3.00000000e+00 -1.00000000e+01 9.90000000e+01]
[ 2.26032941e-06 9.35762297e-14 2.06115362e-09]
[ 0.00000000e+00 -1.00000000e+00 2.00000000e+01]
[ 5.00000000e+00 7.80000000e+01 -1.90000000e+01]]
There are multiple such functions available in numpy. We can type 'np.' and press the 'tab' key to see
the list of such functions.
All the functions present in the math package will be present in numpy package as well.
Reiterating the advantage of working with numpy instead of math package is that numpy enables us
to work with complete arrays. We do not need to write a for loop to apply a function across the
array.
axis argument
To understand the axis argument better, we will now explore the 'sum()' function which collapses the
array.
1 np.sum(b)
175.00000226239064
Instead of summing the entire array 'b', we want to sum across the list i.e. axis = 0.
1 print(b)
2 np.sum(b,axis=0)
If we want to sum all the elements of each list, then we will refer to axis = 1
1 np.sum(b,axis=1)
axis=0 here corresponds to elements across the lists , axis=1 corresponds to within the list elements.
Note: Pandas dataframes, which in a way are 2 dimensional numpy arrays, have each list in a numpy
array correspond to a column in pandas dataframe. In a pandas dataframe, axis=0 would refer to
rows and axis=1 would refer to columns.
Now we will go through some commonly used numpy functions. We will use the rarely used
functions as and when we come across them.
The commonly used functions help in creating special kinds of numpy arrays.
arange()
1 np.arange(0,6)
array([0, 1, 2, 3, 4, 5])
The arange() function returns an array starting from 0 until (6-1) i.e. 5.
1 np.arange(2,8)
array([2, 3, 4, 5, 6, 7])
28
We can also control the starting and ending of an arange array. The above arange function starts
from 2 and ends with (8-1) i.e. 7, incrementing by 1.
The arange function is used for creating a sequence of integers with different starting and ending
points having an increment of 1.
linspace()
To create a more customized sequence we can use the linspace() function. The argument num gives
the number of elements in sequence. The elements in the sequence will be equally spaced.
1 np.linspace(start=2,stop=10,num=15)
random.randint()
Given an array of numbers, we can randomly sample elements from that array using the randint()
function from the random package.
1 np.random.randint(high=10,low=1,size=(2,3))
array([[4, 5, 8],
[1, 3, 4]])
The above code creates a random array of size (2,3) i.e. two lists having three elements each. These
random elements are chosen from numbers between 1 and 10.
random.random()
We can also create an array of random numbers using the random() function from the random
package.
1 np.random.random(size=(3,4))
The above random() function creates an array of size (3,4) where the elements are real numbers
between 0 to 1.
random.choice()
1 x = np.arange(0, 10)
2 x
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
1 np.random.choice(x,6)
array([9, 0, 4, 3, 8, 5])
random.choice() functions helps us to select 6 random numbers from the input array x. Every time
we run this code, the result will be different.
We can see that, at times, the function ends up picking up the same element twice. If we want to
avoid that i.e. get each number only once or in other words, get the elements without replacement;
then we need to set the argument 'replace' as False which is True by default.
29
1 np.random.choice(x,6,replace=False)
array([4, 0, 7, 1, 8, 6])
Now when we run the above code again, we will get different values, but we will not see any number
more than once.
1 y = np.random.choice(['a','b'], 6)
2 print(y)
1 y = np.random.choice(['a','b'], 1000)
The code above samples 'a' and 'b' a 1000 times. Now if we take unique values from this array, it will
be 'a' and 'b', as shown by the code below. The return_counts argument gives the number of times
the two elements are present in the array created.
1 np.unique(y, return_counts=True)
By default, both 'a' and 'b' get picked up with equal probability in the random sample. This does not
mean that the individual values are not random; but the overall percentage of 'a' and 'b' remains
almost the same.
However, if we want the two values according to a specific proportion, we can use the argument 'p'
in the random.choice() function.
1 y=np.random.choice(['a','b'],1000,p=[0.8,0.2])
1 np.unique(y, return_counts=True)
Now we can observe that the value 'a' is present approximately 80% of the time and value 'b'
appears around 20% of the time. the individual values are still random, though overall, 'a' will appear
80% of the time as specified and 'b' will appear 20% of the time. Since the underlying process is
inherently random, these probabilities will be close to 80% and 20% respectively but may not be
exactly the same. In fact, if we draw samples of smaller sizes, the difference could be quite wide as
shown in the code below.
1 y=np.random.choice(['a','b'],10,p=[0.8,0.2])
2 np.unique(y, return_counts=True)
Here we sample only 10 values in the proportion 8:2. As you repeat this sampling process, at times
the proportion may match, but many times the difference will be big. As we increase the size of the
sample, the number of samples drawn for each element will be closer to the proportion specified.
sort()
1 x=np.random.randint(high=100,low=12,size=(15,))
2 print(x)
30
[89 60 31 82 97 92 14 96 75 12 36 37 25 27 62]
1 x.sort()
1 print(x)
[12 14 25 27 31 36 37 60 62 75 82 89 92 96 97]
1 x=np.random.randint(high=100,low=12,size=(4,6))
1 x
This array is 2 dimensional containing 4 lists and each list has 6 elements.
If we use the sort function directly, the correspondence between the elements is broken i.e. each
individual list is sorted independent of the other lists. We may not want this. The first elements in
each list belong together, so do the second and so on; but after sorting this correspondence is
broken.
1 np.sort(x)
argsort()
For maintaining order along either of the axis, we can extract indices of the sorted values and
reorder original array with these indices. Lets see the code below:
1 print(x)
2 x[:,2]
[[61 60 78 20 56 50]
[27 56 88 69 40 26]
[35 83 40 17 74 67]
[33 78 25 19 53 12]]
These return the third element from each list of the 2 dimensional array x.
Lets say we want to sort the 2 dimensional array x by these values and the other values should move
with them maintaining the correspondence. This is where we will use argsort() function.
1 print(x[:,2])
2 x[:,2].argsort()
31
[78 88 40 25]
Instead of sorting the array, argsort() returns the indices of the elements after they are sorted.
The value with index 3 i.e. 25 should appear first, the value with index 2 i.e. 40 should appear next
and so on.
We can now pass these indices to arrange all the lists according to them. We will observe that the
correspondence does not break.
1 x[x[:,2].argsort(),:]
All the lists have been arranged according to order given by x[:,2].argsort() for the third element
across the lists.
1 x=np.random.randint(high=100,low=12,size=(15,))
2 x
array([17, 77, 49, 36, 39, 27, 63, 99, 94, 22, 55, 66, 93, 32, 16])
1 x.max()
99
The max() function will simply return the maximum value from the array.
1 x.argmax()
The argmax() function on the other hand will simply give the index of the maximum value i.e. the
maximum number 99 lies at the index 7.
Panda
In this section we will start with the python package named pandas which is primarily used for
handling datasets in python.
We will cover the following topics:
1 import pandas as pd
2 import numpy as np
3 import random
32
There are two ways of creating dataframes:
1. From lists
2. From dictionary
We will start with creating some lists and then making a dataframe using these lists.
1 age=np.random.randint(low=16,high=80,size=[20,])
2 city=np.random.choice(['Mumbai','Delhi','Chennai','Kolkata'],20)
3 default=np.random.choice([0,1],20)
We can zip these lists to convert them to a single list tuples. Each tuple in the list will refer to a row in
the dataframe.
1 mydata=list(zip(age,city,default))
2 mydata
Each of the tuples come from zipping the elements in each of the lists (age, city and default) that we
created earlier.
Note: You may have different values when you run this code since we are randomly generating the
lists using the random package.
We can then put this list of tuples in a dataframe simply by using the pd.DataFrame function.
1 df.head() # we are using head() function which displays only the first 5 rows.
33
age city default
0 33 Mumbai 0
1 71 Kolkata 1
2 28 Mumbai 1
3 46 Mumbai 1
4 22 Delhi 1
As you can observe, this is a simple dataframe with 3 columns and 20 rows, having the three lists:
age, city and default as columns. The column names could have been different too, they do not have
to necessarily match the list names.
Another way of creating dataframes is using a dictionary. Here we will need to provide the column
names separately, which will be picked as the values of the key.
Here the key values ("age", "city" and "default") will be taken as column names and the lists (age, city
and default) will contain the values themselves.
1 df.head() # we are using head() function which displays only the first 5 rows.
0 33 Mumbai 0
1 71 Kolkata 1
2 28 Mumbai 1
3 46 Mumbai 1
4 22 Delhi 1
In both the cases i.e. creating the dataframe using list or dictionary, the resultant dataframe is the
same. The process of creating them is different but there is no difference in the resulting
dataframes.
Lets first create a string containing the path to the .csv file.
Here, 'r' is added at the beginning of the path. This is to ensure that the file path is read as a raw
string and any special character combinations are not interpreted as their special meaning by
Python. e.g. \n means newline, which will be ignored from the file path when we add 'r' at the
beginning of the string. otherwise it might lead to Unicode errors. Sometimes it will without putting
r at the beginning of the path, but its a safer choice and make it a habit to always use this.
1 ld = pd.read_csv(file)
The pandas function read_csv() reads the file present in the path given by the agrument 'file'.
34
1 ld.head()
2 # display of data in pdf will be truncated on the right hand side
The head() function of the pandas dataframe created, by default, returns the top 5 rows of the
dataframe. If we wish to see more or less rows, for instance 10 rows, then we will pass the number
as an argument to the head() function.
1 ld.head(10)
2 # display of data in pdf will be truncated on the right hand side
We can get the column names by using the 'columns' attribute of the pandas dataframe.
1 ld.columns
If we want to see the type of these columns, we can use the attribute 'dtypes' of the pandas
dataframe.
1 ld.dtypes
1 ID float64
2 Amount.Requested object
3 Amount.Funded.By.Investors object
4 Interest.Rate object
35
5 Loan.Length object
6 Loan.Purpose object
7 Debt.To.Income.Ratio object
8 State object
9 Home.Ownership object
10 Monthly.Income float64
11 FICO.Range object
12 Open.CREDIT.Lines object
13 Revolving.CREDIT.Balance object
14 Inquiries.in.the.Last.6.Months float64
15 Employment.Length object
16 dtype: object
The float64 datatype refers to numeric columns and object datatype refers to categorical columns.
If we want a concise summary of the dataframe including information about null values, we use the
info() function of the pandas dataframe.
1 ld.info()
1 <class 'pandas.core.frame.DataFrame'>
2 RangeIndex: 2500 entries, 0 to 2499
3 Data columns (total 15 columns):
4 ID 2499 non-null float64
5 Amount.Requested 2499 non-null object
6 Amount.Funded.By.Investors 2499 non-null object
7 Interest.Rate 2500 non-null object
8 Loan.Length 2499 non-null object
9 Loan.Purpose 2499 non-null object
10 Debt.To.Income.Ratio 2499 non-null object
11 State 2499 non-null object
12 Home.Ownership 2499 non-null object
13 Monthly.Income 2497 non-null float64
14 FICO.Range 2500 non-null object
15 Open.CREDIT.Lines 2496 non-null object
16 Revolving.CREDIT.Balance 2497 non-null object
17 Inquiries.in.the.Last.6.Months 2497 non-null float64
18 Employment.Length 2422 non-null object
19 dtypes: float64(3), object(12)
20 memory usage: 293.0+ KB
If we want to get the dimensions i.e. numbers of rows and columns of the data, we can use the
attribute 'shape'.
1 ld.shape
(2500, 15)
1 ld1=ld.iloc[3:7,1:5]
2 ld1
36
Amount.Requested Amount.Funded.By.Investors Interest.Rate Loan.Length
'iloc' refers to subsetting the dataframe by position. Here we have extracted the rows from 3rd to the
6th (7-1) position and columns from 1st to 4th (5-1) position.
To understand this further, we will further subset the 'ld1' dataframe. It currently has 4 rows and 4
columns. The indexes (3, 4, 5 and 6) come from the original dataframe. Lets subset the 'ld1'
dataframe further.
1 ld1.iloc[2:4,1:3]
Amount.Funded.By.Investors Interest.Rate
5 6000 15.31%
6 10000 7.90%
You can see here that the positions are relative to the current dataframe 'ld1' and not the original
dataframe 'ld'. Hence we end up with the 3rd and 4th rows along with 2nd and 3rd columns of the
new dataframe 'ld1' and not the original dataframe 'ld'.
Generally, we do not subset dataframes by position. We normally subset the dataframes using
conditions or column names.
Lets say, we want to subset the dataframe and get only those rows for which the 'Home.Ownership'
is of the type 'MORTGAGE' and the 'Monthly.Income' is above 5000.
Note: When we combine multiple conditions, we have to enclose them in parenthesis else your
results will not be as expected.
On observing the results, you will notice that for each of the rows both the conditions will be
satisfied i.e. 'Home.Ownership' will be 'MORTGAGE' and the 'Monthly.Income' will be greater than
5000.
In case we want to access a single columns data only, then we simply have to pass the column name
in square brackets as follows:
37
1 ld['Home.Ownership'].head()
1 0 MORTGAGE
2 1 MORTGAGE
3 2 MORTGAGE
4 3 MORTGAGE
5 4 RENT
6 Name: Home.Ownership, dtype: object
However, if we want to access multiple columns, then the names need to be passed as a list. For
instance, if we wanted to extract both 'Home.Ownership' and 'Monthly.Income', we would need to
pass it as a list, as follows:
1 ld[['Home.Ownership','Monthly.Income']].head()
2 # note the double sqaure brackets used to subset the dataframe using multiple
columns
Home.Ownership Monthly.Income
0 MORTGAGE 6541.67
1 MORTGAGE 4583.33
2 MORTGAGE 11500.00
3 MORTGAGE 3833.33
4 RENT 3195.00
If we intend to use both, condition as well as column names, we will need to use the .loc with the
pandas dataframe name.
Observing the code below, we subset the dataframe using conditions and columns both. We are
subsetting the rows, using the condition '(ld['Home.Ownership']=='MORTGAGE') &
(ld['Monthly.Income']>5000)' and we extract the 'Home.Ownership' and 'Monthly.Income' columns.
Here, both the conditions should be met; to get that observation in the output. e.g. We can see in
the first row of the dataframe that 'Home.Ownership' is 'MORTGAGE', the 'Monthly.Income' is more
than 5000. If either of the condition is false, we will not see that observation in the resulting
dataframe.
Home.Ownership Monthly.Income
0 MORTGAGE 6541.67
2 MORTGAGE 11500.00
7 MORTGAGE 13863.42
12 MORTGAGE 14166.67
20 MORTGAGE 6666.67
The resulting dataframe has only 2 columns and 686 rows. The rows correspond to the result
obtained when the condition (ld['Home.Ownership']=='MORTGAGE') & (ld['Monthly.Income']>5000) is
applied to the 'ld' dataframe.
What if we wanted to subset only those rows which did not satisfy a condition i.e. we want to negate
a condition. In order to do this, we can put a '~' sign before the condition.
38
In the following code, we will get all the observations that do not satisfy the condition
((ld['Home.Ownership']=='MORTGAGE') & (ld['Monthly.Income']>5000)). Here, both the
conditions may not be met; if either of them holds true, then we will get that
observation in the output. e.g. Considering the first row, even though the
'Monthly.Income' is less than 5000 (i.e. one of the conditions is not met), since the
other condition of 'Home.Ownership' being equal to 'MORTGAGE' holds true - the entire
condition ((ld['Home.Ownership']=='MORTGAGE') & (ld['Monthly.Income']>5000)) is not
negated and hence we see this observation in the output.
Home.Ownership Monthly.Income
1 MORTGAGE 4583.33
3 MORTGAGE 3833.33
4 RENT 3195.00
5 OWN 4891.67
6 RENT 2916.67
In short, the '~' sign gives us rest of the observations by negating the condition.
To drop columns on the basis of column names, we can use the in-built drop() function.
In order to drop the columns, we pass the list of columns to be dropped along with specifying the
axis argument as 1 (since we are dropping columns).
The following code will return all the columns except 'Home.Ownership' and 'Monthly.Income'.
1 ld.drop(['Home.Ownership','Monthly.Income'],axis=1).head()
2 # display of data in pdf will be truncated on the right hand side
However, when we check the columns of the 'ld' dataframe now, the two columns which we
presumably deleted, are still there.
1 ld.columns
39
What happens is that the ld.drop() function gives us an output; it does not make any inplace
changes.
So, in case we wish to delete the columns from the original dataframe, we can do two things:
We can update the original dataframe by equating the output to the original dataframe as follows:
1 ld=ld.drop(['Home.Ownership','Monthly.Income'],axis=1)
2 ld.columns
The second way to update the original dataframe is to set the 'inplace' argument of the drop()
function to True
1 ld.drop(['State'],axis=1,inplace=True)
2 ld.columns
Now you will notice that the deleted columns are not present in the original dataframe 'ld' anymore.
We need to be careful when using the inplace=True option; the function drop() doesn't output
anything. So we should not equate ld.drop(['State'],axis=1,inplace=True) to the original dataframe. If
we equate it to the original dataframe 'ld', then 'ld' will end up as a None type object.
We can also delete a column using the 'del' keyword. The following code will remove the column
'Employment.Length' from the original dataframe 'ld'.
1 del ld['Employment.Length']
2 ld.columns
On checking the columns of the 'ld' dataframe, we can observe that 'Employment.Length' column is
not present.
40
2. Adding/modifying variables with algebraic operations
3. Adding/modifying variables based on conditions
4. Handling missing values
5. Creating flag variables
6. Creating multiple columns from a variable separated by a delimiter
1 import numpy as np
2 import pandas as pd
We will start with creating a custom dataframe having 7 columns and 50 rows as follows:
1 age=np.random.choice([15,20,30,45,12,'10',15,'34',7,'missing'],50)
2 fico=np.random.choice(['100-150','150-200','200-250','250-300'],50)
3 city=np.random.choice(['Mumbai','Delhi','Chennai','Kolkata'],50)
4 ID=np.arange(50)
5 rating=np.random.choice(['Excellent','Good','Bad','Pathetic'],50)
6 balance=np.random.choice([10000,20000,30000,40000,np.nan,50000,60000],50)
7 children=np.random.randint(high=5,low=0,size=(50,))
1 mydata=pd.DataFrame({'ID':ID,'age':age,'fico':fico,'city':city,'rating':rating,'bal
ance':balance,'children':children})
1 mydata.head()
2 # data displace in pdf will be truncated on right hand side
1 mydata.dtypes
1 ID int32
2 age object
3 fico object
4 city object
5 rating object
6 balance float64
7 children int32
8 dtype: object
We can see that 'age' column is of the object datatype, though it should have been numeric, maybe
due to some character values in the column. We can change the datatype to 'numeric'; the character
values which cannot be changed to numeric will be assigned missing values i.e. NaN's automatically.
There are multiple numeric formats in Python e.g. integer, float, unsigned integer etc. The
to_numeric() functions chooses the best one for the column under consideration.
1 mydata['age']=pd.to_numeric(mydata['age'])
41
ValueError Traceback (most recent call last)
pandas/libs/src\inference.pyx in pandas.libs.lib.maybe_convert_numeric()
When we run the code above, we get an error i.e. "Unable to parse string "missing" at position 2".
This error means that there are a few values in the column that cannot be converted to numbers; in
our case its the value 'missing' which cannot be converted to a number. In order to handle this, we
need to set the errors argument of the to_numeric() function to 'coerce' i.e. errors='coerce'. When we
use this argument, wherever it was not possible to convert the values to numeric, it converted them
to missing values i.e. NaN's.
1 mydata['age']=pd.to_numeric(mydata['age'], errors='coerce')
2 mydata['age'].head()
0 12.0
1 15.0
2 NaN
3 45.0
4 NaN
Name: age, dtype: float64
As we can observe in the rows 2,4,etc, wherever there was the 'missing' string present, which could
not be converted to numbers are now converted to NaN's or missing values.
1 mydata['const_var']=100
The above code adds a new column 'const_var' to the mydata dataframe and each element in that
column is 100.
1 mydata.head()
If we want to apply a function on an entire column of a dataframe, we use a numpy function; e.g log
as shown below:
1 mydata['balance_log']=np.log(mydata['balance'])
The code above creates a new column 'balance_log' which has the logarithmic value of each element
present in the 'balance' column. A numpy function np.log() is used to do this.
1 mydata.head()
42
ID age fico city rating balance children const_var balance_log
250-
0 0 12.0 Chennai Excellent 10000.0 3 100 9.210340
300
150-
1 1 15.0 Chennai Bad 20000.0 3 100 9.903488
200
250-
2 2 NaN Chennai Pathetic 20000.0 2 100 9.903488
300
250-
3 3 45.0 Delhi Bad NaN 3 100 NaN
300
250-
4 4 NaN Delhi Pathetic 50000.0 2 100 10.819778
300
We can do many complex algebraic calculations as well to create/add new columns to the data.
1 mydata['age_children_ratio']=mydata['age']/mydata['children']
The code above creates a new column 'age_children_ratio'; each element of which will be the result
of the division of the corresponding elements present in the 'age' and 'children' columns.
1 mydata.head(10)
2 # display in pdf will be truncated on right hand side
250-
0 0 12.0 Chennai Excellent 10000.0 3 100 9.210340 4.000000
300
150-
1 1 15.0 Chennai Bad 20000.0 3 100 9.903488 5.000000
200
250-
2 2 NaN Chennai Pathetic 20000.0 2 100 9.903488 NaN
300
250-
3 3 45.0 Delhi Bad NaN 3 100 NaN 15.000000
300
250-
4 4 NaN Delhi Pathetic 50000.0 2 100 10.819778 NaN
300
100-
5 5 20.0 Delhi Pathetic 60000.0 3 100 11.002100 6.666667
150
100-
6 6 34.0 Mumbai Good 40000.0 0 100 10.596635 inf
150
200-
7 7 20.0 Kolkata Pathetic 30000.0 4 100 10.308953 5.000000
250
150-
8 8 20.0 Mumbai Good 20000.0 3 100 9.903488 6.666667
200
150-
9 9 20.0 Chennai Bad 50000.0 4 100 10.819778 5.000000
200
43
Notice that when a missing value is involved in any calculation, the result is also a missing value. We
observe that in the 'age_children_ratio' column we have both NaN's (missing values) as well as inf
(infinity). We get missing values in the 'age_children_ratio' column wherever 'age' has missing values
and we get 'inf' wherever the number of children is 0 and we end up dividing by 0.
Lets say we did not want missing values involved in the calculation i.e. we want to impute the
missing values before computing the 'age_children_ratio' column. For this we would first need to
identify the missing values. The isnull() function will give us a logical array which can be used to
isolate missing values and update these with whatever value we want to impute with.
1 mydata['age'].isnull().head()
0 False
1 False
2 True
3 False
4 True
Name: age, dtype: bool
In the outcome of the code above, we observe that wherever there is a missing value the
corresponding logical value is True.
If we want to know the number of missing values, we can sum the logical array as follows:
1 mydata['age'].isnull().sum()
The following code returns only those elements where the 'age' column has missing values.
1 mydata.loc[mydata['age'].isnull(),'age']
2 NaN
4 NaN
17 NaN
36 NaN
38 NaN
Name: age, dtype: float64
One of the ways of imputing the missing values is with mean. Once these values are imputed, we
then carry out the calculation done above.
1 mydata.loc[mydata['age'].isnull(),'age'] = np.mean(mydata['age'])
In the code above, using the loc() function of the dataframe, on the row side we first access those
rows where the 'age' column is null and on the column side we access only the 'age' column. In other
words, all the missing values from the age column will be replaced with the mean computed using
the 'age' column.
1 mydata['age'].head()
0 12.000000
1 15.000000
2 19.533333
3 45.000000
4 19.533333
Name: age, dtype: float64
44
The missing values from the 'age' column has been replaced by the mean of the column i.e.
19.533333.
Now, we can compute the 'age_children_ratio' again; this time without missing values. We will
observe that there are no missing values in the newly created column now. We however, observe
inf's i.e. infinity which occurs wherever we divide by 0.
1 mydata['age_children_ratio']=mydata['age']/mydata['children']
1 mydata.head()
2 #display in pdf will be truncated on the right hand side
250-
0 0 12.000000 Chennai Excellent 10000.0 3 100 9.210340 4.000000
300
150-
1 1 15.000000 Chennai Bad 20000.0 3 100 9.903488 5.000000
200
250-
2 2 19.533333 Chennai Pathetic 20000.0 2 100 9.903488 9.766667
300
250-
3 3 45.000000 Delhi Bad NaN 3 100 NaN 15.000000
300
250-
4 4 19.533333 Delhi Pathetic 50000.0 2 100 10.819778 9.766667
300
Lets say we want to replace the 'rating' column values with some numeric score - {'pathetic' : -1 ,
'bad' : 0 , 'good or excellent': 1}. We can do it using the np.replace() function as follows:
1 mydata['rating_score']=np.where(mydata['rating'].isin(['Good','Excellent']),1,0)
Using the above code, we create a new column 'rating_score' and wherever either 'Good' or
'Excellent' is present, we replace it with a 1 else with a 0 as we can see below. The function isin() is
used when we need to consider multiple values; in our case 'Good' and 'Excellent'.
1 mydata.head()
2 # display in pdf will be truncated on the right hand side
250-
0 0 12.000000 Chennai Excellent 10000.0 3 100 9.210340 4.000000
300
150-
1 1 15.000000 Chennai Bad 20000.0 3 100 9.903488 5.000000
200
250-
2 2 19.533333 Chennai Pathetic 20000.0 2 100 9.903488 9.766667
300
250-
3 3 45.000000 Delhi Bad NaN 3 100 NaN 15.000000
300
250-
4 4 19.533333 Delhi Pathetic 50000.0 2 100 10.819778 9.766667
300
45
1 mydata.loc[mydata['rating']=='Pathetic','rating_score']=-1
In the code above, wherever the value in 'rating' column is 'Pathetic', we update the 'rating_score'
column to -1 and leave the rest as is. The above code could be been written using np.where()
function as well. The np.where() function is similar to the ifelse statement we may have seen in other
languages.
1 mydata.head()
2 #display in the pdf will be truncated on the right hand side
250-
0 0 12.000000 Chennai Excellent 10000.0 3 100 9.210340 4.000000
300
150-
1 1 15.000000 Chennai Bad 20000.0 3 100 9.903488 5.000000
200
250-
2 2 19.533333 Chennai Pathetic 20000.0 2 100 9.903488 9.766667
300
250-
3 3 45.000000 Delhi Bad NaN 3 100 NaN 15.000000
300
250-
4 4 19.533333 Delhi Pathetic 50000.0 2 100 10.819778 9.766667
300
Now we can see that the 'rating_score' column takes the values 0, 1 and -1 depending on the
corresponding values from the 'rating' column.
At times, we may have columns which we may want to split into multiple columns; column 'fico' in
our case. One sample value is '100-150'. The datatype for 'fico' is considered as object. Its a difficult
problem to solve if we use a for loop for processing each value. However, we will discuss an easier
approach which will be very useful to know when pre-processing your data.
Coming back to the 'fico' column, one of the first thing that comes to mind when we want to
separate the values from 'fico' column into multiple columns is the split() function. The split function
works on strings, but the current datatype of 'fico' column is object; object type does not understand
string functions. If we apply the split() function directly on 'fico', we will get the following error.
1 mydata['fico'].split()
In order to handle this, we will first need to extract the 'str' attribute of the 'fico' column and then
apply the split() function. This will be the case for all string functions and not just split().
1 mydata['fico'].str.split("-").head()
0 [250, 300]
1 [150, 200]
2 [250, 300]
3 [250, 300]
4 [250, 300]
Name: fico, dtype: object
46
We can see that each of the elements have been split on the basis of '-'. However, it is still present in
a single column. We need the values in separate columns. We will set the 'expand' argument of the
split() function to True in order to handle this.
1 mydata['fico'].str.split("-",expand=True).head()
0 1
0 250 300
1 150 200
2 250 300
3 250 300
4 250 300
1 k=mydata['fico'].str.split("-",expand=True).astype(float)
Notice that we have converted the columns to float using the astype(float) function; since after
splitting, by default, the datatype of each column created would be object. But we want to consider
each column as numeric datatype, hence the columns are converted to float. Converting to float is
not a required step when splitting columns. We do it only because these values are supposed to be
considered numeric in the current context.
1 k[0].head()
0 250.0
1 150.0
2 250.0
3 250.0
4 250.0
Name: 0, dtype: float64
We can either concatenate this dataframe to the 'mydata' dataframe after giving proper header to
both the columns or we can directly assign two new columns in the 'mydata' dataframe as follows:
1 mydata['f1'],mydata['f2']=k[0],k[1]
1 mydata.head()
2 # display in pdf will be truncated on the right hand side
250-
0 0 12.000000 Chennai Excellent 10000.0 3 100 9.210340 4.000000
300
150-
1 1 15.000000 Chennai Bad 20000.0 3 100 9.903488 5.000000
200
250-
2 2 19.533333 Chennai Pathetic 20000.0 2 100 9.903488 9.766667
300
250-
3 3 45.000000 Delhi Bad NaN 3 100 NaN 15.000000
300
250-
4 4 19.533333 Delhi Pathetic 50000.0 2 100 10.819778 9.766667
300
47
We do not need the 'fico' column anymore as we have its values in tow separate columns; hence we
will delete it.
1 del mydata['fico']
1 mydata.head()
2 # display in pdf will be truncated on the right hand side
1 print(mydata['city'].unique())
2 print(mydata['city'].nunique())
It consists of 4 unique elements - 'Kolkata', 'Mumbai', 'Chennai', 'Delhi'. For this variable, we will need
to create three dummies.
The code below creates a flag variable when the 'city' column has the value 'Mumbai'.
1 mydata['city_mumbai']=np.where(mydata['city']=='Mumbai',1,0)
Wherever the variable 'city' takes the value 'Mumbai', the flag variable 'city_mumbai' will be 1
otherwise 0.
There is another way to do this, where we write the condition and convert the logical value to
integer.
1 mydata['city_chennai']=(mydata['city']=='Chennai').astype(int)
Code "mydata['city']=='Chennai'" gives a logical array; wherever the city is 'Chennai', the value on
'city_chennai' flag variable is True, else False.
1 (mydata['city']=='Chennai').head()
0 True
1 True
2 True
3 False
48
4 False
Name: city, dtype: bool
When we convert it to an integer, wherever there was True, we get a 1 and wherever there was False,
we get a 0.
1 ((mydata['city']=='Chennai').astype(int)).head()
0 1
1 1
2 1
3 0
4 0
Name: city, dtype: int32
We can use either of the methods for creating flag variables, the end result is same.
1 mydata['city_kolkata']=np.where(mydata['city']=='Kolkata',1,0)
Once the flag variables have been created, we do not need the original variable i.e. we do not need
the 'city' variable anymore.
1 del mydata['city']
This way of creating dummies requires a lot of coding, even if we somehow use a for loop. As an
alternative, we can use get_dummies() function from pandas directly to do this.
We will create dummies for the variable 'rating' using this method.
1 print(mydata['rating'].unique())
2 print(mydata['rating'].nunique())
The pandas function which creates dummies is get_dummies() in which we pass the column for
which dummies need to be created. By default, the get_dummies function creates n dummies if n
unique values are present in the column i.e. for 'rating' column, by default, get_dummies function,
creates 4 dummy variables. We do not want that. Hence, we pass the argument 'drop_first=True'
which removes one of the dummies and creates (n-1) dummies only. Setting the 'prefix' argument
helps to identify which columns are the dummy variables created for. It does this by adding
whatever string you give the 'prefix' argument as a prefix to each of the dummy variables created. In
our case 'rating_' will be appended to each dummy variable as a prefix.
1 dummy=pd.get_dummies(mydata['rating'],drop_first=True,prefix='rating')
The get_dummies() function has created a column for 'Excellent', 'Good' and 'Pathetic' but has
dropped the column for 'Bad'.
1 dummy.head()
49
rating_Excellent rating_Good rating_Pathetic
0 1 0 0
1 0 0 0
2 0 0 1
3 0 0 0
4 0 0 1
We can now simply attach these columns to the data using pandas concat() function.
1 mydata=pd.concat([mydata,dummy],axis=1)
The concat() function will take the axis argument as 1 since we are attaching the columns.
After creating dummies for 'rating', we will now drop the original column.
1 del mydata['rating']
1 mydata.columns
We need to keep in mind that we will not be doing all of what we learned here at once for any one
dataset. Some of these techniques will be useful at a time while preparing data for machine learning
algorithms.
1 import numpy as np
2 import pandas as pd
The random.randint() function creates 4 columns having 20 rows with values ranging between 2 and
8.
1 df.head()
50
A B C D
0 4 7 4 6
1 2 7 5 6
2 2 3 5 7
3 4 6 2 4
4 6 5 4 4
If we wish to sort the dataframe by column A, we can do that using the sort_values() function on the
dataframe.
1 df.sort_values("A").head()
A B C D
9 2 5 6 7
14 2 6 4 6
18 2 7 7 3
6 2 4 2 5
19 2 6 4 6
The output that we get is sorted by the column 'A'. But when we type 'df' again to view the
dataframe, we see that there are not changes; df is the same as it was before sorting.
1 df.head()
A B C D
0 4 7 4 6
1 2 7 5 6
2 2 3 5 7
3 4 6 2 4
4 6 5 4 4
Now when we observe the dataframe 'df', it will be sorted by 'A' in an ascending manner.
1 df.head()
51
A B C D
9 2 5 6 7
14 2 6 4 6
18 2 7 7 3
6 2 4 2 5
19 2 6 4 6
In case we wish to sort the dataframe in a descending manner, we can set the argument
ascending=False in the sort_values() function.
Now the dataset will be sorted in the reverse order of the values of 'A'.
1 df.head()
A B C D
17 7 7 5 5
10 7 7 4 4
15 7 4 6 7
4 6 5 4 4
11 6 3 7 2
Sorting by next column in the sequence happens within the groups formed after sorting of the
previous columns.
In the code below, we can see that the 'ascending' argument takes values [True, False]. It is passed in
the same order as the columns ['B','C']. This means that the column 'B' will be sorted in an ascending
order and within the groups created by column 'B', column 'C' will be sorted in a descending order.
1 df.sort_values(['B','C'],ascending=[True,False]).head(10)
A B C D
12 5 2 3 5
11 6 3 7 2
13 3 3 7 6
2 2 3 5 7
7 5 3 3 4
15 7 4 6 7
16 4 4 4 2
6 2 4 2 5
9 2 5 6 7
4 6 5 4 4
We can observe that the column 'B' is sorted in an ascending order. Within the groups formed by
column 'B', column 'C' sorts its values in descending order.
52
Although we have not taken an explicit example for character data, in case of character data, sorting
happens in the order in which it is present in the dictionary.
Now we will see how to combine dataframes horizontally or vertically by stacking them.
1 df1
letter number
0 a 1
1 b 2
1 df2
0 c 3 cat
1 d 4 dog
In order to combine these dataframes, we will use the concat() function of the pandas library.
The argument 'axis=0' combines the dataframes row-wise i.e. stacks the dataframes vertically.
1 pd.concat([df1,df2], axis=0)
Notice that the index of the two dataframes is not generated afresh in the concatenated dataframe.
The original indices are stacked, so we end up with duplicate index names. More often than not, we
would not want the indices to be stacked. We can avoid doing this by setting the 'ignore_index'
argument to True in the concat() function.
0 NaN a 1
1 NaN b 2
2 cat c 3
3 dog d 4
We discussed how the dataframes can be stacked vertically. Now lets see how they can be stacked
horizontally.
In order to stack the dataframes column-wise i.e. horizontally, we will need to set the 'axis' argument
to 1.
53
The datasets which we will stack horizontally are 'df1' and 'df3'.
1 df1
letter number
0 a 1
1 b 2
1 df3
animal name
0 bird polly
1 monkey george
2 tiger john
1 pd.concat([df1,df3],axis=1)
We see that when we use the concat() function with 'axis=1' argument, we combine the dataframes
column-wise.
Since 'df3' dataframe has three rows, whereas 'df1' dataframe has only two rows, the remaining
values are set to missing as can be observed in the dataframe above.
Many times our datasets need to be combined by keys instead of simply stacking them vertically or
horizontally. As an example, lets consider the following dataframes:
1 df1=pd.DataFrame({"custid":[1,2,3,4,5],
2 "product":["Radio","Radio","Fridge","Fridge","Phone"]})
3 df2=pd.DataFrame({"custid":[3,4,5,6,7],
4 "state":["UP","UP","UP","MH","MH"]})
1 df1
custid product
0 1 Radio
1 2 Radio
2 3 Fridge
3 4 Fridge
4 5 Phone
1 df2
54
custid state
0 3 UP
1 4 UP
2 5 UP
3 6 MH
4 7 MH
Dataframe 'df1' contains information about the customer id and the product they purchased and
dataframe 'df2' also contains the customer id along with which state they come from.
Notice that the first row of the two dataframes have different customer ids i.e. the first row contains
information about different customers, hence it won't make sense to combine the two dataframes
together horizontally.
In order to combine data from the two dataframes, we will first need to set a correspondence using
customer id i.e. combine only those rows having a matching customer id and ignore the rest. In
some situations, if there is data in one dataframe which is not present in the other dataframe,
missing data will be filled in.
There are 4 ways in which the dataframes can be merged - inner join, outer join, left join and right
join:
In the following code, we are joining the two dataframes 'df1' and 'df2' and the key or
correspondence between the two dataframes is determined by 'custid' i.e. customer id. We use the
inner join here (how='inner'), which retains only those rows which are present in both the
dataframes. Since customer id's 3, 4 and 5 are common in both the dataframes, these three rows are
returned as a result of the inner join along with corresponding information 'product' and 'state' from
both the dataframes.
1 pd.merge(df1,df2,on=['custid'],how='inner')
0 3 Fridge UP
1 4 Fridge UP
2 5 Phone UP
Now lets consider outer join. In outer join, we keep all the ids, starting at 1 and going up till 7. This
leads to having missing values in some columns e.g. customer ids 6 and 7 were not present in the
dataframe 'df1' containing product information. Naturally the product information for those
customer ids will be absent.
Similarly, customer ids 1 and 2 were not present in the dataframe 'df2' containing state information.
Hence, state information was missing for those customer ids.
Merging cannot fill in the data on its own if that information is not present in the original
dataframes. We will explicitly see a lot of missing values in outer join.
1 pd.merge(df1,df2,on=['custid'],how='outer')
55
custid product state
0 1 Radio NaN
1 2 Radio NaN
2 3 Fridge UP
3 4 Fridge UP
4 5 Phone UP
5 6 NaN MH
6 7 NaN MH
Using the left join, we will see all customer ids present in the left dataframe 'df1' and only the
corresponding product and state information from the two dataframes. The information present
only in the right dataframe 'df2' is ignored i.e. customer ids 6 and 7 are ignored.
1 pd.merge(df1,df2,on=['custid'],how='left')
0 1 Radio NaN
1 2 Radio NaN
2 3 Fridge UP
3 4 Fridge UP
4 5 Phone UP
Similarly, right join will contain all customer ids present in the right dataframe 'df2' irrespective of
whether they are there in the left dataframe 'df1' or not.
1 pd.merge(df1,df2,on=['custid'],how='right')
0 3 Fridge UP
1 4 Fridge UP
2 5 Phone UP
3 6 NaN MH
4 7 NaN MH
1 import pandas as pd
2 import numpy as np
56
1 file=r'/Users/anjal/Dropbox/PDS V3/Data/bank-full.csv'
2 bd=pd.read_csv(file,delimiter=';')
1 bd.describe()
2 # display in pdf will be truncated on the right hand side
Another useful function that can be applied on the entire data is nunique(). It returns the number of
unique values taken by different variables.
Numeric data
1 bd.nunique()
1 age 77
2 job 12
3 marital 3
4 education 4
5 default 2
6 balance 7168
7 housing 2
8 loan 2
9 contact 3
10 day 31
11 month 12
12 duration 1573
13 campaign 48
14 pdays 559
15 previous 41
16 poutcome 4
17 y 2
18 dtype: int64
We can observe that the variables having the 'object' type have fewer values and variables which are
'numeric' have higher number of unique values.
57
The describe() function can be used with individual columns also. For numeric variables, it gives the
8 summary statistics for that column only.
1 bd['age'].describe()
1 count 45211.000000
2 mean 40.936210
3 std 10.618762
4 min 18.000000
5 25% 33.000000
6 50% 39.000000
7 75% 48.000000
8 max 95.000000
9 Name: age, dtype: float64
When the describe() function is used with a categorical column, it gives the total number of values in
that column, total number of unique values, the most frequent value ('blue-collar') as well as the
frequency of the most frequent value.
1 bd['job'].describe()
1 count 45211
2 unique 12
3 top blue-collar
4 freq 9732
5 Name: job, dtype: object
Note: When we use the describe() function on the entire dataset, by default, it returns the summary
statistics of numeric columns only.
Lets say we only wanted the mean or the median of the variable 'age'.
1 bd['age'].mean(), bd['age'].median()
(40.93621021432837, 39.0)
Apart from the summary statistics provided by the describe function, there are many other statistics
available as shown below:
58
Function Description
min Minimum
max Maximum
mode Mode
Categorical data
Now starting with categorical data, we would want to look at frequency counts. We use the
value_count() function for get the frequency counts of each unique element present in the column.
1 bd['job'].value_counts()
blue-collar 9732
management 9458
technician 7597
admin. 5171
services 4154
retired 2264
self-employed 1579
entrepreneur 1487
unemployed 1303
housemaid 1240
student 938
unknown 288
Name: job, dtype: int64
59
By default, the outcome of the value_counts() function is in descending order. The element 'blue-
collar' with the highest count is displayed on the top and that with the lowest count 'unknown' is
displayed at the bottom.
We should be aware of the format of the output. Lets save the outcome of the above code in a
variable 'k'.
1 k = bd['job'].value_counts()
The outcome stored in 'k' has two attributes. One is values i.e. the raw frequencies and the other is
'index' i.e. the categories to which the frequencies belong.
1 k.values
array([9732, 9458, 7597, 5171, 4154, 2264, 1579, 1487, 1303, 1240, 938,
288], dtype=int64)
1 k.index
As shown, values contain raw frequencies and index contains the corresponding categories. e.g.
'blue-collar' job has 9732 counts.
Lets say, you are asked to get the category with minimum count, you can directly get it with the
following code:
1 k.index[-1]
'unknown'
We can get the category with the second highest count as well as the highest count as follows:
'student'
'blue-collar'
Now if someone asks us for category names with frequencies higher than 1500. We can write the
following code to get the same:
1 k.index[k.values>1500]
Even if we write the condition on k itself, by default it means that the condition is applied in the
values.
1 k.index[k>1500]
60
Index(['blue-collar', 'management', 'technician', 'admin.', 'services',
'retired', 'self-employed'],
dtype='object')
The next kind of frequency table that we are interested in when working with categorical variables is
cross-tabulation i.e. frequency of two categorical variables taken together. e.g. lets consider the
cross-tabulation of two categorical variables 'default' and 'housing'.
1 pd.crosstab(bd['default'],bd['housing'])
housing no yes
default
no 19701 24695
In the above frequency table, we observe that there are 24695 observation where the value for
'housing' is 'yes' and 'default' is 'no'. This is a huge chunk of the population. There is a smaller chunk
of about 435 observations where housing is 'yes' and default is 'yes' as well. Within the observations
where default is 'yes', 'housing' is 'yes' for a higher number of observations i.e. 435 as compared to
where housing is 'no' i.e. 380.
Now, lets say that we want to look at the unique elements as well as the frequency counts of all
categorical variables in the dataset 'bd'.
1 bd.select_dtypes('object').columns
The code above will give us all the column names which are stored as categorical datatypes in the
'bd' dataframe. We can then run a for loop of top of these columns to get whichever summary
statistic we need for the categorical columns.
1 cat_var = bd.select_dtypes('object').columns
2 for col in cat_var:
3 print(bd[col].value_counts())
4 print('~~~~~')
1 blue-collar 9732
2 management 9458
3 technician 7597
4 admin. 5171
5 services 4154
6 retired 2264
7 self-employed 1579
8 entrepreneur 1487
9 unemployed 1303
10 housemaid 1240
11 student 938
12 unknown 288
13 Name: job, dtype: int64
14 ~~~~~
15 married 27214
16 single 12790
17 divorced 5207
18 Name: marital, dtype: int64
19 ~~~~~
20 secondary 23202
61
21 tertiary 13301
22 primary 6851
23 unknown 1857
24 Name: education, dtype: int64
25 ~~~~~
26 no 44396
27 yes 815
28 Name: default, dtype: int64
29 ~~~~~
30 yes 25130
31 no 20081
32 Name: housing, dtype: int64
33 ~~~~~
34 no 37967
35 yes 7244
36 Name: loan, dtype: int64
37 ~~~~~
38 cellular 29285
39 unknown 13020
40 telephone 2906
41 Name: contact, dtype: int64
42 ~~~~~
43 may 13766
44 jul 6895
45 aug 6247
46 jun 5341
47 nov 3970
48 apr 2932
49 feb 2649
50 jan 1403
51 oct 738
52 sep 579
53 mar 477
54 dec 214
55 Name: month, dtype: int64
56 ~~~~~
57 unknown 36959
58 failure 4901
59 other 1840
60 success 1511
61 Name: poutcome, dtype: int64
62 ~~~~~
63 no 39922
64 yes 5289
65 Name: y, dtype: int64
66 ~~~~~
Many times we do not only want the summary statistics of numeric or categorical variables
individually; we may want a summary of numeric variables within the categories coming from a
categorical variable. e.g. lets say we want the average age of the people who are defaulting their
loan as opposed to people who are not defaulting. This is known as group wise summary.
1 bd.groupby(['default'])['age'].mean()
default
no 40.961934
yes 39.534969
Name: age, dtype: float64
The result above tells us that the defaulters have a slightly lower average age as compared to non-
defaulters.
62
Looking at median will give us a better idea in case we have many outliers. We notice that the
difference is not much.
1 bd.groupby(['default'])['age'].median()
default
no 39
yes 38
Name: age, dtype: int64
We can group by multiple variables as well. There is no limit on the number and type of variables we
can group by. But generally, we group by categorical variables only.
Also, it is not necessary of give the name of the column for which we want the summary statistic. e.g.
in the code above, we wanted the median of the column 'age'. It is not necessary to specify the
column 'age'. When we do not specify the column age, then we get a median of all the numeric
columns grouped by the variable 'default'.
1 bd.groupby(['default']).median()
default
no 39 468 16 180 2 -1 0
yes 38 -7 17 172 2 -1 0
1 bd.groupby(['default','loan']).median()
default loan
Each row in the result above gives the 4 categories defined by the two categorical variables we have
grouped by and each column give the median value for all the numerical variables for each group.
In short, when we do not give a variable to compute the summary statistic e.g. median, we get all the
columns where median can be computed.
Now, lets say we do not want to find the median for all columns, but only for 'day' and 'balance'
columns. We can do that as follows:
1 bd.groupby(['housing','default'])['balance','day'].median()
63
balance day
housing default
no no 531 17
yes 0 18
yes no 425 15
yes -137 15
If we were to do the visualizations using matplotlib instead of seaborn, we would need to write a lot
more code. Seaborn has functions which wrap up this code making simpler.
Note: %matplotlib inline is required only when we use the Jupyter notebook so that visualizations
appear within the notebook itself. Other editors like Spyder or PyCharm do not need this line of
code as part of our script.
What we will cover is primarily to visualize our data quickly which will help us build our machine
learning models.
1 import pandas as pd
2 import numpy as np
1 file=r'/Users/anjal/Dropbox/PDS V3/Data/bank-full.csv'
2 bd=pd.read_csv(file,delimiter=';')
Density plots
Lets start with density plots for a single numeric variable. We use the distplot() function from the
seaborn library to get the density plot. The first argument will be the variable for which we want to
make the density plot.
1 sns.distplot(bd['age'])
64
By default, the distplot() function gives a histogram along with the density plot. In case we do not
want the density plot, we can set the argument 'kde' (short for kernel density) to False.
1 sns.distplot(bd['age'], kde=False)
Setting the 'kde' argument to False will not show us the density curve, but will only show the
histogram.
In order to build a histogram, continuous data is split into intervals called bins. The argument 'bins'
lets us set the number of bins which in turn affects the width of each bin. This argument has some
default value, however we can increase the number of bins by changing the 'bin' argument.
65
Notice the difference in the width of the bins when the argument 'bins' has different values. With
'bins' argument having the value 15, the bins are much wider as compared to when the 'bins' have
the value 100.
How do we decide the number of bins, what would be a good choice? First consider why we need a
histogram. Using a histogram we can get a fair idea where most of the values lie. e.g. considering the
variable 'age', a histogram tells us how people in the data are distributed across different values of
'age' variable. Looking at the histogram, we get a fair idea that most of the people in our data lie in
30 to 40 age range. If we look a bit further, we can also say that the data primarily lies between 25 to
about 58 years age range. Beyond the age of 58, the density falls. We might be looking at typical
working age population.
Coming back to how do we decide the number of bins. Now, we can see that most of the people are
between 30 to 40 years. Here, if we want to dig deeper to see how data is distributed within this
range we increase the number of bins. In other words, when we want to go finer, we increase the
number of bins.
We can see here that between 30 to 40, the people in their early 30's are much more dominant as
compared to the people whose age is closer to 40. One thing to be kept in mind when increasing the
number of bins is that if the number of data points are very low, for example, if we have only 100
data points then it does not make sense to create 50 bins because the frequency bars that we see
will not give us a very general picture.
In short, higher number of bins will give us a finer picture of how the data is behaving in terms of
density across the value ranges. But with very few data points, a higher number of bins may give us
a picture which may not be generalizable.
We can see that the y axis has frequencies. Sometimes it is much easier to look at frequency
percentages. We get frequency percentages by setting the 'norm_hist' argument to True. It basically
normalizes the histograms.
66
We can see that about 5% of the population lies between the age range of 31 to about 33.
Note: If we want to get more specific, we need to move towards numeric summary of data.
1 myimage = myplot.get_figure()
1 myimage.savefig("output.png")
The moment we do this, the 'output.png' file will appear wherever this script is. We can get this path
using the pwd() function. My 'output.png' file is present in the following path:
1 pwd()
'C:\Users\anjal\Dropbox\PDS V3\2.Data_Prep'
The above method of saving images will work with all kinds of plots.
There is a function kdeplot() in seaborn that is used for generating only the density plot. It will only
give us the density plot without the histogram.
1 sns.kdeplot(bd['age'])
67
We observe that the different plots have their own functions and we simply need to pass the column
which we wish to visualize.
Now, let us see how to visualize the 'age' column using a boxplot.
1 sns.boxplot(bd['age'])
We can also get the above plot with the following code:
1 sns.boxplot(y='age', data=bd)
We get a vertical boxplot since we mentioned y as 'age'. We can also get the horizontal boxplot if we
specify x as 'age'.
1 sns.boxplot(x='age', data=bd)
68
We can see that there are no extreme values on lower side of the age; however, there are a lot of
extreme values on the higher side of age.
We can use scatterplots to visualize two numeric columns together. The function which helps us do
this is jointplot() from the seaborn library.
The jointplot not only gives us the scatterplot but also gives the density plots along the axis.
We can do a lot more things with the jointplot() function. We observe that the variable 'balance' takes
a value on a very long range but most of the values are concentrated on a very narrow range. Hence
we will plot the data only for those observations for which 'balance' column has values ranging from
0 to 1000.
Note: Putting conditions is not a requirement for us to visualize the data. Since we see that most of
the data lies in a smaller range and because of the few extreme values we may get a distorted plot.
Using the above code we have no way of figuring out if individual points are overlapping each other.
69
Setting the argument 'kind' as 'hex' not only shows us the observations but also helps us in knowing
how many observations are overlapping at a point by observing the shade of each point. The darker
the points, more the number of observations present there.
We can observe that most of the observations lie between the age of 30 and 40 and lot of
observations lie between the balance of 0 to 400. As we move away, the shade keeps getting lighter
indicating that the number of observations reduce or the density decreases.
These hex plots are a combination of scatterplots and pure density plots. If we want pure density
plots, we need to change the 'kind' argument to 'kde'.
The darker shade indicates that most of the data lies there and as we move away the density of the
data dissipates.
70
We observe that lmplot is just like scatter plot, but it fits a line through the data by default. (lmplot -
linear model plot)
We can update lmplot to fit higher order polynomials which gives a sense if there exists a non-linear
relationship between the data.
Since the data in the plot above is overlapping a lot, let us consider the first 500 observations only.
We can see that a line fits through these points.
If we update the code above and add an argument 'order=6' we can see that the function has tried
to fit a curve through the data. Since it still mostly looks like a line, so maybe there is a linear trend.
Also, as the line is horizontal to the x axis, there is not much correlation between the two variables
plotted.
71
3. Faceting the data
As of now, we are looking at the age and balance relationship across the entire data.
Now we want to see how will the relationship between 'duration' and 'campaign' behave for different
values of 'housing'. We can observe this by coloring the datapoints for different values of 'housing'
using the 'hue' argument. 'housing' variable takes two values: yes and no.
We can see two different fitted lines. The orange one corresponds to 'housing' equal to no and the
blue one corresponding to 'housing' equal to yes.
72
Now if we wish to divide our data further on the basis of 'default' column, we can consider using the
'col' argument. 'default' argument takes two values: yes and no.
Now we have 4 parts of the data. Two are given by the color of the points and two more are given by
separate columns. The first column refers to 'default' being no and the second column refers to
'default' being yes.
Observe that there are very few points when 'default' is equal to yes; hence we cannot trust the
relationship as the data is very less.
Next, if we wanted to check how does the relationship between the two variables change when we
are looking at different categories for loan. 'loan' also has two values: yes and no.
73
We observe, that within the group where 'default' is no; the relationship between the two variables
'campaign' and 'duration' is different when 'loan' is yes as compared to when 'loan' is no. Also,
majority of the data points are present where 'loan' and 'default' both are no. There are no data
points where both 'loan' and 'default' is yes. Keep in mind that we are looking at the first 500
observations only. We may get some data points here if we look at higher number of observations.
This how we can facet the data observing whether after breaking the data does the relationship
between two variables change.
We want to know how different education groups are present in the data.
We observe that the 'secondary' education group is very frequent. There is a small chunk where the
level of education is 'unknown'. The 'tertiary' education group has the second highest count followed
by 'primary'.
74
We have options to use faceting here as well. We can use the same syntax we used earlier. Lets start
by adding 'hue' as 'loan'.
We observe that each education level is now broken into two parts according to the value of 'loan'.
This is how we can facet using the same options for categorical data.
When we want to see how 'age' behaves across different levels of education, we can use boxplots().
When we make a boxplot only with the 'age' variable we get the following plot, indicating that the
data primarily lies between age 20 and 70 with a few outlying values and the data overall is positively
skewed.
Now, we want to look how the variable 'age' behaves across different levels of 'education'.
We observe that 'primary' education have overall higher median range. The behavior of the data
points under 'unknown' are similar to those under 'primary' education indicating maybe that the
'unknown' ones may have primary education background.
75
We also observe that people having 'tertiary' education belong to a lower age group. We can infer
that older people could make do with lesser education but the current generation needs higher
levels of education to get by. This could be one of the inferences.
6. Heatmaps
Heatmaps are two dimensional representation of data in which values are represented using colors.
It uses a warm-to-cool color spectrum to describe the values visually.
1 x = np.random.random(size=(20,20))
1 x[:3]
Looking at the data above, it is difficult for us to determine what kind of values it has.
1 sns.heatmap(x)
When we observe a heatmap, wherever the color of the boxes are light, the values are closer to 1
and as the boxes get darker, those are the values closer to 0. Looking at the visualization above, we
get an idea that more or less the values are quite random. There does not seems to be any
dominance of lighter or darker boxes.
Now, since we understand the use of colors in heatmaps, lets get back to understanding how does it
help with understanding our 'bd' dataset. Lets say we look at the correlations in the data 'bd'.
1 bd.corr()
76
age balance day duration campaign pdays previous
Normally the correlation tables can be quite huge, can have 50 variables in the data too. Looking at
this table, it is very difficult to manually check if there exists correlation between the variables. We
can manually check if any of the values in the table above are close to 1 or -1 which indicates high
correlation or we can simply pass the above table to a heatmap.
1 sns.heatmap(bd.corr())
The visualization above shows that wherever the boxes are very light i.e. near +1, there is high
correlation and wherever the boxes are too dark i.e. near 0, the correlation is low. The diagonal will
be the lightest because each variable has maximum correlation with itself. But for the rest of the
data, we observe that there is not much correlation. However, there seems to be some positive
correlation between 'previous' and 'pdays'. The advantage of using heatmaps is that we do not have
to go through the correlation table to manually check if correlation is present or not. We can visually
understand the same through the heatmap shown above.
77
Chapter 3 : Introduction to Machine
Learning
We'll start our discussion with one thing that people tend to forget over time. Whatever we are
learning is about solving business problems .Lets start with; where do we start after we are
given a business problem to work on; in context of machine learning .
Its a genuine business problem but it isn't a data problem yet on the face of it. So, what is a data
problem then. A data problem is a business problem expressed in terms of the data [ potentially ]
available in the business process pipeline.
Response/Goal/Outcome
Set of factor/features data which affects our goal/response/outcome
Lets look at the loan default problem and find these components.
Outcome is loan default, which we would like to predict when considering giving loan to a
prospect.
What factors could help us in doing that? Banks collect a lot of information on a loan application
such as financial data, personal information. In addition to that they also make queries regarding
credit history to various agencies. We could use all these features/information to predict
whether a customer is going to default on their loan or not and then reconsider our decision of
granting loans depending on the result.
Here are few more business problems which you can try converting to data problems :
1. Supervised
2. Unsupervised
Supervised problems are the problems which have explicit outcomes. Such as default on loan,
Required Number of Staff, Server Load etc. Within these , you can see separate kinds.
1. Regression
2. Classification
Regression problems are those where outcome is a continuous numeric value e.g. Sales , Rainfall ,
Server Load ( values over a range with technically infinite unique values possible as outcome and
have an ordinal relationship [ e.g. 100 is twice as much as to 50 ] ).
78
Classification problem on the other hand have their outcome as categories [e.g.: good/bad/worse;
yes/no ; 1/0 etc], with limited number of defined outcomes .
Unsupervised problems are those where there is no explicit measurable outcome associated with
the problem. You just need to find general pattern hidden in the measured factors. Finding different
customer segments or electoral segments or finding latent factors in the data comes under such
problems.
1. Supervised
1. Regression
2. Classification
2. Unsupervised
Our focus in this module will be over Supervised problems with an existing outcome which we are
trying to predict in context of a regression or a classification problem
We want to make use of historical data to extract pattern so that we can build a predictive model
for future/new data
We want our solution to be as accurate as possible
Now, what do we mean by pattern ? In mathematical terms; we are looking for a function which
takes input ( the values of factors affecting my response/outcome ) and outputs the value of
outcome ( which we call predictions )
Regression Problem
We'll denote our prediction for observation as and real value of the outcome in the historical
data given to us as . And this function that we talk about is denoted by f . Inputs collectively are
denoted as
It isn't really possible to have a function which will make perfect predictions [ in fact it is possible but
not good to have a function which makes perfect prediction, its called over-fitting ; we'll keep that
discussion for later ] .
Our predictions are going to have errors . We can calculate those errors easily by comparing our
predictions with real outcomes .
We would want this error to be as small as possible across all observations. One way to represent
the error across all observation will be average error for the entire data . However simple
average is going to be meaning less , because errors for different observations might have different
signs; some negative some positive. Overall average error might as well be zero, but that doesn't
mean individual errors don't exist. We need to come up with something for this error across all
observation which doesn't consider sign of errors . There are couple of ideas that we can consider .
79
These here are called Cost Functions, another popular name for the same is Loss Function . They
are also mentioned as just Cost/Loss . In many places , these are considered as averages but simple
sum [ Sum of absolute errors or sum of squared errors ] . It doesn't really matter because difference
between them is division by a constant ( number of observations ); it doesn't affect our pattern
extraction or function estimation for prediction.
Among the two that we mentioned here , Mean Squared Error is more popular in comparison to
Mean Absolute Error due to its easy differentiability. How does that matter? It'll be clear once we
discuss Gradient Descent .
Lets take a pause here and understand , what do these cost functions represent? These represent
our business expectation of model being as accurate as possible in a mathematically quantifiable
way .
We would want our prediction function to be such that , for the given historical data, Mean
Squared Error ( or whatever other cost function you are considering ) is as small as possible .
Classification Problem
It was pretty straight forward, as to what do we want to predict in Regression Problem. However it
isn't that simple for classification problem as you might expect . Consider the following data on
customer's Age and their response to a product campaign for an insurance policy.
20 NO 30 NO
20 NO 30 NO
20 NO 30 YES
20 NO 30 YES
20 YES 30 YES
25 NO 35 NO
25 NO 35 YES
25 NO 35 YES
25 YES 35 YES
25 YES 35 YES
If I asked you , what will be the outcome if somebody's age is 31, according to the data shown above
. Your likely answer is YES as you see that majority of the people with increasing age have said YES .
If I further asked you whats the outcome for somebody with age 34, your response will again be YES
. Now, whats the difference between these two cases . You can easily see that outcome is more
likely to be YES when the age is higher . This tells us that , we are not really interested in absolute
predictions YES/NO ; but in the probability of the outcome being YES/NO for given inputs.
YES/NO at the end of the day are just labels. We'll consider them to be 1/0 to make our life
mathematically easy; as will be evident in some time. Also to keep the notation concise , we'll use
just instead of
Now that we have figure out that we want to predict , Lets see how do we write that against our
prediction function
80
we need to apply some transformation on any one side to ensure that the ranges match. sigmoid or
logit function is one such transformation which is popular in use.
Finally, we are clear about what we want to predict and how it is written against our prediction
function . We can now start discussing what are our business expectation from this and how do we
represent the same in form of a cost/loss function .
We'll ideally want the probability of outcome being 1 to 100% when outcome in reality is 1 , and
probability of outcome being 1 to be 0% when outcome in reality is 0 . But, as usual , perfect solution
is practically not possible. Closest compromise will be that is as close to 1 as possible when
outcome in reality is 1 and it is as close to 0 as possible when outcome in reality is 0.
since takes value in the range [0,1] , we can say that when goes close to 0, it implies that
goes close to 1. So the above expression can be written as :
This is still at intuition level, we need to find a way to convert this idea into a mathematical
expression which can be used as a cost function . Consider this expression :
Expression is known as likelihood because it gives you the probability of the real outcome that
you get. This expression takes value when and takes value when . Using the
idea that we devised above, we can rewrite it like this :
This implies that we want the likelihood to be as high as possible irrespective of what the outcome (
1/0) is. Another way to express that is , that we want likelihood to align with whatever the real
outcome is . As earlier we are not concerned with likelihood of a single observation, we are
interested in it at over all level. Since likelihood is nothing but a probability , if we want to calculate
likelihood of entire data, it'll be nothing but multiplication of all individual likelihoods.
This is our cost function which we need to maximize . we want such an for which L is as high
possible for the given historical data. However this is not the form in which this is used . Since
optimizing a quantity where individual terms are multiplied with each other is a very difficult
problem to solve we'll instead use log(L) as our cost function , which makes individual terms getting
summed up instead .
To make this a standard minimization problem , instead of maximizing log(L), minimization of -log(L)
is considered
Finally , cost function for classification problem is -log(L) , also known as negative log likelihood
81
other forms of the same which you'll get to see :
Estimating
For this discussion we'll consider regression with as a linear model , but the same idea will be
applicable to any parametric model
Lets say we are trying to predict sales of a jewelry shop, considering gold price that day and temp
that day as factors affecting sales . We can consider a linear model like this
Here s are some constants. How do we determine, what values of s should we consider ?
As per the discussion that we have had about business expectation , we would desire s for which
the cost function is as low as possible for the given historical data given to us .
We can try out many different values of s and select the ones for which our cost function is coming
out to be as low as possible. One way to find good s can be to simply compare the values of cost
functions for each and pick the one with lowest cost function value.
Cost Decision
10 5 0 140 Start
-5 -2 4 160 Discard
1 3 7 150 Discard
1 4 6 132 Discard
3 6 -5 141 Discard
We can keep on trying random values of s like that and switching to new values whenever we
encounter a combination which gives us a new low for the cost function.
And we can stop randomly trying new values of s if we don't get any lower value for cost function
for a long time .
This works in theory , given ; that we have infinite amount of time, resources and most important;
patience!. We would want to have another method, which enables us to change our s in a such a
way that cost function always goes down [ instead of changing randomly and discarding most of
them ] .
Gradient Descent
82
Here in this image we can see that f is function with parameter . If we change by a small amount
, function changes by . As long as we ensure that is small , we can assume the triangle
shown with sides and , is right angle triangle , and we can write :
then changing those parameters will individually contribute to changing the function
this can be written as dot product between two vectors , respectively know as gradient and change
in parameters
This is not some expression that you haven't seen before , in fact we use it in our daily lives all the
time. This expression simply means that , if we change our parameter by some amount , change in
function can be calculated by multiplying change in parameter with gradient of the function w.r.t. the
parameter.
83
When somebody asks you to find how much distance you covered if you were travelling at
4km/minute for 20 minutes .
you simply tell them 80km!, here distance is nothing but function of time ,
& , the speed is nothing but rate of change of distance w.r.t. time, or in
other words , gradient of distance w.r.t. time [ gradient is nothing but rate of
change !!]
You must be wondering , why are we talking about all of this ? eq (5) is magical . This gives us an idea
about how we can change our parameters such that cost function always goes down
Consider
Where is some positive constant . if we put this back in (5), lets see what happens
(7) is an amazing result , it tells us that if we changed our paramter , as per the suggestion in (6) ,
change in function will always be negative . This gives us a consistent way of changing our
parameters so that our cost function always goes down. Once we start to reach near the optimal
value, gradient of the cost function will tend to zero and our parameter will stop to change .
Now we have consistent method for starting from random values of s and changing them in such a
way that we eventually arrive at optimal values of s for given historical data. Lets see whether it
really works or not with an example in context of linear regression .
We can write :
84
predictions = [ keep in mind that this will be a matrix multiplication ]
errors = Y - predictions =
cost =
I have taken liberty to extend the idea to more features which you see here as part of X matrix.
represents value of variables/feature. All column in X matrix represent multipliers of .
Lets see what is the gradient of the loss function . Keep in mind that and here are
numbers/data points from the data. They are not parameters. They are observed values of the
features.
gradient =
1 import pandas as pd
2 import numpy as np
1 x1=np.random.randint(low=1,high=20,size=20000)
2 x2=np.random.randint(low=1,high=20,size=20000)
1 y=3+2*x1-4*x2+np.random.random(20000)
you can see that we have generated data such that y is an approximate linear combination of x1 and
x2, next we'll calculate optimal parameter values using gradient descent and compare them with
results from sklearn and we'll see how good is the method.
1 x=pd.DataFrame({'intercept':np.ones(x1.shape[0]),'x1':x1,'x2':x2})
2 w=np.random.random(x.shape[1])
Lets write functions for predictions, error, cost and gradient that we discussed above
1 def myprediction(features,weights):
2 predictions=np.dot(features,weights)
3 return(predictions)
4
5 myprediction(x,w)
Note that , np.dot here is being used for matrix multiplication . Simple multiplication results to
element wise multiplication , which is simply wrong in this context .
1 def myerror(target,features,weights):
2 error=target-myprediction(features,weights)
3 return(error)
4 myerror(y,x,w)
1 def mycost(target,features,weights):
2 error=myerror(target,features,weights)
3 cost=np.dot(error.T,error)
4 return(cost)
5
6 mycost(y,x,w)
23139076.992828812
85
1 def gradient(target,features,weights):
2
3 error=myerror(target,features,weights)
4 gradient=-np.dot(features.T,error)/features.shape[0]
5 return(gradient)
6
7 gradient(y,x,w)
Note that gradient here is vector of 3 values because there are 3 parameters . Also since this is being
evaluated on the entire data, we scaled it down with number of observations . Do recall that , the
approximation which led to the ultimate results was that change in parameters is small. We don't
have any direct control over gradient , we can always chose a small value for to ensure that
change in parameter remains small. Also if we end up choosing too small value for , we'll need
to take larger number of steps to change in parameter in order to arrive at the optimal value of the
parameters
Lets looks at the expected value for parameters from sklearn . Don't worry about the syntax here ,
we'll discuss that in detail, when we formally start with linear models in next module .
1 sk_estimates
When you run the same , these might be different for you, as we generated the data randomly .Now
lets write our version of this , using gradient descent
1 def my_lr(target,features,learning_rate,num_steps,print_when):
2
3 # start with random values of parameters
4 weights=np.random.random(features.shape[1])
5 # change parameter multiple times in sequence
6 # using the cost function gradient which we discussed earlier
7 for i in range(num_steps):
8 weights -= learning_rate*gradient(target,features,weights)
9 # this simply prints the cost function value every (print_when)th iteration
10 if i%print_when==0:
11 print(mycost(target,features,weights),weights)
12
13 return(weights)
14
15
1 my_lr(y,x,.0001,500,100)
86
1 final weights after 500 iterations: array([ 0.75755246, 1.52465372, -3.28073366])
you can see that if we take too few steps , we did not reach to the optimal value
1 my_lr(y,x,.01,500,100)
You can see that because of high learning rate , change is parameter is huge and we end up missing
the optimal point , cost function values , as well as parameter values ended up exploding. Now lets
run with low learning rate and higher number of steps
1 my_lr(y,x,.0004,100000,10000)
We can see here that we ended up getting pretty good estimates for s , as good as from sklearn
.
1 sk_estimates
87
there are modifications to gradient descent in which can achieve the same thing in much less
number of iterations. We'll discuss that in detail when we start with our course in Deep learning. For
now ,we'll conclude this module here.
88
Chapter 4 : Linear Models
Our last module got little mathematical, and rightly so, for it forms the basis of much what follows
next. However, That doesn't mean things are going to get even more complex. No, we'll focus more
and more on practical aspects of ML as we move forward and see how the gap is bridged between
the math and eventually its application in business.
You'll also notice that when it comes to application of things, its good to know math to understand
whats going on at the back end; but its not absolutely necessary. You can always be "not so
confident" about the math of things and still implement standard algorithms with much ease and
accuracy with what we are going to learn next .
Meaning, you can always come back and have another go at making sense of all the mathematical
jargon associated , but it shouldn't become a hurdle in you advancing through the implementation
and usage of these algorithms. Persevere!
Lets summarize some relevant bits which we are going to refer to time and again.
For Regression :
Predictive Model :
Cost/Loss :
For Classification :
Predictive Model :
Cost/Loss :
In both of these cases can be any generic function, which we'll be referring to as Algorithms.
In case of Linear Models , is linear combination of input variables of the problem ( for both
regression and classification )
It seems our discussion on linear models should be finished here and we should move on to other
parts of the course already. No, not really. There are many unanswered questions here before we
start to build practical, industry level linear models. Here are those questions which we'll address
one by one
1. All the discussions on theoretical aspects have made very convenient assumption that all the
inputs for predicting the outcomes are going to be numeric, which is definitely not the case. For
example , if we are trying to predict sales of different shops; part of a retail chain, which city they
are located in can be a very important feature to consider. However in our discussion so far, we
89
have not considered how we convert this inherently categorical information into numbers and
use it in our algorithms.
2. How do I know, how good are my predictions. Nobody in industry will accept my solution
because I developed it, they need to know how it will tentatively perform before it makes into
production
3. It seems that, if I pass data on 100 inputs/variables to this algorithm to predict the outcome of
interest, it will give coefficient for all the inputs. Meaning, all the inputs are going to be part of
the predictive model or in other words; all of them will have some impact on my predictions;
irrespective of some of them being junk inputs. Our algorithm should contain some way of
either completely removing them from the model or suppress their impact on our predictions.
In the context of numeric variables we can say that if the variable changed by some amount
then, our prediction will change by some constant multiplied by amount. However this concept
of change by amount , just doesn't exist in context of categorical variables. Consider a case
where we were trying to build a predictive model for how much time people take to run a 100 Meter
dash, given their age and what kind of terrain they run on [hilly/flat]. You cant really say that terrain
changed by flat-hill . All that you can say that people running on a hilly terrain , on an average will
take some more time, assuming effect of age remains same across the population. We can depict
this difference by using two separate predictive equations for both terrains
You can see; these two equations mean that if two people of same age run on two different terrains ,
person running on hills will take 5 seconds more. Does that mean if I have 100 categories in my
categorical variable, I'll have to make 100 different models ? Not really , we can easily combine them
like this :
This one equation represents both the above seen separate equations. Variable takes
value 1, when terrain is hill and value 0 when terrain is flat . This is called a dummy/flag variable.
It is also known as one hot encoded representation of categorical data. Notice that we had two
categories in our categorical variable Terrain , But we need only one dummy variable to represent
it numerically . In general if our categorical variable has n categories , we need only n-1 dummies.
Theoretically it doesn't matter which category we ignore while creating dummies for rest.
One last thing that you need to keep in mind is that, since categories have an average impact; for
their estimated average impact from the data to be consistent/reliable; they need to have backing of
good number of observations. Essentially we should ignore categories which have too few obs in the
data. Is there a magic number which we should consider as minimum required number of
observations ? No, there isn't . A good sane choice will do, you can even experiment with different
numbers.
90
Simple answer is that we want to make use of that model to make prediction on future data, ideally
we'd like to know the performance before data. But there is no way to get that future data. We'll
have to make do with whatever training data we have been given.
It doesn't make sense to check performance of the model on the same data it was trained on, Since
the model has already seen the outcome , it of course will perform on the same data as well as
possible. That can not be taken as measure of its performance on unseen data. There are two ways ,
both of which have their pros/cons.
This is one of the simpler ways , but not without its flaws. You break your training data randomly in
two parts , build the model on one and test its performance on another. Generally the training data
is broken into 80:20, larger part to be used for training and smaller one to test performance. But
again that isn't a magic ratio. Idea is to keep a small sample separate from the training data, but not
too small ( so that it contains similar patterns from the overall data ) but not too big ( so as to not
have different patterns from the overall data ).
Cons : We don't have good idea about , what can be optimal ratio which is balance between not too
big or too small . This random sample might be a niche part of the data having very different pattern
from the rest of the data and in that case , measured performance will not be a good representative
of real performance of the model
Cross-Validation :
Instead of breaking it into two parts , we break it into K parts. A good value of K is in the range 10-
15. You'll have a better understanding of why , once we are done with this discussion .
Problem with simply breaking data into two parts was that we cant ensure that smaller validation set
really resembles the overall data. Performance results on this will vary from the real performance. To
counter that we break data into K parts. And build K models , every time leaving one of the parts as
validation sample. This gives us out of sample performance measures for these K parts. Instead of
using one of them as representative tentative performance. We take the average of these
performance measures . Upon averaging , variations will be canceled out and we'll have more
representative measure of performance. We can also look at variance of this performance to asses
how stable it is across data.
Taking K as 10 means, at a time 10% of the data is not involved in the modeling process. Its fair to
say 90% of the data will still have same pattern as the overall data. This number goes up as we break
our data into more parts by taking higher value of K, but that has its down side of increase number
of models we built ( increasing the time taken )
Pros : Gives better measure of performance along with variance as stability measure .
Cons : Its time/resource taking, might as well be in-feasible for very large datasets.
Problem Statement : We have been given data for people applying for a loan to a peer-to-peer
lending firm. The data contains details of loan application and eventually how much interest rate
was offered to people by the platform. Our solution needs to be able to predict the interest rate
which will be offered to people , given their application detail as inputs. Lets look at the training data
given to us
91
1 # import for processing data
2 import pandas as pd
3 import numpy as np
4 # imports for suppressing warnings
5 import warnings
6 warnings.filterwarnings('ignore')
1 # provide complete path for the file which contains your data
2 # r at the beginning is used to ensure that path is considerd as raw string and
3 # you dont get unicode error because of special characters combined with \ or /
4 file=r'/Users/lalitsachan/Dropbox/0.0 Data/loan_data_train.csv'
5 ld_train=pd.read_csv(file)
1 ld_train.info()
1 <class 'pandas.core.frame.DataFrame'>
2 RangeIndex: 2200 entries, 0 to 2199
3 Data columns (total 15 columns):
4 ID 2199 non-null float64
5 Amount.Requested 2199 non-null object
6 Amount.Funded.By.Investors 2199 non-null object
7 Interest.Rate 2200 non-null object
8 Loan.Length 2199 non-null object
9 Loan.Purpose 2199 non-null object
10 Debt.To.Income.Ratio 2199 non-null object
11 State 2199 non-null object
12 Home.Ownership 2199 non-null object
13 Monthly.Income 2197 non-null float64
14 FICO.Range 2200 non-null object
15 Open.CREDIT.Lines 2196 non-null object
16 Revolving.CREDIT.Balance 2197 non-null object
17 Inquiries.in.the.Last.6.Months 2197 non-null float64
18 Employment.Length 2131 non-null object
19 dtypes: float64(3), object(12)
20 memory usage: 257.9+ KB
Lets comment on each of these variables one by one , as to what we are going to do before we start
building our model
ID : It doesn't make sense to include unique identifiers of the observation (ID vars) as input. We'll
drop this column
Amount.Funded.By.Investors : This information, although present in the data, will not come with
loan application. If we want to build a model for predicting Interest.Rate using loan application
characteristics , then we can not include this information in our model. We'll drop this column
Interest.Rate, Debt.To.Income.Ratio: These come as object type again because of the % symbol
contained in it. We'll first remove the % sign and then convert it to numeric type
1 ld_train['Interest.Rate'].head()
0 18.49%
1 17.27%
2 14.33%
3 16.29%
4 12.23%
92
1 ld_train['Debt.To.Income.Ratio'].head()
0 27.56%
1 13.39%
2 3.50%
3 19.62%
4 23.79%
1 ld_train['State'].value_counts(dropna=False).head()
2 # only partial results are shown. to see the full results , remove .head()
CA 376
NY 231
FL 149
TX 146
PA 88
1 ld_train['Home.Ownership'].value_counts(dropna=False).head()
MORTGAGE 1018
RENT 999
OWN 177
OTHER 4
NONE 1
1 ld_train['Loan.Length'].value_counts(dropna=False)
36 months 1722
60 months 476
. 1
NaN 1
1 ld_train['Loan.Purpose'].value_counts(dropna=False).head()
2 # only partial results are shown. to see the full results , remove .head()
debt_consolidation 1147
credit_card 394
other 174
home_improvement 135
major_purchase 84
FICO.Range: This comes as object type because the value written as numeric ranges in the data. As
such, we can convert this to dummies, but we'll not be using information contained in order of the
values if we convert them to dummies . We'll instead take average of the given range using string
processing .
1 ld_train['FICO.Range'].value_counts().head()
670-674 151
675-679 144
680-684 141
695-699 138
665-669 129
93
Employment.Length: This takes type object; because it takes numeric values written in words. We
can again chose to work with it like a categorical variable, but then we'll end up losing information
on the order of values .
1 ld_train['Employment.Length'].value_counts(dropna=False)
1 # Processing FICO.Range
2 k=ld_train['FICO.Range'].str.split("-",expand=True).astype(float)
3 ld_train['fico']=0.5*(k[0]+k[1])
4 del ld_train['FICO.Range']
1 # Processing Employment.Length
2
3 ld_train['Employment.Length']=ld_train['Employment.Length'].str.replace('years',''
)
4 ld_train['Employment.Length']=ld_train['Employment.Length'].str.replace('year','')
5 ld_train['Employment.Length']=np.where(ld_train['Employment.Length'].str[0]=='<',0
,
6 ld_train['Employment.Length'])
7 ld_train['Employment.Length']=np.where(ld_train['Employment.Length'].str[:2]=='10'
,10,
8 ld_train['Employment.Length'])
9 ld_train['Employment.Length']=pd.to_numeric(ld_train['Employment.Length'],errors='
coerce')
10
94
5 for col in cat_col :
6 # calculate frequency of categories in the columns
7 k=ld_train[col].value_counts(dropna=False)
8 # ignoring categories with too low frequencies and then selecting n-1 to
create dummies for
9 cats=k.index[k>50][:-1]
10 # creating dummies for remaining categories
11 for cat in cats:
12 # creating name of the dummy column coresponding to the category
13 name=col+'_'+cat
14 # adding the column to data
15 ld_train[name]=(ld_train[col]==cat).astype(int)
16 # removing the original column once we are done creating dummies for it
17 del ld_train[col]
1 ld_train.info()
1 <class 'pandas.core.frame.DataFrame'>
2 RangeIndex: 2200 entries, 0 to 2199
3 Data columns (total 31 columns):
4 Amount.Requested 2195 non-null float64
5 Interest.Rate 2200 non-null float64
6 Debt.To.Income.Ratio 2199 non-null float64
7 Monthly.Income 2197 non-null float64
8 Open.CREDIT.Lines 2193 non-null float64
9 Revolving.CREDIT.Balance 2195 non-null float64
10 Inquiries.in.the.Last.6.Months 2197 non-null float64
11 Employment.Length 2130 non-null float64
12 fico 2200 non-null float64
13 State_CA 2200 non-null int64
14 State_NY 2200 non-null int64
15 State_FL 2200 non-null int64
16 State_TX 2200 non-null int64
17 State_PA 2200 non-null int64
18 State_IL 2200 non-null int64
19 State_GA 2200 non-null int64
20 State_NJ 2200 non-null int64
21 State_VA 2200 non-null int64
22 State_MA 2200 non-null int64
23 State_NC 2200 non-null int64
24 State_OH 2200 non-null int64
25 State_MD 2200 non-null int64
26 State_CO 2200 non-null int64
27 Home.Ownership_MORTGAGE 2200 non-null int64
28 Home.Ownership_RENT 2200 non-null int64
29 Loan.Length_36 months 2200 non-null int64
30 Loan.Purpose_debt_consolidation 2200 non-null int64
31 Loan.Purpose_credit_card 2200 non-null int64
32 Loan.Purpose_other 2200 non-null int64
33 Loan.Purpose_home_improvement 2200 non-null int64
34 Loan.Purpose_major_purchase 2200 non-null int64
35 dtypes: float64(9), int64(22)
36 memory usage: 532.9 KB
All of the columns in the data are now numeric. Just one more thing to take care of before we start
with modeling process. We need to make sure that there are no missing values in the data. if there
are , then they need to be replaced .
95
1 Amount.Requested 5
2 Interest.Rate 0
3 Debt.To.Income.Ratio 1
4 Monthly.Income 3
5 Open.CREDIT.Lines 7
6 Revolving.CREDIT.Balance 5
7 Inquiries.in.the.Last.6.Months 3
8 Employment.Length 70
9 fico 0
10 State_CA 0
11 State_NY 0
12 State_FL 0
13 State_TX 0
14 State_PA 0
15 State_IL 0
16 State_GA 0
17 State_NJ 0
18 State_VA 0
19 State_MA 0
20 State_NC 0
21 State_OH 0
22 State_MD 0
23 State_CO 0
24 Home.Ownership_MORTGAGE 0
25 Home.Ownership_RENT 0
26 Loan.Length_36 months 0
27 Loan.Purpose_debt_consolidation 0
28 Loan.Purpose_credit_card 0
29 Loan.Purpose_other 0
30 Loan.Purpose_home_improvement 0
31 Loan.Purpose_major_purchase 0
32 dtype: int64
First we'll use the first method of validation , where we break data into two parts before start
building the model
we need to separate predictors (input/x vars) and target before we pass them to scikit-learn
functions for building our linear regression model
1 x_train=t1.drop('Interest.Rate',axis=1)
2 y_train=t1['Interest.Rate']
3 x_test=t2.drop('Interest.Rate',axis=1)
4 y_test=t2['Interest.Rate']
96
1 # import the function for Linear Regression
2 from sklearn.linear_model import LinearRegression
3 lr=LinearRegression()
4 # fit function , builds the model ( parameter estimation etc)
5 lr.fit(x_train,y_train)
1 lr.intercept_
76.32100980688705
1 list(zip(x_train.columns,lr.coef_))
[('Amount.Requested', 0.00016206823832747567),
('Debt.To.Income.Ratio', -0.005217635523226176),
('Monthly.Income', -4.0339147296499624e-05),
('Open.CREDIT.Lines', -0.03015894666977517),
('Revolving.CREDIT.Balance', -1.7860242337434248e-06),
('Inquiries.in.the.Last.6.Months', 0.32786067084992604),
('Employment.Length', 0.02325230583339998),
('fico', -0.08716732836779102),
('State_CA', -0.16231739562106098),
('State_NY', -0.14426278883807817),
('State_FL', -0.11716306311499997),
('State_TX', 0.4481165264861161),
('State_PA', -0.9332596674212796),
('State_IL', -0.4048740473139449),
('State_GA', -0.33202157322249337),
('State_NJ', -0.49634957660360035),
('State_VA', -0.13349751801583823),
('State_MA', -0.1634714204731154),
('State_NC', -0.47136779712009375),
('State_OH', -0.40429922213664504),
('State_MD', -0.1292878863756837),
('State_CO', 0.10071894446013128),
('Home.Ownership_MORTGAGE', -0.5636395222756556),
('Home.Ownership_RENT', -0.27130802518538744),
('Loan.Length_36 months', -3.1821676438146373),
('Loan.Purpose_debt_consolidation', -0.482384755055442),
('Loan.Purpose_credit_card', -0.5726731705822421),
('Loan.Purpose_other', 0.35159491851815755),
('Loan.Purpose_home_improvement', -0.4952547468027438),
('Loan.Purpose_major_purchase', -0.2391664596860732)]
1 predicted_values=lr.predict(x_test)
2 from sklearn.metrics import mean_absolute_error
3 mean_absolute_error(predicted_values,y_test)
1.6531699740032333
This means that, we can assume that tentatively our model, will be off by 1.65 units on an average;
while predicting interest rates on the basis of loan application. Now lets see, how we can do cross-
validation with sklearn tools . Keep in mind that we don't need to break our data in this process.
sklearn function will take care of that internally . All that we need to do is to separate target and
predictors .
97
1 x_train=ld_train.drop('Interest.Rate',axis=1)
2 y_train=ld_train['Interest.Rate']
1 errors =
np.abs(cross_val_score(lr,x_train,y_train,cv=10,scoring='neg_mean_absolute_error'))
2 # cv=10 , means 10 fold cross validation
3 # Regarding scoring functions, the general theme in scikit learn is , higher the
better
4 # to remain consistent with the same , instead of mean_absolute_error, available
function for regression is neg_mean_absolute_error
5 # we can always wrap that and take positve values [ with np.abs ]
1 errors
1 avg_error=errors.mean()
2 error_std=np.std(errors)
3 avg_error,error_std
(1.6052495976597005, 0.10838823631170523)
Note : you can not check performance on this test data, if there is no response column given .
1. Combine training and test from the very beginning and then separate once the data prep is
done. Build model on train and make predictions on test
2. Build a data-prep pipeline which gives same results for both train and test ( We'll learn about this
in later modules )
1 # add an identifier column to both files so that they can be separated later on
2
3 ld_test['data']='test'
4 ld_train['data']='train'
5
6 # combine them
7 ld_all=pd.concat([ld_train,ld_test],axis=0)
8
9 # carry out the same data prep steps on ld_all as we did for ld_train
10
11 #~~~~~~~~~~ data prep on the ld_all~~~~~~~~~~~~~~
12
13 # make sure that you end up making dummies for column 'data'
14 # Now separate them
15
98
16 ld_train=ld_all[ld_all['data']=='train']
17 ld_test=ld_all[ld_all['data']=='test']
18
19 ld_test.drop(['data','Interest.Rate'],1,inplace=True)
20 del ld_train['data']
21 del ld_all
Now you can build model on ld_train , use the same model to make prediction on ld_test and
make submission. Since both the datasets have gone through the same data prep process , they'll
have same columns in them
Lets pick to comment on. The coefficient , associated with , simply means that if
changes by 1 unit, my prediction will change by units. if is +ve , then my prediction
will increase as increases; if is -ve, my prediction will decrease when increases
[pointing to negative linear correlation ].
But if was a junk variable, it should not have any impact on my prediction. In order for that to
happen, the coefficient associated with it should be zero or very close to zero.
This is solved by minimizing sum of squared residuals, for model one we want to minimize
Lets say we have found the correct estimators for model 1, then you can obtain that exact same
residual sum squares in model two by choosing the same values for and letting . Now
you can find, possibly, a lower sum squares residual by searching for a better value for .
To summarize, the models are nested, in the sense that everything we can model with model 1 can
be matched by model two, model two is more general than model 1. So, in the optimization, we
have larger freedom with model two; so can always find a better solution.
This has really nothing to do with statistics but is a general fact about optimization. It extrapolates to
following conclusion : If we add any variable in our model/data [ junk or not junk ]; cost function will
always decrease . However for junk variables, it will go down by a small amount, and for good
variables [ which really do impact our target ], it goes down by a large amount.
Regularisation
Problem with having junk vars a coefficient is that , their effect on the model doesn't remain
consistent . Having lot of junk vars might lead to a model which performs entirely differently on
training data and eventual validation/test data. That defies the purpose of building the model in the
first place . This situation where our prediction model performs very well on the training data but
99
not so well on test/validation set; is called Overfitting . Ways to reduce this problem and make our
model more generalizable is called Regularisation .
Many at times this is achieved by modifying our cost/loss formulation such that it reduces the impact
of junk vars; and in general reduces impact of over-fitting by ensuring that our predictive model
extracts most generic patterns from the data instead of simply memorizing training data.
We want to reduce impact of junk vars , that can be simply achieved by making their coefficient as
close to zero as possible. We can achieve that by adding a penalty to the cost function on the size of
s . Lets understand how that works . Consider there are two variables and . Both of them
contribute to decrease in our traditional loss . however results in higher decrease in
comparison to
Lets say, contribution of towards decrease of is 10, where as for , it is 0.5 . Now we are
going to add penalty to our loss formulation on the size of the parameters . Our new loss function
will look like this :
this penalty can be anything as long as it serves the purpose to make our model more generalizable.
We'll discuss some popular formulations for the same. But those by no means are the only ones
which you are theoretically limited to use.
consider this one, known as L2 penalty [ also known as Ridge regression in context of linear
regression models] :
We'll expand on the role of here in a bit , for now lets consider that to be 1, and lets also consider
that; to start with coefficients for and , both are 2 . Notice that penalty is always positive,
meaning this will increase the loss function for all the vars, higher the parameter size [ absolute
value ] , higher the increase in the loss. Now in the light of new loss formulation, decrease due to
will be = ( 10 - 2X2 = 6 ); and for it'll be = (0.5 - 2X2 = -3.5) .
Clearly this new loss formulation does not decrease because of , during optimization of loss,
coefficient for will be moved to close to zero until decrease because of it becomes positive.
There are many such penalties we'll come across in our course discussions . They'll mainly be
variations of the two, namely; L1 and L2 penalty.
There are one basic impact of using either of the penalties to the cost function
Lets consider case of having a single parameter , same results get extrapolated to higher number
of parameters also.
consider cost function with L2 penalty for linear regression ( in matrix format as discussed in the
earlier module ) :
We'll calculate the gradient and equate it to zero to determine our parameter value
you can see that , by using higher and higher value of , you can make parameter to be very close to
zero, but there is no way to make it absolutely zero.
100
Lets see what happens in case of using L1 penalty , here is the cost function with L1 penalty :
Lets assume that >0 for this discussion , you can do the exercise for -ve as well and reach to the
same conclusion. For +ve , cost function is :
We'll equate the gradient to zero here also and lets see what happens
You can see that in this case , there does exist some value of for which parameter estimate can
become 0
conclusion :
if we take =0, it simply leads to zero penalty and our traditional cost/loss formulation. If we take
, all s will have to be zero, leading to complete reduction in the model ( or no model ). There
is no mathematical formula for best value of , its different for different datasets. What we can do is
, to try out different values of lambda and see its cross validated performance , and choose the one
for which cross validated performance is best. Lets see how to do that [ extending the example
taken earlier ] with sklearn functions
1 # this is the model for which we are tryin to estimate best value of lambda
2 model=Ridge(fit_intercept=True)
1 grid_search.fit(x_train,y_train)
1 grid_search.best_estimator_
101
this is the best estimator [ with best value of ], we can directly use this if we want or look at
similar performance of other values and pick the one with least variance ( most stable model ) . Lets
look at the custom function Report which will enable us to extract more detailed results from
grid_search.cv_results object. Which is a huge dictionary containing information on performance
of all combination of parameters that we are experiment with.
1 report(grid_search.cv_results_,3)
Performance of these is pretty similar ( looks identical due to us limiting this to 5 decimal digits ) ,
there isn't much difference in stability either . We ca. go with the best guy. Tentative performance
measure off by average 1.60 units .
if you want to make prediction , you can directly use grid_search.predict , it by defaults fits the
model with best parameter choices on the entire data. However if you want to look at the
coefficients, you'll have to fit the model separately.
1 ridge=grid_search.best_estimator_
1 ridge.fit(x_train,y_train)
1 ridge.intercept_
75.47046753682933
1 list(zip(x_train.columns,ridge.coef_))
102
[('Amount.Requested', 0.00016321376320470183),
('Debt.To.Income.Ratio', -0.0016649630043797535),
('Monthly.Income', -2.7907207451034204e-05),
('Open.CREDIT.Lines', -0.03648076179235473),
('Revolving.CREDIT.Balance', -2.683721318784198e-06),
('Inquiries.in.the.Last.6.Months', 0.3445032296978588),
('Employment.Length', 0.019601482704651712),
('fico', -0.08659781419319143),
('State_CA', -0.14115706864241717),
('State_NY', -0.10835699414412378),
('State_FL', -0.0112049655199493),
('State_TX', 0.4475943683966697),
('State_PA', -0.38750987879294396),
('State_IL', -0.47889963877418085),
('State_GA', -0.14818704990810666),
('State_NJ', -0.29109377437441625),
('State_VA', -0.05256231331588017),
('State_MA', -0.03973819224423696),
('State_NC', -0.342099870504766),
('State_OH', -0.202440726282369),
('State_MD', -0.023012728967286556),
('State_CO', 0.09019079073895374),
('Home.Ownership_MORTGAGE', -0.3571292206883735),
('Home.Ownership_RENT', -0.13033273805328766),
('Loan.Length_36 months', -3.013912446966353),
('Loan.Purpose_debt_consolidation', -0.41916847523850287),
('Loan.Purpose_credit_card', -0.5187875249317675),
('Loan.Purpose_other', 0.36593813217880045),
('Loan.Purpose_home_improvement', -0.3133150928985485),
('Loan.Purpose_major_purchase', -0.06049720536796199)]
you can see that there is no reduction in model coefficients . However if you compare them with
coefficients obtained in simple linear regression without penalty, you'll find them that many of them
have been suppressed by a good factor.
1 list(zip(x_train.columns,np.round(lr.coef_/ridge.coef_,2)))
[('Amount.Requested', 0.99),
('Debt.To.Income.Ratio', 3.13),
('Monthly.Income', 1.45),
('Open.CREDIT.Lines', 0.83),
('Revolving.CREDIT.Balance', 0.67),
('Inquiries.in.the.Last.6.Months', 0.95),
('Employment.Length', 1.19),
('fico', 1.01),
('State_CA', 1.15),
('State_NY', 1.33),
('State_FL', 10.46),
('State_TX', 1.0),
('State_PA', 2.41),
('State_IL', 0.85),
('State_GA', 2.24),
('State_NJ', 1.71),
('State_VA', 2.54),
('State_MA', 4.11),
('State_NC', 1.38),
103
('State_OH', 2.0),
('State_MD', 5.62),
('State_CO', 1.12),
('Home.Ownership_MORTGAGE', 1.58),
('Home.Ownership_RENT', 2.08),
('Loan.Length_36 months', 1.06),
('Loan.Purpose_debt_consolidation', 1.15),
('Loan.Purpose_credit_card', 1.1),
('Loan.Purpose_other', 0.96),
('Loan.Purpose_home_improvement', 1.58),
('Loan.Purpose_major_purchase', 3.95)]
1 model=Lasso(fit_intercept=True)
1 grid_search=GridSearchCV(model,param_grid=params,cv=10,scoring='neg_mean_absolute_e
rror')
1 grid_search.fit(x_train,y_train)
1 grid_search.best_estimator_
you can see that the best value came out to be on the lower end , we'll expand the range on that
side .
1 lambdas=np.linspace(0.001,2,200)
2 params={'alpha':lambdas}
3 grid_search=GridSearchCV(model,param_grid=params,cv=10,scoring='neg_mean_absolute_e
rror')
4 grid_search.fit(x_train,y_train)
1 grid_search.best_estimator_
this value is in between the range, we're good . Lets look at the cross-validate performance of some
of the top few models
1 report(grid_search.cv_results_,3)
104
Model with rank: 1
Mean validation score: -1.60128 (std: 0.11647)
Parameters: {'alpha': 0.011045226130653268}
There is no dramatic improvement in performance , now lets see if their is any model reduction (
=0)
1 lasso=grid_search.best_estimator_
2 lasso.fit(x_train,y_train)
1 list(zip(x_train.columns,lasso.coef_))
[('Amount.Requested', 0.00016023851703962787),
('Debt.To.Income.Ratio', -0.0009942600081876468),
('Monthly.Income', -2.7304512141858614e-05),
('Open.CREDIT.Lines', -0.036990231977433084),
('Revolving.CREDIT.Balance', -2.8844495569785946e-06),
('Inquiries.in.the.Last.6.Months', 0.33452489786480466),
('Employment.Length', 0.015521998102599638),
('fico', -0.08654353305995562),
('State_CA', -0.0),
('State_NY', -0.0),
('State_FL', 0.0),
('State_TX', 0.4213181413677211),
('State_PA', -0.07076519997728142),
('State_IL', -0.16988773003610938),
('State_GA', -0.0),
('State_NJ', -0.0),
('State_VA', 0.0),
('State_MA', 0.0),
('State_NC', -0.0),
('State_OH', -0.0),
('State_MD', 0.0),
('State_CO', 0.0),
('Home.Ownership_MORTGAGE', -0.2085859940218407),
('Home.Ownership_RENT', -0.0),
('Loan.Length_36 months', -3.074433736924612),
('Loan.Purpose_debt_consolidation', -0.2540228534857545),
('Loan.Purpose_credit_card', -0.32972568683630343),
('Loan.Purpose_other', 0.3971422404793555),
('Loan.Purpose_home_improvement', -0.022332982889820444),
('Loan.Purpose_major_purchase', 0.0)]
you can see that many of the coefficients have become exactly zero . Much smaller model gives you
similar performance . In this case, amongst all, Lasso model is the best , considering its size and
performance. That doesn't mean, Lasso or L1 penalty will always result in best model. It depends on
the data.
Note : Ridge and Lasso regression are just different ways of estimating the parameters.
Eventual prediction model is linear in both as well as in simple linear regression
In medical field, the classification task could be assigning a diagnosis to a given patient as
described by observed characteristics of the patient such as age, gender, blood pressure, body
mass index, presence or absence of certain symptoms, etc.
In banking sector, one may want to categorize hundreds or thousands of applications for new
cards containing information for several attributes such as annual salary, outstanding debts, age
etc., into users who have good credit or bad credit for enabling a credit card company to do
further analysis for decision making; OR one might want to learn to predict whether a particular
credit card charge is legitimate or fraudulent.
In social sciences, we may be interested to predict the preference of a voter for a party based on
: age, income, sex, race, residence state, votes in previous elections etc.
In finance sector, one would require to ascertain , whether a vendor is credit worthy?
In insurance domain, the company will need to assess , Is the submitted claim fraudulent or
genuine?
In Marketing, the marketer would like to figure out , Which segment of consumers are likely to
buy?
All of the problems listed above use same underlying algorithms, despite them being pretty different
from each other on the face of it.
In this module we'll look at where , is linear combination of variables as discussed earlier . Here
is the problem statement we'll be working on
Problem Statement :
A financial institution is planning to roll out a stock market trading facilitation service for their
existing account holders. This service costs significant amount of money for the bank in terms of
infra, licensing and people cost. To make the service offering profitable, they charge a percentage
base commission on every trade transaction. However this is not a unique service offered by them,
many of their other competitors are offering the same service and at lesser commission some times.
To retain or attract people who trade heavily on stock market and in turn generate a good
commission for institution, they are planning to offer discounts as they roll out the service to entire
customer base.
Problem is , that this discount, hampers profits coming from the customers who do not trade in
large quantities . To tackle this issue , company wants to offer discounts selectively. To be able to do
so, they need to know which of their customers are going to be heavy traders or money makers for
them.
To be able to do this, they decided to do a beta run of their service to a small chunk of their
customer base [approx 10000 people]. For these customers they have manually divided them into
two revenue categories 1 and 2. Revenue one category is the one which are money makers for the
bank, revenue category 2 are the ones which need to be kept out of discount offers.
We need to use this study's data to build a prediction model which should be able to identify if a
customer is potentially eligible for discounts [falls In revenue grid category 1]. Lets get the data and
begin.
106
Classification Model Evaluation and Probability Scores to
Hard Classes
Before we jump-in head first into building our model, we need to figure out couple of things about
classification problems in general . We discussed earlier that prediction model for classification
output probabilities as outcome. In many case that'll suffice as a way of scoring our observation in
terms most likely to least likely. However in many cases we'd eventually need to convert those
probability scores to hard classes.
The way is to figure out a cut-off/threshold for the probability scores where we can say that, obs with
higher score than this will be classified as 1 and others will be classified as 0. Now, the question
becomes; how do we come up with this cut-off.
Irrespective of what cut-off we chose, the hard class prediction rule is fixed. if probability score is
higher than the cutoff then the prediction is 1 otherwise 0 [ given, we are predicting probability of
outcome being 1 ]. Any cut-off decision results in some of our predictions being true and some of
them being false. Since there are are only two hard classes, we'll have 4 possible cases .
When we predict 1 [+ve] but in reality the outcome is 0[-ve] : False Positive
When we predict 0 [-ve] but in reality the outcome is 1[+ve] : False Negative
When we predict 1 [+ve] and in reality the outcome is 1[+ve] : True Positive
When we predict 0 [-ve] and in reality the outcome is 0[-ve] : True Positive
** Positive_predicted Negative_predicted
Positive_real TP FN
Negative_real FP TN
Using these figures we can come up with couple of popular measurements in context of
classification
Accuracy =
Sensitivity or Recall =
Specificity =
Precision =
Some might suggest that we can take any one of them, and measure for all scores , and decides that
score to be our cutoff where the taken measurement is highest for the training data. However none
of these above mentioned measurements take care of our business requirement of good separation
between two classes. Here are the issues associated with these individually if we consider them as
candidate for determining cutoffs .
107
Accuracy This works only if our classes are roughly equally present in the data. However generally
this is not the case in many business problems. For example, consider campaign response model.
Typical response rate is 0.5-2%. Even if predict that none of the customers are going to subscribe to
our campaign; accuracy will be in the range 98-99.5%, quite misleading.
Sensitivity or Recall We can arbitrarily make Sensitivity/Recall 100% by predicting all the cases as
positive . Which is kinda useless because it doesn't take into account that labeling lot of negative
cases as positive should be penalized too.
Specificity We can arbitrarily make Specificity 100% by predicting all the cases as Negative . Which is
kinda useless because it doesn't take into account that labeling lot of positive cases as negative
should be penalized too.
Precision We can make precision very high by keep cut-off too high and thus predicting very few sure
shot cases as positive, but in that scenario our recall will be down to dumps.
So it turns out that, these measures are a good look into how our model is doing , but none of them
taken individually can be used to determine proper cut-off. Following are some proper measures
which give equal weight to goal of capturing as many positives as possible and at the same time; not
labeling too many negative cases as positives .
KS =
You can see that KS will be highest when we recall is high but at the same time we are not labeling a
lot of negative cases as positive. We can calculate KS for all the scores and chose that score as our
ideal cut-off for which KS is maximum.
There is another measure which lets you give different weights for your precision or recall. There can
be cases in real business problems where you'd want to give more importance to recall over
precision or otherwise. Consider a critical test which determines whether someone has a particular
aggressively fatal decease or not . In this case we wouldn't want to miss out on positive cases and
wouldn't really mind some of the negative cases being labeled as positive.
Score =
Notice that value of here will determine; how much and what are we going to give importance
to. For =1 , equal importance is given to both precision and recall. When 1> ,
score favors Precision and results in high probability score being chosen as cutoff. On the other
hand when > 1 and as it favors Recall and results in low probability scores being
chooses as cutoff.
Now these let us find a proper cutoff given a probability score model. However so far we haven't
discussed how do asses , how good the score is itself . Next section is dedicated to the same .
When we asses performance of a classification probability score , we try to see how it stacks up with
an ideal scenario. For an ideal score, there will exist a clean cutoff; meaning, there will be no overlap
between two classes when chose that cut-off. There will be no False Positives or False
Negatives
Consider a hypothetical scenario where we have an ideal prob score like given below .
1 d=pd.DataFrame({'score':np.random.random(size=100)})
2 d['target']=(d['score']>0.3).astype(int)
1 sns.lineplot(x='score',y='target',data=d)
108
For this ideal scenario, we are going to consider many cutoffs between 0-1 and calculate True
Positive Rate [ Same as Sensitivity ] and False Positive Rate [ Same as 1-Specificity]. And plot
those . The resultant plot that we'll get is known as ROC Curve . Lets see how that looks
1 TPR=[]
2 FPR=[]
3 real=d['target']
4 for cutoff in np.linspace(0,1,100):
5 predicted=(d['score']>cutoff).astype(int)
6 TP=((real==1)&(predicted==1)).sum()
7 FP=((real==0)&(predicted==1)).sum()
8 TN=((real==0)&(predicted==0)).sum()
9 FN=((real==1)&(predicted==0)).sum()
10
11 TPR.append(TP/(TP+FN))
12 FPR.append(FP/(TN+FP))
13
14 temp=pd.DataFrame({'TPR':TPR,'FPR':FPR})
15
16 sns.lmplot(y='TPR',x='FPR',data=temp,fit_reg=False)
Its perfectly triangular , and area under the curve is 1 for this ideal scenario, however lets see what
happens if we make it not so ideal scenario and add some overlap. [we'll flip some targets in the
same data ]
1 inds=np.random.choice(range(100),10,replace=False)
2 d.iloc[inds,1]=1-d.iloc[inds,1]
1 sns.lineplot(x='score',y='target',data=d)
109
you can see that there is lot of overlap now and , there can not exist a clear cutoff. Lets see how the
ROC curve looks for the same
1 TPR=[]
2 FPR=[]
3 real=d['target']
4 for cutoff in np.linspace(0,1,100):
5 predicted=(d['score']>cutoff).astype(int)
6 TP=((real==1)&(predicted==1)).sum()
7 FP=((real==0)&(predicted==1)).sum()
8 TN=((real==0)&(predicted==0)).sum()
9 FN=((real==1)&(predicted==0)).sum()
10
11 TPR.append(TP/(TP+FN))
12 FPR.append(FP/(TN+FP))
13
14 temp=pd.DataFrame({'TPR':TPR,'FPR':FPR})
15
16 sns.lmplot(y='TPR',x='FPR',data=temp,fit_reg=False)
You can try introducing more overlap and you'll see the ROC curve move away from the ideal
scenario. Area Under the Curve( AUC Score) will become less than one, more close to 1 it is more
close to ideal scenario it is, better is your model . Now that we have a way to asses our model , lets
begin building it
110
1 train_file=r'~/Dropbox/0.0 Data/rg_train.csv'
2 test_file=r'~/Dropbox/0.0 Data/rg_test.csv'
3 bd_train=pd.read_csv(train_file)
4
5 bd_test=pd.read_csv(test_file)
6 bd_train['data']='train'
7 bd_test['data']='test'
8 bd_all=pd.concat([bd_train,bd_test],axis=0)
These are few data decision that we have taken after exploring the data Feel free to go alternate
routes and see how that makes a difference to model performance
1 bd_all.drop(['REF_NO','post_code','post_area'],axis=1,inplace=True)
1 bd_all['children']=np.where(bd_all['children']=='Zero',0,bd_all['children'])
2 bd_all['children']=np.where(bd_all['children'].str[:1]=='4',4,bd_all['children'])
3 bd_all['children']=pd.to_numeric(bd_all['children'],errors='coerce')
1 bd_all['Revenue.Grid']=(bd_all['Revenue.Grid']==1).astype(int)
1 bd_all['family_income'].value_counts(dropna=False)
'>=35,000 2517
<27,500, >=25,000 1227
<30,000, >=27,500 994
<25,000, >=22,500 833
<20,000, >=17,500 683
<12,500, >=10,000 677
<17,500, >=15,000 634
<15,000, >=12,500 629
<22,500, >=20,000 590
<10,000, >= 8,000 563
< 8,000, >= 4,000 402
< 4,000 278
Unknown 128
1 bd_all['family_income']=bd_all['family_income'].str.replace(',',"")
2 bd_all['family_income']=bd_all['family_income'].str.replace('<',"")
3 k=bd_all['family_income'].str.split('>=',expand=True)
1 bd_all['fi']=np.where(bd_all['family_income']=='Unknown',np.nan,
2 np.where(k[0].isnull(),k[1],
3 np.where(k[1].isnull(),k[0],0.5*(k[0]+k[1]))))
1 bd_all['age_band'].value_counts(dropna=False)
111
45-50 1359
36-40 1134
41-45 1112
31-35 1061
51-55 1052
55-60 1047
26-30 927
61-65 881
65-70 598
22-25 456
71+ 410
18-21 63
Unknown 55
1 k=bd_all['age_band'].str.split('-',expand=True)
2 for col in k.columns:
3 k[col]=pd.to_numeric(k[col],errors='coerce')
1 bd_all['ab']=np.where(bd_all['age_band'].str[:2]=='71',71,
2 np.where(bd_all['age_band']=='Unknow',np.nan,0.5*(k[0]+k[1])))
1 del bd_all['age_band']
2 del bd_all['family_income']
1 cat_vars=bd_all.select_dtypes(['object']).columns
2 cat_vars=list(cat_vars)
3 cat_vars.remove('data')
4 # we are using pd.get_dummies here to create dummies
5 # its more straight forward but doesnt let you ignore categories on the basis of
frequencies
6 for col in cat_vars:
7 dummy=pd.get_dummies(bd_all[col],drop_first=True,prefix=col)
8 bd_all=pd.concat([bd_all,dummy],axis=1)
9 del bd_all[col]
10 print(col)
11 del dummy>
TVarea
gender
home_status
occupation
occupation_partner
region
self_employed
self_employed_partner
status
112
1 # separating data
2 bd_train=bd_all[bd_all['data']=='train']
3 del bd_train['data']
4 bd_test=bd_all[bd_all['data']=='test']
5 bd_test.drop(['Revenue.Grid','data'],axis=1,inplace=True)
6
In scikit-learn L1 and L2 penalties are implemented within the function LogisticRegression it self.
We'll be treating them as parameters to tune with cross validation. The counterpart to is C, which
is implemented in a way that , lower the value of C, higher the penalty . There is another parameter
class_weight , takes two values balanced and None . None here implies equal weight to all
observation while calculating the cost/loss [ thus all observation having equal contribution to loss,
this might make the model biased towards the majority class if their is class imbalance ] . balanced
artificially inflates weight of the minority class [ class with low frequency ] in order to ensure that
cost/loss contribution is same from both the classes and model is focused on separation of the
classes .
1 params={'class_weight':['balanced',None],
2 'penalty':['l1','l2'],
3 # these are L1 and L2 written in lower case
4 # dont confuse them with numeric eleven and tweleve
5 'C':np.linspace(0.0001,1000,10)}
6 # we can certainly try much higher ranges and number of values for the parameter
'C'
7 # grid search in this case , will be trying out 2*2*10=40 possible combination
8 # and will give us cross validated performance for all
1 model=LogisticRegression(fit_intercept=True)
1 grid_search=GridSearchCV(model,param_grid=params,cv=10,scoring="roc_auc",n_jobs=-1)
2 # note that scoring is now roc_auc as we are solving a classification problem
3 # n_jobs has nothing to do with model building as such
4 # it enables parallel processing , number reflects number of cores
5 # of your processor being utilised. -1 , means all the cores
1 x_train=bd_train.drop('Revenue.Grid',axis=1)
2 y_train=bd_train['Revenue.Grid']
1 grid_search.fit(x_train,y_train)
1 report(grid_search.cv_results_,3)
113
We'll go ahead with the best model here . Although ideally we should expand the range on C on the
lower side and run the experiment as the best values is coming at the edge. I am leaving that for you
to try . Since this is with l2 penalty there will not be any model reduction
if we want to make prediction just for probabilities and submit , we can simply use grid_search
object. As for the tentative performance of this model, we can already see that in the outcome of
report function with cross validated auc score
1 test_prediction
array([[0.99496285, 0.00503715],
[0.95418665, 0.04581335],
[0.98695233, 0.01304767],
...,
[0.97058409, 0.02941591],
[0.77185663, 0.22814337],
[0.80655962, 0.19344038]])
Note that this gives two probabilities for each observation and they sum up to 1
array([0, 1])
this means first probability is for the outcome being 0 and second is for the outcome being 1
you can extract probability for either class by using proper index
you can submit this by , converting it to a pandas data frame and then using function to_csv to
write it to a csv file
1 train_score=grid_search.predict_proba(x_train)[:,1]
2 real = y_train
1 KS=[]
114
1 temp=pd.DataFrame({'cutoffs':cutoffs,'KS':KS})
2 sns.lmplot(x='cutoffs',y='KS',data=temp,fit_reg=False)
We can now find for which cutoff value KS takes its maximum value
1 cutoffs[KS==max(KS)][0]
0.467
if we have to submit hard-class we'll use this cutoff to convert probability score to hard classes
1 test_hard_classes=(test_prediction>cutoffs[KS==max(KS)][0]).astype(int)
Data Prep part will remains same , irrespective of what modules we study further
Performance measures and methods will also remain same irrespective of the algorithm
The way to find cutoffs for probability score will also remain same irrespective what which
algorithm we obtain those scores from
115
Chapter 5 : Decision Trees and Random
Forests
We will start our discussion with Decision Trees. Decision trees are a hierarchical way of partitioning
the data starting with the entire data and recursively partitioning it into smaller parts.
Lets start with a classification example of predicting whether someone will buy an insurance or not.
We have been given a set of rules which can be shown as the diagram below:
What we see here is an example of a decision tree. A decision tree is drawn upside down with the
root at the top.
Starting from the top, the first question asked is whether the a persons age is greater than 30 years
or not. Depending on the answer, a second question is asked, either a person owns a house or not
or does the person have 1 or more children and so on. Lets say that the person we consider is 45
years old, then the answer to the first question is Yes and then the next question for this person
would be whether this person owns a house or not. Lets say he/she does not own a house. Now we
end up in a node where no further questions are asked. This is the terminal node. In the terminal
node, no further questions are asked. The nodes where we ask questions are the decision nodes
with the top one being considered as the parent node. A thing to note here is that all the questions
have binary answers - yes or no.
116
Now, we know that this person who is 45 years old and does not own a house ends up in one of the
terminal nodes - we can say that this person belongs to this bucket. Now, our main question is to
predict whether the person with these characteristics will buy the insurance or not.
Before we can answer this question, we need to understand a few more things:
Once we come up with the answers for the questions above, we will have a better idea about
decision trees.
The way predictions are made for a classification problem are different than the way they are made
for a regression problem. The details are described below.
Classification:
Someone had given us the rules using which we made the decision tree shown above. Instead, we
ask this person to give us the information using which he built the tree i.e. share the data using
which he/she could come up with the rules. This data can also be referred to as the training data. We
took this training data and passed each observation from this data through the decision tree,
resulting in the following tree:
Note the terminal nodes now. We can see that among people with age greater than 30 years and
who do not own a house, 200 of them bought the insurance and 20 did not buy. Among people with
age less than 30 years and who have children, 15 people bought the insurance and 85 did not buy.
Using this information present in the the terminal nodes, we can make prediction for new people for
whom we do not know the outcome. For example, lets consider a new person comes whose age is
greater than 30 years and does not own a house. Using our training data we saw that most of the
people who end up in that node buy the insurance. So our prediction will be that this new person
will buy the insurance using a simple majority vote. Instead of using the majority vote, we can also
consider the probability of this person buying the insurance i.e. 200/220 = 0.9, which is quite high.
We can say that if someone ends up in this terminal node, there is a high chance or probability of
117
this person to buy the insurance.
Now lets consider a person who owns 1 house and has no children, then this person ends in a node
where 50% of the people buy the insurance and 50% don't. This probability of a person buying the
insurance is as random as a coin flip and hence not desirable.
1. Probability:
2. Hard classes:
Regression:
When the response variable is continuous, in order to make predictions, we take an average of the
response values present in the terminal nodes.
Whether we have a classification decision tree or a regression decision tree, we now know how to
make predictions.
But we still don't know how the decision tree was built in the first place. In order to figure this out,
we need to answer the next question:
The rules are primarily binary questions. How do we come up with binary questions from numerical
and categorical variables?
Numeric variables: For continuous variables, we simply discretize the range and ask questions on
the intervals. Lets say we have values 5, 10, 15, 16 and 20 in a variable say 'age' as follows:
Questions like 'is age greater than 10' have an answer either 'yes' or 'no' which covers the entire
data. Similarly, the rule 'is age greater than 15' will cover the entire range as well. Binary questions
like these can cover the entire data range. Also, it is not necessary that the intervals in which we
break the range should be equidistant. For example, in the line above, there is no value present
between 10 and 15. The question 'is age greater than 11' and the question 'is age greater than 14'
will result in the same partitioning of the data. This is because there are no observations for the
variable 'age' in the range 10 to 15. So whatever question we ask between the range 10 to 15 will
result in the same partition of the data.
The way a continuous variable is discretized and how the interval questions are decided depends on
the kinds of values the continuous variable has.
Categorical variables:
We are already aware how to make dummy variables from categorical predictors. Lets consider the
following dummy variables created for the 'City' predictor:
118
City var_delhi var_new_york var_beijing
delhi 1 0 0
new york 0 1 0
beijing 0 0 1
new york 0 1 0
delhi 1 0 0
Using the categorical column 'City', we create three dummy variables. In decision trees, an example
of a rule is 'Is var_delhi greater than 0.5' which is the same as asking whether it is 0 or 1 i.e if the
variables value is greater than 0.5 then the value is delhi else it is either of the other cities.
We have understood how the rules are made for categorical and numerical variables. Given some
features, the number of rules can be quite big. How do we choose the best rule amongst these to
split a node. The way to pick rules would differ for classification and regression.
Classification:
Lets, for a moment, consider what an ideal decision tree would be like. Which decisions would we be
most happy with? The decisions which gives a clear majority or in more specific words, a decision
which results in a more homogeneous child node. Lets consider the example of a person who is 45
years old and does not own a home - whether this person will buy the insurance or not. 200 out of
the 220 people present in the terminal node end up buying the insurance. Hence, if we get another
person with similar characteristics, we can reasonably predict that that person would buy the
insurance. Similarly, if a person ends up in a terminal node in which 10 out of 170 people usually end
up buying the insurance, then we can be reasonably certain that this person will not buy the
insurance. Both these cases are preferred since they result in a more homogeneous terminal node.
However, we would not be certain whether a person will end up buying the insurance if 50 out of
100 people ending up in the terminal node buy the insurance i.e. 50% of the people can either buy
or not buy the insurance. The decision is as good as a coin flip.
We would want our nodes to have as clear majority as possible i.e. the terminal nodes should be as
homogeneous as possible.
Gini Index
Entropy
Deviance
Lets go through each of these measures of homogeneity. Lower the values of each of these
measures, higher is the homogeneity of the node.
119
Gini Index: To calculate the Gini Index we use the following:
Referring to the diagram above, the Gini Index will be computed as follows:
Referring to the diagram above, we compute the deviance of that node as follows:
An important property of all these measures is that their value will be lowest when any of the classes
have a clear majority.
In case where all the observations in a node belong to the same class, then each of the measures
described above would have the value of 0.
For implementation, we can use any of the measures described above; there is no theoretical
favorite.
For a detailed example of rule selection for classification, you may refer to the class presentation.
Regression:
We know that Gini Index, Entropy and Deviance are used to select rules for classification. But if the
response variable is continuous, how do we select the rules?
In case of regression, the prediction is an average of the node. In order to measure how good our
predictions are, we compute the error sum of squares (SSE).
120
You may refer to the class presentation for a detailed example.
We choose the rules based on lower SSE; lower the SSE, more homogeneous is the node.
If we let our tree grow too big, there are higher chances of over-fitting the data. It might be that the
model is too perfect for the training data but does not generalize well.
In case we wish to stop growing a tree before they reach their natural stopping criterion, we can use
hyper-parameters to control the size of the decision tree.
max_depth: This fixes the maximum depth of the tree. If it is not set then nodes are expanded
until all leaves are pure or until all leaves contain less than min_samples_split samples.
min_sample_split: The minimum number of samples required to split an internal node; default
value - 2. It is a good idea to keep it slightly higher in order to reduce over-fitting of the data.
min_sample_leaf: The minimum number of samples required to be in each leaf node on splitting
their parent node. This defaults to 1. If this number is higher and a split results in a leaf node
having lesser number of samples than specified then that split is cancelled.
max_leaf_nodes: It defines the maximum number of possible leaf nodes. If it is not set then it
takes an unlimited number of leaf nodes. This parameter controls the size of the tree.
1 import pandas as pd
2 import numpy as np
3 from sklearn import tree
4 import numpy as np
5 from sklearn.metrics import roc_auc_score
6 import matplotlib as plt
7 %matplotlib inline
We will consider the demographic data 'census_income.csv' for this module. This is typical census
data. This data has been labeled with annual income less than 50K dollars or not. We want to build a
model such that given these census characteristics we can figure out if someone will fall in a
category in which their income is higher than 50K dollars or not. Such models are mainly used when
formulating government policies.
1 ci_train.head()
2 # display in pdf will be truncated on the right hand side
121
age workclass fnlwgt education education.num marital.status occupation relationship rac
Adm- Not-in-
0 39 State-gov 77516 Bachelors 13 Never-married Wh
clerical family
Handlers- Not-in-
2 38 Private 215646 HS-grad 9 Divorced Wh
cleaners family
Married-civ- Handlers-
3 53 Private 234721 11th 7 Husband Bla
spouse cleaners
Married-civ- Prof-
4 28 Private 338409 Bachelors 13 Wife Bla
spouse specialty
We know that there should not be any redundancy in the data. e.g. consider the 'education' and
'education_num' variables.
1 pd.crosstab(ci_train['education'],ci_train['education.num'])
2 # display will be truncated in pdf on the right hand side
education.num 1 2 3 4 5 6 7 8 9 10 11 12 13 14
education
10th 0 0 0 0 0 933 0 0 0 0 0 0 0 0
11th 0 0 0 0 0 0 1175 0 0 0 0 0 0 0
12th 0 0 0 0 0 0 0 433 0 0 0 0 0 0
1st-4th 0 168 0 0 0 0 0 0 0 0 0 0 0 0
5th-6th 0 0 333 0 0 0 0 0 0 0 0 0 0 0
7th-8th 0 0 0 646 0 0 0 0 0 0 0 0 0 0
9th 0 0 0 0 514 0 0 0 0 0 0 0 0 0
Assoc-acdm 0 0 0 0 0 0 0 0 0 0 0 1067 0 0
Assoc-voc 0 0 0 0 0 0 0 0 0 0 1382 0 0 0
Bachelors 0 0 0 0 0 0 0 0 0 0 0 0 5355 0
Doctorate 0 0 0 0 0 0 0 0 0 0 0 0 0 0
HS-grad 0 0 0 0 0 0 0 0 10501 0 0 0 0 0
Masters 0 0 0 0 0 0 0 0 0 0 0 0 0 1723
Preschool 51 0 0 0 0 0 0 0 0 0 0 0 0 0
Prof-school 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Some-college 0 0 0 0 0 0 0 0 0 7291 0 0 0 0
We observe that there is one to one correspondence. e.g. for 'education' 10th has been labelled as
'education.num' 6, 11th has been labelled as 'education.num' 7 etc. Hence, instead of using the
variable 'education' we can use 'education.num' only. We will go ahead and drop 'education' here.
1 ci_train.drop(['education'],axis=1,inplace=True)
122
1 ci_train['Y'].value_counts().index
Notice that it has white space values which should be removed. During data preparation we need to
be careful with this else when comparing these values we may get unexpected results if the white
space is not considered.
1 ci_train['Y']=(ci_train['Y']==' >50K').astype(int)
1 cat_cols=ci_train.select_dtypes(['object']).columns
2 cat_cols
When making dummies, we will ignore those values which have a frequency less than 500. You can
always reduce this number and check if more dummies result in a better model.
workclass
marital.status
occupation
relationship
race
sex
native.country
The above steps result in my data having 32561 rows and 39 columns including the response
variable.
1 ci_train.shape
(32561, 39)
1 ci_train.isnull().sum().sum()
1 x_train=ci_train.drop(['Y'],1)
2 y_train=ci_train['Y']
123
The data preparation steps are similar for all models. Here we are given only the training dataset.
However, if we are given a test dataset also, we should do the data preparation steps for both
training and test datasets.
criterion: Used to set which homogeneity measure to use. Two options available: "entropy" and
"gini" (default).
max_depth: This fixes the maximum depth of the tree. If None, then nodes are expanded until all
leaves are pure or until all leaves contain less than min_samples_split samples. Ignored if
'max_leaf_nodes' is not None.
min_sample_split: The minimum number of samples required to split an internal node; default
value - 2. It is a good idea to keep it slightly higher in order to reduce over-fitting of the data.
Recommended values are between 5 to 10.
min_sample_leaf: The minimum number of samples required to be in each leaf node on splitting
their parent node. This defaults to 1. If this number is higher and a split results in a leaf node
having lesser number of samples than specified then that split is cancelled.
max_leaf_nodes: It defines the maximum number of possible leaf nodes. If None then it takes
an unlimited number of leaf nodes. By default, it takes ???None??? value. This parameter controls
the size of the tree. We will be finding optimal value of this through cross validation.
class_weight: Default is None, in which case each class is given equal weight-age. If the goal of
the problem is good classification instead of accuracy (especially in the case of imbalanced
datasets) then you should set this to "balanced", in which case class weights assigned are
inversely proportional to class frequencies in the input data.
random_state: Used to reproduce random result.
For selecting the best parameters we will use RandomizedSearchCV instead of GridSearchCV.
The variable 'params' below consists of 5 different parameters we intend to tune. Each of these
parameters contain some values. The possible combinations will be 960 as shown below. If we use
grid search and use a 10 fold CV, we will build around 9600 individual trees. It will result in the best
possible combination but will also take a lot of time. In order to handle this, instead of trying out all
960 combinations, we can try only 10% of these combinations i.e 96 combinations. Randomized
Search will randomly select 96 of the combinations and will result in good enough results - though
we do not have a guarantee of the best combination. It will give us a good enough combination at a
fraction of the runtime. The trade-off is between the run time and how good our result will be.
We can make the Randomized Search better by running it multiple times (each time we get a
different combination) and check whether the combinations are consistent across different runs. We
can expand in the neighborhood values as well e.g we get max_depth as 70 using Randomized
Search; in the next run we would want to add 80 and 90 and check again. Another example would be
if we get max_depth as 30, but we did not consider any other value between 30 and 50. We may
want to add a max_depth of 40 and try again.
Once we get out best max_depth value as 30, we can also try values around 30, like 25, 26, 31, 32 etc
which may result in better performance of the model.
1 2*2*8*6*5
960
124
1 # RandomSearchCV/GridSearchCV accept parameters values as dictionaries.
2 # In example given below we have constructed dictionary for different parameter
values that we want to
3 # try for decision tree model
4 params={ 'class_weight':[None,'balanced'],
5 'criterion':['entropy','gini'],
6 'max_depth':[None,5,10,15,20,30,50,70],
7 'min_samples_leaf':[1,2,5,10,15,20],
8 'min_samples_split':[2,5,10,15,20]
9 }
Now lets see how RandomizedSearchCV works. It works just like GridSearchCV except that we need
to tell RandomizedSearchCV that out of all these combinations how many do we want to try out. e.g.
We may mention that out of 960 combinations only try out 10. We use the argument n_iter to set
this. Ideally, we should try about 10 to 20% of all the combinations.
The classifier we want to try to do this classification is the Decision Tree Classifier.
1 clf=DecisionTreeClassifier()
1 # We try the decision tree classifier here supplying different parameter ranges to
our randomSearchCV which in turn will pass it on to this classifier
2 # n_iter parameter - this number should be 10 to 20% of the total number of
combinations
3 # n_iter parameter of RandomizedSeacrhCV controls how many parameter combinations
of all given combinations will be tried
4
5 random_search=RandomizedSearchCV(clf, cv=10, param_distributions=params,
scoring='roc_auc',n_iter=10, n_jobs=-1,
verbose=False)
1 random_search.fit(x_train,y_train)
In the output above, we can see all the parameters that were tried out.
1 random_search.best_estimator_
125
1 DecisionTreeClassifier(class_weight='balanced', criterion='entropy',
2 max_depth=10, max_features=None, max_leaf_nodes=None,
3 min_impurity_decrease=0.0, min_impurity_split=None,
4 min_samples_leaf=10, min_samples_split=15,
5 min_weight_fraction_leaf=0.0, presort=False, random_state=None,
6 splitter='best')
The report function below gives the rank-wise details of each model.
1 # Utility function to report best scores. This simply accepts grid scores from our
randomSearchCV/GridSearchCV and picks
2 # and gives top few combination according to their scores.
3
4 def report(results, n_top=3):
5 for i in range(1, n_top + 1):
6 # np.flatnonzero extracts index of `True` in a boolean array
7 candidate = np.flatnonzero(results['rank_test_score'] == i)[0]
8 # print rank of the model
9 # values passed to function format here are put in the curly brackets when
printing
10 # 0 , 1 etc refer to placeholder for position of values passed to format
function
11 # .3f means upto 2 decimal digits
12 print("Model with rank: {0}".format(i))
13 # this prints cross validated performance and its standard deviation
14 print("Mean validation score: {0:.5f} (std: {1:.5f})".format(
15 results['mean_test_score'][candidate],
16 results['std_test_score'][candidate]))
17 # prints the paramter combination for which this performance was obtained
18 print("Parameters: {0}".format(results['params'][candidate]))
19 print("")
Below we can see the details of the top 5 models. The best model has the auc_roc score of 0.894 and
we can check the parameters that resulted in this.
1 report(random_search.cv_results_,5)
126
We will now fit the best estimator separately.
1 dtree=random_search.best_estimator_
2 dtree
DecisionTreeClassifier(class_weight='balanced', criterion='entropy',
max_depth=10, max_features=None, max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=10, min_samples_split=15,
min_weight_fraction_leaf=0.0, presort=False, random_state=None,
splitter='best')
1 dtree.fit(x_train,y_train)
The tentative performance of the model is 0.894. We can now use this model to make predictions on
the test data using either the predict() or predict_proba() functions.
We first need to output the decision tree to a dot file using the export_graphviz() function. 'dtree' is
the model we built, 'out_file' is where we will write this decision tree, 'feature_names' are names of
the features on which the rules are based, 'class_names' stores the classes and 'proportions' set to
True means that we want to see the proportions in the nodes too.
Open mytree.dot file in a simple text editor and copy and paste the code at http://webgraphviz.com
to visualize the tree.
As far as Decision Trees for Regression are concerned there will be the following differences: we use
a DecisionTreeRegressor instead of DecisionTreeClassifier to create the Decision Tree object. The
arguments: "class_weight" and "criterion" will not make sense in case of Regression. Also, when
using Regression, the evaluation criterion will be 'neg_mean_absolute_error' instead of 'roc_auc'
score.
Rest of the process remains identical for Decision Tree Regressor and Decision Tree Classifier.
Next we will discuss the issues with Decision Trees and how is the issue handled.
Random Forests
Issues with decision trees: Decision trees help capture niche non-linear patterns in data. The model
may fit the training data very well but may not do well with newer data. Whenever we build a model,
we want it to do well with future data too and hence needs to be as generalizable as possible. How
do we handle this? On one hand we want to capture decision tree's amazing ability to capture non-
linear data and at the same time we do not want it to be susceptible to noise or niche patterns from
the training data.
A very simple and powerful idea resulted in the popular algorithm called Random Forest.
Random Forests introduce two levels of randomness in the tree building process which help in
handling the noise in the data.
Random Forest takes a random subset of all the observations present in training dataset
It also takes a random subset of all the variables from the training dataset
127
In order to understand this idea, lets assume that we have 10000 observations and 200 variables in
the dataset. In Random Forest, many decision trees are built instead of a single tree; for our
example lets consider 500 trees are built. Now, if we build these trees using the same 10000
observations 500 times we will end up with the same 500 trees. But this is not what we want since
we will get the same outcome from each of these 500 trees. Hence, each tree in the Random Forest
does not use the entire data, but instead the individual trees are built on random subset of the
observations (we can consider 10000 sample observations with replacement or fewer sample
observations without replacement). In short, in the first level of randomness, instead of using the
entire data, we use a random sample of the data. How does this help in cancelling out the effect of
noise/niche patterns specific to the training data? By definition, noise will be a small chuck of the
data. Hence, when we sample observations from the training dataset repeatedly for our 500 trees,
we can safely assume that a majority of the trees built will not be affected by noise.
Now, how does taking a random subset of all the variables from the dataset help? When building a
single decision tree (not in Random Forest), in order to choose a rule to split a node, the decision
tree algorithm considers all the variables i.e. in our example all 200 variables will be considered. In
Random Forest algorithm, on the other hand, in order to choose a rule to split a node, instead of
considering all possible variables, the algorithm considers only a random subset of variables. Lets
say only 20 variables are randomly picked up from the 200 variables present and only the rules
generated from these 20 variables are considered when choosing the best rule to split the node. At
every node a fresh random subset of variables is selected from which the best rule is picked up.
How does this help in cancelling out the effect of noisy variables? The noisy variables will be a small
chuck of all the variables present and when the variables are randomly selected, the chances of the
noisy variables being selected are low. This in turn will not always let the noisy variables affect all the
500 individual trees made. The noisy variables may affect some of the trees a lot, but since we will
take a majority over 500 trees, their effect will be minimized.
In short, the first randomness removes the effect of noisy observations and the second randomness
removes the effect of noisy variables.
The final predictions made by Random Forest model will be a majority vote from all the trees in the
forest in case of classification. For Regression, the predictions will be the average of predictions
made by the 500 trees.
Now we will build a Random Forest model. We will use the same hyper-parameters as for decision
trees amongst others since Random Forest is ultimately a collection of decision trees.
n_estimators: Number of trees to be built in the forest - default: 10; good starting point can be
100.
max_features: Number of features being considered for selecting the best rule at each split.
Note: the value of this parameter should not exceed the total number of features available.
bootstrap : Allows for sampling with replacement or without replacement. Takes a Boolean
value; if True, sampling with replacement and if False, sampling without replacement i.e sampled
only once.
1 clf = RandomForestClassifier()
The dictionary below has different values of parameters that will be tried to figure out the model
giving the best performance. Apart from 'n_estimators', 'max_features' and 'bootstrap' parameters
which are specific to Random Forest, the rest of the parameters are the same as the ones used for
decision trees.
128
1 # RandomSearchCV/GridSearchCV accept parameters values as dictionaries.
2 # In example given below we have constructed dictionary for different parameter
values that we want to
3 # try for Random Forest model
4 param_dist = {"n_estimators":[100,200,300,500,700,1000],
5 "max_features": [5,10,20,25,30,35],
6 "bootstrap": [True, False],
7 "class_weight":[None,'balanced'],
8 "criterion":['entropy','gini'],
9 "max_depth":[None,5,10,15,20,30,50,70],
10 "min_samples_leaf":[1,2,5,10,15,20],
11 "min_samples_split":[2,5,10,15,20]}
1 960*6*6*2
69120
We are looking at 69120 combinations. Having said this, each Random Forest can have around 100
to 1000 trees; on an average around 300 trees. If we try all these combinations, we are looking at
building about 20 million decision trees. The time required to execute this will be huge and the
performance improvement we get may not be worth the investment.
Hence, instead of looking at all the possible combinations, we will consider a random subset which
will give us a good enough solution, maybe not the best; the trade off being how good the model is
to the execution time.
1 960*6*6*2*300
20736000
A good number of start with would be about 10 to 20 percent of the total number of combinations.
Note: the value of max_features cannot exceed the total number of features in the data. As seen
below, max_features cannot exceed 38.
1 x_train.shape
(32561, 38)
Amongst all the models built, we get the best estimator using the 'best_estimator_' argument of the
random_search object.
1 random_search.best_estimator_
129
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='entropy',
max_depth=10, max_features=20, max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=None,
oob_score=False, random_state=None, verbose=0,
warm_start=False)
Note: This is a result from one of the runs, you can very well get different results from a different
run. Your results need not match with this.
Looking at the outcome of the report function, we observe that the model with rank 1 has the mean
validation score of 0.918 which is a slight improvement over decision trees. Trying higher values of
'cv' and 'n_iter' arguments should give better models.
1 report(random_search.cv_results_,5)
1 rf.fit(x_train,y_train)
Once the model is fit, we can use it to make predictions using hard classes or probability.
Feature Importance
In scikit-learn, one of the ways in which feature importance is described is by something called as
"mean decrease impurity". This "mean decrease impurity" is the total decrease in node impurity
averaged over all trees built in the random forest.
130
If there is a big decrease in impurity, then the feature is important else it is not important.
In the code below, feature importance can be obtained using the 'feature_importances_' attribute of
the model. The use of this attribute makes sense only after the model is fit. We store the feature
importance along with the corresponding feature names in a dataframe and then sort them
according to importance.
1 feat_imp_df=pd.DataFrame({'features':x_train.columns,'importance':rf.feature_import
ances_})
2
3 print(feat_imp_df.sort_values('importance',ascending=False).head())
4
5 print(feat_imp_df.sort_values('importance',ascending=False).tail())
We can observe that 'marital.status_ Married-civ-spouse' is identified as the most important variable
and 'relationship_ Unmarried' as the least important.
Random Forests can also be used as a dimensionality reduction technique. In case of high
dimensional data, we can run a random forest model and get features sorted by importance. We can
then go ahead and choose the top, say 100, features for further processing. Using this technique we
would not be losing relevant information and at the same time we can drastically reduce the number
of features considered. In other words, even if we do not use the Random Forest model as the final
model, we can use it for reducing the features.
In order to make an interpretation of a variable used in Random Forest, we need to make prediction
on the entire data and average it for the variable we are interested in.
1 var_name='education.num'
2 preds=rf.predict_proba(x_train)[:,1] # making a prediction in the entire data
1 var_data=pd.DataFrame({'var':x_train[var_name],'response':preds})
2 var_data.head() # Create a dataframe of the 'education.num' variable and the
corresponding predictions
131
var response
0 13 0.079281
1 13 0.285775
2 9 0.027492
3 7 0.114180
4 13 0.506204
We plot the two columns 'education.num' against the 'response' as shown below but it is not very
informative i.e. there is a lot of variation since the response contains the effects of other variables
also.
We will plot a smoothing curve through this which will give an approximate effect of the variable
'education.num' on the response. We basically will average the response at each of the
'education.num' values.
1 import statsmodels.api as sm
2 smooth_data=sm.nonparametric.lowess(var_data['response'],var_data['var'])
1 df=pd.DataFrame({'response':smooth_data[:,1],var_name:smooth_data[:,0]})
2 sns.lmplot(x=var_name,y='response',data=df,fit_reg=False)
132
In the plot above we notice that as the 'education.num' value goes up the chances of having income
greater than 50000 dollars go up. The 'education.num' variable does not have much impact on the
response till its value is around 10 or 12, but then it rises steeply indicating that with higher
education the probability of earning more than 50000 dollars increases.
This exercise can be done for any of the other variables we wish to assess. However, it does not
make sense for dummy variables since it has only two values. But it works fine for continuous
variables.
133
Chapter 6 : Boosting Machines
In previous module we saw models based on bagging; random forests and extraTrees where each
individual model was independent and eventual prediction of the ensemble of these models was a
simple majority vote or average depending on whether the problem was of classification of
regression.
The randomness in the process, helped the model become less affected by noise and more
generalizable. However this bagging did not really help in underlying models to become better at
extracting more complex patterns.
Boosting machines go that extra step, in modern implementation, both the ideas; using randomness
to make models generalizable and boosting[ which we'll study in few moments ] are used . You can
consider boosting machines to be more powerful than bagging models . However that comes with
downside of them being prone to over-fitting in theory. We'll learn about Extreme Gradient Boosting
(Xgboost) which goes one step further and adds the element of regularization to boosting machines
and some other radical changes to become one of the most successful Machine Learning Algorithms
in the recent history.
Just like bagging , boosting machines also are made up of multiple individual models . However in
boosting machines , individual models are built sequentially and eventual model is summation of
individual models not the average . Formally for a boosting machine, our or more formally
with t individual models , is written as :
where represents individual models . As mentioned earlier , these individual models are built
sequentially. Patterns which could not be captured by the model so far become target for the next
guy. Its fairly intuitive to understand in context of regression model [ The one with the continuous
numeric target ] .
for , target is simply the original outcome , but as we go forward , the target simply becomes
errors remaining so far :
by the looks of it this looks like a recipe for disaster , certainly over-fitting . Remember that we are
trying to reduce the error here on training data, and if we keep on fitting models on the residuals ,
eventually we'll start to severely over-fit.
Why Weak-Learners
In order to avoid the over-fitting , we can chose our individual models to be weak-learners. Models
which are incapable of over-fitting themselves. In fact the ones which are not good models taken
individually. Individually they are only capable of capturing most strong and hence reliable patterns.
And upon boosting , such consistent patterns extracted ( though with changing target for each )
taken together make for a very strong and yet somewhat robust to over-fitting model.
What weak-learners
In theory, any kind of base model can be made a weak learner. Few examples :
Linear Regression which makes use of say only of the variables at each step
134
Linear Regression with very very high penalty [Large value of in L1/L2 regularization ]
Tree Stumps : Very Shallow Decision Trees [ low depth ]
In Practice however , Tree Stumps are popular and you will find them implemented almost
everywhere.
Remember that decision trees start to over-fit [ extract niche patterns from the training data ] when
we grow them too large and start to pick partition rules from smaller and smaller chunk of the
training data. Shallow decision trees do not reach to that point and hence are incapable of
individually over-fitting .
Lets see how do we transition from Gradient Decent for parameters to Gradient Boosting Machines .
If you recall, idea behind gradient decent was that in each iteration we update the parameters by
changing [ updating ] them like this :
or
Parameters s are getting updated and the gradient of the cost is taken w.r.t. to parameters
In context of boosting machines however , the Model itself is getting updated. in every iteration
., model is updated by . This update, using the Gradient Descent will be
equal to , however the gradient here will be taken w.r.t. to what is being updated that is
.
Before we are able to take the derivative/gradient of the loss w.r.t. , we'll need to write the
cost/loss in terms of
being equal to doesn't intuitively make sense . is after all a model [Tree
stump]. What does it mean for a model to be equal to something. It means that while we are
building the model we'll take as the target for the model .
where
then
135
This simply means that every next shallow decision tree will take small fraction of the remaining
error as its target. is there to ensure that any one individual model doesn't end up contributing too
heavily towards the eventual prediction.
in the same section you would have found that probability is actually represented as this :
Couple things that you need to realize about gbm for classification :
Each individual shallow tree is a regression tree [ not a classification tree ] as the target for them
is a continuous numeric number
is not a simple summation of probability scores given by individual shallow tree models. No.
and
You should realize that this process can be done for any alternate cost/loss formulation as well. All
that we need to do is that to write the cost w.r.t. model itself and then define its derivative.
GBM parameters
n_estimators [default = 100]:
learning_rate [default=0.1]
learning rate shrinks the contribution of each tree by learning_rate .There is a trade-off between
learning_rate and n_estimators. A small learning rate will require large number of n_estimator.
max_depth [default=3]
depth of the individual tree models . High number here will lead to complex/over-fit model
min_samples_split[default=2]
136
1 - If int, then consider `min_samples_split` as the minimum number.
2 - If float, then `min_samples_split` is a fraction and
3 `ceil(min_samples_split * n_samples)` are the minimum
4 number of samples for each split.
min_samples_leaf[default = 1]
The minimum number of samples required to be at a leaf node. A split point at any depth will only
be considered if it leaves at least min_samples_leaf training samples in each of the left and right
branches. This may have the effect of smoothing the model,
especially in regression.
subsample [default =1 ]
The fraction of samples to be used for fitting the individual base learners. If smaller than 1.0 this
results in Stochastic Gradient Boosting.
The number of features to consider when looking for the best split:
1 import pandas as pd
2 import numpy as np
3 file=r'/Users/lalitsachan/Dropbox/0.0 Data/Cycle_Shared.csv'
4 bike=pd.read_csv(file)
5 bike.head()
6 # display will be truncated in pdf on the right hand side
137
instant dteday season yr mnth holiday weekday workingday weathersit temp ate
2011-
0 1 1 0 1 0 6 0 2 0.344167 0.36
01-01
2011-
1 2 1 0 1 0 0 0 2 0.363478 0.35
01-02
2011-
2 3 1 0 1 0 1 1 1 0.196364 0.18
01-03
2011-
3 4 1 0 1 0 2 1 1 0.200000 0.21
01-04
2011-
4 5 1 0 1 0 3 1 1 0.226957 0.22
01-05
1 bike.shape
(731, 16)
1 bike.drop(['yr','instant','casual','registered'],axis=1,inplace=True)
1 bike['dteday']=pd.to_datetime(bike['dteday'])
2 bike['day']=bike['dteday'].dt.month
3 del bike['dteday']
1 x_train=bike.drop('cnt',1)
2 y_train=bike['cnt']
1 gbm_params={'n_estimators':[50,100,200],
2 'learning_rate': [0.01,.05,0.1,0.4,0.8,1],
3 'max_depth':[1,2,3,4,5,6],
4 'subsample':[0.5,0.8,1],
5 'max_features':[5,10,15,20,28]
6 }
7 # Note : keep in mind that this dataset is too small and ideally we should avoid
complex models
138
1 model=GradientBoostingRegressor()
2 random_search=RandomizedSearchCV(model,scoring='neg_mean_absolute_error',
3 param_distributions=gbm_params,
4 cv=10,n_iter=10,
5 n_jobs=-1,verbose=False)
1 random_search.fit(x_train,y_train)
1 report(random_search.cv_results_,3)
This gives us our best model parameters as well as its performance . We can further extract variable
importance and make partial dependence plots here also as we did for randomForest results in the
earlier module
139
age workclass fnlwgt education education.num marital.status occupation relationship rac
Adm- Not-in-
0 39 State-gov 77516 Bachelors 13 Never-married Wh
clerical family
Handlers- Not-in-
2 38 Private 215646 HS-grad 9 Divorced Wh
cleaners family
Married-civ- Handlers-
3 53 Private 234721 11th 7 Husband Bla
spouse cleaners
Married-civ- Prof-
4 28 Private 338409 Bachelors 13 Wife Bla
spouse specialty
1 cd.shape
(32561, 15)
1 cd.drop('education',1,inplace=True)
1 cd['Y'].unique()
1 cd['Y']=(cd['Y']==' >50K').astype(int)
2 for col in ['workclass','marital.status' , 'occupation',
3 'relationship', 'race','sex', 'native.country']:
4 k=cd[col].value_counts()
5 cats=k.index[k>300][:-1]
6 for cat in cats :
7 name=col+'_'+str(cat)
8 cd[name]=(cd[col]==cat).astype(int)
9
10 del cd[col]
1 cd.shape
(32561, 41)
1 x_train=cd.drop('Y',1)
2 y_train=cd['Y']
3
1 gbm_params={'n_estimators':[50,100,200,500],
2 'learning_rate': [0.01,.05,0.1,0.4,0.8,1],
3 'max_depth':[1,2,3,4,5,6],
4 'subsample':[0.5,0.8,1],
5 'max_features':[0.1,0.3,0.5,0.8,1]
6 }
140
1 model=GradientBoostingClassifier()
2 random_search=RandomizedSearchCV(model,scoring='roc_auc',
3 param_distributions=gbm_params,
4 cv=10,n_iter=10,
5 n_jobs=-1,verbose=False)
1 random_search.fit(x_train,y_train)
1 report(random_search.cv_results_,3)
This gives us our best model parameters as well as its performance . We can further extract variable
importance and make partial dependence plots here also as we did for randomForest results in the
earlier module
GBM relies on individual models to be weak learners , but there is no framework enforcing this;
ensuring that individual models are weak learners
Entire focus instead is on bringing down the cost, without any regularization on the complexity of
the model which is a recipe for over-fitting eventually
The contribution from individual models ( scores/update ) are not aligned with the idea of
optimizing the cost. they are simple averages instead .
Lets see how Xgboost addresses these concerns. The discussion will be deeply mathematical and
pretty complex at places. Even if it doesn't make sense in one go, don't worry about it. Focus on
implementation and usage steps, you can always give more passes to understand mathematical
details later on
Where the first term represents the traditional loss that we have been using so far, for GBM, the
second term represents the regularization/penalty on complexity for each of the individual models
141
As mentioned earlier , Xgboost , along with using regularization , uses some clever modification to
loss formulation that eventual make it a great algorithm. One of them is what we are going to
discuss next. It is using Taylor's expansion of the traditional loss
This expression written above represents , general Taylor's expansion of a function g where a is a
very small quantity and is first order derivative , is second order derivative and so on. Generally
, higher order terms ( terms with higher powers of a than ) are ignored , considering them be too
close to zero , given a is a very small number . Lets Re-write (2) using this idea :
Notice that we have ignored higher order terms . Lets simplify this expression so that we don't have
to write such complex mathematical expressions every time we write this objective function
We'll see in sometime , how this helps . So far, we haven't concretely defined the penalty term.
Regularization Term
We have just written some general function without really being explicit about it . Before blindly
diving into how we define , lets consider, what do we want to penalize here, what do we want to
add the regularization term for :
We want to ensure that individual models remain weak learners. In context of shallow decision
tree, we need to ensure the tree size remains small. We can ensure that by adding penalty to the
tree size ( number of terminal nodes in the tree )
We want the updates coming from individual models to be small , we can add L1/L2 penalty on
the scores coming from individual trees
But what are these scores coming from individual trees. How do we represent a tree
mathematically ? Lets have a look at what a tree is :
142
As we know from our discussion in the last module, every observation which ends up in the terminal
node 1 will have some fixed score and so on. Lets say represents the set of rules which tree
is made of . The outcome of for each observation is one of these numbers : {1,2,3,4} .
Essentially telling that in which terminal node the observation will end up . We can formally define
our or decision tree as follows :
Now that we have defined the model ( Decision Tree in this case ) , lets move on to formalize a
regularization term for the same. considering the two concerns that we raised above , here is one
regularization/penalty proposed by Xgboost team . You can always come up with some formulation
of your own, this one works pretty well though.
First term penalizes size T of the trees thus ensuring that they are weak learners. Second term
penalizes outcome/scores of trees , thus ensuring that update coming from individual model
remains small . here controls the extent of penalty on the size of the penalty , and controls the
extent of L2 penalty on the scores .
Adding the proposed regularization to the objective function (3) that we arrived at earlier, gets us to
this :
(4) still has some issues though. First term is written in terms of summation over all observation and
second summation term is written as summation over terminal nodes of the tree. In order to
consolidate things , we need to bring them under summation over same thing.
consider set which contain indices i for observations which end up in node . outcome of the
model , for node is already defined as . Using this, we can write this all the summation terms
in (4) over nodes of the tree .
143
Optimal value of model score/weights
In traditional decision trees ( same idea gets used in traditional GBM also ) , score/outcome of the
trees at nodes ( for node) is simple average of the values in the node. However Xgboost team
modified that idea as well. Instead of using simple average of outcomes in node as output, they
use optimal value as per the objective score in (5)
OK, we have put a lot of work in developing/understanding this objective function . Where do we
use this ? . This is the objective function used to select rules when we are adding new partition to our
tree ( essentially building the tree )
Consider that we are splitting a node into its children. Some of the observations will go to right hand
side and some will go to left hand side node . Change in objective function when we create this
nodes going to be :
This will be evaluated for each candidate rule, and the one with highest gain will be selected as the
rule for that particular node. This is another big change to how trees were being built in tradition
algorithms .
Before starting with building Xgboost models, lets take a quick look at parameters associated with
the implementation in python and what should we keep in mind when we are tuning them .
Xgboost parameters
n_estimators [default = 100]
Number of boosted trees in the model. If learning_rate/eta is small, you'll need higher number of
boosted trees to capture proper patterns in the data
Step size shrinkage used in update to prevents over-fitting. After each boosting step, we can directly
get the weights of new features, and eta shrinks the feature weights to make the boosting process
more conservative.
range: [0,1]
Minimum loss reduction required to make a further partition on a leaf node of the tree. The larger
gamma is, the more conservative(less prone to Overfitting) the algorithm will be.
range: [0,???]
max_depth [default=3]
Maximum depth of a tree. Increasing this value will make the model more complex and more likely
to over-fit.
144
min_child_weight [default=1]
Minimum sum of instance weight (hessian) needed in a child. If the tree partition step results in a
leaf node with the sum of instance weight less than min_child_weight, then the building process will
give up further partitioning. In linear regression task, this simply corresponds to minimum number
of instances needed to be in each node. for classification it corresponds to minimum amount of
impurity required to split a node. The larger min_child_weight is, the more conservative the
algorithm will be.
range: [0,???]
max_delta_step [default=0]
Maximum delta step we allow each leaf output to be. If the value is set to 0, it means there is no
constraint. If it is set to a positive value, it can help making the update step more conservative.
Usually this parameter is not needed, but it might help in logistic regression when class is
extremely imbalanced. Set it to value of 1-10 might help control the update.
range: [0,???]
subsample [default=1]
Subsample ratio of the training instances. Setting it to 0.5 means that XGBoost would randomly
sample half of the training data prior to growing trees. and this will prevent over-fitting. Subsampling
will occur once in every boosting iteration.
range: (0,1]
This is a family of parameters for subsampling of columns. - All colsample_by* parameters have a
range of (0, 1], the default value of 1, and specify the fraction of columns to be subsampled.
colsample_bytree is the subsample ratio of columns when constructing each tree. Subsampling
occurs once for every tree constructed. colsample_bylevel is the subsample ratio of columns for
each level. Subsampling occurs once for every new depth level reached in a tree. Columns are
subsampled from the set of columns chosen for the current tree. colsample_bynode is the
subsample ratio of columns for each node (split). Subsampling occurs once every time a new split is
evaluated. Columns are subsampled from the set of columns chosen for the current
level.colsample_by* parameters work cumulatively. For instance, the combination
{'colsample_bytree':0.5, 'colsample_bylevel':0.5, 'colsample_bynode':0.5} with 64 features will leave 4
features to choose from at each split.
L2 regularization term on weights. Increasing this value will make model more conservative.
L1 regularization term on weights. Increasing this value will make model more conservative.
tree_method string [default= auto]
scale_pos_weight [default=1]
Control the balance of positive and negative weights, useful for unbalanced classes. A typical value
to consider: . Not Relevant for regression problems
As a general strategy you can start with tuning number of trees or n_estimators , in case of boosting
machines , learning_rate is directly related with n_estimators . A very low learning_rate will need high
number of n_estimators . We can start with a decent fixed learning rate and tune n_estimaors for it.
145
All can be left as default for now except subsample , colsample_bytree and colsample_bylevel, these
are set to default=1, we'll take a more conservative value 0.8
we'll use n_estimators as 25 here onwards. Lets tune other remaining parameters sequentially.
Next we'll tune max_depth,gamma and min_child_weight, which control over-fit by controlling size of
individual trees
1 xgb_params={
2 "gamma":[0,2,5,8,10],
3 "max_depth": [2,3,4,5,6,7,8],
4 "min_child_weight":[0.5,1,2,5,10]
5 }
6 xgb2=XGBRegressor(n_estimators=25,subsample=0.8,
7 colsample_bylevel=0.8,colsample_bytree=0.8)
8 random_search = RandomizedSearchCV( xgb2, param_distributions = xgb_params,
9 n_iter = 20, cv= 10,
10 scoring ='neg_mean_absolute_error',
11 n_jobs =-1, verbose=False)
12 random_search.fit(x_train,y_train)
13 report(random_search.cv_results_,3)
146
Next we'll tune subsampling arguments to take care of potential effect of noisy observations [ if this
was classification problem , we'd have tuned max_delta_step and scale_pos_weight to take care of
impact of class imbalance in the data before doing this ]. Notice that now we'll use the values of
parameters tuned so far from the best model outcomes from the gridsearch .\
1 xgb_params={
2 'subsample':[i/10 for i in range(5,11)],
3 'colsample_bytree':[i/10 for i in range(5,11)],
4 'colsample_bylevel':[i/10 for i in range(5,11)]
5 }
6 xgb3=XGBRegressor(learning_rate=0.1,n_estimators=25,
7 min_child_weight=0.5,gamma=2,max_depth=2)
8 random_search=RandomizedSearchCV(xgb3,param_distributions=xgb_params,cv=10,
9 n_iter=20,scoring='neg_mean_absolute_error',
10 n_jobs=-1,verbose=False)
11 random_search.fit(x_train,y_train)
12 report(random_search.cv_results_,3)
1 xgb_params={
2 'reg_lambda':[i/10 for i in range(0,50)],
3 'reg_alpha':[i/10 for i in range(0,50)]
4 }
5 xgb4=XGBRegressor(n_estimators=25,min_child_weight=0.5,
6 gamma=2,max_depth=2, colsample_bylevel= 0.6,
7 colsample_bytree= 0.6, subsample= 0.6)
8 random_search=RandomizedSearchCV(xgb4,param_distributions=xgb_params,cv=10,
9 n_iter=20,scoring='neg_mean_absolute_error',
10 n_jobs=-1,verbose=False)
11 random_search.fit(x_train,y_train)
12 report(random_search.cv_results_,3)
Considering the best parameter values thus obtained , our final model is :
1 xgb5=XGBRegressor(n_estimators=25,min_child_weight=0.5,
2 gamma=2,max_depth=2, colsample_bylevel= 0.6,
3 colsample_bytree= 0.6, subsample= 0.6,
4 reg_lambda=3.5,reg_alpha=2.3)
5
147
Lets check its performance using cross validation
1 scores
1 np.mean(scores)
1512.3334176203005
1 np.std(scores)
450.85032233797364
we should not build a complex model such as Xgboost for this small data
Although the average performance is better than other alternatives , but the variation across
data is huge
performance is not very reliable, you should not be looking at just the average performance
1 x_train=cd.drop('Y',1)
2 y_train=cd['Y']
3
4 from xgboost.sklearn import XGBClassifier
1 xgb_params = {
2 "n_estimators":[100,500,700,900,1000,1200,1500]
3 }
4 xgb1=XGBClassifier(subsample=0.8,colsample_bylevel=0.8,colsample_bytree=0.8)
5 grid_search=GridSearchCV(xgb1,cv=5,param_grid=xgb_params,
6 scoring='roc_auc',verbose=False,n_jobs=-1)
7 grid_search.fit(x_train,y_train)
8 report(grid_search.cv_results_,3)
148
1 xgb_params={
2 "gamma":[0,2,5,8,10],
3 "max_depth": [2,3,4,5,6,7,8],
4 "min_child_weight":[0.5,1,2,5,10]
5 }
6 xgb2=XGBClassifier(n_estimators=500,subsample=0.8,
7 colsample_bylevel=0.8,colsample_bytree=0.8)
8 random_search=RandomizedSearchCV(xgb2,param_distributions=xgb_params,n_iter=20,
9 cv=5,scoring='roc_auc',
10 n_jobs=-1,verbose=False)
11 random_search.fit(x_train,y_train)
12 report(random_search.cv_results_,3)
1 xgb_params={
2 'max_delta_step':[0,1,3,6,10],
3 'scale_pos_weight':[1,2,3,4]
4 }
5 xgb3=XGBClassifier(n_estimators=500,min_child_weight=2,gamma=5,max_depth=6,
6 subsample=0.8,colsample_bylevel=0.8,colsample_bytree=0.8)
7
8 grid_search=GridSearchCV(xgb3,param_grid=xgb_params,
9 cv=5,scoring='roc_auc',n_jobs=-1,verbose=False)
10
11 grid_search.fit(x_train,y_train)
12 report(grid_search.cv_results_,3)
149
1 xgb_params={
2 'subsample':[i/10 for i in range(5,11)],
3 'colsample_bytree':[i/10 for i in range(5,11)],
4 'colsample_bylevel':[i/10 for i in range(5,11)]
5 }
6 xgb4=XGBClassifier(n_estimators=500,min_child_weight=2,gamma=5,max_depth=6,
7 scale_pos_weight=2,max_delta_step=0
8 )
9 random_search=RandomizedSearchCV(xgb4,param_distributions=xgb_params,
10 cv=5,n_iter=20,scoring='roc_auc',
11 n_jobs=-1,verbose=False)
12 random_search.fit(x_train,y_train)
13 report(random_search.cv_results_,3)
1 xgb_params={
2 'reg_lambda':[i/10 for i in range(0,50)],
3 'reg_alpha':[i/10 for i in range(0,50)]
4 }
5 xgb5=XGBClassifier(n_estimators=500,min_child_weight=2,gamma=5,max_depth=6,
6 scale_pos_weight=2,max_delta_step=0,
7 colsample_bylevel=1.0, colsample_bytree= 0.5, subsample= 0.9)
8 random_search=RandomizedSearchCV(xgb5,param_distributions=xgb_params,
9 cv=5,n_iter=20,scoring='roc_auc',
10 n_jobs=-1,verbose=False)
11
12 random_search.fit(x_train,y_train)
13 report(random_search.cv_results_,3)
1 xgb6=XGBClassifier(n_estimators=500,min_child_weight=2,gamma=5,max_depth=6,
2 scale_pos_weight=2,max_delta_step=0,
3 colsample_bylevel=1.0, colsample_bytree= 0.5, subsample= 0.9,
4 reg_lambda=0.7,reg_alpha=0.3)
5 scores=cross_val_score(xgb6,x_train,y_train,scoring='roc_auc',
6 verbose=False,n_jobs=-1,cv=10)
7
1 np.mean(scores)
150
0.9292637432983983
1 np.std(scores)
0.0030315969143107297
We'll conclude our discussion here on boosting machines . For making predictions on new data,
same functions on the model object predict and predict_proba can be used . As usual you'll need
to ensure that test data on which you want to make predictions has same columns and type as in
the training data which was passed to the algorithm while training.
Note : different runs might lead to slightly different parameter values , however the performance
wouldn't be wildly different
151
Chapter 7 : KNN, Naive Bayes and SVM
We will cover the following points in this discussion:
In KNN for classification, an observation is classified by a majority vote of its neighbors. This
observation is assigned to a class most common amongst its k nearest neighbors. k is a small
positive number which stands for the number of nearest observations from which we wish to take a
vote.
For regression, the difference is that instead of taking a vote from the k nearest neighbors, the
algorithm considers averages of the k nearest neighbors.
In the diagram above, our objective is to classify the light blue circles i.e. determine whether they
should be dark blue or yellow considering its closest neighbors. We choose k=3 i.e we will consider
the 3 closest neighbors to determine which class the light blue circle belongs to. In the diagram, you
will observe a bigger circle drawn around each of the light blue points which include the three circles
closest to it.
Lets consider the top left bigger circle. For the light blue circle within it, we observe that its three
closest neighbors are dark blue. Hence, we will consider the majority vote and classify the light blue
circle as dark blue.
Now, considering the top right bigger circle, the three closest neighbors for the light blue circle are
yellow, hence this light blue circle will be classified as a yellow circle.
However, observing the bottom bigger circle, the light blue circle has two yellow circles and one dark
blue circle as its closest neighbors. Since yellow circles are in a majority here, it will be classified as a
yellow circle.
Having said that, for the bottom circle, despite the majority being the yellow circle, we observe that
the dark blue circle is closer to the light blue circle. What if we want to consider the distance of the
nearest neighbors to the light blue circle to determine it class also? To take care of this, we can
consider the inverse of the distance as weights to the votes. In this case, the dark blue circle will be
given a higher weight, say 0.7 since it is closer to the light blue circle, and the two yellow balls will be
given lesser weight, say 0.3 and 0.25, since they are further away from the light blue circle. Now
152
when we consider the weights for the two classes, the dark blue circles will have a weight of 0.7 and
the yellow circles will have a weight of 0.55 (0.3+0.25). Considering this, the dark blue class will have
a higher vote now as compared to the two yellow circles combined and hence the light blue circle
will be classified as dark blue one. In short, if we consider a weighted majority vote instead of a
simple majority vote, our classification may be different.
One thing to note here is that if we change the value of 'k' i.e. the number of closest neighbors to
consider, then the classification of the light blue balls may change depending on the neighborhood
size.
Since KNN relies on a distance measure to figure out which class a test observation belongs to.
Scaling the data becomes a must here since if we do not scale the data then features with larger
values will be dominant.
We import the KNeighborsClassifier to classify the data, using 'k' as 5 i.e. the algorithm will consider
5 nearest neighbors to assess class membership.
153
1 y_pred = classifier.predict(X_test)
1 # comparing actual response values (y_test) with predicted response values (y_pred)
2 from sklearn import metrics
3 print("KNN model accuracy:", metrics.accuracy_score(y_test, y_pred))
Reference: https://stackabuse.com/k-nearest-neighbors-algorithm-in-python-and-scikit-learn/
Points to note:
The KNN algorithm is purely distance based and variables which have larger values i.e. are high
on scale will dominate. In order to avoid this, we will need to standardize the variables such that
all the variables are on a single scale.
Votes can be weighted in different ways, inverse of the the distance is most popular.
Lower values of k will capture very niche patterns and will tend to over-fit.
Very high values of k will capture broad patterns but may miss local patterns in the data.
There is no model equation here; the prediction depends completely on the training data. e.g.
when we take a new observation to be classified, we determine its class by checking the classes
of the training data observations present in its neighborhood. Hence, training data itself is the
model.
KNN can capture very local patterns in the data whereas other algorithms can capture patterns
across the data. Standalone KNN is not a very good algorithm because it relies on simple
neighborhood voting. There is no general pattern extraction from different variables. So they are
rarely used on a standalone basis. But since they can extract very niche patterns, it is very useful
in stacking algorithms.
Naive Bayes
Naive Bayes is a classification algorithm based on Bayes Theorem with a naive assumption of
independence among features i.e. presence of one feature does not affect the other. These models
are easy to build and very useful for high dimensional datasets.
The fundamental Naive Bayes assumption is that each feature makes an independent and equal
contribution to the outcome. This assumption is not generally correct in the real world but works
quite well in practice.
Say we have a bag with 10 balls, 6 blue and 4 red. If we choose a ball from this bag, then the
probability that it is blue will be 6/10 i.e. the number of blue balls divided by all the possible
selections and similarly the probability that it will be red will be 4/10.
Lets say, we have a pack of cards of 52 cards. If we choose a card from this pack at random, then, say
for event A the probability that it is a 5 is 4/52 i.e. the number of 5's present in the pack divided by
the total number of cards. And, say for event B, the probability that it is red is 26/52 i.e. total number
of reds in the pack of cards divided by the total number of cards.
Now, instead of seeing two events in isolation, lets look at them together i.e lets consider the
probability of selecting a 5 which is red. In other words, we wish to find the probability when both
the events A and B happen together.
154
We know that in the pack of cards, we have two 5's that are red as well. So the probability of
selecting a 5 which is red as well is 2/52.
Now, let say we want to find the probability of selecting a 5 given that we already have the red cards
P(A|B) i.e the probability of an event A happening given that event B has happened already. All
possible outcomes in this case will be the number of red cards i.e. count of B - 26 red cards. Within
this subset we select a 5 that is red i.e. both the events A and B have occurred. Therefore the
probability that 5 is selected given that we already have the red cards only is 2/26.
Now, instead of a single factor B, what if we have multiple factors i.e. B1, B2, B3, etc. We want to
know what is the probability of getting an A given multiple factors B1, B2 etc. To handle this we can
extend the Bayes Theorem as follows:
This basically means that given certain factors B1, B2, B3 etc, what is the probability of getting the
outcome A.
Now, when we consider Naive Bayes, we say that each of these factors B1, B2, B3 etc given that A
has happened already, are independent of each other, hence we can write the above equation as
follows:
Note: The denominator will stay constant, it will not change with A.
For a detailed example with numbers, please visit the following link: https://stackoverflow.com/quest
ions/10059594/a-simple-explanation-of-naive-bayes-classification or refer to the class presentation.
Gaussian Naive Bayes: In Gaussian Naive Bayes, continuous values associated with each feature
are assumed to be distributed according to a Gaussian distribution.
Multinomial Naive Bayes: Features have discrete values. This is primarily used for document
classification. In this case the features used are the frequency of the words present in the
document.
155
Bernoulli Naive Bayes: This is similar to the multinomial naive Bayes but the predictors are
Boolean variables. This is also popular for document classification tasks.
1 GaussianNB(priors=None, var_smoothing=1e-09)
1 # comparing actual response values (y_test) with predicted response values (y_pred)
2 from sklearn import metrics
3 print("Gaussian Naive Bayes model accuracy:", metrics.accuracy_score(y_test,
y_pred))
Reference: https://www.geeksforgeeks.org/naive-bayes-classifiers/
156
Considering the diagram above, we observe that the orange line classifies the two classes i.e. stars
and circles. Any observation that lies to the left of the line falls in the star class and any observation
to the right of the line falls under the circle class. In short, SVM separates the classes using a line or a
hyper-plane (for higher dimensions).
Many hyper-planes can be considered to categorize the two classes completely. However, using SVM,
our objective is to find the hyper-plane which has maximum distance from points of either class.
Maximizing the distance increases the confidence that the future data points will be categorized
correctly as well.
This distance between the hyper-plane and the nearest data points is called 'margin'. Margin helps
the algorithm decide the optimal hyper-plane.
157
Now, lets consider what happens when the classes cannot be separated by a line/hyper-plane.
Considering the leftmost diagram above, we can see that a linear boundary cannot categorize the
two classes in the x-y plane. In order to be able to categorize the classes, another dimension z is
added as can be observed from the middle figure. We can see that a clear separation is visible and a
line can be used to categorize the two classes. When we now transform the data back into the x-y
plane, the linear boundary becomes a circle as can be observed from the right most figure.
These transformations are called 'kernels'. Using this 'kernel trick', the algorithm takes a low
dimensional input and converts it into a higher dimensional space; thereby converting a non-
separable problem to a separable one.
In order to understand the math behind SVM, you may want to refer to the following excellent link: h
ttps://www.svm-tutorial.com/2014/11/svm-understanding-math-part-1/
Next we will discuss about parameter used in SVM. There are three parameters we primarily tune in
this algorithm:
kernel - It defines the a distance measure between new data and the support vectors i.e.
observations closest to the hyper-plane. The dot product is the similarity measure used for a
linear kernel since the distance is a linear combination of the inputs. When we consider higher
dimensions other kernels such as a Polynomial Kernel and a Radial Kernel can be used that
transform the input space into higher dimensions.
C (Regularization parameter) - When the value of C is large, smaller-margin hyper-plane will be
158
considered since it stresses on getting all the training points classified correctly. On the other
hand, a small value of C will consider a larger margin hyper-plane, even if some points are
misclassified by the hyper-plane.
gamma - The gamma parameter defines how far the influence of a each training observation
affects the calculation of the optimal hyper-plane. Low gamma values consider points even if
they far away from the plausible hyper-plane and high gamma values consider the points which
are closer to the plausible hyper-plane to get the optimal hyper-plane.
1 svc.fit(X_train, y_train)
1 # comparing actual response values (y_test) with predicted response values (y_pred)
2 from sklearn import metrics
3 print("SVM model accuracy:", metrics.accuracy_score(y_test, y_pred))
1 import os
2 path=r"\Users\anjal\Dropbox\PDS V3\Data\reuters_data"
We will start with collecting all the data in a single file first.
1 files=os.listdir(path)
159
The os.listdir(path) function returns all the files present in the folder specified by the 'path'
argument.
1 files[0:10]
['.ipynb_checkpoints',
'.RData',
'.Rhistory',
'training_crude_10011.txt',
'training_crude_10078.txt',
'training_crude_10080.txt',
'training_crude_10106.txt',
'training_crude_10168.txt',
'training_crude_10190.txt',
'training_crude_10192.txt']
We want content only from the .txt files; however, we can see that some other files are present here
as well. We can clean them up i.e. keep only those files which have a .txt in their file name.
Now if we look at files, the unnecessary files are not there anymore.
1 files[0:10]
['training_crude_10011.txt',
'training_crude_10078.txt',
'training_crude_10080.txt',
'training_crude_10106.txt',
'training_crude_10168.txt',
'training_crude_10190.txt',
'training_crude_10192.txt',
'training_crude_10200.txt',
'training_crude_10228.txt',
'training_crude_1026.txt']
1 target=[]
2 article_text=[]
3 for file in files:
4 if '.txt' not in file:continue # do not read the content from a file not
having .txt in its name
5 f=open(path+'\\'+file,encoding='latin-1') # for every file a handle is created
6 article_text.append(" ".join([line.strip() for line in f if
line.strip()!=""])) # removes lines without text using strip()
7
# and returns a single string for each article
8 if "crude" in file: # if the file name has crude, then target list is appended
with 'crude' else with 'money'
9 target.append("crude")
10 else:
11 target.append("money")
12 f.close()
1 mydata=pd.DataFrame({'target':target,'article_text':article_text})
160
The dataframe 'mydata' consists of two columns, the 'article_text' column containing the text and the
'target' column consisting of the topic of the text.
1 mydata.head()
target article_text
1 mydata.shape
(927, 2)
1 mydata['article_text'][0]
'CANADA OIL EXPORTS RISE 20 PCT IN 1986 Canadian oil exports rose 20 pct in 1986 over the
previous year to 33.96 mln cubic meters, while oil imports soared 25.2 pct to 20.58 mln cubic
meters, Statistics Canada said. Production, meanwhile, was unchanged from the previous year
at 91.09 mln cubic feet. Natural gas exports plunged 19.4 pct to 21.09 billion cubic meters, while
Canadian sales slipped 4.1 pct to 48.09 billion cubic meters. The federal agency said that in
December oil production fell 4.0 pct to 7.73 mln cubic meters, while exports rose 5.2 pct to 2.84
mln cubic meters and imports rose 12.3 pct to 2.1 mln cubic meters. Natural gas exports fell
16.3 pct in the month 2.51 billion cubic meters and Canadian sales eased 10.2 pct to 5.25 billion
cubic meters.'
1 mydata['target'].value_counts()
money 538
crude 389
Name: target, dtype: int64
We see that out of the 927 articles, 538 belong to the 'money' category and 389 belong to the 'crude'
category.
To find the tentative performance on our model we will break the dataset into training and
validation parts.
161
target article_text
375 crude OPEC WITHIN OUTPUT CEILING, SUBROTO SAYS Opec ...
589 money CURRENCY FUTURES TO KEY OFF G-5, G-7 MEETINGS ...
1 article_train.reset_index(drop=True,inplace=True)
1 article_train.head()
target article_text
1 y_train=(article_train['target']=='money').astype(int)
2 y_test=(article_test['target']=='money').astype(int)
Now we have the data in a column format in a single file. However, we have still not created features
that can be used by the different machine learning algorithms for classification of articles.
One simple idea is that we consider every word across all the articles i.e. create a dictionary
containing all the distinct words present across all the articles. We can then consider a word from
the dictionary and count the number of times it is present in each article; e.g. considering the word
'currency', we can count the number of times it is present in each article and store this count. The
count of different words can be used as features i.e. count features.
When doing this, we will come across many words that do not contribute to the article belonging to
one category or the other. These words are called as stopwords e.g the, to, a etc. Such words are not
useful for differentiating the articles and can be removed.
Also, we may want all words to be converted to their base form i.e. both 'playing' and 'played' should
be converted to 'play'. This is referred to as lemmatization. We will use the WordNetLemmatizer for
this.
We also want to remove punctuation from the text else punctuation will be considered as separate
features.
162
1 # Function to clean the text data
2 def split_into_lemmas(message):
3 message=message.lower() # converts all text to lowercase
4 words = word_tokenize(message)
5 words_sans_stop=[]
6 for word in words :
7 if word in my_stop:continue
8 words_sans_stop.append(word) # gets all words from the message except the
stopwords
9 return [lemma.lemmatize(word) for word in words_sans_stop] # lemmatized words
returned
CountVectorizer tokenizes the text and builds a vocabulary of words which are present in the text
body.
'analyzer = split_into_lemmas' sends the text to the function split_into_lemmas() and then gets
every word which is cleaned.
'min_df = 20' argument is used so that only those words are considered which are present in the
data at least 20 times.
'max_df = 500' indicates that the words which are present more than 500 times in the data will
not be considered.
'stop_words = my_stop' argument takes the stop words defined earlier as its input.
CountVectorizer just counts the occurrences of each word in its vocabulary. Hence common
words like ???the???, ???and???, 'a' etc. become very important features since their frequency is
high even though they add little meaning to the text. The words considered as stop words are
not used as features.
1 tf=
CountVectorizer(analyzer=split_into_lemmas,min_df=20,max_df=500,stop_words=my_stop)
The fit() function is used to learn the vocabulary from the training text and the transform() function is
used to encode each article text as a vector. This encoded vector has a length of the whole
vocabulary and an integer count for the number of times each word appeared in the article text is
returned for each article in this vector.
1 tf.fit(article_train['article_text'])
1 train_tf=tf.transform(article_train['article_text'])
1 train_tf
1 print(train_tf.shape)
2 print(type(train_tf))
3 print(train_tf.toarray())
4
163
1 (741, 677)
2 <class 'scipy.sparse.csr.csr_matrix'>
3 [[0 5 0 ... 0 0 0]
4 [0 0 0 ... 0 0 0]
5 [2 2 0 ... 0 0 0]
6 ...
7 [0 1 0 ... 0 0 0]
8 [0 1 0 ... 0 0 0]
9 [1 6 0 ... 0 1 0]]
We observe that the array version of the encoded vector shows counts for different words.
1 '' 's -- ... 1 1.5 10 100 15 15.8 ... work working world \
2 0 0 5 0 0 1 0 2 0 0 0 ... 0 0 0
3 1 0 0 0 0 0 0 0 0 0 0 ... 0 0 0
4 2 2 2 0 0 0 0 0 0 0 2 ... 0 0 0
5 3 0 0 0 0 0 0 0 1 0 0 ... 0 0 0
6 4 7 1 2 0 0 0 0 0 0 0 ... 0 0 0
7
8 worth would year yen yesterday yet york
9 0 0 3 1 0 0 0 0
10 1 0 1 0 0 0 0 0
11 2 0 3 0 0 0 0 0
12 3 0 3 0 0 0 0 0
13 4 0 5 0 7 0 0 0
14
15 [5 rows x 677 columns]
1 test_tf=tf.transform(article_test['article_text'])
1 test_tf.toarray()
1 x_test_tf=pd.DataFrame(test_tf.toarray(), columns=tf.get_feature_names())
1 x_test_tf.head()
2 # display in pdf will truncated on the right hand side
-
'' 's ... 1 1.5 10 100 15 15.8 ... work working world worth would year
-
0 2 1 0 0 0 0 1 1 0 0 ... 0 0 0 0 2 0
1 0 1 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0
2 2 0 0 0 0 0 0 0 0 0 ... 0 0 1 0 1 0
3 0 3 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 7
4 0 1 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0
164
5 rows ?? 677 columns
In the code below you will notice that the number of columns created are 677 which is quite a lot.
Text based features usually end up being too many.
1 print(x_train_tf.shape)
2 print(x_test_tf.shape)
(741, 677)
(186, 677)
KNN
1 knn=KNeighborsClassifier(n_neighbors=10)
1 knn.fit(x_train_tf,y_train)
1 predictions=knn.predict(x_test_tf)
1 accuracy_score(y_test,predictions)
0.9623655913978495
SVM
1 clf_svm=SVC()
1 clf_svm.fit(x_train_tf,y_train)
1 accuracy_score(y_test,clf_svm.predict(x_test_tf))
0.978494623655914
Naive Bayes
1 clf_nb=MultinomialNB()
1 clf_nb.fit(x_train_tf,y_train)
1 accuracy_score(y_test,clf_nb.predict(x_test_tf))
0.989247311827957
We observe that Naive Bayes performs better than the rest of the algorithms in text classification for
this example. This does not mean that this will be the case for every problem. We should anyway use
multiple algorithms and chose the one which performs best for our particular problem in discussion.
Another point for this particular discussion, notice that we haven't tuned parameters. You can follow
similar processes as used in earlier modules to do so.
165
Chapter 8 :Neural Networks
We will cover the following points in our discussion:
We will start with discussing how the neural networks are represented.
We can see that there are three things that make up a neural network:
Referring to the diagram above, we can see that there are two input variables: x1 and x2 and a bias
term with a constant value 1. Each of these inputs are multiplied by the weights w0, w1 and w2. It
then sums the multiplication and passes the sum to an activation function -
introducing non-linearity in the model. In the example above the non-linearity is introduced using
the sigmoid function. This process results in a single output from a neuron i.e.
.
In the figure above, there are two layers: the input layer and the output layer.
Why do we need to introduce non-linearity in the network or why do we need to use an activation
function?
1. To project values having an infinite range (-inf to +inf) i.e. linear function to probabilities where
the value range may be from 0 to 1 or -1 to +1 or something else depending on the activation
function used.
2. These functions introduce non-linearity which enables the detection of non-linear patterns in the
data. If only linear activation functions are used in the network, a combination of these linear
functions will also be a linear function.
3. Activation functions decide whether a neuron fires or not, provided its value crosses a certain
threshold.
Usually there will be more layers present as shown in the figure below. The layers between the input
and output layers are referred to as hidden layers. In the figure below, we have three layers: the
input layer, a single hidden layer and an output layer. Note: hidden nodes are not connected directly
to the input data or the eventual output. We can have any number of hidden layers as well as any
number of nodes in each layer.
166
Bias node
We will now discuss about bias nodes. These nodes are added to neural networks to help the
network learn patterns. They act as an input node and always contain a constant value; can be 1 or
some other value. Due to this they are not connected to any previous layer. For understanding bias
further, you may refer to the following post: https://www.quora.com/What-is-bias-in-artificial-neural-
network
Hidden layer
Lets consider the hidden layer in the diagram above. The first node in the hidden layer is the bias
node. The second node, takes as its input, weighted inputs from all the nodes in the previous layer
i.e. and applies an activation on this linear combination of inputs. In
this case, we are using the sigmoid activation function.
Here, and are what come as input in each node of the hidden layer and what goes out is
and i.e. sigmoid function applied to and .
Output layer
For the third layer i.e. the output layer, we get a weighted linear combination of all the nodes of the
previous layer i.e. and along with the bias term.
We need to keep in mind that as we keep adding layers the output function keeps getting more
complex.
Neural network is a very powerful idea. It can model any non-linear pattern present in the data.
"A feed-forward network with a single layer is sufficient to represent any function, but the layer may
be in-feasibly large and may fail to learn and generalize correctly." -??????Ian Goodfellow, DLB
167
Neural networks may not perform very well if the dataset is too small. But with sufficient data points
neural networks function very well.
Activation Functions
Till now we have considered the sigmoid activation function only. There are a number of other
activation functions that can be used as well.
168
Reference: https://towardsdatascience.com/activation-functions-neural-networks-1cbd9f8d91d6
The example uses a neural network with two inputs, one output and two hidden neurons. Our
inputs are 0.05 and 0.10 and output is 0.01.
Forward Pass
Given the inputs, bias and the randomly initialized weights, we can check what prediction is done by
the neural network moving from left to right.
Lets begin.
We will first compute the weighted linear combination of the inputs and from the first
layer. i.e.
and
and are sent as an input to the next layer i.e. the hidden layer. Here an activation
function is applied introducing non-linearity and squashing the input i.e. and
.
Applying the activation function, in this case the sigmoid function to gives as follows:
169
Note: The activation function need not necessarily be sigmoid. It can be any of the activation
functions discussed before.
After we get the prediction i.e. made by the first pass of the neural network using random
weights, we calculate the total error.
One of the ways we calculate the error for the output for each observation is using the squared
error function.
Note: Here we have a single neuron in the output layer. In case there are multiple neurons, we sum
the errors computed for each neuron on the output layer.
Backward Pass
Backpropagation is done to optimize the weights so that the neural network can learn how to get the
optimum weights which result in minimum error.
Let us start with one of the weight parameters; consider . We want to know how much a small
change in will affect the total error . In other words, we need to find the partial derivative of
w.r.t or the gradient w.r.t. i.e. .
In order to find this, we will apply the chain rule from our knowledge of derivatives.
170
Taking the derivative of a sigmoid function w.r.t.
Once we have the value of i.e. how much does the error changes with a small change in ,
we can now compute the new value of as follows:
Here is a special value known as the learning rate. Learning rate determines how quickly or
slowly we want to update the parameters/weights. We will consider as 0.5. This is a hyper-
parameter to be tuned.
We will continue with our backward pass to calculate the new values of , , , and as
well.
Lets start with finding the updated value for . Similarly we will be able to find the values for
, , and too.
171
Let us consider the term in the above equation (15):
We have computed both these values before; hence substituting we get the following:
Now we need to compute in order to to compute the equation (15) i.e. in order to compute
We know that
Now, referring to equation (14) we still need to find the values of and to compute
Putting it all together, we will be able to observe the effect of a small change in on the overall
error i.e. .
172
We can compute the new value for as follows:
After the first round of back-propagation, a forward pass is done again and we compute the error
now. Now the total error should be lesser than the previous total error.
This cycle will continue till we have a minimum possible overall error.
Reference: https://mattmazur.com/2015/03/17/a-step-by-step-back-propagation-example/
In the forward pass, given inputs and random weights we understand how the output is computed.
After the training of the neural network is complete, we run the forward pass only to make
predictions.
But in order to be able to make these predictions that are close to the actual values, we need
optimal weights. For this we train our model to learn the weights.
Reference: https://towardsdatascience.com/applied-deep-learning-part-1-artificial-neural-networks-d
7834f67a4f6
We will work with the census income dataset where we predict the income level i.e. whether it is
more than 50K dollars or less than that amount given the characteristics of the people.
1 import pandas as pd
2 import numpy as np
3 from sklearn.neural_network import MLPClassifier
4 from sklearn.metrics import roc_auc_score
5 from sklearn.model_selection import RandomizedSearchCV
1 file=r'/Users/anjal/Dropbox/0.0 Data/census_income.csv'
2
3 cd= pd.read_csv(file)
173
1 cd['Y'].unique()
1 cd['Y']=(cd['Y']==' >50K').astype(int)
1 del cd['education']
2 # we have already discussed the reason for this earlier
1 cat_cols=cd.select_dtypes(['object']).columns
1 cd.shape
(32561, 49)
After completing the data preparation part, we observe that there are about 32000 observations and
49 variables.
1 x_train=cd.drop(['Y'],axis=1)
2 y_train=cd['Y']
Parameters
This is a penalty term to our cost function to regularize the cost function so that we do not end
up over optimizing the parameters. Higher the value of alpha, the more penalty we add.
174
1 parameters={
2 'learning_rate': ["constant", "invscaling", "adaptive"],
3 'hidden_layer_sizes': [(5,10,5),(20,10),(10,20)],
4 'alpha': [0.3,.1,.01],
5 'activation': ["logistic", "relu", "tanh"]
6 }
We use Randomized Search to find which of these parameters give the best results.
1 clf=MLPClassifier()
1 random_search=RandomizedSearchCV(clf,n_iter=5,cv=10,param_distributions=parameters,
2
scoring='roc_auc',random_state=2,n_jobs=-1,verbose=False)
1 random_search.fit(x_train,y_train)
We get the model resulting from the best parameter combination using the following code:
1 random_search.best_estimator_
1 report(random_search.cv_results_,5)
175
Model with rank: 3
Mean validation score: 0.51063 (std: 0.01143)
Parameters: {'learning_rate': 'adaptive', 'hidden_layer_sizes': (5, 10, 5), 'alpha': 0.01, 'activation':
'tanh'}
In the model giving the best performance, there are two hidden layers, the first layer having 20
nodes and the second layer having 10 nodes; regularization parameter is 0.3 and activation function
is relu.
We have tried very few combinations. We can go ahead and try more parameter combinations and
see if we get a better model performance.
1 mlp = random_search.best_estimator_
Now since we know the best parameter combination, we can fit the model on the entire training
data and make predictions on the validation data.
1 mlp.fit(x_train,y_train)
After fitting the model, we make predictions just as we made for the algorithms discussed earlier.
176
Chapter 9 : Unsupervised Learning
In this module, we are going to look at certain kind of business problems which are very different
from problems where eventual agenda is to make predictions. Consider some of these problem
statements :
Finding segments of customer in your customer population which are different from each other
and similar within to make more customized marketing campaigns
Finding anomalies in the data, observations which are very different from the rest of the
population
Reduce representation redundancy in the data, decrease number of columns in the data without
loss of information
Visualize high dimensional data in 2D or 3D plane
None of these problems fit in the framework of predictive modelling that we have studied so far.
There is no target to build the model for . However these are legitimate business problems to solve.
We'll be learning techniques to address this kind of business problems here .
We'll start with addressing the first problem of finding groups which are similar within and different
from each other . To address this problem we first need to define what do we mean by similar
Similarity Measures
In this simple image we have created two groups on the basis of our intuition. This intuition that we
applied was nothing but on the basis of distances. We grouped the points which were closer
together . Using euclidean distance as similarity measure is pretty common and works most of the
data that we get to work with.
Distances
Euclidean distance however is a specific case of minkowski distances , which is defined as [between
two points and ]:
177
This could be a good choice for distance when all variables are not on same scale. Euclidean
distances if used end up exaggerating difference in the dimension which is bigger on numeric scale.
Cosine Similarity
Cosine similarity is defined as the cosine of angle between two data points represented as vectors
This takes values between -1 to 1 , where -1 implies most dissimilar and 1 means identical . This
as similarity measure, is mostly used in context of text documents or categorical data.
Jaccard Similarity
Where for set is number of unique elements in the set. [Also known as cardinality of the set]. This
again is used mostly in context of text data or categorical data.
All these different distances/similarity measures can be computed for a pair of points. But we also
need to come up with some way of calculating distances between the groups of points . Why ?
Imagine how will you arrive at say 3 groups , for a datasets containing 1000 observations .
Data Standardisation
Distances which rely on scale of the individual dimension might end up exaggerating impact of some
dimensions over the other . Here is an example to understand this intuitively , Consider these two
individuals with their age and income given :
Intuitively we know that, the difference in ages here is much noticeable in comparison to the
difference between their income. And this intuition is driven by the natural scale of ages , where a
difference of 20 years is significant whereas a difference of just Rs 10,000 is not much in context of
income. However if we let this data be at its given scale and calculate the distance , see what
happens
Because of Age being on smaller numeric scale , difference there doesn't matter at all . This can be
mitigated by removing effect of scale . This is achieved by centering the data with mean [ or median
] and scaling the data with standard deviation [ or MAD or RANGE etc ] .
Here is mathematical explanation of the process and impact it has . Lets say there is one column
which takes values
178
Now lets see what is the mean and standard deviation of this standardized version of X [ that is Y].
This means that, the transformed column will have mean 0 and standard deviation 1 , irrespective of
what values the column takes . This can bring all the columns to same scale and then remove the
effect of larger numeric scales on distance calculations. Impact of using other measures for
centering and scaling is also similar if not exactly same .
Single Linkage : In this method, distance between two groups is defined as distance between
points of the groups which are nearest to each other. This method leads to groups which are as
dispersed as possible.
179
Complete Linkage: Here, distance between two groups is defined as distance between points of
the groups which are farthest from each other. This method leads to groups which are as
compact as possible
Centroid Linkage : Here, the distance between the groups is defined as simple distance
between centroid of the groups. This is computationally much less expensive and one of the
most popular methods to calculate distances between the groups.
There are some other notable methods : Average Linkage , Ward's Method
- A B C D E
A X 2 7 9 10
B - X 1 8 6
C - - X 5 4
D - - - X 3
E - - - - X
The lower half of the matrix is left empty because it'll take same values . Given this matrix , we can
see that the closest points are B & C. Those will be the first to get combined . Once they are in the
group , new distance matrix will look like this :
- A BC D E
A X 7 9 10
BC - X 8 6
D - - X 3
E - - - X
Here, distance between group BC and other points is calculated using complete linkage method. The
longest distance. Remember that joining of points and groups will always be on the basis of
minimum distance. Complete linkage method is just a way of calculating the distances.
Now minimum distance is between points D and E, those will be the next to get combined .
- A BC DE
A X 7 10
BC - X 8
D - - X
Next to be combined , will be A and BC, this will leave us with only two groups which will the ultimate
ones to be joined . If we visualize this process, it looks like this .
180
This is known as a dendrogram, and the process that we just carried out is called hierarchical
clustering . In the process if we stopped at 1 , we'll have no groups , all the points are separate. If we
stopped at 2 , we'll have 4 groups :
In the next stop, we'll be left with just one group with all the points in it . You can see that this
process computationally is pretty complex and this complexity increases exponentially as number of
data points go up. It's not a very popular clustering algorithm for the same reason. We'll be
discussing the alternative named K-means clustering once we are through the implementation of
Hierarchical clustering in python using sk-learn
Note : In general what variables should be considered during the clustering process; is a business
process mandate. There are no statistical or mathematical measures to tell you that one variables is
bad and another is not [ unlike predictive modelling problems where we had a definite target ]
In this case our variables of interest are sulphates and alcohol . Apart from imports , we'll
standardize the data before start the clustering process as discussed above .
1 myfile=r'/Users/lalitsachan/Dropbox/0.0 Data/winequality-white.csv'
2
3 import pandas as pd
4 import matplotlib.pyplot as plt
5 import numpy as np
6 import seaborn as sns
7
8 from sklearn.preprocessing import scale
9 from sklearn.cluster import KMeans
10 from sklearn.metrics import silhouette_score
11
12 %matplotlib inline
13
14 wine=pd.read_csv(myfile,sep=";")
181
15 wine=wine[["sulphates","alcohol"]]
16 wine.head()
sulphates alcohol
0 0.45 8.8
1 0.49 9.5
2 0.44 10.1
3 0.40 9.9
4 0.40 9.9
We can see that these two variables differ a lot on numeric scale , let's have a quick look at their
comparative mean and standard deviation .
1 wine.agg(['mean','std'])
sulphates alcohol
sulphates alcohol
We can see now that the means are practically zero and standard deviations 1 for both the variables.
We can start with clustering process. Before we go ahead , we need to figure out a way to decide
how many groups/cluster is an optimal choice for the data which we have been given .
For each Observation , let be the average distance between and all other data within the same
cluster. We can interpret as a measure of how well is assigned to its cluster (the smaller the
value, the better the assignment). We then define the average dissimilarity of point to a cluster c as
the average of the distance from to all points in c.
Let be the smallest average distance of to all points in any other cluster, of which is not a
member. The cluster with this smallest average dissimilarity is said to be the "neighboring cluster" of
because it is the next best fit cluster for point .
182
We can clearly see that , if , it implies that the observation does not belong to
cluster to which it has been assigned . If , it means observation belongs to the cluster it has
been assigned to. Silhouette Score for a cluster can be calculate as average of silhouette index
values for individual observations in the cluster. Silhouette score for an entire data can be calculated
as average of silhouette index values for all the data points.
We can carry out the clustering process with different number of clusters and eventually choose the
one which has highest silhouette score . Let's now go ahead without clustering problem.
We can see that 3 is the optimal cluster number . Let's go ahead with that and look at the results .
1 hclus=AgglomerativeClustering(n_clusters=3, affinity='euclidean',linkage='ward')
2 labels_hclus=hclus.fit_predict(wine_std)
3 wine['cluster_hclus']=labels_hclus
4 sns.lmplot(fit_reg=False,x='sulphates',y='alcohol',data=wine,hue='cluster_hclus')
Apart from visualization you can also , look at cluster wise numerical summaries to see how the
clusters/groups differ from each other . In fact when you have number of variables , you wont be
able to visualize clusters like this , numerical summaries is all that you'll have . We'll see in this
module, in some time , one way to visualize high dimensional data in 2D-space.[ known as t-sne ]
Note : Cluster numbers 0,1,2 as such are not ordinal . Only the membership in the clustering matters
.
183
K-Means Clustering
Hierarchical clustering requires distance matrix calculation between all the points. Which comes out
to be an expensive process computationally. We need an algorithm which scales well
computationally with number of observations in the data. K-means is such an algorithm. Algorithm
needs number of clusters as input. For example, in the figure given below , we can see that in (a)
there are two groups present in the data. We'll see if K-means can group these points properly . Here
is how the algorithm worked :
(b) Algorithm randomly selects two points and considers them to be group centers
(c) Rest of the points in the data are assigned membership to the two groups on the basis of
distances from these group centers
This of-course is not a proper grouping. Process doesn't stop here . This was the first iteration .
(d) New group centers are calculated on the basis of the previous groups formed.
(d) Points are re-assigned membership on the basis of these new group centers
This process is carried out for multiple iterations, until either observation membership stops to
change or new group centers are very close to the old ones . In very few iterations, as you can see
observation's membership gravitates towards natural groups present in the data.
How do we, however come up with the value for K? Since K-means is cheaper algorithm
computationally; we can run it for multiple values of K, and choose that value of K as best for which
the silhouette score comes out to be the best.
2, 0.3739606160278498
3, 0.4108674619159248
4, 0.38344260989210777
5, 0.33457661751536383
6, 0.3477567148831231
7, 0.3519661585228609
8, 0.35398240541646087
9, 0.3507188899680227
184
1 k = 3
2 kmeans = KMeans(n_clusters=k)
3 kmeans.fit(wine_std)
4 labels = kmeans.labels_
5 wine["cluster"]=labels
1 wine['cluster'].value_counts
1 2281
0 1476
2 1141
1 sns.lmplot(fit_reg=False,x='sulphates',y='alcohol',data=wine,hue='cluster')
It doesn't care, whether there really do exist different groups in the data. It will forcefully
separate data into groups as per the value of K, even if there do not exist any natural groups in
the data
It is susceptible to extreme values in the data and might give rise to improper grouping if such is
the data that we are dealing with.
185
DBSCAN (Density-based spatial clustering of
applications with noise)
This particular algorithm addresses issue associate with K-means discussed above . Basic idea
behind DBSCAN is that
Clusters are dense regions in the data space, separated by regions of lower object density
A cluster is defined as a maximal set of density- connected points
Discovers clusters of arbitrary shape
It has two parameters and minPts , represent the neighborhood size and minPts represents the
minimum number of points required for a point to be core points. Let's see how DBSCAN actually
does clustering . It starts with randomly selecting a point in the data, next action depends on what
kind of point it is .
In this diagram, minPts = 4. Point A and the other red points are core points, because the area
surrounding these points in an radius contain at least 4 points (including the point itself). Because
they are all reachable from one another, they form a single cluster. Points B and C are not core
points, but are reachable from A (via other core points) and thus belong to the cluster as well. Point
N is a noise point that is neither a core point nor directly-reachable.
Noise points are the points which are assigned to no groups and can be labelled as anomalies
Due to progression of membership assignment as explained above , arbitrary shape groups can
be captured
Algorithm doesn't need any input for number of groups , it finds as many groups as there are
naturally present [ This however can be varied by controlling values of ]
186
This source provides nice animation of the process to better understand how DBSCAN works :
https://www.naftaliharris.com/blog/visualizing-dbscan-clustering/
DBSCAN in python
we'll work with an example which will give you a comparison between K-means and DBSCAN results
as well .
1 mydata=pd.read_csv("/Users/lalitsachan/Dropbox/0.0 Data/moon_data.csv").iloc[:,1:]
2 mydata.head()
X Y
0 1.045255 0.332214
1 0.801944 -0.411547
2 -0.749356 0.775108
3 0.975674 0.191768
4 -0.512188 0.929997
Data is already standardized, we can go for the clustering directly . Lets see how the data looks like .
1 sns.lmplot('X','Y',data=mydata,fit_reg=False)
We can clearly see that there are two groups in the data, however they are not spherical in nature .
Lets see how K-means does on this and then we'll results from DBSCAN.
1 kmeans=KMeans(n_clusters=2)
2 kmeans.fit(mydata)
3 mydata["cluster"]=kmeans.labels_
4 sns.lmplot('X','Y',data=mydata,hue='cluster',fit_reg=False)
187
Lets remove the results from K-means and populate them with results from DBSCAN
1 del mydata['cluster']
1 pd.Series(db.labels_).value_counts()
1 999
0 995
-1 6
-1 here are the points which are not part of any group and can be labelled as anomalies. Try playing
around with different values of and minPts and see the impact .
188
1 sns.lmplot(x='Milk',y='Grocery',data=groc,fit_reg=False)
Lets try out different values of , and see how many values are labelled as outliers
1 r=np.linspace(0.5,4,10)
2 for epsilon in r:
3 db = DBSCAN(eps=epsilon, min_samples=20, metric='euclidean').fit(groc_std)
4 labels = db.labels_
5 #n_clust=len(set(labels))-1
6 outlier=np.round((labels == -1).sum()/len(labels)*100,2)
7 #print('Estimated number of clusters: %d', n_clust)
8 print("For epsilon =", np.round(epsilon) ,", percentage of outliers is:
",outlier)
This can be achieved by projecting the high dimensional data onto a lower dimension subspace. Lets
understand what do we mean by that .
This figure here shows what do we mean by projection. Red dots represent data in 3 dimensions .
Grey plane represents the 2 dimensional subspace . If we project these points on that plane, we'll
have 2 dimensional representation of the data. The subspace however has the basis which is linear
combination of 3 dimensional basis.
dot product represent the norm of the vector and the spatial orientation , since that's the unit
vector representing the spatial orientation of the plane onto which we are projecting the data. Now
there are infinitely many possible 2D plane in the example above. Which 2D plane should we select
to project our data on such that we get the lower dimensional representation with least amount of
loss in variance of the data. The lower dimension subspace is essentially defined by the unit vector
.
is as small as possible . [ That is the difference or loss in variance/information after projection ]. One
of the things which makes our lives mathematically easy that the data is centered before projection.
which means
190
For a fixed given data will be constant. This makes our objective to maximize the
expression
we can divide by n , it being a constant , doesn't change the result of objective optimization . We can
use the following identity to further change the objective .
combining the constraints with the objective using Lagrange's multiplier, new objective becomes :
We need to maximize this w.r.t. W, lets differentiate this w.r.t. W and equate it to zero. We get :
If you recall little bit of matrix algebra, this is a classic equation for eigen values. here is nothing
but eigen vectors of variance-covariance matrix of the given data and are eigen values.
Variance-Covariance matrix is symmetric, its Eigen vectors are all orthogonal and distinct
These are known as principal components
191
These can be arranged in terms of variance explained along with direction of each principal
component
Number of principal components are same as number of variables , however we can always
ignore the ones which do not contribute much towards explaining variance in the data; which
will eventually result in dimensionality reduction
1 import pandas as pd
2 import numpy as np
3 from sklearn.decomposition import PCA
4 from sklearn.preprocessing import scale
5
6 file=r'/Users/lalitsachan/Dropbox/PDS V3/Data/Existing Base.csv'
7 bd=pd.read_csv(file)
8
9 # selecting numeric columns which have redundancy in representation
10 bd=bd.select_dtypes(exclude=['object'])
11 bd.drop(['REF_NO','year_last_moved','Revenue Grid'],1,inplace=True)
You can see that there are many variables [total 15 vars in the data] which are highly correlated to
each other. We'll first scale the data and then apply pca with 15 components [ same as number of
vars in the data] , to see how many PCAs we can ignore on the basis of variance explained .
1 X=bd.copy()
2 X = scale(X)
3 pca = PCA(n_components=15)
4 pca.fit(X)
1 var= np.round(pca.explained_variance_ratio_,3)
2 print(var)
0.458 0.107 0.084 0.068 0.055 0.046 0.045 0.041 0.033 0.031 0.027 0.005 0. 0. 0.
You can see that last few PCs have almost zero contribution towards explained variance. lets look at
cumulative explained variance as we increase number of PCs.
192
1 var1=np.cumsum(np.round(pca.explained_variance_ratio_, decimals=4)*100)
2 print(var1)
45.78 56.46 64.84 71.61 77.13 81.74 86.22 90.36 93.7 96.84 99.57. 100.02. 100.02.
100.02 100.02
You can see that by 10 or so PCs we have covered almost ~97% variance. This means that we can
bring down the dimension of the data from 15 to 10 ( reducing the data size by 33%) without much
loss to information .
1 pca = PCA(n_components=10)
2 pca.fit(X)
3 X1=pd.DataFrame(pca.transform(X))
4 X1.head()
We can use this new data with 10 columns , instead of the original one with 15. Each individual
column here is linear combination of 15 columns of the original one. Here is how you can get
loadings for any one of the columns, lets look at for the first one.
1 # pc1
2 loadings=pca.components_[0]
3 loadings
These number will make more sense , along with variables which they are loadings for
1 list(zip(bd.columns,loadings))
Last thing to confirm the orthogonality [ and in turn zero dependence ] of the columns of this new
data, lets look at correlation heat map again .
1 sns.heatmap(X1.corr())
193
you can see zero correlation among any of the new column.
Step 1: In the high-dimensional space, create a probability distribution that dictates the
relationships between various neighboring points
Step 2: It then tries to recreate a low dimensional space that follows that probability distribution
as best as possible.
The ???t??? in t-SNE comes from the t-distribution, which is the distribution used in Step 2. The ???
S??? and ???N??? (???stochastic??? and ???neighbor???) come from the fact that it uses a probability
distribution across neighboring points.
Lets see it in practice . Remember that we said it is impossible to visualized group separation that we
get from clustering, because we are limited to 2D or at max 3D visualization . How about we make
use of t-SNE here .
We'll use the same wine data but make use of all the variables and then visualize the resultant
clustering .
1 wine=pd.read_csv(r'/Users/lalitsachan/Dropbox/0.0 Data/winequality-
white.csv',sep=";")
2 X=scale(wine.iloc[:,:-1])
3 # ignoring the last column quality here
4 # using all other columns in the data
194
2 0.21447656060330086
3 0.14430589277155192
4 0.15898404552909834
5 0.14395682107234764
6 0.14588278206706412
7 0.12773278998393625
8 0.12828212706539913
9 0.12737121583698335
1 kmeans=KMeans(n_clusters=2)
2 kmeans.fit(X)
3 X_2d=tsne.fit_transform(X)
4 mydata=pd.DataFrame(X_2d,columns=['X','Y'])
5 mydata['cluster']=kmeans.labels_
this is how the data, when transformed into 2D, looks like
1 sns.lmplot('X','Y',data=mydata,fit_reg=False)
1 sns.lmplot('X','Y',hue='cluster',data=mydata,fit_reg=False)
We can see the separation of the groups is largely fine. There will always be superficial overlap at
boundaries . If we had multiple groups, this will also help us seeing visually which groups are closer
to each other relative to others .
We'll conclude our discussion with few not so obvious use cases of what we learnt today [ apart from
explicitly stated ones like consumer segmentation etc ]
195
You can use cluster membership as a feature in predictive modeling algorithm
You can build different models for different segments
You can use PCA to not only reduce dimensionality but at a conceptual level disentangle related
information sets [ variables ].
196
Chapter 10 : Text Mining/ Text Analytics
In Text Mining, we use of automated methods in order to understand the content present in text
documents. Its used to transform unstructured textual data to structured data useful for further
analysis. It identifies and extracts information which would otherwise remain buried in the text.
Machine Learning usually requires well structured input features to work well, which is not the case
when the data under consideration is textual. However, text mining can extract the clean, structured
data from the text which is a requirement for most machine learning algorithms.
1.) Text categorization: e.g. categorizing emails into spam and not spam
3.) Sentiment analysis: to identify whether information in a document is positive, negative or neutral.
e.g. Companies monitor social media to detect and handle any negative comments on their products
or services
4.) POS or parts-of-speech tagging for a language: Each word in a document is tagged with its part of
speech like noun, verb, pronoun, etc. This information can be useful for further analysis.
We have just touched the tip of the iceberg; there are many more applications using text mining.
We will be working with the NLTK package. NLTK is a Python package that is free, opensource and
easy to use.
It assists us in doing the following tasks: tokenizing, part-of-speech tagging, stemming, sentiment
analysis, topic segmentation, and named entity recognition among others. We will be discussing
some of these tasks here.
NLTK, mainly helps to preprocess written text and converts it in a format that can be fed to the
machine learning algorithms.
1 import nltk
2 import os
3 import pandas as pd
We can download NLTK data i.e the corpora (corpora is a large and structured set of texts) and other
data from NLTK as follows:
1 nltk.download()
For this discussion we will use Reuters data. We will understand how to gather data from multiple
files. This data can be downloaded from https://www.dropbox.com/sh/865xeu4xwbyo2yt/AACd0pQ
EbeOSfvNjiV7aur4Ka?dl=0
On observing the data, the first thing we notice that the reuters_data folder consists of separate text
files and each file contains a news article. We also observe that there are two categories of files -
crude and money. On opening any file we read its news article.
In order to start processing the data we need to first access the path where all these individual text
files are present:
1 path=r"/Users/anjal/Dropbox/0.0 Data/reuters_data/"
2 files=os.listdir(path)
Note: all the text data which we wish to analyze is not present in a single file.
If we wish to read the contents of a particular file, we can run the following code:
197
1 f=open(path+'training_money-fx_3593.txt','r',encoding='latin-1')
2 for line in f:
3 print(line)
4 f.close()
->Text Omitted<-
The code below goes through every line, ignores any empty line i.e. lines not containing any text or
empty lines and keeps adding the lines containing text to the text variable.
HUNGARY HOPES DEVALUATION WILL END TRADE DEFICIT National Bank of Hungary first vice-
president Janos Fekete said he hoped a planned eight pct devaluation of the forint will spur
exports and redress last year's severe trade deficit with the West. Fekete told Reuters in an
interview Hungary must achieve at least equilibrium on its hard currency trade. "It is useful to
have a devaluation," he said. "There is now a real push to our exports and a bit of a curb to our
imports." The official news agency MTI said today Hungary would devalue by eight pct and it
expected the new rates to be announced later today.
Now, lets consider all the files present in the 'reuters_data' folder. We want to read from all the files
present in this folder.
1 files[:10]
198
1 ['.ipynb_checkpoints',
2 '.RData',
3 '.Rhistory',
4 'training_crude_10011.txt',
5 'training_crude_10078.txt',
6 'training_crude_10080.txt',
7 'training_crude_10106.txt',
8 'training_crude_10168.txt',
9 'training_crude_10190.txt',
10 'training_crude_10192.txt']
On scrolling through the files, we observe some non text files i.e files without the .txt extension. We
do not wish to read from the non text files. Hence, in the code below, we ignore the files which do
not have a .txt extension.
In the code above we read each file present in the reuters_data folder, ignored the files without the
.txt extension, removed the empty lines from the text and appended all the text in the 'article_text'
list. For each file we also assign the target variable 'crude' or 'money'.
We then create a dataframe with the two lists created: target and article_text.
1 mydata=pd.DataFrame({'target':target,'article_text':article_text})
1 mydata.head()
target article_text
We started with about 927 separate files; now the entire text data is present in a single dataframe
having 927 rows.
1 mydata.shape
(927, 2)
199
Once the data is converted to a single dataframe from individual files, we can now start exploring its
content.
Word clouds help in identifying patterns in the text that would be difficult to observe in a tabular
format. Higher the frequency of a word in the text, the more prominently it is displayed.
We first need to combine the text into bunches for which we want to see the wordclouds.
1 all_articles[:1000]
'CANADA OIL EXPORTS RISE 20 PCT IN 1986 Canadian oil exports rose 20 pct in 1986 over the
previous year to 33.96 mln cubic meters, while oil imports soared 25.2 pct to 20.58 mln cubic
meters, Statistics Canada said. Production, meanwhile, was unchanged from the previous year
at 91.09 mln cubic feet. Natural gas exports plunged 19.4 pct to 21.09 billion cubic meters, while
Canadian sales slipped 4.1 pct to 48.09 billion cubic meters. The federal agency said that in
December oil production fell 4.0 pct to 7.73 mln cubic meters, while exports rose 5.2 pct to 2.84
mln cubic meters and imports rose 12.3 pct to 2.1 mln cubic meters. Natural gas exports fell
16.3 pct in the month 2.51 billion cubic meters and Canadian sales eased 10.2 pct to 5.25 billion
cubic meters. BP <BP> DOES NOT PLAN TO HIKE STANDARD <SRD> BID British Petroleum Co Plc
does not intend to raise the price of its planned 70 dlr per share offer for the publicly held 45
pct of Standard Oil Co, BP Managing Director David '
1 crude_articles=" ".join(mydata.loc[mydata['target']=='crude','article_text']) # to
create the wordcloud for crude articles
1 money_articles=" ".join(mydata.loc[mydata['target']=='money','article_text']) # to
create the wordcloud for money articles
The wordcloud can be generated by passing a text variable (i.e. all_articles, crude_articles or
money_articles) to the WordCloud() function as shown below.
1 wordcloud = WordCloud().generate(all_articles)
2 plt.imshow(wordcloud, interpolation='bilinear')
3 plt.axis("off")
200
On observing the wordcloud above, we see that the most common words are 'said', 'dollar', 'will',
'central', 'bank', 'pct' etc. The word 'said' is present most frequently. It gives us the idea that this word
may not contribute much to the differentiation between the articles belonging to the 'crude' or
'money' categories since it may be present frequently in both the categories individually as well. In
order to handle this we may include it as a part of stopwords (words to be removed during analysis).
1 wordcloud = WordCloud().generate(crude_articles)
2 plt.imshow(wordcloud, interpolation='bilinear')
3 plt.axis("off")
Here again we observe that the word 'said' is most prominent. However, we can also see that when
we are referring to 'crude articles' words such as oil, OPEC, crude, price etc are prominent.
1 wordcloud = WordCloud().generate(money_articles)
2 plt.imshow(wordcloud, interpolation='bilinear')
3 plt.axis("off")
When we refer to the 'money articles', we notice that the words prominent here are dollar, money,
market, exchange rate etc.
This visualization of text gives us an insight about the prominent words without having to see its
frequencies in a tabular format.
The process of breaking text paragraphs into smaller chunks like words or sentences is called
tokenization.
Sentence tokenizer breaks paragraphs into sentences and word tokenizer breaks paragraphs into
words.
201
We will see how word tokenizer works.
The following line of code tokenizes only the articles related to money into words.
1 tokens=word_tokenize(money_articles)
1 tokens[:20]
1 ['CANADA',
2 'OIL',
3 'EXPORTS',
4 'RISE',
5 '20',
6 'PCT',
7 'IN',
8 '1986',
9 'Canadian',
10 'oil',
11 'exports',
12 'rose',
13 '20',
14 'pct',
15 'in',
16 '1986',
17 'over',
18 'the',
19 'previous',
20 'year']
1 #help(nltk.Text)
The nltk.Text() function in the code below takes the tokens as its input and converts it to a data
structure that allows a variety of analysis on the text like counting, finding concordance, finding
similarity etc as shown below:
1 money_articles_Text=nltk.Text(tokens)
1 money_articles_Text
The concordance() function shows every occurrence of a given word along with some context.
1 money_articles_Text.concordance('bank')
202
14 Herstatt , managing director of the bank when it collapsed , was sentenced to
15 Six other people associated with the bank were jailed in 1983 . But Dattel was
16 have led him to take his own life . Bank of Japan buys dollars around 149.00
17 rs around 149.00 yen - Tokyo dealers Bank of Japan buys dollars around 149.00
18 DEFICIT FORECAST AT 700 MLN STG The Bank of England said it forecast a shorta
19 me 120 mln stg to the system today . Bank of France buying dollars for yen - b
20 ng dollars for yen - banking sources Bank of France buying dollars for yen - b
21 some bankers to question the Central Bank 's policy of pegging the guilder fir
22 ate policy . While agreeing with the Bank 's commitment to defend the guilder
23 der strongly , some bankers want the Bank to make more use of the range within
24 en , chairman of Amsterdam-Rotterdam Bank NV ( Amro ) said the Central Bank 's
25 am Bank NV ( Amro ) said the Central Bank 's policy was overcautious . `` I wo
26 `` I would like to suggest that the Bank use more freely the range given to t
Concordance allows us to see a word in context. e.g. we observe that 'bank' occurs in the context like
The _ of England, _ of France, _ of Japan and so on.
If we want to find words which occur in similar range of contexts, we use the similar() function.
In the code below, we find the words similar to the word 'dollar' in the articles related to money.
1 money_articles_Text.similar('dollar')
1 market currency yen bank system pound mark fed government ems
2 bundesbank agreement rate economy country budget year accord report
3 treasury
Similarly, in the code below we find the words similar to the word 'dollar' in the articles related to
crude.
1 tokens=word_tokenize(crude_articles)
2 crude_articles_Text=nltk.Text(tokens)
3 crude_articles_Text.similar('dollar')
We observe that similar() function gives different result for different text (articles related to crude
and money) for the same word 'dollar'.
The common_contexts() function allows us to examine the contexts shared by two or more words. In
our case we want to check the context shared by the words 'market' and 'government'.
1 money_articles_Text.common_contexts(['market','government'])
Referring to the output above, both the words 'market' and 'government' can be observed in the
same context (say for context 'the_expects' the text has occurrences like 'the market expects' as well
as 'the government expects').
Reference: https://www.nltk.org/book/ch01.html
Till now we have spoken about exploring text data. In order to make use of this unstructured data in
machine learning algorithms, we will need to make features out of this textual information.
203
There are a number of ways in which we can extract features from textual content. The simplest
method to numerically represent textual data is Bag of Words method. Using this method we get a
matrix in which each column refers to all the unique words in the corpus i.e. the vocabulary and
each row refers to each document, in our case each article.
For each article/document, one representation would be that on parsing each word in the
document, if it is present in the document, then its value is 1 else 0. Another representation would
be giving the count of the times the word present in the document instead of 1 or 0.
However, the most common method used is tf-idf i.e. Term Frequency-Inverse Document Frequency.
Term Frequency (tf) = (No. of times a term t appears in a document)/(Number of terms in the
document)
Inverse Document Frequency (idf) = log(N/n), where N is the number of documents and n is the
number of documents a term t is present in. The IDF of a rare word is high, whereas the IDF of a
frequent word is likely to be low thus highlighting the words that are distinct.
tf-idf = tf * idf
Reference: https://towardsdatascience.com/machine-learning-text-processing-1d5a2d638958
We start with importing the TfidfVectorizer which enables us to get numerical values for all text
features i.e. words.
1 def split_into_lemmas(message):
2 message=message.lower() # converts all text to lower case so that there is no
distinction between the words due to case
3 words = word_tokenize(message) # get individual words from paragraphs
4 words_sans_stop=[]
5 for word in words :
6 if word in my_stop:continue # ignore the word if it is in the list of
stopwords
7 words_sans_stop.append(word) # get the remaining words in this list
8 return [lemma.lemmatize(word) for word in words_sans_stop] # get all words in
their base forms
The split_into_lemmas() function takes a body of text as its input, converts it to lower case and
separates out each word in the text i.e. tokenizes the text. For each word, we check whether the
word is present in our list of stop words or not. If it is present, then that word is ignored else each
word is added in a list. These words are then lemmatized i.e. words are converted to their base
forms. e.g. organize, organizes, and organizing will all be converted to organize.
Tf-idf tokenizer
TfidfVectorizer tokenizes the text and builds a vocabulary of words which are present in the text
body.
204
'analyzer = split_into_lemmas' sends the text to the function split_into_lemmas() and then gets
every word which is cleaned.
'min_df = 20' argument is used so that only those words are considered which are present in the
data at least 20 times.
'max_df = 500' indicates that the words which are present more than 500 times in the data will
not be considered.
'stop_words = my_stop' argument takes the stop words defined earlier as its input. These words
are removed from the vocabulary. Common words like ???the???, ???and???, 'a' etc. become very
important features since their frequency is high even though they add little meaning to the text
hence need to be removed.
1 tfidf=TfidfVectorizer(analyzer=split_into_lemmas,min_df=20,max_df=500,stop_words=my
_stop)
The fit() function is used to learn the vocabulary from the training text and the transform() function is
used to encode each article text as a vector. This encoded vector has a length of the whole
vocabulary and has values computed according to the tf-idf method discussed earlier.
1 tfidf.fit(mydata['article_text'])
1 tfidf_data=tfidf.transform(mydata['article_text'])
1 tfidf_data.shape
(927, 795)
tfidf_data contains 795 words and 927 articles. 795 words are not necessarily all the words present
in the text data. This number is affected by the stop words considered and the minimum and
maximum counts we specified in the TfidfVectorizer i.e.
min_df=20,max_df=500,stop_words=my_stop.
1 tfidf_data
1 print(tfidf_data.shape)
2 print(type(tfidf_data))
3 print(tfidf_data.toarray())
1 (927, 795)
2 <class 'scipy.sparse.csr.csr_matrix'>
3 [[0. 0. 0. ... 0. 0. 0. ]
4 [0.14277346 0. 0. ... 0. 0. 0. ]
5 [0.11976991 0. 0.03532356 ... 0. 0. 0. ]
6 ...
7 [0.12536742 0. 0. ... 0. 0. 0. ]
8 [0.03915382 0.11118725 0. ... 0. 0. 0. ]
9 [0. 0. 0. ... 0. 0. 0. ]]
We observe that the output for tfidf_data is a sparse matrix. Each column in the matrix above would
be a word from the vocabulary and each row would be an article. e.g. the tfidf computed for the first
word in the second article is 0.14277346. The number 0 indicates that those words are not present
in that article.
We can now use this tfidf_data as an input to a machine learning algorithm. We can break this data
into training and validation parts and proceed with our usual model building process.
205
We can also perform sentiment analysis on text data using NLTK.
We can see a number of sentences mentioned below for which the sentiment is analyzed. The
polarity_scores() function is used to get the scores for different sentiments using which we
determine whether a sentence is positive, negative or neutral.
1 VADER is smart, handsome, and funny. {'neg': 0.0, 'neu': 0.254, 'pos': 0.746,
'compound': 0.8316}
2 VADER is not smart, handsome, nor funny. {'neg': 0.646, 'neu': 0.354, 'pos': 0.0,
'compound': -0.7424}
3 VADER is smart, handsome, and funny! {'neg': 0.0, 'neu': 0.248, 'pos': 0.752,
'compound': 0.8439}
4 VADER is very smart, handsome, and funny. {'neg': 0.0, 'neu': 0.299, 'pos': 0.701,
'compound': 0.8545}
5 VADER is VERY SMART, handsome, and FUNNY. {'neg': 0.0, 'neu': 0.246, 'pos': 0.754,
'compound': 0.9227}
6 VADER is VERY SMART, handsome, and FUNNY!!! {'neg': 0.0, 'neu': 0.233, 'pos':
0.767, 'compound': 0.9342}
7 VADER is VERY SMART, uber handsome, and FRIGGIN FUNNY!!! {'neg': 0.0, 'neu':
0.294, 'pos': 0.706, 'compound': 0.9469}
8 The book was good. {'neg': 0.0, 'neu': 0.508, 'pos': 0.492, 'compound': 0.4404}
206
9 The book was kind of good. {'neg': 0.0, 'neu': 0.657, 'pos': 0.343, 'compound':
0.3832}
10 The plot was good, but the characters are uncompelling and
11 the dialog is not great. {'neg': 0.327, 'neu': 0.579, 'pos': 0.094,
'compound': -0.7042}
12 At least it isn't a horrible book. {'neg': 0.0, 'neu': 0.637, 'pos': 0.363,
'compound': 0.431}
13 Make sure you :) or :D today! {'neg': 0.0, 'neu': 0.294, 'pos': 0.706, 'compound':
0.8633}
14 Today SUX! {'neg': 0.779, 'neu': 0.221, 'pos': 0.0, 'compound': -0.5461}
15 Today only kinda sux! But I'll get by, lol {'neg': 0.179, 'neu': 0.569, 'pos':
0.251, 'compound': 0.2228}
For the sentence "The book was good." we get a higher value for 'pos' than for the sentence "The
book was kind of good".
For the sentence "At least it isn't a horrible book.", despite the word horrible being present in it, this
algorithm correctly gives a higher score to neutral by considering the word 'isn't' in the sentence.
The algorithm also considers exclamation marks, smileys etc to assess the sentiment.
1 chinese_blob = TextBlob(u"??????????????????")
1 chinese_blob.translate(from_lang="zh-CN", to='en')
The translate() function is used to translate some text from Chinese to English.
1 b.detect_language()
'ar'
The detect_language() function is used to find the language of the text passed as an argument.
POS Tagging
In Parts of Speech tagging, each word in a text is given its corresponding part of speech like noun,
verb, pronoun etc.
Parts of Speech tagging can be used to create features from text data among other things.
source : https://stackoverflow.com/questions/1833252/java-stanford-nlp-part-of-speech-labels
207
CC: conjunction, coordinating
& 'n and both but either et for less minus neither nor or plus so
therefore times v. versus vs. whether yet
DT: determiner
all an another any both del each either every half la many much nary
neither no some such that the them these this those
can cannot could couldn't dare may might must need ought shall should
shouldn't will would
208
Motown Venneboerger Czestochwa Ranzer Conchita Trumplane Christos
Oceanside Escobar Kreisler Sawyer Cougar Yvette Ervin ODI Darryl CTCA
Shannon A.K.C. Meltex Liverpool ...
PDT: pre-determiner
hers herself him himself hisself it itself me myself one oneself ours
ourselves ownself self she thee theirs them themselves they thou thy us
RB: adverb
RP: particle
aboard about across along apart around aside at away back before behind
by crop down ever fast for forth from go high i.e. in into just later
low more off on open out over per pie raising start teeth that through
under unto up up-pp upon whole with you
SYM: symbol
% & ' '' ''. ) ). * + ,. < = > @ A[fj] U.S U.S.S.R * ** ***
to
UH: interjection
Goodbye Goody Gosh Wow Jeepers Jee-sus Hubba Hey Kee-reist Oops amen
huh howdy uh dammit whammo shucks heck anyways whodunnit honey golly
man baby diddle hush sonuvabitch ...
ask assemble assess assign assume atone attention avoid bake balkanize
bank begin behold believe bend benefit bevel beware bless boil bomb
boost brace break bring broil brush build ...
209
dipped pleaded swiped regummed soaked tidied convened halted registered
cushioned exacted snubbed strode aimed adopted belied figgered
speculated wore appreciated contemplated ...
predominate wrap resort sue twist spill cure lengthen brush terminate
appear tend stray glisten obtain comprise detest tease attract
emphasize mold postpone sever return wag ...
WDT: WH-determiner
that what whatever which whichever
WP: WH-pronoun
whose
WRB: Wh-adverb
how however whence whenever where whereby whereever wherein whereof why
1 sentence = "We hope you love your sleepycat mattress. Welcom to the sleepycat
community."
We use the pos_tag() function to tag each token of a sentence to a part of speech as described
above. e.g. 'We' is tagged as 'PRP' which is a personal pronoun.
1 nltk.pos_tag(nltk.word_tokenize(sentence))
1 [('We', 'PRP'),
2 ('hope', 'VBP'),
3 ('you', 'PRP'),
4 ('love', 'VB'),
5 ('your', 'PRP$'),
6 ('sleepycat', 'NN'),
7 ('mattress', 'NN'),
8 ('.', '.'),
9 ('Welcom', 'NNP'),
10 ('to', 'TO'),
11 ('the', 'DT'),
12 ('sleepycat', 'NN'),
13 ('community', 'NN'),
14 ('.', '.')]
210
Please mail details of any error and suggestions here : [email protected] with subject [ Python ML
book errata]
211