Data Handling With Python
Data Handling With Python
In this module we'll discuss data handling with python. Discussion will be largely around two
packages numpy and pandas . Numpy is the original package in python designed to work with
multidimensional arrays which eventually enables us to work with data files. Pandas is a high level
package written on top of numpy which makes the syntax much more easier so that we can focus on
logic of data handling processes rather than getting bogged down with increasingly complex syntax
of numpy. However numpy comes with lot of functions which we'll be using for data manipulation.
Since pandas is written on top of numpy, its good to know how numpy works in general to
understand rationale behind lot of syntactical choices of pandas. Lets begin the discussion with
numpy.
Numpy
Through numpy we will learn to create and handle arrays. Arrays set a background for handling
columns in our datasets when we eventually move to pandas dataframes.
We will cover the following topics in Numpy:
creating nd arrays
subsetting with indices and conditions
comparison with np and math functions [np.sqrt , log etc ] and special numpy functions
For this course, we will consider only two dimensional arrays, though technically, we can create
arrays with more than two dimensions.
We start with importing the package numpy giving it the alias np.
1 import numpy as np
We start with creating a 2 dimensional array and assign it to the variable 'b'. It is simply a list of lists.
1 b = np.array([[3,20,99],[-13,4.5,26],[0,-1,20],[5,78,-19]])
2 b
We have passed 4 lists and each of the lists contains 3 elements. This makes 'b' a 2 dimensional
array.
We can also determine the size of an array by using its shape attribute.
1 b.shape
(4, 3)
Each dimension in a numpy array is referred to by the argument 'axis'. 2 dimensional arrays have
two axes, namely 0 and 1. Since there are two axes here, we will need to pass two indices when
accessing the values in the numpy array. Numbering along both the axes starts with a 0.
1 b
23
Assuming that we want to access the value -1 from the array 'b', we will need to access it with both
its indices along the two axes.
1 b[2,1]
-1.0
In order to access -1, we will need to first pass index 2 which refers to the third list and then we will
need to pass index 1 which refers to the second element in the third list. Axis 0 as refers to each list
and axis 1 refers to elements present in each list. First index is the index of the list where the
element is (i.e. 2) and the second index is its position within that list (i.e. 1).
Indexing and slicing here works just like it did in lists, only difference being that here we are
considering 2 dimensions.
1 print(b)
2 b[:,1]
[[ 3. 20. 99. ]
[-13. 4.5 26. ]
[ 0. -1. 20. ]
[ 5. 78. -19. ]]
This statement gives us the second element from all the lists.
1 b[1,:]
The above statement gives us all the elements from the second list.
By default, we can access all the elements of a list by providing a single index as well. The above
code can also be written as:
1 b[1]
We can access multiple elements of a 2 dimensional array by passing multiple indices as well.
1 print(b)
2 b[[0,1,1],[1,2,1]]
[[ 3. 20. 99. ]
[-13. 4.5 26. ]
[ 0. -1. 20. ]
[ 5. 78. -19. ]]
Here, values are returned by pairing of indices i.e. (0,1), (1,2) and (1,1). We will get the first element
when we run b[0,1] (i.e. the first list and second element within that list); the second element when
we run b[1,2] (i.e. the second list and the third element within that list) and the third element when
we run b[1,1] (i.e. the second list and the second element within that list). The three values returned
in the array above can also be obtained by the three print statements written below:
24
1 print(b[0,1])
2 print(b[1,2])
3 print(b[1,1])
20.0
26.0
4.5
This way of accessing the index can be used for modification as well e.g. updating the values of
those indices.
1 print(b)
2 b[[0,1,1],[1,2,1]]=[-10,-20,-30]
3 print(b)
[[ 3. 20. 99. ]
[-13. 4.5 26. ]
[ 0. -1. 20. ]
[ 5. 78. -19. ]]
[[ 3. -10. 99.]
[-13. -30. -20.]
[ 0. -1. 20.]
[ 5. 78. -19.]]
Here you can see that for each of the indices accessed, we updated the corresponding values i.e. the
values present for the indices (0,1), (1,2) and (1,1) were updated. In other words, 20.0, 26.0 and 4.5
were replaced with -10, -20 and -30 respectively.
1 b
1 b>0
On applying a condition on the array 'b', we get an array with Boolean values; True where the
condition was met, False otherwise. We can use these Boolean values, obtained through using
conditions, for subsetting the array.
1 b[b>0]
The above statement returns all the elements from the array 'b' which are positive.
Let's say we now want all the positive elements from the third list. Then we need to run the following
code:
1 print(b)
2 b[2]>0
25
[[ 3. -10. 99.]
[-13. -30. -20.]
[ 0. -1. 20.]
[ 5. 78. -19.]]
When we write b[2]>0, it returns a logical array, returning True wherever the list's value is positive
and False otherwise.
Subsetting the list in the following way will, using the condition b[2]>0, will return the actual positive
value.
1 b[2,b[2]>0]
array([20.])
Now, what if I want the values from all lists only for those indices where the values in the third list
were either 0 or positive.
1 print(b)
2 print(b[2]>=0)
3 print(b[:,b[2]>=0])
[[ 3. -10. 99.]
[-13. -30. -20.]
[ 0. -1. 20.]
[ 5. 78. -19.]]
[[ 3. 99.]
[-13. -20.]
[ 0. 20.]
[ 5. -19.]]
For the statement b[:,b[2]>=0], the ':' sign indicates that we are referring to all the lists and the
condition 'b[2]>=0' would ensure that we will get the corresponding elements from all the lists which
satisfy the condition that the third list is either 0 or positive. In other words, 'b[2]>=0' returns [True,
False, True] which will enable us to get the first and the third values from all the lists.
Now lets consider the following scenario, where we want to apply the condition on the third element
of each list and then apply the condition across all the elements of the lists:
1 b[:,2]>0
Here, we are checking whether the third element in each list is positive or not. b[:,2]>0 returns a
logical array. Note: it will have as many elements as the number of lists.
1 print(b)
2 print(b[:,2])
3 print(b[:,2]>0)
4 b[b[:,2]>0,:]
[[ 3. -10. 99.]
[-13. -30. -20.]
[ 0. -1. 20.]
[ 5. 78. -19.]]
26
[ True False True False]
Across the lists, it has extracted those values which correspond to the logical array. Using the
statement print(b[:,2]>0), we see that only 99. and 20. are positive, i.e. the third element from each
of the first and third lists are positive and hence True. On passing this condition to the array 'b',
b[b[:,2]>0,:], we get all those lists wherever the condition evaluated to True i.e. the first and the third
lists.
The idea of using numpy is that it allows us to apply functions on multiple values across a full
dimension instead of single values. The math package on the other hand works on scalars of single
values.
As an example, let's say we wanted to replace the entire 2nd list (index =1) with its absolute values.
1 import math as m
The function exp in the math package returns the exponential value of the number passed as
argument.
1 x=-80
2 m.exp(x)
1.8048513878454153e-35
However, when we pass an array to this function instead of a single scalar value, we get an error.
1 b[1]
1 b[1]=m.exp(b[1])
in
----> 1 b[1]=m.exp(b[1])
TypeError: only size-1 arrays can be converted to Python scalars
Basically, the math package converts its inputs to scalars, but since b[1] is an array of multiple
elements, it gives an error.
We will need to use the corresponding numpy function to be able to return absolute values of arrays
i.e. np.exp().
The following code will return the exponential values of the second list only.
1 print(b)
2 b[1]=np.exp(b[1])
3 print(b)
[[ 3. -10. 99.]
[-13. -30. -20.]
[ 0. -1. 20.]
[ 5. 78. -19.]]
27
[[ 3.00000000e+00 -1.00000000e+01 9.90000000e+01]
[ 2.26032941e-06 9.35762297e-14 2.06115362e-09]
[ 0.00000000e+00 -1.00000000e+00 2.00000000e+01]
[ 5.00000000e+00 7.80000000e+01 -1.90000000e+01]]
There are multiple such functions available in numpy. We can type 'np.' and press the 'tab' key to see
the list of such functions.
All the functions present in the math package will be present in numpy package as well.
Reiterating the advantage of working with numpy instead of math package is that numpy enables us
to work with complete arrays. We do not need to write a for loop to apply a function across the
array.
axis argument
To understand the axis argument better, we will now explore the 'sum()' function which collapses the
array.
1 np.sum(b)
175.00000226239064
Instead of summing the entire array 'b', we want to sum across the list i.e. axis = 0.
1 print(b)
2 np.sum(b,axis=0)
If we want to sum all the elements of each list, then we will refer to axis = 1
1 np.sum(b,axis=1)
axis=0 here corresponds to elements across the lists , axis=1 corresponds to within the list elements.
Note: Pandas dataframes, which in a way are 2 dimensional numpy arrays, have each list in a numpy
array correspond to a column in pandas dataframe. In a pandas dataframe, axis=0 would refer to
rows and axis=1 would refer to columns.
Now we will go through some commonly used numpy functions. We will use the rarely used
functions as and when we come across them.
The commonly used functions help in creating special kinds of numpy arrays.
arange()
1 np.arange(0,6)
array([0, 1, 2, 3, 4, 5])
The arange() function returns an array starting from 0 until (6-1) i.e. 5.
1 np.arange(2,8)
array([2, 3, 4, 5, 6, 7])
28
We can also control the starting and ending of an arange array. The above arange function starts
from 2 and ends with (8-1) i.e. 7, incrementing by 1.
The arange function is used for creating a sequence of integers with different starting and ending
points having an increment of 1.
linspace()
To create a more customized sequence we can use the linspace() function. The argument num gives
the number of elements in sequence. The elements in the sequence will be equally spaced.
1 np.linspace(start=2,stop=10,num=15)
random.randint()
Given an array of numbers, we can randomly sample elements from that array using the randint()
function from the random package.
1 np.random.randint(high=10,low=1,size=(2,3))
array([[4, 5, 8],
[1, 3, 4]])
The above code creates a random array of size (2,3) i.e. two lists having three elements each. These
random elements are chosen from numbers between 1 and 10.
random.random()
We can also create an array of random numbers using the random() function from the random
package.
1 np.random.random(size=(3,4))
The above random() function creates an array of size (3,4) where the elements are real numbers
between 0 to 1.
random.choice()
1 x = np.arange(0, 10)
2 x
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
1 np.random.choice(x,6)
array([9, 0, 4, 3, 8, 5])
random.choice() functions helps us to select 6 random numbers from the input array x. Every time
we run this code, the result will be different.
We can see that, at times, the function ends up picking up the same element twice. If we want to
avoid that i.e. get each number only once or in other words, get the elements without replacement;
then we need to set the argument 'replace' as False which is True by default.
29
1 np.random.choice(x,6,replace=False)
array([4, 0, 7, 1, 8, 6])
Now when we run the above code again, we will get different values, but we will not see any number
more than once.
1 y = np.random.choice(['a','b'], 6)
2 print(y)
1 y = np.random.choice(['a','b'], 1000)
The code above samples 'a' and 'b' a 1000 times. Now if we take unique values from this array, it will
be 'a' and 'b', as shown by the code below. The return_counts argument gives the number of times
the two elements are present in the array created.
1 np.unique(y, return_counts=True)
By default, both 'a' and 'b' get picked up with equal probability in the random sample. This does not
mean that the individual values are not random; but the overall percentage of 'a' and 'b' remains
almost the same.
However, if we want the two values according to a specific proportion, we can use the argument 'p'
in the random.choice() function.
1 y=np.random.choice(['a','b'],1000,p=[0.8,0.2])
1 np.unique(y, return_counts=True)
Now we can observe that the value 'a' is present approximately 80% of the time and value 'b'
appears around 20% of the time. the individual values are still random, though overall, 'a' will appear
80% of the time as specified and 'b' will appear 20% of the time. Since the underlying process is
inherently random, these probabilities will be close to 80% and 20% respectively but may not be
exactly the same. In fact, if we draw samples of smaller sizes, the difference could be quite wide as
shown in the code below.
1 y=np.random.choice(['a','b'],10,p=[0.8,0.2])
2 np.unique(y, return_counts=True)
Here we sample only 10 values in the proportion 8:2. As you repeat this sampling process, at times
the proportion may match, but many times the difference will be big. As we increase the size of the
sample, the number of samples drawn for each element will be closer to the proportion specified.
sort()
1 x=np.random.randint(high=100,low=12,size=(15,))
2 print(x)
30
[89 60 31 82 97 92 14 96 75 12 36 37 25 27 62]
1 x.sort()
1 print(x)
[12 14 25 27 31 36 37 60 62 75 82 89 92 96 97]
1 x=np.random.randint(high=100,low=12,size=(4,6))
1 x
This array is 2 dimensional containing 4 lists and each list has 6 elements.
If we use the sort function directly, the correspondence between the elements is broken i.e. each
individual list is sorted independent of the other lists. We may not want this. The first elements in
each list belong together, so do the second and so on; but after sorting this correspondence is
broken.
1 np.sort(x)
argsort()
For maintaining order along either of the axis, we can extract indices of the sorted values and
reorder original array with these indices. Lets see the code below:
1 print(x)
2 x[:,2]
[[61 60 78 20 56 50]
[27 56 88 69 40 26]
[35 83 40 17 74 67]
[33 78 25 19 53 12]]
These return the third element from each list of the 2 dimensional array x.
Lets say we want to sort the 2 dimensional array x by these values and the other values should move
with them maintaining the correspondence. This is where we will use argsort() function.
1 print(x[:,2])
2 x[:,2].argsort()
31
[78 88 40 25]
Instead of sorting the array, argsort() returns the indices of the elements after they are sorted.
The value with index 3 i.e. 25 should appear first, the value with index 2 i.e. 40 should appear next
and so on.
We can now pass these indices to arrange all the lists according to them. We will observe that the
correspondence does not break.
1 x[x[:,2].argsort(),:]
All the lists have been arranged according to order given by x[:,2].argsort() for the third element
across the lists.
1 x=np.random.randint(high=100,low=12,size=(15,))
2 x
array([17, 77, 49, 36, 39, 27, 63, 99, 94, 22, 55, 66, 93, 32, 16])
1 x.max()
99
The max() function will simply return the maximum value from the array.
1 x.argmax()
The argmax() function on the other hand will simply give the index of the maximum value i.e. the
maximum number 99 lies at the index 7.
Panda
In this section we will start with the python package named pandas which is primarily used for
handling datasets in python.
We will cover the following topics:
1 import pandas as pd
2 import numpy as np
3 import random
32
There are two ways of creating dataframes:
1. From lists
2. From dictionary
We will start with creating some lists and then making a dataframe using these lists.
1 age=np.random.randint(low=16,high=80,size=[20,])
2 city=np.random.choice(['Mumbai','Delhi','Chennai','Kolkata'],20)
3 default=np.random.choice([0,1],20)
We can zip these lists to convert them to a single list tuples. Each tuple in the list will refer to a row in
the dataframe.
1 mydata=list(zip(age,city,default))
2 mydata
Each of the tuples come from zipping the elements in each of the lists (age, city and default) that we
created earlier.
Note: You may have different values when you run this code since we are randomly generating the
lists using the random package.
We can then put this list of tuples in a dataframe simply by using the pd.DataFrame function.
1 df.head() # we are using head() function which displays only the first 5 rows.
33
age city default
0 33 Mumbai 0
1 71 Kolkata 1
2 28 Mumbai 1
3 46 Mumbai 1
4 22 Delhi 1
As you can observe, this is a simple dataframe with 3 columns and 20 rows, having the three lists:
age, city and default as columns. The column names could have been different too, they do not have
to necessarily match the list names.
Another way of creating dataframes is using a dictionary. Here we will need to provide the column
names separately, which will be picked as the values of the key.
Here the key values ("age", "city" and "default") will be taken as column names and the lists (age, city
and default) will contain the values themselves.
1 df.head() # we are using head() function which displays only the first 5 rows.
0 33 Mumbai 0
1 71 Kolkata 1
2 28 Mumbai 1
3 46 Mumbai 1
4 22 Delhi 1
In both the cases i.e. creating the dataframe using list or dictionary, the resultant dataframe is the
same. The process of creating them is different but there is no difference in the resulting
dataframes.
Lets first create a string containing the path to the .csv file.
1 file=r'loans data.csv'
Here, 'r' is added at the beginning of the path. This is to ensure that the file path is read as a raw
string and any special character combinations are not interpreted as their special meaning by
Python. e.g. \n means newline, which will be ignored from the file path when we add 'r' at the
beginning of the string. otherwise it might lead to Unicode errors. Sometimes it will without putting
r at the beginning of the path, but its a safer choice and make it a habit to always use this.
1 ld = pd.read_csv(file)
The pandas function read_csv() reads the file present in the path given by the agrument 'file'.
34
1 ld.head()
2 # display of data in pdf will be truncated on the right hand side
The head() function of the pandas dataframe created, by default, returns the top 5 rows of the
dataframe. If we wish to see more or less rows, for instance 10 rows, then we will pass the number
as an argument to the head() function.
1 ld.head(10)
2 # display of data in pdf will be truncated on the right hand side
We can get the column names by using the 'columns' attribute of the pandas dataframe.
1 ld.columns
If we want to see the type of these columns, we can use the attribute 'dtypes' of the pandas
dataframe.
1 ld.dtypes
1 ID float64
2 Amount.Requested object
3 Amount.Funded.By.Investors object
4 Interest.Rate object
35
5 Loan.Length object
6 Loan.Purpose object
7 Debt.To.Income.Ratio object
8 State object
9 Home.Ownership object
10 Monthly.Income float64
11 FICO.Range object
12 Open.CREDIT.Lines object
13 Revolving.CREDIT.Balance object
14 Inquiries.in.the.Last.6.Months float64
15 Employment.Length object
16 dtype: object
The float64 datatype refers to numeric columns and object datatype refers to categorical columns.
If we want a concise summary of the dataframe including information about null values, we use the
info() function of the pandas dataframe.
1 ld.info()
1 <class 'pandas.core.frame.DataFrame'>
2 RangeIndex: 2500 entries, 0 to 2499
3 Data columns (total 15 columns):
4 ID 2499 non-null float64
5 Amount.Requested 2499 non-null object
6 Amount.Funded.By.Investors 2499 non-null object
7 Interest.Rate 2500 non-null object
8 Loan.Length 2499 non-null object
9 Loan.Purpose 2499 non-null object
10 Debt.To.Income.Ratio 2499 non-null object
11 State 2499 non-null object
12 Home.Ownership 2499 non-null object
13 Monthly.Income 2497 non-null float64
14 FICO.Range 2500 non-null object
15 Open.CREDIT.Lines 2496 non-null object
16 Revolving.CREDIT.Balance 2497 non-null object
17 Inquiries.in.the.Last.6.Months 2497 non-null float64
18 Employment.Length 2422 non-null object
19 dtypes: float64(3), object(12)
20 memory usage: 293.0+ KB
If we want to get the dimensions i.e. numbers of rows and columns of the data, we can use the
attribute 'shape'.
1 ld.shape
(2500, 15)
1 ld1=ld.iloc[3:7,1:5]
2 ld1
36
Amount.Requested Amount.Funded.By.Investors Interest.Rate Loan.Length
'iloc' refers to subsetting the dataframe by position. Here we have extracted the rows from 3rd to the
6th (7-1) position and columns from 1st to 4th (5-1) position.
To understand this further, we will further subset the 'ld1' dataframe. It currently has 4 rows and 4
columns. The indexes (3, 4, 5 and 6) come from the original dataframe. Lets subset the 'ld1'
dataframe further.
1 ld1.iloc[2:4,1:3]
Amount.Funded.By.Investors Interest.Rate
5 6000 15.31%
6 10000 7.90%
You can see here that the positions are relative to the current dataframe 'ld1' and not the original
dataframe 'ld'. Hence we end up with the 3rd and 4th rows along with 2nd and 3rd columns of the
new dataframe 'ld1' and not the original dataframe 'ld'.
Generally, we do not subset dataframes by position. We normally subset the dataframes using
conditions or column names.
Lets say, we want to subset the dataframe and get only those rows for which the 'Home.Ownership'
is of the type 'MORTGAGE' and the 'Monthly.Income' is above 5000.
Note: When we combine multiple conditions, we have to enclose them in parenthesis else your
results will not be as expected.
On observing the results, you will notice that for each of the rows both the conditions will be
satisfied i.e. 'Home.Ownership' will be 'MORTGAGE' and the 'Monthly.Income' will be greater than
5000.
In case we want to access a single columns data only, then we simply have to pass the column name
in square brackets as follows:
37
1 ld['Home.Ownership'].head()
1 0 MORTGAGE
2 1 MORTGAGE
3 2 MORTGAGE
4 3 MORTGAGE
5 4 RENT
6 Name: Home.Ownership, dtype: object
However, if we want to access multiple columns, then the names need to be passed as a list. For
instance, if we wanted to extract both 'Home.Ownership' and 'Monthly.Income', we would need to
pass it as a list, as follows:
1 ld[['Home.Ownership','Monthly.Income']].head()
2 # note the double sqaure brackets used to subset the dataframe using multiple
columns
Home.Ownership Monthly.Income
0 MORTGAGE 6541.67
1 MORTGAGE 4583.33
2 MORTGAGE 11500.00
3 MORTGAGE 3833.33
4 RENT 3195.00
If we intend to use both, condition as well as column names, we will need to use the .loc with the
pandas dataframe name.
Observing the code below, we subset the dataframe using conditions and columns both. We are
subsetting the rows, using the condition '(ld['Home.Ownership']=='MORTGAGE') &
(ld['Monthly.Income']>5000)' and we extract the 'Home.Ownership' and 'Monthly.Income' columns.
Here, both the conditions should be met; to get that observation in the output. e.g. We can see in
the first row of the dataframe that 'Home.Ownership' is 'MORTGAGE', the 'Monthly.Income' is more
than 5000. If either of the condition is false, we will not see that observation in the resulting
dataframe.
Home.Ownership Monthly.Income
0 MORTGAGE 6541.67
2 MORTGAGE 11500.00
7 MORTGAGE 13863.42
12 MORTGAGE 14166.67
20 MORTGAGE 6666.67
The resulting dataframe has only 2 columns and 686 rows. The rows correspond to the result
obtained when the condition (ld['Home.Ownership']=='MORTGAGE') & (ld['Monthly.Income']>5000) is
applied to the 'ld' dataframe.
What if we wanted to subset only those rows which did not satisfy a condition i.e. we want to negate
a condition. In order to do this, we can put a '~' sign before the condition.
38
In the following code, we will get all the observations that do not satisfy the condition
((ld['Home.Ownership']=='MORTGAGE') & (ld['Monthly.Income']>5000)). Here, both the
conditions may not be met; if either of them holds true, then we will get that
observation in the output. e.g. Considering the first row, even though the
'Monthly.Income' is less than 5000 (i.e. one of the conditions is not met), since the
other condition of 'Home.Ownership' being equal to 'MORTGAGE' holds true - the entire
condition ((ld['Home.Ownership']=='MORTGAGE') & (ld['Monthly.Income']>5000)) is not
negated and hence we see this observation in the output.
Home.Ownership Monthly.Income
1 MORTGAGE 4583.33
3 MORTGAGE 3833.33
4 RENT 3195.00
5 OWN 4891.67
6 RENT 2916.67
In short, the '~' sign gives us rest of the observations by negating the condition.
To drop columns on the basis of column names, we can use the in-built drop() function.
In order to drop the columns, we pass the list of columns to be dropped along with specifying the
axis argument as 1 (since we are dropping columns).
The following code will return all the columns except 'Home.Ownership' and 'Monthly.Income'.
1 ld.drop(['Home.Ownership','Monthly.Income'],axis=1).head()
2 # display of data in pdf will be truncated on the right hand side
However, when we check the columns of the 'ld' dataframe now, the two columns which we
presumably deleted, are still there.
1 ld.columns
39
What happens is that the ld.drop() function gives us an output; it does not make any inplace
changes.
So, in case we wish to delete the columns from the original dataframe, we can do two things:
We can update the original dataframe by equating the output to the original dataframe as follows:
1 ld=ld.drop(['Home.Ownership','Monthly.Income'],axis=1)
2 ld.columns
The second way to update the original dataframe is to set the 'inplace' argument of the drop()
function to True
1 ld.drop(['State'],axis=1,inplace=True)
2 ld.columns
Now you will notice that the deleted columns are not present in the original dataframe 'ld' anymore.
We need to be careful when using the inplace=True option; the function drop() doesn't output
anything. So we should not equate ld.drop(['State'],axis=1,inplace=True) to the original dataframe. If
we equate it to the original dataframe 'ld', then 'ld' will end up as a None type object.
We can also delete a column using the 'del' keyword. The following code will remove the column
'Employment.Length' from the original dataframe 'ld'.
1 del ld['Employment.Length']
2 ld.columns
On checking the columns of the 'ld' dataframe, we can observe that 'Employment.Length' column is
not present.
40
2. Adding/modifying variables with algebraic operations
3. Adding/modifying variables based on conditions
4. Handling missing values
5. Creating flag variables
6. Creating multiple columns from a variable separated by a delimiter
1 import numpy as np
2 import pandas as pd
We will start with creating a custom dataframe having 7 columns and 50 rows as follows:
1 age=np.random.choice([15,20,30,45,12,'10',15,'34',7,'missing'],50)
2 fico=np.random.choice(['100-150','150-200','200-250','250-300'],50)
3 city=np.random.choice(['Mumbai','Delhi','Chennai','Kolkata'],50)
4 ID=np.arange(50)
5 rating=np.random.choice(['Excellent','Good','Bad','Pathetic'],50)
6 balance=np.random.choice([10000,20000,30000,40000,np.nan,50000,60000],50)
7 children=np.random.randint(high=5,low=0,size=(50,))
1 mydata=pd.DataFrame({'ID':ID,'age':age,'fico':fico,'city':city,'rating':rating,'bal
ance':balance,'children':children})
1 mydata.head()
2 # data displace in pdf will be truncated on right hand side
1 mydata.dtypes
1 ID int32
2 age object
3 fico object
4 city object
5 rating object
6 balance float64
7 children int32
8 dtype: object
We can see that 'age' column is of the object datatype, though it should have been numeric, maybe
due to some character values in the column. We can change the datatype to 'numeric'; the character
values which cannot be changed to numeric will be assigned missing values i.e. NaN's automatically.
There are multiple numeric formats in Python e.g. integer, float, unsigned integer etc. The
to_numeric() functions chooses the best one for the column under consideration.
1 mydata['age']=pd.to_numeric(mydata['age'])
41
ValueError Traceback (most recent call last)
pandas/libs/src\inference.pyx in pandas.libs.lib.maybe_convert_numeric()
When we run the code above, we get an error i.e. "Unable to parse string "missing" at position 2".
This error means that there are a few values in the column that cannot be converted to numbers; in
our case its the value 'missing' which cannot be converted to a number. In order to handle this, we
need to set the errors argument of the to_numeric() function to 'coerce' i.e. errors='coerce'. When we
use this argument, wherever it was not possible to convert the values to numeric, it converted them
to missing values i.e. NaN's.
1 mydata['age']=pd.to_numeric(mydata['age'], errors='coerce')
2 mydata['age'].head()
0 12.0
1 15.0
2 NaN
3 45.0
4 NaN
Name: age, dtype: float64
As we can observe in the rows 2,4,etc, wherever there was the 'missing' string present, which could
not be converted to numbers are now converted to NaN's or missing values.
1 mydata['const_var']=100
The above code adds a new column 'const_var' to the mydata dataframe and each element in that
column is 100.
1 mydata.head()
If we want to apply a function on an entire column of a dataframe, we use a numpy function; e.g log
as shown below:
1 mydata['balance_log']=np.log(mydata['balance'])
The code above creates a new column 'balance_log' which has the logarithmic value of each element
present in the 'balance' column. A numpy function np.log() is used to do this.
1 mydata.head()
42
ID age fico city rating balance children const_var balance_log
250-
0 0 12.0 Chennai Excellent 10000.0 3 100 9.210340
300
150-
1 1 15.0 Chennai Bad 20000.0 3 100 9.903488
200
250-
2 2 NaN Chennai Pathetic 20000.0 2 100 9.903488
300
250-
3 3 45.0 Delhi Bad NaN 3 100 NaN
300
250-
4 4 NaN Delhi Pathetic 50000.0 2 100 10.819778
300
We can do many complex algebraic calculations as well to create/add new columns to the data.
1 mydata['age_children_ratio']=mydata['age']/mydata['children']
The code above creates a new column 'age_children_ratio'; each element of which will be the result
of the division of the corresponding elements present in the 'age' and 'children' columns.
1 mydata.head(10)
2 # display in pdf will be truncated on right hand side
250-
0 0 12.0 Chennai Excellent 10000.0 3 100 9.210340 4.000000
300
150-
1 1 15.0 Chennai Bad 20000.0 3 100 9.903488 5.000000
200
250-
2 2 NaN Chennai Pathetic 20000.0 2 100 9.903488 NaN
300
250-
3 3 45.0 Delhi Bad NaN 3 100 NaN 15.000000
300
250-
4 4 NaN Delhi Pathetic 50000.0 2 100 10.819778 NaN
300
100-
5 5 20.0 Delhi Pathetic 60000.0 3 100 11.002100 6.666667
150
100-
6 6 34.0 Mumbai Good 40000.0 0 100 10.596635 inf
150
200-
7 7 20.0 Kolkata Pathetic 30000.0 4 100 10.308953 5.000000
250
150-
8 8 20.0 Mumbai Good 20000.0 3 100 9.903488 6.666667
200
150-
9 9 20.0 Chennai Bad 50000.0 4 100 10.819778 5.000000
200
43
Notice that when a missing value is involved in any calculation, the result is also a missing value. We
observe that in the 'age_children_ratio' column we have both NaN's (missing values) as well as inf
(infinity). We get missing values in the 'age_children_ratio' column wherever 'age' has missing values
and we get 'inf' wherever the number of children is 0 and we end up dividing by 0.
Lets say we did not want missing values involved in the calculation i.e. we want to impute the
missing values before computing the 'age_children_ratio' column. For this we would first need to
identify the missing values. The isnull() function will give us a logical array which can be used to
isolate missing values and update these with whatever value we want to impute with.
1 mydata['age'].isnull().head()
0 False
1 False
2 True
3 False
4 True
Name: age, dtype: bool
In the outcome of the code above, we observe that wherever there is a missing value the
corresponding logical value is True.
If we want to know the number of missing values, we can sum the logical array as follows:
1 mydata['age'].isnull().sum()
The following code returns only those elements where the 'age' column has missing values.
1 mydata.loc[mydata['age'].isnull(),'age']
2 NaN
4 NaN
17 NaN
36 NaN
38 NaN
Name: age, dtype: float64
One of the ways of imputing the missing values is with mean. Once these values are imputed, we
then carry out the calculation done above.
1 mydata.loc[mydata['age'].isnull(),'age'] = np.mean(mydata['age'])
In the code above, using the loc() function of the dataframe, on the row side we first access those
rows where the 'age' column is null and on the column side we access only the 'age' column. In other
words, all the missing values from the age column will be replaced with the mean computed using
the 'age' column.
1 mydata['age'].head()
0 12.000000
1 15.000000
2 19.533333
3 45.000000
4 19.533333
Name: age, dtype: float64
44
The missing values from the 'age' column has been replaced by the mean of the column i.e.
19.533333.
Now, we can compute the 'age_children_ratio' again; this time without missing values. We will
observe that there are no missing values in the newly created column now. We however, observe
inf's i.e. infinity which occurs wherever we divide by 0.
1 mydata['age_children_ratio']=mydata['age']/mydata['children']
1 mydata.head()
2 #display in pdf will be truncated on the right hand side
250-
0 0 12.000000 Chennai Excellent 10000.0 3 100 9.210340 4.000000
300
150-
1 1 15.000000 Chennai Bad 20000.0 3 100 9.903488 5.000000
200
250-
2 2 19.533333 Chennai Pathetic 20000.0 2 100 9.903488 9.766667
300
250-
3 3 45.000000 Delhi Bad NaN 3 100 NaN 15.000000
300
250-
4 4 19.533333 Delhi Pathetic 50000.0 2 100 10.819778 9.766667
300
Lets say we want to replace the 'rating' column values with some numeric score - {'pathetic' : -1 ,
'bad' : 0 , 'good or excellent': 1}. We can do it using the np.replace() function as follows:
1 mydata['rating_score']=np.where(mydata['rating'].isin(['Good','Excellent']),1,0)
Using the above code, we create a new column 'rating_score' and wherever either 'Good' or
'Excellent' is present, we replace it with a 1 else with a 0 as we can see below. The function isin() is
used when we need to consider multiple values; in our case 'Good' and 'Excellent'.
1 mydata.head()
2 # display in pdf will be truncated on the right hand side
250-
0 0 12.000000 Chennai Excellent 10000.0 3 100 9.210340 4.000000
300
150-
1 1 15.000000 Chennai Bad 20000.0 3 100 9.903488 5.000000
200
250-
2 2 19.533333 Chennai Pathetic 20000.0 2 100 9.903488 9.766667
300
250-
3 3 45.000000 Delhi Bad NaN 3 100 NaN 15.000000
300
250-
4 4 19.533333 Delhi Pathetic 50000.0 2 100 10.819778 9.766667
300
45
1 mydata.loc[mydata['rating']=='Pathetic','rating_score']=-1
In the code above, wherever the value in 'rating' column is 'Pathetic', we update the 'rating_score'
column to -1 and leave the rest as is. The above code could be been written using np.where()
function as well. The np.where() function is similar to the ifelse statement we may have seen in other
languages.
1 mydata.head()
2 #display in the pdf will be truncated on the right hand side
250-
0 0 12.000000 Chennai Excellent 10000.0 3 100 9.210340 4.000000
300
150-
1 1 15.000000 Chennai Bad 20000.0 3 100 9.903488 5.000000
200
250-
2 2 19.533333 Chennai Pathetic 20000.0 2 100 9.903488 9.766667
300
250-
3 3 45.000000 Delhi Bad NaN 3 100 NaN 15.000000
300
250-
4 4 19.533333 Delhi Pathetic 50000.0 2 100 10.819778 9.766667
300
Now we can see that the 'rating_score' column takes the values 0, 1 and -1 depending on the
corresponding values from the 'rating' column.
At times, we may have columns which we may want to split into multiple columns; column 'fico' in
our case. One sample value is '100-150'. The datatype for 'fico' is considered as object. Its a difficult
problem to solve if we use a for loop for processing each value. However, we will discuss an easier
approach which will be very useful to know when pre-processing your data.
Coming back to the 'fico' column, one of the first thing that comes to mind when we want to
separate the values from 'fico' column into multiple columns is the split() function. The split function
works on strings, but the current datatype of 'fico' column is object; object type does not understand
string functions. If we apply the split() function directly on 'fico', we will get the following error.
1 mydata['fico'].split()
In order to handle this, we will first need to extract the 'str' attribute of the 'fico' column and then
apply the split() function. This will be the case for all string functions and not just split().
1 mydata['fico'].str.split("-").head()
0 [250, 300]
1 [150, 200]
2 [250, 300]
3 [250, 300]
4 [250, 300]
Name: fico, dtype: object
46
We can see that each of the elements have been split on the basis of '-'. However, it is still present in
a single column. We need the values in separate columns. We will set the 'expand' argument of the
split() function to True in order to handle this.
1 mydata['fico'].str.split("-",expand=True).head()
0 1
0 250 300
1 150 200
2 250 300
3 250 300
4 250 300
1 k=mydata['fico'].str.split("-",expand=True).astype(float)
Notice that we have converted the columns to float using the astype(float) function; since after
splitting, by default, the datatype of each column created would be object. But we want to consider
each column as numeric datatype, hence the columns are converted to float. Converting to float is
not a required step when splitting columns. We do it only because these values are supposed to be
considered numeric in the current context.
1 k[0].head()
0 250.0
1 150.0
2 250.0
3 250.0
4 250.0
Name: 0, dtype: float64
We can either concatenate this dataframe to the 'mydata' dataframe after giving proper header to
both the columns or we can directly assign two new columns in the 'mydata' dataframe as follows:
1 mydata['f1'],mydata['f2']=k[0],k[1]
1 mydata.head()
2 # display in pdf will be truncated on the right hand side
250-
0 0 12.000000 Chennai Excellent 10000.0 3 100 9.210340 4.000000
300
150-
1 1 15.000000 Chennai Bad 20000.0 3 100 9.903488 5.000000
200
250-
2 2 19.533333 Chennai Pathetic 20000.0 2 100 9.903488 9.766667
300
250-
3 3 45.000000 Delhi Bad NaN 3 100 NaN 15.000000
300
250-
4 4 19.533333 Delhi Pathetic 50000.0 2 100 10.819778 9.766667
300
47
We do not need the 'fico' column anymore as we have its values in tow separate columns; hence we
will delete it.
1 del mydata['fico']
1 mydata.head()
2 # display in pdf will be truncated on the right hand side
1 print(mydata['city'].unique())
2 print(mydata['city'].nunique())
It consists of 4 unique elements - 'Kolkata', 'Mumbai', 'Chennai', 'Delhi'. For this variable, we will need
to create three dummies.
The code below creates a flag variable when the 'city' column has the value 'Mumbai'.
1 mydata['city_mumbai']=np.where(mydata['city']=='Mumbai',1,0)
Wherever the variable 'city' takes the value 'Mumbai', the flag variable 'city_mumbai' will be 1
otherwise 0.
There is another way to do this, where we write the condition and convert the logical value to
integer.
1 mydata['city_chennai']=(mydata['city']=='Chennai').astype(int)
Code "mydata['city']=='Chennai'" gives a logical array; wherever the city is 'Chennai', the value on
'city_chennai' flag variable is True, else False.
1 (mydata['city']=='Chennai').head()
0 True
1 True
2 True
3 False
48
4 False
Name: city, dtype: bool
When we convert it to an integer, wherever there was True, we get a 1 and wherever there was False,
we get a 0.
1 ((mydata['city']=='Chennai').astype(int)).head()
0 1
1 1
2 1
3 0
4 0
Name: city, dtype: int32
We can use either of the methods for creating flag variables, the end result is same.
1 mydata['city_kolkata']=np.where(mydata['city']=='Kolkata',1,0)
Once the flag variables have been created, we do not need the original variable i.e. we do not need
the 'city' variable anymore.
1 del mydata['city']
This way of creating dummies requires a lot of coding, even if we somehow use a for loop. As an
alternative, we can use get_dummies() function from pandas directly to do this.
We will create dummies for the variable 'rating' using this method.
1 print(mydata['rating'].unique())
2 print(mydata['rating'].nunique())
The pandas function which creates dummies is get_dummies() in which we pass the column for
which dummies need to be created. By default, the get_dummies function creates n dummies if n
unique values are present in the column i.e. for 'rating' column, by default, get_dummies function,
creates 4 dummy variables. We do not want that. Hence, we pass the argument 'drop_first=True'
which removes one of the dummies and creates (n-1) dummies only. Setting the 'prefix' argument
helps to identify which columns are the dummy variables created for. It does this by adding
whatever string you give the 'prefix' argument as a prefix to each of the dummy variables created. In
our case 'rating_' will be appended to each dummy variable as a prefix.
1 dummy=pd.get_dummies(mydata['rating'],drop_first=True,prefix='rating')
The get_dummies() function has created a column for 'Excellent', 'Good' and 'Pathetic' but has
dropped the column for 'Bad'.
1 dummy.head()
49
rating_Excellent rating_Good rating_Pathetic
0 1 0 0
1 0 0 0
2 0 0 1
3 0 0 0
4 0 0 1
We can now simply attach these columns to the data using pandas concat() function.
1 mydata=pd.concat([mydata,dummy],axis=1)
The concat() function will take the axis argument as 1 since we are attaching the columns.
After creating dummies for 'rating', we will now drop the original column.
1 del mydata['rating']
1 mydata.columns
We need to keep in mind that we will not be doing all of what we learned here at once for any one
dataset. Some of these techniques will be useful at a time while preparing data for machine learning
algorithms.
1 import numpy as np
2 import pandas as pd
The random.randint() function creates 4 columns having 20 rows with values ranging between 2 and
8.
1 df.head()
50
A B C D
0 4 7 4 6
1 2 7 5 6
2 2 3 5 7
3 4 6 2 4
4 6 5 4 4
If we wish to sort the dataframe by column A, we can do that using the sort_values() function on the
dataframe.
1 df.sort_values("A").head()
A B C D
9 2 5 6 7
14 2 6 4 6
18 2 7 7 3
6 2 4 2 5
19 2 6 4 6
The output that we get is sorted by the column 'A'. But when we type 'df' again to view the
dataframe, we see that there are not changes; df is the same as it was before sorting.
1 df.head()
A B C D
0 4 7 4 6
1 2 7 5 6
2 2 3 5 7
3 4 6 2 4
4 6 5 4 4
Now when we observe the dataframe 'df', it will be sorted by 'A' in an ascending manner.
1 df.head()
51
A B C D
9 2 5 6 7
14 2 6 4 6
18 2 7 7 3
6 2 4 2 5
19 2 6 4 6
In case we wish to sort the dataframe in a descending manner, we can set the argument
ascending=False in the sort_values() function.
Now the dataset will be sorted in the reverse order of the values of 'A'.
1 df.head()
A B C D
17 7 7 5 5
10 7 7 4 4
15 7 4 6 7
4 6 5 4 4
11 6 3 7 2
Sorting by next column in the sequence happens within the groups formed after sorting of the
previous columns.
In the code below, we can see that the 'ascending' argument takes values [True, False]. It is passed in
the same order as the columns ['B','C']. This means that the column 'B' will be sorted in an ascending
order and within the groups created by column 'B', column 'C' will be sorted in a descending order.
1 df.sort_values(['B','C'],ascending=[True,False]).head(10)
A B C D
12 5 2 3 5
11 6 3 7 2
13 3 3 7 6
2 2 3 5 7
7 5 3 3 4
15 7 4 6 7
16 4 4 4 2
6 2 4 2 5
9 2 5 6 7
4 6 5 4 4
We can observe that the column 'B' is sorted in an ascending order. Within the groups formed by
column 'B', column 'C' sorts its values in descending order.
52
Although we have not taken an explicit example for character data, in case of character data, sorting
happens in the order in which it is present in the dictionary.
Now we will see how to combine dataframes horizontally or vertically by stacking them.
1 df1
letter number
0 a 1
1 b 2
1 df2
0 c 3 cat
1 d 4 dog
In order to combine these dataframes, we will use the concat() function of the pandas library.
The argument 'axis=0' combines the dataframes row-wise i.e. stacks the dataframes vertically.
1 pd.concat([df1,df2], axis=0)
Notice that the index of the two dataframes is not generated afresh in the concatenated dataframe.
The original indices are stacked, so we end up with duplicate index names. More often than not, we
would not want the indices to be stacked. We can avoid doing this by setting the 'ignore_index'
argument to True in the concat() function.
0 NaN a 1
1 NaN b 2
2 cat c 3
3 dog d 4
We discussed how the dataframes can be stacked vertically. Now lets see how they can be stacked
horizontally.
In order to stack the dataframes column-wise i.e. horizontally, we will need to set the 'axis' argument
to 1.
53
The datasets which we will stack horizontally are 'df1' and 'df3'.
1 df1
letter number
0 a 1
1 b 2
1 df3
animal name
0 bird polly
1 monkey george
2 tiger john
1 pd.concat([df1,df3],axis=1)
We see that when we use the concat() function with 'axis=1' argument, we combine the dataframes
column-wise.
Since 'df3' dataframe has three rows, whereas 'df1' dataframe has only two rows, the remaining
values are set to missing as can be observed in the dataframe above.
Many times our datasets need to be combined by keys instead of simply stacking them vertically or
horizontally. As an example, lets consider the following dataframes:
1 df1=pd.DataFrame({"custid":[1,2,3,4,5],
2 "product":["Radio","Radio","Fridge","Fridge","Phone"]})
3 df2=pd.DataFrame({"custid":[3,4,5,6,7],
4 "state":["UP","UP","UP","MH","MH"]})
1 df1
custid product
0 1 Radio
1 2 Radio
2 3 Fridge
3 4 Fridge
4 5 Phone
1 df2
54
custid state
0 3 UP
1 4 UP
2 5 UP
3 6 MH
4 7 MH
Dataframe 'df1' contains information about the customer id and the product they purchased and
dataframe 'df2' also contains the customer id along with which state they come from.
Notice that the first row of the two dataframes have different customer ids i.e. the first row contains
information about different customers, hence it won't make sense to combine the two dataframes
together horizontally.
In order to combine data from the two dataframes, we will first need to set a correspondence using
customer id i.e. combine only those rows having a matching customer id and ignore the rest. In
some situations, if there is data in one dataframe which is not present in the other dataframe,
missing data will be filled in.
There are 4 ways in which the dataframes can be merged - inner join, outer join, left join and right
join:
In the following code, we are joining the two dataframes 'df1' and 'df2' and the key or
correspondence between the two dataframes is determined by 'custid' i.e. customer id. We use the
inner join here (how='inner'), which retains only those rows which are present in both the
dataframes. Since customer id's 3, 4 and 5 are common in both the dataframes, these three rows are
returned as a result of the inner join along with corresponding information 'product' and 'state' from
both the dataframes.
1 pd.merge(df1,df2,on=['custid'],how='inner')
0 3 Fridge UP
1 4 Fridge UP
2 5 Phone UP
Now lets consider outer join. In outer join, we keep all the ids, starting at 1 and going up till 7. This
leads to having missing values in some columns e.g. customer ids 6 and 7 were not present in the
dataframe 'df1' containing product information. Naturally the product information for those
customer ids will be absent.
Similarly, customer ids 1 and 2 were not present in the dataframe 'df2' containing state information.
Hence, state information was missing for those customer ids.
Merging cannot fill in the data on its own if that information is not present in the original
dataframes. We will explicitly see a lot of missing values in outer join.
1 pd.merge(df1,df2,on=['custid'],how='outer')
55
custid product state
0 1 Radio NaN
1 2 Radio NaN
2 3 Fridge UP
3 4 Fridge UP
4 5 Phone UP
5 6 NaN MH
6 7 NaN MH
Using the left join, we will see all customer ids present in the left dataframe 'df1' and only the
corresponding product and state information from the two dataframes. The information present
only in the right dataframe 'df2' is ignored i.e. customer ids 6 and 7 are ignored.
1 pd.merge(df1,df2,on=['custid'],how='left')
0 1 Radio NaN
1 2 Radio NaN
2 3 Fridge UP
3 4 Fridge UP
4 5 Phone UP
Similarly, right join will contain all customer ids present in the right dataframe 'df2' irrespective of
whether they are there in the left dataframe 'df1' or not.
1 pd.merge(df1,df2,on=['custid'],how='right')
0 3 Fridge UP
1 4 Fridge UP
2 5 Phone UP
3 6 NaN MH
4 7 NaN MH
1 import pandas as pd
2 import numpy as np
56
1 file=r'bank-full.csv'
2 bd=pd.read_csv(file,delimiter=';')
1 bd.describe()
2 # display in pdf will be truncated on the right hand side
Another useful function that can be applied on the entire data is nunique(). It returns the number of
unique values taken by different variables.
Numeric data
1 bd.nunique()
1 age 77
2 job 12
3 marital 3
4 education 4
5 default 2
6 balance 7168
7 housing 2
8 loan 2
9 contact 3
10 day 31
11 month 12
12 duration 1573
13 campaign 48
14 pdays 559
15 previous 41
16 poutcome 4
17 y 2
18 dtype: int64
We can observe that the variables having the 'object' type have fewer values and variables which are
'numeric' have higher number of unique values.
57
The describe() function can be used with individual columns also. For numeric variables, it gives the
8 summary statistics for that column only.
1 bd['age'].describe()
1 count 45211.000000
2 mean 40.936210
3 std 10.618762
4 min 18.000000
5 25% 33.000000
6 50% 39.000000
7 75% 48.000000
8 max 95.000000
9 Name: age, dtype: float64
When the describe() function is used with a categorical column, it gives the total number of values in
that column, total number of unique values, the most frequent value ('blue-collar') as well as the
frequency of the most frequent value.
1 bd['job'].describe()
1 count 45211
2 unique 12
3 top blue-collar
4 freq 9732
5 Name: job, dtype: object
Note: When we use the describe() function on the entire dataset, by default, it returns the summary
statistics of numeric columns only.
Lets say we only wanted the mean or the median of the variable 'age'.
1 bd['age'].mean(), bd['age'].median()
(40.93621021432837, 39.0)
Apart from the summary statistics provided by the describe function, there are many other statistics
available as shown below:
58
Function Description
min Minimum
max Maximum
mode Mode
Categorical data
Now starting with categorical data, we would want to look at frequency counts. We use the
value_count() function for get the frequency counts of each unique element present in the column.
1 bd['job'].value_counts()
blue-collar 9732
management 9458
technician 7597
admin. 5171
services 4154
retired 2264
self-employed 1579
entrepreneur 1487
unemployed 1303
housemaid 1240
student 938
unknown 288
Name: job, dtype: int64
59
By default, the outcome of the value_counts() function is in descending order. The element 'blue-
collar' with the highest count is displayed on the top and that with the lowest count 'unknown' is
displayed at the bottom.
We should be aware of the format of the output. Lets save the outcome of the above code in a
variable 'k'.
1 k = bd['job'].value_counts()
The outcome stored in 'k' has two attributes. One is values i.e. the raw frequencies and the other is
'index' i.e. the categories to which the frequencies belong.
1 k.values
array([9732, 9458, 7597, 5171, 4154, 2264, 1579, 1487, 1303, 1240, 938,
288], dtype=int64)
1 k.index
As shown, values contain raw frequencies and index contains the corresponding categories. e.g.
'blue-collar' job has 9732 counts.
Lets say, you are asked to get the category with minimum count, you can directly get it with the
following code:
1 k.index[-1]
'unknown'
We can get the category with the second highest count as well as the highest count as follows:
'student'
'blue-collar'
Now if someone asks us for category names with frequencies higher than 1500. We can write the
following code to get the same:
1 k.index[k.values>1500]
Even if we write the condition on k itself, by default it means that the condition is applied in the
values.
1 k.index[k>1500]
60
Index(['blue-collar', 'management', 'technician', 'admin.', 'services',
'retired', 'self-employed'],
dtype='object')
The next kind of frequency table that we are interested in when working with categorical variables is
cross-tabulation i.e. frequency of two categorical variables taken together. e.g. lets consider the
cross-tabulation of two categorical variables 'default' and 'housing'.
1 pd.crosstab(bd['default'],bd['housing'])
housing no yes
default
no 19701 24695
In the above frequency table, we observe that there are 24695 observation where the value for
'housing' is 'yes' and 'default' is 'no'. This is a huge chunk of the population. There is a smaller chunk
of about 435 observations where housing is 'yes' and default is 'yes' as well. Within the observations
where default is 'yes', 'housing' is 'yes' for a higher number of observations i.e. 435 as compared to
where housing is 'no' i.e. 380.
Now, lets say that we want to look at the unique elements as well as the frequency counts of all
categorical variables in the dataset 'bd'.
1 bd.select_dtypes('object').columns
The code above will give us all the column names which are stored as categorical datatypes in the
'bd' dataframe. We can then run a for loop of top of these columns to get whichever summary
statistic we need for the categorical columns.
1 cat_var = bd.select_dtypes('object').columns
2 for col in cat_var:
3 print(bd[col].value_counts())
4 print('~~~~~')
1 blue-collar 9732
2 management 9458
3 technician 7597
4 admin. 5171
5 services 4154
6 retired 2264
7 self-employed 1579
8 entrepreneur 1487
9 unemployed 1303
10 housemaid 1240
11 student 938
12 unknown 288
13 Name: job, dtype: int64
14 ~~~~~
15 married 27214
16 single 12790
17 divorced 5207
18 Name: marital, dtype: int64
19 ~~~~~
20 secondary 23202
61
21 tertiary 13301
22 primary 6851
23 unknown 1857
24 Name: education, dtype: int64
25 ~~~~~
26 no 44396
27 yes 815
28 Name: default, dtype: int64
29 ~~~~~
30 yes 25130
31 no 20081
32 Name: housing, dtype: int64
33 ~~~~~
34 no 37967
35 yes 7244
36 Name: loan, dtype: int64
37 ~~~~~
38 cellular 29285
39 unknown 13020
40 telephone 2906
41 Name: contact, dtype: int64
42 ~~~~~
43 may 13766
44 jul 6895
45 aug 6247
46 jun 5341
47 nov 3970
48 apr 2932
49 feb 2649
50 jan 1403
51 oct 738
52 sep 579
53 mar 477
54 dec 214
55 Name: month, dtype: int64
56 ~~~~~
57 unknown 36959
58 failure 4901
59 other 1840
60 success 1511
61 Name: poutcome, dtype: int64
62 ~~~~~
63 no 39922
64 yes 5289
65 Name: y, dtype: int64
66 ~~~~~
Many times we do not only want the summary statistics of numeric or categorical variables
individually; we may want a summary of numeric variables within the categories coming from a
categorical variable. e.g. lets say we want the average age of the people who are defaulting their
loan as opposed to people who are not defaulting. This is known as group wise summary.
1 bd.groupby(['default'])['age'].mean()
default
no 40.961934
yes 39.534969
Name: age, dtype: float64
The result above tells us that the defaulters have a slightly lower average age as compared to non-
defaulters.
62
Looking at median will give us a better idea in case we have many outliers. We notice that the
difference is not much.
1 bd.groupby(['default'])['age'].median()
default
no 39
yes 38
Name: age, dtype: int64
We can group by multiple variables as well. There is no limit on the number and type of variables we
can group by. But generally, we group by categorical variables only.
Also, it is not necessary of give the name of the column for which we want the summary statistic. e.g.
in the code above, we wanted the median of the column 'age'. It is not necessary to specify the
column 'age'. When we do not specify the column age, then we get a median of all the numeric
columns grouped by the variable 'default'.
1 bd.groupby(['default']).median()
default
no 39 468 16 180 2 -1 0
yes 38 -7 17 172 2 -1 0
1 bd.groupby(['default','loan']).median()
default loan
Each row in the result above gives the 4 categories defined by the two categorical variables we have
grouped by and each column give the median value for all the numerical variables for each group.
In short, when we do not give a variable to compute the summary statistic e.g. median, we get all the
columns where median can be computed.
Now, lets say we do not want to find the median for all columns, but only for 'day' and 'balance'
columns. We can do that as follows:
1 bd.groupby(['housing','default'])['balance','day'].median()
63
balance day
housing default
no no 531 17
yes 0 18
yes no 425 15
yes -137 15
If we were to do the visualizations using matplotlib instead of seaborn, we would need to write a lot
more code. Seaborn has functions which wrap up this code making simpler.
Note: %matplotlib inline is required only when we use the Jupyter notebook so that visualizations
appear within the notebook itself. Other editors like Spyder or PyCharm do not need this line of
code as part of our script.
What we will cover is primarily to visualize our data quickly which will help us build our machine
learning models.
1 import pandas as pd
2 import numpy as np
1 file=r'bank-full.csv'
2 bd=pd.read_csv(file,delimiter=';')
Density plots
Lets start with density plots for a single numeric variable. We use the distplot() function from the
seaborn library to get the density plot. The first argument will be the variable for which we want to
make the density plot.
1 sns.distplot(bd['age'])
64
By default, the distplot() function gives a histogram along with the density plot. In case we do not
want the density plot, we can set the argument 'kde' (short for kernel density) to False.
1 sns.distplot(bd['age'], kde=False)
Setting the 'kde' argument to False will not show us the density curve, but will only show the
histogram.
In order to build a histogram, continuous data is split into intervals called bins. The argument 'bins'
lets us set the number of bins which in turn affects the width of each bin. This argument has some
default value, however we can increase the number of bins by changing the 'bin' argument.
65
Notice the difference in the width of the bins when the argument 'bins' has different values. With
'bins' argument having the value 15, the bins are much wider as compared to when the 'bins' have
the value 100.
How do we decide the number of bins, what would be a good choice? First consider why we need a
histogram. Using a histogram we can get a fair idea where most of the values lie. e.g. considering the
variable 'age', a histogram tells us how people in the data are distributed across different values of
'age' variable. Looking at the histogram, we get a fair idea that most of the people in our data lie in
30 to 40 age range. If we look a bit further, we can also say that the data primarily lies between 25 to
about 58 years age range. Beyond the age of 58, the density falls. We might be looking at typical
working age population.
Coming back to how do we decide the number of bins. Now, we can see that most of the people are
between 30 to 40 years. Here, if we want to dig deeper to see how data is distributed within this
range we increase the number of bins. In other words, when we want to go finer, we increase the
number of bins.
We can see here that between 30 to 40, the people in their early 30's are much more dominant as
compared to the people whose age is closer to 40. One thing to be kept in mind when increasing the
number of bins is that if the number of data points are very low, for example, if we have only 100
data points then it does not make sense to create 50 bins because the frequency bars that we see
will not give us a very general picture.
In short, higher number of bins will give us a finer picture of how the data is behaving in terms of
density across the value ranges. But with very few data points, a higher number of bins may give us
a picture which may not be generalizable.
We can see that the y axis has frequencies. Sometimes it is much easier to look at frequency
percentages. We get frequency percentages by setting the 'norm_hist' argument to True. It basically
normalizes the histograms.
66
We can see that about 5% of the population lies between the age range of 31 to about 33.
Note: If we want to get more specific, we need to move towards numeric summary of data.
1 myimage = myplot.get_figure()
1 myimage.savefig("output.png")
The moment we do this, the 'output.png' file will appear wherever this script is. We can get this path
using the pwd() function. My 'output.png' file is present in the following path:
1 pwd()
The above method of saving images will work with all kinds of plots.
There is a function kdeplot() in seaborn that is used for generating only the density plot. It will only
give us the density plot without the histogram.
1 sns.kdeplot(bd['age'])
67
We observe that the different plots have their own functions and we simply need to pass the column
which we wish to visualize.
Now, let us see how to visualize the 'age' column using a boxplot.
1 sns.boxplot(bd['age'])
We can also get the above plot with the following code:
1 sns.boxplot(y='age', data=bd)
We get a vertical boxplot since we mentioned y as 'age'. We can also get the horizontal boxplot if we
specify x as 'age'.
1 sns.boxplot(x='age', data=bd)
68
We can see that there are no extreme values on lower side of the age; however, there are a lot of
extreme values on the higher side of age.
We can use scatterplots to visualize two numeric columns together. The function which helps us do
this is jointplot() from the seaborn library.
The jointplot not only gives us the scatterplot but also gives the density plots along the axis.
We can do a lot more things with the jointplot() function. We observe that the variable 'balance' takes
a value on a very long range but most of the values are concentrated on a very narrow range. Hence
we will plot the data only for those observations for which 'balance' column has values ranging from
0 to 1000.
Note: Putting conditions is not a requirement for us to visualize the data. Since we see that most of
the data lies in a smaller range and because of the few extreme values we may get a distorted plot.
Using the above code we have no way of figuring out if individual points are overlapping each other.
69
Setting the argument 'kind' as 'hex' not only shows us the observations but also helps us in knowing
how many observations are overlapping at a point by observing the shade of each point. The darker
the points, more the number of observations present there.
We can observe that most of the observations lie between the age of 30 and 40 and lot of
observations lie between the balance of 0 to 400. As we move away, the shade keeps getting lighter
indicating that the number of observations reduce or the density decreases.
These hex plots are a combination of scatterplots and pure density plots. If we want pure density
plots, we need to change the 'kind' argument to 'kde'.
The darker shade indicates that most of the data lies there and as we move away the density of the
data dissipates.
70
We observe that lmplot is just like scatter plot, but it fits a line through the data by default. (lmplot -
linear model plot)
We can update lmplot to fit higher order polynomials which gives a sense if there exists a non-linear
relationship between the data.
Since the data in the plot above is overlapping a lot, let us consider the first 500 observations only.
We can see that a line fits through these points.
If we update the code above and add an argument 'order=6' we can see that the function has tried
to fit a curve through the data. Since it still mostly looks like a line, so maybe there is a linear trend.
Also, as the line is horizontal to the x axis, there is not much correlation between the two variables
plotted.
71
3. Faceting the data
As of now, we are looking at the age and balance relationship across the entire data.
Now we want to see how will the relationship between 'duration' and 'campaign' behave for different
values of 'housing'. We can observe this by coloring the datapoints for different values of 'housing'
using the 'hue' argument. 'housing' variable takes two values: yes and no.
We can see two different fitted lines. The orange one corresponds to 'housing' equal to no and the
blue one corresponding to 'housing' equal to yes.
72
Now if we wish to divide our data further on the basis of 'default' column, we can consider using the
'col' argument. 'default' argument takes two values: yes and no.
Now we have 4 parts of the data. Two are given by the color of the points and two more are given by
separate columns. The first column refers to 'default' being no and the second column refers to
'default' being yes.
Observe that there are very few points when 'default' is equal to yes; hence we cannot trust the
relationship as the data is very less.
Next, if we wanted to check how does the relationship between the two variables change when we
are looking at different categories for loan. 'loan' also has two values: yes and no.
73
We observe, that within the group where 'default' is no; the relationship between the two variables
'campaign' and 'duration' is different when 'loan' is yes as compared to when 'loan' is no. Also,
majority of the data points are present where 'loan' and 'default' both are no. There are no data
points where both 'loan' and 'default' is yes. Keep in mind that we are looking at the first 500
observations only. We may get some data points here if we look at higher number of observations.
This how we can facet the data observing whether after breaking the data does the relationship
between two variables change.
We want to know how different education groups are present in the data.
We observe that the 'secondary' education group is very frequent. There is a small chunk where the
level of education is 'unknown'. The 'tertiary' education group has the second highest count followed
by 'primary'.
74
We have options to use faceting here as well. We can use the same syntax we used earlier. Lets start
by adding 'hue' as 'loan'.
We observe that each education level is now broken into two parts according to the value of 'loan'.
This is how we can facet using the same options for categorical data.
When we want to see how 'age' behaves across different levels of education, we can use boxplots().
When we make a boxplot only with the 'age' variable we get the following plot, indicating that the
data primarily lies between age 20 and 70 with a few outlying values and the data overall is positively
skewed.
Now, we want to look how the variable 'age' behaves across different levels of 'education'.
We observe that 'primary' education have overall higher median range. The behavior of the data
points under 'unknown' are similar to those under 'primary' education indicating maybe that the
'unknown' ones may have primary education background.
75
We also observe that people having 'tertiary' education belong to a lower age group. We can infer
that older people could make do with lesser education but the current generation needs higher
levels of education to get by. This could be one of the inferences.
6. Heatmaps
Heatmaps are two dimensional representation of data in which values are represented using colors.
It uses a warm-to-cool color spectrum to describe the values visually.
1 x = np.random.random(size=(20,20))
1 x[:3]
Looking at the data above, it is difficult for us to determine what kind of values it has.
1 sns.heatmap(x)
When we observe a heatmap, wherever the color of the boxes are light, the values are closer to 1
and as the boxes get darker, those are the values closer to 0. Looking at the visualization above, we
get an idea that more or less the values are quite random. There does not seems to be any
dominance of lighter or darker boxes.
Now, since we understand the use of colors in heatmaps, lets get back to understanding how does it
help with understanding our 'bd' dataset. Lets say we look at the correlations in the data 'bd'.
1 bd.corr()
76
age balance day duration campaign pdays previous
Normally the correlation tables can be quite huge, can have 50 variables in the data too. Looking at
this table, it is very difficult to manually check if there exists correlation between the variables. We
can manually check if any of the values in the table above are close to 1 or -1 which indicates high
correlation or we can simply pass the above table to a heatmap.
1 sns.heatmap(bd.corr())
The visualization above shows that wherever the boxes are very light i.e. near +1, there is high
correlation and wherever the boxes are too dark i.e. near 0, the correlation is low. The diagonal will
be the lightest because each variable has maximum correlation with itself. But for the rest of the
data, we observe that there is not much correlation. However, there seems to be some positive
correlation between 'previous' and 'pdays'. The advantage of using heatmaps is that we do not have
to go through the correlation table to manually check if correlation is present or not. We can visually
understand the same through the heatmap shown above.
77