Python Study Material
Python Study Material
1.1 Introduction
The word Python - isn’t it scary? Doesn’t it bring the image of a big snake found in the
Amazon forest? Well, it is time to change this image. Now on, you will remember Python as
a free-to-use open-source programming language that is great for performing Data Science
tasks. Python, the programming language, was named after the famous BBC Comedy Show
Monty Python’s Flying Circus. It is an easy-to-learn yet powerful object-oriented
programming language that is mainly used now for analyzing data.
1.2 First Program in Python
Let us now try our first Program in Python. A Statement is an instruction that the
computer will execute. Perhaps the simplest Python Program that you can write is the one that
contains a single print statement, as shown below:
> print (“Hello, Python!”)
When the computer executes the above print statement, it will simply display the value you
write within the parentheses (i.e. “Hello, Python!”). The value you write in the parentheses is
called the argument. If you are using a Jupiter Notebook, you will see a small rectangle with
the above print statement. This is called a cell. If you select this cell with your Mouse and
then click the “Run All” button, the computer will execute the print statement and display the
output (i.e. Hello, Python!) beneath the cell, as shown below:
print (“Hello, Python!”)
Hello, Python!
Using the type( ) function, we can find the data type of a value. For example, type (29)
will return int, while type (3.14) will return float. If a string contains an integer value, we can
convert it to int. This is known as type casting. For example, int (“38”) will return 38, while
int (True) will return 1. On the other hand, bool (1) will return True. The data types, List,
Tuple and Dictionary are known as sequence data types.
1.4 Expressions
A simple example for an expression is 33+60–10. Here, we call the numbers 33, 60 and
10 as operands and the Maths symbols + and – as operators. The value of this expression is
83. We can perform the multiplication operation by using the asterisk symbol, and the division
operation by using the forward slash symbol, as shown below:
5*5 25/5
The values of the above expressions will be evaluated as 25 and 5.0, respectively.
We can use double slash for integer division, in which case the result will be rounded.
Python follows mathematical conventions when evaluating a mathematical expression. The
arithmetic operators in the following two expressions are in different order. But, in both the
cases, Python performs multiplication first and then the addition to obtain the final result:
120 120
2 * 60 + 30 30 + 2 * 60
150 150
The expressions in the parentheses are performed first in the following example. We then
multiply the result by 60. The final result is 1920.
32
(30 + 2) * 60
1920
3
1.5 Variable
We can use variables to store values. In the following example, we assign the value
10000 to the variable principal by using the assignment operator (i.e. the equals sign). We can
then use the value somewhere else in the code by typing the exact name of the variable.
>>> principal = 10000
>>> years =3
>>> interest-rate = 10
>>> simple-interest = (principal * years * interest-rate) / 100
>>> simple-interest
3000
Recall that we can find the data type of a value by using the type( ) function. For example,
type(29) will return int. We can apply the type( ) function on a variable also. For example,
type(principal) will return int.
1.6 Method of providing the input through the keyboard
In the above Program, input values are provided as part of the Program. But, often, the
input values are received from the User through the keyboard. The input( ) function shall be
used for this purpose.
name = input (“What is your name?”)
When the above statement is executed, the prompt will be displayed as a vertical line, as shown
below:
What is your name? |
We have to type the name in front of the above prompt (|), as shown below:
What is your name? Ganesh
Now, “Ganesh” will be assigned as the value of the variable name.
We have just now seen the method of reading in a string. Even if we provide a number
as the input, the input( ) function will return the entered value as a string. So, what should we
do if we have to read in an integer or a fractional value? What is the way out? We have to use
the int( ) and float( ) functions, as shown below:
>>> age = int(input (“what is your age?”))
>>> percentage = float(input (“Enter your percentage of marks:”)
The while statement is used to repeatedly execute a set of statements as long as a condition
remains true. Here is an example:
>>> n = 1
>>> while n < = 3
print (n*n)
n = n+1
Here, n will take the values 1, 2 and 3. So, the output will be as follows:
1
4
9
Note:
Many programming languages use curly braces to delimit blocks of code. But Python
uses indentation, as shown below:
for i in [ 1, 2, 3, 4, 5 ] :
print i
for j in [ 1, 2, 3, 4, 5 ] :
print j
print ( i + j )
print i
print (“looping is over”)
in the re module, such as the compile( ) function, we have to import the re module, as shown
below:
import re
my_regular_expression = re.compile (“[0-9]+”, re.I)
Here, we prefix the compile( ) function with the name of the module (i.e., re). If we already
use re in our Program for some other purpose, we can use an alias, as shown below:
import re as regex
my_regular_expression = regex.compile (“[0-9]+”, re.I)
In Python, we use the functions in the module named matplotlib.pyplot for drawing a
variety of figures, such as the Bar Chart and the Pie Chart.
import matplotlib.pyplot
matplotlib.pyplot.plot( ... )
Here, the name of the module is a lengthy one. Instead of writing this lengthy name again and
again, we can use the alias option, as shown below:
import matplotlib.pyplot as plt
plt.plot( ... )
If we need only a few specific functions and constants defined in a Module, we can
import them explicitly and use them without qualification, as shown below:
from collections import defaultdict, counter
lookup = defaultdict (int)
my_counter = counter( )
Note that we are not invoking the defaultdict( ) function in the format
collections.defaultdict( ).
The module is a single Python file. It is saved with .py extension. Assume that a module
named sum has the following functions. Assume further that the following code is saved as
sum.py.
def add (x, y) :
return (x+y)
def mul (a, b) :
return (a*b)
def sub (x, y) :
return (x–y)
def div (a, b) :
return (a/b)
The above module sum is imported in the following program. Then, the functions in this
module are used.
11
import sum
n1 = int ( input (“Enter the first number:” ) )
n2 = int ( input (“Enter the second number:” ) )
print (“The result of add( ): ”, sum.add (n1, n2))
print (“The result of mul( ): ”, sum.mul (n1, n2))
Output
Enter the first number: 30
Enter the second number: 10
The result of add( ): 40
The result of mul( ): 300
The package contains a group of module files. It also contains ...init... py file. All the
components of a package are placed in a single directory. The package directory should be
differentiated from the ordinary directory. For this purpose, the package directory contains
... init ... py file. Every package directory contains this file. Package Installer for Python (PIP)
is used for installing a package.
Python allows the handling of the exceptions raised in the invoked functions.
1.14 Whitespace Formatting
Whitespace is ignored inside parentheses and brackets. This can be helpful for long-
winded computations:
long_winded_computation = (1+2+3+4+5+6+7+8+9+10+11+12+13+14+15+
16+17+18+19+20)
Whitespace can be used for making code easier to read:
list_of_lists = [ [1, 2, 3], [4, 5, 6], [7, 8, 9] ]
easier_to_read_list_of_lits = [ [1, 2, 3],
[4, 5, 6],
[7, 8, 9] ]
We can also use a backslash to indicate that a statement continues onto the next line.
two_plus_three = 2 + \
3
1.15 Object-oriented Programming in Python
In object-oriented programming, focus is given on the data. The code that works with the
data are grouped into a single entity and it is called as a class. The class contains member data
and methods. Some of the principles of object-oriented programming are listed below:
a) Abstraction: Abstraction allows the Programmer to use the object without knowing the
details of the object.
12
b) Inheritance: Inheritance allows one class to derive the functionality and characteristics of
the other class.
c) Encapsulation: Encapsulation binds all the data members and methods into a single entity
called as class.
d) Polymorphism: Polymorphism allows the same entity to be used a different form. Both
compile-time polymorphism and run-time polymorphism are possible in Python.
1.16 Classes and Objects in Python
A class is a collection of data and the methods that interact with those data. An instance
of a class is known as an object. Each data type in Python, namely the integers, floats, strings,
booleans, lists, sets and dictionaries is an object. We will now develop a class to represent a
circle.
radius
All we need to describe a circle is its radius. Let us also consider the colour to make it easier
later to distinguish between the different instances of the class circle. Here, the data attributes
of the class circle are radius and colour. The class circle shall now be defined with the
constructor and an utility method for calculation the area, as shown below:
>>> class circle (object):
def - init - (self, radius, color) :
self.radius = radius
self.color = color
def calculate - area (self):
area = 3.14 * self.radius * self.radius
return area
We shall now create an object named playground whose radius is 400 meters and the preferred
color is green. We shall also calculate its area.
>>> playground = Circle(400, ‘green’)
>>> area1 = playground. calculate-area( )
>>> print (area1)
5024
We will now develop a class to represent a rectangle.
breadth
length
13
All we need to describe a Rectangle is its length and breadth. Here, the data attributes of the
class. Rectangle are length and breadth. The class Rectangle shall now be defined with a
constructor and an utility method, as shown below:
>>> class Rectangle (object) :
def-init-(self, length, breadth)
self.length = length
self.breadth = breadth
def calculate-perimeter (self) :
perimeter = 2(length + breadth)
return perimeter
We shall now create an object named ClassRoom whose length is 40 feet and the breadth is
20 feet. We will then calculate its area.
>>> classroom = Rectangle (40, 20)
>>> perimeter = classroom.calculate-perimeter( )
>>> perimeter
120
2.1 Introduction
Lists, Tuples, Dictionaries and Data Frames are the important data structures available
in Python. All of them are often used now for analyzing data. We will briefly discuss them one
by one.
2.2 Lists
A list is an ordered sequence of comma-separated values of any data type. The values in
a list are written between square brackets. A value in a list can be accessed by specifying its
position in the list. The position is known as index. Here is an example.
>>> list1 = [1, 2, 3, 4, 5]
>>> list1 [0]
1
The lists are mutable. This means that the elements of a list can be changed at a later stage, if
necessary.
>>> list2 = [10, 12, 14, 18, 30, 47]
>>> list2[0] = 20
>>> list2
[20, 12, 14, 18, 30, 47]
Each element of a List can be accessed via an index, as we have seen above. The elements in
a List are indexed from 0. Backward indexing from –1 is also valid. The following table
represents the relationship between the index and the elements in the following list:
>>> t2
[‘d’, ‘e’]
4. While the append( ) method and the extend( ) method insert the element(s) at the end of
the List, the insert( ) method inserts an element somewhere in between or any position of
your choice. Example:
>>> t1 = [‘a’, ‘e’, ‘u’]
>>> t1.insert (2, ‘i’)
>>> t1
[‘a’, ‘e’, ‘i’, ‘u’]
5. The pop( ) method removes an element from a given position in the List and returns it. If
no index is specified, this method removes and returns the last item in the List. Example:
>>> t2 = [‘k’, ‘a’, ‘e’, ‘i’, ‘p’, ‘q’, ‘u’]
>>> ele1 = t2.pop(o)
>>> ele1
‘k’
>>> t2
[‘a’, ‘e’, ‘i’, ‘p’, ‘q’, ‘u’]
>>> ele2 = t2.pop( )
>>> ele2
‘u’
>>> t2
[‘a’, ‘e’, ‘i’, ‘p’, ‘q’]
6. The pop( ) method removes an element whose position is given. But, what if you know
the value of the element to be removed, but you don’t know its index (i.e. position) in the
List. Well, Python provides the remove( ) method for this purpose. This method removes
the first occurrence of given item from the List. Example:
>>> t3 = [‘a’, ‘e’, ‘i’, ‘p’, ‘q’, ‘a’, ‘q’, ‘p’]
>>> t3.remove (‘q’)
>>> t3
[‘a’, ‘e’, ‘i’,’p’, ‘a’, ‘q’, ‘p’]
7. The clear( ) method removes all the items from the given List. The List becomes an empty
List after this method is applied to it. Example:
>>> t4 = [2, 4, 5, 7]
>>> t4.clear( )
>>> t4
[]
19
8.The count( ) method returns the number of times given item occurs in the List.
Example:
>>> t5 = [13, 18, 20, 10, 18, 23]
>>> t5.count (18)
2
9. The reverse( ) method just reverses the items in a List. It does not return anything.
Example:
>>> t6 = [‘e’, ‘i’, ‘q’, ‘a’, ‘q’, ‘p’]
>>> t6.reverse ( )
>>> t6
[‘p’, ‘q’, ‘a’, ‘q’, ‘i’, ‘e’]
10. The sort( ) method sorts the items in a List in the ascending order by default. It does not
return anything. If we want this method to sort the items in the descending order, we have
to include the argument reverse = True. Example:
>>> t7 = [‘e’, ‘i’, ‘q’, ‘a’, ‘q’, ‘p’]
>>> t7.sort( )
>>> t7
[‘a’, ‘e’, ‘i’, ‘p’, ‘q’, ‘q’]
>>> t7.sort (reverse = True)
>>> t7
[‘q’, ‘q’, ‘p’, ‘i’, ‘e’, ‘a’]
Programming Example 2.1
The following program reads in the marks obtained by a student in 5 different subjects,
store them in a List and then calculate the average mark.
Program to calculate the average
mark-list = eval(input (“Enter the marks in 5 subjects:”))
length = len (mark-list)
sum = mean = o
for i in range (o, length) :
sum = sum+ mark-list [i]
average = sum/length
print (“Mean = ”, average)
Output
Enter the marks in 5 subjects: [70, 60, 80, 100, 90]
Mean = 80
20
Note: Here, the mean can also be found by using the built-in function mean( ), as shown below:
# Program using built-in function
import statistics
mark-list = eval (input (“Enter the marks in 5 subjects: ”) )
average = statistics.mean (mark-list)
print (average)
Output
Enter the marks in 5 subjects: [70, 60, 80, 100, 90)
Mean = 80
Programming Example 2.2
The following program reads in the salaries obtained by the 5 employees in a Startup,
store them in a List and then calculate the median salary.
Program to calculate the Median
salaries = eval(input (“Enter the salaries of 5 employees:”))
size = len (salaries)
if (size %2 is o):
mid = size//2
highelement = salaries [mid]
lowerelement = salaries [mid–1]
average = (lowerelement + highelement)/2
print (“MedianSalary =”, average)
else:
mid = size//2
average = salaries [mid]
print (“Median Salary=”, average)
Output
Enter the salaries in 5 employees: [50000, 40000, 70000, 90000, 100000]
Median Salary = 70000
Note: Here, the median shall also be found by using the build-in function median ( ), as shown
below:
# Program using built-in function
import statistics
salaries = eval (input (“Enter the salaries of 5 employees: ”) )
average = statistics.median (salaries)
print (“Median Salary=”, average)
Output
Enter the salaries of 5 employees: 50000, 40000, 70000, 90000, 100000
Median Salary = 70000
21
2.6 Tuples
A Tuple is an ordered sequence of comma-separated values of any data type. The value
in a Tuple are written between round brackets. Tuples are just like Lists. The only difference
is that once we define a Tuple, we cannot change its contents. If we have to change the contents
of a Tuple, we have to store the changed contents as a new Tuple. Due to this reason, the
Tuples are said to be immutable. Here are some Tuples:
>>> tuple1 = (2, 4.7, 10, 10.1)
>>> tuple2 = (‘a’, ‘b’, 1, 2.5, 3)
As in the case of Lists, each element of a Tuple can be accessed via an index. The element in
a Tuple are indexed from 0. Backward indexing from –1 is also valid. The following table
represents the relationship between the index and the elements in the following Tuple
T = (‘a’, ‘e’, ‘i’, ‘o’, ‘u’)
Element Index Negative Index
‘a’ 0 –5
‘e’ 1 –4
‘i’ 2 –3
‘o’ 3 –2
‘u’ 4 –1
Here, T[0] is ‘a’
T[2] is ‘i’
T[4] is ‘u’
T[–1] is ‘u’
T[–3] is ‘i’
T[–5] is ‘a’
A Tuple, T, can be created by using the tuple( ) function, as shown below:
>>> T = tuple (<sequence>)
Here, the sequence shall be a list or a string or another tuple.
Examples:
>>> tuple1 = (1, 5, 10)
>>> tuple2 = tuple ( [1, 2.5, 3.7, ‘a’, ‘b’] )
>>> tuple3 = tuple (“hello”)
>>> tuple3
(“h”, “e”, “l”, “l”, “o”)
2.7 Reading in the elements of a Tuple through the keyboard
The elements of a Tuple shall be provided through the keyboard by using the tuple()
method, as shown below:
22
Comparing Tuples
We can compare two Tuples without having to write code with loops for achieving
it.
>>> a = (2, 3)
>>> b = (2, 3)
>>> c = (‘2’, ‘3’)
>>> d = (2.0, 3.0)
>>> e = (2, 3, 4)
>>> a == b
True
>>> a == c
False
>>> a == d
True
>>> a<e
True
Unpacking a Tuple
Creating a Tuple from a set of values is called packing. Its reverse (i.e. creating
individual values from a Tuple’s elements is called unpacking.
>>> t = (1, 2, ‘A’, ‘B’)
>>> (w, x, y, z) = t
>>> w
1
>>> x
2
>>> y
‘A’
>>> z
‘B’
Deleting a Tuple
The del statement is used to delete a Tuple. We can delete a complete Tuple. But we
cannot delete the individual elements of a Tuple, as the Tuples are immutable.
2.9 Tuple Methods
Python offers many built-in functions and methods for performing Tuple manipulations.
We will now briefly learn about the important built-in methods that are used for Tuple
manipulations.
1. len( ) method
This method returns the length of a Tuple. (i.e. the number of elements in a Tuple).
24
2.10 Sets
A Set is a collection of distinct elements. As in the case of Lists and Tuples, the elements
of a set can be of different data types. (i.e., Values of different data types can be placed within
a Set). Unlike Lists and Tuples, sets are unordered. This means that the Sets do not record the
element position. Sets only have unique elements. This means there is only one of a particular
element in a Set. To define a Set, you have to use curly brackets. You have to place the
elements of a Set within the curly brackets, as shown below:
25
Let us go over the Set operations. These can be used to change the Set. Consider the
Set, A, given below:
> A = {“Thriller”, “Back in Black”, “AC/DC”}
We can add an item to a Set by using the add( ) method.
> A . add (“NSYNC”)
> A
{“AC/DC”, “Back in Black”, “NSYNC”, “Thriller”}
We can remove an item from a Set by using the remove( ) method.
> A. remove (“NSYNC”)
> A
{“AC/DC”, “Back in Black”, “Thriller”}
We can verify whether an element is in the set by using the in command, as follows:
> A = {“AC/DC”, “Back in Black”, “Thriller”}
> “AC/DC” in A
True
These are the types of Mathematical Set operations. There are other operations that we
can do. For example, we can find the union/intersection of two sets.
> album-set-1 = {“AC/DC”, “Back in Black”, “Thriller”}
> album-set-2 = {“AC/DC”, “Back in Black”, “The Dark Side of the Moon”}
> album-set-3 = album-set-1 & album-set-2
> album-set-4 = album-set-1 . union(album-set-2)
> album-set-3
{“AC/DC”, “Back in Black”}
> album-set-4
{“AC/DC”, “Back in Black”, “Thriller”, “The Dark Side of the Moon”}
26
Here, all the elements of album-set-3 are in album-set-1. We can check whether a Set is a
SubSet of another Set by using the issubject( ) method. Here is an example:
> album-set-3 . issubject(album-set-1)
True
2.11 Dictionaries
A Dictionary is a collection of key:value pairs. The elements in a Dictionary are written
between curly brackets. In the case of Lists and Tuples, an index is associated with each
element. But, in the case of a Dictionary, a key is associated with each element. The key should
be unique. In the case of Lists and Tuples, the index is used to access the elements. But, in the
case of a Dictionary, the key is used to access the elements.
Dictionaries are mutable. So, we can change some of the elements of a Dictionary and
then store the changed elements in the same Dictionary object. Dictionaries are unordered, as
no index is associated with the elements of a Dictionary. A Dictionary, D, can be created by
using a Command of the following form:
>>> <Dictionary-name> = {<key> : <value>, <key> : <value>, ...}
Example
>>> teachers = {“Benedict” : “Maths”, “Albert” : “CS”, “Andrew” : “Commerce” }
An element in a Dictionary can be accessed by using the key, as illustrated below:
>>> teachers [“Andrew”]
Commerce
Traversing a Dictionary
Traversal of a collection of values means accessing and processing each element of it.
The traversal can be done in the case of a Dictionary by using the for loop, as shown below:
>>> d| = {5 : “Number”, “a” : “String”, (1, 2) : “Tuple” }
>>> for key in d|
print (key, “:”, d| [key])
a : String
(1,2) : Tuple
5 : Number
Programming Example 2.3
We will now write a Program to create a Phone Dictionary for our friends and then
print it.
>>> PhoneDirectory = {“Jagdish” : “94437 55625”, “Bala” : “96297 09185”,
“Saravanan” : “99947 49333”}
>>> for name in PhoneDirectory :
print (name, “:”, PhoneDirectory [name])
27
Dictionary Methods
Let us now briefly discuss the various built-in functions and methods that are provided
by Python to manipulate the elements of Dictionaries.
1. len( ) method
This method returns the number of key:value pairs in a Dictionary
28
7. update( ) method
This method merges key:value pairs from the new dictionary into the original dictionary,
adding or replacing, as needed.
>>> Employee11 = {“name” : “John”, “salary” : 10000, “age” : 24}
>>> Employee12 = {“name” : “David”, “salary” : 54000, “department” : “Sales”}
>>> Employees11.update (Employees12)
>>> Employees11
{“salary”: 540000, “department”: “Sales”, “name”: “David”, “age”: 24}
Note:
The elements of Employees12 dictionary have overridden the elements of Employees11
dictionary having the same keys. So, the values associated with the keys “name” and “salary”
have been changed.
2.12 Default Dictionary
Imagine that you are trying to count the number of occurrences of the words in a
document. An obvious approach is to create a Dictionary in which the keys are words and the
values are counts. As you check each word, you can increment its count if it is already in the
Dictionary and add it to the Dictionary if it is not.
word_counts = { }
for word in document:
if word in word_counts:
word_counts [word] = word_counts [word] + 1
else:
word_counts [word] = 1
We can just handle the exception that may arise due to trying to look up a missing key.
word_counts = { }
for word in document:
try:
word_counts [word] + word_counts [word] + 1
except keyError:
word_counts [word] = 1
A third approach is to use the get( ) method, which behaves gracefully in the case of missing
Keys:
word_counts = { }
for word in document:
previous_count = word_counts.get (word, 0)
word_counts [word] = previous_count + 1
30
Every one of these is slightly unwieldly. In such cases, defaultdict( ) method is helpful.
A defaultdict is like a regular dictionary, except that when you try to look up a key it doesn’t
contain, it first adds a value for it by using a zero-argument function you provided when you
created it. In order to use defaultdicts, you have to import them from collections module.
2.13 Exception Handling
When something goes wrong, Python raises an exception. If they are not handled,
exceptions will cause our Program to crash. We can handle the exceptions by using try and
except, as shown below:
try :
print (0/0)
except ZeroDivisionError :
print ( “cannot divide by zero” )
Suppose that a Person is not sure whether a Tuple is immutable or not. So, when he writes
code to alter the value of an element, he will make use of exception handling, as shown below:
try:
my_tuple [1] = 3
except TypeError:
print (“cannot modify a Tuple”)
Suppose that a Person is not sure whether a key named “Kate” is present in a Dictionary or
not. So, when she writes code to access the value associated with the key “Kate”, she will
make use of exception handling so that her Program does not crash in case the key “Kate” is
not present in the Dictionary.
try:
kates_grade = grades [“kate”]
except keyError:
print (“no value for kate!”)
So for, we have discussed about handling exceptions that are raised by default. Exceptions can
also be raised manually, as shown in the following example:
try:
a = int (input (“Enter the value of a:”) )
b = int (input (“Enter the value of b:”) )
print (“The value of a =”, a)
print (“The value of b =”, b)
if (a-b) < 0 :
raise Exception (“Exception raised”)
except Exception as e:
print (“Received exception:”, e)
31
Output
Enter the value of a : 15
Enter the value of b : 20
The value of a = 15
The value of b = 20
Received exception : value of a-b is < 0
2.14 Counter
A Counter turns a sequence of values into a defaultdict(int) - like object. The keys will
be mapped to counts.
from collections import counter
c = Counter ( [0, 1, 2, 0] ) # c is {0:2, 1:1, 2:1}
Counter gives us a very simple way to solve our word-counts problem:
word_counts = Counter (document) # Here, document is a list of words
The most_common( ) method of Counter instance is often used in Natural Language
Processing.
# print the 10 most common words and their counts
for word, count in word_counts.most_common(10) :
print (word, count)
2.15 List Comprehensions
Frequently, you will want to transform a list into another list by choosing only certain
elements, by transforming elements, or both. The Pythonic way to do this is with list
comprehensions.
even_members = [x for x in range(5) if x%2 == 0] # [0, 2, 4]
squares = [x*x for x in range(5) ] # [0, 1, 4, 9, 16]
even_squares = [x*x for x in even_numbers] # [0, 4, 16]
We can similarly turn lists into dictionaries or sets, as shown below:
square_dict = {x: x*x for x in range (5) } # {0:0, 1:1, 2:4, 3:9, 4:16}
square_set = {x*x for x in [1, –1] } # {1}
If you don’t need the value from the list, it is common to use an underscore as the variable:
zeroes = [0 for _ in even-numbers ] # has the same length as even-numbers
A list comprehension can include multiple for loops:
pairs = [(x, y)
for x in range (10)
for y in range (10) ] # 100 pairs (0,0), (0,1), ..., (9, 8), (9,9)
The later for loops can use the results of earlier for loops:
32
In a statistically typed language, such as C++, the data types of the arguments and the
returned value of the above add( ) function will be specified in more or less the same way in
which they are specified below:
def add (a : int, b : int) → int :
return a+b
add (10, 5) # You like this to be OK
add (“hi”, “there”) # You like this to be not OK
The above version of add( ) function with the int type annotations is valid in recent versions
of Python, such as Python 3.6.
However, the above-mentioned type annotations don’t actually do anything. You can
still use the above annotated add( ) function to add two strings, even though it is wrong to do
so. If you make the call add (10, “five”), it will raise the TypeError specified above. That
said, there are still two good reasons to use type annotations in your Python code:
1. Types are an important form of documentation. The second function stub given below is
more informative than the first one.
def dot_product (x, y): ... ... ...
def dot_product (x : Vector, y : Vector) → float ... ... ...
2. There are external tools, such as mypy, that will read your code, inspect the type
annotations, and let you know about type errors before you ever run your code. For
example, if you run mypy over a file containing add (“hi”, “there”), it will warn you, as
shown below:
error : Argument 1 to “add” has incompatible type “str", excepted “int”
Like assert testing, this is a good way to find mistakes in your code before you ever run
it.
How to write Type Annotations?
For built-in types such as int, float and bool, you just use the type itself as the annotation.
(e.g.) def add (a: int, b: int) → int). What if you have a list?
def total (xs : list) → float
return sum (total)
This is not wrong. But the type is not specific enough. It is clear that we really want xs
to be a list of floats, not a list of strings.
The typing module provides a number of parameterized types that we can use to do just
this:
from typing import List # Note the capital letter L
def total ( xs: List [float] ) → float:
return sum (total)
36
Up until now, we have only specified annotations for function parameters and return
types. For variables themselves, it is usually obvious what the data type is:
x : int = 5 # Type annotation is not necessary here. It is obvious.
However, sometimes it is not obvious:
values =[] # Data Type is not clear here
best_so_far = None # Data Type is not clear here
In such cases, we will supply inline type hints, as shown below:
from typing import Optional
values: List [int] = [ ]
best_so_far: Optional [floar] = None # allowed to be either a float or None
Chapter-3:
Case Study: DATASCIENCESTER
3.1 Introduction
Assume that you have been hired as a Data Scientist by the Company named
DataSciencester, which promotes a social network for Data Scientists. Assume further that the
VP of Networking wants you to write code to identify who the “Key connectors” are among
the Data Scientists. In order to do this, he gives you a dump of data about the Data Scientists,
who have joined the DataScientister’s social network. What does this data dump look like? It
consists of a list of Dictionaries. Each element of the Dictionary contains the User’s id as the
key and his name as the value. Here is that great list:
users = [
{ “id” : 0, “name” : “Hero” },
{ “id” : 1, “name” : “Dunn” },
{ “id” : 2, “name” : “Sue” },
{ “id” : 3, “name” : “Chi” },
{ “id” : 4, “name” : “Thor” },
{ “id” : 5, “name” : “Clive” },
{ “id” : 6, “name” : “Hicks” },
{ “id” : 7, “name” : “Devin” },
{ “id” : 8, “name” : “Kate” },
{ “id” : 9, “name” : “Klein” },
]
3.2 Friendship Network
The VP of Networking also gives you to “friendship” data, represented as a list of Tuples.
Each Tuple is a pair of IDs of Users, who are friends to each other.
friendship_pairs = [ (0,1), (0,2), (1,2), (1,3), (2,3), (3, 4), (4,5), (5,6), (5,7), (6,8), (7,8), (8,9) ]
Here, the first Tuple (0,1) indicates that the DataScientist with id 0 (i.e., Hero) and the Data
Scientist with id 1 (i.e., Dunn) are friends. The DataScientist’ friendship network shall be
illustrated as shown below:
38
Having friendships represented as a list of pairs, as shown above, is not the easiest way
to work with them. In order to find all the friendships for user 1 (i.e., Hero), we have to iterate
over every pair looking for pairs containing 1. If you have a lot of pairs, this will take a long
time.
Instead, we will now create a Dictionary where the keys are User ids and the values are
lists of friend ids (Looking things up in a dictionary is very fast.) We still have to look at every
pair to create the dictionary, but we only have to do that once, and we will get fast lookups
after that:
# Initialize the dictionary with an empty list for each user id:
friendships = {user["id"]: [ ] for user in users}
# Loop over the friendship pairs to populate it:
for i, j in friendship_pairs:
friendships[i].append(j) # Add j as a friend of user i
friendships[j].append(i) # Add i as a friend of user j
{0 : [1,2], 1 : [0,2,3], 2: [0,1,3], 3 : [1,2,4], 4 : [3,5], 5 : [4,6,7], 6 : [5,8], 7 : [5, 8],
8 : [6,7,9], 9 : [8] }
We have the friendships in a dictionary now. So, we can ask questions, like “What is the
average number of connections?” First we have to find the total number of connections, by
summing up the lengths of all the friends lists:
def number_of_friends(user): # How many friends does user have?
user_id = user["id"]
friend_ids = friendships[user_id]
return len(friend_ids)
total_connections = sum(number_of_friends(user) for user in users) # 24
num_users = len(users) # Number of users
avg_connections = total_connections / num_users # 24 / 10 = 2.4
It is also easy to find the most connected people. They are the people who have the largest
number of friends. In order to find them, we have to sort the Users from those who have most
friends to those who have least friends.
# Create a list of Tuples of the form (user_id, number_of_friends).
num_friends_by_id = [(user["id"], number_of_friends(user)) for user in users]
num_friends_by_id.sort (
key = lambda id_and_friends: id_and_friends[1], reverse=True)
# Each pair is of the form (user_id, num_friends):
# [(1, 3), (2, 3), (3, 3), (5, 3), (8, 3), (0, 2), (4, 2), (6, 2), (7, 2), (9, 1)]
Here, the Users 1, 2, 3, 5 and 8 are the most connected people. They are somehow central
to the Data Scientists’ Network. What we have just now computed is the Network Metric
named degree centrality.
39
interests = [
(0, "Hadoop"), (0, "Big Data"), (0, "HBase"), (0, "Java"),
(0, "Spark"), (0, "Storm"), (0, "Cassandra"),
(1, "NoSQL"), (1, "MongoDB"), (1, "Cassandra"), (1, "HBase"),
(1, "Postgres"), (2, "Python"), (2, "scikit-learn"), (2, "scipy"),
(2, "numpy"), (2, "statsmodels"), (2, "pandas"), (3, "R"), (3, "Python"),
(3, "statistics"), (3, "regression"), (3, "probability"),
(4, "machine learning"), (4, "regression"), (4, "decision trees"),
(4, "libsvm"), (5, "Python"), (5, "R"), (5, "Java"), (5, "C++"),
(5, "Haskell"), (5, "programming languages"), (6, "statistics"),
(6, "probability"), (6, "mathematics"), (6, "theory"),
(7, "machine learning"), (7, "scikit-learn"), (7, "Mahout"),
(7, "neural networks"), (8, "neural networks"), (8, "deep learning"),
(8, "Big Data"), (8, "artificial intelligence"), (9, "Hadoop"),
(9, "Java"), (9, "MapReduce"), (9, "Big Data")
]
Here, we observe the Users 0 and 9 (i.e., Hero and Klein) have no friends in common. But they
share interests in Java and Big Data. It is now easy to build a function that finds Users with a
certain interest:
def data_scientists_who_like(target_interest):
#Find the ids of all users who like the target interest.#
return [user_id
for user_id, user_interest in interests
if user_interest == target_interest]
The above code will work. But it has to examine the whole list of interests for every
search. If we have a lot of Users and interests (or if we want to do a lot of searches), it is
preferable to build an index from interests to users and another index from users to interests.
from collections import defaultdict
# Keys are interests, values are lists of user_ids with that interest
user_ids_by_interest = defaultdict(list)
for user_id, interest in interests:
user_ids_by_interest[interest].append(user_id)
# Building an index from users to interests
# Keys are user_ids, values are lists of interests for that user_id
interests_by_user_id = defaultdict(list)
for user_id, interest in interests :
interests_by_user_id [user_id].append (interest)
Now, it is easy to find who has the most interests in common with a given User by
carrying out the following three steps:
41
From the above Figure, it seems clear that people with more experience tend to earn
more. How can you turn this into a fun fact? Your first idea is to look at the average salary for
each tenure:
# Keys are years, values are lists of the salaries for each tenure.
salary_by_tenure = defaultdict(list)
for salary, tenure in salaries_and_tenures:
salary_by_tenure[tenure].append(salary)
# Keys are years, each value is average salary for that tenure.
average_salary_by_tenure = {
tenure: sum(salaries) / len(salaries)
for tenure, salaries in salary_by_tenure.items()
}
The above Code segment will generate the following output:
{0.7: 48000.0, 1.9: 48000.0, 2.5: 60000.0, 4.2: 63000.0,
6: 76000.0, 6.5: 69000.0, 7.5: 76000.0, 8.1: 88000.0,
8.7: 83000.0, 10: 83000.0}
The above output is not useful, as we are just reporting the individual Users’ salaries. This is
due to the reason that no two of the Users have the same tenure. It may be more helpful to
bucket the tenures:
def tenure_bucket(tenure):
if tenure < 2:
return "less than two"
elif tenure < 5:
return "between two and five"
else:
return "more than five"
Then, we can group together the salaries corresponding to each bucket:
# Keys are tenure buckets, values are lists of salaries for that bucket.
salary_by_tenure_bucket = defaultdict(list)
for salary, tenure in salaries_and_tenures:
bucket = tenure_bucket(tenure)
salary_by_tenure_bucket[bucket].append(salary)
Finally, we can compute the average salary for each group:
# Keys are tenure buckets, values are average salary for that bucket.
average_salary_by_bucket = {
tenure_bucket: sum(salaries) / len(salaries)
for tenure_bucket, salaries in salary_by_tenure_bucket.items( )
}
43
With more data and more mathematics, we can build a model predicting the likelihood
that a User will pay based on his years of experience. We will build such a model later on by
using the concept of Logistic Regression.
3.6 Topics of Interest
Suppose that the VP of Content Strategy asks you for data about what topics Users are
most interested in, so that she can plan out her blog calendar accordingly. You already have
the new data given below:
interests = [
(0, "Hadoop"), (0, "Big Data"), (0, "HBase"), (0, "Java"),
(0, "Spark"), (0, "Storm"), (0, "Cassandra"),
(1, "NoSQL"), (1, "MongoDB"), (1, "Cassandra"), (1, "HBase"), (1, "Postgres"),
⁞
(8, "neural networks"), (8, "deep learning"), (8, "Big Data"),
(8, "artificial intelligence"),
(9, "Hadoop"), (9, "Java"), (9, "MapReduce"), (9, "Big Data")
]
One simple way to find the most popular “interests is to count the words:
1. Express each interests in lowercase (For example, “Java” should be changed to “Java”)
2. Split each interest into words. (For example, “deep learning” should be changed into “deep”
and “learning”).
3. Count the results. (For example, “java” occurs 3 times, “hbase” occurs 2 times).
The above three steps can be implemented by using the following Code:
words_and_counts = Counter( word
for user, interest in interests
for word in interest.lower( ).split( ) )
Now, it is easy to list out the words that occur more than once:
for word, count in words_and_counts.most_common( ):
if count > 1:
print(word, count)
The following output will be generated by the above Code. We will learn later on more
sophisticated ways to extract topics from data by using Natural Language Processing:
learning 3
java 3
python 3
big 3
data 3
hbase 2
regression 2
45
cassandra 2
statistics 2
probability 2
hadoop 2
networks 2
machine 2
neural 2
scikit-learn 2
r 2
User 0 has most common interests with User 9
Hadoop 9
Big Data 8, 9
HBase 1
Java 5, 9
Spark
Storm
Cassandra 1
User 1 has most common interests with User 0
No SQL
MangoDB
Cassandra 0
HBase 0
Postgres
User 2 has most common interests with Users 3, 5 and 7
Python 3, 5
Scikit_learn 7
Scipy
Mumpy
Statmodels
Pandas
User 3 has most common interests with Users 5 and 6
R 5
Python 2, 5
Statistics 6
Regression 4
Probability 6
User 4 has most common interests with the User 7
Machine Learning 7
Regression
Decision Trees
Libsvm
46
Counter( { 0: [9], 1: [0], 2: [3,5,7], 3: [5,6], 4: [7], 5: [3,9], 6: [3], 7: [2,4,8], 8: [0,7,9], 9: [0] } )
Chapter-4:
VISUALIZING DATA
4.1 Introduction
Making plots and visualizations is one of the most important tasks in data analysis. It
may be a part of the exploratory process. For example, it may help us to identify the outliers,
do required data transformations or come up with ideas for models.
A wide variety of tools exist for visualizing data. We will use the matplotlib library,
which is widely used now. This library is not part of the core Python library. In order to use
matplotlib, we have to start IPython in Pylab mode by using the ipython -- pylab command.
The matplotlib.pyplot module is to be imported by using the following command:
> import matplotliblpyplot as plt
Plots in matplotlib reside within a Figure object. We can create a new figure by using
the figure( ) method.
> fig = plt.figure( )
> ax = fig.add_subplot (1,1,1)
> from numpy.random import randon
> ax.plot ( randn (1000).cumsum( ) )
Note-1
We can’t make a plot with a blank figure. We have to create one or more subplots by
using the add_subplot( ) function, as shown below:
48
Note-2
We can draw a Histogram, Scatter Diagram and a Line Diagram in the above subplots
by using the following commands:
> ax1.hist( randn(100), bins = 20, color = ‘K’, alpha = 0.3 )
> ax2.scatter( np.arange(30), np.arange(30) + 3*randn(30) )
> ax3.plot( randn(50).cumsum( ), ‘K--‘)
49
In order to change the X axis ticks, we have to use the set_xticks() and set_xticklabels()
methods. The set_xticks( ) method instructs matplotlib where to place the ticks along the data
range. By default, these locations will also be the labels. But, we can set any other values as
the labels by using the set_xticklabels( ) method.
> ticks = ax.set_xticks ( [0, 250, 500, 750, 1000] )
> labels = ax.set_xticklabels ( [‘one’, ‘two’, ‘three’, ‘four’, ‘five’],
rotation = 30, fontsize = ‘small’ )
Lastly, the set_xlabel( ) method gives a name to the X axis and the set_title( ) method gives
the subplot title.
> ax. set_title ( ‘My first matplotlib plot’ )
> ax. set_xlabel ( ‘Stages’ )
Modifying the Y axis consists of the same process, substituting y for x in the above
commands.
51
Legends are another critical element for identifying the plot elements. The legend shall
be passed the value of the label argument of the plot( ) function, as shown below.
> fig = plt.figure( )
> ax = fig.add_subplot (1,1,1)
> ax.plot ( randn (1000).cumsum( ), ‘K’, label = ‘one’ )
> ax.plot ( randn (1000).cumsum( ), ‘K--’, label = ‘two’ )
> ax.plot ( randn (1000).cumsum( ), ‘K.’, label = ‘three’ )
Once we do this, we can invoke ax.legend( ) or plt.legend( ) to automatically create a legend:
> ax.legend ( loc = ‘best’ )
Here, the loc tells matplotlib where to place the plot. The value ‘best’ will tell matplotlib to
choose a location that is most out of the way.
52
4.4 Annotations
In addition to the standard plot types, you may wish to draw your own annotations. The
annotations can consist of text, arrows, or other shapes. Annotations and text can be added by
using the text, arrow and annotate functions. The text( ) function draws text at the given
coordinates (x, y) on the plot with optional custom styling:
> ax.text(x, y, ‘Hello world!’, family = ‘monospace’, fontsize = 10)
Annotations can draw both text and arrows arranged appropriately. For example, in the
following figure about the share market trend, the annotations “Peak of bull market”, “Bear
Stearns Falls” and “Lehman Bankruptcy” are shown with arrows at the appropriate places.
This option is useful for saving dynamically-generated images over the Web.
4.6 Drawing a Line Chart
The folloing code makes use of the plot( ) function to draw a simple Line Chart.
> from matplotlib import pyplot as plt
> years = [1950, 1960, 1970, 1980, 1990, 2000, 2010]
> gdp = [300.2, 543.3, 1075.9, 2862.5, 5979.6, 10289.7, 14958.3]
> plt.plot (years, gdp, color = ‘green’, marker = ‘0’, linestyle = ‘solid’)
> plt. title (“Nominal GDP”) # A title is added
> plt.ylabel (“Billions of $”) # A label is added to y-axis
> plt.show( )
A Bar Chart can also be a good choice for plotting Histograms of bucketed numeric
values, as shown in the following figure, in order to visually explore how the values are
distributed.
> from collections import Counter
> grades = [83, 95, 91, 87, 70, 0, 85, 82, 100, 67, 73, 77, 0]
> # Bucket grades by decile, but put 100 in with the 90s
> histogram = Counter ( min (grade // 10*10, 90) for grade in grades )
> plt.bar ( [x+5 for x in histogram.keys( ) ], # Shift Bars right by 5
histogram.values ( ), # Give each Bar its correct height
10, # Give each Bar a width of 10
edgecolor = (0, 0, 0) # Black edges for each Bar
> plt.axis ( [–5, 105, 0, 5] ) # X-axis from –5 to 105,
# Y-axis from 0 to 5
> plt.xticks ( [10*i for i in range(11) ] ) # X-axis labels at 0, 10, ..., 100
> plt.xlabel (“Decile”)
> plt.ylabel (“# of Students”)
> plt.title (“Distribution of Exam 1 grades”)
> plt.show( )
55
In the above code, the third argument to plt.bar( ) method specifies the bar width. Here,
we chose a width of 10, to fill the entire decile. We also shifted the bars right by 5, so that, for
example, the “10” bar (which corresponds to the decile 10-20) will have its center at 15 and
hence occupy the correct range. We also added a black edge to each bar to make them visually
distinct.
In the above code, the arguments of plt.axis( ) method indicate that we want the x-axis
to range from –5 to 105 (just to leave a little space on the left and right), and that the y-axis
should range from 0 to 5. The call to the plt.xticks( ) method puts x-axis labels at 0, 10, 20, ...,
100.
4.8 Line Charts with Legends
The following Code draws several Line Charts with a legend for each. The Line Charts
are a good choice for showing trends, as illustrated in the following figure.
> variance = [1, 2, 4, 8, 16, 32, 64, 128, 256]
> bias_squared = [256, 128, 64, 32, 16, 8, 4, 2, 1]
> total_error = [x+y for x,y in zip (variance, bias_squared)]
> xs = [ i for i, - in enumerate (variance) ]
> # We can make multiple calls to plt.plot( ) method
> # to show multiple series on the same chart
> plt.plot (xs.variance, ‘g-’, label = ‘variance’) # Green Solid Line
> plt.plot (xx, bias_squared, ‘r_.’, label = ‘bias^2’) # Red dot-dashed Line
> plt.plot (xs, total_error, ‘b:’, label = ‘total error’) # Blue Dotted Line
56
5.1 Introduction
NumPy, short for Numerical Python, is the fundamental package that is required for high
performance scientific computing and data analysis. It is the foundation on which many higher-
level tools are built in Python. Here are some of the things that the NumPy package provides:
1) ndarray, a fast and space-efficient multidimensional array that provides vectorized
arithmetic operations and sophisticated broadcasting capabilities
2) Standard mathematical and statistical functions for fast operations on entire arrays of data
without having to write loops
3) Tools for reading / writing array data to disk and working with memory-mapped files.
While NumPy package by itself does not provide very much high-level data analytical
functionality, having an understanding of ndarrays and array-oriented computing will help
you to use tools like Pandas much more effectively. You will learn the following concepts in
the subsequent Sections. These concepts are helpful for developing most data analysis
operations.
1) Fast vectorized array operations for data munging and cleaning, subsetting and filtering,
transformation, and any other kinds of computations
2) Common array methods for performing sorting and set operations
3) Efficient descriptive statistics and aggregating/summarizing data
4) Data alignment and relational data manipulations for merging and joining together
heterogeneous data sets
5) Group-wise data manipulations (aggregation, transformation etc.).
While the NumPy package provides the computational foundation for all the above
operations, we have to mainly use the Pandas Package as our basis for performing most kinds
of data analysis (especially for structured or tabular data such as a Data Frame) as it provides
a rich, high-level interface making most common data tasks very concise and simple.
5.2 The NumPy ndarray
One of the key features of NumPy package is ndarray. It is a N-dimensional array object.
It is a fast, flexible container for large data sets in Python. We can perform mathematical
operations on arrays exactly in the same way in which we perform such operations on scalar
elements. Here is an example:
> data = [ [0.9526, -0.246 , -0.8856], [ 0.5639, 0.2379, 0.9104] ]
> array 1 = np.array (data)
59
> array 1
array ( [ [0.9526, –0.246, –0.8856],
[0.5639, 0.2379, 0.9104] ] )
> array 1 * 10
array ( [ [9.5256, –2.4601, –8.8565],
[5.6385, 2.3794, 9.104] ) )
> array 1 + array 1
array ( [ [1.9051, –0.492, –1.7713],
[1.1277, 0.4759, 1.8208] ] )
> data.shape
(2, 3) # Number of rows and columns
> data.dtype
dtype (‘float64’)
Here, the arrange( ) function is same as the built-in Python range( ) function. While the range()
function returns a List, the arrange( ) function returns a ndarray.
5.4 Vectorization Operation
Arrays enable us to express batch operations on data without writing any for loops. This
is usually called vectorization. Any arithmetic operations between equal-size arrays applies
the operation elementwise. Operations between differently-sized arrays is called
broadcasting. We will discuss it later.
> arr = np.array ( [1., 2., 3.], [4., 5., 6.] ] )
> arr
array ( [ [1., 2., 3.],
[4., 5., 6.] ] )
> arr * arr
array ( [ [1., 4., 9.],
[16., 25., 36.] ] )
> arr - arr
array ( [ [0., 0., 0.],
[0., 0., 0.] ] )
Arithmetic operations with scalars will be done as we expect. The value will be propagated to
each element.
> 1/arr
array ( [ [1., 0.5, 0.3333],
[0.25, 0.2, 0.1667] ] )
> arr ** 0.5
array ( [ [1., 1.4142, 1.7321]
[2., 2.2361, 2.4495] ] )
5.5 Array Indexing and Slicing
NumPy’s ndarray’s elements are indexed from 0. They shall be accessed exactly in the
way in which the elements in a List are accessed.
> arr = np.arange(10)
> arr
array( [0, 1, 2, 3, 4, 5, 6, 7, 8, 9] )
> arr[5]
5 # Value at index 5 is 5
> arr[5:8]
array( [5, 6, 7] ) # This is called Array Slicing
> arr[5:8] = 12 # 12 is assigned to arr[5], arr[6] and arr[7]
> arr
array( [0, 1, 2, 3, 4, 12, 12, 12, 8, 9] )
61
Here, we assign the scalar value 12 to the array slice [5, 6, 7]. Here, the value 12 is
propagated (i.e., broadcasted) to the entire slice. An important distinction from lists is that the
array slices are views on the original array. This means that the data are not copied, and any
modification to the view will be reflected in the source array.
> arr_slice = arr[5:8]
> arr_slice[1] = 12345
> arr
array( [0, 1, 2, 3, 4, 12, 12345, 12, 8, 9] )
> arr_slice[:] = 64
> arr
array( [0, 1, 2, 3, 4, 64, 64, 64, 8, 9] )
If we want a copy of a slice of an ndarray instead of a view, we will need to explicitly
copy the array. (An example: arr[5:8].copy( ).
In a two-dimensional array, the element at each index is an one-dimensional array, as
shown below:
> arr2d = np.array( [ [1, 2, 3], [4, 5, 6], [7, 8, 9] ] )
> arr2d[0]
array( [1, 2, 3])
> arr2d[0][2]
3
> arr2d [0, 2] # Alternative method of accessing an element
3
The element at each index of a three-dimensional array is a two-dimensional array, as
illustrated below:
> arr3d = np.array( [ [ [1, 2, 3], [4, 5, 6] ], [ [7, 8, 9], [10, 11, 12] ] ] )
> arr3d[0]
array( [ [ [1, 2, 3],
[4, 5, 6] ] )
Both the scalar values and arrays can be assigned to arr3d[0], as shown below:
> old_values = arr3d[0].copy()
> arr3d[0] = 42 # a scalar value is assigned
> arr3d
array( [ [ [42, 42, 42],
[42, 42, 42] ],
[ [7, 8, 9],
[10, 11, 12] ] ] )
> arr3d[0] = old_values # An array is assigned
62
> arr3d
array( [ [ [1, 2, 3],
[4, 5, 6]],
[ [7, 8, 9],
[10, 11, 12] ] ] )
> arr.cumsum(0)
array( [ [0, 1, 2],
[3, 5, 7],
[9, 12, 15] ] )
> arr.cumprod(1)
array( [ [0, 0, 0],
[3, 12, 60],
[6, 42, 336] ] )
> arr = randn(8)
> arr
array( [0.6903, 0.4678, 0.0968, -0.1349, 0.9879, 0.0185, -1.3147, -0.5425] )
> arr.sort( )
> arr
array( [-1.3147, -0.5425, -0.1349, 0.0185, 0.0968, 0.4678, 0.6903, 0.9879] )
We can get a 4 by 4 matrix (i.e., array) of random numbers from the Standard Normal
Distribution, as shown below:
> samples = np.random.normal (size = (4, 4) )
> samples
array ( [ [ 0.1241, 0.3026, 0.5238, 0.0009],
[ 1.3438, -0.7135, -0.8312, -2.3702],
[-1.8608, -0.8608, 0.5601, -1.2659],
[ 0.1198, -1.0635, 0.3329, -2.3594] ] )
> obj2 = Series ( [4, 7, -5, 3], index = [‘d’, ‘b’, ‘a’, ‘c’] )
> obj2
d 4
b 7
a -5
c 3
> obj2.values
array ( [4, 7, -5, 3] )
> obj2.index
Index ( [d, b, a, c], dtype = object)
> obj2 [‘a]
-5
> obj2 [‘d’] = 6
> obj2 [ [‘c’, ‘a’, ‘d’] ]
c 3
a -5
d 6
> obj2 [ obj2 > 0]
d 6
b 7
c 3
> obj2 * 2
d 12
b 14
a -10
c 6
> np.exp(obj2)
d 403.428793
b 1096.633158
a 0.006738
c 20.085537
> ‘b’ in obj2
True
> ‘e’ in obj2
False
If we have data contained in a Python dictionary, we can create a Series from it by passing
the dictionary to the Series( ) method as its argument.
67
> sdata = {‘ohio’: 35000, ‘texas’: 71000, ‘Oregon’: 16000, ‘Utah’: 5000}
> obj3 = Series (sdata)
> obj3
ohio 35000
Oregon 16000
Texas 71000
Utah 5000
In the above example, the dictionary’s keys in sorted order are taken as the index of the
resulting Series. We can ourselves specify the keys when creating a Series, as shown below:
> states = [‘California’, ‘Ohio’, ‘Oregon’, ‘Texas’]
> obj4 = Series (sdata, index = states)
> obj4
California NaN
Ohio 35000
Oregon 16000
Texas 71000
In the above example, three values (i.e., 35000, 16000 and 71000) found in sdata were
placed in the appropriate locations. But since no value for ‘California’ was found in sdata, it
appears as NaN (not a number) in the above output.
An important Series feature for many applications is that it automatically aligns
differently-indexed data in arithmetic operations.
> obj 3 > obj
Ohio 35000 California NaN
Oregon 16000 Ohio 35000
Texas 71000 Oregon 16000
Utah 5000 Texas 71000
> obj3 + obj4
California NaN
Ohio 70000
Oregon 32000
Texas 142000
Utah NaN
Both the Series object itself and its index have a name attribute. This attribute integrates
with other key areas of Pandas functionality.
> obj4.name = ‘population’
> obj4.index.name = ‘state’
68
> obj4
California NaN
Ohio 35000
Oregon 16000
Texas 71000
Name: Population
A Series’s index can be altered in place by assignment, as illustrated below:
> obj.index = [‘Bob’, ‘Steve’, ‘Jeff’, ‘Ryan’]
> obj
Bob 4
Steve 7
Jeff -5
Ryan 3
> frame
pop state year
0 1.5 Ohio 2000
1 1.7 Ohio 2001
2 3.6 Ohio 2002
3 2.4 Nevada 2001
4 2.9 Nevada 2002
In the above example, the index is assigned automatically as 0, 1, 2, 3 and 4. The columns
are placed in sorted order. (i.e., pop, state, year). However, if we specify a sequence of
columns, the DataFrame’s columns will be exactly what we pass.
> DataFrame ( data, columns = [‘year’, ‘state’, ‘pop’] )
year state pop
0 2000 Ohio 1.5
1 2001 Ohio 1.7
2 2002 Ohio 3.6
3 2001 Nevada 2.4
4 2002 Nevada 2.9
As in the case of Series, if we pass a column that is not contained in data, it will appear
with NaN values in the result.
> frame2 = DataFrame ( data, columns = [‘year’, ‘state’, ‘pop’, ‘debt’],
index = [‘one’, ‘two’, ‘three’, ‘four’, ‘five’] )
> frame2
year state pop debt
one 2000 Ohio 1.5 NaN
two 2001 Ohio 1.7 NaN
three 2002 Ohio 3.6 NaN
four 2001 Nevada 2.4 NaN
five 2002 Nevada 2.9 NaN
> frame2.columns
Index ( [year, state, pop, debt], dtype = object)
A column in a DataFrame can be retrieved as a Series either by dictionary-like notation
or by attribute:
> frame2 [‘state’]
one Ohio
five Nevada
> frame2.year
one 2000
⁞
five 2002
70
Assigning a column that doesn’t exist will create a new column. The del keyword will
delete columns as with a dictionary.
> frame2 [‘eastern’] = frame2.state == ‘Ohio’
> frame2
year state pop debt eastern
one 2000 Ohio 1.5 Nan True
two 2001 Ohio 1.7 -1.2 True
three 2002 Ohio 3.6 NaN True
four 2001 Nevada 2.4 -1.5 False
five 2002 Nevada 2.9 -1.7 False
> del frame2 [‘eastern’]
> frame2.columns
Index ( [year, state, pop, debt], dtype = object)
> pop = {‘Nevada’ : {2001: 2.4, 2002: 2.9}
‘Ohio’ : {2000: 1,5, 2001: 1.7, 2002: 3.6} }
Here, pop is a nested dictionary of dictionaries. When it is passed as the data to the
DataFrame( ) method, it will interpret the outer dictionary keys as the columns and the inner
keys as the row indices.
> frame3 = DataFrame( pop)
> frame3
Nevada Ohio
2000 NaN 1.5
2001 2.4 1.7
2002 2.9 3.6
We can transpose the above result, if we wish to do so.
> frame3.T
2000 2001 2002
Nevada NaN 2,4 2.9
Ohio 1.5 1.7 3.6
The keys in the inner dictionaries are unioned and sorted to form the index in the result.
This is not true if an explicit index is specified.
> DataFrame (pop, index = [2001, 2002, 2003])
Nevada Ohio
2001 2.4 1.7
2002 2.9 3.6
2003 NaN NaN
The dictionaries of Series are treated much in the same way.
> pdata = {‘Ohio’: frame3 [‘Ohio’] [ :-1],
‘Nevada’: frame3 [‘Nevada’] [ :2] }
72
In the case of DataFrame, the reindex( ) method can be used to alter the row index or
column index or both. When passed just a sequence, the rows are reindexed.
> frame = DataFrame (np.arange(9).reshape( (3,3) ),
index = [‘a’, ‘c’, ‘d’],
columns = [‘Ohio’, ‘Texas’, ‘California’] )
> frame
Ohio Texas California
a 0 1 2
c 3 4 5
d 6 7 8
> frame2 = frame.reindex ( [‘a’, ‘b’, ‘c’, ‘d’] )
> frame2
Ohio Texas California
a 0 1 2
b NaN NaN Nan
c 3 4 5
d 6 7 8
The columns can be reindexed by using the columns keyword, as shown below:
> states = [‘Texas’, ‘Utah’, ‘California’]
> frame.reindex (columns = states)
Texas Utah California
a 1 NaN 2
c 4 NaN 5
d 7 NaN 8
Both the rows and columns can be reindexed in one shot, though reindexation and data
interpolation will be done first for rows (axis 0):
> frame.reindex (index = [‘a’, ‘b’, ‘c’, ‘d’], method = ‘ffill’, columns = states)
Texas Utah California
a 1 NaN 2
b 1 NaN 2
c 4 NaN 5
d 7 NaN 8
Reindexing can also be done by label-indexing with ix, as illustrated below:
> frame.ix [ [‘a’, ‘b’, ‘c’, ‘d’], states ]
Texas Utah California
a 1 NaN 2
b NaN NaN NaN
c 4 NaN 5
d 7 NaN 8
75
> s1 = Series ( [7.3, -2.5, 3.4, 1.5], index = [‘a’, ‘c’, ‘d’, ‘e’] )
> s2 = Series ( [-2.1, 3.6, -1.5, 4, 3.1], index = [‘a’, ‘c’, ‘e’, ‘f’, ‘g’] )
> s1
a 7.3
b -2.5
d 3.4
e 1.5
> s2
a -2.1
c 3.6
e -1.5
f 4.0
g 3.1
> s1 + s2
a 5.2
c 1.1
d NaN
e 0.0
f NaN
g NaN
The internal data alignment introduces NA values in the indices that don’t overlap.
Missing values propagate in arithmetic computations. In the case of Data Frame, alignment is
performed on both the rows and the columns:
> df1 = DataFrame ( np.arange(9.). reshape(3, 3) ), columns = list(‘bcd’),
index = [‘Ohio’, ‘Texas’, ‘Colorado’] )
> df2 = DataFrame ( np.arange(12.) .reshape(4,3) ), columns = list (‘bde’),
index = [‘Utah’, ‘Ohio’, ‘Texas’, ‘Oregon’] )
> df1
b c d
Ohio 0 1 2
Texas 3 4 5
Colorado 6 7 8
> df2
b d e
Utah 0 1 2
Ohio 3 4 5
Texas 6 7 8
Oregon 9 10 11
Adding these two Data Frames together returns a Data Frame whose index and columns
are the unions of the ones in each Data Frame:
79
a b c d e
0 0 1 2 3 0
1 4 5 6 7 0
2 8 9 10 11 0
5.19 Arithmetic Operations between Data Frames and Series
As in the case of NumPy arrays, arithmetic operations between DataFrame and Series is
well-defined. We will now define a two-dimensional array. Then, we will subtract the elements
in the zeroth row from the elements in each row.
> arr = np.arange(12.) . reshape( (3,4) )
> arr
( [ [0., 1., 2., 3.],
[4., 5., 6., 7.],
[8., 9., 10., 11.] ] )
> arr [0]
( [0., 1., 2., 3.] )
> arr-arr[0]
( [ [0., 0., 0., 0.],
[4., 4., 4., 4.],
[8., 8., 8., 8.] ] )
This is referred to as broadcasting. operations between a DataFrame and a Series are
similar.
> frame = DataFrame ( np.arange(12.).reshape( (4, 3), columns = list(‘bde’),
index = [‘Utah’, ‘Ohio’, ‘Texas’, ‘Oregon’] )
> series = frame.ix [0]
> frame
b d e
Utah 0 1 2
Ohio 3 4 5
Texas 6 7 8
Oregon 9 10 11
> Series
b 0
d 1
e 2
By default, arithmetic between DataFrame and Series matches the index of the Series on
the DataFrame’s columns, broadcasting down the rows:
> frame - series
81
b d e
Utah 0 0 0
Ohio 3 3 3
Texas 6 6 6
Oregon 9 9 9
If an index value is not found in either the DataFrame’s columns or the Series’s index,
the DataFrame and the Series will be reindexed to form the union:
> series2 = Series ( range(3), index = [‘b’, ‘e’, ‘f’] )
> frame + series2
b d e f
Utah 0 NaN 3 NaN
Ohio 3 NaN 6 NaN
Texas 6 NaN 9 NaN
Oregon 9 NaN 12 NaN
If we wish to instead broadcast over the columns, matches on the rows, we have to use
one of the arithmetic methods.:
> series3 = frame[‘d’]
> frame
b d e
Utah 0 1 2
Ohio 3 4 5
Texas 6 7 8
Oregon 9 10 11
> series
Utah 1
Ohio 4
Texas 7
Oregon 10
> frame.sub( series3, axis = 0 )
b d e
Utah -1 0 1
Ohio -1 0 1
Texas -1 0 1
Oregon -1 0 1
Here, the axis number that we pass is the axis to match on. In the above example, we
mean to match on the DataFrame’s row index and broadcast across.
5.20 Function Application and Mapping
The ufuncs that we applied to NumPy ndarrays work fine with pandas objects.
> frame = DataFrame ( np.random.randn(4,3), columns = list(‘bde’),
index = [‘Utah’, ‘Ohio’, ‘Texas’, ‘Oregon’] )
82
> frame
b d e
Utah -0.204708 0.478943 -0.519439
Ohio -0.555730 1.965781 1.393406
Texas 0.092908 0.281746 0.769023
Oregon 1.246435 1.007189 -1.296221
> np.abs (frame)
b d e
Utah 0.204708 0.478943 0.519439
Ohio 0.555730 1.965781 1.393406
Texas 0.092908 0.281746 0.769023
Oregon 1.246435 1.007189 1.296221
Another frequent operation is applying a function on one-dimensional arrays to each
column or row. DataFrame’s apply( ) method does exactly this:
> f = lamda x : x.max( ) - x.min( )
> frame.apply (f)
b 1.802165
d 1.684034
e 2.689627
> frame.apply ( f, axis = 1)
Utah 0.998382
Ohio 2.521511
Texas 0.676115
Oregon 2.542656
Chapter-6:
DATA WRANGLING
6.1 Introduction
Much of the programming work in data analysis and modeling is spent on data
preparation: loading, cleaning, transforming and rearranging. Sometimes, the way that data are
stored in files or databases is not the way you need it for a data processing application.
Fortunately, standard libraries in Python, such as Pandas, provided us with a high-level,
flexible, and high-performance set of methods for loading, cleaning, transforming and
rearranging data. We will discuss such methods in this Chapter.
6.2 Merging Data Sets
Merge operation combines data sets by linking rows using one or more keys. This
operation is also known as join operation and it is often used in RDBMS. Here is an example
for the merge( ) function:
> df1 = DataFrame ( { ‘key’ : [‘b’, ‘b’, ‘a’, ‘c’, ‘a’, ‘a’, ‘b’],
‘data1’ : range(7) } )
> df2 = DataFrame ( { ‘key’ : [‘a’, ‘b’, ‘d’],
‘data2’ : range(3) } )
> df1
data1 key
0 0 b
1 1 b
2 2 a
3 3 c
4 4 a
5 5 a
6 6 b
> df2
data2 key
0 0 a
1 1 b
2 2 d
> pd.merge (df1, df2, on = ‘key’)
data1 key data2
0 2 a 0
1 4 a 0
2 5 a 0
3 0 b 1
4 1 b 1
5 6 b 1
84
If the column names are different in each Data Frame, we can specify the column names
separately:
> df3 = DataFrame ( { ‘lkey’ : [‘b’, ‘b’, ‘a’, ‘c’, ‘a’, ‘a’, ‘b’],
‘data1’ : range(7) } )
> df4 = DataFrame ( { ‘rkey’ : [‘a’, ‘b’, ‘d’ ],
‘data2’ : range(3) } )
> pd.merge ( df3, df4, left_on = ‘lkey’, right_on = ‘rkey’ )
data1 lkey data2 rkey
0 2 a 0 a
1 4 a 0 a
2 5 a 0 a
3 0 b 1 b
4 1 b 1 b
5 6 b 1 b
Note that the keys ‘c’ and ‘d’ and the associated data are not listed in all the above results.
By default, the merge( ) function does on ‘inner’ join. Only the keys, which are in both the
Data Frames, and their associated values will be listed in this case.
The outer join will take the union of the keys, combining the effect of applying both left
and right joins, as shown below:
> pd.merge (df1, df2, how = ‘outer’)
data1 lkey data2
0 2 a 0
1 4 a 0
2 5 a 0
3 0 b 1
4 1 b 1
5 6 b 1
6 3 c NaN
7 NaN d 2
Here, we merge df1 and df2. This is an example of many_to_one merge situation. The
data in df1 has multiple rows labeled a and b, whereas df2 has only one row for each value in
the key column. Many_to_many merging is also possible. Such joins form the Cartesian
product of the rows.
Sometimes, we wish to use the index of a Data Frame as the merge key. In such a case,
we can pass left-index = True or right-index = True (or both) as the argument of the merge( )
function to indicate that the index should be used as the merge key. Here is an example:
> left1 = DataFrame ( { ‘key’ : [‘a’, ‘b’, ‘a’, ‘a’, ‘b’, ‘c’],
‘value’ : range(6) } )
> right1 = DataFrame ( { ‘group_val’ : [3.5, 7] }, index = [‘a’, ‘b] )
85
> left1
Key Value
0 a 0
1 b 1
2 a 2
3 a 3
4 b 4
5 c 5
> right1
group-val
a 3.5
b 7
> pd.merge ( left1, right1, left_on = ‘key’, right_index = True )
Key Value group_val
0 a 0 3.5
2 a 2 3.5
3 a 3 3.5
1 b 1 7.0
4 b 4 7.0
Here, the key ‘c’ and its associated value 5 are not listed in the above result, as the inner
join is done by default. Here is an example for outer join.
> pd.merge (left1, right1, left_on = ‘key’, right_index = True, how = ‘outer’)
Key Value group_val
0 a 0 3.5
2 a 2 3.5
3 a 3 3.5
1 b 1 7.0
4 b 4 7.0
5 c 5 NaN
6.3 Reshaping and Pivoting
The reshape and pivot operations are the fundamental operations for rearranging tabular
data. Hierarchical indexing provides a consistent way to rearrange the data in a Data Frame.
Rotating or pivoting the data from the columns to the rows is known as the stack operation.
Pivoting from rows to columns is known as the unstack operation.
> data = DataFrame ( np.arange(6) . reshape( (2, 3) ),
index = pd.Index ( [‘Ohio’, ‘Colorado’], name = ‘state’ ),
columns = pd.Index ( [‘one’, ‘two’, ‘three’], name = ‘number’) )
86
> data
number one two three
state
Ohio 0 1 2
Colorado 3 4 5
When the stack( ) method is applied on this data, columns will become rows. Thus, a
Series will be produced. (Series is an one dimensional array. It will have an associated array
of indices.)
> result = data.stack( )
> result
state number
Ohio one 0
two 1
three 2
Colorado one 3
two 4
three 5
From the above hierarchically-indexed Series, we can rearrange the data back into a Data
Frame by using the unstack( ) method, as shown below:
> result.unstack( )
number one two three
state
Ohio 0 1 2
Colorado 3 4 5
6.4 Data Transformation - Removing Duplicates
If duplicate rows are found in a Data Frame, they shall be dropped by using the drop-
duplicates( ) method. Here is an example:
> data = DataFrame ( { ‘K1’ : [‘one’] * 3 + [‘two’] * 4,
‘K2’ : [1, 1, 2, 3, 3, 4, 4] } )
> data
K1 K2
0 one 1
1 one 1
2 one 2
3 two 3
4 two 3
5 two 4
6 two 4
If a row (i.e., record) in a Data Frame is duplicated, the method duplicated( ) will return
True. Otherwise, this method will return False. We shall use this method to know the
87
duplicated rows. This method will keep the first observed value combination and treat the next
one as duplicated. For example, in the above Data Frame, the contents of first row and second
row are same. Here, the duplicated method will treat the first row as the original one and report
the second row as a duplicate as shown below:
> data. duplicated( )
0 False
1 True
2 False
3 False
4 True
5 False
6 True
As we have seen above, we shall use the drop-duplicates ( ) method to drop the duplicated
rows. In the above example, second row, fourth row and sixth row will be dropped. The remaining
four rows will be retained as shown below:
> data. drop-duplicates ( )
K1 K2
0 one 1
2 one 2
3 two 3
5 two 4
In the above example, the second row is dropped because the values in columns ‘K1”
and ‘K2’ (i.e., ‘one’, 1) are same as the values in columns ‘K1’ and ‘K2’ in first row. In this
example, we drop a row only if the values in all the columns of that row are same as the values
in all the corresponding columns of another row. But, sometimes, we have to drop a row as a
duplicate even if the value in just one column of that row is same as the value in that column
in another row. Here is an example.
> data [‘V1’] = range (7)
> data
K1 K2 V1
0 one 1 0
1 one 1 1
2 one 2 2
3 two 3 3
4 two 3 4
5 two 4 5
6 two 4 6
> data.drop_duplicates ( [‘K1’] )
K1 K2 V1
0 one 1 0
3 two 3 3
Here, we filter the duplicates only based on the ‘K1’ column.
88
Here, we use the map( ) method to perform element-wise transformations. Here, the
map() method takes the dictionary-like object meat-to-animal as the argument. It shall also
accept a function as its argument.
6.6 Data Transformation - Replacing Values
When a value in a column of a Data Frame is missing some people enter the sentinel
value -999 instead of entering NA or NaN. (In Pandas, NaN is used instead of NA to represent
a missing value.) We shall replace -999 by NaN in such a case by using the replace( ) method,
as shown below:
> data = Series ( [1.0, -999.0, 2.0, -999.0, -1000.0, 3.00 )
> data
0 1
1 -999
2 2
3 -999
4 -1000
5 3
> data.replace (-999, np.nan)
0 1
1 NaN
2 2
3 NaN
4 -1000
5 3
If we wish to replace multiple values at once, we have to pass a list to the replace( ) function,
as shown below:
> data.replace ( [-999, -1000], np.nan )
0 1
1 NaN
2 2
3 NaN
4 NaN
5 3
If we wish to use a different replacement for each value, then we have to pass a list of
substitutes, as shown below:
> data.replace ( [-999, -1000], [np.nan, 0] )
0 1
1 NaN
2 2
3 NaN
4 0
5 3
90
We shall also pass a Dictionary as the argument to the replace( ) function, as shown below:
0 1
1 NaN
2 2
3 NaN
4 0
6.7 Data Transformation - Discretization and Binning
Continuous data (heights, weights, etc.) are often separated (i.e., discretized) into bins
(i.e., class intervals) for analysis. Suppose we have data about a group of people in a study,
and we want to group them into discrete age buckets (i.e., class intervals), such as 18-25, 25-
35, 35-60 and 60-100. (Here, these age groups represent youth, young-adult, middle-aged and
seniors.) The cut( ) method, qcut( ) method and the value_counts( ) method are very helpful in
this context.
> ages = [20, 22, 25, 27, 21, 23, 37, 31, 61, 45, 41, 32]
> bins = [18, 25, 35, 60, 100]
> cats = pd.cut (ages, bins)
> cats
array ( [ (18, 25), (18, 25), (18, 25), (25, 35), (18, 25), (18, 25),
(35, 60), (25, 35), (60, 100), (35, 60), (35, 60), (25, 35) ],
dtype = object )
> cats.labels
array ( [0, 0, 0, 1, 0, 0, 2, 1, 3, 2, 2, 1] )
> cats.levels
Index ( [ (18, 25), (25, 35), (35, 60), (60, 100) ], dtype = object)
> pd.value-counts (cats)
(18, 25) 5
(35, 60) 3
(25, 35) 3
(60, 100) 1
Here, 5, 3, 3 and 1 are the frequencies of the class intervals 18-25, 25-35, 35-60 and 60-100.
In the representation (18, 25), the parenthesis means that the side is open, while the square
brackets means it is closed (i.e. inclusive). So, the value 25 belongs to the interval (18, 25). (It
does not belong to the interval (25, 35]), which side is closed can be changed by passing the
argument right = false to the cut( ) method, as shown below:
> pd.cut ( ages, [18, 26, 36, 61, 100], right = False )
array ( [ [18, 26), [18, 26), [18, 26), [26, 36), [18, 26), [18, 26),
[36, 61), [26, 36), [61, 100), [36, 61), [36, 61), [26, 36) ],
dtype = object )
91
we can also pass our own bin names by passing a list or array to the labels option, as
shown below:
> group_names = [ ‘Youth’, ‘YoungAdult’, ‘MiddleAged’, ‘Senior’ ]
> pd.cut ( ages, bins, labels = group_names )
array ( [Youth, Youth, Youth, YoungAdult, Youth, Youth, MiddleAged,
YoungAdult, Senior, MiddleAged, MiddleAged, YoungAdult],
dtype = objct )
If we pass to the cut( ) method an integer number of bins instead of explicit bin edges, it
will compute equal-length bins based on the minimum and maximum values in the data.
Consider the case of some uniformly distributed data divided into fourths:
> data = np.random.rand (20)
> pd.cut (data, 4, precision = 2)
array ( [ (0.45, 0.67], (0.23, 0.45], (0.0037, 0.23], (0.67, 0.9],
.............., (0.23, 0.45] ], dtype = object )
The qcut( ) method is similar to the cut( ) method. It bins the data based on the quantiles.
As quantiles divide the data into four equal parts, the qcut( ) method will generate equal-size
bins.
> data = np.random.randn (1000)
> cats = pd.qcut (data, 4)
> pd.value_counts (cats)
(-3.745, -0.635] 250
(0.641, 3.26] 250
(-0.635, -0.022] 250
(-0.022, 0.641] 250
We can also pass our own quantiles (numbers between 0 and 1, inclusive) as the
argument for the qcut( ) method, as shown below:
> pd.qcut ( data, [0, 0.1, 0.5, 0.9, 1.0] )
array ( [ (-0.022, 1.302], (-1.266, -0.022], (-0.022, 1.302], ...,
(-1.266, -0.022], (-0.022, 1.302], (-1.266, -0.022] ],
dtype = object )
6.8 Data Transformation - Detecting and Filtering Outliers
We shall filter and transform the outliers by using the array methods, as shown below:
> np.random.seed (1 2 3 4 5)
> data = DataFrame ( np.random.randn(1000, 4) )
> data.describe( )
92
0 1 2 3
count 1000.000000 1000.000000 1000.000000 1000.000000
mean -0.067684 0.067924 0.025598 -0.002298
std 0.998035 0.992106 1.006835 0.996794
min -3.428254 -3.548824 -3.184377 -3.745356
25% -0.774890 -0.591841 -0.641675 -0.644144
50% -0.116401 0.101143 0.002073 -0.013611
75% 0.616366 0.780282 0.680391 0.654328
max 3.366626 2.653656 3.260383 3.927528
Suppose, in this example, we consider a value as an outlier if it greater than +3 or less
than -3. The following Commands will display the outliers in the last column of the above
Data Frame.
> col = data [3]
> col [ np.abs (col) > 3 ]
97 3.927528
305 -3.399312
400 -3.745356
In order to select all rows that have outliers in one or more columns, we have to use the
following Command:
> data [ ( np.abs (data) > 3 ) . any(1) ]
0 1 2 3
5 -0.539741 0.476985 3.248944 -1.021228
97 -0.774363 0.552936 0.106061 3.927528
102 -0.655054 -0.565230 3.176873 0.959533
305 -2.315555 0.457246 -0.025907 -3.399312
324 0.050188 1.951312 3.260383 0.963301
400 0.146326 0.508391 -0.196713 -3.745356
499 -0.293333 -0.242459 -3.056990 1.918403
523 -3.428254 -0.296336 -0.439938 -0.867165
586 0.275144 1.179227 -3.184377 1.369891
808 -0.362528 -3.548824 1.553205 -2.186301
900 3.366626 -2.372214 0.851010 1.332846
If a value is greater than +3, we will replace it by +3. Similarly, if a value is less than-3,
we will replace it by -3. This is the way to cap the values that lie outside the interval -3 to 3.
The following code will do this.
> data [ np.abs (data) > 3 ] = np.sign ( data ) * 3
> data.describe( )
93
0 1 2 3
count 1000.000000 1000.000000 1000.000000 1000.000000
mean -0.067623 0.068473 0.025153 -0.002081
std 0.995485 0.990253 1.003977 0.989736
min -3.000000 -3.000000 -3.000000 -3.000000
25% -0.774890 -0.591841 -0.641675 -0.644144
50% -0.116401 0.101143 0.002073 -0.013611
75% 0.616366 0.780282 0.680391 0.654328
max 3.000000 2.653656 3.000000 3.000000
Here, the sign( ) function returns +1 if the sign of a value is positive. Otherwise, it
returns -1. So, the expression np.sign(data) * 3 will return +3 when a value is greater than +3
and it will return -3 when a value is less than -3.
6.9 Data Transformation - Random Sampling
In order to select a random sample without replacement, we have to first permute (i.e.,
randomly reorder) the rows in a Data Frame or the values in a Series by using the
numpy.random permutation function. Then, we shall select the first K elements, if we want a
random sample of size K. Here is an example.
> df = DataFrame ( np.arange (5*4).reshape(5,4) )
> df
0 1 2 3
0 0 1 2 3
1 4 5 6 7
2 8 9 10 11
3 12 13 14 15
4 16 17 18 19
> sampler = np.random.permutation(5)
> sampler
array( [1, 0, 2, 3, 4] )
> df.take (sampler)
0 1 2 3
1 4 5 6 7
0 0 1 2 3
2 8 9 10 11
3 12 13 14 15
4 16 17 18 19
Here, the rows of the data frame have been randomly reordered. The new order of rows
is 1, 0, 2, 3, 4. If we went a random sample of size 3 without replacement, we shall simply take
the first three rows of the above reordered set of rows. Instead, we can execute the following
Command.
> df.take (np. random.permutation ( len (df) ) [:3] )
94
0 1 2 3
1 4 5 6 7
3 12 13 13 15
4 16 17 18 19
In order to generate a random sample with replacement, the fastest way is to use
np.random.rendint( ) method, as shown below:
> bag = np.array ( [5, 7, -1, 6, 4] ) # A Sample is to be taken from these five values
> sampler = np.random.randint ( 0, len(bag), size = 10 )
> sampler
array ( [4, 4, 2, 2, 2, 0, 3, 0, 4, 1] )
> draws = bag.take ( sampler )
> draws
array ( [4, 4, -1, -1, -1, 5, 6, 5, 4, 7] ) # A random sample of size 10
6.10 Data Transformation - Computing Dummy/Indicator Variables
The categorical variables are often transformed into dummy variables or indicator
variables for statistical modeling or machine learning applications. Here is an example:
> df = DataFrame ( { ‘Key’ : [‘b’, ‘b’, ‘a’, ‘c’, ‘a’, ‘b’],
‘data1’ : range(6) } )
> df
data1 Key
0 0 b
1 1 b
2 2 a
3 3 c
4 4 a
5 5 b
> dummies = pd.get_dummies ( df [‘key’] )
> df_with_dummy = df [ [ ‘data1’ ] ] . join ( dummies )
> df_with_dummy
data1 a b c
0 0 0 1 0
1 1 0 1 0
2 2 1 0 0
3 3 0 0 1
4 4 1 0 0
5 5 0 1 0
Here, the categorical variable ‘key’ has three distinct values, namely ‘a’, ‘b’ and ‘c’. So,
three dummy variables, namely ‘a’. ‘b’ and ‘c’ will be generated corresponding to the
categorical variable ‘key’ by the get_dummies( ) method. If we carefully look at the values in
95
the dummy variable columns ‘a’, ‘b’ and ‘c’ together, we note that ‘a’ is represented by the
combination (1, 0, 0), ‘b’ is represented by (0, 1, 0) and ‘c’ is represented by (0, 0, 1).
The dummy variable columns are named as ‘a’, ‘b’ and ‘c’ in the above result. If we want
the names of these columns to be prefixed with the word ‘key’, we have to write the following
code:
> dummies = pd.get_dummies ( df [‘key’], prefix = ‘key’ )
> df_with_dummy = df [ [‘data’] ].join (dummies)
> df_with_dummy
data1 key-a key-b key-c
0 0 0 1 0
1 1 0 1 0
2 2 1 0 0
3 3 0 0 1
4 4 1 0 0
5 5 0 1 0
In the above example, the categorical variable ‘key’ takes only one of the three distinct
values, namely ‘a’, ‘b’ and ‘c’. In such a case, the corresponding dummy variables can be
easily created as shown above. But, a categorical variable, such as the movie genre, may take
one of a very large collection of values, such as Animation, Comedy, Adventure, Fantasy,
Romance, Action, Crime, Thriller, Children’s, etc. Here, the generation of a set of dummy
variables corresponding to the categorical variable movie genre is little bit difficult and it is
explained below with the help of MovieLens 1M dataset.
> mnames = [ ‘movie_id’, ‘title’, ‘genres’ ]
> movies = pd.read_table (‘movies.dot’, sep = ‘::’, header = None,
names = mnames )
> movies [ :5]
movie_id Title genres
0 1 Toy Story (1995) Animation | Children’s | Comedy
1 2 Jumanji (1995) Adventure | Children’s | Fantasy
2 3 Grumpier Old Men (1995) Comedy | Romance
3 4 Waiting to Exhale (1995) Comedy | Drama
4 5 Father of the Bride Part II (1995) Comedy
Here, adding dummy variables for each genre requires a little bit of wrangling. First, we
extract the list of unique genres in the dataset by using the following two Commands:
> genre_iter = ( set (x.split (‘1’) ) for x in movies.genres)
> genres = sorted ( set.union (* genre_iter) )
Here, a set of dummy variables is to be generated corresponding to the categorical
variable genres. If this categorical variable takes K distinct values, then K dummy variables
96
are to be generated. These K dummy variables will be represented by K columns in the Data
Frame. Initially, we will place zeroes in all the element positions in these K columns.
The last Command given above ensures that the distinct values that the categorical
variable genres can take are placed in sorted order in the set named genres. The first two
entries in this set are Genre_Action and Genre_Adventure. Here, the K columns, which
represent the K dummy variables, in the row that contains only the value Genre_Action in
genres column, will take the values (1, 0, 0, 0, ..., 0, 0, 0).
Similarly, the K columns in the row that contains only the two values Genre-Action and
Genre-Adventure will take the values (1, 1, 0, 0, ..., 0, 0, 0). The commands required for
performing these operations are given below:
> dummies = DataFrame ( np.zeroes ( (len (movies), len (genres) ) ),
columns = genres )
> for i, gen is enumerate ( movies.genres ) :
dummies.ix [ i, gen.split ( ‘I’ ) ] = 1
> movies_windic = movies.join ( dummies.add_prefix (‘Gene_’) )
> movies_windic.ix [0]
movie_id 1
title Toy Story (1995)
genres Animation | Children’s | Comedy
Genre_Action 0
Genre_Adventure 0
Genre_Animation 1
Genre_Children’s 1
Genre_Comedy 1
Genre_Crime 0
Genre_Documentary 0
Genre_Drama 0
Genre_Fantasy 0
Genre_Film_Noir 0
Genre_Horror 0
Genre_Musical 0
Genre_Mystery 0
Genre_Romance 0
Genre_Sci-Fi 0
Genre_Thriller 0
Genre_War 0
Genre_Western 0
97
The difference between the find( ) and index( ) methods is that the index( ) method raises
an exception if the string is not found. On the other hand, the find( ) method returns –1, as we
have seen above, if the string is not found.
suppose went to split a string with a variable number of whitespace characters (tabs, spaces,
and newlines). The regex describing one or more whitespace characters is \s+
> import re
> text = “foo bar\t baz \tqux”
> re.split (‘\s+’ , text)
[‘foo’, ‘bar, ‘baz’, ‘quz’]
When we call re.split (‘\s+’, text), the regular expression is first compiled, and then its split( )
method is called on the passed text. We can compile the regex ourselves with re.compile( )
method, forming a reusable regex object.
> regex = re.compile (‘\s+’)
> regex.split(text)
[ ‘foo’, ‘bar’, ‘baz’, ‘qux’ ]
If, instead, we want to get a list of all patterns matching the regex, we can use the findall( )
method, as shown below:
> regex.findall (text)
[‘ ‘, ‘\t’, ‘\t’]
The match( ) and search( ) methods are closely related to the findall( ) method returns all
matches in a string, the search( ) method returns only the first match. The match( ) method
only matches at the beginning of the string.
Let us now consider a block of text and a regular expression capable of identifying most
e-mail addresses:
> text = “ “ “ Dave [email protected]
Steve [email protected]
Rob [email protected]
Ryan [email protected]
”””
> pattern = r ‘[ A-Z0-9 . - % + – ] + @ [ A-Z0-9. - ] +1.[A-Z] {2,4}’
> regex = re.compile ( pattern, flags = re. IGNORECASE )
> regex.findall(text)
[ ‘[email protected]’, ‘[email protected]’, ‘[email protected]’, ‘[email protected]’ ]
The search( ) method will return a special match object for the first e-mail address in the
text. For the above regex, the match object can only tell us the start and end position of the
pattern in the string:
> m = regex.search (text)
> m
< -sre.SRE_Match at OxI0a05de00 >
100
7.1 Introduction
After loading, merging, and preparing a data set, a familiar data processing task is to
categorize the data into groups and then compute the group statistics for reporting and
visualization purposes. Pandas library provides a flexible and high-performance groupby
facility, enabling us to slice and dice, and summarize data sets in a natural way.
One reason for the popularity of Relational Databases and SQL is the ease with which
data can be joined, filtered, transformed, and aggregated. However, query languages like SQL
are rather limited in the kinds of group operations that can be performed. With the
expressiveness and power of Python and Pandas, we can perform much more complex grouped
operations, like the ones listed below:
1. We can split a Panda object into pieces using one or more keys.
2. We can compute group summary statistics, like count, mean, or standard deviation.
3. We can apply a varying set of functions to each column of a Data Frame.
4. We can apply within-group transformations or other manipulations, like normalization,
linear regression, rank, or subset selection.
5. We can compute Pivot Tables and Cross-Tabulation.
7.2 GroupBy Operation
Hadley Wickham coined the
term split-apply-combine for talking
about group operations and this is a
good description of the process. In the
first stage of the process, the data are
split into groups based on one or more
keys that we provide. The splitting is
performed on a particular axis. In the
case of a Data Frame, rows constitute
zeroth axis and the columns constitute
the first axis. Once this is done, a
function is applied to each group,
producing a new value. Finally, the
results of all those function
applications are combined into a
result object. The figure is a mockup of a simple group aggregation.
102
key1 key2
a one 1.319920
two 0.092908
b one 0.281746
two 0.769023
7.5 Grouping with Dictionaries and Series
So far, we have passed one or more keys to the groupby( ) method. But, we shall very
well pass a Dictionary or a Series as the argument to the groupby( ) method. In other words,
grouping of data can be done on the basis of a Dictionary or a Series also.
> people = DataFrame (np.random.randn (5,5) ,
columns = [ ‘a’, ‘b’, ‘c’, ‘d’, ‘e’ ],
index = [‘joe’, ‘Steve’, ‘Wes’, ‘Jim’, ‘Trais’] )
> people.is [2:3, [‘b’, ‘c’] ] = np.nan # Add a few NA values
> people
a b c d
Joe 1.007189 -1.296221 0.274992 0.228913 1.352917
Steve 0.886429 -2.001637 -0.371843 1.669025 -0.438570
Wes -0.539741 NaN NaN -1.021228
Jim 0.124121 0.302614 0.523772 0.000940
Travis -0.713544 -0.831154 -2.370232 -1.860761
Suppose I have group correspondence for the columns and want to sum together the columns
by group.
> mapping = { ‘a’ : ‘red’, ‘b’ : ‘red”, ‘c’ : ‘blue’,
‘d’ : ‘blue’, ‘e’ : ‘red’, ‘f’ : ‘orange’ }
> people.groupby ( mapping, axis = 1 ) . sum( )
blue red
Joe 0.503905 1.063885
Steve 1.297183 -1.553778
Wes -1.021228 -1.116829
Jim 0.524712 1.770545
Travis -4.230992 -2.405455
The same functionality holds for Series. It can be viewed as a fixed size mapping.
> map_series = Series ( mapping )
> map_series
a red
b red
c blue
d blue
e red
f orange
106
We can specify a list of functions as the argument of the agg( ) method. They will be
applied to all of the columns of a Data Frame. We can also apply different functions to different
columns of a Data Frame, if we so wish.
> functions = [ ‘count’, ‘mean’, ‘max’ ]
> result = grouped [ ‘tip_pct’, ‘total_bill’ ] . agg (functions)
> result
tip_pct total_bill
sex smoker count mean max count mean max
Female No 54 0.156921 0.252672 54 18.105185 35.83
Yes 33 0.182150 0.416667 33 17.977879 44.30
Male No 97 0.160669 0.291990 97 19.791237 48.33
Yes 60 0.152771 0.710345 60 22.284500 50.81