0% found this document useful (0 votes)
118 views

Python Study Material

This document provides an introduction to Python programming fundamentals including data types, variables, expressions, control statements like if/else, for/while loops, and user-defined functions. It explains Python's main data types like integers, floats, strings, lists, tuples and dictionaries. It also demonstrates how to write simple Python programs to calculate averages and use input/output, if/else conditional logic, and for/while loops.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
118 views

Python Study Material

This document provides an introduction to Python programming fundamentals including data types, variables, expressions, control statements like if/else, for/while loops, and user-defined functions. It explains Python's main data types like integers, floats, strings, lists, tuples and dictionaries. It also demonstrates how to write simple Python programs to calculate averages and use input/output, if/else conditional logic, and for/while loops.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 109

PYTHON

Chapter-1: Python Fundamentals

1.1 Introduction
The word Python - isn’t it scary? Doesn’t it bring the image of a big snake found in the
Amazon forest? Well, it is time to change this image. Now on, you will remember Python as
a free-to-use open-source programming language that is great for performing Data Science
tasks. Python, the programming language, was named after the famous BBC Comedy Show
Monty Python’s Flying Circus. It is an easy-to-learn yet powerful object-oriented
programming language that is mainly used now for analyzing data.
1.2 First Program in Python
Let us now try our first Program in Python. A Statement is an instruction that the
computer will execute. Perhaps the simplest Python Program that you can write is the one that
contains a single print statement, as shown below:
> print (“Hello, Python!”)
When the computer executes the above print statement, it will simply display the value you
write within the parentheses (i.e. “Hello, Python!”). The value you write in the parentheses is
called the argument. If you are using a Jupiter Notebook, you will see a small rectangle with
the above print statement. This is called a cell. If you select this cell with your Mouse and
then click the “Run All” button, the computer will execute the print statement and display the
output (i.e. Hello, Python!) beneath the cell, as shown below:
print (“Hello, Python!”)
Hello, Python!

1.3 Data Types in Python


The integers like 11, real numbers like 21.62 and the strings like “Python” are known as
literals in Python. They are represented by the data types int, float and str respectively. Apart
from these widely used data types, some other data types are also available in Python, as shown
in the following diagram. We will learn about all of them later.
Data Types in Python

Numbers Sequences Mappings None


(No values)

Integer Floating Complex String List Tuple Dictionary


Point Number
Boolean
2

Data Type Examples


Integer 29, 38000
Floating Point 21.62, 3.14
Complex Number 2.9 + 3.8j, 4.7 + 5.6j
Boolean True, False
String “Hello Python”, “Data Science”
List [2, 10, 29, 38, 56], [“Neha”, 83, 65, 70.4]
Tuple (2, 92, 47, 65, 47), (8, 7.4, 9, “ABC”)
Dictionary {“a” : 1, “e” : 2, “i” : 3, “0” : 4, “u” : 5}
None It is used to indicate the absence of value. It is also used to indicate the
end of a List. The “None” value in Python means “There is nothing else.”

Using the type( ) function, we can find the data type of a value. For example, type (29)
will return int, while type (3.14) will return float. If a string contains an integer value, we can
convert it to int. This is known as type casting. For example, int (“38”) will return 38, while
int (True) will return 1. On the other hand, bool (1) will return True. The data types, List,
Tuple and Dictionary are known as sequence data types.
1.4 Expressions
A simple example for an expression is 33+60–10. Here, we call the numbers 33, 60 and
10 as operands and the Maths symbols + and – as operators. The value of this expression is
83. We can perform the multiplication operation by using the asterisk symbol, and the division
operation by using the forward slash symbol, as shown below:
5*5 25/5
The values of the above expressions will be evaluated as 25 and 5.0, respectively.
We can use double slash for integer division, in which case the result will be rounded.
Python follows mathematical conventions when evaluating a mathematical expression. The
arithmetic operators in the following two expressions are in different order. But, in both the
cases, Python performs multiplication first and then the addition to obtain the final result:
120 120
2 * 60 + 30 30 + 2 * 60
150 150
The expressions in the parentheses are performed first in the following example. We then
multiply the result by 60. The final result is 1920.
32
(30 + 2) * 60
1920
3

1.5 Variable
We can use variables to store values. In the following example, we assign the value
10000 to the variable principal by using the assignment operator (i.e. the equals sign). We can
then use the value somewhere else in the code by typing the exact name of the variable.
>>> principal = 10000
>>> years =3
>>> interest-rate = 10
>>> simple-interest = (principal * years * interest-rate) / 100
>>> simple-interest
3000
Recall that we can find the data type of a value by using the type( ) function. For example,
type(29) will return int. We can apply the type( ) function on a variable also. For example,
type(principal) will return int.
1.6 Method of providing the input through the keyboard
In the above Program, input values are provided as part of the Program. But, often, the
input values are received from the User through the keyboard. The input( ) function shall be
used for this purpose.
name = input (“What is your name?”)
When the above statement is executed, the prompt will be displayed as a vertical line, as shown
below:
What is your name? |
We have to type the name in front of the above prompt (|), as shown below:
What is your name? Ganesh
Now, “Ganesh” will be assigned as the value of the variable name.
We have just now seen the method of reading in a string. Even if we provide a number
as the input, the input( ) function will return the entered value as a string. So, what should we
do if we have to read in an integer or a fractional value? What is the way out? We have to use
the int( ) and float( ) functions, as shown below:
>>> age = int(input (“what is your age?”))
>>> percentage = float(input (“Enter your percentage of marks:”)

1.7 Program to calculate Mean


The following Program will read in the marks obtained by a student in three different
subjects and then calculate the average mark.
4

>>> # Program to calculate the Average


>>> mark1 = int(input (“Enter the first mark:”))
>>> mark2 = int(input (“Enter the second mark:”))
>>> mark3 = int(input (“Enter the third mark:”))
>>> average = (mark1 + mark2 + mark3)/3
>>> print (“Average Mark is: “, average)
We will see the following entries on the Screen when we execute the above Program.
Enter the first mark : 90
Enter the second mark : 70
Enter the third mark : 80
Average mark is : 80

1.8 Control statements in Python


The if statement, the for statement and the while statement are the important Control
Statements that are available in Python. The simple form of the if statement tests a condition
and if the condition evaluates to true, it carries out one set of instructions. Otherwise, it carries
out another set of instructions. Example:
>>> import statistics
>>> marks = [10, 20, 60]
>>> option = int(input(“Enter 1 to calculate Mean, 2 to calculate Median:”))
>>> if option == 1
average = statistics.mean (marks)
else
average = statistics.median (marks)
>>> print (average)
Here, when the value of option is 1, the output will be 30. Otherwise, the output will be 20.
The for statement is used to repeatedly execute a set of statements for a certain number of
times. Here is an example.
>>> for a in [1, 3, 5]
print(a)
print(a * a)
Here, the two print statements will be repeatedly executed three times. Here, a will take the
values 1, 3 and 5, one by one. So, the output will be as follows:
1
1
3
9
5
25
5

The while statement is used to repeatedly execute a set of statements as long as a condition
remains true. Here is an example:
>>> n = 1
>>> while n < = 3
print (n*n)
n = n+1
Here, n will take the values 1, 2 and 3. So, the output will be as follows:
1
4
9
Note:
Many programming languages use curly braces to delimit blocks of code. But Python
uses indentation, as shown below:

for i in [ 1, 2, 3, 4, 5 ] :
print i
for j in [ 1, 2, 3, 4, 5 ] :
print j
print ( i + j )
print i
print (“looping is over”)

1.9 User-defined Functions


When a Code Segment is to be reused (i.e., executed again and again), it shall be defined
as a function. Every function has a name and it is defined by using the def statement. The
statements indented below the def statement form the body of the function.
>>> def mult (a, b) :
c=a*b
return c
>>> mult (20, 10) # Invoking the mult( ) function 200
The syntax of the user-defined function is as follows:
def user_defined_function_name (list_of_arguments) :
body of the user_defined function
In the above example, mult is the name of the function. In this example, a and b
constitute the argument list. The two statements c = a*b and return c constitute the body of
the function.
6

The syntax for calling the user-defined function is as follows:


user_defined_function_name (list_of_arguments)
In the above example, mult (20, 10) calls the user-defined function. The values 20 and
10 are passed as the values of a and b. The value of c is calculated as 200 and it is returned.
When the user-defined function is called, the values are transferred from the calling function
to the called function. These values are called as arguments. (In the above example, 20 and
10 are the arguments). Four types of arguments are possible in Python.
a) Default Arguments
The value of an argument may be specified when an user-defined function is defined.
Such an argument is called a default argument. Here is an example:
>>> def mult (a = 20, b = 10)
c = a*b
return c
>>> mult( ) # 200 will be the result
>>> mult (30, 20) # 600 will be the result
If the arguments are not passed during the function call ( (eg) mult( ) ), then the default values
(i.e., 20 and 10) are assigned. However, if the arguments are passed ( (eg) mult (30,20)), the
passed arguments are assigned. (In this case, the default values are ignored).
b) Required Arguments
Consider the following user-defined function:
>>> def sum (a, b=10)
return (a+b)
In this example, a is the required argument. Its value should be provided during the function
call.
c) Keyword Arguments
Consider the following ways of calling the above sum( ) function:
>>> sum (60, 40) # 60 will be assigned to a; 40 to b.
>>> sum (b = 40, a = 60) # 60 will be assigned to a; 40 will be assigned to b.
Here, a and b are called keyword arguments. When they are used, values can be listed in a
different order in the argument list.
d) Variable length Arguments
Consider the following user-defined function:
def disp (x, *t) :
print (“Value of first argument: “ , x)
7

print (“Number of Variable Argument: “, len (t) )


print (“The Variable Arguments: “)
for i in t
print (i)
n1 = int (input (“Enter a Number:” ))
n2 = int (input (“Enter a Number:” ))
n3 = int (input (“Enter a Number:” ))
n4 = int (input (“Enter a Number:” ))
disp (n1, n2, n3, n4)
Output:
Enter a Number : 200
Enter a Number : 300
Enter a Number : 400
Enter a Number : 500
Value of first argument: 200
Number of Variable Arguments: 3
The Variable Arguments:
300
400
500
1.10 Anonymous Functions
In Python, we can define a user-defined function without a name. Such a function is
called an anonymous function. It is not defined with the def keyword. Instead, the keyword
lambda is used. Such a function can consist of only one expression. It can have multiple
number of arguments. It will return only one value. Here is an example.
m = lambda x, y : x*y
n1 = int (input (“Enter a Number:”) )
n2 = int (input (“Enter a Number:”) )
print (“The two input values:”, n1, n2)
print (“The result:”, m (n1,n2) )
Output
Enter a Number: 10
Enter a Number: 20
The two input values: 10 20
The result: 200
The syntax of an anonymous function is as follows:
lambda list_of_arguments: expression
In the above example, x and y are the arguments. Here x*y is the expression. Note that the
arguments and the expression are separated by : symbol.
8

1.11 Scope of Local Variables and Global Variables


The Variables are of two types: local and global. Local Variable is a Variable that is
defined within a user-defined function. The value of a local variable is available only within
the given user-defined function. Here is an example:
def f1( ) :
x = 10
print (“Value of Local Variable x inside f1( ): ”, x)
f1( )
print (“Value of Local Variable x outside f1( ): ”, x)
Output
Value of Local Variable x inside f1( ) : 10
print (“Value of Local Variable x outside f1( ): ”, x)
Name ERROR : name ‘x’ is not defined
Now, consider the following user-defined function:
x = 10
def f1 ( ) :
print (“Value of Global Variable x inside f1( ): ”, x)
f1( )
print (“Value of Global Variable x outside f1( ): ”, x)
Output
Value of Global Variable x inside f1( ): 10
Value of Global Variable x outside f1( ): 10
In Python, scope resolution for the Variable follows the famous “LEGB” rule. Thus, for
a Variable name, the initial search is made in the Local name space. If the Variable is not
found, then the search is made in the Enclosed name space. If the Variable is not found, then
the search is made in the Global name space. If the Variable is not found in the global name
space, then finally the search is made in the Built-in name space. The following table compares
Local  Global Variables.
S.№ Local Variable Global Variable
1. Local Variables are declared within the Global Variables are declared outside the
function functions, but within the main program.
2. The scope of a Local Variable is restricted The Global Variable is accessible
to the function in which it is declared. throughout the entire program.
3. The value of a Local Variable can be The value of the Global Variable can be
assessed only within the function in which accessed throughout the entire program.
it is declared.
9

1.12 Functional Programming


In Python, the map( ) function, filter( ) function and reduce( ) function are applied to
each element of a List. The map( ) function will take a list as input and applies itself to each
element of that list. Here is an example:
a = [1, 2, 3, 4]
b = [10, 20, 30, 40]
c = list (map (lambda x, y : x+y, a, b) )
print (“a =”, a)
print (“b =”, b)
print (“map( ) result= ”, c)
Output
a = [1, 2, 3, 4]
b = [10, 20, 30, 40]
map( ) result = [11, 22, 33, 44]
The filter( ) function takes two arguments. The first argument is the function to be applied on
the values in a list. The second argument is the list of values on which the function is to be
applied. This function will display only those values that satisfy a specified condition.
a = [15, 20, 24, 35]
c = list (filter (lambda x: x%2 == 0, a) )
print (“a= ”, a)
print (“filter( ) result = ”, c)
Output
a = [15, 20, 24, 35]
filter( ) result = [20, 24]
The reduce( ) function repeatedly applies itself on the input list of values and fuinally
generates a single value as answer.
from functools import reduce
a = [15, 20, 24, 35]
c = reduce (lambda x, y : x+y, a)
print = (“filter( ) result= ”, c)
Output
filter( ) result = 94
1.13 Modules
A set of built-in functions and constants that were written for a specific purpose
constitute a Module. For example, the module named re contains functions and constants
required for working with regular expressions. If we wish to make use of the functions defined
10

in the re module, such as the compile( ) function, we have to import the re module, as shown
below:
import re
my_regular_expression = re.compile (“[0-9]+”, re.I)
Here, we prefix the compile( ) function with the name of the module (i.e., re). If we already
use re in our Program for some other purpose, we can use an alias, as shown below:
import re as regex
my_regular_expression = regex.compile (“[0-9]+”, re.I)
In Python, we use the functions in the module named matplotlib.pyplot for drawing a
variety of figures, such as the Bar Chart and the Pie Chart.
import matplotlib.pyplot
matplotlib.pyplot.plot( ... )
Here, the name of the module is a lengthy one. Instead of writing this lengthy name again and
again, we can use the alias option, as shown below:
import matplotlib.pyplot as plt
plt.plot( ... )
If we need only a few specific functions and constants defined in a Module, we can
import them explicitly and use them without qualification, as shown below:
from collections import defaultdict, counter
lookup = defaultdict (int)
my_counter = counter( )
Note that we are not invoking the defaultdict( ) function in the format
collections.defaultdict( ).
The module is a single Python file. It is saved with .py extension. Assume that a module
named sum has the following functions. Assume further that the following code is saved as
sum.py.
def add (x, y) :
return (x+y)
def mul (a, b) :
return (a*b)
def sub (x, y) :
return (x–y)
def div (a, b) :
return (a/b)
The above module sum is imported in the following program. Then, the functions in this
module are used.
11

import sum
n1 = int ( input (“Enter the first number:” ) )
n2 = int ( input (“Enter the second number:” ) )
print (“The result of add( ): ”, sum.add (n1, n2))
print (“The result of mul( ): ”, sum.mul (n1, n2))
Output
Enter the first number: 30
Enter the second number: 10
The result of add( ): 40
The result of mul( ): 300
The package contains a group of module files. It also contains ...init... py file. All the
components of a package are placed in a single directory. The package directory should be
differentiated from the ordinary directory. For this purpose, the package directory contains
... init ... py file. Every package directory contains this file. Package Installer for Python (PIP)
is used for installing a package.
Python allows the handling of the exceptions raised in the invoked functions.
1.14 Whitespace Formatting
Whitespace is ignored inside parentheses and brackets. This can be helpful for long-
winded computations:
long_winded_computation = (1+2+3+4+5+6+7+8+9+10+11+12+13+14+15+
16+17+18+19+20)
Whitespace can be used for making code easier to read:
list_of_lists = [ [1, 2, 3], [4, 5, 6], [7, 8, 9] ]
easier_to_read_list_of_lits = [ [1, 2, 3],
[4, 5, 6],
[7, 8, 9] ]
We can also use a backslash to indicate that a statement continues onto the next line.
two_plus_three = 2 + \
3
1.15 Object-oriented Programming in Python
In object-oriented programming, focus is given on the data. The code that works with the
data are grouped into a single entity and it is called as a class. The class contains member data
and methods. Some of the principles of object-oriented programming are listed below:
a) Abstraction: Abstraction allows the Programmer to use the object without knowing the
details of the object.
12

b) Inheritance: Inheritance allows one class to derive the functionality and characteristics of
the other class.
c) Encapsulation: Encapsulation binds all the data members and methods into a single entity
called as class.
d) Polymorphism: Polymorphism allows the same entity to be used a different form. Both
compile-time polymorphism and run-time polymorphism are possible in Python.
1.16 Classes and Objects in Python
A class is a collection of data and the methods that interact with those data. An instance
of a class is known as an object. Each data type in Python, namely the integers, floats, strings,
booleans, lists, sets and dictionaries is an object. We will now develop a class to represent a
circle.

radius

All we need to describe a circle is its radius. Let us also consider the colour to make it easier
later to distinguish between the different instances of the class circle. Here, the data attributes
of the class circle are radius and colour. The class circle shall now be defined with the
constructor and an utility method for calculation the area, as shown below:
>>> class circle (object):
def - init - (self, radius, color) :
self.radius = radius
self.color = color
def calculate - area (self):
area = 3.14 * self.radius * self.radius
return area
We shall now create an object named playground whose radius is 400 meters and the preferred
color is green. We shall also calculate its area.
>>> playground = Circle(400, ‘green’)
>>> area1 = playground. calculate-area( )
>>> print (area1)
5024
We will now develop a class to represent a rectangle.

breadth

length
13

All we need to describe a Rectangle is its length and breadth. Here, the data attributes of the
class. Rectangle are length and breadth. The class Rectangle shall now be defined with a
constructor and an utility method, as shown below:
>>> class Rectangle (object) :
def-init-(self, length, breadth)
self.length = length
self.breadth = breadth
def calculate-perimeter (self) :
perimeter = 2(length + breadth)
return perimeter
We shall now create an object named ClassRoom whose length is 40 feet and the breadth is
20 feet. We will then calculate its area.
>>> classroom = Rectangle (40, 20)
>>> perimeter = classroom.calculate-perimeter( )
>>> perimeter
120

1.17 Inheritance in Python


In object oriented programming, inheritance is a mechanism in which a new class is
derived from an already defined class. The derived class is known as a subclass or a child
class. The pre-existing class is known as the base class or a parent class or super class. The
mechanism of inheritance gives rise to hierarchy in classes. The major purpose of inheriting a
base class into one or more derived classes is code reuse. The subclass inherits all the methods
and properties of the superclass. The subclass can also create its own methods and replace the
methods of the superclass. The process of replacing the methods defined in the super class
with the new methods with the same name in the derived class is known as overriding.
We will now define a base class named person that has two data attributes, namely name
and age and two methods namely getName( ) and getAge( ). We will also define a derived
class named student, which has two additional data attributes, namely rollNumber and marks
and two additional methods, namely getRollNumber( ) and getMarks( ).
>>> class person (object):
def-init- (self, name, age):
self.name = name
self.age = age
def getName (self):
return self.rollNumber
def getAge (self):
return self.age
14

>>> class student (person):


def-init- (self, name, age, rollNumber, marks):
super (student, self). -init-(self, name, age)
self.rollNumber = rollNumber
self.marks = marks
def getRollNumber(self):
return self.rollNumber
def getMarks (self):
return self.marks
The above classes shall be instantiated and used as shown below:
>>> person1 = person (“Ganesh”, 20)
>>> person1.getName( )
Ganesh
>>> person1.getAge( )
20
>>> student1 = student (“Ganesh”, 20, 101, 83
>>> student1.getName( )
Ganesh
>>> student1.getAge( )
20
>>> student1.getRollNumber( )
101
>>> student1.getMarks( )
83
Chapter-2: Data Structures in Python

2.1 Introduction
Lists, Tuples, Dictionaries and Data Frames are the important data structures available
in Python. All of them are often used now for analyzing data. We will briefly discuss them one
by one.
2.2 Lists
A list is an ordered sequence of comma-separated values of any data type. The values in
a list are written between square brackets. A value in a list can be accessed by specifying its
position in the list. The position is known as index. Here is an example.
>>> list1 = [1, 2, 3, 4, 5]
>>> list1 [0]
1
The lists are mutable. This means that the elements of a list can be changed at a later stage, if
necessary.
>>> list2 = [10, 12, 14, 18, 30, 47]
>>> list2[0] = 20
>>> list2
[20, 12, 14, 18, 30, 47]
Each element of a List can be accessed via an index, as we have seen above. The elements in
a List are indexed from 0. Backward indexing from –1 is also valid. The following table
represents the relationship between the index and the elements in the following list:

Element Index Negative Index


“Michael Jackson” 0 -3
10.1 1 -2
2018 2 -1

Here, L[0] is “Michael Jackson”


L[1] is 10.1
L[2] is 2018
L[–1] is 2018
L[–2] is 10.1
L[–3] is “Michael Jackson”
L[1:3] will provide the list [10.1, 2018]
A list can be created by using the list( ) function, as shown below:
>>> list3 = list (<sequence>)
16

Here, the sequence shall be a list or string or tuple. Examples:


>>> list4 = list (“hello”)
>>> list4
[‘h’, ‘e’, ‘l’, ‘l’, ‘o’]
>>> list5 = list ((‘p’, ‘y’, ‘t’, ‘h’, ‘o’, ‘n’)) # Tuple
>>> list5
[‘p’, ‘y’, ‘t’, ‘h’, ‘o’, ‘n’]
2.3 Reading in the elements of a List through the keyboard
We can use the list( ) method to read in the elements of a list through the keyboard, as
shown below:
>>> list6 = list (input(“Enter the elements of the List: “))
Enter the elements of the List: 1 2 3 4 5
>>> list6
[‘1’, ‘2’, ‘3’, ‘4’, ‘5’]
Notice that the integers that we have entered as input are represented as strings in the above
case. In order to enter a list of integers or fractional values through the keyboard, we have to
use the eval( ) method, as shown below:
>>> marks = eval(input (“Enter marks obtained in 3 Maths, Physics, Chemistry:”))
Enter marks obtained in Maths, Physics, Chemistry: [92, 74, 83]
>>> percentage = eval(input(“Enter percentage marks: “))
Enter percentage marks: [92.0, 74.0, 83.0]

2.4 List Operations


The most common operations that we perform with the lists include joining lists,
replicating lists and slicing lists. We will briefly discuss about each one of them now.
Joining Two Lists
Joining two lists is very easy just like you perform addition literally. The concentration
operator +, when used with two Lists, joins the two Lists. Here is an example:
>>> list7 = [1, 2, 4]
>>> list8 = [5, 7, 8]
>>> list7 + list8
[1, 2, 4, 5, 7, 8]
Replicating (i.e. Repeating) Lists
As in the case of Strings, you can use the * operator to replicate a List for a specified
number of times. Example:
>>> list7 * 3
[1, 2, 4, 1, 2, 4, 1, 2, 4]
17

Slicing the Lists


As in the case of Strings, a List Slice is a part of a List, that is extracted out of it.
Example:
>>> list8 = [10, 15, 20, 25, 30, 35, 40, 45, 50]
>>> seq1 = list8[3:6]
>>> seq1
[25, 30, 35]
Note that the slice of a list contains the elements falling between the indices 3 and 6, not
including 6 (i.e., it contains the elements at the positions 3, 4, and 5).
Recall what you have learnt about the negative indexing. In the above example, the
negative index of the element 50 is –1 and the negative index of 40 is –3. So, the above List
Slice shall also be obtained as shown below:
>>> list9 = [10, 15, 20, 25, 30, 35, 40, 45, 50]
>>> seq2 = list9 [3 : –3]
>>> seq2
[25, 30, 35]

2.5 List Methods


Python offers many built-in functions and methods for performing List manipulations.
We will now briefly learn about the important built-in methods that are used for List
manipulations.
1. The index( ) method returns the index of the first matched item from the list. Example:
>>> list10 = [13, 18, 11, 16, 18, 14]
>>> list10.index(18)
1
2. The append( ) method adds a value to the end of the List. Example:
>>> colors = [‘red’, ‘green’, ‘blue’]
>>> colors.append (‘yellow’)
>>> colors
[‘red’, ‘green’, ‘blue’, ‘yellow’]
3. While the append( ) method adds just one element to the List, the extend( ) method can
add multiple elements to a List.
>>> t1 = [‘a’, ‘b’, ‘c’]
>>> t2 = [‘d’, ‘e’]
>>> t1.extend (t2)
>>> t1
[‘a’, ‘b’, ‘c’, ‘d’, ‘e’]
18

>>> t2
[‘d’, ‘e’]
4. While the append( ) method and the extend( ) method insert the element(s) at the end of
the List, the insert( ) method inserts an element somewhere in between or any position of
your choice. Example:
>>> t1 = [‘a’, ‘e’, ‘u’]
>>> t1.insert (2, ‘i’)
>>> t1
[‘a’, ‘e’, ‘i’, ‘u’]
5. The pop( ) method removes an element from a given position in the List and returns it. If
no index is specified, this method removes and returns the last item in the List. Example:
>>> t2 = [‘k’, ‘a’, ‘e’, ‘i’, ‘p’, ‘q’, ‘u’]
>>> ele1 = t2.pop(o)
>>> ele1
‘k’
>>> t2
[‘a’, ‘e’, ‘i’, ‘p’, ‘q’, ‘u’]
>>> ele2 = t2.pop( )
>>> ele2
‘u’
>>> t2
[‘a’, ‘e’, ‘i’, ‘p’, ‘q’]
6. The pop( ) method removes an element whose position is given. But, what if you know
the value of the element to be removed, but you don’t know its index (i.e. position) in the
List. Well, Python provides the remove( ) method for this purpose. This method removes
the first occurrence of given item from the List. Example:
>>> t3 = [‘a’, ‘e’, ‘i’, ‘p’, ‘q’, ‘a’, ‘q’, ‘p’]
>>> t3.remove (‘q’)
>>> t3
[‘a’, ‘e’, ‘i’,’p’, ‘a’, ‘q’, ‘p’]
7. The clear( ) method removes all the items from the given List. The List becomes an empty
List after this method is applied to it. Example:
>>> t4 = [2, 4, 5, 7]
>>> t4.clear( )
>>> t4
[]
19

8.The count( ) method returns the number of times given item occurs in the List.
Example:
>>> t5 = [13, 18, 20, 10, 18, 23]
>>> t5.count (18)
2
9. The reverse( ) method just reverses the items in a List. It does not return anything.
Example:
>>> t6 = [‘e’, ‘i’, ‘q’, ‘a’, ‘q’, ‘p’]
>>> t6.reverse ( )
>>> t6
[‘p’, ‘q’, ‘a’, ‘q’, ‘i’, ‘e’]
10. The sort( ) method sorts the items in a List in the ascending order by default. It does not
return anything. If we want this method to sort the items in the descending order, we have
to include the argument reverse = True. Example:
>>> t7 = [‘e’, ‘i’, ‘q’, ‘a’, ‘q’, ‘p’]
>>> t7.sort( )
>>> t7
[‘a’, ‘e’, ‘i’, ‘p’, ‘q’, ‘q’]
>>> t7.sort (reverse = True)
>>> t7
[‘q’, ‘q’, ‘p’, ‘i’, ‘e’, ‘a’]
Programming Example 2.1
The following program reads in the marks obtained by a student in 5 different subjects,
store them in a List and then calculate the average mark.
Program to calculate the average
mark-list = eval(input (“Enter the marks in 5 subjects:”))
length = len (mark-list)
sum = mean = o
for i in range (o, length) :
sum = sum+ mark-list [i]
average = sum/length
print (“Mean = ”, average)
Output
Enter the marks in 5 subjects: [70, 60, 80, 100, 90]
Mean = 80
20

Note: Here, the mean can also be found by using the built-in function mean( ), as shown below:
# Program using built-in function
import statistics
mark-list = eval (input (“Enter the marks in 5 subjects: ”) )
average = statistics.mean (mark-list)
print (average)
Output
Enter the marks in 5 subjects: [70, 60, 80, 100, 90)
Mean = 80
Programming Example 2.2
The following program reads in the salaries obtained by the 5 employees in a Startup,
store them in a List and then calculate the median salary.
Program to calculate the Median
salaries = eval(input (“Enter the salaries of 5 employees:”))
size = len (salaries)
if (size %2 is o):
mid = size//2
highelement = salaries [mid]
lowerelement = salaries [mid–1]
average = (lowerelement + highelement)/2
print (“MedianSalary =”, average)
else:
mid = size//2
average = salaries [mid]
print (“Median Salary=”, average)
Output
Enter the salaries in 5 employees: [50000, 40000, 70000, 90000, 100000]
Median Salary = 70000
Note: Here, the median shall also be found by using the build-in function median ( ), as shown
below:
# Program using built-in function
import statistics
salaries = eval (input (“Enter the salaries of 5 employees: ”) )
average = statistics.median (salaries)
print (“Median Salary=”, average)
Output
Enter the salaries of 5 employees: 50000, 40000, 70000, 90000, 100000
Median Salary = 70000
21

2.6 Tuples
A Tuple is an ordered sequence of comma-separated values of any data type. The value
in a Tuple are written between round brackets. Tuples are just like Lists. The only difference
is that once we define a Tuple, we cannot change its contents. If we have to change the contents
of a Tuple, we have to store the changed contents as a new Tuple. Due to this reason, the
Tuples are said to be immutable. Here are some Tuples:
>>> tuple1 = (2, 4.7, 10, 10.1)
>>> tuple2 = (‘a’, ‘b’, 1, 2.5, 3)
As in the case of Lists, each element of a Tuple can be accessed via an index. The element in
a Tuple are indexed from 0. Backward indexing from –1 is also valid. The following table
represents the relationship between the index and the elements in the following Tuple
T = (‘a’, ‘e’, ‘i’, ‘o’, ‘u’)
Element Index Negative Index
‘a’ 0 –5
‘e’ 1 –4
‘i’ 2 –3
‘o’ 3 –2
‘u’ 4 –1
Here, T[0] is ‘a’
T[2] is ‘i’
T[4] is ‘u’
T[–1] is ‘u’
T[–3] is ‘i’
T[–5] is ‘a’
A Tuple, T, can be created by using the tuple( ) function, as shown below:
>>> T = tuple (<sequence>)
Here, the sequence shall be a list or a string or another tuple.
Examples:
>>> tuple1 = (1, 5, 10)
>>> tuple2 = tuple ( [1, 2.5, 3.7, ‘a’, ‘b’] )
>>> tuple3 = tuple (“hello”)
>>> tuple3
(“h”, “e”, “l”, “l”, “o”)
2.7 Reading in the elements of a Tuple through the keyboard
The elements of a Tuple shall be provided through the keyboard by using the tuple()
method, as shown below:
22

>>> tuple4 = tuple ( input (“Enter the elements of the Tuple:” ))


>>> Enter the elements of the Tuple: 1 2 3 4 5
>>> tuple4
(‘1’, ‘2’, ‘3’, ‘4’, ‘5’)
The tuple( ) around input( ) will create a tuple by using individual characters as elements, as
illustrated above. But the most commonly used method to input the elements of a Tuple is to
provide them through the keyboard by using the eval( ) method, as shown below:
>>> tuple5 = eval (input (“Enter the elements of a Tuple:”))
>>> Enter the elements of a Tuple: (1, 2, “be”, [5, 6])
>>> tuple5
(1, 2, “be”, [5, 6])
When we have used the tuple( ) method around the input( ) method above, the integers that we
have entered (i.e. 1 2 3 4 5) were represented as the strings. (i.e. ‘1’, ‘2’, ‘3’, ‘4’, ‘5’). But,
when we have used the eval( ) method around the input( ) method above, the numbers 1 and 2
were represented as numbers. (They are not represented as the strings ‘1’ and ‘2’).
2.8 Tuple Operations
The most common operations that we perform with the tuples include joining tuples,
replicating tuples and slicing tuples. We will briefly discuss about each one of them now.
Joining two Tuples
Joining two Tuples is just like you perform the addition operation. The concentration
operator +, when used with two tuples, joins the two tuples. Here is an example.
>>> t1 = (1, 3, 5)
>>> t2 = (6, 7, 8)
>>> T1 + t2
(1, 3, 5, 6, 7, 8)
Replicating (i.e. Repeating) Tuples
As in the case of Lists, you can use the * operator to replicate a Tuple for a specified
number of times. Example:
>>> t1 * 3
(1, 3, 5, 1, 3, 5, 1, 3, 5)
Slicing the Tuples
As in the case of Lists and Strings, a slice of a Tuple is a part of a Tuple, as illustrated
below:
>>> t3 = (10, 12, 14, 20, 22, 24, 30, 32, 34)
>>> t4 = t3 [3 : –3]
>>> t4
(20, 22, 24)
23

Comparing Tuples
We can compare two Tuples without having to write code with loops for achieving
it.
>>> a = (2, 3)
>>> b = (2, 3)
>>> c = (‘2’, ‘3’)
>>> d = (2.0, 3.0)
>>> e = (2, 3, 4)
>>> a == b
True
>>> a == c
False
>>> a == d
True
>>> a<e
True
Unpacking a Tuple
Creating a Tuple from a set of values is called packing. Its reverse (i.e. creating
individual values from a Tuple’s elements is called unpacking.
>>> t = (1, 2, ‘A’, ‘B’)
>>> (w, x, y, z) = t
>>> w
1
>>> x
2
>>> y
‘A’
>>> z
‘B’
Deleting a Tuple
The del statement is used to delete a Tuple. We can delete a complete Tuple. But we
cannot delete the individual elements of a Tuple, as the Tuples are immutable.
2.9 Tuple Methods
Python offers many built-in functions and methods for performing Tuple manipulations.
We will now briefly learn about the important built-in methods that are used for Tuple
manipulations.
1. len( ) method
This method returns the length of a Tuple. (i.e. the number of elements in a Tuple).
24

>>> employee = (‘John’, 10000, 24, ‘Sales’)


>>> len(employee)
4
2. max( ) method
This method returns the element that has the maximum value.
>>> t1 = (10, 12, 14, 20, 22, 24, 30, 32, 34)
>>> max(t1)
34
3. min( ) method
This method returns the element that has the minimum value.
>>> t2 = (10, 12, 14, 20, 22)
>>> min(t2)
10
4. index( ) method
This method returns the index of an element in a Tuple.
>>> t3 = (3, 4, 5, 6)
>>> t3.index(5)
2
5. count( ) method
This method returns the number of times an element occurs in a Tuple.
>>> t4 = (2, 4, 2, 5, 7, 4, 8, 9, 9, 11, 7, 2)
>>> t4.count(2)
3
6. tuple( ) method
This is the constructor method. It is used to create tuples from different types of values.
>>> t = tuple ( [1, 2, 3] ) # Creating a Tuple from a List
>>> t
(1, 2, 3)

2.10 Sets
A Set is a collection of distinct elements. As in the case of Lists and Tuples, the elements
of a set can be of different data types. (i.e., Values of different data types can be placed within
a Set). Unlike Lists and Tuples, sets are unordered. This means that the Sets do not record the
element position. Sets only have unique elements. This means there is only one of a particular
element in a Set. To define a Set, you have to use curly brackets. You have to place the
elements of a Set within the curly brackets, as shown below:
25

>>> set1 = {“rock”, “R&B”, “disco”, “hard rock”, “pop”, “soul”}


You can convert a List to a Set by using the function Set. This is called type-casting. You
have to simply use the List as the input to the function set( ). The result will be a List converted
to a Set.
Let us go over an example. We start off with a List. We input the List to the function
set(). The function set() returns a Set. Notice how there are no duplicate elements.
> album-list = [“Michael Jackson”, “Thriller”, “Thriller”, 1982]
> album-set = set (album-list)
> album-set
{“Michael Jackson”, “Thriller”, 1982}

Let us go over the Set operations. These can be used to change the Set. Consider the
Set, A, given below:
> A = {“Thriller”, “Back in Black”, “AC/DC”}
We can add an item to a Set by using the add( ) method.
> A . add (“NSYNC”)
> A
{“AC/DC”, “Back in Black”, “NSYNC”, “Thriller”}
We can remove an item from a Set by using the remove( ) method.
> A. remove (“NSYNC”)
> A
{“AC/DC”, “Back in Black”, “Thriller”}
We can verify whether an element is in the set by using the in command, as follows:
> A = {“AC/DC”, “Back in Black”, “Thriller”}
> “AC/DC” in A
True
These are the types of Mathematical Set operations. There are other operations that we
can do. For example, we can find the union/intersection of two sets.
> album-set-1 = {“AC/DC”, “Back in Black”, “Thriller”}
> album-set-2 = {“AC/DC”, “Back in Black”, “The Dark Side of the Moon”}
> album-set-3 = album-set-1 & album-set-2
> album-set-4 = album-set-1 . union(album-set-2)
> album-set-3
{“AC/DC”, “Back in Black”}
> album-set-4
{“AC/DC”, “Back in Black”, “Thriller”, “The Dark Side of the Moon”}
26

Here, all the elements of album-set-3 are in album-set-1. We can check whether a Set is a
SubSet of another Set by using the issubject( ) method. Here is an example:
> album-set-3 . issubject(album-set-1)
True
2.11 Dictionaries
A Dictionary is a collection of key:value pairs. The elements in a Dictionary are written
between curly brackets. In the case of Lists and Tuples, an index is associated with each
element. But, in the case of a Dictionary, a key is associated with each element. The key should
be unique. In the case of Lists and Tuples, the index is used to access the elements. But, in the
case of a Dictionary, the key is used to access the elements.
Dictionaries are mutable. So, we can change some of the elements of a Dictionary and
then store the changed elements in the same Dictionary object. Dictionaries are unordered, as
no index is associated with the elements of a Dictionary. A Dictionary, D, can be created by
using a Command of the following form:
>>> <Dictionary-name> = {<key> : <value>, <key> : <value>, ...}
Example
>>> teachers = {“Benedict” : “Maths”, “Albert” : “CS”, “Andrew” : “Commerce” }
An element in a Dictionary can be accessed by using the key, as illustrated below:
>>> teachers [“Andrew”]
Commerce
Traversing a Dictionary
Traversal of a collection of values means accessing and processing each element of it.
The traversal can be done in the case of a Dictionary by using the for loop, as shown below:
>>> d| = {5 : “Number”, “a” : “String”, (1, 2) : “Tuple” }
>>> for key in d|
print (key, “:”, d| [key])
a : String
(1,2) : Tuple
5 : Number
Programming Example 2.3
We will now write a Program to create a Phone Dictionary for our friends and then
print it.
>>> PhoneDirectory = {“Jagdish” : “94437 55625”, “Bala” : “96297 09185”,
“Saravanan” : “99947 49333”}
>>> for name in PhoneDirectory :
print (name, “:”, PhoneDirectory [name])
27

The output of the above program will be as shown below:


Jagdish : 94437 55625
Bala : 96297 09185
Saravanan : 99947 49333
Adding an element to a Dictionary
We can add a new key:value pair to a Dictionary by using a simple assignment
statement, as shown below:
>>> Employee = {“name” : “John”, “salary” : 10000, “age” : 24}
>>> Employee [“dept”] = “Sales”
>>> Employee
{“name” : “John”, “salary” : 10000, “age” : 24, “dept” : “Sales”}

Updating an element in a Dictionary


Like the Lists, the Dictionaries are also mutable. So, we can add an element to an existing
Dictionary, as shown above. We can also change (i.e. update) an existing element by using a
simple assignment statement, as shown below:
>>> Employee2 = {“name” : “John”, “salary” : 10000, “age” : 24}
>>> Employee2 [“Salary”] = 20000
>>> Employee2
{“name” : “John”, “salary” : 20000, “age” : 24}
Deleting an existing element from a Dictionary
Both the del( ) method and the pop( ) method can be used for this purpose, as illustrated
below:
>>> Employee3 = {“salary” : 10000, “age” : 24, “name” : “John”}
>>> del Employee3 [“age”]
>>> Employee 3
{“salary” : 10000, “name” : “John”}
>>> Employee4 = {“salary” : 10000, “age” : 24, “name” : “John”}
>>> Employee4.pop (“age”)
24
>>> Employee4
{“salary” : 10000, “name” : “John”}

Dictionary Methods
Let us now briefly discuss the various built-in functions and methods that are provided
by Python to manipulate the elements of Dictionaries.
1. len( ) method
This method returns the number of key:value pairs in a Dictionary
28

>>> Employee5 = {“name” : “John”, “salary” : 10000, “age” : 24}


>>> len (Employee5)
3
2. clear( ) method
This method removes all the elements of a Dictionary and makes it an empty Dictionary.
When we use the del statement, the Dictionary no more exists, not even an empty
Dictionary.
>>> Employee6 = {“name” : “John”, “salary” : 10000, “age” : 24}
>>> Employee6.clear( )
>>> Employee6
{}
3. get( ) method
This method helps us to get the value that is associated with a key.
>>> Employee7 = {“salary” : 10000, “department” : “Sales”, “age” : 24, “name” : “John”}
>>> Employee7.get (“department”)
“Sales”
4. items( ) method
This method returns all the key:value pairs in a Dictionary.
>>> Employee8 = {“name” : “John”, “salary” : 10000, “age” : 24}
>>> myList = Employee8.items( )
>>> for x in myList
print x
(“salary”, 10000)
(“age”, 24)
(“name”, “John”)
5. keys( ) method
This method returns all the keys in a Dictionary
>>> Employee9 = {“salary” : 10000, “department” : “sales”, “age” : 24, “name” : “John”}
>>> Employee9.keys( )
[“salary”, “department”, “age”, “name”]
6. values( ) method
This method returns all the values in a Dictionary
>>> Employee10 = {“salary” : 10000, “department” : “Sales”, “age” : 24, “name” : “John”}
>>> Employee10: values( )
[10000, “Sales”, 24, “John”]
29

7. update( ) method
This method merges key:value pairs from the new dictionary into the original dictionary,
adding or replacing, as needed.
>>> Employee11 = {“name” : “John”, “salary” : 10000, “age” : 24}
>>> Employee12 = {“name” : “David”, “salary” : 54000, “department” : “Sales”}
>>> Employees11.update (Employees12)
>>> Employees11
{“salary”: 540000, “department”: “Sales”, “name”: “David”, “age”: 24}
Note:
The elements of Employees12 dictionary have overridden the elements of Employees11
dictionary having the same keys. So, the values associated with the keys “name” and “salary”
have been changed.
2.12 Default Dictionary
Imagine that you are trying to count the number of occurrences of the words in a
document. An obvious approach is to create a Dictionary in which the keys are words and the
values are counts. As you check each word, you can increment its count if it is already in the
Dictionary and add it to the Dictionary if it is not.
word_counts = { }
for word in document:
if word in word_counts:
word_counts [word] = word_counts [word] + 1
else:
word_counts [word] = 1
We can just handle the exception that may arise due to trying to look up a missing key.
word_counts = { }
for word in document:
try:
word_counts [word] + word_counts [word] + 1
except keyError:
word_counts [word] = 1
A third approach is to use the get( ) method, which behaves gracefully in the case of missing
Keys:
word_counts = { }
for word in document:
previous_count = word_counts.get (word, 0)
word_counts [word] = previous_count + 1
30

Every one of these is slightly unwieldly. In such cases, defaultdict( ) method is helpful.
A defaultdict is like a regular dictionary, except that when you try to look up a key it doesn’t
contain, it first adds a value for it by using a zero-argument function you provided when you
created it. In order to use defaultdicts, you have to import them from collections module.
2.13 Exception Handling
When something goes wrong, Python raises an exception. If they are not handled,
exceptions will cause our Program to crash. We can handle the exceptions by using try and
except, as shown below:
try :
print (0/0)
except ZeroDivisionError :
print ( “cannot divide by zero” )
Suppose that a Person is not sure whether a Tuple is immutable or not. So, when he writes
code to alter the value of an element, he will make use of exception handling, as shown below:
try:
my_tuple [1] = 3
except TypeError:
print (“cannot modify a Tuple”)
Suppose that a Person is not sure whether a key named “Kate” is present in a Dictionary or
not. So, when she writes code to access the value associated with the key “Kate”, she will
make use of exception handling so that her Program does not crash in case the key “Kate” is
not present in the Dictionary.
try:
kates_grade = grades [“kate”]
except keyError:
print (“no value for kate!”)
So for, we have discussed about handling exceptions that are raised by default. Exceptions can
also be raised manually, as shown in the following example:
try:
a = int (input (“Enter the value of a:”) )
b = int (input (“Enter the value of b:”) )
print (“The value of a =”, a)
print (“The value of b =”, b)
if (a-b) < 0 :
raise Exception (“Exception raised”)
except Exception as e:
print (“Received exception:”, e)
31

Output
Enter the value of a : 15
Enter the value of b : 20
The value of a = 15
The value of b = 20
Received exception : value of a-b is < 0

2.14 Counter
A Counter turns a sequence of values into a defaultdict(int) - like object. The keys will
be mapped to counts.
from collections import counter
c = Counter ( [0, 1, 2, 0] ) # c is {0:2, 1:1, 2:1}
Counter gives us a very simple way to solve our word-counts problem:
word_counts = Counter (document) # Here, document is a list of words
The most_common( ) method of Counter instance is often used in Natural Language
Processing.
# print the 10 most common words and their counts
for word, count in word_counts.most_common(10) :
print (word, count)
2.15 List Comprehensions
Frequently, you will want to transform a list into another list by choosing only certain
elements, by transforming elements, or both. The Pythonic way to do this is with list
comprehensions.
even_members = [x for x in range(5) if x%2 == 0] # [0, 2, 4]
squares = [x*x for x in range(5) ] # [0, 1, 4, 9, 16]
even_squares = [x*x for x in even_numbers] # [0, 4, 16]
We can similarly turn lists into dictionaries or sets, as shown below:
square_dict = {x: x*x for x in range (5) } # {0:0, 1:1, 2:4, 3:9, 4:16}
square_set = {x*x for x in [1, –1] } # {1}
If you don’t need the value from the list, it is common to use an underscore as the variable:
zeroes = [0 for _ in even-numbers ] # has the same length as even-numbers
A list comprehension can include multiple for loops:
pairs = [(x, y)
for x in range (10)
for y in range (10) ] # 100 pairs (0,0), (0,1), ..., (9, 8), (9,9)
The later for loops can use the results of earlier for loops:
32

increasing_pairs = [(x, y) # only pairs with x < y


for x in range (10) # range 1 to 10 equals
for y in rang (x+1, 10)] # [1, 2, 3, 4, 5, 6, 7, 8, 9]

2.16 Automated Testing and Assert


As Data Scientists, we will be writing a lot of code. How can we be confident that our
code is correct? One way is with types. (We will discuss this later). Another way is with
automated tests.
There are elaborate frameworks for writing and running tests. But, we will use the assert
statement for this purpose. It will cause your code to raise an assertionError if your specified
condition is not met.
assert 1+1 == 2
assert 1+1 == 2, “1+1 should equal 2 but didn’t”
As you can see in the second case, you can optionally add a message to be displayed if
the assertion fails. The above example is a trivial one. A real-life example is given below:
def smallest_item (xs)
return min (xs)
assert smallest_item ( [10, 20, 5, 40] ) == 5
assert smallest_item ( [1, 0, –1, 2] ) == –1
The assert statement is used as shown above. It is a good practice to liberally use the assert
statement in your code, as it helps you to be confident that your code is correct.
2.17 Iterables and Generators
We can retrieve the specific elements of a List by their indices. But you don’t always
need this! A list of a billion numbers takes up a lot of memory. If you only want the elements
one at a time, there is no good reason to keep them all around in the memory. If you need only
the first few elements, generating the entire billion of values is highly wasteful.
Often all we need is to iterate over a collection of values by using for and in. In this case,
we can create generators, which can be iterated over just like lists but generate their values
lazily on demand. One way to create a generator is to use a function and the yield operator, as
shown below:
def generate_range (n) :
i=0
while i < n :
yield i # every call to yield produces a value of the Generator
i=i+1
The following loop will make use of the yielded values one at a time until no value is left:
33

for i in generate_range (10) :


print ( f “i : {i}” )
With a generator, you can even create an infinite sequence:
def natural_numbers ( ) :
n=1
while True :
yield n
n = n+1
A second way to create generators is by using for comprehensions wrapped in parentheses:
evens_below_20 = ( i for i in generate_range (20) if 1%2 = = 0 )
Such a generator comprehension doesn’t do any work until you iterate over it by using for
or next.
Quite frequently, when we are iterating over a list or a generator, we will want not just
the values but also their indices. For this common case, Python provides an enumerate
function, which turns values into pairs (index, value):
names = [ “Alice”, “Bob”, “Charlie”, “Debbie” ]
# not Pythonic way of writing code
for i in range ( len (names) ) :
print ( f “name {i} i {names[i]}” )
# also not Pythonic way of writing code
i=0
for name in names :
print ( f “name {i} is {names [i]}” )
i=i+1
# Pythonic way of writing code
for i, name in enumerate (names) :
print ( f “name {i} is {name}” )
2.18 Random Number Generation
As we learn Data Science, we will frequently need to generate random numbers. We can
generate random numbers by using the functions in the random module:
import random
random.seed (10) # This ensures that we get the same results every time.
The random( ) function in the random module is used most often. This function
generates random numbers in the range 0 to 1. The random numbers that are generated by this
function will be uniformly distributed in the interval [0, 1]. Suppose that you generate 100
numbers by using this function. Then, 25 of them will lie in the interval [0, 0.25], another 25
in the interval [0.25, 0.50], another 25 in the interval [0.50, 0.75] and 25 in the interval
[0.75, 1].
34

four_uniform_random_numbers = [random.random( ) for_in_range(r)]


# [0.57140259468, 0.42888905467, 0.57809130113, 0.20609823213]
We will sometimes use random.randrange( ) function. It takes either one or two
arguments and returns an element chosen randomly from the corresponding range:
random.randrange(10) # choose randomly from the range [0, 1, 2, ..., 8, 9]
random.randrange(3, 6) # choose randomly from the range [3, 4, 5]
There are a few more methods that we will sometimes find convenient. For example,
random.shuffle( ) randomly reorders the elements of list:
up_to_ten = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
random.shuffle (up_to_ten)
print (up_to_ten)
# [7, 2, 6, 8, 9, 4, 10, 1, 3, 5] (Your results will probably be different.)
If you need to randomly pick one element from a list, you can use the random.choice( )
function.
randomly_picked_value = random.choice ( [“Alice”, “Bob”, “Charlie”] )
If you wish to select a random sample of values without replacement, (i.e., with no
duplicate), you can use the random.sample( ) function.
lottery_numvers = range(60)
winning_numbers = random.sample (lottery_numbers, 6) # [16, 36, 10, 6, 25, 9]
In order to select a random sample of values with replacement (i.e., allowing
duplicates), you can just make multiple calls to random choice( ) function.
four_numbers_with_replacement = [random.choice ( range(10) ) for_in_range (4) ]
print ( four_members_with_replacement) # [9, 4, 4, 2]

2.19 Type Annotations


Python is a dynamically typed language. This means that it is general it does not care
about the data types of the objects we use, as long as we use them in valid ways:
def add (a, b):
return a+b
assert add (10, 5) == 15, “+ is valid for numbers”
assert add ( [1, 2], [3] == [1,2,3], “+ is valid for lists
assert add (“hi”, “there”) == “hi there”, “+ is valid for strings”
try :
add (10, “five”)
except TypeError :
print (“cannot add an int to a string”)
35

In a statistically typed language, such as C++, the data types of the arguments and the
returned value of the above add( ) function will be specified in more or less the same way in
which they are specified below:
def add (a : int, b : int) → int :
return a+b
add (10, 5) # You like this to be OK
add (“hi”, “there”) # You like this to be not OK
The above version of add( ) function with the int type annotations is valid in recent versions
of Python, such as Python 3.6.
However, the above-mentioned type annotations don’t actually do anything. You can
still use the above annotated add( ) function to add two strings, even though it is wrong to do
so. If you make the call add (10, “five”), it will raise the TypeError specified above. That
said, there are still two good reasons to use type annotations in your Python code:
1. Types are an important form of documentation. The second function stub given below is
more informative than the first one.
def dot_product (x, y): ... ... ...
def dot_product (x : Vector, y : Vector) → float ... ... ...
2. There are external tools, such as mypy, that will read your code, inspect the type
annotations, and let you know about type errors before you ever run your code. For
example, if you run mypy over a file containing add (“hi”, “there”), it will warn you, as
shown below:
error : Argument 1 to “add” has incompatible type “str", excepted “int”
Like assert testing, this is a good way to find mistakes in your code before you ever run
it.
How to write Type Annotations?
For built-in types such as int, float and bool, you just use the type itself as the annotation.
(e.g.) def add (a: int, b: int) → int). What if you have a list?
def total (xs : list) → float
return sum (total)
This is not wrong. But the type is not specific enough. It is clear that we really want xs
to be a list of floats, not a list of strings.
The typing module provides a number of parameterized types that we can use to do just
this:
from typing import List # Note the capital letter L
def total ( xs: List [float] ) → float:
return sum (total)
36

Up until now, we have only specified annotations for function parameters and return
types. For variables themselves, it is usually obvious what the data type is:
x : int = 5 # Type annotation is not necessary here. It is obvious.
However, sometimes it is not obvious:
values =[] # Data Type is not clear here
best_so_far = None # Data Type is not clear here
In such cases, we will supply inline type hints, as shown below:
from typing import Optional
values: List [int] = [ ]
best_so_far: Optional [floar] = None # allowed to be either a float or None
Chapter-3:
Case Study: DATASCIENCESTER

3.1 Introduction
Assume that you have been hired as a Data Scientist by the Company named
DataSciencester, which promotes a social network for Data Scientists. Assume further that the
VP of Networking wants you to write code to identify who the “Key connectors” are among
the Data Scientists. In order to do this, he gives you a dump of data about the Data Scientists,
who have joined the DataScientister’s social network. What does this data dump look like? It
consists of a list of Dictionaries. Each element of the Dictionary contains the User’s id as the
key and his name as the value. Here is that great list:
users = [
{ “id” : 0, “name” : “Hero” },
{ “id” : 1, “name” : “Dunn” },
{ “id” : 2, “name” : “Sue” },
{ “id” : 3, “name” : “Chi” },
{ “id” : 4, “name” : “Thor” },
{ “id” : 5, “name” : “Clive” },
{ “id” : 6, “name” : “Hicks” },
{ “id” : 7, “name” : “Devin” },
{ “id” : 8, “name” : “Kate” },
{ “id” : 9, “name” : “Klein” },
]
3.2 Friendship Network
The VP of Networking also gives you to “friendship” data, represented as a list of Tuples.
Each Tuple is a pair of IDs of Users, who are friends to each other.
friendship_pairs = [ (0,1), (0,2), (1,2), (1,3), (2,3), (3, 4), (4,5), (5,6), (5,7), (6,8), (7,8), (8,9) ]
Here, the first Tuple (0,1) indicates that the DataScientist with id 0 (i.e., Hero) and the Data
Scientist with id 1 (i.e., Dunn) are friends. The DataScientist’ friendship network shall be
illustrated as shown below:
38

Having friendships represented as a list of pairs, as shown above, is not the easiest way
to work with them. In order to find all the friendships for user 1 (i.e., Hero), we have to iterate
over every pair looking for pairs containing 1. If you have a lot of pairs, this will take a long
time.
Instead, we will now create a Dictionary where the keys are User ids and the values are
lists of friend ids (Looking things up in a dictionary is very fast.) We still have to look at every
pair to create the dictionary, but we only have to do that once, and we will get fast lookups
after that:
# Initialize the dictionary with an empty list for each user id:
friendships = {user["id"]: [ ] for user in users}
# Loop over the friendship pairs to populate it:
for i, j in friendship_pairs:
friendships[i].append(j) # Add j as a friend of user i
friendships[j].append(i) # Add i as a friend of user j
{0 : [1,2], 1 : [0,2,3], 2: [0,1,3], 3 : [1,2,4], 4 : [3,5], 5 : [4,6,7], 6 : [5,8], 7 : [5, 8],
8 : [6,7,9], 9 : [8] }
We have the friendships in a dictionary now. So, we can ask questions, like “What is the
average number of connections?” First we have to find the total number of connections, by
summing up the lengths of all the friends lists:
def number_of_friends(user): # How many friends does user have?
user_id = user["id"]
friend_ids = friendships[user_id]
return len(friend_ids)
total_connections = sum(number_of_friends(user) for user in users) # 24
num_users = len(users) # Number of users
avg_connections = total_connections / num_users # 24 / 10 = 2.4
It is also easy to find the most connected people. They are the people who have the largest
number of friends. In order to find them, we have to sort the Users from those who have most
friends to those who have least friends.
# Create a list of Tuples of the form (user_id, number_of_friends).
num_friends_by_id = [(user["id"], number_of_friends(user)) for user in users]
num_friends_by_id.sort (
key = lambda id_and_friends: id_and_friends[1], reverse=True)
# Each pair is of the form (user_id, num_friends):
# [(1, 3), (2, 3), (3, 3), (5, 3), (8, 3), (0, 2), (4, 2), (6, 2), (7, 2), (9, 1)]
Here, the Users 1, 2, 3, 5 and 8 are the most connected people. They are somehow central
to the Data Scientists’ Network. What we have just now computed is the Network Metric
named degree centrality.
39

3.3 Data Scientists You May Know


Suppose the VP of Fraternization wants to encourage more connections among your
members. She asks you to design a “Data Scientists You May Know” suggester.
You first instinct will be to suggest that Users may know the friends of their friends. So,
you shall write some code to iterate over their friends and collect the friends’ friends:
def foaf_ids_bad (user):
return [foaf_id
for friend_id in friendships[user["id"]]
for foaf_id in friendships[friend_id]]
When you call the above function with the argument users[0] (i.e., Hero), you will get the
following result:
[0, 2, 3, 0, 1, 3]
The above List includes User 0 twice, as Hero is indeed friends with both of his friends. (i.e.,
Users 1 and 2). It includes Users 1 and 2, although they are both friends with Hero already. It
includes User 3 twice, as Chi is reachable through two different friends.
print ( friendships [0] ) # [1, 2]
print ( friendships [1] ) # (0, 2, 3]
print ( friendships [2] ) # [0, 1, 3]
Suppose you wish to produce a count of mutual friends. You have to exclude people already
known to the User from collections import counter # not loaded by default
def friends_of_friends(user):
user_id = user["id"]
return Counter(
foaf_id
for friend_id in friendships[user_id] # For each of my friends,
for foaf_id in friendships[friend_id] # find their friends
if foaf_id != user_id # who aren't me
and foaf_id not in friendships[user_id] # and aren't my friends.
)
print(friends_of_friends(users[3] ) ) # Counter({0: 2, 5: 1})
Here, both users 1 and 2 are friends of User 3 as well as User 0. So, User 3 has 2 mutual
friends with User 0. Similarly, User 4 is the friend of User 3 as well as User 5. So, User 3 has
1 mutual friend with User 5.
As a Data Scientist, you may enjoy meeting Users with similar interests. For example,
Users 2, 3 and 5 are interested in Python. So, User 2 may enjoy meeting the Users 3 and 5.
Assume that, after asking around, you manage to get your hands on this data, as a List of
Tuples, each of which is of the form (user_id, interest).
40

interests = [
(0, "Hadoop"), (0, "Big Data"), (0, "HBase"), (0, "Java"),
(0, "Spark"), (0, "Storm"), (0, "Cassandra"),
(1, "NoSQL"), (1, "MongoDB"), (1, "Cassandra"), (1, "HBase"),
(1, "Postgres"), (2, "Python"), (2, "scikit-learn"), (2, "scipy"),
(2, "numpy"), (2, "statsmodels"), (2, "pandas"), (3, "R"), (3, "Python"),
(3, "statistics"), (3, "regression"), (3, "probability"),
(4, "machine learning"), (4, "regression"), (4, "decision trees"),
(4, "libsvm"), (5, "Python"), (5, "R"), (5, "Java"), (5, "C++"),
(5, "Haskell"), (5, "programming languages"), (6, "statistics"),
(6, "probability"), (6, "mathematics"), (6, "theory"),
(7, "machine learning"), (7, "scikit-learn"), (7, "Mahout"),
(7, "neural networks"), (8, "neural networks"), (8, "deep learning"),
(8, "Big Data"), (8, "artificial intelligence"), (9, "Hadoop"),
(9, "Java"), (9, "MapReduce"), (9, "Big Data")
]
Here, we observe the Users 0 and 9 (i.e., Hero and Klein) have no friends in common. But they
share interests in Java and Big Data. It is now easy to build a function that finds Users with a
certain interest:
def data_scientists_who_like(target_interest):
#Find the ids of all users who like the target interest.#
return [user_id
for user_id, user_interest in interests
if user_interest == target_interest]
The above code will work. But it has to examine the whole list of interests for every
search. If we have a lot of Users and interests (or if we want to do a lot of searches), it is
preferable to build an index from interests to users and another index from users to interests.
from collections import defaultdict
# Keys are interests, values are lists of user_ids with that interest
user_ids_by_interest = defaultdict(list)
for user_id, interest in interests:
user_ids_by_interest[interest].append(user_id)
# Building an index from users to interests
# Keys are user_ids, values are lists of interests for that user_id
interests_by_user_id = defaultdict(list)
for user_id, interest in interests :
interests_by_user_id [user_id].append (interest)
Now, it is easy to find who has the most interests in common with a given User by
carrying out the following three steps:
41

1. Iterate over the user’s interests.


2. For each interest, iterate over the other users with that interest.
3. Keep count of how many times we see each other user.
The required Code is given below:
def most_common_interests_with(user):
return Counter(
interested_user_id
for interest in interests_by_user_id[user["id"]]
for interested_user_id in user_ids_by_interest[interest]
if interested_user_id != user["id"]
)
print ( most_common_interests_with (users[0]) ) # counter ( { 0:[9] } )
print ( most_common_interests_with (users[1]) ) # counter ( { 1:[0] } )
print ( most_common_interests_with (users[9]) ) # counter ( { 9:[0] } )
3.4 Salaries and Experience
Suppose that the VP of Public Relations asks if you can provide some fun facts about
how much Data Scientists earn. Salary data is of course sensitive, but he manages to provide
you an anonymous data set containing each User’s salary (in dollars) and tenure as a Data
Scientist (in years):
salaries_and_tenures = [(83000, 8.7), (88000, 8.1),
(48000, 0.7), (76000, 6),
(69000, 6.5), (76000, 7.5),
(60000, 2.5), (83000, 10),
(48000, 1.9), (63000, 4.2)]
The natural first step is to plot the data, which we will see how to do in the next Chapter.
Here is the figure that we will get:
42

From the above Figure, it seems clear that people with more experience tend to earn
more. How can you turn this into a fun fact? Your first idea is to look at the average salary for
each tenure:
# Keys are years, values are lists of the salaries for each tenure.
salary_by_tenure = defaultdict(list)
for salary, tenure in salaries_and_tenures:
salary_by_tenure[tenure].append(salary)
# Keys are years, each value is average salary for that tenure.
average_salary_by_tenure = {
tenure: sum(salaries) / len(salaries)
for tenure, salaries in salary_by_tenure.items()
}
The above Code segment will generate the following output:
{0.7: 48000.0, 1.9: 48000.0, 2.5: 60000.0, 4.2: 63000.0,
6: 76000.0, 6.5: 69000.0, 7.5: 76000.0, 8.1: 88000.0,
8.7: 83000.0, 10: 83000.0}
The above output is not useful, as we are just reporting the individual Users’ salaries. This is
due to the reason that no two of the Users have the same tenure. It may be more helpful to
bucket the tenures:
def tenure_bucket(tenure):
if tenure < 2:
return "less than two"
elif tenure < 5:
return "between two and five"
else:
return "more than five"
Then, we can group together the salaries corresponding to each bucket:
# Keys are tenure buckets, values are lists of salaries for that bucket.
salary_by_tenure_bucket = defaultdict(list)
for salary, tenure in salaries_and_tenures:
bucket = tenure_bucket(tenure)
salary_by_tenure_bucket[bucket].append(salary)
Finally, we can compute the average salary for each group:
# Keys are tenure buckets, values are average salary for that bucket.
average_salary_by_bucket = {
tenure_bucket: sum(salaries) / len(salaries)
for tenure_bucket, salaries in salary_by_tenure_bucket.items( )
}
43

The above Code segment will generate the following output:


{'between two and five': 61500.0,
'less than two' : 48000.0,
'more than five' : 79166.67}
Here, we get our insight: “Data Scientists with more than five years’ experience earn 65%
more than Data Scientists with little or no experience!”
Here, we chose the buckets in a pretty arbitrary way. What we really like is to make some
statement about the salary effect of having an additional year of experience. We will do so
later on by developing a Linear Regression model.
3.5 Paid Accounts
Suppose that the VP of Revenue wants to better understand which Users pay for the
services rendered by DataScientists Social Network Website and which Users want to make
use of the services rendered by DataScientister free of cost. She knows their names, but she
wants to get more actionable information (insights). When you look at the following data
provided to you by her, you notice that there seems to be a correspondence between years of
experience and paid accounts:
0.7 paid
1.9 unpaid
2.5 paid
4.2 unpaid
6.0 unpaid
6.5 unpaid
7.5 unpaid
8.1 unpaid
8.7 paid
10.0 paid
From the above data, you note that Users with very few years of experience and the Users with
very many years of experience tend to pay. On the other hand, Users with average amounts of
experience don’t pay. Accordingly, if you want to create a model - through this is definitely
not enough data to base a model on - you may try to predict “paid” for Users with very few
and very many years of experience, and “unpaid” for Users with middling amount s of
experience:
def predict_paid_or_unpaid(years_experience):
if years_experience < 3.0:
return "paid"
elif years_experience < 8.5:
return "unpaid"
else:
return "paid"
44

With more data and more mathematics, we can build a model predicting the likelihood
that a User will pay based on his years of experience. We will build such a model later on by
using the concept of Logistic Regression.
3.6 Topics of Interest
Suppose that the VP of Content Strategy asks you for data about what topics Users are
most interested in, so that she can plan out her blog calendar accordingly. You already have
the new data given below:
interests = [
(0, "Hadoop"), (0, "Big Data"), (0, "HBase"), (0, "Java"),
(0, "Spark"), (0, "Storm"), (0, "Cassandra"),
(1, "NoSQL"), (1, "MongoDB"), (1, "Cassandra"), (1, "HBase"), (1, "Postgres"),

(8, "neural networks"), (8, "deep learning"), (8, "Big Data"),
(8, "artificial intelligence"),
(9, "Hadoop"), (9, "Java"), (9, "MapReduce"), (9, "Big Data")
]
One simple way to find the most popular “interests is to count the words:
1. Express each interests in lowercase (For example, “Java” should be changed to “Java”)
2. Split each interest into words. (For example, “deep learning” should be changed into “deep”
and “learning”).
3. Count the results. (For example, “java” occurs 3 times, “hbase” occurs 2 times).
The above three steps can be implemented by using the following Code:
words_and_counts = Counter( word
for user, interest in interests
for word in interest.lower( ).split( ) )
Now, it is easy to list out the words that occur more than once:
for word, count in words_and_counts.most_common( ):
if count > 1:
print(word, count)
The following output will be generated by the above Code. We will learn later on more
sophisticated ways to extract topics from data by using Natural Language Processing:
learning 3
java 3
python 3
big 3
data 3
hbase 2
regression 2
45

cassandra 2
statistics 2
probability 2
hadoop 2
networks 2
machine 2
neural 2
scikit-learn 2
r 2
User 0 has most common interests with User 9
Hadoop 9
Big Data 8, 9
HBase 1
Java 5, 9
Spark
Storm
Cassandra 1
User 1 has most common interests with User 0
No SQL
MangoDB
Cassandra 0
HBase 0
Postgres
User 2 has most common interests with Users 3, 5 and 7
Python 3, 5
Scikit_learn 7
Scipy
Mumpy
Statmodels
Pandas
User 3 has most common interests with Users 5 and 6
R 5
Python 2, 5
Statistics 6
Regression 4
Probability 6
User 4 has most common interests with the User 7
Machine Learning 7
Regression
Decision Trees
Libsvm
46

User 5 has most common interests with Users 3 and 9


Python 3
R
Java 9
C++
Has Kell
Programming Languages
User 6 has most common interests with User 3
Statistics 3
Probability 3
Mathematics
Theory
User 7 has most common interests with Users 2, 4 and 8
Machine Learning 4
Scikit_learn 2
Mahout
Neural Networks 8
User 8 has most common interests with Users 0, 7 and 9
Neural Networks 7
Deep Learning
Big Data 0, 9
Artificial Intelligence
User 9 has most common interests with User 0
Hadoop 9
Java 0, 5
Map Reduce
Big Data 0, 8

Counter( { 0: [9], 1: [0], 2: [3,5,7], 3: [5,6], 4: [7], 5: [3,9], 6: [3], 7: [2,4,8], 8: [0,7,9], 9: [0] } )
Chapter-4:
VISUALIZING DATA

4.1 Introduction
Making plots and visualizations is one of the most important tasks in data analysis. It
may be a part of the exploratory process. For example, it may help us to identify the outliers,
do required data transformations or come up with ideas for models.
A wide variety of tools exist for visualizing data. We will use the matplotlib library,
which is widely used now. This library is not part of the core Python library. In order to use
matplotlib, we have to start IPython in Pylab mode by using the ipython -- pylab command.
The matplotlib.pyplot module is to be imported by using the following command:
> import matplotliblpyplot as plt
Plots in matplotlib reside within a Figure object. We can create a new figure by using
the figure( ) method.
> fig = plt.figure( )
> ax = fig.add_subplot (1,1,1)
> from numpy.random import randon
> ax.plot ( randn (1000).cumsum( ) )

Note-1
We can’t make a plot with a blank figure. We have to create one or more subplots by
using the add_subplot( ) function, as shown below:
48

> fig = plt.figure( )


> ax1 = fig.add_subplot(2,2,1)
> ax2 = fig.add_subplot(2,2,2)
> ax3 = fig.add_subplot(2,2,3)
> ax4 = fig.add_subplot(2,2,4)

Note-2
We can draw a Histogram, Scatter Diagram and a Line Diagram in the above subplots
by using the following commands:
> ax1.hist( randn(100), bins = 20, color = ‘K’, alpha = 0.3 )
> ax2.scatter( np.arange(30), np.arange(30) + 3*randn(30) )
> ax3.plot( randn(50).cumsum( ), ‘K--‘)
49

4.2 Colors, Markers, Line Styles


Matplotlib’s main plot( ) function accepts arrays of X and Y coordinates and optionally
a string abbreviation indicating color and line style. For example, in order to x versus y with
green dashes, we have to execute the following command:
> fig = plt.figure( ); ax = fig.add_subplot(1, 1, 1)
> ax.plot(x, y, ‘g--‘)
The above command can also be written in the following explicit manner:
> ax.plot (x,y, linestyle = ‘--‘, color = ‘g’)
There are a number of color abbreviations provided for commonly used colors. (‘g’ for green,
‘k’ for black, etc.). The full set of line styles can be seen by looking at the docstring for plot.
Line plots can additionally have markers to highlight the actual data points. Since
matplotlib creates a continuous line plot, interpolating between points, it can occasionally be
unclear where the points lie. Markers are helpful in such cases. The marker can be part of the
style string, which must have color followed by marker type and line style, as shown in the
following figure:
> plt.plot ( randn(30).comsum( ), ‘ko--’ )
The above command can be written more explicitly, as shown below:
> plt.plot (randn(30).cumsum( ), color = ‘K’, linestyle = ‘dashed’, marker = ‘0’)
50

4.3 Ticks, Labels, Legends


For most kinds of plot decorations, we use the pyplot interface. It consists of methods
like xlim, xticks and xticklabels. These control the plot range, tick locations, and tick labels,
respectively. Here is an example.
> fig = plt.figure( )
> ax = fig.add_subplot (1,1,1)
> ax.plot ( randn (1000).cumsum( ) )

In order to change the X axis ticks, we have to use the set_xticks() and set_xticklabels()
methods. The set_xticks( ) method instructs matplotlib where to place the ticks along the data
range. By default, these locations will also be the labels. But, we can set any other values as
the labels by using the set_xticklabels( ) method.
> ticks = ax.set_xticks ( [0, 250, 500, 750, 1000] )
> labels = ax.set_xticklabels ( [‘one’, ‘two’, ‘three’, ‘four’, ‘five’],
rotation = 30, fontsize = ‘small’ )
Lastly, the set_xlabel( ) method gives a name to the X axis and the set_title( ) method gives
the subplot title.
> ax. set_title ( ‘My first matplotlib plot’ )
> ax. set_xlabel ( ‘Stages’ )

Modifying the Y axis consists of the same process, substituting y for x in the above
commands.
51

Legends are another critical element for identifying the plot elements. The legend shall
be passed the value of the label argument of the plot( ) function, as shown below.
> fig = plt.figure( )
> ax = fig.add_subplot (1,1,1)
> ax.plot ( randn (1000).cumsum( ), ‘K’, label = ‘one’ )
> ax.plot ( randn (1000).cumsum( ), ‘K--’, label = ‘two’ )
> ax.plot ( randn (1000).cumsum( ), ‘K.’, label = ‘three’ )
Once we do this, we can invoke ax.legend( ) or plt.legend( ) to automatically create a legend:
> ax.legend ( loc = ‘best’ )
Here, the loc tells matplotlib where to place the plot. The value ‘best’ will tell matplotlib to
choose a location that is most out of the way.
52

4.4 Annotations
In addition to the standard plot types, you may wish to draw your own annotations. The
annotations can consist of text, arrows, or other shapes. Annotations and text can be added by
using the text, arrow and annotate functions. The text( ) function draws text at the given
coordinates (x, y) on the plot with optional custom styling:
> ax.text(x, y, ‘Hello world!’, family = ‘monospace’, fontsize = 10)
Annotations can draw both text and arrows arranged appropriately. For example, in the
following figure about the share market trend, the annotations “Peak of bull market”, “Bear
Stearns Falls” and “Lehman Bankruptcy” are shown with arrows at the appropriate places.

4.5 Saving Plots to File


The active figure, say, figure 1. svg, can be saved as a file by using the savefig( ) function,
as shown below:
> plt.savefig( ) (‘figure1.svg’)
The file type here is svg. But, it shall be png or pdf also.
> plt.savefig( ) (‘figure2.svg’), dpi = 400, bbox_inches = ‘tight’)
A figure need not be only saved to disk. It can also be written to any file-like object, such as a
StringID.
> from io import StringIO
> buffer = StringIO( )
> plt.savefig (buffer)
> plot_data = buffer.getvalue( )
53

This option is useful for saving dynamically-generated images over the Web.
4.6 Drawing a Line Chart
The folloing code makes use of the plot( ) function to draw a simple Line Chart.
> from matplotlib import pyplot as plt
> years = [1950, 1960, 1970, 1980, 1990, 2000, 2010]
> gdp = [300.2, 543.3, 1075.9, 2862.5, 5979.6, 10289.7, 14958.3]
> plt.plot (years, gdp, color = ‘green’, marker = ‘0’, linestyle = ‘solid’)
> plt. title (“Nominal GDP”) # A title is added
> plt.ylabel (“Billions of $”) # A label is added to y-axis
> plt.show( )

4.7 Bar Charts


A Bar Chart is a good choice when you want to show how some quantity varies among
some discrete set of items. The following code generates a Bar Chart that shows how may
Academy Awards ere won by each of a variety of movies.
> movies = [“Annie Hall”, “Benhur”, “Casablanca”, “Gandhi”, “WestSideStory”]
> num_oscars = [5, 11, 3, 8, 10]
> #Plot bars with left x_coordinates [0, 1, 2, 3, 4], heights [num_oscars]
> plt.bar ( range (len (movies) ), num_oscars)
> plt.title (“My Favourite Movies”) # A title is added
54

> plt.ylabel (“# of Academy Awards”) # y-axis is labelled


> # Label x-axis with movie names at bar centers
> plt.xticks ( rang ( len (movies) ), movies )
> plt.show( )

A Bar Chart can also be a good choice for plotting Histograms of bucketed numeric
values, as shown in the following figure, in order to visually explore how the values are
distributed.
> from collections import Counter
> grades = [83, 95, 91, 87, 70, 0, 85, 82, 100, 67, 73, 77, 0]
> # Bucket grades by decile, but put 100 in with the 90s
> histogram = Counter ( min (grade // 10*10, 90) for grade in grades )
> plt.bar ( [x+5 for x in histogram.keys( ) ], # Shift Bars right by 5
histogram.values ( ), # Give each Bar its correct height
10, # Give each Bar a width of 10
edgecolor = (0, 0, 0) # Black edges for each Bar
> plt.axis ( [–5, 105, 0, 5] ) # X-axis from –5 to 105,
# Y-axis from 0 to 5
> plt.xticks ( [10*i for i in range(11) ] ) # X-axis labels at 0, 10, ..., 100
> plt.xlabel (“Decile”)
> plt.ylabel (“# of Students”)
> plt.title (“Distribution of Exam 1 grades”)
> plt.show( )
55

In the above code, the third argument to plt.bar( ) method specifies the bar width. Here,
we chose a width of 10, to fill the entire decile. We also shifted the bars right by 5, so that, for
example, the “10” bar (which corresponds to the decile 10-20) will have its center at 15 and
hence occupy the correct range. We also added a black edge to each bar to make them visually
distinct.
In the above code, the arguments of plt.axis( ) method indicate that we want the x-axis
to range from –5 to 105 (just to leave a little space on the left and right), and that the y-axis
should range from 0 to 5. The call to the plt.xticks( ) method puts x-axis labels at 0, 10, 20, ...,
100.
4.8 Line Charts with Legends
The following Code draws several Line Charts with a legend for each. The Line Charts
are a good choice for showing trends, as illustrated in the following figure.
> variance = [1, 2, 4, 8, 16, 32, 64, 128, 256]
> bias_squared = [256, 128, 64, 32, 16, 8, 4, 2, 1]
> total_error = [x+y for x,y in zip (variance, bias_squared)]
> xs = [ i for i, - in enumerate (variance) ]
> # We can make multiple calls to plt.plot( ) method
> # to show multiple series on the same chart
> plt.plot (xs.variance, ‘g-’, label = ‘variance’) # Green Solid Line
> plt.plot (xx, bias_squared, ‘r_.’, label = ‘bias^2’) # Red dot-dashed Line
> plt.plot (xs, total_error, ‘b:’, label = ‘total error’) # Blue Dotted Line
56

> plt.legend 9loc = 9) # loc = 9 means “top center”


> plt.xlabel (“model complexity”
> plt.xticks ([ ])
> plt.title (“The Bias_Variance Tradeoff”)
> plt.show( )

4.9 Scatter Plots


A Scatter Plot is the right choice for visualizing the relationship between two paired sets
of data. For example, the following figure illustrates the relationship between the number of
friends your Users have and the number of minutes they spend on the DataScienceter Website
every day.
> friends = [ 70, 65, 72, 63, 71, 64, 60, 64, 67 ]
> minutes = [ 175, 170, 205, 120, 220, 130, 105, 145, 190 ]
> labels = [ ‘a’, ‘b’, ‘c’, ‘d’, ‘e’, ‘f’, ‘g’, ‘h’, ‘i’ ]
> plt.scatter (friends, minutes0
> # Label each point
> for label, friend_count, minute_count in zip (labels, friends, minutes) :
plt.annotate ( label,
xy = (friend_count, minute_count), # Put the label with its point
xytext = (5, –5), # but slightly offset
textcoods = ‘offset points’ )
> plt.title (“Daily Minutes vs. Number of Friends”)
57

> plt.xlabel (“# of friends”)


> plt.ylabel (“daily minutes spent on the site”)
> plt.show( )

4.10 Data Frame’s plot( ) Method


Data Frame’s plot( ) method plots each of its columns as a different line on the same
subplot, creating a legend automatically. The following code will plot each column of the Data
Frame df defined below:
> df = DataFrame ( np.random.randn(10, 4).cumsum(0),
columns = [‘A’, ‘B’, ‘C’, ‘D’]
index = np.arange (0, 100, 10) )
> df.plot( )
Chapter-5:
NumPy and Pandas Packages

5.1 Introduction
NumPy, short for Numerical Python, is the fundamental package that is required for high
performance scientific computing and data analysis. It is the foundation on which many higher-
level tools are built in Python. Here are some of the things that the NumPy package provides:
1) ndarray, a fast and space-efficient multidimensional array that provides vectorized
arithmetic operations and sophisticated broadcasting capabilities
2) Standard mathematical and statistical functions for fast operations on entire arrays of data
without having to write loops
3) Tools for reading / writing array data to disk and working with memory-mapped files.
While NumPy package by itself does not provide very much high-level data analytical
functionality, having an understanding of ndarrays and array-oriented computing will help
you to use tools like Pandas much more effectively. You will learn the following concepts in
the subsequent Sections. These concepts are helpful for developing most data analysis
operations.
1) Fast vectorized array operations for data munging and cleaning, subsetting and filtering,
transformation, and any other kinds of computations
2) Common array methods for performing sorting and set operations
3) Efficient descriptive statistics and aggregating/summarizing data
4) Data alignment and relational data manipulations for merging and joining together
heterogeneous data sets
5) Group-wise data manipulations (aggregation, transformation etc.).
While the NumPy package provides the computational foundation for all the above
operations, we have to mainly use the Pandas Package as our basis for performing most kinds
of data analysis (especially for structured or tabular data such as a Data Frame) as it provides
a rich, high-level interface making most common data tasks very concise and simple.
5.2 The NumPy ndarray
One of the key features of NumPy package is ndarray. It is a N-dimensional array object.
It is a fast, flexible container for large data sets in Python. We can perform mathematical
operations on arrays exactly in the same way in which we perform such operations on scalar
elements. Here is an example:
> data = [ [0.9526, -0.246 , -0.8856], [ 0.5639, 0.2379, 0.9104] ]
> array 1 = np.array (data)
59

> array 1
array ( [ [0.9526, –0.246, –0.8856],
[0.5639, 0.2379, 0.9104] ] )
> array 1 * 10
array ( [ [9.5256, –2.4601, –8.8565],
[5.6385, 2.3794, 9.104] ) )
> array 1 + array 1
array ( [ [1.9051, –0.492, –1.7713],
[1.1277, 0.4759, 1.8208] ] )
> data.shape
(2, 3) # Number of rows and columns
> data.dtype
dtype (‘float64’)

5.3 Creating ndarrays


The easiest way to create a ndarray is to use the array( ) function. This function accepts
any sequence-like object (List, Tuple, etc.) as its argument and produces a new ndarray
containing the passed data. Here is an example.
> data1 = [6, 7.5, 8, 0, 1]
> arr1 = np.array(data1)
> arr1
array ( [ 6. , 7.5, 8. , 0. , 1. ] )
A Nested sequence, like a list of equal-length lists can also be passed as the argument of
the array( ) function. A multi_dimensional array will be created in such a case.
> data2 = [ [1, 2, 3, 4], [5, 6, 7, 8] ]
> arr2 = np.array(data2)
> arr2
> array ( [ [1, 2, 3, 4],
[5, 6, 7, 8] ] )
> arr2.ndim
2 # number of rows
> arr2.shape
(2, 4) # number of rows and number of columns
> arr2.dtype
dtype ('int64') # data type of the elements in the array
> np.arange(5)
array ( [0, 1, 2, 3, 4] )
60

Here, the arrange( ) function is same as the built-in Python range( ) function. While the range()
function returns a List, the arrange( ) function returns a ndarray.
5.4 Vectorization Operation
Arrays enable us to express batch operations on data without writing any for loops. This
is usually called vectorization. Any arithmetic operations between equal-size arrays applies
the operation elementwise. Operations between differently-sized arrays is called
broadcasting. We will discuss it later.
> arr = np.array ( [1., 2., 3.], [4., 5., 6.] ] )
> arr
array ( [ [1., 2., 3.],
[4., 5., 6.] ] )
> arr * arr
array ( [ [1., 4., 9.],
[16., 25., 36.] ] )
> arr - arr
array ( [ [0., 0., 0.],
[0., 0., 0.] ] )
Arithmetic operations with scalars will be done as we expect. The value will be propagated to
each element.
> 1/arr
array ( [ [1., 0.5, 0.3333],
[0.25, 0.2, 0.1667] ] )
> arr ** 0.5
array ( [ [1., 1.4142, 1.7321]
[2., 2.2361, 2.4495] ] )
5.5 Array Indexing and Slicing
NumPy’s ndarray’s elements are indexed from 0. They shall be accessed exactly in the
way in which the elements in a List are accessed.
> arr = np.arange(10)
> arr
array( [0, 1, 2, 3, 4, 5, 6, 7, 8, 9] )
> arr[5]
5 # Value at index 5 is 5
> arr[5:8]
array( [5, 6, 7] ) # This is called Array Slicing
> arr[5:8] = 12 # 12 is assigned to arr[5], arr[6] and arr[7]
> arr
array( [0, 1, 2, 3, 4, 12, 12, 12, 8, 9] )
61

Here, we assign the scalar value 12 to the array slice [5, 6, 7]. Here, the value 12 is
propagated (i.e., broadcasted) to the entire slice. An important distinction from lists is that the
array slices are views on the original array. This means that the data are not copied, and any
modification to the view will be reflected in the source array.
> arr_slice = arr[5:8]
> arr_slice[1] = 12345
> arr
array( [0, 1, 2, 3, 4, 12, 12345, 12, 8, 9] )
> arr_slice[:] = 64
> arr
array( [0, 1, 2, 3, 4, 64, 64, 64, 8, 9] )
If we want a copy of a slice of an ndarray instead of a view, we will need to explicitly
copy the array. (An example: arr[5:8].copy( ).
In a two-dimensional array, the element at each index is an one-dimensional array, as
shown below:
> arr2d = np.array( [ [1, 2, 3], [4, 5, 6], [7, 8, 9] ] )
> arr2d[0]
array( [1, 2, 3])
> arr2d[0][2]
3
> arr2d [0, 2] # Alternative method of accessing an element
3
The element at each index of a three-dimensional array is a two-dimensional array, as
illustrated below:
> arr3d = np.array( [ [ [1, 2, 3], [4, 5, 6] ], [ [7, 8, 9], [10, 11, 12] ] ] )
> arr3d[0]
array( [ [ [1, 2, 3],
[4, 5, 6] ] )
Both the scalar values and arrays can be assigned to arr3d[0], as shown below:
> old_values = arr3d[0].copy()
> arr3d[0] = 42 # a scalar value is assigned
> arr3d
array( [ [ [42, 42, 42],
[42, 42, 42] ],
[ [7, 8, 9],
[10, 11, 12] ] ] )
> arr3d[0] = old_values # An array is assigned
62

> arr3d
array( [ [ [1, 2, 3],
[4, 5, 6]],
[ [7, 8, 9],
[10, 11, 12] ] ] )

5.6 Transposing Array and Swapping Axes


Transposing is a special form of reshaping. It returns a view on the underlying data
without copying anything. Arrays have the transpose( ) without for performing the
transposing operation. They also have the special T attribute for performing the transposing
operation. We shall use any one of these two options.
> arr = np.arrange(15).reshape( (3,5) )
> arr
array( [ [ 0, 1, 2, 3, 4],
[ 5, 6, 7, 8, 9],
[10, 11, 12, 13, 14] ] )
> arr.T # We shall also use the command arr.transpose( ) here.
array ( [ [0, 5, 10],
[1, 6, 11],
[2, 7, 12],
[3, 8, 13],
[4, 9, 14] ] )
When doing matrix computations, we often need to compute the inner matrix product
T
X X by using the np.dot( ) method.
> arr = np.random.randn(6, 3)
> np.dot (arr.T, arr)
array( [ [2.584 , 1.8753, 0.8888],
[1.8753, 6.6636, 0.3884],
[0.8888, 0.3884, 3.9781] ] )
Transposing is just a special case of swapping axes. In the above example, we shall swap
the axes (i.e. rows and columns) by invoking the swap axes (0, 1) method.
> arr = np.arange(15). reshape( (3, 5) )
> arr.swapaxes (0, 1)
array( [ [0, 5, 10],
[1, 6, 11],
[2, 7, 12],
[3, 8, 13],
[4, 9, 14] ] )
63

5.7 Universal Functions: Fast Element-wise Array Functions


A universal function performs elementwise operations on the data that are present in
ndarrays. A universal function id denoted as ufunc( ). We can think of universal functions as
fast vectorized wrappers for simple functions that take one or more scalar values and produce
one or more scalar results. Many ufunc( ) are simple elementwise transformations, like
sqrt( ) or exp( ).
> arr = np.arange(10)
> np.sqrt(arr)
array( [0. , 1. , 1.4142, 1.7321, 2. , 2.2361, 2.4495, 2.6458, 2.8284, 3. ] )
> np.exp(arr)
array( [1., 2.7183, 7.3891, 20.0855, 54.5982, 148.4132,
403.4288, 1096.6332, 2980.958 , 8103.0839] )
The above ufuncs are known as unary unfuncs, as they take only one argument. On the
other hand, the unfuncs, such as add( ) and maximum( ) are called binary unfuncs, as they take
two arguments.
> x = randn(8)
> y = randn(8)
> x
array( [0.0749, 0.0974, 0.2002, -0.2551, 0.4655, 0.9222, 0.446, -0.9337] )
> y
array( [0.267 , -1.1131, -0.3361, 0.6117, -1.2323, 0.4788, 0.4315, -0.7147] )
> np.maximum(x, y)
array( [0.267 , 0.0974, 0.2002, 0.6117, 0.4655, 0.9222, 0.446, -0.7147] )

5.8 Mathematical and Statistical Methods


A set of mathematical and statistical methods are available for computing statistics for
an entire array or for the data in rows or columns alone. Most of these are available as array
instance methods as well as top level NumPy functions.
> arr = np.random.randn (5, 4) # Normally-distributed data
> arr.mean() # Array Instance Method
0.0628
> np.mean(arr) # NumPy Function
0.0628
> np.array( [ [0, 1, 2], [3, 4, 5], [6, 7, 8] ] )
> arr.sum(0) # Axis 0 represents Rows
array( [3, 12, 21] )
> arr.mean(axis = 1) # Axis 1 represents Columns
array ( [3, 4, 5] )
64

> arr.cumsum(0)
array( [ [0, 1, 2],
[3, 5, 7],
[9, 12, 15] ] )
> arr.cumprod(1)
array( [ [0, 0, 0],
[3, 12, 60],
[6, 42, 336] ] )
> arr = randn(8)
> arr
array( [0.6903, 0.4678, 0.0968, -0.1349, 0.9879, 0.0185, -1.3147, -0.5425] )
> arr.sort( )
> arr
array( [-1.3147, -0.5425, -0.1349, 0.0185, 0.0968, 0.4678, 0.6903, 0.9879] )
We can get a 4 by 4 matrix (i.e., array) of random numbers from the Standard Normal
Distribution, as shown below:
> samples = np.random.normal (size = (4, 4) )
> samples
array ( [ [ 0.1241, 0.3026, 0.5238, 0.0009],
[ 1.3438, -0.7135, -0.8312, -2.3702],
[-1.8608, -0.8608, 0.5601, -1.2659],
[ 0.1198, -1.0635, 0.3329, -2.3594] ] )

5.9 Saving and Loading Array


The np.save( ) and np.load( ) methods are used to save and load arrays. Arrays are saved
by default in an uncompressed raw binary format with the file extension .npy.
> arr = np.arange(10)
> arr
array( [0, 1, 2, 3, 4, 5, 6, 7, 8, 9] )
> np.save (‘some_array’, arr) # Above array will be stored in the current directory.
> np.load (‘some_array.npy’) # Array will get loaded
array ( [0, 1, 2, 3, 4, 5, 6, 7, 8, 9) ]
The loadtxt( ) method shall be used to load the numerical contents of a TextFile into a
2D array, as shown below:
> cat array_ex.txt
0.580052, 0.186730, 1.040717, 1.134411,
0.194163, -0.636917, -0.938659, 0.124094,
-0.126410, 0.268607, -0.695724, 0.047428,
65

> arr = np.loadtxt ('array_ex.txt', delimiter=',')


> arr
array( [ [ 0.5801, 0.1867, 1.0407, 1.1344],
[ 0.1942, -0.6369, -0.9387, 0.1241],
[-0.1264, 0.2686, -0.6957, 0.0474] ] )
The np.savetxt( ) method performs the inverse operation. It writes an array to a delimited
text file.
5.10 Getting Started with Pandas
The Pandas package contains high-level data structures and manipulation tools that are
designed to make data analysis fast and easy in Python. Pandas is built on top of NumPy
package and makes it easy to use in NumPy centric applications. Series and DataFrame are
the two important data structures available in Pandas. They are used so often that it is
preferable to import them into the local namespace by using the following command:
> from pandas import Series, DataFrame
Once we execute the following import command, we shall refer to pandas as pd in our
Code:
> import pandas as pd
5.11 Series
A Series is a one-dimensional array-like object. It contains an array of data and an
associated array of data labels, called its index. The simplest way of creating a Series is by
passing an array of data to the Series( ) method.
> obj = Series ( [4, 7, -5, 3] )
> obj
0 4
1 7
2 -5
3 3
Since we did not specify an index for the above data, a default one consisting of the integers
0, 1, 2 and 3 is created.
> obj.values
array( [4, 7, -5, 3] )
> obj.index
Int64Index ( [0, 1, 2, 3] )
Often, it is desirable to create a Series with an index identifying each data point, as shown
below:
66

> obj2 = Series ( [4, 7, -5, 3], index = [‘d’, ‘b’, ‘a’, ‘c’] )
> obj2
d 4
b 7
a -5
c 3
> obj2.values
array ( [4, 7, -5, 3] )
> obj2.index
Index ( [d, b, a, c], dtype = object)
> obj2 [‘a]
-5
> obj2 [‘d’] = 6
> obj2 [ [‘c’, ‘a’, ‘d’] ]
c 3
a -5
d 6
> obj2 [ obj2 > 0]
d 6
b 7
c 3
> obj2 * 2
d 12
b 14
a -10
c 6
> np.exp(obj2)
d 403.428793
b 1096.633158
a 0.006738
c 20.085537
> ‘b’ in obj2
True
> ‘e’ in obj2
False
If we have data contained in a Python dictionary, we can create a Series from it by passing
the dictionary to the Series( ) method as its argument.
67

> sdata = {‘ohio’: 35000, ‘texas’: 71000, ‘Oregon’: 16000, ‘Utah’: 5000}
> obj3 = Series (sdata)
> obj3
ohio 35000
Oregon 16000
Texas 71000
Utah 5000
In the above example, the dictionary’s keys in sorted order are taken as the index of the
resulting Series. We can ourselves specify the keys when creating a Series, as shown below:
> states = [‘California’, ‘Ohio’, ‘Oregon’, ‘Texas’]
> obj4 = Series (sdata, index = states)
> obj4
California NaN
Ohio 35000
Oregon 16000
Texas 71000
In the above example, three values (i.e., 35000, 16000 and 71000) found in sdata were
placed in the appropriate locations. But since no value for ‘California’ was found in sdata, it
appears as NaN (not a number) in the above output.
An important Series feature for many applications is that it automatically aligns
differently-indexed data in arithmetic operations.
> obj 3 > obj
Ohio 35000 California NaN
Oregon 16000 Ohio 35000
Texas 71000 Oregon 16000
Utah 5000 Texas 71000
> obj3 + obj4
California NaN
Ohio 70000
Oregon 32000
Texas 142000
Utah NaN
Both the Series object itself and its index have a name attribute. This attribute integrates
with other key areas of Pandas functionality.
> obj4.name = ‘population’
> obj4.index.name = ‘state’
68

> obj4
California NaN
Ohio 35000
Oregon 16000
Texas 71000
Name: Population
A Series’s index can be altered in place by assignment, as illustrated below:
> obj.index = [‘Bob’, ‘Steve’, ‘Jeff’, ‘Ryan’]
> obj
Bob 4
Steve 7
Jeff -5
Ryan 3

5.12 Data Frame


As we have seen earlier, Series and DataFrame are the two important Data Structures
available in the Pandas package. A DataFrame represent a tabular, spreadsheet-like data
structure. It contains an ordered collection of columns. All the values in each column should
be of the same data type. But, different columns may be of different data types. For example,
while all the values in the first column are numeric values all the values in the second column
may be strings.
The DataFrame has both a row index and a column index. It can be thought of as a
dictionary of Series. Internally, a DataFrame stores the data in a two-dimensional format.
However, we can easily represent much higher-dimensional data in a tabular format by using
hierarchical indexing. This is a key ingredient in many of the more advanced data-handling
features in Pandas.
5.13 Creation of Data Frame
The most common way to construct a DataFrame is passing a dictionary of equal-length
lists or ndarrays to the DataFrame( ) method, as shown below:
> data = { ‘state’ : [‘Ohio’, ‘Ohio’, ‘Ohio’, ‘Nevada’, ‘Nevada’],
‘year’ : [2000, 2001, 2002, 2001, 2002],
‘pop’ : [1.5, 1.7, 3.6, 2.4, 2.9] }
> frame = DataFrame (data)
69

> frame
pop state year
0 1.5 Ohio 2000
1 1.7 Ohio 2001
2 3.6 Ohio 2002
3 2.4 Nevada 2001
4 2.9 Nevada 2002
In the above example, the index is assigned automatically as 0, 1, 2, 3 and 4. The columns
are placed in sorted order. (i.e., pop, state, year). However, if we specify a sequence of
columns, the DataFrame’s columns will be exactly what we pass.
> DataFrame ( data, columns = [‘year’, ‘state’, ‘pop’] )
year state pop
0 2000 Ohio 1.5
1 2001 Ohio 1.7
2 2002 Ohio 3.6
3 2001 Nevada 2.4
4 2002 Nevada 2.9
As in the case of Series, if we pass a column that is not contained in data, it will appear
with NaN values in the result.
> frame2 = DataFrame ( data, columns = [‘year’, ‘state’, ‘pop’, ‘debt’],
index = [‘one’, ‘two’, ‘three’, ‘four’, ‘five’] )
> frame2
year state pop debt
one 2000 Ohio 1.5 NaN
two 2001 Ohio 1.7 NaN
three 2002 Ohio 3.6 NaN
four 2001 Nevada 2.4 NaN
five 2002 Nevada 2.9 NaN
> frame2.columns
Index ( [year, state, pop, debt], dtype = object)
A column in a DataFrame can be retrieved as a Series either by dictionary-like notation
or by attribute:
> frame2 [‘state’]
one Ohio
five Nevada
> frame2.year
one 2000

five 2002
70

A row in a DataFrame can be retrieved by position or name by a couple of methods, such


as the ix indexing field.
> frame2.ix [‘three’]
year 2002
state Ohio
pop 3.6
debt NaN
Name: three
Columns in a DataFrame can be modified by assignment. For example, the empty ‘debt’
column can be assigned a scalar value or an array of values:
> frame2 [‘debt’] = 16.5
> frame2
year state pop debt
one 2000 Ohio 1.5 16.5
two 2001 Ohio 1.7 16.5
three 2002 Ohio 3.6 16.5
four 2001 Nevada 2.4 16.5
five 2002 Nevada 2.9 16.5

> frame2 [‘debt’] = np.arange(5.)


> frame2
year state pop debt
one 2000 Ohio 1.5 0
two 2001 Ohio 1.7 1
three 2002 Ohio 3.6 2
four 2001 Nevada 2.4 3
five 2002 Nevada 2.9 4
When we assign a list or an array to a column in a DataFrame, the length of the list or
the length of the array should be same as the length of the column of the DataFrame. But, if
we assign a Series, it will be instead conformed exactly to the DataFrame’s index, inserting
missing values in any holes:
> val = Series ( [-1.2, -1.5, -1.7], index = [‘two’, ‘four’, ‘five’] )
> frame2 [‘debt’] = val
> frame2
year state pop debt
one 2000 Ohio 1.5 Nan
two 2001 Ohio 1.7 -1.2
three 2002 Ohio 3.6 NaN
four 2001 Nevada 2.4 -1.5
five 2002 Nevada 2.9 -1.7
71

Assigning a column that doesn’t exist will create a new column. The del keyword will
delete columns as with a dictionary.
> frame2 [‘eastern’] = frame2.state == ‘Ohio’
> frame2
year state pop debt eastern
one 2000 Ohio 1.5 Nan True
two 2001 Ohio 1.7 -1.2 True
three 2002 Ohio 3.6 NaN True
four 2001 Nevada 2.4 -1.5 False
five 2002 Nevada 2.9 -1.7 False
> del frame2 [‘eastern’]
> frame2.columns
Index ( [year, state, pop, debt], dtype = object)
> pop = {‘Nevada’ : {2001: 2.4, 2002: 2.9}
‘Ohio’ : {2000: 1,5, 2001: 1.7, 2002: 3.6} }
Here, pop is a nested dictionary of dictionaries. When it is passed as the data to the
DataFrame( ) method, it will interpret the outer dictionary keys as the columns and the inner
keys as the row indices.
> frame3 = DataFrame( pop)
> frame3
Nevada Ohio
2000 NaN 1.5
2001 2.4 1.7
2002 2.9 3.6
We can transpose the above result, if we wish to do so.
> frame3.T
2000 2001 2002
Nevada NaN 2,4 2.9
Ohio 1.5 1.7 3.6
The keys in the inner dictionaries are unioned and sorted to form the index in the result.
This is not true if an explicit index is specified.
> DataFrame (pop, index = [2001, 2002, 2003])
Nevada Ohio
2001 2.4 1.7
2002 2.9 3.6
2003 NaN NaN
The dictionaries of Series are treated much in the same way.
> pdata = {‘Ohio’: frame3 [‘Ohio’] [ :-1],
‘Nevada’: frame3 [‘Nevada’] [ :2] }
72

> DataFrame (pdata)


Nevada Ohio
2000 NaN 1.5
2001 2.4 1.7
If a DataFrame’s index and columns have their name attributes set, these will also be
displayed.
> frame3.index.name = ‘year’
> frame3.columns.name = ‘state’
> frame3
Year Nevada Ohio
2000 NaN 1.5
2001 2.4 1.7
2002 2.9 3.6
Likewise Series, the values attribute returns the data contained in the DataFrame as a 2D
ndarray:
> frame3.values
array( [nan, 1.5],
[2.4, 1.7],
[2.9, 3.6] ] )
If the DataFrame’s columns are of different dtypes, the dtype of the values array will be
chosen to accommodate all of the columns:
> frame2.values
array( [ [2000, Ohio, 1.5, nan],
[2001, Ohio, 1.7, -1.2],
[2002, Ohio, 3.6, nan],
[2001, Nevada, 2.4, -1.5],
[2002, Nevada, 2.9, -1.7] ], dtype = object]
5.14 Panda’s Index Objects
Panda’s Index Objects are responsible for holding the axis labels and axis names. Any
sequence of labels used when constructing a Series or a DataFrame is internally converted to
an Index:
> obj = Series (range(3), index = [‘a’, ‘b’, ‘c’])
> index = obj.index
> index
Index ( [a, b, c], dtype = object)
> Index [1: ]
Index ( [b, c], dtype = object)
Index objects are immutable. So, they cannot be modified by the User.
73

> index [1] = ‘d’ # Error will be reported.


Immutability of Index objects is important so that Index objects can be safely shared
among data structures:
> index = pd.Index (np.arange(3) )
> obj2 = Series ( [1.5, -2.5, 0], index = index )
> obj2.index is index
True
In addition to being array-like, an Index also functions as a fixed-size set.
> ‘Ohio’ in frame3.columns
True
> 2003 in frame3.index
False
5.15 Reindexing Series and DataFrames
A new index can be created for a Series. When we do this, the data will be conformed to
the new index. A new object will be created, as shown below:
> obj = Series ( [4.5, 7.2, -5.3, 3.6], index = [‘d’, ‘b’, ‘a’, ‘c’] )
> obj
d 4.5
b 7.2
a -5.3
c 3.6
> obj2 = obj.reindex ( [‘a’, ‘b’, ‘c’, ‘d’, ‘e’] )
> obj2
a -5.3
b 7.2
c 3.6
d 4.5
e NaN
In the above example, the data are rearranged according to the new index. As the index
value ‘e’ is not present in obj, NaN is assigned as the value of ‘e’ in obj2. We can ourselves
assign a value, such as zero as the value of ‘e’ as shown below:
> obj.reindex ( [‘a’, ‘b’, ‘c’, ‘d’, ‘e’], fill-value = 0 )
a -5.3
b 7.2
c 3.6
d 4.5
e 0.0
74

In the case of DataFrame, the reindex( ) method can be used to alter the row index or
column index or both. When passed just a sequence, the rows are reindexed.
> frame = DataFrame (np.arange(9).reshape( (3,3) ),
index = [‘a’, ‘c’, ‘d’],
columns = [‘Ohio’, ‘Texas’, ‘California’] )
> frame
Ohio Texas California
a 0 1 2
c 3 4 5
d 6 7 8
> frame2 = frame.reindex ( [‘a’, ‘b’, ‘c’, ‘d’] )
> frame2
Ohio Texas California
a 0 1 2
b NaN NaN Nan
c 3 4 5
d 6 7 8
The columns can be reindexed by using the columns keyword, as shown below:
> states = [‘Texas’, ‘Utah’, ‘California’]
> frame.reindex (columns = states)
Texas Utah California
a 1 NaN 2
c 4 NaN 5
d 7 NaN 8
Both the rows and columns can be reindexed in one shot, though reindexation and data
interpolation will be done first for rows (axis 0):
> frame.reindex (index = [‘a’, ‘b’, ‘c’, ‘d’], method = ‘ffill’, columns = states)
Texas Utah California
a 1 NaN 2
b 1 NaN 2
c 4 NaN 5
d 7 NaN 8
Reindexing can also be done by label-indexing with ix, as illustrated below:
> frame.ix [ [‘a’, ‘b’, ‘c’, ‘d’], states ]
Texas Utah California
a 1 NaN 2
b NaN NaN NaN
c 4 NaN 5
d 7 NaN 8
75

5.16 Dropping entries from Series and Data Frames


As dropping entries require a bit of munging and set logic, the drop( ) method will return
a new object with the indicated value(s) deleted from an axis. (i.e., deleted from a row or
column).
> obj = Series (np.arange(5.), index = [‘a’, ‘b’, ‘c’, ‘d’, ‘e’] )
> new_obj = obj.drop (‘c’)
> new_obj
a 0
b 1
d 3
e 4
> obj.drop ( [‘d’, ‘c’] )
a 0
b 1
e 4
In the case of a Data Frame, Value(s) can be dropped from rows or columns. (In the case
of rows, axis = 0. In the case of columns, axis = 1).
> data = DataFrame( np.arange(16).reshape ( (4, 4) ),
index = [‘Ohio’, ‘Colorado’, ‘Utah’, ‘New York’],
columns = [‘one’, ‘two’, ‘three’, ‘four’] )
> data.drop ( [‘Colorado’, ‘Ohio’] )
one two three four
Utah 8 9 10 11
New York 12 13 14 15
> data.drop (‘two’, axis = 1)
one three four
Ohio 0 2 3
Colorado 4 6 7
Utah 8 10 11
New York 12 14 15
> data.drop ( [‘two’, ‘four’ ], axis = 1 )
one three
Ohio 0 2
Colorado 4 6
Utah 8 10
New York 12 14
5.17 Indexing, Selection and Filtering
Only integers can be used for indexing NumPy arrays, (i.e., ndarrays). But this restriction
is not there in the case of Series indexing. In the following example, ‘a’, ‘b’, ‘c’ and ‘d’ are
used as indices.
76

> obj = Series (np.arange(4.), index = [‘a’, ‘b’, ‘c’, ‘d’] )


> obj [‘b’]
1.0
> obj [ [‘b’, ‘a’, ‘d’] ]
b 1
a 0
d 3
> obj [1]
1.0
> obj [2:4]
c 2
d 3
> obj [ [1, 3] ]
b 1
d 3
Slicing with labels behaves differently than normal Python slicing. While the endpoint is not
included in normal Python, it is included in Series.
> obj [ ‘b’ : ‘c’ ]
b 1
c 2
Setting one or more values in a Series works just as you will spent.
> obj [ ‘b’ : ‘c’ ] = 5
> obj
a 0
b 5
c 5
d 3
In the case of DataFrames, indexing is used for retrieving one or more columns.
> data = DataFrame (np.arange(16).reshape( (4, 4) ),
index = [‘Ohio’, ‘Colorado’, ‘Utah’, ‘New York’],
columns = [‘one’, ‘two’, ‘three’, ‘four’ ] )
> data
one two three four
Ohio 0 1 2 3
Colorado 4 5 6 7
Utah 8 9 10 11
New York 12 13 14 15
77

> data [‘two’]


Ohio 1
Colorado 5
Utah 9
New York 13
> data ( [‘three’, ‘one’ ] )
three one
Ohio 2 0
Colorado 6 4
Utah 10 8
New York 14 12
For DataFrame label-indexing on the rows, the special indexing field ix is used. It enables
us to select a subset of the rows and columns from a DataFrame with NumPy-like notation
plus axis labels. This is also a less verboseway to do reindexing.
> data.ix ( ‘Colorado’, [‘two’, ‘three’ ] )
two 5
three 6
> data.ix ( [‘Colorado’, ‘Utah’ ] [3, 0, 1] )
four one two
Colorado 7 0 5
Utah 11 8 9
> data.ix [2]
one 8
two 9
three 10
four 11
> data.ix ( : ‘Utah’, ‘two’ )
Ohio 0
Colorado 5
Utah 9
> data.ix ( data.three > 5, : 3 )
one two three
Colorado 0 5 6
Utah 8 9 10
New York 12 13 14
5.18 Arithmetic and Data Alignment
When adding together two Series, if any index pairs are not the same, the respective
index in the result will be the union of the index pairs. Here is a simple example:
78

> s1 = Series ( [7.3, -2.5, 3.4, 1.5], index = [‘a’, ‘c’, ‘d’, ‘e’] )
> s2 = Series ( [-2.1, 3.6, -1.5, 4, 3.1], index = [‘a’, ‘c’, ‘e’, ‘f’, ‘g’] )
> s1
a 7.3
b -2.5
d 3.4
e 1.5
> s2
a -2.1
c 3.6
e -1.5
f 4.0
g 3.1
> s1 + s2
a 5.2
c 1.1
d NaN
e 0.0
f NaN
g NaN
The internal data alignment introduces NA values in the indices that don’t overlap.
Missing values propagate in arithmetic computations. In the case of Data Frame, alignment is
performed on both the rows and the columns:
> df1 = DataFrame ( np.arange(9.). reshape(3, 3) ), columns = list(‘bcd’),
index = [‘Ohio’, ‘Texas’, ‘Colorado’] )
> df2 = DataFrame ( np.arange(12.) .reshape(4,3) ), columns = list (‘bde’),
index = [‘Utah’, ‘Ohio’, ‘Texas’, ‘Oregon’] )
> df1
b c d
Ohio 0 1 2
Texas 3 4 5
Colorado 6 7 8
> df2
b d e
Utah 0 1 2
Ohio 3 4 5
Texas 6 7 8
Oregon 9 10 11
Adding these two Data Frames together returns a Data Frame whose index and columns
are the unions of the ones in each Data Frame:
79

> df1 + df2


b c d e
Colorado NaN NaN NaN NaN
Ohio 3 NaN 6 NaN
Oregon NaN NaN NaN NaN
Texas 9 NaN 12 NaN
Utah NaN NaN NaN NaN
When we perform arithmetic operations between differently indexed two Series or two
DataFrames, we may want to fill with a special value, like 0, when an axis label is found in on
Series/Data Frame but not found in the other Series/DataFrame.
> df1 = DataFrame ( np.arange(12.0). reshape(3, 4) ), columns = list(‘abcd’) )
> df2 = DataFrame ( np.arange(20.) .reshape(4,5) ), columns = list (‘abcde’),
> df1
a b c d
0 0 1 2 3
1 4 5 6 7
2 8 9 10 11
> df2
a b c d e
0 0 1 2 3 4
1 5 6 7 8 9
2 10 11 12 13 14
3 15 16 17 18 19
> df1 + df2
a b c d e
0 0 2 4 6 NaN
1 9 11 13 15 NaN
2 18 20 22 24 NaN
3 NaN NaN NaN NaN NaN
> df1 + add (df2, fill-value = 0)
a b c d e
0 0 2 4 6 4
1 9 11 13 15 9
2 18 20 22 24 14
3 15 16 17 18 19
When we reindex a Series or Data Frame, we can specify a different fill value:
> df1.reindex (columns = df2.columns, fill-value = 0)
80

a b c d e
0 0 1 2 3 0
1 4 5 6 7 0
2 8 9 10 11 0
5.19 Arithmetic Operations between Data Frames and Series
As in the case of NumPy arrays, arithmetic operations between DataFrame and Series is
well-defined. We will now define a two-dimensional array. Then, we will subtract the elements
in the zeroth row from the elements in each row.
> arr = np.arange(12.) . reshape( (3,4) )
> arr
( [ [0., 1., 2., 3.],
[4., 5., 6., 7.],
[8., 9., 10., 11.] ] )
> arr [0]
( [0., 1., 2., 3.] )
> arr-arr[0]
( [ [0., 0., 0., 0.],
[4., 4., 4., 4.],
[8., 8., 8., 8.] ] )
This is referred to as broadcasting. operations between a DataFrame and a Series are
similar.
> frame = DataFrame ( np.arange(12.).reshape( (4, 3), columns = list(‘bde’),
index = [‘Utah’, ‘Ohio’, ‘Texas’, ‘Oregon’] )
> series = frame.ix [0]
> frame
b d e
Utah 0 1 2
Ohio 3 4 5
Texas 6 7 8
Oregon 9 10 11
> Series
b 0
d 1
e 2
By default, arithmetic between DataFrame and Series matches the index of the Series on
the DataFrame’s columns, broadcasting down the rows:
> frame - series
81

b d e
Utah 0 0 0
Ohio 3 3 3
Texas 6 6 6
Oregon 9 9 9
If an index value is not found in either the DataFrame’s columns or the Series’s index,
the DataFrame and the Series will be reindexed to form the union:
> series2 = Series ( range(3), index = [‘b’, ‘e’, ‘f’] )
> frame + series2
b d e f
Utah 0 NaN 3 NaN
Ohio 3 NaN 6 NaN
Texas 6 NaN 9 NaN
Oregon 9 NaN 12 NaN
If we wish to instead broadcast over the columns, matches on the rows, we have to use
one of the arithmetic methods.:
> series3 = frame[‘d’]
> frame
b d e
Utah 0 1 2
Ohio 3 4 5
Texas 6 7 8
Oregon 9 10 11
> series
Utah 1
Ohio 4
Texas 7
Oregon 10
> frame.sub( series3, axis = 0 )
b d e
Utah -1 0 1
Ohio -1 0 1
Texas -1 0 1
Oregon -1 0 1
Here, the axis number that we pass is the axis to match on. In the above example, we
mean to match on the DataFrame’s row index and broadcast across.
5.20 Function Application and Mapping
The ufuncs that we applied to NumPy ndarrays work fine with pandas objects.
> frame = DataFrame ( np.random.randn(4,3), columns = list(‘bde’),
index = [‘Utah’, ‘Ohio’, ‘Texas’, ‘Oregon’] )
82

> frame
b d e
Utah -0.204708 0.478943 -0.519439
Ohio -0.555730 1.965781 1.393406
Texas 0.092908 0.281746 0.769023
Oregon 1.246435 1.007189 -1.296221
> np.abs (frame)
b d e
Utah 0.204708 0.478943 0.519439
Ohio 0.555730 1.965781 1.393406
Texas 0.092908 0.281746 0.769023
Oregon 1.246435 1.007189 1.296221
Another frequent operation is applying a function on one-dimensional arrays to each
column or row. DataFrame’s apply( ) method does exactly this:
> f = lamda x : x.max( ) - x.min( )
> frame.apply (f)
b 1.802165
d 1.684034
e 2.689627
> frame.apply ( f, axis = 1)
Utah 0.998382
Ohio 2.521511
Texas 0.676115
Oregon 2.542656
Chapter-6:
DATA WRANGLING

6.1 Introduction
Much of the programming work in data analysis and modeling is spent on data
preparation: loading, cleaning, transforming and rearranging. Sometimes, the way that data are
stored in files or databases is not the way you need it for a data processing application.
Fortunately, standard libraries in Python, such as Pandas, provided us with a high-level,
flexible, and high-performance set of methods for loading, cleaning, transforming and
rearranging data. We will discuss such methods in this Chapter.
6.2 Merging Data Sets
Merge operation combines data sets by linking rows using one or more keys. This
operation is also known as join operation and it is often used in RDBMS. Here is an example
for the merge( ) function:
> df1 = DataFrame ( { ‘key’ : [‘b’, ‘b’, ‘a’, ‘c’, ‘a’, ‘a’, ‘b’],
‘data1’ : range(7) } )
> df2 = DataFrame ( { ‘key’ : [‘a’, ‘b’, ‘d’],
‘data2’ : range(3) } )
> df1
data1 key
0 0 b
1 1 b
2 2 a
3 3 c
4 4 a
5 5 a
6 6 b
> df2
data2 key
0 0 a
1 1 b
2 2 d
> pd.merge (df1, df2, on = ‘key’)
data1 key data2
0 2 a 0
1 4 a 0
2 5 a 0
3 0 b 1
4 1 b 1
5 6 b 1
84

If the column names are different in each Data Frame, we can specify the column names
separately:
> df3 = DataFrame ( { ‘lkey’ : [‘b’, ‘b’, ‘a’, ‘c’, ‘a’, ‘a’, ‘b’],
‘data1’ : range(7) } )
> df4 = DataFrame ( { ‘rkey’ : [‘a’, ‘b’, ‘d’ ],
‘data2’ : range(3) } )
> pd.merge ( df3, df4, left_on = ‘lkey’, right_on = ‘rkey’ )
data1 lkey data2 rkey
0 2 a 0 a
1 4 a 0 a
2 5 a 0 a
3 0 b 1 b
4 1 b 1 b
5 6 b 1 b
Note that the keys ‘c’ and ‘d’ and the associated data are not listed in all the above results.
By default, the merge( ) function does on ‘inner’ join. Only the keys, which are in both the
Data Frames, and their associated values will be listed in this case.
The outer join will take the union of the keys, combining the effect of applying both left
and right joins, as shown below:
> pd.merge (df1, df2, how = ‘outer’)
data1 lkey data2
0 2 a 0
1 4 a 0
2 5 a 0
3 0 b 1
4 1 b 1
5 6 b 1
6 3 c NaN
7 NaN d 2
Here, we merge df1 and df2. This is an example of many_to_one merge situation. The
data in df1 has multiple rows labeled a and b, whereas df2 has only one row for each value in
the key column. Many_to_many merging is also possible. Such joins form the Cartesian
product of the rows.
Sometimes, we wish to use the index of a Data Frame as the merge key. In such a case,
we can pass left-index = True or right-index = True (or both) as the argument of the merge( )
function to indicate that the index should be used as the merge key. Here is an example:
> left1 = DataFrame ( { ‘key’ : [‘a’, ‘b’, ‘a’, ‘a’, ‘b’, ‘c’],
‘value’ : range(6) } )
> right1 = DataFrame ( { ‘group_val’ : [3.5, 7] }, index = [‘a’, ‘b] )
85

> left1
Key Value
0 a 0
1 b 1
2 a 2
3 a 3
4 b 4
5 c 5
> right1
group-val
a 3.5
b 7
> pd.merge ( left1, right1, left_on = ‘key’, right_index = True )
Key Value group_val
0 a 0 3.5
2 a 2 3.5
3 a 3 3.5
1 b 1 7.0
4 b 4 7.0
Here, the key ‘c’ and its associated value 5 are not listed in the above result, as the inner
join is done by default. Here is an example for outer join.
> pd.merge (left1, right1, left_on = ‘key’, right_index = True, how = ‘outer’)
Key Value group_val
0 a 0 3.5
2 a 2 3.5
3 a 3 3.5
1 b 1 7.0
4 b 4 7.0
5 c 5 NaN
6.3 Reshaping and Pivoting
The reshape and pivot operations are the fundamental operations for rearranging tabular
data. Hierarchical indexing provides a consistent way to rearrange the data in a Data Frame.
Rotating or pivoting the data from the columns to the rows is known as the stack operation.
Pivoting from rows to columns is known as the unstack operation.
> data = DataFrame ( np.arange(6) . reshape( (2, 3) ),
index = pd.Index ( [‘Ohio’, ‘Colorado’], name = ‘state’ ),
columns = pd.Index ( [‘one’, ‘two’, ‘three’], name = ‘number’) )
86

> data
number one two three
state
Ohio 0 1 2
Colorado 3 4 5
When the stack( ) method is applied on this data, columns will become rows. Thus, a
Series will be produced. (Series is an one dimensional array. It will have an associated array
of indices.)
> result = data.stack( )
> result
state number
Ohio one 0
two 1
three 2
Colorado one 3
two 4
three 5
From the above hierarchically-indexed Series, we can rearrange the data back into a Data
Frame by using the unstack( ) method, as shown below:
> result.unstack( )
number one two three
state
Ohio 0 1 2
Colorado 3 4 5
6.4 Data Transformation - Removing Duplicates
If duplicate rows are found in a Data Frame, they shall be dropped by using the drop-
duplicates( ) method. Here is an example:
> data = DataFrame ( { ‘K1’ : [‘one’] * 3 + [‘two’] * 4,
‘K2’ : [1, 1, 2, 3, 3, 4, 4] } )
> data
K1 K2
0 one 1
1 one 1
2 one 2
3 two 3
4 two 3
5 two 4
6 two 4
If a row (i.e., record) in a Data Frame is duplicated, the method duplicated( ) will return
True. Otherwise, this method will return False. We shall use this method to know the
87

duplicated rows. This method will keep the first observed value combination and treat the next
one as duplicated. For example, in the above Data Frame, the contents of first row and second
row are same. Here, the duplicated method will treat the first row as the original one and report
the second row as a duplicate as shown below:
> data. duplicated( )
0 False
1 True
2 False
3 False
4 True
5 False
6 True
As we have seen above, we shall use the drop-duplicates ( ) method to drop the duplicated
rows. In the above example, second row, fourth row and sixth row will be dropped. The remaining
four rows will be retained as shown below:
> data. drop-duplicates ( )
K1 K2
0 one 1
2 one 2
3 two 3
5 two 4
In the above example, the second row is dropped because the values in columns ‘K1”
and ‘K2’ (i.e., ‘one’, 1) are same as the values in columns ‘K1’ and ‘K2’ in first row. In this
example, we drop a row only if the values in all the columns of that row are same as the values
in all the corresponding columns of another row. But, sometimes, we have to drop a row as a
duplicate even if the value in just one column of that row is same as the value in that column
in another row. Here is an example.
> data [‘V1’] = range (7)
> data
K1 K2 V1
0 one 1 0
1 one 1 1
2 one 2 2
3 two 3 3
4 two 3 4
5 two 4 5
6 two 4 6
> data.drop_duplicates ( [‘K1’] )
K1 K2 V1
0 one 1 0
3 two 3 3
Here, we filter the duplicates only based on the ‘K1’ column.
88

6.5 Data Transformation by using a Function or Mapping


For many Data Frames, we may wish to perform some transformation based on the values
in a column. Consider the following hypothetical data collected about some kinds of meat:
> data = DataFrame ( { ‘food’ : [ ‘bacon’, ‘pulled pork’, ‘bacon’,
‘pastrami’, ‘corned beef’, ‘bacon’,
‘pastrami’, ‘honey ham’, ‘nova lox’]
‘ounces’ : [4, 3, 12, 6, 7.5, 8, 3, 5, 6] } )
> data
food ounces
0 bacon 4.0
1 pulled pork 3.0
2 bacon 12.0
3 pastrami 6.0
4 corned beef 7.5
5 bacon 8.0
6 pastrami 3.0
7 honey ham 5.0
8 nova lox 6.0
Suppose that we want to add a column indicating the type of animal that each food came
from. We shall write down a mapping of each distinct meat type to the kind of animal:
> meat_to_animal = {
‘bacon’ : ‘pig’,
‘pulled pork’ : ‘pig’,
‘pastrami’ : ‘cow’,
‘corned beef’ : ‘cow’
‘honey ham’: ‘pig’,
‘nova lox’: ‘salmon’
}
> data [‘animal’] = data [‘food’].map(meat_to_animal)
> data
food ounces animal
0 bacon 4.0 pig
1 pulled pork 3.0 pig
2 bacon 12.0 pig
3 pastrami 6.0 cow
4 corned beef 7.5 cow
5 bacon 8.0 pig
6 pastrami 3.0 cow
7 honey ham 5.0 pig
8 nova lox 6.0 salmon
89

Here, we use the map( ) method to perform element-wise transformations. Here, the
map() method takes the dictionary-like object meat-to-animal as the argument. It shall also
accept a function as its argument.
6.6 Data Transformation - Replacing Values
When a value in a column of a Data Frame is missing some people enter the sentinel
value -999 instead of entering NA or NaN. (In Pandas, NaN is used instead of NA to represent
a missing value.) We shall replace -999 by NaN in such a case by using the replace( ) method,
as shown below:
> data = Series ( [1.0, -999.0, 2.0, -999.0, -1000.0, 3.00 )
> data
0 1
1 -999
2 2
3 -999
4 -1000
5 3
> data.replace (-999, np.nan)
0 1
1 NaN
2 2
3 NaN
4 -1000
5 3
If we wish to replace multiple values at once, we have to pass a list to the replace( ) function,
as shown below:
> data.replace ( [-999, -1000], np.nan )
0 1
1 NaN
2 2
3 NaN
4 NaN
5 3
If we wish to use a different replacement for each value, then we have to pass a list of
substitutes, as shown below:
> data.replace ( [-999, -1000], [np.nan, 0] )
0 1
1 NaN
2 2
3 NaN
4 0
5 3
90

We shall also pass a Dictionary as the argument to the replace( ) function, as shown below:
0 1
1 NaN
2 2
3 NaN
4 0
6.7 Data Transformation - Discretization and Binning
Continuous data (heights, weights, etc.) are often separated (i.e., discretized) into bins
(i.e., class intervals) for analysis. Suppose we have data about a group of people in a study,
and we want to group them into discrete age buckets (i.e., class intervals), such as 18-25, 25-
35, 35-60 and 60-100. (Here, these age groups represent youth, young-adult, middle-aged and
seniors.) The cut( ) method, qcut( ) method and the value_counts( ) method are very helpful in
this context.
> ages = [20, 22, 25, 27, 21, 23, 37, 31, 61, 45, 41, 32]
> bins = [18, 25, 35, 60, 100]
> cats = pd.cut (ages, bins)
> cats
array ( [ (18, 25), (18, 25), (18, 25), (25, 35), (18, 25), (18, 25),
(35, 60), (25, 35), (60, 100), (35, 60), (35, 60), (25, 35) ],
dtype = object )
> cats.labels
array ( [0, 0, 0, 1, 0, 0, 2, 1, 3, 2, 2, 1] )
> cats.levels
Index ( [ (18, 25), (25, 35), (35, 60), (60, 100) ], dtype = object)
> pd.value-counts (cats)
(18, 25) 5
(35, 60) 3
(25, 35) 3
(60, 100) 1
Here, 5, 3, 3 and 1 are the frequencies of the class intervals 18-25, 25-35, 35-60 and 60-100.
In the representation (18, 25), the parenthesis means that the side is open, while the square
brackets means it is closed (i.e. inclusive). So, the value 25 belongs to the interval (18, 25). (It
does not belong to the interval (25, 35]), which side is closed can be changed by passing the
argument right = false to the cut( ) method, as shown below:
> pd.cut ( ages, [18, 26, 36, 61, 100], right = False )
array ( [ [18, 26), [18, 26), [18, 26), [26, 36), [18, 26), [18, 26),
[36, 61), [26, 36), [61, 100), [36, 61), [36, 61), [26, 36) ],
dtype = object )
91

we can also pass our own bin names by passing a list or array to the labels option, as
shown below:
> group_names = [ ‘Youth’, ‘YoungAdult’, ‘MiddleAged’, ‘Senior’ ]
> pd.cut ( ages, bins, labels = group_names )
array ( [Youth, Youth, Youth, YoungAdult, Youth, Youth, MiddleAged,
YoungAdult, Senior, MiddleAged, MiddleAged, YoungAdult],
dtype = objct )
If we pass to the cut( ) method an integer number of bins instead of explicit bin edges, it
will compute equal-length bins based on the minimum and maximum values in the data.
Consider the case of some uniformly distributed data divided into fourths:
> data = np.random.rand (20)
> pd.cut (data, 4, precision = 2)
array ( [ (0.45, 0.67], (0.23, 0.45], (0.0037, 0.23], (0.67, 0.9],
.............., (0.23, 0.45] ], dtype = object )
The qcut( ) method is similar to the cut( ) method. It bins the data based on the quantiles.
As quantiles divide the data into four equal parts, the qcut( ) method will generate equal-size
bins.
> data = np.random.randn (1000)
> cats = pd.qcut (data, 4)
> pd.value_counts (cats)
(-3.745, -0.635] 250
(0.641, 3.26] 250
(-0.635, -0.022] 250
(-0.022, 0.641] 250
We can also pass our own quantiles (numbers between 0 and 1, inclusive) as the
argument for the qcut( ) method, as shown below:
> pd.qcut ( data, [0, 0.1, 0.5, 0.9, 1.0] )
array ( [ (-0.022, 1.302], (-1.266, -0.022], (-0.022, 1.302], ...,
(-1.266, -0.022], (-0.022, 1.302], (-1.266, -0.022] ],
dtype = object )
6.8 Data Transformation - Detecting and Filtering Outliers
We shall filter and transform the outliers by using the array methods, as shown below:
> np.random.seed (1 2 3 4 5)
> data = DataFrame ( np.random.randn(1000, 4) )
> data.describe( )
92

0 1 2 3
count 1000.000000 1000.000000 1000.000000 1000.000000
mean -0.067684 0.067924 0.025598 -0.002298
std 0.998035 0.992106 1.006835 0.996794
min -3.428254 -3.548824 -3.184377 -3.745356
25% -0.774890 -0.591841 -0.641675 -0.644144
50% -0.116401 0.101143 0.002073 -0.013611
75% 0.616366 0.780282 0.680391 0.654328
max 3.366626 2.653656 3.260383 3.927528
Suppose, in this example, we consider a value as an outlier if it greater than +3 or less
than -3. The following Commands will display the outliers in the last column of the above
Data Frame.
> col = data [3]
> col [ np.abs (col) > 3 ]
97 3.927528
305 -3.399312
400 -3.745356
In order to select all rows that have outliers in one or more columns, we have to use the
following Command:
> data [ ( np.abs (data) > 3 ) . any(1) ]
0 1 2 3
5 -0.539741 0.476985 3.248944 -1.021228
97 -0.774363 0.552936 0.106061 3.927528
102 -0.655054 -0.565230 3.176873 0.959533
305 -2.315555 0.457246 -0.025907 -3.399312
324 0.050188 1.951312 3.260383 0.963301
400 0.146326 0.508391 -0.196713 -3.745356
499 -0.293333 -0.242459 -3.056990 1.918403
523 -3.428254 -0.296336 -0.439938 -0.867165
586 0.275144 1.179227 -3.184377 1.369891
808 -0.362528 -3.548824 1.553205 -2.186301
900 3.366626 -2.372214 0.851010 1.332846
If a value is greater than +3, we will replace it by +3. Similarly, if a value is less than-3,
we will replace it by -3. This is the way to cap the values that lie outside the interval -3 to 3.
The following code will do this.
> data [ np.abs (data) > 3 ] = np.sign ( data ) * 3
> data.describe( )
93

0 1 2 3
count 1000.000000 1000.000000 1000.000000 1000.000000
mean -0.067623 0.068473 0.025153 -0.002081
std 0.995485 0.990253 1.003977 0.989736
min -3.000000 -3.000000 -3.000000 -3.000000
25% -0.774890 -0.591841 -0.641675 -0.644144
50% -0.116401 0.101143 0.002073 -0.013611
75% 0.616366 0.780282 0.680391 0.654328
max 3.000000 2.653656 3.000000 3.000000
Here, the sign( ) function returns +1 if the sign of a value is positive. Otherwise, it
returns -1. So, the expression np.sign(data) * 3 will return +3 when a value is greater than +3
and it will return -3 when a value is less than -3.
6.9 Data Transformation - Random Sampling
In order to select a random sample without replacement, we have to first permute (i.e.,
randomly reorder) the rows in a Data Frame or the values in a Series by using the
numpy.random permutation function. Then, we shall select the first K elements, if we want a
random sample of size K. Here is an example.
> df = DataFrame ( np.arange (5*4).reshape(5,4) )
> df
0 1 2 3
0 0 1 2 3
1 4 5 6 7
2 8 9 10 11
3 12 13 14 15
4 16 17 18 19
> sampler = np.random.permutation(5)
> sampler
array( [1, 0, 2, 3, 4] )
> df.take (sampler)
0 1 2 3
1 4 5 6 7
0 0 1 2 3
2 8 9 10 11
3 12 13 14 15
4 16 17 18 19
Here, the rows of the data frame have been randomly reordered. The new order of rows
is 1, 0, 2, 3, 4. If we went a random sample of size 3 without replacement, we shall simply take
the first three rows of the above reordered set of rows. Instead, we can execute the following
Command.
> df.take (np. random.permutation ( len (df) ) [:3] )
94

0 1 2 3
1 4 5 6 7
3 12 13 13 15
4 16 17 18 19
In order to generate a random sample with replacement, the fastest way is to use
np.random.rendint( ) method, as shown below:
> bag = np.array ( [5, 7, -1, 6, 4] ) # A Sample is to be taken from these five values
> sampler = np.random.randint ( 0, len(bag), size = 10 )
> sampler
array ( [4, 4, 2, 2, 2, 0, 3, 0, 4, 1] )
> draws = bag.take ( sampler )
> draws
array ( [4, 4, -1, -1, -1, 5, 6, 5, 4, 7] ) # A random sample of size 10
6.10 Data Transformation - Computing Dummy/Indicator Variables
The categorical variables are often transformed into dummy variables or indicator
variables for statistical modeling or machine learning applications. Here is an example:
> df = DataFrame ( { ‘Key’ : [‘b’, ‘b’, ‘a’, ‘c’, ‘a’, ‘b’],
‘data1’ : range(6) } )
> df
data1 Key
0 0 b
1 1 b
2 2 a
3 3 c
4 4 a
5 5 b
> dummies = pd.get_dummies ( df [‘key’] )
> df_with_dummy = df [ [ ‘data1’ ] ] . join ( dummies )
> df_with_dummy
data1 a b c
0 0 0 1 0
1 1 0 1 0
2 2 1 0 0
3 3 0 0 1
4 4 1 0 0
5 5 0 1 0
Here, the categorical variable ‘key’ has three distinct values, namely ‘a’, ‘b’ and ‘c’. So,
three dummy variables, namely ‘a’. ‘b’ and ‘c’ will be generated corresponding to the
categorical variable ‘key’ by the get_dummies( ) method. If we carefully look at the values in
95

the dummy variable columns ‘a’, ‘b’ and ‘c’ together, we note that ‘a’ is represented by the
combination (1, 0, 0), ‘b’ is represented by (0, 1, 0) and ‘c’ is represented by (0, 0, 1).
The dummy variable columns are named as ‘a’, ‘b’ and ‘c’ in the above result. If we want
the names of these columns to be prefixed with the word ‘key’, we have to write the following
code:
> dummies = pd.get_dummies ( df [‘key’], prefix = ‘key’ )
> df_with_dummy = df [ [‘data’] ].join (dummies)
> df_with_dummy
data1 key-a key-b key-c
0 0 0 1 0
1 1 0 1 0
2 2 1 0 0
3 3 0 0 1
4 4 1 0 0
5 5 0 1 0
In the above example, the categorical variable ‘key’ takes only one of the three distinct
values, namely ‘a’, ‘b’ and ‘c’. In such a case, the corresponding dummy variables can be
easily created as shown above. But, a categorical variable, such as the movie genre, may take
one of a very large collection of values, such as Animation, Comedy, Adventure, Fantasy,
Romance, Action, Crime, Thriller, Children’s, etc. Here, the generation of a set of dummy
variables corresponding to the categorical variable movie genre is little bit difficult and it is
explained below with the help of MovieLens 1M dataset.
> mnames = [ ‘movie_id’, ‘title’, ‘genres’ ]
> movies = pd.read_table (‘movies.dot’, sep = ‘::’, header = None,
names = mnames )
> movies [ :5]
movie_id Title genres
0 1 Toy Story (1995) Animation | Children’s | Comedy
1 2 Jumanji (1995) Adventure | Children’s | Fantasy
2 3 Grumpier Old Men (1995) Comedy | Romance
3 4 Waiting to Exhale (1995) Comedy | Drama
4 5 Father of the Bride Part II (1995) Comedy
Here, adding dummy variables for each genre requires a little bit of wrangling. First, we
extract the list of unique genres in the dataset by using the following two Commands:
> genre_iter = ( set (x.split (‘1’) ) for x in movies.genres)
> genres = sorted ( set.union (* genre_iter) )
Here, a set of dummy variables is to be generated corresponding to the categorical
variable genres. If this categorical variable takes K distinct values, then K dummy variables
96

are to be generated. These K dummy variables will be represented by K columns in the Data
Frame. Initially, we will place zeroes in all the element positions in these K columns.
The last Command given above ensures that the distinct values that the categorical
variable genres can take are placed in sorted order in the set named genres. The first two
entries in this set are Genre_Action and Genre_Adventure. Here, the K columns, which
represent the K dummy variables, in the row that contains only the value Genre_Action in
genres column, will take the values (1, 0, 0, 0, ..., 0, 0, 0).
Similarly, the K columns in the row that contains only the two values Genre-Action and
Genre-Adventure will take the values (1, 1, 0, 0, ..., 0, 0, 0). The commands required for
performing these operations are given below:
> dummies = DataFrame ( np.zeroes ( (len (movies), len (genres) ) ),
columns = genres )
> for i, gen is enumerate ( movies.genres ) :
dummies.ix [ i, gen.split ( ‘I’ ) ] = 1
> movies_windic = movies.join ( dummies.add_prefix (‘Gene_’) )
> movies_windic.ix [0]
movie_id 1
title Toy Story (1995)
genres Animation | Children’s | Comedy
Genre_Action 0
Genre_Adventure 0
Genre_Animation 1
Genre_Children’s 1
Genre_Comedy 1
Genre_Crime 0
Genre_Documentary 0
Genre_Drama 0
Genre_Fantasy 0
Genre_Film_Noir 0
Genre_Horror 0
Genre_Musical 0
Genre_Mystery 0
Genre_Romance 0
Genre_Sci-Fi 0
Genre_Thriller 0
Genre_War 0
Genre_Western 0
97

6.11 Data Transformation - String Manipulation


Python has long been a popular data munging language in part due to its ease-of-use for
text processing. Most text processing operations are made simple with string object’s built-in
methods, such as the split( ), strip( ) and replace( ) methods. For more complex pattern
matching and text manipulations, regular expressions, which are discussed in the next Section,
are needed.
6.11.1 split( ) method
A comma-separated string can be broken into pieces by using the split( ) method, as
shown below:
> val = ‘a, b guido’
> val.split ( ‘,’ )
[‘a’, ‘b’, ‘guido’]
6.11.2 strip( ) method
This method is often combined with the split( ) method to trim the whitespace (including
newlines).
> pieces = [ x.strip( ) for x in val.split(‘,’) ]
> pieces
[‘a’, ‘b’, ‘guido’]
6.11.3 join( ) method
Substrings can be concatenated by using the usual addition operator. But this can be done
in a better way by using the join( ) method.
> first, second, third = pieces
> first + ‘::’ + second + ‘::’ + third
‘a::b::guido’
But this method of concatenating the strings is not a practical generic method. A faster
and more Pythonic way is to pass a list or tuple to the join( ) method on the string ‘::’, as shown
below:
> ‘::’.join(pieces)
‘a::b::guido’
6.11.4 find( ) and index( ) methods
> val = ‘a,b,guido’
> val.find(‘:’)
–1
> val.index(‘,’)
1
98

The difference between the find( ) and index( ) methods is that the index( ) method raises
an exception if the string is not found. On the other hand, the find( ) method returns –1, as we
have seen above, if the string is not found.

6.11.5 count( ) method


This method returns the number of occurrences of a particular substring.
> val.count(‘,’)
2

6.11.6 replace( ) method


This method will substitute occurrences of one pattern for another. This is commonly
used to delete patterns, too, by passing an empty string:
> val.replace(‘,’ , ‘::’)
‘a::b:: guido”
6.12 Data Transformation - Regular Expressions
Regular Expressions provide a flexible way to search or match string patterns in text. A
single expression, commonly called a regex, is a string formed according to the regular
expression language.
Suppose we need to express in the regular expression language the structure of an e-mail
address, like the one shown below, but in the general case.
[email protected]
Basically, an e-mail can be expressed as a set of characters, followed by an @ sign, followed
by another set of characters. What we have just described can be written as the following
Regular Expression string:
.+@.+
Here, we have the @ sign in the middle, with text before it and text after it. So, let us take a
closer look at what all the symbols involved in the above Regular Expression do. The first
symbol in the above Regular Expression is a period (.). This symbol has a special meaning. A
period is like a wild card that will match any character. The plus (+) sign is also a special
character. It is used to match the preceding pattern element one or more times. So, a period
followed by a plus sign will match any sequence of one or more characters. The @ sign simply
matches an @ sign in the string.
The built-in methods in Python’s re module helps us to apply regular expressions to
strings. The built-in methods in the re module fall into three categories: pattern matching,
substitution, and splitting. These are all related. A regex describes a pattern, such as .+@.+, to
locate in the text, which can then be used for many purposes. Let us look at a simple example:
99

suppose  went to split a string with a variable number of whitespace characters (tabs, spaces,
and newlines). The regex describing one or more whitespace characters is \s+
> import re
> text = “foo bar\t baz \tqux”
> re.split (‘\s+’ , text)
[‘foo’, ‘bar, ‘baz’, ‘quz’]
When we call re.split (‘\s+’, text), the regular expression is first compiled, and then its split( )
method is called on the passed text. We can compile the regex ourselves with re.compile( )
method, forming a reusable regex object.
> regex = re.compile (‘\s+’)
> regex.split(text)
[ ‘foo’, ‘bar’, ‘baz’, ‘qux’ ]
If, instead, we want to get a list of all patterns matching the regex, we can use the findall( )
method, as shown below:
> regex.findall (text)
[‘ ‘, ‘\t’, ‘\t’]
The match( ) and search( ) methods are closely related to the findall( ) method returns all
matches in a string, the search( ) method returns only the first match. The match( ) method
only matches at the beginning of the string.
Let us now consider a block of text and a regular expression capable of identifying most
e-mail addresses:
> text = “ “ “ Dave [email protected]
Steve [email protected]
Rob [email protected]
Ryan [email protected]
”””
> pattern = r ‘[ A-Z0-9 . - % + – ] + @ [ A-Z0-9. - ] +1.[A-Z] {2,4}’
> regex = re.compile ( pattern, flags = re. IGNORECASE )
> regex.findall(text)
[ ‘[email protected]’, ‘[email protected]’, ‘[email protected]’, ‘[email protected]’ ]
The search( ) method will return a special match object for the first e-mail address in the
text. For the above regex, the match object can only tell us the start and end position of the
pattern in the string:
> m = regex.search (text)
> m
< -sre.SRE_Match at OxI0a05de00 >
100

> text [m.start( ) : m.end( ) ]


[email protected]
> print regex.match(text)
None
Here, regex.match( ) returns None, as it only will match if the pattern occurs at the start
of the string.
The sub( ) method will return a new string with occurrences of the pattern replaced by a
new string.
> print regex.sub ( ‘REDACTED’, text )
Dave REDACTED
Steve REDACTED
Rob REDACTED
Ryan REDACTED
Suppose we want to find the e-mail addresses and simultaneously segment each address
into its three components username, domain name, and domain suffix. In order to do this, put
parentheses around the parts of the pattern to segment:
> pattern = r ‘ ( [ A-Z0-9. _ % + – ] + ) @ ( [ A-Z0-9.- ] + ) |. ( [ A-Z] { 2, 4 } ) ’
> regex = re.compile ( pattern, flags = re.IGNORECASE )
> m = regex.match ( ‘[email protected]’ )
> m.groups( )
( ‘wesm’, ‘bright’, ‘net’ )
> regex.findall (text)
[ ( ‘dave’, ‘google’, com’ ),
( ‘steve’, ‘gmail’, com’ ),
( ‘rob’, ‘gmail’, com’ ),
( ‘ryan’, ‘yahoo’, com’ ) ]
Chapter-7:
DATA AGGREGATION & GROUP OPERATIONS

7.1 Introduction
After loading, merging, and preparing a data set, a familiar data processing task is to
categorize the data into groups and then compute the group statistics for reporting and
visualization purposes. Pandas library provides a flexible and high-performance groupby
facility, enabling us to slice and dice, and summarize data sets in a natural way.
One reason for the popularity of Relational Databases and SQL is the ease with which
data can be joined, filtered, transformed, and aggregated. However, query languages like SQL
are rather limited in the kinds of group operations that can be performed. With the
expressiveness and power of Python and Pandas, we can perform much more complex grouped
operations, like the ones listed below:
1. We can split a Panda object into pieces using one or more keys.
2. We can compute group summary statistics, like count, mean, or standard deviation.
3. We can apply a varying set of functions to each column of a Data Frame.
4. We can apply within-group transformations or other manipulations, like normalization,
linear regression, rank, or subset selection.
5. We can compute Pivot Tables and Cross-Tabulation.
7.2 GroupBy Operation
Hadley Wickham coined the
term split-apply-combine for talking
about group operations and this is a
good description of the process. In the
first stage of the process, the data are
split into groups based on one or more
keys that we provide. The splitting is
performed on a particular axis. In the
case of a Data Frame, rows constitute
zeroth axis and the columns constitute
the first axis. Once this is done, a
function is applied to each group,
producing a new value. Finally, the
results of all those function
applications are combined into a
result object. The figure is a mockup of a simple group aggregation.
102

In order to get started, let us consider the following Data Frame.


> df = DataFrame ( { ‘key1’ : [ ‘a’, ‘a’, ‘b’, ‘b’, ‘a’ ],
‘key2’ : [ ‘one’, ‘two’, ‘one’, ‘two’, ‘one’ ],
‘data1’: np.random.randn(5),
‘data2’: np.random.randn(5) } )
> df
data1 data2 key1 key2
0 -0.204708 1.393406 a one
1 0.478943 0.092908 a two
2 -0.519439 0.281746 b one
3 -0.555730 0.769023 b two
4 1.965781 1.2464435 a one
Suppose we want to compute the mean of the data1 column by using the group labels
from key1 column (i.e., ‘a’ and ‘b’). To do this, we shall access the data1 column and call the
groupby( ) method. We shall pass the key1 Column as the argument of the groupby( ) method,
as shown below:
> grouped = df[‘data1’].groupby( df [‘key1’] )
> grouped.mean( )
key1
a 0.746672
b -0.537585
The important point to be noted here is that the data (a Series) has been aggregated
according to the group key, producing a new Series that is now indexed by the unique values
in the key1 column. The result index has the name Key1.
We have grouped the data above by using one key. We will now group the data by using
two keys. The resulting Series will have a hierarchical index consisting of the unique pairs of
keys, as shown below:
> means = df[‘data1’].groupby( [df [‘key1’], df [‘key2’] ] ).mean( )
> means
key1 key2
a one 0.880536
two 0.478943
b one -0.519439
two -0.555730
In the above examples, the group keys are all Series. But the group keys can be any arrays
of the right length. Here is an example.
> states = np.array ( [ ‘Ohio’, ‘California’, ‘California’, ‘Ohio’, ‘Ohio’ ] )
> years = np. array ( [ 2005, 2005, 2006, 2005, 2006 ] )
> df[ ‘data1’ ].groupby( [states, years] ).mean( )
103

California 2005 0.478943


2006 -0.519439
Ohio 2005 -0.380219
2006 1.965781
If the grouping information is available in the same Data Frame, we can just pass column
names as the group keys, as shown below:
> df.groupby (‘key1’).mean( )
key1 data1 data2
a 0.746672 0.910916
b -0.537585 0.525384
> df.groupby ( [‘key1’, ‘key2’] ).mean( )
key1 key2 data1 data2
a one 0.880536 1.319920
two 0.478943 0.092908
b one -0.519439 0.281746
two -0.555730 0.769023
Note in the first case df.groupby (‘key1’).mean( ) that there is no key2 column in the
result. Because df[‘key2’] does not contain numeric data, it is said to be a nuisance column.
It is therefore excluded from the result. By default, all of the numeric columns are aggregated.

7.3 Iterating over Groups


The GroupBy object (i.e., the result obtained by applying a GroupBy operation) supports
iteration. Such an iteration operation generates the group names along with the chunk of data,
as shown below:
> for name, group in df.groupby (‘key1’) :
print name
print group
a)
data1 data2 key1 key2
0 -0.204708 1.393406 a one
1 0.478943 0.092908 a two
4 1.965781 1.246435 a one
b)
data1 data2 key1 key2
2 -0.519439 0.281746 b one
3 -0.555730 0.769023 b two
In the above example, we provide a single key. But, we can very well pass two keys to
the groupby( ) method, as shown below:
104

> for (k1, k2), group in df.groupby ( [ ‘key1’, ‘key2’ ] ):


print k1, k2
print group
a one
data1 data2 key1 key2
0 -0.204708 1.393406 a one
4 1.965781 1.246435 a one
a two
data1 data2 key1 key2
1 0.478943 0.092908 a two
b one
data1 data2 key1 key2
2 -0.519439 0.281746 b one
b two
data1 data2 key1 key2
3 -0.555730 0.769023 b two
7.4 Selecting a Column or a subset of Columns
Indexing a GroupBy object (i.e., the result obtained by applying a Groupby operation)
with a column name has the effect of selecting that column for aggregation. This means that:
> df.groupby ( ‘key1’ ) [ ‘data1’ ]
> df.groupby ( ‘key1’ ) [ [‘data2’ ] ]
are same as:
> df [ ‘data1’ ].groupby ( df [ ‘key1’ ] )
> df [ [ ‘data2’ ] ].groupby ( df [ ‘key1’ ] )
Especially for large data sets, it may be desirable to aggregate only a few columns. For
example, in the above data set df, to compute means for just the data2 column and get the result
as a Data Frame, we can write:
> df.group ( [ ‘key1’, ‘key2’ ] ) [ [ ‘data2’ ] ].mean( )
data 2
key1 key2
a one 1.319920
two 0.092908
b one 0.281746
two 0.769023
We can get the above result by using the following command also.
> df.group by ( [ ‘key1’, ‘key2’ ] ) [ ‘data2’ ].mean( )
105

key1 key2
a one 1.319920
two 0.092908
b one 0.281746
two 0.769023
7.5 Grouping with Dictionaries and Series
So far, we have passed one or more keys to the groupby( ) method. But, we shall very
well pass a Dictionary or a Series as the argument to the groupby( ) method. In other words,
grouping of data can be done on the basis of a Dictionary or a Series also.
> people = DataFrame (np.random.randn (5,5) ,
columns = [ ‘a’, ‘b’, ‘c’, ‘d’, ‘e’ ],
index = [‘joe’, ‘Steve’, ‘Wes’, ‘Jim’, ‘Trais’] )
> people.is [2:3, [‘b’, ‘c’] ] = np.nan # Add a few NA values
> people
a b c d
Joe 1.007189 -1.296221 0.274992 0.228913 1.352917
Steve 0.886429 -2.001637 -0.371843 1.669025 -0.438570
Wes -0.539741 NaN NaN -1.021228
Jim 0.124121 0.302614 0.523772 0.000940
Travis -0.713544 -0.831154 -2.370232 -1.860761
Suppose I have group correspondence for the columns and want to sum together the columns
by group.
> mapping = { ‘a’ : ‘red’, ‘b’ : ‘red”, ‘c’ : ‘blue’,
‘d’ : ‘blue’, ‘e’ : ‘red’, ‘f’ : ‘orange’ }
> people.groupby ( mapping, axis = 1 ) . sum( )
blue red
Joe 0.503905 1.063885
Steve 1.297183 -1.553778
Wes -1.021228 -1.116829
Jim 0.524712 1.770545
Travis -4.230992 -2.405455
The same functionality holds for Series. It can be viewed as a fixed size mapping.
> map_series = Series ( mapping )
> map_series
a red
b red
c blue
d blue
e red
f orange
106

> people.groupby ( map_series, axis = 1 ).count( )


blue red
Joe 2 3
Steve 2 3
Wes 1 2
Jim 2 3
Travis 2 3
7.6 Grouping with Functions
Any function passed as a group key will be called once per index value, with the return
values being used as the group names. In the data frame named people that we have seen in
the previous Section, people’s first names (i.e., Joe, Steve, Wes, Jim, Travis) are used as index
values. Suppose we want to group by the length of the names. (The length of the name Joe
is 3). We can compute an array of string lengths. But, instead, we can just pass the len function,
as shown below:
> people.groupby(len).sum( )
3 0.591569 -0.993608 0.798764 -0.791374 2.119639
5 0.886429 -2.001637 -0.371843 1.669025 -0.438570
6 -0.713544 -0.831154 -2.370232 -1.860761 -0.860757
Mixing functions with arrays, dictionaries, or Series is not a problem, as everything gets
converted to arrays internally.
> key_list = [ ‘one’, ‘one’, ‘one’, ‘two’, ‘two’ ]
> people.grouping ( [ len, key_list ] ) . min ( )
a b c d e
3 one -0.539741 -1.296221 0.274992 -1.021228 -0.577087
two 0.124121 0.302614 0.523772 0.000940 1.343810
5 one 0.886429 -2.001637 -0.371843 1.669025 -0.438570
6 two -0.713544 -0.831154 -2.370232 -1.860761 -0.860757
7.7 Data Aggregation
Any data transformation that produces scalar values from arrays is an aggregation
operation. The mean, count, min, and sum are all examples for aggregation operation. Consider
the following example:
> df
data1 data2 key1 key2
0 -0.204708 1.393406 a one
1 0.478943 0.092908 a two
2 -0.519439 0.281746 b one
3 -0.555730 0.769023 b two
4 1.965781 1.246435 a one
107

> grouped = df.groupby ( ‘key1’ )


> grouped [‘data1’].quantile (0.9)
key1
a 1.668413
b -0.523068
While quantile is not explicitly implemented for GroupBy, it is a Series method and thus
available for use. Internally, GroupBy efficiently slices up the Series, calls piece.quantile(0.9)
for each piece, then assembles those results together into the result object.
In order to use our own aggregation functions, we have to pass any function that
aggregates an array to the aggregate or agg method.
> def peak_to_peak ( arr ) :
return arr.max( ) - arr.min( )
> grouped.agg ( peak_to_peak )
key1 data1 data2
a 2.170488 1.300498
b 0.036292 0.487276
Some methods like describe( ) also work with GroupBy, even though they are not
aggregations, strictly speaking.
> grouped.describe( )
key1 data1 data2
a count 3.000000 3.000000
mean 0.746672 0.910916
std 1.109736 0.712217
min -0.204708 0.092908
25% 0.137118 0.669671
50% 0.478943 1.246435
75% 1.222362 1.319920
max 1.965781 1.393406

b count 2.000000 2.000000


mean -0.537585 0.525384
std 0.025662 0.344556

75% -0.528512 0.647203
max -0.519439 0.769023
7.8 Column-wise and Multiple Function Application
We have seen above how to aggregate all of the columns of a Data Frame by invoking
the aggregate( ) method with a desired function or calling a method like mean or std. But,
sometimes we may wish to call the method mean for one column and the method std for
108

another column. Fortunately, this is straightforward to do, as illustrated by the following


example:
> tips = pd.read_csv( ‘tips.csv’ ) # Available in R reshape2 package
> tips [ ‘tip_pct’ ] = tips [‘tip’] / tips [ ‘total_bill’ ] # Add a tip percentage column
> tips [ :4 ]
total_bill tip sex smoker day time size tip-pct
0 16.99 1.01 Female No Sun Dinner 2 0.059447
1 10.34 1.66 Male No Sun Dinner 3 0.160542
2 21.01 3.50 Male No Sun Dinner 3 0.166587
3 23.68 3.31 Male No Sun Dinner 2 0.139780
> grouped = tips.groupby ( [ ‘sex’, ‘smoker’ ] )
> grouped_pct = grouped [ ‘tip-pct’ ]
> grouped_pct.agg ( ‘mean’ )
sex smoker
Female No 0.156921
Yes 0.182150
Male No 0.160669
Yes 0.152771
We have passed the function mean as the argument of the agg( ) method above. Instead,
we can pass a list of functions (i.e., mean, std, etc.) or function names (peak_to_peak, for
example), as illustrated below:
> grouped_pct.agg ( [ ‘mean’, ‘std’, ‘peak_to_peak ] )
sex smoker
Female No 0.156921 0.03642 0.195876
Yes 0.182150 0.071595 0.360233
Male No 0.160669 0.041849 0.220186
Yes 0.152771 0.090588 0.674707
In the above command, the first and second arguments of the agg( ) method are mean
and std. These two arguments themselves are assigned as the column names in the above
output by default. But, we shall assign different names as the column names in the above
output, if we so wish.
> grouped_pct.agg ( [ ‘foo’, ‘mean’ ), ( ‘bar’, np.std ) ] )
sex smoker foo bar
Female No 0.156921 0.036421
Yes 0.182150 0.071595
Male No 0.160669 0.041849
Yes 0.152771 0.090588
109

We can specify a list of functions as the argument of the agg( ) method. They will be
applied to all of the columns of a Data Frame. We can also apply different functions to different
columns of a Data Frame, if we so wish.
> functions = [ ‘count’, ‘mean’, ‘max’ ]
> result = grouped [ ‘tip_pct’, ‘total_bill’ ] . agg (functions)
> result
tip_pct total_bill
sex smoker count mean max count mean max
Female No 54 0.156921 0.252672 54 18.105185 35.83
Yes 33 0.182150 0.416667 33 17.977879 44.30
Male No 97 0.160669 0.291990 97 19.791237 48.33
Yes 60 0.152771 0.710345 60 22.284500 50.81

You might also like