0% found this document useful (0 votes)
4 views37 pages

Unit 3

NumPy is a Python library designed for efficient array manipulation and mathematical operations, particularly useful in data science. It introduces the ndarray object, which allows for fast processing of large datasets and supports various functions for creating and manipulating arrays. Key features include array creation methods, data types, indexing, slicing, and advanced operations like Boolean and fancy indexing.

Uploaded by

Barun Shrestha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views37 pages

Unit 3

NumPy is a Python library designed for efficient array manipulation and mathematical operations, particularly useful in data science. It introduces the ndarray object, which allows for fast processing of large datasets and supports various functions for creating and manipulating arrays. Key features include array creation methods, data types, indexing, slicing, and advanced operations like Boolean and fancy indexing.

Uploaded by

Barun Shrestha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 37

Unit 3

Libraries
NumPy
NumPy stands for Numerical Python. It is a Python library used for working with arrays.
It also has functions for working in domain of linear algebra, Fourier transform, and
matrices. In Python we have lists that serve the purpose of arrays, but they are slow to
process. NumPy aims to provide an array object that is much faster than traditional
Python lists. The array object in NumPy is called ndarray, it provides a lot of supporting
functions that make working with ndarray very easy. Arrays are very frequently used in
data science, where speed and resources are very important. NumPy arrays are stored at
one continuous place in memory unlike lists, so processes can access and manipulate
them very efficiently. This behavior is called locality of reference in computer science.
This is the main reason why NumPy is faster than lists. Also it is optimized to work with
latest CPU architectures using the concept of vectorized processing.

The NumPy ndarray: A Multidimensional Array Object


One of the key features of NumPy is its N-dimensional array object, or ndarray, which is
a fast, flexible container for large data sets in Python. Arrays enable you to perform
mathematical operations on whole blocks of data using similar syntax to the equivalent
operations between scalar elements.

import numpy as np
#creating and displaying array
data=[[1,2,6],[3,5,9]]
data=np.array(data)
print("Array Data")
print(data)

Every array has a shape, a tuple indicating the size of each dimension, and a dtype, an
object describing the data type of the array.

#displaying array shape


print(data.shape)

#displaying datatype of array elemnts


print(data.dtype)

Prepared By: Arjun Singh Saud, Asst. Prof. CDCSIT


Creating ndarrays

The easiest way to create an array is to use the array function. This accepts any sequence
like object (including other arrays) and produces a new NumPy array containing the
passed data.

data=(2,6,9)
data=np.array(data)
print("Array Data")
print(data)

Nested sequences, like a list of equal-length lists, will be converted into a


multidimensional array:

import numpy as np
#creating and displaying array
data=[[1,2,6],[3,5,9]]
data=np.array(data)
print("Array Data")
print(data)

NumPy arrays has many attributes. ndim is the attribute that represents the number of
dimensions (axes) of the ndarray.

#displaying dimension
print(data.ndim)

Unless explicitly specified, np.array tries to infer a good data type for the array that it
creates. The data type is stored in a special dtype object.

data=[2.4,3.9,-1.2]
data=np.array(data)
#print data type of array elemnts
print(data.dtype)

We can also specify data type of array elements explicitly while creating ndarrays.

data=np.array([1,3,5,8],dtype='int64')
print(data)

Prepared By: Arjun Singh Saud, Asst. Prof. CDCSIT


In addition to np.array, there are a number of other functions for creating new arrays. As
examples, zeros and ones create arrays of 0’s or 1’s, respectively, with a given length or
shape. empty creates an array without initializing its values to any particular value. To
create a higher dimensional array with these methods, pass a tuple for the shape.

#create array of length 5


data=np.zeros(5)
print(data)

#create array of length 5


data=np.ones(5)
print(data)

#create array of shape(3,3)


data=np.zeros((3,3))
print(data)

#create array of shape(3,3)


data=np.ones((3,3))
print(data)

#create array of length 10


data=np.empty(10)
print(data)

#create array of shape(2,3)


data=np.empty((2,3))
print(data)

The numpy.arange() function is used to generate an array with evenly spaced values
within a specified interval. The function returns a one-dimensional array of type
numpy.ndarray.

Syntax: numpy.arange([start, ]stop, [step, ], dtype=None)

#create array with elements 0-9


data=np.arange(10)
print(data)

#create array with elements 5-9


data=np.arange(5,10)
print(data)

#create array with elements 1-9

Prepared By: Arjun Singh Saud, Asst. Prof. CDCSIT


data=np.arange(1,10,2)
print(data)

A list of standard array creation functions is given below.

 array: Convert input data (list, tuple, array, or other sequence type) to an ndarray
either by inferring a dtype or explicitly specifying a dtype. Copies the input data
by default.
 asarray: Convert input to ndarray, but do not copy if the input is already an
ndarray.
 arrange: used to generate an array with evenly spaced values within a specified
interval.
 ones, ones_like: Produce an array of all 1’s with the given shape and dtype.
ones_like takes another array and produces a ones array of the same shape and
dtype.

#create array with elements 0-9


data=np.arange(10)
print(data)

d=np.ones_like(data)
print(d)

 zeros, zeros_like: Like ones and ones_like but producing arrays of 0’s instead.

#create array with elements 0-9


data=np.arange(10)
print(data)

d=np.zeros_like(data)
print(d)
 empty, empty_like: Create new arrays by allocating new memory, but do not
populate with any values like ones and zeros.
 eye, identity: Create a square N x N identity matrix (1’s on the diagonal and 0’s
elsewhere)

#creates identity matrix of 3x3


data=np.eye(3)
print(data)

#creates identity matrix of 4x


data=np.identity(4)
print(data)

Prepared By: Arjun Singh Saud, Asst. Prof. CDCSIT


Data Types for ndarrays

The data type or dtype is a special object containing the information the ndarray needs
to interpret a chunk of memory as a particular type of data.

data=np.array([1,3,5,8],dtype='int64')
print(data)

The numerical dtypes are named in the format: a type name, like float or int, followed by
a number indicating the number of bits per element. We can explicitly convert or cast an
array from one dtype to another using ndarray’s astype method.

data=np.array([1,3,5,8],dtype='int64')
print(data)

data=data.astype('float64')
print(data)

data=data.astype(np.int32)
print(data)

If we have an array of strings representing numbers, we can use astype to convert them
to numeric form.

data=np.array(['2.5','3.7','9.1'],dtype=np.string_)
print(data)
data=data.astype('float64')
print(data)
print(data.dtype)

If casting was failed for some reason (like a string that cannot be converted to float64), a
TypeError will be raised.

data=np.array(['2.5','3.7','9.1f'],dtype=np.string_)
print(data)
data=data.astype('float64') #Error
print(data)
print(data.dtype)

Prepared By: Arjun Singh Saud, Asst. Prof. CDCSIT


Operations between Arrays and Scalars
Arrays are important because they enable you to express batch operations on data
without writing any for loops. This is usually called Vectorization. Any arithmetic
operations between equal-size arrays applies the operation element-wise.

import numpy as np
a = np.array([[1., 2., 3.], [4., 5., 6.]])
print(a)
r=a*a
print("Element-wise multiplication of arrays:")
print(r)
r=a+a
print("Sum of arrays:")
print(r)

Arithmetic operations with scalars is propagated to the value to each element in the
NumPy array.

import numpy as np
a = np.array([[1., 2., 3.], [4., 5., 6.]])
print(a)
r=a/2
print("Half of array elements:")
print(r)
r=a**0.5
print("Square root of array elements:")
print(r)

Basic Indexing and Slicing


There are many ways to select a subset of data or individual elements stored in NumPy
arrays. One-dimensional arrays are simple; on the surface they act similarly to Python
lists.

import numpy as np
a = np.arange(10)
print("Array Elements:")
print(a)
print("Element at index 3")
print(a[3])
print("Element from index 3-6")
print(a[3:7])

Prepared By: Arjun Singh Saud, Asst. Prof. CDCSIT


a[2]=10 #modifying element at index 2
a[4:6]=11 #modifying element from index 4 to 5
print("Array Elements:")
print(a)

If we assign a scalar value to a slice, as in a[4:6] = 11, the value is propagated (or
broadcasted henceforth) to the entire selection. An important first distinction from lists is
that array slices are views on the original array. This means that the data is not copied,
and any modifications to the view will be reflected in the source array as demonstrated
below.

import numpy as np
a = np.array([1,2,3,4,5,6,7,8,9])
aslice=a[3:7]
aslice[1]=15 #modification will be reflected in original array
print("Array Elements:")
print(a)

a = [1,2,3,4,5,6,7,8,9]
aslice=a[3:7]
aslice[1]=15 #modification will not be reflected in original list
print("List Elements:")
print(a)

As NumPy has been designed with large data use cases in mind, we could imagine
performance and memory problems if NumPy copies data instead of creating views. We
want a copy of a slice of an ndarray instead of a view, you will need to explicitly copy the
array; for example arr[5:8].copy().

With higher dimensional arrays, we have many more options. In a two-dimensional


array, the elements at each index are no longer scalars but rather one-dimensional arrays.
Thus, individual elements can be accessed recursively. We can also pass a comma-
separated list of indices to select individual elements. So these are equivalent.

import numpy as np
a = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
print("Array Element at index 2")
print(a[2])
print("Array Element at index 1,2")
print(a[1][2])
print(a[1,2])#Equivalent to a[1][2]

Prepared By: Arjun Singh Saud, Asst. Prof. CDCSIT


In multidimensional arrays, if you omit later indices, the returned object will be a lower
dimensional ndarray consisting of all the data along the higher dimensions. As
demonstrated in 2 × 2 × 3 array.

import numpy as np
a = np.array([[[1, 2, 3], [4, 5, 6]], [[7, 8, 9], [10, 11, 12]]])
print("Array Elements")
print(a)
print("Array Element at index 1")
print(a[1])
print("Array Element at index 1,1")
print(a[1,1])
print("Array Element at index 1,1,2")
print(a[1,1,2])

Indexing with slices


Like one-dimensional objects such as Python lists, ndarrays can be sliced using the similar
syntax. Higher dimensional objects give you more options as you can slice one or more
axes and also mix integers. Note that a colon by itself means to take the entire axis, so we
can slice only higher dimensional axes.

import numpy as np
a = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
print("Array Elements:")
print(a)
print("First Two Rows")
print(a[0:2])# Or print(a[:2])
print("First Two Columns of array")
print(a[:,0:2])
print("2x2 slice in top-left corner")
print(a[0:2,0:2])

Like 1D array we can take array slices and update it which is reflected in the original
array.

import numpy as np
a = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
aslice=a[0:2,0:2]
aslice[:,:]=0
print("Array Elements")
print(a)

Prepared By: Arjun Singh Saud, Asst. Prof. CDCSIT


Boolean Indexing
In NumPy, Boolean indexing allows us to filter elements from an array based on a specific
condition. We use Boolean masks to specify the condition. Boolean mask is a NumPy
array containing truth values (True/False) that correspond to each element in the array.
Suppose we have an array named ‘a’.

a = np.array([12, 24, 16, 21, 32, 29, 7, 15])

We can create a mask that selects all elements of a that are greater than 20.

boolean_mask = a> 20

Above statement creates a Boolean mask that evaluates to True for elements that are
greater than 20, and False for elements that are less than or equal to 20. The resulting
mask is an array stored in the boolean_mask variable as below.

[False, True, False, True, True, True, False, False]

Boolean Indexing allows us to create a filtered subset of an array by passing a Boolean


mask as an index. The boolean_mask selects only those elements in the array that have a
True value at the corresponding index position as demonstrated below.

import numpy as np
a = np.array([12, 24, 16, 21, 32, 29, 7, 15])
boolean_mask = a > 20
print(boolean_mask)
print(a[boolean_mask])
a[boolean_mask]=0#sets all elements greater than 20 to zero
print(a)

Fancy Indexing
In NumPy, fancy indexing allows us to use an array of indices to access multiple array
elements at once. Fancy indexing can perform more advanced and efficient array
operations, including conditional filtering, sorting, and so on.

Example

import numpy as np
a = np.array([1, 2, 3, 4, 5, 6, 7, 8])

# select a single element


simple_indexing = a[3]

Prepared By: Arjun Singh Saud, Asst. Prof. CDCSIT


print("Simple Indexing:",simple_indexing) # 4

# select multiple elements


fancy_indexing = a[[1, 2, 5, 7]]
print("Fancy Indexing:",fancy_indexing) # [2 3 6 8]

#Returns array of indices of sorted array in ascending order


print("Indicies of Sorted Data:",np.argsort(a))

# sort a using fancy indexing


sorted_array = a[np.argsort(a)]
print("Sorted Data:",sorted_array)

#Sorting is descending order


sorted_array = a[np.argsort(-a)]
print("Reverse Sorted Data",sorted_array)

We can also use fancy indexing on multi-dimensional arrays. Concept of fancy indexing
is also same in multi-dimensional arrays.

import numpy as np
a=np.array([[1,3,6],[2,7,1],[1,9,4]])
ind=[0,2]
print(a[ind])#prints row 0 and row 2

Universal Functions: Fast Element-wise Array Functions


A universal function is a function that performs element-wise operations on data in
ndarray. We can think of them as fast vectorized wrappers for simple functions that take
one or more scalar values and produce one or more scalar results.

Example

import numpy as np
a = np.arange(10)
print("Dataset:",a)
s=np.sqrt(a)#unary universal function
print("Square Roots:",s)
e=np.exp(a)
print("Exp(a):",e)
x=np.random.randn(10)
y=np.random.randn(10)

Prepared By: Arjun Singh Saud, Asst. Prof. CDCSIT


z=np.maximum(x,y)#bunary universal function
print("x=",x)
print("y=",y)
print("z=",z)

m=np.max(x)
print("Maximum=",m)

List of Unary Universal Functions

List of Binary Universal Functions

Prepared By: Arjun Singh Saud, Asst. Prof. CDCSIT


Data Processing With Arrays
Example

points = np.arange(-5, 5, 0.01)


#print(points)
xs, ys = np.meshgrid(points, points)
import matplotlib.pyplot as plt
z = np.sqrt(xs ** 2 + ys ** 2)
print(z.shape)
plt.imshow(z, cmap=plt.cm.gray)
plt.colorbar()
plt.title("Image plot of a grid of values")

Array Functions

Note: Write down programs to demonstrate each of the above methods

File Input and Output with Arrays


NumPy is able to save and load data to and from disk either in text or binary format.
Storing Arrays on Disk in Binary Format np.save and np.load are the two workhorse

Prepared By: Arjun Singh Saud, Asst. Prof. CDCSIT


functions for efficiently saving and loading array data on disk. Arrays are saved by
default in an uncompressed raw binary format with file extension .npy. If the file path
does not already end in .npy, the extension will be appended. The array on disk can then
be loaded using np.load. We can save multiple arrays in a zip archive using np.savez and
passing the arrays as keyword arguments. When loading an .npz file, we get back a
dictionary-like object which loads the individual arrays.

import numpy as np
a = np.arange(10)
print("a=",a)
np.save('some_array', a)
b=np.load('some_array.npy')
print("b=",b)
c = np.arange(20)
print("c=",c)
np.savez('array_archive.npz', x=a, y=c)
arch = np.load('array_archive.npz')
print("Arrays in Archive:")
for k in arch:
print(arch[k])

Saving and Loading Text Files


Loading text from files is a fairly standard task. We will focus mainly on the read_csv and
read_table functions in pandas. Sometimes it is useful to load data into vanilla NumPy
arrays using np.loadtxt or the more specialized np.genfromtxt. These functions have
many options allowing us to specify different delimiters, converter functions for certain
columns, skipping rows, and other things. Take a simple case of a comma-separated file
as demonstrated in the example below. np.savetxt performs the inverse operation:
writing an array to a delimited text file.

Example

import numpy as np
a = np.loadtxt('/content/drive/My Drive/test.txt', delimiter=',')
print(a)
np.savetxt('/content/drive/My Drive/test1.txt', a)
print("File is saved")

Prepared By: Arjun Singh Saud, Asst. Prof. CDCSIT


The genfromtxt() function is used to load data in a program from a text file. It takes
multiple argument values to clean the data of the text file. It also has the ability to deal
with missing or null values through the processes of filtering, removing, and replacing.

import numpy as np
# invoking genfromtxt method to read employee.txt file
content = np.genfromtxt("/content/drive/My Drive/test.txt", dtype=str,
encoding = None, delimiter=",")
# print file data on console
print("File data:", content)

Linear Algebra
Linear algebra, like matrix multiplication, decompositions, determinants, and other
square matrix math, is an important part of any array library. Unlike some languages like
MATLAB, multiplying two two-dimensional arrays with * is an element-wise product
instead of a matrix dot product. Numpy.linalg has a standard set of matrix
decompositions and things like inverse and determinant. Commonly-used numpy.linalg
functions are listed below.

#Matrix multiplication
import numpy as np
x = np.array([[1, 2, 3], [4, 5, 6]])
y = np.array([[6, 23], [-1, 7], [8, 9]])
z=x.dot(y)
print(z)
r=np.dot(x, y)#equivalent to x.dot(y)
print(r)

Prepared By: Arjun Singh Saud, Asst. Prof. CDCSIT


#solving system of linear equations, finding determinant and inverse
#2x+3y-z=5
#x+3y-z=4
#3x-y+2z=7
import numpy as np
from numpy.linalg import inv, solve,det
a = np.array([[2,3,-1],[1,3,-1],[3,-1,2]])
b=np.array([5,4,7])
s=solve(a,b)
print(s)
d=det(a)
print("determinant of a=",d)
b=inv(a)
print("Inverse of a=",b)

Random Number Generation


The numpy.random module supplements the built-in Python random with functions for
efficiently generating whole arrays of sample values from many kinds of probability
distributions. For example, you can get a 4 by 4 array of samples from the standard
normal distribution using normal. See table given below for a partial list of functions
available in numpy.random.

import numpy as np
np.random.seed(100)
d=np.random.randint(0,10)
print("d=",d)
samples = np.random.normal(size=(4, 4))
print(samples)
d=np.random.permutation([1,2,3])
print("d=",d)

Prepared By: Arjun Singh Saud, Asst. Prof. CDCSIT


l=[1,2,3,4,5]
d=np.random.shuffle(l)
print("Shuffled List=",l)

Introduction to pandas Data Structures


Series and DataFrame are two widely used data structures of Pandas. While they are not
a universal solution for every problem, they provide a solid, easy-to-use basis for most
applications.

Series

A Series is a one-dimensional array-like object containing an array of data and an


associated array of data labels, called its index.

Example

import pandas as pd
import numpy as np
obj = pd.Series([4, 7, -5, 3]) #series data structure
print(obj.values) #displaying values in the data structure
print(obj[1]) #vaue at index 1
obj[2]=5 #modifying value
print(obj.values)
print(obj[[1,2,3]]) #displaying values at index 1, 2, and 3
obj = pd.Series([4, 7, -5, 3], index=['a', 'b', 'c', 'd'])
print(obj.values)
print(obj['a'])
obj=obj*2 # scalar multiplication
print(obj.values)
print(obj[obj>0])#boolean indexing
print(np.exp(obj)) #using universal functions

print(pd.isnull(obj))#checking for null valuess


print(pd.notnull(obj))

DataFrame

A DataFrame represents a tabular, spreadsheet-like data structure containing an ordered


collection of columns, each of which can be a different value type (numeric, string,
boolean, etc.). The DataFrame has both a row and column index; it can be thought of as a

Prepared By: Arjun Singh Saud, Asst. Prof. CDCSIT


dict of Series. The data is stored as one or more two-dimensional blocks rather than a list,
dict, or some other collection of one-dimensional arrays.

import pandas as pd
data = {'State': ['Bagmati', 'Koshi', 'Karnali', 'Lumbini', 'Gandaki'],
'Year': [2000, 2001, 2002, 2001, 2002]}
frame1 = pd.DataFrame(data)#creating dataframe
print(frame1)
frame2 = pd.DataFrame(data,columns=["State","Year","Debt"])
print(frame2)#creating data frame
print(frame2["State"])#displaying column State
obj=pd.Series([2,5,3,3,4])
frame2["Debt"]=obj
print(frame2)#displaying data frame
print(frame2.values)#displaying in 2D array format

Index Objects

Pandas’s Index objects are responsible for holding the axis labels and other metadata (like
the axis names). Any array or other sequence of labels used when constructing a Series
or DataFrame is internally converted to an Index. Index objects are immutable and thus
can’t be modified by the user.

import pandas as pd
s= pd.Series(range(3), index=[1, 2, 3])
print(s)
print(s.index)
print(pd.Int64Index(s))
#s.index[1]='d'# index is immutable

Essential Functionalities
This section discusses fundamental mechanics of interacting with the data contained in a
Series or DataFrame.

Reindexing

A critical method on panda’s objects is reindex, which means to create a new object with
the data conformed to a new index. Calling reindex on this Series rearranges the data
according to the new index, introducing missing values if any index values were not
already present.

import pandas as pd

Prepared By: Arjun Singh Saud, Asst. Prof. CDCSIT


obj = pd.Series([4.5, 7.2, -5.3, 3.6], index=['d', 'b', 'a', 'c'])
print(obj)
obj1 = obj.reindex(['a', 'b', 'c', 'd', 'e'])
print(obj1)
obj2=obj.reindex(['a', 'b', 'c', 'd', 'e'], fill_value=0)
print(obj2)

For ordered data like time series, it may be desirable to do some interpolation or filling
of values when reindexing. The method option allows us to do this, using a method such
as ffill which forward fills the values.

import pandas as pd
obj = pd.Series(['blue', 'purple', 'yellow'], index=[0, 2, 4])
print(obj)
obj1=obj.reindex(range(6), method='ffill')
print(obj1)
obj2=obj.reindex(range(6), method='bfill')
print(obj2)

Dropping Entries From an Axis

Dropping one or more entries from an axis is easy if you have an index array or list
without those entries. As that can require a bit of munging and set logic, the drop method
will return a new object with the indicated value or values deleted from an axis.

import pandas as pd
import numpy as np
obj = pd.Series(np.arange(5), index=['a', 'b', 'c', 'd', 'e'])
print(obj)
obj1 = obj.drop('c')
print(obj1)
obj2 = obj.drop(['b','d'])
print(obj2)

With DataFrame, index values can be deleted from either axis:

import pandas as pd

Prepared By: Arjun Singh Saud, Asst. Prof. CDCSIT


import numpy as np
data = pd.DataFrame(np.arange(16).reshape((4, 4)),index=['c1', 'c2', 'c3',
'c4'],
columns=['r1', 'r2', 'r3', 'r4'])
print(data)
d=data.drop('c2')
print(d)
d=data.drop('r2',axis=1)
print(d)

Indexing, Selection, and Filtering

Series indexing works analogously to NumPy array indexing, except we can use the
Steris’s index values instead of only integers.

import pandas as pd
import numpy as np
obj = pd.Series(np.arange(4), index=['a', 'b', 'c', 'd'])
print(obj[2]) #same as obj(['c'])
print(obj['c'])
print(obj[1:3])
print(obj[['b','c','d']])

Slicing with labels behaves differently than normal Python slicing in that the endpoint is
inclusive and setting using these methods works just as we would expect.

import pandas as pd
import numpy as np
obj = pd.Series(np.arange(4), index=['a', 'b', 'c', 'd'])
print(obj[2]) #same as obj(['c'])
print(obj['c'])
print(obj[1:3])
print(obj['b':'d'])
obj['b':'c'] = 5
print(obj)

Indexing into a DataFrame is for retrieving one or more columns either with a single
value or sequence:

import pandas as pd
import numpy as np
data = pd.DataFrame(np.arange(16).reshape((4, 4)),index=['r1', 'r2', 'r3',
'r4'],

Prepared By: Arjun Singh Saud, Asst. Prof. CDCSIT


columns=['c1', 'c2', 'c3', 'c4'])
print(data['c1'])
print(data[['c1','c3']])
print(data[:2])
print(data[data['c3'] > 5])

Another use case is in indexing with a Boolean DataFrame, such as one produced by a
scalar comparison.

import pandas as pd
import numpy as np
data = pd.DataFrame(np.arange(16).reshape((4, 4)),index=['r1', 'r2', 'r3',
'r4'],
columns=['c1', 'c2', 'c3', 'c4'])
print(data < 5)
data[data < 5] = 0
print(data)

Arithmetic and Data Alignment

One of the most important pandas features is the behavior of arithmetic between objects
with different indexes. When adding together objects, if any index pairs are not the same,
the respective index in the result will be the union of the index pairs.

import pandas as pd
s1 = pd.Series([7.3, -2.5, 3.4, 1.5], index=['a', 'c', 'd', 'e'])
s2 = pd.Series([-2.1, 3.6, -1.5, 4, 3.1], index=['a', 'c', 'e', 'f', 'g'])
s3=s1+s1
print(s3)
s3=s1+s2
print(s3)

In the case of DataFrame, alignment is performed on both the rows and the columns:

import pandas as pd
df1 = pd.DataFrame(np.arange(9).reshape((3, 3)), columns=list('bcd'),
index=['1', '2', '3'])
df2 = pd.DataFrame(np.arange(12).reshape((4, 3)), columns=list('bde'),
index=['1', '2', '3', '4'])
print(df1)
print(df2)
df=df1+df1
print(df)

Prepared By: Arjun Singh Saud, Asst. Prof. CDCSIT


df=df1+df2
print(df)

Arithmetic Methods with Fill Values

In arithmetic operations between differently-indexed objects, we might want to fill with


a special value, like 0, when an axis label is found in one object but not the other.

import pandas as pd
df1 = pd.DataFrame(np.arange(12).reshape((3, 4)), columns=list('abcd'),
index=['1', '2', '3'])
df2 = pd.DataFrame(np.arange(20).reshape((4, 5)), columns=list('abcde'),
index=['1', '2', '3', '4'])
df=df1.add(df2, fill_value=0)
print(df)
df=df1.sub(df2, fill_value=0)
print(df)
df=df1.mul(df2, fill_value=0)
print(df)
df=df1.div(df2, fill_value=0)
print(df)

Operations between DataFrame and Series

As with NumPy arrays, arithmetic between DataFrame and Series is well-defined. In such
case, operation is performed by using the concept of broadcasting.

import pandas as pd
import numpy as np
df = pd.DataFrame(np.arange(12).reshape((3, 4)))
s=pd.Series([2,4,5,7])
df1=df+s
print(df)
print(s)
print(df1)

By default, arithmetic between DataFrame and Series matches the index of the Series on
the Data Frame’s columns, broadcasting down the rows. If an index value is not found in
either the DataFrame’s columns or the Series’s index, the objects will be reindexed to form
the union.

Function application and mapping

Prepared By: Arjun Singh Saud, Asst. Prof. CDCSIT


NumPy universal functions work fine with pandas objects.

import numpy as np
import pandas as pd
frame = pd.DataFrame(np.random.randn(4, 3),
columns=list('bde'),index=['r1', 'r2', 'r3', 'r4'])
print(frame)
frame=np.abs(frame)
print(frame)

Another frequent operation is applying a function on 1D arrays to each column or row.

import numpy as np
import pandas as pd
frame = pd.DataFrame(np.random.randn(4, 3),
columns=list('bde'),index=['r1', 'r2', 'r3', 'r4'])
print(frame)
frame=np.abs(frame)
print(frame)
f= lambda x: x.max() - x.min()
fr=frame.apply(f,axis=0)
print(fr)

The function passed to apply need not return a scalar value, it can also return a Series
with multiple values.

import numpy as np
import pandas as pd

def f(x):
return pd.Series([x.min(), x.max()], index=['min', 'max'])

frame = pd.DataFrame(np.random.randn(4, 3),


columns=list('bde'),index=['r1', 'r2', 'r3', 'r4'])
print(frame)
frame=np.abs(frame)
print(frame)
fr=frame.apply(f)
print(fr)

Sorting and Ranking

Prepared By: Arjun Singh Saud, Asst. Prof. CDCSIT


Sorting a data set by some criterion is another important built-in operation. To sort
lexicographically by row or column index, use the sort_index method, which returns a
new, sorted object.

import numpy as np
import pandas as pd

obj = pd.Series(range(4), index=['d', 'a', 'b', 'c'])


obj1=obj.sort_index()
print(obj1)

With a DataFrame, we can sort by index on either axis. The data is sorted in ascending
order by default, but can be sorted in descending order too.

import numpy as np
import pandas as pd

frame = pd.DataFrame(np.arange(8).reshape((2, 4)), index=['three',


'one'],columns=['d', 'a', 'b', 'c'])
fr=frame.sort_index(axis=0)
print(fr)
fr=frame.sort_index(axis=1)
print(fr)
fr=frame.sort_index(axis=1,ascending=False)
print(fr)

To sort a Series by its values, use its sort_values method. Any missing values are sorted
to the end of the Series by default.

import numpy as np
import pandas as pd

obj = pd.Series([4, np.nan, 7, np.nan, -3, 2])


obj1=obj.sort_values()
print(obj1)

On DataFrame, We may want to sort by the values in one or more columns. To do so,
pass one or more column names to the by option:

import numpy as np

Prepared By: Arjun Singh Saud, Asst. Prof. CDCSIT


import pandas as pd

frame = pd.DataFrame({'b': [4, 7, -3, 2], 'a': [0, 1, 0, 1]})


print(frame)
fr=frame.sort_values(by='b')
print(fr)
fr=frame.sort_values(by=['a','b'])
print(fr)
fr=frame.sort_values(by=['a','b'], ascending=False)
print(fr)

Ranking is closely related to sorting, assigning ranks from one through the number of
valid data points in an array. Ties are broken according to a rule. By default rank breaks
ties by assigning. We can also rank in descending order, too.

import numpy as np
import pandas as pd

obj = pd.Series([7, -5, 7, 4, 2, 0, 4])


obj1=obj.rank()
print(obj1)
obj1=obj.rank(method='first')
print(obj1)
obj1=obj.rank(ascending=False, method='min')
print(obj1)

DataFrame can compute ranks over the rows or the columns:

import numpy as np
import pandas as pd

frame = pd.DataFrame({'b': [4.3, 7, -3, 2], 'a': [0, 1, 0, 1], 'c': [-2,
5, 8, -2.5]})
print(frame)
fr=frame.rank(axis=1)
print(fr)

Prepared By: Arjun Singh Saud, Asst. Prof. CDCSIT


Axis indexes with Duplicate Values

Series may have duplicate indices. The index’s is_unique property can tell you whether
its values are unique or not. Data selection is one of the main things that behaves
differently with duplicates. Indexing a value with multiple entries returns a Series while
single entries return a scalar value.

import numpy as np
import pandas as pd

obj = pd.Series(range(5), index=['a', 'a', 'b', 'b', 'c'])


print(obj.index.is_unique)
print(obj['a'])
print(obj['c'])

The same logic extends to indexing rows in a DataFrame.

import numpy as np
import pandas as pd

df = pd.DataFrame(np.random.randn(4, 3), columns=['a', 'a', 'b'])


print(df.index.is_unique)
print(df['a'])
print(df['b'])

Summarizing and Computing Descriptive Statistics


Pandas objects are equipped with a set of common mathematical and statistical methods.
Most of these fall into the category of reductions or summary statistics, methods that
extract a single value (like the sum or mean) from a Series or a Series of values from the
rows or columns of a DataFrame. Compared with the equivalent methods of vanilla
NumPy arrays, they are all built from the ground up to exclude missing data. Calling
DataFrame’s sum method returns a Series containing column sums. Passing axis=1 sums
over the rows instead. NA values are excluded unless the entire slice (row or column in
this case) is NA.

Prepared By: Arjun Singh Saud, Asst. Prof. CDCSIT


Some methods, like idxmin and idxmax, return indirect statistics like the index value
where the minimum or maximum values are attained. Another method is describe that
produce multiple summary statistics in one shot. Summary descriptive methods of
dataframe is listed the table given below.

Example

import numpy as np
import pandas as pd
df = pd.DataFrame([[1.4, np.nan], [7.1, -4.5],[np.nan, np.nan], [0.75, -
1.3]],
index=['a', 'b', 'c', 'd'],columns=['one', 'two'])
print(df)
print(df.sum())
print(df.sum(axis=1))
print(df.mean())
print(df.describe())

Covariance
Covariance is a measure of the relationship between two random variables. It measures
the direction of the relationship between two variables. If the covariance for any two
variables is positive, that means, both the variables move in the same direction. If the

Prepared By: Arjun Singh Saud, Asst. Prof. CDCSIT


covariance for any two variables is negative, that means, both the variables move in the
opposite direction. It can be calculated as below:
∑𝑛𝑖=1(𝑥𝑖 − 𝑥̅ )(𝑦𝑖 − 𝑦̅)
𝑐𝑜𝑣(𝑥, 𝑦) =
𝑛

A square matrix provides the covariance between each pair of components (or elements)
of a given random vector is called a covariance matrix.

# importing pandas as pd
import pandas as pd

# Creating the dataframe


df = pd.DataFrame({"A":[5, 3, 6, 4],
"B":[11, 2, 4, 3],
"C":[4, 3, 8, 5],
"D":[5, 4, 2, 8]})
print(df)
print(df.cov())

Correlation is a statistical measure that expresses the extent to which two variables are
linearly related (meaning they change together at a constant rate). It’s a common tool for
describing simple relationships without making a statement about cause and effect. The
sample correlation coefficient, r, quantifies the strength of the relationship. Correlation
coefficient quite close to 0, but either positive or negative, implies little or no relationship
between the two variables. A correlation coefficient close to plus 1 means a positive
relationship between the two variables, with increases in one of the variables being
associated with increment in the other variable. A correlation coefficient close to -1
indicates a negative relationship between two variables, with an increase in one of the
variables being associated with a decrease in the other variable. The most common
formula is the Pearson Correlation coefficient used for linear dependency between the
data sets and is given as below.
𝑛 ∑ 𝑥𝑦 − ∑ 𝑥 ∑ 𝑦
𝑟=
√(𝑛 ∑ 𝑥 2 − (∑ 𝑥)2 )(𝑛 ∑ 𝑦 2 − (∑ 𝑦)2 )

Example

# importing pandas as pd
import pandas as pd

# Creating the dataframe


df = pd.DataFrame({"A":[5, 3, 6, 4],
"B":[11, 2, 4, 3],

Prepared By: Arjun Singh Saud, Asst. Prof. CDCSIT


"C":[4, 3, 8, 5],
"D":[5, 4, 2, 8]})
print(df)
print(df.corr())

Unique Values, Value Counts, and Membership


Another class of related methods extracts information about the values contained in a
one-dimensional Series. Unique function gives us an array of the unique values in a
Series. The unique values are not necessarily returned in sorted order, but could be sorted
if needed using sort() function. Function value_counts() computes a Series containing
value frequencies. Lastly, isin() is responsible for vectorized set membership and can be
very useful in filtering a data set down to a subset of values in a Series or column in a
DataFrame.

# importing pandas as pd
import pandas as pd
s=pd.Series(['c', 'a', 'd', 'a', 'a', 'b', 'b', 'c', 'c'])
uniques = s.unique()
print(uniques)
l=s.value_counts()
print(l)
m = s.isin(['b', 'c'])
print(m)
print(s[m])

Handling Missing Data


Missing data is common in most data analysis applications. One of the goals in designing
pandas was to make working with missing data as easy as possible.

Example

Prepared By: Arjun Singh Saud, Asst. Prof. CDCSIT


import pandas as pd
import numpy as np
s = pd.Series(['Orange', 'Mango', np.nan, 'Avocado'])
print(s)
print(s.isnull())
print(s.notnull())
s1=s.dropna()
print(s1)
s2=s.fillna(0)
print(s2)

Example 2
import pandas as pd
import numpy as np
data = pd.DataFrame([[np.nan, 6.5, 3.], [np.nan, np.nan, 2.0],[np.nan,
np.nan, np.nan], [np.nan, 6.5, 3.]])
print(data)
d1=data.dropna()
print(d1)
d2=data.fillna(0)
print(d2)
d3=data.dropna(how='all')
print(d3)
d4=data.dropna(how='all',axis=1)
print(d4)

Calling fillna with a dict you can use a different fill value for each column. fillna returns
a new object, but you can modify the existing object in place. The same interpolation
methods available for reindexing can be used with fillna. With fillna you can do lots of
other things with a little creativity. For example, we might pass the mean or median
value of a Series.

import pandas as pd
import numpy as np
data = pd.DataFrame([[np.nan, 6.5, 3.], [np.nan, np.nan, 2.0],[np.nan,
np.nan, np.nan], [np.nan, 6.5, 3.]])
d=data
d.fillna(0)
print(d)
d.fillna(0,inplace=True)
print(d)

Prepared By: Arjun Singh Saud, Asst. Prof. CDCSIT


import pandas as pd
import numpy as np
data = pd.DataFrame([[np.nan, 6.5, 3.], [np.nan, np.nan, 2.0],[np.nan,
np.nan, np.nan], [np.nan, 6.5, 3.]])
print(data)
d=data.ffill()
print(d)
d=data.ffill(limit=1)
print(d)
d=data.fillna(data.mean())
print(d)

Hierarchical Indexing?
Hierarchical indexing, also known as multi-level indexing, is a way of organizing data in
Pandas with multiple levels of row or column labels. This allows you to work with more
complex data structures than a simple table with one row and one column of labels. For
example, imagine we have a dataset with sales data for a company, broken down by
region and by quarter. You could organize this data with a hierarchical index that has
two levels: one for the region and one for the quarter.

Example

import pandas as pd
index = [('Kathmandu', 'Q1'), ('Kathmandu', 'Q2'),('Kathmandu', 'Q3'),
('Kathmandu', 'Q4'),
('Pokhara', 'Q1'), ('Pokhara', 'Q2'), ('Pokhara',
'Q3'),('Pokhara', 'Q4')]
sales = [350, 500,325, 475,200, 300,350,250]
sales_data = pd.Series(sales, index=index)
print(sales_data)
print(sales_data.index)
for x in index:
if(x[1]=='Q2'):
print(x,sales_data[x])

Panel Data
The Panel in Pandas is used for working with three-dimensional data. It has three main
axes these are items is the 0 axis which corresponds to the data, major-axis is the axis 1
for rows, and minor-axis is the axis 2 for columns. A panel can be created by using the
pandas panel () function. The panel in pandas is a three-dimensional container of data.

Prepared By: Arjun Singh Saud, Asst. Prof. CDCSIT


To create a panel, we can use ndarrays and a dictionary of DataFrames. We can also
extract data from panels using different methods. (Deprecated)

Matplotlib Library
Matplotlib is a comprehensive library for creating static, animated, and interactive visualizations
in Python. Matplotlib makes difficult things possible and simple things easy. matplotlib.pyplot is
a collection of functions that make matplotlib work like MATLAB. Each pyplot function makes
some change to a figure: e.g., creates a figure, creates a plotting area in a figure, plots some lines
in a plotting area, decorates the plot with labels, etc. Once we are done, we can save it with
savefig() or display it with show().
Example
from matplotlib import pyplot as plt
years = [1950, 1960, 1970, 1980, 1990, 2000, 2010]
gdp = [300.2, 543.3, 1075.9, 2862.5, 5979.6, 10289.7, 14958.3]
# create a line chart, years on x-axis, gdp on y-axis
plt.plot(years, gdp, color='green', marker='o', linestyle='solid')
# add a title
plt.title("Nominal GDP")
# add a label to the x and y-axis
plt.ylabel("Billions")
plt.xlabel("Years")
plt.show()

Bar Charts
A bar chart, often known as a bar graph, is a diagram that displays categorical data as rectangular
bars with heights or lengths proportional to the values they stand for. You can plot the bars either
vertically or horizontally. A vertical bar chart may also be referred to as a column chart.
Comparisons among distinct categories are displayed in a bar graph. The comparison categories
are shown on one axis of the chart, and a measured value is shown on the other axis.
Example
from matplotlib import pyplot as plt
Country = ["Nepal", "Srilanka", "Bangladesh", "India",
"Bhutan","Madhives","Pakistan","Afganistan"]
GDP_growth_rate = [6.4, 4.5, 8.3, 7.4, 5.8,8.7,3.2,2.1]
# plot bars with Country as x-coordinate and GDP_growth_rate as height
plt.figure(figsize=(8,4))
plt.bar(Country, GDP_growth_rate)
plt.title("GDP Growth Rates of SAARC Countries") # add a title

Prepared By: Arjun Singh Saud, Asst. Prof. CDCSIT


plt.ylabel("GDP Growth Rate") # label the y-axis
plt.xlabel("Country")#label the x-axis
# label x-axis with movie names at bar centers
plt.show()

Calling plt.barh() function with parameters y,x as plt.barh(y,x) plots horizontal bar chart.
Example
from matplotlib import pyplot as plt
Country = ["Nepal", "Srilanka", "Bangladesh", "India",
"Bhutan","Madhives","Pakistan","Afganistan"]
GDP_growth_rate = [6.4, 4.5, 8.3, 7.4, 5.8,8.7,3.2,2.1]
# plot bars with Country as x-coordinate and GDP_growth_rate as height
plt.figure(figsize=(8,4))
plt.barh(Country, GDP_growth_rate)
plt.title("GDP Growth Rates of SAARC Countries") # add a title
plt.ylabel("GDP Growth Rate") # label the y-axis
plt.xlabel("Country")#label the x-axis
# label x-axis with movie names at bar centers
plt.show()

Stacked bar charts have each plot stacked one over another. We used an unstacked bar chart to
compare each group; we can use a stacked plot to compare each individual. A stacked bar plot is
used to represent the grouping variable. Where group counts or relative proportions are being
plotted in a stacked manner. Occasionally, it is used to display the relative proportion summed to
100%.
Example
# importing package
import matplotlib.pyplot as plt
import numpy as np

# create data
x = ['A', 'B', 'C', 'D']
y1 = np.array([10, 20, 10, 30])
y2 = np.array([20, 25, 15, 25])
y3 = np.array([12, 15, 19, 6])
y4 = np.array([10, 29, 13, 19])

# plot bars in stack manner


plt.bar(x, y1, color='r')
plt.bar(x, y2, bottom=y1, color='b')
plt.bar(x, y3, bottom=y1+y2, color='y')
plt.bar(x, y4, bottom=y1+y2+y3, color='g')
plt.xlabel("Teams")
plt.ylabel("Score")

Prepared By: Arjun Singh Saud, Asst. Prof. CDCSIT


plt.legend(["Round 1", "Round 2", "Round 3", "Round 4"])
plt.title("Scores by Teams in 4 Rounds")
plt.show()

Line Charts
A line chart is a type of chart that provides a visual representation of data in the form of points that
are connected in a straight line. Line Charts are a good choice for showing trends. These charts are
used to represent the relation between two data X and Y on a different axis.
Example
import matplotlib.pyplot as plt

quantity=[1123,1256,1289,1378,1456,1367,1256]
amount=[2246,2512,2588,2702,2912,3214,3250]
Month=["Jan","Feb","Mar","Apr","May","June","July"]

plt.figure(figsize=(8,4))
plt.plot(Month,quantity,marker='x')
plt.plot(Month,amount,marker='o')
plt.title('Sales Trend')
plt.xlabel('Month')
plt.ylabel('Sales')
plt.legend(["Sales Quntity","Sales Amount"],loc="upper left")

# Show the plot


plt.show()

Example
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
# Load the dataset into a Pandas DataFrame
df = pd.read_csv("/content/drive/My Drive/HistoricalPrices.csv")

# Convert the date column to datetime


df['Date'] = pd.to_datetime(df['Date'])

# Sort the dataset in the ascending order of date


df = df.sort_values(by = 'Date')

plt.figure(figsize=(8,4))

# Extract the date and close price columns


plt.plot(df['Date'], df['Open'])
plt.plot(df['Date'], df['Close'])

Prepared By: Arjun Singh Saud, Asst. Prof. CDCSIT


plt.title('DJIA Open and Close Prices')
plt.xlabel('Date')
plt.ylabel('Price')
plt.legend(["Open Price","Close Price"],loc="upper left")

# Show the plot


plt.show()

Scatterplots
A scatter plot (aka scatter chart, scatter graph) uses dots to represent values for two
different numeric variables. The position of each dot on the horizontal and vertical axis
indicates values for an individual data point. Scatter plots are used to observe
relationships between variables.

Example

import matplotlib.pyplot as plt


numofgames =[3, 5, 2, 6, 7, 1, 2, 7, 1, 7]
scores =[80, 90, 75, 80, 90, 50, 65, 85, 40, 100]
teams=['A','B','C','D','E','E','F','G','H','I']

plt.scatter(numofgames, scores, c ="blue", marker='o', linewidths=0.25)


plt.title("Game Scores")
plt.xlabel("#Games")
plt.ylabel("Scores")

#Labeling Scatter plot


for i,txt in enumerate(teams):
plt.annotate(txt, (numofgames[i], scores[i]))

# To show the plot


plt.show()

Histogram and Density Plots


A histogram is a graph that displays the frequency of data in equal intervals or bins. It
consists of a series of bars, where each bar represents a range of values, and the height of
the bar corresponds to the number of data points that fall within that range. Histograms
are commonly used to show the distribution of a single variable, such as age, income, or
test scores.

Example

Prepared By: Arjun Singh Saud, Asst. Prof. CDCSIT


import numpy as np
import matplotlib.pyplot as plt
x = np.random.normal(170, 10, 250)
num_bins = 7
plt.figure(figsize=(4,3))
plt.hist(x, num_bins, color='Blue', alpha=0.5)
plt.show()

Example 2

import matplotlib.pyplot as plt


import pandas as pd
flights = pd.read_csv('/content/drive/My Drive/Data/flights.csv')
print(flights)
plt.figure(figsize=(9,7))
plt.hist(flights['arr_delay'], color = 'blue', edgecolor = 'black', bins =
int(180/5))
plt.show()

A density plot shows the probability density function of a variable. It is a smoothed


version of a histogram, where the bars are replaced by a continuous line. Density plots
are useful for showing the shape of a distribution and identifying its mode, skewness,
and kurtosis.

Example

import matplotlib.pyplot as plt


import pandas as pd
import seaborn as sns
flights = pd.read_csv('/content/drive/My Drive/Data/flights.csv')
plt.figure(figsize=(9,7))
sns.kdeplot(flights['arr_delay'], fill=True, color='blue')
plt.show()

Plotting Maps
Maps have been used for centuries to help people navigate and understand their
surroundings. In the age of big data, maps have become an essential tool for data
visualization. They allow us to visualize data in a way that is intuitive, interactive, and

Prepared By: Arjun Singh Saud, Asst. Prof. CDCSIT


easy to understand. Maps can help us identify patterns and relationships that might be
difficult to see in other types of visualizations.

Plotly is a powerful data visualization library for Python that allows you to create a wide
range of interactive visualizations, including maps. One of the advantages of Plotly is
that it is designed to work seamlessly with other Python libraries, such as Pandas and
NumPy. This makes it easy to import and manipulate data and to create visualizations
that are customized to your specific needs.

The Scattergeo() function is used to create a scatter plot on a geographic map. This means
that it can help you plot points on a map where each point represents a specific
geographic location, like a city or a landmark. For example, if you have a dataset that
contains the latitude and longitude coordinates of different cities around the world, we
can use Scattergeo() to plot each city on a world map.

Example

import plotly.express as px
import pandas as pd

# Import data from USGS


data =
pd.read_csv('https://earthquake.usgs.gov/earthquakes/feed/v1.0/summary/all
_month.csv')

# Drop rows with missing or invalid values in the 'mag' column


data = data.dropna(subset=['mag'])
data = data[data.mag >= 4]

# Create scatter map


fig = px.scatter_geo(data, lat='latitude', lon='longitude', color='mag',
hover_name='place', #size='mag',
title='Earthquakes Around the World')
fig.show()

Python Visualization Tool Ecosystem


There are a plethora of options for creating graphics in Python. In addition to open
source, there are numerous commercial libraries with Python bindings. Matplotlib is the
most widely used plotting tool in Python. While it’s an important part of the scientific
Python ecosystem, matplotlib has plenty of shortcomings when it comes to the creation
and display of statistical graphics. MATLAB users will likely find matplotlib familiar,
while R users may be somewhat disappointed. It is possible to make beautiful plots for
display on the web in matplotlib, but doing so often requires significant effort as the

Prepared By: Arjun Singh Saud, Asst. Prof. CDCSIT


library is designed for the printed page. There are a number of other visualization tools
in wide use. Few of them here discussed below.

 Chaco
 mayavi
 Other Packages

See Book for details

Prepared By: Arjun Singh Saud, Asst. Prof. CDCSIT

You might also like