Numpy Basics
Numpy Basics
slicing,functions
numpy, short for Numerical Python, is the fundamental package required for high
performance scientific computing and data analysis.This package is adding support for
large, multi-dimensional arrays and matrices, along with a large collection of high-level
mathematical functions to operate on these arrays. The ancestor of NumPy, Numeric, was
originally created by Jim Hugunin with contributions from several other developers. In
2005, Travis Oliphant created NumPy by incorporating features of the competing
Numarray into Numeric, with extensive modifications. NumPy is open-source software
and has many contributors.Here are some of the things it provides:
• ndarray, a fast and space-efficient multidimensional array providing
vectorized arithmetic operations and sophisticated broadcasting capabilities
• Standard mathematical functions for fast operations on entire arrays of data without
having to write loops
• Tools for reading / writing array data to disk and working with memory-mapped files
• Linear algebra, random number generation, and Fourier transform capabilities
• Tools for integrating code written in C, C++, and Fortran
The NumPy ndarray: A Multidimensional Array Object
One of the key features of NumPy is its N-dimensional array object, or ndarray, which is
a fast, flexible container for large data sets in Python. Arrays enable you to
perform mathematical operations on whole blocks of data using similar syntax to the
equivalent operations between scalar elements:
A numpy array is a grid of values, all of the same type, and is indexed by a tuple of non
negative integers. In NumPy dimensions are called axes.The number of dimensions is the
rank of the array; the shape of an array is a tuple of integers giving the size of the array
along each dimension.
We can initialize numpy arrays from nested Python lists, and access elements using
square brackets:
Python array indexing start from 0.Matrix operations can be done with numpy arrays.
You can explicitly convert or cast an array from one dtype to another using ndarray’s
astype method:
A=np.arange(10)
print(A)
[0 1 2 3 4 5 6 7 8 9]
print(A.astype(np.float))
[0. 1. 2. 3. 4. 5. 6. 7. 8. 9.]
Arrays of zeros and ones
import numpy as np
za = np.zeros( (2, 3) )
print(za)
Output:
[[0. 0. 0.]
[0. 0. 0.]]
oa= np.ones( (2,3), dtype=np.int32 )
print(oa)
Output:
[[1. 1. 1.]
[1. 1. 1.]]
Identity matrix
import numpy as np
print(np.eye(4,4))
[[1. 0. 0. 0.]
[0. 1. 0. 0.]
[0. 0. 1. 0.]
[0. 0. 0. 1.]]
Constant array
import numpy as np
print(np.full((4,4),5))
[[5 5 5 5]
[5 5 5 5]
[5 5 5 5]
[5 5 5 5]]
An uninitialized array
import numpy as np
print(np.empty((4,2))
[[2.37663529e-312, 2.14321575e-312],
[2.37663529e-312, 2.56761491e-312],
[1.18831764e-312, 1.10343781e-312],
[2.02566915e-322, 0.00000000e+000]]
A = np.arange(4)
print('A =', A)
A = [0 1 2 3]
B = np.arange(12).reshape(2, 6)
print('B =', B)
B = [[ 0 1 2 3 4 5]
[ 6 7 8 9 10 11]]
print(np.arange(2, 3, 0.1) )
[ 2. , 2.1, 2.2, 2.3, 2.4, 2.5, 2.6, 2.7, 2.8, 2.9]
A=np.random.random((2,3))
print(A)
[[0.5516055 0.27255692 0.74313995]
[0.39238878 0.63832042 0.11740813]]
A=np.array([[1,2],[3,4]])
print(A)
[[1 2]
[3 4]]
adding a row
A=np.append(A,[[5,6]],axis=0)
print(A)
[[1 2]
[3 4]
[5 6]]
adding a column
A=np.append(A,[[5],[6]],axis=1)
print(A)
[[1 2 5]
[3 4 6]]
A=np.array([10,20,30,40,50,60,70,80])
print(A)
[10 20 30 40 50 60 70 80]
A=np.delete(A,1)
print(A)
[10 30 40 50 60 70 80]
deleting a row
A=np.array([[10,20,30],[40,50,60],[70,80,90]])
print(A)
[[10 20 30]
[40 50 60]
[70 80 90]]
A=np.delete(A,1,axis=0)
print(A)
[[10 20 30]
[70 80 90]]
As you can see, if you assign a scalar value to a slice, as in A[0:3] = 100, the value is
propagated (or broadcasted henceforth) to the entire selection. An important first
distinction from lists is that array slices are views on the original array. This means
that the data is not copied, and any modifications to the view will be reflected in the source
array:
slice=A[5:9]
print(slice)
[5 6 7 8]
slice[:]=200
print(A)
[100 100 100 3 4 200 200 200 200 9]
B=np.arange(10)
print(B[0:8:2])
[0 2 4 6]
print(B[8:0:-2])
[8 6 4 2]
print(B[:4])
[0 1 2 3]
print(B[5:])
[5 6 7 8 9]
print(B[::-1])
[9 8 7 6 5 4 3 2 1 0]
A=np.array([[1,2,3],[4,5,6],[7,8,9]])
print(A[2])
[7 8 9]
print(A[2,2])
9
print(A[2][2])
9
print(A[1:,1:])
[[5 6]
[8 9]]
print(A[:2,1:])
[[2 3]
[5 6]]
A[:,:1]
print(A[:,:1])
[[1]
[4]
[7]]
A[:,:1]=10
print(A)
[[10 2 3]
[10 5 6]
[10 8 9]]
Consider student names and their marks are stored in another array marks.Students
names are repeated and we can get all those rows of marks array.
import numpy as np
names=np.array(['biju','bini','binu','bini'])
marks=np.array([[30,40,40],[45,46,47],[38,40,45],[47,48,30]])
index=names=='bini'
print(index)
[False True False True]
print(marks[index])
[[45 46 47]
[47 48 30]]
Selecting data from an array by boolean indexing always creates a copy of the data,even
if the returned array is unchanged.
Fancy Indexing
We can select particular row or column based on index array which stores the index
A=np.empty([4,2])
print(A)
[[2.37663529e-312 2.14321575e-312]
[2.37663529e-312 2.56761491e-312]
[1.18831764e-312 1.10343781e-312]
[2.02566915e-322 0.00000000e+000]]
index=[2,3,1]
print(A[index])
[[1.18831764e-312 1.10343781e-312]
[2.02566915e-322 0.00000000e+000]
[2.37663529e-312 2.56761491e-312]]
index=[-1,-3]
print(A[index])
[[2.02566915e-322 0.00000000e+000]
[2.37663529e-312 2.56761491e-312]]
Passing multiple index arrays does something slightly different; it selects a 1D array of
elements corresponding to each tuple of indices:
print(A[[1,2],[1,1]]) # prints elements at (1,1) and (2,1)
[2.56761491e-312 1.10343781e-312]
Another way is to use the np.ix_ function, which converts two 1D integer arrays to an
indexer that selects the square region: (1,0) (1,1) (2,0) (2,1).
print(A[np.ix_([1,2],[0,1])])
[[2.37663529e-312 2.56761491e-312]
[1.18831764e-312 1.10343781e-312]]
Keep in mind that fancy indexing, unlike slicing, always copies the data into a new array.
Universal Functions
A universal function, or ufunc, is a function that performs element wise operations
on data in ndarrays. You can think of them as fast vectorized wrappers for simple
functions that take one or more scalar values and produce one or more scalar results.
Unary function takes single array and apply the operation to all values.
A=np.arange(10)
print(np.sqrt(A))
print(np.exp(A))
[1.00000000e+00 2.71828183e+00 7.38905610e+00 2.00855369e+01 5.45981500e+01
1.48413159e+02 4.03428793e+02 1.09663316e+03 2.98095799e+03 8.10308393e+03]
print(np.square(A))
[ 0 1 4 9 16 25 36 49 64 81]
The binary functions take two arrays as arguments and return a single array.
x=np.array([3,4,5,6])
y=np.array([1,4,7,2])
print(np.minimum(x,y))
[1 4 5 2]
print(np.maximum(x,y))
[3 4 7 6]
print(np.mod(x,y))
[0 0 5 0]
Data Processing
A typical use of where in data analysis is to produce a new array of values based on
another array. Suppose you had a matrix of randomly generated data and you wanted to
replace all positive values with 2 and all negative values with -2.This is very easy to do
with np.where
A=np.random.randn(4,4)
B=np.where(A>0,2,-2)
print(A)
print(B)
[[-2 -2 2 -2]
[ 2 2 2 -2]
[-2 -2 -2 2]
[-2 2 2 2]]
statistical methods
A=np.array([[1,2,3],[4,5,6]])
print(A)
[[1 2 3]
[4 5 6]]
print(np.min(A))
1
print(np.min(A,axis=0))
[1 2 3]
print(np.min(A,axis=1))
[1 4]
print(np.max(A))
6
print(np.max(A,axis=0))
[4 5 6]
print(np.max(A,axis=1))
[3 6]
print(np.sum(A))
21
print(np.sum(A,axis=0))
[5 7 9]
print(np.sum(A,axis=1))
[ 6 15]
print(np.mean(A))
3.5
print(np.mean(A,axis=0))
[2.5 3.5 4.5]
print(np.mean(A,axis=1))
[2. 5.]
print(np.var(A))
2.9166666666666665
print(np.var(A,axis=0))
[2.25 2.25 2.25]
print(np.var(A,axis=1))
[0.66666667 0.66666667]
print(np.std(A))
1.707825127659933
print(np.std(A,axis=0))
[1.5 1.5 1.5]
print(np.std(A,axis=1))
[0.81649658 0.81649658]
print(np.cumsum(A))
[ 1 3 6 10 15 21]
print(np.cumsum(A,axis=0))
[[1 2 3]
[5 7 9]]
print(np.cumsum(A,axis=1))
[[ 1 3 6]
[ 4 9 15]]
print(np.cumprod(A))
[ 1 2 6 24 120 720]
print(np.cumprod(A,axis=0))
[[ 1 2 3]
[ 4 10 18]]
print(np.cumprod(A,axis=1))
[[ 1 2 6]
[ 4 20 120]]
A=np.random.randn(2,3)
print(A)
[[-0.4203773 -0.30855868 -1.08973324]
[ 1.40028156 0.5397636 -0.02366591]]
union of elements
A=np.array([3,3,2,1,1,5,4,4,6,7,3])
B=np.array([3,4,5,5,8])
print(np.union1d(A,B))
difference
A=np.array([3,3,2,1,1,5,4,4,6,7,3])
B=np.array([3,4,5,5,8])
print(np.setdiff1d(A,B))
[1 2 6 7]
symmetric difference
A=np.array([3,3,2,1,1,5,4,4,6,7,3])
B=np.array([3,4,5,5,8])
print(np.setxor1d(A,B))
A=np.load('arr')
print(A)
[0 1 2 3 4 5 6 7 8 9 ]
Multiplication
import numpy as np
A = np.array([[2, 4], [5, -6]])
B = np.array([[9, -3], [3, 6]])
C = A .dot(B)
print(C)
[[ 30 18]
[ 27 -51]]
Transpose
import numpy as np
A = np.array([[2, 4], [5, -6]])
print(A.T)
print(A.transpose())
[[ 2, 5],
[ 4, -6]])
Multiplying array with * operation results in element wise multiplication
import numpy as np
A = np.array([[2, 4], [5, -6]])
print(A*A)
[[ 4 16]
[25 36]]
import numpy as np
A = np.array([[1, 2], [3, 4]])
print(1/A)
[[1. , 0.5 ],
[0.33333333, 0.25 ]])
import numpy as np
A = np.array([[1, 2], [3, 4]])
print(A*2)
[[2 4]
[6 8]]
Random Numbers
Random means something that can not be predicted logically.Computers work on
programs, and programs are definitive set of instructions. So it means there must be
some algorithm to generate a random number as well.
If there is a program to generate random number it can be predicted, thus it is not truly
random.
Random numbers generated through a generation algorithm are called pseudo random.
The random module's rand() method returns a random float between 0 and 1.
import numpy as np
x = np.random.rand()
print(x)
0.2733166576024767
This will generate 10 random numbers
x = np.random.rand(10)
print(x)
[0.82536563 0.46789636 0.28863107 0.83941914 0.24424812 0.2581629
1
0.72567413 0.80770073 0.32845661 0.34451507]
Generate an array with size (3,5)
x = np.random.rand(3,5)
print(x)
[[0.16220086 0.80935717 0.97331357 0.60975199 0.48542906] [0.68311884
0.27623475 0.73447814 0.29257476 0.27329666] [0.62625815 0.0069779 0.21403868
0.49191027 0.4116709 ]]
The choice() method allows to get a random value from an array of values.
import numpy as np
x = np.random.choice([3,5,6,7,9,2])
print(x)
3
import numpy as np
x = np.random.choice([3,5,6,7,9,2],size=(3,5))
print(x)
[[3 2 5 2 6]
[5 9 3 6 9]
[5 6 9 3 3]]
The probability is set by a number between 0 and 1, where 0 means that the value will
never occur and 1 means that the value will always occur.
Example
import numpy as np
x = np.random.choice([3,5,7,9],p=[0.1,0.3,0.6,0.0],size=10)
print(x)
[5 7 7 7 5 7 7 3 7 5]
Random Permutations
A permutation refers to an arrangement of elements. e.g. [3, 2, 1] is a permutation of [1,
2, 3] and vice-versa.
The NumPy Random module provides two methods for
this: shuffle() and permutation().
Shuffling Arrays
Shuffle means changing arrangement of elements in-place. i.e. in the array itself.
import numpy as np
x=np.array([1,2,3,4,5])
np.random.shuffle(x)
print(x)
[4 1 3 5 2]
Generating Permutation of Arrays
The permutation() method returns a re-arranged array (and leaves the original array un-
changed).
import numpy as np
x=np.array([1,2,3,4,5])
y=np.random.permutation(x)
print(y)
[3 1 5 2 4]
The numpy.random module supplements the built-in Python random with functions
for efficiently generating whole arrays of sample values from many kinds of probability
distributions. For example, you can get a 4 by 4 array of samples from the standard normal
distribution using normal:
import numpy as np
print(np.random.normal(size=(4,4)))
[[ 0.18577774 -1.07506339 1.0338707 1.32696306]
[ 0.41939598 -1.15732977 -0.19081001 0.10567808]
[ 0.7482679 -0.39357911 0.08297663 -0.60563642]
[ 0.23671784 -1.3504756 0.24030689 0.4240251 ]]
Binomial Distribution
Example
Given 10 trials for coin toss generate 10 data points:
import numpy as np
x =np. random.binomial(n=10, p=0.5, size=10)
print(x)
Visualization of Binomial Distribution
from numpy import random
import matplotlib.pyplot as plt
import seaborn as sns
sns.distplot(random.binomial(n=10, p=0.5, size=1000), hist=True, kde=False)
plt.show()
x=np.array([[1,2,3],[4,5,6]])
y=np.array([2,4,1])
print(x.dot(y))
print(x@y)
print(np.dot(x,y))
[13 34]
[13 34]
[13 34]
import numpy as np
A = np.array([[6, 1, 1],
[2, 8, 7]])
Rank of a matrix
Rank of A: 3
Diagonals of a matrix
print(np.diag(x))
[ 6 -2 7]
print(np.diag(x,k=1)) # above the main diagonal
[1 5]
print(np.diag(x,k=-1)) #below the main diagonal
[4 8]
import numpy as np
y=np.fliplr(x)
[ 1 -2 2]
z=np.flipud(x)
print(np.diag(z))
[ 2 -2 1]
Trace of matrix A
Trace of A: 11
Determinant of a matrix
Determinant of A: -306.0
Inverse of matrix A
Inverse of A:
Transpose of matrix A
Transpose of A:
[[ 6 4 2]
[ 1 -2 8]
[ 1 5 7]]
Power
AX=eX
The eigenvalues of a symmetric matrix are always real and the eigenvectors are always
orthogonal
import numpy as np
A=np.array([[ 2 , 3, 5],
[ 3, -2 ,1],
[ 1, 5 , 7 ]])
e,v=np.linalg.eig(A)
print(e)
print(v)
print(v[:,0]*e[0]) # eX
print(A.dot(v[:,0])) # AX
One of the greatest benefits of visualization is that it allows us visual access to huge
amounts of data in easily digestible visuals. Matplotlib consists of several plots like line,
bar, scatter, histogram etc.
Anaconda is a free and open source distribution of the Python and R programming
languages for large-scale data processing, predictive analytics, and scientific computing.
The distribution makes package management and deployment simple and easy.
Matplotlib and lots of other useful (data) science tools form part of the distribution.If you
have anaconda installed on your computer matplotlib can be used directly else install
matplotlib.
import numpy as np
import math
x = np.arange(0, math.pi*2, 0.05)
3.The ndarray object serves as values on x axis of the graph. The corresponding sine
values of angles in x to be displayed on y axis are obtained by the following statement
y = np.sin(x)
4.The values from two arrays are plotted using the plot() function.
plt.plot(x,y)
5.You can set the plot title, and labels for x and y axes.
plt.xlabel("angle")
plt.ylabel("sine")
plt.title('sine wave')
plt.show()
import numpy as np
y = np.sin(x)
plt.plot(x,y)
plt.xlabel("angle")
plt.ylabel("sine")
plt.title('sine wave')
plt.show()
PyLab is a convenience module that bulk imports matplotlib.pyplot (for plotting) and
NumPy (for Mathematics and working with arrays) in a single name space. Although many
examples use PyLab, it is no longer recommended.
basic plot
x = linspace(-3, 3, 30)
y = x**2
plot(x, y)
show()
from pylab import *
x = linspace(-3, 3, 30)
y = x**2
plot(x, y, 'r.')
show()
plot(x, sin(x))
show()
Color codes
Character Color
‘b’ Blue
‘g’ Green
‘r’ Red
‘b’ Blue
‘c’ Cyan
‘m’ Magenta
‘y’ Yellow
‘k’ Black
‘b’ Blue
‘w’ White
Marker codes
Character Description
‘x’ X marker
Line styles
Character Description
plot(x, sin(x),label='sin')
grid(True)
title('waves')
xlabel('x')
legend(loc='upper right')
show()
The following code will create three separate figures and plot
figure(1)
plot(x, sin(x),label='sin')
xlabel('x')
ylabel('sin')
legend(loc='upper right')
grid(True)
figure(2)
xlabel('x')
ylabel('cos')
legend(loc='upper right')
grid(True)
figure(3)
xlabel('x')
ylabel('-sin')
grid(True)
show()
Creating a histogram
# x-axis values
x = [5, 2, 9, 4, 7,5,5,5,4,9,9,9,9,9,9,9,9,9]
plt.hist(x)
plt.show()
Scatter Plot
x = [5, 2, 9, 4, 7]
y = [10, 5, 8, 4, 2]
plt.scatter(x, y)
plt.show()
Stem plot
x = [5, 2, 9, 4, 7]
y = [10, 5, 8, 4, 2]
# Function to plot scatter
plt.stem(x, y,use_line_collection=True)
plt.show()
Pie Plot
data=[20,30,10,50]
pie(data)
show()
Subplots with in the same plot
subplot(2,2,1)
plot(x, sin(x),label='sin')
xlabel('x')
ylabel('sin')
legend(loc='upper right')
grid(True)
subplot(2,2,2)
xlabel('x')
ylabel('cos')
legend(loc='upper right')
grid(True)
subplot(2,2,3)
xlabel('x')
ylabel('-sin')
legend(loc='upper right')
grid(True)
subplot(2,2,4)
xlabel('x')
ylabel('tan')
grid(True)
show()
Ticks in Plot
Ticks are the values used to show specific points on the coordinate axis. It can be a
number or a string. Whenever we plot a graph, the axes adjust and take the default ticks.
Matplotlib’s default ticks are generally sufficient in common situations but are in no way
optimal for every plot. Here, we will see how to customize these ticks as per our need.
The following program shows the default ticks and customized ticks
import numpy as np
x = [5, 10, 15, 20, 25, 30, 35, 40, 45, 50]
y = [1, 4, 3, 2, 7, 6, 9, 8, 10, 5]
figure(1)
plt.plot(x, y, 'b')
plt.xlabel('x')
plt.ylabel('y')
figure(2)
plt.plot(x, y, 'r')
plt.xlabel('x')
plt.ylabel('y')
plt.tick_params(axis='y',colors='red',rotation=45)
plt.show()
direction in, out, inout Puts the ticks inside or outside or both
Pandas was initially developed by Wes McKinney in 2008 while he was working at AQR
Capital Management. He convinced the AQR to allow him to open source the Pandas.
Another AQR employee, Chang She, joined as the second major contributor to the library
in 2012.
Advantages
Fast and efficient for manipulating and analyzing data.
Data from different file objects can be loaded.
Easy handling of missing data (represented as NaN) in floating point as well as non-
floating point data
Size mutability: columns can be inserted and deleted from DataFrame and higher
dimensional objects
Data set merging and joining.
Flexible reshaping and pivoting of data sets
Provides time-series functionality.
Powerful group by functionality for performing split-apply-combine operations on data
sets.
Pandas generally provide two data structure for manipulating data, They are:
1)Series
2)DataFrame
Series
Pandas Series is a one-dimensional labeled array capable of holding data of any type
(integer, string, float, python objects, etc.). The axis labels are collectively called index.
Pandas Series is nothing but a column in an excel sheet. Labels need not be unique but
must be a hashable type. The object supports both integer and label-based indexing and
provides a host of methods for performing operations involving the index.
A Series is a one-dimensional array-like object containing an array of data (of any NumPy
data type) and an associated array of data labels, called its index. The simplest Series is
formed from only an array of data:
import pandas as pd
obj=pd.Series([3,5,-8,7,9])
print(obj)
0 3
1 5
2 -8
3 7
4 9
dtype: int64
print(obj.index)
[ 3 5 -8 7 9]
Often it will be desirable to create a Series with an index identifying each data point:
print(obj2)
d 4
b 7
a -5
c 3
dtype: int64
NumPy array operations, such as filtering with a boolean array, scalar multiplication, or
applying math functions, will preserve the index-value link:
print(obj2[obj2>0])
d 4
b 7
c 3
dtype: int64
print(obj2*2)
d 8
b 14
a -10
c 6
dtype: int64
Another way to think about a Series is as a fixed-length, ordered dict, as it is a mapping of
index values to data values. It can be substituted into many functions that expect a dict:
'b' in obj2
True
'e' in obj2
False
If you have data contained in a Python dict, you can create a Series from it by passing the
dict:
obj3=pd.Series(sdata)
print(obj3)
Ohio 35000
Texas 71000
Oregon 16000
Utah 5000
dtype: int64
states = ['California', 'Ohio', 'Oregon', 'Texas']
obj4=pd.Series(sdata,index=states)
print(obj4)
California NaN
Ohio 35000.0
Oregon 16000.0
Texas 71000.0
dtype: float64
3 values found in sdata were placed in the appropriate locations, but since no value for
'California' was found, it appears as NaN (not a number) which is considered in pandas to
mark missing or NA values. The isnull and notnull functions in pandas should be used to
detect missing data:
obj4.isnull()
California True
Ohio False
Oregon False
Texas False
dtype: bool
obj4.notnull()
California False
Ohio True
Oregon True
Texas True
dtype:bool
Pandas DataFrame
Creating a DataFrame
In the real world, a Pandas DataFrame will be created by loading the datasets from
existing storage, storage can be SQL Database, CSV file, and Excel file. Pandas DataFrame
can be created from the lists, dictionary, and from a list of dictionary etc. Dataframe can
be created in different ways here are some ways by which we create a dataframe:
import pandas as pd
# list of strings
df = pd.DataFrame(lst)
print(df)
0
0 mec
1 minor
2 stud
3 eee
4 bio
Creating DataFrame from dict of ndarray/lists: To create DataFrame from dict of
narray/list, all the narray must be of same length. If index is passed then the length index
should be equal to the length of arrays. If no index is passed, then by default, index will
be range(n) where n is the array length.
import pandas as pd
# Create DataFrame
df = pd.DataFrame(data)
print(df)
Name Age
0 Tom 20
1 nick 21
2 krish 19
3 jack 18
import pandas as pd
df = pd.DataFrame(data)
print(df)
print(df[['Name', 'Qualification']])
Row Selection: Pandas provide a unique method to retrieve rows from a Data
frame. DataFrame.loc[] method is used to retrieve rows from Pandas DataFrame. Rows
can also be selected by passing integer location to an iloc[] function.
Create a data file using excel and save it in CSV(Comma Separated Values) format as
shown below
# Import pandas package
import pandas as pd
print(data)
print(data.loc[101])
print(data.iloc[1])
data[["name","mark"]]
rollno name place mark
101 binu 45
103 ashik 35
102 faisal 48
105 biju 25
106 ann 30
Missing Data can occur when no information is provided for one or more items or for a
whole unit. Missing Data is a very big problem in real life scenario. Missing Data can
also refer to as NA(Not Available) values in pandas.
import pandas as pd
# importing numpy as np
import numpy as np
# dictionary of lists
df = pd.DataFrame(dict)
print(df.isnull())
print(df.notnull())
In order to fill null values in a datasets, we use fillna(), replace() and interpolate() function
these function replace NaN values with some value of their own. All these function help
in filling a null values in datasets of a DataFrame. Interpolate() function is basically used
to fill NA values in the dataframe but it uses various interpolation technique to fill the
missing values rather than hard-coding the value.
import pandas as pd
# importing numpy as np
import numpy as np
# dictionary of lists
dict = {'First Score':[100, 90, np.nan, 95],
df = pd.DataFrame(dict)
print(df)
print(df.fillna(0))
print(df.interpolate())
print(df.replace(np.nan,-1))
print(df.dropna())
Iteration is a general term for taking each item of something, one after another. Pandas
DataFrame consists of rows and columns so, in order to iterate over dataframe, we have
to iterate a dataframe like a dictionary.
In order to iterate over rows, we can use three function iteritems(), iterrows(),
itertuples() . These three function will help in iteration over rows.
# importing pandas as pd
import pandas as pd
# dictionary of lists
df = pd.DataFrame(dict)
print(df)
for i,j in df.iterrows(): # this will get each index and each row values
You can convert a column to list and later process the list easily
sc=df['score'].to_list() #sc is a list of score
# importing pandas as pd
import pandas as pd
# dictionary of lists
df = pd.DataFrame(dict)
print(df)
print(df)
lst=[47,45,26,34,45]
print(df)
lst=[2002,2003,2004,2005,2017]
print(df)
print(df)
print(df)
Outputs:
The following will change the score in the 3rd row. You can also use index values with
at command.
df.at[3,'score']=100
print(df)
This will add value 2 to all values in age column
df['age'] +=2
print(df)
The following are the various functions you can do on this data file
# importing pandas as pd
import pandas as pd
df=pd.read_csv('stud.csv',index_col='rollno')
print(df)
data frame stud
name place mark
rollno
101 binu ernkulam 45
103 ashik alleppey 35
102 faisal kollam 48
105 biju kotayam 25
106 ann thrisur 25
107 padma kylm 25
print("columns")
print(df.columns)
columns
Index(['name', 'place', 'mark'], dtype='object')
print(df.describe())
print("size")
print(df.size)
size
18
print("data types")
print(df.dtypes)
data types
name object
place object
mark int64
dtype: object
print("shapes")
print(df.shape)
shapes
(6, 3)
print("index and length of index")
print(df.index,len(df.index))
index and length of index
Int64Index([101, 103, 102, 105, 106, 107], dtype='int64', name='rollno') 6
print("statistical functions")
print("sum=",df['mark'].sum())
print("mean=",df['mark'].mean())
print("max=",df['mark'].max())
print("min=",df['mark'].min())
print("var=",df['mark'].var())
print("standard deviation=",df['mark'].std())
print(df.std())
statistical functions
sum= 203
mean= 33.833333333333336
max= 48
min= 25
var= 112.16666666666667
standard deviation= 10.59087657687817
mark 10.590877
dtype: float64
print("top 2 rows")
print(df.head(2))
top 2 rows
name place mark
rollno
101 binu ernkulam 45
103 ashik alleppey 35
print("last 2 rows")
print(df.tail(2))
last 2 rows
name place mark
rollno
106 ann thrisur 25
107 padma kylm 25
print(df[0:3])
data from rows 0,1,2
name place mark
rollno
101 binu ernkulam 45
103 ashik alleppey 35
102 faisal kollam 48
print(df['mark'])
mark column values
rollno
101 45
103 35
102 48
105 25
106 25
107 25
Name: mark, dtype: int64
print(df[df['mark']>40])
rows where mark >40
name place mark
rollno
101 binu ernkulam 45
102 faisal kollam 48
print(df.iloc[0:3,[0,2]])
rows 0,1,2 columns 0,2
name mark
rollno
101 binu 45
103 ashik 35
102 faisal 48
print(df.sort_values(by='mark',ascending=False))
sorting in the descending order of marks
name place mark
rollno
102 faisal kollam 48
101 binu ernkulam 45
103 ashik alleppey 35
105 biju kotayam 25
106 ann thrisur 25
107 padma kylm 25
2
print(df['mark'].agg(['min','max','mean']))
use agg function to compute all the values
min 25.000000
max 48.000000
mean 33.833333
Name: mark, dtype: float64
print("median of marks")
print("Median",df.sort_values(by='mark',ascending=False).median())
median of marks
Median mark 30.0
dtype: float64
print("mode of marks")
print("Mode",df['mark'].mode())
mode of marks
Mode 0 25
dtype: int64
print("count of marks")
print(df['mark'].value_counts())
count of marks
25 3
45 1
35 1
48 1
Name: mark, dtype: int64
print(df.groupby('mark')['mark'].mean())
grouping data based on column value
mark
25 25
35 35
45 45
48 48
Name: mark, dtype: int64
figure(1)
plt.hist(df['mark'])
figure(2)
plt.scatter(df['name'],df['mark'])
figure(3)
plt.pie(df['mark'])
Outputs:
Column names can also be specified via the keyword argument columns, as well as a
different delimiter via the sep argument. Again, the default delimiter is a comma, ','.
Here is a simple example showing how to export a DataFrame to a CSV file via to_csv():
# importing pandas as pd
import pandas as pd
# dictionary of lists
dict = {'name':["aparna", "pankaj", "sudhir", "Geeku"],
'degree': ["MBA", "BCA", "M.Tech", "MBA"],
'score':[90, 40, 80, 98]}
# creating a dataframe from a dictionary
df = pd.DataFrame(dict)
print(df)
df.to_csv('studdata.csv')
#open the studdata.csv and see the data written
1.Add two matrix and find the transpose of the result ( university question)
def readmatrix(x,r,c):
for i in range(r):
for j in range(c):
x[i][j]=int(input('enter elements row by row'))
import numpy as np
r1=int(input('rows of a'))
c1=int(input('columns of a'))
r2=int(input('rows of b'))
c2=int(input('columns of b'))
if r1!=r2 or c1!=c2:
print("cant add matrices")
else:
A=np.zeros((r1,c1))
print("Enter the elements of A")
readmatrix(A,r1,c1)
B=np.zeros((r2,c2))
print("Enter the elements of B")
readmatrix(B,r2,c2)
print("Matrix A")
print(A)
print("Matrix B")
print(B)
C=A+B
print("sum")
print(C)
print("transpose of sum")
print(C.T)
import pandas as pd
#initialize a dataframe
df = pd.DataFrame(
df = df.set_index('rollno')
print('\nDataFrame with column as index\n',df)
import pandas as pd
# create dataframe
writer = pd.ExcelWriter('output.xlsx')
df_marks.to_excel(writer)
writer.save()
import xlrd
loc = ("stud.xlsx")
wb = xlrd.open_workbook(loc)
sheet = wb.sheet_by_index(0)
for i in range(1,sheet.nrows):
print(sheet.row_values(i))
5.Write Python program to write the data given below to a CSV file.(university
question)
# importing pandas as pd
import pandas as pd
# dictionary of lists
columns=['SN','Name','Country','Contribution','Year'])
print(df)
df.to_csv('inventors.csv')
import pandas as pd
# dictionary of lists
df = pd.DataFrame(dict)
print(df)
7.Given a file “auto.csv” of automobile data with the fields index, company,
Reading the data file and showing the first five records
import pandas as pd
df = pd.read_csv("Automobile_data.csv")
df.head(5)
averag
num-of-
inde compa body wheel- lengt engin horsepow e- pric
cylinde
x ny -style base h e-type er mileag e
rs
e
alfa-
convertib 13495.
0 0 romer 88.6 168.8 dohc four 111 21
le 0
o
alfa-
convertib 16500.
1 1 romer 88.6 168.8 dohc four 111 21
le 0
o
alfa-
hatchbac 16500.
2 2 romer 94.5 171.2 ohcv six 154 19
k 0
o
13950.
3 3 audi sedan 99.8 176.6 ohc four 102 24
0
17450.
4 4 audi sedan 99.4 176.6 ohc five 115 18
0
df.tail(7)
import pandas as pd
df = pd.read_csv("Automobile_data.csv",
na_values={
'price':["?","n.a"],
'stroke':["?","n.a"],
'horsepower':["?","n.a"],
'peak-rpm':["?","n.a"],
'average-mileage':["?","n.a"]})
print (df)
df.to_csv("Automobile_data.csv")
import pandas as pd
df = pd.read_csv("Automobile_data.csv")
df = df [['company','price']][df.price==df['price'].max()]
df
output
company price
35 mercedes-benz 45400.0
import pandas as pd
df = pd.read_csv("Automobile_data.csv")
print(df[df['company']=='toyota'])
OR
import pandas as pd
df = pd.read_csv("Automobile_data.csv")
car_Manufacturers = df.groupby('company')
toyotaDf = car_Manufacturers.get_group('toyota')
toyotaDf
df = pd.read_csv("Automobile_data.csv")
df.groupby('company')['company'].count()
OR
import pandas as pd
df['company'].value_counts()
import pandas as pd
df = pd.read_csv("Automobile_data.csv")
df.groupby('company')[['company','price']].max()
import pandas as pd
df = pd.read_csv("Automobile_data.csv")
df.groupby('company')[['company','average-mileage']].mean()
import pandas as pd
df = pd.read_csv("Automobile_data.csv")
df.sort_values(by=['price11
8) Create a stud.csv file containing rollno,name,place and mark of students. Use this file and do
the following
import pandas as pd
df = pd.read_csv("stud.csv")
print(df)
rollno name place mark
0 101 binu ernkulam 45
1 103 ashik alleppey 35
df=df.set_index('rollno')
print(df)
name place mark
rollno
101 binu ernkulam 45
103 ashik alleppey 35
102 faisal kollam 48
105 biju kotayam 25
106 anu thrisur 25
107 padma kylm 25
df=df[['name','mark']]
print(df)
print(df['mark'].var())
print(df['mark'].std())
112.16666666666667
10.59087657687817
i)display the histogram of marks
plt.hist(df['mark'])
df.drop(['place'],axis=1,inplace=True)
print(df)
0 101 binu 45
1 103 ashik 35
2 102 faisal 48
3 105 biju 25
4 106 ann 25
5 107 padma 25
df = df [['name']][df.mark==df['mark'].max()]
print(df[df['place']=='ernakulam'])