0% found this document useful (0 votes)
24 views

Numpy Basics

Uploaded by

Kamal Krishnan
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views

Numpy Basics

Uploaded by

Kamal Krishnan
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 66

Numpy Basics- Creating arrays, arithmetic,indexing and

slicing,functions
numpy, short for Numerical Python, is the fundamental package required for high
performance scientific computing and data analysis.This package is adding support for
large, multi-dimensional arrays and matrices, along with a large collection of high-level
mathematical functions to operate on these arrays. The ancestor of NumPy, Numeric, was
originally created by Jim Hugunin with contributions from several other developers. In
2005, Travis Oliphant created NumPy by incorporating features of the competing
Numarray into Numeric, with extensive modifications. NumPy is open-source software
and has many contributors.Here are some of the things it provides:
• ndarray, a fast and space-efficient multidimensional array providing
vectorized arithmetic operations and sophisticated broadcasting capabilities
• Standard mathematical functions for fast operations on entire arrays of data without
having to write loops
• Tools for reading / writing array data to disk and working with memory-mapped files
• Linear algebra, random number generation, and Fourier transform capabilities
• Tools for integrating code written in C, C++, and Fortran
The NumPy ndarray: A Multidimensional Array Object
One of the key features of NumPy is its N-dimensional array object, or ndarray, which is
a fast, flexible container for large data sets in Python. Arrays enable you to
perform mathematical operations on whole blocks of data using similar syntax to the
equivalent operations between scalar elements:

A numpy array is a grid of values, all of the same type, and is indexed by a tuple of non
negative integers. In NumPy dimensions are called axes.The number of dimensions is the
rank of the array; the shape of an array is a tuple of integers giving the size of the array
along each dimension.
We can initialize numpy arrays from nested Python lists, and access elements using
square brackets:
Python array indexing start from 0.Matrix operations can be done with numpy arrays.

Creating numpy arrays


import numpy as np
A = np.array([[1, 2, 3], [3, 4, 5]]) #Array of integers
print(A)
[[1 2 3]
[3 4 5]]

A = np.array([[1.1, 2, 3], [3, 4, 5]]) # Array of floats


print(A)
[[1.1 2. 3. ]
[3. 4. 5. ]]

A = np.array([[1, 2, 3], [3, 4, 5]], dtype = complex) # Array of complex numbers


print(A)
[[1.+0.j 2.+0.j 3.+0.j] [3.+0.j 4.+0.j 5.+0.j]]
The data type or dtype is a special object containing the information the ndarray needs
to interpret a chunk of memory as a particular type of data:
print(type(A)) # prints the class type
<class 'numpy.ndarray'>

print(A.dtype) # print the data type


complex128

print(A.ndim) # print the number of dimensions


2

print(A.size) # print the size...size zero means array empty


6

print(A.shape) # shape of the array as tuple


(2,3)

You can explicitly convert or cast an array from one dtype to another using ndarray’s
astype method:
A=np.arange(10)
print(A)
[0 1 2 3 4 5 6 7 8 9]
print(A.astype(np.float))
[0. 1. 2. 3. 4. 5. 6. 7. 8. 9.]
Arrays of zeros and ones
import numpy as np
za = np.zeros( (2, 3) )
print(za)
Output:
[[0. 0. 0.]
[0. 0. 0.]]
oa= np.ones( (2,3), dtype=np.int32 )
print(oa)
Output:
[[1. 1. 1.]
[1. 1. 1.]]
Identity matrix
import numpy as np
print(np.eye(4,4))
[[1. 0. 0. 0.]
[0. 1. 0. 0.]
[0. 0. 1. 0.]
[0. 0. 0. 1.]]
Constant array
import numpy as np
print(np.full((4,4),5))
[[5 5 5 5]
[5 5 5 5]
[5 5 5 5]
[5 5 5 5]]

An uninitialized array
import numpy as np
print(np.empty((4,2))
[[2.37663529e-312, 2.14321575e-312],
[2.37663529e-312, 2.56761491e-312],
[1.18831764e-312, 1.10343781e-312],
[2.02566915e-322, 0.00000000e+000]]

Nine numbers from 0-2


import numpy as np
print(np.linspace(0,2,9))
[0. 0.25 0.5 0.75 1. 1.25 1.5 1.75 2. ]
Using arange() and reshape()
import numpy as np

A = np.arange(4)
print('A =', A)
A = [0 1 2 3]

B = np.arange(12).reshape(2, 6)
print('B =', B)
B = [[ 0 1 2 3 4 5]
[ 6 7 8 9 10 11]]

print(np.arange(2, 10, dtype=float))


[ 2., 3., 4., 5., 6., 7., 8., 9.]

print(np.arange(2, 3, 0.1) )
[ 2. , 2.1, 2.2, 2.3, 2.4, 2.5, 2.6, 2.7, 2.8, 2.9]

A=np.random.random((2,3))
print(A)
[[0.5516055 0.27255692 0.74313995]
[0.39238878 0.63832042 0.11740813]]

Adding and removing elements


import numpy as np
A=np.array([10,20,30])
print(A)
[10 20 30]
A=np.append(A,[40,50])
print(A)
[10 20 30 40 50]
A=np.insert(A,0,100)
print(A)
[100 10 20 30 40 50]

A=np.array([[1,2],[3,4]])
print(A)
[[1 2]
[3 4]]
adding a row
A=np.append(A,[[5,6]],axis=0)
print(A)
[[1 2]
[3 4]
[5 6]]
adding a column
A=np.append(A,[[5],[6]],axis=1)
print(A)
[[1 2 5]
[3 4 6]]

A=np.array([10,20,30,40,50,60,70,80])
print(A)
[10 20 30 40 50 60 70 80]
A=np.delete(A,1)
print(A)
[10 30 40 50 60 70 80]

deleting a row
A=np.array([[10,20,30],[40,50,60],[70,80,90]])
print(A)
[[10 20 30]
[40 50 60]
[70 80 90]]
A=np.delete(A,1,axis=0)
print(A)
[[10 20 30]
[70 80 90]]

Indexing and slicing


One-dimensional arrays can be indexed, sliced and iterated over, much like lists and
other Python sequences.
import numpy as np
A=np.arange(10)
print(A)
[0 1 2 3 4 5 6 7 8 9]
print(A[0])
0
print(A[-1])
9
print(A[0:3])
[0 1 2]
A[0:3]=100
A[3]=200
print(A)
[100 100 100 200 4 5 6 7 8 9]

As you can see, if you assign a scalar value to a slice, as in A[0:3] = 100, the value is
propagated (or broadcasted henceforth) to the entire selection. An important first
distinction from lists is that array slices are views on the original array. This means
that the data is not copied, and any modifications to the view will be reflected in the source
array:

slice=A[5:9]
print(slice)
[5 6 7 8]

slice[:]=200
print(A)
[100 100 100 3 4 200 200 200 200 9]

B=np.arange(10)
print(B[0:8:2])
[0 2 4 6]
print(B[8:0:-2])
[8 6 4 2]
print(B[:4])
[0 1 2 3]
print(B[5:])
[5 6 7 8 9]
print(B[::-1])
[9 8 7 6 5 4 3 2 1 0]

Indexing on 2-D array

A=np.array([[1,2,3],[4,5,6],[7,8,9]])
print(A[2])
[7 8 9]
print(A[2,2])
9
print(A[2][2])
9
print(A[1:,1:])
[[5 6]
[8 9]]
print(A[:2,1:])
[[2 3]
[5 6]]
A[:,:1]

print(A[:,:1])
[[1]
[4]
[7]]

A[:,:1]=10
print(A)
[[10 2 3]
[10 5 6]
[10 8 9]]

Boolean array Indexing


Boolean array indexing lets you pick out arbitrary elements of an array. Frequently this
type of indexing is used to select the elements of an array that satisfy some condition.
import numpy as np
A=np.random.randn(3,2)
print(A)
[[-0.71292301 0.52865595]
[-0.54578822 -0.48479499]
[-0.01538739 0.00882706]]
print(A[A<0])
[-0.71292301 -0.54578822 -0.48479499 -0.01538739]

Consider student names and their marks are stored in another array marks.Students
names are repeated and we can get all those rows of marks array.

import numpy as np
names=np.array(['biju','bini','binu','bini'])
marks=np.array([[30,40,40],[45,46,47],[38,40,45],[47,48,30]])
index=names=='bini'
print(index)
[False True False True]
print(marks[index])
[[45 46 47]
[47 48 30]]
Selecting data from an array by boolean indexing always creates a copy of the data,even
if the returned array is unchanged.

Fancy Indexing
We can select particular row or column based on index array which stores the index

A=np.empty([4,2])

print(A)
[[2.37663529e-312 2.14321575e-312]
[2.37663529e-312 2.56761491e-312]
[1.18831764e-312 1.10343781e-312]
[2.02566915e-322 0.00000000e+000]]

index=[2,3,1]

print(A[index])
[[1.18831764e-312 1.10343781e-312]
[2.02566915e-322 0.00000000e+000]
[2.37663529e-312 2.56761491e-312]]

index=[-1,-3]
print(A[index])
[[2.02566915e-322 0.00000000e+000]
[2.37663529e-312 2.56761491e-312]]

Passing multiple index arrays does something slightly different; it selects a 1D array of
elements corresponding to each tuple of indices:
print(A[[1,2],[1,1]]) # prints elements at (1,1) and (2,1)
[2.56761491e-312 1.10343781e-312]

Another way is to use the np.ix_ function, which converts two 1D integer arrays to an
indexer that selects the square region: (1,0) (1,1) (2,0) (2,1).

print(A[np.ix_([1,2],[0,1])])
[[2.37663529e-312 2.56761491e-312]
[1.18831764e-312 1.10343781e-312]]
Keep in mind that fancy indexing, unlike slicing, always copies the data into a new array.

Universal Functions
A universal function, or ufunc, is a function that performs element wise operations
on data in ndarrays. You can think of them as fast vectorized wrappers for simple
functions that take one or more scalar values and produce one or more scalar results.

Unary function takes single array and apply the operation to all values.
A=np.arange(10)
print(np.sqrt(A))

[0. 1. 1.41421356 1.73205081 2. 2.23606798 2.44948974 2.64575131 2.82842712 3. ]

print(np.exp(A))
[1.00000000e+00 2.71828183e+00 7.38905610e+00 2.00855369e+01 5.45981500e+01
1.48413159e+02 4.03428793e+02 1.09663316e+03 2.98095799e+03 8.10308393e+03]

print(np.square(A))
[ 0 1 4 9 16 25 36 49 64 81]

The binary functions take two arrays as arguments and return a single array.

x=np.array([3,4,5,6])
y=np.array([1,4,7,2])
print(np.minimum(x,y))
[1 4 5 2]
print(np.maximum(x,y))
[3 4 7 6]
print(np.mod(x,y))
[0 0 5 0]

Data Processing
A typical use of where in data analysis is to produce a new array of values based on
another array. Suppose you had a matrix of randomly generated data and you wanted to
replace all positive values with 2 and all negative values with -2.This is very easy to do
with np.where
A=np.random.randn(4,4)
B=np.where(A>0,2,-2)
print(A)
print(B)

[[-1.01775201 -0.73483517 0.10462159 -0.23697366]


[ 0.23281261 0.13014115 1.82079278 -0.72670015]
[-2.67186248 -0.8649474 -0.25756318 0.49680316]
[-0.65459274 0.17070326 3.29936106 0.14854436]]

[[-2 -2 2 -2]
[ 2 2 2 -2]
[-2 -2 -2 2]
[-2 2 2 2]]

statistical methods
A=np.array([[1,2,3],[4,5,6]])
print(A)
[[1 2 3]
[4 5 6]]
print(np.min(A))
1
print(np.min(A,axis=0))
[1 2 3]
print(np.min(A,axis=1))
[1 4]
print(np.max(A))
6
print(np.max(A,axis=0))
[4 5 6]
print(np.max(A,axis=1))
[3 6]
print(np.sum(A))
21
print(np.sum(A,axis=0))
[5 7 9]
print(np.sum(A,axis=1))
[ 6 15]
print(np.mean(A))
3.5
print(np.mean(A,axis=0))
[2.5 3.5 4.5]
print(np.mean(A,axis=1))
[2. 5.]
print(np.var(A))
2.9166666666666665
print(np.var(A,axis=0))
[2.25 2.25 2.25]
print(np.var(A,axis=1))
[0.66666667 0.66666667]
print(np.std(A))
1.707825127659933
print(np.std(A,axis=0))
[1.5 1.5 1.5]
print(np.std(A,axis=1))
[0.81649658 0.81649658]
print(np.cumsum(A))
[ 1 3 6 10 15 21]
print(np.cumsum(A,axis=0))
[[1 2 3]
[5 7 9]]
print(np.cumsum(A,axis=1))
[[ 1 3 6]
[ 4 9 15]]
print(np.cumprod(A))
[ 1 2 6 24 120 720]
print(np.cumprod(A,axis=0))
[[ 1 2 3]
[ 4 10 18]]
print(np.cumprod(A,axis=1))
[[ 1 2 6]
[ 4 20 120]]

Counting number of +ve elements


A=np.array([1,2,3,-1,-4])
print((A>0).sum())

Sorting and searching


A=np.random.randn(10)
print(A)
[-0.00557752 0.81721283 0.96476642 0.9729171 -1.12079968 -0.3228774 -1.56839221
-0.96986826 -0.63970741 0.1422579 ]
A.sort()
print(A)
[-1.56839221 -1.12079968 -0.96986826 -0.63970741 -0.3228774 -0.00557752
0.1422579 0.81721283 0.96476642 0.9729171 ]
where function is used for searching which will return the index
A=np.array([[10,20,30],[40,50,60],[70,80,90]])
print(A)
pos=np.where(A==30)
print(pos [0], pos[1])
02

A=np.random.randn(2,3)
print(A)
[[-0.4203773 -0.30855868 -1.08973324]
[ 1.40028156 0.5397636 -0.02366591]]

unique elements in sorted order


A=np.array([3,3,2,1,1,5,4,4,6,7,3])
print(np.unique(A))
[1 2 3 4 5 6 7]

common elements in sorted order


A=np.array([3,3,2,1,1,5,4,4,6,7,3])
B=np.array([3,4,5,5,8])
print(np.intersect1d(A,B))
[3 4 5]

union of elements
A=np.array([3,3,2,1,1,5,4,4,6,7,3])
B=np.array([3,4,5,5,8])
print(np.union1d(A,B))

difference
A=np.array([3,3,2,1,1,5,4,4,6,7,3])
B=np.array([3,4,5,5,8])
print(np.setdiff1d(A,B))
[1 2 6 7]

symmetric difference
A=np.array([3,3,2,1,1,5,4,4,6,7,3])
B=np.array([3,4,5,5,8])
print(np.setxor1d(A,B))

in1d(A, B) Compute a boolean array indicating whether each element of A is contained in


B
A=np.array([1,2,3,4])
B=np.array([3,4,5,6,7,8])
print(np.in1d(A,B))
[False False True True]

Saving and loading data from disk


np.save and np.load are the two workhorse functions for efficiently saving and
loading array data on disk. Arrays are saved by default in an uncompressed raw binary
format with file extension .npy.
A=np.arange(10)
np.save('arr',A)

A=np.load('arr')
print(A)
[0 1 2 3 4 5 6 7 8 9 ]

read data from a file into numpy array


let arr.txt contains data separated with delimiter ' ,' comma-separated value (CSV) like
this:
arr.txt
12,13,14,15
20,30,40,50
70,70,80,90
import numpy as np
x=np.loadtxt('arr.txt', delimiter=',')
print(x)
[[12. 13. 14. 15.]
[20. 30. 40. 50.]
[70. 70. 80. 90.]]
np.savetxt performs the inverse operation: writing an array to a delimited text file.

Matrix operations can be implemented using arrays


Addition
import numpy as np
A = np.array([[2, 4], [5, -6]])
B = np.array([[9, -3], [3, 6]])
C = A + B # element wise addition
print(C)
[[11 1]
[ 8 0]]
Subtraction
import numpy as np
A = np.array([[2, 4], [5, -6]])
B = np.array([[9, -3], [3, 6]])
C=A-B
print(C)
[[ -7 7]
[ 2 -12]]

Multiplication
import numpy as np
A = np.array([[2, 4], [5, -6]])
B = np.array([[9, -3], [3, 6]])
C = A .dot(B)
print(C)
[[ 30 18]
[ 27 -51]]
Transpose
import numpy as np
A = np.array([[2, 4], [5, -6]])
print(A.T)
print(A.transpose())
[[ 2, 5],
[ 4, -6]])
Multiplying array with * operation results in element wise multiplication
import numpy as np
A = np.array([[2, 4], [5, -6]])
print(A*A)
[[ 4 16]
[25 36]]

import numpy as np
A = np.array([[1, 2], [3, 4]])
print(1/A)
[[1. , 0.5 ],
[0.33333333, 0.25 ]])

import numpy as np
A = np.array([[1, 2], [3, 4]])
print(A*2)
[[2 4]
[6 8]]
Random Numbers
Random means something that can not be predicted logically.Computers work on
programs, and programs are definitive set of instructions. So it means there must be
some algorithm to generate a random number as well.

If there is a program to generate random number it can be predicted, thus it is not truly
random.
Random numbers generated through a generation algorithm are called pseudo random.

Can we make truly random numbers?


Yes. In order to generate a truly random number on our computers we need to get the
random data from some outside source. This outside source is generally our keystrokes,
mouse movements, data on network etc.
Pseudo random number generation can be done with numpy random module

The random module's randint() method returns a random number from 0 to n.


This will generate a random int from 0 to 100.Try running this multiple time and see the
output
import numpy as np
x = np.random.randint(100)
print(x)
64
The randint() method takes a size parameter where you can specify the shape of an
array.The following commands will generate 5 random numbers from 0 to 100.
import numpy as np
x = np.random.randint(100,size=5)
print(x)
[25 62 24 81 39]
The following will Generate a 2-D array with 3 rows, each row containing 5 random
integers from 0 to 100:
import numpy as np
x = np.random.randint(100,size=(3,5))
print(x)
[[ 2 96 40 43 85]
[81 81 4 48 29]
[80 31 6 10 24]]

The random module's rand() method returns a random float between 0 and 1.
import numpy as np
x = np.random.rand()
print(x)
0.2733166576024767
This will generate 10 random numbers

x = np.random.rand(10)
print(x)
[0.82536563 0.46789636 0.28863107 0.83941914 0.24424812 0.2581629
1
0.72567413 0.80770073 0.32845661 0.34451507]
Generate an array with size (3,5)
x = np.random.rand(3,5)
print(x)
[[0.16220086 0.80935717 0.97331357 0.60975199 0.48542906] [0.68311884
0.27623475 0.73447814 0.29257476 0.27329666] [0.62625815 0.0069779 0.21403868
0.49191027 0.4116709 ]]

The choice() method allows to get a random value from an array of values.
import numpy as np
x = np.random.choice([3,5,6,7,9,2])
print(x)
3
import numpy as np
x = np.random.choice([3,5,6,7,9,2],size=(3,5))
print(x)
[[3 2 5 2 6]
[5 9 3 6 9]
[5 6 9 3 3]]

Random Data Distribution


Data Distribution is a list of all possible values, and how often each value occurs.Such
lists are important when working with statistics and data science.
The random module offer methods that returns randomly generated data distributions.
A random distribution is a set of random numbers that follow a certain probability
density function.
Probability Density Function: A function that describes a continuous probability. i.e.
probability of all values in an array.
We can generate random numbers based on defined probabilities using the choice()
method of the random module.
The choice() method allows us to specify the probability for each value.

The probability is set by a number between 0 and 1, where 0 means that the value will
never occur and 1 means that the value will always occur.

Example

Generate a 1-D array containing 10 values, where each value has to be 3, 5, 7 or 9.


The probability for the value to be 3 is set to be 0.1
The probability for the value to be 5 is set to be 0.3
The probability for the value to be 7 is set to be 0.6
The probability for the value to be 9 is set to be 0

import numpy as np
x = np.random.choice([3,5,7,9],p=[0.1,0.3,0.6,0.0],size=10)
print(x)
[5 7 7 7 5 7 7 3 7 5]

Random Permutations
A permutation refers to an arrangement of elements. e.g. [3, 2, 1] is a permutation of [1,
2, 3] and vice-versa.
The NumPy Random module provides two methods for
this: shuffle() and permutation().
Shuffling Arrays
Shuffle means changing arrangement of elements in-place. i.e. in the array itself.
import numpy as np
x=np.array([1,2,3,4,5])
np.random.shuffle(x)
print(x)
[4 1 3 5 2]
Generating Permutation of Arrays
The permutation() method returns a re-arranged array (and leaves the original array un-
changed).
import numpy as np
x=np.array([1,2,3,4,5])
y=np.random.permutation(x)
print(y)
[3 1 5 2 4]

Normal (Gaussian) Distribution

The Normal Distribution is one of the most important distributions.


It is also called the Gaussian Distribution after the German mathematician Carl Friedrich
Gauss.
It fits the probability distribution of many events, eg. IQ Scores, Heartbeat etc.
Use the random.normal() method to get a Normal Data Distribution.
It has three parameters:

loc - (Mean) where the peak of the bell exists.


scale - (Standard Deviation) how flat the graph distribution should be.
size - The shape of the returned array.

The numpy.random module supplements the built-in Python random with functions
for efficiently generating whole arrays of sample values from many kinds of probability
distributions. For example, you can get a 4 by 4 array of samples from the standard normal
distribution using normal:

import numpy as np
print(np.random.normal(size=(4,4)))
[[ 0.18577774 -1.07506339 1.0338707 1.32696306]
[ 0.41939598 -1.15732977 -0.19081001 0.10567808]
[ 0.7482679 -0.39357911 0.08297663 -0.60563642]
[ 0.23671784 -1.3504756 0.24030689 0.4240251 ]]

x = np.random.normal(loc=1, scale=2, size=(2, 3))


print(x)
[[ 4.6162552 2.90317721 1.75121165]
[-0.03026904 3.54906062 1.25067476]]

Visualization of Normal Distribution

from numpy import random


import matplotlib.pyplot as plt
import seaborn as sns
sns.distplot(random.normal(size=1000), hist=False)
plt.show()

Binomial Distribution

Binomial Distribution is a Discrete Distribution.


It describes the outcome of binary scenarios, e.g. toss of a coin, it will either be head or
tails.
It has three parameters:
n - number of trials.
p - probability of occurence of each trial (e.g. for toss of a coin 0.5 each).
size - The shape of the returned array.

Discrete Distribution:The distribution is defined at separate set of events, e.g. a coin


toss's result is discrete as it can be only head or tails whereas height of people is
continuous as it can be 170, 170.1, 170.11 and so on.

Example
Given 10 trials for coin toss generate 10 data points:
import numpy as np
x =np. random.binomial(n=10, p=0.5, size=10)
print(x)
Visualization of Binomial Distribution
from numpy import random
import matplotlib.pyplot as plt
import seaborn as sns
sns.distplot(random.binomial(n=10, p=0.5, size=1000), hist=True, kde=False)
plt.show()

Numpy and Linear Algebra


The Linear Algebra module of NumPy offers various methods to apply linear algebra on
any numpy array.
One can find:

• rank, determinant, trace, etc. of an array.


• eigen values of matrices
• matrix and vector products (dot, inner, outer,etc. product), matrix exponentiation
• solve linear or tensor equations and much more!

Mutiplacation of matrx using dot function ( @ operator can also be used)


import numpy as np

x=np.array([[1,2,3],[4,5,6]])

y=np.array([2,4,1])
print(x.dot(y))

print(x@y)

print(np.dot(x,y))

[13 34]
[13 34]

[13 34]

import numpy as np

A = np.array([[6, 1, 1],

[4, -2, 5],

[2, 8, 7]])

Rank of a matrix

print("Rank of A:", np.linalg.matrix_rank(A))

Rank of A: 3

Diagonals of a matrix

print(np.diag(x))

[ 6 -2 7]
print(np.diag(x,k=1)) # above the main diagonal
[1 5]
print(np.diag(x,k=-1)) #below the main diagonal
[4 8]

import numpy as np

y=np.fliplr(x)

print(np.diag(y)) #secondary diagonal

[ 1 -2 2]

z=np.flipud(x)
print(np.diag(z))

[ 2 -2 1]

Trace of matrix A

print("\nTrace of A:", np.trace(A))

Trace of A: 11

Determinant of a matrix

print("\nDeterminant of A:", np.linalg.det(A))

Determinant of A: -306.0

Inverse of matrix A

print("\nInverse of A:\n", np.linalg.inv(A))

Inverse of A:

[[ 0.17647059 -0.00326797 -0.02287582]

[ 0.05882353 -0.13071895 0.08496732]

[-0.11764706 0.1503268 0.05228758]]

Transpose of matrix A

print("\nTranspose of A:\n", A.T)

Transpose of A:

[[ 6 4 2]

[ 1 -2 8]

[ 1 5 7]]

Power

print("\nMatrix A raised to power 3:\n",np.linalg.matrix_power(A, 3))

Matrix A raised to power 3:

[[336 162 228]

[406 162 469]


[698 702 905]]

Solving system of linear equations


let 2x1+3x2 +5x3= 10
3x1-2x2+x3=3
x1+5x2+7x3=8
the matrix representation is
Ax=b
where
A=[[ 2 , 3, 5],
[ 3, -2 ,1],
[ 1, 5 , 7 ]])
b=[10,3,8]
The following is the python code to solve the problem
import numpy as np
A=np.array([[ 2 , 3, 5],
[ 3, -2 ,1],
[ 1, 5 , 7 ]])
b=np.array([10,3,8])
x=np.linalg.solve(A,b)
print(x)

[ 5.69230769 5.30769231 -3.46153846]

Eigen values and eigen vector

Let A be a square matrix. A non-zero vector X is an eigenvector for A with eigenvalue e if

AX=eX

The eigenvalues of a symmetric matrix are always real and the eigenvectors are always
orthogonal

import numpy as np

A=np.array([[ 2 , 3, 5],

[ 3, -2 ,1],

[ 1, 5 , 7 ]])
e,v=np.linalg.eig(A)

print(e)

[-2.81422161 0.49572305 9.31849856]

print(v)

[[ 0.09368857 -0.64029415 0.61137707]

[-0.89093813 -0.55826909 0.2289721 ]

[ 0.44435537 0.5275974 0.75748918]]

print(v[:,0]*e[0]) # eX

[-0.2636604 2.50729735 -1.25051449]

print(A.dot(v[:,0])) # AX

[-0.2636604 2.50729735 -1.25051449]

# Note that AX=eX

Plotting and Visualization-Matplotlib


Matplotlib is one of the most popular Python packages used for data visualization. It is a
cross-platform library for making 2D plots from data in arrays.Matplotlib is written in
Python and makes use of NumPy.It was introduced by John Hunter in the year 2002.

One of the greatest benefits of visualization is that it allows us visual access to huge
amounts of data in easily digestible visuals. Matplotlib consists of several plots like line,
bar, scatter, histogram etc.

Anaconda is a free and open source distribution of the Python and R programming
languages for large-scale data processing, predictive analytics, and scientific computing.
The distribution makes package management and deployment simple and easy.
Matplotlib and lots of other useful (data) science tools form part of the distribution.If you
have anaconda installed on your computer matplotlib can be used directly else install
matplotlib.

Lets plot a simple sin wave using matplotlib


1.To begin with, the Pyplot module from Matplotlib package is imported

import matplotlib.pyplot as plt

2.Next we need an array of numbers to plot.

import numpy as np

import math
x = np.arange(0, math.pi*2, 0.05)

3.The ndarray object serves as values on x axis of the graph. The corresponding sine
values of angles in x to be displayed on y axis are obtained by the following statement

y = np.sin(x)
4.The values from two arrays are plotted using the plot() function.

plt.plot(x,y)

5.You can set the plot title, and labels for x and y axes.

plt.xlabel("angle")

plt.ylabel("sine")

plt.title('sine wave')

6.The Plot viewer window is invoked by the show() function

plt.show()

The complete program is as follows −

from matplotlib import pyplot as plt

import numpy as np

import math #needed for definition of pi

x = np.arange(0, math.pi*2, 0.05)

y = np.sin(x)

plt.plot(x,y)
plt.xlabel("angle")

plt.ylabel("sine")

plt.title('sine wave')

plt.show()

Matplotlib - PyLab module

PyLab is a procedural interface to the Matplotlib object-oriented plotting library. Matplotlib


is the whole package; matplotlib.pyplot is a module in Matplotlib; and PyLab is a module
that gets installed alongside Matplotlib.

PyLab is a convenience module that bulk imports matplotlib.pyplot (for plotting) and
NumPy (for Mathematics and working with arrays) in a single name space. Although many
examples use PyLab, it is no longer recommended.

basic plot

from numpy import *

from pylab import *

x = linspace(-3, 3, 30)

y = x**2

plot(x, y)

show()
from pylab import *

x = linspace(-3, 3, 30)

y = x**2

plot(x, y, 'r.')

show()

from pylab import *

x = np.arange(0, math.pi*2, 0.05)

plot(x, sin(x))

plot(x, cos(x), 'r-')

plot(x, -sin(x), 'g--')

show()
Color codes

Character Color

‘b’ Blue

‘g’ Green

‘r’ Red

‘b’ Blue

‘c’ Cyan

‘m’ Magenta

‘y’ Yellow

‘k’ Black

‘b’ Blue

‘w’ White

Marker codes
Character Description

‘.’ Point marker

‘o’ Circle marker

‘x’ X marker

‘D’ Diamond marker

‘H’ Hexagon marker


‘s’ Square marker

‘+’ Plus marker

Line styles
Character Description

‘-‘ Solid line

‘—‘ Dashed line

‘-.’ Dash-dot line

‘:’ Dotted line

‘H’ Hexagon marker

Adding Grids and Legend to the Plot

from pylab import *

x = np.arange(0, math.pi*2, 0.05)

plot(x, sin(x),label='sin')

plot(x, cos(x), 'r-',label='cos')

plot(x, -sin(x), 'g--',label='-sin')

grid(True)

title('waves')

xlabel('x')

ylabel('sin cos -sin')

legend(loc='upper right')

show()
The following code will create three separate figures and plot

from pylab import *

x = np.arange(0, math.pi*2, 0.05)

figure(1)

plot(x, sin(x),label='sin')

xlabel('x')

ylabel('sin')

legend(loc='upper right')

grid(True)

figure(2)

plot(x, cos(x), 'r-',label='cos')

xlabel('x')

ylabel('cos')

legend(loc='upper right')

grid(True)

figure(3)

xlabel('x')

ylabel('-sin')

plot(x, -sin(x), 'g--',label='-sin')


legend(loc='upper right')

grid(True)

show()

Creating a bar plot

from matplotlib import pyplot as plt


x = [5, 2, 9, 4, 7]
y = [10, 5, 8, 4, 2]
# Function to plot the bar
plt.bar(x,y)
# function to show the plot
plt.show()

Creating a histogram

from matplotlib import pyplot as plt

# x-axis values

x = [5, 2, 9, 4, 7,5,5,5,4,9,9,9,9,9,9,9,9,9]

# Function to plot the histogram

plt.hist(x)

# function to show the plot

plt.show()
Scatter Plot

from matplotlib import pyplot as plt

x = [5, 2, 9, 4, 7]

y = [10, 5, 8, 4, 2]

# Function to plot scatter

plt.scatter(x, y)

plt.show()

Stem plot

from matplotlib import pyplot as plt

x = [5, 2, 9, 4, 7]

y = [10, 5, 8, 4, 2]
# Function to plot scatter

plt.stem(x, y,use_line_collection=True)

plt.show()

Pie Plot

data=[20,30,10,50]

from pylab import *

pie(data)

show()
Subplots with in the same plot

from pylab import *

x = np.arange(0, math.pi*2, 0.05)

subplot(2,2,1)

plot(x, sin(x),label='sin')

xlabel('x')

ylabel('sin')

legend(loc='upper right')

grid(True)

subplot(2,2,2)

plot(x, cos(x), 'r-',label='cos')

xlabel('x')

ylabel('cos')

legend(loc='upper right')

grid(True)

subplot(2,2,3)

xlabel('x')

ylabel('-sin')

plot(x, -sin(x), 'g--',label='-sin')

legend(loc='upper right')

grid(True)

subplot(2,2,4)

xlabel('x')

ylabel('tan')

plot(x, tan(x), 'y-',label='tan')


legend(loc='upper right')

grid(True)

show()

Ticks in Plot

Ticks are the values used to show specific points on the coordinate axis. It can be a
number or a string. Whenever we plot a graph, the axes adjust and take the default ticks.
Matplotlib’s default ticks are generally sufficient in common situations but are in no way
optimal for every plot. Here, we will see how to customize these ticks as per our need.

The following program shows the default ticks and customized ticks

import matplotlib.pyplot as plt

import numpy as np

# values of x and y axes

x = [5, 10, 15, 20, 25, 30, 35, 40, 45, 50]

y = [1, 4, 3, 2, 7, 6, 9, 8, 10, 5]

figure(1)

plt.plot(x, y, 'b')

plt.xlabel('x')

plt.ylabel('y')
figure(2)

plt.plot(x, y, 'r')

plt.xlabel('x')

plt.ylabel('y')

# 0 is the initial value, 51 is the final value

# (last value is not taken) and 5 is the difference

# of values between two consecutive ticks

plt.xticks(np.arange(0, 51, 5))

plt.yticks(np.arange(0, 11, 1))

plt.tick_params(axis='y',colors='red',rotation=45)

plt.show()

PARAMETER VALUE USE

axis x, y, both Tells which axis to operate

reset True, False If True, set all parameters to default


PARAMETER VALUE USE

direction in, out, inout Puts the ticks inside or outside or both

length Float Sets tick’s length

width Float Sets tick’s width

rotation Float Rotates ticks wrt the axis

colors Color Changes tick color

Pad Float Distance in points between tick and label

Pandas-Panal Data and Python Data Analysis


Pandas is an open-source library that is built on top of NumPy library. It is a Python
package that offers various data structures and operations for manipulating numerical
data and time series. It is mainly popular for importing and analyzing data much easier.
Pandas is fast and it has high-performance and productivity for users.

Pandas was initially developed by Wes McKinney in 2008 while he was working at AQR
Capital Management. He convinced the AQR to allow him to open source the Pandas.
Another AQR employee, Chang She, joined as the second major contributor to the library
in 2012.

Advantages
Fast and efficient for manipulating and analyzing data.
Data from different file objects can be loaded.
Easy handling of missing data (represented as NaN) in floating point as well as non-
floating point data
Size mutability: columns can be inserted and deleted from DataFrame and higher
dimensional objects
Data set merging and joining.
Flexible reshaping and pivoting of data sets
Provides time-series functionality.
Powerful group by functionality for performing split-apply-combine operations on data
sets.

Pandas generally provide two data structure for manipulating data, They are:
1)Series
2)DataFrame
Series

Pandas Series is a one-dimensional labeled array capable of holding data of any type
(integer, string, float, python objects, etc.). The axis labels are collectively called index.
Pandas Series is nothing but a column in an excel sheet. Labels need not be unique but
must be a hashable type. The object supports both integer and label-based indexing and
provides a host of methods for performing operations involving the index.

A Series is a one-dimensional array-like object containing an array of data (of any NumPy
data type) and an associated array of data labels, called its index. The simplest Series is
formed from only an array of data:

import pandas as pd

obj=pd.Series([3,5,-8,7,9])

print(obj)

0 3
1 5
2 -8
3 7
4 9
dtype: int64
print(obj.index)

RangeIndex(start=0, stop=5, step=1)


print(obj.values)

[ 3 5 -8 7 9]
Often it will be desirable to create a Series with an index identifying each data point:

obj2 = pd.Series([4, 7, -5, 3], index=['d', 'b', 'a', 'c'])

print(obj2)
d 4
b 7
a -5
c 3
dtype: int64
NumPy array operations, such as filtering with a boolean array, scalar multiplication, or
applying math functions, will preserve the index-value link:

print(obj2[obj2>0])

d 4
b 7
c 3
dtype: int64
print(obj2*2)

d 8
b 14
a -10
c 6
dtype: int64
Another way to think about a Series is as a fixed-length, ordered dict, as it is a mapping of
index values to data values. It can be substituted into many functions that expect a dict:

'b' in obj2

True

'e' in obj2

False

If you have data contained in a Python dict, you can create a Series from it by passing the
dict:

sdata = {'Ohio': 35000, 'Texas': 71000, 'Oregon': 16000, 'Utah': 5000}

obj3=pd.Series(sdata)

print(obj3)

Ohio 35000
Texas 71000
Oregon 16000
Utah 5000
dtype: int64
states = ['California', 'Ohio', 'Oregon', 'Texas']

obj4=pd.Series(sdata,index=states)

print(obj4)

California NaN
Ohio 35000.0
Oregon 16000.0
Texas 71000.0
dtype: float64
3 values found in sdata were placed in the appropriate locations, but since no value for
'California' was found, it appears as NaN (not a number) which is considered in pandas to
mark missing or NA values. The isnull and notnull functions in pandas should be used to
detect missing data:

obj4.isnull()

California True
Ohio False
Oregon False
Texas False
dtype: bool
obj4.notnull()

California False
Ohio True
Oregon True
Texas True
dtype:bool

A critical Series feature for many applications is that it automa


tically aligns differently indexed data in arithmetic operations:
print(obj3)
ohio 35000
Texas 71000
Oregon 16000
Utah 5000
dtype: int64
print(obj4)
California NaN
Ohio 35000.0
Oregon 16000.0
Texas 71000.0
dtype: float64
print(obj3+obj4)
California NaN
Ohio 70000.0
Oregon 32000.0
Texas 142000.0
Utah NaN
dtype: float64
obj4.name='Population'
obj4.index.name='states'
print(obj4)
states
California NaN
Ohio 35000.0
Oregon 16000.0
Texas 71000.0
Name: Population, dtype: float64

Pandas DataFrame

Pandas DataFrame is two-dimensional size-mutable, potentially heterogeneous tabular


data structure with labeled axes (rows and columns). A Data frame is a two-dimensional
data structure, i.e., data is aligned in a tabular fashion in rows and columns. Pandas
DataFrame consists of three principal components, the data, rows, and columns.
We will get a brief insight on all these basic operation which can be performed on
Pandas DataFrame :
Creating a DataFrame
Dealing with Rows and Columns
Indexing and Selecting Data
Working with Missing Data
Iterating over rows and columns

Creating a DataFrame

In the real world, a Pandas DataFrame will be created by loading the datasets from
existing storage, storage can be SQL Database, CSV file, and Excel file. Pandas DataFrame
can be created from the lists, dictionary, and from a list of dictionary etc. Dataframe can
be created in different ways here are some ways by which we create a dataframe:

import pandas as pd

# list of strings

lst = ['mec', 'minor', 'stud', 'eee', 'bio']

# Calling DataFrame constructor on list

df = pd.DataFrame(lst)

print(df)

0
0 mec
1 minor
2 stud
3 eee
4 bio
Creating DataFrame from dict of ndarray/lists: To create DataFrame from dict of
narray/list, all the narray must be of same length. If index is passed then the length index
should be equal to the length of arrays. If no index is passed, then by default, index will
be range(n) where n is the array length.

import pandas as pd

# intialise data of lists.

data = {'Name':['Tom', 'nick', 'krish', 'jack'], 'Age':[20, 21, 19, 18]}

# Create DataFrame

df = pd.DataFrame(data)

# Print the output.

print(df)

Name Age
0 Tom 20
1 nick 21
2 krish 19
3 jack 18

Dealing with Rows and Columns


A Data frame is a two-dimensional data structure, i.e., data is aligned in a tabular fashion
in rows and columns. We can perform basic operations on rows/columns like selecting,
deleting, adding, and renaming.
Column Selection: In Order to select a column in Pandas DataFrame, we can either
access the columns by calling them by their columns name.

import pandas as pd

# Define a dictionary containing employee data

data = {'Name':['Jai', 'Princi', 'Gaurav', 'Anuj'], 'Age':[27, 24, 22, 32],'Address':['Delhi',


'Kanpur', 'Allahabad', 'Kannauj'],'Qualification':['Msc', 'MA', 'MCA', 'Phd']}
# Convert the dictionary into DataFrame

df = pd.DataFrame(data)

# select two columns

print(df)

print(df[['Name', 'Qualification']])

Name Age Address Qualification


0 Jai 27 Delhi Msc
1 Princi 24 Kanpur MA
2 Gaurav 22 Allahabad MCA
3 Anuj 32 Kannauj Phd
Name Qualification
0 Jai Msc
1 Princi MA
2 Gaurav MCA
3 Anuj Phd

Row Selection: Pandas provide a unique method to retrieve rows from a Data
frame. DataFrame.loc[] method is used to retrieve rows from Pandas DataFrame. Rows
can also be selected by passing integer location to an iloc[] function.

Create a data file using excel and save it in CSV(Comma Separated Values) format as
shown below
# Import pandas package

import pandas as pd

# making data frame from csv file

data = pd.read_csv("stud.csv", index_col ="rollno")

print(data)

print("retrieving row by loc method")

print(data.loc[101])

print("retrieving row by iloc method")

print(data.iloc[1])

print("Selecting name and mark")

data[["name","mark"]]
rollno name place mark

101 binu ernkulam 45


103 ashik alleppey 35
102 faisal kollam 48
105 biju kotayam 25
106 ann thrisur 30

retrieving row by loc method


name binu
place ernkulam
mark 45
Name: 101, dtype: object

retrieving row by iloc method


name ashik
place alleppey
mark 35
Name: 103, dtype: object
Selecting name and mark
rollno name mark

101 binu 45
103 ashik 35
102 faisal 48
105 biju 25
106 ann 30

Indexing and Selecting Data


Indexing in pandas means simply selecting particular rows and columns of data from a
DataFrame. Indexing could mean selecting all the rows and some of the columns, some
of the rows and all of the columns, or some of each of the rows and columns. Indexing
can also be known as Subset Selection.

Working with Missing Data

Missing Data can occur when no information is provided for one or more items or for a
whole unit. Missing Data is a very big problem in real life scenario. Missing Data can
also refer to as NA(Not Available) values in pandas.

Checking for missing values using isnull() and notnull() :


In order to check missing values in Pandas DataFrame, we use a function isnull() and
notnull(). Both function help in checking whether a value is NaN or not. These function
can also be used in Pandas Series in order to find null values in a series.

import pandas as pd

# importing numpy as np

import numpy as np

# dictionary of lists

dict = {'First Score':[100, 90, np.nan, 95],

'Second Score': [30, 45, 56, np.nan],


'Third Score':[np.nan, 40, 80, 98]}

# creating a dataframe from list

df = pd.DataFrame(dict)

# using isnull() function

print(df.isnull())

print(df.notnull())

First Score Second Score Third Score


0 False False True
1 False False False
2 True False False
3 False True False
First Score Second Score Third Score
0 True True False
1 True True True
2 False True True
3 True False True

Filling missing values using fillna(), replace() and interpolate() :

In order to fill null values in a datasets, we use fillna(), replace() and interpolate() function
these function replace NaN values with some value of their own. All these function help
in filling a null values in datasets of a DataFrame. Interpolate() function is basically used
to fill NA values in the dataframe but it uses various interpolation technique to fill the
missing values rather than hard-coding the value.

import pandas as pd

# importing numpy as np

import numpy as np

# dictionary of lists
dict = {'First Score':[100, 90, np.nan, 95],

'Second Score': [30, 45, 56, np.nan],

'Third Score':[np.nan, 40, 80, 98]}

# creating a dataframe from dictionary

df = pd.DataFrame(dict)

print(df)

First Score Second Score Third Score


0 100.0 30.0 NaN
1 90.0 45.0 40.0
2 NaN 56.0 80.0
3 95.0 NaN 98.0
# filling missing value using fillna()

print(df.fillna(0))

First Score Second Score Third Score


0 100.0 30.0 0.0
1 90.0 45.0 40.0
2 0.0 56.0 80.0
3 95.0 0.0 98.0
#filling the NaN values by interpolation

print(df.interpolate())

First Score Second Score Third Score


0 100.0 30.0 NaN
1 90.0 45.0 40.0
2 92.5 56.0 80.0
3 95.0 56.0 98.0
#replacing the nan values with -1

print(df.replace(np.nan,-1))

First Score Second Score Third Score


0 100.0 30.0 -1.0
1 90.0 45.0 40.0
2 -1.0 56.0 80.0
3 95.0 -1.0 98.0
#dropping the rows containing null values

print(df.dropna())

First Score Second Score Third Score


1 90.0 45.0 40.0

Iterating over rows and columns

Iteration is a general term for taking each item of something, one after another. Pandas
DataFrame consists of rows and columns so, in order to iterate over dataframe, we have
to iterate a dataframe like a dictionary.

In order to iterate over rows, we can use three function iteritems(), iterrows(),
itertuples() . These three function will help in iteration over rows.
# importing pandas as pd

import pandas as pd

# dictionary of lists

dict = {'name':["aparna", "pankaj", "sudhir", "Geeku"],'degree': ["MBA", "BCA",


"M.Tech","MBA"], 'score':[90, 40, 80, 98]}

# creating a dataframe from a dictionary

df = pd.DataFrame(dict)

print(df)

for i in df.itertuples(): # this will get each row as a tuple

for i,j in df.iterrows(): # this will get each index and each row values

for i,j in df.iteritems():# this will extract each field seperately

You can convert a column to list and later process the list easily
sc=df['score'].to_list() #sc is a list of score

Inserting/Deleting rows and columns

# importing pandas as pd

import pandas as pd

# dictionary of lists

dict = {'name':["aparna", "pankaj", "sudhir", "Geeku"], 'degree': ["MBA", "BCA", "M.Tech",


"MBA"], 'score':[90, 40, 80, 98]}

# creating a dataframe from a dictionary

df = pd.DataFrame(dict)

print(df)

df.iloc[len(df.index)]=['binu','Phd',47] #adding a new row

print(df)

lst=[47,45,26,34,45]

df['age']=lst # adding a new column age at the end

print(df)

lst=[2002,2003,2004,2005,2017]

df.insert(1,'year',lst) # adding a new column at a particular position

print(df)

df.drop([0,1],inplace=True) # use the index values to remove the rows

print(df)

df.drop(['score'],axis=1,inplace=True) # use the column name to drop a column

print(df)

Outputs:

name degree score


0 aparna MBA 90
1 pankaj BCA 40
2 sudhir M.Tech 80
3 Geeku MBA 98
name degree score
0 aparna MBA 90
1 pankaj BCA 40
2 sudhir M.Tech 80
3 Geeku MBA 98
4 binu Phd 47
name degree score age
0 aparna MBA 90 47
1 pankaj BCA 40 45
2 sudhir M.Tech 80 26
3 Geeku MBA 98 34
4 binu Phd 47 45
name year degree score age
0 aparna 2002 MBA 90 47
1 pankaj 2003 BCA 40 45
2 sudhir 2004 M.Tech 80 26
3 Geeku 2005 MBA 98 34
4 binu 2017 Phd 47 45
name year degree score age
2 sudhir 2004 M.Tech 80 26
3 Geeku 2005 MBA 98 34
4 binu 2017 Phd 47 45
name year degree age
2 sudhir 2004 M.Tech 26
3 Geeku 2005 MBA 34
4 binu 2017 Phd 45

Updating a particular value

The following will change the score in the 3rd row. You can also use index values with
at command.

df.at[3,'score']=100

print(df)
This will add value 2 to all values in age column

df['age'] +=2

print(df)

Handling data from a data file

Create a data file in excel and save it in CSV format

The following are the various functions you can do on this data file

# importing pandas as pd

import pandas as pd

df=pd.read_csv('stud.csv',index_col='rollno')

print("data frame stud")

print(df)
data frame stud
name place mark
rollno
101 binu ernkulam 45
103 ashik alleppey 35
102 faisal kollam 48
105 biju kotayam 25
106 ann thrisur 25
107 padma kylm 25

print("columns")
print(df.columns)
columns
Index(['name', 'place', 'mark'], dtype='object')

print("statistical info of numerical column")

print(df.describe())

statistical info of numerical column


mark
count 6.000000
mean 33.833333
std 10.590877
min 25.000000
25% 25.000000
50% 30.000000
75% 42.500000
max 48.000000

print("size")

print(df.size)
size
18

print("data types")

print(df.dtypes)
data types
name object
place object
mark int64
dtype: object

print("shapes")

print(df.shape)
shapes
(6, 3)
print("index and length of index")

print(df.index,len(df.index))
index and length of index
Int64Index([101, 103, 102, 105, 106, 107], dtype='int64', name='rollno') 6

print("statistical functions")

print("sum=",df['mark'].sum())

print("mean=",df['mark'].mean())

print("max=",df['mark'].max())

print("min=",df['mark'].min())

print("var=",df['mark'].var())

print("standard deviation=",df['mark'].std())

print(df.std())
statistical functions
sum= 203
mean= 33.833333333333336
max= 48
min= 25
var= 112.16666666666667
standard deviation= 10.59087657687817
mark 10.590877
dtype: float64

print("top 2 rows")

print(df.head(2))
top 2 rows
name place mark
rollno
101 binu ernkulam 45
103 ashik alleppey 35

print("last 2 rows")

print(df.tail(2))
last 2 rows
name place mark
rollno
106 ann thrisur 25
107 padma kylm 25

print("data from rows 0,1,2")

print(df[0:3])
data from rows 0,1,2
name place mark
rollno
101 binu ernkulam 45
103 ashik alleppey 35
102 faisal kollam 48

print("mark column values")

print(df['mark'])
mark column values
rollno
101 45
103 35
102 48
105 25
106 25
107 25
Name: mark, dtype: int64

print("rows where mark >40")

print(df[df['mark']>40])
rows where mark >40
name place mark
rollno
101 binu ernkulam 45
102 faisal kollam 48

print("rows 0,1,2 columns 0,2")

print(df.iloc[0:3,[0,2]])
rows 0,1,2 columns 0,2
name mark
rollno
101 binu 45
103 ashik 35
102 faisal 48

print("sorting in the descending order of marks")

print(df.sort_values(by='mark',ascending=False))
sorting in the descending order of marks
name place mark
rollno
102 faisal kollam 48
101 binu ernkulam 45
103 ashik alleppey 35
105 biju kotayam 25
106 ann thrisur 25
107 padma kylm 25
2

print("use agg function to compute all the values")

print(df['mark'].agg(['min','max','mean']))
use agg function to compute all the values
min 25.000000
max 48.000000
mean 33.833333
Name: mark, dtype: float64

print("median of marks")

print("Median",df.sort_values(by='mark',ascending=False).median())
median of marks
Median mark 30.0
dtype: float64

print("mode of marks")

print("Mode",df['mark'].mode())
mode of marks
Mode 0 25
dtype: int64

print("count of marks")
print(df['mark'].value_counts())
count of marks
25 3
45 1
35 1
48 1
Name: mark, dtype: int64

print("grouping data based on column value")

print(df.groupby('mark')['mark'].mean())
grouping data based on column value
mark
25 25
35 35
45 45
48 48
Name: mark, dtype: int64

print("plotting the histogram")

import matplotlib.pyplot as plt

figure(1)

plt.hist(df['mark'])

figure(2)

plt.scatter(df['name'],df['mark'])

figure(3)

plt.pie(df['mark'])

Outputs:

plotting the histogram


Writing Data to CSV file
The process of creating or writing a CSV file through Pandas can be a little more
complicated than reading CSV, but it's still relatively simple. We use the to_csv() function
to perform this task. However, you have to create a Pandas DataFrame first, followed by
writing that DataFrame to the CSV file.

Column names can also be specified via the keyword argument columns, as well as a
different delimiter via the sep argument. Again, the default delimiter is a comma, ','.

Here is a simple example showing how to export a DataFrame to a CSV file via to_csv():
# importing pandas as pd
import pandas as pd
# dictionary of lists
dict = {'name':["aparna", "pankaj", "sudhir", "Geeku"],
'degree': ["MBA", "BCA", "M.Tech", "MBA"],
'score':[90, 40, 80, 98]}
# creating a dataframe from a dictionary
df = pd.DataFrame(dict)
print(df)
df.to_csv('studdata.csv')
#open the studdata.csv and see the data written

Example Programs using numpy and pandas


November 23, 2020

1.Add two matrix and find the transpose of the result ( university question)
def readmatrix(x,r,c):
for i in range(r):
for j in range(c):
x[i][j]=int(input('enter elements row by row'))
import numpy as np
r1=int(input('rows of a'))
c1=int(input('columns of a'))
r2=int(input('rows of b'))
c2=int(input('columns of b'))
if r1!=r2 or c1!=c2:
print("cant add matrices")

else:
A=np.zeros((r1,c1))
print("Enter the elements of A")
readmatrix(A,r1,c1)
B=np.zeros((r2,c2))
print("Enter the elements of B")
readmatrix(B,r2,c2)
print("Matrix A")
print(A)
print("Matrix B")
print(B)
C=A+B
print("sum")
print(C)
print("transpose of sum")
print(C.T)

2.Creating a dataframe from a list of data and setting the index

import pandas as pd

#initialize a dataframe

df = pd.DataFrame(

[[21, 'Amol', 72, 67],

[23, 'Lini', 78, 69],

[32, 'Kiku', 74, 56],

[52, 'Ajit', 54, 76]],

columns=['rollno', 'name', 'physics', 'botony'])

print('DataFrame with default index\n', df)

#set column as index

df = df.set_index('rollno')
print('\nDataFrame with column as index\n',df)

DataFrame with default index


rollno name physics botony
0 21 Amol 72 67
1 23 Lini 78 69
2 32 Kiku 74 56
3 52 Ajit 54 76

DataFrame with column as index


name physics botony
rollno
21 Amol 72 67
23 Lini 78 69
32 Kiku 74 56
52 Ajit 54 76
3.Writing data to an excel file

import pandas as pd

# create dataframe

df_marks = pd.DataFrame({'name': ['Somu', 'Kiku', 'Amol', 'Lini'],

'physics': [68, 74, 77, 78],

'chemistry': [84, 56, 73, 69],

'algebra': [78, 88, 82, 87]})

# create excel writer object

writer = pd.ExcelWriter('output.xlsx')

# write dataframe to excel

df_marks.to_excel(writer)

# save the excel

writer.save()

print('DataFrame is written successfully to Excel File.')


4.Reading data from excel file

# Program to extract a particular row value

import xlrd

loc = ("stud.xlsx")

wb = xlrd.open_workbook(loc)

sheet = wb.sheet_by_index(0)

#extracting column names

print(sheet.cell_value(0, 0),sheet.cell_value(0, 1),sheet.cell_value(0, 2))

for i in range(1,sheet.nrows):

print(sheet.row_values(i))

5.Write Python program to write the data given below to a CSV file.(university
question)

SN Name Country Contribution Year

1 Linus Torvalds Finland Linux Kernel 1991

2 Tim Berners-Lee England World Wide Web 1990

3 Guido van Rossum Netherlands Python 1991

# importing pandas as pd

import pandas as pd

# dictionary of lists

# creating a dataframe from a dictionary

df = pd.DataFrame([[1,' Linus Torvalds','Finland','Linux Kernel ',1991],

[2,'Tim Berners-Lee','England','World Wide Web',1990],

[3,'Guido van Rossum','Netherlands','Python',1991]],

columns=['SN','Name','Country','Contribution','Year'])

print("data frame with defaut index=",df)


df=df.set_index('SN')

print("data frame with SN as index=",df)

print(df)

df.to_csv('inventors.csv')

6.Create a data frame from the dictionary of lists

import pandas as pd

# dictionary of lists

dict = {'name':["aparna", "pankaj", "sudhir", "Geeku"],

'degree': ["MBA", "BCA", "M.Tech", "MBA"],

'score':[90, 40, 80, 98]}

# creating a dataframe from a dictionary

df = pd.DataFrame(dict)

print(df)

7.Given a file “auto.csv” of automobile data with the fields index, company,

body-style, wheel-base, length, engine-type, num-of-cylinders, horsepower,

average-mileage, and price, write Python codes using Pandas to

1) Clean and Update the CSV file

2)Find the most expensive car company name

3)Print all toyota car details

4) Print total cars of all companies

5) Find the highest priced car of all companies

6)Find the average mileage of all companies

7)Sort all cars by Price column

Reading the data file and showing the first five records
import pandas as pd

df = pd.read_csv("Automobile_data.csv")

df.head(5)

averag
num-of-
inde compa body wheel- lengt engin horsepow e- pric
cylinde
x ny -style base h e-type er mileag e
rs
e

alfa-
convertib 13495.
0 0 romer 88.6 168.8 dohc four 111 21
le 0
o

alfa-
convertib 16500.
1 1 romer 88.6 168.8 dohc four 111 21
le 0
o

alfa-
hatchbac 16500.
2 2 romer 94.5 171.2 ohcv six 154 19
k 0
o

13950.
3 3 audi sedan 99.8 176.6 ohc four 102 24
0

17450.
4 4 audi sedan 99.4 176.6 ohc five 115 18
0

This will show last 7 rows

df.tail(7)

1) Clean and Update the CSV file

import pandas as pd

df = pd.read_csv("Automobile_data.csv",

na_values={

'price':["?","n.a"],

'stroke':["?","n.a"],
'horsepower':["?","n.a"],

'peak-rpm':["?","n.a"],

'average-mileage':["?","n.a"]})

print (df)

df.to_csv("Automobile_data.csv")

2)Find the most expensive car company name

import pandas as pd

df = pd.read_csv("Automobile_data.csv")

df = df [['company','price']][df.price==df['price'].max()]

df

output

company price

35 mercedes-benz 45400.0

3) Print all toyota car details

import pandas as pd

df = pd.read_csv("Automobile_data.csv")

print(df[df['company']=='toyota'])

OR

import pandas as pd

df = pd.read_csv("Automobile_data.csv")

car_Manufacturers = df.groupby('company')

toyotaDf = car_Manufacturers.get_group('toyota')

toyotaDf

4)Print total cars of all companies


import pandas as pd

df = pd.read_csv("Automobile_data.csv")

df.groupby('company')['company'].count()

OR

import pandas as pd

df['company'].value_counts()

5) Find the highest priced car of all companies

import pandas as pd

df = pd.read_csv("Automobile_data.csv")

df.groupby('company')[['company','price']].max()

6)Find the average mileage of all companies

import pandas as pd

df = pd.read_csv("Automobile_data.csv")

df.groupby('company')[['company','average-mileage']].mean()

7)Sort all cars by Price column

import pandas as pd

df = pd.read_csv("Automobile_data.csv")

df.sort_values(by=['price11

', 'horsepower'], ascending=False)[['company','price']]

8) Create a stud.csv file containing rollno,name,place and mark of students. Use this file and do
the following

a) Read and display the file contents

import pandas as pd
df = pd.read_csv("stud.csv")
print(df)
rollno name place mark
0 101 binu ernkulam 45
1 103 ashik alleppey 35

2 102 faisal kollam 48

3 105 biju kotayam 25

4 106 anu thrisur 25

5 107 padma kylm 25

b)Set rollno as index

df=df.set_index('rollno')
print(df)
name place mark
rollno
101 binu ernkulam 45
103 ashik alleppey 35
102 faisal kollam 48
105 biju kotayam 25
106 anu thrisur 25
107 padma kylm 25

c)Display name and mark

df=df[['name','mark']]
print(df)

roll no name mark


101 binu 45
103 ashik 35
102 faisal 48
105 biju 25
106 anu 25
107 padma 25

d) rollno,Name and mark in the order of name


df=df[['name','mark']]
df=df.sort_values('name')
print(df)
name mark
rollno
106 anu 25
103 ashik 35
105 biju 25
101 binu 45
102 faisal 48
107 padma 25
e) Display the rollno,name, mark in the descending order of mark
df=df.sort_values(by='mark',ascending=False)
print(df)
name mark
rollno
102 faisal 48
101 binu 45
103 ashik 35
106 anu 25
105 biju 25
107 padma 25

f) Find the average mark,median and mode of marks


print(df['mark'].mean())
print(df['mark'].median())
print(df['mark'].mode())
33.833333333333336
30.0
25
g)Find minimum and maximum marks
print(df['mark'].min())
print(df['mark'].max())
25
48
h)variance and standard deviation of marks

print(df['mark'].var())

print(df['mark'].std())

112.16666666666667

10.59087657687817
i)display the histogram of marks

import matplotlib.pyplot as plt

plt.hist(df['mark'])

j)remove the place column

df.drop(['place'],axis=1,inplace=True)

print(df)

rollno name mark

0 101 binu 45

1 103 ashik 35

2 102 faisal 48

3 105 biju 25

4 106 ann 25

5 107 padma 25

Find the student with max marks

df = df [['name']][df.mark==df['mark'].max()]

Print all student details who are from ernakulam

print(df[df['place']=='ernakulam'])

You might also like