0% found this document useful (0 votes)

29 views

Data Science Lab (To Write)

Uploaded by

jenishton7

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

29 views

Data Science Lab (To Write)

Uploaded by

jenishton7

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 64

1(a).

Download and install the different packages like NumPy, SciPy, Jupyter, Statsmodels and
Pandas

AIM:
To learn how to download and install the different packages of NumPy, SciPy, Jupyter,
Statsmodels and Pandas.
ALGORITHM:
1. Download Python and Jupyter.
2. Install Python and Jupyter.
3. Install the pack like NumPy, SciPy Satsmodels and Pandas.
4. Verify the proper execution of Python and Jupyter.
Python Installation
● Open the python official web site. (https://www.python.org/)
● Downloads ==> Windows ==> Select Recent Release. (Requires Windows 10 or
above versions)
● Install "python-3.10.6-amd64.exe"
Jupyter Installation
● Open command prompt and enter the following to check whether the pyton was
installed properly or not, “python –version”.
● If installation is proper it returns the version of python
● Enter the following to check whether the pyton package manager was
installed properly or not, “pip –version”
● If installation is proper it returns the version of python package manager
● Enter the following command “pip install jupyterlab”.
● Enter the following command “pip install jupyter notebook”.
● Copy the above command result from path to upgrade command and paste it
and execute for upgrade process.
● Create a folder and name the folder accordingly.
● Open command prompt and enter in to that folder. Enter the following code
“jupyter notebook” and then give enter.
● Now new jupyter notebook will be opened for our use.
pip Installation
Installation of NumPy
● pip install
numpy Installation
of SciPy
● pip install scipy
Installation of
Statsmodels
● pip install
statsmodels Installation
of Pandas
● pip install pandas

RESULT:
NumPy, SciPy, Jupyter, Statsmodels and Pandas packages were installed properly and
the execution also verified.
1(b). Explore the features of NumPy

AIM:
To learn the different features provided by NumPy package.

ALGORITHM:
1. Install the NumPy package
2. Study all the features of NumPy package.

NumPy
● NumPy is a Python library used for working with arrays.
● It also has functions for working in domain of linear algebra, fourier
transform, and matrices.
1. Arrays
1.1 Array Slicing
● Slicing in python means taking elements from one given index to another
given index.
● We pass slice instead of index like this: [start:end].
● We can also define the step, like this: [start:end:step].

import numpy as np
arr = np.array([1, 2, 3, 4, 5, 6, 7])
print(arr[1:5:2])

1.2 Array Shape & Reshaping

1.2.1 Array Shape
NumPy arrays have an attribute called shape that returns a tuple with each index
having the number of corresponding elements.
import numpy as np
arr = np.array([[1, 2, 3, 4], [5, 6, 7, 8]])
print(arr.shape)

1.2.2 Array Reshaping

● Reshaping means changing the shape of an array.
● By reshaping we can add or remove dimensions or change number of
elements in each dimension.
import numpy as np
arr = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12])
newarr = arr.reshape(2, 3, 2)
print(newarr)
2. Random
Random Permutations
A permutation refers to an arrangement of elements. e.g. [3, 2, 1] is a permutation of
[1, 2, 3] and vice-versa.
The NumPy Random module provides two methods for this: shuffle() and
permutation().
from numpy import random
import numpy as np
arr = np.array([1, 2, 3, 4, 5])
random.shuffle(arr)
print(arr)

2.1 Seaborn
Seaborn is a library that uses Matplotlib underneath to plot graphs. It will be used to
visualize random distributions.
import matplotlib.pyplot as
plt import seaborn as sns
sns.distplot([0, 1, 2, 3, 4, 5])
plt.show()
2.2 Normal (Gaussian) Distribution
It uses the random.normal() method to get a Normal Data Distribution.
It has three parameters:
loc - (Mean) where the peak of the bell exists.
scale - (Standard Deviation) how flat the graph distribution should be.
size - The shape of the returned array.
Ex:
from numpy import random
x = random.normal(loc=1, scale=2, size=(2, 3))
print(x)
2.3 Binomial Distribution
Binomial Distribution is a Discrete Distribution.It describes the outcome of binary
scenarios, e.g. toss of a coin.
It has three parameters:
n - number of trials.
p - probability of occurence of each trial (e.g. for toss of a coin 0.5
each). size - The shape of the returned array.
Ex:
from numpy import random
x = random.binomial(n=10, p=0.5, size=10) print(x)

2.4 Poisson Distribution

It estimates how many times an event can happen in a specified time.
It has two parameters:
lam - rate or known number of occurences e.g. 2 for above problem.
size - The shape of the returned array.
Ex:
from numpy import random
x = random.poisson(lam=2, size=10)
print(x)
2.5 Uniform Distribution
Used to describe probability where every event has equal chances of occuring.
E.g.Generation of random numbers.
It has three parameters:
a - lower bound - default 0 .0.
b - upper bound - default 1.0.
size - The shape of the returned array.
EX:
from numpy import random
x = random.uniform(size=(2, 3)) print(x)

2.6 Logistic Distribution

Logistic Distribution is used to describe growth.
Used extensively in machine learning in logistic regression, neural networks etc.
It has three parameters:
loc - mean, where the peak is. Default 0.
scale - standard deviation, the flatness of distribution. Default 1.
size - The shape of the returned array.
EX:
from numpy import random
x = random.logistic(loc=1, scale=2, size=(2, 3)) print(x)

2.7 Multinomial Distribution

Multinomial distribution is a generalization of binomial distribution.
It describes outcomes of multi-nomial scenarios unlike binomial where scenarios must
be only one of two. e.g. Blood type of a population, dice roll outcome.
It has three parameters:
n - number of possible outcomes (e.g. 6 for dice roll).
pvals - list of probabilties of outcomes (e.g. [1/6, 1/6, 1/6, 1/6, 1/6, 1/6] for dice roll).
size - The shape of the returned array.
EX:
from numpy import random
x = random.multinomial(n=6, pvals=[1/6, 1/6, 1/6, 1/6, 1/6, 1/6])
print(x)
2.8 Exponential Distribution
Exponential distribution is used for describing time till next event e.g. failure/success
etc.
It has two parameters:
scale - inverse of rate ( see lam in poisson distribution ) defaults to
1.0. size - The shape of the returned array.
Ex:
from numpy import random
x = random.exponential(scale=2, size=(2, 3))
print(x)
2.9 Chi Square Distribution
Chi Square distribution is used as a basis to verify the hypothesis.
It has two parameters:
df - (degree of freedom).
size - The shape of the returned array.
EX:
from numpy import random
x = random.chisquare(df=2, size=(2,
3)) print(x)
2.10 Rayleigh Distribution
Rayleigh distribution is used in signal processing.
It has two parameters:
scale - (standard deviation) decides how flat the distribution will be default 1.0).
size - The shape of the returned array.
Ex:
x = random.rayleigh(scale=2, size=(2,
3)) print(x)
2.11 Pareto Distribution
A distribution following Pareto's law i.e. 80-20 distribution (20% factors cause 80%
outcome).
It has two parameter:
a - shape parameter.
size - The shape of the returned array.
Ex:
from numpy import random
x = random.pareto(a=2, size=(2, 3))
print(x)
2.12 Zipf Distribution
Zipf distritutions are used to sample data based on zipf's law.
Zipf's Law: In a collection the nth common term is 1/n times of the most common
term.
It has two parameters:
a - distribution parameter.
size - The shape of the returned array.
EX:
from numpy import random
x = random.zipf(a=2, size=(2,
3)) print(x)
3. Universal Functions
Create Your Own ufunc (Universal)
To create you own ufunc, you have to define a function, like you do with normal
functions in Python, then you add it to your NumPy ufunc library with the frompyfunc()
method.
The frompyfunc() method takes the following arguments:
function - the name of the function.
inputs - the number of input arguments
(arrays). outputs - the number of output
arrays.
Create your own ufunc for addition:
import numpy as np
def myadd(x, y):
return x+y
myadd = np.frompyfunc(myadd, 2, 1)
print(myadd([1, 2, 3, 4], [5, 6, 7, 8]))
3.1 Simple Arithmetic
Addition
Add the values in arr1 to the values in arr2:
import numpy as np
arr1 = np.array([10, 11, 12, 13, 14, 15])
arr2 = np.array([20, 21, 22, 23, 24, 25])
newarr = np.add(arr1,
arr2) print(newarr)
Subtraction
Subtract the values in arr2 from the values in arr1:
import numpy as np
arr1 = np.array([10, 20, 30, 40, 50, 60])
arr2 = np.array([20, 21, 22, 23, 24, 25])
newarr = np.subtract(arr1,
arr2) print(newarr)
Multiplication
Multiply the values in arr1 with the values in arr2:
import numpy as np
arr1 = np.array([10, 20, 30, 40, 50, 60])
arr2 = np.array([20, 21, 22, 23, 24, 25])
newarr = np.multiply(arr1,
arr2) print(newarr)
Division
Divide the values in arr1 with the values in arr2:
import numpy as np
arr1 = np.array([10, 20, 30, 40, 50, 60])
arr2 = np.array([3, 5, 10, 8, 2, 33]) newarr = np.divide(arr1, arr2)
print(newarr)
Power

Raise the valules in arr1 to the power of values in arr2:

import numpy as np
arr1 = np.array([10, 20, 30, 40, 50, 60])
arr2 = np.array([3, 5, 6, 8, 2, 33])
newarr = np.power(arr1, arr2)
print(newarr)
Remainder
Return the remainders:
import numpy as np
arr1 = np.array([10, 20, 30, 40, 50, 60])
arr2 = np.array([3, 7, 9, 8, 2, 33])
newarr = np.mod(arr1, arr2)
print(newarr)
Absolute Values
Return the quotient and mod:
import numpy as np
arr = np.array([-1, -2, 1, 2, 3, -4])
newarr = np.absolute(arr)
print(newarr)

3.2 Rounding Decimals

There are primarily five ways of rounding off decimals in NumPy:

● truncation
● rounding
3.2.1 Truncation
● floor
● ceil
Remove the decimals, and return the float number closest to zero. Use the trunc() and
fix() functions.
Truncate elements of following array:
import numpy as np
arr = np.trunc([-3.1666, 3.6667])
print(arr)

3.2.2 Rounding
The around() function increments preceding digit or decimal by 1 if >=5 else do
nothing.
Round off 3.1666 to 2 decimal places:
import numpy as np
arr = np.around(3.1666,
2) print(arr)

3.2.3 Floor
The floor() function rounds off decimal to nearest lower integer.
Floor the elements of following array:
import numpy as np
arr = np.floor([-3.1666, 3.6667])
print(arr)

3.2.4 Ceil
The ceil() function rounds off decimal to nearest upper integer.
Ceil the elements of following array:
import numpy as np
arr = np.ceil([-3.1666, 3.6667])
print(arr)

3.3 Logs

NumPy provides functions to perform log at the base 2, e and 10.We will also explore
how we can take log for any base by creating a custom ufunc.
Ex:
import numpy as np
arr = np.arange(1,
10)
print(np.log10(arr))

3.4 Summations
Addition is done between two arguments whereas summation happens over n
elements
Add the values in arr1 to the values in arr2:
import numpy as np
arr1 = np.array([1, 2,
3])
arr2 = np.array([1, 2, 3])
newarr = np.add(arr1,
arr2) print(newarr)

3.5 Products
To find the product of the elements in an array, use the prod() function.
Find the product of the elements of this array:
import numpy as np
arr = np.array([1, 2, 3, 4])
x = np.prod(arr)
print(x)

3.6 Differences
A discrete difference means subtracting two successive elements. To find the
discrete difference, use the diff() function.
Compute discrete difference of the following
array: import numpy as np
arr = np.array([10, 15, 25, 5])
newarr = np.diff(arr)
print(newarr)

3.7 LCM (Lowest Common Multiple

The Lowest Common Multiple is the least number that is common multiple of both of
the numbers.
import numpy as np
num1 = 4
num2 = 6
x = np.lcm(num1, num2)
print(x)

3.8 GCD (Greatest Common Denominator)

The GCD (Greatest Common Denominator), also known as HCF (Highest Common
Factor) is the biggest number that is a common factor of both of the numbers.
Find the HCF of the following two numbers:
import numpy as np
num1 = 6
num2 = 9
x = np.gcd(num1, num2)
print(x)

3.9 Trigonometric Functions

NumPy provides the ufuncs sin(), cos() and tan() that take values in radians and
produce the corresponding sin, cos and tan values.
Find sine value of PI/2:
import numpy as np
x = np.sin(np.pi/2)
print(x)

Find sine values for all of the values in arr:

import numpy as np
arr = np.array([np.pi/2, np.pi/3, np.pi/4,
np.pi/5]) x = np.sin(arr)
print(x)

3.10 Hyperbolic Functions

NumPy provides the ufuncs sinh(), cosh() and tanh() that take values in radians and
produce the corresponding sinh, cosh and tanh values.
Find sinh value of PI/2:
import numpy as np
x = np.sinh(np.pi/2)
print(x)

Find cosh values for all of the values in arr:

import numpy as np
arr = np.array([np.pi/2, np.pi/3, np.pi/4,
np.pi/5]) x = np.cosh(arr)
print(x)

3.11 Set Operations

A set in mathematics is a collection of unique elements.
3.11.1 Create Sets in NumPy
We can use NumPy's unique() method to find unique elements from any array. E.g.
create a set array, but remember that the set arrays should only be 1-D arrays.
Convert following array with repeated elements to a
set: import numpy as np
arr = np.array([1, 1, 1, 2, 3, 4, 5, 5, 6, 7])
x=
np.unique(arr)
print(x)
3.11.2 Finding Union
To find the unique values of two arrays, use the union1d() method.
Find union of the following two set arrays:
import numpy as np
arr1 = np.array([1, 2, 3, 4])
arr2 = np.array([3, 4, 5, 6])
newarr = np.union1d(arr1, arr2)
print(newarr)

3.11.3 Finding Intersection

To find only the values that are present in both arrays, use the intersect1d() method.
Find intersection of the following two set
arrays: import numpy as np
arr1 = np.array([1, 2, 3, 4])
arr2 = np.array([3, 4, 5, 6])
newarr = np.intersect1d(arr1, arr2, assume_unique=True)
print(newarr)

RESULT
Thus the feature study of NumPy was completed successfully.
1(c). Explore the features of SciPy

AIM:
To learn the different features provided by SciPy package.

ALGORITHM:
1. Install the SciPy package
2. Study all the features of SciPy package.

SciPy
SciPy stands for Scientific Python, SciPy is a scientific computation library that uses
NumPy underneath.
1. Constants in SciPy
As SciPy is more focused on scientific implementations, it provides many built-in
scientific constants.
1.1 Constants in
SciPy Metric
Ret
urn
Binary the
spe
cifie
Mass d
unit
in
Angle met
er
e
Time x
:
p
r
i
n
t
(
c
o
n
s n
t t
a s
n .
t k
s i
. b
m i
i )
l
l Return the specified unit in kg
i
ex: print(constants.stone)
)

Retur
R
n the
et
specif
ur
ied
n
unit
th
in
e
radia
s
ns
p
e e
ci x
:
fi
p
e
r
d i
u n
ni t
t (
in c
b o
yt n
e s
s t
e a
x n
: t
p s
r .
i d
n e
t g
( r
c e
o e
n )
s
t Return the specified unit in seconds
a ex: print(constants.year)
Length
Return the specified unit in meters
ex: print(constants.mile)
Pressure
Return the specified unit in pascals
ex: print(constants.bar)
Area Return the specified unit in square meters
ex: print(constants.hectare)
Volume
Return the specified unit in cubic meters
ex: print(constants.litre)
Speed Return the specified unit in meters per second
ex: print(constants.kmh)
Temperature
Return the specified unit in Kelvin
ex: print(constants.zero_Celsius)
Energy
Return the specified unit in joules
ex: print(constants.calorie)
Power Return the specified unit in watts
ex: print(constants.hp)

Force Return the specified unit in newton

ex: print(constants.pound_force)

2. Sparse Data
Sparse data is data that has mostly unused elements (elements that don't carry any
information).
It can be an array like this one:
[1, 0, 2, 0, 0, 3, 0, 0, 0, 0, 0, 0]
Sparse Data: is a data set where most of the item values are zero.
Dense Array: is the opposite of a sparse array: most of the values are not zero.

2.1 CSR(Compressed Sparse Row) Matrix

We can create CSR matrix by passing an arrray into
function scipy.sparse.csr_matrix().
Create a CSR matrix from an array:
import numpy as np
from scipy.sparse import csr_matrix
arr = np.array([0, 0, 0, 0, 0, 1, 1, 0, 2])
print(csr_matrix(arr))

3. Graphs
Graphs are an essential data structure.
SciPy provides us with the module scipy.sparse.csgraph for working with such data
structures.
Adjacency Matrix
Adjacency matrix is a nxn matrix where n is the number of elements in a graph.
The values represents the connection between the elements.
3.1 Dijkstra
Use the dijkstra method to find the shortest path in a graph from one element to
another.
It takes following arguments:
return_predecessors: boolean (True to return whole path of traversal otherwise False).
indices: index of the element to return all paths from that element only.
limit: max weight of path.
Find the shortest path from element 1 to 2:
import numpy as np
from scipy.sparse.csgraph import dijkstra
from scipy.sparse import csr_matrix
arr = np.array([ [0, 1, 2],
[1, 0, 0],
[2, 0, 0]])
newarr = csr_matrix(arr)
print(dijkstra(newarr, return_predecessors=True, indices=0))

3.2 Depth First Order

The depth_first_order() method returns a depth first traversal from a node.
This function takes following arguments:
The graph.
The starting element to traverse graph from.

Traverse the graph depth first for given adjacency matrix:

import numpy as np
from scipy.sparse.csgraph import depth_first_order
from scipy.sparse import csr_matrix
arr =np.array([ [0, 1, 0, 1],
[1, 1, 1, 1],
[2, 1, 1, 0],
[0, 1, 0, 1]])
newarr = csr_matrix(arr)
print(depth_first_order(newarr, 1))
3.3 Breadth First Order
The breadth_first_order() method returns a breadth first traversal from a node.
This function takes following arguments:
The graph.
The starting element to traverse graph from.
Traverse the graph breadth first for given adjacency matrix:
import numpy as np
from scipy.sparse.csgraph import breadth_first_order
from scipy.sparse import csr_matrix
arr = np.array([ [0, 1, 0, 1],
[1, 1, 1, 1],
[2, 1, 1, 0],
[0, 1, 0, 1]])
newarr = csr_matrix(arr)
print(breadth_first_order(newarr, 1))

4. Spatial Data
Spatial data refers to data that is represented in a geometric space.
E.g. points on a coordinate system.
4.1 Triangulation
A Triangulation of a polygon is to divide the polygon into multiple triangles with which
we can compute an area of the polygon.One method to generate these triangulations through
points is the Delaunay() Triangulation.
Example:
Create a triangulation from following points:
import numpy as np
from scipy.spatial import Delaunay
import matplotlib.pyplot as plt
points = np.array([[2, 4],
[3, 4],
[3, 0],
[2, 2],
[4, 1]])
simplices = Delaunay(points).simplices
plt.triplot(points[:, 0], points[:, 1], simplices)
plt.scatter(points[:, 0], points[:, 1],
color='r') plt.show()

4.2 Convex Hull

A convex hull is the smallest polygon that covers all of the given points.
Use the ConvexHull() method to create a Convex Hull.
Example
Create a convex hull for following points:
import numpy as np
from scipy.spatial import ConvexHull
import matplotlib.pyplot as plt
points = np.array([ [2, 4],[3, 4],[3, 0],
[2, 2],[4, 1],[1, 2],
[5, 0],[3, 1],[1, 2],
[0, 2] ])
hull = ConvexHull(points)
hull_points = hull.simplices
plt.scatter(points[:,0], points[:,1])
for simplex in hull_points:
plt.plot(points[simplex,0], points[simplex,1], 'k-')
plt.show()

4.3 KDTrees
KDTrees are a datastructure optimized for nearest neighbor queries.
E.g. in a set of points using KDTrees we can efficiently ask which points are nearest to
a certain given point.
The KDTree() method returns a KDTree object.
The query() method returns the distance to the nearest neighbor and the location of
the neighbors.
Example
Find the nearest neighbor to point (1,1):
from scipy.spatial import KDTree
points = [(1, -1), (2, 3), (-2, 3), (2,
-3)]
kdtree = KDTree(points)
res = kdtree.query((1, 1))
print(res)
4.4 Distance Matrix
There are many Distance Metrics used to find various types of distances between two
points in data science, Euclidean distsance, cosine distsance etc.
E.g. "K Nearest Neighbors", or "K Means" etc.
4.4.1 Euclidean Distance
Find the euclidean distance between given points A and B.
Example
Find the euclidean distance between given points.
from scipy.spatial.distance import euclidean
p1 = (1, 0)
p2 = (10, 2)
res = euclidean(p1, p2)
print(res)

4.4.2 Cosine Distance

Is the value of cosine angle between the two points A and B.
Example
Find the cosine distsance between given points:
from scipy.spatial.distance import cosine
p1 = (1, 0)
p2 = (10, 2)
res = cosine(p1, p2)
print(res)

Hamming Distance
Is the proportion of bits where two bits are difference. It's a way to
measure distance for binary sequences.
Example
Find the hamming distance between given points:
from scipy.spatial.distance import hamming
p1 = (True, False, True)
p2 = (False, True, True)
res = hamming(p1, p2)
print(res)

5. Matlab Arrays
We know that NumPy provides us with methods to persist the data in readable
formats for Python. But SciPy provides us with interoperability with Matlab as well.
Working With Matlab Arrays
Exporting Data in Matlab Format
The savemat() function allows us to export data in Matlab format.
The method takes the following parameters:
filename - the file name for saving
data. mdict - a dictionary containing
the data.
do_compression - a boolean value that specifies whether to compress the
result or not. Default False.
Example
Export the following array as variable name "vec" to a mat
file: from scipy import io
import numpy as np
arr = np.arange(10)
io.savemat('arr.mat', {"vec": arr})

Import Data from Matlab Format

The loadmat() function allows us to import data from a Matlab file.
The function takes one required parameter:
filename - the file name of the saved data.
It will return a structured array whose keys are the variable names, and the
corresponding values are the variable values.
Example
Import the array from following mat file.:
from scipy import io
import numpy as np
arr = np.array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9,])
# Export:
io.savemat('arr.mat', {"vec": arr})
# Import:
mydata = io.loadmat('arr.mat')
print(mydata)

6. Interpolation
Interpolation is a method for generating points between given points.
For example: for points 1 and 2, we may interpolate and find points 1.33 and 1.66.
6.1 1D Interpolation
The function interp1d() is used to interpolate a distribution with 1 variable.
It takes x and y points and returns a callable function that can be called with new x
and returns corresponding y.
Example
For given xs and ys interpolate values from 2.1, 2.2... to 2.9:
from scipy.interpolate import interp1d
import numpy as np
xs =
np.arange(10) ys
= 2*xs + 1
interp_func = interp1d(xs, ys)
newarr = interp_func(np.arange(2.1, 3,
0.1)) print(newarr)

6.2 Spline Interpolation

In 1D interpolation the points are fitted for a single curve whereas in Spline
interpolation the points are fitted against a piecewise function defined with polynomials
called splines.
The UnivariateSpline() function takes xs and ys and produce a callable funciton that
can be called with new xs.
Example
Find univariate spline interpolation for 2.1, 2.2 2.9 for the following non linear points:
from scipy.interpolate import UnivariateSpline
import numpy as np
xs = np.arange(10)
ys = xs**2 + np.sin(xs) + 1
interp_func = UnivariateSpline(xs,
ys)
newarr = interp_func(np.arange(2.1, 3,
0.1)) print(newarr)

RESULT
Thus the feature study of SciPy was completed successfully.
1(d). Explore the features of Pandas

AIM:
To learn the different features provided by Pandas package.

ALGORITHM:
1. Install the Pandas package
2. Study all the features of Pandas package.

Pandas
● Pandas is a Python library used for working with data sets.
● It has functions for analyzing, cleaning, exploring, and manipulating data.
● Pandas allows us to analyze big data and make conclusions based on statistical
theories.
● Pandas can clean messy data sets, and make them readable and relevant.

Installation of Pandas
Install it using this command:
C:\Users\Your Name>pip install pandas

Import Pandas
Once pandas is installed, import it in your applications by adding the import keyword:

import pandas
Can import using alias as
import pandas as p

Features
1. Series
● A Pandas Series is like a column in a table.
● It is a one-dimensional array holding data of any type.
● Create a simple Pandas Series from a list:

import pandas as pd
a = [1, 7, 2]
myvar =
pd.Series(a)
print(myvar)

1.1 Create Labels

With the index argument, you can name your own labels.
Example
Create you own labels:
import pandas as pd
a = [1, 7, 2]
myvar = pd.Series(a, index = ["x", "y",
"z"]) print(myvar)
1.2 Key/Value Objects as Series
You can also use a key/value object, like a dictionary, when creating a Series.
Example
Create a simple Pandas Series from a dictionary:
import pandas as pd
calories = {"day1": 420, "day2": 380, "day3":
390} myvar = pd.Series(calories)
print(myvar)

2. DataFrames
A Pandas DataFrame is a 2 dimensional data structure, like a 2 dimensional array, or a
table with rows and columns.
Example
Create a simple Pandas DataFrame:
import pandas as pd
data = {
"calories": [420, 380, 390],
"duration": [50, 40, 45]
}
#load data into a DataFrame object:
df =
pd.DataFrame(data)
print(df)

3. Read CSV
A simple way to store big data sets is to use CSV files (comma separated files). CSV
files contains plain text and is a well know format that can be read by everyone
including Pandas.
Example
To print maximum rows in a CSV file
import pandas as pd
pd.options.display.max_rows = 9999
df = pd.read_csv('data.csv')
print(df)

4. Read JSON
● Big data sets are often stored, or extracted as JSON.
● JSON is plain text, but has the format of an object, and is well known in the world
of programming, including Pandas.
Load the JSON file into a DataFrame:
import pandas as pd
df = pd.read_json('data.json')
print(df.to_string())
5. Viewing the Data
One of the most used method for getting a quick overview of the DataFrame, is the
head() method. The head() method returns the headers and a specified number of rows,
starting from the top.

5.1 Info About the Data

The DataFrames object has a method called info(), that gives you more information
about the data set.
Example
Print information about the data:
import pandas as pd
df =
pd.read_csv('data.csv')
print(df.info())
6. Data Cleaning
Data cleaning means fixing bad data in your data set.
Bad data could be:
● Empty cells
● Data in wrong format
● Wrong data
● Duplicates

6.1 Empty Cells

6.1.1 Remove Rows
One way to deal with empty cells is to remove rows that contain empty cells.
This is usually OK, since data sets can be very big, and removing a few rows will not
have a big impact on the result.
Example
Return a new Data Frame with no empty cells:
import pandas as pd
df =
pd.read_csv('data.csv')
new_df = df.dropna()
print(new_df.to_string())
inplace() method
It remove all rows with NULL values:
import pandas as pd
df = pd.read_csv('data.csv')
df.dropna(inplace = True)
print(df.to_string())

6.1.2 Replace Empty Values

Another way of dealing with empty cells is to insert a new value instead.
Example
Replace NULL values with the number 130:
import pandas as pd
df = pd.read_csv('data.csv')
df.fillna(130, inplace = True)

6.1.3 Replace Using Mean, Median, or Mode

A common way to replace empty cells, is to calculate the mean, median or mode
value of the column.Pandas uses the mean() median() and mode() methods to calculate the
respective values for a specified column:
mean()
'
)
x
=
d
median() f
[
"
C
a
mode() l
import pandas as pd o
d r
f i
= e
p s
d "
. ]
r .
e m
a e
d a
_ n
c (
s )
v df["
( Calo
' ries"
d ].filln
a a(x,
t inpla
a ce =
. True)
c print
s (df.to
v
_stri m
ng()) e
d
import pandas as pd i
d a
f n
= (
p )
d df["Calories"].fillna(x, inplace = True)
.
r import pandas as pd
e d
a f
d =
_ p
c d
s .
v r
( e
' a
d d
a _
t c
a s
. v
c (
s '
v d
' a
) t
x a
= .
d c
f s
[ v
" '
C )
a x
l =
o d
r f
i [
e "
s C
" a
] l
. o
r o
i d
e e
s (
" )
] [
. 0
m ]
df["Calories"].fillna(x, inplace = True)

6.2 Data of Wrong Format

Cells with data of wrong format can make it difficult, or even impossible, to analyze
data.To fix it, you have two options: remove the rows, or convert all cells in the columns
into the same format.
Example
import pandas as pd
df = pd.read_csv('data.csv')
df['Date'] = pd.to_datetime(df['Date'])
print(df.to_string())

6.2.1 Removing Rows

Remove rows with a NULL value in the "Date"
column: import pandas as pd
df = pd.read_csv('data.csv')
df['Date'] = pd.to_datetime(df['Date'])
df.dropna(subset=['Date'], inplace = True)
print(df.to_string())

6.3 Fixing Wrong Data

6.3.1 Wrong Data
"Wrong data" does not have to be "empty cells" or "wrong format", it can just be
wrong, like if someone registered "199" instead of "1.99".
6.3.2 Replacing Values
One way to fix wrong values is to replace them with something else.
Example
Set "Duration" = 45 in row 7:
import pandas as pd
df =
pd.read_csv('data.csv')
df.loc[7,'Duration'] = 45
print(df.to_string())

6.3.3 Removing Rows

Another way of handling wrong data is to remove the rows that contains wrong data.
Example
Delete rows where "Duration" is higher than
120: import pandas as pd
df =
pd.read_csv('data.csv') for
x in df.index:
if df.loc[x, "Duration"] > 120:
df.drop(x, inplace = True)
print(df.to_string())

6.4 Removing Duplicates

6.4.1 Discovering Duplicates
Duplicate rows are rows that have been registered more than one time.
duplicated() method
import pandas as pd
df =
pd.read_csv('data.csv')
print(df.duplicated())

6.4.2 Removing Duplicates

To remove duplicates, use the drop_duplicates() method.
import pandas as pd
df = pd.read_csv('data.csv')
df.drop_duplicates(inplace = True)
print(df.to_string())
7. Plotting
We can use Pyplot, a submodule of the Matplotlib library to visualize the diagram on
the screen.Pandas uses the plot() method to create diagrams.

7.1 Scatter Plot

Specify that you want a scatter plot with the kind
argument: kind = 'scatter'
Example
import sys
import
matplotlib
matplotlib.use('Agg')
import pandas as pd
import matplotlib.pyplot as
plt df =
pd.read_csv('data.csv')
df.plot(kind = 'scatter', x = 'Duration', y =
'Maxpulse') plt.show()
plt.savefig(sys.stdout.buffer)
sys.stdout.flush()

7.2 Histogram
Use the kind argument to specify that you want a
histogram: kind = 'hist'
Example
import sys
import
matplotlib
matplotlib.use('Agg')
import pandas as pd
import matplotlib.pyplot as plt
df = pd.read_csv('data.csv')
df["Duration"].plot(kind =
'hist') plt.show()
plt.savefig(sys.stdout.buffer)
sys.stdout.flush()

RESULT
Thus the feature study of Pandas was completed successfully.
1(e). Explore the features of statsmodels

AIM:
To learn the different features provided by statsmodels package.

ALGORITHM:
3. Install the statsmodels package
4. Study all the features of statsmodels package.

Statsmodels
statsmodels is a Python module that provides classes and functions for the estimation of
many different statistical models, as well as for conducting statistical tests, and statistical data
exploration.

1. Linear regression models

Linear regression analysis is a statistical technique for predicting the value of one
variable(dependent variable) based on the value of another(independent variable).
Syntax:
statsmodels.regression.linear_model.OLS(endog, exog=None, missing=’none’,
hasconst=None, **kwargs)
Parameters:
● endog: array like object.
● exog: array like object.
● missing: str. None, decrease, and raise are the available alternatives.
● hasconst: None or Bool. Indicates whether a user-supplied constant is included in the
RHS.
● **kwargs: When using the formula interface, additional arguments are utilised to set
model characteristics.

Step 1: Import packages.

Importing the required packages is the first step of modeling. The pandas, NumPy,
and stats model packages are imported.
import numpy as np
import pandas as
pd
import statsmodels.api as sm
Step 2: Loading data
To access the CSV file click here. The CSV file is read using pandas.read_csv() method.
df =
pd.read_csv('headbrain1.csv')
df.head()
Visualizing the data:
By using the matplotlib and seaborn packages, we visualize the data. sns.regplot()
function helps us create a regression plot.
# import packages
import pandas as pd
import matplotlib.pyplot as
plt import seaborn as sns
df = pd.read_csv('headbrain1.csv')
sns.regplot('Head Size(cm^3)', 'Brain Weight(grams)', data=df)
plt.show()
Step 3: Setting a hypothesis.
Null hypothesis (H0): There is no relationship between head size and brain weight.
Alternative hypothesis (Ha): There is a relationship between head size and brain
weight.
Step 4: Fitting the model
statsmodels.regression.linear_model.OLS() method is used to get ordinary least
squares, and fit() method is used to fit the data in it.
inpendent_columns ~ dependent_column:
left side of the ~ operator contains the independent variables and right side of the
operator contains the name of the dependent variable or the predicted column.
df.columns = ['Head_size', 'Brain_weight']
model = sm.ols(formula='Head_size ~ Brain_weight', data=df).fit()
Step 5: Summary of the model.
All the summary statistics of the linear regression model are returned by the
model.summary() method. The p-value and many other values/statistics are known by this
method. Predictions about the data are found by the model.summary() method.
print(model.summary())
2. Survival analysis
The statsmodels.api.SurvfuncRight class can be used to estimate survival functions
using data that may be censored to the right. The duration.survdiff function provides a test
procedure for comparing survival distributions.
Example:
# Importing libraries
import statsmodels.api as sm
X = sm.datasets.get_rdataset("Moore", "carData").data # Filtering
data of low fcategory
X = X[X['fcategory'] == "low"] # Creating
SurvfuncRight model
model = sm.SurvfuncRight(X["conformity"], X["fscore"]) # Model
Summary
model.summary()

RESULT
Thus the few important features of study statsmodels was completed successfully.
2. Working with Numpy arrays
AIM:
To work with different features provided by Numpy arrays.

ALGORITHM:
1. Install the numpy package
2. Work with all the features of numpy array.

Arrays
1. Creating Arrays
● 0-D Arrays
Each value in an array is a 0-D array.
import numpy as np
arr = np.array(42)
print(arr)
● 1-D Arrays
An array that has 0-D arrays as its elements is called 1-D array.
import numpy as np
arr = np.array([1, 2, 3, 4, 5])
print(arr)
● 2-D Arrays
An array that has 1-D arrays as its elements is called a 2-D array.
import numpy as np
arr = np.array([[1, 2, 3], [4, 5, 6]])
print(arr)
● 3-D arrays
An array that has 2-D arrays (matrices) as its elements is called 3-D array.
import numpy as np
arr = np.array([[[1, 2, 3], [4, 5, 6]], [[1, 2, 3], [4, 5, 6]]])
print(arr)
Example: i
m
p
o
r
t
n
u
m
p
y
a
s
n n
p t
a (
= b
n .
p n
. d
a i
r m
r )
a p
y r
( i
4 n
2 t
) (
b = np.array([1, 2, 3, 4, 5]) c
c = np.array([[1, 2, 3], [4, 5, 6]]) .
d = np.array([[[1, 2, 3], [4, 5, 6]], [[1, 2, 3], n
[4, 5, 6]]]) d
p i
r m
i )
n p
t r
( i
a n
. t
n (
d d
i .
m n
) d
p i
r m
i )
y
s
To access elements from 2-D arrays we
s
can use comma separated integers
s
2 representing the dimension and the index of
- the element.
D Access 2-D Arrays
A import numpy as np
r arr = np.array([[1,2,3,4,5],
[6,7,8,9,10]])
r
a
prin ,
t('2nd element on 6
1st row: ', arr[0, ,
1]) Access 3-D 7
Arrays ]
To access elements from 3-D arrays we )
can use comma separated integers p
representing the dimensions and the index of r
the element. i
import numpy as np n
arr = np.array([[[1, 2, 3], [4, 5, 6]], [[7, 8, 9], t
[10, 11, 12]]]) (
print(arr[0, 1, 2]) a
r
2. Array Slicing r
● Slicing in python means taking [
elements from one given index to 1
another given index. :
● We pass slice instead of index like 5
this: [start:end]. :
● We can also define the step, like 2
this: [start:end:step]. ]
)
import numpy as np 3. Data Types
a
NumPy has some extra data types,
r
and refer to data types with one character,
r
like i for integers, u for unsigned integers
=
etc.
n
p Example:
.
a import numpy as np
r arr = np.array([1, 2, 3, 4], dtype='S')
r print(arr)
a print(arr.dtype)
y
( 4. Copy & View
[ 5.1 Copy:
1 Make a copy
, import numpy as np
2 a
, r
3 r
, =
4 n
, p
5 .
a r
r )
r p
a r
y i
( n
[ t
1 (
, x
2 )
,
3 5.2 View:
, Make a view
4 import numpy as np
, a
5 r
] r
) =
x n
= p
a .
r a
r r
. r
c a
o y
p (
y [
( 1
) ,
a 2
r ,
r 3
[ ,
0 4
] ,
= 5
4 ]
2 )
p x
r =
i a
n r
t r
( .
a v
r i
e ● By reshaping we can add or remove
w dimensions or change number of
( elements in each dimension.
) import numpy as np
a arr = np.array([1, 2, 3, 4, 5, 6,
r 7, 8, 9, 10, 11, 12])
r n
[ e
0 w
] a
= r
4 r
2 =
p a
r r
i r
n .
t r
( e
a s
r h
r a
) p
p e
r (
i 2
n ,
t 3
( ,
x 2
) )
p
5. Array Shape & Reshaping r
6.1 Array Shape i
NumPy arrays have an attribute called n
shape that returns a tuple with each index t
having the number of corresponding (
n
elements.
e
import numpy as np
w
arr = np.array([[1, 2, 3, 4], [5,
6, 7, 8]]) a
print(arr.shape) r
r
6.2 Array Reshaping )
● Reshaping means changing the shape
of an array. 6. Array Iterating
● Iterating means going through 2
elements one by one. ,
● As we deal with 3
multi-dimensional arrays in ]
numpy, we can do this using )
basic for loop of python. arr2 = np.array([4, 5, 6])
import numpy as np a
arr = np.array([[[1, 2, 3], [4, r
5, 6]], [[7, 8, 9], [10, 11, 12]]]) r
for x in arr: =
print(x) n
p
7. Joining Array .
Joining means putting contents of two c
or more arrays in a single array. o
i n
m c
p a
o t
r e
t n
n a
u t
m e
p (
y (
a a
s r
n r
p 1
a ,
r a
r r
1 r
= 2
n )
p )
. p
a r
r i
r n
a t
y (
( a
[ r
1 r
, )
a
8. Splitting Array y
Splitting is reverse operation of Joining. _
Joining merges multiple arrays into s
one and Splitting breaks one array into p
multiple. l
import numpy as np i
a t
r (
r a
= r
n r
p ,
. 3
a )
r p
r r
a i
y n
( t
[ (
1 n
, e
2 w
, a
3 r
, r
4 )
,
5 9. Searching Arrays
, We can search an array for a
6 certain value, and return the
] indexes that get a match. To search
) an array, use the where() method.
n import numpy as np
e
a
w
r
a
r
r
=
r
n
=
p
n
.
p
a
.
r
a
r
r
a
r
y
( ● The NumPy ndarray object has a
[ function called sort(), that will
1 sort a specified array.
, import numpy as np
2 a
, r
3 r
, =
4 n
, p
5 .
, a
4 r
, r
4 a
] y
) (
[
x 3
= ,
n 2
p ,
. 0
w ,
h 1
e ]
r )
e p
( r
a i
r n
r t
= (
= n
4 p
) .
print(x) s
o
10. Sorting means putting elements in an r
ordered sequence. t
● Ordered sequence is any (
sequence that has an order a
corresponding to elements, like r
numeric or alphabetical, r
ascending or descending. )
)
11. Filtering Arrays T
Getting some elements out of an r
existing array and creating a new array out u
of them is called filtering. In NumPy, you e
filter an array using a boolean index list. ,
import numpy as np F
a a
r l
r s
= e
n ]
p n
. e
a w
r a
r r
a r
y =
( a
[ r
4 r
1 [
, x
4 ]
2 print(newarr)
,
4
3 RESULT
, Thus the important features of numpy
array was completed successfully.
4
4
]
)
x
=
[
T
r
u
e
,
F
a
l
s
e
,
t
3 h
. D
W a
3 t
. a
W F
o r
r a
k m
i e
n
g AIM:
w To work with dataframe provided by pandas.
i

ALGORITHM:
1. Install the pandas package
2. Work with all the features of dataframe.

1. DataFrame
A Pandas DataFrame is a 2 dimensional data structure, like a 2 dimensional array, or a
table with rows and columns.
Example
Create a simple Pandas DataFrame:
import pandas as pd
data = {
"calories": [420, 380, 390],
"duration": [50, 40, 45]
}
#load data into a DataFrame object:
df =
pd.DataFrame(data)
print(df)

2. Locate Row
As you can see from the result above, the DataFrame is like a table with rows and
columns.Pandas use the loc attribute to return one or more specified row(s)
Example
import pandas as pd
data = {
"calories": [420, 380, 390],
"duration": [50, 40, 45]
}
#load data into a DataFrame
object: df = pd.DataFrame(data)
print(df.loc[0])
3. Named Indexes
With the index argument, you can name your own indexes.
Example
Add a list of names to give each row a name:
import pandas as pd
data = {
"calories": [420, 380, 390],
"duration": [50, 40, 45]
}
df = pd.DataFrame(data, index = ["day1", "day2", "day3"])
print(df)

4. Locate Named Indexes

Use the named index in the loc attribute to return the specified row(s).
Example
Return "day2":
import pandas as pd
data = {
"calories": [420, 380, 390],
"duration": [50, 40, 45]
}
df = pd.DataFrame(data, index = ["day1", "day2", "day3"])
print(df.loc["day2"])

5. Load Files Into a DataFrame

If your data sets are stored in a file, Pandas can load them into a DataFrame.
Example
Load a comma separated file (CSV file) into a
DataFrame: import pandas as pd
df =
pd.read_csv('data.csv')
print(df)

RESULT
Thus the dataframe features of pandas was completed successfully.
Ex.No. 4 Reading data from iris data set and doing descriptive analytics on the Iris
data set

AIM:
To read data from files and exploring various commands for doing descriptive
analytics on the Iris data set.

ALGORITHM:
1. Download “Iris.csv” file from GitHub.com
2. Load the “Iris.csv” into google colab.
3. Perform descriptive analysis on the Iris file.

Importing Iris.csv
● Login to google colab by using gmail.
● Login to google drive and create a folder with required name.
● Move the Iris file from system to google drive.
● Click on the “file” icon and click on “Mount Device”.
● Code will appeared on a typing area, execute the same code.
● It requires authentication verification, complete the authentication.
● After successful verification it shows the message “Mounted at /content/drive”
● Find the Iris.csv file and copy the path for future references.

About Iris Database

Iris Dataset is considered as the Hello World for data science. It contains five columns
namely – Petal Length, Petal Width, Sepal Length, Sepal Width, and Species Type. Iris is a
flowering plant, the researchers have measured various features of the different iris flowers
and recorded them digitally.
read_csv() method is used to read CSV files.

Example: import pandas as pd

# Reading the CSV file
df =
pd.read_csv("/content/drive/MyDrive/Data_Scie
nce/iris.csv")
#
P
r
i
n
t
i
n
g
t
o
p .
5 h
r e
o a
w d
s (
d )
f

Getting Information about the Dataset

We will use the shape parameter to get the shape of the dataset.
df.shape -> returns no of rows and columns
df.info() -> returns column data types.
Checking Missing Values
We will check if our data contains any missing values or not. We will use the isnull()
method.
Example:
df.isnull().sum()

Checking Duplicates
Let’s see if the dataset contains any duplicates or not. Pandas drop_duplicates()
method helps in removing duplicates from the data frame.

Example: data =
df.drop_dup
licates(subse
t
="variety",)
data

Data Visualization
Visualizing the target column
Our target column will be the Species column because at the end we will need the
result according to the species only. Let’s see a countplot for species.
Example:
# importing packages
import seaborn as
sns
import matplotlib.pyplot as plt
sns.countplot(x='Species', data=df,)
plt.show()

Relation between variables

We will see the relationship between the sepal length and sepal width and also
between petal length and petal width.
Example 1: Comparing Sepal Length and Sepal Width
# importing packages
import seaborn as
sns
import matplotlib.pyplot as plt
sns.scatterplot(x='petal.length', y='petal.width',hue='variety', data=df, )
# Placing Legend outside the Figure
plt.legend(bbox_to_anchor=(1, 1),
loc=2) plt.show()

Example 2: Comparing Petal Length and Petal Width

# importing packages
import seaborn as
sns
import matplotlib.pyplot as plt
sns.scatterplot(x='petal.length', y='petal.width', hue='variety', data=df, )
# Placing Legend outside the Figure
plt.legend(bbox_to_anchor=(1, 1),
loc=2) plt.show()

Handling Correlation
Pandas dataframe.corr() is used to find the pairwise correlation of all columns in the
dataframe. Any NA values are automatically excluded. For any non-numeric data type
columns in the dataframe it is ignored.

Example:
data.corr(method='pearson')

RESULT
Iris.csv file was loaded into google colab and descriptive analytics was made on the Iris
data set successfully.
5(a). Perform Univariate analysis on the diabetes data set

AIM:
Use the diabetes data set from UCI and Pima Indians Diabetes data set for Univariate
analysis.

ALGORITHM:
1. Download diabetes data set from UCI and Pima Indians Diabetes data set.
2. Load the above data files into google colab.
3. Perform analysis like Frequency, Mean, Median, Mode, Variance, Standard
Deviation, Skewness and Kurtosis.

Univariate analysis
● The term univariate analysis refers to the analysis of one variable.
● There are three common ways to perform univariate analysis on one variable:
Summary statistics – Measures the center and spread of values.
1. Central tendency — mean, median, mode
2. Dispersion — variance, standard deviation, range, interquartile
range (IQR)
3. Skewness — symmetry of data along with mean value
4. Kurtosis — peakedness of data at mean value
5. Frequency table – Describes how often different values occur.

File Importing:
# Reading the UCI
file import pandas as
pd
df = pd.read_csv("/content/drive/MyDrive/Data_Science/UCI_diabetes.csv")
# Printing top 5 rows
df.head()
# Reading the Pima
file import pandas as
pd
df = pd.read_csv("/content/drive/MyDrive/Data_Science/Pima_diabetes.csv")
# Printing top 5 rows
df.head()

1. Central Tendency
We can use the following syntax to calculate various summary statistics like Mean,
Median and Mode.

1.1 Mean:
It is average value of given numeric values
● Mean of UCI data
import pandas as pd
# Reading the UCI file
df = pd.read_csv("/content/drive/MyDrive/Data_Science/UCI_diabetes.cs
v")
# Mean of UCI data
df.mean(axis=0)
● Mean of Pima data
import pandas as pd
# Reading the UCI
file df =
pd.read_csv("/content/drive/MyDrive/Data_Science/Pima_diabetes.csv")
# Mean of Pima data
df.mean(axis=0)

1.2 Median:
It is middle most value of given values
● Median of UCI data
import pandas as pd
# Reading the UCI file
df =
pd.read_csv("/content/drive/MyDrive/Data_Science/UCI_diabetes.csv")
# Median of UCI data
df.median(axis=0)

● Median of Pima data

import pandas as pd
# Reading the UCI file
df =
pd.read_csv("/content/drive/MyDrive/Data_Science/Pima_diabetes.csv")
# Median of Pima data
df.median(axis=0)

1.3 Mode:
It is the most frequently occurring value of given numeric variables
● Mode of UCI data
import pandas as pd
# Reading the UCI
file df =
pd.read_csv("/content/drive/MyDrive/Data_Science/UCI_diabetes.csv")
# Median of UCI data
df.mode(axis=0)

● Mode of Pima data

import pandas as pd
# Reading the UCI
file df =
pd.read_csv("/content/drive/MyDrive/Data_Science/Pima_diabetes.csv")
# Mean of Pima data
df.mode(axis=0)

2. Dispersion
2.1 Variance
The range is the difference between the maximum and minimum values of a data set.
Example
import pandas as pd
# Reading the UCI
file df =
pd.read_csv("/content/drive/MyDrive/Data_Science/Pima_diabetes.csv")
# variance of the BMI column
df.loc[:,"BMI"].var()

2.2 Standard deviation

Example
import pandas as pd
# Reading the UCI
file df =
pd.read_csv("/content/drive/MyDrive/Data_Science/Pima_diabetes.csv")
# Standard deviation of the BMI column
df.loc[:,"BMI"].std()

2.3 Range
Example
df=pd.read_csv("/content/drive/MyDrive/Data_Science/Pima_diabetes.csv")
print("Range is:",df.BloodPressure.max()-df.BloodPressure.min())

2.4 Interquartile range

Example
# Importing important libraries
import numpy as np
import pandas as pd
import seaborn as
sns
import matplotlib.pyplot as
plt plt.style.use('seaborn')
data =
pd.read_csv('/content/drive/MyDrive/Data_Science/Pima_diabetes.csv')
# Removing the outliers
def removeOutliers(data, col):
Q3 = np.quantile(data[col],
0.75) Q1 =
np.quantile(data[col], 0.25)
IQR = Q3 - Q1

print("IQR value for column %s is: %s" % (col,

IQR)) global outlier_free_list
global filtered_data
lower_range = Q1 - 1.5 * IQR
upper_range = Q3 + 1.5 * IQR
outlier_free_list = [x for x in data[col] if (
(x > lower_range) & (x < upper_range))]
filtered_data = data.loc[data[col].isin(outlier_free_list)]
for i in data.columns:
if i == data.columns[0]:
removeOutliers(data, i)
else:
removeOutliers(filtered_data, i)
# Assigning filtered data back to our original variable
data = filtered_data
print("Shape of data after outlier removal is: ", data.shape)

3. Skewness
● Skewness essentially measures the symmetry of the distribution.
Example
# importing pandas as pd
import pandas as pd
# Creating the
dataframe df =
pd.read_csv("/content/drive/MyDrive/Data_Science/Pima_diabetes.csv")
# skip the na values
# find skewness in each row
df.skew(axis = 0, skipna = True)
4. kurtosis
kurtosis determines the heaviness of the distribution tails.
Example
import pandas as pd
df =
pd.read_csv('/content/drive/MyDrive/Data_Science/Pima_diabetes.csv')
df['BloodPressure'].kurtosis()

5. Frequency
Example
# import packages
import pandas as pd
import numpy as np
# reading csv file
data =
pd.read_csv('/content/drive/MyDrive/Data_Science/Pima_diabetes.csv')
# one way frequency table for the species column.
freq_table = pd.crosstab(data['Age'], 'BMI')
# frequency table in proportion of species
freq_table= freq_table/len(data)
freq_table

RESULT
Thus the Univariate analysis on the Diabetes data of UCI and Pima was performed
successfully.
5(b). Perform Bivariate analysis on the diabetes data set.

AIM:
To use the UCI and Pima Indians Diabetes data set for Bivariate analysis.

ALGORITHM:
1. Download diabetes data set from UCI and Pima Indians Diabetes data set.
2. Load the above data files into google colab.
3. Perform various methods of bivariate.

Bivariate analysis
The term bivariate analysis refers to the analysis of two variables. The purpose of
bivariate analysis is to understand the relationship between two variables
There are three common ways to perform bivariate analysis:
1. Scatterplots
2. Correlation Coefficients
3. Simple Linear Regression

1. Scatterplots
A scatterplot is a type of data display that shows the relationship between two
numerical variables
Example
import numpy as np
import pandas as pd
import matplotlib.pyplot as
plt import seaborn as sns
# import
packages data =
pd.read_csv("/content/drive/MyDrive/Data_Science/Pima_diabetes.csv")
# Diabetes Outcome
g1 = data.loc[data.Outcome==1,:]
# Pregnancies, Glucose and Diabetes
relation g1.plot.scatter('Pregnancies',
'Glucose');

2. Correlation Coefficients
The correlation coefficient is a statistical measure of the strength of the relationship
between the relative movements of two variables. The values range between -1.0 and 1.0.
Example
# Import those
libraries import pandas
as pd
from scipy.stats import pearsonr
# Import your data into Python
df =
pd.read_csv("/content/drive/MyDrive/Data_Science/Pima_diabetes.csv")
# Convert dataframe into series
list1 =
df['BloodPressure'] list2
= df['SkinThickness'] #
Apply the pearsonr()
corr, _ = pearsonr(list1, list2)
print('Pearsons correlation: %.3f' %
corr)

3. Simple Linear Regression

Simple linear regression is a statistical method that we can use to find a relationship
between two variables and make predictions. The independent variable, or the variable
used to predict the dependent variable is denoted as x. The dependent variable, or the
outcome/output, is denoted as y.
Example
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
# Importing the dataset
dataset =
pd.read_csv('/content/drive/MyDrive/Data_Science/Pima_diabetes.csv'
) X = dataset.iloc[:, :-1].values #get a copy of dataset exclude last
column y = dataset.iloc[:, 1].values #get array of dataset in column 1st
# Splitting the dataset into the Training set and Test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=1/3,
random_state=0)
# Fitting Simple Linear Regression to the Training set
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X_train, y_train)

RESULT:
Thus the Bivariate analysis on the diabetes data set was executed successfully.
5(c). Perform Multiple Regression Analysis on the diabetes data set

AIM:
To use UCI and Pima Indians Diabetes data set for Multiple Regression Analysis.

ALGORITHM:
1. Download diabetes data set from UCI and Pima Indians Diabetes data set.
2. Load the above data files into google colab.
3. Perform multiple regression analysis on data sets.

Multiple Regression Analysis

Multiple regression is like linear regression, but with more than one independent
value, meaning that we try to predict a value based on two or more variables.
Example
#
Pima_diabetes
import pandas
from sklearn import linear_model
df =
pandas.read_csv("/content/drive/MyDrive/Data_Science/Pima_diabetes.csv")
X = df['Pregnancies ', 'Glucose ']
y = df['BloodPressure ']
regr = linear_model.LinearRegression()
regr.fit(X, y)
#predict the Blood Pressure based on Pregnancies and Glucose level:
predictedBP = regr.predict([[4, 120]])
print(predictedBP)

# UCI-Diabetes
import pandas
from sklearn import linear_model
df =
pandas.read_csv("("/content/drive/MyDrive/Data_Science/UCI_diabetes.
csv")
X = df[['Time', 'Code']]
y = df['Value']
regr =
linear_model.LinearRegression()
regr.fit(X, y)
#predict the Diabetes based on Time and Code:
predictedBP = regr.predict([[13:23, 46]])
print(predictedBP)

RESULT
Thus the Multiple Regression analysis on the Diabetes data of UCI and Pima was
performed successfully.

6(a). Apply and explore Normal curves & Histograms plotting functions on UCI-Iris
data sets

AIM:
To apply and explore Normal curves & Histograms plotting functions on UCI-Iris
data sets.

ALGORITHM:
1. Download Iris data set from UCI.
2. Load the above Iris data files into google colab.
3. Plot the normal curve and Histograms for Iris data set.

Normal Curves
It is a probability function used in statistics that tells about how the data values are
distributed. It is the most important probability distribution function used in statistics
because of its advantages in real case scenarios.
Example
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import norm
import statistics
# import dataset
df =
pd.read_csv("/content/drive/MyDrive/Data_Science/iris.csv") #
Plot between -10 and 10 with .001 steps.
x_axis = np.arange(-20, 20, 0.01)
# Calculating mean and standard deviation
mean = df["sepal.length"].mean()
sd = df.loc[:,"sepal.width"].std()
plt.plot(x_axis, norm.pdf(x_axis, mean, sd))
plt.show()

Histograms plotting functions

A histogram is basically used to represent data provided in a form of some groups.It
is accurate method for the graphical representation of numerical data distribution.It is a
type of bar plot where X-axis represents the bin ranges while Y-axis gives information about
frequency.
Example
import matplotlib.pyplot as
plt import pandas as pd
import numpy as np
df = pd.read_csv('/content/drive/MyDrive/Data_Science/iris.csv ')
data = df[' sepal.length']
bins = np.arange(min(data), max(data) + 1, 1)
plt.hist(data, bins = bins, density = True)
plt.ylabel('sepal.width')
plt.xlabel(
petal.length')
plt.show()

RESULT
Thus the UCI data set was plotted using Normal Curve and Histogram plotting
was executed successfully.
6(b). Density and contour plotting functions on UCI-Iris data sets.

AIM:
To apply and explore Density & Contour plotting functions on UCI-Iris data sets.

ALGORITHM:
1. Download Iris data set from UCI.
2. Load the above Iris data files into google colab.
3. Plot the density and contour plotting for Iris data sets.

Density Plotting
Density Plot is a type of data visualization tool. It is a variation of the histogram that
uses ‘kernel smoothing’ while plotting the values. It is a continuous and smooth version of a
histogram inferred from a data.

Example - Density plot of several variables

# libraries & dataset
import seaborn as
sns
import matplotlib.pyplot as plt
# set a grey background (use sns.set_theme() if seaborn version 0.11.0 or
above)
sns.set(style="darkgrid")
df =
sns.load_dataset('iris')
# plotting both distibutions on the same figure
fig = sns.kdeplot(df['sepal_width'], shade=True, color="r")
fig = sns.kdeplot(df['sepal_length'], shade=True, color="b")
plt.show()

Contour plotting
Contour plots also called level plots are a tool for doing multivariate analysis and
visualizing 3-D plots in 2-D space. If we consider X and Y as our variables we want to plot
then the response Z will be plotted as slices on the X-Y plane due to which contours are
sometimes referred as Z-slices or iso-response.
Example
import pandas as pd
import matplotlib.pyplot as plt import
matplotlib as mpl
px_orbital = pd.read_csv("/content/drive/MyDrive/Data_Science/iris.csv
")
x = px_orbital.iloc[0, 1:]
y = px_orbital.iloc[1:, 0]
px_values = px_orbital.iloc[1:, 1:]
mpl.rcParams['font.size'] = 14
mpl.rcParams['legend.fontsize'] = 'large'
mpl.rcParams['figure.titlesize'] = 'medium'
fig, ax = plt.subplots()
ticks = np.linspace(pmin, pmax, 6)
CS = ax.contourf(x, y, px_values, cmap="RdBu", levels=levels)
ax.set_aspect('equal')
ax.set_xlabel('x'
)
ax.set_ylabel('y'
)
fig.colorbar(CS, format="%.3f", ticks=ticks)

RESULT
Thus the UCI data set was plotted using Density & Contour plotting was
executed successfully.
6(c). Correlation and scatter plotting functions on UCI data sets.

AIM:
To apply and correlation & Scatter plotting functions on UCI-Iris data sets.

ALGORITHM:
1. Download Iris data set from UCI.
2. Load the above Iris data files into google colab.
3. Plot the correlation and scatter plotting for Iris data sets.

Correlation Matrix Plotting

Correlation gives an indication of how related the changes are between two
variables.
This is useful in some machine learning algorithms like linear and logistic regression
can have poor performance if there are highly correlated input variables in the data.

Example #
C
o
r
r
e
c
t
i
o
n
M
a
t
r
i
x
P
l
o
t
i
m
p
o
r u
t r
m l
a =
t "https://raw.githubusercontent.com/
p jbrownlee/Datasets/master/pima-
l indians-diabetes.csv"
o names = ['preg', 'plas', 'pres', 'skin',
t 'test', 'mass', 'pedi', 'age', 'class']
l data = pandas.read_csv(url,
i names=names)
b c
. o
p r
y r
p e
l l
o a
t t
a i
s o
p n
l s
t =
i d
m a
p t
o a
r .
t c
p o
a r
n r
d (
a )
s #
i p
m l
p o
o t
r c
t o
n r
u r
m e
p l
y a
t a
i n
o g
n e
m (
a 0
t ,
r 9
i ,
x 1
f )
i a
g x
= .
p s
l e
t t
. _
f x
i t
g i
u c
r k
e s
( (
) t
ax = fig.add_subplot(111) i
cax = c
ax.matshow(cor k
relations, s
vmin=-1, )
vmax=1) a
fig.colorbar(cax) x
t .
i s
c e
k t
s _
= y
n t
u i
m c
p k
y s
. (
a t
r i
c (
k n
s a
) m
a e
x s
. )
s p
e l
t t
_ .
x s
t h
i o
c w
k (
l )
a
b
e
l
s
(
n
a
m
e
s
)
a
x
.
s
e
t
_
y
t
i
c
k
l
a
b
e
l
s
Scatter Plotting
A scatterplot shows the relationship between two variables as dots in two dimensions,
one axis for each attribute. Drawing all these scatterplots together is called a scatterplot
matrix.

Example

# Scatterplot Matrix
import matplotlib.pyplot as plt import
pandas
from pandas.plotting import scatter_matrix
url =
"https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-
indians-diabetes.csv"
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class'] data =
pandas.read_csv(url, names=names)
scatter_matrix(data) plt.show()

RESULT
Thus the UCI data set was plotted using Correlation and scatter plotting was executed
successfully.
7. Visualizing Geographic Data with Basemap

AIM:
To visualizing the Geographic Data with Basemap using Zomato geographic data.

ALGORITHM:
1. Study the basics of Basemap.
2. Use Zomato data to plot city names and restaurants details.

Basemap Introduction
Basemap is a toolkit under the Python visualization library Matplotlib. Its main
function is to draw 2D maps, which are important for visualizing spatial data. basemap itself
does not do any plotting, but provides the ability to transform coordinates into one of 25
different map projections.

English Land 1
100% (5)
English Land 1
69 pages
Data Science-lab-080424manual With Header
No ratings yet
Data Science-lab-080424manual With Header
78 pages
CS3361 - Data Science Lab Record
No ratings yet
CS3361 - Data Science Lab Record
76 pages
Fds Lab Final 2nd Year (1)
No ratings yet
Fds Lab Final 2nd Year (1)
75 pages
CS3362 - Data Science Laboratory - Manual - Final-1
No ratings yet
CS3362 - Data Science Laboratory - Manual - Final-1
76 pages
fods(1)-merged (1)-1
No ratings yet
fods(1)-merged (1)-1
100 pages
CS3361 Data Science MANUAL
No ratings yet
CS3361 Data Science MANUAL
78 pages
3 IntroToPython-PythonLibraries
No ratings yet
3 IntroToPython-PythonLibraries
36 pages
Random Numpy
No ratings yet
Random Numpy
29 pages
HKU - 7001 - 3.2 Managing Data II
No ratings yet
HKU - 7001 - 3.2 Managing Data II
67 pages
Fds Lab Record
No ratings yet
Fds Lab Record
84 pages
Value Added Course: Programming in Python and Machine Learning UNIT-2
No ratings yet
Value Added Course: Programming in Python and Machine Learning UNIT-2
41 pages
Description and The First Use of Numpy Library
No ratings yet
Description and The First Use of Numpy Library
7 pages
DSF LAB EXP FULL (1) (1)
No ratings yet
DSF LAB EXP FULL (1) (1)
88 pages
Numpy
No ratings yet
Numpy
4 pages
Unit 5 PythonPackages(Matplotlib)
No ratings yet
Unit 5 PythonPackages(Matplotlib)
24 pages
Final Fds Manual
No ratings yet
Final Fds Manual
77 pages
Fds Record
No ratings yet
Fds Record
69 pages
Numpy
No ratings yet
Numpy
15 pages
CS3361-DATA SCIENCE LAB MANUAL
No ratings yet
CS3361-DATA SCIENCE LAB MANUAL
44 pages
CS3361 Data Science Lab Manual
No ratings yet
CS3361 Data Science Lab Manual
43 pages
Final Fds Manual Print
No ratings yet
Final Fds Manual Print
55 pages
Batch2_FDS_printout
No ratings yet
Batch2_FDS_printout
38 pages
UNIT 3 (1)
No ratings yet
UNIT 3 (1)
56 pages
Numpy
No ratings yet
Numpy
4 pages
Tutorial-2 Basic NumPy (2)
No ratings yet
Tutorial-2 Basic NumPy (2)
16 pages
c
No ratings yet
c
22 pages
Introduction To NumPy
No ratings yet
Introduction To NumPy
27 pages
FDS Final Manual
No ratings yet
FDS Final Manual
41 pages
Roadmap
No ratings yet
Roadmap
27 pages
FINAL FDS MANUAL print
No ratings yet
FINAL FDS MANUAL print
55 pages
Unit Vi
No ratings yet
Unit Vi
60 pages
NumPy is
No ratings yet
NumPy is
8 pages
NumPy & Pandas
No ratings yet
NumPy & Pandas
27 pages
FDS Lab Manual R21
No ratings yet
FDS Lab Manual R21
47 pages
Introductory To Numpy Jan 2023
No ratings yet
Introductory To Numpy Jan 2023
5 pages
Unit-3_PSC
No ratings yet
Unit-3_PSC
62 pages
Lab description file (4)
No ratings yet
Lab description file (4)
11 pages
NumPy is a powerful Python library used for numerical computing. Here are s_20250101_154624_0000
No ratings yet
NumPy is a powerful Python library used for numerical computing. Here are s_20250101_154624_0000
8 pages
Unit5 NumPy Pandas Notes
No ratings yet
Unit5 NumPy Pandas Notes
90 pages
Numpy
No ratings yet
Numpy
8 pages
10 Numpy
No ratings yet
10 Numpy
39 pages
Python-Unit-4
No ratings yet
Python-Unit-4
43 pages
MP2 Exercise 01 - Numpy Arrays
No ratings yet
MP2 Exercise 01 - Numpy Arrays
6 pages
Teste 3
No ratings yet
Teste 3
3 pages
Unit Iii Using Numpy
No ratings yet
Unit Iii Using Numpy
23 pages
Programming Notes 2
No ratings yet
Programming Notes 2
9 pages
Answers 1
No ratings yet
Answers 1
17 pages
15.NUMPY
No ratings yet
15.NUMPY
32 pages
Numpy Python
No ratings yet
Numpy Python
36 pages
FDS record last copy
No ratings yet
FDS record last copy
61 pages
Numpy
No ratings yet
Numpy
9 pages
NumPy Functions
No ratings yet
NumPy Functions
5 pages
SBLCExp 7
No ratings yet
SBLCExp 7
8 pages
Numpy
No ratings yet
Numpy
5 pages
Unit 1 Machine Learning
No ratings yet
Unit 1 Machine Learning
36 pages
Machine Learning Lab File: Submitted To: Submitted by
No ratings yet
Machine Learning Lab File: Submitted To: Submitted by
9 pages
Numpy & Pandas
No ratings yet
Numpy & Pandas
13 pages
Array: 3.1 Generating Sequential Arrays
No ratings yet
Array: 3.1 Generating Sequential Arrays
13 pages
Profound Python Data Science
From Everand
Profound Python Data Science
Onder Teker
No ratings yet
A Brief Introduction to MATLAB: Taken From the Book "MATLAB for Beginners: A Gentle Approach"
From Everand
A Brief Introduction to MATLAB: Taken From the Book "MATLAB for Beginners: A Gentle Approach"
Peter Kattan
2.5/5 (2)
3 Multiple Roughing and Pocketing
No ratings yet
3 Multiple Roughing and Pocketing
6 pages
نسخة نسخة Final Print Medical TerminologyI Eng 210 Lectures
No ratings yet
نسخة نسخة Final Print Medical TerminologyI Eng 210 Lectures
109 pages
O - Need For Speed - Top Five Oracle Performance Tuning Tips - NYOUG
No ratings yet
O - Need For Speed - Top Five Oracle Performance Tuning Tips - NYOUG
67 pages
Mathematics For Engineers - Lecture 5
No ratings yet
Mathematics For Engineers - Lecture 5
19 pages
Ma40092 Problem Sheet 3 - Solutions
No ratings yet
Ma40092 Problem Sheet 3 - Solutions
4 pages
Stiffness Matrix Method of 2D Truss Syst
No ratings yet
Stiffness Matrix Method of 2D Truss Syst
25 pages
ZKB202S_User Manual_20230228
No ratings yet
ZKB202S_User Manual_20230228
2 pages
Relnote
No ratings yet
Relnote
12 pages
Correção Das Páginas Do Livro Didático
60% (10)
Correção Das Páginas Do Livro Didático
14 pages
Christopher Ariza - Computer-Aided Algorithmic Composition Systems
100% (1)
Christopher Ariza - Computer-Aided Algorithmic Composition Systems
8 pages
Act 2 -
No ratings yet
Act 2 -
26 pages
Ib04 Asr9006 1
No ratings yet
Ib04 Asr9006 1
111 pages
(Culture and History of The Ancient Near East 84 - Ancient Warfare 1) Krzysztof Ulanowski-The Religious Aspects of War in The Ancient Near East, Greece, and Rome-Brill Academic Publishers (2016)
50% (2)
(Culture and History of The Ancient Near East 84 - Ancient Warfare 1) Krzysztof Ulanowski-The Religious Aspects of War in The Ancient Near East, Greece, and Rome-Brill Academic Publishers (2016)
441 pages
Vineet Intodia CV
No ratings yet
Vineet Intodia CV
1 page
UNIT 5 Vocab
No ratings yet
UNIT 5 Vocab
5 pages
Chapter 3 Riph
No ratings yet
Chapter 3 Riph
3 pages
HSM Script
No ratings yet
HSM Script
18 pages
Research example-WPS Office
No ratings yet
Research example-WPS Office
2 pages
Ruby in Her Own Time Homework Coversheet 2018
No ratings yet
Ruby in Her Own Time Homework Coversheet 2018
1 page
Acd Syllabus
No ratings yet
Acd Syllabus
2 pages
Qabool Hai by Shagufta Kanwal Novelatte
No ratings yet
Qabool Hai by Shagufta Kanwal Novelatte
68 pages
The Creative Process Scott Jeffrey
No ratings yet
The Creative Process Scott Jeffrey
16 pages
Boundaries (Easy) Answers
No ratings yet
Boundaries (Easy) Answers
28 pages
Math Syllabus (2025-27)
No ratings yet
Math Syllabus (2025-27)
43 pages
SPC 6 R 06
No ratings yet
SPC 6 R 06
833 pages
Social Intelligence
No ratings yet
Social Intelligence
2 pages
Entry Level Software Developer Resume Examples
100% (2)
Entry Level Software Developer Resume Examples
5 pages
Refresher Course Report
100% (1)
Refresher Course Report
3 pages
St. Peter's High School: Instructions
No ratings yet
St. Peter's High School: Instructions
4 pages

Data Science Lab (To Write)

Uploaded by

Data Science Lab (To Write)

Uploaded by

1(a).

1.2 Array Shape & Reshaping

1.2.2 Array Reshaping

2.4 Poisson Distribution

2.6 Logistic Distribution

2.7 Multinomial Distribution

Raise the valules in arr1 to the power of values in arr2:

3.2 Rounding Decimals

3.7 LCM (Lowest Common Multiple

3.8 GCD (Greatest Common Denominator)

3.9 Trigonometric Functions

Find sine values for all of the values in arr:

3.10 Hyperbolic Functions

Find cosh values for all of the values in arr:

3.11 Set Operations

3.11.3 Finding Intersection

Force Return the specified unit in newton

2.1 CSR(Compressed Sparse Row) Matrix

3.2 Depth First Order

Traverse the graph depth first for given adjacency matrix:

4.2 Convex Hull

4.4.2 Cosine Distance

Import Data from Matlab Format

6.2 Spline Interpolation

1.1 Create Labels

5.1 Info About the Data

6.1 Empty Cells

6.1.2 Replace Empty Values

6.1.3 Replace Using Mean, Median, or Mode

6.2 Data of Wrong Format

6.2.1 Removing Rows

6.3 Fixing Wrong Data

6.3.3 Removing Rows

6.4 Removing Duplicates

6.4.2 Removing Duplicates

7.1 Scatter Plot

1. Linear regression models

Step 1: Import packages.

4. Locate Named Indexes

5. Load Files Into a DataFrame

About Iris Database

Example: import pandas as pd

Getting Information about the Dataset

Relation between variables

Example 2: Comparing Petal Length and Petal Width

● Median of Pima data

● Mode of Pima data

2.2 Standard deviation

2.4 Interquartile range

print("IQR value for column %s is: %s" % (col,

3. Simple Linear Regression

Multiple Regression Analysis

Histograms plotting functions

Example - Density plot of several variables

Correlation Matrix Plotting

You might also like