0% found this document useful (0 votes)
8 views38 pages

DS - Ex-1

The document outlines the installation process for Python, Jupyter, and various packages including NumPy, SciPy, Statsmodels, and Pandas. It also explores the features of NumPy, detailing its capabilities in array manipulation, random number generation, and universal functions. Additionally, it provides examples of different statistical distributions and arithmetic operations using NumPy.

Uploaded by

vishnupriyapacet
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views38 pages

DS - Ex-1

The document outlines the installation process for Python, Jupyter, and various packages including NumPy, SciPy, Statsmodels, and Pandas. It also explores the features of NumPy, detailing its capabilities in array manipulation, random number generation, and universal functions. Additionally, it provides examples of different statistical distributions and arithmetic operations using NumPy.

Uploaded by

vishnupriyapacet
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
You are on page 1/ 38

1(a).

Download and install the different packages like NumPy,


SciPy, Jupyter, Statsmodels and Pandas

AIM:
To learn how to download and install the different packages of NumPy, SciPy, Jupyter,
Statsmodels and Pandas.

ALGORITHM:
1. Download Python and Jupyter.
2. Install Python and Jupyter.
3. Install the pack like NumPy, SciPy Satsmodels and Pandas.
4. Verify the proper execution of Python and Jupyter.

Python Installation
 Open the python official web site. (https://www.python.org/)
 Downloads ==> Windows ==> Select Recent Release. (Requires Windows 10
or above versions)
 Install "python-3.10.6-amd64.exe"

Jupyter Installation
 Open command prompt and enter the following to check whether the python
was installed properly or not, “python –version”.
 If installation is proper it returns the version of python
 Enter the following to check whether the pyton package manager was
installed properly or not, “pip –version”
 If installation is proper it returns the version of python package manager
 Enter the following command “pip install jupyterlab”.
 Enter the following command “pip install jupyter notebook”.
 Copy the above command result from path to upgrade command and paste it
and execute for upgrade process.
 Create a folder and name the folder accordingly.
 Open command prompt and enter in to that folder. Enter the following
code “jupyter notebook” and then give enter.
 Now new jupyter notebook will be opened for our use.

pip Installation
Installation of NumPy
 pip install
numpy Installation
of SciPy
 pip install scipy
Installation of
Statsmodels

Page No.
P. A. COLLEGE OF ENGINEERING AND TECHNOLOGY

 pip install
statsmodels Installation
of Pandas
 pip install pandas

Sample Output

RESULT:
NumPy, SciPy, Jupyter, Statsmodels and Pandas packages were installed properly and
the execution also verified.

Page No.
P. A. COLLEGE OF ENGINEERING AND TECHNOLOGY

1(B). EXPLORE THE FEATURES OF NUMPY

AIM:
To learn the different features provided by NumPy package.

ALGORITHM:
1. Install the NumPy package
2. Study all the features of NumPy package.

NumPy
 NumPy is a Python library used for working with arrays.
 It also has functions for working in domain of linear algebra, fourier
transform, and matrices.

Features
These are the important features of NumPy
1. Array 2. Random 3. Universal Functions

1. Arrays
1.1 Array Slicing
 Slicing in python means taking elements from one given index to another
given index.
 We pass slice instead of index like this: [start:end].
 We can also define the step, like this: [start:end:step].
 If we don't pass start its considered 0
 If we don't pass end its considered length of array in that dimension
 If we don't pass step its considered 1

import numpy as np
arr = np.array([1, 2, 3, 4, 5, 6, 7])
print(arr[1:5:2])

1.2 Array Shape & Reshaping


1.2.1Array Shape
NumPy arrays have an attribute called shape that returns a tuple with eachindex
having the number of corresponding elements.
import numpy as np
arr = np.array([[1, 2, 3, 4], [5, 6, 7, 8]])
print(arr.shape)

1.2.2Array Reshaping
 Reshaping means changing the shape of an array.
 The shape of an array is the number of elements in each dimension.

Page No.
P. A. COLLEGE OF ENGINEERING AND TECHNOLOGY

 By reshaping we can add or remove dimensions or change number of elements


in each dimension.
 Convert the following 1-D array with 12 elements into a 3-D array.

The outermost dimension will have 2 arrays that contains 3 arrays, each with 2 elements:
import numpy as np
arr = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12])
newarr = arr.reshape(2, 3, 2)
print(newarr)

2. Random
Random Permutations
A permutation refers to an arrangement of elements. e.g. [3, 2, 1] is a permutation of
[1, 2, 3] and vice-versa. The NumPy Random module provides two methods for this:
shuffle() andpermutation().
from numpy import random
import numpy as np
arr = np.array([1, 2, 3, 4, 5])
random.shuffle(arr)
print(arr)

2.1 Seaborn
Seaborn is a library that uses Matplotlib underneath to plot graphs. It will be used to
visualize random distributions.
import matplotlib.pyplot as plt
import seaborn as sns
sns.distplot([0, 1, 2, 3, 4, 5])
plt.show()

2.2 Normal (Gaussian) Distribution


It is also called the Gaussian Distribution after the German mathematician Carl
Friedrich Gauss. It fits the probability distribution of many events, eg. IQ Scores, Heartbeat
etc.
It uses the random.normal() method to get a Normal Data Distribution.
It has three parameters:
loc - (Mean) where the peak of the bell exists.
scale - (Standard Deviation) how flat the graph distribution should be.
size - The shape of the returned array.
Generate a random normal distribution of size 2x3 with mean at 1 and standard
deviation of 2:
from numpy import random
x = random.normal(loc=1, scale=2, size=(2,
3)) print(x)

Page No.
P. A. COLLEGE OF ENGINEERING AND TECHNOLOGY

2.3 Binomial Distribution


Binomial Distribution is a Discrete Distribution.
It describes the outcome of binary scenarios, e.g. toss of a coin, it will either be head
or tails.
It has three parameters:
n - number of trials.
p - probability of occurence of each trial (e.g. for toss of a coin 0.5
each). size - The shape of the returned array.
Given 10 trials for coin toss generate 10 data points:
from numpy import random
x = random.binomial(n=10, p=0.5, size=10)
print(x)

2.4 Poisson Distribution


It estimates how many times an event can happen in a specified time. e.g. If someone
eats twice a day what is probability he will eat thrice?
It has two parameters:
lam - rate or known number of occurences e.g. 2 for above
problem. size - The shape of the returned array.
Generate a random 1x10 distribution for occurence
2: from numpy import random
x = random.poisson(lam=2, size=10)
print(x)

2.5 Uniform Distribution


Used to describe probability where every event has equal chances of occuring. E.g.
Generation of random numbers.
It has three parameters:
a - lower bound - default 0 .0.
b - upper bound - default 1.0.
size - The shape of the returned array.
Create a 2x3 uniform distribution sample:
from numpy import random
x = random.uniform(size=(2, 3))
print(x)

2.6 Logistic Distribution


Logistic Distribution is used to describe growth.
Used extensively in machine learning in logistic regression, neural networks etc.
It has three parameters:
loc - mean, where the peak is. Default 0.
scale - standard deviation, the flatness of distribution. Default
1. size - The shape of the returned array.

Page No.
P. A. COLLEGE OF ENGINEERING AND TECHNOLOGY

Draw 2x3 samples from a logistic distribution with mean at 1 and stddev 2.0:
from numpy import random
x = random.logistic(loc=1, scale=2, size=(2, 3))
print(x)

2.7 Multinomial Distribution


Multinomial distribution is a generalization of binomial distribution.
It describes outcomes of multi-nomial scenarios unlike binomial where scenarios
must be only one of two. e.g. Blood type of a population, dice roll outcome.
It has three parameters:
n - number of possible outcomes (e.g. 6 for dice roll).
pvals - list of probabilties of outcomes (e.g. [1/6, 1/6, 1/6, 1/6, 1/6, 1/6] for dice roll).
size - The shape of the returned array.
Draw out a sample for dice roll:
from numpy import random
x = random.multinomial(n=6, pvals=[1/6, 1/6, 1/6, 1/6, 1/6, 1/6])
print(x)

2.8 Exponential Distribution


Exponential distribution is used for describing time till next event e.g. failure/success
etc.
It has two parameters:
scale - inverse of rate ( see lam in poisson distribution ) defaults to 1.0.
size - The shape of the returned array.
Draw out a sample for exponential distribution with 2.0 scale with 2x3 size:
from numpy import random
x = random.exponential(scale=2, size=(2, 3))
print(x)

2.9 Chi Square Distribution


Chi Square distribution is used as a basis to verify the hypothesis.
It has two parameters:
df - (degree of freedom).
size - The shape of the returned array.
Draw out a sample for chi squared distribution with degree of freedom 2 with size 2x3:
from numpy import random
x = random.chisquare(df=2, size=(2, 3))
print(x)

2.10 Rayleigh Distribution


Rayleigh distribution is used in signal processing.
It has two parameters:
scale - (standard deviation) decides how flat the distribution will be default 1.0).
size - The shape of the returned array.

Page No.
P. A. COLLEGE OF ENGINEERING AND TECHNOLOGY

Draw out a sample for rayleigh distribution with scale of 2 with size 2x3:
from numpy import random
x = random.rayleigh(scale=2, size=(2, 3))
print(x)

2.11 Pareto Distribution


A distribution following Pareto's law i.e. 80-20 distribution (20% factors cause
80% outcome).
It has two parameter:
a - shape parameter.
size - The shape of the returned array.
Draw out a sample for pareto distribution with shape of 2 with size 2x3:
from numpy import random
x = random.pareto(a=2, size=(2, 3))
print(x)

2.12 Zipf Distribution


Zipf distritutions are used to sample data based on zipf's law.
Zipf's Law: In a collection the nth common term is 1/n times of the most common
term. E.g. 5th common word in english has occurs nearly 1/5th times as of the most
used word.
It has two parameters:
a - distribution parameter.
size - The shape of the returned array.
Draw out a sample for zipf distribution with distribution parameter 2 with size
2x3: from numpy import random
x = random.zipf(a=2, size=(2,
3)) print(x)

3. Universal Functions
Create Your Own ufunc (Universal)
To create you own ufunc, you have to define a function, like you do with normal
functions in Python, then you add it to your NumPy ufunc library with the frompyfunc()
method.
The frompyfunc() method takes the following arguments:
function - the name of the function.
inputs - the number of input arguments
(arrays). outputs - the number of output arrays.
Create your own ufunc for addition:
import numpy as np
def myadd(x, y):
return x+y
myadd = np.frompyfunc(myadd, 2, 1)
print(myadd([1, 2, 3, 4], [5, 6, 7, 8]))

Page No.
P. A. COLLEGE OF ENGINEERING AND TECHNOLOGY

3.1 Simple Arithmetic


You could use arithmetic operators + - * / directly between NumPy arrays, but this
section discusses an extension of the same where we have functions that can take any array-
like objects e.g. lists, tuples etc. and perform arithmetic conditionally.
Addition
Add the values in arr1 to the values in arr2:
import numpy as np
arr1 = np.array([10, 11, 12, 13, 14, 15])
arr2 = np.array([20, 21, 22, 23, 24, 25])
newarr = np.add(arr1, arr2)
print(newarr)
Subtraction
Subtract the values in arr2 from the values in arr1:
import numpy as np
arr1 = np.array([10, 20, 30, 40, 50, 60])
arr2 = np.array([20, 21, 22, 23, 24, 25])
newarr = np.subtract(arr1, arr2)
print(newarr)
Multiplication
Multiply the values in arr1 with the values in arr2:
import numpy as np
arr1 = np.array([10, 20, 30, 40, 50, 60])
arr2 = np.array([20, 21, 22, 23, 24, 25])
newarr = np.multiply(arr1, arr2)
print(newarr)
Division
Divide the values in arr1 with the values in arr2:
import numpy as np
arr1 = np.array([10, 20, 30, 40, 50, 60])
arr2 = np.array([3, 5, 10, 8, 2,
33]) newarr = np.divide(arr1,
arr2) print(newarr)
Pow
er Raise the valules in arr1 to the power of values in arr2:
import numpy as np
arr1 = np.array([10, 20, 30, 40, 50, 60])
arr2 = np.array([3, 5, 6, 8,
2, 33]) newarr =
np.power(arr1, arr2)
print(newarr)
Remainder
Return the remainders:
import numpy as np
arr1 = np.array([10, 20, 30, 40, 50, 60])

Page No.
arr2 = np.array([3, 7, 9, 8, 2, 33])
newarr = np.mod(arr1, arr2)
print(newarr)
Absolute Values
Return the quotient and mod:
import numpy as np
arr = np.array([-1, -2, 1, 2, 3, -4])
newarr = np.absolute(arr)
print(newarr)

3.2 Rounding Decimals


There are primarily five ways of rounding off decimals in NumPy:
 trun  floor
cati  ceil
on
 rou
ndin
g
3.2.1Truncation

Remove the decimals, and return the float number closest to zero. Use
the trunc() and fix() functions.
Truncate elements of following array:
import numpy as np
arr = np.trunc([-
3.1666, 3.6667])
print(arr)

3.2.2Rounding
The around() function increments preceding digit or decimal by 1 if
>=5 else do nothing.
Round off 3.1666 to 2 decimal places:
import numpy as np
arr =
np.around(3.1666,
2) print(arr)

3.2.3Floor
The floor() function rounds off decimal to nearest lower integer.
Floor the elements of following array:
import numpy as np
arr = np.floor([-
3.1666, 3.6667])
print(arr)

3.2.4Ceil
The ceil() function rounds off decimal to nearest upper integer.
Ceil the elements of following array:
import numpy as np
arr = np.ceil([-
3.1666, 3.6667])
print(arr)
3.3 Logs
 NumPy provides functions to perform log at the base 2, e and 10.
 We will also explore how we can take log for any base by creating a custom func. All
of the log functions will place -inf or inf in the elements if the log can not be
computed.
Find log at base 10 of all elements of following
array: import numpy as np
arr = np.arange(1,
10)
print(np.log10(arr))

3.4 Summations
Addition is done between two arguments whereas summation happens over nelements
Add the values in arr1 to the values in arr2:
import numpy as np
arr1 = np.array([1, 2,
3])
arr2 = np.array([1, 2, 3])
newarr = np.add(arr1, arr2)
print(newarr)

3.5 Products
To find the product of the elements in an array, use the prod() function.
Find the product of the elements of this array:
import numpy as np
arr = np.array([1, 2, 3, 4])
x = np.prod(arr)
print(x)

3.6 Differences
 A discrete difference means subtracting two successive elements.
 To find the discrete difference, use the diff() function.
Compute discrete difference of the following array:
import numpy as np
arr = np.array([10, 15, 25, 5])
newarr = np.diff(arr)
print(newarr)

3.7 LCM (Lowest Common Multiple)


The Lowest Common Multiple is the least number that is common multiple of both
of the numbers.
import numpy as np
num1 = 4
num2 = 6
x = np.lcm(num1, num2)
print(x)

3.8 GCD (Greatest Common Denominator)


The GCD (Greatest Common Denominator), also known as HCF (Highest Common
Factor) is the biggest number that is a common factor of both of the numbers.
Find the HCF of the following two numbers:
import numpy as np
num1 = 6
num2 = 9

x = np.gcd(num1, num2)
print(x)

3.9 Trigonometric Functions


NumPy provides the ufuncs sin(), cos() and tan() that take values in radians and
produce the corresponding sin, cos and tan values.
Find sine value of PI/2:
import numpy as np
x = np.sin(np.pi/2)
print(x)

Find sine values for all of the values in arr:


import numpy as np
arr = np.array([np.pi/2, np.pi/3, np.pi/4, np.pi/5])
x = np.sin(arr)
print(x)

3.10 Hyperbolic Functions


NumPy provides the ufuncs sinh(), cosh() and tanh() that take values in radians and
produce the corresponding sinh, cosh and tanh values.
Find sinh value of PI/2:
import numpy as np
x = np.sinh(np.pi/2)
print(x)

Find cosh values for all of the values in arr:


import numpy as np
arr = np.array([np.pi/2, np.pi/3, np.pi/4, np.pi/5])
x = np.cosh(arr)
print(x)

3.11 Set Operations


A set in mathematics is a collection of unique elements.
3.11.1 Create Sets in NumPy
We can use NumPy's unique() method to find unique elements from any array. E.g.
create a set array, but remember that the set arrays should only be 1-D arrays.
Convert following array with repeated elements to a set:
import numpy as np
arr = np.array([1, 1, 1, 2, 3, 4, 5, 5, 6, 7])
x = np.unique(arr)
print(x)
3.11.2 Finding Union
To find the unique values of two arrays, use the union1d() method.
Find union of the following two set arrays:
import numpy as np
arr1 = np.array([1, 2, 3, 4])
arr2 = np.array([3, 4, 5, 6])
newarr = np.union1d(arr1, arr2)
print(newarr)

3.11.3 Finding Intersection


To find only the values that are present in both arrays, use the intersect1d() method.
Find intersection of the following two set arrays:
import numpy as np
arr1 = np.array([1, 2, 3, 4])
arr2 = np.array([3, 4, 5, 6])
newarr = np.intersect1d(arr1, arr2, assume_unique=True)
print(newarr)

OUTPUT:
RESULT
Thus the feature study of NumPy has been completed successfully.
1(C). EXPLORE THE FEATURES OF SCIPY

AIM:
To learn the different features provided by SciPy package.

ALGORITHM:
1. Install the SciPy package
2. Study all the features of SciPy package.

SciPy
SciPy stands for Scientific Python, SciPy is a scientific computation library that uses
NumPy underneath.

Features
These are the important features of SciPy
1. Constants 2. Sparse Data 3. Graphs
4. Spatial Data 5. Matlab Arrays 6. Interpolation

1. Constants in SciPy
As SciPy is more focused on scientific implementations, it provides many built-in
scientific constants.
These constants can be helpful when you are working with Data Science.
1.1 Constants in
SciPy Metric
Return the specified
unit in meter
Bina ex:
ry print(constants.mil
li)

Mass Return the


specified unit in
bytes
ex:
Angl print(constants.ki
e bi)

Return the specified unit in kg


Time ex: print(constants.stone)

Return the specified


unit in radians
ex:
print(constants.degre
e)

Return the specified unit in seconds


ex: print(constants.year)
Lengt
h Return the specified unit in meters
ex: print(constants.mile)
Pressure
Return the specified unit in pascals
ex: print(constants.bar)
Are
a
Return the specified unit in square meters
ex: print(constants.hectare)
V
o
l
u
m
e
Return the specified unit in cubic meters
ex: print(constants.litre)
Spee
d Return the specified unit in meters per second
ex: print(constants.kmh)
Temperature
Return the specified unit in Kelvin
ex: print(constants.zero_Celsius)
Energy
Return the specified unit in joules
ex: print(constants.calorie)
Pow
er Return the specified unit in watts
ex: print(constants.hp)

Forc Return the specified unit in newton


e
ex: print(constants.pound_force)

2. Sparse Data
Sparse data is data that has mostly unused elements (elements that don't
carry any information).
It can be an array like this one:
[1, 0, 2, 0, 0, 3, 0, 0, 0, 0, 0, 0]
Sparse Data: is a data set where most of the item values are zero.
Dense Array: is the opposite of a sparse array: most of the values are not
zero.

2.1 CSR(Compressed Sparse Row) Matrix


We can create CSR matrix by passing an arrray
into function scipy.sparse.csr_matrix().
Create a CSR matrix from an array:
import numpy as np
from scipy.sparse
import csr_matrix arr
= np.array([0, 0, 0, 0, 0,
1, 1, 0, 2])
print(csr_matrix(arr))
3. Graphs
Graphs are an essential data structure.
SciPy provides us with the module scipy.sparse.csgraph for working with
such data structures.
Adjacency Matrix
Adjacency matrix is a nxn matrix where n is the number of elements in a graph.
The values represents the connection between the elements.

3.1 Dijkstra
Use the dijkstra method to find the shortest path in a graph from one element to
another.
It takes following arguments:
return_predecessors: boolean (True to return whole path of traversal otherwise False).
indices: index of the element to return all paths from that element only.
limit: max weight of path.
Find the shortest path from element 1 to 2:
import numpy as np
from scipy.sparse.csgraph import dijkstra
from scipy.sparse import csr_matrix
arr =
np.array([ [0, 1,
2],
[1, 0, 0],
[2, 0, 0]
])
newarr = csr_matrix(arr)
print(dijkstra(newarr, return_predecessors=True, indices=0))

3.2 Depth First Order


The depth_first_order() method returns a depth first traversal from a node.
This function takes following arguments:
the graph.
the starting element to traverse graph from.

Traverse the graph depth first for given adjacency matrix:


import numpy as np
from scipy.sparse.csgraph import depth_first_order
from scipy.sparse import csr_matrix
arr =
np.array([ [0, 1,
0, 1],
[1, 1, 1, 1],
[2, 1, 1, 0],
[0, 1, 0, 1]
])
newarr = csr_matrix(arr)
print(depth_first_order(newarr, 1))
3.3 Breadth First Order
The breadth_first_order() method returns a breadth first traversal from a node.
This function takes following arguments:
the graph.
the starting element to traverse graph from.
Traverse the graph breadth first for given adjacency matrix:
import numpy as np
from scipy.sparse.csgraph import breadth_first_order
from scipy.sparse import csr_matrix
arr =
np.array([ [0, 1,
0, 1],
[1, 1, 1, 1],
[2, 1, 1, 0],
[0, 1, 0, 1]
])
newarr = csr_matrix(arr)
print(breadth_first_order(newarr, 1))

4. Spatial Data
Spatial data refers to data that is represented in a geometric space.
E.g. points on a coordinate system.
We deal with spatial data problems on many tasks.
E.g. finding if a point is inside a boundary or not.

4.1 Triangulation
A Triangulation of a polygon is to divide the polygon into multiple triangles with
which we can compute an area of the polygon.
A Triangulation with points means creating surface composed triangles in which all of
the given points are on at least one vertex of any triangle in the surface.
One method to generate these triangulations through points is the Delaunay()
Triangulation.
Example:
Create a triangulation from following points:
import numpy as np
from scipy.spatial import Delaunay
import matplotlib.pyplot as plt
points = np.array([
[2, 4],
[3, 4],
[3, 0],
[2, 2],
[4, 1]
])
simplices = Delaunay(points).simplices
plt.triplot(points[:, 0], points[:, 1], simplices)
plt.scatter(points[:, 0], points[:, 1], color='r')
plt.show()

4.2 Convex Hull


A convex hull is the smallest polygon that covers all of the given points.
Use the ConvexHull() method to create a Convex Hull.
Example
Create a convex hull for following points:
import numpy as np
from scipy.spatial import ConvexHull
import matplotlib.pyplot as plt
points =
np.array([ [2, 4],
[3, 4],
[3, 0],
[2, 2],
[4, 1],
[1, 2],
[5, 0],
[3, 1],
[1, 2],
[0, 2] ])
hull = ConvexHull(points)
hull_points = hull.simplices
plt.scatter(points[:,0], points[:,1])
for simplex in hull_points:
plt.plot(points[simplex,0], points[simplex,1], 'k-')
plt.show()

4.3 KDTrees
KDTrees are a datastructure optimized for nearest neighbor queries.
E.g. in a set of points using KDTrees we can efficiently ask which points are nearest
to a certain given point.
The KDTree() method returns a KDTree object.
The query() method returns the distance to the nearest neighbor and the location of the
neighbors.
Example
Find the nearest neighbor to point (1,1):
from scipy.spatial import KDTree
points = [(1, -1), (2, 3), (-2, 3), (2, -3)]
kdtree = KDTree(points)
res = kdtree.query((1, 1))
print(res)
4.4 Distance Matrix
There are many Distance Metrics used to find various types of distances between two
points in data science, Euclidean distsance, cosine distsance etc.
The distance between two vectors may not only be the length of straight line between
them, it can also be the angle between them from origin, or number of unit steps required
etc.
Many of the Machine Learning algorithm's performance depends greatly on distance
metrices. E.g. "K Nearest Neighbors", or "K Means" etc.
Let us look at some of the Distance Metrices:

4.4.1Euclidean Distance
Find the euclidean distance between given points A and B.
Example
Find the euclidean distance between given points.
from scipy.spatial.distance import euclidean
p1 = (1, 0)
p2 = (10, 2)
res = euclidean(p1, p2)
print(res)

4.4.2Cosine Distance
Is the value of cosine angle between the two points A and B.
Example
Find the cosine distsance between given points:
from scipy.spatial.distance import cosine
p1 = (1, 0)
p2 = (10, 2)
res = cosine(p1, p2)
print(res)

Hamming Distance
Is the proportion of bits where two bits are difference.
It's a way to measure distance for binary sequences.
Example
Find the hamming distance between given points:
from scipy.spatial.distance import hamming
p1 = (True, False, True)
p2 = (False, True, True)
res = hamming(p1, p2)
print(res)

5. Matlab Arrays
We know that NumPy provides us with methods to persist the data in readable
formats for Python. But SciPy provides us with interoperability with Matlab as well.
Working With Matlab Arrays
 We know that NumPy provides us with methods to persist the data in readable
formats for Python. But SciPy provides us with interoperability with Matlab as well.
 Exporting Data in Matlab Format
 The savemat() function allows us to export data in Matlab format.
 The method takes the following parameters:
filename - the file name for saving
data. mdict - a dictionary containing
the data.
do_compression - a boolean value that specifies whether to compress the
result or not. Default False.
Example
Export the following array as variable name "vec" to a mat file:
from scipy import io
import numpy as np
arr = np.arange(10)
io.savemat('arr.mat', {"vec": arr})

Import Data from Matlab Format


o The loadmat() function allows us to import data from a Matlab file.
The function takes one required parameter:
filename - the file name of the saved data.
o It will return a structured array whose keys are the variable names, and the
corresponding values are the variable values.
Example
Import the array from following mat file.:
from scipy import io
import numpy as np
arr = np.array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9,])
# Export:
io.savemat('arr.mat', {"vec":
arr}) # Import:
mydata =
io.loadmat('arr.mat')
print(mydata)

6. Interpolation
 Interpolation is a method for generating points between given points.
For example: for points 1 and 2, we may interpolate and find points 1.33 and 1.66.
 Interpolation has many usages, in Machine Learning we often deal with missing data in
a dataset, interpolation is often used to substitute those values. This method of filling
values is called imputation.
 Apart from imputation, interpolation is often used where we need to smooth the
discrete points in a dataset.
6.1 1D Interpolation
The function interp1d() is used to interpolate a distribution with 1 variable.
It takes x and y points and returns a callable function that can be called with new x
and returns corresponding y.
Example
For given xs and ys interpolate values from 2.1, 2.2... to
2.9: from scipy.interpolate import interp1d
import numpy as np
xs = np.arange(10)
ys = 2*xs + 1
interp_func = interp1d(xs, ys)
newarr = interp_func(np.arange(2.1, 3, 0.1))
print(newarr)

6.2 Spline Interpolation


In 1D interpolation the points are fitted for a single curve whereas in Spline
interpolation the points are fitted against a piecewise function defined with polynomials
called splines.
The UnivariateSpline() function takes xs and ys and produce a callable funciton that
can be called with new xs.
Example
Find univariate spline interpolation for 2.1, 2.2....2.9 for the following non linear points:
from scipy.interpolate import UnivariateSpline
import numpy as np
xs = np.arange(10)
ys = xs**2 + np.sin(xs) + 1
interp_func = UnivariateSpline(xs,
ys)
newarr = interp_func(np.arange(2.1, 3, 0.1))
print(newarr)

OUTPUT:
RESULT
Thus the feature study of SciPy was completed successfully.
1(D). EXPLORE THE FEATURES OF PANDAS

AIM:
To learn the different features provided by Pandas package.

ALGORITHM:
1. Install the Pandas package
2. Study all the features of Pandas package.

Pandas
 Pandas is a Python library used for working with data sets.
 It has functions for analyzing, cleaning, exploring, and manipulating data.
 Pandas allows us to analyze big data and make conclusions based on statistical
theories.
 Pandas can clean messy data sets, and make them readable and relevant.

Features
These are the important features of Pandas.
1. Series 2. DataFrames 3. Read CSV
4. Read JSON 5. Viewing the Data 6. Data Cleaning
7. Plotting

1. Series
 A Pandas Series is like a column in a table.
 It is a one-dimensional array holding data of any type.
 Create a simple Pandas Series from a list:

import pandas as pd
a = [1, 7, 2]
myvar = pd.Series(a)
print(myvar)

1.1 Create Labels


With the index argument, you can name your own labels.
Example
Create you own labels:
import pandas as pd
a = [1, 7, 2]
myvar = pd.Series(a, index = ["x", "y", "z"])
print(myvar)

1.2 Key/Value Objects as Series


You can also use a key/value object, like a dictionary, when creating a Series.
Example
Create a simple Pandas Series from a dictionary:
import pandas as pd
calories = {"day1": 420, "day2": 380, "day3": 390}
myvar = pd.Series(calories)
print(myvar)

2. DataFrames
A Pandas DataFrame is a 2 dimensional data structure, like a 2 dimensional array, or a
table with rows and columns.
Example
Create a simple Pandas DataFrame:
import pandas as pd
data = {
"calories": [420, 380, 390],
"duration": [50, 40, 45]
}
#load data into a DataFrame object:
df =
pd.DataFrame(data)
print(df)

3. Read CSV
A simple way to store big data sets is to use CSV files (comma separated files). CSV
files contains plain text and is a well know format that can be read by everyone
including Pandas.
Example
To print maximum rows in a CSV file
import pandas as pd
pd.options.display.max_rows =
9999 df = pd.read_csv('data.csv')
print(df)

4. Read JSON
 Big data sets are often stored, or extracted as JSON.
 JSON is plain text, but has the format of an object, and is well known in the
world of programming, including Pandas.
Load the JSON file into a DataFrame:
import pandas as pd
df = pd.read_json('data.json')
print(df.to_string())
5. Viewing the Data
One of the most used method for getting a quick overview of the DataFrame, is the
head() method. The head() method returns the headers and a specified number of rows,
starting from the top.

5.1 Info About the Data


The DataFrames object has a method called info(), that gives you more information
about the data set.
Example
Print information about the data:
import pandas as pd
df =
pd.read_csv('data.csv')
print(df.info())

6. Data Cleaning
Data cleaning means fixing bad data in your data set.
Bad data could be:
 Empty cells
 Data in wrong format
 Wrong data
 Duplicates

6.1 Empty Cells


6.1.1Remove Rows
One way to deal with empty cells is to remove rows that contain empty cells.
This is usually OK, since data sets can be very big, and removing a few rows will not
have a big impact on the result.
Example
Return a new Data Frame with no empty cells:
import pandas as pd
df =
pd.read_csv('data.csv')
new_df = df.dropna()
print(new_df.to_string())
inplace() method
It remove all rows with NULL values:
import pandas as pd
df = pd.read_csv('data.csv')
df.dropna(inplace = True)
print(df.to_string())

6.1.2Replace Empty Values


Another way of dealing with empty cells is to insert a new value instead.
Example
Replace NULL values with the number 130:
import pandas as pd
df = pd.read_csv('data.csv')
df.fillna(130, inplace = True)

6.1.3Replace Using Mean, Median, or Mode


A common way to replace empty cells, is to calculate the mean, median or mode
value of the column.
Pandas uses the mean() median() and mode() methods to calculate the respective
values for a specified column:
mean()
import pandas as pd
df =
pd.read_csv('da
ta.csv') x =
df["Calories"].
median mean()
() df["Calories"].fillna(x,
inplace = True)
print(df.to_string())

import pandas as pd
mode() df =
pd.read_csv('da
ta.csv') x =
df["Calories"].
median()
df["Calories"].fillna(x, inplace = True)

import pandas as pd
df =
pd.read_csv('da
ta.csv') x =
df["Calories"].
mode()[0]
df["Calories"].fillna(x, inplace = True)

6.2 Data of Wrong Format


Cells with data of wrong format can make it difficult, or even
impossible, to analyze
data.
To fix it, you have two options: remove the rows, or convert all cells in
the columns
into the same format.
Example
import pandas as pd
df = pd.read_csv('data.csv')
df['Date'] =
pd.to_datetime(df['Date'])
print(df.to_string())

6.2.1Removing Rows
Remove rows with a NULL value in the
"Date" column: import pandas
as pd
df = pd.read_csv('data.csv')
df['Date'] = pd.to_datetime(df['Date'])
df.dropna(subset=['Date'], inplace = True)
print(df.to_string())

6.3 Fixing Wrong Data


6.3.1Wrong Data
"Wrong data" does not have to be "empty cells" or "wrong format", it can just be
wrong, like if someone registered "199" instead of "1.99".
Sometimes you can spot wrong data by looking at the data set, because you have an
expectation of what it should be.

6.3.2Replacing Values
One way to fix wrong values is to replace them with something else.
Example
Set "Duration" = 45 in row 7:
import pandas as pd
df =
pd.read_csv('data.csv')
df.loc[7,'Duration'] = 45
print(df.to_string())

6.3.3Removing Rows
Another way of handling wrong data is to remove the rows that contains wrong data.
Example
Delete rows where "Duration" is higher than 120:
import pandas as pd
df =
pd.read_csv('data.csv') for
x in df.index:
if df.loc[x, "Duration"] > 120:
df.drop(x, inplace = True)
print(df.to_string())

6.4 Removing Duplicates


6.4.1Discovering Duplicates
Duplicate rows are rows that have been registered more than one time.
duplicated() method
import pandas as pd
df =
pd.read_csv('data.csv')
print(df.duplicated())

6.4.2Removing Duplicates
To remove duplicates, use the drop_duplicates() method.
import pandas as pd
df = pd.read_csv('data.csv')
df.drop_duplicates(inplace =
True) print(df.to_string())
7. Plotting
We can use Pyplot, a submodule of the Matplotlib library to visualize the diagram on
the screen.
Pandas uses the plot() method to create diagrams.

7.1 Scatter Plot


Specify that you want a scatter plot with the kind argument:
kind = 'scatter'
Example
import sys
import
matplotlib
matplotlib.use('Agg')
import pandas as pd
import matplotlib.pyplot as plt
df = pd.read_csv('data.csv')
df.plot(kind = 'scatter', x = 'Duration', y = 'Maxpulse')
plt.show()
plt.savefig(sys.stdout.buffer)
sys.stdout.flush()

7.2 Histogram
Use the kind argument to specify that you want a histogram:
kind = 'hist'
Example
import sys
import
matplotlib
matplotlib.use('Agg')
import pandas as pd
import matplotlib.pyplot as plt
df = pd.read_csv('data.csv')
df["Duration"].plot(kind =
'hist') plt.show()
plt.savefig(sys.stdout.buffer)
sys.stdout.flush()
OUTPUT
RESULT
Thus the feature study of Pandas has been completed successfully.
1(E). EXPLORE THE FEATURES OF STATSMODELS

AIM:
To learn the different features provided by statsmodels package.

ALGORITHM:
1. Install the statsmodels package
2. Study all the features of statsmodels package.

Statsmodels
statsmodels is a Python module that provides classes and functions for the
estimation of many different statistical models, as well as for conducting statistical tests,
and statistical data exploration.

Features
These are the important features of statsmodels
1. Linear regression models
2. Survival analysis

1. Linear regression models


Linear regression analysis is a statistical technique for predicting the value of one
variable(dependent variable) based on the value of another(independent variable).
In simple linear regression, there’s one independent variable used to predict a single
dependent variable. In the case of multilinear regression, there’s more than one independent
variable.
The independent variable is the one you’re using to forecast the value of the other
variable. The statsmodels.regression.linear_model.OLS method is used to perform linear
regression. Linear equations are of the form:
Y=mX+C (m=slope; c=constant)
Syntax:
statsmodels.regression.linear_model.OLS(endog, exog=None, missing=’none’,
hasconst=None, **kwargs)
Parameters:
 endog: array like object.
 exog: array like object.
 missing: str. None, decrease, and raise are the available alternatives. If the value is
‘none,’ no nan testing is performed. Any observations with nans are dropped if ‘drop’
is selected. An error is raised if ‘raise’ is used. ‘none’ is the default.
 hasconst: None or Bool. Indicates whether a user-supplied constant is included in the
RHS. If True, k constant is set to 1 and all outcome statistics are calculated as if a
constant is present. If False, k constant is set to 0 and no constant is verified.
 **kwargs: When using the formula interface, additional arguments are utilised to set
model characteristics.

Step 1: Import packages.


Importing the required packages is the first step of modeling. The pandas, NumPy,
and stats model packages are imported.
import numpy as np
import pandas as pd
import statsmodels.api as
sm Step 2: Loading data
To access the CSV file click here. The CSV file is read using pandas.read_csv()
method. The head or the first five rows of the dataset is returned by using the head() method.
Head size and Brain weight are the columns.
df =
pd.read_csv('headbrain1.csv')
df.head()
Visualizing the data:
By using the matplotlib and seaborn packages, we visualize the data. sns.regplot()
function helps us create a regression plot.
# import packages
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
df = pd.read_csv('headbrain1.csv')
sns.regplot('Head Size(cm^3)', 'Brain Weight(grams)', data=df)
plt.show()
Step 3: Setting a hypothesis.
Null hypothesis (H0): There is no relationship between head size and brain weight.
Alternative hypothesis (Ha): There is a relationship between head size and brain
weight.
Step 4: Fitting the model
statsmodels.regression.linear_model.OLS() method is used to get ordinary least
squares, and fit() method is used to fit the data in it.
The ols method takes in the data and performs linear regression. we provide the
dependent and independent columns in this format :
inpendent_columns ~ dependent_column:
left side of the ~ operator contains the independent variables and right side of the
operator contains the name of the dependent variable or the predicted column.
df.columns = ['Head_size', 'Brain_weight']
model = sm.ols(formula='Head_size ~ Brain_weight',
data=df).fit() Step 5: Summary of the model.
All the summary statistics of the linear regression model are returned by the
model.summary() method. The p-value and many other values/statistics are known by this
method. Predictions about the data are found by the model.summary() method.
print(model.summary())
2. Survival analysis
The statsmodels.api.SurvfuncRight class can be used to estimate survival
functionsusing data that may be censored to the right. SurvfuncRight implements several
inference methods, including confidence intervals for survival quantiles, pointwise
simultaneous confidence intervals for survival functions, and plotting methods. The
duration.survdiff function provides a test procedure for comparing survival distributions.
Here we are creating a SurvfuncRight object using the data from the Moore study
available from the R dataset repository. Adjust the survival distribution for 'low' fcategory
subjects only.

Example:
# Importing libraries
import statsmodels.api as sm
X = sm.datasets.get_rdataset("Moore", "carData").data
# Filtering data of low fcategory
X = X[X['fcategory'] == "low"]
# Creating SurvfuncRight
model
model = sm.SurvfuncRight(X["conformity"], X["fscore"])
# Model Summary
model.summary()

Sample Output

Linear regression models


Survival analysis

RESULT
Thus the few important features of study statsmodels has been completed
successfully.

You might also like