DS - Ex-1
DS - Ex-1
AIM:
To learn how to download and install the different packages of NumPy, SciPy, Jupyter,
Statsmodels and Pandas.
ALGORITHM:
1. Download Python and Jupyter.
2. Install Python and Jupyter.
3. Install the pack like NumPy, SciPy Satsmodels and Pandas.
4. Verify the proper execution of Python and Jupyter.
Python Installation
Open the python official web site. (https://www.python.org/)
Downloads ==> Windows ==> Select Recent Release. (Requires Windows 10
or above versions)
Install "python-3.10.6-amd64.exe"
Jupyter Installation
Open command prompt and enter the following to check whether the python
was installed properly or not, “python –version”.
If installation is proper it returns the version of python
Enter the following to check whether the pyton package manager was
installed properly or not, “pip –version”
If installation is proper it returns the version of python package manager
Enter the following command “pip install jupyterlab”.
Enter the following command “pip install jupyter notebook”.
Copy the above command result from path to upgrade command and paste it
and execute for upgrade process.
Create a folder and name the folder accordingly.
Open command prompt and enter in to that folder. Enter the following
code “jupyter notebook” and then give enter.
Now new jupyter notebook will be opened for our use.
pip Installation
Installation of NumPy
pip install
numpy Installation
of SciPy
pip install scipy
Installation of
Statsmodels
Page No.
P. A. COLLEGE OF ENGINEERING AND TECHNOLOGY
pip install
statsmodels Installation
of Pandas
pip install pandas
Sample Output
RESULT:
NumPy, SciPy, Jupyter, Statsmodels and Pandas packages were installed properly and
the execution also verified.
Page No.
P. A. COLLEGE OF ENGINEERING AND TECHNOLOGY
AIM:
To learn the different features provided by NumPy package.
ALGORITHM:
1. Install the NumPy package
2. Study all the features of NumPy package.
NumPy
NumPy is a Python library used for working with arrays.
It also has functions for working in domain of linear algebra, fourier
transform, and matrices.
Features
These are the important features of NumPy
1. Array 2. Random 3. Universal Functions
1. Arrays
1.1 Array Slicing
Slicing in python means taking elements from one given index to another
given index.
We pass slice instead of index like this: [start:end].
We can also define the step, like this: [start:end:step].
If we don't pass start its considered 0
If we don't pass end its considered length of array in that dimension
If we don't pass step its considered 1
import numpy as np
arr = np.array([1, 2, 3, 4, 5, 6, 7])
print(arr[1:5:2])
1.2.2Array Reshaping
Reshaping means changing the shape of an array.
The shape of an array is the number of elements in each dimension.
Page No.
P. A. COLLEGE OF ENGINEERING AND TECHNOLOGY
The outermost dimension will have 2 arrays that contains 3 arrays, each with 2 elements:
import numpy as np
arr = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12])
newarr = arr.reshape(2, 3, 2)
print(newarr)
2. Random
Random Permutations
A permutation refers to an arrangement of elements. e.g. [3, 2, 1] is a permutation of
[1, 2, 3] and vice-versa. The NumPy Random module provides two methods for this:
shuffle() andpermutation().
from numpy import random
import numpy as np
arr = np.array([1, 2, 3, 4, 5])
random.shuffle(arr)
print(arr)
2.1 Seaborn
Seaborn is a library that uses Matplotlib underneath to plot graphs. It will be used to
visualize random distributions.
import matplotlib.pyplot as plt
import seaborn as sns
sns.distplot([0, 1, 2, 3, 4, 5])
plt.show()
Page No.
P. A. COLLEGE OF ENGINEERING AND TECHNOLOGY
Page No.
P. A. COLLEGE OF ENGINEERING AND TECHNOLOGY
Draw 2x3 samples from a logistic distribution with mean at 1 and stddev 2.0:
from numpy import random
x = random.logistic(loc=1, scale=2, size=(2, 3))
print(x)
Page No.
P. A. COLLEGE OF ENGINEERING AND TECHNOLOGY
Draw out a sample for rayleigh distribution with scale of 2 with size 2x3:
from numpy import random
x = random.rayleigh(scale=2, size=(2, 3))
print(x)
3. Universal Functions
Create Your Own ufunc (Universal)
To create you own ufunc, you have to define a function, like you do with normal
functions in Python, then you add it to your NumPy ufunc library with the frompyfunc()
method.
The frompyfunc() method takes the following arguments:
function - the name of the function.
inputs - the number of input arguments
(arrays). outputs - the number of output arrays.
Create your own ufunc for addition:
import numpy as np
def myadd(x, y):
return x+y
myadd = np.frompyfunc(myadd, 2, 1)
print(myadd([1, 2, 3, 4], [5, 6, 7, 8]))
Page No.
P. A. COLLEGE OF ENGINEERING AND TECHNOLOGY
Page No.
arr2 = np.array([3, 7, 9, 8, 2, 33])
newarr = np.mod(arr1, arr2)
print(newarr)
Absolute Values
Return the quotient and mod:
import numpy as np
arr = np.array([-1, -2, 1, 2, 3, -4])
newarr = np.absolute(arr)
print(newarr)
Remove the decimals, and return the float number closest to zero. Use
the trunc() and fix() functions.
Truncate elements of following array:
import numpy as np
arr = np.trunc([-
3.1666, 3.6667])
print(arr)
3.2.2Rounding
The around() function increments preceding digit or decimal by 1 if
>=5 else do nothing.
Round off 3.1666 to 2 decimal places:
import numpy as np
arr =
np.around(3.1666,
2) print(arr)
3.2.3Floor
The floor() function rounds off decimal to nearest lower integer.
Floor the elements of following array:
import numpy as np
arr = np.floor([-
3.1666, 3.6667])
print(arr)
3.2.4Ceil
The ceil() function rounds off decimal to nearest upper integer.
Ceil the elements of following array:
import numpy as np
arr = np.ceil([-
3.1666, 3.6667])
print(arr)
3.3 Logs
NumPy provides functions to perform log at the base 2, e and 10.
We will also explore how we can take log for any base by creating a custom func. All
of the log functions will place -inf or inf in the elements if the log can not be
computed.
Find log at base 10 of all elements of following
array: import numpy as np
arr = np.arange(1,
10)
print(np.log10(arr))
3.4 Summations
Addition is done between two arguments whereas summation happens over nelements
Add the values in arr1 to the values in arr2:
import numpy as np
arr1 = np.array([1, 2,
3])
arr2 = np.array([1, 2, 3])
newarr = np.add(arr1, arr2)
print(newarr)
3.5 Products
To find the product of the elements in an array, use the prod() function.
Find the product of the elements of this array:
import numpy as np
arr = np.array([1, 2, 3, 4])
x = np.prod(arr)
print(x)
3.6 Differences
A discrete difference means subtracting two successive elements.
To find the discrete difference, use the diff() function.
Compute discrete difference of the following array:
import numpy as np
arr = np.array([10, 15, 25, 5])
newarr = np.diff(arr)
print(newarr)
x = np.gcd(num1, num2)
print(x)
OUTPUT:
RESULT
Thus the feature study of NumPy has been completed successfully.
1(C). EXPLORE THE FEATURES OF SCIPY
AIM:
To learn the different features provided by SciPy package.
ALGORITHM:
1. Install the SciPy package
2. Study all the features of SciPy package.
SciPy
SciPy stands for Scientific Python, SciPy is a scientific computation library that uses
NumPy underneath.
Features
These are the important features of SciPy
1. Constants 2. Sparse Data 3. Graphs
4. Spatial Data 5. Matlab Arrays 6. Interpolation
1. Constants in SciPy
As SciPy is more focused on scientific implementations, it provides many built-in
scientific constants.
These constants can be helpful when you are working with Data Science.
1.1 Constants in
SciPy Metric
Return the specified
unit in meter
Bina ex:
ry print(constants.mil
li)
2. Sparse Data
Sparse data is data that has mostly unused elements (elements that don't
carry any information).
It can be an array like this one:
[1, 0, 2, 0, 0, 3, 0, 0, 0, 0, 0, 0]
Sparse Data: is a data set where most of the item values are zero.
Dense Array: is the opposite of a sparse array: most of the values are not
zero.
3.1 Dijkstra
Use the dijkstra method to find the shortest path in a graph from one element to
another.
It takes following arguments:
return_predecessors: boolean (True to return whole path of traversal otherwise False).
indices: index of the element to return all paths from that element only.
limit: max weight of path.
Find the shortest path from element 1 to 2:
import numpy as np
from scipy.sparse.csgraph import dijkstra
from scipy.sparse import csr_matrix
arr =
np.array([ [0, 1,
2],
[1, 0, 0],
[2, 0, 0]
])
newarr = csr_matrix(arr)
print(dijkstra(newarr, return_predecessors=True, indices=0))
4. Spatial Data
Spatial data refers to data that is represented in a geometric space.
E.g. points on a coordinate system.
We deal with spatial data problems on many tasks.
E.g. finding if a point is inside a boundary or not.
4.1 Triangulation
A Triangulation of a polygon is to divide the polygon into multiple triangles with
which we can compute an area of the polygon.
A Triangulation with points means creating surface composed triangles in which all of
the given points are on at least one vertex of any triangle in the surface.
One method to generate these triangulations through points is the Delaunay()
Triangulation.
Example:
Create a triangulation from following points:
import numpy as np
from scipy.spatial import Delaunay
import matplotlib.pyplot as plt
points = np.array([
[2, 4],
[3, 4],
[3, 0],
[2, 2],
[4, 1]
])
simplices = Delaunay(points).simplices
plt.triplot(points[:, 0], points[:, 1], simplices)
plt.scatter(points[:, 0], points[:, 1], color='r')
plt.show()
4.3 KDTrees
KDTrees are a datastructure optimized for nearest neighbor queries.
E.g. in a set of points using KDTrees we can efficiently ask which points are nearest
to a certain given point.
The KDTree() method returns a KDTree object.
The query() method returns the distance to the nearest neighbor and the location of the
neighbors.
Example
Find the nearest neighbor to point (1,1):
from scipy.spatial import KDTree
points = [(1, -1), (2, 3), (-2, 3), (2, -3)]
kdtree = KDTree(points)
res = kdtree.query((1, 1))
print(res)
4.4 Distance Matrix
There are many Distance Metrics used to find various types of distances between two
points in data science, Euclidean distsance, cosine distsance etc.
The distance between two vectors may not only be the length of straight line between
them, it can also be the angle between them from origin, or number of unit steps required
etc.
Many of the Machine Learning algorithm's performance depends greatly on distance
metrices. E.g. "K Nearest Neighbors", or "K Means" etc.
Let us look at some of the Distance Metrices:
4.4.1Euclidean Distance
Find the euclidean distance between given points A and B.
Example
Find the euclidean distance between given points.
from scipy.spatial.distance import euclidean
p1 = (1, 0)
p2 = (10, 2)
res = euclidean(p1, p2)
print(res)
4.4.2Cosine Distance
Is the value of cosine angle between the two points A and B.
Example
Find the cosine distsance between given points:
from scipy.spatial.distance import cosine
p1 = (1, 0)
p2 = (10, 2)
res = cosine(p1, p2)
print(res)
Hamming Distance
Is the proportion of bits where two bits are difference.
It's a way to measure distance for binary sequences.
Example
Find the hamming distance between given points:
from scipy.spatial.distance import hamming
p1 = (True, False, True)
p2 = (False, True, True)
res = hamming(p1, p2)
print(res)
5. Matlab Arrays
We know that NumPy provides us with methods to persist the data in readable
formats for Python. But SciPy provides us with interoperability with Matlab as well.
Working With Matlab Arrays
We know that NumPy provides us with methods to persist the data in readable
formats for Python. But SciPy provides us with interoperability with Matlab as well.
Exporting Data in Matlab Format
The savemat() function allows us to export data in Matlab format.
The method takes the following parameters:
filename - the file name for saving
data. mdict - a dictionary containing
the data.
do_compression - a boolean value that specifies whether to compress the
result or not. Default False.
Example
Export the following array as variable name "vec" to a mat file:
from scipy import io
import numpy as np
arr = np.arange(10)
io.savemat('arr.mat', {"vec": arr})
6. Interpolation
Interpolation is a method for generating points between given points.
For example: for points 1 and 2, we may interpolate and find points 1.33 and 1.66.
Interpolation has many usages, in Machine Learning we often deal with missing data in
a dataset, interpolation is often used to substitute those values. This method of filling
values is called imputation.
Apart from imputation, interpolation is often used where we need to smooth the
discrete points in a dataset.
6.1 1D Interpolation
The function interp1d() is used to interpolate a distribution with 1 variable.
It takes x and y points and returns a callable function that can be called with new x
and returns corresponding y.
Example
For given xs and ys interpolate values from 2.1, 2.2... to
2.9: from scipy.interpolate import interp1d
import numpy as np
xs = np.arange(10)
ys = 2*xs + 1
interp_func = interp1d(xs, ys)
newarr = interp_func(np.arange(2.1, 3, 0.1))
print(newarr)
OUTPUT:
RESULT
Thus the feature study of SciPy was completed successfully.
1(D). EXPLORE THE FEATURES OF PANDAS
AIM:
To learn the different features provided by Pandas package.
ALGORITHM:
1. Install the Pandas package
2. Study all the features of Pandas package.
Pandas
Pandas is a Python library used for working with data sets.
It has functions for analyzing, cleaning, exploring, and manipulating data.
Pandas allows us to analyze big data and make conclusions based on statistical
theories.
Pandas can clean messy data sets, and make them readable and relevant.
Features
These are the important features of Pandas.
1. Series 2. DataFrames 3. Read CSV
4. Read JSON 5. Viewing the Data 6. Data Cleaning
7. Plotting
1. Series
A Pandas Series is like a column in a table.
It is a one-dimensional array holding data of any type.
Create a simple Pandas Series from a list:
import pandas as pd
a = [1, 7, 2]
myvar = pd.Series(a)
print(myvar)
2. DataFrames
A Pandas DataFrame is a 2 dimensional data structure, like a 2 dimensional array, or a
table with rows and columns.
Example
Create a simple Pandas DataFrame:
import pandas as pd
data = {
"calories": [420, 380, 390],
"duration": [50, 40, 45]
}
#load data into a DataFrame object:
df =
pd.DataFrame(data)
print(df)
3. Read CSV
A simple way to store big data sets is to use CSV files (comma separated files). CSV
files contains plain text and is a well know format that can be read by everyone
including Pandas.
Example
To print maximum rows in a CSV file
import pandas as pd
pd.options.display.max_rows =
9999 df = pd.read_csv('data.csv')
print(df)
4. Read JSON
Big data sets are often stored, or extracted as JSON.
JSON is plain text, but has the format of an object, and is well known in the
world of programming, including Pandas.
Load the JSON file into a DataFrame:
import pandas as pd
df = pd.read_json('data.json')
print(df.to_string())
5. Viewing the Data
One of the most used method for getting a quick overview of the DataFrame, is the
head() method. The head() method returns the headers and a specified number of rows,
starting from the top.
6. Data Cleaning
Data cleaning means fixing bad data in your data set.
Bad data could be:
Empty cells
Data in wrong format
Wrong data
Duplicates
import pandas as pd
mode() df =
pd.read_csv('da
ta.csv') x =
df["Calories"].
median()
df["Calories"].fillna(x, inplace = True)
import pandas as pd
df =
pd.read_csv('da
ta.csv') x =
df["Calories"].
mode()[0]
df["Calories"].fillna(x, inplace = True)
6.2.1Removing Rows
Remove rows with a NULL value in the
"Date" column: import pandas
as pd
df = pd.read_csv('data.csv')
df['Date'] = pd.to_datetime(df['Date'])
df.dropna(subset=['Date'], inplace = True)
print(df.to_string())
6.3.2Replacing Values
One way to fix wrong values is to replace them with something else.
Example
Set "Duration" = 45 in row 7:
import pandas as pd
df =
pd.read_csv('data.csv')
df.loc[7,'Duration'] = 45
print(df.to_string())
6.3.3Removing Rows
Another way of handling wrong data is to remove the rows that contains wrong data.
Example
Delete rows where "Duration" is higher than 120:
import pandas as pd
df =
pd.read_csv('data.csv') for
x in df.index:
if df.loc[x, "Duration"] > 120:
df.drop(x, inplace = True)
print(df.to_string())
6.4.2Removing Duplicates
To remove duplicates, use the drop_duplicates() method.
import pandas as pd
df = pd.read_csv('data.csv')
df.drop_duplicates(inplace =
True) print(df.to_string())
7. Plotting
We can use Pyplot, a submodule of the Matplotlib library to visualize the diagram on
the screen.
Pandas uses the plot() method to create diagrams.
7.2 Histogram
Use the kind argument to specify that you want a histogram:
kind = 'hist'
Example
import sys
import
matplotlib
matplotlib.use('Agg')
import pandas as pd
import matplotlib.pyplot as plt
df = pd.read_csv('data.csv')
df["Duration"].plot(kind =
'hist') plt.show()
plt.savefig(sys.stdout.buffer)
sys.stdout.flush()
OUTPUT
RESULT
Thus the feature study of Pandas has been completed successfully.
1(E). EXPLORE THE FEATURES OF STATSMODELS
AIM:
To learn the different features provided by statsmodels package.
ALGORITHM:
1. Install the statsmodels package
2. Study all the features of statsmodels package.
Statsmodels
statsmodels is a Python module that provides classes and functions for the
estimation of many different statistical models, as well as for conducting statistical tests,
and statistical data exploration.
Features
These are the important features of statsmodels
1. Linear regression models
2. Survival analysis
Example:
# Importing libraries
import statsmodels.api as sm
X = sm.datasets.get_rdataset("Moore", "carData").data
# Filtering data of low fcategory
X = X[X['fcategory'] == "low"]
# Creating SurvfuncRight
model
model = sm.SurvfuncRight(X["conformity"], X["fscore"])
# Model Summary
model.summary()
Sample Output
RESULT
Thus the few important features of study statsmodels has been completed
successfully.