0% found this document useful (0 votes)
129 views59 pages

FDS Record

This document is a certificate certifying that Sreya Reddy Addula completed practical work for the Fundamentals of Data Science lab during the 2020-2021 academic year at Chaitanya Bharathi Institute of Technology. The certificate is signed by the internal and external examiners as well as the Head of the Department of Computer Science and is dated February 2nd, 2022. The attached index lists topics covered in the lab including installing Python, NumPy commands, data visualization techniques, and data analysis methods with and without Scipy.

Uploaded by

tbhumuytj
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
129 views59 pages

FDS Record

This document is a certificate certifying that Sreya Reddy Addula completed practical work for the Fundamentals of Data Science lab during the 2020-2021 academic year at Chaitanya Bharathi Institute of Technology. The certificate is signed by the internal and external examiners as well as the Head of the Department of Computer Science and is dated February 2nd, 2022. The attached index lists topics covered in the lab including installing Python, NumPy commands, data visualization techniques, and data analysis methods with and without Scipy.

Uploaded by

tbhumuytj
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 59

DEPARTMENT OF COMPUTER SCIENCE

B.E – III SEMESTER


Fundamentals of Data Science Lab
Course Code: 20CAC02

Academic Year
2021-22
CHAITANYA BHARATHI INSTITUTE OF
TECHNOLOGY
Gandipet, Hyderabad-500075

Certificate
Certified that this is the bonafide record of the practical work done during the academic year
2020-2021 by Sreya Reddy Addula
Roll Number _ 160120748017 Section CSE-4
in the Laboratory of Fundamentals of Data Science of the Department of Computer
Science.

Internal examiner External examiner

Head of the Department

Date : 02-02-2022
INDEX

S.No TOPICS Page Remarks


1. Installation process for Python in Windows 1-2
2. NUMPY 3-16
• Numpy Commands
• Array Slicings and dimensions in
Numpy
• User Defined Datatypes using Numpy
3. Pandas 17-20
4. Data Visualization 21-43
• Bar Graphs
• Pie Charts
• Box Plots
• Frequency Polygons
• Histograms
• Scatter Plots
5. Data Analysis and Distribution 44-56
With and without Scipy
• 1-sample t-test
• Unpaired unequal Variance T-Test
Theory
• Unpaired equal variance t test
• Paired t-test
• ANOVA Test
CSE-4 FDS RECORD 160120748017

Fundementals of Data Science Lab

INSTALLATION PROCEDURE FOR PYTHON IN WINDOWS

STEP 1: SELECT VERSION OF PYTHON TO INSTALL:


The installation procedure involves downloading the official python.exe
installer and running on the system.
STEP 2: DOWNLOAD PYTHON EXECUTABLE INSTALLER:
Open the browser and navigate to official Python website. Search for desired
version of python.
E.g : 3.9.7
STEP 3: RUN EXECUTABLE INSTALLER:
Run the python installer one downloaded and make sure you have select the
install launcher for all users. Add python 3.9.7 to path checkboxes the select
install now.
STEP 4: VERIFY PYTHON WAS INSTALLED ON WINDOWS
Navigate to the directory in which python was installed on the system.
C:/Programfile/Python/Python3.9.7
After finding that folder double click on python.exe. The output will be in a
python terminal
STEP 5: VERIFY Pip WAS INSTALLED OR NOT:
CASE 1: if Pip was not installed:
How to install Pip:
Pip is a package management system used to install and manage software
packages written in python.
Pip stands for preferred installer program
Step 1: Download Pip get_pip.py:
Browse from official website or use following command to get get-pip.py file
from Command prompt you need to run
https:\\ bootstrap.pypa.io\get-pip.py-oget-pip.py
Step 2:Install pip on windows:
Python get-pip.py
Step 3: Once you installed pip you can test by typing the following command in
the command prompt “pip.help”

1
CSE-4 FDS RECORD 160120748017

Case 2: Pip already installed :


Step 1: Open start menu and type cmd select command prompt application
and enter the command “Pip-V”. if pip was installed successfully you should
see the version of the python.
STEP 6: ADD PYTHON PATH TO ENVIRONMENT VARIABLES :
Open the start menu and choose my computer , right click my computer.
Choose properties and navigate to advance system settings and choose
environmental variable ,choose system variable and from there choose path.
For data science we need additional packages and libraries which are scipy,
pandas ,NumPy
NUMPY:
Numpy means numerical python. It is an opensource library for the python
programming language. It is used for scientific computing and working with
arrays. Apart from its multidimensional array objects, it also provides high level
functioning tools for working arrays.
How to install Numpy :
Note: Prerequeste is Python installed on your system.
To install Numpy in Python, type the following,
pip install NumPy.
PANDAS:
It is an open source python Package that is most widely used for data science,
data analysis and machine learning tasks. It is built on top of another package
named NumPy. Pandas work well with many other data science modules inside
python ecosystem. Pandas make it simple to do with many time consuming,
repetitive tasks associated with working with data which includes data
cleaning, normalization of data, visualization, statistics etc.
How to install pandas:
"pip install Pandas"

2
CSE-4 FDS RECORD 160120748017

WEEK-1:

PROGRAM 1:
AIM: To access various type of commands from the numpy array
PROCEDURE: In this code we have used type, shape commands. A numpy array
is a grid of values, all of the same type, and is indexed by a tuple of
nonnegative integers. The number of dimensions is the rank of the array;
the shape of an array is a tuple of integers giving the size of the array along
each dimension.
CODE:
import numpy as np
a = np.array([5, 12, 23, 40])
print(type(a)) print(a.shape)
print(a[3], a[1], a[0])
a[0] = 6
print(a)
b = np.array([[1,2,43],[14,5,6]])
print(b.shape)
print(b[0 0], b[0 1], b[1 0])
OUTPUT:

PROGRAM 2:
AIM: To create arrays using various functions
PROCEDURE: Here we have used different functions to create an array like
zeros([m,n]) is the command used to create an array with all zeros with m rows
and n columns
ones([m,n]) is the command used to create an array with all ones with m rows
and n columns
full([m,n]) is the command used to create a constant array with m rows and n
columns
eyes([m,n]) is the command used to create an identity matrix with m rows and
n columns
random.random([m,n]) is the command used to create an array consisting of
random values with m rows and n columns
CODE:
import numpy as np

3
CSE-4 FDS RECORD 160120748017

a = np.zeros((2,5))
print(a)
b = np.ones((2,3))
print(b)
c = np.full((2,2), 12)
print(c)
d = np.eye(2)
print(d)
e = np.random.random((3,2))
print(e)
OUTPUT:

PROGRAM 3:
AIM: To implement slicing.
PROCEDURE: Similar to Python lists, numpy arrays can be sliced. Since arrays
may be multidimensional, you must specify a slice for each dimension of the
array
CODE:
import numpy as np
a = np.array([[1,2,3,4], [5,6,7,8], [9,10,11,12]])
b = a[:3, 1:2]
print(a[0, 1])
b[0, 0] = 23
print(a[0, 1])
OUTPUT:

PROGRAM 4:
AIM: To create an array with different dimensions

4
CSE-4 FDS RECORD 160120748017

PROCEDURE: In this program we need to create an array by using shape


command with different dimensions where shape command is used to return a
tuple of the size of each dimension in a Numpy array
CODE:
import numpy as np
a = np.array([[1,2,3,4], [5,6,7,8], [9,10,11,12]])
row_r1 = a[1, :]
row_r2 = a[1:2, :]
print(row_r1, row_r1.shape)
print(row_r2, row_r2.shape)
col_r1 = a[:, 1]
col_r2 = a[:, 1:2]
print(col_r1, col_r1.shape)
print(col_r2, col_r2.shape)
OUTPUT:

PROGRAM 5:
AIM: To implement integer array indexing
PROCEDURE: When you index into numpy arrays using slicing, the resulting
array view will always be a subarray of the original array. In contrast, integer
array indexing allows you to construct arbitrary arrays using the data from
another array.
CODE:
import numpy as np
a = np.array([[1,2], [3, 4], [5, 6]])
print(a[[0, 1, 2], [0, 1, 0]])
print(np.array([a[0, 0], a[1, 1], a[2, 0]]))
print(a[[0, 0], [1, 1]])
print(np.array([a[0, 1], a[0, 1]]))
OUTPUT:

5
CSE-4 FDS RECORD 160120748017

PROGRAM 6:
AIM: To implement Boolean array indexing
PROCEDURE: Boolean array indexing lets you pick out arbitrary elements of an
array. Frequently this type of indexing is used to select the elements of an
array that satisfy some condition.
CODE:
import numpy as np
a = np.array([[1,2], [3, 4], [5, 6]])
bool_idx = (a > 2)
print(bool_idx)
print(a[bool_idx])
print(a[a > 2])
OUTPUT:

PROGRAM 7:
AIM: To implement data types
PROCEDURE: Numpy provides a large set of numeric datatypes that you can
use to construct arrays. Numpy tries to guess a datatype when you create an
array, but functions that construct arrays usually also include an optional
argument to explicitly specify the datatype
CODE:
import numpy as np
x = np.array([1, 2])
print(x.dtype)
x = np.array([1.0, 2.0])
print(x.dtype)
x = np.array([1, 2], dtype=np.int64)
print(x.dtype)
OUTPUT:

PROGRAM 8
AIM: To implement math in arrays

6
CSE-4 FDS RECORD 160120748017

PROCEDURE: Basic mathematical functions operate elementwise on arrays,


and are available both as operator overloads and as functions in the numpy
module
CODE:
import numpy as np
x = np.array([[12,23],[39,40]], dtype=np.float64)
y = np.array([[6,21],[5,17]], dtype=np.float64)
print(x + y)
print(np.add(x, y))
print(x - y)
print(np.subtract(x, y))
print(x * y)
print(np.multiply(x, y))
print(x / y)
print(np.divide(x, y))
print(np.sqrt(x))

OUTPUT:

PROGRAM 9:
AIM: To implement inner product and vector product
PROCEDURE: We use the dot function to compute inner products of vectors, to
multiply a vector by a matrix, and to multiply matrices. dot is available both as
a function in the numpy module and as an instance method of array objects
CODE:

7
CSE-4 FDS RECORD 160120748017

import numpy as np
x = np.array([[1,2],[3,4]])
y = np.array([[5,6],[7,8]])
v = np.array([9,10])
w = np.array([11, 12])
print(v.dot(w))
print(np.dot(v, w))
print(x.dot(v))
print(np.dot(x, v))

OUTPUT:

PROGRAM 10
AIM: To implement computation functions.
PROCEDURE: Here is the sum command which is used to find the sum of
elements in the array.
CODE:
import numpy as np
x = np.array([[1,2],[3,4]])
print(np.sum(x))
print(np.sum(x, axis=0))
print(np.sum(x, axis=1))
OUTPUT:

PROGRAM 11
AIM: To display transpose of a matrix
PROCEDURE: The transpose of a matrix can be established by arrayname.T
CODE:
import numpy as np
x = np.array([[1,2], [3,4]])
print(x)
print(x.T)
v = np.array([1,2,3])
print(v)
print(v.T)

8
CSE-4 FDS RECORD 160120748017

OUTPUT:

9
CSE-4 FDS RECORD 160120748017

WEEK-1 Dt: 21.10.21


Aim: To demonstrate how inbuilt numpy function arrange works
Procedure: arange is a in-built numpy function which returns an array with
evenly spaced elements as per the interval. arange(start,stop) prints an array of
elements from start to end. arange(start,stop,interval) prints an array of
elements with an interval as inputted.
Code:
import numpy as np
b=np.arange(1,10)
print(list(b))

Output:

Code:
import numpy as np
b=np.arange(1,9,2)
print(list(b))

Output:

Code:
#arange([start,] stop[, step], [, dtype=None])
x = np.arange(19.8)
print(x)
x = np.arange(0.8, 19.8,1.0 )
print(x)

Output:

Code:

10
CSE-4 FDS RECORD 160120748017

# 8 values between 1 and 100:


print(np.linspace(1, 100, 8))

Output:

Aim: To demonstrate numpy arrays of various dimensions.


Procedure:
A numpy array is a grid of values, all of the same type, and is indexed by a tuple
of nonnegative integers. The number of dimensions is the rank of the array;
the shape of an array is a tuple of integers giving the size of the array along
each dimension.
Code:
#zero Dimensional Arrays
import numpy as np
l = np.array(89)
print("l: ", l)
print("The type of l: ", type(l))
print("The dimension of l:", np.ndim(l))

Output:

Code:

#one dimensional Arrays


A = np.array([1,3,4,6,10,13,15,19])
B = np.array([2.2,5.9,4.5,1.9,12.8,19.5])
print("A: ", A)
print("B: ", B)
print("Type of A: ", A.dtype)
print("Type of B: ", B.dtype)
print("Dimension of A: ", np.ndim(A))
print("Dimension of B: ", np.ndim(B))

Output:

11
CSE-4 FDS RECORD 160120748017

Code:

M = np.array([ [[-12, 100, -903,901], [-156,-34,123,392]],


[[39,278,890,456], [-12,-279,125,580]],
[[190,-19,-78,90], [-292,70,109,-18]]])

print(M.shape)
print(M)

Output:

Aim: To perform array indexing and slicing operations

Procedure:

Indexing: Array indexing refers to the accessing of elements in the given array.
Slicing: Similar to Python lists, numpy arrays can be sliced. Since arrays may
be multidimensional, you must specify a slice for each dimension of the array.

Code:
Q = np.array([1,5,14,6,87,24,84])
# print the first element of Q
print(Q[0])
# print the last but one element of Q
print(Q[-2])

12
CSE-4 FDS RECORD 160120748017

Output:

Code:
#slicing ( Single Dimensional Array)
S = np.array([ 1, 2, 3, 4, 5, 6, 7, 8, 9])
print(S[2:4])
print(S[:2])
print(S[3:])
print(S[:]) #prints entire array

Output:

Code:
L = np.array([ [[-12, 100, -903,901], [-156,-34,123,392]],
[[39,278,890,456], [-12,-279,125,580]],
[[190,-19,-78,90], [-292,70,109,-18]]])
L[1:3, 0:1,1:4] # equivalent to A[1:3, 0:2, :]

Output:

13
CSE-4 FDS RECORD 160120748017

Dt:21.10.2021

PROGRAM 1:
AIM : To write a program using numpy in python to create an array using dtype .
PROCEDURE: In this program, dtype is used to set the byte size of the elements in
the array . i4 is declared as dtype (np.int32) and arr array is then declared as
array(lst,dtype=i4) which results as all the elements in arr array are int32 data
type.
CODE:
import numpy as np
i4 = np.dtype(np.int32)
print(i4)
list_a = [ [1.2,2.3,4.5,9.0],[2.4,7.8,4.7,5],[7.9,-5.3,7, 5.9],[4.6,7,9,-6.8]]
arr= np.array(lst, dtype=i4)
print(arr)
OUTPUT:

PROGRAM 2:
AIM : To write a program to create an array using dtype and to show repr()
function.
PROCEDURE: In this program, dtype is used to set the layout for the array .dtype
can set different datatypes(different byte size ) to different columns in the multi
dimensional array.
CODE:
import numpy as np
dt = np.dtype([('area', np.int32)])
arr = np.array([(2357), (1456), (6789)], dtype=dt)
print(arr)
print("Internal representation:")
print(repr(arr))
OUTPUT:

14
CSE-4 FDS RECORD 160120748017

PROGRAM 3:
AIM : To write a program to create an array which shows different datatypes in
different columns of the array.
PROCEDURE:In this program , dtype is used to create the layout for the array.
dtype can set different datatypes(different byte size ) to different columns in the
multi dimentional array. And some slicing and indexing operations are done on
the array arr1.
CODE:
d=np.dtype([('product','S20'),('productId','i4'),('Price',np.float64)])
arr1= np.array([('Pen',245,20.4),
('Pencil',304,35.8),
('Book',498,57),
('Mask',268,10),
('Sanitiser',468,59.9)],dtype=d)
print(arr1)
print(repr(arr1))
print(arr1[1])
print(arr1[1][2])
print(arr1[1:])
OUTPUT:

PROGRAM 4:
AIM : To write a program to save the array to a file using savetxt and print data
from the file.
PROCEDURE: This method is used to save an array to a file in requires format .The
NumPy genfromtxt is one of the various functions supported by python numpy

15
CSE-4 FDS RECORD 160120748017

library that reads the table data and generates it into an array of data and
displays as output.
CODE:
np.savetxt("products.csv",
arr1,
fmt="%s;%d;%d",
delimiter=";")
d=np.dtype([('product','S20'),('productId','i4'),('Price','i4')])
a7 = np.genfromtxt("products.csv",
dtype=d,
delimiter=";")
print(a7)
OUTPUT:

16
CSE-4 FDS RECORD 160120748017

Dt: 28.10.21
PROGRAM 1
AIM: To demonstrate pandas series
PROCEDURE:: Pandas Series is a one-dimensional labeled array capable of
holding data of any type (integer, string, float, python objects, etc.). The axis
labels are collectively called index. Pandas Series is nothing but a column in an
excel sheet. Labels need not be unique but must be a hashable type
CODE:
import pandas as pd
A=pd.Series([12,40,23,17])
A
OUTPUT:

PROGRAM 2:
AIM: To access single values from pandas series
PROCEDURE:: Pandas Series is a one-dimensional labeled array capable of
holding data of any type (integer, string, float, python objects, etc.). The axis
labels are collectively called index. Pandas Series is nothing but a column in an
excel sheet. Labels need not be unique but must be a hashable type
PROGRAM CODE:
colors=['blue','red','black','white']
codes=[12,40,23,17]
I=pd.Series(codes,index=colors)
I
OUTPUT:

PROGRAM 3
AIM: To demonstrate addition on pandas series
PROCEDURE:: Pandas Series is a one-dimensional labeled array capable of
holding data of any type (integer, string, float, python objects, etc.). The axis

17
CSE-4 FDS RECORD 160120748017

labels are collectively called index. Pandas Series is nothing but a column in an
excel sheet. Labels need not be unique but must be a hashable type
CODE:
colors=['blue','red','black','white']
colors1=['blue','orange','black','green']
T=pd.Series([12,23,40,17],index=colors)
Y=pd.Series([5,12,16,39],index=colors1)
print(T+Y)
print(sum(T))
OUTPUT:

PROGRAM 4
AIM: To demonstrate how to handle missing values in pandas series
PROCEDURE:: Pandas Series is a one-dimensional labeled array capable of
holding data of any type (integer, string, float, python objects, etc.). The axis
labels are collectively called index. Pandas Series is nothing but a column in an
excel sheet. Labels need not be unique but must be a hashable type
CODE:
colors=['blue','red','black','white']
colors1=['pink','orange','yellow','green']
T=pd.Series([12,23,40,17],index=colors)
Y=pd.Series([5,12,16,39],index=colors1)
print(T+Y)
print(sum(T))

OUTPUT:

18
CSE-4 FDS RECORD 160120748017

PROGRAM 5
AIM: To demonstrate pandas isnull() and notnull() function
PROCEDURE:: Return a boolean same-sized object indicating if the values are
NA. NA values, such as None or numpy.NaN, gets mapped to True values.
Everything else gets mapped to False values. Characters such as empty strings
'' or numpy.inf are not considered NA values
CODE:
my_cities=["USA","Poland","Berlin","China"]
my_city_series=pd.Series(cities,index=my_cities)
print(my_city_series.isnull())
print(my_city_series.notnull())
OUTPUT:

PROGRAM 7
AIM: To demonstrate pandas dropna () function
PROCEDURE:: The dropna() function is used to return a new Series with
missing values removed. There is only one axis to drop values from. If True, do

19
CSE-4 FDS RECORD 160120748017

operation inplace and return None. Whether to perform the operation in place
on the data
CODE:
cities={"Australia":123456,
"China":9324,
"Russia":683506,
"USA":56897,
"Cambodia":896764}
city_series=pd.Series(cities)
print(city_series)
print(my_city_series.dropna())
print(my_city_series.fillna(0))
OUTPUT:

20
CSE-4 FDS RECORD 160120748017

Dt: 11.11.21

Program 1:
AIM : To plot a 2-d graph using matplotlib.
PROCEDURE:.A Line plot can be defined as a graph that displays data as points or check
marks above a number line, showing the frequency of each valuematplotlib.pyplot is
library of functions that make matplotlib work like matlab and helps to visualise the
data. In this program, plot() fuction is used to plot the 2d graph and xlabel , ylabel are
used to provide labels to the graph.
CODE:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
x=[100,200,300]
y=[400,500,600]
plt.plot(x,y)

OUTPUT:

CODE:
plt.title("Graph")
plt.xlabel("X-axis")
plt.ylabel("Y-axis")
plt.plot(x,y)

21
CSE-4 FDS RECORD 160120748017

OUTPUT :

PROGRAM 2:
AIM : to plot a 2d graph and determine the use of title(), fontdict, xticks and yticks.
PROCEDURE : In this program, title is used to print the title to the graph. Fontdict is
used to style the title of the graph , to give fontname , fontsize to the title. Xticks and
yticks are used to set the current tick locations.
CODE:
x=[100,200,300]
y=[400,500,600]
plt.plot(x,y)
plt.title("Graph",fontdict={'fontname':'FreeSerif','fontsize':20})
plt.xlabel("X")
plt.ylabel("Y")
plt.xticks([60,100,140,180,220,260,300])
plt.yticks([400,500,600,700,800])
plt.show()

OUTPUT:

22
CSE-4 FDS RECORD 160120748017

PROGRAM 3:

AIM : to plot a 2d graph and demonstrate the use of plot() function.


PROCEDURE : In this program, plot() function helps to give color , to the plot, give
marker , markersize, markeredgecolor ,linestyle to the plot
CODE :
x=[100,200,300]
y=[400,500,600]
plt.plot(x,y,label='x+300',color="blue",linewidth=3,linestyle='--'
,marker="*",markersize=12,markeredgecolor="red")
plt.title("Graph",fontdict={'fontname':'FreeSerif','fontsize':20})
plt.xlabel("X")
plt.ylabel("Y")
plt.xticks([100,140,180,220,260,300])
plt.yticks([300,350,400,450,500,550,600])
plt.legend()
plt.show()
OUTPUT:

PROGRAM 4:
AIM : To draw multiple plots using plot() function and save the figure.
PROCEDURE : In this program, plots of x+1 , x^2, x^3 are ploted using plot() . to save
the figure, savefig() function is used . the figure is saved as linegraph.png and with dpi
300 by using savefig('linegraph.png',dpi=300).

23
CSE-4 FDS RECORD 160120748017
CODE:
x=[1,1.2,1.4,1.6]
y=[2,2.2,2.4,2.6]
plt.plot(x,y,'b*--',label='x+1')
plt.title("Graph",fontdict={'fontname':'FreeSerif','fontsize':20})
x2=np.arange(0,2.5,0.5)
plt.plot(x2,x2**2,'g^--',label='x^2')
plt.plot(x2,x2**3,'r',label='x^3')
plt.xlabel("X")
plt.ylabel("Y")
plt.savefig('linegraph.png',dpi=300)
plt.legend()
plt.show()

OUTPUT:

PROGRAM 5:

AIM : To plot a bar graph.


PROCEDURE : a bar plot is a plot that presents categorical data with rectangular bars
with lengths proportional to the values that they represent. A bar plot shows
comparisons among discrete categories. One axis of the plot shows the specific
categories being compared, and the other axis represents a measured value.
Set_hatch() is used to give different symbolled hatch to the barplot .
CODE:

24
CSE-4 FDS RECORD 160120748017
labels=['a','b','c']
values=[10,20,30]
b=plt.bar(labels,values)
OUTPUT:

CODE:
labels=['a','b','c']
values=[10,20,30]
b=plt.bar(labels,values)
b[0].set_hatch('/')
b[1].set_hatch('*')
b[2].set_hatch('.')

OUTPUT:

CODE:
labels=['a','b','c']

25
CSE-4 FDS RECORD 160120748017
values=[10,20,30]
b=plt.bar(labels,values)
patterns=['.','/',"*"]
for i in b:
i.set_hatch(patterns.pop(0))

OUTPUT:

26
CSE-4 FDS RECORD 160120748017

Dt:18.11.2021
Program 1:
Aim: To demonstrate plots for gas prices datasets
Procedure:
A Line plot can be defined as a graph that displays data as points or check
marks above a number line, showing the frequency of each value
A legend is an area describing the elements of the graph. In the matplotlib
library, there’s a function called legend() which is used to Place a legend on
the axes.
format :[color;marker;linestyle]
Program Code:

import matplotlib.pyplot as plt


import numpy as np
import pandas as pd
plt.title('Gas Prices (in USD)',fontdict={'fontweight':'bold','fontsize':10})
gas=pd.read_csv('gasprices.csv')
plt.plot(gas.Year,gas.USA,label='United States')
plt.plot(gas.Year,gas.Canada,label='Canada')
plt.plot(gas.Year,gas['South Korea'],label='S K')
plt.legend()
plt.show()

Output:

27
CSE-4 FDS RECORD 160120748017

Code:
for country in gas:
print(country)
for country in gas:
if country!='Year':
plt.plot(gas.Year,gas[country],marker='.',label=country)
print(gas.Year[::3])
plt.xticks(gas.Year[::3])
plt.xlabel('Year')
plt.ylabel('US Dollars')
plt.legend()
plt.show()

28
CSE-4 FDS RECORD 160120748017

Output:

Program 2:
Aim: To read data from fifa dataset
Procedure:
A Line plot can be defined as a graph that displays data as points or check
marks above a number line, showing the frequency of each value
A legend is an area describing the elements of the graph. In the matplotlib
library, there’s a function called legend() which is used to Place a legend on
the axes.

29
CSE-4 FDS RECORD 160120748017

format :[color;marker;linestyle]

Program Code:
fifa=pd.read_csv('fifa_data.csv')
fifa.head(5)

Output:

Program 3:
Aim: To represent data from fifa dataset using histograms
Procedure:
A histogram graph is a bar graph representation of data. It is a
representation of a range of outcomes into columns formation along the x-
axis. in the same histogram, the number count or multiple occurrences in the
data for each column is represented by the y-axis.
Program Code:
plt.hist(fifa.Overall)
plt.show()

Output:

30
CSE-4 FDS RECORD 160120748017

Program Code:

bins=[40,50,60,70,80,90,100]
plt.figure(figsize=(6,5))
plt.hist(fifa.Overall,bins=bins,color='blue')
plt.xticks(bins)
plt.ylabel('Number of Players')
plt.xlabel('Skill Level')
plt.title('Distribution of Player Skills in FIFA 2018')
plt.savefig('histogram.png',dpi=300)
plt.show()
Output:

31
CSE-4 FDS RECORD 160120748017

Program 4:
Aim: To represent data from fifa dataset using piecharts for preferred legs
Procedure:
A pie chart (or a circle chart) is a circular statistical graphic, which is divided
into slices to illustrate numerical proportion. In a pie chart, the arc length of
each slice (and consequently its central angle and area), is proportional to the
quantity it represents.

Program Code:

l=fifa.loc[fifa['Preferred Foot']=='Left'].count()[0]
r=fifa.loc[fifa['Preferred Foot']=='Right'].count()[0]
labels=['Left','Right']
colors=['y','g']
plt.pie([l,r],labels=labels,colors=colors,autopct='%.1f%%')
plt.title('Foot Preference of FIFA Players')
plt.show()

Output:

Program 5:

Aim: To represent data from fifa dataset using piecharts for weighs
Procedure:
A pie chart (or a circle chart) is a circular statistical graphic, which is divided
into slices to illustrate numerical proportion. In a pie chart, the arc length of

32
CSE-4 FDS RECORD 160120748017

each slice (and consequently its central angle and area), is proportional to the
quantity it represents.

Program Code:

light=fifa.loc[fifa.Weight<125].count()[0]
light_medium=fifa[(fifa.Weight>=125)&(fifa.Weight<150)].count()[0]
medium=fifa[(fifa.Weight>=150)&(fifa.Weight<175)].count()[0]
medium_heavy=fifa[(fifa.Weight>=200)&(fifa.Weight<200)].count()[0]
heavy=fifa[(fifa.Weight>=200)].count()[0]
labels=['Under 125','125-150','150-175','175-200','Over 200']
weights=[light,light_medium,medium,medium_heavy,heavy]
plt.pie(weights,labels=labels)
plt.title('Weight of professional Soccer Players(lbs)')
plt.show()

Output:

Program 6:
Aim: To demonstrate box plots for fifa dataset

Procedure:
Boxplots are a standardized way of displaying the distribution of data based on
a five number summary (“minimum”, first quartile (Q1), median, third quartile
(Q3), and “maximum”).

Program Code:
barcelona=fifa.loc[fifa.Club=='FC Barcelona']['Overall']

33
CSE-4 FDS RECORD 160120748017

madrid=fifa.loc[fifa.Club=='Real Madrid']['Overall']
bp=plt.boxplot([barcelona,madrid])
plt.title('Professional Soccer Team Comparision')
plt.ylabel('FIFA Overall Rating')
plt.show()

Output:

Program 7:
Aim: To demonstrate box plots for fifa dataset

Procedure:
Boxplots are a standardized way of displaying the distribution of data based on
a five number summary (“minimum”, first quartile (Q1), median, third quartile
(Q3), and “maximum”).

Program Code:

barcelona=fifa.loc[fifa.Club=='FC Barcelona']['Overall']
madrid=fifa.loc[fifa.Club=='Real Madrid']['Overall']
rev=fifa.loc[fifa.Club=='New England Revolution']['Overall']
labels=['FC Barcelona','Real Madrid','New England Revolution']
bp=plt.boxplot([barcelona,madrid,rev],labels=labels,patch_artist=True)
for box in bp['boxes']:
box.set(color='b',linewidth=2)
box.set(facecolor='y')
plt.title('Professional Soccer Team Comparision')
plt.ylabel('FIFA Overall Rating')

34
CSE-4 FDS RECORD 160120748017

plt.show()
Output:

35
CSE-4 FDS RECORD 160120748017
Dt:25.11.21
PROGRAM 1
AIM: To implement scatter plot
PROCEDURE: With Pyplot, you can use the scatter() function to draw a scatter plot.
The scatter() function plots one dot for each observation. It needs two arrays of the same
length, one for the values of the x-axis, and one for values on the y-axis
CODE:
import matplotlib.pyplot as plt
import numpy as np
price=np.asarray([23.3,23.20,20.3,10.3,3.2])
sales_per_day=np.asarray([10,20,30,40,50])
profit_margin=np.asarray([5,10,15,20,25])
low=(0,1,0)
medium=(0,0,1)
high=(1,0,0)
sugar_cont=[low,high,high,medium,high]
plt.scatter(x=price,y=sales_per_day,s=profit_margin*10,c=sugar_cont)
plt.show()
OUTPUT:

CODE:
import matplotlib.pyplot as plt
import numpy as np
low=(0,1,0)
medium=(0,0,1)
high=(1,0,0)
price_orange=np.asarray([23.3,23.20,20.3,10.3,3.2])
sales_per_day_orange=np.asarray([10,20,30,40,50])
profit_margin_orange=np.asarray([5,10,15,20,25])
sugar_cont_orange=[low,high,high,medium,high]
price_cereal = np.asarray([1.50, 2.50, 1.15, 1.95])

36
CSE-4 FDS RECORD 160120748017
sales_per_day_cereal = np.asarray([67, 34, 36, 12])
profit_margin_cereal = np.asarray([20,12,7,9])
sugar_cont_cereal = [low, high, medium, low]
plt.scatter(x=price_orange,y=sales_per_day_orange,s=profit_margin_orange*10,c=sugar_c
ont_orange,marker="X")
plt.scatter(x=price_cereal,y=sales_per_day_cereal,s=profit_margin_cereal*10,c=sugar_cont
_cereal,marker="D")
plt.show()
OUTPUT:

PROGRAM 2
AIM: To demonstrate plot function
DESCRIPTION: here in this program kind () function determines the type of the plot
required and fig size reperesents the window
CODE:
import pandas as pd
plotdata = pd.DataFrame({
"2018":[57,67,77,83],
"2019":[68,73,80,79],
"2020":[73,78,80,85]},
index=["Django", "Gafur", "Tommy", "Ronnie"])
plotdata.plot(kind="bar",figsize=(15, 8))
plt.title("FIFA ratings")
plt.xlabel("Footballer")
plt.ylabel("Ratings")
OUTPUT:

37
CSE-4 FDS RECORD 160120748017

PROGRAM 3
AIM: to implement stacked function
DESCRIPTION: here the stacked function plots the data one above the other like a pile
CODE:
import pandas as pd
plotdata = pd.DataFrame({
"2018":[57,67,77,83],
"2019":[68,73,80,79],
"2020":[73,78,80,85]},
index=["Django", "Gafur", "Tommy", "Ronnie"])
plotdata.plot(kind="bar",figsize=(15, 8),stacked="True")
plt.title("FIFA ratings")
plt.xlabel("Footballer")
plt.ylabel("Ratings")
OUTPUT:

38
CSE-4 FDS RECORD 160120748017
PROGRAM 4:
AIM: To demonstrate the first n number of observations from the csv file
PROCEDURE: value_counts() function is used to access the data to certain number given
which is present in the csv file or given dataset
CODE:
top_20 = df['Country'].value_counts()[:20]
top_20.plot(kind='bar',figsize=(10,8))
plt.title('All Time Medals of top 20 countries')
plt.show()
OUTPUT:

Dt:16.12.21
Program 1:
AIM: To implement box plot without using inbuilt function
PROCEDURE:
Boxplots are a standardized way of displaying the distribution of data based on a five
number summary (“minimum”, first quartile (Q1), median, third quartile (Q3), and
“maximum”).
CODE:
#box plot
import matplotlib.pyplot as plt
import numpy as np

39
CSE-4 FDS RECORD 160120748017
data=[199,201,236,269,271,278,283,291,301,303,341]
n=len(data)
m=(n+1)//2
q2=data[m]
q1=data[(n+1)//4]
q3=data[(n+1)*3//4]
iqr= q3-q1
min= q1-(iqr/2)
max= q3+(iqr/2)
x= ['min','q1','q2','q3','max']
y= [min,q1,q2,q3,max]
plt.boxplot(y)
print(y)
plt.show()
OUTPUT:

Program 2:
AIM: To plot frequency polygons using input frequency.
PROCEDURE:
A frequency polygon is a line graph of class frequency plotted against class midpoint. It can
be obtained by joining the midpoints of the tops of the rectangles in the histogram
CODE:
#q2 frequency polygons using frequency
import matplotlib.pyplot as plt
import numpy as np
range_bin=[5.5,10.5,15.5,20.5,25.5,30.5,35.5,40.5]
freq=[1,3,2,4,5,3,2]
l= len(range_bin)
r=[]
for i in range(l-1):
x= (range_bin[i]+range_bin[i+1])/2
r.append(x)

40
CSE-4 FDS RECORD 160120748017
plt.plot(r,freq,marker="*")
plt.xticks([0,5.5,10.5,15.5,20.5,25.5,30.5,35.5,40.5,45.5])
plt.xlabel("Bin range")
plt.ylabel("Frequency")
plt.show()
OUTPUT:

Program 3:
AIM: To plot relative frequency polygon with given input frequencies.
PROCEDURE:
A relative frequency polygon has peaks that represent the percentage of total data points
falling within the interval.
CODE:
#q3 frequency polygons using relative frequency
import matplotlib.pyplot as plt
import numpy as np
range_bin=[5.5,10.5,15.5,20.5,25.5,30.5,35.5,40.5]
freq=[1,3,2,4,5,3,2]
p=len(freq)
c_freq=0
for i in range(p):
c_freq=c_freq+freq[i]
print(c_freq)
c_f=[]
for i in range(p):
c_f.append(freq[i]/c_freq)
l= len(range_bin)
r=[]
for i in range(l-1):
x= (range_bin[i]+range_bin[i+1])/2
r.append(x)
plt.plot(r,c_f,marker="*")
plt.xticks([0,5.5,10.5,15.5,20.5,25.5,30.5,35.5,40.5,45.5])

41
CSE-4 FDS RECORD 160120748017
plt.xlabel("Bin range")
plt.ylabel("Frequency")
plt.show()
OUTPUT:

Program 4:
AIM: To demonstrate stem and leaf plot
PROCEDURE:
Stem and leaf plot is a way of organizing data into a form that makes it easy to observe the
frequency of different types of values.
CODE:
#stem and leaf plot
x=[143,163,154,159,172,165,162,171,146,165,176,145,165,182,175,186,160,158,167,172]
x=sorted(x)
dict_a={}
for i in x :
s= str(i)
y=int(s[0:2])
dict_a[y]=[]
for i in x:
s= str(i)
y= int(s[0:2])
z= int(s[2])
dict_a[y].append(z)
OUTPUT:

42
CSE-4 FDS RECORD 160120748017

Dt: 06-01-2022
Program-1
AIM: To implement one sample t test
PROCEDURE:
1. Identify Null hypothesis for the given problem.
2. Calculate mean of the given data set.

3. Calculate s value by using the formula: ∑ (xi − x¯)2


s=
n−1
Where is mean of given data set
n is size of data set
4. Find degrees of freedom i.e., n-1.
5. By using degrees of freedom find t critical value.
6. Calculate t value by using the formula:
t=(x-mu)/(s/n**0.5)
7. If t calculated is less than t critical then Null hypothesis is accepted or else rejected.
CODE:
#t test
#one sample
x=[90,98,110,150,200,91,82,80,110,96]
su=0
for j in x:

43
CSE-4 FDS RECORD 160120748017
su=su+j
n=len(x)
avg=su/n
b=90
sd=0
for i in x:
sd=sd+(i-avg)**2
s=(sd/(n-1))**(0.5)
tstat=(avg-b)/(s/(n)**(0.5))
print(tstat)
tcritic=1.83
if tstat<tcritic:
print("accept NH")
else:
print("reject NH")

OUTPUT:

USING SCIPY
#one sample t-test
from scipy import stats
data=[90,98,110,150,200,91,82,80,110,96]
t,p=stats.ttest_1samp(data,90)
print("tstat: ",t)
tcr=1.83

44
CSE-4 FDS RECORD 160120748017
print("tcritical: ",tcr)
if(t<tcr):
print("Null hypothesis is accepted")
else:
print("Null hypothesis is rejected")
OUTPUT:

Program 2
AIM: To implement Unpaired unequal Variance T-Test Theory
PROCEDURE:
1.Identify Null hypothesis for the given problem.
2.Calculate first sample mean and second sample for the given data set.
3.Calculate s1 and s2 values by using the formula:

4. Find degrees of freedom

5. t can be calculates using formula:

S1 is standard deviation of first sample


S2 is the standard deviation of second sample
X 1,X 2 are the mean of first
And second samples

45
CSE-4 FDS RECORD 160120748017
N 1,N 2 are the mean of first
And second samples

6. If t calculated is less than t critical then Null hypothesis is accepted or else rejected
CODE:
x=[13.5,23,13.2,12.7,22.1,17.5,20.1,22.5,19.0,21.9,13.2,11,12.8,13.1,11.6,23.0,13.2,22.9,13
.1]
y=[10.1,27.6,13.8,13.1,25.6,26.7,28.9,30.1,25.4,21.9,12.1,13.4,12.3,11.9,22.2,12.3,22.2]
t_critic=2.052
nx=len(x)
ny=len(y)
meanx=sum(x)/nx
meany=sum(y)/ny
def standard_deviation(a,mean):
x=0
n=len(a)
for i in a:
c= (i-mean)**2
x= x+ c
varience= x/(n-1)
sd=(varience)**(0.5)
return sd
sdx= standard_deviation(x,meanx)
sdy= standard_deviation(y,meany)
df=(((sdx**2)/nx) + ((sdy**2)/ny))/((((sdx/nx)**2)/(nx-1)) +(((sdy/ny)**2)/(ny-1)))
t_stat= (meanx-meany)/(np.sqrt((sdx**2/nx)+(sdy**2/ny)))
print("Degree of freedom :",df)

46
CSE-4 FDS RECORD 160120748017
print("t-statical value :",t_stat)
f_stat= (sdy**2)/(sdx**2)
f_critic=2.23
tcritic=2.052
if f_stat>f_critic :
print("Unequal variences")
else:
print("Equal variences")
if t_stat<tcritic:
print("Null hypothesis is accepted")
else:
print("Null hypothesis is rejected")
OUTPUT:

USING SCIPY
x=[13.5,23,13.2,12.7,22.1,17.5,20.1,22.5,19.0,21.9,13.2,11,12.8,13.1,11.6,23.0,13.2,22.9,13
.1]
y=[10.1,27.6,13.8,13.1,25.6,26.7,28.9,30.1,25.4,21.9,12.1,13.4,12.3,11.9,22.2,12.3,22.2]
res=stats.ttest_ind(x,y,equal_var=False)
print(res)
t_crit=2.052
if abs(res[0]) > t_crit:
print("Alternate hypothesis is rejected.")
else:
print("Null hypothesis is rejected.")

47
CSE-4 FDS RECORD 160120748017
OUTPUT:

Program 3
AIM: To implement unpaired equal variance t test
PROCEDURE:
1.Identify Null hypothesis for the given problem.
2.Calculate first sample mean and second sample mean as x1 bar and x2 bar.
3.Calculate s1 and s2 values by using the formula:

4. f can be calculated by using

5. If f calculated is less than f critical , then it denotes that the variances of samples are
equal
6. find degree of freedom

7. By using degrees of freedom find t critical value.


8. Calculate the Pooled Sample Variance and t value by using the formula

48
CSE-4 FDS RECORD 160120748017

9.If t calculated is less than t critical then Null hypothesis is accepted or else rejected.
CODE:
a=[23,15,16,25,20,17,18,14,12,19,21,22]
b= [16,21,16,11,24,21,18,15,19,22,13,24]
f_critic= 2.82
na=len(a)
sa=sum(a)
mean_a=sa/na
nb= len(b)
sb= sum(b)
mean_b= sb/nb
def standard_deviation(a,mean):
x=0
n=len(a)
for i in a:
c= (i-mean)**2
x= x+ c
varience= x/(n-1)
sd=(varience)**(0.5)
return sd

49
CSE-4 FDS RECORD 160120748017
sda= standard_deviation(a,mean_a)
sdb= standard_deviation(b,mean_b)
f_stat=(sdb/sda)**2
print("F-stat:",f_stat)
if f_stat>f_critic:
print("Variences are unequal")
elif f_stat< f_critic:
print("Equal variences")
def pooledSV(X,Y):
n1, n2 = len(X), len(Y)
xbar, ybar = np. mean( X), np.mean(Y)
sum1, sum2 = 0, 0
sum1 = sum([(x - xbar)**2 for x in X])
sum2 = sum([(y - ybar)**2 for y in Y])
return (sum1+sum2)/ (n1+n2-2)
S2 = pooledSV(a,b)
print("Pooled Sample Variance:{}".format(S2))
t_value = (mean_a-mean_b)/ (np.sqrt(S2 * (1/na + 1/nb)))
print("t-value :",t_value)
tcritic=1.796
if t<tcritic:
print("Null hypothesis is accepted")
else:
print("Null hypothesis is rejected")
OUTPUT:

50
CSE-4 FDS RECORD 160120748017

USING SCIPY
from scipy import stats
x= [23, 15, 16, 25, 20, 17, 18, 14, 12, 19, 21, 22]
y= [16, 21, 16, 11, 24, 21, 18, 15, 19, 22, 30, 24]
res = stats.ttest_ind(x,y, equal_var=True)
print(res)
t_crit = 1.717
if abs(res[0]) > t_crit:
print("Alternate hypothesis is rejected.")
else:
print("Null hypothesis is rejected.")
OUTPUT:

Program 4:
AIM: To implement paired t -test
PROCEDURE:
1. Identify Null hypothesis for the given problem.

2. Find the difference between the samples i.e. d and d^2

3. Find degrees of freedom i.e., n-1.

4. By using degrees of freedom find t critic

51
CSE-4 FDS RECORD 160120748017
5. Calculate t value by using the formula:

6. If t calculated is less than t critical NH is accepeted or else it is rejected


CODE:
pretest=[23,25,28,25,25,26,25,22,30,35,40,35,30,30]
posttest=[35,40,30,40,45,30,30,55,40,40,35,38,41,35]
t_critic=-2.160
D=[]
D2=[]
n=len(pretest)
for i in range(n):
d= pretest[i]-posttest[i]
D.append(d)
D2.append(d**2)
sD=sum(D)
sD2=sum(D2)
tnum=sD
tden=((n*sD2-(sD*sD))/(n-1))**(1/2)
tstat=tnum/tden
print("t_stat value :",tstat)
t_critic=1.771

52
CSE-4 FDS RECORD 160120748017
if tstat<t_critic:
print("Null hypothesis is accepted")
else:
print("Null hypothesis is rejected")
OUTPUT:

USING SCIPY
import pandas as pd
from scipy import stats
#paired t test
pretest=[23,25,28,25,25,26,25,22,30,35,40,35,30,30]
posttest=[35,40,30,40,45,30,30,55,40,40,35,38,41,35]
ttest,pvalue=stats.ttest_rel(pretest,posttest)
print("ttest:",ttest)
print("pvalue:",pvalue)
if pvalue<0.05:
print("Reject Null hypothesis")
else:
print("Accept Null hypothesis")
OUTPUT:

53
CSE-4 FDS RECORD 160120748017
Program 5:

AIM: To perform one way ANOVA test.


PROCEDURE: Analysis of variance (ANOVA) is a statistical technique that is used to check if
the means of two or more groups are significantly different from each other.The one-way
ANOVA tests the null hypothesis that two or more groups have the same population mean.
The test is applied to samples from two or more groups, possibly with differing sizes.

fstat=mssb/mssw
In the program to find with out using scipy, ssw, msw, ssb,msb,fstat were calculated using
numpy, pandas with the help of above formulae. If f statistic is less than f critical then Null
hypothesis is accepted or else rejected.
In the program to find with help of scipy, stats is imported , with the help of f_oneway(),
tstat value and p value will be generated

CODE:
1) Without using scipy
#one way anova
g1=[7,7,6,9,7,7,6,7,8,9]
g2=[5,6,3,5,4,6,5,4,5,5,6,7,6]
g3=[1,3,4,3,1,1,2,6,5,4,3,4,5]
mg1=sum(g1)/len(g1)
mg2=sum(g2)/len(g2)
mg3= sum(g3)/len(g3)
x=0

54
CSE-4 FDS RECORD 160120748017
n1=len(g1)
n2=len(g2)
n3=len(g3)
n= len(g1) +len(g2) +len(g3)
k=3
for i in g1:
y= (i-mg1)**2
x = x+ y
for i in g2:
y= (i-mg2)**2
x = x+ y
for i in g3:
y= (i-mg3)**2
x = x+ y
mssw= x/(n-k)
G=(sum(g1)+sum(g2)+sum(g3))/n
mssb= ((n1*(mg1-G)**2) + (n2*(mg2-G)**2) +(n3*(mg3-G)**2))/2
fstat=mssb/mssw
print("Mssb and Mssw are :",mssb,mssw)
fcritic=3.32
if fstat>fcritic :
print("Reject Null Hypothesis")
else:
print("Accept null hypothesis")
print("f-stat:",fstat)
OUTPUT:

55
CSE-4 FDS RECORD 160120748017

2) using scipy
from scipy import stats
import numpy as np
g1=[7,7,6,9,7,7,6,7,8,9]
g2=[5,6,3,5,4,6,5,4,5,5,6,7,6]
g3=[1,3,4,3,1,1,2,6,5,4,3,4,5]
stats.f_oneway(g1,g2,g3)
OUTPUT:

56

You might also like