0% found this document useful (0 votes)
2K views64 pages

OCS353 - Data Science Manual-FULL

Uploaded by

sarangrao2304
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2K views64 pages

OCS353 - Data Science Manual-FULL

Uploaded by

sarangrao2304
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 64

LABORATORY RECORD

for

OCS353 – Data Science Fundamentals


OPEN ELECTIVE - II

of

B.Tech - Information Technology

Anna University
Regulation - 2021

Batch (2021 - 2025)

DEPARTMENT OF INFORMATION TECHNOLOGY

KCG COLLEGE OF TECHNOLOGY,

CHENNAI – 600097
Bonafide Certificate

REG. No……………..…………………………….

Certified that this is the bonafide record of work done by --------------------------------------------

of ---------------- year ---------------- semester, from ------------------------------------------branch

during the period from July 2024 to Nov 2024 of the academic year 2024 - 2025, for the

subject OCS353 – Data Science Fundamentals Lab.

Faculty in-charge HOD

Submitted for the Anna university practical examination held at KCG COLLEGE

OF TECHNOLOGY, CHENNAI on ………………………….

Internal Examiner External Examiner


VISION OF THE COLLEGE

KCG College of Technology aspires to become a globally recognized Centre of excellence


for science, technology & engineering education, committed to quality teaching, learning,
and research while ensuring for every student a unique educational experience which will
promote leadership, job creation, social commitment and service to nation building

MISSION OF THE COLLEGE


 Disseminate knowledge in a rigorous and intellectually stimulating environment
 Facilitate socially responsive research, innovation and entrepreneurship
 Foster holistic development and professional competency
 Nurture the virtue of service and an ethical value system in the young minds

VISION OF THE DEPARTMENT

The department of Information Technology aspires to become a globally acclaimed


center of excellence offering quality education and enabling innovative research in
Information Technology department by producing competent Information Technology
graduates to contribute towards nation building.

MISSION OF THE DEPARTMENT

 Impart knowledge of fundamentals as well as emerging trends in InformationTechnology.


 Inculcate innovative and entrepreneurial abilities as well as ethical values among thestudents.
 Establish computing facilities and research activities to enhance the knowledge.
 Enhance competency of faculty with the advanced technologies in InformationTechnology.
PROGRAMME EDUCATIONAL OBJECTIVES
On The graduates will be able to
Outstand as technically skilled professionals in Information Technology and
PEO 1
relevant sector.

PEO 2 Devise, Implement and Deploy software solutions for computational problems.

PEO 3 Build software solutions for the challenging problems in industry and research.

PEO 4 Manifest the ethical values and exhibit social responsibility.

PROGRAMME OUTCOMES AND PROGRAMME SPECIFIC OUTCOMES


After successful completion of B.Tech (IT) programme, the graduates will be able to

Apply the knowledge of mathematics, science, engineering fundamentals, and an


PO 1
engineering specialization to the solution of complex engineering problems

Identify, formulate, review research literature, and analyze complex engineering


PO 2 problems reaching substantiated conclusions using first principles of mathematics,
natural sciences, and engineering sciences.
Design solutions for complex engineering problems and design system components
PO 3 or processes that meet the specified needs with appropriateconsideration for the public
health and safety, and the cultural, societal, andenvironmental considerations.
Use research-based knowledge and research methods including design of
PO 4 experiments, analysis and interpretation of data, and synthesis of the information
to provide valid conclusions.
Create, select, and apply appropriate techniques, resources, and modern
PO5 Engineering and IT tools including prediction and modeling to complexengineering
activities with an understanding of the limitations.

Function effectively as an individual, and as a member or leader in diverse teams,and


PO9
in multidisciplinary settings.

Recognize the need for, and have the preparation and ability to engage in
PO12
independent and life-long learning in the broadest context of technological change.
After successful completion of B.Tech (IT) programme, the graduates will be able to

PSO No. Description of PSO

Design system to solve complex IT related problems using algorithm analysis,


PSO 1 database technology, multimedia, web design, networking and principles of
Software Engineering, to face the challenges in corporate and industries.

Communicate and function efficiently as an individual and as a member or


PSO 2
leader in multidisciplinary teams in software development process.

Identify the need for sustainable development in software industries and follow
PSO 3
the professional code of ethics.

Course Outcomes:

Upon completion of this course, the student will be able to:

Highest
Course Outcomes Cognitive
Level

Gain knowledge on data science process.


CO 1 K2

CO 2 Perform data manipulation functions using Numpy and Pandas K3

CO 3 Combining datasets for Aggregation and Grouping K3

CO 4 Understand different types of machine learning approaches. K2

CO 5 Perform data visualization using tools K3

CO 6 Handle large volumes of data in practical scenarios. K3


Department of Information Technology
OCS353 – Data Science Fundamentals Lab
CONTENTS
Page
Exp. No. Name of the Experiment Date Signature
No
Download, Install and Explore the Features
1.
of Python for Data Analytics
2. Working with Numpy Arrays
3. Dataframes Using Pandas
4. Basic plots using Matplotlib
5. Statistical and Probability Measures
a. Frequency Distributions
b. Averages
c. Variability
d. Normal Curve
e. Correlation and scatter plots
f. Correlation coefficient
g. Simple Linear Regression
6. Data Analysis
Univariate Analysis - Frequency, Mean,
a. Median, Mode, Variance, Standard
Deviation, Skewness and Kurtosis.
Bivariate Analysis - Linear and logistic
b.
regression modelling.
Apply and explore various plotting
7.
functions on any data set
8. Visualizing Geographic Data with Basemap
9. Exploratory Data Analysis
EX.NO: 1 Download, Install and Explore the Features
DATE: of Python for Data Analytics

Python is a high-level and general-purpose programming language with data science


and machine learning packages. Use the video below to install on Windows.
As a first step, install Python for Windows, MacOS, or Linux.

Install Python Packages


Numpy is a numerical computing package for mathematics, science, and engineering.
Many data science packages use Numpy as a dependency.

Ex : pip install NumPy

Output:

1
PANDAS
Pandas visualizes and manipulates data tables. There are many functions that allow
efficient manipulation for the preliminary steps of data analysis problems.

Ex: pip install pandas

Output:

Statsmodel
Statsmodels is a package for exploring data, estimating statistical models, and
performing statistical tests. It include descriptive statistics, statistical tests, plotting
functions, and result statistics.

Ex: pip install statsmodels

Output:

2
Scipy:
SciPy is a general-purpose package for mathematics, science, and engineering and
extends the base capabilities of NumPy.

Ex: pip install scipy

Output:

MARKS MARKS
ALLOTTED SECURED
AIM & ALG/
5
PROCEDURE
PROGRAM 10

OUTPUT 5

TOTAL 20

Result:

Thus the python packages for data analytics has been installed successfully.

3
EX.NO: 2
DATE:
Working with Numpy arrays

AIM
To work with Numpy arrays,

NumPy stands for Numerical Python. It is a Python library used for working with an
array. In Python, we use the list for purpose of the array but it’s slow to process.
NumPy array is a powerful N-dimensional array object and its use in linear algebra,
Fourier transforms, and random number capabilities. It provides an array object much
faster than traditional Python lists.

Types of Array:
1. One Dimensional Array
2. Multi-Dimensional Array

1. One Dimensional Array:


A one-dimensional array is a type of linear array.

Example

5.
6. One
Dimensional Array

ALGORITHM
Step1: Start
Step2: Import numpy module
Step3: Print the basic characteristics and operactions of array
Step4: Stop

4
PROGRAM
# importing numpy module
import numpy as np
# creating list
list = [1, 2, 3,4]
# creating numpy array
sample_array = np.array(list1)
print("List in python : ", list)
print("Numpy Array in python :", sample_array)

Multi-Dimensional Array:

Data in multidimensional arrays are stored in tabular form.

Two Dimensional Array

PROGRAM

# importing numpy module


import numpy as np
# creating list list_1 = [1, 2,3, 4]
list_2 = [5, 6, 7, 8]
list_3 = [9, 10, 11, 12]

5
# creating numpy array
sample_array = np.array([list_1, list_2, list_3])
print("Numpy multi dimensional array in python\n", sample_array)

MARKS MARKS
ALLOTTED SECURED
AIM & ALG/
5
PROCEDURE
PROGRAM 10

OUTPUT 5

TOTAL 20

Result:
Thus the working with Numpy arrays was successfully completed.

6
EX.NO: 3 Create a Dataframe Using Pandas with
DATE: a List of Elements

AIM:
To create a DataFrame using Pandas to createa single list or a list of lists, Locate row,
named index, Locate Named indexes.

ALGORITHM

Step1: Start
Step2: import numpy and pandas module
Step3: Create a dataframe using the dictionary
Step4: Print the output
Step5: Stop

PROGRAM
Create a DataFrame can be created using a single list.
import pandas as pd
lst = ['python', 'For', 'first', 'year', 'students', 'interesting', 'programs']
# Calling DataFrame constructor on list
df = pd.DataFrame(lst)
print(df)

Output:
0
0 python
1 For
2 first
3 year
4 students
5 interesting
6 programs

7
CODE 2: Locate Row:
import pandas as pd
lst = ['python', 'For', 'first', 'year', 'students', 'interesting', 'programs']
# Calling DataFrame constructor on list
df = pd.DataFrame(lst)
print(df)
print(df.loc[0])

Output:
0
0 python
1 For
2 first
3 year
4 students
5 interesting
6 programs
0 python
Name: 0, dtype: object

CODE 3:

Named Indexes,Locate Named Indexes

import pandas as pd
lst = {"list1":['python', 'For', 'first'], "list2":['students', 'interesting', 'programs']}

# Calling DataFrame constructor on list


df = pd.DataFrame(lst, index = ["day1", "day2", "day3"])
print(df)
print(df.loc["day2"])

8
OUTPUT:

list1 list2
day1 python students
day2 For interesting
day3 first programs
list1 For
list2 interesting
Name: day2, dtype: object

MARKS MARKS
ALLOTTED SECURED
AIM & ALG/
5
PROCEDURE
PROGRAM 10

OUTPUT 5

TOTAL 20

Result:
Thus the working with Pandas data frames was successfully completed.

9
EX.NO: 4
DATE: Basic plots using Matplotlib

AIM
To draw basic plots in Python program using Matplotlib

ALGORITHM

Step1: Start
Step2: import Matplotlib module
Step3: Create a Basic plots using Matplotlib
Step4: Print the output
Step5: Stop

Program : 3 a

# importing the required module


import matplotlib.pyplot as plt
# x axis values
x = [1,2,3]
# corresponding y axis values
y = [2,4,1]
# plotting the points
plt.plot(x, y)
# naming the x axis
plt.xlabel('x - axis')
# naming the y axis
plt.ylabel('y - axis')
# giving a title to my graph
plt.title('My first graph!')
# function to show the plot
plt.show()

10
Output:

Program: 3b
import matplotlib.pyplot as plt
a = [1, 2, 3, 4, 5]
b = [0, 0.6, 0.2, 15, 10, 8, 16, 21]
plt.plot(a)
# o is for circles and r is
# for red
plt.plot(b, "or")
plt.plot(list(range(0, 22, 3)))
# naming the x-axis
plt.xlabel('Day ->')
# naming the y-axis
plt.ylabel('Temp ->')
c = [4, 2, 6, 8, 3, 20, 13, 15]
plt.plot(c, label = '4th Rep')
# get current axes command
ax = plt.gca()
# get command over the individual
# boundary line of the graph body
ax.spines['right'].set_visible(False)
ax.spines['top'].set_visible(False)
# set the range or the bounds of
# the left boundary line to fixed range a
x.spines['left'].set_bounds(-3, 40)
# set the interval by which
# the x-axis set the marks
plt.xticks(list(range(-3, 10)))

11
# set the intervals by which y-axis
# set the marks
plt.yticks(list(range(-3, 20, 3)))

# legend denotes that what color


# signifies what
ax.legend(['1st Rep', '2nd Rep', '3rd Rep', '4th Rep'])

# annotate command helps to write


# ON THE GRAPH any text xy denotes
# the position on the graph
plt.annotate('Temperature V / s Days', xy = (1.01, -2.15))

# gives a title to the Graph


plt.title('All Features Discussed')
plt.show()

Output:

Program:3c

import matplotlib.pyplot as plt


a = [1, 2, 3, 4, 5]
b = [0, 0.6, 0.2, 15, 10, 8, 16, 21]
c = [4, 2, 6, 8, 3, 20, 13, 15]
# use fig whenever u want the
# output in a new window also
# specify the window size you
# want ans to be displayed
fig = plt.figure(figsize =(10, 10))
12
# creating multiple plots in a
# single plot
sub1 = plt.subplot(2, 2, 1)
sub2 = plt.subplot(2, 2, 2)
sub3 = plt.subplot(2, 2, 3)
sub4 = plt.subplot(2, 2, 4)
sub1.plot(a, 'sb')
# sets how the display subplot
# x axis values advances by 1
# within the specified range
sub1.set_xticks(list(range(0, 10, 1)))
sub1.set_title('1st Rep')
sub2.plot(b, 'or')
# sets how the display subplot x axis
# values advances by 2 within the
# specified range
sub2.set_xticks(list(range(0, 10, 2)))
sub2.set_title('2nd Rep')
# can directly pass a list in the plot
# function instead adding the reference
sub3.plot(list(range(0, 22, 3)), 'vg')
sub3.set_xticks(list(range(0, 10, 1)))
sub3.set_title('3rd Rep')
sub4.plot(c, 'Dm')
# similarly we can set the ticks for
# the y-axis range(start(inclusive),
# end(exclusive), step)
sub4.set_yticks(list(range(0, 24, 2)))
sub4.set_title('4th Rep')
# without writing
plt.show() no plot
# will be visible
plt.show()

13
Output:

CODE 3d:
from matplotlib import pyplot as plt
plt.bar([0.25,1.25,2.25,3.25,4.25],[50,40,70,80,20
], label="BMW",width=.5)
plt.bar([.75,1.75,2.75,3.75,4.75],[80,20,20,50,60],
label="Audi", color='r',width=.5)
plt.legend()
plt.xlabel('Days')
plt.ylabel('Distance
(kms)')
plt.title('Informatio
n') plt.show()

14
OUTPUT:

CODE 3e:

HISTOGRAM
import matplotlib.pyplot as plt
population_age = [22,55,62,45,21,22,34,42,42,4,2,102,95,85,55,110,120,70,65,55,111,115,
80,75,65,54,44,43,42,48]
bins = [0,10,20,30,40,50,60,70,80,90,100]
plt.hist(population_age, bins, histtype='bar', rwidth=0.8)
plt.xlabel('age groups')
plt.ylabel('Number of people')plt.title('Histogram') plt.show()

15
CODE 3 f :
AREA PLOT
import matplotlib.pyplot as plt
days = [1,2,3,4,5]
sleeping =[7,8,6,11,7]
eating = [2,3,4,3,2]
working =[7,8,7,2,2]
playing = [8,5,7,8,13]
plt.plot([],[],color='m', label='Sleeping', linewidth=5)
plt.plot([],[],color='c', label='Eating', linewidth=5)
plt.plot([],[],color='r', label='Working', linewidth=5)
plt.plot([],[],color='k', label='Playing', linewidth=5)
plt.stackplot(days, sleeping,eating,working,playing, colors=['m','c','r','k'])
plt.xlabel('x')
plt.ylabel('y')
plt.title('Stack
Plot')plt.legend()
plt.show()

OUTPUT:

16
CODE 3 g: PIE CHART
import matplotlib.pyplot as plt
days = [1,2,3,4,5]
sleeping =[7,8,6,11,7] eating = [2,3,4,3,2] working =[7,8,7,2,2] playing = [8,5,7,8,13] slices =
[7,2,2,13]
activities = ['sleeping','eating','working','playing']
cols = ['c','m','r','b']
plt.pie(slices, labels=activities, colors=cols, startangle=90, shadow= True,
explode=(0,0.1,0,0), autopct='%1.1f%%')
plt.title('Pie Plot')
plt.show()

OUTPUT

MARKS MARKS
ALLOTTED SECURED
AIM & ALG/
5
PROCEDURE
PROGRAM 10

OUTPUT 5

TOTAL 20

Result:
Thus the basic plots using Matplotlib in Python program was successfully completed.

17
EX.NO: 5 a Statistical and Probability measures
DATE: (Frequency Distributions)

AIM
To Count the frequency of occurrence of a word in a body of text is often needed
during text processing.

ALGORITHM

Step 1: Start the Program


Step 2: Create text file blake-poems.txt
Step 3: Import the word tokenize function and Gutenberg
Step 4: Write the code to count the frequency of occurrence of a word in a body of text
Step 5: Print the result
Step 6: Stop the process

Program:
from nltk.tokenize
import word_tokenize from nltk.corpus
import gutenberg
sample = gutenberg.raw("blake-poems.txt")
token = word_tokenize(sample)
wlist = []
for i in range(50):
wlist.append(token[i])
wordfreq = [wlist.count(w) for w in wlist]
print("Pairs\n" + str(zip(token, wordfreq)))

18
Output:
[([', 1), (Poems', 1), (by', 1), (William', 1), (Blake', 1), (1789', 1), (]', 1), (SONGS', 2), (OF', 3),
(INNOCENCE', 2), (AND', 1), (OF', 3), (EXPERIENCE', 1), (and', 1), (THE', 1), (BOOK', 1),
(of', 2), (THEL', 1), (SONGS', 2), (OF', 3), (INNOCENCE', 2), (INTRODUCTION', 1),
(Piping', 2), (down', 1), (the', 1), (valleys', 1), (wild', 1), (,', 3), (Piping', 2), (songs', 1), (of', 2),
(pleasant', 1), (glee', 1), (,', 3), (On', 1), (a', 2), (cloud', 1), (I', 1), (saw', 1), (a', 2), (child', 1), (,',
3), (And', 1), (he', 1), (laughing', 1), (said', 1), (to', 1), (me', 1), (:', 1), (``', 1)]

MARKS MARKS
ALLOTTED SECURED
AIM & ALG/
5
PROCEDURE
PROGRAM 10

OUTPUT 5

TOTAL 20

Result:
Thus the count the frequency of occurrence of a word in a body of text is often needed
during text processing and Conditional Frequency Distribution program using python
was successfully completed.

19
EX.NO: 5 b Statistical and Probability measures
DATE: (Averages)

AIM
To compute weighted averages in Python either defining your own functions or using
Numpy.

ALGORITHM

Step 1: Start the Program


Step 2: Create the employees_salary table and save as .csv file
Step 3: Import packages (pandas and numpy) and the employees_salary table itself:
Step 4: Calculate weighted sum and average using Numpy Average() Function
Step 5 : Stop the process

Program:
#Method Using Numpy Average() Function

weighted_avg_m3 = round(average( df['salary_p_year'], weights = df['employees_number']),2)

weighted_avg_m3

Output:
44225.35
MARKS MARKS
ALLOTTED SECURED
AIM & ALG/
5
PROCEDURE
PROGRAM 10

OUTPUT 5

TOTAL 20

Result:
Thus the compute weighted averages in Python either defining your own functions or
using Numpy was successfully completed.

20
EX.NO: 5 c Statistical and Probability measures
DATE: (Variability)

AIM
To write a python program to calculate the variance.

ALGORITHM

Step 1: Start the Program

Step 2: Import statistics module from statistics import variance

Step 3: Import fractions as parameter values from fractions import Fraction as fr

Step 4: Create tuple of a set of positive and negative numbers

Step 5: Print the variance of each samples

Step 6: Stop the process

Program
# Python code to demonstrate variance()
# function on varying range of data-types
# importing statistics module from statistics
import variance
# importing fractions as parameter values from fractions
import Fraction as fr
# tuple of a set of positive integers
# numbers are spread apart but not very much
sample1 = (1, 2, 5, 4, 8, 9, 12)
# tuple of a set of negative integers
sample2 = (-2, -4, -3, -1, -5, -6)
# tuple of a set of positive and negative numbers
# data-points are spread apart considerably
sample3 = (-9, -1, -0, 2, 1, 3, 4, 19)
# tuple of a set of fractional numbers
sample4 = (fr(1, 2), fr(2, 3), fr(3, 4), fr(5, 6), fr(7, 8))
# tuple of a set of floating point values

21
sample5 = (1.23, 1.45, 2.1, 2.2, 1.9)
# Print the variance of each samples
print("Variance of Sample1 is % s " %(variance(sample1)))
print("Variance of Sample2 is % s " %(variance(sample2)))
print("Variance of Sample3 is % s " %(variance(sample3)))
print("Variance of Sample4 is % s " %(variance(sample4)))
print("Variance of Sample5 is % s " %(variance(sample5)))

Output:

Variance of Sample 1 is 15.80952380952381


Variance of Sample 2 is 3.5
Variance of Sample 3 is 61.125
Variance of Sample 4 is 1/45
Variance of Sample 5 is 0.17613000000000006

MARKS MARKS
ALLOTTED SECURED
AIM & ALG/
5
PROCEDURE
PROGRAM 10

OUTPUT 5

TOTAL 20

Result:
Thus the computation for variance was successfully completed.

22
EX.NO: 5 d Statistical and Probability measures
DATE: (Normal Curve)

AIM:
To create a normal curve using python program.

ALGORITHM

Step 1: Start the Program


Step 2: Import packages scipy and call function scipy.stats
Step 3: Import packages numpy, matplotlib and seaborn
Step 4: Create the distribution
Step 5: Visualizing the distribution
Step 6: Stop the process

Program:
# import required libraries
from scipy.stats
import norm
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sb

# Creating the distribution


data = np.arange(1,10,0.01)
pdf = norm.pdf(data , loc = 5.3 , scale = 1 )

#Visualizing the distribution


sb.set_style('whitegrid')
sb.lineplot(data, pdf , color = 'black')
plt.xlabel('Heights')
plt.ylabel('Probability Density')

23
Output:

MARKS MARKS
ALLOTTED SECURED
AIM & ALG/
5
PROCEDURE
PROGRAM 10

OUTPUT 5

TOTAL 20

Result:
Thus the normal curve using python program was successfully completed.

24
EX.NO: 5 e Statistical and Probability measures
DATE: (Correlation and scatter plots)

AIM
To write a python program for correlation with scatter plot.

ALGORITHM

Step 1: Start the Program


Step 2: Create variable y1, y2
Step 3: Create variable x, y3 using random function
Step 4: plot the scatter plot
Step 5: Print the result
Step 6: Stop the process

Program:

# Scatterplot and Correlations


# Data
x-pp random randn(100)
yl=x*5+9 y2=-5°x y3=no_random.randn(100)

#Plot
plt.reParams update('figure figsize' (10,8), 'figur e dpi¹:100})
plt scatter(x, yl, label=fyl, Correlation = {np.round(np.corrcoef(x,y1)[0,1], 2)})
plt scatter(x, y2, label=fy2 Correlation = (np.round(np.corrcoef(x,y2)[0,1], 2)})
plt scatter(x, y3, label=fy3 Correlation = (np.round(np.corrcoef(x,y3)[0,1], 2)})

# Plot
plt titlef('Scatterplot and Correlations')
plt(legend)
plt(show)

25
Output

MARKS MARKS
ALLOTTED SECURED
AIM & ALG/
5
PROCEDURE
PROGRAM 10

OUTPUT 5

TOTAL 20

Result:
Thus the Correlation and scatter plots using python program was successfully
completed.

26
EX.NO: 5 f Statistical and Probability measures
DATE: (Correlation coefficient)

AIM:
To write a python program to compute correlation coefficient.

ALGORITHM

Step 1: Start the Program


Step 2: Import math package
Step 3: Define correlation coefficient function
Step 4: Calculate correlation using formula
Step 5: Print the result
Step 6: Stop the process

Program:

# Python Program to find correlation coefficient.


import math
# function that returns correlation coefficient.
def correlationCoefficient(X, Y, n) :
sum_X = 0 sum_Y = 0 sum_XY = 0
squareSum_X = 0 squareSum_Y = 0
i=0
while i < n :
# sum of elements of array X.
sum_X = sum_X + X[i]
# sum of elements of array Y.
sum_Y = sum_Y + Y[i]
# sum of X[i] * Y[i].
sum_XY = sum_XY + X[i] * Y[i]
# sum of square of array elements.
squareSum_X = squareSum_X + X[i] * X[i]
squareSum_Y = squareSum_Y + Y[i] * Y[i]

27
i=i+1
# use formula for calculating correlation
# coefficient.
corr = (float)(n * sum_XY - sum_X * sum_Y) /
(float)(math.sqrt((n * squareSum_X –
sum_X * sum_X)* (n * squareSum_Y –
sum_Y * sum_Y)))
return corr

# Driver function
X = [15, 18, 21, 24, 27] Y = [25, 25, 27, 31, 32]

# Find the size of array. n = len(X)


# Function call to correlationCoefficient.
print ('{0:.6f}'.format(correlationCoefficient(X, Y, n)))

Output :

0.953463

MARKS MARKS
ALLOTTED SECURED
AIM & ALG/
5
PROCEDURE
PROGRAM 10

OUTPUT 5

TOTAL 20

Result:
Thus the computation for correlation coefficient was successfully completed.

28
EX.NO: 5 g Statistical and Probability measures
DATE: (Simple Linear Regression)

AIM
To write a python program for Simple Linear Regression.

ALGORITHM
Step 1: Start the Program
Step 2: Import numpy and matplotlib package
Step 3: Define coefficient function
Step 4: Calculate cross-deviation and deviation about x
Step 5: Calculate regression coefficients
Step 6: Plot the Linear regression and define main function
Step 7: Print the result
Step 8: Stop the process

Program:
import numpy as np
import matplotlib.pyplot as plt

def estimate_coef(x, y):


# number of observations/points
n = np.size(x)

# mean of x and y vector


m_x = np.mean(x)
m_y = np.mean(y)

# calculating cross-deviation and deviation about x


SS_xy = np.sum(y*x) - n*m_y*m_x
SS_xx = np.sum(x*x) - n*m_x*m_x

# calculating regression coefficients


b_1 = SS_xy / SS_xx
b_0 = m_y - b_1*m_x
return (b_0, b_1)

29
def plot_regression_line(x, y, b):
# plotting the actual points as scatter plot
plt.scatter(x, y, color = "m", marker = "o", s = 30)

# predicted response vector


y_pred = b[0] + b[1]*x

# plotting the regression line


plt.plot(x, y_pred, color = "g")

# putting labels
plt.xlabel('x') plt.ylabel('y')

# function to show plot


plt.show()

def main():
# observations / data
x = np.array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
y = np.array([1, 3, 2, 5, 7, 8, 8, 9, 10, 12])

# estimating coefficients
b = estimate_coef(x, y)
print("Estimated coefficients:\nb_0 = {} \\ nb_1 = {}".format(b[0], b[1]))

# plotting regression line


plot_regression_line(x, y, b)

if name == " main ":


main()

30
Output:
Estimated coefficients:
b_0 = -0.0586206896552
b_1 = 1.45747126437

Graph:

MARKS MARKS
ALLOTTED SECURED
AIM & ALG/
5
PROCEDURE
PROGRAM 10

OUTPUT 5

TOTAL 20

Result:
Thus the computation for Simple Linear Regression was successfully completed.

31
EX.NO: 6 a Univariate Analysis
Frequency, Mean, Median, Mode, Variance,
DATE:
Standard Deviation, Skewness and Kurtosis.

AIM
To use the diabetes data set from UCI and Pima Indians Diabetes data set for
performing the Univariate analysis: Frequency, Mean, Median, Mode, Variance, Standard
Deviation, Skewness and Kurtosis.

DISCRIPTION:
This dataset is originally from the National Institute of Diabetes and Digestive and
Kidney Diseases. The objective of the dataset is to diagnostically predict whether or not a
patient has diabetes, based on certain diagnostic measurements included in the dataset.
Several constraints were placed on the selection of these instances from a larger database. In
particular, all patients here are females at least 21 years old of Pima Indian heritage. The
datasets consists of several medical predictor variables and one target variable, Outcome.
Predictor variables include the number of pregnancies the patient has had, their BMI, insulin
level, age, and so on.

The datasets consist of several medical predictor (independent) variables and one target
(dependent) variable, Outcome. Independent variables include the number of pregnancies the
patient has had, their BMI, insulin level, age, and so on.

Descriptive Statistics is the building block of data science. Advanced analytics is often
incomplete without analyzing descriptive statistics of the key metrics. In simple terms,
descriptive statistics can be defined as the measures that summarize a given data, and these
measures can be broken down further into the measures of central tendency and the measures
of dispersion.

Measures of central tendency include mean, median, and the mode, while the measures of
variability include standard deviation, variance, and the interquartile range. In this guide, you
will learn how to compute these measures of descriptive statistics and use them to interpret
the data.

We will cover the topics given below:


 Mean
 Median

32
 Mode
 Standard Deviation
 Variance
 Interquartile
 Range
 Skewness

Data
In this guide, we will be using fictitious data of loan applicants containing 600 observations and 10
variables, as described below:
1. Marital_status: Whether the applicant is married ("Yes") or not ("No").
2. Dependents: Number of dependents of the applicant.
3. Is_graduate: Whether the applicant is a graduate ("Yes") or not ("No").
4. Income: Annual Income of the applicant (in USD).
5. Loan_amount: Loan amount (in USD) for which the application was submitted.
6. Term_months: Tenure of the loan (in months).
7. Credit_score: Whether the applicant's credit score was good ("Satisfactory") or not ("Not_satisfactory").
8. Age: The applicant’s age in years.
9. Sex: Whether the applicant is female (F) or male (M).
10. approval_status: Whether the loan application was approved ("Yes") or not ("No").

Let's start by loading the required libraries and the data.

Code:
import pandas as pd
import numpy as np
import statistics as st
# Load the data
df = pd.read_csv("diabetes.csv")
print(df.shape)
print(df.info())

33
Measures of Central Tendency
Measures of central tendency describe the center of the data, and are often represented by the mean,
the median, and the mode.

Mean
Mean represents the arithmetic average of the data. The line of code below prints the mean of
the numerical variables in the data. From the output, we can infer that the average age of the applicant
is 49 years, the average annual income is USD 705,541, and the average tenure of loans is 183
months. The command df.mean(axis = 0) will also give the same output.

Code:

df.mean()

Code:
print(df.loc[:,'Age'].mean())
print(df.loc[:,'Income'].mean())

Median
In simple terms, median represents the 50th percentile, or the middle value of the data, that separates the
distribution into two halves. The line of code below prints the median of the numerical variables
in the data. The command df.median(axis = 0) will also give the same output.

Code:

df.median()

34
Mode
Mode represents the most frequent value of a variable in the data.
This is the only central tendency measure that can be used with categorical variables, unlike the
mean and the median which can be used only with quantitative data.
The line of code below prints the mode of all the variables in the data.
The .mode() function returns the most common value or most repeated value of a variable.
The command df.mode (axis = 0) will also give the same output.

Code:

df.mode()

Standard Deviation
Standard deviation is a measure that is used to quantify the amount of variation of a set of data
values from its mean. A low standard deviation for a variable indicates that the data points tend to be
close to its mean, and vice versa. The line of code below prints the standard deviation of all the
numerical variables in the data.

Code:

df.std()

35
Variance
Variance is another measure of dispersion. It is the square of the standard deviation and the covariance
of the random variable with itself. The line of code below prints the variance of all the numerical
variables in the dataset. The interpretation of the variance is similar to that of the standard deviation.

Code:

df.var()

Interquartile Range (IQR)


The Interquartile Range (IQR) is a measure of statistical dispersion, and is calculated as the difference
between the upper quartile (75th percentile) and the lower quartile (25th percentile). The IQR is also a
very important measure for identifying outliers and could be visualized using a boxplot. IQR can be
calculated using the iqr() function. The first line of code below imports the 'iqr' function from
the scipy.stats module, while the second line prints the IQR for the variable 'Age'.

Code:
from scipy.stats import iqr
iqr(df['Age'])

36
Skewness

Another useful statistic is skewness, which is the measure of the symmetry, or lack of it, for a real-valued
random variable about its mean. The skewness value can be positive, negative, or undefined. In
a perfectly symmetrical distribution, the mean, the median, and the mode will all have the same value.
However, the variables in our data are not symmetrical, resulting in different values of the
central tendency. We can calculate the skewness of the numerical variables using the skew() function, as
shown below.

Code:

print(df.skew())

MARKS MARKS
ALLOTTED SECURED
AIM & ALG/
5
PROCEDURE
PROGRAM 10

OUTPUT 5

TOTAL 20

Result:
Thus the univariate analysis is performed in the given dataset successfully.

37
EX.NO: 6 b Bivariate Analysis -
DATE: Linear and logistic regression modelling.

AIM
To use the diabetes data set from UCI and Pima Indians Diabetes data set for
performing the Bivariate analysis linear and logistic regression modelling.

DISCRIPTION:
This dataset is originally from the National Institute of Diabetes and Digestive and
Kidney Diseases. The objective of the dataset is to diagnostically predict whether or not a
patient has diabetes, based on certain diagnostic measurements included in the dataset.
Several constraints were placed on the selection of these instances from a larger database. In
particular, all patients here are females at least 21 years old of Pima Indian heritage. The
datasets consists of several medical predictor variables and one target variable, Outcome.
Predictor variables include the number of pregnancies the patient has had, their BMI, insulin
level, age, and so on.

The datasets consist of several medical predictor (independent) variables and one target
(dependent) variable, Outcome. Independent variables include the number of pregnancies the
patient has had, their BMI, insulin level, age, and so on.

Descriptive Statistics is the building block of data science. Advanced analytics is often
incomplete without analyzing descriptive statistics of the key metrics. In simple terms,
descriptive statistics can be defined as the measures that summarize a given data, and these
measures can be broken down further into the measures of central tendency and the measures
of dispersion.

Bivariate Regression Analysis is a type of statistical analysis that can be used during the
analysis and reporting stage of quantitative market research. It is often considered the simplest
form of regression analysis, and is also known as Ordinary Least-Squares regression or linear
regression.

38
Code:

import pandas as pd
df = pd.read_csv(diabetes.csv')
df.head()

Code:

import matplotlib.pyplot as plt import

seaborn as sns sns.set(style='whitegrid',

context='notebook')

cols =
['Pregnancies','Glucose','BloodPressure','SkinThickness','Insulin','BMI','DiabetesPedigreeFunction','A
ge']

39
Code:

import numpy as np
cm = np.corrcoef(df[cols].values.T)
sns.set(font_scale=1.5)
hm = sns.heatmap(cm,cbar=True,annot=True,square=True,fmt='.2f',annot_kws={'size':
15},yticklabels=cols,xticklabels=cols)
plt.show()

Code:

class LinearRegressionGD(object):
def init (self, eta=0.001, n_iter=20):
self.eta = eta self.n_iter =
n_iter def fit(self, X, y):
self.w_ = np.zeros(1 + X.shape[1]) self.cost_ = []
for i in range(self.n_iter): output = self.net_input(X)
errors = (y - output) self.w_[1:] += self.eta * X.T.dot(errors) self.w_[0] += self.eta * errors.sum() cost
= (errors**2).sum() / 2.0 self.cost_.append(cost)

40
return self
def net_input(self, X):
return np.dot(X, self.w_[1:]) +
self.w_[0] def predict(self, X):
return self.net_input(X) X = df[['Age']].values
y = df['Pregnancies'].values
from sklearn.preprocessing import StandardScaler sc_x = StandardScaler() sc_y = StandardScaler()
X_std = sc_x.fit_transform(X) y_std = sc_y.fit_transform(y) lr = LinearRegressionGD() lr.fit(X_std,
y_std)
plt.plot(range(1, lr.n_iter+1), lr.cost_) plt.ylabel('SSE') plt.xlabel('Epoc
h') plt.show()

Code:
def lin_regplot(X, y, model): plt.scatter(X, y, c='blue')
plt.plot(X, model.predict(X), color='red') return None
lin_regplot(X_std, y_std, lr)
plt.xlabel('Age (standardized)')
plt.ylabel('Pregnancies(standardized)')
plt.show()

41
Code:

age_std = sc_x.transform([20])

pregnancy_std = lr.predict(age_std)

print("Pregnancy: %.3f"

%sc_y.inverse_transform(price_std)) print('Slope: %.3f' % lr.w_[1])

MARKS MARKS
ALLOTTED SECURED
AIM & ALG/
5
PROCEDURE
PROGRAM 10

OUTPUT 5

TOTAL 20

Result:
Thus the bivariate analysis is performed in the given dataset successfully.

42
EX.NO: 7 Apply and explore various plotting functions on any
DATE: data set

AIM
To Apply and explore various plotting functions on any data set.

PROCEDURE:

To load and quickly visualize the Multiple Features Dataset [1] from the UCI
repository, which is available in mvlearn. This dataset can be a good tool for analyzing
the effectiveness of multiview algorithms. It contains 6 views of handwritten digit images, thus
allowing for analysis of multiview algorithms in multiclass or unsupervised tasks.

a. Normal curves

A probability distribution is a statistical function that describes the likelihood of obtaining the
possible values that a random variable can take. By this, we mean the range of values that a
parameter can take when we randomly pick up values from it. If we were asked to pick up 1 adult
randomly and asked what his/her (assuming gender does not affect height) height would be?
There’s no way to know what the height will be. But if we have the distribution of heights of
adults in the city, we can bet on the most probable outcome.A Normal Distribution is also known
as a Gaussian distribution or famously Bell Curve. People use both words interchangeably, but it
means the same thing. It is a continuous probability distribution.

Code:
import numpy as np
import matplotlib.pyplot as plt
# Creating a series of data of in range of 1-50.
x = np.linspace(1,50,200)
#Creating a Function.
def normal_dist(x , mean , sd):
prob_density = (np.pi*sd) * np.exp(-0.5*((x- mean)/sd)**2)
return prob_density
#Calculate mean and Standard deviation.
mean = np.mean(x)
sd = np.std(x)
#Apply function to the data.
pdf = normal_dist(x,mean,sd)

43
#Plotting the Results
plt.plot(x,pdf , color = 'red') plt.xlabel('Data points')
plt.ylabel('Probability Density')

OUTPUT:

b. Density and contour plots

Contour plots also called level plots are a tool for doing multivariate analysis and visualizing 3-D plots in
2-D space. If we consider X and Y as our variables we want to plot then the response Z will be plotted
as slices on the X-Y plane due to which contours are sometimes referred as Z-slices or iso-response.

Contour plots are widely used to visualize density, altitudes or heights of the mountain as well as in the
meteorological department. Due to such wide usage matplotlib.pyplot provides a method contour to make
it easy for us to draw contour plots

Code:

import matplotlib.pyplot as plt


import numpy as np
feature_x = np.arange(0, 50, 2)
feature_y = np.arange(0, 50, 3)
# Creating 2-D grid of features
[X, Y] = np.meshgrid(feature_x, feature_y)
fig, ax = plt.subplots(1, 1)
Z = np.cos(X / 2) + np.sin(Y / 4)
# plots contour lines
ax.contour(X, Y, Z)
ax.set_title('Contour Plot')
ax.set_xlabel('feature_x')

44
ax.set_ylabel('feature_y')
plt.show()

OUTPUT:

c. Correlation and scatter plots

Correlation means an association, It is a measure of the extent to which two variables are
related.

1. Positive Correlation: When two variables increase together and decrease together. They
are positively correlated.
‘1’ is a perfect positive correlation. For example – demand and profit are positively
correlated the more the demand for the product, the more profit hence positive correlation.

2. Negative Correlation: When one variable increases and the other variable decreases
together and vice-versa. They are negatively correlated. For example, If the distance
between magnet increases their attraction decreases, and vice- versa. Hence, a negative
correlation. ‘-1’ is no correlation

2. Zero Correlation( No Correlation): When two variables don’t seem to be linked at


all. ‘0’ is a perfect negative correlation. For Example, the amount of tea you take and
level of intelligence.

Code:
import pandas as pd
con = pd.read_csv('concrete.csv')
con.list(con.columns)
con.head() con['cement'] = con['cement'].astype('category')
con.describe(include='category')
import seaborn as sns
sns.scatterplot(x="water", y="coarseagg", data=con);

45
ax = sns.scatterplot(x="water", y="coarseagg", data=con)
ax.set_title("Concrete Strength vs. Fly ash")
ax.set_xlabel("coarseagg");
sns.lmplot(x="water", y="coarseagg", data=con);

OUTPUT:

46
d. Histograms:

A histogram is basically used to represent data provided in a form of some groups.It is


accurate method for the graphical representation of numerical data distribution.It is a type
of bar plot where X-axis represents the bin ranges while Y-axis gives information about
frequency.

Creating a Histogram

To create a histogram the first step is to create bin of the ranges, then distribute the whole
range of the values into a series of intervals, and count the values which fall into each of the
intervals. Bins are clearly identified as consecutive, non-overlapping intervals of variables.
The matplotlib.pyplot.hist() function is used to compute and create histogram of x.

Code:

from matplotlib import pyplot as plt


import numpy as np

# Creating dataset
a = np.array([22, 87, 5, 43, 56, 73, 55, 54, 11, 20, 51, 5, 79, 31, 27])

# Creating histogram
fig, ax = plt.subplots(figsize =(10, 7))
ax.hist(a, bins = [0, 25, 50, 75, 100])

# Show plot
plt.show()

OUTPUT:

47
CODE:
import matplotlib.pyplot as plt
import numpy as np
from matplotlib import colors
from matplotlib.ticker import PercentFormatter

# Creating dataset
np.random.seed(23685752)
N_points = 10000
n_bins = 20

# Creating distribution
x = np.random.randn(N_points)
y = .8 ** x + np.random.randn(10000) + 25

# Creating histogram
fig, axs = plt.subplots(1, 1,figsize =(10, 7),tight_layout = True)
axs.hist(x, bins = n_bins)

# Show plot
plt.show()

OUTPUT:

48
e. Three dimensional plotting

Matplotlib was introduced keeping in mind, only two-dimensional plotting. But at the time
when the release of 1.0 occurred, the 3d utilities were developed upon the 2d and thus, we have
3d implementation of data available today! The 3d plots are enabled by importing the
mplot3d toolkit. In this article, we will deal with the 3d plots using matplotlib.

CODE:

from mpl_toolkits import mplot3d


import numpy as np
import matplotlib.pyplot as plt
fig = plt.figure()

# syntax for 3-D projection


ax = plt.axes(projection ='3d')

# defining axes
z = np.linspace(0, 1, 100)
x= z * np.sin(25 * z)
y = z * np.cos(25 * z)
c=x+y
ax.scatter(x, y, z, c = c)

# syntax for plotting


ax.set_title('3d Scatter plot')
plt.show()

OUTPUT:

49
MARKS MARKS
ALLOTTED SECURED
AIM & ALG/
5
PROCEDURE
PROGRAM 10

OUTPUT 5

TOTAL 20

Result:
Thus the various plotting functions on given dataset is performed in the successfully.

50
EX.NO: 8
Visualizing Geographic Data with Basemap
DATE:

AIM
To Visualizing Geographic Data with Basemap.

PROCEDURE:

One common type of visualization in data science is that of geographic data. Matplotlib's
main tool for this type of visualization is the Basemap toolkit, which is one of several
Matplotlib toolkits which lives under the mpl_toolkits namespace. Admittedly, Basemap
feels a bit clunky to use, and often even simple visualizations take much longer to render
than you might hope. More modern solutions such as leaflet or the Google Maps API may
be a better choice for more intensive map visualizations. Still, Basemap is a useful tool for
Python users to have in their virtual toolbelts. In this section, we'll show several examples
of the type of map visualization that is possible with this toolkit.

Installation of Basemap is straightforward; if you're using conda you can type this and the
package will be downloaded:

conda install basemap

Code:

%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt from mpl_toolkits.basemap
import Basemap plt.figure(figsize=(8, 8))
m = Basemap(projection='ortho', resolution=None, lat_0=50, lon_0=-100)
m.bluemarble(scale=0.5);
fig = plt.figure(figsize=(8, 8))
m = Basemap(projection='lcc', resolution=None, width=8E6, height=8E6, lat_0=45, lon_0=-
100,)
m.etopo(scale=0.5, alpha=0.5)

# Map (long, lat) to (x, y) for plotting


x, y = m(-122.3, 47.6)
plt.plot(x, y, 'ok', markersize=5)

51
plt.text(x, y, ' Seattle', fontsize=12);
from mpl_toolkits.basemap import Basemap
import matplotlib.pyplot as plt
fig = plt.figure(figsize = (12,12)) m = Basemap()
m.drawcoastlines()
m.drawcoastlines(linewidth=1.0, linestyle='dashed', color='red')
plt.title("Coastlines", fontsize=20)
plt.show()

OUTPUT:

52
CODE:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import geopandas as gpd
import shapefile as shp
from shapely.geometry import Point
sns.set_style('whitegrid')
fp = r'Maps_with_python\india-polygon.shp'
map_df = gpd.read_file(fp)
map_df_copy = gpd.read_file(fp)
plt.plot(map_df , markersize=5)

53
OUTPUT:

MARKS MARKS
ALLOTTED SECURED
AIM & ALG/
5
PROCEDURE
PROGRAM 10

OUTPUT 5

TOTAL 20

Result:
Thus the Visualizing Geographic Data with Basemap is completed successfully.

54
EX.NO: 9
Exploratory Data Analysis
DATE:

AIM
To do Exploratory Data Analysis on Iris dataset.

What is Exploratory Data Analysis?

Exploratory Data Analysis (EDA) is a technique to analyze data using some visual
Techniques. With this technique, we can get detailed information about the statistical
summary of the data. We will also be able to deal with the duplicates values, outliers, and
also see some trends or patterns present in the dataset.

Now let’s see a brief about the Iris dataset.

Iris Dataset

If you are from a data science background you all must be familiar with the Iris Dataset. If
you are not then don’t worry we will discuss this here.

Iris Dataset is considered as the Hello World for data science. It contains five columns
namely – Petal Length, Petal Width, Sepal Length, Sepal Width, and Species Type. Iris is a
flowering plant, the researchers have measured various features of the different iris flowers
and recorded them digitally.

Note: This dataset can be downloaded from www.kaggle.com.

You can download the Iris.csv file from the above link. Now we will use the Pandas library
to load this CSV file, and we will convert it into the dataframe. read_csv() method is used
to read CSV files.

55
Code:

import pandas as pd

data1=pd.read_csv("Iris.csv")

data1.head()

data1.info()

56
data1.describe()

data1.isnull().sum()

data1.shape

57
data = data1.drop_duplicates(subset ="Species",)

data

MARKS MARKS
ALLOTTED SECURED
AIM & ALG/
5
PROCEDURE
PROGRAM 10

OUTPUT 5

TOTAL 20

Result:
Thus the do Exploratory Data Analysis on Iris dataset is completed successfully.

58

You might also like