0% found this document useful (0 votes)
9 views54 pages

Fods Lab

The document outlines a series of experiments and exercises focused on data science tools including NumPy, SciPy, Jupyter, Statsmodels, and Pandas. Each section describes the installation procedures, features, and applications of these packages, along with specific aims for practical exercises. The document serves as a manual for understanding and utilizing these essential Python libraries for data manipulation and analysis.

Uploaded by

viji
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views54 pages

Fods Lab

The document outlines a series of experiments and exercises focused on data science tools including NumPy, SciPy, Jupyter, Statsmodels, and Pandas. Each section describes the installation procedures, features, and applications of these packages, along with specific aims for practical exercises. The document serves as a manual for understanding and utilizing these essential Python libraries for data manipulation and analysis.

Uploaded by

viji
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 54

lOMoARcPSD|20936916

lOMoARcPSD|20936916
lOMoARcPSD|20936916

S.No Date Name of the Experiment/Exercise Page Marks Awarded Remarks


No.

CONTENTS
lOMoARcPSD|20936916

Page
S.no Experiment name No.
Download, install and explore the features of NumPy, SciPy, Jupyter, Statsmodels
1 3
and Pandas packages.

2 Working with Numpy arrays 11

3 Working with Pandas data frames 13

Reading data from text files, Excel and the web and exploring various commands 15
4
for doing descriptive analytics on the Iris data set

Univariate analysis: Frequency, Mean, Median, Mode, Variance, Standard 18


5a
Deviation, Skewness and Kurtosis.

5b Bivariate analysis: Linear and logistic regression modeling 21

5c Multiple Regression analysis 25

5d Compare the results of the above analysis for the two data sets 28

6a Normal curves 29

6b Density and contour plots 32

6c Correlation and scatter plots 35

6d Histograms 38

6e Three dimensional plotting 39

7 Visualizing Geographic Data with Basemap 41


lOMoARcPSD|20936916

EX.NO: 1.A NUMPY


lOMoARcPSD|20936916

AIM:
To download, install numpy package and explore its features.

NUMPY STUDY:
NumPy is a general-purpose array-processing package. It provides a high-performance
multidimensional array object, and tools for working with these arrays. It is the fundamental
package for scientific computing with Python. NumPy’s main object is the homogeneous
multidimensional array. NumPy’s array class is called ndarray. It is also known by the alias
array.

NUMPY USED FOR:


Numpy makes many mathematical operations used widely in scientific computing fast and easy to
use, such as:
• Vector-Vector multiplication
• Matrix-Matrix and Matrix-Vector multiplication
• Element-wise operations on vectors and matrices
(i.e., adding, subtracting, multiplying, and dividing by a number)
• Element-wise or array-wise comparisons
• Applying functions element-wise to a vector/matrix (like pow, log, and exp)
• A whole lot of Linear Algebra operations can be found in NumPy. linalg
• Reduction, statistics, and much more

PROCEDURE TO INSTALL:
Step 1: Open Command Prompt.
Step 2: Install pip using command “pip install”.
Step 3: Use the command “pip install numpy” to install Numpy package.
Step 4: After it is installed, import the package to check and use it.

FEATURES:
• A powerful N-dimensional array object
• Sophisticated (broadcasting) functions
• Tools for integrating C/C++ and Fortran code
• Useful linear algebra, Fourier transform, and random number capabilities

RESULT:
Thus the package NUMPY was installed, imported and features are explored.
lOMoARcPSD|20936916

EX NO:1b SCIPY

Study:
SciPy stands for Scientific Python. It provides more utility functions for optimization, stats
and signal processing. Like NumPy, SciPy is open source so we can use it freely. SciPy was
created by NumPy's creator Travis Olliphant. The SciPy library supports integration, gradient
optimization, special functions, ordinary differential equation solvers, parallel
programming tools, and many more. We can say that SciPy implementation exists in every
complex numerical computation. The scipy is a data-processing and system-prototyping
environment as similar to MATLAB. It is easy to use and provides great flexibility to scientists
and engineers.

Uses:
• SciPy has optimized and added functions that are frequently used in NumPy and Data Science.
• SciPy contains varieties of sub packages which help to solve the most common issue related to Scientific
Computation.
• SciPy package in Python is the most used Scientific library only second to GNU
Scientific Library for C/C++ or Matlab’s.
• Easy to use and understand as well as fast computational power.

Aim:
To download, install numpy package and explore its features.

Features:
• It supports integration, gradient optimization and special functions.
• It can operate on an array of NumPy library
• It is the most used Scientific library.
Procedure to Install:
Step 1: Open Command Prompt.
Step 2: Install pip using command “pip install”.
Step 3: Use the command “pip install scipy” to install Scipy package.
Step 4: After it is installed, import the package to check and use it.
lOMoARcPSD|20936916

Output:

RESULT:
Thus the package SCIPY was installed, imported and features are explored.
lOMoARcPSD|20936916

EX NO:1C JUPYTER

:Study
Jupyter notebook is a tool for developing open source data science projects in different
languages such as Python and R. It is an interactive, open source browser based application.
and allows running code in the browser. It doesn’t just consist of code but can also contain
output,mathematical equations and other narratives . So it can be used for sharing data science
projects which can consist of code and reports. It can be installed using the Python pip
command. If you are using Anaconda then it is
automatically installed as part of the Anaconda installation.

: It can be used for


➔ Writing code executing code
➔ Generating output
➔ visualizing output
➔ generating reports
➔ .It also supports containers such as docker
Aim
.To download, install and explore the features of Jupyter package
lOMoARcPSD|20936916

Features
➔ .It is used for creating and sharing computational documents
➔ .It offers a simple,streamlined document and centric experience
➔ .It integrates with many programming languages like Python, PHP, R, C#, etc
Procedure
Step 1.Open the command prompt :
Step 2. ”Install pip using the command “pip install :
Step 3. ”To install the Jupyter Notebook, type the command: “pip install jupyter :
Step 4 Once the installation process is completed, you can run your notebook :
. ”on the server using the command “jupyter notebook
Step 5 To upgrade the jupyter notebook type the command “pip install notebook :
”upgrade--

Output
C:\Users\Lenovo>pip install jupyter
Collecting jupyter
Using cached jupyter-1.0.0-py2.py3-none-any.whl (2.7 kB)
Collecting notebook
Downloading notebook-6.5.2-py3-none-any.whl (439 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
kB 1.0 MB/s eta 0:00:00 439.1/439.1
Collecting qtconsole
Downloading qtconsole-5.4.0-py3-none-any.whl (121 kB)

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
kB 1.0 MB/s eta 0:00:00 121.0/121.0
Collecting jupyter-console
Using cached jupyter_console-6.4.4-py3-none-any.whl (22 kB)
Collecting nbconvert
Downloading nbconvert-7.2.7-py3-none-any.whl (273 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
kB 1.2 MB/s eta 0:00:00 273.2/273.2
Collecting ipykernel
Downloading ipykernel-6.20.1-py3-none-any.whl (149 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
kB 1.1 MB/s eta 0:00:00 149.2/149.2
Collecting ipywidgets
Downloading ipywidgets-8.0.4-py3-none-any.whl (137 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
kB 910.9 kB/s eta 0:00:00 137.8/137.8
Installing collected packages
lOMoARcPSD|20936916

…Successfully installed

Result
Thus the Jupyter notebook is downloaded and installed successfully

EX NO:1d STATSMODELS

Study:
Statsmodels is a Python module that provides classes and functions for the estimation of

many different statistical models, as well as for conducting statistical tests, and statistical data
exploration. An extensive list of result statistics are available for each estimator. The results are
tested against existing statistical packages to ensure that they are correct. The package is
released under the open source Modified BSD (3-clause) license. The online documentation is
hosted at statsmodels.org.

It can be used for :


➔ Allows users to explore data
➔ To estimate statistical models
➔ To perform statistical tests.

Aim:
To download, install and explore the features of Statsmodels package.
Features:
➔ It provides classes and functions for the estimation of many different statistical models.
➔ It offers classes and functions for conducting statistical tests.
➔ It provides classes and functions for statistical data exploration.
Procedure:
Step 1 : Open the command prompt.
Step 2 : Install pip using the command “pip install” .
lOMoARcPSD|20936916

Step 3 : To install the statsmodels, type the command: “pip install statsmodels” .
Step 4 : Once the installation process is completed, you can import and use
stas models in python shell.
lOMoARcPSD|20936916

Output:

Microsoft Windows [Version 10.0.22000.1455]


(c) Microsoft Corporation. All rights reserved.

C:\Users\Admin>pip install statsmodels


Defaulting to user installation because normal site-packages is not writeable
Collecting statsmodels
Using cached statsmodels-0.13.5-cp311-cp311-win_amd64.whl (9.0 MB)

Data Science Manual R-


021
lOMoARcPSD|20936916

Requirement already satisfied: pandas>=0.25 in c:\users\admin\appdata\roaming\python\python311\site-


packages (from statsmodels) (1.5.2) Requirement already satisfied: patsy>=0.5.2 in c:\users\admin\
appdata\roaming\python\python311\site-packages (from statsmodels) (0.5.3) Requirement already
satisfied: packaging>=21.3 in c:\users\admin\appdata\roaming\python\python311\site-packages (from
statsmodels) (23.0) Requirement already satisfied: scipy>=1.3 in c:\users\admin\appdata\roaming\
python\python311\site- packages (from statsmodels) (1.10.0)
Requirement already satisfied: numpy>=1.17 in c:\users\admin\appdata\roaming\python\python311\site-
packages (from statsmodels) (1.24.1) Requirement already satisfied: python-dateutil>=2.8.1 in c:\users\
admin\appdata\roaming\python\python311\site-packages (from pandas>=0.25->statsmodels) (2.8.2)
Requirement already satisfied: pytz>=2020.1 in c:\users\admin\appdata\roaming\python\python311\site-
packages (from pandas>=0.25->statsmodels) (2022.7)
Requirement already satisfied: six in c:\users\admin\appdata\roaming\python\python311\site-packages
(from patsy>=0.5.2->statsmodels) (1.16.0)
Installing collected packages: statsmodels
Successfully installed statsmodels-0.13.5

Result:
Thus the Statsmodels is downloaded and installed successfully.
lOMoARcPSD|20936916

EX.NO.1E PANDAS

STUDY:
It is a software library written for the python programming language data
manipulation and analysis.In particular,it offers data structures and Operations
for manuplating numerical tables and time series.It is free software Relaxed
under the three clause BSD License.The name Pandas has a reference to booth
Panel data and Python Data Analysis and was created by Wes Mckinney in 2008.
Pandas generally provide two data structures for manuplating data,They are:
• Series
• Index
• Dataframe
It can be used for:
• Data from different file objects can be loaded.
• Easy handling of missing data (represented by NaN) in floating point as well as non-
floating point data .
• Size mutability:Columns can be inserted and deleted from dataframe and higher
Dimensional Objects.
• Provides time-series functionality.
• Powerful group by functionality for performing split-apply-combine operations on data
sets.
Aim:
To download , install and explore the features of Pandas Package.
Features:
• Fast and efficient Dataframe object with default and customized indexing.
• Reshaping and pivoting of Datasets.
• Tools for loading data into in-measure data objects from different
file formats.
Procedure:
Step 1: Open the command Prompt.
Step 2: Install pip using the command “pip install”.
Step 3: To install the Pandas Notebook ,type the command: “pip install
pandas ”.
Step 4: Once the installation process is completed, you can run your notebook on the
server using the command “Pandas Notebook”.
Step 5: To upgrade the Pandas notebook type the command “pip install
Notebook upgrade”.
lOMoARcPSD|20936916

Output:
C:\Users\USER>pip install pandas
Collecting pandas
Using cached pandas-1.5.2-cp311-cp311-win_amd64.whl (10.3 MB)
Requirement already satisfied: python-dateutil>=2.8.1 in d:\pthon works\lib\site-packages
(from pandas) (2.8.2)
Data Science Manual R-2021
lOMoARcPSD|20936916

Requirement already satisfied: pytz>=2020.1 in d:\pthon works\lib\site-packages (from pandas)


(2022.6)
Requirement already satisfied: numpy>=1.21.0 in d:\pthon works\lib\site-packages (from
pandas) (1.23.5)
Requirement already satisfied: six>=1.5 in d:\pthon works\lib\site-packages (from python-
dateutil>=2.8.1->pandas) (1.16.0)
Installing collected packages: pandas
Successfully installed pandas-1.5.2

Result:
Thus the Pandas notebook is downloaded and installed successfully.
lOMoARcPSD|20936916

EX.NO:2 NUMPY
STUDY:
Python lists are a substitute for arrays, but they fail to deliver the performance required
while computing large sets of numerical data. To address this issue we use a python library
called NumPy. The word NumPy stands for Numerical Python. NumPy offers an array object
called ndarray. They are similar to standard python sequences but differ in certain key factors.
Unlike lists, NumPy arrays are of fixed size, and changing the size of an array will lead to the
creation of a new array while the original array will be deleted.All the elements in an array are of
the same type.Numpy arrays are faster, more efficient, and require less syntax than standard
python sequences.

AIM:
To implement python program to make numpy operation with array.

ALGORITHM:

STEP 1: Start
STEP 2: Read the elements in array
STEP 3: Read the choice
STEP 4: Repeat the following steps 5,6,7,8,9 while choice is not equal to 5
STEP 5: If ch=1, read value and index, insert the value using insert() function
STEP 6: If ch=2,read the element and search and return the index using where() function
STEP 7: If ch=3, sort the elements using sort() function
STEP 8: If ch=4, read index and delete element usind delete() function
STEP 9: If ch=5, print exit
STEP 9: If nothing matches, print invalid choice
STEP 10: Stop

PROGRAM:

import numpy as np a=np.array([1,2,3,4]) print("Options\n\t1.Insertion\n\


t2.Search\n\t3.Sort\n\t4.Deletion\n\t5.Exit") ch=0
x=int(input("Enter the choice:"))
while(ch!=5):
if ch==1:
print("Array before insertion:",a)
b=int(input("Enter the element to be inserted :"))
p=int(input("Enter the position:"))
a=np.insert(a,p,b)
print("Array after insertion:",a)
elif ch==2:
s=int(input("Enter the element to be searched:"))
d=np.where(a==s)
print("Position:",d)
elif ch==3:
print("Sorted array is:",np.sort(a))
lOMoARcPSD|20936916

elif ch==4:
e=int(input("Enter position of the element to be deleted :"))
a=np.delete(a,e)
print("Array after deletion:",a)
elif ch==5:
print("EXIT")
else:
print("Invalid choice")

OUTPUT:

Enter the choice:1


Array before insertion:1 2 3 4
Enter the element to be inserted :6
Enter the position:2
Array after insertion:1 2 6 3 4
Enter the choice:2
Enter the element to be searched:2
Position:1
Enter the choice:3
Sorted array is:1 2 6 3 4
Enter the choice:4
Enter position of the element to be deleted :4
Array after deletion:1 2 6 3 Enter the choice:5
EXIT
Enter the choice:7
Invalid choice

RESULT:
Thus the python program to make numpy operation with array is written,executed and
the output is verified successfully.
lOMoARcPSD|20936916

EX.NO:3 WORKING WITH PANDAS DATAFRAME

STUDY:
The Pandas DataFrame is a structure that contains two-dimensional data and its
corresponding labels. DataFrames are widely used in data science, machine learning, scientific
computing, and many other data-intensive fields. DataFrames are similar to SQL tables or the
spreadsheets that you work with in Excel or Calc. In many cases, DataFrames are faster, easier
to use, and more powerful than tables or spreadsheets because they’re an integral part of the
Python and NumPy ecosystems.
AIM:
To write a python program to import pandas dataframe.
ALGORITHM:
STEP 1 : Start.
STEP 2 : Import the pandas package.
STEP 3 : Create the dataframe for the list of elements,
STEP 3.1: roll number, name, datascience, datastructure, oops, maths.
STEP 4 : Display the table using pd.DataFrame().
STEP 5 : Display the output.
STEP 6 : Stop.
PROGRAM:
import pandas as pd
n=int(input("Enter the total number of students:"))
rollnumber=[]
name=[]
datascience=[]
datastructure=[]
oops=[]
maths=[]
result=[]
for i in range(n):
rn=int(input("Enter the roll number"))
rollnumber.append(rn)
n=input("Enter the name")
name.append(n)
dsc=int(input("Enter the datascience mark"))
datascience.append(dsc)
ds=int(input("Enter the datastructure mark"))
datastructure.append(ds)
o=int(input("Enter the oops mark"))
oops.append(o)
m=int(input("Enter the maths mark"))
maths.append(m)
if( (dsc>34)& (ds>34)& (o>34)& (m>34)):
r=["Qualified"]
else:
r=["failed"]
result.append(r)
l=pd.DataFrame({"rollnumber":rollnumber,"name":name,
lOMoARcPSD|20936916

"datascience":datascience,"datastructure":datastructure,"oops":oops,
"maths":maths,"result":result})
print(l)
OUTPUT
:
Enter the total number of students:2
Enter the roll number1
Enter the nameA
Enter the datascience mark90
Enter the datastructure mark89
Enter the oops mark90
Enter the maths mark89
Enter the roll number2
Enter the nameB
Enter the datascience mark99
Enter the datastructure mark89
Enter the oops mark99
Enter the maths mark89
rollnumber name datascience datastructure oops maths result
0 1 A 90 89 90 89 [Qualified]
1 2 B 99 89 99 89 [Qualified]

RESULT:
Thus the python program to import pandas dataframe is written, executed and the
output is verified successfully.
lOMoARcPSD|20936916

EX.NO:4 DESCRIPTIVE ANALYTICS ON THE IRIS DATASET

AIM:
To write a python program to read data from text files, Excel and from web and to explore
various commands for doing descriptive analytics on the Iris Dataset.
STUDY:
The IRIS dataset is a collection of data that is used to demonstrate the properties of
various statistical models. It contains information about 50 observations on four different
variables: Petal Length, Petal Width, Sepal Length, and Sepal Width. The dataset is often used in
data mining, classification and clustering examples and to test algorithms. Information about the
original paper and usages of the dataset can be found in the UCI Machine Learning
Repository. This data sets consists of 3 different types of irises' (Setosa, Versicolour, and
Virginica) petal and sepal length, stored in a 150x4 numpy.ndarray. The rows being the samples
and the columns being: Sepal Length, Sepal Width, Petal Length and Petal Width.

ALGORITHM:
STEP 1: start the program.
STEP 2: Import the required modules.
STEP 3: Read the iris dataset(text file, Excel, web)
STEP 4: Read the specific column name for calculating mean, median,mode,
sum,maximum,
minimum.
STEP 5: Display the final dataset after replacing the old column name with the new
column
name.
STEP 6: stop

PROGRAM:
import pandas as pd
d=pd.read_csv(“filename”)
print(“The length of the dataset:”, len(d))
start=int(input(“Enter the start value:”))
stop=int(input(“Enter the stop value”))
print(“After silicing:\n”, d[start:stop-1])
n=int(input(“Enter the specific column to display:”))
print(d.iloc[n]) Print(“Mean:”,d[d].mean(),”\
nMedian:”,d[d].median(),
”\nSum:”,d[d].sum(),”\nMinimum:”,d[d].min(),

Data Science Manual R-


2021

”\nMaximum:”,d[d].max())
lOMoARcPSD|20936916

E=input(“Enter the column name to be replaced:”)


G=input(“Enter new name:”)
replace={E:G}
d.rename(columns=relace,inplace=True)
print(“Dataset:\n”,d)

OUTPUT:
The length of the dataset: 150
Enter the start value: 2
Enter the stop value: 5
After slicing:
Sepallength sepalwidth petallength petalwidth class
Iris
5.1 3.5 1.4 0.2 setosa
Iris-
4.9 3 1.4 0.2 setosa
Iris-
4.7 3.2 1.3 0.2 setosa

Enter the specific column to display: 2


Sepalwidth
3.5
3
3.2
3.1
3.6
3.9
3.4
3.4
2.9
3.1
3.7
3.4
3
3
4
4.4
3.9
lOMoARcPSD|20936916

Mean: 374.4 Median:376.3


Sum: 874.9 Maximum:653.4 Minimum:476.9
Enter the column name to be replaced: Sepallength
Enter the new name: New sepal
New sepal sepalwidth petallength petalwidth class
Iris-
5.1 3.5 1.4 0.2 setosa
Iris-
4.9 3 1.4 0.2 setosa
Iris-
4.7 3.2 1.3 0.2 setosa

Dataset:

sepallength sepalwidth petallength petalwidth class


Iris-
5.1 3.5 1.4 0.2 setosa
Iris-
4.9 3 1.4 0.2 setosa
Iris-
4.7 3.2 1.3 0.2 setosa
Iris-
4.6 3.1 1.5 0.2 setosa
Iris-
5 3.6 1.4 0.2 setosa
Iris-
5.4 3.9 1.7 0.4 setosa
Iris-
4.6 3.4 1.4 0.3 setosa
Iris-
setosa
5 3.4 1.5 0.2
Iris-
setosa
4.4 2.9 1.4 0.2

RESULT
Thus the python program to read the dataset from text files, Excel and from the web, to
explore various commands for doing descriptive analytics on the iris dataset was successfully
executed and the output was verified successfully.
lOMoARcPSD|20936916

Ex.No:5a UNIVARIATE ANALYSIS

AIM:
To perform univariate analysis in diabetes data set from uci and pima Indians diabetes
data set.

STUDY:
UNIVARIATE ANALYSIS:
It is the simplest form of statistical analysis. It can be inferential or descriptive. The key
fact is that only one variable is involved. It involves finding the
frequency ,mean ,mode ,median ,standard deviation ,skewness and kurtosis of the variable taken
into concern. It can yield misleading results in cases in which multivariate analysis is more
appropriate.

ALGORITHM:
Step 1: Start
STEP 2: Import pandas and scipy packages.
STEP 3: Download required datasets.
STEP 4: Read dataset using pandas read_csv().
STEP 5: Display top contents of dataset using head().
STEP 6: Display bottom contents of dataset using tail().
STEP 7: Read column to calculate mean and display mean using mean().
STEP 8: Read column to calculate median and display median using median().
STEP 9: Read column to calculate mode and display mode using mode().
STEP 10: Read column name to find frequency and display frequency using value_counts().
STEP 11: Read column name to find variance and display variance using var().
STEP 12: Read column to find standard deviation and display using std().
STEP 13: Read column name to find skewness and display using skew().
STEP 14: Read column name to find kurtosis and display using
kurtosis(). STEP 15: Stop.

PROGRAM:
import pandas as pd
import numpy as np
from scipy.stats import kurtosis
from scipy.stats import skew
print(“ press 1 for UCI diabetes dataset and 2 for Pima Indians diabetes dataset”)
ch=input(“Enter choice:”)
if(ch==1):
d=pd.read_csv(“ uci diabetes.csv”)
else:
d=pd.read_csv(“pima diabetes.csv”)
lOMoARcPSD|20936916

print(“Top rows :\n”,d.head()) print(“Bottom


rows:\n”,d.tail()) a=input(“Enter column name
to find mean:”) print(“Mean of “,a,”is\
n”,d[a].mean())
b=input(“Enter column name to find median:”)
print(“Median of “,b,”is\n”,d[b].median())
c=input(“Enter column name to find mode:”)
print(“Mode of”,c,”is\n”,d[c].mode())
g=input(“Enter column name to find frequency:”)
print(“Frequency of”,g,”is\n”,d[g].value_counts())
e=input(“Enter column name to find variance:”)
print(“Variance of”,e,”is\n”,d[e].var())
f=input(“Enter column name to find standard deviation:”)
print(“Standard deviation of”,f,”is\n”,d[f].std())
h=input(“Enter column name to find skewness:”)
print(“The skewness value:”,skew(d[h],axis=0,bias=True))
i=input(“Enter column name to find kurtosis:”)
print(“The kurtosis value:”,kurtosis(d[i],axis=0,bias=True))

OUTPUT:
press 1 for UCI diabetes dataset and 2 for Pima Indians diabetes dataset
Enter choice:1
Top rows: Pregnancies Glucose Blood Pressure … Diabetes Pedigree
Function Age Outcome 0 6 148 72 …
0.627 50 1
1 1 85 66 … 0.351 31 0
2 8 183 64 … 0.672 32 1
3 1 89 66 … 0.167 21 0
4 0 137 40 … 2.288 33 1
[5 rows x 9 colums]
Bottom rows:
Pregnancies
Glucose
BloodPressure …
Diabetes Pedigree
Function
Age Outcome 763 10 101 64 … 0.672
63 0
764 2 122 72 … 0.627 27 0
765 5 121 66 … 0.167 30 0
766 1 126 66 … 0.351 47 1
767 1 93 40 … 2.288 23 0

[5 rows x 9 columns]
Enter column name to find mean : Age
Mean of Age is
33.240882416666664
Enter column name to find median : Insulin
Median of Insulin is
30.5
Enter column name to find mode : Blood Pressure
Mode of Blood Pressure is
lOMoARcPSD|20936916

0 70
Name : Blood Pressure , dtype : int64
Enter column name to find frequency : Glucose
99 17
100 17
111 14
129 14
125 14
.. ..
191 1
177 1
44 1
62 1
190 1
Name : Glucose , length : 136 , dtype : int64
Enter column name to find variance : Age
Variance of Age :
138.30304589037377
Enter column name to find standard deviation : Glucose
Standard deviation of glucose :
31.97261819513622
Enter column name to find skewness : Age
The skewness value :
1.27389259531697
Enter column name to find kurtosis : Age
The kurtosis value :
0.6311769413798585

RESULT:
Thus python program to perform univariate analysis in diabetes dataset was
written,executed and output was verified.
lOMoARcPSD|20936916

EX.NO:5b PIMA INDIAN DATA SET

STUDY:
It is a statistical technique applied to a pair of variable ( Features \ Attributes ) of data
to determine the empirical relationship between them .In other words it is meant to determine
any current relation usually over and above a simple correlation analysis

Example:
If you are studying a group of college students to find out their average . SAT score and their
age . you have two pieces of puzzle to find.

AIM:
To write a python program to perform bivariate analysis in diabetes dataset from Pima
Indian data set

ALGORITHM (LINEAR REGRESSION):


STEP 1 : Start
STEP 2 : Read the diabetes dataset.
STEP 3 : Use the function to select an entire row for x , y axes .
STEP 4 : Using scatter() function a scatter plot for the values of rows.
STEP 5 : Create a object for linear regression class as linear reg.
STEP 6 : Set x label,y label and title.
STEP 7 : Plot the dots using scatter().
STEP 8 : Plot the linear regression line using plot.
STEP 9 : Display model using show.
STEP 10 : Stop.

PROGRAM (LINEAR REGRESSION):


import pandas as pd
import matplotlib . pyplot as plt
d=pd.read_csv(“diabetes.csv”)
from sklearn.linear _model import linearRegression
x=d.iloc[:,5].values_reshape(-1,1)
y=d.iloc[:,1].values_reshape(-1,1)
lin=LinearRegression()
lOMoARcPSD|20936916

plt.scatter(x,y)
lin.fit(x,y)
plt.xlabel(“BMI”)
plt.ylabel(“Diabetes”)
y.pred=lin.predict(x)
plt.plot(x,y-pred,color=’Red’)
plt.show()

ALGORITHM (LOGISTIC REGRESSION):


STEP 1: Start
STEP 2: Read the diabetes dataset.
STEP 3 : Use the function to select an entire row for x,y axes.
STEP 4 : Define x.
STEP 5 : Assign value for y using defined x.
STEP 6 : Using scatter() function plot a scatter plot for the values of rows.
STEP 7 : Set x label ,y label and title.
STEP 8 : Plot logistic regression .
STEP 9 : Display model using
show(). STEP 10 : Stop.

PROGRAM (LOGISTIC REGRESSION):


import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib . pyplot as plot
def sig(x)
return [1/1 + np exp(-x)]
d=pd.read_csv(“diabetes.csv”)
x=d.iloc[:,5] values.reshape (-1,1)
y=d.iloc [:,8] values .reshape(-1,1)
y_pred=sig(x)
plt.scatter(x,y) plt.xlabel(“BMI”)
plt.ylabel(“Diabetes Coefficient”)
sns.regplot(x=x,y=y,data=d,logistic=True,ci=none,colour=’red’)
plt.show()
lOMoARcPSD|20936916

RESULT
Thus the program to perform bivariate analysis in diabetes dataset from pima was written ,
executed and output was verified successfully.
lOMoARcPSD|20936916

EX.NO UCI DATA SET

AIM:
To write a python program to perform bivariate analysis in diabetes dataset from UCI
data set

ALGORITHM (LINEAR REGRESSION):


STEP 1 : Start
STEP 2 : Read the diabetes dataset.
STEP 3 : Use the function to select an entire row for x , y axes .
STEP 4 : Using scatter() function a scatter plot for the values of rows.
STEP 5 : Create a object for linear regression class as linear reg.
STEP 6 : Set x label,y label and title.
STEP 7 : Plot the dots using scatter().
STEP 8 : Plot the linear regression line using plot.
lOMoARcPSD|20936916

STEP 9 : Display model using


show. STEP 10 : Stop.

PROGRAM (LINEAR REGRESSION):


import pandas as pd
import matplotlib . pyplot as plt
d=pd.read_csv(“UCI diabetes.csv”)
from sklearn.linear _model import linearRegression
x=d.iloc[:,5].values_reshape(-1,1)
y=d.iloc[:,1].values_reshape(-1,1)
lin=Linear Regression()
ph.scatter (x,y)
plt.scatter(x,y)
lin.fit(x,y)
plt.xlabel(“BMI”)
plt.ylabel(“Diabetes”)
y.pred=lin.predict(x)
plt.plot(x,y-pred,color=’Red’)
plt.show()

ALGORITHM (LOGISTIC REGRESSION):


STEP 1: Start
STEP 2: Read the diabetes dataset.
STEP 3 : Use the function to select an entire row for x,y axes.
STEP 4 : Define x.
STEP 5 : Assign value for y using defined x.
STEP 6 : Using scatter() function plot a scatter plot for the values of rows.
STEP 7 : Set x label ,y label and title.
STEP 8 : Plot logistic regression .
STEP 9 : Display model using
show(). STEP 10 : Stop.

PROGRAM (LOGISTIC REGRESSION):


import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib . pyplot as plot
def sig(x)
return [1/1 + np exp(-x)]
d=pd.read_csv(“UCI diabetes.csv”)
x=d.iloc[:,5] values.reshape (-1,1)
y=d.iloc [:,8] values .reshape(-1,1)
y_pred=sig(x)
plt.scatter(x,y) plt.xlabel(“BMI”)
plt.ylabel(“Diabetes Coefficient”)
sns.regplot(x=x,y=y,data=d,logistic=True,ci=none,colour=’red’)
plt.show()

\
lOMoARcPSD|20936916

RESULT:
Thus the program to perform bivariate analysis in diabetes dataset from UCI was written
,executed and output was verified successfully.
lOMoARcPSD|20936916

EX.NO:5c. MULTIPLE REGRESSION ANALYSIS

STUDY:
Multiple Regression Analysis works by considering the values of the available multiple
independent variables and predicting the value of one dependent variable.

AIM:
lOMoARcPSD|20936916

To perform Multiple Regression Analysis for the diabetes dataset from UCI and PIMA
Indian Dataset.

ALGORITHM:
STEP 1: Start
STEP 2: Display the choices for datasets
STEP 3: Get the choice
STEP 4: Repeat the following STEPs until the choice is not
equal to 3
4.1: If choice is 1, read the diabetes dataset
4.2: If choice is 2, read the PIMA Indians diabetes
Dataset
4.3: Otherwise, print Invalid choice
STEP 5: Import necessary modules
STEP 6: Reshape the category of dataset to be suitable
for the analysis plotting

STEP 7: Set the figure to plot and projection as 3D


STEP 8: Plot the analysis using scatter plot and set the
color as per the wish
STEP 9: Display the plotting of Multiple
Regression STEP 10: Stop

PROGRAM:
import numpy as np
import pandas as pd
import matplotlib as mpl
from mpl_toolkits.mplot3d import Axes3D
import matplotlib.pyplot as plt import seaborn as sns
print("Choices:\n1.Diabetes\n2.Pima indians\n3.Exit\n")
ch=int(input("Enter choice:"))
while(ch!=3):
if(ch==1):
d=pd.read_csv("pima.csv")
elif(ch==2):
d=pd.read_csv("Dataset of Diabetes.csv")
else:
print("Invalid choice")
d.head()
x=d.iloc[:,5].values.reshape(-1,1)
y=d.iloc[:,1].values.reshape(-1,1)
fig=plt.figure()
ax=fig.add_subplot(projection='3d')
ax.scatter(xs=x,ys=y,c='red',s=5)
plt.show()
ch=int(input("Enter choice"))
lOMoARcPSD|20936916

OUTPUT:
Choices:
1.Diabetes
2.Pima indians
3.Exit

Enter choice:1

Enter choice:2

Enter choice :3
>>>

RESULT:
Thus, the multiple regression analysis on two given datasets was performed successfully.
lOMoARcPSD|20936916

EX.NO:5d COMPARING RESULTS OF ABOVE TWO DATA SETS

AIM:
To write analysis for result of two data sets[UCI dataset and Pima Indians data
set].

STUDY:
To perform comparision for two data set under following analysis
o Univariate Analysis
o Bivariate Analysis
o Multiple Regression Analysis

UNIVARIATE ANALYSIS:
According to the data set you have downloaded calculate total number of columns and rows.
Take some columns from both data set to find mean,mode,median,frequency,skewness,kurtosi
from the above result of both data set analysis the difference between statistical values. If you
need to differentiate more accurately we can use some plotting techniques.

BIVARIATE ANALYSIS:
In bivariate we are going to calculate the linear and logistic regression to analysis
the above result.

LINEAR REGRESSION:
For Pima Indian and UCI dataset take two columns as x
and y xis. Calculate range between the two axis and
analyse the above result.

LOGISTIC REGRESSION:
For Pima Indian and UCI dataset take two columns as x
and y axis. Calculate range between the two axis and
analyse the above result.

MULTIPLE REGRESSION ANALYSIS:


In multiple Regression ,we have to
take 3-axes. One is dependent and
others are independent.
In Uci And Pima Indian dataset ,write a program to calculate multiple
regression . Finally write the values corresponding to 3 axes.

RESULT:
Thus,we have performed the comparision for given analysis using both dataset.
lOMoARcPSD|20936916

EX.NO:6 APPLY AND EXPLORE THE VARIOUS PLOTTING


UNCTIONS ON UCI DATA SETS

Study:
The UCI Machine learning Repository is a collection of databases, domain
theories, data generators that are used by the machine learning community for
the empirical analysis of machine learning algorithms. Visit the UCI dataset.
Thus the graph is plotted using various plotting functions on UCI dataset.

EX.NO:6A NORMAL CURVES


Aim:
To write a python program to explore the normal curves for UCI dataset.
lOMoARcPSD|20936916

Study:
The normal curve represents the shape of an important class of statistical probabilities.The
normal curve is used to characterize complex constructs containing continuous random
variables. Many phenomena observed in nature have been found to follow a normal
curves.Normal distribution is a probability function used in statistics that tells about how the
data values are distributed. It is the most important probability function used in statistics
because of its advantages in real case scenarios.
For example: -> the height the population
-> shoe size
-> IQ level
-> rolling a die and many more

Algorithm:
Step 1: Start and download the dataset.
Step 2: Import the numpy , scipy , pandas , seaborn packages ,plotting function
. matplotlib and statistics
Step 3: Read the UCI dataset
Step 4: Calculate the mean and standard deviation.
Step 5: Numpy function arange is used for in between the values for plot the
. graph.
Step 6: Plot is displayed.
Step7: Stop

Program:
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import norm
import statistics
import pandas as pd
d=pd.read_csv(“forestfires.csv”)
C=str(input(“Enter column name to plot normal:”))
m=d[c],mean()
Sd=d[c].std()
Print(“mean”,mean,”standard deviation”,std)
a=np.arange(-15,15,0.01)
Plt. plot(a,norm.pdf(a,norm.pdf(a,m,sd))
Plt.show()
lOMoARcPSD|20936916

Output:
Enter column name to plot normal: wind
lOMoARcPSD|20936916

Result:
Thus the python program to draw the normal curve using plotting functions for UCI dataset
was implemented and the output was verified successfully .

Ex.no:6b DENSITY AND CONTOUR PLOTS

Aim:
To write a python program to explore the density and contour plot functions for UCI
dataset.

Study on Density Plot:


Density Plot is a type of data visualization tool. It is a variation of the histogram that uses
lOMoARcPSD|20936916

‘kernel smoothing’ while plotting the values. It is a continuous and smooth version of a
histogram inferred from a data.Density plots uses Kernel Density Estimation (so they are also
known as Kernel density estimation plots or KDE) which is a probability density function. The
region of plot with a higher peak is the region with maximum data points residing between those
values.

Study on Contour Plot:


Contour plots also called level plots are a tool for doing multivariate analysis and visualizing 3-
D plots in 2-D space. If we consider X and Y as our variables we want to plot then the
response Z will be plotted as slices on the X-Y plane due to which contours are sometimes
referred as Z-slices or iso-response . Contour plots are widely used to visualize density, altitudes
or heights of the mountain as well as in the meteorological department. Due to such wide usage
matplotlib.pyplot provides a method contour to make it easy for us to draw contour plots.

Algorithm 1:
Step 1: Start the program.
Step 2: import the modules and dataset.
Step 3: Read the column name to form density plot.
Step 4: Plot the density plot for the column using density() function.
Step 5: Display the plot.
Step 6: Stop the program.

Algorithm 2:
Step 1: Start the program.
Step 2: import the modules and dataset.
Step 3: Read the column names to form density plot.
Step 4: Plot the contour plot for the column using contour() function.
Step 5: Display the plot.
Step 6: Stop the program.

Program 1:
import pandas as pd
lOMoARcPSD|20936916
lOMoARcPSD|20936916

import matplotlib.pyplot as plt


data=pd.read_csv("forestfires.csv")
print(data.head(2))
a=input("Enter column name to form density curve:")
data[a].plot.density(color='blue')
plt.title('Density plot')
plt.xlabel(a)
plt.show()

Program 2:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
data=pd.read_csv("forestfires.csv")
print(data.head(2))
a=input("Enter column name for x-axis:")
b=input("Enter column name for y-axis:")
x1=np.linspace(data[a].min(),data[a].max(),50)
y1=np.linspace(data[b].min(),data[b].max(),50)
x,y=np.meshgrid(x1,y1)
z=np.sin(x) ** 10 + np.cos(10 + y * x) * np.cos(x)
plt.xlabel(a)
plt.ylabel(b)
plt.title('Contour Plot')
plt.contour(x,y,z)
plt.show()

Output 1:
X Y month day FFMC DMC DC ISI temp RH wind rain area
0 7 5 mar fri 86.2 26.2 94.3 5.1 8.2 51 6.7 0.0 0.0
1 7 4 octtue 90.6 35.4 669.1 6.7 18.0 33 0.9 0.0 0.0
Enter column name to form density curve:wind
lOMoARcPSD|20936916

Output 2:
X Y month day FFMC DMC DC ISI temp RH wind rain area
0 7 5 mar fri 86.2 26.2 94.3 5.1 8.2 51 6.7 0.0 0.0
1 7 4 octtue 90.6 35.4 669.1 6.7 18.0 33 0.9 0.0 0.0
Enter column name for x-axis:temp
Enter column name for y-axis:ISI
lOMoARcPSD|20936916

Result:
lOMoARcPSD|20936916

Thus, the python program to explore the density and contour plot functions for UCI
dataset was written , executed and output was verified.

EX.NO:6C CORRELATION AND SCATTER PLOT

Aim:
To write a python program for correlation and scatterplot in UCI dataset.

Study:
Correlati
on:
A correlational research design investigates relationships between two
variables (or more) without the researcher controlling or manipulating any of them. It's a
non- experimental type of quantitative research.

Scatter plot:
A scatter plot is a type of data visualization that shows the relationship
between different variables. This data is shown by placing various data points between an x-
and y-axis. Essentially, each of these data points looks “scattered” around the graph, giving
this type of data visualization its name.
Algorithm:
Step 1: Start
Step 2: Read the following dataset and draw the scatter plots and correlation for two
selective columns.
Step 3: Using seaborn , draw the scatter plots and draw the lint to connect the plots.
Step 4: Stop.

Program:
Import pandas as
pd Import numpy as
np Import seaborn
as sns
Import maplotlib.pyplot as plt
Con=pd.read_csv(“foretfire.csv “)
Cormat=con.corr()
Sns.lmplot(x=”rain”,y=”temp”data=con)
Sns.catterplot(x=”rain”,y=”temp”,data=co
n) Plt.show()
Sns.heatmap(cormat)
Plt.show()

Output:
lOMoARcPSD|20936916
lOMoARcPSD|20936916

Result
Thus the python program for correlation and scatter plot in UCI dataset was written,
executed and outputs were verified.

EX.NO:6D HISTOGRAM
Aim:
To write a python program for plotting histogram for UCI datasets.

Study:
A histogram is a graphical representation of data points organized into user-specified ranges.
Similar in appearance to a bar graph, the histogram condenses a data series into an easily
interpreted visual by taking many data points and grouping them into logical ranges or bins.A
histogram divides up the range of possible values in a data set into classes or groups. For each
group, a rectangle is constructed with a base length equal to the range of values in that specific
group and a length equal to the number of observations falling into that group. A histogram has
an appearance similar to a vertical bar chart, but there are no gaps between the bars.
Generally, a histogram will have bars of equal width.

Algorithm:
Step 1: Start
Step 2: Import libraries
lOMoARcPSD|20936916

Step 3: Read the csv file


Step 4: Display the content
Step 5: Plot the histogram for the dataset
Step 6: Stop

Program:
import matplotlib.pyplot as plt
import pandas as pd import
seaborn as sns
d=pd.read_csv("Forestfires.csv")
sns.histplot(d['wind'])
plt.show()

Output:

Result:
lOMoARcPSD|20936916
lOMoARcPSD|20936916

Thus the python program to plot histogram for UCI datasets was written, executed and output
was verified successfully.
lOMoARcPSD|20936916

EX.NO:6E THREE DIMENSIONAL PLOTTING


Aim:
To write a python program for three-dimensional plotting using UCI Datasets.

Study:
Three -dimensional axes are enabled and data can be plotted in 3-dimensions. 3-dimension graph gives a
dynamic approach and makes data more interactive. Like 2-D graphs, we can use different ways to represent
3-D graph. We can make a scatter plot, contour plot, surface plot, etc. The 3d plots are enabled by importing
the mplot3d toolkit. The most basic three- dimensional plot is a line or collection of scatter plot created from
sets of (x, y, z) triples. In analogy with the more common two-dimensional plots discussed earlier, these can
be created using the ax.plot3D and ax.scatter3D functions.

Algorithm:
Step 1: Start
Step 2: Read the UCI dataset and draw three dimensional plotting using
the dataset.
Step 3: Perform various methods for the dataset
Step 4: Stop
Program:
import numpy as np
import pandas as pd
from mpl_toolkits.mplot3d import Axes3D import matplotlib.pyplot as plt
d=pd.read_csv("C:/Users/admin/Documents/mad/forestfires.csv")
x=((d[(d["DMC"]==35.4)]["DC"].values.reshape(-1,1)))
y=((d[(d["DMC"]==43.7)]["DC"].values.reshape(-1,1)))
z=((d[(d["DMC"]==33.3)]["DC"].values.reshape(-1,1))) l=[len(x),len(y),len(z)]
m=min(l) x=x[:m]
y=y[:m] z=z[:m]
np.random.seed(42)
xs=np.random.random(100)*10+0.2
ys=np.random.random(100)*5+4.0
lOMoARcPSD|20936916

zs=np.random.random(100)*
15+0.1 fig=plt.figure()
ax=fig.add_subplot(111,proj
ection='3d')
ax.set_xlabel("wind")
ax.set_ylabel("temp")
ax.set_zlabel("rain")
plt.show()

Output:

Result:
Thus the python program for three dimensional plotting using UCI dataset was written,
executed and output was verified.

EX. NO: 7 VISUALIZING GEOGRAPHIC DATA WITH


BASEMAP
Aim:
To write a python program to visualize Geographic data with Basemap.

Algorithm:
Step 1 : Start
Step 2 : Import the required packages
Step 3 : By using Basemap library create the required map using functions
. drawcoastlines(),fillcontinents(), drawcountries()
Step 4 : Display the plot

Step 5 : Stop

Program:
lOMoARcPSD|20936916

from mpl_toolkits.basemap import Basemap


import matplotlib.pyplot as plt
fig=plt.figure()
m=Basemap()
m.drawcoastlines()
m.fillcontinents()
m.drawcountries()
plt.title("Base Map")
plt.show()
Output:

Result:
Thus the python program to visualize geographic data using basemap is written,
executed and the output is verified.

You might also like