CS3361-Data Science Laboratory Manual
CS3361-Data Science Laboratory Manual
Salem College
of Engineering and Technology
NH-68, Salem-Attur Main Road, Mettupatty, Perumapalayam, Selliamman
Nagar, Salem, Tamil Nadu 636111
Lab Manual
(III semester)
Regulation 2021
Prepared By
Ms.S.Namagiri, AP/CSE
Ms.A.Merlin, AP/CSE
1
Department of CSE & IT
Salem college of Engineering and Technology Data Science Laboratory
1. Download, install and explore the features of NumPy, SciPy, Jupyter, Statsmodels and Pandas
packages.
2. Working with Numpy arrays
3. Working with Pandas data frames
4. Reading data from text files, Excel and the web and exploring various commands for doing
descriptive analytics on the Iris data set.
5. Use the diabetes data set from UCI and Pima Indians Diabetes data set for performing the following:
a.Univariate analysis: Frequency, Mean, Median, Mode, Variance, Standard Deviation,
Skewness and Kurtosis.
b. Bivariate analysis: Linear and logistic regression modeling
c. Multiple Regression analysis
d. Also compare the results of the above analysis for the two data sets.
6. Apply and explore various plotting functions on UCI data sets.
a.Normal curves
b. Density and contour plots
c. Correlation and scatter plots
d. Histograms
e. Three dimensional plotting
7. Visualizing Geographic Data with Basemap
LIST OF EQUIPMENTS :(30 Students per Batch)
Tools: Python, Numpy, Scipy, Matplotlib, Pandas, statmodels, seaborn, plotly, bokeh
Note: Example data sets like: UCI, Iris, Pima Indians Diabetes etc.
2
Department of CSE & IT
Salem college of Engineering and Technology Data Science Laboratory
LIST OF EXPERIMENTS
Download, install and explore the features of NumPy, SciPy, Jupyter, Statsmodels and Pandas
1
packages
2 Working with Numpy arrays
5.1.b Univariate Analysis - Pima Indians Diabetes for Non Diabetic Patients
5.2.a Bivariate Analysis - Linear Regression -Pima Indians Diabetes for Diabetic Patients
5.2b Bivariate Analysis - Linear Regression -Pima Indians Diabetes for Non Diabetic Patients
6.2 Density and contour plots - Pima Indians Diabetes data set
3
Department of CSE & IT
Salem college of Engineering and Technology Data Science Laboratory
AIM:-
To download, install and explore the features of numpy ,scipy, jupyter, stats models and pandas packages .
DESCRIPTION:-
1 .Download the package of numpy, scipy, jupyter, stats model and pandas packages download numpy
from web.
2. Check the python version before install numpy because most of as having pre installed.
Check the version of python 3, run the command python_ 3v.
3. Install the package in windows. Install pip, which is a package manager for installing and managing
python software packages. The easiest way t installs numpy is by using pip.
4. Install numpy with pp set up we can use its command line for installing numpy with python by tying.
FEATUES:-
1 .NUMPY
Numpy stands for numerical python.
Numpy is one of the most commonly used packages for specifying computing in python.
High performance zero-dimensional array object
It contains tools for integrating code from C/C++ &Fort on .
It contains an multidimensional contains for generic data.
Additional linear algebra, Fouriertransforms and random number capabilities.
It consists of broad casting functions.
2. SCIPY:
Scipy stands for scientific python
Scipy is a scientific computation library that uses numpy underneath.
It provides more utility functions for optimization, stats and signal processing.
Scipy is a python library that is useful in solving many mathematical equations and algorithm.
It is used on the top of numpy library that gives more extension of finding scientificalmathematical
formulae like matrix rule, inverse, polynomial equations, LU Decomposition, etc,…..
4
Department of CSE & IT
Salem college of Engineering and Technology Data Science Laboratory
3. JUPYTER:-
Jupyter is a loose acronym meaning Julie, python, and R.
These programming languages were the first target languages of the jupyter application.
The main components of the whole environment are on the end pne hand, the note books
themselves and application,
Jupyter notebook is an open-source, web-based interactive environment, which allows you to
create and share documents that contain live code, mathematical equations, graphics , maps, plots,
visualizations and narrative text.
5
Department of CSE & IT
Salem college of Engineering and Technology Data Science Laboratory
4. STATS MODELS:-
Stats models stands for statistical models.Stats models is a python package that allows users
to explore data, estimate statisticalmodels, and perform statistical tests.
An extensive list of descriptive statistics, statistical tests, plotting functions and result
statistics are available for different types of data and each estimator.
6
Department of CSE & IT
Salem college of Engineering and Technology Data Science Laboratory
7
Department of CSE & IT
Salem college of Engineering and Technology Data Science Laboratory
5. PANDAS:-
Pandas stands for “Python Analysis Library”.
Fast and efficient Data frame object with default and customized indexing.
Tools for loading data into in-memory data objects from different file formats.
Data alignment and integrated handling of missing data.
Reshaping and evaluating of data sets.
8
Department of CSE & IT
Salem college of Engineering and Technology Data Science Laboratory
RESULT:-
Thus the above packages downloaded, installed and features are studied.
9
Department of CSE & IT
Salem college of Engineering and Technology Data Science Laboratory
Date:
AIM:-
ALGORITHM:-
2. Create a array from list with type float using array() and print the same.
3. Create a array from tuple using array () and print the same.
4. Creating a 3*4 array with all zeroes () and print the same.
5. Create a withstand value array of complex type using full () and print the same.
6. Create a sequences of integers with steps from 0 to 30 using array () and print the same.
7. Reshaping 3*4 array to 2*2*3 array reshape () and print the same.
10. Create a merge splitten array using array split () and print the same.
11. To sort a given array using sort () and print the same.
12. To search the given key value in the given array by using where() and print the same.
PROGRAM :
import numpy as np
a=np. Array([[1,2,4],[5,8,7]],dtype='float')
b=np.array((1,3,2))
10
Department of CSE & IT
Salem college of Engineering and Technology Data Science Laboratory
c=np.zeros((3,4))
OUTPUT :
1.Array created using passed list:
[[1. 2. 4.]
[5. 8. 7.]]
12
Department of CSE & IT
Salem college of Engineering and Technology Data Science Laboratory
[4 5 2]]
[[4 2 1]
[2 0 1]]]
7.Old array
[[1 2 3]
[4 5 6]]
Flatten array
[1 2 3 4 5 6]
8.Joined array
[1 2 3 4 5 6]
9. Spitted array
[array([1, 2]), array([3, 4]), array([5, 6])]
10.Sorted 1D array:
[0 1 2 3]
10.Sorted 2D array:
[[2 3 4]
[0 1 5]]
11.Searched array position:
(array([3, 5, 6], dtype=int64),)
RESULT:-
Thus the above program is executed successfully .
13
Department of CSE & IT
Salem college of Engineering and Technology Data Science Laboratory
Date:
AIM:-
ALGORITHM:-
5.Using concat() method to join the two data frames into single data frame .
PROGRAM:
import pandas as pd
data1={'Name':['Jai','Princi','Gaurav','Anuj'],
'Age':[27,24,22,32],
'Address':['Nagpur','Kanpur','Allahabad','Kannuaj'],
'Qualification':['M.sc','MA','MCA','P.hd']}
data2={'Name':['Abhi','Ayushi','Dhiraj','Hitesh'],
'Age':[17,14,12,52],
'Address':['Nagpur','Kanpur','Allahabad','Kannuaj'],
'Qualification':['B.tech','BA','B.com','B.hons']}
14
Department of CSE & IT
Salem college of Engineering and Technology Data Science Laboratory
df=pd.DataFrame(data1, index=[0,1,2,3])
df1=pd.DataFrame(data2,index=[4,5,6,7])
print ("DataFrame1\n\n",df,"\n\n","DataFrame2\n\n",df1,"\n\n")
frames=[df,df1]
res1=pd.concat(frames)
OUTPUT:
DataFrame 1
1 Princi 24 Kanpur MA
DataFrame 2
5 Ayushi 14 Kanpur BA
1 Princi 24 Kanpur MA
15
Department of CSE & IT
Salem college of Engineering and Technology Data Science Laboratory
5 Ayushi 14 Kanpur BA
RESULT:-
16
Department of CSE & IT
Salem college of Engineering and Technology Data Science Laboratory
Date:
AIM:-
To write a python program for reading data from text file, excel file and web.
ALGORITHM:-
PROGRAM :
import pandas as pd
f=open("file.txt","r")
print (f.read())
df=pd.read_excel('test.xlsx',sheet_name='Employee')
print (df)
17
Department of CSE & IT
Salem college of Engineering and Technology Data Science Laboratory
f=open("D:\\cse\test.txt","r")
print(f.read())
INPUT:-
18
Department of CSE & IT
Salem college of Engineering and Technology Data Science Laboratory
19
Department of CSE & IT
Salem college of Engineering and Technology Data Science Laboratory
OUTPUT:-
S.no Name
0 1 Steve
1 2 Andrea
2 3 Mike
3 4 John
4 5 Max
5 6 Emma
RESULT:-
#5.checking duplicates
print(“\n checking duplicates\n”)
data=df.drop_duplicates(subset=”variety”)
print(data)
#6.value counts
print(“\n variety counts \n”)
print(df.value_counts(“variety”))
21
Department of CSE & IT
Salem college of Engineering and Technology Data Science Laboratory
OUTPUT:-
:first 5 rows
sepal.length sepal.width petal.length petal.width variety
0 5.1 3.5 1.4 0.2 Setosa
1 4.9 3.0 1.4 0.2 Setosa
2 4.7 3.2 1.3 0.2 Setosa
3 4.6 3.1 1.5 0.2 Setosa
4 5.0 3.6 1.4 0.2 Setosa
(150, 5)
22
Department of CSE & IT
Salem college of Engineering and Technology Data Science Laboratory
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 5 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 sepal.length 150 non-null float64
1 sepal.width 150 non-null float64
2 petal.length 150 non-null float64
3 petal.width 150 non-null float64
4 variety 150 non-null object
dtypes: float64(4), object(1)
memory usage: 6.0+ KB
None
Statistical values
Checking duplicates
variety counts
RESULT:-
Thus the above program exploratory descriptive analysis on Iris data set is verified.
23
Department of CSE & IT
Salem college of Engineering and Technology Data Science Laboratory
24
Department of CSE & IT
Salem college of Engineering and Technology Data Science Laboratory
print(df1['insulin'].skew())
print(df1['insulin'].kurt())
print(df1['bmi'].value_counts())
print(df1['bmi'].mean())
print(df1['bmi'].std())
print(df1['bmi'].mode())
print(df1['bmi'].var())
print(df1['bmi'].skew())
print(df1['bmi'].kurt())
print(df1['pedigree'].value_counts())
print(df1['pedigree'].mean())
print(df1['pedigree'].std())
print(df1['pedigree'].mode())
print(df1['pedigree'].var())
print(df1['pedigree'].skew())
print(df1['pedigree'].kurt())
print(df1['age'].value_counts())
print(df1['age'].mean())
print(df1['age'].std())
print(df1['age'].mode())
print(df1['age'].var())
print(df1['age'].skew())
print(df1['age'].kurt())
25
Department of CSE & IT
Salem college of Engineering and Technology Data Science Laboratory
OUTPUT:
8 2
6 1
0 1
3 1
2 1
10 1
1 1
5 1
Name: pregnant, dtype: int64
4.777777777777778
3.492054473292827
26
Department of CSE & IT
Salem college of Engineering and Technology Data Science Laboratory
0 8
Name: pregnant, dtype: int64
12.194444444444445
0.07976827393633572
-1.4086477342894645
148 1
183 1
137 1
78 1
197 1
125 1
168 1
189 1
166 1
Name: glucose, dtype: int64
154.55555555555554
37.4069215223303
0 78
1 125
2 137
3 148
4 166
5 168
6 183
7 189
8 197
Name: glucose, dtype: int64
1399.2777777777776
-1.0313853707975722
0.978100480723807
72 2
64 1
40 1
50 1
70 1
96 1
74 1
60 1
Name: bp, dtype: int64
66.44444444444444
15.898986690282428
0 72
Name: bp, dtype: int64
252.77777777777777
0.13656074039998925
1.0136266496454889
0 3
35 2
32 1
45 1
23 1
27
Department of CSE & IT
Salem college of Engineering and Technology Data Science Laboratory
19 1
Name: skin, dtype: int64
21.0
17.392527130926087
0 0
Name: skin, dtype: int64
302.5
-0.21810449977720303
-1.6245455521188052
0 4
168 1
88 1
543 1
846 1
175 1
Name: insulin, dtype: int64
202.22222222222223
297.7233521987223
0 0
Name: insulin, dtype: int64
88639.19444444445
1.6550059400947466
1.9781799097232406
33.6 1
23.3 1
43.1 1
31.0 1
30.5 1
0.0 1
38.0 1
30.1 1
25.8 1
Name: bmi, dtype: int64
28.37777777777778
12.189521912053994
0 0.0
1 23.3
2 25.8
3 30.1
4 30.5
5 31.0
6 33.6
7 38.0
8 43.1
Name: bmi, dtype: float64
148.58444444444447
-1.6632175863057026
3.989200103644359
0.627 1
0.672 1
2.288 1
28
Department of CSE & IT
Salem college of Engineering and Technology Data Science Laboratory
0.248 1
0.158 1
0.232 1
0.537 1
0.398 1
0.587 1
Name: pedigree, dtype: int64
0.6385555555555555
0.6462886566989844
0 0.158
1 0.232
2 0.248
3 0.398
4 0.537
5 0.587
6 0.627
7 0.672
8 2.288
Name: pedigree, dtype: float64
0.41768902777777767
2.521186410463069
6.957869024161264
50 1
32 1
33 1
26 1
53 1
54 1
34 1
59 1
51 1
Name: age, dtype: int64
43.55555555555556
12.135805608931685
0 26
1 32
2 33
3 34
4 50
5 51
6 53
7 54
8 59
Name: age, dtype: int64
147.27777777777774
-0.23884551536673332
-1.9149668580541754
RESULT:-
Thus the above program univariant analysis on pima Indian diabetes for diabetics patients is verified
29
Department of CSE & IT
Salem college of Engineering and Technology Data Science Laboratory
30
Department of CSE & IT
Salem college of Engineering and Technology Data Science Laboratory
print(df1['insulin'].mean())
print(df1['insulin'].median())
print(df1['insulin'].std())
print(df1['insulin'].mode())
print(df1['insulin'].var())
print(df1['insulin'].skew())
print(df1['insulin'].kurt())
print(df1['bmi'].value_counts())
print(df1['bmi'].mean())
print(df1['bmi'].median())
print(df1['bmi'].std())
print(df1['bmi'].mode())
print(df1['bmi'].var())
print(df1['bmi'].skew())
print(df1['bmi'].kurt())
print(df1['pedigree'].value_counts())
print(df1['pedigree'].mean())
print(df1['pedigree'].median())
print(df1['pedigree'].std())
print(df1['pedigree'].mode())
print(df1['pedigree'].var())
print(df1['pedigree'].skew())
print(df1['pedigree'].kurt())
print(df1['age'].value_counts())
print(df1['age'].mean())
print(df1['age'].median())
print(df1['age'].std())
print(df1['age'].mode())
print(df1['age'].var())
print(df1['age'].skew())
print(df1['age'].kurt())
31
Department of CSE & IT
Salem college of Engineering and Technology Data Science Laboratory
OUTPUT:
1 2
10 2
5 1
4 1
Name: pregnant, dtype: int64
5.166666666666667
4.5
4.070217029430577
0 1
1 10
Name: pregnant, dtype: int64
32
Department of CSE & IT
Salem college of Engineering and Technology Data Science Laboratory
16.56666666666667
0.3539476771784376
-1.9239379941621584
85 1
89 1
116 1
115 1
110 1
139 1
Name: glucose, dtype: int64
109.0
112.5
19.809088823063014
0 85
1 89
2 110
3 115
4 116
5 139
Name: glucose, dtype: int64
392.4
0.221379243643542
-0.31517019081197173
66 2
74 1
0 1
92 1
80 1
Name: bp, dtype: int64
63.0
70.0
32.36664950222682
0 66
Name: bp, dtype: int64
1047.6
-1.9408208876079585
4.311601666734459
0 4
29 1
23 1
Name: skin, dtype: int64
8.666666666666666
0.0
13.559744343705992
0 0
Name: skin, dtype: int64
183.86666666666665
1.052575991628368
-1.3694264585166165
0 5
94 1
33
Department of CSE & IT
Salem college of Engineering and Technology Data Science Laboratory
34
Department of CSE & IT
Salem college of Engineering and Technology Data Science Laboratory
21 1
29 1
57 1
Name: age, dtype: int64
33.0
30.0
12.312595177297109
0 30
Name: age, dtype: int64
151.6
1.923829603041346
4.4999860763988
RESULT:-
Thus the above program univariant analysis on Pima Indian Diabetes for non Diabetic Patients is verified.
35
Department of CSE & IT
Salem college of Engineering and Technology Data Science Laboratory
AIM
To write a python program for linear regression using pima Indians Diabetes data set for diabetic
patients.
ALGORITHM
Step 1: Start the program
Step 2 :import pandas and matplotlib.pyplot,scipy packages
Step 3:Read data from pima Indian Diabetes.csv data set
Step 4:separate diabetic patients from the data set
Step 5:get x,y values from data set for linregress()
Step 6:Plot x,y values using scatterplot
Step 7: plot regression line by calculate y=bx+a
Step 8:stop the program
PROGRAM:
import pandas as pd
import matplotlib.pyplot as plt
from scipy import stats
df=pd.read_csv("pima1.csv")
df1=df[df.outcome==1]
x=df1["age"]
y=df1["bp"]
slope,intercept,r,p,std_err=stats.linregress(x,y)
defmyfunc(x):
return slope*x+intercept
mymodel=list(map(myfunc,x))
plt.scatter(x,y)
plt.plot(x,mymodel)
plt.show()
36
Department of CSE & IT
Salem college of Engineering and Technology Data Science Laboratory
OUTPUT:
RESULT:
Thus the above program Bivariant Analysis on Linear Regression usingPima Indians Diabetes for
Diabetic Patients is verified.
37
Department of CSE & IT
Salem college of Engineering and Technology Data Science Laboratory
AIM
To write a python program for linear regression using pima Indians Diabetes data set for non diabetic
patients.
ALGORITHM
Step 1: Start the program
Step 2 :import pandas and matplotlib.pyplot,scipy packages
Step 3:Read data from pima Indian Diabetes.csv data set
Step 4:separate non diabetic patients from the data set
Step 5:get x,y values from data set for linregress()
Step 6:Plot x,y values using scatterplot
Step 7: plot regression line by calculate y=bx+a
Step 8:stop the program
PROGRAM:
import pandas as pd
import matplotlib.pyplot as plt
from scipy import stats
df=pd.read_csv("pima1.csv")
df1=df[df.outcome==0]
x=df1["age"]
y=df1["bp"]
slope,intercept,r,p,std_err=stats.linregress(x,y)
defmyfunc(x):
return slope*x+intercept
mymodel=list(map(myfunc,x))
plt.scatter(x,y)
plt.plot(x,mymodel)
plt.show()
38
Department of CSE & IT
Salem college of Engineering and Technology Data Science Laboratory
OUTPUT:
RESULT:
Thus the above program Bivariant Analysis on Linear Regression using Pima Indians Diabetes for Non
Diabetic Patients is verified.
39
Department of CSE & IT
Salem college of Engineering and Technology Data Science Laboratory
RESULT:
Thus the above program Bivariate Analysis on Logistic Regression using Pima Indians Diabetes set is
verified.
40
Department of CSE & IT
Salem college of Engineering and Technology Data Science Laboratory
AIM
To write a python program for multiple regression analysis using pima Indians Diabetes data set
ALGORITHM
Step 1: Start the program
Step 2 :import pandas and liner_model from sklearn packages
Step 3:Read data from pima Indian Diabetes.csv data set
Step 4: get x as independent , y as dependent values from data set for linear_model_LinearRegression()
Step 5:predict the y for the given x values based on pima Indian diabetes data set.
Step 6:stop the program
PROGRAM:
import pandas as pd
from sklearn import linear_model
df=pd.read_csv("pima1.csv")
X=df[['age','bp']]
y=df['outcome']
regr=linear_model.LinearRegression()
regr.fit(X.values,y)
predicteddia=regr.predict([[62.11,63]])
print(predicteddia)
OUTPUT:
[1.00002405]
RESULT:
Thus the above program multiple regression Analysis using Pima Indians Diabetes set is verified.
41
Department of CSE & IT
Salem college of Engineering and Technology Data Science Laboratory
AIM:
To perform result analysis for 5.1.a,5.1.b,5.2.a and 5.2.b using pima Indians diabetes data set.
ANALYSIS
5.1.a vs 5.1.b
category function Diabetes Non-Diabetes
8 2
6 1
0 1
3 1 1 2
2 1 10 2
10 1 5 1
1 1 4 1
frequency 5 1
pregnant
mean 4.777777777777778 5.166666666666667
standard deviation 3.492054473292827 4.070217029430577
0 1
mode 0 8 1 10
variance 12.194444444444445 16.56666666666667
skewness 0.07976827393633572 0.3539476771784376
kurtosis -1.4086477342894645 -1.9239379941621584
148 1
183 1
137 1
78 1 85 1
197 1 89 1
125 1 116 1
168 1 115 1
189 1 110 1
166 1 139 1
frequency
mean 154.55555555555554 109.0
standard deviation 37.4069215223303 19.809088823063014
glucose 0 78
1 125
2 137 0 85
3 148 1 89
4 166 2 110
5 168 3 115
6 183 4 116
7 189 5 139
mode 8 197
variance 1399.2777777777776 392.4
skewness -1.0313853707975722 0.221379243643542
kurtosis 0.978100480723807 -0.31517019081197173
42
Department of CSE & IT
Salem college of Engineering and Technology Data Science Laboratory
72 2
64 1
40 1 66 2
50 1 74 1
70 1 0 1
96 1 92 1
74 1 80 1
bp frequency 60 1
mean 66.44444444444444 63.0
standard deviation 15.898986690282428 32.36664950222682
mode 0 72 0 66
variance 252.77777777777777 1047.6
skewness 0.13656074039998925 -1.9408208876079585
kurtosis 1.0136266496454889 4.311601666734459
0 3
35 2
32 1 0 4
45 1 29 1
23 1 23 1
frequency 19 1
skin mean 21.0 8.666666666666666
standard deviation 17.392527130926087 13.559744343705992
mode 0 0 0 0
variance 302.5 183.86666666666665
skewness -0.21810449977720303 1.052575991628368
kurtosis -1.6245455521188052 -1.3694264585166165
0 4
168 1
88 1
543 1 0 5
846 1 94 1
frequency 175 1
insulin mean 202.22222222222223 15.666666666666666
standard deviation 297.7233521987223 38.37533930360312
mode 0 0 0 0
variance 88639.19444444445 1472.6666666666665
skewness 1.6550059400947466 2.4494897427831783
kurtosis 1.9781799097232406 6.0
33.6 1
23.3 1 26.6 1
43.1 1 28.1 1
31.0 1 25.6 1
bmi 35.3 1
30.5 1
0.0 1 37.6 1
38.0 1 27.1 1
frequency 30.1 1
43
Department of CSE & IT
Salem college of Engineering and Technology Data Science Laboratory
25.8 1
mean 28.37777777777778 30.05
standard deviation 12.189521912053994 5.074938423271753
0 0.0
1 23.3
2 25.8 0 25.6
3 30.1 1 26.6
4 30.5 2 27.1
5 31.0 3 28.1
6 33.6 4 35.3
7 38.0 5 37.6
mode 8 43.1
variance 148.58444444444447 25.75499999999999
skewness -1.6632175863057026 0.9474768598128397
kurtosis 3.989200103644359 -1.3608302417826117
0.627 1
0.672 1
2.288 1 0.351 1
0.248 1 0.167 1
0.158 1 0.201 1
0.232 1 0.134 1
0.537 1 0.191 1
0.398 1 1.441 1
frequency 0.587 1
mean 0.6385555555555555 0.41416666666666674
standard deviation 0.6462886566989844 0.5085675635219638
pedigree 0 0.158
1 0.232
2 0.248 0 0.134
3 0.398 1 0.167
4 0.537 2 0.191
5 0.587 3 0.201
6 0.627 4 0.351
7 0.672 5 1.441
mode 8 2.288
variance 0.41768902777777767 0.25864096666666664
skewness 2.521186410463069 2.336696757512407
kurtosis 6.957869024161264 5.534561849601891
50 1
32 1
33 1
26 1 30 2
age 53 1 31 1
54 1 21 1
34 1 29 1
59 1 57 1
frequency 51 1
44
Department of CSE & IT
Salem college of Engineering and Technology Data Science Laboratory
45
Department of CSE & IT
Salem college of Engineering and Technology Data Science Laboratory
5.2.a vs 5.2.b
5.2.a
5.2.b
RESULT:-
Thus the above program 5.1.a, 5.1.b,5.2.a and 5.2.b results are analyzed.
46
Department of CSE & IT
Salem college of Engineering and Technology Data Science Laboratory
AIM
To write a python program for Normal curves using Iris data set
ALGORITHM:-
Step 1: Start the program
Step 2 :import pandas, matplotlin.pyplot packages
Step 3:Read data from Iris.csv data set
Step 4: get x,y from the data set
Step 5:Plot dotted line curve by using x,y values
Step 6:stop the program
PROGRAM:-
import pandas as pd
importmatplotlib.pyplot as plt
df=pd.read_csv("irisdata.csv")
x=df['sepal.length']
y=df['petal.length']
plt.plot(x,y, linestyle = 'dotted')
plt.show()
47
Department of CSE & IT
Salem college of Engineering and Technology Data Science Laboratory
OUTPUT:-
RESULT:-
Thus the above program Normal Curves using Iris Data Set is verified.
48
Department of CSE & IT
Salem college of Engineering and Technology Data Science Laboratory
49
Department of CSE & IT
Salem college of Engineering and Technology Data Science Laboratory
OUTPUT:-
RESULT:-
Thus the above program Density and Control Plots using Pima Indians diabetes Set is verified.
50
Department of CSE & IT
Salem college of Engineering and Technology Data Science Laboratory
AIM:
To write a python program for Correlation using Iris data Set.
ALGORITHM:
1. Start the program.
2. Import pandas and matplotlib pyplot package.
3. Read iris.Csv file.
4.Calculate containing correlation co-efficient of r by using (method=”person”) defined on pandas package.
5. Coefficient of r for each column by using correlation.
6.Stop the program.
PROGRAM:-
import pandas as pd
import matplotlib.pyplot as plt
df=pd.read_csv("iris.csv")
x=df.corr(method='pearson')
print(x)
OUTPUT:
sepal.lengthsepal.widthpetal.lengthpetal.width
sepal.length 1.000000 -0.117570 0.871754 0.817941
sepal.width -0.117570 1.000000 -0.428440 -0.366126
petal.length 0.871754 -0.428440 1.000000 0.962865
petal.width 0.817941 -0.366126 0.962865 1.000000
RESULT:
Thus the above program Correlation using Iris Data Set is verified.
51
Department of CSE & IT
Salem college of Engineering and Technology Data Science Laboratory
PROGRAM:-
import pandas as pd
import matplotlib.pyplot as plt
df=pd.read_csv("iris.csv")
df1=df[df.variety=='Setosa']
x1=df1['sepal.length']
y1=df1['petal.length']
df2=df[df.variety=='Versicolor']
x2=df2['sepal.length']
y2=df2['petal.length']
df3=df[df.variety=='Virginica']
x3=df3['sepal.length']
y3=df3['petal.length']
plt.scatter(x1,y1,color='red',marker='o',label='Setosa')
plt.scatter(x2,y2,color='blue',marker='s',label='Versicolor')
plt.scatter(x3,y3,color='green',marker='x',label='Virginica')
plt.title('iris_scatterplot')
plt.xlabel("sepal.length[cm]")
plt.ylabel("petal.length[cm]")
plt.show()
52
Department of CSE & IT
Salem college of Engineering and Technology Data Science Laboratory
OUTPUT:
RESULT:
Thus the above program Scatter Plot using Iris Data Set is verified.
53
Department of CSE & IT
Salem college of Engineering and Technology Data Science Laboratory
AIM:
To write the python program for Histogram using Iris Data set.
ALGORITHM:
1. Start the program.
2. Import pandas and matplotlib-pyplot package.
3. Read Iris_csv file.
4. Draw the Histogram using hist() for sepal.length as X-axis.
5. Draw the histogram using hist() for sepal.width as X-axis.
6. Draw the histogram using hist() for petal.length as x-axis.
7. Draw the histogram usingn hist() for petal.width as x-axis.
8. Stop the program.
PROGRAM:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
df=pd.read_csv("iris.csv")
x1=df["sepal.width"]
plt.subplot(2,2,1)
plt.hist(x1,bins=20,color="green")
plt.title("sepal width in cm")
plt.xlabel("sepal width cm")
plt.ylabel("count")
x2=df["sepal.length"]
plt.subplot(2,2,2)
plt.hist(x2,bins=20,color="blue")
plt.title("sepal length in cm")
plt.xlabel("sepal length cm")
plt.ylabel("count")
x3=df["petal.length"]
plt.subplot(2,2,3)
plt.hist(x3,bins=20,color="red")
plt.title("petal length in cm")
plt.xlabel("petal length cm")
plt.ylabel("count")
x4=df["petal.width"]
plt.subplot(2,2,4)
plt.hist(x4,bins=20,color="yellow")
plt.title("petal width in cm")
plt.xlabel("petal width cm")
plt.ylabel("count")
plt.show()
54
Department of CSE & IT
Salem college of Engineering and Technology Data Science Laboratory
OUTPUT:
RESULT:
Thus the above program Histogram using Iris Data Set is verified.
55
Department of CSE & IT
Salem college of Engineering and Technology Data Science Laboratory
AIM:
To write the python program for three dimensional plotting using Pima Indians Diabetes Data set.
ALGORITHM:
1. Start the program.
2. Import mplot3d from mpl_tiilkits ,numpy ,pandas and matplotlib.pyplot packages.
3. Read pima1.csv file.
4. Read x and y.
5. Grid is generated for X and Y axis using meshgrid().
6.Z axis is generated by using f(x,y).
7. Draw the wireframe by using plot_wireframe().
8. Stop the program.
PROGRAM:
from mpl_toolkits import mplot3d
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
df=pd.read_csv("pima1.csv")
#function for z area
def f(x,y):
return np.sin(np.sqrt(x**2+y**2))
#x and y axis
x=df['age']
y=df['bp']
X,Y=np.meshgrid(x,y)
Z=f(X,Y)
fig=plt.figure()
ax=plt.axes(projection='3d')
ax.plot_wireframe(X,Y,Z,color='green')
ax.set_title('wireframe geeks for geeks')
plt.show()
56
Department of CSE & IT
Salem college of Engineering and Technology Data Science Laboratory
OUTPUT:
RESULT:
Thus the above program three dimensional plotting using Pima Indians Diabetes Set is verified.
57
Department of CSE & IT
Salem college of Engineering and Technology Data Science Laboratory
AIM:
To write the python program for Visualizing Geographic Data with Basemap
ALGORITHM:
1. Start the program.
2. Import matplotlib.pyplot and Basemap from mpl_toolkits.basemap package.
3..Draw figure window using figure()
4. Draw the coastline using drawcoastlines()
5. Draw the countries using drawcountries().
6. Fill the continents using fillcontinents().
7. Draw the boundaries using drawmapboundary().
8. Stop the program.
PROGRAM:
from mpl_toolkits.basemap import Basemap
import matplotlib.pyplot as plt
fig=plt.figure(figsize=(12,12))
m=Basemap()
m.drawcoastlines(linewidth=1.0,linestyle='solid',color='black')
m.drawcountries(linewidth=1.0,linestyle='solid',color='k')
m.fillcontinents(color='coral',lake_color='aqua')
m.drawmapboundary(color='b',linewidth=2.0,fill_color='aqua')
plt.title("Filled map boundary",fontsize=20)
plt.show()
OUTPUT:
RESULT:-
Thus the above program for Visualizing Geographic Data with Basemap is verified.
58
Department of CSE & IT