Grace Python Numpy MB
Grace Python Numpy MB
LAB
PREPARED BY Mrs.P.JOY
SUGANTHY BAI,
Assistant professor
CSE Department
Grace College of Engineering, Thoothukudi
CS3361_DATA SCIENCE
4931_Grace College of Engineering,
List of Experiments
1. Download, install and explore the features of NumPy, SciPy, Jupyter, Statsmodels
and Pandaspackages.
2. Working with Numpy arrays
3. Working with Pandas data frames
4. Reading data from text files, Excel and the web and exploring various commands
for doingdescriptive analytics on the Iris data set.
5. Use the diabetes data set from UCI and Pima Indians Diabetes data set for
performing thefollowing:
a. Univaria te analysis: Frequency, Mean, Median, Mode, Variance,
Deviation, Standard
b. BivariatSkewness and Kurtosis.
c. Multiplee analysis: Linear and logistic regression modeling
d. Also co Regression analysis
mpare the results of the above analysis for the two data sets.
6. Apply a
a. Normal nd explore various plotting functions on UCI data
b. Density sets. curves
c. Correlati and contour plots
d. Histogra on and scatter
e. Three di plots ms
mensional plotting
7. Visualiz
ing Geographic Data with Basemap
CS3361_DATA SCIENCE
4931_Grace College of Engineering,
Anaconda:
Jupyter NoteBook:
The Jupy ter Notebook is an open-source web application that allows you t o create and
share documents that contain live code, equations, visualizations, and ext. Its
uses include data narrative
cle t aning and transformation, numerical simulation, deling,
data visualization,statistical
ma mo chine learning, and much more.
NumPy:
NumPy is a Python library used for working with arrays. It also has for working
in domain of line functions ar algebra, fourier transform, and matrices.
SciPy:
SciPy is ciPy stands
for Scientific Py a scientific computation library that uses NumPy underneath. and signal
processing. Like S thon. It provides more utility functions for optimization,
stats NumPy, SciPy is open source so we can use it freely
NumPy, stands f erical array
data. SciPy, stan or Numerical Python, is used for the manipulation of elements of . Both these
packages providenum ds for Scientific Python, is used for numerical computations in
Python
Statsmodels : extended functionality to work with
Statsmo Python. lyze various
statistical models . It includes
various models o es, weighted
least squares, etcdels is a popular library in Python that enables us to estimate and
ana
Pandas
Pandas are really powerful. They provide you with a huge set of important commands and
features
which are used to easily analyze your data. We can use Pandas to perform various tasks like
filtering your data according to certain conditions, or segmenting and segregating the data
according to preference, etc.
Download Anaconda:
CS3361_DATA SCIENCE
4931_Grace College of Engineering,
CS3361_DATA SCIENCE
4931_Grace College of Engineering,
CS3361_DATA SCIENCE
4931_Grace College of Engineering,
CS3361_DATA SCIENCE
4931_Grace College of Engineering,
7. Click “ New”
CS3361_DATA SCIENCE
4931_Grace College of Engineering,
9. Sample program s
Result :
Thus the NumPy, SciPy, Jupyter, Statsmodels and Pandas packages are and
installed successf downloaded ully.
CS3361_DATA SCIENCE
4931_Grace College of Engineering,
Aim:
To work with Numpy array using Jupyter Notebook.
Program 1:
Python program to demonstrate
# basic array characteristics
import numpy as np
Output:
array is of type:
No. of dimensio ns:
Shape of array: 2
Size of array: 6
Array stores elements of type: int64
Program 2:
# Python program to demonstrate
# array creation techniques
import numpy as np
CS3361_DATA SCIENCE
4931_Grace College of Engineering,
newarr = arr.reshape(2, 2, 3)
# Flatten array
arr = np.array([[1, 2, 3], [4, 5, 6]])
flarr = arr.flatten()
10
CS3361_DATA SCIENCE
4931_Grace College of Engineering,
Output:
Array created using passed list:
[[ 1. 2. 4.]
[ 5. 8. 7.]]
A random array:
[[ 0.46829566 0.67079389]
[ 0.09079849 0. 95410464]]
Original array:
[[1 2 3 4]
[5 2 4 2]
11
CS3361_DATA SCIENCE
4931_Grace College of Engineering,
[1 2 0 1]]
Reshaped array:
[[[1 2 3]
[4 5 2]]
[[4 2 1]
[2 0 1]]]
Original array:
[[1 2 3]
[4 5 6]]
Fattened array:
[1 2 3 4 5 6]
Program3:
# Python progra m to
# indexing in nu demonstrate
import numpy mpy
as np
# An exemplar a
arr = np.array([[- rray
[4, -0. 1, 2, 0, 4],
[2.6, 5, 6, 0],
[3, -7, 0, 7, 8],
4, 2.0]])
# Slicing array
temp = arr[:2, ::2]
print ("Array with first 2 rows and alternate"
"columns(0 and 2):\n", temp)
CS3361_DATA SCIENCE
4931_Grace College of Engineering,
# Python progr
# basic operatio am to
import numpy demonstrate ns
on single array
a = np.array([1, as np
# transpose of array
a = np.array([[1, 2, 3], [3, 4, 5], [9, 6, 0]])
CS3361_DATA SCIENCE
4931_Grace College of Engineering,
Original array:
[[1 2 3]
[3 4 5] [9 6 0]]
Transpose of arr
[[1 3 9] ay:
[2 4 6]
[3 5 0]]
# Python pro
import nump
gram to demonstrate sorting in
a = np.array([[numpy y as np
[3,
[0, - 1, 4, 2],
4, 6],
# sorted array1, 5]])
print ("Array
n
elements in sorted order:\
# sort array ro n", p.sort(a, axis = None))
print ("Row-wise sorted array:\n",
w-wise
np.sort(a, axis = 1))
14
CS3361_DATA SCIENCE
4931_Grace College of Engineering,
# Creating array
arr = np.array(values, dtype = dtypes)
print ("\nArray sorted by names:\n",
np.sort(arr, order = 'name'))
Result:
Thus the programs using numpy executed successfully.
15
CS3361_DATA SCIENCE
4931_Grace College of Engineering,
Aim:
To Work with Pandas data frames using Jupyter Notebook
The name "Pandas" has a reference to both "Panel Data", and "Python Data Analysis" and was
created by Wes McKinney in 2008.
Pandas allows us to analyze big data and make conclusions based on statistical ies.
Pandas can clean theor messy data sets, and make them readable and relevant.
Example1:
import pandas
mydataset = {
'cars': ["BMW", "Volvo", "Ford"],
'passings': [3, 7, 2]
}
myvar = pandas.
DataFrame(mydataset)
print(myvar)
Create Labels
import pandas as pd
a = [1, 7, 2]
print(myvar)
What is a DataFrame?
16
CS3361_DATA SCIENCE
4931_Grace College of Engineering,
import pandas as pd
data = {
"calories": [420, 380, 390],
"duration": [50, 40, 45]
}
print(df)
A simple way to store big data sets is to use CSV files (comma separated files).
CSV files contai ns plain text and is a well know format that can be read by including
Pandas. everyone
import pandas as
pd
df = pd.read_csv
(' C:\Users\New\Desktop\AD8302\data.csv')
print(df.to_strin
g())
Result:
Thus the program using Data Frames were executed successfully.
17
CS3361_DATA SCIENCE
4931_Grace College of Engineering,
Ex.No:4 Reading data from text files , Excel and the web and
exploring various commands for doing descriptive analysis
on the Iris data set.
Aim:
To Read data from text files , Excel and the web and exploring various commands for
doing descriptive analysis on the Iris data set.
Iris Dataset
Iris Dataset is co nsidered as the Hello World for data science. It contains five colu mns namely
– Petal Length, Petal Width, Sepal Length, Sepal Width, and Species Type. Iris is a flowering
plant, the researc hers have measured various features of the different iris flowers a nd recorded
them digitally.
Output: ows
Output:
18
CS3361_DATA SCIENCE
4931_Grace College of Engineering,
(150, 6)
The dataframe contains 6 columns and 150 rows
3. Describe():The describe() function applies basic statistical computations on the dat aset like
extreme values, count of data points standard deviation, etc. Any missing value or aN value
is automatically N skipped. describe() function gives a good picture of the n of data
df.describe() distributio
Output:
19
CS3361_DATA SCIENCE
4931_Grace College of Engineering,
2. Checking Duplicates:
Pandas drop_duplicates() method helps in removing duplicates from the data frame
data = df.drop_duplicates(subset ="class")
Output:
3. Count:
Series. value_counts() function. This function returns a Series counts of
containing unique values.
df.value_counts("class")
Output:
III.Data Visuali zation: We will use Matplotlib and Seaborn library for the data vi sualization.
Matplotlib is ea sy to use and an amazing visualizing library in Python. It is built o n NumPy
arrays and desig ned to work with the broader SciPy stack and consists of several pl ots like
line, bar, scatter, histogram, etc
Seaborn is a library mostly used for statistical plotting in Python. It is built on top of
Matplotlib and provides beautiful default styles and color palettes to make statistical plots
more attractive
20
CS3361_DATA SCIENCE
4931_Grace College of Engineering,
Output:
Hue: Hue para meter denotes which column decides the kind of color. end() is an
Legend(): A leg area describing the elements of the graph. bbox_to_anchor=[x0,
Bounding Box: y0] will create a bounding box with lower left corner at
position [x0, y0] . The legend will then be placed 'inside' this box and overlapp it ording to
the specified loc acc
Loc:The attribut parameter. lt value of
loc is loc=”best” e Loc in legend() is used to specify the location of the legend.
Defau
(upper left)
Example 1: Co
plt.show()
Output:
21
CS3361_DATA SCIENCE
4931_Grace College of Engineering,
sns.scatterplot(x ='petallength',
hue='c y='petalwidth', lass',
data=df, )
# Placing Lege
plt.legend(bbo nd outside the Figure
x_to_anchor=(1, 1),
plt.show() loc=2)
Output:
22
CS3361_DATA SCIENCE
4931_Grace College of Engineering,
axes[0,0].set_title("Sepal Length")
axes[0,0].hist(df['sepallength'], bins=7)
23
CS3361_DATA SCIENCE
4931_Grace College of Engineering,
axes[0,1].set_title("Sepal Width")
axes[0,1].hist(df['sepalwidth'], bins=5);
axes[1,0].set_title("Petal Length")
axes[1,0].hist(df['petallength'], bins=6);
axes[1,1].set_title("Petal Width")
axes[1,1].hist(df['petalwidth'], bins=6);
Output:
Handling Correlation
24
CS3361_DATA SCIENCE
4931_Grace College of Engineering,
Pandas dataframe.corr() is used to find the pairwise correlation of all columns in the
dataframe. Any NA values are automatically excluded. For any non-numeric data type
columns in the dataframe it is ignored.
Output:
Box Plots
We can use boxp lots to see how the categorical value os distributed with other erical
values.
Example:
# importing pack num ages
import seaborn as sns
import matplotli b.pyplot as plt
def graph(y):
sns.boxplot(x ="class", y=y, data=df)
plt.figure(figsize =(10,10))
plt.subplot(222)
graph('sepalwidth')
plt.subplot(223)
graph('petallength')
plt.subplot(224)
graph('petalwidth')
plt.show()
Output:
25
CS3361_DATA SCIENCE
4931_Grace College of Engineering,
Handling
An Outlier is a Outliers e (so-called
normal)objects. data-item/object that deviates significantly from the rest of for outlier
is
detection is refer th They can be caused by measurement or execution errors. The ers, and the
removal process analys red to as outlier mining. There are many ways to detect the ataframe.
Let’s consider the iris dataset and let’s plot the boxplot for the SepalWidthCm column.
Example:
# importing packages
import seaborn as sns
import matplotlib.pyplot as plt
sns.boxplot(x='sepalwidth', data=df)
26
CS3361_DATA SCIENCE
4931_Grace College of Engineering,
Output:
For removing t he outlier, one must follow the same process of removing an entry fr om the
dataset using its exact position in the dataset because in all the above methods of detecting the
outliers end resu lt is the list of all those data items that satisfy the outlier definition according
to the method us ed.
Example: We w ill detect the outliers using IQR and then we will remove them. W e will also
draw the boxplot to see if the outliers are removed or not.
import sklearn
from sklearn.dat asets import
import pandas as load_boston pd
import seaborn a s
import numpy as sn
s
# Load the datas np
df = pd.read_csv
et
# IQR ('iris_csv.csv')
Q1 = np.percentile(df['sepalwidth'], 25,
interpolation = 'midpoint')
Q3 = np.percentile(df['sepalwidth'], 75,
interpolation = 'midpoint')
IQR = Q3 - Q1
# Upper bound
upper = np.where(df['sepalwidth'] >= (Q3+1.5*IQR))
27
CS3361_DATA SCIENCE
4931_Grace College of Engineering,
# Lower bound
lower = np.where(df['sepalwidth'] <= (Q1-1.5*IQR))
sns.boxplot(x='sepalwidth', data=df)
Output:
pandas.DataFrame() used to create a DataFrame in pandas. There are two ways to use this
function. You can form a DataFrame column-wise by passing a dictionary into
the pandas.DataFrame() function. Here, each key is a column, while the values are the rows:
import pandas
DataFrame = pandas.DataFrame({"A" : [1, 3, 4], "B": [5, 9, 12]})
print(DataFrame)
28
CS3361_DATA SCIENCE
4931_Grace College of Engineering,
A B
0 1 5
1 3 9
2 4 12
import pandas as pd
df = pd.read_csv("iris_csv.csv")
print(df)
We can a lso compute the central tendencies of each column in a DataFrame ing
pandas.: us
DataFrame.mean
()
df.median()
df.mode()
4. DataFrame.tr
ansform
pandas' DataFr nction as an
argument. ame.transform() modifies the values of a DataFrame. It accepts a
fu
data = df.transfor
print(data)
5. DataFrame.is m(lambda y: y*3)
df.isnull().sum()
sepallength 0
sepalwidth 0
petallength 0
petalwidth 0
class 0
dtype: int64
6. Dataframe.info
29
CS3361_DATA SCIENCE
4931_Grace College of Engineering,
df.info()
df.describe()
7. DataFrame.loc
loc to used find the elements in a particular index. To view all items in the third row, for
instance:
data=df.loc[2]
print(data)
df.min()
df.max()
9. DataFrame.ast ype
10. DataFrame.inse rt
' insert() fun ction used to add a new column to a DataFrame. It accepts three words,
the column nam key
print(DataFrame)
11. DataFrame.sum
The sum() function in pandas returns the sum of the values in each column
DataFrame.cumsum()
30
CS3361_DATA SCIENCE
4931_Grace College of Engineering,
12. Correlation:
Want to find the correlation between integer or float columns? pandas can help you
achieve that using the corr() function.
DataFrame.corr()
13. DataFrame.add
The add() function used to add a specific number to each value in DataFrame. It works
by iterating through a DataFrame and operating on each item.
DataFra me['A'].add(20)
Like the addition function, you can also subtract a number from each value a
DataFrame or s in pecific column:
DataFra me['A'].sub(10)
DataFra me['A'].mul(10)
Using the std() function, pandas also lets you compute the standard deviation for each
column in a DataFrame. It works by iterating through each column in a dataset and calculating
the standard deviation for each:
DataFrame.std()
18.DataFrame.melt
31
CS3361_DATA SCIENCE
4931_Grace College of Engineering,
The melt() function in pandas flips the columns in a DataFrame to individual rows. It's
like exposing the anatomy of a DataFrame. So it lets you view the value assigned to each
column explicitly.
newDataFrame = DataFrame.melt()
print(newDataFrame)
19. DataFrame.pop
This function lets you remove a specified column from a pandas DataFrame. It accepts
an item keyword, returns the popped column, and separates it from the rest of the DataFrame:
print(DataFrame)
20.DataFrame.d ropna
The drop na() method removes all rows containing null values:
print(DataFrame)
Result:
32
CS3361_DATA SCIENCE
4931_Grace College of Engineering,
Aim :
To find Frequency, Mean, Median, Mode, Variance, Standard Deviation,
Read Diabetes
data set:
Import
pandas df = as pd
pd. read_csv('diabetes.c
df.head()sv')
df.shape
OutPut
(768, 9)
df.dtypes
Output:
Pregnancies
Glucose int
BloodPressur 64
e int64
SkinThicknes int64
s Insulin int64
BMI int64
float
Age 64
Outcome DiabetesPedigreeFunction
dtype: object float64
int
df['Outcome'] 64
= df.dtypes[
Output:
dtype('bool')
df.info(
) Output:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):
Pregnancies 768 non-null int64
Glucose 768 non-null int64
BloodPressure 768 non-null int64
SkinThickness 768 non-null int64
Insulin 768 non-null int64
BMI 768 non-null float64
33
CS3361_DATA SCIENCE
4931_Grace College of Engineering,
df.describe().T
Pregnency Propagation:
import numpy as np
preg_proportion = np.array(df['Pregnancies'].value_counts()) preg_month =
np.array(df['Pregnancies'].value_counts().index) preg_proportion_perc =
np.array(np.round(preg_proportion/sum(preg_proportion),3)
*100,dtype=int)
preg = pd.DataFrame({'month':
Pregnancies,'count_of_preg_prop':preg_proportion,'
percentage_proportion':preg_proportion_p
erc})
preg.set_inde (['month'],inplace=True)
x
preg.head(10)
s sns
import seaborn ab.pyplot as plt
import matplotli plots(nrows=3,ncols=2,dpi=120,figsize = (8,6))
fig,axes = plt.sub
plot('Pregnancies',data=df,ax=axes[0][0],color='green')
plot00=sns.count le('Count',fontdict={'fontsize':8})
axes[0][0].set_tit abel('Month of Preg.',fontdict={'fontsize':7})
axes[0][0].set_xl abel('Count',fontdict={'fontsize':7})
axes[0][0].set_yl
plt.tight_layout()
plot('Pregnancies',data=df,hue='Outcome',ax=axes[0]
plot01=sns.count [1]) le('Diab. VS Non-Diab.',fontdict={'fontsize':8})
axes[0][1].set_tit abel('Month of Preg.',fontdict={'fontsize':7})
axes[0][1].set_xl abel('Count',fontdict={'fontsize':7})
axes[0][1].set_yl nd(loc=1)
plot01.axes.lege 1].get_legend().get_texts(), fontsize='6')
plt.setp(axes[0] 1].get_legend().get_title(), fontsize='6')
[ plt.setp(axes[0]
[ plt.tight_layout
()
plot11 = df[df['Outcome']==False]['Pregnancies'].plot.hist(ax=axes[1][1],label='Non-Diab.')
plot11_2=df[df['Outcome']==True]['Pregnancies'].plot.hist(ax=axes[1][1],label='Diab.')
axes[1][1].set_title('Diab. VS Non-Diab.',fontdict={'fontsize':8}) axes[1]
[1].set_xlabel('Pregnancy Class',fontdict={'fontsize':7})
axes[1][1].set_ylabel('Freq/Dist',fontdict={'fontsize':7})
plot11.axes.legend(loc=1)
plt.setp(axes[1][1].get_legend().get_texts(), fontsize='6') # for legend text
34
CS3361_DATA SCIENCE
4931_Grace College of Engineering,
plot20 = sns.boxplot(df['Pregnancies'],ax=axes[2][0],orient='v')
axes[2][0].set_title('Pregnancies',fontdict={'fontsize':8}) axes[2]
[0].set_xlabel('Pregnancy',fontdict={'fontsize':7}) axes[2]
[0].set_ylabel('Five Point Summary',fontdict={'fontsize':7})
plt.tight_layout()
plot21 = sns.boxplot(x='Outcome',y='Pregnancies',data=df,ax=axes[2][1])
axes[2][1].set_title('Diab. VS Non-Diab.',fontdict={'fontsize':8}) axes[2]
[1].set_xlabel('Pregnancy',fontdict={'fontsize':7}) axes[2]
[1].set_ylabel('Five Point Summary',fontdict={'fontsize':7})
plt.xticks(ticks=[0,1],labels=['Non-Diab.','Diab.'],fontsize=7)
plt.tight_layout()
plt.show()
Understanding Distribution
The distribution of Pregnancies in data is unimodal and skewed to the right, centered at
about 1 with most of the data between 0 and 15, A range of roughly 15, and outliers are
present on the higher end.
Glucose Variable
35
CS3361_DATA SCIENCE
4931_Grace College of Engineering,
df.Glucose.describe
() Output:
count
768.000000 mean
120.894531 std
31.972618 min
0.000000
25%
99.000000
50%
117.000000
75%
140.250000
max
199.000000
Name: Glucose, dtype:
float64
#sns.set_style('darkgrid')
fig,axes = plt.subplots(nrows=2,ncols=2,dpi=120,figsize = (8,6))
plot00=sns.
#axes[0][0].
axes[0] distplot(df['Glucose'],ax=axes[0][0],color='green')
[0]. yaxis.set_major_formatter(FormatStrFormatter('%.3f'))
axes[0] set_title('Distribution of
[0]. Glucose',fontdict={'fontsize':8}) set_xlabel('Glucose
axes[0] Class',fontdict={'fontsize':7})
[0]. plt. set_ylabel('Count/Dist.',fontdict={'fontsize':7})
tight_layout()
plot01=sns.
color='gre en',label='Non
distplot(df[df['Outcome']==False]
sns.distplot( ['Glucose'],ax=axes[0][1], Diab.') ,label='
Di ab') df[df.Outcome==True]['Glucose'],ax=axes[0]
axes[0] [1],color='red'
[1].
axes[0] set_title('Distribution of
[1]. Glucose',fontdict={'fontsize':8}) set_xlabel('Glucose
axes[0] Class',fontdict={'fontsize':7})
[1]. set_ylabel('Count/Dist.',fontdict={'fontsize':7})
#axes[0][1].
yaxis.set_major_formatter(FormatStrFormatter('%.3f'))
plot01.axes.
legend(loc=1)
plt.setp(axes 0][1].get_legend().get_texts(), fontsize='6')
[ plt.setp(ax 0][1].get_legend().get_title(), fontsize='6')
es[ tight_layout()
plt.
boxplot(df['Glucose'],ax=axes[1][0],orient='v')
plot10=sns set_title('Numerical
. axes[1]
[0].
axes[1]
[0].
axes[1][0].
e':7})
plt
.
plot11=sns.boxplot(x='Outcome',y='Glucose',data=df,ax=axes[1][1])
axes[1][1].set_title(r'Numerical Summary
(Outcome)',fontdict={'fontsize':8}) axes[1][1].set_ylabel(r'Five Point
Summary(Glucose)',fontdict={'fontsize':7})
CS3361_DATA SCIENCE
4931_Grace College of Engineering,
plt.xticks(ticks=[0,1],labels=['Non-Diab.','Diab.'],fontsize=7) axes[1]
[1].set_xlabel('Category',fontdict={'fontsize':7})
plt.tight_layout()
plt.show()
36
CS3361_DATA SCIENCE
4931_Grace College of Engineering,
Understanding Distribution
The distribution of Glucose level among patients is unimodal and roughly bell shaped,
centered at abo ut 115 with most of the data between 90 and 140, A range of 150, and
outliers are pres roughly ent on the lower end(Glucose ==0).
plot0=sns.dist plot(df[df['Glucose']!=0]['Glucose'],ax=axes[0],color='green')
#axes[0].yaxis.s et_major_formatter(FormatStrFormatter('%.3f'))
axes[0].set_title( 'Distribution of Glucose',fontdict={'fontsize':8})
axes[0].set_xla bel('Glucose Class',fontdict={'fontsize':7})
axes[0].set_yla bel('Count/Dist.',fontdict={'fontsize':7})
plt.tight_layout()
plot1=sns.boxplot(df[df['Glucose']!=0]['Glucose'],ax=axes[1],orient='v')
axes[1].set_title('Numerical Summary',fontdict={'fontsize':8})
axes[1].set_xlabel('Glucose',fontdict={'fontsize':7})
axes[1].set_ylabel(r'Five Point Summary(Glucose)',fontdict={'fontsize':7})
plt.tight_layout()
37
CS3361_DATA SCIENCE
4931_Grace College of Engineering,
Blood Pr
df. essure variable
counBloodPressure.describ
t e()
mean 768.0000
std 00
min 69.1054
25% 69
50% 19.3558
75% 07
max 0.0000
Name: 00
62.0000
fig,axes = plt00
72.0000
plot00=sns 00
. axes[0]80.0000
[0]. 00
axes[0] 122.0000
[0]. 00
axes[0] BloodPressure, dtype:
[0]. float64
axes[0]
[0]. plt.
plot01=sns.distplot(df[df['Outcome']==False]['BloodPressure'],ax=axes[0]
[1],colo r='green',label='Non Diab.') sns.distplot(df[df.Outcome==True]
['BloodPressure'],ax=axes[0][1],color='red',lab el='Diab')
axes[0][1].set_title('Distribution of BP',fontdict={'fontsize':8})
axes[0][1].set_xlabel('BP Class',fontdict={'fontsize':7})
axes[0][1].set_ylabel('Count/Dist.',fontdict={'fontsize':7})
axes[0]
[1].yaxis.set_major_formatter(FormatStrFormatter('%.3f'))
plot01.axes.legend(loc=1) plt.setp(axes[0]
[1].get_legend().get_texts(), fontsize='6')
plt.setp(axes[0][1].get_legend().get_title(),
fontsize='6') plt.tight_layout()
plot10=sns.boxplot(df['BloodPressure'],ax=axes[1][0],orient='v')
38
CS3361_DATA SCIENCE
4931_Grace College of Engineering,
axes[1][0].set_title('Numerical Summary',fontdict={'fontsize':8})
axes[1][0].set_xlabel('BP',fontdict={'fontsize':7}) axes[1]
[0].set_ylabel(r'Five Point Summary(BP)',fontdict={'fontsize':7})
plt.tight_layout()
plot11=sns.boxplot(x='Outcome',y='BloodPressure',data=df,ax=axes[1]
[1]) axes[1][1].set_title(r'Numerical Summary
(Outcome)',fontdict={'fontsize':8}) axes[1][1].set_ylabel(r'Five Point
Summary(BP)',fontdict={'fontsize':7})
plt.xticks(ticks=[0,1],labels=['Non-Diab.','Diab.'],fontsize=7)
axes[1][1].set_xlabel('Category',fontdict={'fontsize':7})
plt.tight_layout()
plt.show()
Understanding Distribution
The distribution of BloodPressure among patients is unimodal (This is not a bimodal
because BP=0 does not make any sense and it is Outlier) and bell shaped, d at about
65 with most of t centere he data between 60 and 90, A range of roughly 100, andre present
on the lower end
outliers a (BP ==0).
import
os import
pandas as
import
pd random
import
matplotlib.pyplot as
import
plt seaborn as sns
import numpy
as np
os.chdir(
"C:/Users/Administrator/
df = Desktop/DS")
df.head()
pd.read_csv('diabetes.csv')
plt.ylim(0,
sns.scatterplot(df.DiabetesPedigreeFunction,df.Gl
ucose)
39
CS3361_DATA SCIENCE
4931_Grace College of Engineering,
sns.scatterplot(df.BMI,df.Age)
plt.ylim(0,20000)
sns.scatterplot(df.BloodPressure,df.Glucose)
plt.ylim(0,20000)
plt.figure(figsize=(12,8))
sns.kdeplot(data=df,x=df.Glucose,hue=df.Outcome,fill=True)
40
CS3361_DATA SCIENCE
4931_Grace College of Engineering,
df.isnull().values .any()
False
(df.Pregnancies ==
0).sum(),(df.Gl ucose==0).sum(),(df.BloodPressure==0).sum(), .sum(),(df.
Insulin==0).sum((df.SkinThickness==0) Age==0).su
m() ),(df.BMI==0).sum(),(df.DiabetesPedigreeFunction==0).sum(),
(df.
## Counting cell
Output: s with 0 Values for each variable and publishing the counts below
(111, 5, 35,
Output:
class 'pandas.core.frame.DataFrame'>
Int64Index: 392 entries, 3 to 765
Data columns (total 9 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Pregnancies 392 non-null int64
1 Glucose 392 non-null int64
2 BloodPressure 392 non-null int64
41
CS3361_DATA SCIENCE
4931_Grace College of Engineering,
Output
sns.heatmap(cor)
Result:
Thus the univariate ,bivariate, multivariate analysis performed successfully.
42
CS3361_DATA SCIENCE
4931_Grace College of Engineering,
Aim:
Apply and explore various plotting functions on UCI data sets.
To load and quickly visualize the Multiple Features Dataset [1] from the UCI repository, which
is available in mvlearn. This dataset can be a good tool for analyzing the effectiveness
of multiview algorithms. It contains 6 viewsof handwritten digit images, thus allowing for
analysis of multiview al in multiclass or unsupervised tasks.
gorithms
a. Normal curves
A probability dis tribution is a statistical function that describes the likelihood of ob taining the
possible values t hat a random variable can take. By this, we mean the range of va lues that a
parameter can ta ke when we randomly pick upvalues from it. If we were asked to up 1 adult
randomly and as pick ked what his/her (assuming gender does not affect height) would be?
There’s no way height heights of
adults in the city, to know what the height will be. But if we have the distribution of lso known
as a Gaussian dis we can bet on the most probable outcome.A Normal Distribution is ably, but
it means the sam a tribution or famously Bell Curve. People use both words
interchange e thing.It is a continuous probability distribution.
Code:
import numpy as
np
import matplotli
b.pyplot as
# Creating a seri
plt es of data
range of 1-50.x =
np.linspace(1,50,200)
#Creating a Function.
mean)/sd)**2)return prob_density
43
CS3361_DATA SCIENCE
4931_Grace College of Engineering,
deviation.mean = np.mean(x)
sd = np.std(x)
data. pdf =
normal_dist(x,mean,sd
plt.plot(x,pdf , co lor
'red') plt.xlabel(' =
points') Data
plt.ylabel('Proba
bility Density')
Contour plots are widely used to visualize density, altitudes or heights of the mountain as well
as in the meteorological department. Due to such wide usage matplotlib.pyplot provides
a
44
CS3361_DATA SCIENCE
4931_Grace College of Engineering,
Code:
import
matplotlib.pyplot as plt
import numpy as np
feature_y =
features
feature_y)fig, ax = plt.subplots(1,
1)
Z = np.cos(X / 2) +
contour lines
ax.contour(X, Y, Z)
45
CS3361_DATA SCIENCE
4931_Grace College of Engineering,
ax.set_title('Contou
r Plot')
ax.set_xlabel('featu
re_x')
ax.set_ylabel('fea tu
re_y') plt.show()
3. Zero Correlation( No Correlation): When two variables don’t seem to be linked at all. ‘0’ is a perfect
negative correlation. For Example, the amount of tea you take and level of intelligence.
Code:
import pandas as pd
con =
46
CS3361_DATA SCIENCE
4931_Grace College of Engineering,
pd.read_csv('concrete.csv
')con
list(con.columns)
con.head()
con['cement'] =
con['cement'].astype('category')
con.describe(include='category')
ax = sns.scatterpl ot(x="water",
data=con)ax.set_ y="coarseagg",
47
CS3361_DATA SCIENCE
4931_Grace College of Engineering,
d.Histogra ms:
A histogram is b asically used to represent data provided in a form of some is accurate
method for the groups.It graphical representation of numerical data distribution.It of bar plot
where X-axis re is a type presents the bin ranges while Y-axis gives information cy.
about frequen
Creating
To create a histo
a ole range
of the values in Histogram of the
intervals.Bins ar iables. The
matplotlib.pyplot gram the first step is to create bin of the ranges, then distribute the
wh
Code: to a series of intervals, and count the values which fall into
each e clearly identified as consecutive, non-overlapping intervals of
from matplotlib import pyplot as
pltimport numpy as np
# Creating dataset
48
CS3361_DATA SCIENCE
4931_Grace College of Engineering,
27])
# Creating histogram
plt.show()
Code:
import matplotlib.pyplot
pltimport numpy as as np
np.random.seed(23685752)
N_points = 10000
n_bins = 20
# Creating distribution
x = np.random.randn(N_points)
49
CS3361_DATA SCIENCE
4931_Grace College of Engineering,
y = .8 ** x + np.random.randn(10000) +
# Show
plot
plt.show()
Matplotlib was in troduced keeping in mind, only two-dimensional plotting. But at time when
the release of 1. the have 3d
implementation of data available today! The 3d plots are enabled by importing the
mplot3d toolkit. In this article, we will deal with the 3d plots using matplotlib.
Code:
mplot3dimport numpy as np
fig = plt.figure()
50
CS3361_DATA SCIENCE
4931_Grace College of Engineering,
ax = plt.axes(projection ='3d')
# defining axes
z = np.linspace(0, 1, 100)
x = z * np.sin(25 * z)
y = z * np.cos(25 * z)
c=x+y
ax.scatter( x, y, z, c = c)
# syntax fo r plotting
plt.show()
Result:
Thus the various plots are executed and plotted successfully.
51
CS3361_DATA SCIENCE
4931_Grace College of Engineering,
Aim:
To Visualizing Geographic Data with Basemap
One common type of visualization in data science is that of geographic data. Matplotlib's main
tool for this type of visualization is the Basemap toolkit, which is one of several Matplotlib
toolkits which li ves under the mpl_toolkits namespace. Admittedly, Basemap feels bit clunky
to use, and often a ight hope.
More modern sol even simple visualizations take much longer to render than you m e for more
intensive map vi utions such as leaflet or the Google Maps API may be a better ve in their
virtual toolbelts. choic sualizations. Still, Basemap is a useful tool for Python userssualization
that is possible w to ha In this section, we'll show several examples of the type of
map vi ith this toolkit.
Installation of B s and
the package will be asemap is straightforward; if you're using conda you can type
thi downloaded:
conda install bas
em
Code: ap
fig = plt.fi
m = Basemgure(figsize=(8,
resol 8))
widt ap(projection='lc
c', ution=None,
lat_0=45, lon_0=-100,)
m.etopo(scale=0.5, alpha=0.5)
y) for plottingx, y =
m(-122.3, 47.6)
plt.plot(x, y, 'ok',
markersize=5)
52
CS3361_DATA SCIENCE
4931_Grace College of Engineering,
fontsize=12);
plt fig =
as
plt.figure(figsize =
(12,12))m =
Basemap()
m.drawcoastlines()
m.drawcoastlines(linewidth=1.0,
linestyle='dashed', color='red')
53
CS3361_DATA SCIENCE
4931_Grace College of Engineering,
plt.title("Coastlines", fontsize=20)
plt.show()
import nu mpy as
import mat
as pd
import sea
plotlib.pyplot as plt
import geo
born as sns
import sha
pandas as
from
gpd pefile as
shapely.ge
shp
import Poi
fp = r'Map
ometry
map_df_copy = gpd.read_file(fp)
plt.plot(map_df , markersize=5)
54
CS3361_DATA SCIENCE
4931_Grace College of Engineering,
Result :
Thus the program using Basemap was installed and successfully executed c
geographi
visualization.
55
CS3361_DATA SCIENCE