Final Dev Record
Final Dev Record
Laboratory Record
Name :
Reg. No. :
Laboratory :
Semester :
Year :
KATHIR COLLEGE OF ENGINEERING
“WISDOM TREE”, NEELAMBUR, COIMBATORE – 641 062.
Bonafide Certificate
This is to certify that the record word for …………. DATA EXPLORATION AND
B.E/B.Tech…………..……………………………………………………..during …….semester of
Aim:
Procedure:
Installation
Pandas-DataFrames
Data framesa re the main tools when we are working with pandas.
code −
import numpy as np
import pandas as pd
from numpy.random import randn
np.random.seed(50)
df = pd.DataFrame(randn(6,4), ['a','b','c','d','e','f'],['w','x','y','z'])
df
Output
w x y z
Pandas-Missing Data
We are going to see some convenient ways to deal with missing data in
pandas, which automatically gets filled with zero's or nan.
import numpy as np
import pandas as pd
from numpy.random import randn
d = {'A': [1,2,np.nan], 'B': [9, np.nan, np.nan], 'C': [1,4,9]}
df = pd.DataFrame(d)
df
Output
A B C
0 1.0 9.0 1
1 2.0 NaN 4
2 NaN NaN 9
So,we are having 3 missing value in above.
df.dropna()
A B C
0 1.0 9.0 1
df.dropna(axis = 1)
0 1
1 4
2 9
df.dropna(thresh = 2)
A B C
0 1.0 9.0 1
1 2.0 NaN 4
df.fillna(value = df.mean())
A B C
0 1.0 9.0 1
1 2.0 9.0 4
2 1.5 9.0 9
To read the number of rows and columns in our dataframe or csv file.
.....
.....
Output
Result:
Thus the procedure to install Python and the data Analysis and
Visualization has been done and verified.
Ex.No.2 Perform exploratory data analysis (EDA) on Email data set
Aim:
PROGRAM CODE:
Result:
Thus the program for email dataset using EDA has been done and the
output has been verified.
Ex.No.3 Working with Numpy arrays, Pandas data frames , Basic plots using
Matplotlib
Aim:
To work with Numpy arrays, Pandas data frames , Basic plots using
Matplotlib.
Output:
List in python : [1, 2, 3, 4]
Numpy Array in python : [1 2 3 4]
Multi-Dimensional Array:
# creating list
list_1 = [1, 2, 3, 4]
list_2 = [5, 6, 7, 8]
list_3 = [9, 10, 11, 12]
Output:
Numpy multi dimensional array in python
[[ 1 2 3 4]
[ 5 6 7 8]
[ 9 10 11 12]]
Pandas data frames
import pandas as pd
data = {
"calories": [420, 380, 390],
"duration": [50, 40, 45]
}
Output
Duration Pulse Maxpulse Calories
0 60 110 130 409.1
1 60 117 145 479.0
2 60 103 135 340.0
3 45 109 175 282.4
4 45 117 148 406.0
.. ... ... ... ...
164 60 105 140 290.8
165 60 110 145 300.4
166 60 115 145 310.2
167 75 120 150 320.4
168 75 125 150 330.4
# x axis values
x = [1,2,3]
# corresponding y axis values
y = [2,4,1]
# plotting the points
plt.plot(x, y)
# naming the x axis
plt.xlabel('x - axis')
# naming the y axis
plt.ylabel('y - axis')
# giving a title to my graph
plt.title('My first graph!')
# function to show the plot
plt.show()
Output:
Result:
Thus the program for Working with Numpy arrays, Pandas data frames ,
Basic plots using Matplotlib.
Ex.No.4. Explore various variable and row filters in R for cleaning data
Aim:
Explore various variable and row filters in R for cleaning data.Apply various
plot features in R on sample data sets and visualize.
For this, we will use inbuilt datasets(air quality datasets) which are
available in R.
head(airquality)
Output:
In the above dataset, we can clearly see the NA value inside the columns which will
generate the error or not produce the accurate predictions for Machine Learning
Model.
Handling missing value in R
To handle the missing value we will check the columns of the datasets, if we
found some missing data inside the columns then this generates the NA values as
an output, which can be not good for every model. So let’s check it
using mean() method.
mean(airquality$Solar.R)
Output:
<NA>
Checking another column
mean(airquality$Ozone)
Output:
<NA>
Checking another column
Here we get the mean value of Wind Columns which means it doesn’t have any missing
value in this column.
mean(airquality$Wind)
Output:
9.95751633986928
Handling NA values
Handling NA value using na.rm in both columns.
After checking the summary of the dataset and we found the number on NA in two
columns(Ozone and Solar.R)
summary(airquality)
Output:
We can get a clear visual of the irregular data using a boxplot.
boxplot(airquality)
Output:
Output:
Now can clearly see that we don’t have any unclean data using summary
methods.
summary(New_df)
Output:
We can clearly see that we don’t have any missing data inside data frame.
head(New_df)
Output:
Now our boxplot outliers also show no errors.
boxplot(New_df)
Result:
Thus the program to explore various variable and row filters in R for
cleaning data and applying various plot features in R on sample data sets and
visualizing has been done and the output has been verified.
Ex.No.5 Perform Time Series Analysis and apply the various visualization
Techniques
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
# reading the dataset using read_csv
df = pd.read_csv("stock_data.csv",
parse_dates=True,
index_col="Date")
Output:
# deleting column
df.drop(columns='Unnamed: 0')
Output:
Plotting a simple line plot for time series data.
df['Volume'].plot()
Output:
Output:
# Resampling the time series data based on monthly 'M' frequency
df_month = df.resample("M").mean()
# using subplot
fig, ax = plt.subplots(figsize=(10, 6))
# plotting bar graph
ax.bar(df_month['2016':].index,
df_month.loc['2016':, "Volume"],
width=25, align='center')
Output:
There are 24 bars in the graph and each bar represents a month.
Result:
Thus the program to perform Time Series Analysis and apply the various
visualization techniques has bee done and the output has been verified.
Ex.No.6 Perform Data Analysis and representation on a Map
Aim:
print("getting data")
df=px.data.carshare()
print(df.head(10))
print(df.tail(10))
fig=px.scatter_mapbox(df,
lon=df["centroid_lon"],
lat=df["centroid_lat"],
zoom=10,
color=df["peak_hour"],
size=df["car_hours"],
width=1200,
height=900,
title="CAR SHARE SCATTER MAP")
fig.update_layout(mapbox_style="open-street-map")
fig.update_layout(margin={"r":0,"t":50,"b":10})
fig.show()
OUTPUT:
Result:
Aim:
PROCEDURE:
Here is a GeoJSON feature object for the boundary of the U.S. state of Colorado:
{
"type": "Feature",
"id": 8,
"properties": {"name": "Colorado"},
"geometry": {
"type": "Polygon",
"coordinates": [
[[-106.32056285448942,40.998675790862656],[-106.19134826714341,40.99813863734313],[-
105.27607827344248,40.99813863734313],[-104.9422739227986,40.99813863734313],[-
104.05212898774828,41.00136155846029],[-103.57475287338661,41.00189871197981],[-
103.38093099236758,41.00189871197981],[-102.65589358559272,41.00189871197981],[-
102.62000064466328,41.00189871197981],[-102.052892177978,41.00189871197981],[-
102.052892177978,40.74889940428302],[-102.052892177978,40.69733266640851],[-
102.052892177978,40.44003613055551],[-102.052892177978,40.3492571857556],[-
102.052892177978,40.00333031918079],[-102.04930288388505,39.57414465707943],[-
102.04930288388505,39.56823596836465],[-102.0457135897921,39.1331416175485],[-
102.0457135897921,39.0466599009048],[-102.0457135897921,38.69751011321283],[-
102.0457135897921,38.61478847120581],[-102.0457135897921,38.268861604631],[-
102.0457135897921,38.262415762396685],[-102.04212429569915,37.738153927339205],[-
102.04212429569915,37.64415206142214],[-102.04212429569915,37.38900413964724],[-
102.04212429569915,36.99365914927603],[-103.00046581851544,37.00010499151034],[-
103.08660887674611,37.00010499151034],[-104.00905745863294,36.99580776335414],[-
105.15404227428235,36.995270609834606],[-105.2222388620483,36.995270609834606],[-
105.7175614468747,36.99580776335414],[-106.00829426840322,36.995270609834606],[-
106.47490250048605,36.99365914927603],[-107.4224761410235,37.00010499151034],[-
107.48349414060355,37.00010499151034],[-108.38081766383978,36.99903068447129],[-
109.04483707103458,36.99903068447129],[-109.04483707103458,37.484617466122884],[-
109.04124777694163,37.88049961001363],[-109.04124777694163,38.15283644441336],[-
109.05919424740635,38.49983761802722],[-109.05201565922046,39.36680339854235],[-
109.05201565922046,39.49786885730673],[-109.05201565922046,39.66062637372313],[-
109.05201565922046,40.22248895514744],[-109.05201565922046,40.653823231326896],[-
109.05201565922046,41.000287251421234],[-107.91779872584989,41.00189871197981],[-
107.3183866123281,41.00297301901887],[-106.85895696843116,41.00189871197981],[-
106.32056285448942,40.998675790862656]]
]
}
}
Let’s load a TopoJSON file of world countries (at 110 meter resolution):
world = data.world_110m.url
world
'https://cdn.jsdelivr.net/npm/[email protected]/data/world-110m.json'
world_topo = data.world_110m()
world_topo.keys()
dict_keys(['type', 'transform', 'objects', 'arcs'])
world_topo['type']
'Topology'
world_topo['objects'].keys()
dict_keys(['land', 'countries'])
As TopoJSON is a specialized format, we need to instruct Altair to parse the TopoJSON
format, indicating which named faeture object we wish to extract from the topology.
The following code indicates that we want to extract GeoJSON features from
the world dataset for the countries object:
alt.topo_feature(world, 'countries')
This alt.topo_feature method call expands to the following Vega-Lite JSON:
{
"values": world,
"format": {"type": "topojson", "feature": "countries"}
To visualize geographic data, Altair provides the geoshape mark type. To create a basic
map, we can create a geoshape mark and pass it our TopoJSON data, which is then
unpacked into GeoJSON features, one for each country of the
world:alt.Chart(alt.topo_feature(world, 'countries')).mark_geoshape()
(iii) Point Maps
Altair includes special longitude and latitude encoding channels to handle geographic
coordinates. These channels indicate which data fields should be mapped
to longitude and latitude coordinates, and then applies a projection to map those
coordinates to (x, y) positions.
alt.Chart(zipcodes).mark_square(
size=1, opacity=1
).encode(
longitude='longitude:Q', # apply the field named 'longitude' to the longitude channel
latitude='latitude:Q' # apply the field named 'latitude' to the latitude channel
).project(
type='albersUsa'
).properties(
width=900,
height=500
).configure_view(
stroke=None
)
(iv) Symbol Maps
Let’s start by creating a base map using the albersUsa projection, and add a layer that
plots circle marks for each airport:
alt.layer(
alt.Chart(alt.topo_feature(usa, 'states')).mark_geoshape(
fill='#ddd', stroke='#fff', strokeWidth=1
),
alt.Chart(airports).mark_circle(size=9).encode(
latitude='latitude:Q',
longitude='longitude:Q',
tooltip='iata:N'
)
).project(
type='albersUsa'
).properties(
width=900,
height=500
).configure_view(
stroke=None
)
To integrate our data sources, we will again need to use the lookup transform,
augmenting our TopoJSON-based geoshape data with unemployment rates. We can
then create a map that includes a color encoding for the looked-up rate field.
alt.Chart(alt.topo_feature(usa, 'counties')).mark_geoshape(
stroke='#aaa', strokeWidth=0.25
).transform_lookup(
lookup='id', from_=alt.LookupData(data=unemp, key='id', fields=['rate'])
).encode(
alt.Color('rate:Q',
scale=alt.Scale(domain=[0, 0.3], clamp=True),
legend=alt.Legend(format='%')),
alt.Tooltip('rate:Q', format='.0%')
).project(
type='albersUsa'
).properties(
width=900,
height=500
).configure_view(
stroke=None
Result:
Thus the program for Build cartographic visualization for multiple datasets
involving various countries of the world; states and districts in India has been done
and the output has been verified.
Ex.No.8 Perform EDA on Wine Quality Data Set
Aim:
PROGRAM:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
df = pd.read_csv('../input/red-wine-quality-cortez-et-al-2009/winequality-red.csv')
df.head()
OUTPUT:
free total
fixed volatile citric residual
chlorides sulfur sulfur density pH sulphates alcohol quality
acidity acidity acid sugar
dioxide dioxide
0 7.4 0.70 0.00 1.9 0.076 11.0 34.0 0.9978 3.51 0.56 9.4 5
1 7.8 0.88 0.00 2.6 0.098 25.0 67.0 0.9968 3.20 0.68 9.8 5
2 7.8 0.76 0.04 2.3 0.092 15.0 54.0 0.9970 3.26 0.65 9.8 5
3 11.2 0.28 0.56 1.9 0.075 17.0 60.0 0.9980 3.16 0.58 9.8 6
4 7.4 0.70 0.00 1.9 0.076 11.0 34.0 0.9978 3.51 0.56 9.4 5
Find total number of rows and columns in the dataset using shape
df.shape
(1599, 12)
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1599 entries, 0 to 1598
Data columns (total 12 columns):
fixed acidity 1599 non-null float64
volatile acidity 1599 non-null float64
citric acid 1599 non-null float64
residual sugar 1599 non-null float64
chlorides 1599 non-null float64
free sulfur dioxide 1599 non-null float64
total sulfur dioxide 1599 non-null float64
density 1599 non-null float64
pH 1599 non-null float64
sulphates 1599 non-null float64
alcohol 1599 non-null float64
quality 1599 non-null int64
dtypes: float64(11), int64(1)
memory usage: 150.0 KB
df.describe()
free
total
fixed volatile citric residual sulfur sulphate
chlorides sulfur density pH alcohol quality
acidity acidity acid sugar dioxid s
dioxide
e
count 1599 1599 1599 1599 1599 1599 1599 1599 1599 1599 1599 1599
8.319 0.5278 0.2709 2.5388 0.08746 15.87 46.467 0.9967 3.3111 0.6581 10.422 5.6360
mean
637 21 76 06 7 492 79 47 13 49 98 2
1.741 0.1790 0.1948 1.4099 0.04706 10.46 32.895 0.0018 0.1543 0.1695 1.0656 0.8075
std
096 6 01 28 5 016 32 87 86 07 68 7
0.9900
min 4.6 0.12 0 0.9 0.012 1 6 2.74 0.33 8.4 3
7
25% 7.1 0.39 0.09 1.9 0.07 7 22 0.9956 3.21 0.55 9.5 5
0.9967
50% 7.9 0.52 0.26 2.2 0.079 14 38 3.31 0.62 10.2 6
5
0.9978
75% 9.2 0.64 0.42 2.6 0.09 21 62 3.4 0.73 11.1 6
35
1.0036
max 15.9 1.58 1 15.5 0.611 72 289 4.01 2 14.9 8
9
df.quality.unique()
array([5, 6, 7, 4, 8, 3])
This tells us vote count of each quality score in descending order.“quality” has most
values concentrated in the categories 5, 6 and 7. Only a few observations made for the
categories 3 and 8.
df.quality.value_counts()
Out[7]:
5 681
6 638
7 199
4 53
8 18
3 10
Name: quality, dtype: int64
df['quality'].hist()
<matplotlib.axes._subplots.AxesSubplot at 0x7fd6695f8240>
Result:
Thus the program to perform EDA on Wine Quality Data Set has been done and
the output has been verified.
Ex.No.9 Apply the various EDA on a dataset using a case study
Aim:
To apply various EDA and visualization techniques and present an analysis report
for a case study.
PROGRAM:
import pandas as pd
import numpy as np
import seaborn as sns
#Load the data
df = pd.read_csv('titanic.csv')
#View the data
df.head()
#Basic information
df.info()
df.describe()
Describe the data - Descriptive statistics.
Duplicate values
You can use the df.duplicate.sum() function to the sum of duplicate value present if
any. It will show the number of duplicate values if they are present in the data.
df.duplicated().sum()
Output:
0
This means, there is not a single duplicate value present in our dataset.
#unique values
df['Pclass'].unique()
df['Survived'].unique()
df['Sex'].unique()
array([3, 1, 2], dtype=int64)
array([0, 1], dtype=int64)
array(['male', 'female'], dtype=object)
You have to call the sns.countlot() function and specify the variable to plot the count
plot.
sns.countplot(df['Pclass']).unique()
Find the Null values
df.isnull().sum()
PassengerId 0
Survived 0
Pclass 0
Name 0
Sex 0
Age 177
SibSp 0
Parch 0
Ticket 0
Fare 0
Cabin 687
Embarked 2
dtype: int64
A replace() function to replace all the null values with a specific data.
df.replace(np.nan,'0',inplace = True)
PassengerId 0
Survived 0
Pclass 0
Name 0
Sex 0
Age 0
SibSp 0
Parch 0
Ticket 0
Fare 0
Cabin 0
Embarked 0
dtype: int64
Know the datatypes
df.dtypes
PassengerId int64
Survived int64
Pclass int64
Name object
Sex object
Age object
SibSp int64
Parch int64
Ticket object
Fare float64
Cabin object
Embarked object
dtype: object
df[df['Pclass']==1].head()
df[['Fare']].boxplot()
Correlation Plot - EDA
df.corr()
This is the correlation matrix with the range from +1 to -1 where +1 is highly and positively
correlated and -1 will be highly negatively correlated.
#Correlation plot
sns.heatmap(df.corr())
Result:
Thus the program for analyzing EDA for Titanic case study has been done and
the output has been verified.
CONTENT BEYOND SYLLABUS
Line Plot
Scatter Plot
Box plot
Point plot
Count plot
Violin plot
Swarm plot
Bar plot
KDE Plot
Line plot:
CODE:
# import module
import seaborn as sns
import pandas
# loading csv
data = pandas.read_csv("nba.csv")
# plotting lineplot
sns.lineplot( data['Age'], data['Weight']
Output:
Scatter Plot:
CODE:
# import module
import seaborn
import pandas
# load csv
data = pandas.read_csv("nba.csv")
# plotting
seaborn.scatterplot(data['Age'],data['Weight'])
Output:
Box plot:
CODE:
# import module
import seaborn as sns
import pandas
Violin Plot:
CODE:
# import module
import seaborn as sns
import pandas
# read csv and plot
data = pandas.read_csv("nba.csv")
sns.violinplot(data['Age'])
Output:
Swarm plot:
CODE:
# import module
import seaborn
seaborn.set(style = 'whitegrid')
Output:
Bar plot:
CODE:
# import module
import seaborn
seaborn.set(style = 'whitegrid')
CODE:
# import module
import seaborn
seaborn.set(style = 'whitegrid')
Output:
Count plot:
CODE:
# import module
import seaborn
seaborn.set(style = 'whitegrid')
# read csv and plot
data = pandas.read_csv("nba.csv")
seaborn.countplot(data["Age"])
Output:
KDE Plot:
CODE:
# import module
import seaborn as sns
import pandas
Output:
Result:
Thus the program for data visualization techniques using seaborn with python has
been done and the output has been verified.
Ex.No.2 Numerical summaries in Statistics using python
Aim:
PROCEDURE:
Mean
The ‘x-bar’ is used to represent the sample mean (the mean of a sample of data). ‘∑’
(sigma) implies de addition of all values up from ‘i=1’ until ‘i=n’ (’n’ is the number of data
When n is even, there is no “middle” data point, so the middle two values are averaged.
Python: np.median([1,2,3,4,5,6]) (n is even). The result is 3.5, the average between 3 and
4 (middle points).
Mode
The mode will return you the most commonly occurring data value.
Python: statistics.mode([1,2,2,2,3,3,4,5,6]) The result is 2.
Percentile
Python:
from scipy import statsx = [10, 12, 15, 17, 20, 25, 30]## In what percentile lies the number 25?
stats.percentileofscore(x,25)
# result: 85.7
Standard Deviation
Deviation: The idea is to use the mean as a reference point from which everything varies.
A deviation is defined as the distance an observation lies from the reference point. This
distance is obtained by subtracting the data point (xi) from the mean (x-bar).
Variance
Variance is almost the same calculation of the standard deviation, but it stays in
squared units. So, if you take the square root of the variance, you have the standard
deviation.
Python: np.var(x)
Range
The difference between the maximum and minimum values. Useful for some basic
It’s often referred to as “percentage”. Defines the percent of observations in the data
Correlation
Defines the strength and direction of the association between two quantitative
variables. It ranges between -1 and 1. Positive correlations mean that one variable
increases as the other variable increases. Negative correlations mean that one variable
decreases as the other increases. When the correlation is zero, there is no correlation at
all.
Python: stats.pearsonr(x,y)
Result:
Thus the program for finding numerical summaries in statistics has been done and
the output has been verified.