0% found this document useful (0 votes)
17 views49 pages

Final Dev Record

The document is a laboratory record for the Data Exploration and Visualization Laboratory at Kathir College of Engineering, detailing various experiments and procedures related to data analysis using Python and R. It includes a bonafide certificate, an index of experiments, and specific aims and procedures for each experiment, such as exploratory data analysis, time series analysis, and data visualization techniques. The document serves as a practical guide for students to perform data analysis and visualization tasks using different tools and libraries.

Uploaded by

kumaresan7751
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views49 pages

Final Dev Record

The document is a laboratory record for the Data Exploration and Visualization Laboratory at Kathir College of Engineering, detailing various experiments and procedures related to data analysis using Python and R. It includes a bonafide certificate, an index of experiments, and specific aims and procedures for each experiment, such as exploratory data analysis, time series analysis, and data visualization techniques. The document serves as a practical guide for students to perform data analysis and visualization tasks using different tools and libraries.

Uploaded by

kumaresan7751
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 49

KATHIR COLLEGE OF ENGINEERING

“WISDOM TREE”,Neelambur, Coimbatore – 641 062.

Department of Computer Science and Engineering

Laboratory Record

Name :

Reg. No. :

Laboratory :

Semester :

Year :
KATHIR COLLEGE OF ENGINEERING
“WISDOM TREE”, NEELAMBUR, COIMBATORE – 641 062.

Bonafide Certificate

This is to certify that the record word for …………. DATA EXPLORATION AND

VISUALIZATION LABORATORY, is a bonafide record of work done by

Mr./Ms…………………………………………………..Register Number for the course

B.E/B.Tech…………..……………………………………………………..during …….semester of

academic year 2022- 2023.

Faculty In-Charge HoD

Submitted for Practical Examination of Anna University, Chennai held on ………………

Internal Examiner External Examiner


INDEX

S.No Date Experiments Page Marks Signature


No
1 Install the data Analysis and Visualization
tool:Python

2 Perform exploratory data analysis (EDA)


on Email data set

3 Working with Numpy arrays, Pandas


data frames , Basic plots using Matplotlib

4 Explore various variable and row filters in


R for cleaning data

5 Perform Time Series Analysis and apply


the various visualization Techniques

6 Perform Data Analysis and


representation on a Map

7 To build cartographic visualization for


multiple datasets
8 Perform EDA on Wine Quality Data Set

9 Apply the various EDA on a dataset


using a case study

CONTENT BEYOND SYLLABUS


1 Data Visualization with Python Seaborn

2 Numerical summaries in Statistics using


python
EX.No.01 Install the data Analysis and Visualization tool:Python

Aim:

To Install the data Analysis and Visualization tool Python.

Procedure:

Installation

To install pandas, run the below command in your terminal −


pip install pandas
Or we have anaconda, you can use
condainstall pandas

Pandas-DataFrames

Data framesa re the main tools when we are working with pandas.
code −
import numpy as np
import pandas as pd
from numpy.random import randn
np.random.seed(50)
df = pd.DataFrame(randn(6,4), ['a','b','c','d','e','f'],['w','x','y','z'])
df
Output
w x y z

a -1.560352 -0.030978 -0.620928 -1.464580

b 1.411946 -0.476732 -0.780469 1.070268

c -1.282293 -1.327479 0.126338 0.862194

d 0.696737 -0.334565 -0.997526 1.598908


e 3.314075 0.987770 0.123866 0.742785

f -0.393956 0.148116 -0.412234 -0.160715

Pandas-Missing Data

We are going to see some convenient ways to deal with missing data in
pandas, which automatically gets filled with zero's or nan.
import numpy as np
import pandas as pd
from numpy.random import randn
d = {'A': [1,2,np.nan], 'B': [9, np.nan, np.nan], 'C': [1,4,9]}
df = pd.DataFrame(d)
df
Output
A B C

0 1.0 9.0 1

1 2.0 NaN 4

2 NaN NaN 9
So,we are having 3 missing value in above.
df.dropna()

A B C

0 1.0 9.0 1

df.dropna(axis = 1)

0 1
1 4

2 9

df.dropna(thresh = 2)

A B C

0 1.0 9.0 1

1 2.0 NaN 4

df.fillna(value = df.mean())

A B C

0 1.0 9.0 1

1 2.0 9.0 4

2 1.5 9.0 9

Pandas − Import data


We are going to read the csv file which is either stored in our local machine(in
my case) or we can directly fetch from the web.

#import pandas library


import pandas as pd
#Read csv file and assigned it to dataframe variable
df = pd.read_csv("SYB61_T03_Population Growth Rates in Urban areas and
Capital cities.csv",encoding = "ISO-8859-1")
#Read first five element from the dataframe
df.head()
Output

To read the number of rows and columns in our dataframe or csv file.

#Countthe number of rows and columns in our dataframe.


df.shape
Output
(4166,9)
Pandas − Dataframe Math
Operationson dataframes can be done using various tools of pandas
forstatistics

#To computes various summary statistics, excluding NaN values


df.describe()
Output

# computes numerical data ranks


df.rank()
Output

.....
.....

Pandas − plot graph


import matplotlib.pyplot as plt
years = [1981, 1991, 2001, 2011, 2016]
Average_populations = [716493000, 891910000, 1071374000, 1197658000,
1273986000]
plt.plot(years, Average_populations)
plt.title("Census of India: sample registration system")
plt.xlabel("Year")
plt.ylabel("Average_populations")
plt.show()
Output

Scatter plot of above data:


plt.scatter(years,Average_populations)
Histogram:
import matplotlib.pyplot as plt
Average_populations = [716493000, 891910000, 1071374000, 1197658000,
1273986000]
plt.hist(Average_populations, bins = 10)
plt.xlabel("Average_populations")
plt.ylabel("Frequency")
plt.show()

Output

Result:

Thus the procedure to install Python and the data Analysis and
Visualization has been done and verified.
Ex.No.2 Perform exploratory data analysis (EDA) on Email data set

Aim:

To perform exploratory data analysis (EDA) on with datasets like email


data set. And to export all our emails as a dataset, import them inside a pandas
data frame, visualize them and get different insights from the data.

PROGRAM CODE:

#import required modules


import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
#read the dataset to be accessed
df=pd.read_csv(r"C:\\Users\sibis\Desktop\studies\2nd sem\dev lab\EMAIL.csv")
#headfunction()-
print(df.head())
#dimentions of table
print(df.shape)
#table info
print(df.info)
#table statistics
print(df.describe)
#unique values is a particular attribute
print(df.TO.unique())
#subject of mail
plt.plot(df["FROM"],df["SUBJECT"])
plt.show()
OUTPUT:

Result:

Thus the program for email dataset using EDA has been done and the
output has been verified.
Ex.No.3 Working with Numpy arrays, Pandas data frames , Basic plots using
Matplotlib

Aim:
To work with Numpy arrays, Pandas data frames , Basic plots using
Matplotlib.

Working with Numpy arrays

One Dimensional Array:

A one-dimensional array is a type of linear array.


# importing numpy module
import numpy as np
# creating list
list = [1, 2, 3, 4]
# creating numpy array
sample_array = np.array(list1)
print("List in python : ", list)
print("Numpy Array in python :",
sample_array)

Output:
List in python : [1, 2, 3, 4]
Numpy Array in python : [1 2 3 4]

Multi-Dimensional Array:

# importing numpy module


import numpy as np

# creating list
list_1 = [1, 2, 3, 4]
list_2 = [5, 6, 7, 8]
list_3 = [9, 10, 11, 12]

# creating numpy array


sample_array = np.array([list_1,
list_2,
list_3])
print("Numpy multi dimensional array in python\n",
sample_array)

Output:
Numpy multi dimensional array in python
[[ 1 2 3 4]
[ 5 6 7 8]
[ 9 10 11 12]]
Pandas data frames
import pandas as pd
data = {
"calories": [420, 380, 390],
"duration": [50, 40, 45]
}

#load data into a DataFrame object:


df = pd.DataFrame(data)
print(df)
Result
calories duration
0 420 50
1 380 40
2 390 45

Load Files Into a DataFrame


import pandas as pd
df = pd.read_csv('data.csv')
print(df)

Output
Duration Pulse Maxpulse Calories
0 60 110 130 409.1
1 60 117 145 479.0
2 60 103 135 340.0
3 45 109 175 282.4
4 45 117 148 406.0
.. ... ... ... ...
164 60 105 140 290.8
165 60 110 145 300.4
166 60 115 145 310.2
167 75 120 150 320.4
168 75 125 150 330.4

[169 rows x 4 columns]

Basic plots using Matplotlib

Creating a Simple Plot

# importing the required module


import matplotlib.pyplot as plt

# x axis values
x = [1,2,3]
# corresponding y axis values
y = [2,4,1]
# plotting the points
plt.plot(x, y)
# naming the x axis
plt.xlabel('x - axis')
# naming the y axis
plt.ylabel('y - axis')
# giving a title to my graph
plt.title('My first graph!')
# function to show the plot
plt.show()

Output:

Result:

Thus the program for Working with Numpy arrays, Pandas data frames ,
Basic plots using Matplotlib.
Ex.No.4. Explore various variable and row filters in R for cleaning data

Aim:
Explore various variable and row filters in R for cleaning data.Apply various
plot features in R on sample data sets and visualize.

Start the implementation of Data Cleaning in R

For this, we will use inbuilt datasets(air quality datasets) which are
available in R.

head(airquality)

Output:

In the above dataset, we can clearly see the NA value inside the columns which will
generate the error or not produce the accurate predictions for Machine Learning
Model.
Handling missing value in R

To handle the missing value we will check the columns of the datasets, if we
found some missing data inside the columns then this generates the NA values as
an output, which can be not good for every model. So let’s check it
using mean() method.

mean(airquality$Solar.R)

Output:
<NA>
Checking another column

mean(airquality$Ozone)

Output:
<NA>
Checking another column
Here we get the mean value of Wind Columns which means it doesn’t have any missing
value in this column.

mean(airquality$Wind)

Output:
9.95751633986928
Handling NA values
Handling NA value using na.rm in both columns.

mean(airquality$Solar.R, na.rm = TRUE)


Output:
185.931506849315
Data Cleaning Operation

After checking the summary of the dataset and we found the number on NA in two
columns(Ozone and Solar.R)

summary(airquality)
Output:
We can get a clear visual of the irregular data using a boxplot.

boxplot(airquality)

Output:

Removing irregularities data with is.na() methods.


New_df = airquality
New_df$Ozone = ifelse(is.na(New_df$Ozone),
median(New_df$Ozone,
na.rm = TRUE),
New_df$Ozone)

Output:
Now can clearly see that we don’t have any unclean data using summary
methods.
summary(New_df)

Output:

We can clearly see that we don’t have any missing data inside data frame.

head(New_df)

Output:
Now our boxplot outliers also show no errors.

boxplot(New_df)

Result:

Thus the program to explore various variable and row filters in R for
cleaning data and applying various plot features in R on sample data sets and
visualizing has been done and the output has been verified.
Ex.No.5 Perform Time Series Analysis and apply the various visualization
Techniques

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
# reading the dataset using read_csv
df = pd.read_csv("stock_data.csv",
parse_dates=True,
index_col="Date")

# displaying the first five rows of dataset


df.head()

Output:

# deleting column
df.drop(columns='Unnamed: 0')

Output:
Plotting a simple line plot for time series data.
df['Volume'].plot()

Output:

Here, we have plotted the ‘Volume’ column data.


Now let’s plot all other columns using subplot.

df.plot(subplots=True, figsize=(10, 12))

Output:
# Resampling the time series data based on monthly 'M' frequency
df_month = df.resample("M").mean()
# using subplot
fig, ax = plt.subplots(figsize=(10, 6))
# plotting bar graph
ax.bar(df_month['2016':].index,
df_month.loc['2016':, "Volume"],
width=25, align='center')

Output:

There are 24 bars in the graph and each bar represents a month.

Result:

Thus the program to perform Time Series Analysis and apply the various
visualization techniques has bee done and the output has been verified.
Ex.No.6 Perform Data Analysis and representation on a Map

Aim:

To perform Data Analysis and representation on a Map using various Map


data sets with Mouse Rollover effect, user interaction, etc.
Program:
import plotly.express as px
import pandas as pd

print("getting data")

df=px.data.carshare()
print(df.head(10))
print(df.tail(10))

fig=px.scatter_mapbox(df,
lon=df["centroid_lon"],
lat=df["centroid_lat"],
zoom=10,
color=df["peak_hour"],
size=df["car_hours"],
width=1200,
height=900,
title="CAR SHARE SCATTER MAP")
fig.update_layout(mapbox_style="open-street-map")
fig.update_layout(margin={"r":0,"t":50,"b":10})
fig.show()
OUTPUT:

Result:

Thus the program to perform Data Analysis and representation on a Map


using various Map data sets with Mouse Rollover effect, user interaction, etc has
been done and the output has been verified.
Ex.No.7. To build cartographic visualization for multiple datasets

Aim:

Build cartographic visualization for multiple datasets involving various


countries of the world; states and districts in India etc.

PROCEDURE:

(i) Geographic Data: GeoJSON and TopoJSON

GeoJSON models geographic features within a specialized JSON format. A


GeoJSON feature can include geometric data – such as longitude, latitude coordinates
that make up a country boundary – as well as additional data attributes.

Here is a GeoJSON feature object for the boundary of the U.S. state of Colorado:

{
"type": "Feature",
"id": 8,
"properties": {"name": "Colorado"},
"geometry": {
"type": "Polygon",
"coordinates": [
[[-106.32056285448942,40.998675790862656],[-106.19134826714341,40.99813863734313],[-
105.27607827344248,40.99813863734313],[-104.9422739227986,40.99813863734313],[-
104.05212898774828,41.00136155846029],[-103.57475287338661,41.00189871197981],[-
103.38093099236758,41.00189871197981],[-102.65589358559272,41.00189871197981],[-
102.62000064466328,41.00189871197981],[-102.052892177978,41.00189871197981],[-
102.052892177978,40.74889940428302],[-102.052892177978,40.69733266640851],[-
102.052892177978,40.44003613055551],[-102.052892177978,40.3492571857556],[-
102.052892177978,40.00333031918079],[-102.04930288388505,39.57414465707943],[-
102.04930288388505,39.56823596836465],[-102.0457135897921,39.1331416175485],[-
102.0457135897921,39.0466599009048],[-102.0457135897921,38.69751011321283],[-
102.0457135897921,38.61478847120581],[-102.0457135897921,38.268861604631],[-
102.0457135897921,38.262415762396685],[-102.04212429569915,37.738153927339205],[-
102.04212429569915,37.64415206142214],[-102.04212429569915,37.38900413964724],[-
102.04212429569915,36.99365914927603],[-103.00046581851544,37.00010499151034],[-
103.08660887674611,37.00010499151034],[-104.00905745863294,36.99580776335414],[-
105.15404227428235,36.995270609834606],[-105.2222388620483,36.995270609834606],[-
105.7175614468747,36.99580776335414],[-106.00829426840322,36.995270609834606],[-
106.47490250048605,36.99365914927603],[-107.4224761410235,37.00010499151034],[-
107.48349414060355,37.00010499151034],[-108.38081766383978,36.99903068447129],[-
109.04483707103458,36.99903068447129],[-109.04483707103458,37.484617466122884],[-
109.04124777694163,37.88049961001363],[-109.04124777694163,38.15283644441336],[-
109.05919424740635,38.49983761802722],[-109.05201565922046,39.36680339854235],[-
109.05201565922046,39.49786885730673],[-109.05201565922046,39.66062637372313],[-
109.05201565922046,40.22248895514744],[-109.05201565922046,40.653823231326896],[-
109.05201565922046,41.000287251421234],[-107.91779872584989,41.00189871197981],[-
107.3183866123281,41.00297301901887],[-106.85895696843116,41.00189871197981],[-
106.32056285448942,40.998675790862656]]
]
}
}

Let’s load a TopoJSON file of world countries (at 110 meter resolution):
world = data.world_110m.url

world
'https://cdn.jsdelivr.net/npm/[email protected]/data/world-110m.json'

world_topo = data.world_110m()

world_topo.keys()
dict_keys(['type', 'transform', 'objects', 'arcs'])
world_topo['type']
'Topology'
world_topo['objects'].keys()
dict_keys(['land', 'countries'])
As TopoJSON is a specialized format, we need to instruct Altair to parse the TopoJSON
format, indicating which named faeture object we wish to extract from the topology.
The following code indicates that we want to extract GeoJSON features from
the world dataset for the countries object:
alt.topo_feature(world, 'countries')
This alt.topo_feature method call expands to the following Vega-Lite JSON:
{
"values": world,
"format": {"type": "topojson", "feature": "countries"}

(ii) Geoshape Marks

To visualize geographic data, Altair provides the geoshape mark type. To create a basic
map, we can create a geoshape mark and pass it our TopoJSON data, which is then
unpacked into GeoJSON features, one for each country of the
world:alt.Chart(alt.topo_feature(world, 'countries')).mark_geoshape()
(iii) Point Maps

Altair includes special longitude and latitude encoding channels to handle geographic
coordinates. These channels indicate which data fields should be mapped
to longitude and latitude coordinates, and then applies a projection to map those
coordinates to (x, y) positions.
alt.Chart(zipcodes).mark_square(
size=1, opacity=1
).encode(
longitude='longitude:Q', # apply the field named 'longitude' to the longitude channel
latitude='latitude:Q' # apply the field named 'latitude' to the latitude channel
).project(
type='albersUsa'
).properties(
width=900,
height=500
).configure_view(
stroke=None
)
(iv) Symbol Maps

Let’s start by creating a base map using the albersUsa projection, and add a layer that
plots circle marks for each airport:
alt.layer(
alt.Chart(alt.topo_feature(usa, 'states')).mark_geoshape(
fill='#ddd', stroke='#fff', strokeWidth=1
),
alt.Chart(airports).mark_circle(size=9).encode(
latitude='latitude:Q',
longitude='longitude:Q',
tooltip='iata:N'
)
).project(
type='albersUsa'
).properties(
width=900,
height=500
).configure_view(
stroke=None
)

(v) Choropleth Maps

To integrate our data sources, we will again need to use the lookup transform,
augmenting our TopoJSON-based geoshape data with unemployment rates. We can
then create a map that includes a color encoding for the looked-up rate field.
alt.Chart(alt.topo_feature(usa, 'counties')).mark_geoshape(
stroke='#aaa', strokeWidth=0.25
).transform_lookup(
lookup='id', from_=alt.LookupData(data=unemp, key='id', fields=['rate'])
).encode(
alt.Color('rate:Q',
scale=alt.Scale(domain=[0, 0.3], clamp=True),
legend=alt.Legend(format='%')),
alt.Tooltip('rate:Q', format='.0%')
).project(
type='albersUsa'
).properties(
width=900,
height=500
).configure_view(
stroke=None

Result:
Thus the program for Build cartographic visualization for multiple datasets
involving various countries of the world; states and districts in India has been done
and the output has been verified.
Ex.No.8 Perform EDA on Wine Quality Data Set

Aim:

To perform EDA on Wine Quality Data Set using python.

PROGRAM:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
df = pd.read_csv('../input/red-wine-quality-cortez-et-al-2009/winequality-red.csv')
df.head()

OUTPUT:

free total
fixed volatile citric residual
chlorides sulfur sulfur density pH sulphates alcohol quality
acidity acidity acid sugar
dioxide dioxide

0 7.4 0.70 0.00 1.9 0.076 11.0 34.0 0.9978 3.51 0.56 9.4 5

1 7.8 0.88 0.00 2.6 0.098 25.0 67.0 0.9968 3.20 0.68 9.8 5

2 7.8 0.76 0.04 2.3 0.092 15.0 54.0 0.9970 3.26 0.65 9.8 5

3 11.2 0.28 0.56 1.9 0.075 17.0 60.0 0.9980 3.16 0.58 9.8 6

4 7.4 0.70 0.00 1.9 0.076 11.0 34.0 0.9978 3.51 0.56 9.4 5
Find total number of rows and columns in the dataset using shape

df.shape

(1599, 12)

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1599 entries, 0 to 1598
Data columns (total 12 columns):
fixed acidity 1599 non-null float64
volatile acidity 1599 non-null float64
citric acid 1599 non-null float64
residual sugar 1599 non-null float64
chlorides 1599 non-null float64
free sulfur dioxide 1599 non-null float64
total sulfur dioxide 1599 non-null float64
density 1599 non-null float64
pH 1599 non-null float64
sulphates 1599 non-null float64
alcohol 1599 non-null float64
quality 1599 non-null int64
dtypes: float64(11), int64(1)
memory usage: 150.0 KB

df.describe()

free
total
fixed volatile citric residual sulfur sulphate
chlorides sulfur density pH alcohol quality
acidity acidity acid sugar dioxid s
dioxide
e
count 1599 1599 1599 1599 1599 1599 1599 1599 1599 1599 1599 1599
8.319 0.5278 0.2709 2.5388 0.08746 15.87 46.467 0.9967 3.3111 0.6581 10.422 5.6360
mean
637 21 76 06 7 492 79 47 13 49 98 2
1.741 0.1790 0.1948 1.4099 0.04706 10.46 32.895 0.0018 0.1543 0.1695 1.0656 0.8075
std
096 6 01 28 5 016 32 87 86 07 68 7
0.9900
min 4.6 0.12 0 0.9 0.012 1 6 2.74 0.33 8.4 3
7
25% 7.1 0.39 0.09 1.9 0.07 7 22 0.9956 3.21 0.55 9.5 5
0.9967
50% 7.9 0.52 0.26 2.2 0.079 14 38 3.31 0.62 10.2 6
5
0.9978
75% 9.2 0.64 0.42 2.6 0.09 21 62 3.4 0.73 11.1 6
35
1.0036
max 15.9 1.58 1 15.5 0.611 72 289 4.01 2 14.9 8
9
df.quality.unique()

array([5, 6, 7, 4, 8, 3])
This tells us vote count of each quality score in descending order.“quality” has most
values concentrated in the categories 5, 6 and 7. Only a few observations made for the
categories 3 and 8.

df.quality.value_counts()

Out[7]:
5 681
6 638
7 199
4 53
8 18
3 10
Name: quality, dtype: int64

df['quality'].hist()

<matplotlib.axes._subplots.AxesSubplot at 0x7fd6695f8240>

Result:

Thus the program to perform EDA on Wine Quality Data Set has been done and
the output has been verified.
Ex.No.9 Apply the various EDA on a dataset using a case study

Aim:
To apply various EDA and visualization techniques and present an analysis report
for a case study.
PROGRAM:
import pandas as pd
import numpy as np
import seaborn as sns
#Load the data
df = pd.read_csv('titanic.csv')
#View the data
df.head()

#Basic information

df.info()

#Describe the data

df.describe()
 Describe the data - Descriptive statistics.
Duplicate values

You can use the df.duplicate.sum() function to the sum of duplicate value present if
any. It will show the number of duplicate values if they are present in the data.

df.duplicated().sum()

Output:
0
This means, there is not a single duplicate value present in our dataset.

Unique values in the data

#unique values
df['Pclass'].unique()
df['Survived'].unique()
df['Sex'].unique()
array([3, 1, 2], dtype=int64)
array([0, 1], dtype=int64)
array(['male', 'female'], dtype=object)

Visualize the Unique counts

You have to call the sns.countlot() function and specify the variable to plot the count
plot.

#Plot the unique values

sns.countplot(df['Pclass']).unique()
Find the Null values

df.isnull().sum()
PassengerId 0
Survived 0
Pclass 0
Name 0
Sex 0
Age 177
SibSp 0
Parch 0
Ticket 0
Fare 0
Cabin 687
Embarked 2

dtype: int64

Replace the Null values

A replace() function to replace all the null values with a specific data.

#Replace null values

df.replace(np.nan,'0',inplace = True)

#Check the changes now


df.isnull().sum()

PassengerId 0
Survived 0
Pclass 0
Name 0
Sex 0
Age 0
SibSp 0
Parch 0
Ticket 0
Fare 0
Cabin 0
Embarked 0

dtype: int64
Know the datatypes

df.dtypes
PassengerId int64
Survived int64
Pclass int64
Name object
Sex object
Age object
SibSp int64
Parch int64
Ticket object
Fare float64
Cabin object
Embarked object

dtype: object

Filter the Data

df[df['Pclass']==1].head()

A quick box plot

df[['Fare']].boxplot()
Correlation Plot - EDA

df.corr()

This is the correlation matrix with the range from +1 to -1 where +1 is highly and positively
correlated and -1 will be highly negatively correlated.

#Correlation plot

sns.heatmap(df.corr())

Result:

Thus the program for analyzing EDA for Titanic case study has been done and
the output has been verified.
CONTENT BEYOND SYLLABUS

Ex.No.1 Data Visualization with Python Seaborn

Seaborn is an amazing visualization library for statistical graphics plotting in Python. It is


built on the top of matplotlib library and also closely integrated into the data
structures from pandas.
Installation
For python environment :

pip install seaborn

For conda environment :

conda install seaborn

Seaborn: statistical data visualization

These are the plot will help to visualize:

 Line Plot
 Scatter Plot
 Box plot
 Point plot
 Count plot
 Violin plot
 Swarm plot
 Bar plot
 KDE Plot

Line plot:

CODE:
# import module
import seaborn as sns
import pandas

# loading csv
data = pandas.read_csv("nba.csv")

# plotting lineplot
sns.lineplot( data['Age'], data['Weight']
Output:

Scatter Plot:

CODE:

# import module
import seaborn
import pandas

# load csv
data = pandas.read_csv("nba.csv")

# plotting
seaborn.scatterplot(data['Age'],data['Weight'])

Output:
Box plot:
CODE:
# import module
import seaborn as sns
import pandas

# read csv and plotting


data = pandas.read_csv( "nba.csv" )
sns.boxplot( data['Age'] )
Output:

Violin Plot:
CODE:
# import module
import seaborn as sns
import pandas
# read csv and plot
data = pandas.read_csv("nba.csv")
sns.violinplot(data['Age'])
Output:
Swarm plot:

CODE:
# import module
import seaborn

seaborn.set(style = 'whitegrid')

# read csv and plot


data = pandas.read_csv( "nba.csv" )
seaborn.swarmplot(x = data["Age"])

Output:

Bar plot:
CODE:
# import module
import seaborn

seaborn.set(style = 'whitegrid')

# read csv and plot


data = pandas.read_csv("nba.csv")
seaborn.barplot(x =data["Age"])
Output:
Point plot:

CODE:
# import module
import seaborn

seaborn.set(style = 'whitegrid')

# read csv and plot


data = pandas.read_csv("nba.csv")
seaborn.pointplot(x = "Age", y = "Weight", data = data)

Output:

Count plot:
CODE:
# import module
import seaborn
seaborn.set(style = 'whitegrid')
# read csv and plot
data = pandas.read_csv("nba.csv")
seaborn.countplot(data["Age"])
Output:
KDE Plot:
CODE:
# import module
import seaborn as sns
import pandas

# read top 5 column


data = pandas.read_csv("nba.csv").head()

sns.kdeplot( data['Age'], data['Number'])

Output:

Result:

Thus the program for data visualization techniques using seaborn with python has
been done and the output has been verified.
Ex.No.2 Numerical summaries in Statistics using python

Aim:

The following numerical summaries:


1. Mean
2. Median
3. Mode
4. Percentile
5. Quartiles (five-number summary)
6. Standard Deviation
7. Variance
8. Range
9. Proportion
10. Correlation

PROCEDURE:

Mean

The ‘x-bar’ is used to represent the sample mean (the mean of a sample of data). ‘∑’

(sigma) implies de addition of all values up from ‘i=1’ until ‘i=n’ (’n’ is the number of data

values). The result is then divided by ‘n’.

Python: np.mean([1,2,3,4,5]) The result is 3.


Median

The formula for the median when n is even.

When n is even, there is no “middle” data point, so the middle two values are averaged.

The formula for the median when n is odd.

When n is odd, the middle data point is the median.

Python: np.median([1,2,3,4,5,6]) (n is even). The result is 3.5, the average between 3 and
4 (middle points).

Mode

The mode will return you the most commonly occurring data value.
Python: statistics.mode([1,2,2,2,3,3,4,5,6]) The result is 2.

Percentile
Python:
from scipy import statsx = [10, 12, 15, 17, 20, 25, 30]## In what percentile lies the number 25?
stats.percentileofscore(x,25)

# result: 85.7

Quartiles (five-number summary)


Five-number summary is composed of:
1. Minimum
2. 25th percentile (lower quartile)
3. 50th percentile (median)
4. 75th percentile (upper quartile)
5. 100th percentile (maximum)
Python:
import numpy as npx = [10,12,15,17,20,25,30]min = np.min(x)
q1 = np.quantile(x, .25)
median = np.median(x)
q3 = np.quantile(x, .75)
max = np.max(x)print(min, q1, median, q3, max)

Standard Deviation
Deviation: The idea is to use the mean as a reference point from which everything varies.
A deviation is defined as the distance an observation lies from the reference point. This
distance is obtained by subtracting the data point (xi) from the mean (x-bar).

The formula to calculate the standard deviation.


Python: np.std(x)

Variance

Variance is almost the same calculation of the standard deviation, but it stays in

squared units. So, if you take the square root of the variance, you have the standard

deviation.

The formula for the variance.

Python: np.var(x)

Range

The difference between the maximum and minimum values. Useful for some basic

exploratory analysis, but not as powerful as the standard deviation.

The formula for the range.

Python: np.max(n) — np.min(x)


Proportion

It’s often referred to as “percentage”. Defines the percent of observations in the data

set that satisfy some requirements.

Correlation

Defines the strength and direction of the association between two quantitative

variables. It ranges between -1 and 1. Positive correlations mean that one variable

increases as the other variable increases. Negative correlations mean that one variable

decreases as the other increases. When the correlation is zero, there is no correlation at

all.

The formula to compute the correlation.

Python: stats.pearsonr(x,y)

Result:

Thus the program for finding numerical summaries in statistics has been done and
the output has been verified.

You might also like