0% found this document useful (0 votes)
4 views45 pages

Aids Lab

The document is a lab manual for an Artificial Intelligence and Data Science course at Anna University, detailing various exercises related to data analysis, visualization, and cleaning using Python libraries such as Pandas and Matplotlib. It includes installation procedures, exploratory data analysis, working with NumPy arrays and Pandas data frames, creating plots, and data cleaning techniques. Each exercise concludes with a result statement confirming successful completion of the tasks.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views45 pages

Aids Lab

The document is a lab manual for an Artificial Intelligence and Data Science course at Anna University, detailing various exercises related to data analysis, visualization, and cleaning using Python libraries such as Pandas and Matplotlib. It includes installation procedures, exploratory data analysis, working with NumPy arrays and Pandas data frames, creating plots, and data cleaning techniques. Each exercise concludes with a result statement confirming successful completion of the tasks.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 45

lOMoARcPSD|57176504

AD3301 dev lab - dev lab manual

Artificial intelligence and data science (Anna University)

Scan to open on Studocu

Studocu is not sponsored or endorsed by any college or university


Downloaded by hanipriya nirmal ([email protected])
lOMoARcPSD|57176504

EX.NO:1 INSTALLATION OF DATA ANALYSIS


AND
VISUALIZATION TOOL

AIM :

To install Data analysis and visualization tool

INSTALLATION PROCEDURE:

Python is a great language for doing data analysis, primarily because of the fantastic
ecosystem of data-centric Python packages. Pandas is one of those packages, and
makes importing and analyzing data much easier

Step 1: To ensure that the system is updated and the necessary packages are installed, open a
terminal window and type the following commands:

Sudo apt update

Step 2:
Sudo apt install python3

Step 3: Verify the installation by checking the installed version:

$ python3 –version

Step 4 : Install pandas and numpy

$ pip install pandas

$ pip install numpy

Step 5: Install visualization tool

$ pip install matplotlib

RESULT:

Thus,we install Data analysis and visualization tool

1
Downloaded by hanipriya nirmal ([email protected])
lOMoARcPSD|57176504

EX.NO:2 EXPLORATORY DATA ANALYSIS

AIM:

To Perform exploratory data analysis (EDA) on with datasets like email data set. Export all
your emails as a dataset, import theminside a pandas data frame, visualize them and get
different insights from the data.

ALGORITHM:
Step 1 : Import necessary libraries
Step 2 :Create a simple example dataset
Step 3 :Display the dataset
Step 4 :Check for missing values
Step 5 :Visualize the distribution ofemail lengths,count and common words
Step 6 : Stop

PROGRAM:

# Import necessary libraries


import pandas as pd
import matplotlib.pyplot as plt
# Create a simple example dataset (replace this with
your actual dataset)
data = {'sender': ['sender1', 'sender2', 'sender1', 'sender3'],
'content':
['Hello,howareyou?','Meetingtomorrow?','Important
update', 'Check this out!']}
email_data = pd.DataFrame(data)
# Display the dataset
print(email_data)
# Check for missing values

2
Downloaded by hanipriya nirmal ([email protected])
lOMoARcPSD|57176504

print(email_data.isnull().sum())

# Visualize the distribution ofemail lengths


email_data['email_length']=email_data['content'].apply(len)
plt.figure(figsize=(10,6))
plt.hist(email_data['email_length'],bins=30,
edgecolor='black')
plt.title('Distribution ofEmail Lengths')
plt.xlabel('EmailLength')
plt.ylabel('Frequency')
plt.show()
# Visualize the count ofemails per sender
plt.figure(figsize=(8, 5))
email_data['sender'].value_counts().plot(kind='bar',color='s
kyblue')
plt.title('Count ofEmails per Sender')
plt.xlabel('Sender')
plt.ylabel('Count')
plt.show()
# Visualize the most common words in emails
all_emails = ' '.join(email_data['content'])
wordcloud = WordCloud(width=800, height=400,
background_color='white').generate(all_emails)
plt.figure(figsize=(10, 6))
plt.imshow(wordcloud,interpolation='bilinear')
plt.axis('off')
plt.title('Word Cloud ofEmail Content')
plt.show()

3
Downloaded by hanipriya nirmal ([email protected])
lOMoARcPSD|57176504

OUTPUT:

RESULT:
4
Downloaded by hanipriya nirmal ([email protected])
lOMoARcPSD|57176504

Thus the exploratory data analysis (EDA) on with datasets like email data set Performed
successfully

EX.NO:3.1 NUMPY ARRAYS

AIM:
To Working with NumPy arrays, Pandas data frames, and creating basic plots using
Matplotlib

PROGRAM
import numpy as np

# Creating NumPy arrays


arr1=np.array([1,2,3,4,5])
arr2=np.array([6,7,8,9,10])

#Basicoperationsonarrays

sum_array = arr1 + arr2


product_array = arr1 * arr2
mean_value=np.mean(arr1)

#Displayingarraycontentsandresults
print("Array 1:", arr1)
print("Array2:",arr2)
print("Sum of arrays:", sum_array)
print("Productofarrays:", product_array)
print("Mean of Array 1:", mean_value)

5
Downloaded by hanipriya nirmal ([email protected])
lOMoARcPSD|57176504

OUTPUT

Array1: [1 2 3 4 5]
Array2: [6 7 8 9 10]
Sum of arrays: [7 9 1 1 1 3 1 5]
Product of arrays: [6 1 4 2 4 3 6 5 0]
Mean of Array 1:3.0

RESULT:

Thus we Working with NumPy arrays, Pandas data frames, and creating basic plots

6
Downloaded by hanipriya nirmal ([email protected])
lOMoARcPSD|57176504

using Matplotlib

Ex.No:3.2 PANDAS DATAFRAMES

AIM:
To Working with Pandas data frames, using Matplotlib

PROGRAM

import pandas as pd

#CreatingaPandasDataFramefromadictionary

data = {'Name':['Alice','Bob','Charlie','David','Emily'],'Age': [25, 30, 22, 35, 28],'City':


['NewYork','SanFrancisco','LosAngeles','Chicago','Boston']}
df=pd.DataFrame(data)

# Displaying the DataFrame

print("Original DataFrame:")
print(df)

# Basic DataFrame operations

average_age=df['Age'].mean()
youngest_person=df[df['Age']==df['Age'].min()]

# Displaying results of operations

print("\nAverageAge:",average_age)
print("\nYoungest Person:")
print(youngest_person)

7
Downloaded by hanipriya nirmal ([email protected])
lOMoARcPSD|57176504

OUTPUT:
Original DataFrame:
Name Age City
0 Alice 25 NewYork
1 Bob 30 SanFrancisco
2 Charlie 22 LosAngeles
3 David 35 Chicago
4 Emily 28 Boston

AverageAge: 28.0

Youngest Person:
Name Age City
2 Charlie 22 LosAngeles

8
Downloaded by hanipriya nirmal ([email protected])
lOMoARcPSD|57176504

RESULT:
Thus we working with pandas data frames using matplotlib worked successfully

EX.NO:3.3 BASIC PLOTS USING MATPLOTLIB

AIM:
To Working with creating basic plots using Matplotlib

PROGRAM
import matplotlib.pyplot as plt

#Exampledata
x=[1,2,3,4,5]
y=[2,4,6,8,10]

# Line
plotplt.figure(figsize=(8,6)
plt.plot(x,y,label='LinePlot',marker='o',linestyle='-',color='blue')
plt.title('Simple Line Plot')
plt.xlabel('X-axis')
plt.ylabel('Y-axis')
plt.legend()
plt.grid(True)
plt.show()

# Scatter
plotplt.figure(figsize=(8,6))
plt.scatter(x, y, label='Scatter Plot', color='red', marker='x')
plt.title('Simple Scatter Plot')
plt.xlabel('X-axis')
9
Downloaded by hanipriya nirmal ([email protected])
lOMoARcPSD|57176504

plt.ylabel('Y-axis')
plt.legend()
plt.grid(True)
plt.show()

OUTPUT:
1. LinePlot:

2. ScatterPlot:

10
Downloaded by hanipriya nirmal ([email protected])
lOMoARcPSD|57176504

RESULT:
Thus we working with basic plot using matplotlib worked successfully

Ex.No:3.4 CUSTOMIZING PLOT

AIM:
To Working with customizing plot using Matplotlib

Program

import numpy as np
import matplotlib.pyplot as plt

#Generatedata
x=np.linspace(0,10,100)
y1 = np.sin(x)
y2=np.cos(x)

#Createafigureandaxis

fig,ax=plt.subplots(figsize=(8,6))

#Plotthedatawithcustomstyles
ax.plot(x,y1,label='SineWave',color='blue',linestyle='--',linewidth=2)
ax.plot(x,y2,label='CosineWave',color='red',linestyle='-',linewidth=2)

#Customizeaxeslabelsandtitle
ax.set_xlabel('X-axis')
ax.set_ylabel('Y-axis')
ax.set_title('CustomizedSineandCosineWaves')

#Addalegenda
x.legend()

11
Downloaded by hanipriya nirmal ([email protected])
lOMoARcPSD|57176504

#Addgrid
ax.grid(True,linestyle='--',alpha=0.7)

#Showtheplot
plt.show()

OUTPUT:

12
Downloaded by hanipriya nirmal ([email protected])
lOMoARcPSD|57176504

RESULT:
Thus creating basic plots using Matplotlib worked successfully

EX.NO:4 DATA CLEANING USING R

AIM:
To explore various variable and row filtering techniques in R for cleaning data, and then
apply different plot features on a sample dataset.

PROGRAM
import pandas as pd

#Loadasampledataset(e.g.,Irisdataset)
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data"
column_names=["sepal_length","sepal_width","petal_length","petal_width","class"]

df = pd.read_csv(url, header=None, names=column_names)

#Displaythefirstfewrowsofthedataset
print("Sample Dataset:")
print(df.head())

13
Downloaded by hanipriya nirmal ([email protected])
lOMoARcPSD|57176504

OUTPUT:
SampleDataset:
sepal_lengthsepal_widthpetal_lengthpetal_width class

0 5.1 3.5 1.4 0.2 Iris-setosa


1 4.9 3.0 1.4 0.2 Iris-setosa
2 4.7 3.2 1.3 0.2 Iris-setosa
3 4.6 3.1 1.5 0.2 Iris-setosa
4 5.0 3.6 1.4 0.2 Iris-setosa

14
Downloaded by hanipriya nirmal ([email protected])
lOMoARcPSD|57176504

RESULT:
Thus we explore various variable and row filtering techniques in R for cleaning data,
and then apply different plot features on a sample dataset worked successfully.

EXP 4.1 EXPLORING DATA

AIM:
To explore and understand the underlying patterns, trends, and relationships within the
dataset.

PROGRAM
import pandas as pd

#Loadasampledataset(e.g.,Irisdataset)

url = "https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data"
column_names=["sepal_length","sepal_width","petal_length","petal_width","class"]
df = pd.read_csv(url, header=None, names=column_names)

#Displaybasicinformationaboutthedataset
print("Dataset Information:")
print(df.info())

#Displaysummarystatistics
print("\nSummary Statistics:")
print(df.describe())

#Displaythefirstfewrowsofthedataset
print("\nFirstFewRowsoftheDataset:")
print(df.head())

#Displayuniqueclassesinthe'class'column
print("\nUnique Classes:")
print(df['class'].unique())

15
Downloaded by hanipriya nirmal ([email protected])
lOMoARcPSD|57176504

OUTPUT:

DatasetInformation:
<class
'pandas.core.frame.DataFrame'>RangeInde
x: 150 entries, 0 to 149
Datacolumns(total5columns):
#Column Non-NullCountDtype
--------- -------------------
0 sepal_length150non-nullfloat64
1 sepal_width150non-nullfloat64
2 petal_length150non-nullfloat64
3 petal_width150non-nullfloat64
4 class 150non-nullobject
dtypes: float64(4), object(1)
memory usage: 6.0+ KB
None

SummaryStatistics:
sepal_length sepal_width petal_length petal_width
count150.000000150.000000150.000000150.000000
mean 5.84333 3.054000 3.75866 1.198667
3 7
std 0.828066 0.433594 1.764420 0.763161
min 4.300000 2.000000 1.000000 0.100000
25% 5.100000 2.800000 1.600000 0.300000
50% 5.800000 3.000000 4.350000 1.300000
75% 6.400000 3.300000 5.100000 1.800000
max 7.900000 4.400000 6.900000 2.500000

FirstFewRowsoftheDataset:
sepal_lengthsepal_widthpetal_lengthpetal_width class

0 5.1 3.5 1.4 0.2Iris-setosa


1 4.9 3.0 1.4 0.2Iris-setosa
2 4.7 3.2 1.3 0.2Iris-setosa
3 4.6 3.1 1.5 0.2Iris-setosa
4 5.0 3.6 1.4 0.2Iris-setosa

16
Downloaded by hanipriya nirmal ([email protected])
lOMoARcPSD|57176504

UniqueClasses:

['Iris-setosa''Iris-versicolor''Iris-virginica']

RESULT:
Thus, we explore and understand the underlying patterns, trends, and relationships within the
Dataset successfully

EX.NO:4.2 VARIABLE FILTERS

AIM:
To apply variable filters in order to refine data based on specific conditions or criteria.

PROGRAM
import pandas as pd

#CreateasampleDataFrame
data = {'Name':['Alice','Bob','Charlie'], 'Age': [25, 30, 22],'City':['NewYork','SanFrancisco','LosAngeles']}
df=pd.DataFrame(data)

#DisplaytheoriginalDataFrame
print("Original DataFrame:")
print(df)

#Filterspecificvariables(columns)
selected_columns=['Name','City']
filtered_df=df[selected_columns]

#DisplaythefilteredDataFrame
print("\nFilteredDataFrame:")
print(filtered_df)

17
Downloaded by hanipriya nirmal ([email protected])
lOMoARcPSD|57176504

OUTPUT:

Original DataFrame:

Name Age City


0Alice 25 NewYork
1Bob 30 SanFrancisco
2Charlie 22 LosAngeles

FilteredDataFrame:

Name City
Alice NewYork
Bob SanFrancisco
Charlie LosAngeles

18
Downloaded by hanipriya nirmal ([email protected])
lOMoARcPSD|57176504

RESULT:
Filtered data will display only the values or records that meet the defined criteria, improving
data analysis and decision-making.

EX.NO:4.3 ROW FILTERS


Aim:
To apply a row filter to restrict data visibility to only specific rows based on predefined
conditions.

PROGRAM

import pandas as pd
# Sample data
data={'Name':['Alice','Bob','Charlie','David'], 'Age': [20, 22, 21, 23],'Grade':[85,92,78,95]}
df = pd.DataFrame(data)
print("OriginalDataFrame:")
print(df)

OUTPUT:
OriginalDataFrame:

Name Age Grade


0Alice 20 85

19
Downloaded by hanipriya nirmal ([email protected])
lOMoARcPSD|57176504

1Bob 22 92
2Charlie 21 78
3David 23 95

20
Downloaded by hanipriya nirmal ([email protected])
lOMoARcPSD|57176504

Result:

The dataset will display only the rows that meet the filter criteria, improving focus and analysis
efficiency.

EX.NO:4.4 DATACLEANING

Aim:
To identify and correct errors, inconsistencies, and inaccuracies in the dataset to ensure its quality
and reliability.

PROGRAM

import pandas as pd

#Sampledatawithmissingvaluesandduplicates
data={'Name':['Alice','Bob','Charlie','David','Alice'],'Age':[20,None,21,23,22],'Grade':[85,92,78,95,92]}

df=pd.DataFrame(data)

print("Original DataFrame:")
print(df)

21
Downloaded by hanipriya nirmal ([email protected])
lOMoARcPSD|57176504

OUTPUT:
OriginalDataFrame:

Name Age Grade


0Alice 20.0 85
1Bob NaN 92
2Charlie 21.0 78
3David 23.0 95
4Alice 22.0 92

In this example ,we have a Data Frame with missing values in the 'Age' column and aduplicate row for the
name 'Alice.'

22
Downloaded by hanipriya nirmal ([email protected])
lOMoARcPSD|57176504

RESULT:
Thus the various variable and row filtering techniques in R for cleaning data, and then apply
different plot features on a sample dataset explored successfully

EX.NO:5 TIMESERIES
AIM:
To Perform Time Series Analysis and apply the various visualization techniques
DEFINITION
Timeseriesanalysisinvolvesanalyzingandmodelingdatacollectedovertimetoidentifypatterns, trends,
and make predictions. Here, I'll demonstrate timeseries analysis using Python with the pandas,
matplotlib, and seaborn libraries. I'll usea simpleexamplewith a synthetic time series dataset.

Firstly,ensureyouhavethe requiredlibrariesinstalled:

bash
pipinstallpandasmatplotlibseaborn

Now,let'screateasynthetictimeseriesdatasetandperformbasictimeseriesanalysis:
PROGRAM
import pandas as pd

import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
From datetime import datetime,timedelta

#Generateasynthetictimeseriesdataset
np.random.seed(42)
date_today=datetime.now()
days = pd.date_range(date_today, date_today + timedelta(9), freq='D')
23
Downloaded by hanipriya nirmal ([email protected])
lOMoARcPSD|57176504

data = {'Date': days, 'Value': np.random.randint(10, 100,size=(len(days)))}


df = pd.DataFrame(data)
# Display the dataset
print("TimeSeriesDataset:")
print(df.head())

# Visualize the time series data


plt.figure(figsize=(10, 6))
sns.lineplot(x='Date', y='Value', data=df)
plt.title('Time Series Data')
plt.xlabel('Date')
plt.ylabel('Value')
plt.show()
OUTPUT:

This will generate a synthetic time series dataset and plot it using a line chart.

Now, let's perform some basic time series analysis and visualization techniques:

24
Downloaded by hanipriya nirmal ([email protected])
lOMoARcPSD|57176504

EX.NO:5.1 TRENDANALYSIS
DEFINITION:
Trend analysis examines data over time to identify long-termpatterns, helping discern upward,
downward, or stable directional changes.
PROGRAM
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

#Generate asynthetic time series dataset


np.random.seed(42)
date_today=pd.to_datetime('2023-01-01')
days = pd.date_range(date_today, date_today + pd.to_timedelta(365, unit='D'),
freq='D') values = np.cumsum(np.random.randn(len(days)))
df = pd.DataFrame({'Date':days,'Value':values})

#Plot the originaltimeseriesdata


plt.figure(figsize=(10, 6))
plt.plot(df['Date'], df['Value'], label='Original
Data')
plt.title('Original Time Series Data')

25
Downloaded by hanipriya nirmal ([email protected])
lOMoARcPSD|57176504

plt.xlabel('Date')
plt.ylabel('Value')
plt.legend()
plt.show()

OUTPUT:

26
Downloaded by hanipriya nirmal ([email protected])
lOMoARcPSD|57176504

EX.NO : 5.2 SEASONAL DECOMPOSITION


DEFINITION:
Seasonaldecompositionseparatesatimeseriesintocomponents:trend,seasonal,andresidual,
aiding analysis by isolating underlying patterns and fluctuations.

PROGRAM
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

#Generateasynthetictimeseriesdatasetwithaclearseasonalcomponent
np.random.seed(42)

date_today=pd.to_datetime('2023-01-01')
days=pd.date_range(date_today,date_today+pd.to_timedelta(365,unit='D'),freq='D')

#Creatingaseasonalcomponent
seasonal_component=np.sin(2*np.pi*np.arange(len(days))/365*7)

27
Downloaded by hanipriya nirmal ([email protected])
lOMoARcPSD|57176504

#Generatingsyntheticdatawithtrendandseasonalcomponents
trend_component = np.cumsum(np.random.randn(len(days))) values =
trend_component + 10 * seasonal_component

df = pd.DataFrame({'Date':days,'Value':values})

#Plottheoriginaltimeseriesdata
plt.figure(figsize=(10, 6))
plt.plot(df['Date'], df['Value'], label='Original Data')
plt.title('OriginalTimeSeriesDatawithSeasonalComponent')
plt.xlabel('Date')
plt.ylabel('Value')
plt.legend()
plt.show()

OUTPUT:

28
Downloaded by hanipriya nirmal ([email protected])
lOMoARcPSD|57176504

Result:

Thus Time Series Analysis and apply the various visualization techniques Performed
successfully

EX.NO : 6 INTERACTIVE MAPVISUALIZATION

AIM:
To Perform Data Analysis and representation on a Map using various Map data
sets with Mouse Rollover effect, user interaction, etc..

DEFINITION:
Performing data analysis and representation on a map involves visualizing geographic
data in away that provides in sights into spatial patterns. A popular tool for this is the
`folium` library in Python, which allows you to create interactive maps. Below is abasic
exampleusing`folium`withmouserollovereffectanduserinteraction.

PROGRAM
import folium

#LoadaGeoJSONdataset(youcanuseyourownGeoJSONfileorfindoneonline)

geojson_data = "https://raw.githubusercontent.com/nvkelso/natural-earth-
vector/master/geojson/ne_110m_admin_0_countries.geojson"

#Createafoliummapcenteredaroundthemeanlatitudeandlongitudeofthedataset m =
folium.Map(location=[0, 0], zoom_start=2)

#AddGeoJSONdatawithmouserollovereffectfolium.
29
Downloaded by hanipriya nirmal ([email protected])
lOMoARcPSD|57176504

GeoJson(
geojson_data,
name='geojson',
style_function=lambda x: {'fillColor': 'green', 'color': 'black'},
highlight_function=lambda x: {'fillColor': 'yellow', 'color': 'blue'},
tooltip=folium.features.GeoJsonTooltip(fields=['name'], labels=False,
sticky=False)
).add_to(m)

#Addlayercontrolforuserinteraction
folium.LayerControl().add_to(m)

# Display the
mapm.save("interactive_map_with_interaction.html")

OUTPUT:

When you run the script, it will create an HTML file (in this case,
"interactive_map_with_tooltip.html")inthesamedirectorywhereyourunthescript.
Open this HTML file in a web browser to visualize themap with mouserollover effects.

30
Downloaded by hanipriya nirmal ([email protected])
lOMoARcPSD|57176504

RESULT:

Thus the Data Analysis and representation on a Map using various Map data sets with
Mouse Rollover effect, user interaction, etc.. are performed successfully.

EX.NO : 7 CARTOGRAPHIC VISUALIZATION


AIM:

To build cartographic visualization for multiple datasets involving various


countries ofthe world; states and districts in India etc.

DEFINITION:
Creating cartographic visualizations for multiple datasets involving various countries,
states ,or districts often involves combining data with geographical boundaries. Here's a
Python script that utilizes `geopandas` and `folium` tocreatevisualizationsforboth world
countries and states in India, along with fictional data for illustration

PROGRAM

Import folium

#Createafoliummapcenteredaroundaspecificlocation
m=folium.Map(location=[20.5937,78.9629],zoom_start=5)

# Add a marker for a few world countries


31
Downloaded by hanipriya nirmal ([email protected])
lOMoARcPSD|57176504

folium.Marker([37.7749, -122.4194],
popup='USA').add_to(m)
folium.Marker([35.8617,104.1954],popup='China').add_to(m)
folium.Marker([20.5937,78.9629],popup='India').add_to(m)

#AddamarkerforafewIndianstates
folium.Marker([19.7515,75.7139],popup='Maharashtra').add_to(m)
folium.Marker([27.0238,74.2179],popup='Rajasthan').add_to(m)

# Save the
mapm.save("world_and_india_visualization_simple.html")

PROCEDURE:
Youneedtoruntheprovidedcodeonyourlocalmachinetoseetheoutput.Whenyourunthe script,
it will generate an HTML file named "world_and_india_visualization_simple.html" in
the same directory where you saved the script.

OUTPUT:

32
Downloaded by hanipriya nirmal ([email protected])
lOMoARcPSD|57176504

RESULT:

Thus cartographic visualization for multiple datasets involving various


countries of the world; states and districts in India are built successfully

EX.NO : 8 EDA ON WINE QUALITYDATA SET


AIM:

To Perform EDA on Wine Quality Data Set.

DEFINITION:
ExploratoryDataAnalysis(EDA)isacrucialstepinunderstandingthecharacteristicsofa
dataset.Let'sperformEDAonawinequalitydataset.Forthisexample,I'llusetheWine
Quality dataset available in the UCI Machine Learning Repository.

PROGRAM:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

#LoadtheWineQualitydataset
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-
white.csv"
wine_data=pd.read_csv(url,sep=';')

33
Downloaded by hanipriya nirmal ([email protected])
lOMoARcPSD|57176504

#Displaythefirstfewrowsofthedataset
print(wine_data.head())

# Summary
statisticsprint(wine_data.describe())

# Distribution of Wine Quality


sns.countplot(x='quality', data=wine_data)
plt.title('Distribution of Wine Quality')
plt.show()

# Correlation heatmap
correlation_matrix =
wine_data.corr()
plt.figure(figsize=(10, 8))
sns.heatmap(correlation_matrix,annot=True,cmap='coolwarm',linewidths=0.5)
plt.title('Correlation Heatmap')
plt.show()

#Pairplotforselectedfeatures
selected_features=['fixedacidity','volatileacidity','citricacid','residualsugar','chlorides',
'quality']
sns.pairplot(wine_data[selected_features], hue='quality',
markers='o')
plt.title('Pairplot of Selected Features')
plt.show()

OUTPUT:

34
Downloaded by hanipriya nirmal ([email protected])
lOMoARcPSD|57176504

35
Downloaded by hanipriya nirmal ([email protected])
lOMoARcPSD|57176504

RESULT:

Thus EDA on Wine Quality Data Set are Performed successfully.

EX.NO : 9 CASE STUDY ON A DATA SET TO PRESENT AN


ANALYSIS REPORT

AIM:
To Use a case study on a data set and apply the various EDA and visualization
techniques and present an analysis report.

DEFINITION:

PROGRAM

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Load the dataset (replace 'college_students.csv' with your actual file path)
df=pd.read_csv('college_students.csv')

# Display the first few rows ofthe dataset


36
Downloaded by hanipriya nirmal ([email protected])
lOMoARcPSD|57176504

print(df.head())

# Check for missing values


print(df.isnull().sum())

# Summary statistics

print(df.describe())

#UnivariateAnalysis
#Histogramsfornumericalvariables
df.plot(kind='hist', subplots=True, layout=(2, 2), figsize=(12, 10), bins=20,
title='Histograms')
plt.show()

# Bar plot for categorical variables (e.g., Gender)


df['Gender'].value_counts().plot(kind='bar',color=['skyblue','pink'])
plt.title('Distribution ofGender')
plt.xlabel('Gender')
plt.ylabel('Count')
plt.show()

#BivariateAnalysis
#Pairplotfornumericalvariables
sns.pairplot(df, hue='Gender', markers=['o', 's'], height=3) plt.suptitle('Pair Plot
ofNumerical Variables by Gender', y=1.02)
plt.show()

# Correlation heatmap
correlation_matrix=df.corr()

plt.figure(figsize=(8,6))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', linewidths=0.5)
plt.title('CorrelationHeatmap')
plt.show()

#InsightsandAnalysis
# You can print and analyze key insights based on the EDA performed

# Save the visualizations or export the analysis report as needed

Csv file :
Age,Gender,GPA,StudyHours,Grade
24,Female,2.76,24,A
21,Male,2.53,25,D
22,Male,3.24,14,D
24,Female,2.77,12,F
20,Female,3.05,21,B

37
Downloaded by hanipriya nirmal ([email protected])
lOMoARcPSD|57176504

22,Female,3.62,29,F
22,Female,3.58,13,B
24,Female,2.96,25,F
19,Female,3.31,16,C
20,Female,3.26,22,C
24,Female,3.45,19,C
20,Male,2.88,16,C
20,Female,3.38,23,A
22,Female,3.97,14,C
21,Male,3.23,12,D
20,Female,3.86,20,D
23,Male,3.15,20,A
22,Male,3.03,27,C
19,Female,3.47,24,C
21,Male,3.5,21,C
23,Male,3.8,18,F
23,Male,2.85,19,B
19,Male,3.25,21,F
21,Female,3.36,26,B
22,Male,3.65,15,C
18,Female,2.57,16,C
21,Male,3.99,23,F
19,Male,3.2,22,F
23,Male,2.92,17,B
22,Male,3.83,19,D
21,Female,3.62,18,B
18,Female,3.93,27,F

38
Downloaded by hanipriya nirmal ([email protected])
lOMoARcPSD|57176504

18,Male,3.0,11,F
20,Male,3.33,14,A
20,Female,3.36,14,F
24,Male,3.97,15,A
19,Male,2.61,28,D
21,Male,2.96,17,B
21,Female,2.79,25,B
24,Female,2.9,22,A
23,Female,3.23,10,B
23,Male,3.06,29,F
24,Male,3.09,26,C
23,Female,3.77,16,A
20,Female,3.9,22,B
21,Female,2.61,13,A
24,Female,2.81,13,A
21,Male,3.51,15,C
18,Female,3.04,28,F
20,Male,2.88,21,A
22,Female,2.94,16,B
20,Male,2.98,19,D
24,Female,3.77,28,A
22,Female,2.7,16,A
18,Female,3.56,12,C
24,Female,3.33,22,F
19,Male,2.94,22,D
21,Female,3.13,27,B
18,Male,2.88,29,D
21,Male,3.42,17,B
23,Male,2.62,18,F
19,Male,2.51,16,B
19,Female,3.44,10,C
18,Male,2.79,12,C
19,Male,2.61,22,C
22,Male,3.1,26,C
19,Female,2.58,10,D
21,Female,3.83,15,F
21,Female,2.54,15,B
24,Female,3.37,21,B
21,Male,3.16,22,C
24,Male,3.51,22,C
21,Female,2.99,24,A
22,Male,2.73,25,F
24,Male,3.97,20,D
20,Male,3.76,14,B
23,Female,3.79,13,A
18,Female,2.88,12,A
21,Male,2.56,28,B
19,Female,2.95,29,D

39
Downloaded by hanipriya nirmal ([email protected])
lOMoARcPSD|57176504

21,Female,3.31,27,A
19,Female,2.99,24,A
23,Female,3.74,18,F
23,Female,2.91,26,D
23,Male,3.95,23,A
19,Female,3.19,24,D
21,Male,3.76,10,B
23,Male,2.79,12,C
22,Female,3.12,25,A
24,Male,3.55,20,F
19,Male,2.71,21,B
19,Male,2.7,19,D
21,Female,3.95,25,B
19,Female,3.57,17,A
19,Male,2.56,15,D
23,Female,3.1,21,C
21,Female,3.15,17,B
23,Female,3.62,13,A
24,Female,2.88,17,F
24,Male,2.78,27,D
23,Male,2.62,14,B
24,Female,3.14,18,B
21,Female,3.53,13,C
18,Male,2.59,26,C
23,Male,3.87,18,F
22,Female,3.16,10,F
22,Male,2.86,29,A
19,Female,2.64,22,A
24,Female,2.77,25,F
22,Male,3.9,22,F
19,Male,3.46,23,D
18,Female,3.28,12,C
21,Female,3.49,15,A
21,Male,3.15,27,C
21,Female,3.6,28,C
22,Male,2.57,14,F
18,Female,3.35,24,D
22,Male,2.74,11,B
24,Male,2.68,19,D
22,Male,3.01,27,D
18,Female,2.64,22,C
18,Female,2.64,14,D
24,Male,2.97,10,A
18,Female,3.97,10,C
18,Male,2.76,27,A
21,Male,2.53,24,B
24,Female,3.65,26,C
20,Female,3.71,20,B

40
Downloaded by hanipriya nirmal ([email protected])
lOMoARcPSD|57176504

20,Male,3.02,26,C
18,Female,3.2,22,F
20,Female,3.47,10,D
20,Male,2.57,11,F
18,Male,3.92,18,B
20,Female,3.83,12,D
22,Male,2.89,10,C
19,Male,2.52,25,D
24,Female,3.9,15,A
19,Male,3.25,26,D
18,Male,3.31,14,A
21,Female,3.53,14,D
24,Female,3.42,15,A
18,Male,3.92,12,B
21,Male,3.92,14,F
19,Female,3.8,14,C
18,Female,3.45,19,D
24,Female,3.7,19,F
24,Female,3.52,28,C
23,Male,3.36,26,C
22,Female,2.69,23,A
20,Female,3.72,18,B
21,Male,3.73,23,B
23,Female,3.44,10,F
20,Male,3.73,28,B
20,Female,3.48,22,D
18,Female,2.81,22,B
20,Female,2.91,13,F
22,Male,2.82,10,B
24,Male,3.07,26,D
23,Female,2.56,17,A
20,Male,3.43,11,F
18,Male,3.0,17,A
22,Male,3.48,16,A
19,Male,3.08,11,A
24,Male,3.52,12,C
24,Male,3.01,27,C
23,Female,2.89,21,A
24,Female,3.24,10,F
20,Male,3.54,21,D
18,Female,3.02,14,D
24,Female,3.9,26,B
24,Female,2.56,25,F
19,Female,3.13,24,C
19,Male,3.95,24,A
21,Female,3.32,14,B
22,Female,3.14,23,D
20,Male,3.35,11,C

41
Downloaded by hanipriya nirmal ([email protected])
lOMoARcPSD|57176504

24,Male,3.36,20,C
24,Female,3.6,28,A
18,Female,2.69,16,D
21,Female,2.88,15,F
22,Female,3.37,11,C
21,Female,3.8,15,A
23,Female,3.34,27,F
22,Male,2.86,11,D
24,Female,3.52,27,C
24,Female,3.61,24,F
22,Male,2.86,28,F
24,Male,3.07,11,F
20,Male,3.3,29,C
22,Male,3.24,15,C
21,Female,3.08,10,B
22,Female,2.95,24,D
24,Female,2.65,19,A
20,Female,2.58,28,F
20,Female,3.94,26,B
23,Female,3.77,14,A
21,Male,3.03,13,B
19,Female,3.94,19,C
19,Male,3.52,26,F
22,Male,3.22,19,A

42
Downloaded by hanipriya nirmal ([email protected])
lOMoARcPSD|57176504

OUTPUT:

43
Downloaded by hanipriya nirmal ([email protected])
lOMoARcPSD|57176504

RESULT:
Thus case study on a data set apply the various EDA and visualization
techniques and present an analysis report successfully.

44
Downloaded by hanipriya nirmal ([email protected])

You might also like