0% found this document useful (0 votes)
78 views34 pages

Eda Lab Manual

The document outlines a series of exercises focused on data analysis and visualization using tools such as R, Python, and libraries like Pandas and Matplotlib. Each exercise includes aims, procedures, sample programs, and results, covering topics like exploratory data analysis, time series analysis, and interactive mapping. The exercises demonstrate the application of statistical methods and graphical representations to various datasets.

Uploaded by

alonedhoni26
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
78 views34 pages

Eda Lab Manual

The document outlines a series of exercises focused on data analysis and visualization using tools such as R, Python, and libraries like Pandas and Matplotlib. Each exercise includes aims, procedures, sample programs, and results, covering topics like exploratory data analysis, time series analysis, and interactive mapping. The exercises demonstrate the application of statistical methods and graphical representations to various datasets.

Uploaded by

alonedhoni26
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 34

Ex No: 01 Install the data Analysis and Visualization tool: R

Date:

Aim:

Install R, a data analysis and visualization tool, to leverage its capabilities for statistical analysis
and graphical representation of data.

Procedure:

1. Download and install the latest version of R from the official R website.

2. Optionally, install an integrated development environment (IDE) such as RStudio for a user-
friendly interface.

3. Use the R console or an IDE to execute R scripts and commands for data analysis and
visualization.

Installation/Output:

1
2
Result:

Access to a powerful data analysis and visualization tool, enabling the exploration
and representation of data through statistical methods and graphs in the R environment.

3
Ex No: 02 Perform exploratory data analysis (EDA) with
Date: email data set.

Aim:

To Perform exploratory data analysis (EDA) with datasets like email data set. Export all
your emails as a dataset, import them inside a pandas data frame, visualize them and get
different insights from the data.

Procedure:

Exporting Email Data:

Export your email data in a suitable format (e.g., CSV, JSON, or plain text).

Importing Data into Pandas:

Use Pandas to import your email data into a DataFrame. You can use pd.read_csv(),
pd.read_json(), or other relevant functions depending on the data format.

Exploratory Data Analysis:

Start by examining the basic properties of the dataset. Use functions like info(), head(), and
describe() to get an overview.

Data Cleaning:

Clean the data by handling missing values, removing duplicates, and addressing any other
data quality issues.

Visualization:

Utilize libraries like Matplotlib or Seaborn to create visualizations. For email data, you might
want to create:Histograms of email frequencies over time.Bar charts showing the most
frequent senders or recipients.Word clouds to identify common words or phrases in
emails.Time series plots for email activity.

4
Program:

import pandas as pd

import matplotlib.pyplot as plt

# Sample data

data = {

"Sender": ["[email protected]", "[email protected]", "[email protected]",


"[email protected]", "[email protected]"],

"Receiver": ["[email protected]", "[email protected]", "[email protected]",


"[email protected]", "[email protected]"],

"Subject": ["Meeting", "Meeting", "Meeting", "Meeting", "Meeting"],

"Body": ["3pm Yes, Let's meet at 4", "OK, 3 pm works for me", "OK, let's meet at 5",
"Let's meet?", "Sure, I'll be there"],

# Create a DataFrame
df = pd.DataFrame(data)
# Count the number of emails sent by each sender

sender_counts = df["Sender"].value_counts()

# Plot the counts

sender_counts.plot(kind="bar")

plt.xlabel("Sender")

plt.ylabel("Number of emails")

plt.title("Number of emails sent by each sender")

plt.show()

5
Output:

Result:

Thus, the given program perform exploratory data analysis (EDA) with email data set
have been executed successfully.

6
Ex.no:03 Working with Numpy arrays, Pandas data frames ,
Date: Basic plots using Matplotlib

Aim:

To write a python with Numpy arrays, Pandas data frames , Basic plots using Matplotlib.

Procedure:

1. Start program.
2. Import library like numpy ,pandas , matplotlib .
3. Create a Array and Dataframe.
4. Show the plot with given array.
5. Stop.
Program:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
# Creating a NumPy array
data = np.array([1, 2, 3, 4, 5])
# Creating a Pandas DataFrame
df = pd.DataFrame({'Values': data})
# Plotting the data
plt.plot(df['Values'])
plt.xlabel('Index')
plt.ylabel('Values')
plt.title('Simple Line Plot')
plt.show()

7
Output:

Result:

Thus,the given program to perform Numpy arrays, Pandas data frames , Basic plots using
Matplotlib have been executed successfully.

8
Ex.no:04 Explore various variable and row filters in R for
Date: cleaning data

Aim:

To explore various variable and row filters in R for cleaning data. Apply various plot
features in R on sample data sets and visualize.

Algorithm:

1. Start the program.


2. Declare a Dataset of various variable.
3. Cleaning the given dataset .
4. Visualization the given variable based on the libraries ggplot2,base R graphics.
5. Stop.

Program:

# Sample data frame

data <- data.frame(

Name = c("Alice", "Bob", "Charlie", "David", "Eve"),

Age = c(25, 32, 29, NA, 27),

Score = c(85, 92, 78, 64, 89))

data_cleaned<- na.omit(data)

data_cleaned<- unique(data)

data_filtered<- data[data$Age> 30, ]

library(ggplot2)

# Scatter plot

ggplot(data, aes(x = Age, y = Score)) +

geom_point() +

labs(x = "Age", y = "Score") +


9
ggtitle("Scatter Plot of Age vs. Score")

# Mean Age

mean_age<- mean(data$Age, na.rm = TRUE)

cat("Mean Age:", mean_age, "\n")

# Median Score

median_score<- median(data$Score)

cat("Median Score:", median_score, "\n")

# Histogram of Age

ggplot(data, aes(x = Age)) +

geom_histogram() +

labs(x = "Age", y = "Frequency") +

ggtitle("Histogram of Age")

Output:

Warning message:

Removed 1 rows containing missing values (geom_point).

Mean Age: 28.25

Median Score: 85

`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Warning message:

Removed 1 rows containing non-finite values (stat_bin).

[Execution complete with exit code 0]

10
11
Result:

Thus, the R program to cleaning the data and visualization has been executed successfully
and output is verified.

12
Ex.no:05 Perform Time Series Analysis and apply the
Date: various visualization techniques

Aim:

To perform Time Series Analysis and apply the various visualization techniques.

Algorithm:

1. Start the program.


2. Load required libraries with prepared data.
3. Create a model with Time series.
4. Filter the model for further prediction.
5. Visualize the given model.
6. Stop.

Program:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from statsmodels.tsa.seasonal import seasonal_decompose

# Load the Air Passengers dataset


url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/airline-passengers.csv"
df = pd.read_csv(url, parse_dates=['Month'], index_col='Month')

# Display the first few rows of the dataset


print(df.head())

# Visualize the time series


plt.figure(figsize=(12, 6))
plt.plot(df.index, df['Passengers'], label='Passenger Count')
plt.title('Airline Passengers Over Time')
plt.xlabel('Year')
plt.ylabel('Number of Passengers')
plt.legend()
13
plt.show()

# Time Series Decomposition


result = seasonal_decompose(df['Passengers'], model='multiplicative', period=12)
fig, (ax1, ax2, ax3, ax4) = plt.subplots(4, 1, figsize=(12, 8), sharex=True)

ax1.plot(result.trend, label='Trend')
ax1.set_title('Trend Component')

ax2.plot(result.seasonal, label='Seasonal')
ax2.set_title('Seasonal Component')

ax3.plot(result.resid, label='Residual')
ax3.set_title('Residual Component')

ax4.plot(result.observed, label='Observed')
ax4.set_title('Observed')

plt.tight_layout()
plt.show()

# Rolling Statistics and ADF Test


rolling_mean = df['Passengers'].rolling(window=12).mean()
rolling_std = df['Passengers'].rolling(window=12).std()

plt.figure(figsize=(12, 6))
plt.plot(df['Passengers'], label='Original')
plt.plot(rolling_mean, label='Rolling Mean (12 months)')
plt.plot(rolling_std, label='Rolling Std (12 months)')
plt.title('Rolling Mean & Standard Deviation')
plt.legend()
plt.show()

# Seasonal Decomposition of Residuals (STL)


from statsmodels.tsa.seasonal import STL

14
stl = STL(df['Passengers'], seasonal=13)
result_stl = stl.fit()

plt.figure(figsize=(12, 6))
plt.plot(result_stl.trend, label='Trend')
plt.plot(result_stl.seasonal, label='Seasonal')
plt.plot(result_stl.resid, label='Residual')
plt.title('Seasonal-Trend decomposition using LOESS (STL)')
plt.legend()
plt.show()

# Autocorrelation and Partial Autocorrelation


from statsmodels.graphics.tsaplots import plot_acf, plot_pacf

plt.figure(figsize=(12, 6))
plot_acf(df['Passengers'], lags=40, alpha=0.05)
plt.title('Autocorrelation Function (ACF)')
plt.show()

plt.figure(figsize=(12, 6))
plot_pacf(df['Passengers'], lags=40, alpha=0.05)
plt.title('Partial Autocorrelation Function (PACF)')
plt.show()

Output:

Passengers
Month
1949-01-01 112
1949-02-01 118
1949-03-01 132
1949-04-01 129
1949-05-01 121

15
16
17
Result:

Thus ,the given R program to perform Time Series Analysis and apply the various
visualization techniques has been executed successfully.

18
Ex.no:06 Perform Data Analysis and representation on a Map using
Date: various Map data sets with Mouse Rollover effect, user
interaction, etc

AIM:

Create an interactive map using Folium to display markers for specified cities, showcasing their
populations with tooltips and popups.

Algorithm:

1. Start the program.


2. Load required libraries with prepared data.
3. Create a model with Time series.
4. Filter the model for further prediction.
5. Visualize the given model.
6. Stop.

Program:

import folium
import pandas as pd

# Sample data (replace with your own dataset)


data = {
'City': ['New York', 'San Francisco', 'Los Angeles'],
'Population': [8175133, 884363, 3906772],
'Latitude': [40.7128, 37.7749, 34.0522],
'Longitude': [-74.0060, -122.4194, -118.2437]
}

df = pd.DataFrame(data)

# Create a folium map centered at a specific location


map_center = [37.7749, -122.4194]
19
map_obj = folium.Map(location=map_center, zoom_start=5)

# Add markers to the map with mouseover text


for index, row in df.iterrows():
folium.Marker(
location=[row['Latitude'], row['Longitude']],
popup=f"City: {row['City']} \nPopulation: {row['Population']}", tooltip=row['City'])
.add_to(map_obj)

# Save the map to an HTML file


map_obj.save('interactive_map.html')

Output:

Result:

An interactive map, saved as 'interactive_map.html', showcasing city markers with


tooltips and popups.

20
Ex.no:07 Build cartographic visualization for multiple datasets involving various
Date: countries of the world; states and districts in India etc.

AIM:

Develop cartographic visualizations for multiple datasets, encompassing countries


worldwide, and regions like states and districts in India.

Algorithm:

1. Start the program.


2. Load required libraries with prepared data.
3. Create a model with Time series.
4. Filter the model for further prediction.
5. Visualize the given model.
6. Stop.

Program:

import geopandas as gpd


from shapely.geometry import Polygon
import matplotlib.pyplot as plt
import pandas as pd

# Create a GeoDataFrame for the world


world = gpd.read_file(gpd.datasets.get_path('naturalearth_lowres'))

# Create a GeoDataFrame for India


# Note: For simplicity, using a small polygon for India
india_geometry = gpd.GeoSeries([Polygon([(75, 20), (80, 20), (80, 25), (75, 25)])],
crs='EPSG:4326')
india = gpd.GeoDataFrame(geometry=india_geometry)

21
# Sample data for demonstration
world_data = {'Country': ['USA', 'China', 'India'], 'Population': [331002651,
1444216107, 1380004385]}
world_df = pd.DataFrame(world_data)

india_data = {'State': ['Maharashtra', 'Uttar Pradesh', 'Tamil Nadu'],


'Population': [123144223, 223897418, 77841267]}
india_df = pd.DataFrame(india_data)

# Merge world and India data with the map data


world = world.merge(world_df, left_on='name', right_on='Country', how='left')
india['Population'] = india_df['Population']

# Plot world map


fig, ax = plt.subplots(1, 2, figsize=(15, 7))
world.plot(column='Population', cmap='OrRd', ax=ax[0], legend=True,
legend_kwds={'label': "Population by Country"})
ax[0].set_title('World Population')

# Plot India map


india.plot(column='Population', cmap='OrRd', ax=ax[1], legend=True, legend_kwds={'label':
"Population by State"})
ax[1].set_title('India Population')

plt.show()

22
Output:

Result:

ThusProduce informative and visually engaging maps representing diverse datasets for
global countries and specific Indian regions.

23
Ex.No:08 Perform EDA on Wine Quality Data Set
Date:

Aim:

To perform EDA on Wine Quality Data Set.

Algorithm:

1. Start the program.


2. Import libraries like pandas,matplotlib,seaborn.
3. Initial Data Exploration and Summary Statistics.
4. Data Visualization the Dataset for further Analysis.
5. Stop.

Program:

import pandas as pd

import matplotlib.pyplot as plt

import seaborn as sns

data = pd.read_csv("winequalityN.csv")

print("First few rows of the dataset:")

print(data.head())

print("Summary statistics of the dataset:")

print(data.describe())

data.hist(bins=30, figsize=(12, 8))

plt.suptitle("Histograms of Features", y=1.02)

plt.show()

correlation_matrix = data.corr()

plt.figure(figsize=(10, 8))

24
sns.heatmap(correlation_matrix, annot=True, cmap="coolwarm", linewidths=0.5)

plt.title("Correlation Heatmap")

plt.show()

plt.figure(figsize=(12, 8))

sns.boxplot(data=data, width=0.5)

plt.xticks(rotation=45)

plt.title("Box Plots of Features")

plt.show()

plt.figure(figsize=(12, 6))

sns.histplot(data["alcohol"], kde=True)

plt.title("Alcohol Content Distribution")

plt.show()

Output:

First few rows of the dataset:

type fixed acidity volatile acidity ... sulphates alcohol quality

0 white 7.0 0.27 ... 0.45 8.8 6

1 white 6.3 0.30 ... 0.49 9.5 6

2 white 8.1 0.28 ... 0.44 10.1 6

3 white 7.2 0.23 ... 0.40 9.9 6

4 white 7.2 0.23 ... 0.40 9.9 6

[5 rows x 13 columns]

Summary statistics of the dataset:


25
fixed acidity volatile acidity ... alcohol quality

count 6487.000000 6489.000000 ...6497.000000 6497.000000

mean 7.216579 0.339691 ... 10.491801 5.818378

std 1.296750 0.164649 ... 1.192712 0.873255

min 3.800000 0.080000 ... 8.000000 3.000000

25% 6.400000 0.230000 ... 9.500000 5.000000

50% 7.000000 0.290000 ... 10.300000 6.000000

75% 7.700000 0.400000 ... 11.300000 6.000000

max 15.900000 1.580000 ... 14.900000 9.000000

[8 rows x 12 columns]

26
Result:

Thus, the python program to perform EDA on Wine Quality Data Set.

27
Ex.no:09 Use a case study on a data set and apply EDA , visualization
Date: techniques and present an analysis report.

Aim:

Use a case study on a data set and apply the various EDA and visualization techniques and
present an analysis report.

Algorithm:

1. Start the program.


2. Import the necessary libraries :pandas,matplotlib,seaborn and sklearn.dataset for data
loading.
3. Loading the dataset and Exploring the Data.
4. Visualization the dataset and Analysing the report.
5. Stop.

Program:

import pandas as pd

import numpy as np

import matplotlib.pyplot as plt

import seaborn as sns

from sklearn.datasets import load_iris

iris = load_iris()

data = pd.DataFrame(data=np.c_[iris['data'], iris['target']], columns=iris['feature_names'] +


['target'])

print("First few rows of the dataset:")

print(data.head())

print("\nData Information:")

print(data.info())

print("\nSummary Statistics:")
28
print(data.describe())

sns.set(style="whitegrid")

plt.figure(figsize=(12, 6))

plt.subplot(1, 2, 1)

sns.histplot(data['sepal length (cm)'], kde=True)

plt.title("Distribution of Sepal Length")

plt.subplot(1, 2, 2)

sns.histplot(data['sepal width (cm)'], kde=True)

plt.title("Distribution of Sepal Width")

plt.show()

sns.set_style("whitegrid")

sns.pairplot(data, hue='target', markers=["o", "s", "D"])

plt.show()

plt.figure(figsize=(10, 6))

sns.boxplot(x='target', y='petal length (cm)', data=data)

plt.title("Petal Length Boxplot by Species")

plt.show()

correlation_matrix = data.corr()

plt.figure(figsize=(8, 6))

sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm')

plt.title("Correlation Heatmap")

plt.show()

print("\nAnalysis Report:")

print("- The dataset contains three species of iris flowers: setosa, versicolor, and virginica.")
29
print("- The features vary in their distributions, with sepal length and sepal width showing different
patterns.")

print("- The pairplot shows how the features are correlated and how they can be used to distinguish
between species.")

print("- The petal length is a strong predictor for species differentiation, with setosa having the
shortest petals and virginica the longest.")

print("- The correlation heatmap confirms that petal length is highly correlated with the target
variable, making it an important feature for classification.")

Output:

First few rows of the dataset:

sepal length (cm) sepal width (cm) ... petal width (cm) target

0 5.1 3.5 ... 0.2 0.0

1 4.9 3.0 ... 0.2 0.0

2 4.7 3.2 ... 0.2 0.0

3 4.6 3.1 ... 0.2 0.0

4 5.0 3.6 ... 0.2 0.0

[5 rows x 5 columns]

Data Information:
<class 'pandas.core.frame.DataFrame'>

RangeIndex: 150 entries, 0 to 149

30
Data columns (total 5 columns):

# Column Non-Null Count Dtype


--- ------ -------------- -----
0 sepal length (cm) 150 non-null float64
1 sepal width (cm) 150 non-null float64
2 petal length (cm) 150 non-null float64

3 petal width (cm) 150 non-null float64

4 target 150 non-null float64

dtypes: float64(5)

memory usage: 6.0 KB

None

Summary Statistics:

sepal length (cm) sepal width (cm) ... petal width (cm) target

count 150.000000 150.000000 ...150.000000 150.000000

mean 5.843333 3.057333 ... 1.199333 1.000000

std 0.828066 0.435866 ... 0.762238 0.819232

min 4.300000 2.000000 ... 0.100000 0.000000

25% 5.100000 2.800000 ... 0.300000 0.000000

50% 5.800000 3.000000 ... 1.300000 1.000000

75% 6.400000 3.300000 ... 1.800000 2.000000

max 7.900000 4.400000 ... 2.500000 2.000000

[8 rows x 5 columns]

31
32
33
Result:

Thus, the case study on a data set and apply the various EDA and visualization
techniques and present an analysis report.

34

You might also like