Eda Lab Manual
Eda Lab Manual
Date:
Aim:
Install R, a data analysis and visualization tool, to leverage its capabilities for statistical analysis
and graphical representation of data.
Procedure:
1. Download and install the latest version of R from the official R website.
2. Optionally, install an integrated development environment (IDE) such as RStudio for a user-
friendly interface.
3. Use the R console or an IDE to execute R scripts and commands for data analysis and
visualization.
Installation/Output:
1
2
Result:
Access to a powerful data analysis and visualization tool, enabling the exploration
and representation of data through statistical methods and graphs in the R environment.
3
Ex No: 02 Perform exploratory data analysis (EDA) with
Date: email data set.
Aim:
To Perform exploratory data analysis (EDA) with datasets like email data set. Export all
your emails as a dataset, import them inside a pandas data frame, visualize them and get
different insights from the data.
Procedure:
Export your email data in a suitable format (e.g., CSV, JSON, or plain text).
Use Pandas to import your email data into a DataFrame. You can use pd.read_csv(),
pd.read_json(), or other relevant functions depending on the data format.
Start by examining the basic properties of the dataset. Use functions like info(), head(), and
describe() to get an overview.
Data Cleaning:
Clean the data by handling missing values, removing duplicates, and addressing any other
data quality issues.
Visualization:
Utilize libraries like Matplotlib or Seaborn to create visualizations. For email data, you might
want to create:Histograms of email frequencies over time.Bar charts showing the most
frequent senders or recipients.Word clouds to identify common words or phrases in
emails.Time series plots for email activity.
4
Program:
import pandas as pd
# Sample data
data = {
"Body": ["3pm Yes, Let's meet at 4", "OK, 3 pm works for me", "OK, let's meet at 5",
"Let's meet?", "Sure, I'll be there"],
# Create a DataFrame
df = pd.DataFrame(data)
# Count the number of emails sent by each sender
sender_counts = df["Sender"].value_counts()
sender_counts.plot(kind="bar")
plt.xlabel("Sender")
plt.ylabel("Number of emails")
plt.show()
5
Output:
Result:
Thus, the given program perform exploratory data analysis (EDA) with email data set
have been executed successfully.
6
Ex.no:03 Working with Numpy arrays, Pandas data frames ,
Date: Basic plots using Matplotlib
Aim:
To write a python with Numpy arrays, Pandas data frames , Basic plots using Matplotlib.
Procedure:
1. Start program.
2. Import library like numpy ,pandas , matplotlib .
3. Create a Array and Dataframe.
4. Show the plot with given array.
5. Stop.
Program:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
# Creating a NumPy array
data = np.array([1, 2, 3, 4, 5])
# Creating a Pandas DataFrame
df = pd.DataFrame({'Values': data})
# Plotting the data
plt.plot(df['Values'])
plt.xlabel('Index')
plt.ylabel('Values')
plt.title('Simple Line Plot')
plt.show()
7
Output:
Result:
Thus,the given program to perform Numpy arrays, Pandas data frames , Basic plots using
Matplotlib have been executed successfully.
8
Ex.no:04 Explore various variable and row filters in R for
Date: cleaning data
Aim:
To explore various variable and row filters in R for cleaning data. Apply various plot
features in R on sample data sets and visualize.
Algorithm:
Program:
data_cleaned<- na.omit(data)
data_cleaned<- unique(data)
library(ggplot2)
# Scatter plot
geom_point() +
# Mean Age
# Median Score
median_score<- median(data$Score)
# Histogram of Age
geom_histogram() +
ggtitle("Histogram of Age")
Output:
Warning message:
Median Score: 85
Warning message:
10
11
Result:
Thus, the R program to cleaning the data and visualization has been executed successfully
and output is verified.
12
Ex.no:05 Perform Time Series Analysis and apply the
Date: various visualization techniques
Aim:
To perform Time Series Analysis and apply the various visualization techniques.
Algorithm:
Program:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from statsmodels.tsa.seasonal import seasonal_decompose
ax1.plot(result.trend, label='Trend')
ax1.set_title('Trend Component')
ax2.plot(result.seasonal, label='Seasonal')
ax2.set_title('Seasonal Component')
ax3.plot(result.resid, label='Residual')
ax3.set_title('Residual Component')
ax4.plot(result.observed, label='Observed')
ax4.set_title('Observed')
plt.tight_layout()
plt.show()
plt.figure(figsize=(12, 6))
plt.plot(df['Passengers'], label='Original')
plt.plot(rolling_mean, label='Rolling Mean (12 months)')
plt.plot(rolling_std, label='Rolling Std (12 months)')
plt.title('Rolling Mean & Standard Deviation')
plt.legend()
plt.show()
14
stl = STL(df['Passengers'], seasonal=13)
result_stl = stl.fit()
plt.figure(figsize=(12, 6))
plt.plot(result_stl.trend, label='Trend')
plt.plot(result_stl.seasonal, label='Seasonal')
plt.plot(result_stl.resid, label='Residual')
plt.title('Seasonal-Trend decomposition using LOESS (STL)')
plt.legend()
plt.show()
plt.figure(figsize=(12, 6))
plot_acf(df['Passengers'], lags=40, alpha=0.05)
plt.title('Autocorrelation Function (ACF)')
plt.show()
plt.figure(figsize=(12, 6))
plot_pacf(df['Passengers'], lags=40, alpha=0.05)
plt.title('Partial Autocorrelation Function (PACF)')
plt.show()
Output:
Passengers
Month
1949-01-01 112
1949-02-01 118
1949-03-01 132
1949-04-01 129
1949-05-01 121
15
16
17
Result:
Thus ,the given R program to perform Time Series Analysis and apply the various
visualization techniques has been executed successfully.
18
Ex.no:06 Perform Data Analysis and representation on a Map using
Date: various Map data sets with Mouse Rollover effect, user
interaction, etc
AIM:
Create an interactive map using Folium to display markers for specified cities, showcasing their
populations with tooltips and popups.
Algorithm:
Program:
import folium
import pandas as pd
df = pd.DataFrame(data)
Output:
Result:
20
Ex.no:07 Build cartographic visualization for multiple datasets involving various
Date: countries of the world; states and districts in India etc.
AIM:
Algorithm:
Program:
21
# Sample data for demonstration
world_data = {'Country': ['USA', 'China', 'India'], 'Population': [331002651,
1444216107, 1380004385]}
world_df = pd.DataFrame(world_data)
plt.show()
22
Output:
Result:
ThusProduce informative and visually engaging maps representing diverse datasets for
global countries and specific Indian regions.
23
Ex.No:08 Perform EDA on Wine Quality Data Set
Date:
Aim:
Algorithm:
Program:
import pandas as pd
data = pd.read_csv("winequalityN.csv")
print(data.head())
print(data.describe())
plt.show()
correlation_matrix = data.corr()
plt.figure(figsize=(10, 8))
24
sns.heatmap(correlation_matrix, annot=True, cmap="coolwarm", linewidths=0.5)
plt.title("Correlation Heatmap")
plt.show()
plt.figure(figsize=(12, 8))
sns.boxplot(data=data, width=0.5)
plt.xticks(rotation=45)
plt.show()
plt.figure(figsize=(12, 6))
sns.histplot(data["alcohol"], kde=True)
plt.show()
Output:
[5 rows x 13 columns]
[8 rows x 12 columns]
26
Result:
Thus, the python program to perform EDA on Wine Quality Data Set.
27
Ex.no:09 Use a case study on a data set and apply EDA , visualization
Date: techniques and present an analysis report.
Aim:
Use a case study on a data set and apply the various EDA and visualization techniques and
present an analysis report.
Algorithm:
Program:
import pandas as pd
import numpy as np
iris = load_iris()
print(data.head())
print("\nData Information:")
print(data.info())
print("\nSummary Statistics:")
28
print(data.describe())
sns.set(style="whitegrid")
plt.figure(figsize=(12, 6))
plt.subplot(1, 2, 1)
plt.subplot(1, 2, 2)
plt.show()
sns.set_style("whitegrid")
plt.show()
plt.figure(figsize=(10, 6))
plt.show()
correlation_matrix = data.corr()
plt.figure(figsize=(8, 6))
plt.title("Correlation Heatmap")
plt.show()
print("\nAnalysis Report:")
print("- The dataset contains three species of iris flowers: setosa, versicolor, and virginica.")
29
print("- The features vary in their distributions, with sepal length and sepal width showing different
patterns.")
print("- The pairplot shows how the features are correlated and how they can be used to distinguish
between species.")
print("- The petal length is a strong predictor for species differentiation, with setosa having the
shortest petals and virginica the longest.")
print("- The correlation heatmap confirms that petal length is highly correlated with the target
variable, making it an important feature for classification.")
Output:
sepal length (cm) sepal width (cm) ... petal width (cm) target
[5 rows x 5 columns]
Data Information:
<class 'pandas.core.frame.DataFrame'>
30
Data columns (total 5 columns):
dtypes: float64(5)
None
Summary Statistics:
sepal length (cm) sepal width (cm) ... petal width (cm) target
[8 rows x 5 columns]
31
32
33
Result:
Thus, the case study on a data set and apply the various EDA and visualization
techniques and present an analysis report.
34