NAME: Bharathram Srinivasan
ROLL NO:2201107
T-22
EXPERIMENT NO.1
Aim: To Study and Implement Exploratory Data Analytics on a given
dataset
THEORY:
Exploratory Data Analysis (EDA) is an important first step in data
science projects. It involves looking at and visualizing data to
understand its main features, find patterns, and discover how different
parts of the data are connected.
EDA helps to spot any unusual data or outliers and is usually done
before starting more detailed statistical analysis or building models. In
this article, we will discuss what is Exploratory Data Analysis
(EDA) and the steps to perform EDA.
Why Exploratory Data Analysis is Important?
Exploratory Data Analysis (EDA) is important for several reasons,
especially in the context of data science and statistical modeling. Here
are some of the key reasons why EDA is a critical step in the data
analysis process:
Helps to understand the dataset, showing how many features
there are, the type of data in each feature, and how the data
is spread out, which helps in choosing the right methods for
analysis.
EDA helps to identify hidden patterns and relationships
between different data points, which help us in and model
building.
Allows to spot errors or unusual data points (outliers) that
could affect your results.
Insights that you obtain from EDA help you decide which
features are most important for building models and how to
prepare them to improve performance.
NAME: Bharathram Srinivasan
ROLL NO:2201107
T-22
By understanding the data, EDA helps us in choosing the best
modeling techniques and adjusting them for better results.
Types of Exploratory Data Analysis
There are various sorts of EDA strategies based on nature of the
records. Depending on the number of columns we are analyzing we
can divide EDA into three types: Univariate, bivariate and
multivariate.
1. Univariate Analysis
Univariate analysis focuses on studying one variable to understand its
characteristics. It helps describe the data and find patterns within a
single feature. Common methods include histograms to show data
distribution, box plots to detect outliers and understand data spread,
and bar charts for categorical data. Summary
statistics like mean, median, mode, variance, and standard
deviation help describe the central tendency and spread of the data
2. Bivariate Analysis
Bivariate analysis focuses on exploring the relationship between two
variables to find connections, correlations, and dependencies. It’s an
important part of exploratory data analysis that helps understand how
two variables interact. Some key techniques used in bivariate analysis
include scatter plots, which visualize the relationship between two
continuous variables; correlation coefficient, which measures how
strongly two variables are related, commonly using Pearson’s
correlation for linear relationships; and cross-tabulation,
or contingency tables, which show the frequency distribution of two
categorical variables and help understand their relationship.
Line graphs are useful for comparing two variables over time,
especially in time series data, to identify trends or
patterns. Covariance measures how two variables change together,
though it’s often supplemented by the correlation coefficient for a
clearer, more standardized view of the relationship.
NAME: Bharathram Srinivasan
ROLL NO:2201107
T-22
3. Multivariate Analysis
Multivariate analysis examines the relationships between two or more
variables in the dataset. It aims to understand how variables interact
with one another, which is crucial for most statistical modeling
techniques. It include Techniques like pair plots, which show the
relationships between multiple variables at once, helping to see how
they interact. Another technique is Principal Component Analysis
(PCA), which reduces the complexity of large datasets by simplifying
them, while keeping the most important information.
In addition to univariate and multivariate analysis, there are
specialized EDA techniques tailored for specific types of data or
analysis needs:
Spatial Analysis: For geographical data, using maps and spatial
plotting to understand the geographical distribution of variables.
Text Analysis: Involves techniques like word clouds, frequency
distributions, and sentiment analysis to explore text data.
Time Series Analysis: This type of analysis is mainly applied to
statistics sets that have a temporal component. Time collection
evaluation entails inspecting and modeling styles, traits, and
seasonality inside the statistics through the years. Techniques
like line plots, autocorrelation analysis, transferring averages,
NAME: Bharathram Srinivasan
ROLL NO:2201107
T-22
and ARIMA (AutoRegressive Integrated Moving Average)
fashions are generally utilized in time series analysis.
Steps for Performing Exploratory Data Analysis
Performing Exploratory Data Analysis (EDA) involves a series of steps
designed to help you understand the data you’re working with,
uncover underlying patterns, identify anomalies, test hypotheses, and
ensure the data is clean and suitable for further analysis.
Step 1: Understand the Problem and the Data
The first step in any data analysis project is to clearly understand the
problem you’re trying to solve and the data you have. This involves
asking key questions such as:
What is the business goal or research question?
What are the variables in the data and what do they
represent?
What types of data (numerical, categorical, text, etc.) do
you have?
NAME: Bharathram Srinivasan
ROLL NO:2201107
T-22
Are there any known data quality issues or limitations?
Are there any domain-specific concerns or restrictions?
By thoroughly understanding the problem and the data, you can better
plan your analysis, avoid wrong assumptions, and ensure accurate
conclusions
Step 2: Import and Inspect the Data
After clearly understanding the problem and the data, the next step is to
import the data into your analysis environment (like Python, R, or a
spreadsheet tool). At this stage, it’s crucial to examine the data to get
an initial understanding of its structure, variable types, and potential
issues.
Here’s what you can do:
Load the data into your environment carefully to avoid
errors or truncations.
Examine the size of the data (number of rows and columns) to
understand its complexity.
Check for missing values and see how they are distributed
across variables, since missing data can impact the quality of
your analysis.
Identify data types for each variable (like numerical,
categorical, etc.), which will help in the next steps of
data manipulation and analysis.
Look for errors or inconsistencies, such as invalid values,
mismatched units, or outliers, which could signal deeper
issues with the data.
By completing these tasks, you’ll be prepared to clean and analyze the
data more effectively.
Step 3: Handle Missing Data
NAME: Bharathram Srinivasan
ROLL NO:2201107
T-22
Missing data is common in many datasets and can significantly affect
the quality of your analysis. During Exploratory Data Analysis
(EDA), it’s important to identify and handle missing data properly to
avoid biased or misleading results.
Here’s how to handle it:
Understand the patterns and possible reasons for missing data.
Is it missing completely at random (MCAR), missing at
random (MAR), or missing not at random (MNAR)?
Knowing this helps decide how to handle the missing data.
Decide whether to remove missing data (listwise deletion)
or impute (fill in) the missing values. Removing data can lead
to biased outcomes, especially if the missing data isn’t MCAR.
Imputing values helps preserve data but should be done
carefully.
Use appropriate imputation methods like mean/median
imputation, regression imputation, or machine learning
techniques like KNN or decision trees based on the data’s
characteristics.
Consider the impact of missing data. Even after imputing,
missing data can cause uncertainty and bias, so interpret the
results with caution.
Properly handling missing data improves the accuracy of your
analysis and prevents misleading conclusions.
Step 4: Explore Data Characteristics
After addressing missing data, the next step in EDA is to explore the
characteristics of your data by examining the distribution, central
tendency, and variability of your variables, as well as identifying
any outliers or anomalies. This helps in selecting appropriate
analysis methods and spotting potential data issues. You should
calculate summary statistics like mean, median, mode, standard
deviation, skewness, and kurtosis for numerical variables. These
NAME: Bharathram Srinivasan
ROLL NO:2201107
T-22
provide an overview of the data’s distribution and help identify any
irregular patterns or issues.
Step 5: Perform Data Transformation
Data transformation is an essential step in EDA because it prepares
your data for accurate analysis and modeling. Depending on your
data’s characteristics and analysis needs, you may need to transform it
to ensure it’s in the right format.
Common transformation techniques include:
Scaling or normalizing numerical variables (e.g., min-max
scaling or standardization).
Encoding categorical variables for machine learning
(e.g., one-hot encoding or label encoding).
Applying mathematical
transformations (e.g., logarithmic or square root) to correct
skewness or non-linearity.
Creating new variables from existing ones (e.g., calculating
ratios or combining variables).
Aggregating or grouping data based on specific variables or
conditions
Step 6: Visualize Data Relationship
Visualization is a powerful tool in the EDA process, helping to
uncover relationships between variables and identify patterns or
trends that may not be obvious from summary statistics alone.
For categorical variables, create frequency tables, bar plots,
and pie charts to understand the distribution of categories and
identify imbalances or unusual patterns.
For numerical variables, generate histograms, box plots, violin
plots, and density plots to visualize distribution, shape, spread,
and potential outliers.
NAME: Bharathram Srinivasan
ROLL NO:2201107
T-22
To explore relationships between variables, use scatter
plots, correlation matrices, or statistical tests like Pearson’s
correlation coefficient or Spearman’s rank correlation
Step 7: Handling Outliers
Outliers are data points that significantly differ from the rest of the data,
often caused by errors in measurement or data entry. Detecting and
handling outliers is important because they can skew your analysis
and affect model performance. You can identify outliers using
methods like interquartile range (IQR), Z-scores, or domain-
specific rules. Once identified, outliers can be removed or adjusted
depending on the context. Properly managing outliers ensures your
analysis is accurate and reliable.
Step 8: Communicate Findings and Insights
The final step in EDA is to communicate your findings clearly. This
involves summarizing your analysis, pointing out key discoveries, and
presenting your results in a clear and engaging way.
NAME: Bharathram Srinivasan
ROLL NO:2201107
T-22
Clearly state the goals and scope of your analysis.
Provide context and background to help others understand your
approach.
Use visualizations to support your findings and make them
easier to understand.
Highlight key insights, patterns, or anomalies discovered.
Mention any limitations or challenges faced during the analysis.
Suggest next steps or areas that need further investigation.
Effective conversation is critical for ensuring that your EDA efforts
have a meaningful impact and that your insights are understood
and acted upon with the aid of stakeholders.
Exploratory Data Analysis (EDA) can be performed using a variety of
tools and software, each offering features that deal to different data
and analysis needs.
In Python, libraries like Pandas are essential for data manipulation,
providing functions to clean, filter, and transform data. Matplotlib is
used for creating basic static, interactive, and animated visualizations,
while Seaborn, built on top of Matplotlib, allows for the creation of
more attractive and informative statistical plots. For interactive and
advanced visualizations, Plotly is an excellent choice
In R, packages like ggplot2 are powerful for creating complex and
visually appealing plots from data frames. dplyr helps in data
manipulation, making tasks like filtering and summarizing easier,
and tidyr ensures your data is in a tidy format, making it easier to
work with.
Applications and Use Cases of Exploratory Data Analysis (EDA)
Exploratory Data Analysis (EDA) is a foundational process that has
widespread applications across various industries and domains. By
understanding the dataset and uncovering hidden patterns, EDA
NAME: Bharathram Srinivasan
ROLL NO:2201107
T-22
facilitates better decision-making, problem-solving, and model
building.
1. Business and Marketing
Applications:
Customer Segmentation: Analyze demographic and behavioral
data to group customers with similar characteristics for targeted
marketing.
Churn Analysis: Identify patterns in customer behavior to
predict and reduce churn.
Sales Forecasting: Analyze historical sales data to identify
trends and make predictions.
Campaign Effectiveness: Evaluate the impact of marketing
campaigns by analyzing response rates, conversions, and ROI.
Use Case:
A retail company uses EDA to analyze customer purchasing data. By
visualizing product sales trends, they identify high-performing
products and seasonal buying patterns, enabling better inventory
management and promotional strategies.
2. Healthcare
Applications:
Patient Outcome Prediction: Analyze medical history and
demographic data to predict health outcomes.
Disease Progression: Study patterns in patient data to track the
progression of diseases.
Clinical Trials: Evaluate experimental results by analyzing
differences between control and treatment groups.
NAME: Bharathram Srinivasan
ROLL NO:2201107
T-22
Resource Allocation: Analyze hospital occupancy and patient
flow to optimize resource allocation.
Use Case:
A hospital conducts EDA on patient admission data to identify peak
hours and days, which helps optimize staff scheduling and reduce
wait times.
3. Finance
Applications:
Fraud Detection: Spot anomalies in transaction data to detect
fraudulent activities.
Risk Analysis: Assess credit risk by analyzing customer
financial behavior and historical data.
Portfolio Analysis: Evaluate investment performance and
optimize asset allocation.
Customer Lifetime Value: Predict future revenue
from customers using transactional and behavioral data.
Use Case:
A bank performs EDA on loan repayment data to identify factors
influencing defaults. This insight helps refine credit scoring models
and lending policies.
4. Technology and IT
Applications:
User Behavior Analysis: Understand user interactions with
software or websites to improve user experience.
System Monitoring: Detect anomalies in server logs and
network traffic for cybersecurity purposes.
NAME: Bharathram Srinivasan
ROLL NO:2201107
T-22
Product Performance: Analyze feature usage and identify
issues affecting application performance.
Chatbot Training: Use EDA to preprocess and understand
conversational data for chatbot development.
Use Case:
An e-commerce platform uses EDA to analyze website navigation
patterns. Insights from this analysis lead to improvements in website
design, reducing bounce rates and increasing conversions.
5. Education
Applications:
Student Performance Analysis: Identify factors that influence
student performance.
Curriculum Development: Analyze feedback and success rates
to design effective curricula.
Retention Rates: Predict and address issues leading to student
dropout.
Adaptive Learning: Personalize learning paths based on EDA
of student data.
Use Case:
An online learning platform uses EDA to analyze quiz scores and
engagement metrics. The findings guide the creation of personalized
learning recommendations for students.
6. Manufacturing and Supply
Chain Applications:
Quality Control: Analyze production data to identify and
minimize defects.
NAME: Bharathram Srinivasan
ROLL NO:2201107
T-22
Predictive Maintenance: Use sensor data to predict equipment
failures and schedule maintenance.
Supply Chain Optimization: Study delivery times and
inventory levels to optimize supply chains.
Energy Efficiency: Analyze machine performance to reduce
energy consumption.
Use Case:
A car manufacturer conducts EDA on sensor data from assembly lines to
detect patterns associated with defects. This helps improve the
manufacturing process and product quality.
7. Entertainment and Media
Applications:
Content Recommendation: Analyze user viewing habits to
improve recommendation systems.
Audience Analysis: Identify demographics and preferences to
tailor content.
Box Office Predictions: Use historical data to forecast movie
performance.
Trend Analysis: Analyze social media and streaming data to
identify emerging trends.
Use Case:
A streaming service uses EDA to analyze user viewing habits. This
insight improves their recommendation algorithms, increasing user
engagement.
8. Energy and Utilities
Applications:
NAME: Bharathram Srinivasan
ROLL NO:2201107
T-22
Consumption Forecasting: Predict energy demand based
on historical consumption data.
Grid Optimization: Analyze grid performance data to enhance
efficiency and reliability.
Renewable Energy: Study weather patterns and energy output
to optimize renewable energy sources.
Leak Detection: Identify anomalies in usage data to detect leaks
or unauthorized consumption.
Use Case:
An energy company analyzes household energy consumption patterns
to design targeted energy-saving programs.
9. Social Sciences and Public
Policy Applications:
Survey Analysis: Analyze survey data to identify public
opinion trends.
Crime Analysis: Spot patterns in crime data to aid in resource
allocation and prevention strategies.
Policy Impact Assessment: Evaluate the effectiveness of
policies by analyzing social and economic data.
Demographic Studies: Study population data for urban
planning and resource allocation.
Use Case:
A government agency uses EDA on crime data to identify high-risk
areas, enabling better deployment of law enforcement resources.
10. Sports Analytics
NAME: Bharathram Srinivasan
ROLL NO:2201107
T-22
Applications:
Player Performance Analysis: Assess player strengths and
weaknesses.
Team Strategy: Analyze game data to develop strategies.
Injury Prevention: Use sensor and health data to identify risk
factors for injuries.
Fan Engagement: Study fan behavior and preferences
to enhance the spectator experience.
Use Case:
A football team uses EDA to analyze match performance data.
Insights from this analysis help coaches refine tactics and training
programs.
Benefits of EDA in Use Cases
Improved Decision-Making: Provides data-driven insights to
inform strategies.
Resource Optimization: Identifies inefficiencies and helps
allocate resources effectively.
Enhanced User Experience: Uncovers patterns in user
behavior for personalized experiences.
Risk Mitigation: Detects anomalies and reduces exposure to
potential risks.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
import warnings
warnings.filterwarnings('ignore')
data = pd.read_csv('/content/FINANCE.csv')
data
{"type":"dataframe","variable_name":"data"}
print("Shape of the dataset:", data.shape)
data.info()
Shape of the dataset: (869, 27)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 869 entries, 0 to 868
Data columns (total 27 columns):
# Column Non-Null Count Dtype
0 COMM_NAME 869 non-null object
1 COMM_CODE 869 non-null int64
2 COMM_WT 869 non-null float64
3 INDEX2011 0 non-null float64
4 INDEX20112012 0 non-null float64
5 INDEX2012 0 non-null float64
6 INDEX20122013 869 non-null float64
7 INDEX2013 869 non-null float64
8 INDEX20132014 869 non-null float64
9 INDEX2014 869 non-null float64
10 INDEX20142015 869 non-null float64
11 INDEX2015 869 non-null float64
12 INDEX20152016 869 non-null float64
13 INDEX2016 869 non-null float64
14 INDEX20162017 869 non-null float64
15 INDEX2017 869 non-null float64
16 INDEX20172018 869 non-null float64
17 INDEX2018 869 non-null float64
18 INDEX20182019 869 non-null float64
19 INDEX2019 869 non-null float64
20 INDEX20192020 869 non-null float64
21 INDEX2020 869 non-null float64
22 INDEX20202021 869 non-null float64
23 INDEX2021 869 non-null float64
24 INDEX20212022 869 non-null float64
25 INDEX2022 869 non-null float64
26 INDEX20222023 869 non-null float64
dtypes: float64(25), int64(1), object(1)
memory usage: 183.4+ KB
print("Missing values per column:\n", data.isnull().sum())
Missing values per column:
COMM_NAME 0
COMM_CODE 0
COMM_WT 0
INDEX2011 869
INDEX20112012 869
INDEX2012 869
INDEX20122013 0
INDEX2013 0
INDEX20132014 0
INDEX2014 0
INDEX20142015 0
INDEX2015 0
INDEX20152016 0
INDEX2016 0
INDEX20162017 0
INDEX2017 0
INDEX20172018 0
INDEX2018 0
INDEX20182019 0
INDEX2019 0
INDEX20192020 0
INDEX2020 0
INDEX20202021 0
INDEX2021 0
INDEX20212022 0
INDEX2022 0
INDEX20222023 0
dtype: int64
data = data.fillna(data.median(numeric_only=True))
duplicates = data.duplicated().sum() print(f"\
nNumber of duplicate rows: {duplicates}")
Number of duplicate rows: 0
data = data.drop_duplicates()
print("\nSummary Statistics:")
print(data.describe())
Summary Statistics:
COMM_CODE COMM_WT INDEX2011 INDEX20112012
INDEX2012 \
count 8.690000e+02 869.000000 0.0 0.0 0.0
mean 1.275946e+09 0.595963 NaN NaN NaN
std 8.216739e+07 4.293702 NaN NaN NaN
min 1.000000e+09 0.000020 NaN NaN NaN
25% 1.301100e+09 0.022430 NaN NaN NaN
50% 1.310070e+09 0.075530 NaN NaN NaN
75% 1.316040e+09 0.241250 NaN NaN NaN
max 2.000000e+09 100.000000 NaN NaN NaN
INDEX20122013 INDEX2013 INDEX20132014 INDEX2014
INDEX20142015 \
count 869.000000 869.000000 869.000000 869.000000
869.000000
mean 106.347641 110.648446 111.990449 115.296663
115.661105
std 9.487981 14.442772 15.022158 16.131469
16.871443
min 38.300000 59.200000 63.300000 60.000000
59.700000
25% 102.100000 103.800000 104.500000 106.300000
106.100000
50% 105.100000 108.300000 109.800000 113.200000
113.500000
75% 109.200000 114.800000 116.300000 121.100000
121.900000
max 186.800000 282.300000 280.200000 285.000000
266.200000
... INDEX2018 INDEX20182019 INDEX2019 INDEX20192020
INDEX2020 \
count ... 869.000000 869.000000 869.000000 869.000000
869.000000
mean ... 121.806904 122.758688 125.337514 125.905293
126.047986
std ... 23.562135 24.368001 28.692250 30.252793
31.007760
min ... 40.900000 30.900000 44.300000 43.600000
41.800000
25% ... 109.300000 109.600000 109.500000 109.600000
109.100000
50% ... 119.000000 119.800000 119.700000 119.500000
119.900000
75% ... 131.400000 133.200000 136.500000 137.500000
138.300000
max ... 288.300000 311.200000 337.900000 350.800000
400.600000
INDEX20202021 INDEX2021 INDEX20212022 INDEX2022
INDEX20222023
count 869.000000 869.000000 869.000000 869.000000
869.000000
mean 127.237399 136.333372 140.165132 148.902532
149.627043
std 30.525569 32.319046 34.624997 39.374676
39.602056
min 42.200000 44.300000 44.800000 47.600000
48.200000
25% 110.400000 116.600000 119.300000 125.800000
126.600000
50% 121.200000 130.500000 133.500000 140.900000
141.400000
75% 140.700000 150.100000 155.900000 164.200000
165.500000
max 371.600000 327.000000 372.200000 413.500000
410.300000
[8 rows x 26 columns]
categorical_columns = data.select_dtypes(include=['object']).columns
print(f"\nCategorical Columns: {categorical_columns.tolist()}")
Categorical Columns: ['COMM_NAME']
numeric_columns = data.select_dtypes(include=[np.number]).columns
print(f"\nNumeric Columns: {numeric_columns.tolist()}")
Numeric Columns: ['COMM_CODE', 'COMM_WT', 'INDEX2011',
'INDEX20112012', 'INDEX2012', 'INDEX20122013', 'INDEX2013',
'INDEX20132014', 'INDEX2014', 'INDEX20142015', 'INDEX2015',
'INDEX20152016', 'INDEX2016', 'INDEX20162017', 'INDEX2017',
'INDEX20172018', 'INDEX2018', 'INDEX20182019', 'INDEX2019',
'INDEX20192020', 'INDEX2020', 'INDEX20202021', 'INDEX2021',
'INDEX20212022', 'INDEX2022', 'INDEX20222023']
y_true = (data['INDEX2022'] > data['INDEX2022'].median()).astype(int)
# Binary actual labels
y_pred = (data['INDEX2021'] > data['INDEX2021'].median()).astype(int)
# Binary predicted labels
# Generate the confusion matrix
cm = confusion_matrix(y_true, y_pred)
# Visualize the confusion matrix
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=[0,
1])
disp.plot(cmap='Blues')
plt.title("Confusion Matrix")
plt.show()
# Output the confusion matrix for reference
print("Confusion Matrix:")
print(cm)
Confusion Matrix:
[[389 46]
[ 47 387]]
for column in numeric_columns:
plt.figure(figsize=(8, 4))
sns.histplot(data[column], kde=True, bins=30)
plt.title(f"Distribution of {column}")
plt.show()
for column in categorical_columns: print(f"\
nValue counts for {column}:")
print(data[column].value_counts())
Value counts for COMM_NAME:
COMM_NAME
ALL COMMODITIES 1
Rubber cloth/sheet 1
Plasticizer 1
Polyester film(metalized) 1
Gelatine 1
..
Cranes 1
Material handling, lifting and hoisting equipment 1
Deep freezers 1
Grinding or polishing machine 1
e. Manufacture of medical and dental instruments and supplies 1
Name: count, Length: 869, dtype: int64