0% found this document useful (0 votes)
16 views41 pages

PDF Experiments-1 DADV

The document outlines the steps for performing Exploratory Data Analysis (EDA) using Python, emphasizing the importance of data pre-processing and feature engineering. It details key processes such as data loading, cleaning, univariate and bivariate analysis, and visualization techniques using libraries like Pandas, NumPy, Matplotlib, and Seaborn. The conclusion highlights the significance of EDA in understanding datasets and informing further analysis.

Uploaded by

okkshrutii
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views41 pages

PDF Experiments-1 DADV

The document outlines the steps for performing Exploratory Data Analysis (EDA) using Python, emphasizing the importance of data pre-processing and feature engineering. It details key processes such as data loading, cleaning, univariate and bivariate analysis, and visualization techniques using libraries like Pandas, NumPy, Matplotlib, and Seaborn. The conclusion highlights the significance of EDA in understanding datasets and informing further analysis.

Uploaded by

okkshrutii
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 41

CSE-5th SEM

303105315 - Data Analytics and Data Visualization Laboratory

Dr. Vinod Patidar


Asst. Prof. (CSE Dept.)
9009526270
Experiment-1

1. Perform Exploratory Data Analysis on the given dataset


using Python.
Introduction to EDA
• The main objective of this Experiment is to cover
the steps involved in Data pre-processing, Feature
Engineering, and different stages of Exploratory
Data Analysis, which is an essential step in any
research analysis.
• Data pre-processing, Feature Engineering, and
EDA are fundamental early steps after data
collection.
What is Exploratory Data Analysis?
• Exploratory Data Analysis (EDA) is a method of analyzing
datasets to understand their main characteristics.
• It involves summarizing data features, detecting patterns, and
uncovering relationships through visual and statistical
techniques.
• EDA helps in gaining insights and formulating hypotheses for
further analysis.
What is Data Pre-processing and Feature Engineering?

• Data pre-processing involves cleaning and preparing raw data


to facilitate feature engineering. Meanwhile, feature
engineering entails employing various techniques to
manipulate the data. This may include adding or removing
relevant features, handling missing data, encoding variables,
and dealing with categorical variables, among other tasks.
• Feature Engineering is a critical task that significantly
influences the outcome of a model. It involves crafting new
features based on existing data while pre-processing primarily
focuses on cleaning and organizing the data.
Let’s look at how to perform EDA using python!

Step 1: Import Python Libraries

• Import all libraries which are required for our analysis, such as Data Loading, Statistical
analysis, Visualizations, Data Transformations, Merge and Joins, etc.
• Here is the link: (https://www.kaggle.com/datasets/sukhmanibedi/cars4u/data) to the
dataset.
Pandas and Numpy have been used for Data Manipulation and
numerical Calculations

Matplotlib and Seaborn have been used for Data visualizations.


import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
#to ignore warnings
import warnings
warnings.filterwarnings('ignore')
Step 2: Reading Dataset

• The Pandas library offers a wide range of possibilities for loading data into the pandas
DataFrame from files like JSON, .csv, .xlsx, .sql, .pickle, .html, .txt, images etc.

• Most of the data are available in a tabular format of CSV files. It is trendy and easy to
access. Using the read_csv() function, data can be converted to a pandas DataFrame.

• In this example, the data to predict Used car price is being used as an example. In this
dataset, we are trying to analyze the used car’s price and how EDA focuses on
identifying the factors influencing the car price. We have stored the data in the
DataFrame data.
data.shape

OUTPUT: (27, 7)
Step 3: Data Reduction

• Some columns or variables can be dropped if they do not add value to our analysis.

• In our dataset, the column S.No have only ID values, assuming they don’t have any
predictive power to predict the dependent variable.
Step 4: Feature Engineering

• Feature engineering refers to the process of using domain knowledge to select and
transform the most relevant variables from raw data when creating a predictive
model using machine learning or statistical modeling.

• The main goal of Feature engineering is to create meaningful data from raw data.
Step 5: Creating Features

• We will play around with the variables Year and Name in our dataset. If we see
the sample data, the column “Year” shows the manufacturing year of the car.

• It would be difficult to find the car’s age if it is in year format as the Age of the
car is a contributing factor to Car Price.

• Introducing a new column, “Car_Age” to know the age of the car.


Step 6: EDA Exploratory Data Analysis
Exploratory Data Analysis refers to the crucial process of performing initial
investigations on data to discover patterns to check assumptions with the help
of summary statistics and graphical representations.

• EDA can be leveraged to check for outliers, patterns, and trends in the given
data.
• EDA helps to find meaningful patterns in data.
• EDA provides in-depth insights into the data sets to solve our business
problems.
• EDA gives a clue to impute missing values in the dataset.
Step 7: Statistics Summary
• The information gives a quick and simple description of the data.
• It can include Count, Mean, Standard Deviation, median, mode,
minimum value, maximum value, range, standard deviation, etc.
• Statistics summary gives a high-level idea to identify whether the
data has any outliers, data entry error, distribution of data such
as the data is normally distributed or left/right skewed
Step 8: Statistics Summary…
In python, this can be achieved using describe()

describe() function gives all statistics summary of data

describe()– Provide a statistics summary of data belonging to numerical datatype


such as int, float

describe(include=’all’)
provides a statistics summary of all data, include object,
category etc
Before we do EDA, lets separate Numerical and
categorical variables for easy analysis
Step 9: EDA Univariate Analysis
• Analyzing/visualizing the dataset by taking one variable at a time:
• Data visualization is essential; we must decide what charts to plot to
better understand the data. In this article, we visualize our data using
Matplotlib and Seaborn libraries.
• Matplotlib is a Python 2D plotting library used to draw basic charts
we use Matplotlib.
• Seaborn is also a python library built on top of Matplotlib that uses
short lines of code to create and style statistical plots from Pandas
and Numpy.
• Univariate analysis can be done for both Categorical and Numerical
variables.
• Categorical variables can be visualized using a Count plot,
Bar Chart, Pie Plot, etc.
• Numerical Variables can be visualized using Histogram, Box
Plot, Density Plot, etc.
• In our example, we have done a Univariate analysis using
Histogram and Box Plot for continuous Variables.
• In the below fig, a histogram and box plot is used to show
the pattern of the variables, as some variables have
skewness and outliers.
Exploratory Data Analysis in Python
Exploratory data analysis (EDA) is a critical initial step in the data
science workflow. It involves using Python libraries to inspect,
summarize, and visualize data to uncover trends, patterns, and
relationships. Here’s a breakdown of the key steps in performing
EDA with Python:
1. Importing Libraries:
• pandas (pd): For data manipulation and analysis.
• NumPy (np): For numerical computations.
•Matplotlib.pyplot (plt): For basic plotting functionalities.
•Seaborn (sns): A built-on top of Matplotlib, providing high-level visualization.
2. Loading the Data:

•Use pd.read_csv() for CSV files, similar functions exist for other data
formats (e.g., .xlsx, .json).
3. Initial Inspection:
•Get an overview of the data using df.head(), .tail(), and .info().

•Check data types with df.dtypes.


4. Data Cleaning:

•dentify and handle missing values using methods


like df.isnull().sum().

•Find and address duplicates with df.duplicated().sum().


5. Univariate Analysis:

•Analyze single variables at a time.

•Use descriptive statistics with df.describe() for numerical data.

•Create histograms, box plots, and density plots to visualize distributions.


6. Bivariate Analysis:
•Explore relationships between two variables.

•Create scatter plots to identify trends and potential


correlations.
7. Visualization:

•Effective visualizations are crucial for understanding


data.

•Use various plots like bar charts, pie charts, and


heatmaps to represent categorical data.
Conclusion
In conclusion,
• Exploratory Data Analysis (EDA) is crucial for understanding datasets,
identifying patterns, and informing subsequent analysis.
• Data pre-processing and feature engineering are essential steps in
preparing data for analysis, involving tasks such as data reduction,
cleaning, and transformation.
• Python libraries offer powerful tools for executing these steps
efficiently.
Thanks…

You might also like