DSP Unit - Ii
DSP Unit - Ii
UNIT-2
EXPLORATORY DATA ANALYSIS (EDA)
Exploratory data analysis (EDA) is used by data scientists to analyze and investigate datasets and
summarize their main characteristics, often using statistical graphics and other data visualization methods.
It helps determine how best to manipulate data sources to get the answers you need, making it easier
for data scientists to discover patterns, spot anomalies, test a hypothesis, or check assumptions.
The main objectives of EDA are to:
1. Understand the Data: Gain insights into the data distribution, structure, and patterns.
2. Detect Anomalies and Outliers: Identify any unusual observations or data points that may require
further investigation.
3. Formulate Hypotheses: Generate initial hypotheses about relationships and patterns in the data that can
be further tested.
4. Prepare for Modeling: Inform the selection of appropriate statistical techniques and machine learning
algorithms for subsequent analysis.
Exploratory data analysis tools:
Exploratory Data Analysis (EDA) can be performed using a variety of tools and libraries.
Here are some popular tools commonly used for EDA:
Python
Pandas: Pandas is a powerful data manipulation library in Python that provides data structures and functions
for working with structured data.
NumPy: NumPy is a fundamental package for scientific computing in Python, providing support for arrays,
matrices, and mathematical functions.
Matplotlib: Matplotlib is a plotting library for Python that provides a MATLAB-like interface for creating
static, interactive, and animated visualizations.
Seaborn: Seaborn is a statistical data visualization library based on Matplotlib, providing a high-level
interface for drawing attractive and informative statistical graphics.
Plotly: Plotly is an interactive visualization library in Python that supports a wide range of chart types,
including scatter plots, line plots, bar charts, and more.
R
ggplot2: ggplot2 is a plotting system for R based on the grammar of graphics, allowing users to create
complex plots with ease.
dplyr: dplyr is a data manipulation library for R that provides a set of functions for filtering, summarizing,
and transforming data.
tidyr: tidyr is a library for R that provides functions for reshaping and tidying data, making it easier to work
within the context of EDA.
ggvis: ggvis is an interactive visualization library for R based on ggplot2, allowing users to create interactive
plots with linked brushing and dynamic tooltips.
Jupyter Notebooks
Jupyter Notebooks provide an interactive computing environment that allows users to create and share
documents containing live code, equations, visualizations, and narrative text.
Jupyter Notebooks support multiple programming languages, including Python, R, Julia, and Scala, making
them a versatile tool for conducting EDA.
V. ARUNA KUMARI-Asst. Professor-Dept of MCA
Page 1
ASHOKA WOMEN’S ENGINEERING COLLEGE (AUTONOMOUS)
Tableau
Tableau is a powerful data visualization tool that allows users to create interactive dashboards and
visualizations without writing code.
Tableau supports a wide range of data sources and provides drag-and-drop functionality for building
visualizations, making it accessible to users with varying levels of technical expertise.
Excel
Microsoft Excel is a widely used spreadsheet application that provides basic data analysis and visualization
capabilities.
Excel supports features such as pivot tables, charts, and conditional formatting, making it suitable for simple
EDA tasks and data exploration by non-technical users. These are just a few examples of tools and libraries
commonly used for exploratory data analysis.
Types of Exploratory Data Analysis:
Exploratory Data Analysis (EDA) encompasses a variety of techniques and methods to understand and
analyze data. Here are some common types of EDA techniques:
Univariate Analysis: Analyzing individual variables in isolation to understand their distribution, central
tendency, and variability. Techniques used in univariate analysis include histograms, box plots, frequency
tables, and summary statistics.
Bivariate Analysis: Examining the relationship between two variables to understand patterns, dependencies,
and correlations. Techniques used in bivariate analysis include scatter plots, correlation coefficients, and
cross-tabulations.
Multivariate Analysis: Analyzing the relationships among multiple variables simultaneously to uncover
complex patterns and dependencies. Techniques used in multivariate analysis include principal component
analysis (PCA), factor analysis, and cluster analysis.
Summary Statistics: Calculating summary statistics such as mean, median, mode, variance, and standard
deviation to summarize the central tendency, dispersion, and shape of the data.
Data Visualization: Creating visual representations of the data using plots and charts such as histograms,
box plots, scatter plots, bar plots, and heatmaps to explore patterns and relationships.
Outlier Detection: Identifying unusual observations or data points that deviate significantly from the rest of
the data. Techniques used in outlier detection include z-scores, Tukey's method, and visualization methods
such as box plots and scatter plots.
Missing Values Handling: Dealing with missing data through techniques such as imputation or deletion,
depending on the nature and extent of missing ness.
Feature Engineering: Creating new features or transforming existing ones to better represent the underlying
patterns in the data and improve model performance.
Dimensionality Reduction: Reducing the dimensionality of the data using techniques such as principal
component analysis (PCA) or t-distributed stochastic neighbor embedding (t-SNE) to visualize high-
dimensional data and identify clusters or patterns.
Correlation Analysis: Investigating the relationships between pairs of variables using correlation
coefficients (e.g., Pearson correlation, Spearman correlation) and visualizations such as correlation matrices
and scatter plots.
Business Understanding:
Identify the business problem or opportunity that can be addressed through data analysis.
Clearly define the objectives and key performance indicators (KPIs) that will measure the success
of the data science project. These objectives should be aligned with the overall business goals
and priorities.
You need to understand if the customer desires to minimize savings loss, or if they prefer to predict
the rate of a commodity, etc.
Data Understanding:
It involves gaining a comprehensive understanding of the data that will be used for analysis.
This stage is crucial for ensuring that the data is relevant, accurate, and sufficient for addressing
the business problem.
Gather relevant data from various sources, including databases, files, APIs, web scraping, or sensors.
Ensure that the data collected is comprehensive and covers all relevant aspects of the problem domain.
Data Preparation:
Clean the data by handling missing values, outliers, and inconsistencies.
Perform data transformation and feature engineering to create new variables.
Explore the data to understand its distribution, relationships, and patterns.
Exploratory Data Analysis (EDA):
Conduct exploratory data analysis to gain insights into the data and identify potential relationships
and patterns.
Visualize the data using plots, charts, and summary statistics to understand its characteristics.
Data Modeling:
Select appropriate machine learning algorithms or statistical models based on the nature of the
problem and the data.
Split the data into training and testing sets to train and evaluate the models.
Train multiple models and fine-tune hyperparameters to optimize performance.
V. ARUNA KUMARI-Asst. Professor-Dept of MCA
Page 3
ASHOKA WOMEN’S ENGINEERING COLLEGE (AUTONOMOUS)
Model Evaluation:
Evaluate the performance of the models using appropriate metrics such as accuracy, precision, recall,
F1-score, or area under the ROC curve (AUC).
Compare the performance of different models and select the best-performing one for deployment.
Model Deployment:
Deploy the trained model into production, either as a standalone application, integrated into existing
systems, or as a web service/API.
Monitor the model's performance in production and make necessary adjustments or updates as needed.
DESCRIPTIVE STATISTICS
Descriptive statistics is a branch of statistics that focuses on summarizing and describing the features of a
dataset. These statistics provide insights into the central tendency, variability, and distribution of the data.
Descriptive statistics are useful for gaining a better understanding of the dataset before performing more
advanced analyses. Here are some common descriptive statistics:
Measures of Central Tendency:
Mean: The arithmetic average of all the values in the dataset.
Median: The middle value of the dataset when it is sorted in ascending order.
Mode: The most frequently occurring value(s) in the dataset.
Measures of Variability:
Range: The difference between the maximum and minimum values in the dataset.
Variance: The average of the squared differences from the mean.
Standard Deviation: The square root of the variance, representing the average deviation of
data points from the mean.
Measures of Shape and Distribution:
Skewness: A measure of the asymmetry of the distribution.
Kurtosis: A measure of the "peakedness" or "flatness" of the distribution.
Frequency Distribution:
A tabular or graphical representation of the number of times each value occurs in the dataset,
showing the distribution of values.
Descriptive statistics can be calculated using various tools and software, including statistical software
packages like Python's NumPy and pandas libraries, R, and Microsoft Excel. These statistics help summarize
the essential characteristics of the dataset, making it easier to interpret and draw insights from the data.
Horizontal Bar Graph: Horizontal bar graphs are the graphs that have their rectangular bars lying horizontally.
This means that the frequency of the data lie on the x-axis while the categories of the data lie on the y-axis.
Grouped Bar Graph: Grouped bar graphs are the bar charts in which multiple sets of data items are compared,
with a single color used to denote a specific series across all sets. It is also called the clustered bar graph.
A grouped bar graph compares different sets of data items. The grouped bar graph can be represented
using both vertical and horizontal bar charts.
Line Chart:
A line Graph is nothing but a way to represent two or more variables in the form of line or curves to
visualize the concept and helps to understand it in a better form.
It displays the data that changes continuously concerning time.
In a line graph data points are connected with an edge and data points are represented either with points.
Parts of Line Graph:
Title: It is nothing but the title of the graph drawn.
Axes: The line graph contains two axes i.e. X-axis and Y-axis.
Labels: The name given to the x-axis and y-axis.
Line: It is the line segment that is used to connect two or more data points.
Point: It is nothing but a point given at each segment.
Multiple Line Graph: It is the type of line graph in which we can represent two or more lines in a single
graph and they can either belong to the same categories or different which makes it easy to make comparisons
between them. Multiple line graphs also include a double line graph or we can say that a double line graph is
also a multiple line graph. An example of multiple graphs is shown below:
Compound Line Graph: It is a type of line graph in which multiple lines or data are combined into a single
graph showing different categories or variables. The main aim of a compound line graph is to represent or
display the relationship between different variables on a single graph.
A Compound Line graph example is shown below:
Pie Chart:
A pie chart is a pictorial or graphical representation of data in chart format.
A pie chart uses a circle or sphere to represent the data, where the circle represents the entire data, and
the slices represent the data in parts.
Pie chart is one of the easiest ways to present and understand the given data, and pie charts are used
very commonly.
For example, pie charts are used in excel very often.
Solution:
a) The students who come by bicycle = 25%; (25/100) × 500 = 25 × 5 = 125
b) The students who do not walk to school - We need to add the values of all the remaining means,
i.e., bus + car + bicycle = 26 + 32 + 25 = 83
Hence, (83/100) × 500 = 83 × 5 = 415 students do not walk to school.
c) The students who come by bus and car [(32 + 26)/100] × 500 = 58 × 5 = 290.
Scatter Plots:
A scatter plot is a chart type that is normally used to observe and visually display the relationship
between variables.
The values of the variables are represented by dots.
The positioning of the dots on the vertical and horizontal axis will inform the value of the respective
data point; hence, scatter plots make use of Cartesian coordinates to display the values of the variables
in a data set.
Scatter plots are also known as scatter grams, scatter graphs, or scatter charts.
Scores 80 90 75 80 90 50 65 85 40 100
Solution:
X-axis or horizontal axis: Number of games
Y-axis or vertical axis: Scores
Now, the scatter graph will be:
Histogram:
A histogram is a graphical representation of a grouped frequency distribution with continuous classes.
It is an area diagram and can be defined as a set of rectangles with bases along with the intervals
between class boundaries and with areas proportional to frequencies in the corresponding classes.
In such representations, all the rectangles are adjacent since the base covers the intervals between class
boundaries.
The heights of rectangles are proportional to corresponding frequencies of similar classes and for
different classes; the heights will be proportional to corresponding frequency densities.
In other words, histogram a diagram involving rectangles whose area is proportional to the frequency
of a variable and width is equal to the class interval.
Types of Histogram:
Uniform histogram
Bimodal histogram
Symmetric histogram
Uniform Histogram: A uniform distribution reveals that the number of classes is too small, and each class has
the same number of elements. It may involve distribution that has several peaks.
Bimodal Histogram: If a histogram has two peaks, it is said to be bimodal. Bimodality occurs when the data set
has observations on two different kinds of individuals or combined groups if the centers of the two separate
histograms are far enough to the variability in both the data sets.
Symmetric Histogram: A symmetric histogram is also called a bell-shaped histogram. When you draw the
vertical line down the center of the histogram, and the two sides are identical in size and shape, the histogram is
said to be symmetric.
Histogram Example:
Question: The following table gives the lifetime of 400 neon lamps. Draw the histogram for the below data.
Lifetime (in hours) Number of lamps
300 – 400 14
400 – 500 56
500 – 600 60
600 – 700 86
700 – 800 74
800 – 900 62
900 – 1000 48
Solution:
HeatMap:
A heatmap is a graphical representation of data where values are depicted by colors.
Heatmaps make it easy to visualize complex data and understand it.
A heat map is a way to represent data points in a data set in a visual manner.
All heat maps share one thing in common -- they use different colors or different shades of the same
color to represent different values and to communicate the relationships that may exist between the
variables plotted on the x-axis and y-axis.
Usually, a darker color or shade represents a higher or greater quantity of the value being represented in
the heat map.
BoxPlots:
When we display the data distribution in a standardized way using 5 summary – minimum,
Q1(FirstQuartile), median, Q3(thirdQuartile), and maximum, it is called a Boxplot. It is
also termed as box and whisker plot.
Parts of BoxPlots:
Check the image below which shows the minimum, maximum, first quartile, third
quartile, median and outliers.