0% found this document useful (0 votes)
4 views14 pages

DSP Unit - Ii

Exploratory Data Analysis (EDA) is a crucial process for data scientists to analyze datasets, identify patterns, and prepare for modeling. It utilizes various tools and techniques, including Python libraries like Pandas and Matplotlib, R libraries like ggplot2, and software like Tableau and Excel, to visualize data and summarize its characteristics. The document also outlines the data science lifecycle, descriptive statistics, and the importance of data visualization in understanding and interpreting data.

Uploaded by

harshakoushil04
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views14 pages

DSP Unit - Ii

Exploratory Data Analysis (EDA) is a crucial process for data scientists to analyze datasets, identify patterns, and prepare for modeling. It utilizes various tools and techniques, including Python libraries like Pandas and Matplotlib, R libraries like ggplot2, and software like Tableau and Excel, to visualize data and summarize its characteristics. The document also outlines the data science lifecycle, descriptive statistics, and the importance of data visualization in understanding and interpreting data.

Uploaded by

harshakoushil04
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

ASHOKA WOMEN’S ENGINEERING COLLEGE (AUTONOMOUS)

UNIT-2
EXPLORATORY DATA ANALYSIS (EDA)
Exploratory data analysis (EDA) is used by data scientists to analyze and investigate datasets and
summarize their main characteristics, often using statistical graphics and other data visualization methods.
It helps determine how best to manipulate data sources to get the answers you need, making it easier
for data scientists to discover patterns, spot anomalies, test a hypothesis, or check assumptions.
The main objectives of EDA are to:
1. Understand the Data: Gain insights into the data distribution, structure, and patterns.
2. Detect Anomalies and Outliers: Identify any unusual observations or data points that may require
further investigation.
3. Formulate Hypotheses: Generate initial hypotheses about relationships and patterns in the data that can
be further tested.
4. Prepare for Modeling: Inform the selection of appropriate statistical techniques and machine learning
algorithms for subsequent analysis.
Exploratory data analysis tools:
Exploratory Data Analysis (EDA) can be performed using a variety of tools and libraries.
Here are some popular tools commonly used for EDA:
 Python
 Pandas: Pandas is a powerful data manipulation library in Python that provides data structures and functions
for working with structured data.
 NumPy: NumPy is a fundamental package for scientific computing in Python, providing support for arrays,
matrices, and mathematical functions.
 Matplotlib: Matplotlib is a plotting library for Python that provides a MATLAB-like interface for creating
static, interactive, and animated visualizations.
 Seaborn: Seaborn is a statistical data visualization library based on Matplotlib, providing a high-level
interface for drawing attractive and informative statistical graphics.
 Plotly: Plotly is an interactive visualization library in Python that supports a wide range of chart types,
including scatter plots, line plots, bar charts, and more.
 R
 ggplot2: ggplot2 is a plotting system for R based on the grammar of graphics, allowing users to create
complex plots with ease.
 dplyr: dplyr is a data manipulation library for R that provides a set of functions for filtering, summarizing,
and transforming data.
 tidyr: tidyr is a library for R that provides functions for reshaping and tidying data, making it easier to work
within the context of EDA.
 ggvis: ggvis is an interactive visualization library for R based on ggplot2, allowing users to create interactive
plots with linked brushing and dynamic tooltips.
 Jupyter Notebooks
 Jupyter Notebooks provide an interactive computing environment that allows users to create and share
documents containing live code, equations, visualizations, and narrative text.
 Jupyter Notebooks support multiple programming languages, including Python, R, Julia, and Scala, making
them a versatile tool for conducting EDA.
V. ARUNA KUMARI-Asst. Professor-Dept of MCA
Page 1
ASHOKA WOMEN’S ENGINEERING COLLEGE (AUTONOMOUS)

 Tableau
 Tableau is a powerful data visualization tool that allows users to create interactive dashboards and
visualizations without writing code.
 Tableau supports a wide range of data sources and provides drag-and-drop functionality for building
visualizations, making it accessible to users with varying levels of technical expertise.
 Excel
 Microsoft Excel is a widely used spreadsheet application that provides basic data analysis and visualization
capabilities.
 Excel supports features such as pivot tables, charts, and conditional formatting, making it suitable for simple
EDA tasks and data exploration by non-technical users. These are just a few examples of tools and libraries
commonly used for exploratory data analysis.
Types of Exploratory Data Analysis:
Exploratory Data Analysis (EDA) encompasses a variety of techniques and methods to understand and
analyze data. Here are some common types of EDA techniques:
 Univariate Analysis: Analyzing individual variables in isolation to understand their distribution, central
tendency, and variability. Techniques used in univariate analysis include histograms, box plots, frequency
tables, and summary statistics.
 Bivariate Analysis: Examining the relationship between two variables to understand patterns, dependencies,
and correlations. Techniques used in bivariate analysis include scatter plots, correlation coefficients, and
cross-tabulations.
 Multivariate Analysis: Analyzing the relationships among multiple variables simultaneously to uncover
complex patterns and dependencies. Techniques used in multivariate analysis include principal component
analysis (PCA), factor analysis, and cluster analysis.
 Summary Statistics: Calculating summary statistics such as mean, median, mode, variance, and standard
deviation to summarize the central tendency, dispersion, and shape of the data.
 Data Visualization: Creating visual representations of the data using plots and charts such as histograms,
box plots, scatter plots, bar plots, and heatmaps to explore patterns and relationships.
 Outlier Detection: Identifying unusual observations or data points that deviate significantly from the rest of
the data. Techniques used in outlier detection include z-scores, Tukey's method, and visualization methods
such as box plots and scatter plots.
 Missing Values Handling: Dealing with missing data through techniques such as imputation or deletion,
depending on the nature and extent of missing ness.
 Feature Engineering: Creating new features or transforming existing ones to better represent the underlying
patterns in the data and improve model performance.
 Dimensionality Reduction: Reducing the dimensionality of the data using techniques such as principal
component analysis (PCA) or t-distributed stochastic neighbor embedding (t-SNE) to visualize high-
dimensional data and identify clusters or patterns.
 Correlation Analysis: Investigating the relationships between pairs of variables using correlation
coefficients (e.g., Pearson correlation, Spearman correlation) and visualizations such as correlation matrices
and scatter plots.

V. ARUNA KUMARI-Asst. Professor-Dept of MCA


Page 2
ASHOKA WOMEN’S ENGINEERING COLLEGE (AUTONOMOUS)

THE LIFE CYCLE OF DATASCIENCE:


The data science lifecycle, also known as the data science process or workflow, is a systematic approach to
solving data-related problems and extracting actionable insights from data.

 Business Understanding:
 Identify the business problem or opportunity that can be addressed through data analysis.
 Clearly define the objectives and key performance indicators (KPIs) that will measure the success
of the data science project. These objectives should be aligned with the overall business goals
and priorities.
 You need to understand if the customer desires to minimize savings loss, or if they prefer to predict
the rate of a commodity, etc.
 Data Understanding:
 It involves gaining a comprehensive understanding of the data that will be used for analysis.
 This stage is crucial for ensuring that the data is relevant, accurate, and sufficient for addressing
the business problem.
 Gather relevant data from various sources, including databases, files, APIs, web scraping, or sensors.
 Ensure that the data collected is comprehensive and covers all relevant aspects of the problem domain.
 Data Preparation:
 Clean the data by handling missing values, outliers, and inconsistencies.
 Perform data transformation and feature engineering to create new variables.
 Explore the data to understand its distribution, relationships, and patterns.
 Exploratory Data Analysis (EDA):
 Conduct exploratory data analysis to gain insights into the data and identify potential relationships
and patterns.
 Visualize the data using plots, charts, and summary statistics to understand its characteristics.
 Data Modeling:
 Select appropriate machine learning algorithms or statistical models based on the nature of the
problem and the data.
 Split the data into training and testing sets to train and evaluate the models.
 Train multiple models and fine-tune hyperparameters to optimize performance.
V. ARUNA KUMARI-Asst. Professor-Dept of MCA
Page 3
ASHOKA WOMEN’S ENGINEERING COLLEGE (AUTONOMOUS)

 Model Evaluation:
 Evaluate the performance of the models using appropriate metrics such as accuracy, precision, recall,
F1-score, or area under the ROC curve (AUC).
 Compare the performance of different models and select the best-performing one for deployment.
 Model Deployment:
 Deploy the trained model into production, either as a standalone application, integrated into existing
systems, or as a web service/API.
 Monitor the model's performance in production and make necessary adjustments or updates as needed.

DESCRIPTIVE STATISTICS
Descriptive statistics is a branch of statistics that focuses on summarizing and describing the features of a
dataset. These statistics provide insights into the central tendency, variability, and distribution of the data.
Descriptive statistics are useful for gaining a better understanding of the dataset before performing more
advanced analyses. Here are some common descriptive statistics:
 Measures of Central Tendency:
 Mean: The arithmetic average of all the values in the dataset.
 Median: The middle value of the dataset when it is sorted in ascending order.
 Mode: The most frequently occurring value(s) in the dataset.

 Measures of Variability:
 Range: The difference between the maximum and minimum values in the dataset.
 Variance: The average of the squared differences from the mean.
 Standard Deviation: The square root of the variance, representing the average deviation of
data points from the mean.
 Measures of Shape and Distribution:
 Skewness: A measure of the asymmetry of the distribution.
 Kurtosis: A measure of the "peakedness" or "flatness" of the distribution.

 Percentiles and Quartiles:


 Percentiles: Values that divide the dataset into 100 equal parts, indicating the relative position of a
data point within the dataset.
 Quartiles: Values that divide the dataset into four equal parts (25th, 50th, and 75th percentiles),
providing insights into the spread of the data.

 Frequency Distribution:
 A tabular or graphical representation of the number of times each value occurs in the dataset,
showing the distribution of values.

Descriptive statistics can be calculated using various tools and software, including statistical software
packages like Python's NumPy and pandas libraries, R, and Microsoft Excel. These statistics help summarize
the essential characteristics of the dataset, making it easier to interpret and draw insights from the data.

V. ARUNA KUMARI-Asst. Professor-Dept of MCA


Page 4
ASHOKA WOMEN’S ENGINEERING COLLEGE (AUTONOMOUS)

BASIC TOOLS OF EDA


Exploratory Data Analysis (EDA) involves the use of various tools and techniques to gain insights into a
dataset and understand its key characteristics. Here are some basic tools commonly used in EDA:
 Statistical Summary:
Summary statistics such as mean, median, mode, standard deviation, variance, minimum, maximum, and
quartiles provide a basic overview of the dataset's central tendency, spread, and distribution.
 Histograms:
Histograms visualize the distribution of numerical variables by dividing the data into bins and plotting the
frequency or count of observations within each bin. They help identify patterns, outliers, and the shape of the
distribution.
 Box Plots:
Box plots (also known as box-and-whisker plots) provide a graphical summary of the distribution of numerical
variables. They display the median, quartiles, and potential outliers, making it easy to compare the
distributions of different variables or groups.
 Scatter Plots:
Scatter plots visualize the relationship between two numerical variables by plotting data points on a two-
dimensional graph. They help identify correlations, trends, clusters, and outliers in the data.
 Pair Plots:
Pair plots (or scatterplot matrices) display scatter plots for pairs of numerical variables in a dataset. They
allow for a quick visual inspection of relationships between multiple variables.
 Correlation Heatmaps:
Correlation heatmaps visualize the correlation coefficients between pairs of numerical variables in a dataset.
They use colors to represent the strength and direction of correlations, making it easy to identify variables that
are highly correlated or inversely correlated.
 Bar Plots:
Bar plots (or bar charts) visualize the distribution of categorical variables by plotting the frequency or count
of each category as bars. They help compare the frequency of different categories and identify dominant
categories in the dataset.
 Pie Charts:
Pie charts represent the distribution of categorical variables as slices of a pie, where each slice corresponds to a
category's proportion of the total. They are useful for visualizing the composition of categorical variables.
 Summary Tables:
Summary tables provide tabular summaries of data, including counts, percentages, and summary statistics for
categorical variables. They help organize and present data in a concise format. These are some basic tools
commonly used in Exploratory Data Analysis.
PHILOSOPHY OF EDA
There are important reasons anyone working with data should do EDA.
 Namely, to gain intuition about the data;
 To make comparisons between distributions;
 For sanity checking (making sure the data is on the scale you expect, in the format you thought it
should be);
V. ARUNA KUMARI-Asst. Professor-Dept of MCA
Page 5
ASHOKA WOMEN’S ENGINEERING COLLEGE (AUTONOMOUS)

 To find out where data is missing or if there are outliers;


 To summarize the data.
In the context of data generated from logs, EDA also helps with de-bugging the logging process. For example,
“patterns” you find in the data could actually be something wrong in the logging process that needs to be fixed.
If you never go to the trouble of debugging, you’ll continue to think your patterns are real. The engineers we’ve
worked with are always grateful for help in this area.
DATA VISUALIZATION
Data visualization is the graphical representation of information and data. By using visual elements like charts,
graphs, and maps, data visualization tools provide an accessible way to see and understand trends, outliers, and
patterns in data.
In the world of Big Data, data visualization tools and technologies are essential to analyze massive
amounts of information and make data-driven decisions.
More specific examples of methods to visualize data:
 Bar Chart
 Line Chart
 Pie Chart
 Scatter Plots
 Box Plot
 Histograms
 Heat Maps
 Bar Chart:
 A bar chart is a visual representation of data using rectangular bars.
 The bars can be vertical or horizontal, and their lengths are proportional to the data they represent.
 Bar Charts are also known as bar graphs or bar diagrams.
Parts of a Bar Graph:
The main parts of a bar graph include:
 Title: Describes the purpose or subject of the graph.
 X-axis (horizontal axis): Represents the categories or groups being compared.
 Y-axis (vertical axis): Displays the values or quantities corresponding to each category.
 Bars: Vertical or horizontal rectangles representing the data values for each category.
 Data labels: Numerical values attached to the bars to show the exact measurement.
 Legend: Explains the meaning of different colors or patterns if multiple data sets are presented.
 Scale: The units or intervals used on the axes to measure and represent the data accurately.

Bar Graph Types:


The different types of bar graph are :
 Vertical Bar Graph
 Horizontal Bar Graph
 Grouped Bar Graph
Vertical Bar Graph: Vertical bar graph is a type of data visualization technique used to represent data using
vertical bars or columns. It is also known as a vertical bar chart.

V. ARUNA KUMARI-Asst. Professor-Dept of MCA


Page 6
ASHOKA WOMEN’S ENGINEERING COLLEGE (AUTONOMOUS)

Horizontal Bar Graph: Horizontal bar graphs are the graphs that have their rectangular bars lying horizontally.
This means that the frequency of the data lie on the x-axis while the categories of the data lie on the y-axis.

Grouped Bar Graph: Grouped bar graphs are the bar charts in which multiple sets of data items are compared,
with a single color used to denote a specific series across all sets. It is also called the clustered bar graph.
A grouped bar graph compares different sets of data items. The grouped bar graph can be represented
using both vertical and horizontal bar charts.

V. ARUNA KUMARI-Asst. Professor-Dept of MCA


Page 7
ASHOKA WOMEN’S ENGINEERING COLLEGE (AUTONOMOUS)

 Line Chart:
 A line Graph is nothing but a way to represent two or more variables in the form of line or curves to
visualize the concept and helps to understand it in a better form.
 It displays the data that changes continuously concerning time.
 In a line graph data points are connected with an edge and data points are represented either with points.
Parts of Line Graph:
 Title: It is nothing but the title of the graph drawn.
 Axes: The line graph contains two axes i.e. X-axis and Y-axis.
 Labels: The name given to the x-axis and y-axis.
 Line: It is the line segment that is used to connect two or more data points.
 Point: It is nothing but a point given at each segment.

Types of Line Graph:


Let us discuss the types of line graphs:
 Simple Line Graph
 Multiple Line Graph
 Compound Line Graph
Simple Line Graph: It is the most common type of line graph in which a single line represents the
relationship between two variables over time. The below diagram is an example of a basic line graph.

Multiple Line Graph: It is the type of line graph in which we can represent two or more lines in a single
graph and they can either belong to the same categories or different which makes it easy to make comparisons
between them. Multiple line graphs also include a double line graph or we can say that a double line graph is
also a multiple line graph. An example of multiple graphs is shown below:

V. ARUNA KUMARI-Asst. Professor-Dept of MCA


Page 8
ASHOKA WOMEN’S ENGINEERING COLLEGE (AUTONOMOUS)

Compound Line Graph: It is a type of line graph in which multiple lines or data are combined into a single
graph showing different categories or variables. The main aim of a compound line graph is to represent or
display the relationship between different variables on a single graph.
A Compound Line graph example is shown below:

 Pie Chart:
 A pie chart is a pictorial or graphical representation of data in chart format.
 A pie chart uses a circle or sphere to represent the data, where the circle represents the entire data, and
the slices represent the data in parts.
 Pie chart is one of the easiest ways to present and understand the given data, and pie charts are used
very commonly.
 For example, pie charts are used in excel very often.

Pie Chart Formula:


We know that the total value of the pie is always 100%. It is also known that a circle subtends an angle of 360°.
Hence, the total of all the data is equal to 360°. Based on these, there are two main formulas used in pie charts:
 To calculate the percentage of the given data, we use the formula: (Frequency ÷ Total Frequency) × 100
 To convert the data into degrees we use the formula: (Given Data ÷ Total value of Data) × 360°
We can work out the percentage for a given pie chart using the steps given below,
 Categorize the given data and calculate the total
 Divide the different categories
 Convert the data into percentages
 Calculate the degrees
Let us understand the above steps using an example.
 Example: The pie chart shown below shows the percentages of types of transportation used by 500 students
to come to school. With this given information, answer the following questions:
a) How many students come to school by bicycle?
b) How many students do not walk to school?
c) How many students come to school by bus and car?

V. ARUNA KUMARI-Asst. Professor-Dept of MCA


Page 9
ASHOKA WOMEN’S ENGINEERING COLLEGE (AUTONOMOUS)

Solution:
a) The students who come by bicycle = 25%; (25/100) × 500 = 25 × 5 = 125
b) The students who do not walk to school - We need to add the values of all the remaining means,
i.e., bus + car + bicycle = 26 + 32 + 25 = 83
Hence, (83/100) × 500 = 83 × 5 = 415 students do not walk to school.
c) The students who come by bus and car [(32 + 26)/100] × 500 = 58 × 5 = 290.

 Scatter Plots:
 A scatter plot is a chart type that is normally used to observe and visually display the relationship
between variables.
 The values of the variables are represented by dots.
 The positioning of the dots on the vertical and horizontal axis will inform the value of the respective
data point; hence, scatter plots make use of Cartesian coordinates to display the values of the variables
in a data set.
 Scatter plots are also known as scatter grams, scatter graphs, or scatter charts.

Scatter plot Example:


Let us understand how to construct a scatter plot with the help of the below example.
Question: Draw a scatter plot for the given data that shows the number of games played and scores obtained
in each instance.
No. of games 3 5 2 6 7 1 2 7 1 7

Scores 80 90 75 80 90 50 65 85 40 100

V. ARUNA KUMARI-Asst. Professor-Dept of MCA


Page 10
ASHOKA WOMEN’S ENGINEERING COLLEGE (AUTONOMOUS)

Solution:
X-axis or horizontal axis: Number of games
Y-axis or vertical axis: Scores
Now, the scatter graph will be:

 Histogram:
 A histogram is a graphical representation of a grouped frequency distribution with continuous classes.
 It is an area diagram and can be defined as a set of rectangles with bases along with the intervals
between class boundaries and with areas proportional to frequencies in the corresponding classes.
 In such representations, all the rectangles are adjacent since the base covers the intervals between class
boundaries.
 The heights of rectangles are proportional to corresponding frequencies of similar classes and for
different classes; the heights will be proportional to corresponding frequency densities.
 In other words, histogram a diagram involving rectangles whose area is proportional to the frequency
of a variable and width is equal to the class interval.

Types of Histogram:
 Uniform histogram
 Bimodal histogram
 Symmetric histogram

Uniform Histogram: A uniform distribution reveals that the number of classes is too small, and each class has
the same number of elements. It may involve distribution that has several peaks.

V. ARUNA KUMARI-Asst. Professor-Dept of MCA


Page 11
ASHOKA WOMEN’S ENGINEERING COLLEGE (AUTONOMOUS)

Bimodal Histogram: If a histogram has two peaks, it is said to be bimodal. Bimodality occurs when the data set
has observations on two different kinds of individuals or combined groups if the centers of the two separate
histograms are far enough to the variability in both the data sets.

Symmetric Histogram: A symmetric histogram is also called a bell-shaped histogram. When you draw the
vertical line down the center of the histogram, and the two sides are identical in size and shape, the histogram is
said to be symmetric.

Histogram Example:
Question: The following table gives the lifetime of 400 neon lamps. Draw the histogram for the below data.
Lifetime (in hours) Number of lamps

300 – 400 14

400 – 500 56

500 – 600 60

600 – 700 86

700 – 800 74

800 – 900 62

900 – 1000 48

Solution:

V. ARUNA KUMARI-Asst. Professor-Dept of MCA


Page 12
ASHOKA WOMEN’S ENGINEERING COLLEGE (AUTONOMOUS)

 HeatMap:
 A heatmap is a graphical representation of data where values are depicted by colors.
 Heatmaps make it easy to visualize complex data and understand it.
 A heat map is a way to represent data points in a data set in a visual manner.
 All heat maps share one thing in common -- they use different colors or different shades of the same
color to represent different values and to communicate the relationships that may exist between the
variables plotted on the x-axis and y-axis.
 Usually, a darker color or shade represents a higher or greater quantity of the value being represented in
the heat map.

 BoxPlots:
 When we display the data distribution in a standardized way using 5 summary – minimum,
Q1(FirstQuartile), median, Q3(thirdQuartile), and maximum, it is called a Boxplot. It is
also termed as box and whisker plot.
Parts of BoxPlots:
Check the image below which shows the minimum, maximum, first quartile, third
quartile, median and outliers.

V. ARUNA KUMARI-Asst. Professor-Dept of MCA


Page 13
ASHOKA WOMEN’S ENGINEERING COLLEGE (AUTONOMOUS)

Minimum: The minimum value in the given dataset.


First Quartile (Q1): The first quartile is the median of the lower half of the data set.
Median: The median is the middle value of the dataset, which divides the given dataset into two equal parts.
The median is considered as the second quartile.
Third Quartile (Q3): The third quartile is the median of the upper half of the data.
Maximum: The maximum value in the given dataset.
Apart from these five terms, the other terms used in the box plot are:
Interquartile Range (IQR): The difference between the third quartile and first quartile is known as the
interquartile range. (i.e.) IQR = Q3-Q1
Outlier: The data that falls on the far left or right side of the ordered data is tested to be the outliers. Generally,
the outliers fall more than the specified distance from the first and third quartile.
Box Plot Example:
Example:
Find the maximum, minimum, median, first quartile, third quartile for the given data set:
23, 42, 12, 10, 15, 14, 9.
Solution:
Given: 23, 42, 12, 10, 15, 14, 9.
Arrange the given dataset in ascending order.
9, 10, 12, 14, 15, 23, 42
Hence,
Minimum = 9
Maximum = 42
Median = 14
First Quartile = 10 (Middle value of 9, 10, 12 is 10)
Third Quartile = 23 (Middle value of 15, 23, 42 is 23).
V. ARUNA KUMARI-Asst. Professor-Dept of MCA
Page 14

You might also like