EDA Unit1
EDA Unit1
1
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis
o Visualizing the data to get a better perspective.
o Understanding the data to make better decisions and finding the final result.
The main phases of data science life cycle are given below:
2
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis
1. Discovery: The first phase is discovery, which involves asking the right questions. When you start any
data science project, you need to determine what are the basic requirements, priorities, and project budget.
In this phase, we need to determine all the requirements of the project such as the number of people,
technology, time, data, an end goal, and then we can frame the business problem on first hypothesis level.
2. Data preparation: Data preparation is also known as Data Munging. In this phase, we need to perform
the following tasks:
o Data cleaning
o Data Reduction
o Data integration
o Data transformation,
After performing all the above tasks, we can easily use this data for our further processes.
3. Model Planning: In this phase, we need to determine the various methods and techniques to establish
the relation between input variables. We will apply Exploratory data analytics(EDA) by using various
statistical formula and visualization tools to understand the relations between variable and to see what data
can inform us. Common tools used for model planning are:
o SQL Analysis Services
o R
o SAS
o Python
4. Model-building: In this phase, the process of model building starts. We will create datasets for training
and testing purpose. We will apply different techniques such as association, classification, and clustering,
to build the model.
Following are some common Model building tools:
o SAS Enterprise Miner
o WEKA
o SPCS Modeler
o MATLAB
5. Operationalize: In this phase, we will deliver the final reports of the project, along with briefings,
code, and technical documents. This phase provides you a clear overview of complete project performance
and other components on a small scale before the full deployment.
3
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis
6. Communicate results: In this phase, we will check if we reach the goal, which we have set on the
initial phase. We will communicate the findings and final result with the business team.
Data Science Techniques
Here are some of the technical concepts you should know about before starting to learn what is data science.
Machine Learning: Machine learning is the backbone of data science. Data Scientists need to have a solid
grasp of ML in addition to basic knowledge of statistics.
Modeling: Mathematical models enable you to make quick calculations and predictions based on what you
already know about the data. Modeling is also a part of Machine Learning and involves identifying which
algorithm is the most suitable to solve a given problem and how to train these models.
Statistics: Statistics are at the core of data science. A sturdy handle on statistics can help you extract more
intelligence and obtain more meaningful results.
Hypothesis testing
Central limit theorem,
z-test, and t-test
Correlation coefficients,
Sampling techniques.
Programming: Some level of programming is required to execute a successful data science project. The
most common programming languages are Python, and R. Python is especially popular because it’s easy to learn,
and it supports multiple libraries for data science and ML.
Database: A capable data scientist needs to understand how databases work, how to manage them, and
how to extract data from them.
Data Science Components:
4
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis
1. Statistics: Statistics is one of the most important components of data science. Statistics is a way to
collect and analyze the numerical data in a large amount and finding meaningful insights from it.
2. Domain Expertise: In data science, domain expertise binds data science together. Domain expertise
means specialized knowledge or skills of a particular area. In data science, there are various areas for
which we need domain experts.
3. Data engineering: Data engineering is a part of data science, which involves acquiring, storing,
retrieving, and transforming the data. Data engineering also includes metadata (data about data) to the
data.
4. Visualization: Data visualization is meant by representing data in a visual context so that people can
easily understand the significance of data. Data visualization makes it easy to access the huge amount of
data in visuals.
5. Advanced computing: Heavy lifting of data science is advanced computing. Advanced computing
involves designing, writing, debugging, and maintaining the source code of computer programs.
6. Mathematics: Mathematics is the critical part of data science. Mathematics involves the study of
quantity, structure, space, and changes. For a data scientist, knowledge of good mathematics is essential.
7. Machine learning: Machine learning is backbone of data science. Machine learning is all about to
provide training to a machine so that it can act as a human brain. In data science, we use various machine
learning algorithms to solve the problems.
5
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis
1. Problem Formulation
The product managers or the stakeholders need to understand the problems associated with a particular
operation. It is one of the most crucial aspects of a Data Science pipeline. To frame a use case as a Data
Science problem, the subject matter experts must first understand the current work stream and the nitty-
gritties associated with it. Data Science problem needs a strong domain input, without which coming up with
a viable success criterion becomes challenging.
2. Data Sources
Once the problem is clearly defined, the product managers, along with the Data Scientist, need to work
together to figure out the data required and the various sources from which it may be acquired. The source of
data could be IoT sensors, cloud platforms like GCP, AWS, Azure, or even web-scraped data from social
media.
3. Exploratory Data Analysis
The next process in the pipeline is EDA, where the gathered data is explored and analyzed for any descriptive
pattern in the data. Often the common exploratory data analysis steps involve finding missing values,
checking for correlation among the variables, performing univariate, bi-variate, and multivariate analysis.
4. Feature Engineering
The process of EDA is followed by fetching key features from the raw data or creating additional features
based on the results of EDA and some domain experience. The process of feature engineering could be both
model agnostic such as finding correlation, forward selection, backward elimination, etc., and model
dependent such as getting feature importance from tree-based algorithms.
6
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis
5. Modelling
It largely depends on whether the scope of the project deems the usage of predictive, diagnostic, or
prescriptive modeling. In this step, a Data Scientist would try out multiple experiments using various
Machine Learning or Deep Learning algorithms. The trained models are validated against the test data to
check its performance. Once you have performed a thorough analysis of data and decided on a suitable
model/algorithm, it’s time to develop the real model and test it. Before building the model, you need to divide
the data into Training and Testing data sets. In normal circumstances, the Training data set constitutes 80% of
the data, whereas, Testing data set consists of the remaining 20%.
Firstly, Training data is employed to build the model, and then Testing data is used to evaluate whether the
developed model works correctly or not.
There are various packaged libraries in different programming languages (R, Python, MATLAB), which you
can use to build the model just by inputting the labeled training data.
6. Deployment
The models developed need to be hosted on on-premises or cloud server for the end users to consume it.
Highly optimized and scalable code must be written to put models in production.
7
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis
7. Monitoring
After the models are deployed, it is necessary to set up a monitoring pipeline. Often the deployed models
suffer from various data drift challenges in real time which need to be monitored and dealt with accordingly.
8. User Acceptance
The data science project life cycle is only completed once the end-user has given a sign-off. The deployed
models are kept under observation for some time to validate their success against various business metrics.
Once that’s validated over a period, the users often give a sign-off for the closure of the project.
EDA Fundamentals
Exploratory Data Analysis (EDA)
Exploratory Data Analysis (EDA) refers to the method of studying and exploring record sets to apprehend their
predominant traits, discover patterns, locate outliers, and identify relationships between variables. EDA is
normally carried out as a preliminary step before undertaking extra formal statistical analyses or modelling.
8
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis
4. Feature Engineering: EDA allows for the exploration of various variables and their adjustments to create
new functions or derive meaningful insights. Feature engineering can contain scaling, normalization, binning,
encoding express variables, and creating interplay or derived variables.
5. Correlation and Relationships: EDA allows discover relationships and dependencies between
variables. Techniques such as correlation analysis, scatter plots, and pass- tabulations offer insights into the
power and direction of relationships between variables.
6. Data Segmentation: EDA can contain dividing the information into significant segments based totally on
sure standards or traits. This segmentation allows advantage insights into unique subgroups inside the
information and might cause extra focused analysis.
7. Hypothesis Generation: EDA aids in generating hypotheses or studies questions based totally on the
preliminary exploration of the data. It facilitates form the inspiration for in addition evaluation and model
building.
8. Data Quality Assessment: EDA permits for assessing the nice and reliability of the information. It
involves checking for records integrity, consistency, and accuracy to make certain the information is suitable for
analysis.
Types of EDA
Depending on the number of columns we are analyzing we can divide EDA into two types. EDA,
or Exploratory Data Analysis, refers back to the method of analyzing and analyzing information units to uncover
styles, pick out relationships, and gain insights. There are various sorts of EDA strategies that can be hired
relying on the nature of the records and the desires of the evaluation. Here are some not unusual kinds of EDA:
1. Univariate Analysis: This sort of evaluation makes a speciality of analyzing character variables inside the
records set. It involves summarizing and visualizing a unmarried variable at a time to understand its distribution,
9
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis
relevant tendency, unfold, and different applicable records. Techniques like histograms, field plots, bar charts,
and precis information are generally used in univariate analysis.
2. Bivariate Analysis: Bivariate evaluation involves exploring the connection between variables. It enables
find associations, correlations, and dependencies between pairs of variables. Scatter plots, line plots, correlation
matrices, and move-tabulation are generally used strategies in bivariate analysis.
3. Multivariate Analysis: Multivariate analysis extends bivariate evaluation to encompass greater than
variables. It ambitions to apprehend the complex interactions and dependencies among more than one variables
in a records set. Techniques inclusive of heatmaps, parallel coordinates, aspect analysis, and primary component
analysis (PCA) are used for multivariate analysis.
4. Time Series Analysis: This type of analysis is mainly applied to statistics sets that have a temporal
component. Time collection evaluation entails inspecting and modeling styles, traits, and seasonality inside the
statistics through the years. Techniques like line plots, autocorrelation analysis, transferring averages, and
ARIMA (AutoRegressive Integrated Moving Average) fashions are generally utilized in time series analysis.
5. Missing Data Analysis: Missing information is a not unusual issue in datasets, and it may impact the
reliability and validity of the evaluation. Missing statistics analysis includes figuring out missing values, know-
how the patterns of missingness, and using suitable techniques to deal with missing data. Techniques along with
lacking facts styles, imputation strategies, and sensitivity evaluation are employed in lacking facts evaluation.
6. Outlier Analysis: Outliers are statistics factors that drastically deviate from the general sample of the
facts. Outlier analysis includes identifying and knowledge the presence of outliers, their capability reasons, and
their impact at the analysis. Techniques along with box plots, scatter plots, z-rankings, and clustering algorithms
are used for outlier evaluation.
7. Data Visualization: Data visualization is a critical factor of EDA that entails creating visible
representations of the statistics to facilitate understanding and exploration. Various visualization techniques,
inclusive of bar charts, histograms, scatter plots, line plots, heatmaps, and interactive dashboards, are used to
represent exclusive kinds of statistics.
These are just a few examples of the types of EDA techniques that can be employed at some stage in information
evaluation. The choice of strategies relies upon on the information traits, research questions, and the insights
sought from the analysis.
Understanding data science or Stages of EDA
➢ Data science involves cross-disciplinary knowledge from computer science, data, statistics, and mathematics.
➢ There are several phases of data analysis, including
10
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis
1. Data requirements
2. Data collection
3. Data processing
4. Data cleaning
5. Exploratory data analysis
6. Modeling and algorithms
7. Data product and communication
➢ These phases are similar to the Cross-Industry Standard Process for data mining (CRISP) framework in data
mining.
1. Data requirements
• There can be various sources of data for an organization. • It is important to comprehend what type of data is
required for the organization to be collected, curated, and stored. • For example, an application tracking the
sleeping pattern of patients suffering from dementia requires several types of sensors’ data storage, such as sleep
data, heart rate from the patient, electro-dermal activities, and user activities patterns.
• All of these data points are required to correctly diagnose the mental state of the person.
• Hence, these are mandatory requirements for the application. • It is also required to categorize the data,
numerical or categorical, and the format of storage and dissemination.
2. Data collection
• Data collected from several sources must be stored in the correct format and transferred to the right information
technology personnel within a company.
• Data can be collected from several objects during several events using different types of sensors and storage
tools.
3. Data processing
• Preprocessing involves the process of pre-curating (selecting and organizing) the dataset before actual analysis. •
Common tasks involve correctly exporting the dataset, placing them under the right tables, structuring them, and
exporting them in the correct format.
4. Data cleaning
• Preprocessed data is still not ready for detailed analysis. • It must be correctly transformed for an incompleteness
check, duplicates check, error check, and missing value check.
• This stage involves responsibilities such as matching the correct record, finding inaccuracies in the
dataset, understanding the overall data quality, removing duplicate items, and filling in the missing
11
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis
values.
• Data cleaning is dependent on the types of data under study. • Hence, it is essential for data scientists or EDA
experts to comprehend different types of datasets.
• An example of data cleaning is using outlier detection methods for quantitative data cleaning.
5. EDA
• Exploratory data analysis is the stage where the message contained in the data is actually understood.
• Several types of data transformation techniques might be required during the process of exploration.
6. Modeling and algorithm
• Generalized models or mathematical formulas represent or exhibit relationships among different variables, such
as correlation or causation.
• These models or equations involve one or more variables that depend on other variables to cause an event.
• For example, when buying pens, the total price of pens(Total) = price for one pen(UnitPrice) * the number of
pens bought (Quantity). Hence, our model would be Total = UnitPrice * Quantity. Here, the total price is
dependent on the unit price. Hence, the total price is referred to as the dependent variable and the unit price is
referred to as an independent variable.
• In general, a model always describes the relationship between independent and dependent variables.
• Inferential statistics deals with quantifying relationships between particular variables.
• The model for describing the relationship between data, model, and the error still holds true:
Data = Model + Error
7. Data Product
• Any computer software that uses data as inputs, produces outputs, and provides feedback based on the output to
control the environment is referred to as a data product.
• A data product is generally based on a model developed during data analysis
• Example: a recommendation model that inputs user purchase history and recommends a related item that the
user is highly likely to buy.
8. Communication
• This stage deals with disseminating the results to end stakeholders to use the result for business
intelligence. • One of the most notable steps in this stage is data visualization.
• Visualization deals with information relay techniques such as tables, charts, summary diagrams, and
bar charts to show the analyzed result.
The significance of EDA
12
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis
Different fields of science, economics, engineering, and marketing accumulate and store data
primarily in electronic databases. Appropriate and well-established decisions should be made using the data
collected. It is practically impossible to make sense of datasets containing more than a handful of data points
without the help of computer programs. To be certain of the insights that the collected data provides and to make
further decisions, data mining is performed where we go through distinctive analysis processes. Exploratory data
analysis is key, and usually the first exercise in data mining. It allows us to visualize data to understand it as well
as to create hypotheses for further analysis. The exploratory analysis centers around creating a synopsis of data
or insights for the next steps in a data mining project.
EDA actually reveals ground truth about the content without making any underlying assumptions. This is the fact
that data scientists use this process to actually understand what type of modeling and hypotheses can be created.
Key components of exploratory data analysis include summarizing data, statistical analysis, and visualization of
data.
➢ Python provides expert tools for exploratory analysis
• pandas for summarizing
• scipy, along with others, for statistical analysis
• matplotlib and plotly for visualizations
Steps in EDA
The four different steps involved in exploratory data analysis are,
1. Problem Definition
2. Data Preparation
3. Data Analysis
4. Development and Representation of the Results
13
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis
Problem definition: Before trying to extract useful insight from the data, it is essential to define the business
problem to be solved. The problem definition works as the driving force for a data analysis plan execution. The
main tasks involved in problem definition are defining the main objective of the analysis, defining the main
deliverables, outlining the main roles and responsibilities, obtaining the current status of the data, defining the
timetable, and performing cost/benefit analysis. Based on such a problem definition, an execution plan can be
created.
Data preparation: This step involves methods for preparing the dataset before actual analysis. In this step, we
define the sources of data, define data schemas and tables, understand the main characteristics of the data, clean
the dataset, delete non-relevant datasets, transform the data, and divide the data into required chunks for analysis.
Data analysis: This is one of the most crucial steps that deals with descriptive statistics and analysis of the data.
The main tasks involve summarizing the data, finding the hidden correlation and relationships among the data,
developing predictive models, evaluating the models, and calculating the accuracies. Some of the techniques
used for data summarization are summary tables, graphs, descriptive statistics, inferential statistics, correlation
statistics, searching, grouping, and mathematical models.
Development and representation of the results: This step involves presenting the dataset to the target audience in
the form of graphs, summary tables, maps, and diagrams. This is also an essential step as the result analyzed
from the dataset should be interpretable by the business stakeholders, which is one of the major goals of EDA.
Most of the graphical analysis techniques include scattering plots, character plots, histograms, box plots, residual
plots, mean plots, and others.
Making Sense of Data
14
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis
It is crucial to identify the type of data under analysis. In this section, we are going to learn about
different types of data that you can encounter during analysis. Different disciplines store different kinds of data
for different purposes. For example, medical researchers store patients' data, universities store students' and
teachers' data, and real estate industries storehouse and building datasets. A dataset contains many observations
about a particular object. For instance, a dataset about patients in a hospital can contain many observations. A
patient can be described by a patient identifier (ID), name, address, weight, date of birth, address,
email, and gender. Each of these features that describes a patient is a variable. Each observation can have a
specific value for each of these variables. For example, a patient can have the following:
PATIENT_ID = 1001
Name = Yoshmi Mukhiya
Address = Mannsverk 61, 5094, Bergen, Norway
Date of birth = 10th July 2018
Email = [email protected]
Weight = 10
Gender = Female
These datasets are stored in hospitals and are presented for analysis. Most of this data is stored in some sort of
database management system in tables/schema. An example of a table for storing patient information is shown
here:
To summarize the preceding table, there are four observations (001, 002, 003, 004, 005). Each observation
describes variables (PatientID, name, address, dob, email, gender, and weight). Most of the dataset broadly falls
into two groups—numerical data and categorical data.
15
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis
Types of datasets
➢ Most datasets broadly fall into two groups—numerical data and categorical data.
Numerical data
This data has a sense of measurement involved in it; for example, a person's age, height, weight, blood pressure,
heart rate, temperature, number of teeth, number of bones, and the number of family members. This data is often
referred to as quantitative data in statistics. The numerical dataset can be either discrete or continuous types.
Discrete data
This is data that is countable and its values can be listed out. For example, if we flip a coin, the number of heads
in 200 coin flips can take values from 0 to 200 (finite) cases. A variable that represents a discrete dataset is
referred to as a discrete variable. The discrete variable takes a fixed number of distinct values. For example,
the Country variablecan have values such as Nepal, India, Norway, and Japan. It is fixed. The Rankvariableof a
student in a classroom can take values from 1, 2, 3, 4, 5, and so on.
16
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis
Continuous data
A variable that can have an infinite number of numerical values within a specific range is classified as
continuous data. A variable describing continuous data is a continuous variable. For example, what is the
temperature of your city today? Can we be finite? Similarly, the weight variablein the previous section is a
continuous variable.
A section of the table is shown in the following table:
Check the preceding table and determine which of the variables are discrete and which of the
variables are continuous. Can you justify your claim? Continuous data can follow an interval measure of scale or
ratio measure of scale.
Categorical data
This type of data represents the characteristics of an object; for example, gender, marital status, type of address,
or categories of the movies. This data is often referred to as qualitative datasets in statistics. To understand
clearly, here are some of the most common types of categorical data you can find in data:
Gender (Male, Female, Other, or Unknown)
Marital Status (Annulled, Divorced, Interlocutory, Legally Separated, Married, Polygamous, Never
Married, Domestic Partner, Unmarried, Widowed, or Unknown)
Movie genres (Action, Adventure, Comedy, Crime, Drama, Fantasy, Historical, Horror, Mystery,
Philosophical, Political, Romance, Saga, Satire, Science Fiction, Social, Thriller, Urban, or Western)
Blood type (A, B, AB, or O)
17
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis
Types of drugs (Stimulants, Depressants, Hallucinogens, Dissociatives, Opioids, Inhalants, or
Cannabis)
A variable describing categorical data is referred to as a categorical variable. These types of variables can have
one of a limited number of values. It is easier for computer science students to understand categorical values as
enumerated types or enumerations of variables. There are different types of categorical variables:
A binary categorical variable can take exactly two values and is also referred to as a dichotomous
variable. For example, when you create an experiment, the result is either success or failure. Hence, results can
be understood as a binary categorical variable.
Polytomous variables are categorical variables that can take more than two possible values. For
example, marital status can have several values, such as annulled, divorced, interlocutory, legally separated,
married, polygamous, never married, domestic partners, unmarried, widowed, domestic partner, and unknown.
Since marital status can take more than two possible values, it is a polytomous variable.
Most of the categorical dataset follows either nominal or ordinal measurement scales.
Measurement scales
There are four different types of measurement scales described in statistics: nominal, ordinal,
interval, and ratio. These scales are used morein academic industries. Let's understand each of them with some
examples.
Nominal
These are practiced for labeling variables without any quantitative value. The scales are generally referred to
as labels. And these scales are mutually exclusive and do not carry any numerical importance. Let's see some
examples:
What is your gender?
Male
Female
Third gender/Non-binary
I prefer not to answer
Other
Other examples include the following:
The languages that are spoken in a particular country
Biological species
Parts of speech in grammar (noun, pronoun, adjective, and so on)
18
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis
Taxonomic ranks in biology (Archea, Bacteria, and Eukarya)
Nominal scales are considered qualitative scales and the measurements that are taken using qualitative scales are
considered qualitative data. However, the advancement in qualitative research has created confusion to be
definitely considered as qualitative. If, for example, someone uses numbers as labels in the nominal
measurement sense, they have no concrete numerical value or meaning. No form of arithmetic calculation can be
made on nominal measures.
Well, for example, in the case of a nominal dataset, you can certainly know the following:
Frequency is the rate at which a label occurs over a period of time within the dataset.
Proportion can be calculated by dividing the frequency by the total number of events.
Then, you could compute the percentage of each proportion.
And to visualize the nominal dataset, you can use either a pie chart or a bar chart.
Ordinal
The main difference in the ordinal and nominal scale is the order. In ordinal scales, the order of the values is a
significant factor. An easy tip to remember the ordinal scale is that it sounds like an order. Have you heard about
the Likert scale, which uses a variation of an ordinal scale? Let's check an example of ordinal scale using the
Likert scale: WordPress is making content managers' lives easier. How do you feel about this statement? The
following diagram shows the Likert scale:
As depicted in the preceding diagram, the answer to the question of WordPress is making content managers'
lives easier is scaled down to five different ordinal values, Strongly Agree, Agree, Neutral, Disagree,
and Strongly Disagree. Scales like these are referred to as the Likert scale. Similarly, the following diagram
shows more examples of the Likert scale:
19
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis
To make it easier, consider ordinal scales as an order of ranking (1st, 2nd, 3rd, 4th, and so on). The median item
is allowed as the measure of central tendency; however, the average is not permitted.
Interval
In interval scales, both the order and exact differences between the values are significant. Interval scales are
widely used in statistics, for example, in the measure of central tendencies—mean, median, mode, and standard
deviations. Examples include location in Cartesian coordinates and direction measured in degrees from magnetic
north. The mean, median, and mode are allowed on interval data.
Ratio
Ratio scales contain order, exact values, and absolute zero, which makes it possible to be used in descriptive and
inferential statistics. These scales provide numerous possibilities for statistical analysis. Mathematical
operations, the measure of central tendencies, and the measure of dispersion and coefficient of variation can
also be computed from such scales.
Examples include a measure of energy, mass, length, duration, electrical energy, plan angle, and volume. The
following table gives a summary of the data types and scale measures:
20
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis
Comparing EDA with classical and Bayesian analysis
There are several approaches to data analysis. The most popular ones that are relevant to this book
are the following:
Classical data analysis: For the classical data analysis approach, the problem definition and data
collection step are followed by model development, which is followed by analysis and result communication.
Exploratory data analysis approach: For the EDA approach, it follows the same approach as
classical data analysis except the model imposition and the data analysis steps are swapped. The main focus is on
the data, its structure, outliers, models, and visualizations. Generally, in EDA, we do not impose any
deterministic or probabilistic models on the data.
Bayesian data analysis approach: The Bayesian approach incorporates prior probability
distribution knowledge into the analysis steps as shown in the following diagram. Well, simply put, prior
probability distribution of any quantity expresses the belief about that particular quantity before considering
some evidence. Are you still lost with the term prior probability distribution? Andrew Gelman has a very
descriptive paper about prior probability distribution. The following diagram shows three different approaches
for data analysis illustrating the difference in their execution steps:
Data analysts and data scientists freely mix steps mentioned in the preceding approaches to get meaningful
insights from the data. In addition to that, it is essentially difficult to judge or estimate which model is best for
data analysis. All of them have their paradigms and are suitable for different types of data analysis.
Software tools available for EDA
21
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis
Exploratory Data Analysis (EDA) involves examining datasets to summarize their main
characteristics, often with visual methods. There are several software tools available for EDA, each offering a
variety of features for data visualization, statistical analysis, and data manipulation. Here are some of the most
popular tools:
1. Python Libraries
Pandas: A powerful data manipulation and analysis library. It offers data structures like DataFrames and
functions for cleaning, aggregating, and analyzing data.
Matplotlib: A plotting library that produces static, animated, and interactive visualizations in Python.
Seaborn: Built on top of Matplotlib, it provides a high-level interface for drawing attractive statistical
graphics.
Plotly: An interactive graphing library that makes it easy to create interactive plots.
Scipy: A library used for scientific and technical computing.
Statsmodels: A library for estimating and testing statistical models.
Sweetviz: An open-source library that generates beautiful, high-density visualizations for EDA.
Pandas Profiling: A library that creates HTML profiling reports from pandas DataFrames.
2. R Libraries
ggplot2: A data visualization package for the statistical programming language R.
dplyr: A data manipulation library that provides a grammar for data manipulation.
tidyr: Helps to tidy up data.
Shiny: Allows for building interactive web applications directly from R.
DataExplorer: An R package that simplifies EDA by providing functions to visualize data distributions,
correlations, and more.
3. Integrated Development Environments (IDEs)
Jupyter Notebooks: An open-source web application that allows you to create and share documents
containing live code, equations, visualizations, and narrative text.
RStudio: An integrated development environment for R that includes a console, syntax-highlighting
editor, and tools for plotting, history, debugging, and workspace management.
4. Data Visualization Tools
Tableau: A powerful data visualization tool that allows for creating a wide range of interactive and
shareable dashboards.
22
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis
Power BI: A business analytics service by Microsoft that provides interactive visualizations and business
intelligence capabilities.
QlikView: A business discovery platform that provides self-service BI for all business users in
organizations.
Looker: A data-discovery platform that offers a user-friendly approach to data exploration and
visualization.
5. Data Analysis Platforms
Google Data Studio: A free tool to create customizable reports and dashboards from data.
KNIME: An open-source platform for data analytics, reporting, and integration.
RapidMiner: A data science platform for data preparation, machine learning, deep learning, text
mining, and predictive analytics.
Weka an open-source data mining package that involves several EDA tools and algorithms
6. Other Tools
Orange: An open-source data visualization and analysis tool, with a visual programming interface.
Apache Zeppelin: A web-based notebook that enables data-driven, interactive data analytics and
collaborative documents with SQL, Scala, and more.
Databricks: An enterprise software company founded by the creators of Apache Spark, that
provides a unified analytics platform.
Python tools and packages
23
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis
24
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis
Hence, visual aids are very useful tools. In this chapter, we will focus on different types of visual aids that can be
used with our datasets. We are going to learn about different types of techniques that can be used in the
visualization of data.
Line chart Line plots are used to show the trend or change in a numerical variable over time or some
other continuous dimension.
Bar chart Bar plots are used to display the distribution of a categorical variable or to compare values
across different categories. They can be used for both single and grouped categorical data.
Scatter plot : Scatter plots are used to visualize the relationship between two numerical variables. Each
data point is plotted as a point on the graph, allowing you to observe patterns, trends, and correlations
25
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis
Pie chart Pie charts are used to show the proportion of different categories within a whole. However,
they are best used when you have a small number of categories and clear proportions.
Table chart A table chart is a chart that helps visually represent data that is arranged in rows and columns.
Throughout all forms of communication and research, tables are used extensively to store, analyze, compare, and
present data.
Polar chart Polar charts are data visualizations best used for displaying multivariate observations with
an arbitray number of variables I the form of two dimensional chart.
Lollipop chart
A lollipop chart can be a sweet alternative to a regular bar chart if you are dealing with a lot of categories and
want to make optimal use of space. It shows the relationship between a numeric and a categorical variable. This
26
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis
type of chart consists of a line, which represents the magnitude, and ends with a dot, or a circle, which highlights
the data value.
Violin Plot
Violin plots combine box plots with kernel density estimation, offering a detailed view of data distribution.
import seaborn as sns
import matplotlib.pyplot as plt
sns.violinplot(x="Pclass", y="Age", data=data)
27
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis
Box plot
Box plots represent the distribution and spread of data, useful for detecting outliers and understanding central
tendencies.
import seaborn as sns
import matplotlib.pyplot as plt
sns.boxplot(x="Pclass", y="Age", data=data)
Heatmap
Heatmaps visualize the correlation between numerical features, helping to uncover dependencies in the data.
import seaborn as sns
import matplotlib.pyplot as plt
correlation_matrix = data.corr()
plt.figure(figsize=(10, 8))
sns.heatmap(correlation_matrix, annot=True, cmap="coolwarm")1
28
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis
data = px.data.tips()
fig = px.treemap(
data, path=['day', 'time', 'sex'], values='total_bill',
color='tip', hover_data=['tip'], color_continuous_scale='Viridis'
)
fig.update_layout(title_text="Tree Map with Heatmap Example")
fig.show()
29
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis
Histograms: Histograms display the distribution of a single numerical variable by dividing it into bins
and showing the frequency or count of data points in each bin. They help you understand the central
tendency, spread, and shape of the data.
Box Plots (Box-and-Whisker Plots): Box plots provide a visual summary of the distribution of a
numerical variable. They show the median, qualities, and potential outliers in the data.
Pair Plots: A pair plot (from Seaborn) displays pairwise relationship in a dataset, showing scatter plots
for numerical variables and histograms for diagonal variables.
30
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis
TYPES OF EXPLORATORY DATA ANALYSIS:
1. Univariate Non-graphical
2. Multivariate Non-graphical
3. Univariate graphical
4. Multivariate graphical
Univariate Non-graphical: this is the simplest form of data analysis as during this we use just
one variable to research the info. The standard goal of univariate non-graphical EDA is to know
the underlying sample distribution/ data and make observations about the population. Outlier
detection is additionally part of the analysis. The characteristics of population distribution
include:
Central tendency: The central tendency or location of distribution has got to do with typical or
middle values. The commonly useful measures of central tendency are statistics called mean,
median, and sometimes mode during which the foremost common is mean. For skewed
distribution or when there’s concern about outliers, the median may be preferred.
Spread: Spread is an indicator of what proportion distant from the middle we are to seek out
the find the info values. the quality deviation and variance are two useful measures of spread.
The variance is that the mean of the square of the individual deviations and therefore the
variance is the root of the variance
Skewness and kurtosis: Two more useful univariates descriptors are the skewness and
kurtosis of the distribution. Skewness is that the measure of asymmetry and kurtosis may be a
more subtle measure of peakedness compared to a normal distribution
Multivariate Non-graphical: Multivariate non-graphical EDA technique is usually wont to
show the connection between two or more variables within the sort of either cross- tabulation
or statistics.
For categorical data, an extension of tabulation called cross-tabulation is extremely useful.
For 2 variables, cross-tabulation is preferred by making a two- way table with column
headings that match the amount of one-variable and row headings that match the amount of
the opposite two variables, then filling the counts with all subjects that share an equivalent
pair of levels.
For each categorical variable and one quantitative variable, we create statistics for
31
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis
quantitative variables separately for every level of the specific variable then compare the
statistics across the amount of categorical variable.
Univariate graphical: Non-graphical methods are quantitative and objective, they are not
able to give the complete picture of the data; therefore, graphical methods are used more as
they involve a degree of subjective analysis, also are required. Common sorts of univariate
graphics are:
Histogram: The foremost basic graph is a histogram, which may be a barplot during which
each bar represents the frequency (count) or proportion (count/total count) of cases for a
variety of values. Histograms are one of the simplest ways to quickly learn a lot about your
data, including central tendency, spread, modality, shape and outliers.
Stem-and-leaf plots: An easy substitute for a histogram may be stem-and-leaf plots. It
shows all data values and therefore the shape of the distribution.
Boxplots: Another very useful univariate graphical technique is that the boxplot. Boxplots
are excellent at presenting information about central tendency and show robust measures of
location and spread also as providing information about symmetry and outliers, although
they will be misleading about aspects like multimodality. One among the simplest uses of
boxplots is within the sort of side- by-side boxplots.
Quantile-normal plots: The ultimate univariate graphical EDA technique is that the most
intricate. it’s called the quantile-normal or QN plot or more generally the quantile-quantile
or QQ plot. it’s wont to see how well a specific sample follows a specific theoretical
distribution. It allows detection of non-normality and diagnosis of skewness and kurtosis
Multivariate graphical: Multivariate graphical data uses graphics to display relationships
between two or more sets of knowledge. The sole one used commonly may be a grouped
barplot with each group representing one level of 1 of the variables and every bar within a
gaggle representing the amount of the opposite variable.
Other common sorts of multivariate graphics are:
Scatterplot: For 2 quantitative variables, the essential graphical EDA technique is
that the scatterplot , sohas one variable on the x-axis and one on the y-axis and
therefore the point for every case in your dataset.
Run chart: It’s a line graph of data plotted over time.
Heat map: It’s a graphical representation of data where values are depicted by
32
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis
color.
Multivariate chart: It’s a graphical representation of the relationships between
factors and response.
Bubble chart: It’s a data visualization that displays multiple circles (bubbles) in two-dimensional
plot.
Technical requirements
You can find the code for this chapter on GitHub: https://github.com/PacktPublishing/hands-on-
exploratory-data-analysis-with-python. In order to get the best out of this chapter, ensure the following:
Make sure you have Python 3.X installed on your computer. It is recommended to use a Python notebook
such as Anaconda.
You must have Python libraries such as pandas, seaborn, and matplotlib installed.
Data transformation:
Data transformation is the mutation of data characteristics to improve access or
storage. Transformation may occur on the format, structure, or values of data. With regard to data
analytics, transformation usually occurs after data is extracted or loaded (ETL/ELT).
Data transformation increases the efficiency of analytic processes and enables data-
driven decisions. Raw data is often difficult to analyze and too vast in quantity to derive
meaningful insight, hence the need for clean, usable data.
During the transformation process, an analyst or engineer will determine the data structure.
The most common types of data transformation are:
Constructive: The data transformation process adds, copies, or replicates data. Destructive:
The system deletes fields or records.
Aesthetic: The transformation standardizes the data to meet requirements or parameters.
Structural: The database is reorganized by renaming, moving, or combining columns.
33
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis
Data transformation techniques:
There are 6 basic data transformation techniques that you can use in your analysis project or data
pipeline:
• Data Smoothing
• Attribution Construction
• Data Generalization
• Data Aggregation
• Data Discretization
• Data Normalization
Data Smoothing
Data smoothing is a process that is used to remove noise from the dataset using some algorithms. It allows
for highlighting important features present in the dataset. It helps in predicting the patterns. When
collecting data, it can be manipulated to eliminate or reduce any variance or any other noise form.
The concept behind data smoothing is that it will be able to identify simple changes to help predict
different trends and patterns. This serves as a help to analysts or traders who need to look at a lot of data
which can often be difficult to digest for finding patterns that they wouldn't see otherwise.
We have seen how the noise is removed from the data using the techniques such as binning, regression,
clustering.
o Binning: This method splits the sorted data into the number of bins and smoothens the data values
in each bin considering the neighborhood values around it.
34
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis
o Regression: This method identifies the relation among two dependent attributes so that if we have
one attribute, it can be used to predict the other attribute.
o Clustering: This method groups similar data values and form a cluster. The values that lie outside
a cluster are known as outliers.
Attribution Construction
Attribution construction is one of the most common techniques in data transformation pipelines.
Attribution construction or feature construction is the process of creating new features from a set
of the existing features/attributes in the dataset.
Data Generalization
Data generalization refers to the process of transforming low-level attributes into high-level ones by using
the concept of hierarchy. Data generalization is applied to categorical data where they have a finite but
large number of distinct values. It converts low-level data attributes to high-level data attributes using
concept hierarchy. This conversion from a lower level to a higher conceptual level is useful to get a clearer
picture of the data. Data generalization can be divided into two approaches:
o Data cube process (OLAP) approach.
o Attribute-oriented induction (AOI) approach.
For example, age data can be in the form of (20, 30) in a dataset. It is transformed into a higher conceptual
level into a categorical value (young, old).
Data Aggregation Data collection or aggregation is the method of storing and presenting data in a summary
format. The data may be obtained from multiple data sources to integrate these data sources into a data analysis
description. This is a crucial step since the accuracy of data analysis insights is highly dependent on the quantity
and quality of the data used.
35
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis
Gathering accurate data of high quality and a large enough quantity is necessary to produce relevant results. The
collection of data is useful for everything from decisions concerning financing or business strategy of the
product, pricing, operations, and marketing strategies.
For example, we have a data set of sales reports of an enterprise that has quarterly sales of each year. We can
aggregate the data to get the enterprise's annual sales report.
Data Discretization
Data discretization refers to the process of transforming continuous data into a set of data
intervals. This is an especially useful technique that can help you make the data easier to study
and analyze and improve the efficiency of any applied algorithm.
Data Normalization
Data normalization is the process of scaling the data to a much smaller range,
without losing information in order to help minimize or exclude duplicated data, and improve
algorithm efficiency and data extraction performance.
There are three methods to normalize an attribute:
This method implements a linear transformation on the original data. Let us consider that we have min A and
maxA as the minimum and maximum value observed for attribute A and V iis the value for attribute A that has
to be normalized.
The min-max normalization would map V i to the V'i in a new smaller range [new_min A, new_maxA]. The
formula for min-max normalization is given below:
36
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis
For example, we have $1200 and $9800 as the minimum, and maximum value for the attribute income, and
[0.0, 1.0] is the range in which we have to map a value of $73,600.
The value $73,600 would be transformed using min-max normalization as follows:
Z-score normalization: This method normalizes the value for attribute A using the meanand standard
deviation. The following formula is used for Z-score normalization:
Here Ᾱ and σA are the mean and standard deviation for attribute A, respectively.
For example, we have a mean and standard deviation for attribute A as $54,000 and $16,000. And we have to
normalize the value $73,600 using z-score normalization.
Decimal Scaling: This method normalizes the value of attribute A by moving the decimal point in the value.
This movement of a decimal point depends on the maximum absolute value of A. The formula for the decimal
scaling is given below:
37
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis
pre-processing phase.
Data integration is the process of combining data from multiple sources creating a unified view of
the data. These sources can be:
• Traditional databases
• Data warehouses
• Simple CSV or Excel files
• Exports from popular tools
Data Manipulation
Data manipulation refers to the process of making your data more readable and organized. This
can be achieved by changing or altering your raw datasets.
Data manipulation tools can help you identify patterns in your data and apply any data
transformation technique (e.g. attribution creation, normalization, or aggregation) in an efficient
and easy way.
The Data Transformation Process
In a cloud data warehouse, the data transformation process most typically takes the form
of ELT (Extract Load Transform) or ETL (Extract Transform Load). With cloud storage costs
becoming cheaper by the year, many teams opt for ELT— the difference being that all data is
loaded in cloud storage, then transformed and added to a warehouse.
The transformation process generally follows 6 stages/steps:
1. Data Discovery: During the first stage, analysts work to understand and identify data in its source format. To
do this, they will use data profiling tools. This step helps analysts decide what they need to do to get data into its
desired format.
2. Data Mapping: During this phase, analysts perform data mapping to determine how individual fields are
modified, mapped, filtered, joined, and aggregated. Data mapping is essential to many data processes, and one
misstep can lead to incorrect analysis and ripple through your entire organization.
3. Data Extraction: During this phase, analysts extract the data from its original source. These may include
structured sources such as databases or streaming sources such as customer log files from web applications.
38
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis
4. Code Generation and Execution: Once the data has been extracted, analysts need to create a code to complete
the transformation. Often, analysts generate codes with the help of data transformation platforms or tools.
5. Review: After transforming the data, analysts need to check it to ensure everything has been formatted
correctly.
6. Sending: The final step involves sending the data to its target destination. The target might be a data warehouse
or a database that handles both structured and unstructured data.
Database:
A database is an organized collection of data, so that it can be easily accessed and managed.
We can organize data into tables, rows, columns, and index it to make it easier to find
relevant information.
Database handlers create a database in such a way that only one set of software program
provides access of data to all the users.
The main purpose of the database is to operate a large amount of information by storing,
retrieving, and managing data.
There are many dynamic websites on the World Wide Web nowadays which are handled
through databases. For example, a model that checks the availability of rooms in a hotel. It
is an example of a dynamic website that uses a database.
There are many databases available like MySQL, Sybase, Oracle, MongoDB, Informix,
PostgreSQL, SQL Server, etc.
Modern databases are managed by the database management system (DBMS).
SQL or Structured Query Language is used to operate on the data stored in a database.
SQL depends on relational algebra and tuple relational calculus.
Merging Data
We will often come across multiple independent or dependent datasets (e.g. when working with
different groups or multiple kind of data). However, it might be necessary to get all the different
data from different datasets into one data frame.
At first we need to find a meaningful way in which we can merge the datasets "zoo1", "zoo2",
"zoo3" and "class" into one data frame. Take a look at the data again. Do you see similarities
and differences? In order to properly match two (or more) datasets, we have to make sure that
the structure and datatypes match.
39
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis
Preparation
Before merging the zoo datasets, you should match the data type and structure of all the zoo
datasets. You can for example use a loop in order to match the data types so that you don't have to
do all variables manually. You can convert data types in R with as.numeric(), as.character(),
as.logical(), ...
All of the datasets can then be merged in two steps.
1) Merging zoo datasets
In order to merge all of the zoo datasets, we recommend using either cbind() or rbind().You
can find in the you documentation what both of the commands do. Choose the one according to
data.
2) Merging zoo and class datasets
When looking at the class dataset, you will see that it looks fundamentally different from the
zoo datasets. For this are dplyr's join-functions helpful: left_join(), right_join(), full_join() or inner_join(). You
should merge the datasets in such a way that the class dataset adds data to your merged zoo dataset
We can use at least 3 clever methods and functions from pandas to merge your datasets. Be creative. We can use
merge(), concat() and join() from pandas.
Reshaping and Pivoting
Reshaping and Pivoting are powerful operations in Pandas that allow us to transform and
restructure our data in various ways. This can be particularly useful when dealing with messy or
unstructured datasets. In this explanation, we will cover both reshaping and pivoting in detail with
examples, sample code, and output.
1. Reshaping:
Reshaping in Pandas is the process of converting data from a “wide” format to a “long” format
or vice versa. This can be achieved using the `melt()` and `stack()` functions.
a) Melt:
The `melt()` function is used to transform a DataFrame from a wide format to a long format. It
essentially unpivots the data and brings it into a tabular form.
Let’s consider an example where we have a DataFrame with three variables: Name, Math score,
and English score. We can use the `melt()` function to bring the Math and English scores into a
single column:
40
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis
“`python
import pandas as pd
data = {‘Name’: [‘Alice’, ‘Bob’, ‘Charlie’],
‘Math Score’: [85, 90, 75],
‘English Score’: [95, 80, 70]} df =
pd.DataFrame(data)
melted_df = pd.melt(df, id_vars=[‘Name’], var_name=’Subject’, value_name=’Score’) “`
Output:
“`
Name Subject Score
0 Alice Math Score 85
1 Bob Math Score 90
2 Charlie Math Score 75
3 Alice English Score 95
4 Bob English Score 80
5 Charlie English Score 70 “`
In the melted DataFrame, the ‘Math Score’ and ‘English Score’ columns have been merged into a
single ‘Subject’ column, and the respective scores are placed in a new ‘Score’ column. The
‘Name’ column remains unchanged.
b) Stack:
The `stack()` function is used to pivot a level of column labels into a level of row labels, resulting
in a long format DataFrame.
Consider a DataFrame where we have multiple variables for each person. We can use the
`stack()` function to stack these variables into a single column:
“`python
import pandas as pd
data = {‘Name’: [‘Alice’, ‘Bob’], ‘Variable1’:
[5, 10],
‘Variable2’: [8, 12]}
df = pd.DataFrame(data)
stacked_df = df.set_index(‘Name’).stack().reset_index()
41
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis
stacked_df.columns = [‘Name’, ‘Variable’, ‘Value’]
“`
Output:
“`
Name Variable Value
Alice Variable1 5
Alice Variable2 8
Bob Variable1 10
Bob Variable2 12 “`
The `stack()` function stacks the ‘Variable1’ and ‘Variable2’ columns into a single ‘Variable’
column. The respective values are stored in the ‘Value’ column, while the ‘Name’ column
remains unchanged.
Pivoting:
Pivoting is the process of converting a long format DataFrame into a wide format, where one or
more columns become new columns in the new DataFrame. This can be done using the
`pivot()` and `pivot_table()` functions.
c) Pivot:
The `pivot()` function is used to convert a DataFrame from long format to wide format based on
unique values in a column. It requires specifying both an index and a column.
Consider the following melted DataFrame from earlier:
“`python
import pandas as pd
melted_df = pd.DataFrame({
‘Name’: [‘Alice’, ‘Bob’, ‘Charlie’, ‘Alice’, ‘Bob’, ‘Charlie’],
‘Subject’: [‘Math Score’, ‘Math Score’, ‘Math Score’, ‘English Score’, ‘English Score’, ‘English
Score’],
‘Score’: [85, 90, 75, 95, 80, 70]
})
pivoted_df = melted_df.pivot(index=’Name’, columns=’Subject’, values=’Score’) “`
Output:
“`
42
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis
Subject English Score Math Score Name
Alice 95 85
Bob 80 90
Charlie 70 75 “`
In the pivoted DataFrame, the unique values in the ‘Subject’ column (‘Math Score’, ‘English
Score’) become the column names, and the corresponding scores are filled into the respective
cells.
d) Pivot_table:
The `pivot_table()` function is similar to `pivot()` but allows us to aggregate values that have the
same indices. This is useful when we have multiple values for the same index-column
combinations.
Consider the following DataFrame:
“`python
import pandas as pd
data = {‘Name’: [‘Alice’, ‘Bob’, ‘Alice’, ‘Bob’],
‘Subject’: [‘Math’, ‘Math’, ‘English’, ‘English’], ‘Score’:
[85, 90, 95, 80]}
df = pd.DataFrame(data)
pivoted_table = pd.pivot_table(df, index=’Name’, columns=’Subject’, values=’Score’,
aggfunc=’mean’)
“`
Output:
“`
Subject English Math Name
Alice 95 85
Bob 80 90
“`
The `pivot_table()` function aggregates the scores based on the provided aggregation function,
which is the mean (`aggfunc=’mean’`). We obtain the average scores for each subject and each
person in the pivoted DataFrame.
These are some examples of reshaping and pivoting on Pandas DataFrames using Python. The
43
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis
ability to reshape and pivot data provides flexibility in manipulating and analyzing datasets
efficiently.
GROUPING DATASET and DATA AGGREGATION
Grouping and aggregating will help to achieve data analysis easily using various
functions. These methods will help us to the group and summarize our data and make complex
analysis comparatively easy.
Creating a sample dataset of marks of various subjects.
# import module
import pandas as pd
Output:
Aggregation in Pandas
Aggregation in pandas provides various functions that perform a mathematical or logical
operation on our dataset and returns a summary of that function. Aggregation can be used to get a
summary of columns in our dataset like getting sum, minimum, maximum, etc. from a particular
column of our dataset. The function used for aggregation is agg(), the parameter is the function
44
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis
we want to perform.
Some functions used in the aggregation are:
Function Description:
• sum() :Compute sum of column values
• min() :Compute min of column values
• max() :Compute max of column values
• mean() :Compute mean of column
• size() :Compute column sizes
• describe() :Generates descriptive statistics
• first() :Compute first of group values
• last() :Compute last of group values
• count() :Compute count of column values
• std() :Standard deviation of column
• var() :Compute variance of column
• sem() :Standard error of the mean of column
Examples:
• The sum() function is used to calculate the sum of every value.
• Python
• df.sum()
Output:
45
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis
• We used agg() function to calculate the sum, min, and max of each column in our
dataset.
• Python
• df.agg(['sum', 'min', 'max'])
Output:
Grouping in Pandas
Grouping is used to group data using some criteria from our dataset. It is used as split-
apply-combine strategy.
• Splitting the data into groups based on some criteria.
• Applying a function to each group independently.
• Combining the results into a data structure.
Examples:
We use groupby() function to group the data on “Maths” value. It returns the object as result.
46
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis
• Python
df.groupby(by=['Maths'])
Output:
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x0000012581821388>
Applying groupby() function to group the data on “Maths” value. To view result of formed
groups use first() function.
• Python
• a =
df.groupby('Maths') a.first()
Output:
First grouping based on “Maths” within each team we are grouping based on “Science”
• Python
47
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis
Implementation on a Dataset
Here we are using a dataset of diamond information.
# import module import
numpy as np import pandas
as pd
48
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis
Output:
• Here we are grouping using cut and color and getting minimum value for all other
groups.
• dataset.groupby(['cut', 'color']).agg('min')
Output:
Here we are grouping using color and getting aggregate values like sum, mean,
49
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis
min, etc. for the price group.
# dictionary having key as group name of price and # value as
list of aggregation function
# we want to perform on group price agg_functions = {
'price':
['sum', 'mean', 'median', 'min', 'max', 'prod']
}
dataset.groupby(['color']).agg(agg_functions)
Output:
Pivot Tables: A pivot table is a table of statistics that summarizes the data of a more extensive table (such as
from a database, spreadsheet, or business intelligence program). This summary might include sums, averages, or
other statistics, which the pivot table groups together in a meaningful way.
Steps Needed
• Import Library (Pandas)
• Import / Load / Create data.
• Use Pandas.pivot_table() method with different variants.
50
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis
# import packages
import pandas as pd
# create data
df = pd.DataFrame({'ID': {0: 23, 1: 43, 2: 12,
3: 13, 4: 67, 5: 89,
6: 90, 7: 56, 8: 34},
'Name': {0: 'Ram', 1: 'Deep', 2:
'Yash',
3: 'Aman', 4: 'Arjun', 5:
'Aditya',
6: 'Akash', 7: 'Chalsea',
8: 'Divya'},
'Marks': {0: 89, 1: 97, 2:
45,
3: 78, 4: 56, 5: 76,
Output:
# import packages
import pandas as pd
# create data
df = pd.DataFrame({'ID': {0: 23, 1: 43, 2: 12,
3: 13, 4: 67, 5: 89,
6: 90, 7: 56, 8: 34},
CROSS TABULATIONS:
Cross tabulation (or crosstab) is an important tool for analyzing two categorical
variables in a dataset. It provides a tabular summary of the frequency distribution of two
variables, allowing us to see the relationship between them and identify any patterns or trends.
Also known as contingency tables or cross tabs, cross tabulation groups variables to
understand the correlation between different variables. It also shows how
correlations change from one variable grouping to another. It is usually used in
statistical analysis to find patterns, trends, and probabilities within raw data.
The pandas crosstab function builds a cross-tabulation table that can show the frequency with
which certain groups of data appear.
This method is used to compute a simple cross-tabulation of two (or more) factors. By default,
computes a frequency table of the factors unless an array of values and an aggregation function
are passed.
Syntax: pandas.crosstab(index, columns, values=None, rownames=None, colnames=None,
aggfunc=None, margins=False, margins_name=’All’, dropna=True, normalize=False)
52
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis
Arguments :
• index : array-like, Series, or list of arrays/Series, Values to group by in the rows.
• columns : array-like, Series, or list of arrays/Series, Values to group by in the columns.
• values : array-like, optional, array of values to aggregate according to the factors.
Requires `aggfunc` be specified.
• rownames : sequence, default None, If passed, must match number of row arrays passed.
• colnames : sequence, default None, If passed, must match number of column arrays
passed.
• aggfunc : function, optional, If specified, requires `values` be specified as well.
• margins : bool, default False, Add row/column margins (subtotals).
• margins_name : str, default ‘All’, Name of the row/column that will contain the totals
when margins is True.
• dropna : bool, default True, Do not include columns whose entries are all NaN.
Unit 1 Completed
53