0% found this document useful (0 votes)
28 views16 pages

Introduction To EDA

The document outlines the fundamentals of Exploratory Data Analysis (EDA) within the context of data science, emphasizing its significance in understanding and visualizing data. It details the main pillars of EDA, including data cleaning, preparation, exploration, and visualization, and compares the use of Python and R in this field. Additionally, it describes the phases of data analysis, from data requirement to communication, highlighting the importance of each step in deriving meaningful insights from datasets.

Uploaded by

jr.vijiofficial
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
28 views16 pages

Introduction To EDA

The document outlines the fundamentals of Exploratory Data Analysis (EDA) within the context of data science, emphasizing its significance in understanding and visualizing data. It details the main pillars of EDA, including data cleaning, preparation, exploration, and visualization, and compares the use of Python and R in this field. Additionally, it describes the phases of data analysis, from data requirement to communication, highlighting the importance of each step in deriving meaningful insights from datasets.

Uploaded by

jr.vijiofficial
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

St.

Joseph’s College of Engineering


Department of Artificial Intelligence and Data Science
Academic Year: 2025-2026 [ODD Semester]

Introduction to EDA
Unit I

Exploratory Data Analysis Fundamentals - Understanding data science - The


significance of EDA - - Making sense of data - Comparing EDA with classical and
Bayesian analysis - Software tools available for EDA - Visual aids for EDA – Types
of Charts
EDA Fundamentals
● What is data?
● Data is a collection of discrete objects, events, and facts in the form of numbers, text,pictures, videos, objects, audio,
and other entities
● What is information?
- Processing data provides a great deal of information.
- Processing such data elicits useful information and processing such information provides useful knowledge
● To collect meaningful information - we need EDA

Definition: EDA

● It is the process of investigating datasets, elucidating subjects, and visualizing outcomes.


● EDA is an approach to data analysis that applies a variety of techniques to maximize specific insights into a dataset,
reveal an underlying structure, extract significant variables, detect outliers and anomalies, test assumptions, develop
models, and determine best parameters for future estimations.
● EDA is the process of examining the datasets to discover the patterns, spot anomalies , test hypotheses and check
assumptions using statistical measures.

Various exploratory tools: Python, R

Enterprise Applications: Power BI, SAP Cloud Analytics , Tableau



EDA Fundamentals
Main pillars of EDA:
-Data Cleaning
-Data Preparation
-Data exploration
-Data Visualization
[Python: Panda library - important, other library- NumPy, Scikit-learn, SciPy, Stats
Models ( for regression) , Matplotlib - visualization
EDA Fundamentals
Main pillars of EDA:
-Data Cleaning
-Data Preparation
-Data exploration
-Data Visualization
[Python: Panda library - important, other library- NumPy, Scikit-learn, SciPy, Stats
Models ( for regression) , Matplotlib - visualization
EDA Fundamentals
Difference between R and Python

| Aspect |R | Python |

| ------------ ----- | ----------------------------------- --------------------------------------- | ----------------------------------------------------------------------------------- |

| **Primary Use** | Statistical analysis, data visualization | General-purpose programming, data science, machine learning
|

| **Strengths** | - Built-in stats packages<br>- Great for academic and statistical modeling | - Versatile<br>- Strong machine learning libraries (scikit-learn, TensorFlow,
etc.) |

| **Community** | Strong among statisticians and researchers | Very large and cross-disciplinary |

| **Visualization** | ggplot2, plotly | matplotlib, seaborn, plotly |

| **Ease of Use** | Easier for statistical analysis tasks | Easier for general programming & integration |

| **Flexibility** | Limited outside data analysis | High—used for web dev, automation, AI, etc. |

| **Use Case** | Research, reports, prototyping | Full-stack applications, production ML systems |


EDA Fundamentals
| Feature | R/Python (Analysis Tools) | Enterprise Applications |

| ------------------ --- | ------------------ -------------- | ------------------------------------------------------- |

| **Goal** | Explore data, build models | Manage business workflows |

| **User** | Analysts, data scientists | Enterprise employees, managers |

| **Customization** | High (code-based) | Limited to platform capabilities |

| **Deployment** | Local or cloud notebooks/scripts | Cloud/on-premises systems |

| **Data Handling** | Excellent for analysis | Strong for storage, reporting, transactional processing |

| **Real-time Support** | Typically not built-in | Often includes dashboards, alerts, automation |

| **Examples** | Jupyter Notebook, RStudio | SAP, Salesforce, Oracle EBS |


EDA Fundamentals
|Bridging the Two
Modern enterprises combine both:

● Use Python/R for advanced analytics, ML models

● Integrate with enterprise apps via APIs for automation (e.g., Python script sending predictions to Salesforce)
Understanding Data Science
|Bridging the Two
Modern enterprises combine both:

● Use Python/R for advanced analytics, ML models

● Integrate with enterprise apps via APIs for automation (e.g., Python script sending predictions to Salesforce)
LAB
Software/hardware covered in the book
OS requirements: Python 3.x - Windows, - macOS, -Linux, or any other OS
Python notebooks
There are several options:
Local: Jupyter: https:/ / jupyter. org/
Local: https:/ / www. anaconda. com/ distribution/
Online: https:/ / colab. research. google. com/
Python libraries NumPy, pandas, scikit-learn, Matplotlib, Seaborn, StatsModel
Understanding Data Science
Data science involves cross-disciplinary knowledge from computer science, data, statistics, and mathematics. There are several phases of data analysis, including data
requirements, data collection, data processing, data cleaning, exploratory data analysis, modeling and algorithms, and data product and communication. These phases are
similar to the CRoss-Industry Standard Process for data mining (CRISP) framework in data mining.

1. Data requirement: It is important to comprehend what type of data is required for the organization to be collected, curated, and stored. For example, an application
tracking the sleeping pattern of patients suffering from dementia requires several types of sensors' data storage, such as sleep data, heart rate from the patient,
electro-dermal activities, and user activities pattern. All of these data points are required to correctly diagnose the mental state of the person. Hence, these are
mandatory requirements for the application. In addition to this, it is required to categorize the data, numerical or categorical, and the format of storage and
dissemination

2. Data collection: Data collected from several sources must be stored in the correct format and transferred to the right information technology personnel within a
company. As mentioned previously, data can be collected from several objects on several events using different types of sensors and storage tools.

3. Data processing: Preprocessing involves the process of pre-curating the dataset before actual analysis. Common tasks involve correctly exporting the dataset,
placing them under the right tables, structuring them, and exporting them in the correct format.
Understanding Data Science
4. Data Cleaning: Preprocessed data is still not ready for detailed analysis. It must
be correctly transformed for an incompleteness check, duplicates check, error
check, and missing value check. These tasks are performed in the data cleaning
stage, which involves responsibilities such as matching the correct record,
finding inaccuracies in the dataset, understanding the overall data quality,
removing duplicate items, and filling in the missing values
Understanding Data Science
4. Data Cleaning: Preprocessed data is still not ready for detailed analysis. It must

be correctly transformed for an incompleteness check, duplicates check, error

check, and missing value check. These tasks are performed in the data cleaning

stage, which involves responsibilities such as matching the correct record,

finding inaccuracies in the dataset, understanding the overall data quality,

removing duplicate items, and filling in the missing values

- data cleaning is dependent on the types of data under study. Hence, it is most
- essential for data scientists or EDA experts to comprehend different types of
- datasets. An example of data cleaning would be using outlier detection methods
- for quantitative data cleaning.
Understanding Data Science
5 EDA: Exploratory data analysis, as mentioned before, is the stage where we

actually start to understand the message contained in the data. It should be noted

that several types of data transformation techniques might be required during

the process of exploration


Understanding Data Science
6. Modeling and algorithm: From a data science perspective, generalized models or
mathematical formulas can represent or exhibit relationships among different variables, such as
correlation or causation. These models or equations involve one or more variables that depend
on other variables to cause an event. For example, when buying, say, pens, the total price of
pens(Total) = price for one pen(UnitPrice) * the number of pens bought (Quantity). Hence, our
model would be Total = UnitPrice * Quantity. Here, the total price is dependent on the unit
price.Hence, the total price is referred to as the dependent variable and the unit price is referred
to as an independent variable. In general, a model always describes the relationship between
independent and dependent variables. Inferential statistics deals with quantifying relationships
between particular variables. The Judd model for describing the relationship between data,
model, and error still holds true: Data = Model + Error
Understanding Data Science
7. Data Product: Any computer software that uses data as inputs, produces
outputs, and provides feedback based on the output to control the environment is
referred to as a data product. A data product is generally based on a model
developed during data analysis, for example, a recommendation model that inputs
user purchase history and recommends a related item that the user is highly likely
to buy.
8. Communication: This stage deals with disseminating the results to end
stakeholders to use the result for business intelligence. One of the most notable
steps in this stage is data visualization. Visualization deals with information relay
techniques such as tables, charts, summary diagrams, and bar charts to show the
analyzed result.

You might also like