Introduction-to-Exploratory-Data-Analysis-EDA
Introduction-to-Exploratory-Data-Analysis-EDA
Exploratory Data
Analysis (EDA)
Exploratory data analysis (EDA) is a crucial stage in the data
science lifecycle. It involves examining and understanding the data
to gain insights and prepare it for further analysis.
DJ
by Dency John
Importance of EDA in the
Data Science Lifecycle
1 Uncover Hidden 2 Validate Assumptions
Patterns
It allows you to check if your
EDA helps identify trends, initial assumptions about
outliers, and relationships the data are accurate or
within the data that might need to be revised.
not be obvious at first
glance.
1 Data Gathering
Begin by acquiring the data from various sources, ensuring it's relevant to your analytical goals.
2 Data Cleaning
Address any inconsistencies, missing values, or outliers in the data to ensure its quality and accuracy.
3 Data Transformation
Transform the data to make it suitable for analysis, such as scaling or encoding categorical variables.
4 Data Exploration
Explore the data by using descriptive statistics, visualizations, and summary tables to uncover patterns and relationships.
5 Data Interpretation
Interpret the insights gained from the exploration to draw conclusions and form hypotheses for further analysis.
Importing Data from Various Sources
Databases Files APIs
Connect to databases like MySQL, Import data from various file Access data from external APIs to
PostgreSQL, or SQLite to retrieve formats such as CSV, Excel, JSON, retrieve data from websites, social
data directly. or XML. media platforms, or weather
services.
Creating Data Frames from Diverse Formats
CSV Excel
Read data from comma-separated values (CSV) files into Import data from Excel spreadsheets into a data frame.
a data frame.
JSON HTML
Load data from JavaScript Object Notation (JSON) files Extract data from HTML tables into a data frame using
into a data frame. web scraping techniques.
Exploring Data Structure
and Dimensions
Data Type Description
Position-Based Indexing
Access data using numerical indices for rows and
columns.
Boolean Indexing
Select rows or columns based on conditions that
evaluate to True or False.
Handling Missing Values and Outliers
Missing Values Outliers
Identify and handle missing values by imputing them Detect and address outliers by replacing them with
with statistical measures or dropping rows/columns. appropriate values, removing them, or applying
transformations.
Visualising Data Patterns and Relationships