0% found this document useful (0 votes)

31 views21 pages

Document

data science

Uploaded by

Akansha S

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

31 views21 pages

Document

data science

Uploaded by

Akansha S

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 21

Exploratory Data Analysis (EDA): Summary Notes and Explanation

What is Exploratory Data Analysis?

EDA is the initial step in data analysis where graphical and statistical
techniques are used to:

Summarize the main characteristics of the dataset.

Detect patterns, anomalies, or relationships.

Refine hypotheses for further analysis.

Prepare data for formal modeling by identifying trends, distributions, and

outliers.

Key Steps in EDA

1. Data Collection and Loading

Gather data from sources like CSV, databases, or APIs.

Load the data into analysis tools (e.g., Python, R, Excel).

2. Data Cleaning

Handle missing data (imputation, removal, or flagging).

Correct data types (e.g., integers, dates, strings).

Identify and address duplicate or inconsistent records.

3. Descriptive Statistics

Central Tendency: Mean, median, mode.

Dispersion: Range, variance, standard deviation, IQR (Interquartile Range).

Distribution: Skewness and kurtosis to understand the shape of data.

4. Data Visualization

Graphical methods to explore relationships, trends, and patterns.

Essential EDA Techniques

1. Univariate Analysis

Focus on a single variable at a time.

Tools:

Histograms

Boxplots

Frequency tables

Summary statistics

2. Bivariate Analysis

Study relationships between two variables.

Tools:

Scatter plots

Correlation coefficients

Line plots (for trends over time)

Bar plots (categorical vs. numerical relationships)

3. Multivariate Analysis

Explore relationships among three or more variables simultaneously.

Tools:

Pair plots (e.g., seaborn.pairplot in Python)

Heatmaps (correlation matrix visualization)

Parallel coordinate plots

Common Multivariate Statistical Techniques

1. Principal Component Analysis (PCA)

Reduces the dimensionality of data while retaining as much variance as

possible.

Visualizes high-dimensional data in 2D or 3D plots.

2. Clustering

Groups similar data points together:

K-Means Clustering

Hierarchical Clustering
3. Multidimensional Scaling (MDS)

Projects high-dimensional data into a lower-dimensional space for

visualization.

4. Factor Analysis

Identifies underlying latent variables that influence the observed variables.

5. T-SNE (t-Distributed Stochastic Neighbor Embedding)

Non-linear technique to visualize complex patterns in high-dimensional data.

6. Heatmaps

Used for showing relationships between variables in a matrix format, often

paired with correlation coefficients.
Key Questions EDA Aims to Address

1. Data Quality

Are there missing, duplicate, or erroneous entries?

2. Data Distribution

What is the shape of the data (e.g., normal, skewed)?

3. Relationships

How do variables interact with each other?

4. Outliers

Are there unusual data points, and what impact do they have?
Steps to Perform EDA in Python

1. Loading Libraries

Import pandas as pd

Import numpy as np

Import matplotlib.pyplot as plt

Import seaborn as sns

2. Data Loading

Df = pd.read_csv(‘data.csv’) # Load dataset

3. Summary Statistics

Df.describe()

Df.info()

4. Visualization
Histogram:

Df[‘column_name’].hist()

Scatter Plot:

Sns.scatterplot(x=’feature1’, y=’feature2’, data=df)

Heatmap:

Sns.heatmap(df.corr(), annot=True, cmap=’coolwarm’)

Importance of EDA

Helps in identifying errors and cleaning data.

Provides insights into the structure of the dataset.

Aids in selecting the right modeling techniques.

Reduces time spent troubleshooting issues during modeling.

Exploratory Data Analysis (EDA): Detailed Notes

1. Goals of EDA

Understand the Dataset: Gain insights into the dataset’s structure,

distributions, and key characteristics.

Identify Patterns: Uncover relationships, trends, and structures in the data.

Detect Anomalies: Spot missing values, outliers, or unusual observations.

Hypothesis Testing: Refine or eliminate hypotheses before formal modeling.

Feature Engineering: Identify features to add, transform, or remove for better

modeling results.

2. Types of EDA Techniques

1. Univariate Analysis

Examines a single variable.

Focuses on:

Central tendency (mean, median, mode).

Dispersion (variance, standard deviation, range).

Distribution shape (normal, skewed, etc.).

Visualizations:

Numerical Data: Histograms, boxplots, density plots.

Categorical Data: Bar charts, pie charts.

2. Bivariate Analysis

Analyzes relationships between two variables.

Visualizations:

Numerical-Numerical: Scatter plots, line charts, correlation matrices.

Numerical-Categorical: Boxplots, violin plots, bar plots.

Categorical-Categorical: Grouped bar charts, contingency tables.

3. Multivariate Analysis

Explores relationships among three or more variables.

Useful for datasets with high dimensions.

Visualizations:

Pair plots.

3D scatter plots.

Heatmaps (for correlation matrices).

Statistical Techniques:

PCA, t-SNE, clustering methods, multidimensional scaling.

4. EDA Techniques in Depth

Data Cleaning

1. Handling Missing Data:

Strategies:

Deletion: Remove rows or columns with missing values.

Imputation: Replace with mean, median, mode, or interpolation.

Flagging: Add an indicator variable for missing values.

2. Handling Outliers:

Detect using boxplots, z-scores, or the IQR method.

Options:

Remove.

Transform (log, square root).

Cap values at reasonable thresholds.

Data Transformation

1. Standardization:

Scale data to have a mean of 0 and standard deviation of 1.

Suitable for techniques sensitive to magnitude (e.g., PCA, clustering).

2. Normalization:

Scale data to range [0, 1].

Useful for features with different units.

3. Encoding Categorical Variables:

One-hot encoding, label encoding, or ordinal encoding.

4. Statistical Analysis

1. Descriptive Statistics:

Mean, median, mode, variance, standard deviation.

Skewness and kurtosis to understand distribution shape.

2. Inferential Statistics:

Hypothesis testing (e.g., t-tests, chi-square tests).

Confidence intervals.

3. Correlation Analysis:

Pearson correlation: Measures linear relationships (-1 to 1).

Spearman correlation: Measures monotonic relationships.

4. Advanced Multivariate Techniques

1. Principal Component Analysis (PCA):

Reduces dimensionality while preserving most variance.

Helps visualize high-dimensional data in 2D or 3D.

2. Clustering:

Groups similar data points into clusters.

Examples: K-means, hierarchical clustering, DBSCAN.

3. T-SNE:
Visualizes high-dimensional data by projecting it into a 2D or 3D space.

Effective for clustering visualization.

4. Heatmaps:

Displays relationships between variables using color-coded matrices.

Common for correlation analysis.

5. Tools for EDA

Python Libraries:

Pandas: Data manipulation and summary statistics.

Matplotlib and seaborn: Visualization.

Numpy: Numerical computations.

Scipy: Statistical computations.

Plotly and bokeh: Interactive visualizations.

R Libraries:

Ggplot2: Data visualization.

Dplyr: Data manipulation.

Caret: Preprocessing and feature engineering.

Other Tools:

Tableau, Power BI for visualization.

Excel for basic EDA.

6. Practical Tips for Effective EDA

1. Start Simple:

Begin with univariate analysis before diving into complex relationships.

2. Iterative Process:

EDA is not linear; revisit earlier steps as needed.

3. Ask Questions:

What does each variable represent?

Are there any inconsistencies or anomalies?

How do variables interact?

4. Document Findings:

Keep track of key insights and issues for future reference.

5. Automate Repetitive Steps:

Use scripts to standardize data cleaning and visualization.

7. Example Code for EDA in Python

# Importing Libraries

Import pandas as pd

Import numpy as np

Import seaborn as sns

Import matplotlib.pyplot as plt

# Load Dataset

Df = pd.read_csv(‘data.csv’)

# Basic Information

Print(df.info()) # Data types and non-null counts

Print(df.describe()) # Summary statistics

Print(df.head()) # First few rows

# Missing Values
Print(df.isnull().sum()) # Count missing values

# Univariate Analysis

Df[‘column_name’].hist(bins=30)

Plt.title(‘Histogram of column_name’)

Plt.show()

# Bivariate Analysis

Sns.scatterplot(x=’feature1’, y=’feature2’, data=df)

Plt.title(‘Scatter Plot of Feature1 vs Feature2’)

Plt.show()

# Correlation Heatmap

Sns.heatmap(df.corr(), annot=True, cmap=’coolwarm’)

Plt.title(‘Correlation Heatmap’)

Plt.show()

# Pair Plot

Sns.pairplot(df)

Plt.show()

Unit I - Part I Notes
100% (7)
Unit I - Part I Notes
33 pages
Unit 1
No ratings yet
Unit 1
23 pages
Dev 1
No ratings yet
Dev 1
2 pages
Wa0000.
No ratings yet
Wa0000.
15 pages
Exploratory Data Analysis
No ratings yet
Exploratory Data Analysis
13 pages
Unit 1 - Intro To EDA
No ratings yet
Unit 1 - Intro To EDA
40 pages
BI-LEc 3
No ratings yet
BI-LEc 3
24 pages
FDS Unit 2
No ratings yet
FDS Unit 2
15 pages
Introduction To EDA: Exploratory Data Analysis (EDA) in Data Science
No ratings yet
Introduction To EDA: Exploratory Data Analysis (EDA) in Data Science
4 pages
Unit 3
No ratings yet
Unit 3
47 pages
What Is Exploratory Data Analysis (EDA) ?
No ratings yet
What Is Exploratory Data Analysis (EDA) ?
6 pages
DL EDA Process
No ratings yet
DL EDA Process
2 pages
05 AIHC Exp02
No ratings yet
05 AIHC Exp02
11 pages
Datascience Unit-4
No ratings yet
Datascience Unit-4
6 pages
EDA Feature Eng - Estimation Inference and Hypothesis
No ratings yet
EDA Feature Eng - Estimation Inference and Hypothesis
53 pages
Exploratory Data Analysis
No ratings yet
Exploratory Data Analysis
3 pages
Exploratory Data Analysis (EDA) in Data
No ratings yet
Exploratory Data Analysis (EDA) in Data
12 pages
Group 7
No ratings yet
Group 7
19 pages
Eda Sandhya
No ratings yet
Eda Sandhya
7 pages
PDF Experiments-1 DADV
No ratings yet
PDF Experiments-1 DADV
41 pages
03a EDA
No ratings yet
03a EDA
47 pages
Eda 1
No ratings yet
Eda 1
25 pages
Exp 12
No ratings yet
Exp 12
7 pages
Unit 1
No ratings yet
Unit 1
19 pages
22amh32 - Data Analytics and Data Science Unit I & Exploratory Data Analysis (Eda) 1. Exploratory Data Analysis (Eda)
No ratings yet
22amh32 - Data Analytics and Data Science Unit I & Exploratory Data Analysis (Eda) 1. Exploratory Data Analysis (Eda)
9 pages
Unit 3 Ids Notes
No ratings yet
Unit 3 Ids Notes
31 pages
Exploratory Data Analysis
No ratings yet
Exploratory Data Analysis
2 pages
4.1 Advanced Data Analysis & Visualization
No ratings yet
4.1 Advanced Data Analysis & Visualization
12 pages
Unit 3
No ratings yet
Unit 3
222 pages
IOT-Domain Analyst
No ratings yet
IOT-Domain Analyst
11 pages
DSP Unit - Ii
No ratings yet
DSP Unit - Ii
14 pages
Dev Answer Key
No ratings yet
Dev Answer Key
21 pages
Lesson 5 Exploratory Data Analysis
No ratings yet
Lesson 5 Exploratory Data Analysis
10 pages
Data Sciecnce
No ratings yet
Data Sciecnce
16 pages
Unit 1
No ratings yet
Unit 1
52 pages
DSML Notes
No ratings yet
DSML Notes
32 pages
Exploratory Data Analysis
No ratings yet
Exploratory Data Analysis
4 pages
ML Lac0 Notes
No ratings yet
ML Lac0 Notes
37 pages
Unit 1
No ratings yet
Unit 1
50 pages
UNIT 1 Exploratory Data Analysis
100% (1)
UNIT 1 Exploratory Data Analysis
8 pages
Exploratory Data Analysis
No ratings yet
Exploratory Data Analysis
9 pages
Assignment EDA
No ratings yet
Assignment EDA
4 pages
Data Exploration and Visualization
100% (1)
Data Exploration and Visualization
281 pages
Ai ML Exp2
No ratings yet
Ai ML Exp2
7 pages
CH4 Exploratory Data Analysis
No ratings yet
CH4 Exploratory Data Analysis
12 pages
Intro
No ratings yet
Intro
26 pages
ML Exp1 - 2201107
No ratings yet
ML Exp1 - 2201107
34 pages
Best Journal
No ratings yet
Best Journal
11 pages
Exp 12
No ratings yet
Exp 12
4 pages
Exploratory Dataanalysis (EDA) : Kevin Angelo A. Inlong
No ratings yet
Exploratory Dataanalysis (EDA) : Kevin Angelo A. Inlong
6 pages
Exploratory Data Analysis in ML
No ratings yet
Exploratory Data Analysis in ML
7 pages
Exploratory Data Analysis Using Python
No ratings yet
Exploratory Data Analysis Using Python
7 pages
Exploratory Data Analysis - Komorowski PDF
No ratings yet
Exploratory Data Analysis - Komorowski PDF
20 pages
Exploratory Data Analysis (EDA)
No ratings yet
Exploratory Data Analysis (EDA)
1 page
Notes - Unit 1 - Exploratory Data Analysis
No ratings yet
Notes - Unit 1 - Exploratory Data Analysis
33 pages
UNIT II-DSDA - Docx Notes
No ratings yet
UNIT II-DSDA - Docx Notes
26 pages
Data Science Process
No ratings yet
Data Science Process
30 pages
Illuminating Data: A hands on guide to data visualization in R
From Everand
Illuminating Data: A hands on guide to data visualization in R
Eman Ahmad
No ratings yet
Mathematics for Data Science: Linear Algebra with Matlab
From Everand
Mathematics for Data Science: Linear Algebra with Matlab
César Pérez López
No ratings yet
Data Analytics with Generative AI
From Everand
Data Analytics with Generative AI
Younish P
No ratings yet
Document
No ratings yet
Document
37 pages
Document
No ratings yet
Document
32 pages
DS Practical
No ratings yet
DS Practical
9 pages
Document
No ratings yet
Document
29 pages
? Simple UI Design Elements (Community)
No ratings yet
? Simple UI Design Elements (Community)
1 page
2nd Semester Question
No ratings yet
2nd Semester Question
8 pages
JS Part-2
No ratings yet
JS Part-2
24 pages
Statistics 2
No ratings yet
Statistics 2
36 pages
SE Unit - 3
No ratings yet
SE Unit - 3
25 pages
SE Unit 1
No ratings yet
SE Unit 1
29 pages
MAD Unit 4
No ratings yet
MAD Unit 4
141 pages
DM Unit 1
No ratings yet
DM Unit 1
9 pages
L13 Relational Model DDL
No ratings yet
L13 Relational Model DDL
79 pages
Machine Learning Risk Models
No ratings yet
Machine Learning Risk Models
26 pages
Notes For Multivariate Statistics With R
No ratings yet
Notes For Multivariate Statistics With R
189 pages
Dhaapps Datascience With Gen AI-1
No ratings yet
Dhaapps Datascience With Gen AI-1
23 pages
Partial Least Square Structural Equation Model PLSSEM To Predict Total Quality Management in UAE Higher Education A Comprehensive Framework For Organizational Performance Enhancement
No ratings yet
Partial Least Square Structural Equation Model PLSSEM To Predict Total Quality Management in UAE Higher Education A Comprehensive Framework For Organizational Performance Enhancement
13 pages
Chapter12 Slides
No ratings yet
Chapter12 Slides
15 pages
DSG Bring Your Own Project
No ratings yet
DSG Bring Your Own Project
8 pages
Ship-Motion Prediction - Algorithms and Simulation Results
No ratings yet
Ship-Motion Prediction - Algorithms and Simulation Results
4 pages
MANAGA
No ratings yet
MANAGA
4 pages
Statistical Analysis
No ratings yet
Statistical Analysis
50 pages
Notes 14
No ratings yet
Notes 14
189 pages
M.tech Syllabus
No ratings yet
M.tech Syllabus
32 pages
Support Vector Data Description Applied To Machine Vibration Analysis
No ratings yet
Support Vector Data Description Applied To Machine Vibration Analysis
8 pages
Lee Et Al., 2017
No ratings yet
Lee Et Al., 2017
9 pages
EEG Signal Processing For Alzheimers Disorders Us
No ratings yet
EEG Signal Processing For Alzheimers Disorders Us
17 pages
40 ML Interview Questions That You Must Know (2024) - Reader View
No ratings yet
40 ML Interview Questions That You Must Know (2024) - Reader View
13 pages
Big Data Environment
No ratings yet
Big Data Environment
23 pages
Measuring Brand Love : Understanding The Attitude of Millennials Towards Select Brands
No ratings yet
Measuring Brand Love : Understanding The Attitude of Millennials Towards Select Brands
30 pages
Keragaman Dan Analisis Kekerabatan 30 Jenis Begonia Berdasarkan Karakter Morfologi
No ratings yet
Keragaman Dan Analisis Kekerabatan 30 Jenis Begonia Berdasarkan Karakter Morfologi
13 pages
Chapter2 PCA
No ratings yet
Chapter2 PCA
65 pages
ML Unit Wise Important Questions
No ratings yet
ML Unit Wise Important Questions
2 pages
Oral History Reveals Landscaoe Ecology in Ecuadorian Amazonia
No ratings yet
Oral History Reveals Landscaoe Ecology in Ecuadorian Amazonia
14 pages
Spectral Analysis With ENVI5.0
100% (1)
Spectral Analysis With ENVI5.0
99 pages
Predictive Dissolution Models For Real-Time Release Testing: Development and Implementation - Workshop Summary Report
No ratings yet
Predictive Dissolution Models For Real-Time Release Testing: Development and Implementation - Workshop Summary Report
23 pages
Operational Modal Analysis Tutorial
No ratings yet
Operational Modal Analysis Tutorial
11 pages
Workload Characterization
No ratings yet
Workload Characterization
5 pages
Leadership: Leadership Questionnaire Destructive Leader Behaviour: A Study of Iranian Leaders Using The Destructive
No ratings yet
Leadership: Leadership Questionnaire Destructive Leader Behaviour: A Study of Iranian Leaders Using The Destructive
23 pages
Xlstat-Pro 6.1 - Manual
0% (1)
Xlstat-Pro 6.1 - Manual
187 pages
Arunachalam Et Al 2021 Exploring The Association of Riders Physical Attributes With Comfortable Riding Posture and
No ratings yet
Arunachalam Et Al 2021 Exploring The Association of Riders Physical Attributes With Comfortable Riding Posture and
23 pages
Exercises w3
No ratings yet
Exercises w3
6 pages
RRL Matrix - ToPIC 3 Effect of Corporate Governance and Accounting Information Transparency On Earnings Quality
No ratings yet
RRL Matrix - ToPIC 3 Effect of Corporate Governance and Accounting Information Transparency On Earnings Quality
16 pages