0% found this document useful (0 votes)

7 views

EDAusingpython_SAlaruri

Uploaded by

jagan raj

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

7 views

EDAusingpython_SAlaruri

Uploaded by

jagan raj

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 50

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/384698814

Exploratory data analysis (EDA) using python: a tutorial

Presentation · October 2024

DOI: 10.13140/RG.2.2.20542.09284/1

CITATION READS

1 2,081

1 author:

Sami D. Alaruri ‫ﺳﺎﻣﻲ اﻟﻌﺎروري‬

Independnet Researcher
73 PUBLICATIONS 518 CITATIONS

SEE PROFILE

All content following this page was uploaded by Sami D. Alaruri ‫ ﺳﺎﻣﻲ اﻟﻌﺎروري‬on 19 October 2024.

The user has requested enhancement of the downloaded file.

EXPLORATORY DATA ANALYSIS
(EDA) USING PYTHON: A TUTORIAL
SAMI D. ALARURI
OCT., 2024
AGENDA
• Introduction

• Overview

• Data Inspection & Cleaning Steps

• Graphical EDA

• Conclusions

• References
INTRODUCTION: WHAT IS EDA?
Exploratory data analysis is an analysis technique to analyze and investigate
the data set and summarize the main characteristics of the dataset. Main
advantage of EDA is providing the data visualization of data after conducting
the analysis.

Tukey defined data analysis in 1961 as: "Procedures for analyzing data,
techniques for interpreting the results of such procedures, ways of planning
the gathering of data to make its analysis easier, more precise or more
accurate, and all the machinery and results of (mathematical) statistics which
apply to analyzing data”.
https://en.wikipedia.org/wiki/Exploratory_data_analysis
3
OVERVIEW
Obesity is a complex disease involving having too much body fat. Obesity isn't just a cosmetic
concern. It's a medical problem that increases the risk of many other diseases and health
problems. These can include heart disease, diabetes, high blood pressure, high cholesterol, liver
disease, sleep apnea and certain cancers.
There are many reasons why some people have trouble losing weight. Often, obesity results from
inherited, physiological and environmental factors, combined with diet, physical activity and
exercise choices.
The good news is that even modest weight loss can improve or prevent the health problems
associated with obesity. A healthier diet, increased physical activity and behavior changes can
help you lose weight. Prescription medicines and weight-loss procedures are other options for
treating obesity. In this Exploratory Data Analysis (EDA) using Python we will examine the
NObeyesdad (Obesity level of the individual) as a function of several factors (i.e., age, gender,
height, alcohol consumption ..etc.).
https://www.mayoclinic.org/diseases-conditions/obesity/symptoms-causes/syc-20375742
4
DATASET INFORMATION
This dataset include data for the estimation of obesity levels in individuals from the
countries of Mexico, Peru and Colombia, based on their eating habits and physical
condition. The data contains 17 attributes and 2111 records, the records are
labeled with the class variable NObesity (Obesity Level), that allows classification of
the data using the values of Insufficient Weight, Normal Weight, Overweight Level
I, Overweight Level II, Obesity Type I, Obesity Type II and Obesity Type III. 77% of
the data was generated synthetically using the Weka tool and the SMOTE filter, 23%
of the data was collected directly from users through a web platform.
https://archive.ics.uci.edu/dataset/544/estimation+of+obesity+levels+based+on+eating+habits+and+physical+
condition
5
DATA INSPECTION & CLEANING STEPS
1. Dataset dimensions
2. Titles of columns
3. Data types
4. Missing values
5. Nulls in the dataset (not required for this dataset)
6. Duplicate rows
7. NaN values
8. Infinity values
9. Outliers detection
10. Encode categorical features 6
LOADING PYTHON LIBRARIES

pandas (https://pandas.pydata.org)

numpy (https://numpy.org)

matplotlib (https://matplotlib.org)
seaborn (https://seaborn.pydata.org)
sklearn (https://scikit-learn.org/stable) 7
DATA SET ATTRIBUTES

8
PANDAS FUNCTIONS FOR DATA INSPECTION
• df.head()/df.tail()

• df.sample()

• df.info()

• df.columns

• df.describe()

• df.shape

• Df.count()

https://www.geeksforgeeks.org/pandas-functions-in-python/
9
USING PANDAS FOR LOADING THE DATA FILE
& VIEWING THE FIRST 5 ROWS

Dataset source: UC Irvine-Machine Learning Repository

https://archive.ics.uci.edu/dataset/544/estimation+of+obesity+levels+based+on+eating+habits+and+physical+
condition
10
EXAMINING 5 ROWS IN RANDOM & AT THE END OF
THE DATASET

11
USING PANDA FUNCTIONS FOR
VIEWING THE DATA

Data set size:

2111 rows x 17 columns
12
DATA STATISTICS & MISSING VALUES
CHECK

Dataset Statistics Summary

Missing Values Check

No missing values 13
INSPECTING THE DATASET FOR
DUPLICATE ROWS & DROPPING THE
DUPLICATE ROWS

24 duplicate row
After dropping the 24
duplicate Rows, new dataset
size:
2087 rows x 17 columns

14
NAN VALUES CHECK

No NaN values in
the dataset

15
INFINITY VALUES CHECK

No +∞ values in
the data set

16
GENERATING HISTOGRAM PLOTS FOR
THE DATASET ATTRIBUTES

17
HISTOGRAM PLOT SHOWING
THE WEIGHT DISTRIBUTION

18
SCATTER PLOT DEPICTING AGE
VS. WEIGHT

19
SCATTER PLOT SHOWING AGE
VS. FCVC

20
SCATTER PLOT ILLUSTRATING AGE
VS. HEIGHT FOR DIFFERENT
WEIGHTS

21
USING ‘JOINTPLOT’ FOR PLOTTING
HEIGHT VS. WEIGHT

22
BAR PLOT SHOWING AGE VS. AVERAGE
WEIGHT FOR THE TOP 15 AGES

23
BARPLOT ILLUSTRATING
CONSUMPTION OF ALCOHOL
FOR FEMALES & MALES

24
PIE CHART SHOWING THE %
CONSUMPTION OF ALCOHOL

25
PLOTTING AGE VS. NOBEYESAD
USING SCATTER PLOT

26
PLOTTING THE
DISTRIBUTION OF CH2O

https://gist.github.com/mwaskom/de44147ed2974457ad6372750bbe5751 27
COUNT PLOT SHOWING THE
CATEGORICAL FACTOR ‘CAEC’

28
COUNT PLOT DEPICTING THE
CATEGORICAL FACTOR ‘CALC’

29
COUNT PLOT DEPICTING
NOBEYESDAD VS. FREQUENCY

30
HOW TO INTERPRETING A
BOX PLOT?

https://www.atlassian.com/data/charts/box-plot-complete-guide 31
INSPECTING THE DATASET FOR
OUTLIERS USING BOXPLOT

32
INSPECTING THE DATASET FOR
OUTLIERS USING BOXPLOT

33
BOXPLOT SHOWING THE NCP FACTOR
VALUES

34
THE DATA SET FACTORS BEFORE
REMOVING THE OUTLIERS

Example showing the removal of outliers using the diabetes.csv dataset

https://www.kaggle.com/datasets/saurabh00007/diabetescsv
35
THE DATASET FACTORS AFTER
REMOVING THE OUTLIERS

https://www.geeksforgeeks.org/detect-and-remove-the-outliers-using-python/
36
WHAT IS A CORRELATION?

https://www.geeksforgeeks.org/create-a-correlation-matrix-using-python 37
WHAT IS A CORRELATION MATRIX?
A correlation is a tabular representation that displays correlation
coefficients, indicating the strength and direction of relationships
between variables in a dataset. Within this matrix, each cell
signifies the correlation between two specific variables. This tool
serves multiple purposes, serving as a summary of data
relationships, input for more sophisticated analyses, and a
diagnostic aid for advanced analytical procedures. By presenting a
comprehensive overview of inter-variable correlations, the matrix
becomes invaluable in discerning patterns, guiding further analyses,
and identifying potential areas of interest or concern in the dataset.
Its applications extend beyond mere summary statistics, positioning
it as a fundamental component in the preliminary stages of diverse
and intricate data analyses.
https://www.geeksforgeeks.org/create-a-correlation-matrix-using-python/
38
INTERPRETING THE CORRELATION
MATRIX RESULTS
Strong correlations, indicated by values close to 1 or -1, suggest a
robust connection, while weak correlations, near 0, imply a less
pronounced association. They are identifying these degrees of
correlation aids in understanding the intensity of interactions within
the dataset, facilitating targeted analysis and decision-making.
Positive correlations (values > 0) signify that as one variable
increases, the other tends to increase as well. Conversely, negative
correlations (values < 0) imply an inverse relationship—when one
variable increases, the other tends to decrease. Investigating these
directional associations provides insights into how variables influence
each other, crucial for formulating informed hypotheses and
predictions.
https://www.geeksforgeeks.org/create-a-correlation-matrix-using-python/
39
CALCULATING THE PAIRWISE
CORRELATION FOR ALL COLUMNS

https://www.geeksforgeeks.org/python-pandas-dataframe-corr/
40
PLOTTING THE CORRELATION
MATRIX HEATMAP

https://www.geeksforgeeks.org/create-a-correlation-matrix-using-python/ 41
CORRELATION MATRIX CLUSTER MAP

42
CONVERTING CATEGORICAL
VARIABLES TO NUMERIC VALUES

Before After

Converting categorical variables to numerical values for use in machine learning predictive models

https://www.geeksforgeeks.org/how-to-convert-categorical-variable-to-numeric-in-pandas/
43
CONCLUSIONS
• The original dataset contains 2111 rows and 17 column
• There are 24 duplicate rows in the dataset
• All values in the dataset are unique and contain no null,
NaN, + ∞ or missing values
• The NCP (number of main meals per day) data contains
some outlier values and required removal
• There is a significant correlation between weight and height

44
ADDITIONAL EXAMPLES

45
DISTRIBUTION PLOT ILLUSTRATION

https://www.kaggle.com/datasets/mathchi/diabetes-data-set
46
PROBABILITY PLOT FOR TWO FACTORS
IN THE DATASET

https://www.kaggle.com/datasets/mathchi/diabetes-data-set
47
ANOTHER FORM OF HEATMAP
PRESENTATION

https://www.kaggle.com/code/busekseolu/diabetes-classification
48
REFERENCES
1. Data Science Horizon, Data cleaning and preprocessing for data science beginners.
(https://datasciencehorizons.com/data-cleaning-preprocessing-data-science-beginners-ebook/)
2. Matthieu Komorowski, et al., Exploratory Data Analysis, Chapter 15, doi:10.1007/978-3- 319-43742-2_15.
3. DataQuest, Data Science Cheat Sheet-Pandas. (https://s3.amazonaws.com/dq-blog-files/pandas-cheat-sheet.pdf)
4. DataCamp, Python for Data Science Cheat Sheet-Matplotlib.
(https://s3.amazonaws.com/assets.datacamp.com/blog_assets/Python_Matplotlib_Cheat_Sheet.pdf)
5. DataQuest, Data Science Cheat Sheet-Numpy.
(https://assets.datacamp.com/blog_assets/Numpy_Python_Cheat_Sheet.pdf)

6. DataCamp, Python for Data Science, Seaborn Cheat Sheet.

(https://s3.amazonaws.com/assets.datacamp.com/blog_assets/Python_Seaborn_Cheat_Sheet.pdf)
7. DataCamp, Python for Data Science Cheat Sheet-Scikit-Learn.
(https://s3.amazonaws.com/assets.datacamp.com/blog_assets/Scikit_Learn_Cheat_Sheet_Python.pdf)
8. David Beazley, et al., Python Cookbook, 3rd Ed., O’Reilly, Beiing, 2013.

View publication stats

Business Report Statistical Analysis of FoodHub Data
No ratings yet
Business Report Statistical Analysis of FoodHub Data
21 pages
Data Science and Big Data Analytics: Discovering, Analyzing, Visualizing and Presenting Data
From Everand
Data Science and Big Data Analytics: Discovering, Analyzing, Visualizing and Presenting Data
EMC Education Services
No ratings yet
Research Proposal PHD 32
56% (9)
Research Proposal PHD 32
21 pages
ML Data Preprocessing in Python
No ratings yet
ML Data Preprocessing in Python
9 pages
Data Pre-Processing
No ratings yet
Data Pre-Processing
22 pages
CardioGoodFitness - Jupyter Notebook
No ratings yet
CardioGoodFitness - Jupyter Notebook
12 pages
ML Proj Diabetes.pptx
No ratings yet
ML Proj Diabetes.pptx
51 pages
Unit - Iii - Eda
No ratings yet
Unit - Iii - Eda
25 pages
PythonForMachineLearning
No ratings yet
PythonForMachineLearning
66 pages
Exploratory Data Analysis-1 (EDA-1)
No ratings yet
Exploratory Data Analysis-1 (EDA-1)
38 pages
Machine Learning
No ratings yet
Machine Learning
30 pages
Afroz Content
No ratings yet
Afroz Content
24 pages
Semi-Automated Exploratory Data Analysis (EDA) in Python - by Destin Gong - Mar, 2021 - Towards Data
No ratings yet
Semi-Automated Exploratory Data Analysis (EDA) in Python - by Destin Gong - Mar, 2021 - Towards Data
3 pages
K-Nearest Neighbors For Diabetes Prediction: Malik Yousaf (F2020019038) Ahsan Rauf (F2020019057)
No ratings yet
K-Nearest Neighbors For Diabetes Prediction: Malik Yousaf (F2020019038) Ahsan Rauf (F2020019057)
15 pages
Asset-V1 VIT+MBA109+2020+type@asset+block@Introductio To ML Using Python
No ratings yet
Asset-V1 VIT+MBA109+2020+type@asset+block@Introductio To ML Using Python
7 pages
diabetes-prediction-using-machine-learning
No ratings yet
diabetes-prediction-using-machine-learning
16 pages
PROJECTS
No ratings yet
PROJECTS
6 pages
Part2 Statistics
No ratings yet
Part2 Statistics
55 pages
EDA - Final
No ratings yet
EDA - Final
7 pages
AttiqAhmadAfsarMidExam
No ratings yet
AttiqAhmadAfsarMidExam
8 pages
Exploratory Data Analysis: by Neha Mathur
No ratings yet
Exploratory Data Analysis: by Neha Mathur
14 pages
Python For Exploratory Data Analysis
No ratings yet
Python For Exploratory Data Analysis
12 pages
Machine Learning Lab Manual (1)
No ratings yet
Machine Learning Lab Manual (1)
42 pages
04 DS 2023
No ratings yet
04 DS 2023
63 pages
Data Science in Society Cat
No ratings yet
Data Science in Society Cat
5 pages
Exploratory Data Analysis: by Neha Mathur
No ratings yet
Exploratory Data Analysis: by Neha Mathur
14 pages
Copy of ML_preprocessing_introduction.pptx
No ratings yet
Copy of ML_preprocessing_introduction.pptx
14 pages
Case Study Data Science
No ratings yet
Case Study Data Science
7 pages
EXP-12
No ratings yet
EXP-12
4 pages
Project paarth (1) (1)
No ratings yet
Project paarth (1) (1)
21 pages
Data Science Fundamentals
No ratings yet
Data Science Fundamentals
22 pages
What Is Exploratory Data Analysis?: Intuition
No ratings yet
What Is Exploratory Data Analysis?: Intuition
8 pages
MEHAK MONIKA IP PROJECT FINAL 1
No ratings yet
MEHAK MONIKA IP PROJECT FINAL 1
24 pages
Week-6 DS Practical
No ratings yet
Week-6 DS Practical
12 pages
Comprehensive EDA Python Guide
No ratings yet
Comprehensive EDA Python Guide
13 pages
Big Data Analysis
No ratings yet
Big Data Analysis
38 pages
Exploratory Data Analysis-1
No ratings yet
Exploratory Data Analysis-1
10 pages
What Is Exploratory Data Analysis - by Prasad Patil - Towards Data Science
No ratings yet
What Is Exploratory Data Analysis - by Prasad Patil - Towards Data Science
17 pages
Cardiovascular_Disease_Prediction
No ratings yet
Cardiovascular_Disease_Prediction
2 pages
Python For Data Sceince l1 Hands On
No ratings yet
Python For Data Sceince l1 Hands On
5 pages
Explorotary Data Analysis
100% (1)
Explorotary Data Analysis
30 pages
Big Sales Mart Final Script PDF
No ratings yet
Big Sales Mart Final Script PDF
36 pages
Phython Example
No ratings yet
Phython Example
12 pages
data analysis
No ratings yet
data analysis
42 pages
_Exploratory_Data_Analysis_of_Heart_Disease_Dataset__1737826105
No ratings yet
_Exploratory_Data_Analysis_of_Heart_Disease_Dataset__1737826105
50 pages
Jupyter Notebook On Obesity Prediction
No ratings yet
Jupyter Notebook On Obesity Prediction
15 pages
Data Science Notes
No ratings yet
Data Science Notes
66 pages
Aerofit_business_Case - JupyterLab
No ratings yet
Aerofit_business_Case - JupyterLab
36 pages
EDA - Exploratory Data Analysis
No ratings yet
EDA - Exploratory Data Analysis
16 pages
Pima Tutorial
No ratings yet
Pima Tutorial
8 pages
Assignment 1 - LP1
No ratings yet
Assignment 1 - LP1
14 pages
Machine Learning Unit 2
No ratings yet
Machine Learning Unit 2
71 pages
Machine Learning Project Roadmap
No ratings yet
Machine Learning Project Roadmap
4 pages
Hgs Phase II
No ratings yet
Hgs Phase II
27 pages
My Secret Sauce To Be in Top 2% of A Kaggle Competition
No ratings yet
My Secret Sauce To Be in Top 2% of A Kaggle Competition
10 pages
hduud
No ratings yet
hduud
55 pages
Perform Exploratory Data Analysis
No ratings yet
Perform Exploratory Data Analysis
5 pages
diabetes_test report
No ratings yet
diabetes_test report
62 pages
Dissertation
No ratings yet
Dissertation
41 pages
cs3362 Foundations of Data Science Lab Manual
No ratings yet
cs3362 Foundations of Data Science Lab Manual
53 pages
Diabetes EDA and Kears Modeling
No ratings yet
Diabetes EDA and Kears Modeling
26 pages
Técnicas Estadísticas para la Ciencia de Datos a través de R. Aprendizaje Supervisado: Análisis Discriminante, Árboles de Decisión, Redes Neuronales y Modelos Lineales Generalizados
From Everand
Técnicas Estadísticas para la Ciencia de Datos a través de R. Aprendizaje Supervisado: Análisis Discriminante, Árboles de Decisión, Redes Neuronales y Modelos Lineales Generalizados
César Pérez López
No ratings yet
5 Algoritma Klastering
No ratings yet
5 Algoritma Klastering
85 pages
Practical - 592 MA SOCIOLOGY SPSS Fourth Sem
No ratings yet
Practical - 592 MA SOCIOLOGY SPSS Fourth Sem
45 pages
MBA Banking & Insurance Syllabus 2018-19 & 2019-20 PDF
No ratings yet
MBA Banking & Insurance Syllabus 2018-19 & 2019-20 PDF
53 pages
Tutorial 7 Machine Learning Algorithms
No ratings yet
Tutorial 7 Machine Learning Algorithms
30 pages
Class-33 Regression
No ratings yet
Class-33 Regression
15 pages
Survey Programming
No ratings yet
Survey Programming
22 pages
Lab-7_Clustering
No ratings yet
Lab-7_Clustering
4 pages
Mariculture: Growth in Milkfish Production in Mati City
No ratings yet
Mariculture: Growth in Milkfish Production in Mati City
27 pages
Exploratory Data Analysis-Engineering Statistics Handbook NIST 2002
100% (1)
Exploratory Data Analysis-Engineering Statistics Handbook NIST 2002
804 pages
Sigma
No ratings yet
Sigma
3 pages
Upgrad Course Plan - Final
No ratings yet
Upgrad Course Plan - Final
2 pages
The Anatomy of The Luxury Fasion Brand
100% (2)
The Anatomy of The Luxury Fasion Brand
18 pages
Data Visualization Short Explanation With Pictures
No ratings yet
Data Visualization Short Explanation With Pictures
6 pages
EJ1121816
No ratings yet
EJ1121816
20 pages
Usefull Insights About Data
No ratings yet
Usefull Insights About Data
8 pages
Research Methods
100% (2)
Research Methods
340 pages
Garedew Derara
No ratings yet
Garedew Derara
108 pages
Correlation & Regression
100% (1)
Correlation & Regression
23 pages
Introduction to Data Mining Assignment 2
No ratings yet
Introduction to Data Mining Assignment 2
1 page
Mcqs
No ratings yet
Mcqs
204 pages
S2 Big Data Week 5 Quiz - Attempt Review
No ratings yet
S2 Big Data Week 5 Quiz - Attempt Review
9 pages
Sirgut Tesfaye Simulation Project
No ratings yet
Sirgut Tesfaye Simulation Project
19 pages
Girls Earthquake Science and Safety Initiative Evaluation Framework
No ratings yet
Girls Earthquake Science and Safety Initiative Evaluation Framework
75 pages
Conflicts Selected Company For Project: Nestle Company
No ratings yet
Conflicts Selected Company For Project: Nestle Company
21 pages
Course 7 Week 1 Glossary - DA Terms and Definitions
No ratings yet
Course 7 Week 1 Glossary - DA Terms and Definitions
21 pages
Correlation
No ratings yet
Correlation
4 pages
Module 2 Application Answers - Elver Derequito
No ratings yet
Module 2 Application Answers - Elver Derequito
3 pages
Pilot Shop Is A Catalog Business Providing A Wide Variety
No ratings yet
Pilot Shop Is A Catalog Business Providing A Wide Variety
1 page

EDAusingpython_SAlaruri

Uploaded by

EDAusingpython_SAlaruri

Uploaded by

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

Exploratory data analysis (EDA) using python: a tutorial

Presentation · October 2024

Sami D. Alaruri ‫ﺳﺎﻣﻲ اﻟﻌﺎروري‬

The user has requested enhancement of the downloaded file.

• Data Inspection & Cleaning Steps

Dataset source: UC Irvine-Machine Learning Repository

Data set size:

Dataset Statistics Summary

Missing Values Check

Example showing the removal of outliers using the diabetes.csv dataset

6. DataCamp, Python for Data Science, Seaborn Cheat Sheet.

View publication stats

You might also like