0% found this document useful (0 votes)
7 views

EDAusingpython_SAlaruri

Uploaded by

jagan raj
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views

EDAusingpython_SAlaruri

Uploaded by

jagan raj
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 50

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/384698814

Exploratory data analysis (EDA) using python: a tutorial

Presentation · October 2024


DOI: 10.13140/RG.2.2.20542.09284/1

CITATION READS

1 2,081

1 author:

Sami D. Alaruri ‫ﺳﺎﻣﻲ اﻟﻌﺎروري‬


Independnet Researcher
73 PUBLICATIONS 518 CITATIONS

SEE PROFILE

All content following this page was uploaded by Sami D. Alaruri ‫ ﺳﺎﻣﻲ اﻟﻌﺎروري‬on 19 October 2024.

The user has requested enhancement of the downloaded file.


EXPLORATORY DATA ANALYSIS
(EDA) USING PYTHON: A TUTORIAL
SAMI D. ALARURI
OCT., 2024
AGENDA
• Introduction

• Overview

• Data Inspection & Cleaning Steps

• Graphical EDA

• Conclusions

• References
INTRODUCTION: WHAT IS EDA?
Exploratory data analysis is an analysis technique to analyze and investigate
the data set and summarize the main characteristics of the dataset. Main
advantage of EDA is providing the data visualization of data after conducting
the analysis.

Tukey defined data analysis in 1961 as: "Procedures for analyzing data,
techniques for interpreting the results of such procedures, ways of planning
the gathering of data to make its analysis easier, more precise or more
accurate, and all the machinery and results of (mathematical) statistics which
apply to analyzing data”.
https://en.wikipedia.org/wiki/Exploratory_data_analysis
3
OVERVIEW
Obesity is a complex disease involving having too much body fat. Obesity isn't just a cosmetic
concern. It's a medical problem that increases the risk of many other diseases and health
problems. These can include heart disease, diabetes, high blood pressure, high cholesterol, liver
disease, sleep apnea and certain cancers.
There are many reasons why some people have trouble losing weight. Often, obesity results from
inherited, physiological and environmental factors, combined with diet, physical activity and
exercise choices.
The good news is that even modest weight loss can improve or prevent the health problems
associated with obesity. A healthier diet, increased physical activity and behavior changes can
help you lose weight. Prescription medicines and weight-loss procedures are other options for
treating obesity. In this Exploratory Data Analysis (EDA) using Python we will examine the
NObeyesdad (Obesity level of the individual) as a function of several factors (i.e., age, gender,
height, alcohol consumption ..etc.).
https://www.mayoclinic.org/diseases-conditions/obesity/symptoms-causes/syc-20375742
4
DATASET INFORMATION
This dataset include data for the estimation of obesity levels in individuals from the
countries of Mexico, Peru and Colombia, based on their eating habits and physical
condition. The data contains 17 attributes and 2111 records, the records are
labeled with the class variable NObesity (Obesity Level), that allows classification of
the data using the values of Insufficient Weight, Normal Weight, Overweight Level
I, Overweight Level II, Obesity Type I, Obesity Type II and Obesity Type III. 77% of
the data was generated synthetically using the Weka tool and the SMOTE filter, 23%
of the data was collected directly from users through a web platform.
https://archive.ics.uci.edu/dataset/544/estimation+of+obesity+levels+based+on+eating+habits+and+physical+
condition
5
DATA INSPECTION & CLEANING STEPS
1. Dataset dimensions
2. Titles of columns
3. Data types
4. Missing values
5. Nulls in the dataset (not required for this dataset)
6. Duplicate rows
7. NaN values
8. Infinity values
9. Outliers detection
10. Encode categorical features 6
LOADING PYTHON LIBRARIES

pandas (https://pandas.pydata.org)

numpy (https://numpy.org)

matplotlib (https://matplotlib.org)
seaborn (https://seaborn.pydata.org)
sklearn (https://scikit-learn.org/stable) 7
DATA SET ATTRIBUTES

8
PANDAS FUNCTIONS FOR DATA INSPECTION
• df.head()/df.tail()

• df.sample()

• df.info()

• df.columns

• df.describe()

• df.shape

• Df.count()

https://www.geeksforgeeks.org/pandas-functions-in-python/
9
USING PANDAS FOR LOADING THE DATA FILE
& VIEWING THE FIRST 5 ROWS

Dataset source: UC Irvine-Machine Learning Repository


https://archive.ics.uci.edu/dataset/544/estimation+of+obesity+levels+based+on+eating+habits+and+physical+
condition
10
EXAMINING 5 ROWS IN RANDOM & AT THE END OF
THE DATASET

11
USING PANDA FUNCTIONS FOR
VIEWING THE DATA

Data set size:


2111 rows x 17 columns
12
DATA STATISTICS & MISSING VALUES
CHECK

Dataset Statistics Summary

Missing Values Check


No missing values 13
INSPECTING THE DATASET FOR
DUPLICATE ROWS & DROPPING THE
DUPLICATE ROWS

24 duplicate row
After dropping the 24
duplicate Rows, new dataset
size:
2087 rows x 17 columns

14
NAN VALUES CHECK

No NaN values in
the dataset

15
INFINITY VALUES CHECK

No +∞ values in
the data set

16
GENERATING HISTOGRAM PLOTS FOR
THE DATASET ATTRIBUTES

17
HISTOGRAM PLOT SHOWING
THE WEIGHT DISTRIBUTION

18
SCATTER PLOT DEPICTING AGE
VS. WEIGHT

19
SCATTER PLOT SHOWING AGE
VS. FCVC

20
SCATTER PLOT ILLUSTRATING AGE
VS. HEIGHT FOR DIFFERENT
WEIGHTS

21
USING ‘JOINTPLOT’ FOR PLOTTING
HEIGHT VS. WEIGHT

22
BAR PLOT SHOWING AGE VS. AVERAGE
WEIGHT FOR THE TOP 15 AGES

23
BARPLOT ILLUSTRATING
CONSUMPTION OF ALCOHOL
FOR FEMALES & MALES

24
PIE CHART SHOWING THE %
CONSUMPTION OF ALCOHOL

25
PLOTTING AGE VS. NOBEYESAD
USING SCATTER PLOT

26
PLOTTING THE
DISTRIBUTION OF CH2O

https://gist.github.com/mwaskom/de44147ed2974457ad6372750bbe5751 27
COUNT PLOT SHOWING THE
CATEGORICAL FACTOR ‘CAEC’

28
COUNT PLOT DEPICTING THE
CATEGORICAL FACTOR ‘CALC’

29
COUNT PLOT DEPICTING
NOBEYESDAD VS. FREQUENCY

30
HOW TO INTERPRETING A
BOX PLOT?

https://www.atlassian.com/data/charts/box-plot-complete-guide 31
INSPECTING THE DATASET FOR
OUTLIERS USING BOXPLOT

32
INSPECTING THE DATASET FOR
OUTLIERS USING BOXPLOT

33
BOXPLOT SHOWING THE NCP FACTOR
VALUES

34
THE DATA SET FACTORS BEFORE
REMOVING THE OUTLIERS

Example showing the removal of outliers using the diabetes.csv dataset


https://www.kaggle.com/datasets/saurabh00007/diabetescsv
35
THE DATASET FACTORS AFTER
REMOVING THE OUTLIERS

https://www.geeksforgeeks.org/detect-and-remove-the-outliers-using-python/
36
WHAT IS A CORRELATION?

https://www.geeksforgeeks.org/create-a-correlation-matrix-using-python 37
WHAT IS A CORRELATION MATRIX?
A correlation is a tabular representation that displays correlation
coefficients, indicating the strength and direction of relationships
between variables in a dataset. Within this matrix, each cell
signifies the correlation between two specific variables. This tool
serves multiple purposes, serving as a summary of data
relationships, input for more sophisticated analyses, and a
diagnostic aid for advanced analytical procedures. By presenting a
comprehensive overview of inter-variable correlations, the matrix
becomes invaluable in discerning patterns, guiding further analyses,
and identifying potential areas of interest or concern in the dataset.
Its applications extend beyond mere summary statistics, positioning
it as a fundamental component in the preliminary stages of diverse
and intricate data analyses.
https://www.geeksforgeeks.org/create-a-correlation-matrix-using-python/
38
INTERPRETING THE CORRELATION
MATRIX RESULTS
Strong correlations, indicated by values close to 1 or -1, suggest a
robust connection, while weak correlations, near 0, imply a less
pronounced association. They are identifying these degrees of
correlation aids in understanding the intensity of interactions within
the dataset, facilitating targeted analysis and decision-making.
Positive correlations (values > 0) signify that as one variable
increases, the other tends to increase as well. Conversely, negative
correlations (values < 0) imply an inverse relationship—when one
variable increases, the other tends to decrease. Investigating these
directional associations provides insights into how variables influence
each other, crucial for formulating informed hypotheses and
predictions.
https://www.geeksforgeeks.org/create-a-correlation-matrix-using-python/
39
CALCULATING THE PAIRWISE
CORRELATION FOR ALL COLUMNS

https://www.geeksforgeeks.org/python-pandas-dataframe-corr/
40
PLOTTING THE CORRELATION
MATRIX HEATMAP

https://www.geeksforgeeks.org/create-a-correlation-matrix-using-python/ 41
CORRELATION MATRIX CLUSTER MAP

42
CONVERTING CATEGORICAL
VARIABLES TO NUMERIC VALUES

Before After

Converting categorical variables to numerical values for use in machine learning predictive models

https://www.geeksforgeeks.org/how-to-convert-categorical-variable-to-numeric-in-pandas/
43
CONCLUSIONS
• The original dataset contains 2111 rows and 17 column
• There are 24 duplicate rows in the dataset
• All values in the dataset are unique and contain no null,
NaN, + ∞ or missing values
• The NCP (number of main meals per day) data contains
some outlier values and required removal
• There is a significant correlation between weight and height

44
ADDITIONAL EXAMPLES

45
DISTRIBUTION PLOT ILLUSTRATION

https://www.kaggle.com/datasets/mathchi/diabetes-data-set
46
PROBABILITY PLOT FOR TWO FACTORS
IN THE DATASET

https://www.kaggle.com/datasets/mathchi/diabetes-data-set
47
ANOTHER FORM OF HEATMAP
PRESENTATION

https://www.kaggle.com/code/busekseolu/diabetes-classification
48
REFERENCES
1. Data Science Horizon, Data cleaning and preprocessing for data science beginners.
(https://datasciencehorizons.com/data-cleaning-preprocessing-data-science-beginners-ebook/)
2. Matthieu Komorowski, et al., Exploratory Data Analysis, Chapter 15, doi:10.1007/978-3- 319-43742-2_15.
3. DataQuest, Data Science Cheat Sheet-Pandas. (https://s3.amazonaws.com/dq-blog-files/pandas-cheat-sheet.pdf)
4. DataCamp, Python for Data Science Cheat Sheet-Matplotlib.
(https://s3.amazonaws.com/assets.datacamp.com/blog_assets/Python_Matplotlib_Cheat_Sheet.pdf)
5. DataQuest, Data Science Cheat Sheet-Numpy.
(https://assets.datacamp.com/blog_assets/Numpy_Python_Cheat_Sheet.pdf)

6. DataCamp, Python for Data Science, Seaborn Cheat Sheet.


(https://s3.amazonaws.com/assets.datacamp.com/blog_assets/Python_Seaborn_Cheat_Sheet.pdf)
7. DataCamp, Python for Data Science Cheat Sheet-Scikit-Learn.
(https://s3.amazonaws.com/assets.datacamp.com/blog_assets/Scikit_Learn_Cheat_Sheet_Python.pdf)
8. David Beazley, et al., Python Cookbook, 3rd Ed., O’Reilly, Beiing, 2013.

49

View publication stats

You might also like