EDAusingpython_SAlaruri
EDAusingpython_SAlaruri
net/publication/384698814
CITATION READS
1 2,081
1 author:
SEE PROFILE
All content following this page was uploaded by Sami D. Alaruri ﺳﺎﻣﻲ اﻟﻌﺎروريon 19 October 2024.
• Overview
• Graphical EDA
• Conclusions
• References
INTRODUCTION: WHAT IS EDA?
Exploratory data analysis is an analysis technique to analyze and investigate
the data set and summarize the main characteristics of the dataset. Main
advantage of EDA is providing the data visualization of data after conducting
the analysis.
Tukey defined data analysis in 1961 as: "Procedures for analyzing data,
techniques for interpreting the results of such procedures, ways of planning
the gathering of data to make its analysis easier, more precise or more
accurate, and all the machinery and results of (mathematical) statistics which
apply to analyzing data”.
https://en.wikipedia.org/wiki/Exploratory_data_analysis
3
OVERVIEW
Obesity is a complex disease involving having too much body fat. Obesity isn't just a cosmetic
concern. It's a medical problem that increases the risk of many other diseases and health
problems. These can include heart disease, diabetes, high blood pressure, high cholesterol, liver
disease, sleep apnea and certain cancers.
There are many reasons why some people have trouble losing weight. Often, obesity results from
inherited, physiological and environmental factors, combined with diet, physical activity and
exercise choices.
The good news is that even modest weight loss can improve or prevent the health problems
associated with obesity. A healthier diet, increased physical activity and behavior changes can
help you lose weight. Prescription medicines and weight-loss procedures are other options for
treating obesity. In this Exploratory Data Analysis (EDA) using Python we will examine the
NObeyesdad (Obesity level of the individual) as a function of several factors (i.e., age, gender,
height, alcohol consumption ..etc.).
https://www.mayoclinic.org/diseases-conditions/obesity/symptoms-causes/syc-20375742
4
DATASET INFORMATION
This dataset include data for the estimation of obesity levels in individuals from the
countries of Mexico, Peru and Colombia, based on their eating habits and physical
condition. The data contains 17 attributes and 2111 records, the records are
labeled with the class variable NObesity (Obesity Level), that allows classification of
the data using the values of Insufficient Weight, Normal Weight, Overweight Level
I, Overweight Level II, Obesity Type I, Obesity Type II and Obesity Type III. 77% of
the data was generated synthetically using the Weka tool and the SMOTE filter, 23%
of the data was collected directly from users through a web platform.
https://archive.ics.uci.edu/dataset/544/estimation+of+obesity+levels+based+on+eating+habits+and+physical+
condition
5
DATA INSPECTION & CLEANING STEPS
1. Dataset dimensions
2. Titles of columns
3. Data types
4. Missing values
5. Nulls in the dataset (not required for this dataset)
6. Duplicate rows
7. NaN values
8. Infinity values
9. Outliers detection
10. Encode categorical features 6
LOADING PYTHON LIBRARIES
pandas (https://pandas.pydata.org)
numpy (https://numpy.org)
matplotlib (https://matplotlib.org)
seaborn (https://seaborn.pydata.org)
sklearn (https://scikit-learn.org/stable) 7
DATA SET ATTRIBUTES
8
PANDAS FUNCTIONS FOR DATA INSPECTION
• df.head()/df.tail()
• df.sample()
• df.info()
• df.columns
• df.describe()
• df.shape
• Df.count()
https://www.geeksforgeeks.org/pandas-functions-in-python/
9
USING PANDAS FOR LOADING THE DATA FILE
& VIEWING THE FIRST 5 ROWS
11
USING PANDA FUNCTIONS FOR
VIEWING THE DATA
24 duplicate row
After dropping the 24
duplicate Rows, new dataset
size:
2087 rows x 17 columns
14
NAN VALUES CHECK
No NaN values in
the dataset
15
INFINITY VALUES CHECK
No +∞ values in
the data set
16
GENERATING HISTOGRAM PLOTS FOR
THE DATASET ATTRIBUTES
17
HISTOGRAM PLOT SHOWING
THE WEIGHT DISTRIBUTION
18
SCATTER PLOT DEPICTING AGE
VS. WEIGHT
19
SCATTER PLOT SHOWING AGE
VS. FCVC
20
SCATTER PLOT ILLUSTRATING AGE
VS. HEIGHT FOR DIFFERENT
WEIGHTS
21
USING ‘JOINTPLOT’ FOR PLOTTING
HEIGHT VS. WEIGHT
22
BAR PLOT SHOWING AGE VS. AVERAGE
WEIGHT FOR THE TOP 15 AGES
23
BARPLOT ILLUSTRATING
CONSUMPTION OF ALCOHOL
FOR FEMALES & MALES
24
PIE CHART SHOWING THE %
CONSUMPTION OF ALCOHOL
25
PLOTTING AGE VS. NOBEYESAD
USING SCATTER PLOT
26
PLOTTING THE
DISTRIBUTION OF CH2O
https://gist.github.com/mwaskom/de44147ed2974457ad6372750bbe5751 27
COUNT PLOT SHOWING THE
CATEGORICAL FACTOR ‘CAEC’
28
COUNT PLOT DEPICTING THE
CATEGORICAL FACTOR ‘CALC’
29
COUNT PLOT DEPICTING
NOBEYESDAD VS. FREQUENCY
30
HOW TO INTERPRETING A
BOX PLOT?
https://www.atlassian.com/data/charts/box-plot-complete-guide 31
INSPECTING THE DATASET FOR
OUTLIERS USING BOXPLOT
32
INSPECTING THE DATASET FOR
OUTLIERS USING BOXPLOT
33
BOXPLOT SHOWING THE NCP FACTOR
VALUES
34
THE DATA SET FACTORS BEFORE
REMOVING THE OUTLIERS
https://www.geeksforgeeks.org/detect-and-remove-the-outliers-using-python/
36
WHAT IS A CORRELATION?
https://www.geeksforgeeks.org/create-a-correlation-matrix-using-python 37
WHAT IS A CORRELATION MATRIX?
A correlation is a tabular representation that displays correlation
coefficients, indicating the strength and direction of relationships
between variables in a dataset. Within this matrix, each cell
signifies the correlation between two specific variables. This tool
serves multiple purposes, serving as a summary of data
relationships, input for more sophisticated analyses, and a
diagnostic aid for advanced analytical procedures. By presenting a
comprehensive overview of inter-variable correlations, the matrix
becomes invaluable in discerning patterns, guiding further analyses,
and identifying potential areas of interest or concern in the dataset.
Its applications extend beyond mere summary statistics, positioning
it as a fundamental component in the preliminary stages of diverse
and intricate data analyses.
https://www.geeksforgeeks.org/create-a-correlation-matrix-using-python/
38
INTERPRETING THE CORRELATION
MATRIX RESULTS
Strong correlations, indicated by values close to 1 or -1, suggest a
robust connection, while weak correlations, near 0, imply a less
pronounced association. They are identifying these degrees of
correlation aids in understanding the intensity of interactions within
the dataset, facilitating targeted analysis and decision-making.
Positive correlations (values > 0) signify that as one variable
increases, the other tends to increase as well. Conversely, negative
correlations (values < 0) imply an inverse relationship—when one
variable increases, the other tends to decrease. Investigating these
directional associations provides insights into how variables influence
each other, crucial for formulating informed hypotheses and
predictions.
https://www.geeksforgeeks.org/create-a-correlation-matrix-using-python/
39
CALCULATING THE PAIRWISE
CORRELATION FOR ALL COLUMNS
https://www.geeksforgeeks.org/python-pandas-dataframe-corr/
40
PLOTTING THE CORRELATION
MATRIX HEATMAP
https://www.geeksforgeeks.org/create-a-correlation-matrix-using-python/ 41
CORRELATION MATRIX CLUSTER MAP
42
CONVERTING CATEGORICAL
VARIABLES TO NUMERIC VALUES
Before After
Converting categorical variables to numerical values for use in machine learning predictive models
https://www.geeksforgeeks.org/how-to-convert-categorical-variable-to-numeric-in-pandas/
43
CONCLUSIONS
• The original dataset contains 2111 rows and 17 column
• There are 24 duplicate rows in the dataset
• All values in the dataset are unique and contain no null,
NaN, + ∞ or missing values
• The NCP (number of main meals per day) data contains
some outlier values and required removal
• There is a significant correlation between weight and height
44
ADDITIONAL EXAMPLES
45
DISTRIBUTION PLOT ILLUSTRATION
https://www.kaggle.com/datasets/mathchi/diabetes-data-set
46
PROBABILITY PLOT FOR TWO FACTORS
IN THE DATASET
https://www.kaggle.com/datasets/mathchi/diabetes-data-set
47
ANOTHER FORM OF HEATMAP
PRESENTATION
https://www.kaggle.com/code/busekseolu/diabetes-classification
48
REFERENCES
1. Data Science Horizon, Data cleaning and preprocessing for data science beginners.
(https://datasciencehorizons.com/data-cleaning-preprocessing-data-science-beginners-ebook/)
2. Matthieu Komorowski, et al., Exploratory Data Analysis, Chapter 15, doi:10.1007/978-3- 319-43742-2_15.
3. DataQuest, Data Science Cheat Sheet-Pandas. (https://s3.amazonaws.com/dq-blog-files/pandas-cheat-sheet.pdf)
4. DataCamp, Python for Data Science Cheat Sheet-Matplotlib.
(https://s3.amazonaws.com/assets.datacamp.com/blog_assets/Python_Matplotlib_Cheat_Sheet.pdf)
5. DataQuest, Data Science Cheat Sheet-Numpy.
(https://assets.datacamp.com/blog_assets/Numpy_Python_Cheat_Sheet.pdf)
49