03b EDA-Tutorial
03b EDA-Tutorial
Tutorial
Exploratory Data Analysis
and Tools – Orange and Python
Exploratory Data Analysis
• EDA is an iterative cycle:
• Generate questions about your data:
• Search for answers by visualising, transforming, and modelling your data
• Use what you learn to refine your questions and/or generate new questions
https://duo.com/labs/research/gamifying-data-
science-education
2
We are considering the popular data set “iris”
-The Iris flower data set or Fisher's Iris data set is a multivariate data set introduced by the British statistician and
biologist Ronald Fisher in his 1936 paper The use of multiple measurements in taxonomic problems as an
example of linear discriminant analysis.[1](Wikipedia)
[1]. R. A. Fisher (1936). "The use of multiple measurements in taxonomic problems". Annals of Eugenics. 7 (2): 179–188. doi:10.1111/j.1469-
1809.1936.tb02137.x. hdl:2440/15227.
Let’s start with Orange
- Load the data set
qWhat to notice?
- Type of target is categorical è classification
- Data size is smallè might need cross validation
Orange EDA: A first look
qWhat to notice?
- Variable scale is different (e.g. sepal length
is the widest and petal width is the least)
è might need normalization
Orange EDA: A first look
qWhat to notice?
- Type of target is categorical è classification
- Data size is smallè might need cross validation
Orange EDA:
What are the stats of the variables?
qWhat to notice?
- Centre, spread, no missing values of variables
- Distributions of variables over classes
- Classes balance
Orange EDA:
What are the most important features?
qWhat to notice?
- Petal length seems the most important and sepal width is the least
è Feature selection
Orange EDA:
What are the most important features?
qWhat to notice?
- [-1, 1], negative/positive, strong/weak…
- Petal length and petal width look strongly corelated
è Feature selection
Orange EDA:
What is the relationship between two variables (e.g. the sepal length and width) per/regardless class?
- Change variables
- What to notice?
- Compare with the correlation shown previously
Orange EDA:
How the values of a certain variable (e.g. sepal length) are distributed?
Univariate
qWhat to notice?
- Graphical presentation for the stats
Orange EDA:
How the values of a certain variable (e.g. sepal length) are distributed per target class (iris species)?
Multivariate
qWhat to notice?
- Graphical presentation for the stats per class
- Small sepal length è iris-setosa class
Orange EDA:
How the values of a certain variable (e.g. sepal length) are distributed per target class (iris species)?
qWhat to notice?
- Similar to box plot but the density/frequency of the samples for variable values is
visualized
Orange EDA:
How the values of a certain variable (e.g. sepal length) are distributed per target class (iris species)?
Change
qWhat to notice?
- Show the points for clearer visualization
Orange EDA:
How the values of a certain variable (e.g. sepal length) are distributed per target class (iris species)?
qWhat to notice?
- Shorter sepal èIris-setosa
- Longer sepal è more likely Iris-virginica
Orange EDA:
How the values of input variables are distributed w.r.t. another variable?