0% found this document useful (0 votes)
11 views16 pages

03b EDA-Tutorial

Uploaded by

Van loi Ha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views16 pages

03b EDA-Tutorial

Uploaded by

Van loi Ha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

Lession 03 - Tutorial

Tutorial
Exploratory Data Analysis
and Tools – Orange and Python
Exploratory Data Analysis
• EDA is an iterative cycle:
• Generate questions about your data:
• Search for answers by visualising, transforming, and modelling your data
• Use what you learn to refine your questions and/or generate new questions

https://duo.com/labs/research/gamifying-data-
science-education

2
We are considering the popular data set “iris”
-The Iris flower data set or Fisher's Iris data set is a multivariate data set introduced by the British statistician and
biologist Ronald Fisher in his 1936 paper The use of multiple measurements in taxonomic problems as an
example of linear discriminant analysis.[1](Wikipedia)

qUCI Machine Learning Repository

[1]. R. A. Fisher (1936). "The use of multiple measurements in taxonomic problems". Annals of Eugenics. 7 (2): 179–188. doi:10.1111/j.1469-
1809.1936.tb02137.x. hdl:2440/15227.
Let’s start with Orange
- Load the data set

qWhat to notice?
- Type of target is categorical è classification
- Data size is smallè might need cross validation
Orange EDA: A first look

qWhat to notice?
- Variable scale is different (e.g. sepal length
is the widest and petal width is the least)
è might need normalization
Orange EDA: A first look

qWhat to notice?
- Type of target is categorical è classification
- Data size is smallè might need cross validation
Orange EDA:
What are the stats of the variables?

qWhat to notice?
- Centre, spread, no missing values of variables
- Distributions of variables over classes
- Classes balance
Orange EDA:
What are the most important features?

qWhat to notice?
- Petal length seems the most important and sepal width is the least
è Feature selection
Orange EDA:
What are the most important features?

qWhat to notice?
- [-1, 1], negative/positive, strong/weak…
- Petal length and petal width look strongly corelated
è Feature selection
Orange EDA:
What is the relationship between two variables (e.g. the sepal length and width) per/regardless class?

- Change variables
- What to notice?
- Compare with the correlation shown previously
Orange EDA:
How the values of a certain variable (e.g. sepal length) are distributed?

Univariate

qWhat to notice?
- Graphical presentation for the stats
Orange EDA:
How the values of a certain variable (e.g. sepal length) are distributed per target class (iris species)?

Multivariate

qWhat to notice?
- Graphical presentation for the stats per class
- Small sepal length è iris-setosa class
Orange EDA:
How the values of a certain variable (e.g. sepal length) are distributed per target class (iris species)?

qWhat to notice?
- Similar to box plot but the density/frequency of the samples for variable values is
visualized
Orange EDA:
How the values of a certain variable (e.g. sepal length) are distributed per target class (iris species)?

Change

qWhat to notice?
- Show the points for clearer visualization
Orange EDA:
How the values of a certain variable (e.g. sepal length) are distributed per target class (iris species)?

qWhat to notice?
- Shorter sepal èIris-setosa
- Longer sepal è more likely Iris-virginica
Orange EDA:
How the values of input variables are distributed w.r.t. another variable?

You might also like