Lecture 2 EDA 1
Lecture 2 EDA 1
ANALYSIS
(EDA)
1
What is Exploratory
Data Analysis
2
Exploratory Data Analysis
Get Data
Preprocessing
Data Mining
WHAT IS EDA?
• The analysis of datasets based on various numerical methods and
graphical tools.
• Exploring data for patterns, trends, underlying structure, deviations
from the trend, anomalies and strange structures.
• It facilitates discovering unexpected as well as conforming the
expected.
• Another definition: An approach/philosophy for data analysis that
employs a variety of techniques (mostly graphical).
4
Exploratory data analysis (EDA)
Exploring your data is a crucial step in data analysis. It involves:
•Organising the data set
•Plotting aspects of the data set
•Maybe producing some numerical summaries; central tendency and
spread, etc.
“Exploratory data analysis can never be the whole story, but nothing
else can serve as the foundation stone.”
- John Tukey.
Importance of EDA
6
EXAMPLE 1
Data from the Places Rated Almanac *Boyer and Savageau, 1985)
9 variables fro 329 metropolitan areas in the USA
1.Climate mildness Questions:
2.Housing cost 1.How is climate related to location?
3.Health care and environment 2.Are there clusters in the data (excluding
4.Crime location)?
3.Are nearby cities similar?
5.Transportation supply 4.Any relation bw economic outlook and crime?
6.Educational opportunities and effort 5.What else???
7.Arts and culture facilities
8.Recreational opportunities
9.Personal economic outlook
+ latitude and longitude of each city
7
EXAMPLE 2
• In a breast cancer research, main questions of interest might be
• Does any treatment method result in a higher survival rate? Can a
particular treatment be suggested to a woman with specific
characteristic?
• Is there any difference between patients in terms of survival rates
(e.g. Are white woman more likely to survive compare the black
woman if they are both at the same stage of disease?)
8
EXAMPLE 3
• In a project, investigating the well-being of teenagers after an
economic hardship, main questions can be
• Is there a positive ( and significant) effect of economic problems on
distress?
• Which other factors can be most related to the distress of teenagers?
e.g. age, gender,…?
9
What is data?
• A bunch of numbers (usually)
• Each number summarises some property or event of interest
e.g. 18
• Age, Beck Depression Inventory (BDI) score, Income in £’000s
• Data: lots of numbers
• e.g. 18, 24, 43, 22, 37, …
Is there a pattern?
Types of Data
Types of
Data
Quantitative Qualitative
Data Data
$ Credit
Attributes
• Data points or Samples are described by attributes.
• Attribute (or dimensions, features, variables): a data field,
representing a characteristic or feature of a data object.
• Types
• Nominal or Categorical
• Ordinal
• Binary
• Numerical
14
Attribute types
• Nominal: categories, states, or “names of things”
• Hair color = {black, blond, brown, grey, red, white}
• marital status, occupation, ID numbers, zip codes
• Ordinal: Values have a meaningful order (ranking) but magnitude
between successive values is not known.
• Size = {small, medium, large}, grades, army rankings
• Binary: Nominal attribute with only 2 states (0 and 1)
• Symmetric binary: both outcomes equally important, e.g., gender
• Asymmetric binary: outcomes not equally important. e.g., medical test
(positive vs. negative)
• Numeric: represents quantity (integer or real-valued)
• Temperature, length, counts, grade point, CGPA, salary etc.
15
DISCRETE vs. continuous attributes
16
A sample Dataset
outlook temperature humidity windy play
sunny 85 85 FALSE no
sunny 80 90 TRUE no
overcast 83 86 FALSE yes
rainy 70 96 FALSE yes
rainy 68 80 FALSE yes
rainy 65 70 TRUE no
overcast 64 65 TRUE yes
sunny 72 95 FALSE no
sunny 69 70 FALSE yes
rainy 75 80 FALSE yes
sunny 75 70 TRUE yes
overcast 72 90 TRUE yes
overcast 81 75 FALSE yes
rainy 71 91 TRUE no
17
Univariate Analysis
Scatter plot
Boxplot Histogram
Bivariate
Analysis
Bivariate Analysis