0% found this document useful (0 votes)
7 views26 pages

Lecture 2 EDA 1

Uploaded by

tama1999tonni
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views26 pages

Lecture 2 EDA 1

Uploaded by

tama1999tonni
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 26

EXPLORATORY DATA

ANALYSIS
(EDA)

1
What is Exploratory
Data Analysis

2
Exploratory Data Analysis
Get Data

Exploratory Data Analysis

Preprocessing

Data Mining
WHAT IS EDA?
• The analysis of datasets based on various numerical methods and
graphical tools.
• Exploring data for patterns, trends, underlying structure, deviations
from the trend, anomalies and strange structures.
• It facilitates discovering unexpected as well as conforming the
expected.
• Another definition: An approach/philosophy for data analysis that
employs a variety of techniques (mostly graphical).

4
Exploratory data analysis (EDA)
Exploring your data is a crucial step in data analysis. It involves:
•Organising the data set
•Plotting aspects of the data set
•Maybe producing some numerical summaries; central tendency and
spread, etc.

“Exploratory data analysis can never be the whole story, but nothing
else can serve as the foundation stone.”
- John Tukey.
Importance of EDA

6
EXAMPLE 1
Data from the Places Rated Almanac *Boyer and Savageau, 1985)
9 variables fro 329 metropolitan areas in the USA
1.Climate mildness Questions:
2.Housing cost 1.How is climate related to location?
3.Health care and environment 2.Are there clusters in the data (excluding
4.Crime location)?
3.Are nearby cities similar?
5.Transportation supply 4.Any relation bw economic outlook and crime?
6.Educational opportunities and effort 5.What else???
7.Arts and culture facilities
8.Recreational opportunities
9.Personal economic outlook
+ latitude and longitude of each city

7
EXAMPLE 2
• In a breast cancer research, main questions of interest might be
• Does any treatment method result in a higher survival rate? Can a
particular treatment be suggested to a woman with specific
characteristic?
• Is there any difference between patients in terms of survival rates
(e.g. Are white woman more likely to survive compare the black
woman if they are both at the same stage of disease?)

8
EXAMPLE 3
• In a project, investigating the well-being of teenagers after an
economic hardship, main questions can be
• Is there a positive ( and significant) effect of economic problems on
distress?
• Which other factors can be most related to the distress of teenagers?
e.g. age, gender,…?

9
What is data?
• A bunch of numbers (usually)
• Each number summarises some property or event of interest
e.g. 18
• Age, Beck Depression Inventory (BDI) score, Income in £’000s
• Data: lots of numbers
• e.g. 18, 24, 43, 22, 37, …

Is there a pattern?
Types of Data
Types of
Data

Quantitative Qualitative
Data Data

© 2011 Pearson Education, Inc


Quantitative Data
Measured on a numeric
scale.
4
• Number of defective 943
items in a lot. 21 52
• Salaries of CEOs of
oil companies. 120 12
• Ages of employees at
a company.
8
71 3

© 2011 Pearson Education, Inc


Qualitative Data
Classified into categories.
• College major of each
student in a class.
• Gender of each employee
at a company.
• Method of payment
(cash, check, credit card).

$ Credit
Attributes
• Data points or Samples are described by attributes.
• Attribute (or dimensions, features, variables): a data field,
representing a characteristic or feature of a data object.
• Types
• Nominal or Categorical
• Ordinal
• Binary
• Numerical

14
Attribute types
• Nominal: categories, states, or “names of things”
• Hair color = {black, blond, brown, grey, red, white}
• marital status, occupation, ID numbers, zip codes
• Ordinal: Values have a meaningful order (ranking) but magnitude
between successive values is not known.
• Size = {small, medium, large}, grades, army rankings
• Binary: Nominal attribute with only 2 states (0 and 1)
• Symmetric binary: both outcomes equally important, e.g., gender
• Asymmetric binary: outcomes not equally important. e.g., medical test
(positive vs. negative)
• Numeric: represents quantity (integer or real-valued)
• Temperature, length, counts, grade point, CGPA, salary etc.
15
DISCRETE vs. continuous attributes

• Discrete Attribute: has only a finite or countably infinite set of values


• E.g., zip codes, profession, or the set of words in a collection of documents
• Sometimes, represented as integer variables
• Note: Binary attributes are a special case of discrete attributes
• Continuous Attribute: has real numbers as attribute values
• E.g., temperature, height, or weight
• Continuous attributes are typically represented as floating-point variables

16
A sample Dataset
outlook temperature humidity windy play
sunny 85 85 FALSE no
sunny 80 90 TRUE no
overcast 83 86 FALSE yes
rainy 70 96 FALSE yes
rainy 68 80 FALSE yes
rainy 65 70 TRUE no
overcast 64 65 TRUE yes
sunny 72 95 FALSE no
sunny 69 70 FALSE yes
rainy 75 80 FALSE yes
sunny 75 70 TRUE yes
overcast 72 90 TRUE yes
overcast 81 75 FALSE yes
rainy 71 91 TRUE no

17
Univariate Analysis

Univariate analysis techniques


Univariate Analysis
Some statistical methods for univariate
analysis include looking at:
•Mean
Use of statistical techniques •Median
in univariate analysis •Mode
•Range
•Variance
•Maximum
•Minimum
•Quartiles
•Standard deviation.
Univariate Analysis
Some graphical methods for univariate
analysis involve preparing:
Use of graphical techniques in •frequency distribution tables
univariate analysis •bar charts
•Histograms
•frequency polygons
•pie charts.
Examples of graphical method for
univariate analysis

Scatter plot
Boxplot Histogram
Bivariate
Analysis
Bivariate Analysis

Bivariate analysis is usually done by using


graphical methods like
What are the different methods to
•scatter plots
perform bivariate analysis? •line charts
•pair plots.
Multivariate
Analysis
Multivariate Analysis

Different methods to perform multivariate


analysis are:
What are the different methods to •Canonical Correlation Analysis
perform multivariate analysis? •Cluster Analysis
•Contour plots
•Principal Component Analysis.
26

You might also like