0% found this document useful (0 votes)
15 views22 pages

Lecture 2.1 Data - Exploration

The document discusses data exploration, emphasizing its importance in understanding data characteristics and selecting appropriate analysis tools. It covers techniques such as summary statistics, visualization, and specific methods like histograms, box plots, and scatter plots, using the Iris Plant data set as an example. Key statistical measures like mean, median, variance, and percentiles are also highlighted to summarize data properties.

Uploaded by

revaldochetie092
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views22 pages

Lecture 2.1 Data - Exploration

The document discusses data exploration, emphasizing its importance in understanding data characteristics and selecting appropriate analysis tools. It covers techniques such as summary statistics, visualization, and specific methods like histograms, box plots, and scatter plots, using the Iris Plant data set as an example. Key statistical measures like mean, median, variance, and percentiles are also highlighted to summarize data properties.

Uploaded by

revaldochetie092
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 22

Machine Learning: Exploring Data

What is data exploration?

A preliminary exploration of the data to


better understand its characteristics.
 Key motivations of data exploration include
– Helping to select the right tool for preprocessing or analysis
– Making use of humans’ abilities to recognize patterns
 People can recognize patterns not captured by data analysis
tools
Techniques Used In Data
Exploration
 In EDA, as originally defined by Tukey
– The focus was on visualization
– Clustering and anomaly detection were viewed as
exploratory techniques
– In data mining, clustering and anomaly detection are
major areas of interest, and not thought of as just
exploratory

 In our discussion of data exploration, we focus on


– Summary statistics
– Visualization
Iris Sample Data Set

 Many of the exploratory data techniques are illustrated


with the Iris Plant data set.
– Can be obtained from the UCI Machine Learning Repository
http://www.ics.uci.edu/~mlearn/MLRepository.html
– From the statistician Douglas Fisher
– Three flower types (classes):
Setosa
 Virginica

 Versicolour

– Four (non-class) attributes


 Sepal width and length
 Petal width and length Virginica. Robert H. Mohlenbrock. USDA
NRCS. 1995. Northeast wetland flora: Field
office guide to plant species. Northeast National
Technical Center, Chester, PA. Courtesy of
USDA NRCS Wetland Science Institute.
Summary Statistics

 Summary statistics are numbers that summarize


properties of the data

– Summarized properties include frequency, location and


spread
 Examples: location - mean
spread - standard deviation

– Most summary statistics can be calculated in a single


pass through the data
Variable Summaries

Indices of central tendency:



Mean – the average value

Median – the middle value

Mode – the most frequent value
Indices of Variability:

Variance – the spread around the mean

Standard deviation

Standard error of the mean (estimate)
The Mean
Subjec before during after Mean = sum of all scores divided by
t number of scores
1 3 2 7
2 3 8 4
X1 + X2 + X3 + …. Xn
3 3 7 3
4 3 2 6
n

5 3 8 4
6 3 1 6
7 3 9 3
8 3 3 6
9 3 9 4
10 3 1 7
Sum = 30 50 50
/n 10 10 10
Mean = 3 5 5
Frequency and Mode

 The frequency of an attribute value is the


percentage of time the value occurs in the
data set
– For example, given the attribute ‘gender’ and a
representative population of people, the gender
‘female’ occurs about 50% of the time.
 The mode of a an attribute is the most frequent
attribute value
 The notions of frequency and mode are typically
used with categorical data
Percentiles

 For continuous data, the notion of a percentile is


more useful.

Given an ordinal or continuous attribute x and a


number p between 0 and 100, the pth percentile is
a value x p of x such that p% of the observed
values of x are less than x p .
xp

 For instance, the 50th percentile is the value x 50%


such that 50% of all values of x are less than x 50% .
Measures of Location: Mean and
Median
 The mean is the most common measure of the
location of a set of points.
 However, the mean is very sensitive to outliers.
 Thus, the median or a trimmed mean is also
commonly used.
Measures of Spread: Range and
Variance
 Range is the difference between the max and min
 The variance or standard deviation is the most
common measure of the spread of a set of points.

 However, this is also sensitive to outliers, so that


other measures are often used.
Visualization

Visualization is the conversion of data into a visual


or tabular format so that the characteristics of the
data and the relationships among data items or
attributes can be analyzed or reported.

 Visualization of data is one of the most powerful


and appealing techniques for data exploration.
– Humans have a well developed ability to analyze large
amounts of information that is presented visually
– Can detect general patterns and trends
– Can detect outliers and unusual patterns
Visualization Techniques:
Histograms
 Histogram
– Usually shows the distribution of values of a single variable
– Divide the values into bins and show a bar plot of the number of
objects in each bin.
– The height of each bar indicates the number of objects
– Shape of histogram depends on the number of bins
 Example: Petal Width (10 and 20 bins, respectively)
Two-Dimensional Histograms
 Show the joint distribution of the values of two
attributes
 Example: petal width and petal length
– What does this tell us?
Visualization Techniques: Box
Plots
 Box Plots
– Invented by J. Tukey
– Another way of displaying the distribution of data
– Following figure shows the basic part of a box plot
outlier

10th percentile

75th percentile

50th percentile
25th percentile

10th percentile
Example of Box Plots

 Box plots can be used to compare attributes


Visualization Techniques: Scatter
Plots
 Scatter plots
– Attributes values determine the position
– Two-dimensional scatter plots most common, but can
have three-dimensional scatter plots
– Often additional attributes can be displayed by using
the size, shape, and color of the markers that
represent the objects
– It is useful to have arrays of scatter plots can
compactly summarize the relationships of several pairs
of attributes
 See example on the next slide
Scatter Plot Array of Iris Attributes
Chart types
• Line plot

19
Visualization Techniques: Matrix
Plots
 Matrix plots
– Can plot the data matrix
– This can be useful when objects are sorted according
to class
– Typically, the attributes are normalized to prevent one
attribute from dominating the plot
– Plots of similarity or distance matrices can also be
useful for visualizing the relationships between objects
– Examples of matrix plots are presented on the next two
slides
Visualization of the Iris Data
Matrix

standard
deviation
Visualization of the Iris Correlation
Matrix

You might also like