Lec 2 Getting To Know Data EDA
Lec 2 Getting To Know Data EDA
Data object
represents an entity in the data set
also called data item, point, instance, example, sample, row, observation
e.g. a patient, movie, student, customer, product, book, tweet
described by a set of attributes
Attribute
is a data field, representing a feature/characteristic of data objects
also called variable, feature, dimension, column, coordinate, field
e.g. reaction to a test, genre/director, course, address, price/category,
author, publisher, word
Sparsity in Data
If most of the feature values are missing, then the data is called sparse
Univariate Data
Bivariate Data
Multivariate Data
Nominal/Categorical Attributes
Ordinal Attributes
Numeric Attributes
For location of nominal and ordinal attributes one can use the most
frequent value
Not the same as the Majority element (a value with frequency more than
50%)
5 99
Mean = 13.57
5 99
Mean = 4.34
5 99
Max
Min
Range := max - min
Midrange := average of min and max
Inter-Quartile Range := 3rd quartile - 1st quartile
Low Spread Mid-spread High Spread
Variance and Standard Deviation
The ith q-quantile is a data point x such that ∼ i/q fraction of points
are less than x and ∼ (q−i)/q fraction of points are greater than x
Median is the first 2-quantile
3rd quartile := 3rd 4-quantile := 75 percentile
{
Min
{
interquartile range
data range
median (50th percentile)
}|
Median
}|
Max
minimum (0th percentile)
z
Standard Deviation
For normal distribution, there are guarantees that certain number of values
must fall within k st-dev from the mean
For any distribution of data, there are guarantees that certain number of
values must fall with k st-dev from the mean
At least ∼ 75% must lie within k = 2 st-dev (x ± 2σ)
At least ∼ 89% must lie within k = 3 st-dev (x ± 3σ)
At least ∼ 93% must lie within k = 4 st-dev (x ± 4σ)
Numeric Attributes
Covariance
Correlation
Correlation Matrix
Contingency Table:
a1 a2 . . . ap
b1
C = b2
..
. fij
bq
cov(x, y) = cov(y, x)
cov(x, x) = var(x, x)
If x and y are independent, then cov(x, y) = 0
For constant a and b
cov(x, a) = 0
cov(ax, by) = ab cov(x, y)
cov(x + a, y + b) = cov(x, y)
Correlation
Covariance depends on magnitude and scale of variable x and y
Correlation quantifies how strongly two variables are linearly related
cov(x, y)
rxy = corr (x, y) =
σx .σy
−1 ≤ corr (x, y) ≤ 1
1 1 1 1 −1 −1 −1
0 0 0 0 0 0 0
Bar Charts
Histogram ▷ and also overlapping histogram
Box Plot ▷ and also side-by-side box-plots
Scatter Plot ▷ and scatter plot matrix
Heat map
Line Graph
Parallel Axis Plot
Word-Cloud
Quite clear that the word cloud on left is for a collection of articles about US politics,
political news, while that on the right seems a corpus of astronomy/astrophysics