0% found this document useful (0 votes)
7 views46 pages

Presentation 1

- Descriptive analytics involves summarizing data using statistics like frequency, mode, percentiles, mean, median, standard deviation, and skewness. These statistics describe properties of data like location, spread, and distribution. - Data can take various forms like matrices, documents, transactions, graphs, sequences, and more. It can have different attribute types like nominal, binary, numeric, ordinal, discrete vs continuous. - Understanding data types, attributes, and calculating summary statistics provides insight into the characteristics of structured data.

Uploaded by

Narasimman C
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views46 pages

Presentation 1

- Descriptive analytics involves summarizing data using statistics like frequency, mode, percentiles, mean, median, standard deviation, and skewness. These statistics describe properties of data like location, spread, and distribution. - Data can take various forms like matrices, documents, transactions, graphs, sequences, and more. It can have different attribute types like nominal, binary, numeric, ordinal, discrete vs continuous. - Understanding data types, attributes, and calculating summary statistics provides insight into the characteristics of structured data.

Uploaded by

Narasimman C
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 46

Descriptive Analytics

Overview
• Data – types and formats
• Descriptive Analytics
Data – types and formats
Data Matrix
• If data objects have the same fixed set of numeric attributes, then the data
objects can be thought of as points in a multi-dimensional space, where
each dimension represents a distinct attribute

• Such data set can be represented by an m by n matrix, where there are m


rows, one for each object, and n columns, one for each attribute

Projection Projection Distance Load Thickness


of x Load of y load

10.23 5.27 15.22 2.7 1.2


12.65 6.25 16.22 2.2 1.1
Document Data
• Each document becomes a `term' vector,
• each term is a component (attribute) of the vector,
• the value of each component is the number of times the
corresponding term occurs in the document.

timeout

season
coach

game
score
team

ball

lost
pla

wi
n
Document 1 3 0 y
5 0 2 6 0 2 0 2

Document 2 0 7 0 2 1 0 0 3 0 0

Document 3 0 1 0 0 1 2 2 0 3 0
Transaction Data
• A special type of record data, where
• each record (transaction) involves a set of items.
• For example, consider a grocery store. The set of products purchased
by a customer during one shopping trip constitute a transaction,
while the individual products that were purchased are the items.

TID Items
1 Bread, Coke, Milk
2 Butter, Bread
3 Butter, Coke, Donut, Milk
4 Butter, Bread, Donut, Milk
5 Coke, Donut, Milk
Graph Data
• Examples: Generic graph and HTML Links

2
5 1
2 <a href="papers/papers.html#bbbb">
Data Mining </a>
5 <li>
<a href="papers/papers.html#aaaa">
Graph Partitioning </a>
<li>
<a href="papers/papers.html#aaaa">
Parallel Solution of Sparse Linear System of Equations </a>
<li>
<a href="papers/papers.html#ffff">
N-Body Computation and Dense Linear System Solvers
Chemical Data
• Benzene Molecule: C6H6
Ordered Data
• Sequences of transactions
Items/Events

An element of
the sequence
Ordered Data
• Genomic sequence data
GGTTCCGCCTTCAGCCCCGCGCC
CGCAGGGCCCGCCCCGCGCCGTC
GAGAAGGGCCCGCCTGGCGGGCG
GGGGGAGGCGGGGCCGCCCGAGC
CCAACCGAGTCCGACCAGGTGCC
CCCTCTGCTCGGCCTAGACCTGA
GCTCATTAGGCGGCAGCGGACAG
GCCAAGTAGAACACGCGAAGCGC
TGGGCTGCCTGCTGCGACCAGGG
Ordered Data
• Spatio-Temporal Data

Average Monthly
Temperature of
land and ocean
Important Characteristics of Structured Data

• Dimensionality
• Curse of dimensionality
• Sparsity
• Only presence counts
• Resolution
• Patterns depend on the scale

• Distribution
• Centrality and dispersion
Data Objects

• Data sets are made up of data objects.


• A data object represents an entity.
• Examples:
• sales database: customers, store items, sales
• medical database: patients, treatments
• university database: students, professors, courses
• Also called samples , examples, instances, data points, objects, tuples.
• Data objects are described by attributes.
• Database rows -> data objects; columns ->attributes.
Attributes
• Attribute (or dimensions, features, variables): a data
field, representing a characteristic or feature of a data
object.
• E.g., customer _ID, name, address
• Types:
• Nominal
• Binary
• Numeric: quantitative
• Interval-scaled
• Ratio-scaled
Attribute Types
• Nominal or categorical: categories, states, or “names of things”
• Hair_color = {auburn, black, blond, brown, grey, red, white}
• marital status, occupation, ID numbers, zip codes
• Classifications must be mutually exclusive (every element should belong to one category with no
ambiguity).
• Binary
• Nominal attribute with only 2 states (0 and 1)
• Symmetric binary: both outcomes equally important
• e.g., gender
• Asymmetric binary: outcomes not equally important.
• e.g., medical test (positive vs. negative)
• Convention: assign 1 to most important outcome (e.g., HIV positive)
• Ordinal
• Values have a meaningful order (ranking) but magnitude between successive values is not known.
• Size = {small, medium, large}, grades, army rankings
Numeric Attribute Types
• Quantity (integer or real-valued)
• Interval
• Measured on a scale of equal-sized units
• Values have order
• E.g., temperature in C˚or F˚, calendar dates
• No true zero-point
• Ratio
• Inherent zero-point
• We can speak of values as being an order of magnitude larger than
the unit of measurement (10 K˚ is twice as high as 5 K˚).
• e.g., temperature in Kelvin, length, counts,
monetary quantities
Discrete vs. Continuous Attributes
• Discrete Attribute
• Has only a finite or countably infinite set of values
• E.g., zip codes, profession, or the set of words in a collection of
documents
• Sometimes, represented as integer variables
• Note: Binary attributes are a special case of discrete
attributes
• Continuous Attribute
• Has real numbers as attribute values
• E.g., temperature, height, or weight
• Practically, real values can only be measured and represented
using a finite number of digits
• Continuous attributes are typically represented as floating-
point variables
Ordinal Variables

• An ordinal variable can be discrete or continuous


• Order is important, e.g., rank
• Can be treated like interval-scaled
rif {1,...,M f }
• replace xif by their rank
• map the range of each variable onto [0, 1] by replacing i-th
object in the f-th variable by
rif 1
zif 
M f 1

• compute the dissimilarity using methods for interval-scaled


variables
Attributes of Mixed Type

• A database may contain all attribute types


• Nominal, symmetric binary, asymmetric binary, numeric,
ordinal
• One may use a weighted formula to combine their effects
 pf  1 ij( f ) dij( f )
d (i, j) 
 pf  1 ij( f )
• f is binary or nominal:
dij(f) = 0 if xif = xjf , or dij(f) = 1 otherwise
• f is numeric: use the normalized distance
• f is ordinal
• Compute ranks rif and
zif  r
if1
• Treat zif as interval-scaled M 1 f
Descriptive Analytics
Summary Statistics
• Summary statistics are numbers that summarize
properties of the data

• Summarized properties include frequency, location and


spread
• Examples: location - mean
spread - standard deviation

• Most summary statistics can be calculated in a single pass


through the data
Frequency and Mode
• The frequency of an attribute value is the
percentage of time the value occurs in the
data set
• For example, given the attribute ‘gender’ and a
representative population of people, the gender ‘female’
occurs about 50% of the time.
• The mode of a an attribute is the most frequent
attribute value
• The notions of frequency and mode are typically used
with categorical data
Percentiles
• For continuous data, the notion of a percentile is more
useful.

Given an ordinal or continuous attribute x and a number


p between 0 and 100, the pth percentile is a value xp of
x such that p% of the observed values of x are less than
xp.


• For instance, the 50th percentile is the value x50%such
 that 50% of all values of x are less than x50%.


The mean is very sensitive to outliers.
Thus, the median or a trimmed mean is also commonly used.
Skewness
• The first thing you usually notice about a distribution’s shape is
whether it has one mode (peak) or more than one.
• If it’s unimodal (has just one peak), like most data sets, the next thing
you notice is whether it’s symmetric or skewed to one side.
• If the bulk of the data is at the left and the right tail is longer, we say
that the distribution is skewed right or positively skewed;
• If the peak is toward the right and the left tail is longer, we say that
the distribution is skewed left or negatively skewed.
Symmetric vs. Skewed Data
• Median, mean and mode of symmetric, symmetric

positively and negatively skewed data

positively skewed negatively skewed


Interpreting
• If skewness is positive, the data are positively skewed or skewed right, meaning that the right
tail of the distribution is longer than the left.
• If skewness is negative, the data are negatively skewed or skewed left, meaning that the left tail
is longer.
• If skewness = 0, the data are perfectly symmetrical.
• But a skewness of exactly zero is quite unlikely for real-world data, so how can you interpret
the skewness number?
• There’s no one agreed interpretation, but for what it’s worth Bulmer (1979) — a classic —
suggests this rule of thumb:
• If skewness is less than −1 or greater than +1, the distribution can be called highly skewed.
• If skewness is between −1 and −½ or between +½ and +1, the distribution can be called moderately
skewed.
• If skewness is between −½ and +½, the distribution can be called approximately symmetric.
• With a skewness of −0.1098, the sample data for student heights are approximately symmetric.
Kurtosis
• The other common measure of shape is called the kurtosis.
• As skewness involves the third moment of the distribution, kurtosis involves the fourth moment.
• The outliers in a sample, therefore, have even more effect on the kurtosis than they do on the
skewness.
• Traditionally, kurtosis has been explained in terms of the central peak.
• Higher values indicate a higher, sharper peak; lower values indicate a lower, less distinct peak.
(Kurtosis > 3)

(Kurtosis = 3)

(Kurtosis < 3)

Platykurtic
Measuring the Dispersion of Data

• Quartiles, outliers and boxplots


• Quartiles: Q1 (25th percentile), Q3 (75th percentile)
• Inter-quartile range: IQR = Q3 – Q1
• Five number summary: min, Q1, median, Q3, max
• Boxplot: ends of the box are the quartiles; median is marked; add whiskers, and
plot outliers individually
• Outlier: usually, a value higher/lower than 1.5 x IQR
• Variance and standard deviation (sample: s, population: σ)
• Variance: (algebraic, scalable computation)
• Standard deviation s (or σ) is the square root of variance s2 (or σ2)
Boxplot Analysis
• Five-number summary of a distribution
• Minimum, Q1, Median, Q3, Maximum
• Boxplot
• Data is represented with a box
• The ends of the box are at the first and third
quartiles, i.e., the height of the box is IQR
• The median is marked by a line within the box
• Whiskers: two lines outside the box extended to
Minimum and Maximum
• Outliers: points beyond a specified outlier
threshold, plotted individually
Measures of Association
Measures of Association…

• Covariance Matrix
• The variance–covariance information for the two attributes X 1 and X2 can be
summarized in the square 2×2 covariance matrix, given as

• Because σ12 =σ21, the matrix is a symmetric matrix.


• The covariance matrix records the attribute specific variances on the main
diagonal, and the covariance information on the off-diagonal elements.
• The total variance of the two attributes is given as the sum of the diagonal
elements
Total variance var(D) = σ21 + σ22
Measures of Association…
• Correlation
• The correlation between variables X1 and X2 is the standardized
covariance, obtained by normalizing the covariance with the
standard deviation of each variable, given as:

• The correlation is then the cosine of the angle between them


Example – Iris Data set
Example – Iris Data set
Sample Mean

Median
Because n = 150 is even, the sample median is the value at positions
n/2 = 75 and n/2 + 1 = 76 in sorted order. For sepal length both these
values are 5.8; thus the sample median is 5.8

Mode
The sample mode for sepal length is 5

Range

Variance
σ2= [(5.9 – 5.843)2 + (6.9 – 5.843)2 + (6.6 – 5.843)2 + (4.6 – 5.843)2 + …]/150
= 0.681

Standard Deviation
Example…
• Sample Mean and Covariance

• Sample covariance matrix

• The variance for sepal length is σ12 =0.681, and


that for sepal width is σ22 =0.187.
• Covariance
• The covariance between the two attributes is
σ12 = −0.039
• Correlation

The angle is close to 90◦, that is, the two attribute vectors are almost orthogonal, indicating weak correlation.
Further, the angle being greater than 90◦ indicates negative correlation.

You might also like