0% found this document useful (0 votes)

7 views46 pages

Presentation 1

- Descriptive analytics involves summarizing data using statistics like frequency, mode, percentiles, mean, median, standard deviation, and skewness. These statistics describe properties of data like location, spread, and distribution. - Data can take various forms like matrices, documents, transactions, graphs, sequences, and more. It can have different attribute types like nominal, binary, numeric, ordinal, discrete vs continuous. - Understanding data types, attributes, and calculating summary statistics provides insight into the characteristics of structured data.

Uploaded by

Narasimman C

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

7 views46 pages

Presentation 1

Uploaded by

Narasimman C

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 46

Descriptive Analytics

Overview
• Data – types and formats
• Descriptive Analytics
Data – types and formats
Data Matrix
• If data objects have the same fixed set of numeric attributes, then the data
objects can be thought of as points in a multi-dimensional space, where
each dimension represents a distinct attribute

• Such data set can be represented by an m by n matrix, where there are m

rows, one for each object, and n columns, one for each attribute

Projection Projection Distance Load Thickness

of x Load of y load

10.23 5.27 15.22 2.7 1.2

12.65 6.25 16.22 2.2 1.1
Document Data
• Each document becomes a `term' vector,
• each term is a component (attribute) of the vector,
• the value of each component is the number of times the
corresponding term occurs in the document.

timeout

season
coach

game
score
team

ball

lost
pla

wi
n
Document 1 3 0 y
5 0 2 6 0 2 0 2

Document 2 0 7 0 2 1 0 0 3 0 0

Document 3 0 1 0 0 1 2 2 0 3 0
Transaction Data
• A special type of record data, where
• each record (transaction) involves a set of items.
• For example, consider a grocery store. The set of products purchased
by a customer during one shopping trip constitute a transaction,
while the individual products that were purchased are the items.

TID Items
1 Bread, Coke, Milk
2 Butter, Bread
3 Butter, Coke, Donut, Milk
4 Butter, Bread, Donut, Milk
5 Coke, Donut, Milk
Graph Data
• Examples: Generic graph and HTML Links

2
5 1
2 <a href="papers/papers.html#bbbb">
Data Mining </a>
5 <li>
<a href="papers/papers.html#aaaa">
Graph Partitioning </a>
<li>
<a href="papers/papers.html#aaaa">
Parallel Solution of Sparse Linear System of Equations </a>
<li>
<a href="papers/papers.html#ffff">
N-Body Computation and Dense Linear System Solvers
Chemical Data
• Benzene Molecule: C6H6
Ordered Data
• Sequences of transactions
Items/Events

An element of
the sequence
Ordered Data
• Genomic sequence data
GGTTCCGCCTTCAGCCCCGCGCC
CGCAGGGCCCGCCCCGCGCCGTC
GAGAAGGGCCCGCCTGGCGGGCG
GGGGGAGGCGGGGCCGCCCGAGC
CCAACCGAGTCCGACCAGGTGCC
CCCTCTGCTCGGCCTAGACCTGA
GCTCATTAGGCGGCAGCGGACAG
GCCAAGTAGAACACGCGAAGCGC
TGGGCTGCCTGCTGCGACCAGGG
Ordered Data
• Spatio-Temporal Data

Average Monthly
Temperature of
land and ocean
Important Characteristics of Structured Data

• Dimensionality
• Curse of dimensionality
• Sparsity
• Only presence counts
• Resolution
• Patterns depend on the scale

• Distribution
• Centrality and dispersion
Data Objects

• Data sets are made up of data objects.

• A data object represents an entity.
• Examples:
• sales database: customers, store items, sales
• medical database: patients, treatments
• university database: students, professors, courses
• Also called samples , examples, instances, data points, objects, tuples.
• Data objects are described by attributes.
• Database rows -> data objects; columns ->attributes.
Attributes
• Attribute (or dimensions, features, variables): a data
field, representing a characteristic or feature of a data
object.
• E.g., customer _ID, name, address
• Types:
• Nominal
• Binary
• Numeric: quantitative
• Interval-scaled
• Ratio-scaled
Attribute Types
• Nominal or categorical: categories, states, or “names of things”
• Hair_color = {auburn, black, blond, brown, grey, red, white}
• marital status, occupation, ID numbers, zip codes
• Classifications must be mutually exclusive (every element should belong to one category with no
ambiguity).
• Binary
• Nominal attribute with only 2 states (0 and 1)
• Symmetric binary: both outcomes equally important
• e.g., gender
• Asymmetric binary: outcomes not equally important.
• e.g., medical test (positive vs. negative)
• Convention: assign 1 to most important outcome (e.g., HIV positive)
• Ordinal
• Values have a meaningful order (ranking) but magnitude between successive values is not known.
• Size = {small, medium, large}, grades, army rankings
Numeric Attribute Types
• Quantity (integer or real-valued)
• Interval
• Measured on a scale of equal-sized units
• Values have order
• E.g., temperature in C˚or F˚, calendar dates
• No true zero-point
• Ratio
• Inherent zero-point
• We can speak of values as being an order of magnitude larger than
the unit of measurement (10 K˚ is twice as high as 5 K˚).
• e.g., temperature in Kelvin, length, counts,
monetary quantities
Discrete vs. Continuous Attributes
• Discrete Attribute
• Has only a finite or countably infinite set of values
• E.g., zip codes, profession, or the set of words in a collection of
documents
• Sometimes, represented as integer variables
• Note: Binary attributes are a special case of discrete
attributes
• Continuous Attribute
• Has real numbers as attribute values
• E.g., temperature, height, or weight
• Practically, real values can only be measured and represented
using a finite number of digits
• Continuous attributes are typically represented as floating-
point variables
Ordinal Variables

• An ordinal variable can be discrete or continuous

• Order is important, e.g., rank
• Can be treated like interval-scaled
rif {1,...,M f }
• replace xif by their rank
• map the range of each variable onto [0, 1] by replacing i-th
object in the f-th variable by
rif 1
zif 
M f 1

• compute the dissimilarity using methods for interval-scaled

variables
Attributes of Mixed Type

• A database may contain all attribute types

• Nominal, symmetric binary, asymmetric binary, numeric,
ordinal
• One may use a weighted formula to combine their effects
 pf  1 ij( f ) dij( f )
d (i, j) 
 pf  1 ij( f )
• f is binary or nominal:
dij(f) = 0 if xif = xjf , or dij(f) = 1 otherwise
• f is numeric: use the normalized distance
• f is ordinal
• Compute ranks rif and
zif  r
if1
• Treat zif as interval-scaled M 1 f
Descriptive Analytics
Summary Statistics
• Summary statistics are numbers that summarize
properties of the data

• Summarized properties include frequency, location and

spread
• Examples: location - mean
spread - standard deviation

• Most summary statistics can be calculated in a single pass

through the data
Frequency and Mode
• The frequency of an attribute value is the
percentage of time the value occurs in the
data set
• For example, given the attribute ‘gender’ and a
representative population of people, the gender ‘female’
occurs about 50% of the time.
• The mode of a an attribute is the most frequent
attribute value
• The notions of frequency and mode are typically used
with categorical data
Percentiles
• For continuous data, the notion of a percentile is more
useful.

Given an ordinal or continuous attribute x and a number

p between 0 and 100, the pth percentile is a value xp of
x such that p% of the observed values of x are less than
xp.


• For instance, the 50th percentile is the value x50%such
 that 50% of all values of x are less than x50%.


The mean is very sensitive to outliers.
Thus, the median or a trimmed mean is also commonly used.
Skewness
• The first thing you usually notice about a distribution’s shape is
whether it has one mode (peak) or more than one.
• If it’s unimodal (has just one peak), like most data sets, the next thing
you notice is whether it’s symmetric or skewed to one side.
• If the bulk of the data is at the left and the right tail is longer, we say
that the distribution is skewed right or positively skewed;
• If the peak is toward the right and the left tail is longer, we say that
the distribution is skewed left or negatively skewed.
Symmetric vs. Skewed Data
• Median, mean and mode of symmetric, symmetric

positively and negatively skewed data

positively skewed negatively skewed

Interpreting
• If skewness is positive, the data are positively skewed or skewed right, meaning that the right
tail of the distribution is longer than the left.
• If skewness is negative, the data are negatively skewed or skewed left, meaning that the left tail
is longer.
• If skewness = 0, the data are perfectly symmetrical.
• But a skewness of exactly zero is quite unlikely for real-world data, so how can you interpret
the skewness number?
• There’s no one agreed interpretation, but for what it’s worth Bulmer (1979) — a classic —
suggests this rule of thumb:
• If skewness is less than −1 or greater than +1, the distribution can be called highly skewed.
• If skewness is between −1 and −½ or between +½ and +1, the distribution can be called moderately
skewed.
• If skewness is between −½ and +½, the distribution can be called approximately symmetric.
• With a skewness of −0.1098, the sample data for student heights are approximately symmetric.
Kurtosis
• The other common measure of shape is called the kurtosis.
• As skewness involves the third moment of the distribution, kurtosis involves the fourth moment.
• The outliers in a sample, therefore, have even more effect on the kurtosis than they do on the
skewness.
• Traditionally, kurtosis has been explained in terms of the central peak.
• Higher values indicate a higher, sharper peak; lower values indicate a lower, less distinct peak.
(Kurtosis > 3)

(Kurtosis = 3)

(Kurtosis < 3)

Platykurtic
Measuring the Dispersion of Data

• Quartiles, outliers and boxplots

• Quartiles: Q1 (25th percentile), Q3 (75th percentile)
• Inter-quartile range: IQR = Q3 – Q1
• Five number summary: min, Q1, median, Q3, max
• Boxplot: ends of the box are the quartiles; median is marked; add whiskers, and
plot outliers individually
• Outlier: usually, a value higher/lower than 1.5 x IQR
• Variance and standard deviation (sample: s, population: σ)
• Variance: (algebraic, scalable computation)
• Standard deviation s (or σ) is the square root of variance s2 (or σ2)
Boxplot Analysis
• Five-number summary of a distribution
• Minimum, Q1, Median, Q3, Maximum
• Boxplot
• Data is represented with a box
• The ends of the box are at the first and third
quartiles, i.e., the height of the box is IQR
• The median is marked by a line within the box
• Whiskers: two lines outside the box extended to
Minimum and Maximum
• Outliers: points beyond a specified outlier
threshold, plotted individually
Measures of Association
Measures of Association…

• Covariance Matrix
• The variance–covariance information for the two attributes X 1 and X2 can be
summarized in the square 2×2 covariance matrix, given as

• Because σ12 =σ21, the matrix is a symmetric matrix.

• The covariance matrix records the attribute specific variances on the main
diagonal, and the covariance information on the off-diagonal elements.
• The total variance of the two attributes is given as the sum of the diagonal
elements
Total variance var(D) = σ21 + σ22
Measures of Association…
• Correlation
• The correlation between variables X1 and X2 is the standardized
covariance, obtained by normalizing the covariance with the
standard deviation of each variable, given as:

• The correlation is then the cosine of the angle between them

Example – Iris Data set
Example – Iris Data set
Sample Mean

Median
Because n = 150 is even, the sample median is the value at positions
n/2 = 75 and n/2 + 1 = 76 in sorted order. For sepal length both these
values are 5.8; thus the sample median is 5.8

Mode
The sample mode for sepal length is 5

Range

Variance
σ2= [(5.9 – 5.843)2 + (6.9 – 5.843)2 + (6.6 – 5.843)2 + (4.6 – 5.843)2 + …]/150
= 0.681

Standard Deviation
Example…
• Sample Mean and Covariance

• Sample covariance matrix

• The variance for sepal length is σ12 =0.681, and

that for sepal width is σ22 =0.187.
• Covariance
• The covariance between the two attributes is
σ12 = −0.039
• Correlation

The angle is close to 90◦, that is, the two attribute vectors are almost orthogonal, indicating weak correlation.
Further, the angle being greater than 90◦ indicates negative correlation.

Getting To Know Your Data
No ratings yet
Getting To Know Your Data
42 pages
Week 1B - Data
No ratings yet
Week 1B - Data
38 pages
DM Introduction
No ratings yet
DM Introduction
50 pages
02data DMDW
No ratings yet
02data DMDW
40 pages
Lec2 Data
No ratings yet
Lec2 Data
51 pages
DWDM Unit6-Data Similarity Measures
No ratings yet
DWDM Unit6-Data Similarity Measures
40 pages
2 Knowing Data & Visualization
No ratings yet
2 Knowing Data & Visualization
51 pages
CS 591.03 Introduction To Data Mining Instructor: Abdullah Mueen
No ratings yet
CS 591.03 Introduction To Data Mining Instructor: Abdullah Mueen
52 pages
CH 2
No ratings yet
CH 2
35 pages
Chapter 2: Getting To Know Your Data
No ratings yet
Chapter 2: Getting To Know Your Data
30 pages
Lesson 2.1 - Know Your Data PDF
No ratings yet
Lesson 2.1 - Know Your Data PDF
43 pages
Lect 2 DM Converted 1
No ratings yet
Lect 2 DM Converted 1
29 pages
CPSC 4830 2025summer Lecture 2
No ratings yet
CPSC 4830 2025summer Lecture 2
42 pages
02 Data
No ratings yet
02 Data
35 pages
Know Your Data
No ratings yet
Know Your Data
83 pages
Datalec 1
No ratings yet
Datalec 1
23 pages
Mod 4 Types of Data in Cluster Analysis
No ratings yet
Mod 4 Types of Data in Cluster Analysis
31 pages
Concepts and Techniques: - Chapter 2
No ratings yet
Concepts and Techniques: - Chapter 2
29 pages
Get To Know About Data
No ratings yet
Get To Know About Data
25 pages
Lecture 2 - Exploratory Data Analysis
No ratings yet
Lecture 2 - Exploratory Data Analysis
35 pages
CH 2
No ratings yet
CH 2
68 pages
DA Major Notes
No ratings yet
DA Major Notes
46 pages
DWDM Unit-2
No ratings yet
DWDM Unit-2
19 pages
IT326 - Ch2
No ratings yet
IT326 - Ch2
44 pages
01 Data
No ratings yet
01 Data
100 pages
Week 5 - Data Mining Exploring Data With R
No ratings yet
Week 5 - Data Mining Exploring Data With R
146 pages
02 Kinds of Data
No ratings yet
02 Kinds of Data
41 pages
Getting To Know Your Data
No ratings yet
Getting To Know Your Data
78 pages
Chapter 2 - Tagged
No ratings yet
Chapter 2 - Tagged
66 pages
DM-Knowing Your Data
No ratings yet
DM-Knowing Your Data
56 pages
Week2 UnderstandingData
No ratings yet
Week2 UnderstandingData
27 pages
Unit 3 Data Preprocessing - Data
No ratings yet
Unit 3 Data Preprocessing - Data
90 pages
Knowing The Data Set
No ratings yet
Knowing The Data Set
31 pages
02data Edited v2
No ratings yet
02data Edited v2
43 pages
Lect 3
No ratings yet
Lect 3
51 pages
Module No 2 - Part 2 - Compressed - Compressed
No ratings yet
Module No 2 - Part 2 - Compressed - Compressed
46 pages
Lec01 Dataprep
No ratings yet
Lec01 Dataprep
67 pages
Topics To Be Covered
No ratings yet
Topics To Be Covered
58 pages
Transportation Data Mining: Chapter 2. Getting To Know Your Data
No ratings yet
Transportation Data Mining: Chapter 2. Getting To Know Your Data
77 pages
Data Mining:: Concepts and Techniques
100% (1)
Data Mining:: Concepts and Techniques
63 pages
Data Preprocessing Data Basics
No ratings yet
Data Preprocessing Data Basics
86 pages
ITS665dm Topic2-DataUnderstanding
No ratings yet
ITS665dm Topic2-DataUnderstanding
53 pages
2 1 Data
No ratings yet
2 1 Data
22 pages
Concepts and Techniques: - Chapter 2
No ratings yet
Concepts and Techniques: - Chapter 2
36 pages
02 Data
No ratings yet
02 Data
41 pages
Chapter 2 - Understand Data
No ratings yet
Chapter 2 - Understand Data
63 pages
Data-Preprocessing
No ratings yet
Data-Preprocessing
138 pages
Unit 1 - IDS
No ratings yet
Unit 1 - IDS
49 pages
Clustering Vivek Saxena
No ratings yet
Clustering Vivek Saxena
169 pages
Unit 2 Data Preprocessing
No ratings yet
Unit 2 Data Preprocessing
8 pages
Lecture 2
No ratings yet
Lecture 2
33 pages
Data Mining (DM) : Lecture 3: Know Your Data
No ratings yet
Data Mining (DM) : Lecture 3: Know Your Data
53 pages
3 Data
No ratings yet
3 Data
64 pages
Lec.02 Getting To Know Your Data
No ratings yet
Lec.02 Getting To Know Your Data
62 pages
Data Preprocessing I
No ratings yet
Data Preprocessing I
39 pages
Module 1
No ratings yet
Module 1
64 pages
02 Data
No ratings yet
02 Data
24 pages
Data ch2
No ratings yet
Data ch2
16 pages
VIPDMTheory Chapter 2
No ratings yet
VIPDMTheory Chapter 2
56 pages
Coding
No ratings yet
Coding
3 pages
Overview of Allergens and Their Impact On Immune System
No ratings yet
Overview of Allergens and Their Impact On Immune System
52 pages
Unit I - Polymer Chem
No ratings yet
Unit I - Polymer Chem
75 pages
MINE Machinery 1
No ratings yet
MINE Machinery 1
7 pages
Tabel Kurtusis (Ku) & Skwenes (SK) PDF
No ratings yet
Tabel Kurtusis (Ku) & Skwenes (SK) PDF
1 page
Walpole Ch-07 KZ
100% (1)
Walpole Ch-07 KZ
29 pages
Right
No ratings yet
Right
1 page
Syllabus: For Probability and Statistics
No ratings yet
Syllabus: For Probability and Statistics
2 pages
Sit Cse PG - Big Data Syllabus
No ratings yet
Sit Cse PG - Big Data Syllabus
96 pages
Skewness and Kurtosis
No ratings yet
Skewness and Kurtosis
7 pages
ST2133 ASDT 2021 Guide
No ratings yet
ST2133 ASDT 2021 Guide
242 pages
Skewness and Kurtosis PDF
No ratings yet
Skewness and Kurtosis PDF
19 pages
Uniform or Rectangular Distribution
No ratings yet
Uniform or Rectangular Distribution
1 page
524 - Gen Maths I Semester 2019 20
No ratings yet
524 - Gen Maths I Semester 2019 20
5 pages
Reliability of Jack-Up Platforms Against Overturning
100% (1)
Reliability of Jack-Up Platforms Against Overturning
27 pages
Smart Brackets
No ratings yet
Smart Brackets
6 pages
Cbcs Ba-Bsc Hons Sem-5 Mathematics Cc-11 Probability & Statistics-10230
No ratings yet
Cbcs Ba-Bsc Hons Sem-5 Mathematics Cc-11 Probability & Statistics-10230
5 pages
7.riskreliability Based Hydraulic Engineering Design
No ratings yet
7.riskreliability Based Hydraulic Engineering Design
54 pages
A Comparative Study of Uncertainty Propagation Methods For Black-Box Type Functions
No ratings yet
A Comparative Study of Uncertainty Propagation Methods For Black-Box Type Functions
10 pages
1.0 Concept: Stiffness of The Vibration System
No ratings yet
1.0 Concept: Stiffness of The Vibration System
10 pages
Swami Ramanand Teerth Marathwada University, Nanded.: Syllabus Statistics
No ratings yet
Swami Ramanand Teerth Marathwada University, Nanded.: Syllabus Statistics
10 pages
Sample Assessment Materials Model Answers - Further Stats For A Level Further Mathematics
No ratings yet
Sample Assessment Materials Model Answers - Further Stats For A Level Further Mathematics
39 pages
ANG Chapter 2 - Preferences
No ratings yet
ANG Chapter 2 - Preferences
31 pages
Week 4-6 - Ruin Theory
No ratings yet
Week 4-6 - Ruin Theory
76 pages
Data Analysis Using Spss
No ratings yet
Data Analysis Using Spss
7 pages
Indian Standard: Code of Practice For Concrete Strutures For The Storage of Liquids
No ratings yet
Indian Standard: Code of Practice For Concrete Strutures For The Storage of Liquids
50 pages
(Object XMLDocument) KK
No ratings yet
(Object XMLDocument) KK
27 pages
Optimization of Business Processes
50% (2)
Optimization of Business Processes
242 pages
Syllebus 14th Batch 2018 19
No ratings yet
Syllebus 14th Batch 2018 19
118 pages
Class Interval Frequency Class Boundary CM (X) CF: SK SK SK SK SK SK SK SK
No ratings yet
Class Interval Frequency Class Boundary CM (X) CF: SK SK SK SK SK SK SK SK
2 pages
045 2010
No ratings yet
045 2010
60 pages
Problem Data
No ratings yet
Problem Data
43 pages
PTSP Objective Questions
No ratings yet
PTSP Objective Questions
7 pages
Aggregate Loss Models
No ratings yet
Aggregate Loss Models
123 pages

Presentation 1

Uploaded by

Presentation 1

Uploaded by

Descriptive Analytics

• Such data set can be represented by an m by n matrix, where there are m

Projection Projection Distance Load Thickness

10.23 5.27 15.22 2.7 1.2

• Data sets are made up of data objects.

• An ordinal variable can be discrete or continuous

• compute the dissimilarity using methods for interval-scaled

• A database may contain all attribute types

• Summarized properties include frequency, location and

• Most summary statistics can be calculated in a single pass

Given an ordinal or continuous attribute x and a number

positively and negatively skewed data

positively skewed negatively skewed

• Quartiles, outliers and boxplots

• Because σ12 =σ21, the matrix is a symmetric matrix.

• The correlation is then the cosine of the angle between them

• Sample covariance matrix

• The variance for sepal length is σ12 =0.681, and

You might also like