1 L2 Intro DAM
1 L2 Intro DAM
Data analytics is a discipline focused on extracting insights from data, including the analysis,
collection, organization, and storage, visualization as well as the tools and techniques used to do so
• It aims to apply statistical analysis and technologies on data to find trends (predict) and
improve performance.
• It utilizes a range of data management techniques, including data mining, data cleansing,
data transformation, data modelling, and more.
timeout
season
coach
game
score
•
team
Document data: text documents: term-frequency
ball
lost
pla
wi
n
y
vector
• Transaction data
• Graph and network Document 1 3 0 5 0 2 6 0 2 0 2
• Ordered
• Video data: sequence of images
TID Items
• Temporal data: time-series
• Sequential Data: transaction sequences 1 Bread, Coke, Milk
• Genetic sequence data 2 Beer, Bread
• Spatial, image and multimedia: 3 Beer, Coke, Diaper, Milk
• Spatial data: maps 4 Beer, Bread, Diaper, Milk
• Image data:
5 Coke, Diaper, Milk
• Video data:
2
Data Object
• Data sets are made up of data objects.
• A data object represents an entity.
• Examples:
• sales database: customers, store items, sales
• medical database: patients, treatments
• university database: students, professors, courses
• Also called samples , examples, instances, data points, objects, tuples.
• Data objects are described by attributes.
• Database rows -> data objects; columns ->attributes.
3
Data Attributes
• Attribute (or dimensions, features, variables): a data field,
representing a characteristic or feature of a data object.
• E.g., customer _ID, name, address
• Types:
• Nominal
• Binary
• Numeric: quantitative
• Interval-scaled
• Ratio-scaled
4
Types of Attributes
• Nominal: categories, states, or “names of things”
• Hair_color = {auburn, black, blond, brown, grey, red, white}
• marital status, occupation, ID numbers, zip codes
• Binary
• Nominal attribute with only 2 states (0 and 1)
• Symmetric binary: both outcomes equally important
• e.g., gender
• Asymmetric binary: outcomes not equally important.
• e.g., medical test (positive vs. negative)
• Convention: assign 1 to most important outcome (e.g., HIV positive)
• Ordinal
• Values have a meaningful order (ranking) but magnitude between successive values is not known.
• Size = {small, medium, large}, grades, army rankings
5
Numeric Attribute Types
• Quantity (integer or real-valued)
• Interval
• Measured on a scale of equal-sized units
• Values have order
• E.g., temperature in C˚or F˚, calendar dates
• No true zero-point
• Ratio
• Inherent zero-point
• We can speak of values as being an order of magnitude larger than the unit of
measurement (10 K˚ is twice as high as 5 K˚).
• e.g., temperature in Kelvin, length, counts, monetary quantities
6
Discrete vs Continuous Attributes
• Discrete Attribute
• Has only a finite or countably infinite set of values
• E.g., zip codes, profession, or the set of words in a collection of documents
• Sometimes, represented as integer variables
• Note: Binary attributes are a special case of discrete attributes
• Continuous Attribute
• Has real numbers as attribute values
• E.g., temperature, height, or weight
• Practically, real values can only be measured and represented using a
finite number of digits
• Continuous attributes are typically represented as floating-point
variables
7
Characteristics of Structured Data
• Dimensionality
• Curse of dimensionality
• Sparsity
• Only presence counts
• Resolution
• Patterns depend on the scale
• Distribution
• Centrality and dispersion
8
Basic Statistical Descriptions of Data
• Motivation
• To better understand the data: central tendency, variation and
spread
• Data dispersion characteristics
• median, max, min, quantiles, outliers, variance, etc.
• Numerical dimensions correspond to sorted intervals
• Data dispersion: analyzed with multiple granularities of precision
• Boxplot or quantile analysis on sorted intervals
• Dispersion analysis on computed measures
• Folding measures into numerical dimensions
• Boxplot or quantile analysis on the transformed cube
9
Measuring the Central Tendency
• Mean (algebraic measure) (sample vs. population): 1 n
x = xi = x
Note: n is sample size and N is population size. n i =1 N
n
• Weighted arithmetic mean: w x i i
• Trimmed mean: chopping extreme values x= i =1
n
• Median: w
i =1
i
11
August 21, 2023 Data Mining: Concepts and Techniques
Measuring the Dispersion of Data
• Quartiles, outliers and boxplots
• Quartiles: Q1 (25th percentile), Q3 (75th percentile)
• Inter-quartile range: IQR = Q3 – Q1
• Five number summary: min, Q1, median, Q3, max
• Boxplot: ends of the box are the quartiles; median is marked; add whiskers, and plot
outliers individually
• Outlier: usually, a value higher/lower than 1.5 x IQR
• Variance and standard deviation (sample: s, population: σ)
• Variance: (algebraic, scalable computation)
1 n 1 n 2 1 n 2
[ xi − ( xi ) ]
n n
1 1
s = ( xi − x ) = = − = xi − 2
2 2 2 2 2
( x )
n − 1 i =1 n − 1 i =1 n i =1 N i =1
i
N i =1
12
Some Illustrations
Boxplot Analysis
14
Visualization of Data Dispersion: 3-D Boxplots
15
August 21, 2023 Data Mining: Concepts and Techniques
Properties of Normal Distribution Curve
16
Graphic Displays of Basic Statistical Descriptions
• Quantile plot: each value xi is paired with fi indicating that approximately 100 fi % of
data are xi
• Quantile-quantile (q-q) plot: graphs the quantiles of one univariant distribution against
the corresponding quantiles of another
• Scatter plot: each pair of values is a pair of coordinates and plotted as points in the plane
17
Histogram Analysis
• Histogram: Graph display of tabulated
frequencies, shown as bars 40
18
Histograms Often Tell More than Boxplots
19
Quantile Plot
• Displays all of the data (allowing the user to assess both the
overall behavior and unusual occurrences)
• Plots quantile information
• For a data xi data sorted in increasing order, fi indicates that
approximately 100 fi% of the data are below or equal to the
value xi
20
Data Mining: Concepts and Techniques
Quantile-Quantile (Q-Q) Plot
• Graphs the quantiles of one univariate distribution against the corresponding
quantiles of another
• View: Is there is a shift in going from one distribution to another?
• Example shows unit price of items sold at Branch 1 vs. Branch 2 for each
quantile. Unit prices of items sold at Branch 1 tend to be lower than those at
Branch 2.
21
Scatter plot
• Provides a first look at bivariate data to see clusters of points,
outliers, etc
• Each pair of values is treated as a pair of coordinates and
plotted as points in the plane
22
Positively and Negatively Correlated Data
23
Uncorrelated Data
24
Predictive Analysis
• It utilizes historical data, machine learning, and artificial intelligence to
predict the future outcome.
• The historical data is used to derive a mathematical model that considers
key trends and patterns in the data.
• The model is then applied to current data to predict what will happen next.
Applications of Predictive Analysis
• Health
• Google Flu Trends (GFT)
• Retail
• Recommendation list on Amazon, It uses customer behavior and past
transactions
• Social Media
• For marketing strategy, the review and contents is combined to derive an
strategy
• Risk Assessment
• Insurance Companies for product selling, estimating future losses, catching
fraud claims
• Financial Modelling
• Credit rating for loan approval, revenue generation, resource optimization
• Sports, Competition, Elections
• Bing Predicts, with 90% accuracy the US elections, American Idol, World Cup
• Bing Predicts uses statistics and social media sentiment
Data Analytics Methods
Regression analysis: statistical processes used to estimate the relationships between variables
(how changes to one or more variables might affect another). Ex. how might social media
spending affect sales?
Classification Analysis : Classify objects into separate predefined classes.
Cluster analysis: Classify objects or cases into relative groups called clusters. (investigate why
certain locations are associated with particular purchase).
Time series analysis: “a statistical technique that deals with time series data, or trend analysis.
(identify trends and cycles over time i.e for economic and sales forecasting).
Sentiment analysis:
Monte Carlo simulation: