AAU Data Analytics 24
AAU Data Analytics 24
What is analytics?
• Analytics is the systematic discovery, interpretation,
and use of meaningful patterns in data.
– It also entails applying patterns in data towards
effective problem solving and decision making.
DATA Explicit
Creating concepts
INFORMATION
Depth of meaning
Creating contexts
KNOWLEDGE
Creating patterns
WISDOM
Creating principles and moral
Tacit
TRUTH
Why Data Analytics?
• Reason one: We are living in complex and dynamic business
environment
• How to gain competitive advantage, though the competitive
pressure is very strong?
• How to control the volatile market (Product, Price, Promotion,
Place, People, Process & Physical evidence)?
• How to satisfy users (such as customers, consumers need) that
are professional?
• How to manage the high turnover rate of professionals which
results in diminishing individual and organizational experience?
– Requirement: Business Intelligence
• Prediction: attempting to know what may happen in the future
• Just-in-time response
• Quality, rational, sound and value added decision and problem
solving
• Enhance efficiency and competency
The need for Business Intelligence
• Business Intelligence is getting the right information to the right
people at the right time to support better decision making and
gain competitive advantage
Why Data Analytics
Reason two: Massive data collection
• Data is being produced (generated & collected) at alarming rate
because of:
– The computerization of business & scientific transactions
– Advances in data collection tools, ranging from scanned texts & image
platforms to satellite remote sensing systems
– Above all, popular use of WWW as a global information system
• Nowadays large databases (data warehouses) are growing at
unprecedented rates to manage the explosive growth of data.
• Examples of massive data sets
– Google: Order of 10 billion Web pages indexed
• 100’s of millions of site visitors per day
– MEDLINE text database: 17 million published articles
– Retail transaction data: EBay, Amazon, Wal-Mart: order of 100
million transactions per day
• Visa, MasterCard: similar or larger numbers With the phenomenal
rate of growth of data, users expect more sophisticated useful and
The Web Expansion: Web 0.0 to Web 5.0
Too much data & information, but too little knowledge
• With the phenomenal rate of growth of data, users expect
more useful and valuable information
– There is a need to extract knowledge (useful information) from the
massive data.
• Facing too enormous volumes of data, human analysts with no
special tools can no longer make sense.
– Data analytics can automate the process of finding patterns &
relationships in raw data and the results can be utilized for decision
support. That is why data analytics is used, in science, health and
business areas.
• If we know how to reveal valuable meaningful patterns in data,
data might be one of our most valuable assets.
– Data analytics is the technology that extracts diamonds of knowledge
from historical data & predict outcome of the future.
The Way Forward
Topics Areas covered
• Overfitting
– The size and representativeness of the dataset
determines whether the model associated with a given
database states fits to also future database states.
– Overfitting occurs when the model does not fit to the
future states which is caused by the use of small size
and unbalanced training database.
Assignment
• Compare and contrast: overfitting vs
underfitting
– Discuss what they mean
– Show their similarity and differences
– Methods used to solve them, show with example
– Conclusion
– Reference
Data Preparation
• There are three major phases in data analytics:
• Pre-data analytics,
• data analytics for modeling
• post-data analytics
• Pre-data analytics involves four major tasks:
• Data cleansing
• Data integration
• Data reduction
• Data transformation
65
Data Collection for analytics
• Data analytics and data mining requires collecting great
amount of data to achieve the intended objective.
– Data analytics starts by understanding the business or problem
domain in order to gain the business knowledge
• Business knowledge guides the process towards useful results,
and enables the recognition of those results that are useful.
– Based on the business knowledge data related to the business
problem are identified from the database/data warehouse for
analytics.
– Once we collect the data, the next task is data understanding
where there is a need to well-understand the type data we are
using for analysis and identify the problem observed within the
data.
• Before feeding data to DM we have to make sure the
quality of data? 66
Data Quality Measures
• A well-accepted multidimensional data quality measures
are the following:
– Accuracy (free from errors and outliers)
– Completeness (no missing attributes and values)
– Consistency (no inconsistent values and attributes)
– Timeliness (appropriateness of the data for the purpose it is
required)
– Believability (acceptability)
– Interpretability (easy to understand)
• Most of the data in the real world are poor quality; that
is:
– Incomplete, Inconsistent, Noisy, Invalid, Redundant, …
67
Data is often of low quality
• Data in the real world is with poor quality
– incomplete: lacking attribute values, lacking certain
attributes of interest, or containing only aggregate data
– noisy: containing errors or outliers
– inconsistent: containing discrepancies in codes or names
• No quality data, no quality analytics results!
– Quality decisions must be based on quality data
– Data warehouse needs consistent integration of quality data
Why low quality Data?
• Collecting the required data is challenging
– In addition to its heterogeneous & distributed nature of data,
real world data is low in quality.
• Why?
– You didn’t collect it yourself
– It probably was created for some other use, and then you came
along wanting to integrate it
– People make mistakes (typos)
– People are busy (“this is good enough”) to systematically
organize carefully using structured formats
69
Types of problems with data
• Some data have problems on their own that needs to be
cleaned:
– Outliers: misleading data that do not fit to most of the data/facts
– Missing data: attributes values might be absent which needs to be
replaced with estimates
– Irrelevant data: attributes in the database that might not be of
interest to the DM task being developed
– Noisy data: attribute values that might be invalid or incorrect. E.g.
typographical errors
– Inconsistent data, duplicate data, etc.
• Other data are problematic only when we integrate them
– Everyone had their own way of structuring and formatting data,
based on what was convenient for them
– How to integrate data organized in different format following
different conventions. 70
The issue is …..
• How to prepare enough and complete data
with good quality that we need for analytics?
Coming up with good quality data needs to pass
through different data preprocessing tasks
71
Forms of data preprocessing
Data Cleaning
75
Data Cleaning: Incomplete Data
• The dataset may lack certain attributes of interest
– Is that enough if you have patient demographic profile and
address of region to predict the vulnerability (or exposure) of
a given region to Malaria outbreak?
• The dataset may contain only aggregate data. E.g.: traffic
police car accident report
– this much accident occurs this day in this sub-city
No of accident Date address
3 Oct 23, 2012 Yeka, Addis Ababa
76
Data Cleaning: Missing Data
• Data is not always available, lacking attribute values. E.g.,
Occupation=“ ”
many tuples have no recorded value for several attributes, such
as customer income in sales data
77
Data Cleaning: Missing Data
• Missing data may be due to
– inconsistent with other recorded data and thus deleted
– data not entered due to misunderstanding and may not be
considered important at the time of entry
– not register history or changes of the data
78
Data Cleaning: Missing Data
There are different methods for treating missing values.
• Ignore attributes with missing value: this method usually
done when class label is missed (assuming the task in
classification).
– not effective when the percentage of missing values per attribute
varies considerably
• Fill in the missing value manually: this method is tedious,
time consuming and infeasible.
• Use a global constant to fill in the missing value: This can
be done if a new class is unknown.
• Use the attributes’ mean or mode to fill in the missing
value: Replacing the missed values with the attributes’
mean or mode (most frequent) for numeric or nominal
attributes, respectively.
– Use the most probable value to fill in the missing value automatically
• calculate, say, using Expected Maximization (EM) Algorithm the most
79
probable value
Example: Missing Values Handling method
Attribute Data type Handling method
Name
Sex Nominal Replace by the mode value.
Age Numeric Replace by the mean value.
Religion Nominal Replace by the mode value.
Height Numeric Replace by the mean value.
Marital status Nominal Replace by the mode value.
Job Nominal Replace by the mode value.
Weight Numeric Replace by the mean value.
Predict missing value using EM
• Solves estimation with incomplete data.
– Obtain initial estimates for parameters using mean value.
– use estimates for calculating a value for missing data &
– The process continue Iteratively until convergence ((μi - μi+1) ≤ Ө).
• E.g.: out of six data items given known values= {1, 5, 10, 4},
estimate the two missing data items?
– Let the EM converge if two estimates differ in 0.05 & our initial guess of the two
missing values= 3.
• The algorithm
stop since the
last two
estimates are
only 0.05 apart.
• Thus, our
estimate for the
two items is
4.97. 81
Data Cleaning: Noisy Data
• Noisy: containing noise, errors, or outliers
– Noise: random error or variance in a measured variable
– e.g., Salary=“−10” (an error)
• Typographical errors are errors that corrupt data
• Let say ‘green’ is written as ‘rgeen’
85
Data Integration: Formats
• Not everyone uses the same format. Do you agree?
• Schema integration
– integrate metadata from different sources
– Entity identification problem: identify real world entities from
multiple data sources: e.g., A.cust-id B.cust-#
• Dates are especially problematic:
– 12/19/97
– 19/12/97
– 19/12/1997
– 19-12-97
– Dec 19, 1997
– 19 December 1997
– 19th Dec. 1997
• Are you frequently writing money as:
– Birr 200, Br. 200, 200 Birr, … 86
Data Integration: Inconsistent
• Inconsistent data: containing discrepancies in codes or
names, which is also the problem of lack of
standardization / naming conventions. e.g.,
– Age=“26” vs. Birthday=“03/07/1986”
– Some use “1,2,3” for rating; others “A, B, C”
• Discrepancy between duplicate records
ID Name City State
1 Ministry of Transportation Addis Ababa Addis Ababa region
Addis Ababa
2 Ministry of Finance Addis Ababa administration
Addis Ababa regional
3 Office of Foreign Affairs Addis Ababa administration
87
Data Integration: Inconsistent
Attribute Current values New value
name
Job status “no work”, “job Unemployed
less”, “Jobless”
Marital “not married”, Unmarried
status
“single”
Education “uneducated”, “no Illiterate
level
education level”
Data Integration: different structure
What’s wrong here? No data type constraints
ID Name City State
Ministry of
1234 Transportation Addis Ababa AA
90
Data at different level of detail than needed
• If it is at a finer level of detail, you can sometimes bin
it
• Example
– I need age ranges of 20-30, 30-40, 40-50, etc.
– Imported data contains birth date
– No problem! Divide data into appropriate categories
• Sometimes you cannot bin it
• Example
– I need age ranges 20-30, 30-40, 40-50 etc.
– Data is of age ranges 25-35, 35-45, etc.
– What to do?
• Ignore age ranges because you aren’t sure
• Make educated guess based on imported data (e.g.,
assume that # people of age 25-35 are average # of
people of age 20-30 & 30-40)
91
Data Integration: Conflicting Data
• Detecting and resolving data value conflicts
–For the same real world entity, attribute values from
different sources are different
–Possible reasons: different representations, different
scales, e.g., American vs. British units
• weight measurement: KG or pound
• Height measurement: meter or inch
• Information source #1 says that Alex lives in Bahirdar
– Information source #2 says that Alex lives in Mekele
• What to do?
– Use both (He lives in both places)
– Use the most recently updated piece of information
– Use the “most trusted” information
– Flag row to be investigated further by hand
– Use neither (We’d rather be incomplete than wrong)
92
Handling Redundant Data
• Redundant data occur often when integration of multiple
databases
– Object identification: The same attribute or object may have
different names in different databases
– Derivable data: One attribute may be a “derived” attribute in
another table, e.g., annual revenue, age
• Redundant attributes may be able to be detected by
correlation analysis and covariance analysis
• Careful integration of the data from multiple sources
may help reduce/avoid redundancies and inconsistencies
and improve analytics speed and quality
93
Covariance
• Covariance is similar to correlation
p q
where n is the number of tuples, and are the respective
mean of p and q, σp and σq are the respective standard deviation
of p and q.
• It can be simplified in computation as
W O R
SRS le random
i m p ho ut
( s e wi t
l
sa m p m e nt )
p l a ce
re
SRSW
R
Raw Data
103
Sampling: Cluster or Stratified Sampling
104
Data Transformation
• A function that maps the entire set of values of
a given attribute to a new set of replacement
values such that each old value can be
identified with one of the new values
• Methods for data transformation
– Normalization
– Discretization
‐ Generalization: Concept hierarchy climbing
105
Data Transformation: Normalization
• Normalization: scaled to fall within a smaller, specified range of values
• min-max normalization
• z-score normalization
• Min-max normalization:
v minA
v' (newMax newMin) newMin
maxA minA
– Ex. Let income range $12,000
73,to $98,000
600 12,is000
normalized to [0.0, 1.0]. Then
$73,600 is mapped to (1.0 0) 0 0.716
98,000 12,000
+
- Final Evaluation
+
Final Test Set Final Model -
116
Divide the dataset into training & test
• There are various ways in which to separate the data
into training and test sets
– The established ways by which to use the two sets to
assess the effectiveness and the predictive/ descriptive
accuracy of a machine learning techniques over unseen
examples.
– The holdout method
• Repeated holdout method
– Cross-validation
– The bootstrap
The holdout method
• The holdout method reserves a certain amount for
testing and uses the remainder for training
– Usually: one third for testing, the rest for training
118
Cross-validation
• Cross-validation works as follows:
– First step: data is split into k subsets of equal-sized sets
randomly. A partition of a set is a collection of subsets for
which the intersection of any pair of sets is empty. That is, no
element of one subset is an element of another subset in a
partition.
– Second step: each subset in turn is used for testing and the
remainder for training
• This is called k-fold cross-validation
– Often the subsets are stratified before the cross-validation is
performed
• The error estimates are averaged to yield an overall error
estimate
119
Cross-validation example:
— Break up data into groups of the same size
— Hold aside one group for testing and use the rest to build
model
Test
— Repeat
120
120
DATA MINING; a step in the process of Data Analytics
125
DM Task: Predictive Modeling
• A predictive model makes a prediction/forecast about
values of data using known results found from different
historical data
– Prediction Methods use existing variables to predict unknown
or future values of other variables.
• Predict one variable Y given a set of other variables X.
Here X could be an n-dimensional vector
– In effect this is a function approximation through learning the
relationship between Y and X
• Many, many algorithms for predictive modeling in
statistics and machine learning, including
– Classification, regression, etc.
• Often the emphasis is on predictive accuracy, less
emphasis on understanding the model
126
Prediction Problems:
Classification vs. Numeric Prediction
• Classification
– predicts categorical class labels (discrete or nominal)
– classifies data (constructs a model) based on the
training set and the values (class labels) in a classifying
attribute and uses it in classifying new data
• Numeric Prediction
– models continuous-valued functions, and predicts
unknown or missing values
127
Predictive Modeling: Customer Scoring
• Goal: To predict whether a customer is a high risk
customer or not.
– Example: a bank has a database of 1 million past
customers, 10% of whom took out mortgages
• Use machine learning to rank new customers as a
function of p(mortgage | customer data)
• Customer data
– History of transactions with the bank
– Other credit data (obtained from Experian, etc)
– Demographic data on the customer or where they live
• Techniques
– Binary classification: logistic regression, decision trees, etc
– Many, many applications of this nature 128
Classification
properties 3.8
3.7
3.3 3.4 3.5 3.6 3.7 3.8 3.9 4
Red Blood Cell Volume
134
Pattern (Association Rule) Discovery
• Goal is to discover interesting “local” patterns
(sequential patterns) in the data rather than to
characterize the data globally
– Also called link analysis (uncovers relationships among data)
135
Basic Data Mining algorithms
• Classification: which is also called Supervised learning,
maps data into predefined groups or classes to enhance the
prediction process
• Clustering: which is also called Unsupervised learning,
groups similar data together into clusters.
– is used to find appropriate groupings of elements for a set of data.
– Unlike classification, clustering is a kind of undirected knowledge
discovery or unsupervised learning; i.e., there is no target field & the
relationship among the data is identified by bottom-up approach.
• Association Rule: is also known as market-basket
analysis
– It discovers interesting associations between attributes contained in
a database.
– Based on frequency of occurrence of number of items in the event,
association rule tells if item X is a part of the event, then what is the
likelihood of item Y is also part of the event. 136
Classification
Classification is a data mining (machine
learning) technique used to predict group
membership of new data instances.
137
OVERVIEW OF CLASSIFICATION
• Given a collection of records (training set), each record
contains a set of attributes, one of the attributes is the
class.
– construct a model for class attribute as a function of the
values of other attributes.
– Given a data D={t1,t2,…,tn} and a set of classes C={C1,…,Cm},
the Classification Problem is to define a mapping f:DgC
where each ti is assigned to one class.
• Goal: previously unseen records should be assigned a class
as accurately as possible. A test set is used to determine
the accuracy of the model.
– Usually, the given data set is divided into training and test
sets, with training set used to build the model and test set
used to validate it. 138
Classification Examples
• Teachers classify students’ grades as A, B, C, D, or F.
• Predict whether the weather on a particular day will
be “sunny”, “rainy” or “cloudy”.
• Identify individuals with credit risks.
• Identify mushrooms as poisonous or edible.
• Predict when a river will flood.
• Document classification into the predefined classes,
such as politics, sport, social, economy, law, etc.
140
CLASSIFICATION: A TWO-STEP PROCESS
• Model construction: describing a set of predetermined classes
– Each tuple/sample is assumed to belong to a predefined class, as
determined by the class label attribute
– The set of tuples used for model construction is training set
– The model is represented as classification rules, decision trees,
or mathematical formulae
• Model usage: for classifying future or unknown objects
– Estimate accuracy of the model
• The known label of test sample is compared with the
classified result from the model
• Accuracy rate is the percentage of test set samples that are
correctly classified by the model
• Test set is independent of training set
– If the accuracy is acceptable, use the model to classify data
tuples whose class labels are not known 142
Illustrating Classification Task
Tid Attrib1 Attrib2 Attrib3 Class Learning
No
1 Yes Large 125K
algorithm
2 No Medium 100K No
3 No Small 70K No
4 Yes Medium 120K No
Induction
5 No Large 95K Yes
6 No Medium 60K No
Training Set
Apply
Tid Attrib1 Attrib2 Attrib3 Class Model
11 No Small 55K ?
15 No Large 67K ?
10
Test Set
Confusion Matrix & Performance Evaluation
PREDICTED CLASS
Class=Yes Class=No
ACTUAL Class=Yes a b
CLASS (TP) (FP)
Class=No c d
(FP) (TP)
148
Classification methods
K-Nearest Neighbors
• K-nearest neighbor is a supervised learning algorithm
where the result of new instance query is classified
based on majority of K-nearest neighbor category.
• The purpose of this algorithm is to classify a new object
based on attributes and training samples: (xi, f(xi)),
i=1..N.
• Given a query point, we find K number of objects or
(training points) closest to the query point.
– The classification is using majority vote among the
classification of the K objects.
– K Nearest neighbor algorithm used neighborhood classification
as the prediction value of the new query instance.
• K nearest neighbor algorithm is very simple. It works
based on minimum distance from the query instance to
the training samples to determine the K-nearest
neighbors.
150
How to compute K-Nearest Neighbor (KNN)
Algorithm?
• Determine parameter K = number of nearest neighbors
• Calculate the distance between the query-instance and
all the training samples
– we can use Euclidean distance
• Sort the distance and determine nearest neighbors
based on the Kth minimum distance
• Gather the category of the nearest neighbors
• Use simple majority of the category of nearest
neighbors as the prediction value of the query instance
– Any ties can be broken at random with reason.
K Nearest Neighbors: Key issues
The key issues involved in training KNN model includes
• Setting the variable K (Number of nearest neighbors)
–The numbers of nearest neighbors (K) should be based on cross
validation over a number of K setting.
–When k=1 is a good baseline model to benchmark against.
–A good rule-of-thumb is that K should be less than or equal to the
square root of the total number of training patterns.
• Setting the type of distant metric K N
–We need a measure of distance in order to know who are the
neighbours
–Assume that we have T attributes for the learning problem. Then
one example point x has elements xt , t=1,…T.
–The distance between two points xi xj is often defined as the
Euclidean distance: D 2
Dist ( X , Y ) ( Xi Yi )
i 1
152
Example
• We have data from the questionnaires survey (to ask people opinion) & objective
testing with two attributes (acid durability & strength) to classify whether a
special paper tissue is good or not. Here is four training samples.
X1 = Acid Durability (seconds) X2 = Strength (kg/m2) Y = Classification
7 7 Bad
7 4 Bad
3 4 Good
• Now the factory1produces a new paper tissue that 4 pass laboratory test Goodwith X =
1
3 and X2 = 7.
– Without undertaking another expensive survey, guess the goodness of the
new tissue? Use squared Euclidean distance for similarity measurement.
K=sqrt(4) = 2
dis(TD1, NP) = sqrt(16+0)=sqrt(16)=4
dis(TD2, NP) = sqrt(16+9)=sqrt(25)=5
dis(TD3, NP) = sqrt(0+9)=sqrt(9)= 3
dis(TD4, NP) = sqrt(4+9)=sqrt(13)=3.6
1.TD3 (Good) 2. TD4 (Good) 3. TD1 4. TD2
- The new product is Good
KNNs: advantages & Disadvantages
• Advantage
– Simple
– Powerful
– Requires no training time
– Nonparametric architecture
• Disadvantage: Difficulties with k-nearest neighbour
algorithms
– Memory intensive: just store the training examples
• when a test example is given then find the closest matches
– Classification/estimation is slow
– Have to calculate the distance of the test case from all training
cases
– There may be irrelevant attributes amongst the attributes –
curse of dimensionality
156
Decision Tree
Decision Trees
• Decision tree constructs a tree where internal
nodes are simple decision rules on one or more
attributes and leaf nodes are predicted class labels.
Given an instance of an object or situation, which is specified by a
set of properties, the tree returns a "yes" or "no" decision about
that instance.
Attribute_1
value-1 value-3
value-2
Attribute_2 Class1 Attribute_2
• Information Gain
–Select the attribute with the highest information gain, that create
small average disorder
• First, compute the disorder using Entropy; the expected
information needed to classify objects into classes
• Second, measure the Information Gain; to calculate by how
much the disorder of a set would reduce by knowing the value
of a particular attribute.
Entropy
• The Entropy measures the disorder of a set S containing a
total of n examples of which n+ are positive and n- are
negative and it is given by:
n n n n
D(n , n ) log 2 log 2 Entropy ( S )
n n n n
• Some useful properties of the Entropy:
– D(n,m) = D(m,n)
– D(0,m) = D(m,0) = 0
D(S)=0 means that all the examples in S have the same
class
– D(m,m) = 1
D(S)=1 means that half the examples in S are of one class
and half are in the opposite class
Information Gain
• The Information Gain measures the expected
reduction in entropy due to splitting on an attribute A
k ni
GAIN split Entropy ( S ) Entropy (i )
i 1 n
Parent Node, S is split into k partitions; ni is number of
records in partition i
3 3 5 5
D(3 ,5 ) log 2 log 2 0.954
8 8 8 8
Which attribute minimises the disorder?
Test Average Disorder of attributes
Hair 0.50
height 0.69
weight 0.94
lotion 0.61
• Which decision variable maximises the Info Gain then?
• Remember it’s the one which minimises the average disorder.
Gain(hair) = 0.954 - 0.50 = 0.454
Gain(height) = 0.954 - 0.69 =0.264
Gain(weight) = 0.954 - 0.94 =0.014
Gain (lotion) = 0.954 - 0.61 =0.344
The best decision tree?
is_sunburned
Hair colour
blonde
red brown
Sunburned None
(2,2)? (Alex, Pete,John)
(Emily)
Sunburned = Sarah, Annie,
None = Dana, Katie
• Once we have finished with hair colour we then need to calculate the
remaining branches of the decision tree. Which attributes is better to classify
the remaining ?
• E(2,2) = 1
• IG(lotion) = E(beforeLotion)-E(afterLotion)
= 1 – (E(yes) + E(no))
= 1 – (E(0,2) + E(2,0)) = 1 – (2/4*0 + 2/4*0) = 1
IG(weight) = 1 – (E(weight)) = 1 – (E(heavy)+E(light)+E(average))
= 1 – (E(0,0) + E(1,1) + E(1,1)) = 1-(0+2/4*1+2/4*1)= 1-1=0
The best Decision Tree
• This is the simplest and optimal one possible and it makes a lot of
sense.
• It classifies 4 of the people on just the hair colour alone.
is_sunburned
Hair colour
blonde brown
red
Alex,
Emily
Lotion used Pete,
John
no yes
Sunburned None
(Sarah, Annie) (Dana, Katie)
Sunburn sufferers are ...
• You can view Decision Tree as an IF-THEN_ELSE
statement which tells us whether someone will suffer
from sunburn.
If (Hair-Colour=“red”) then
return (sunburned = yes)
else if (hair-colour=“blonde” and lotion-
used=“No”) then
return (sunburned = yes)
else
return (false)
Why decision tree induction in DM?
• Relatively faster learning speed (than other classification
methods)
• Convertible to simple and easy to understand classification if-
then-else rules
•Comparable classification accuracy with other methods
• Does not require any prior knowledge of data distribution,
works well on noisy data.
Pros Cons
+ Reasonable training time Cannot handle
+ Easy to understand & complicated relationship
interpret between features
+ Easy to generate rules & Simple decision
implement boundaries
+ Can handle large number Problems with lots of
of features: missing data 175
Regression analysis
B
Prediction
• Regression analysis is used to find equations
that fit data. Once we have the regression
equation, we can use the model to make
predictions
• If you know something about X, this
knowledge helps you predict something about
Y using the constructed regression equation…
– Expected value of y at a given level of x=
E(y/x) = mX + B
Computing regression line
• The y-intercept of the regression line is B and
the slope is m. The following formulas give
the y-intercept and the slope of the equation.
m= B=
m= = = = 0.93
B= = = =7.29
• The following data give the height in inches (X) and the weight in lb.
(Y) of a random sample of 10 students from a large group of students
of age 17 years: Estimate weight of the student of a height 69 inches.
Individual Assignment (due: )
Discuss the concept, show the algorithm, demonstrate with example
how it works, concluding remarks & Reference:
1. One class classification
2. Single-link clustering
3. Bayesian Belief Network
4. Complete-link clustering
5. Support Vector Machine
6. k-medoids clustering (Partitioning Around Medoids (PAM))
7. Divisive clustering
8. Regression Analysis
9. Decision tree with GINI index
10. Principal Component Analysis (PCA)
11. Average-link clustering
12. Vertical data format for frequent pattern mining
13. Recurrent neural network
14. Ensemble Model
15. Hidden Markov Model
16. Expectation maximization (EM) clustering
17. Outlier Detection Methods
18. Decision tree with information gain ratio
19. Missing value prediction
20. Convolutional Neural Network
21. Genetic algorithm
22. Density Based Spatial Clustering (DBSCAN)
23. Bisecting k-means
Clustering
Clustering
• Clustering is a data mining (machine learning) technique that
finds similarities between data according to the characteristics
found in the data & groups similar data objects into one cluster
• User inspection
– Study centroids of the cluster, and spreads of data items in
each cluster
– For text documents, one can read some documents in clusters
to evaluate the quality of clustering algorithms employed.
197
Cluster Evaluation: Ground Truth
• We use some labeled data (for classification)
– Assumption: Each class is a cluster.
Cluster I Cluster II Cluster III
203
Similarity/Dissimilarity Measures
• Each clustering problem is based on some kind of distance
“farness” or “nearness” measurement between data points.
– Distances are normally used to measure the similarity or dissimilarity
between two data objects
• Similarity Measures:
– Euclidean Distance if attributes are continuous.
– Other Problem-specific Measures.
• Popular similarity measure is: Minkowski distance:
n q
dis( X ,Y ) q (| x y |)
i 1 i i
where X = (x1, x2, …, xn) and Y = (y1, y2, …, yn) are two n-
dimensional data objects; n is size of vector attributes of the
data object; q= 1,2,3,… 204
Similarity & Dissimilarity Between Objects
• If q = 1, dis is Manhattan distance
n
dis ( X , Y ) (| xi yi |
i 1
n 2
dis( X ,Y ) (| x y |)
i 1 i i
205
The need for representative
• Key problem: as you build clusters, how do you
represent the location of each cluster, to tell
which pair of clusters is closest?
• For each cluster assign a centroid (closest to all
other points)= average of its points.
N (C )
Cm i 1 ip
N
209
Partitioning Algorithms: Basic Concept
• Partitioning method: Construct a partition of a database D of n
objects into a set of k clusters; such that, sum of squared
distance is minimum
• Given a k, find a partition of k clusters that optimizes the
chosen partitioning criterion
– Global optimal: exhaustively enumerate all partitions
– Heuristic methods: k-means and k-medoids algorithms
– k-means: Each cluster is represented by the center of the
cluster
• K is the number of clusters to partition the dataset
• Means refers to the average location of members of a
particular cluster
– k-medoids or PAM (Partition Around Medoids): Each cluster is
represented by one of the objects in the cluster
211
The K-Means Clustering Algorithm
Given k (number of clusters), the k-means algorithm is
implemented as follows:
– Select K cluster points randomly as initial centroids
– Repeat until the centroid don’t change
• Compute similarity between each instance and
each cluster
• Assign each instance to the cluster with the
nearest seed point
• Recompute the centroids of each K clusters of the
current partition (the centroid is the center, i.e.,
mean point, of the cluster)
212
Example Problem
• Cluster the following eight points (with (x, y)
representing locations) into three clusters :
A1(2, 10) A2(2, 5) A3(8, 4) A4(5, 8) A5(7, 5)
A6(6, 4) A7(1, 2) A8(4, 9).
– Assume that initial cluster centers are:
A1(2, 10), A4(8,4) and A7(1, 2).
• The distance function between two points Aj=(x1, y1)
and Ci=(x2, y2) is defined as:
dis(Aj, Ci) = |x2 – x1| + |y2 – y1| .
• Use k-means algorithm to find optimal centroids to
group the given data into three clusters.
Iteration 1
First we list all points in the first column of the table below. The initial
cluster centers - centroids, are (2, 10), (8,4) and (1, 2) - chosen
randomly.
Data Points Cluster 1 with Cluster 2 with Cluster 3 with Cluster
centroid (2,10) centroid (8, 4) centroid (1, 2)
A1 (2, 10) 0 12 9 1
A2 (2, 5) 5 7 4 3
A3 (8, 4) 12 0 9 2
A4 (5, 8) 5 7 10 1
A5 (7, 5) 10 2 9 2
A6 (6, 4) 10 2 7 2
A7 (1, 2) 9 9 0 3
A8 (4, 9) 3 9 10 1
Next, we will calculate the distance from each points to each of the
three centroids, by using the distance function:
dis(point i,mean j)=|x2 – x1| + |y2 – y1|
Iteration 1
• Starting from point A1 calculate the distance to each of the
three means, by using the distance function:
dis (A1, mean1) = |2 – 2| + |10 – 10| = 0 + 0 = 0
dis(A1, mean2) = |8 – 2| + |4 – 10| = 6 + 6 = 12
dis(A1, mean3) = |1 – 2| + |2 – 10| = 1 + 8 = 9
– Fill these values in the table & decide which cluster should the point (2, 10) be
placed in? The one, where the point has the shortest distance to the mean – i.e.
mean 1 (cluster 1), since the distance is 0.
• Next go to the second point A2 and calculate the distance:
dis(A2, mean1) = |2 – 2| + |10 – 5| = 0 + 5 = 5
dis(A2, mean2) = |8 – 2| + |4 – 5| = 6 + 1 = 7
dis(A2, mean2) = |1 – 2| + |2 – 5| = 1 + 3 = 4
– So, we fill in these values in the table and assign the point (2, 5) to cluster 3
since mean 3 is the shortest distance from A2.
• Analogically, we fill in the rest of the table, and place each
point in one of the clusters
Iteration 1
• Next, we need to re-compute the new cluster centers. We
do so, by taking the mean of all points in each cluster.
• For Cluster 1, we have three points and needs to take
average of them as new centroid, i.e.
((2+5+4)/3, (10+8+9)/3) = (3.67, 9)
• For Cluster 2, we have three points. The new centroid is:
((8+7+6)/3, (4+5+4)/3 ) = (7, 4.33)
• For Cluster 3, we have two points. The new centroid is:
( (2+1)/2, (5+2)/2 ) = (1.5, 3.5)
• Since centroids changes in Iteration1 (epoch1), we go to the
next Iteration (epoch2) using the new means we computed.
– The iteration continues until the centroids do not change anymore..
Second epoch
• Using the new centroid compute cluster members again.
Data Points Cluster 1 Cluster 2 Cluster 3 Cluster
with centroid with centroid with centroid
(3.67, 9) (7, 4.33) (1.5, 3.5)
A1 (2, 10) 2.67 10.67 7 1
A2 (2, 5) 5.67 5.67 2 3
A3 (8, 4) 2
A4 (5, 8) 1
A5 (7, 5) 2
A6 (6, 4) 2
A7 (1, 2) 3
A8 (4, 9) 1
• After the 2nd epoch the results would be:
cluster 1: {A1,A4,A8} with new centroid=(3.67,9);
cluster 2: {A3,A5,A6} with new centroid = (7,4.33);
cluster 3: {A2,A7} with new centroid=(1.5,3.5)
Final results
• Finally in the 2th epoch there is no change of members of
clusters and centroids. So the algorithm stops.
• The result of clustering is shown in the following figure
Comments on the K-Means Method
• Strength: Relatively efficient: O (tkn), where n is the number
objects, k is the number clusters, and t is the number
iterations. Normally, k, t << n.
• Weakness
– Applicable only when mean is defined and K, the number of
clusters, specified in advance (use Elbow algorithm to determine k
for K-Means)
• Use hierarchical clustering
– Unable to handle noisy data & outliers Since an object with an
extremely large value may substantially distort the distribution of
the data.
• K-Medoids: Instead of taking the mean value of the object in a
cluster as a reference point, medoids can be used, which is the
most centrally located object in a cluster.
220
Hierarchical Clustering
• As compared to partitioning algorithm,
in hierarchical clustering the data are 0.2
4
structure that is more informative than the 3
2
4
9 9 9
8 8 8
7 7 7
6 6 6
5 5 5
4 4 4
3 3 3
2 2 2
1 1 1
0 0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
Example
• Perform a agglomerative clustering of five samples using two
features X and Y. Calculate Manhattan distance between each
pair of samples to measure their similarity.
Data item X Y
1 4 4
2 8 4
3 15 8
4 24 4
5 24 12
Proximity Matrix: First epoch
1 2 3 4 5
1 = (4,4) - 4 15 20 28
2= (8,4) - 11 16 24
3=(15,8) - 13 13
4=(24,4) - 8
5=(24,12) -
Proximity Matrix: First epoch
(1,2) 3 4 5
(1,2) = (6,4) - 13 18 26
3=(15,8) - 13 13
4=(24,4) - 8
5=(24,12) -
Proximity Matrix: First epoch
(1,2) 3 (4,5)
(1,2) = (6,4) - 13 22
3=(15,8) - 9
(4,5)=(24,8) -
Proximity Matrix: First epoch
(1,2) (3,4,5)
(1,2) = (6,4) - 19
(3,4,5)=(21,8) -
Proximity Matrix: First epoch
(1,2,3,4,5)
(1,2,3,4,5)=( , ) -
Dendrogram
(1,2)
(4,5)
(3,4,5)
(1,2,3,4,5)
Strengths of Hierarchical Clustering
• Do not have to assume any particular number of
clusters
– Any desired number of clusters can be obtained by ‘cutting’
the dendrogram at the proper level
• They may correspond to meaningful taxonomies
– Example in biological sciences (e.g., animal kingdom,
phylogeny reconstruction, …)
Major Weakness
• Do not scale well: time complexity of at least
O(n2), where the total number of data objects
Project (Due:_______)
•Requirement:
–Choose dataset with 10+ attributes and at least 1500 instances. As much as
possible try to use local data to learn a lot about DM
–Prepare the dataset by applying business understanding, data understanding
and data preprocessing.
–Use DM algorithm assigned to experiment using WEKA and discover
interesting patterns
•Project Report
–Write a publishable report with the following sections:
• Abstract -- ½ page
• Introduce problem, objective, scope & methodology -- 2 pages
• Review related works -- 4 pages
• Description of Data preparation -- 3 pages
• Description of DM algorithms used for the experiment -- 3 pages
• Discussion of experimental result, with findings --- 3 pages
• Concluding remarks, with major recommendation --- 1 page
• Reference (use IEEE referencing style)
237
• Describe, in detail contribution of each member of the group
Group Group members Data analytics task Project title
number
1 Classification:
Samson Akale Decision tree
Dawit Neural Network
Leniesil K-Nearest Neignbour
Select one more algorithm
2 Classification:
G/Tsadk Decision tree
Mulu Support Vector Machine
Eshetu Regression Analysis
Select one more algorithm
3 Hulu Clustering:
Betty K-Means
Adane Hierarchical
Expected Maximization
Select one more algorithm
239
Association Rule Discovery: Definition
• Association rule discovery attempts to discover
hidden linkage between data items
• Given a set of records each of which contain some
number of items from a given collection;
– Association rule discovery produce dependency
rules which will determine the likelihood of
occurrence of an item based on occurrences of
other items.
240
Association Rule Discovery: Definition
• Given a set of records each of which contain some
number of items from a given collection;
– Produce dependency rules which will predict occurrence
of an item based on occurrences of other items.
TID Items
1 Bread, Coke, Milk
2 Beer, Bread Rules
RulesDiscovered:
Discovered:
3 Beer, Coke, Diaper, Milk {Milk}
{Milk}-->
-->{Coke}
{Coke}
4 Beer, Bread, Diaper, Milk {Diaper,
{Diaper,Milk}
Milk}-->
-->{Beer}
{Beer}
5 Coke, Diaper, Milk
Motivation of Association Rule discovery
242
Association Rule Discovery: Application
• Shelf management (in Supermarket, Book shop, Pharmacy
…).
– Goal: To identify items that are bought together by
sufficiently many customers.
– Approach: Process historical sales transactions data (the
point-of-sale data collected with barcode scanners) to find
dependencies among items.
– A classic rule –
• If a customer buys Coffee and Milk, then he/she is very
likely to buy Beer.
• So, don’t be surprised if you find six-packs of Beer stacked
next to Coffee!
{Coffee,Milk} Beer 243
Prevalent Interesting Rules
• Analysts already know about prevalent rules
– Interesting rules are those that deviate from Milk and
1995 Cereal sell
prior expectation together!
• Mining’s payoff is in finding interesting
(surprising) phenomena
• What makes a rule surprising?
– Does not match prior expectation
• Correlation between milk and cereal Milk and
2014 Raw Meat sell
remains roughly constant over time
Zzzz... together!
• Cannot be trivially derived from simpler
rules
– Milk 60%, cereal … 60%
– Milk & cereal 60% … prevailing
– Raw Meat … 65%
– Milk & Raw Meat … 65 % … Surprising! 246
Association Rule Discovery: Two Steps
• The problem of Association rule discovery can be
generalized into two steps:
– Finding frequent patterns from large itemsets
• Frequent pattern is a set of items (subsequences,
substructures, etc.) that occurs frequently in a data set
– Generating association rules from these itemsets.
• Association rules are defined as statements of the form {X1,X2,
…,Xn} Y, which means that Y may present in the transaction
if X1,X2,…,Xn are all in the transaction.
• Example: Rules Discovered:
{Milk} {Coke}
{Tea, Milk} {Coke}
247
Frequent Pattern Analysis: Basic concepts
• Itemset:
– A set of one or more items; k-itemset: X = {x1, …, xk}
• Support
– support, s, is the fraction of transactions that contains X (i.e., the
probability that a transaction contains X)
– support of X & Y greater than or equal to user defined threshold
s; i.e. support probability of s that a transaction contains X Y
– An itemset X is frequent if X’s support is no less than a minsup
threshold
• Confidence
– confidence: is the probability of finding Y {y1, …, yk} in a
transaction with X {x1,x2,…,xn}.
– confidence, c, conditional probability that a transaction having X
also contains Y; i.e. conditional probability (confidence) of Y given
X greater than or equal to user defined threshold c
248
Example: Finding frequent itemsets
• Given a support threshold S, sets of X items that
appear in (X > S) baskets are called frequent itemsets.
• Example: Frequent Itemsets
– Itemsets bought={milk, coke, pepsi, biscuit, juice}.
– Find frequent k-itemsets that fulfill minimum support of 50% of
the given transactions (i.e. 4 baskets).
B1 = {m, c, b} B2 = {m, p, j}
B3 = {m, b} B4 = {c, j}
B5 = {m, p, b} B6 = {m, c, b, j}
B7 = {c, b, j} B8 = {b, c}
– Frequent itemsets:
• Frequent 1-itemset: {m}, {c}, {b}, {j};
• Frequent 2-itemsets: {m,b} , {b,c}.
• Is there any frequent 3-itemset? 249
Association Rules
• Find all rules on frequent itemsets of the form XY that fulfills
minimum confidence
– If-then rules about the contents of baskets.
• X → Y; where X = {x1, …, xn} and Y = {y1, …, yk}. This means that:
– “if a basket contains all of X’s then what is the likelihood of
containing Y.”
• A typical question: “find all association rules with support ≥ s
and confidence ≥ c.”Confidence of an association rule is the
probability of Y given X. It shows the number of transactions X
containing item(s) Y
• Example: Given the following transactions generate
association rules with minimum support & confidence of
50%
B1 = {m, c, b} B2 = {m, p, j} B3 = {m, b}
B4 = {c, j} B5 = {m, p, b} B6 =
{m, c, b, j}
B7 = {c, b, j} B8 = {b, c}
• Association rule: b → c (fulfils both s = 50% & c = 67%).250
Frequent Itemset Mining Methods
• The hardest problem often turns out to be finding the frequent
pairs.
• Naïve Algorithm
– Read file once, counting in main memory the occurrences of each pair.
• From each basket of n items, generate all pairs {n, (n-1), …, 1} by
two nested loops.
– Fails if (number_of_items)2 exceeds main memory.
• Remember: number of items can be, say Billion of Web pages via
Internet.
• The downward closure property of frequent patterns
– Any subset of a frequent itemset must be frequent
– If {Coke, Tea, nuts} is frequent, so is {Coke, Tea}
• i.e., every transaction having {Coke, Tea, nuts} also contains {Coke,
Tea}
252
Frequent Itemset Mining Methods
• Apriori: A Candidate Generation-and-Test Approach
that limits the need for main memory.
– Key idea: if a set of items appears at least s times, so does
every subset.
• Contra-positive for pairs: if item i does not appear in
s baskets, then no pair including i can appear in s
baskets.
• FPGrowth: A Frequent Pattern-Growth Approach
– Mining Frequent Patterns without explicit Candidate
Generation, rather it generates frequent itemsets
– Uses the Apriori Pruning Principle to generate frequent
itemsets
253
A-Priori Algorithm
• a-priori is a two-pass approach for each frequent k-itemsets
generation
• Step 1
–Pass 1: Read baskets & count in
main memory the occurrences Item counts Frequent items
of each item.
• Identify candidate frequent
k-itemsets
–Pass 2: identify those truly
Counts of
frequent k-item sets
candidate
• Step 2
–Pass 1: Read baskets again & pairs
count in main memory only
those pairs both of which were
found in Step 1 as frequent. Pass 1 Pass 2
–Pass 2: identify those truly
frequent k-itemsets
254
The Apriori Algorithm: A Candidate Generation &
Test Approach
• Iterative algorithm (also called level-wise search): Find
all 1-item frequent itemsets; then all 2-item frequent
itemsets, and so on.
– In each iteration k, only consider itemsets that contain some
k-1 frequent itemset.
Find frequent itemsets of size 1: F1
From k = 2
Ck = candidates of size k: those itemsets of size k that could
be frequent, given Fk-1
Fk = those itemsets that are actually frequent, Fk Ck (need
to scan the database once).
The Apriori Algorithm—An Example
Assume that min Support = 2 and min confidence = 80%, identify
frequent itemsets and construct association rules
Database TDB C1 Itemset sup
Tid Items L1 Itemset sup
{A} 2
10 A, C, D {B} 3 {A} 2
20 B, C, E 1st scan {C} 3 {B} 3
30 A, B, C, E {D} 1 {C} 3
40 B, E {E} 3 {E} 3
2nd scan
L2 Itemset sup Itemset sup
C3 Itemset sup 3 scan
rd {A, C} 2 {A, B} 1 C2
{A, B, C} 1 {B, C} 2 {A, C} 2
{A, B, E} 1 {B, E} 3 {A, E} 1
{A, C, E} 1 {C, E} 2 {B, C} 2
{B, C, E} 2 {B, E} 3
{C, E} 2
L3
Itemset sup
{B, C, E} 2
259
Which of the above pairs fulfill confidence level
at least 80%?
Pairs Support Confidence
CB F(B,C)/F(C)= 2/3=67% Weak relationship
BC 50% 66.67%
BE=f(BE)/f(B)=3/3 75% 100%
CE 50% 66.67%
(B,C) E 50% 100%
(B,E)C
EB=f(BE)/f(E)=3/3 100% Strong relationship
Results:
AC (with support 50%, confidence 100%)
BE (with support 75%, confidence 100%)
(B,C)E (with support 50%, confidence 100%)
260
Bottlenecks of the Apriori approach
• The Apriori algorithm reduces the size of candidate
frequent itemsets by using “Apriori property.”
– However, it still requires two nontrivial computationally
expensive processes.
• It requires as many database scans as the size of the largest
frequent itemsets. In order to find frequent k-itemsets, the
Apriori algorithm needs to scan database k times.
– Breadth-first (i.e., level-wise) search
• Candidate generation and test the frequency of true
appearance of the itemsets
– It may generate a huge number of candidate sets that will
be discarded later in the test stage.
261
Frequent Pattern-Growth Approach
• The FP-Growth Approach
– Depth-first search: search depth wise by identifying
different set of combinations with a given single or pair of
items
• Steps followed: The FP-Growth Approach scans DB only twice
Scan DB once to find frequent 1-itemset (single
item pattern)
Sort frequent items in frequency descending order,
f-list
Scan DB again to construct FP-tree, the data
structure of FP-Growth
262
Construct FP-tree from a Transaction Database
• Assume min-support = 3 and min-confidence = 80%
TID Items bought (ordered) frequent items
100 {f, a, c, d, g, i, m, p} {f, c, a, m, p}
200 {a, b, c, f, l, m, o} {f, c, a, b, m}
300 {b, f, h, j, o, w} {f, b}
400 {b, c, k, s, p} {c, b, p}
500 {a, f, c, e, l, p, m, n} {f, c, a, m, p}
{}
Header Table
f:4 c:1
Item frequency head
f 4 c:3 b:1 b:1
c 4
a 3 a:3 p:1
b 3
m 3 m:2 b:1
p 3
F-list = f-c-a-b-m-p p:2 m:1 263
{}
f:1
c:1
a:1
m:1
p:1
{}
f:2
c:2
a:2
m:1 b:1
p:1 m:1
{}
f:3
c:2 b:1
a:2
m:1 b:1
p:1 m:1
{}
f:3 c:1
a:2
p:1
m:1 b:1
p:1 m:1
{}
f:4 c:1
a:3
p:1
m:2 b:1
p:2 m:1
FP-Growth Example
• Construct conditional pattern base which consists of the set of prefix
paths in the FP tree co-occuring with the suffix pattern, and then
construct its conditional FP-tree.
• Compactness
– Reduce irrelevant information: infrequent items are gone
– Items in frequency descending order: the more
frequently occurring, the more likely to be shared
– Never be larger than the original database (not count
node-links and the count field)
271
Exercise
• Given frequent 3-itemset
{abc, acd, ace, bcd}
Generate possible frequent 4-itemset?
Important Dates:
• Concept presentation
• Project Presentation:
• Final exam:
273
THANK YOU