0% found this document useful (0 votes)
5 views205 pages

AAU Data Analytics 24

The document provides an overview of data analytics, defining it as the systematic discovery and application of meaningful patterns in data for effective decision-making. It discusses the importance of business intelligence in navigating complex market environments and the challenges posed by massive data collection. Additionally, it highlights the need for data science to manage big data and the various types of data analytics used to extract valuable insights from large datasets.

Uploaded by

hewan bekele
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views205 pages

AAU Data Analytics 24

The document provides an overview of data analytics, defining it as the systematic discovery and application of meaningful patterns in data for effective decision-making. It discusses the importance of business intelligence in navigating complex market environments and the challenges posed by massive data collection. Additionally, it highlights the need for data science to manage big data and the various types of data analytics used to extract valuable insights from large datasets.

Uploaded by

hewan bekele
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 205

Data Analytics

What is analytics?
• Analytics is the systematic discovery, interpretation,
and use of meaningful patterns in data.
– It also entails applying patterns in data towards
effective problem solving and decision making.

• What is data? What makes it different from


information and knowledge?
– It is good to understand DIKW?
Data, Information, Knowledge, Wisdom & Truth
FACT

DATA Explicit
Creating concepts

INFORMATION
Depth of meaning

Creating contexts

KNOWLEDGE
Creating patterns

WISDOM
Creating principles and moral
Tacit

TRUTH
Why Data Analytics?
• Reason one: We are living in complex and dynamic business
environment
• How to gain competitive advantage, though the competitive
pressure is very strong?
• How to control the volatile market (Product, Price, Promotion,
Place, People, Process & Physical evidence)?
• How to satisfy users (such as customers, consumers need) that
are professional?
• How to manage the high turnover rate of professionals which
results in diminishing individual and organizational experience?
– Requirement: Business Intelligence
• Prediction: attempting to know what may happen in the future
• Just-in-time response
• Quality, rational, sound and value added decision and problem
solving
• Enhance efficiency and competency
The need for Business Intelligence
• Business Intelligence is getting the right information to the right
people at the right time to support better decision making and
gain competitive advantage
Why Data Analytics
Reason two: Massive data collection
• Data is being produced (generated & collected) at alarming rate
because of:
– The computerization of business & scientific transactions
– Advances in data collection tools, ranging from scanned texts & image
platforms to satellite remote sensing systems
– Above all, popular use of WWW as a global information system
• Nowadays large databases (data warehouses) are growing at
unprecedented rates to manage the explosive growth of data.
• Examples of massive data sets
– Google: Order of 10 billion Web pages indexed
• 100’s of millions of site visitors per day
– MEDLINE text database: 17 million published articles
– Retail transaction data: EBay, Amazon, Wal-Mart: order of 100
million transactions per day
• Visa, MasterCard: similar or larger numbers With the phenomenal
rate of growth of data, users expect more sophisticated useful and
The Web Expansion: Web 0.0 to Web 5.0
Too much data & information, but too little knowledge
• With the phenomenal rate of growth of data, users expect
more useful and valuable information
– There is a need to extract knowledge (useful information) from the
massive data.
• Facing too enormous volumes of data, human analysts with no
special tools can no longer make sense.
– Data analytics can automate the process of finding patterns &
relationships in raw data and the results can be utilized for decision
support. That is why data analytics is used, in science, health and
business areas.
• If we know how to reveal valuable meaningful patterns in data,
data might be one of our most valuable assets.
– Data analytics is the technology that extracts diamonds of knowledge
from historical data & predict outcome of the future.
The Way Forward
Topics Areas covered

Introducing Data Meaning of Data Analytics; Essence of DA; Why DA;


analytics Issues in DA

Major Tasks in Quality Data Preprocessing; Data


Data Preparation Cleaning; Data Integration; Data Reduction; Data
Transformation
Supervised learning Concepts of Classification; K-Nearest Neighbour;
(Classification) Decision Trees; Naïve Bayes; Artificial Neural
Networks

Unsupervised Overview of Clustering; Partitioning algorithms: K-


learning (Clustering) Means, K-mode & K-Medoids; Hierarchical
Clustering: Agglomerative & Divisive Algorithms

Association rules The essence of association rules discovery; Vertical &


Discovery Horizontal Data Format; aPriori algorithm; Pattern-
Growth Approach
14
Evaluation

• Assignments & Presentation 20%


• Project 25%
• Final exam 45%
• Knowledge sharing 10%
(Class attendance & participation)
Presentation assignment
• Instruction: As per the given topic, review at
least 5+ journal articles & prepare presentation
slides on the following topics;
– (i) Introduce what it means, i.e. overview and
definition of the concept;
– (ii) explain why we need it, pros & cons, significance;
– (iii) discuss how it works, architecture, & approaches
followed;
– (iv) concluding remarks (show strength & weakness
of the concept with the way forward);
– (iv) reference.
Presentation assignment
No Name Topic Date
1 Kidist teshome Data science
2 Mesfin Sisay Data lake
3 Meroda Goshu Big data analytics
4 Mesele KEBEDE Predictive data analytics
5 Temesgen fiseha Education data analytics
6 Tenesgen endakmew Real time data analytics
7 Agriculture data analytics
8 Tadele abate Data mining
9 Samuel Twsfaye Descriptive data analytics
10 Selamawit Tibebe Diagnostic analytics
11 Prescriptive analytics
12 Martha Abdissa Business Intelligence
13 Sentayehu demissie Data analytics
14 Zerihun Tobesa Big data
15 Mulugeta G/medhin Health care data analytics
16 Agriculture data analytics
17 Multimodal data analytics
18 Mintesinot Girma Data ecosystem
19 Machine learning
20 Rediet Firew Image analytics
Presentation assignment
No Name Topic Date
21 Bacha Belete Business data analytics
22 Intrusion detection
23 Azeze debolaw Text classification
24 Sentiment analysis
25 Mehret Biruk Hate speech detection
26 Data Curation
27 Helen Tesfaye Exploratory Data analysis
28 Multimedia data analytics
29 Sirak Gelaye Data wrangling
30 Text clustering
Introducing Big Data

• Big Data and its characteristics


What is Big Data
What is Big Data?
• Big data refers to
extremely huge and
diverse collections of
structured, unstructured
and semi-structured data
that continues to grow
exponentially over time.
• Big data is a collection of
data so large, complex and
heterogenous, generated
at a higher speed that it
becomes difficult to store,
process and analyze using
on-hand database
management tools. ,
Big Data
• Big data refers to the incredible amount of structured and
unstructured data that humans, applications and machines generate,
which is collected, analyzed and mined for information and insights.
56 V’s of Big data
Two types of big data
• Big data is divided into data at rest and data in
motion.
• Data at rest:
– This refers to data that has been collected from various
sources and is then analyzed after the event occurs.
– The point where the data is analyzed and the point where
action is taken on it occur at two separate times.
• Data in motion:
– The collection process for data in motion is similar to that of
data at rest; however, the difference lies in the analytics.
– In this case, the analytics occur in real-time as the event
happens.
Stages of Big data
• Data Generation: concerns how data are being generated, this is to
mean large diverse and complex dataset that generated from different
data sources. However there are overwhelming technical challenges in
collecting, processing and analyzing these datasets.
• Data acquisition: refers to the process of obtaining information and is
subdivided into data collection, data transmission, and data pre-
processing.
• Data storage and retrieval: concerns persistently storing and managing
large-scale datasets as well as searching for relevant data as per the
information need of data.
• Data analytics: leverages data analysis methods or tools to inspect,
transform, and model data to extract value.

• Each component of this value chain presents various challenges that


require deep research into, mostly because of the heterogeneous and
complex character of the data involved.
Big Data Challenges
Classification of big data challenges
• Challenges of big data can be classified into:
data management and data analytics.
– Data management involves processes and
supporting technologies to acquire and store data
as well as to prepare and retrieve it for analysis.
– Data analytics refers to techniques used to
discover meaningful patterns and acquire
intelligence from big data.
• Such challenges inspires the introduction of
multidisciplinary field, Data Science
Data science

More data usually beats better algorithms,


techniques and tools
The need for data science
• No matter how extremely efficient your algorithm is,
they can often be beaten simply by having more data.
The need for data science
• There is a high demand for Data Science and data analytics
– databases, warehousing, data architectures
– data analytics – statistics, data mining, machine learning
• Data science is the ability to store large amounts of data, to be
able to understand it, to process it, to extract value from it, to
visualize it, to communicate it for decision making and problem
solving in such dynamic world
• It supports “Business Intelligence”
– for smart decision-making and problem solving
– for predicting potential market, potential product, potential
customers
– need data for identifying risks, opportunities, conducting
“what-if” analyses
Components of Data Science
Where to concentrate in Data Science?
Data Architecture
The Data Architecture includes a thorough analysis of:
• what data needs will be accessed and why is the data
accessed ?
• where is the data located ?
• what is the currency of the data ?
• how will we maintain data integrity ?
• what is the data relationship between data displayed
& stored?
• how can we provide round the clock availability
when the backend systems and databases are not
available on a 24x7 basis?
Issues of efficiency with real databases
• indexing
– how to efficiently find all songs users want in a database with
10,000,000 entries?
– data structures for representing sorted order on fields
• disk management
– databases are often too big to fit in RAM, leave most of it on
disk and swap in blocks of records as needed – could be slow
• concurrency
– transaction semantics: either all updates happen in batch or
none (commit or rollback)
– like delete one record and simultaneously add another but
guarantee not to leave in an inconsistent state
– other users might be blocked till done
Data quality
• missing values
– how to interpret? not available? 0? Predict values
using the mean or mode?
• duplicated values
– including partial matches (Jon Smith=John Smith?)
• inconsistency:
– multiple information (say, addresses) for a person
• outliers:
– salaries that are negative, or in the trillions
Solution to manage big data: Data lake
• A data lake is a centralized repository that ingests and stores
large volumes of data in its original form.
– The data can then be processed and used as a basis for a variety of
analytic needs.
• Due to its open, scalable architecture, a data lake can
accommodate all types of data from any source, from structured
(database tables, Excel sheets) to semi-structured (XML files,
webpages) to unstructured (images, audio files, tweets), all
without sacrificing fidelity.
• The data files are typically stored in staged zones—raw,
cleansed, and curated—so that different types of users may use
the data in its various forms to meet their needs.
• Data lakes provide core data consistency across a variety of
applications, powering big data analytics, machine learning,
predictive analytics, and other forms of intelligent action.
Data lake vs. data warehouse
Data lake Data warehouse
Type Structured, semi-structured, Structured
unstructured
Relational, non-relational Relational
Schema Schema on read Schema on write
Format Raw, unfiltered Processed, checked
Sources Big data, IoT, social media, Application, business,
streaming data transactional data, batch
reporting
Scalability Easy to scale at a low cost Difficult and expensive to scale
Users Data scientists, data engineers Data warehouse professionals,
business analysts
Use cases Machine learning, predictive Core reporting, BI
analytics, real-time analytics
Data lake vs. data warehouse
Data lake Data lakehouse
Type Structured, semi-structured, Structured, semi-structured,
unstructured unstructured
Relational, non-relational Relational, non-relational
Schema Schema on read Schema on read, schema on write
Format Raw, unfiltered, processed, Raw, unfiltered, processed, curated,
curated delta format files
Sources Big data, IoT, social media, Big data, IoT, social media,
streaming data streaming data, application,
business, transactional data, batch
reporting
Scalability Easy to scale at a low cost Easy to scale at a low cost
Users Data scientists Business analysts, data engineers,
data scientists
Use cases Machine learning, predictive Core reporting, BI, machine
analytics learning, predictive analytics
What Can Be Learned From Big Data?

• Data analytics is an emerging technique that


dives into a data set without prior set of
hypotheses
• The data derive meaningful trends or
intriguing findings that were not previously
seen or empirically validated.
• Data analytics enables quick decisions or help
change policies due to trends observed
Data analytics

• Data analytics is a multidisciplinary field, which involves an


extensive use of computer science, information science,
mathematics, statistics, the use of descriptive techniques and
predictive models to gain valuable knowledge from data .
Defining data analytics
• Data analytics is the method for looking at big data to reveal hidden
patterns, incomprehensible relationships, and other important
information that can be utilized to resolve on enhanced business
decisions.
• Data analytics is defined as an approach or type of business analytics
that assists organizations to analyze large amounts of data in a timely
fashion and extract fruitful patterns and relationships
• Data analytics is the systematic approach of collecting, processing,
and analyzing data sets using statistical and other business analysis
methodologies regardless of size and volume to provide better
insights in strategic, tactical, and operational decision making.
• Hence, data analytics can be exemplified as the systematic approach
of the collection of massive data sets, processing, and analyzing for
data-driven decision making
Four types of data analytics
Example of Data Analytics
• Customer relationship management:
– Which of my customers are likely to be the most loyal, and
which are most likely to leave for a competitor?
• Credit ratings/targeted marketing:
– Given a database of 100,000 names, which persons are the
least likely to default on their credit cards?
– Identify likely responders to sales promotions

• Fraud detection/Network intrusion detection


– Which types of tr ansactions are likely to be fraudulent, given
the demographics and transactional history of a particular
customer?

Data Analytics helps extract such useful


information
Issues in Data Analytics
• Scalability
– Applicability of data analytics to perform well with massive real
world data sets
– Techniques should also work regardless of the amount of available
main memory
• Real World Data
– Real world data are noisy and have many missing attribute values.
Algorithms should be able to work even in the presence of these
problems
• Updates
– Database can not be assumed to be static. The data is frequently
changing.
– However, many data mining and machine learning algorithms work
with static data sets. This requires that the algorithm be completely
rerun any time the database changes.
Data analytics implementation issues
• High dimensionality:
– A conventional database schema may be composed of
many different attributes. The problem here is that all
attributes may not be needed to solve a given DM
problem.
– The use of unnecessary attributes may increase the
overall complexity and decrease the efficiency of an
algorithms.
– The solution is dimensionality reduction (reduce the
number of attributes). But, determining which
attributes are not needed is a tough task!

• Overfitting
– The size and representativeness of the dataset
determines whether the model associated with a given
database states fits to also future database states.
– Overfitting occurs when the model does not fit to the
future states which is caused by the use of small size
and unbalanced training database.
Assignment
• Compare and contrast: overfitting vs
underfitting
– Discuss what they mean
– Show their similarity and differences
– Methods used to solve them, show with example
– Conclusion
– Reference
Data Preparation
• There are three major phases in data analytics:
• Pre-data analytics,
• data analytics for modeling
• post-data analytics
• Pre-data analytics involves four major tasks:
• Data cleansing
• Data integration
• Data reduction
• Data transformation

65
Data Collection for analytics
• Data analytics and data mining requires collecting great
amount of data to achieve the intended objective.
– Data analytics starts by understanding the business or problem
domain in order to gain the business knowledge
• Business knowledge guides the process towards useful results,
and enables the recognition of those results that are useful.
– Based on the business knowledge data related to the business
problem are identified from the database/data warehouse for
analytics.
– Once we collect the data, the next task is data understanding
where there is a need to well-understand the type data we are
using for analysis and identify the problem observed within the
data.
• Before feeding data to DM we have to make sure the
quality of data? 66
Data Quality Measures
• A well-accepted multidimensional data quality measures
are the following:
– Accuracy (free from errors and outliers)
– Completeness (no missing attributes and values)
– Consistency (no inconsistent values and attributes)
– Timeliness (appropriateness of the data for the purpose it is
required)
– Believability (acceptability)
– Interpretability (easy to understand)

• Most of the data in the real world are poor quality; that
is:
– Incomplete, Inconsistent, Noisy, Invalid, Redundant, …

67
Data is often of low quality
• Data in the real world is with poor quality
– incomplete: lacking attribute values, lacking certain
attributes of interest, or containing only aggregate data
– noisy: containing errors or outliers
– inconsistent: containing discrepancies in codes or names
• No quality data, no quality analytics results!
– Quality decisions must be based on quality data
– Data warehouse needs consistent integration of quality data
Why low quality Data?
• Collecting the required data is challenging
– In addition to its heterogeneous & distributed nature of data,
real world data is low in quality.

• Why?
– You didn’t collect it yourself
– It probably was created for some other use, and then you came
along wanting to integrate it
– People make mistakes (typos)
– People are busy (“this is good enough”) to systematically
organize carefully using structured formats

69
Types of problems with data
• Some data have problems on their own that needs to be
cleaned:
– Outliers: misleading data that do not fit to most of the data/facts
– Missing data: attributes values might be absent which needs to be
replaced with estimates
– Irrelevant data: attributes in the database that might not be of
interest to the DM task being developed
– Noisy data: attribute values that might be invalid or incorrect. E.g.
typographical errors
– Inconsistent data, duplicate data, etc.
• Other data are problematic only when we integrate them
– Everyone had their own way of structuring and formatting data,
based on what was convenient for them
– How to integrate data organized in different format following
different conventions. 70
The issue is …..
• How to prepare enough and complete data
with good quality that we need for analytics?
 Coming up with good quality data needs to pass
through different data preprocessing tasks

71
Forms of data preprocessing
Data Cleaning

• Data cleaning tasks


– Correct redundant data
– Fill in missing values
– Identify outliers and smooth out noisy data
– Correct inconsistent data
Data Cleaning: Redundancy
• Duplicate or redundant data is data problems which require data
cleaning

• What’s wrong here?

ID Name City State


1 Ministry of Transportation Addis Ababa Addis Ababa
2 Ministry of Finance Addis Ababa Addis Ababa
3 Ministry of Finance Addis Ababa Addis Ababa

• How to clean it: manually or automatically?

75
Data Cleaning: Incomplete Data
• The dataset may lack certain attributes of interest
– Is that enough if you have patient demographic profile and
address of region to predict the vulnerability (or exposure) of
a given region to Malaria outbreak?
• The dataset may contain only aggregate data. E.g.: traffic
police car accident report
– this much accident occurs this day in this sub-city
No of accident Date address
3 Oct 23, 2012 Yeka, Addis Ababa

2 Oct 12, 2011 Amhara region

76
Data Cleaning: Missing Data
• Data is not always available, lacking attribute values. E.g.,
Occupation=“ ”
 many tuples have no recorded value for several attributes, such
as customer income in sales data

ID Name City State


1 Ministry of Transportation Addis Ababa Addis Ababa
2 Ministry of Finance ? Addis Ababa
3 Office of Foreign Affairs Addis Ababa Addis Ababa

• What’s wrong here? A missing required field

77
Data Cleaning: Missing Data
• Missing data may be due to
– inconsistent with other recorded data and thus deleted
– data not entered due to misunderstanding and may not be
considered important at the time of entry
– not register history or changes of the data

78
Data Cleaning: Missing Data
There are different methods for treating missing values.
• Ignore attributes with missing value: this method usually
done when class label is missed (assuming the task in
classification).
– not effective when the percentage of missing values per attribute
varies considerably
• Fill in the missing value manually: this method is tedious,
time consuming and infeasible.
• Use a global constant to fill in the missing value: This can
be done if a new class is unknown.
• Use the attributes’ mean or mode to fill in the missing
value: Replacing the missed values with the attributes’
mean or mode (most frequent) for numeric or nominal
attributes, respectively.
– Use the most probable value to fill in the missing value automatically
• calculate, say, using Expected Maximization (EM) Algorithm the most
79
probable value
Example: Missing Values Handling method
Attribute Data type Handling method
Name
Sex Nominal Replace by the mode value.
Age Numeric Replace by the mean value.
Religion Nominal Replace by the mode value.
Height Numeric Replace by the mean value.
Marital status Nominal Replace by the mode value.
Job Nominal Replace by the mode value.
Weight Numeric Replace by the mean value.
Predict missing value using EM
• Solves estimation with incomplete data.
– Obtain initial estimates for parameters using mean value.
– use estimates for calculating a value for missing data &
– The process continue Iteratively until convergence ((μi - μi+1) ≤ Ө).
• E.g.: out of six data items given known values= {1, 5, 10, 4},
estimate the two missing data items?
– Let the EM converge if two estimates differ in 0.05 & our initial guess of the two
missing values= 3.
• The algorithm
stop since the
last two
estimates are
only 0.05 apart.
• Thus, our
estimate for the
two items is
4.97. 81
Data Cleaning: Noisy Data
• Noisy: containing noise, errors, or outliers
– Noise: random error or variance in a measured variable
– e.g., Salary=“−10” (an error)
• Typographical errors are errors that corrupt data
• Let say ‘green’ is written as ‘rgeen’

• Incorrect attribute values may be due to


– faulty data collection instruments (e.g.: OCR)
– data entry problems
– data transmission problems
– technology limitation
– inconsistency in naming convention
82
82
Data Cleaning: How to catch Noisy Data
• Manually check all data : tedious + infeasible?
• Sort data by frequency
– ‘green’ is more frequent than ‘rgeen’
– Works well for categorical data
• Use, say Numerical constraints to Catch Corrupt Data
• Weight can’t be negative
• People can’t have more than 2 parents
• Salary can’t be less than Birr 300
• Use statistical techniques to Catch Corrupt Data
– Check for outliers (the case of the 8 meters man)
– Check for correlated outliers using n-gram (“pregnant male”)
• People can be male
• People can be pregnant
• People can’t be male AND pregnant 83
83
Data Integration
• Data integration
– combines data from multiple sources (database, data
warehouse, files & sometimes from non-electronic
sources) into a coherent store
– Because of the use of different sources, data that
that is fine on its own may become problematic
when we want to integrate it.
• Some of the issues are:
– Different formats and structures
– Conflicting and redundant data
– Data at different levels

85
Data Integration: Formats
• Not everyone uses the same format. Do you agree?
• Schema integration
– integrate metadata from different sources
– Entity identification problem: identify real world entities from
multiple data sources: e.g., A.cust-id  B.cust-#
• Dates are especially problematic:
– 12/19/97
– 19/12/97
– 19/12/1997
– 19-12-97
– Dec 19, 1997
– 19 December 1997
– 19th Dec. 1997
• Are you frequently writing money as:
– Birr 200, Br. 200, 200 Birr, … 86
Data Integration: Inconsistent
• Inconsistent data: containing discrepancies in codes or
names, which is also the problem of lack of
standardization / naming conventions. e.g.,
– Age=“26” vs. Birthday=“03/07/1986”
– Some use “1,2,3” for rating; others “A, B, C”
• Discrepancy between duplicate records
ID Name City State
1 Ministry of Transportation Addis Ababa Addis Ababa region
Addis Ababa
2 Ministry of Finance Addis Ababa administration
Addis Ababa regional
3 Office of Foreign Affairs Addis Ababa administration

87
Data Integration: Inconsistent
Attribute Current values New value
name
Job status “no work”, “job Unemployed
less”, “Jobless”
Marital “not married”, Unmarried
status
“single”
Education “uneducated”, “no Illiterate
level
education level”
Data Integration: different structure
What’s wrong here? No data type constraints
ID Name City State
Ministry of
1234 Transportation Addis Ababa AA

ID Name City State

GCR34 Ministry of Finance Addis Ababa AA

Name ID City State


Office of Foreign
Affairs GCR34 Addis Ababa AA
89
Data Integration: Data that Moves
• Be careful of taking snapshots of a moving target
• Example: Let’s say you want to store the price of a shoe
in France, and the price of a shoe in Italy. Can we use
same currency (say, US$) or country’s currency?
– You can’t store it all in the same currency (say, US$) because
the exchange rate changes frequently
– Price in foreign currency stays the same
– Must keep the data in foreign currency and use the current
exchange rate to convert
• The same needs to be done for ‘Age’
– It is better to store ‘Date of Birth’ than ‘Age’

90
Data at different level of detail than needed
• If it is at a finer level of detail, you can sometimes bin
it
• Example
– I need age ranges of 20-30, 30-40, 40-50, etc.
– Imported data contains birth date
– No problem! Divide data into appropriate categories
• Sometimes you cannot bin it
• Example
– I need age ranges 20-30, 30-40, 40-50 etc.
– Data is of age ranges 25-35, 35-45, etc.
– What to do?
• Ignore age ranges because you aren’t sure
• Make educated guess based on imported data (e.g.,
assume that # people of age 25-35 are average # of
people of age 20-30 & 30-40)

91
Data Integration: Conflicting Data
• Detecting and resolving data value conflicts
–For the same real world entity, attribute values from
different sources are different
–Possible reasons: different representations, different
scales, e.g., American vs. British units
• weight measurement: KG or pound
• Height measurement: meter or inch
• Information source #1 says that Alex lives in Bahirdar
– Information source #2 says that Alex lives in Mekele
• What to do?
– Use both (He lives in both places)
– Use the most recently updated piece of information
– Use the “most trusted” information
– Flag row to be investigated further by hand
– Use neither (We’d rather be incomplete than wrong)

92
Handling Redundant Data
• Redundant data occur often when integration of multiple
databases
– Object identification: The same attribute or object may have
different names in different databases
– Derivable data: One attribute may be a “derived” attribute in
another table, e.g., annual revenue, age
• Redundant attributes may be able to be detected by
correlation analysis and covariance analysis
• Careful integration of the data from multiple sources
may help reduce/avoid redundancies and inconsistencies
and improve analytics speed and quality
93
Covariance
• Covariance is similar to correlation

p q
where n is the number of tuples, and are the respective
mean of p and q, σp and σq are the respective standard deviation
of p and q.
• It can be simplified in computation as

• Positive covariance: If Covp,q > 0, then p and q both tend to be


directly related.
• Negative covariance: If Covp,q < 0 then p and q are inversely
related.
• Independence: Covp,q = 0 94
Example: Co-Variance
• Suppose two stocks A and B have the following values in
one week: (2, 5), (3, 8), (5, 10), (4, 11), (6, 14).
• Question: If the stocks are affected by the same industry
trends, will their prices rise or fall together?
– E(A) = (2 + 3 + 5 + 4 + 6)/ 5 = 20/5 = 4

– E(B) = (5 + 8 + 10 + 11 + 14) /5 = 48/5 = 9.6

– Cov(A,B) = (2×5+3×8+5×10+4×11+6×14)/5 − 4 × 9.6 = 4

• Thus, A and B rise together since Cov(A, B) > 0.


95
Data Reduction Strategies
• Why data reduction?
– A database/data warehouse may store terabytes of data.
Complex data analysis may take a very long time to run on the
complete data set.
• Data reduction: obtains a reduced representation of the
data set that is much smaller in volume but yet produces
the same (or almost the same) analytical results
• Data reduction strategies
– Dimensionality reduction,
• Select best attributes or remove unimportant attributes
– Numerosity reduction
• Reduce data volume by choosing alternative, smaller forms of
data representation
– Data compression
• Is a technology that reduce the size of large files such that
smaller files take less memory space and fast to transfer over a
96 network or the Internet,
Data Reduction: Dimensionality Reduction
• Curse of dimensionality
– When dimensionality increases, data becomes increasingly sparse
• Dimensionality reduction
– Helps to eliminate Irrelevant attributes and reduce noise: that
contain no information useful for the data analytics task at hand
• E.g., is students' ID relevant to predict students' GPA?
– Helps to avoid redundant attributes : that contain duplicate
information in one or more other attributes
• E.g., purchase price of a product & the amount of sales tax paid
– Reduce time and space required in data analytics
– Allow easier visualization
• Method: attribute subset selection
– One of the method to reduce dimensionality of data is by selecting
best attributes
– Given M attributes there are 2M possible attribute combinations99
Heuristic Search in Attribute Selection
Commonly used heuristic attribute selection methods:
– Best step-wise attribute selection:
• Start with empty set of attributes, {}
• The best single-attribute is picked first, {Ai}
• Then combine best attribute with the remaining to select the best combined two
attributes, {AiAj}, then three attributes {AiAjAk},…
• The process continues until the performance of the combined attributes starts to
decline
– Example: Given ABCDE attributes, we can start with {}, and then compare and select
attribute with best accuracy, say {B}. Then combine it with others, {[BA][BC][BD][BE]}
& compare and select those with best accuracy, say {BD}, then combine with the
rest, {[BDA][BDC][BDE]}, select those with best accuracy or ignore if accuracy start
decreasing
– Step-wise attribute elimination:
• Start with all attributes as best
• Eliminate one of the worst performing attribute
• Repeatedly continue the process if the performance of the combined attributes
increases
– Example: Given ABCDE attributes, we can start with {ABCDE}, then compare accuracy of
(n-1) attributes, like {[ABCD][ABCE][ABDE][ACDE][BCDE]}. If {ABCE} performed best,
ignore attribute {D}. Again compare accuracy of (n-2) attributes, {[ABC][ABE][ACE]
[BCE]}, based on accuracy ignore the attribute that affect accuracy.
100
– Best combined attribute selection and elimination
Data Reduction: Numerosity Reduction
• Different methods can be used, including Clustering and sampling
• Clustering
– Partition data set into clusters based on similarity, and store cluster
representation (e.g., centroid and diameter) only
– There are many choices of clustering definitions and clustering
algorithms
• Sampling
– Obtain a small sample s to represent the whole data set N
– Allow a analytics algorithm to run in complexity that is potentially sub-
linear to the size of the data
– Key principle: Choose a representative subset of the data using
suitable sampling technique
• Samples may be >=< Population. For instance;
– Total instances of Class 1 = 10,000 & class 2 = 1000. If a sample of 5000
records required, where from class 1 = 3,000 & class 2 = 2, 000. How to
101 select samples? Apply sampling with & without replacement.
Types of Sampling
• Stratified sampling:
– Develop adaptive sampling methods, e.g., stratified sampling;
which partition the data set, and draw samples from each
partition (proportionally, i.e., approximately the same
percentage of the data)
– Used in conjunction with skewed data

• Simple random sampling


– There is an equal probability of selecting any particular item
– Simple random sampling may have very poor performance in
the presence of skew
• Sampling without replacement
– Once an object is selected, it is removed from the population
• Sampling with replacement
– A selected object is not removed from the population
102
Simple Random Sampling: With or without
Replacement

W O R
SRS le random
i m p ho ut
( s e wi t
l
sa m p m e nt )
p l a ce
re

SRSW
R

Raw Data
103
Sampling: Cluster or Stratified Sampling

Raw Data Cluster/Stratified Sample

104
Data Transformation
• A function that maps the entire set of values of
a given attribute to a new set of replacement
values such that each old value can be
identified with one of the new values
• Methods for data transformation
– Normalization
– Discretization
‐ Generalization: Concept hierarchy climbing

105
Data Transformation: Normalization
• Normalization: scaled to fall within a smaller, specified range of values
• min-max normalization
• z-score normalization
• Min-max normalization:
v  minA
v'  (newMax  newMin)  newMin
maxA  minA
– Ex. Let income range $12,000
73,to $98,000
600  12,is000
normalized to [0.0, 1.0]. Then
$73,600 is mapped to (1.0  0)  0 0.716
98,000  12,000

• Z-score normalization (μ:vmean, A


 σ: standard deviation):
v' 
 A
73,600  54,000
1.225
16,000

– Ex. Let μ = 54,000, σ = 16,000. Then, 106


Simple Discretization: Binning
• Equal-width (distance) partitioning
–Divides the range into N intervals of equal size (uniform grid)
–if A and B are the lowest and highest values of the attribute,
the width of intervals for N bins will be:
W = (B –A)/N.
–This is the most straightforward, but outliers may dominate
presentation
• Skewed data is not handled well

• Equal-depth (frequency) partitioning


–Divides the range into N bins, each containing approximately
same number of samples
–Good data scaling
–Managing categorical attributes can be tricky
109
Binning into Ranges
• Given the following AGE attribute values for 9
instances:
– 0, 4, 12, 16, 16, 18, 24, 26, 28
• rearrange the data in increasing order if not sorted

• Equi-width binning for bin width of e.g., 10:


– Bin 1: 0, 4 [-,10) bin • – denote negative
infinity
– Bin 2: 12, 16, 16, 18 [10,20) bin •+ positive infinity
– Bin 3: 24, 26, 28 [20,+) bin

• Equi-frequency binning for bin density of e.g., 3:


– Bin 1: 0, 4, 12 [-, 14) bin
– Bin 2: 16, 16, 18 [14, 21) bin
– Bin 3: 24, 26, 28 [21,+] bin 110
Binning Methods for Data Smoothing
 Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26,
28, 29, 34
* Partition into equal-frequency (equi-depth) bins:
- Bin 1: 4, 8, 9, 15
- Bin 2: 21, 21, 24, 25
- Bin 3: 26, 28, 29, 34
* Smoothing by bin means:
- Bin 1: 9, 9, 9, 9
- Bin 2: 23, 23, 23, 23
- Bin 3: 29, 29, 29, 29
* Smoothing by bin boundaries:
- Bin 1: 4, 4, 4, 15
- Bin 2: 21, 21, 25, 25
111
- Bin 3: 26, 26, 26, 34
Data Transformation: Concept Hierarchy Generation
• Generalization: Concept hierarchy climbing,
which organizes concepts (i.e., attribute values) country
hierarchically and is usually associated with each
dimension in a data warehouse Region or state
– Concept hierarchy formation: Recursively reduce
the data by collecting and replacing low level
concepts (such as numeric values for age) by city
higher level concepts (such as child, youth, adult,
or senior)
Sub city
• Concept hierarchies can be explicitly
specified by domain experts and/or data
warehouse designers Kebele
• Concept hierarchy can be automatically formed by the analysis
of the number of distinct values. E.g., for a set of attributes:
{Kebele, city, state, country}
For numeric data, use discretization methods.
Data sets preparation for learning
• A standard machine learning technique is to divide the
dataset into a training set and a test set.
– Training dataset is used for learning the parameters of
the model in order to produce hypotheses.
• A training set is a set of problem instances (described as a set
of properties and their values), together with a classification
of the instance.
– Test dataset, which is never seen during the
hypothesis forming stage, is used to get a final,
unbiased estimate of how well the model works.
• Test set evaluates the accuracy of the model/hypothesis in
predicting the categorization of unseen examples.
• A set of instances and their classifications used to test the
accuracy of a learned hypothesis.
Classification: Train, Validation, Test split
Results Known
+ Model
+ Training set Builder
-
-
+
Data
Evaluate
Model Builder
Predictions
+
-
Y N +
Test dataset -

+
- Final Evaluation
+
Final Test Set Final Model -
116
Divide the dataset into training & test
• There are various ways in which to separate the data
into training and test sets
– The established ways by which to use the two sets to
assess the effectiveness and the predictive/ descriptive
accuracy of a machine learning techniques over unseen
examples.
– The holdout method
• Repeated holdout method
– Cross-validation
– The bootstrap
The holdout method
• The holdout method reserves a certain amount for
testing and uses the remainder for training
– Usually: one third for testing, the rest for training

• For small or “unbalanced” datasets, samples might not


be representative
– Few or none instances of some classes

• Stratified sample: advanced version of balancing the


data
– Make sure that each class is represented with approximately
equal proportions in both subsets

118
Cross-validation
• Cross-validation works as follows:
– First step: data is split into k subsets of equal-sized sets
randomly. A partition of a set is a collection of subsets for
which the intersection of any pair of sets is empty. That is, no
element of one subset is an element of another subset in a
partition.
– Second step: each subset in turn is used for testing and the
remainder for training
• This is called k-fold cross-validation
– Often the subsets are stratified before the cross-validation is
performed
• The error estimates are averaged to yield an overall error
estimate
119
Cross-validation example:
— Break up data into groups of the same size

— Hold aside one group for testing and use the rest to build
model

Test

— Repeat

120
120
DATA MINING; a step in the process of Data Analytics

• Data analytics deals with every step in the process of a


data-driven model, including data mining
• Data mining is therefore a step in the process of data
analytics for
– Predictive modeling
– Descriptive modeling
What is Data Mining?
• DM is the process of discovery of useful and
hidden patterns in large quantities of data
using machine learning algorithms
– It is concerned with the non-trivial extraction of
implicit, previously unknown and potentially useful
information and knowledge from data
– It discovers meaningful patterns that are valid,
novel, useful and understandable.
• The major task of data mining includes:
– Classification
– Clustering
– Association rule discovery
Data Mining Main Tasks
Prediction Methods
create a model to predict Description Methods
the class of unknown or construct a model
new instances. that can describe the
existing data.

125
DM Task: Predictive Modeling
• A predictive model makes a prediction/forecast about
values of data using known results found from different
historical data
– Prediction Methods use existing variables to predict unknown
or future values of other variables.
• Predict one variable Y given a set of other variables X.
Here X could be an n-dimensional vector
– In effect this is a function approximation through learning the
relationship between Y and X
• Many, many algorithms for predictive modeling in
statistics and machine learning, including
– Classification, regression, etc.
• Often the emphasis is on predictive accuracy, less
emphasis on understanding the model
126
Prediction Problems:
Classification vs. Numeric Prediction

• Classification
– predicts categorical class labels (discrete or nominal)
– classifies data (constructs a model) based on the
training set and the values (class labels) in a classifying
attribute and uses it in classifying new data

• Numeric Prediction
– models continuous-valued functions, and predicts
unknown or missing values

127
Predictive Modeling: Customer Scoring
• Goal: To predict whether a customer is a high risk
customer or not.
– Example: a bank has a database of 1 million past
customers, 10% of whom took out mortgages
• Use machine learning to rank new customers as a
function of p(mortgage | customer data)
• Customer data
– History of transactions with the bank
– Other credit data (obtained from Experian, etc)
– Demographic data on the customer or where they live
• Techniques
– Binary classification: logistic regression, decision trees, etc
– Many, many applications of this nature 128
Classification

• Example: Credit scoring


– Differentiating between low-risk and high-risk customers from
their income and savings
Discriminant rule: IF income > θ1 AND savings > θ2
THEN low-risk
ELSE high-risk
Predictive Modeling: Fraud Detection
• Goal: Predict fraudulent cases in credit card
transactions.
– Credit card losses in the US are over 1 billion $ per year
– Roughly 1 in 50 transactions are fraudulent
• Approach:
– Use credit card transactions and the information on its
account-holder as attributes.
• When does a customer buy, what does he buy, how often
he pays on time, etc
– Label past transactions as fraud or fair transactions. This
forms the class attribute.
– Learn a model for the class of the transactions.
– Use this model to detect fraud by observing credit card
transactions on an account.
DM Task: Descriptive Modeling
• Goal is to build a “descriptive” model that models the underlying
observation
– e.g., a model that could simulate the data if needed
EM ITERATION 25
4.4

• Descriptive model identifies 4.3

Red Blood Cell Hemoglobin Concentration


patterns or relationship in data 4.2

– Unlike the predictive model, a 4.1

descriptive model serves as a way


to explore the properties of the
4

data examined, not to predict new 3.9

properties 3.8

3.7
3.3 3.4 3.5 3.6 3.7 3.8 3.9 4
Red Blood Cell Volume

• Description Methods find human-interpretable patterns that


describe and find natural groupings of the data.
• Methods used in descriptive modeling are: clustering,
summarization, association rule discovery, etc.
133
Example of Descriptive Modeling
• goal: learn directed relationships among p variables
• techniques: directed (causal) graphs
• challenge: distinguishing between correlation and causation
– Example: Do yellow fingers cause lung cancer?

smoking hidden cause:


smoking

yellow fingers cancer


?

134
Pattern (Association Rule) Discovery
• Goal is to discover interesting “local” patterns
(sequential patterns) in the data rather than to
characterize the data globally
– Also called link analysis (uncovers relationships among data)

• Given market basket data we might discover that


– If customers buy wine and bread then they buy cheese with
probability 0.9

• Methods used in pattern discovery include:


– Association rules, Sequence discovery, etc.

135
Basic Data Mining algorithms
• Classification: which is also called Supervised learning,
maps data into predefined groups or classes to enhance the
prediction process
• Clustering: which is also called Unsupervised learning,
groups similar data together into clusters.
– is used to find appropriate groupings of elements for a set of data.
– Unlike classification, clustering is a kind of undirected knowledge
discovery or unsupervised learning; i.e., there is no target field & the
relationship among the data is identified by bottom-up approach.
• Association Rule: is also known as market-basket
analysis
– It discovers interesting associations between attributes contained in
a database.
– Based on frequency of occurrence of number of items in the event,
association rule tells if item X is a part of the event, then what is the
likelihood of item Y is also part of the event. 136
Classification
Classification is a data mining (machine
learning) technique used to predict group
membership of new data instances.

137
OVERVIEW OF CLASSIFICATION
• Given a collection of records (training set), each record
contains a set of attributes, one of the attributes is the
class.
– construct a model for class attribute as a function of the
values of other attributes.
– Given a data D={t1,t2,…,tn} and a set of classes C={C1,…,Cm},
the Classification Problem is to define a mapping f:DgC
where each ti is assigned to one class.
• Goal: previously unseen records should be assigned a class
as accurately as possible. A test set is used to determine
the accuracy of the model.
– Usually, the given data set is divided into training and test
sets, with training set used to build the model and test set
used to validate it. 138
Classification Examples
• Teachers classify students’ grades as A, B, C, D, or F.
• Predict whether the weather on a particular day will
be “sunny”, “rainy” or “cloudy”.
• Identify individuals with credit risks.
• Identify mushrooms as poisonous or edible.
• Predict when a river will flood.
• Document classification into the predefined classes,
such as politics, sport, social, economy, law, etc.

140
CLASSIFICATION: A TWO-STEP PROCESS
• Model construction: describing a set of predetermined classes
– Each tuple/sample is assumed to belong to a predefined class, as
determined by the class label attribute
– The set of tuples used for model construction is training set
– The model is represented as classification rules, decision trees,
or mathematical formulae
• Model usage: for classifying future or unknown objects
– Estimate accuracy of the model
• The known label of test sample is compared with the
classified result from the model
• Accuracy rate is the percentage of test set samples that are
correctly classified by the model
• Test set is independent of training set
– If the accuracy is acceptable, use the model to classify data
tuples whose class labels are not known 142
Illustrating Classification Task
Tid Attrib1 Attrib2 Attrib3 Class Learning
No
1 Yes Large 125K
algorithm
2 No Medium 100K No

3 No Small 70K No
4 Yes Medium 120K No
Induction
5 No Large 95K Yes
6 No Medium 60K No

7 Yes Large 220K No Learn


8 No Small 85K Yes Model
9 No Medium 75K No

10 No Small 90K Yes


Model
10

Training Set
Apply
Tid Attrib1 Attrib2 Attrib3 Class Model
11 No Small 55K ?

12 Yes Medium 80K ?

13 Yes Large 110K ? Deduction


14 No Small 95K ?

15 No Large 67K ?
10

Test Set
Confusion Matrix & Performance Evaluation
PREDICTED CLASS
Class=Yes Class=No
ACTUAL Class=Yes a b
CLASS (TP) (FP)
Class=No c d
(FP) (TP)

• Most widely-used metric is measuring Accuracy of the


system : ad TP
Accuracy   *100
a b c d TP  FP
• Other metric for performance evaluation are Precision,
Recall & F-Measure
Confusion matrix & model effectiveness
Outlook Temp Windy Predicted Real
class class
overcast mild yes YES YES
rainy mild no YES YES
rainy cool yes YES NO Confusion matrix
sunny mild no NO YES TRUE CLASS
sunny cool no NO NO YES NO
sunny hot no NO YES PREDICTED YES 2 (TP) 1 (FP)
CLASS
sunny hot yes NO NO NO 2 (FN) 3 (TN)
rainy hot yes NO NO Total: 4 (Yes) 4 (No)

Compute accuracy, recall and precision?


• Accuracy = (2+3)/8 = 5/8 * 100 = 62%
Yes No

Recall 2/4 3/4

precision 2/3 3/5

F-score (2*P*R)/(P+R) (2*2/4*2/3)/(2/4+2/3) (2*3/4*3/5)/(3/4+3/5)


Classification methods
• Goal: Predict class Ci = f(x1, x2, .. xn)
• There are various classification methods. Popular
classification techniques include the following.
– K-nearest neighbor
– Decision tree classifier: divide decision space into
piecewise constant regions.
– Neural networks: partition by non-linear boundaries
– Bayesian network: a probabilistic model
– Support vector machine

148
Classification methods
K-Nearest Neighbors
• K-nearest neighbor is a supervised learning algorithm
where the result of new instance query is classified
based on majority of K-nearest neighbor category.
• The purpose of this algorithm is to classify a new object
based on attributes and training samples: (xi, f(xi)),
i=1..N.
• Given a query point, we find K number of objects or
(training points) closest to the query point.
– The classification is using majority vote among the
classification of the K objects.
– K Nearest neighbor algorithm used neighborhood classification
as the prediction value of the new query instance.
• K nearest neighbor algorithm is very simple. It works
based on minimum distance from the query instance to
the training samples to determine the K-nearest
neighbors.
150
How to compute K-Nearest Neighbor (KNN)
Algorithm?
• Determine parameter K = number of nearest neighbors
• Calculate the distance between the query-instance and
all the training samples
– we can use Euclidean distance
• Sort the distance and determine nearest neighbors
based on the Kth minimum distance
• Gather the category of the nearest neighbors
• Use simple majority of the category of nearest
neighbors as the prediction value of the query instance
– Any ties can be broken at random with reason.
K Nearest Neighbors: Key issues
The key issues involved in training KNN model includes
• Setting the variable K (Number of nearest neighbors)
–The numbers of nearest neighbors (K) should be based on cross
validation over a number of K setting.
–When k=1 is a good baseline model to benchmark against.
–A good rule-of-thumb is that K should be less than or equal to the
square root of the total number of training patterns.
• Setting the type of distant metric K N
–We need a measure of distance in order to know who are the
neighbours
–Assume that we have T attributes for the learning problem. Then
one example point x has elements xt  , t=1,…T.
–The distance between two points xi xj is often defined as the
Euclidean distance: D 2

Dist ( X , Y )   ( Xi  Yi )
i 1
152
Example
• We have data from the questionnaires survey (to ask people opinion) & objective
testing with two attributes (acid durability & strength) to classify whether a
special paper tissue is good or not. Here is four training samples.
X1 = Acid Durability (seconds) X2 = Strength (kg/m2) Y = Classification
7 7 Bad
7 4 Bad
3 4 Good
• Now the factory1produces a new paper tissue that 4 pass laboratory test Goodwith X =
1
3 and X2 = 7.
– Without undertaking another expensive survey, guess the goodness of the
new tissue? Use squared Euclidean distance for similarity measurement.
K=sqrt(4) = 2
dis(TD1, NP) = sqrt(16+0)=sqrt(16)=4
dis(TD2, NP) = sqrt(16+9)=sqrt(25)=5
dis(TD3, NP) = sqrt(0+9)=sqrt(9)= 3
dis(TD4, NP) = sqrt(4+9)=sqrt(13)=3.6
1.TD3 (Good) 2. TD4 (Good) 3. TD1 4. TD2
- The new product is Good
KNNs: advantages & Disadvantages
• Advantage
– Simple
– Powerful
– Requires no training time
– Nonparametric architecture
• Disadvantage: Difficulties with k-nearest neighbour
algorithms
– Memory intensive: just store the training examples
• when a test example is given then find the closest matches
– Classification/estimation is slow
– Have to calculate the distance of the test case from all training
cases
– There may be irrelevant attributes amongst the attributes –
curse of dimensionality
156
Decision Tree
Decision Trees
• Decision tree constructs a tree where internal
nodes are simple decision rules on one or more
attributes and leaf nodes are predicted class labels.
Given an instance of an object or situation, which is specified by a
set of properties, the tree returns a "yes" or "no" decision about
that instance.
Attribute_1
value-1 value-3
value-2
Attribute_2 Class1 Attribute_2

value-5 value-4 value-6 value-7


Class2 Class1 Class1 Class2
Choosing the Splitting Attribute
• At each node, the best attribute is selected for splitting the
training examples using a Goodness function
– The best attribute is the one that separate the classes of the
training examples faster such that it results in the smallest tree
• Typical goodness functions:
– information gain, information gain ratio, and GINI index

• Information Gain
–Select the attribute with the highest information gain, that create
small average disorder
• First, compute the disorder using Entropy; the expected
information needed to classify objects into classes
• Second, measure the Information Gain; to calculate by how
much the disorder of a set would reduce by knowing the value
of a particular attribute.
Entropy
• The Entropy measures the disorder of a set S containing a
total of n examples of which n+ are positive and n- are
negative and it is given by:
n n n n
D(n , n )  log 2  log 2 Entropy ( S )
n n n n
• Some useful properties of the Entropy:
– D(n,m) = D(m,n)
– D(0,m) = D(m,0) = 0
D(S)=0 means that all the examples in S have the same
class
– D(m,m) = 1
D(S)=1 means that half the examples in S are of one class
and half are in the opposite class
Information Gain
• The Information Gain measures the expected
reduction in entropy due to splitting on an attribute A
 k ni 
GAIN split Entropy ( S )    Entropy (i ) 
 i 1 n 
Parent Node, S is split into k partitions; ni is number of
records in partition i

• Information Gain: Measures Reduction in Entropy


achieved because of the split. Choose the split that
achieves most reduction (maximizes GAIN)
Example 1: The problem of “Sunburn”
• You want to predict whether another person is likely to get
sunburned if he is back to the beach. How can you do this?
• Data Collected: predict based on the observed properties of the
people
Name Hair Height Weight Lotion Result
Sarah Blonde Average Light No Sunburned
Dana Blonde Tall Average Yes None
Alex Brown Short Average Yes None
Annie Blonde Short Average No Sunburned
Emily Red Average Heavy No Sunburned
Pete Brown Tall Heavy No None
John Brown Average Heavy No None
Kate Blonde Short Light Yes None
Attribute Selection by Information Gain to construct the
optimal decision tree

•Entropy: The Disorder of Sunburned

D({ “Sarah”,“Dana”,“Alex”,“Annie”, “Emily”,“Pete”,“John”,“Katie”})

3 3 5 5
D(3 ,5 )  log 2  log 2 0.954
8 8 8 8
Which attribute minimises the disorder?
Test Average Disorder of attributes
Hair 0.50
height 0.69
weight 0.94
lotion 0.61
• Which decision variable maximises the Info Gain then?
• Remember it’s the one which minimises the average disorder.
 Gain(hair) = 0.954 - 0.50 = 0.454
 Gain(height) = 0.954 - 0.69 =0.264
 Gain(weight) = 0.954 - 0.94 =0.014
 Gain (lotion) = 0.954 - 0.61 =0.344
The best decision tree?
is_sunburned
Hair colour
blonde
red brown
Sunburned None
(2,2)? (Alex, Pete,John)
(Emily)
Sunburned = Sarah, Annie,
None = Dana, Katie
• Once we have finished with hair colour we then need to calculate the
remaining branches of the decision tree. Which attributes is better to classify
the remaining ?
• E(2,2) = 1
• IG(lotion) = E(beforeLotion)-E(afterLotion)
= 1 – (E(yes) + E(no))
= 1 – (E(0,2) + E(2,0)) = 1 – (2/4*0 + 2/4*0) = 1
IG(weight) = 1 – (E(weight)) = 1 – (E(heavy)+E(light)+E(average))
= 1 – (E(0,0) + E(1,1) + E(1,1)) = 1-(0+2/4*1+2/4*1)= 1-1=0
The best Decision Tree
• This is the simplest and optimal one possible and it makes a lot of
sense.
• It classifies 4 of the people on just the hair colour alone.

is_sunburned
Hair colour
blonde brown
red
Alex,
Emily
Lotion used Pete,
John
no yes

Sunburned None
(Sarah, Annie) (Dana, Katie)
Sunburn sufferers are ...
• You can view Decision Tree as an IF-THEN_ELSE
statement which tells us whether someone will suffer
from sunburn.

If (Hair-Colour=“red”) then
return (sunburned = yes)
else if (hair-colour=“blonde” and lotion-
used=“No”) then
return (sunburned = yes)
else
return (false)
Why decision tree induction in DM?
• Relatively faster learning speed (than other classification
methods)
• Convertible to simple and easy to understand classification if-
then-else rules
•Comparable classification accuracy with other methods
• Does not require any prior knowledge of data distribution,
works well on noisy data.
Pros Cons
+ Reasonable training time ­ Cannot handle
+ Easy to understand & complicated relationship
interpret between features
+ Easy to generate rules & ­ Simple decision
implement boundaries
+ Can handle large number ­ Problems with lots of
of features: missing data 175
Regression analysis

Regression is a statistical procedure that


determines the equation for the straight
line that best fits a specific set of data.
What is Regression Analysis?
• Regression analysis is a form of predictive modelling
technique which investigates the relationship between
a dependent (target) and independent variable
(s) (predictor).
• This technique is used for forecasting, time series
modelling and finding the causal effect
relationship between the variables.
– For example, relationship between rash driving and number of
road accidents by a driver is best studied through regression.
• Regression analysis is an important tool for modelling and
analyzing data. Here, we fit a curve / line to the data
points, in such a manner that the differences
between the distances of data points from the curve or
line is minimized.
Use of Regression analysis
• Regression analysis is used to:
– Predict the value of a dependent variable based on the value
of at least one independent variable
– Explain the impact of changes in an independent variable on
the dependent variable
– Dependent variable: the variable we wish to explain
– Independent variable: the variable used to explain the
dependent variable

• In regression, one variable or multiple variables are


considered independent (=predictor) variable (X) and the
other, the dependent (=outcome) variable Y.
Types of regression analysis
• There are various kinds of regression techniques
available to make predictions.
• These techniques are mostly driven by three
metrics (number of independent variables, type of
dependent variables and shape of regression line).
Types of regression analysis
• Shape of regression line
– Regression analysis can result in linear or nonlinear graphs.
– A linear regression is where the relationships between your variables can
be described with a straight line. Non-linear regressions produce curved
lines.
• Type of dependent variable
– Logistic regression is one of the types of regression analysis technique,
which gets used when the dependent variable is binary (0 or 1, true or
false, yes or No)
• Number of independent variables
– Linear regression is a way to model the relationship between two
variables.
– In the same way that linear regression produces an equation that uses
values of X to predict values of Y,
• multiple regression produces an equation that uses two or more different
variables (X1, …. Xn) to predict values of Y.
– For two predictor variables, the general form of the multiple regression
equation is:
Y= b X + b X + b
Overview of Linear Regression Model
• Linear Regression Model
– Only one independent variable, x
– Relationship between x and y is described by a linear function
– Changes in y are assumed to be caused by changes in x
• Linear regression equation: Y=mX+B, where m and B are
constants.
– The value of m is called the slope constant and determines the
direction and degree to which the line is tilted.
– The value of B is called the Y-intercept and determines the point
where the line crosses the Y-axis.
– A slope of 2 means that every 1-unit change in X yields a 2-unit change
in Y.
m

B
Prediction
• Regression analysis is used to find equations
that fit data. Once we have the regression
equation, we can use the model to make
predictions
• If you know something about X, this
knowledge helps you predict something about
Y using the constructed regression equation…
– Expected value of y at a given level of x=
E(y/x) = mX + B
Computing regression line
• The y-intercept of the regression line is B and
the slope is m. The following formulas give
the y-intercept and the slope of the equation.
m= B=

• ∑ X = The sum of all the values in the x column


• ∑ Y = The sum of all the values in the y column
• ∑ X 2 = The sum of squared values in the x column
• ∑ Y 2 = The sum of squared values in the y column
• ∑ XY = The sum of the products of x and y at each point
Solved Example Problems
• Calculate the regression coefficient and obtain the
lines of regression for the following data, where x is
the demand for a product and y is the price of the
product

• Estimate the likely price when the demand is Birr. 20.


Solution

m= = = = 0.93

B= = = =7.29

The regression equation is therefore, Y= 0.93X + 7.29


Exercise
• Find (a) The regression equation, (b) The mostly likely marks in
Statistics when the marks in Economics is 30.

• From the data given below,; The heights ( in cm.) of a group of


fathers and sons are given below. Find the lines of regression and
estimate the height of son when the height of the father is 164 cm.

• The following data give the height in inches (X) and the weight in lb.
(Y) of a random sample of 10 students from a large group of students
of age 17 years: Estimate weight of the student of a height 69 inches.
Individual Assignment (due: )
Discuss the concept, show the algorithm, demonstrate with example
how it works, concluding remarks & Reference:
1. One class classification
2. Single-link clustering
3. Bayesian Belief Network
4. Complete-link clustering
5. Support Vector Machine
6. k-medoids clustering (Partitioning Around Medoids (PAM))
7. Divisive clustering
8. Regression Analysis
9. Decision tree with GINI index
10. Principal Component Analysis (PCA)
11. Average-link clustering
12. Vertical data format for frequent pattern mining
13. Recurrent neural network
14. Ensemble Model
15. Hidden Markov Model
16. Expectation maximization (EM) clustering
17. Outlier Detection Methods
18. Decision tree with information gain ratio
19. Missing value prediction
20. Convolutional Neural Network
21. Genetic algorithm
22. Density Based Spatial Clustering (DBSCAN)
23. Bisecting k-means
Clustering
Clustering
• Clustering is a data mining (machine learning) technique that
finds similarities between data according to the characteristics
found in the data & groups similar data objects into one cluster

• Given a set of data points, each x x x


having a set of attributes, and a x x x x x x x
similarity measure among them, x xx x x x x
group the points into some number x x x x x x
of clusters, so that x x x xx x
– Data points in the same cluster are
x x
similar to one another.
x x x x
– Data points in separate clusters are
x x x
dissimilar to one another.
x
191
Example: Clustering
• The example below demonstrates the clustering of padlocks of
same kind. There are a total of 10 padlocks which various in
color, size, shape, etc.

• How many possible clusters of padlocks can be identified?


– There are three different kind of padlocks; which can be
grouped into three different clusters.
– The padlocks of same kind are clustered into a group as shown
below.
Clustering: Document Clustering
• Goal: To find groups of documents that are similar to
each other based on the important terms appearing in
them.
• Approach:
 Identify content-bearing terms in each document.
 Form a similarity measure based on the frequencies of different
terms and use it to cluster documents.
• Application:
 Information Retrieval can utilize the clusters to relate a new
document or search term to clustered documents.
Quality: What Is Good Clustering?
• The quality of a clustering result Intra-cluster
depends on both the similarity distances
measure used by the method and its are
minimized
implementation
– Key requirement of clustering:
Need a good measure of similarity
between instances.
• A good clustering method will
produce high quality clusters with
– high intra-class similarity
– low inter-class similarity Inter-cluster
distances are
Inter
maximized
195
Cluster Evaluation: Hard Problem
• The quality of a clustering is very hard to evaluate
because
– We do not know the correct clusters/classes
• Some methods are used:
– Direct evaluation (using either User inspection or Ground
Truth)
– Indirect Evaluation

• User inspection
– Study centroids of the cluster, and spreads of data items in
each cluster
– For text documents, one can read some documents in clusters
to evaluate the quality of clustering algorithms employed.

197
Cluster Evaluation: Ground Truth
• We use some labeled data (for classification)
– Assumption: Each class is a cluster.

• After clustering, a confusion matrix is constructed. From


the matrix, we compute various measurements,
entropy, purity, precision, recall and F-score.
– Let the classes in the data D be C = (c1, c2, …, ck). The clustering
method produces k clusters, which divides D into k disjoint
subsets, D1, D2, …, Dk.
198
Evaluation of Cluster Quality using Purity
• Quality measured by its ability to discover some or all of the
hidden patterns or latent classes in gold standard data
• Assesses a clustering with respect to ground truth … requires
labeled data
• Assume documents with C gold standard classes, while our
clustering algorithms produce K clusters, ω1, ω2, …, ωK with ni
members
• Simple measure: purity, the ratio between the dominant class
in the cluster πi and the size of cluster
1 ωi
Purity (i )  max j (nij ) j C
ni

• Others are entropy of classes in clusters (or mutual


information between classes and clusters)
Purity example

     
     
    
Cluster I Cluster II Cluster III

• Assume that we cluster three category of data items (those colored


with red, blue and green) into three clusters as shown in the above
figures. Calculate purity to measure the quality of each cluster.
Cluster I: Purity = 1/6 (max(5, 1, 0)) = 5/6 = 83%
Cluster II: Purity = 1/6 (max(1, 4, 1)) = 4/6 = 67%
Cluster III: Purity = 1/5 (max(2, 0, 3)) = 3/5 = 60%
Indirect Evaluation
• In some applications, clustering is not the primary task, but
used to help perform another task.
• We can use the performance on the primary task to compare
clustering methods.
• For instance, in designing a recommender system, if the
primary task is to provide recommendations on book
purchasing to online shoppers.
– If we can cluster books according to their features, we might be
able to provide better recommendations.
– We can evaluate different clustering algorithms based on how
well they help with the recommendation task.
– Here, we assume that the recommendation can be reliably
evaluated.

203
Similarity/Dissimilarity Measures
• Each clustering problem is based on some kind of distance
“farness” or “nearness” measurement between data points.
– Distances are normally used to measure the similarity or dissimilarity
between two data objects
• Similarity Measures:
– Euclidean Distance if attributes are continuous.
– Other Problem-specific Measures.
• Popular similarity measure is: Minkowski distance:
n q
dis( X ,Y ) q  (| x  y |)
i 1 i i

where X = (x1, x2, …, xn) and Y = (y1, y2, …, yn) are two n-
dimensional data objects; n is size of vector attributes of the
data object; q= 1,2,3,… 204
Similarity & Dissimilarity Between Objects
• If q = 1, dis is Manhattan distance
n
dis ( X , Y )  (| xi  yi |
i 1

• If q = 2, dis is Euclidean distance:

n 2
dis( X ,Y )   (| x  y |)
i 1 i i

205
The need for representative
• Key problem: as you build clusters, how do you
represent the location of each cluster, to tell
which pair of clusters is closest?
• For each cluster assign a centroid (closest to all
other points)= average of its points.
 N (C )
Cm  i 1 ip
N

• One can measure intercluster distances by distances


of centroids.
Major Clustering Approaches
• Partitioning clustering approach:
– Construct various partitions as per the given number of
clusters
– Typical methods:
• distance-based: K-means clustering
• model-based: expectation maximization (EM) clustering.

• Hierarchical clustering approach:


– Create a hierarchical decomposition of the set of data (or
objects) using some criterion
– Typical methods:
• agglomerative vs. divisive clustering
• single link vs. complete link vs. average link clustering

209
Partitioning Algorithms: Basic Concept
• Partitioning method: Construct a partition of a database D of n
objects into a set of k clusters; such that, sum of squared
distance is minimum
• Given a k, find a partition of k clusters that optimizes the
chosen partitioning criterion
– Global optimal: exhaustively enumerate all partitions
– Heuristic methods: k-means and k-medoids algorithms
– k-means: Each cluster is represented by the center of the
cluster
• K is the number of clusters to partition the dataset
• Means refers to the average location of members of a
particular cluster
– k-medoids or PAM (Partition Around Medoids): Each cluster is
represented by one of the objects in the cluster
211
The K-Means Clustering Algorithm
 Given k (number of clusters), the k-means algorithm is
implemented as follows:
– Select K cluster points randomly as initial centroids
– Repeat until the centroid don’t change
• Compute similarity between each instance and
each cluster
• Assign each instance to the cluster with the
nearest seed point
• Recompute the centroids of each K clusters of the
current partition (the centroid is the center, i.e.,
mean point, of the cluster)

212
Example Problem
• Cluster the following eight points (with (x, y)
representing locations) into three clusters :
A1(2, 10) A2(2, 5) A3(8, 4) A4(5, 8) A5(7, 5)
A6(6, 4) A7(1, 2) A8(4, 9).
– Assume that initial cluster centers are:
A1(2, 10), A4(8,4) and A7(1, 2).
• The distance function between two points Aj=(x1, y1)
and Ci=(x2, y2) is defined as:
dis(Aj, Ci) = |x2 – x1| + |y2 – y1| .
• Use k-means algorithm to find optimal centroids to
group the given data into three clusters.
Iteration 1
First we list all points in the first column of the table below. The initial
cluster centers - centroids, are (2, 10), (8,4) and (1, 2) - chosen
randomly.
Data Points Cluster 1 with Cluster 2 with Cluster 3 with Cluster
centroid (2,10) centroid (8, 4) centroid (1, 2)
A1 (2, 10) 0 12 9 1
A2 (2, 5) 5 7 4 3
A3 (8, 4) 12 0 9 2
A4 (5, 8) 5 7 10 1
A5 (7, 5) 10 2 9 2
A6 (6, 4) 10 2 7 2
A7 (1, 2) 9 9 0 3
A8 (4, 9) 3 9 10 1
Next, we will calculate the distance from each points to each of the
three centroids, by using the distance function:
dis(point i,mean j)=|x2 – x1| + |y2 – y1|
Iteration 1
• Starting from point A1 calculate the distance to each of the
three means, by using the distance function:
dis (A1, mean1) = |2 – 2| + |10 – 10| = 0 + 0 = 0
dis(A1, mean2) = |8 – 2| + |4 – 10| = 6 + 6 = 12
dis(A1, mean3) = |1 – 2| + |2 – 10| = 1 + 8 = 9
– Fill these values in the table & decide which cluster should the point (2, 10) be
placed in? The one, where the point has the shortest distance to the mean – i.e.
mean 1 (cluster 1), since the distance is 0.
• Next go to the second point A2 and calculate the distance:
dis(A2, mean1) = |2 – 2| + |10 – 5| = 0 + 5 = 5
dis(A2, mean2) = |8 – 2| + |4 – 5| = 6 + 1 = 7
dis(A2, mean2) = |1 – 2| + |2 – 5| = 1 + 3 = 4
– So, we fill in these values in the table and assign the point (2, 5) to cluster 3
since mean 3 is the shortest distance from A2.
• Analogically, we fill in the rest of the table, and place each
point in one of the clusters
Iteration 1
• Next, we need to re-compute the new cluster centers. We
do so, by taking the mean of all points in each cluster.
• For Cluster 1, we have three points and needs to take
average of them as new centroid, i.e.
((2+5+4)/3, (10+8+9)/3) = (3.67, 9)
• For Cluster 2, we have three points. The new centroid is:
((8+7+6)/3, (4+5+4)/3 ) = (7, 4.33)
• For Cluster 3, we have two points. The new centroid is:
( (2+1)/2, (5+2)/2 ) = (1.5, 3.5)
• Since centroids changes in Iteration1 (epoch1), we go to the
next Iteration (epoch2) using the new means we computed.
– The iteration continues until the centroids do not change anymore..
Second epoch
• Using the new centroid compute cluster members again.
Data Points Cluster 1 Cluster 2 Cluster 3 Cluster
with centroid with centroid with centroid
(3.67, 9) (7, 4.33) (1.5, 3.5)
A1 (2, 10) 2.67 10.67 7 1
A2 (2, 5) 5.67 5.67 2 3
A3 (8, 4) 2
A4 (5, 8) 1
A5 (7, 5) 2
A6 (6, 4) 2
A7 (1, 2) 3
A8 (4, 9) 1
• After the 2nd epoch the results would be:
cluster 1: {A1,A4,A8} with new centroid=(3.67,9);
cluster 2: {A3,A5,A6} with new centroid = (7,4.33);
cluster 3: {A2,A7} with new centroid=(1.5,3.5)
Final results
• Finally in the 2th epoch there is no change of members of
clusters and centroids. So the algorithm stops.
• The result of clustering is shown in the following figure
Comments on the K-Means Method
• Strength: Relatively efficient: O (tkn), where n is the number
objects, k is the number clusters, and t is the number
iterations. Normally, k, t << n.
• Weakness
– Applicable only when mean is defined and K, the number of
clusters, specified in advance (use Elbow algorithm to determine k
for K-Means)
• Use hierarchical clustering
– Unable to handle noisy data & outliers Since an object with an
extremely large value may substantially distort the distribution of
the data.
• K-Medoids: Instead of taking the mean value of the object in a
cluster as a reference point, medoids can be used, which is the
most centrally located object in a cluster.

220
Hierarchical Clustering
• As compared to partitioning algorithm,
in hierarchical clustering the data are 0.2

not partitioned into a particular cluster. 0.15


– Instead, a series of partitions takes place,
which may run from a single cluster 0.1

containing all objects to n clusters each 0.05

containing a single object. 0


1 3 2 5 4 6

• Produces a set of nested clusters organized as a


hierarchical tree.
– Hierarchical clustering outputs a hierarchy, a 6 5

4
structure that is more informative than the 3
2
4

unstructured set of clusters returned by partitioning 5


2
clustering.
– Can be visualized as a dendrogram; a tree like 3
1
1

diagram that records the sequences of merges or


splits
Two main types of hierarchical clustering
• Agglomerative: it is a Bottom Up clustering technique
– Start with all sample units in n clusters of size 1.
– Then, at each step of the algorithm, the pair of clusters with the
shortest distance are combined into a single cluster.
– The algorithm stops when all sample units are grouped into one
cluster of size n.
• Divisive: it is a Top Down clustering technique
– Start with all sample units in a single cluster of size n.
– Then, at each step of the algorithm, clusters are partitioned into a pair
of clusters, selected to maximize the distance between each daughter.
– The algorithm stops when sample units are partitioned into n clusters
of size 1.
Step 0 Step 1 Step 2 Step 3 Step 4
agglomerative
a ab
b abcde
c
cde
d
de
e
divisive
Step 4 Step 3 Step 2 Step 1 Step 0
Agglomerative Clustering Algorithm
• More popular hierarchical clustering technique
• Basic algorithm is straightforward
1. Let each data point be a cluster
2. Compute the proximity matrix
3. Repeat
4. Merge the two closest clusters
5. Update the proximity matrix
6. Until only a single cluster remains
• Key operation is the computation of the proximity of two
clusters
10 10 10

9 9 9

8 8 8

7 7 7

6 6 6

5 5 5

4 4 4

3 3 3

2 2 2

1 1 1

0 0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
Example
• Perform a agglomerative clustering of five samples using two
features X and Y. Calculate Manhattan distance between each
pair of samples to measure their similarity.

Data item X Y
1 4 4
2 8 4
3 15 8
4 24 4
5 24 12
Proximity Matrix: First epoch

1 2 3 4 5

1 = (4,4) - 4 15 20 28
2= (8,4) - 11 16 24
3=(15,8) - 13 13
4=(24,4) - 8
5=(24,12) -
Proximity Matrix: First epoch

(1,2) 3 4 5

(1,2) = (6,4) - 13 18 26
3=(15,8) - 13 13
4=(24,4) - 8
5=(24,12) -
Proximity Matrix: First epoch

(1,2) 3 (4,5)
(1,2) = (6,4) - 13 22
3=(15,8) - 9
(4,5)=(24,8) -
Proximity Matrix: First epoch

(1,2) (3,4,5)
(1,2) = (6,4) - 19
(3,4,5)=(21,8) -
Proximity Matrix: First epoch

(1,2,3,4,5)
(1,2,3,4,5)=( , ) -
Dendrogram

(1) (2) (3) (4) (5)

(1,2)
(4,5)

(3,4,5)

(1,2,3,4,5)
Strengths of Hierarchical Clustering
• Do not have to assume any particular number of
clusters
– Any desired number of clusters can be obtained by ‘cutting’
the dendrogram at the proper level
• They may correspond to meaningful taxonomies
– Example in biological sciences (e.g., animal kingdom,
phylogeny reconstruction, …)

Major Weakness
• Do not scale well: time complexity of at least
O(n2), where the total number of data objects
Project (Due:_______)
•Requirement:
–Choose dataset with 10+ attributes and at least 1500 instances. As much as
possible try to use local data to learn a lot about DM
–Prepare the dataset by applying business understanding, data understanding
and data preprocessing.
–Use DM algorithm assigned to experiment using WEKA and discover
interesting patterns
•Project Report
–Write a publishable report with the following sections:
• Abstract -- ½ page
• Introduce problem, objective, scope & methodology -- 2 pages
• Review related works -- 4 pages
• Description of Data preparation -- 3 pages
• Description of DM algorithms used for the experiment -- 3 pages
• Discussion of experimental result, with findings --- 3 pages
• Concluding remarks, with major recommendation --- 1 page
• Reference (use IEEE referencing style)
237
• Describe, in detail contribution of each member of the group
Group Group members Data analytics task Project title
number
1 Classification:
Samson Akale  Decision tree
Dawit  Neural Network
Leniesil  K-Nearest Neignbour
 Select one more algorithm

2 Classification:
G/Tsadk  Decision tree
Mulu  Support Vector Machine
Eshetu  Regression Analysis
 Select one more algorithm

3 Hulu Clustering:
Betty  K-Means
Adane  Hierarchical
 Expected Maximization
 Select one more algorithm

4 Association rule discovery


Abel  Apriori
Berhan  FP growth
 Select one more algorithm
Association Rule Discovery

• Finding frequent patterns


• Generating association rules
• Frequent Itemset Mining Methods

239
Association Rule Discovery: Definition
• Association rule discovery attempts to discover
hidden linkage between data items
• Given a set of records each of which contain some
number of items from a given collection;
– Association rule discovery produce dependency
rules which will determine the likelihood of
occurrence of an item based on occurrences of
other items.

240
Association Rule Discovery: Definition
• Given a set of records each of which contain some
number of items from a given collection;
– Produce dependency rules which will predict occurrence
of an item based on occurrences of other items.

TID Items
1 Bread, Coke, Milk
2 Beer, Bread Rules
RulesDiscovered:
Discovered:
3 Beer, Coke, Diaper, Milk {Milk}
{Milk}-->
-->{Coke}
{Coke}
4 Beer, Bread, Diaper, Milk {Diaper,
{Diaper,Milk}
Milk}-->
-->{Beer}
{Beer}
5 Coke, Diaper, Milk
Motivation of Association Rule discovery

• Finding inherent regularities in dataset


– What products were often purchased together?
Pasta & Tea? Meat & Milk
– What are the subsequent purchases after
buying a PC?
– What kinds of DNA are sensitive to the new drug
D?
– Can we find redundant lab tests in medicine?

242
Association Rule Discovery: Application
• Shelf management (in Supermarket, Book shop, Pharmacy
…).
– Goal: To identify items that are bought together by
sufficiently many customers.
– Approach: Process historical sales transactions data (the
point-of-sale data collected with barcode scanners) to find
dependencies among items.
– A classic rule –
• If a customer buys Coffee and Milk, then he/she is very
likely to buy Beer.
• So, don’t be surprised if you find six-packs of Beer stacked
next to Coffee!
{Coffee,Milk}  Beer 243
Prevalent  Interesting Rules
• Analysts already know about prevalent rules
– Interesting rules are those that deviate from Milk and
1995 Cereal sell
prior expectation together!
• Mining’s payoff is in finding interesting
(surprising) phenomena
• What makes a rule surprising?
– Does not match prior expectation
• Correlation between milk and cereal Milk and
2014 Raw Meat sell
remains roughly constant over time
Zzzz... together!
• Cannot be trivially derived from simpler
rules
– Milk 60%, cereal … 60%
– Milk & cereal 60% … prevailing
– Raw Meat … 65%
– Milk & Raw Meat … 65 % … Surprising! 246
Association Rule Discovery: Two Steps
• The problem of Association rule discovery can be
generalized into two steps:
– Finding frequent patterns from large itemsets
• Frequent pattern is a set of items (subsequences,
substructures, etc.) that occurs frequently in a data set
– Generating association rules from these itemsets.
• Association rules are defined as statements of the form {X1,X2,
…,Xn}  Y, which means that Y may present in the transaction
if X1,X2,…,Xn are all in the transaction.
• Example: Rules Discovered:
{Milk}  {Coke}
{Tea, Milk}  {Coke}
247
Frequent Pattern Analysis: Basic concepts
• Itemset:
– A set of one or more items; k-itemset: X = {x1, …, xk}
• Support
– support, s, is the fraction of transactions that contains X (i.e., the
probability that a transaction contains X)
– support of X & Y greater than or equal to user defined threshold
s; i.e. support probability of s that a transaction contains X  Y
– An itemset X is frequent if X’s support is no less than a minsup
threshold
• Confidence
– confidence: is the probability of finding Y {y1, …, yk} in a
transaction with X {x1,x2,…,xn}.
– confidence, c, conditional probability that a transaction having X
also contains Y; i.e. conditional probability (confidence) of Y given
X greater than or equal to user defined threshold c
248
Example: Finding frequent itemsets
• Given a support threshold S, sets of X items that
appear in (X > S) baskets are called frequent itemsets.
• Example: Frequent Itemsets
– Itemsets bought={milk, coke, pepsi, biscuit, juice}.
– Find frequent k-itemsets that fulfill minimum support of 50% of
the given transactions (i.e. 4 baskets).
B1 = {m, c, b} B2 = {m, p, j}
B3 = {m, b} B4 = {c, j}
B5 = {m, p, b} B6 = {m, c, b, j}
B7 = {c, b, j} B8 = {b, c}
– Frequent itemsets:
• Frequent 1-itemset: {m}, {c}, {b}, {j};
• Frequent 2-itemsets: {m,b} , {b,c}.
• Is there any frequent 3-itemset? 249
Association Rules
• Find all rules on frequent itemsets of the form XY that fulfills
minimum confidence
– If-then rules about the contents of baskets.
• X → Y; where X = {x1, …, xn} and Y = {y1, …, yk}. This means that:
– “if a basket contains all of X’s then what is the likelihood of
containing Y.”
• A typical question: “find all association rules with support ≥ s
and confidence ≥ c.”Confidence of an association rule is the
probability of Y given X. It shows the number of transactions X
containing item(s) Y
• Example: Given the following transactions generate
association rules with minimum support & confidence of
50%
B1 = {m, c, b} B2 = {m, p, j} B3 = {m, b}
B4 = {c, j} B5 = {m, p, b} B6 =
{m, c, b, j}
B7 = {c, b, j} B8 = {b, c}
• Association rule: b → c (fulfils both s = 50% & c = 67%).250
Frequent Itemset Mining Methods
• The hardest problem often turns out to be finding the frequent
pairs.
• Naïve Algorithm
– Read file once, counting in main memory the occurrences of each pair.
• From each basket of n items, generate all pairs {n, (n-1), …, 1} by
two nested loops.
– Fails if (number_of_items)2 exceeds main memory.
• Remember: number of items can be, say Billion of Web pages via
Internet.
• The downward closure property of frequent patterns
– Any subset of a frequent itemset must be frequent
– If {Coke, Tea, nuts} is frequent, so is {Coke, Tea}
• i.e., every transaction having {Coke, Tea, nuts} also contains {Coke,
Tea}
252
Frequent Itemset Mining Methods
• Apriori: A Candidate Generation-and-Test Approach
that limits the need for main memory.
– Key idea: if a set of items appears at least s times, so does
every subset.
• Contra-positive for pairs: if item i does not appear in
s baskets, then no pair including i can appear in s
baskets.
• FPGrowth: A Frequent Pattern-Growth Approach
– Mining Frequent Patterns without explicit Candidate
Generation, rather it generates frequent itemsets
– Uses the Apriori Pruning Principle to generate frequent
itemsets

253
A-Priori Algorithm
• a-priori is a two-pass approach for each frequent k-itemsets
generation
• Step 1
–Pass 1: Read baskets & count in
main memory the occurrences Item counts Frequent items
of each item.
• Identify candidate frequent
k-itemsets
–Pass 2: identify those truly
Counts of
frequent k-item sets
candidate
• Step 2
–Pass 1: Read baskets again & pairs
count in main memory only
those pairs both of which were
found in Step 1 as frequent. Pass 1 Pass 2
–Pass 2: identify those truly
frequent k-itemsets
254
The Apriori Algorithm: A Candidate Generation &
Test Approach
• Iterative algorithm (also called level-wise search): Find
all 1-item frequent itemsets; then all 2-item frequent
itemsets, and so on.
– In each iteration k, only consider itemsets that contain some
k-1 frequent itemset.
Find frequent itemsets of size 1: F1
From k = 2
Ck = candidates of size k: those itemsets of size k that could
be frequent, given Fk-1
Fk = those itemsets that are actually frequent, Fk  Ck (need
to scan the database once).
The Apriori Algorithm—An Example
Assume that min Support = 2 and min confidence = 80%, identify
frequent itemsets and construct association rules
Database TDB C1 Itemset sup
Tid Items L1 Itemset sup
{A} 2
10 A, C, D {B} 3 {A} 2
20 B, C, E 1st scan {C} 3 {B} 3
30 A, B, C, E {D} 1 {C} 3
40 B, E {E} 3 {E} 3

2nd scan
L2 Itemset sup Itemset sup
C3 Itemset sup 3 scan
rd {A, C} 2 {A, B} 1 C2
{A, B, C} 1 {B, C} 2 {A, C} 2
{A, B, E} 1 {B, E} 3 {A, E} 1
{A, C, E} 1 {C, E} 2 {B, C} 2
{B, C, E} 2 {B, E} 3
{C, E} 2
L3
Itemset sup
{B, C, E} 2
259
Which of the above pairs fulfill confidence level
at least 80%?
Pairs Support Confidence
CB F(B,C)/F(C)= 2/3=67% Weak relationship
BC 50% 66.67%
BE=f(BE)/f(B)=3/3 75% 100%
CE 50% 66.67%
(B,C) E 50% 100%
(B,E)C
EB=f(BE)/f(E)=3/3 100% Strong relationship

Results:
AC (with support 50%, confidence 100%)
BE (with support 75%, confidence 100%)
(B,C)E (with support 50%, confidence 100%)
260
Bottlenecks of the Apriori approach
• The Apriori algorithm reduces the size of candidate
frequent itemsets by using “Apriori property.”
– However, it still requires two nontrivial computationally
expensive processes.
• It requires as many database scans as the size of the largest
frequent itemsets. In order to find frequent k-itemsets, the
Apriori algorithm needs to scan database k times.
– Breadth-first (i.e., level-wise) search
• Candidate generation and test the frequency of true
appearance of the itemsets
– It may generate a huge number of candidate sets that will
be discarded later in the test stage.
261
Frequent Pattern-Growth Approach
• The FP-Growth Approach
– Depth-first search: search depth wise by identifying
different set of combinations with a given single or pair of
items
• Steps followed: The FP-Growth Approach scans DB only twice
 Scan DB once to find frequent 1-itemset (single
item pattern)
 Sort frequent items in frequency descending order,
f-list
 Scan DB again to construct FP-tree, the data
structure of FP-Growth
262
Construct FP-tree from a Transaction Database
• Assume min-support = 3 and min-confidence = 80%
TID Items bought (ordered) frequent items
100 {f, a, c, d, g, i, m, p} {f, c, a, m, p}
200 {a, b, c, f, l, m, o} {f, c, a, b, m}
300 {b, f, h, j, o, w} {f, b}
400 {b, c, k, s, p} {c, b, p}
500 {a, f, c, e, l, p, m, n} {f, c, a, m, p}
{}
Header Table
f:4 c:1
Item frequency head
f 4 c:3 b:1 b:1
c 4
a 3 a:3 p:1
b 3
m 3 m:2 b:1
p 3
F-list = f-c-a-b-m-p p:2 m:1 263
{}

f:1

c:1

a:1

m:1

p:1
{}

f:2

c:2

a:2

m:1 b:1

p:1 m:1
{}

f:3

c:2 b:1

a:2

m:1 b:1

p:1 m:1
{}

f:3 c:1

c:2 b:1 b:1

a:2
p:1

m:1 b:1

p:1 m:1
{}

f:4 c:1

c:3 b:1 b:1

a:3
p:1

m:2 b:1

p:2 m:1
FP-Growth Example
• Construct conditional pattern base which consists of the set of prefix
paths in the FP tree co-occuring with the suffix pattern, and then
construct its conditional FP-tree.

Item Conditional pattern-base Conditional FP-tree


p {(fcam:2), (cb:1)} {(c:3)}|p
m {(fca:2), (fcab:1)} {(f:3, c:3, a:3)}|m
b {(fca:1), (f:1), (c:1)} --
a {(fc:3)} {(f:3, c:3)}|a
c {(f:3)} {(f:3)}|c
f -- --
Which of the above pairs fulfill confidence
level at least 80%?
Pairs Support Confidence
cp 3 75%
fcam 3 100%
cam 3 100%
fa 3 75%
ca 3 75%
fc 3 100%

Results: generate association rules


• fcam
• cam
• ….pc= f(cp)/f(p)= 3/3=100%
270
Benefits of the FP-tree Structure
• Completeness
– Preserve complete information for frequent pattern
mining
– Never break a long pattern of any transaction

• Compactness
– Reduce irrelevant information: infrequent items are gone
– Items in frequency descending order: the more
frequently occurring, the more likely to be shared
– Never be larger than the original database (not count
node-links and the count field)

271
Exercise
• Given frequent 3-itemset
{abc, acd, ace, bcd}
Generate possible frequent 4-itemset?
Important Dates:
• Concept presentation

• Project Presentation:

• Final exam:

273
THANK YOU

You might also like