Support Machine Learning
Support Machine Learning
Learning
INSA Rouen Normandie – Département GM
Cecilia ZANNI-MERK
Office: BO BR1 04
Goals
• Understanding the motivations for machine learning and data mining
• Being able to implement the data mining process
• Identifying the differences among different types of learning and
understanding the basic algorithms associated to each of them
• Developing skills in the design and use of those algorithms for specific
tasks
2
Analytical Program
• Introduction and Examples
• Algorithms and Approaches
• Implementation
3
Introduction
Thank you to my colleague Professor Christoph Reich (Hochshule Furtwangen) for
his slides
4
“Big Data”
• Trendy word to describe *a lot* of data!
• Buzz:
5
“Big Data”
• We live in the information age
• data accumulation in all areas
• internet
• biology: human genome, DNA sequencing
• physics: Large Hadron Collider, 1020 bytes/day per sensor
• recording devices
• sensors, mobile phones, interactions on the Internet, ...
• IT challenges
• storage, recovery, distributed calculation...
• 3V' s: volume, velocity, variety
• We need to give meaning to data
machine learning
6
Giving meaning to (Big) Data
• Some quantitative indicators
• Twitter : 50 M of tweets / day (≈ 7 terabytes)
• Facebook : 10 terabytes / day
• Youtube : 50 hours of uploaded videos / minute
• Mail : 2,9 million of mail / second
• The data quantity is too big to be treated manually or by classical
algorithms:
• The number of entries is millions to billions
• Multi-dimensional data
• Heterogeneous sources of data
7
Paradigm Shift
Store Everything Now Because
it May be Useful Later
8
Source: Microsoft’s Chicago data center (Image courtesy cnet.com)
Giving meaning to (Big) Data?
• The user is full of data, but does not know how to understand it:
• “The greatest problem of today is how to teach people to ignore the
irrelevant, how to refuse to know things, before they are suffocated. For too
many facts are as bad as none at all.“ (W.H. Auden)
• What do we need?
• To extract interesting and useful knowledge from the data: rules, regularities,
irregularities, patterns, constraints
• To make predictions, detect faults, solve problems...
9
Giving meaning to (Big) Data
• Science behind it all:
• machine learning / computational statistics
• Other terms in practice:
• data mining, business/data analytics, pattern recognition, knowledge
discovery in databases (KDD), knowledge extraction, data/pattern analysis
10
What is machine learning?
• How to build computer systems that improve with experience, and
what are the fundamental laws that govern all machine learning
processes (Tom Mitchell)
• Mixing computer science and statistics
• CS:"How do you build machines that solve problems, and what problems are
intrinsically feasible / unfeasible?"
• Statistics:"What can be deduced from data and a set of modeling
assumptions?
• how can a computer learn from data?
11
What is data mining ?
• Extraction of implicit original (non-trivial) information, previously
unknown and potentially useful from different data sources:
• Not trivial: otherwise knowledge is not useful
• Implicit: hidden knowledge is difficult to observe
• Unknown until now: obvious!
• Potentially useful: usable, understandable
• Whole process of discovery and interpretation of regularity in data
12
Machine Learning or Data Mining?
• Data Mining is a cross-disciplinary field that focuses on discovering
properties of data sets.
• Machine Learning is a sub-field of data science that focuses on
designing algorithms that can learn from and make predictions on the
data.
It is clearto
• There are different approaches then that machine
discovering properties learning
of data can
sets.
Machine Learning is one of them. be used for data mining.
• Another one is simplyHowever,
looking at data mining
the data can use
sets using other techniques
visualization techniques
besides or on top of machine learning.
13
Some warnings, however …
• Hype: With enough data, we can solve “everything” with “no
assumptions”!
• Theory: No Free Lunch Theorem!
• If we do not make assumptions about the data, all learning methods do as
bad “on average” on unseen data as a random prediction!
• Consequence: need some assumptions
• for example, that time series vary ‘smoothly’
14
No Free Lunch Theorem
• Hume (1739–1740) pointed out that ‘even after the observation of
the frequent or constant conjunction of objects, we have no reason to
draw any inference concerning any object beyond those of which we
have had experience’.
• More recently, and with increasing rigour, Mitchell (1980), Schaffer
(1994) and Wolpert (1996) showed that bias-free learning is futile.
• The mathematical demonstration is here
• http://www.no-free-lunch.org/coev.pdf
15
No Free Lunch Theorem
• 2‐class problem with 100 binary attributes
• Say you know a million instances, and their classes (training set)
• You don’t know the classes of 2100 – 106 examples! (that’s 99.9999…% of the
data set)
• How could you possibly figure them out?
• In order to generalize, every learner must embody some knowledge
or assumptions beyond the data it’s given
• A learning algorithm implicitly provides a set of assumptions
• There can be no “universal” best algorithm (no free lunch)
• Data mining is an experimental science
16
Mining random patterns
• We can ‘discover’ meaningless random patterns if we look through too
many possibilities : “Bonferroni’s principle”
• NSA example: say we consider suspicious when a pair of (unrelated)
people stayed at least twice in the same hotel on the same day
• suppose 109 people tracked during 1000 days
• each person stays in a hotel 1% of the time (1 day out of 100)
• each hotel holds 100 people (so we need 105 hotels)
• if everyone behaves randomly (i.e. no terrorist), can we still detect something
suspicious?
• Probability that a specific pair of people visit same hotel on same day is
10-9; probability this happens twice is thus 10-18 (tiny),
• ... but there are many possible pairs
• Expected number of “suspicious” pairs is actually about 250,000!
example taken from Rajamaran et al., Mining of Massive Datasets
17
Mining random patterns
• Suppose there are (say) 10 pairs of terrorists who definitely stayed at
the same hotel twice.
• Analysts have to go through 250,010 candidates to find the 10 real
cases !!!!
• Morale : When looking for a property (e.g., “two people stayed at the
same hotel twice”), make sure that the property does not allow so
many possibilities that random data will surely produce facts “of
interest”.
18
Some success stories using Machine Learning
• spam classification (Google)
• machine translation (impressive sometimes, see
https://www.deepl.com/translator)
• speech recognition (used in your smart phone)
• self-driving cars (again Google)
19
Some examples
Data Mining: Practical Machine Learning Tools and Techniques (Chapter 1)
20
Bibliography
• This textbook discusses data mining and Weka in depth:
• Data Mining: Practical machine learning tools and techniques,
by Ian H. Witten, Eibe Frank
and Mark A. Hall.
Morgan Kaufmann, 2011
21
Processing loan applications (American Express)
22
With machine learning
● 1000 training examples of borderline cases
● 20 attributes:
● age
● years with current employer
● years at current address
● years with the bank
● other credit cards possessed,…
● Learned rules: correct on 70% of cases
● human experts only 50%
● Rules could be used to explain decisions to customers
23
Screening images
● Given: radar satellite images of coastal waters
● Problem: detect oil slicks in those images
● Oil slicks appear as dark regions with changing size and shape
● Not easy: lookalike dark regions can be caused by weather conditions
(e.g. high wind)
● Expensive process requiring highly trained personnel
24
With machine learning
● Extract dark regions from normalized image
● Attributes:
● size of region
● shape, area
● intensity
● sharpness and jaggedness of boundaries
● proximity of other regions
● Info about background
● Constraints:
● Few training examples—oil slicks are rare!
● Unbalanced data: most dark regions aren’t slicks
● Regions from same image form a batch
● Requirement: adjustable false-alarm rate
25
Load forecasting
● Electricity supply companies
need forecast of future demand
for power
● Forecasts of min/max load for each hour
⇒ significant savings
● Given: manually constructed load model that assumes “normal”
climatic conditions
● Problem: adjust for weather conditions
● Static model consist of:
● base load for the year
● load periodicity over the year
● effect of holidays
26
With machine learning
● Prediction corrected using “most similar” days
● Attributes:
● temperature
● humidity
● wind speed
● cloud cover readings
● plus difference between actual load and predicted load
● Average difference among three “most similar” days added to static
model
● Linear regression coefficients form attribute weights in similarity
function
27
Diagnosis of machine faults
● Diagnosis: classical domain
of expert systems
● Given: Fourier analysis of vibrations measured at various points of a
device’s mounting
● Question: which fault is present?
● Preventive maintenance of electromechanical motors and generators
● Information very noisy
● So far: diagnosis by expert/hand-crafted rules
28
With machine learning
● Available: 600 faults with expert’s diagnosis
● ~300 unsatisfactory, rest used for training
● Attributes augmented by intermediate concepts that embodied
causal domain knowledge
● Expert not satisfied with initial rules because they did not relate to his
domain knowledge
● Further background knowledge resulted in more complex rules that
were satisfactory
● Learned rules outperformed hand-crafted ones
29
Marketing and sales I
● Companies precisely record massive amounts of marketing and sales
data
● Applications:
● Customer loyalty: identifying customers that are likely to defect by detecting
changes in their behavior
(e.g. banks/phone companies)
● Special offers: identifying profitable customers
(e.g. reliable owners of credit cards that need extra money during the holiday
season)
30
Marketing and sales II
● Market basket analysis
● Association techniques find groups of items
that tend to occur together in a transaction
(used to analyze checkout data)
● Historical analysis of purchasing patterns
● Identifying prospective customers
● Focusing promotional mail outs
(targeted campaigns are cheaper than mass-marketed ones)
31
Data mining and ethics
• Information privacy laws (in Europe, but not US)
• A purpose must be stated for any personal information collected
• Such information must not be disclosed to others without consent
• Records kept on individuals must be accurate and up to date
• To ensure accuracy, individuals should be able to review data about
themselves
• Data must be deleted when it is no longer needed for the stated purpose
• Personal information must not be transmitted to locations where equivalent
data protection cannot be assured
• Some data is too sensitive to be collected, except in extreme circumstances
(e.g., sexual orientation, religion)
32
Data mining and ethics
• Anonymization is harder than you think
• When Massachusetts released medical records summarizing every state
employee’s hospital record in the mid‐1990s, the governor gave a public
assurance that it had been anonymized by removing all identifying
information such as name, address, and social security number. He was
surprised to receive his own health records (which included diagnoses and
prescriptions) in the mail.
33
Data mining and ethics
• Re-identification techniques. Using publicly available records:
• 50% of Americans can be identified from city, birth date, and sex
• 85% can be identified if you include the 5‐digit zip code as well
• Netflix movie database: 100 million records of movie ratings (1–5)
• Can identify 99% of people in the database if you know their ratings for 6
movies and approximately when they saw the movies (± one week)
• Can identify 70% if you know their ratings for 2 movies and roughly when
they saw them
34
Data mining and ethics
• The purpose of data mining is to discriminate …
• who gets the loan
• who gets the special offer
• Certain kinds of discrimination are unethical, and illegal
• racial, sexual, religious, …
• But it depends on the context
• sexual discrimination is usually illegal
• … except for doctors, who are expected to take gender into account
• … and information that appears innocuous may not be
• ZIP code correlates with race
• membership of certain organizations correlates with gender
35
Correlation vs Causation
• Correlation does not imply causation!!
• http://www.tylervigen.com/spurious-correlations
36
Data mining and ethics
37
Algorithms and Approaches
Slides adapted from the course of Maja Temerinac-Ott
(Master ILC, University of Strasbourg)
Thank you also to Nicolas Lachiche (University of Strasbourg)
38
Group exercise
• Form groups
• You will get a list of steps in the data mining process
• Please order the steps in chronological order!
• You have 5 min for the task.
39
Group exercise
• Clean the data
• Identify the problem
• Find data
• Integration of new knowledge
• Validation and interpretation of the result, with possible return to the
results in the previous steps
• Coding of data, do actions on variables
• Search for a model, for knowledge, etc.
40
The data mining process
1. Identify the problem
2. Find data
3. Clean the data
4. Coding of data, do actions on variables
5. Search for a model, for knowledge, etc.
6. Validation and interpretation of the result, with possible return to
the results in the previous steps
7. Integration of new knowledge
41
The data mining process
42
Data sets
• Data for a data mining problem
• Information is examples with attributes
• A set of N data is generally available.
• Attributes
• An attribute is a descriptor of an entity
• It is also called variable or characteristic
• Examples
• These are entities that characterize an object
• They consist of attributes
• Synonyms: point, vector (often in ℝ𝑛 )
45
Preparation of the data
• Existing or needed data:
• Files: information contained in one or more independent files
• Relational Database: information contained in several files by a common ‘key’
• Transactional database
• Data cleaning:
• duplicates, input errors, outliers, missing information (ignore observations,
average value, mean value on class, regression, etc.)
46
Preparation of the data
• Data warehouses: collection of data collected from multiple
heterogeneous sources
• Data is recorded, cleaned, transformed and integrated
• Usually modeled by a multidimensional data structure (cube):
• Data is structured along several lines of analysis (dimensions of the
cube) such as time, location, etc.
• A cell is the intersection of different dimensions
• The calculation of each cell is carried out at loading
• The response time is thus stable whatever is requested
47
Preparation of the data
• Selection of the data
• Data sampling
• Selection of sources
• Dimensionality reduction
• Selection or transformation of attributes
• Weighting
• Feature extraction
• Coding
• Aggregation (sum, average), discretization, coding of discrete attributes, scale
standardization
48
Data visualization
• Obtain a visual representation of the data
• Not always possible depending on the data type
• Not always possible depending on the amount of data
49
A word about data quality
• A huge problem in practice
• any manually entered data is suspect
• most data sets are in practice deeply problematic
• Even automatically gathered data can be a problem
• systematic problems with sensors
• errors causing data loss
• incorrect metadata about the sensor
• Never, never, never trust the data without checking it!
50
Algorithms and Approaches (cont)
Slides adapted from the course of Maja Temerinac-Ott
(Master ILC, University of Strasbourg)
Thank you also to Nicolas Lachiche (University of Strasbourg)
51
The data mining process
52
Data mining goals and principles
• Goal: Learn something new!
• Concepts: regrouping of data based on shared characteristics
• Associations: correlations between attributes or data
• Principles:
• Getting the highest level of abstraction possible
• Rules or truths that are the basis for other truths
• Three types
• Supervised learning
• Unsupervised learning
• Semi-supervised learning
53
Supervised learning
• Inductive model where the learner considers a set of grouped
examples representative for the learning task (class of belonging,
ownership, etc.)
• The examples are labeled beforehand
• Predictive data mining
• Divide / group instances into special classes for future predictions
• Predicting unknown or missing values
54
Supervised learning
• Induction:
• Generalization of an observation or reasoning established from singular cases.
• It consists in drawing conclusions from a series of facts
• Example:
• Induction: water, oil and milk freeze when they are cooled, we will infer that
all liquids must freeze, provided the cold is rather intense
• Deduction: all liquids are likely to freeze; so, if mercury is a liquid, it can
freeze
55
Supervised learning
• From data 𝑥𝑖 , 𝑦𝑖 𝜖 𝒳 × 𝒴, 𝑖 = 1, … , 𝑛 estimate the dependencies
between 𝒳 and 𝒴
• Elements in 𝒴 are called labels, tags, annotations
• It is called supervised learning because 𝑦𝑖 are used to guide the estimation
process.
• Examples
• Estimate the links between eating habits and risk of heart attack.
𝑥𝑖 : attributes of a patient's diet, 𝑦𝑖 its category (risk, not risk).
• Applications
• fraud detection, medical diagnosis...
• Algorithms
• Decision trees, classifications, genetic algorithms, (linear and non-linear) regression
Taken from Gilles Gasso’s course at ASI4 56
Unsupervised learning
• Construction of a model and discovery of relations is given without
reference to other data
• There is no prior information on the data
• Explanatory data mining
• Grouping instances into special classes based on their resemblance or their
sharing of properties.
• The classes are unknown and are therefore created, they are used to explain
or summarize the data
57
Unsupervised learning
• Only data {𝑥𝑖 𝜖 𝒳, 𝑖 = 1, … , 𝑛} are available. The aim is to describe
how data are organized and to extract homogeneous subsets of data.
• Examples
• Categorization of supermarket customers. 𝑥𝑖 represents an individual
(address, age, shopping habits...)
• Applications
• identification of market segments, categorization of similar documents,
segmentation of biomedical images....
• Algorithms
• Segmentation, clustering, discovery of associations and rules
Source: http://www.favouriteblog.com/essential-cheat-sheets-for-machine-learning-python-and-maths/
60
Types of algorithms and their use …
Source: https://www.analyticsvidhya.com/blog/2017/02/top-28-cheat-sheets-for-machine-learning-data-science-probability-sql-big-data/ 61
Types of Algorithms and there Usage
62
Source: http://blogs.sas.com/content/subconsciousmusings/2017/04/12/machine-learning-algorithm-use/
Algorithms and Approaches (cont)
Slides adapted from the course of Maja Temerinac-Ott
(Master ILC, University of Strasbourg)
Thank you also to Nicolas Lachiche (University of Strasbourg)
63
Data mining approaches
1. Estimation:
• create a model that best describes a prediction or forecast variable linked to
real data
2. Classification:
• create a function that classifies an element among several pre-existing classes
3. Clustering :
• identify a finite set of categories or groups with the goal of describing the
data
4. Dependency modelling:
• find a model that describes significant dependencies among variables
64
Estimation
• Goal : to create a model that best describes a prediction or forecast
variable linked to real data
• How: Analyze the relationship of one variable vs. one or more other
variables
65
Estimation : Regression
• Least-squares method
66
Estimation : Regression
• Underfitting
• Using an algorithm that cannot capture the full complexity of the data
Estimation : Regression
• Overfitting
• Tuning the algorithm so carefully it starts matching the noise in the training
data
68
Estimation : Neural networks
• Neural network
69
Reminder: Four different approaches
1. Estimation:
• create a model that best describes a prediction or forecast variable linked to
real data
2. Classification:
• create a function that classifies an element among several pre-existing classes
3. Clustering:
• identify a finite set of categories or groups with the goal of describing the
data
4. Dependency modelling:
• find a model that describes significant dependencies among variables
70
Classification
• Division of the data set into disjoint classes
• Goal:
• search for a set of predicates characterizing a class of objects and which can
be applied to unknown objects in order to identify their class of belonging.
• Main techniques:
• Decision trees
• Bayesian classifier
• k-nearest neighbors
• Neural networks
• SVM
• Genetic algorithms
71
Classification: Decision Trees
• Classify objects into subclasses by hierarchical divisions
• Automatic construction from a sample
• There are several techniques to build the tree
72
Classification: Decision Trees
• An example
• A gift is sent to potential customers, that can then place an order.
• If the customer does not place an order the company pays 50 euro; otherwise
it earns 100 euro.
• Forgetting to send a gift to a potentially responsive customer “costs” 100 euro
• Given a table of responses for a population sample (size 100), decide which
group of people should be targeted in the future
73
Classification: Decision Trees
• An example
Mail only to
executives or only
female executives
74
Classification: K-nearest neighbors
• All distances between the point X to be classified and all labeled
points is computed
• We keep the K closest labeled points. The majority class in this set is
assigned to X.
Example of k-NN classification. The test sample (green
circle) should be classified either to the first class of blue
squares or to the second class of red triangles.
• If k = 3 (solid line circle) it is assigned to the second class
because there are 2 triangles and only 1 square inside the
inner circle.
• If k = 5 (dashed line circle) it is assigned to the first class (3
squares vs. 2 triangles inside the outer circle).
Source Wikipedia 75
Classification: K-nearest neighbors
• No hypothesis on the distribution and the “shape” of the classes
• Complexity increasing with the size of the training base
• For high-dimensional data (e.g., with number of dimensions more
than 10) dimension reduction is usually performed prior to applying
the k-NN algorithm
• In fact, the Euclidean distance is unhelpful in high dimensions because all
vectors can be almost equidistant to the search query vector
76
Classification: Neural Networks
• Inspired by the structure of the nervous system
• A large number of connected neurons that process information
• The response of the neuron depends on its state and the weights of
the connections
• The weights (or forces) are developed by experiment
77
Classification: Neural Networks
• Principle
• Construction of a network of simple computational units (neurons) linked by
connections
• Learning network parameters (weight of connections) using a set of examples
• Components of a neuron:
• Inputs (incoming connections or input variables)
• weights on incoming connections
• a function F which computes an output as a function of the inputs and the
weight on the inlets
• activation function which modifies the amplitude of the output of the
node
78
Classification: Neural Networks
• Activation function
• (s) can be linear
79
Classification: Neural Networks
• The simplest neural network
• Single neuron or perceptron
• F = weighted sum of inputs
• Activation function = thresholding
80
Classification: Neural Networks
• The simplest neural network
• Linear separation
81
Classification: Neural Networks
• The simplest neural network
• Additional examples
• We still need to find a neuronal network discriminating between the two
classes
82
Classification: Neural Networks
• The simplest neural network
• Learning of the new weights would lead to something like this
83
Classification: Neural Networks
• In practice:
• Choose a calculation function and an activation function
• Choose an architecture:
• Number of inputs
• Number of outputs
• Number of internal layers
• Number of neurons of each of the internal layers
• Select a cost function
• Decide when to stop training
84
Classification: Neural Networks
• Advantages of neural networks:
• Robust to noise
• Classification or estimation is quick after training is completed
• Available in all data mining software
• Disadvantages:
• Black box: difficult to interpret the obtained model
• Significant learning time
• Selection of parameters is difficult
85
Classification: Neural Networks
• Attention: Neural Nets model anything …
• Express some output as a nonlinear function (sets of equations) of
some inputs
86
Source: http://www.asimovinstitute.org/neural-network-zoo/
87
Classification: NN & Deep Learning
Previously: Machine Learning (about 1-2 hidden layers)
Often: Principal Component
Analysis for Dimensionality
Reduction
89
Classification: Training and Testing
• Construction of a model on the learning set and test model on the test set
for which the results are known
90
Classification: Evaluation
• Cross-validation
• Split test / train data, in general more data for learning
• The number of cross-validation folds depends on the volume of available the
data
91
Algorithms and Approaches (cont)
Slides adapted from the course of Maja Temerinac-Ott
(Master ILC, University of Strasbourg)
Thank you also to Nicolas Lachiche (University of Strasbourg)
92
The data mining process
93
Reminder: Data mining goals and principles
• Goal: Learn something new!
• Principles:
• Getting the highest level of abstraction possible
• Rules or truths that are the basis for other truths
• Three types
• Supervised learning
• Unsupervised learning
• Semi-supervised learning
94
Reminder: Four different approaches
1. Estimation:
• create a model that best describes a prediction or forecast variable linked to
real data
2. Classification:
• create a function that classifies an element among several pre-existing classes
3. Clustering:
• identify a finite set of categories or groups with the goal of describing the
data
4. Dependency modelling:
• find a model that describes significant dependencies among variables
95
Clustering
• Aim of the clustering: obtain a simplified representation (structuring)
of the initial data
• Organization of a set of objects into a set of homogeneous and / or
natural groupings
96
Clustering
• Automatic partitioning from data
• No a priori semantic interpretation
97
Clustering
• Two approaches:
• classification / clustering / grouping: discovery in extension of these sets-> notion of
classes or clusters
• generation of concepts: discovery in intention of these clusters -> notion of concepts
• Different methods:
• Partition-based clustering
• Model-based clustering
• Hierarchical clustering
• And many others …
• density-based clustering
• grid-based clustering
98
Partition-based clustering
• Principle: Place the different objects in clusters (groups)
• Different types
• Hard clustering: each object is in one and only one class
• Soft clustering: An object can be in several classes
• Fuzzy clustering: An object belongs partly to all classes
• Goals
• Find the organization in homogeneous classes such that two objects of the
same class are more similar than two objects from different classes
• Find the organization in homogeneous classes such that the classes are as
different as possible
99
Partition-based clustering
• Find the organization in homogeneous classes such that two objects
of the same class are more similar than two objects from different
classes
• Example: if x1 and x2 in the same class, x3 in a different class then
d(x1, x2) <d(x1, x3) and d(x1, x2) <d(x2, x3)
100
Partition-based clustering
• Find the organization in homogeneous classes such that the distances
between classes are maximal
101
Partition-based clustering : notion of inertia
• Inertia of a class 𝐺𝑘 (k : 1 → K)
𝐼𝑘 = 𝑑𝑖𝑠𝑡 2 (𝑥𝑖 , 𝑔𝑘 )
𝑥𝑖 ∈𝐺𝑘 Center of gravity class k
• Inertia intra-class
𝐼𝑖𝑛𝑡𝑟𝑎 = 𝐼𝑘
𝑘:𝟏 →𝑲
• Inertia inter-class
𝐼𝑖𝑛𝑡𝑒𝑟 = 𝐺𝑘 𝑑𝑖𝑠𝑡 2 (𝑔, 𝑔𝑘 )
𝑘:𝟏 →𝑲 Center of gravity the whole
102
Partition-based clustering : K-means
• Criteria: Minimize the 𝐼𝑖𝑛𝑡𝑟𝑎 , maximize the 𝐼𝑖𝑛𝑡𝑒𝑟
• Algorithm:
1. Choose the number of classes K
2. Choose K random initial data cluster gravity centers
3. Assign each data point to its closest cluster center
4. Compute the new cluster gravity centers (average of data points assigned to
the cluster)
5. Repeat 3.-4. until convergence
103
Partition-based clustering : K-means
104
Partition-based clustering : K-means
• Attention!! 𝐼𝑖𝑛𝑡𝑟𝑎 decreases if K increases
• cannot be used to find the ideal number of groups, but allows to compare 2
partitions of K classes
• Properties
• At each step of reallocation-recentering 3) + 4), it can be shown that
𝐼𝑖𝑛𝑡𝑟𝑎 decreases → stabilization when 𝐼𝑖𝑛𝑡𝑟𝑎 no longer decreases.
• Experience shows that the algorithm converges quite fast (less than 10
iterations), even with large volumes of data and a “bad” initialization
105
Partition-based clustering : K-means
• Advantages:
• quick and easy to implement
• allows the processing of large datasets
• Disadvantages:
• obligation to fix k
• the result depends strongly on the choice of the initial gravity centers.
• does not necessarily provide the optimal result (i. e. partition in K groups for which 𝐼𝑖𝑛𝑡𝑟𝑎
is minimum)
• provides a local optimum that depends on the initial gravity centers.
106
Reminder: Four different approaches
1. Estimation:
• create a model that best describes a prediction or forecast variable linked to
real data
2. Classification:
• create a function that classifies an element among several pre-existing classes
3. Clustering:
• identify a finite set of categories or groups with the goal of describing the
data
4. Dependency modelling:
• find a model that describes significant dependencies among variables
107
Associations
• Association rules: analysis of the shopping basket
• “On fridays, customers often buy beer packs and at the same time
diapers”
• Are there any causal links between the purchase of a product P0 and
a other product P1?
108
Associations
Id Transaction
109
Associations
• Formally:
• Given a set of transactions D, find all the association rules X → Y having
support and confidence above the minimum thresholds predefined by the
user
• A transaction is a set of attributes (for example, butter, fruit, milk, bread)
110
Associations
• Interpretation of rule R : X → Y (A%, B%)
• A% of all transactions show that X and Y have been purchased at the same
time (support of the rule)
• B% of the clients who purchased X have also purchased Y (confidence of the
rule
• Two sub-problems
• Find all frequent sets (frequent item sets or FIS) that have support greater
than or equal to a minimum value “minsup”
• Generate all the association rules having confidence greater or equal to
“minconf”
111
Id Transaction
Associations 1
2
butter fruit milk bread
fruit milk bread
3 butter cheese bread pasta meat wine
𝑥𝑥𝑥 𝑒𝑡 𝑦𝑦𝑦 4 cheese fruit milk vegetables bread pasta fish
𝑠𝑢𝑝𝑝𝑜𝑟𝑡 = 5 butter fruit milk vegetables bread pasta fish meat
𝑡𝑟𝑎𝑛𝑠𝑎𝑐𝑡𝑖𝑜𝑛𝑠
6 butter cheese vegetables bread pastameat wine
7 butter cheese milk vegetables bread pasta meat wine
𝑥𝑥𝑥 𝑒𝑡 𝑦𝑦𝑦 8 fruit vegetables fish
𝑐𝑜𝑛𝑓𝑖𝑑𝑒𝑛𝑐𝑒 =
𝑥𝑥𝑥 9 butter cheese milk bread pasta meat wine
10 butter cheese fruit milk vegetables bread fish meat
Support Confidence
butter → bread 70% 100%
fish, meat → milk 20% 100%
cheese, pasta → wine 40% 80%
112
Id Transaction
Associations 1
2
butter fruit milk bread
fruit milk bread
3 butter cheese bread pasta meat wine
𝑥𝑥𝑥 𝑒𝑡 𝑦𝑦𝑦 4 cheese fruit milk vegetables bread pasta fish
𝑠𝑢𝑝𝑝𝑜𝑟𝑡 = 5 butter fruit milk vegetables bread pasta fish meat
𝑡𝑟𝑎𝑛𝑠𝑎𝑐𝑡𝑖𝑜𝑛𝑠
6 butter cheese vegetables bread pasta meat wine
7 butter cheese milk vegetables bread pasta meat wine
𝑥𝑥𝑥 𝑒𝑡 𝑦𝑦𝑦 8 fruit vegetables fish
𝑐𝑜𝑛𝑓𝑖𝑑𝑒𝑛𝑐𝑒 =
𝑥𝑥𝑥 9 butter cheese milk bread pasta meat wine
10 butter cheese fruit milk vegetables bread fish meat
Support Confidence
butter → bread 70% 100%
fish, meat → milk 20% 100%
cheese, pasta → wine 40% 80%
113
Associations
• Numerous criteria for evaluating the interest of a rule
• Difficult to scale to large volumes of data
114
The data mining process
115
Validation
• Generation of a large number of models
• Is the generated model interesting?
• How to measure the interest of a model:
• New
• Easy to understand
• Valid on new data (with a measure of certainty)
• Useful
• Confirms (or invalidates) the hypotheses of an expert
116
Validation
• Evaluation of a model
• subjective: expert
• objective: statistics and structure of the model
• Can we find all models? (completeness)
• Can we only generate the interesting models? (optimization):
• Generating all the models and filtering according to certain measures and
characteristics: non-realistic
• Generating only the models that satisfy a particular condition
117
Conclusions
118
Is this statement True or False? Why?
• “Data mining methods are purely inductive rather than hypothesis-
based because there is no a priori on the data.”
119
Is this statement True or False? Why?
• “It is necessary to systematically use all the available data to
generate the best possible model”
120
Is this statement True or False? Why?
• “With all these techniques, we will always make amazing
discoveries.”
121
Is this statement True or False? Why?
• “Data mining is revolutionary”
122
Conclusions
• Question
• “Why so many algorithms?”
• Answer
• Because none is optimal in all cases
• As they are in practice complementary to each other, by intelligently
combining them (by constructing meta models) it is possible to obtain very
significant performance gains.
123
Implementation
Introduction to Weka
Data Mining: Practical Machine Learning Tools and Techniques (Chapters 2 and 3)
124
Weka
• What’s Weka?
• A bird found only in New Zealand
• Data mining workbench
• Waikato Environment for Knowledge Analysis
• Machine learning algorithms for data mining tasks
• 100+ algorithms for classification
• 75 for data preprocessing
• 25 to assist with feature selection
• 20 for clustering, finding association rules, etc
125
WEKA: the bird
• The Weka or woodhen (Gallirallus australis) is an endemic bird of New
Zealand. (Source: WikiPedia)
127
The interface
128
The interface
129
Terminology
• Components of the input
• Concepts: kinds of things that can be learned
130
What’s a concept?
• Styles of learning (learning schemes)
• Classification learning: predicting a discrete class
• Association learning: detecting associations between features
• Clustering: grouping similar instances into clusters
• Numeric prediction: predicting a numeric quantity
• Concept: thing to be learned
• Concept description: output of learning scheme
131
Classification learning
• Classification learning is supervised
• Scheme is provided with actual outcome
• Outcome is called the class of the example
• Measure success on fresh data for which class labels are known (test
data)
• In practice success is often measured subjectively
132
Classification learning
• By default, the last
attribute is considered to
be the class variable or
the variable whose value
we need to predict
• Dataset : the classified
examples
• The idea is to deduce a
model or classifier that
will be used to classify
the new (unclassified)
examples
133
Association learning
• Can be applied if no class is specified and any kind of structure is
considered “interesting”
• The idea is to find associations between attributes, when no “class” is
specified
• Difference to classification learning:
• Can predict any attribute’s value, not just the class, and more than one
attribute’s value at a time
• Hence: far more association rules than classification rules
• Thus: constraints are necessary
• Minimum coverage and minimum accuracy
134
Clustering
• Finding groups of items that are similar
• Clustering is unsupervised
• The class of an example is not known
• Success often measured subjectively
137
What’s in an attribute?
• Each instance is described by a fixed predefined set of features, its
“attributes”
• But: number of attributes may vary in practice
• Possible solution: “irrelevant value” flag
• Related problem: existence of an attribute may depend of value of
another one
• Possible attribute types (“levels of measurement”):
• Nominal, ordinal, interval and ratio
138
WEKA only deals with “flat” files
@relation heart-disease-simplified
@data
63,male,typ_angina,233,no,not_present
67,male,asympt,286,yes,present
67,male,asympt,229,yes,present Flat file in ARFF
38,female,non_anginal,?,no,not_present
... (Attribute-Relation
File Format) format 139
WEKA only deals with “flat” files
@relation heart-disease-simplified
@data
63,male,typ_angina,233,no,not_present
67,male,asympt,286,yes,present
67,male,asympt,229,yes,present
38,female,non_anginal,?,no,not_present
...
140
Output (knowledge representation)
• Tables
• Linear models
• Trees
• Rules
• Instance-based representation
• Clusters
141
Tables
• Simplest way of representing output:
• Use the same format as input!
• Decision table for the weather problem:
Outlook Humidity Play
Sunny High No
Sunny Normal Yes
Overcast High Yes
Overcast Normal Yes
Rainy High No
Rainy Normal No
143
Linear models
145
Linear models for classification
148
Classification rules
• Popular alternative to decision trees
• Antecedent (pre-condition): a series of tests (just like the tests at the
nodes of a decision tree)
• Tests are usually logically ANDed together (but may also be general
logical expressions)
• Consequent (conclusion): classes, set of classes, or probability
distribution assigned by rule
• Individual rules are often logically ORed together
• Conflicts arise if different conclusions apply
149
More about rules …
• Are rules independent pieces of knowledge? (It seems easy to add a
rule to an existing rule base.)
• Problem: ignores how rules are executed
• Two ways of executing a rule set:
• Ordered set of rules (“decision list”)
• Order is important for interpretation
• Unordered set of rules
• Rules may overlap and lead to different conclusions for the same instance
150
More about rules …
• What if two or more rules conflict?
• Give no conclusion at all?
• Go with rule that is most popular on training data?
•…
• What if no rule applies to a test instance?
• Give no conclusion at all?
• Go with class that is most frequent in training data?
•…
151
Instance-based representation
• Simplest form of learning: plain memorization or rote learning
• Training instances are searched for instance that most closely resembles new
instance
• The instances themselves represent the knowledge
• Also called instance-based learning
• Similarity function defines what’s “learned”
• Instance-based learning is lazy learning
• Methods: nearest-neighbor, k-nearest-neighbor, …
152
Distance …
• Simplest case: one numeric attribute
• Distance is the difference between the two attribute values involved (or a
function thereof)
• Several numeric attributes: normally, Euclidean distance is used and
attributes are normalized
• Nominal attributes: distance is set to 1 if values are different, 0 if they
are equal
• Are all attributes equally important?
• Weighting the attributes might be necessary
153
Representing Clusters
154
Representing Clusters
155
Algorithms: the Basic Methods
Data Mining: Practical Machine Learning Tools and Techniques (Chapters 4)
156
Simplicity first!!
• Simple algorithms often work very well!
• There are many kinds of simple structure, eg:
• One attribute does all the work
• Attributes contribute equally and independently
• A decision tree that tests a few attributes
• Calculate distance from training instances
• Result depends on a linear combination of attributes
• Success of method depends on the domain
• Again, data-mining is an experimental science!!
157
Implementation in WEKA
• Classifiers in WEKA are models for predicting nominal or numeric
quantities
• Implemented learning schemes include:
• Decision trees and lists, instance-based classifiers, support vector machines,
multi-layer perceptrons, logistic regression, Bayes’ nets, …
• “Meta”-classifiers include:
• Bagging, boosting, stacking, error-correcting output codes, locally weighted
learning, …
158
Two words about evaluation
• Confusion matrix
• TP or true positives,
• FP or false positives,
• TN or true negatives,
• FN or false negatives.
Predicted class
Yes No
Yes TP FN
Real class
No FP TN
159
Two words about evaluation
• Classical measures
𝑇𝑃
• Precision : 𝑃 =
𝑇𝑃+𝐹𝑃
𝑇𝑃
• Recall : 𝑅 =
𝑇𝑃+𝐹𝑁
2 ∗𝑃 ∗𝑅
• f-measure =
𝑃+𝑅
In a classification task, a precision score of 1.0 for a class C means that every
item labeled as belonging to class C does indeed belong to class C (but says
nothing about the number of items from class C that were not labeled
correctly) whereas a recall of 1.0 means that every item from class C was
labeled as belonging to class C (but says nothing about how many other items
were incorrectly also labeled as belonging to class C).
160
The Basic Methods
• Inferring rudimentary rules
• Statistical modeling
• Constructing decision trees
• Constructing rules
• Association rule learning
• Linear models
• Instance-based learning
• Clustering
161