0% found this document useful (0 votes)
7 views

Support Machine Learning

cours de Cecilia ZANNI-MERK

Uploaded by

dstqsq
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views

Support Machine Learning

cours de Cecilia ZANNI-MERK

Uploaded by

dstqsq
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 161

Introduction to Machine

Learning
INSA Rouen Normandie – Département GM

Cecilia ZANNI-MERK
Office: BO BR1 04
Goals
• Understanding the motivations for machine learning and data mining
• Being able to implement the data mining process
• Identifying the differences among different types of learning and
understanding the basic algorithms associated to each of them
• Developing skills in the design and use of those algorithms for specific
tasks

2
Analytical Program
• Introduction and Examples
• Algorithms and Approaches
• Implementation

3
Introduction
Thank you to my colleague Professor Christoph Reich (Hochshule Furtwangen) for
his slides

4
“Big Data”
• Trendy word to describe *a lot* of data!
• Buzz:

5
“Big Data”
• We live in the information age
• data accumulation in all areas
• internet
• biology: human genome, DNA sequencing
• physics: Large Hadron Collider, 1020 bytes/day per sensor
• recording devices
• sensors, mobile phones, interactions on the Internet, ...
• IT challenges
• storage, recovery, distributed calculation...
• 3V' s: volume, velocity, variety
• We need to give meaning to data
machine learning
6
Giving meaning to (Big) Data
• Some quantitative indicators
• Twitter : 50 M of tweets / day (≈ 7 terabytes)
• Facebook : 10 terabytes / day
• Youtube : 50 hours of uploaded videos / minute
• Mail : 2,9 million of mail / second
• The data quantity is too big to be treated manually or by classical
algorithms:
• The number of entries is millions to billions
• Multi-dimensional data
• Heterogeneous sources of data

7
Paradigm Shift
Store Everything Now Because
it May be Useful Later

8
Source: Microsoft’s Chicago data center (Image courtesy cnet.com)
Giving meaning to (Big) Data?
• The user is full of data, but does not know how to understand it:
• “The greatest problem of today is how to teach people to ignore the
irrelevant, how to refuse to know things, before they are suffocated. For too
many facts are as bad as none at all.“ (W.H. Auden)
• What do we need?
• To extract interesting and useful knowledge from the data: rules, regularities,
irregularities, patterns, constraints
• To make predictions, detect faults, solve problems...

9
Giving meaning to (Big) Data
• Science behind it all:
• machine learning / computational statistics
• Other terms in practice:
• data mining, business/data analytics, pattern recognition, knowledge
discovery in databases (KDD), knowledge extraction, data/pattern analysis

10
What is machine learning?
• How to build computer systems that improve with experience, and
what are the fundamental laws that govern all machine learning
processes (Tom Mitchell)
• Mixing computer science and statistics
• CS:"How do you build machines that solve problems, and what problems are
intrinsically feasible / unfeasible?"
• Statistics:"What can be deduced from data and a set of modeling
assumptions?
• how can a computer learn from data?

11
What is data mining ?
• Extraction of implicit original (non-trivial) information, previously
unknown and potentially useful from different data sources:
• Not trivial: otherwise knowledge is not useful
• Implicit: hidden knowledge is difficult to observe
• Unknown until now: obvious!
• Potentially useful: usable, understandable
• Whole process of discovery and interpretation of regularity in data

12
Machine Learning or Data Mining?
• Data Mining is a cross-disciplinary field that focuses on discovering
properties of data sets.
• Machine Learning is a sub-field of data science that focuses on
designing algorithms that can learn from and make predictions on the
data.
It is clearto
• There are different approaches then that machine
discovering properties learning
of data can
sets.
Machine Learning is one of them. be used for data mining.
• Another one is simplyHowever,
looking at data mining
the data can use
sets using other techniques
visualization techniques
besides or on top of machine learning.

13
Some warnings, however …
• Hype: With enough data, we can solve “everything” with “no
assumptions”!
• Theory: No Free Lunch Theorem!
• If we do not make assumptions about the data, all learning methods do as
bad “on average” on unseen data as a random prediction!
• Consequence: need some assumptions
• for example, that time series vary ‘smoothly’

14
No Free Lunch Theorem
• Hume (1739–1740) pointed out that ‘even after the observation of
the frequent or constant conjunction of objects, we have no reason to
draw any inference concerning any object beyond those of which we
have had experience’.
• More recently, and with increasing rigour, Mitchell (1980), Schaffer
(1994) and Wolpert (1996) showed that bias-free learning is futile.
• The mathematical demonstration is here
• http://www.no-free-lunch.org/coev.pdf

15
No Free Lunch Theorem
• 2‐class problem with 100 binary attributes
• Say you know a million instances, and their classes (training set)
• You don’t know the classes of 2100 – 106 examples! (that’s 99.9999…% of the
data set)
• How could you possibly figure them out?
• In order to generalize, every learner must embody some knowledge
or assumptions beyond the data it’s given
• A learning algorithm implicitly provides a set of assumptions
• There can be no “universal” best algorithm (no free lunch)
• Data mining is an experimental science
16
Mining random patterns
• We can ‘discover’ meaningless random patterns if we look through too
many possibilities : “Bonferroni’s principle”
• NSA example: say we consider suspicious when a pair of (unrelated)
people stayed at least twice in the same hotel on the same day
• suppose 109 people tracked during 1000 days
• each person stays in a hotel 1% of the time (1 day out of 100)
• each hotel holds 100 people (so we need 105 hotels)
• if everyone behaves randomly (i.e. no terrorist), can we still detect something
suspicious?
• Probability that a specific pair of people visit same hotel on same day is
10-9; probability this happens twice is thus 10-18 (tiny),
• ... but there are many possible pairs
• Expected number of “suspicious” pairs is actually about 250,000!
example taken from Rajamaran et al., Mining of Massive Datasets
17
Mining random patterns
• Suppose there are (say) 10 pairs of terrorists who definitely stayed at
the same hotel twice.
• Analysts have to go through 250,010 candidates to find the 10 real
cases !!!!
• Morale : When looking for a property (e.g., “two people stayed at the
same hotel twice”), make sure that the property does not allow so
many possibilities that random data will surely produce facts “of
interest”.

18
Some success stories using Machine Learning
• spam classification (Google)
• machine translation (impressive sometimes, see
https://www.deepl.com/translator)
• speech recognition (used in your smart phone)
• self-driving cars (again Google)

19
Some examples
Data Mining: Practical Machine Learning Tools and Techniques (Chapter 1)

20
Bibliography
• This textbook discusses data mining and Weka in depth:
• Data Mining: Practical machine learning tools and techniques,
by Ian H. Witten, Eibe Frank
and Mark A. Hall.
Morgan Kaufmann, 2011

21
Processing loan applications (American Express)

• Given: questionnaire with financial and


personal information
• Question: should money be lent?
• Simple statistical method covers 90% of cases
• Borderline cases referred to loan officers
• But: 50% of accepted borderline cases defaulted!
• Solution: reject all borderline cases?
• No! Borderline cases are most active customers

22
With machine learning
● 1000 training examples of borderline cases
● 20 attributes:
● age
● years with current employer
● years at current address
● years with the bank
● other credit cards possessed,…
● Learned rules: correct on 70% of cases
● human experts only 50%
● Rules could be used to explain decisions to customers

23
Screening images
● Given: radar satellite images of coastal waters
● Problem: detect oil slicks in those images
● Oil slicks appear as dark regions with changing size and shape
● Not easy: lookalike dark regions can be caused by weather conditions
(e.g. high wind)
● Expensive process requiring highly trained personnel

24
With machine learning
● Extract dark regions from normalized image
● Attributes:
● size of region
● shape, area
● intensity
● sharpness and jaggedness of boundaries
● proximity of other regions
● Info about background
● Constraints:
● Few training examples—oil slicks are rare!
● Unbalanced data: most dark regions aren’t slicks
● Regions from same image form a batch
● Requirement: adjustable false-alarm rate

25
Load forecasting
● Electricity supply companies
need forecast of future demand
for power
● Forecasts of min/max load for each hour
⇒ significant savings
● Given: manually constructed load model that assumes “normal”
climatic conditions
● Problem: adjust for weather conditions
● Static model consist of:
● base load for the year
● load periodicity over the year
● effect of holidays

26
With machine learning
● Prediction corrected using “most similar” days
● Attributes:
● temperature
● humidity
● wind speed
● cloud cover readings
● plus difference between actual load and predicted load
● Average difference among three “most similar” days added to static
model
● Linear regression coefficients form attribute weights in similarity
function
27
Diagnosis of machine faults
● Diagnosis: classical domain
of expert systems
● Given: Fourier analysis of vibrations measured at various points of a
device’s mounting
● Question: which fault is present?
● Preventive maintenance of electromechanical motors and generators
● Information very noisy
● So far: diagnosis by expert/hand-crafted rules

28
With machine learning
● Available: 600 faults with expert’s diagnosis
● ~300 unsatisfactory, rest used for training
● Attributes augmented by intermediate concepts that embodied
causal domain knowledge
● Expert not satisfied with initial rules because they did not relate to his
domain knowledge
● Further background knowledge resulted in more complex rules that
were satisfactory
● Learned rules outperformed hand-crafted ones

29
Marketing and sales I
● Companies precisely record massive amounts of marketing and sales
data
● Applications:
● Customer loyalty: identifying customers that are likely to defect by detecting
changes in their behavior
(e.g. banks/phone companies)
● Special offers: identifying profitable customers
(e.g. reliable owners of credit cards that need extra money during the holiday
season)

30
Marketing and sales II
● Market basket analysis
● Association techniques find groups of items
that tend to occur together in a transaction
(used to analyze checkout data)
● Historical analysis of purchasing patterns
● Identifying prospective customers
● Focusing promotional mail outs
(targeted campaigns are cheaper than mass-marketed ones)

31
Data mining and ethics
• Information privacy laws (in Europe, but not US)
• A purpose must be stated for any personal information collected
• Such information must not be disclosed to others without consent
• Records kept on individuals must be accurate and up to date
• To ensure accuracy, individuals should be able to review data about
themselves
• Data must be deleted when it is no longer needed for the stated purpose
• Personal information must not be transmitted to locations where equivalent
data protection cannot be assured
• Some data is too sensitive to be collected, except in extreme circumstances
(e.g., sexual orientation, religion)

32
Data mining and ethics
• Anonymization is harder than you think
• When Massachusetts released medical records summarizing every state
employee’s hospital record in the mid‐1990s, the governor gave a public
assurance that it had been anonymized by removing all identifying
information such as name, address, and social security number. He was
surprised to receive his own health records (which included diagnoses and
prescriptions) in the mail.

33
Data mining and ethics
• Re-identification techniques. Using publicly available records:
• 50% of Americans can be identified from city, birth date, and sex
• 85% can be identified if you include the 5‐digit zip code as well
• Netflix movie database: 100 million records of movie ratings (1–5)
• Can identify 99% of people in the database if you know their ratings for 6
movies and approximately when they saw the movies (± one week)
• Can identify 70% if you know their ratings for 2 movies and roughly when
they saw them

34
Data mining and ethics
• The purpose of data mining is to discriminate …
• who gets the loan
• who gets the special offer
• Certain kinds of discrimination are unethical, and illegal
• racial, sexual, religious, …
• But it depends on the context
• sexual discrimination is usually illegal
• … except for doctors, who are expected to take gender into account
• … and information that appears innocuous may not be
• ZIP code correlates with race
• membership of certain organizations correlates with gender

35
Correlation vs Causation
• Correlation does not imply causation!!
• http://www.tylervigen.com/spurious-correlations

36
Data mining and ethics

Data mining reveals correlation, not


causation

but really, we want to predict the effects of


our actions

37
Algorithms and Approaches
Slides adapted from the course of Maja Temerinac-Ott
(Master ILC, University of Strasbourg)
Thank you also to Nicolas Lachiche (University of Strasbourg)

38
Group exercise
• Form groups
• You will get a list of steps in the data mining process
• Please order the steps in chronological order!
• You have 5 min for the task.

39
Group exercise
• Clean the data
• Identify the problem
• Find data
• Integration of new knowledge
• Validation and interpretation of the result, with possible return to the
results in the previous steps
• Coding of data, do actions on variables
• Search for a model, for knowledge, etc.

40
The data mining process
1. Identify the problem
2. Find data
3. Clean the data
4. Coding of data, do actions on variables
5. Search for a model, for knowledge, etc.
6. Validation and interpretation of the result, with possible return to
the results in the previous steps
7. Integration of new knowledge

41
The data mining process

42
Data sets
• Data for a data mining problem
• Information is examples with attributes
• A set of N data is generally available.
• Attributes
• An attribute is a descriptor of an entity
• It is also called variable or characteristic
• Examples
• These are entities that characterize an object
• They consist of attributes
• Synonyms: point, vector (often in ℝ𝑛 )

Taken from Gilles Gasso’s course at ASI4 43


Types of data
• Sensors: quantitative, qualitative and ordinal variables
• Text: String of characters
• Speech: Time series
• Images: 2D data
• Videos: 2D data + time
• Networks: Graphs
• Tags or labels or annotations: evaluation information

Taken from Gilles Gasso’s course at ASI4 44


The data mining process

45
Preparation of the data
• Existing or needed data:
• Files: information contained in one or more independent files
• Relational Database: information contained in several files by a common ‘key’
• Transactional database
• Data cleaning:
• duplicates, input errors, outliers, missing information (ignore observations,
average value, mean value on class, regression, etc.)

46
Preparation of the data
• Data warehouses: collection of data collected from multiple
heterogeneous sources
• Data is recorded, cleaned, transformed and integrated
• Usually modeled by a multidimensional data structure (cube):
• Data is structured along several lines of analysis (dimensions of the
cube) such as time, location, etc.
• A cell is the intersection of different dimensions
• The calculation of each cell is carried out at loading
• The response time is thus stable whatever is requested

47
Preparation of the data
• Selection of the data
• Data sampling
• Selection of sources
• Dimensionality reduction
• Selection or transformation of attributes
• Weighting
• Feature extraction
• Coding
• Aggregation (sum, average), discretization, coding of discrete attributes, scale
standardization

48
Data visualization
• Obtain a visual representation of the data
• Not always possible depending on the data type
• Not always possible depending on the amount of data

49
A word about data quality
• A huge problem in practice
• any manually entered data is suspect
• most data sets are in practice deeply problematic
• Even automatically gathered data can be a problem
• systematic problems with sensors
• errors causing data loss
• incorrect metadata about the sensor
• Never, never, never trust the data without checking it!

50
Algorithms and Approaches (cont)
Slides adapted from the course of Maja Temerinac-Ott
(Master ILC, University of Strasbourg)
Thank you also to Nicolas Lachiche (University of Strasbourg)

51
The data mining process

52
Data mining goals and principles
• Goal: Learn something new!
• Concepts: regrouping of data based on shared characteristics
• Associations: correlations between attributes or data
• Principles:
• Getting the highest level of abstraction possible
• Rules or truths that are the basis for other truths
• Three types
• Supervised learning
• Unsupervised learning
• Semi-supervised learning

53
Supervised learning
• Inductive model where the learner considers a set of grouped
examples representative for the learning task (class of belonging,
ownership, etc.)
• The examples are labeled beforehand
• Predictive data mining
• Divide / group instances into special classes for future predictions
• Predicting unknown or missing values

54
Supervised learning
• Induction:
• Generalization of an observation or reasoning established from singular cases.
• It consists in drawing conclusions from a series of facts
• Example:
• Induction: water, oil and milk freeze when they are cooled, we will infer that
all liquids must freeze, provided the cold is rather intense
• Deduction: all liquids are likely to freeze; so, if mercury is a liquid, it can
freeze

55
Supervised learning
• From data 𝑥𝑖 , 𝑦𝑖 𝜖 𝒳 × 𝒴, 𝑖 = 1, … , 𝑛 estimate the dependencies
between 𝒳 and 𝒴
• Elements in 𝒴 are called labels, tags, annotations
• It is called supervised learning because 𝑦𝑖 are used to guide the estimation
process.
• Examples
• Estimate the links between eating habits and risk of heart attack.
𝑥𝑖 : attributes of a patient's diet, 𝑦𝑖 its category (risk, not risk).
• Applications
• fraud detection, medical diagnosis...
• Algorithms
• Decision trees, classifications, genetic algorithms, (linear and non-linear) regression
Taken from Gilles Gasso’s course at ASI4 56
Unsupervised learning
• Construction of a model and discovery of relations is given without
reference to other data
• There is no prior information on the data
• Explanatory data mining
• Grouping instances into special classes based on their resemblance or their
sharing of properties.
• The classes are unknown and are therefore created, they are used to explain
or summarize the data

57
Unsupervised learning
• Only data {𝑥𝑖 𝜖 𝒳, 𝑖 = 1, … , 𝑛} are available. The aim is to describe
how data are organized and to extract homogeneous subsets of data.
• Examples
• Categorization of supermarket customers. 𝑥𝑖 represents an individual
(address, age, shopping habits...)
• Applications
• identification of market segments, categorization of similar documents,
segmentation of biomedical images....
• Algorithms
• Segmentation, clustering, discovery of associations and rules

Taken from Gilles Gasso’s course at ASI4 58


Semi-supervised learning
• Among the data, only a small number have a label
𝑥1 , 𝑦1 , … , 𝑥𝑘 , 𝑦𝑘 , 𝑥𝑘+1 , … , 𝑥𝑛
• The objective is the same as for supervised learning but we would like
to take advantage of the unlabeled data without.
• Examples
• For discrimination of web pages, the number of 'examples' can be very large
but associating a label (or label) to them is expensive.
• Algorithms
• Bayesian methods, SVM....

Taken from Gilles Gasso’s course at ASI4 59


Trying to summarize …

Source: http://www.favouriteblog.com/essential-cheat-sheets-for-machine-learning-python-and-maths/
60
Types of algorithms and their use …

Source: https://www.analyticsvidhya.com/blog/2017/02/top-28-cheat-sheets-for-machine-learning-data-science-probability-sql-big-data/ 61
Types of Algorithms and there Usage

62
Source: http://blogs.sas.com/content/subconsciousmusings/2017/04/12/machine-learning-algorithm-use/
Algorithms and Approaches (cont)
Slides adapted from the course of Maja Temerinac-Ott
(Master ILC, University of Strasbourg)
Thank you also to Nicolas Lachiche (University of Strasbourg)

63
Data mining approaches
1. Estimation:
• create a model that best describes a prediction or forecast variable linked to
real data
2. Classification:
• create a function that classifies an element among several pre-existing classes
3. Clustering :
• identify a finite set of categories or groups with the goal of describing the
data
4. Dependency modelling:
• find a model that describes significant dependencies among variables

64
Estimation
• Goal : to create a model that best describes a prediction or forecast
variable linked to real data
• How: Analyze the relationship of one variable vs. one or more other
variables

65
Estimation : Regression
• Least-squares method

66
Estimation : Regression
• Underfitting
• Using an algorithm that cannot capture the full complexity of the data
Estimation : Regression
• Overfitting
• Tuning the algorithm so carefully it starts matching the noise in the training
data

68
Estimation : Neural networks
• Neural network

69
Reminder: Four different approaches
1. Estimation:
• create a model that best describes a prediction or forecast variable linked to
real data
2. Classification:
• create a function that classifies an element among several pre-existing classes
3. Clustering:
• identify a finite set of categories or groups with the goal of describing the
data
4. Dependency modelling:
• find a model that describes significant dependencies among variables

70
Classification
• Division of the data set into disjoint classes
• Goal:
• search for a set of predicates characterizing a class of objects and which can
be applied to unknown objects in order to identify their class of belonging.
• Main techniques:
• Decision trees
• Bayesian classifier
• k-nearest neighbors
• Neural networks
• SVM
• Genetic algorithms

71
Classification: Decision Trees
• Classify objects into subclasses by hierarchical divisions
• Automatic construction from a sample
• There are several techniques to build the tree

72
Classification: Decision Trees
• An example
• A gift is sent to potential customers, that can then place an order.
• If the customer does not place an order the company pays 50 euro; otherwise
it earns 100 euro.
• Forgetting to send a gift to a potentially responsive customer “costs” 100 euro
• Given a table of responses for a population sample (size 100), decide which
group of people should be targeted in the future

73
Classification: Decision Trees
• An example

Mail only to
executives or only
female executives
74
Classification: K-nearest neighbors
• All distances between the point X to be classified and all labeled
points is computed
• We keep the K closest labeled points. The majority class in this set is
assigned to X.
Example of k-NN classification. The test sample (green
circle) should be classified either to the first class of blue
squares or to the second class of red triangles.
• If k = 3 (solid line circle) it is assigned to the second class
because there are 2 triangles and only 1 square inside the
inner circle.
• If k = 5 (dashed line circle) it is assigned to the first class (3
squares vs. 2 triangles inside the outer circle).
Source Wikipedia 75
Classification: K-nearest neighbors
• No hypothesis on the distribution and the “shape” of the classes
• Complexity increasing with the size of the training base
• For high-dimensional data (e.g., with number of dimensions more
than 10) dimension reduction is usually performed prior to applying
the k-NN algorithm
• In fact, the Euclidean distance is unhelpful in high dimensions because all
vectors can be almost equidistant to the search query vector

76
Classification: Neural Networks
• Inspired by the structure of the nervous system
• A large number of connected neurons that process information
• The response of the neuron depends on its state and the weights of
the connections
• The weights (or forces) are developed by experiment

77
Classification: Neural Networks
• Principle
• Construction of a network of simple computational units (neurons) linked by
connections
• Learning network parameters (weight of connections) using a set of examples
• Components of a neuron:
• Inputs (incoming connections or input variables)
• weights on incoming connections
• a function F which computes an output as a function of the inputs and the
weight on the inlets
• activation function  which modifies the amplitude of the output of the
node
78
Classification: Neural Networks
• Activation function
•  (s) can be linear

•  (s) can be of type threshold (aka binary step)


•  (s) = 0 si s ≤ a
•  (s) = 1 si s > a

•  (s) = 1 / (1+e-ks) of type logistic or sigmoid (aka soft step)

79
Classification: Neural Networks
• The simplest neural network
• Single neuron or perceptron
• F = weighted sum of inputs
• Activation function = thresholding

• Therefore, output is  (w1x1 + w2x2 + ... )


• = 1 if w1x1 + w2x2 + ... > t
• = 0 if w1x1 + w2x2 + ... <= t
• Output is 1 if w1x1 + w2x2 + ... -t > 0 → Equation of a hyperplane!

80
Classification: Neural Networks
• The simplest neural network
• Linear separation

81
Classification: Neural Networks
• The simplest neural network
• Additional examples
• We still need to find a neuronal network discriminating between the two
classes

82
Classification: Neural Networks
• The simplest neural network
• Learning of the new weights would lead to something like this

83
Classification: Neural Networks
• In practice:
• Choose a calculation function and an activation function
• Choose an architecture:
• Number of inputs
• Number of outputs
• Number of internal layers
• Number of neurons of each of the internal layers
• Select a cost function
• Decide when to stop training

84
Classification: Neural Networks
• Advantages of neural networks:
• Robust to noise
• Classification or estimation is quick after training is completed
• Available in all data mining software
• Disadvantages:
• Black box: difficult to interpret the obtained model
• Significant learning time
• Selection of parameters is difficult

85
Classification: Neural Networks
• Attention: Neural Nets model anything …
• Express some output as a nonlinear function (sets of equations) of
some inputs

86
Source: http://www.asimovinstitute.org/neural-network-zoo/

87
Classification: NN & Deep Learning
Previously: Machine Learning (about 1-2 hidden layers)
Often: Principal Component
Analysis for Dimensionality
Reduction

Nowadays: Deep Learning (up to 100 and more hidden layers)

Increasingly powerful computational capabilities


• graphical processing unit (GPU)
• massively parallel processing (MPP) 88
Classification: Training and Testing

89
Classification: Training and Testing

• Construction of a model on the learning set and test model on the test set
for which the results are known
90
Classification: Evaluation
• Cross-validation
• Split test / train data, in general more data for learning
• The number of cross-validation folds depends on the volume of available the
data

91
Algorithms and Approaches (cont)
Slides adapted from the course of Maja Temerinac-Ott
(Master ILC, University of Strasbourg)
Thank you also to Nicolas Lachiche (University of Strasbourg)

92
The data mining process

93
Reminder: Data mining goals and principles
• Goal: Learn something new!
• Principles:
• Getting the highest level of abstraction possible
• Rules or truths that are the basis for other truths
• Three types
• Supervised learning
• Unsupervised learning
• Semi-supervised learning

94
Reminder: Four different approaches
1. Estimation:
• create a model that best describes a prediction or forecast variable linked to
real data
2. Classification:
• create a function that classifies an element among several pre-existing classes
3. Clustering:
• identify a finite set of categories or groups with the goal of describing the
data
4. Dependency modelling:
• find a model that describes significant dependencies among variables

95
Clustering
• Aim of the clustering: obtain a simplified representation (structuring)
of the initial data
• Organization of a set of objects into a set of homogeneous and / or
natural groupings

96
Clustering
• Automatic partitioning from data
• No a priori semantic interpretation

97
Clustering
• Two approaches:
• classification / clustering / grouping: discovery in extension of these sets-> notion of
classes or clusters
• generation of concepts: discovery in intention of these clusters -> notion of concepts
• Different methods:
• Partition-based clustering
• Model-based clustering
• Hierarchical clustering
• And many others …
• density-based clustering
• grid-based clustering

98
Partition-based clustering
• Principle: Place the different objects in clusters (groups)
• Different types
• Hard clustering: each object is in one and only one class
• Soft clustering: An object can be in several classes
• Fuzzy clustering: An object belongs partly to all classes
• Goals
• Find the organization in homogeneous classes such that two objects of the
same class are more similar than two objects from different classes
• Find the organization in homogeneous classes such that the classes are as
different as possible

99
Partition-based clustering
• Find the organization in homogeneous classes such that two objects
of the same class are more similar than two objects from different
classes
• Example: if x1 and x2 in the same class, x3 in a different class then
d(x1, x2) <d(x1, x3) and d(x1, x2) <d(x2, x3)

100
Partition-based clustering
• Find the organization in homogeneous classes such that the distances
between classes are maximal

101
Partition-based clustering : notion of inertia
• Inertia of a class 𝐺𝑘 (k : 1 → K)
𝐼𝑘 = ෍ 𝑑𝑖𝑠𝑡 2 (𝑥𝑖 , 𝑔𝑘 )
𝑥𝑖 ∈𝐺𝑘 Center of gravity class k
• Inertia intra-class
𝐼𝑖𝑛𝑡𝑟𝑎 = ෍ 𝐼𝑘
𝑘:𝟏 →𝑲
• Inertia inter-class
𝐼𝑖𝑛𝑡𝑒𝑟 = ෍ 𝐺𝑘 𝑑𝑖𝑠𝑡 2 (𝑔, 𝑔𝑘 )
𝑘:𝟏 →𝑲 Center of gravity the whole
102
Partition-based clustering : K-means
• Criteria: Minimize the 𝐼𝑖𝑛𝑡𝑟𝑎 , maximize the 𝐼𝑖𝑛𝑡𝑒𝑟

• Algorithm:
1. Choose the number of classes K
2. Choose K random initial data cluster gravity centers
3. Assign each data point to its closest cluster center
4. Compute the new cluster gravity centers (average of data points assigned to
the cluster)
5. Repeat 3.-4. until convergence

103
Partition-based clustering : K-means

104
Partition-based clustering : K-means
• Attention!! 𝐼𝑖𝑛𝑡𝑟𝑎 decreases if K increases
• cannot be used to find the ideal number of groups, but allows to compare 2
partitions of K classes
• Properties
• At each step of reallocation-recentering 3) + 4), it can be shown that
𝐼𝑖𝑛𝑡𝑟𝑎 decreases → stabilization when 𝐼𝑖𝑛𝑡𝑟𝑎 no longer decreases.
• Experience shows that the algorithm converges quite fast (less than 10
iterations), even with large volumes of data and a “bad” initialization

105
Partition-based clustering : K-means
• Advantages:
• quick and easy to implement
• allows the processing of large datasets
• Disadvantages:
• obligation to fix k
• the result depends strongly on the choice of the initial gravity centers.
• does not necessarily provide the optimal result (i. e. partition in K groups for which 𝐼𝑖𝑛𝑡𝑟𝑎
is minimum)
• provides a local optimum that depends on the initial gravity centers.

106
Reminder: Four different approaches
1. Estimation:
• create a model that best describes a prediction or forecast variable linked to
real data
2. Classification:
• create a function that classifies an element among several pre-existing classes
3. Clustering:
• identify a finite set of categories or groups with the goal of describing the
data
4. Dependency modelling:
• find a model that describes significant dependencies among variables

107
Associations
• Association rules: analysis of the shopping basket
• “On fridays, customers often buy beer packs and at the same time
diapers”
• Are there any causal links between the purchase of a product P0 and
a other product P1?

108
Associations
Id Transaction

• Association rule 1 butter fruit milk bread


2 fruit milk bread
• Premise → Conclusion
3 butter cheese bread pasta meat wine
4 cheese fruit milk vegetables bread pasta fish
• Questions: 5 butter fruit milk vegetables bread pasta fish meat
• butter → bread ? 6 butter cheese vegetables bread pastameat wine
• fish, meat → milk ? 7 butter cheese milk vegetables bread pasta meat wine
• Cheese, pasta → wine ? 8 fruit vegetables fish
9 butter cheese milk bread pasta meat wine
10 butter cheese fruit milk vegetables bread fish meat

109
Associations
• Formally:
• Given a set of transactions D, find all the association rules X → Y having
support and confidence above the minimum thresholds predefined by the
user
• A transaction is a set of attributes (for example, butter, fruit, milk, bread)

• Support: % of transactions in D that contain X and Y


• Confidence: % of transactions in D that contain X among those
containing Y

110
Associations
• Interpretation of rule R : X → Y (A%, B%)
• A% of all transactions show that X and Y have been purchased at the same
time (support of the rule)
• B% of the clients who purchased X have also purchased Y (confidence of the
rule
• Two sub-problems
• Find all frequent sets (frequent item sets or FIS) that have support greater
than or equal to a minimum value “minsup”
• Generate all the association rules having confidence greater or equal to
“minconf”

111
Id Transaction

Associations 1
2
butter fruit milk bread
fruit milk bread
3 butter cheese bread pasta meat wine
𝑥𝑥𝑥 𝑒𝑡 𝑦𝑦𝑦 4 cheese fruit milk vegetables bread pasta fish
𝑠𝑢𝑝𝑝𝑜𝑟𝑡 = 5 butter fruit milk vegetables bread pasta fish meat
𝑡𝑟𝑎𝑛𝑠𝑎𝑐𝑡𝑖𝑜𝑛𝑠
6 butter cheese vegetables bread pastameat wine
7 butter cheese milk vegetables bread pasta meat wine
𝑥𝑥𝑥 𝑒𝑡 𝑦𝑦𝑦 8 fruit vegetables fish
𝑐𝑜𝑛𝑓𝑖𝑑𝑒𝑛𝑐𝑒 =
𝑥𝑥𝑥 9 butter cheese milk bread pasta meat wine
10 butter cheese fruit milk vegetables bread fish meat

Support Confidence
butter → bread 70% 100%
fish, meat → milk 20% 100%
cheese, pasta → wine 40% 80%
112
Id Transaction

Associations 1
2
butter fruit milk bread
fruit milk bread
3 butter cheese bread pasta meat wine
𝑥𝑥𝑥 𝑒𝑡 𝑦𝑦𝑦 4 cheese fruit milk vegetables bread pasta fish
𝑠𝑢𝑝𝑝𝑜𝑟𝑡 = 5 butter fruit milk vegetables bread pasta fish meat
𝑡𝑟𝑎𝑛𝑠𝑎𝑐𝑡𝑖𝑜𝑛𝑠
6 butter cheese vegetables bread pasta meat wine
7 butter cheese milk vegetables bread pasta meat wine
𝑥𝑥𝑥 𝑒𝑡 𝑦𝑦𝑦 8 fruit vegetables fish
𝑐𝑜𝑛𝑓𝑖𝑑𝑒𝑛𝑐𝑒 =
𝑥𝑥𝑥 9 butter cheese milk bread pasta meat wine
10 butter cheese fruit milk vegetables bread fish meat

Support Confidence
butter → bread 70% 100%
fish, meat → milk 20% 100%
cheese, pasta → wine 40% 80%
113
Associations
• Numerous criteria for evaluating the interest of a rule
• Difficult to scale to large volumes of data

114
The data mining process

115
Validation
• Generation of a large number of models
• Is the generated model interesting?
• How to measure the interest of a model:
• New
• Easy to understand
• Valid on new data (with a measure of certainty)
• Useful
• Confirms (or invalidates) the hypotheses of an expert

116
Validation
• Evaluation of a model
• subjective: expert
• objective: statistics and structure of the model
• Can we find all models? (completeness)
• Can we only generate the interesting models? (optimization):
• Generating all the models and filtering according to certain measures and
characteristics: non-realistic
• Generating only the models that satisfy a particular condition

117
Conclusions

118
Is this statement True or False? Why?
• “Data mining methods are purely inductive rather than hypothesis-
based because there is no a priori on the data.”

• False: condition of application of the methods, choice of data, coding


of data, choice of explanatory variables, choice of variables to be
explained, order of entry of variables in the algorithm, …

119
Is this statement True or False? Why?
• “It is necessary to systematically use all the available data to
generate the best possible model”

• False: data coding, order of entry of variables into the algorithm,


irregular number of data, outliers, influence of redundancies,
correlations, computer data model, saturation, instability, ...

120
Is this statement True or False? Why?
• “With all these techniques, we will always make amazing
discoveries.”

• False: common sense solutions must be found (specialists, business


experts). In fact, we need to find the best solution (amongst others)
for a given problem.

121
Is this statement True or False? Why?
• “Data mining is revolutionary”

• False: traditional data analysis + more specific methods (neural


networks). Optimization of existing (old?) techniques because of the
large number of data.

122
Conclusions
• Question
• “Why so many algorithms?”

• Answer
• Because none is optimal in all cases
• As they are in practice complementary to each other, by intelligently
combining them (by constructing meta models) it is possible to obtain very
significant performance gains.

123
Implementation
Introduction to Weka
Data Mining: Practical Machine Learning Tools and Techniques (Chapters 2 and 3)

124
Weka
• What’s Weka?
• A bird found only in New Zealand
• Data mining workbench
• Waikato Environment for Knowledge Analysis
• Machine learning algorithms for data mining tasks
• 100+ algorithms for classification
• 75 for data preprocessing
• 25 to assist with feature selection
• 20 for clustering, finding association rules, etc

125
WEKA: the bird
• The Weka or woodhen (Gallirallus australis) is an endemic bird of New
Zealand. (Source: WikiPedia)

Copyright: Martin Kramer ([email protected]) 126


WEKA: the software
• Machine learning/data mining software written in Java (distributed
under the GNU Public License)
• Used for research, education, and applications
• Complements “Data Mining” by Witten & Frank
• Main features:
• Comprehensive set of data pre-processing tools, learning algorithms and
evaluation methods
• Graphical user interfaces (incl. data visualization)
• Environment for comparing learning algorithms

127
The interface

128
The interface

129
Terminology
• Components of the input
• Concepts: kinds of things that can be learned

• Aim: intelligible and operational concept description


• Instances: the individual, independent examples of a concept

• Note: more complicated forms of input are possible


• Attributes: measuring aspects of an instance

• We will focus on nominal and numeric ones

130
What’s a concept?
• Styles of learning (learning schemes)
• Classification learning: predicting a discrete class
• Association learning: detecting associations between features
• Clustering: grouping similar instances into clusters
• Numeric prediction: predicting a numeric quantity
• Concept: thing to be learned
• Concept description: output of learning scheme

131
Classification learning
• Classification learning is supervised
• Scheme is provided with actual outcome
• Outcome is called the class of the example
• Measure success on fresh data for which class labels are known (test
data)
• In practice success is often measured subjectively

132
Classification learning
• By default, the last
attribute is considered to
be the class variable or
the variable whose value
we need to predict
• Dataset : the classified
examples
• The idea is to deduce a
model or classifier that
will be used to classify
the new (unclassified)
examples

133
Association learning
• Can be applied if no class is specified and any kind of structure is
considered “interesting”
• The idea is to find associations between attributes, when no “class” is
specified
• Difference to classification learning:
• Can predict any attribute’s value, not just the class, and more than one
attribute’s value at a time
• Hence: far more association rules than classification rules
• Thus: constraints are necessary
• Minimum coverage and minimum accuracy

134
Clustering
• Finding groups of items that are similar
• Clustering is unsupervised
• The class of an example is not known
• Success often measured subjectively

Sepal length Sepal width Petal length Petal width Type


1 5.1 3.5 1.4 0.2 Iris setosa
2 4.9 3.0 1.4 0.2 Iris setosa

51 7.0 3.2 4.7 1.4 Iris versicolor
52 6.4 3.2 4.5 1.5 Iris versicolor

101 6.3 3.3 6.0 2.5 Iris virginica
102 5.8 2.7 5.1 1.9 Iris virginica
… 135
Numeric prediction
• Variant of classification learning where “class” is numeric (also called
“regression”)
• Learning is supervised
• Scheme is being provided with target value
• Measure success on test data

Outlook Temperature Humidity Windy Play-time


Sunny Hot High False 5
Sunny Hot High True 0
Overcast Hot High False 55
Rainy Mild Normal False 40
… … … … … 136
What’s in an example?
• Instance: specific type of example
• Thing to be classified, associated, or clustered
• Individual, independent example of target concept
• Characterized by a predetermined set of attributes
• Input to learning scheme: set of instances/dataset
• Represented as a single relation/flat file
• Rather restricted form of input
• No relationships between objects
• Most common form in practical data mining

137
What’s in an attribute?
• Each instance is described by a fixed predefined set of features, its
“attributes”
• But: number of attributes may vary in practice
• Possible solution: “irrelevant value” flag
• Related problem: existence of an attribute may depend of value of
another one
• Possible attribute types (“levels of measurement”):
• Nominal, ordinal, interval and ratio

138
WEKA only deals with “flat” files
@relation heart-disease-simplified

@attribute age numeric


@attribute sex { female, male}
@attribute chest_pain_type { typ_angina, asympt, non_anginal,
atyp_angina}
@attribute cholesterol numeric
@attribute exercise_induced_angina { no, yes}
@attribute class { present, not_present}

@data
63,male,typ_angina,233,no,not_present
67,male,asympt,286,yes,present
67,male,asympt,229,yes,present Flat file in ARFF
38,female,non_anginal,?,no,not_present
... (Attribute-Relation
File Format) format 139
WEKA only deals with “flat” files
@relation heart-disease-simplified

@attribute age numeric


@attribute sex { female, male}
@attribute chest_pain_type { typ_angina, asympt, non_anginal,
atyp_angina}
@attribute cholesterol numeric
@attribute exercise_induced_angina { no, yes}
@attribute class { present, not_present}

@data
63,male,typ_angina,233,no,not_present
67,male,asympt,286,yes,present
67,male,asympt,229,yes,present
38,female,non_anginal,?,no,not_present
...

140
Output (knowledge representation)
• Tables
• Linear models
• Trees
• Rules
• Instance-based representation
• Clusters

141
Tables
• Simplest way of representing output:
• Use the same format as input!
• Decision table for the weather problem:
Outlook Humidity Play
Sunny High No
Sunny Normal Yes
Overcast High Yes
Overcast Normal Yes
Rainy High No
Rainy Normal No

• Main problem: selecting the right attributes


142
Linear models
• Another simple representation
• Regression model
• Inputs (attribute values) and output are all numeric
• Output is the sum of weighted attribute values
• The trick is to find good values for the weights

143
Linear models

PRP = 37.06 + 2.47CACH 144


Linear models for classification
• Binary classification
• Line separates the two classes
• Decision boundary - defines where the decision changes from one class value
to the other
• Prediction is made by plugging in observed values of the attributes
into the expression
• Predict one class if output ≥ 0, and the other class if output < 0
• Boundary becomes a high-dimensional plane (hyperplane) when
there are multiple attributes

145
Linear models for classification

2.0 – 0.5PETAL-LENGTH – 0.8PETAL-WIDTH = 0


146
Trees
• “Divide-and-conquer” approach produces tree
• Nodes involve testing a particular attribute
• Usually, attribute value is compared to constant
• Other possibilities:
• Comparing values of two attributes
• Using a function of one or more attributes
• Leaves assign classification, set of classifications, or probability
distribution to instances
• Unknown instance is routed down the tree
147
Trees

148
Classification rules
• Popular alternative to decision trees
• Antecedent (pre-condition): a series of tests (just like the tests at the
nodes of a decision tree)
• Tests are usually logically ANDed together (but may also be general
logical expressions)
• Consequent (conclusion): classes, set of classes, or probability
distribution assigned by rule
• Individual rules are often logically ORed together
• Conflicts arise if different conclusions apply

149
More about rules …
• Are rules independent pieces of knowledge? (It seems easy to add a
rule to an existing rule base.)
• Problem: ignores how rules are executed
• Two ways of executing a rule set:
• Ordered set of rules (“decision list”)
• Order is important for interpretation
• Unordered set of rules
• Rules may overlap and lead to different conclusions for the same instance

150
More about rules …
• What if two or more rules conflict?
• Give no conclusion at all?
• Go with rule that is most popular on training data?
•…
• What if no rule applies to a test instance?
• Give no conclusion at all?
• Go with class that is most frequent in training data?
•…

151
Instance-based representation
• Simplest form of learning: plain memorization or rote learning
• Training instances are searched for instance that most closely resembles new
instance
• The instances themselves represent the knowledge
• Also called instance-based learning
• Similarity function defines what’s “learned”
• Instance-based learning is lazy learning
• Methods: nearest-neighbor, k-nearest-neighbor, …

152
Distance …
• Simplest case: one numeric attribute
• Distance is the difference between the two attribute values involved (or a
function thereof)
• Several numeric attributes: normally, Euclidean distance is used and
attributes are normalized
• Nominal attributes: distance is set to 1 if values are different, 0 if they
are equal
• Are all attributes equally important?
• Weighting the attributes might be necessary

153
Representing Clusters

154
Representing Clusters

155
Algorithms: the Basic Methods
Data Mining: Practical Machine Learning Tools and Techniques (Chapters 4)

156
Simplicity first!!
• Simple algorithms often work very well!
• There are many kinds of simple structure, eg:
• One attribute does all the work
• Attributes contribute equally and independently
• A decision tree that tests a few attributes
• Calculate distance from training instances
• Result depends on a linear combination of attributes
• Success of method depends on the domain
• Again, data-mining is an experimental science!!

157
Implementation in WEKA
• Classifiers in WEKA are models for predicting nominal or numeric
quantities
• Implemented learning schemes include:
• Decision trees and lists, instance-based classifiers, support vector machines,
multi-layer perceptrons, logistic regression, Bayes’ nets, …
• “Meta”-classifiers include:
• Bagging, boosting, stacking, error-correcting output codes, locally weighted
learning, …

158
Two words about evaluation
• Confusion matrix
• TP or true positives,
• FP or false positives,
• TN or true negatives,
• FN or false negatives.

Predicted class
Yes No
Yes TP FN
Real class
No FP TN
159
Two words about evaluation
• Classical measures
𝑇𝑃
• Precision : 𝑃 =
𝑇𝑃+𝐹𝑃
𝑇𝑃
• Recall : 𝑅 =
𝑇𝑃+𝐹𝑁
2 ∗𝑃 ∗𝑅
• f-measure =
𝑃+𝑅

In a classification task, a precision score of 1.0 for a class C means that every
item labeled as belonging to class C does indeed belong to class C (but says
nothing about the number of items from class C that were not labeled
correctly) whereas a recall of 1.0 means that every item from class C was
labeled as belonging to class C (but says nothing about how many other items
were incorrectly also labeled as belonging to class C).
160
The Basic Methods
• Inferring rudimentary rules
• Statistical modeling
• Constructing decision trees
• Constructing rules
• Association rule learning
• Linear models
• Instance-based learning
• Clustering

161

You might also like