DLWSS551 - Introduction
DLWSS551 - Introduction
Chapter 1: Introduction
Harris Papadopoulos
Course Details
• Material:
– Study the main concepts and techniques of Data Mining
and its applications
– Provide hands-on experience on real life problems and
data!
2
Course Details
Learning Outcomes:
• Define and explain the major principles, terminology and problem types of Data Mining
• Describe and discuss the main Machine Learning techniques used in practical Data
Mining and their theoretical basis and evaluate their strengths and weaknesses
• Explain and propose ways of dealing with the issues involved in the application of
Machine Learning techniques to practical problems
• Apply Machine Learning techniques to a practical problem both in an exploratory or a
targeted manner
• Analyse and evaluate the performance of Machine Learning techniques on a supervised
Data Mining task
• Define and apply the main data transformation approaches used in practical Data Mining
• Define and explain the main concepts and terminology of Web Mining
• Define, explain and demonstrate the main concepts, approaches and issues for designing
a recommendation system
• Describe and explain the two main versions of the Conformal Prediction framework for
quantifying uncertainty both in classification and regression and evaluate their outputs
3
3
Couse Content
• Chapter 1 (Week 1) is the introductory chapter for the whole course
• Chapter 2 (Weeks 2 & 3) introduces the main components and terminology of a Data Mining task
• Chapter 3 (Weeks 4 & 5) analyses the main ideas behind some of the leading techniques that are
used in practical Data Mining
• Chapter 4 (Week 6) deals with the evaluation and comparison of Machine Learning techniques
• Chapter 5 (Weeks 7 & 8) examines some of the most prominent advanced Machine Learning
techniques used in practice today
• Chapter 6 (Week 9) introduces the unsupervised learning setting and various approaches for this
setting leading to different kinds of representations
• Chapter 7 (Week 10) studies data engineering approaches for transforming the input and output to a
suitable or even more effective form
• Chapter 8 (Week 11) examines Web Mining and personalization through recommendation systems
• Chapter 10 (Week 12) introduces a recently developed framework for quantifying the uncertainty of
Machine Learning predictions, called Conformal Prediction
• The course also includes a week for revising the whole course (Week 13)
4
Assessment
5
What’s it all About?
• Data vs. information
• Data mining and machine learning
• Structural descriptions
• Rules: classification and association
• Decision trees
• Datasets
• Weather, contact lens, CPU performance, labor negotiation data,
soybean classification
• Fielded applications
• Loan applications, screening images, load forecasting, machine fault
diagnosis, market basket analysis, ranking web pages…
• Generalization as search
• Data mining and ethics
6
6
Data vs. Information
• Society produces huge amounts of data – Big Data!
• E.g. the Large Hadron Collider computers have to store 15
petabytes a year!
• World wide web
• Sources: business, science, medicine, economics, geography,
environment, sports, …
• The amount of data stored in the world’s databases doubles
every 20 months!
• Potentially valuable resource
• Raw data is useless: need techniques to automatically
extract information from it
• Data: recorded facts
• Information: patterns underlying the data
7
7
Information is Crucial
• Example 1: in vitro fertilization
• Given: embryos described by 60 features
• Problem: selection of embryos that will survive
• Data: historical records of embryos and outcome
Two examples of how data can be used for deriving valuable information
8
Data Mining
• Extracting
• implicit,
• previously unknown,
• potentially useful
information from data
• Needed: programs that detect patterns and
regularities in the data
• Strong patterns ⇒ good predictions
• Problem 1: most patterns are not interesting
• Problem 2: patterns may be inexact (or spurious)
• Problem 3: data may be garbled or missing
9
Data mining is defined as the process of discovering patterns in data. The process must be
automatic or (more usually) semiautomatic. The patterns discovered must be meaningful
in that they lead to some advantage, usually an economic one. The data is invariably
present in substantial quantities.
Useful patterns allow us to make nontrivial predictions on new data.
9
Machine Learning Techniques
• Algorithms for acquiring structural descriptions from
examples
• Structural descriptions represent patterns explicitly
• Can be used to predict outcome in new situation
• Can be used to understand and explain how prediction is
derived
(may be even more important)
• Methods originate from artificial intelligence,
statistics, and research on databases
10
10
Structural Descriptions
Example: if-then rules
If tear production rate = reduced
then recommendation = none
Otherwise, if age = young and astigmatic = no
then recommendation = soft
… … … … …
11
What is meant by structural descriptions/patterns? And what form does the input take?
Let’s have a look at an example.
The table gives the conditions under which an optician might want to prescribe soft
contact lenses, hard contact lenses, or no contact lenses at all.
Each line of the table is one of the examples.
Part of a structural description of this information might be the if-then-rules you see in
the slide.
Structural descriptions need not necessarily be couched as rules such as these. Decision
trees, which specify the sequences of decisions that need to be made along with the
resulting recommendation, are another popular means of expression. We will discuss
decision trees and many other representations later.
11
Can Machines Really Learn?
• Definitions of “learning” from dictionary:
To get knowledge of by study, experience,
or being taught Difficult to measure
To become aware by information or from
observation
To commit to memory
To be informed of, ascertain; to receive Trivial for computers
instruction
• Operational definition:
Things learn when they change their
behavior in a way that makes them perform Does a slipper learn?
better in the future.
Now that we have some idea of the inputs and outputs, let’s turn to machine learning.
What is learning, anyway? What is machine learning?
Four dictionary definitions of learning are shown here.
For the first two, it is virtually impossible to test whether learning has been achieved or
not. How can we know if a machine has got knowledge of something or if it has become
aware of something?
The last two are obviously trivial for computers.
The operational definition ties learning to performance rather than knowledge. You can
test learning by observing present behaviour and comparing it with past behaviour. This
is a much more objective kind of definition and appears to be far more satisfactory.
But still there’s a problem: Lots of things change their behaviour in ways that make them
perform better in the future, yet we wouldn’t want to say that they have actually learned.
A good example is a comfortable slipper. Has it learned the shape of your foot? It has
certainly changed its behaviour to make it perform better as a slipper!
Fortunately, the kind of learning techniques explained in this course do not present these
conceptual problems—they are called machine learning without really presupposing any
particular philosophical stance about what learning actually is.
12
The Weather Problem
Conditions for playing a certain game
Outlook Temperature Humidity Windy Play
Sunny Hot High False No
Sunny Hot High True No
Overcast Hot High False Yes
Rainy Mild Normal False Yes
… … … … …
In this course we will be using a lot of toy problems to illustrate different concepts and to
explain what algorithms do and how they work. These are all unrealistically simple –
serious application of data mining involves thousands, hundreds of thousands, or even
millions of individual cases. We will go through many of these toy problems (illustrative
datasets) here.
The weather problem is a tiny dataset that will be used repeatedly to illustrate machine
learning methods. Entirely fictitious, it supposedly concerns the conditions that are
suitable for playing some unspecified game.
In general, instances in a dataset are characterized by the values of features, or attributes,
that measure different aspects of the instance. In this case there are four attributes:
outlook, temperature, humidity, and windy. The outcome is whether to play or not.
Each row in the table corresponds to an instance – a particular day.
Each column in the table corresponds a feature.
In the simplest form of this toy problem, all four attributes have values that are symbolic
categories rather than numbers. Outlook can be sunny, overcast, or rainy; temperature can
be hot, mild, or cool; humidity can be high or normal; and windy can be true or false.
This creates 36 possible combinations (3 × 3 × 2 × 2 = 36), of which 14 are present in the
set of input examples.
A set of rules learned from this data – not necessarily a very good one – is shown in the
slide. These rules are meant to be interpreted in order: The first one; then, if it doesn’t
apply, the second; and so on. A set of rules that are intended to be interpreted in sequence
is called a decision list.
13
Classification vs. Association Rules
• Classification rule:
predicts the value of a given attribute (the classification of an example)
• Association rule:
predicts the value of an arbitrary attribute (or combination)
If temperature = cool then humidity = normal
If humidity = normal and windy = false
then play = yes
If outlook = sunny and play = no
then humidity = high
If windy = false and play = no
then outlook = sunny and humidity = high
14
The rules we have seen so far are classification rules: They predict the classification of
the example in terms of whether to play or not.
It is equally possible to disregard the classification and just look for any rules that
strongly associate different attribute values. These are called association rules.
Many association rules can be derived from the weather data. Some good ones are shown
in the slide.
All these rules are 100% correct on the given data; they make no false predictions. The
first two apply to four examples in the dataset, the third to three examples, and the fourth
to two examples. And there are many other rules. In fact, nearly 60 association rules can
be found that apply to two or more examples of the weather data and are completely
correct on this data. And if you look for rules that are less than 100% correct, then you
will find many more.
There are so many because, unlike classification rules, association rules can “predict” any
of the attributes, not just a specified class, and can even predict more than one thing. For
example, the fourth rule predicts both that outlook will be sunny and that humidity will
be high.
14
Weather Data with Mixed Attributes
• Some attributes have numeric values
Outlook Temperature Humidity Windy Play
Sunny 85 85 False No
Sunny 80 90 True No
Overcast 83 86 False Yes
Rainy 75 80 False Yes
… … … … …
15
This slide shows a slightly more complex form of the weather data: two of the attributes
(temperature and humidity) have numeric values.
This means that any learning scheme must create inequalities involving these attributes
rather than simple equality tests as in the former case. This is called a numeric-attribute
problem. Actually in this case, a mixed-attribute problem because not all attributes are
numeric.
Some example rules for this case are shown in the slide.
A slightly more complex process is required to come up with rules that involve numeric
tests such as these.
15
The Contact Lenses Data
Age Spectacle prescription Astigmatism Tear production rate Recommended
lenses
Young Myope No Reduced None
Young Myope No Normal Soft
Young Myope Yes Reduced None
Young Myope Yes Normal Hard
Young Hypermetrope No Reduced None
Young Hypermetrope No Normal Soft
Young Hypermetrope Yes Reduced None
Young Hypermetrope Yes Normal hard
Pre-presbyopic Myope No Reduced None
Pre-presbyopic Myope No Normal Soft
Pre-presbyopic Myope Yes Reduced None
Pre-presbyopic Myope Yes Normal Hard
Pre-presbyopic Hypermetrope No Reduced None
Pre-presbyopic Hypermetrope No Normal Soft
Pre-presbyopic Hypermetrope Yes Reduced None
Pre-presbyopic Hypermetrope Yes Normal None
Presbyopic Myope No Reduced None
Presbyopic Myope No Normal None
Presbyopic Myope Yes Reduced None
Presbyopic Myope Yes Normal Hard
Presbyopic Hypermetrope No Reduced None
Presbyopic Hypermetrope No Normal Soft
Presbyopic Hypermetrope Yes Reduced None
Presbyopic Hypermetrope Yes Normal None
16
This is the complete contact lens dataset introduced earlier, which tells you the kind of
contact lens to prescribe, given certain information about a patient.
The first four columns correspond to the features or attributes, the final column shows
which kind of lenses to prescribe, whether hard, soft, or none.
All possible combinations of the attribute values are represented in the table.
16
A Complete and Correct Rule Set
If tear production rate = reduced then recommendation = none
If age = young and astigmatic = no
and tear production rate = normal then recommendation = soft
If age = pre-presbyopic and astigmatic = no
and tear production rate = normal then recommendation = soft
If age = presbyopic and spectacle prescription = myope
and astigmatic = no then recommendation = none
If spectacle prescription = hypermetrope and astigmatic = no
and tear production rate = normal then recommendation = soft
If spectacle prescription = myope and astigmatic = yes
and tear production rate = normal then recommendation = hard
If age young and astigmatic = yes
and tear production rate = normal then recommendation = hard
If age = pre-presbyopic
and spectacle prescription = hypermetrope
and astigmatic = yes then recommendation = none
If age = presbyopic and spectacle prescription = hypermetrope
and astigmatic = yes then recommendation = none
17
A sample set of rules learned from this information is shown in this slide. This is a rather
large set of rules, but they do correctly classify all the examples.
Generally this is not the case in real-world problems. Sometimes there are situations in
which no rule applies; other times more than one rule may apply, resulting in conflicting
recommendations.
Sometimes probabilities or weights may be associated with the rules themselves to
indicate that some are more important, or more reliable, than others.
17
A Decision Tree for this Problem
18
This slide shows a structural description for the contact lens data in the form of a decision
tree, which for many purposes is a more concise and perspicuous representation of the
rules and has the advantage that it can be visualized more easily.
The tree calls first for a test on the tear production rate, and the first two branches
correspond to the two possible outcomes. If the tear production rate is reduced (the left
branch), the outcome is none. If it is normal (the right branch), a second test is made, this
time on astigmatism. Eventually, whatever the outcome of the tests, a leaf of the tree is
reached that dictates the contact lens recommendation for that case.
18
Classifying Iris Flowers
Sepal length Sepal width Petal length Petal width Type
1 5.1 3.5 1.4 0.2 Iris setosa
2 4.9 3.0 1.4 0.2 Iris setosa
…
51 7.0 3.2 4.7 1.4 Iris versicolor
52 6.4 3.2 4.5 1.5 Iris versicolor
…
101 6.3 3.3 6.0 2.5 Iris virginica
102 5.8 2.7 5.1 1.9 Iris virginica
…
19
The iris dataset, which is arguably the most famous dataset used in data mining, contains
50 examples of each of three types of plant: Iris setosa, Iris versicolor, and Iris virginica.
There are four attributes: sepal length, sepal width, petal length, and petal width (all
measured in centimeters).
Unlike previous datasets, all attributes have values that are numeric.
Two example rules (from a rather long set of rules) that might be learned from this
dataset are shown in the slide.
19
Predicting CPU Performance
• Example: 209 different computer configurations
Cycle time Main memory Cache Channels Performance
(ns) (Kb) (Kb)
MYCT MMIN MMAX CACH CHMIN CHMAX PRP
1 125 256 6000 256 16 128 198
2 29 8000 32000 32 8 32 269
…
208 480 512 8000 32 0 0 67
209 480 1000 4000 0 0 0 45
20
Although the iris dataset involves numeric attributes, the outcome – the type of iris – is a
category, not a numeric value.
This slide shows some data for which both the outcome and the attributes are numeric. It
concerns the relative performance of computer processing power on the basis of a
number of relevant attributes; each row represents one of 209 different computer
configurations.
The classic way of dealing with continuous prediction is to write the outcome as a linear
sum of the attribute values with appropriate weights, such as the example shown. This is
called a regression equation, and the process of determining the weights is called
regression, a well-known procedure in statistics.
However, the basic regression method is incapable of discovering nonlinear relationships.
We will cover both linear and nonlinear regression later in the course.
20
Data from Labor Negotiations
Attribute Type 1 2 3 … 40
Duration (Number of years) 1 2 3 2
Wage increase first year Percentage 2% 4% 4.3% 4.5
Wage increase second year Percentage ? 5% 4.4% 4.0
Wage increase third year Percentage ? ? ? ?
Cost of living adjustment {none,tcf,tc} none tcf ? none
Working hours per week (Number of hours) 28 35 38 40
Pension {none,ret-allw, empl-cntr} none ? ? ?
Standby pay Percentage ? 13% ? ?
Shift-work supplement Percentage ? 5% 4% 4
Education allowance {yes,no} yes ? ? ?
Statutory holidays (Number of days) 11 15 12 12
Vacation {below-avg,avg,gen} avg gen gen avg
Long-term disability assistance {yes,no} no ? ? yes
Dental plan contribution {none,half,full} none ? full full
Bereavement assistance {yes,no} no ? ? yes
Health plan contribution {none,half,full} none ? full half
Acceptability of contract {good,bad} bad good good good
21
The labor negotiations dataset summarizes the outcome of Canadian contract negotiations
in 1987 and 1988.
Each case concerns one contract, and the outcome is whether the contract is deemed
acceptable or unacceptable.
There are 40 examples in the dataset (plus another 17 that are normally reserved for test
purposes). Unlike the other tables we have seen, this table presents the examples as
columns rather than as rows; otherwise, it would have to be stretched over several pages.
Many of the values are unknown or missing, as indicated by question marks. This is a
much more realistic dataset than the others we have seen. It contains many missing
values, and it seems unlikely that an exact classification can be obtained.
21
Decision Trees for the Labor Data
22
This slide shows two decision trees that represent the labor negotiations dataset.
The one on the left is simple and approximate – it doesn’t represent the data exactly, but
it makes intuitive sense.
The one on the right is more complex and gives a more accurate representation of the
training data. But it is not necessarily a more accurate representation of the underlying
concept. Although it is more accurate on the data that was used to train the classifier, it
may perform less well on an independent set of test data. It may be “overfitted” to the
training data – following it too well.
We will discuss the issue of overfitting later in the course.
22
Soybean Classification
Attribute Number Sample value
of values
Environment Time of occurrence 7 July
Precipitation 3 Above normal
…
Seed Condition 2 Normal
Mold growth 2 Absent
…
Fruit Condition of fruit 4 Normal
pods
Fruit spots 5 ?
Leaf Condition 2 Abnormal
Leaf spot size 3 ?
…
Stem Condition 2 Abnormal
Stem lodging 2 Yes
…
Root Condition 3 Normal
Diagnosis 19 Diaporthe stem canker
23
An often quoted early success story in the application of machine learning to practical
problems is the identification of rules for diagnosing soybean diseases. The data is taken
from questionnaires describing plant diseases.
There are about 680 examples, each representing a diseased plant. Plants were measured
on 35 attributes, each one having a small set of possible values.
Examples are labelled with the diagnosis of an expert in plant biology: There are 19
disease categories altogether – horrible sounding diseases such as diaporthe stem canker,
rhizoctonia root rot, and bacterial blight, to mention just a few.
This table gives the attributes, the number of different values that each can have, and a
sample record for one particular plant. The attributes are placed in different categories
just to make them easier to read.
23
The Role of Domain Knowledge
These are two example rules, learned from the soybean dataset.
These rules nicely illustrate the potential role of prior knowledge – often called domain
knowledge – in machine learning.
In this domain, if the leaf condition is normal then leaf malformation is necessarily
absent, so one of these conditions happens to be a special case of the other. Thus, if the
first rule is true, the second is necessarily true as well. The only time the second rule
comes into play is when leaf malformation is absent, but leaf condition is not normal –
that is, when something other than malformation is wrong with the leaf. This is certainly
not apparent from a casual reading of the rules.
24
ML on Soybean Data
• Research in the late 70s using machine learning
• Rules for every disease based on ~300 examples
• Gave correct disease 97.5% of the time
• Rules from the plant pathologist who produced the
diagnoses
• Gave correct disease only 72% of the time
25
Not only did the learning algorithm find rules that outperformed those of the expert
collaborator, but the same expert was so impressed that he allegedly adopted the
discovered rules in place of his own!
25
Fielded Applications
• The result of learning—or the learning method itself—is
deployed in practical applications
• Processing loan applications
• Screening images for oil slicks
• Electricity supply forecasting
• Diagnosis of machine faults
• Marketing and sales
• Separating crude oil and natural gas
• Reducing banding in rotogravure printing
• Finding appropriate technicians for telephone faults
• Scientific applications: biology, astronomy, chemistry
• Automatic selection of TV programs
• Monitoring intensive care patients
26
The examples in the previous slides are speculative research projects, not production
systems. And the previous datasets are toy problems: They are deliberately chosen to be
small so that we can use them to work through algorithms later in the course.
Here are some applications of machine learning that have actually been put into use.
26
Processing Loan Applications
(American Express)
27
27
Enter Machine Learning
• 1000 training examples of borderline cases
• 20 attributes:
• age
• years with current employer
• years at current address
• years with the bank
• other credit cards possessed,…
• Learned rules: correct on 70% of cases
• human experts only 50%
• Rules could be used to explain decisions to customers
28
28
Screening Images
• Given: radar satellite images of coastal waters
• Problem: detect oil slicks in those images
• Oil slicks appear as dark regions with changing size
and shape
• Not easy: lookalike dark regions can be caused by
weather conditions (e.g. high wind)
• Expensive process requiring highly trained personnel
29
29
Enter Machine Learning
• Extract dark regions from normalized image
• Attributes:
• size of region
• shape, area
• intensity
• sharpness and jaggedness of boundaries
• proximity of other regions
• info about background
• Constraints:
• Few training examples—oil slicks are rare!
• Unbalanced data: most dark regions aren’t slicks
• Regions from same image form a batch
• Requirement: adjustable false-alarm rate
30
30
Load Forecasting
• Electricity supply companies
need forecast of future demand
for power
• Forecasts of min/max load for each hour
⇒ significant savings
• Given: manually constructed load model that assumes
“normal” climatic conditions
• Problem: adjust for weather conditions
• Static model consist of:
• base load for the year
• load periodicity over the year
• effect of holidays
31
31
Enter Machine Learning
• Prediction corrected using “most similar” days
• Attributes:
• temperature
• humidity
• wind speed
• cloud cover readings
• plus difference between actual load and predicted load
• Average difference among three “most similar” days
added to static model
• Linear regression coefficients form attribute weights
in similarity function
32
32
Diagnosis of Machine Faults
• Diagnosis: classical domain
of expert systems
• Given: Fourier analysis of vibrations measured at
various points of a device’s mounting
• Question: which fault is present?
• Preventative maintenance of electromechanical
motors and generators
• Information very noisy
• So far: diagnosis by expert/hand-crafted rules
33
33
Enter Machine Learning
• Available: 600 faults with expert’s diagnosis
• ~300 unsatisfactory, rest used for training
• Attributes augmented by intermediate concepts that
embodied causal domain knowledge
• Expert not satisfied with initial rules because they did
not relate to his domain knowledge
• Further background knowledge resulted in more
complex rules that were satisfactory
• Learned rules outperformed hand-crafted ones
34
34
Marketing and Sales I
• Companies precisely record massive amounts of
marketing and sales data
• Applications:
• Customer loyalty:
identifying customers that are likely to defect by detecting
changes in their behavior
(e.g. banks/phone companies)
• Special offers:
identifying profitable customers
(e.g. reliable owners of credit cards that need extra money
during the holiday season)
35
35
Marketing and Sales II
• Market basket analysis
• Association techniques find
groups of items that tend to
occur together in a
transaction
(used to analyze checkout data)
• Historical analysis of purchasing patterns
• Identifying prospective customers
• Focusing promotional mailouts
(targeted campaigns are cheaper than mass-marketed
ones)
36
36
Web Mining
• Discovering interesting and useful information from
Web content, structure and usage
37
37
Web Mining
• Discovering interesting and useful information from
Web content, structure and usage
• Web Content Mining
• Content like text, images, audio, video etc.
• Example applications:
• intelligent search tools
• document clustering or categorization
• content-based personalization
38
38
Web Mining
• Discovering interesting and useful information from
Web content, structure and usage
• Web Structure Mining
• Explicit hyper-links or implicit links (e.g. tags)
• Example applications:
• document retrieval and ranking (e.g. PageRank, HITS)
• discovery of Web communities
• social network analysis
39
39
Web Mining
• Discovering interesting and useful information from
Web content, structure and usage
• Web Usage Mining
• User interactions with resources on one or more Web sites
• usage statistics or logs
• Example applications:
• user and customer behavior modeling
• Web site optimization
• Web marketing
40
40
Recommender Systems
• Give recommendations to a user based on preferences of
“similar” users or on items with “similar” content
41
41
DM on Mobile Computing
• Indoor localization based on WiFi RSS measurements
• Based on data of WiFi RSS measurements in different
indoor locations
• Find the current user location
• Malware detection
• Based on data of malware and non-malware application
resource usage (battery, memory, network etc.) and
permissions
• Determine if an application is likely to contain malware
42
42
DM in Games
• ARShooter Wars
• Developed at Frederick Mobile Devices Laboratory
• Augmented reality FPS game
• Realistic AI bots were developed using based on real data
crowdsourced by human users playing the game
43
43
Machine Learning and Statistics
• Historical difference (grossly oversimplified):
• Statistics: testing hypotheses
• Machine learning: finding the right hypothesis
• But: huge overlap
• Decision trees (C4.5 and CART)
• Breiman et al. 1984 – statistics
• J. Ross Quinlan 1970s and early 1980s (Machine Learning)
• Nearest-neighbor methods
• Today: perspectives have converged
• Most ML algorithms employ statistical techniques
44
In truth, you should not look for a dividing line between machine learning and statistics
because there is a continuum – and a multidimensional one at that – of data analysis
techniques. Some derive from the skills taught in standard statistics courses, and others
are more closely associated with the kind of machine learning that has arisen out of
computer science.
In our study of practical techniques for data mining, we will learn a great deal about
statistics.
44
Generalization as Search
• Inductive learning: find a concept description that fits
the data
• Example: rule sets as description language
• Enormous, but finite, search space
• Simple solution:
• enumerate the concept space
• eliminate descriptions that do not fit examples
• surviving descriptions contain target concept
45
One way of visualizing the problem of learning – and one that distinguishes it from
statistical approaches – is to imagine a search through a space of possible concept
descriptions for one that fits the data.
45
Enumerating the Concept Space
• Search space for weather problem
• 4 x 4 x 3 x 3 x 2 = 288 possible combinations
• With 14 rules ⇒ 2.7x1034 possible rule sets
• Other practical problems:
• More than one description may survive
• No description may survive
• Language is unable to describe target concept
• or data contains noise
• Another view of generalization as search:
hill-climbing in description space according to pre-
specified matching criterion
• Most practical algorithms use heuristic search that cannot
guarantee finding the optimum solution
46
Regarding it as search is a good way of looking at the learning process. However, the
search space, although finite, is extremely big, and it is generally quite impractical to
enumerate all possible descriptions and then see which ones fit. See for example the huge
number of possible rule sets for even the very trivial weather problem.
Plus there are other practical problems
Another way of looking at generalization as search is to imagine it not as a process of
enumerating descriptions and striking out those that don’t apply but as a kind of hill
climbing in description space to find the description that best matches the set of examples
according to some prespecified matching criterion. This is the way that most practical
machine learning methods work. However, except in the most trivial cases, it is
impractical to search the whole space exhaustively; most practical algorithms involve
heuristic search and cannot guarantee to find the optimal description.
46
Bias
• Important decisions in learning systems:
• Concept description language
• Order in which the space is searched
• Way that overfitting to the particular training data is
avoided
• These form the “bias” of the search:
• Language bias
• Search bias
• Overfitting-avoidance bias
47
Viewing generalization as a search in a space of possible concepts makes it clear that the
most important decisions in a machine learning system are the ones listed in this slide.
These three properties are generally referred to as the bias of the search and are called
language bias, search bias, and overfitting-avoidance bias.
Without some form of bias, nothing can be learned.
47
Language Bias
• Important question:
• is language universal
or does it restrict what can be learned?
• Universal language can express arbitrary subsets of
examples
• If language includes logical or (“disjunction”), it is
universal
• Example: rule sets
• Domain knowledge can be used to exclude some
concept descriptions a priori from the search
48
The most important question for language bias is whether the concept description
language is universal or whether it imposes constraints on what concepts can be learned.
If you consider the set of all possible examples, a concept is really just a division of that
set into subsets.
A concept description language that can express any possible division into subsets is
universal, i.e. no language bias.
In the absence of any other constraints, a universal language will only memorise the
given data.
For example, if the description language is rule-based, disjunction can be achieved by
using separate rules. Then one possible concept representation is just to enumerate all the
given instances. This concept description does not perform any generalization; it simply
records the original data.
On the other hand, if the language is not universal, some possible concepts – sets of
examples – may not be able to be represented at all. In that case, a machine learning
scheme may simply be unable to achieve good performance.
48
Search Bias
• Search heuristic
• “Greedy” search: performing the best single step
• “Beam search”: keeping several alternatives
• …
• Direction of search
• General-to-specific
• E.g. specializing a rule by adding conditions
• Specific-to-general
• E.g. generalizing an individual instance into a rule
49
In realistic data mining problems, there are many alternative concept descriptions that fit
the data, and the problem is to find the “best” one according to some criterion – usually
simplicity.
It is often computationally infeasible to search the whole space and guarantee that the
description found really is the best. Consequently, the search procedure is heuristic, and
no guarantees can be made about the optimality of the final result. This leaves plenty of
room for bias: Different search heuristics bias the search in different ways.
A more general and higher-level kind of search bias concerns whether the search is done
by starting with a general description and refining it or by starting with a specific
example and generalizing it. The former is called a general-to-specific search bias; the
latter, a specific-to-general one.
49
Overfitting-Avoidance Bias
• Can be seen as a form of search bias
• Modified evaluation criterion
• E.g. balancing simplicity and number of errors
• Modified search strategy
• E.g. pruning (simplifying a description)
• Pre-pruning: stops at a simple description before search
proceeds to an overly complex one
• Post-pruning: generates a complex description first and
simplifies it afterwards
50
Overfitting-avoidance bias is often just another kind of search bias. However, because it
addresses a rather special problem, we treat it separately.
In summary, although generalization as search is a nice way to think about the learning
problem, bias is the only way to make it feasible in practice. Different learning
algorithms correspond to different concept description spaces searched with different
biases. This is what makes it interesting: Different description languages and biases serve
some problems well and other problems badly. There is no universal “best” learning
method.
50
Data Mining and Ethics I
• Ethical issues arise in
practical applications
• Anonymizing data is difficult
• 85% of Americans can be identified from just zip code, birth
date and sex
• Data mining often used to discriminate
• E.g. loan applications: using some information (e.g. sex,
religion, race) is unethical
• Ethical situation depends on application
• E.g. same information ok in medical application
• Attributes may contain problematic information
• E.g. area code may correlate with race
51
The use of data – particularly data about people – for data mining has serious ethical
implications, and practitioners of data mining techniques must act responsibly by making
themselves aware of the ethical issues that surround their particular application.
When applied to people, data mining is frequently used to discriminate: who gets the
loan, who gets the special offer, and so on. Certain kinds of discrimination – racial,
sexual, religious, and so on – are not only unethical but also illegal.
However, the situation is complex: Everything depends on the application. Using sexual
and racial information for medical diagnosis is certainly ethical, but using the same
information when mining loan payment behaviour is not.
Even when sensitive information is discarded, there is a risk that models will be built that
rely on variables that can be shown to substitute for racial or sexual characteristics.
51
Data Mining and Ethics II
• Important questions:
• Who is permitted access to the data?
• For what purpose was the data collected?
• What kind of conclusions can be legitimately drawn from it?
• Caveats must be attached to results
• Purely statistical arguments are never sufficient!
• Are resources put to good use?
52
When presented with data, you need to ask who is permitted to have access to it, for what
purpose it was collected, and what kind of conclusions are legitimate to draw from it.
In addition, logical and scientific standards must be adhered to when drawing conclusions
from data. If you do come up with conclusions (e.g., red car owners being greater credit
risks), you need to attach caveats to them and back them up with arguments other than
purely statistical ones. The point is that data mining is just a tool in the whole process. It
is people who take the results, along with other knowledge, and decide what action to
apply.
Important question: Are resources put to good use?
52
Research at Frederick
• Indoor Localization
• Android Malware Detection
• Face recognition
• Prediction of Stroke Risk
• Osteoporosis Risk Assessment
• Acute Abdominal Pain Diagnosis
• Space Weather Prediction
• Information Extraction from Complicated Documents
• …
These are just some examples of the research that was or is currently being performed at
Frederick
53
Competitions
• Netflix Prize: Recommender system
– $1 Million
54
Lastly, there are many data mining competitions. Maybe the most famous was the Netflix
Prize, in which the team that managed to improve Netflix’s own algorithm for predicting
ratings won $1 Million. This started back in 2006 and ended in 2009.
If you want to have a look at current competitions, most are held on a platform called
Kaggle: https://www.kaggle.com/
54