0% found this document useful (0 votes)
25 views72 pages

Lec 1

This document provides an introduction to data mining concepts. It discusses the definition of data mining, real-world applications, different data set types and attributes. It also describes how patterns can be extracted from data using classification rules or other models. The document outlines the general data mining process and different concrete tasks like classification, regression, clustering and association rule mining. It provides examples to illustrate key concepts in data preparation and pattern extraction.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
25 views72 pages

Lec 1

This document provides an introduction to data mining concepts. It discusses the definition of data mining, real-world applications, different data set types and attributes. It also describes how patterns can be extracted from data using classification rules or other models. The document outlines the general data mining process and different concrete tasks like classification, regression, clustering and association rule mining. It provides examples to illustrate key concepts in data preparation and pattern extraction.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 72

Lecture 1: Introduction to Data Mining

7CCSMDM1 Data Mining

Dr Dimitrios Letsios

Department of Informatics
King’s College London

1 / 72
Lecture Contents

I Section 1: Basic Elements


I Definition, Real-World Applications
I Data Sets, Patterns, General Picture

I Section 2: Conceptual Framework


I Concrete Tasks
I Process
I Relation to Other Fields

I Section 3: Models
I Knowledge Representations
I Evaluation

2 / 72
Definition

Main Data Mining Goal


I Extract information from data, i.e. understand and take
advantage of it.

I Computers allow generating, managing, processing, and


communicating data and information.
I Data is raw, unorganized facts not a priori useful.
I If data is processed, organized, structured, and meaningfully
presented in some context, then it becomes information.
I Information is hidden in the data and understanding tends to
decrease as the volume of data increases.

3 / 72
Definition (2)

Data Mining Definition [WFH3, Section 1.1]

The process of discovering patterns in data.


I The process must be automated.
I The patterns must be meaningful and useful, i.e. lead to some
benefit and inform future decisions.
I Data mining works on existing data, i.e. data that has already
been generated, by people, machines, processes, etc.

I A pattern can be thought as a series of data that repeat in a


recognizable way.
I Finding patterns in data involves:
1. Identifying patterns
2. Validating patterns
3. Using patterns for predictions

4 / 72
Real-World Applications

Web Data
I PageRank assigns measures to web pages, based on online
search query relevance (Google).
I Email filtering classifies new messages as spams or hams.
I Online advertising based on users with similar purchases.
I Social media identify users with similar preferences.

5 / 72
Real-World Applications (2)

Marketing and Sales


I Identifying customers likely to defect and fight churn.
I Market basket analysis for personalised offers.

Risk
I Statistical calculation of bank loan default risk.
I Anticipated job candidate performance in recruitments.

6 / 72
Real-World Applications (3)

Images
I Oil spill or deforestation detection from satellite images.
I Currency recognition in automated payment machines.
I Face recognition for police surveillance.

Engineering
I Power demand forecasting for electricity suppliers.
I Failure prediction for machine maintenance in manufacturing.

7 / 72
Data Sets
Contact Lens Data Set [WFH3, Table 1.1]

8 / 72
Data Sets (2)

Nominal Weather Data Set [WFH3, Table 1.2]

9 / 72
Data Sets (3)

Numeric Weather Data Set [WFH3, Table 1.3]

10 / 72
Data Sets (4)

CPU Performance Data Set [WFH3, Table 1.5]

11 / 72
Data Sets (5)

Main Data Set Elements


I Attributes
I Instances

I Attributes or Features or Columns:


I Characterise each data set entry, i.e. specify the data set form.
I E.g. types of conditions considered in the weather data set.
I Attributes might depend to each other.
I Instances or Examples or Rows
I A set of values, one for each attribute.
I Typically, instances are considered to be independent.
I However, there might be relationships between instances.

12 / 72
Data Sets (6)
Family Tree [WFH3, Figure 2.1]

13 / 72
Data Sets (7)

Attribute Types:
I Numeric: Continuous or discrete with well-defined distance
between values.
I Nominal: Categorical.
I Dichotomous: Binary or boolean or yes/no.
I Ordinal: Ordered but without well-defined distance, e.g. poor,
reasonable, good and excellent health quality.
I Interval: Ordered, but also measured in fixed units, e.g. cool,
mild and hot temperatures.

14 / 72
Data Sets (8)
Attribute-Relation File Format (ARFF) [WFH3, Figure 2.2]

15 / 72
Data Sets (9)

I Lectures include simple data sets which are appropriate for


learning because they expose different issues and challenges.
I Practicals use a range of data sets from online sources.
I Often, real-world data sets:
I contain thousands or millions of entries,
I are incomplete,
I are noisy,
I incorporate randomness,
I are sparse.

16 / 72
Data Sets (10)

I Data preparation can be a significant part of the data mining


process and may require:
I Assembly
I Integration
I Cleaning
I Transformation

17 / 72
Data Sets (11)

Feature Engineering
The process of transforming raw data by selecting the most
suitable attributes for the data mining problem to be solved.

I Significant part of the data preparation time before modeling.


I Coming up with appropriate attributes can be difficult,
time-consuming and requires experience.
I May significantly affect data mining methods.

18 / 72
Patterns

I Allow making non-trivial predictions for new data.


I Black box, i.e. incomprehensible with hidden structure.
I Transparent with visible structure.

I Structural patterns:
I Capture and explain data aspects in an explicit way.
I Can be used for better-informed decisions.
I E.g. rules in the form if-then-else.

19 / 72
Patterns (2)

Nominal Weather Data Set [WFH3, Table 1.2]

20 / 72
Patterns (3)

Classification Rule
Attribute values predict the label.
i f ( o u t l o o k == s u n n y ) and ( h u m i d i t y == h i g h ) :
p l a y = no

Patterns can be captured by different types of models:


I Linear equations.
I Clusters, i.e. meaningful groups of data.
I Tree structures.

21 / 72
Patterns (4)

Weather Data Set Elements


I The input is four attributes or features:
I outlook = { sunny, overcast, rainy }
I temperature = { hot, mild, cool }
I humidity = { high, normal }
I windy = { true, false }
I The ouput is a decision, i.e. one label or class:
I play = { yes, no }

I There are 3 × 3 × 2 × 2 = 36 possible cases, i.e. conditions.


I Only 14 cases are present in the data set.
I A rule may use one or more attributes to make the right
decision (i.e. select the correct label).
I Rules obtained for a data set might not be good.

22 / 72
Patterns (5)

I Data mining aims to construct models from the data.


I If the data set is complete, then the rules produce 100%
correct predictions.
I If the data set is incomplete, then the rules may produce
incorrect predictions because information is missing.
I Real-world data sets are typically incomplete and we aim in
the best possible rules.

23 / 72
General Picture

I Data mining is a process for exploring data to discover


meaningful patterns.
I Given a data set, we aim to construct models expressing these
patterns using some form of knowledge representation.

24 / 72
Lecture Contents

I Section 1: Basic Elements


I Definition, Real-World Applications
I Data Sets, Patterns, General Picture

I Section 2: Conceptual Framework


I Concrete Tasks
I Process
I Relation to Other Fields

I Section 3: Models
I Knowledge Representations
I Evaluation

25 / 72
Concrete Tasks

Examples:
I Classification models relationships between data elements to
predict classes or labels.
I Regression models relationships between data elements to
predict numeric quantities.
I Clustering models relationships of instances to group them so
that instances in the same group are similar.
I Association models relationships between attributes.

26 / 72
Concrete Tasks (2)

Classification:
I The data is classified, e.g. people can be labelled as Covid
positive, or negative based on their symptoms.
I Models ways that attributes determine the class of instances.
I Supervised learning task because it uses already classified
instances to predictions for new instances.

27 / 72
Concrete Tasks (3)

Regression:
I Models ways that attributes determine a numeric value.
I Variant of classification, but without discrete classes.
I Supervised learning task, similarly to classification.
I Often, the produced model is more interesting than predicted
values, e.g. what attributes affect car prices.

28 / 72
Concrete Tasks (4)

Clustering:
I Models similarity between instances and divides them into
groups so that instances in the same group are more similar
than instances in different groups.
I E.g. partition customers into groups.
I By labelling the clusters, we may use them in meaningful ways.
I Unsupervised learning task because the data is not labelled.

29 / 72
Concrete Tasks (5)

Association:
I Models how some attributes determine other attributes.
I No specific class or label.
I May examine any subset of attributes to predict any other
disjoint subset attributes.
I Usually involve only nominal data.
I E.g. use supermarket data, to identify combinations of
products that occur together in transactions.

30 / 72
Process

I Decomposes the data mining process into a number of steps.


I Allows distinguishing various issues.
I Provides a methodology for implementing data mining tasks.

31 / 72
Process (2)
Data Mining Process

Data containing
Examples

Question of
Evaluate
Interest

Score Model

Prediction or
Insight

32 / 72
Process (3)

Step 1: Objective Specification


I Identify the data mining problem type.

I Supervised learning:
I There is a target attribute.
I If nominal, then classification. E.g. to play or not in the
weather data set.
I If numeric, then predition. E.g. predict power value in the CPU
performance data set.
I Unsupervised learning:
I There is no target attribute.
I Cluster instances into groups of similarity.
I Find attribute correlations or associations.
I There exist other data mining tasks for other types of data.

33 / 72
Process (4)

Step 2: Data Exploration


I Visualise the data, e.g. using histograms or scatter plots.
I Confirm that the objective can be achieved with the data set.

I In this module, we first select the methods and then an


appropriate data set.
I In real-world applications, we typically begin with the data
and then select an appropriate method.

34 / 72
Process (5)

Step 3: Data Cleaning


I Fix any problems with the data.

I Confirm there is enough data, i.e. broad and deep.


I With very sparse data, data mining might not be effective.
I Rule of thumb: the more, the better.
I However, very large data sets can be problematic when (i) the
target variable appears in extremely rare patterns, or (ii) model
building is very resource consuming.
I Check whether there are imprecise or missing values.
I Verify that the data set is representative and not biased

35 / 72
Process (6)

Step 4: Model Building


I Select the most appropriate model for the data.

I The data may contain:


I Discrete or continuous numbers.
I Categorical, numeric, or mixed values.
I Grayscale or coloured images.

36 / 72
Process (7)

Step 5: Model Evaluation


I Assess whether the model achieves the desiderata.

I Measure accuracy, i.e. how well the model performs.


I Use both existing and new data by partitioning the data into:
I Training set for model building.
I Validation set for model selection.
I Test set for model evaluation with unseen data.
I Measure accuracy in all training, validation and test sets.
I Overfitting: the model is very tailored to the training
instances and does not generalize well to new instances.

37 / 72
Process (8)

Step 6: Repeat
I Usually, multiple iterations of the aforementioned steps are
required to build a good enough model.
I Revise the performed steps, adapt and reiterate.

38 / 72
Relation to Other Fields

I Data Mining (DM) is an interdisciplinary field using


approaches and techniques from multiple fieds, including:
I Artificial Intelligence (AI) and Machine Learning (ML)
I Statistics (Stats)
I Algorithms and Mathematical Optimization

39 / 72
Relation to Other Fields (2)
I DM is strongly related to ML and Stats.
I These fields share methods, but use them in different ways
and for different reasons.

40 / 72
Relation to Other Fields (3)

When a machine learns? [WFH3, Section 1.1]

Subjects learn when they change their behaviour in a way that


makes them perform better in the future.

I The above statement emphasises on performance rather than


knowledge, which can be measured by comparing past
behaviour to present and future behaviour.
I There is a difference between learning and adaptation. E.g.
the adaptation of shoe to the shape of a foot is not learning.
I Learning implies purpose, i.e. intention.
I There are philosophical questions here.

41 / 72
Relation to Other Fields (4)

I The application of ML to DM is not a philosophical question.


I DM involves learning in practical sense, i.e. finding and
describing well-structured patterns in the data.
I Input: data containing a set of examples.
I Output: explicit knowledge representation.
I DM involves the efficient acquisition of knowledge, but with
the ability of using it.

42 / 72
Relation to Other Fields (5)

I ML and Stats have different histories, but similar methods


have been parallelly developed in the two fields. E.g.
I Generating decision trees from examples.
I Nearest-neighbour classification.
I ML and Stats have different goals:
I ML aims in the most accurate predictions.
I Stats infer variable relationships and test hypotheses.
I Nowadays, there is significant overlap between the two.
I Many DM techniques require statistical thinking.

43 / 72
Lecture Contents

I Section 1: Basic Elements


I Definition, Real-World Applications
I Data Sets, Patterns, General Picture

I Section 2: Conceptual Framework


I Concrete Tasks
I Process
I Relation to Other Fields

I Section 3: Models
I Knowledge Representations
I Evaluation

44 / 72
Knowledge Representations

I Decision Tables
I Trees
I Rules
I Linear Models
I Instance-Based Representations
I Clusters

45 / 72
Tables
Decision Table
I Concise visual representation for specifying which actions to
perform based on given conditions.
I Contains a set of attributes and a decision label for each
unique set of attribute values.

46 / 72
Trees
Building Blocks
I Nodes: specify decisions to be made.
I Branches from a node represent possible alternatives.
I A branch connects a parent node to one of its child nodes.
I The very top node without a parent is called root.
I The very bottom nodes without a child are called leaves.

47 / 72
Trees (2)

Decision Trees:
I Branches may involve a single or multiple attributes.
I We examine the value of an attribute and branch based on
equality or inequality.
i f ( t e m p e r a t u r e < 80 ) :
branch l e f t
else :
branch r i g h t

I The alternatives of a decision can be:


I two-way such as yes or no,
I three-way such as <, =, or >,
I multi-way.

48 / 72
Trees (3)
Decision Trees:
I A path is a sequence of nodes such that each node is the
child of the previous node in the sequence.
I An attribute can be tested more than once in a path.
I In a classification context:
I A leaf specifies a class.
I Each instance satisfying all decisions of the corresponding path
from the root to the leaf is assigned this class.

49 / 72
Trees (4)

Missing Value Problem


It is unclear which branch should be considered if an attribute
value is missing.

I Possible solutions:
I Ignore all instances with missing values.
I Each attribute may get the value missing.
I Set the most popular choice for each missing attribute value.
I Make a probabilistic (weighted) choice for each missing
attribute value, based on the other instances.
I All these solutions propagate errors, especially when the
number of missing values increases.

50 / 72
Trees (5)

Functional Tree
I Computes a function of multiple attribute values in each node.
I Branches based on the value returned by the function.
i f ( petal length ∗ petal width > threshold ) :
make d e c i s i o n
else :
make d i f f e r e n t d e c i s i o n

51 / 72
Trees (6)
Regression Tree
I Predicts numeric values.
I Each node branches on the value of an attribute or on the
value of a function of the attributes.
I A leaf specifies a predicted value for corresponding instances.

52 / 72
Trees (7)
Model Trees
I Similar to a regression tree, except that a regression
equation predicts the numeric output value in each leaf.
I A regression equation predicts a numeric quantity as a
function of the attributes.
I More sophisticated than linear regression and regression trees.

53 / 72
Rules
Rule
I An expression in if-then format.
I The if part is the pre-condition or antecedent and consists
of a series of tests.
I The then part is the conclusion or consequent and assigns
values to one or more attributes.

The pre-condition may contain multiple clauses in the form of:


I Conjunction, i.e. tests linked by and, meaning that all tests
must be true to fire the rule.
I Disjunction, i.e. tests linked by or, meaning that at least one
test must be true to fire the rule.
I General logic expressions, i.e. tests linked by different
logical operators (and/or).

54 / 72
Rules (2)

Classification rules:
I Predict the class or label of an instance.
I Can be derived from a decision tree.
I One rule can be constructed for each leaf of the tree:
I The pre-condition contains a clause for each decision along the
path from the root to the leaf.
I The conclusion is the class of the leaf.
I Rule sets constructed in this way may contain redundancies,
especially if multiple leaves contain the same class.

55 / 72
Rules (3)
I Transforming a set of rules into a decision tree is also
possible, but not straightforward.
I The difficulty is the order of tests, starting from the root.
I The replicated subtree problem may occur, i.e. no matter
which rule is chosen first, the other is replicated in the tree.
I Sometimes, classification rules can be significantly more
compact than decision trees.

i f a and b :
x
i f c and d :
x

56 / 72
Rules (4)

I A set of rules may fail to classify an instance.


I These situations cannot happen with decision trees.
I Ordered set of rules:
I Should be applied based on the given order (decision list).
I An individual rule out of the list may be incorrect.
I Unordered sets of rules:
I Each rule represents an independent piece of knowledge.
I Different rules may lead to different classes for one instance.

57 / 72
Rules (5)
Association rules:
I Predict an attribute of an instance.
I Similar to classification rules except that they can predict
combinations of attributes too.
I Express different regularities in the data set.
I Many different association rules, even in tiny data sets.

Interesting association rules:


I High coverage and accuracy.
I Coverage: number of correctly predicted instances.
I Accuracy: % correctly predicted instances over all instances.
I Typically, we seek rules with coverage and accuracy above
prescribed thresholds.

58 / 72
Rules (6)

Learning Rules:
I By adding new rules and refining existing rules while more
instances are added in the training set.
I A refinement may add another conjunctive clause (and) to a
pre-condition.

Rules may:
I contain functions of attribute values, e.g. area( rectangle ).
I compare attribute values or functions of them, e.g.
area( rectangle )>width(rectangle)
I recursively concern different data set parts, e.g.
tallerThan ( rectangle , triangle )

59 / 72
Linear Models
I A linear model is a weighted sum of attribute values.
I E.g. PRP = 2.47 · CACH + 37.06.
I All attribute values must be numeric.
I Typically visualised as a 2D scatter plot with a regression
line, i.e. a linear function that best represents the data.

60 / 72
Linear Models (2)
I Linear models can be applied to classification problems, by
defining decision boundaries separating instances that
belong to different classes.
I E.g. 0.5 · PL + 0.8 · PW = 2.0.

61 / 72
Instance-Based Representations

Instance-Based Learning
I Instead of creating models, memorise actual instances.
I Instances are knowledge representation themselves.
I For new instances, search their closest ones in the training set.

I All work is done when classifying new instances, rather than


when processing the training set.
I Instances are compared using a distance metric.
I The closest training instances are referred to as nearest
neighbours.

62 / 72
Instance-Based Representations (2)

Euclidean Distance
Metric computing the distance between instances i and i 0 with
numeric attributes.
s
n
d(i, i 0 ) = ∑ (xi,j − xi ,j )2
0
j=1

I d(i, i 0 ): distance between instances i and i 0 .


I n: number of attributes.
I xi,j : value of attribute j for instance i.
I xi 0 ,j : value of attribute j for instance i 0 .

63 / 72
Instance-Based Representations (3)

Hamming Distance
Metric computing the distance between instances i and i 0 with
nominal attributes.

d(i, i 0 ) : number of attributes at which i and i 0 differ.

(
0, if xi,j = xi 0 ,j
I Attribute j contributes to d(i, i 0 ) :
1, if xi,j 6= xi 0 ,j
I Contributes 0 if the attribute values are the same.
I Contributes 1 if the attribute values are different.

64 / 72
Instance-Based Representations (4)
I Often, it is not desirable to store all training instances.
I Deciding the (i) saved and (ii) discarded instances is an issue.
I Even though instance-based methods do not learn an explicit
structure, the instances and distance metric specify
boundaries distinguishing different classes.
I Some instance-based methods create rectagular regions
containing instances of the same class.

65 / 72
Clusters

Clustering
Partitions the training set into regions which can be:
I non-overlapping, i.e. each instance is in exactly one cluster,
I overlapping, i.e. an instance may appear in multiple clusters.

66 / 72
Clusters (2)

Dendrogram (Hierarchical Clustering)


I Type of tree diagram with an hierchical structure of clusters.
I The top level partitions the space into two or more groups.
I These groups are further partitioned into subgroups and so on.

67 / 72
Model Evaluation

https://xkcd.com/242/

68 / 72
Model Evaluation (2)

I How good is the model?


I How does it perform on known data?
I How well does it predict for new data?
I A score function or error function computes the differences
between the predictions and the actual outcome.
I Typically, we want to maximize score or minimize error.

69 / 72
Model Evaluation (3)

I Different data mining tasks use different score functions.


I We will cover scoring and evaluation approaches together
with data mining methods.
I The evaluation also depends on the type of the modelled data.
I For example, model evaluation may require:
I measuring quantitative differences,
I counting the number of correct predictions,
I statistical measures, e.g. t-test.

70 / 72
Model Evaluation (4)

I Typically, we divide the data set into:


I the training set for model building,
I the validation set for model selection,
I the test set for model evaluation.
I If the training set or test set is not a representative sample,
then we will not build a strong model.
I A model can be as good as the data used to construct it.

71 / 72
Lecture Contents

I Section 1: Basic Elements


I Definition, Real-World Applications
I Data Sets, Patterns, General Picture

I Section 2: Conceptual Framework


I Concrete Tasks
I Process
I Relation to Other Fields

I Section 3: Models
I Knowledge Representations
I Evaluation

72 / 72

You might also like