0% found this document useful (0 votes)

25 views72 pages

Lec 1

This document provides an introduction to data mining concepts. It discusses the definition of data mining, real-world applications, different data set types and attributes. It also describes how patterns can be extracted from data using classification rules or other models. The document outlines the general data mining process and different concrete tasks like classification, regression, clustering and association rule mining. It provides examples to illustrate key concepts in data preparation and pattern extraction.

Uploaded by

Jonas Jixiao Wang

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

25 views72 pages

Lec 1

Uploaded by

Jonas Jixiao Wang

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 72

Lecture 1: Introduction to Data Mining

7CCSMDM1 Data Mining

Dr Dimitrios Letsios

Department of Informatics
King’s College London

1 / 72
Lecture Contents

I Section 1: Basic Elements

I Definition, Real-World Applications
I Data Sets, Patterns, General Picture

I Section 2: Conceptual Framework

I Concrete Tasks
I Process
I Relation to Other Fields

I Section 3: Models
I Knowledge Representations
I Evaluation

2 / 72
Definition

Main Data Mining Goal

I Extract information from data, i.e. understand and take
advantage of it.

I Computers allow generating, managing, processing, and

communicating data and information.
I Data is raw, unorganized facts not a priori useful.
I If data is processed, organized, structured, and meaningfully
presented in some context, then it becomes information.
I Information is hidden in the data and understanding tends to
decrease as the volume of data increases.

3 / 72
Definition (2)

Data Mining Definition [WFH3, Section 1.1]

The process of discovering patterns in data.

I The process must be automated.
I The patterns must be meaningful and useful, i.e. lead to some
benefit and inform future decisions.
I Data mining works on existing data, i.e. data that has already
been generated, by people, machines, processes, etc.

I A pattern can be thought as a series of data that repeat in a

recognizable way.
I Finding patterns in data involves:
1. Identifying patterns
2. Validating patterns
3. Using patterns for predictions

4 / 72
Real-World Applications

Web Data
I PageRank assigns measures to web pages, based on online
search query relevance (Google).
I Email filtering classifies new messages as spams or hams.
I Online advertising based on users with similar purchases.
I Social media identify users with similar preferences.

5 / 72
Real-World Applications (2)

Marketing and Sales

I Identifying customers likely to defect and fight churn.
I Market basket analysis for personalised offers.

Risk
I Statistical calculation of bank loan default risk.
I Anticipated job candidate performance in recruitments.

6 / 72
Real-World Applications (3)

Images
I Oil spill or deforestation detection from satellite images.
I Currency recognition in automated payment machines.
I Face recognition for police surveillance.

Engineering
I Power demand forecasting for electricity suppliers.
I Failure prediction for machine maintenance in manufacturing.

7 / 72
Data Sets
Contact Lens Data Set [WFH3, Table 1.1]

8 / 72
Data Sets (2)

Nominal Weather Data Set [WFH3, Table 1.2]

9 / 72
Data Sets (3)

Numeric Weather Data Set [WFH3, Table 1.3]

10 / 72
Data Sets (4)

CPU Performance Data Set [WFH3, Table 1.5]

11 / 72
Data Sets (5)

Main Data Set Elements

I Attributes
I Instances

I Attributes or Features or Columns:

I Characterise each data set entry, i.e. specify the data set form.
I E.g. types of conditions considered in the weather data set.
I Attributes might depend to each other.
I Instances or Examples or Rows
I A set of values, one for each attribute.
I Typically, instances are considered to be independent.
I However, there might be relationships between instances.

12 / 72
Data Sets (6)
Family Tree [WFH3, Figure 2.1]

13 / 72
Data Sets (7)

Attribute Types:
I Numeric: Continuous or discrete with well-defined distance
between values.
I Nominal: Categorical.
I Dichotomous: Binary or boolean or yes/no.
I Ordinal: Ordered but without well-defined distance, e.g. poor,
reasonable, good and excellent health quality.
I Interval: Ordered, but also measured in fixed units, e.g. cool,
mild and hot temperatures.

14 / 72
Data Sets (8)
Attribute-Relation File Format (ARFF) [WFH3, Figure 2.2]

15 / 72
Data Sets (9)

I Lectures include simple data sets which are appropriate for

learning because they expose different issues and challenges.
I Practicals use a range of data sets from online sources.
I Often, real-world data sets:
I contain thousands or millions of entries,
I are incomplete,
I are noisy,
I incorporate randomness,
I are sparse.

16 / 72
Data Sets (10)

I Data preparation can be a significant part of the data mining

process and may require:
I Assembly
I Integration
I Cleaning
I Transformation

17 / 72
Data Sets (11)

Feature Engineering
The process of transforming raw data by selecting the most
suitable attributes for the data mining problem to be solved.

I Significant part of the data preparation time before modeling.

I Coming up with appropriate attributes can be difficult,
time-consuming and requires experience.
I May significantly affect data mining methods.

18 / 72
Patterns

I Allow making non-trivial predictions for new data.

I Black box, i.e. incomprehensible with hidden structure.
I Transparent with visible structure.

I Structural patterns:
I Capture and explain data aspects in an explicit way.
I Can be used for better-informed decisions.
I E.g. rules in the form if-then-else.

19 / 72
Patterns (2)

Nominal Weather Data Set [WFH3, Table 1.2]

20 / 72
Patterns (3)

Classification Rule
Attribute values predict the label.
i f ( o u t l o o k == s u n n y ) and ( h u m i d i t y == h i g h ) :
p l a y = no

Patterns can be captured by different types of models:

I Linear equations.
I Clusters, i.e. meaningful groups of data.
I Tree structures.

21 / 72
Patterns (4)

Weather Data Set Elements

I The input is four attributes or features:
I outlook = { sunny, overcast, rainy }
I temperature = { hot, mild, cool }
I humidity = { high, normal }
I windy = { true, false }
I The ouput is a decision, i.e. one label or class:
I play = { yes, no }

I There are 3 × 3 × 2 × 2 = 36 possible cases, i.e. conditions.

I Only 14 cases are present in the data set.
I A rule may use one or more attributes to make the right
decision (i.e. select the correct label).
I Rules obtained for a data set might not be good.

22 / 72
Patterns (5)

I Data mining aims to construct models from the data.

I If the data set is complete, then the rules produce 100%
correct predictions.
I If the data set is incomplete, then the rules may produce
incorrect predictions because information is missing.
I Real-world data sets are typically incomplete and we aim in
the best possible rules.

23 / 72
General Picture

I Data mining is a process for exploring data to discover

meaningful patterns.
I Given a data set, we aim to construct models expressing these
patterns using some form of knowledge representation.

24 / 72
Lecture Contents

I Section 1: Basic Elements

I Definition, Real-World Applications
I Data Sets, Patterns, General Picture

I Section 2: Conceptual Framework

I Concrete Tasks
I Process
I Relation to Other Fields

I Section 3: Models
I Knowledge Representations
I Evaluation

25 / 72
Concrete Tasks

Examples:
I Classification models relationships between data elements to
predict classes or labels.
I Regression models relationships between data elements to
predict numeric quantities.
I Clustering models relationships of instances to group them so
that instances in the same group are similar.
I Association models relationships between attributes.

26 / 72
Concrete Tasks (2)

Classification:
I The data is classified, e.g. people can be labelled as Covid
positive, or negative based on their symptoms.
I Models ways that attributes determine the class of instances.
I Supervised learning task because it uses already classified
instances to predictions for new instances.

27 / 72
Concrete Tasks (3)

Regression:
I Models ways that attributes determine a numeric value.
I Variant of classification, but without discrete classes.
I Supervised learning task, similarly to classification.
I Often, the produced model is more interesting than predicted
values, e.g. what attributes affect car prices.

28 / 72
Concrete Tasks (4)

Clustering:
I Models similarity between instances and divides them into
groups so that instances in the same group are more similar
than instances in different groups.
I E.g. partition customers into groups.
I By labelling the clusters, we may use them in meaningful ways.
I Unsupervised learning task because the data is not labelled.

29 / 72
Concrete Tasks (5)

Association:
I Models how some attributes determine other attributes.
I No specific class or label.
I May examine any subset of attributes to predict any other
disjoint subset attributes.
I Usually involve only nominal data.
I E.g. use supermarket data, to identify combinations of
products that occur together in transactions.

30 / 72
Process

I Decomposes the data mining process into a number of steps.

I Allows distinguishing various issues.
I Provides a methodology for implementing data mining tasks.

31 / 72
Process (2)
Data Mining Process

Data containing
Examples

Question of
Evaluate
Interest

Score Model

Prediction or
Insight

32 / 72
Process (3)

Step 1: Objective Specification

I Identify the data mining problem type.

I Supervised learning:
I There is a target attribute.
I If nominal, then classification. E.g. to play or not in the
weather data set.
I If numeric, then predition. E.g. predict power value in the CPU
performance data set.
I Unsupervised learning:
I There is no target attribute.
I Cluster instances into groups of similarity.
I Find attribute correlations or associations.
I There exist other data mining tasks for other types of data.

33 / 72
Process (4)

Step 2: Data Exploration

I Visualise the data, e.g. using histograms or scatter plots.
I Confirm that the objective can be achieved with the data set.

I In this module, we first select the methods and then an

appropriate data set.
I In real-world applications, we typically begin with the data
and then select an appropriate method.

34 / 72
Process (5)

Step 3: Data Cleaning

I Fix any problems with the data.

I Confirm there is enough data, i.e. broad and deep.

I With very sparse data, data mining might not be effective.
I Rule of thumb: the more, the better.
I However, very large data sets can be problematic when (i) the
target variable appears in extremely rare patterns, or (ii) model
building is very resource consuming.
I Check whether there are imprecise or missing values.
I Verify that the data set is representative and not biased

35 / 72
Process (6)

Step 4: Model Building

I Select the most appropriate model for the data.

I The data may contain:

I Discrete or continuous numbers.
I Categorical, numeric, or mixed values.
I Grayscale or coloured images.

36 / 72
Process (7)

Step 5: Model Evaluation

I Assess whether the model achieves the desiderata.

I Measure accuracy, i.e. how well the model performs.

I Use both existing and new data by partitioning the data into:
I Training set for model building.
I Validation set for model selection.
I Test set for model evaluation with unseen data.
I Measure accuracy in all training, validation and test sets.
I Overfitting: the model is very tailored to the training
instances and does not generalize well to new instances.

37 / 72
Process (8)

Step 6: Repeat
I Usually, multiple iterations of the aforementioned steps are
required to build a good enough model.
I Revise the performed steps, adapt and reiterate.

38 / 72
Relation to Other Fields

I Data Mining (DM) is an interdisciplinary field using

approaches and techniques from multiple fieds, including:
I Artificial Intelligence (AI) and Machine Learning (ML)
I Statistics (Stats)
I Algorithms and Mathematical Optimization

39 / 72
Relation to Other Fields (2)
I DM is strongly related to ML and Stats.
I These fields share methods, but use them in different ways
and for different reasons.

40 / 72
Relation to Other Fields (3)

When a machine learns? [WFH3, Section 1.1]

Subjects learn when they change their behaviour in a way that

makes them perform better in the future.

I The above statement emphasises on performance rather than

knowledge, which can be measured by comparing past
behaviour to present and future behaviour.
I There is a difference between learning and adaptation. E.g.
the adaptation of shoe to the shape of a foot is not learning.
I Learning implies purpose, i.e. intention.
I There are philosophical questions here.

41 / 72
Relation to Other Fields (4)

I The application of ML to DM is not a philosophical question.

I DM involves learning in practical sense, i.e. finding and
describing well-structured patterns in the data.
I Input: data containing a set of examples.
I Output: explicit knowledge representation.
I DM involves the efficient acquisition of knowledge, but with
the ability of using it.

42 / 72
Relation to Other Fields (5)

I ML and Stats have different histories, but similar methods

have been parallelly developed in the two fields. E.g.
I Generating decision trees from examples.
I Nearest-neighbour classification.
I ML and Stats have different goals:
I ML aims in the most accurate predictions.
I Stats infer variable relationships and test hypotheses.
I Nowadays, there is significant overlap between the two.
I Many DM techniques require statistical thinking.

43 / 72
Lecture Contents

I Section 1: Basic Elements

I Definition, Real-World Applications
I Data Sets, Patterns, General Picture

I Section 2: Conceptual Framework

I Concrete Tasks
I Process
I Relation to Other Fields

I Section 3: Models
I Knowledge Representations
I Evaluation

44 / 72
Knowledge Representations

I Decision Tables
I Trees
I Rules
I Linear Models
I Instance-Based Representations
I Clusters

45 / 72
Tables
Decision Table
I Concise visual representation for specifying which actions to
perform based on given conditions.
I Contains a set of attributes and a decision label for each
unique set of attribute values.

46 / 72
Trees
Building Blocks
I Nodes: specify decisions to be made.
I Branches from a node represent possible alternatives.
I A branch connects a parent node to one of its child nodes.
I The very top node without a parent is called root.
I The very bottom nodes without a child are called leaves.

47 / 72
Trees (2)

Decision Trees:
I Branches may involve a single or multiple attributes.
I We examine the value of an attribute and branch based on
equality or inequality.
i f ( t e m p e r a t u r e < 80 ) :
branch l e f t
else :
branch r i g h t

I The alternatives of a decision can be:

I two-way such as yes or no,
I three-way such as <, =, or >,
I multi-way.

48 / 72
Trees (3)
Decision Trees:
I A path is a sequence of nodes such that each node is the
child of the previous node in the sequence.
I An attribute can be tested more than once in a path.
I In a classification context:
I A leaf specifies a class.
I Each instance satisfying all decisions of the corresponding path
from the root to the leaf is assigned this class.

49 / 72
Trees (4)

Missing Value Problem

It is unclear which branch should be considered if an attribute
value is missing.

I Possible solutions:
I Ignore all instances with missing values.
I Each attribute may get the value missing.
I Set the most popular choice for each missing attribute value.
I Make a probabilistic (weighted) choice for each missing
attribute value, based on the other instances.
I All these solutions propagate errors, especially when the
number of missing values increases.

50 / 72
Trees (5)

Functional Tree
I Computes a function of multiple attribute values in each node.
I Branches based on the value returned by the function.
i f ( petal length ∗ petal width > threshold ) :
make d e c i s i o n
else :
make d i f f e r e n t d e c i s i o n

51 / 72
Trees (6)
Regression Tree
I Predicts numeric values.
I Each node branches on the value of an attribute or on the
value of a function of the attributes.
I A leaf specifies a predicted value for corresponding instances.

52 / 72
Trees (7)
Model Trees
I Similar to a regression tree, except that a regression
equation predicts the numeric output value in each leaf.
I A regression equation predicts a numeric quantity as a
function of the attributes.
I More sophisticated than linear regression and regression trees.

53 / 72
Rules
Rule
I An expression in if-then format.
I The if part is the pre-condition or antecedent and consists
of a series of tests.
I The then part is the conclusion or consequent and assigns
values to one or more attributes.

The pre-condition may contain multiple clauses in the form of:

I Conjunction, i.e. tests linked by and, meaning that all tests
must be true to fire the rule.
I Disjunction, i.e. tests linked by or, meaning that at least one
test must be true to fire the rule.
I General logic expressions, i.e. tests linked by different
logical operators (and/or).

54 / 72
Rules (2)

Classification rules:
I Predict the class or label of an instance.
I Can be derived from a decision tree.
I One rule can be constructed for each leaf of the tree:
I The pre-condition contains a clause for each decision along the
path from the root to the leaf.
I The conclusion is the class of the leaf.
I Rule sets constructed in this way may contain redundancies,
especially if multiple leaves contain the same class.

55 / 72
Rules (3)
I Transforming a set of rules into a decision tree is also
possible, but not straightforward.
I The difficulty is the order of tests, starting from the root.
I The replicated subtree problem may occur, i.e. no matter
which rule is chosen first, the other is replicated in the tree.
I Sometimes, classification rules can be significantly more
compact than decision trees.

i f a and b :
x
i f c and d :
x

56 / 72
Rules (4)

I A set of rules may fail to classify an instance.

I These situations cannot happen with decision trees.
I Ordered set of rules:
I Should be applied based on the given order (decision list).
I An individual rule out of the list may be incorrect.
I Unordered sets of rules:
I Each rule represents an independent piece of knowledge.
I Different rules may lead to different classes for one instance.

57 / 72
Rules (5)
Association rules:
I Predict an attribute of an instance.
I Similar to classification rules except that they can predict
combinations of attributes too.
I Express different regularities in the data set.
I Many different association rules, even in tiny data sets.

Interesting association rules:

I High coverage and accuracy.
I Coverage: number of correctly predicted instances.
I Accuracy: % correctly predicted instances over all instances.
I Typically, we seek rules with coverage and accuracy above
prescribed thresholds.

58 / 72
Rules (6)

Learning Rules:
I By adding new rules and refining existing rules while more
instances are added in the training set.
I A refinement may add another conjunctive clause (and) to a
pre-condition.

Rules may:
I contain functions of attribute values, e.g. area( rectangle ).
I compare attribute values or functions of them, e.g.
area( rectangle )>width(rectangle)
I recursively concern different data set parts, e.g.
tallerThan ( rectangle , triangle )

59 / 72
Linear Models
I A linear model is a weighted sum of attribute values.
I E.g. PRP = 2.47 · CACH + 37.06.
I All attribute values must be numeric.
I Typically visualised as a 2D scatter plot with a regression
line, i.e. a linear function that best represents the data.

60 / 72
Linear Models (2)
I Linear models can be applied to classification problems, by
defining decision boundaries separating instances that
belong to different classes.
I E.g. 0.5 · PL + 0.8 · PW = 2.0.

61 / 72
Instance-Based Representations

Instance-Based Learning
I Instead of creating models, memorise actual instances.
I Instances are knowledge representation themselves.
I For new instances, search their closest ones in the training set.

I All work is done when classifying new instances, rather than

when processing the training set.
I Instances are compared using a distance metric.
I The closest training instances are referred to as nearest
neighbours.

62 / 72
Instance-Based Representations (2)

Euclidean Distance
Metric computing the distance between instances i and i 0 with
numeric attributes.
s
n
d(i, i 0 ) = ∑ (xi,j − xi ,j )2
0
j=1

I d(i, i 0 ): distance between instances i and i 0 .

I n: number of attributes.
I xi,j : value of attribute j for instance i.
I xi 0 ,j : value of attribute j for instance i 0 .

63 / 72
Instance-Based Representations (3)

Hamming Distance
Metric computing the distance between instances i and i 0 with
nominal attributes.

d(i, i 0 ) : number of attributes at which i and i 0 differ.

(
0, if xi,j = xi 0 ,j
I Attribute j contributes to d(i, i 0 ) :
1, if xi,j 6= xi 0 ,j
I Contributes 0 if the attribute values are the same.
I Contributes 1 if the attribute values are different.

64 / 72
Instance-Based Representations (4)
I Often, it is not desirable to store all training instances.
I Deciding the (i) saved and (ii) discarded instances is an issue.
I Even though instance-based methods do not learn an explicit
structure, the instances and distance metric specify
boundaries distinguishing different classes.
I Some instance-based methods create rectagular regions
containing instances of the same class.

65 / 72
Clusters

Clustering
Partitions the training set into regions which can be:
I non-overlapping, i.e. each instance is in exactly one cluster,
I overlapping, i.e. an instance may appear in multiple clusters.

66 / 72
Clusters (2)

Dendrogram (Hierarchical Clustering)

I Type of tree diagram with an hierchical structure of clusters.
I The top level partitions the space into two or more groups.
I These groups are further partitioned into subgroups and so on.

67 / 72
Model Evaluation

https://xkcd.com/242/

68 / 72
Model Evaluation (2)

I How good is the model?

I How does it perform on known data?
I How well does it predict for new data?
I A score function or error function computes the differences
between the predictions and the actual outcome.
I Typically, we want to maximize score or minimize error.

69 / 72
Model Evaluation (3)

I Different data mining tasks use different score functions.

I We will cover scoring and evaluation approaches together
with data mining methods.
I The evaluation also depends on the type of the modelled data.
I For example, model evaluation may require:
I measuring quantitative differences,
I counting the number of correct predictions,
I statistical measures, e.g. t-test.

70 / 72
Model Evaluation (4)

I Typically, we divide the data set into:

I the training set for model building,
I the validation set for model selection,
I the test set for model evaluation.
I If the training set or test set is not a representative sample,
then we will not build a strong model.
I A model can be as good as the data used to construct it.

71 / 72
Lecture Contents

I Section 1: Basic Elements

I Definition, Real-World Applications
I Data Sets, Patterns, General Picture

I Section 2: Conceptual Framework

I Concrete Tasks
I Process
I Relation to Other Fields

I Section 3: Models
I Knowledge Representations
I Evaluation

72 / 72

Age and Gender Classiication Report
60% (10)
Age and Gender Classiication Report
54 pages
Chapter 1
No ratings yet
Chapter 1
313 pages
Data Mining Concepts
100% (3)
Data Mining Concepts
122 pages
01 Intro
No ratings yet
01 Intro
61 pages
3 DM
No ratings yet
3 DM
36 pages
Chapter Five Data Mining For Healthcare Analytics
No ratings yet
Chapter Five Data Mining For Healthcare Analytics
77 pages
Data Mining Merged PDF CS1 CS8
No ratings yet
Data Mining Merged PDF CS1 CS8
272 pages
DWDM LS1 Fall 24 25
No ratings yet
DWDM LS1 Fall 24 25
42 pages
Introduction To Data Mining, 2 Edition: by Tan, Steinbach, Karpatne, Kumar
No ratings yet
Introduction To Data Mining, 2 Edition: by Tan, Steinbach, Karpatne, Kumar
95 pages
Lec 1
No ratings yet
Lec 1
33 pages
Class 1a-DataCollection
No ratings yet
Class 1a-DataCollection
14 pages
Unit 1
No ratings yet
Unit 1
102 pages
1 - Introduction To DM
No ratings yet
1 - Introduction To DM
59 pages
Data Mining Techniques
No ratings yet
Data Mining Techniques
41 pages
WINSEM2024-25 MCSE615L TH VL2024250502897 2024-12-19 Reference-Material-I
No ratings yet
WINSEM2024-25 MCSE615L TH VL2024250502897 2024-12-19 Reference-Material-I
58 pages
Lecture 2
No ratings yet
Lecture 2
66 pages
Unit 1
No ratings yet
Unit 1
148 pages
Module 4
No ratings yet
Module 4
54 pages
DWDM Reference Notes
No ratings yet
DWDM Reference Notes
126 pages
DWM Notes Class by Proff
No ratings yet
DWM Notes Class by Proff
88 pages
Lec 1
No ratings yet
Lec 1
24 pages
1 - DM
No ratings yet
1 - DM
5 pages
Lecture 1.1.1 1.1.2
No ratings yet
Lecture 1.1.1 1.1.2
32 pages
Chapter 3-IB
No ratings yet
Chapter 3-IB
69 pages
21SE204-B DATA MINING - S2 M.Tech: Prepared By, Prince V Jose Ap, Cse Saintgits College of Engg
No ratings yet
21SE204-B DATA MINING - S2 M.Tech: Prepared By, Prince V Jose Ap, Cse Saintgits College of Engg
31 pages
01 Intro
No ratings yet
01 Intro
28 pages
DMlecture 1
No ratings yet
DMlecture 1
39 pages
Data Mining
No ratings yet
Data Mining
15 pages
Lecture 1428550844
No ratings yet
Lecture 1428550844
87 pages
DM Notes
No ratings yet
DM Notes
91 pages
Data Mining Notes UNIT I
No ratings yet
Data Mining Notes UNIT I
21 pages
DM-Unit-I Introduction To Association-1
No ratings yet
DM-Unit-I Introduction To Association-1
97 pages
DMiningKuliah 1 Introduction
No ratings yet
DMiningKuliah 1 Introduction
41 pages
Data Warehousing and Data Mining: DR Seema Agarwal
No ratings yet
Data Warehousing and Data Mining: DR Seema Agarwal
72 pages
DataMining S
No ratings yet
DataMining S
103 pages
Chapter-1 - Introduction To Data Mining
No ratings yet
Chapter-1 - Introduction To Data Mining
10 pages
FDS-Content Beyond Syllabus
No ratings yet
FDS-Content Beyond Syllabus
15 pages
1 IT326 - Ch1 - Introduction
No ratings yet
1 IT326 - Ch1 - Introduction
37 pages
Lec 1
No ratings yet
Lec 1
48 pages
Week 1-2
No ratings yet
Week 1-2
3 pages
6th - SEM Machine Learning Notes PDF
100% (1)
6th - SEM Machine Learning Notes PDF
36 pages
UNIT 1 Introduction of Data Mining
No ratings yet
UNIT 1 Introduction of Data Mining
40 pages
DM Introduction
No ratings yet
DM Introduction
32 pages
Why Data Mining?: March 3, 2015
No ratings yet
Why Data Mining?: March 3, 2015
41 pages
Dr. TV. Geetha
No ratings yet
Dr. TV. Geetha
176 pages
Data Mining
No ratings yet
Data Mining
26 pages
Harzing
No ratings yet
Harzing
26 pages
1 Lect - 1.2 - 12 - August 2022 PDF
No ratings yet
1 Lect - 1.2 - 12 - August 2022 PDF
59 pages
Lecture 1
No ratings yet
Lecture 1
17 pages
ICS 2408 Lecture 1 Introduction
No ratings yet
ICS 2408 Lecture 1 Introduction
32 pages
ML Customer Segmentation
No ratings yet
ML Customer Segmentation
39 pages
Lecture Notes For Chapter 1: by Tan, Steinbach, Karpatne, Kumar
No ratings yet
Lecture Notes For Chapter 1: by Tan, Steinbach, Karpatne, Kumar
28 pages
Data Mining: Ying Liu, Prof., PH.D
No ratings yet
Data Mining: Ying Liu, Prof., PH.D
57 pages
Cse5243 Intro. To Data Mining: Chapter 1. Introduction
No ratings yet
Cse5243 Intro. To Data Mining: Chapter 1. Introduction
56 pages
PDF Proceedings of The Fourteenth International Conference On Management Science and Engineering Management: Volume 1 Jiuping Xu Download
100% (2)
PDF Proceedings of The Fourteenth International Conference On Management Science and Engineering Management: Volume 1 Jiuping Xu Download
55 pages
Data Mining 1
No ratings yet
Data Mining 1
56 pages
Corrosion Loop Development - Andikaand
No ratings yet
Corrosion Loop Development - Andikaand
20 pages
Hardware-Software Co-Partitioning For Distributed Embedded Systems
No ratings yet
Hardware-Software Co-Partitioning For Distributed Embedded Systems
41 pages
Unit 1: Data Warehousing & Data Mining
No ratings yet
Unit 1: Data Warehousing & Data Mining
54 pages
Data Mining Concepts
No ratings yet
Data Mining Concepts
35 pages
Measurement of Software Similarity
0% (1)
Measurement of Software Similarity
46 pages
Lossless Image Compression Using K-Means Clustering in Color Pixel Domain
No ratings yet
Lossless Image Compression Using K-Means Clustering in Color Pixel Domain
9 pages
What Is Data Mining?
No ratings yet
What Is Data Mining?
17 pages
01 Intro
No ratings yet
01 Intro
23 pages
AI.5 Machine Learning (21 26)
No ratings yet
AI.5 Machine Learning (21 26)
176 pages
Syllabus-Machine Learning Elective-II
No ratings yet
Syllabus-Machine Learning Elective-II
1 page
Melanoma Classification A Comprehensive Survey (3 240314 220858
No ratings yet
Melanoma Classification A Comprehensive Survey (3 240314 220858
67 pages
1 Intro
No ratings yet
1 Intro
33 pages
Introducing The Geneva Music-Induced Affect Checklist
No ratings yet
Introducing The Geneva Music-Induced Affect Checklist
16 pages
K-Means Clustering Using Weka Interface
No ratings yet
K-Means Clustering Using Weka Interface
6 pages
Optimized and Efficient Color Prediction Algorithm
No ratings yet
Optimized and Efficient Color Prediction Algorithm
25 pages
7 - Chapter 7-Chapter 7 - Density-Based Clustering Methods
No ratings yet
7 - Chapter 7-Chapter 7 - Density-Based Clustering Methods
30 pages
AI&ML Labmanual
No ratings yet
AI&ML Labmanual
33 pages
Customer Segmentation With K-Means and RMF
No ratings yet
Customer Segmentation With K-Means and RMF
13 pages
Machine-Learning Paradigms
No ratings yet
Machine-Learning Paradigms
32 pages
Ba ZG524 Ec-3r First Sem 2023-2024
No ratings yet
Ba ZG524 Ec-3r First Sem 2023-2024
5 pages
An Introduction To Data Mining
No ratings yet
An Introduction To Data Mining
47 pages
Semester 2
No ratings yet
Semester 2
14 pages
Stream and Pool Based Active Learning
No ratings yet
Stream and Pool Based Active Learning
11 pages
A Traffic Classification Method With Spectral
No ratings yet
A Traffic Classification Method With Spectral
4 pages
Research Paper On Cluster Techniques of Data Variations
No ratings yet
Research Paper On Cluster Techniques of Data Variations
9 pages
CS 412: Introduction To Data Mining Course Syllabus
No ratings yet
CS 412: Introduction To Data Mining Course Syllabus
7 pages
Measuring User Credibility in Social Media-CredRank
No ratings yet
Measuring User Credibility in Social Media-CredRank
8 pages
Midterm Exam - FALL 2020 Artificial Intelligence
No ratings yet
Midterm Exam - FALL 2020 Artificial Intelligence
3 pages
Data Mining in Telecommunication - India
No ratings yet
Data Mining in Telecommunication - India
5 pages

Lec 1

Uploaded by

Lec 1

Uploaded by

Lecture 1: Introduction to Data Mining

7CCSMDM1 Data Mining

I Section 1: Basic Elements

I Section 2: Conceptual Framework

Main Data Mining Goal

I Computers allow generating, managing, processing, and

Data Mining Definition [WFH3, Section 1.1]

The process of discovering patterns in data.

I A pattern can be thought as a series of data that repeat in a

Marketing and Sales

Nominal Weather Data Set [WFH3, Table 1.2]

Numeric Weather Data Set [WFH3, Table 1.3]

CPU Performance Data Set [WFH3, Table 1.5]

Main Data Set Elements

I Attributes or Features or Columns:

I Lectures include simple data sets which are appropriate for

I Data preparation can be a significant part of the data mining

I Significant part of the data preparation time before modeling.

I Allow making non-trivial predictions for new data.

Nominal Weather Data Set [WFH3, Table 1.2]

Patterns can be captured by different types of models:

Weather Data Set Elements

I There are 3 × 3 × 2 × 2 = 36 possible cases, i.e. conditions.

I Data mining aims to construct models from the data.

I Data mining is a process for exploring data to discover

I Section 1: Basic Elements

I Section 2: Conceptual Framework

I Decomposes the data mining process into a number of steps.

Step 1: Objective Specification

Step 2: Data Exploration

I In this module, we first select the methods and then an

Step 3: Data Cleaning

I Confirm there is enough data, i.e. broad and deep.

Step 4: Model Building

I The data may contain:

Step 5: Model Evaluation

I Measure accuracy, i.e. how well the model performs.

I Data Mining (DM) is an interdisciplinary field using

When a machine learns? [WFH3, Section 1.1]

Subjects learn when they change their behaviour in a way that

I The above statement emphasises on performance rather than

I The application of ML to DM is not a philosophical question.

I ML and Stats have different histories, but similar methods

I Section 1: Basic Elements

I Section 2: Conceptual Framework

I The alternatives of a decision can be:

Missing Value Problem

The pre-condition may contain multiple clauses in the form of:

I A set of rules may fail to classify an instance.

Interesting association rules:

I All work is done when classifying new instances, rather than

I d(i, i 0 ): distance between instances i and i 0 .

d(i, i 0 ) : number of attributes at which i and i 0 differ.

Dendrogram (Hierarchical Clustering)

I How good is the model?

I Different data mining tasks use different score functions.

I Typically, we divide the data set into:

I Section 1: Basic Elements

I Section 2: Conceptual Framework

You might also like