0% found this document useful (0 votes)

46 views36 pages

Mekelle University-Mekelle Institute of Technology Department of Information Technology Data Mining and Knowledge Discovery

The document provides an overview of data mining and knowledge discovery. It defines data mining as the non-trivial extraction of implicit, previously unknown, and potentially useful information from data. Data mining involves techniques from machine learning, statistics, databases, and other fields to discover patterns in large data sets. It discusses how vast amounts of data are now collected and stored, creating opportunities to apply data mining to gain useful knowledge and insights. The document outlines some common data mining tasks like classification, clustering, and association rule mining and the types of patterns they can reveal in databases, data warehouses, and transactional data.

Uploaded by

Yrga Weldegiwergs

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

46 views36 pages

Mekelle University-Mekelle Institute of Technology Department of Information Technology Data Mining and Knowledge Discovery

Uploaded by

Yrga Weldegiwergs

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 36

MEKELLE UNIVERSITY-MEKELLE INSTITUTE OF

TECHNOLOGY

DEPARTMENT OF INFORMATION TECHNOLOGY

DATA MINING AND KNOWLEDGE DISCOVERY

Halefom Tekle
Friday, February 5, 2021
Outlines
Chapter 1: Definition
 Non-trivial extraction of implicit, previously unknown and
potentially useful information from data
 Exploration & analysis, by automatic or semi-automatic
means, of large quantities of data in order to discover
meaningful patterns
 What is not Data mining?  What is Data Mining?

Look up phone number in Certain names are more prevalent

phone directory in certain US locations (O’Brien,
O’Rurke, O’Reilly… in Boston
Query a Web search area)
engine for information
about “Amazon” Group together similar documents
returned by search engine
according to their context (e.g.
Amazon rainforest, Amazon.com,)
Con.
 Data mining is a technique for discovering interesting
patterns from data
 Data mining also kwon as knowledge discovery from data.
 It is a multi-disciplinary field involving
 Machine learning
 Statistics
 Databases
 Artificial intelligence
 Information retrieval, and
 Visualization
1.1 Why Data Mining? Commercial view

 We live in a world where vast amounts of data are

collected daily.
 Lots of data is being collected and warehoused
 Web data, e-commerce
 purchases at department/grocery stores
 Bank/Credit Card transactions

 Computers have become cheaper and more powerful

 Competitive Pressure is Strong
 Provide better, customized services for an edge (e.g. in Customer
Relationship Management)
1.3 Motivation

 There is often information “hidden” in the data that is

not readily evident
 Human analysts may take weeks to discover useful information
 Much of the data is never analyzed at all
1.4 Data Mining as the Evolution of Information
Technology
 Data mining can be viewed as a result of the natural evolution of
information technology.
 Those are
 Data collection and database creation
 Database management system
 Advanced database system
 Advanced data analysis
 The early development of data collection and database creation
mechanisms served as a prerequisite for the later development of
effective mechanisms for data storage and retrieval, as well as query
and transaction processing.
 Nowadays numerous database systems offer query and transaction
processing as common practice.
 Advanced data analysis has naturally become the next step.
Con.
Con.
ata
d
is or.
r ld po
wo on
h e ati
s, t rm
a n nfo
e ti
m
his h bu
T ric

So, we need tools to extract the valuable knowledge

embedded in the vast amounts of data to help decision
maker’s intuition .
Con.

Data mining
 Is the process of discovering interesting patterns and
knowledge from large amounts of data.
 Many people treat data mining as a synonym for another
popularly used term, knowledge discovery from data, or
KDD, while others view data mining as merely an
essential step in the process of knowledge discovery.
The data sources can include databases, data warehouses,
the Web, other information repositories, or data that are
streamed into the system dynamically.
 The knowledge discovery process is an iterative sequence
Con.
 Pre-processing:
 The raw data is usually not suitable for mining due to
various reasons.
 Data mining:
 The processed data is then fed to a data mining
algorithm which will produce patterns or knowledge.
 Post-processing:
 In many applications, not all discovered patterns are
useful. This step identifies those useful ones for
applications. Various evaluation and visualization
techniques are used to make the decision.
Con.
1. Data cleaning: to remove noise and inconsistent data
2. Data integration: where multiple data sources may be combined
3. Data selection: where data relevant to the analysis task are
retrieved from the database
4. Data transformation: where data are transformed and consolidated
into forms appropriate for mining by performing summary or
aggregation operations
5. Data mining: an essential process where intelligent methods are
applied to extract data patterns
6. Pattern evaluation: to identify the truly interesting patterns
representing knowledge based on interestingness measures
7. Knowledge presentation: where visualization and knowledge
representation techniques are used to present mined knowledge to
users
1.5 What Kinds of Data Can Be Mined?
 Data mining can be applied to any kind of data as long as the data
are meaningful for a target application.
 The most basic forms of data for mining applications are
 Database data
 Data warehouse data
 Transactional data
 Can also be applied to other forms of data
 data streams
 ordered/sequence data
 graph or networked data
 text data
 multimedia data (audio, video, image)
 and WWW
Con.
1.5.1 Database data
 Consider a relational database for AllElectronics.
Customer: (cust_ID, name, address, age, occupation,
annual income, credit information, category, . . .)
Item: (item_ID, brand, category, type, price, place made,
supplier, cost, . . . )
Employee: (empl_ID, name, category, group, salary,
commission, . . . )
Branch: (branch_ID, name, address, . . . )
Purchases: (trans_ID, cust_ID, empl_ID, date, time, method
paid, amount)
Items_sold: (trans_ID, item_ID, qty)
Works_at: (empl_ID, branch_ID)
Con.
 Database data
 Relational data can be accessed by database queries written in a
relational query (SQL, PostgreeSQL, …) or
 With the assistance of graphical user interfaces.

 The mining task is

 prediction methods
 Predict the credit risk of new customers
 To use some variables to predict unknown or future values of
other variables.
 detect deviations—that is, items with sales that are far from
those expected in comparison with the previous year
 Description Methods
 Find human-interpretable patterns that describe the data.
Con.

 Classification
 Regression Predictive
 Deviation Detection

 Clustering
 Association Rule Discovery Descriptive
 Sequential Pattern Discovery
Con.
1.5.2 Data warehouse
 Is a repository of multiple heterogeneous data sources
organized under a unified schema at a single site to
facilitate management decision making.

 Data warehouse technology includes data cleaning, data

integration, and online analytical processing (OLAP)

 OLAP—is analysis techniques with functionalities such

as summarization, consolidation, and aggregation, as well
as the ability to view information from different angles.
Con.

 Although OLAP tools support multidimensional analysis and

decision making, additional data analysis tools are required
for in-depth analysis—for example, data mining tools that
provide data classification, clustering, outlier/anomaly
detection, and the characterization of changes in data over
time.
 A data warehouse is usually modeled by a multidimensional
data structure, called a data cube, in which each dimension
corresponds to an attribute or a set of attributes in the schema,
and each cell stores the value of some aggregate measure such
as count or sum (sales_amount).
 A data cube provides a multidimensional view of data and
allows the precomputation and fast access of summarized data.
Con.
 Let AllElectronics had a data warehouse
Con.
1.5.3 Transactional Data
 Transactional database captures a transaction, such as a
customer’s purchase, a flight booking, or a user’s clicks on a
web page.
 A transaction typically includes
 a unique transaction identity number (trans ID) and
 a list of the items making up the transaction, such as the items
purchased in the transaction.
 A transactional database may have additional tables, which
contain other information related to the transactions
 such as item description,
 information about the salesperson or the branch, and so on.
1.6 What Kinds of Patterns Can Be Mined?
 There are a number of data mining functionalities. These include
 Characterization and discrimination
 Mining of frequent patterns, associations, and correlations

 Classification and regression

 Clustering analysis

 Outlier analysis

 Data mining functionalities are used to specify the kinds of patterns to

be found in data mining tasks.
 Such tasks can be classified into two categories:
 Descriptive and

 Predictive.

 Descriptive mining tasks characterize properties of the data in a target

data set.
 Predictive mining tasks perform induction on the current data in order
to make predictions.
Con.
1.6.1 Class/Concept Description: Characterization and Discrimination
 Data entries can be associated with classes or concepts.
 For example, in the AllElectronics store, classes of items for sale
include computers and printers, and concepts of customers include
bigSpenders and budgetSpenders.
 It can be useful to describe individual classes and concepts in
summarized, concise, and yet precise terms.
 Such descriptions of a class or a concept are called class/concept
descriptions.
 These descriptions can be derived using
 Data characterization, by summarizing the data of the class under study
(often called the target class) in general terms
 Data discrimination, by comparison of the target class with one or a set of
comparative classes (often called the contrasting classes) or
 both data characterization and discrimination.
Con.
1.6.2 Mining Frequent Patterns, Associations, and
Correlations
 Frequent patterns, as the name suggests, are patterns that
occur frequently in data.
 There are many kinds of frequent patterns
 Frequent itemsets
 a set of items that often appear together in a transactional data set, milk

and bread
 Frequent subsequences (also known as sequential patterns)
 tend to purchase first a laptop, followed by a digital camera, and then a
memory card
 Frequent substructures.
 can refer to different structural forms (e.g., graphs, trees, or lattices) that

may be combined with itemsets or subsequences.

Con.

 Mining frequent patterns leads to the discovery of interesting

associations and correlations within data.
 Association analysis.
 Suppose that, as a marketing manager at AllElectronics, you want to
know which items are frequently purchased together (i.e., within the
same transaction).
 Buys(X, “computer”)=>buys(X, “software”) [support = 1%,
confidence = 50%],
 single-dimensional association rules (buys).
 Age(X, “20..29”)^income(X, “40K..49K”)=>buys(X, “laptop”)
[support = 2%, confidence = 60%],
 multidimensional association rule (Age, income, buys).
 Typically, association rules are discarded as uninteresting if they
do not satisfy both a minimum support threshold and a minimum
confidence threshold.
Con.
1.6.3 Classification and Regression for Predictive Analysis
 Classification (na¨ıve Bayesian, SVM, and KNN)
 Is the process of finding a model (or function) that describes and
distinguishes data classes or concepts.
 The model are derived based on the analysis of a set of training
data (i.e., data objects for which the class labels are known).
 The model is used to predict the class label of objects for which
the class label is unknown.
 It predicts categorical (discrete, unordered) labels
 Regression analysis
 Is a statistical methodology that is most often used for
numeric prediction
 It predicts continuous-valued
Con.
Con.

1.6.4 Cluster Analysis

 Unlike classification and regression, which analyze class-
labeled (training) data sets.
 Clustering analyzes data objects without consulting class
labels.
 In many cases, classlabeled data may simply not exist at the
beginning.
 Clustering can be used to generate class labels for a group of
data.
 The objects are clustered or grouped based on the principle of
maximizing the intraclass similarity and minimizing the
interclass similarity.
Con.
Con.
1.6.5 Outlier Analysis
 A data set may contain objects that do not comply with the
general behavior or model of the data.
 These data objects are outliers.
 Many data mining methods discard outliers as noise or
exceptions.
 However, in some applications (e.g., fraud detection) the rare
events can be more interesting than the more regularly
occurring ones
1.7 Which Technologies Are Used?
Con.

 A statistical model
 Is a set of mathematical functions that describe the behavior of the
objects in a target class in terms of random variables and their
associated probability distributions.
 Machine Learning
 Machine learning investigates how computers can learn (or improve
their performance) based on data.
 A main research area is for computer programs to automatically learn
to recognize complex patterns and make intelligent decisions based on
data.
 learning methods
 Supervised

 Unsupervised

 Semi-supervised

 Reinforcement
Which Kinds of Applications Are Targeted?

 Business Intelligence
 Organization commercial context
customers, the market, supply and resources, and
competitors
 provide historical, current, and predictive views of business

operations
 Web Search Engines
 Have to handle with
 a huge and ever-growing amount of data

 online data

 queries that are asked only a very small number of times

 Bioinformatics and health informatics

 Finance, digital libraries, and digital governments.
1.8 Major Issues in Data Mining
 Mining Methodology
 Mining various and new kinds of knowledge
 Mining knowledge in multidimensional space
 Data mining—an interdisciplinary effort
 Boosting the power of discovery in a networked environment
 User Interaction
 Interactive mining
 Incorporation of background knowledge
 Ad hoc data mining and data mining query languages
 Presentation and visualization of data mining results
 Efficiency and Scalability
 Efficiency, scalability, performance, optimization, ability to execute in real time
 Parallel, distributed, and incremental mining algorithms
 Diversity of Database Types
 Handling complex types of data
 Mining dynamic, networked, and global data repositories
 Data Mining and Society
 Social impacts of data mining
 Privacy-preserving data mining
 Invisible data mining
Exercises
 How is a data warehouse different from a database? How are
they similar?
 What are the major challenges of mining a huge amount of
data (e.g., billions of tuples) in comparison with mining a
small amount of data (e.g., data set of a few hundred tuple)?
 Define each of the following data mining functionalities:
characterization, discrimi-nation, association and correlation
analysis, classification, regression, clustering, and outlier
analysis. Give examples of each data mining functionality,
using a real-life database that you are familiar with.

Examples of An Evaluation Essay
100% (2)
Examples of An Evaluation Essay
7 pages
Prof. Chandan Singhavi
No ratings yet
Prof. Chandan Singhavi
86 pages
Talend + SQL + Datawarehousing - Beginner To Prof
No ratings yet
Talend + SQL + Datawarehousing - Beginner To Prof
1 page
DWDM Notes
No ratings yet
DWDM Notes
59 pages
Logical Design of Multi-Model Data Warehouses
No ratings yet
Logical Design of Multi-Model Data Warehouses
38 pages
2-Introduction To Data Mining, Steps in Data Mining Process-31-07-2024
No ratings yet
2-Introduction To Data Mining, Steps in Data Mining Process-31-07-2024
77 pages
Datamining 1
No ratings yet
Datamining 1
30 pages
SAP PI Cache Refresh - How To Document
No ratings yet
SAP PI Cache Refresh - How To Document
10 pages
Information Technology
No ratings yet
Information Technology
53 pages
Chapter 2 Data Mining
No ratings yet
Chapter 2 Data Mining
25 pages
User Manual Test Universe
No ratings yet
User Manual Test Universe
12 pages
Web Mining - Lec1 2
No ratings yet
Web Mining - Lec1 2
62 pages
Noftl-Kv: Tackling Write-Amplification On Kv-Stores With Native Storage Management
No ratings yet
Noftl-Kv: Tackling Write-Amplification On Kv-Stores With Native Storage Management
4 pages
Data Mining and Datawarehousing CS-303
No ratings yet
Data Mining and Datawarehousing CS-303
34 pages
swr302 pt2
No ratings yet
swr302 pt2
3 pages
TRADOC Pamphlet 350-70-7 Army Educational Processes (2013)
No ratings yet
TRADOC Pamphlet 350-70-7 Army Educational Processes (2013)
104 pages
Project Doc - Last Mile Delivery Analysis
No ratings yet
Project Doc - Last Mile Delivery Analysis
3 pages
To Be Considered True Research
No ratings yet
To Be Considered True Research
22 pages
Week 2 SQL Queries
No ratings yet
Week 2 SQL Queries
2 pages
Data Mining Unit 1
No ratings yet
Data Mining Unit 1
39 pages
1 Intro
No ratings yet
1 Intro
50 pages
C# Record
No ratings yet
C# Record
43 pages
Neo4j: What's A Graph Database?
No ratings yet
Neo4j: What's A Graph Database?
2 pages
Understanding Quality Attributes Module 2 - L1: BITS Pilani
No ratings yet
Understanding Quality Attributes Module 2 - L1: BITS Pilani
25 pages
Selecting and Constructing Data Collection Instruments
No ratings yet
Selecting and Constructing Data Collection Instruments
59 pages
Module 4
No ratings yet
Module 4
54 pages
8 Data Mining and Warehousing
No ratings yet
8 Data Mining and Warehousing
171 pages
UkgRanrLR Ki2e1zVCKYDA Reference-Guide-SQL
No ratings yet
UkgRanrLR Ki2e1zVCKYDA Reference-Guide-SQL
8 pages
UNIT-1 Why We Need Data Mining?
No ratings yet
UNIT-1 Why We Need Data Mining?
99 pages
Borang RPM Sains T1 Edited
No ratings yet
Borang RPM Sains T1 Edited
53 pages
Unit III
No ratings yet
Unit III
101 pages
Introduction To Data Mining
No ratings yet
Introduction To Data Mining
38 pages
1intro - Data Mining
No ratings yet
1intro - Data Mining
61 pages
DVRP lp1
No ratings yet
DVRP lp1
2 pages
1 IT326 - Ch1 - Introduction
No ratings yet
1 IT326 - Ch1 - Introduction
37 pages
CS8651 - IP - UNIT - IV - 6 - File Handling
No ratings yet
CS8651 - IP - UNIT - IV - 6 - File Handling
7 pages
Unit I Dbmi
No ratings yet
Unit I Dbmi
35 pages
Datawarehouse&Data Mining - ALL
No ratings yet
Datawarehouse&Data Mining - ALL
46 pages
Midterm Exam
No ratings yet
Midterm Exam
11 pages
Datamining Unit - 1
No ratings yet
Datamining Unit - 1
20 pages
Data Mining-CH5
No ratings yet
Data Mining-CH5
49 pages
Data Mining Unit 1
No ratings yet
Data Mining Unit 1
13 pages
DM Introduction
No ratings yet
DM Introduction
32 pages
Combine 056
No ratings yet
Combine 056
57 pages
Data Mining and Its Applications
No ratings yet
Data Mining and Its Applications
60 pages
4 - Data Mining & Preprocessing - L - 11,12,13,14,15,16
No ratings yet
4 - Data Mining & Preprocessing - L - 11,12,13,14,15,16
80 pages
Software
No ratings yet
Software
93 pages
DWM Unit II
No ratings yet
DWM Unit II
76 pages
Data Mining Summaries PDF
No ratings yet
Data Mining Summaries PDF
22 pages
Using Big Data To Solve Economic and Social Problems: Professor Raj Chetty Head Section Leader: Gregory Bruich, PH.D
No ratings yet
Using Big Data To Solve Economic and Social Problems: Professor Raj Chetty Head Section Leader: Gregory Bruich, PH.D
31 pages
DM Mod1
No ratings yet
DM Mod1
29 pages
Data Mining 1
No ratings yet
Data Mining 1
39 pages
ICS 2408 Lecture 1 Introduction
No ratings yet
ICS 2408 Lecture 1 Introduction
32 pages
Data Mining Unit I Notes
No ratings yet
Data Mining Unit I Notes
24 pages
Data Science & Big Data Analysis Module 1,2,3,4,5
No ratings yet
Data Science & Big Data Analysis Module 1,2,3,4,5
70 pages
Introduction To Data Mining 1604
No ratings yet
Introduction To Data Mining 1604
32 pages
Module 1
No ratings yet
Module 1
41 pages
Unit 1 DMDW
No ratings yet
Unit 1 DMDW
57 pages
Data Warehouse & Mining
No ratings yet
Data Warehouse & Mining
28 pages
DM Chapter 1
No ratings yet
DM Chapter 1
10 pages
Unit-1 DWDM
No ratings yet
Unit-1 DWDM
20 pages
Course Module of Advanced Excel
No ratings yet
Course Module of Advanced Excel
2 pages
Introduction
No ratings yet
Introduction
27 pages
A Conceptual Overview of Data Mining: B.N. Lakshmi., G.H. Raghunandhan
No ratings yet
A Conceptual Overview of Data Mining: B.N. Lakshmi., G.H. Raghunandhan
6 pages
Data Minng
No ratings yet
Data Minng
20 pages
DWDM 01 Introduction
No ratings yet
DWDM 01 Introduction
43 pages
DM Unit2 (Part1)
No ratings yet
DM Unit2 (Part1)
19 pages
UNIT-1 Introduction To Data Mining
No ratings yet
UNIT-1 Introduction To Data Mining
29 pages
Unit - 8 Database and Database Management System
No ratings yet
Unit - 8 Database and Database Management System
36 pages
Data Mining & Data Warehousing
No ratings yet
Data Mining & Data Warehousing
84 pages
D-Unit-1 R16
No ratings yet
D-Unit-1 R16
17 pages
OpenText Archive Server and OpenText Enterprise Library 10.5 - Update Installation Guide (UNIX-Linux) English (AR100500-DUG-EN-27)
No ratings yet
OpenText Archive Server and OpenText Enterprise Library 10.5 - Update Installation Guide (UNIX-Linux) English (AR100500-DUG-EN-27)
32 pages
Unit - I
No ratings yet
Unit - I
22 pages
Cambridge International AS & A Level: Computer Science 9618/13 May/June 2022
No ratings yet
Cambridge International AS & A Level: Computer Science 9618/13 May/June 2022
10 pages
Information Technology (IT) Is The Application of
No ratings yet
Information Technology (IT) Is The Application of
6 pages
Unit I DATA MINING AAGAC
No ratings yet
Unit I DATA MINING AAGAC
27 pages
Introduction To Data Mining-Week1
No ratings yet
Introduction To Data Mining-Week1
43 pages
10.3 As File Handling
100% (1)
10.3 As File Handling
43 pages
Data Mining:: Concepts and Techniques
No ratings yet
Data Mining:: Concepts and Techniques
28 pages
Chapter 1. Introduction
No ratings yet
Chapter 1. Introduction
323 pages
Data Mining
No ratings yet
Data Mining
14 pages
Intro of Data Mining
No ratings yet
Intro of Data Mining
27 pages
7 Stages or Steps Involved in Marketing Research Process
No ratings yet
7 Stages or Steps Involved in Marketing Research Process
8 pages
DWDM R13 Unit 1 PDF
No ratings yet
DWDM R13 Unit 1 PDF
10 pages
Data Mining
No ratings yet
Data Mining
27 pages
Corriges Exos
No ratings yet
Corriges Exos
16 pages
Chap 1
No ratings yet
Chap 1
32 pages
Image Retrieval: Unlocking the Power of Visual Data
From Everand
Image Retrieval: Unlocking the Power of Visual Data
Fouad Sabry
No ratings yet
Automatic Image Annotation: Enhancing Visual Understanding through Automated Tagging
From Everand
Automatic Image Annotation: Enhancing Visual Understanding through Automated Tagging
Fouad Sabry
No ratings yet
Data Mining: Fundamentals and Applications
From Everand
Data Mining: Fundamentals and Applications
Fouad Sabry
No ratings yet

Mekelle University-Mekelle Institute of Technology Department of Information Technology Data Mining and Knowledge Discovery

Uploaded by

Mekelle University-Mekelle Institute of Technology Department of Information Technology Data Mining and Knowledge Discovery

Uploaded by

MEKELLE UNIVERSITY-MEKELLE INSTITUTE OF

DEPARTMENT OF INFORMATION TECHNOLOGY

DATA MINING AND KNOWLEDGE DISCOVERY

Look up phone number in Certain names are more prevalent

 We live in a world where vast amounts of data are

 Computers have become cheaper and more powerful

 There is often information “hidden” in the data that is

So, we need tools to extract the valuable knowledge

 The mining task is

 Data warehouse technology includes data cleaning, data

 OLAP—is analysis techniques with functionalities such

 Although OLAP tools support multidimensional analysis and

 Classification and regression

 Data mining functionalities are used to specify the kinds of patterns to

 Descriptive mining tasks characterize properties of the data in a target

may be combined with itemsets or subsequences.

 Mining frequent patterns leads to the discovery of interesting

1.6.4 Cluster Analysis

 queries that are asked only a very small number of times

 Bioinformatics and health informatics

You might also like