0% found this document useful (0 votes)
33 views

Mekelle University-Mekelle Institute of Technology Department of Information Technology Data Mining and Knowledge Discovery

The document provides an overview of data mining and knowledge discovery. It defines data mining as the non-trivial extraction of implicit, previously unknown, and potentially useful information from data. Data mining involves techniques from machine learning, statistics, databases, and other fields to discover patterns in large data sets. It discusses how vast amounts of data are now collected and stored, creating opportunities to apply data mining to gain useful knowledge and insights. The document outlines some common data mining tasks like classification, clustering, and association rule mining and the types of patterns they can reveal in databases, data warehouses, and transactional data.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
33 views

Mekelle University-Mekelle Institute of Technology Department of Information Technology Data Mining and Knowledge Discovery

The document provides an overview of data mining and knowledge discovery. It defines data mining as the non-trivial extraction of implicit, previously unknown, and potentially useful information from data. Data mining involves techniques from machine learning, statistics, databases, and other fields to discover patterns in large data sets. It discusses how vast amounts of data are now collected and stored, creating opportunities to apply data mining to gain useful knowledge and insights. The document outlines some common data mining tasks like classification, clustering, and association rule mining and the types of patterns they can reveal in databases, data warehouses, and transactional data.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 36

MEKELLE UNIVERSITY-MEKELLE INSTITUTE OF

TECHNOLOGY

DEPARTMENT OF INFORMATION TECHNOLOGY

DATA MINING AND KNOWLEDGE DISCOVERY

Halefom Tekle
Friday, February 5, 2021
Outlines
Chapter 1: Definition
 Non-trivial extraction of implicit, previously unknown and
potentially useful information from data
 Exploration & analysis, by automatic or semi-automatic
means, of large quantities of data in order to discover
meaningful patterns
 What is not Data mining?  What is Data Mining?

Look up phone number in Certain names are more prevalent


phone directory in certain US locations (O’Brien,
O’Rurke, O’Reilly… in Boston
Query a Web search area)
engine for information
about “Amazon” Group together similar documents
returned by search engine
according to their context (e.g.
Amazon rainforest, Amazon.com,)
Con.
 Data mining is a technique for discovering interesting
patterns from data
 Data mining also kwon as knowledge discovery from data.
 It is a multi-disciplinary field involving
 Machine learning
 Statistics
 Databases
 Artificial intelligence
 Information retrieval, and
 Visualization
1.1 Why Data Mining? Commercial view

 We live in a world where vast amounts of data are


collected daily.
 Lots of data is being collected and warehoused
 Web data, e-commerce
 purchases at department/grocery stores
 Bank/Credit Card transactions

 Computers have become cheaper and more powerful


 Competitive Pressure is Strong
 Provide better, customized services for an edge (e.g. in Customer
Relationship Management)
1.3 Motivation

 There is often information “hidden” in the data that is


not readily evident
 Human analysts may take weeks to discover useful information
 Much of the data is never analyzed at all
1.4 Data Mining as the Evolution of Information
Technology
 Data mining can be viewed as a result of the natural evolution of
information technology.
 Those are
 Data collection and database creation
 Database management system
 Advanced database system
 Advanced data analysis
 The early development of data collection and database creation
mechanisms served as a prerequisite for the later development of
effective mechanisms for data storage and retrieval, as well as query
and transaction processing.
 Nowadays numerous database systems offer query and transaction
processing as common practice.
 Advanced data analysis has naturally become the next step.
Con.
Con.
ata
d
is or.
r ld po
wo on
h e ati
s, t rm
a n nfo
e ti
m
his h bu
T ric

So, we need tools to extract the valuable knowledge


embedded in the vast amounts of data to help decision
maker’s intuition .
Con.

Data mining
 Is the process of discovering interesting patterns and
knowledge from large amounts of data.
 Many people treat data mining as a synonym for another
popularly used term, knowledge discovery from data, or
KDD, while others view data mining as merely an
essential step in the process of knowledge discovery.
The data sources can include databases, data warehouses,
the Web, other information repositories, or data that are
streamed into the system dynamically.
 The knowledge discovery process is an iterative sequence
Con.
 Pre-processing:
 The raw data is usually not suitable for mining due to
various reasons.
 Data mining:
 The processed data is then fed to a data mining
algorithm which will produce patterns or knowledge.
 Post-processing:
 In many applications, not all discovered patterns are
useful. This step identifies those useful ones for
applications. Various evaluation and visualization
techniques are used to make the decision.
Con.
1. Data cleaning: to remove noise and inconsistent data
2. Data integration: where multiple data sources may be combined
3. Data selection: where data relevant to the analysis task are
retrieved from the database
4. Data transformation: where data are transformed and consolidated
into forms appropriate for mining by performing summary or
aggregation operations
5. Data mining: an essential process where intelligent methods are
applied to extract data patterns
6. Pattern evaluation: to identify the truly interesting patterns
representing knowledge based on interestingness measures
7. Knowledge presentation: where visualization and knowledge
representation techniques are used to present mined knowledge to
users
1.5 What Kinds of Data Can Be Mined?
 Data mining can be applied to any kind of data as long as the data
are meaningful for a target application.
 The most basic forms of data for mining applications are
 Database data
 Data warehouse data
 Transactional data
 Can also be applied to other forms of data
 data streams
 ordered/sequence data
 graph or networked data
 text data
 multimedia data (audio, video, image)
 and WWW
Con.
1.5.1 Database data
 Consider a relational database for AllElectronics.
Customer: (cust_ID, name, address, age, occupation,
annual income, credit information, category, . . .)
Item: (item_ID, brand, category, type, price, place made,
supplier, cost, . . . )
Employee: (empl_ID, name, category, group, salary,
commission, . . . )
Branch: (branch_ID, name, address, . . . )
Purchases: (trans_ID, cust_ID, empl_ID, date, time, method
paid, amount)
Items_sold: (trans_ID, item_ID, qty)
Works_at: (empl_ID, branch_ID)
Con.
 Database data
 Relational data can be accessed by database queries written in a
relational query (SQL, PostgreeSQL, …) or
 With the assistance of graphical user interfaces.

 The mining task is


 prediction methods
 Predict the credit risk of new customers
 To use some variables to predict unknown or future values of
other variables.
 detect deviations—that is, items with sales that are far from
those expected in comparison with the previous year
 Description Methods
 Find human-interpretable patterns that describe the data.
Con.

 Classification
 Regression Predictive
 Deviation Detection

 Clustering
 Association Rule Discovery Descriptive
 Sequential Pattern Discovery
Con.
1.5.2 Data warehouse
 Is a repository of multiple heterogeneous data sources
organized under a unified schema at a single site to
facilitate management decision making.

 Data warehouse technology includes data cleaning, data


integration, and online analytical processing (OLAP)

 OLAP—is analysis techniques with functionalities such


as summarization, consolidation, and aggregation, as well
as the ability to view information from different angles.
Con.

 Although OLAP tools support multidimensional analysis and


decision making, additional data analysis tools are required
for in-depth analysis—for example, data mining tools that
provide data classification, clustering, outlier/anomaly
detection, and the characterization of changes in data over
time.
 A data warehouse is usually modeled by a multidimensional
data structure, called a data cube, in which each dimension
corresponds to an attribute or a set of attributes in the schema,
and each cell stores the value of some aggregate measure such
as count or sum (sales_amount).
 A data cube provides a multidimensional view of data and
allows the precomputation and fast access of summarized data.
Con.
 Let AllElectronics had a data warehouse
Con.
1.5.3 Transactional Data
 Transactional database captures a transaction, such as a
customer’s purchase, a flight booking, or a user’s clicks on a
web page.
 A transaction typically includes
 a unique transaction identity number (trans ID) and
 a list of the items making up the transaction, such as the items
purchased in the transaction.
 A transactional database may have additional tables, which
contain other information related to the transactions
 such as item description,
 information about the salesperson or the branch, and so on.
1.6 What Kinds of Patterns Can Be Mined?
 There are a number of data mining functionalities. These include
 Characterization and discrimination
 Mining of frequent patterns, associations, and correlations

 Classification and regression

 Clustering analysis

 Outlier analysis

 Data mining functionalities are used to specify the kinds of patterns to


be found in data mining tasks.
 Such tasks can be classified into two categories:
 Descriptive and

 Predictive.

 Descriptive mining tasks characterize properties of the data in a target


data set.
 Predictive mining tasks perform induction on the current data in order
to make predictions.
Con.
1.6.1 Class/Concept Description: Characterization and Discrimination
 Data entries can be associated with classes or concepts.
 For example, in the AllElectronics store, classes of items for sale
include computers and printers, and concepts of customers include
bigSpenders and budgetSpenders.
 It can be useful to describe individual classes and concepts in
summarized, concise, and yet precise terms.
 Such descriptions of a class or a concept are called class/concept
descriptions.
 These descriptions can be derived using
 Data characterization, by summarizing the data of the class under study
(often called the target class) in general terms
 Data discrimination, by comparison of the target class with one or a set of
comparative classes (often called the contrasting classes) or
 both data characterization and discrimination.
Con.
1.6.2 Mining Frequent Patterns, Associations, and
Correlations
 Frequent patterns, as the name suggests, are patterns that
occur frequently in data.
 There are many kinds of frequent patterns
 Frequent itemsets
 a set of items that often appear together in a transactional data set, milk

and bread
 Frequent subsequences (also known as sequential patterns)
 tend to purchase first a laptop, followed by a digital camera, and then a
memory card
 Frequent substructures.
 can refer to different structural forms (e.g., graphs, trees, or lattices) that

may be combined with itemsets or subsequences.


Con.

 Mining frequent patterns leads to the discovery of interesting


associations and correlations within data.
 Association analysis.
 Suppose that, as a marketing manager at AllElectronics, you want to
know which items are frequently purchased together (i.e., within the
same transaction).
 Buys(X, “computer”)=>buys(X, “software”) [support = 1%,
confidence = 50%],
 single-dimensional association rules (buys).
 Age(X, “20..29”)^income(X, “40K..49K”)=>buys(X, “laptop”)
[support = 2%, confidence = 60%],
 multidimensional association rule (Age, income, buys).
 Typically, association rules are discarded as uninteresting if they
do not satisfy both a minimum support threshold and a minimum
confidence threshold.
Con.
1.6.3 Classification and Regression for Predictive Analysis
 Classification (na¨ıve Bayesian, SVM, and KNN)
 Is the process of finding a model (or function) that describes and
distinguishes data classes or concepts.
 The model are derived based on the analysis of a set of training
data (i.e., data objects for which the class labels are known).
 The model is used to predict the class label of objects for which
the class label is unknown.
 It predicts categorical (discrete, unordered) labels
 Regression analysis
 Is a statistical methodology that is most often used for
numeric prediction
 It predicts continuous-valued
Con.
Con.

1.6.4 Cluster Analysis


 Unlike classification and regression, which analyze class-
labeled (training) data sets.
 Clustering analyzes data objects without consulting class
labels.
 In many cases, classlabeled data may simply not exist at the
beginning.
 Clustering can be used to generate class labels for a group of
data.
 The objects are clustered or grouped based on the principle of
maximizing the intraclass similarity and minimizing the
interclass similarity.
Con.
Con.
1.6.5 Outlier Analysis
 A data set may contain objects that do not comply with the
general behavior or model of the data.
 These data objects are outliers.
 Many data mining methods discard outliers as noise or
exceptions.
 However, in some applications (e.g., fraud detection) the rare
events can be more interesting than the more regularly
occurring ones
1.7 Which Technologies Are Used?
Con.

 A statistical model
 Is a set of mathematical functions that describe the behavior of the
objects in a target class in terms of random variables and their
associated probability distributions.
 Machine Learning
 Machine learning investigates how computers can learn (or improve
their performance) based on data.
 A main research area is for computer programs to automatically learn
to recognize complex patterns and make intelligent decisions based on
data.
 learning methods
 Supervised

 Unsupervised

 Semi-supervised

 Reinforcement
Which Kinds of Applications Are Targeted?

 Business Intelligence
 Organization commercial context
customers, the market, supply and resources, and
competitors
 provide historical, current, and predictive views of business

operations
 Web Search Engines
 Have to handle with
 a huge and ever-growing amount of data

 online data

 queries that are asked only a very small number of times

 Bioinformatics and health informatics


 Finance, digital libraries, and digital governments.
1.8 Major Issues in Data Mining
 Mining Methodology
 Mining various and new kinds of knowledge
 Mining knowledge in multidimensional space
 Data mining—an interdisciplinary effort
 Boosting the power of discovery in a networked environment
 User Interaction
 Interactive mining
 Incorporation of background knowledge
 Ad hoc data mining and data mining query languages
 Presentation and visualization of data mining results
 Efficiency and Scalability
 Efficiency, scalability, performance, optimization, ability to execute in real time
 Parallel, distributed, and incremental mining algorithms
 Diversity of Database Types
 Handling complex types of data
 Mining dynamic, networked, and global data repositories
 Data Mining and Society
 Social impacts of data mining
 Privacy-preserving data mining
 Invisible data mining
Exercises
 How is a data warehouse different from a database? How are
they similar?
 What are the major challenges of mining a huge amount of
data (e.g., billions of tuples) in comparison with mining a
small amount of data (e.g., data set of a few hundred tuple)?
 Define each of the following data mining functionalities:
characterization, discrimi-nation, association and correlation
analysis, classification, regression, clustering, and outlier
analysis. Give examples of each data mining functionality,
using a real-life database that you are familiar with.

You might also like