Mekelle University-Mekelle Institute of Technology Department of Information Technology Data Mining and Knowledge Discovery
Mekelle University-Mekelle Institute of Technology Department of Information Technology Data Mining and Knowledge Discovery
TECHNOLOGY
Halefom Tekle
Friday, February 5, 2021
Outlines
Chapter 1: Definition
Non-trivial extraction of implicit, previously unknown and
potentially useful information from data
Exploration & analysis, by automatic or semi-automatic
means, of large quantities of data in order to discover
meaningful patterns
What is not Data mining? What is Data Mining?
Data mining
Is the process of discovering interesting patterns and
knowledge from large amounts of data.
Many people treat data mining as a synonym for another
popularly used term, knowledge discovery from data, or
KDD, while others view data mining as merely an
essential step in the process of knowledge discovery.
The data sources can include databases, data warehouses,
the Web, other information repositories, or data that are
streamed into the system dynamically.
The knowledge discovery process is an iterative sequence
Con.
Pre-processing:
The raw data is usually not suitable for mining due to
various reasons.
Data mining:
The processed data is then fed to a data mining
algorithm which will produce patterns or knowledge.
Post-processing:
In many applications, not all discovered patterns are
useful. This step identifies those useful ones for
applications. Various evaluation and visualization
techniques are used to make the decision.
Con.
1. Data cleaning: to remove noise and inconsistent data
2. Data integration: where multiple data sources may be combined
3. Data selection: where data relevant to the analysis task are
retrieved from the database
4. Data transformation: where data are transformed and consolidated
into forms appropriate for mining by performing summary or
aggregation operations
5. Data mining: an essential process where intelligent methods are
applied to extract data patterns
6. Pattern evaluation: to identify the truly interesting patterns
representing knowledge based on interestingness measures
7. Knowledge presentation: where visualization and knowledge
representation techniques are used to present mined knowledge to
users
1.5 What Kinds of Data Can Be Mined?
Data mining can be applied to any kind of data as long as the data
are meaningful for a target application.
The most basic forms of data for mining applications are
Database data
Data warehouse data
Transactional data
Can also be applied to other forms of data
data streams
ordered/sequence data
graph or networked data
text data
multimedia data (audio, video, image)
and WWW
Con.
1.5.1 Database data
Consider a relational database for AllElectronics.
Customer: (cust_ID, name, address, age, occupation,
annual income, credit information, category, . . .)
Item: (item_ID, brand, category, type, price, place made,
supplier, cost, . . . )
Employee: (empl_ID, name, category, group, salary,
commission, . . . )
Branch: (branch_ID, name, address, . . . )
Purchases: (trans_ID, cust_ID, empl_ID, date, time, method
paid, amount)
Items_sold: (trans_ID, item_ID, qty)
Works_at: (empl_ID, branch_ID)
Con.
Database data
Relational data can be accessed by database queries written in a
relational query (SQL, PostgreeSQL, …) or
With the assistance of graphical user interfaces.
Classification
Regression Predictive
Deviation Detection
Clustering
Association Rule Discovery Descriptive
Sequential Pattern Discovery
Con.
1.5.2 Data warehouse
Is a repository of multiple heterogeneous data sources
organized under a unified schema at a single site to
facilitate management decision making.
Clustering analysis
Outlier analysis
Predictive.
and bread
Frequent subsequences (also known as sequential patterns)
tend to purchase first a laptop, followed by a digital camera, and then a
memory card
Frequent substructures.
can refer to different structural forms (e.g., graphs, trees, or lattices) that
A statistical model
Is a set of mathematical functions that describe the behavior of the
objects in a target class in terms of random variables and their
associated probability distributions.
Machine Learning
Machine learning investigates how computers can learn (or improve
their performance) based on data.
A main research area is for computer programs to automatically learn
to recognize complex patterns and make intelligent decisions based on
data.
learning methods
Supervised
Unsupervised
Semi-supervised
Reinforcement
Which Kinds of Applications Are Targeted?
Business Intelligence
Organization commercial context
customers, the market, supply and resources, and
competitors
provide historical, current, and predictive views of business
operations
Web Search Engines
Have to handle with
a huge and ever-growing amount of data
online data