0% found this document useful (0 votes)
77 views38 pages

Introduction To Data Mining

The document provides an introduction to data mining, including its meaning and applications. It discusses the key steps in knowledge discovery: data cleaning, integration, selection, transformation, mining, evaluation, and representation. It also outlines the types of data, patterns, and functionalities involved in data mining, as well as the technologies used. The essence of data mining is extracting valuable knowledge and insights from the vast amounts of data now available.

Uploaded by

Shumet Woldie
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
77 views38 pages

Introduction To Data Mining

The document provides an introduction to data mining, including its meaning and applications. It discusses the key steps in knowledge discovery: data cleaning, integration, selection, transformation, mining, evaluation, and representation. It also outlines the types of data, patterns, and functionalities involved in data mining, as well as the technologies used. The essence of data mining is extracting valuable knowledge and insights from the vast amounts of data now available.

Uploaded by

Shumet Woldie
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 38

Chapter 1

Introduction to Data Mining

1
Meaning of data mining
 Extracting information from huge sets of data

 Procedure of mining knowledge from data

 Efficient discovery of previously unknown, valid, potentially useful,

understandable patterns in large datasets

 Analysis of observational data sets to find

Unsuspected relationships
2
Cont…
Summarize the data in novel ways that are both understandable
and useful to the data owner
 Popularly known as Knowledge Discovery in Databases (KDD)
 Extracted knowledge can be used for any of the following applications
Market Analysis
Fraud Detection
Customer Retention
Production Control
Science Exploration
3
Knowledge discovery steps
 Data cleaning

Noise data (random error or variance in a measured variable) and

irrelevant data are removed from the collection

Fill in missing values, smooth out noise while identifying outliers,

and correct inconsistencies in the data

4
Cont…
 Data integration

Multiple data sources, often heterogeneous, may be combined in a

common source

 Data selection (reduction)

Data relevant to the analysis task are retrieved from the database

5
Cont…
 Data transformation
Data are transformed and consolidated into forms appropriate for
mining

By performing summary or aggregation operations

 Data mining
Intelligent methods are applied to extract data patterns

 Pattern evaluation
Identifying the truly interesting patterns representing knowledge based
on interestingness measure
6
Cont…
 Knowledge representation
Visualization and knowledge representation techniques are used to
present mined knowledge to users

7
What kinds of data can be mined?
 Data mining can be applied to any kind of data as long as the data are meaningful for a
target application
 The most basic forms of data for mining applications are:
Relational database
o Collection of tables, each of which is assigned a unique name
o Each table consists of a set of attributes and usually stores a large set of
tuples
o Most commonly available and richest information repositories for searching
trends
8
Cont…
Data Warehouses

o A repository of information collected from multiple sources

o Stored under a unified schema and usually residing at a single site

o Constructed via a process of data cleaning, data integration, data

transformation, data loading, and periodic data refreshing

o Data in a data warehouse are organized around major subjects

9
Cont…
Transactional data
oset of records representing transactions
oEach with a time stamp, an identifier and a set of items
oTransaction files could also be descriptive data for items
oTypical data mining analysis on transactional data is
Market basket analysis or association rules
Multimedia databases
o Include video, images, audio and text media

10
Cont…
Spatial databases
o Store geographical information like maps, and global or regional
positioning

Time-series databases
o Contain time related data like stock market data or logged activities

o Have a continuous flow of new data coming in

o Data mining in such databases commonly includes

Study of trends and correlations between evolutions of different


variables
11
What kinds of patterns can be mined?
 Data mining used to specify the kinds of patterns to be found in data mining tasks. Such
tasks can be classified into two categories
1. Descriptive task
Deals with the general properties of data in the database
Descriptive functions are:-
o Class/Concept Description
o Mining of Frequent Patterns
o Mining of Associations
o Mining of correlations
o Mining of clusters 12
Cont…
2. Predictive/classification task
 Perform induction on the current data in order to make predictions

 Process of finding a model that describes the data classes

 Use the model to predict the class of objects whose class label is unknown

 Derived model is based on the analysis of sets of training data

 Derived model can be presented in the following forms:

Classification (IF-THEN) Rules, Decision Trees, Mathematical


Formula and Neural Networks

13
Data mining functionalities

Characterization
o Summarization of general features of objects in a target class

o Data corresponding to the user-specified class are typically collected by a


query

o Output of data characterization can be presented in various forms

Pie charts, bar charts, curves, multidimensional and data cubes

14
Cont…
Discrimination
o Comparison of the general features of objects between two classes referred
to as the target class and the contrasting class

o Target and contrasting classes can be specified by a user

o Corresponding data objects can be retrieved through database queries

Association analysis
o Studies the frequency of items occurring together in transactional database

o Commonly used for market basket analysis

15
Cont…
Classification
o Organization of data in given classes

o Process of finding a model (function)

That describes and distinguishes data classes or concepts

o Use a training set where all objects are already associated with known
class labels

o Classification algorithm learns from the training set and builds a model

16
Cont…
o Classification model can be represented in various forms

IF-THEN rules, decision tree and neural network

17
Cont…
Prediction
o The major idea is to use a large number of past values to consider
probable future values

Clustering
o Organization of data in classes
o In clustering, class labels are unknown and it is up to the clustering
algorithm to discover acceptable classes
o clustering approaches are
 Maximizing the similarity between objects in a same class (intra-
class similarity)
18
Cont…
Minimizing the similarity between objects of different classes (inter-
class similarity)

Outlier analysis
o Data elements that cannot be grouped in a given class or cluster
o Data set that do not comply with the general behavior or model of the
data
o Many data mining methods discard outliers as noise or exceptions
o However, in some applications (e.g., fraud detection) the rare events
can be more interesting than the more regularly occurring ones

19
Technologies used in data mining
 Data mining has incorporated many techniques from other domains

 The following figure shows adopted techniques from different domains

20
Essence of data mining
Moving toward the Information Age

o Vast amounts of data are collected daily analyzing such data is an important need

o Explosive growth of available data volume is the result of

Computerization of our society

Fast development of powerful data collection and storage tools

o Global backbone telecommunication networks carry tens of petabytes of data traffic

every day
21
Cont…

o Explosively growing, widely available, and gigantic body of data makes our

time truly the data age

o Powerful and versatile tools are critically needed

To uncover valuable information from the tremendous amounts of data

To transform such data into organized knowledge

o This necessity has led to the birth of data mining

22
Cont…

Data mining as the evolution of information technology


o Data mining can be viewed as a result of the natural evolution of information
technology

o Database and information technology has evolved systematically

o Database management systems technology moved towards

 The development of advanced database systems and data warehousing

23
Cont…
oComputer hardware technology

 Supplies of powerful and affordable computers

 Data collection equipment and storage media

This technology provides a great boost

 To the database and information industry

 It enables a huge number of databases and information repositories to be

available for
24
Cont…
Transaction management

Information retrieval

Data analysis

o Internet-based global information bases, such as

 WWW

 Various kinds of interconnected and heterogeneous databases

Emerged and play a vital role in the information industry

25
Cont…
The abundance of data, coupled with the need for powerful data analysis
tools
o Described as a data rich but information poor situation
o Fast-growing, tremendous amount of data, collected and stored in large
and numerous data repositories
Exceeded our human ability for comprehension without powerful
tools
o Widening gap between data and information calls for
Systematic development of data mining tools
That can turn data tombs into “golden nuggets” of knowledge
26
Relationship b/n Data mining, Data warehousing and OLAP

o Data warehouse is a repository of information collected from multiple sources

o Stored under a unified schema and usually residing at a single site

o Data warehouses are constructed via a process of data cleaning, data integration,

data transformation, data loading, and periodic data refreshing

o The following figure shows typical framework for construction and use of a data

warehouse for a particular electronics shop

27
28
Cont…
o Data in a data warehouse are organized around major subjects

o Data are stored to provide information from a historical perspective

o Data warehouse is usually modeled by a multidimensional data structure, called a data


cube

o Each dimension corresponds to an attribute in the schema

o Each cell stores the value of some aggregate measure

o Data cube allows the precomputation and fast access of summarized data

o Data warehouse systems can provide inherent support for OLAP

29
Cont…
OLAP (Online analytical processing)

o Use background knowledge regarding the domain of the data being studied

o To allow the presentation of data at different levels of abstraction

o Such operations accommodate different user viewpoints

o Drill-down and roll-up are examples of OLAP

o Allow the user to view the data at different degrees of summarization

o The following figure shows examples of drill-down and roll-up operations

30
31
32
Issues in data mining
 Many pending issues have to be addressed in data mining

 Some of these issues are


Mining methodology
o Issues affecting the data mining approaches applied and their
limitations
o Examples that can dictate mining methodology choices
Versatility of the mining approaches
Diversity of data available
Dimensionality of the domain
33
Cont…
Broad analysis needs

Assessment of the knowledge discovered

Control and handling of noise in data

Performance issues

o The issues of scalability and efficiency of the data mining methods when

processing considerably large data

o Incremental updating and parallel programming

34
Cont…
Data source issues
o Many issues related to the data sources are exist such as
Diversity of data types
Data glut problem
Storing different types of data in a variety of repositories
o Different kinds of data and sources may require distinct algorithms and
methodologies
o Proliferation of heterogeneous data sources poses important challenges on
the database community and data mining community

35
Cont…
Security and social issues
o Sensitive and private information about individuals or companies is
gathered and stored
o This information is collected for
Customer profiling, user behavior understanding, correlating personal data
with other information

o This could disclose new implicit knowledge about individuals or groups


that could be against privacy policies
o Important information could be withheld, while other information could be
widely distributed and used without control
36
Cont…
User interface issues

o Knowledge discovered by data mining tools is useful and above

understandable by the user

o Good data visualization eases the interpretation of data mining results

o Major issues related to user interfaces and visualization are

Screen real-estate

Information rendering and interaction

37
Thank You

38

You might also like