0% found this document useful (0 votes)
37 views59 pages

Unit 1

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
37 views59 pages

Unit 1

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 59

Data Mining (DM)

GTU #3160714

Unit-1
Introduction to
Data Mining (DM)
 Topics
Loopingto be covered
• Motivation for Data Mining
• Data Mining - Definition
• Data Mining – On what kind of data?
• Data Mining Functionalities
• KDD Process (Knowledge Discovery in Databases)
• Classification of DM (Data Mining) Systems
• DM task primitives
• Issues in DM
Just think: One Second on Internet
9,003 Tweets
4,705 Skype Calls
1,711 Tumblr Posts
83,378 Google Searches
84,388 YouTube videos viewed
Are all these information is
996 Instagram photos uploaded
really important to us
& many more… ?????????
Motivation: Why data mining?
“Necessity is the Mother of all Inventions”
“It has been estimated that the amount of information in the world doubles every
10 months.”
There is a tremendous increase in the amount of data recorded and stored on digital
media as well as individual sources.
 Since the 1960’s, database and information technology has been changed
systematically from primitive file processing systems to powerful database systems.
 The research and development in database systems since the 1970’s has led to the
development of relational database systems.

“We are drowning in data, but starving for


knowledge!”
“Data rich but Information poor”
Motivation: Why data mining? (Cont..)
Years Evolutions
Since 1960’s Data collection, database creation, IMS (hierarchical database system by IBM) and network DBMS
1970s Relational data model, relational DBMS implementation
1980s RDBMS, advanced data models, application-oriented DBMS (spatial, scientific, engineering, etc.)
1990s Data mining, data warehousing, multimedia databases, and web databases
2000s Stream data management and mining, Social Networks (Facebook, etc.), web technology (XML)
and global information systems
At Present Heterogeneous database systems, big data

Every day data grows exponentially,


but these all data are really important to
us??
Motivation for Data Mining : An Example
Data  Knowledge  Action  Goal

Netflix collects user ratings of movies (data)  What types of movies you
will like (knowledge)  Recommend new movies to you (action)  Users
stay with Netflix (goal)

Gene sequences of cancer patients (data)  Which genes lead to cancer?


(knowledge)  Appropriate treatment (action)  Save life (goal)

Road traffic (data)  Which road is likely to be congested? (knowledge) 


Suggest better routes to drivers (action)  Save time and energy (goal)

Summary
The overall goal of the data mining process is to extract information
from a large data sets or databases and transform it into an
understandable structure for further use.
What is Data Mining?
 Data mining refers to extracting or “mining”
Database
Technology knowledge from large amounts of data.

Other
Statistics  “Knowledge mining from data” or “Knowledge
Disciplines
mining”
Data
Mining  “Extract knowledge from large data or databases”

Visualization
Machine  “Knowledge discovery from database (KDD)”
Learning

Information
Science
Data Mining—On what kind of data?
Relational Databases:
• A database system, also called a database management system (DBMS), consists of a collection
of interrelated data, known as a database tables, and a set of software programs to manage and
access these data.
• E.g. : SQL Server, Oracle etc.
Data Warehouses:
• A data warehouse is a repository of information collected from multiple sources.
• It is constructed after pre-processing of data. (Data cleaning, Data integration, Data
transformation, Data loading, and Periodic data refreshing etc.)
• E.g. : Stock Market, D-Mart, Big Bazar etc.
Data Mining—On what kind of data? (Cont..)
Transactional Databases:
• Transactional database consists of a file where each record represents a transaction.
• A transaction typically includes a unique transaction identity number (TID) and a list of the items
making up the transaction (such as items purchased in a store).
• E.g. : Online shopping on Flipkart, Amazon etc.
Other Data/Databases
• Spatial data (Maps or Location related data)
• Engineering design data (Designs of Buildings, Offices Structures data)
• Hypertext and multimedia data (Including text, image, video and audio data), the World Wide
Web (WWW a huge, widely distributed information repository made available on the Internet).
Data Mining Architecture

Graphical User Interface

Pattern Evaluation

Knowled
Data Mining Engine
ge Base

Database or Data Warehouse


Server

Cleaning, Integration & Selection

Databas Data Other Info


WWW
e Warehouse Repositories
Data Mining Functionalities
Data mining functionalities can be classified into two categories:
1. Descriptive
2. Predictive

 Descriptive
• This task presents the general properties of data stored in a database.
• The descriptive tasks are used to find out patterns in data.
• E.g.: Cluster, Trends, etc.

 Predictive
• These tasks predict the value of one attribute on the basis of values of other attributes.
• E.g.: Festival Customer/Product Sell prediction at store
Data Mining Functionalities
1. Class/Concept Descriptions

A class or concept implies there is a data set or set of features that define the class
or a concept.
A class can be a category of items on a shop floor, and a concept could be the
abstract idea on which data may be categorized like products to be put on clearance
sale and non-sale products.

Data Characterization: This refers to the summary of general characteristics or


features.

Data Discrimination: Discrimination is used to separate distinct data sets based on


the disparity in attribute values. of the class, resulting in specific rules that define a
target class.
2. Mining Frequent Patterns
One of the functions of data mining is finding data patterns. Frequent patterns are
things that are discovered to be most common in data.
Various types of frequency can be found in the dataset.

Frequent item set: This term refers to a group of items that are commonly found
together, such as milk and sugar.

Frequent substructure: It refers to the various types of data structures that can be
combined with an item set or subsequences, such as trees and graphs.

Frequent Subsequence: A regular pattern series, such as buying a phone followed


by a cover.
3. Association Analysis
It analyses the set of items that generally occur together in a transactional dataset. It is
also known as Market Basket Analysis for its wide use in retail sales. Two parameters
are used for determining the association rules:

It provides which identifies the common item set in the database.

Confidence is the conditional probability that an item occurs when another item
occurs in a transaction.
4. Classification
Classification is a data mining technique that categorizes items in a collection based
on some predefined properties.

It uses methods like if-then, decision trees or neural networks to predict a class or
essentially classify a collection of items.

A training set containing items whose properties are known is used to train the
system to predict the category of items from an unknown collection of items.
5. Prediction
It defines predict some unavailable data values or spending trends.
An object can be anticipated based on the attribute values of the object and attribute
values of the classes.
It can be a prediction of missing numerical values or increase or decrease trends in
time-related information.
There are primarily two types of predictions in data mining: numeric and class
predictions.

Numeric predictions are made by creating a linear regression model that is based
on historical data. Prediction of numeric values helps businesses ramp up for a
future event that might impact the business positively or negatively.

Class predictions are used to fill in missing class information for products using a
training data set where the class for products is known.
6. Cluster Analysis
In image processing, pattern recognition and bioinformatics, clustering is a popular
data mining functionality.
It is similar to classification, but the classes are not predefined.
Data attributes represent the classes. Similar data are grouped together, with the
difference being that a class label is not known.
Clustering algorithms group data based on similar features and dissimilarities.
7. Outlier Analysis
Outlier analysis is important to understand the quality of data.
If there are too many outliers, you cannot trust the data or draw patterns.
An outlier analysis determines if there is something out of turn in the data and
whether it indicates a situation that a business needs to consider and take measures
to mitigate.
An outlier analysis of the data that cannot be grouped into any classes by the
algorithms is pulled up.
8. Evolution and Deviation Analysis
Evolution Analysis pertains to the study of data sets that change over time.
Evolution analysis models are designed to capture evolutionary trends in data
helping to characterize, classify, cluster or discriminate time-related data.

9. Correlation Analysis
Correlation is a mathematical technique for determining whether and how strongly
two attributes is related to one another.
It refers to the various types of data structures, such as trees and graphs, that can
be combined with an item set or subsequence.
It determines how well two numerically measured continuous variables are linked.
 Researchers can use this type of analysis to see if there are any possible
correlations between variables in their study.
KDD (Knowledge Discovery in Databases) Process
Knowledge discovery in databases is a process of an iterative sequence of the
following steps:
1. Selection
2. Preprocessing
3. Transformation
4. Data Mining
5. Pattern Evaluation
6. User Interface (Visualization of Pattern or Knowledge)
KDD (Knowledge Discovery in Databases) Process (Cont..)
Appropriate for
mining by Pattern
Intelligent
performing Evaluation
methods
summary or are
Patterns
applied
aggregation in order
The analysisTo remove Knowledg
to extractfor
operations, data Data Mining
task are noise and e
patterns.
instance.
retrieved inconsistent
from the data. Transformatio
database.
KDD n
Transforme
Process d Data
Preprocessin
g
Visualization and
Preprocess
knowledge representation
Selection
ed Data techniques are used to
present the mined
Target
Data knowledge to the user.
KDD (Knowledge Discovery in Databases) Process (Cont..)
• Data Selection: Where data relevant to the analysis task are retrieved from the
database.
• Data Cleaning: To remove noise and inconsistent data.
• Data Integration: Where multiple data sources may be combined.
• Data Transformation: Where data are transformed or consolidated into appropriate
forms for mining by performing summary or aggregation operations.
• Data Mining: An essential process where intelligent methods are applied in order to
extract data patterns.
• Pattern Evaluation: To identify the truly interesting patterns representing knowledge
based on some interestingness measures.
• Knowledge Presentation: Where visualization and knowledge representation
techniques are used to present the mined knowledge to the user.
Classification of Data Mining Systems
Classification of data mining based on..
1. Databases to be mined
2. Knowledge to be mined
3. Techniques/Methods utilized
4. Application adapted
1. Classification according to the kinds of Databases mined
Database models are important for classification according to the kinds of databases
to be mined.
Types of database models
 Hierarchical database model
 Relational model
 Network model
 Object-oriented database model
 Entity-relationship model
 Document model
 Entity-attribute-value model
 Star schema
 Object-relational model

Classification of Data Mining


1. Classification according to the kinds of Databases mined
(Cont..)
It can be classified as a ‘type of data’ or ‘use of data’ model or ‘application of data’.
Classified according to different criteria (such as data models, or the types of data or
applications involved), each of which may require its own data mining technique.
For instance, if classifying according to data models, we may have a relational,
transactional, object-oriented, object-relational, or data warehouse mining system.
If classifying according to the special data types, we may have a spatial, time-series,
text or multimedia data mining system or a world-wide web mining system.
Other system types include heterogeneous data mining systems and legacy data
mining systems.

Classification of Data Mining


2. Classification according to the kinds of Knowledge mined
Based on data mining functionalities,
 Characterization: Summarization of the general characteristics or features of a target
class of data.
 Association: It discovers the probability of the co-occurrence of items in a collection.
 Correlation analysis: It is used to find the association between the variables.
 Classification: It discovers a model that defines the data classes or concepts.
 Prediction: It represents the data classes to predict future data/trends.
 Cluster analysis: To find out the group of objects which are similar to each other in the
group but are different from the object in other groups.
 Outlier analysis: It is a process that involves identifying the anomalous observation in
the dataset

Classification of Data Mining


3. Classification according to the kinds of Techniques utilized
These techniques can be described according to the degree of user interaction
involved (e.g., autonomous systems, query-driven systems).
The methods of data analysis employed (e.g., database-oriented or data
warehouse–oriented techniques, machine learning, statistics, visualization, pattern
recognition, neural networks etc.)
A sophisticated data mining system will
 Often adopt multiple data mining techniques for work out an effective
 Integrated technique which combines the merits of a few individual approaches

Classification of Data Mining


4. Classification according to the Applications adapted
Retail
Telecommunication
Banking
Fraud analysis
Stock market analysis
Text mining
Web mining etc.

Classification of Data Mining


Data Mining Task Primitives
 A data mining task can be specified in the form of a data mining query, which is
input to the data mining system.
 A data mining query is defined in terms of data mining task primitives.
 These primitives allow the user to inter- actively communicate with the data
mining system during discovery in order to direct the mining process, or examine
the findings from different angles or depths.
 The data mining primitives specify the following
 The set of task-relevant data to be mined
 The kind of knowledge to be mined
 The background knowledge to be used in the discovery process
 The interestingness measures and thresholds for pattern evaluation
 The expected representation for visualizing the discovered patterns
Data Mining Task Primitives (Cont..)
The set of task-relevant data to be mined
 This specifies the portions of the database or the set of data in which the user is interested
(Target Data)
 This includes the database attributes or data warehouse dimensions of interest
 The kind of knowledge to be mined
 This specifies the data mining functions to be performed, such as
 Characterization: Summarization of the general characteristics or features of a target class of data.
 Association: It discovers the probability of the co-occurrence of items in a collection.
 Correlation analysis: It is used to find the association between the variables.
 Classification: It discovers a model that defines the data classes or concepts.
 Prediction: It represents the data classes to predict future data/trends.
 Cluster analysis: To find out the group of objects which are similar to each other in the group but are different from
the object in other groups.
 Outlier analysis: It is a process that involves identifying the anomalous observation in the dataset
Data Mining Task Primitives (Cont..)
The background knowledge to be used in the discovery process
 This knowledge about the domain to be mined is useful for guiding the knowledge discovery
process and for evaluating the patterns found.
 Concept hierarchies (It defines a sequence of mappings from a set of low-level concepts to
higher-level, more general concepts) are a popular form of back- ground knowledge, which allow
data to be mined at multiple levels of abstraction.

H1
National
Highway
H2

Road

S1
State
Highway
S2
Data Mining Task Primitives (Cont..)
The interestingness measures and thresholds for pattern evaluation
 They may be used to guide the mining process or, after discovery, to evaluate the discovered
patterns.
 Different kinds of knowledge may have different interestingness measures
 For example,
 Interestingness measures for association rules include support and confidence.
 Rules whose support and confidence values are below user-specified thresholds are considered
uninteresting.
Data Mining Task Primitives (Cont..)
The expected representation for visualizing the discovered patterns
 This refers to the form in which discovered patterns are to be displayed, which may include rules,
tables, charts, graphs, decision trees, and cubes.
Interestingness of Patterns
 In a Data Mining System, everyday millions of data patterns are
generated.
 Among all these patterns generated, how many are really interesting?
 Actually, a small fraction of patterns generated would be of interest to
any given user.
 This raises 3 Questions:
1. What makes patterns interesting?
A pattern is interesting if it is
-easily understood by humans
-Valid on new/test data
-Potentially useful.
Contd..
2. Can data mining system generate all of the interesting patterns?
-refers to completeness of a DM System.
-In reality it is not possible for a DM system to generate all interesting patterns.

3. Can DM systems generate only interesting patterns?


- Refers to optimization of a DM System.
- Generating only interesting patterns ( it’s a challenging task)
- If only interesting patterns are generated, it becomes easy and efficient for the user
(time is saved).
Integration of Data Mining System With a Database/ Data
Warehouse System
Integration means Association / Combining
If there is no integration- no communication with DB.
We have a total of 4 integration schemas

1. No Coupling
DM System will not use any function i.e. there is no communication with db.
In this case, It directly communicate with other storage methods ( File Systems).

2. Loose Coupling
Will use some of the functionalities (only up to some extent).
Better than no coupling
Suitable for small data sets.
Contd..
3. Semi tight Coupling
Linked to the DB.
Also, some of the DM primitives are also implemented in db.

4. Tight Coupling
DM System is completely linked to DB.
Most efficient among all.
The DB system is fully integrated in such a way that it becomes part of the DM
System.
Efficient and optimized implementation of DM.
Data Mining Issues
 Data mining issues can be classified into five categories:
1. Mining Methodology
2. User Interaction
3. Efficiency and Scalability (Algorithms)
4. Diversity of Database Types
5. Data Mining and Society
1. Mining Methodology Data Mining Issues

 Mining various and new kinds of knowledge


• Data mining covers a wide spectrum of data analysis and knowledge discovery tasks, so these tasks may
use the same database in different ways and requires a development of numerous data mining techniques.
 Mining knowledge in multidimensional space
• When searching for knowledge in large data sets, we can explore the data in multidimensional space.
• That is, we can search for interesting patterns among combinations of dimensions (attributes) at varying
levels of abstraction. Such mining is known as (exploratory) multidimensional data mining.
 Data mining—an interdisciplinary effort
• The power of data mining can be substantially enhanced by integrating new methods from multiple
disciplines.
• For example, to mine data with natural language text, it makes sense to fuse data mining methods of
information retrieval and natural language processing.
 Handling uncertainty, noise, or incompleteness of data
• Data often contain noise, errors, exceptions, uncertainty or incomplete.
• Errors and noise may confuse the data mining process, leading to the derivation
of erroneous patterns.
2. User Interaction Data Mining
Issues
 Interactive mining
• The data mining process should be highly interactive. Thus, it is important to build flexible user
interfaces and an exploratory mining environment, facilitating the user’s interaction with the
system.
 Incorporation of background knowledge
• Background knowledge, constraints, rules, and other information regarding the domain under
study should be incorporated into the knowledge discovery process.
 Presentation and visualization of data mining results
• How any system can present data mining results, vividly(clear image in mind) and flexibly ?, so
that the discovered knowledge can be easily understood and directly usable by humans.
3. Efficiency and Scalability Data Mining
Issues
 Efficiency and scalability of data mining algorithms
• Data mining algorithms must be efficient and scalable in order to effectively extract information
from huge amounts of data lies in many data repositories or in dynamic data streams.
• In other words, the running time of a data mining algorithm must be predictable, short, and
acceptable by applications.
• Efficiency, scalability, performance, optimization and the ability to execute in real time are key
criteria for new mining algorithms.
 Parallel, distributed, and incremental mining algorithms
• The giant size of many data sets, the wide distribution of data, and the computational complexity
of some data mining methods are factors that motivate the development of parallel and
distributed data-intensive mining algorithms.
4. Diversity of Database Types Data Mining Issues

Handling complex types of data


• Data mining is how to uncover knowledge from stream, time-series, sequence, graph, social
network and multi-relational data.
• In mining various types of attributes are available and also different types of data in database or
dataset.
Mining dynamic, networked, and global data repositories
• Data from multiple sources are connected by the Internet and various kinds of networks like
distributed and heterogeneous global information systems.
• The discovery of knowledge from different sources of structured, semi-structured, or unstructured
is challengeable.
5. Data Mining and Society Data Mining Issues

Social impacts of data mining


• With data mining penetrating our everyday lives, it is important to study the impact of data
mining on society,
• How can we used at a mining technology to benefit our society?
• How can we guard against its misuse?

Privacy-preserving data mining


• Data mining will help in scientific discovery, business management, economy recovery, and
security protection (e.g., the real-time discovery of intruders and cyber attacks).
• However, it poses the risk of disclosing an individual’s personal information.
Invisible data mining
• We cannot expect everyone in society to learn and master in data mining techniques.
• For example, when purchasing items online, users may be unaware that the store is likely
collecting data on the buying patterns of its customers, which may be used to recommend other
items for purchase in the future.
Data Pre-processing
The process of transforming raw data into understandable format.

4 Major tasks

1. Data Cleaning

2. Data Integration

3. Data Reduction

4. Data Transformation
Data Cleaning:
Process of removal of incorrect, incomplete, inaccurate data, also replaces missing
data.
1. Handling Missing Values:
Missing values can filled in 2 ways
 Manual – Used in small data sets
 Automatic- More efficient-Used in large data sets.
Replace of missing values, we can replace with “NA”
Replace of missing values, we can replace with Mean Values ( We can use this in
Normal Distribution)
Replace of missing values, we can replace with Median Values ( We can use this in
Non-Normal Distribution)
Some times replaced with most probable values.
Contd..
2. Handling Noisy Data
Noisy data-inconsistent/error data.

Methods to handle
1.Binning
First, Data is sorted. Then sorted data is stored in bins.
3 methods to handle data in bins.
-Smoothing by bin mean
-Smoothing by bin median
-Smoothing by bin boundary
2.Regression
Numerical prediction of data.
Contd..
3. Clustering
Similar data items are grouped at one place
Dissimilar items-outside the cluster.

Data Integration
Multiple heterogeneous sources of data are combined into single dataset.
2 types of data Integration
1. Tight Coupling
Data is combined together into a physical location.
2. Loose Coupling
Only an interface is created and data is combined through that interface and also
accessed through interface.
Data remains in actual database only.
Data Reduction
Volume of data is reduced to make analysis easier.
Methods for Data Reduction
1. Dimensionality Reduction
Reduces no.of input variables in the data set.
If large input variables leads to poor performance.
2. Data Cube Aggregation
Data is combined to construct a data cube. (Redundant, noisy data removed)
3. Attribute Subset Selection
Highly relevant attributes (columns) should be used. Others to be discarded (Data is
reduced).
4. Numerosity Reduction
Here, We store only model (Sample) of data instead of entire data.
Data Transformation
Data is transformed into appropriate form suitable for mining process.

4 Methods used
1. Normalization
It is done in order to scale the data values in specified range. (-1.0 to 1.0 or 0 to 1)
2. Attribute Selection
New attributes are created using older ones.
3. Discretization
Raw values are replace by interval levels.
4. Concept hierarchy Generation
Attributes are converted from low level to high level
Ex: City-- Country

You might also like