Unit 1
Unit 1
GTU #3160714
Unit-1
Introduction to
Data Mining (DM)
Topics
Loopingto be covered
• Motivation for Data Mining
• Data Mining - Definition
• Data Mining – On what kind of data?
• Data Mining Functionalities
• KDD Process (Knowledge Discovery in Databases)
• Classification of DM (Data Mining) Systems
• DM task primitives
• Issues in DM
Just think: One Second on Internet
9,003 Tweets
4,705 Skype Calls
1,711 Tumblr Posts
83,378 Google Searches
84,388 YouTube videos viewed
Are all these information is
996 Instagram photos uploaded
really important to us
& many more… ?????????
Motivation: Why data mining?
“Necessity is the Mother of all Inventions”
“It has been estimated that the amount of information in the world doubles every
10 months.”
There is a tremendous increase in the amount of data recorded and stored on digital
media as well as individual sources.
Since the 1960’s, database and information technology has been changed
systematically from primitive file processing systems to powerful database systems.
The research and development in database systems since the 1970’s has led to the
development of relational database systems.
Netflix collects user ratings of movies (data) What types of movies you
will like (knowledge) Recommend new movies to you (action) Users
stay with Netflix (goal)
Summary
The overall goal of the data mining process is to extract information
from a large data sets or databases and transform it into an
understandable structure for further use.
What is Data Mining?
Data mining refers to extracting or “mining”
Database
Technology knowledge from large amounts of data.
Other
Statistics “Knowledge mining from data” or “Knowledge
Disciplines
mining”
Data
Mining “Extract knowledge from large data or databases”
Visualization
Machine “Knowledge discovery from database (KDD)”
Learning
Information
Science
Data Mining—On what kind of data?
Relational Databases:
• A database system, also called a database management system (DBMS), consists of a collection
of interrelated data, known as a database tables, and a set of software programs to manage and
access these data.
• E.g. : SQL Server, Oracle etc.
Data Warehouses:
• A data warehouse is a repository of information collected from multiple sources.
• It is constructed after pre-processing of data. (Data cleaning, Data integration, Data
transformation, Data loading, and Periodic data refreshing etc.)
• E.g. : Stock Market, D-Mart, Big Bazar etc.
Data Mining—On what kind of data? (Cont..)
Transactional Databases:
• Transactional database consists of a file where each record represents a transaction.
• A transaction typically includes a unique transaction identity number (TID) and a list of the items
making up the transaction (such as items purchased in a store).
• E.g. : Online shopping on Flipkart, Amazon etc.
Other Data/Databases
• Spatial data (Maps or Location related data)
• Engineering design data (Designs of Buildings, Offices Structures data)
• Hypertext and multimedia data (Including text, image, video and audio data), the World Wide
Web (WWW a huge, widely distributed information repository made available on the Internet).
Data Mining Architecture
Pattern Evaluation
Knowled
Data Mining Engine
ge Base
Descriptive
• This task presents the general properties of data stored in a database.
• The descriptive tasks are used to find out patterns in data.
• E.g.: Cluster, Trends, etc.
Predictive
• These tasks predict the value of one attribute on the basis of values of other attributes.
• E.g.: Festival Customer/Product Sell prediction at store
Data Mining Functionalities
1. Class/Concept Descriptions
A class or concept implies there is a data set or set of features that define the class
or a concept.
A class can be a category of items on a shop floor, and a concept could be the
abstract idea on which data may be categorized like products to be put on clearance
sale and non-sale products.
Frequent item set: This term refers to a group of items that are commonly found
together, such as milk and sugar.
Frequent substructure: It refers to the various types of data structures that can be
combined with an item set or subsequences, such as trees and graphs.
It provides which identifies the common item set in the database.
Confidence is the conditional probability that an item occurs when another item
occurs in a transaction.
4. Classification
Classification is a data mining technique that categorizes items in a collection based
on some predefined properties.
It uses methods like if-then, decision trees or neural networks to predict a class or
essentially classify a collection of items.
A training set containing items whose properties are known is used to train the
system to predict the category of items from an unknown collection of items.
5. Prediction
It defines predict some unavailable data values or spending trends.
An object can be anticipated based on the attribute values of the object and attribute
values of the classes.
It can be a prediction of missing numerical values or increase or decrease trends in
time-related information.
There are primarily two types of predictions in data mining: numeric and class
predictions.
Numeric predictions are made by creating a linear regression model that is based
on historical data. Prediction of numeric values helps businesses ramp up for a
future event that might impact the business positively or negatively.
Class predictions are used to fill in missing class information for products using a
training data set where the class for products is known.
6. Cluster Analysis
In image processing, pattern recognition and bioinformatics, clustering is a popular
data mining functionality.
It is similar to classification, but the classes are not predefined.
Data attributes represent the classes. Similar data are grouped together, with the
difference being that a class label is not known.
Clustering algorithms group data based on similar features and dissimilarities.
7. Outlier Analysis
Outlier analysis is important to understand the quality of data.
If there are too many outliers, you cannot trust the data or draw patterns.
An outlier analysis determines if there is something out of turn in the data and
whether it indicates a situation that a business needs to consider and take measures
to mitigate.
An outlier analysis of the data that cannot be grouped into any classes by the
algorithms is pulled up.
8. Evolution and Deviation Analysis
Evolution Analysis pertains to the study of data sets that change over time.
Evolution analysis models are designed to capture evolutionary trends in data
helping to characterize, classify, cluster or discriminate time-related data.
9. Correlation Analysis
Correlation is a mathematical technique for determining whether and how strongly
two attributes is related to one another.
It refers to the various types of data structures, such as trees and graphs, that can
be combined with an item set or subsequence.
It determines how well two numerically measured continuous variables are linked.
Researchers can use this type of analysis to see if there are any possible
correlations between variables in their study.
KDD (Knowledge Discovery in Databases) Process
Knowledge discovery in databases is a process of an iterative sequence of the
following steps:
1. Selection
2. Preprocessing
3. Transformation
4. Data Mining
5. Pattern Evaluation
6. User Interface (Visualization of Pattern or Knowledge)
KDD (Knowledge Discovery in Databases) Process (Cont..)
Appropriate for
mining by Pattern
Intelligent
performing Evaluation
methods
summary or are
Patterns
applied
aggregation in order
The analysisTo remove Knowledg
to extractfor
operations, data Data Mining
task are noise and e
patterns.
instance.
retrieved inconsistent
from the data. Transformatio
database.
KDD n
Transforme
Process d Data
Preprocessin
g
Visualization and
Preprocess
knowledge representation
Selection
ed Data techniques are used to
present the mined
Target
Data knowledge to the user.
KDD (Knowledge Discovery in Databases) Process (Cont..)
• Data Selection: Where data relevant to the analysis task are retrieved from the
database.
• Data Cleaning: To remove noise and inconsistent data.
• Data Integration: Where multiple data sources may be combined.
• Data Transformation: Where data are transformed or consolidated into appropriate
forms for mining by performing summary or aggregation operations.
• Data Mining: An essential process where intelligent methods are applied in order to
extract data patterns.
• Pattern Evaluation: To identify the truly interesting patterns representing knowledge
based on some interestingness measures.
• Knowledge Presentation: Where visualization and knowledge representation
techniques are used to present the mined knowledge to the user.
Classification of Data Mining Systems
Classification of data mining based on..
1. Databases to be mined
2. Knowledge to be mined
3. Techniques/Methods utilized
4. Application adapted
1. Classification according to the kinds of Databases mined
Database models are important for classification according to the kinds of databases
to be mined.
Types of database models
Hierarchical database model
Relational model
Network model
Object-oriented database model
Entity-relationship model
Document model
Entity-attribute-value model
Star schema
Object-relational model
H1
National
Highway
H2
Road
S1
State
Highway
S2
Data Mining Task Primitives (Cont..)
The interestingness measures and thresholds for pattern evaluation
They may be used to guide the mining process or, after discovery, to evaluate the discovered
patterns.
Different kinds of knowledge may have different interestingness measures
For example,
Interestingness measures for association rules include support and confidence.
Rules whose support and confidence values are below user-specified thresholds are considered
uninteresting.
Data Mining Task Primitives (Cont..)
The expected representation for visualizing the discovered patterns
This refers to the form in which discovered patterns are to be displayed, which may include rules,
tables, charts, graphs, decision trees, and cubes.
Interestingness of Patterns
In a Data Mining System, everyday millions of data patterns are
generated.
Among all these patterns generated, how many are really interesting?
Actually, a small fraction of patterns generated would be of interest to
any given user.
This raises 3 Questions:
1. What makes patterns interesting?
A pattern is interesting if it is
-easily understood by humans
-Valid on new/test data
-Potentially useful.
Contd..
2. Can data mining system generate all of the interesting patterns?
-refers to completeness of a DM System.
-In reality it is not possible for a DM system to generate all interesting patterns.
1. No Coupling
DM System will not use any function i.e. there is no communication with db.
In this case, It directly communicate with other storage methods ( File Systems).
2. Loose Coupling
Will use some of the functionalities (only up to some extent).
Better than no coupling
Suitable for small data sets.
Contd..
3. Semi tight Coupling
Linked to the DB.
Also, some of the DM primitives are also implemented in db.
4. Tight Coupling
DM System is completely linked to DB.
Most efficient among all.
The DB system is fully integrated in such a way that it becomes part of the DM
System.
Efficient and optimized implementation of DM.
Data Mining Issues
Data mining issues can be classified into five categories:
1. Mining Methodology
2. User Interaction
3. Efficiency and Scalability (Algorithms)
4. Diversity of Database Types
5. Data Mining and Society
1. Mining Methodology Data Mining Issues
4 Major tasks
1. Data Cleaning
2. Data Integration
3. Data Reduction
4. Data Transformation
Data Cleaning:
Process of removal of incorrect, incomplete, inaccurate data, also replaces missing
data.
1. Handling Missing Values:
Missing values can filled in 2 ways
Manual – Used in small data sets
Automatic- More efficient-Used in large data sets.
Replace of missing values, we can replace with “NA”
Replace of missing values, we can replace with Mean Values ( We can use this in
Normal Distribution)
Replace of missing values, we can replace with Median Values ( We can use this in
Non-Normal Distribution)
Some times replaced with most probable values.
Contd..
2. Handling Noisy Data
Noisy data-inconsistent/error data.
Methods to handle
1.Binning
First, Data is sorted. Then sorted data is stored in bins.
3 methods to handle data in bins.
-Smoothing by bin mean
-Smoothing by bin median
-Smoothing by bin boundary
2.Regression
Numerical prediction of data.
Contd..
3. Clustering
Similar data items are grouped at one place
Dissimilar items-outside the cluster.
Data Integration
Multiple heterogeneous sources of data are combined into single dataset.
2 types of data Integration
1. Tight Coupling
Data is combined together into a physical location.
2. Loose Coupling
Only an interface is created and data is combined through that interface and also
accessed through interface.
Data remains in actual database only.
Data Reduction
Volume of data is reduced to make analysis easier.
Methods for Data Reduction
1. Dimensionality Reduction
Reduces no.of input variables in the data set.
If large input variables leads to poor performance.
2. Data Cube Aggregation
Data is combined to construct a data cube. (Redundant, noisy data removed)
3. Attribute Subset Selection
Highly relevant attributes (columns) should be used. Others to be discarded (Data is
reduced).
4. Numerosity Reduction
Here, We store only model (Sample) of data instead of entire data.
Data Transformation
Data is transformed into appropriate form suitable for mining process.
4 Methods used
1. Normalization
It is done in order to scale the data values in specified range. (-1.0 to 1.0 or 0 to 1)
2. Attribute Selection
New attributes are created using older ones.
3. Discretization
Raw values are replace by interval levels.
4. Concept hierarchy Generation
Attributes are converted from low level to high level
Ex: City-- Country