Data Mining Unit 1
Data Mining Unit 1
What is Data
Data can be defined as a representation of facts, concepts, or instructions in a formalized manner, which should be
suitable for communication, interpretation, or processing by human or electronic machine.
A data object represents an entity.Also called as record, sample, example, instance, data point, object, tuple.
Examples:
In a sales database, the objects may be customers, store items, and sales;
In a medical database, the objects may be patients;
In a university database, the objects may be students, professors, and courses.
Data objects are described by attributes, in other words, A collection of attributes describes an object.
Attributes
An attribute is a data field, representing property or feature of a data object.
Also known as dimension, feature, and variable.
Definition 1
Data mining is the process of discovering interesting patterns and knowledge from large amounts of data. The data
sources can include databases, data warehouses, the Web, other information repositories, or data that are streamed into
the system dynamically.
(or)
Definition 2
Data Mining is all about discovering hidden, unsuspected, and previously unknown yet valid relationships amongst the
data.
The need of data mining is to extract useful information from large datasets and use it to make predictions or better
decision-making. Nowadays, data mining is used in almost all places where a large amount of data is stored and
processed.
For examples: Banking sector, Market Basket Analysis, Network Intrusion Detection.
Data Mining also known as Knowledge Discovery from Data or KDD.
KDD is a process that involves the extraction of useful, previously unknown, and potentially valuable information from
large datasets.
The KDD process is an iterative process and it requires multiple iterations of the above steps to extract accurate
knowledge from the data.
Fig : KDD
The following steps are included in KDD process:
Data Cleaning
Data Integration
Data Selection
Data Transformation
Data Mining
Pattern Evaluation
Knowledge Representation
a) Data cleaning: to remove noise or irrelevant data
b) Data integration: where multiple data sources may be combined
c) Data selection: where data relevant to the analysis task are retrieved from the database
d) Data transformation: where data are transformed or consolidated into forms appropriate for mining by
performing summary or aggregation operations
e) Data mining: an essential process where intelligent methods are applied in order to extract data patterns
f) Pattern evaluation to identify the truly interesting patterns representing knowledge based on some
interestingness measures
g) Knowledge presentation: where visualization and knowledge representation techniques are used to present the
mined knowledge to the user.
Data mining is a very important process where potentially useful and previously unknown information is extracted from
large volumes of data. There are several components involved in the data mining process.
The major components of any data mining system are data source, data warehouse server, data mining engine, pattern
evaluation module, graphical user interface and knowledge base.
Data Sources
Database, data warehouse, World Wide Web (WWW), text files and other documents are the actual sources of data.
You need large volumes of historical data for data mining to be successful.
Database or Data Warehouse Server
The database or data warehouse server contains the actual data that is ready to be processed. Hence, the server is
responsible for retrieving the relevant data based on the data mining request of the user.
Data Mining Engine
The data mining engine is the core component of any data mining system. It consists of several modules for performing
data mining tasks including association, classification, characterization, clustering,
Pattern Evaluation Modules
The pattern evaluation module is mainly responsible for the measure of interestingness of the pattern by using
a threshold value.
Graphical User Interface
The graphical user interface module provides the communication between the user and the data mining system. This
module helps the user use the system easily
Knowledge Base
The knowledge base is helpful in the whole data mining process. It might be useful for guiding the search or evaluating
the interestingness of the result patterns.
As a general technology, data mining can be applied to any kind of data as long as the data are meaningful for a target
application.
The following are the most basic forms of data for mining.
Multimedia Database
Spatial Database
World Wide Web
Text data (Flat File)
Time series database
5) Describe the Data Mining Functionalities
Data mining is important because there is so much data out there, and it's impossible for people to look through it all by
themselves.
Data mining uses various functionalities to analyze the data and find patterns, trends, and other information that would
be hard for people to find on their own.
Data mining functionalities are used to specify the kinds of patterns to be found in data mining tasks.In general, such
data mining tasks can be classified into two categories: descriptive and predictive.
7) Interesting Patterns
A data mining system has the potential to generate thousands or even millions of patterns, or rules. then “are all of the
patterns interesting?” Typically, not—only a small fraction of the patterns potentially generated would be of interest to
any given user.
This raises some serious questions for data mining. You may wonder,
Data Mining is considered as an interdisciplinary field. It includes a set of various disciplines such as statistics, database
systems, machine learning, visualization, and information sciences. Classification of the data mining system helps users
to understand the system and match their requirements with such systems.
Data mining discovers patterns and extracts useful information from large datasets. Organizations need to analyze and
interpret data using data mining systems as data grows rapidly. With an exponential increase in data, active data
analysis is necessary to make sense of it all.
Data mining (DM) systems can be classified based on various factors.
A database mining system can be classified based on ‘type of data’ or ‘use of data’ model or ‘application of data.’
For Example: Relational Database, Transactional Database, Multimedia Database, Textual Data, World Wide Web
(WWW) and etc,
We can classify a data mining system according to the kind of knowledge mined. It means the data mining system is
classified based on functionalities such as
Association Analysis
Classification
Prediction
Cluster Analysis
Characterization
Discrimination
We can classify a data mining system according to the kind of techniques used. We can describe these techniques
according to the degree of user interaction involved or the methods of analysis employed.
Data Mining systems use various techniques, including Statistics, Machine Learning, Database Systems, Information
retrieval, Visualization, and pattern recognition.
We can classify a data mining system according to the applications adapted. These applications are as follows
Finance
Telecommunications
E-Commerce
Media Sector
Stock Markets
9) Data mining Task primitives
A data mining task can be specified in the form of a data mining query, which is input to the data mining system. A data
mining query is defined in terms of data mining task primitives. These primitives allow the user to interactively
communicate with the data mining system during the mining process to discover interesting patterns.
Here is the list of Data Mining Task Primitives
The data mining system is integrated with a database or data warehouse system so that it can do its tasks in an effective
mode. A data mining system operates in an environment that needs to communicate with other data systems like a
Database or Data ware house system.
There are different possible integration (coupling) schemes as follows:
No Coupling
Loose Coupling
Semi-Tight Coupling
Tight Coupling
No Coupling
No coupling means that a Data Mining system will not utilize any function of a Data Base or Data Warehouse system.
It may fetch data from a particular source (such as a file system), process data using some data mining algorithms, and
then store the mining results in another file.
Loose Coupling
In this Loose coupling, the data mining system uses some facilities / services of a database or data warehouse system.
The data is fetched from a data repository managed by these (DB/DW) systems.
Loose coupling is better than no coupling because it can fetch any portion of data stored in Databases or Data
Warehouses by using query processing, indexing, and other system facilities.
Semi-Tight Coupling
Semi tight coupling means that besides linking a Data Mining system to a Data Base/Data Warehouse system, efficient
implementations of a few essential data mining primitives can be provided in the DB/DW system. These primitives can
include sorting, indexing, aggregation, histogram analysis, multi way join, and pre computation of some essential
statistical measures, such as sum, count, max, min, standard deviation.
Tight Coupling
Tight coupling means that a Data Mining system is smoothly integrated into the Data Base/Data Ware house system.
The data mining subsystem is treated as one functional component of information system. Data mining queries and
functions are optimized based on mining query analysis, data structures, indexing schemes, and query processing
methods of a DB or DW system.
Data mining, the process of extracting knowledge from data, has become increasingly important as the amount of data
generated by individuals, organizations, and machines has grown exponentially.Data mining is not an easy task, as the
algorithms used can get very complex and data is not always available at one place. It needs to be integrated from
various heterogeneous data sources.
The above factors may lead to some issues in data mining. These issues are mainly divided into three categories, which
are given below:
Data preprocessing is a crucial step in data mining. It involves transforming raw data into a clean, structured, and
suitable format for mining. Proper data preprocessing helps improve the quality of the data, enhances the performance
of algorithms, and ensures more accurate and reliable results.
Major Tasks in Data Preprocessing
Data preprocessing is an essential step in the knowledge discovery process, because quality decisions must be based on
quality data. And Data Preprocessing involves Data Cleaning, Data Integration, Data Reduction and Data Transformation.
Steps in Data Preprocessing
1. Data Cleaning
2. Data integration
3. Data Reduction
4. Data Transformation
1.Data Cleaning:
● Handling missing values: Dealing with cases where some data points have no values by filling them in or
removing them.
● Smoothing noisy data: Removing or reducing random errors or outliers in the data.
● Removing outliers: Identifying and eliminating data points that significantly deviate from the overall pattern.
● Resolving inconsistencies: Correcting discrepancies or conflicts in codes, names, or values across the data.
2.Data Integration:
● Combining data from multiple sources: Bringing together data from different databases, files, or data cubes
into a single, unified format for analysis.
3.Data Transformation:
● Normalizing data: Scaling the values of different attributes to a common range, ensuring they are on the same
scale for accurate analysis.
● Aggregating data: Summarizing or grouping data to a higher level of abstraction, such as calculating averages
or totals.
4.Data Reduction:
● Reducing data volume: Applying techniques to reduce the size of the dataset without losing essential
information.
● Preserving important information: Ensuring that the reduced dataset still retains key
patterns, trends, or characteristics present in the original data.