0% found this document useful (0 votes)
54 views13 pages

Data Mining Unit 1

The document provides an overview of data and data mining, defining data as a collection of objects characterized by attributes. It explains the data mining process, including its functionalities, architecture, and applications across various sectors such as healthcare, finance, and education. Additionally, it discusses the Knowledge Discovery from Data (KDD) process, types of data, and the classification of data mining systems.

Uploaded by

Aparna kallepu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
54 views13 pages

Data Mining Unit 1

The document provides an overview of data and data mining, defining data as a collection of objects characterized by attributes. It explains the data mining process, including its functionalities, architecture, and applications across various sectors such as healthcare, finance, and education. Additionally, it discusses the Knowledge Discovery from Data (KDD) process, types of data, and the classification of data mining systems.

Uploaded by

Aparna kallepu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 13

UNIT-1

What is Data
Data can be defined as a representation of facts, concepts, or instructions in a formalized manner, which should be
suitable for communication, interpretation, or processing by human or electronic machine.

In other words, The Data is collection of objects defined by attributes.

A data object represents an entity.Also called as record, sample, example, instance, data point, object, tuple.
Examples:
 In a sales database, the objects may be customers, store items, and sales;
 In a medical database, the objects may be patients;
 In a university database, the objects may be students, professors, and courses.

Data objects are described by attributes, in other words, A collection of attributes describes an object.
Attributes
 An attribute is a data field, representing property or feature of a data object.
 Also known as dimension, feature, and variable.

1) What is Data Mining

Definition 1

Data mining is the process of discovering interesting patterns and knowledge from large amounts of data. The data
sources can include databases, data warehouses, the Web, other information repositories, or data that are streamed into
the system dynamically.

(or)

Definition 2
Data Mining is all about discovering hidden, unsuspected, and previously unknown yet valid relationships amongst the
data.

Some Terms in Data mining


Flat files: Flat files are actually the most common data source for data mining algorithms, especially at the research
level. Flat files are simple data files in text or binary format with a structure known by the data mining algorithm to be
applied. The data in these files can be transactions, time-series data, scientific measurements, etc.

Relational Database (RDBMS):


● RDBMS stands for Relational Database Management System. Data mining techniques
can be used to analyze data stored in relational databases.
● These databases organize data into tables, consisting of rows and columns.
Data Warehouse:
● A data warehouse is a large centralized repository that consolidates data from various
sources within an organization.
● It is designed to support analytical processing and decision-making.
Transactional Data:
● Transactional data captures records of individual transactions or activities, such as
customer purchases, financial transactions, online interactions, and user behavior.
● Data mining techniques can be applied to transactional data to discover patterns, detect
anomalies, and make predictions.
2) Knowledge Discovery from Data (KDD)

The need of data mining is to extract useful information from large datasets and use it to make predictions or better
decision-making. Nowadays, data mining is used in almost all places where a large amount of data is stored and
processed.
For examples: Banking sector, Market Basket Analysis, Network Intrusion Detection.
Data Mining also known as Knowledge Discovery from Data or KDD.

Knowledge Discovery from Data (KDD) Process

KDD is a process that involves the extraction of useful, previously unknown, and potentially valuable information from
large datasets.
The KDD process is an iterative process and it requires multiple iterations of the above steps to extract accurate
knowledge from the data.

Fig : KDD
The following steps are included in KDD process:
 Data Cleaning
 Data Integration
 Data Selection
 Data Transformation
 Data Mining
 Pattern Evaluation
 Knowledge Representation
a) Data cleaning: to remove noise or irrelevant data
b) Data integration: where multiple data sources may be combined
c) Data selection: where data relevant to the analysis task are retrieved from the database
d) Data transformation: where data are transformed or consolidated into forms appropriate for mining by
performing summary or aggregation operations
e) Data mining: an essential process where intelligent methods are applied in order to extract data patterns
f) Pattern evaluation to identify the truly interesting patterns representing knowledge based on some
interestingness measures
g) Knowledge presentation: where visualization and knowledge representation techniques are used to present the
mined knowledge to the user.

3) Data Mining architecture

Data mining is a very important process where potentially useful and previously unknown information is extracted from
large volumes of data. There are several components involved in the data mining process.
The major components of any data mining system are data source, data warehouse server, data mining engine, pattern
evaluation module, graphical user interface and knowledge base.

Data Sources
Database, data warehouse, World Wide Web (WWW), text files and other documents are the actual sources of data.
You need large volumes of historical data for data mining to be successful.
Database or Data Warehouse Server
The database or data warehouse server contains the actual data that is ready to be processed. Hence, the server is
responsible for retrieving the relevant data based on the data mining request of the user.
Data Mining Engine
The data mining engine is the core component of any data mining system. It consists of several modules for performing
data mining tasks including association, classification, characterization, clustering,
Pattern Evaluation Modules
The pattern evaluation module is mainly responsible for the measure of interestingness of the pattern by using
a threshold value.
Graphical User Interface
The graphical user interface module provides the communication between the user and the data mining system. This
module helps the user use the system easily
Knowledge Base
The knowledge base is helpful in the whole data mining process. It might be useful for guiding the search or evaluating
the interestingness of the result patterns.

4) What are the Types of Data

What Kinds of Data Can Be Mined

As a general technology, data mining can be applied to any kind of data as long as the data are meaningful for a target
application.
The following are the most basic forms of data for mining.

Basic forms of data for mining

 Database Data (or) Relational database


 Data warehouse data
 Transactional data

other forms of data for mining

 Multimedia Database
 Spatial Database
 World Wide Web
 Text data (Flat File)
 Time series database
5) Describe the Data Mining Functionalities
Data mining is important because there is so much data out there, and it's impossible for people to look through it all by
themselves.
Data mining uses various functionalities to analyze the data and find patterns, trends, and other information that would
be hard for people to find on their own.
Data mining functionalities are used to specify the kinds of patterns to be found in data mining tasks.In general, such
data mining tasks can be classified into two categories: descriptive and predictive.

1]Descriptive Data Mining:


This category of data mining is concerned with finding
patterns and relationships in the data that can provide
insight into the underlying structure of the data.
Descriptive data mining is often used to summarize or
explore the data

2]Predictive Data Mining: This category of data mining


is concerned with developing models that can predict
future behavior or outcomes based on historical data.
Fig: Data Mining Functionalities Predictive data mining is often used for classification
or regression tasks

1] Descriptive Data Mining:


 Data Characterization
Data characterization refers to actively summarizing the general features or characteristics of the class
under study. Presenting the output can take various forms, including bar charts, pie charts,
multidimensional data cubes, etc.
 Data Discrimination
In data discrimination, common features of the class in question are identified and compared.
 Cluster Analysis
Cluster analysis, or called clustering, is a process of data mining where similar data points are identified
and grouped.
It is commonly used in customer behavior analysis, fraud detection, etc.
 Classification
Classification in data mining is a technique used to categorize data into predefined classes or categories
based on specific attributes or characteristics.
It is commonly used for churn prediction, loan default risk assessment, item categorization, etc.
 Regression
Regression is a data mining technique that predicts numeric values by modeling the relationship
between a dependent variable and one or more independent variables. The model can then be used to
predict future values of the independent variables.
Some examples of what can be predicted using regression include:
Profit, Sales, Mortgage rates, House values, Square footage, Temperature, and Distance.

2] Predictive Data Mining:


 Prediction
Prediction in data mining is the process of using historical data and patterns to make informed estimates
about future or missing data values. It involves the application of various algorithms and techniques to
anticipate numerical values, such as sales figures, or to classify items into predefined categories. This
predictive capability enables businesses and researchers to make data-driven decisions, identify trends,
and enhance their understanding of complex datasets, ultimately facilitating better planning and
strategy development
 Decision Tree
A great way of predicting values is through decision trees, which use a tree-like visualization to explain
how the model reaches a prediction. This allows users to drill deeper into the data and understand the
relationship between the predictors and the predicted value.
 Neural Networks
The most advanced way of performing predictive data mining is by using neural networks, a class of
algorithms that simulate how the human brain works. Neural networks use input, weights, and output
to form a node that acts as a human brain cell – neuron.
 Association Analysis
In association analysis, we identify rules that actively dictate the relationships between the data. For
instance, conducting market basket analysis on a supermarket’s transaction data helps identify
frequently bought items, leading to improved inventory management, optimized product placement,
and effective group discounts.

6) Applications or Uses or Advantages of Data Mining

Data Mining in Healthcare:


Data mining in healthcare has excellent potential to improve the health system. It uses data and analytics for better
insights and to identify best practices that will enhance health care services and reduce costs. Data Mining can be used
to forecast patients in each category. The procedures ensure that the patients get intensive care at the right place and at
the right time. Data mining also enables healthcare insurers to recognize fraud and abuse.

Data Mining in Market Basket Analysis:


Market basket analysis is a modeling method based on a hypothesis. If you buy a specific group of products, then you
are more likely to buy another group of products. This technique may enable the retailer to understand the purchase
behavior of a buyer.

Data mining in Education:


Education data mining is a newly emerging field, concerned with developing techniques that explore knowledge from
the data generated from educational Environmentsn organization can use data mining to make precise decisions and
also to predict the results of the student. With the results, the institution can concentrate on what to teach and how to
teach.
Data Mining in Manufacturing Engineering:
Knowledge is the best asset possessed by a manufacturing company. Data mining tools can be beneficial to find patterns
in a complex manufacturing process. Data mining can be used in system-level designing to obtain the relationships
between product architecture, product portfolio, and data needs of the customers
CRM (Customer Relationship Management):
Customer Relationship Management (CRM) is all about obtaining and holding Customers, also enhancing customer
loyalty and implementing customer-oriented strategies. To get a decent relationship with the customer, a business
organization needs to collect data and analyze the data. With data mining technologies, the collected data can be used
for analytics.

Data Mining in Fraud detection:


Data mining provides meaningful patterns and turning data into information. An ideal fraud detection system should
protect the data of all the users. Supervised methods consist of a collection of sample records, and these records are
classified as fraudulent or non-fraudulent.

Data Mining in Lie Detection:


Law enforcement may use data mining techniques to investigate offenses, monitor suspected terrorist communications,
etc.

Data Mining Financial Banking:


The Digitalization of the banking system is supposed to generate an enormous amount of data with every new
transaction. The data mining technique can help bankers by solving business-related problems in banking and finance by
identifying trends, casualties, and correlations in business information and market costs that are not instantly evident to
managers or executives because the data volume is too large or are produced too rapidly on the screen by experts. The
manager may find these data for better targeting, acquiring, retaining, segmenting, and maintain a profitable customer.

7) Interesting Patterns

A data mining system has the potential to generate thousands or even millions of patterns, or rules. then “are all of the
patterns interesting?” Typically, not—only a small fraction of the patterns potentially generated would be of interest to
any given user.
This raises some serious questions for data mining. You may wonder,

1. What makes a pattern interesting?


2. Can a data mining system generate all the interesting patterns?
3. Can a data mining system generate only interesting patterns?
To answer the first question, a pattern is interesting if it is

1. easily understood by humans,


2. valid on new or test data with some degree of certainty,
3. potentially useful, and
4. novel.
The second question―Can a data mining system generate all the interesting patterns?--refers to the completeness of a
data mining algorithm. It is often unrealistic and inefficient for data mining systems to generate all the possible patterns.
Instead, user-provided constraints and interestingness measures should be used to focus the search. A data mining
algorithm is complete if it mines all interesting patterns.
Finally, the third question -- “Can a data mining system generate only interesting patterns?”— is an optimization
problem in data mining. It is highly desirable for data mining systems to generate only interesting patterns. An
interesting pattern represents knowledge.

Fig: interesting Patterns

8) Classification of Data Mining systems

Data Mining is considered as an interdisciplinary field. It includes a set of various disciplines such as statistics, database
systems, machine learning, visualization, and information sciences. Classification of the data mining system helps users
to understand the system and match their requirements with such systems.
Data mining discovers patterns and extracts useful information from large datasets. Organizations need to analyze and
interpret data using data mining systems as data grows rapidly. With an exponential increase in data, active data
analysis is necessary to make sense of it all.
Data mining (DM) systems can be classified based on various factors.

 Classification based on Types of Data Mined


 Classification based on Type of knowledge Mined
 Classification based on Type of Technique Utilized
 Classification based on Application Domain

Classification based on Types of Data Mined

A database mining system can be classified based on ‘type of data’ or ‘use of data’ model or ‘application of data.’
For Example: Relational Database, Transactional Database, Multimedia Database, Textual Data, World Wide Web
(WWW) and etc,

Classification based on Type of knowledge Mined

We can classify a data mining system according to the kind of knowledge mined. It means the data mining system is
classified based on functionalities such as
 Association Analysis
 Classification
 Prediction
 Cluster Analysis
 Characterization
 Discrimination

Classification based on Type of Technique Utilized

We can classify a data mining system according to the kind of techniques used. We can describe these techniques
according to the degree of user interaction involved or the methods of analysis employed.
Data Mining systems use various techniques, including Statistics, Machine Learning, Database Systems, Information
retrieval, Visualization, and pattern recognition.

Classification based on Application Domain

We can classify a data mining system according to the applications adapted. These applications are as follows

 Finance
 Telecommunications
 E-Commerce
 Media Sector
 Stock Markets
9) Data mining Task primitives

A data mining task can be specified in the form of a data mining query, which is input to the data mining system. A data
mining query is defined in terms of data mining task primitives. These primitives allow the user to interactively
communicate with the data mining system during the mining process to discover interesting patterns.
Here is the list of Data Mining Task Primitives

 Set of task relevant data to be mined.


 Kind of knowledge to be mined.
 Background knowledge to be used in discovery process.
 Interestingness measures and thresholds for pattern evaluation.
 Representation for visualizing the discovered patterns.
10) Integration of Data mining system with a Data warehouse

The data mining system is integrated with a database or data warehouse system so that it can do its tasks in an effective
mode. A data mining system operates in an environment that needs to communicate with other data systems like a
Database or Data ware house system.
There are different possible integration (coupling) schemes as follows:

 No Coupling
 Loose Coupling
 Semi-Tight Coupling
 Tight Coupling

No Coupling

No coupling means that a Data Mining system will not utilize any function of a Data Base or Data Warehouse system.
It may fetch data from a particular source (such as a file system), process data using some data mining algorithms, and
then store the mining results in another file.

Loose Coupling

In this Loose coupling, the data mining system uses some facilities / services of a database or data warehouse system.
The data is fetched from a data repository managed by these (DB/DW) systems.
Loose coupling is better than no coupling because it can fetch any portion of data stored in Databases or Data
Warehouses by using query processing, indexing, and other system facilities.

Semi-Tight Coupling

Semi tight coupling means that besides linking a Data Mining system to a Data Base/Data Warehouse system, efficient
implementations of a few essential data mining primitives can be provided in the DB/DW system. These primitives can
include sorting, indexing, aggregation, histogram analysis, multi way join, and pre computation of some essential
statistical measures, such as sum, count, max, min, standard deviation.
Tight Coupling

Tight coupling means that a Data Mining system is smoothly integrated into the Data Base/Data Ware house system.
The data mining subsystem is treated as one functional component of information system. Data mining queries and
functions are optimized based on mining query analysis, data structures, indexing schemes, and query processing
methods of a DB or DW system.

Major issues in Data Mining

Data mining, the process of extracting knowledge from data, has become increasingly important as the amount of data
generated by individuals, organizations, and machines has grown exponentially.Data mining is not an easy task, as the
algorithms used can get very complex and data is not always available at one place. It needs to be integrated from
various heterogeneous data sources.
The above factors may lead to some issues in data mining. These issues are mainly divided into three categories, which
are given below:

1. Mining Methodology and User Interaction


2. Performance Issues
3. Diverse Data Types Issues

Fig : Data Mining Issues


11) What is Data Preprocessing?

Data preprocessing is a crucial step in data mining. It involves transforming raw data into a clean, structured, and
suitable format for mining. Proper data preprocessing helps improve the quality of the data, enhances the performance
of algorithms, and ensures more accurate and reliable results.
Major Tasks in Data Preprocessing
Data preprocessing is an essential step in the knowledge discovery process, because quality decisions must be based on
quality data. And Data Preprocessing involves Data Cleaning, Data Integration, Data Reduction and Data Transformation.
Steps in Data Preprocessing
1. Data Cleaning
2. Data integration
3. Data Reduction
4. Data Transformation
1.Data Cleaning:
● Handling missing values: Dealing with cases where some data points have no values by filling them in or
removing them.
● Smoothing noisy data: Removing or reducing random errors or outliers in the data.
● Removing outliers: Identifying and eliminating data points that significantly deviate from the overall pattern.
● Resolving inconsistencies: Correcting discrepancies or conflicts in codes, names, or values across the data.

2.Data Integration:
● Combining data from multiple sources: Bringing together data from different databases, files, or data cubes
into a single, unified format for analysis.

3.Data Transformation:
● Normalizing data: Scaling the values of different attributes to a common range, ensuring they are on the same
scale for accurate analysis.
● Aggregating data: Summarizing or grouping data to a higher level of abstraction, such as calculating averages
or totals.

4.Data Reduction:
● Reducing data volume: Applying techniques to reduce the size of the dataset without losing essential
information.
● Preserving important information: Ensuring that the reduced dataset still retains key
patterns, trends, or characteristics present in the original data.

You might also like