0% found this document useful (0 votes)

40 views9 pages

Data Mining Notes

Uploaded by

shimi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

40 views9 pages

Data Mining Notes

Uploaded by

shimi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 9

DATA MINING

Data mining refers to extracting or mining knowledge from large amounts of data. Data
mining should have been more appropriately named as knowledge mining which emphasizes on
mining from large amounts of data.
It is the computational process of discovering patterns in large data sets involving methods at the
intersection of artificial intelligence, machine learning, statistics, and database systems.
The overall goal of the data mining process is to extract information from a data set and
transform it into an understandable structure for further use.
The key properties of data mining are
• Automatic discovery of patterns
• Prediction of likely outcomes
• Creation of actionable information
• Focus on large datasets and databases

The knowledge discovery process is shown in Figure as an iterative sequence of the following
steps:

• Data cleaning: To remove noise and inconsistent data

• Data integration: Where multiple data sources may be combined
• Data selection: Where data relevant to the analysis task are retrieved from the database
• Data transformation: Where data are transformed and consolidated into form
appropriate for mining by performing summary or aggregation operations.
• Data mining: An essential process where intelligent methods are applied to extract data
patterns.
• Pattern evaluation: To identify the interesting patterns representing knowledge based on
interestingness measures.
• Knowledge presentation: Where visualization and knowledge representation techniques
are used to present mined knowledge to users.
What kinds of data can be mined?
The most basic forms of data for mining applications are database data, data warehouse data and
transactional data. Data mining can also be applied to other forms of data such as data streams,
ordered/sequence data, graph or networked data, spatial data, text data, multimedia data and the
WWW.

Database data
A database system, also called a database management system (DBMS), consists of a collection
of interrelated data known as a database and a set of software programs to manage and access the
data. The software programs provide mechanisms for defining database structures and data
storage, for specifying and managing concurrent, shared, or distributed data access and for
ensuring consistency and security of the information stored despite system crashes or attempts at
unauthorized access.
Relational databases are one of the most commonly available and richest information repositories
and thus they are a major data form in the study of data mining.

THE SCOPE OF DATA MINING

Data mining derives its name from the similarities between searching for valuable
business information in a large database. For example, finding linked products in gigabytes of
store scanner data_ and mining a mountain for a vein of valuable ore.
Automated prediction of trends and behaviors. Data mining automates the process of finding
predictive information in large databases. Questions that traditionally required extensive handson
analysis can now be answered directly from the data — quickly. A typical example of a
predictive problem is targeted marketing. Data mining uses data on past promotional mailings to
identify the targets most likely to maximize return on investment in future mailings. Other
predictive problems include forecasting bankruptcy and other forms of default, and identifying
segments of a population likely to respond similarly to given events.
Automated discovery of previously unknown patterns. Data mining tools sweep through
databases and identify previously hidden patterns in one step. An example of pattern discovery is
the analysis of retail sales data to identify seemingly unrelated products that are often purchased
together. Other pattern discovery problems include detecting fraudulent credit card transactions
and identifying anomalous data that could represent data entry keying errors.

Tasks of Data Mining

Data mining involves six common classes of tasks:
• Anomaly detection (Outlier/change/deviation detection) - The identification of
unusual data records, that might be interesting or data errors that require further
investigation.
• Association rule learning (Dependency modeling) - Searches for relationships between
variables. For example, a supermarket might gather data on customer purchasing habits.
Using association rule learning, the supermarket can determine which products are
frequently bought together and use this information for marketing purposes. This is
sometimes referred to as market basket analysis.
• Clustering - is the task of discovering groups and structures in the data that are in some
way or another "similar", without using known structures in the data.
• Classification – is the task of generalizing known structure to apply to new data. For
example, an e-mail program might attempt to classify an e-mail as "legitimate" or as
"spam".
• Regression – attempts to find a function which models the data with the least error.
• Summarization - providing a more compact representation of the data set, including
visualization and report generation.
Architecture of Data Mining
A typical data mining system may have the following major components.

1. Knowledge Base:
This is the domain knowledge that is used to guide the search evaluate the
interestingness of resulting patterns. Such knowledge can include concept hierarchies,
used to organize attributes or attribute values into different levels of abstraction.
Knowledge such as user beliefs, which can be used to assess a pattern’s interestingness
based on its unexpectedness, may also be included. Other examples of domain knowledge
are additional interestingness constraints or thresholds, and metadata (e.g., describing
data from multiple heterogeneous sources).
2. Data Mining Engine:

This is essential to the data mining system and ideally consists of a set of functional
modules for tasks such as characterization, association and correlation analysis,
classification, prediction, cluster analysis, outlier analysis, and evolution analysis.

3. Pattern Evaluation Module:

This component typically employs interestingness measures interact with the data
mining modules so as to focus the search toward interesting patterns. It may use
interestingness thresholds to filter out discovered patterns. Alternatively, the pattern
evaluation module may be integrated with the mining module, depending on the
implementation of the data mining method used. For efficient data mining, it is highly
recommended to push the evaluation of pattern interestingness as deep as possible into
the mining process so as to confine the search to only the interesting patterns.

4. User interface:

This module communicates between users and the data mining system, allowing the user to
interact with the system by specifying a data mining query or task, providing information to
help focus the search, and performing exploratory data mining based on the intermediate data
mining results. In addition, this component allows the user to browse database and data
warehouse schemas or data structures, evaluate mined patterns, and visualize the patterns in
different forms.

Data Mining Process:

Data Mining is a process of discovering various models, summaries, and derived values from
a given collection of data.
The general experimental procedure adapted to data-mining problems involves the following
steps:

1. State the problem and formulate the hypothesis

Most data-based modeling studies are performed in a particular application domain.

Hence, domain-specific knowledge and experience are usually necessary in order to
come up with a meaningful problem statement. Unfortunately, many application studies
tend to focus on the data-mining technique at the expense of a clear problem statement.
In this step, a modeler usually specifies a set of variables for the unknown dependency
and, if possible, a general form of this dependency as an initial hypothesis. There may be
several hypotheses formulated for a single problem at this stage. The first step requires
the combined expertise of an application domain and a data-mining model. In practice, it
usually means a close interaction between the data-mining expert and the application
expert. In successful data-mining applications, this cooperation does not stop in the
initial phase; it continues during the entire data-mining process.

2. Collect the data

This step is concerned with how the data are generated and collected. In general, there
are two distinct possibilities. The first is when the data-generation process is under the
control of an expert (modeler): this approach is known as a designed experiment. The
second possibility is when the expert cannot influence the data-generation process: this is
known as the observational approach. An observational setting, namely, random data
generation, is assumed in most data-mining applications. Typically, the sampling
distribution is completely unknown after data are collected, or it is partially and
implicitly given in the data-collection procedure. It is very important, however, to
understand how data collection affects its theoretical distribution, since such a priori
knowledge can be very useful for modeling and, later, for the final interpretation of
results. Also, it is important to make sure that the data used for estimating a model and
the data used later for testing and applying a model come from the same, unknown,
sampling distribution. If this is not the case, the estimated model cannot be successfully
used in a final application of the results.

3. Preprocessing the data

In the observational setting, data are usually "collected" from the existing databses, data
warehouses, and data marts. Data preprocessing usually includes at least two common
tasks:

1. Outlier detection (and removal) – Outliers are unusual data values that are not
consistent with most observations. Commonly, outliers result from measurement
errors, coding and recording errors, and, sometimes, are natural, abnormal values.
Such nonrepresentative samples can seriously affect the model produced later. There
are two strategies for dealing with outliers:

a. Detect and eventually remove outliers as a part of the preprocessing phase, or

b. Develop robust modeling methods that are insensitive to outliers.

2. Scaling, encoding, and selecting features – Data preprocessing includes several

steps such as variable scaling and different types of encoding. For example, one
feature with the range [0, 1] and the other with the range [−100, 1000] will not have
the same weights in the applied technique; they will also influence the final data-
mining results differently. Therefore, it is recommended to scale them and bring both
features to the same weight for further analysis. Also, application-specific encoding
methods usually achieve dimensionality reduction by providing a smaller number of
informative features for subsequent data modeling. These two classes of
preprocessing tasks are only illustrative examples of a large spectrum of
preprocessing activities in a data-mining process. Data-preprocessing steps should
not be considered completely independent from other data-mining phases. In every
iteration of the data-mining process, all activities, together, could define new and
improved data sets for subsequent iterations. Generally, a good preprocessing
method provides an optimal representation for a data-mining technique by
incorporating a priori knowledge in the form of application-specific scaling and
encoding.

4.Estimate the model

The selection and implementation of the appropriate data-mining technique is the main task in
this phase. This process is not straightforward; usually, in practice, the implementation is based
on several models, and selecting the best one is an additional task. The basic principles of
learning and discovery from data are given in Chapter 4 of this book. Later, Chapter 5 through
13 explain and analyze specific techniques that are applied to perform a successful learning
process from data and to develop an appropriate model.

5.Interpret the model and draw conclusions

In most cases, data-mining models should help in decision making. Hence, such models need to
be interpretable in order to be useful because humans are not likely to base their decisions on
complex "black-box" models. Note that the goals of accuracy of the model and accuracy of its
interpretation are somewhat contradictory. Usually, simple models are more interpretable, but
they are also less accurate. Modern data-mining methods are expected to yield highly accurate
results using highdimensional models. The problem of interpreting these models, also very
important, is considered a separate task, with specific DEPT OF CSE & IT VSSUT, Burla
techniques to validate the results. A user does not want hundreds of pages of numeric results. He
does not understand them; he cannot summarize, interpret, and use them for successful decision
making.
Classification of Data mining Systems:

The data mining system can be classified according to the following criteria:
• Database Technology
• Statistics
• Machine Learning
• Information Science
• Visualization and Other Disciplines
Some Other Classification Criteria:
1. Classification according to kind of databases mined
2. Classification according to kind of knowledge mined
3. Classification according to kinds of techniques utilized
4. Classification according to applications adapted
1. Classification according to kind of databases mined

We can classify the data mining system according to kind of databases mined. Database
system can be classified according to different criteria such as data models, types of data etc.
And the data mining system can be classified accordingly. For example if we classify the
database according to data model then we may have a relational, transactional, object-
relational, or data warehouse mining system.

2. Classification according to kind of knowledge mined

We can classify the data mining system according to kind of knowledge mined. It is means
data mining system are classified on the basis of functionalities such as:
• Characterization
• Discrimination
• Association and Correlation Analysis
• Classification
• Prediction
• Clustering
• Outlier Analysis
• Evolution Analysis
3. Classification according to kinds of techniques utilized
We can classify the data mining system according to kind of techniques used. We can describes
these techniques according to degree of user interaction involved or the methods of analysis
employed.

4. Classification according to applications adapted

We can classify the data mining system according to application adapted. These applications are
as follows:
• Finance
• Telecommunications
• DNA
• Stock Markets
• E-mail

Data Mining Notes
100% (1)
Data Mining Notes
75 pages
Jack Copeland - Turing - Pioneer of The Information Age (2013, Oxford University Press, USA) PDF
100% (1)
Jack Copeland - Turing - Pioneer of The Information Age (2013, Oxford University Press, USA) PDF
309 pages
Unit-1 Introduction To Data Mining
No ratings yet
Unit-1 Introduction To Data Mining
33 pages
UNIT-2 BI
No ratings yet
UNIT-2 BI
26 pages
big data analytics notes
No ratings yet
big data analytics notes
15 pages
DM Notes-1
No ratings yet
DM Notes-1
71 pages
Data Mining Notes
No ratings yet
Data Mining Notes
82 pages
Unit-I Data Mining
No ratings yet
Unit-I Data Mining
28 pages
Unit 1 Datamining For Business Intelligence
No ratings yet
Unit 1 Datamining For Business Intelligence
101 pages
DMDW Lecture Notes
No ratings yet
DMDW Lecture Notes
24 pages
dw and dm notes (1)
No ratings yet
dw and dm notes (1)
89 pages
Unit 3
No ratings yet
Unit 3
34 pages
Data Mining - KTUweb PDF
No ratings yet
Data Mining - KTUweb PDF
82 pages
LECTURE NOTES ON DATA MINING and DATA WA
No ratings yet
LECTURE NOTES ON DATA MINING and DATA WA
84 pages
Data Mining Notes
No ratings yet
Data Mining Notes
25 pages
UNIT 3 DWM NOTES
No ratings yet
UNIT 3 DWM NOTES
17 pages
Data Mining Mod 1 Notes
No ratings yet
Data Mining Mod 1 Notes
25 pages
Data Mining U-1
No ratings yet
Data Mining U-1
10 pages
DMWH M1
No ratings yet
DMWH M1
25 pages
DWM Notes Class by Proff
No ratings yet
DWM Notes Class by Proff
88 pages
data mining
No ratings yet
data mining
44 pages
CSC 425 Data Mining and Warehousing 2024
No ratings yet
CSC 425 Data Mining and Warehousing 2024
54 pages
DWH Unit 3
No ratings yet
DWH Unit 3
7 pages
Data Mining - Digital Notes (Unit I To V)
No ratings yet
Data Mining - Digital Notes (Unit I To V)
85 pages
DM Chapter 1
No ratings yet
DM Chapter 1
10 pages
DATA MINING-Knowledge Discovery in Databases
No ratings yet
DATA MINING-Knowledge Discovery in Databases
6 pages
Data Mining Tutorials
No ratings yet
Data Mining Tutorials
52 pages
ware house server
No ratings yet
ware house server
89 pages
R18CSE4102-UNIT 2 Data Mining Notes
100% (1)
R18CSE4102-UNIT 2 Data Mining Notes
31 pages
DM NOTES
No ratings yet
DM NOTES
91 pages
DM Unit1 Intro
No ratings yet
DM Unit1 Intro
12 pages
Data Mining Is Defined As The Procedure of Extracting Information From Huge Sets of Data
No ratings yet
Data Mining Is Defined As The Procedure of Extracting Information From Huge Sets of Data
6 pages
Data Mining-CH5
No ratings yet
Data Mining-CH5
49 pages
Data Mining
No ratings yet
Data Mining
19 pages
Data Mining Unit 1
No ratings yet
Data Mining Unit 1
24 pages
Unit 1
No ratings yet
Unit 1
27 pages
Module 4
No ratings yet
Module 4
54 pages
Introduction To Data Mining
No ratings yet
Introduction To Data Mining
11 pages
Unit-4 DWM
No ratings yet
Unit-4 DWM
73 pages
Data Mining
No ratings yet
Data Mining
89 pages
Data Mining and Data Analysis UNIT-1 Notes For Print
No ratings yet
Data Mining and Data Analysis UNIT-1 Notes For Print
22 pages
Data User 0 Com - Microsoft.office - Officehubrow Files TempOffice OfficeMobilePdf DWDM Unit 3-1
No ratings yet
Data User 0 Com - Microsoft.office - Officehubrow Files TempOffice OfficeMobilePdf DWDM Unit 3-1
97 pages
unit-III
No ratings yet
unit-III
101 pages
Data Mining Notes
No ratings yet
Data Mining Notes
21 pages
DM-unit 1
No ratings yet
DM-unit 1
22 pages
Unit I DM
No ratings yet
Unit I DM
27 pages
Chapter-1 - Introduction To Data Mining
No ratings yet
Chapter-1 - Introduction To Data Mining
10 pages
unit-1 notes onl
No ratings yet
unit-1 notes onl
25 pages
Mining
No ratings yet
Mining
7 pages
Unit-1 PPT
No ratings yet
Unit-1 PPT
21 pages
Unit II Data Mining
No ratings yet
Unit II Data Mining
8 pages
Notes for DMDWH -Module1
No ratings yet
Notes for DMDWH -Module1
21 pages
1.1 Introduction To Data Mining: 1.1.1 Moving Toward The Information Age
No ratings yet
1.1 Introduction To Data Mining: 1.1.1 Moving Toward The Information Age
14 pages
Data Mining and Its Applications
No ratings yet
Data Mining and Its Applications
60 pages
Data Mining-Introduction
No ratings yet
Data Mining-Introduction
8 pages
DATA MINING MODULE 2
No ratings yet
DATA MINING MODULE 2
23 pages
Data Mining - Prashant
No ratings yet
Data Mining - Prashant
10 pages
Lecture 1428550844
No ratings yet
Lecture 1428550844
87 pages
Data Mining Tutorial
No ratings yet
Data Mining Tutorial
30 pages
lecture1428550844
No ratings yet
lecture1428550844
84 pages
Data Mining: Fundamentals and Applications
From Everand
Data Mining: Fundamentals and Applications
Fouad Sabry
No ratings yet
Defining Abaqus Contacts For 3-D Models in Hypermesh - Hm-4320
No ratings yet
Defining Abaqus Contacts For 3-D Models in Hypermesh - Hm-4320
12 pages
F4a PDMDR 15M N1
No ratings yet
F4a PDMDR 15M N1
6 pages
Open Loop Vs Closed Loop
100% (1)
Open Loop Vs Closed Loop
3 pages
circ-circle-3
No ratings yet
circ-circle-3
2 pages
Amazon CloudWatch
No ratings yet
Amazon CloudWatch
6 pages
Tos Assignment No: 2: Shell Structures
No ratings yet
Tos Assignment No: 2: Shell Structures
5 pages
Yash Shah 201902241
No ratings yet
Yash Shah 201902241
3 pages
Analysis and Design of A Functional Electric Motorcycle Prototype
No ratings yet
Analysis and Design of A Functional Electric Motorcycle Prototype
6 pages
Ubolts PDF
No ratings yet
Ubolts PDF
4 pages
LESSONS 03 - Chap 4 Chemical Formulae
No ratings yet
LESSONS 03 - Chap 4 Chemical Formulae
12 pages
CQ 02 February 1980
No ratings yet
CQ 02 February 1980
100 pages
Karakter Goku
No ratings yet
Karakter Goku
8 pages
MULTICOLLINEARITY
No ratings yet
MULTICOLLINEARITY
8 pages
Get Calculations for Molecular Biology and Biotechnology Frank Harold Stephenson free all chapters
100% (9)
Get Calculations for Molecular Biology and Biotechnology Frank Harold Stephenson free all chapters
82 pages
Application, Data and Host Security
No ratings yet
Application, Data and Host Security
24 pages
MAY LED Strip Light Quotation-GM
No ratings yet
MAY LED Strip Light Quotation-GM
3 pages
Smart Test Series: 1-Circle The Correct One. (15x1 15)
No ratings yet
Smart Test Series: 1-Circle The Correct One. (15x1 15)
3 pages
Semiconductor Physics
100% (1)
Semiconductor Physics
54 pages
CandleScanner User Guide 4.0.3
No ratings yet
CandleScanner User Guide 4.0.3
92 pages
Dme PDF
No ratings yet
Dme PDF
8 pages
Unit Cost Calculation Under Traditional Costing.: Predetermined Overhead Rate X Direct Labor Hours RM30 X 1hr. RM30
No ratings yet
Unit Cost Calculation Under Traditional Costing.: Predetermined Overhead Rate X Direct Labor Hours RM30 X 1hr. RM30
4 pages
11 Maths Exemplar Answer PDF
No ratings yet
11 Maths Exemplar Answer PDF
18 pages
The Airborne Seeker Test Bed: Davis III
No ratings yet
The Airborne Seeker Test Bed: Davis III
22 pages
Math Symbol-3 Using LaTex PDF
No ratings yet
Math Symbol-3 Using LaTex PDF
3 pages
High Voltage Engineering 01
No ratings yet
High Voltage Engineering 01
12 pages
CEMAT PresentationV71474389
0% (1)
CEMAT PresentationV71474389
112 pages
Fixed, Variable and Total Costs (Printable) (1)
No ratings yet
Fixed, Variable and Total Costs (Printable) (1)
1 page
Web Services Building Blocks
No ratings yet
Web Services Building Blocks
51 pages
Solidworks 2018 Reference Guide: A Comprehensive Reference Guide With Over 250 Standalone Tutorials
100% (1)
Solidworks 2018 Reference Guide: A Comprehensive Reference Guide With Over 250 Standalone Tutorials
70 pages

Data Mining Notes

Uploaded by

Data Mining Notes

Uploaded by

DATA MINING

• Data cleaning: To remove noise and inconsistent data

THE SCOPE OF DATA MINING

Tasks of Data Mining

3. Pattern Evaluation Module:

Data Mining Process:

1. State the problem and formulate the hypothesis

Most data-based modeling studies are performed in a particular application domain.

2. Collect the data

3. Preprocessing the data

a. Detect and eventually remove outliers as a part of the preprocessing phase, or

2. Scaling, encoding, and selecting features – Data preprocessing includes several

4.Estimate the model

5.Interpret the model and draw conclusions

2. Classification according to kind of knowledge mined

4. Classification according to applications adapted

You might also like