0% found this document useful (0 votes)
31 views9 pages

DM - MOD - 1 Part I

The document discusses data mining concepts and processes including knowledge discovery in databases, data preprocessing, data mining functionality, and desirable properties of discovered knowledge representation. Knowledge representation can take forms like rules, decision trees, knowledge bases, neural networks, and cases.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
31 views9 pages

DM - MOD - 1 Part I

The document discusses data mining concepts and processes including knowledge discovery in databases, data preprocessing, data mining functionality, and desirable properties of discovered knowledge representation. Knowledge representation can take forms like rules, decision trees, knowledge bases, neural networks, and cases.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

Advanced Data Mining 221ECS001 Module -1 Part -1

Module - 1

Data Mining and Knowledge Discovery

Desirable properties of Discovered Knowledge - Knowledge representation, Data Mining


Functionalities, Motivation and Importance of Data Mining, Classification of Data Mining
Systems, Integration of a data mining system with a Database or Data Warehouse System,
Classification, Clustering, Regression, Data Pre-Processing: Data Cleaning, Data Integration and
Transformation, normalization, standardization, Data Reduction, Feature vector representation,
importance of feature engineering in machine learning; forward selection and backward selection
for feature selection; curse of dimensionality; data imputation techniques; No Free Lunch
theorem in the context of machine learning, Data Discretization and Concept Hierarchy
Generation.

Data Mining Concepts and Applications


● Data mining is the process of extracting knowledge or insights from large amounts of
data using various statistical and computational techniques. The data can be structured,
semi-structured or unstructured, and can be stored in various forms such as databases,
data warehouses.
● Data mining is the process of searching and analyzing a large batch of raw data in order
to identify patterns and extract useful information.
● The overall goal of the data mining process is to extract information from a data set and
transform it into an understandable structure for further use.
Application Areas
● Financial Data Analysis : The financial data in banking and financial industry is generally
reliable and of high quality which facilitates systematic data analysis and data mining.
● Retail Industry : Collects large amounts of data from on sales, customer purchasing
history, goods transportation, consumption and services.
● Telecommunication Industry : Due to the development of new computer and
communication technologies, the telecommunication industry is rapidly expanding.
● Biological Data Analysis : A tremendous growth in the field of biology such as
genomics, proteomics, functional Genomics and biomedical research.
● Intrusion Detection : With increased usage of the internet and availability of the tools and
tricks for intruding and attacking networks prompted.

Knowledge discovery (mining) in databases (KDD)


● Knowledge discovery in databases (KDD) is the process of finding useful information
and patterns in data. Data mining is the use of algorithms to extract the information and
patterns derived by the KDD process . KDD refers to a process consisting of many steps,

Prepared by : Sandra Raju , Asst. Professor, Department of CSE, MLMCE


Advanced Data Mining 221ECS001 Module -1 Part -1

while data mining is only one of these steps. The KDD is an interactive and iterative
process.
● Data cleaning: also known as data cleansing, it is a phase in which noise data and
irrelevant data are removed from the collection.
● Data integration: at this stage, multiple data sources, often heterogeneous, may be
combined in a common source.
● Data selection: at this step, the data relevant to the analysis is decided on and retrieved
from the data collection.
● Data transformation: also known as data consolidation, it is a phase in which the selected
data is transformed into forms appropriate for the mining procedure.
● Data mining: it is the crucial step in which clever techniques are applied to extract
patterns potentially useful.
● Pattern evaluation: in this step, strictly interesting patterns representing knowledge are
identified based on given measures.
● Knowledge representation: is the final phase in which the discovered knowledge is
visually represented to the user. This essential step uses visualization techniques to help
users understand and interpret the data mining results.

Data Mining Process:


The general experimental procedure adapted to data-mining problems involves the following
steps:
1. State the problem and formulate the hypothesis.
2. Collect the data
3. Preprocessing the data
4. Estimate the model
5. Interpret the model and draw conclusions

Data Warehouse:
A data warehouse is a subject-oriented, integrated, time-variant and non-volatile collection of
data in support of management's decision making process.

● Subject-Oriented: A data warehouse can be used to analyze a particular subject area. For
example, "sales" can be a particular subject.
● Integrated: A data warehouse integrates data from multiple data sources. For example,
source A and source B may have different ways of identifying a product, but in a data
warehouse, there will be only a single way of identifying a product.
● Time-Variant: Historical data is kept in a data warehouse. For example, one can retrieve
data from 3 months, 6 months, 12 months, or even older data from a data warehouse. This
contrasts with a transactions system, where often only the most recent data is kept.

Prepared by : Sandra Raju , Asst. Professor, Department of CSE, MLMCE


Advanced Data Mining 221ECS001 Module -1 Part -1

For example, a transaction system may hold the most recent address of a customer, where
a data warehouse can hold all addresses associated with a customer.
● Non-volatile: Once data is in the data warehouse, it will not change. So, historical data in
a data warehouse should never be altered.

Desirable properties of Discovered Knowledge - Knowledge representation

Knowledge representation is the presentation of knowledge to the user for visualization in terms
of trees, tables, rules, graphs, charts, matrices, etc.
There are mainly six forms of knowledge from data mining:
● Rule
Rule knowledge consists of precondition and conclusion, precondition composed of
“and” and “or” operation on the value of field term(property), conclusion composed of
the value or type of decision making field term(property).
Taking one corporation’s notebook computer sales data as an example, we will describe
the representation of rules.

Taking table 1 as an example, we can obtain the following rule knowledge by the methods of
data mining.

● Decision tree
Decision tree is a kind of treelike graph to indicate people’s series of judgment
procedures in order to make some decision. Decision tree consists of decision nodes,

Prepared by : Sandra Raju , Asst. Professor, Department of CSE, MLMCE


Advanced Data Mining 221ECS001 Module -1 Part -1

branches and leaves. The top node is the root which is the start of the decision tree. In the
course of searching along the decision tree from top to bottom, there will be a problem in
each node, different answers will lead to different branches, it will reach a leaf at last.
Each leaf belongs to a different classification. Below figure is a decision tree
corresponding with TABLE 1.

● Knowledge base (concentrated data)


Through calculating the important degree of the field term in the database, we can delete
some unimportant fields. Merge records in the database according to definite principle.
Acquire concentrated data called knowledge base from condensed database
● Network weight
Neural network method is through the way of studying training sample to acquire
knowledge of network connection weight and node threshold, usually expressed as matrix
and vector
● Formula
In science and engineering databases, large amounts of experiment data are usually
stored, regularity always implied there. Through formula discovery algorithms, we can
find the variables’ correlations and express them by formula.
● Case
Case is a complete event people experienced. We can utilize the solutions to problems or
results of its processing in past cases as a reference to revise properly to solve some new
problem. Case knowledge are expressed as triple:
<problem description, solution description, effects description>

Other Output knowledge representation

• Association rules
Association rules are "if-then" statements that help to show the probability of relationships
between data items, within large data sets in various types of databases.

Prepared by : Sandra Raju , Asst. Professor, Department of CSE, MLMCE


Advanced Data Mining 221ECS001 Module -1 Part -1

• Classification rules
Rule-based classification in data mining is a technique in which class decisions are taken based
on various “if...then… else” rules. Thus, we define it as a classification type governed by a set of
IF-THEN rules. We write an IF-THEN rule as:
“IF condition THEN conclusion.”

• Prediction schemes:
Nearest neighbor
Bayesian classification
Neural networks
Regression
• Clusters:
Type of grouping: partitions/hierarchical
Grouping or describing: agglomerative/conceptual
Type of descriptions: statistical/structural

Data Mining Functionality

1. Class/Concept Descriptions:
Classes or definitions can be correlated with results. In simplified, descriptive and yet accurate
ways, it can be helpful to define individual groups and concepts. These class or concept
definitions are referred to as class/concept descriptions.

Data Characterization: This refers to the summary of general characteristics or features of the
class that is under the study. The output of the data characterization can be presented in various
forms including pie charts, bar charts, curves, multidimensional data cubes.
Example: To study the characteristics of software products with sales increased by 10% in the
previous years. To summarize the characteristics of the customers who spend more than $5000 a
year at AllElectronics, the result is the general profile of those customers such as that they are
40-50 years old, employed and have excellent credit rating.

Data Discrimination: It compares common features of class which is under study. It is a


comparison of the general features of the target class data objects against the general features of
objects from one or multiple contrasting classes.
Example: we may want to compare two groups of customers those who shop for computer
products regular and those who rarely shop for such products(less than 3 times a year), the
resulting description provides a general comparative profile of those customers, such as 80% of
the customers who frequently purchased computer products are between 20 and 40 years old and
have a university degree, and 60% of the customers who infrequently buys such products are
either seniors or youth, and have no university degree.

Prepared by : Sandra Raju , Asst. Professor, Department of CSE, MLMCE


Advanced Data Mining 221ECS001 Module -1 Part -1

2. Mining Frequent Patterns, Associations, and Correlations:


Frequent patterns are nothing but things that are found to be most common in the data. There are
different kinds of frequencies that can be observed in the dataset.
Frequent item set: This applies to a number of items that can be seen together regularly for eg:
milk and sugar.
Frequent Subsequence: This refers to the pattern series that often occurs regularly such as
purchasing a phone followed by a back cover.
Frequent Substructure: It refers to the different kinds of data structures such as trees and graphs
that may be combined with the itemset or subsequence.

3. Association Analysis:
The process involves uncovering the relationship between data and deciding the rules of the
association. It is a way of discovering the relationship between various items.
Example: Suppose we want to know which items are frequently purchased together. An example
for such a rule mined from a transactional database is,

buys (X, “computer”) ⇒ buys (X, “software”) [support = 1%, confidence = 50%],

where X is a variable representing a customer. A confidence, or certainty, of 50% means that if a


customer buys a computer, there is a 50% chance that she will buy software as well. A 1%
support means that 1% of all the transactions under analysis show that computer and software are
purchased together. This association rule involves a single attribute or predicate (i.e., buys) that
repeats.
It analyzes the set of items that generally occur together in a transactional dataset. It is also
known as Market Basket Analysis for its wide use in retail sales. Two parameters are used for
determining the association rules:

○ It provides which identifies the common item set in the database.

○ Confidence is the conditional probability that an item occurs when another item occurs in
a transaction.

4. Correlation Analysis: Correlation is a mathematical technique that can show whether and
how strongly the pairs of attributes are related to each other.

5. Cluster Analysis
In image processing, pattern recognition and bioinformatics, clustering is a popular data mining
functionality. It is similar to classification, but the classes are not predefined. Data attributes
represent the classes. Similar data are grouped together, with the difference being that a class

Prepared by : Sandra Raju , Asst. Professor, Department of CSE, MLMCE


Advanced Data Mining 221ECS001 Module -1 Part -1

label is not known. Clustering algorithms group data based on similar features and
dissimilarities.

6. Outlier Analysis
Outlier analysis is important to understand the quality of data. If there are too many outliers, you
cannot trust the data or draw patterns.

7. Evolution and Deviation Analysis


Evolution Analysis pertains to the study of data sets that change over time. Evolution analysis
models are designed to capture evolutionary trends in data helping to characterize, classify,
cluster or discriminate time-related data.

8. Correlation Analysis
Correlation is a mathematical technique for determining whether and how strongly two attributes
are related to one another. It refers to the various types of data structures, such as trees and
graphs, that can be combined with an item set or subsequence.

Motivation and Importance of Data Mining

Data mining tools perform data analysis and may uncover important data patterns, contributing
greatly to business strategies, knowledge bases, and scientific and medical research. The
widening gap between data and information calls for the systematic development of data tools
that will turn data tombs into “golden nuggets” of knowledge.
Data can be saved in several types of databases and data repositories. One data repository
structure that has appeared in the data warehouse, a repository of several heterogeneous data
sources organized under a unified schema at an individual site to support management decision
making.Data warehouse technology involves data cleaning, data integration, and online
analytical processing (OLAP), especially, analysis techniques with functionalities including
summarization, consolidation, and aggregation, and the ability to view data from multiple
angles.
While data can help organizations achieve their goals, it needs to be mined first. Raw data
cannot be used for any purpose. Data mining ensures that useful information can be derived
from raw data and used to benefit both the organization and its customers. Some of the areas
where data mining helps are detecting fraud, spam filtering, managing risks, and cybersecurity.
In the marketing sector, it helps in forecasting customer behavior. In the banking sector, it can
help in determining fraudulent transactions. Data mining is not only beneficial for organizations.
From governments to healthcare, it is used everywhere.

Prepared by : Sandra Raju , Asst. Professor, Department of CSE, MLMCE


Advanced Data Mining 221ECS001 Module -1 Part -1

Classification of Data Mining Systems

Classification of the data mining system helps users to understand the system and match their
requirements with such systems.
1. Classification according to the form of mined databases

A data mining method can be graded according to the type of mined databases. It is possible to
distinguish database systems according to various parameters (such as data models or data types
or programs involved), each of which may include its own technique of data mining.
Accordingly, data mining programs can be categorized accordingly.

2. Classification according to the forms of information derived

Data mining systems can be classified by the kinds of information they gain, that is, based on
functionalities of data mining, such as characterization, discrimination, analysis of interaction
and similarity, grouping, estimation, clustering, outlier analysis, and analysis of evolution. A
robust data mining framework generally offers several and/or combined functionalities for data
mining

We may define a method of data mining according to the form of information that is extracted.
This suggests that the data mining system is categorized on the basis of features including

● Characterisation
● Discrimination Also
● Correlation Study and Interaction
● Classifying
● Forecasting
● Review of Outlier
● Study of Evolutions

3. Classification by the kind of techniques used


The level of user interaction or data processing method involved requires this methodology. For
example, artificial learning, simulation, pattern recognition, neural networks, database-oriented
or data-warehouse-oriented techniques.
4. Classification according to the adapted applications
Data mining systems can also be classified according to the adapted applications. For instance, it
is possible to adapt data mining systems specifically for banking, telecommunications, DNA,
capital markets, e-mail, and so on. The application oriented approach is also needed by various
applications. Therefore, domain-specific mining activities could not suit a standardized,
all-purpose data mining method.

Prepared by : Sandra Raju , Asst. Professor, Department of CSE, MLMCE


Advanced Data Mining 221ECS001 Module -1 Part -1

These applications are as follows −

● Finance
● Telecommunications
● DNA
● Stock Markets
● E-mail

Integration of a data mining system with a Database or Data Warehouse System

It is a strategy that integrates data from several sources to make it available to users in a single
uniform view that shows their status. There are communication sources between systems that can
include multiple databases, data cubes, or flat files. The goal of data integration is to make it
easier to access and analyze data that is spread across multiple systems or platforms, in order to
gain a more complete and accurate understanding of the data.

Data Integration Approaches


There are mainly two types of approaches for data integration. These are as follows:
Tight Coupling
It is the process of using ETL (Extraction, Transformation, and Loading) to combine data
from various sources into a single physical location.
Loose Coupling
Facts with loose coupling are most effectively kept in the actual source databases. This approach
provides an interface that gets a query from the user, changes it into a format that the supply
database may understand, and then sends the query to the source databases without delay to
obtain the result.
No coupling
No coupling means that a DM system will not utilize any function of a DB or DW system. It may
fetch data from a particular source (such as a file system), process data using some data mining
algorithms, and then store the mining results in another file.
Semitight coupling
Semitight coupling means that besides linking a DM system to a DB/DW system, efficient
implementations of a few essential data mining primitives (identified by the analysis of
frequently encountered data mining functions) can be provided in the DB/DW system. These
primitives can include sorting, indexing, aggregation, histogram analysis, multi way join, and
precomputation of some essential statistical measures, such as sum, count, max, min ,standard
deviation

Prepared by : Sandra Raju , Asst. Professor, Department of CSE, MLMCE

You might also like