0% found this document useful (0 votes)
172 views17 pages

Knowledge Management UNIT-3 Notes

Uploaded by

loyof97175
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
172 views17 pages

Knowledge Management UNIT-3 Notes

Uploaded by

loyof97175
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

Subject:- Knowledge Management

Code:-BCA-604

Unit-3 Notes
Data Mining:
Data Mining is defined as extracting information from huge sets of data. In other words, we can
say that data mining is the procedure of mining knowledge from data. The information or
knowledge extracted so can be used for any of the following applications −

● Market Analysis
● Fraud Detection
● Customer Retention
● Production Control
● Science Exploration

Data Mining Applications

Data mining is highly useful in the following domains −

● Market Analysis and Management


● Corporate Analysis & Risk Management
● Fraud Detection
Apart from these, data mining can also be used in the areas of production control, customer
retention, science exploration, sports, astrology, and Internet Web Surf-Aid

Market Analysis and Management


Listed below are the various fields of market where data mining is used −
● Customer Profiling − Data mining helps determine what kind of people buy what kind
of products.
● Identifying Customer Requirements − Data mining helps in identifying the best
products for different customers. It uses prediction to find the factors that may attract
new customers.
● Cross Market Analysis − Data mining performs Association/correlations between
product sales.
● Target Marketing − Data mining helps to find clusters of model customers who share
the same characteristics such as interests, spending habits, income, etc.
● Determining Customer purchasing pattern − Data mining helps in determining
customer purchasing pattern.
● Providing Summary Information − Data mining provides us various multidimensional
summary reports.
Corporate Analysis and Risk Management

Data mining is used in the following fields of the Corporate Sector −


● Finance Planning and Asset Evaluation − It involves cash flow analysis and
prediction, contingent claim analysis to evaluate assets.
● Resource Planning − It involves summarizing and comparing the resources and
spending.
● Competition − It involves monitoring competitors and market directions.

Fraud Detection
Data mining is also used in the fields of credit card services and telecommunication to detect
frauds. In fraud telephone calls, it helps to find the destination of the call, duration of the call,
time of the day or week, etc. It also analyzes the patterns that deviate from expected norms.

Data Mining Task:


Data mining deals with the kind of patterns that can be mined. On the basis of the kind of data
to be mined, there are two categories of functions involved in Data Mining −

● Classification and Prediction


● Descriptive

Classification and Prediction


Classification is the process of finding a model that describes the data classes or concepts. The
purpose is to be able to use this model to predict the class of objects whose class label is
unknown. This derived model is based on the analysis of sets of training data. The derived
model can be presented in the following forms −

● Classification (IF-THEN) Rules


● Decision Trees
● Mathematical Formulae
● Neural Networks
The list of functions involved in these processes is as follows −
● Classification − It predicts the class of objects whose class label is unknown. Its
objective is to find a derived model that describes and distinguishes data classes or
concepts. The Derived Model is based on the analysis set of training data i.e. the data
object whose class label is well known.
● Prediction − It is used to predict missing or unavailable numerical data values rather
than class labels. Regression Analysis is generally used for prediction. Prediction can
also be used for identification of distribution trends based on available data.
● Outlier Analysis − Outliers may be defined as the data objects that do not comply with
the general behavior or model of the data available.
● Evolution Analysis − Evolution analysis refers to the description and model regularities
or trends for objects whose behavior changes over time.

Data Mining Techniques:

Data mining includes the utilization of refined data analysis tools to find previously unknown,
valid patterns and relationships in huge data sets. These tools can incorporate statistical models,
machine learning techniques, and mathematical algorithms, such as neural networks or decision
trees. Thus, data mining incorporates analysis and prediction

In recent data mining projects, various major data mining techniques have been developed and
used, including association, classification, clustering, prediction, sequential patterns, and
regression.
1. Classification:

This technique is used to obtain important and relevant information about data and metadata.
This data mining technique helps to classify data in different classes.

Data mining techniques can be classified by different criteria, as follows:

i. Classification of Data mining frameworks as per the type of data sources mined:
This classification is as per the type of data handled. For example, multimedia, spatial
data, text data, time-series data, World Wide Web, and so on..
ii. Classification of data mining frameworks as per the database involved:
This classification based on the data model involved. For example. Object-oriented
database, transactional database, relational database, and so on..
iii. Classification of data mining frameworks as per the kind of knowledge discovered:
This classification depends on the types of knowledge discovered or data mining
functionalities.
iv. Classification of data mining frameworks according to data mining techniques used:
This classification is as per the data analysis approach utilized, such as neural networks,
machine learning, genetic algorithms, visualization, statistics, data warehouse-oriented or
database-oriented

2. Clustering:

● Clustering is a division of information into groups of connected objects. Describing the


data by a few clusters mainly loses certain confine details, but accomplishes
improvement. It models data by its clusters.
● Data modeling puts clustering from a historical point of view rooted in statistics,
mathematics, and numerical analysis. From a machine learning point of view, clusters
relate to hidden patterns, the search for clusters is unsupervised learning, and the
subsequent framework represents a data concept.
● From a practical point of view, clustering plays an extraordinary job in data mining
applications. For example, scientific data exploration, text mining, information retrieval,
spatial database applications, CRM, Web analysis, computational biology, medical
diagnostics, and much more.

3. Regression:

● Regression analysis is the data mining process is used to identify and analyze the
relationship between variables because of the presence of the other factor. It is used to
define the probability of the specific variable.
● Regression, primarily a form of planning and modeling. For example, we might use it to
project certain costs, depending on other factors such as availability, consumer demand,
and competition. Primarily it gives the exact relationship between two or more variables
in the given data set.

4. Association Rules:
This data mining technique helps to discover a link between two or more items. It finds a hidden
pattern in the data set.

● Association rules are if-then statements that support to show the probability of
interactions between data items within large data sets in different types of databases.
Association rule mining has several applications and is commonly used to help sales
correlations in data or medical data sets.
● The way the algorithm works is that you have various data, For example, a list of grocery
items that you have been buying for the last six months. It calculates a percentage of
items being purchased together.

5. Outer detection:

This type of data mining technique relates to the observation of data items in the data set, which
do not match an expected pattern or expected behavior.

● This technique may be used in various domains like intrusion, detection, fraud detection,
etc. It is also known as Outlier Analysis or Outlier mining.
● The outlier is a data point that diverges too much from the rest of the dataset. The
majority of the real-world datasets have an outlier.
● Outlier detection plays a significant role in the data mining field. Outlier detection is
valuable in numerous fields like network interruption identification, credit or debit card
fraud detection, detecting outlying in wireless sensor network data, etc.

6. Sequential Patterns:

● The sequential pattern is a data mining technique specialized for evaluating sequential
data to discover sequential patterns.
● It comprises of finding interesting subsequences in a set of sequences, where the stake of
a sequence can be measured in terms of different criteria like length, occurrence
frequency, etc.
● In other words, this technique of data mining helps to discover or recognize similar
patterns in transaction data over some time.

7. Prediction:

● Prediction used a combination of other data mining techniques such as trends, clustering,
classification, etc.
● It analyzes past events or instances in the right sequence to predict a future event.

Data Mining Implementation Process:


● Many different sectors are taking advantage of data mining to boost their business
efficiency, including manufacturing, chemical, marketing, aerospace, etc. Therefore, the
need for a conventional data mining process improved effectively.
● Data mining techniques must be reliable, repeatable by company individuals with little or
no knowledge of the data mining context.
● As a result, a cross-industry standard process for data mining (CRISP-DM) was first
introduced in 1990, after going through many workshops, and contribution for more than
300 organizations.

Data mining is described as a process of finding hidden precious data by evaluating the huge
quantity of information stored in data warehouses, using multiple data mining techniques such as
Artificial Intelligence (AI), Machine learning and statistics.

KDD- Knowledge Discovery in Databases:

● The term KDD stands for Knowledge Discovery in Databases. It is a field of interest to
researchers in various fields, including artificial intelligence, machine learning, pattern
recognition, databases, statistics, knowledge acquisition for expert systems, and data
visualization.
● The main objective of the KDD process is to extract information from data in the context
of large databases. It does this by using Data Mining algorithms to identify what is
deemed knowledge.
● Data Mining is the root of the KDD procedure, including the inferring of algorithms that
investigate the data, develop the model, and find previously unknown patterns. The
model is used for extracting the knowledge from the data, analyze the data, and predict
the data.

The KDD Process:

● The knowledge discovery process (illustrates in the given figure) is iterative and
interactive, comprises of nine steps.
● The process is iterative at each stage, implying that moving back to the previous actions
might be required.
● The process begins with determining the KDD objectives and ends with the
implementation of the discovered knowledge.

1. Building up an understanding of the application domain

● This is the initial preliminary step. It develops the scene for understanding what should
be done with the various decisions like transformation, algorithms, representation, etc.
● The individuals who are in charge of a KDD venture need to understand and characterize
the objectives of the end-user and the environment in which the knowledge discovery
process will occur (involves relevant prior knowledge).

2. Choosing and creating a data set on which discovery will be performed

● Once defined the objectives, the data that will be utilized for the knowledge discovery
process should be determined.
● This incorporates discovering what data is accessible, obtaining important data, and
afterward integrating all the data for knowledge discovery onto one set involves the
qualities that will be considered for the process.
● This process is important because of Data Mining learns and discovers from the
accessible data.

3. Preprocessing and cleansing

● In this step, data reliability is improved. It incorporates data clearing, for example,
Handling the missing quantities and removal of noise or outliers.
● It might include complex statistical techniques or use a Data Mining algorithm in this
context.
● For example, when one suspect that a specific attribute of lacking reliability or has many
missing data, at this point, this attribute could turn into the objective of the Data Mining
supervised algorithm.

4. Data Transformation

● In this stage, the creation of appropriate data for Data Mining is prepared and developed.
Techniques here incorporate dimension reduction (for example, feature selection and
extraction and record sampling), also attribute transformation (for example, discretization
of numerical attributes and functional transformation).
● This step can be essential for the success of the entire KDD project, and it is typically
very project-specific. For example, in medical assessments, the quotient of attributes may
often be the most significant factor and not each one by itself.

5. Prediction and description

● We are now prepared to decide on which kind of Data Mining to use, for example,
classification, regression, clustering, etc.
● This mainly relies on the KDD objectives, and also on the previous steps.
● There are two significant objectives in Data Mining, the first one is a prediction, and the
second one is the description.
● Prediction is usually referred to as supervised Data Mining, while descriptive Data
Mining incorporates the unsupervised and visualization aspects of Data Mining.

6. Selecting the Data Mining algorithm


● This stage incorporates choosing a particular technique to be used for searching patterns
that include multiple inducers. For example, considering precision versus
understandability, the previous is better with neural networks, while the latter is better
with decision trees.
● Thus, this methodology attempts to understand the situation under which a Data Mining
algorithm is most suitable. Each algorithm has parameters and strategies of leaning, such
as ten folds cross-validation or another division for training and testing.

7. Utilizing the Data Mining algorithm

● At last, the implementation of the Data Mining algorithm is reached. In this stage, we
may need to utilize the algorithm several times until a satisfying outcome is obtained.
● For example, by turning the algorithms control parameters, such as the minimum number
of instances in a single leaf of a decision tree.

8. Evaluation

● In this step, we assess and interpret the mined patterns, rules, and reliability to the
objective characterized in the first step.
● Here we consider the preprocessing steps as for their impact on the Data Mining
algorithm results.
● For example, including a feature in step 4, and repeat from there. This step focuses on the
comprehensibility and utility of the induced model.
● In this step, the identified knowledge is also recorded for further use. The last step is the
use, and overall feedback and discovery results acquire by Data Mining.

9. Using the discovered knowledge

● Now, we are prepared to include the knowledge into another system for further activity.
The knowledge becomes effective in the sense that we may make changes to the system
and measure the impacts.
● The accomplishment of this step decides the effectiveness of the whole KDD process.

Data Mining Architecture:

The significant components of data mining systems are a data source, data mining engine, data
warehouse server, the pattern evaluation module, graphical user interface, and knowledge base.
Data Source:

● The actual source of data is the Database, data warehouse, World Wide Web (WWW),
text files, and other documents.
● You need a huge amount of historical data for data mining to be successful. Organizations
typically store data in databases or data warehouses.
● Data warehouses may comprise one or more databases, text files spreadsheets, or other
repositories of data.
● Sometimes, even plain text files or spreadsheets may contain information. Another
primary source of data is the World Wide Web or the internet.

Different processes:

● Before passing the data to the database or data warehouse server, the data must be
cleaned, integrated, and selected.
● As the information comes from various sources and in different formats, it can't be used
directly for the data mining procedure because the data may not be complete and
accurate. So, the first data requires to be cleaned and unified.
● More information than needed will be collected from various data sources, and only the
data of interest will have to be selected and passed to the server.
Database or Data Warehouse Server:

The database or data warehouse server consists of the original data that is ready to be processed.
Hence, the server is cause for retrieving the relevant data that is based on data mining as per user
request.

Data Mining Engine:

The data mining engine is a major component of any data mining system. It contains several
modules for operating data mining tasks, including association, characterization, classification,
clustering, prediction, time-series analysis, etc.

Pattern Evaluation Module:

The Pattern evaluation module is primarily responsible for the measure of investigation of the
pattern by using a threshold value. It collaborates with the data mining engine to focus the search
on exciting patterns.

Graphical User Interface:

● The graphical user interface (GUI) module communicates between the data mining
system and the user.
● This module helps the user to easily and efficiently use the system without knowing the
complexity of the process.
● This module cooperates with the data mining system when the user specifies a query or a
task and displays the results.

Knowledge Base:

● The knowledge base is helpful in the entire process of data mining. It might be helpful to
guide the search or evaluate the stake of the result patterns.
● The knowledge base may even contain user views and data from user experiences that
might be helpful in the data mining process.
● The data mining engine may receive inputs from the knowledge base to make the result
more accurate and reliable.

Multi-Dimensional Data Model:

A multidimensional model views data in the form of a data-cube. A data cube enables data to be
modeled and viewed in multiple dimensions. It is defined by dimensions and facts.

The dimensions are the perspectives or entities concerning which an organization keeps records.
For example, a shop may create a sales data warehouse to keep records of the store's sales for the
dimension time, item, and location. These dimensions allow the save to keep track of things, for
example, monthly sales of items and the locations at which the items were sold. Each dimension
has a table related to it, called a dimensional table, which describes the dimension further. For
example, a dimensional table for an item may contain the attributes item name, brand, and type.

A multidimensional data model is organized around a central theme, for example, sales. This
theme is represented by a fact table. Facts are numerical measures. The fact table contains the
names of the facts or measures of the related dimensional tables.

Consider the data of a shop for items sold per quarter in the city of Delhi. The data is shown in
the table. In this 2D representation, the sales for Delhi are shown for the time dimension
(organized in quarters) and the item dimension (classified according to the types of an item sold).
The fact or measure displayed in rupee_sold (in thousands).

Now, if we want to view the sales data with a third dimension, For example, suppose the data
according to time and item, as well as the location is considered for the cities Chennai, Kolkata,
Mumbai, and Delhi. These 3D data are shown in the table. The 3D data of the table are
represented as a series of 2D tables.
Conceptually, it may also be represented by the same data in the form of a 3D data cube, as
shown in fig:

Data Cube:

● When data is grouped or combined in multidimensional matrices called Data Cubes. The
data cube method has a few alternative names or a few variants, such as
"Multidimensional databases," "materialized views," and "OLAP (On-Line Analytical
Processing)."
● The general idea of this approach is to materialize certain expensive computations that
are frequently inquired.
● For example, a relation with the schema sales (part, supplier, customer, and sale-price)
can be materialized into a set of eight views as shown in fig, where psc indicates a view
consisting of aggregate function value (such as total-sales) computed by grouping three
attributes part, supplier, and customer, p indicates a view composed of the corresponding
aggregate function values calculated by grouping part alone, etc.

● A data cube is created from a subset of attributes in the database. Specific attributes are
chosen to be measure attributes, i.e., the attributes whose values are of interest. Another
attributes are selected as dimensions or functional attributes. The measure attributes are
aggregated according to the dimensions.
● For example, XYZ may create a sales data warehouse to keep records of the store's sales
for the dimensions time, item, branch, and location. These dimensions enable the store to
keep track of things like monthly sales of items, and the branches and locations at which
the items were sold. Each dimension may have a table identify with it, known as a
dimensional table, which describes the dimensions. For example, a dimension table for
items may contain the attributes item_name, brand, and type.
● Data cube method is an interesting technique with many applications. Data cubes could
be sparse in many cases because not every cell in each dimension may have
corresponding data in the database.
Example: In the 2-D representation, we will look at the All Electronics sales data for items
sold per quarter in the city of Vancouver. The measured display in dollars sold (in thousands).

● Figure is shown a 4-D data cube representation of sales data, according to the
dimensions time, item, location, and supplier. The measure displayed is dollars sold (in
thousands).
● The topmost 0-D cuboid, which holds the highest level of summarization, is known as
the apex cuboid. In this example, this is the total sales, or dollars sold, summarized over
all four dimensions.
● The lattice of cuboid forms a data cube. The figure shows the lattice of cuboids creating
4-D data cubes for the dimension time, item, location, and supplier. Each cuboid
represents a different degree of summarization.

You might also like