Knowledge Management UNIT-3 Notes
Knowledge Management UNIT-3 Notes
Code:-BCA-604
Unit-3 Notes
Data Mining:
Data Mining is defined as extracting information from huge sets of data. In other words, we can
say that data mining is the procedure of mining knowledge from data. The information or
knowledge extracted so can be used for any of the following applications −
● Market Analysis
● Fraud Detection
● Customer Retention
● Production Control
● Science Exploration
Fraud Detection
Data mining is also used in the fields of credit card services and telecommunication to detect
frauds. In fraud telephone calls, it helps to find the destination of the call, duration of the call,
time of the day or week, etc. It also analyzes the patterns that deviate from expected norms.
Data mining includes the utilization of refined data analysis tools to find previously unknown,
valid patterns and relationships in huge data sets. These tools can incorporate statistical models,
machine learning techniques, and mathematical algorithms, such as neural networks or decision
trees. Thus, data mining incorporates analysis and prediction
In recent data mining projects, various major data mining techniques have been developed and
used, including association, classification, clustering, prediction, sequential patterns, and
regression.
1. Classification:
This technique is used to obtain important and relevant information about data and metadata.
This data mining technique helps to classify data in different classes.
i. Classification of Data mining frameworks as per the type of data sources mined:
This classification is as per the type of data handled. For example, multimedia, spatial
data, text data, time-series data, World Wide Web, and so on..
ii. Classification of data mining frameworks as per the database involved:
This classification based on the data model involved. For example. Object-oriented
database, transactional database, relational database, and so on..
iii. Classification of data mining frameworks as per the kind of knowledge discovered:
This classification depends on the types of knowledge discovered or data mining
functionalities.
iv. Classification of data mining frameworks according to data mining techniques used:
This classification is as per the data analysis approach utilized, such as neural networks,
machine learning, genetic algorithms, visualization, statistics, data warehouse-oriented or
database-oriented
2. Clustering:
3. Regression:
● Regression analysis is the data mining process is used to identify and analyze the
relationship between variables because of the presence of the other factor. It is used to
define the probability of the specific variable.
● Regression, primarily a form of planning and modeling. For example, we might use it to
project certain costs, depending on other factors such as availability, consumer demand,
and competition. Primarily it gives the exact relationship between two or more variables
in the given data set.
4. Association Rules:
This data mining technique helps to discover a link between two or more items. It finds a hidden
pattern in the data set.
● Association rules are if-then statements that support to show the probability of
interactions between data items within large data sets in different types of databases.
Association rule mining has several applications and is commonly used to help sales
correlations in data or medical data sets.
● The way the algorithm works is that you have various data, For example, a list of grocery
items that you have been buying for the last six months. It calculates a percentage of
items being purchased together.
5. Outer detection:
This type of data mining technique relates to the observation of data items in the data set, which
do not match an expected pattern or expected behavior.
● This technique may be used in various domains like intrusion, detection, fraud detection,
etc. It is also known as Outlier Analysis or Outlier mining.
● The outlier is a data point that diverges too much from the rest of the dataset. The
majority of the real-world datasets have an outlier.
● Outlier detection plays a significant role in the data mining field. Outlier detection is
valuable in numerous fields like network interruption identification, credit or debit card
fraud detection, detecting outlying in wireless sensor network data, etc.
6. Sequential Patterns:
● The sequential pattern is a data mining technique specialized for evaluating sequential
data to discover sequential patterns.
● It comprises of finding interesting subsequences in a set of sequences, where the stake of
a sequence can be measured in terms of different criteria like length, occurrence
frequency, etc.
● In other words, this technique of data mining helps to discover or recognize similar
patterns in transaction data over some time.
7. Prediction:
● Prediction used a combination of other data mining techniques such as trends, clustering,
classification, etc.
● It analyzes past events or instances in the right sequence to predict a future event.
Data mining is described as a process of finding hidden precious data by evaluating the huge
quantity of information stored in data warehouses, using multiple data mining techniques such as
Artificial Intelligence (AI), Machine learning and statistics.
● The term KDD stands for Knowledge Discovery in Databases. It is a field of interest to
researchers in various fields, including artificial intelligence, machine learning, pattern
recognition, databases, statistics, knowledge acquisition for expert systems, and data
visualization.
● The main objective of the KDD process is to extract information from data in the context
of large databases. It does this by using Data Mining algorithms to identify what is
deemed knowledge.
● Data Mining is the root of the KDD procedure, including the inferring of algorithms that
investigate the data, develop the model, and find previously unknown patterns. The
model is used for extracting the knowledge from the data, analyze the data, and predict
the data.
● The knowledge discovery process (illustrates in the given figure) is iterative and
interactive, comprises of nine steps.
● The process is iterative at each stage, implying that moving back to the previous actions
might be required.
● The process begins with determining the KDD objectives and ends with the
implementation of the discovered knowledge.
● This is the initial preliminary step. It develops the scene for understanding what should
be done with the various decisions like transformation, algorithms, representation, etc.
● The individuals who are in charge of a KDD venture need to understand and characterize
the objectives of the end-user and the environment in which the knowledge discovery
process will occur (involves relevant prior knowledge).
● Once defined the objectives, the data that will be utilized for the knowledge discovery
process should be determined.
● This incorporates discovering what data is accessible, obtaining important data, and
afterward integrating all the data for knowledge discovery onto one set involves the
qualities that will be considered for the process.
● This process is important because of Data Mining learns and discovers from the
accessible data.
● In this step, data reliability is improved. It incorporates data clearing, for example,
Handling the missing quantities and removal of noise or outliers.
● It might include complex statistical techniques or use a Data Mining algorithm in this
context.
● For example, when one suspect that a specific attribute of lacking reliability or has many
missing data, at this point, this attribute could turn into the objective of the Data Mining
supervised algorithm.
4. Data Transformation
● In this stage, the creation of appropriate data for Data Mining is prepared and developed.
Techniques here incorporate dimension reduction (for example, feature selection and
extraction and record sampling), also attribute transformation (for example, discretization
of numerical attributes and functional transformation).
● This step can be essential for the success of the entire KDD project, and it is typically
very project-specific. For example, in medical assessments, the quotient of attributes may
often be the most significant factor and not each one by itself.
● We are now prepared to decide on which kind of Data Mining to use, for example,
classification, regression, clustering, etc.
● This mainly relies on the KDD objectives, and also on the previous steps.
● There are two significant objectives in Data Mining, the first one is a prediction, and the
second one is the description.
● Prediction is usually referred to as supervised Data Mining, while descriptive Data
Mining incorporates the unsupervised and visualization aspects of Data Mining.
● At last, the implementation of the Data Mining algorithm is reached. In this stage, we
may need to utilize the algorithm several times until a satisfying outcome is obtained.
● For example, by turning the algorithms control parameters, such as the minimum number
of instances in a single leaf of a decision tree.
8. Evaluation
● In this step, we assess and interpret the mined patterns, rules, and reliability to the
objective characterized in the first step.
● Here we consider the preprocessing steps as for their impact on the Data Mining
algorithm results.
● For example, including a feature in step 4, and repeat from there. This step focuses on the
comprehensibility and utility of the induced model.
● In this step, the identified knowledge is also recorded for further use. The last step is the
use, and overall feedback and discovery results acquire by Data Mining.
● Now, we are prepared to include the knowledge into another system for further activity.
The knowledge becomes effective in the sense that we may make changes to the system
and measure the impacts.
● The accomplishment of this step decides the effectiveness of the whole KDD process.
The significant components of data mining systems are a data source, data mining engine, data
warehouse server, the pattern evaluation module, graphical user interface, and knowledge base.
Data Source:
● The actual source of data is the Database, data warehouse, World Wide Web (WWW),
text files, and other documents.
● You need a huge amount of historical data for data mining to be successful. Organizations
typically store data in databases or data warehouses.
● Data warehouses may comprise one or more databases, text files spreadsheets, or other
repositories of data.
● Sometimes, even plain text files or spreadsheets may contain information. Another
primary source of data is the World Wide Web or the internet.
Different processes:
● Before passing the data to the database or data warehouse server, the data must be
cleaned, integrated, and selected.
● As the information comes from various sources and in different formats, it can't be used
directly for the data mining procedure because the data may not be complete and
accurate. So, the first data requires to be cleaned and unified.
● More information than needed will be collected from various data sources, and only the
data of interest will have to be selected and passed to the server.
Database or Data Warehouse Server:
The database or data warehouse server consists of the original data that is ready to be processed.
Hence, the server is cause for retrieving the relevant data that is based on data mining as per user
request.
The data mining engine is a major component of any data mining system. It contains several
modules for operating data mining tasks, including association, characterization, classification,
clustering, prediction, time-series analysis, etc.
The Pattern evaluation module is primarily responsible for the measure of investigation of the
pattern by using a threshold value. It collaborates with the data mining engine to focus the search
on exciting patterns.
● The graphical user interface (GUI) module communicates between the data mining
system and the user.
● This module helps the user to easily and efficiently use the system without knowing the
complexity of the process.
● This module cooperates with the data mining system when the user specifies a query or a
task and displays the results.
Knowledge Base:
● The knowledge base is helpful in the entire process of data mining. It might be helpful to
guide the search or evaluate the stake of the result patterns.
● The knowledge base may even contain user views and data from user experiences that
might be helpful in the data mining process.
● The data mining engine may receive inputs from the knowledge base to make the result
more accurate and reliable.
A multidimensional model views data in the form of a data-cube. A data cube enables data to be
modeled and viewed in multiple dimensions. It is defined by dimensions and facts.
The dimensions are the perspectives or entities concerning which an organization keeps records.
For example, a shop may create a sales data warehouse to keep records of the store's sales for the
dimension time, item, and location. These dimensions allow the save to keep track of things, for
example, monthly sales of items and the locations at which the items were sold. Each dimension
has a table related to it, called a dimensional table, which describes the dimension further. For
example, a dimensional table for an item may contain the attributes item name, brand, and type.
A multidimensional data model is organized around a central theme, for example, sales. This
theme is represented by a fact table. Facts are numerical measures. The fact table contains the
names of the facts or measures of the related dimensional tables.
Consider the data of a shop for items sold per quarter in the city of Delhi. The data is shown in
the table. In this 2D representation, the sales for Delhi are shown for the time dimension
(organized in quarters) and the item dimension (classified according to the types of an item sold).
The fact or measure displayed in rupee_sold (in thousands).
Now, if we want to view the sales data with a third dimension, For example, suppose the data
according to time and item, as well as the location is considered for the cities Chennai, Kolkata,
Mumbai, and Delhi. These 3D data are shown in the table. The 3D data of the table are
represented as a series of 2D tables.
Conceptually, it may also be represented by the same data in the form of a 3D data cube, as
shown in fig:
Data Cube:
● When data is grouped or combined in multidimensional matrices called Data Cubes. The
data cube method has a few alternative names or a few variants, such as
"Multidimensional databases," "materialized views," and "OLAP (On-Line Analytical
Processing)."
● The general idea of this approach is to materialize certain expensive computations that
are frequently inquired.
● For example, a relation with the schema sales (part, supplier, customer, and sale-price)
can be materialized into a set of eight views as shown in fig, where psc indicates a view
consisting of aggregate function value (such as total-sales) computed by grouping three
attributes part, supplier, and customer, p indicates a view composed of the corresponding
aggregate function values calculated by grouping part alone, etc.
● A data cube is created from a subset of attributes in the database. Specific attributes are
chosen to be measure attributes, i.e., the attributes whose values are of interest. Another
attributes are selected as dimensions or functional attributes. The measure attributes are
aggregated according to the dimensions.
● For example, XYZ may create a sales data warehouse to keep records of the store's sales
for the dimensions time, item, branch, and location. These dimensions enable the store to
keep track of things like monthly sales of items, and the branches and locations at which
the items were sold. Each dimension may have a table identify with it, known as a
dimensional table, which describes the dimensions. For example, a dimension table for
items may contain the attributes item_name, brand, and type.
● Data cube method is an interesting technique with many applications. Data cubes could
be sparse in many cases because not every cell in each dimension may have
corresponding data in the database.
Example: In the 2-D representation, we will look at the All Electronics sales data for items
sold per quarter in the city of Vancouver. The measured display in dollars sold (in thousands).
● Figure is shown a 4-D data cube representation of sales data, according to the
dimensions time, item, location, and supplier. The measure displayed is dollars sold (in
thousands).
● The topmost 0-D cuboid, which holds the highest level of summarization, is known as
the apex cuboid. In this example, this is the total sales, or dollars sold, summarized over
all four dimensions.
● The lattice of cuboid forms a data cube. The figure shows the lattice of cuboids creating
4-D data cubes for the dimension time, item, location, and supplier. Each cuboid
represents a different degree of summarization.