0% found this document useful (0 votes)
23 views18 pages

1.1 - Data Mining

Data mining is the process of analyzing large datasets to identify patterns and relationships that can help solve business problems. It is a key part of data analytics and involves techniques to extract useful information from data. The document discusses different types of data that can be mined, functionalities of data mining like classification and prediction, and patterns that can be mined like descriptive patterns involving class descriptions and predictive patterns for predicting future values.

Uploaded by

Ranveer Sehedeva
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views18 pages

1.1 - Data Mining

Data mining is the process of analyzing large datasets to identify patterns and relationships that can help solve business problems. It is a key part of data analytics and involves techniques to extract useful information from data. The document discusses different types of data that can be mined, functionalities of data mining like classification and prediction, and patterns that can be mined like descriptive patterns involving class descriptions and predictive patterns for predicting future values.

Uploaded by

Ranveer Sehedeva
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

Data Mining:

Data mining is the process of sorting through large data sets to identify patterns and relationships that can
help solve business problems through data analysis. Data mining techniques and tools enable enterprises
to predict future trends and make more-informed business decisions.

Data mining is a key part of data analytics overall and one of the core disciplines in data science, which uses
advanced analytics techniques to find useful information in data sets. At a more granular level, data mining
is a step in the knowledge discovery in databases (KDD) process, a data science methodology for gathering,
processing and analysing data. Data mining and KDD are sometimes referred to interchangeably, but they're
more commonly seen as distinct things.

Why is data mining important?

Data mining is a crucial component of successful analytics initiatives in organizations. The information it
generates can be used in business intelligence (BI) and advanced analytics applications that involve analysis
of historical data, as well as real-time analytics applications that examine streaming data as it's created or
collected.

Effective data mining aids in various aspects of planning business strategies and managing operations. That
includes customer-facing functions such as marketing, advertising, sales and customer support, plus
manufacturing, supply chain management, finance and HR. Data mining supports fraud detection, risk
management, cybersecurity planning and many other critical business use cases. It also plays an important
role in healthcare, government, scientific research, mathematics, sports and more.

What Kinds of data can be mined?


Data mining defines extracting or mining knowledge from huge amounts of data. Data mining is generally
used in places where a huge amount of data is saved and processed. For example, the banking system uses
data mining to save huge amounts of data which is processed constantly.
In Data mining, hidden patterns of data are considering according to the multiple categories into a piece of
useful data. This data is assembled in an area including data warehouses for analyzing it, and data mining
algorithms are performed. This data facilitates in creating effective decisions which cut value and increase
revenue.
There are various types of data mining applications that are used for data are as follows −
 Relational Databases − A database system is also called a database management system. It
includes a set of interrelated data, called a database, and a set of software programs to handle
and access the data.
A relational database is a set of tables, each of which is authorized a unique name. Each table includes a set
of attributes (columns or fields) and generally stores a huge set of tuples (records or rows). Each tuple in a
relational table defines an object identified by a unique key and represented by a set of attribute values. A
semantic data model including an entity-relationship (ER) data model is generally constructed for relational
databases. An ER data model defines the database as a set of entities and their relationships.
 Transactional Databases − A transactional database includes a file where each record defines
a transaction. A transaction generally contains a unique transaction identity number (trans ID)
and a list of the items creating up the transaction (such as items purchased in a store).
The transactional database can have additional tables related to it, which includes other data regarding the
sale, including the date of the transaction, the customer ID number, the ID number of the salesperson and of
the branch at which the sale appeared, etc.
 Object-Relational Databases − Object-relational databases are assembled based on an object-
relational data model. This model continues the relational model by supporting a rich data
type for managing complex objects and object orientation.
 Temporal Database − A temporal database generally stores relational data that contains time-
related attributes. These attributes can include multiple timestamps, each having several
semantics.
 Sequence Database − A sequence database stores sequences of ordered events, with or
without a factual idea of time. For example, customer shopping sequences, Web click streams,
and biological sequences.
 Time-Series Database − A time-series database stores sequences of values or events accessed
over repeated measurements of time (e.g., hourly, daily, weekly). An example includes data
collected from the stock exchange, stock control, and the measurement of natural phenomena
(like temperature and wind).
What are the functionalities of data mining?
Data mining functionalities are used to represent the type of patterns that have to be discovered in data
mining tasks. In general, data mining tasks can be classified into two types including descriptive and
predictive. Descriptive mining tasks define the common features of the data in the database and the
predictive mining tasks act inference on the current information to develop predictions.
There are various data mining functionalities which are as follows −
 Data characterization − It is a summarization of the general characteristics of an object class
of data. The data corresponding to the user-specified class is generally collected by a database
query. The output of data characterization can be presented in multiple forms.
 Data discrimination − It is a comparison of the general characteristics of target class data
objects with the general characteristics of objects from one or a set of contrasting classes. The
target and contrasting classes can be represented by the user, and the equivalent data objects
fetched through database queries.
 Association Analysis − It analyses the set of items that generally occur together in a
transactional dataset. There are two parameters that are used for determining the association
rules −
o It provides which identifies the common item set in the database.
o Confidence is the conditional probability that an item occurs in a transaction
when another item occurs.
 Classification − Classification is the procedure of discovering a model that represents and
distinguishes data classes or concepts, for the objective of being able to use the model to
predict the class of objects whose class label is anonymous. The derived model is established
on the analysis of a set of training data (i.e., data objects whose class label is common).
 Prediction − It defines predict some unavailable data values or pending trends. An object can
be anticipated based on the attribute values of the object and attribute values of the classes. It
can be a prediction of missing numerical values or increase/decrease trends in time-related
information.
 Clustering − It is similar to classification but the classes are not predefined. The classes are
represented by data attributes. It is unsupervised learning. The objects are clustered or
grouped, depends on the principle of maximizing the intraclass similarity and minimizing the
intraclass similarity.
 Outlier analysis − Outliers are data elements that cannot be grouped in a given class or
cluster. These are the data objects which have multiple behaviour from the general behaviour
of other data objects. The analysis of this type of data can be essential to mine the knowledge.
 Evolution analysis − It defines the trends for objects whose behaviour changes over some
time.
What kind of patterns can be mined in data mining?

Overview

Different types of data can be mined in data mining. However, the data should have a pattern to get helpful
information.

Based on the data functionalities, patterns can be further classified into two categories.

Descriptive patterns

It deals with the general characteristics and converts them into relevant and helpful information.

Descriptive patterns can be divided into the following patterns:

 Class/concept description: Data entries are associated with labels or classes. For instance, in a
library, the classes of items for borrowed items include books and research journals, and customers'
concepts include registered members and not registered members. These types of descriptions are
class or concept descriptions.

 Frequent patterns: These are data points that occur more often in the dataset. There are many kinds of
recurring patterns, such as frequent items, frequent subsequence, and frequent sub-structure.

 Associations: It shows the relationships between data and pre-defined association rules. For instance,
a shopkeeper makes an association rule that 70% of the time, when a football is sold, a kit is bought
alongside. These two items can be combined together to make an association.
 Correlations: This is performed to find the statistical correlations between two data points to find if
they have positive, negative, or no effect.

 Clusters: This is the formation of a group of similar data points. Each point in the collection is
somewhat similar but very different from other members of different groups.

Predictive patterns

It predicts future values by analyzing the data patterns and their outcomes based on the previous data. It also
helps us find missing values in the data.

Predictive patterns can be categorized into the following patterns.


 Classification: It helps predict the label of unknown data points with the help of known data
points. For instance, if we have a dataset of X-rays of cancer patients, then the possible labels would
be cancer patient and not cancer patient. These classes can be obtained by data characterizations or
by data discrimination.
 Regression: Unlike classification, regression is used to find the missing numeric values from the
dataset. It is also used to predict future numeric values as well. For instance, we can find the
behavior of the next year's sales based on the past twenty years' sales by finding the relation between
the data.
 Outlier analysis: Not all data points in the dataset need to follow the same behavior. Data points that
don't follow the usual behavior are called outliers. Analysis of these outliers is called outlier analysis.
These outliers are not considered while working on the data.
 Evolution analysis: As the name suggests, those data points change their behavior and trends with
time.

Data Mining – Systems


There is a large variety of data mining systems available. Data mining systems may integrate techniques
from the following −

 Spatial Data Analysis


 Information Retrieval
 Pattern Recognition
 Image Analysis
 Signal Processing
 Computer Graphics
 Web Technology
 Business
 Bioinformatics
Data Mining System Classification
A data mining system can be classified according to the following criteria −

 Database Technology
 Statistics
 Machine Learning
 Information Science
 Visualization
 Other Disciplines
Apart from these, a data mining system can also be classified based on the kind of (a) databases mined, (b)
knowledge mined, (c) techniques utilized, and (d) applications adapted.
Classification Based on the Databases Mined
We can classify a data mining system according to the kind of databases mined. Database system can be
classified according to different criteria such as data models, types of data, etc. And the data mining system
can be classified accordingly.
For example, if we classify a database according to the data model, then we may have a relational,
transactional, object-relational, or data warehouse mining system.
Classification Based on the kind of Knowledge Mined
We can classify a data mining system according to the kind of knowledge mined. It means the data mining
system is classified on the basis of functionalities such as −

 Characterization
 Discrimination
 Association and Correlation Analysis
 Classification
 Prediction
 Outlier Analysis
 Evolution Analysis
Classification Based on the Techniques Utilized
We can classify a data mining system according to the kind of techniques used. We can describe these
techniques according to the degree of user interaction involved or the methods of analysis employed.
Classification Based on the Applications Adapted
We can classify a data mining system according to the applications adapted. These applications are as
follows −

 Finance
 Telecommunications
 DNA
 Stock Markets
 E-mail
Integrating a Data Mining System with a DB/DW System
If a data mining system is not integrated with a database or a data warehouse system, then there will be no
system to communicate with. This scheme is known as the non-coupling scheme. In this scheme, the main
focus is on data mining design and on developing efficient and effective algorithms for mining the available
data sets.
The list of Integration Schemes is as follows −
 No Coupling − In this scheme, the data mining system does not utilize any of the database or
data warehouse functions. It fetches the data from a particular source and processes that data
using some data mining algorithms. The data mining result is stored in another file.
 Loose Coupling − In this scheme, the data mining system may use some of the functions of
database and data warehouse system. It fetches the data from the data respiratory managed by
these systems and performs data mining on that data. It then stores the mining result either in
a file or in a designated place in a database or in a data warehouse.
 Semi−tight Coupling − In this scheme, the data mining system is linked with a database or a
data warehouse system and in addition to that, efficient implementations of a few data mining
primitives can be provided in the database.
 Tight coupling − In this coupling scheme, the data mining system is smoothly integrated into
the database or data warehouse system. The data mining subsystem is treated as one
functional component of an information system.
Data Mining Task Primitives

A data mining task can be specified in the form of a data mining query, which is input to the data mining
system. A data mining query is defined in terms of data mining task primitives. These primitives allow the
user to interactively communicate with the data mining system during discovery to direct the mining process
or examine the findings from different angles or depths. The data mining primitives specify the following,

1. Set of task-relevant data to be mined.


2. Kind of knowledge to be mined.
3. Background knowledge to be used in the discovery process.
4. Interestingness measures and thresholds for pattern evaluation.
5. Representation for visualizing the discovered patterns.

A data mining query language can be designed to incorporate these primitives, allowing users to interact
with data mining systems flexibly. Having a data mining query language provides a foundation on which
user-friendly graphical interfaces can be built.

Designing a comprehensive data mining language is challenging because data mining covers a wide
spectrum of tasks, from data characterization to evolution analysis. Each task has different requirements.
The design of an effective data mining query language requires a deep understanding of the power,
limitation, and underlying mechanisms of the various kinds of data mining tasks. This facilitates a data
mining system's communication with other information systems and integrates with the overall information
processing environment.

List of Data Mining Task Primitives

A data mining query is defined in terms of the following primitives, such as:

1. The set of task-relevant data to be mined

This specifies the portions of the database or the set of data in which the user is interested. This includes the
database attributes or data warehouse dimensions of interest (the relevant attributes or dimensions).

In a relational database, the set of task-relevant data can be collected via a relational query involving
operations like selection, projection, join, and aggregation.
The data collection process results in a new data relational called the initial data relation. The initial data
relation can be ordered or grouped according to the conditions specified in the query. This data retrieval can
be thought of as a subtask of the data mining task.

This initial relation may or may not correspond to physical relation in the database. Since virtual relations
are called Views in the field of databases, the set of task-relevant data for data mining is called a minable
view.

2. The kind of knowledge to be mined

This specifies the data mining functions to be performed, such as characterization, discrimination,
association or correlation analysis, classification, prediction, clustering, outlier analysis, or evolution
analysis.

3. The background knowledge to be used in the discovery process

This knowledge about the domain to be mined is useful for guiding the knowledge discovery process and
evaluating the patterns found. Concept hierarchies are a popular form of background knowledge, which
allows data to be mined at multiple levels of abstraction.

Concept hierarchy defines a sequence of mappings from low-level concepts to higher-level, more general
concepts.

o Rolling Up - Generalization of data: Allow to view data at more meaningful and explicit abstractions
and makes it easier to understand. It compresses the data, and it would require fewer input/output
operations.
o Drilling Down - Specialization of data: Concept values replaced by lower-level concepts. Based on
different user viewpoints, there may be more than one concept hierarchy for a given attribute or
dimension.

An example of a concept hierarchy for the attribute (or dimension) age is shown below. User beliefs
regarding relationships in the data are another form of background knowledge.

4. The interestingness measures and thresholds for pattern evaluation

Different kinds of knowledge may have different interesting measures. They may be used to guide the
mining process or, after discovery, to evaluate the discovered patterns. For example, interesting measures for
association rules include support and confidence. Rules whose support and confidence values are below
user-specified thresholds are considered uninteresting.

o Simplicity: A factor contributing to the interestingness of a pattern is the pattern's overall simplicity
for human comprehension. For example, the more complex the structure of a rule is, the more
difficult it is to interpret, and hence, the less interesting it is likely to be. Objective measures of
pattern simplicity can be viewed as functions of the pattern structure, defined in terms of the pattern
size in bits or the number of attributes or operators appearing in the pattern.
o Certainty (Confidence): Each discovered pattern should have a measure of certainty associated with
it that assesses the validity or "trustworthiness" of the pattern. A certainty measure for association
rules of the form "A =>B" where A and B are sets of items is confidence. Confidence is a certainty
measure. Given a set of task-relevant data tuples, the confidence of "A => B" is defined as
Confidence (A=>B) = # tuples containing both A and B /# tuples containing A
o Utility (Support): The potential usefulness of a pattern is a factor defining its interestingness. It can
be estimated by a utility function, such as support. The support of an association pattern refers to the
percentage of task-relevant data tuples (or transactions) for which the pattern is true.
Utility (support): usefulness of a pattern
Support (A=>B) = # tuples containing both A and B / total #of tuples
o Novelty: Novel patterns are those that contribute new information or increased performance to the
given pattern set. For example -> A data exception. Another strategy for detecting novelty is to
remove redundant patterns.

5. The expected representation for visualizing the discovered patterns

This refers to the form in which discovered patterns are to be displayed, which may include rules, tables,
cross tabs, charts, graphs, decision trees, cubes, or other visual representations.

Users must be able to specify the forms of presentation to be used for displaying the discovered patterns.
Some representation forms may be better suited than others for particular kinds of knowledge.

For example, generalized relations and their corresponding cross tabs or pie/bar charts are good for
presenting characteristic descriptions, whereas decision trees are common for classification.

Example of Data Mining Task Primitives

Suppose, as a marketing manager of AllElectronics, you would like to classify customers based on their
buying patterns. You are especially interested in those customers whose salary is no less than $40,000 and
who have bought more than $1,000 worth of items, each of which is priced at no less than $100.

In particular, you are interested in the customer's age, income, the types of items purchased, the purchase
location, and where the items were made. You would like to view the resulting classification in the form of
rules. This data mining query is expressed in DMQL3 as follows, where each line of the query has been
enumerated to aid in our discussion.

1. use database AllElectronics_db


2. use hierarchy location_hierarchy for T.branch, age_hierarchy for C.age
3. mine classification as promising_customers
4. in relevance to C.age, C.income, I.type, I.place_made, T.branch
5. from customer C, an item I, transaction T
6. where I.item_ID = T.item_ID and C.cust_ID = T.cust_ID and C.income ≥ 40,000 and I.price ≥ 100
7. group by T.cust_ID

Data Warehousing - System Processes


We have a fixed number of operations to be applied on the operational databases and we have well-defined
techniques such as use normalized data, keep table small, etc. These techniques are suitable for delivering a
solution. But in case of decision-support systems, we do not know what query and operation needs to be
executed in future. Therefore techniques applied on operational databases are not suitable for data
warehouses.
In this chapter, we will discuss how to build data warehousing solutions on top open-system technologies
like Unix and relational databases.
Process Flow in Data Warehouse
There are four major processes that contribute to a data warehouse −

 Extract and load the data.


 Cleaning and transforming the data.
 Backup and archive the data.
 Managing queries and directing them to the appropriate data sources.

Extract and Load Process


Data extraction takes data from the source systems. Data load takes the extracted data and loads it into the
data warehouse.
Note − Before loading the data into the data warehouse, the information extracted from the external sources
must be reconstructed.
Controlling the Process
Controlling the process involves determining when to start data extraction and the consistency check on
data. Controlling process ensures that the tools, the logic modules, and the programs are executed in correct
sequence and at correct time.
When to Initiate Extract
Data needs to be in a consistent state when it is extracted, i.e., the data warehouse should represent a single,
consistent version of the information to the user.
For example, in a customer profiling data warehouse in telecommunication sector, it is illogical to merge the
list of customers at 8 pm on Wednesday from a customer database with the customer subscription events up
to 8 pm on Tuesday. This would mean that we are finding the customers for whom there are no associated
subscriptions.
Loading the Data
After extracting the data, it is loaded into a temporary data store where it is cleaned up and made consistent.
Note − Consistency checks are executed only when all the data sources have been loaded into the temporary
data store.
Clean and Transform Process
Once the data is extracted and loaded into the temporary data store, it is time to perform Cleaning and
Transforming. Here is the list of steps involved in Cleaning and Transforming −

 Clean and transform the loaded data into a structure


 Partition the data
 Aggregation
Clean and Transform the Loaded Data into a Structure
Cleaning and transforming the loaded data helps speed up the queries. It can be done by making the data
consistent −

 within itself.
 with other data within the same data source.
 with the data in other source systems.
 with the existing data present in the warehouse.
Transforming involves converting the source data into a structure. Structuring the data increases the query
performance and decreases the operational cost. The data contained in a data warehouse must be
transformed to support performance requirements and control the ongoing operational costs.
Partition the Data
It will optimize the hardware performance and simplify the management of data warehouse. Here we
partition each fact table into multiple separate partitions.
Aggregation
Aggregation is required to speed up common queries. Aggregation relies on the fact that most common
queries will analyze a subset or an aggregation of the detailed data.
Backup and Archive the Data
In order to recover the data in the event of data loss, software failure, or hardware failure, it is necessary to
keep regular back ups. Archiving involves removing the old data from the system in a format that allow it to
be quickly restored whenever required.
For example, in a retail sales analysis data warehouse, it may be required to keep data for 3 years with the
latest 6 months data being kept online. In such as scenario, there is often a requirement to be able to do
month-on-month comparisons for this year and last year. In this case, we require some data to be restored
from the archive.
Query Management Process
This process performs the following functions −
 manages the queries.
 helps speed up the execution time of queris.
 directs the queries to their most effective data sources.
 ensures that all the system sources are used in the most effective way.
 monitors actual query profiles.
The information generated in this process is used by the warehouse management process to determine which
aggregations to generate. This process does not generally operate during the regular load of information into
data warehouse.
Data Warehousing - Architecture
The business analyst get the information from the data warehouses to measure the performance and make
critical adjustments in order to win over other business holders in the market. Having a data warehouse
offers the following advantages −
 Since a data warehouse can gather information quickly and efficiently, it can enhance
business productivity.
 A data warehouse provides us a consistent view of customers and items, hence, it helps us
manage customer relationship.
 A data warehouse also helps in bringing down the costs by tracking trends, patterns over a
long period in a consistent and reliable manner.
To design an effective and efficient data warehouse, we need to understand and analyze the business needs
and construct a business analysis framework. Each person has different views regarding the design of a data
warehouse. These views are as follows −
 The top-down view − This view allows the selection of relevant information needed for a data
warehouse.
 The data source view − This view presents the information being captured, stored, and
managed by the operational system.
 The data warehouse view − This view includes the fact tables and dimension tables. It
represents the information stored inside the data warehouse.
 The business query view − It is the view of the data from the viewpoint of the end-user.
Three-Tier Data Warehouse Architecture
Generally a data warehouses adopts a three-tier architecture. Following are the three tiers of the data
warehouse architecture.
 Bottom Tier − The bottom tier of the architecture is the data warehouse database server. It is
the relational database system. We use the back end tools and utilities to feed data into the
bottom tier. These back end tools and utilities perform the Extract, Clean, Load, and refresh
functions.
 Middle Tier − In the middle tier, we have the OLAP Server that can be implemented in either
of the following ways.
o By Relational OLAP (ROLAP), which is an extended relational database
management system. The ROLAP maps the operations on multidimensional
data to standard relational operations.
o By Multidimensional OLAP (MOLAP) model, which directly implements the
multidimensional data and operations.
 Top-Tier − This tier is the front-end client layer. This layer holds the query tools and
reporting tools, analysis tools and data mining tools.
The following diagram depicts the three-tier architecture of data warehouse −
Data Warehouse Models
From the perspective of data warehouse architecture, we have the following data warehouse models −

Virtual Warehouse
 Data mart
 Enterprise Warehouse
Virtual Warehouse
The view over an operational data warehouse is known as a virtual warehouse. It is easy to build a virtual
warehouse. Building a virtual warehouse requires excess capacity on operational database servers.
Data Mart
Data mart contains a subset of organization-wide data. This subset of data is valuable to specific groups of
an organization.
In other words, we can claim that data marts contain data specific to a particular group. For example, the
marketing data mart may contain data related to items, customers, and sales. Data marts are confined to
subjects.
Points to remember about data marts −
Window-based or Unix/Linux-based servers are used to implement data marts. They are
implemented on low-cost servers.
 The implementation data mart cycles is measured in short periods of time, i.e., in weeks rather
than months or years.
 The life cycle of a data mart may be complex in long run, if its planning and design are not
organization-wide.
 Data marts are small in size.
 Data marts are customized by department.
 The source of a data mart is departmentally structured data warehouse.
 Data mart are flexible.
Enterprise Warehouse
 An enterprise warehouse collects all the information and the subjects spanning an entire
organization
 It provides us enterprise-wide data integration.
 The data is integrated from operational systems and external information providers.
 This information can vary from a few gigabytes to hundreds of gigabytes, terabytes or
beyond.
Load Manager
This component performs the operations required to extract and load process.
The size and complexity of the load manager varies between specific solutions from one data warehouse to
other.
Load Manager Architecture
The load manager performs the following functions −
 Extract the data from source system.
 Fast Load the extracted data into temporary data store.
 Perform simple transformations into structure similar to the one in the data warehouse.
Extract Data from Source
The data is extracted from the operational databases or the external information providers. Gateways is the
application programs that are used to extract data. It is supported by underlying DBMS and allows client
program to generate SQL to be executed at a server. Open Database Connection(ODBC), Java Database
Connection (JDBC), are examples of gateway.
Fast Load
In order to minimize the total load window the data need to be loaded into the warehouse in
the fastest possible time.
 The transformations affects the speed of data processing.
 It is more effective to load the data into relational database prior to applying transformations
and checks.
 Gateway technology proves to be not suitable, since they tend not be performant when large
data volumes are involved.
Simple Transformations
While loading it may be required to perform simple transformations. After this has been completed we are in
position to do the complex checks. Suppose we are loading the EPOS sales transaction we need to perform
the following checks:


Strip out all the columns that are not required within the warehouse.
 Convert all the values to required data types.
Warehouse Manager
A warehouse manager is responsible for the warehouse management process. It consists of third-party
system software, C programs, and shell scripts.
The size and complexity of warehouse managers varies between specific solutions.
Warehouse Manager Architecture
A warehouse manager includes the following −

 The controlling process


 Stored procedures or C with SQL
 Backup/Recovery tool
 SQL Scripts
Operations Performed by Warehouse Manager
 A warehouse manager analyzes the data to perform consistency and referential integrity
checks.
 Creates indexes, business views, partition views against the base data.
 Generates new aggregations and updates existing aggregations. Generates normalizations.
 Transforms and merges the source data into the published data warehouse.
 Backup the data in the data warehouse.
 Archives the data that has reached the end of its captured life.
Query Manager
 Query manager is responsible for directing the queries to the suitable tables.
 By directing the queries to appropriate tables, the speed of querying and response generation
can be increased.
 Query manager is responsible for scheduling the execution of the queries posed by the user.
Query Manager Architecture
The following screenshot shows the architecture of a query manager. It includes the following:

 Query redirection via C tool or RDBMS


 Stored procedures
 Query management tool
 Query scheduling via C tool or RDBMS
 Query scheduling via third-party software
Detailed Information
Detailed information is not kept online, rather it is aggregated to the next level of detail and then archived to
tape. The detailed information part of data warehouse keeps the detailed information in the starflake schema.
Detailed information is loaded into the data warehouse to supplement the aggregated data.
The following diagram shows a pictorial impression of where detailed information is stored and how it is
used.

Note − If detailed information is held offline to minimize disk storage, we should make sure that the data has
been extracted, cleaned up, and transformed into starflake schema before it is archived.
Summary Information
Summary Information is a part of data warehouse that stores predefined aggregations. These aggregations
are generated by the warehouse manager. Summary Information must be treated as transient. It changes on-
the-go in order to respond to the changing query profiles.
The points to note about summary information are as follows −
 Summary information speeds up the performance of common queries.
 It increases the operational cost.
 It needs to be updated whenever new data is loaded into the data warehouse.
 It may not have been backed up, since it can be generated fresh from the detailed information.

What are the major issues of data mining?

Mining methodology and user interaction issues

1. User Interface: The knowledge is determined utilizing data mining devices is valuable just in the
event that it is fascinating or more all reasonable by the client. From great representation translation
of data, mining results can be facilitated, and betters comprehend their prerequisites. To get a great
perception, many explorations are done for enormous data sets that manipulate and display mined
knowledge.
2. Mining different kinds of knowledge in databases: This issue is responsible for addressing the
problems of covering a big range of data in order to meet the needs of the client or the customer.
Due to the different information or a different way, it becomes difficult for a user to cover a big
range of knowledge discovery tasks.

3. Interactive mining of knowledge at multiple levels of abstraction: Interactive mining is very crucial
because it permits the user to focus the search for patterns, providing and refining data mining
requests based on the results that were returned. In simpler words, it allows users to focus the search
on patterns from various different angles.

4. Incorporation of the background of knowledge: The main work of background knowledge is to


continue the process of discovery and indicate the patterns or trends that were seen in the process.
Background knowledge can also be used to express the patterns or trends observed in brief and
precise terms. It can also be represented at different levels of abstraction.

5. Data mining query languages and ad hoc data mining: Data Mining Query language is responsible
for giving access to the user such that it describes ad hoc mining tasks as well and it needs to be
integrated with a data warehouse query language.

6. Presentation and visualization of data mining results: In this issue, the patterns or trends that are
discovered are to be rendered in high-level languages and visual representations. The representation
has to be written so that it is simply understood by everyone.

7. Handling noisy or incomplete data: For this process, data cleaning methods are used. It is a
convenient way of handling noise and incomplete objects in data mining. Without data cleaning
methods, there will be no accuracy in the discovered patterns. And then these patterns will be poor
in quality.

8. Noisy and incomplete data: Data Mining is the way toward obtaining information from huge
volumes of data. This present reality information is noisy, incomplete, and heterogeneous. Data in
huge amounts regularly will be unreliable or inaccurate. These issues could be because of human
mistakes blunders or errors in the instruments that measure the data.

9. Incorporation of background knowledge: In the event that background knowledge can be


consolidated, more accurate and reliable data mining arrangements can be found. Predictive tasks
can make more accurate predictions, while descriptive tasks can come up with more useful findings.
Be that as it may, gathering and including foundation knowledge is an unpredictable cycle.

Performance issues
1. Performance: The presentation of the data mining framework basically relies upon the productivity
of techniques and algorithms utilized. On the off chance that the techniques and algorithms planned
are not sufficient; at that point, it will influence the presentation of the data mining measure
unfavorably.

2. Scalability and Efficiency of the Algorithms: The Data Mining algorithm should be scalable and
efficient to extricate information from tremendous measures of data in the data set.

3. Parallel and incremental mining algorithm: There are a lot of factors that can be responsible for the
development of parallel and distributed algorithms in data mining. These factors are the large size of
the database, the huge distribution of data, and the data mining methods that are complex. In this
process, the first and foremost step, the algorithm divides the data from the database into various
partitions. In the next step, that data is processed such that it is situated in a parallel manner. Then
the last step, the result from the partition is merged.

4. Distributed Data: True data is normally put away at various stages in distributed processing
conditions. It very well may be on the internet, individual systems, or even on databases. It is
essentially hard to carry all the data to a unified data archive principally because of technical and
organizational reasons.

5. Managing relational as well as complex data types: Many structures of data can be complicated to
manage as they may be in the form of tabular, media files, spatial and temporal data. Mining all data
types in one go is tougher to do.

6. Data mining from globally present heterogeneous databases: Since databases are fetched from
various data sources available on LAN and WAN. These structures can be organized and semi-
organized. Thus, making them streamlined is the hardest challenge.

Diverse data types issues

1. Security and Social Challenges: Dynamic techniques are done through data assortment sharing, so it
requires impressive data security. Private information about people and touchy information is
gathered for the client’s profiles, client standard of conduct understanding—illicit admittance to
information and the secret idea of information turning into a significant issue.

2. Complex Data: True data is truly heterogeneous, and it very well may be media data,
including natural language text, time series, spatial data, temporal data, complex data, audio or
video, images, etc. It is truly hard to deal with these various types of data and concentrate on the
necessary information. More often than not, new apparatuses and systems would need to be created
to separate important information.

3. Improvement of Mining Algorithms: Factors, for example, the difficulty of data mining approaches,
the enormous size of the database, and the entire data flow inspire the distribution and creation of
parallel data mining algorithms.

4. Data Visualization: Data visualization is a vital cycle in data mining since it is the foremost
interaction that shows the output in a respectable way to the client. The information extricated ought
to pass on the specific significance of what it really plans to pass on. However, ordinarily, it is truly
hard to address the information in a precise and straightforward manner to the end-user. The output
information and input data being very effective, successful, and complex data perception methods
should be applied to make it fruitful.

5. Data Privacy and Security: Data mining typically prompts significant issues regarding governance,
privacy, and data security. For instance, when a retailer investigates the purchase details, it uncovers
information about purchasing propensities and choices of customers without their authorization.

Examples:

 Data Integrity: A bank may maintain credit card accounts on several different databases. The
addresses (or even the names) of a single cardholder may be different in each. Software must
translate data from one system to another and select the address most recently entered.

 Overfitting: Over-fitting occurs when the model does not fit future states a classification model for a
student database may be developed to classify students as excellent, good, or average. If the training
database is quite small, the model might erroneously indicate that an excellent student is anyone
who scores more than 90% because there is only one entry in the training database under 90%. In
this case, many future students would be erroneously classified as excellent. Over-fitting can arise
under other circumstances as well, even though the data are not changing.

 Large data sets: The massive datasets associated with data mining create problems when applying
algorithms designed for small datasets. Many modeling applications grow exponentially on the
dataset size and thus are too inefficient for larger datasets. Sampling and parallelization are effective
tools to attack this scalability problem.

You might also like