1.1 - Data Mining
1.1 - Data Mining
Data mining is the process of sorting through large data sets to identify patterns and relationships that can
help solve business problems through data analysis. Data mining techniques and tools enable enterprises
to predict future trends and make more-informed business decisions.
Data mining is a key part of data analytics overall and one of the core disciplines in data science, which uses
advanced analytics techniques to find useful information in data sets. At a more granular level, data mining
is a step in the knowledge discovery in databases (KDD) process, a data science methodology for gathering,
processing and analysing data. Data mining and KDD are sometimes referred to interchangeably, but they're
more commonly seen as distinct things.
Data mining is a crucial component of successful analytics initiatives in organizations. The information it
generates can be used in business intelligence (BI) and advanced analytics applications that involve analysis
of historical data, as well as real-time analytics applications that examine streaming data as it's created or
collected.
Effective data mining aids in various aspects of planning business strategies and managing operations. That
includes customer-facing functions such as marketing, advertising, sales and customer support, plus
manufacturing, supply chain management, finance and HR. Data mining supports fraud detection, risk
management, cybersecurity planning and many other critical business use cases. It also plays an important
role in healthcare, government, scientific research, mathematics, sports and more.
Overview
Different types of data can be mined in data mining. However, the data should have a pattern to get helpful
information.
Based on the data functionalities, patterns can be further classified into two categories.
Descriptive patterns
It deals with the general characteristics and converts them into relevant and helpful information.
Class/concept description: Data entries are associated with labels or classes. For instance, in a
library, the classes of items for borrowed items include books and research journals, and customers'
concepts include registered members and not registered members. These types of descriptions are
class or concept descriptions.
Frequent patterns: These are data points that occur more often in the dataset. There are many kinds of
recurring patterns, such as frequent items, frequent subsequence, and frequent sub-structure.
Associations: It shows the relationships between data and pre-defined association rules. For instance,
a shopkeeper makes an association rule that 70% of the time, when a football is sold, a kit is bought
alongside. These two items can be combined together to make an association.
Correlations: This is performed to find the statistical correlations between two data points to find if
they have positive, negative, or no effect.
Clusters: This is the formation of a group of similar data points. Each point in the collection is
somewhat similar but very different from other members of different groups.
Predictive patterns
It predicts future values by analyzing the data patterns and their outcomes based on the previous data. It also
helps us find missing values in the data.
Database Technology
Statistics
Machine Learning
Information Science
Visualization
Other Disciplines
Apart from these, a data mining system can also be classified based on the kind of (a) databases mined, (b)
knowledge mined, (c) techniques utilized, and (d) applications adapted.
Classification Based on the Databases Mined
We can classify a data mining system according to the kind of databases mined. Database system can be
classified according to different criteria such as data models, types of data, etc. And the data mining system
can be classified accordingly.
For example, if we classify a database according to the data model, then we may have a relational,
transactional, object-relational, or data warehouse mining system.
Classification Based on the kind of Knowledge Mined
We can classify a data mining system according to the kind of knowledge mined. It means the data mining
system is classified on the basis of functionalities such as −
Characterization
Discrimination
Association and Correlation Analysis
Classification
Prediction
Outlier Analysis
Evolution Analysis
Classification Based on the Techniques Utilized
We can classify a data mining system according to the kind of techniques used. We can describe these
techniques according to the degree of user interaction involved or the methods of analysis employed.
Classification Based on the Applications Adapted
We can classify a data mining system according to the applications adapted. These applications are as
follows −
Finance
Telecommunications
DNA
Stock Markets
E-mail
Integrating a Data Mining System with a DB/DW System
If a data mining system is not integrated with a database or a data warehouse system, then there will be no
system to communicate with. This scheme is known as the non-coupling scheme. In this scheme, the main
focus is on data mining design and on developing efficient and effective algorithms for mining the available
data sets.
The list of Integration Schemes is as follows −
No Coupling − In this scheme, the data mining system does not utilize any of the database or
data warehouse functions. It fetches the data from a particular source and processes that data
using some data mining algorithms. The data mining result is stored in another file.
Loose Coupling − In this scheme, the data mining system may use some of the functions of
database and data warehouse system. It fetches the data from the data respiratory managed by
these systems and performs data mining on that data. It then stores the mining result either in
a file or in a designated place in a database or in a data warehouse.
Semi−tight Coupling − In this scheme, the data mining system is linked with a database or a
data warehouse system and in addition to that, efficient implementations of a few data mining
primitives can be provided in the database.
Tight coupling − In this coupling scheme, the data mining system is smoothly integrated into
the database or data warehouse system. The data mining subsystem is treated as one
functional component of an information system.
Data Mining Task Primitives
A data mining task can be specified in the form of a data mining query, which is input to the data mining
system. A data mining query is defined in terms of data mining task primitives. These primitives allow the
user to interactively communicate with the data mining system during discovery to direct the mining process
or examine the findings from different angles or depths. The data mining primitives specify the following,
A data mining query language can be designed to incorporate these primitives, allowing users to interact
with data mining systems flexibly. Having a data mining query language provides a foundation on which
user-friendly graphical interfaces can be built.
Designing a comprehensive data mining language is challenging because data mining covers a wide
spectrum of tasks, from data characterization to evolution analysis. Each task has different requirements.
The design of an effective data mining query language requires a deep understanding of the power,
limitation, and underlying mechanisms of the various kinds of data mining tasks. This facilitates a data
mining system's communication with other information systems and integrates with the overall information
processing environment.
A data mining query is defined in terms of the following primitives, such as:
This specifies the portions of the database or the set of data in which the user is interested. This includes the
database attributes or data warehouse dimensions of interest (the relevant attributes or dimensions).
In a relational database, the set of task-relevant data can be collected via a relational query involving
operations like selection, projection, join, and aggregation.
The data collection process results in a new data relational called the initial data relation. The initial data
relation can be ordered or grouped according to the conditions specified in the query. This data retrieval can
be thought of as a subtask of the data mining task.
This initial relation may or may not correspond to physical relation in the database. Since virtual relations
are called Views in the field of databases, the set of task-relevant data for data mining is called a minable
view.
This specifies the data mining functions to be performed, such as characterization, discrimination,
association or correlation analysis, classification, prediction, clustering, outlier analysis, or evolution
analysis.
This knowledge about the domain to be mined is useful for guiding the knowledge discovery process and
evaluating the patterns found. Concept hierarchies are a popular form of background knowledge, which
allows data to be mined at multiple levels of abstraction.
Concept hierarchy defines a sequence of mappings from low-level concepts to higher-level, more general
concepts.
o Rolling Up - Generalization of data: Allow to view data at more meaningful and explicit abstractions
and makes it easier to understand. It compresses the data, and it would require fewer input/output
operations.
o Drilling Down - Specialization of data: Concept values replaced by lower-level concepts. Based on
different user viewpoints, there may be more than one concept hierarchy for a given attribute or
dimension.
An example of a concept hierarchy for the attribute (or dimension) age is shown below. User beliefs
regarding relationships in the data are another form of background knowledge.
Different kinds of knowledge may have different interesting measures. They may be used to guide the
mining process or, after discovery, to evaluate the discovered patterns. For example, interesting measures for
association rules include support and confidence. Rules whose support and confidence values are below
user-specified thresholds are considered uninteresting.
o Simplicity: A factor contributing to the interestingness of a pattern is the pattern's overall simplicity
for human comprehension. For example, the more complex the structure of a rule is, the more
difficult it is to interpret, and hence, the less interesting it is likely to be. Objective measures of
pattern simplicity can be viewed as functions of the pattern structure, defined in terms of the pattern
size in bits or the number of attributes or operators appearing in the pattern.
o Certainty (Confidence): Each discovered pattern should have a measure of certainty associated with
it that assesses the validity or "trustworthiness" of the pattern. A certainty measure for association
rules of the form "A =>B" where A and B are sets of items is confidence. Confidence is a certainty
measure. Given a set of task-relevant data tuples, the confidence of "A => B" is defined as
Confidence (A=>B) = # tuples containing both A and B /# tuples containing A
o Utility (Support): The potential usefulness of a pattern is a factor defining its interestingness. It can
be estimated by a utility function, such as support. The support of an association pattern refers to the
percentage of task-relevant data tuples (or transactions) for which the pattern is true.
Utility (support): usefulness of a pattern
Support (A=>B) = # tuples containing both A and B / total #of tuples
o Novelty: Novel patterns are those that contribute new information or increased performance to the
given pattern set. For example -> A data exception. Another strategy for detecting novelty is to
remove redundant patterns.
This refers to the form in which discovered patterns are to be displayed, which may include rules, tables,
cross tabs, charts, graphs, decision trees, cubes, or other visual representations.
Users must be able to specify the forms of presentation to be used for displaying the discovered patterns.
Some representation forms may be better suited than others for particular kinds of knowledge.
For example, generalized relations and their corresponding cross tabs or pie/bar charts are good for
presenting characteristic descriptions, whereas decision trees are common for classification.
Suppose, as a marketing manager of AllElectronics, you would like to classify customers based on their
buying patterns. You are especially interested in those customers whose salary is no less than $40,000 and
who have bought more than $1,000 worth of items, each of which is priced at no less than $100.
In particular, you are interested in the customer's age, income, the types of items purchased, the purchase
location, and where the items were made. You would like to view the resulting classification in the form of
rules. This data mining query is expressed in DMQL3 as follows, where each line of the query has been
enumerated to aid in our discussion.
within itself.
with other data within the same data source.
with the data in other source systems.
with the existing data present in the warehouse.
Transforming involves converting the source data into a structure. Structuring the data increases the query
performance and decreases the operational cost. The data contained in a data warehouse must be
transformed to support performance requirements and control the ongoing operational costs.
Partition the Data
It will optimize the hardware performance and simplify the management of data warehouse. Here we
partition each fact table into multiple separate partitions.
Aggregation
Aggregation is required to speed up common queries. Aggregation relies on the fact that most common
queries will analyze a subset or an aggregation of the detailed data.
Backup and Archive the Data
In order to recover the data in the event of data loss, software failure, or hardware failure, it is necessary to
keep regular back ups. Archiving involves removing the old data from the system in a format that allow it to
be quickly restored whenever required.
For example, in a retail sales analysis data warehouse, it may be required to keep data for 3 years with the
latest 6 months data being kept online. In such as scenario, there is often a requirement to be able to do
month-on-month comparisons for this year and last year. In this case, we require some data to be restored
from the archive.
Query Management Process
This process performs the following functions −
manages the queries.
helps speed up the execution time of queris.
directs the queries to their most effective data sources.
ensures that all the system sources are used in the most effective way.
monitors actual query profiles.
The information generated in this process is used by the warehouse management process to determine which
aggregations to generate. This process does not generally operate during the regular load of information into
data warehouse.
Data Warehousing - Architecture
The business analyst get the information from the data warehouses to measure the performance and make
critical adjustments in order to win over other business holders in the market. Having a data warehouse
offers the following advantages −
Since a data warehouse can gather information quickly and efficiently, it can enhance
business productivity.
A data warehouse provides us a consistent view of customers and items, hence, it helps us
manage customer relationship.
A data warehouse also helps in bringing down the costs by tracking trends, patterns over a
long period in a consistent and reliable manner.
To design an effective and efficient data warehouse, we need to understand and analyze the business needs
and construct a business analysis framework. Each person has different views regarding the design of a data
warehouse. These views are as follows −
The top-down view − This view allows the selection of relevant information needed for a data
warehouse.
The data source view − This view presents the information being captured, stored, and
managed by the operational system.
The data warehouse view − This view includes the fact tables and dimension tables. It
represents the information stored inside the data warehouse.
The business query view − It is the view of the data from the viewpoint of the end-user.
Three-Tier Data Warehouse Architecture
Generally a data warehouses adopts a three-tier architecture. Following are the three tiers of the data
warehouse architecture.
Bottom Tier − The bottom tier of the architecture is the data warehouse database server. It is
the relational database system. We use the back end tools and utilities to feed data into the
bottom tier. These back end tools and utilities perform the Extract, Clean, Load, and refresh
functions.
Middle Tier − In the middle tier, we have the OLAP Server that can be implemented in either
of the following ways.
o By Relational OLAP (ROLAP), which is an extended relational database
management system. The ROLAP maps the operations on multidimensional
data to standard relational operations.
o By Multidimensional OLAP (MOLAP) model, which directly implements the
multidimensional data and operations.
Top-Tier − This tier is the front-end client layer. This layer holds the query tools and
reporting tools, analysis tools and data mining tools.
The following diagram depicts the three-tier architecture of data warehouse −
Data Warehouse Models
From the perspective of data warehouse architecture, we have the following data warehouse models −
Virtual Warehouse
Data mart
Enterprise Warehouse
Virtual Warehouse
The view over an operational data warehouse is known as a virtual warehouse. It is easy to build a virtual
warehouse. Building a virtual warehouse requires excess capacity on operational database servers.
Data Mart
Data mart contains a subset of organization-wide data. This subset of data is valuable to specific groups of
an organization.
In other words, we can claim that data marts contain data specific to a particular group. For example, the
marketing data mart may contain data related to items, customers, and sales. Data marts are confined to
subjects.
Points to remember about data marts −
Window-based or Unix/Linux-based servers are used to implement data marts. They are
implemented on low-cost servers.
The implementation data mart cycles is measured in short periods of time, i.e., in weeks rather
than months or years.
The life cycle of a data mart may be complex in long run, if its planning and design are not
organization-wide.
Data marts are small in size.
Data marts are customized by department.
The source of a data mart is departmentally structured data warehouse.
Data mart are flexible.
Enterprise Warehouse
An enterprise warehouse collects all the information and the subjects spanning an entire
organization
It provides us enterprise-wide data integration.
The data is integrated from operational systems and external information providers.
This information can vary from a few gigabytes to hundreds of gigabytes, terabytes or
beyond.
Load Manager
This component performs the operations required to extract and load process.
The size and complexity of the load manager varies between specific solutions from one data warehouse to
other.
Load Manager Architecture
The load manager performs the following functions −
Extract the data from source system.
Fast Load the extracted data into temporary data store.
Perform simple transformations into structure similar to the one in the data warehouse.
Extract Data from Source
The data is extracted from the operational databases or the external information providers. Gateways is the
application programs that are used to extract data. It is supported by underlying DBMS and allows client
program to generate SQL to be executed at a server. Open Database Connection(ODBC), Java Database
Connection (JDBC), are examples of gateway.
Fast Load
In order to minimize the total load window the data need to be loaded into the warehouse in
the fastest possible time.
The transformations affects the speed of data processing.
It is more effective to load the data into relational database prior to applying transformations
and checks.
Gateway technology proves to be not suitable, since they tend not be performant when large
data volumes are involved.
Simple Transformations
While loading it may be required to perform simple transformations. After this has been completed we are in
position to do the complex checks. Suppose we are loading the EPOS sales transaction we need to perform
the following checks:
Strip out all the columns that are not required within the warehouse.
Convert all the values to required data types.
Warehouse Manager
A warehouse manager is responsible for the warehouse management process. It consists of third-party
system software, C programs, and shell scripts.
The size and complexity of warehouse managers varies between specific solutions.
Warehouse Manager Architecture
A warehouse manager includes the following −
Note − If detailed information is held offline to minimize disk storage, we should make sure that the data has
been extracted, cleaned up, and transformed into starflake schema before it is archived.
Summary Information
Summary Information is a part of data warehouse that stores predefined aggregations. These aggregations
are generated by the warehouse manager. Summary Information must be treated as transient. It changes on-
the-go in order to respond to the changing query profiles.
The points to note about summary information are as follows −
Summary information speeds up the performance of common queries.
It increases the operational cost.
It needs to be updated whenever new data is loaded into the data warehouse.
It may not have been backed up, since it can be generated fresh from the detailed information.
1. User Interface: The knowledge is determined utilizing data mining devices is valuable just in the
event that it is fascinating or more all reasonable by the client. From great representation translation
of data, mining results can be facilitated, and betters comprehend their prerequisites. To get a great
perception, many explorations are done for enormous data sets that manipulate and display mined
knowledge.
2. Mining different kinds of knowledge in databases: This issue is responsible for addressing the
problems of covering a big range of data in order to meet the needs of the client or the customer.
Due to the different information or a different way, it becomes difficult for a user to cover a big
range of knowledge discovery tasks.
3. Interactive mining of knowledge at multiple levels of abstraction: Interactive mining is very crucial
because it permits the user to focus the search for patterns, providing and refining data mining
requests based on the results that were returned. In simpler words, it allows users to focus the search
on patterns from various different angles.
5. Data mining query languages and ad hoc data mining: Data Mining Query language is responsible
for giving access to the user such that it describes ad hoc mining tasks as well and it needs to be
integrated with a data warehouse query language.
6. Presentation and visualization of data mining results: In this issue, the patterns or trends that are
discovered are to be rendered in high-level languages and visual representations. The representation
has to be written so that it is simply understood by everyone.
7. Handling noisy or incomplete data: For this process, data cleaning methods are used. It is a
convenient way of handling noise and incomplete objects in data mining. Without data cleaning
methods, there will be no accuracy in the discovered patterns. And then these patterns will be poor
in quality.
8. Noisy and incomplete data: Data Mining is the way toward obtaining information from huge
volumes of data. This present reality information is noisy, incomplete, and heterogeneous. Data in
huge amounts regularly will be unreliable or inaccurate. These issues could be because of human
mistakes blunders or errors in the instruments that measure the data.
Performance issues
1. Performance: The presentation of the data mining framework basically relies upon the productivity
of techniques and algorithms utilized. On the off chance that the techniques and algorithms planned
are not sufficient; at that point, it will influence the presentation of the data mining measure
unfavorably.
2. Scalability and Efficiency of the Algorithms: The Data Mining algorithm should be scalable and
efficient to extricate information from tremendous measures of data in the data set.
3. Parallel and incremental mining algorithm: There are a lot of factors that can be responsible for the
development of parallel and distributed algorithms in data mining. These factors are the large size of
the database, the huge distribution of data, and the data mining methods that are complex. In this
process, the first and foremost step, the algorithm divides the data from the database into various
partitions. In the next step, that data is processed such that it is situated in a parallel manner. Then
the last step, the result from the partition is merged.
4. Distributed Data: True data is normally put away at various stages in distributed processing
conditions. It very well may be on the internet, individual systems, or even on databases. It is
essentially hard to carry all the data to a unified data archive principally because of technical and
organizational reasons.
5. Managing relational as well as complex data types: Many structures of data can be complicated to
manage as they may be in the form of tabular, media files, spatial and temporal data. Mining all data
types in one go is tougher to do.
6. Data mining from globally present heterogeneous databases: Since databases are fetched from
various data sources available on LAN and WAN. These structures can be organized and semi-
organized. Thus, making them streamlined is the hardest challenge.
1. Security and Social Challenges: Dynamic techniques are done through data assortment sharing, so it
requires impressive data security. Private information about people and touchy information is
gathered for the client’s profiles, client standard of conduct understanding—illicit admittance to
information and the secret idea of information turning into a significant issue.
2. Complex Data: True data is truly heterogeneous, and it very well may be media data,
including natural language text, time series, spatial data, temporal data, complex data, audio or
video, images, etc. It is truly hard to deal with these various types of data and concentrate on the
necessary information. More often than not, new apparatuses and systems would need to be created
to separate important information.
3. Improvement of Mining Algorithms: Factors, for example, the difficulty of data mining approaches,
the enormous size of the database, and the entire data flow inspire the distribution and creation of
parallel data mining algorithms.
4. Data Visualization: Data visualization is a vital cycle in data mining since it is the foremost
interaction that shows the output in a respectable way to the client. The information extricated ought
to pass on the specific significance of what it really plans to pass on. However, ordinarily, it is truly
hard to address the information in a precise and straightforward manner to the end-user. The output
information and input data being very effective, successful, and complex data perception methods
should be applied to make it fruitful.
5. Data Privacy and Security: Data mining typically prompts significant issues regarding governance,
privacy, and data security. For instance, when a retailer investigates the purchase details, it uncovers
information about purchasing propensities and choices of customers without their authorization.
Examples:
Data Integrity: A bank may maintain credit card accounts on several different databases. The
addresses (or even the names) of a single cardholder may be different in each. Software must
translate data from one system to another and select the address most recently entered.
Overfitting: Over-fitting occurs when the model does not fit future states a classification model for a
student database may be developed to classify students as excellent, good, or average. If the training
database is quite small, the model might erroneously indicate that an excellent student is anyone
who scores more than 90% because there is only one entry in the training database under 90%. In
this case, many future students would be erroneously classified as excellent. Over-fitting can arise
under other circumstances as well, even though the data are not changing.
Large data sets: The massive datasets associated with data mining create problems when applying
algorithms designed for small datasets. Many modeling applications grow exponentially on the
dataset size and thus are too inefficient for larger datasets. Sampling and parallelization are effective
tools to attack this scalability problem.