MCA Data Mining and Business Intelligence 2
MCA Data Mining and Business Intelligence 2
MCA
SEMESTER - IV (CBCS)
: Dr. D. S. Rao
Professor (CSE) & Associate Dean
(Student Affairs),
Koneru Lakshmaiah Education Foundation.
KL H (Deemed to be University)
Hyderabad Campus Hyderabad,Telangana
: Dr..R. Saradha
Assistant Professor,
SDNB Vaishnav College For Women,
Chennai
Module I
1. Introduction and Overview of BI 01
Module II
2. Data preparation 16
3. Optimization Technique 37
Module III
4. Bi Using Data Warehousing 50
5. Data mart 76
Module IV
6. Data Mining and Preprocessing 107
Module V
7. Association Rule Mining 122
Module VI
8. Classification and Prediction 140
Module VII
9. Clustering 170
Module VII
10. Text Mining 188
11. Web Mining 205
S.Y. MCA
SEMESTER - IV (CBCS)
I
6 Classification Introduction, Classification methods:-Decision 08
and Prediction Tree- ID3, CART, Bayesian classification-
Baye’stheorem( Naïve Bayesian
classification),Linear and nonlinear regression.
II
ModuleI
Business Intelligence
1
INTRODUCTION AND OVERVIEW OF BI
Unit Structure
1.0 Objectives
1.1 Introduction
1.2 An Overview
1.2.1 Effective and timely decisions
1.2.2 Data Information and knowledge
1.2.3 BI Architecture
1.2.4 Ethics and BI
1.3 BI Applications
1.4 List of References
1.5 Unit End Exercises
1.0 OBJECTIVES
After going through this unit, you will be able to:
● This course focuses on how to design and build a Business Intelligence
solution
● To understand the concept of Business Intelligence
● To understand the basics of design and management of BI systems
1.1 INTRODUCTION
The advent of low-cost data storage technologies and the wide availability
of Internet connections have made it easier for individuals and
organizations to access large amounts of data. Such data are often
heterogeneous in origin, content and representation, as they include
commercial, financial and administrative transactions, web navigation
paths, emails, texts and hypertexts, and the results of clinical tests, to name
just a few examples. Their accessibility opens up promising scenarios and
opportunities, and raises an enticing question: is it possible to convert such
data into information and knowledge that can then be used by decision
1
Data Mining and Business makers to aid and improve the governance of enterprises and of public
Intelligence administration? Business intelligence may be defined as a set of
mathematical models and analysis methodologies that exploit the available
data to generate information and knowledge useful for complex decision-
making processes. This opening chapter will describe in general terms the
problems entailed in business intelligence, highlighting the
interconnections with other disciplines and identifying the primary
components typical of a business intelligence environment.
In complex organizations, public or private, decisions are made on a
continual basis. Such decisions may be more or less critical, have long- or
short-term effects and involve people and roles at various hierarchical
levels. The ability of these knowledge workers to make decisions, both as
individuals and as a community, is one of the primary factors that
influence the performance and competitive strength of a given
organization.
Individuals and companies may now access enormous amounts of data
more easily because to the introduction of low-cost data storage
technologies and the widespread availability of Internet connections.
Commercial, financial, and administrative transactions, web navigation
patterns, emails, texts and hypertexts, and the results of clinical testing, to
name a few instances, are all examples of data that are heterogeneous in
origin, content, and representation. Their accessibility opens up exciting
situations and opportunities, as well as the intriguing question of whether
such data can be converted into information and knowledge that can be
used by decision makers to help and improve corporate and government
governance. Business intelligence is a set of mathematical models and
analysis procedures that employ existing data to develop information and
knowledge that may be used in complex decision-making processes. This
first chapter will outline the difficulties that business intelligence entails,
as well as the links with other disciplines and the major components that
make up a business intelligence ecosystem. Examples 1.1 and 1.2 illustrate
two highly complex decision-making processes in rapidly changing
conditions.
Example 1.1 – Retention in the mobile phone industry. The marketing
manager of a mobile phone company realizes that a large number of
customers are discontinuing their service, leaving her company in favour
of some competing provider. As can be imagined, low customer loyalty,
also known as customer attrition or churn, is a critical factor for many
companies operating in service industries. Suppose that the marketing
manager can rely on a budget adequate to pursue a customer retention
campaign aimed at 2000 individuals out of a total customer base of 2
million people. Hence, the question naturally arises of how she should go
about choosing those customers to be contacted so as to optimize the
effectiveness of the campaign. In other words, how can the probability that
each single customer will discontinue the service be estimated so as to
target the best group of customers and thus reduce churning and maximize
customer retention? By knowing these probabilities, the target group can
be chosen as the 2000 people having the highest churn likelihood among
2
the customers of high business value. Without the support of advanced Introduction and Overview
mathematical models and data mining techniques, it would be arduous to of BI
derive a reliable estimate of the churn probability and to determine the
best recipients of a specific marketing campaign.
Example 1.2 – Logistics planning. The logistics manager of a
manufacturing company wishes to develop a medium-term logistic-
production plan. This is a decision-making process of high complexity
which includes, among other choices, the allocation of the demand
originating from different market areas to the production sites, the
procurement of raw materials and purchased parts from suppliers, the
production planning of the plants and the distribution of end products to
market areas. In a typical manufacturing company this could well entail
tens of facilities, hundreds of suppliers, and thousands of finished goods
and components, over a time span of one year divided into weeks. The
magnitude and complexity of the problem suggest that advanced
optimization models are required to devise the best logistic plan.
Optimization models allow highly complex and large-scale problems to be
tackled successfully within a business intelligence framework.
1.2.3 BI ARCHITECTURE
5
Data Mining and Business Data sources: In a first stage, it is necessary to gather and integrate the
Intelligence data stored in the various primary and secondary sources, which are
heterogeneous in origin and type. The sources consist for the most part of
data belonging to operational systems, but may also include unstructured
documents, such as emails and data received from external providers.
Generally speaking, a major effort is required to unify and integrate the
different data sources.
Data warehouses and data marts: Using extraction and transformation
tools known as extract, transform, load (ETL), the data originating from
the different sources are stored in databases intended to support business
intelligence analyses. These databases are usually referred to as data
warehouses and data marts.
Business intelligence methodologies: Data are finally extracted and used
to feed mathematical models and analysis methodologies intended to
support decision makers. In a business intelligence system, several
decision support applications may be implemented.
Several public and private enterprises and organizations have developed in
recent years formal and systematic mechanisms to gather, store and share
their wealth of knowledge, which is now perceived as an invaluable
intangible asset. The activity of providing support to knowledge workers
through the integration of decision-making processes and enabling
information technologies is usually referred to as knowledge management.
It is apparent that business intelligence and knowledge management share
some degree of similarity in their objectives. The main purpose of both
disciplines is to develop environments that can support knowledge
workers in decision-making processes and complex problem-solving
activities. To draw a boundary between the two approaches, we may
observe that knowledge management methodologies primarily focus on
the treatment of information that is usually unstructured, at times implicit,
contained mostly in documents, conversations and past experience.
Conversely, business intelligence systems are based on structured
information, most often of a quantitative nature and usually organized in a
database. However, this distinction is a somewhat fuzzy one: for example,
the ability to analyse emails and web pages through text mining methods
progressively induces business intelligence systems to deal with
unstructured information.
6
Introduction and Overview
of BI
7
Data Mining and Business Optimization. By moving up one level in the pyramid we find
Intelligence optimization models that allow us to determine the best solution out of a
set of alternative actions, which is usually fairly extensive and sometimes
even infinite. Example 1.3 shows a typical field of application of
optimization models. Other optimization models applied in marketing and
logistics will be described in later chapters.
Decisions. Finally, the top of the pyramid corresponds to the choice and
the actual adoption of a specific decision, and in some way represents the
natural conclusion of the decision-making process. Even when business
intelligence methodologies are available and successfully adopted, the
choice of a decision pertains to the decision makers, who may also take
advantage of informal and unstructured information available to adapt and
modify the recommendations and the conclusions achieved through the
use of mathematical models. As we progress from the bottom to the top of
the pyramid, business intelligence systems offer increasingly more
advanced support tools of an active type. Even roles and competencies
change. At the bottom, the required competencies are provided for the
most part by the information systems specialists within the organization,
usually referred to as database administrators. Analysts and experts in
mathematical and statistical models are responsible for the intermediate
phases.
Finally, the activities of decision makers responsible for the application
domain appear dominant at the top. As described above, business
intelligence systems address the needs of different types of complex
organizations, including agencies of public administration and association.
8
Introduction and Overview
of BI
9
Data Mining and Business Decision. During the third phase, knowledge obtained as a result of the
Intelligence insight phase is converted into decisions and subsequently into actions.
The availability of business intelligence methodologies allows the analysis
and insight phases to be executed more rapidly so that more effective and
timely decisions can be made that better suit the strategic priorities of a
given organization. This leads to an overall reduction in the execution time
of the analysis–decision–action– revision cycle, and thus to a decision-
making process of better quality.
Evaluation. Finally, the fourth phase of the business intelligence cycle
involves performance measurement and evaluation. Extensive metrics
should then be devised that are not exclusively limited to the financial
aspects but also take into account the major performance indicators
defined for the different company departments.
ENABLING FACTORS IN BUSINESS INTELLIGENCE
PROJECTS
Some factors are more critical than others to the success of a business
intelligence project: technologies, analytics and human resources.
Technologies. Hardware and software technologies are significant
enabling factors that have facilitated the development of business
intelligence systems within enterprises and complex organizations. On the
one hand, the computing capabilities of microprocessors have increased on
average by 100% every 18 months during the last two decades, and prices
have fallen. This trend has enabled the use of advanced algorithms which
are required to employ inductive learning methods and optimization
models, keeping the processing times within a reasonable range.
Moreover, it permits the adoption of state-of-the-art graphical
visualization techniques, featuring real-time animations. A further relevant
enabling factor derives from the exponential increase in the capacity of
mass storage devices, again at decreasing costs, enabling any organization
to store terabytes of data for business intelligence systems. And network
connectivity, in the form of Extranets or Intranets, has played a primary
role in the diffusion within organizations of information and knowledge
extracted from business intelligence systems. Finally, the easy integration
of hardware and software purchased by different suppliers, or developed
internally by an organization, is a further relevant factor affecting the
diffusion of data analysis tools. Analytics. As stated above, mathematical
models and analytical methodologies play a key role in information
enhancement and knowledge extraction from the data available inside
most organizations. The mere visualization of the data according to timely
and flexible logical views, as described in further chapters, plays a
relevant role in facilitating the decision-making process, but still
represents a passive form of support. Therefore, it is necessary to apply
more advanced models of inductive learning and optimization in order to
achieve active forms of support for the decision-making process. Human
resources. The human assets of an organization are built up by the
competencies of those who operate within its boundaries, whether as
individuals or collectively. The overall knowledge possessed and shared
by these individuals constitutes the organizational culture. The ability of
10
knowledge workers to acquire information and then translate it into Introduction and Overview
practical actions is one of the major assets of any organization, and has a of BI
major impact on the quality of the decision-making process. If a given
enterprise has implemented an advanced business intelligence system,
there still remains much scope to emphasize the personal skills of its
knowledge workers, who are required to perform the analyses and to
interpret the results, to work out creative solutions and to devise effective
action plans. All the available analytical tools being equal, a company
employing human resources endowed with a greater mental agility and
willing to accept changes in the decision-making style will be at an
advantage over its competitors
1.3 BI APPLICATIONS
BALANCED SCORE CARD
The balanced scorecard is the most developed method for high-level
reporting. The balanced scorecard was the first type of performance
reporting, and it was intended primarily for top-level management.
Financial information, information about customers and their perceptions
of the business, information about internal business procedures, and
measures for business improvement were the key themes of discussion.
Indicators were established for each topic to measure the relevant business
goals in an effective manner. The so-called third-generation balanced
scorecard has evolved from this basic definition. It has four main
components.
1. Destination statement: It describes the organization at present and at a
defined point in the future (midterm planning) in the four perspectives:
financial and stakeholder expectations, customer and external
relationships, process activities, and organization and culture.
2. Strategic linkage model: This topic contains strategic objectives with
respect to outcome and activities, together with hypothesized causal
relationships between these strategic objectives.
3. Definitions of strategic objectives.
4. Definitions of measures: For each strategic objective, measures are
defined, together with their targets.
FRAUD DETECTION
Another important application of data mining is fraud detection. Fraud can
occur in a variety of businesses, including telecommunications, insurance
(false claims), and banking (illegal use of credit cards and bank checks;
illegal monetary transactions).
TELECOMMUNICATION INDUSTRY
Mobile phone companies were among the first to employ learning models
and data mining techniques to assist relational marketing efforts. Customer
retention, often known as churn analysis, has been one of the key goals.
Market saturation and fierce competition have combined to create
insecurity and dissatisfaction among consumers, who can choose a
provider depending on the rates, services, and access methods that suit
12
them best. This phenomenon is especially important in the case of prepaid Introduction and Overview
telephone cards, which are quite popular in the mobile phone business of BI
today since they make switching a phone service provider relatively
simple and inexpensive. Company and objectives A mobile phone
company wishes to model its customers’ propensity to churn, that is, a
predictive model able to associate each customer with a numerical value
(or score) that indicates their probability of discontinuing service, based
on the value of the available explanatory variables. The model should be
able to identify, based on customer characteristics, homogeneous segments
relative to the probability of churning, in order to later concentrate on
these groups, the marketing actions to be carried out for retention, thus
reducing attrition and increasing the overall effectiveness. Figure shows
the possible segments derived using a classification model, using only two
predictive attributes in order to create a two-dimensional chart for
illustration purposes. The segments with the highest density of churners
allow to identify the recipients of the marketing actions. After an initial
exploratory data analysis, the decision is made to develop more than one
predictive model, determining a priori some market macrosegments that
appear heterogeneous. accurate models instead of a single model related to
the entire customer base. The analysis carried out using clustering
methods confirms the appropriateness of the segments considered, and
leads to the subdivision of customers into groups based on the following
dimensions:
• customer type (business or private);
• telephone card type (subscription or prepaid);
• years of service provision, whether above or below a given threshold;
• area of residence.
The marketing data mart provides for each customer a large amount of
data:
• personal information (socio-demographic);
• administrative and accounting information;
• incoming and outgoing telephone traffic, subdivided by period (weeks or
months) and traffic direction;
• access to additional services, such as fax, voice mail, and special service
numbers;
• calls to customer assistance centres;
• notifications of malfunctioning and disservice;
• emails and requests through web pages.
13
Data Mining and Business BANKING AND FINANCE
Intelligence
1. FRAUD ANALYSIS
According to a global banking survey published by KPMG, financial
frauds have increased both in volume & value. This has made fraud
detection and prevention the top priority of every bank.
Beyond delivering a faster, safer, and convenient experience, the solution
also:
● Simplified payment processing
● Triggered warning signals to take preventive action
● Tracked in-process transactions in real-time
● Blocked fraudulent credit cards & payments in real-time
2. CROSS-SELLING
BI solutions help conduct a win-loss data analysis to predict acceptance
rates for upcoming cross-selling initiatives. An Asia-based financial
services institution wanted to increase its revenue through cross-selling
insurance policies. For this, they needed an intelligent solution that could:
● Analyse the CRM data
● Uncover customer trends
● Identify the customers that are most likely to convert based on their
purchase history of other products.
The created Business Intelligence & Analytics help desk designed for the
customer helps:
● Generate excel-based analytics reports
● Identify potential customers most likely to convert based on their
buying behaviour and profile
It helped increase the conversion rate and revenues while reducing the
incurred costs on expensive statistical tools.
14
1.5 UNIT END EXERCISES Introduction and Overview
of BI
1) What is effective and timely decisions in business intelligence?
2) Explain Data, Information and knowledge in detail.
3) Explain Business Intelligence architecture in detail.
4) Explain BI applications in detail.
15
Module II: Prediction Methods and
Mathematical Method
2
DATA PREPARATION
Unit Structure
2.1 Data preparation
2.2 Prediction methods
2.3 Mathematical methods
2.4 Distance methods
2.5 Logic method
2.6 Heuristic method
16
combined. Correcting data errors, validating data quality and consolidating Data preparation
data sets are big parts of data preparation projects.
Data preparation also involves finding relevant data to ensure that
analytics applications deliver meaningful information and actionable
insights for business decision-making. The data often is enriched and
optimized to make it more informative and useful -- for example, by
blending internal and external data sets, creating new data fields,
eliminating outlier values and addressing imbalanced data sets that could
skew analytics results.
In addition, BI and data management teams use the data preparation
process to curate data sets for business users to analyse. Doing so helps
streamline and guide self-service BI applications for business analysts,
executives and workers.
17
Data Mining and Business Steps in the data preparation process
Intelligence
Data preparation is done in a series of steps. There's some variation in the
data preparation steps listed by different data professionals and software
vendors, but the process typically involves the following tasks:
1. Data collection. Relevant data is gathered from operational systems,
data warehouses, data lakes and other data sources. During this step,
data scientists, members of the BI team, other data professionals and
end users who collect data should confirm that it's a good fit for the
objectives of the planned analytics applications.
2. Data discovery and profiling. The next step is to explore the
collected data to better understand what it contains and what needs to
be done to prepare it for the intended uses. To help with that, data
profiling identifies patterns, relationships and other attributes in the
data, as well as inconsistencies, anomalies, missing values and other
issues so they can be addressed.
3. Data cleansing. Next, the identified data errors and issues are
corrected to create complete and accurate data sets. For example, as
part of cleansing data sets, faulty data is removed or fixed, missing
values are filled in and inconsistent entries are harmonized.
4. Data structuring. At this point, the data needs to be modelled and
organized to meet the analytics requirements. For example, data stored
in comma-separated values (CSV) files or other file formats has to be
converted into tables to make it accessible to BI and analytics tools.
5. Data transformation and enrichment. In addition to being
structured, the data typically must be transformed into a unified and
usable format. For example, data transformation may involve creating
new fields or columns that aggregate values from existing ones. Data
enrichment further enhances and optimizes data sets as needed,
through measures such as augmenting and adding data.
6. Data validation and publishing. In this last step, automated routines
are run against the data to validate its consistency, completeness and
accuracy. The prepared data is then stored in a data warehouse, a data
lake or another repository and either used directly by whoever
prepared it or made available for other users to access.
Data preparation can also incorporate or feed into data curation work that
creates and oversees ready-to-use data sets for BI and analytics. Data
curation involves tasks such as indexing, cataloging and maintaining data
sets and their associated metadata to help users find and access the data. In
some organizations, data curator is a formal role that works
collaboratively with data scientists, business analysts, other users and the
IT and data management teams. In others, data may be curated by data
stewards, data engineers, database administrators or data scientists and
business users themselves.
18
Data preparation
19
Data Mining and Business
Intelligence
These issues complicate the process of preparing data for BI and analytics
applications.
20
Data preparation
21
Data Mining and Business 2.2 PREDICTION METHODS
Intelligence
What is Prediction in Data Mining?
To find a numerical output, prediction is used. The training dataset
contains the inputs and numerical output values. According to the training
dataset, the algorithm generates a model or predictor. When fresh data is
provided, the model should find a numerical output. This approach, unlike
classification, does not have a class label. A continuous-valued function or
ordered value is predicted by the model.
In most cases, regression is utilized to make predictions. For example:
Predicting the worth of a home based on facts like the number of rooms,
total area, and so on.
Consider the following scenario: A marketing manager needs to forecast
how much a specific consumer will spend during a sale. In this scenario,
we are bothered to forecast a numerical value. In this situation, a model or
predictor that forecasts a continuous or ordered value function will be
built.
Prediction Issues:
Preparing the data for prediction is the most pressing challenge. The
following activities are involved in data preparation:
● Data Cleaning: Cleaning data include reducing noise and treating
missing values. Smoothing techniques remove noise, and the problem
of missing values is solved by replacing a missing value with the most
often occurring value for that characteristic.
22
● Relevance Analysis: The irrelevant attributes may also be present in Data preparation
the database. The correlation analysis method is used to determine
whether two attributes are connected.
● Data Transformation and Reduction: Any of the methods listed
below can be used to transform the data.
● Normalization: Normalization is used to transform the data.
Normalization is the process of scaling all values for a given attribute
so that they lie within a narrow range. When neural networks or
methods requiring measurements are utilized in the learning process,
normalization is performed.
● Generalization: The data can also be modified by applying a higher
idea to it. We can use the concept of hierarchies for this.
Other data reduction techniques include wavelet processing, binning,
histogram analysis, and clustering.
Prediction Methods
3. Regression Tree
A Regression tree may be considered a variant of a decision tree, designed
to approximate real-valued functions instead of being used for
classification methods. As with all regression techniques, XLMiner
assumes the existence of a single output (response) variable and one or
more input (predictor) variables. The output variable is numerical. The
general regression tree building methodology allows input variables to be
a mixture of continuous and categorical variables. A decision tree is
generated when each decision node in the tree contains a test on some
input variable's value. The terminal nodes of the tree contain the predicted
output variable values.
4. Neural Network
Artificial neural networks are based on the operation and structure of the
human brain. These networks process one record at a time and “learn” by
comparing their prediction of the record (which as the beginning is largely
23
Data Mining and Business arbitrary) with the known actual value of the response variable. Errors
Intelligence from the initial prediction of the first records are fed back into the network
and used to modify the networks algorithm the second time around. This
continues for many, many iterations.
1. Decision-Making
Making decisions is a crucial activity for businesses. It often involves
multiple participants with conflicting views. Decision-making
mathematical models can be of great use here. Such models use input
variables and a set of conditions to be fulfilled to help management arrive
at a decision.
One of the most common decision-making problems faced by any
business is the investment decision, where it must decide whether to
24
invest its money in a project or not. Businesses often use mathematical Data preparation
models that assess the potential valuation of the project against the
investment to be made for making such decisions. Examples of such
models are net present value (NPV), internal rate of return (IRR), etc. A
simple NPV model can be illustrated as:
2. Making Predictions
Often businesses have the requirement of predicting certain factors, such
as revenue, growth rate, costs, etc. These are usually used in case of new
product launch, change in strategy, investment needs, expansion projects,
etc. In such cases, predictive mathematical models are used that analyze
historical data and use probability distribution as input for predicting the
future values.
Regression analysis, which involves examining both dependent and
independent variables present in a given situation and then determining the
level of correlation they have with one another. Because of this, it's one of
the most commonly used techniques for predictive models.
Let's understand more with the help of an example of a company that is
conceptualizing a new product for kids aged 8-12 years. Before the
production, the company would want to understand the potential demand
for the product. In doing so, it will require understanding the interests of
their target audience. Then, with the help of a predictive model, they can
forecast the demand in the future.
3. Optimizing
Businesses often need to optimize certain variables to control costs and
ensure maximum efficiency. Such variables might include capacity
planning, human resources planning, space planning, route planning,
etc. Optimization mathematical models are typically used for such
problems. These models often maximize or minimize a quantity by
making changes in another variable or a set of variables.
In devising a pricing strategy, price optimization models are commonly
used to analyze demand of a product at different price points to calculate
profits. The goal of the model is to maximize profits by optimizing prices.
Thus, a company can determine a price level that achieves maximum
profit.
25
Data Mining and Business 2.4 DISTANCE METHODS
Intelligence
Clustering consists of grouping certain objects that are similar to each
other, it can be used to decide if two items are similar or dissimilar in their
properties.
In a Data Mining sense, the similarity measure is a distance with
dimensions describing object features. That means if the distance among
two data points is small then there is a high degree of similarity among
the objects and vice versa. The similarity is subjective and depends
heavily on the context and application. For example, similarity among
vegetables can be determined from their taste, size, colour etc.
Most clustering approaches use distance measures to assess the similarities
or differences between a pair of objects, the most popular distance
measures used are:
1. Euclidean Distance:
Euclidean distance is considered the traditional metric for problems with
geometry. It can be simply explained as the ordinary distance between
two points. It is one of the most used algorithms in the cluster analysis.
One of the algorithms that use this formula would be K-mean.
Mathematically it computes the root of squared differences between the
coordinates between two objects.
26
2. Manhattan Distance: Data preparation
This determines the absolute difference among the pair of the coordinates.
Suppose we have two points P and Q to determine the distance between
these points we simply have to calculate the perpendicular distance of the
points from X-Axis and Y-Axis.
In a plane with P at coordinate (x1, y1) and Q at (x2, y2).
Manhattan distance between P and Q = |x1 – x2| + |y1 – y2|
Here the total distance of the Red line gives the Manhattan distance
between both the points.
3. Jaccard Index:
The Jaccard distance measures the similarity of the two data set items as
the intersection of those items divided by the union of the data items.
27
Data Mining and Business
Intelligence
4. Minkowski distance:
It is the generalized form of the Euclidean and Manhattan Distance
Measure. In an N-dimensional space, a point is represented as,
(x1, x2, ..., xN)
Consider two points P1 and P2:
P1: (X1, X2, ..., XN)
P2: (Y1i, Y2, ..., YN)
Then, the Minkowski distance between P1 and P2 is given
● When p = 2, Minkowski distance is same as the Euclidean distance.
● When p = 1, Minkowski distance is same as the Manhattan
distance.
5. Cosine Index:
Cosine distance measure for clustering determines the cosine of the angle
between two vectors.
28
Data preparation
Logic Etymology
The word logic stems from the Greek word logike and or logos which
translates to reason. While many versions of the word have existed over
time, from Latin to Middle English, the first known use of the word was in
the 12th century to define a scientific set of principles. In the 14th century,
the word's definition grew to encompass the idea of true and false thinking
in terms of reasoning. Today, logic is connected to reasoning in forms of
nuance found in argumentation, math, symbolism, and much more.
29
Data Mining and Business Logic Examples and Concepts
Intelligence
Since logic is dependent upon reason, emotions are removed from this
practice, which means the concept of logic relies solely on given data and
valid correlations based on the governing principles presented. The goal of
logic is to find reasonable conclusions based on the given information, but
to make those conclusions, the person in question must ensure they are
making valid arguments.
Types of Logic
There are many types of logic located within the governing science. The
four main logic types are:
● Informal logic
● Formal logic
● Symbolic logic
● Mathematical logic
30
Read on to learn about each logic type and gain a better understanding Data preparation
through definitions and examples.
Informal Logic
Most people use informal logic everyday, as it's how we reason and form
argumentation in the moment. For example, arguing with a friend about if
Rachel and Ross were on a break in the TV show Friends would result in
the use of informal logic. On the show, the couple decided to take time
away from each other, and in that time, Ross slept with another woman.
Ross argues they were on a break, and Rachel argues they weren't. For this
argument, each person uses the information presented and creates their
conclusion based on their understanding of the word '~break'~.
Informal logic consists of two types of reasoning to make arguments:
● Deductive reasoning: Uses information from various sources and
applies that information to the argument at hand to support a larger,
generalized conclusion
● Inductive reasoning: Uses the specific information given to form a
generalized conclusion
In the Friends example, the arguing friends would use inductive
reasoning, since they are only using the evidence given from the one
source (the TV show). They would look at the episode before and after
Ross' actions to determine if the couple was, in fact, on a break. To use
deductive reasoning, the arguing friends would look into more examples
of infidelity and might even define the word ''break'' in terms of various
definitions. Inductive reasoning uses a smaller source pool and focuses on
Ross and Rachel. Deductive reasoning would center on the concept of
cheating and the notion behind the word ''break'' pulling from multiple
sources until a larger conclusion about cheating is created.
Formal Logic
Formal logic uses deductive reasoning in conjunction with syllogisms and
mathematical symbols to infer if a conclusion is valid. In formal logic, a
person looks to ensure the premises made about a topic logically connects
to the conclusion.
A common example of formal logic is the use of a syllogism to explain
those connections. A syllogism is form of reasoning which draws
conclusions based on two given premises. In each syllogism, there are two
premises and one conclusion that is drawn based on the given information.
The most famous example is about Socrates.
Premise A: Socrates is a man.
Premise B: All men are mortal.
Conclusion C: Therefore, Socrates is mortal.
31
Data Mining and Business 2.6 HEURISTIC METHODS
Intelligence
The heuristic method refers to finding the best possible solution to a
problem quickly, effectively, and efficiently. The word heuristic is derived
from an ancient Greek word, 'eurisko.' It means to find, discover, or
search. It is a practical method of mental shortcut for problem-solving and
decision making that reduces the cognitive load and doesn't require to be
perfect. The method is helpful in getting a satisfactory solution to a much
larger problem within a limited time frame.
The trial and error heuristic is the most fundamental heuristic. It can be
applied in all situations, from matching nuts and bolts to finding the
answer related to algebra. Some common heuristics used to solve
metamaterial problems are visual representation, forward/backward
reasoning, additional assumptions, and simplification.
Advantages of Heuristic
The heuristic method is one of the best ways to solve a problem and make
decisions. The method provides a quick solution with the help of mental
tricks. Some advantages of the heuristic method are given below:
o Attribute Substitution: At the place of more complex and difficult
questions, one can also opt for a simpler question related to the original
one. This technique of attribute substitution makes the method more
beneficial.
o Effort Reduction: The heuristic method reduces the mental efforts
required to solve a problem by making different choices and decisions. It
makes the method one of the most effective ways to find solutions to
many time-consuming problems.
o Fast and Frugal: With the help of a heuristic method, the problems
can be solved within a limited time, and the best & accurate answer can be
obtained.
32
Disadvantages of Heuristic Method Data preparation
As we know heuristic method helps us in getting a quick and effective
solution and decision in a problem, they can also make errors and bias
decisions in some situations. The certain disadvantages of the heuristic
method are as under:
o Inaccurate Decision: It is not true that the heuristic method always
provides an accurate answer or decision to a problem. Sometimes, the
method can provide an inaccurate solution or judgment about how
commonly things appear in your mind and how certain representative
things may be. It can be easily understood by the examples of
manipulation of decision-making.
o Bias Decision: It is not compulsory, and 100% proved that a
decision or a solution that was effective in a past situation will always
work with other situations and even with the same situation. If a person
always relies on the same heuristic, then it can make it difficult to see
other better and alternative solutions.
o Reduces Creativity: If a person always relies on the same decision,
it can also reduce his/her creativity and decision-making & problem-
solving ability. It does not allow the person to come up with new ideas and
judgments.
o Stereotypes and Prejudice: The methods also affect a certain
things, such as stereotypes and prejudice. When a person classifies and
categorizes other people using mental shortcuts, he/she can miss the more
relevant and informative. Such conditions may create stereotyped and
prejudiced categorization of people and decisions that do not match with
the real conditions.
33
Data Mining and Business o Second Principle - Making a Plan: A problem can be solved by
Intelligence using many different ways. The second principle says that it is required to
find the best way that can be used to find the solution to the given
problem. For this purpose, the right strategy is the first find the
requirement. The reverse 'working backward' can help with this. In this,
people assume to have a solution that helps them in solving the problem
from the starting point.
It also helps in making an overview of the possibilities, removing the less
efficient immediately, comparing all the remaining possibilities, or
applying symmetry. This improves the judgment ability as well as the
creativity of a person.
o Third Principle - Implementing the Plan: After making the proper
strategy, the plan can be implemented. However, for this, it is necessary to
be patient and give the required time to solve the problem. Because
implementing the plan is tougher than making a plan. If the plan does not
provide any solution or does not stand as per the expectations, then it is
advised to repeat the second principle in a better way.
o Fourth Principle - Evaluation and Adaptation: This principle
evaluated that things are in the planned way. In other words, it said that we
match the planned way with the standard way. After this, it is found that
the things are going well maintained so that the best way of solving the
problem can get. Some plans may work while others may not. So, after the
proper evaluation, the most appropriate way can be adapted to solve the
main problem.
34
helps in getting the best way to solve the problem and getting a successful Data preparation
result.
o Local Search Method: In this method, the most feasible way of
solving a problem is searched and used. Continuous improvement is made
in the method during the solving process, and when there is no more scope
for improvement, the method gets to the end, and the final result is the
answer to the problem.
36
3
OPTIMIZATION TECHNIQUES
Unit Structure
3.1 Introduction
3.2 Local Optimization Technique
3.3 Stochastic hill climber
3.4 Evaluation of models
3.1 INTRODUCTION
The field of data mining increasingly adapts methods and algorithms from
advanced matrix computations, graph theory and optimization. In these
methods, the data is described using matrix representations (graphs are
represented by their adjacency matrices) and the data mining problem is
formulated as an optimization problem with matrix variables. With these,
the data mining task becomes a process of minimizing or maximizing a
desired objective function of matrix variables.
Prominent examples include spectral clustering, matrix factorization,
tensor analysis, and regularizations. These matrix-formulated
optimization-centric methodologies are rapidly evolving into a popular
research area for solving challenging data mining problems. These
methods are amenable to vigorous analysis and benefit from the well-
established knowledge in linear algebra, graph theory, and optimization
accumulated through centuries. They are also simple to implement and
easy to understand, in comparison with probabilistic, information-
theoretic, and other methods. In addition, they are well-suited to parallel
and distributed processing for solving large scale problems.
37
Data Mining and Business
Intelligence
38
None, One, or Many Objectives Optimization Techniques
Although most optimization problems have a single objective function,
there have been peculiar cases when optimization problems have either -
no objective function or multiple objective functions. Multi-objective
optimization problems arise in streams such as engineering, economics,
and logistics. Often, problems with multiple objectives are reformulated as
single-objective problems.
optimization problem
Constraint set bounded by the five lines x1 = 0, x2 = 0, x1 = 8, x2 = 5, and x1 + x2 = 10.
These enclose an infinite number of points that represent feasible solutions.
2. Non-linear programming
Although the linear programming model works fine for many situations,
some problems cannot be modeled accurately without including nonlinear
components. One example would be the isoperimetric problem: determine
the shape of the closed plane curve having a given length and enclosing
the maximum area. The solution, but not a proof, was known by Pappus of
Alexandria c. 340 CE:
Bees, then, know just this fact which is useful to them, that the hexagon is
greater than the square and the triangle and will hold more honey for the
same expenditure of material in constructing each. But we, claiming a
greater share of wisdom than the bees, will investigate a somewhat wider
problem, namely that, of all equilateral and equiangular plane figures
having the same perimeter, that which has the greater number of angles is
always greater, and the greatest of them all is the circle having its perimeter
equal to them.
44
State Space Diagram – Hill Climbing in Artificial Intelligence Optimization Techniques
Local Maxima/Minima: Local Minima is a state which is better than its
neighbouring state, however, it is not the best possible state as there exists
a state where objective function value is higher
Global Maxima/Minima: It is the best possible state in the state diagram.
Here the value of the objective function is highest
Current State: Current State is the state where the agent is present
currently
Flat Local Maximum: This region is depicted by a straight line where all
neighbouring states have the same value so every node is local maximum
over the region
1. Local Maximum
The algorithm terminates when the current node is local maximum as it is
better than its neighbours. However, there exists a global maximum where
objective function value is higher
Solution: Back Propagation can mitigate the problem of Local maximum
as it starts exploring alternate paths when it encounters Local Maximum
2. Ridge
Ridge occurs when there are multiple peaks and all have the same value or
in other words, there are multiple local maxima which are same as global
maxima
45
Data Mining and Business 3. Plateau
Intelligence
Plateau is the region where all the neighbouring nodes have the same
value of objective function so the algorithm finds it hard to select an
appropriate direction.
Algorithm
1. Examine the current state, Return success if it is a goal state
2. Continue the Loop until a new solution is found or no operators are left
to apply
3. Apply the operator to the node in the current state
4. Check for the new state
If Current State = Goal State, Return success and exit
Else if New state is better than current state then Goto New state
Else return to step 2
5. Exit
46
Algorithm Optimization Techniques
1. Examine the current state, Return success if it is a goal state
2. Continue the Loop until a new solution is found or no operators are left
to apply
Let ‘Temp’ be a state such that any successor of the current state will have
a higher value for the objective function. For all operators that can be
applied to the current state
Apply the operator to create a new state
Examine new state
If Current State = Goal State, Return success and exit
Else if New state is better than Temp then set this state as Temp
If Temp is better than Current State set Current state to Target
3. Stochastic Hill Climbing
Stochastic Hill Climbing doesn’t look at all its neighboring nodes to check
if it is better than the current node instead, it randomly selects one
neighboring node, and based on the pre-defined criteria it decides whether
to go to the neighboring node or select an alternate node.
Advantage of Hill Climbing Algorithm in Artificial Intelligence
Advantage of Hill Climbing Algorithm in Artificial Intelligence is given
below:
Hill Climbing is very useful in routing-related problems like Travelling
Salesmen Problem, Job Scheduling, Chip Designing, and Portfolio
Management
It is good in solving the optimization problem while using only limited
computation power
It is more efficient than other search algorithms
Hill Climbing Algorithm is a very widely used algorithm for Optimization
related problems as it gives decent solutions to computationally
challenging problems. It has certain drawbacks associated with it like its
Local Minima, Ridge, and Plateau problem which can be solved by using
some advanced algorithm
3.4 EVALUATION OF MODELS
What is Model Evaluation?
Model evaluation is the process of using different evaluation metrics to
understand a machine learning model’s performance, as well as its
strengths and weaknesses. Model evaluation is important to assess the
efficacy of a model during initial research phases, and it also plays a role
in model monitoring.
To understand if your model(s) is working well with new data, you can
leverage a number of evaluation metrics.
47
Data Mining and Business Classification
Intelligence
The most popular metrics for measuring classification performance
include accuracy, precision, confusion matrix, log-loss, and AUC (area
under the ROC curve).
● Accuracy measures how often the classifier makes the correct
predictions, as it is the ratio between the number of correct predictions and
the total number of predictions.
● Precision measures the proportion of predicted Positives that are
truly Positive. Precision is a good choice of evaluation metrics when you
want to be very sure of your prediction. For example, if you are building a
system to predict whether to decrease the credit limit on a particular
account, you want to be very sure about the prediction or it may result in
customer dissatisfaction.
● The confusion matrix (or confusion table) shows a more detailed
breakdown of correct and incorrect classifications for each class. Using a
confusion matrix is useful when you want to understand the distinction
between classes, particularly when the cost of misclassification might
differ for the two classes, or you have a lot more test data on one class
than the other. For example, the consequences of making a false positive
or false negative in a cancer diagnosis are very different.
49
Module III
4
BI USING DATA WAREHOUSING
4.1 Introduction to DW
4.2 DW architecture
4.3 ETL Process
4.4 Data Warehouse Design
4.1 INTRODUCTION TO DW
Data Warehouse (DW) is maintained separately from the organization’s
operational database and is an environment. Its architectural construct
provides users with current and historical decision support information
which is not possible in the present traditional operational data store. DW
provides a new design which helps in reduced response time and enhance
the performance of queries for reports and analytics.
Data warehouse system is also known by the following name:
❖ Decision Support System (DSS)
❖ Executive Information System
❖ Management Information System
❖ Business Intelligence Solution
❖ Analytic Application
❖ Data Warehouse
50
History of Datawarehouse Bi Using Data Warehousing
The need to warehouse is to handle increasing amounts of Information.
3. Data Mart:
Data mart a subset of the DW is designed for a particular line of business
like sales, finance and etc.
51
Data Mining and Business ❖ Query Manager
Intelligence
❖ End-user access tools:
This is categorized into five different groups like 1. Data Reporting 2.
Query Tools 3. Application development tools 4. EIS tools, 5. OLAP tools
and data mining tools.
52
Bi Using Data Warehousing
53
Data Mining and Business The Future of Data Warehousing
Intelligence
❖ Change in Regulatory constrains.
❖ Size of the database.
❖ Multimedia data.
2. Oracle:
https://www.oracle.com/index.html
3. Amazon RedShift:
https://aws.amazon.com/redshift/?nc2=h_m1
Here is a complete list of useful Datawarehouse Tools.
54
3 QuerySurge Windows, It speeds up testing process up to Bi Using Data Warehousing
Linux 1,000 x and also providing up to
100% data coverage
It integrates an out-of-the-box
DevOps solution for most Build,
ETL & QA management
software.
55
Data Mining and Business 4.2 DW ARCHITECTURE
Intelligence
Business Analysis Framework
The business analyst get the information from the data warehouses to
measure the performance and make critical adjustments in order to win
over other business holders in the market. Has the following advantages −
❖ Can enhance business productivity.
❖ Helps us manage customer relationship.
❖ Brings down the costs by tracking trends, patterns over a long period
in a consistent and reliable manner.
To design an effective and efficient data warehouse, we need to
understand and analyze the business needs and construct a business
analysis framework. Views are as follows:
❖ The top-down view
❖ The data source view
❖ The data warehouse.
❖ The business query view
Data warehouses and their architectures very depending upon the elements
of an organization's situation and are classified as:
❖ Data Warehouse Architecture: Basic
❖ Data Warehouse Architecture: With Staging Area
❖ Data Warehouse Architecture: With Staging Area and Data Marts
Fig 6: Data Warehouse Architecture with Staging Area and Data Marts (a)
The figure 6 illustrates an example where purchasing, sales, and stocks are
separated. In this example, a financial analyst wants to analyze historical
data for purchases and sales or mine historical information to make
predictions about customer behavior.
Two-Tier Architecture
The requirement for separation plays an essential role in defining the two-
tier architecture for a data warehouse system, as shown in fig:
58
Bi Using Data Warehousing
Three-Tier Architecture
The three-tier architecture consists of the source layer (containing multiple
source system), the reconciled layer and the data warehouse layer
(containing both data warehouses and data marts). The reconciled layer
sits between the source data and data warehouse.
The main advantage of the reconciled layer is that it creates a standard
reference data model for a whole enterprise. At the same time, it separates
the problems of source data extraction and integration from those of data
warehouse population.
59
Data Mining and Business
Intelligence
61
Data Mining and Business Load Performance
Intelligence
Data warehouses require increase loading of new data periodically basis
within less amount of time; performance on the load process should be
measured in hundreds of millions of rows and gigabytes per hour and must
not artificially constrain the volume of data business.
Load Processing
Many phases must be taken to load new or update data into the data
warehouse, including data conversion, filtering, reformatting, indexing,
and metadata update.
Query Performance
Fact-based management must not be slowed by the performance of the
data warehouse RDBMS; large, complex queries must be complete in
seconds.
Terabyte Scalability
Data warehouse sizes are growing at enormous rates. Today these size
from a few to 100 of GBs and TB-sized DW.
Types of Data Warehouses
There are different types of data warehouses, which are as follows:
62
Host-Based Data Warehouses Bi Using Data Warehousing
There are two types of host-based data warehouses which can be
implemented:
❖ Host-Based mainframe warehouses which reside on a high volume
database. Supported by robust and reliable high capacity structure such as
IBM system/390, UNISYS and Data General sequent systems, and
databases such as Sybase, Oracle, Informix, and DB2.
❖ Host-Based LAN data warehouses, where data delivery can be
handled either centrally or from the workgroup environment. The size of
the data warehouses of the database depends on the platform.
Data Extraction and transformation tools allow the automated extraction
and cleaning of data from production systems.
1. A huge load of complex warehousing queries would possibly have too
much of a harmful impact upon the mission-critical transaction
processing (TP)-oriented application.
2. These transaction processing systems have been developing in their
database design for transaction throughput.
3. There is no assurance that data remains consistent.
Host-Based (MVS) Data Warehouses
Those data warehouse uses that reside on large volume databases on MVS
are the host-based types of data warehouses. Often the DBMS is DB2 with
a huge variety of original source for legacy information like VSAM, DB2,
flat files, and Information Management System (IMS). of Java
66
❖ Impacting performance since the customer will be competing with the Bi Using Data Warehousing
production data stores.
Disadvantages
1. Queries competing with production record transactions can degrade
the performance.
2. No metadata, no summary record, or no individual DSS (Decision
Support System) integration or history.
3. No refreshing process, causing the queries to be very complex.
68
Bi Using Data Warehousing
Step 2) Transformation
Data extracted from source server is raw and not usable in its original
form and needs to be cleansed, mapped and transformed.
69
Data Mining and Business Step 3) Loading
Intelligence
Large volume of data needs to be loaded in a relatively short period and
needs to be optimized for performance.
In case of load failure, recover mechanisms should be configured to restart
from the point of failure without data integrity loss.
Types of Loading:
❖ Initial Load — populating all the Data Warehouse tables
❖ Incremental Load — applying ongoing changes as when needed
periodically.
❖ Full Refresh —erasing the contents of one or more tables and
reloading with fresh data.
ETL Tools
Prominent data warehousing tools available in the market are:
1. MarkLogic:
https://www.marklogic.com/product/getting-started/
2. Oracle:
https://www.oracle.com/index.html
3. Amazon RedShift:
https://aws.amazon.com/redshift/?nc2=h_m1
70
Bi Using Data Warehousing
Weaknesses
❖ Flexibility
❖ Hardware
❖ Learning Curve
ELT (Extract, Load and Transform)
ELT stands for Extract, Load and Transform is the various sights while
looking at data migration or movement. ELT involves the extraction of
aggregate information from the source system and loading to the target
method instead of transformation between the extraction and loading
phase. Once the data is copied or loaded into the target method, then
change takes place.
71
Data Mining and Business ❖ Risk minimization
Intelligence
❖ Utilize Existing Hardware
❖ Utilize Existing Skill sets
Weaknesses
❖ Against the Norm
❖ Tools Availability
Difference between ETL vs. ELT
Basics ETL ELT
Process Data is transferred to the Data remains in the DB
ETL server and moved except for cross Database
back to DB. High network loads (e.g. source to object).
bandwidth required.
Transformation Transformations are Transformations are
performed in ETL Server. performed (in the source or)
in the target.
Code Usage Typically used for Typically used for
❖ Source to target ❖ High amounts of data
transfer
❖ Compute-intensive
Transformations
❖ Small amount of
data
Time- It needs highs Low maintenance as data is
Maintenance maintenance as you need always available.
to select data to load and
transform.
Calculations Overwrites existing Easily add the calculated
column or Need to column to the existing table.
append the dataset and
push to the target
platform.
Analysis
73
Data Mining and Business Bottom-Up Design Approach
Intelligence
In "Bottom-Up" approach, a DW is described as "a copy of transaction
data specifical architecture for query and analysis," term the star schema.
In this approach, a data mart is created first to necessary reporting and
analytical capabilities for particular business processes. Data marts include
the lowest grain data and, aggregated data, if needed.
Main advantage of "bottom-up" design approach is, it has quick ROI, and
takes less time and effort than developing an enterprise-wide data
warehouse. In addition to it the risk of failure is even less. This method is
inherently incremental. This method allows the project team to learn and
grow.
74
Differentiate between Top-Down Design Approach and Bottom-Up Bi Using Data Warehousing
Design Approach
Breaks the vast problem into Solves the essential low-level problem
smaller sub problems. and integrates them into a higher one.
75
5
DATA MART
Unit Structure
5.1 Data mart
5.2 OLAP
5.3 Dimensional Modeling
5.4 Operations on Data Cube
5.5 Schema
5.6 References
5.7 MOOCs
5.8 Video Lectures
5.9 Quiz
76
Data mart
77
Data Mining and Business
Intelligence
Hybrid Data Mart:
It combines input from sources apart from Data warehouse and is helpful
in integration. Hybrid Data mart also supports large storage structures, and
it is best suited for flexible for smaller data-centric applications.
78
Constructing Data mart
In this second phase of implementation it involves in creating the physical
database and the logical structures. Involves the following tasks:
● Implementing the physical database designed in the earlier phase.
Database schema objects like table, indexes, views, etc. are to be created.
Populating:
In the third phase, data in populated in the data mart involving the
following tasks:
❖ Data Mapping
❖ Extraction of source data
❖ Cleaning and transformation operations
❖ Loading data into the data mart
❖ Creating and storing metadata
Accessing
Accessing is a fourth step which involves putting the data to use and
submit queries to the database & display the results of the queries
The accessing step needs to perform the following tasks:
❖ Translates database structures and objects names into business terms
❖ Set up and maintain database structures.
❖ Set up API and interfaces if required
Managing
Is the last step of Data Mart Implementation process and covers
management tasks like:
❖ User access management.
❖ System optimizations and fine-tuning
❖ Adding and managing fresh data into the data mart.
❖ Planning recovery scenarios and ensure system availability in the case
of system fails.
Disadvantages
● Maintenance problem.
● Data analysis is limited.
5.2 OLAP
Online Analytical Processing provide analysis of data for business
decisions and allow users to analyze database information from multiple
database systems at one time.
The primary objective is data processing and not data analysis
Example of OLAP
Any Data warehouse system is an OLAP system.
Uses of OLAP:
❖ A company might compare their mobile phone sales in September with
sales in October, then compare those results with another location
which may be stored in a separate database.
❖ Amazon analyzes purchases by its customers to come up with a
personalized homepage with products which likely interest to their
customer.
80
OLTP Data mart
Online transaction processing supports transaction-oriented applications in
a 3-tier architecture administering day to day transaction of an
organization.
81
Data Mining and Business OLTP Vs OLAP
Intelligence
82
Dimensions: Is a collection of data which describe one business Data mart
dimension.
Measure: Is a numeric attribute of a fact, representing the performance or
behavior of the business relative to the dimensions.
ROLAP
ROLAP servers are placed between relational back-end server and client
front-end tools. To store and manage warehouse data, ROLAP uses
relational or extended-relational DBMS.
ROLAP includes:
❖ Implementation of aggregation navigation logic.
❖ Optimization for each DBMS back end.
❖ Additional tools and services.
84
Data mart
85
Data Mining and Business 5.4 OPERATIONS ON DATA CUBE
Intelligence
Basic operations implemented on a data cube are:
1. Roll Up
2. Drill Down
3. Slice and Dice
4. Pivot
Roll Up: Summarizes or aggregates the dimensions either by performing
dimension reduction or by concept hierarchy.
86
3. Slice and Dice Data mart
PickS up one dimension of the data cube and then forms a sub cube out of
it. The data cube is sliced based on time.
87
Data Mining and Business Advantages of Data Cube
Intelligence
❖ Data cube ease in aggregating and summarizing the data.
❖ Data cube provide better visualization of data.
❖ Data cube stores huge amount of data in a very simplified way.
❖ Data cube increases the overall efficiency of the data warehouse.
❖ The aggregated data in data cube helps in analysing the data fast and
thereby reducing the access time.
5.5 SCHEMA
Schema is used to define the way to organize the system with all the
database entities and their logical association.
88
Data mart
89
Data Mining and Business Results:
Intelligence
Product_Name Quantity_Sold
Novels 12,702
DVDs 32,919
2) SnowFlake Schema
Is a process of completely normalizes all the dimension tables from a star
schema.
3) Galaxy Schema
Galaxy Schema known as Fact Constellation Schema like a collection of
stars in the Galaxy schema model. In this schema, multiple fact tables
share the same dimension tables.
91
Data Mining and Business 4) Star Cluster Schema
Intelligence
A SnowFlake schema with many dimension tables may need more
complex joins while querying. A star schema with fewer dimension tables
may have more redundancy.
92
5.6 REFERENCES Data mart
1. Introduction to Data Warehouse System.
https://www.javatpoint.com/. [Last Accessed on 10.03.2022]
2. Introduction to Data Warehouse System. https://www.guru99.com/.
[Last Accessed on 10.03.2022]
3. Introduction to Data Warehouse System. https://www.
http://www.nitjsr.ac.in/. [Last Accessed on 10.03.2022]
4. Introduction to Data Warehouse System. https://www.
oms.bdu.ac.in/. [Last Accessed on 10.03.2022]
5. Data Warehouse. https://www.softwaretestinghelp.com/. [Last
Accessed on 10.03.2022]
6. Introduction to Data Warehouse System.
https://www.tutorialspoint.com/ebook/data_warehouse_tutorial/index
.asp. [Last Accessed on 10.03.2022].
7. Using ID3 Algorithm to build a Decision Tree to predict the weather.
https://iq.opengenus.org/id3-algorithm/. [Last Accessed on
10.03.2022].
8. Data Warehouse Architecture. https://binaryterms.com/data-
warehouse-architecture.html. [Last Accessed on 10.03.2022].
9. Compare-and-Contrast-Database-with-Data-Warehousing-and-Data-
Visualization . https://www.coursehero.com/file/28202760/Compare-
and-Contrast-Database-with-Data-Warehousing-and-Data-
Visualization-Databases-Assignmentdocx/. [Last Accessed on
10.03.2022].
10. Data Warehousing and data mining.
https://lastmomenttuitions.com/course/data-warehousing-and-
mining/. [Last Accessed on 10.03.2022].
11. Data Warehouse. https://one.sjsu.edu/task/all/finance-data-
warehouse. [Last Accessed on 10.03.2022].
12. DWDM Notes. https://dwdmnotes.blogspot.com. [Last Accessed on
10.03.2022].
13. Data Warehouse and Data Mart.
https://www.geeksforgeeks.org/difference-between-data-warehouse-
and-data-mart/?ref=gcse. [Last Accessed on 10.03.2022].
14. Data Warehouse System. https://www.analyticssteps.com/. [Last
Accessed on 10.03.2022].
15. Han J, Pei J, Kamber M. Data mining: concepts and techniques.
Elsevier; 2011 Jun 9.
93
Data Mining and Business 16. CART. https://iq.opengenus.org/cart-algorithm. [Last Accessed on
Intelligence 10.03.2022].
17. Bhatia P. Data mining and data warehousing: principles and practical
techniques. Cambridge University Press; 2019 Jun 27.
18. Han J, Pei J, Kamber M. Data mining: concepts and techniques.
Elsevier; 2011 Jun 9.
19. Berzal F, Matín N. Data mining: concepts and techniques by Jiawei
Han and Micheline Kamber. ACM Sigmod Record. 2002 Jun 1;
31(2):66-8.
20. Gupta GK. Introduction to data mining with case studies. PHI
Learning Pvt. Ltd.; 2014 Jun 28.
21. Zhou, Zhi-Hua. "Three perspectives of data mining." (2003): 139-
146.
22. Wang J, editor. Encyclopedia of data warehousing and mining. iGi
Global; 2005 Jun 30.
23. Pujari AK. Data mining techniques. Universities press; 2001.
5.7 MOOCS
1. Data Warehousing for Business Intelligence Specialization.
https://www.coursera.org/specializations/data-warehousing.
2. Data Mining.
https://onlinecourses.swayam2.ac.in/cec20_cs12/preview.
3. Data Warehouse Concepts, Design, and Data Integration.
https://www.coursera.org/learn/dwdesign.
4. Data Warehouse Courses. https://www.edx.org/learn/data-warehouse.
5. BI Foundations with SQL, ETL and Data Warehousing
Specialization. https://www.coursera.org/specializations/bi-
foundations-sql-etl-data-warehouse.
6. Fundamentals of Data Warehousing. https://www.mooc-
list.com/initiative/coursera.
7. Foundations for Big Data Analysis with SQL.
https://www.coursera.org/learn/foundations-big-data-analysis-sql.
5.9 QUIZ
1. OLAP stands for
a) Online analytical processing
b) Online analysis processing
c) Online transaction processing
d) Online aggregate processing
Answer: a
2. Data that can be modeled as dimension attributes and measure attributes
are called _______ data.
a) Multidimensional
b) Singledimensional
c) Measured
d) Dimensional
Answer: a
95
Data Mining and Business 3. The generalization of cross-tab which is represented visually is
Intelligence ____________ which is also called as data cube.
a) Two dimensional cube
b) Multidimensional cube
c) N-dimensional cube
d) Cuboid
Answer: a
4. The process of viewing the cross-tab (Single dimensional) with a fixed
value of one attribute is
a) Slicing
b) Dicing
c) Pivoting
d) Both Slicing and Dicing
Answer: a
5. The operation of moving from finer-granularity data to a coarser
granularity (by means of aggregation) is called a ________
a) Rollup
b) Drill down
c) Dicing
d) Pivoting
Answer: a
6. In SQL the cross-tabs are created using
a) Slice
b) Dice
c) Pivot
d) All of the mentioned
Answer: a
{ (item name, color, clothes size), (item name, color), (item name, clothes
size), (color, clothes size), (item name), (color), (clothes size), () }
7. This can be achieved by using which of the following ?
a) group by rollup
b) group by cubic
c) group by
d) none of the mentioned
Answer: d
96
8. What do data warehouses support? Data mart
a) OLAP
b) OLTP
c) OLAP and OLTP
d) Operational databases
Answer: a
9. SELECT item name, color, clothes SIZE, SUM(quantity)
FROM sales
GROUP BY rollup (item name, color, clothes SIZE);
How many grouping is possible in this rollup?
a) 8
b) 4
c) 2
d) 1
Answer: b
10. Which one of the following is the right syntax for DECODE?
a) DECODE (search, expression, result [, search, result]… [, default])
b) DECODE (expression, result [, search, result]… [, default], search)
c) DECODE (search, result [, search, result]… [, default], expression)
d) DECODE (expression, search, result [, search, result]… [, default])
Answer: d
11. What is the full form of OLAP?
a) Online Application Programming
b) Online Application Processing
c) Online Analytical programming
d) Online Analytical Processing
Answer: d
12. Data that can be modelled as dimension attributes and measure
attributes are called ___________
a) Mono-dimensional data
b) Multi-dimensional data
c) Measurable data
d) Efficient data
Answer: b
97
Data Mining and Business 13. The operation of changing a dimensions used in a cross-tab is called as
Intelligence ________
a) Alteration
b) Pivoting
c) Piloting
d) Renewing
Answer: b
14. The operation of moving from finer granular data to coarser granular
data is called _______
a) Reduction
b) Increment
c) Roll up
d) Drill down
Answer: c
15. How many dimensions of multi-dimensional data do cross tabs enable
analysts to view?
a) 1
b) 2
c) 3
d) None of the mentioned
Answer: b
16. The _______ function allows substitution of values in an attribute of a
tuple
a) Cube
b) Unknown
c) Decode
d) Substitute
Answer: c
17. Which of the following OLAP systems do not exist?
a) HOLAP
b) MOLAP
c) ROLAP
d) None of the mentioned
Answer: d
98
18. State true or false: OLAP systems can be implemented as client-server Data mart
systems
a) True
b) False
Answer: a
19. The operation of moving from coarser granular data to finer granular
data is called _______
a) Reduction
b) Increment
c) Roll back
d) Drill down
Answer: d
20. State true or false: In OLAP, analysts cannot view a dimension in
different levels of detail.
a) True
b) False
Answer: b
21. What is a Star Schema?
a) a star schema consists of a fact table with a single table for each
dimension
b) a star schema is a type of database system
c) a star schema is used when exporting data from the database
d) none of these
Answer: A
22. What is the type of relationship in star schema?
a) many-to-many.
b) one-to-one
c) many-to-one
d) one-to-many
Answer: D
23. Fact tables are _______.
a) completely demoralized.
b) partially demoralized.
c) completely normalized.
d) partially normalized.
Answer: C
99
Data Mining and Business 24. Data warehouse is volatile, because obsolete data are discarded
Intelligence
a) TRUE
b) FALSE
Answer: B
25. Which is NOT a basic conceptual schema in Data Modeling of Data
Warehouses?
a) Star schema
b) Tree schema
c) Snowflake schema
d) Fact constellations
Answer: B
26. Which is NOT a valid OLAP Rule by E.F.Codd?
a) Accessibility
b) Transparency
c) Flexible reporting
d) Reliability
Answer: D
27. Which is NOT a valid layer in Three-layer Data Warehouse
Architecture in Conceptual View?
a) Processed data layer
b) Real-time data layer
c) Derived data layer
d) Reconciled data layer
Answer: A
28. Among the types of fact tables which is not a correct type ?
a) Fact-less fact table
b) Transaction fact tables
c) Integration fact tables
d) Aggregate fact tables
Answer: C
29. Among the followings which is not a characteristic of Data
Warehouse?
a) Integrated
b) Volatile
c) Time-variant
d) Subject oriented
Answer: B
100
30. What is not considered as isssues in data warehousing? Data mart
a) optimization
b) data transformation
c) extraction
d) inter mediation
Answer: D
31. Which one is NOT considering as a standard query technique?
a) Drill-up
b) Drill-across
c) DSS
d) Pivoting
Answer: C
32. Among the following which is not a type of business data ?
a) Real time data
b) Application data
c) Reconciled data
d) Derived data
Answer : B
33. A data warehouse is which of the following?
a) Can be updated by end users.
b) Contains numerous naming conventions and formats.
c) Organized around important subject areas.
d) Contains only current data.
Answer: C
34. A snowflake schema is which of the following types of tables?
a) Fact
b) Dimension
c) Helper
d) All of the above
Answer: D
35. The extract process is which of the following?
a) Capturing all of the data contained in various operational systems
b) Capturing a subset of the data contained in various operational systems
c) Capturing all of the data contained in various decision support systems
.
101
Data Mining and Business d) Capturing a subset of the data contained in various decision support
Intelligence systems
Answer: B
36. The generic two-level data warehouse architecture includes which of
the following?
a) At least one data mart
b) Data that can extracted from numerous internal and external sources
c) Near real-time updates
d) All of the above.
Answer: B
37. Which one is correct regarding MOLAP ?
a) Data is stored and fetched from the main data warehouse.
b) Use complex SQL queries to fetch data from the main warehouse
c) Large volume of data is used.
d) All are incorrect
Answer: A
38. In terms of data warehouse, metadata can be define as,
a) Metadata is a road-map to data warehouse
b) Metadata in data warehouse defines the warehouse objects.
c) Metadata acts as a directory.
d) All are incorrect
Answer: D
39. In terms of RLOP model, choose the most suitable answer
a) The warehouse stores atomic data.
b) The application layer generates SQL for the two dimensional view
c) The presentation layer provides the multidimensional view.
d) All are incorrect
Answer: D
40. In the OLAP model, the _ provides the multidimensional view.
a) Data layer
b) Data link layer
c) Presentation layer
d) Application layer
Answer: A
102
41. Which of the following is not true regarding characteristics of Data mart
warehoused data?
a) Changed data will be added as new data
b) Data warehouse can contains historical data
c) Obsolete data are discarded
d) Users can change data once entered into the data warehouse
Answer: D
42. ETL is an abbreviation for Elevation, Transformation and Loading
a) TRUE
b) FALSE
Answer: B
43. Which is the core of the multidimensional model that consists of a
large set of facts and a number of dimensions?
a) Multidimensional cube
b) Data model
c) Data cube
d) None of the above
Answer: C
44. Which of the following statements is incorrect
a) ROLAPs have large data volumes
b) Data form of ROLAP is large multidimentional array made of cubes
c) MOLAP uses sparse matrix technology to manage data sparcity
d) Access for MOLAP is faster than ROLAP
Answer: B
45. Which of the following standard query techniques increase the
granularity
a) roll-up
b) dril-down
c) slicing
d) dicing
Answer: B
46. The full form of OLAP is
a) Online Analytical Processing
b) Online Advanced Processing
c) Online Analytical Performance
d) Online Advanced Preparation
Answer: A
103
Data Mining and Business 47. Which of the following statements is/are incorrect about ROLAP
Intelligence
a) ROLAP fetched data from data warehouse.
b) ROLAP data store as data cubes.
c) ROLAP use sparse matrix to manage data sparsity.
Answer: B and C
48. __ is a standard query technique that can be used within OLAP to
zoom in to more detailed data by changing dimensions.
a) Drill-up
b) Drill-down
c) Pivoting
d) Drill-across
Answer: B
49. Which of the following statements is/are correct about Fact
constellation
a) Fact constellation schema can be seen as a combination of many star
schemas.
b) It is possible to create fact constellation schema, for each star
schema or snowflake schema.
c) Can be identified as a flexible schema for implementation.
Answer: C
50. How to describe the data contained in the data warehouse?.
a) Relational data
b) Operational data
c) Meta data
d) Informational data
Answer: C
51. The output of an OLAP query is displayed as a
a) Pivot
b) Matrix
c) Excel
Answer: B and C
52. One can perform Query operations in the data present in Data
Warahouse
a) TRUE
b) FALSE
Answer: A
104
53. A __ combines facts from multiple processes into a single fact table Data mart
and eases the analytic burden on BI applications.
a) Aggregate fact table
b) Consolidated fact table
c) Transaction fact table
d) Accumulating snapshot fact table
Answer: B
54. In OLAP operations, Slicing is the technique of ____
a) Selecting one particular dimension from a given cube and providing
a new sub-cube
b) Selecting two or more dimensions from a given cube and providing
a new sub-cube
c) Rotating the data axes in order to provide an alternative presentation
of data
d) Performing aggregation on a data cube
Answer: A
55. Standalone data marts built by drawing data directly from operational
or external sources of data or both are known as independent data marts
a) TRUE
b) FALSE
Answer: A
56. Focusing on the modeling and analysis of data for decision makers, not
on daily operations or transaction processing is known
a) Integrated
b) Time-variant
c) Subject oriented
d) Non-volatile
Answer: C
57. Most of the time data warehouse is
a) Read
b) Write
c) Both
Answer: A
58. Data granularity is ——————- of details of data ?
a) summarization
b) transformation
c) level
Answer: C
105
Data Mining and Business 59. Which one is not a type of fact?
Intelligence
a) Fully Addictive
b) Cumulative addictive
c) Semi Addictive
d) Non Addictive
Answer: C
60. When the level of details of data is reducing the data granularity goes
higher
a) True
b) False
Answer: B
61. Data Warehouses are having summarized and reconciled data which
can be used by decision makers
a) True
b) False
Answer: A
62. _____ refers to the currency and lineage of data in a data warehouse
a) Operational metadata
b) Business metadata
c) Technical metadata
d) End-User meatdata
Answer: A
106
Module IV
6
DATA MINING AND PREPROCESSING
Unit Structure
6.0 Object4ves
6.1 Introduction
6.2 Definition
6.3 Functionalities of Data Mining
6.3.1 Class/ Concept Descriptions
6.3.2 Mining Frequent Patterns, Associations, and Correlations
6.3.3 Association Analysis
6.3.4 Correlation Analysis
6.4 Data Preprocessing & KDD
6.4.1 Data Cleaning
6.4.2 Data Integration
6.4.3 Data Selection
6.4.4 Data Transformation
6.4.5 Data Mining
6.4.6 Pattern Evaluation
6.4.7 Knowledge representation
6.5 Data Reduction
6.6 Let us sum up
6.7 List of References
6.8 Bibliography
6.9 Unit End Exercises
107
Data Mining and Business 6.0 OBJECTIVES
Intelligence
After going through this unit, you will be able to:
• Define data mining and its functionalities
• Understand the Knowledge discovery of data process
• Explain the steps in data pre-processing
• describe the dimensionality of data
• learn the data reduction and data compression techniques
6.1 INTRODUCTION
Generally, data mining (sometimes called data or knowledge discovery) is
the process of analyzing data from different perspectives and summarizing
it into useful information - information that can be used to increase
revenue, cut costs, or both. Data mining software is one of a number of
analytical tools for analyzing data. It allows users to analyze data from
many different dimensions or angles, categorize it, and summarize the
relationships identified. Technically, data mining is the process of finding
correlations or patterns among dozens of fields in large relational
databases.
Data, Information, and Knowledge
Data
Data are any facts, numbers, or text that can be processed by a computer.
Today, organizations are accumulating vast and growing amounts of data
in different formats and different databases. This includes:
metadata - data about the data itself, such as logical database design or
data dictionary definitions
Information
The patterns, associations, or relationships among all this data can provide
information. For example, analysis of retail point of sale transaction data
can yield information on which products are selling and when.
Knowledge
Information can be converted into knowledge about historical patterns and
future trends. For example, summary information on retail supermarket
sales can be analyzed in light of promotional efforts to provide knowledge
108
of consumer buying behavior. Thus, a manufacturer or retailer could Data Mining and Preprocessing
determine which items are most susceptible to promotional efforts.
109
Data Mining and Business Steps in Evolution of Data Mining- Table.
Intelligence
6.2 DEFINITION
The technique of identifying patterns and relationships within large
databases through the use of advanced statistical methods.
Data mining, the extraction of hidden predictive information from large
databases, is a powerful new technology with great potential to help
companies focus on the most important information in their data
warehouses. Data mining tools predict future trends and behaviors,
allowing businesses to make proactive, knowledge-driven decisions.
110
Descriptive Data Mining, on the other hand, focuses on finding patterns Data Mining and Preprocessing
describing the data that can be interpreted by humans. The common data
features are highlighted in the data set.
For example, count, average, etc. Therefore, it is possible to put data-
mining activities into one of two categories:
111
Data Mining and Business Frequent Subsequence:
Intelligence
A pattern series that happens on a frequent basis, such as purchasing a
phone followed by a rear cover.
Frequent Substructure:
It refers to the various types of data structures, such as trees and graphs,
that can be joined with an itemset or subsequence.
112
information from data recorded in databases. Here is the list of steps Data Mining and Preprocessing
involved in the knowledge discovery process:
An Outline of the steps in the KDD process
6.4.1 Data Cleaning: Data cleaning is the elimination of noisy and useless
data from a collection. Cleaning in the event of missing values. These are
the following methods to find out the missing values.
6.4.1.3 Regression:
Here data can be made smooth by fitting it to a regression function. The
regression used may be linear (having one independent variable) or
multiple (having multiple independent variables).
6.4.1.4 Outlier analysis: Outliers may be detected by clustering, for
example, where similar values are organized into groups, or “clusters.”
Intuitively, values that fall outside of the set of clusters may be considered
outliers
114
Data Mining and Preprocessing
115
Data Mining and Business 6.4.6 Pattern Evaluation is described as the identification of strictly
Intelligence growing patterns expressing knowledge based on supplied metrics.
● Determine the pattern's interestingness score.
● Summarization and visualization are used to make data clear to the
user.
6.4.7 Knowledge representation is described as a strategy that uses
visualization tools to depict data mining findings.
● Create reports.
● Create tables.
● Create discriminant rules, classification rules, characterization rules,
and other rules.
116
● Dimensionality reduction, where encoding mechanisms are used to Data Mining and Preprocessing
reduce the dataset size. Dimensionality refers to the number of input
characteristics, variables, or columns contained in a particular dataset,
while dimensionality reduction refers to the process of reducing these
features. In certain circumstances, a dataset comprises a large number of
input characteristics, which complicates the predictive modeling work.
Because it is difficult to see or forecast for a training dataset with a large
number of characteristics, dimensionality reduction techniques must be
used. The dimensionality reduction approach is defined as "a method of
transforming greater dimensions datasets into smaller dimensions datasets
while guaranteeing that they give identical information." These strategies
are commonly used in machine learning to obtain a better fit prediction
model when tackling classification and regression problems. It is
frequently utilized in high-dimensional data domains such as voice
recognition, signal processing, bioinformatics, and so on. It may also be
used for data visualization, noise reduction, cluster analysis, and other
similar tasks.
● Data Compression
Using various encoding algorithms, data compression minimizes the size
of files (Huffman Encoding & run-length Encoding). Based on its
compression methodologies, we may categorize it into two groups.
Lossless Compression - Encoding Techniques (Run Length Encoding)
provide easy and minimum data size reduction. Lossless data compression
use techniques to reconstruct the exact original data from compressed
data.
Lossy compression methods include the Discrete Wavelet transform
algorithm and PCA (principal component analysis). JPEG picture format,
for example, is a lossy compression, yet we may find the meaning
comparable to the original image. The decompressed data in lossy-data
compression may differ from the original data, but they are still usable for
retrieving information.
● Numerosity reduction, where the data are replaced or estimated by
alternative, smaller data representations such as parametric models (which
need to store only the model parameters instead of the actual data) or
nonparametric methods such as clustering, sampling, and the use of
histograms.
● Discretization and concept hierarchy generation, where raw data
values for attributes are replaced by ranges or higher conceptual levels.
Data discretization is a form of numerosity reduction that is very useful for
the automatic generation of concept hierarchies. Discretization and
concept hierarchy generation are powerful tools for data mining, in that
they allow the mining of data at multiple levels of abstraction
Top-down discretization — This procedure is known as splitting if you
first examine one or a couple of places (so-called breakpoints or split
points) to divide the entire collection of characteristics and then continue
this method until the finish.
Bottom-up discretization — When all of the constant values are
considered as split points, some are eliminated through a combination of
the neighborhood values in the interval. This is known as bottom-up
discretization.
● Concept Hierarchies: It decreases data size by gathering and then
substituting low-level ideas (for example, 43 for age) with high-level
118
concepts (categorical variables such as middle age or Senior). For Data Mining and Preprocessing
numerical data, the following strategies can be used:
o Binning - The process of converting numerical variables into
categorical equivalents is known as binning. The number of category
equivalents is determined by the user's selection of bins.
o Histogram analysis - The histogram, like the binning procedure, is
used to separate the value for the attribute X into disjoint ranges called
brackets. There are a number of partitioning rules:
o Partitioning values based on their frequency of occurrence in the data
collection is known as equal frequency partitioning.
o Partitioning the data in a fixed gap depending on the number of rows
and columns is known as equal width partitioning.
119
Data Mining and Business 6.7 REFERENCES
Intelligence
1. R. Agrawal, T. Imielinski, and A. Swami (1993). "Mining
associations between sets of items in massive databases," in
Proceedings of the 1993 ACM-SIGMOD International Conference
on Management of Data (pp. 207–216), New York: ACM Press.
2. M. J. A. Berry, and G. S. Linoff (1997). Data Mining Techniques.
New York: Wiley.
3. M. J. A. Berry, and G. S. Linoff (2000). Mastering Data Mining.
New York: Wiley.
4. L. Breiman, J. Friedman, R. Olshen, and C. Stone (1984).
Classification and Regression Trees. Boca Raton, FL: Chapman &
Hall/CRC (orig. published by Wadsworth).
5. C. Chatfield (2003). The Analysis of Time Series: An Introduction,
6th ed. Chapman & Hall/CRC.
6. R. Delmaster, and M. Hancock (2001). Data Mining Explained.
Boston: Digital Press.
7. S. Few (2004). Show Me the Numbers. Oakland, CA, Analytics
Press.
8. J. Han, and M. Kamber (2001). Data Mining: Concepts and
Techniques. San Diego, CA: Academic.
9. D. Hand, H. Mannila and P. Smyth (2001). Principles of Data
Mining. Cambridge, MA: MIT Press.
10. T. Hastie, R. Tibshirani, and J. Friedman (2009). The Elements of
Statistical Learning. 2nd ed. New York: Springer.
11. D. W. Hosmer, and S. Lemeshow (2000). Applied Logistic
Regression, 2nd ed. New York: Wiley-Interscience.
12. W. Jank, and I. Yahav (2010). E-Loyalty Networks in Online
Auctions. Annals of Applied Statistics, forthcoming.
13. W. Johnson, and D. Wichern (2002). Applied Multivariate Statistics.
Upper Saddle River, NJ: Prentice Hall.
6.8 BIBLIOGRAPHY
1. Bezdek, J. C., & Pal, S. K. (1992). Fuzzy models for pattern
recognition: Methods that search for structures in data. New York:
IEEE Press
2. Fayyad, U. M., Piatetsky-Shapiro, G., Smyth, P., & Uthurusamy, R.
(Eds.). (1996). Advances in knowledge discovery and data mining.
AAAI/MIT Press.
120
Data Mining and Preprocessing
3. Han, J., & Kamber, M. (2000). Data mining: Concepts and techniques:
Morgan Kaufmann.
4. Hastie, T., Tibshirani, R., & Friedman, J. H. (2001). The elements of
statistical learning: Data mining, inference, and prediction: New York:
Springer.
5. Jain, A. K., & Dubes, R. C. (1988). Algorithms for clustering data.
New Jersey: Prentice Hall.
6. Jensen, F. V. (1996). An introduction to bayesian networks. London:
University College London Press.
7. Kaufman, L., & Rousseeuw, P. J. (1990). Finding groups in data: An
introduction to cluster analysis. New York: John Wiley.
8. Michie, D., Spiegelhalter, D. J., & Taylor, C. C. (1994). Machine
learning, neural and statistical classification: Ellis Horwood.
121
Module V
Associations and Correlation
7
ASSOCIATION RULE MINING
Unit Structure
7.0 Objectives
7.1 Introduction
7.2 Steps in Association Rule Mining
7.3 Market Basket Analysis
7.3.1 What is an Itemset?
7.3.2 What is a Frequent Itemset?
7.3.3 Association rules
7.3.4 Why Frequent Itemset mining?
7.4 Frequent Pattern Algorithms
7.4.1 Apriori algorithm
7.4.2 Steps in Apriori algorithm
7.4.3 Example of Apriori
7.4.4 Advantages and Disadvantages
7.4.5 Methods to improve Apriori Efficiency
7.4.6 Applications of Apriori algorithm.
7.5 Incremental Association Rule Mining
7.5.1 Classification rule mining
7.6 Let us sum up
7.7 List of References
7.8 Bibliography
7.9 Unit End Exercises
122
7.0 OBJECTIVES Association Rule Mining
7.1 INTRODUCTION
The data mining technique of discovering the rules that regulate
relationships and causal objects between collections of items is known as
association rule mining. So, in a particular transaction involving many
items, it attempts to identify the principles that govern how or why such
items are frequently purchased together. Association rule mining is a well-
known and well-studied approach for uncovering interesting relationships
between variables in huge datasets. Its goal is to find strong rules in
databases using various metrics of interestingness.
The goal of ARM is to discover association rules, frequent patterns,
subsequences, or correlation links among a huge collection of data items
that meet the established minimal support and confidence from a given
database.
Association rule learning is a form of unsupervised learning approach that
detects the dependency of one data item on another and maps
appropriately to make it more lucrative. It attempts to discover some
interesting relationships or links among the variables in the dataset. It uses
several criteria to uncover interesting relationships between variables in a
database.
One of the most significant topics in machine learning is association rule
learning, which is used in Market Basket analysis, Web usage mining,
continuous manufacturing, and other applications. Market basket analysis
is a technique used by many large retailers to find the relationships
between commodities. We may explain it by using a supermarket as an
example because, at a supermarket, all things that are purchased together
are placed together. For example, if a consumer purchases bread, he is
likely to also purchase butter, eggs, or milk, therefore these items are
displayed on a shelf or in close proximity.
124
Itemsets and their Support values Association Rule Mining
Consider the [Bread] itemset, which has 80% support. This suggests that
bread appears 80 times out of every 100 transactions.
Defining support as a percentage allows us to establish a frequency
threshold called min support. If we set support to 50%, we define a
frequent itemset as one that appears at least 50 times in 100 transactions.
For example, in the preceding dataset, we set threshold support to 60%.
60% minimum support 60% of (total # of transactions) 0.6 *5 =3
For an Itemset to be frequent, it should occur at least 3 times in 5
transactions in the given dataset.
Association Rules
Table 5.1 Association Rule Notation
Term Description
D Database
ti Transaction in D
S Support
A Confidence
X, Y Itemsets
X =Y Association Rule
P Number of partitions
When all large Itemsets are found, generating the association rules is
straightforward.
The term "Association Rule Mining" refers to the following:
"Assume I=... is a collection of 'n' binary attributes known as items. Let
D=.... be a set of transactions referred to as a database. Each transaction in
127
Data Mining and Business D has a distinct transaction ID and includes a subset of the items in I. A
Intelligence rule is defined as a logical implication of the type X->Y, where X, Y? I,
and X?Y=? The sets of objects X and Y are referred to as the rule's
antecedent and consequent."
Association rule learning is used to discover associations between
attributes in huge datasets. An A=>B association rule will be of the form
"given a set of transactions, some value of itemset A determines the values
of itemset B under the condition that minimal support and confidence are
present."
Support (A) =
Confidence (A B) =
Apriori says:
The probability that item I is not frequent is if:
• P(I) < minimum support threshold, then I is not frequent.
• P (I+A) < minimum support threshold, then I+A is not frequent, where A
also belongs to itemset.
• If an itemset set has value less than minimum support then all of its
supersets will also fall below min support, and thus can be ignored. This
property is called the Antimonotone property
Apriori Property:
1. It makes use of “Upward Closure property” (Any superset of
infrequent itemset is also an infrequent set). It follows Bottom-up
search, moving upward level-wise in the lattice.
2. It makes use of the “downward closure property” (any subset of a
frequent itemset is a frequent itemset).
3. If Support of an itemset exceeds the support of its subsets, then it is
known as the antimonotone property of support.
An essential attribute called Apriori is utilized to increase the efficiency of
level-wise creation of frequent Itemsets by minimizing the search area.
End
Return UkLi;
130
Association Rule Mining
Table – 1
Solution:
Support Threshold =50% => 0.5 *6=3 => min_sup =3
Item Count
I1 4
I2 5
I3 4
I4 4
I5 2
131
Data Mining and Business 2. Prune Step: Table -2 shows that I5 item does not meet min_sup
Intelligence =3, thus it is deleted, only I1,I2,I3,I4 meet min_sup count.
Table -3
Item Count
I1 4
I2 5
I3 4
I4 4
Item Count
I1, I2 4
I2, I3 3
I1,I4 2
I2,I3 4
I2,I4 3
I3,I4 2
4. Prune Step: TABLE -4 shows that item set {I1, I4} and {I3, I4}
does not meet min_sup, thus it is deleted.
Table – 5
Item Count
I1,I2 4
I1,I3 3
I2,I3 4
I2,I4 3
132
5. Join and Prune Step: Form 3-itemset. From the TABLE- 1 find out Association Rule Mining
occurrences of 3-itemset. From TABLE-5, find out the 2-itemset subsets
which support min_sup. We can see for itemset {I1, I2, I3} subsets, {I1,
I2}, {I1, I3}, {I2, I3} are occurring in TABLE-5 thus {I1, I2, I3} is
frequent.
We can see for itemset {I1, I2, I4} subsets, {I1, I2}, {I1, I4}, {I2, I4}, {I1,
I4} is not
frequent, as it is not occurring in TABLE-5 thus {I1, I2, I4} is not
frequent, hence it is deleted.
Table – 6
Item
I1,I2, I3
I1,I2,I4
I1,I3,I4
I2,I3,I4
133
Data Mining and Business • This shows that all the above association rules are strong if
Intelligence minimum confidence threshold is 60%.
7.4.4 Advantages and Disadvantages of APRIORI:
● Easy to understand algorithm
● Join and Prune steps are easy to implement on large Itemsets in large
databases
Disadvantages of APRIORI:
● It requires high computation if the Itemsets are very large and the
minimum support is kept very low.
● The entire database needs to be scanned.
Transaction Reduction
A transaction that does not include certain frequent k-Itemsets cannot also
include some frequent (k + 1)-Itemsets. As a result, such a transaction can
be flagged or removed from further consideration since further database
searches for j-Itemsets with j > k will not require it.
Partitioning
A partitioning strategy that required two database scans to mine the
frequent Itemsets can be utilized. It is divided into two stages. The method
separates D's transactions into n non-overlapping segments in Phase I. If
the minimum support threshold for transactions in D is min sup, then the
minimum support count for a partition is the number of transactions in that
partition multiplied by min sup.
All frequent Itemsets inside a partition are found for each partition. These
are known as local frequently occurring Itemsets. The procedure utilizes a
special data structure that stores the TIDs of the transactions that include
the items in the itemset for each itemset. This allows it to locate all of the
local frequent k-Itemsets for k = 1, 2... in a single database search.
134
A local frequent itemset can be often connected to the entire database or Association Rule Mining
not, D. Any potentially frequent connected D itemset must show as a
frequent itemset is partially one of the divisions. As a result, all local
frequent Itemsets are somewhat D candidate Itemsets. The worldwide
candidate Itemsets for D are formed by the set of frequent Itemsets from
all partitions. he second scan of D is arranged in Phase II, in which the real
support of each candidate is analyzed to determine the global frequent
Itemsets.
Sampling
The sampling approach's basic premise is to take a random sample S of the
provided data D and then look for frequent Itemsets in S rather than D. It
is possible to trade off some degree of accuracy for efficiency with this
strategy. Because the sample size of S is such that the search for frequent
Itemsets in S may be performed in main memory, just one scan of the
transactions in S is required overall.
135
Data Mining and Business The finding of association rules is based on the regular mining of Itemsets.
Intelligence Earlier in this chapter, many strategies for frequent itemset mining and the
development of association rules were given. In this section, we will look
into associative classification, which is the process of generating and
analyzing association rules for use in classification. The main approach is
to look for substantial correlations between common patterns
(conjunctions of attribute-value pairs) and class labels. Because
association rules investigate extremely confident links among several
characteristics, this technique may be able to circumvent some of the
limits imposed by decision - tree induction, which investigates only one
attribute at a time. In many studies, associative classification has been
found to be more accurate than some traditional classification methods,
such as C4.5. In particular, we study three main methods: CBA, CMAR,
and CPAR.
136
square approach, the CMAR algorithm employs numerous criteria to Association Rule Mining
forecast previously undiscovered cases.
137
Data Mining and Business CPAR ● Since it searches for ● CPAR is more complex to
Intelligence only high-quality rules, it is understand as well as to
slower. implement.
● Performing weighted ● Usage of greedy algorithm
analysis adds substantial to train the dataset adds additional
computational load to the computational overhead to the
algorithm. algorithm.
Let us sum up
● Association Rules are used to show the relationship between data
items and are used frequently by retail stores to assist in marketing,
advertisement, inventory control etc.
● The selection of Association Rule depends on support and
confidence. Apriori algorithm, sampling algorithm are some of the
basic algorithms used in Association Rules.
● Apriori algorithm is an efficient algorithm that scans the database
only once.
● It reduces the size of the Itemsets in the database considerably
providing a good performance. Thus, data mining helps consumers
and industries better in the decision-making process.
● Frequent Itemsets discovered through Apriori have many
applications in data mining tasks. Tasks such as finding interesting
patterns in the database, finding out sequence and Mining of
association rules is the most important of them.
7.6 REFERENCES
1. Su Z, Song W, Cao D, Li J. Discovering informative association rules
for associative classification. IEEE International Symposium on
Knowledge Acquisition and Modeling Workshop; Wuhan. 2008. p.
1060–3.
2. Ye Y, Li T, Jiang Q, Wang Y. CIMDS: Adapting postprocessing
techniques of associative classification for malware detection. IEEE
Transactions on Systems, Man, and Cybernetics, Part C: Applications
and Reviews. 2010; 40(3):298–307.
3. M. Ankerst, M. Breunig, H.-P. Kriegel, and J. Sander. OPTICS:
Ordering points to identify the clustering structure. In Proc. 1999
ACM-SIGMOD Int. Conf. Management of Data (SIGMOD’99), pages
49–60, Philadelphia, PA, June 1999.
4. R. Agrawal, J. Gehrke, D. Gunopulos, and P. Raghavan. Automatic
subspace clustering of high dimensional data for data mining
138
applications. In Proc. 1998 ACM-SIGMOD Int. Conf.Management of Association Rule Mining
Data (SIGMOD’98), pages 94–105, Seattle, WA, June 1998.
5. C. C. Aggarwal and P. S. Yu. Privacy-Preserving Data Mining:
Models and Algorithms. Springer, 2008.
7.7 BIBLIOGRAPHY
1. Bezdek, J. C., & Pal, S. K. (1992). Fuzzy models for pattern
recognition: Methods that search for structures in data. New York:
IEEE Press
2. Fayyad, U. M., Piatetsky-Shapiro, G., Smyth, P., & Uthurusamy, R.
(Eds.). (1996). Advances in knowledge discovery and data mining.
AAAI/MIT Press.
3. Han, J., & Kamber, M. (2000). Data mining: Concepts and techniques:
Morgan Kaufmann.
4. Hastie, T., Tibshirani, R., & Friedman, J. H. (2001). The elements of
statistical learning: Data mining, inference, and prediction: New York:
Springer.
5. Jain, A. K., & Dubes, R. C. (1988). Algorithms for clustering data.
New Jersey: Prentice Hall.
6. Jensen, F. V. (1996). An introduction to bayesian networks. London:
University College London Press.
7. Kaufman, L., & Rousseeuw, P. J. (1990). Finding groups in data: An
introduction to cluster analysis. New York: John Wiley.
8. Michie, D., Spiegelhalter, D. J., & Taylor, C. C. (1994). Machine
learning, neural and statistical classification: Ellis Horwood.
7.8 UNIT END EXERCISES
1. Explain Associations Rule Mining
2. Explain how to generate strong association rule mining
3. Explain the mining methods for association rule generation
4. What are the methods to improve Accuracy of Apriori?
5. Write short notes on Prediction.
6. Elaborate associative classification methods.
7. Explain support and confidence rule with examples.
8. Write short notes on frequent pattern mining.
9. Explain Market basket analysis with example.
10. Define Incremental rule mining.
139
Module VI
8
CLASSIFICATION AND PREDICTION
Unit Structure
8.0 Introduction
8.1 Decision Tree
8.2 CART
8.3 Bayesian classification
8.4 Linear and nonlinear regression.
8.5 References
8.6 MOOCs
8.7 Video Lectures
8.8 Quiz
8.0 INTRODUCTION
There are two forms of data analysis that can be used for extracting
models describing important classes or to predict future data trends.
❖ Classification
❖ Prediction
Classification models predict categorical class labels; and prediction
models predict continuous valued functions.
Examples:
❖ A bank loan officer wants to analyze the data in order to know which
customer (loan applicant) are risky or which are safe.
❖ A marketing manager at a company needs to analyze a customer with a
given profile, who will buy a new computer.
The Data Classification process includes two steps −
❖ Building the Classifier or Model
❖ Using Classifier for Classification
140
Classification and Prediction
141
Data Mining and Business Prediction examples:
Intelligence
Suppose the marketing manager needs to predict how much a given
customer will spend during a sale at his company. Data analysis task is an
example of numeric prediction. In this case, a model or a predictor will be
constructed that predicts a continuous-valued-function or ordered value.
Classification and Prediction Issues
❖ Data Cleaning
❖ Relevance Analysis
❖ Data Transformation and reduction − The data can be transformed
by any of the following methods.
❖ Normalization
❖ Generalization
Comparison of Classification and Prediction Methods
❖ Accuracy
❖ Speed
❖ Robustness
❖ Scalability
❖ Interpretability
8.1 DECISION TREE
A decision tree is a structure that includes a root node, branches, and leaf
nodes. Each internal node denotes a test on an attribute, each branch
denotes the outcome of a test, and each leaf node holds a class label. The
top most node in the tree is the root node.
The following decision tree is for the concept buy_computer that indicates
whether a customer at a company is likely to buy a computer or not. Each
internal node represents a test on an attribute. Each leaf node represents a
class.
Decision Tree
Tree Pruning
Is performed in order to remove anomalies in the training data due to noise
or outliers. The pruned trees are smaller and less complex.
143
Data Mining and Business Tree Pruning Approaches
Intelligence
❖ Pre-pruning
❖ Post-pruning
Cost Complexity
❖ Number of leaves in the tree, and
❖ Error rate of the tree.
Decision Tree
❖ Classifies data using the attributes
❖ Tree consists of decision nodes and decision leafs.
❖ Nodes can have two or more branches which represents the value for
the attribute tested.
❖ Leafs nodes produces a homogeneous result.
The algorithm
❖ The ID3 follows the Occam’s razor principle.
❖ Attempts to create the smallest possible decision tree.
The Process
❖ Take all unused attributes and calculates their entropies.
❖ Chooses attribute that has the lowest entropy is minimum or when
information gain is maximum
❖ Makes a node containing that attribute
The Algorithm
❖ Create a root node for the tree
❖ If all examples are positive, Return the single-node tree Root, with
label = +.
❖ If all examples are negative, Return the single-node tree Root, with
label = -.
❖ If number of predicting attributes is empty, then Return the single
node tree Root, with label = most common value of the target
attribute in the examples.
❖ Else
– A = The Attribute that best classifies examples.
– Decision Tree attribute for Root = A.
– For each possible value, vi, of A,
144
❖ Add a new tree branch below Root, corresponding to the test A = vi. Classification and Prediction
❖ Let Examples(vi), be the subset of examples that have the value vi
for A
❖ If Examples(vi) is empty
– Then below this new branch add a leaf node with label = most common
target value in the examples
❖ Else below this new branch add the subtree ID3 (Examples(vi),
Target_Attribute, Attributes – {A})
❖ End
❖ Return Root
Entropy
❖ Formula to calculate
❖ A complete homogeneous sample has an entropy of 0
❖ An equally divided sample as an entropy of 1
❖ Entropy = - p+log2 (p+) -p-log2 (p-) for a sample of negative and
positive elements.
Exercise
Calculate the entropy
Given:
❖ Set S contains14 examples
❖ 9 Positive values
❖ 5 Negative values
Entropy(S) = - (9/14) Log2 (9/14) - (5/14) Log2 (5/14)
= 0.940
Information Gain
❖ Information gain is based on the decrease in entropy after a dataset
is split on an attribute.
❖ Looking for which attribute creates the most homogeneous branches
145
Data Mining and Business Information Gain Example
Intelligence
❖ 14 examples, 9 positive 5 negative
❖ The attribute is Wind.
❖ Values of wind are Weak and Strong
❖ 8 occurrences of weak winds
❖ 6 occurrences of strong winds
❖ For the weak winds, 6 are positive and 2 are negative
❖ For the strong winds, 3 are positive and 3 are negative
Gain(S,Wind) = • Entropy(S) - (8/14)*Entropy (Weak) -(6/14)*Entropy
(Strong)
❖ Entropy (Weak) = - (6/8)*log2(6/8) - (2/8)*log2(2/8) = 0.811
❖ Entropy (Strong) = - (3/6)*log2(3/6) - (3/6)*log2(3/6) = 1.00
0.940 - (8/14)*0.811 - (6/14)*1.00
= 0.048
Advantage of ID3
❖ Understandable prediction rules are created from the training data.
❖ Builds the fastest tree.
❖ Builds a short tree.
❖ Only need to test enough attributes until all data is classified.
❖ Finding leaf nodes enables test data to be pruned, reducing number
of tests.
Disadvantage of ID3
❖ Data may be over-fitted or overclassified, if a small sample is tested.
❖ Only one attribute at a time is tested for making a decision.
❖ Classifying continuous data may be computationally expensive, as
many trees must be generated to see where to break the continuum.
147
Data Mining and Business
Intelligence
However, we must note that there can be many other possible decision
trees for a given problem - we want the shortest one. We also want it to be
better in terms of accuracy (prediction error measured in terms of
misclassification cost).
An alternative, shorter decision tree for the same –
148
Step 3: Partition instances according to selected attribute recursively Classification and Prediction
Partitioning stops when:
❖ There are no examples left
❖ All examples for a given node belong to the same class
❖ There are no remaining attributes for further partitioning – majority
class is the leaf
What is Impurity?
The key to building a decision tree is in Step 2 above - selecting which
attribute to branch off on. We want to choose the attribute that gives us the
most information. This subject is called information theory.
In our dataset we can see that a loan is always approved when the
applicant owns their own house. This is very informative (and certain) and
is hence set as the root node of the alternative decision tree shown
previously. Classifying a lot of future applicants will be easy.
Selecting the age attribute is not as informative - there is a degree of
uncertainity (or impurity). The person's age does not seem to affect the
final class as much.
Based on the above discussion:
A subset of data is pure if all instances belong to the same class.
Our objective is to reduce impurity or uncertainty in data as much as
possible.
The metric (or heuristic) used in CART to measure impurity is the Gini
Index and we select the attributes with lower Gini Indices first. Here is the
algorithm:
We need to first define the Gini Index, which is used to find the
information gained by selecting an attribute. The Gini Index favours larger
partitions and is calculated for i = 1 to the n, number of attributes:
149
Data Mining and Business
Intelligence
150
Prediction using CARTs Classification and Prediction
Conclusion
The CART algorithm is organized as a series of questions, the responses
to which decide the following question if any, outcome of these questions
is a tree-like structure with terminal nodes when there are no more
questions.
Baye's Theorem
Bayes' Theorem is named after Thomas Bayes. There are two types of
probabilities −
❖ Posterior Probability [P(H/X)]
❖ Prior Probability [P(H)]
where X is data tuple and H is some hypothesis.
According to Bayes' Theorem,
P(H/X)= P(X/H)P(H) / P(X)
152
Applications of Bayes’ Theorem Classification and Prediction
❖ It can also be used as a building block and starting point for more
complex methodologies.
❖ Used in classification problems and other probability-related
questions.
❖ Sstatistical inference.
❖ Can be used to calculate the probability of an individual having a
specific genotype.
Example
While the independent variable is squared, the model is still linear in the
parameters. Linear models can also contain log terms and inverse terms to
153
Data Mining and Business follow different kinds of curves and yet continue to be linear in the
Intelligence parameters.
The regression example below models the relationship between body mass
index (BMI) and body fat percent
8.5 REFERENCES
1. Introduction to Data Warehouse System.
https://www.javatpoint.com/. [Last Accessed on 10.03.2022]
2. Introduction to Data Warehouse System. https://www.guru99.com/.
[Last Accessed on 10.03.2022]
3. Introduction to Data Warehouse System. https://www.
http://www.nitjsr.ac.in/. [Last Accessed on 10.03.2022]
4. Introduction to Data Warehouse System. https://www.
oms.bdu.ac.in/. [Last Accessed on 10.03.2022]
154
5. Data Warehouse. https://www.softwaretestinghelp.com/. [Last Classification and Prediction
Accessed on 10.03.2022]
6. Introduction to Data Warehouse System.
https://www.tutorialspoint.com/ebook/data_warehouse_tutorial/inde
x.asp. [Last Accessed on 10.03.2022].
7. Using ID3 Algorithm to build a Decision Tree to predict the
weather. https://iq.opengenus.org/id3-algorithm/. [Last Accessed on
10.03.2022].
8. Data Warehouse Architecture. https://binaryterms.com/data-
warehouse-architecture.html. [Last Accessed on 10.03.2022].
9. Compare-and-Contrast-Database-with-Data-Warehousing-and-Data-
Visualization .
https://www.coursehero.com/file/28202760/Compare-and-Contrast-
Database-with-Data-Warehousing-and-Data-Visualization-
Databases-Assignmentdocx/. [Last Accessed on 10.03.2022].
10. Data Warehousing and data mining.
https://lastmomenttuitions.com/course/data-warehousing-and-
mining/. [Last Accessed on 10.03.2022].
11. Data Warehouse. https://one.sjsu.edu/task/all/finance-data-
warehouse. [Last Accessed on 10.03.2022].
12. DWDM Notes. https://dwdmnotes.blogspot.com. [Last Accessed on
10.03.2022].
13. Data Warehouse and Data Mart.
https://www.geeksforgeeks.org/difference-between-data-warehouse-
and-data-mart/?ref=gcse. [Last Accessed on 10.03.2022].
14. Data Warehouse System. https://www.analyticssteps.com/. [Last
Accessed on 10.03.2022].
15. Han J, Pei J, Kamber M. Data mining: concepts and techniques.
Elsevier; 2011 Jun 9.
16. CART. https://iq.opengenus.org/cart-algorithm. [Last Accessed on
10.03.2022].
17. Bhatia P. Data mining and data warehousing: principles and
practical techniques. Cambridge University Press; 2019 Jun 27.
18. Han J, Pei J, Kamber M. Data mining: concepts and techniques.
Elsevier; 2011 Jun 9.
19. Berzal F, Matín N. Data mining: concepts and techniques by Jiawei
Han and Micheline Kamber. ACM Sigmod Record. 2002 Jun 1;
31(2):66-8.
155
Data Mining and Business 20. Gupta GK. Introduction to data mining with case studies. PHI
Intelligence Learning Pvt. Ltd.; 2014 Jun 28.
21. Zhou, Zhi-Hua. "Three perspectives of data mining." (2003): 139-
146.
22. Wang J, editor. Encyclopedia of data warehousing and mining. iGi
Global; 2005 Jun 30.
23. Pujari AK. Data mining techniques. Universities press; 2001.
8.6 MOOCS
1. Data Warehousing for Business Intelligence Specialization.
https://www.coursera.org/specializations/data-warehousing.
2. Data Mining.
https://onlinecourses.swayam2.ac.in/cec20_cs12/preview.
3. Data Warehouse Concepts, Design, and Data Integration.
https://www.coursera.org/learn/dwdesign.
4. Data Warehouse Courses. https://www.edx.org/learn/data-
warehouse.
5. BI Foundations with SQL, ETL and Data Warehousing
Specialization. https://www.coursera.org/specializations/bi-
foundations-sql-etl-data-warehouse.
6. Fundamentals of Data Warehousing. https://www.mooc-
list.com/initiative/coursera.
7. Foundations for Big Data Analysis with SQL.
https://www.coursera.org/learn/foundations-big-data-analysis-sql.
156
7. Star Schema & Snow Flake Design. Classification and Prediction
https://www.youtube.com/watch?v=KUwOcip7Zzc.
8. OLTP vs OLAP.
https://www.youtube.com/watch?v=aRT8E0nD_LE.
9. OLAP and Data Modeling Concepts.
https://www.youtube.com/watch?v=rnQDuz1ZkIo.
10. Understand OLAP.
https://www.youtube.com/watch?v=yoE6bgJv08E.
11. OLAP Cubes. https://www.youtube.com/watch?v=UKCQQwx-Fy4.
12. OLAP vs OLTP. https://www.youtube.com/watch?v=TCrCo2-w-vM.
13. OLAP. https://www.youtube.com/watch?v=AC1cLmbXcqA.
14. OLAP Vs OLTP. https://www.youtube.com/watch?v=kFQRrgHeiOo.
8.8 QUIZ
1. OLAP stands for
a) Online analytical processing
b) Online analysis processing
c) Online transaction processing
d) Online aggregate processing
Answer: a
{ (item name, color, clothes size), (item name, color), (item name, clothes
size), (color, clothes size), (item name), (color), (clothes size), () }
7. This can be achieved by using which of the following ?
a) group by rollup
b) group by cubic
c) group by
d) none of the mentioned
Answer: d
158
8. What do data warehouses support? Classification and Prediction
a) OLAP
b) OLTP
c) OLAP and OLTP
d) Operational databases
Answer: a
10. Which one of the following is the right syntax for DECODE?
a) DECODE (search, expression, result [, search, result]… [, default])
b) DECODE (expression, result [, search, result]… [, default], search)
c) DECODE (search, result [, search, result]… [, default], expression)
d) DECODE (expression, search, result [, search, result]… [, default])
Answer: d
159
Data Mining and Business 12. Data that can be modelled as dimension attributes and measure
Intelligence attributes are called ___________
a) Mono-dimensional data
b) Multi-dimensional data
c) Measurable data
d) Efficient data
Answer: b
14. The operation of moving from finer granular data to coarser granular
data is called _______
a) Reduction
b) Increment
c) Roll up
d) Drill down
Answer: c
19. The operation of moving from coarser granular data to finer granular
data is called _______
a) Reduction
b) Increment
c) Roll back
d) Drill down
Answer: d
161
Data Mining and Business 22. What is the type of relationship in star schema?
Intelligence
a) many-to-many.
b) one-to-one
c) many-to-one
d) one-to-many
Answer: D
162
27. Which is NOT a valid layer in Three-layer Data Warehouse Classification and Prediction
Architecture in Conceptual View?
a) Processed data layer
b) Real-time data layer
c) Derived data layer
d) Reconciled data layer
Answer: A
28. Among the types of fact tables which is not a correct type ?
a) Fact-less fact table
b) Transaction fact tables
c) Integration fact tables
d) Aggregate fact tables
Answer: C
165
Data Mining and Business 42. ETL is an abbreviation for Elevation, Transformation and Loading
Intelligence
a) TRUE
b) FALSE
Answer: B
166
47. Which of the following statements is/are incorrect about ROLAP Classification and Prediction
a) ROLAP fetched data from data warehouse.
b) ROLAP data store as data cubes.
c) ROLAP use sparse matrix to manage data sparsity.
Answer: B and C
167
Data Mining and Business
Intelligence
52. One can perform Query operations in the data present in Data
Warahouse
a) TRUE
b) FALSE
Answer: A
53. A __ combines facts from multiple processes into a single fact table
and eases the analytic burden on BI applications.
a) Aggregate fact table
b) Consolidated fact table
c) Transaction fact table
d) Accumulating snapshot fact table
Answer: B
55. Standalone data marts built by drawing data directly from operational
or external sources of data or both are known as independent data marts
a) TRUE
b) FALSE
Answer: A
56. Focusing on the modeling and analysis of data for decision makers, not
on daily operations or transaction processing is known
a) Integrated
b) Time-variant
c) Subject oriented
d) Non-volatile
Answer: C
168
57. Most of the time data warehouse is Classification and Prediction
a) Read
b) Write
c) Both
Answer: A
60. When the level of details of data is reducing the data granularity goes
higher
a) True
b) False
Answer: B
61. Data Warehouses are having summarized and reconciled data which
can be used by decision makers
a) True
b) False
Answer: A
62. _____ refers to the currency and lineage of data in a data warehouse
a) Operational metadata
b) Business metadata
c) Technical metadata
d) End-User meatdata
Answer: A
169
Module VII
9
CLUSTERING
Unit Structure
9.0 Introduction
9.1 Types of Clustering
9.1.1 Hard clustering
9.1.2 Soft clustering
9.2 Categorization of Major Clustering Methods
9.2.1 Partitioning methods - K-Means.
9.2.2 Hierarchical methods
9.2.2.1 Agglomerative hierarchical methods
9.2.2.2 Divisive hierarchical methods
9.2.3 Model- based- Expectation and Maximization
9.2.3.1 Expectation- Maximization algorithm
9.2.3.2 EM Application, Advantages & Disadvantages
9.3 Evaluating cluster models
9.4 List of References
9.5 Quiz
9.6 Exercise
9.7 Video Links
9.0 INTRODUCTION
The term cluster refers to a homogeneous subgroup existing within a
population. Clustering techniques are therefore aimed toward segmenting
a heterogeneous population into a given number of subgroups composed
of observations that share similar characteristics. The characteristics of
observation is different clusters are distinct. In classification we have
predefined classes or labels indicating the target class but in clustering
there are no predefined classes or reference examples indicating the target
class. In clustering, the objects are grouped together based on their mutual
homogeneity. Sometimes, exploratory data analysis is used for identifying
170
Clustering
clusters at initial stage in data mining process. The aim of clustering is to
sort data with similar traits and create clusters by reducing size of datasets.
In clustering, observations in dataset are grouped together bases on
distance among each other. The observations that are not placed in any of
the clusters are called as outliers. An outlier may be an error or variation
in the observations of specific dataset. Clustering algorithm may actually
find and remove outliers to ensure that they perform better still care must
be taken to remove outliers. Outlier detection or outlier mining is the
process for identification of outlier into a set of data
"Clusters represented in figure below indicate clusters and outliers using
iris dataset"
171
Data Mining and Business
Intelligence
Partition methods
Partition methods are used to develop a subdivision of a given dataset
using a predetermined number K of non-empty data subsets. These are
iterative clustering models in which the similarity is derived by the
closeness of a data point to the given centroid or medoid of dataset.
172
Clustering
Hierarchical methods
Hierarchical method is a type of connectivity model. It is based on the fact
that the data points closer in the data space having more similarity with
each other when the data points lying far away. In this type of clustering
predetermined number of clusters is not required. This type of clustering
supports top down and bottom-up approach.
Grid methods
In grid-based clustering observations belonging to data space is divided
into a grid like structure consisting of finite number of cells. A grid is a
multidimensional data structure used for achieving reduced computing
times.
While doing grid-based clustering below steps need to be followed:
173
Data Mining and Business ● Manhattan distance measure
Intelligence
● Cosine distance measure
The general formula for an n-dimensional space having data points pi and
qi is given as,
174
Clustering
The formula is given below
K-Means Algorithm
K-means is centroid-based unsupervised learning algorithms used for
solving clustering problem. It is iterative in nature and it is used to divide
the unlabelled dataset into k different clusters having dataset having
similar properties. It is very sensitive to outliers. In K-means algorithm
input is received from dataset T, a number of K clusters are generated
along with a function dist(xi, xk) which is used to express the
inhomogeneity between each pair of observation or distance between
observations.
It is used to assign datapoints to a cluster by calculating the sum of the
squared distance between each data attribute of observations and the
centroid of cluster ie the mean of data attribute of observations is at
minimum.
Procedure for K-means algorithm
1. During the initialization phase, K observations are arbitrarily chosen in
D as the centroids of the clusters.
2. Each observation is iteratively assigned to the cluster whose centroid is
the most similar to the observation, in the sense that it minimizes the
distance from the record.
3. If no observation is assigned to a different cluster with respect to the
previous iteration, the algorithm stops.
4. For each cluster, the new centroid is computed as the mean of the
values of the observations belonging to the cluster, and then the
algorithm returns to step 2.
The calculation begins by self-assertively selecting K observations that
represent the initial centroids. For example, the K points may be randomly
chosen among the m observations in D. At every succeeding iteration,
each record is appointed to the cluster whose centroid is that the closest,
that is, which minimizes the space from the observation among all
centroids. If no observation is reallocated to a cluster totally different from
the one to which it belongs, determined throughout the previous iteration,
the procedure stops, since any subsequent iteration cannot alter this
subdivision in clusters. Otherwise, the new centroids for every cluster are
computed and a brand-new assignment made.
175
Data Mining and Business Advantages and Disadvantages
Intelligence
Advantages of K-Means Algorithm
● K-means algorithm is simple, easy to understand and easy to
implement.
● It is efficient as time taken to cluster K-means rises linearly with the
number of data points.
● No other clustering algorithm performs better than K-means.
Disadvantages of K-Means Algorithm
● The initial value of K has to be specified by user.
● The process of finding the clusters may not change.
● It is not suitable for discovering clusters that are not hyper ellipsoids or
hyper spheres.
● Minimum distance:
According to the criterion of minimal distance, additionally called the
single linkage criterion, the dissimilarity among clusters is given with the
aid of using the minimal distance amongst all pairs of observations such
that one belongs to the primary cluster and the alternative to the second
one cluster, that is,
176
Clustering
● Maximum distance:
According to the criterion of maximum distance, moreover called the
complete linkage criterion, the dissimilarity among clusters is given with
the resource of the usage of the minimum distance among all pairs of
observations such that one belongs to the number one cluster and the
opportunity to the second cluster, that is,
● Mean distance:
The mean distance criterion expresses the dissimilarity among clusters
through the mean of the distances among all pairs of observations
belonging to the 2 clusters, that is,
● Ward distance:
Ward's distance criterion, based on an analysis of variance of Euclidean
distances between observations, is somewhat more complex than the
criteria described above. In fact, it requires that the algorithm first
calculate the sum of the squared distances between all pairs of
observations belonging to a cluster. Then all pairs of clusters that could be
merged in the current iteration are considered and for each pair the total
variance is calculated as the sum of the two variances between the
distances in each cluster evaluated in the first step. Finally, the pairs of
clusters associated with the minimum total variance of are merged.
Methods based on the Ward distance tend to generate a large number of
clusters, each containing a few observations.
177
Data Mining and Business Hierarchical methods are divided into two main categories:
Intelligence
o The main single cluster having all the observations has to be divided
into two subsets such that the distance between the two resulting
clusters is maximised.
o As there are 2m – 2 possible partitions for the whole data set into two
nonempty disjoint subsets then this results in the exponential number
of operations already at the first iteration.
o To overcome the difficulty at any given iteration divisive hierarchal
algorithm usually determines for each cluster the two observations that
are furthest from each other and subdivide the cluster by assigning the
remaining records to the one or the other based on their proximity.
Procedure for Divisive Hierarchical Clustering Algorithm
1. In the initialization phase, all observations constitute a single cluster.
The distance between clusters therefore corresponds to the matrix D of
the distances between all pairs of observations.
2. The minimum distance between the clusters is then computed, and the
cluster Ch will be subdivided into smaller clusters based on minimum
distance, thus deriving a new clusters Ce, Cf, and so on.
3. If all the observations are subdivided into a single cluster holding a
single observation or else if any stopping criteria is met, the procedure
stops. Otherwise, it is repeated from step 2.
179
Data Mining and Business
Intelligence
180
Clustering
Step III: Maximization or M-step
● The data which was estimated in previous step should now be
maximized i.e. use complete data to update the parameters from the
missing data and observed data by finding the most likely parameters.
● It is used to update the hypothesis.
Step IV: Convergence
● In this it is checked whether the values are converging or not.
● If yes, then stop otherwise repeat Step II and III until convergence.
181
Data Mining and Business
Intelligence
9.5 QUIZ
1. Point out the wrong statement
a) k-means clustering is a method of vector quantization
b) k-means clustering aims to partition n observations into k clusters
c) k-nearest neighbour is same as k-means
4. _____ is a clustering procedure where all objects start out in one single
huge cluster. Later smaller clusters are formed by dividing this cluster.
a) Non-hierarchical clustering
b) Divisive clustering
c) Model-based clustering
d) Agglomerative clustering
184
Clustering
7. Point out correct statement for single linkage hierarchical clustering.
a) we merge in the members of the clusters in each step, which provide the
smallest maximum pairwise distance.
b) the distance between two clusters is defined as the average distance
between each point in one cluster to every point in the other cluster.
c) we merge in each step the two clusters, whose two closest members
have the smallest distance.
11. The silhouette coefficient can have the value in the interval of ____.
a) [1,3]
b) [-1,0]
c) [-1,1]
d) [-3,3]
12. The sum of the horizontal and vertical components or the distance
between two points measured along axes at right angles is called
__________.
a) Euclidean distance measure
b) Squared Euclidean distance measure
c) Manhattan distance measure
d) Cosine distance measure
185
Data Mining and Business 13. __________ is the process for identification of outlier into a set of data
Intelligence
a) Outlier definition
b) Outlier reduction
c) Outlier mining
d) Outlier collection
9.6 EXERCISE
1. What is clustering? Explain its types.
2. State and explain different clustering methods.
3. Explain k-means clustering algorithm with its advantages and
disadvantages.
4. Explain
a. Euclidean distance measure
b. Squared Euclidean distance measure
c. Manhattan distance measure
d. Cosine distance measure
5. Short note on:
a. Hierarchical clustering
b. Divisive clustering
6. Explain EM algorithm in detail
7. What are the distance measures associated with hierarchical
clustering?
8. Explain silhouette coefficient.
186
Clustering
9.7 VIDEO LINKS
1. https://www.youtube.com/watch?v=CLKW6uWJtTc
2. https://www.youtube.com/watch?v=p3HbBlcXDTE
3. https://www.youtube.com/watch?v=ieMjGVYw9ag
4. https://www.youtube.com/watch?v=VMyXc3SiEqs
5. https://www.youtube.com/watch?v=7enWesSofhg
6. https://www.youtube.com/watch?v=EFhcDnw7RGY
7. https://www.youtube.com/watch?v=G_Ob1k28ZJo
8. https://www.youtube.com/watch?v=7e65vXZEv5Q
9. https://www.youtube.com/watch?v=g5e_r8dw3uc
10. https://www.youtube.com/watch?v=aOnKnLM4eok
187
Module VIII
Web mining and Text mining
10
TEXT MINING
Unit Structure
188
Text mining
As we discussed above, the size of information is expanding at
exponential rates. Today all institutes, companies, different organizations,
and business ventures are stored their information electronically. A huge
collection of data available on the internet and store in digital libraries,
database repositories, and other textual data like websites, blogs, social
media networks, and e-mails. It is a difficult task to determine appropriate
patterns and trends to extract knowledge from this large volume of data.
Text mining is a part of Data mining to extract valuable text information
from a text database repository. Text mining is a multi-disciplinary field
based on data recovery, Data mining, AI, statistics, Machine learning, and
computational linguistics.
189
Data Mining and Business
Intelligence
190
Text mining
● Tokenization: It makes segmentation of sentences into words by
erasing spaces, commas etc.
● Filtering: It extricates the words that have no relevant content-
information including articles, conjunctions, prepositions, etc. Even
the words of frequent repetitions are also removed.
● Stemming: It is the process of transforming words to its stem,or
normalized form by making basic forms of words to recognize words
by its root word-forms. For example, the word “go” is the stem goes,
going and gone.
● Lemmatization: It reorganizes the word to correct root linguistically,
that is the base form of the verb. During the entire process, the first
step is to understand the context, and finds out the POS of a word in a
sentence and at last identifies the ‘lemma’. For example, go is the
lemma of goes, gone, going, went.
● Linguistic processing: Involving Part-of-speech tagging (POS), Word
Sense Disambiguation (WSD) and Semantic structure, it works as
follow as
Part-of-speech tagging: to determine the linguistic category of the word
by assigning word class to each token. It has eight classes: noun, pronoun,
adjective, verb, adverb, preposition, conjunction and interjection.
Word Sense Disambiguous (WSD): to determine that a given word is
ambiguous in a text, e.g., resolving the ambiguity in words “bank” and
“financial institutions”. Basically, it assigns the most suitable meaning
automatically to a polysemous word in a given context.
Semantic structure: Full parsing and partial parsing are known two
methods for making semantic structures.
● Full Parsing: makes a full parse tree for a sentence, and sometimes
fails due to poor tokenizing, error in POS tagging, latest word, incorrect
sentence breaking, grammatical inaccuracy, and many more.
● Partial Parsing: Also known as word chunking, it makes syntactic
constructs such as Noun Phrases and Verb Groups.
Text Transformation
After the process of feature selection, text transformation conducts
features generation. Feature generation reflects documents by words they
contain and words occurrences where the order of word is not significant.
Here feature selection is the process of choosing the subset of significant
features that are used in creating a model. It diminishes the dimensionality
through excluding redundant and unnecessary features.
191
Data Mining and Business Methods Used in Text Mining
Intelligence
There are various techniques being developed to solve the text mining
problems, they are basically the relevant information retrieval according to
the requirement of users. Counting on the information retrieval techniques,
some common methods are following:
Term Based Method
192
Text mining
2. Second component: It decides a conceptual ontological graph (COG)
that explains semantic structure.
3. Third component: It extracts major concepts on the basis of first-two
components in an attempt to construct feature vectors via
implementing standard vector space model.
Holding the ability to distinguish unnecessary terms and meaningful
terms, this model explains a meaningful sentence and occasionally relies
on NLP methods.
Pattern Taxonomy Method
The Pattern based model performs better than any other pure data mining-
based method.
Under this method, documents are examined on the basis of patterns
where patterns are built in a taxonomy by applying a relation. Patterns can
be identified by employing data mining techniques including association
rule, frequent itemset mining, sequential and closed pattern mining.
In text mining, where this identified knowledge is important, it is also
inefficient as there are some useful long patterns, with high selectivity,
that need support. Even, most of the short patterns are useful (known as
misconstrued patterns) and lead to ineffective performance.
As a consequence of it, an effective pattern discovery process is required
to conquer low-frequency and mis-construction text mining problems. The
pattern detected method employs two procedures “pattern deploying” and
“pattern evolving” and refines the discovered patterns
193
Data Mining and Business 2. Academic and Research Field
Intelligence
In the education field, different text mining tools and strategies are utilized
to examine the instructive patterns in a specific region/research field. The
main purpose of text mining utilization in the research field help to
discover and arrange research papers and relevant material of various
fields on one platform. For this, we use k-Means clustering and different
strategies help to distinguish the properties of significant data. Also,
student performance in various subjects can be accessed, and how various
qualities impact the selection of subjects evaluated by this mining.
3. Life Science
Life science and health care industries are producing an enormous volume
of textual and mathematical data regarding patient records, sicknesses,
medicines, symptoms, and treatments of diseases, etc. It is a major issue to
filter data and relevant text to make decisions from a biological data
repository. The clinical records contain variable data which is
unpredictable, lengthy. Text mining can help to manage such kinds of
data. Text mining use in biomarkers disclosure, pharmacy industry, a
clinical trade analysis examination, clinical study, patent competitive
intelligence also.
4. Social-Media
Text mining is accessible for dissecting analyzing web-based media
applications to monitor and investigate online content like plain text from
internet news, web journals, email, blogs, etc. Text mining devices help to
distinguish and investigate the number of posts, likes, and followers on the
web-based media network. This kind of analysis shows individuals’
responses to various posts, news and how it spread around. It shows the
behavior of people who belong to a specific age group and variation in
like, views about the same post.
5. Business Intelligence
Text mining plays an important role in business intelligence that help
different organization and enterprises to analyze their customers and
competitors to make better decisions. It gives an accurate understanding of
business and gives data on how to improve consumer satisfaction and gain
competitive benefits. The text mining devices like IBM text analytics.
The text mining market has experienced exponential growth and adoption
over the last few years and also expected to gain significant growth and
adoption in the coming future. One of the primary reasons behind the
adoption of text mining is higher competition in the business market,
many organizations seeking value-added solutions to compete with other
organizations. With increasing completion in business and changing
customer perspectives, organizations are making huge investments to find
a solution that is capable of analyzing customer and competitor data to
improve competitiveness. The primary source of data is e-commerce
websites, social media platforms, published articles, survey, and many
more. The larger part of the generated data is unstructured, which makes it
challenging and expensive for the organizations to analyze with the help
of the people. This challenge integrates with the exponential growth in
data generation has led to the growth of analytical tools. It is not only able
to handle large volumes of text data but also helps in decision-making
195
Data Mining and Business purposes. Text mining software empowers a user to draw useful
Intelligence information from a huge set of data available sources.
Information Extraction:
The automatic extraction of structured data such as entities, entities
relationships, and attributes describing entities from an unstructured
source is called information extraction.
Data Mining:
Data mining refers to the extraction of useful data, hidden patterns from
large data sets. Data mining tools can predict behaviors and future trends
that allow businesses to make a better data-driven decision. Data mining
tools can be used to resolve many business problems that have
traditionally been too time-consuming.
Information Retrieval:
Information retrieval deals with retrieving useful data from data that is
stored in our systems. Alternately, as an analogy, we can view search
engines that happen on websites such as e-commerce sites or any other
sites as part of information retrieval.
196
Text mining
Text Mining Process:
The text mining process incorporates the following steps to extract the
data from the document.
Text transformation
A text transformation is a technique that is used to control the
capitalization of the text.
Here the two major way of document representation is given.
● Bag of words
● Vector Space
Text Pre-processing
Pre-processing is a significant task and a critical step in Text Mining,
Natural Language Processing (NLP), and information retrieval(IR). In the
field of text mining, data pre-processing is used for extracting useful
information and knowledge from unstructured text data. Information
Retrieval (IR) is a matter of choosing which documents in a collection
should be retrieved to fulfil the user's need.
Feature selection:
Feature selection is a significant part of data mining. Feature selection can
be defined as the process of reducing the input of processing or finding the
essential information sources. The feature selection is also called variable
selection.
Data Mining:
Now, in this step, the text mining procedure merges with the conventional
process. Classic Data Mining procedures are used in the structural
database.
197
Data Mining and Business Evaluate:
Intelligence Afterward, it evaluates the results. Once the result is evaluated, the result
abandon.
Applications:
These are the following text mining applications:
Risk Management:
Risk Management is a systematic and logical procedure of analyzing,
identifying, treating, and monitoring the risks involved in any action or
process in organizations. Insufficient risk analysis is usually a leading
cause of disappointment. It is particularly true in the financial
organizations where adoption of Risk Management Software based on text
mining technology can effectively enhance the ability to diminish risk. It
enables the administration of millions of sources and petabytes of text
documents, and giving the ability to connect the data. It helps to access the
appropriate data at the right time.
Business Intelligence:
Companies and business firms have started to use text mining strategies as
a major aspect of their business intelligence. Besides providing significant
insights into customer behavior and trends, text mining strategies also
support organizations to analyze the qualities and weaknesses of their
opponent's so, giving them a competitive advantage in the market.
198
Text mining
1. Keyword-based Association Analysis:
It collects sets of keywords or terms that often happen together and
afterward discover the association relationship among them. First, it pre-
processes the text data by parsing, stemming, removing stop words, etc.
Once it pre-processed the data, then it induces association mining
algorithms. Here, human effort is not required, so the number of unwanted
results and the execution time is reduced.
Numericizing text:
Stemming algorithms
A significant pre-processing step before ordering of input documents starts
with the stemming of words. The terms "stemming" can be defined as a
reduction of words to their roots. For example, different grammatical
forms of words and ordered are the same. The primary purpose of
stemming is to ensure a similar word by text mining program.
Support for different languages:
There are some highly language-dependent operations such as stemming,
synonyms, the letters that are allowed in words. Therefore, support for
various languages is important.
199
Data Mining and Business 10.2.1 Information retrieval
Intelligence
Information retrieval (IR) is a field that has been developing in parallel
with database systems for many years. Unlike the field of database
systems, which has targeted query and transaction processing of structured
data, information retrieval is concerned with the organization and retrieval
of data from multiple text-based documents.
Since information retrieval and database systems each handle different
kinds of data, some database system problems are usually not present in
information retrieval systems, such as concurrency control, recovery,
transaction management, and update. There are some common information
retrieval problems that are usually not encountered in traditional database
systems, such as unstructured documents, approximate search based on
keywords, and the notion of relevance.
Because of the abundance of text data, information retrieval has
discovered several applications. There exist several information retrieval
systems, including online library catalog systems, online records
management systems, and the more currently developed Web search
engines.
A general data retrieval problem is to locate relevant documents in a
document set depending on a user’s query, which is often some keywords
defining an information need, although it can also be an example of
relevant records.
This is most suitable when a user has some ad hoc (i.e., short-term) data
need, including finding data to buy a used car. When a user has a long-
term data need (e.g., a researcher’s interests), a retrieval system can also
take the initiative to “push” any newly arrived data elements to a user if
the element is judged as being relevant to the user’s data need.
There are two basic measures for assessing the quality of text retrieval
which are as follows −
Precision − This is the percentage of retrieved data that are actually
relevant to the query (i.e., “correct” responses). It is formally represented
as
precision=|{Relevant}∩{Retrieved}||{Retrieved}|precision=|{Relevant}∩
{Retrieved}||{Retrieved}|
Recall − This is the percentage of records that are relevant to the query
and were actually retrieved. It is formally represented as
recall=|{Relevant}∩{Retrieved}||{Relevant}|recall=|{Relevant}∩{Retriev
ed}||{Relevant}|
An information retrieval system is often required to trade-off recall for
precision or vice versa. There is one generally used trade-off is the F-
score, which is represented as the harmonic mean of recall and precision −
200
Text mining
F–score=recall×precision(recall+precision)2F_score=recall×precision
(recall+precision)2
The harmonic means trouble a system that sacrifices one measure for
another too extremely. Precision, recall, and F-score is the basic measures
of a retrieved collection of records. These three measures are not generally
useful for comparing two ranked lists of files because they are not
sensitive to the internal ranking of the documents in a retrieved set.
201
Data Mining and Business The most popular approach of this method is the vector space model. The
Intelligence basic idea of the vector space model is the following: It can represent a
document and a query both as vectors in a high-dimensional space
corresponding to all the keywords and use an appropriate similarity
measure to evaluate the similarity among the query vector and the record
vector. The similarity values can then be used for ranking documents.
202
Text mining
204
11
WEB MINING
Unit Structure
11.1 Web Mining
11.2 Web content
11.3 Web structure
11.4 Web usage
206
Web mining
Comparison Between Data mining and Web mining:
Data Mining is
access data Web Mining is access
Access privately. data publicly.
Clustering,
classification,
regression,
prediction,
Problem optimization and Web content mining,
Type control. Web structure mining.
207
Data Mining and Business
Intelligence It includes It includes application
approaches for data level knowledge, data
cleansing, machine engineering with
learning algorithms. mathematical modules
Statistics and like statistics and
Skills probability. probability.
208
Web mining
This includes usage of common content, terminology, and positioning;
consistent navigation; link management; and finally, metadata application.
There are a wide range of WCM tools available for effectively handling
web content.
1. Blogs
Blogging is an invaluable tool for driving visitors to your website, and
building awareness about you and your brand.
Generally written from a more personal and informal point of view than
content assets, a blog is a great way to connect with readers. It is the
perfect vehicle for providing them with information that not only answers
a question or solves a problem, but also helps to establish you as a trusted
authority on the topic.
Blogs are also a great way to keep your web content fresh, enabling you to
post new content on a regular basis and helping you continue to rank in
SERPs (search results).
2. Content assets
This broad category of web content includes collateral and similar
resources you have already invested in and can now repurpose to help
draw visitors to your website.
Some examples are product brochures, user manuals, slide presentations,
white papers, industry reports, case studies, fact sheets, ebooks, webinars,
and podcasts.
The goal is to extend the value of these assets by using them across
different digital media and channels. The content can be broken up into
smaller pieces and distributed in new ways, such as via blog posts, tweets,
video clips, email blasts, search engine ads, and other channels.
3. Calls to action
A call to action (CTA) is a prompt designed to get your website visitor to
take some immediate action, such as make a purchase or get more
information.
209
Data Mining and Business In addition to having CTAs on your web pages, you can include them in
Intelligence other marketing content you use to drive traffic to your website, such as
blogs, emails, social media posts, and e-newsletters.
Some common prompts:
● Apply today
● Book now
● Contact us
● Download for free
● Get a quote
● Join today
● Learn more
● Order now
● Register today
● Shop online and save
A CTA may take your web visitor to a landing page for further action.
Whatever your CTA is, it is important that the intent is clear and your
audience has a good idea what to expect. After all, you don’t want lose
visitors by having them click on a link that takes them somewhere they
really don’t want to go.
4. Landing pages
Landing pages are destinations — the web pages where visitors are sent
when they click on a hyperlink, such as a search engine result, a social
media ad, a CTA, or a special offer on your website.
These pages are designed to help you convert website visitors into leads
by providing a way to capture their contact information.
For example, suppose you want to build your authority as an SME by
offering a free white paper to your website visitors. When they click on
the offer link, it can take them to a landing page where the content of
white paper is described in more detail and they can download the paper
by submitting an email address.
5. Testimonials
One of the best ways to appeal to prospects and build credibility is with
relatable success stories from their peers. That is what makes customer
testimonials such valuable web content.
Whether your goal is to create formal case studies, include real-life
customer scenarios in a white paper, or post short video clips on Twitter or
210
Web mining
Facebook, having a process in place to identify happy customers and
capture their feedback is a great idea.
TIP: Don’t hide all your valuable customer feedback on one testimonials
page. Include testimonials throughout your site to serve as social proof
that validates your claims.
7. Visual content
According to the Social Science Research Network, 65% of people are
visual learners. So, it makes good sense to incorporate visual web content
into your website.
In addition to having a graphic design that helps to convey the flavor and
purpose of your brand, you can:
● Use images — preferably original ones — to break up and enhance the
text
● Create videos to entertain and inform
● Reiterate key information in a concise way through infographics
● Create your own memes to make important messages more memorable
● Offer presentations for visitors who want details in a more graphic,
bulleted format
● Include screenshots to clearly show things that may be difficult to
explain in words
212
Web mining
The ideal website structure looks like a pyramid, starting with the home
page at the top, then categories, subcategories, and individual posts and
pages.
● Home page – The home page is at the top of the pyramid. It acts as
a hub for the visitors of your site. Designers should link to critical or
popular pages from the home page. In doing so, designers will be able to
more easily guide users to the most important pages.
● Categories – Categorization is a valuable part of a website’s
structure. Designers can help users make decisions faster and easier with
good categorization. Designers can use categories to reduce the amount of
time spent considering a decision
● Subcategories – These play a major role in defining a website’s
structure. For example, online marketplaces like eBay and Amazon have a
nearly unfathomable number of pages. It would be easy for a user to get
lost in the information provided. Subcategories provide a structured
methodology for browsing and categorizing information in a meaningful
manner, especially for websites with complex data.
● Individual posts and pages – Individual posts and pages are the
basic elements of a website. Designers should focus on how to create a
meaningful information hierarchy within every page, so the user has less
to consider when it comes to consuming content.
213
Data Mining and Business Hierarchical model
Intelligence
214
Web mining
Matrix model
The matrix model of a web structure lets users choose where they want to
go next.
The matrix model is one of the oldest site structure types on the internet.
This model is unique and non-traditional in its behavior. A matrix-type
structure gives users options to choose where they want to go next. These
types of sites are best navigated via search or internal links.
Database model
216
Web mining
Such more data can include user-browsing sequences of the web pages in
the internet server buffer. With the need of such weblog documents,
studies have been directed on analyzing system implementation,
enhancing system design by web caching, web page prefetching, and web
page swapping; understanding the feature of Web traffic; and
understanding customer reaction and motivation.
For instance, some studies have proposed adaptive sites − websites that
enhance themselves by understanding from user access patterns. Weblog
analysis can also help construct customized web services for single users.
Web mining has distinctive features to offer a set of multiple data types.
The web has multiple elements that yield multiple approaches for the
mining procedure, including web pages including text, web pages are
linked via hyperlinks, and customer activity can be monitored via web
server logs.
There are various rules of web usage mining which are as follows −
Preprocessing − The web usage log is not in a format that is accessible by
mining applications. For some data to be used in a mining application, the
data can be required to be reformatted and cleansed. There are some issues
specifically related to the use of weblogs. There are some steps included in
the processing phase include cleansing, user identification, session
identification, path completion, and formatting.
Data structure − There are several unique data structures have been
proposed to keep track of patterns identified during the web usage mining
process. A basic data structure that is used is called a tree. A tree is a
rooted tree, where each path from the root to a leaf represents a sequence.
Trees can save strings for pattern matching applications. The only problem
with trees Types of Web Usage Mining based upon the Usage Data:
1. Web Server Data: The web server data generally includes the IP
address, browser logs, proxy server logs, user profiles, etc. The user logs
are being collected by the web server data.
2. Application Server Data: An added feature on the commercial
application servers is to build applications on it. Tracking various business
events and logging them into application server logs is mainly what
application server data consists of.
3. Application-level data: There are various new kinds of events that can
be there in an application. The logging feature enabled in them helps us
get the past record of the events.
Advantages of Web Usage Mining
● Government agencies are benefited from this technology to overcome
terrorism.
● Predictive capabilities of mining tools have helped identify various
criminal activities.
217
Data Mining and Business ● Customer Relationship is being better understood by the company with
Intelligence the aid of these mining tools. It helps them to satisfy the needs of the
customer faster and efficiently.
Disadvantages of Web Usage Mining
● Privacy stands out as a major issue. Analyzing data for the benefit of
customers is good. But using the same data for something else can be
dangerous. Using it within the individual’s knowledge can pose a big
threat to the company.
● Having no high ethical standards in a data mining company, two or
more attributes can be combined to get some personal information of
the user which again is not respectable.
218
Web mining
3. Clustering: Clustering is a technique to group together a set of things
having similar features/traits. There are mainly 2 types of clusters- the first
one is the usage cluster and the second one is the page cluster. The
clustering of pages can be readily performed based on the usage data. In
usage-based clustering, items that are commonly accessed /purchased
together can be automatically organized into groups. The clustering of
users tends to establish groups of users exhibiting similar browsing
patterns. In page clustering, the basic concept is to get information quickly
over the web pages.
Applications of Web Usage Mining
1. Personalization of Web Content: The World Wide Web has a lot of
information and is expanding very rapidly day by day. The big problem is
that on an everyday basis the specific needs of people are increasing and
they quite often don’t get that query result. So, a solution to this is web
personalization. Web personalization may be defined as catering to the
user’s need-based upon its navigational behavior tracking and their
interests. Web Personalization includes recommender systems, check-box
customization, etc. Recommender systems are popular and are used by
many companies.
220