0% found this document useful (0 votes)
197 views225 pages

MCA Data Mining and Business Intelligence 2

1. Business intelligence (BI) aims to convert data into useful information and knowledge to help decision makers. It uses mathematical models and analysis methods on available data. 2. BI is important because decisions made by knowledge workers at all levels significantly impact an organization's performance. Individuals and organizations now have access to large amounts of diverse data from various sources. 3. The goal of BI is to extract meaningful insights from this data to support complex decision making processes. It provides a set of tools and techniques to analyze past business performance and gain actionable insights.

Uploaded by

Santosh s yanni
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
197 views225 pages

MCA Data Mining and Business Intelligence 2

1. Business intelligence (BI) aims to convert data into useful information and knowledge to help decision makers. It uses mathematical models and analysis methods on available data. 2. BI is important because decisions made by knowledge workers at all levels significantly impact an organization's performance. Individuals and organizations now have access to large amounts of diverse data from various sources. 3. The goal of BI is to extract meaningful insights from this data to support complex decision making processes. It provides a set of tools and techniques to analyze past business performance and gain actionable insights.

Uploaded by

Santosh s yanni
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 225

S.Y.

MCA
SEMESTER - IV (CBCS)

DATA MINING AND


BUSINESS INTELLIGENCE

SUBJECT CODE : MCA401


© UNIVERSITY OF MUMBAI

Prof. Suhas Pednekar


Vice-Chancellor,
University of Mumbai,

Prof. Ravindra D. Kulkarni Prof. Prakash Mahanwar


Pro Vice-Chancellor, Director,
University of Mumbai, IDOL, University of Mumbai

Programme Co-ordinator : Shri Mandar Bhanushe


Asst. Prof. cum Asst. Director in Mathematics,
IDOL, University of Mumbai, Mumbai

Course Co-ordinator : Mrs. Reshama Kurkute


Asst. Professor, Dept. of MCA
IDOL, University of Mumbai, Mumbai

Course Writers : Dr. Rakhee Yadav


Assistant Professor,
Somaiya College, vidyavihar

: Mr. Sandeep Kamble


Assistant Professor,
Cosmopolitan’s Valia College

: Dr. D. S. Rao
Professor (CSE) & Associate Dean
(Student Affairs),
Koneru Lakshmaiah Education Foundation.
KL H (Deemed to be University)
Hyderabad Campus Hyderabad,Telangana

: Dr..R. Saradha
Assistant Professor,
SDNB Vaishnav College For Women,
Chennai

: Ms.Gauri Ulhas Ansurkar


Lecture,
KSD’s Model College (Autonomous)
April 2021, Print - I
Published by : Director
Institute of Distance and Open Learning ,
ipin Enterprises
University of Mumbai,
Tantia Jogani
Vidyanagari, Industrial
Mumbai Estate, Unit No. 2,
- 400 098.
Ground Floor, Sitaram Mill Compound,
DTP Composed J.R. Boricha
: Mumbai Marg,
University Mumbai - 400 011
Press
Printed by Vidyanagari, Santacruz (E), Mumbai,
CONTENTS
Unit No. Title Page No.

Module I
1. Introduction and Overview of BI 01

Module II
2. Data preparation 16
3. Optimization Technique 37

Module III
4. Bi Using Data Warehousing 50
5. Data mart 76

Module IV
6. Data Mining and Preprocessing 107

Module V
7. Association Rule Mining 122

Module VI
8. Classification and Prediction 140

Module VII
9. Clustering 170

Module VII
10. Text Mining 188
11. Web Mining 205


S.Y. MCA
SEMESTER - IV (CBCS)

DATA MINING AND BUSINESS INTELLIGENCE


Syllabus

Sr. Module Detailed Contents Hrs


No.

1 Business Introduction and overview of BI-Effective and 06


Intelligence- timely decisions, Data Information and
knowledge, BI Architecture, Ethics and BI.
BI Applications- Balanced score card, Fraud
detection, Telecommunication Industry, Banking
and finance, Market segmentation.

2 Prediction Data preparation, Prediction methods- 06


methods and Mathematical method, Distance methods, Logic
models for BI method, heuristic method-local optimization
technique, stochastic hill climber, evaluation of
models

3 BI using Data Introduction to DW, DW architecture, ETL 08


Warehousing Process, Top-down and bottom-up approaches,
characteristics and benefits of data mart,
Difference between OLAP and OLTP.
Dimensional analysis- Define cubes. Drill- down
and roll- up – slice and dice or rotation, OLAP
models- ROLAP and MOLAP. Define Schemas-
Star, snowflake and fact constellations.

4 Data Mining and Data mining- definition and functionalities, 06


Preprocessing KDD Process, Data Cleaning: - Missing values,
Noisy data, data integration and transformations.
Data Reduction: - Data cube aggregation,
dimensionality reduction- data compression,
Numerosity reduction- discretization and
concept hierarchy.

5 Associations Association rule mining:-support and confidence 06


and and frequent item sets, market basket analysis,
Correlation Apriori algorithm, Incremental ARM,
Associative classification- Rule Mining.

I
6 Classification Introduction, Classification methods:-Decision 08
and Prediction Tree- ID3, CART, Bayesian classification-
Baye’stheorem( Naïve Bayesian
classification),Linear and nonlinear regression.

7 Clustering Introduction, categorization of Major, Clustering 08


Methods:-
partitioning methods- K-Means. Hierarchical-
Agglomerative and divisive methods, Model-
based- Expectation and Maximization.

8 Web mining Text data analysis and Information retrieval, text 04


and Text retrieval methods, dimensionality reduction for
mining text.
Web Mining: - web content, web structure, web
usage.

II
ModuleI
Business Intelligence

1
INTRODUCTION AND OVERVIEW OF BI
Unit Structure
1.0 Objectives
1.1 Introduction
1.2 An Overview
1.2.1 Effective and timely decisions
1.2.2 Data Information and knowledge
1.2.3 BI Architecture
1.2.4 Ethics and BI
1.3 BI Applications
1.4 List of References
1.5 Unit End Exercises

1.0 OBJECTIVES
After going through this unit, you will be able to:
● This course focuses on how to design and build a Business Intelligence
solution
● To understand the concept of Business Intelligence
● To understand the basics of design and management of BI systems

1.1 INTRODUCTION
The advent of low-cost data storage technologies and the wide availability
of Internet connections have made it easier for individuals and
organizations to access large amounts of data. Such data are often
heterogeneous in origin, content and representation, as they include
commercial, financial and administrative transactions, web navigation
paths, emails, texts and hypertexts, and the results of clinical tests, to name
just a few examples. Their accessibility opens up promising scenarios and
opportunities, and raises an enticing question: is it possible to convert such
data into information and knowledge that can then be used by decision
1
Data Mining and Business makers to aid and improve the governance of enterprises and of public
Intelligence administration? Business intelligence may be defined as a set of
mathematical models and analysis methodologies that exploit the available
data to generate information and knowledge useful for complex decision-
making processes. This opening chapter will describe in general terms the
problems entailed in business intelligence, highlighting the
interconnections with other disciplines and identifying the primary
components typical of a business intelligence environment.
In complex organizations, public or private, decisions are made on a
continual basis. Such decisions may be more or less critical, have long- or
short-term effects and involve people and roles at various hierarchical
levels. The ability of these knowledge workers to make decisions, both as
individuals and as a community, is one of the primary factors that
influence the performance and competitive strength of a given
organization.
Individuals and companies may now access enormous amounts of data
more easily because to the introduction of low-cost data storage
technologies and the widespread availability of Internet connections.
Commercial, financial, and administrative transactions, web navigation
patterns, emails, texts and hypertexts, and the results of clinical testing, to
name a few instances, are all examples of data that are heterogeneous in
origin, content, and representation. Their accessibility opens up exciting
situations and opportunities, as well as the intriguing question of whether
such data can be converted into information and knowledge that can be
used by decision makers to help and improve corporate and government
governance. Business intelligence is a set of mathematical models and
analysis procedures that employ existing data to develop information and
knowledge that may be used in complex decision-making processes. This
first chapter will outline the difficulties that business intelligence entails,
as well as the links with other disciplines and the major components that
make up a business intelligence ecosystem. Examples 1.1 and 1.2 illustrate
two highly complex decision-making processes in rapidly changing
conditions.
Example 1.1 – Retention in the mobile phone industry. The marketing
manager of a mobile phone company realizes that a large number of
customers are discontinuing their service, leaving her company in favour
of some competing provider. As can be imagined, low customer loyalty,
also known as customer attrition or churn, is a critical factor for many
companies operating in service industries. Suppose that the marketing
manager can rely on a budget adequate to pursue a customer retention
campaign aimed at 2000 individuals out of a total customer base of 2
million people. Hence, the question naturally arises of how she should go
about choosing those customers to be contacted so as to optimize the
effectiveness of the campaign. In other words, how can the probability that
each single customer will discontinue the service be estimated so as to
target the best group of customers and thus reduce churning and maximize
customer retention? By knowing these probabilities, the target group can
be chosen as the 2000 people having the highest churn likelihood among
2
the customers of high business value. Without the support of advanced Introduction and Overview
mathematical models and data mining techniques, it would be arduous to of BI
derive a reliable estimate of the churn probability and to determine the
best recipients of a specific marketing campaign.
Example 1.2 – Logistics planning. The logistics manager of a
manufacturing company wishes to develop a medium-term logistic-
production plan. This is a decision-making process of high complexity
which includes, among other choices, the allocation of the demand
originating from different market areas to the production sites, the
procurement of raw materials and purchased parts from suppliers, the
production planning of the plants and the distribution of end products to
market areas. In a typical manufacturing company this could well entail
tens of facilities, hundreds of suppliers, and thousands of finished goods
and components, over a time span of one year divided into weeks. The
magnitude and complexity of the problem suggest that advanced
optimization models are required to devise the best logistic plan.
Optimization models allow highly complex and large-scale problems to be
tackled successfully within a business intelligence framework.

1.2.1 EFFECTIVE AND TIMELY DECISIONS


Decisions are made on a regular basis in complicated organisations,
whether public or private. These decisions might be crucial or not, have
long- or short-term consequences, and include individuals and jobs at
various levels of the hierarchy. One of the key variables that influences the
performance and competitive strength of a given business is the ability of
these knowledge workers to make decisions, both individually and
collectively. Most knowledge workers reach their decisions primarily
using easy and intuitive methodologies, which take into account specific
elements such as experience, knowledge of the application domain and the
available information. This approach leads to a stagnant decision-making
style which is inappropriate for the unstable conditions determined by
frequent and rapid changes in the economic environment. Indeed,
decision-making processes within today’s organizations are often too
complex and dynamic to be effectively dealt with through an intuitive
approach, and require instead a more rigorous attitude based on analytical
methodologies and mathematical models.
The main purpose of business intelligence systems is to provide
knowledge workers with tools and methodologies that allow them to make
effective and timely decisions.
Effective decisions: The application of rigorous analytical methods allows
decision makers to rely on information and knowledge which are more
dependable. As a result, they are able to make better decisions and devise
action plans that allow their objectives to be reached in a more effective
way. Indeed, turning to formal analytical methods forces decision makers
to explicitly describe both the criteria for evaluating alternative choices
and the mechanisms regulating the problem under investigation.
Furthermore, the ensuing in-depth examination and thought lead to a
3
Data Mining and Business deeper awareness and comprehension of the underlying logic of the
Intelligence decision-making process.
Timely decisions: Enterprises operate in economic environments
characterized by growing levels of competition and high dynamism. As a
consequence, the ability to rapidly react to the actions of competitors and
to new market conditions is a critical factor in the success or even the
survival of a company.

Figure 1.1: Advantages of Business Intelligence


Figure 1.1 illustrates the major benefits that a given organization may
draw from the adoption of a business intelligence system. When facing
problems such as those described in Examples 1.1 and 1.2 above, decision
makers ask themselves a series of questions and develop the
corresponding analysis. Hence, they examine and compare several
options, selecting among them the best decision, given the conditions at
hand. If decision makers can rely on a business intelligence system
facilitating their activity, we can expect that the overall quality of the
decision-making process will be greatly improved. With the help of
mathematical models and algorithms, it is actually possible to analyse a
larger number of alternative actions, achieve more accurate conclusions
and reach effective and timely decisions. We may therefore conclude that
the major advantage deriving from the adoption of a business intelligence
system is found in the increased effectiveness of the decision-making
process.

1.2.2 DATA INFORMATION AND KNOWLEDGE


The information systems of both public and commercial enterprises have
amassed massive amounts of data. These data come from both internal and
external sources, including administrative, logistical, and commercial
4
transactions. These data, even if collected and maintained in a systematic Introduction and Overview
and organised manner, cannot be used directly for decision-making of BI
reasons. They must be processed using proper extraction techniques and
analytical processes capable of changing them into information and
knowledge that decision-makers can use.
Data: Generally, data represent a structured codification of single primary
entities, as well as of transactions involving two or more primary entities.
For example, for a retailer data refer to primary entities such as customers,
points of sale and items, while sales receipts represent the commercial
transactions.
Information: Information is the outcome of extraction and processing
activities carried out on data, and it appears meaningful for those who
receive it in a specific domain. For example, to the sales manager of a
retail company, the proportion of sales receipts in the amount of over ¤100
per week, or the number of customers holding a loyalty card who have
reduced by more than 50% the monthly amount spent in the last three
months, represent meaningful pieces of information that can be extracted
from raw stored data.
Knowledge: Information is transformed into knowledge when it is used to
make decisions and develop the corresponding actions. Therefore, we can
think of knowledge as consisting of information put to work into a specific
domain, enhanced by the experience and competence of decision makers
in tackling and solving complex problems. For a retail company, a sales
analysis may detect that a group of customers, living in an area where a
competitor has recently opened a new point of sale, have reduced their
usual amount of business. The knowledge extracted in this way will
eventually lead to actions aimed at solving the problem detected, for
example by introducing a new free home delivery service for the
customers residing in that specific area.

1.2.3 BI ARCHITECTURE

Figure 1.2: Architecture of Business intelligent system


The architecture of a business intelligence system, depicted in Figure 1.2,
includes three major components.

5
Data Mining and Business Data sources: In a first stage, it is necessary to gather and integrate the
Intelligence data stored in the various primary and secondary sources, which are
heterogeneous in origin and type. The sources consist for the most part of
data belonging to operational systems, but may also include unstructured
documents, such as emails and data received from external providers.
Generally speaking, a major effort is required to unify and integrate the
different data sources.
Data warehouses and data marts: Using extraction and transformation
tools known as extract, transform, load (ETL), the data originating from
the different sources are stored in databases intended to support business
intelligence analyses. These databases are usually referred to as data
warehouses and data marts.
Business intelligence methodologies: Data are finally extracted and used
to feed mathematical models and analysis methodologies intended to
support decision makers. In a business intelligence system, several
decision support applications may be implemented.
Several public and private enterprises and organizations have developed in
recent years formal and systematic mechanisms to gather, store and share
their wealth of knowledge, which is now perceived as an invaluable
intangible asset. The activity of providing support to knowledge workers
through the integration of decision-making processes and enabling
information technologies is usually referred to as knowledge management.
It is apparent that business intelligence and knowledge management share
some degree of similarity in their objectives. The main purpose of both
disciplines is to develop environments that can support knowledge
workers in decision-making processes and complex problem-solving
activities. To draw a boundary between the two approaches, we may
observe that knowledge management methodologies primarily focus on
the treatment of information that is usually unstructured, at times implicit,
contained mostly in documents, conversations and past experience.
Conversely, business intelligence systems are based on structured
information, most often of a quantitative nature and usually organized in a
database. However, this distinction is a somewhat fuzzy one: for example,
the ability to analyse emails and web pages through text mining methods
progressively induces business intelligence systems to deal with
unstructured information.

6
Introduction and Overview
of BI

Figure: 1.3 Components of Business Intelligence


The pyramid in Figure 1.3 shows the building blocks of a business
intelligence system. So far, we have seen the components of the first two
levels when discussing Figure 1.3 We now turn to the description of the
upper tiers.
Data exploration. At the third level of the pyramid, we find the tools for
performing a passive business intelligence analysis, which consist of query
and reporting systems, as well as statistical methods. These are referred to
as passive methodologies because decision makers are requested to
generate prior hypotheses or define data extraction criteria, and then use
the analysis tools to find answers and confirm their original insight. For
instance, consider the sales manager of a company who notices that
revenues in a given geographic area have dropped for a specific group of
customers. Hence, she might want to bear out her hypothesis by using
extraction and visualization tools, and then apply a statistical test to verify
that her conclusions are adequately supported by data. Statistical
techniques for exploratory data analysis will be described in later chapters.
Data mining. The fourth level includes active business intelligence
methodologies, whose purpose is the extraction of information and
knowledge from data. These include mathematical models for pattern
recognition, machine learning and data mining techniques, which will be
dealt with in Part II of this book. Unlike the tools described at the previous
level of the pyramid, the models of an active kind do not require decision
makers to formulate any prior hypothesis to be later verified. Their
purpose is instead to expand the decision makers’ knowledge.

7
Data Mining and Business Optimization. By moving up one level in the pyramid we find
Intelligence optimization models that allow us to determine the best solution out of a
set of alternative actions, which is usually fairly extensive and sometimes
even infinite. Example 1.3 shows a typical field of application of
optimization models. Other optimization models applied in marketing and
logistics will be described in later chapters.
Decisions. Finally, the top of the pyramid corresponds to the choice and
the actual adoption of a specific decision, and in some way represents the
natural conclusion of the decision-making process. Even when business
intelligence methodologies are available and successfully adopted, the
choice of a decision pertains to the decision makers, who may also take
advantage of informal and unstructured information available to adapt and
modify the recommendations and the conclusions achieved through the
use of mathematical models. As we progress from the bottom to the top of
the pyramid, business intelligence systems offer increasingly more
advanced support tools of an active type. Even roles and competencies
change. At the bottom, the required competencies are provided for the
most part by the information systems specialists within the organization,
usually referred to as database administrators. Analysts and experts in
mathematical and statistical models are responsible for the intermediate
phases.
Finally, the activities of decision makers responsible for the application
domain appear dominant at the top. As described above, business
intelligence systems address the needs of different types of complex
organizations, including agencies of public administration and association.

Figure 1.4 Departments of enterprise concerned with business


intelligence

8
Introduction and Overview
of BI

Figure 1.5 Cycle of business intelligence


The phases are as follows:
Each business intelligence analysis follows its own path according to the
application domain, the personal attitude of the decision makers and the
available analytical methodologies. However, it is possible to identify an
ideal cyclical path characterizing the evolution of a typical business
intelligence analysis, as shown in Figure 1.5, even though differences still
exist based upon the peculiarity of each specific context.
Analysis. During the analysis phase, it is necessary to recognize and
accurately spell out the problem at hand. Decision makers must then create
a mental representation of the phenomenon being analysed, by identifying
the critical factors that are perceived as the most relevant. The availability
of business intelligence methodologies may help already in this stage, by
permitting decision makers to rapidly develop various paths of
investigation. For instance, the exploration of data cubes in a
multidimensional analysis, according to different logical views, allows
decision makers to modify their hypotheses flexibly and rapidly, until they
reach an interpretation scheme that they deem satisfactory. Thus, the first
phase in the business intelligence cycle leads decision makers to ask
several questions and to obtain quick responses in an interactive way.
Insight. The second phase allows decision makers to better and more
deeply understand the problem at hand, often at a causal level. For
instance, if the analysis carried out in the first phase shows that a large
number of customers are discontinuing an insurance policy upon yearly
expiration, in the second phase it will be necessary to identify the profile
and characteristics shared by such customers. The information obtained
through the analysis phase is then transformed into knowledge during the
insight phase. On the one hand, the extraction of knowledge may occur
due to the intuition of the decision makers and therefore be based on their
experience and possibly on unstructured information available to them. On
the other hand, inductive learning models may also prove very useful
during this stage of analysis, particularly when applied to structured data.

9
Data Mining and Business Decision. During the third phase, knowledge obtained as a result of the
Intelligence insight phase is converted into decisions and subsequently into actions.
The availability of business intelligence methodologies allows the analysis
and insight phases to be executed more rapidly so that more effective and
timely decisions can be made that better suit the strategic priorities of a
given organization. This leads to an overall reduction in the execution time
of the analysis–decision–action– revision cycle, and thus to a decision-
making process of better quality.
Evaluation. Finally, the fourth phase of the business intelligence cycle
involves performance measurement and evaluation. Extensive metrics
should then be devised that are not exclusively limited to the financial
aspects but also take into account the major performance indicators
defined for the different company departments.
ENABLING FACTORS IN BUSINESS INTELLIGENCE
PROJECTS
Some factors are more critical than others to the success of a business
intelligence project: technologies, analytics and human resources.
Technologies. Hardware and software technologies are significant
enabling factors that have facilitated the development of business
intelligence systems within enterprises and complex organizations. On the
one hand, the computing capabilities of microprocessors have increased on
average by 100% every 18 months during the last two decades, and prices
have fallen. This trend has enabled the use of advanced algorithms which
are required to employ inductive learning methods and optimization
models, keeping the processing times within a reasonable range.
Moreover, it permits the adoption of state-of-the-art graphical
visualization techniques, featuring real-time animations. A further relevant
enabling factor derives from the exponential increase in the capacity of
mass storage devices, again at decreasing costs, enabling any organization
to store terabytes of data for business intelligence systems. And network
connectivity, in the form of Extranets or Intranets, has played a primary
role in the diffusion within organizations of information and knowledge
extracted from business intelligence systems. Finally, the easy integration
of hardware and software purchased by different suppliers, or developed
internally by an organization, is a further relevant factor affecting the
diffusion of data analysis tools. Analytics. As stated above, mathematical
models and analytical methodologies play a key role in information
enhancement and knowledge extraction from the data available inside
most organizations. The mere visualization of the data according to timely
and flexible logical views, as described in further chapters, plays a
relevant role in facilitating the decision-making process, but still
represents a passive form of support. Therefore, it is necessary to apply
more advanced models of inductive learning and optimization in order to
achieve active forms of support for the decision-making process. Human
resources. The human assets of an organization are built up by the
competencies of those who operate within its boundaries, whether as
individuals or collectively. The overall knowledge possessed and shared
by these individuals constitutes the organizational culture. The ability of
10
knowledge workers to acquire information and then translate it into Introduction and Overview
practical actions is one of the major assets of any organization, and has a of BI
major impact on the quality of the decision-making process. If a given
enterprise has implemented an advanced business intelligence system,
there still remains much scope to emphasize the personal skills of its
knowledge workers, who are required to perform the analyses and to
interpret the results, to work out creative solutions and to devise effective
action plans. All the available analytical tools being equal, a company
employing human resources endowed with a greater mental agility and
willing to accept changes in the decision-making style will be at an
advantage over its competitors

1.2.4 ETHICS AND BI


The adoption of business intelligence methodologies, data mining methods
and decision support systems raises some ethical problems that should not
be overlooked. Indeed, the progress toward the information and
knowledge society opens up countless opportunities, but may also
generate distortions and risks which should be prevented and avoided by
using adequate control rules and mechanisms. Usage of data by public and
private organizations that is improper and does not respect the individuals’
right to privacy should not be tolerated. More generally, we must guard
against the excessive growth of the political and economic power of
enterprises allowing the transformation processes outlined above to
exclusively and unilaterally benefit such enterprises themselves, at the
expense of consumers, workers and inhabitants of the Earth ecosystem.
However, even failing specific regulations that would prevent the abuse of
data gathering and invasive investigations, it is essential that business
intelligence analysts and decision makers abide by the ethical principle of
respect for the personal rights of the individuals. The risk of overstepping
the boundary between correct and intrusive use of information is
particularly high within the relational marketing and web mining fields.
For example, even if disguised under apparently inoffensive names such
as ‘data enrichment’, private information on individuals and households
does circulate, but that does not mean that it is ethical for decision makers
and enterprises to use it. Respect for the right to privacy is not the only
ethical issue concerning the use of business intelligence systems. There
has been much discussion in recent years of the social responsibilities of
enterprises, leading to the introduction of the new concept of stakeholders.
This term refers to anyone with any interest in the activities of a given
enterprise, such as investors, employees, labour unions and civil society as
a whole. There is a diversity of opinion on whether a company should
pursue the short-term maximization of profits, acting exclusively in the
interest of shareholders, or should instead adopt an approach that takes
into account the social consequences of its decisions. As this is not the
right place to discuss a problem of such magnitude, we will confine
ourselves to pointing out that analyses based on business intelligence
systems are affected by this issue and therefore run the risk of being used
to maximize profits even when different considerations should prevail
related to the social consequences of the decisions made, according to a
logic that we believe should be rejected. For example, is it right to develop
11
Data Mining and Business an optimization model with the purpose of distributing costs on an
Intelligence international scale in order to circumvent the tax systems of certain
countries? Is it legitimate to make a decision on the optimal position of the
tank in a vehicle in order to minimize production costs, even if this may
cause serious harm to the passengers in the event of a collision? As proven
by these examples, analysts developing a mathematical model and those
who make the decisions cannot remain neutral, but have the moral
obligation to take an ethical stance.

1.3 BI APPLICATIONS
BALANCED SCORE CARD
The balanced scorecard is the most developed method for high-level
reporting. The balanced scorecard was the first type of performance
reporting, and it was intended primarily for top-level management.
Financial information, information about customers and their perceptions
of the business, information about internal business procedures, and
measures for business improvement were the key themes of discussion.
Indicators were established for each topic to measure the relevant business
goals in an effective manner. The so-called third-generation balanced
scorecard has evolved from this basic definition. It has four main
components.
1. Destination statement: It describes the organization at present and at a
defined point in the future (midterm planning) in the four perspectives:
financial and stakeholder expectations, customer and external
relationships, process activities, and organization and culture.
2. Strategic linkage model: This topic contains strategic objectives with
respect to outcome and activities, together with hypothesized causal
relationships between these strategic objectives.
3. Definitions of strategic objectives.
4. Definitions of measures: For each strategic objective, measures are
defined, together with their targets.

FRAUD DETECTION
Another important application of data mining is fraud detection. Fraud can
occur in a variety of businesses, including telecommunications, insurance
(false claims), and banking (illegal use of credit cards and bank checks;
illegal monetary transactions).

TELECOMMUNICATION INDUSTRY
Mobile phone companies were among the first to employ learning models
and data mining techniques to assist relational marketing efforts. Customer
retention, often known as churn analysis, has been one of the key goals.
Market saturation and fierce competition have combined to create
insecurity and dissatisfaction among consumers, who can choose a
provider depending on the rates, services, and access methods that suit
12
them best. This phenomenon is especially important in the case of prepaid Introduction and Overview
telephone cards, which are quite popular in the mobile phone business of BI
today since they make switching a phone service provider relatively
simple and inexpensive. Company and objectives A mobile phone
company wishes to model its customers’ propensity to churn, that is, a
predictive model able to associate each customer with a numerical value
(or score) that indicates their probability of discontinuing service, based
on the value of the available explanatory variables. The model should be
able to identify, based on customer characteristics, homogeneous segments
relative to the probability of churning, in order to later concentrate on
these groups, the marketing actions to be carried out for retention, thus
reducing attrition and increasing the overall effectiveness. Figure shows
the possible segments derived using a classification model, using only two
predictive attributes in order to create a two-dimensional chart for
illustration purposes. The segments with the highest density of churners
allow to identify the recipients of the marketing actions. After an initial
exploratory data analysis, the decision is made to develop more than one
predictive model, determining a priori some market macrosegments that
appear heterogeneous. accurate models instead of a single model related to
the entire customer base. The analysis carried out using clustering
methods confirms the appropriateness of the segments considered, and
leads to the subdivision of customers into groups based on the following
dimensions:
• customer type (business or private);
• telephone card type (subscription or prepaid);
• years of service provision, whether above or below a given threshold;
• area of residence.
The marketing data mart provides for each customer a large amount of
data:
• personal information (socio-demographic);
• administrative and accounting information;
• incoming and outgoing telephone traffic, subdivided by period (weeks or
months) and traffic direction;
• access to additional services, such as fax, voice mail, and special service
numbers;
• calls to customer assistance centres;
• notifications of malfunctioning and disservice;
• emails and requests through web pages.

13
Data Mining and Business BANKING AND FINANCE
Intelligence
1. FRAUD ANALYSIS
According to a global banking survey published by KPMG, financial
frauds have increased both in volume & value. This has made fraud
detection and prevention the top priority of every bank.
Beyond delivering a faster, safer, and convenient experience, the solution
also:
● Simplified payment processing
● Triggered warning signals to take preventive action
● Tracked in-process transactions in real-time
● Blocked fraudulent credit cards & payments in real-time
2. CROSS-SELLING
BI solutions help conduct a win-loss data analysis to predict acceptance
rates for upcoming cross-selling initiatives. An Asia-based financial
services institution wanted to increase its revenue through cross-selling
insurance policies. For this, they needed an intelligent solution that could:
● Analyse the CRM data
● Uncover customer trends
● Identify the customers that are most likely to convert based on their
purchase history of other products.
The created Business Intelligence & Analytics help desk designed for the
customer helps:
● Generate excel-based analytics reports
● Identify potential customers most likely to convert based on their
buying behaviour and profile
It helped increase the conversion rate and revenues while reducing the
incurred costs on expensive statistical tools.

1.4 LIST OF REFERENCES


● Business Intelligence data mining and optimization for decision
making- by Carlo Vercellis , wiley publication.
● Adaptive business Intelligence by Zbigniew Michlewicz, martin
Schmidt, matthewmichalewicz, constantin Chiriac
● Fundamental of Business Intelligence by Grossmann W, Rinderle-
Ma, Springer, 2015.

14
1.5 UNIT END EXERCISES Introduction and Overview
of BI
1) What is effective and timely decisions in business intelligence?
2) Explain Data, Information and knowledge in detail.
3) Explain Business Intelligence architecture in detail.
4) Explain BI applications in detail.



15
Module II: Prediction Methods and
Mathematical Method

2
DATA PREPARATION
Unit Structure
2.1 Data preparation
2.2 Prediction methods
2.3 Mathematical methods
2.4 Distance methods
2.5 Logic method
2.6 Heuristic method

2.1 DATA PREPARATION


Data preparation is the process of gathering, combining, structuring and
organizing data so it can be used in business intelligence (BI), analytics
and data visualization applications. The components of data preparation
include data pre-processing, profiling, cleansing, validation and
transformation; it often also involves pulling together data from different
internal systems and external sources.
Data preparation work is done by information technology (IT), BI and data
management teams as they integrate data sets to load into a data
warehouse, NoSQL database or data lake repository, and then when new
analytics applications are developed with those data sets. In addition, data
scientists, data engineers, other data analysts and business users
increasingly use elf service data preparation tool to collect and prepare
data themselves.
Data preparation is often referred to informally as data prep. It's also
known as data wrangling, although some practitioners use that term in a
narrower sense to refer to cleansing, structuring and transforming data;
that usage distinguishes data wrangling from the data pre-processing stage.

Purposes of data preparation


One of the primary purposes of data preparation is to ensure that raw data
being readied for processing and analysis is accurate and consistent so the
results of BI and analytics applications will be valid. Data is commonly
created with missing values, inaccuracies or other errors, and separate data
sets often have different formats that need to be reconciled when they're

16
combined. Correcting data errors, validating data quality and consolidating Data preparation
data sets are big parts of data preparation projects.
Data preparation also involves finding relevant data to ensure that
analytics applications deliver meaningful information and actionable
insights for business decision-making. The data often is enriched and
optimized to make it more informative and useful -- for example, by
blending internal and external data sets, creating new data fields,
eliminating outlier values and addressing imbalanced data sets that could
skew analytics results.
In addition, BI and data management teams use the data preparation
process to curate data sets for business users to analyse. Doing so helps
streamline and guide self-service BI applications for business analysts,
executives and workers.

What are the benefits of data preparation?


Data scientists often complain that they spend most of their time
gathering, cleansing and structuring data instead of analysing it. A big
benefit of an effective data preparation process is that they and other end
users can focus more on data mining and data analysis -- the parts of their
job that generate business value. For example, data preparation can be
done more quickly, and prepared data can automatically be fed to users for
recurring analytics applications.
Done properly, data preparation also helps an organization do the
following:
● ensure the data used in analytics applications produces reliable results;
● identify and fix data issues that otherwise might not be detected;
● enable more informed decision-making by business executives and
operational workers;
● reduce data management and analytics costs;
● avoid duplication of effort in preparing data for use in multiple
applications; and
● get a higher ROI from BI and analytics initiatives.
Effective data preparation is particularly beneficial in big data
environments that store a combination of structured, semi-structured and
unstructured data, often in raw form until it's needed for specific analytics
uses. Those uses include predictive analytics, machine learning (ML) and
other forms of advanced analytics that typically involve large amounts of
data to prepare.

17
Data Mining and Business Steps in the data preparation process
Intelligence
Data preparation is done in a series of steps. There's some variation in the
data preparation steps listed by different data professionals and software
vendors, but the process typically involves the following tasks:
1. Data collection. Relevant data is gathered from operational systems,
data warehouses, data lakes and other data sources. During this step,
data scientists, members of the BI team, other data professionals and
end users who collect data should confirm that it's a good fit for the
objectives of the planned analytics applications.
2. Data discovery and profiling. The next step is to explore the
collected data to better understand what it contains and what needs to
be done to prepare it for the intended uses. To help with that, data
profiling identifies patterns, relationships and other attributes in the
data, as well as inconsistencies, anomalies, missing values and other
issues so they can be addressed.
3. Data cleansing. Next, the identified data errors and issues are
corrected to create complete and accurate data sets. For example, as
part of cleansing data sets, faulty data is removed or fixed, missing
values are filled in and inconsistent entries are harmonized.
4. Data structuring. At this point, the data needs to be modelled and
organized to meet the analytics requirements. For example, data stored
in comma-separated values (CSV) files or other file formats has to be
converted into tables to make it accessible to BI and analytics tools.
5. Data transformation and enrichment. In addition to being
structured, the data typically must be transformed into a unified and
usable format. For example, data transformation may involve creating
new fields or columns that aggregate values from existing ones. Data
enrichment further enhances and optimizes data sets as needed,
through measures such as augmenting and adding data.
6. Data validation and publishing. In this last step, automated routines
are run against the data to validate its consistency, completeness and
accuracy. The prepared data is then stored in a data warehouse, a data
lake or another repository and either used directly by whoever
prepared it or made available for other users to access.
Data preparation can also incorporate or feed into data curation work that
creates and oversees ready-to-use data sets for BI and analytics. Data
curation involves tasks such as indexing, cataloging and maintaining data
sets and their associated metadata to help users find and access the data. In
some organizations, data curator is a formal role that works
collaboratively with data scientists, business analysts, other users and the
IT and data management teams. In others, data may be curated by data
stewards, data engineers, database administrators or data scientists and
business users themselves.

18
Data preparation

The data preparation process includes these primary steps.

What are the challenges of data preparation?


Data preparation is inherently complicated. Data sets pulled together from
different source systems are highly likely to have numerous data quality,
accuracy and consistency issues to resolve. The data also must be
manipulated to make it usable, and irrelevant data needs to be weeded out.
As noted above, it's a time-consuming process:
● Inadequate or non-existent data profiling. If data isn't properly
profiled, errors, anomalies and other problems might not be identified,
which can result in flawed analytics.
● Missing or incomplete data. Data sets often have missing values and
other forms of incomplete data; such issues need to be assessed as
possible errors and addressed if so.
● Invalid data values. Misspellings, other typos and wrong numbers are
examples of invalid entries that frequently occur in data and must be
fixed to ensure analytics accuracy.
● Name and address standardization. Names and addresses may be
inconsistent in data from different systems, with variations that can
affect views of customers and other entities.
● Inconsistent data across enterprise systems. Other inconsistencies in
data sets drawn from multiple source systems, such as different
terminology and unique identifiers, are also a pervasive issue in data
preparation efforts.
● Data enrichment. Deciding how to enrich a data set -- for example,
what to add to it -- is a complex task that requires a strong
understanding of business needs and analytics goals.
● Maintaining and expanding data prep processes. Data preparation
work often becomes a recurring process that needs to be sustained and
enhanced on an ongoing basis.

19
Data Mining and Business
Intelligence

These issues complicate the process of preparing data for BI and analytics
applications.

Data preparation tools and the self-service data prep market


Data preparation can pull skilled BI, analytics and data management
practitioners away from more high-value work, especially as the volume
of data used in analytics applications continues to grow. However, various
software vendors have introduced self-service tools that automate data
preparation methods, enabling both data professionals and business users
to get data ready for analysis in a streamlined and interactive way. Other
prominent BI, analytics and data management vendors that offer data
preparation tools or capabilities include the following:
● Altair
● Boomi
● Datameer
● DataRobot
● IBM
● Informatica
● Microsoft
● Precisely
● SAP
● SAS
● Tableau
● Talend
● Tamr
● Tibco Software

20
Data preparation

Data preparation software typically provides these capabilities.

How to get started on data preparation


The following six items as starting points for successful data prep
initiatives:
1. Think of data preparation as part of data analysis. Data preparation
and analysis are "two sides of the same coin," Farmer wrote. Data, he
said, can't be properly prepared without knowing what analytics use it
needs to fit.
2. Define what data preparation success means. Desired data accuracy
levels and other data quality metrics should be set as goals, balanced
against projected costs to create a data prep plan that's appropriate to
each use case.
3. Prioritize data sources based on the application. Resolving
differences in data from multiple source systems is an important
element of data preparation that also should be based on the planned
analytics use case.
4. Use the right tools for the job and your skill level. Self-service data
preparation tools aren't the only option available -- other tools and
technologies can also be used, depending on your skills and data needs.
5. Be prepared for failures when preparing data. Error-handling
capabilities need to be built into the data preparation process to prevent
it from going awry or getting bogged down when problems occur.
6. Keep an eye on data preparation costs. The cost of software licenses,
processing and storage resources, and the people involved in preparing
data should be watched closely to ensure that they don't get out of hand.

21
Data Mining and Business 2.2 PREDICTION METHODS
Intelligence
What is Prediction in Data Mining?
To find a numerical output, prediction is used. The training dataset
contains the inputs and numerical output values. According to the training
dataset, the algorithm generates a model or predictor. When fresh data is
provided, the model should find a numerical output. This approach, unlike
classification, does not have a class label. A continuous-valued function or
ordered value is predicted by the model.
In most cases, regression is utilized to make predictions. For example:
Predicting the worth of a home based on facts like the number of rooms,
total area, and so on.
Consider the following scenario: A marketing manager needs to forecast
how much a specific consumer will spend during a sale. In this scenario,
we are bothered to forecast a numerical value. In this situation, a model or
predictor that forecasts a continuous or ordered value function will be
built.

Prediction Issues:
Preparing the data for prediction is the most pressing challenge. The
following activities are involved in data preparation:
● Data Cleaning: Cleaning data include reducing noise and treating
missing values. Smoothing techniques remove noise, and the problem
of missing values is solved by replacing a missing value with the most
often occurring value for that characteristic.

22
● Relevance Analysis: The irrelevant attributes may also be present in Data preparation
the database. The correlation analysis method is used to determine
whether two attributes are connected.
● Data Transformation and Reduction: Any of the methods listed
below can be used to transform the data.
● Normalization: Normalization is used to transform the data.
Normalization is the process of scaling all values for a given attribute
so that they lie within a narrow range. When neural networks or
methods requiring measurements are utilized in the learning process,
normalization is performed.
● Generalization: The data can also be modified by applying a higher
idea to it. We can use the concept of hierarchies for this.
Other data reduction techniques include wavelet processing, binning,
histogram analysis, and clustering.
Prediction Methods

1. Multiple Linear Regression


This method is performed on a dataset to predict the response variable
based on a predictor variable or used to study the relationship between a
response and predictor variable, for example, student test scores compared
to demographic information such as income, education of parents, etc.
2. k-Nearest Neighbors
Like the classification method with the same name above, this prediction
method divides a training dataset into groups of k observations using a
Euclidean Distance measure to determine similarity between “neighbors”.
These groups are used to predict the value of the response for each
member of the validation set.

3. Regression Tree
A Regression tree may be considered a variant of a decision tree, designed
to approximate real-valued functions instead of being used for
classification methods. As with all regression techniques, XLMiner
assumes the existence of a single output (response) variable and one or
more input (predictor) variables. The output variable is numerical. The
general regression tree building methodology allows input variables to be
a mixture of continuous and categorical variables. A decision tree is
generated when each decision node in the tree contains a test on some
input variable's value. The terminal nodes of the tree contain the predicted
output variable values.

4. Neural Network
Artificial neural networks are based on the operation and structure of the
human brain. These networks process one record at a time and “learn” by
comparing their prediction of the record (which as the beginning is largely
23
Data Mining and Business arbitrary) with the known actual value of the response variable. Errors
Intelligence from the initial prediction of the first records are fed back into the network
and used to modify the networks algorithm the second time around. This
continues for many, many iterations.

2.3 MATHEMATICAL MODEL


Mathematical models are used in data analysis to aid in decision-making
and other functions in businesses. Discover how mathematical models are
used in the business field, including making predictions and optimizing
costs.

Mathematical Models Defined


Mathematics can be used to represent real-world situations. Indeed, the
instruments used to represent the real world in mathematical terms are
called mathematical models. They help us understand how the real world
works.
Most models don't recreate the real world as it is, but they offer a
simplified approximation of the real-world situations. A simple
mathematical model representing profit calculation is illustrated as:

Every mathematical model requires a set of inputs and mathematical


functions to generate an output.

Mathematical Models in Business


There are several ways in which mathematical models are used in the field
of business. We will explore some of them here.

1. Decision-Making
Making decisions is a crucial activity for businesses. It often involves
multiple participants with conflicting views. Decision-making
mathematical models can be of great use here. Such models use input
variables and a set of conditions to be fulfilled to help management arrive
at a decision.
One of the most common decision-making problems faced by any
business is the investment decision, where it must decide whether to
24
invest its money in a project or not. Businesses often use mathematical Data preparation
models that assess the potential valuation of the project against the
investment to be made for making such decisions. Examples of such
models are net present value (NPV), internal rate of return (IRR), etc. A
simple NPV model can be illustrated as:

2. Making Predictions
Often businesses have the requirement of predicting certain factors, such
as revenue, growth rate, costs, etc. These are usually used in case of new
product launch, change in strategy, investment needs, expansion projects,
etc. In such cases, predictive mathematical models are used that analyze
historical data and use probability distribution as input for predicting the
future values.
Regression analysis, which involves examining both dependent and
independent variables present in a given situation and then determining the
level of correlation they have with one another. Because of this, it's one of
the most commonly used techniques for predictive models.
Let's understand more with the help of an example of a company that is
conceptualizing a new product for kids aged 8-12 years. Before the
production, the company would want to understand the potential demand
for the product. In doing so, it will require understanding the interests of
their target audience. Then, with the help of a predictive model, they can
forecast the demand in the future.

3. Optimizing
Businesses often need to optimize certain variables to control costs and
ensure maximum efficiency. Such variables might include capacity
planning, human resources planning, space planning, route planning,
etc. Optimization mathematical models are typically used for such
problems. These models often maximize or minimize a quantity by
making changes in another variable or a set of variables.
In devising a pricing strategy, price optimization models are commonly
used to analyze demand of a product at different price points to calculate
profits. The goal of the model is to maximize profits by optimizing prices.
Thus, a company can determine a price level that achieves maximum
profit.

25
Data Mining and Business 2.4 DISTANCE METHODS
Intelligence
Clustering consists of grouping certain objects that are similar to each
other, it can be used to decide if two items are similar or dissimilar in their
properties.
In a Data Mining sense, the similarity measure is a distance with
dimensions describing object features. That means if the distance among
two data points is small then there is a high degree of similarity among
the objects and vice versa. The similarity is subjective and depends
heavily on the context and application. For example, similarity among
vegetables can be determined from their taste, size, colour etc.
Most clustering approaches use distance measures to assess the similarities
or differences between a pair of objects, the most popular distance
measures used are:

1. Euclidean Distance:
Euclidean distance is considered the traditional metric for problems with
geometry. It can be simply explained as the ordinary distance between
two points. It is one of the most used algorithms in the cluster analysis.
One of the algorithms that use this formula would be K-mean.
Mathematically it computes the root of squared differences between the
coordinates between two objects.

Figure – Euclidean Distance

26
2. Manhattan Distance: Data preparation
This determines the absolute difference among the pair of the coordinates.
Suppose we have two points P and Q to determine the distance between
these points we simply have to calculate the perpendicular distance of the
points from X-Axis and Y-Axis.
In a plane with P at coordinate (x1, y1) and Q at (x2, y2).
Manhattan distance between P and Q = |x1 – x2| + |y1 – y2|

Here the total distance of the Red line gives the Manhattan distance
between both the points.

3. Jaccard Index:
The Jaccard distance measures the similarity of the two data set items as
the intersection of those items divided by the union of the data items.

27
Data Mining and Business
Intelligence

Figure – Jaccard Index

4. Minkowski distance:
It is the generalized form of the Euclidean and Manhattan Distance
Measure. In an N-dimensional space, a point is represented as,
(x1, x2, ..., xN)
Consider two points P1 and P2:
P1: (X1, X2, ..., XN)
P2: (Y1i, Y2, ..., YN)
Then, the Minkowski distance between P1 and P2 is given
● When p = 2, Minkowski distance is same as the Euclidean distance.
● When p = 1, Minkowski distance is same as the Manhattan
distance.

5. Cosine Index:
Cosine distance measure for clustering determines the cosine of the angle
between two vectors.

28
Data preparation

2.5 LOGIC METHODS


Logic Definition
There are many types of logic that exist. From formal to symbolic, logic
takes on many forms. What is logic, and is there a logic definition that
encompasses them all?
Logic is defined as a system that aims to draw reasonable conclusions
based on given information. This means the goal of logic is to use data to
make inferences. For example, if a person walked into a room and saw
children holding markers and then saw marker scribbles all over the walls,
logic would dictate that from the given the information presented, the
children drew all over the walls with markers. There is no direct evidence
or confession, but logical principles reveal what is true based on the given
information.
While one can reasonably argue the children drew on the walls with
markers, when it comes to logic, these inferences must follow a set of
guidelines to ensure the reached conclusion is valid and accurate. The
differences will be found in each logic type.

Logic Etymology
The word logic stems from the Greek word logike and or logos which
translates to reason. While many versions of the word have existed over
time, from Latin to Middle English, the first known use of the word was in
the 12th century to define a scientific set of principles. In the 14th century,
the word's definition grew to encompass the idea of true and false thinking
in terms of reasoning. Today, logic is connected to reasoning in forms of
nuance found in argumentation, math, symbolism, and much more.

29
Data Mining and Business Logic Examples and Concepts
Intelligence
Since logic is dependent upon reason, emotions are removed from this
practice, which means the concept of logic relies solely on given data and
valid correlations based on the governing principles presented. The goal of
logic is to find reasonable conclusions based on the given information, but
to make those conclusions, the person in question must ensure they are
making valid arguments.

Valid Argument vs. Invalid Argument


Argumentation is the basis of logic in that it presents a series of
statements or premises that help support an overall claim. These
statements create a foundation for a conclusion to be true or false. There
are two types of arguments: valid and invalid.
● Valid argument: When a person makes an argument, and all the
claims they make are true, then it's deduced that the conclusion must be
true, too. A valid argument provides clear and true premises that support
the overall conclusion, which, in turn, makes it valid.
● Invalid argument: When a person makes an argument and presents
claims that don't prove their conclusion, or the premises simply aren't true,
this argument is considered invalid or false.
To understand these definitions, think about this conclusion: The
cancellation of student loan debt will help boost the economy. To make a
valid argument in regards to student loan debt cancellation, a person
would need to research factual information to prove the conclusion true.
They could do so by looking at financial projections, the demographics of
those who would be affected, and interview people with student loan debt
to learn how cancellation would change their lives. By bringing this
information to the table, one could construct a valid argument for
cancelling the debt.
However, invalid arguments are just as common in today's world, and
these false claims could looks like arguing the opposite of the conclusions,
forming stereotypes about folks who have accrued the debt, and giving
opinions about the unfair treatment for the folks who already paid off their
debt.

Types of Logic
There are many types of logic located within the governing science. The
four main logic types are:
● Informal logic
● Formal logic
● Symbolic logic
● Mathematical logic

30
Read on to learn about each logic type and gain a better understanding Data preparation
through definitions and examples.

Informal Logic
Most people use informal logic everyday, as it's how we reason and form
argumentation in the moment. For example, arguing with a friend about if
Rachel and Ross were on a break in the TV show Friends would result in
the use of informal logic. On the show, the couple decided to take time
away from each other, and in that time, Ross slept with another woman.
Ross argues they were on a break, and Rachel argues they weren't. For this
argument, each person uses the information presented and creates their
conclusion based on their understanding of the word '~break'~.
Informal logic consists of two types of reasoning to make arguments:
● Deductive reasoning: Uses information from various sources and
applies that information to the argument at hand to support a larger,
generalized conclusion
● Inductive reasoning: Uses the specific information given to form a
generalized conclusion
In the Friends example, the arguing friends would use inductive
reasoning, since they are only using the evidence given from the one
source (the TV show). They would look at the episode before and after
Ross' actions to determine if the couple was, in fact, on a break. To use
deductive reasoning, the arguing friends would look into more examples
of infidelity and might even define the word ''break'' in terms of various
definitions. Inductive reasoning uses a smaller source pool and focuses on
Ross and Rachel. Deductive reasoning would center on the concept of
cheating and the notion behind the word ''break'' pulling from multiple
sources until a larger conclusion about cheating is created.

Formal Logic
Formal logic uses deductive reasoning in conjunction with syllogisms and
mathematical symbols to infer if a conclusion is valid. In formal logic, a
person looks to ensure the premises made about a topic logically connects
to the conclusion.
A common example of formal logic is the use of a syllogism to explain
those connections. A syllogism is form of reasoning which draws
conclusions based on two given premises. In each syllogism, there are two
premises and one conclusion that is drawn based on the given information.
The most famous example is about Socrates.
Premise A: Socrates is a man.
Premise B: All men are mortal.
Conclusion C: Therefore, Socrates is mortal.

31
Data Mining and Business 2.6 HEURISTIC METHODS
Intelligence
The heuristic method refers to finding the best possible solution to a
problem quickly, effectively, and efficiently. The word heuristic is derived
from an ancient Greek word, 'eurisko.' It means to find, discover, or
search. It is a practical method of mental shortcut for problem-solving and
decision making that reduces the cognitive load and doesn't require to be
perfect. The method is helpful in getting a satisfactory solution to a much
larger problem within a limited time frame.
The trial and error heuristic is the most fundamental heuristic. It can be
applied in all situations, from matching nuts and bolts to finding the
answer related to algebra. Some common heuristics used to solve
metamaterial problems are visual representation, forward/backward
reasoning, additional assumptions, and simplification.

Advantages of Heuristic
The heuristic method is one of the best ways to solve a problem and make
decisions. The method provides a quick solution with the help of mental
tricks. Some advantages of the heuristic method are given below:
o Attribute Substitution: At the place of more complex and difficult
questions, one can also opt for a simpler question related to the original
one. This technique of attribute substitution makes the method more
beneficial.
o Effort Reduction: The heuristic method reduces the mental efforts
required to solve a problem by making different choices and decisions. It
makes the method one of the most effective ways to find solutions to
many time-consuming problems.
o Fast and Frugal: With the help of a heuristic method, the problems
can be solved within a limited time, and the best & accurate answer can be
obtained.

32
Disadvantages of Heuristic Method Data preparation
As we know heuristic method helps us in getting a quick and effective
solution and decision in a problem, they can also make errors and bias
decisions in some situations. The certain disadvantages of the heuristic
method are as under:
o Inaccurate Decision: It is not true that the heuristic method always
provides an accurate answer or decision to a problem. Sometimes, the
method can provide an inaccurate solution or judgment about how
commonly things appear in your mind and how certain representative
things may be. It can be easily understood by the examples of
manipulation of decision-making.
o Bias Decision: It is not compulsory, and 100% proved that a
decision or a solution that was effective in a past situation will always
work with other situations and even with the same situation. If a person
always relies on the same heuristic, then it can make it difficult to see
other better and alternative solutions.
o Reduces Creativity: If a person always relies on the same decision,
it can also reduce his/her creativity and decision-making & problem-
solving ability. It does not allow the person to come up with new ideas and
judgments.
o Stereotypes and Prejudice: The methods also affect a certain
things, such as stereotypes and prejudice. When a person classifies and
categorizes other people using mental shortcuts, he/she can miss the more
relevant and informative. Such conditions may create stereotyped and
prejudiced categorization of people and decisions that do not match with
the real conditions.

Four Principles of the Heuristic Method


György (George) Pólya gave the four principles of the heuristic method
in his book. The book was published in 1945 with the title 'How to solve
it'. These principles should be followed in the proper sequence in which
they are given; otherwise, it can be difficult to find the solution to the
problem. That's why they are also called the four steps of the method.
o First Principle - Understanding the Problem: It is the first step to
solve a problem. This is the most important principle because before
solving a problem, it is required to understand the real problem. But many
people skip this principle of finding the initial suitable approach. The
principle is focused on knowing the problem and looking at the problem
from other angles.
The various aspects covered under this principle are: what is the problem,
what is going on, is there any other way to explain the problem, is there all
required information available, etc. These all points help in understanding
the actual problem and its aspects.

33
Data Mining and Business o Second Principle - Making a Plan: A problem can be solved by
Intelligence using many different ways. The second principle says that it is required to
find the best way that can be used to find the solution to the given
problem. For this purpose, the right strategy is the first find the
requirement. The reverse 'working backward' can help with this. In this,
people assume to have a solution that helps them in solving the problem
from the starting point.
It also helps in making an overview of the possibilities, removing the less
efficient immediately, comparing all the remaining possibilities, or
applying symmetry. This improves the judgment ability as well as the
creativity of a person.
o Third Principle - Implementing the Plan: After making the proper
strategy, the plan can be implemented. However, for this, it is necessary to
be patient and give the required time to solve the problem. Because
implementing the plan is tougher than making a plan. If the plan does not
provide any solution or does not stand as per the expectations, then it is
advised to repeat the second principle in a better way.
o Fourth Principle - Evaluation and Adaptation: This principle
evaluated that things are in the planned way. In other words, it said that we
match the planned way with the standard way. After this, it is found that
the things are going well maintained so that the best way of solving the
problem can get. Some plans may work while others may not. So, after the
proper evaluation, the most appropriate way can be adapted to solve the
main problem.

Types of Heuristic Methods


Some of the most popular methods are discussed below:
o Dividing Technique: Under this technique, the original problem is
divided into smaller pieces or sub-problems so that the answer can be
found more easily. After solving these sub-problems separately, they can
be merged to get the final answer of the solution of the original problem.
o Inductive Method: This method involves a smaller problem than
the original problem, which has been solved already. The original bigger
problem can be solved by deriving the generalization from the smaller
problem or by using the same method that is applied in the previous
problem.
o Reduction Method: As we know, the problem is solved by different
factors and causes, this method sets various limits for the main problem in
advance. It is helpful in reducing the leeway of the original problem and
getting the solution easily.
o Constructive Method: Under this method, the problem is solved
step by step, and when the first step is passed, the solution is taken as a
victory. After it, consecutive steps are taken to reach the final stage. It

34
helps in getting the best way to solve the problem and getting a successful Data preparation
result.
o Local Search Method: In this method, the most feasible way of
solving a problem is searched and used. Continuous improvement is made
in the method during the solving process, and when there is no more scope
for improvement, the method gets to the end, and the final result is the
answer to the problem.

Uses of Heuristic in Various Fields


Psychology:
o Informal Modes of Heuristic:
o Affect Heuristic: Emotion is used as a mental shortcut to affect a
decision. Emotion is the driving force behind making a decision or solving
an issue fast and effectively. It's used to assess the dangers and advantages
of something based on the pleasant or negative emotions people connect
with a stimulus. It can also be termed a gut decision because if the gut
feeling is correct, the rewards will outweigh the risks.
o Familiarity Heuristic: A mental shortcut used in various scenarios
in which people presume that the circumstances that led to previous
conduct are still true in the present situation and that the previous behavior
may thus be applied correctly to the new situation. This is true when the
person is under a lot of mental strain.
o Peak-end Rule: An event's experience is rated solely on the
sentiments felt at the event's apex. Typically, not every event is viewed as
complete, but rather what the spectator felt at the climax, whether the
event was pleasant or painful. All other emotions aren't lost, but they aren't
used. It can also contain the duration of the event.
o Some Other Types:
o Balance Heuristic
o Bade Rate Heuristic
o Common Sense Heuristic
o Anchoring and Adjustment
o Availability Heuristic
o Contagion Heuristic
o Default Heuristic
o Educated Guess Heuristic
o Effort Heuristic
o Escalation of Commitment
35
Data Mining and Business o Fairness Heuristic
Intelligence
o Naïve Diversification
o Representativeness Heuristic
o Scarcity Heuristic
o Simulation Heuristic
o Social Proof
o Working Backward

o Formal Modes of Heuristic:


o The heuristic of Aspects Elimination
o Fast-and-frugal trees
o Fluency heuristic
o Gaze heuristic
o Recognition heuristic
o Satisficing
o Similarity heuristic
o Take-the-best heuristic



36
3
OPTIMIZATION TECHNIQUES
Unit Structure
3.1 Introduction
3.2 Local Optimization Technique
3.3 Stochastic hill climber
3.4 Evaluation of models

3.1 INTRODUCTION
The field of data mining increasingly adapts methods and algorithms from
advanced matrix computations, graph theory and optimization. In these
methods, the data is described using matrix representations (graphs are
represented by their adjacency matrices) and the data mining problem is
formulated as an optimization problem with matrix variables. With these,
the data mining task becomes a process of minimizing or maximizing a
desired objective function of matrix variables.
Prominent examples include spectral clustering, matrix factorization,
tensor analysis, and regularizations. These matrix-formulated
optimization-centric methodologies are rapidly evolving into a popular
research area for solving challenging data mining problems. These
methods are amenable to vigorous analysis and benefit from the well-
established knowledge in linear algebra, graph theory, and optimization
accumulated through centuries. They are also simple to implement and
easy to understand, in comparison with probabilistic, information-
theoretic, and other methods. In addition, they are well-suited to parallel
and distributed processing for solving large scale problems.

3.2 LOCAL OPTIMIZATION TECHNIQUES


The optimization problem can be defined as a computational situation
where the objective is to find the best of all possible solutions.

37
Data Mining and Business
Intelligence

Types of Optimization Technique


An essential step to optimization technique is to categorize the
optimization model since the algorithms used for solving optimization
problems are customized as per the nature of the problem. Let us walk
through the various optimization problem types:

Continuous Optimization versus Discrete Optimization


Models with discrete variables are discrete optimization problems, while
models with continuous variables are continuous optimization problems.
Continuous optimization problems are easier to solve than discrete
optimization problems. In a discrete optimization problem, the aim is to
look for an object such as an integer, permutation, or graph from
a countable set. However, with improvements in algorithms coupled along
with advancements in computing technology, there has been an increase in
the size and complexity of discrete optimization problems that can be
solved efficiently. It is to note that Continuous optimization algorithms are
essential in discrete optimization because many discrete optimization
algorithms generate a series of continuous sub-problems.

Unconstrained Optimization versus Constrained Optimization


An essential distinction between optimization problems is the situation
where problems have constraints on the variables and problems in which
there are constraints on the variables.
Unconstrained optimization problems arise primarily in many practical
applications and also in the reformulation of constrained optimization
problems. Constrained optimization problems appear from applications
where there are explicit constraints on the variables. Constrained
optimization problems are further divided according to the nature of the
limitations, such as linear, nonlinear, convex, and functional smoothness,
such as differentiable or non-differentiable.

38
None, One, or Many Objectives Optimization Techniques
Although most optimization problems have a single objective function,
there have been peculiar cases when optimization problems have either -
no objective function or multiple objective functions. Multi-objective
optimization problems arise in streams such as engineering, economics,
and logistics. Often, problems with multiple objectives are reformulated as
single-objective problems.

Deterministic Optimization versus Stochastic Optimization


Deterministic optimization is where the data for the given problem is
known accurately. But sometimes, the data cannot be known precisely for
a variety of reasons. A simple measurement error can be a reason for that.
Another reason is that some data describe information about the future,
hence cannot be known with certainty. In optimization under uncertainty,
when the uncertainty is incorporated into the model, it is called stochastic
optimization.

Optimization problems are classified into two types:


1. Linear programming
A simple problem in linear programming is one in which it is necessary to
find the maximum (or minimum) value of a simple function subject to
certain constraints. An example might be that of a factory producing two
commodities. In any production run, the factory produces x1 of the first
type and x2 of the second. If the profit on the second type is twice that on
the first, then x1 + 2x2 represents the total profit. The function x1 + 2x2 is
known as the objective function.
Clearly the profit will be highest if the factory devotes its entire
production capacity to making the second type of commodity. In a
practical situation, however, this may not be possible; a set of constraints
is introduced by such factors as availability of machine time, labour, and
raw materials. For example, if the second type of commodity requires a
raw material that is limited so that no more than five can be made in any
batch, then x2 must be less than or equal to five; i.e., x2 ≤ 5. If the first
commodity requires another type of material limiting it to eight per batch,
then x1 ≤ 8. If x1 and x2 take equal time to make and the machine time
available allows a maximum of 10 to be made in a batch, then x1 + x2 must
be less than or equal to 10; i.e., x1 + x2 ≤ 10.
Two other constraints are that x1 and x2 must each be greater than or equal
to zero, because it is impossible to make a negative number of either;
i.e., x1 ≥ 0 and x2 ≥ 0. The problem is to find the values of x1 and x2 for
which the profit is a maximum. Any solution can be denoted by a pair of
numbers (x1, x2); for example, if x1 = 3 and x2 = 6, the solution is (3, 6).
These numbers can be represented by points plotted on two axes, as shown
in the figure. On this graph the distance along the horizontal axis
represents x1 and that along the vertical represents x2. Because of the
constraints given above, the feasible solutions must lie within a certain
39
Data Mining and Business well-defined region of the graph. For example, the constraint x1 ≥ 0 means
Intelligence that points representing feasible solutions lie on or to the right of
the x2 axis. Similarly, the constraint x2 ≥ 0 means that they also lie on or
above the x1 axis. Application of the entire set of constraints gives the
feasible solution set, which is bounded by a polygon formed by the
intersection of the lines x1 = 0, x2 = 0, x1 = 8, x2 = 5, and x1 + x2 = 10. For
example, production of three items of commodity x1 and four of x2 is a
feasible solution since the point (3, 4) lies in this region. To find the best
solution, however, the objective function x1 + 2x2 = k is plotted on the
graph for some value of k, say k = 4. This value is indicated by the
broken line in the figure. As k is increased, a family of parallel lines are
produced and the line for k = 15 just touches the constraint set at the point
(5, 5). If k is increased further, the values of x1 and x2 will lie outside the
set of feasible solutions. Thus, the best solution is that in which equal
quantities of each commodity are made. It is no coincidence that an
optimal solution occurs at a vertex, or “extreme point,” of the region. This
will always be true for linear problems, although an optimal solution may
not be unique. Thus, the solution of such problems reduces to finding
which extreme point (or points) yields the largest value for the objective
function.

optimization problem
Constraint set bounded by the five lines x1 = 0, x2 = 0, x1 = 8, x2 = 5, and x1 + x2 = 10.
These enclose an infinite number of points that represent feasible solutions.

The simplex method


The graphical method of solution illustrated by the example in the
preceding section is useful only for systems of inequalities involving two
variables. In practice, problems often involve hundreds of equations with
thousands of variables, which can result in an astronomical number of
extreme points. In 1947 George Dantzig, a mathematical adviser for the
U.S. Air Force, devised the simplex method to restrict the number of
extreme points that have to be examined. The simplex method is one of
the most useful and efficient algorithms ever invented, and it is still the
40
standard method employed on computers to solve optimization problems. Optimization Techniques
First, the method assumes that an extreme point is known. (If no extreme
point is given, a variant of the simplex method, called Phase I, is used to
find one or to determine that there are no feasible solutions.) Next, using
an algebraic specification of the problem, a test determines whether that
extreme point is optimal. If the test for optimality is not passed,
an adjacent extreme point is sought along an edge in the direction for
which the value of the objective function increases at the fastest rate.
Sometimes one can move along an edge and make the objective function
value increase without bound. If this occurs, the procedure terminates with
a prescription of the edge along which the objective goes to
positive infinity. If not, a new extreme point is reached having at least as
high an objective function value as its predecessor. The sequence
described is then repeated. Termination occurs when an optimal extreme
point is found or the unbounded case occurs. Although in principle the
necessary steps may grow exponentially with the number of extreme
points, in practice the method typically converges on the optimal solution
in a number of steps that is only a small multiple of the number of extreme
points.
To illustrate the simplex method, the example from the preceding section
will be solved again. The problem is first put into canonical form by
converting the linear inequalities into equalities by introducing “slack
variables” x3 ≥ 0 (so that x1 + x3 = 8), x4 ≥ 0 (so that x2 + x4 = 5), x5 ≥ 0 (so
that x1 + x2 + x5 = 10), and the variable x0 for the value of the objective
function (so that x1 + 2x2 − x0 = 0). The problem may then be restated as
that of finding nonnegative quantities x1, …, x5 and the largest
possible x0 satisfying the resulting equations. One obvious solution is
to set the objective variables x1 = x2 = 0, which corresponds to the extreme
point at the origin. If one of the objective variables is increased from zero
while the other one is fixed at zero, the objective value x0 will increase as
desired (subject to the slack variables satisfying the equality constraints).
The variable x2 produces the largest increase of x0 per unit change; so it is
used first. Its increase is limited by the nonnegativity requirement on the
variables. In particular, if x2 is increased beyond 5, x4 becomes negative.
At x2 = 5, this situation produces a new solution—(x0, x1, x2, x3, x4, x5) =
(10, 0, 5, 8, 0, 5)—that corresponds to the extreme point (0, 5) in the
figure. The system of equations is put into an equivalent form by solving
for the nonzero variables x0, x2, x3, x5 in terms of those variables now at
zero; i.e., x1 and x4. Thus, the new objective function is x1 − 2x4 = −10,
while the constraints are x1 + x3 = 8, x2 + x4 = 5, and x1 − x4 + x5 = 5. It is
now apparent that an increase of x1 while holding x4 equal to zero will
produce a further increase in x0. The nonnegativity restriction
on x3 prevents x1 from going beyond 5. The new solution—
(x0, x1, x2, x3, x4, x5) = (15, 5, 5, 3, 0, 0)—corresponds to the extreme point
(5, 5) in the figure. Finally, since solving for x0 in terms of the
variables x4 and x5 (which are currently at zero value) yields x0 = 15
− x4 − x5, it can be seen that any further change in these slack variables will
decrease the objective value. Hence, an optimal solution exists at the
extreme point (5, 5).
41
Data Mining and Business Standard formulation
Intelligence
In practice, optimization problems are formulated in terms of matrices—a
compact symbolism for manipulating the constraints and testing the
objective function algebraically. The original (or “primal”) optimization
problem was given its standard formulation by von Neumann in 1947. In
the primal problem the objective is replaced by the product (px) of
a vector x = (x1, x2, x3, …, xn)T, whose components are the objective
variables and where the superscript “transpose” symbol indicates that the
vector should be written vertically, and another vector p = (p1, p2, p3,
…, pn), whose components are the coefficients of each of the objective
variables. In addition, the system of inequality constraints is replaced by
Ax ≤ b, where the m by n matrix A replaces the m constraints on
the n objective variables, and b = (b1, b2, b3, …, bm)T is a vector whose
components are the inequality bounds.

2. Non-linear programming
Although the linear programming model works fine for many situations,
some problems cannot be modeled accurately without including nonlinear
components. One example would be the isoperimetric problem: determine
the shape of the closed plane curve having a given length and enclosing
the maximum area. The solution, but not a proof, was known by Pappus of
Alexandria c. 340 CE:
Bees, then, know just this fact which is useful to them, that the hexagon is
greater than the square and the triangle and will hold more honey for the
same expenditure of material in constructing each. But we, claiming a
greater share of wisdom than the bees, will investigate a somewhat wider
problem, namely that, of all equilateral and equiangular plane figures
having the same perimeter, that which has the greater number of angles is
always greater, and the greatest of them all is the circle having its perimeter
equal to them.

An important early algorithm for solving nonlinear programs was given by


the Nobel Prize-winning Norwegian economist Ragnar Frisch in the mid-
1950s. Curiously, his approach fell out of favour for some decades,
reemerging as a viable and competitive approach only in the 1990s. Other
important algorithmic approaches include sequential quadratic
programming, in which an approximate problem with a quadratic
objective and linear constraints is solved to obtain each search step; and
penalty methods, including the “method of multipliers,” in which points
that do not satisfy the constraints incur penalty terms in the objective to
discourage algorithms from visiting them.
The Nobel Prize-winning American economist Harry M. Markowitz
provided a boost for nonlinear optimization in 1958 when he formulated
the problem of finding an efficient investment portfolio as a nonlinear
optimization problem with a quadratic objective function. Nonlinear
optimization techniques are now widely used in finance, economics,
manufacturing, control, weather modeling, and all branches of
engineering.
42
An optimization problem is nonlinear if the objective function f(x) or any Optimization Techniques
of the inequality constraints ci(x) ≤ 0, i = 1, 2, …, m, or equality
constraints dj(x) = 0, j = 1, 2, …, n, are nonlinear functions of the vector of
variables x. For example, if x contains the components x1 and x2, then the
function 3 + 2x1 − 7x2 is linear, whereas the functions (x1)3 + 2x2 and 3x1 +
2x1x2 + x2 are nonlinear.
Nonlinear problems arise when the objective or constraints cannot be
expressed as linear functions without sacrificing some essential nonlinear
feature of the real world system. For example, the folded conformation of
a protein molecule is believed to be the one that minimizes a certain
nonlinear function of the distances between the nuclei of its component
atoms—and these distances themselves are nonlinear functions of the
positions of the nuclei. In finance, the risk associated with a portfolio of
investments, as measured by the variance of the return on the portfolio, is
a nonlinear function of the amounts invested in each security in the
portfolio. In chemistry, the concentration of each chemical in a solution is
often a nonlinear function of time, as reactions between chemicals usually
take place according to exponential formulas.
Nonlinear problems can be categorized according to several properties.
There are problems in which the objective and constraints are smooth
functions, and there are nonsmooth problems in which the slope or value
of a function may change abruptly. There are unconstrained problems, in
which the aim is to minimize (or maximize) the objective function f(x)
with no restrictions on the value of x, and there are constrained problems,
in which the components of x must satisfy certain bounds or other more
complex interrelationships. In convex problems the graph of the objective
function and the feasible set are both convex (where a set is convex if
a line joining any two points in the set is contained in the set). Another
special case is quadratic programming, in which the constraints are linear
but the objective function is quadratic; that is, it contains terms that are
multiples of the product of two components of x. (For instance, the
function 3(x1)2 + 1.4x1x2 + 2(x2)2 is a quadratic function of x1 and x2.)
Another useful way to classify nonlinear problems is according to the
number of variables (that is, components of x). Loosely speaking, a
problem is said to be “large” if it has more than a thousand or so variables,
although the threshold of “largeness” continually increases as computers
become more powerful. Another useful distinction is between problems
that are computationally “expensive” to evaluate and those that are
relatively cheap, as is the case in linear programming.
Nonlinear programming algorithms typically proceed by making a
sequence of guesses of the variable vector x (known as iterates and
distinguished by superscripts x1, x2, x3, …) with the goal of eventually
identifying an optimal value of x. Often, it is not practical to identify the
globally optimal value of x. In these cases, one must settle for a local
optimum—the best value in some region of the feasible solutions.
Each iterate is chosen on the basis of knowledge about the constraint and
objective functions gathered at earlier iterates. Most nonlinear
programming algorithms are targeted to a particular subclass of problems.
43
Data Mining and Business For example, some algorithms are specifically targeted to large, smooth
Intelligence unconstrained problems in which the matrix of second derivatives of f(x)
contains few nonzero entries and is expensive to evaluate, while other
algorithms are aimed specifically at convex quadratic programming
problems, and so on.
Software for solving optimization problems is available both
commercially and in the public domain. In addition to computer
optimization programs, a number of optimization modeling languages are
available that allow a user to describe the problem in intuitive terms,
which are then automatically translated to the mathematical form required
by the optimization software

3.3 STOCHASTIC HILL CLIMBER


Hill Climbing is a form of heuristic search algorithm which is used in
solving optimization related problems in Artificial Intelligence domain.
The algorithm starts with a non-optimal state and iteratively improves its
state until some predefined condition is met. The condition to be met is
based on the heuristic function.
The aim of the algorithm is to reach an optimal state which is better than
its current state. The starting point which is the non-optimal state is
referred to as the base of the hill and it tries to constantly iterate (climb)
untill it reaches the peak value, that is why it is called Hill Climbing
Algorithm.
Hill Climbing Algorithm is a memory-efficient way of solving large
computational problems. It takes into account the current state and
immediate neighbouring state. The Hill Climbing Problem is particularly
useful when we want to maximize or minimize any particular function
based on the input which it is taking.
The most commonly used Hill Climbing Algorithm is “Travelling
Salesman” Problem” where we have to minimize the distance travelled by
the salesman. Hill Climbing Algorithm may not find the global optimal
(best possible) solution but it is good for finding local minima/maxima
efficiently.

Key Features of Hill Climbing in Artificial Intelligence


Following are few of the key features of Hill Climbing Algorithm
Greedy Approach: The algorithm moves in the direction of optimizing
the cost i.e. finding Local Maxima/Minima
No Backtracking: It cannot remember the previous state of the system so
backtracking to the previous state is not possible
Feedback Mechanism: The feedback from the previous computation
helps in deciding the next course of action i.e. whether to move up or
down the slope

44
State Space Diagram – Hill Climbing in Artificial Intelligence Optimization Techniques
Local Maxima/Minima: Local Minima is a state which is better than its
neighbouring state, however, it is not the best possible state as there exists
a state where objective function value is higher
Global Maxima/Minima: It is the best possible state in the state diagram.
Here the value of the objective function is highest
Current State: Current State is the state where the agent is present
currently
Flat Local Maximum: This region is depicted by a straight line where all
neighbouring states have the same value so every node is local maximum
over the region

Problems in Hill Climbing Algorithm


Here we discuss the problems in the hill-climbing algorithm:

1. Local Maximum
The algorithm terminates when the current node is local maximum as it is
better than its neighbours. However, there exists a global maximum where
objective function value is higher
Solution: Back Propagation can mitigate the problem of Local maximum
as it starts exploring alternate paths when it encounters Local Maximum

2. Ridge
Ridge occurs when there are multiple peaks and all have the same value or
in other words, there are multiple local maxima which are same as global
maxima

45
Data Mining and Business 3. Plateau
Intelligence
Plateau is the region where all the neighbouring nodes have the same
value of objective function so the algorithm finds it hard to select an
appropriate direction.

Types of Hill Climbing Algorithm


Here we discuss the types of a hill-climbing algorithm in artificial
intelligence:
1. Simple Hill Climbing
It is the simplest form of the Hill Climbing Algorithm. It only takes into
account the neighboring node for its operation. If the neighboring node is
better than the current node then it sets the neighbor node as the current
node. The algorithm checks only one neighbor at a time. Following are a
few of the key feature of the Simple Hill Climbing Algorithm
Since it needs low computation power, it consumes lesser time
The algorithm results in sub-optimal solutions and at times the solution is
not guaranteed

Algorithm
1. Examine the current state, Return success if it is a goal state
2. Continue the Loop until a new solution is found or no operators are left
to apply
3. Apply the operator to the node in the current state
4. Check for the new state
If Current State = Goal State, Return success and exit
Else if New state is better than current state then Goto New state
Else return to step 2
5. Exit

2. Steepest-Ascent Hill Climbing


Steepest-Ascent hill climbing is an advanced form of simple Hill Climbing
Algorithm. It runs through all the nearest neighbor nodes and selects the
node which is nearest to the goal state. The algorithm requires more
computation power than Simple Hill Climbing Algorithm as it searches
through multiple neighbors at once.

46
Algorithm Optimization Techniques
1. Examine the current state, Return success if it is a goal state
2. Continue the Loop until a new solution is found or no operators are left
to apply
Let ‘Temp’ be a state such that any successor of the current state will have
a higher value for the objective function. For all operators that can be
applied to the current state
Apply the operator to create a new state
Examine new state
If Current State = Goal State, Return success and exit
Else if New state is better than Temp then set this state as Temp
If Temp is better than Current State set Current state to Target
3. Stochastic Hill Climbing
Stochastic Hill Climbing doesn’t look at all its neighboring nodes to check
if it is better than the current node instead, it randomly selects one
neighboring node, and based on the pre-defined criteria it decides whether
to go to the neighboring node or select an alternate node.
Advantage of Hill Climbing Algorithm in Artificial Intelligence
Advantage of Hill Climbing Algorithm in Artificial Intelligence is given
below:
Hill Climbing is very useful in routing-related problems like Travelling
Salesmen Problem, Job Scheduling, Chip Designing, and Portfolio
Management
It is good in solving the optimization problem while using only limited
computation power
It is more efficient than other search algorithms
Hill Climbing Algorithm is a very widely used algorithm for Optimization
related problems as it gives decent solutions to computationally
challenging problems. It has certain drawbacks associated with it like its
Local Minima, Ridge, and Plateau problem which can be solved by using
some advanced algorithm
3.4 EVALUATION OF MODELS
What is Model Evaluation?
Model evaluation is the process of using different evaluation metrics to
understand a machine learning model’s performance, as well as its
strengths and weaknesses. Model evaluation is important to assess the
efficacy of a model during initial research phases, and it also plays a role
in model monitoring.
To understand if your model(s) is working well with new data, you can
leverage a number of evaluation metrics.

47
Data Mining and Business Classification
Intelligence
The most popular metrics for measuring classification performance
include accuracy, precision, confusion matrix, log-loss, and AUC (area
under the ROC curve).
● Accuracy measures how often the classifier makes the correct
predictions, as it is the ratio between the number of correct predictions and
the total number of predictions.
● Precision measures the proportion of predicted Positives that are
truly Positive. Precision is a good choice of evaluation metrics when you
want to be very sure of your prediction. For example, if you are building a
system to predict whether to decrease the credit limit on a particular
account, you want to be very sure about the prediction or it may result in
customer dissatisfaction.
● The confusion matrix (or confusion table) shows a more detailed
breakdown of correct and incorrect classifications for each class. Using a
confusion matrix is useful when you want to understand the distinction
between classes, particularly when the cost of misclassification might
differ for the two classes, or you have a lot more test data on one class
than the other. For example, the consequences of making a false positive
or false negative in a cancer diagnosis are very different.

Example of Confusion Matrix on Iris Flower Dataset

● Log-loss (logarithmic loss) can be used if the raw output of the


classifier is a numeric probability instead of a class label. The probability
can be understood as a gauge of confidence, as it is a measurement of
accuracy.
48
● AUC (Area Under the ROC Curve) is a performance measurement Optimization Techniques
for classification problems at various thresholds settings. It tells how much
a model is capable of distinguishing between classes. The higher the AUC,
better the model is at predicting when a 0 is actually a 0 and a 1 is actually
a 1. Similarly, the higher the AUC, the better the model is at
distinguishing between patients with a disease and with no disease.
Other popular metrics exist for regression models, like R Square, Adjusted
R Square, MSE (Mean Squared Error), RMSE (Root Mean Squared
Error), and MAE (Mean Absolute Error).



49
Module III

4
BI USING DATA WAREHOUSING
4.1 Introduction to DW
4.2 DW architecture
4.3 ETL Process
4.4 Data Warehouse Design

4.1 INTRODUCTION TO DW
Data Warehouse (DW) is maintained separately from the organization’s
operational database and is an environment. Its architectural construct
provides users with current and historical decision support information
which is not possible in the present traditional operational data store. DW
provides a new design which helps in reduced response time and enhance
the performance of queries for reports and analytics.
Data warehouse system is also known by the following name:
❖ Decision Support System (DSS)
❖ Executive Information System
❖ Management Information System
❖ Business Intelligence Solution
❖ Analytic Application
❖ Data Warehouse

Fig 1: Data warehouse System

50
History of Datawarehouse Bi Using Data Warehousing
The need to warehouse is to handle increasing amounts of Information.

How Datawarehouse works?


DW works as a central repository from one or more data sources, data
flows into a DW from the transactional system and other relational
databases.
Data may be:
1. Structured
2. Semi-structured
3. Unstructured data

Types of Data Warehouse


Three main types of Data Warehouses (DWH) are:
1. Enterprise Data Warehouse (EDW):
Enterprise Data Warehouse (EDW) is a centralized warehouse provides
decision support service across the enterprise for organizing and
representing data.

2. Operational Data Store:


Operational Data Store (ODS) is a data store required when neither DW
nor OLTP systems support organizations reporting needs and is preferred
for routine activities like storing records of the Employees.

3. Data Mart:
Data mart a subset of the DW is designed for a particular line of business
like sales, finance and etc.

General stages of Data Warehouse


❖ Offline Operational Database:
❖ Offline Data Warehouse:
❖ Real time Data Warehouse:
❖ Integrated Data Warehouse:

Components of Data warehouse


Four components of Data Warehouses are:
❖ Load manager
❖ Warehouse Manager

51
Data Mining and Business ❖ Query Manager
Intelligence
❖ End-user access tools:
This is categorized into five different groups like 1. Data Reporting 2.
Query Tools 3. Application development tools 4. EIS tools, 5. OLAP tools
and data mining tools.

Who needs Data warehouse?


DWH (Data warehouse) is needed for all types of users like:
❖ Decision makers
❖ Users who use customized, complex processes to obtain information
from multiple data sources.
❖ Technology Savvy
❖ Making decisions.
❖ Fast performance on a huge amount of data.
❖ Hidden patterns.

Data Warehouse is used for?


❖ Airline
❖ Banking
❖ Healthcare
❖ Public sector
❖ Investment and Insurance sector
❖ Retain chain
❖ Telecommunication
❖ Hospitality Industry

Steps to Implement Data Warehouse


1. Enterprise strategy
2. Phased delivery
3. Iterative Prototyping
Here, are key steps in Data warehouse implementation along with its
deliverables.

52
Bi Using Data Warehousing

Fig 2: Data Warehouse implementation


Advantages of Data Warehouse (DWH) [1-15]:
● DW allows business users to quickly access critical data.
● DW provides consistent information on various cross-functional
activities and supports ad-hoc reporting and query.
● DW helps to integrate many sources of data.
● DW helps to reduce total turnaround time for analysis and reporting.
● Restructuring and Integration make it easier for the user to use for
reporting and analysis.
● DW allows users to access critical data.
● DW analyzes different time periods and trends to make future
predictions.

Disadvantages of Data Warehouse:


❖ Not an ideal option for unstructured data.
❖ Creating and Implementation of DW is a time taking process.
❖ Difficult to make changes in data types and ranges, data source
schema, indexes, and queries.
❖ Spend lots of their resources for training and Implementation
purpose.

53
Data Mining and Business The Future of Data Warehousing
Intelligence
❖ Change in Regulatory constrains.
❖ Size of the database.
❖ Multimedia data.

Data Warehouse Tools


1. MarkLogic:
https://www.marklogic.com/product/getting-started/

2. Oracle:
https://www.oracle.com/index.html

3. Amazon RedShift:
https://aws.amazon.com/redshift/?nc2=h_m1
Here is a complete list of useful Datawarehouse Tools.

Sl.No DWM Tool Platform Features

1 CData Sync Windows, Automated intelligent


Mac, Linux, incremental data replication
Cloud
Fully customizable ETL/ELT
data transformation
Runs anywhere – On-premise or
in the Cloud

2 Integrate.io Cloud Integrate.io connects to all major


E-commerce providers such as
Shopify, NetSuite,
BigCommerce, and Magento.
Meet all compliance
requirements with security
features like: field-level data
encryption, SOC II certification,
GDPR compliance, and data
masking.

54
3 QuerySurge Windows, It speeds up testing process up to Bi Using Data Warehousing
Linux 1,000 x and also providing up to
100% data coverage
It integrates an out-of-the-box
DevOps solution for most Build,
ETL & QA management
software.

4 Astera DW Windows Automate ETL operations


Builder through job scheduling and
workflow automation.

5 MS SSIS Windows Tightly integrated with Microsoft


Visual Studio and SQL Server
SSIS consumes data which are
difficult like FTP, HTTP,
MSMQ, and Analysis services,
etc.
Data can be loaded in parallel to
many varied destinations

Characteristics of Data warehouse


Data Warehouse Concepts have following characteristics:
❖ Subject-Oriented
❖ Integrated
❖ Time-variant
❖ Non-volatile

Fig 3: Data Integration Issues

55
Data Mining and Business 4.2 DW ARCHITECTURE
Intelligence
Business Analysis Framework
The business analyst get the information from the data warehouses to
measure the performance and make critical adjustments in order to win
over other business holders in the market. Has the following advantages −
❖ Can enhance business productivity.
❖ Helps us manage customer relationship.
❖ Brings down the costs by tracking trends, patterns over a long period
in a consistent and reliable manner.
To design an effective and efficient data warehouse, we need to
understand and analyze the business needs and construct a business
analysis framework. Views are as follows:
❖ The top-down view
❖ The data source view
❖ The data warehouse.
❖ The business query view
Data warehouses and their architectures very depending upon the elements
of an organization's situation and are classified as:
❖ Data Warehouse Architecture: Basic
❖ Data Warehouse Architecture: With Staging Area
❖ Data Warehouse Architecture: With Staging Area and Data Marts

Data Warehouse Architecture: Basic

Fig 4: Data Warehouse Architecture - Basic


56
Data Warehouse Architecture: With Staging Area Bi Using Data Warehousing
We must clean and process your operational information before put it into
the warehouse.

Fig 5: Data Warehouse Architecture with a staging area


*** Data Warehouse Staging Area is a temporary location where a
record from source systems is copied.

Fig 6: Data Warehouse Architecture with Staging Area and Data Marts (a)

The figure 6 illustrates an example where purchasing, sales, and stocks are
separated. In this example, a financial analyst wants to analyze historical
data for purchases and sales or mine historical information to make
predictions about customer behavior.

Fig 7: Architecture of Date warehouse with staging area and data


marts (b)
57
Data Mining and Business Properties of Data Warehouse Architectures
Intelligence
The following architecture properties are necessary for a data warehouse
system:
1. Separation
2. Scalability
3. Extensibility
4. Security
5. Administerability

DATA WAREHOUSE ARCHITECTURES


Single-Tier Architecture
Single-Tier architecture is not periodically used in practice. Its purpose is
to minimize the amount of data stored to reach this goal; it removes data
redundancies.
The figure shows the only layer physically available is the source layer. In
this method, data warehouses are virtual. This means that the data
warehouse is implemented as a multidimensional view of operational data
created by specific middleware, or an intermediate processing layer.

Fig 8: Single Tier Data warehouse Architecture


The vulnerability of this architecture lies in its failure to meet the
requirement for separation between analytical and transactional
processing. Analysis queries are agreed to operational data after the
middleware interprets them. In this way, queries affect transactional
workloads.

Two-Tier Architecture
The requirement for separation plays an essential role in defining the two-
tier architecture for a data warehouse system, as shown in fig:

58
Bi Using Data Warehousing

Fig 9: Two Tier Data warehouse Architecture


Although it is typically called two-layer architecture to highlight a
separation between physically available sources and data warehouses, in
fact, consist of four subsequent data flow stages:
1. Source layer: A data warehouse system uses a heterogeneous source
of data..
2. Data Staging: The data stored to the source should be extracted,
cleansed to remove inconsistencies and fill gaps, and integrated to
merge heterogeneous sources into one standard schema. The so-
named Extraction, Transformation, and Loading Tools (ETL) can
combine heterogeneous schemata, extract, transform, cleanse, validate,
filter, and load source data into a data warehouse.
3. Data Warehouse layer: Information is saved to one centralized
repository: a data warehouse. The data warehouses can be directly
accessed, but it can also be used as a source for creating data marts,
which partially replicate data warehouse contents and are designed for
specific enterprise departments.
4. Analysis: Integrated data is efficiently, and flexible accessed to issue
reports, dynamically analyze information, and simulate hypothetical
business scenarios.

Three-Tier Architecture
The three-tier architecture consists of the source layer (containing multiple
source system), the reconciled layer and the data warehouse layer
(containing both data warehouses and data marts). The reconciled layer
sits between the source data and data warehouse.
The main advantage of the reconciled layer is that it creates a standard
reference data model for a whole enterprise. At the same time, it separates
the problems of source data extraction and integration from those of data
warehouse population.
59
Data Mining and Business
Intelligence

Fig 10: Three Tier Data warehouse Architecture


Three-Tier Data Warehouse Architecture
Data Warehouses usually have a three-level (tier) architecture that
includes:
1. Bottom Tier (Data Warehouse Server)
2. Middle Tier (OLAP Server)
3. Top Tier (Front end Tools).
A bottom-tier that consists of the Data Warehouse server, which is
almost always an RDBMS. It may include several specialized data marts
and a metadata repository.
Data from operational databases and external sources (such as user profile
data provided by external consultants) are extracted using application
program interfaces called a gateway. A gateway is provided by the
underlying DBMS and allows customer programs to generate SQL code to
be executed at a server.
Examples of gateways contain ODBC (Open Database Connection)
and OLE-DB (Open-Linking and Embedding for Databases), by
Microsoft, and JDBC (Java Database Connection).
A middle-tier which consists of an OLAP server for fast querying of the
data warehouse.
The OLAP server is implemented using either
(1) A Relational OLAP (ROLAP) model, i.e., an extended relational
DBMS that maps functions on multidimensional data to standard
relational operations.
(2) A Multidimensional OLAP (MOLAP) model, i.e., a particular
purpose server that directly implements multidimensional information and
operations.
60
A top-tier that contains front-end tools for displaying results provided by Bi Using Data Warehousing
OLAP, as well as additional tools for data mining of the OLAP-generated
data.
The overall Data Warehouse Architecture is shown in Fig 11.

Fig 11: Overall Data warehouse Architecture


The metadata repository stores information that defines DW objects. It
includes the following parameters and information for the middle and the
top-tier applications:
1. A description of the DW structure, including the warehouse schema,
dimension, hierarchies, data mart locations, and contents, etc.
2. Operational metadata, which usually describes the currency level of
the stored data.
3. System performance data, which includes indices, used to improve
data access and retrieval performance.
4. Summarization algorithms

Principles of Data Warehousing

Fig 12: Principles of Data warehousing

61
Data Mining and Business Load Performance
Intelligence
Data warehouses require increase loading of new data periodically basis
within less amount of time; performance on the load process should be
measured in hundreds of millions of rows and gigabytes per hour and must
not artificially constrain the volume of data business.

Load Processing
Many phases must be taken to load new or update data into the data
warehouse, including data conversion, filtering, reformatting, indexing,
and metadata update.

Data Quality Management


Fact-based management demands the highest data quality.

Query Performance
Fact-based management must not be slowed by the performance of the
data warehouse RDBMS; large, complex queries must be complete in
seconds.

Terabyte Scalability
Data warehouse sizes are growing at enormous rates. Today these size
from a few to 100 of GBs and TB-sized DW.
Types of Data Warehouses
There are different types of data warehouses, which are as follows:

Fig 13: Types of Data warehouses

62
Host-Based Data Warehouses Bi Using Data Warehousing
There are two types of host-based data warehouses which can be
implemented:
❖ Host-Based mainframe warehouses which reside on a high volume
database. Supported by robust and reliable high capacity structure such as
IBM system/390, UNISYS and Data General sequent systems, and
databases such as Sybase, Oracle, Informix, and DB2.
❖ Host-Based LAN data warehouses, where data delivery can be
handled either centrally or from the workgroup environment. The size of
the data warehouses of the database depends on the platform.
Data Extraction and transformation tools allow the automated extraction
and cleaning of data from production systems.
1. A huge load of complex warehousing queries would possibly have too
much of a harmful impact upon the mission-critical transaction
processing (TP)-oriented application.
2. These transaction processing systems have been developing in their
database design for transaction throughput.
3. There is no assurance that data remains consistent.
Host-Based (MVS) Data Warehouses
Those data warehouse uses that reside on large volume databases on MVS
are the host-based types of data warehouses. Often the DBMS is DB2 with
a huge variety of original source for legacy information like VSAM, DB2,
flat files, and Information Management System (IMS). of Java

Fig 14: Host based (MVS) Data warehouse


Before embarking on designing, building and implementing such a
warehouse, some further considerations must be specified as
1. Databases have very high volumes of data storage.
2. Warehouses may require support for both MVS and customer-based
report and query facilities.
63
Data Mining and Business 3. DW has complicated source systems.
Intelligence
4. Needs continuous maintenance.
To make such data warehouses building successful, the following phases
are generally followed:
❖ Unload Phase
❖ Transform Phase
❖ Load Phase
An integrated Metadata repository is central to any data warehouse
environment. It provides a dynamic network between the multiple data
source databases and the DB2 of the conditional data warehouses.
A metadata repository is necessary to design, build, and maintain data
warehouse processes. It should be capable of providing data as to what
data exists in both the operational system and data warehouse, where the
data is located. Query, reporting, and maintenance are another
indispensable method of such a data warehouse. An MVS-based query and
reporting tool for DB2.
Host-Based (UNIX) Data Warehouses
Oracle and Informix RDBMSs support the facilities for such data
warehouses. Both of these databases can extract information from MVS¬
based databases as well as a higher number of other UNIX¬ based
databases.
LAN-Based Workgroup Data Warehouses
A LAN based workgroup warehouse is an integrated structure for building
and maintaining a data warehouse in a LAN environment. We can extract
information from a variety of sources and support multiple LAN based
warehouses, generally chosen warehouse databases to include DB2 family,
Oracle, Sybase, and Informix. Other databases that can also be contained
through infrequently are IMS, VSAM, Flat File, MVS, and VH.

Fig 15: LAN based Work Group Warehouse


64
Designed for the workgroup environment, a LAN based workgroup Bi Using Data Warehousing
warehouse is optimal for any business organization that wants to build a
data warehouse often called a data mart.
Data Delivery: A LAN based workgroup warehouse, customer needs
minimal technical knowledge to create and maintain a store of data that
customized for use at the department, business unit, or workgroup level
and ensures the delivery of information from corporate resources by
providing transport access to the data in the warehouse.

Host-Based Single Stage (LAN) Data Warehouses


Within a LAN based data warehouse, data delivery can be handled either
centrally or from the workgroup environment so business groups can meet
process their data needed without burdening centralized IT resources,
enjoying the autonomy of their data mart without comprising overall data
integrity and security in the enterprise.

Fig 16: LAN based Single Stage Warehouse


A LAN based warehouse provides data from many sources requiring a
minimal initial investment and technical knowledge. A LAN based
warehouse can also work replication tools for populating and updating the
data warehouse. This type of warehouse can include business views,
histories, aggregation, versions in, and heterogeneous source support, such
as
❖ DB2 Family
❖ IMS, VSAM, Flat File [MVS and VM]
A single store frequently drives a LAN based warehouse and provides
existing DSS applications, enabling the business user to locate data in their
data warehouse. The LAN based warehouse can support business users
with complete data to information solution.

Multi-Stage Data Warehouses


Is well suitable to environments where end-clients in numerous capacities
require access to both summarized information for up to the minute
65
Data Mining and Business tactical decisions as well as summarized, a commutative record for long-
Intelligence term strategic decisions. Both the Operational Data Store (ODS) and the
data warehouse may reside on host-based or LAN Based databases,
depending on volume and custom requirements. These contain DB2,
Oracle, Informix, IMS, Flat Files, and Sybase.

Fig 17: Multistage Data Warehouse


Stationary Data Warehouses
In this type of data warehouses, the data is not changed from the sources,
as depcited in fig:

Fig 18: Stationary data Warehouse


The customer is given direct access to the data. Problems generated by this
schema are:
❖ Identifying the location of the information for the users
❖ Providing clients the ability to query different DBMSs as is they were
all a single DBMS with a single API.

66
❖ Impacting performance since the customer will be competing with the Bi Using Data Warehousing
production data stores.

Distributed Data Warehouses


The concept of a distributed data warehouse suggests that there are two
types of distributed data warehouses and their modifications for the local
enterprise warehouses which are distributed throughout the enterprise and
global warehouses as shown in fig:

Fig 19: Distributed Data Warehouse


Characteristics of Local data warehouses
❖ Has its unique architecture and contents of data
❖ The data is unique
❖ Majority of the record is local and not replicated
❖ Any intersection of data between local data warehouses is
circumstantial
❖ Local warehouse serves different technical communities

Virtual Data Warehouses


Virtual Data Warehouses is created in the following stages:
1. Installing a set of data approach, data dictionary, and process
management facilities.
2. Training end-clients.
3. Monitoring how DW facilities will be used
This strategy provides ultimate flexibility as well as the minimum amount
of redundant information that must be loaded and maintained and is
termed the 'virtual data warehouse.'
67
Data Mining and Business To accomplish this, there is a need to define four kinds of data:
Intelligence
1. A data dictionary including the definitions of the various databases.
2. A description of the relationship between the data components.
3. The description of the method user will interface with the system.
4. The algorithms and business rules that describe what to do and how to
do it.

Disadvantages
1. Queries competing with production record transactions can degrade
the performance.
2. No metadata, no summary record, or no individual DSS (Decision
Support System) integration or history.
3. No refreshing process, causing the queries to be very complex.

Fig 20: Virtual data Warehouse

4.3 ETL PROCESS


ETL is a process that extracts the data from different source systems,
transforming the data and finally loads the data into the Data Warehouse
system. Extract, Transform and Load (ETL).

ETL Process in Data Warehouses


ETL is a 3-step process

68
Bi Using Data Warehousing

Fig 21: ETL Process in Data Warehouse


Step 1) Extraction
Transformations are done in staging area so that performance of source
system in not degraded. Staging area gives an opportunity as to validate
extracted data before it moves into the DW.

Three Data Extraction methods:


1. Full Extraction
2. Partial Extraction- without update notification.
3. Partial Extraction- with update notification

Step 2) Transformation
Data extracted from source server is raw and not usable in its original
form and needs to be cleansed, mapped and transformed.

Fig 22: Data Integration Issues

69
Data Mining and Business Step 3) Loading
Intelligence
Large volume of data needs to be loaded in a relatively short period and
needs to be optimized for performance.
In case of load failure, recover mechanisms should be configured to restart
from the point of failure without data integrity loss.

Types of Loading:
❖ Initial Load — populating all the Data Warehouse tables
❖ Incremental Load — applying ongoing changes as when needed
periodically.
❖ Full Refresh —erasing the contents of one or more tables and
reloading with fresh data.

ETL Tools
Prominent data warehousing tools available in the market are:

1. MarkLogic:
https://www.marklogic.com/product/getting-started/

2. Oracle:
https://www.oracle.com/index.html
3. Amazon RedShift:
https://aws.amazon.com/redshift/?nc2=h_m1

Best practices ETL process


❖ Never try to cleanse all the data
❖ Never cleanse anything
❖ Determine the cost of cleansing the data
❖ To speed up query processing, have auxiliary views and indexes
Difference between ETL and ELT

ETL (Extract, Transform, and Load)


Extract, Transform and Load is the technique of extracting the record from
sources to a staging area, then transforming or reformatting with business
manipulation performed on it in order to fit the operational needs or data
analysis, and later loading into the goal or destination databases or data
warehouse.

70
Bi Using Data Warehousing

Fig 23: ETL


Strengths
❖ Development Time
❖ Targeted data
❖ Tools Availability

Weaknesses
❖ Flexibility
❖ Hardware
❖ Learning Curve
ELT (Extract, Load and Transform)
ELT stands for Extract, Load and Transform is the various sights while
looking at data migration or movement. ELT involves the extraction of
aggregate information from the source system and loading to the target
method instead of transformation between the extraction and loading
phase. Once the data is copied or loaded into the target method, then
change takes place.

Fig 24: ELT


Strengths
❖ Project Management
❖ Flexible & Future Proof

71
Data Mining and Business ❖ Risk minimization
Intelligence
❖ Utilize Existing Hardware
❖ Utilize Existing Skill sets
Weaknesses
❖ Against the Norm
❖ Tools Availability
Difference between ETL vs. ELT
Basics ETL ELT
Process Data is transferred to the Data remains in the DB
ETL server and moved except for cross Database
back to DB. High network loads (e.g. source to object).
bandwidth required.
Transformation Transformations are Transformations are
performed in ETL Server. performed (in the source or)
in the target.
Code Usage Typically used for Typically used for
❖ Source to target ❖ High amounts of data
transfer
❖ Compute-intensive
Transformations
❖ Small amount of
data
Time- It needs highs Low maintenance as data is
Maintenance maintenance as you need always available.
to select data to load and
transform.
Calculations Overwrites existing Easily add the calculated
column or Need to column to the existing table.
append the dataset and
push to the target
platform.
Analysis

4.4 DATA WAREHOUSE DESIGN


A data warehouse is a single data repository where a record from multiple
data sources is integrated for online business analytical processing
(OLAP). This implies a data warehouse needs to meet the requirements
72
from all the business stages within the entire organization and is hugely Bi Using Data Warehousing
complex, lengthy, and hence error-prone process. Furthermore, data
warehouse and OLAP systems are dynamic, and the design process is
continuous.
Data warehouse design takes a method different from view materialization
in the industries and has two approaches
1. "top-down" approach
2. "bottom-up" approach

Top-down Design Approach


In the "Top-Down" design approach, a data warehouse is described as a
subject-oriented, time-variant, non-volatile and integrated data repository
for the entire enterprise data from different sources are validated,
reformatted and saved in a normalized (up to 3NF) database as the data
warehouse. Main advantage of this method is it supports a single
integrated data source.

Advantages of top-down design


❖ Data Marts are loaded from the data warehouses.
❖ Developing new data mart from the data warehouse is very easy.

Disadvantages of top-down design


❖ This technique is inflexible to changing departmental needs.
❖ The cost of implementing the project is high.

Fig 25: Top Down Design Approach

73
Data Mining and Business Bottom-Up Design Approach
Intelligence
In "Bottom-Up" approach, a DW is described as "a copy of transaction
data specifical architecture for query and analysis," term the star schema.
In this approach, a data mart is created first to necessary reporting and
analytical capabilities for particular business processes. Data marts include
the lowest grain data and, aggregated data, if needed.
Main advantage of "bottom-up" design approach is, it has quick ROI, and
takes less time and effort than developing an enterprise-wide data
warehouse. In addition to it the risk of failure is even less. This method is
inherently incremental. This method allows the project team to learn and
grow.

Fig 26: Bottom Up Design Approach


Advantages of bottom-up design
❖ Documents can be generated quickly.
❖ DW can be extended to accommodate new business units.
❖ Developing new data marts and then integrating with other data
marts.

Disadvantages of bottom-up design


The locations of the data warehouse and the data marts are reversed in the
bottom-up approach design.

74
Differentiate between Top-Down Design Approach and Bottom-Up Bi Using Data Warehousing
Design Approach

Top-Down Design Approach Bottom-Up Design Approach

Breaks the vast problem into Solves the essential low-level problem
smaller sub problems. and integrates them into a higher one.

Inherently architected- not a Inherently incremental; can schedule


union of several data marts. essential data marts first.

Single, central storage of Departmental information stored.


information about the content.

Centralized rules and control. Departmental rules and control.

It includes redundant Redundancy can be removed.


information.

It may see quick results if Less risk of failure, favorable return on


implemented with repetitions. investment, and proof of techniques.



75
5
DATA MART
Unit Structure
5.1 Data mart
5.2 OLAP
5.3 Dimensional Modeling
5.4 Operations on Data Cube
5.5 Schema
5.6 References
5.7 MOOCs
5.8 Video Lectures
5.9 Quiz

5.1 DATA MART


Is focused on a single functional area of an organization and contains a
subset of data stored in a Data Warehouse. A Data Mart is a condensed
version of Data Warehouse and is designed for use by a specific
department, unit or set of users in an organization. Data marts are small in
size and are more flexible compared to a Datawarehouse.
Types of Data Mart
There are three main types of data mart:
1. Dependent
2. Independent
3. Hybrid

Dependent Data Mart


It allows sourcing organization's data from a single Data Warehouse.
Dependent Data Mart can be built in two different ways.
❖ A user can access both the data mart and data warehouse, depending
on need, or where access is limited only to the data mart.
❖ Produces data junkyard (all data begins with a common source, and are
scrapped & mostly junked).

76
Data mart

Fig 27: Dependent Data Mart

Independent Data Mart


It is created without the use of central Data warehouse and is an ideal
option for smaller groups within an organization.It has neither a
relationship with the enterprise data warehouse nor with any other data
mart.

Fig 28: Independent Data Mart

77
Data Mining and Business
Intelligence
Hybrid Data Mart:
It combines input from sources apart from Data warehouse and is helpful
in integration. Hybrid Data mart also supports large storage structures, and
it is best suited for flexible for smaller data-centric applications.

Fig 29: Hybrid Data

Steps in Implementing a Datamart

Fig 30: Steps in implementing a Data mart


Designing
Designing is the first phase of Data Mart implementation. It includes the
following tasks:
❖ Gathering the business & technical requirements and Identifying data
sources.
❖ Selecting the appropriate subset of data.
❖ Designing the logical and physical structure of the data mart.
Data could be partitioned based on following criteria:
❖ Date
❖ Business or Functional Unit
❖ Geography
❖ Any of the above combination

78
Constructing Data mart
In this second phase of implementation it involves in creating the physical
database and the logical structures. Involves the following tasks:
● Implementing the physical database designed in the earlier phase.
Database schema objects like table, indexes, views, etc. are to be created.

Populating:
In the third phase, data in populated in the data mart involving the
following tasks:
❖ Data Mapping
❖ Extraction of source data
❖ Cleaning and transformation operations
❖ Loading data into the data mart
❖ Creating and storing metadata

Accessing
Accessing is a fourth step which involves putting the data to use and
submit queries to the database & display the results of the queries
The accessing step needs to perform the following tasks:
❖ Translates database structures and objects names into business terms
❖ Set up and maintain database structures.
❖ Set up API and interfaces if required

Managing
Is the last step of Data Mart Implementation process and covers
management tasks like:
❖ User access management.
❖ System optimizations and fine-tuning
❖ Adding and managing fresh data into the data mart.
❖ Planning recovery scenarios and ensure system availability in the case
of system fails.

Advantages and Disadvantages of a Data Mart


Advantages
● Data is valuable to a specific group of people in an organization.
● Cost-effective
79
Data Mining and Business ● Easy to use and can accelerate business processes.
Intelligence
● Consumes less implementation time as compared to Data Warehouse
systems.
● Contains historical data enabling the analyst to determine data trends.

Disadvantages
● Maintenance problem.
● Data analysis is limited.

5.2 OLAP
Online Analytical Processing provide analysis of data for business
decisions and allow users to analyze database information from multiple
database systems at one time.
The primary objective is data processing and not data analysis
Example of OLAP
Any Data warehouse system is an OLAP system.
Uses of OLAP:
❖ A company might compare their mobile phone sales in September with
sales in October, then compare those results with another location
which may be stored in a separate database.
❖ Amazon analyzes purchases by its customers to come up with a
personalized homepage with products which likely interest to their
customer.

Benefits of using OLAP services


❖ OLAP creates a single platform for all type of business analytical
needs which includes planning, budgeting, forecasting, and analysis.
❖ The main benefit of OLAP is the consistency of information and
calculations.
❖ Easily apply security restrictions on users and objects to comply with
regulations and protect sensitive data.

Drawbacks of OLAP service


❖ Implementation and maintenance are dependent on IT professional
because the traditional OLAP tools require a complicated modeling
procedure.
❖ OLAP tools need cooperation between people of various departments
to be effective which might always be not possible.

80
OLTP Data mart
Online transaction processing supports transaction-oriented applications in
a 3-tier architecture administering day to day transaction of an
organization.

Example of OLTP system


An example of OLTP system is ATM center. Assume that a couple has a
joint account with a bank. One day both simultaneously reach different
ATM centers at precisely the same time and want to withdraw total
amount present in their bank account.
However, the person that completes authentication process first will be
able to get money. In this case, OLTP system makes sure that withdrawn
amount will be never more than the amount present in the bank. The key
to note here is that OLTP systems are optimized for transactional
superiority instead data analysis.
Other examples of OLTP applications are:
❖ Online banking
❖ Online airline ticket booking
❖ Sending a text message
❖ Order entry
❖ Add a book to shopping cart

Benefits of OLTP method


❖ It administers daily transactions of an organization.
❖ OLTP widens the customer base of an organization by simplifying
individual processes.

Drawbacks of OLTP method


❖ If OLTP system faces hardware failures, then online transactions get
severely affected.
❖ OLTP systems allow multiple users to access and change the same
data at the same time which many times created unprecedented
situation.

81
Data Mining and Business OLTP Vs OLAP
Intelligence

Fig 31: OLTP VS OLAP

5.3 DIMENSIONAL MODELING


Dimensional modeling represents data with a cube operation, making
more suitable logical data representation with OLAP data management.
The transaction record is divided into either "facts," which are frequently
numerical transaction data, or "dimensions," which are the reference
information that gives context to the facts.
Fact: Is a collection of associated data items, consisting of measures and
context data representing business items or business transactions.

82
Dimensions: Is a collection of data which describe one business Data mart
dimension.
Measure: Is a numeric attribute of a fact, representing the performance or
behavior of the business relative to the dimensions.

Objectives of Dimensional Modeling


1. To produce database architecture that is easy for end-clients to
understand and write queries.
2. To maximize the efficiency of queries. Achieving goals by minimizing
the number of tables and relationships between them.

Advantages of Dimensional Modeling


❖ Dimensional modeling is simple
❖ Dimensional modeling promotes data quality
❖ Performance optimization is possible through aggregates

Disadvantages of Dimensional Modeling


❖To maintain the integrity of fact and dimensions, loading the data
warehouses with a record from various operational systems is
complicated.
❖It is severe to modify the data warehouse operation if the organization
adopting the dimensional technique changes the method in which it
does business.
Two basic models used in dimensional modeling:
❖ Star Model
❖ Snowflake Model
Example: A city and state can view a store summary in a fact table. Item
summary can be viewed by brand, color, etc. Customer information can be
viewed by name and address.

Fig 32: Fact table


83
Data Mining and Business Data Cube
Intelligence
Is a multidimensional data model that stores the optimized, summarized or
aggregated data which eases the OLAP tools for the quick and easy
analysis, storing pre-computed data easing online analytical processing.
Let us take an example of the data cube for AllElectronics sales.

Fig 33: Data Cube


Each dimension has a dimension table containing description of the
dimension like branch_name, branch_code, branch_address etc.
The fact table denotes the numeric measures such as a number of units of
an item sold, sale of a particular branch in a particular year, etc.

TYPES OF OLAP SERVERS


❖ Relational OLAP (ROLAP)
❖ Multidimensional OLAP (MOLAP)
❖ Hybrid OLAP (HOLAP)
❖ Specialized SQL Servers

ROLAP
ROLAP servers are placed between relational back-end server and client
front-end tools. To store and manage warehouse data, ROLAP uses
relational or extended-relational DBMS.
ROLAP includes:
❖ Implementation of aggregation navigation logic.
❖ Optimization for each DBMS back end.
❖ Additional tools and services.

84
Data mart

Fig 34: ROLAP Architecture


MOLAP
MOLAP uses array-based multidimensional storage engines for
multidimensional views of data.

Fig 35: MOLAP Architecture


Hybrid OLAP
Hybrid OLAP is a combination of both ROLAP and MOLAP. It offers
higher scalability of ROLAP and faster computation of MOLAP. HOLAP
servers allows to store the large data volumes of detailed information.

Specialized SQL Servers


Specialized SQL servers provide advanced query language and query
processing support for SQL queries over star and snowflake schemas in a
read-only environment.

ROLAP Vs MOLAP Vs HOLAP

85
Data Mining and Business 5.4 OPERATIONS ON DATA CUBE
Intelligence
Basic operations implemented on a data cube are:
1. Roll Up
2. Drill Down
3. Slice and Dice
4. Pivot
Roll Up: Summarizes or aggregates the dimensions either by performing
dimension reduction or by concept hierarchy.

Fig 36: Roll up


2. Drill Down
The data on the dimension is fragmented into granular form.
Quarter Q1, Q2, is fragmented into months.

Fig 37: Drill Down

86
3. Slice and Dice Data mart
PickS up one dimension of the data cube and then forms a sub cube out of
it. The data cube is sliced based on time.

Fig 38: Slice


The dice operation select more than one dimension to form a sub cube.

Fig 39: Dice


4. Pivot
Pivot is not a calculative operation rather it rotates the data cube in order
to view data cube from different dimensions.

Fig 40: Pivot

87
Data Mining and Business Advantages of Data Cube
Intelligence
❖ Data cube ease in aggregating and summarizing the data.
❖ Data cube provide better visualization of data.
❖ Data cube stores huge amount of data in a very simplified way.
❖ Data cube increases the overall efficiency of the data warehouse.
❖ The aggregated data in data cube helps in analysing the data fast and
thereby reducing the access time.

5.5 SCHEMA
Schema is used to define the way to organize the system with all the
database entities and their logical association.

Different types of Schemas in DW:


1. Star Schema
2. SnowFlake Schema
3. Galaxy Schema
4. Star Cluster Schema
1) Star Schema
This is the simplest and most effective schema in a data warehouse. A fact
table in the center surrounded by multiple dimension tables resembles a
star in the Star Schema model. While designing star schemas the
dimension tables are purposefully de-normalized. They are wide with
many attributes to store the contextual data for better analysis and
reporting.

Benefits of Star Schema


❖ Queries use very simple joins while retrieving the data and thereby
query performance is increased.
❖ It is simple to retrieve data for reporting, at any point of time for any
period.

Disadvantages of Star Schema


❖ If there are many changes in the requirements, the existing star schema
is not recommended to modify and reuse in the long run.
❖ Data redundancy is more as tables are not hierarchically divided.

88
Data mart

Fig 41: STAR Schema


Querying a Star Schema
An end-user can request a report using Business Intelligence tools and can
be processed by creating a chain of “SELECT queries”. The performance
of these queries can be calculated on the report execution time.
Example, if a business user wants to know how many Novels and DVDs
have been sold in the state of Kerala in January in 2018, then you can
apply the query as follows on Star schema tables:

89
Data Mining and Business Results:
Intelligence
Product_Name Quantity_Sold

Novels 12,702

DVDs 32,919

2) SnowFlake Schema
Is a process of completely normalizes all the dimension tables from a star
schema.

Benefits of SnowFlake Schema:


❖ Data redundancy is completely removed by creating new dimension
tables.
❖ When compared with star schema, less storage space is used by the
Snow Flaking dimension tables.
❖ It is easy to update (or) maintain the Snow Flaking tables.
Disadvantages of SnowFlake Schema:
❖ Due to normalized dimension tables, the ETL system has to load the
number of tables.
❖ You may need complex joins to perform a query due to the number
of tables added. Hence query performance will be degraded.

Fig 42: SnowFlake Schema


90
Querying A Snowflake Schema Data mart
That is if a business user wants to know how many Novels and DVDs
have been sold in the state of Kerala in January in 2018, you can apply the
query as follows on SnowFlake schema tables.

3) Galaxy Schema
Galaxy Schema known as Fact Constellation Schema like a collection of
stars in the Galaxy schema model. In this schema, multiple fact tables
share the same dimension tables.

Fig 43: Galaxy Schema

91
Data Mining and Business 4) Star Cluster Schema
Intelligence
A SnowFlake schema with many dimension tables may need more
complex joins while querying. A star schema with fewer dimension tables
may have more redundancy.

Fig 44: Star Cluster Schema


Star Schema Vs Snowflake Schema: Key Differences

92
5.6 REFERENCES Data mart
1. Introduction to Data Warehouse System.
https://www.javatpoint.com/. [Last Accessed on 10.03.2022]
2. Introduction to Data Warehouse System. https://www.guru99.com/.
[Last Accessed on 10.03.2022]
3. Introduction to Data Warehouse System. https://www.
http://www.nitjsr.ac.in/. [Last Accessed on 10.03.2022]
4. Introduction to Data Warehouse System. https://www.
oms.bdu.ac.in/. [Last Accessed on 10.03.2022]
5. Data Warehouse. https://www.softwaretestinghelp.com/. [Last
Accessed on 10.03.2022]
6. Introduction to Data Warehouse System.
https://www.tutorialspoint.com/ebook/data_warehouse_tutorial/index
.asp. [Last Accessed on 10.03.2022].
7. Using ID3 Algorithm to build a Decision Tree to predict the weather.
https://iq.opengenus.org/id3-algorithm/. [Last Accessed on
10.03.2022].
8. Data Warehouse Architecture. https://binaryterms.com/data-
warehouse-architecture.html. [Last Accessed on 10.03.2022].
9. Compare-and-Contrast-Database-with-Data-Warehousing-and-Data-
Visualization . https://www.coursehero.com/file/28202760/Compare-
and-Contrast-Database-with-Data-Warehousing-and-Data-
Visualization-Databases-Assignmentdocx/. [Last Accessed on
10.03.2022].
10. Data Warehousing and data mining.
https://lastmomenttuitions.com/course/data-warehousing-and-
mining/. [Last Accessed on 10.03.2022].
11. Data Warehouse. https://one.sjsu.edu/task/all/finance-data-
warehouse. [Last Accessed on 10.03.2022].
12. DWDM Notes. https://dwdmnotes.blogspot.com. [Last Accessed on
10.03.2022].
13. Data Warehouse and Data Mart.
https://www.geeksforgeeks.org/difference-between-data-warehouse-
and-data-mart/?ref=gcse. [Last Accessed on 10.03.2022].
14. Data Warehouse System. https://www.analyticssteps.com/. [Last
Accessed on 10.03.2022].
15. Han J, Pei J, Kamber M. Data mining: concepts and techniques.
Elsevier; 2011 Jun 9.

93
Data Mining and Business 16. CART. https://iq.opengenus.org/cart-algorithm. [Last Accessed on
Intelligence 10.03.2022].
17. Bhatia P. Data mining and data warehousing: principles and practical
techniques. Cambridge University Press; 2019 Jun 27.
18. Han J, Pei J, Kamber M. Data mining: concepts and techniques.
Elsevier; 2011 Jun 9.
19. Berzal F, Matín N. Data mining: concepts and techniques by Jiawei
Han and Micheline Kamber. ACM Sigmod Record. 2002 Jun 1;
31(2):66-8.
20. Gupta GK. Introduction to data mining with case studies. PHI
Learning Pvt. Ltd.; 2014 Jun 28.
21. Zhou, Zhi-Hua. "Three perspectives of data mining." (2003): 139-
146.
22. Wang J, editor. Encyclopedia of data warehousing and mining. iGi
Global; 2005 Jun 30.
23. Pujari AK. Data mining techniques. Universities press; 2001.

5.7 MOOCS
1. Data Warehousing for Business Intelligence Specialization.
https://www.coursera.org/specializations/data-warehousing.
2. Data Mining.
https://onlinecourses.swayam2.ac.in/cec20_cs12/preview.
3. Data Warehouse Concepts, Design, and Data Integration.
https://www.coursera.org/learn/dwdesign.
4. Data Warehouse Courses. https://www.edx.org/learn/data-warehouse.
5. BI Foundations with SQL, ETL and Data Warehousing
Specialization. https://www.coursera.org/specializations/bi-
foundations-sql-etl-data-warehouse.
6. Fundamentals of Data Warehousing. https://www.mooc-
list.com/initiative/coursera.
7. Foundations for Big Data Analysis with SQL.
https://www.coursera.org/learn/foundations-big-data-analysis-sql.

5.8 VIDEO LECTURES


1. Data Warehouse Concepts.
https://www.youtube.com/watch?v=CHYPF7jxlik.
2. Data warehouse schema design.
https://www.youtube.com/watch?v=fpquGrdgbLg.
94
3. Data Modeling. https://www.youtube.com/watch?v=7ciFtfi-kQs. Data mart
4. Star and SnowFlake Schema in Data Warehouse.
https://www.youtube.com/watch?v=VOJ54hu2e2Q.
5. Dimensional Modeling – Declaring Dimensions.
https://www.youtube.com/watch?v=ajVfBJrTOxw.
6. What is ETL. https://www.youtube.com/watch?v=oF_2uDb7DvQ.
7. Star Schema & Snow Flake Design.
https://www.youtube.com/watch?v=KUwOcip7Zzc.
8. OLTP vs OLAP.
https://www.youtube.com/watch?v=aRT8E0nD_LE.
9. OLAP and Data Modeling Concepts.
https://www.youtube.com/watch?v=rnQDuz1ZkIo.
10. Understand OLAP.
https://www.youtube.com/watch?v=yoE6bgJv08E.
11. OLAP Cubes. https://www.youtube.com/watch?v=UKCQQwx-Fy4.
12. OLAP vs OLTP. https://www.youtube.com/watch?v=TCrCo2-w-
vM.
13. OLAP. https://www.youtube.com/watch?v=AC1cLmbXcqA.
14. OLAP Vs OLTP.
https://www.youtube.com/watch?v=kFQRrgHeiOo.

5.9 QUIZ
1. OLAP stands for
a) Online analytical processing
b) Online analysis processing
c) Online transaction processing
d) Online aggregate processing
Answer: a
2. Data that can be modeled as dimension attributes and measure attributes
are called _______ data.
a) Multidimensional
b) Singledimensional
c) Measured
d) Dimensional
Answer: a

95
Data Mining and Business 3. The generalization of cross-tab which is represented visually is
Intelligence ____________ which is also called as data cube.
a) Two dimensional cube
b) Multidimensional cube
c) N-dimensional cube
d) Cuboid
Answer: a
4. The process of viewing the cross-tab (Single dimensional) with a fixed
value of one attribute is
a) Slicing
b) Dicing
c) Pivoting
d) Both Slicing and Dicing
Answer: a
5. The operation of moving from finer-granularity data to a coarser
granularity (by means of aggregation) is called a ________
a) Rollup
b) Drill down
c) Dicing
d) Pivoting
Answer: a
6. In SQL the cross-tabs are created using
a) Slice
b) Dice
c) Pivot
d) All of the mentioned
Answer: a
{ (item name, color, clothes size), (item name, color), (item name, clothes
size), (color, clothes size), (item name), (color), (clothes size), () }
7. This can be achieved by using which of the following ?
a) group by rollup
b) group by cubic
c) group by
d) none of the mentioned
Answer: d

96
8. What do data warehouses support? Data mart
a) OLAP
b) OLTP
c) OLAP and OLTP
d) Operational databases
Answer: a
9. SELECT item name, color, clothes SIZE, SUM(quantity)
FROM sales
GROUP BY rollup (item name, color, clothes SIZE);
How many grouping is possible in this rollup?
a) 8
b) 4
c) 2
d) 1
Answer: b
10. Which one of the following is the right syntax for DECODE?
a) DECODE (search, expression, result [, search, result]… [, default])
b) DECODE (expression, result [, search, result]… [, default], search)
c) DECODE (search, result [, search, result]… [, default], expression)
d) DECODE (expression, search, result [, search, result]… [, default])
Answer: d
11. What is the full form of OLAP?
a) Online Application Programming
b) Online Application Processing
c) Online Analytical programming
d) Online Analytical Processing
Answer: d
12. Data that can be modelled as dimension attributes and measure
attributes are called ___________
a) Mono-dimensional data
b) Multi-dimensional data
c) Measurable data
d) Efficient data
Answer: b

97
Data Mining and Business 13. The operation of changing a dimensions used in a cross-tab is called as
Intelligence ________
a) Alteration
b) Pivoting
c) Piloting
d) Renewing
Answer: b
14. The operation of moving from finer granular data to coarser granular
data is called _______
a) Reduction
b) Increment
c) Roll up
d) Drill down
Answer: c
15. How many dimensions of multi-dimensional data do cross tabs enable
analysts to view?
a) 1
b) 2
c) 3
d) None of the mentioned
Answer: b
16. The _______ function allows substitution of values in an attribute of a
tuple
a) Cube
b) Unknown
c) Decode
d) Substitute
Answer: c
17. Which of the following OLAP systems do not exist?
a) HOLAP
b) MOLAP
c) ROLAP
d) None of the mentioned
Answer: d

98
18. State true or false: OLAP systems can be implemented as client-server Data mart
systems
a) True
b) False
Answer: a
19. The operation of moving from coarser granular data to finer granular
data is called _______
a) Reduction
b) Increment
c) Roll back
d) Drill down
Answer: d
20. State true or false: In OLAP, analysts cannot view a dimension in
different levels of detail.
a) True
b) False
Answer: b
21. What is a Star Schema?
a) a star schema consists of a fact table with a single table for each
dimension
b) a star schema is a type of database system
c) a star schema is used when exporting data from the database
d) none of these
Answer: A
22. What is the type of relationship in star schema?
a) many-to-many.
b) one-to-one
c) many-to-one
d) one-to-many
Answer: D
23. Fact tables are _______.
a) completely demoralized.
b) partially demoralized.
c) completely normalized.
d) partially normalized.
Answer: C

99
Data Mining and Business 24. Data warehouse is volatile, because obsolete data are discarded
Intelligence
a) TRUE
b) FALSE
Answer: B
25. Which is NOT a basic conceptual schema in Data Modeling of Data
Warehouses?
a) Star schema
b) Tree schema
c) Snowflake schema
d) Fact constellations
Answer: B
26. Which is NOT a valid OLAP Rule by E.F.Codd?
a) Accessibility
b) Transparency
c) Flexible reporting
d) Reliability
Answer: D
27. Which is NOT a valid layer in Three-layer Data Warehouse
Architecture in Conceptual View?
a) Processed data layer
b) Real-time data layer
c) Derived data layer
d) Reconciled data layer
Answer: A
28. Among the types of fact tables which is not a correct type ?
a) Fact-less fact table
b) Transaction fact tables
c) Integration fact tables
d) Aggregate fact tables
Answer: C
29. Among the followings which is not a characteristic of Data
Warehouse?
a) Integrated
b) Volatile
c) Time-variant
d) Subject oriented
Answer: B
100
30. What is not considered as isssues in data warehousing? Data mart
a) optimization
b) data transformation
c) extraction
d) inter mediation
Answer: D
31. Which one is NOT considering as a standard query technique?
a) Drill-up
b) Drill-across
c) DSS
d) Pivoting
Answer: C
32. Among the following which is not a type of business data ?
a) Real time data
b) Application data
c) Reconciled data
d) Derived data
Answer : B
33. A data warehouse is which of the following?
a) Can be updated by end users.
b) Contains numerous naming conventions and formats.
c) Organized around important subject areas.
d) Contains only current data.
Answer: C
34. A snowflake schema is which of the following types of tables?
a) Fact
b) Dimension
c) Helper
d) All of the above
Answer: D
35. The extract process is which of the following?
a) Capturing all of the data contained in various operational systems
b) Capturing a subset of the data contained in various operational systems
c) Capturing all of the data contained in various decision support systems
.

101
Data Mining and Business d) Capturing a subset of the data contained in various decision support
Intelligence systems
Answer: B
36. The generic two-level data warehouse architecture includes which of
the following?
a) At least one data mart
b) Data that can extracted from numerous internal and external sources
c) Near real-time updates
d) All of the above.
Answer: B
37. Which one is correct regarding MOLAP ?
a) Data is stored and fetched from the main data warehouse.
b) Use complex SQL queries to fetch data from the main warehouse
c) Large volume of data is used.
d) All are incorrect
Answer: A
38. In terms of data warehouse, metadata can be define as,
a) Metadata is a road-map to data warehouse
b) Metadata in data warehouse defines the warehouse objects.
c) Metadata acts as a directory.
d) All are incorrect
Answer: D
39. In terms of RLOP model, choose the most suitable answer
a) The warehouse stores atomic data.
b) The application layer generates SQL for the two dimensional view
c) The presentation layer provides the multidimensional view.
d) All are incorrect
Answer: D
40. In the OLAP model, the _ provides the multidimensional view.
a) Data layer
b) Data link layer
c) Presentation layer
d) Application layer
Answer: A

102
41. Which of the following is not true regarding characteristics of Data mart
warehoused data?
a) Changed data will be added as new data
b) Data warehouse can contains historical data
c) Obsolete data are discarded
d) Users can change data once entered into the data warehouse
Answer: D
42. ETL is an abbreviation for Elevation, Transformation and Loading
a) TRUE
b) FALSE
Answer: B
43. Which is the core of the multidimensional model that consists of a
large set of facts and a number of dimensions?
a) Multidimensional cube
b) Data model
c) Data cube
d) None of the above
Answer: C
44. Which of the following statements is incorrect
a) ROLAPs have large data volumes
b) Data form of ROLAP is large multidimentional array made of cubes
c) MOLAP uses sparse matrix technology to manage data sparcity
d) Access for MOLAP is faster than ROLAP
Answer: B
45. Which of the following standard query techniques increase the
granularity
a) roll-up
b) dril-down
c) slicing
d) dicing
Answer: B
46. The full form of OLAP is
a) Online Analytical Processing
b) Online Advanced Processing
c) Online Analytical Performance
d) Online Advanced Preparation
Answer: A

103
Data Mining and Business 47. Which of the following statements is/are incorrect about ROLAP
Intelligence
a) ROLAP fetched data from data warehouse.
b) ROLAP data store as data cubes.
c) ROLAP use sparse matrix to manage data sparsity.
Answer: B and C
48. __ is a standard query technique that can be used within OLAP to
zoom in to more detailed data by changing dimensions.
a) Drill-up
b) Drill-down
c) Pivoting
d) Drill-across
Answer: B
49. Which of the following statements is/are correct about Fact
constellation
a) Fact constellation schema can be seen as a combination of many star
schemas.
b) It is possible to create fact constellation schema, for each star
schema or snowflake schema.
c) Can be identified as a flexible schema for implementation.
Answer: C
50. How to describe the data contained in the data warehouse?.
a) Relational data
b) Operational data
c) Meta data
d) Informational data
Answer: C
51. The output of an OLAP query is displayed as a
a) Pivot
b) Matrix
c) Excel
Answer: B and C
52. One can perform Query operations in the data present in Data
Warahouse
a) TRUE
b) FALSE
Answer: A

104
53. A __ combines facts from multiple processes into a single fact table Data mart
and eases the analytic burden on BI applications.
a) Aggregate fact table
b) Consolidated fact table
c) Transaction fact table
d) Accumulating snapshot fact table
Answer: B
54. In OLAP operations, Slicing is the technique of ____
a) Selecting one particular dimension from a given cube and providing
a new sub-cube
b) Selecting two or more dimensions from a given cube and providing
a new sub-cube
c) Rotating the data axes in order to provide an alternative presentation
of data
d) Performing aggregation on a data cube
Answer: A
55. Standalone data marts built by drawing data directly from operational
or external sources of data or both are known as independent data marts
a) TRUE
b) FALSE
Answer: A
56. Focusing on the modeling and analysis of data for decision makers, not
on daily operations or transaction processing is known
a) Integrated
b) Time-variant
c) Subject oriented
d) Non-volatile
Answer: C
57. Most of the time data warehouse is
a) Read
b) Write
c) Both
Answer: A
58. Data granularity is ——————- of details of data ?
a) summarization
b) transformation
c) level
Answer: C

105
Data Mining and Business 59. Which one is not a type of fact?
Intelligence
a) Fully Addictive
b) Cumulative addictive
c) Semi Addictive
d) Non Addictive
Answer: C
60. When the level of details of data is reducing the data granularity goes
higher
a) True
b) False
Answer: B
61. Data Warehouses are having summarized and reconciled data which
can be used by decision makers
a) True
b) False
Answer: A
62. _____ refers to the currency and lineage of data in a data warehouse
a) Operational metadata
b) Business metadata
c) Technical metadata
d) End-User meatdata
Answer: A





106
Module IV

6
DATA MINING AND PREPROCESSING
Unit Structure
6.0 Object4ves
6.1 Introduction
6.2 Definition
6.3 Functionalities of Data Mining
6.3.1 Class/ Concept Descriptions
6.3.2 Mining Frequent Patterns, Associations, and Correlations
6.3.3 Association Analysis
6.3.4 Correlation Analysis
6.4 Data Preprocessing & KDD
6.4.1 Data Cleaning
6.4.2 Data Integration
6.4.3 Data Selection
6.4.4 Data Transformation
6.4.5 Data Mining
6.4.6 Pattern Evaluation
6.4.7 Knowledge representation
6.5 Data Reduction
6.6 Let us sum up
6.7 List of References
6.8 Bibliography
6.9 Unit End Exercises

107
Data Mining and Business 6.0 OBJECTIVES
Intelligence
After going through this unit, you will be able to:
• Define data mining and its functionalities
• Understand the Knowledge discovery of data process
• Explain the steps in data pre-processing
• describe the dimensionality of data
• learn the data reduction and data compression techniques

6.1 INTRODUCTION
Generally, data mining (sometimes called data or knowledge discovery) is
the process of analyzing data from different perspectives and summarizing
it into useful information - information that can be used to increase
revenue, cut costs, or both. Data mining software is one of a number of
analytical tools for analyzing data. It allows users to analyze data from
many different dimensions or angles, categorize it, and summarize the
relationships identified. Technically, data mining is the process of finding
correlations or patterns among dozens of fields in large relational
databases.
Data, Information, and Knowledge

Data
Data are any facts, numbers, or text that can be processed by a computer.
Today, organizations are accumulating vast and growing amounts of data
in different formats and different databases. This includes:

 operational or transactional data such as sales, cost, inventory, payroll,


and accounting

 nonoperational data, such as industry sales, forecast data, and


macroeconomic data

 metadata - data about the data itself, such as logical database design or
data dictionary definitions

Information
The patterns, associations, or relationships among all this data can provide
information. For example, analysis of retail point of sale transaction data
can yield information on which products are selling and when.

Knowledge
Information can be converted into knowledge about historical patterns and
future trends. For example, summary information on retail supermarket
sales can be analyzed in light of promotional efforts to provide knowledge
108
of consumer buying behavior. Thus, a manufacturer or retailer could Data Mining and Preprocessing
determine which items are most susceptible to promotional efforts.

The Foundations of Data Mining


Data mining techniques are the result of a long process of research and
product development. This evolution began when business data was first
stored on computers, continued with improvements in data access, and
more recently, generated technologies that allow users to navigate through
their data in real-time. Data mining takes this evolutionary process beyond
retrospective data access and navigation to prospective and proactive
information delivery. Data mining is ready for application in the business
community because it is supported by three technologies that are now
sufficiently mature:
Massive data collection
Powerful multiprocessor computers
Data mining algorithms
Commercial databases are growing at unprecedented rates. A recent
META Group survey of data warehouse projects found that 19% of
respondents are beyond the 50-gigabyte level, while 59% expect to be
there by the second quarter of 1996.1 in some industries, such as retail,
these numbers can be much larger. The accompanying need for improved
computational engines can now be met in a cost-effective manner with
parallel multiprocessor computer technology. Data mining algorithms
embody techniques that have existed for at least 10 years but have only
recently been implemented as mature, reliable, understandable tools that
consistently outperform older statistical methods.
In the evolution from business data to business information, each new step
has built upon the previous one. For example, dynamic data access is
critical for drill-through in data navigation applications, and the ability to
store large databases is critical to data mining. From the user’s point of
view, the four steps listed in Table 1 were revolutionary because they
allowed new business questions to be answered accurately and quickly.

109
Data Mining and Business Steps in Evolution of Data Mining- Table.
Intelligence

Evolutionary Business Enabling Product Characteristics


Step Question Technologies Providers

Data Collection "What was my Computers, tapes, IBM, CDC Retrospective,


total revenue in disks static data
(1960s) the last five delivery
years?"

Data Access "What were unit Relational Oracle, Retrospective,


sales in New databases Sybase, dynamic data
(1980s) England last (RDBMS), Informix, delivery at the
March?" Structured Query IBM, record level
Language (SQL), Microsoft
ODBC

Data "What were unit Online analytic Pilot, Retrospective,


Warehousing & sales in New processing Comshare, dynamic data
England last (OLAP), Arbor, delivery at
Decision March? Drill multidimensional Cognos, multiple levels
Support down to Boston." databases, data Micro
warehouses strategy
(1990s)

Data Mining "What’s likely to Advanced Pilot, Prospective,


happen to Boston algorithms, Lockheed, proactive
(Emerging unit sales next multiprocessor IBM, SGI, information
Today) month? Why?" computers, numerous delivery
massive databases startups
(nascent
industry)

6.2 DEFINITION
The technique of identifying patterns and relationships within large
databases through the use of advanced statistical methods.
Data mining, the extraction of hidden predictive information from large
databases, is a powerful new technology with great potential to help
companies focus on the most important information in their data
warehouses. Data mining tools predict future trends and behaviors,
allowing businesses to make proactive, knowledge-driven decisions.

6.3 FUNCTIONALITIES OF DATA MINING


We have two primary goals of data mining tasks tend to be prediction and
description.
Predictive Data Mining involves using some variables or fields in the
data set to predict unknown or future values of other variables of interest.
For example: Judging from the findings of a patient’s medical
examinations that is he suffering from any particular disease.

110
Descriptive Data Mining, on the other hand, focuses on finding patterns Data Mining and Preprocessing
describing the data that can be interpreted by humans. The common data
features are highlighted in the data set.
For example, count, average, etc. Therefore, it is possible to put data-
mining activities into one of two categories:

Data Mining Functionality:


6.3.1 Class/Concept Descriptions:
Classes or definitions can be correlated with results. In simplified,
descriptive, and yet accurate ways, it can be helpful to define individual
groups and concepts. These class or concept definitions are referred to as
class/concept descriptions.
Data Characterization: This is a summary of the general characteristics
or attributes of the class under inquiry. As an example, anyone may
acquire data connected to such items by performing SQL queries to
investigate the features of a software product whose sales climbed by 15%
two years ago.
Data Discrimination: It compares common properties of the class being
studied. This process's product can take many different forms. For
example, bar charts, curves, and pie charts.

6.3.2 Mining Frequent Patterns, Associations, and Correlations:


Frequent patterns are items that are discovered to be most prevalent in
data.
There are several types of frequency that may be found in the dataset.

Frequent item set:


This refers to a group of items that are frequently found together, such as
milk and sugar.

111
Data Mining and Business Frequent Subsequence:
Intelligence
A pattern series that happens on a frequent basis, such as purchasing a
phone followed by a rear cover.

Frequent Substructure:
It refers to the various types of data structures, such as trees and graphs,
that can be joined with an itemset or subsequence.

6.3.3 Association Analysis:


This technique entails determining the laws of the association and
exposing the link between data. It is a method for determining the link
between various elements. It can, for example, be used to calculate the
sales of goods that are usually purchased together.

6.3.4 Correlation Analysis:


Correlation is a mathematical technique that can show whether and how
strongly the pairs of attributes are related to each other. For example,
Heightened people tend to have more weight.

6.4 DATA PREPROCESSING AND KNOWLEDGE


DISCOVERY IN DATABASES (KDD)
Why Data Preprocessing is Important?
Data have quality if they satisfy the requirements of the intended use.
There are many factors comprising data quality, including accuracy,
completeness, consistency, timeliness, believability, and interpretability.
Preprocessing of data is mainly to check the data quality. The quality can
be checked by the following
● Accuracy: To check whether the data entered is correct or not.
● Completeness: To check whether the data is available or not recorded.
● Consistency: To check whether the same data is kept in all the places
that do or do not match.
● Timeliness: The data should be updated correctly.
● Believability: The data should be trustable.
● Interpretability: The understandability of the data.
Some people treat data mining the same as Knowledge discovery while
some people view data
mining essential step in process of knowledge discovery. Data Mining,
also known as Knowledge Discovery in Databases, is the nontrivial
extraction of implicit, previously unknown, and possibly valuable

112
information from data recorded in databases. Here is the list of steps Data Mining and Preprocessing
involved in the knowledge discovery process:
An Outline of the steps in the KDD process

6.4.1 Data Cleaning: Data cleaning is the elimination of noisy and useless
data from a collection. Cleaning in the event of missing values. These are
the following methods to find out the missing values.

6.4.1.1 Missing Data


● Ignore the tuple: This is usually done when the class label is
missing (assuming the
mining task involves classification). This method is not very effective
unless the tuple
contains several attributes with missing values. It is especially poor when
the percentage of missing values per attribute varies considerably. By
ignoring the tuple, we do not make use of the remaining attributes’ values
in the tuple. Such data could have been useful to the task at hand.
● Fill in the missing value manually: In general, this approach is
time-consuming and may not be feasible given a large data set with many
missing values.
● Use a global constant to fill in the missing value: Replace all
missing attribute values with the same constant such as a label like
“Unknown” or −∞. If missing values are replaced by, say, “Unknown,”
then the mining program may mistakenly think that they form an
interesting concept, since they all have a value in common—that of
“Unknown.” Hence, although this method is simple, it is not foolproof.
● Use a measure of central tendency for the attribute (e.g., the
mean or median) to fill in the missing value: For normal (symmetric)
data distributions, the mean can be used, while skewed data distribution
should employ the median.
113
Data Mining and Business ● Use the attribute mean or median for all samples belonging to
Intelligence the same class as the given tuple: For example, if classifying customers
according to credit risk, we may replace the missing value with the mean
income value for customers in the same credit risk category as that of the
given tuple. If the data distribution for a given class is skewed, the median
value is a better choice.
● Use the most probable value to fill in the missing value: This may
be determined with regression, inference-based tools using a Bayesian
formalism, or decision tree

6.4.1.2 Noisy Data:


Noise is a random error or variance in a measured variable. Following are
some data smoothing techniques available for identifying and rectifying
Noisy data.
● Binning Approach: This method is used to smooth sorted data. The
data is separated into equal-sized parts, and then several approaches are
used to accomplish the work. Each section is treated independently. To
finish the assignment, one can replace all data in a segment with its mean
or utilize boundary values.

6.4.1.3 Regression:
Here data can be made smooth by fitting it to a regression function. The
regression used may be linear (having one independent variable) or
multiple (having multiple independent variables).
6.4.1.4 Outlier analysis: Outliers may be detected by clustering, for
example, where similar values are organized into groups, or “clusters.”
Intuitively, values that fall outside of the set of clusters may be considered
outliers

114
Data Mining and Preprocessing

6.4.2 Data Integration: Data integration is the process of combining


heterogeneous data from numerous sources into a single source (Data
warehouse).
● Data migration tools are used for data integration.
● Data synchronization tools are used for data integration.
● The ETL (Extract-Load-Transformation) method is used for data
integration.
6.4.3 Data Selection: Data selection is described as the process of
deciding on and retrieving data relevant to the analysis from the data
gathering.
● Using a neural network to choose data.
● Decision Trees are used to pick data.
● The Naive Bayes method is used to choose data.
● Data selection techniques such as clustering, regression, and so on are
used.
6.4.4 Data Transformation: this is described as the process of changing
data into the proper form required by the mining technique. The
transformation of data is a two-step process:
● Data Mapping is the process of assigning items from a source base to
a destination base in order to capture transformations.
● Code generation is the process of creating the actual transformation
program.
6.4.5 Data Mining: Data mining is defined as the application of smart
techniques to discover potentially relevant patterns.
● Patterns are created by transforming task-relevant data.
● Determines the goal of the model through classification or
characterization.

115
Data Mining and Business 6.4.6 Pattern Evaluation is described as the identification of strictly
Intelligence growing patterns expressing knowledge based on supplied metrics.
● Determine the pattern's interestingness score.
● Summarization and visualization are used to make data clear to the
user.
6.4.7 Knowledge representation is described as a strategy that uses
visualization tools to depict data mining findings.
● Create reports.
● Create tables.
● Create discriminant rules, classification rules, characterization rules,
and other rules.

6.5 DATA REDUCTION:


Data reduction techniques can be applied to obtain a reduced
representation of the dataset that is much smaller in volume, yet closely
maintains the integrity of the original data. That is, mining on the reduced
data set should be more efficient yet produce the same (or almost the
same) analytical results.

Strategies for data reduction include the following:


● Data cube aggregation, where aggregation operations are applied to
the data in the construction of a data cube. Data cubes are
multidimensional aggregated information. Each cell holds an aggregate
data value, corresponding to the data point in multidimensional space.
The lowest level of a data cube (base cuboid).
o The cube created at the lowest level of abstraction is referred to as the
base cuboid.
o The aggregated data for an individual entity of interest
o E.g., For a customer in a phone calling data warehouse, A cube at the
highest level of abstraction is the apex cuboid.

Multiple levels of aggregation in data cubes


o Further, reducing the size of data to deal with queries regarding
aggregated information should be answered using a data cube, when
possible
o Data cubes provide fast access to precomputed, summarized data,
thereby benefiting online analytical processing as well as data mining.
o Attribute subset selection, where irrelevant, weakly relevant, or
redundant attributes or dimensions may be detected and removed.

116
● Dimensionality reduction, where encoding mechanisms are used to Data Mining and Preprocessing
reduce the dataset size. Dimensionality refers to the number of input
characteristics, variables, or columns contained in a particular dataset,
while dimensionality reduction refers to the process of reducing these
features. In certain circumstances, a dataset comprises a large number of
input characteristics, which complicates the predictive modeling work.
Because it is difficult to see or forecast for a training dataset with a large
number of characteristics, dimensionality reduction techniques must be
used. The dimensionality reduction approach is defined as "a method of
transforming greater dimensions datasets into smaller dimensions datasets
while guaranteeing that they give identical information." These strategies
are commonly used in machine learning to obtain a better fit prediction
model when tackling classification and regression problems. It is
frequently utilized in high-dimensional data domains such as voice
recognition, signal processing, bioinformatics, and so on. It may also be
used for data visualization, noise reduction, cluster analysis, and other
similar tasks.

The Advantages of Dimensionality Reduction


The following are some advantages of using the dimensionality reduction
approach on the supplied dataset:
● The space required to store the dataset is lowered when the dimensions
of the features are reduced.
● Reduced feature dimensions need less Computation training time.
● Reduced dimension aspects of the dataset aid in quickly displaying the
data.
● It eliminates superfluous features (if present) while accounting for
multicollinearity.
117
Data Mining and Business Disadvantages of dimensionality Reduction
Intelligence
There are various drawbacks to using dimensionality reduction, which are
listed below:
● Due to dimensionality reduction, some data may be lost.
● The primary components necessary to consider in the PCA
dimensionality reduction approach are occasionally unknown.

● Data Compression
Using various encoding algorithms, data compression minimizes the size
of files (Huffman Encoding & run-length Encoding). Based on its
compression methodologies, we may categorize it into two groups.
Lossless Compression - Encoding Techniques (Run Length Encoding)
provide easy and minimum data size reduction. Lossless data compression
use techniques to reconstruct the exact original data from compressed
data.
Lossy compression methods include the Discrete Wavelet transform
algorithm and PCA (principal component analysis). JPEG picture format,
for example, is a lossy compression, yet we may find the meaning
comparable to the original image. The decompressed data in lossy-data
compression may differ from the original data, but they are still usable for
retrieving information.
● Numerosity reduction, where the data are replaced or estimated by
alternative, smaller data representations such as parametric models (which
need to store only the model parameters instead of the actual data) or
nonparametric methods such as clustering, sampling, and the use of
histograms.
● Discretization and concept hierarchy generation, where raw data
values for attributes are replaced by ranges or higher conceptual levels.
Data discretization is a form of numerosity reduction that is very useful for
the automatic generation of concept hierarchies. Discretization and
concept hierarchy generation are powerful tools for data mining, in that
they allow the mining of data at multiple levels of abstraction
Top-down discretization — This procedure is known as splitting if you
first examine one or a couple of places (so-called breakpoints or split
points) to divide the entire collection of characteristics and then continue
this method until the finish.
Bottom-up discretization — When all of the constant values are
considered as split points, some are eliminated through a combination of
the neighborhood values in the interval. This is known as bottom-up
discretization.
● Concept Hierarchies: It decreases data size by gathering and then
substituting low-level ideas (for example, 43 for age) with high-level
118
concepts (categorical variables such as middle age or Senior). For Data Mining and Preprocessing
numerical data, the following strategies can be used:
o Binning - The process of converting numerical variables into
categorical equivalents is known as binning. The number of category
equivalents is determined by the user's selection of bins.
o Histogram analysis - The histogram, like the binning procedure, is
used to separate the value for the attribute X into disjoint ranges called
brackets. There are a number of partitioning rules:
o Partitioning values based on their frequency of occurrence in the data
collection is known as equal frequency partitioning.
o Partitioning the data in a fixed gap depending on the number of rows
and columns is known as equal width partitioning.

6.6 LET US SUM UP


● Data Mining is finding the hidden information from the database.
● Data Mining is one of the steps in the KDD process.
● Data mining tasks are categorized according to the nature of the data
being processed.
● When dealing with the techniques to be implemented with the
applications it should compute the different issues of Data Mining.
● Data quality is characterized as correctness, completeness,
consistency, timeliness, believability, and interpretability. These
characteristics are evaluated depending on the data's intended purpose.
● Data cleaning processes aim to fill in missing numbers, smooth out
noise while detecting outliers, and fix anomalies in the data. Data
cleaning is often accomplished as a two-step iterative process
consisting of discrepancy identification and data modification.
● Data integration is the process of combining data from different
sources to create a cohesive data repository. The resolution of
semantic heterogeneity, metadata, correlation analysis, tuple
duplication detection, and data conflict detection all contribute to
seamless data integration.
● Despite the development of several data preprocessing methods, data
preprocessing remains an active topic of study due to the large volume
of inconsistent or filthy data and the complexity of the problem.
● Data reduction is a procedure that takes a large amount of original data
and compresses it into a much smaller volume. Data reduction
techniques are used to generate a reduced version of the dataset that is
significantly lower in volume while preserving the original data's
integrity.

119
Data Mining and Business 6.7 REFERENCES
Intelligence
1. R. Agrawal, T. Imielinski, and A. Swami (1993). "Mining
associations between sets of items in massive databases," in
Proceedings of the 1993 ACM-SIGMOD International Conference
on Management of Data (pp. 207–216), New York: ACM Press.
2. M. J. A. Berry, and G. S. Linoff (1997). Data Mining Techniques.
New York: Wiley.
3. M. J. A. Berry, and G. S. Linoff (2000). Mastering Data Mining.
New York: Wiley.
4. L. Breiman, J. Friedman, R. Olshen, and C. Stone (1984).
Classification and Regression Trees. Boca Raton, FL: Chapman &
Hall/CRC (orig. published by Wadsworth).
5. C. Chatfield (2003). The Analysis of Time Series: An Introduction,
6th ed. Chapman & Hall/CRC.
6. R. Delmaster, and M. Hancock (2001). Data Mining Explained.
Boston: Digital Press.
7. S. Few (2004). Show Me the Numbers. Oakland, CA, Analytics
Press.
8. J. Han, and M. Kamber (2001). Data Mining: Concepts and
Techniques. San Diego, CA: Academic.
9. D. Hand, H. Mannila and P. Smyth (2001). Principles of Data
Mining. Cambridge, MA: MIT Press.
10. T. Hastie, R. Tibshirani, and J. Friedman (2009). The Elements of
Statistical Learning. 2nd ed. New York: Springer.
11. D. W. Hosmer, and S. Lemeshow (2000). Applied Logistic
Regression, 2nd ed. New York: Wiley-Interscience.
12. W. Jank, and I. Yahav (2010). E-Loyalty Networks in Online
Auctions. Annals of Applied Statistics, forthcoming.
13. W. Johnson, and D. Wichern (2002). Applied Multivariate Statistics.
Upper Saddle River, NJ: Prentice Hall.

6.8 BIBLIOGRAPHY
1. Bezdek, J. C., & Pal, S. K. (1992). Fuzzy models for pattern
recognition: Methods that search for structures in data. New York:
IEEE Press
2. Fayyad, U. M., Piatetsky-Shapiro, G., Smyth, P., & Uthurusamy, R.
(Eds.). (1996). Advances in knowledge discovery and data mining.
AAAI/MIT Press.

120
Data Mining and Preprocessing
3. Han, J., & Kamber, M. (2000). Data mining: Concepts and techniques:
Morgan Kaufmann.
4. Hastie, T., Tibshirani, R., & Friedman, J. H. (2001). The elements of
statistical learning: Data mining, inference, and prediction: New York:
Springer.
5. Jain, A. K., & Dubes, R. C. (1988). Algorithms for clustering data.
New Jersey: Prentice Hall.
6. Jensen, F. V. (1996). An introduction to bayesian networks. London:
University College London Press.
7. Kaufman, L., & Rousseeuw, P. J. (1990). Finding groups in data: An
introduction to cluster analysis. New York: John Wiley.
8. Michie, D., Spiegelhalter, D. J., & Taylor, C. C. (1994). Machine
learning, neural and statistical classification: Ellis Horwood.

6.9 UNIT END EXERCISES


1. Data quality can be assessed in terms of several issues, including
accuracy, completeness, and consistency. For each of the above three
issues, discuss how the assessment of data quality can depend on the
intended use of the data, giving examples. Propose two other
dimensions of data quality.
2. Describe why concept hierarchies are useful in data mining.
3. In real-world data, tuples with missing values for some attributes are
a common occurrence. Describe various methods for handling this
problem.
4. Given the following data (in increasing order) for the attribute age:
13, 15, 16, 16, 19, 20, 20, 21, 22, 22, 25, 25, 25, 25, 30, 33, 33, 35,
35, 35, 35, 36, 40, 45, 46, 52, 70.
(a) Use smoothing by bin means to smooth the above data, using a bin
depth of 3. Illustrate your steps. Comment on the effect of this
technique for the given data.
(b) How might you determine outliers in the data?
(c) What other methods are there for data smoothing?
5. Discuss issues to consider during data integration.




121
Module V
Associations and Correlation

7
ASSOCIATION RULE MINING
Unit Structure
7.0 Objectives
7.1 Introduction
7.2 Steps in Association Rule Mining
7.3 Market Basket Analysis
7.3.1 What is an Itemset?
7.3.2 What is a Frequent Itemset?
7.3.3 Association rules
7.3.4 Why Frequent Itemset mining?
7.4 Frequent Pattern Algorithms
7.4.1 Apriori algorithm
7.4.2 Steps in Apriori algorithm
7.4.3 Example of Apriori
7.4.4 Advantages and Disadvantages
7.4.5 Methods to improve Apriori Efficiency
7.4.6 Applications of Apriori algorithm.
7.5 Incremental Association Rule Mining
7.5.1 Classification rule mining
7.6 Let us sum up
7.7 List of References
7.8 Bibliography
7.9 Unit End Exercises

122
7.0 OBJECTIVES Association Rule Mining

After going through this unit, you will be able to:


• Explain what and why we need Data Mining.
• Understand Association Rule, its needs and Applications
• Differentiate between large and small data sets.
• Different techniques and algorithms used in association rules.
• Understand the difference between Data Mining and KDD in
Databases.

7.1 INTRODUCTION
The data mining technique of discovering the rules that regulate
relationships and causal objects between collections of items is known as
association rule mining. So, in a particular transaction involving many
items, it attempts to identify the principles that govern how or why such
items are frequently purchased together. Association rule mining is a well-
known and well-studied approach for uncovering interesting relationships
between variables in huge datasets. Its goal is to find strong rules in
databases using various metrics of interestingness.
The goal of ARM is to discover association rules, frequent patterns,
subsequences, or correlation links among a huge collection of data items
that meet the established minimal support and confidence from a given
database.
Association rule learning is a form of unsupervised learning approach that
detects the dependency of one data item on another and maps
appropriately to make it more lucrative. It attempts to discover some
interesting relationships or links among the variables in the dataset. It uses
several criteria to uncover interesting relationships between variables in a
database.
One of the most significant topics in machine learning is association rule
learning, which is used in Market Basket analysis, Web usage mining,
continuous manufacturing, and other applications. Market basket analysis
is a technique used by many large retailers to find the relationships
between commodities. We may explain it by using a supermarket as an
example because, at a supermarket, all things that are purchased together
are placed together. For example, if a consumer purchases bread, he is
likely to also purchase butter, eggs, or milk, therefore these items are
displayed on a shelf or in close proximity.

7.2 STEPS IN ASSOCIATION RULE MINING


Association rule mining can be described in a two-step process explained
below.
123
Data Mining and Business Step 1: Find all frequent Itemsets
Intelligence
● An itemset is a collection of items found in a shopping basket. An
Itemset is a collection of items in a shopping basket. It can include any
number of products. For example, [bread, butter, eggs] is a database
itemset.
● A frequently occurring itemset is one that appears frequently in a
database. This raises the issue of how frequency is defined. This is
when the number of supporters comes into play.
● The frequency of an item in the dataset is used to calculate its support
count.

Itemsets and their respective support counts.


● The support count can only speak for the frequency of an Itemset. It
does not take into account relative frequency i.e., the frequency with
respect to the number of transactions. This is called the support of an
Itemset.
● Support of an itemset is the frequency of the itemset with respect to
the number of transactions.

124
Itemsets and their Support values Association Rule Mining
Consider the [Bread] itemset, which has 80% support. This suggests that
bread appears 80 times out of every 100 transactions.
Defining support as a percentage allows us to establish a frequency
threshold called min support. If we set support to 50%, we define a
frequent itemset as one that appears at least 50 times in 100 transactions.
For example, in the preceding dataset, we set threshold support to 60%.
60% minimum support 60% of (total # of transactions) 0.6 *5 =3
For an Itemset to be frequent, it should occur at least 3 times in 5
transactions in the given dataset.

Step 2: Generate strong association rules from the frequent Itemsets.


● Association rules are generated by building associations from
frequent Itemsets generated in step 1.
● This uses a measure called confidence to find strong associations.

7.3 MARKET BASKET ANALYSIS

Fig: 1 Market basket analysis


Market basket analysis is a common example of frequent itemset mining.
This approach examines consumer purchasing behaviors by identifying
correlations between the various things that customers place in their
shopping carts (Figure 1). The finding of such relationships might assist
merchants in developing marketing strategies by providing information on
which things are commonly purchased.
Customers purchased items together. For example, if consumers buy milk,
how likely are they to also buy bread (and at what price)?
On the same trip to the grocery, would you buy (what sort of bread)? Such
data can lead to greater sales by assisting merchants. They promote
selectively and design their shelf space.
125
Data Mining and Business If we consider the universe to be the set of items offered at the shop, then
Intelligence each item has a Boolean variable that represents its presence or absence. A
Boolean vector of values set to these variables may subsequently be used
to represent each basket. The Boolean vectors may be used to identify
purchasing patterns that represent products that are commonly related or
purchased together. Association rules can be used to express these
patterns. For example, in Association Rule (5.1), the knowledge that
consumers who buy computers also prefer to buy antivirus software at the
same time is represented:
Computer =>antivirus software [support = 2%; confidence = 60%] (6.1)
Rule support and confidence are two measures of rule interest. They
indicate the utility and certainty of discovered rules, respectively. A
support rate of 2% for Association Rule (6.1) indicates that 2% of all
transactions and analysis reveal that computer and antivirus software are
purchased together. A confidence level of 60% indicates that 60% of
consumers who purchased a computer also purchased the Software.
Association rules are often regarded as intriguing if they meet both
minimal support and a minimum confidence level. Users or domain
experts can specify such thresholds.
Additional analysis can be carried out to discover intriguing statistical
relationships between related objects.

7.3.1 What is an Itemset


An itemset is a collection of items. A k-itemset is any Itemset that contains
k-items. An Itemset is made up of two or more items. A frequent itemset is
an itemset that happens frequently. As a result, frequent itemset mining is
a data mining approach for identifying items that frequently appear
together.
For instance, bread and butter, a laptop and antivirus software, Milk and
bread and so on.

7.3.2 What is a Frequent Itemset?


Frequent patterns are items that are discovered to be most prevalent in
data. There are several types of frequency that may be found in the
dataset. A combination of items is considered frequent if it meets a certain
threshold of support and trust. Support displays transactions in which
many items were purchased in a single transaction. Confidence
demonstrates transactions in which products are purchased one after the
other.
We evaluate just those transactions that match the minimal threshold
support and confidence conditions for the frequent itemset mining
approach. Insights from these mining algorithms provide several benefits,
including cost savings and an increased competitive edge.
For frequent mining, there is a tradeoff between the time it takes to mine
data and the volume of data. A frequent mining method is an effective
126
approach for mining latent patterns in Itemsets in a short amount of time Association Rule Mining
with little memory use.

7.3.3 Frequent pattern Mining and Association Rules.


One of the most significant data mining approaches for discovering links
between distinct elements in a collection is the frequent pattern mining
algorithm. Association rules are used to express these relationships. It aids
in the detection of data anomalies.
FPM has several applications in data analysis, software defects, cross-
marketing, sale campaign analysis, market basket analysis, and so on.
Apriori-discovered frequent Itemsets offer a wide range of applications in
data mining activities. The most significant of them are tasks like
detecting interesting patterns in a database, determining sequence, and
mining association rules.
Association rules are used to supermarket transaction data in order to
investigate consumer behavior in terms of purchased items. The
association rules specify how frequently the objects are used.

Association Rules
Table 5.1 Association Rule Notation

Term Description

D Database

ti Transaction in D

S Support

A Confidence

X, Y Itemsets

X =Y Association Rule

L Set of Large Itemsets

C Set of candidate Itemsets

P Number of partitions

When all large Itemsets are found, generating the association rules is
straightforward.
The term "Association Rule Mining" refers to the following:
"Assume I=... is a collection of 'n' binary attributes known as items. Let
D=.... be a set of transactions referred to as a database. Each transaction in
127
Data Mining and Business D has a distinct transaction ID and includes a subset of the items in I. A
Intelligence rule is defined as a logical implication of the type X->Y, where X, Y? I,
and X?Y=? The sets of objects X and Y are referred to as the rule's
antecedent and consequent."
Association rule learning is used to discover associations between
attributes in huge datasets. An A=>B association rule will be of the form
"given a set of transactions, some value of itemset A determines the values
of itemset B under the condition that minimal support and confidence are
present."

Support and Confidence can be represented by the following example:


Bread=> butter [support=2%, confidence-60%]
The above statement is an example of an association rule. This means that
there is a 2% transaction that bought bread and butter together and there
are 60% of customers who bought bread as well as butter.
Support and Confidence for Itemset A and B are represented by
formulas:

Support (A) =

Confidence (A B) =

7.3.4 Why frequent Itemset Mining?


The process of generating association rules from a Transactional Dataset is
known as frequent mining. If two things, X and Y, are regularly
purchased, it is a good idea to stock them together in stores or to provide a
discount on one item with the purchase of another. This has the potential
to significantly enhance sales. For example, if a consumer purchases milk
and bread, it is likely that he or she also purchases butter.
As a result, the association rule is ['milk']=>['bread']=>['butter']. As a
result, the merchant may advise the consumer to purchase butter if he or
she purchases milk and bread.

7.4 FREQUENT PATTERN ALGORITHMS


Frequent patterns are patterns (such as Itemsets, subsequences, or
substructures) that occur frequently in a data collection. A frequent itemset
is a group of items, such as milk and bread, that appear frequently together
in a transaction data set. A subsequence, such as purchasing a PC first,
then a digital camera, and then a memory card, is a (often) sequential
pattern if it happens frequently in a shopping history database. A
substructure can refer to a variety of structural forms such as subgraphs,
subtrees, or sublattices that can be coupled with Itemsets or subsequences.
There are two categories of frequent pattern mining the algorithm, namely
Apriori algorithm and Tree structure algorithm. The Apriori based
128
algorithm uses generate and test strategy approach to find frequent pattern Association Rule Mining
by constructing candidate items and checking their counts and frequency
from transactional databases.

7.4.1 APRIORI Algorithm:


R. Agrawal and R. Srikant presented the Apriori technique in 1994 for
identifying common Itemsets in a dataset for Boolean association rules.
The approach is called Apriori because it makes advantage of prior
knowledge about common itemset attributes. We employ an iterative
technique or level-wise search to identify k+1 Itemsets by using k-frequent
Itemsets. This algorithm uses two steps “join” and “prune” to reduce the
search space. It is an iterative approach to discover the most frequent
Itemsets.

Apriori says:
The probability that item I is not frequent is if:
• P(I) < minimum support threshold, then I is not frequent.
• P (I+A) < minimum support threshold, then I+A is not frequent, where A
also belongs to itemset.
• If an itemset set has value less than minimum support then all of its
supersets will also fall below min support, and thus can be ignored. This
property is called the Antimonotone property
Apriori Property:
1. It makes use of “Upward Closure property” (Any superset of
infrequent itemset is also an infrequent set). It follows Bottom-up
search, moving upward level-wise in the lattice.
2. It makes use of the “downward closure property” (any subset of a
frequent itemset is a frequent itemset).
3. If Support of an itemset exceeds the support of its subsets, then it is
known as the antimonotone property of support.
An essential attribute called Apriori is utilized to increase the efficiency of
level-wise creation of frequent Itemsets by minimizing the search area.

The Apriori Algorithm Pseudocode


C: Candidate Itemset of size k
L: Frequent Itemset of size k
● Join step: Ck is generated by joining Lk-1 with itself
● Prune Step: Any (k-1) – Itemset that is not frequent cannot be a
subset of a frequent k-Itemset
● Pseudo code: Ck: Candidate Itemset of size k
129
Data Mining and Business Lk: frequent Itemset of size k
Intelligence
L1 = { Frequent items};
For (k=1; Lk != 0; k++) do begin
Ck+1= Candidates generated from Lk
For each transaction t in database do
Increment the count of all Candidates in Ck+1
That are contained in t
Lk+1 = Candidates in Ck+1 with min_support

End
Return UkLi;

7.4.2 Steps in APRIORI:


Apriori algorithm is a collection of processes to determine the most
frequent itemset in a given database. This data mining approach iteratively
applies the join and prune processes until the most frequent itemset is
obtained. The problem specifies or the user assumes a minimum support
threshold.
● In the first iteration of the algorithm, each item is treated as a
candidate for a 1-itemset. Each item's occurrences will be counted by
the algorithm.
● Let there be some minimum support, min_sup ( eg 2). The set of 1 –
Itemsets whose occurrence meets the min sup condition is determined.
Only candidates with a score greater than or equal to min sup are
considered for the following iteration, while the rest are pruned.
● Next, 2-itemset frequent items with min_sup are discovered. For this
in the join step, the 2-itemset is generated by forming a group of 2 by
combining items with itself.
● The 2-itemset candidates are pruned using min-sup threshold value.
Now the table will have 2 –Itemsets with min-sup only.
● The next iteration will form 3 –Itemsets using join and prune step.
This iteration will follow antimonotone property where the subsets of
3-itemsets, that is the 2 –itemset subsets of each group fall in min_sup.
If all 2-itemset subsets are frequent then the superset will be frequent
otherwise it is pruned.
● Next step will follow making 4-itemset by joining 3-itemset with itself
and pruning if its subset does not meet the min_sup criteria. The
algorithm is stopped when the most frequent itemset is achieved.

130
Association Rule Mining

7.4.3 Example of APRIORI


Support threshold=50%, Confidence= 60%

Table – 1

Transaction List of Items


T1 I1,I2,I3
T2 I2,I3,I4
T3 I4,I5
T4 I1.I2,I4
T5 I1,I2,I3,I5
T6 I1,I2,I3,I4

Solution:
Support Threshold =50% => 0.5 *6=3 => min_sup =3

1. Count of each item


Table - 2

Item Count

I1 4

I2 5

I3 4

I4 4

I5 2

131
Data Mining and Business 2. Prune Step: Table -2 shows that I5 item does not meet min_sup
Intelligence =3, thus it is deleted, only I1,I2,I3,I4 meet min_sup count.
Table -3

Item Count

I1 4

I2 5

I3 4

I4 4

3. Join Step: Form 2- itemset. From Table -1 find out the


occurrences of 2- itemset
Table – 4

Item Count

I1, I2 4

I2, I3 3

I1,I4 2

I2,I3 4

I2,I4 3

I3,I4 2

4. Prune Step: TABLE -4 shows that item set {I1, I4} and {I3, I4}
does not meet min_sup, thus it is deleted.
Table – 5

Item Count

I1,I2 4

I1,I3 3

I2,I3 4

I2,I4 3

132
5. Join and Prune Step: Form 3-itemset. From the TABLE- 1 find out Association Rule Mining
occurrences of 3-itemset. From TABLE-5, find out the 2-itemset subsets
which support min_sup. We can see for itemset {I1, I2, I3} subsets, {I1,
I2}, {I1, I3}, {I2, I3} are occurring in TABLE-5 thus {I1, I2, I3} is
frequent.
We can see for itemset {I1, I2, I4} subsets, {I1, I2}, {I1, I4}, {I2, I4}, {I1,
I4} is not
frequent, as it is not occurring in TABLE-5 thus {I1, I2, I4} is not
frequent, hence it is deleted.

Table – 6

Item

I1,I2, I3

I1,I2,I4

I1,I3,I4

I2,I3,I4

Only {I1, I2, I3} is frequent.


6. Generate Association Rules: From the frequent itemset
discovered above the association could be:
• {I1, I2} => {I3}
• Confidence = support {I1, I2, I3} / support {I1, I2} = (3/ 4)* 100 = 75%
• {I1, I3} => {I2}
• Confidence = support {I1, I2, I3} / support {I1, I3} = (3/ 3)* 100 =
100%
• {I2, I3} => {I1}
• Confidence = support {I1, I2, I3} / support {I2, I3} = (3/ 4)* 100 = 75%
• {I1} => {I2, I3}
• Confidence = support {I1, I2, I3} / support {I1} = (3/ 4)* 100 = 75%
• {I2} => {I1, I3}
• Confidence = support {I1, I2, I3} / support {I2 = (3/ 5)* 100 = 60%
• {I3} => {I1, I2}
• Confidence = support {I1, I2, I3} / support {I3} = (3/ 4)* 100 = 75%

133
Data Mining and Business • This shows that all the above association rules are strong if
Intelligence minimum confidence threshold is 60%.
7.4.4 Advantages and Disadvantages of APRIORI:
● Easy to understand algorithm
● Join and Prune steps are easy to implement on large Itemsets in large
databases

Disadvantages of APRIORI:
● It requires high computation if the Itemsets are very large and the
minimum support is kept very low.
● The entire database needs to be scanned.

7.4.5 Methods to improve Apriori Efficiency


The following are some Apriori algorithm variants that have been
proposed in order to improve the efficiency of the original method:

The Hash- Based Technique


The hash-based approach (which involves shuffling Itemsets into
appropriate buckets) For k > 1, a hash-based method can be utilized to
reduce the size of the candidate k-Itemsets, Ck. For example, when
scanning each transaction in the database to generate the frequent 1-
itemsets, L1, from the candidate 1-itemsets in C1, it can generate some 2-
itemsets for each transaction, hash (i.e., map) them into the various
buckets of a hash table structure, and increase the equivalent bucket
counts.

Transaction Reduction
A transaction that does not include certain frequent k-Itemsets cannot also
include some frequent (k + 1)-Itemsets. As a result, such a transaction can
be flagged or removed from further consideration since further database
searches for j-Itemsets with j > k will not require it.

Partitioning
A partitioning strategy that required two database scans to mine the
frequent Itemsets can be utilized. It is divided into two stages. The method
separates D's transactions into n non-overlapping segments in Phase I. If
the minimum support threshold for transactions in D is min sup, then the
minimum support count for a partition is the number of transactions in that
partition multiplied by min sup.
All frequent Itemsets inside a partition are found for each partition. These
are known as local frequently occurring Itemsets. The procedure utilizes a
special data structure that stores the TIDs of the transactions that include
the items in the itemset for each itemset. This allows it to locate all of the
local frequent k-Itemsets for k = 1, 2... in a single database search.
134
A local frequent itemset can be often connected to the entire database or Association Rule Mining
not, D. Any potentially frequent connected D itemset must show as a
frequent itemset is partially one of the divisions. As a result, all local
frequent Itemsets are somewhat D candidate Itemsets. The worldwide
candidate Itemsets for D are formed by the set of frequent Itemsets from
all partitions. he second scan of D is arranged in Phase II, in which the real
support of each candidate is analyzed to determine the global frequent
Itemsets.

Sampling
The sampling approach's basic premise is to take a random sample S of the
provided data D and then look for frequent Itemsets in S rather than D. It
is possible to trade off some degree of accuracy for efficiency with this
strategy. Because the sample size of S is such that the search for frequent
Itemsets in S may be performed in main memory, just one scan of the
transactions in S is required overall.

7.4.6 Applications of the Apriori algorithm


Some fields where Apriori is used:
● In Education Field: Extracting association rules in data mining of
admitted students through characteristics and specialties.
● In the Medical field: For example, Analysis of the patient’s database.
● In Forestry: Analysis of probability and intensity of forest fire with
the forest fire data.
● Apriori is used by many companies like Amazon in the Recommender
System and by Google for the auto-complete feature.

7.5 INCREMENTAL ASSOCIATION RULE MINING


Incremental association rule mining using promising frequent itemset
algorithm. Abstract: Association rule discovery is an important area of
data mining. In dynamic databases, new transactions are appended as time
advances. ... Thus, the algorithm can reduce a number of times to scan the
original database.

7.5.1 Associative Classification: Classification by Rule Analysis


Frequent patterns and their associated association or correlation rules
indicate noteworthy correlations between attribute conditions and class
labels and have therefore lately been employed for successful
categorization. Association rules reveal significant relationships between
attribute-value pairs (or items) that appear often in a particular data
collection. The organization's regulations are In a store, this is widely used
to study the purchase habits of customers. This type of analysis is
beneficial in a variety of decision-making situations. Product placement,
catalogue design, and cross-marketing are examples of procedures.

135
Data Mining and Business The finding of association rules is based on the regular mining of Itemsets.
Intelligence Earlier in this chapter, many strategies for frequent itemset mining and the
development of association rules were given. In this section, we will look
into associative classification, which is the process of generating and
analyzing association rules for use in classification. The main approach is
to look for substantial correlations between common patterns
(conjunctions of attribute-value pairs) and class labels. Because
association rules investigate extremely confident links among several
characteristics, this technique may be able to circumvent some of the
limits imposed by decision - tree induction, which investigates only one
attribute at a time. In many studies, associative classification has been
found to be more accurate than some traditional classification methods,
such as C4.5. In particular, we study three main methods: CBA, CMAR,
and CPAR.

CBA: Classification- based Association


CBA algorithm, combined association rule mining and classification. The
CBA algorithm's procedure is separated into two phases. First, (Class
Association Rules) CARs are produced using the well-known Apriori
technique. Second, CARs are sorted and pruned in order to identify
efficient CARs for use in a classifier.
The CBA algorithm has been shown to create less errors than C4.5.
Unfortunately, owing to Apriori inheritance, which identifies all potential
frequent rules at each level, the CBA algorithm confronts a huge number
of candidate generation difficulties. When the samples become more and
more large and characteristic attributes become more and more numerous,
CBA algorithm becomes much lower.

CMAR: Classification based on Multiple Association Rules


CMAR is an acronym that stands for classification based on multiple
association rules. CMAR uses numerous strong association rules to do
weight analysis. The weight analysis is carried out using the Chi-Square
(2 ) approach. This defines the association rule's strength in terms of both
support and class distribution28. CMAR similarly operates in two parts,
with the first being rule creation and the second being class distribution.
The Classification Rule Tree (CR-Tree) data structure is used to increase
efficiency and accuracy. The new data structure is an extension of
Frequent Pattern Growth (FP-tree), and it stores and retrieves a large
number of categorization rules in a compact manner. CMAR employs a
form of the FP-growth algorithm, which is quicker than apriori-like
approaches, to speed up the classification process.
CMAR, as opposed to CBA, uses a Frequent pattern tree (FP-tree) and a
Cosine R-tree (CR-tree) for the rule generation and classification phases.
It divides the subset in the FP-tree to search for frequent rule items, and
then adds the frequent rule items to the CR-tree based on their frequencies.
As a result, CMAR only has to search the database once. Based on the chi-

136
square approach, the CMAR algorithm employs numerous criteria to Association Rule Mining
forecast previously undiscovered cases.

CPAR: Classification-based on Predicted Association Rules


CPAR is an acronym that stands for classification based on predicted
association rules. To avoid overfitting, CPAR evaluates each rule using
the anticipated accuracy metric. In rule generation, CPAR employs the
core concept of the First Order Inductive Learner (FOIL) algorithm29. It
finds the ideal rule condition that will result in the greatest benefit from
the supplied dataset. Once the condition has been discovered, the weights
of positive instances linked with it will be lowered by a factor.
Misclassification penalty is a typical mistake that arises throughout the
classifier construction process. This mistake can be reduced by employing
modified CPAR (M-CPAR)26. CPAR combines the benefits of
associative and classic rule-based categorization. CPAR uses a greedy
algorithm to derive rules directly from training data.

Comparison of Associative Classification methods

Method Advantage Disadvantage

CBA ● Simple algorithm, finds ● Training the dataset often


valuable rules. generates huge set of rules leading
to redundancy.
● Capable of handling
data in table form as well as ● Classification is based on
in transaction form parameter called confidence,
which can be biased sometimes.
● It does not require the
whole dataset to be fetched
into main memory

CMAR ● It finds frequent ● Since it searches for only


patterns and generates high-quality rules, it is slower.
association rules in one step.
● Performing weighted
● Since CR-tree data analysis adds substantial
structure is used, both computational load to the
accuracy and efficiency is algorithm.
improved.
● By pruning process,
CMAR selects only high-
quality rules for
classification.
● CMAR is superior to
C4.5 and CBA in terms of
accuracy and it is also
scalable

137
Data Mining and Business CPAR ● Since it searches for ● CPAR is more complex to
Intelligence only high-quality rules, it is understand as well as to
slower. implement.
● Performing weighted ● Usage of greedy algorithm
analysis adds substantial to train the dataset adds additional
computational load to the computational overhead to the
algorithm. algorithm.

Let us sum up
● Association Rules are used to show the relationship between data
items and are used frequently by retail stores to assist in marketing,
advertisement, inventory control etc.
● The selection of Association Rule depends on support and
confidence. Apriori algorithm, sampling algorithm are some of the
basic algorithms used in Association Rules.
● Apriori algorithm is an efficient algorithm that scans the database
only once.
● It reduces the size of the Itemsets in the database considerably
providing a good performance. Thus, data mining helps consumers
and industries better in the decision-making process.
● Frequent Itemsets discovered through Apriori have many
applications in data mining tasks. Tasks such as finding interesting
patterns in the database, finding out sequence and Mining of
association rules is the most important of them.

7.6 REFERENCES
1. Su Z, Song W, Cao D, Li J. Discovering informative association rules
for associative classification. IEEE International Symposium on
Knowledge Acquisition and Modeling Workshop; Wuhan. 2008. p.
1060–3.
2. Ye Y, Li T, Jiang Q, Wang Y. CIMDS: Adapting postprocessing
techniques of associative classification for malware detection. IEEE
Transactions on Systems, Man, and Cybernetics, Part C: Applications
and Reviews. 2010; 40(3):298–307.
3. M. Ankerst, M. Breunig, H.-P. Kriegel, and J. Sander. OPTICS:
Ordering points to identify the clustering structure. In Proc. 1999
ACM-SIGMOD Int. Conf. Management of Data (SIGMOD’99), pages
49–60, Philadelphia, PA, June 1999.
4. R. Agrawal, J. Gehrke, D. Gunopulos, and P. Raghavan. Automatic
subspace clustering of high dimensional data for data mining

138
applications. In Proc. 1998 ACM-SIGMOD Int. Conf.Management of Association Rule Mining
Data (SIGMOD’98), pages 94–105, Seattle, WA, June 1998.
5. C. C. Aggarwal and P. S. Yu. Privacy-Preserving Data Mining:
Models and Algorithms. Springer, 2008.

7.7 BIBLIOGRAPHY
1. Bezdek, J. C., & Pal, S. K. (1992). Fuzzy models for pattern
recognition: Methods that search for structures in data. New York:
IEEE Press
2. Fayyad, U. M., Piatetsky-Shapiro, G., Smyth, P., & Uthurusamy, R.
(Eds.). (1996). Advances in knowledge discovery and data mining.
AAAI/MIT Press.
3. Han, J., & Kamber, M. (2000). Data mining: Concepts and techniques:
Morgan Kaufmann.
4. Hastie, T., Tibshirani, R., & Friedman, J. H. (2001). The elements of
statistical learning: Data mining, inference, and prediction: New York:
Springer.
5. Jain, A. K., & Dubes, R. C. (1988). Algorithms for clustering data.
New Jersey: Prentice Hall.
6. Jensen, F. V. (1996). An introduction to bayesian networks. London:
University College London Press.
7. Kaufman, L., & Rousseeuw, P. J. (1990). Finding groups in data: An
introduction to cluster analysis. New York: John Wiley.
8. Michie, D., Spiegelhalter, D. J., & Taylor, C. C. (1994). Machine
learning, neural and statistical classification: Ellis Horwood.
7.8 UNIT END EXERCISES
1. Explain Associations Rule Mining
2. Explain how to generate strong association rule mining
3. Explain the mining methods for association rule generation
4. What are the methods to improve Accuracy of Apriori?
5. Write short notes on Prediction.
6. Elaborate associative classification methods.
7. Explain support and confidence rule with examples.
8. Write short notes on frequent pattern mining.
9. Explain Market basket analysis with example.
10. Define Incremental rule mining.


139
Module VI

8
CLASSIFICATION AND PREDICTION

Unit Structure
8.0 Introduction
8.1 Decision Tree
8.2 CART
8.3 Bayesian classification
8.4 Linear and nonlinear regression.
8.5 References
8.6 MOOCs
8.7 Video Lectures
8.8 Quiz

8.0 INTRODUCTION
There are two forms of data analysis that can be used for extracting
models describing important classes or to predict future data trends.
❖ Classification
❖ Prediction
Classification models predict categorical class labels; and prediction
models predict continuous valued functions.
Examples:
❖ A bank loan officer wants to analyze the data in order to know which
customer (loan applicant) are risky or which are safe.
❖ A marketing manager at a company needs to analyze a customer with a
given profile, who will buy a new computer.
The Data Classification process includes two steps −
❖ Building the Classifier or Model
❖ Using Classifier for Classification

140
Classification and Prediction

Building the Classifier or Model


❖ Learning phase.
❖ Classification algorithms build the classifier.
❖ The classifier is built from the training set made up of database
tuples and their associated class labels.
❖ Each tuple that constitutes the training set is referred to as a category
or class. These tuples can also be referred to as sample, object or
data points.

Fig 1: Building the Classifier


Using Classifier for Classification
In this step, the classifier is used for classification. Here the test data is
used to estimate the accuracy of classification rules. The classification
rules can be applied to the new data tuples if the accuracy is considered
acceptable.

Fig 2: Classifier for Classification

141
Data Mining and Business Prediction examples:
Intelligence
Suppose the marketing manager needs to predict how much a given
customer will spend during a sale at his company. Data analysis task is an
example of numeric prediction. In this case, a model or a predictor will be
constructed that predicts a continuous-valued-function or ordered value.
Classification and Prediction Issues
❖ Data Cleaning
❖ Relevance Analysis
❖ Data Transformation and reduction − The data can be transformed
by any of the following methods.
❖ Normalization
❖ Generalization
Comparison of Classification and Prediction Methods
❖ Accuracy
❖ Speed
❖ Robustness
❖ Scalability
❖ Interpretability
8.1 DECISION TREE
A decision tree is a structure that includes a root node, branches, and leaf
nodes. Each internal node denotes a test on an attribute, each branch
denotes the outcome of a test, and each leaf node holds a class label. The
top most node in the tree is the root node.
The following decision tree is for the concept buy_computer that indicates
whether a customer at a company is likely to buy a computer or not. Each
internal node represents a test on an attribute. Each leaf node represents a
class.

Decision Tree

Fig 3: Decision Tree


142
The benefits of having a decision tree are as follows − Classification and Prediction
❖ It does not require any domain knowledge.
❖ It is easy to comprehend.
❖ The learning and classification steps of a decision tree are simple
and fast.

Decision Tree Induction Algorithm


Researcher named J. Ross Quinlan in 1980 developed a decision tree
algorithm known as ID3 (Iterative Dichotomiser). He presented C4.5,
which was the successor of ID3. ID3 and C4.5 adopt a greedy approach.
There is no backtracking; the trees are constructed in a top-down recursive
divide-and-conquer manner.

Tree Pruning
Is performed in order to remove anomalies in the training data due to noise
or outliers. The pruned trees are smaller and less complex.
143
Data Mining and Business Tree Pruning Approaches
Intelligence
❖ Pre-pruning
❖ Post-pruning

Cost Complexity
❖ Number of leaves in the tree, and
❖ Error rate of the tree.

Decision Tree
❖ Classifies data using the attributes
❖ Tree consists of decision nodes and decision leafs.
❖ Nodes can have two or more branches which represents the value for
the attribute tested.
❖ Leafs nodes produces a homogeneous result.

The algorithm
❖ The ID3 follows the Occam’s razor principle.
❖ Attempts to create the smallest possible decision tree.

The Process
❖ Take all unused attributes and calculates their entropies.
❖ Chooses attribute that has the lowest entropy is minimum or when
information gain is maximum
❖ Makes a node containing that attribute
The Algorithm
❖ Create a root node for the tree
❖ If all examples are positive, Return the single-node tree Root, with
label = +.
❖ If all examples are negative, Return the single-node tree Root, with
label = -.
❖ If number of predicting attributes is empty, then Return the single
node tree Root, with label = most common value of the target
attribute in the examples.
❖ Else
– A = The Attribute that best classifies examples.
– Decision Tree attribute for Root = A.
– For each possible value, vi, of A,
144
❖ Add a new tree branch below Root, corresponding to the test A = vi. Classification and Prediction
❖ Let Examples(vi), be the subset of examples that have the value vi
for A
❖ If Examples(vi) is empty
– Then below this new branch add a leaf node with label = most common
target value in the examples
❖ Else below this new branch add the subtree ID3 (Examples(vi),
Target_Attribute, Attributes – {A})
❖ End
❖ Return Root
Entropy
❖ Formula to calculate
❖ A complete homogeneous sample has an entropy of 0
❖ An equally divided sample as an entropy of 1
❖ Entropy = - p+log2 (p+) -p-log2 (p-) for a sample of negative and
positive elements.

Exercise
Calculate the entropy
Given:
❖ Set S contains14 examples
❖ 9 Positive values
❖ 5 Negative values
Entropy(S) = - (9/14) Log2 (9/14) - (5/14) Log2 (5/14)
= 0.940

Information Gain
❖ Information gain is based on the decrease in entropy after a dataset
is split on an attribute.
❖ Looking for which attribute creates the most homogeneous branches

145
Data Mining and Business Information Gain Example
Intelligence
❖ 14 examples, 9 positive 5 negative
❖ The attribute is Wind.
❖ Values of wind are Weak and Strong
❖ 8 occurrences of weak winds
❖ 6 occurrences of strong winds
❖ For the weak winds, 6 are positive and 2 are negative
❖ For the strong winds, 3 are positive and 3 are negative
Gain(S,Wind) = • Entropy(S) - (8/14)*Entropy (Weak) -(6/14)*Entropy
(Strong)
❖ Entropy (Weak) = - (6/8)*log2(6/8) - (2/8)*log2(2/8) = 0.811
❖ Entropy (Strong) = - (3/6)*log2(3/6) - (3/6)*log2(3/6) = 1.00
0.940 - (8/14)*0.811 - (6/14)*1.00
= 0.048

Advantage of ID3
❖ Understandable prediction rules are created from the training data.
❖ Builds the fastest tree.
❖ Builds a short tree.
❖ Only need to test enough attributes until all data is classified.
❖ Finding leaf nodes enables test data to be pruned, reducing number
of tests.

Disadvantage of ID3
❖ Data may be over-fitted or overclassified, if a small sample is tested.
❖ Only one attribute at a time is tested for making a decision.
❖ Classifying continuous data may be computationally expensive, as
many trees must be generated to see where to break the continuum.

8.2 CLASSIFICATION AND REGRESSION TREES


(CART) ALGORITHM
Classification and Regression Trees (CART) is only a modern term for
what are otherwise known as Decision Trees. Decision Trees have been
around for a very long time and are important for predictive modelling in
Machine Learning.
146
They have withstood the test of time because of the following reasons: Classification and Prediction
❖ Very competitive with other methods
❖ High efficiency
Classification
Here our focus is to learn a target function that can be used to predict the
values of a discrete class attribute. Naturally, this falls into Supervised
Learning.
This includes applications such as approving loans (high-risk or low-risk),
predicting the weather (sunny, rainy or cloudy).
Example Problem
Generally, a classification problem can be described as follows:
Data: A set of records (instances) that are described by:
* k attributes: A1, A2,...Ak
* A class: Discrete set of labels
Goal: To learn a classification model from the data that can be used to
predict the classes of new (future, or test) instances.
Note- When our problem has 2 possible labels, this is called a Binary
Classification Problem.
Eg - Predicting if someone's tumour is benign or malignant.
Our example problem is Loan Approval Prediction, and this is also a
binary classification problem - either 'Yes' or 'No'. Each record is for one
loan applicant at a famous bank. The attributes being considered are - age,
job status, do they own a house or not, and their credit rating. In the real
world, banks would look into many more attributes. They may even
classify individuals on the basis of risk - high, medium and low.
Sample data:

147
Data Mining and Business
Intelligence

However, we must note that there can be many other possible decision
trees for a given problem - we want the shortest one. We also want it to be
better in terms of accuracy (prediction error measured in terms of
misclassification cost).
An alternative, shorter decision tree for the same –

CART Algorithm for Classification


Step 1: Start at the root node with all training instances
Step 2: Select an attribute on the basis of splitting criteria (Gain Ratio or
other impurity metrics, discussed below)

148
Step 3: Partition instances according to selected attribute recursively Classification and Prediction
Partitioning stops when:
❖ There are no examples left
❖ All examples for a given node belong to the same class
❖ There are no remaining attributes for further partitioning – majority
class is the leaf
What is Impurity?
The key to building a decision tree is in Step 2 above - selecting which
attribute to branch off on. We want to choose the attribute that gives us the
most information. This subject is called information theory.
In our dataset we can see that a loan is always approved when the
applicant owns their own house. This is very informative (and certain) and
is hence set as the root node of the alternative decision tree shown
previously. Classifying a lot of future applicants will be easy.
Selecting the age attribute is not as informative - there is a degree of
uncertainity (or impurity). The person's age does not seem to affect the
final class as much.
Based on the above discussion:
A subset of data is pure if all instances belong to the same class.
Our objective is to reduce impurity or uncertainty in data as much as
possible.
The metric (or heuristic) used in CART to measure impurity is the Gini
Index and we select the attributes with lower Gini Indices first. Here is the
algorithm:

We need to first define the Gini Index, which is used to find the
information gained by selecting an attribute. The Gini Index favours larger
partitions and is calculated for i = 1 to the n, number of attributes:

149
Data Mining and Business
Intelligence

150
Prediction using CARTs Classification and Prediction

Conclusion
The CART algorithm is organized as a series of questions, the responses
to which decide the following question if any, outcome of these questions
is a tree-like structure with terminal nodes when there are no more
questions.

8.3 BAYESIAN CLASSIFICATION


Bayesian classification is based on Bayes' Theorem. Bayesian classifiers
are the statistical classifiers which can predict class membership
probabilities such as the probability that a given tuple belongs to a
particular class.

Baye's Theorem
Bayes' Theorem is named after Thomas Bayes. There are two types of
probabilities −
❖ Posterior Probability [P(H/X)]
❖ Prior Probability [P(H)]
where X is data tuple and H is some hypothesis.
According to Bayes' Theorem,
P(H/X)= P(X/H)P(H) / P(X)

Bayesian Belief Network


Bayesian Belief Networks specify joint conditional probability
distributions allowing class conditional independencies to be defined
151
Data Mining and Business between subsets of variables. It provides a graphical model of causal
Intelligence relationship on which learning can be performed.
Components that define a Bayesian Belief Network −
Directed acyclic graph
A set of conditional probability tables
Directed Acyclic Graph
❖ Each node in a directed acyclic graph represents a random variable.
❖ These variable may be discrete or continuous valued.
❖ These variables may correspond to the actual attribute given in the
data.
Directed Acyclic Graph Representation
The following diagram shows a directed acyclic graph for six Boolean
variables.

Fig 4: Directed Acyclic Graph


The arc in the diagram allows representation of causal knowledge. For
example, lung cancer is influenced by a person's family history of lung
cancer, as well as whether or not the person is a smoker. It is worth noting
that the variable PositiveXray is independent of whether the patient has a
family history of lung cancer or that the patient is a smoker, given that we
know the patient has lung cancer.

Conditional Probability Table


The conditional probability table for the values of the variable
LungCancer (LC) showing each possible combination of the values of its
parent nodes, FamilyHistory (FH), and Smoker (S) is as follows −

Fig 5: Probability Table

152
Applications of Bayes’ Theorem Classification and Prediction
❖ It can also be used as a building block and starting point for more
complex methodologies.
❖ Used in classification problems and other probability-related
questions.
❖ Sstatistical inference.
❖ Can be used to calculate the probability of an individual having a
specific genotype.

Example

8.4 LINEAR REGRESSION


A linear regression model follows a very particular form when all terms in
the model are one of the following:
❖ The constant
❖ A parameter multiplied by an independent variable (IV)
Then, you build the equation by only adding the terms together. These
rules limit the form to just one type:
Dependent variable = constant + parameter * IV + … + parameter * IV

While the independent variable is squared, the model is still linear in the
parameters. Linear models can also contain log terms and inverse terms to

153
Data Mining and Business follow different kinds of curves and yet continue to be linear in the
Intelligence parameters.
The regression example below models the relationship between body mass
index (BMI) and body fat percent

Fig 6: Linear Regression


Nonlinear Regression
If a regression equation doesn’t follow the rules for a linear model is
literally not linear and can fit an enormous variety of curves. Example
below models the relationship between density and electron mobility.

Fig 7: NonLinear Regression

8.5 REFERENCES
1. Introduction to Data Warehouse System.
https://www.javatpoint.com/. [Last Accessed on 10.03.2022]
2. Introduction to Data Warehouse System. https://www.guru99.com/.
[Last Accessed on 10.03.2022]
3. Introduction to Data Warehouse System. https://www.
http://www.nitjsr.ac.in/. [Last Accessed on 10.03.2022]
4. Introduction to Data Warehouse System. https://www.
oms.bdu.ac.in/. [Last Accessed on 10.03.2022]

154
5. Data Warehouse. https://www.softwaretestinghelp.com/. [Last Classification and Prediction
Accessed on 10.03.2022]
6. Introduction to Data Warehouse System.
https://www.tutorialspoint.com/ebook/data_warehouse_tutorial/inde
x.asp. [Last Accessed on 10.03.2022].
7. Using ID3 Algorithm to build a Decision Tree to predict the
weather. https://iq.opengenus.org/id3-algorithm/. [Last Accessed on
10.03.2022].
8. Data Warehouse Architecture. https://binaryterms.com/data-
warehouse-architecture.html. [Last Accessed on 10.03.2022].
9. Compare-and-Contrast-Database-with-Data-Warehousing-and-Data-
Visualization .
https://www.coursehero.com/file/28202760/Compare-and-Contrast-
Database-with-Data-Warehousing-and-Data-Visualization-
Databases-Assignmentdocx/. [Last Accessed on 10.03.2022].
10. Data Warehousing and data mining.
https://lastmomenttuitions.com/course/data-warehousing-and-
mining/. [Last Accessed on 10.03.2022].
11. Data Warehouse. https://one.sjsu.edu/task/all/finance-data-
warehouse. [Last Accessed on 10.03.2022].
12. DWDM Notes. https://dwdmnotes.blogspot.com. [Last Accessed on
10.03.2022].
13. Data Warehouse and Data Mart.
https://www.geeksforgeeks.org/difference-between-data-warehouse-
and-data-mart/?ref=gcse. [Last Accessed on 10.03.2022].
14. Data Warehouse System. https://www.analyticssteps.com/. [Last
Accessed on 10.03.2022].
15. Han J, Pei J, Kamber M. Data mining: concepts and techniques.
Elsevier; 2011 Jun 9.
16. CART. https://iq.opengenus.org/cart-algorithm. [Last Accessed on
10.03.2022].
17. Bhatia P. Data mining and data warehousing: principles and
practical techniques. Cambridge University Press; 2019 Jun 27.
18. Han J, Pei J, Kamber M. Data mining: concepts and techniques.
Elsevier; 2011 Jun 9.
19. Berzal F, Matín N. Data mining: concepts and techniques by Jiawei
Han and Micheline Kamber. ACM Sigmod Record. 2002 Jun 1;
31(2):66-8.

155
Data Mining and Business 20. Gupta GK. Introduction to data mining with case studies. PHI
Intelligence Learning Pvt. Ltd.; 2014 Jun 28.
21. Zhou, Zhi-Hua. "Three perspectives of data mining." (2003): 139-
146.
22. Wang J, editor. Encyclopedia of data warehousing and mining. iGi
Global; 2005 Jun 30.
23. Pujari AK. Data mining techniques. Universities press; 2001.

8.6 MOOCS
1. Data Warehousing for Business Intelligence Specialization.
https://www.coursera.org/specializations/data-warehousing.
2. Data Mining.
https://onlinecourses.swayam2.ac.in/cec20_cs12/preview.
3. Data Warehouse Concepts, Design, and Data Integration.
https://www.coursera.org/learn/dwdesign.
4. Data Warehouse Courses. https://www.edx.org/learn/data-
warehouse.
5. BI Foundations with SQL, ETL and Data Warehousing
Specialization. https://www.coursera.org/specializations/bi-
foundations-sql-etl-data-warehouse.
6. Fundamentals of Data Warehousing. https://www.mooc-
list.com/initiative/coursera.
7. Foundations for Big Data Analysis with SQL.
https://www.coursera.org/learn/foundations-big-data-analysis-sql.

8.7 VIDEO LECTURES


1. Data Warehouse Concepts.
https://www.youtube.com/watch?v=CHYPF7jxlik.
2. Data warehouse schema design.
https://www.youtube.com/watch?v=fpquGrdgbLg.
3. Data Modeling. https://www.youtube.com/watch?v=7ciFtfi-kQs.
4. Star and SnowFlake Schema in Data Warehouse.
https://www.youtube.com/watch?v=VOJ54hu2e2Q.
5. Dimensional Modeling – Declaring Dimensions.
https://www.youtube.com/watch?v=ajVfBJrTOxw.
6. What is ETL. https://www.youtube.com/watch?v=oF_2uDb7DvQ.

156
7. Star Schema & Snow Flake Design. Classification and Prediction
https://www.youtube.com/watch?v=KUwOcip7Zzc.
8. OLTP vs OLAP.
https://www.youtube.com/watch?v=aRT8E0nD_LE.
9. OLAP and Data Modeling Concepts.
https://www.youtube.com/watch?v=rnQDuz1ZkIo.
10. Understand OLAP.
https://www.youtube.com/watch?v=yoE6bgJv08E.
11. OLAP Cubes. https://www.youtube.com/watch?v=UKCQQwx-Fy4.
12. OLAP vs OLTP. https://www.youtube.com/watch?v=TCrCo2-w-vM.
13. OLAP. https://www.youtube.com/watch?v=AC1cLmbXcqA.
14. OLAP Vs OLTP. https://www.youtube.com/watch?v=kFQRrgHeiOo.

8.8 QUIZ
1. OLAP stands for
a) Online analytical processing
b) Online analysis processing
c) Online transaction processing
d) Online aggregate processing
Answer: a

2. Data that can be modeled as dimension attributes and measure attributes


are called _______ data.
a) Multidimensional
b) Singledimensional
c) Measured
d) Dimensional
Answer: a

3. The generalization of cross-tab which is represented visually is


____________ which is also called as data cube.
a) Two dimensional cube
b) Multidimensional cube
c) N-dimensional cube
d) Cuboid
Answer: a
157
Data Mining and Business 4. The process of viewing the cross-tab (Single dimensional) with a fixed
Intelligence value of one attribute is
a) Slicing
b) Dicing
c) Pivoting
d) Both Slicing and Dicing
Answer: a

5. The operation of moving from finer-granularity data to a coarser


granularity (by means of aggregation) is called a ________
a) Rollup
b) Drill down
c) Dicing
d) Pivoting
Answer: a

6. In SQL the cross-tabs are created using


a) Slice
b) Dice
c) Pivot
d) All of the mentioned
Answer: a

{ (item name, color, clothes size), (item name, color), (item name, clothes
size), (color, clothes size), (item name), (color), (clothes size), () }
7. This can be achieved by using which of the following ?
a) group by rollup
b) group by cubic
c) group by
d) none of the mentioned
Answer: d

158
8. What do data warehouses support? Classification and Prediction
a) OLAP
b) OLTP
c) OLAP and OLTP
d) Operational databases
Answer: a

9. SELECT item name, color, clothes SIZE, SUM(quantity)


FROM sales
GROUP BY rollup (item name, color, clothes SIZE);
How many grouping is possible in this rollup?
a) 8
b) 4
c) 2
d) 1
Answer: b

10. Which one of the following is the right syntax for DECODE?
a) DECODE (search, expression, result [, search, result]… [, default])
b) DECODE (expression, result [, search, result]… [, default], search)
c) DECODE (search, result [, search, result]… [, default], expression)
d) DECODE (expression, search, result [, search, result]… [, default])
Answer: d

11. What is the full form of OLAP?


a) Online Application Programming
b) Online Application Processing
c) Online Analytical programming
d) Online Analytical Processing
Answer: d

159
Data Mining and Business 12. Data that can be modelled as dimension attributes and measure
Intelligence attributes are called ___________
a) Mono-dimensional data
b) Multi-dimensional data
c) Measurable data
d) Efficient data
Answer: b

13. The operation of changing a dimensions used in a cross-tab is called as


________
a) Alteration
b) Pivoting
c) Piloting
d) Renewing
Answer: b

14. The operation of moving from finer granular data to coarser granular
data is called _______
a) Reduction
b) Increment
c) Roll up
d) Drill down
Answer: c

15. How many dimensions of multi-dimensional data do cross tabs enable


analysts to view?
a) 1
b) 2
c) 3
d) None of the mentioned
Answer: b

16. The _______ function allows substitution of values in an attribute of a


tuple
a) Cube
b) Unknown
c) Decode
d) Substitute
Answer: c
160
Classification and Prediction
17. Which of the following OLAP systems do not exist?
a) HOLAP
b) MOLAP
c) ROLAP
d) None of the mentioned
Answer: d

18. State true or false: OLAP systems can be implemented as client-server


systems
a) True
b) False
Answer: a

19. The operation of moving from coarser granular data to finer granular
data is called _______
a) Reduction
b) Increment
c) Roll back
d) Drill down
Answer: d

20. State true or false: In OLAP, analysts cannot view a dimension in


different levels of detail.
a) True
b) False
Answer: b

21. What is a Star Schema?


a) a star schema consists of a fact table with a single table for each
dimension
b) a star schema is a type of database system
c) a star schema is used when exporting data from the database
d) none of these
Answer: A

161
Data Mining and Business 22. What is the type of relationship in star schema?
Intelligence
a) many-to-many.
b) one-to-one
c) many-to-one
d) one-to-many
Answer: D

23. Fact tables are _______.


a) completely demoralized.
b) partially demoralized.
c) completely normalized.
d) partially normalized.
Answer: C

24. Data warehouse is volatile, because obsolete data are discarded


a) TRUE
b) FALSE
Answer: B

25. Which is NOT a basic conceptual schema in Data Modeling of Data


Warehouses?
a) Star schema
b) Tree schema
c) Snowflake schema
d) Fact constellations
Answer: B

26. Which is NOT a valid OLAP Rule by E.F.Codd?


a) Accessibility
b) Transparency
c) Flexible reporting
d) Reliability
Answer: D

162
27. Which is NOT a valid layer in Three-layer Data Warehouse Classification and Prediction
Architecture in Conceptual View?
a) Processed data layer
b) Real-time data layer
c) Derived data layer
d) Reconciled data layer
Answer: A

28. Among the types of fact tables which is not a correct type ?
a) Fact-less fact table
b) Transaction fact tables
c) Integration fact tables
d) Aggregate fact tables
Answer: C

29. Among the followings which is not a characteristic of Data


Warehouse?
a) Integrated
b) Volatile
c) Time-variant
d) Subject oriented
Answer: B

30. What is not considered as isssues in data warehousing?


a) optimization
b) data transformation
c) extraction
d) inter mediation
Answer: D

31. Which one is NOT considering as a standard query technique?


a) Drill-up
b) Drill-across
c) DSS
d) Pivoting
Answer: C
163
Data Mining and Business 32. Among the following which is not a type of business data ?
Intelligence
a) Real time data
b) Application data
c) Reconciled data
d) Derived data
Answer : B

33. A data warehouse is which of the following?


a) Can be updated by end users.
b) Contains numerous naming conventions and formats.
c) Organized around important subject areas.
d) Contains only current data.
Answer: C

34. A snowflake schema is which of the following types of tables?


a) Fact
b) Dimension
c) Helper
d) All of the above
Answer: D

35. The extract process is which of the following?


a) Capturing all of the data contained in various operational systems
b) Capturing a subset of the data contained in various operational systems
c) Capturing all of the data contained in various decision support systems
d) Capturing a subset of the data contained in various decision support
systems
Answer: B

36. The generic two-level data warehouse architecture includes which of


the following?
a) At least one data mart
b) Data that can extracted from numerous internal and external sources
c) Near real-time updates
d) All of the above.
Answer: B
164
37. Which one is correct regarding MOLAP ? Classification and Prediction
a) Data is stored and fetched from the main data warehouse.
b) Use complex SQL queries to fetch data from the main warehouse
c) Large volume of data is used.
d) All are incorrect
Answer: A

38. In terms of data warehouse, metadata can be define as,


a) Metadata is a road-map to data warehouse
b) Metadata in data warehouse defines the warehouse objects.
c) Metadata acts as a directory.
d) All are incorrect
Answer: D

39. In terms of RLOP model, choose the most suitable answer


a) The warehouse stores atomic data.
b) The application layer generates SQL for the two dimensional view
c) The presentation layer provides the multidimensional view.
d) All are incorrect
Answer: D

40. In the OLAP model, the _ provides the multidimensional view.


a) Data layer
b) Data link layer
c) Presentation layer
d) Application layer
Answer: A

41. Which of the following is not true regarding characteristics of


warehoused data?
a) Changed data will be added as new data
b) Data warehouse can contains historical data
c) Obsolete data are discarded
d) Users can change data once entered into the data warehouse
Answer: D

165
Data Mining and Business 42. ETL is an abbreviation for Elevation, Transformation and Loading
Intelligence
a) TRUE
b) FALSE
Answer: B

43. Which is the core of the multidimensional model that consists of a


large set of facts and a number of dimensions?
a) Multidimensional cube
b) Data model
c) Data cube
d) None of the above
Answer: C

44. Which of the following statements is incorrect


a) ROLAPs have large data volumes
b) Data form of ROLAP is large multidimentional array made of cubes
c) MOLAP uses sparse matrix technology to manage data sparcity
d) Access for MOLAP is faster than ROLAP
Answer: B

45. Which of the following standard query techniques increase the


granularity
a) roll-up
b) dril-down
c) slicing
d) dicing
Answer: B

46. The full form of OLAP is


a) Online Analytical Processing
b) Online Advanced Processing
c) Online Analytical Performance
d) Online Advanced Preparation
Answer: A

166
47. Which of the following statements is/are incorrect about ROLAP Classification and Prediction
a) ROLAP fetched data from data warehouse.
b) ROLAP data store as data cubes.
c) ROLAP use sparse matrix to manage data sparsity.
Answer: B and C

48. __ is a standard query technique that can be used within OLAP to


zoom in to more detailed data by changing dimensions.
a) Drill-up
b) Drill-down
c) Pivoting
d) Drill-across
Answer: B

49. Which of the following statements is/are correct about Fact


constellation
a) Fact constellation schema can be seen as a combination of many star
schemas.
b) It is possible to create fact constellation schema, for each star schema
or snowflake schema.
c) Can be identified as a flexible schema for implementation.
Answer: C

50. How to describe the data contained in the data warehouse?.


a) Relational data
b) Operational data
c) Meta data
d) Informational data
Answer: C

51. The output of an OLAP query is displayed as a


a) Pivot
b) Matrix
c) Excel
Answer: B and C

167
Data Mining and Business
Intelligence
52. One can perform Query operations in the data present in Data
Warahouse
a) TRUE
b) FALSE
Answer: A

53. A __ combines facts from multiple processes into a single fact table
and eases the analytic burden on BI applications.
a) Aggregate fact table
b) Consolidated fact table
c) Transaction fact table
d) Accumulating snapshot fact table
Answer: B

54. In OLAP operations, Slicing is the technique of ____


a) Selecting one particular dimension from a given cube and providing a
new sub-cube
b) Selecting two or more dimensions from a given cube and providing a
new sub-cube
c) Rotating the data axes in order to provide an alternative presentation of
data
d) Performing aggregation on a data cube
Answer: A

55. Standalone data marts built by drawing data directly from operational
or external sources of data or both are known as independent data marts
a) TRUE
b) FALSE
Answer: A

56. Focusing on the modeling and analysis of data for decision makers, not
on daily operations or transaction processing is known
a) Integrated
b) Time-variant
c) Subject oriented
d) Non-volatile
Answer: C
168
57. Most of the time data warehouse is Classification and Prediction
a) Read
b) Write
c) Both
Answer: A

58. Data granularity is ——————- of details of data ?


a) summarization
b) transformation
c) level
Answer: C

59. Which one is not a type of fact?


a) Fully Addictive
b) Cumulative addictive
c) Semi Addictive
d) Non Addictive
Answer: C

60. When the level of details of data is reducing the data granularity goes
higher
a) True
b) False
Answer: B

61. Data Warehouses are having summarized and reconciled data which
can be used by decision makers
a) True
b) False
Answer: A

62. _____ refers to the currency and lineage of data in a data warehouse
a) Operational metadata
b) Business metadata
c) Technical metadata
d) End-User meatdata
Answer: A



169
Module VII

9
CLUSTERING
Unit Structure
9.0 Introduction
9.1 Types of Clustering
9.1.1 Hard clustering
9.1.2 Soft clustering
9.2 Categorization of Major Clustering Methods
9.2.1 Partitioning methods - K-Means.
9.2.2 Hierarchical methods
9.2.2.1 Agglomerative hierarchical methods
9.2.2.2 Divisive hierarchical methods
9.2.3 Model- based- Expectation and Maximization
9.2.3.1 Expectation- Maximization algorithm
9.2.3.2 EM Application, Advantages & Disadvantages
9.3 Evaluating cluster models
9.4 List of References
9.5 Quiz
9.6 Exercise
9.7 Video Links

9.0 INTRODUCTION
The term cluster refers to a homogeneous subgroup existing within a
population. Clustering techniques are therefore aimed toward segmenting
a heterogeneous population into a given number of subgroups composed
of observations that share similar characteristics. The characteristics of
observation is different clusters are distinct. In classification we have
predefined classes or labels indicating the target class but in clustering
there are no predefined classes or reference examples indicating the target
class. In clustering, the objects are grouped together based on their mutual
homogeneity. Sometimes, exploratory data analysis is used for identifying

170
Clustering
clusters at initial stage in data mining process. The aim of clustering is to
sort data with similar traits and create clusters by reducing size of datasets.
In clustering, observations in dataset are grouped together bases on
distance among each other. The observations that are not placed in any of
the clusters are called as outliers. An outlier may be an error or variation
in the observations of specific dataset. Clustering algorithm may actually
find and remove outliers to ensure that they perform better still care must
be taken to remove outliers. Outlier detection or outlier mining is the
process for identification of outlier into a set of data
"Clusters represented in figure below indicate clusters and outliers using
iris dataset"

9.1 TYPES OF CLUSTERING


There are two types of clustering hard clustering and soft clustering. We
can define the type of cluster based on the data belonging to the dataset
can used to identify whether the data is belonging to distinct clusters with
the help of likelihood of the data points or the probability of data points
belonging to the nearest cluster.

9.1.1 Hard clustering


In hard clustering each datapoint either belongs to a cluster completely or
does not belong to the cluster at all.

171
Data Mining and Business
Intelligence

9.1.2 Soft clustering


In soft clustering instead of putting each datapoint into a separate cluster a
probability or likelihood of the data point is to be considered to find
whether the data point is belonging to the specific cluster. An observation
can belong to more than one cluster to a certain degree that is likelihood of
belonging to that cluster can be more.

9.2 CATEGORIZATION OF MAJOR CLUSTERING


METHODS
Clustering methods can be classified into four methods by considering the
logic used for deriving the clusters:

Partition methods
Partition methods are used to develop a subdivision of a given dataset
using a predetermined number K of non-empty data subsets. These are
iterative clustering models in which the similarity is derived by the
closeness of a data point to the given centroid or medoid of dataset.

172
Clustering
Hierarchical methods
Hierarchical method is a type of connectivity model. It is based on the fact
that the data points closer in the data space having more similarity with
each other when the data points lying far away. In this type of clustering
predetermined number of clusters is not required. This type of clustering
supports top down and bottom-up approach.

Density based methods


Density based methods are different than partition or hierarchical methods
as in this method the distance between the observation and between the
clusters is not considered whereas in density-based method, we derive
clusters from the number of observations that are locally close to each
other in the neighbourhood of each observation. It isolates various density
regions and assign the data points within this region in the same cluster.

Grid methods
In grid-based clustering observations belonging to data space is divided
into a grid like structure consisting of finite number of cells. A grid is a
multidimensional data structure used for achieving reduced computing
times.
While doing grid-based clustering below steps need to be followed:

● After creation of finite number of cells, we need to calculate the


density of the observations present in each cell.
● The cells must be sorted based on the density.
● After sorting the center of cluster must be identified.
● once the center is identified update the neighbour cells with the
new value.

9.2.1 Partitioning methods - K-Means


Consider a dataset T having observations m, each represented by vector in
n-dimensional space. Partition method is used to construct a subdivision
on T into non-empty subsets of major dataset as C = {C1, C2 ,…. CK),
where K ≤ m. Generally, the number of clusters is predetermined and
assign as an input to the partition algorithm. In partition algorithm we
generate cluster that are mutually exclusive such that each observation
belongs to one and only one cluster. In R we use kmeans(dataset,no of
clusters) to create clusters from actual dataset.
K-Means clustering supports various sorts of distance measures, such as:
● Euclidean distance measure
● Squared Euclidean distance measure

173
Data Mining and Business ● Manhattan distance measure
Intelligence
● Cosine distance measure

Euclidean distance measure


The Euclidean distance is ordinary straight line between two points. Let us
consider point M(p1,p2) and N(q1,q2), the straight-line distance between
them is called as Euclidean space.
The formula for distance between two points is shown below

Squared Euclidean Distance Measure


Squared Euclidean Distance Measure is similar to the Euclidean distance
measurement but does not take the square root at the end. The formula is
shown below

Manhattan distance measure


The sum of the horizontal and vertical components or the distance
between two points measured along axes at right angles is called
Manhattan distance.
In this we take absolute value to avoid negative values in calculation.
The formula for Manhattan distance in is given below 2-dimensional space
is given below,

The general formula for an n-dimensional space having data points pi and
qi is given as,

Cosine distance measure


Cosine similarity between two vectors corresponds to their dot product
divided by the product of their magnitudes.

174
Clustering
The formula is given below

K-Means Algorithm
K-means is centroid-based unsupervised learning algorithms used for
solving clustering problem. It is iterative in nature and it is used to divide
the unlabelled dataset into k different clusters having dataset having
similar properties. It is very sensitive to outliers. In K-means algorithm
input is received from dataset T, a number of K clusters are generated
along with a function dist(xi, xk) which is used to express the
inhomogeneity between each pair of observation or distance between
observations.
It is used to assign datapoints to a cluster by calculating the sum of the
squared distance between each data attribute of observations and the
centroid of cluster ie the mean of data attribute of observations is at
minimum.
Procedure for K-means algorithm
1. During the initialization phase, K observations are arbitrarily chosen in
D as the centroids of the clusters.
2. Each observation is iteratively assigned to the cluster whose centroid is
the most similar to the observation, in the sense that it minimizes the
distance from the record.
3. If no observation is assigned to a different cluster with respect to the
previous iteration, the algorithm stops.
4. For each cluster, the new centroid is computed as the mean of the
values of the observations belonging to the cluster, and then the
algorithm returns to step 2.
The calculation begins by self-assertively selecting K observations that
represent the initial centroids. For example, the K points may be randomly
chosen among the m observations in D. At every succeeding iteration,
each record is appointed to the cluster whose centroid is that the closest,
that is, which minimizes the space from the observation among all
centroids. If no observation is reallocated to a cluster totally different from
the one to which it belongs, determined throughout the previous iteration,
the procedure stops, since any subsequent iteration cannot alter this
subdivision in clusters. Otherwise, the new centroids for every cluster are
computed and a brand-new assignment made.

175
Data Mining and Business Advantages and Disadvantages
Intelligence
Advantages of K-Means Algorithm
● K-means algorithm is simple, easy to understand and easy to
implement.
● It is efficient as time taken to cluster K-means rises linearly with the
number of data points.
● No other clustering algorithm performs better than K-means.
Disadvantages of K-Means Algorithm
● The initial value of K has to be specified by user.
● The process of finding the clusters may not change.
● It is not suitable for discovering clusters that are not hyper ellipsoids or
hyper spheres.

9.2.2 Hierarchical - Agglomerative and Divisive methods


Hierarchical clustering techniques are primarily based totally on a tree
structure. Unlike partition techniques, they do now no longer require the
wide variety of clusters to be decided in advance. Hence, they acquire as
enter a dataset D containing m observations and a matrix of distances
dist(xi, xk) among all pairs of observations. The space complexity of the
hierarchal algorithm is O(n2) because this is the space required for the
adjacency matrix. The space required for the dendrogram is O(kn) in
which is mostly much less than O(n2). The time complexity for
hierarchical algorithms is 0 (kn2) because there is one iteration for each
level in the dendrogram. Depending on the specific algorithm, however,
this could actually be O(maxd n2) where maxd is the maximum distance
between points
In order to assess the space among clusters, maximum hierarchical
algorithms resort to one of five opportunity measures: minimal distance,
most distance, suggest distance, distance among centroids, and Ward
distance. Suppose that we desire to calculate the space among clusters Ch
and Cf and allow zh and zf be the corresponding centroids.

● Minimum distance:
According to the criterion of minimal distance, additionally called the
single linkage criterion, the dissimilarity among clusters is given with the
aid of using the minimal distance amongst all pairs of observations such
that one belongs to the primary cluster and the alternative to the second
one cluster, that is,

176
Clustering
● Maximum distance:
According to the criterion of maximum distance, moreover called the
complete linkage criterion, the dissimilarity among clusters is given with
the resource of the usage of the minimum distance among all pairs of
observations such that one belongs to the number one cluster and the
opportunity to the second cluster, that is,

● Mean distance:
The mean distance criterion expresses the dissimilarity among clusters
through the mean of the distances among all pairs of observations
belonging to the 2 clusters, that is,

● Distance between centroids:


The criterion based on the distance between centroids determines the
dissimilarity between two clusters through the distance between the
centroids representing the two clusters, that is

● Ward distance:
Ward's distance criterion, based on an analysis of variance of Euclidean
distances between observations, is somewhat more complex than the
criteria described above. In fact, it requires that the algorithm first
calculate the sum of the squared distances between all pairs of
observations belonging to a cluster. Then all pairs of clusters that could be
merged in the current iteration are considered and for each pair the total
variance is calculated as the sum of the two variances between the
distances in each cluster evaluated in the first step. Finally, the pairs of
clusters associated with the minimum total variance of are merged.
Methods based on the Ward distance tend to generate a large number of
clusters, each containing a few observations.

177
Data Mining and Business Hierarchical methods are divided into two main categories:
Intelligence

Representation of hierarchical clustering methods

9.2.2.1 Agglomerative Hierarchical Clustering (AHC)


o Agglomerative Hierarchical Clustering methods are having bottom-up
techniques in which is observation initially represents a distinct
cluster.
o After multiple iterations this clusters are aggregated to derive clusters
of increasingly larger cardinalities.
o Once a single cluster has been reached including all the observations
then the algorithm is stopped
o The user has to decide the number of clusters and determine the cut
point.
o After the algorithm is executed correctly it is possible to graphically
represent the entire process of subsequent mergers using a
dendrogram, having one axis with the value of minimum distance
corresponding to each merger and on the other axis observations are
represented.
Procedure for Agglomerative Hierarchical Clustering Algorithm
1. In the initialization phase, each observation constitutes a cluster. The
distance between clusters therefore corresponds to the matrix D of the
distances between all pairs of observations.
2. The minimum distance between the clusters is then computed, and the
two clusters Ch and Cf with the minimum distance are merged, thus
deriving a new cluster Ce. The corresponding minimum distance
dist(Ch, Cf ) originating the merger is recorded.
178
Clustering
3. The distance between the new cluster Ce, resulting from the merger
between Ch and Cf, and the pre-existing clusters is computed.
4. If all the observations are included into a single cluster, the procedure
stops. Otherwise, it is repeated from step 2

9.2.2.2 Divisive hierarchical methods


o Divisive Hierarchical Clustering methods are reverse of agglomerative
methods and are having top-down techniques in which observations
are initially represented in single cluster.
o Then the single cluster is divided further into multiple clusters of
smaller size so that the distances between the generated subgroups are
minimised.
o About things are repeated until cluster having a single observation is
obtained or until the stopping criteria has met.
o "To keep the computing times within a reasonable limit division algorithm, "
require a strict limitation on the number of combinations that can be
analysed which is not happening in agglomerative method.

o The main single cluster having all the observations has to be divided
into two subsets such that the distance between the two resulting
clusters is maximised.
o As there are 2m – 2 possible partitions for the whole data set into two
nonempty disjoint subsets then this results in the exponential number
of operations already at the first iteration.
o To overcome the difficulty at any given iteration divisive hierarchal
algorithm usually determines for each cluster the two observations that
are furthest from each other and subdivide the cluster by assigning the
remaining records to the one or the other based on their proximity.
Procedure for Divisive Hierarchical Clustering Algorithm
1. In the initialization phase, all observations constitute a single cluster.
The distance between clusters therefore corresponds to the matrix D of
the distances between all pairs of observations.
2. The minimum distance between the clusters is then computed, and the
cluster Ch will be subdivided into smaller clusters based on minimum
distance, thus deriving a new clusters Ce, Cf, and so on.
3. If all the observations are subdivided into a single cluster holding a
single observation or else if any stopping criteria is met, the procedure
stops. Otherwise, it is repeated from step 2.

179
Data Mining and Business
Intelligence

Representation of hierarchical methods using dendrogram

9.2.3 Model- based- Expectation and Maximization


In K-means, we try to find out centroids that are good representatives and
these centroids act as a model that generates the data. Model-based
clustering assumes that the data were generated by the model and then
tries to recover the original model from the data. Unlike k-means
clustering, model-based clustering offers more flexibility.
Expectation- Maximization algorithm or EM algorithm is most commonly
used algorithm for model-based clustering. EM clustering is an iterative
algorithm. EM can be applied to many different types of probabilistic
modelling. The E-step is assigning the data points to the closest cluster.
The M-step is computing the centroid of each cluster.

9.2.3.1 Expectation- Maximization algorithm


EM algorithm is an iterative estimation algorithm that can derive the
maximum likelihood estimate in the presence of incomplete or missing
data.
Step I: Estimation
● Consider a set of incomplete data.
● Assume the incomplete observed data came from a specific model.
Step II: Expectation or E-step
● Use this to estimate the missing data and formulate some parameters
for that model, using this we can guess the missing value or data.
● It is used to update the variables.

180
Clustering
Step III: Maximization or M-step
● The data which was estimated in previous step should now be
maximized i.e. use complete data to update the parameters from the
missing data and observed data by finding the most likely parameters.
● It is used to update the hypothesis.
Step IV: Convergence
● In this it is checked whether the values are converging or not.
● If yes, then stop otherwise repeat Step II and III until convergence.

9.2.3.2 EM Application, Advantages & Disadvantages


EM algorithm can be applied in in areas of Artificial Intelligence,
parameter estimation in mixed models for example Gaussian Mixture,
Natural Language Processing, Clustering, Segmentation etc. It is also used
in image reconstruction in the field of medicine and structural engineering,
computer vision, it is applicable for estimating parameters in HMM
(Hidden Markov Model)
Advantages of EM algorithm
● Implementation of the E-step and the M-step are very easy for many
problems.
● With each iteration, it is guaranteed that the likelihood will increase.
● The solution for the M- step often exist in closed form.
Disadvantages of EM algorithm
● It requires both forward and backward probabilities.
● It makes the convergence to the local Optima only.
● It has very slow convergence.

9.3 EVALUATING CLUSTERING MODELS


For the evaluation of clustering method, it is necessary to verify that the
clusters which are generated corresponds to the actual regular pattern in
the data. If it is therefore appropriate to apply other clustering algorithm
and to compare the result obtained by the use of different clustering
methods. It is possible to evaluate if the number of identified cluster is
robust with respect to the different techniques applied.
Let C = {C1, C2,….., CK} be the set of K clusters generated.
Cohesion is used as an indicator of homogeneity of the observations
within the cluster Ch and it is given as,

181
Data Mining and Business
Intelligence

The overall cohesion of the partition C and therefore be defined as,

In terms of homogeneity within each cluster if is it has smaller overall


cohesion, then one clustering is preferred over another.
Separation is used as an indicator for inhomogeneity between a pair of
clusters and is given as,

The overall separation of the partition C and therefore be defined as,

In terms of inhomogeneity among all cluster if is it has greater overall


separation, then one clustering is preferred over another.
Another indicator of clustering quality is given by silhouette coefficient
which involves a combination of cohesion and separation.
To calculate silhouette coefficient for a single observation xi, three steps
should be followed.
Procedure for calculation of the silhouette coefficient
1. The mean distance ui of xi from all the remaining observations
belonging to the same cluster is computed.
2. For each cluster Cf other than the cluster to which it belongs the mean
distance wif between xi and all the observations in Cf is calculated. The
minimum vi among the distance is wif is determined by varying the
cluster Cf.
3. The silhouette coefficient xi is defined as

The silhouette coefficient where is between -1 and 1. A negative value


indicates that the mean distance ui of the observation xi from the point of
182
Clustering
its cluster is greater than the minimum value vi of the main distances from
the observations of the other clusters and it is therefore undesirable since
the membership of xi in its cluster is not well characterized.
The ideal value of coefficient should be positive and ui should be close to
the value zero. Finally, it should be noticed that the overall silhouette
coefficient of a clustering may be computed as the mean of this silhouette
coefficient for all the observations in the dataset D.
Silhouette coefficients can be illustrated by using silhouette diagrams, in
which observations are placed on the vertical axis, sub divided by the
clusters and the values of the silhouette coefficients are for each
observation shown in the horizontal axis.

Representation of silhouette diagrams with two clusters


The silhouette coefficient can have values in the interval of [-1,1]
If,
● Value is 0 : the sample is very close to the neighbouring clusters.
● Value is 1 : the sample is far away from the neighbouring clusters.
● Value is -1 : the sample is assigned to the wrong clusters.

9.4 LIST OF REFERENCES


● Business Intelligence data mining and optimization for decision
making- by Carlo Vercellis, wiley publication.
● Data mining for Business intelligence: concepts, techniques and
applications in Microsoft Excel by G. Shumeli, N R Patel, P.C Bruce,
Wiley
● Data Mining: Introductory and Advanced topics, Pearson Education,
by M.Dunham
183
Data Mining and Business ● An Introduction to Information Retrieval, Christopher D. Manning,
Intelligence Prabhakar Raghavan, Hinrich Schütze, Cambridge University Press
Cambridge, England

9.5 QUIZ
1. Point out the wrong statement
a) k-means clustering is a method of vector quantization
b) k-means clustering aims to partition n observations into k clusters
c) k-nearest neighbour is same as k-means

2. ________ clustering requires merging approach.


a) Partitioning
b) Hierarchical
c) Naive Bayes
d) Grid based

3. Suppose we would like to perform clustering on spatial data such as the


geometrical locations of houses. We wish to produce clusters of many
different sizes and shapes. Which of the following methods is the most
appropriate?
a) Decision Trees
b) Density-based clustering
c) Model-based clustering
d) K-means clustering

4. _____ is a clustering procedure where all objects start out in one single
huge cluster. Later smaller clusters are formed by dividing this cluster.
a) Non-hierarchical clustering
b) Divisive clustering
c) Model-based clustering
d) Agglomerative clustering

5. Which of the following clustering algorithms suffers from the problem


of convergence at local optima?
a) Divisive clustering
b) Agglomerative clustering algorithm
c) Expectation-Maximization clustering algorithm
d) Non-hierarchical clustering

6. __________ algorithm is most sensitive to outliers.


a) K-means
b) K-medians
c) K-medoids
d) K-modes

184
Clustering
7. Point out correct statement for single linkage hierarchical clustering.
a) we merge in the members of the clusters in each step, which provide the
smallest maximum pairwise distance.
b) the distance between two clusters is defined as the average distance
between each point in one cluster to every point in the other cluster.
c) we merge in each step the two clusters, whose two closest members
have the smallest distance.

8. Point out correct statement for complete linkage hierarchical clustering.


a) we merge in the members of the clusters in each step, which
provide the smallest maximum pairwise distance.
b) the distance between two clusters is defined as the average distance
between each point in one cluster to every point in the other cluster.
c) we merge in each step the two clusters, whose two closest members
have the smallest distance.

9. Point out correct statement for average linkage hierarchical clustering.


a) we merge in the members of the clusters in each step, which provide the
smallest maximum pairwise distance.
b) the distance between two clusters is defined as the average distance
between each point in one cluster to every point in the other cluster.
c) we merge in each step the two clusters, whose two closest members
have the smallest distance.

10. In Hierarchical Clustering ___________ is used to find the right


number of clusters.
a) Scatter plot
b) Dendrograms
c) Bar charts
d) Histograms

11. The silhouette coefficient can have the value in the interval of ____.
a) [1,3]
b) [-1,0]
c) [-1,1]
d) [-3,3]

12. The sum of the horizontal and vertical components or the distance
between two points measured along axes at right angles is called
__________.
a) Euclidean distance measure
b) Squared Euclidean distance measure
c) Manhattan distance measure
d) Cosine distance measure

185
Data Mining and Business 13. __________ is the process for identification of outlier into a set of data
Intelligence
a) Outlier definition
b) Outlier reduction
c) Outlier mining
d) Outlier collection

14. ______ is used as an indicator of homogeneity of the observations


within the cluster.
a) Cohesion
b) Separation
c) Silhouette
d) Elbow method

15. If we give number of centroids as three in k-means algorithm, then


how many clusters wil be formed?
a) 3
b) 1
c) 2
d) n

9.6 EXERCISE
1. What is clustering? Explain its types.
2. State and explain different clustering methods.
3. Explain k-means clustering algorithm with its advantages and
disadvantages.
4. Explain
a. Euclidean distance measure
b. Squared Euclidean distance measure
c. Manhattan distance measure
d. Cosine distance measure
5. Short note on:
a. Hierarchical clustering
b. Divisive clustering
6. Explain EM algorithm in detail
7. What are the distance measures associated with hierarchical
clustering?
8. Explain silhouette coefficient.
186
Clustering
9.7 VIDEO LINKS
1. https://www.youtube.com/watch?v=CLKW6uWJtTc
2. https://www.youtube.com/watch?v=p3HbBlcXDTE
3. https://www.youtube.com/watch?v=ieMjGVYw9ag
4. https://www.youtube.com/watch?v=VMyXc3SiEqs
5. https://www.youtube.com/watch?v=7enWesSofhg
6. https://www.youtube.com/watch?v=EFhcDnw7RGY
7. https://www.youtube.com/watch?v=G_Ob1k28ZJo
8. https://www.youtube.com/watch?v=7e65vXZEv5Q
9. https://www.youtube.com/watch?v=g5e_r8dw3uc
10. https://www.youtube.com/watch?v=aOnKnLM4eok




187
Module VIII
Web mining and Text mining

10
TEXT MINING
Unit Structure

10.1 Text Mining


10.2 Text data analysis and information retrieval
10.2.1 Information retrieval
10.3 Text retrieval methods
10.4 Dimensionality reduction for text
10.1 TEXT MINING
Rapid increment in computerized or digital information has prompted an
enormous volume of information and data. A substantial portion of the
available information is stored in Text databases, which consist of large
collections of documents from various sources. Text databases are rapidly
growing due to the increasing amount of information available in
electronic form. In excess of 80% of the present information is in the form
of unstructured or semi-organized data. Traditional information retrieval
techniques become inadequate for the increasingly vast amount of text
data. Thus, text mining has become an increasingly popular and essential
part of Data Mining. The discovery of proper patterns and analyzing the
text document from the huge volume of data is a major issue in real-world
application areas.
“Extraction of interesting information or patterns from data in large
databases is known as data mining.”
Text mining is a process of extracting useful information and nontrivial
patterns from a large volume of text databases. There exist various
strategies and devices to mine the text and find important data for the
prediction and decision-making process. The selection of the right and
accurate text mining procedure helps to enhance the speed and the time
complexity also. This article briefly discusses and analyzes text mining
and its applications in diverse fields.

“Text Mining is the procedure of synthesizing information, by


analyzing relations, patterns, and rules among textual data.”

188
Text mining
As we discussed above, the size of information is expanding at
exponential rates. Today all institutes, companies, different organizations,
and business ventures are stored their information electronically. A huge
collection of data available on the internet and store in digital libraries,
database repositories, and other textual data like websites, blogs, social
media networks, and e-mails. It is a difficult task to determine appropriate
patterns and trends to extract knowledge from this large volume of data.
Text mining is a part of Data mining to extract valuable text information
from a text database repository. Text mining is a multi-disciplinary field
based on data recovery, Data mining, AI, statistics, Machine learning, and
computational linguistics.

The conventional process of text mining as follows:


● Gathering unstructured information from various sources accessible
in various document organizations, for example, plain text, web
pages, PDF records, etc.
● Pre-processing and data cleansing tasks are performed to distinguish
and eliminate inconsistency from the data. The data cleansing process
makes sure to capture the genuine text, and it is performed to
eliminate stop words stemming (the process of identifying the root of
a certain word and indexing the data.
● Processing and controlling tasks are applied to review and further
clean the data set.
● Pattern analysis is implemented in Management Information System.
● Information processed in the above steps is utilized to extract
important and applicable data for a powerful and convenient
decision-making process and trend analysis.

Procedures of analyzing Text Mining:


● Text Summarization: To extract its partial content reflection its
whole content automatically.
● Text Categorization: To assign a category to the text among
categories predefined by users.
● Text Clustering: To segment texts into several clusters, depending on
the substantial relevance.

189
Data Mining and Business
Intelligence

Text Mining Techniques:


● Information Extraction: It is a process of extract meaningful words
from documents.
● Information Retrieval: It is a process of extracting relevant and
associated patterns according to a given set of words or text
documents.
● Natural Language Processing: It concerns the automatic processing
and analysis of unstructured text information.
● Clustering: It is an unsupervised learning process that grouping of
text according to their similar characteristics.
● Text Summarization: To extract its partial content reflection it’s
whole content automatically.

Text Mining Process


1. Text Pre-processing
A large number of documents that contain unstructured and semi-
structured data, text pre-processing is applied on it and transforms a raw
text file into clearly-explained sequence of linguistically-meaningful units.
Text pre-processing incorporates various kind of processing as following.
● Text Cleanup: It processes various tasks such as from removing
advertisements from web pages to cutting out tables and figures, etc.

190
Text mining
● Tokenization: It makes segmentation of sentences into words by
erasing spaces, commas etc.
● Filtering: It extricates the words that have no relevant content-
information including articles, conjunctions, prepositions, etc. Even
the words of frequent repetitions are also removed.
● Stemming: It is the process of transforming words to its stem,or
normalized form by making basic forms of words to recognize words
by its root word-forms. For example, the word “go” is the stem goes,
going and gone.
● Lemmatization: It reorganizes the word to correct root linguistically,
that is the base form of the verb. During the entire process, the first
step is to understand the context, and finds out the POS of a word in a
sentence and at last identifies the ‘lemma’. For example, go is the
lemma of goes, gone, going, went.
● Linguistic processing: Involving Part-of-speech tagging (POS), Word
Sense Disambiguation (WSD) and Semantic structure, it works as
follow as
Part-of-speech tagging: to determine the linguistic category of the word
by assigning word class to each token. It has eight classes: noun, pronoun,
adjective, verb, adverb, preposition, conjunction and interjection.
Word Sense Disambiguous (WSD): to determine that a given word is
ambiguous in a text, e.g., resolving the ambiguity in words “bank” and
“financial institutions”. Basically, it assigns the most suitable meaning
automatically to a polysemous word in a given context.
Semantic structure: Full parsing and partial parsing are known two
methods for making semantic structures.
● Full Parsing: makes a full parse tree for a sentence, and sometimes
fails due to poor tokenizing, error in POS tagging, latest word, incorrect
sentence breaking, grammatical inaccuracy, and many more.
● Partial Parsing: Also known as word chunking, it makes syntactic
constructs such as Noun Phrases and Verb Groups.
Text Transformation
After the process of feature selection, text transformation conducts
features generation. Feature generation reflects documents by words they
contain and words occurrences where the order of word is not significant.
Here feature selection is the process of choosing the subset of significant
features that are used in creating a model. It diminishes the dimensionality
through excluding redundant and unnecessary features.

191
Data Mining and Business Methods Used in Text Mining
Intelligence
There are various techniques being developed to solve the text mining
problems, they are basically the relevant information retrieval according to
the requirement of users. Counting on the information retrieval techniques,
some common methods are following:
Term Based Method

Term is defined as the word that has a well-explained meaning in a


document. Under term based method, the document is inspected on the
basis of terms and takes the benefit of productive computational
performance while capturing the theories for term weighting. Over the
time, after the association of information retrieval and machine learning,
the term based techniques have developed. It has some disadvantages as
well, such as:
● Polysemy: means a word having multiple meanings, and
● Synonymy: means multiple words have the same meaning.
This makes confusion for the users to understand the importance/meaning
of a document. Information retrieval technique gives many of the term-
based methods to solve such ambiguity.

Phrase Based Method


Phrases provide more types of semantic information and are ambiguous.
In this method, documents are anticipated on the basis of phrases as they
are less doubtful and extra handy than individual terms. Some causes that
impede the performance are
● Because of secondary analytical properties to terms
● Limited occurrence
● Extensive replication and noisy phrases
Concept Based Method
Under this method, terms (words) are concluded on the sentence and
document level, such text mining techniques are based on the analytical
analysis of words and phrases. Here, analytical analysis considers the
word significance without any document.
It often occurs that two terms might hold the same frequency in the same
document, but one term contributes more meaning/significance than the
other. Therefore, a new concept based text mining is introduced in order to
accomplish the semantics of texts.
Content based mining method/model contains three components
1. First component: It figures out semantic arrangement of sentences.

192
Text mining
2. Second component: It decides a conceptual ontological graph (COG)
that explains semantic structure.
3. Third component: It extracts major concepts on the basis of first-two
components in an attempt to construct feature vectors via
implementing standard vector space model.
Holding the ability to distinguish unnecessary terms and meaningful
terms, this model explains a meaningful sentence and occasionally relies
on NLP methods.
Pattern Taxonomy Method
The Pattern based model performs better than any other pure data mining-
based method.
Under this method, documents are examined on the basis of patterns
where patterns are built in a taxonomy by applying a relation. Patterns can
be identified by employing data mining techniques including association
rule, frequent itemset mining, sequential and closed pattern mining.
In text mining, where this identified knowledge is important, it is also
inefficient as there are some useful long patterns, with high selectivity,
that need support. Even, most of the short patterns are useful (known as
misconstrued patterns) and lead to ineffective performance.
As a consequence of it, an effective pattern discovery process is required
to conquer low-frequency and mis-construction text mining problems. The
pattern detected method employs two procedures “pattern deploying” and
“pattern evolving” and refines the discovered patterns

Application Area of Text Mining


1. Digital Library
Various text mining strategies and tools are being used to get the pattern
and trends from journal and proceedings which is stored in text database
repositories. These resources of information help in the field of the
research area. Libraries are a good resource of text data in digital form. It
gives a novel technique for getting useful data in such a way that makes it
conceivable to access millions of records online. A green-stone
international digital library that supports numerous languages and
multilingual interfaces give a springy method for extracting reports that
handle various formats, i.e. Microsoft Word, PDF, postscript, HTML,
scripting languages, and email. It additionally supports the extraction of
audiovisual and image formats along with text documents. Text Mining
processes perform different activities like document collection,
determination, enhancement, removing data, and handling substances, and
Producing summarization. There are different types of digital libraries text
mining tools namely: GATE, Net Owl, and Aylien which used for text
mining.

193
Data Mining and Business 2. Academic and Research Field
Intelligence
In the education field, different text mining tools and strategies are utilized
to examine the instructive patterns in a specific region/research field. The
main purpose of text mining utilization in the research field help to
discover and arrange research papers and relevant material of various
fields on one platform. For this, we use k-Means clustering and different
strategies help to distinguish the properties of significant data. Also,
student performance in various subjects can be accessed, and how various
qualities impact the selection of subjects evaluated by this mining.

3. Life Science
Life science and health care industries are producing an enormous volume
of textual and mathematical data regarding patient records, sicknesses,
medicines, symptoms, and treatments of diseases, etc. It is a major issue to
filter data and relevant text to make decisions from a biological data
repository. The clinical records contain variable data which is
unpredictable, lengthy. Text mining can help to manage such kinds of
data. Text mining use in biomarkers disclosure, pharmacy industry, a
clinical trade analysis examination, clinical study, patent competitive
intelligence also.

4. Social-Media
Text mining is accessible for dissecting analyzing web-based media
applications to monitor and investigate online content like plain text from
internet news, web journals, email, blogs, etc. Text mining devices help to
distinguish and investigate the number of posts, likes, and followers on the
web-based media network. This kind of analysis shows individuals’
responses to various posts, news and how it spread around. It shows the
behavior of people who belong to a specific age group and variation in
like, views about the same post.

5. Business Intelligence
Text mining plays an important role in business intelligence that help
different organization and enterprises to analyze their customers and
competitors to make better decisions. It gives an accurate understanding of
business and gives data on how to improve consumer satisfaction and gain
competitive benefits. The text mining devices like IBM text analytics.

Issues in Text Mining


Numerous issues happen during the text mining process:
1. The efficiency and effectiveness of decision-making.
2. The uncertain problem can come at an intermediate stage of text
mining. In the pre-processing stage, different rules and guidelines are
characterized to normalize the text that makes the text mining process
efficient. Prior to applying pattern analysis on the document, there is a
need to change over unstructured data into a moderate structure.
194
Text mining
3. Sometimes original message or meaning can be changed due to
alteration.
4. Another issue in text mining is many algorithms and techniques support
multi-language text. It may create ambiguity in text meaning. This
problem can lead to false-positive results.
5. The utilization of synonym, polysemy, and antonyms in the document
text makes issues for the text mining tools that take both in a similar
setting. It is difficult to categorize such kinds of text/ words.

10.2 TEXT DATA ANALYSIS AND INFORMATION


RETRIEVAL
Text data mining can be described as the process of extracting essential
data from standard language text. All the data that we generate via text
messages, documents, emails, files are written in common language text.
Text mining is primarily used to draw useful insights or patterns from such
data.

The text mining market has experienced exponential growth and adoption
over the last few years and also expected to gain significant growth and
adoption in the coming future. One of the primary reasons behind the
adoption of text mining is higher competition in the business market,
many organizations seeking value-added solutions to compete with other
organizations. With increasing completion in business and changing
customer perspectives, organizations are making huge investments to find
a solution that is capable of analyzing customer and competitor data to
improve competitiveness. The primary source of data is e-commerce
websites, social media platforms, published articles, survey, and many
more. The larger part of the generated data is unstructured, which makes it
challenging and expensive for the organizations to analyze with the help
of the people. This challenge integrates with the exponential growth in
data generation has led to the growth of analytical tools. It is not only able
to handle large volumes of text data but also helps in decision-making

195
Data Mining and Business purposes. Text mining software empowers a user to draw useful
Intelligence information from a huge set of data available sources.

Areas of text mining in data mining:


These are the following area of text mining :

Information Extraction:
The automatic extraction of structured data such as entities, entities
relationships, and attributes describing entities from an unstructured
source is called information extraction.

Natural Language Processing:


NLP stands for Natural language processing. Computer software can
understand human language as same as it is spoken. NLP is primarily a
component of artificial intelligence(AI). The development of the NLP
application is difficult because computers generally expect humans to
"Speak" to them in a programming language that is accurate, clear, and
exceptionally structured. Human speech is usually not authentic so that it
can depend on many complex variables, including slang, social context,
and regional dialects.

Data Mining:
Data mining refers to the extraction of useful data, hidden patterns from
large data sets. Data mining tools can predict behaviors and future trends
that allow businesses to make a better data-driven decision. Data mining
tools can be used to resolve many business problems that have
traditionally been too time-consuming.

Information Retrieval:
Information retrieval deals with retrieving useful data from data that is
stored in our systems. Alternately, as an analogy, we can view search
engines that happen on websites such as e-commerce sites or any other
sites as part of information retrieval.
196
Text mining
Text Mining Process:
The text mining process incorporates the following steps to extract the
data from the document.

Text transformation
A text transformation is a technique that is used to control the
capitalization of the text.
Here the two major way of document representation is given.
● Bag of words
● Vector Space

Text Pre-processing
Pre-processing is a significant task and a critical step in Text Mining,
Natural Language Processing (NLP), and information retrieval(IR). In the
field of text mining, data pre-processing is used for extracting useful
information and knowledge from unstructured text data. Information
Retrieval (IR) is a matter of choosing which documents in a collection
should be retrieved to fulfil the user's need.

Feature selection:
Feature selection is a significant part of data mining. Feature selection can
be defined as the process of reducing the input of processing or finding the
essential information sources. The feature selection is also called variable
selection.

Data Mining:
Now, in this step, the text mining procedure merges with the conventional
process. Classic Data Mining procedures are used in the structural
database.

197
Data Mining and Business Evaluate:
Intelligence Afterward, it evaluates the results. Once the result is evaluated, the result
abandon.
Applications:
These are the following text mining applications:

Risk Management:
Risk Management is a systematic and logical procedure of analyzing,
identifying, treating, and monitoring the risks involved in any action or
process in organizations. Insufficient risk analysis is usually a leading
cause of disappointment. It is particularly true in the financial
organizations where adoption of Risk Management Software based on text
mining technology can effectively enhance the ability to diminish risk. It
enables the administration of millions of sources and petabytes of text
documents, and giving the ability to connect the data. It helps to access the
appropriate data at the right time.

Customer Care Service:


Text mining methods, particularly NLP, are finding increasing
significance in the field of customer care. Organizations are spending in
text analytics programming to improve their overall experience by
accessing the textual data from different sources such as customer
feedback, surveys, customer calls, etc. The primary objective of text
analysis is to reduce the response time of the organizations and help to
address the complaints of the customer rapidly and productively.

Business Intelligence:
Companies and business firms have started to use text mining strategies as
a major aspect of their business intelligence. Besides providing significant
insights into customer behavior and trends, text mining strategies also
support organizations to analyze the qualities and weaknesses of their
opponent's so, giving them a competitive advantage in the market.

Social Media Analysis:


Social media analysis helps to track the online data, and there are
numerous text mining tools designed particularly for performance analysis
of social media sites. These tools help to monitor and interpret the text
generated via the internet from the news, emails, blogs, etc. Text mining
tools can precisely analyze the total no of posts, followers, and total no of
likes of your brand on a social media platform that enables you to
understand the response of the individuals who are interacting with your
brand and content.

Text Mining Approaches in Data Mining:


These are the following text mining approaches that are used in data
mining.

198
Text mining
1. Keyword-based Association Analysis:
It collects sets of keywords or terms that often happen together and
afterward discover the association relationship among them. First, it pre-
processes the text data by parsing, stemming, removing stop words, etc.
Once it pre-processed the data, then it induces association mining
algorithms. Here, human effort is not required, so the number of unwanted
results and the execution time is reduced.

2. Document Classification Analysis:


Automatic document classification:
This analysis is used for the automatic classification of the huge number of
online text documents like web pages, emails, etc. Text document
classification varies with the classification of relational data as document
databases are not organized according to attribute values pairs.

Numericizing text:
Stemming algorithms
A significant pre-processing step before ordering of input documents starts
with the stemming of words. The terms "stemming" can be defined as a
reduction of words to their roots. For example, different grammatical
forms of words and ordered are the same. The primary purpose of
stemming is to ensure a similar word by text mining program.
Support for different languages:
There are some highly language-dependent operations such as stemming,
synonyms, the letters that are allowed in words. Therefore, support for
various languages is important.

Exclude certain character:


Excluding numbers, specific characters, or series of characters, or words
that are shorter or longer than a specific number of letters can be done
before the ordering of the input documents.

Include lists, exclude lists (stop-words):


A particular list of words to be listed can be characterized, and it is useful
when we want to search for a specific word. It also classifies the input
documents based on the frequencies with which those words occur.
Additionally, "stop words," which means terms that are to be rejected
from the ordering can be characterized. Normally, a default list of English
stop words incorporates "the," "a," "since," etc. These words are used in
the respective language very often but communicate very little data in the
document.

199
Data Mining and Business 10.2.1 Information retrieval
Intelligence
Information retrieval (IR) is a field that has been developing in parallel
with database systems for many years. Unlike the field of database
systems, which has targeted query and transaction processing of structured
data, information retrieval is concerned with the organization and retrieval
of data from multiple text-based documents.
Since information retrieval and database systems each handle different
kinds of data, some database system problems are usually not present in
information retrieval systems, such as concurrency control, recovery,
transaction management, and update. There are some common information
retrieval problems that are usually not encountered in traditional database
systems, such as unstructured documents, approximate search based on
keywords, and the notion of relevance.
Because of the abundance of text data, information retrieval has
discovered several applications. There exist several information retrieval
systems, including online library catalog systems, online records
management systems, and the more currently developed Web search
engines.
A general data retrieval problem is to locate relevant documents in a
document set depending on a user’s query, which is often some keywords
defining an information need, although it can also be an example of
relevant records.
This is most suitable when a user has some ad hoc (i.e., short-term) data
need, including finding data to buy a used car. When a user has a long-
term data need (e.g., a researcher’s interests), a retrieval system can also
take the initiative to “push” any newly arrived data elements to a user if
the element is judged as being relevant to the user’s data need.
There are two basic measures for assessing the quality of text retrieval
which are as follows −
Precision − This is the percentage of retrieved data that are actually
relevant to the query (i.e., “correct” responses). It is formally represented
as
precision=|{Relevant}∩{Retrieved}||{Retrieved}|precision=|{Relevant}∩
{Retrieved}||{Retrieved}|
Recall − This is the percentage of records that are relevant to the query
and were actually retrieved. It is formally represented as
recall=|{Relevant}∩{Retrieved}||{Relevant}|recall=|{Relevant}∩{Retriev
ed}||{Relevant}|
An information retrieval system is often required to trade-off recall for
precision or vice versa. There is one generally used trade-off is the F-
score, which is represented as the harmonic mean of recall and precision −

200
Text mining
F–score=recall×precision(recall+precision)2F_score=recall×precision
(recall+precision)2
The harmonic means trouble a system that sacrifices one measure for
another too extremely. Precision, recall, and F-score is the basic measures
of a retrieved collection of records. These three measures are not generally
useful for comparing two ranked lists of files because they are not
sensitive to the internal ranking of the documents in a retrieved set.

10.3 TEXT RETRIEVAL METHODS


Text retrieval is the process of transforming unstructured text into a
structured format to identify meaningful patterns and new insights. By
using advanced analytical techniques, including Naïve Bayes, Support
Vector Machines (SVM), and other deep learning algorithms,
organizations are able to explore and find hidden relationships inside their
unstructured data. There are two methods of text retrieval which are as
follows −
Document Selection − In document selection methods, the query is
regarded as defining constraint for choosing relevant documents. A
general approach of this category is the Boolean retrieval model, in which
a document is defined by a set of keywords and a user provides a Boolean
expression of keywords, such as car and repair shops, tea or coffee, or
database systems but not Oracle.
The retrieval system can take such a Boolean query and return records that
satisfy the Boolean expression. Because of the complexity in prescribing a
user’s data required exactly with a Boolean query, the Boolean retrieval
techniques usually only work well when the user understands a lot about
the document set and can formulate the best query in this way.
Document ranking − Document ranking methods use the query to rank
all records in the order of applicability. For ordinary users and exploratory
queries, these techniques are more suitable than document selection
methods. Most current data retrieval systems present a ranked list of files
in response to a user’s keyword query.
There are several ranking methods based on a huge spectrum of numerical
foundations, such as algebra, logic, probability, and statistics. The
common intuition behind all of these techniques is that it can connect the
keywords in a query with those in the records and score each record
depending on how well it matches the query.
The objective is to approximate the degree of relevance of records with a
score computed depending on the information including the frequency of
words in the document and the whole set. It is inherently difficult to
provide a precise measure of the degree of relevance between a set of
keywords. For example, it is difficult to quantify the distance between data
mining and data analysis.

201
Data Mining and Business The most popular approach of this method is the vector space model. The
Intelligence basic idea of the vector space model is the following: It can represent a
document and a query both as vectors in a high-dimensional space
corresponding to all the keywords and use an appropriate similarity
measure to evaluate the similarity among the query vector and the record
vector. The similarity values can then be used for ranking documents.

10.4 DIMENSIONALITY REDUCTION FOR TEXT


What is Dimensionality Reduction?
In machine learning classification problems, there are often too many
factors on the basis of which the final classification is done. These factors
are basically variables called features. The higher the number of features,
the harder it gets to visualize the training set and then work on it.
Sometimes, most of these features are correlated, and hence redundant.
This is where dimensionality reduction algorithms come into play.
Dimensionality reduction is the process of reducing the number of random
variables under consideration, by obtaining a set of principal variables. It
can be divided into feature selection and feature extraction.
Why is Dimensionality Reduction important in Machine Learning
and Predictive Modeling?
An intuitive example of dimensionality reduction can be discussed
through a simple e-mail classification problem, where we need to classify
whether the e-mail is spam or not. This can involve a large number of
features, such as whether or not the e-mail has a generic title, the content
of the e-mail, whether the e-mail uses a template, etc. However, some of
these features may overlap. In another condition, a classification problem
that relies on both humidity and rainfall can be collapsed into just one
underlying feature, since both of the aforementioned are correlated to a
high degree. Hence, we can reduce the number of features in such
problems. A 3-D classification problem can be hard to visualize, whereas
a 2-D one can be mapped to a simple 2 dimensional space, and a 1-D
problem to a simple line. The below figure illustrates this concept, where a
3-D feature space is split into two 1-D feature spaces, and later, if found to
be correlated, the number of features can be reduced even further.

202
Text mining

Components of Dimensionality Reduction


There are two components of dimensionality reduction:
● Feature selection: In this, we try to find a subset of the original set
of variables, or features, to get a smaller subset which can be used to
model the problem. It usually involves three ways:
1. Filter
2. Wrapper
3. Embedded
● Feature extraction: This reduces the data in a high dimensional
space to a lower dimension space, i.e. a space with lesser no. of
dimensions.
Methods of Dimensionality Reduction
The various methods used for dimensionality reduction include:
● Principal Component Analysis (PCA)
● Linear Discriminant Analysis (LDA)
● Generalized Discriminant Analysis (GDA)
Dimensionality reduction may be both linear or non-linear, depending
upon the method used. The prime linear method, called Principal
Component Analysis, or PCA, is discussed below.

Principal Component Analysis


This method was introduced by Karl Pearson. It works on a condition that
while the data in a higher dimensional space is mapped to data in a lower
203
Data Mining and Business dimension space, the variance of the data in the lower dimensional space
Intelligence should be maximum.

It involves the following steps:


● Construct the covariance matrix of the data.
● Compute the eigenvectors of this matrix.
● Eigenvectors corresponding to the largest eigenvalues are used to
reconstruct a large fraction of variance of the original data.
Hence, we are left with a lesser number of eigenvectors, and there might
have been some data loss in the process. But, the most important variances
should be retained by the remaining eigenvectors.
Advantages of Dimensionality Reduction
● It helps in data compression, and hence reduced storage space.
● It reduces computation time.
● It also helps remove redundant features, if any.
Disadvantages of Dimensionality Reduction
● It may lead to some amount of data loss.
● PCA tends to find linear correlations between variables, which is
sometimes undesirable.
● PCA fails in cases where mean and covariance are not enough to
define datasets.
● We may not know how many principal components to keep- in
practice, some thumb rules are applied.



204
11
WEB MINING
Unit Structure
11.1 Web Mining
11.2 Web content
11.3 Web structure
11.4 Web usage

11.1 WEB MINING


Web Mining is the process of Data Mining techniques to automatically
discover and extract information from Web documents and services. The
main purpose of web mining is discovering useful information from the
World-Wide Web and its usage patterns.
Web mining can widely be viewed as the application of adapted data
mining methods to the web, whereas data mining is represented as the
application of the algorithm to find patterns on mostly structured data
fixed into a knowledge discovery process.

There are various applications of web mining which are as follows −


● Web mining is used to discover how users navigate a website and the
results can help in improving the site design and making it more
visible on the web.
● In Customer Relationship Management (CRM), Web mining is the
unification of data gathered by traditional data mining approaches and
techniques with data gathered over the World Wide Web. Web mining
can learn user behavior, compute the effectiveness of a specific Web
site, and provide quantify the success of a marketing campaign.
● The popularity of digital images is quickly increasing because of
enhancing digital imaging technologies and convenient availability
supported by the web. However, how to find customer-intended
images from the web is non-trivial. The main reason is that the web
images are generally not annotated utilizing semantic descriptors. It is
used to fetch web images from the internet, web mining is utilized.
● Web mining is used for keyphrase extraction. Keyphrases are
beneficial for several purposes, such as summarizing, indexing,
labeling, categorizing, clustering, featuring, scanning, and searching.
The task of automatic keyphrase extraction is to select keyphrases
from within the text of a given document. Automatic keyphrase
extraction creates it feasible to make keyphrases for the large number
of files that do not have manually assigned keyphrases.
205
Data Mining and Business ● Web mining is used for social network analysis. A social network is
Intelligence the study of social entities (person in an organization, known as
actors), and their connections and relationships.
● Social network analysis is helpful for the Web because the Web is
significantly a virtual society, and therefore a virtual social web, where
every page can be regarded as a social actor and every hyperlink as a
relationship. Many of the results from social networks can be adapted
and extended for use in the Web context. The ideas from social
network analysis are indeed instrumental to the success of Web search
engines.
Web mining can be broadly divided into three different types of
techniques of mining: Web Content Mining, Web Structure Mining, and
Web Usage Mining. These are explained as following below.

1. Web Content Mining:


Web content mining is the application of extracting useful information
from the content of the web documents. Web content consist of several
types of data – text, image, audio, video etc. Content data is the group of
facts that a web page is designed. It can provide effective and interesting
patterns about user needs. Text documents are related to text mining,
machine learning and natural language processing. This mining is also
known as text mining. This type of mining performs scanning and mining
of the text, images and groups of web pages according to the content of
the input.

2. Web Structure Mining:


Web structure mining is the application of discovering structure
information from the web. The structure of the web graph consists of web
pages as nodes, and hyperlinks as edges connecting related pages.
Structure mining basically shows the structured summary of a particular
website. It identifies relationship between web pages linked by
information or direct link connection. To determine the connection
between two commercial websites, Web structure mining can be very
useful.

3. Web Usage Mining:


Web usage mining is the application of identifying or discovering
interesting usage patterns from large data sets. And these patterns enable
you to understand the user behaviors or something like that. In web usage
mining, user access data on the web and collect data in form of logs. So,
Web usage mining is also called log mining.

206
Web mining
Comparison Between Data mining and Web mining:

Points Data Mining Web Mining

Data Mining is the


process that Web Mining is the
attempts to discover process of data mining
pattern and hidden techniques to
knowledge in large automatically discover
data sets in any and extract information
Definition system. from web documents.

Data Mining is very Web Mining is very


useful for web page useful for a particular
Application analysis. website and e-service.

Target Data scientist and Data scientists along with


Users data engineers. data analysts.

Data Mining is
access data Web Mining is access
Access privately. data publicly.

In Web Mining get the


In Data Mining get information from
the information structured, unstructured
from explicit and semi-structured web
Structure structure. pages.

Clustering,
classification,
regression,
prediction,
Problem optimization and Web content mining,
Type control. Web structure mining.

Special tools for web


It includes tools mining are Scrapy,
like machine PageRank and Apache
Tools learning algorithms. logs.

207
Data Mining and Business
Intelligence It includes It includes application
approaches for data level knowledge, data
cleansing, machine engineering with
learning algorithms. mathematical modules
Statistics and like statistics and
Skills probability. probability.

11.2 WEB CONTENT


Web content refers to the textual, aural, or visual content published on a
website. Content means any creative element, for example, text,
applications, images, archived e-mail messages, data, e-services, audio
and video files, and so on.
Web content is the key behind traffic generation to websites. Creating
engaging content and organizing it into various categories for easy
navigation is most important for a successful website. Also, it is important
to optimize the web content for search engines so that it responds to the
keywords used for searching.

There are two basic kinds of web content:


Text: Text is simple. It is added on the webpage as text blocks or within
images. The best written content is unique textual web content that is free
from plagiarism. Web content added as text can also include good internal
links that help readers gain access to more information.
Multimedia: Another kind of web content is multimedia. Simply put,
multimedia refers to any content which is not text; some examples
include:
1. Animations: Animations can be added with the help of Flash, Ajax,
GIF images as well as other animation tools.
2. Images: Images are considered the most popular option to incorporate
multimedia to websites. Clip art, photos, or even drawings can be
created by means of a scanner or a graphics editor. It is recommended
to optimize the images so that the users can download them quickly.
3. Audio: Different types of audio files can be added as part of the web
content so as to increase the desirability of the website.
4. Video: It is the most popular multimedia contents; however, when
adding video files, the publishers should make sure that they
efficiently on various browsers.
Web content management (WCM) is essential to run a website
successfully. To manage web content, publishers should organize content
in line with the requirements of the audience.

208
Web mining
This includes usage of common content, terminology, and positioning;
consistent navigation; link management; and finally, metadata application.
There are a wide range of WCM tools available for effectively handling
web content.

Examples of effective website content


Expert web content development produces pages that your audience will
find helpful, informative, unique, and entertaining. Done right, it
encourages visitors to stay longer on a site to explore and learn more or to
bookmark pages to return to again later.
In addition, original and high-quality text-based content that is optimized
for SEO can help your site rank well in web searches, helping customers
and prospects find you.
Here are 7 types of high-quality web content that can please website
visitors and keep them coming back for more.

1. Blogs
Blogging is an invaluable tool for driving visitors to your website, and
building awareness about you and your brand.
Generally written from a more personal and informal point of view than
content assets, a blog is a great way to connect with readers. It is the
perfect vehicle for providing them with information that not only answers
a question or solves a problem, but also helps to establish you as a trusted
authority on the topic.
Blogs are also a great way to keep your web content fresh, enabling you to
post new content on a regular basis and helping you continue to rank in
SERPs (search results).

2. Content assets
This broad category of web content includes collateral and similar
resources you have already invested in and can now repurpose to help
draw visitors to your website.
Some examples are product brochures, user manuals, slide presentations,
white papers, industry reports, case studies, fact sheets, ebooks, webinars,
and podcasts.
The goal is to extend the value of these assets by using them across
different digital media and channels. The content can be broken up into
smaller pieces and distributed in new ways, such as via blog posts, tweets,
video clips, email blasts, search engine ads, and other channels.

3. Calls to action
A call to action (CTA) is a prompt designed to get your website visitor to
take some immediate action, such as make a purchase or get more
information.
209
Data Mining and Business In addition to having CTAs on your web pages, you can include them in
Intelligence other marketing content you use to drive traffic to your website, such as
blogs, emails, social media posts, and e-newsletters.
Some common prompts:
● Apply today
● Book now
● Contact us
● Download for free
● Get a quote
● Join today
● Learn more
● Order now
● Register today
● Shop online and save
A CTA may take your web visitor to a landing page for further action.
Whatever your CTA is, it is important that the intent is clear and your
audience has a good idea what to expect. After all, you don’t want lose
visitors by having them click on a link that takes them somewhere they
really don’t want to go.

4. Landing pages
Landing pages are destinations — the web pages where visitors are sent
when they click on a hyperlink, such as a search engine result, a social
media ad, a CTA, or a special offer on your website.
These pages are designed to help you convert website visitors into leads
by providing a way to capture their contact information.
For example, suppose you want to build your authority as an SME by
offering a free white paper to your website visitors. When they click on
the offer link, it can take them to a landing page where the content of
white paper is described in more detail and they can download the paper
by submitting an email address.

5. Testimonials
One of the best ways to appeal to prospects and build credibility is with
relatable success stories from their peers. That is what makes customer
testimonials such valuable web content.
Whether your goal is to create formal case studies, include real-life
customer scenarios in a white paper, or post short video clips on Twitter or
210
Web mining
Facebook, having a process in place to identify happy customers and
capture their feedback is a great idea.
TIP: Don’t hide all your valuable customer feedback on one testimonials
page. Include testimonials throughout your site to serve as social proof
that validates your claims.

6. Video & audio content


With the ability to embed video and audio clips so that anyone can view
and listen without leaving the webpage, digitally recorded media are
increasingly popular web content tools. It is a great way to offer content
such as how-tos, webinars, podcasts, and seminars.

7. Visual content
According to the Social Science Research Network, 65% of people are
visual learners. So, it makes good sense to incorporate visual web content
into your website.
In addition to having a graphic design that helps to convey the flavor and
purpose of your brand, you can:
● Use images — preferably original ones — to break up and enhance the
text
● Create videos to entertain and inform
● Reiterate key information in a concise way through infographics
● Create your own memes to make important messages more memorable
● Offer presentations for visitors who want details in a more graphic,
bulleted format
● Include screenshots to clearly show things that may be difficult to
explain in words

11.3 WEB STRUCTURE


Web structure mining is a tool that can recognize the relationship between
web pages linked by data or direct link connection. This structured data is
discoverable by the provision of web structure schema through database
techniques for Web pages.
This connection enables a search engine to pull data associated with a
search query directly to the connecting Web page from the website the
content rests upon. This completion takes place through the need of
spiders scanning the websites, fetching the home page, then, and
connecting the data through reference connection to bring forth the
specific page including the desired information.
Web mining can widely be viewed as the application of adapted data
mining methods to the web, whereas data mining is represented as the
211
Data Mining and Business application of the algorithm to find patterns on mostly structured data
Intelligence fixed into a knowledge discovery process.
Web mining has a distinctive property to support a collection of multiple
data types. The web has several aspects that yield multiple approaches for
the mining process, such as web pages including text, web pages are
connected via hyperlinks, and user activity can be monitored via web
server logs.
Structure mining uses minimize two main problems of the World Wide
Web because of its large amount of data. The first problem is irrelevant to
search outcomes.
Relevance of search information becomes misconstrued due to the
problem that search engines often only allow for low precision criteria.
The second problem is the inability to index the large amount of data
supported on the Web. This generates a low amount of remembering with
content mining. This minimization appears in part with the service of
finding the model underlying the Web hyperlink structure supported by
Web structure mining.
The objective of structure mining is to extract previously unknown
relationships among web pages. This structure of data mining offers use
for a business to connect the data of its website to allow navigation and
cluster data into site maps.
This enables its users the ability to create the desired data through
keyword relations and content mining. Hyperlink hierarchy is also decided
to path the related data within the sites to the connection of competitor
links and connection through search engines and third-party co-links. This
allows clustering of linked Web pages to create the relationship of these
pages.
On the World Wide Web, the use of structure mining allows the
determination of the same architecture of Web pages by clustering through
the identification of basic structure.
This data can be used to design the similarities of web content. The known
similarities then support the ability to support or improve the data of a site
to allow access of web-spiders in a higher ratio. The higher the number of
Web crawlers, the more advantageous to the site due to related content to
searches.
The ideal website structure can be looked at like a pyramid. It consists of a
home page, categories, subcategories, and individual posts and pages.

212
Web mining

The ideal website structure looks like a pyramid, starting with the home
page at the top, then categories, subcategories, and individual posts and
pages.
● Home page – The home page is at the top of the pyramid. It acts as
a hub for the visitors of your site. Designers should link to critical or
popular pages from the home page. In doing so, designers will be able to
more easily guide users to the most important pages.
● Categories – Categorization is a valuable part of a website’s
structure. Designers can help users make decisions faster and easier with
good categorization. Designers can use categories to reduce the amount of
time spent considering a decision
● Subcategories – These play a major role in defining a website’s
structure. For example, online marketplaces like eBay and Amazon have a
nearly unfathomable number of pages. It would be easy for a user to get
lost in the information provided. Subcategories provide a structured
methodology for browsing and categorizing information in a meaningful
manner, especially for websites with complex data.
● Individual posts and pages – Individual posts and pages are the
basic elements of a website. Designers should focus on how to create a
meaningful information hierarchy within every page, so the user has less
to consider when it comes to consuming content.

Types of website structures


There are four main types of website structures. Having a proper
understanding of website structures makes it easier for designers to create
a meaningful website information architecture. Let’s look into them one
by one.

213
Data Mining and Business Hierarchical model
Intelligence

The hierarchical model is used in web applications that contain a large


amount of data.
The hierarchical model is one of the most common types of site
architecture. The hierarchical model is often used in web applications that
contain a large amount of data. The hierarchical model is similar to a tree
in that it has a trunk (like a homepage) that branches out into categories
and pages.
Sequential model

The sequential model can be used to develop flows for a process.


Sequential models are popular when leading users through a sequence like
onboarding or new account creation when the user is taken through the
process step-by-step. UX designers can use this model to create flows for
a process.

214
Web mining
Matrix model

The matrix model of a web structure lets users choose where they want to
go next.
The matrix model is one of the oldest site structure types on the internet.
This model is unique and non-traditional in its behavior. A matrix-type
structure gives users options to choose where they want to go next. These
types of sites are best navigated via search or internal links.
Database model

The database model of a web structure determines the logical structure of


a database.
A database model is a dynamic approach to the website structure. To build
a website structure like this, designers should think about the bottom-up
215
Data Mining and Business approach by considering a page’s metadata and adhering to strong
Intelligence information architecture and taxonomic best practices.

Why you should start with the site structure


By considering the user’s needs first when beginning a design, UX
designers can create a website structure that helps the user rather than
standing in their way. A good structure adds to usability and can help
improve the site’s overall user experience. Plainly put, a website’s
structure helps the designer create delightful user experiences through
improved discoverability and intuitiveness.

11.4 WEB USAGE


Web usage mining is used to derive useful data, information, knowledge
from the weblog data, and helps in identifying the user access designs for
web pages.
In Mining, the management of web resources, the individual is thinking
about data of requests of visitors of a website that are composed as web
server logs. While the content and mechanism of the set of web pages
follow the intentions of the authors of the pages, the single requests show
how the users view these pages. Web usage mining can disclose
relationships that were not suggested by the designer of the pages.
A web server generally registers a (Web) log entry, or Weblog entry, for
each access of a Web page. It contains the URL requested, the IP address
from which the request introduced, and a timestamp.
For Web-based e-commerce servers, a large number of Web access log
data are being collected. There are famous websites can register Weblog
records in the order of thousands of megabytes each day. Weblog
databases supports rich data about Web dynamics. Therefore it is essential
to produce sophisticated Weblog mining approaches.
In developing methods for Web usage mining, it can consider the
following. First, although it is encouraging and stimulating to conceive the
several applications of Weblog file analysis. It is essential to understand
that the success of such applications based on what and how much true
and reliable knowledge can be find from the large raw log records.
Second, with the available URL, time, IP address, and web page content
data, a multidimensional view can be built on the Weblog database, and
multidimensional OLAP analysis can be implemented to discover the top
N users, top N accessed Web pages, most generally accessed time periods,
etc., which will help find potential customers, users, markets, etc.
Third, data mining can be implemented on Weblog records to discover
association patterns, sequential patterns, and trends of Web accessing. For
Web access pattern mining, it is essential to take further measures to
obtain more data of user traversal to simplify accurate Weblog analysis.

216
Web mining
Such more data can include user-browsing sequences of the web pages in
the internet server buffer. With the need of such weblog documents,
studies have been directed on analyzing system implementation,
enhancing system design by web caching, web page prefetching, and web
page swapping; understanding the feature of Web traffic; and
understanding customer reaction and motivation.
For instance, some studies have proposed adaptive sites − websites that
enhance themselves by understanding from user access patterns. Weblog
analysis can also help construct customized web services for single users.
Web mining has distinctive features to offer a set of multiple data types.
The web has multiple elements that yield multiple approaches for the
mining procedure, including web pages including text, web pages are
linked via hyperlinks, and customer activity can be monitored via web
server logs.

There are various rules of web usage mining which are as follows −
Preprocessing − The web usage log is not in a format that is accessible by
mining applications. For some data to be used in a mining application, the
data can be required to be reformatted and cleansed. There are some issues
specifically related to the use of weblogs. There are some steps included in
the processing phase include cleansing, user identification, session
identification, path completion, and formatting.
Data structure − There are several unique data structures have been
proposed to keep track of patterns identified during the web usage mining
process. A basic data structure that is used is called a tree. A tree is a
rooted tree, where each path from the root to a leaf represents a sequence.
Trees can save strings for pattern matching applications. The only problem
with trees Types of Web Usage Mining based upon the Usage Data:
1. Web Server Data: The web server data generally includes the IP
address, browser logs, proxy server logs, user profiles, etc. The user logs
are being collected by the web server data.
2. Application Server Data: An added feature on the commercial
application servers is to build applications on it. Tracking various business
events and logging them into application server logs is mainly what
application server data consists of.
3. Application-level data: There are various new kinds of events that can
be there in an application. The logging feature enabled in them helps us
get the past record of the events.
Advantages of Web Usage Mining
● Government agencies are benefited from this technology to overcome
terrorism.
● Predictive capabilities of mining tools have helped identify various
criminal activities.
217
Data Mining and Business ● Customer Relationship is being better understood by the company with
Intelligence the aid of these mining tools. It helps them to satisfy the needs of the
customer faster and efficiently.
Disadvantages of Web Usage Mining
● Privacy stands out as a major issue. Analyzing data for the benefit of
customers is good. But using the same data for something else can be
dangerous. Using it within the individual’s knowledge can pose a big
threat to the company.
● Having no high ethical standards in a data mining company, two or
more attributes can be combined to get some personal information of
the user which again is not respectable.

Some Techniques in Web Usage Mining

1. Association Rules:The most used technique in Web usage mining is


Association Rules. Basically, this technique focuses on relations among
the web pages that frequently appear together in users’ sessions. The pages
accessed together are always put together into a single server session.
Association Rules help in the reconstruction of websites using the access
logs. Access logs generally contain information about requests which are
approaching the webserver. The major drawback of this technique is that
having so many sets of rules produced together may result in some of the
rules being completely inconsequential. They may not be used for future
use too.
2. Classification: Classification is mainly to map a particular record to
multiple predefined classes. The main target here in web usage mining is
to develop that kind of profile of users/customers that are associated with a
particular class/category. For this exact thing, one requires to extract the
best features that will be best suitable for the associated class.
Classification can be implemented by various algorithms – some of them
include- Support vector machines, K-Nearest Neighbors, Logistic
Regression, Decision Trees, etc. For example, having a track record of
data of customers regarding their purchase history in the last 6 months the
customer can be classified into frequent and non-frequent
classes/categories. There can be multiclass also in other cases too.

218
Web mining
3. Clustering: Clustering is a technique to group together a set of things
having similar features/traits. There are mainly 2 types of clusters- the first
one is the usage cluster and the second one is the page cluster. The
clustering of pages can be readily performed based on the usage data. In
usage-based clustering, items that are commonly accessed /purchased
together can be automatically organized into groups. The clustering of
users tends to establish groups of users exhibiting similar browsing
patterns. In page clustering, the basic concept is to get information quickly
over the web pages.
Applications of Web Usage Mining
1. Personalization of Web Content: The World Wide Web has a lot of
information and is expanding very rapidly day by day. The big problem is
that on an everyday basis the specific needs of people are increasing and
they quite often don’t get that query result. So, a solution to this is web
personalization. Web personalization may be defined as catering to the
user’s need-based upon its navigational behavior tracking and their
interests. Web Personalization includes recommender systems, check-box
customization, etc. Recommender systems are popular and are used by
many companies.

2. E-commerce: Web-usage Mining plays a very vital role in web-based


companies. Since their ultimate focus is on Customer attraction, customer
retention, cross-sales, etc. To build a strong relationship with the customer
it is very necessary for the web-based company to rely on web usage
219
Data Mining and Business mining where they can get a lot of insights about customer’s interests.
Intelligence Also, it tells the company about improving its web-design in some aspects.
3. Prefetching and Catching: Prefetching basically means loading of data
before it is required to decrease the time waiting for that data hence the
term ‘prefetch’. All the results which we get from web usage mining can
be used to produce prefetching and caching strategies which in turn can
highly reduce the server response time is space requirements.
Pattern discovery − The most common data mining technique used on
clickstream data is that of uncovering traversal patterns. A traversal
pattern is a group of pages inspected by a user in a session. The other type
of pattern may be uncovered by web usage mining. Patterns are found
using different combinations which are used to discover different features
and for different purposes.
Pattern analysis − When patterns are discovered, they must be analyzed
to determine how that information can be used. Some of the patterns can
be deleted and not determined to be of interest.
Pattern analysis is the phase of viewing and interpreting the outcomes of
the discovery activities. It is not necessary to identify frequent types of
traversal patterns, but also to identify patterns that are of interest because
of their uniqueness or statistical properties.



220

You might also like