0% found this document useful (0 votes)
19 views289 pages

Data Warehousing Mining F-CSIT341

The document outlines various postgraduate, diploma, and undergraduate programs offered in Data Warehousing and Mining at Amity University, including degrees in business administration, computer applications, and data science. It also discusses the fundamentals of cloud computing, its advantages, and its impact on data management and business decision-making. Additionally, the document highlights the importance of data warehousing and mining in leveraging business data for competitive advantage.

Uploaded by

gshshg1737
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views289 pages

Data Warehousing Mining F-CSIT341

The document outlines various postgraduate, diploma, and undergraduate programs offered in Data Warehousing and Mining at Amity University, including degrees in business administration, computer applications, and data science. It also discusses the fundamentals of cloud computing, its advantages, and its impact on data management and business decision-making. Additionally, the document highlights the importance of data warehousing and mining in leveraging business data for competitive advantage.

Uploaded by

gshshg1737
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 289

Data Warehousing and Mining

Programs Offered

n e
i
Post Graduate Programmes (PG)

l
• Master of Business Administration
• Master of Computer Applications

Data Warehousing
n
• Master of Commerce (Financial Management / Financial
Technology)

and MiningO
• Master of Arts (Journalism and Mass Communication)
• Master of Arts (Economics)
• Master of Arts (Public Policy and Governance)




Master of Social Work
Master of Arts (English)
Master of Science (Information Technology) (ODL)
Master of Science (Environmental Science) (ODL)

i t y
Diploma Programmes
• Post Graduate Diploma (Management)

r s
e
• Post Graduate Diploma (Logistics)
• Post Graduate Diploma (Machine Learning and Artificial


Intelligence)
Post Graduate Diploma (Data Science)

i v
Undergraduate Programmes (UG)




Bachelor of Business Administration
Bachelor of Computer Applications
Bachelor of Commerce
Bachelor of Arts (Journalism and Mass Communication)
U n



Bachelor of Social Work

i
Bachelor of Science (Information Technology) (ODL)
y
Bachelor of Arts (General / Political Science / Economics /
English / Sociology)

t
Am
c )
DIRECTORATE OF Product code

(
DISTANCE & ONLINE EDUCATION
Amity Helpline: 1800-102-3434 (Toll-free), 0120-4614200
AMITY

For Distance Learning Programmes: [email protected] | www.amity.edu/addoe


DIRECTORATE OF
For Online Learning programmes: [email protected] | www.amityonline.com DISTANCE & ONLINE EDUCATION
e
in
nl
O
ty
Data Warehousing and Mining

r si
ve
ni
U
ity
m
)A
(c
e
in
© Amity University Press

All Rights Reserved

nl
No parts of this publication may be reproduced, stored in a retrieval system or transmitted
in any form or by any means, electronic, mechanical, photocopying, recording or otherwise
without the prior permission of the publisher.

O
SLM & Learning Resources Committee

ty
Chairman : Prof. Abhinash Kumar

si
Members : Dr. Divya Bansal
Dr. Coral J Barboza

Dr. Monica Rose
r
Dr. Apurva Chauhan
ve
Dr. Winnie Sharma

Member Secretary : Ms. Rita Naskar


ni
U
ity
m
)A
(c

Published by Amity University Press for exclusive use of Amity Directorate of Distance and Online Education,
Amity University, Noida-201313
Contents

e
Page No.

in
Module - I: Data Warehouse Fundamentals 01
1.1 Defining the Cloud for the Enterprise
1.1.1 Database as a Service

nl
1.1.2 Governance/Management as a Service
1.1.3 Testing as a Service

O
1.1.4 Storage as a Service
1.2 Cloud Service Development
1.2.1 Cloud Service Development

ty
1.2.2 Cloud Computing Challenges
1.3 Cloud Computing Layers

si
1.3.1 Understand Layers of Cloud Computing
1.4 Cloud Computing Types
1.4.1 Types of Cloud Computing and Features
r
1.5 Cloud Computing Security Requirements, Pros and Cons, and Benefits
ve
1.5.1 Cloud Computing Security Requirements
1.5.2 Cloud Computing - Pros, Cons and Benefits
Case Study
ni

Module - II: Principles of Dimensional Modelling 56


2.1 Basics of Dimensional Modelling
U

2.1.1 Identify Facts and Dimensions


2.1.2 Design Fact Tables and Design Dimension Table
2.2 Understanding Schemas
ity

2.2.1 Data Warehouse Schemas


2.3 OLAP Operations
2.3.1 Concept of OLAP, OLAP Features, Benefits
m

2.3.2 OLAP Operations


2.4 Data Extraction, Clean-up and Transformation
2.4.1 Data Extraction, Clean-up and Transformation
)A

2.4.2 Concept of Schemas, Star Schemas for Multidimensional Databases


2.4.3 Snowflake and Galaxy Schemas for Multidimensional Databases
2.5 Warehouse Architecture
(c

2.5.1 Architecture for a Warehouse


2.5.2 Steps for Construction of Data Warehouses, Data Marts and Metadata
2.6 OLAP Operations

e
2.6.1 OLAP Server - ROLAP

in
2.6.2 OLAP Server - MOLAP
2.6.3 OLAP Server - HOLAP
Case Study

nl
Module - III: Data Mining 119
3.1 Understanding Data Mining

O
3.1.1 Understanding Concepts of Data Warehousing
3.1.2 Advancements to Data Mining
3.2 Motivation and Knowledge Discovery Process

ty
3.2.1 Data Mining on Databases
3.2.2 Data Mining Functionalities

si
3.3 Data Mining Basics
3.3.1 Objectives of Data Mining and the Business Context for Data Mining
3.3.2 Data Mining Process Improvement
3.3.3 Data Mining in Marketing
r
ve
3.3.4 Data Mining in CRM
3.3.5 Tools of Data Mining
Case Study
ni

Module - IV: Data Mining Functionalities 174


4.1 Data Preparation and Data Mining Techniques
U

4.1.1 Statistical Techniques


4.2 Characterisation of Data Mining
4.2.1 Data Mining Characterisation
ity

4.2.2 Data Mining Discrimination


4.3 Association and Market Basket Analysis
4.3.1 Mining Patterns
m

4.3.2 Mining Associations


4.3.3 Mining Correlations
)A

4.4 Data Mining Classification


4.4.1 Classification
4.4.2 Prediction
4.5 Data Mining Analysis
(c

4.5.1 Cluster Analysis


4.5.2 Outlier Analysis
Case Study

e
Module - V: Data Mining Applications 233

in
5.1 Text Mining, Spatial Databases and Web Mining
5.1.1 Text Mining

nl
5.1.2 Spatial Databases
5.1.3 Web Mining
5.2 Multimedia Web Mining

O
5.2.1 Multidimensional Analysis of Multimedia Data
5.2.2 Applications in Telecommunications Industry
5.2.3 Applications in Retail Marketing

ty
5.2.4 Applications in Target Marketing
5.3 Applications in Industry

si
5.3.1 Mining in Fraud Protection
5.3.2 Mining in Healthcare
5.3.3 Mining in Science
5.3.4 Mining in E-commerce
r
ve
5.3.5 Mining in Finance
Case Study
ni
U
ity
m
)A
(c
(c
)A
m
ity
U
ni
ve
r si
ty
O
nl
in
e
Data Warehousing and Mining 1

Module - I: Data Warehouse Fundamentals


Notes

e
Learning Objectives:

in
At the end of this topic, you will be able to understand:

●● Database as a Service

nl
●● Governance/Management as a Service
●● Testing as a Service

O
●● Storage as a Service
●● Cloud Service Development
●● Cloud Computing Challenges

ty
●● Understand Layers of Cloud Computing
●● Types of Cloud Computing and Features
●● Cloud Computing Security Requirements

si
●● Cloud Computing - Pros, Cons and Benefits

Introduction r
ve
The field of data warehousing has expanded dramatically during the past ten
years. To aid in business decision-making, many firms are either actively investigating
this technology or are using one or more data marts or warehouses. The competitive
advantage in the current economic climate frequently results from the proactive use of
ni

the data that businesses have been accumulating in their operational systems. They
are becoming aware of the enormous potential that this information may have for their
company. Users have access to these huge quantities of integrated, nonvolatile, time-
U

variant data through the data warehouse, which may be utilised to monitor commercial
trends, simplify forecasting, and enhance strategic choices.

An growing trend in IT called cloud computing aims to make the Internet the
ultimate repository for all computing resources, including storage, computation, and
ity

accessibility. It provides pay-per-use on-demand access to external infrastructures.


The object-oriented technology and fields in the database, which increase presentation
and its liveliness, are the basis for the multimedia database used in cloud computing.
This paradigm lessens the demand for hardware from consumers while enhancing the
m

flexibility of computational resources, enabling them to adjust to changing business


needs. Businesses are finding it appealing to adopt the cloud computing paradigm due
to its robustness, scalability, performance, high availability, least cost, and a number of
)A

other factors.

Software manufacturers and developers have access to the promise of on-


demand scalability and effective resource usage through the use of cloud computing.
Consumers can access their “desktop” and data from anywhere, whether at work,
home, while travelling, or from within other enterprises, thanks to infrastructure-
(c

free computing. Cloud computing provides enterprise firms with the opportunity to
outsource computing infrastructure so that they may concentrate on their core skills with
more efficiency, which is very advantageous to them. Virtualization, which allows for
Amity Directorate of Distance & Online Education
2 Data Warehousing and Mining

multitenancy and on-demand utilisation of scalable shared resources by all tenants, is


Notes

e
the foundation for cloud computing.

It is tremendously advantageous for business people to use cloud services since

in
they enable them to do away with the need for technicians to support and manage
some of the most coveted new IT technologies, such as highly scalable, variably
provided systems. Computing resources, including virtual servers, data storage,

nl
and network capabilities, are all load-balanced and automatically extensible, which
is obviously advantageous to clients. A solid, dependable service is produced when
resources are assigned as necessary and loads can be moved automatically to better
areas.

O
The concept is not too complex. A huge business that has access to lots of
computer resources, like sizable datacenters, comes to an arrangement with clients.
Utilizing the capabilities of the provider, customers can execute their applications, store

ty
their data, host virtual machines, and more. Customers have the option of ending their
contracts, saving money on setup and maintenance fees, and gaining access to the
provider’s resource allocation flexibility.

si
Definition of Cloud Computing
According to the National Institute of Standards and Technology(NIST) definition,
r
“Cloud computing is a model for enabling ubiquitous, convenient, on-demand network
ve
access to a shared pool of configurable computing resources (e.g., networks, servers,
storage, applications, and services) that can be quickly provisioned and released with
little management effort or service provider interaction. Three service models, four
deployment methods, and five key criteria make up this cloud model.
ni

Some of the definition of cloud computing provided by various scientists.


U
ity
m
)A
(c

Amity Directorate of Distance & Online Education


Data Warehousing and Mining 3

Notes

e
in
nl
O
ty
1.1 Defining the Cloud for the Enterprise
Applications and services made available through the Internet are referred to as
cloud computing. These services are offered via data centres located all over the world,

si
collectively known as the “cloud”. This metaphor illustrates how thin and yet universal
the Internet is. The numerous network connections and computer systems required for
internet services are made simpler by the “cloud” technology.
r
In reality, the Internet is frequently depicted as a cloud in network diagrams. This
ve
highlights the Internet’s enormous reach while also reducing its complexity. The cloud
and the services it offers are accessible to anybody with an Internet connection. Users
can exchange information with other users and between other systems thanks to the
regular connections between these services.
ni

The most recent development in Internet-based computing is called cloud


computing, and it is a very successful paradigm of service-oriented computing.
An approach to providing simple, on-demand network access to a shared pool of
U

reconfigurable computer resources or shared services is known as cloud computing


(e.g., networks, servers, storage, applications, and IT services). The main advantages
of cloud computing are lower costs, less complexity, better service quality, and more
adaptability to workload changes.
ity
m
)A

The major enabling features of cloud computing are elasticity, pay-per-use, low
upfront investment, which makes cloud computing a ubiquitous paradigm for deploying
(c

novel applications. This has made many SMBs (Small and Medium Business) and
SMEs (Small and Medium Enterprises) to deploy the application online which were not
economically feasible in traditional enterprise infrastructure settings.

Amity Directorate of Distance & Online Education


4 Data Warehousing and Mining

Cloud computing is transforming the way data is stored, retrieved and served.
Notes

e
Computing resources like servers, storage, network and applications (including
databases) are hosted, and made available as cloud services, for a price. Cloud
platforms have evolved to offer many IT needs as online services, without having to

in
invest in expensive data centers and worry about the hassle of managing them.

Cloud platforms virtually alleviate the need of having their own expensive data-

nl
centre. For database environments, the PaaS cloud model provides better IT services
than the IaaS model. The PaaS model provides enough resources in the cloud
databases which enables the users to create the applications they need. A lot of
research has been carried out for more than three decades to address scalable and

O
distributed data management.

The way data is stored, retrieved, and served is changing as a result of cloud
computing. Servers, storage, networks, and applications (including databases) are

ty
hosted and made available as cloud services for a fee. Cloud platforms have grown to
provide many IT needs as online services, eliminating the need to invest in costly data
centres and deal with the inconvenience of operating them. Cloud systems practically

si
eliminate the requirement for a costly data centre. For database settings, the PaaS
cloud architecture outperforms the IaaS approach in terms of IT services. The PaaS
architecture gives adequate resources in cloud databases for users to construct the

r
apps they require. For more than three decades, much research has been conducted to
address scalable and distributed data management.
ve
With the enterprise cloud computing paradigm, businesses can pay as they go
for access to virtualized IT resources from public or private cloud service providers.
Servers, computing power (CPU cores), data storage, virtualization tools, and
ni

networking infrastructure are a few examples of these resources.

Enterprise cloud computing gives companies new ways to cut expenses while
boosting their adaptability, network security, and resilience.
U

Processing power, computer memory, and data storage are three different types
of computing resources that enterprises undergoing digital transformation need flexible
and scalable access to. In the past, these companies were responsible for paying for
ity

the setup and upkeep of their own networks and data centres. Now that they have
partnered with public and private enterprise cloud service providers, businesses can
use these resources at a reasonable cost.

Why are Businesses Choosing the Cloud?


m

The enterprise cloud as we know it today has only been around for a few years.
It began with the introduction of Amazon Web Services (AWS), the first commercial
cloud storage provider, in March 2006. Enterprise cloud technology has grown quickly
)A

and widely since the introduction of AWS. According to a 2019 survey (conducted by
right Scale State of The Cloudreport) of 786 technical professionals from various firms,
91 percent of public cloud solutions and 72 percent of private cloud solutions were
adopted.

Why did these companies adopt business cloud so quickly?


(c

Amity Directorate of Distance & Online Education


Data Warehousing and Mining 5

There are numerous impetuses:


Notes

e
1. Cost savings: Using pay-as-you-go pricing, a typical enterprise cloud solution
allows firms to only pay for the services they really utilise. Additionally,

in
companies that switch to the cloud can avoid most or all of the upfront
expenditures associated with building comparable capabilities internally.
There is no requirement to purchase servers, lease a data centre, or manage

nl
any physical computing infrastructure. As a result, IT costs for businesses
that utilise the cloud are frequently lower, simpler to estimate, and more
predictable.

O
2. Security: Cybercriminals who want to steal or expose data commonly target
enterprise firms. Data breaches can harm your reputation and business
relationships and are very expensive to fix. Businesses can use security
technologies like system-wide identity/access management and cloud security

ty
monitoring thanks to enterprise clouds. They can quickly deploy identification
and access controls across the entire network. In both public and private
installations, cloud service providers enable data security in a variety of ways.

si
3. Business resilience and disaster recovery: Business resilience in the case of a
service outage, a natural disaster, or a cyber attack is at risk without a reliable
disaster recovery solution. Possible repercussions include lost sales, declining
customer trust, and even bankruptcy.
r
ve
Faction’s Hybrid-Disaster-Recovery-as-a-Service (HDRaaSTM) is our
proprietary and fully managed enterprise disaster recovery solution. We offer
reliable backup data storage, non-disruptive recovery testing, and process
management for disaster declaration and failover. Disaster recovery services
ni

can help businesses like Delta Air Lines recover from service interruptions
more quickly and keep their income flowing.
4. Flexibility & Innovation - Business can dynamically scale their resource
U

consumption up or down as needed using enterprise cloud computing. As a


result, obstacles to innovation are reduced and the upfront capital costs of
introducing a new product or testing a new service are reduced.
ity

1.1.1 Database as a Service


Infrastructure, platform, or software are the most prevalent resource categories
provided by cloud computing services, which lead to a more widespread taxonomy of
cloud kinds. The Daas service type is the primary topic of this essay.
m

A good illustration of SaaS is DaaS. Customers are given access to a certain


piece of software using SaaS, or software as a service. Although the consumer feeds
the software with data and instructions, the provider runs the programme and gives it
)A

Internet connection. The service provider selects, instals, launches, and manages the
database management software.

In any event, all the consumer is getting is a piece of software. DaaS is especially
well suited for small- to medium-sized organisations who depend on databases but
(c

find the installation and maintenance fees prohibitive for financial reasons. Because
consumers don’t have to find, hire, train, and pay people to maintain a database, the
service is much more valuable.
Amity Directorate of Distance & Online Education
6 Data Warehousing and Mining

Database-as-a-Service (DBaaS) is a service that supports applications and is


Notes

e
administered by a cloud operator (public or private), freeing the application team from
having to handle routine database management tasks. Application developers shouldn’t
be required to be database professionals or to pay a database administrator (DBA)

in
to maintain the database if there is a DBaaS available. When database services are
merely invoked by application developers and the database is taken care of, true
database as a service (DBaaS) has been attained.

nl
This would imply that the database will scale without any issues, as well as be
maintained, upgraded, backed up, and able to handle server failure without having
any negative effects on the developer. A service that is both intriguing and loaded with

O
challenging security challenges is database as a service (DaaS).

The DBaaS service mentioned above is one that cloud providers seek to provide.
The cloud providers require a high level of automation in order to offer a complete

ty
DBaaS solution to numerous customers. Backups are an example of an operation that
can be planned and batch processed. Numerous other processes, including elastic
scale-out, can be automated based on certain business rules. For instance, in order to

si
meet the service level agreement’s (SLA’s) requirements for quality of service (QoS),
databases may need to be restricted in terms of connections or peak CPU usage.

The DBaaS may automatically add a new database instance to share the load if
r
this requirement is surpassed. Additionally, the cloud service provider must be able to
ve
automatically create and set up database instances. This method can automate a large
portion of database administration, but the database management system that powers
DBaaS must provide these features through an application programming interface in
order to accomplish this level of automation.
ni

The number of databases that cloud operators must work on concurrently must
be in the hundreds of thousands or possibly tens of thousands. Automation is required
in this. The DBaaS solution must give the cloud operator an API in order to automate
U

these tasks in a customizable way. The main objective of a DBaaS is to free the
customer from having to worry about the database.

Today’s cloud users simply operate without having to consider server instances,
storage, or networking. Cloud computing is made possible through virtualization, which
ity

automates many of the labor-intensive processes associated with purchasing, setting


up, configuring, and managing these features. Currently, database virtualization is
performing the same function for cloud databases, which are offered as Database as
a Service (DBaaS). The DBaaS can function well and significantly lower operational
m

costs. The fact that DaaS aims to simplify things is crucial as well as obvious.

Requirements of Database Management in Cloud


)A

Data processing efficiency is a fundamental and critical issue for nearly any
scientific, academic, or business institution. As a result, enterprises install, operate, and
maintain database management systems to meet various data processing demands.
Although it is possible to purchase the necessary hardware, deploy database products,
establish network connectivity, and hire professional system administrators as a
(c

traditional solution, this solution has become increasingly expensive and impractical as
database systems and problems have grown larger and more complex. The traditional
solution has various costs.
Amity Directorate of Distance & Online Education
Data Warehousing and Mining 7

Though the costs of technology, software, and networks are anticipated to


Notes

e
decrease over time, the costs of people do not. People costs will most likely dominate
computing solution costs in the future. Database backup, database restore, and
database reorganisation are also required to free up space or restore a preferred data

in
structure. Migration from one database version to the next without affecting solution
availability is a skill that is still in its early stages. During a version change, parts of a
database system, if not the entire solution, typically become inaccessible.

nl
Database management solutions have been used in data centres by businesses.
Initially, developers were left to install, manage, and use their preferred cloud database
instance, with the developer bearing the burden of all database administration tasks.

O
The benefit is that you may choose your own database and have complete control over
how the data is managed. Many PaaS companies have begun to offer cloud database
services in order to reduce the strain on customers of their cloud products.

ty
All physical database administration chores, such as backup, recovery, managing
the logs, etc., are managed by the cloud provider. The developer is in charge of the
database’s logical administration, which includes table tuning and query optimization. If

si
a database service provider is efficient, it has the opportunity to perform these jobs and
create a value proposition.

Database service providers provide enterprises with seamless means for


r
creating, storing, and accessing their databases. Users who want to access data will
ve
now do so utilising the service provider’s hardware and software rather than their
own organization’s computing infrastructure. Outages caused by software, hardware,
or networking modifications or failures at the database service provider’s site would
have no effect on the application. This would eliminate the need to purchase, install,
ni

maintain, and update software, as well as administer the system. Instead, for its
database needs, the company will solely use the ready system managed by the service
provider.
U

Database systems have shown to be extremely effective in a wide range of


financial, economic, and Internet applications. However, they have several significant
drawbacks, including:
ity

◌◌ Scaling database systems is difficult.


◌◌ Database systems are difficult to set up and keep running.
◌◌ The variety of available systems makes selecting more difficult.
◌◌ Peak provisioning incurs unnecessary costs.
m

Database as a service in the cloud computing environment addresses the


constraints of traditional database systems. There are currently no true DBaaS options
that meet all of these requirements. As a result, the future phase of database evolution
)A

will be driven by these cloud computing requirements.

Deployment of DBaaS
Database as a Service (DaaS) (DBaaS) is a technical and operational method that
enables IT companies to provide database functionality as a service to one or more
(c

customers. There are two use-case situations in which cloud database products meet
an organization’s database demands. These are their names:

Amity Directorate of Distance & Online Education


8 Data Warehousing and Mining

●● A single huge corporation with several individual databases that can be transferred
Notes

e
to the organization’s private cloud.
●● Outsourcing the data management needs of small and medium-sized

in
organisations to a public cloud provider that serves a large number of small and
medium-sized businesses.
In a true sense, a DaaS offering should meet the following requirements:

nl
◌◌ Relieving the end developer/user of database administration, tuning, and
maintenance tasks, while providing high performance, availability, and fault
tolerance, as well as advanced features such as snapshot, analytics, and time

O
travel.
◌◌ Elasticity, or the capacity to adapt to changing workloads. Elasticity is
essential to meet user SLAs (Service Level Agreements) while lowering
infrastructure, power, and administration costs for the cloud provider.

ty
◌◌ Guarantees of security and privacy, as well as a pay-per-usage pricing plan.
Because it allows IT organisations to consolidate servers, storage, and database
workloads into a shared hardware and software architecture, a private cloud is an

si
efficient approach to supply database services. Databases hosted on a private cloud
provide significant cost, quality of service, and agility advantages by providing self-
service, elastically scalable, and metered access to database services.
r
For a variety of reasons, private clouds are preferable to public clouds. There are
ve
significant data security hazards with public clouds, which often provide little or no
availability or performance service-level agreements. Private clouds, on the other hand,
give IT departments entire control over the performance and availability service levels
they provide, as well as the capacity to efficiently implement data governance standards
ni

and auditing policies.

Challenges of DBaaS
U

A DBaaS claims to shift much of the operational load of database users’


provisioning, configuration, scalability, performance tuning, backup, data protection, and
access control to the service operator, resulting in lower overall costs for users. The
three major difficulties that DBaaS providers must handle are efficient multi-tenancy,
ity

elastic scalability, and database privacy.

User data must reside on the database service provider’s premises in the database
service provider model. Most businesses regard their data as a highly valuable asset.
The service provider must provide adequate security measures to protect data privacy.
m

At the same time, cloud databases have several negatives, such as security and
privacy concerns, as well as the potential loss or inability to access vital data in the
event of a disaster or bankruptcy of the cloud database service provider.
)A

Another issue confronting the database service provider paradigm is the


development of an effective user interface. Clearly, the interface must be simple to use
while also being sophisticated enough to allow for ease in developing applications.
(c

1.1.2 Governance/Management as a Service


With the introduction of the distributed computing model, IT service management
must solve a wide range of multi-device system management difficulties, from simple
Amity Directorate of Distance & Online Education
Data Warehousing and Mining 9

point products to full enterprise frameworks. Existing technology products rarely solve
Notes

e
the complete problem since they are expensive to purchase, complex to integrate, and
do not include the full range of features and functions required to fulfil today’s system
administration requirements.

in
With low-cost data centre appliances on the rise, the only way to truly solve
today’s heterogeneous IT dilemma is to offer systems management capabilities at the

nl
service level. IT professionals should be able to use their management experience and
knowledge to improve management performance.

This level of integration is being enabled by a new approach to IT service

O
management that is emerging. Management as a service (MaaS) enables IT service
management to create network management appliances, services, and knowledge
repositories that are highly adaptable and capable of growing in response to changing
client needs. MaaS deployments combine several system management features and

ty
solutions into a unified management environment that covers all system management
kinds.

MaaS-based management services provide better flexibility and lower installation

si
costs than any enterprise architecture. MaaS enables enterprises to fully leverage the
benefits of their converged network by recognising and resolving problems faster, more
precisely, less expensively, and with greater visibility than silos alone.
r
ve
1.1.3 Testing as a Service
IT has recently become a utility. Furthermore, software-oriented architecture and
software-as-a-service models have a significant impact on software-based companies.
These models have had a significant impact on the nature of software systems. Every
ni

software company strives to provide software that is simple to use, error-free, and of
excellent quality. To create high-quality, adaptable, and error-free software, it must
be tested against certain criteria in a specific environment. It is extremely tough and
U

expensive to administer that environment because each client or customer requirement


is unique.

It is nearly impossible and highly expansive for a company to sustain that climate.
ity

So, to test software, it is very simple and cost effective to contract the services of a third
party who has the necessary tools, hardware, simulators, or devices.

Software testing as a service is a business model in which testing activities are


purchased or rented on demand from a third party that possesses all of the testing
m

resources required in a real-world context. We can define Software as a service as a


procedure in which software corporations ask service provider companies to supply
their services of testing software as needed.
)A

When enterprises need to generate the appropriate outcomes from software


utilising pricey technologies, Software as a Service solutions are very helpful and
valuable. Instead of purchasing pricey instruments, software testing as a service
provider delivers such tools to the company to utilise for the needed outcomes.
Regression testing, security testing, performance testing, monitoring, compatibility
(c

testing, functional testing, ERP software testing, cloud-based application testing, and
other services are supplied under the Software testing as a Service model.

Amity Directorate of Distance & Online Education


10 Data Warehousing and Mining

Most firms desire automated testing operations to be completed in less time


Notes

e
in order to achieve the best results. However, due of real-world environment tools,
automation is prohibitively expensive for most enterprises. As a result of this issue,
testing operations were restricted, and everything was done manually. However, this

in
process was incredibly slow because it took a long time and resulted in a risk.

Testing software as a service: When businesses require their services, solutions

nl
supply them. It means that those businesses can employ automation tools and qualified
workers on demand rather than purchasing those tools at a low cost that is affordable
for all organisations, large or small.

O
Software testing is a comprehensive and continuous technique for checking and
authenticating software that meets the customer’s technical and business requirements.
Other quality measures like as integrity, dependability, interoperability, usability,
efficiency, maintainability, security, portability, and so on are tested using software.

ty
Software testing methods varies depending on the amount of testing and the goal
of the testing. Testing should be done successfully and efficiently under the constraints
of available financial and scheduling constraints.

si
Because of the vast number of testing limits, it is nearly impossible to ensure that
testing has eradicated almost every error. Applying previously developed concepts

r
for testing software can make testing easier and more successful. Testing is a crucial
quality filter, and it should be organised to identify its principles, aims, and restrictions.
ve
Software testing is the process of verifying and checking the defects in a
programme or system. It not only checks for mistakes, but it also ensures that the
software is operating in accordance with the user’s specifications. It has recently
ni

become an important job in the software business. Because of the increasing


intricacies of technology, software testing has become a difficult procedure for software
organisations. In the beginning, software testing was done internally within a business.
U

Testing is now valuable for software organisationsin order to decrease costs


and improve services based on customer needs. Testing is critical for increasing
user happiness while minimising costs. As a result, firms must invest in people, the
environment, and tools while reducing a specific percentage of their budget. The most
ity

important thing to remember is that quality should never be compromised at any cost.
Today, with the help of innovative methods of innovation and testing, firms can secure
the highest quality of software at the lowest possible cost.

Software testing as a Service (STaaS) is an outsourced working paradigm in which


m

test actions are outsourced to a third party that focuses on reproducing real-life test
environments based on the needs of the purchaser. Simply said, STaaS is a means via
which businesses request providers to deliver application testing services as needed.
)A

STaaS entails outsourcing worker testing actions to providers by delivering the


most appropriate files as well as examination files, so that providers may imitate real-life
testing environments and deliver thorough examination findings on time. Providers that
are well-suited for the STaaS model include programmed regression tests, efficiency
tests, security tests, tests associated with large ERP applications, and monitoring/
(c

testing associated with cloud-based purposes.

Amity Directorate of Distance & Online Education


Data Warehousing and Mining 11

STaaS alternatives are very advantageous, although companies should automate


Notes

e
examination activities using costly application models. Rather than purchasing these
types of tools, STaaS allows businesses to pay only for what they need, lowering
costs while achieving the best potential results. STaaS is sometimes referred to as on-

in
demand testing.

Software testing as an online service is described as a method of software

nl
package testing used to test an application as a service given to customers via the
internet. It offers regular procedure, upkeep, and examination support via web-based
web browsers, examination frameworks, and hosts.

O
This design helps to sustain any demand-led software package testing market by
allowing enterprises to supply and purchase testing products and services as needed.
There are numerous advantages to online software testing, including: Customers who
use tests do not have to invest significantly in installing and maintaining test conditions.

ty
This dramatically decreases check prices while providing customers with a flexible
way to acquire evaluation products and services as needed, from anywhere in the
world. Second, online distribution of software package evaluation opens up a bigger

si
current market for both evaluation companies and users.

This evaluation company attracts a larger base of customers, while customers

r
have access to international evaluation professionals. Furthermore, it has been said
that software package evaluation as an online service may be given in a time period
ve
of up to 10 trading days. As a result, this adds to quicker turnaround times, allowing
customers to realise the ideal time to current market quickly.

Furthermore, while dealing with published assessment national infrastructure


ni

on the web, the World Wide Web service APIs used can undoubtedly disguise
the complexity involved with using published assessment national infrastructure,
encouraging developers and testers to use it more frequently.
U

Cloud computing has gotten a lot of attention recently because it alters the way
computation as well as providers interact with customers. For example, the concept
alters how processing resources, including as CPUs, databases, and storage space
devices, are supplied and managed. Small and medium-sized organisations want
ity

higher - speed, security, and scalability in their product structure to meet their business
requirements.

However, in their assumption, these firms would not be able to possess this type of
creation. Now that organisations are generally focusing on improving efficiency as well
m

as returns on investments, CEOs should evaluate ways to reduce their investments in


engineering or perhaps gain a higher return on a single or even additional investments.
Test is essential for increasing individual agreement as well as slowing down the
)A

preservation and charge.

A fresh approach to improvement and screening enables businesses to ensure


high quality while spending significantly less money. As a result, the necessity for cloud
migration evolved as a means of let businesses to focus on their core competencies
rather than dealing with the economics and up keep of their the item structure.
(c

However, the choice has a lot of issues that must be addressed in terms of stability,
uniformity, and preservation, so your company must conduct thorough testing.

Amity Directorate of Distance & Online Education


12 Data Warehousing and Mining

TaaS is an outsourcing model in which testing tasks related to portions of an


Notes

e
organization’s business activities are conducted by a service provider rather than in-
house workers.

in
TaaS may entail hiring consultants to assist and advise personnel, or it may simply
entail outsourcing a portion of testing to a service provider. Typically, a corporation will
conduct some testing in-house.

nl
TaaS is best suited for specific testing efforts that do not necessitate extensive
understanding of the architecture or system. Automated regression testing,
performance testing, security testing, application testing, testing of key ERP (enterprise

O
resource planning) software, and monitoring/testing of cloud-based apps are all good
candidates for the TaaS approach.

TaaS, or Testing as a Service, is an outsourcing model in which software testing

ty
is performed by a third-party service provider rather than by organisation employees.
TaaS testing is performed by a service provider who specialises in mimicking real-world
testing environments and locating defects in software products.

si
TaaS is employed when
◌◌ A corporation lacks the necessary skills and resources to conduct internal

◌◌
testing.
r
Don’t want in-house developers to have a say in the testing process (which
ve
they could if done internally).
◌◌ Save money.
◌◌ Accelerate test execution and shorten software development time.
ni
U
ity
m
)A

How does testing as a service work?

TaaS is essentially when a business employs a third party to undertake testing


operations that would otherwise be handled in-house. Providers sell testing tools,
(c

software, and infrastructure to organisations, often on a pay-per-use basis. TaaS can


refer to a single component of the testing procedure, such as a platform, a software-
and-infrastructure combination, or the outsourcing of an entire department. TaaS, in

Amity Directorate of Distance & Online Education


Data Warehousing and Mining 13

whatever shape it takes, entails a provider taking on some of the organization’s testing
Notes

e
obligations.

TaaS could be utilised for automated testing activities that would take in-house

in
workers longer to accomplish manually. It can also be employed when the customer
organisation lacks the resources to conduct testing themselves. The resource could be
time, money, people, or technology. TaaS may not be the best option for enterprises

nl
that want in-depth knowledge of their infrastructure.

There are numerous varieties of TaaS, each with its own set of procedures, but in
general, TaaS will work as follows:

O
◌◌ To conduct the test, a scenario and environment are built. In software testing,
this is known as a user scenario.
◌◌ A test is created to assess the company’s reaction to such scenario.

ty
◌◌ The test is carried out in the vendor’s secure test environment.
◌◌ The vendor monitors performance and assesses the company’s ability to
satisfy test design goals.

si
◌◌ To improve future performance and results, the vendor and company
collaborate to improve the system or product being tested.

Key TaaS Features r


ve
ni
U
ity
m

Benefits of testing as a service


)A

The primary advantages of testing as a service are the same as those of employing
any service or outsourcing. They revolve around the fact that the company paying for
the service is not required to host or maintain the testing processes and technology.

The following are the primary advantages of TaaS:


(c

Reduced costs: Companies are not required to host infrastructure or pay people.
There are no licencing or staffing fees.
Amity Directorate of Distance & Online Education
14 Data Warehousing and Mining

Pay as you go pricing: Companies just pay for what they utilise.
Notes

e
Less rote maintenance: In-house IT personnel will be doing less rote maintenance.

High availability:TaaS providers normally offer services 24*7.

in
High flexibility: Companies’ service plans can be simply adjusted as their needs
change.

nl
Less-biased testers: The test is being carried out by a third party with little
understanding of the product or firm. The test is not influenced by internal personnel.

Data integrity: The vendor cleans test data and runs tests in controlled conditions.

O
Scalability:TaaS products can be tailored to the size of the organisation.

1.1.4 Storage as a Service

ty
Storage as a service (STaaS) is a managed service in which the provider provides
access to a data storage platform to the consumer. The service can be offered on-
premises using infrastructure dedicated to a single customer, or it can be delivered

si
from the public cloud as a shared service acquired on a subscription basis and invoiced
based on one or more consumption indicators.

r
Individual storage services are accessed by STaaS customers via regular system
interface protocols or application programme interfaces (APIs). Bare-metal storage
ve
capacity; raw storage volumes; network file systems; storage objects; and storage
programmes that provide file sharing and backup lifecycle management are typical
offerings.
ni

Storage as a service was first viewed as a cost-effective solution for small and
medium-sized organisations without the technical personnel and capital funding to
construct and operate their own storage infrastructure. Storage as a service is now
U

used by businesses of all sizes.

Uses of STaaS
Storage as a service can be used for data transfers, redundant storage, and data
ity

restoration from corrupted or missing files. CIOs may want to use STaaS to quickly
deploy resources or to replace some existing storage space, freeing up space for on-
premises storage gear. CIOs may also value the option to customise storage capacity
and performance based on workload.
m

Instead of maintaining a huge tape library and arranging for tape vaulting (storage)
elsewhere, a network administrator who utilisesSTaaS for backups could designate
what data on the network should be backed up and how frequently it should be backed
)A

up. Their company would sign a service-level agreement (SLA) in which the STaaS
provider agrees to rent storage space on a cost-per-gigabyte-stored and cost-per-
data-transfer basis, and the company’s data would then be automatically transferred at
the specified time over the storage provider’s proprietary WAN or the internet. If the
company’s data becomes corrupted or lost, the network administrator can contact the
(c

STaaS supplier and seek a copy.

Amity Directorate of Distance & Online Education


Data Warehousing and Mining 15

Storage as a Service in Cloud Computing


Notes

e
Organizations that employ STaaS will often use a public cloud for storage and
backup purposes rather than storing data on-premises. Different storage strategies

in
may be used for STaaS in public cloud storage. Backup and restore, disaster recovery,
block storage, SSD storage, object storage, and bulk data transfer are all examples of
storage technologies. Backup and restore refers to the process of backing up data to

nl
the cloud in order to protect it in the event of data loss. Data protection and replication
from virtual computers may be referred to as disaster recovery Virtual Machines(VMs).

Customers can use block storage to provision block storage volumes for lower-

O
latency I/O. SSD storage is yet another type of storage that is commonly utilised for
demanding read/write and I/O operations. Object storage systems, which have a high
latency, are utilised in data analytics, disaster recovery, and cloud applications. Cold
storage is used to quickly create and configure stored data. Bulk data transfers will

ty
transmit data using discs and other gear.

Advantages of STaaS

si
●● Storage costs: Expenses for personnel, hardware, and physical storage space are
decreased.
●● Disaster recovery: Having numerous copies of data kept in separate locations can
help disaster recovery techniques work better. r
ve
●● Scalability: Users of most public cloud services only pay for the resources they
utilise.
●● Syncing: Files can be synced automatically across various devices.
ni

●● Security: Because security measures vary by provider, security can be both an


advantage and a negative. Data is often encrypted both during transmission and
at rest.
U

Disadvantages of STaaS
●● Security: Users may wind up sending business-sensitive or mission-critical data to
the cloud, making it crucial to select a dependable service provider.
ity

●● Potential storage costs: If bandwidth limits are exceeded, this could be costly.
●● Potential downtimes: Vendors may experience periods of downtime during which
the service is unavailable, which can be problematic for mission-critical data.
m

●● Limited customization: The cloud infrastructure is less adaptable because it is


owned and managed by the service provider.
●● Potential for vendor lock-in: Migrating from one service to another may be tough.
)A

Popular Storage-as-a-service Vendors


STaaS providers include Dell EMC, Hewlett Packard Enterprise (HPE), NetApp,
and IBM. Isilon NAS storage, EMC Unity hybrid-flash storage, and other storage
alternatives are available from Dell EMC. When compared to Dell EMC, HPE has an
(c

equal, if not greater, presence in storage systems.

Amity Directorate of Distance & Online Education


16 Data Warehousing and Mining

Other public cloud suppliers offering cloud storage services include:


Notes

e
◌◌ Amazon Web Services (AWS)
◌◌ Microsoft Azure

in
◌◌ Google Cloud
◌◌ Oracle cloud

nl
◌◌ Box
◌◌ Arcserve

O
1.2 Cloud Service Development
With the increasing rise of cloud computing, many companies are shifting their
computing activities to the cloud. Cloud computing refers to the delivery of computer
or IT infrastructure over the Internet. That is the provisioning of shared resources,

ty
software, apps, and services via the internet to meet the customer’s elastic demand
with minimal effort or interaction with the service provider.

si
It is a network access model that enables ubiquitous, convenient, on-demand
network access to a shared pool of computing resources (e.g., networks, servers,
storage, applications, and services) that can be rapidly provisioned and released with
minimal management effort or interaction from service providers.
r
ve
The cloud-based high-performance computing centre attempts to address the
following issues:

◌◌ Dynamically produced high-performance computer platform.


◌◌ Computing resources are virtualized.
ni

◌◌ Combining high-performance computer management technologies with


traditional methods.
◌◌ Dynamically produced high-performance computer platform.
U

Different phases of cloud computing

Phase Description
ity

Mainframes 1950s User shared powerful mainframes using dummy terminals.


Start of automation phase
Localized infrastructure
PC Computing 1960s Stand-alone PCs became powerful enough to meet the majority
m

of users needs.
Rise in Demand of personnedl Computer
Decentralized Computing
)A

Birth of IT service
Network Computing Pcs, laptops and servers were connected together through local
1990s networks to share resources and increase performance.
Internet Computing Local networks were connected to other local networks forming a
2001 global network such as the internet to utilize remote applications
(c

and resourse

Amity Directorate of Distance & Online Education


Data Warehousing and Mining 17

G r i l d C o m p u t i n g Computing provided shared computing power and storage


Notes

e
Beyond 2010 through a distributed computing.
Solving large problems with prarallel computing

in
C l o u r C o m p u t i n g Cloud computing is the provision of computer or IT infrastructure
Beyond 2010 through the Internet. That is the provisioning of shared resources,
software, applications and services over the internet to meet the

nl
demand of the customer with minimum effort or interaction with
the service provider.

1.2.1 Cloud Service Development

O
Historically, computing power was a precious and expensive resource. With the
advent of cloud computing, it is now abundant and inexpensive, resulting in a significant
paradigm shift – a shift from scarcity computing to abundance computing. This

ty
computing revolution hastens the commoditization of products, services, and business
models, while also disrupting the current information and communications technology
(ICT) industry. It provided the same services as water, electricity, gas, telephony, and

si
other appliances. Cloud computing provides on-demand computing, storage, software,
and other IT services with metered billing based on usage.

Cloud computing enables the reinvention and transformation of technology


r
collaborations in order to better marketing, simplify and strengthen security, and
ve
increase stakeholder engagement and user experience while lowering costs. You don’t
have to over-provision resources for future peak levels of business operation while
using cloud computing. Then you have the materials you truly needed. These resources
may be scaled immediately to grow and contract capability as business needs change.
ni

Cloud computing virtualization is the providing of hardware, runtime environment,


and resources for a fee to a user. These goods can be used as long as the User
wants, with no upfront commitment required. The entire computing device collection is
U

converted into a Utilities set that can be supplied and assembled in hours rather than
days, allowing devices to be deployed without incurring maintenance costs. The long-
term objective of a cloud computer is that IT services be exchanged on an open market
without the use of technology and as utilities as barriers to the rules.
ity

We can hope that in the near future, a solution that clearly meets our needs will
be identified and entered into our application on a worldwide digital market services
for cloud computing. This market will enable the automation of the discovery and
integration processes with its existing software platforms. A digital cloud trading platform
m

will also allow service providers to increase their earnings. A competitor’s customer
service may also be a cloud service to meet consumer promises.

Company and personal data are available in structured formats everywhere,


)A

allowing us to simply access and connect on a bigger scale. The security and stability of
cloud computing will continue to develop, making it even safer with a number of ways.
Instead of focusing on the services and applications that they enable, we do not believe
that “cloud” is the most relevant technology. The combination of wearables and bring
your own device (BYOD) with cloud technology and the Internet of Things (IOT) would
(c

become so widespread in personal and professional life that cloud technology would be
disregarded as an enabling.

Amity Directorate of Distance & Online Education


18 Data Warehousing and Mining

Historical Developments
Notes

e
Cloud computing is not a cutting-edge technology. Cloud computing has gone
through several stages of development, including Grid computing, utility computing,

in
application service provision, software as a service, etc. However, the idea of providing
computing resources across a worldwide network was first introduced in the 1960s.
The market for cloud computing is anticipated to reach $241 billion by 2020(Forrester

nl
Research). But how we got there and where all that began is explained by the history of
cloud computing.

The first commercial and consumer cloud computing website was built in 1999,

O
hence cloud computing has a recent history (Salesforce.com and Google). As cloud
computing is the solution to the issue of how the Internet may enhance business
technology, it is intimately related to both the development of the Internet and the
advancement of corporate technology. Nearly as long as businesses themselves,

ty
business technology has a rich and fascinating history. However, the evolution that has
most directly influenced cloud computing starts with the introduction of computers as
providers of practical business solutions.

si
History of Cloud Computing
A cutting-edge technology today is cloud computing. A brief history of cloud
computing follows that. r
ve
ni
U
ity

Figure: History of Cloud Computing


m

EARLY 1960S
John McCarthy, a computer scientist, developed a time-sharing idea that enables
the company to use a pricey mainframe concurrently. This device is hailed as a
)A

significant advancement for the Internet and a pioneer in cloud computing.

IN 1969
The Advanced Research Projects Agency (ARPANET) was founded by J.C.R.
(c

Licklider, who also advocated the concept of a “Intergalactic Computer Network” or


“Galactic Network” (a computer networking phrase akin to the current Internet). His

Amity Directorate of Distance & Online Education


Data Warehousing and Mining 19

goal was to connect everyone on the planet and enable universal access to data and
Notes

e
applications.

IN 1970

in
Use of virtualization products like VMware: It is possible to run many operating
systems concurrently in different environments. It was conceivable to operate a whole
separate computer under a different operating system (virtual machine).

nl
IN 1997

O
The first known definition of “cloud computing,” “a paradigm in which computer
boundaries are set purely on economic rather than technical restrictions alone,”
appears to have been provided by Prof. Ramnath Chellappa in Dallas in 1997.

ty
IN 1999
In 1999, Salesforce.com was introduced as the first company to offer client
applications through a straightforward website. The services provider was able to

si
offer software applications over the Internet to both niche and mainstream software
providers.

IN 2003
r
Xen, commonly known as the Virtual Machine Monitor (VMM) as a hypervisor,
ve
is a software system that enables many virtual guest operating systems to be run
concurrently on a single machine. This is its first public release.

IN 2006
ni

The Amazon cloud service was launched in 2006. First, its Elastic Compute
Cloud (EC2) enabled users to access machines and use their own cloud apps. Simple
Storage Service (S3) was then released. This utilised the user-as-you-go concept and
U

has since evolved into the accepted practise for both users and the sector at large.

IN 2013
ity

In 2012, the global market for public cloud services expanded by 18.5% to £ 78
billion, with IaaS being one of the services with the greatest market growth.

IN 2014
m

The predicted £ 103.8 billion in global business spending on cloud-related


technology and services in 2014, up 20% from 2013..

The examination of the growth of distributed cloud computing technologies is


)A

shown in the figure below. When tracking historical advancements, we quickly go


over five essential technologies that have been crucial to the development of cloud
computing. They are Web 2.0, virtualization, distributed systems, service orientation,
and utility computing.
(c

Amity Directorate of Distance & Online Education


20 Data Warehousing and Mining

Notes

e
in
nl
O
ty
si
Figure: The evolution of distributed computing technologies, 1950s- 2010s.

As per Techtarget The term “distributed computing” refers to the use of several
r
computer systems to tackle a single task. In distributed computing, a single task is
ve
divided into several parts, with separate machines handling each part. Because the
computers are connected, they may speak to one another to address the issue. If done
correctly, the computer operates as a single unit.

Distributed computing’s ultimate goal is to boost overall performance by


ni

establishing affordable, open, and safe links between users and IT resources.
Additionally, it guarantees defect tolerance and offers access to resources in the event
that one component fails.
U

The way resources are distributed in computer networks is really not all that
unusual. This was first accomplished with the use of mainframe terminals, progressed
to minicomputers, and is currently feasible with personal computers and client-server
ity

architecture with multiple tiers.

One or more dedicated servers for computer management are placed on a number
of very light client computers that make up a distributed computer architecture. Client
agents typically detect when a machine is unoccupied so that the management server
m

is informed that the machine is free to use. The agent next requests a shipment. When
the client receives this application package from the management server, it executes
the application software and sends the results back to the management server when
it has free CPU cycles. The management server will return the resources required to
)A

complete various tasks while the user was away when the user logs back in.

Heterogeneity, openness, scalability, transparency, concurrency, continuous


availability, and independent failures are all characteristics of distributed systems. In
some ways, these describe clouds, particularly in terms of scalability, concurrency, and
(c

continuous availability. Three significant achievements—mainframe, cluster computing,


and grid computing—have been made possible by cloud computing.

Amity Directorate of Distance & Online Education


Data Warehousing and Mining 21

Mainframes: A mainframe is a robust computer that frequently acts as the primary


Notes

e
data store for an organization’s IT infrastructure. It communicates with users through
less capable hardware, such as workstations or terminals. By consolidating data into
a single mainframe repository, it is simpler to manage, update, and safeguard data

in
integrity. In contrast to smaller machines, mainframes are typically employed for
large-scale procedures that demand higher levels of availability and safety. Large
enterprises utilise mainframe computers or mainframes largely for processing mass

nl
data for things like censuses, industry and consumer statistics, enterprise resource
planning, and transaction processing. In the late 1950s, mainframes had a simple
interactive interface and transmitted data and programmes using punched cards,

O
paper tape, or magnetic tape.

To handle back office activities like payroll and customer billing, they functioned in
batch mode, mostly using repetitive tape and merging operations followed by a line print

ty
to continuous stationary using pre-printed ink. Digital user interfaces are now almost
exclusively employed to run applications (like airline reservations) rather than to create
the software. Although mainly replaced by keypads, typewriter and teletype machines
were the usual network operators’ control consoles in the early 1970s.

si
Cluster Computing: A fast local area network (LAN) is used to connect a few
computer nodes (personal computers employed as servers) that are available for

r
download in the approach to computer clustering. The “clustering middleware,” a
software layer placed in front of nodes that enables users to access the cluster as a
ve
whole through the use of a Single system image concept, coordinates computing node
activities.

A cluster is a sort of parallel or distributed computer system that consists of a group


ni

of connected, independent computers that act as a highly centralised computing tool


that combines separate machines, networking, and software in one system.

Typically, clusters are utilised to provide more computational power than a single
U

computer can in order to support high availability, higher reliability, or high performance
computing. Since it uses off-the-shelf hardware and software components as opposed
to mainframe computers, which use custom-built hardware and software, the cluster
technique is more power and processing speed efficient when compared to other
ity

technologies.

A cluster of computers collaborate to provide unified processing and speedier


processing. In contrast to a mainframe computer, a cluster can be expanded by adding
new nodes or upgraded to a higher standard. Continuously resuming processing
m

redundant machines reduce the likelihood of a single component failure. Applications


running on mainframes don’t have this kind of redundancy.

Grid Computing: It is a processor architecture that integrates computing power


)A

from other disciplines to accomplish the main goal. Grid computing allows network
computers to collaborate on a job and act as a single supercomputer. A grid often
works on numerous networked jobs, however it can also handle particular applications.
A grid generally works on various network jobs, but it can also function on particular
applications. It is designed to handle multiple little problems while solving problems
(c

that are too big for a supercomputer to handle. A multi-user network that supports
discontinuous information processing is a feature of computing grids.

Amity Directorate of Distance & Online Education


22 Data Warehousing and Mining

A parallel node operating system, such as Linux or free software, is used to


Notes

e
connect a grid to a computer cluster. The cluster’s size might range from one tiny
network to many. The technology is employed in a wide range of applications,
including mathematics, research, and instructional tasks, through a number of

in
computing resources. It is frequently used in web services like ATM banking, back
office infrastructure, scientific and marketing research, as well as structural analysis.
Applications are utilised in a parallel networking environment as part of grid computing

nl
to address computational issues. Each PC is connected, and information is combined
into a computational application.

O
Virtualization
Cloud computing is based on virtualization, a method that improves the utilisation
of actual computer hardware. Through the use of software, virtualization may divide the
hardware components of a single computer, such as processors, memory, storage, and

ty
more, into several virtual computers, commonly referred to as VMs. Despite only using
a piece of the underlying computer hardware, each VM runs its own OS and functions
like a standalone machine.

si
As a result, virtualization enables a considerably more efficient use of physical
computer hardware, enabling an organisation to get a higher return on its hardware
investment.
r
ve
Today, virtualization is a standard approach in business IT architecture. The
technology is also what powers the cloud computing industry. Because of virtualization,
cloud service providers can service customers using their own physical computing
hardware, and cloud customers can buy only the computer resources they require at
the time they require them and expand them affordably as their workloads grow.
ni

The process of creating a virtual platform, including virtual computer networks,


virtual storage devices, and virtual computer hardware, is known as virtualization.
U

Hardware virtualization is accomplished using a programme known as a


hypervisor. Software is integrated into the server hardware component using a virtual
machine hypervisor. The physical hardware that is shared by the client and the provider
is under the control of the hypervisor. The Virtual Machine Monitor (VVM) can be
ity

used to eliminate actual hardware and implement hardware virtualization. A number


of process extensions aid in accelerating virtualization processes and improving
hypervisor performance. Server socialising refers to the process of virtualizing the
server platform.
m

Web 2.0
Websites that stress user-generated content, user-friendliness, participatory
)A

culture, and interoperability for end users are referred to as participatory, or


participative, or participative / activist, and social websites. The term “Web 2.0” is very
new and first appeared in everyday speech in 1999, or 20 years ago. Tim O’Reilly and
Dale Doughterypopularised it at a conference in 2004 after Darcy DiNucci’s initial use
of the phrase. It is important to keep in mind that Web 2.0 frameworks simply address
(c

website design and use, not imposing any technical restrictions on the designers.

Amity Directorate of Distance & Online Education


Data Warehousing and Mining 23

The phrase “Web 2.0” is used to describe a variety of websites and applications
Notes

e
that enable anyone to produce or share content online. The ability for individuals to
create, exchange, and communicate is one of the major features of technology. Web
2.0 differs from earlier types of websites in that it allows anybody to easily create,

in
publish, and communicate their work to the world without the need for any prior
knowledge of Web design or publishing. The layout makes it simple and well-liked for
sharing knowledge with a small community or a much larger audience. These tools

nl
will be used by the university to communicate with its employees, students, and other
university members. Students and coworkers may be able to engage and communicate
well through it.

O
The web apps, which allow for interactive data sharing, user-centered design, and
global collaboration, symbolise the progress of the World Wide Web in this context. The
term “Web 2.0” refers to a broad category of Web-based technologies, including blogs

ty
and wikis, social networks, podcasts, social bookmaking websites, and really simple
syndication (RSS) feeds. The basic idea behind Web 2.0 is to improve the connection
of Web applications and make it possible for consumers to access the Web quickly
and effectively. Web applications that offer computing capabilities on demand over the

si
Internet are exactly what cloud computing services are.

As a result, Cloud Computing adheres to the Web 2.0 methodology; it is regarded

r
as providing a key component of the Web 2.0 infrastructure and benefits from the Web
2.0 Framework. A number of web technologies sit beneath Web 2.0. RIAs that have
ve
just debuted or moved to a new production stage (Rich Internet Applications). The
most well-known and largely-accepted technology on the Web is AJAX (Asynchronous
JavaScript and XML). Widgets are plug-in modular components, RSS (Really Simple
Syndication), and Web services are further technologies ( e.g. SOAP, REST).
ni

1.2.2 Cloud Computing Challenges


Because businesses require data storage, nearly all adopt cloud computing.
U

Companies generate and store massive amounts of data. As a result, they confront
numerous security challenges. Businesses would incorporate institutions to streamline
and optimise the process, as well as to improve cloud computing administration.
ity

1. Security and Privacy of Cloud: The cloud data storage must be secure and private.
Clients are completely reliant on the cloud provider. In other words, the cloud
provider must implement the appropriate security procedures to protect customer
data. Securities are also the customer’s liability since they must have a good
m

password, not share the password with others, and regularly update the password.
Certain issues may arise if the data is stored outside of the firewall, which the cloud
provider can resolve. Hacking and viruses are also major issues because they might
harm a large number of clients. Data loss may occur, as well as disruptions to the
)A

encrypted file system and other concerns.


2. Interoperability and Portability: The Customer will be provided with migration
services into and out of the cloud. Customers may be harmed if a bond period is
permitted. The cloud will be able to provide premises facilities. One of the cloud
(c

hurdles is remote access, which prevents the cloud provider from accessing the
cloud from anyplace.

Amity Directorate of Distance & Online Education


24 Data Warehousing and Mining

3. Reliable and Flexible: Reliability and flexibility are indeed demanding tasks for cloud
Notes

e
consumers, which can avoid data leakage and supply customer trustworthiness.
To address this issue, third-party services should be monitored, as well as the
performance, robustness, and reliance of businesses.

in
4. Cost: Cloud computing is inexpensive, but changing the cloud to meet customer
demand can be costly at times. Furthermore, changing the cloud might be detrimental

nl
to small businesses because demand can often be more expensive. Furthermore,
data transport from the Cloud to the premises can be pricey at times.
5. Downtime: Downtime is the most common cloud computing difficulty, as no cloud

O
provider guarantees a platform free of downtime. Internet connection is also vital, as
it might be a problem if a corporation has an untrustworthy internet connection that
experiences downtime.
6. Lack of resources: Many organisations are attempting to overcome a shortage of

ty
resources and skills in the cloud market by hiring new, more experienced staff.
These personnel will not only assist in resolving business difficulties, but will also
train existing staff to benefit the organisation. Currently, many IT staff are working to

si
improve their cloud computing skills, which is challenging for the CEO because the
employees are inexperienced. It says that individuals who are exposed to the latest
developments and associated technology will be more valuable in the workplace.
7. r
Dealing with Multi-Cloud Environments: Today, hardly a single cloud is running
ve
full-fledged businesses. According to the RightScale survey, nearly 84 percent of
organisations use a multi-cloud approach, and 58 percent use hybrid cloud systems
that combine public and private clouds. Furthermore, corporations employ five
separate public and private clouds.
ni
U
ity
m
)A

Figure: RightScale 2019 report revelation

Long-term predictions concerning the future of cloud computing technology are more
challenging for IT infrastructure teams to make. Professionals have also proposed
(c

top solutions to this challenge, such as redesigning processes, training employees,


tools, active vendor relations management, and research.

Amity Directorate of Distance & Online Education


Data Warehousing and Mining 25

8. Cloud Migration: While it is straightforward to launch a new cloud app, it is more


Notes

e
difficult to migrate an existing programme to a cloud computing environment.
According to the report(a survey from Velostrata conducted), 62% of respondents
claimed their cloud migration projects were more difficult than projected.

in
Furthermore, 64% of migration initiatives took longer than projected, and 55% went
over budget. Organizations that migrate their applications to the cloud, in particular,
reported migration downtime (37%), data synchronisation issues before cutbacks

nl
(40%), migration tooling problems that work well (40%), slow data migration (44%),
security configuration issues (40%), and time-consuming troubleshooting (47%). To
address these issues, over 42% of IT specialists stated they wanted to see their

O
budget grow, and nearly 45% wanted to work with an in-house professional, 50%
wanted to extend the project, and 56% wanted more pre-migration tests.
9. Vendor lock-in: The issue with vendor lock-in cloud computing is that clients are

ty
dependant (i.e. locked in) on the deployment of a single Cloud provider and cannot
migrate to another vendor in the future without incurring large costs, regulatory
constraints, or technological incompatibilities. The lock-up condition can be seen
in programmes for specific cloud platforms, such as Amazon EC2 and Microsoft

si
Azure, which are not readily transferred to any other cloud platform and expose
users to modifications made by their providers to further confirm the objectives of a
software developer.
r
In fact, the issue of lock-in arises when a company decides to change cloud
ve
providers (or perhaps integrate services from different providers), but is unable to
move applications or data across different cloud services because the semantics
of the cloud providers’ resources and services do not correspond. Because of the
heterogeneity of cloud semantics and APIs, technological incompatibility arises,
ni

posing a challenge to interoperability and portability.


This makes interoperability, cooperation, portability, handling, and maintaining
data and services extremely complicated and challenging. For these reasons, it is
U

necessary for the organisation to maintain flexibility in changing providers based on


business needs, or even to keep in-house certain components that are less critical
to safety owing to risks. Supplier lock-in will stymie interoperability and portability
among cloud providers. It is the path to greater competitiveness for cloud providers
ity

and clients.
10. Privacy and Legal issues: The biggest issue with cloud privacy/data security appears
to be ‘data leak.’ Data infringement is broadly defined as the loss of electronically
encrypted personal information. A breach of the information could result in a slew
m

of losses for both the provider and the customer, including identity theft, debit/credit
card fraud for the customer, loss of credibility, future prosecutions, and so on. In
the event of a data breach, American law requires impacted individuals to notify the
)A

authorities.
Almost every state in the United States is now required to notify affected individuals
of data breaches. When data is subject to many jurisdictions and the laws governing
data privacy differ, problems occur. For example, the European Union’s Data
Privacy Directive clearly stipulates that “data can only leave the EU if it goes to a
(c

‘higher degree of security’ country.” This rule, while simple to establish, restricts data
transfer and so reduces data capacity. The EU’s regulations are enforceable.

Amity Directorate of Distance & Online Education


26 Data Warehousing and Mining

1.3 Cloud Computing Layers


Notes

e
What distinguishes cloud computing from conventional data storage so
significantly? Businesses and private individuals may use databases, storage, and

in
computing power to manage their data thanks to cloud computing. Due to this, you
no longer need to own, own, and operate physical data centres and servers. Cloud
computing layers are the name given to these technologies.

nl
The following figure illustrates a tiered cloud computing architecture that can be
used to conceptualise cloud computing as a collection of services:

O
ty
r si
ve
ni

Figure: Layered architecture of Cloud Computing (adapted from Jones 2003)


U

1.3.1 Understand Layers of Cloud Computing


According to the many types of administrations given, cloud computing may
ity

be divided into three layers: programming as a service (SAAS), stage as a service


(PAAS), and foundation as a service (IAAS) (Iyer and Henderson, 2010; Han, 2010,
Mell and Grance, 2010). IaaS (Infrastructure as a Service) is the most significant layer
that provides fundamental basis bolster administration. The Platform as a Service
(PaaS) layer is the central layer, which gives stage-arranged administrations in
m

addition to providing the ground for enabling client applications. The uppermost layer
is Programming as a Service (SaaS), which contains a whole application given as
administration on interest.
)A

Software-as-a-Service (SaaS): SaaS is a process in which Application Service


Providers (ASP) provide various programming programmes over the Internet. This
impacts the client to be free of introducing and running the application on their own
PC and moreover kills a large heap of programming upkeep; continuing with activity,
(c

protecting, and support. The SaaS vendor is responsible for sending and managing the
IT framework (servers, working framework programming, databases, server farm space,

Amity Directorate of Distance & Online Education


Data Warehousing and Mining 27

organise access, power and cooling, and so on) and procedures (foundation patches/
Notes

e
updates, application patches/overhauls, reinforcements, and so on) required to run and
manage the entire arrangement.

in
SaaS emphasises a complete application provided as a service on demand.
There is a Divided Cloud and Convergence Intelligibility component in SaaS, and each
datum thing has either a Read Lock or a - Write Lock. SaaS makes use of two types

nl
of servers: the Main Consistence Server (MCS) and the Domain Consistence Server
(DCS) (DCS). The cooperation of MCS with DCS achieves reserve cognizance [6]. If
the MCS is damaged or traded off in SaaS, the power over the cloud status is gone.
Verifying the MCS will be quite important in the future. Salesforce.com and Google

O
Apps are two examples of SaaS.

The SaaS, or actual software, is the third cloud tier (Software as a service). This
layer offers a comprehensive software solution. An app is rented out by organisations,

ty
and consumers connect to it online, typically through a web browser.

SaaS is the layer where the user accesses the service provider’s offering in a cloud
environment. It must be available on the web, accessible from anywhere, and ideally

si
compatible with any device. The hardware and software are managed by the service
provider.

r
Web-based email services like Outlook, Gmail, and Hotmail are an example of
SaaS. Here, the email client and your communications are on the network of the service
ve
provider.

Organizations can hire a variety of business software. Enterprise resource planning


(ERP), customer relationship management (CRM), and document management are a
ni

few examples of business apps.

This allows companies to invest in high-end enterprise apps without having to set
them up or manage them. The service provider then makes hardware, middleware, and
U

software purchases and updates them. All businesses can now afford it because they
just have to pay for the services they actually utilise.

Additionally, the ability for users to access information and data at any time,
ity

anyplace, has significant benefits for remote work.

Platform as a Service (PaaS): PaaS is the delivery of a calculating stage and


arrangement stack as a service without the need for programming downloads or
establishing for engineers, IT administrators, or end-clients. It provides a framework
m

with an unusual state of mix for executing and testing cloud applications. The
client does not deal with the foundation (including the operating system, servers,
working frameworks, and capacity), but he does control sent apps and, maybe, their
configurations. PaaS examples include Force.com, Google App Engine, and Microsoft
)A

Azure.

The platform, or PaaS, is the second layer of the cloud (Platform as a service). This
layer offers the tools necessary to create apps in a cloud development and deployment
environment.
(c

PaaS involves infrastructure, like IaaS, but it also contains business intelligence,
middleware, database management systems, development tools, and more. It is

Amity Directorate of Distance & Online Education


28 Data Warehousing and Mining

intended to handle the complete lifetime of a web application, from development and
Notes

e
testing to deployment, administration, and updating.

PaaS thus offers the capability to create, test, execute, and host applications when

in
combined with Infrastructure as a Service.

Businesses benefit from Platform as a Service in addition to the benefits they


already get from Infrastructure as a Service. These benefits include making it simpler

nl
to create for several platforms. Additionally, it involves reducing the amount of coding
required because the platform already has pre-coded application components.

Infrastructure as a Service (IaaS): Infrastructure as a Service (IaaS) refers to the

O
sharing of hardware assets for the execution of administrations employing Virtualization
technology. Its primary goal is to make assets, for example, servers, systems, and
capacity, more quickly accessible by applications and working frameworks. As a result,

ty
it gives fundamental framework on-demand administrations and utilises Application
Programming Interface (API) for collaborations with hosts, switches, and switches, as
well as the capacity to incorporate new gear in a simple and straightforward manner.
After all is said and done, the customer does not have to deal with the concealed

si
equipment in the cloud foundation, but he does have control over the working
frameworks, stockpiling, and conveyed applications. The specialist organisation
owns the hardware and is in responsible of housing, running, and maintaining it. The
r
customer is frequently charged on a per-use basis. Amazon Elastic Cloud Computing
ve
(EC2), Amazon S3, and GoGrid are examples of IaaS.

Infrastructure as a service, or cloud computing, is the very first and most


fundamental layer (Iaas). Renting IT infrastructure from a cloud provider, such as
Microsoft Azure or Amazon Web Services, is known as “infrastructure as a service.”
ni

Pay-as-you-go is used here, so you only pay for what you use.

It is a cloud computing service where a provider gives customers access to


resources including networking, storage, and data servers. This indicates that
U

businesses are not required to handle something internally.

Technology as a Service include both hardware and networking components,


such as servers, storage, networking firewalls, and data centres. This means that
ity

companies and organisations can utilise their own platforms and applications inside the
infrastructure that a service provider provides.

Businesses can swiftly set up and operate test and development environments
thanks to IaaS. They may be able to launch new application solutions more quickly as
m

a result. Additionally, it enables firms to analyse huge data and better manage storage
requirements as the business expands. Finally, it makes backup system management
simpler.
)A

Businesses can benefit from renting IT infrastructure from cloud providers in a


number of ways, including:

◌◌ Reduced recurring costs as well as lower initial setup and management costs
for on-site data centres.
(c

◌◌ Improved business continuity, creativity, and disaster recovery.


◌◌ Improved stability, security, and dependability.

Amity Directorate of Distance & Online Education


Data Warehousing and Mining 29

1.4 Cloud Computing Types


Notes

e
1.4.1 Types of Cloud Computing and Features

in
As shown in the diagram below, in the cloud arrangement show, arranging, staging,
stockpiling, and programming framework are supplied as administrations that scale up
or down based on the interest. The Cloud Computing model is comprised of four core

nl
organisational paradigms, which are as follows:

O
ty
r si
ve
ni

Figure: Cloud Deployment Model


U

Private Cloud: Private cloud is another name that some sellers have recently used
to describe contributions that mimic cloud computing on private systems. It is installed
within an organization’s internal effort datacenter. In the private cloud, the cloud
merchant’s mobile assets and virtual apps are pooled together and made available
ity

for cloud clients to share and use. It differs from the general population cloud in that
all cloud assets and applications, such as Intranet functionality, are governed by the
organisation itself. Because of its predefined inner introduction, private cloud usage
can be significantly more secure than public cloud usage. Only the organisation and
m

its designated partners may approach work on a specific Private cloud. Eucalyptus
Systems is one of the best examples of a private cloud.

Public Cloud: The term “public cloud” refers to cloud computing in the traditional
)A

sense, in which assets are powerfully provisioned on a fine-grained, self-administration


premise over the Internet, through web applications/web administrations, from an
off-webpage outsider supplier who shares assets and bills on a fine-grained utility
processing premise. It is frequently based on pay for every usage demonstration,
similar to a prepaid power metering framework that is sufficiently adaptable to supply
(c

food for spikes popular for cloud advancement. Open mists are less secure than other
cloud models because they place an added burden on ensuring that all apps and

Amity Directorate of Distance & Online Education


30 Data Warehousing and Mining

information accessed by individuals in general cloud are not vulnerable to malicious


Notes

e
attacks. Microsoft Azure and Google App Engine are examples of open clouds.

Hybrid Cloud: A hybrid cloud is a private cloud that is linked to at least one

in
outside cloud administration, is centrally managed, provisioned as a solitary unit, and
is protected by a secure system. It provides virtual IT configurations by combining
open and private mists. Half and half Cloud provides increasingly secure control

nl
of information and apps while also allowing various groups to access data via the
Internet. It also features an open engineering that allows for integration with different
administrative frameworks. Cross breed cloud might depict the establishment of a
local gadget, for example, a Plug PC, with cloud administrations. It can also depict

O
designs that combine virtual and physical, organised resources, for example, a mostly
virtualized state that requires real servers, switches, or other equipment, for example,
a system apparatus acting as a firewall or spam channel. Amazon Web Services is an

ty
example of a Hybrid Cloud (AWS).

Community Cloud: Framework shared by a few organisations for a common


purpose that may be directed by them or an outsider expert co-op and occasionally

si
given cloud display. These organisations are typically based on an agreement between
similar business associations, for example, money keeping or instructional associations.
This concept indicates that a cloud situation may exist locally or remotely. Facebook is

r
an example of a Community Cloud.
ve
Furthermore, as technology advances, we can see subsidiary cloud organisation
models emerge from various client requests and requirements. A comparable
architecture is a virtual-private cloud, in which an open cloud is used privately and is
linked to the client’s server farm’s internal resources. With the development of top-
ni

of-the-line organisation get to innovations like 2G, 3G, Wi-Fi, Wi-Max, and so on and
highlight telephones, another subordinate of cloud computing has emerged. This is
commonly referred to as - Mobile Cloud Computing (MCC).
U

It may be described as a piece of flexible innovation and cloud computing


foundation where information and related preparation will happen on the cloud with the
exception that they can be reached to using a cell phone and therefore dubbed portable
cloud computing. It’s coming into a trend these days, and numerous organisations are
ity

fast to supply their representatives the ability to get to office arrangements using a
cell phone from anyplace. Recent technological advancements, for example, the rise
of HTML5 and other programme advancement instruments, have simply enlarged the
demand for adaptable cloud computing. A growing trend toward component telephone
m

reception has also increased the MCC market.

1.5 Cloud Computing Security Requirements, Pros and Cons,


)A

and Benefits
Many businesses (such as major enterprises) are likely to find that the
infrastructure and data security in public cloud computing are less reliable than their
own current capabilities. As a result of this presumably less safe and higher risk
security posture, there is also a higher chance that privacy may be violated. However,
(c

it should be emphasised that many small and medium-sized businesses (SMBs) have

Amity Directorate of Distance & Online Education


Data Warehousing and Mining 31

constrained IT and dedicated information security resources, which causes them to give
Notes

e
this sector little attention. For these companies, a public cloud service provider’s (CSP)
level of security may be higher.

in
Even a seemingly minor data breach can have significant financial repercussions
(e.g., cost of incident response and potential forensic investigation, compensation for
identity theft victims, punitive penalties), as well as long-term effects like bad press

nl
and a loss of customer confidence. Despite the overused headlines, privacy concerns
frequently don’t match the degree of intrinsic risk.

What Is Privacy?

O
The idea of privacy varies greatly between (and occasionally even within)
nations, cultures, and legal systems. A succinct definition is difficult, if not impossible,
because it is formed by legal interpretations and societal expectations. The gathering,

ty
use, disclosure, storage, and destruction of personal data (or personally identifiable
information, or PII) are all covered by privacy rights or obligations. The accountability
of corporations to data subjects and the openness of their practises around personal
information are ultimately what privacy is all about.

si
Similar to this, there is no accepted definition of what personal information is.
The Organization for Economic Cooperation and Development (OECD) accepted the

r
following definition for the purposes of this discussion: any information relating to a
named or identifiable individual (data subject).
ve
Another definition that is gaining popularity is the one given in the Generally
Accepted Privacy Principles (GAPP) standard by the American Institute of Certified
Public Accountants (AICPA) and the Canadian Institute of Chartered Accountants
ni

(CICA): “The rights and obligations of individuals and organisations with respect to the
collection, use, retention, and disclosure of personal information.”

1.5.1 Cloud Computing Security Requirements


U

For cloud computing to succeed, it is essential to comprehend the security


and privacy issues and to design practical, workable solutions. Clouds’ distinctive
architectural qualities also create a number of security and privacy problems, even
ity

though they enable clients to minimise start-up costs, lower running costs, and boost
their agility by quickly obtaining services and infrastructure resources when needed.

The use of cloud computing in sensitive areas is hampered by a number of serious


security and privacy problems. Due to the dynamic nature of the cloud computing
m

architecture and the fact that hardware and software components of a single service in
this model transcend many trust domains, the move to this model exacerbates security
and privacy concerns.
)A

Settings for cloud computing are multi-domain environments where each domain
may use a separate set of security, privacy, and trust standards as well as perhaps
use a variety of processes, interfaces, and semantics. These domains might stand
in for independently enabled services or other infrastructure or application parts.
(c

Through service composition and orchestration, service-oriented architectures are a


technologically relevant tool to make such multi-domain formation possible.

Amity Directorate of Distance & Online Education


32 Data Warehousing and Mining

Building a thorough policy-based management framework in cloud computing


Notes

e
settings requires utilising existing research on multi domain policy integration and safe
service composition.

in
Clients that subscribe to the cloud will cede some control to a third party.

Customers of cloud computing are simultaneously interested and concerned by


the potential security risks that cloud computing poses. They are enthusiastic about

nl
the chances to cut down on capital expenses, absolve themselves of infrastructure
administration, and concentrate on key skills. Again, the agility provided by on-demand
compute provisioning and the ability to more easily match information technology with

O
business plans and demands make it more alluring to cloud clients. Users are worried
about the security dangers of cloud computing as well as the loss of direct control over
the systems for which they are still responsible.

ty
Cloud computing (CC) is a computer concept built on parallel computing,
distributed computing, and grid computing rather than a specific technology. The cloud
service provider (CSP) and the cloud service consumer are the two primary actors
implied by the CC business model (CSU).

si
The corporate software and data are kept on servers located far away, while
the CSPs provide applications over the internet that the CSUs access through

r
web browsers, desktop applications, and mobile apps. Three types of clouds are
distinguishable based on the service level offered: Infrastructure-as-a-Service (IaaS),
ve
Platform-as-a-Service (PaaS), and Software-as-a-Service (SaaS). Security has been
identified as the biggest challenge for all three.

Cloud services have gradually become a key feature of modern computers,


ni

smartphones and tablets. Each leading company in the industry offers consumers its
own offer in this segment which significantly intensifies the competition in it. Cloud
technologies are a flexible, highly efficient and proven platform for providing IT services
over the Internet. However, their increasing use and the provision of cloud services
U

by providers creates a number of risks associated with ensuring the protection of


information.

In order to avoid ineffective and expensive operation, broken governance, hard to


ity

automate controls or procedures and poor security, sound foundations of principles,


structure and methodology must be established.

Systems engineering is a methodology for achieving integration by viewing the


collective of components as a hole rather than an assembly. Each component should
m

be designed in a way that will interoperate with the rest. Figure below shows the scope
systems engineering.
)A
(c

Amity Directorate of Distance & Online Education


Data Warehousing and Mining 33

Notes

e
in
nl
O
Figure: Systems Engineering Management Activities

The architecture of individual components is the main emphasis of IT architecture.


The main goal is to create an effective structure that can sustainably and long-term

ty
meet the needs or objectives of an organisation. It should promote stability, allow for
ongoing innovation, support corporate objectives, and assist in cost-cutting.

Architectural methods, engineering, and models have a bearing on security

si
architecture. Many new models are based on the System Security Engineering
Capability Maturity Model (SSE-CMM), which was developed in the early 2000s and
highlights the value of practising security engineering. These kinds of models can
r
be used as reference models for cloud computing, security engineering, security
ve
architecture, and security operations. A few models worth mentioning are:

●● The International Standards for Security are ISO 27001 and ISO 27006. They
cover management, ideal procedures, specifications, and methods.
●● The European cyber security agency’s suggestions for security risks when
ni

implementing cloud computing are provided by the European Network and


Information Security Agency (ENISA).
U

●● Business-focused Information Technology Infrastructure Library (ITIL) It is


organised around service life cycles and procedures and focuses on IT service
management.
●● A collection of universally recognised best practises for governance and control is
ity

called COBIT, or Control Objectives for Information and Related Technology.


●● Standards and recommendations for security engineering and architecture
used by non-government organisations are covered by the National Institute for
Standards and Technology (NIST).
m

A shared set of security requirements from various stakeholders can be brought


together with the aid of security architecture, which can then meet them collectively with
a better solution than can be offered to them separately.
)A

One of the difficulties raised in relation to cloud computing is the restriction of


users’ independence and creativity as well as the development of a great dependency
on the cloud service provider. The Free Software Foundation’s founding member,
Richard Stallman, claims that “cloud computing undermines personal freedom since
(c

users give their personal data to a third party.” He has concerns about software
vendors forcing people to utiliseparticular platforms and systems. Stallman stated

Amity Directorate of Distance & Online Education


34 Data Warehousing and Mining

that because access to internal corporate networks must be rigorously controlled,


Notes

e
cloud services will be challenging to implement in sectors like defence, government
institutions, e-services, etc.

in
Data security is a different issue that comes up when using cloud-based IT
services. It entails growing reliant on the supplier and the security features it provides
for transferring and storing data. The likelihood of information leaks to rival businesses

nl
or malicious persons, the best way to handle a potential provider change, what would
happen in the event of a system breakdown, and other issues come up as they relate to
future hosting conditions.

O
A service portfolio provides a commercial value assessment of the cloud service
provider’s offerings. It is a dynamic strategy for controlling financial value-based
investments in service management throughout the entire organisation. Managers can
evaluate the expenses and associated quality requirements using Service Portfolio

ty
Management (SPM).

Access Control

si
This reflects how strictly the system restricts unauthorised parties from accessing
its resources (e.g. human users, programs, processes, devices). The need to identify
parties who want to engage with the system, confirm that the parties are who they claim
r
they are, and grant them access to only the resources they are authorised to access is
ve
therefore addressed by access control security requirements.

Almulla and Yeun(2013) outline the many issues, the present authentication, and
authorization practises for CSU’s cloud access, while Sato et al. make use of the new
Identity and Access Management (IAM) protocols and standards.
ni

IAM are strategies that impose rules and policies on CSUs using a variety of
tactics, such as enforcing login passwords, giving CSUs access, and provisioning
CSU accounts, to offer an acceptable level of protection for company resources and
U

data. It could be beneficial to utilise a model-driven approach for designing the security
needs because it converts the security objectives into enforceable security regulations.
Zhou et al. explain this method, which enables CSUs to declare security intentions in
system models that allow for the straightforward characterization of security needs.
ity

A 3D model might be used to create some of the specifications, with the data’s value
being first determined and then placed appropriately in nested “protective rings.” The
Security Access Control Service (SACS), a different model, combines the Security API
(which ensures safe use of the services after gaining access to the cloud) and Cloud
m

Connection Security (for those CSUs who want to request cloud services), as well as
Access Authorization (for those CSUs who want to do so).

Utilizing a combination of symmetric and asymmetric encryption as well as


)A

capability-based access control are other methods for threat prevention. When it comes
to personal cloud computing, for instance, prominent cloud services like Amazon EC2
and Azure address security threats and requirements by regulating access to platform
services based on permissions encoded in cryptographic capability tokens.
(c

Three factors need to be considered while establishing the access control


requirements:

Amity Directorate of Distance & Online Education


Data Warehousing and Mining 35

◌◌ The method of access to the cloud.


Notes

e
◌◌ The architecture of the cloud.
◌◌ The features of the multi-tenant environment

in
1. The CSUs typically access cloud environments through a web application, which
is frequently thought of as CC’s weakest link. Because browsers can’t generate
XML-based security tokens on their own, the current browser-based authentication

nl
protocols for the cloud are insecure. Technical solutions are put out to get around
these challenges, such as encrypting data while it is kept in the care of a cloud
service provider or while it is sent to a CSU.

O
2. Virtual machine (VM) instance interconnection is one of the greatest problems with
the cloud’s architecture. Isolation, which ensures that one VM cannot harm another
VM running on the same host, is a major issue in virtualization. One VM may be
unlawfully accessed through another VM when numerous VMs are present on

ty
the same hardware (which is typical for clouds). The Virtual Network Framework,
which consists of three layers (routing layer, firewall, and shared network) and tries
to regulate intercommunication among VMs installed in physical computers with

si
improved security, is a way to stop this from happening.
3. In order to prevent potential issues caused by role name conflicts, cross-level
management, and the makeup of tenants’ access control, needs should be aligned
r
to the specific context of the multi-tenant environments. The SaaS Role Based
ve
Access Control (S-RBAC) model, the reference architecture outlined in, and the
reference architecture including the idea of “interoperable security” are solutions
that satisfy these needs. These tools make it easier to distinguish between “home
clouds” and “foreign clouds.” When a CSP cannot meet demand with its available
ni

resources, it sends federation requests to “foreign clouds” in an effort to take use of


those clouds’ virtualization infrastructures.

Attach Harm Detection


U

This speaks to the demands for detection, recording, and warning whenever
an assault is launched and/or is successful. Four categories of solutions are now
put forward. Cloud firewalls are a part of the first category and serve as the attack
ity

prevention filtering method. Their key strength is their dynamic and intelligent
technology, which fully utilises the cloud to collect and distribute threat information
in real time. Unfortunately, there is still no order to the cloud security standards used
by cloud firewalls, with providers competing with one another and adopting different
standards.
m

The second group speaks about a framework for measuring security that is
appropriate for SaaS. The framework in, for instance, seeks to ascertain the status of
)A

user-level apps in guest VMs that have been running for some time.

The third group of solutions includes cloud community watch services, which
constantly analyse data from millions of CSUs to find newly inserted malware threats.
Community services receive greater web traffic than individual CSPs and can therefore
make use of more defences.
(c

The fourth and final category includes multi-technology based strategies, such as
the cloud security based intelligent Network-based Intrusion Prevention system (NIPS),

Amity Directorate of Distance & Online Education


36 Data Warehousing and Mining

which combines four key technologies to block visits based on real-time analysis: active
Notes

e
defence technology, linkage technology with firewall, synthesis detecting method, and
hardware acceleration system.

in
Integrity
This relates to how well various components are guarded against unauthorised
and purposeful corruption. Data integrity, hardware integrity, personal integrity, and

nl
software integrity are the four components of integrityDifferent models are deployed by
multi-model techniques in response to security-related features. In, a data tunnel and a
cryptography model team up to provide data security during storage and transmission,

O
whereas in, five models each deal with separation, availability, migration, data
tunnelling, and cryptography.

Last but not least, by raising the standards for VM security, VM-focused techniques

ty
ensure integrity. In systems where many VMs are co-located on the same physical
server, a malicious user in control of one VM may attempt to take over the resources
of other VMs, exhaust all system resources, target other VM users by depriving them of

si
resources, or steal server-based data. Jasti et al. investigate 2010 how such co-existing
VMs can be used to access other CSU’s data or disrupt service and provide useful
security measures that can be used to prevent such attacks.

r
When SaaS applications in the cloud need to contact business on-premises
ve
applications for data interchange and on-premises services, integrity issues in the cloud
may occur. Implementing a proxy-based firewall/NAT traversal solution that maintains
the security of on-premise applications while enabling SaaS applications to interface
with them without requiring firewall reconfiguration can solve these problems. A way for
the CSU or the CSP to check whether the record has been updated in the cloud storage
ni

is also required, in addition to safeguarding integrity during the data transmission


process. This could involve the usage of authenticate code and digital signatures, which
are very effective and resistant to Byzantine failure, malicious data alteration assaults,
U

and even server collusion attacks.

Security Auditing
ity

By examining security-related events, security staff will be able to assess the status
and application of security systems. Typically, this is done to ensure accountability
and control as well as compliance with laws and regulations. Security configuration
management and vulnerability assessment seem to be the foundations of approaches
to security audibility needs. For instance, this method evaluates the vulnerabilities of
m

each virtual machine (VM) in an infrastructure and then combines these results into a
general evaluation of the multitier infrastructure’s vulnerabilities using attack graphs.
For security audits, a logging structure like the one suggested in might be very helpful.
)A

The data protection scheme with public auditing described in offers an alternative
strategy and provides a means to enable data encryption in the cloud without
compromising accessibility or functionality for authorised parties. We discovered
research that cover the auditability needs for IaaS and PaaS in addition to SaaS
(c

security auditability. To help with security evaluation and improvement at the IaaS layer,
a Security Model for IaaS (SMI) is proposed. Presents an infrastructure that combines
dynamic web service policy frameworks and semantic security risk management

Amity Directorate of Distance & Online Education


Data Warehousing and Mining 37

tools on the PaaS layer to facilitate the mitigation of security threats. The platform
Notes

e
takes care of the requirements for dynamically provisioning and configuring security
services, modelling security requirements, and connecting operational security events
to vulnerability and effect analyses at the business level.

in
Privacy
In order to protect sensitive information from being obtained by unauthorised

nl
parties. It incorporates anonymity and secrecy, two features of privacy. It is crucial
to identify relevant, understandable, and worthwhile privacy metrics in order for
privacy needs to be clearly specified. At the moment, ways to improve privacy include

O
separating the data from the software used by CSUs and using cloud-based virus
scanners to obscure the relationship between data elements.

Non-repudiation

ty
This subarea contains rules for preventing one of the parties to a cloud interaction
from rejecting the interaction. In order to locate the origin of the CSUs, the authors
apply a technique. It makes it possible to capture visitor information and makes it

si
exceedingly difficult for CSUs to misrepresent visitors’ identity information. The multi-
party non-repudiation (MPNR) protocol is another option because it offers a fair non-
repudiation storage cloud and guards against roll-back attacks.
r
ve
1.5.2 Cloud Computing - Pros, Cons and Benefits
Today, cloud computing has completely taken over the IT industry. No of their size,
businesses are migratingtheir current IT infrastructure to the public cloud, their own
private cloud, or a hybrid cloud that combines the finest aspects of both.
ni

However, a small minority of sceptics continue to wonder about the benefits and
drawbacks of cloud computing, as well as whether they should migrate to the cloud
U

simply because others are doing so.

Well, they should undoubtedly assess their current IT infrastructure, consider


any workload or application restrictions, and assess whether their current issues and
limitations will be resolved or eliminated by the cloud.
ity

Benefits of Cloud Computing


For a wide range of applications and advantages, cloud computing is a great
choice for businesses wishing to implement it in their workplaces. The advantages are
m

the key reason why big businesses started using cloud computing early on and why
more industries will do so in the near future.

1. Time Saving: With cloud computing, we can save time with just a few clicks. A PC
)A

with an Internet connection is used to connect our data from any location.
2. Inferior computer cost: Applications that run on the cloud do not require a powerful
machine to run them. While running on the cloud, apps don’t require the computing
power, hard disc space, or even a DVD drive that typical desktop software does.
(c

3. Enhanced presentation: The appearance of the computer will be better if fewer apps
are simultaneously consuming its memory. Because they have fewer processes

Amity Directorate of Distance & Online Education


38 Data Warehousing and Mining

and programmes loaded into memory, desktop PCs that make use of cloud-based
Notes

e
services might boot and operate more quickly.
4. Abridged software costs: Cloud computing programmes are frequently available for

in
download for free, as opposed to having to pay for them. The consumer version of
Google Docs is one illustration.
5. Immediate software updates: A cloud-based application’s updates often happen

nl
immediately and are accessible after logging in. The most recent version of a web-
based application is typically immediately accessible without the requirement for an
upgrade.

O
6. Disaster recovery: Companies won’t require elaborate disaster recovery strategies
once they begin utilising cloud-based services. The majority of problems are handled
quickly and efficiently by cloud computing companies.

ty
7. Enhanced document format compatibility: Any other user that accesses the web-
based programme has access to all documents created by it. When all users share
documents and apps in the same cloud, there are fewer format mismatches.

si
8. Limitless storage capacity: Cloud computing has nearly infinite storage capacity.
The hard disc space on a PC today is insignificant in comparison to the cloud’s
storage capacity. However, keep in mind that large-scale storage is typically not

r
free, even in a cloud environment.
ve
9. Better data reliability: A computer catastrophe in the cloud shouldn’t have an impact
on the storage of data, unlike desktop computing, where a hard disc crash might
obliterate personal data. This is because cloud services normally offer numerous
layers of security.
ni

10. Universal document access: Documents remain in the cloud and are accessible
from any location that has an Internet connection and a device that is capable of
using the Internet. Documents are instantly accessible from anywhere, eliminating
U

the need to carry them with you when you travel.


11. Newest version availability: A document modified from one location, such as your
home, is identical to the document accessed from another location (e.g. at work).
ity

12. Easier cluster cooperation: Users can simply collaborate with one another on papers
and projects. An Internet connection is all that is required to collaborate because the
papers are stored in the cloud rather than on individual PCs.
13. Domestic device independence: Applications that can be used continue to be
m

accessible whether switching computers or switching to a portable device. There is


no requirement to save a document in a format specific to a given device or to use
a special version of a programme for that device.
)A

14. Shrink expenses on technology infrastructure: Maintain simple access to your data
with little up-front cost. Pay as you go (weekly, quarterly, or annually), as necessary.
15. Developing accessibility: Our lives are made so much easier by our access at any
time and from anyplace.
(c

This is a new age in which we are moving from the age of iron and stone to the
age of Technology, we participate on the biggest platform - the Internet. The cloud is
not a simple cloud literary saying not something floating above our head but it is a real

Amity Directorate of Distance & Online Education


Data Warehousing and Mining 39

technology that combines data to open the doors for our future generation. We get
Notes

e
together upload pictures, make transactions, performs learning, do chat, post blogs, or
even provide a social attachment to our colleague.

in
With this feature of Internet it has given a boon for the Internet surfers by providing
clouds for the Users. With the advancement of Technology and users need, it is not a
big surprise that cloud computing has become now a talk of the town. Cloud for the

nl
person who has little knowledge of internet is enormously effective. People now are
understanding and grasping the usefulness of the cloud that how much effective and
beneficial cloud may be for them.

O
Pros in Cloud Computing
1. Dumping the costly systems: Business owners can spend the least amount of
money possible on system management thanks to cloud hosting. Since everything

ty
can be done on the cloud, there is no need for or very little usage for local systems,
which helps save money that would have been spent on expensive equipment.
2. Providing various options to access: The ability to access the cloud for several

si
purposes without relying solely on a computer makes it the most widely used
technology today. The cloud is accessible outside of the office via mobile devices
like iPods and tablets, making work for users simple and effective. In addition to
r
improving efficiency, it also improves the services offered to customers. With a
single touch, the consumer has access to the desired files and documents.
ve
3. No Software maintenance Expense: The design of cloud computing eliminates
the need for any additional software, saving users from having to invest in pricey
software systems for their companies. The user can relax because all the practical
ni

software is already stored on cloud servers. For customers who can’t buy pricey
software and its licence costs, it fully eliminates the scarcity. Another popular benefit
of periodic software upgrades is that they help businesses save time and money.
U

4. Pre Processed Platform: The configuration, organisation, and installation for the
new gadget have no impact on the cost of adding a new individual. There is no
need to modify the platform in order to add a new user or application because cloud
applications can be used without modification.
ity

5. Say No to Server: Using the cloud for business eliminates the significant server cost
burden. Up to a point, the additional cost associated with server maintenance is
eliminated.
6. Centralization of Data: The most amazing features of the Cloud include centralising
m

all the data from various projects and enabling one-click access to data from faraway
locations.
7. No data is lost, can be recovered easily: Since all of the data is automatically
)A

backed up on the cloud, cloud computing allows for quick data recovery. In personal
business servers, data recovery is very expensive or not possible, wasting a lot of
time and money.
8. Say No to Server: Using the cloud for business eliminates the significant server cost
(c

burden. Up to a point, the additional cost associated with server maintenance is


eliminated.

Amity Directorate of Distance & Online Education


40 Data Warehousing and Mining

9. Centralization of data: The most amazing features of the Cloud include centralising
Notes

e
all the data from various projects and enabling one-click access to data from faraway
locations.

in
10. No data is lost , can be recovered easily: Since all of the data is automatically
backed up on the cloud, cloud computing allows for quick data recovery. In personal
business servers, data recovery is very expensive or not possible, wasting a lot of

nl
time and money.
11. Capable for sharing: There is also a sharing issue when discussing document
accessibility. Anytime you want, you can send the documents to a friend or colleague

O
via email.
12. Secure Store and Accessibility: Users’ data is accessible and stored in a secure
area thanks to cloud services; there is no risk of data loss or corruption. Users can
thus store the data without thinking about it.

ty
Cons in Cloud Computing
1. No Internet No cloud: An internet connection is required to access the cloud. We

si
can therefore conclude that it is a significant barrier to the expansion of cloud usage
globally.
2. Need a good Bandwidth: We are unable to fully take use of the benefits of the
r
clouds due to insufficient internet connectivity. Again, high latency can cause poor
ve
performance even with a high capacity satellite connection. It is therefore quite
challenging to achieve the simultaneous availability of all these items.
3. Accessing multiple things simultaneously can affect Quality of cloud access: A
user must give up some of the benefits of the cloud if they want to use numerous
ni

internet connections at once. Good Internet speed is required for cloud accessibility,
therefore if a user is viewing video or listening to audio while using the cloud, his
experience may be subpar.
U

4. Security is Must: Your data is secure when stored in the cloud, but for the best
experience, support from an IT business is required to maintain comprehensive
security. If a person uses the cloud without the right support and information, their
ity

company may be exposed to hackers.


5. Hard and Non-negotiable Agreements: Some cloud vendors have rigid, non-
negotiable agreements with business personnel that cause the businessmen to
hesitate before using the cloud.
m

6. Cost Comparison with traditional Utilities: When compared to the cost of installing
software in homes the traditional way, software that is available in the cloud may
appear to be a more cost-effective solution. Software that is accessible from the
)A

cloud may incur additional fees that do not apply when it is installed on-site or used
at home. As a result, offering a lot of features at once forces customers to consider
their entire requirement and free cloud availability because you are charged for the
more features.
7. Problems to programmers with no Hard Drive: A hard drive is a necessity for
(c

programmers, thus keeping one would be a waste of money if there was no need

Amity Directorate of Distance & Online Education


Data Warehousing and Mining 41

for storage in the system. On the other hand, programmers must have a hard drive
Notes

e
for their own use.
8. Lack of full support: The dependability of cloud-based services is not perfect. When

in
a user experiences a difficulty while using a cloud service, they are not assisted
in accordance with the fees they are charged; instead, they are forced to rely on
the FAQ and user manual, which are inadequate to address the issues they are

nl
experiencing. Because there is a lack of transparency in the service, users may be
hesitant to rely only on cloud services.
9. Incompatibility in software: The incompatibility of the programme with cloud

O
applications causes issues on occasion. This is because some programmes,
devices, and software are only compatible with personal computers, and when such
computers are connected to the internet, they don’t completely support that feature.
10. Lack of insight into your network: It is undoubtedly true that cloud computing

ty
companies give you access to CPU, application, and disc usage. However, it limits
the user’s understanding to their own network. Therefore, it is impossible to correct
a bug in the code, a hardware issue, or anything else without first identifying the

si
issue.
11. Minimum flexibility: On a distant server, the application and services are running.
Because of this, businesses employing cloud computing have little control over how
r
the hardware and software work. The remote software prevents the applications
ve
from ever being run locally.
ni
U
ity
m
)A
(c

Amity Directorate of Distance & Online Education


42 Data Warehousing and Mining

Case Study
Notes

e
Cloud as Infrastructure for an Internet Data Center (IDC)

in
Internet portals invested a significant amount of money in the 1990s to draw
users. Their market value was dependent on the number of unique “hits,” or visitors,
as opposed to earnings and losses. This approach worked effectively as these portals
started to provide paid services to users as well as advertising options aimed at their

nl
installed user base, raising revenue per capita in a theoretically unlimited growth curve.

Similar to this, Internet Data Centers (IDC) have developed into a strategic plan for

O
Cloud service providers to draw clients. An IDC would transform into a portal drawing
more applications and more users in a virtuous loop if it reached a critical mass of
people using computer resources and applications.

Two crucial elements will determine how the future generation of IDC is developed.

ty
The expansion of the Internet is the first. For instance, China had 253 million Internet
users at the end of June 2008, with a 56.2% annual growth rateAs a result, more
storage and servers are needed by Internet carriers to accommodate consumers’

si
growing demands for traffic and storage capacity on the network. The growth of mobile
communication is the second. The total number of mobile phone subscribers in China
reached 4 billion by the end of 2008. Server-based computing and storage are driven
r
by the growth of mobile communication, giving customers Internet access to the data
and computer services they require.
ve
How can we create a new IDC with core competency in the era of tremendous
Internet and mobile communication expansion?

In light of the whole business integration of fixed and mobile networks, cloud
ni

computing offers a cutting-edge business model for data centres, consequently


assisting telecom operators to foster business innovation and greater service
capabilities.
U

The Bottleneck on IDC Development


Traditional IDCs offer highly homogenised goods and services. Value-added
ity

services only provide a small portion of the revenue in almost all IDCs, while basic
collocation services account for the majority of it. For instance, a telecom operator’s
IDC reports that hosting services account for 90% of its income but value-added
services only account for 10%. The demands of customers for load balancing, disaster
recovery, data flow analysis, resource utilisation analysis, and other services cannot be
m

met as a result.

Although there is little energy consumption, there are substantial operating


expenses. According to CCID research statistics, IDC firms’ energy expenditures
)A

account for over 50% of their operational expenses, and adding additional servers will
result in a sharp spike in the related power consumption. IDC firms will have to deal
with a significant growth in power usage as their businesses expand due to the rise
in Internet users and enterprise IT transformation. If prompt action is not done to find
appropriate solutions, the high costs will jeopardise the long-term growth of these
(c

businesses.

Amity Directorate of Distance & Online Education


Data Warehousing and Mining 43

Additionally, as Web 2.0 sites and online games gain popularity, all forms of
Notes

e
content, including audio, video, photos, and games, will require a significant amount
of storage and the appropriate infrastructure to facilitate transmission. As a result, the
demand for IDC services from businesses will steadily climb, and service levels and

in
resource utilisation efficiency in data centres will be held to higher standards.

The market competitiveness intensifies under the full service operating model that

nl
evolved with the restructuring of telecom operators. Higher standards are placed on
telecom IDC operators as a result of the merging of fixed network and mobile services
since they must quickly roll out new services to keep up with consumer demand.

O
Cloud Computing Provides IDC with a New Infrastructure Solution
IDC is given a solution by cloud computing that takes into account both current
development needs and future development objectives. When using the cloud, you can

ty
create a resource service management system where physical resources are used as
input and virtual resources are produced at the appropriate time, in the right quantity,
and in the correct quality. The resources of IDC centres, including servers, storage, and

si
networks, are combined into a vast resource pool through cloud computing, thanks to
virtualization technologies.

Administrators may dynamically monitor, schedule, and deploy all of the resources
r
in the pool using a cloud computing management platform, then make them available to
ve
users via the network. A single resource management platform can result in improved
IDC operation, scheduling effectiveness, resource utilisation in the centre, and
decreased management complexity. The timely release of new services is ensured by
the automatic resource deployment and software installation, which can also shorten
the time-to-market. Customers who rent from data centres can use the resources
ni

according to their professional requirements.

Additionally, companies are permitted to timely modify the resources that they
U

rent and pay fees based on resource utilisation as needed by business development
needs. IDC is more appealing when it has variable charging modes like this. IDC
growth benefits from management through a single platform as well. The existing
cloud computing management platform can be expanded to accommodate additional
ity

resources when an IDC operator requires them, allowing them to be managed and
deployed consistently.

Software upgrades and the addition of new features and services will become
a continuous process thanks to cloud computing, which may be accomplished using
m

intelligent monitoring and an automatic installation programme rather than manual


labour.

The Long Tail Theory claims that: Cloud computing creates infrastructures
)A

depending on the size of the market head and offers plug-and-play technological
infrastructure with marginal management expenses that are almost nil in the market tail.
With flexible costs, it is able to satisfy a wide range of needs. By utilising cutting-edge
IT technology in this way, the Long Tail’s effect of maintaining low-volume production
of varied goods is realised, creating a market economy model that is competitive and
(c

supportive of the survival of the fittest.

Amity Directorate of Distance & Online Education


44 Data Warehousing and Mining

The Value of Cloud Computing for IDC Service Providers


Notes

e
1. IDC is versatile and scalable and can exploit the Long Tail effect at a reasonable
cost because to cloud computing technology. The platform for cloud computing

in
enables the development and introduction of new goods at a low marginal cost of
management. As a result, initial expenses for new businesses can be practically
eliminated, and the resources are not constrained to a specific class of goods or

nl
services. In order to best utilise the Long Tail, operators can considerably expand
their product lines within a given investment scope and provide a variety of services
by automatically allocating resources.

O
2. When business demands are at their greatest, the dynamic architecture of cloud
computing can deploy resources in a flexible manner. For instance, during the
Olympics, a tonne of people visit the websites dedicated to the events. The cloud
computing technology would temporarily deploy more idle resources to support the

ty
demand on resources during peak hours to solve this problem. The USOC has
utilised the cloud computing tools offered by AT&T to enable competition viewing
throughout the Olympics. In addition, the peak times for resource demands include
holidays, application and inquiry days for exams, and SMS and phone calls.

si
3. The return on investment for IDC service providers is increased by cloud computing.
Cloud computing technology can lower the price of computing resources, electricity
r
consumption, and human resource expenses by increasing the usage and
management efficiency of resources. Additionally, it may result in a new service
ve
having a quicker time to market, assisting IDC service providers in gaining market
share.
Additionally, cloud computing offers a cutting-edge pricing method. IDC service
ni

providers bill clients depending on the terms of resource rental, and clients only have to
pay for what they really utilise. This increases payment charge transparency and may
draw in additional clients.
U

Table: Value comparison on co-location, physical server renting and IaaS for
providers
ity
m
)A
(c

Amity Directorate of Distance & Online Education


Data Warehousing and Mining 45

The Value Brought by Cloud Computing for IDC Users


Notes

e
1. It is possible to cut startup costs, ongoing expenses, and hazards. IDC customers
are not required to purchase pricey software licences or expensive hardware up

in
front. Instead, users only need to rent the hardware and software resources they
actually need, and they only have to pay for usage-related fees. More and more
subject matter specialists are starting to create their own websites and information

nl
systems in the age of business informationization. These businesses can achieve
digital transformation with the aid of cloud computing with relatively little expenditure
and fewer IT staff.

O
2. A platform for automatic, streamlined, and unified service management can help
customers quickly meet their growing resource needs and obtain the resources they
need on time. Customers can increase their responsiveness to market demands
and promote business innovation in this way.

ty
3. IDC users have quicker requirement response times and access to more value-
added services. The IDC cloud computing unified service delivery platform enables
users to submit customised requests and take advantage of numerous value-added

si
services. Additionally, their requests would receive a prompt answer.

Table: Value comparison on co-location, physical server renting and IaaS for
users
r
ve
ni
U
ity
m

An IDC Cloud Example


As an illustration, an IDC in Europe provides business services to clients in the
)A

sports, government, banking, automotive, and healthcare industries in four nearby


nations.

IDC places a high value on cloud computing technologies with the goal of creating
a data centre that is adaptable, demand-driven, and quick to respond. It has made the
decision to create a number of cloud centres throughout Europe using cloud computing
(c

technology. Virtual SAN and the newest MPLS technologies connect the first five data
centres. Additionally, the centre complies with the ISO27001 security standard, and

Amity Directorate of Distance & Online Education


46 Data Warehousing and Mining

other security activities, such as auditing functions offered by certified partners, that are
Notes

e
required by banks and government institutions are also implemented.

in
nl
O
ty
si
Figure: IDC cloud

r
Customers in the IDC’s sister sites are served by the main Data Center. This
IDC will be able to pay for fixed or usage-based variable services via a credit card bill
ve
thanks to the new cloud computing facility. The management of this hosting facility will
eventually include even more European data centres.

Summary
ni

●● Applications for SaaS cloud computing have a lower total cost of ownership. SaaS
programmes don’t require significant capital expenditures for licences or support
infrastructure. The price of cloud computing is constantly changing, thus it is
U

crucial to review current pricing.


●● Although the cloud offers advantages and disadvantages, it is now a necessary
component of every corporate endeavour. Therefore, it is impossible to think
ity

or grow without using cloud computing. The issue with cloud computing can be
reduced with proper study and safety measures. As it is undeniably true, one of
the rapidly-evolving technologies that has astounded the corporate world is cloud
computing. The advantages of cloud computing outweigh the drawbacks. Even
with so many drawbacks, cloud computing continues to gain popularity due to its
m

outstanding benefits, which include low costs, simple access, data backup, data
centralization, sharing capabilities, security, free storage, and speedy testing. The
improved dependability and adaptability make the case even stronger.
)A

●● The field of data warehousing has expanded dramatically during the past
ten years. To aid in business decision-making, many firms are either actively
investigating this technology or are using one or more data marts or warehouses.
●● Data warehousing is tremendously advantageous for business people to use
(c

cloud services since they enable them to do away with the need for technicians
to support and manage some of the most coveted new IT technologies, such as
highly scalable, variably provided systems.
Amity Directorate of Distance & Online Education
Data Warehousing and Mining 47

●● Cloud computing is a model for enabling ubiquitous, convenient, on-demand


Notes

e
network access to a shared pool of configurable computing resources (e.g.,
networks, servers, storage, applications, and services) that can be quickly
provisioned and released with little management effort or service provider

in
interaction.
●● Applications and services made available through the Internet are referred to as

nl
cloud computing. These services are offered via data centres located all over the
world, collectively known as the “cloud”.
●● The most recent development in Internet-based computing is called cloud

O
computing, and it is a very successful paradigm of service-oriented computing.
An approach to providing simple, on-demand network access to a shared pool
of reconfigurable computer resources or shared services is known as cloud
computing (e.g., networks, servers, storage, applications, and IT services).

ty
●● Benefits of business cloud computing: a) Cost savings, b) Security, c) Business
resilience and disaster recovery, d) Flexibility and innovation.
●● Database-as-a-Service (DBaaS) is a service that supports applications and

si
is administered by a cloud operator (public or private), freeing the application
team from having to handle routine database management tasks. Application
developers shouldn’t be required to be database professionals or to pay a
r
database administrator (DBA) to maintain the database if there is a DBaaS
ve
available.
●● Today’s cloud users simply operate without having to consider server instances,
storage, or networking. Cloud computing is made possible through virtualization,
which automates many of the labor-intensive processes associated with
ni

purchasing, setting up, configuring, and managing these features.


●● All physical database administration chores, such as backup, recovery, managing
the logs, etc., are managed by the cloud provider. The developer is in charge of
U

the database’s logical administration, which includes table tuning and query
optimization.
●● Database as a Service (DaaS) (DBaaS) is a technical and operational method
ity

that enables IT companies to provide database functionality as a service to one


or more customers. There are two use-case situations in which cloud database
products meet an organization’s database demands.
●● MaaS-based management services provide better flexibility and lower installation
m

costs than any enterprise architecture. MaaS enables enterprises to fully leverage
the benefits of their converged network by recognising and resolving problems
faster, more precisely, less expensively, and with greater visibility than silos alone.
)A

●● Testing is now valuable for software organisationsin order to decrease costs and
improve services based on customer needs. Testing is critical for increasing user
happiness while minimising costs. As a result, firms must invest in people, the
environment, and tools while reducing a specific percentage of their budget. The
most important thing to remember is that quality should never be compromised at
(c

any cost.

Amity Directorate of Distance & Online Education


48 Data Warehousing and Mining

●● Software testing as a Service (STaaS) is an outsourced working paradigm in which


Notes

e
test actions are outsourced to a third party that focuses on reproducing real-life
test environments based on the needs of the purchaser.

in
●● TaaS, or Testing as a Service, is an outsourcing model in which software testing is
performed by a third-party service provider rather than by organisation employees.
TaaS testing is performed by a service provider who specialises in mimicking real-

nl
world testing environments and locating defects in software products.
●● Benefits of TaaS: a) Reduced costs, b) Pay as you go pricing, c) Less rote
maintenance, d) High availability, e) High flexibility, f) Less-biased testers, g) Data

O
integrity, h) Scalibility.
●● Storage as a service (STaaS) is a managed service in which the provider provides
access to a data storage platform to the consumer. The service can be offered on-
premises using infrastructure dedicated to a single customer, or it can be delivered

ty
from the public cloud as a shared service acquired on a subscription basis and
invoiced based on one or more consumption indicators.
●● Advantages of STaaS: a) Storage costs, b) Disaster recovery, c) Scalability, d)

si
Syncing, e) Security.
●● A mainframe is a robust computer that frequently acts as the primary data store

r
for an organization’s IT infrastructure. It communicates with users through less
capable hardware, such as workstations or terminals. By consolidating data into a
ve
single mainframe repository, it is simpler to manage, update, and safeguard data
integrity.
●● A fast local area network (LAN) is used to connect a few computer nodes
(personal computers employed as servers) that are available for download in
ni

the approach to computer clustering. The “clustering middleware,” a software


layer placed in front of nodes that enables users to access the cluster as a whole
through the use of a Single system image concept, coordinates computing node
U

activities.
●● A cluster is a sort of parallel or distributed computer system that consists of a
group of connected, independent computers that act as a highly centralised
ity

computing tool that combines separate machines, networking, and software in one
system.
●● Grid computing is a processor architecture that integrates computing power from
other disciplines to accomplish the main goal. Grid computing allows network
m

computers to collaborate on a job and act as a single supercomputer.


●● Types of cloud computing: a) Private cloud, b) Public cloud, c) Hybrid cloud, d)
Community cloud.
)A

●● The gathering, use, disclosure, storage, and destruction of personal data (or
personally identifiable information, or PII) are all covered by privacy rights or
obligations. The accountability of corporations to data subjects and the openness
of their practises around personal information are ultimately what privacy is all
about.
(c

Amity Directorate of Distance & Online Education


Data Warehousing and Mining 49

●● The use of cloud computing in sensitive areas is hampered by a number of


Notes

e
serious security and privacy problems. Due to the dynamic nature of the cloud
computing architecture and the fact that hardware and software components of a
single service in this model transcend many trust domains, the move to this model

in
exacerbates security and privacy concerns.
●● Cloud services have gradually become a key feature of modern computers,

nl
smartphones and tablets. Each leading company in the industry offers consumers
its own offer in this segment which significantly intensifies the competition in it.
Cloud technologies are a flexible, highly efficient and proven platform for providing
IT services over the Internet.

O
●● Benefits of cloud computing: a) Time saving, b) Inferior computer cost, c)
Enhanced presentation, d) Abridged software costs, e) Immediate software
updates, f) Disaster recovery, g) Enhanced document format compatibility, h)

ty
Limitless storage capacity, i) Better data reliability, j) Universal Document Access,
k) Newest version availability, l) Easier cluster cooperation, m) Domestic device
independence, n) Shrink expenses on technology infrastructure, o) Developing
accessibility.

si
Glossary
●●
r
HDRaaSTM: Hybrid-Disaster-Recovery-as-a-Service.
ve
●● DBaaS: Database-as-a-Service.
●● DBA: Database Administrator.
●● SaaS: Software as a Service.
ni

●● SLA: Service Level Agreement.


●● QoS: Quality of Service.
●● MaaS: Management as a Service.
U

●● STaaS: Software Testing as a Service.


●● ERP: Enterprise Resource Planning.
ity

●● WWW: World Wide Web.


●● API: Application Programming Interface.
●● TaaS: Testing as a Service.
●● STaaS: Storage as a Service.
m

●● AWS: Amazon Web Service.


●● HPE: Hewlett Packard Enterprise.
)A

●● ARPANET: Advanced Research Projects Agency.


●● LAN: Local Area Network.
●● VM: Virtual Machine.
(c

●● AJAX: Asynchronous JavaScript and XML.


●● RSS: Really Simple Syndication.

Amity Directorate of Distance & Online Education


50 Data Warehousing and Mining

●● ASP: Application Service Providers.


Notes

e
●● MCS: Main Consistence Server.
●● DCS: Domain Consistence Server.

in
●● CRM: Customer Relationship Management.
●● PaaS: Platform as a Service.

nl
●● IaaS: Infrastructure as a Service.
●● EC2: Elastic Cloud Computing.

O
●● MCC: Mobile Cloud Computing.
●● SMBs: Small and Medium-sized Businesses.
●● CSP: Cloud Service Provider.

ty
●● OECD: Organization for Economic Cooperation and Development.
●● GAAP: Generally Accepted Privacy Principles.

si
●● AICPA: American Institute of Certified Public Accountants.
●● CICA: Canadian Institute of Chartered Accountants.
●● SSE-CMM: the System Security Engineering Capability Maturity Model.
●●
r
ENISA: European Network and Information Security Agency.
ve
●● ITIL: Information Technology Infrastructure Library.
●● COBIT: Control Objectives for Information and Related Technology.
●● NIST: National Institute for Standards and Technology.
ni

●● SPM: Service Portfolio Management.


●● IAM: Identity and Access Management.
U

●● SACS: Security Access Control Service.


●● NIPS: Network-based Intrusion Prevention system.
●● MPNR: Multi-Party Non-Repudiation.
ity

Check Your Understanding


1. Which of the following features usually applies to data in a data warehouse?
a. Data are often deleted
m

b. Most applications consist of transactions


c. Data are rarely deleted
)A

d. Relatively few records are processed by applications


2. Which of the following statement is true?
a. The operational data is used as a source for the data warehouse
(c

b. The data warehouse consists of data marts and operational data


c. The data warehouse is used as a source for the operational data

Amity Directorate of Distance & Online Education


Data Warehousing and Mining 51

d. All of the mentioned


Notes

e
3. _ _ _ _ _is a term that applies to applications and data storage that are delivered
over the internet or via wireless technology.

in
a. Web hoisting
b. Cloud computing

nl
c. Web searching
d. Cloud enterprise
4. Applications and services are offered via data centres located all over the world,

O
collectively known as the_ _ _ _ _.
a. Storage
b. Database

ty
c. Data mining
d. Cloud

si
5. A data warehouse is which of the following?
a. Organised around important subject areas.
b. Can be updated by end user r
ve
c. Contains numerous naming convention and formats
d. Contains only current data
6. The following technology is not well-suited for data mining?
ni

a. Expert system technology


b. Technology limited to specific data types such as numeric data types
U

c. Data visualization
d. Parallel architecture
7. Which is true about multidimensional model?
ity

a. It typically requires less disk storage


b. Increasing the size of the dimension is difficult
c. It typically requires more disk storage
m

d. None of the above


8. What is the full form of IoT?
a. Intranet of Thing
)A

b. Internet on Thinks
c. Internet over Things
d. Internet of Things
(c

Amity Directorate of Distance & Online Education


52 Data Warehousing and Mining

9. _ _ _ is a service that supports applications and is administered by a cloud operator


Notes

e
(public or private), freeing the application team from having to handle routine
database management tasks.

in
a. DBaaS
b. SaaS
c. PaaS

nl
d. IaaS
10. _ _ _ _is an outsourcing model in which software testing is performed by a third-

O
party service provider rather than by organisation employees.
a. Database as a Service
b. Testing as a Service

ty
c. Software as a Service
d. Platform as a Service

si
11. A_ _ _ _is a robust computer that frequently acts as the primary data store for an
organization’s IT infrastructure.
a. LAN
b. WAN
r
ve
c. Mainframe
d. Cluster computing
12. What is the abbreviation of API?
ni

a. Allied Program Interface


b. Applicants Programming Interface
U

c. Application program Internet


d. Application ProgrammeInterface
13. A__ _ _ _is a sort of parallel or distributed computer system that consists of a group
ity

of connected, independent computers that act as a highly centralised computing


tool that combines separate machines, networking, and software in one system.
a. Cluster
b. WAN
m

c. Mainframe
d. None of the mentioned
)A

14. The process of creating a virtual platform, including virtual computer networks,
virtual storage devices, and virtual computer hardware, is known as_ _ _ .
a. Cloud computing
b. Virtualization
(c

c. Data warehousing
d. None of the mentioned
Amity Directorate of Distance & Online Education
Data Warehousing and Mining 53

15. _ _ _ _is the delivery of a calculating stage and arrangement stack as a service
Notes

e
without the need for programming downloads or establishing for engineers, IT
administrators, or end-clients.

in
a. Software as a Service
b. Database as a Service
c. Platform as a Service

nl
d. Infrastructure as a Service
16. Data scrubbing is which of the following?

O
a. A process to upgrade the quality of data before it is moved into a data
warehouse
b. A process to upgrade the quality of data after it is moved into a data

ty
warehouse
c. A process to reject data from the data warehouse and to create the necessary
indexes

si
d. A process to load data from the data warehouse and to create the necessary
indexes

r
17. The @active data warehouse architecture includes which of the following?
ve
a. At least one data mart
b. Data that can be extracted from numerous internal and external sources
c. Near real-time updates
ni

d. All of the above


18. A goal of data mining includes which of the following?
a. To confirm that data exists
U

b. To explain some observed eventor condition


c. To analyze data for expected relationships
ity

d. To create a new data warehouse


19. A data warehouse is which of the following?
a. Can be updated by end users
b. Contains numerous naming conventions and formats
m

c. Organized around important subject areas


d. Contains only current data
)A

20. A snowflake schema is which of the following types of tables?


a. Fact
b. Dimension
(c

c. Helper
d. All of the above

Amity Directorate of Distance & Online Education


54 Data Warehousing and Mining

Exercise
Notes

e
1. What is database as a service and governance/management as a service?
2. Define the following terms:

in
a. Testing as a service
b. Storage as a service

nl
3. What do you mean by cloud service development?
4. What are the challenges in cloud computing?

O
5. Define different layers of cloud computing
6. Explain different types of cloud computing and their features?
7. What are the cloud computing security requirements?

ty
8. Define Cloud computing - pros, cons and benefits.

Learning Activities

si
1. Do you agree that a typical retail store collects huge volumes of data through its
operational systems? Name three types of transaction data likely to be collected by
a retail store in large volumes during its daily operations.
r
Check Your Understanding - Answers
ve
1 c 2 a
3 b 4 d
5 a 6 b
ni

7 c 8 d
9 a 10 b
U

11 c 12 d
13 a 14 b
15 c 16 a
ity

17 d 18 b
19 c 20 d

Further Readings and Bibliography:


m

1. Data Mining: Concepts and Techniques. San Francisco: Morgan Kaufmann.


2. Data Mining Techniques. For marketing, sales, and customer relationship
)A

management, Berry, M. &Linoff, G.


3. Principles of Data Mining, , D.J., Mannila, H. & Smith, P
4. Data Mining: Next Generation Challenges and Future Directions, KarguptaHillol
5. Data Mining System And Applications, Deshpande S. P.
(c

Amity Directorate of Distance & Online Education


Data Warehousing and Mining 55

6. Data Mining Techniques AndApplicationsInternational Journal of Computer


Notes

e
Science and Engineering., Bharati M
7. https://mu.ac.in/wp-content/uploads/2021/01/Cloud-Computing.pdf

in
8. Source: www.ibm.com/cloud
9. blog.irvingwb.com/blog/2008/07/what-is-cloud-c.html

nl
10. Source: CCIDConsulting, 2008–2009 China IDC Market Research Annual
Report
11. https://vmblog.com/archive/2017/11/15/research-shows-nearly-half-of-all-cloud-

O
migration-projects-over-budget-and-behind-schedule.aspx#.YzC-l3ZBzIU

ty
r si
ve
ni
U
ity
m
)A
(c

Amity Directorate of Distance & Online Education


56 Data Warehousing and Mining

Module - II: Principles of Dimensional Modelling


Notes

e
Learning Objectives:

in
At the end of this topic, you will be able to understand:

●● Identify Facts and Dimensions

nl
●● Design Fact Tables and Design Dimension Table
●● Data Warehouse Schemas

O
●● Concept of OLAP, OLAP Features, Benefits
●● OLAP Operations
●● Data Extraction, Clean-up and Transformation

ty
●● Concept of Schemas, Star Schemas for Multidimensional Databases
●● Snowflake and Galaxy Schemas for Multidimensional Databases

si
●● Architecture for a Warehouse
●● Steps for Construction of Data Warehouses, Data Marts and Metadata
●● OLAP Server - ROLAP
●● OLAP Server - MOLAP
r
ve
●● OLAP Server - HOLAP

Introduction
ni

Data warehouses are challenging to make. Their strategy calls for a mindset
that is recently the opposite of how normal computer architectures are created. Their
creation necessitates the radical reconstruction of vast amounts of information that
U

are frequently of dubious or conflicting quality and are derived from many diverse
sources. Businesses only employ effective data warehouses to provide answers to their
questions. To communicate with experts, one must comprehend their questions. The
DW configuration connects corporate innovation with learning expertise.
ity

The information distribution center’s layout will indicate the difference between
success and failure. Deep knowledge of the business is necessary for the Data
Warehousing plan. Data warehousing frequently employs a legal outline technique
known as “dimensional demonstrating” (DM). It differs from and looks distinct from
m

substance connection displaying in certain ways (ER).

We employ dimensional models to solve performance problems for massive


queries in the data warehouse. The use of dimensional modelling offers a method
)A

for enhancing summary report query efficiency without compromising data integrity.
However, the extra storage capacity required to achieve that performance has a
price. In general, a dimensional database needs a lot more room than its relational
cousin. That expense is becoming less significant, though, as storage space costs
continue to fall.
(c

Another name for a dimensional model is a star schema. Because it can deliver
significantly higher query performance than an E/R model, especially for very big
Amity Directorate of Distance & Online Education
Data Warehousing and Mining 57

queries, this type of model is very common in data warehousing. The main advantage
Notes

e
is that it is simpler to understand, though. It often comprises of a sizable table of
facts (referred to as a fact table) and other tables surrounding it that offer descriptive
information, known as dimensions. The term comes from the fact that when it is drawn,

in
it has a star-like appearance.

2.1 Basics of Dimensional Modelling

nl
Dimensional modelling uses a cube operation to represent data, making OLAP
data management more suitable for logical data representation. Ralph Kimball created

O
the fact and dimension tables that make up the notion of “dimensional modelling.”

The transaction record is separated into “facts,” which are often numerical
transaction data, and “dimensions,” which are the reference data that places the facts
in context. For instance, a sale transaction can be broken down into specifics like the

ty
quantity of products ordered and the cost of those products, as well as into dimensions
like the date of the order, the user’s name, the product number, the ship-to and bill-to
addresses, and the salesperson who was in charge of receiving the order.

si
Objectives of Dimensional Modeling
Dimensional modelling serves the following purposes:

●●
r
create database design that is simple for end users to comprehend and create
ve
queries for.
●● to make queries as efficient as possible. By reducing the amount of tables and
relationships between them, it accomplishes these objectives.
ni

2.1.1 Identify Facts and Dimensions


Data structure method called Dimensional Modeling (DM) is designed specifically
U

for data warehouse storage. Dimensional modelling is used to enhance databases for
quicker data retrieval. The “fact” and “dimension” tables that make up the Dimensional
Modelling idea were created by Ralph Kimball.
ity

A dimensional model is a tool used in data warehouses to read, summarise, and


analyse numerical data such as values, balances, counts, weights, etc. Relational
models, on the other hand, are designed for the insertion, updating, and deletion of data
in an online transaction system that is live.
m

These dimensional and relational models each offer a special method of storing
data that has certain benefits.

For instance, normalisation and ER models in the relational manner eliminate data
)A

redundancy. On the other hand, a data warehouse’s dimensional model organises data
to make it simpler to obtain information and produce reports.

In dimensional data warehouse schemas, the following object types are frequently
employed:
(c

The huge tables in your warehouse schema known as “fact tables” are used to
record business measurements. Facts and foreign keys to the dimension tables are

Amity Directorate of Distance & Online Education


58 Data Warehousing and Mining

often included in fact tables. Fact tables provide examineable data, which is typically
Notes

e
additive and quantitative. Examples include revenue, expenses, and earnings.

in
nl
O
Figure: A sales analysis of star schema

The relatively static data in the warehouse is stored in dimension tables, which

ty
are also referred to as lookup or reference tables. The information that you often use
to contain queries is stored in dimension tables. You can use dimension tables, which
are often text-based and illustrative, as the result set’s row headers. Examples include

si
clients, the present moment, suppliers, and products.

Fact Tables

r
Both columns that hold numerical facts (commonly referred to as measurements)
ve
and columns that are foreign keys to dimension tables are common in fact tables. A
fact table either includes information at the aggregate level or at the detail level.
Summarized fact tables are frequently referred to as SUMMARY TABLES. Facts in a
fact table often have the same level of aggregation. Despite the fact that most facts
are additive, some are semi-additive or non-additive. Simple arithmetic addition can be
ni

used to group additive information together. Sales are an instance of this frequently.
Facts that are non-additive cannot be added at all. Averages are one illustration of this.
Some of the dimensions can be aggregated with semi-additive facts whereas others
U

cannot. Inventory levels are one instance of this, where it is impossible to determine
what a level means just by looking at it.

Creating a New Fact Table:


ity

For each star schema, a fact table needs to be defined. From a modelling
perspective, the fact table’s primary key is typically a composite key made up of all of its
foreign keys.
m

Fact tables provide information on business events for synthesis. Fact tables are
frequently very huge, requiring hundreds of gigabytes or even terabytes of storage and
having hundreds of millions of entries. The fact table can be simplified to columns for
dimension foreign keys and numeric fact values since dimension tables include records
)A

that explain facts. Normally, the fact table does not store text, BLOBs, or denormalized
data. The following are definitions for this “sales” fact table:

CREATE TABLE sales


(c

prod_idNUMBER(7) CONSTRAINT sales_product_nn NOT NULL,

Amity Directorate of Distance & Online Education


Data Warehousing and Mining 59

cust_id NUMBER CONSTRAINT sales_customer_nn NOT NULL,


Notes

e
time_id DATE CONSTRAINT sales_time_nn NOT NULL,

ad_idNUMBER(7),quantity_sold NUMBER(4) CONSTRAINT sales_quantity_nn

in
NOT NULL,

amount NUMBER(10,2) CONSTRAINT sales_amount_nn NOT NULL,

nl
cost NUMBER(10,2) CONSTRAINT sales_cost_nn NOT NULL )

Multiple Fact Tables:

O
Data warehouses that cover several business tasks, such as sales, inventories,
and finance, require a number of fact tables. There should be a fact table specific to
each business function, and there presumably will be some unique dimension tables as

ty
well. As was previously covered in “Dimension Tables,” all dimensions that are shared
by all business processes must express the dimension information uniformly. Typically,
each business function has its own schema, which includes a fact table, a number

si
of conforming dimension tables, and a few dimension tables that are special to that
business function. Such industry-specific schemas could be deployed as data marts or
as a component of the central data warehouse.

r
Physical partitioning of very large fact tables may be used for implementation
ve
and maintenance design purposes. Due to the historical nature of the majority of data
warehouse data, the partition divisions are nearly usually along a single dimension,
with time being the most frequently used one. For simplicity of maintenance, OLAP
cubes are typically partitioned to match the partitioned fact table portions if fact tables
are partitioned. As long as the total number of tables involved does not exceed the
ni

maximum for a single query, partitioned fact tables can be seen as a single table with a
SQL UNION query.
U

2.1.2 Design Fact Tables and Design Dimension Table


A dimension is a structure that classifies data and is frequently made up of one or
more hierarchies. The dimensional value is better explained by dimensional qualities.
ity

Typically, they are textual, descriptive values. You can respond to business inquiries
by combining a number of different dimensions with facts. Customers, items, and time
are frequently utilised dimensions. The smallest possible degree of detail is often used
to collect dimension data, which is then combined to create higher-level totals that are
m

better for analysis. Hierarchies are the terms used to describe these organic rollups or
aggregations found in dimension tables.

If the data warehouse has numerous fact tables or provides data to data marts,
)A

a dimension table could be used in more than one place. For instance, in the data
warehouse, one or more departmental data marts, and sales fact tables and inventory
fact tables, a product dimension may be employed. If all copies of a dimension, such as
customer, time, or product, utilised across several schemas are identical, the dimension
is said to be compliant.
(c

If separate schemas make use of various versions of a dimension table, the data
and reports for summarization won’t match. Successful data warehouse architecture

Amity Directorate of Distance & Online Education


60 Data Warehousing and Mining

depends on the usage of conforming dimensions. These are the definitions for this fact
Notes

e
table’s “customer” row:

CREATE TABLE customers

in
( cust_id NUMBER, cust_first_name VARCHAR2(20) CONSTRAINT customer_
fname_nn NOT

nl
NULL, cust_last_name VARCHAR2(40) CONSTRAINT customer_lname_nn NOT
NULL,cust_sex

CHAR(1), cust_year_of_birth NUMBER(4), cust_marital_status

O
VARCHAR2(20),cust_street_address

VARCHAR2(40) CONSTRAINT customer_st_addr_nn NOT NULL, cust_postal_


code VARCHAR2(10)

ty
CONSTRAINT customer_pcode_nn NOT NULL,cust_city VARCHAR2(30)
CONSTRAINT

customer_city_nn NOT NULL, cust_state_district VARCHAR2(40),country_id

si
CHAR(2)

CONSTRAINT customer_country_id_nn NOT NULL, cust_phone_number


VARCHAR2(25),
r
ve
cust_income_level VARCHAR2(30), cust_credit_limit NUMBER, cust_email
VARCHAR2(30) )

CREATE DIMENSION products_dim


ni

LEVEL product IS (products.prod_id)

LEVEL subcategory IS (products.prod_subcategory)

LEVEL category IS (products.prod_category)


U

HIERARCHY prod_rollup (

product CHILD OF
ity

subcategory CHILD OF

category

)
m

ATTRIBUTE product DETERMINES products.prod_name

ATTRIBUTE product DETERMINES products.prod_desc


)A

ATTRIBUTE subcategory DETERMINES products.prod_subcat_desc

ATTRIBUTE category DETERMINES products.prod_cat_desc;

A dimension table’s records create many-to-one relationships with the fact table.
For instance, several sales of the same product or multiple sales to a same consumer
(c

may occur. The attribute table for the dimension entry comprises extensive and user-
oriented textual information, such as the name of the product or the customer’s name

Amity Directorate of Distance & Online Education


Data Warehousing and Mining 61

and address. Report labels and query restrictions are provided through attributes. An
Notes

e
OLTP database’s coded attributes need to be translated into descriptions.

For instance, the product category field in the OLTP database might only be a

in
simple integer, but the dimension table ought to have the text for the category. If it’s
necessary for maintenance, the code could also be stored in the dimension table. This
denormalization streamlines searches, increases their effectiveness, and makes user

nl
query tools simpler. However, if a dimension attribute changes regularly, creating a
snowflake dimension by assigning the attribute to its own table may make maintenance
simpler.

O
Hierarchies:
A dimension typically contains hierarchical data. The necessity for data to be
grouped and summarised into meaningful information drives the creation of hierarchies.

ty
For instance, the hierarchy elements (all time), Year, Quarter, Month, Day, or Week
are frequently found in time dimensions. Multiple hierarchies may exist inside a
single dimension; for example, a time dimension frequently has both fiscal year and

si
calendar year hierarchies. Geography is typically a hierarchy that puts a structure
on sales points, clients, or other geographically spread dimensions rather than being
a dimension in and of itself. The following geography hierarchy for sales points is an
example: (all), nation, area, state or region, city, store.
r
ve
The top-to-bottom ordering of levels, from the root (the most general) to the leaf
(the most detailed), is specified by level relationships. In a hierarchy, they specify the
parent-child relationship between the levels. In order to facilitate more complicated
rewrites, hierarchies are also crucial components. When the dimensional dependencies
between the quarter and the year are understood, the database, for instance, can
ni

aggregate previous sales revenue on a quarterly basis to a yearly aggregation.

Dimensional Modeling Process


U

The fact table is surrounded by dimensions in the dimensional model, which has a
star-like structure. The schema is constructed using the following design model:

1. Choose the business process


ity

2. Declare the grain

3. Identify the dimensions

4. Identify the fact


m

Choose the Business Process


The utilisation of the data warehouse and the usefulness of the dimensional
)A

model are ensured by the dimensional modelling process, which is built on a 4-step
design methodology. The fundamentals of the design are based on the real business
process that the data warehouse should support. As a result, the model’s initial step is
to describe the business process on which it is based. For instance, this may be a sales
scenario in a store. One has the option of describing the business process in plain text,
(c

using the Unified Modeling Language or simple Business Process Modeling Notation
(BPMN) (UML).

Amity Directorate of Distance & Online Education


62 Data Warehousing and Mining

Declare the Grain


Notes

e
Declaring the grain of the model comes after specifying the business process in
the design. The model’s grain perfectly captures what the dimensional model should be

in
concentrating on. One such example of this is “A single line item on a client slip from a
retail establishment.” Pick the central process and succinctly describe it in one sentence
to help the reader understand what the grain implies. Additionally, it is from the grain

nl
(sentence) that you will construct the dimensions and fact table. Due to new knowledge
about what your model is meant to be able to deliver, you might find that you need to
return to this phase and change the grain.

O
Identify the Dimensions
Specifying the model’s dimensions is the third step in the design process. The
second part of the 4-step method requires that the dimensions be established within the

ty
grain. The fact table’s base is its dimensions, from which the information is gathered. A
dimension is typically a noun, such as a date, store, inventory, etc. All the data is kept
in these dimensions. For instance, the date dimension can include information on the
year, month, and weekday.

si
Identify the Facts

r
Making keys for the fact table is the next stage in the process after establishing the
dimensions. Finding the numerical facts that will fill each row of the fact table is the next
ve
stage. The business users of the system are directly impacted by this stage because it
is where they can access the data that is kept in the data warehouse. As a result, the
majority of the rows in the fact table are composed of numerical, additive values like
amount or cost per unit, etc.
ni

Benefits of Dimensional Modeling


●● The firm can easily report across departments thanks to the standardisation of
U

dimensions.
●● The history of the dimensional information is stored in dimension tables.
●● It enables the addition of a completely new dimension without substantially
ity

changing the fact table.


●● Dimensional data storage also makes it simpler to retrieve information from the
data once it has been placed in a database.
●● Dimensional tables are simpler to interpret than normalised models.
m

●● Information is categorised into understandable and straightforward business


categories.
)A

●● The business can easily understand the dimensional model. In order for the
business to understand what each fact, dimension, or characteristic implies, this
model is built on business terminology.
●● Deformalized and streamlined dimensional models are used for quick data
(c

querying. This paradigm is recognised by many relational database platforms,


which then optimise query execution plans to improve performance.

Amity Directorate of Distance & Online Education


Data Warehousing and Mining 63

●● In a data warehouse, dimensional modelling produces a high-performance-


Notes

e
optimized schema. It results in fewer joins and lessens data redundancy.
●● The performance of queries is also improved by the dimensional model. Because

in
it is more denormalized, querying is optimised.
●● Models with dimensions can easily adapt to change. More columns can be added
to dimension tables without having an impact on currently running business

nl
intelligence applications that use these tables.

Multi-use dimensions:

O
By aggregating a lot of minor, unconnected dimensions into a single physical
dimension, frequently referred to as a garbage dimension, data warehouse design
can occasionally be made simpler. By lowering the amount of foreign keys in fact table
records, this can significantly minimise the size of the fact table. The cartesian product

ty
of all the dimension values will frequently be prepopulated in the combined dimension.
The table can be filled with value combinations as they are encountered during the load
or update process if the quantity of discrete values results in a very big table including
all potential value combinations.

si
A dimension with customer demographics chosen for reporting standardisation
is a typical example of a multi-use dimension. The collection of these remarks into a
r
single dimension eliminates a sparse text field from the fact table and replaces it
with a compact foreign key. Another multiuse dimension might contain helpful textual
ve
comments that only occasionally appear in the source data records.

2.2 Understanding Schemas


ni

A schema is a logical definition of a database that logically connects fact and


dimension tables. Star, Snowflake, and Fact Constellation schema are used to maintain
Data Warehouse.
U

2.2.1 Data Warehouse Schemas


Schemas are logical descriptions of databases. It serves as the database’s
ity

overall design blueprint. It explains how the data are set up and how the relationships
between them are connected. The name and description of records, together with any
associated data items and aggregates, are included in the data warehouse schema.
A data warehouse uses the Star, Snowflake, and Fact Constellation types of schema
while a database uses relational models.
m

The following definitions of the basic terms used in this approach will help you
comprehend these schemas:
)A

Dimension
In data warehousing, a “dimension” is a group of references to data concerning
a quantifiable event. These incidents are known as facts and are kept in a fact table.
The entities that an organisation wants to keep data for are often the dimensions. A
(c

data warehouse arranges the descriptive qualities as columns in dimension tables. As


an illustration, a student’s dimension attributes might include first and last names, roll
numbers, ages, and genders, or an address dimension might have attributes for the

Amity Directorate of Distance & Online Education


64 Data Warehousing and Mining

street name, state, and nation. Each record (row) of a dimension table has a primary key
Notes

e
column that uniquely identifies it. A dimension is a classification framework made up of
one or more hierarchies. Dimensions are typically de-normalized tables with potentially
redundant data.

in
The terms normalisation and de-normalization will be used throughout this chapter,
so let’s quickly review them. A larger table is split up into smaller tables during the

nl
normalisation process to remove any potential insertion, update, or deletion anomalies.
Data redundancy has been eliminated in normalised tables. These tables are typically
connected to obtain complete information.

O
De-normalization involves combining smaller tables to create larger ones in order
tominimise joining procedures. De-normalization is especially used when retrieval is
crucial when insert, update, and delete operations are few, as they are when dealing
with historical data in a data warehouse. The data in these de-normalized tables will be

ty
redundant. As an illustration, in the case of the EMP-DEPT database, two normalised
tables would be EMP (eno, ename, job, sal, deptno) and DEPT (deptno, dname),
whereas in the de-normalized case, we would have a single table called EMP DEPT

si
with the following attributes: eno, ename, job, sal, deptno, and dname.

Let’s look at the dimensions of the location and the item as indicated in Figure:
(A) Dimension of place. Location_id is the main key in this case, and its attributes are
r
street_name, city, state_id, and country_code. In Figure (b), the item dimension, which
ve
has the attributes item_name, item_type, brand_name, and supplier_id as well as item
code as the main key, is depicted.
ni
U
ity

Figure: (a) location dimension, (b) item dimension

The location dimension may be produced from the location detail (location id, street
name, state id) and state detail (state id, state name, country code) tables, as illustrated
m

in the below Figure: Normalized view. It is vital to keep in mind that these dimensions
may be de-normalized tables.
)A
(c

Figure: Normalized view

Amity Directorate of Distance & Online Education


Data Warehousing and Mining 65

Measure
Notes

e
The term “measure” refers to values that depend on dimensions. For instance,
sales volume, sales volume, etc.

in
Fact Table
A bundle of related data pieces is referred to as a “fact table.” It consists

nl
of dimensions and measure values. It follows that the specified dimension and
measure can be used to define a fact table. Foreign keys and measure columns
are two common sorts of columns found in fact tables. As demonstrated in Figure:

O
Representation of fact and dimension tables, foreign keys are connected to dimension
tables, and measures are made up of numerical facts. Dimension tables are often
smaller in size than fact tables.

ty
r si
ve
ni

Figure: Representation of fact and dimension tables


U

A dataset of facts, either detailed or aggregated, can be stored in a fact table.


Take a look at the sales fact table in Figure for an example. It has rupees_sold and
units_sold as measure or aggregations, as well as time_key, item_code, branch_
ity

code and location_id as foreign keys of dimension tables. FK stands for a foreign key
in this context.
m
)A
(c

Figure: The sales fact table

Amity Directorate of Distance & Online Education


66 Data Warehousing and Mining

Multi-dimensional View of Data


Notes

e
There are many different dimensions to data. Hierarchies are frequently used in
dimensions to depict parent-child connections.

in
2.3 OLAP Operations
The question of whether OLAP is simply data warehousing in a pretty package

nl
comes up rather frequently. Can you not think of online analytical processing as nothing
more than a method of information delivery? Is this layer in the data warehouse not
another layer that serves as an interface between the users and the data? OLAP

O
functions somewhat as a data warehouse’s information distribution mechanism. OLAP,
however, is much more than that. Data is stored and more easily accessible through
a data warehouse. An OLAP system enhances the data warehouse by expanding the
capacity for information distribution.

ty
2.3.1 Concept of OLAP, OLAP Features, Benefits

si
On-line analytical processing is known as OLAP. Through quick, consistent,
interactive access to a variety of views of data that have been transformed from raw
data to reflect the true dimensionality of the enterprise as understood by the clients,

r
analysts, managers, and executives can gain insight into information using OLAP, a
category of software technology.
ve
Business information is multidimensionally analysed using OLAP, which also
supports sophisticated data modelling, complex estimations, and trend analysis.
The vital framework for intelligent solutions, which includes business performance
management, planning, budgeting, forecasting, financial documentation, analysis,
ni

simulation-models, knowledge discovery, and data warehouse reporting, is being


quickly improved. Ad hoc record analysis in several dimensions is made possible
by OLAP, giving end users the knowledge and information they need to make better
U

decisions.

Who uses OLAP and Why?

Many different organisational functions employ OLAP systems.


ity

Finance and accounting:


◌◌ Budgeting
m

◌◌ Activity-based costing
◌◌ Financial performance analysis
◌◌ And financial modeling
)A

Sales and Marketing


◌◌ Sales analysis and forecasting
◌◌ Market research analysis
(c

◌◌ Promotion analysis
◌◌ Customer analysis

Amity Directorate of Distance & Online Education


Data Warehousing and Mining 67

◌◌ Market and customer segmentation


Notes

e
Production

in
◌◌ Production planning
◌◌ Defect analysis
There are two primary uses for OLAP cubes. First, a data model that is more

nl
understandable to business users than a tabular model must be made available to
them. A dimensional model is what this one is known as.

The second goal is to provide quick query responses, which are typically

O
challenging with tabular models.

How OLAP Works?

OLAP has a fundamentally pretty straightforward notion. The majority of the

ty
queries, such as aggregation, joining, and grouping, which are notoriously challenging
to run over tabular databases, are pre-calculated. These queries are calculated as part
of the OLAP cube’s “building” or “processing” operation. By the time end users arrive at

si
work in the morning, the data will have been updated as a result of this process.

OLAP Guidelines

r
The “founder” of the relational model, Dr. E.F. Codd, has created a list of 12
ve
standards and criteria that serve as the foundation for choosing OLAP systems:
ni
U
ity
m
)A

1) Multidimensional Conceptual View: This is one of an OLAP system’s main


characteristics. Slice and dice techniques can be used since they call for a
multidimensional view.
2) Transparency: Make the users well aware of the technology, underlying information
(c

repository, computing operations, and the disparate nature of source data. The
users’ productivity and efficiency are enhanced by this transparency.

Amity Directorate of Distance & Online Education


68 Data Warehousing and Mining

3) Accessibility: It only gives access to the data that is actually necessary to carry out
Notes

e
the specific analysis and give the clients a single, coherent, and consistent view.
The OLAP system must apply any necessary transformations and map its own
logical schema to the various physical data storage. The OLAP operations ought

in
to be situated in the middle of an OLAP front-end and data sources (such as data
warehouses).

nl
4) Consistent Reporting Performance: To ensure that when the number of dimensions
or the size of the database rises, the users do not notice any appreciable reduction
in documentation performance. That is, as the number of dimensions rises, OLAP
performance shouldn’t deteriorate. Every time a specific query is done, users must

O
see a consistent run time, response time, or machine utilisation.
5) Client/Server Architecture: Make the OLAP tool’s server component sufficiently
clever so that different clients can be connected with the least amount of effort and

ty
integration code. The server ought to be able to map and combine data from many
databases.
6) Generic Dimensionality: Each dimension should be treated equally by an OLAP

si
method in terms of its operational capability and organisational structure. Selected
dimensions may be given access to new operational capabilities, but all dimensions
should be able to receive these additional jobs.
7) r
Dynamic Sparse Matrix Handling: To modify the physical schema in order to best
ve
handle sparse matrices in the particular analytical model that is being built and
loaded. To achieve and keep up a constant level of performance when faced with a
sparse matrix, the system must be simple to dynamically assume the distribution of
the information and modify the storage and access.
ni

8) Multiuser Support: Concurrent data access, data integrity, and access security are
requirements for OLAP tools.
9) Unrestricted cross-dimensional Operations: It gives the techniques the ability to
U

determine the order of the dimensions and inevitably works roll-up and drill-down
methods within or across the dimensions.
10) Intuitive Data Manipulation: Fundamental data manipulation includes drill-down and
ity

roll-up operations, as well as other manipulations that can be carried out organically
and precisely using point-and-click and drag-and-drop techniques on the scientific
model’s cells. It does not require the usage of a menu or repeated visits to the user
interface.
m

11) Flexible Reporting: Columns, rows, and cells are organised efficiently for business
clients so that data may be easily modified, analysed, and synthesised.
12) Unlimited Dimensions and Aggregation Levels: There should be no limit to the
)A

amount of data dimensions. Within any given consolidation path, each of these
common dimensions must provide nearly an infinite number of customer-defined
aggregation levels.

Characteristics of OLAP
(c

The word for the OLAP methods FASMI characteristics is generated from the initial
letters of the characteristics and is as follows:

Amity Directorate of Distance & Online Education


Data Warehousing and Mining 69

Notes

e
in
nl
O
ty
Fast

si
With the elementary analysis lasting little more than one second and very few
taking more than 20 seconds, it identifies which system was intended to provide the
majority of feedback to the client in approximately five seconds.

Analysis
r
ve
It specifies how any business logic and statistical analysis pertinent to the function
and the user can be handled by the method while yet keeping the method simple
enough for the intended client. We exclude products (like Oracle Discoverer) that do not
ni

allow the user to define new Adhoc calculations as part of the analysis and to document
on the data in any desired product that do not allow adequate e we do not think it is
acceptable if all application definitions have to be allow the user to define new Adhoc
calculations as part of the analysis and to document on the data in any desired method,
U

without having to programme.

Share
ity

Not all functions require the user to write data back, but an increasing number do,
so the system should be able to manage multiple updates in a timely, secure manner.
It defines which the system tools all the security requirements for understanding and, if
multiple write connections are needed, concurrent update location at an appropriated
level.
m

Multidimensional
Due to the fact that this is unquestionably the most logical way to assess
)A

businesses and organisations, OLAP systems must offer a multidimensional conceptual


representation of the data, including full support for hierarchies.

Information
(c

All the data required by the apps should be able to be stored on the system. It is
important to handle data sparsity effectively.

Amity Directorate of Distance & Online Education


70 Data Warehousing and Mining

Benefits of OLAP
Notes

e
OLAP holds several benefits for businesses: -

1. OLAP increases managers’ productivity by assisting decision-making through the

in
efficient provision of multidimensional record views.
2. Because organised databases have an inherent flexibility, OLAP functions are self-
sufficient.

nl
3. Through careful management of analysis-capabilities, it makes it easier to simulate
business models and challenges.

O
4. When used in conjunction with a data warehouse, OLAP can help to speed up data
retrieval, reduce query drag, and reduce the backlog of applications.

2.3.2 OLAP Operations

ty
On-line analytical processing is known as OLAP. OLAP is a component of software
technology that allows analysts, managers, and executives to access data quickly,
consistently, interactively, and in a range of alternative views that reflect the true

si
dimensionality of the company as learnt by the client.

Without worrying about how or where the data are saved, OLAP servers provide
r
business users with multidimensional information from data warehouses or data marts.
Data storage concerns should be taken into account in the OLAP servers’ physical
ve
design and operation.

These various views are materialised via a number of OLAP data cube procedures,
allowing for interactive querying and analysis of the available data. As a result, OLAP
ni

provides a practical setting for interactive data analysis.

Slice − In order to obtain more precise information, it describes the subcube. You
can do this by choosing one dimension.
U

Dice − By doing selection on two or more dimensions, it describes the subcube.

Roll-up − The roll-up gives the user the ability to summarise data at a more general
level of the hierarchy. By expanding the area hierarchy from the level of the city to the
ity

level of the country, the roll-up operation illustrated gathers the data. In other words, the
generated cube groups the data by country rather than by city.

When roll-up is carried out using dimension reduction, the specified cube may
have one or more dimensions removed. Think of a sales data cube that just has the two
m

dimensions of location and time. By eliminating the time dimension, roll-up can be used
to aggregate total sales by place rather than by location and by time.
)A

Drill-down − Roll-up is the opposite of drill-down. It works by moving from less


specific information to more specific information. Drill-down can be finished by either
showing new dimensions or going down a concept hierarchy for a dimension. Drill-down
is achieved by moving down the time hierarchy from the precise level of the month to
the level of the quarter. Instead of summarising the revenues by quarter, the data cube
(c

that was produced analyses the entire sales for each month.

Amity Directorate of Distance & Online Education


Data Warehousing and Mining 71

Visualisation − Visualization is the process of visually representing data using


Notes

e
detailed graphs, pictures, lists, charts, and other visual elements. It enables users to
quickly comprehend the data and extract pertinent information, patterns, and trends.
Additionally, it makes the data simple to comprehend.

in
In other words, it can be said that data visualisation is the act of representing data
in a graphical format so that consumers can easily understand how the process of

nl
trends in the data works.

●● Understanding and improving sales: OLAP can assist in identifying the best
products and the most well-known channels for businesses that profit from a wide

O
range of channels for marketing their products. It could be possible to identify the
most lucrative users using several techniques.
For example, If a company wants to analyse the sales of products for every hour

ty
of the day (24 hours), the difference between weekdays and weekends (2 values), and
split regions to which calls are made into 50 region, there is a large amount of record
when considering the telecommunication industry and considering only one product,
communication minutes.

si
●● Understanding and decreasing costs of doing business: One way to improve a firm
is to increase sales. Another way is to assess costs and reduce them as much

r
as possible without impacting sales. OLAP can help with the analysis of sales-
related expenses. It may be possible to find expenses that have a good return on
ve
investment using some strategies as well (ROI).
For example, It may be expensive to hire a good salesperson, but the income they
provide can make the expense worthwhile.
ni

Basic Analytical Operations of OLAP


Four types of analytical OLAP operations are:
U

1. Roll-up

2. Drill-down

3. Slice and dice


ity

4. Pivot (rotate)

2.4 Data Extraction, Clean-up and Transformation


m

You must prepare the data for storage in the data warehouse once you have
extracted it from various operational systems and external sources. It is necessary to
modify, transform, and prepare the extracted data from several unrelated sources in a
)A

format that can be saved for querying and analysis.

For the data to be ready, three primary tasks must be completed. The data must
first be extracted, transformed, and loaded into the data warehouse storage. A staging
area is where these three crucial processes—extracting, transforming, and getting
ready to load—take place. A workbench is included in the data staging component
(c

for these tasks. Data staging offers a location and a collection of tools with which to

Amity Directorate of Distance & Online Education


72 Data Warehousing and Mining

organise, integrate, transform, deduplicate, and ready source data for use and storage
Notes

e
in the data warehouse.

Why is data preparation done using a different location or component? Can’t you

in
prepare the data first and then transport the data from the various sources into the data
warehouse storage? When we put an operational system into place, it’s likely that we’ll
gather data from various sources, transfer it to the new operational system database,

nl
and then do data conversions. Why is it that a data warehouse cannot employ this
technique? The key distinction is that in a data warehouse, information is gathered from
numerous operating systems as sources. Do not forget that data in a data warehouse
is operational application- and subject-agnostic. Therefore, preparing data for the data

O
warehouse requires a distinct staging space.

2.4.1 Data Extraction, Clean-up and Transformation

ty
Data Extraction: There are several data sources that this function must handle.
For each data source, you must use the right approach. Source data may come in a
variety of data types from various source machines. Relational database systems may

si
contain a portion of the original data. Some data might be stored in older hierarchical
and network data models. There may still be a lot of flat file data sources. Data from
spreadsheets and regional departmental data sets might be included. Data extraction
r
could become very difficult.
ve
On the market, there exist tools for data extraction. You might wish to think about
employing external tools made for particular data sources. You might want to create
internal programmes to perform the data extraction for the other data sources. It could
be expensive initially to buy tools from other sources. On the other hand, internal
ni

programmes could incur continuing costs for development and upkeep.

Where do you save the data after you’ve extracted it so you may prepare it
further? If your framework allows it, you can conduct the extraction function directly
U

on the old platform. Data warehouse implementation teams usually extract the source
into a different physical environment so that getting the data into the data warehouse
would be simpler from there. You can extract the source data into a collection of flat
files, a relational database for data staging, or a combination of both in the separate
ity

environment.

ETL, which stands for Extraction, Transformation, and Loading, is the term used to
describe the process of taking data from source systems and bringing it into the data
warehouse.
m

The ETL process is technically difficult and demands active participation from many
stakeholders, including developers, analysts, testers, and top executives.
)A

Data warehouse technique must evolve along with business changes if it is


to continue serving as a valuable tool for decision-makers. ETL must be flexible,
automated, and thoroughly documented because it is a regular procedure (daily,
weekly, or monthly) of a data warehouse system.
(c

Amity Directorate of Distance & Online Education


Data Warehousing and Mining 73

Notes

e
in
nl
ETL consists of three separate phases:

O
ty
r si
ve
Extraction
●● Extraction involves taking data out of a source system so that it can be used in a
data warehouse environment. The ETL process begins with this step.
●● One of the ETL’s most time-consuming processes is frequently the extraction
ni

procedure.
●● Determining which data has to be extracted might be challenging because the
source systems may be complex and poorly documented.
U

●● To update the warehouse and provide it with all modified data, the data must be
extracted multiple times on a regular basis.
ity

Cleansing
A data warehouse technique must include a cleansing stage because it is designed
to enhance data quality. Rectification and homogenization are the main data cleansing
functions present in ETL technologies. In order to correct spelling errors and identify
m

synonyms, they use specialised dictionaries. They also use rule-based cleansing to
impose domain-specific rules and define the proper relationships between variables.

The following examples show the essential of data cleaning:


)A

●● An organisation must have access to a complete, accurate, and current list of


contact addresses, email addresses, and phone numbers if it wants to get in touch
with its customers or its suppliers.
●● The staff member fielding the call should instantly be able to locate the caller in the
(c

enterprise database, however this requires that the caller’s name or the name of
his or her company be listed in the database.

Amity Directorate of Distance & Online Education


74 Data Warehousing and Mining

●● It becomes challenging to update the customer’s information if a user appears


Notes

e
in the databases with two or more slightly different names or different account
numbers.

in
Transformation
Data conversion is a crucial task in every system implementation. For instance,
you must initially load your database with information from the previous system

nl
records when you construct an operational system, such as a magazine subscription
application. It’s possible that you’re switching from a manual system. You can also
be switching from a file-oriented system to one that uses relational database tables

O
for support. You will convert the data from the earlier systems in any scenario. What,
therefore, makes a data warehouse so unique? In what ways is data transformation
more complex for a data warehouse than for an operational system?

ty
As you are aware, data for a data warehouse is gathered from a variety of
unrelated sources. Data transformation creates even larger hurdles than data extraction
for a data warehouse. The fact that the data flow is more than just an initial load is
another aspect of the data warehouse. You will need to keep acquiring the most recent

si
updates from the source systems. Any transformation tasks you create for the first load
will also be modified for the subsequent updates.

r
As part of data transformation, you carry out a variety of distinct actions. You must
first purge the data you have taken from each source. Cleaning can involve removing
ve
duplicates when bringing in the same data from different source systems, or it can
involve resolving conflicts between state codes and zip codes in the source data. It can
also involve giving default values for missing data components.
ni

A significant portion of data transformation involves standardised data pieces. For


identical data components that were collected from diverse sources, you standardise
the data types and field lengths. Another significant job is semantic standardisation. You
U

eliminate homonyms and synonyms. You resolve synonyms when two or more terms
from several source systems have the same meaning. You resolve the homonym when
a single term has several meanings across separate source systems.

Combining data from many sources includes many different types of data
ity

transformation. Data from a single source record or associated data pieces from several
source records are combined. On the other side, data transformation also entails
breaking out source records into new combinations and removing source data that is
not helpful. In the data staging area, data is sorted and combined on a big scale.
m

The operational systems’ keys are frequently field values with predefined
meanings. For instance, the product key value could be a string of characters that
includes information on the product category, the warehouse where it is kept, and
)A

the production batch. The data warehouse’s primary keys are not allowed to have
predefined meanings. The assignment of substitute keys derived from the major system
keys of the source system is another aspect of data transformation.

The unit sales and revenue numbers for each transactions are recorded by a
(c

supermarket chain’s point-of-sale operational system at the checkout counter in each


shop. However, it might not be required to maintain the data at this degree of granularity
in the data warehouse. You could want to preserve the summary totals of the sale units

Amity Directorate of Distance & Online Education


Data Warehousing and Mining 75

and revenue in the data warehouse storage and combine the totals by product at each
Notes

e
store for a certain day. In these circumstances, an appropriate summarization would be
part of the data transformation process.

in
You have a set of integrated, cleaned, standardised, and summarised data after
the data transformation function is finished. All of the data sets in your data warehouse
can now be loaded with data.

nl
The reconciliation phase’s fundamental component is transformation. It changes
the format of the records from their operational source into a specific data warehouse
format. Our reconciled data layer is produced by this stage if we use a three-tier

O
architecture.

The following points must be rectified in this phase:

◌◌ Unstructured text may conceal important information. For instance, XYZ PVT

ty
Ltd makes no mention of being a Limited Partnership corporation.
◌◌ Individual data can be used in a variety of formats. Data can be saved in
many formats, such as a string or three integers.

si
Following are the main transformation processes aimed at populating the
reconciled data layer:

◌◌
r
Data normalisation and conversion processes that apply to both storage
formats and units of measurement.
ve
◌◌ Matching, which links equivalent fields from many sources.
◌◌ A choice that minimises the number of source records and fields.
In ETL tools, cleaning and transformation procedures are frequently intertwined..
ni
U
ity
m
)A
(c

Amity Directorate of Distance & Online Education


76 Data Warehousing and Mining

Loading
Notes

e
The function of data loading is composed of two different groups of jobs. The
initial loading of the data into the data warehouse storage is done when the design and

in
construction of the data warehouse are finished and it goes online for the first time.
Large amounts of data are moved initially, which takes a long time. As soon as the
data warehouse is operational, you continue to extract the source data’s modifications,

nl
transform the data revisions, and continuously feed the incremental data modifications.

The process of writing data into the target database is known as the load. Make
sure the load is carried out accurately and with the least amount of resources feasible

O
during the load step.

There are two ways to carry out a load:

1. Refresh:Data in the data warehouse is entirely rewritten. The older file has so

ty
been overwritten. Typically, refresh is combined with static extraction to first
populate a data warehouse.
2. Update: The Data Warehouse only receives the updates that were made

si
to the original data. Normally, an update is performed without erasing or
changing previously stored data. This technique is used to update data
warehouses on a regular basis in conjunction with incremental extraction.

Selecting an ETL Tool


r
ve
1. An important choice that must be considered when determining the significance
of an ODS or data warehousing application is the choice of a suitable ETL tool. In
order to extract pertinent data from several data sources, the ETL tools must enable
ni

coordinated access to those sources. An ETL tool often includes capabilities for
automatic data loading into the object database, data cleansing, reorganisation,
transformations, aggregation, calculation, and aggregation.
U

2. An ETL solution should offer a straightforward user interface that enables point-and-
click specification of data cleansing and data transformation rules. The data extract/
transformation/load procedures, which are normally executed in batch mode, should
be automatically generated by the ETL tool once all mappings and transformations
ity

have been set.

2.4.2 Concept of Schemas, Star Schemas for Multidimensional


Databases
m

Star Schema
One of the simplest data warehouse schemas is the star schema. Because of the
)A

way its points expand outward from a central point, it is known as a star. Figure: The
star schema in which the fact table is in the centre and the dimension tables are at the
nodes is represented by the representation of fact and dimension tables above.

A one-dimensional table and a collection of attributes make up the dimension


table for each dimension in a star schema. Comparatively speaking, dimension tables
(c

have fewer entries than fact tables, but each record may have many attributes that

Amity Directorate of Distance & Online Education


Data Warehousing and Mining 77

characterise the fact data. Foreign keys to dimensions data are typically included in fact
Notes

e
tables along with numerical facts.

In a star schema, fact tables are often in third normal form (3NF), whereas

in
dimensional tables are typically in de-normalized form. Although one of the simplest
forms, the star schema is still widely used today and is advised by Oracle. A star
schema is depicted visually in Figure: Graphical illustration of Star Schema.

nl
O
ty
r si
ve
ni
U
ity

Figure: Graphical representation of Star schema

Figure: Star schema for analysis of sales illustrates a company’s star schema
for sales analysis. In this structure, the four foreign keys that are connected to each
m

of the four matching dimension tables form a star with the sales data in the middle.
The units sold and rupees sold measurements are also included in the sales fact
table. The fact that the dimensions are de-normalized in this case must be noted. For
instance, the location dimension table, which includes attribute sets like {location_id,
)A

street_name, city, state_id, country_code} may include redundant data as a result of


its de-normalization. For instance, the Indian state of Punjab is home to the towns of
Patiala and Amritsar. The records for these cities could lead to data duplication for the
characteristics state and country.
(c

Amity Directorate of Distance & Online Education


78 Data Warehousing and Mining

Notes

e
in
nl
O
ty
si
Figure: Star schema for analysis of sales

Main Characteristics of Star Schema


r
The following are the primary attributes of star schema:
ve
●● Because of the de-normalization of the data, it has a good query performance
because less join operations are needed.
●● It has an understandable, straightforward structure.
ni

●● De-normalization produces data redundancy, which can increase the size of the
table data and make it more time consuming to load the data into dimension
tables.
U

●● It is the data warehouse’s most widely used and simplest structure, and it is
supported by a huge variety of technologies.

Advantages of Star Schema


ity

The advantages of star schema are:

●● Due to the de-normalization of the data, it has straightforward queries as join


procedures are not necessary.
m

●● The star schema offers simple business reporting logic compared to fully
standardised schemas.
●● It performs queries quickly because there aren’t many join operations.
)A

●● Because its queries are simpler, it has quick aggregations.

Disadvantages of Star Schema


The disadvantages of star schema are:
(c

●● Star Schema’s database is de-normalized and has redundant data, so data


integrity cannot be implemented.

Amity Directorate of Distance & Online Education


Data Warehousing and Mining 79

●● Due to data redundancy, which normalised schemas are intended to prevent,


Notes

e
inserts and changes in star schemas may cause data anomalies.
●● When compared to a normalised data model, it is less adaptable to analytical

in
requirements.
●● Because it is made for a fixed analysis of data and does not support complex
analytics, it is overly specific.

nl
2.4.3 Snowflake and Galaxy Schemas for Multidimensional
Databases

O
The main distinction between the snowflake and star schemas is that while the
star schema always has de-normalized dimensions, the snowflake schema may also
contain normalised dimensions. As a result, the snowflake schema is a variation of the
star schema that supports dimension table normalisation. In the snowflake schema,

ty
some dimension tables are normalised, which separates the data into additional
tables. In a star schema, normalising the dimension tables is the procedure known as
“snowflaking.”

si
When all the dimension tables are fully normalised, the resulting structure
resembles a snowflake with a fact table at the centre. For instance, the item dimension

r
table in Figure (the Snowflake schema for analysis of sales) is normalised by being
divided into the item and supplier dimension tables.
ve
ni
U
ity
m
)A

Figure: Snowflake schema for analysis of sales

The characteristics set “item code, item name, item type, brand name, supplier id”
make up this item dimension table. Additionally, the supplier id is linked to a database
called the supplier dimension that has two attributes: supplier id and supplier type.
(c

Similar to this, the attributes set {location_id, street_name, city_id} make up the
location dimension table. Additionally, the characteristics city_id, city_name, state_id,

Amity Directorate of Distance & Online Education


80 Data Warehousing and Mining

and country_codeare all included in the city dimension table, which is related with the
Notes

e
city_id.

Figure: Snowflake Schema depicts another angle on the concept. Store, city, and

in
region dimensions have been created here by normalising the shop dimension. Time
has been normalised into three dimensions: time, month, and quarter. Client and
client group dimensions have been created from the normalised client dimension.

nl
Additionally, as illustrated in Figure: Snowflake schema, the product dimension has
been standardised into product type, brand, and supplier dimensions.

O
ty
r si
ve
ni
U

Figure: Snowflake schema

It is significant to notice that normalisation reduces redundancy in the snowflake


schema.
ity

Advantages of Snowflake Schema


The benefits of snowflake schema are as follows:
m

●● The normalised properties of the snowflake schema reduce storage requirements


despite the complexity of the source query joins.
●● For snowflake schemas, some OLAP multi-dimensional database modelling tools
)A

have been developed.

Disadvantages of Snowflake Schema


The disadvantages of snowflake schema are:
(c

●● It has poor query speed due to joins needed for normaliseddata;


●● It has complex queries due to join operations since dimensions are normalised.

Amity Directorate of Distance & Online Education


Data Warehousing and Mining 81

Galaxy Schema (Fact Constellation Schema)


Notes

e
The primary distinction between the star/snowflake and fact constellation schemas
is that the former always includes one fact table while the latter always includes several.

in
Fact constellation is hence also known as the galaxy schema (multiple fact tables being
viewed as a group of stars).

Fact constellation, which is a collection of many fact tables sharing dimension

nl
tables, is a metric for online analytical processing. Comparing this to Star schema is an
improvement.

A fact constellation schema example is shown in Figure (Fact constellation schema

O
for analysis of sales), which comprises two fact tables, namely sales and shipping.

ty
r si
ve
ni
U

Figure: Fact constellation schema for analysis of sales


ity

Similar to the star schema’s sales fact table, the fact constellation schema’s does
as well. The sales fact table has two measures, such as rupees_sold and units_sold, as
well as four characteristics, including time_key, item_code, branch_code, and location_
id (Fact constellation schema for analysis of sales). Additionally, the shipping fact table
m

has two measures like rupees_cost and units_sold along with five attributes like item_
code, time_key, shipper_id, from_location, and to_location. Dimension tables can be
shared across fact tables in constellation schema.
)A

For instance, the fact tables shipping and sales share the dimension tables, namely
location, item, and time, as illustrated in Figure (Fact constellation schema for analysis
of sales).
(c

Advantages of Fact Constellation Schema


●● This schema’s main benefit is that it offers greater end-user assistance because it
has several fact tables.
Amity Directorate of Distance & Online Education
82 Data Warehousing and Mining

Disadvantages of Fact Constellation Schema


Notes

e
●● This schema’s primary drawback is its intricate design, which was necessitated by
taking several aggregation versions into account.

in
Comparison among Star, Snowflake and Galaxy (Fact Constellation) Schema

nl
O
ty
r si
ve
2.5 Warehouse Architecture
ni

The architecture is the framework that binds every element of a data warehouse
together. Consider the architecture of a school as an illustration. The building’s
architecture encompasses more than just its aesthetic. It consists of numerous
U

classrooms, offices, a library, hallways, gymnasiums, doors, windows, a roof, and


numerous other similar structures. When all of these elements are brought and
assembled, the architecture of the school building serves as the framework that binds
ity

them all together. If you can apply this analogy to a data warehouse, the architecture of
the data warehouse is made up of all of the different parts of the data warehouse.

Let’s assume that the builders of the school building were instructed to make
the classrooms spacious. They increased the size of the classrooms but completely
m

removed the offices, giving the school building a flawed architectural design. Where did
the architecture go wrong? For starters, not all the required elements were available.
Most likely, the positioning of the remaining elements was also incorrect. Your data
warehouse’s success depends on the right architecture. As a result, we will examine
)A

data warehouse architecture in more detail in this chapter.

The architecture of your data warehouse takes into account a number of elements.
It primarily consists of the integrated data, which serves as the focal point. Everything
required for preparing and storing the data is included in the design. However, it also
(c

provides all of the tools for getting data from your data warehouse. Rules, processes, and
functions that make your data warehouse run and satisfy business needs are also a part
of the architecture. Lastly, your data warehouse’s technology is included into the design.
Amity Directorate of Distance & Online Education
Data Warehousing and Mining 83

2.5.1 Architecture for a Warehouse


Notes

e
A means of outlining the total architecture of data exchange, processing, and
presentation that exists for end-client computing within the organisation is the data

in
warehouse architecture. Although every data warehouse is unique, they all share
several essential elements.

Online transaction processing is built into production applications including

nl
payroll, accounts payable, product purchasing, and inventory control (OLTP). These
programmes collect thorough information on daily operations.

Applications for data warehouses are created to serve the user’s ad-hoc data

O
requirements, or what is now known as online analytical processing (OLAP). These
comprise tools including trend analysis, profiling, summary reporting, and forecasting.

Production databases are regularly updated manually or via OLTP software.

ty
In contrast, a warehouse database receives periodic updates from operating
systems, typically after hours. OLTP data is routinely retrieved, filtered, and loaded
into a dedicated warehouse server that is open to users as it builds up in production

si
databases. Tables must be de-normalized, data must be cleaned of mistakes and
duplicates, and new fields and keys must be introduced as the warehouse is filled with
data to represent the user’s demands for sorting, combining, and averaging data.

r
Figure illustrates the three main elements of every data warehouse (Data
ve
warehouse architecture). These are listed below.

◌◌ Load Manager
◌◌ Warehouse Manager
ni

◌◌ Data Access Manager


U
ity
m
)A
(c

Figure: Data Warehouse Architecture

Amity Directorate of Distance & Online Education


84 Data Warehousing and Mining

Load Manager
Notes

e
The task of gathering data from operating systems falls under the purview of the
load manager. Additionally, it converts data into a format that the user can use going

in
forward. It consists of all the software and application interfaces needed for data
extraction from operational systems, preparation of the extracted data, and then loading
the data into the data warehouse itself.

nl
It ought to carry out the following duties:

◌◌ Data Identification
◌◌ Data Validation for its accuracy

O
◌◌ Data Extraction from the original source
◌◌ Data Cleansing
◌◌ Data formatting

ty
◌◌ Data standardization (i.e. bringing data into conformity with some standard
format)
◌◌ Consolidates data from multiple sources to one place

si
◌◌ Establishment of Data Integrity using Integrity Constraints

Warehouse Manager
r
The key component of the data warehousing system is the warehouse manager,
ve
which contains a vast amount of data from numerous sources. It arranges data in such
a way that anyone may easily evaluate or locate the necessary information. It serves
as the data warehouse’s main component. It retains detailed, lightly summarised, and
highly summarised information levels. Additionally, it keeps track of mete data, or data
ni

about data.

Query Manager
U

Last but not least, the query manager is the interface that enables end users to
access data warehouse-stored data through the use of specific end-user tools. These
devices are referred to as access tools for data mining. These solutions, which have
basic functionality and the option to customise new features unique to a company,
ity

are widely available today. These fall under a number of categories, including data
discovery, statistics, and query and reporting.

Types of Data Warehouse Architectures


m
)A

There are mainly three types of Data warehouse architecture.

Single-Tier Architecture
(c

In reality, single-tier architecture is not frequently employed. In order to accomplish


this, it eliminates redundant data in order to keep as little data as possible.
Amity Directorate of Distance & Online Education
Data Warehousing and Mining 85

The source layer is the sole layer that is actually accessible, as seen in the picture.
Notes

e
Data warehouses are virtual in this approach. This indicates that the data warehouse is
actually a multidimensional representation of operational data produced by a particular
middleware, or intermediate processing layer.

in
nl
O
ty
si
Figure: Single Tier

This architecture has a vulnerability since it does not separate analytical and
r
transactional processing as required. Following the middleware’s interpretation,
ve
analysis queries are approved for operational data. This is how inquiries have an impact
on transactional workloads.

Two-Tier Architecture
ni

As seen in fig., the requirement for separation is crucial in developing the two-tier
architecture for a data warehouse system.
U
ity
m
)A
(c

Figure: Two Tier

Amity Directorate of Distance & Online Education


86 Data Warehousing and Mining

Although it is frequently referred to as a two-layer architecture to emphasise a


Notes

e
division between physically accessible sources and data warehouses, in reality, it
consists of four consecutive stages of data flow:

in
1. Source layer: A data warehouse system makes use of several data sources. The
information may originate from an information system beyond the boundaries of the
company or be initially housed in legacy databases or internal relational databases.

nl
2. Data Staging: In order to combine disparate sources into a single standard schema,
the data held in the source should be collected, cleaned to remove inconsistencies
and fill gaps, and integrated. The so-called Extraction, Transformation, and Loading

O
Tools (ETL) may extract, transform, clean, validate, filter, and load source data into
a data warehouse while combining disparate schemata.
3. Data Warehouse layer: A data warehouse serves as one logically centralised
individual repository for information storage. The data warehouses can be accessed

ty
directly, but they can also be utilised as a source for developing data marts, which are
intended for particular company departments and partially reproduce the contents
of data warehouses. Data staging, users, sources, access processes, data mart

si
schema, and other information are all stored in meta-data repositories.
4. Analysis: In this layer, integrated data is effectively and adaptably accessed to
generate reports, evaluate data in real time, and model fictitious business scenarios.
r
It should have customer-friendly GUIs, advanced query optimizers, and aggregate
ve
information navigators.

Three-Tier Architecture
The source layer, which includes several source systems, the reconciliation layer,
ni

and the data warehouse layer make up the three-tier architecture (containing both data
warehouses and data marts). In between the data warehouse and the source data is
the reconciliation layer.
U

The reconciled layer’s key benefit is that it produces a uniform reference data
model for an entire company. It also distinguishes between issues with data warehouse
filling and those with source data extraction and integration.
ity

In some instances, the reconciled layer is also used directly to improve how some
operational tasks are carried out, such as generating daily reports that cannot be
adequately prepared using corporate applications or creating data flows to periodically
feed external processes in order to gain the benefits of cleaning and integration.
m

Particularly beneficial for large, enterprise-wide systems, this architecture. The


additional file storage space required by the extra redundant reconciling layer is a
drawback of this arrangement. The analytical tools are also a little less real-time as a
)A

result.
(c

Amity Directorate of Distance & Online Education


Data Warehousing and Mining 87

Notes

e
in
nl
O
ty
r si
ve
Figure: Three Tier

Benefits of data warehousing


ni

If properly implemented, a data warehouse has several advantages and benefits,


some of which are listed below.
U

◌◌ Potentially high ROI: Investing in data warehousing is a significant financial


commitment in and of itself, but previous reports indicate ROI growth of up to
400% with data warehousing, making it worthwhile for company.
◌◌ Unbeatable competitive advantage: Data warehousing implementation could
ity

give businesses an advantage over their rivals. Companies could learn about
patterns, untapped knowledge, and facts and figures that were previously
unavailable by adopting data warehousing. Such fresh information would
improve the calibre of choices.
m

◌◌ High Productivity in Business Intelligence and Corporate Decision Making:


Data Warehousing combines data from various sources into useful information
that managers can evaluate and refer to in order to make better decisions for
)A

the organisation.
◌◌ Cost-effective: Data warehousing allows for organisational streamlining, which
lowers overhead and lowers product costs.
◌◌ Improved customer service: Data warehousing offers crucial assistance when
dealing with clients, which helps to raise client satisfaction and keep them as
(c

clients.

Amity Directorate of Distance & Online Education


88 Data Warehousing and Mining

Problems or limitations of datawarehousing


Notes

e
The problems associated with developing and managing data warehousing are as
follows.

in
◌◌ Underestimation of resources for data:ETL The amount of time needed to
extract, clean, and load data before warehousing is frequently underestimated
by the user. As a result, many other processes suffer from real-time

nl
implementation, and operations may suffer in the interim.
◌◌ Erroneous source systems: The source systems that are being used to
feed data have a lot of unreported issues. Years may pass before these

O
issues are discovered, and after many years of laying undiscovered, they
may suddenly manifest in embarrassing ways. For instance, after entering
information for a brand-new property, one discovers that some fields have
allowed null values. In many of these situations, the employee is found to

ty
have entered inaccurate data.
◌◌ Required data not captured: Although data warehouses frequently
purposefully leave out a few inconsequential facts that could later be useful

si
for analysis or other tasks, they often keep detailed information. In the source
system, the date of registration, for instance, may not be used while entering
details of a new property, but it may be very helpful during analysis.
◌◌ r
Increased end user queries or demands: Queries never stop from the end
ve
user’s perspective. A series of additional questions may arise even after the
initial ones have been successfully answered.
◌◌ Loss of information during data homogenization: When combining data from
various sources, information loss due to format conversion is possible.
ni

◌◌ High demand of resources: To store the massive amount of data that is


generated every day, resources like a lot of disc space are needed.
◌◌ Data ownership: When one department is asked for data, it frequently
U

responds slowly or with reluctance. They can be concerned about losing


control or ownership of data. As a result, the Data Warehousing process
delays and is interrupted.
ity

◌◌ Necessary maintenance: High upkeep is required for a data warehouse.


The data warehouse may be impacted by changes to business processes
or the organisation of source systems, and all of these factors add to the
extraordinarily high maintenance costs.
◌◌ Long-duration projects: Data warehouses need a lot of time and money from
m

design to real deployment, which is why many firms are first hesitant.
◌◌ Complexity of integration: The performance of a data warehouse could
be assessed using its integration capabilities. As a result, an organisation
)A

invests a significant amount of effort in figuring out how well the various data
warehousing solutions can work together (or integrate), which is a particularly
challenging undertaking given the number of options available.
(c

Amity Directorate of Distance & Online Education


Data Warehousing and Mining 89

2.5.2 Steps for Construction of Data Warehouses, Data Marts and


Metadata Notes

e
A data warehouse is a single repository for data where records from various

in
sources are combined for use in online business analytics (OLAP). This suggests that
a data warehouse must satisfy the needs of every business stage across the board. As
a result, data warehouse design is a very time-consuming, complex, and thus prone

nl
to error process. Additionally, as business analytical tasks evolve over time, so do
the requirements for the systems. As a result, the design process for OLAP and data
warehousing systems is continuous.

O
A different approach than view materialisation is used in the architecture of
data warehouses. Data warehouses are viewed as database systems with specific
requirements, such as responding to management-related inquiries. The goal of the
design is to determine how records should be extracted, processed, and loaded (ETL)

ty
from various data sources in order to be arranged in a database as the data warehouse.

There are two methods.

si
◌◌ “top-down” approach
◌◌ “bottom-up” approach

Top-down Design Approach


r
ve
A data warehouse is defined in the “Top-Down” architecture paradigm as a
subject-oriented, time-variant, non-volatile, and integrated data repository for the
whole organisation. Data from various sources are vetted, reorganised, and saved in
a normalised (up to 3NF) database as the data warehouse. The atomic, or most basic,
level of data is stored in the data warehouse, from which dimensional data marts can be
ni

constructed by choosing the data needed for certain business areas or departments. An
approach is data-driven because business requirements from the subjects for creating
data marts are developed after the information has been obtained and integrated. This
U

approach has the benefit of supporting a single integrated data source. Consequently,
when they overlap, data marts created from it will have consistency..

Advantages of top-down design


ity

◌◌ The data warehouses are used to load data marts.


◌◌ It is quite simple to create new data marts from the data warehouse.

Disadvantages of top-down design


m

◌◌ This approach is not adaptable to shifting departmental needs.


◌◌ The project will be expensive to carry out.
)A
(c

Amity Directorate of Distance & Online Education


90 Data Warehousing and Mining

Notes

e
in
nl
O
ty
r si
ve
To down design approach

Bottom-Up Design Approach


ni

A data warehouse is defined as “a copy of transaction data specialised architecture


for query and analysis,” also known as the star schema, under the “Bottom-Up”
strategy. In this method, a data mart is first built to provide the reporting and analytical
tools required for certain business operations (or subjects). Therefore, in contrast to
U

Inmon’s data-driven approach, it must be a business-driven method.

The lowest grain of data and, if necessary, aggregated data are both included
in data marts. To suit the data delivery needs of data warehouses, a denormalized
ity

dimensional database is modified in place of a normalised database for the data


warehouse. In order to use the collection of data marts as the enterprise data
warehouse, it is necessary to build the data marts with conformed dimensions, which
ensures that common items are represented consistently throughout all of the data
m

marts. The data marts were joined by the conformed dimensions to create a data
warehouse, also known as a virtual data warehouse.

The benefit of the “bottom-up” design method is that it has a quick return
)A

on investment (ROI), as it requires far less time and effort to create a data mart, or
a data warehouse, for a single subject than it does to create an enterprise-wide data
warehouse. The chance of failure is even lower. This approach is by definition gradual.
This approach enables the project team to develop and learn.
(c

Amity Directorate of Distance & Online Education


Data Warehousing and Mining 91

Notes

e
in
nl
O
ty
Advantages of bottom-up design r si
ve
◌◌ Quick document generation is possible.
◌◌ New business units can be added to the data warehouse as needed.
◌◌ It simply involves creating new data marts and merging them with existing
data marts.
ni

Disadvantages of bottom-up design


◌◌ In the bottom-up approach design, the locations of the data warehouse and
U

the data marts are switched.

Difference between Top-Down Design Approach and Bottom-Up Design Approach


ity

Top-Down Design Approach Bottom-Up Design Approach


divides the enormous challenge into combines the resolution of the basic low-level
smaller, related problems. issue with the creation of a higher issue.
A union of different data marts, rather than Incremental by nature; can schedule
m

being inherently designed. important data marts first.


Information regarding the material is Departmental data is kept.
stored in a single, central location
)A

centralised authority and regulation. norms and regulations for the department.
It contains information that is superfluous. The redundant can be eliminated.
If repeatedly used, it can yield speedy less chance of failure, a good return on
results. investment, and approaches that have been
(c

proven.

Amity Directorate of Distance & Online Education


92 Data Warehousing and Mining

Data Marts
Notes

e
The phrase “data mart” refers to a tiny, localised data warehouse created for
a single function and is used to describe a department-specific data warehouse. It is

in
typically designed to meet the requirements of a department or a group of users inside
an organisation.

An organisation might, for instance, have a number of departments, such as

nl
the finance and IT departments. These departments can each have their own data
warehouses, which are nothing more than their respective data marts.

In order to support a particular subset of management decisions, a data mart is

O
characterised as “a specialised, subject-oriented, integrated, time-variant, volatile data
repository.”

Data marts are a subset of data warehouses that fulfil the needs of a specific

ty
department or business function, to put it simply.

A company may maintain both a data warehouse (which unifies the data from all
departments) and separate departmental data marts. Thus, as indicated in the below

si
graphic, the data mart can be centrally linked to the corporate data warehouse or stand-
alone (individual).

r
ve
ni

It is frequently observed that as a data warehouse expands and accumulates more


data, its capacity to meet the various needs of any organisation deteriorates. In these
U

situations, data marts are helpful since they frequently serve as a method for building a
data warehouse in a phased or sequential manner in large enterprises. An enterprise-
wide data warehouse can be made up of a number of data marts. In contrast, a data
ity

warehouse could be seen as a collection of subsets of data marts, as shown in Figure


below (Relationship between data mart and data warehouse).
m
)A
(c

Figure: Relationship between data mart and data warehouse

Amity Directorate of Distance & Online Education


Data Warehousing and Mining 93

Types of Data Marts


Notes

e
There are mainly two approaches to designing data marts. These approaches are

◌◌ Dependent Data Marts

in
◌◌ Independent Data Marts

Dependent Data Marts

nl
The logical subset of the physical subset of a higher data warehouse is a
dependent data mart. The data marts are regarded as a data warehouse’s subsets in
accordance with this method. This method starts by building a data warehouse from

O
which other data marts can be made. These data mart rely on the data warehouse
and pull the crucial information from it. This method eliminates the need for data mart
integration because the data warehouse builds the data mart. It is also referred to as a
top-down strategy.

ty
r si
ve
Independent Data Marts
ni

Independent data marts are the second strategy (IDM) In this case, separate
multiple data marts are first constructed, and then a data warehouse is designed using
them. This method requires the integration of data marts because each data mart is
U

independently built. As the data marts are combined to create a data warehouse, it is
also known as a bottom-up strategy.
ity
m
)A

There is a third category, known as “Hybrid Data Marts,” in addition to these two.

Hybrid Data Marts


We can merge data from sources besides a data warehouse thanks to it. This
(c

could be useful in a variety of circumstances, but it’s especially useful when adhoc
integrations are required, such when a new group or product is introduced to an
organisation.

Amity Directorate of Distance & Online Education


94 Data Warehousing and Mining

Steps in Implementing a Data Mart


Notes

e
Designing the schema, building the physical storage, populating the data mart with
data from source systems, accessing it to make educated decisions, and managing it

in
over time are the key implementation processes. These are the steps:

Designing

nl
The data mart process starts with the design step. Initiating the request for a data
mart, acquiring information about the needs, and creating the data mart’s logical and
physical design are all covered in this step.

O
The tasks involved are as follows:

1. Compiling the technical and business needs


2. Recognizing the data sources

ty
3. Choosing the right subset of data.
4. Creating the data mart’s logical and physical architecture.

si
Constructing
In order to enable quick and effective access to the data, this step involves building
r
the physical database and the logical structures related to the data mart.
ve
The tasks involved are as follows:

1. Establishing the data mart’s logical structures, such as table spaces, and
physical database.
ni

2. Creating the schema objects described in the design process, such as the
tables and indexes.
3. Deciding on the best way to put up the access structures and tables.
U

Populating
This stage involves obtaining data from the source, cleaning it up, transforming it
into the appropriate format and level of detail, and transferring it into the data mart.
ity

The tasks involved are as follows:

1. Mapping data sources to target data sources


2. Extracting data
m

3. Cleansing and transforming the information.


4. Loading data into the data mart
)A

5. Creating and storing metadata

Accessing
In this step, the data is put to use through querying, analysis, report creation, chart
and graph creation, and publication.
(c

The tasks involved are as follows:

Amity Directorate of Distance & Online Education


Data Warehousing and Mining 95

1. Create a meta layer (intermediate layer) for the front-end tool to use. This
Notes

e
layer converts database operations and object names into business terms so
that end users can communicate with the data mart using language that is
related to business processes.

in
2. Establish and maintain database architectures including summarised tables
that facilitate quick and effective query execution through front-end tools.

nl
Managing
In this step, the data mart’s lifespan management is included. The following

O
management tasks are carried out in this step:

1. Providing safe access to the data.


2. Controlling the expansion of the data.

ty
3. Improving the system’s performance through optimization.
4. Ensuring data accessibility in case of system faults.

si
Differences between Data Mart and Data Warehouse
The following traits set data marts and data warehouses apart from one another.

●●
r
Instead of the needs of the entire organisation, data marts often concentrate on
the data requirements of a single department.
ve
●● Data marts do not include detailed information (unlike data warehouses).
●● In contrast to data warehouses, which handle vast amounts of data, they are
simple to move between, explore, and traverse.
ni

Advantages of data marts


Following are the advantages of data marts.
U

◌◌ The user receives pertinent, focused info from data marts.


◌◌ Data marts react promptly.
◌◌ Because data marts deal with small amounts of data, data operations
ity

including data cleansing, loading, transformation, and integration are much


simpler and less expensive. A data mart is easier to set up and implement
than a data warehouse for the entire company.
◌◌ Implementing a data mart is significantly more affordable than a data
m

warehouse.
◌◌ Instead of including numerous pointless individuals, it is possible to better
organise the potential users of a data mart.
)A

◌◌ Data marts are created on the premise that serving the full company is not
necessary. As a result, the department can independently summarise, pick,
and organise the data from its own departments.
◌◌ Each department may be able to focus on a particular segment of historical
(c

data rather than the entire data set thanks to data marts.
◌◌ Depending on their demands, departments can modify the software for their
data mart.

Amity Directorate of Distance & Online Education


96 Data Warehousing and Mining

◌◌ Data marts are reasonably priced.


Notes

e
Limitations of data marts

in
The following are some drawbacks of data marts:

◌◌ Once in use, it becomes difficult to expand their coverage to other


departments due to intrinsic design limits.

nl
◌◌ Data integration issues are frequently encountered.
◌◌ The scalability issue becomes prevalent as the data mart grows several
dimensions.

O
Meta data
For Example: Over the years, numerous application designers in each branch have
made their own choices about how an application and database should be constructed

ty
in order to store data. Therefore, naming conventions, variable measures, encoding
structures, and physical properties of data will vary amongst source systems. Consider
a bank with numerous locations throughout numerous nations, millions of clients, with

si
lending and savings as its primary business lines. The process of integrating data from
source systems to target systems is demonstrated in the example that follows.

Example of source data r


ve
ni

In the aforementioned example, the names of the attribute, column, datatype,


U

and values vary greatly between source systems. By integrating the data into a data
warehouse with sound standards, this data inconsistency can be avoided.

Example of Target Data (Data Warehouse)


ity
m
)A

The target system’s attribute names, column names, and datatypes are all
consistent in the target data from the aforementioned example. This is how the data
warehouse accurately integrates and stores data from diverse source systems.

Literally, “data about data” is what meta data is. It describes the many types of
(c

information in the warehouse, where they are kept, how they relate to one another,
where they originate, and how they relate to the company. This project aims to address

Amity Directorate of Distance & Online Education


Data Warehousing and Mining 97

the issue of standardising meta data across diverse products and using a systems
Notes

e
engineering approach to this process in order to improve data warehouse architecture.

A data warehouse consolidates both recent and old data from various operational

in
systems (transactional databases), where the data is cleaned and reorganised to
assist data analysis. By developing a metadata system, we hope to construct a design
procedure for building a data warehouse. The basis for organising and designing

nl
the data warehouse will be provided by the meta data system we plan to build. This
procedure will be used initially to assist in defining the requirements for the data
warehouse and will then be used iteratively throughout the life of the data warehouse to
update and integrate additional dimensions.

O
The data warehouse user must have access to correct and current meta data in
order to be productive. Without a reliable source of meta data to base operations on,
the analyst’s task is substantially more challenging and the amount of labour necessary

ty
for analysis is significantly increased.

A successful design and deployment of a data warehouse and its meta data
management depend on an understanding of the enterprise data model’s relationship to

si
the data warehouse. Some particular difficulties that entail a data warehouse’s physical
architecture trade-offs include:

◌◌
◌◌
r
Granularity of data - refers to the level of detail held in the unit of data
Partitioning of data - refers to the breakup of data into separate physical units
ve
that can be handled independently.
◌◌ Performance issues
◌◌ Data structures inside the warehouse
ni

◌◌ Migration
As a byproduct of transaction processing and other computer-related activities,
data has been accumulating in the business sector for years. Most of the time, the data
U

that has accumulated has simply been used to fulfil the initial set of needs. The data
warehouse offers a data processing solution that reflects the integrated information
needs of the business. The entire corporate organisation can use it to facilitate
analytical processing. Analytical processing examines different parts of the business or
ity

the entire organisation to find trends and patterns that would not otherwise be visible.
The management’s vision for leading the organisation depends utterly on these trends
and patterns.

Arguably, the most important component of efficient data management is meta


m

data itself. To correctly store and utilise the meta data produced by the various systems,
efficient tools must be used.
)A

●● Data Transformation and Load


In order tocharacterise the source data and any necessary modifications, meta
data may be used during transformation and loading. The complexity of the data
transformations that will be carried out on the data as it moves from the source to
the data warehouse will determine whether or not you need to retain any meta data
(c

for this operation. The quantity and kind of the source systems will also play a role.

Amity Directorate of Distance & Online Education


98 Data Warehousing and Mining

The following data must be provided for each source data field:
Notes

e
Source field

in
Unique identifier, name, type, location (system, object).

A unique identifier is required to avoid any confusion occurring between two fields
of the same name from different sources. Name, type, location describe where the data

nl
comes from. Name is its local field name. Type is the storage type of data. Location is
both the system comes from and the object contains it.

The destination field needs to be described in a similar way to the source.

O
Destination

Unique identifier, name, type, table name.

ty
In order to assist prevent any ambiguity between frequently used field names,
a distinctive identification is required. Name is the identifier that the field in the data
warehouse’s base data will have. To identify between the original data load destination

si
and any copies, base data is employed. The database data type, such as clear, varchar,
or numeric, is known as the destination type. The name of the table field is what is used
to identify it.
r
To transform the source data into the destination data, transformation must be
ve
used.

Transformation(s)
ni

Name, language (module name, syntax).

Name serves as a distinctive identifier to set this transformation apart from others
that are analogous. The language name used in the transformation is contained in the
U

Language attribute. Module name and syntax are the additional attributes.

●● Data Management
To define the data as it is stored in the data warehouse, meta data is necessary.
ity

The warehouse manager needs this to be able to monitor and manage all data
migration. It is necessary to describe each object in the database. All of the following
need the use of meta data:

Table: Columns (name, type)


m

Indexes: Columns (name, type)

Views: Columns (name, type)


)A

Constraints: Name, type, table (columns)

To achieve this, each field will need to store the following meta data.

Field: Unique identifier, field name, description.


(c

The following data needs to be saved for each table.

Table: Table name, columns (column_name, reference identifier)

Amity Directorate of Distance & Online Education


Data Warehousing and Mining 99

As a specific case of tables, aggregates need to store the following meta


Notes

e
information.

Aggregation: Aggregation name, columns (column_name, reference identifier,

in
aggregation).

Examples of aggregation functions are the following functions: min, max, average,
and sum

nl
Partitions are subsets of a table, but they also require knowledge of the key for the
partition and the data range it contains.

O
Partition: Table name, partition key (reference identifier), and table name (range
permitted, range contained)

Types of Metadata

ty
Metadata in a data warehouse fall into three major parts:

◌◌ Operational Metadata

si
◌◌ Extraction and Transformation Metadata
◌◌ End-User Metadata

Operational Metadata
r
ve
As is well known, the enterprise’s numerous operating systems provide the data
for the data warehouse. Different data structures are present in these source systems.
The data items chosen for the data warehouse contain a range of data kinds and field
lengths.
ni

We separate records, combine factors of documents from several source files, deal
with various coding schemes and field lengths when choosing data from the source
systems for the data warehouses. We must be able to link the information we provide
U

to end consumers back to the original data sets. All of this information regarding the
operational data sources is contained in operational metadata.

Extraction and Transformation Metadata


ity

Data about the removal of data from the source systems, such as data extraction
frequencies, methods, and business rules, are included in the metadata for data
extraction and transformation. Additionally, the data transformations that occur in the
data staging area are all described in this type of metadata.
m

End-User Metadata
The navigational map of the data warehouses is the end-user metadata. It makes
)A

it possible for end users to locate data in data warehouses. The end-user metadata
enables the end-users to search for information using terms that are familiar to them
from the business world.

2.6 OLAP Operations


(c

Online Analytical Processing Server is what OLAP stands for. Using this software
technology, users can simultaneously study data from many database systems. The

Amity Directorate of Distance & Online Education


100 Data Warehousing and Mining

user can query on multi-dimensional data because it is built on a multidimensional


Notes

e
data model.

The question of whether OLAP is simply data warehousing in a pretty package

in
comes up rather frequently. Can you not think of online analytical processing as nothing
more than a method of information delivery? Is this layer in the data warehouse not
another layer that serves as an interface between the users and the data? OLAP

nl
functions somewhat as a data warehouse’s information distribution mechanism. OLAP,
however, is much more than that. Data is stored and more easily accessible through
a data warehouse. An OLAP system enhances the data warehouse by expanding the
capacity for information distribution.

O
OLAP is implemented with data cubes, on which the following operations may be
applied:

ty
◌◌ Roll-up
◌◌ Drill-down
◌◌ Slice and dice

si
◌◌ Pivot (rotate)

Roll-up

r
Rolling up is like enlarging the data cube’s view. Roll-up is a tool for giving users
ve
details at an abstract level. It accomplishes further data aggregation either by moving
up a concept hierarchy for a dimension or by reducing the dimension. The functioning of
the roll-up operation is shown in Figure (Working of the Roll-up operation).
ni
U
ity
m
)A
(c

Figure: Working of the Roll-up operation

Amity Directorate of Distance & Online Education


Data Warehousing and Mining 101

Cities are aggregated in this case to reach the state level for dimension reduction.
Notes

e
Additionally, the aggregate can be carried out on Time (Quarter), Year, etc., or on
specific objects to group, such as mobile, modem, etc.

in
Drill-down
Roll-opposite up’s is drill-down. Drill-down is used to give the user detailed info
and is similar to zooming in on the data. By adding a new dimension or descending a

nl
concept hierarchy for an existing dimension, it gives extensive information. It switches
between less and more precise data. The working of the drill-down operation is shown
in Figure (Working of the Drill down operation).

O
ty
r si
ve
ni
U

Figure: Working of the Drill down operation


ity

In this example, the time dimension runs down from the level of quarter to the level
of month on moving down.

Slice and Dice


m

Slice and dice illustrate how information may be viewed from several angles. By
choosing a specific dimension from a predetermined cube, the slice operation creates
a new sub-cube. Consequently, a slice is a subset of the cube that represents a single
)A

value for one or more dimension members. It causes a reduction in size. In order to
choose one dimension from a three-dimensional cube and create a two-dimensional
slice, the user performs a slice operation.

The working of the slice operation is depicted in Figure (Working of slice


(c

operation). The criteria time ‘Q1’ for the dimension ‘time’ is used in this figure’s slice
operation. It provides a new sub-cube after choosing one or more dimensions.

Amity Directorate of Distance & Online Education


102 Data Warehousing and Mining

Notes

e
in
nl
O
ty
r si
ve
Figure: Working of Slice operation

Let’s look at an additional instance of a university database system where we


ni

have a count of the number of students from each state in prior semesters in a certain
course, as shown in Figure (The slice operation).
U
ity
m
)A
(c

Figure: The slice operation

Amity Directorate of Distance & Online Education


Data Warehousing and Mining 103

As can be seen in the table below (Result of the slice operation for degree = BE), the
Notes

e
information returned is consequently more like a two-dimensional rectangle than a cube.

in
nl
O
ty
Dice
Without a reduction in the number of dimensions, the dice operation is comparable

si
to the slice procedure. By choosing two or more dimensions from a predetermined
cube, the Dice operation creates a new sub-cube. The action of the dice is depicted in
Figure (Working of the Dice operation).

r
ve
ni
U
ity
m
)A
(c

Figure: Working of the Dice operation

Amity Directorate of Distance & Online Education


104 Data Warehousing and Mining

Here, the dice operation is carried out on the cube based on the three selection
Notes

e
criteria listed below.

◌◌ (location = ‘Amritsar’ or ‘Patiala’)

in
◌◌ (time = ‘Q1’ or ‘Q2’)
◌◌ (item = ‘Mobile’ or ‘Modem’)
A dice is obtained by doing selection on two or more dimensions, as was previously

nl
mentioned.

A dice operation can be used to determine the number of students enrolled in both

O
BE and BCom degrees from the states of Punjab, Haryana, and Maharashtra in the
university database system, as illustrated in Figure (Dice operation).

ty
r si
ve
ni
U

Figure: Dice operation


ity

Pivot
Rotation is another name for the pivot action. To provide a different data
presentation, this action rotates the data axes in view. It might entail switching the
columns and rows. Figure below shows the pivot operation in action.
m
)A
(c

Amity Directorate of Distance & Online Education


Data Warehousing and Mining 105

Notes

e
in
nl
O
ty
r
Figure: Working of the Pivot operation

si
ve
OLAP Server
A group of software tools known as Online Analytical Processing (OLAP) are used
to analyse data and make business choices. OLAP offers a platform for simultaneously
ni

accessing databases pulled from several database systems to gain insights. It is


built on a multidimensional data model, allowing users to extract and view data from
a variety of angles. The OLAP data is kept in a multidimensional database. OLAP
U

technology is used extensively in Business Intelligence (BI) applications.

Type of OLAP servers:


The three major types of OLAP servers are as follows:
ity

◌◌ ROLAP
◌◌ MOLAP
◌◌ HOLAP
m

2.6.1 OLAP Server - ROLAP


Data that is saved in a relational database, where both the base data and
)A

dimension tables are maintained as relational tables, is typically used for relational on-
line analytical processing (ROLAP). ROLAP servers are used to fill the gap between the
client’s front-end tools and the relational back-end server. OLAP middleware fills in the
gaps left by ROLAP servers’ use of RDBMS to store and manipulate warehouse data.
(c

Benefits:
●● Both data warehouses and OLTP systems are compatible with it.

Amity Directorate of Distance & Online Education


106 Data Warehousing and Mining

●● The underlying RDBMS determines the ROLAP technology’s data size restriction.
Notes

e
As a result, ROLAP does not place a storage cap on the volume of data that can
be kept.

in
Limitations:
●● SQL has limited functionality.

nl
●● Updating aggregate tables is challenging.

O
ty
2.6.2 OLAP Server - MOLAP
r si
ve
Multidimensional On-Line Analytical Processing (MOLAP) offers multidimensional
representations of data through array-based multidimensional storage engines. If the
data set is sparse, storage utilisation in multidimensional data repositories may be low.
ni

MOLAP uses a customised multidimensional array structure to store data on discs.


It is employed for OLAP, which relies on the arrays’ capacity for random access. The
data or measured value associated with each cell is normally kept in the matching array
U

element, which is determined by the dimension instances. A linear allocation based on


layered traversal of the axes in a predetermined order is commonly used to store the
multidimensional array in MOLAP.
ity

All array elements are defined in MOLAP, as opposed to ROLAP, which only
saves records with non-zero facts; as a result, the arrays tend to be sparse, with empty
components taking up a larger proportion of them. Due to the fact that both storage and
retrieval costs must be taken into account when assessing online performance, MOLAP
systems frequently contain features like sophisticated indexing and hashing to locate
m

data while running queries for managing sparse arrays. MOLAP cubes can handle
intricate calculations and are perfect for slicing and chopping data. All calculations are
already done when the cube is produced.
)A

Benefits:
●● Suitable for activities involving cutting and dicing.
●● ROLAP is outperformed when the data is dense.
(c

●● Capable of executing sophisticated math.

Amity Directorate of Distance & Online Education


Data Warehousing and Mining 107

Limitations:
Notes

e
●● Changing the proportions without re-aggregating is challenging.
●● Large amounts of data cannot be kept in the cube itself since all calculations must

in
be done at the time the cube is constructed.

nl
O
ty
2.6.3 OLAP Server - HOLAP

si
Hybrid On-Line Analytical Processing combines ROLAP and MOLAP. Compared
to ROLAP and MOLAP, HOLAP is more scalable and performs computations more
r
quickly. ROLAP and MOLAP are combined to create HOLAP. HOLAP servers have the
capacity to store a lot of specific data. On the one hand, the higher scalability of ROLAP
ve
improves HOLAP. HOLAP, on the other hand, utilises cube technology for information
that is summarised quickly. Cubes are smaller than MOLAP because detailed data is
kept in a relational database.
ni

Benefits:
●● The advantages of MOLAP and ROLAP are combined in HOLAP, which also offers
rapid access at all aggregation levels.
U

Limitations:
●● HOLAP architecture is exceedingly complex because it allows both MOLAP and
ity

ROLAP servers, and there is a higher possibility of overlap, especially in their


capabilities.
m
)A
(c

Amity Directorate of Distance & Online Education


108 Data Warehousing and Mining

Case Study
Notes

e
Designing a dimensional model for a cargo shipper

in
To examine the process of developing a data warehouse schema, we will use a
case study of a XYZ shipping corporation as our example. The entire data warehouse is
logically described by a schema or dimensional model. The most basic data warehouse
structure that we’ll take into consideration is a star schema. The entity-relationship

nl
diagram of this schema has points radiating out from a central table, giving it the
appearance of a star, which is why it is known as a star schema. It is distinguished by
one or more large fact tables that hold the majority of the data warehouse’s information

O
and a number of considerably smaller dimension tables that each contain details on the
entries for a specific attribute in the fact table. We’ll employ methods for dimensional
modelling.

ty
Gather Business Requirements
We must comprehend the needs of the Shipping Company and the realities of the
underlying source data before beginning a dimensional modelling project. Sessions with

si
business representatives can help us grasp the demands of the organisation and its key
performance indicators, compelling business problems, decision-making procedures,
and supporting analytical requirements.
r
Because we don’t fully grasp the business, we can’t develop the dimensional
ve
model alone. Together with subject matter experts and firm data governance officials,
the necessary dimensions model should be createdUnderstanding the case after
gathering data realities
ni

We can clearly define the criteria after gathering them in order to comprehend the
situation. In this instance, the shipping company only provides services to consumers
who have registered with the business. Bulk products are transported between ports
in containers for customers. There may be several intermediate port stops during the
U

journey.

The following details can be found on a shipment invoice: invoice number, date of
pick-up, date of delivery, ship from customer, ship to customer, ship from warehouse,
ity

ship from city, ship to city, shipment mode (such as air, sea, truck, or train), shipment
class (codes like 1, 2, and 3, which translate to express, expedited, and standard),
contract ID, total shipping charges, deal ID (refers to a discount deal in effect),
discount amount, taxes, total billed amount, total ship The product ID, product name,
m

quantity, ship weight, ship volume, and charge are also included on each invoice line.
Additionally, a single invoice will not have several lines for the same product.

Customer size (small, medium, or large), customer ship frequency (low, medium, or
)A

high), and credit status are all included in the customer table (excellent, good, average).
These sometimes go hand in hand; for instance, big clients typically buy frequently and
have great or decent credit. Multiple contracts may be held by a single consumer, but
the total number of contracts is minimal.

There aren’t many specials, and some shipments might not qualify for a discount.
(c

The business might need to run a query on data like total revenue broken down by deal.
Additionally, offers that are available for a specific product, source city, destination city,

Amity Directorate of Distance & Online Education


Data Warehousing and Mining 109

and date may need to be identified. Even though it is not mentioned on the invoice,
Notes

e
we may nevertheless find an Estimated Date of Delivery for each cargo from operating
systems. Between shipment mode and shipment class, there is some link. While cities
and warehouses roll up to subregion, region, and territory, products roll up to brand and

in
category.

The following queries may be of interest to the business: client type, product

nl
category, region, year, shipment mode, discount contract terms, ship weight, volume,
delivery delay, and income.

Modeling

O
Dimensional modelling is done in four steps using the Kimball method. It acts as a
guide and looks like this:

Step 1: Select the Business Processes

ty
Step 2: Declare the Grain

Step 3: Identify the Dimensions

si
Step 4: Identify the Facts

In this instance, shipment invoicing is the business process. The dimensional


r
model’s highest level of detail is called granularity. We have greater freedom when we
ve
learn more things since we have more information about the fact measurement, i.e., the
smaller the atomic granularity. In our example, there is one row for each line item on the
invoice.

Identification of the dimensions is the third phase. Below, we’ll show the
ni

dimensions as a table:
U
ity
m
)A

Figure: The Dimension Tables

Finding the facts is the fourth phase in the Kimball process. The following illustrates
(c

the fact table:

Amity Directorate of Distance & Online Education


110 Data Warehousing and Mining

Notes

e
in
nl
O
ty
r si
Figure: Shipment Invoice Fact Table
ve
Further notes about the dimensional model:
The role-playing dimensions are presented as views in the categories of City, Date,
ni

Customer Type, and Customer. The corresponding tables include descriptions of the
roles. These role-playing dimensions act as the model’s legal outriggers.

Shipment mode and Shipment Class have a low attribute cardinality and are
U

connected. Therefore, the technique for the shipment dimension is Type 4 Mini-
Dimension.

The Deal dimension, a causal dimension, contains the descriptions of the


ity

discounts. When a deal isn’t in place, we’ll put a specific record in the dimension table
with a descriptive string like “No Deal Applicable” to prevent null keys. Per line item on
the invoice, discounts are distributed.

Bridge crucial composite dimension: The measurement fact table is used in


m

conjunction with a bridge table with dual keys to represent the many-to-many link
between customers and contracts. The primary key of the bridge table is a composite
key made up of the contract ID and customer ID.
)A

A join between a fact table and numerous dimension tables is known as a star
query. A primary key to foreign key join is used to connect each dimension table to
the fact table; however, the dimension tables are not connected to one another. Star
queries are identified by the cost-based optimizer, which also produces effective
execution plans for them.
(c

Amity Directorate of Distance & Online Education


Data Warehousing and Mining 111

Notes

e
in
nl
O
ty
r si
ve
ni

We must keep in mind that the invoice line item is our fact grain. As a result, the
Fact Table won’t show us the agreement that was in effect on a particular day if the
products weren’t sold. To enable this functionality for Deals, we must add attributes.
Additionally, we can use a type 2 SCD technique because this will be a slowly changing
U

dimension (SCD) with shifting deals. We must add the Valid-From and Valid-To dates
as characteristics for deals. It will also be easier to keep track of the necessary deal
information for any given date if attributes are included for the product on which the deal
ity

is applicable and the source-destination city where the deal is applicable.

This situation emphasises how crucial it is to comprehend business requirements


and the four-step dimensional modelling methodology. This results in a model that not
only meets the needs of the business but also secures business support.
m

Summary
●● Dimensional modelling uses a cube operation to represent data, making OLAP
)A

data management more suitable for logical data representation. Ralph Kimball
created the fact and dimension tables that make up the notion of “dimensional
modelling.”
●● A dimensional model is a tool used in data warehouses to read, summarise, and
analyse numerical data such as values, balances, counts, weights, etc. Relational
(c

models, on the other hand, are designed for the insertion, updating, and deletion
of data in an online transaction system that is live.

Amity Directorate of Distance & Online Education


112 Data Warehousing and Mining

●● Both columns that hold numerical facts (commonly referred to as measurements)


Notes

e
and columns that are foreign keys to dimension tables are common in fact tables.
A fact table either includes information at the aggregate level or at the detail level.

in
●● A dimension is a structure that classifies data and is frequently made up of one
or more hierarchies. The dimensional value is better explained by dimensional
qualities.

nl
●● A dimension typically contains hierarchical data. The necessity for data to be
grouped and summarised into meaningful information drives the creation of
hierarchies.

O
●● A schema is a logical definition of a database that logically connects fact and
dimension tables. Star, Snowflake, and Fact Constellation schema are used to
maintain Data Warehouse.

ty
●● In data warehousing, a “dimension” is a group of references to data concerning a
quantifiable event. These incidents are known as facts and are kept in a fact table.
The entities that an organisation wants to keep data for are often the dimensions.

si
●● A bundle of related data pieces is referred to as a “fact table.” It consists of
dimensions and measure values. It follows that the specified dimension and
measure can be used to define a fact table.
●● r
On-line analytical processing is known as OLAP. Through quick, consistent,
ve
interactive access to a variety of views of data that have been transformed from
raw data to reflect the true dimensionality of the enterprise as understood by the
clients, analysts, managers, and executives can gain insight into information using
OLAP, a category of software technology.
ni

●● Characteristics of OLAP: a) Fast, b) Analysis, c) Shared, d) Multidimensional, e)


Information.
●● There are several data sources that this function must handle. For each data
U

source, you must use the right approach. Source data may come in a variety
of data types from various source machines. Relational database systems
may contain a portion of the original data. Some data might be stored in older
hierarchical and network data models.
ity

●● A data warehouse technique must include a cleansing stage because it is


designed to enhance data quality. Rectification and homogenization are the main
data cleansing functions present in ETL technologies.
m

●● As part of data transformation, you carry out a variety of distinct actions. You
must first purge the data you have taken from each source. Cleaning can involve
removing duplicates when bringing in the same data from different source
systems, or it can involve resolving conflicts between state codes and zip codes
)A

in the source data. It can also involve giving default values for missing data
components.
●● The function of data loading is composed of two different groups of jobs. The initial
loading of the data into the data warehouse storage is done when the design and
(c

construction of the data warehouse are finished and it goes online for the first time.
●● The process of writing data into the target database is known as the load. Make

Amity Directorate of Distance & Online Education


Data Warehousing and Mining 113

sure the load is carried out accurately and with the least amount of resources
Notes

e
feasible during the load step.
●● One of the simplest data warehouse schemas is the star schema. Because of the

in
way its points expand outward from a central point, it is known as a star.
●● The main distinction between the snowflake and star schemas is that while the
star schema always has de-normalized dimensions, the snowflake schema may

nl
also contain normalised dimensions. As a result, the snowflake schema is a
variation of the star schema that supports dimension table normalisation.
●● The architecture is the framework that binds every element of a data warehouse

O
together. Consider the architecture of a school as an illustration. The building’s
architecture encompasses more than just its aesthetic. It consists of numerous
classrooms, offices, a library, hallways, gymnasiums, doors, windows, a roof, and
numerous other similar structures.

ty
●● A means of outlining the total architecture of data exchange, processing, and
presentation that exists for end-client computing within the organisation is the
data warehouse architecture. Online transaction processing is built into production

si
applications including payroll, accounts payable, product purchasing, and
inventory control (OLTP). These programmes collect thorough information on daily
operations.
●●
r
Types of data warehouse architectures: a) Single-tier architecture, b) Two-tier
ve
architecture, c) Three-tier architecture.
●● A data warehouse is a single repository for data where records from various
sources are combined for use in online business analytics (OLAP). This suggests
that a data warehouse must satisfy the needs of every business stage across the
ni

board. As a result, data warehouse design is a very time-consuming, complex, and


thus prone to error process.
U

●● Two methods of data warehouse: a) Top-down design approach, b) Bottom-up


design approach.
●● The phrase “data mart” refers to a tiny, localised data warehouse created for a
single function and is used to describe a department-specific data warehouse. It
ity

is typically designed to meet the requirements of a department or a group of users


inside an organisation.
●● Data marts are a subset of data warehouses that fulfil the needs of a specific
department or business function, to put it simply.
m

●● Types of datamarts: a) Dependent data marts, b) Independent data marts.


●● A data warehouse consolidates both recent and old data from various operational
)A

systems (transactional databases), where the data is cleaned and reorganised to


assist data analysis. By developing a metadata system, we hope to construct a
design procedure for building a data warehouse.
●● Types of metadata: a) Operational metadata, b) Extraction and transformation
metadata, c) End-user metadata.
(c

●● Online Analytical Processing Server is what OLAP stands for. Using this
software technology, users can simultaneously study data from many database

Amity Directorate of Distance & Online Education


114 Data Warehousing and Mining

systems. The user can query on multi-dimensional data because it is built on a


Notes

e
multidimensional data model.
●● Types of OLAP servers: a) Relational OLAP, b) Multidimensional OLAP, c) Hybrid

in
OLAP.

Glossary

nl
●● DM: Dimensional Modeling.
●● UML: Unified Modeling Language.
●● BPMN: Business Process Modeling Notation.

O
●● OLAP: On-line analytical processing.
●● FASMI: Fast, Analysis, Shared, Multidimensional, Information.

ty
●● ROI: return on investment.
●● SQL: Sequential Query Language.
●● OLTP:Online Transaction Processing.

si
●● ETL: Extraction, Transformation, and Loading.
●● 3NF: Third Normal Form.
●● r
IDM: Independent data marts.
ve
●● ROLAP: RelationalOnline Transaction Processing.
●● MOLAP: MultidimensionalOnline Transaction Processing.
●● HOLAP: HybridOnline Transaction Processing.
ni

●● RDBMS: Relational Database Management System.

Check Your Understanding


U

1. A_ _ _ _is a tool used in data warehouses to read, summarise, and analyse numerical
data such as values, balances, counts, weights, etc.
a. Dimensional model
ity

b. Data structure
c. Data warehouse
d. Data mining
m

2. A_ _ _ _ is a logical definition of a database that logically connects fact and dimension


tables.
a. Table
)A

b. Schema
c. Query
d. None of the mentioned
(c

3. In data warehousing, a_ _ _ _is a group of references to data concerning a


quantifiable event. These incidents are known as facts and are kept in a fact table.

Amity Directorate of Distance & Online Education


Data Warehousing and Mining 115

a. Schema
Notes

e
b. Tables
c. Dimension

in
d. Normalisation
4. A bundle of related data pieces is referred to as a_ _ _ _ .

nl
a. Dimension
b. Schema

O
c. Data mining
d. Fact table
5. The term_ _ _ _ _refers to values that depend on dimensions.

ty
a. Measures
b. Fact table

si
c. Data warehouse
d. None of the mentioned
6. _ _ _. _is the term used to describe the process of taking data from source systems
and bringing it into the data warehouse. r
ve
a. OLAP
b. ETL
c. FASMI
ni

d. OLTP
7. In a star schema, normalising the dimension tables is the procedure known as_
_________ _ __.
U

a. Normalisation
b. Denormalisation
ity

c. Snowflaking
d. Clustering
8. The phrase_ _ _ _ refers to a tiny, localised data warehouse created for a single
function and is used to describe a department-specific data warehouse.
m

a. Meta data
b. Media data
)A

c. Natural language processing


d. Data marts
9. Initiating the request for a data mart, acquiring information about the needs, and
creating the data mart’s logical and physical design are all covered in_ _ _ _.
(c

a. Designing
b. Constructing

Amity Directorate of Distance & Online Education


116 Data Warehousing and Mining

c. Populating
Notes

e
d. Accessing
10. _ _ _ _ _ involves obtaining data from the source, cleaning it up, transforming it into

in
the appropriate format and level of detail, and transferring it into the data mart.
a. Designing

nl
b. Populating
c. Accessing
d. Constructing

O
11. _ _ _ _ is the step, in which the data is put to use through querying, analysis, report
creation, chart and graph creation, and publication.
a. Constructing

ty
b. Populating
c. Accessing

si
d. None of the mentioned
12. Data about data is termed as_ _ _ ?
a. Media format data r
ve
b. Media feature data
c. Media data
d. Meta data
ni

13. A group of software tools known as_ _ _ _ are used to analyse data and make
business choices
a. Online Analytical Processing
U

b. Online Transaction Processing


c. Business Intelligence
d. ETL
ity

14. What is the full form of ROLAP?


e. Relative Online Analytical Processing
a. Relational Online Analytical Processing
m

b. ResearchOnline Analytical Processing


c. Rational Online Analytical Processing
)A

15. Rotation is another name for the_ _ _ _ action.


a. Slice
b. Dice
c. Pivot
(c

d. Roll up

Amity Directorate of Distance & Online Education


Data Warehousing and Mining 117

16. Fact tables are which of the following?


Notes

e
a. Completely normalized
b. Partially normalized

in
c. Completely denormalized
d. Partially denormalized

nl
17. Data transformation includes which of the following?
a. A process to change data from a summary level to detailed level

O
b. A process to change data from a detailed level to summary level
c. Joining data from one source into various sources of data
d. Separating data from one source into various sources of data

ty
18. Reconciled data is which of the following?
a. Data stored in the various operational systems throughout the organization.

si
b. Current data intended to be the single source for all decision support systems.
c. Data stored in one operational system in the organization.
d. Data that has been selected and formatted for end-user support applications.
19. The load and index is which of the following?
r
ve
a. A process to reject the data in the data warehouse and to create the
necessary indexes
b. A process to upgrade the quality of data after it is moved into a data
ni

warehouse
c. A process to load the data in the data warehouse and to create the necessary
indexes
U

d. A process to upgrade the quality of data before it is moved into a data


warehouse
20. The extract process is which of the following?
ity

a. Capturing all of the data contained in various operational system


b. Capturing a subset of the data contained in various decision support system
c. Capturing all of the data contained in various decision support system
m

d. Capturing a subset of the data contained in various operational system

Exercise
)A

1. Define facts and dimensions, design fact tables and design dimension table.
2. What do you mean by data warehouse schemas?
3. Define OLAP.
4. What are the various features and benefits of OLAP?
(c

5. Define the term:

Amity Directorate of Distance & Online Education


118 Data Warehousing and Mining

a. Data extraction
Notes

e
b. Clean-up
c. Transformation

in
6. What do you mean by Star, snowflake and galaxy schemas for multidimensional
databases?

nl
7. Define architecture for a warehouse.
8. Define the various steps for construction of data warehouses, data marts and
metadata.

O
9. What do you mean by the OLAP server - ROLAP, MOLAP and HOLAP.

Learning Activities

ty
1. As a senior analyst on the project team of a publishing company exploring the
options for a data warehouse, make a case for OLAP. Describe the merits of OLAP
and how it will be essential in your environment.

si
Check Your Understanding - Answers
1 a 2 b
3 c r
4 d
ve
5 a 6 b
7 c 8 d
9 a 10 b
ni

11 c 12 d
13 a 14 b
15 c 16 a
U

17 b 18 b
19 c 20 d
ity

Further Readings and Bibliography:


1. Bill Williams, 2012, The Economics of Cloud Computing, Cisco Systems, Inc.
2. Borko Furht & Armando Escalante, 2010, Handbook of Cloud Computing,
m

Springer.
3. Kevin Jackson, 2012, OpenStack Cloud Computing Cookbook, Packt
Publishing.
)A

4. Rajkumar Buyya, James Broberg & Andrzej Goscinski, 2011, Cloud


Computing: Principles and Paradigms, John Wiley & Sons.
5. Sarna, David E.Y., 2010, Implementing and Developing Cloud Computing
Applications, Taylor & Francis.
(c

6. Williams, Mark I., 2010, A Quick Start Guide to Cloud Computing: Moving Your
Business into the Cloud, Kogan Page Ltd.

Amity Directorate of Distance & Online Education


Data Warehousing and Mining 119

Module - III: Data Mining


Notes

e
Learning Objectives:

in
At the end of this topic, you will be able to understand:

●● Understanding Concepts of Data Warehousing

nl
●● Advancements to Data Mining
●● Data Mining on Databases

O
●● Data Mining Functionalities
●● Objectives of Data Mining and the Business Context for Data Mining
●● Data Mining Process Improvement

ty
●● Data Mining in Marketing
●● Data Mining in CRM

si
●● Tools of Data Mining

Introduction
r
Although the phrase “we are living in the information age” is common, we are
ve
actually living in the data age. Every day, terabytes or petabytes of data from business,
society, science and engineering, medicine, and nearly every other element of daily life
flood our computer networks, the World Wide Web (WWW), and numerous data storage
devices. The rapid development of effective data gathering and storage methods, as
well as the computerization of our society, are to blame for this tremendous expansion
ni

in the volume of data that is currently available. Globally, businesses produce


enormous amounts of data, such as stock trading records, product descriptions, sales
promotions, corporate profiles, and consumer feedback. For instance, huge retailers
U

with thousands of locations around the world, like Wal-Mart, conduct hundreds of
millions of transactions per week. High orders of petabytes of data are continuously
produced by scientific and engineering procedures, including remote sensing, process
measurement, scientific experiments, system performance, engineering observations,
ity

and environment surveillance.

Data mining converts a sizable data collection into knowledge. Every day,
hundreds of millions of searches are made using search engines like Google. Every
query can be seen as a transaction in which the user expresses a demand for
m

information. What innovative and helpful information can a search engine get from
such a vast database of user queries gathered over time? It’s interesting how some
user search query patterns can reveal priceless information that cannot be learned from
)A

examining individual data items alone.

Google’s Flu Trends, for instance, uses particular search phrases as a gauge of flu
activity. It discovered a strong correlation between the number of persons who actually
have flu symptoms and those who look for information relevant to the illness. When
(c

all of the flu-related search searches are combined, a pattern becomes apparent. Flu
Trends can predict flu activity up to two weeks sooner than conventional methods by
using aggregated Google search data. This illustration demonstrates how data mining

Amity Directorate of Distance & Online Education


120 Data Warehousing and Mining

may transform a sizable data set into knowledge that can assist in resolving a current
Notes

e
global concern.

The market has been overrun with different products by hundreds of suppliers

in
during the last five years. Data modelling, data gathering, data quality, data analysis,
metadata, and other aspects of data warehousing are all covered by vendor solutions
and products. No fewer than 105 top items are highlighted in the Data Warehousing

nl
Institute’s buyer’s guide. The market is already enormous and is still expanding.

Almost sure, you’ve heard of data mining. Most of you are aware that technology
plays a role in knowledge discovery. It’s possible that some of you are aware of the use

O
of data mining in fields like marketing, sales, credit analysis, and fraud detection. You’re
all aware, at least in part, that data mining and data warehousing are related. Almost all
corporate sectors, including sales and marketing, new product development, inventory
management, and human resources, use data mining.

ty
The definition of data mining has possibly as many iterations as there has
supporters and suppliers. Some experts include a wide variety of tools and procedures
in the definition, ranging from straightforward query protocols to statistical analysis.

si
Others limit the definition to methods of knowledge discovery. Although not necessary, a
functional data warehouse will offer the data mining process a useful boost.

r
3.1 Understanding Data Mining
ve
Let’s try to comprehend the technology in a business context before offering
some formal definitions of data mining. Data mining delivers information, much like all
other decision support systems. Please refer to the figure below, which illustrates how
decision support has evolved. Take note of the first strategy, when simple decision
ni

assistance technologies were available. Database systems followed, offering more


insightful data for decision support. Data warehouses with query and report tools that
help users find the specific forms of decision support information they require started
U

to take over as the main and most beneficial source of decision support information in
the 1990s.

OLAP tools became accessible for more complex analysis. The strategy for
ity

gathering information up until this point had been driven by the users. But no one can
utilise analysis and query tools to find valuable trends because of the sheer amount
of data.

For instance, it is very impossible to think through all the potential linkages in
m

marketing analysis and to discover insights by querying and digging further into the
data warehouse. You require a solution that can forecast client behaviour by learning
from prior associations and outcomes. You need a device that can conduct knowledge
)A

discovery on its own. Instead of a user-driven strategy, you want a data-driven strategy.
At this point, data mining intervenes and replaces the users.
(c

Amity Directorate of Distance & Online Education


Data Warehousing and Mining 121

Notes

e
in
nl
O
ty
si
Figure: Decision support progresses to data mining.

Enterprise data is gathered by forward-thinking businesses from their operational


r
systems of origin, processed through cleansing and transformation steps, and then
ve
stored in data warehouses in a manner suited for multidimensional analysis. Data
mining advances the procedure significantly.

3.1.1 Understanding Concepts of Data Warehousing


ni

Building and using a data warehouse is known as data warehousing. Data from
several heterogeneous sources is combined to create a data warehouse, which
facilitates analytical reporting, organised and/or ad hoc searches, and decision-making.
U

Data consolidation, data integration, and data cleaning are all components of data
warehousing.

Data warehouses (DWHs) are repositories where organisations store data


ity

electronically by separating it from operational systems and making it accessible for ad-
hoc searches and scheduled reporting. In contrast, creating a data warehouse requires
creating a data model that can produce insights quickly.

When compared to data found in the operational environment, DWH data is


m

different. To make daily operations, analysis, and reporting easier, it is set up such that
pertinent data is grouped together. This aids in identifying long-term trends and enables
users to make plans based on that knowledge. Thus, it is important for firms to use data
)A

warehouses.
(c

Amity Directorate of Distance & Online Education


122 Data Warehousing and Mining

Notes

e
in
nl
O
ty
Figure: Data Warehouse Architecture
https://www.astera.com/type/blog/what-is-data-warehousing/

si
In the early stages, four significant factors drove many companies to move into
data warehousing:

◌◌
r
Fierce competition
ve
◌◌ Government deregulation
◌◌ Need to revamp internal processes
◌◌ Imperative for customized marketing
The first industries to use data warehousing were telecommunications, finance,
ni

and retail. Government deregulation in banking and telecoms was primarily to blame
for that. Because of the increased competitiveness, retail businesses have shifted to
data warehousing. As that industry was deregulated, utility firms joined the organisation.
U

Enterprises in the financial services, healthcare, insurance, manufacturing,


pharmaceuticals, transportation, and distribution sectors made up the second wave of
companies to enter the data warehousing market.
ity

Today, investment on data warehouses is still dominated by the banking and


telecoms sectors. Data warehousing accounts for up to 15% of these sectors’
technological budgets. Businesses in these sectors gather a lot of transactional data.
Such massive amounts of data can be converted through data warehousing into
strategic information valuable for decision making.
m

(Source: https://anuradhasrinivas.files.wordpress.com/2013/03/data-warehousing-
fundamentals-by-paulraj-ponniah.pdf )
)A

The majority of global firms were the only ones who used data warehousing in its
early phases. Building a data warehouse was expensive, and the available tools were
not quite sufficient. Only big businesses have the money to invest in the new paradigm.
Now that smaller and medium-sized businesses can afford the cost of constructing
data warehouses or purchasing turnkey data marts, we are starting to notice a
(c

strong presence of data warehousing in these businesses. Examine the database

Amity Directorate of Distance & Online Education


Data Warehousing and Mining 123

management systems (DBMSs) you have previously employed. The database vendors
Notes

e
have now included tools to help you create data warehouses using these DBMSs,
as you will discover. The cost of packaged solutions has decreased, and operating
systems are now reliable enough to support data warehousing operations.

in
Although earlier data warehouses focused on preserving summary data for high-
level analysis, different businesses are increasingly building larger and larger data

nl
warehouses. Companies may now collect, purify, maintain, and exploit the enormous
amounts of data produced by their economic activities. The amount of data stored in
data warehouses is now in the terabyte scale. In the telecommunications and retail
industries, data warehouses housing many terabytes of data are not unusual.

O
Take the telecommunications sector, for instance. In a single year, a telecoms
operator produces hundreds of millions of call-detail transactions. The corporation must
examine these specific transactions in order to market the right goods and services.

ty
The company’s data warehouse must keep data at the most basic degree of detail.

Consider a chain of stores with hundreds of locations in a similar way. Each store
produces thousands of point-of-sale transactions each day. Another example is a

si
business in the pharmaceutical sector that manages thousands of tests and measures
to obtain government product approvals. In these sectors, data warehouses are
frequently very vast.
r
ve
Multiple Data Types
If you’re just starting to develop your data warehouse, you might only include
numeric data. You will quickly understand that that simply including structured numerical
data is insufficient. Be ready to take into account more data kinds.
ni

Companies have historically included structured data, primarily quantitative data,


into their data warehouses. From this vantage point, decision support systems might be
separated into two groups: knowledge management dealt with unstructured data, while
U

data warehousing dealt with organised data. This line between them is getting fuzzier.
For instance, the majority of marketing data is structured data presented as numerical
numbers.
ity

Unstructured data in the form of pictures is also present in marketing data. Let’s
imagine that a decision-maker is conducting research to determine the most popular
product categories. During the course of the analysis, the decision maker settles on
a particular product type. In order to make additional judgments, he or she would
now like to see pictures of the products in that type. How is it possible to do this?
m

Companies are learning that their data warehouses need to integrate both structured
and unstructured data.
)A

What kinds of data fall under the category of unstructured data? The data
warehouse must incorporate a variety of data kinds to better support decision-making,
as seen in the figure below.
(c

Amity Directorate of Distance & Online Education


124 Data Warehousing and Mining

Notes

e
in
nl
O
ty
r si
Figure: Data warehouse: multiple data types.
ve
Adding Unstructured Data: The incorporation of unstructured data, particularly text
and images, is being addressed by some suppliers by treating such multimedia data as
merely another data type. These are classified as relational data and are kept as binary
ni

large objects (BLOBs) with a maximum size of 2 GB. These are defined as user-defined
types using user-defined functions (UDFs) (UDTs).

It is not always possible to store BLOBs as just another relational data type. A
U

server that can send several streams of video at a set rate and synchronise them with
the audio component, for instance, is needed for a video clip. Specialized servers are
being made available for this purpose.
ity

Searching Unstructured Data: Your data warehouse has been improved by the
addition of unstructured data. Do you have any other tasks to complete? Of course,
integrating such data is largely useless without the capacity to search it. In order to help
users find the information they need from unstructured data, vendors are increasingly
offering new search engines. An illustration of an image search technique is querying by
m

image content.

Preindexing photographs based on forms, colours, and textures is possible with


this product. The chosen images are shown one after the other when more than one
)A

image matches the search criteria.

Retrieval engines preindex the textual documents for free-form text data in order
to support word, character, phrase, wild card, proximity, and Boolean searches. Some
search engines are capable of searching and replacing words with their equivalents.
(c

The word mice will also turn up in documents when you search for it.

Amity Directorate of Distance & Online Education


Data Warehousing and Mining 125

Directly searching audio and video data is still a research topic. These are typically
Notes

e
defined using free-form language, which is then searched using the existing textual
search techniques.

in
Searching audio and video data directly is still in the research stage. Usually, these
are described with free-form text, and then searched using textual search methods that
are currently available.

nl
Spatial Data: Imagine one of your key users—perhaps the Marketing Director—
is online and accessing your data warehouse to conduct an analysis. Show me the
sales for the first two quarters for all products compared to last year in shop XYZ,

O
requests the Marketing Director. He or she considers two additional questions after
going over the findings. What is the typical income of those who reside in the store’s
neighbourhood? How far did those people drive to get to the store on average? These
queries can only be addressed if spatial data is included in your data warehouse.

ty
Your data warehouse’s value will increase significantly with the addition of spatial
data. Examples of spatial data include an address, a street block, a city quarter, a
county, a state, and a zone. Vendors have started addressing the requirement for

si
spatial data. In order to combine spatial and business data, some database providers
offer spatial extenders to their products via SQL extensions.

r
Data mining is a technique used with the Internet of Things (IoT) and artificial
intelligence (AI) to locate relevant, possibly useful, and intelligible information, identify
ve
patterns, create knowledge graphs, find anomalies, and establish links in massive data.
This procedure is crucial for expanding our understanding of a variety of topics that deal
with unprocessed data from the web, text, numbers, media, or financial transactions.
ni

By fusing several data mining techniques for usage in financial technology and
cryptocurrencies, the blockchain, data sciences, sentiment analysis, and recommender
systems, its application domain has grown. Additionally, data mining benefits a variety
of real-world industries, including biology, data security, smart grids, smart cities,
U

and maintaining the privacy of health data analysis and mining. Investigating new
developments in data mining that incorporate machine learning methods and artificial
neural networks is also required.
ity

Machine learning and deep learning are undoubtedly two of the areas of
artificial intelligence that have received the most research in recent years. Due to the
development of deep learning, which has provided data mining with previously unheard-
of theoretical and application-based capabilities, there has been a significant change
m

over the past few decades. The articles in this Topic will examine both theoretical
and practical applications of knowledge discovery and extraction, image analysis,
classification, and clustering, as well as FinTech and cryptocurrencies, the blockchain
and data security, privacy-preserving data mining, and many other topics. Both
)A

theoretical and practical model-focused contributions are sought. According to their


formal and technical soundness, experimental support, and relevance, papers will be
chosen for inclusion.

According to Fayadd, Piatesky-Shapiro, and Smyth (1996), data mining (DM)


(c

refers to a collection of particular techniques and algorithms created specifically for


the purpose of identifying patterns in unprocessed data. The massive amount of data

Amity Directorate of Distance & Online Education


126 Data Warehousing and Mining

that must be managed more easily in sectors like commerce, the medical industry,
Notes

e
astronomy, genetics, or banking has given rise to the DM process. Additionally, the
exceptional success of hardware technologies resulted in a large amount of storage
capacity on hard drives, which posed a challenge to the emergence of numerous issues

in
with handling enormous amounts of data. Of course, the Internet’s rapid expansion is
the most significant factor in this situation.

nl
The core of the DM process is the application of methods and algorithms to find
and extract patterns from stored data, although data must first be pre-processed before
this stage. It is well known that DM algorithms alone do not yield satisfactory results.
Finding useful knowledge from raw data thus requires the sequential application of

O
the following steps: developing an understanding of the application domain, creating a
target data set based on an intelligent way of selecting data by focusing on a subset
of variables or data samples, data cleaning and pre-processing, data reduction and

ty
projection, choosing the data mining task, choosing the data mining algorithm, the data
mining step, and interpreting the results.

Regression is learning a function that maps a data item to a real-valued prediction

si
variable. Clustering is the division of a data set into subsets (clusters). Association rules
determine implication rules for a subset of record attributes. Summarization involves
methods for finding a compact description for a subset. Classification is learning a

r
function that maps a data item into one of several predefined classes .
ve
The DM covers a wide range of academic disciplines, including artificial
intelligence, machine learning, databases, statistics, pattern identification in data, and
data visualisation. Finding trends in the data and presenting crucial information in a way
that the average person can understand it are the main objectives here. For ease of
ni

usage, it is advised that the information acquired be simple to understand. Getting high-
level data from low-level data is the overall goal of the process.

The DM method can be applied in a wide range of scientific fields, including


U

biology, medicine, genetics, astronomy, high-energy physics, banking, and business,


among many others. DM algorithms and approaches can be used on a variety of
information, including plain text and multimedia formats.
ity

3.1.2 Advancements to Data Mining

History of Data Mining


Gregory Piatesky-Shapiro, a researcher, first coined the phrase “data mining”
m

(DM) in 1989. At the time, there weren’t many data mining tools available for
completing a single problem. The C4.5 decision tree technique, the SNNS neural
network, and parallel coordinate visualisation are among examples (Quinlan, 1986).
)A

(Inselberg, 1985). These techniques required significant data preparation and were
challenging to use.

Suites, or second-generation data mining systems, were created by manufacturers


beginning in 1995. These tools considered the fact that the DM process necessitates
several kinds of data analysis, with the bulk of the work occurring during the data
(c

cleaning and preprocessing stages. Users may carry out a number of discovery
activities (often classification, clustering, and visualisation) using programmes like

Amity Directorate of Distance & Online Education


Data Warehousing and Mining 127

SPSS Clementine, SGI Mineset, IBM Intelligent Miner, or SAS Enterprise Miner. These
Notes

e
programmes also offered data processing and visualisation. A GUI (Graphical User
Interface), invented by Clementine, was one of the most significant developments since
it allowed users to construct their information discovery process visually.

in
In 1999, more than 200 tools were available for handling various tasks, but even
the best of them only handled a portion of the whole DM architecture. Preprocessing

nl
and data cleaning were still required. Data-mining-based “vertical solutions” have
emerged as a result of the growth of these types of applications in industries like direct
marketing, telecommunications, and fraud detection. The systems HNC Falcon for
credit card fraud detection, IBM Advanced Scout for sports analysis, and NASD DM

O
Detection system are the best examples of such applications.

In recent years, parallel and distributed computers were used as approaches


to the DM process. These instructions caused Parallel DM and Distributed DM to

ty
appear. Data sets are distributed to high performance multi-computer machines
for analysis in Parallel DM. All algorithms that were employed on single-processor
units must be scaled in order to run on parallel computers, despite the fact that these

si
types of machines are becoming more widely available. The Parallel DM technique is
appropriate for transaction data, telecom data, and scientific simulation. Distributed DM
must offer local data analysis solutions as well as global methods for recombining local

r
results from each computing unit without requiring a significant amount of data to be
transferred to a central server. Grid technologies combine distributed parallel computing
ve
and DM.

Data Mining and Neural Networks


The core of the DM process is the data mining step. Neural networks are utilised to
ni

complete several DM jobs. Since the human brain serves as the computing model for
these networks, neural networks should be able to learn from experience and change
in reaction to new information. Neural networks can find previously unidentified links
U

and learn complicated non-linear patterns in the data when subjected to a collection of
training or test data.

Our brain’s capacity to differentiate between two items is one of its most crucial
ity

abilities. This capability of differentiating between what is friendly and what is dangerous
for particular species has allowed numerous species to evolve successfully.

Regression is a learning process that converts a specific piece of data into a


genuine variable that can be predicted. This is comparable to what individuals actually
m

do: after seeing a few examples, they can learn to extrapolate from them in order to
apply the information to unrelated issues or situations. One of the benefits of adopting
neural network technology is its capacity for generalisation.
)A

The Hebbian learning theory, which holds that information is stored by associations
with relevant memories, was validated by neurophysiologists. A intricate network
of concepts with semantically similar meanings exists in human brain. According to
the hypothesis, when two neurons in the brain are engaged simultaneously, their
connection develops stronger and the physical properties of their synapse are altered.
(c

The neural network field’s associative memory branch focuses on developing models
that capture associative activity.

Amity Directorate of Distance & Online Education


128 Data Warehousing and Mining

Associative memories with a restricted capacity have been demonstrated in


Notes

e
neural networks like Binary Adaptive Memories and Hopfield networks. Different neural
models, including back propagation networks, recurrent back propagation networks,
self-organizing maps, radial basis function networks, adaptive resonance theory

in
networks (ART), probabilistic neural networks, and fuzzy perceptions, are used to
resolve these DM procedures. Certain DM procedures are available to any structure.

nl
Businesses who were late to implement data mining are rapidly catching up to
the others. Making crucial business decisions frequently involves using data mining
to extract key information. Data mining is predicted to become as commonplace as
some of the current technologies. Among the major trends anticipated to influence the

O
direction of data mining are:

Multimedia Data Mining

ty
This is one of the most recent techniques that is gaining acceptance due to its
increasing capacity to accurately capture important data. Data extraction from several
types of multimedia sources, including audio, text, hypertext, video, pictures, and more,

si
is a part of this process. Afterwards, the extracted data is transformed into a numerical
representation in a variety of formats. Using this technique, you may create categories
and clusters, run similarity tests, and find relationships.

Ubiquitous Data Mining r


ve
With this technique, data from mobile devices is mined to obtain personal
information about people. Despite having a number of drawbacks of this kind, including
complexity, privacy, expense, and more, this approach is poised to develop and find use
in a variety of industries, particularly when researching human-computer interactions.
ni

Distributed Data Mining


Since it includes the mining of a sizable amount of data held across numerous
U

corporate sites or within multiple companies, this type of data mining is becoming more
and more popular. To collect data from various sources and produce improved insights
and reports based on it, highly complex algorithms are used.
ity

Spatial and Geographic Data Mining


This is a newly popular sort of data mining that involves information extraction from
environmental, astronomical, and geographic data, including photographs captured
from space. This kind of data mining shows a variety of factors, mostly employed in
m

geographic information systems and other navigation applications, such as distance


and topology.
)A

Time Series and Sequence Data Mining


The analysis of cyclical and seasonal trends is the main application of this kind of
data mining. Even random incidents that happen outside of the usual course of events
can be studied with the use of this approach. Retail businesses mostly employ this
strategy to analyse consumer behaviour and purchase habits.
(c

Amity Directorate of Distance & Online Education


Data Warehousing and Mining 129

3.2 Motivation and Knowledge Discovery Process


Notes

e
Knowledge Discovery in Databases is what KDD stands for. Another word for data
mining is this.

in
It is essentially a group of procedures that aid in the gathering of intelligence.
Simply expressed, it makes it easy to make rational decisions that will help you reach
your goals.

nl
When you make decisions using data science, this method is really helpful. Simply
filter through a big volume of datasets. Once finished, select a few relational, business-
related patterns to filter. Finding such informative patterns through analysis is possible.

O
They serve as a foundation for analysis and better decision-making. What trend is most
likely to catch on is simple to forecast.

Data is continuously accumulating thanks to sensors, pictures, social media,

ty
movies, etc. We must carefully comprehend what it says. The knowledge discovery
technique works best for this objective. In fact, it facilitates decision-making.

This procedure aids in the efficient analysis of the data gathered. This analysis’s

si
output is referred to as business intelligence. Undoubtedly, real-time analytics apps and
in-depth examination of previous data are necessary to arrive at workable answers. You
can simply hear the voice of the data with their assistance.
r
In a word, it aids in corporate strategy development and operational management.
ve
Marketing, advertising, sales, and customer service are all involved in these two
objectives. It can also be used to modify the functions of manufacturing, supply chain
management, finance, and human resources. In other words, you can use it to refine
various industries and their processes.
ni

In fact, it aids in identifying fraud trends, dangers, cybersecurity holes, and


consumer behaviour. Once known, making important judgments appears to be simple.
U

A significant amount of data is needed for the process of knowledge discovery, and
that data must first be in a reliable state in order to be used in data mining. The ideal
source of data for knowledge discovery is the aggregation of enterprise data in a data
warehouse that has been adequately vetted, cleaned, and integrated. The warehouse
ity

is likely to include not only the depth of data required for this step in the BI process, but
it also likely has the necessary historical data. Having the previous data available for
testing and evaluating hypothese makes the warehouse more more beneficial because
a lot of data mining relies on using one set of data for training a process that can then
m

be evaluated on another set of data.

The availability of high-performance analytical platforms, the capability to stream


many data sources into an analytical platform, and the democratisation of data mining
)A

and analytics algorithms through direct integration with existing platforms may have
produced the “perfect storm” for the development of large-scale “big data” analytics
applications.

In essence, systems like Hadoop’s accessibility, usability, and scalability have


(c

fostered the development of sophisticated data mining and analytics algorithms.


Programming environments like the one offered by Hadoop and MapReduce offer a
simple framework for the creation of parallel and distributed applications. When these

Amity Directorate of Distance & Online Education


130 Data Warehousing and Mining

applications reach a certain level of maturity, they will be able to manage large-scale
Notes

e
applications that access both structured and unstructured data from several sources
while streaming it at varying rates.

in
3.2.1 Data Mining on Databases

Steps of Knowledge Discovery Database Process

nl
Step 1. Data Collection
You require some data for any research. The research team locates pertinent

O
tools to extract, convert, and process data from various websites or big data settings
like data lakes, servers, or warehouses. Both structured and unstructured files could be
present. Data migration issues and the database’s inability to demonstrate file format

ty
compliance might occasionally create errors.

To avoid incorrect or erroneous data entry in this case, the errors should be
eliminated at the point of entry.

si
Step 2: Preparing Datasets
Investigate datasets after collection. As part of pre-processing, create their profiles.
r
It is the transformation step, in fact. All records are cleaned up, kept consistent, and
abnormalities are eliminated here. Resources are provided in this manner for efficient
ve
data analysis. The final phase would then be the error-fixing, which is done in the next
step. The method is used by Eminenture’s data mining specialists.

Step 3: Cleansing Data


ni

Here, cleaning refers to eliminating noise and duplication from data collecting.
Noises in the database relate to duplicates, corrupt, incomplete, and other anomalies.
U

Following are the subsets for tidy and clean databases:

◌◌ De-duplication
◌◌ Data appending
ity

◌◌ Normalization
◌◌ Typos removal
◌◌ Standardization

Step 4: Data Integration


m

This procedure necessitates synthesising information from several sources.


A variety of apps and technologies for migration and synchronisation are used in the
)A

knowledge discovery (KDD) process.

This procedure mainly entails extracting, loading, and changing datasets.


Additionally known as the ETL procedure. Information needed for study must be:

◌◌ Extracted
(c

◌◌ Transformed
◌◌ Loaded for deep analysis

Amity Directorate of Distance & Online Education


Data Warehousing and Mining 131

Step 5: Data Analysis


Notes

e
Analysis is the process of delving deeply into insights. Datasets are processed
here by filtering. They are carefully scrutinised to see whether they can be helpful and

in
the ideal fit for these processes:

◌◌ Neural network
◌◌ Decision trees

nl
◌◌ Naïve Byaes
◌◌ Clustering

O
◌◌ Association
◌◌ Regression, etc.

Step 6: Data Transformation

ty
◌◌ Using data mapping, elements are assigned from sources to destinations.
◌◌ OCR conversion for scanning, recognition, and translation of data

si
◌◌ Coding to convert all of the data to the same format.
In this process, data is transformed into a format and structure that will be exactly
the same for further data mining. Sometimes, PDFs or data in many formats are

r
available. It is essential to digitise them. As a result, the following actions are taken:
ve
Step 7: Modeling or Mining
This is the most important step, which uses scripts and apps to extract pertinent
patterns to support goals.
ni

◌◌ Data transformation into models or patterns is covered.


◌◌ Verification is used to check the veracity of facts.
U

Step 8: Validating Models


This KDD process stage, also known as evaluation, includes validating found
patterns. The data scientist creates a few predictive models (or projections) to evaluate
whether they can deliver the desired outcomes. By employing classification and
ity

characterisation techniques or the best-fit method, validation demonstrates that the


models accurately fit the predicted models.

◌◌ Finding each model’s interestingness score or rating is a step in this approach.


m

◌◌ An overview of the Pan database.


◌◌ visualisation for simple comprehension and analysis.

Step 9: Knowledge Presentation


)A

As the name implies, you must show your discoveries in extensive datasets
throughout this stage of the data mining process. It can be made simpler by using
visualisation tools like Data Studio.

◌◌ Making reports
(c

◌◌ Establish classification, characterisation, and other types of discriminatory


norms.

Amity Directorate of Distance & Online Education


132 Data Warehousing and Mining

Step 10: Execution


Notes

e
Finally, numerous applications or machine learning are used with the driven
decisions or findings. If done correctly, it can be used to automate operations using

in
artificial intelligence (AI).

3.2.2 Data Mining Functionalities

nl
Six Basic Data Mining Activities
It’s critical to distinguish between the activities involved in data mining, the

O
processes by which they are carried out, and the methodologies that make use of these
activities to find business prospects. The most important techniques can essentially be
reduced to six tasks:

ty
Clustering and Segmentation
The process of clustering entails breaking up a huge collection of items into
smaller groups that share specific characteristics (Figure below). The distinction

si
between classification and clustering is that, in the clustering task, the classes are not
predefined. Instead, the decision or definition of that class is based on the evaluation of
the classes that occurs once clustering is complete.
r
ve
ni
U
ity
m
)A

Figure: Example of Clustering

When you need to separate data but are unsure exactly what you are looking
for, clustering can be helpful. To check if any relationships can be deduced through
the clustering process, you might, for instance, want to assess health data based on
(c

particular diseases together with other variables. Clustering can be used to identify a
business problem area that needs more investigation in conjunction with other data

Amity Directorate of Distance & Online Education


Data Warehousing and Mining 133

mining operations. As an illustration, segmenting the market based on product sales


Notes

e
before examining why sales are low within a certain market segment.

The clustering method arranges data instances so that each group stands out from

in
the others and that its members are easily distinguishable from one another. Since
there are no established criteria for classification, the algorithms essentially “choose”
the features (or “variables”) that are used to gauge similarity. The records in the

nl
collection are arranged according to how similar they are. The results may need to be
interpreted by someone with knowledge of the business context to ascertain whether
the clustering has any particular meaning. In some cases, this may lead to the culling
out of variables that do not carry meaning or may not be relevant, in which case the

O
clustering can be repeated in the absence of the culled variables.

Classification

ty
People who categorise the world into two groups and those who do not are divided
into two groups. But truly, it’s human nature to classify objects into groups based on a
shared set of traits. For instance, we classify products into product classes, or divide

si
consumer groups by demographic and/or psychographic profiles (e.g., marketing to the
profitable 18 to 34-year-olds).

The results of identified dependent variables can be used for classification when
r
segmentation is complete. The process of classifying data into preset groups is called
ve
classification (Figure below). These classifications might either be based on the output
of a clustering model or could be described using the analyst’s chosen qualities. The
programme is given the class definitions and a training set of previously classified
objects during a classification process, and it then makes an effort to create a model
that can be used to correctly categorise new records. For instance, a classification
ni

model can be used to categorise customers into specific market sectors, assign meta-
tags to news articles depending on their content, and classify public firms into good,
medium, and poor investments.
U
ity
m
)A

Figure: Classification of records based on defined characteristics.

Estimation
(c

Estimation is the process of giving an object a continuously varying numerical


value. For instance, determining a person’s credit risk is not always a yes-or-no
decision; it is more likely to involve some form of scoring that determines a person’s

Amity Directorate of Distance & Online Education


134 Data Warehousing and Mining

propensity to default on a loan. In the categorization process, estimation can be


Notes

e
employed (for example, in a market segmentation process, to estimate a person’s
annual pay).

in
Because a value is being given to a continuous variable, one benefit of estimate
is that the resulting assignments can be sorted according to score. Consequently, a
ranking method may, for instance, assign a value to the variable “probability of buying a

nl
time-share vacation package” and then order the applicants according to that projected
score, making those individuals the most likely. Estimation is usually used to establish
a fair guess at an unknowable value or to infer some likelihood to execute an action.
Customer lifetime value is one instance where we tried to create a model that reflected

O
the future worth of the relationship with a customer.

Prediction

ty
Prediction is an attempt to categorise items based on some anticipated future
behaviour, which is a slight distinction from the preceding two jobs. Using historical data
where the classification is already known, classification and estimation can be utilised

si
to make predictions by creating a model (this is called training). Then, using fresh data,
the model can be used to forecast future behaviour.

When utilising training sets for prediction, you must be cautious. The data may
r
have an inherent bias, which could cause you to make inferences or conclusions that
ve
are pertinent to the bias. Utilize various data sets for testing, testing, and more testing!

Affinity Grouping
The technique of analysing associations or correlations between data items that
ni

show some sort of affinity between objects is known as affinity grouping. For instance,
affinity grouping could be used to assess whether customers of one product are likely
to be open to trying another. When doing marketing campaigns to cross-sell or up-sell
a consumer on more or better products, this type of analysis is helpful. The creation
U

of product packages that appeal to broad market segments can also be done in this
way. For instance, fast-food chains may choose particular product ingredients to go into
meals packed for a specific demographic (such as the “kid’s meal”) and aimed at the
ity

people who are most likely to buy those packages (e.g., children between the ages of 9
and 14).

Description
The final duty is description, which is attempting to characterise what has been
m

found or to provide an explanation for the outcomes of the data mining process. Another
step toward a successful intelligence programme that can locate knowledge, express
it, and then assess possible courses of action is the ability to characterise a behaviour
)A

or a business rule. In fact, we may claim that the metadata linked to that data collection
can include the description of newly acquired knowledge.

3.3 Data Mining Basics


(c

In general, “mining” refers to the process of extracting a valuable resource from the
earth, such as coal or diamonds. Knowledge mining from data, knowledge extraction,

Amity Directorate of Distance & Online Education


Data Warehousing and Mining 135

data/pattern analysis, data archaeology, and data dredging are all terms used in
Notes

e
computer science to describe data mining. It is essentially the technique of extracting
valuable information from a large amount of data or data warehouses. It’s clear that the
term itself is a little perplexing. The product of the extraction process in coal or diamond

in
mining is coal or diamond. The result of the extraction process in Data Mining, however,
is not data!! Instead, the patterns and insights we get at the end of the extraction
process are what we call data mining outcomes. In that respect, Data Mining can be

nl
considered a step in the Knowledge Discovery or Knowledge Extraction process.

The phrase “Knowledge Discovery in Databases” was coined by Gregory


Piatetsky-Shapiro in 1989. The term ‘data mining,’ on the other hand, grew increasingly

O
prevalent in the business and media circles. Data mining and knowledge discovery are
terms that are being used interchangeably.

ty
3.3.1 Objectives of Data Mining and the Business Context for Data
Mining
Data mining is now employed practically everywhere where enormous amounts of

si
data are stored and processed. Banks, for example, frequently employ ‘data mining’ to
identify potential consumers who could be interested in credit cards, personal loans,
or insurance. Because banks have transaction details and complete profiles on their
r
clients, they examine this data and look for patterns that can help them forecast which
customers might be interested in personal loans or other financial products.
ve
Companies employ data mining as a method to transform unstructured data into
information that is useful. Businesses can learn more about their customers to create
more successful marketing campaigns, boost sales, and cut expenses by employing
ni

software to seek for patterns in massive volumes of data. Effective data collection,
warehousing, and computer processing are prerequisites for data mining.

How Data Mining Works


U

Data mining is the process of examining and analysing huge chunks of data to
discover significant patterns and trends. Numerous applications exist for it, including
database marketing, credit risk management, fraud detection, spam email screening,
ity

and even user sentiment analysis.

There are five steps in the data mining process. Data is first gathered by
organisations and loaded into data warehouses. The data is then kept and managed,
either on internal servers or on the cloud. The data is accessed by business analysts,
m

management groups, and information technology specialists, who then decide how
to organise it. The data is next sorted by application software according to the user’s
findings, and ultimately the end-user presents the data in a manner that is simple to
)A

communicate, such a graph or table.

Data Warehousing and Mining Software


Based on user requests, data mining tools examine relationships and patterns
in data. A business might employ data mining software to produce information
(c

classifications, for instance. As an example, consider a restaurant that wishes to use


data mining to figure out when to run specific specials. It examines the data it has

Amity Directorate of Distance & Online Education


136 Data Warehousing and Mining

gathered and establishes classes according to the frequency of client visits and the
Notes

e
items they purchase.

Other times, data miners hunt for information clusters based on logical

in
connections, or they analyse associations and sequential patterns to infer trends in
customer behaviour.

Data mining includes warehousing as a crucial component. Companies who

nl
warehouse their data into a single database or application. An organisation can
isolate specific data segments for analysis and use by particular users using a data
warehouse. In other instances, analysts could start with the data they need and build a

O
data warehouse from scratch using those specifications.

Main Purpose of Data Mining

ty
r si
ve
ni
U

In general, data mining has been combined with a variety of other techniques from
other domains, such as statistics, machine learning, pattern recognition, database
ity

and data warehouse systems, information retrieval, visualisation, and so on, to gather
more information about the data and to help predict hidden patterns, future trends, and
behaviours, allowing businesses to make decisions.

Data mining, in technical terms, is the computer process of examining data from
m

many viewpoints, dimensions, and angles, and categorizing/summarizing it into useful


information.

Data Mining can be used on any sort of data, including data from Data
)A

Warehouses, Transactional Databases, Relational Databases, Multimedia Databases,


Spatial Databases, Time-series Databases, and the World Wide Web.

Data Mining as a Whole Process


(c

The data mining process can be broken down into these four primary stages:

Data gathering: Data that is pertinent to an analytics application is gathered and

Amity Directorate of Distance & Online Education


Data Warehousing and Mining 137

identified. A data lake or data warehouse, which are becoming more and more popular
Notes

e
repositories in big data contexts and include a mixture of structured and unstructured
data, may be where the data is located. Another option is to use other data sources.
Wherever the data originates, a data scientist frequently transfers it to a data lake for

in
the process’ remaining steps.

Data preparation: A series of procedures are included in this stage to prepare the

nl
data for mining. Data exploration, profiling, and pre-processing come first, then work
to clean up mistakes and other problems with data quality. Unless a data scientist is
attempting to evaluate unfiltered raw data for a specific application, data transformation
is also done to make data sets consistent.

O
Mining the data: A data scientist selects the best data mining technique after the
data is ready, and then uses one or more algorithms to perform the mining. Before
being applied to the entire set of data in machine learning applications, the algorithms

ty
are often trained on sample data sets to look for the desired information.

Data analysis and interpretation: Analytical models are developed using the data
mining findings to guide decision-making and other business activities. Additionally, the

si
data scientist or another member of the data science team must convey the results to
users and business executives, frequently by using data storytelling approaches and
data visualisation.
r
ve
Applications of Data Mining
●● Financial Analysis
●● Biological Analysis
ni

●● Scientific Analysis
●● Intrusion Detection
●● Fraud Detection
U

●● Research Analysis

Real-life examples of Data Mining


ity

Market Basket Analysis (MBA) is a technique for analysing the purchases made
by a client in a supermarket. The idea is to use the concept to identify the things that a
buyer buys together. What are the chances that if a person buys bread, he or she will
also buy butter? This study aids in the promotion of company offers and discounts. Data
m

mining is used to do the same thing.

Protein Folding is a technology that examines biological cells in detail and


predicts protein connections and functions within them. This research could lead to the
)A

discovery of causes and potential therapies for Alzheimer’s, Parkinson’s, and cancer
diseases caused by protein misfolding.

Fraud Detection: In today’s world of cell phones, we can utilise data mining to
compare suspicious phone activity by analysing cell phone activities. This may aid in
the detection of cloned phone calls. Similarly, when it comes to credit cards, comparing
(c

recent transactions to previous purchases can help uncover fraudulent behaviour.

Amity Directorate of Distance & Online Education


138 Data Warehousing and Mining

Business intelligence, Web search, bioinformatics, health informatics, finance,


Notes

e
digital libraries, and digital governments are just a few of the successful applications of
data mining.

in
Data Mining Techniques
Algorithms and a variety of techniques are used in data mining to transform
massive data sets into useable output. The most often used kinds of data mining

nl
methods are as follows:

●● Market basket analysis and association rules both look for connections between

O
different variables. As it attempts to connect different bits of data, this relationship
in and of itself adds value to the data collection. For instance, association rules
would look up a business’s sales data to see which products were most frequently
bought together; with this knowledge, businesses may plan, advertise, and

ty
anticipate appropriately.
●● To assign classes to items, classification is used. These categories describe
the qualities of the things or show what the data points have in common. The

si
underlying data can be more precisely categorised and summed up across related
attributes or product lines thanks to this data mining technique.
●● Clustering and categorization go hand in hand. Clustering, on the other hand,
r
found similarities between objects before classifying them according to how they
ve
differ from one another. While clustering might reveal groupings like “dental health”
and “hair care,” categorization can produce groups like “shampoo,” “conditioner,”
“soap,” and “toothpaste.”
●● Decision trees are employed to categorise or forecast a result based on a
ni

predetermined set of standards or choices. A cascading series of questions that


rank the dataset based on responses are asked for input using a decision tree.
A decision tree allows for particular direction and user input when digging deeper
U

into the data and is occasionally represented visually as a tree.


●● An method called K-Nearest Neighbor (KNN) classifies data according to how
closely it is related to other data. KNN is based on the idea that data points near
to one another have a higher degree of similarity than other types of data. This
ity

supervised, non-parametric method forecasts group characteristics from a set of


individual data points.
●● The nodes of neural networks are used to process data. These nodes have an
output, weights, and inputs. Through supervised learning, data is mapped (similar
m

to how the human brain is interconnected). This model can be fitted to provide
threshold values that show how accurate a model is.
●● In order to forecast future results, predictive analysis aims to use historical data
)A

to create graphical or mathematical models. This data mining technique, which


overlaps with regression analysis, seeks to support an unknown figure in the future
based on already available data.

Applications of Data Mining


(c

Data mining appears to be useful in practically every department, industry, sector,

Amity Directorate of Distance & Online Education


Data Warehousing and Mining 139

and business in the digital era. As long as there is a set of data to analyse, data mining
Notes

e
is a broad process with a variety of applications.

Sales

in
A company’s primary objective is to maximise profits, and data mining promotes
more intelligent and effective capital allocation to boost sales. Think about the cashier
at your preferred neighbourhood coffee shop. The coffee shop records the time of each

nl
purchase, the products that were purchased at the same time, and the most popular
baked goods. The store can strategically design its product range using this knowledge.

O
Marketing
It’s time to put the modifications into effect once the coffee shop mentioned above
determines its optimum lineup. The store can utilise data mining to better identify where

ty
its customers view ads, which demographics to target, where to place digital ads, and
what marketing tactics resonate with them in order to increase the effectiveness of its
marketing campaigns. This entails adapting marketing strategies, advertising offerings,
cross-sell opportunities, and programmes to data mining discoveries.

si
Manufacturing

r
Data mining is essential for organisations that manufacture their own items in
determining the cost of each raw material, which materials are used most effectively,
ve
how much time is spent throughout the manufacturing process, and which bottlenecks
have a negative impact on the process. The continual and least expensive flow of
commodities is ensured with the use of data mining.
ni

Fraud Detection
Finding patterns, trends, and correlations between data points is at the core of
data mining. Data mining can therefore be used by a business to find anomalies or
U

relationships that shouldn’t exist. For instance, a business might examine its cash flow
and discover a recurring transaction to an unidentified account. If this is unexpected,
the business might want to look into it in case money was possibly mishandled.
ity

Human Resources
A variety of data, including information on retention, promotions, salary ranges,
business perks and usage of those benefits, and employee satisfaction surveys, are
frequently available for processing in human resources. This data can be correlated
m

through data mining to better understand why employees depart and what draws new
hires in.
)A

Customer Service
Numerous factors can either create or undermine customer satisfaction. Consider
a business that ships things. Customers may become dissatisfied with communication
over shipment expectations, shipping quality, or delivery delay. The same customer can
grow impatient with lengthy hold times on the phone or sluggish email replies. Data
(c

mining analyses operational information about client interactions, summarises findings,


and identifies the company’s strong points and areas for improvement.

Amity Directorate of Distance & Online Education


140 Data Warehousing and Mining

Why is data mining crucial for firms, then? Businesses that use data mining can
Notes

e
gain a competitive edge, have a deeper understanding of their customers, have more
control over daily operations, increase customer acquisition rates, and discover new
business prospects.

in
Data analytics will aid many industries in different ways. The best approaches to
attract new clients are sought after by some industries, while fresh marketing strategies

nl
and system upgrades are sought after by others. Businesses have the ability and
understanding to make decisions, examine their data, and proceed thanks to the data
mining process.

O
Data mining models have only recently been used in marketing. Data mining
is a “new ground” for many marketers who rely solely on their “intuition” and domain
knowledge, despite its rapid expansion. Their marketing campaign lists and
segmentation plans are developed using business criteria based on their understanding

ty
of the industry.

Data mining models are not “dangerous” since they cannot take the place of
domain experts and their valuable business expertise. Despite their strength, these

si
models are ineffective without the active participation of business professionals. On
the contrary, they can only produce outcomes that are truly significant when business
experience is added to data mining capabilities. For instance, incorporating meaningful
r
inputs with predictive potential proposed by experts in the industry can significantly
ve
improve the predictive capability of a data mining model.

Additionally, information from current business rules and scores can be included
into a data mining model to help create a more reliable and effective outcome.
Additionally, in order to reduce the danger of producing irrelevant or ambiguous
ni

outcomes, model results should always be verified by business specialists with regard
to their relevance prior to the actual deployment. Therefore, business domain expertise
can significantly improve and enhance data mining findings.
U

However, data mining programmeshave the ability to spot trends that even
the most seasoned businesspeople would have overlooked. They can assist
in fine-tuning the current business standards as well as enrich, automate, and
ity

standardisejudgemental working methods that are based on subjective perceptions


and viewpoints. They consist of a data-driven, objective methodology that reduces
subjective judgement and streamlines laborious procedures.

3.3.2 Data Mining Process Improvement


m

Data’s importance in production has long been downplayed or ignored. The


new uses for data and data analytics have changed how businesses approach
quality improvement. The field specialists indicate a significant change away from
)A

a reliance solely on retrospective analysis and post-manufacturing inspection work


toward the prediction and early identification of problem areas and maintenance
requirements. Traditional product checks are being elevated by new data sources, like
call centre discussions and sensors. These technologies are gradually enhancing the
(c

manufacturing sector by changing how quality and safety are managed in asset-based
businesses.

Amity Directorate of Distance & Online Education


Data Warehousing and Mining 141

Technology is transformed by data, and this is just the start of significant changes.
Notes

e
Numerous technological innovations, like real-time data from GPS and connected
vehicle sensors, text extracted from warranty reports, and voice-to-text translations

in
in call centres, to mention a few, contributed to the quality and safety revolution in
businesses. However, the data has been consolidated into a repository that enables
analysis across many data formats. This is the precise situation in which machine

nl
learning techniques are useful. They are tasked with seeing patterns in the data and
making forecasts.

Businesses utilise data mining to make inferences and find solutions to certain

O
issues. The fact that data mining is essentially relevant to every process and increases
the adaptability and effectiveness of operations is one of its main advantages. As
a result, using data in manufacturing helps with waste reduction, capacity modelling,
monitoring automation, and schedule adherence. By establishing comprehensive data

ty
transparency, departments are entirely transformed, and factories become smarter.

How manufacturing businesses take advantage of data mining

si
Process mining is now being used by ABB, a significant worldwide manufacturer,
for its procurement to payment and production procedures. In the past, staff from the
ABB facility in Hanau, Germany, would transfer evaluations from their SAP systems

r
into Excel, evaluate them using complicated formulas, and comprehend operations.
Each morning at ABB, an email detailing the production variants, throughput times, and
ve
rejection rates from the previous day is sent to the relevant production and assembly
team leaders. Process mining makes the entire ecosystem of quality improvement
procedures at the plant instantly visible. As more data is sent into the algorithm, pattern
recognition only improves. Operational processes deliver immediate results as opposed
ni

to relying on intricate manual examination of processes.

The industry that produces vehicles has also undergone significant development.
The products in this market are fairly pricey, with high-end producers emphasising
U

customer service and product quality. They point out that the commercial advantages
associated with the adoption of data-driven innovations have the potential to
accelerate the detection and correction of quality issues as well as reduce warranty
ity

spending, which represents between 2-6% of the total sales in the automotive sector.
Early detection and preventive maintenance frequently lead to higher uptime for the
customers and users of these vehicles and equipment. For instance, in one incident
involving an automaker, the discovery of a fault prior to the release of the product
prevented the recall of 28,000 automobiles.
m

The majority of marketers have come to the realisation that gathering client data
and extracting useful information from it is crucial for business growth. Business
organisations are frantically seeking expertise and insightful information from these
)A

customer data in order to stay competitive globally and boost their bottom line.

The marketing executives want to learn different kinds of information from the data
they get, but it isn’t always possible because operational computer systems can’t give
them information on daily transactions. Marketers can position their goods in response
(c

to the demands of their specific clientele. Finding the specific clients that are truly
interested from a vast customer database is very challenging. Data mining methods and
technology now play a part.

Amity Directorate of Distance & Online Education


142 Data Warehousing and Mining

Businesses can sift through layers of data that at first glance appear to be
Notes

e
unrelated in search of important links where they can predict, rather than just respond
to, client wants with the aid of data mining techniques and tools. Data mining is just
one phase in a much longer process that occurs between a business and its clients.

in
Data mining must be pertinent to the underlying business process in order to have an
impact on a firm. The business process, not the data mining method, determines how
data mining affects a business.

nl
Knowledge discovery is a process; it has stages, a context, and operates under
presumptions and limitations. The KDD process is depicted in the figure below along
with typical data sets, tasks, and stages. Its main purpose is to serve as a database

O
with a lot of data that can be searched for potential patterns. A representative target
data set is used for KDD, and it is constructed from the big database. In most real-world
scenarios, both the database and the target data set contain noise, which includes

ty
incorrect, imprecise, contradictory, exceptional, missing, and values without a value.
One obtains the collection of preprocessed data by removing this noise from the target
data set. The preprocessed data set’s converted data set is used straight away for DM.
In general, DM produces a collection of patterns, some of which may represent newly

si
learned information.

r
ve
ni

The data is turned into several steps to be cleansed through a variety of stages
in the process. The target data set can be generated from the database using the
U

selection technique. Its main responsibility is to choose representative data from the
database in order tomaximise the representativeness of the target data collection.
Preprocessing is the subsequent step, which produces specified data sequences in the
set of preprocessed data and removes noise from the target data set. Some DM jobs
ity

require such sequences. The preprocessed data must now be transformed into a format
that will allow it to be used to carry out the intended DM task.

DM tasks are certain actions taken over a set of altered data in order to look for
patterns (useful information), directed by the kind of knowledge that should be found.
m

Depending on the DM task, many types of preprocessed data transformations may be


used. To map the original data onto a data space more suited for the DM job that will be
carried out in the following phase, the DM, additional alterations and combinations of
)A

the remaining data record fields can be made.

Although not all of the patterns are helpful, a process is used in the DM phase to
extract patterns from altered data. Keeping only the patterns that are interesting and
beneficial to the user and discarding the rest is the aim of understanding and evaluating
(c

all the patterns found. The information has been uncovered in those patterns that have
persisted.

Amity Directorate of Distance & Online Education


Data Warehousing and Mining 143

The manner that businesses are run has undergone a paradigm shift. The
Notes

e
business process has become more complex as a result of the shifting customer
behaviour patterns and the methods used by competitors. Shopkeepers today face
an increased number of customers, items, and competitors, as well as a decrease in

in
reaction time. This suggests that it is now much more difficult to comprehend one’s
customers. Customer connections are becoming more complex as a result of a variety
of factors coming together.

nl
●● Client loyalty is a thing of the past due to the substantial decline in customer
attention span. A successful business must always emphasise the value it offers to
its clients. The amount of time between a new want and the moment when it must

O
be satisfied is likewise getting shorter. The customer will locate someone else who
will respond swiftly if one does not.
●● Everything is more expensive. printing, postage, and special discounts (if one

ty
doesn’t provide the special discount, his rivals will).
●● Customers choose products that precisely fulfil their requirements over those that
only sort of do. This indicates that there are now much more things available and

si
more ways to purchase them.
●● The best clients make one’s rivals appear good. They’ll concentrate on a few
select, lucrative areas of your business and work to maintain the finest for
themselves. r
ve
It has become difficult to maintain customer relationships in such a dynamic
climate. The continuation of a customer’s business is no longer assured. Customers
that one has today may no longer exist tomorrow because the market does not wait
for anyone. It’s also more difficult now to interact with customers than it formerly was.
ni

Customers and potential customers like dealing with businesses on their terms. In other
words, when deciding how to proceed, it is necessary to consider a variety of factors.
Companies must automate The Right Offer, To The Right Person, At The Right Time,
U

and Through The Right Channel in order to make it possible in an efficient manner.

Manage the many encounters with clients to make sure the business has given the
best offer. One must rank the most significant offers first and the least relevant ones last
ity

among the many available. The appropriate individual makes the implication that not
all clients are created equal. It is necessary to transition one’s interactions with them
to highly targeted, individual-focused marketing initiatives. The fact that interactions
with clients now take place continuously makes the correct timing possible. In the past,
quarterly mailings were considered to be the most innovative kind of marketing. Finally,
m

using the appropriate channel allows for a range of consumer interactions (direct mail,
email, telemarketing, etc.). Here, it’s crucial to pick the media that will facilitate a certain
interaction the best.
)A

The enormous amount of data collected about clients and everyday purchase
activities has caused business databases to grow rapidly. Because of this, the
strategies and technologies used in knowledge management and data mining (DM)
have become crucial for making marketing decisions. In order to support marketing
decisions, DM can be utilised to gather relevant data on hidden buying trends.
(c

Additionally, DM can assist with market-wide analysis.

Amity Directorate of Distance & Online Education


144 Data Warehousing and Mining

3.3.3 Data Mining in Marketing


Notes

e
Researching, creating, and making a product available to the public are all
part of the discipline of marketing. The idea of marketing has been around for a

in
while, yet it keeps evolving in response to customer wants and buying patterns. As a
result, marketing today is quite different from what it was a few decades ago, largely
because of the world economy’s rapid transformation and technological advancements,

nl
which together allowed for the free and quick sharing of knowledge. As a result of
globalization’s reduction in the cost and complexity of operating abroad and the
resulting increase in competitiveness, MNCs were exposed to local markets.

O
The manner and rate of knowledge dissemination have significantly changed as
markets have become more deregulated. Every action we take, including taking a
lengthy walk in the woods, leaves behind a few trace amounts of electronic data. Data
is punched every time the Internet or a mobile device is used, and a massive industry

ty
is right there suckling up all that data and utilising it to determine how to sell you
something.

Thanks to globalisation and technological advancements, every activity—from the

si
sale of toothpaste to the purchase of life insurance policies—generates some amount
of data, which, when properly analysed, can give an organisation a competitive edge
in the global market by revealing implicit patterns and blatant connections among vast
r
data sets. Data mining is one such method that may be used to analyse such a massive
ve
amount of data.

Finding correlations or patterns between dozens of fields in sizable databases is


a process known as data mining. Data mining’s main objective is to draw information
from an existing data set and organise it in a way that is understandable to humans
ni

for later use.

Data mining is the process of extracting patterns or models from observable data.
It is a non-trivial method for finding valid, innovative, possibly helpful, and ultimately
U

understood patterns in data.

To uncover the significant links between variables in historical data, to forecast


or analyse the data, or to find any relevant relationships within the data recorded,
ity

statistical techniques and mathematical equations are utilised in the technology known
as “data mining.”

Intelligent techniques are used in data mining, a crucial procedure for extracting
patterns from data.
m

The examination and analysis of massive amounts of data in order to find


significant patterns and rules is known as data mining.
)A

These days, massive amounts of data are being gathered. Every nine months, the
amount of data gathered is said to roughly treble. One of the most desirable features
of data mining is the ability to extract knowledge from large amounts of data. The
knowledge that can be derived from stored data typically has a wide gap between it
and it. Because this transformation does not happen naturally, data mining is important.
(c

Many marketers like this technology because it helps them understand their clients
better and make wise marketing choices.

Amity Directorate of Distance & Online Education


Data Warehousing and Mining 145

The industry of “data mining,” as it is known, expands by 10% a year as data


Notes

e
production soars. Data mining information can help to raise return on investment (ROI),
enhance customer relationship management (CRM) and market analysis, lower the cost
of marketing campaigns, and make it easier to detect fraud and keep customers.

in
The definition of marketing is placing the ideal product in the ideal environment
at the ideal time and price. It is a process of organising and carrying out the creation,

nl
pricing, promotion, and distribution of concepts, products, and services to generate
exchanges that meet both individual and organisational goals. The process of
determining what customers want and where they purchase requires a lot of effort.
Then, it must be determined how to make the item at a cost that is valuable to them and

O
coordinate everything at the crucial moment.

However, one incorrect factor might cause a catastrophe, such as promoting a car
with incredible fuel efficiency in a nation where fuel is relatively inexpensive, releasing

ty
a textbook after the start of a new school year, or pricing an item too high or too low to
draw in the target market. The marketing mix is a useful place to start when thinking
through ideas for a product or service so as to avoid any blunders. It is a generic term

si
used to describe the various decisions businesses must make throughout the process
of bringing a good or service to market. The 4 Ps, which were first stated by E J
McCarthy in 1960, are one of the greatest ways to describe the marketing mix.

The 4Ps are: r


ve
◌◌ Product (or Service)
◌◌ Price
◌◌ Place
ni

◌◌ Promotion
Data mining is used in a variety of marketing-related applications. One of them
is referred to as market segmentation, and it identifies typical client behaviours.
U

Customers who appear to buy the same things at the same time are examined for
patterns. Customer churn is another application of data mining that aids in determining
the kinds of customers that are likely to stop using a product or service and go to a rival.
A business can also utilise data mining to identify the transactions that are most likely
ity

to be fraudulent. A retail store, for instance, would be able to identify which products are
most frequently stolen using data mining so that appropriate security measures can be
adopted.

Even though direct mail marketing is an outdated strategy, businesses can achieve
m

excellent results by combining it with data mining. Data mining, for instance, can be
utilised to determine which clients will be responsive to a direct mail marketing plan.
Additionally, it establishes how effective interactive marketing is. There is a need to
)A

identify clients who are more inclined to buy things online than in-person since some
customers are more likely to do so.

While many marketers use data mining to boost their profitability, it can also be
utilised to launch new businesses and sectors. Every industry built on data mining is
predicated on the automatic identification of both patterns and behaviours. For instance,
(c

data mining uses automatic prediction to examine prior marketing tactics. Which one
was most effective? What made it work so well? Who were the clients who responded

Amity Directorate of Distance & Online Education


146 Data Warehousing and Mining

to it the best? These questions can be answered via data mining, which aids in avoiding
Notes

e
errors committed in earlier marketing campaigns. Thus, data mining for marketing
aids businesses in gaining a competitive edge over rivals and surviving in a global
marketplace.

in
Implementation
Customer Identification: Target customer analysis and customer segmentation are

nl
components for identifying customers. While customer segmentation entails the division
of an entire customer base into smaller customer groups or segments, consisting of
customers who are relatively similar within each specific segment, target customer

O
analysis entails seeking the profitable segments of customers through analysis of
customers’ underlying characteristics.

Customer Attraction: Organizations can focus their efforts and resources on luring

ty
the target consumer segments once they have identified the possible client segments.
Direct marketing is one strategy for attracting customers. Direct marketing is a form of
advertising that encourages consumers to make purchases through multiple channels.

si
Distributions of coupons or direct mail are classic instances of direct marketing.

Customer Retention: The management of complaints, loyalty programmes, and


one-to-one marketing are all components of client retention. One-to-one marketing
r
describes targeted advertising strategies that are aided by tracking, spotting, and
ve
forecasting shifts in consumer behaviour.

Customer Development: Customer lifetime value analysis, up- and cross-selling,


and market basket analysis are all components of customer development. Customer
lifetime value analysis is the estimation of the entire revenue a business can receive
ni

from a single customer. Up/Cross selling describes marketing initiatives that increase
the amount of related or complimentary services a consumer uses across a company.
The goal of market basket analysis is to increase the volume and value of client
U

transactions by identifying patterns in consumer purchasing behaviour.

Clustering in Marketing
When clients haven’t yet been segmented for advertising purposes, clustering
ity

may be used. The clusters may be investigated for characteristics based on which
ad campaigns may be targeted at the client base after performing a cluster analysis.
Following segmentation, product positioning, product repositioning, and product
creation may be done based on the traits of the clusters to better the product’s fit
m

with the intended customers. It is also possible to choose test markets using
cluster analysis. Additionally, customer relationship management may make use of
clustering (CRM).
)A

Client clustering would track purchasing behaviour and develop strategic business
efforts using data from customer purchase transactions. High-profit, high-value, and
low-risk consumers are what businesses want to keep. This cluster often symbolises
the 10–20% of consumers that account for 50%–80% of a business’s profits. A business
would not want to lose these clients, hence retention is the segment’s strategic aim.
(c

A low-profit, high-value, and low-risk customer category is likewise desirable, and


raising profitability for this group would seem to be the natural objective. The preferred

Amity Directorate of Distance & Online Education


Data Warehousing and Mining 147

marketing strategies for this market category include cross-selling new products and
Notes

e
up-selling more of what customers already purchase.

When a marketer has very little information about the customer, pattern association

in
can be widely employed to forecast customer preferences. Tools for pattern association
would assist a marketer in predicting which product or advertisement the customer
may be interested in based solely on the customer’s current purchasing behaviour and

nl
comparing it with the purchasing behaviour of similar customers (who bought similar
products), even when no information about the customer is available.

O
3.3.4 Data Mining in CRM
Customers are an organization’s most valuable resource. Without happy clients
who remain devoted and strengthen their relationship with the company, there cannot
be any business opportunities. Because of this, a business should develop and

ty
implement a clear customer service strategy. Building, maintaining, and enhancing
devoted and enduring client connections is the goal of CRM (Customer Relationship
Management). Based on customer insight, CRM should be a customer-centric

si
approach. Its focus should be on providing consumers with “personalised” service by
recognising and appreciating their unique needs, preferences, and behaviours.

Let’s use the following real-world example of two apparel companies with
r
various selling strategies to clarify the goals and advantages of CRM. The first store’s
ve
employees make an effort to sell everything to everyone. The second store’s staff
makes an effort to determine each customer’s needs and interests before making the
suitable recommendations. Which retailer’s reputation for dependability will ultimately
win over customers? The second option certainly seems more reliable for a long-term
ni

connection because it seeks for customer happiness while taking into account the
unique needs of the consumer.

CRM has two main objectives:


U

◌◌ Customer retention through customer satisfaction.


◌◌ Customer development through customer insight.
ity

It is clear how important the first goal is. Getting new customers is challenging,
especially in developed countries. Replacing current clients with new ones from the
competition is never easy. There is no such thing as an average customer, which is the
second CRM goal of customer development. The client base consists of a variety of
people, each with their own wants, tendencies, and potentials that should be managed
m

appropriately.

To track and effectively arrange inbound and outbound customer contacts,


including the management of marketing campaigns and call centres, a number of
)A

CRM software solutions are available. By automating contacts and interactions with
the customers, these systems, also known as operational CRM systems, often support
front-line procedures in sales, marketing, and customer care. They keep track of
consumer information and contact histories. Additionally, they make sure that at every
customer “touch” points (interaction points), a consistent image of the customer’s
(c

connection with the company is available.

Amity Directorate of Distance & Online Education


148 Data Warehousing and Mining

These solutions should only be utilised as tools to complement the approach of


Notes

e
managing clients effectively, though. Organizations must use data analysis to develop
understanding of consumers, their needs, and their wants in order to use CRM
successfully and achieve the aforementioned goals. The analytical CRM comes into

in
play here. Analytical CRM focuses on evaluating customer data to effectively address
CRM goals and communicate with the appropriate customer. In order to evaluate the
worth of the clients, comprehend their behaviour, and predict it, data mining algorithms

nl
are used. It involves examining data trends in order to derive information for improving
customer connections.

Data mining, for instance, can aid in customer retention since it makes it possible

O
to quickly identify important clients who are more likely to quit, giving time for tailored
retention programmes. By matching products with customers and improving the
targeting of product promotion activities, it can assist customer development.

ty
Additionally, it may help to identify different consumer segments, supporting the creation
of new, tailored products and product offerings that better reflect the interests and
objectives of the target market.

si
To better manage all client contacts on a more informed and “personalised” basis,
the outcomes of the analytical CRM operations should be imported and incorporated
into the operational CRM front-line systems. The topic of this book is analytical CRM. Its

r
goal is to demonstrate how data mining techniques can be used to the CRM framework,
with a particular emphasis on customer segmentation.
ve
What can data mining do in CRM?

Data mining uses complex modelling approaches to analyse vast volumes of data
in order to derive knowledge and insight. Data are transformed into knowledge and
ni

useful information.

The data that has to be examined may be extracted from numerous unstructured
data sources or may already be present in well-organized data marts and warehouses.
U

There are several steps in a data mining process. Prior to the use of a statistical or
machine learning technique and the creation of a suitable model, it frequently includes
substantial data management. Data mining tools are specialised software packages
ity

that may support the entire data mining process.

The goal of data mining is to find patterns in data that may be used to understand
and anticipate behaviour. Data mining models are made up of a series of rules,
equations, or complicated “transfer functions.” According to their objectives, they can be
m

divided into the following two primary classes.

Data Mining InThe CRM Framework


)A

The below figure outlines the CRM-data mining structure that has been presented.
In data mining, the first step in solving any problem is to acquire an understanding of
the business goals and requirements associated with the problem area. An in-depth
analysis and careful management of customer connections and the dynamics between
those ties will assist to find, attract, and keep valuable consumers in the domain. The
(c

subsequent step of data preparation, also known as preprocessing, assists in preparing


the data for the further stages of model construction and evaluation by performing
operations such as cleaning, attribute selection, and data transformation, amongst

Amity Directorate of Distance & Online Education


Data Warehousing and Mining 149

others. The building of an effective model that satisfies the business needs is a major
Notes

e
stage in the CRM framework, and model construction is one of those major steps. The
behaviour of the customers can be more accurately predicted with the use of these
models. Evaluation and visualisation of the model are used to determine how effective

in
the model is at improving the performance of the organisation.

nl
O
ty
r si
ve
ni
U

Figure: . CRM–data mining framework.


ity

Customer insight, which is essential for developing a successful CRM strategy, can
be obtained through data mining. Through data analysis, it can result in individualised
interactions with customers, increasing happiness and fostering profitable client
relationships. It can assist “individualised” and optimised customer management
m

throughout all stages of the customer lifecycle, from client acquisition and relationship
building to attrition prevention and client retention. Marketers aim to increase both their
market share and their consumer base. Simply put, they are in charge of acquiring,
)A

retaining, and growing the clientele. As seen in the following Figure, data mining models
can assist with all of these tasks:
(c

Amity Directorate of Distance & Online Education


150 Data Warehousing and Mining

Notes

e
in
nl
O
ty
si
Figure: Data mining and customer lifecycle management

The following themes are more specifically included in the marketing activities that
can be aided by data mining. r
ve
Customer Segmentation
It is possible to create distinctive marketing strategies by segmenting the consumer
base into distinct, internally homogeneous groups based on their attributes. Based on
ni

the specific criteria or attributes utilised for segmentation, there are numerous different
segmentation kinds.

Customers are categorised by behavioural and usage traits in behavioural


U

segmentation. Although business rules can be used to define behavioural segments,


this method has inherent drawbacks. Its objectivity is in doubt because it is reliant on
the subjective opinions of a business expert, and it can only handle a limited number of
ity

segmentation fields effectively. On the other hand, data mining can produce behavioural
segments that are driven by data. Clustering algorithms can examine behavioural
data, spot consumer natural groupings, and offer a solution based on seen patterns in
the data. If data mining models are constructed correctly, they can reveal groups with
unique profiles and attributes and produce rich segmentation schemes with commercial
m

significance and value.

Additionally, data mining can be utilised to create segmentation plans based on


the existing, anticipated, or forecasted value of the customers. These categories are
)A

essential for prioritising customer service and marketing efforts based on the value of
each customer.

Direct Marketing Campaigns


(c

Direct marketing campaigns are used by marketers to reach out to customers


directly by mail, the Internet, e-mail, telemarketing (phone), and other direct channels

Amity Directorate of Distance & Online Education


Data Warehousing and Mining 151

in an effort to reduce customer attrition, increase client acquisition, and encourage the
Notes

e
purchase of supplementary items. Acquisition campaigns are more explicitly designed
to lure potential clients away from other businesses. Cross-, deep-, and up-selling
strategies are used to persuade current consumers to buy more of the same product,

in
more of the same product, or a different but more lucrative product. Finally, retention
campaigns are designed to keep loyal consumers from severing ties with the business.

nl
These efforts, although having the potential to be successful, can also squander
a significant amount of money and time by bombarding customers with unsolicited
messages. Marketing campaigns that are specifically aimed at a target audience can
be developed with the use of data mining and classification (propensity) models. They

O
look at client traits and identify the target customers’ profiles. Then, fresh instances with
comparable features are found, given a high propensity score, and added to the target
lists. To make the ensuing marketing campaigns as effective as possible, the following

ty
classification models are used:

Acquisition models: These can be used to identify potentially profitable potential


clients by locating “clones” of valuable present clients in external contact lists.

si
Cross-/deep-/up-selling models: These can show whether current clients have the
potential to make purchases.

r
Voluntary attrition or voluntary churn models: These locate clients who are more
likely to depart freely and identify early churn indications.
ve
When constructed appropriately, these models can help create campaign lists with
higher target customer densities and frequencies by identifying the best customers to
reach out to. They exceed predictions made using business principles and personal
ni

intuition as well as random selections. The lift in predictive modelling refers to the metric
that contrasts a model’s propensity to forecast with chance. It indicates how much more
effectively a classification data mining model outperforms a random selection. Figure
following, which contrasts the outcomes of a data mining churn model to random
U

selection, provides an illustration of the “lift” idea.


ity
m
)A

Figure: The increase in predictive ability resulting from the use of a data mining
churn model.
(c

In this fictitious example, 10% of the actual “churners” are included in the sample
that was chosen at random. However, a list of the same size produced by a data

Amity Directorate of Distance & Online Education


152 Data Warehousing and Mining

mining algorithm is far more productive because it comprises around 60% of real
Notes

e
churners. Data mining outperformed randomness in terms of prediction six times over.
Although entirely speculative, these outcomes are not too far off from reality. In real-
world scenarios that were successfully handled by well-designed propensity models, lift

in
values higher than 4, 5, or even 6 are rather typical, demonstrating the potential for
improvement provided by data mining.

nl
The following diagram and explanation show the steps of direct marketing
campaigns:

1. Compiling and combining the required data from various data sources.

O
2. Customer segmentation and analysis into different customer groups.
3. Using propensity models to develop targeted marketing strategies that choose the
correct customers.

ty
4. Execution of the campaign by selecting the ideal channel, ideal moment, and ideal
offer for every campaign.
5. Evaluation of the campaign using test and control groups. The population is divided

si
into test and control groups for the evaluation, and the positive responses are
compared.
6.
r
Analysis of campaign results to help with targeting, timing, offers, products,
communication, and other aspects of the campaign for the following round.
ve
Data mining can play a significant role in all these stages, particularly in identifying
the right customers to be contacted.
ni
U
ity
m
)A
(c

Figure: The stages of direct marketing campaigns

Amity Directorate of Distance & Online Education


Data Warehousing and Mining 153

Market Basket and Sequence Analysis


Notes

e
In example, association models and data mining can be used to find related
products that are frequently bought together. These models can be used to analyse

in
market baskets and identify sets of goods or services that can be marketed as a unit.

Sequence models can recognise sequences of occurrences by accounting for the


order of activities and purchases.

nl
The Next Best Activity Strategy and ‘‘Individualized’’ Customer Management
To improve customer management, data mining models should be developed and

O
implemented in an organization’s daily business activities. The construction of a next
best activity (NBA) strategy can benefit from the knowledge acquired by data mining.
More specifically, the establishment of “personalised” marketing goals can be made
possible by the customer understanding gleaned via data mining. The company can

ty
choose the following “individualised” strategy and make a more informed decision on
the next appropriate marketing activity for each client:

1. An offer to stop attrition, primarily for high-value, vulnerable clients.

si
2. an advertisement for the ideal add-on item and a targeted cross-, up-, or
deep-selling offer for clients with expansion potential.
3. r
placing use limits and limitations on clients with problematic payment histories
ve
and credit risk scores.
4. the creation of a new product or service that is specifically suited to the traits
of a targeted market niche, etc.
The following figure illustrates the key elements that should be considered when
ni

designing the NBA strategy. As follows:

◌◌ The profitability and value of the current and anticipated/estimated customers.


U

◌◌ Through data analysis and segmentation, the type of customer, distinctive


behavioural and demographic traits, and recognised needs and attitudes are
made known.
◌◌ The possibility for growth as indicated by pertinent cross-, up-, and deep-
ity

selling models and tendencies.


◌◌ A voluntary churn model’s estimation of the defection risk/churn propensity.
◌◌ The customer’s credit history and payment history.
m
)A
(c

Amity Directorate of Distance & Online Education


154 Data Warehousing and Mining

Notes

e
in
nl
O
ty
r si
ve
Figure: The next best activity components.

3.3.5 Tools of Data Mining


ni

Although there are numerous ways for data mining, this section lists some of the
most popular ones and provides some instances of each technique’s use.
U

Market Basket Analysis


Typically, the first thing you do when you go to the store is grab a shopping cart.
You’ll pick up various things and add them to your shopping cart as you move up and
ity

down the aisles. While some of them might have been spontaneously chosen, the
majority of these might match a pre-made shopping list. Let’s assume that when you
pay at the register, the items in your (and every other shopper’s) cart are recorded
because the store wants to identify any patterns in the items that different customers
tend to purchase. The term for this is market basket analysis.
m

A procedure called market basket analysis looks for links between items that “go
together” in a business context. Despite its name, market basket analysis goes beyond
)A

the typical supermarket setting. Any group of items can be analysed using the market
basket method to find affinities that can be used in some way. Following are some
instances of how market basket analysis has been used:

Physical shelf arrangement: To encourage customers to browse the store to


locate what they’re searching for and possibly raise the likelihood of additional impulse
(c

purchases, it is also possible to physically separate items that are frequently bought
together in a store.

Amity Directorate of Distance & Online Education


Data Warehousing and Mining 155

Up-Sell, Cross-Sell, and Bundling Opportunities: Businesses may utilise the affinity
Notes

e
grouping of several products as a sign that customers could be inclined to purchase
the grouped products simultaneously. This makes it possible to showcase products for
cross-selling or may imply that buyers could be likely to purchase additional products

in
when specific products are grouped together.

Customer Retention: A corporate representative may utilise market basket

nl
research to decide what incentives to provide when clients contact a company to end a
relationship in order to keep their business.

Memory-Based Reasoning

O
A typical sick visit to the doctor comprises the patient explaining a number of
symptoms, the doctor reviewing the patient’s medical history to match the symptoms to
known illnesses, making a diagnosis, and then suggesting a course of treatment. This

ty
is a practical illustration of memory-based reasoning (MBR), also known as case-based
reasoning or instance-based reasoning, which uses known circumstances to create a
model for analysis. The closest matches between new circumstances and the model

si
are then examined to help with classification or prediction judgments.

Memory-based reasoning is a technique for using a single data set to build a


model from which assumptions or predictions about recently introduced objects can
r
be derived. Two steps are taken to implement this. First, a training set of data and
ve
specified results are used to develop the model. In order to classify an entity or to
recommend actions depending on the value of the new data instance, the model is then
used for comparison with new data instances. The primary idea behind the method is to
compare the similarity of two items, both during training and afterwards while attempting
to match the new object to its nearest neighbour within the classified set.
ni

An MBR approach has two fundamental parts. The first is the similarity function,
often known as distance, which calculates how similar any two things are to one
U

another. The second option is the combination function, which combines the outcomes
from the neighbours’ set to make a choice.

Memory-based reasoning can be applied to classification, with the understanding


that an existing data set will serve as the foundation for identifying classes (perhaps by
ity

clustering), and that new objects will then be classified using the findings. The same
approach used in the matching process to get the closest match can also be used for
prediction. The outcome of the new object can be predicted using the behaviour that
results from that matched object.
m

Think about a database that records cancer symptoms, diagnoses, and therapies
as an illustration. After doing a clustering of the cases based on symptoms, MBR can
be used to identify the instances that are most similar to a recently discovered one in
)A

order to make an educated guess at the diagnosis and suggest a course of action.

Cluster Detection
One common data mining challenge is to split up a big set of heterogeneous
(c

objects into several smaller, more homogeneous groupings. Applications for automated
clustering are utilised to carry out this grouping. We need to understand a function that
calculates the distance between any two points based on an element’s characteristics,

Amity Directorate of Distance & Online Education


156 Data Warehousing and Mining

just like in memory-based reasoning. When segmenting visitors to an e-commerce


Notes

e
website to learn more about the types of people who are using the site, clustering might
be effective.

in
To cluster, there are two methods. The first method works on the presumption
that the data already contains a particular number of clusters, and the objective is to
separate the data into those clusters. The K-Means clustering methodology is a widely

nl
used method that first identifies clusters and then iteratively applies the following steps:
Determine the precise centre of each cluster, calculate the distance between each
object and that precise middle, assign each object to the cluster that owns that precise
middle, and then redrew the boundaries of the clusters. Up until the cluster borders stop

O
shifting, this is repeated.

The alternative method, known as agglomerative clustering, begins with each


item in its own cluster and seeks to merge clusters iteratively using a similarity-based

ty
process, rather than assuming the presence of a set number of clusters. However,
in this scenario, we need a method to measure similarity between clusters (instead
of points within an n-dimensional space), which is a little more challenging. The

si
agglomerative method’s end outcome is the composition of all clusters into a single
cluster. However, each iteration’s history is documented, allowing the data analyst to
select the level of clustering that is most suitable for the specific business need at hand.

Link Analysis
r
ve
The process of finding and forming relationships between items in a data set, as
well as quantifying the weight attached to every such link, is known as link analysis.
Examples include examining links created when a connection is made at one phone
number to another, evaluating whether two people are related via a social network, or
ni

examining the frequency with which travellers with similar travel preferences choose
to board particular flights. In addition to creating a relationship between two entities,
this link can also be described by other variables or attributes. Examples of telephone
U

connectivity include the volume of calls, the length of calls, or the times when the calls
are placed.

For analytical applications that rely on graph theory for inference, link analysis is
ity

helpful. Finding groupings of people who are connected to one another is one example.
Are there groups of people connected to one another such that the connection between
any two individuals inside a group is as strong as the connection between any other
pair? By providing an answer, you can learn more about drug trafficking organisations
or a group of individuals with a lot of sway over one another.
m

Process optimization is another analytical field where link analysis is helpful. An


illustration would be assessing how pilots and aircraft (who are trained to fly particular
)A

types of aircraft) are distributed among the many routes that an airline takes. The goal
of lowering lag time and extra travel time required for a flight crew, as well as any
external regulations related with the amount of time a crew may be in the air, are the
driving forces behind the assignment of pilots to aircraft. Each trip represents a link
within a big graph.
(c

The evaluation of a person’s potential for spreading information on social


networking sites is a third application. One participant might not necessarily be directly

Amity Directorate of Distance & Online Education


Data Warehousing and Mining 157

responsible for a sizable portion of product purchases, but several people in her
Notes

e
“sphere of influence” might.

Rule Induction Using Decision Trees

in
The process of locating business (or other types of) rules that are embedded in
data is a part of knowledge discovery. For this discovery procedure, rule induction
techniques are applied.

nl
Similar to the game “Twenty Questions,” many situations can be solved by
responding to a series of questions that gradually eliminate potential solutions. Using

O
decision trees is one method of discovering rules. A decision tree is a decision-support
model that organises the questions and potential answers, directs the analyst to the
right solution, and can be applied to both operational and classification procedures.

When a decision tree is complete, each node in the tree represents a question, and

ty
the decision of which path to follow from that node depends on the issue’s resolution.
One internal node of a binary decision tree, for instance, could inquire whether the
employee’s compensation exceeds $50,000. The left-hand road is taken if the response

si
is yes, and the right-hand path if the response is no.

An evaluation of the frequency and distribution of values across a set of variables


using decision tree analysis results in the construction of a decision model in the shape
r
of a tree. In this tree, each node represents a question, and each branch denotes a
ve
potential response by pointing to a different node at the following level. The set of
records that correspond to the responses along the way gets less at each stage of the
journey from the tree’s root to its leaves.

In essence, each node in the tree represents a collection of records that are
ni

consistent with the responses to the queries encountered on the way to that particular
node. Every path from the root node to any other node is different, and each of these
queries divides the view into two smaller halves. Every node in the tree serves as a
U

representation of a rule, and we may assess the number of records that adhere to a
rule as well as the size of that record set at any given node.

As the decision support process traverses the tree, the analyst utilises the model
ity

to seek a desired outcome; the traversal ends when it reaches a tree leaf. The “thinking
process” employed by decision tree models is transparent, and it is obvious to the
analyst how the model arrived at a specific conclusion.

Rule Induction Using Association Rules


m

The search for association rules is another method of rule induction. According
to association rules, certain relationships between traits are more common than
would be predicted if the attributes were independent. A relationship between sets of
)A

values that happens frequently enough to indicate an intriguing pattern is described


by an association rule. A rule often has the following syntax: “If X then Y,” where
X is a set of requirements for a set of variables that affect the values of set Y. One
illustration is the well-known data mining tale about how frequently people buy beer
(c

and diapers together. The apparent codependent variable(s) must have a degree of
confidence suggesting that of the times the X values exist, so do the Y values, and
the cooccurrence of the variable values must have support, which indicates that those

Amity Directorate of Distance & Online Education


158 Data Warehousing and Mining

values occur together with a fair frequency. An association rule basically says that,
Notes

e
with some degree of certainty and some amount of support, the values of one set of
characteristics determine the values of another set of attributes.

in
The “source attribute value set” and “target attribute value set” are the
specifications for an association rule. For obvious reasons, the target attribute set is
also referred to as the right-hand side and the source attribute set as the left-hand side.

nl
The probability that the association rule will apply is its confidence level. The confidence
of the rule “item1: Buys network hub” / “item2: Buys NIC” is 85%, for instance, if a client
purchases a network hub 85% of the time, she also purchases a network interface card
(NIC). The percentage of all records where the left- and right-side attributes have the

O
prescribed values is the support of a rule.

In this scenario, 6% of all records must have the values “item1: Buys network hub”
and “item2: Buys NIC” set in order for the rule to be supported.

ty
Neural Networks
A data mining model used for prediction is a neural network. The algorithms for

si
creating neural networks incorporate statistical artefacts of the training data to construct
a “black box” process that accepts a certain amount of inputs and provides some
predictable output. The neural network model is trained using data examples and
r
desired outputs. Neural network models are based on statistics and probability and,
ve
once trained, are excellent at solving prediction problems. They were initially designed
as a technique to simulate human reasoning. However, the training set’s information
is not transparently absorbed into the neural network as it grows. The neural network
model is excellent at making predictions, but it is unable to explain how it arrived at a
particular conclusion.
ni

In essence, a neural network is a collection of statistical operations that are done


to all of the inputs to a neuron to compute a single output value, which is subsequently
U

transmitted to other neurons in the network. The input values are ultimately connected
as the initial inputs, and the output(s) that result represent a choice made by the neural
network. For categorization, estimation, and prediction, this method works well.

There are instances when changing the way data is represented is necessary to
ity

obtain the kinds of values needed for accurate value calculation. For instance, when
looking for a discrete yes/no response, continuous value values may need to be
rounded to 0 or 1, and historical data supplied as dates may need to be converted
into elapsed days.Data mining is a collection of techniques for analysing data from a
m

variety of angles and views using algorithms, statical analysis, artificial intelligence, and
database systems.

Data mining techniques are used to uncover patterns, trends, and groups in large
)A

amounts of data and to translate that data into more comprehensive information.

It’s a framework, similar toRstudio or Tableau, that allows you to perform various
data mining analyses.

We may use your data collection to perform a variety of algorithms, such as


(c

clustering or classification, and visualise the results. It’s a framework that aids in the

Amity Directorate of Distance & Online Education


Data Warehousing and Mining 159

comprehension of our data and the phenomena it depicts. A framework like this is a
Notes

e
data mining tool.

According to ReortLinker’s latest forecast, sales of data mining tools would exceed

in
$1 billion by 2023, up from $ 591 million in 2018.

The most extensively used data mining tools are as follows:

nl
Orange Data Mining:
The Orange software bundle is ideal for machine learning and data mining. It
is a software that facilitates visualisation and is based on components written in the

O
Python programming language and created at the bioinformatics laboratory at Ljubljana
University’s faculty of computer and information technology.

Orange’s components are referred to as “widgets” because it is a component-

ty
based software. Preprocessing and data visualisation are only a few of the widgets
available, as are algorithm evaluation and predictive modelling.

Widgets have a lot of useful features, such as:

si
◌◌ Displaying data table and allowing to select features
◌◌ Data reading
◌◌ r
Training predictors and comparison of learning algorithms
ve
◌◌ Data element visualization, etc.

SAS Data Mining:


The abbreviation for Statistical Analysis System is Statistical Analysis System.
ni

It is an SAS Institute tool for data management and analytics. SAS can mine, modify,
and manage data from a variety of sources, as well as do statistical analysis. For non-
technical users, it provides a graphical user interface (GUI).
U

SAS data miners enable users to analyse vast volumes of data and provide
accurate information for speedy decision-making. SAS has a distributed memory
processing architecture that is very scalable. It can be used for text mining, data mining,
and optimization.
ity

DataMelt Data Mining:


DataMelt is a computing and visualisation environment that provides an interactive
data analysis and visualisation structure. It was created with students, engineers, and
m

scientists in mind. DMelt is another name for it.

DMelt is a JAVA-based multi-platform utility. It will run on any operating system that
is JVM compatible (Java Virtual Machine). It is made up of science and math libraries.
)A

●● Scientific libraries:
Scientific libraries are used for drawing the 2D/3D plots.

●● Mathematical libraries:
(c

Random number generation, algorithms, curve fitting, and other mathematical


tasks require the usage of mathematical libraries.

Amity Directorate of Distance & Online Education


160 Data Warehousing and Mining

DMelt can be used for large-scale data analysis, data mining, and statistical
Notes

e
analysis. It’s widely used in fields including natural sciences, finance, and engineering.

Rattle:

in
Ratte is a data mining tool with a graphical user interface. R, a statistical
programming language, was used to create it. By providing sophisticated data mining
features, Rattle exhibits R’s statical strength. While the user interface of rattling is well-

nl
designed, it also contains an incorporated log code tab that generates duplicate code
for all GUI operations.

O
The data collected by Rattle can be viewed and edited. Others can assess the
code, use it for a number of purposes, and enhance it without restriction thanks to
Rattle.

ty
Rapid Miner:
Rapid Miner, developed by the Rapid Miner corporation, is one of the most
extensively used predictive analysis programmes. The JAVA programming language

si
was used to construct it. Text mining, deep learning, machine learning, and predictive
analytic environments are all included.

The instrument can be used for a variety of purposes, including business


r
applications, commercial applications, research, education, training, application
ve
development, and machine learning.

Rapid Miner’s server can be hosted on-premises or in a public or private cloud. It’s
built on a client/server architecture. A rapid miner uses template-based frameworks to
deliver data quickly and with little errors (which are commonly expected in the manual
ni

coding writing process).


U
ity
m
)A
(c

Amity Directorate of Distance & Online Education


Data Warehousing and Mining 161

Case Study
Notes

e
A Knowledge Discovery Case Study for the Intelligent Workplace

in
Introduction
The Robert L. Preger Intelligent Workplace (IW) serves as a “living laboratory”
for testing innovative construction methods and materials in a real-world setting for

nl
research and instruction. On the campus of Carnegie Mellon University, the IW may be
found on the top floor of Margaret Morrison Carnegie Hall. The IW has been equipped
with a collection of sensors that measure HVAC performance, ambient conditions, and

O
energy usage in order to assess the effectiveness of its revolutionary HVAC system.
Although the Intelligent Workplace researchers have daily access to the sensor records,
they would like to develop automated methods to condense the data into a manageable
form that illustrates trends in the energy performance. There are few choices for

ty
automated analysis of big observational data sets like the IW sensor data; one of them
is a collection of methods known as “data mining.” In this case study, we use knowledge
discovery to apply data mining techniques to the IW data.

si
(Source: https://www.researchgate.net/publication/269159907_A_Knowledge_
Discovery_Case_Study_for_the_Intelligent_Workplace)

Background r
ve
The use of data mining tools is merely one aspect of the situation. The non-trivial
process of finding true, original, possibly helpful, and ultimately intelligible patterns in
data is known as knowledge discovery in databases (KDD). The KDD method seeks to
assist in ensuring that any results generated are both accurate and reproducible.
ni

A hierarchical representation of the KDD process is provided by the CRoss-


Industry Standard Process Model for Data Mining (CRISP-DM). Domain knowledge,
data understanding, data preparation, modelling, results evaluation, and results
U

deployment are its six high-level phases.

The six main categories that data mining techniques themselves fall under are
classification, regression, clustering, summarization, dependency modelling, and
ity

deviation detection. Rough sets analysis is a classification method that was employed
in this case study. In general, mapping a data item into a preset class is done using
classification procedures. Each data point is treated as a “object” in rough sets analysis.
Based on the values of each object’s separate properties, the approach seeks to
differentiate them from one another. The analysis excludes any attributes that cannot
m

be used to differentiate between several objects. Decision rules are generated using the
remaining attributes. The analyst is then given these generalised principles in the form
of conjunctive if-then rules.
)A

Intelligent Workplace Case Study


This case study’s main objective was to assess the IW’s performance in terms
of building operation. The two main goals of this case study were to test the HVAC
(c

system’s control mechanisms and assess the impact of environmental conditions on


energy usage. For each rule, the rough sets algorithm determines the accuracy and
coverage of two assessment measures. When predicting the output attribute, accuracy

Amity Directorate of Distance & Online Education


162 Data Warehousing and Mining

indicates how reliable the predictor qualities are. Calculating coverage involves
Notes

e
determining how much variation in the output attribute the rule accounts for. Although
it turns out that there is typically a trade-off between accuracy and coverage, it is highly
desirable to have both high accuracy and coverage. In the IW case study, accuracy

in
and coverage were used to gauge each rule’s technical excellence. The IW researchers
themselves assessed each rule’s actual usefulness.

nl
In light of the list of requirements put forth by Fayyad et al. 2000, the KDD
approach was a fantastic solution for the IW data analysis challenge. Due to the volume
of data, the intricate relationships between the sensor systems, and the observational
nature of the data, there were no viable options for the analysis. Over a year’s worth of

O
data had already been gathered by the researchers, which was more than enough for
data mining. The analysis was able to overlook any incorrect or missing data despite
the fact that the data set had issues with missing values and noise. We gathered

ty
attributes that were pertinent to the modelling goals, such as the weather station’s eight
separate environmental element monitors and the HVAC system’s instrumentation on
each part. Last but not least, the IW researchers were quite supportive of the project.

si
Data Understanding
Nearly all of the attributes in the IW data are numerical since all of them were
collected by sensors. The data collection also includes a few notional attributes, such
r
as the condition of the valves. There are around 400 distinct attributes total in the data
ve
set. Approximately 600MB of fiat text files are needed to store one year’s worth of data.

At intervals of five minutes, the weather station sensors measure the following
variables: temperature, relative humidity, solar radiation, barometric pressure, dew
point, wind speed, wind direction, and rainfall. The make-up air unit, the secondary
ni

water mullion system (for heating and cooling), individual water mullions, the hot
water system, the chilled water system, and the secondary water mullion system all
contribute temperature, flow, and status data to the HVAC sensor system at 30-minute
U

intervals. Finally, every outlet in every electrical panel has a sensor system that
records the number of kilowatt-hours utilised every thirty minutes. To identify the broad
characteristics of each attribute, a series of summary statistical tests were performed on
the data for each of the sensor attributes.
ity

Data Preparation
During the KDD process’s data preparation phase, a lot of problems need to
be solved. Data selection, data cleaning, data creation, data integration, and data
m

formatting are the smaller phases that make up the data preparation phase.

The first step of the data preparation phase in the CRISP-DM model is to choose
the data that will be modelled. First, we had to decide how much data each model
)A

would require. We decided to create annual models because we wanted to represent


the variance in the processes that the data describes. It could have been better to
create several distinct models based on smaller, more homogeneous data sets if our
data mining goal had been prediction. A year’s worth of data, which is collected at thirty-
(c

minute intervals, is roughly 17,000 recordings. This is a medium-sized data set by data
mining standards, so most algorithms ought to be able to handle it without any difficulty.

Amity Directorate of Distance & Online Education


Data Warehousing and Mining 163

Attribute selection is also a part of data selection. Although the majority of data
Notes

e
mining algorithms deteriorate dramatically with each additional feature added to the
data set, others are built to handle enormous numbers of records smoothly. It is best to
employ just pertinent attributes when computing. However, part of knowledge discovery

in
is seeing new and unexpected patterns in the data set; eliminating features might
stop these unexpected outcomes. The number of potential features in this case study
was enormous. However, given our primary objective was to validate current domain

nl
knowledge rather than uncover unique patterns, it made logical to consult the IW’s
domain experts to determine which elements each model needed.

Data cleansing comes next in the data preparation process. How to handle missing

O
values is the first problem with data cleansing. The IW data collection had numerous
intervals with missing data values, most of which were relatively long (e.g., the sensors
stopped recording for a weekend). One option was to attempt to forecast the missing

ty
values, however most prediction techniques struggle when used to make predictions
based on data that has already been predicted or predictions made several intervals in
advance. Prediction would most likely do well in a different data set where the missing
values are dispersed in tiny groups. We ignored the missing values since there was

si
still enough data to accurately describe the underlying processes that the data set
measures even if the missing data only made up around 10% of the overall data.

r
Noise in the data, often known as poor data, is the second problem with data
cleansing. How to distinguish between variance that is necessary for creating a
ve
complete model of a problem and noise that is actually noise is one of the more
challenging decisions to be made in the KDD process. Despite the well-known noise in
sensor data, we chose not to delete any values from the data set. Out-of-range values
were crucial since we were trying to characterise the data rather than make predictions,
ni

therefore it was crucial to know what variables were related to ineffective control tactics
or anomalous energy use.

The construction of the data is the third step in the data preparation phase. The
U

process of building the data include altering and deriving new qualities from preexisting
ones. Rough sets analysis functions better with nominal than with numerical attributes
because it is a classification technique. Since almost all of the IW data properties are
ity

numerical, we had to determine how to discretize (or divide into discrete ranges) the
data. There is no theory on the ideal method for discretizing an attribute; instead, most
analysts employ domain expertise or trial-and-error to do so. To select ranges that
would “make sense” to the IW researchers, we applied our domain knowledge.
m

Integrating data from many sources is the following step in the data preparation
process. How to deal with the different sensor granularities—environmental data
were collected in five-minute intervals, while all other data were recorded in thirty—
was the first problem that surfaced. The choice was simple in this instance. Because
)A

the environmental data were quite continuous, averaging them over intervals of thirty
minutes produced results that did not materially deviate from the original data. We
performed the statistical tests on the averaging of the environmental data to make sure
that there had been no modifications. Between the original and averaged data sets,
we did not detect any appreciable statistical differences. The time stamps that were
(c

recorded by each sensor as it recorded a value made it simple to connect the data once
it was all at the same degree of granularity.

Amity Directorate of Distance & Online Education


164 Data Warehousing and Mining

The formatting of the data for the particular modelling tool is the last stage in the
Notes

e
data preparation procedure. This stage is one of the most iterative ones in the KDD
process, and the data will probably need to be reorganised almost every time a new
algorithm or tool (even one from the same set of techniques) is used. In this case study,

in
before beginning the processes of data comprehension and preparation, all of the
sensor files were structured as comma-separated value files. This first step made a lot
of the subsequent data manipulation simpler.

nl
Data Modeling
Choosing a modelling technique is the first step in the KDD process’ data modelling

O
phase. The data set that was prepared in the previous phase is used to apply this
modelling technique. The technical soundness of the produced data model is next
evaluated. Multiple repetitions of this phase may be necessary, either by choosing a
different modelling technique completely or by changing model parameters within the

ty
same modelling technique.

The selection of an appropriate data modelling approach is frequently challenging;

si
similarly to discretization, no dependable theories exist to guide users in making this
decision. Chapman, et al.2000, however, define classes of data mining techniques
that are frequently employed to address challenges of each type and describe a set of
six general data mining problem categories. A more detailed mapping of data mining
r
problem types to data mining techniques will be provided by the framework discussed
ve
in Section 4. The IW case study unmistakably belongs to the category of data mining
tasks called “concept description,” which “aims at a comprehensible description of ideas
or classes.” Typically, rule induction and conceptual clustering techniques are used to
model these issues.
ni

The technical criteria for this problem are satisfied via conceptual clustering and
rule induction. However, the majority of conceptual clustering algorithms frequently
provide data that is challenging for inexperienced analysts to interpret. If-then rules are
U

produced via rule induction algorithms, which are simple to comprehend. We believed
that a rule induction method would produce the best model, in terms of technical
correctness and usability, given the need that the IW researchers wanted to be able to
continue analysing their data on their own.
ity

There are numerous varieties of rule induction methods. Machine learning


approaches have given rise to numerous rule induction techniques, which perform
best when given all available data. Given that the IW data is imperfect and rather
noisy. This method can assess the strength of the rules it creates and performs well
m

with sparse data.

We created two models for the IW researchers in this case study. These models
)A

will serve as a model for future model production, along with instructions on how to
construct more models. The first model explains how a heat exchanger’s hot water
supply temperature is impacted by the outside air temperature (oa-t) (hxl-hwst). The
second model explains how the amount of energy used to heat water is influenced by
the outside air temperature, solar radiation, and wind speed (btu-mw).
(c

The initial model (oa-t =v hxl-hwst) was created to test the efficiency of the heat
exchanger. The heat exchanger’s control rules are as follows: oa-t 0 =v hxl-hwst =

Amity Directorate of Distance & Online Education


Data Warehousing and Mining 165

140; oa-t 60 =v 90 hxl-hwst 140; and, oa-t 60:=hxl-hwst = 90. The model discovered
Notes

e
that the heat exchanger occasionally did not operate in accordance with the control
criteria, most frequently when the temperature was between 30 and 60 degrees. There
was no discernible pattern to identify when the heat exchanger was not operating in

in
accordance with the control rules given only the temperature of the outside air. Further
investigation found a strong correlation between the heat exchanger’s activity and the
time of day. We created a new model (oa-t AND time =v hxl-hwst) to identify the hot

nl
water supply temperature using the time of day and the outside air temperature. The
new model demonstrated that from early in the morning to mid-afternoon, it was most
likely that the heat exchanger would run outside of the control rules.

O
The second model (oa-t AND sr AND ws = btu-mw) was created to describe how
various environmental factors impact the amount of energy used to heat water. Notably,
just as it is erroneous to assume that numerical correlation indicates causation, it is

ty
typically incorrect to infer a causal relationship between the left and right hand sides of
a rule created through rough sets analysis. In this instance, we did presume a causal
relationship based on the topic expertise provided by the IW researchers.

si
We looked at the rules generated for this model and found that they did not
significantly differentiate between different values for energy consumption but
did differentiate between different values for solar radiation and wind speed. This

r
information should be discovered by the rough sets method, which will then remove sun
radiation and wind speed from the rules. The rough sets technique was unable to do
ve
so due to the volume of noise in this data set; as a result, the analyst should carefully
review the resultant rules to ensure that they make sense. To categorise energy usage,
a new model was created using simply the outside air temperature (oa-t =v btu-mw).
Five of the eight rules we produced had an accuracy value over 77%, and all of the
ni

rules had an accuracy value over 55%, making the new model considerably clearer.

Results Evaluation and Deployment


U

Results from the data modelling phase are assessed based on their technical
merit. The modelling outputs are assessed for utility, innovation, and understandability
in the final two stages of the KDD process, after which the outputs are put to use.
The case study’s findings and the method utilised to produce them primarily serve as
ity

a guide for the IW researchers’ future research. Teaching the IW researchers how to
use the data preparation and modelling tools was part of the case study’s deployment
phase. They will continue to utilise these tools to comprehend the patterns in their data,
enhance their control techniques, keep track of how well those tactics are working, and
m

keep an eye on how well their sensors are working.

Summary
)A

●● Data warehousing is altering how individuals conduct business research and


make strategic decisions in every industry, from retail chain shops to financial
institutions, from industrial organisations to government agencies, from airlines to
utility corporations.
●● The definition of data mining has possibly as many iterations as there has
(c

supporters and suppliers. Some experts include a wide variety of tools and

Amity Directorate of Distance & Online Education


166 Data Warehousing and Mining

procedures in the definition, ranging from straightforward query protocols to


Notes

e
statistical analysis.
●● Data mining delivers information, much like all other decision support systems.

in
Please refer to the figure below, which illustrates how decision support has
evolved.
●● Building and using a data warehouse is known as data warehousing. Data from

nl
several heterogeneous sources is combined to create a data warehouse, which
facilitates analytical reporting, organised and/or ad hoc searches, and decision-
making.

O
●● The first industries to use data warehousing were telecommunications, finance,
and retail. Government deregulation in banking and telecoms was primarily to
blame for that. Because of the increased competitiveness, retail businesses have
shifted to data warehousing.

ty
●● Today, investment on data warehouses is still dominated by the banking and
telecoms sectors. Data warehousing accounts for up to 15% of these sectors’
technological budgets.

si
●● Data mining is a technique used with the Internet of Things (IoT) and artificial
intelligence (AI) to locate relevant, possibly useful, and intelligible information,

r
identify patterns, create knowledge graphs, find anomalies, and establish links in
massive data.
ve
●● By fusing several data mining techniques for usage in financial technology
and cryptocurrencies, the blockchain, data sciences, sentiment analysis, and
recommender systems, its application domain has grown.
ni

●● Machine learning and deep learning are undoubtedly two of the areas of artificial
intelligence that have received the most research in recent years. Due to the
development of deep learning, which has provided data mining with previously
U

unheard-of theoretical and application-based capabilities, there has been a


significant change over the past few decades.
●● The DM covers a wide range of academic disciplines, including artificial
intelligence, machine learning, databases, statistics, pattern identification in
ity

data, and data visualisation. Finding trends in the data and presenting crucial
information in a way that the average person can understand it are the main
objectives here.
●● Neural networks are utilised to complete several DM jobs. Since the human brain
m

serves as the computing model for these networks, neural networks should be
able to learn from experience and change in reaction to new information.
●● Regression is a learning process that converts a specific piece of data into a
)A

genuine variable that can be predicted. This is comparable to what individuals


actually do: after seeing a few examples, they can learn to extrapolate from them
in order to apply the information to unrelated issues or situations.
●● KDD is essentially a group of procedures that aid in the gathering of intelligence.
(c

Simply expressed, it makes it easy to make rational decisions that will help you
reach your goals.

Amity Directorate of Distance & Online Education


Data Warehousing and Mining 167

●● A significant amount of data is needed for the process of knowledge discovery,


Notes

e
and that data must first be in a reliable state in order to be used in data mining.
The ideal source of data for knowledge discovery is the aggregation of enterprise
data in a data warehouse that has been adequately vetted, cleaned, and

in
integrated.
●● Steps of Knowledge Discovery Database Process: 1) Data collection, 2)

nl
Preparing datasets, 3) Cleansing data,4) Data integration, 5) Data analysis, 6)
Data transformation, 7) Modeling or mining, 8) Validating models, 9) Knowledge
presentation, 10) Execution.

O
●● Estimation is the process of giving an object a continuously varying numerical
value. For instance, determining a person’s credit risk is not always a yes-or-
no decision; it is more likely to involve some form of scoring that determines a
person’s propensity to default on a loan.

ty
●● Prediction is an attempt to categorise items based on some anticipated future
behaviour, which is a slight distinction from the preceding two jobs. Using historical
data where the classification is already known, classification and estimation can

si
be utilised to make predictions by creating a model (this is called training). Then,
using fresh data, the model can be used to forecast future behaviour.
●● The technique of analysing associations or correlations between data items that
r
show some sort of affinity between objects is known as affinity grouping. For
ve
instance, affinity grouping could be used to assess whether customers of one
product are likely to be open to trying another.
●● The phrase “Knowledge Discovery in Databases” was coined by Gregory
Piatetsky-Shapiro in 1989. The term ‘data mining,’ on the other hand, grew
ni

increasingly prevalent in the business and media circles. Data mining and
knowledge discovery are terms that are being used interchangeably.
●● There are five steps in the data mining process. Data is first gathered by
U

organisations and loaded into data warehouses. The data is then kept and
managed, either on internal servers or on the cloud. The data is accessed by
business analysts, management groups, and information technology specialists,
who then decide how to organise it. The data is next sorted by application software
ity

according to the user’s findings, and ultimately the end-user presents the data in a
manner that is simple to communicate, such a graph or table.
●● Data mining, in technical terms, is the computer process of examining data from
many viewpoints, dimensions, and angles, and categorizing/summarizing it into
m

useful information.
●● Data Mining can be used on any sort of data, including data from Data
Warehouses, Transactional Databases, Relational Databases, Multimedia
)A

Databases, Spatial Databases, Time-series Databases, and the World Wide Web.
●● Applications of data mining: a) Sales, b) Marketing, c) Manufacturing, d) Fraud
detection, e) Human resources, f) Customer service.
●● Researching, creating, and making a product available to the public are all part of
(c

the discipline of marketing. The idea of marketing has been around for a while, yet
it keeps evolving in response to customer wants and buying patterns.

Amity Directorate of Distance & Online Education


168 Data Warehousing and Mining

●● Customers are an organization’s most valuable resource. Without happy clients


Notes

e
who remain devoted and strengthen their relationship with the company, there
cannot be any business opportunities. Because of this, a business should develop
and implement a clear customer service strategy.

in
●● CRM has two main objectives: a) Customer retention through customer
satisfaction, b) Customer development through customer insight.

nl
●● Direct marketing campaigns are used by marketers to reach out to customers
directly by mail, the Internet, e-mail, telemarketing (phone), and other direct
channels in an effort to reduce customer attrition, increase client acquisition, and

O
encourage the purchase of supplementary items.
●● Tools of data mining: a) Market based analysis, b) Memory based reasoning, c)
Cluster detection, d) Link analysis.

ty
●● A data mining model used for prediction is a neural network. The algorithms for
creating neural networks incorporate statistical artefacts of the training data to
construct a “black box” process that accepts a certain amount of inputs and
provides some predictable output. The neural network model is trained using data

si
examples and desired outputs.
●● Orange’s components are referred to as “widgets” because it is a component-

r
based software. Preprocessing and data visualisation are only a few of the widgets
available, as are algorithm evaluation and predictive modelling.
ve
●● DataMelt is a computing and visualisation environment that provides an interactive
data analysis and visualisation structure. It was created with students, engineers,
and scientists in mind. DMelt is another name for it. DMelt is a JAVA-based multi-
platform utility.
ni

Glossary
●● DW: Data Warehouse.
U

●● Data warehousing: Building and using a data warehouse is known as data


warehousing.
●● BLOBs: Binary Large Objects.
ity

●● UDFs: User-Defined Functions.


●● UDTs: User-Defined Types.
●● IoT: Internet of Things.
m

●● AI: Artificial Intelligence.


●● DM: Data Mining, refers to a collection of particular techniques and algorithms
)A

created specifically for the purpose of identifying patterns in unprocessed data.


●● Regression: Regression is learning a function that maps a data item to a real-
valued prediction variable.
●● Clustering: Clustering is the division of a data set into subsets (clusters).
(c

●● Association: Association rules determine implication rules for a subset of record


attributes.

Amity Directorate of Distance & Online Education


Data Warehousing and Mining 169

●● Summarization: Summarization involves methods for finding a compact description


Notes

e
for a subset.
●● Classification: Classification is learning a function that maps a data item into one

in
of several predefined classes.
●● GUI: Graphical User Interface.
●● ART: Adaptive Resonance Theory.

nl
●● KDD: Knowledge Discovery in Databases.
●● Estimation: Estimation is the process of giving an object a continuously varying

O
numerical value.
●● Prediction: Prediction is an attempt to categorise items based on some anticipated
future behaviour, which is a slight distinction from the preceding two jobs.

ty
●● Affinity grouping: The technique of analysing associations or correlations between
data items that show some sort of affinity between objects is known as affinity
grouping.

si
●● MBA: Market Basket Analysis (MBA) is a technique for analysing the purchases
made by a client in a supermarket.
●● KNN: K-Nearest Neighbor.
●● MNCs: Multi-National Companies.
r
ve
●● CRM: Customer Relationship Management.
●● SAS: Statistical Analysis System.
●● Rattle: Rattle is a data mining tool with a graphical user interface. R, a statistical
ni

programming language, was used to create it. By providing sophisticated data


mining features, Rattle exhibits R’s statical strength. While the user interface
of rattling is well-designed, it also contains an incorporated log code tab that
U

generates duplicate code for all GUI operations.

Check Your Understanding


1. In Data warehouse DSS stands for?
ity

a. Decision Support System


b. Decision Support Stream
c. Decision Study Stream
m

d. None of the mentioned


2. The data contained in the data warehouse is known as_ _ _ _.
)A

a. Relational data
b. Meta data
c. Operational data
d. None of the mentioned
(c

3. _ _ _ _ _is the specialized data warehouse database.

Amity Directorate of Distance & Online Education


170 Data Warehousing and Mining

a. AWS
Notes

e
b. Informix
c. Redbrick

in
d. Oracle
4. The test is used in an online transactional processing environment is known as_ _

nl
_ _test.
a. MICRO
b. Blackbox

O
c. Whitebox
d. ACID

ty
5. The method of incremental conceptual clustering is known as_ _ _ _.
a. COBWEB
b. STING

si
c. OLAP
d. None of the mentioned
6. r
Multidimensional database is also known as_ _ _ _.
ve
a. Extended DBMS
b. Extended RDBMS
c. RDBMS
ni

d. DBMS
7. __ _ _ _ is a data transformation process.
U

a. Projection
b. Selection
c. Filtering
ity

d. None of the mentioned


8. GUI stands for_ _ _ _.
a. Graphics Used Interface
m

b. Graphed Used Interface


c. Geographical User Interface
)A

d. Graphical User Interface


9. _ _ _ _is learning a function that maps a data item to a real-valued prediction variable
a. Regression
b. Clustering
(c

c. Association
d. Summarisation

Amity Directorate of Distance & Online Education


Data Warehousing and Mining 171

10. _ _ _ _is the division of a data set into subsets.


Notes

e
a. Summarization
b. Clustering

in
c. Association
d. Regression

nl
11. __ _ _ _is learning a function that maps a data item into one of several predefined
classes.
a. Association

O
b. Populating
c. Classification

ty
d. Clustering
12. _ _ _. _is the process of giving an object a continuously varying numerical value.
a. Association

si
b. Regression
c. Classification
d. Estimation r
ve
13. _ _ _ _ _involves methods for finding a compact description for a subset.
a. Summarization
b. Estimation
ni

c. Clustering
d. Association
U

14. The term data mining was coined by_ _ _ in_ _ _ _year?
a. James Gosling, 1986
b. Gregory Piatesky-Shapiro, 1989
ity

c. Bill Inmon, 1987


d. GuidoVan Rossum, 1988
15. What does KDD stand for?
m

a. Knowledge Develop Database


b. Knowledge Divided in Databases
)A

c. Knowledge Discovery in Databases


d. None of the mentioned
16. Transient data is which of the following?
a. Data in which changes to the existing records cause the previous version of
(c

the records to be eliminated

Amity Directorate of Distance & Online Education


172 Data Warehousing and Mining

b. Data in which changes to the existing recordsdo not cause the previous
Notes

e
version of the records to be eliminated
c. Data that are never altered or deleted once they have been added

in
d. Data that are never deleted once they have been added
17. A multifield transformation does which of the following?

nl
a. Converts data from one field to multiple field
b. Converts data from multiple field to one field
c. Converts data from multiple field to multiple field

O
d. All of the above
18. The time horizon in data warehouse is usually_ _ _ _.

ty
a. 3-4 years
b. 5-6 years
c. 5-10 years

si
d. None of the above
19. Which of the following predicts the future trends&behaviour, allowing business
r
managers to make proactive, knowledge-driven decisions?
ve
a. Meta data
b. Data mining
c. Data marts
ni

d. Data warehouse
20. _ _ _ _ is the heart of the warehouse.
a. Data warehouse database servers
U

b. Data marts database servers


c. Relational database servers
ity

d. All of the above

Exercise
1. What do you mean by data warehousing?
m

2. What are the advancements to data mining?


3. What do you mean by data mining on databases?
)A

4. Define data mining functionalities.


5. What are the objectives of data mining?
6. Define the business context for data mining.
7. What do you mean by data mining process improvement?
(c

8. Define the role of data mining marketing and crm


9. Define the tools of data mining.

Amity Directorate of Distance & Online Education


Data Warehousing and Mining 173

Learning Activities
Notes

e
1. You are the data analyst on the project team building a data warehouse for an
insurance company. List the possible data sources from which you will bring the

in
data into your data warehouse. State your assumptions.

Check Your Understanding - Answers

nl
1 a 2 b
3 c 4 d
5 a 6 b

O
7 c 8 d
9 a 10 b

ty
11 c 12 d
13 a 14 b
15 c 16 a

si
17 d 18 c
19 b 20 a

Further Readings and Bibliography:


r
ve
1. Rajkumar Buyya, James Broberg & Andrzej Goscinski, 2011, Cloud
Computing: Principles and Paradigms, John Wiley & Sons.
2. Sarna, David E. Y., 2010, Implementing and Developing Cloud Computing
ni

Applications, Taylor & Francis.


3. Williams, Mark I., 2010, A Quick Start Guide to Cloud Computing: Moving Your
Business into the Cloud, Kogan Page Ltd.
U

4. Data Mining and Data Warehousing: Principles and Practical Techniques,


Prateek Bhatia, Cambridge University Press
5. https://www.researchgate.net/publication/319355511_Data_Mining_in_CRM
ity
m
)A
(c

Amity Directorate of Distance & Online Education


174 Data Warehousing and Mining

Module - IV: Data Mining Functionalities


Notes

e
Learning Objectives:

in
At the end of this topic, you will be able to understand:

●● Statistical Techniques

nl
●● Data Mining Characterisation
●● Data Mining Discrimination

O
●● Mining Patterns
●● Mining Associations
●● Mining Correlations

ty
●● Classification
●● Prediction

si
●● Cluster Analysis
●● Outlier Analysis

Introduction r
ve
We have seen a range of databases and information repositories that can be used
for data mining. Now let’s look at the types of data patterns that can be extracted.

The kind of patterns that will be found in data mining jobs are specified using
data mining functionalities. Data mining jobs can often be divided into two groups:
ni

descriptive and predictive. The general characteristics of the data in the database are
described by descriptive mining tasks. To produce predictions, predictive mining jobs
perform inference on the most recent data.
U

Users might search for a variety of patterns simultaneously because they are
unsure of what kinds of patterns in their data would be intriguing. Therefore, it’s crucial
to have a data mining system that can extract a variety of patterns in order to satisfy
ity

various user expectations or applications. Data mining systems should also be able to
find patterns at different granularities (i.e., different levels of abstraction). Users should
be able to specify indications in data mining systems so that the search for intriguing
patterns is guided or narrowed down. Each pattern found typically has a level of
assurance or “trustworthiness” attached to it because some patterns might not hold for
m

all of the data in the database.

4.1 Data Preparation and Data Mining Techniques


)A

It is designed for the efficient handling of significant amounts of data that are
typically multidimensional and potentially of many complex forms in statistical data
mining approaches.
(c

Numerous known statistical techniques exist for the study of data, particularly
quantitative data. Numerous scientific records, including those from experiments in

Amity Directorate of Distance & Online Education


Data Warehousing and Mining 175

physics, engineering, manufacturing, psychology, and medicine, as well as data from


Notes

e
the social and economic sciences, have been subjected to these techniques.

The intersection of statistics and machine learning gives rise to the interdisciplinary

in
topic of data mining (DM) (artificial intelligence). It offers a technique that aids in the
analysis and comprehension of the data found in databases, and it has been used to
a wide range of industries or applications. In particular, the analogy between searching

nl
databases for useful information and extracting valuable minerals from mountains
is where the term “data mining” (DM) originates. The concept is that the data to be
analysed serves as the raw material, and we use a collection of learning algorithms to
act as sleuths, looking for gold nuggets of knowledge.

O
In order to offer a didactic viewpoint on the data analysis process of these
techniques, we present an applied vision of DM techniques. In order to find knowledge
models that reveal the patterns and regularities underlying the analysed data, we

ty
employ machine learning algorithms and statistical approaches, comparing and
analysing the findings. To put it another way, some authors have noted that DM is “the
analysis of (often large) observational datasets to find unsuspected relationships and to

si
summarise the data in novel ways that are both understandable and useful to the data
owner,” or, to put it another way, “the search for valuable information in large volumes
of data,” or “the discovery of interesting, unexpected or valuable structures in large

r
databases.” According to some authors, data mining (DM) is the “search and analysis of
enormous amounts of data to uncover relevant patterns and laws.”
ve
4.1.1 Statistical Techniques
The table below provides a classification of some DM approaches based on the
ni

type of data studied. In this regard, we outline the strategies that are possible based
on the types of predictor and output variables. Unsupervised learning models are used
when there is no output variable, however supervised learning models are used when
U

the output variable is continuous or categorical.


ity
m
)A

Table: Data mining techniques according to the nature of the data

Artificial Neural Networks


Artificial neural networks (ANNs) are data processing systems that borrow features
(c

from biological neural networks in their design and operation. ANN were created on the
following principles:

Amity Directorate of Distance & Online Education


176 Data Warehousing and Mining

In basic building blocks known as neurons, information processing takes place.


Notes

e
The connections between the neurons allow for the transmission of signals.

Each link (communication) has a corresponding weight.

in
To the total entry of connected neurons received (sum of entries weighted
according to the connection weights), each neuron applies an activation function (often

nl
non linear), resulting in an output value that will serve as the entry value that will be
broadcast to the rest of the network.

Parallel processing, distributed memory, and environmental adaptability are the

O
core traits of ANN.

The artificial neuron serves as the processing unit, collecting input from nearby
neurons and calculating an output value to be relayed to the rest of the neurons.

ty
We can find networks with continuous input and output data, networks with discrete
or binary input and output data, and networks with continuous input data and discrete
output data when it comes to the representation of input and output information.

si
Input nodes, output nodes, and intermediate nodes (hidden layer) are the three
primary types of nodes or layers that make up an ANN (Figure below). The starting
values of the data from each case must be obtained by the input nodes before they can
r
be sent to the network. As input is received, the output nodes compute the output value.
ve
ni
U
ity

Figure: Generic working of an artificial neuron and its output mathematical


representation
m

The activation function and the collection of nodes the ANN uses enable it to
readily express non-linear relationships, which are the most challenging to depict using
multivariate techniques.
)A

The step function, identity function, sigmoid or logistic function, and hyperbolic
tangent are the most used activation functions.

There are many different ANN models available. An ANN model is defined by
(c

its topology (number of neurons and hidden layers and their connections), learning
paradigm, and learning method.

Amity Directorate of Distance & Online Education


Data Warehousing and Mining 177

An ANN can be considered to have three benefits that make it particularly


Notes

e
appealing for managing data: adaptive learning through examples, robustness in
handling redundant and erroneous information, and significant parallelism.

in
The multilayer perceptron, which was popularised by Rumelhart et al., is the
technique most frequently employed in actual ANN implementations.

An input layer with each node or neuron corresponding to a predictor variable

nl
forms the foundation of a multilayer perceptron type of ANN. Each of the neurons that
make up the hidden layer are connected to these input neurons. In turn, the neurons in
one hidden layer are connected to the nodes in the first. One (binary prediction) or more

O
output neurons make up the output layer. Information is always sent from the input layer
to the output layer in this type of architecture.

The multilayer perceptron’s appeal is largely a result of its ability to serve as

ty
a universal function approximator. More technically, any function or continuous
relationship between a collection of input variables (discrete and/or continuous) and an
output variable can be learned by a “backpropagation” network with at least one hidden
layer and sufficient non-linear units (discrete or continuous). Multilayer perceptron

si
networks are versatile, adaptable, and non-linear tools as a result of this characteristic.
Rumelhart et al. provide a thorough explanation of the mathematical underpinnings of
the multilayer perceptron architecture’s backpropagation algorithm’s training stage and
functional stage. r
ve
The multilayer perceptron’s value stems from its capacity to recognise almost any
correlation between a set of input and output variables. However, methods derived
from classical statistics, such as linear discriminant analysis, lack the ability to compute
non-linear functions and, as a result, perform less well than multilayer perceptrons in
ni

classification tasks involving intricate non-linear interactions.

The output layer of a network used for classification typically includes as many
nodes as there are classes, and the node of the output layer with the greatest value
U

provides the network’s best guess as to the class for a given input. A node is frequently
present in the output layer in the special case of two classes, and the node value is
used to carry out the classification between the two classes by applying a cut off point.
ity

Decision Trees
Sequential data divisions called decision trees (DT) maximise the differences of
a dependent variable (response or output variable). They provide a clear definition for
groups whose characteristics are constant but whose dependent variable varies.
m

Nodes (input variables), branches (groups of input variable entries), and leaves or
leaf nodes make up DT (values of the output variable). The building of a DT is based
)A

on the “divide and conquer” principle: consecutive divides of the multivariable space
are carried out by a supervised learning algorithm in order tomaximise the distance
between groups in each division (that is, carry out partitions that discriminate). When
all of a branch’s entries have the same value in the output variable (pure leaf node),
the division process is complete and the full model is produced (maximum specified).
(c

The importance of the input factors in the output categorization decreases as they move
deeper down the tree in the tree diagram (and the less generalisation they allow, due to
the decrease in the number of inputs in the descending branches).

Amity Directorate of Distance & Online Education


178 Data Warehousing and Mining

The tree can be pruned by removing the branches with few or hardly meaningful
Notes

e
entries in order to prevent overfitting the model. As a result, if we begin with the
whole model, the tree pruning will increase the model’s capacity for generalisation
(as measured by test data), but at the expense of decreasing the level of purity of its

in
leaves.

To create DT models, various learning techniques are available. The following

nl
elements are determined by the learning algorithm:

specific compatibility with the type of variables, including the input and output
variables’ characteristics.

O
Division criteria are used to measure the distance between groups in each division.

The number of branches that each node can be divided into may be constrained.

ty
Pre- and post-pruning pruning settings include: the minimal number of entries
per node or branch, the critical value of the division, and the performance difference
between the extended and reduced tree. While post-pruning applies the pruning
parameters to the entire tree, pre-pruning entails utilising halting criteria during the

si
tree’s formation.

The most popular algorithms are C4.5/C5.0, QUEST (Quick, Unbiased, Efficient

r
Statistical Tree), CHAID (Chi-Squared Automatic Interaction Detection), and
Classification and Regression Trees (CART).
ve
ni
U
ity

Table: Comparative between learning algorithms for decision trees

Breiman et al. (1984) created the CART algorithm, which creates binary decision
m

trees with each node precisely divided into two branches. This groups many categories
in one branch if the input variable is nominal and has more than two categories. Even if
the input variable is continuous or nominal, it still creates two branches, assigning a set
)A

of values to each one that is “less than or equal to” or “greater” than a specific value.
The model can accept nominal, ordinal, and continuous input data thanks to the CART
algorithm. The model’s output variable can also be nominal, ordinal, or continuous.

Initial versions of the CHAID algorithm (Kass, 1980) were only intended to handle
categorical variables. However, it is now able to handle continuous variables, nominal
(c

and ordinal categorical output data, and more. In order to establish clusters of similar
values (statistically homogeneous) with respect to the output variable and to keep all

Amity Directorate of Distance & Online Education


Data Warehousing and Mining 179

the values that ultimately turn out to be heterogeneous (distinct), the tree construction
Notes

e
process is based on the calculation of the significance of a statistical contrast as a
criterion. Similar values are merged into one category and make up a portion of one
tree branch. The statistical test that is applied depends on how accurately the output

in
variable is measured. The F test is applied if the aforementioned variable is continuous.
The Chi-square test is utilised if the output variable is categorical.

nl
The CHAID algorithm differs from the CART algorithm in that it permits the partition
of each node into more than one branch, leading to the creation of significantly wider
trees than those produced by binary development techniques.

O
In the event that the result is nominal-categorical, the QUEST algorithm (Loh&
Shih, 1997) may be applied (allows the creation of classification trees). Calculating the
significance of a statistical contrast is the foundation of the tree construction process.
If an input variable is nominal categorical, the critical level of a Pearson Chi-square

ty
independence contrast between the input variable and the output variable is calculated
for each input variable. It employs the F test if the input variable is ordinal or continuous.

The only output variables accepted by the C5.0 algorithm (Quinlan, 1997) are

si
categorical ones. Input variables might be continuous or categorical. This algorithm
evolved from algorithm C4.5 (Quinlan, 1993), which was created by the same author
and has the ID3 version as its core (Quinlan, 1986). The ID3 algorithm bases its choice
of the best attribute on the idea of knowledge gain. r
ve
The descriptive nature of DT is one of its most notable benefits since it gives us
access to the rules that the model used to make its predictions, making it simple for
us to comprehend and analyse the judgments made by the model (an aspect that is
not taken into consideration in other machine learning techniques, such as ANN). In
ni

order to provide a clear, pleasant explanation of the outcomes, DT allows the graphic
representation of a set of rules pertaining to the choice that must be made in the
assignment of an output value for a specific entry.
U

The decision rules offered by a tree model, on the other hand, have a predictive
value (rather than just being descriptive) at the point at which their correctness is
evaluated from separate data (test data) to those used in the model’s construction
ity

(training data).

k-Nearest Neighbor
Humans rely on memories of previous encounters with situations similar to the
current one when confronted with them. The k-Nearest Neighbor (kNN) method
m

is founded on this. In other words, the k-NN method is founded on the idea of
similarity. Additionally, this method creates a classification method without assuming
anything about the structure of the function connecting the dependent variable to the
)A

independent variables. The goal is to dynamically find k training observations that are
similar to a fresh observation that needs to be classified. In this method, the observation
is classified explicitly in a class using k related (neighbouring) observations (see Figure
below).
(c

Amity Directorate of Distance & Online Education


180 Data Warehousing and Mining

Notes

e
in
nl
Figure: Graphical representation of k-NN classification

In more detail, k-NN searches the training data for observations that are

O
comparable to or around the observation that needs to be categorised based on
the values of the independent variables (attributes). Then it allocates a class to
the observation it wants to classify based on the classes of these neighbouring
observations, using the majority vote of the neighbours to decide the class. In other

ty
words, it determines which class the majority of the new case belongs to by counting
the number of cases in each class.

Despite the method’s “naive” appearance, it may compete with other, more

si
advanced categorization techniques. Therefore, k-NN is very adaptable in situations
where the linear model is stiff. This method’s performance for data of a given size
depends on k as well as the measurement used to identify the nearest observations.
r
As a result, when using the technique, we must examine the number of neighbours
ve
to be taken into consideration (k value), how to measure the distance, how to combine
the data from several observations, and whether or not each neighbour should be given
the same weight.
ni

As was previously stated, the k-NN technique uses an unknown example’s k


nearby neighbours to classify it in the most prevalent class. It is presumptively true
that each example is a point in an n-dimensional space. If a neighbour is closest to
another in the n-dimensional space of qualities, it is said to be close by (An, 2006). The
U

unknown example is categorised in the training data in the class of its closest neighbour
if we set k=1.

The k number is not predetermined, however it is important to keep in mind that


ity

if we choose a tiny k value, the categorization may be too influenced by outlier values
or uncommon observations. On the other hand, picking a k value that is not extremely
small will tend to tamp down any peculiar behaviour that was picked up from the
training set. However, if we pick a k value that is too high, we will miss some locally
m

interesting behaviour.

By using a cross validation process, you can use the data to assist in solving this
issue. That is, we can test a variety of k values using various training sets selected at
)A

random, then select the k value that has the lowest classification error.

The most used distance function, in terms of how it calculates distances, is


the Euclidean distance (1), where x and y stand for the m values of two cases’
characteristics.
(c

Although, alternatively, the Manhattan distance may also be used


Amity Directorate of Distance & Online Education
Data Warehousing and Mining 181

Notes

e
There are three issues with the Euclidean distance:

in
◌◌ The distance is determined by the measurement units used for the variables.
◌◌ The fluctuation of the various factors is not taken into account.
◌◌ The relationship between the variables is disregarded.

nl
Utilizing a gauge known as statistical distance is one option (or Mahalanobis
distance). We may list a few benefits of the k-NN technique, including the fact that the
training set is entirely stored as a description of this distribution rather than simplifying

O
the distribution of objects in space into a set of understandable features.

The k-NN approach is also simple to understand, simple to use, and practical.
For each new case to be classified, it can build a different approximation to the goal

ty
function, which is helpful when the target function is extremely complex but can be
described by a collection of less complex local approximations. Finally, without making
any assumptions about the structure of the function connecting the dependent variable

si
(the classification variable) with the independent variables, this strategy constructs a
classification method (attributes).

The most significant drawback of k-NN, however, is that it is extremely sensitive to


r
the existence of irrelevant factors. Other drawbacks include the time
ve
◌◌ it may take to locate neighbours in a big training set.
◌◌ With more dimensions, the quantity of observations required in the training
data rises exponentially (variables).
ni

Naive Bayes
The Bayes rule or formula, which is based on Bayes’ theorem, is used in Bayesian
approaches to combine data from the sample with expert opinion (prior probability) in
U

order to get an updated expert opinion (posterior probability)

In particular, the Naive Bayes technique (NB) is one of the most popular
classification techniques and one of the most powerful due to its straightforward
ity

computing procedure (Hand & Yu, 2001). It is able to forecast the likelihood that a given
example will belong to a particular class because it is based on the Bayes theorem.
It is referred to as a “naive” classifier in this respect because the assumption known
as class conditional independence, which holds that the impact of one attribute’s value
on a given class is independent of the values of other attributes, is what makes its
m

computations simple.

Because class Ci has the highest X conditioned posterior probability, this classifier
)A

predicts that case A will be a member of that class (set of attributes of the case in the
predictor variables). We can define the Bayes’ formula (3) that this posterior probability
offers thanks to Bayes’ theorem, and since P(X) is constant for all classes, the
classification procedure just requires us to maximise P(X|Ci)P(Ci).
(c

Amity Directorate of Distance & Online Education


182 Data Warehousing and Mining

The number of instances of each class Ci in a set of training data, P(Ci), can be
Notes

e
calculated. The classifier makes the “naive” assumption that the attributes used to
define X are conditionally independent from one another given class Ci in order to
reduce the computing cost of calculating P(X|Ci) for all potential xk (predictor variables).

in
The phrase, where the m value denotes the number of predictor variables involved in
the classification, contains this conditional independence.

nl
When the characteristics are conditionally independent given the class, as is the

O
case in many studies comparing classification methods (e.g. Michie et al., 1994), NB
performs on par with and sometimes even better than ANN and DT. Recent theoretical
studies have demonstrated why NB is so resilient.

The simplicity, computational efficiency, and effective classification performance of

ty
NB classifiers are what make them appealing. Additionally, NB can deal with unknown
or missing values with ease. However, it has three significant flaws. The technique
assumes that a new case with this category in the predictor has zero probability if it is

si
not present in the training data, which is the first of several drawbacks. Finally, even
though we achieve a good performance if the aim is to classify or order cases according
to their probability of belonging to a certain class, this method offers very biassed
r
results when the aim is to estimate th
ve
Beyond these limitations, the NB technique is simple to apply, it adapts to the data,
and it is simple to interpret. Additionally, just one investigation of the data is necessary.
Its popularity has grown significantly, notably in the literature on machine learning,
thanks to its simplicity, parsimony, and interpretability.
ni

Logistic Regression
The link between a continuous response variable and a group of predictor variables
U

is investigated using linear regression. However, linear regression is inappropriate when


the response variable is categorical.

A generalised linear model (GLM) is a logistic regression (LR). Predicting binary


ity

variables (with values like as yes/no or 0/1) is its principal use. Thus, using LR
approaches, a new observation with an unknown group can be assigned to one of the
groups depending on the values of the predictor variables.

The categorization is dependent on the linear combination of the attributes, just


m

like with linear regression. The linear combination is converted into an interval [0, 1] via
the logistic function (Ye, 2003). Consequently, the dependent variable is changed into a
continuous value that is a function of the likelihood that the event will occur in order to
apply LR.
)A

In LR, there are two steps: first, we estimate the likelihood that each instance
(c

will belong to each group, and then, in the second phase, we utilise a cut-off point
in conjunction with these probabilities to place each example in one of the groups.

Amity Directorate of Distance & Online Education


Data Warehousing and Mining 183

Through a series of rounds, the model’s parameters are calculated using the maximum
Notes

e
likelihood technique. Consult Larose for a more thorough explanation of the LR method.

Finally, it is important to note that LR can generate reliable results with a little

in
amount of data. On the other hand, classical regression is even more appealing
because it is so widely established, simple to use, and universally understood.
Additionally, LR exhibits behavioursimilar to that of a diagnostic test.

nl
Data mining regression techniques such as linear regression use a straight line
to establish a connection between the target variable and one or more independent
variables. The provided equation serves as a representation of the linear regression

O
equation.

Y = a + b*X + e

Where

ty
The intercept is represented by a.

The slope of the regression line is represented by b.

si
The letter e stands for error.

The predictor and target variables are represented by X and Y, respectively.

r
Multiple linear equations exist when X is made up of more than one variable.
ve
The least squares method, which minimises the total sum of the squares of the
deviations from each data point to the regression line, is used in linear regression to
determine the best suited line. The positive and negative deviations are not neutralised
since all variances are squared.
ni

4.2 Characterisation of Data Mining


U

There are two types of data mining: descriptive data mining and predictive data
mining. In a succinct and summarizing manner, descriptive data mining displays
the data set’s intriguing general qualities. In order to build one or more models
and make predictions about the behavior of fresh data sets, predictive data mining
ity

analyzes the data.

Large volumes of data are frequently stored in great detail in databases. However,
users frequently prefer to view collections of summarized facts in clear, evocative
language. Such data descriptions may give a broad overview of a data class or set it
m

apart from related classes. Users also like how easily and adaptably data sets may be
described at various granularities and from various perspectives. Concept description is
a type of descriptive data mining that is a crucial part of data mining.
)A

Data entries may be connected to concepts or classes. For instance, computer and
printer classes are available for purchase in the AllElectronics store, and big spenders
and budget spenders are concepts for customers. Individual classes and concepts can
be explained in succinct, precise, but brief ways.
(c

Class/concept descriptions are what are used to describe a class or a notion.


These definitions are derivable from

Amity Directorate of Distance & Online Education


184 Data Warehousing and Mining

●● Data characterization involves a generalized summary of the data for the class
Notes

e
being studied (often referred to as the target class),
●● data discrimination involves contrasting the target class with one or more

in
comparison classes (often referred to as the contrasting classes),
●● both data characterization and discrimination.

nl
4.2.2 Data Mining Characterisation
Concept description is the most basic form of descriptive data mining. Typically, a
concept describes a grouping of facts, such as frequent_buyers, graduate_students,

O
and so forth. Concept description is not just an enumeration of the data when it comes
to data mining tasks. Concept description, on the other hand, produces descriptions for
characterising and contrasting the facts.

ty
When the notion to be expressed pertains to a class of objects, it is sometimes
referred to as a class description. While concept or class comparison (also known
as discrimination) provides discriminations between two or more collections of data,

si
characterization provides a brief and succinct summary of the given collection of facts.

The general characteristics or attributes of a target class of data are summarised in


data characterisation. Typically, a query is used to gather the data related to the user-
r
specified class. For instance, data for such items can be gathered by running a SQL
ve
query on the sales database in order to investigate the features of software products
with sales that increased by 10% in the previous year.

Example: Data characterization.


ni

The following data mining task can be ordered by an AllElectronics customer


relationship manager: Summarize the traits of AllElectronics customers who make
annual purchases of at least $5000. A basic profile of these clients emerges as a
U

result, including information about their age range of 40 to 50, employment status, and
credit standing. The customer relationship manager should be able to drill down on
any dimension, such as occupation, to view these customers according to their sort of
employment, using the data mining system.
ity

Data discrimination is the process of comparing the general characteristics of data


items from the target class to those of objects from one or more opposing classes. A
user can specify the target and opposing classes, and database queries can be used to
retrieve the associated data objects.
m

For instance, a user might want to contrast the general features of software items
whose sales rose by 10% last year with those of products whose sales fell by at least
30% during the same time period. The techniques used for data characterization and
)A

data discrimination are comparable.

“How are descriptions of discrimination produced?” Although discrimination


descriptions should include comparative measures that aid to discriminate between the
target and contrasting classes, the formats of output presentation are similar to those
(c

for characteristic descriptions. Discriminant rules are descriptions of discrimination that


take the form of rules.

Amity Directorate of Distance & Online Education


Data Warehousing and Mining 185

4.2.2 Data Mining Discrimination.


Notes

e
Customers who frequently (more than twice a month, for example) and infrequently
(less than once a year) purchase computer products may be compared by an

in
AllElectronics customer relationship manager (e.g., less than three times a year). The
resulting description offers a general comparative profile of these customers, showing,
for example, that 60% of those who infrequently buy computer products are either

nl
seniors or youths and have no university degree, compared to 80% of those who
frequently buy computer products who are between the ages of 20 and 40 and have a
university education. Finding even more distinguishing characteristics between the two
classes may be aided by expanding or narrowing a dimension, such as occupation, or

O
by including a new dimension, such as income level.

Conceptualization and data generalisation go hand in hand. An effective way


to aid users in examining the overall behaviour of the data is to express concepts in

ty
brief and succinct terms at broad levels of abstraction, given the volume of data that is
contained in databases. With the ABCompany database, for instance, sales managers
may choose to view the data generalised to higher levels, such as aggregated by

si
customer groups according to geographic regions, frequency of purchases per group,
and customer income, rather than looking at individual customer transactions. Similar
to multidimensional data analysis in data warehouses, like multidimensional, multilevel

r
data generalisation. The following are the key distinctions between online analytical
processing and concept description in huge databases.
ve
Complex Data Types and Aggregation:
The foundation of data warehouses and OLAP technologies is a multidimensional
data model, which presents data as a data cube made up of dimensions (or
ni

attributes) and measurements (aggregate functions). For the majority of commercial


implementations of these systems, the dimensions and measures’ potential data types
are constrained. Many of today’s OLAP systems limit dimensions to non-numeric data,
U

and measures today’s OLAP systems (such as count, total, and average) only apply to
numeric data. In contrast, the database properties for concept development can be of
several data types, such as numeric, nonnumeric, spatial, text, or image.
ity

Additionally, the gathering of attributes in a database may comprise complex


data kinds, such as the grouping of object pointers, the combining of spatial regions,
the composition of images, and the collection of non-numerical data. OLAP, with its
limitations on the types of dimension and measure that are feasible, thereby constitutes
a streamlined paradigm for data analysis. Databases may manage complicated data
m

kinds of characteristics and their aggregations as needed for concept description.

User-control Versus Automation:


)A

Data warehouses’ online analytical processing is an entirely user-controlled


procedure. Although the control in most OLAP systems is relatively user-friendly, users
do need a clear understanding of the purpose of each dimension. The users direct and
control the selection of dimensions and the implementation of OLAP operations, such
(c

as drill-down, roll-up, slicing, and dicing. Users may also need to describe a lengthy list
of OLAP operations in order to find a good description of the data. Concept description
in data mining, in contrast, aims for a more automated approach that aids in deciding

Amity Directorate of Distance & Online Education


186 Data Warehousing and Mining

which dimensions (or attributes) should be included in the analysis and the extent to
Notes

e
which the supplied data set should be generalisedin order to produce an engaging
summary of the data.

in
Data warehousing and OLAP technologies have recently advanced to handle more
complicated data kinds and incorporate additional knowledge discovery processes.
Additional descriptive data mining elements are predicted to be incorporated into OLAP

nl
systems in the future as this technology develops.

The following list of idea description techniques includes multilevel generalisation,


summarization, characterisation, and comparison. These techniques laid the

O
groundwork for the multiple-level characterisation and comparison functional modules,
which make up the bulk of data mining. You will also look at methods for presenting
concept and description in various formats, such as tables, charts, graphs, and rules.

ty
Data Generalization and Summarization-Based Characterization
Databases frequently have detailed information at the level of simple concepts in
their data and objects. For instance, properties indicating low-level item information

si
like item _ID, name, brand, category, supplier, place_made, and price may be present
in the item relation in the sales database. Being able to condense a vast collection of
material and convey it at a high conceptual level is helpful. For instance, providing a
r
basic summary of such data by summarising a huge number of elements related to
ve
Christmas season sales can be quite beneficial for sales and marketing managers.
Data generalisation, a crucial component of data mining, is necessary here.

A technique called data generalisation abstracts a sizable collection of task-


relevant data from a database’s low to higher conceptual levels. Two approaches can
ni

be used to group methods for the effective and flexible generalisation of huge data sets:

◌◌ the data cube (or OLAP) approach and


◌◌ the attribute –oriented induction approach.
U

Attribute-Oriented Induction
A few years before the data cube approach was introduced, in 1989, the attribute-
ity

oriented induction (AOI) approach to data generalisation and summarization-based


characterization was initially put forth. The data cube approach can be thought
of as a materialized-view, data warehouse-based, pre-computational strategy.
Before submitting an OLAP or data mining query for processing, it performs off-line
aggregation. On the other hand, the attribute-oriented induction approach is a relational
m

database query-oriented, generalization-based, on-line data analysis tool, at least in its


initial proposal.

The two approaches can’t be separated based on online aggregation versus


)A

off-line pre computation, though, because of an intrinsic barrier. While off-line pre-
computation of multidimensional space can speed up attribute-oriented induction as
well, some aggregations in the data cube can be computed online.
(c

4.3 Association and Market Basket Analysis


A common data mining method is called “MARKET Basket analysis” (MB), which

Amity Directorate of Distance & Online Education


Data Warehousing and Mining 187

is an association analysis. It falls under the category of knowledge discovery in data


Notes

e
(KDD), a technique that has applications across many industries. Here, I’ll utilise data
from a retail transaction to demonstrate how to give businesses access to information
containing the customer’s purchasing patterns. This may also be included in a system

in
of support for decisions.

4.3.1 Mining Patterns

nl
Patterns that commonly show up in a data set include item sets, subsequences,
and substructures. A frequent itemset is, for instance, a group of items that regularly

O
appear together in a transaction data set, like milk and bread.

A subsequence is a (frequent) sequential pattern if it regularly appears in a


database of shoppers’ purchases, such as when a PC, digital camera, and memory
card are purchased in that order. Different structural forms, such as subgraphs,

ty
subtrees, or sublattices, which may be combined with item sets or subsequences, might
be referred to as substructures. A substructure is referred to as being frequent in an
organised pattern. In order to mine connections, correlations, and a variety of other

si
fascinating relationships between data, it is crucial to identify common patterns.

Market Basket Analysis: A Motivating Example

r
Market basket analysis is an illustration of frequent itemset mining in action.
ve
By identifying relationships between the many goods that customers place in their
“shopping baskets,” this technique analyses the purchasing behaviours of those
customers (Figure: Market basket analysis). The identification of these relationships
can assist merchants in creating marketing plans by providing information on the
ni

products that customers typically buy in tandem. How likely is it that consumers who are
purchasing milk will also purchase bread (and what kind of bread) during the same trip
to the store? By assisting merchants with targeted marketing and shelf space planning,
this information can result in an increase in sales.
U
ity
m
)A
(c

Figure: Market basket analysis

Amity Directorate of Distance & Online Education


188 Data Warehousing and Mining

Which categories or combinations of goods are customers most likely to buy during
Notes

e
a particular store visit? The retail data of consumer transactions at your store may be
used for market basket analysis to address your query. The results can then be used to
develop marketing or advertising plans or to design a new catalogue.

in
Association rules are a sort of pattern representation. The following association
rule, for instance, illustrates the fact that consumers who buy laptops also frequently

nl
purchase antivirus software at the same time:

Two metrics for rule interestingness are rule support and rule confidence. They

O
demonstrate the utility and certainty of established rules, respectively. A support of 2%
for the aforementioned rule indicates that antivirus software and computers are bought
jointly in 2% of the transactions being examined. A 60% confidence level indicates
that 60% of customers who bought a machine also purchased the software. Usually,

ty
association rules fulfil both a minimum support threshold and a minimum confidence
requirement in order to be deemed interesting. Users or subject matter experts may
establish these thresholds.

si
Frequent Itemsets, Closed Itemsets, and Association Rules

r
ve
Strong rules are those that fulfil both a minimum support threshold and a minimum
confidence criterion.

An itemset is a collection of items. A k-itemset is an itemset that includes k items.


Computer and antivirus software make up the 2-item set. The quantity of transactions
ni

including an itemset determines how often it occurs. The frequency, support count, or
count of the itemset are other names for this.
U

Association rule mining is typically thought of as a two-step process:


ity

1. Find all frequent itemsets: Each of these itemsets must happen at least as
frequently as a specific minimum support count, min sup, to be considered by
definition.
2. Generate strong association rules from the frequent itemsets: These
regulations must by definition meet minimum support and confidence
m

requirements.
If there is no suitable super-itemset Y with the same support count as X in D, then
)A

itemset X in data set D is closed. If an item set X is both closed and frequent in set D,
then X is a closed frequent item set in D. If an itemset X is frequent and there is no
super-itemset Y such that X_Y and Y is frequent in D, then X is a maximal frequent
itemset (or max-itemset) in the data set D.

Example: Closed and maximal frequent itemsets.


(c

Assume that there are just two transactions in a transaction database: “{a1, a2, : : :

Amity Directorate of Distance & Online Education


Data Warehousing and Mining 189

, a100}; {a1, a2, : : : , a50}.” Let min sup = 1 be the minimal support count criterion. We
Notes

e
discover two closed frequent item sets with their corresponding support counts, namely
C={{a1, a2, : : : , a100} : 1; {a1, a2, : : : , a50} : 2}. The only maximally common item
collection is this: M={{a1, a2,::: , a100}: 1}. Because it has a frequent superset, “{a1,

in
a2, : : : , a100},” we are unable to include “{a1, a2, : : : , a50}” as a maximal frequent
itemset. Compare this to the last analysis, when we found that there are too many 2100
-1 frequent itemsets to list!

nl
Complete details about the frequent itemsets are contained in the set of closed
frequent itemsets. For instance, from C, we can deduce (1) {a2, a45 : 2} because {a2,
a45} is a subitemset of the itemset a1, a2, … , a50 : 2}; and (2) {a8, a55 : 1} because

O
{a8, a55} is a subitemset of the itemset {a1, a2, : : : , a100 : 1} rather than the previous
itemset. However, based just on the most common itemset, we are only able to state
that the two itemsets ({a2, a45} and {a8, a55}) are frequent but not their precise support

ty
counts.

Frequent Itemset Mining Methods

si
A priori Algorithm: Finding Frequent Itemsets by Confined Candidate Generation

R. Agrawal and R. Srikant introduced the groundbreaking algorithm apriori in


1994 [AS94b] for mining frequent itemsets for Boolean association rules. The name
r
of the algorithm is based on the fact that, as we shall see later, it makes use of prior
ve
knowledge of common itemset attributes. A level-wise search is an iterative method
used by Apriori to explore (k +1)-itemsets using k-itemsets. By searching the database,
adding up each item’s count, and gathering the items that meet the minimal support,
the set of frequent 1-itemsets is first discovered. L1 represents the resulting set. When
no more frequent k-itemsets can be located, L1 is then used to find L2, the collection of
ni

frequent 2-itemsets, which is then used to find L3, and so on. Each Lk must be located
via a complete database scan.
U

An important attribute known as the Apriori property is employed to condense


the search space in order to increase the effectiveness of the level-wise production of
frequent itemsets.

Apriori property: A frequent itemset must also have all nonempty subsets.
ity

How is the algorithm using the Apriori property?

It is done in two steps, using join and prune operations.

1. The join step: By combining Lk-1 with itself, a set of potential k-itemsets is created
m

in order to find Lk. This group of contenders is referred to as Ck. Let Lk-1’s itemsets
l1 and l2 be examples. The jth item in li is indicated by the notation li[j] (e.g., l1[k
-2] refers to the second to the last item in l1). Apriori assumes that the items
)A

within a transaction or itemset are arranged in lexicographic order for effective


implementation. This means that the items are sorted in the order li[1]< li[2]< ......
<li[k -1] for the (k -1)-itemset, li. Members of Lk-1 are joinable if their first (k -2) items
are shared, and this join is Lk-1 Lk-1. That is, if (l1[1] = l2[1])< (l1[2] = l2[2])... (l1[k -2]
= l2[k -2]) (l1[k -1]< l2[k -1]), then members l1 and l2 of Lk-1 are connected. Simply
(c

put, the l1[k -1] < l2[k -1] condition checks to make sure no duplicates are produced.

Amity Directorate of Distance & Online Education


190 Data Warehousing and Mining

By joining l1 and l2, an itemset is created that looks like this: l1[1], l1[2],...., l1[k -2],
Notes

e
l1[k -1], l2[k -1].
2. The prune step: Since Ck is a superset of Lk, all frequent k-itemsets are present

in
even though its members may or may not be common. Lk would be determined by
conducting a database search to count each candidate in Ck (i.e., all candidates
having a count no less than the minimum support count are frequent by definition,

nl
and therefore belong to Lk). However, Ck can be very large, thus this could need a
lot of work. The Apriori property is utilised in the following way to decrease the size
of Ck. Any . An uncommon (k-1)-itemset cannot be a subset of a common (k-1)-
itemset. Therefore, the candidate cannot be frequent either and can be eliminated

O
from Ck if any (k -1)-subset of a candidate k-itemset is not in Lk-1. By keeping a
hash tree of all frequently occurring itemsets, this subset testing can be completed
quickly.

ty
Transactional Data for an Allelectronics Branch

TID List of item_IDs

si
T100 11, 12, 15
T200 12, 14
T300 12, 13
r
T400 11, 12, 14
ve
T500 11, 13
T600 12, 13
T700 11, 13
ni

T800 11, 12, 13, 15


T900 11, 12, 13

Table T1: Transaction Data for All Electronics


U

Example: Apriori.
Let’s examine a specific illustration based on Table T1 of the AllElectronics
ity

transaction database, D. This database has nine transactions, therefore |D| = 9. We


demonstrate the Apriori approach for locating common itemsets.

1. Each item in the algorithm’s initial iteration is a part of the C1 set of candidate
1-itemsets. To count the instances of each item, the programme simply scans all of
m

the transactions.
2. Assume that the necessary minimum support count is 2, or min sup = 2. (Since
we are utilising a support count, we are talking about absolute support here. 2/9
)A

= 22% is the corresponding relative support.) Then, L1, the collection of common
1-itemsets, can be found. The candidate 1-itemsets that meet the required level of
support make up this list. All of the candidates in C1 in our example meet the criteria
for little support.
3. The algorithm uses the join L1 L1 to create a candidate set of 2-itemsets, C2, in
(c

order to find the set of frequent 2-itemsets, L2. (|L1|C2) 2-itemsets make up C2. As

Amity Directorate of Distance & Online Education


Data Warehousing and Mining 191

each subset of the candidates is also frequent, no candidates are eliminated from
Notes

e
C2 during the prune stage.
4. Then, as shown in the middle table of the second row in Figure 6.2, the transactions

in
in D are scanned, and the support count of each potential 2-itemset in C2 is tallied.
5. The set of frequent 2-itemsets, L2, is then established, consisting of the potential
2-itemsets in C2 with the least amount of support.

nl
6. The generation of the set of the candidate 3-itemsets, C3, is detailed in Figure 6.3.
From the join step, we first get C3 =L2 L2 = {{I1, I2, I3}, {I1, I2, I5}, {I1, I3, I5},{I2, I3,
I4}, {I2, I3, I5}, {I2, I4, I5}}.

O
(a) Join: C3 = L2 L2 = {{I1, I2}, {I1, I3}, {I1, I5}, {I2, I3}, {I2, I4}, {I2, I5}} {{I1, I2},
{I1, I3}, {I1, I5}, {I2, I3}, {I2, I4}, {I2, I5}} = {{I1, I2, I3}, {I1, I2, I5}, {I1, I3, I5},{I2,
I3, I4}, {I2, I3, I5}, {I2, I4, I5}}.

ty
(b) (b) Pruning based on the Apriori characteristic A frequent itemset must also
have all nonempty subsets. Do any of the contenders have an uncommon
subset?

si
Generation and pruning of Candidate 3-itemsets, C3, from L2 using the apriori
property.
{I1, I2, I3}, and {I2, I3} are the 2-item subsets of “{I1, I2, I3}.” L2 includes all
r
2-item subgroups of “{I1, I2, I3}.” Keep “{I1, I2, I3}” in C3 as a result.
ve
{I1, I2}, {I1, I5}, and {I2, I5} are the 2-item subsets of “{I1, I2, I5}”. L2 includes
all 2-item subgroups of “{I1, I2, I5}”. Keep “{I1, I2, I5}” in C3 as a result.
{I1, I3, I5}, and {I3, I5} are the 2-item subsets of “{I1, I3, I5}”. Since “{I3, I5}” is
ni

not a member of “L2,” it is uncommon. So, take off I1, I3, and I5 from C3.
{I2, I3}, {I2, I4}, and {I3, I4} are the 2-item subsets of “{I2, I3, I4}”. Since “{I3,
I4}” is not a part of L2, it is not frequently encountered. So, take off I2, I3, and
U

I4 from C3.
{I2, I3, I5}, and {I3, I5} are the 2-item subsets of “{I2, I3, I5}”. Since “{I3, I5}” is
not a member of “L2,” it is uncommon. So, take out I2, I3, and I5 from C3.
ity

{I2, I4}, {I2, I5}, and {I4, I5} are the 2-item subsets of “{I2, I4, I5}”. Since “{I4,
I5}” is not a part of L2, it is not frequently encountered. As a result, delete “I2,
I4, I5” from C3.
(c) Therefore, C3 = {{I1, I2, I3}, {I1, I2, I5}} after pruning.
m
)A
(c

Amity Directorate of Distance & Online Education


192 Data Warehousing and Mining

Notes

e
in
nl
O
ty
Figure: Generation of the candidate itemsets

si
7. To find L3, which consists of those candidate 3-itemsets in C3 with minimum support,
the transactions in D are examined (Figure above).
8. r
The algorithm creates a candidate set of four item sets, C4, using L3 1 L3. Despite
ve
the fact that the join produces the itemset “{{I1, I2, I3, I5}}”, this itemset is pruned
since its subset “{I2, I3, I5}” is seldom. As a result, C4 =, and the algorithm stops
after discovering all of the common itemsets.
Algorithm: Apriori. Find frequent itemsets using an iterative level-wise approach
ni

based on candidate generation.

Input:
U

◌◌ D, a database of transactions;
◌◌ min_sup, the minimum support count threshold.
Output: L, frequent itemsets in D.
ity
m
)A
(c

Amity Directorate of Distance & Online Education


Data Warehousing and Mining 193

Notes

e
in
nl
O
ty
Figure F2: Apriori algorithm

si
4.3.2 Mining Associations

Generating Association Rules from Frequent Itemsets


r
ve
Finding the common itemsets from transactions in a database D makes it simple to
create powerful association rules from them (where strong association rules satisfy both
minimum support and minimum confidence). The following Eq1 equation for confidence
can be used to accomplish this, and we’ll repeat it here for completeness:
ni

confidence(A⇒B) = P(B|A) = support(A∪B)/support(A) = support_count(A∪B)/


support_count(A)

confidence(A ⇒ B) = P(B|A) = support_count(A∪B)/support_count(A)


U

The number of transactions that contain the itemsets A and B, as well as the
number of transactions that contain the itemset A, are used to describe the conditional
probability in terms of itemset support count (also known as support count(A)). This
ity

equation can be used to generate the following association rules:

◌◌ For each frequent itemset l, generate all nonempty subsets of l.


◌◌ For every nonempty subset s of l, output the rule “s ⇒ (l − s)” if support_
count(l)/support_count(s) ≥ min_conf, where min_conf is the minimum
m

confidence threshold.
Each rule automatically satisfies the minimal support because it is created from
frequent itemsets. Hash tables can be used to store frequently used itemsets and their
)A

counts in advance for easy access.

Improving the Efficiency of Apriori


“How may the effectiveness of apriori-based mining be increased? The Apriori
(c

algorithm has undergone numerous modifications, all of which aim to increase its
effectiveness. Here is a summary of a few of these variations:

Amity Directorate of Distance & Online Education


194 Data Warehousing and Mining

Notes

e
in
nl
Figure: H2 hash table for potential 2-item sets. This hash table was created by
looking up transactions in Table (Transaction Data for All Electronics) and figuring out
L1. The itemsets in buckets 0, 1, 3, and 4 cannot be frequent, hence they should not be

O
included in C2 if the minimal support count is, say, 3.

Hash-based technique (hashing itemsets into corresponding buckets):

ty
For k > 1, a hash-based method can be utilised to minimise the size of the
candidate k-itemsets, Ck. We can create all the 2-itemsets for each transaction, hash
(i.e., map) them into the various buckets of a hash table structure, and then increase

si
the associated bucket counts when scanning each transaction in the database to
produce the frequent 1-itemsets, L1, for instance (Figure above). A 2-itemset that has
a bucket count in the hash table for the associated item that is less than the support

r
criteria cannot be frequent and ought to be eliminated from the candidate set. A
hash-based method like this one could significantly minimise the amount of potential
ve
k-itemsets that need to be investigated (particularly when k = 2).

Transaction reduction (reducing the number of transactions scanned in future


iterations): Any frequent (k + 1)-itemsets cannot be present in a transaction that doesn’t
contain any frequent k-itemsets. Therefore, since it won’t need to be taken into account
ni

in later database searches for j-itemsets where j > k, such a transaction can be flagged
or excluded from further consideration.
U

Partitioning (partitioning the data to find candidate itemsets): To mine the common
itemsets, a partitioning strategy can be utilised that only needs two database scans
(Figure Mining by Partitioning the data). There are two phases to it. The algorithm
separates the transactions in D into n nonoverlapping partitions in phase I. The
ity

minimum support count for a partition is min sup the number of transactions in that
partition if the minimum relative support threshold for transactions in D is min sup. All of
the local frequent item sets, or the frequent item sets within the partition, are located for
each partition.
m
)A
(c

Figure: Mining by Partitioning the data


Amity Directorate of Distance & Online Education
Data Warehousing and Mining 195

A local frequent itemset could or might not be frequent across the board, D. To be
Notes

e
a frequent itemset, however, a set of items must be potentially frequent with regard to
D in at least one of the partitions. In light of D, all local frequent itemsets are candidate
itemsets. The global candidate itemsetswith regard to D are comprised of the frequent

in
item sets from each partition. In phase II, a second scan of D is performed to ascertain
the worldwide frequent itemsets by evaluating the actual support of each contender.
The number of partitions and their sizes are chosen so that each may fit in main

nl
memory and only need to be read once during each phase.

Sampling (mining on a subset of the given data): The sampling approach’s


primary premise is to select a random sample S from the given data D and then look

O
for frequent item sets in S rather than D. By doing this, we compromise some of our
accuracy for some of our efficiency. Only one scan of the transactions in S is necessary
in total since the S sample size allows for the search for frequently occurring itemsets in

ty
S to be performed in main memory. We might overlook some of the worldwide frequent
itemsets because we are looking for frequent itemsets in S rather than D.

To lessen the likelihood of this happening, we discover the frequent itemsets local

si
to S with a support threshold that is lower than the minimal support (denoted LS ).
The actual frequencies of each itemset in LS are then calculated using the remaining
data in the database. The existence of all global frequent itemsets in LS is checked

r
using a technique. One scan of D is necessary if LS truly contains all of the frequent
item sets in D. If not, a second pass can be made to locate the common itemsets that
ve
were overlooked during the first. When efficiency is crucial, as it is in computationally
intensive programmes that must be executed frequently, the sampling strategy is
extremely advantageous.
ni

Dynamic itemset counting (adding candidate itemsets at different points during a


scan): A dynamic itemset counting method that partitions the database into blocks with
start points has been presented. In contrast to Apriori, which decides new candidate
itemsets only just before each complete database scan, this variation allows for the
U

addition of new candidate itemsets at any stage in the analysis process. As the lower
bound of the actual count, the approach uses the count-so-far. The itemset is added to
the frequent itemset collection and can be used to produce lengthier candidates if the
ity

count-to-date exceeds the minimal support. Compared to using Apriori, this results in
fewer database scans to discover all the frequently occurring itemsets.

A Pattern-Growth Approach for Mining Frequent Itemsets


As we have seen, the Apriori candidate generate-and-test method frequently
m

results in a good performance boost by drastically reducing the size of candidate sets.
However, it can be hampered by two significant expenses:
)A

◌◌ Quite a few candidate sets may still need to be generated. The Apriori
algorithm will need to produce more than 107 candidate 2-itemsets, for
instance, if there are 104 frequent 1-itemsets.
◌◌ It could be necessary to repeatedly scan the entire database and use pattern
matching to check a huge number of candidates. Analyzing every database
(c

transaction to see whether the candidate itemsets are supported is expensive.


Is it possible to create a system that mines every common itemset without using

Amity Directorate of Distance & Online Education


196 Data Warehousing and Mining

such an expensive candidate generating procedure? Frequent pattern growth, often


Notes

e
known as FP-growth, is an intriguing approach in this endeavour that uses the following
divide-and-conquer tactic. First, it creates a frequent pattern tree, or FP-tree, using
the database of often occurring items while preserving the itemset association data.

in
The compressed database is then divided into a number of conditional databases
(a particular type of projected database), each linked to a single frequent item or
“pattern fragment,” and each database is mined independently. Only the data sets

nl
that are related with each “pattern fragment” need to be looked at. As a result, this
strategy may significantly minimise the amount of the data sets to be searched and the
“development” of the patterns under investigation.

O
Mining Frequent Itemsets Using the Vertical Data Format
The TID-itemset structure (i.e., “TID: itemset”), where TID is a transaction ID and
itemset is a set of items purchased in transaction TID, is used by both the Apriori and

ty
FP-growth methods to identify recurring patterns in a collection of transactions. The
horizontal data format is what is used in this. Alternative data presentation formats
include item-TID sets.

si
Algorithm: FP growth. Mine frequent itemsets using an FP-tree by pattern fragment
growth. Input:

◌◌ r
D, a transaction database;
ve
◌◌ Min_sup, the minimum support count threshold.
Output: The complete set of frequent patterns.

Method:
ni

1. The FP-tree is constructed in the following steps:


a. once scan transaction database D. Gather F, the group of frequent things,
together with the support numbers for each. As L, the list of frequently
U

occurring items, sort F in support count decreasing order.


b. Make a root for an FP tree and give it the label “null.” Do the following for each
transaction Trans in D.
ity

Choose and arrange the goods in Trans that are used frequently in the
order of L. Let [p|P], where p is the initial member and P is the rest of the
list, represent the sorted frequent item list in Trans. Call insert tree([p|P], T),
and the following action is taken. If T has a child N such that N.item-name =
m

p.item-name, then N’s count should be increased by 1. If not, then a new node
N should be created with a count of 1, a parent link to T, and node links to
other nodes that have the same item-name. Recursively call insert tree(P, N) if
)A

P is not empty.
2. The FP-tree is mined by calling FP growth(FP tree, null), which is implemented as
follows.
procedure FP growth(Tree, α)
(c

(1) if Tree contains a single path P then

Amity Directorate of Distance & Online Education


Data Warehousing and Mining 197

(2) for each combination (denoted as β) of the nodes in the path P


Notes

e
(3) generate pattern β ∪α with support count = minimum support count of nodes
in β;

in
(4) else for each ai in the header of Tree {
(5) generate pattern β = ai ∪α with support count = ai .support count;

nl
(6) construct β’s conditional pattern base and then β’s conditional FP tree Treeβ ;
(7) if Treeβ 6= ∅ then
(8) call FP growth(Treeβ , β); }

O
(i.e., {item : TID set}), item is the name of the item, and TID set is the collection of
transaction identifiers that contain it. The vertical data format is what is used for this.

ty
Which Patterns Are Interesting?—Pattern Evaluation Methods
A support-confidence framework is used by the majority of association rule mining
algorithms. Many of the created rules are still not attractive to the users, despite

si
minimum support and confidence criteria’ assistance in weeding out or excluding
the investigation of a significant number of uninteresting rules. Unfortunately, this is
particularly true when mining for extended patterns or at low support thresholds. This
r
has been a significant barrier to the successful use of association rule mining.
ve
Strong Rules Are Not Necessarily Interesting
A rule’s interestingness might be determined subjectively or objectively. In the
end, only the user can determine whether a specific rule is interesting, and since this
judgement is subjective, it may vary from user to user. To eliminate boring rules that
ni

would otherwise be displayed to the user, objective interestingness measurements that


are based on the statistics “behind” the data might be employed.
U

Example:
A misleading “strong” association rule. Let’s say our research involves looking at
AllElectronics purchases of video games and computer software. Let’s use the terms
ity

“game” and “video” to denote transactions that contain computer games and videos,
respectively. Data from the investigated 10,000 transactions reveals that 6000 of the
client purchases included video games, 7500 featured audiovisual content, and 2000
included both. Assume that the data is ran through a data mining tool to find association
rules, with a minimum support of, say, 30% and a minimum confidence of 60%. It is
m

found that the following association rule exists:

buys(X, “computer games”) ⇒ buys(X, “videos”)


)A

[support = 40%, confidence = 66%]. (Rule R1)

Since the above rule (R1) meets the minimal support and minimum confidence
standards with a support value of 4000/10,000 = 40% and a confidence value of
4000/6000 = 66%, respectively, it would be reported as a strong association rule. Rule
(c

(R1) is misleading, though, because 75% rather than 66% of people are likely to buy
videos. In fact, there is a bad correlation between video games and computer games
because buying one of these products actually makes it less likely that you’ll buy

Amity Directorate of Distance & Online Education


198 Data Warehousing and Mining

the other. We could easily base bad business judgments on Rule R1 if we don’t fully
Notes

e
comprehend this issue.

The aforementioned example also shows how a rule’s apparent confidence A

in
⇒B can be misleading. It does not reflect the actual strength (or lack thereof) of the
relationship between A and B in terms of correlation and implication. To mine intriguing
data correlations, alternatives to the support-confidence paradigm can be helpful.

nl
4.3.3 Mining Correlations
The support and confidence measurements are ineffective for removing

O
uninteresting association rules, as we have thus far shown. The support-confidence
framework for association rules can be supplemented with a correlation measure to
address this problem. As a result, the following correlation rules result:

ty
A ⇒ B [support, confidence, correlation]. (Rule R2)

In other words, the strength and confidence of a correlation rule are evaluated
along with the correlation of Itemsets A and B. There are a variety of correlation

si
measurements available. To identify which correlation measure would be best for
mining massive data sets, we examine a number of correlation measures in this
subsection.
r
The lift is a straightforward correlation metric and is provided as follows. If P(A B) =
ve
P(A)P(B), then the occurrence of itemset A is independent of the occurrence of itemset
B; otherwise, the events of itemsets A and B are dependent and connected. More than
two itemets can be included in this definition with ease. By calculating the values below
Equation, one may determine the lift between the incidence of A and B. (Eq1)
ni

(Equation Eq1)
U

The presence of A is negatively linked with the presence of B, which means that
the occurrence of one is likely to result in the absence of the other, if the resulting value
of the preceding equation (Eq1) is less than 1. A and B are positively associated if the
calculated value is higher than 1, which denotes that the occurrence of one precludes
ity

the occurrence of the other. A and B are independent and there is no correlation
between them if the outcome value equals 1.

The lift of the association (or correlation) rule A ⇒ B is also known as Eq (Eq1),
and it is identical to P(B|A)/P(B), or conf(A B)/sup(B). It evaluates how much the
m

incidence of one “lifts” the occurrence of the other, to put it another way. According
to the current state of the market, the sale of games is considered to “lift” or raise the
likelihood of the sale of videos by a factor of the value returned by Eq, for instance,
)A

if A corresponds to the sale of computer games and B corresponds to the sale of


videos. (Eq1).

Example: Correlation analysis using lift.

According to the table, the likelihood of buying a computer game is P.(game) =


(c

0.60, the likelihood of buying a video is P(video) = 0.75, and the likelihood of buying
both is P(game, video) = 0.40. The lift of the provided Association rule according to the
equation is P(game, video)/(P(game)*P.(video)) = 0.40/(0.60*0.75) = 0.89. There is a

Amity Directorate of Distance & Online Education


Data Warehousing and Mining 199

negative link between the occurrence of “game” and “video” because this number is
Notes

e
smaller than 1.

Table T2: 2 x 2Contingency Table Summarizing the Transactions with Respect to

in
Game and Video Purchases

Game Game Σrow

nl
Video 4000 3500 7500
Video 2000 500 2500
Σcol 6000 4000 10,000

O
The second correlation measure that we study is the x2 measure.

Table T3: Table T2 contingency table, now with expected values

ty
Game Game Σrow
Video 4000 (4500) 3500 (3000) 7500
Video 2000 (1500) 500 (1000) 2500

si
Σcol 6000 4000 10,000

Example: Correlation analysis using x2 .


r
ve
The observed value and expected value (given in parenthesis) for each slot of the
contingency table are required in order to calculate the correlation using the x2 analysis
for nominal data, as shown in Table T3. We can determine the value of x2 by using the
information in the table.
ni
U

The observed value of the slot (game, video) = 4000, which is less than the
expected value of 4500, and the x2 value being more than 1 cause buying game and
ity

buying video to be negatively linked.

A Comparison of Pattern Evaluation Measures


How effective are the so far-discussed above measures? Should we also take into
m

account other options?

Yes, there are four such measures: cosine, Kulczynski, maximum confidence, and
)A

total confidence.

all confidence: The all confidence measure of two item sets A and B is defined as
(c

where max {sup(A), sup(B)} is the itemsets A and B’s maximum support.

Amity Directorate of Distance & Online Education


200 Data Warehousing and Mining

max confidence: Given two itemsets, A and B, the max confidence measure of A
Notes

e
and B is defined as

in
The highest confidence of the two association rules is represented by the max conf
measure.

nl
“A ⇒ B” and “B ⇒ A.”

The Kulczynski measure of A and B (abbreviated as Kulc) is defined as follows


given two item sets A and B:

O
ty
It can be thought of as the mean of two confidence intervals. The likelihood of
itemset B given itemset A and the probability of itemset A given itemset B, respectively,
are the two conditional probabilities that make up this average.

si
Cosine: Last but not least, the cosine measure of two itemsets, A and B, is defined
as

r
ve
One might think of the cosine measure as a harmonised lift measure: The two
formulas are similar, with the exception that the square root of the product of the
ni

probabilities of A and B is used for cosine. But because the square root is used, the
cosine value is only affected by the supports of A, B, and A B, not by the overall number
of transactions. This is a significant distinction.
U

Each of these four defined measurements possesses the following characteristic:


Its value is not affected by the overall number of transactions, but rather solely by the
supports of A, B, and A B, or more precisely, by the conditional probabilities of P(A|B)
and P(B|A). Each measure has a range from 0 to 1, and the higher the value, the tighter
ity

the link between A and B is. This is another characteristic that all measures have.

Now that lift and 2 have been added, we have a total of six pattern evaluation
metrics. Which method is ideal for evaluating the identified pattern correlations, you
might wonder. We investigate their performance on various common data sets to
m

provide an answer to this topic.

Table T4: 2 X 2 Contingency Table for Two Items


)A
(c

Amity Directorate of Distance & Online Education


Data Warehousing and Mining 201

Table T5: Comparison of Six Pattern Evaluation Measures Using Contingency


Tables for a Variety of Data Sets Notes

e
in
nl
O
Example: Six pattern evaluation measures are compared on common data sets.

The purchase histories of two items, milk and coffee, can be summarised in

ty
Table T4, a 2 2 contingency table, where an entry such as mc denotes the number
of transactions including both milk and coffee. This allows for the examination of the
correlations between the purchases of the two commodities.

si
The transactional data sets, relevant contingency tables, and associated values
for each of the six assessment measures are displayed in Table T5. The first four data
sets, D1 through D4, should be examined. According to the table, m and c have positive
r
associations in D1 and D2, negative associations in D3, and no associations in D4. Due
ve
to the fact that mc (10,000) is significantly larger than mc (1000) and mc (1000), m and
c are positively correlated for D1 and D2 (1000). It makes intuitive sense to assume that
customers who purchased milk (m = 10,000 + 1000 = 11,000) also purchased coffee
(mc/m = 10/11 = 91%) and vice versa.
ni

By generating a measure value of 0.91, the four newly proposed measures’


results demonstrate that m and c are substantially positively correlated in both data
sets. However, because of their sensitivity to mc, lift and χ2 produce radically different
U

measure values for D1 and D2. In truth, mc is frequently enormous and unstable in real-
world situations. A market basket database, for instance, might have a daily fluctuation
in the overall number of transactions that much outnumber those including any given
itemset. In order to avoid producing unstable findings, as shown in D1 and D2, a good
ity

interestingness measure should not be influenced by transactions that do not contain


the itemsets of interest.

Similar to D2, the four additional measures correctly demonstrate that m and c
have a substantial negative correlation because 100/1100 = 9.1% is the ratio of m to
m

c to m. However, lift and χ2 both incorrectly contradict this: Between D1 and D3, their
values for D2 fall.

Because the ratio of mc to mc is equal to the ratio of mc to mc, which is 1, the


)A

others imply a “neutral” correlation while lift and 2 indicate a very positive association
between m and c for data set D4. In other words, the likelihood that a consumer will
also buy milk or coffee is exactly 50% if they purchase coffee.

Why, in the preceding transactional data sets, are lift and χ2 so ineffective at
(c

identifying pattern association relationships? We must think about the nulltransactionsin


order to respond to this. A transaction that doesn’t contain any of the itemsets under

Amity Directorate of Distance & Online Education


202 Data Warehousing and Mining

examination is said to be null. In our illustration, mc stands for the quantity of null-
Notes

e
transactions. Because mc greatly influences both lift and χ2, they have trouble
differentiating important pattern association associations. Typically, more people
may make nulltransactions than actual purchases since, for instance, many people

in
might decide not to buy either milk nor coffee. The other four measures, however, are
excellent predictors of intriguing pattern connections since their definitions eliminate the
impact of mc (i.e., they are not influenced by the number of null-transactions).

nl
This conversation demonstrates how important it is to have a measure whose
value is unaffected by the quantity of null transactions. If null-transactions have no
impact on a measure’s value, that measure is said to be null-invariant. For measuring

O
association patterns in sizable transaction databases, null-invariance is a key
characteristic. The only two measures that are not null-invariant among the six that are
addressed in this paragraph are lift and χ2.

ty
Which of the Kulczynski, cosine, all confidence, and maximum confidence
measures is most effective in highlighting intriguing pattern relationships?

The imbalance ratio (IR), which evaluates the imbalance of two itemsets, A and

si
B, in rule implications, is introduced in order to provide a response to this query. It’s
described as

r
ve
where the denominator is the number of transactions containing A or B and the
numerator is the absolute value of the difference between the support of itemsets A and
B. IR(A,B) will be zero if the two directional implications between A and B are identical.
ni

In contrast, the imbalance ratio increases with the size of the difference between the
two. This ratio is independent of both the total number of transactions and the number
of null transactions.
U

4.4 Data Mining Classification


Data analysis that uses classification extracts models describing significant
data classes. These models, referred to as classifiers, forecast discrete, unordered
ity

categorical class labels. For instance, we can create a classification model to divide
bank loan applications into safe and dangerous categories. We may gain a better
comprehension of the data as a whole via such study. Researchers in statistics, pattern
recognition, and machine learning have proposed a wide variety of classification
m

techniques. The majority of algorithms are memory-resident and typically assume


a small amount of data. On top of this work, more recent data mining research has
created scalable categorization and prediction methods that can manage massive
)A

volumes of disk-resident data. Numerous industries, including as fraud detection,


target marketing, performance prediction, manufacturing, and medical diagnostics, use
classification.
(c

Amity Directorate of Distance & Online Education


Data Warehousing and Mining 203

Notes

e
in
nl
O
ty
4.4.1 Classification
To extract models representing significant classes or forecast upcoming data
trends, two types of data analysis can be performed. Here are these two forms:

si
1. Classification
2. Prediction
r
To create a model that represents the various data classes and forecasts future
ve
data trends, classification and prediction are used. With the aid of prediction models,
classification forecasts the category labels of data. We have the clearest knowledge of
the data at a broad scale thanks to this analysis.

Prediction models predict continuous-valued functions, whereas classification


ni

models predict categorical class labels. For instance, based on a person’s income and
line of work, we can create a classification model to classify bank loan applications as
safe or risky, or a prediction model to estimate how much money a potential consumer
U

will spend on computer equipment.


ity
m
)A
(c

Amity Directorate of Distance & Online Education


204 Data Warehousing and Mining

What is Classification?
Notes

e
A fresh observation is classified by determining its category or class label. As
training data, a set of data is first employed. The algorithm receives a set of input data

in
along with the associated outputs. The input data and their corresponding class labels
are thus included in the training data set. The algorithm creates a classifier or model
using the training dataset. A decision tree, a mathematical formula, or a neural network

nl
can all be used as the resulting model. When given unlabeled data for classification, the
model should be able to identify the class to which it belongs. The test data set is the
fresh information given to the model.

O
The act of classifying a record is known as classification. To classify something as
simple as whether it is raining or not. Either yes or no can be given as a response.
There are so a certain amount of options. There may occasionally be more than two
classes to categorise. Multiclass classification is what that is.

ty
The bank must determine whether or not it is dangerous to lend money to a
specific consumer. A classification model may be created, for instance, that predicts
credit risk based on observable data for various loan borrowers. The information may

si
include employment history, house ownership or rental history, years of residence,
number, kind, and history of deposits, as well as credit scores. The data would
represent a case for each consumer, the goal would be credit ranking, and the
r
predictors would be the other features. In this illustration, a model is built to identify the
ve
categorical label. Both risky and safe labelling are included.

How does Classification Works?


As was indicated earlier, classification operates with the help of a bank loan
ni

application. The creation of the classifier or model and the classification classifier are
the two phases of the data classification system.
U
ity
m
)A

1. Developing the Classifier or model creation: This level represents the learning
phase or process. In this phase, the classifier is built by the classification algorithms.
A training set of database records and the matching class names is used to build
a classifier. A category or class is used to describe each category that makes up
the training set. These documents may also be referred to as examples, things, or
(c

data points.
2. Applying classifier for classification: At this level, classification is done using the
classifier. Here, the test data are utilised to gauge how accurate the categorization
Amity Directorate of Distance & Online Education
Data Warehousing and Mining 205

system is. The classification rules may be expanded to include additional data
Notes

e
records if the consistency is judged to be sufficient. It contains:
◌◌ Sentiment Analysis: Monitoring social media effectively makes extensive use

in
of sentiment analysis. We can use it to gain insights into social media. With
the use of cutting-edge machine learning algorithms, we may create sentiment
analysis models to read and examine misspelt words. The precise trained

nl
models deliver consistently correct results in a small amount of time.
◌◌ Document Classification: To arrange the papers into parts based on content,
we can utilise document categorization. Text classification applies to

O
documents; the full text of a document can be categorised. We can carry it out
automatically with the aid of machine learning classification algorithms.
◌◌ Image Classification: To arrange the papers into parts based on content,
we can utilise document categorization. Text classification applies to

ty
documents; the full text of a document can be categorised. We can carry it out
automatically with the aid of machine learning classification algorithms.
◌◌ Machine Learning Classification: It performs analytical activities that would

si
take humans hundreds of additional hours to complete using algorithm rules
that are statistically verifiable.
3. Data Classification Process: There are five steps in the data classification process:
◌◌ r
Establish the architecture, strategy, and goals for the data classification
ve
process.
◌◌ Sort the stored private information.
◌◌ using data labelling to assign marks.
ni

◌◌ Use impacts to increase protection and compliance.


◌◌ Data is complicated, and classification is a continual approach.

What is Data Classification Lifecycle?


U

The data classification life cycle creates a great framework for managing the
data flow into an organisation. Businesses must take each level of compliance and
data security into account. We are able to carry it out at every stage, from creation to
ity

deletion, with the use of data classification. The following stages make up the data life
cycle, including:
m
)A
(c

Amity Directorate of Distance & Online Education


206 Data Warehousing and Mining

1. Origin: Sensitive data is produced in a variety of formats, including emails, Word,


Notes

e
Excel, Google documents, social media, and websites.
2. Role-based practice: All sensitive data is tagged with role-based security restrictions

in
based on internal protection policies and agreement guidelines.
3. Storage: Here is the data that was collected, complete with encryption and access
restrictions.

nl
4. Sharing: Agents, customers, and coworkers receive data continuously from a variety
of platforms and devices.

O
5. Archive: Here, data eventually goes into the storage systems of an industry.
6. Publication: Data publication allows it to contact customers. The dashboards are
then available for viewing and downloading.

ty
4.4.2 Prediction
Prediction is another step in the data analysis process. It is employed to discover
a numerical result. The training dataset includes the inputs and matching numerical

si
output values, much like in classification. Using the training dataset, the algorithm
creates a model or prediction. When the new data is provided, the model should
produce a numerical result. In contrast to classification, this approach lacks a class
r
label. A continuous-valued function or an ordered value is predicted by the model.
ve
In most cases, regression is utilised for prediction. One example of prediction is
estimating the worth of a home based on information like the number of rooms, total
square footage, etc.
ni

Consider a scenario where the marketing manager must estimate how much a
specific consumer will spend during a sale. In this situation, we are bothered to predict
a number value. Data processing is thus an example of a numerical prediction. In this
situation, a model or predictor will be created that makes predictions about an ordered
U

or continuous value function.

Classification and Prediction Issues


ity

Preparing the data for classification and prediction is the main problem. The
following tasks are involved in data preparation:

●● Data Cleaning: Data cleaning entails treating missing values and eliminating
noise. Smoothing techniques are used to reduce noise, and the issue of missing
m

values is resolved by replacing a missing value with the value that occurs the most
frequently for that characteristic.
●● Relevance Analysis − The unnecessary attributes may also be present in a
)A

database. To determine whether any two provided attributes are connected,


correlation analysis is employed.
●● Data Transformation and reduction − Any of the aforementioned techniques can
be used to change the data.
(c

◌◌ Normalization: Normalization is the process used to change the data. By


scaling all values for a specific attribute, normalisation causes them to all fall

Amity Directorate of Distance & Online Education


Data Warehousing and Mining 207

inside a narrow range. When neural networks or other approaches requiring


Notes

e
measurements are used in the learning phase, normalisation is performed.
◌◌ Generalization: The data can also be changed by applying a higher-level

in
generalisation to it. Hierarchies are a useful notion for this.

Comparison of Classification and Prediction Methods


Here is the criteria for comparing the methods of Classification and Prediction –

nl
●● Accuracy: The ability of the classifier is referred to as accuracy. It accurately
predicts the class label, and predictor accuracy describes how effectively a given

O
predictor can make an educated guess about the value of a predicted attribute for
fresh data.
●● Speed: This is the price paid for the computational resources used to create and
employ the classifier or predictor.

ty
●● Robustness: It describes a classifier’s or predictor’s capacity to provide accurate
predictions from supplied noisy data.

si
●● Scalability: Scalability is the capacity to build a classifier or predictor effectively in
the presence of a vast amount of data.
●● Interpretability: It speaks to the level of comprehension of the classifier or
predictor. r
ve
4.5 Data Mining Analysis
Data science is a new discipline that incorporates automated ways to examine
patterns and models for all types of data, with applications ranging from scientific
ni

discovery to business intelligence and analytics. Its fundamental algorithms in data


mining and analysis serve as its foundation.
U

4.5.1 Cluster Analysis


The division of a set of data items (or observations) into subsets is known as
cluster analysis or simply clustering. Every subset is a cluster, and each cluster
ity

contains items that are similar to one another but different from those in other clusters.
A clustering is the collection of clusters that emerges from a cluster analysis. In this
situation, various clustering techniques may provide various clusterings on the same
data set. The clustering algorithm, not people, performs the partitioning. Therefore,
clustering is advantageous in that it can result in the identification of previously
m

unidentified groupings in the data.

Numerous applications, including corporate intelligence, visual pattern recognition,


)A

web search, biology, and security, have extensively exploited cluster analysis. In
business intelligence, clustering can be used to group a lot of customers into groups
where each group of consumers has a lot in common with the others. This makes
it easier to create business plans for better customer relationship management.
Additionally, take into account a consulting firm with a lot of active projects. Projects
(c

can be divided into groups based on similarities using clustering to enhance project
management. This will allow for efficient project auditing and diagnosis (to enhance
project delivery and outcomes).

Amity Directorate of Distance & Online Education


208 Data Warehousing and Mining

In handwritten character recognition systems, clustering can be used to find


Notes

e
clusters or “subclasses” in images. Let’s say we have a collection of handwritten
numerals with the labels 1, 2, 3, and so on. Be aware that there can be significant
differences in how people write the same digit. Consider the number 2, for instance.

in
Some people might choose to write it with a tiny circle at the bottom left corner, while
others might choose not to. Clustering can be used to identify “2” subclasses, each
of which stands for a different way that 2 can be expressed. The accuracy of overall

nl
recognition can be increased by combining several models based on the subclasses.

Web search has benefited greatly from clustering. Due to the extraordinarily high
number of online pages, for instance, a keyword search may frequently produce a very

O
large number of hits (i.e., pages relevant to the search). Clustering can be used to
arrange the search results into categories and present them in a clear, understandable
manner. Additionally, strategies for grouping documents into subjects, which are

ty
frequently employed in information retrieval practise, have been developed.

Cluster analysis is a data mining function that may be used independently to


study the properties of each cluster, get insight into the distribution of the data, and

si
concentrate on a specific collection of clusters for future research. Alternately, it might
act as a preprocessing stage for other algorithms, such characterization, attribute
subset selection, and classification, which would later work with the identified clusters

r
and the chosen attributes or features.
ve
A cluster of data items can be regarded as an implicit class since a cluster is a
group of data objects that are distinct from one another and similar to one another
inside the cluster. This is why clustering is occasionally referred to as automatic
classification. Another important distinction is that clustering can automatically identify
ni

the groupings in this case. Cluster analysis has the advantage of being able to do this.

Because clustering divides enormous data sets into groups based on their
resemblance, it is also known as data segmentation in some applications. In
U

circumstances when outliers (values that are “far away” from any cluster) are more
intriguing than common cases, clustering can also be employed to find them. The use
of outlier detection in electronic commerce and the identification of credit card fraud
are two examples of its applications. For instance, unusual credit card transactions,
ity

including very pricey and rarely purchases, may be of interest as potential fraud
schemes.

Data clustering is actively being developed. Data mining, statistics, machine


learning, geographic database technology, information retrieval, Web search, biology,
m

marketing, and many more application areas are among the research fields that have
contributed. Cluster analysis has lately emerged as a very active area of research in
data mining due to the enormous volumes of data produced in databases.
)A

Cluster analysis has received a great deal of attention as a statistical subfield, with
a primary focus on distance-based cluster analysis. Many statistical analysis software
packages or systems, such as S-Plus, SPSS, and SAS, also include cluster analysis
tools based on k-means, k-medoids, and various other techniques. The learning
algorithm is supervised in that it is informed of the class membership of each training
(c

tuple. This is why classification is known as supervised learning in machine learning


because the class label information is provided.

Amity Directorate of Distance & Online Education


Data Warehousing and Mining 209

Because clustering lacks the class label information, it is referred to as


Notes

e
unsupervised learning. Because of this, clustering is an example of learning through
observation as opposed to learning through examples. Finding techniques for efficient
and effective cluster analysis in huge databases has been the focus of data mining. The

in
efficacy of methods for clustering complex shapes (such as nonconvex) and types of
data (such as text, graphs, and images), high-dimensional clustering techniques (such
as clustering objects with thousands of features), and methods for clustering mixed

nl
numerical and nominal data in large databases are currently active research areas.

Properties of Clustering:

O
●● Scalability: A huge database may contain millions or even billions of objects,
especially in Web search applications. However, many clustering methods perform
effectively on small data sets with fewer than a few hundred data objects. Only
using a sample from a given huge data collection while clustering could produce

ty
skewed findings. As a result, we want extremely scalable clustering techniques.
●● Ability to deal with different types of attributes: Algorithms for clustering numerical
(interval-based) data are numerous. However, applications could call for

si
clustering binary, nominal (categorical), ordinal, or combinations of these data
types. In recent years, complex data types like graphs, sequences, pictures, and
documents have increased the requirement for clustering algorithms in a growing
number of applications. r
ve
●● Discovery of clusters with arbitrary shape: Numerous clustering methods
identify clusters using Manhattan or Euclidean distance metrics. Such distance
estimates have the tendency to lead to the discovery of spherical clusters that
are comparable in size and density. A cluster, however, could take on any shape.
ni

Take sensors, for instance, which are frequently used to monitor the surroundings.
Sensor readings can be subjected to cluster analysis to find intriguing patterns.
To locate the border of a running forest fire, which is frequently not spherical,
U

we would want to apply clustering. It’s crucial to create algorithms that can find
clusters of any shape.
●● Requirements for domain knowledge to determine input parameters: Users
ity

are often required to supply domain knowledge in the form of input parameters
for clustering algorithms, such as the desired number of clusters. As a result,
the clustering results might be affected by these settings. Particularly for high-
dimensional data sets and when users haven’t fully grasped their data, parameters
are frequently difficult to calculate. In addition to burdening users, requiring the
m

specification of domain knowledge makes it challenging to regulate the clustering


quality.
●● Ability to deal with noisy data: Outliers and/or missing, unidentified, or incorrect
)A

data are common components of real-world data sets. For instance, sensor
readings are frequently noisy; some readings may be incorrect as a result of
the detecting mechanisms, and some readings may be incorrect as a result of
interference from nearby transitory objects. Because of this noise, clustering
algorithms may create subpar clusters. We therefore require noise-resistant
(c

clustering techniques.
●● Incremental clustering and insensitivity to input order: Incremental updates

Amity Directorate of Distance & Online Education


210 Data Warehousing and Mining

(indicating updated data) may appear at any time in many programmes. Some
Notes

e
clustering algorithms must recompute a new clustering from start since they are
unable to incorporate incremental updates into pre-existing clustering structures.
The arrangement of the incoming data may also affect how sensitive clustering

in
algorithms are. In other words, depending on the order in which the objects are
presented, clustering algorithms may produce drastically different clusterings
given a collection of data objects. Algorithms for incremental clustering and those

nl
indifferent to the input order are required.
●● Capability of clustering high-dimensionality data: There may be several
dimensions or attributes in a data set. Each term can be viewed as a dimension

O
when clustering papers, for instance, and there are frequently thousands of
keywords. The majority of clustering algorithms are adept at working with low-
dimensional data, such as sets with only two or three dimensions. Given that such

ty
data might be extremely sparse and highly skewed, it can be difficult to locate
groups of data objects in a highdimensional space.
●● Constraint-based clustering: Clustering may be required in real-world applications
under a variety of restrictions. Imagine you have the responsibility of selecting

si
the locations for a certain number of new ATMs to be installed in a city. You can
cluster homes while taking into account restrictions like the city’s waterways and
transportation networks, as well as the kinds and numbers of clients each cluster,
r
to decide on this. Finding data groups that satisfy the required requirements and
ve
exhibit effective clustering behaviour is a difficult undertaking.
●● Interpretability and usability: Users want interpretable, understandable, and
useable clustering findings. In other words, clustering might need to be connected
to certain semantic applications and interpretations. The impact of an application
ni

aim on the choice of clustering features and clustering techniques must be studied.
The following are orthogonal aspects with which clustering methods can be
compared:
U

●● The partitioning criteria: Some techniques divide up all the objects so that there
is no hierarchy between the clusters. In other words, conceptually, all the clusters
are on the same level. Such a technique is helpful, for instance, when dividing
ity

consumers into groups and assigning a manager to each group. As an alternative,


some techniques divide data objects hierarchically, allowing for the formation
of clusters at various semantic levels. The organisation of a corpus of texts into
various basic subjects, such as “politics” and “sports,” each of which may have
m

subtopics, is one example of how text mining might be used. Sports can include
subtopics like “football,” “basketball,” “baseball,” and “hockey,” for instance. The
last four themes are listed below “sports” in the hierarchy.
)A

●● Separation of clusters: Some techniques group data objects into clusters that are
mutually exclusive. Each client may only be a part of one group when customers
are grouped together and each group is managed by a different manager. Other
times, the clusters might not be mutually exclusive, meaning that a data object
might be a member of more than one cluster. A document may be related to many
(c

subjects, for instance, when grouping documents into themes. The clusters of
themes thus might not be mutually exclusive.

Amity Directorate of Distance & Online Education


Data Warehousing and Mining 211

●● Similarity measure: Some techniques use the separation between the two things
Notes

e
to compare how similar they are. Any space, including Euclidean space, a road
network, a vector space, and others, can be used to define this distance. In other
approaches, connectedness based on density or contiguity may be used to identify

in
similarity rather than the exact separation of two items. The design of clustering
techniques relies heavily on similarity measures. Density- and continuity-based
approaches may frequently locate clusters of any shape, whereas distance-based

nl
methods can frequently benefit from optimization techniques.
●● Clustering space: Numerous clustering techniques look for clusters throughout the
full specified data space. These techniques work well with low-dimensional data

O
sets. However, highdimensional data may contain a large number of unimportant
features, making similarity measurements suspect. As a result, clusters discovered
throughout the entire space are frequently meaningless. Instead, it is frequently

ty
preferable to look for clusters inside various subspaces of the same data set.
Subspace clustering identifies clusters and subspaces that exhibit object similarity
(typically of low dimensionality).
Finally, clustering algorithms have a number of specifications. Scalability, the

si
capacity to handle various attribute kinds, noisy data, incremental updates, clusters of
variable shape, and limitations are some of these elements. Usability and interpretability
are also crucial. The level of partitioning, whether or not clusters are mutually exclusive,
r
the similarity metrics applied, and whether or not subspace clustering is carried out are
ve
further factors that might vary amongst clustering techniques.

Clustering Methods:
The clustering methods can be classified into the following categories:
ni

●● Partitioning Method
●● Hierarchical Method
U

●● Density-based Method
●● Grid-Based Method
●● Model-Based Method
ity

●● Constraint-based Method

Partitioning Method:
In order to create clusters, it is utilised to partition the data. Each partition is
m

represented by a cluster and n < p if “n” partitions are made on “p” database items. For
this Partitioning Clustering Method to work, two requirements must be met:
)A

◌◌ Only one group should have access to a single objective.


◌◌ No group should exist without even a single goal.
Iterative relocation is a technique used in the partitioning method that involves
moving an object from one group to another in order tooptimise partitioning.
(c

Hierarchical Method:
This method creates a hierarchical breakdown of the supplied data items. On the

Amity Directorate of Distance & Online Education


212 Data Warehousing and Mining

basis of how the hierarchical decomposition is created, we may categorise hierarchical


Notes

e
approaches and determine the classification’s purpose. There are two different methods
for creating hierarchical decomposition, and they are as follows:

in
◌◌ Agglomerative Approach: The bottom-up strategy is another name for the
agglomerative strategy. The initial division of the provided data is into groups
of objects. After then, it continues to combine groups of items that are close

nl
to one another and share characteristics. Up until the termination condition is
met, this merging process continues.
◌◌ Divisive Approach: The top-down strategy is another name for the polarising

O
strategy. We would begin with the data objects that are in the same cluster
in this method. By continuously iterating, the collection of distinct clusters is
split into smaller clusters. The iteration keeps going until either the termination
condition is satisfied or each cluster has an object in it.

ty
As it is a hard procedure and is not very flexible, once the group is divided or
merged, it cannot be undone. The two methods that can be utilised to raise the quality
of hierarchical clustering in data mining are as follows:

si
◌◌ At each stage of hierarchical clustering, the links of the item should be
thoroughly examined.
◌◌ For the integration of hierarchical agglomeration, one can utilise a hierarchical
r
agglomerative algorithm. In this method, the objects are initially put into little
ve
clusters. Data objects are grouped into microclusters, and the microcluster is
then subjected to macro clustering.

Density-Based Method:
ni

●● The majority of object clustering techniques use object distance to group things
together. Such algorithms struggle to find clusters of any shape and can only
locate spherical-shaped clusters.
U

●● On the basis of the idea of density, other clustering techniques have been
created. Their basic premise is that a given cluster should develop as long as its
neighbourhood density exceeds a certain threshold; in other words, each data
point within a given cluster must have at least a certain number of neighbours
ity

within a given radius. A technique like this can be used to remove noise (outliers)
and find clusters of any shape.
●● The usual density-based approaches DBSCAN and its extension OPTICS build
clusters in accordance with a density-based connectivity analysis. By analysing
m

the value distributions of density functions, the DENCLUE approach clusters


objects.
)A

Grid-Based Method:
●● Grid-based approaches divide the object space into a fixed number of grid-like
cells.
●● On the grid structure, or on the quantized space, all clustering operations are
(c

carried out. The key benefit of this strategy is its quick processing time, which is
usually unaffected by the quantity of data items and only depends on the number
of cells in each dimension of the quantized space.

Amity Directorate of Distance & Online Education


Data Warehousing and Mining 213

●● A common illustration of a grid-based approach is STING. Wave Cluster uses both


Notes

e
grid-based and density-based wavelet transformation for clustering analysis.

Model-Based Method:

in
To find the data that is most suitable for the model, all of the clusters are
hypothesised in the model-based procedure. To find the clusters for a specific model,
use the clustering of the density function. In addition to reflecting the spatial distribution

nl
of data points, it offers a method for automatically calculating the number of clusters
using conventional statistics while accounting for outlier or noise. As a result, it
produces reliable clustering techniques.

O
Constraint-Based Method:
Application- or user-oriented restrictions are used in the constraint-based clustering

ty
approach. The user expectation or the characteristics of the expected clustering results
are examples of constraints. With the use of constraints, we may engage with the
clustering process. Constraints might be specified by the user or an application need.

si
Tasks in Data Mining:
●● Clustering High-Dimensional Data:
o r
Because many applications call for the study of objects with a lot of attributes
or dimensions, it is a task that is especially crucial in cluster analysis.
ve
o For instance, written documents may contain tens of thousands of terms
or keywords, and DNA microarray data may reveal details on the levels
of expression of tens of thousands of genes under hundreds of different
ni

circumstances.
o The curse of dimensionality makes it difficult to cluster high-dimensional data.
o Some dimensions might not be important. The data becomes increasingly
U

sparse as the number of dimensions rises, making it meaningless to quantify


the distance between two points and making it likely that the average density
of points throughout the data would be low. As a result, a new clustering
technique must be created for high-dimensional data.
ity

o Influential subspace clustering techniques like CLIQUE and PROCLUS look


for clusters within certain data subspaces rather than throughout the full data
space.
m

o Another clustering technique called frequent pattern-based clustering captures


distinctive frequent patterns from often occurring subsets of dimensions.
It makes advantage of these patterns to organise items into meaningful
)A

groupings.
●● Constraint-Based Clustering:
o It is a clustering method that employs application- or user-specified
constraints to produce clustering.
(c

o A constraint is an excellent tool for expressing to the clustering process the


user’s expectations or describing the characteristics of the desired clustering
results.

Amity Directorate of Distance & Online Education


214 Data Warehousing and Mining

o Different types of limitations can be provided, either by the user or in


Notes

e
accordance with the needs of the programme.
o With impediments present and clustering under user-specified limitations,

in
spatial clustering is used. Additionally, pairwise restrictions are used in semi-
supervised clustering to enhance the quality of the final grouping.

Hierarchical Clustering Methods:

nl
Data objects are organised into groups in a tree structure using the hierarchical
clustering approach. A pure hierarchical clustering method’s performance is hindered

O
by its inability to modify once a merge or split decision has been made. This means that
the approach cannot go back and change a particular merge or split decision if it later
turns out to have been a bad one. Depending on whether the hierarchical breakdown
is generated top-down or bottom-up, hierarchical clustering approaches can be further

ty
categorised as either agglomerative or divisive.

Constraint-Based Cluster Analysis:

si
Finding clusters that adhere to user-specified preferences or requirements is
known as constraint-based clustering. Constraint-based clustering may use a variety of
strategies depending on the type of constraints. There are various types of restrictions.

●● r
Constraints on individual objects: On the objects to be clustered, restrictions
ve
can be specified. For instance, in a real estate application, one would want to
geographically cluster just those expensive homes that cost more than a million
dollars. The set of objects that can be clustered is limited by this constraint.
Preprocessing makes it simple to manage, and the issue is thus reduced to a case
ni

of unrestricted clustering.
●● Constraints on the selection of clustering parameters: Each clustering parameter
can have a desired range chosen by the user. Typically, clustering parameters
U

are quite particular to the chosen clustering algorithm. Examples of parameters


include e, the radius, and the minimum number of points in the DBSCAN method,
or k, the required number of clusters in a k-means algorithm. These user-specified
parameters typically only affect the algorithm itself, even though they may have a
ity

significant impact on the clustering results. As a result, their processing and fine
tuning are typically not regarded as a type of constraint-based clustering.
●● Constraints on distance or similarity functions: For particular features of the items
to be clustered, we can provide various distance or similarity functions, as well as
m

various distance measurements for particular pairs of objects. We might utilise


various weighting algorithms for height, body weight, age, and skill level when
clustering athletes, for instance. This may not directly affect the clustering process,
)A

but it will probably influence the mining results. However, in some circumstances,
particularly when it is closely related to the clustering procedure, such alterations
may make the assessment of the distance function nontrivial.
●● User-specified constraints on the properties of individual clusters: The user’s
preference for the intended properties of the generated clusters may have a
(c

significant impact on the clustering process.

Amity Directorate of Distance & Online Education


Data Warehousing and Mining 215

●● Semi-supervised clustering based on partial supervision: Some mild kind of


Notes

e
supervision can be used to significantly enhance the quality of unsupervised
clustering. Pair-wise constraints may be used to accomplish this (i.e., pairs of
objects labelled as belonging to the same or different cluster). Semi-supervised

in
clustering is the name given to such a limited clustering procedure.

Applications of Cluster Analysis:

nl
●● Market research, pattern recognition, data analysis, and image processing are just
a few of the many fields in which cluster analysis has been successfully applied.
●● Clustering in business can assist marketers in identifying unique consumer groups

O
and describing customer groupings based on purchase trends.
●● In biology, it can be used to build taxonomies for plants and animals, classify
genes with similar functions, and learn more about population structures.

ty
●● The identification of areas of similar land use in an earth observation database,
the grouping of homes in a city based on house type, value, and location, and the
identification of car insurance policyholders with high average claim costs can all

si
benefit from clustering.
●● Because clustering divides enormous data sets into groups based on their
resemblance, it is also known as data segmentation in some applications.
●●
r
Outlier detection applications include the identification of credit card fraud and
ve
the monitoring of criminal activity in electronic commerce. Clustering can also be
utilised for outlier detection.

4.5.2 Outlier Analysis


ni

Imagine working for a credit card corporation as a transaction auditor. You pay
close attention to card usages that differ significantly from ordinary occurrences in
U

order to safeguard your consumers from credit card fraud. For instance, a purchase
is suspicious if it is significantly larger than usual for the cardholder and takes place
outside of the city where they normally reside. As soon as such transactions take place,
you should identify them and get in touch with the cardholder for confirmation. In a lot of
ity

credit card firms, this is standard procedure. What methods of data mining can be used
to identify suspicious transactions?

Credit card transactions are typically routine. However, when a credit card is stolen,
the typical transaction pattern is drastically altered; the places and things bought are
m

frequently considerably different from those of the legitimate card owner and other
customers. Detecting transactions that are noticeably out of the ordinary is a key
component of credit card fraud detection.
)A

Finding data items with behaviour that deviates significantly from expectations is
the process of outlier detection, sometimes referred to as anomaly detection. These
things are referred to as anomalies or outliers. In addition to fraud detection, outlier
identification is crucial for a variety of other applications, including image processing,
sensor/video network monitoring, medical care, public safety, and security, industry
(c

damage detection, and intrusion detection.

Amity Directorate of Distance & Online Education


216 Data Warehousing and Mining

Clustering analysis and outlier detection are two tasks that are closely related.
Notes

e
In contrast to outlier identification, which seeks to identify unusual instances that
significantly vary from the majority patterns, clustering identifies the predominant
patterns in a data set and organises the data accordingly. Clustering analysis and

in
outlier identification have various uses.

What are outliers?

nl
When discussing data analysis, the word “outliers” frequently comes to mind.
Outliers are data points that are not consistent with expectations, as the name implies.
What you do with the outliers is the most important aspect of them. You will always

O
make certain assumptions depending on how the data was produced if you are going
to evaluate any task to analyse data sets. These data points are outliers if you discover
them, and depending on the situation, you may wish to correct them if they are likely
to contain some sort of inaccuracy. The analysis and prediction of data that the data

ty
contains are two steps in the data mining process. Grubbs provided the initial definition
of outliers in 1969.

An outlier can neither be considered noise nor an error. Instead, it is thought that

si
they weren’t created using the same process as the other data objects.

There are three categories of outliers:

◌◌ r
Global (or Point) Outliers
ve
◌◌ Collective Outliers
◌◌ Contextual (or Conditional) Outliers
ni
U
ity

Global Outliers
A data object in a particular data collection is considered a global outlier if
it considerably differs from the other data objects in the set. The most basic kind of
outliers are global outliers, also known as point anomalies. Finding global outliers is the
m

main objective of most outlier detection techniques.

Finding an adequate measurement of deviation with respect to the application


in question is a crucial step in the detection of global outliers. Different metrics are
)A

suggested, and outlier identification techniques are divided into various groups as a
result. Later, we shall discuss this topic in further detail.

The ability to recognise global outliers is crucial in many applications. Take, for
instance, intrusion detection in computer networks. When a computer’s communication
(c

behaviour deviates significantly from typical patterns (for example, when a huge
number of packages are broadcast in a brief period of time), this behaviour may be
regarded as a global outlier, and the accompanying machine is presumed to have
Amity Directorate of Distance & Online Education
Data Warehousing and Mining 217

been hacked. As another illustration, transactions that do not adhere to the rules are
Notes

e
regarded as global outliers and should be stored for additional analysis in trading
transaction auditing systems.

in
For instance, in an intrusion detection system, if several packages are broadcast
in a short period of time, this may be seen as a global outlier and we can infer that the
system in question has possibly been compromised.

nl
O
ty
The red data point is global outlier

si
Collective Outliers
Let’s say you handle the supply chain for All Electronics. Each day, you manage
tens of thousands of shipments and orders. Given that delays are statistically common,
r
if an order’s shipping is delayed, it may not be viewed as an oddity. However, if 100
ve
orders are delayed in a single day, you need to pay attention. Even if each of those
100 orders might not be viewed as an outlier if taken individually, taken as a total, they
constitute an anomaly. To comprehend the shipment issue, you might need to take a
thorough look at those orders taken as a whole.
ni

If a subset of data objects inside a given data collection considerably deviates from
the total data set, the subset becomes a collective outlier. The individual data objects
might not all be outliers, which is significant.
U

Because their density is significantly greater than that of the other objects in the
data set, the black objects in Figure (The black objects form a collective outlier) as a
whole constitute a collective outlier. However, when compared to the entire dataset, no
one black object is an outlier.
ity
m
)A

Figure:The black objects form a collective outlier

Numerous crucial applications exist for collective outlier detection. For instance, a
denial-of-service package sent from one computer to another is not at all unusual and
(c

is regarded standard in intrusion detection. The group of computers should be viewed


as an outlier if they consistently send each other denial-of-service packets. It’s possible
to suspect that an attack has compromised the involved machines. Another illustration
Amity Directorate of Distance & Online Education
218 Data Warehousing and Mining

of a typical transaction is a stock exchange between two parties. A significant number


Notes

e
of identical stock transactions within a small group in a brief period, however, are
collective outliers because they could be proof of market manipulation.

in
In contrast to global or contextual outlier detection, collective outlier detection
requires that we take into account both the behaviour of groups of objects as well
as that of individual objects. As a result, prior knowledge of the relationships among

nl
data items, such as assessments of object distance or similarity, is required in order to
identify collective outliers.

In conclusion, several kinds of outliers can exist in a data collection. Additionally, an

O
object may fall under more than one category of outlier. Different outliers can be used
in business for a variety of applications or goals. The easiest type of outlier detection
is global. Contextual attributes and contexts must be determined in order to perform
context outlier identification. In order to describe the link between items and discover

ty
groups of outliers, collective outlier identification requires background knowledge.

Contextual Outliers

si
“Today’s temperature is 28 C. Is it extraordinary (an outlier)? For instance, it
depends on the moment and place! Yes, it is an anomaly if it happens to be winter in
Toronto. It is typical if it is a summer day in Toronto. Contrary to global outlier detection,
r
in this instance, the context—the date, the location, and maybe other factors—
ve
determines whether or not the current temperature number is an outlier.

A data object in a given data set is considered a contextual outlier if it significantly


differs from one of the object’s contexts. Because they depend on the chosen
environment, contextual outliers are often referred to as conditional outliers. As a result,
ni

the context must be stated as part of the problem formulation in order to do contextual
outlier detection. The properties of the data objects in question are often separated into
two sections for contextual outlier detection:
U

●● Contextual attributes: The context of a data object is determined by its contextual


properties. Contextual attributes in the temperature example could include time
and place.
ity

●● Behavioral attributes: In determining if the object is an outlier in the context to


which it belongs, certain properties of the object are defined. The temperature,
humidity, and pressure in the temperature example might all be considered
behavioural traits.
m

Contrary to global outlier detection, contextual outlier detection considers both


behavioural and contextual variables when determining whether a data object is an
outlier. In one context—for example, a Toronto winter—a configuration of behavioural
attribute values may be seen as an outlier, but not in another—for instance, a Toronto
)A

summer—in which case 28°C is not an outlier.

Local outliers, a concept presented in density-based outlier analysis


methodologies, are a generalisation of contextual outliers. A local outlier is an object in
a data set whose density considerably differs from the neighbourhood where it is found.
(c

When the set of contextual attributes is empty, global outlier identification can
be thought of as a specific instance of contextual outlier detection. In other words,

Amity Directorate of Distance & Online Education


Data Warehousing and Mining 219

the entire data set serves as the background for global outlier detection. Users have
Notes

e
flexibility thanks to contextual outlier analysis since they may study outliers in various
situations, which is highly desirable in many applications.

in
An analyst may take into account outliers in several situations in addition to global
outliers while detecting credit card fraud, for instance. Think about clients who exceed
90% of their credit limit. Such behaviour might not be regarded as an outlier if one

nl
of these consumers is perceived to be a member of a group of clients with low credit
limitations. However, if a high-income customer consistently exceeds their credit limit,
that customer’s same conduct can be viewed as an exception. Raising credit limits for
such customers can generate additional money. Such outliers may present economic

O
opportunities.

The usefulness of the contextual attributes, along with the assessment of an


object’s departure from the norm in the space of behavioural attributes, determine the

ty
accuracy of contextual outlier detection in an application. Experts in the relevant fields
should, more often than not, decide on the contextual qualities; this can be viewed as
input background knowledge. In many applications, it is difficult to gather high-quality

si
contextual attribute data or gain enough information to infer contextual attributes.

How can we create contexts that are meaningful for contextual outlier detection?
The contextual characteristics’ group-bys are used as contexts in a simple approach.
r
But many group-bys might have insufficient data or noise, so this might not work. The
ve
proximity of data objects in the space of contextual attributes is used in a more generic
approach.

Outliers Analysis
ni

When data mining is used, outliers are frequently deleted. However, it is still utilised
in numerous applications, including medical and fraud detection. It usually happens
because uncommon events have a lot greater capacity to store important information
U

than more frequent happenings.

The following list includes more uses for outlier detection.

Outlier analysis in data mining can be used to examine any uncommon response
ity

that develops as a result of medical therapy.

●● Fraud detection in the telecom industry


●● In market analysis, outlier analysis enables marketers to identify the customer’s
behaviors.
m

●● In the Medical analysis field.


●● Fraud detection in banking and finance such as credit cards, insurance sector, etc.
)A

Outlier analysis is the process of identifying the behaviour of the outliers in


a dataset. The procedure, commonly referred to as “outlier mining,” is a crucial data
mining work.
(c

Amity Directorate of Distance & Online Education


220 Data Warehousing and Mining

Case Study
Notes

e
Data analysis Companies employ data mining as a method to transform
unstructured data into information that is useful. Businesses may learn more about

in
their customers, create more successful marketing campaigns, boost sales, and cut
expenses by employing software to find patterns in massive volumes of data. Effective
data collection, warehousing, and computer processing are all necessary for data

nl
mining. examine a customer’s psychological perspective, translate it into statistical
form, and determine whether there is any technical format that allows us to assess his
purchasing behaviour.

O
Breaking down ‘data mining in marketing’
The well-known application of data mining techniques in grocery stores.
Customers can obtain discounted pricing not available to non-members by using the

ty
free loyalty cards that are frequently given out by supermarkets. The cards make it
simple for retailers to keep tabs on who is purchasing what, when, and for what price.
After evaluating the data, the retailers can use it for a variety of things, like providing
consumers with coupons based on their purchasing patterns and choosing when to

si
sell items on sale or at full price. When only a small portion of information that is not
representative of the entire sample group is utilised to support a certain premise, data
mining can be problematic.
r
ve
Power of hidden information in data:
Every day, businesses produce terabytes of data that are kept in databases, data
warehouses, or other types of data repositories. The majority of useful information
might be concealed within such data; yet, the enormous amount of data makes it
ni

difficult for humans to access them without the aid of strong tools and approaches.
Information was only available on papers and only at certain times at the beginning of
the previous decade. Information is now readily available because content producers,
U

content locators, and strong search engines have made it possible to quickly access
vast amounts of information.

The same information can now be accessed through natural language processing
ity

in a language that is comfortable for the user. We have arrived at a period where
everything happens in a matter of clicks, unlike earlier times when people had to
wait in lengthy lines to pay bills, taxes, or purchase cinema tickets and other forms of
entertainment. The firms that are heavily reliant on consumer behaviour and basket
trends have been greatly impacted by all these altering elements and trends. To be
m

profitable, any organisation needs to be extremely scalable and able to anticipate client
behaviour in the future.

The Market Analysis was prompted by these very needs. Earlier methods for
)A

projecting a product’s future included surveys, questioners, test marketing, historical


sales data, and leading indications. Although all of these methods were successful, they
were quite labor-intensive and time-consuming. The situation has been fundamentally
altered by data mining techniques. Information that used to require miles of travel is
now accessible in a matter of seconds.
(c

Amity Directorate of Distance & Online Education


Data Warehousing and Mining 221

Data Mining Tasks


Notes

e
Classification:

in
Finding a model that adequately describes the various data classes or concepts
is the process of classification. The goal is to be able to forecast the class of objects
whose class label is unknown using this model. This model was developed by the
examination of training data sets.

nl
Eg: 1.putting voters into groups that are well-known to political parties.

2. Adding new clients to a group of established ones.

O
Regression:
Regression analysis is a statistical method for determining the relationships

ty
between variables in statistical modelling. When the emphasis is on the link between
a dependent variable and one or more independent variables, it encompasses
numerous approaches for modelling and evaluating multiple variables. Imagining the
unemployment rate for the following year. estimating the cost of insurance.

si
Finding anomalies: Finding anomalies, also known as detecting outliers, entails
locating data points, events, or observations that deviate from a dataset’s norm.
r
Example: Credit card fraud transaction detection. The discovery of objects, events, or
observations that deviate from an expected pattern or other objects in a collection is
ve
known as anomaly detection (also known as outlier detection). For instance, credit card
fraud detection.

Time series
ni

A time series is a collection of data points that have been enumerated, graphed,
or otherwise organised chronologically. A time series is most frequently a sequence
captured at a series of equally spaced moments in time. As a result, it is a collection of
U

discrete-time data.

Eg: Forecasting for sales, manufacturing, or essentially any growth event that
requires extrapolation
ity

Clustering
The task of clustering involves organising a collection of objects into groups so that
objects within a cluster are more similar (in some manner or another) to one another
m

than to those in other groupings (clusters).

for instance, identifying consumer segments in a business using data from


transactions, websites, and customer calls.
)A

Association analysis
Data mining function called association determines the likelihood of elements in a
collection occurring together. Association rules are used to express the links between
(c

co-occurring items.

Amity Directorate of Distance & Online Education


222 Data Warehousing and Mining

For instance, research cross-selling prospects for a retailer using transactional


Notes

e
purchase data.

Investigating and Growing Your Business A business technique known as “data

in
mining” is used to examine vast amounts of data in order to find significant patterns and
rules. Data mining is a tool that businesses can use to enhance their operations and
outperform rivals. The following figure lists the most significant business sectors where

nl
data mining is successfully used.

O
ty
r si
ve
ni

A kind of marketing that puts the customer first is digital marketing. Digital
information is not only simpler to integrate, organise, and disseminate, but it also
speeds up interactions between service providers and customers. In the past,
U

marketing analysis and efficacy typically took a very long time. Today, marketing
promotion can have a higher synergistic effect thanks to digital marketing.

Because of the accelerating changes in the business environment, technological


ity

advancements, and digital transmission, firms should modify their marketing strategies
quickly. The market should switch from the Red Sea Strategy to the Blue Ocean
Strategy in a similar manner. Market space keeps growing as the environment changes,
but the market environment also gets more competitive. The development of Internet
m

marketing, including sale and purchase through the Website, keyword marketing, blog
marketing, and so forth, along with the development of wireless networks, will bring
the world closer together from traditional store sales, telephone marketing, and face-
to-face marketing. It gives rise to the impression that developing digital marketing
)A

communications is crucial. The company is expanding its options for client engagement
and connection because it is no longer constrained by conventional factors like time
and geography.

Apriori Algorithm
(c

Agrawal and Srikant introduced the Apriori algorithm in 1994 [3]. The algorithm
used to learn association rules is a well-known one. Apriori is made to work with

Amity Directorate of Distance & Online Education


Data Warehousing and Mining 223

databases that contain transactions (for example, collections of items bought by


Notes

e
customers, or details of commerce website frequentation). Given a set of itemsets
(for example, sets of retail transactions, each showing specific items purchased), the
method seeks to identify subsets that are shared by at least a portion of the itemsets,

in
as is typical in association rule mining. As part of its “bottom up” methodology,
Apriori extends frequent subsets one item at a time (a process known as candidate
generation), and then compares groups of candidates to the data. When no more

nl
successful extensions are detected, the algorithm ends.

The Apriori Algorithm is used to identify relationships between various data sets.
It is also known as “Market Basket Analysis.” Each set, which has several things, is

O
referred to as a transaction. Apriori produces rules that specify how frequently items
appear in collections of data.

ty
Predicting the Customer Behaviour
In enterprise company, predicting client behaviour is the most crucial action. All of
the aforementioned techniques provide businesses with a wealth of insightful data. The

si
scenarios that have used the aforementioned techniques are shown in the next section.

Case study 1: Application of Association Rule mining in Recommender systems

r
These days, recommender systems are widely used across many industries.
Movies, music, literature, academic articles, search terms, social media tags, etc. are
ve
just a few examples. In order to forecast client behaviour, these systems combine
the concepts of intelligent systems, machine learning, and information retrieval. In
recommender systems, there are two methods: collaborative filtering and content-based
filtering.
ni

Collaborative filtering techniques gather and examine a substantial quantity of data


on users’ actions, interests, or preferences in order to forecast what users would find
appealing based on their shared characteristics with other users. Utilizing the Apriori
U

algorithm is one strategy.

In this case study, association rules are extracted from user profiles using the
Apriori method. The PVT system is presented as an example. A recommender
ity

programme called PVT system makes TV channel recommendations to consumers


based on their viewing preferences. This arrangement keeps both well- and poorly-
rated TV networks. The Apriori technique can be used to construct a set of rules and
related confidence levels between programmes by treating user profiles as transactions
m

and the programme ratings contained inside as itemsets.

A matrix of programme similarity is filled out using the confidence values as


similarity scores. The process is as follows: beyond simple overlap, the relationship
)A

between the programmes is determined. Like, for instance, a viewer of reality


programmes like Big Boss and Rodies could not be interested in programmes like
KBC and Indian Idol. However, if a connection between Rodies and Indian Idol can be
found, it can serve as a foundation for pattern recognition. By locating the support and
confidence values, the relationship can be determined. The user is advised to utilise the
(c

confidence values in this case study as the similarity scores. We can create rules using
direct programme similarities, and we can create new results by chaining these rules
together.
Amity Directorate of Distance & Online Education
224 Data Warehousing and Mining

Case Study 2: Target selection classification in direct marketing A predictive


Notes

e
response model using data mining techniques was created using past purchase data
to determine the likelihood that a client at Ebedi Microfinance Bank (Nigeria) will take
advantage of an offer or promotion. Data mining techniques were used to create

in
a predictive response model for this goal utilising past purchase information from
customers.

nl
A data warehouse was used to store the data so that management could use it as
a decision support system. The response model was created using data on previous
consumer purchases and demographics. The following are the buying behaviour
variables that were used in the model’s construction. The amount of months since the

O
most recent and initial purchases is known as the recency. It is frequently the most
effective of the three factors for forecasting a customer’s reaction to a following offer.

This makes a lot of sense. According to this statement, you are more likely to

ty
make another purchase from a firm if you recently made one than if you haven’t. This
is the total number of purchases. It may represent the sum of purchases made during
a certain period of time or comprise all purchases. In terms of predicting a response,

si
this trait is only second to recency. Once more, the connection to future purchases is
extremely obvious. Value in money: This is the whole sum. It might be within a certain
time range or encompass all purchases, similar to frequency. This trait has the weakest

r
predictive ability of the three when it comes to predicting response. However, when
used in concert, it can broaden our understanding in still another way. Customers’
ve
individual traits and information, such as age, sex, residence, profession, etc., are
included in the demographic data. exact Bayesian algorithm The classifier system was
built using the Naive Bayes technique.
ni

In order to choose the model’s inputs, filter and wrapper feature selection
approaches were also used. According to the results, Ebedi Microfinance Bank can
effectively sell their goods and services by obtaining a report on the state of their
clients. This will help management significantly reduce the amount of money that would
U

have been wasted on ineffective advertising efforts.

Conclusion
ity

Getting the attention of a customer is crucial in today’s business climate. Every


company creates a variety of goods, therefore differentiating themselves in a crowded
market is a major challenge that must be overcome by companies. One of the most
important success factors in running a business is knowing how to combine marketing
techniques with products, spread the word about the product, and increase customers’
m

purchase intentions and interests. Businesses have the advantage of having innovative
products, services, brands, quality, etc. Overall, the analysis of the marketing data for
the product, customer history, and purchasing patterns will lead digital marketing to
)A

advise an appropriate manner of connection with consumers.

The type, cost, location, and advertising of the product are only a few of those
covered by the product information. History records provide prior marketing tactics,
approaches, and consumer responses for estimation, consultation, and exploration
(c

of prospective or unidentified marketing influence elements. Feedback on customer


behaviour can be provided, and suggestions can be made for the products that
consumers are interested in. The marketing approach should be tailored to the many

Amity Directorate of Distance & Online Education


Data Warehousing and Mining 225

consumer attributes, including age, sex, occupation, income, and lifestyle. In other
Notes

e
words, the business’s ability to develop a marketing strategy for various items will be
influenced by product information, historical records, and consumer buying behaviour in
terms of products association.

in
Summary
●● The intersection of statistics and machine learning gives rise to the

nl
interdisciplinary topic of data mining (DM) (artificial intelligence). It offers a
technique that aids in the analysis and comprehension of the data found in
databases, and it has been used to a wide range of industries or applications.

O
●● In order to offer a didactic viewpoint on the data analysis process of these
techniques, we present an applied vision of DM techniques. In order to find
knowledge models that reveal the patterns and regularities underlying the

ty
analysed data, we employ machine learning algorithms and statistical approaches,
comparing and analysing the findings.
●● Artificial neural networks (ANNs) are data processing systems that borrow features

si
from biological neural networks in their design and operation.
●● In basic building blocks known as neurons, information processing takes place.
The connections between the neurons allow for the transmission of signals.Each
r
link (communication) has a corresponding weight.
ve
●● The artificial neuron serves as the processing unit, collecting input from nearby
neurons and calculating an output value to be relayed to the rest of the neurons.
●● Sequential data divisions called decision trees (DT) maximise the differences of a
dependent variable (response or output variable). They provide a clear definition
ni

for groups whose characteristics are constant but whose dependent variable
varies.
U

●● Humans rely on memories of previous encounters with situations similar to the


current one when confronted with them. The k-Nearest Neighbor (kNN) method
is founded on this. In other words, the k-NN method is founded on the idea of
similarity.
ity

●● Despite the method’s “naive” appearance, it may compete with other, more
advanced categorization techniques. Therefore, k-NN is very adaptable in
situations where the linear model is stiff.
●● The k number is not predetermined, however it is important to keep in mind that
m

if we choose a tiny k value, the categorization may be too influenced by outlier


values or uncommon observations.
●● The Bayes rule or formula, which is based on Bayes’ theorem, is used in Bayesian
)A

approaches to combine data from the sample with expert opinion (prior probability)
in order to get an updated expert opinion (posterior probability).
●● A generalised linear model (GLM) is a logistic regression (LR). Predicting binary
variables (with values like as yes/no or 0/1) is its principal use. Thus, using LR
(c

approaches, a new observation with an unknown group can be assigned to one of


the groups depending on the values of the predictor variables.

Amity Directorate of Distance & Online Education


226 Data Warehousing and Mining

●● There are two types of data mining: descriptive data mining and predictive data
Notes

e
mining. In a succinct and summarizing manner, descriptive data mining displays
the data set’s intriguing general qualities. In order to build one or more models
and make predictions about the behavior of fresh data sets, predictive data mining

in
analyzes the data.
●● Data discrimination is the process of comparing the general characteristics of data

nl
items from the target class to those of objects from one or more opposing classes.
A user can specify the target and opposing classes, and database queries can be
used to retrieve the associated data objects.

O
●● The data cube approach can be thought of as a materialized-view, data
warehouse-based, pre-computational strategy. Before submitting an OLAP or data
mining query for processing, it performs off-line aggregation.
●● Data warehousing and OLAP technologies have recently advanced to handle

ty
more complicated data kinds and incorporate additional knowledge discovery
processes. Additional descriptive data mining elements are predicted to be
incorporated into OLAP systems in the future as this technology develops.

si
●● Patterns that commonly show up in a data set include itemsets, subsequences,
and substructures. A frequent itemset is, for instance, a group of items that
regularly appear together in a transaction data set, like milk and bread.
●●
r
Data analysis that uses classification extracts models describing significant data
ve
classes. These models, referred to as classifiers, forecast discrete, unordered
categorical class labels. For instance, we can create a classification model to
divide bank loan applications into safe and dangerous categories. We may gain a
better comprehension of the data as a whole via such study.
ni

●● Prediction models predict continuous-valued functions, whereas classification


models predict categorical class labels. For instance, based on a person’s income
and line of work, we can create a classification model to classify bank loan
U

applications as safe or risky, or a prediction model to estimate how much money a


potential consumer will spend on computer equipment.
●● A fresh observation is classified by determining its category or class label. As
ity

training data, a set of data is first employed. The algorithm receives a set of input
data along with the associated outputs. The input data and their corresponding
class labels are thus included in the training data set. The algorithm creates a
classifier or model using the training dataset.
m

●● Lifecycle of data classification: a) Origin, b) Role-based practice, c) Storage, d)


Sharing, e) Archive, f) Publication.
●● Prediction is another step in the data analysis process. It is employed to discover a
)A

numerical result. The training dataset includes the inputs and matching numerical
output values, much like in classification. Using the training dataset, the algorithm
creates a model or prediction.
●● The division of a set of data items (or observations) into subsets is known as
cluster analysis or simply clustering. Every subset is a cluster, and each cluster
(c

contains items that are similar to one another but different from those in other
clusters.

Amity Directorate of Distance & Online Education


Data Warehousing and Mining 227

●● A clustering is the collection of clusters that emerges from a cluster analysis. In


Notes

e
this situation, various clustering techniques may provide various clustering on the
same data set. The clustering algorithm, not people, performs the partitioning.

in
●● Properties of clustering: a) Scalability, b) Ability to deal with different types of
attributes, c) Discovery of clusters with arbitrary shape, d) Requirements for
domain knowledge to determine input parameters, e) Ability to deal with noisy

nl
data, f) Incremental clustering and insensitivity to input order, g) Capability
of clustering high-dimensionality data, h) Constraint-based clustering, i)
Interpretability and usability.

O
●● Clustering methods: a) Partitioning method, b) Hierarchical method, c) Density-
based method, d) Grid-based method, e) Model-based method, f) Constraint-
based method.
●● Outliers are data points that are not consistent with expectations, as the name

ty
implies.An outlier can neither be considered noise nor an error. Instead, it is
thought that they weren’t created using the same process as the other data
objects.

si
●● Three categories of outliers: a) Global outliers, b) Collective outliers, c) Contextual
outliers.
●●
r
When data mining is used, outliers are frequently deleted. However, it is still
utilised in numerous applications, including medical and fraud detection. It usually
ve
happens because uncommon events have a lot greater capacity to store important
information than more frequent happenings.

Glossary
ni

●● DM: Data Mining.


●● ANN: Artificial Neural Networks.
U

●● AI: Artificial Intelligence.


●● DT: Decision Trees.
●● KNN: K-Nearest Neighbor.
ity

●● QUEST: Quick, Unbiased, Efficient Statistical Tree.


●● CHAID: Chi-Squared Automatic Interaction Detection.
●● CART: Classification and Regression Trees.
m

●● NB: Naive Bayes.


●● GLM: GeneralisedLinear Model.
)A

●● LR: Logistic Regression.


●● OLAP: Online Analytical Processing.
●● AOI: Attribute-Oriented Induction.
●● MBA: Market Basket Analysis.
(c

●● KDD: Knowledge Discovery in Data.


●● DBSCAN:Density Based Spatial Clustering of Application with Noise.

Amity Directorate of Distance & Online Education


228 Data Warehousing and Mining

●● OPTICS:Ordering Points To Identify Cluster Structure.


Notes

e
●● DENCLUE:Density Clustering.
●● STING: Statistical Information Grid Clustering Algorithm.

in
●● PROCLUS: Projected Clustering.

Check Your Understanding

nl
1. Which of the following refers to the problem of finding abstracted data in the
unlabeled data?

O
a. Unsupervised learning
b. Supervised learning
c. Hybrid learning

ty
d. Reinforcement learning
2. Which of the following refers to the querying the unstructured textual data?

si
a. Information access
b. Information retrieval
c. Information update
d.
r
Information manipulation
ve
3. Which of the following can be considered as the correct process of data mining?
a. Exploitation, Interpretation, Analysis, Exploration, Infrastructure
b. Exploration, Exploitation, Analysis, Infrastructure, Interpretation
ni

c. Infrastructure, Exploration, Analysis, Interpretation, Exploitation


d. Interpretation, Exploitation, Exploration, Analysis, Infrastructure
U

4. Which of the following is an essential process in which the intelligent methods are
applied to extract data patterns?
a. Warehousing
ity

b. Text selection
c. Text mining
d. Data mining
m

5. What is KDD in data mining?


a. Knowledge Discovery Database
)A

b. Knowledge Discovery Data


c. Knowledge Definition Data
d. Knowledge Data House
6. Which of the following techniques needs the merging approach?
(c

a. Partitioned
b. Hierarchical

Amity Directorate of Distance & Online Education


Data Warehousing and Mining 229

c. Naïve Bayes
Notes

e
d. None of the mentioned
7. In data mining how many categories of function are included?

in
a. 4
b. 3

nl
c. 2
d. 5

O
8. What are the functions of data mining?
a. Association and correlation analysis classification
b. Prediction and Characterization

ty
c. Cluster analysis and Evolution analysis
d. All of the above

si
9. _ _ _ _module communicates between the data mining system and the user.
a. Graphical User Interface
b. Knowledge Base
c. Pattern Evolution Module
r
ve
d. None of the mentioned
10. The steps of the knowledge discovery database process, in which several data
sources are combined refers to_ _ _.
ni

a. Data transformation
b. Data integration
U

c. Data cleaning
d. None of the above
11. _ _ _ _ are the data mining application?
ity

a. Market Basket Analysis


b. Fraud detection
c. Both a and b
m

d. None of the mentioned


12. The classification of the data mining system involves:
)A

a. Database Technology
b. Information Science
c. Machine Learning
d. All of the above
(c

13. KDD process consists of_ _ _ steps.


a. 9

Amity Directorate of Distance & Online Education


230 Data Warehousing and Mining

b. 4
Notes

e
c. 7
d. 6

in
14. Which among the following is the data mining algorithm?
a. K-mean algorithm

nl
b. Apriori algorithm
c. Naïve Bayes algorithm

O
d. All of the above
15. _ _ _ _ _is a data mining algorithms used to create decision tree.
a. C4 .5 algorithm

ty
b. PageRank algorithm
c. K-mean algorithm

si
d. Adaboost algorithm
16. Which of the following defines the structure of the data held in operational databases
and used by operational applications?
a. User-level metadatar
ve
b. Operational metadata
c. Relational metadata
d. Data warehouse metadata
ni

17. Which of the following consists of information in the enterprise that is not in classical
form?
a. Mushy metadata
U

b. Data mining
c. Differential metadata
ity

d. None of the above


18. Data can be updated in_ _ _ _environment.
a. Data mining
m

b. Informational
c. Operational
d. None of the above
)A

19. The source of all the data warehouse data is the_ _ _ _ .


a. Operational environment
b. Formal environment
(c

c. Informal environment
d. None of the above

Amity Directorate of Distance & Online Education


Data Warehousing and Mining 231

20. Data warehouse contains_ _ _ _ data that is never found in the operational
Notes

e
environment.
a. Normalized

in
b. Denormalized
c. Informational

nl
d. Summary

Exercise

O
1. Define statistical techniques in data mining.
2. Explain data mining characterisation and data mining discrimination.
3. Define the terms

ty
a. Mining patterns
b. Associations

si
c. Correlations
4. What do you mean by classification and prediction?
5. Explain cluster analysis.
6. Define the term outlier analysis.
r
ve
Learning Activities
1. In your project you are responsible for analyzing the requirements and selecting a
toolset for data mining. Make a list of the criteria you will use for the toolset selection.
ni

Briefly explain why each criterion is necessary.

Check Your Understanding - Answers


U

1 a 2 b
3 c 4 d
5 a 6 b
ity

7 c 8 d
9 a 10 b
11 c 12 d
m

13 a 14 d
15 a 16 b
)A

17 a 18 c
19 a 20 d

Further Readings and Bibliography:


(c

1. Data Mining and Analysis: Fundamental Concepts and Algorithms, Mohammed


J. Zaki

Amity Directorate of Distance & Online Education


232 Data Warehousing and Mining

2. Data Mining: Concepts and Techniques, Jiawei Han


Notes

e
3. Handbook of Statistical Analysis and Data Mining Applications, Gary Miner,
John Elder, and Robert Nisbet

in
4. Data Mining and Machine Learning: Fundamental Concepts and Algorithms,
Mohammed J. Zaki and Wagner Meira
5. Introduction to Data Mining, Michael Steinbach, Pang-Ning Tan, and Vipin

nl
Kumar
6. Data Mining: Practical Machine Learning Tools and Techniques, Eibe Frank,

O
Ian H. Witten, and Mark A. Hall

ty
r si
ve
ni
U
ity
m
)A
(c

Amity Directorate of Distance & Online Education


Data Warehousing and Mining 233

Module - V: Data Mining Applications


Notes

e
Learning Objectives:

in
At the end of this topic, you will be able to understand:

●● Text Mining

nl
●● Spatial Databases
●● Web Mining

O
●● Multidimensional Analysis of Multimedia Data
●● Applications in Telecommunications Industry
●● Applications in Retail Marketing

ty
●● Applications in Target Marketing
●● Mining in Fraud Protection

si
●● Mining in Healthcare
●● Mining in Science
●● Mining in E-commerce
●● Mining in Finance
r
ve
Introduction
Applications for data mining may be general or specialised. The general application
ni

must be an intelligent system capable of making choices on its own, including those
regarding data selection, data mining method, presentation, and result interpretation.
Some general data mining programmes cannot make these choices on their own
but instead assist users in choosing the data, the data mining technique, and the
U

interpretation of the results.

In various corporate domains, data mining is important for discovering patterns,


forecasting, and knowing. Applications for data mining use a range of data kinds, from
ity

text to photos, and store them in a range of databases and data structures. From these
various datasets, patterns and knowledge are extracted using various data mining
techniques. The work of choosing the data and methods for data mining is crucial to this
process and calls for domain knowledge.
m

For data mining, a wide range of data should be gathered in the particular problem
domain, specific data should be chosen, the data should be cleaned and transformed,
patterns should be extracted for knowledge development, and knowledge should
)A

then be interpreted. Data mining is utilised in the medical field to do basket analyses,
uncover patterns, forecast sales, and detect malicious executables. There are still
numerous unresolved challenges with data mining, such as security and social issues,
user interface problems, performance problems, etc., before it becomes a common,
established, and trustworthy discipline.
(c

Amity Directorate of Distance & Online Education


234 Data Warehousing and Mining

5.1 Text Mining, Spatial Databases and Web Mining


Notes

e
There is a vast amount of knowledge saved in text documents nowadays, either
within businesses or generally accessible. Due to the enormous amount of information

in
available in electronic form, including electronic periodicals, digital libraries, e-mail,
and the World Wide Web, text databases are expanding quickly. The majority of text
databases store semi-structured data, and specialised text mining algorithms have

nl
been created for extracting new knowledge from big textual data sets.

“Text Mining is the procedure of synthesizing information, by analyzing rela-


tions, patterns, and rules among textual data.”

O
5.1.1 Text Mining
Online text mining is often enabled by two main methods. The first is the ability

ty
to search the Internet, and the second is the text-analysis method. There has been
internet search for a while. Since there have been so many websites during the last
ten years, a tonne of search engines that are intended to assist consumers in finding
material have nearly developed overnight. The earliest search engines were Yahoo,

si
AltaVista, and Excite, whereas Google and Bing have gained the most traction in recent
years. Search engines work by indexing a certain Web site’s content and letting users
search these indices. Users of the latest generation of Internet search engines can now
r
find pertinent information by sifting through fewer links, pages, and indexes.
ve
The study of text analysis is older than the study of internet search. It has been
a part of efforts to teach computers to understand natural languages, and artificial
intelligence experts frequently view it as a challenge. Anywhere there is a lot of text
that has to be studied, text analysis can be applied. Although the depth of analysis that
ni

a human can bring to the task is not possible with automatic processing of documents
using various techniques, it can be used to extract important textual information,
classify documents, and produce summaries when manual analysis is impossible due
U

to a large number of documents.

You can either look for keywords in text documents or try to classify the semantic
content of the text itself to comprehend the specifics. When establishing specific
ity

information or elements within text documents that can be used to demonstrate links or
linkages with other documents, you are looking for keywords in those text documents.
Documents have usually been modelled in vector space in the IR domain. White-space
delimiters in English are used as simple syntactic rules for tokenization, and tokens are
changed to their canonical forms (e.g., “reading” becomes “read,” “is,” “was,” and “are”
m

becomes “be”). A Euclidean axis is represented by each canonical token. Vectors in this
n-dimensional space are documents. The tth coordinate of a document d is just n if a
token t called term appears n times in it. Using the L1, L2, or L norms, one may decide
)A

to normalise the document’s length to 1:

Where n(d,t) denotes the quantity of times a term appears in a given text (d).
These illustrations fail to convey the reality that some keywords, such as “algorithm,”
(c

are more significant in determining document content than others, such as “the” and

Amity Directorate of Distance & Online Education


Data Warehousing and Mining 235

“is.” If t appears in n papers out of n, n/N conveys the term’s rarity and significance.
Notes

e
document frequency inversion

IDF = 1 + log(nt/N)

in
is used to differentially extend the vector space’s axes. Thus, in the weighted
vector space model, the tth coordinate of document d may be represented by the
value (n(d,t)/||d1||) IDF(t)). Despite being incredibly basic and unable to capture any

nl
part of language or semantics, this model frequently works effectively for the task in
hand. Despite some small differences, all of these textual models treat documents as
collections of terms rather than paying attention to the order in which the terms are

O
used. As a result, they are referred to as bag-of-words models as a whole. The results
of these keyword approaches are frequently expressed as relational data sets, which
may then be examined using one of the common data-mining methods.

ty
Hypertext documents are a specific category of text-based documents that also
include hyperlinks. They are typically shown as basic Web components. Depending
on the application, different levels of detail are modelled for them. Hypertext can be
conceptualised as a directed graph (D, L) in the most basic paradigm, where D is the

si
collection of nodes that represent documents or Web pages and L is the collection
of links. When the focus is on documents’ linkages, crude models might not need to
incorporate the text models at the node level. A combined distribution between a node’s
r
term distribution and those around it in the graph’s document will be characterised by
ve
more complex models.

The problem of content-based document analysis and partitioning is more


challenging. There has been some progress made in this direction, and new text-mining
algorithms have been identified, but no standards or shared theoretical foundation have
ni

been developed in the field.

Text categorization is typically thought of as comparing one document to another or


to a predetermined list of categories or definitions. The outcomes of these comparisons
U

can be represented visually inside a semantic landscape wherein dissimilar documents


are positioned farther away and similar documents are grouped together in the
semantic space.
ity

Indirect evidence, for instance, frequently enables us to draw semantic links


between works that might not even use the same terms. For instance, the co-
occurrence of the phrases “vehicle” and “auto” in a group of papers may suggest that
these terms are connected. This might make it easier for us to compare documents that
m

use these terms. The generated topographic map can show the degree of similarities
between documents in terms of Euclidean distance, depending on the specific algorithm
employed to build the landscape. The method used to create Kohonen feature maps
is comparable to this concept. You can then extrapolate concepts represented by
)A

documents based on the semantic environment.

The automatic analysis of text data can be applied to a variety of general


objectives, including:

◌◌ In order to efficiently arrange the contents of a huge document collection and


(c

present an overview of them,


◌◌ To find hidden relationships between documents or collections of documents,

Amity Directorate of Distance & Online Education


236 Data Warehousing and Mining

◌◌ To make a search process more efficient and effective in order to find related
Notes

e
or comparable information, and
◌◌ To search an archive for duplicate data or documents.

in
A new set of functionalities called text mining is based mostly on text analysis
technologies. The most typical medium for the official exchange of information is text.
Even if attempts to automatically extract, organise, and utilise information from it only

nl
partially succeed, the urge to do so is strong. While conventional, commercial text-
retrieval systems rely on inverted text indices made up of statistics like the number of
times a word occurs in a document, text mining needs to offer benefits beyond simply
retriving text indices like keywords. Text mining can be defined as the process of

O
analysing text to extract interesting, nontrivial information that is valuable for specific
objectives. Text mining is about looking for semantic patterns in text.

Text mining is seen to have an even greater commercial potential than traditional

ty
data mining with structured data because text is the most natural way to store
information. In fact, according to recent studies, text documents contain 80% of a
company’s information. But because it deals with naturally confusing, unstructured

si
text data, text mining is also a considerably more difficult process than traditional data
mining.

Text mining is a multidisciplinary field that includes IR, text analysis, information
r
extraction, natural language processing, clustering, categorization, visualisation,
ve
machine learning, and other methodologies that are already on the data-mining
“menu.” This field may also include some additional particular techniques that have
recently been developed and used on semi-structured data. Only a few of the potential
uses for text mining include market research, business intelligence collecting, email
ni

management, claim analysis, electronic procurement, and automated help desk. The
two stages of the text-mining process are graphically shown in Figure A:

Text refinement is the process of taking free-form text documents and turning
U

them into a selected intermediate form (IF), and knowledge distillation is the process of
extracting patterns or knowledge from an intermediate form
ity
m
)A

Figure A: Text mining framework


(c

An (IF) can be structured, like the relational data representation, or semi-


structured, like the conceptual graph representation. For various mining objectives,
intermediate versions with differing degrees of complexity are suitable. They can
Amity Directorate of Distance & Online Education
Data Warehousing and Mining 237

be divided into two categories: idea-based and document-based, where each entity
Notes

e
represents an item or concept with interests in a certain domain. Finding patterns and
connections between papers via document-based IF mining. Examples of document-
based If mining include categorization, visualisation, and grouping of documents.

in
It is important to do a semantic analysis and derive a sufficiently rich representation
to capture the link between objects or concepts provided in the document for a

nl
fine-grained, domain-specific knowledge-discovery activity. Finding patterns and
connections between things and concepts through mining a concept-based IF. Making
these computationally expensive semantic analysis approaches more effective and
scalable for extremely big text corpora is a difficult task. This group includes text-mining

O
activities including association discovery and predictive modelling. By realigning or
extracting the pertinent data in accordance with the objects of interest in a particular
domain, a document-based IF can be converted into a concept-based IF. As a result, a

ty
concept-based representation is typically domain-dependent while a document-based
IF is typically domain independent.

The classification of various text-mining tools and their associated methodologies

si
is based on the functions of text-refining and knowledge-distillation as well as the
intermediate form used. Document organisation, visualisation, and navigation are
the subject of one set of methodologies and recently released commercial tools. The

r
functions of text analysis, IR, classification, and summarization are the focus of a
different group.
ve
These text-mining methods and techniques fall within a significant and broad
subclass that is based on document visualisation. The general strategy is to combine
or cluster papers together based on their similarity and then display the groups or
ni

clusters as 2D or 3D graphics. The most complete commercial text-mining solutions are


arguably IBM’s Intelligent Miner and SAS Enterprise Miner. They provide a collection
of text analysis tools that incorporate a text search engine along with tools for feature
extraction, clustering, summarization, and categorization.
U

Domain knowledge may be crucial to the text-mining process but is not currently
used or examined by any text-mining tools. In particular, domain expertise can be
applied as early as the text-refining step to boost parsing effectiveness and provide a
ity

shorter intermediate form. In order to increase the effectiveness of learning, domain


knowledge may also be used in knowledge distillation. We anticipate that the future
generation of text-mining techniques and technologies will enhance the quality of
information and knowledge discovery from text as all these concepts are still in their
m

infancy.

Application Area of Text Mining


Digital Library
)A

To extract patterns and trends from journal and proceedings that are maintained
in text database repositories, several text mining methodologies and technologies are
applied. These informational sources are beneficial in the field of the study. Libraries
are a fantastic place to find digitised text data. It provides a cutting-edge method for
(c

obtaining valuable information in a way that makes it possible to access millions of


records online. A springy approach for extracting reports that handle multiple forms,
such as Microsoft Word, PDF, postscript, HTML, scripting languages, and email, is
Amity Directorate of Distance & Online Education
238 Data Warehousing and Mining

provided by a green-stone international digital library that supports several languages


Notes

e
and multilingual interfaces. Along with text documents, it also supports the extraction
of audiovisual and image formats. Text mining techniques carry out a variety of tasks,
including document collecting, analysis, improvement, data removal, handling, and

in
summarization. There are various forms of text mining software for digital libraries,
including GATE, Net Owl, and Aylien.

nl
Academic and Research Field
Different text mining techniques and technologies are used in the field of education
to look at instructional patterns in a certain area or study area. The primary goal of

O
text mining use in the research sector is to assist in the discovery and organisation of
research papers and pertinent content from multiple fields on one platform. We employ
k-Means clustering for this, and other techniques help to differentiate the characteristics
of meaningful data. Additionally, information on student performance in other topics

ty
is accessible, and this mining can assess how different factors affect the choice of
disciplines.

si
Life Science
The life science and healthcare sectors generate a vast amount of textual and
mathematical data about patient records, illnesses, medications, disease symptoms,
r
and disease therapies, among other topics. Filtering information and pertinent text from
ve
a biological data store in order to make decisions is a significant problem. The clinical
records include fluctuating, unpredictably long data. Such data can be managed with
the use of text mining. Text mining is also used in the pharmacy sector, the disclosure of
biomarkers, clinical research, and competitive intelligence for patents.
ni

Social-Media
To monitor and look at online content such plain text from blogs, emails, web
U

journals, and other online sources, text mining is a tool that can be used. The number
of posts, likes, and followers on the online social media network can be distinguished
and investigated with the aid of text mining tools. This type of study demonstrates
how people react to various articles and news stories as well as how it spreads. It
ity

demonstrates how members of a certain age group behave and the range of reactions
to the same post in terms of likes and views.

Business Intelligence
m

Business intelligence relies heavily on text mining to study customers and rivals
so that various organisations and businesses may make better judgments. It provides
a thorough understanding of business and information on how to raise customer
)A

satisfaction and get an advantage in the marketplace. IBM Text Analytics is one of the
text mining tools.

GATE assists in choosing the company that sends alerts regarding good and bad
performance, a market transition that aids in taking the essential actions. The telecom
industry, business, and customer chain management systems can all make advantage
(c

of this mining.

Amity Directorate of Distance & Online Education


Data Warehousing and Mining 239

Issues in Text Mining


Notes

e
Throughout the text mining process, a number of problems arise:

●● The effectiveness and efficiency of decisions.

in
●● A text mining stage at which the unclear problem can appear is the intermediate
stage. Different criteria and principles are described during the pre-processing
step to standardise the content and facilitate text mining. Before performing

nl
pattern analysis on the document, unstructured data must be converted into a
reasonable structure.

O
●● Occasionally, adjustment might cause the original message or meaning to shift.
●● Many algorithms and techniques for text mining support text in multiple languages,
which is another problem. It could lead to unclear text meaning. This issue could
result in falsely favourable findings.

ty
●● Synonym, polysemy, and antonym usage in the document text causes problems
for text mining methods that use both in the same context. It is challenging to
classify these texts and words.

si
5.1.2 Spatial Databases

Introduction to Spatial Databases r


ve
Functionality that supports databases that track items in a multidimensional space
is incorporated into spatial databases. Examples include the two-dimensional spatial
descriptions of its objects, which range from countries and states to rivers, cities, roads,
seas, and other geographic features, in cartographic data-bases that store maps.
ni

Geographical Information Systems (GIS) are the systems that control geographic data
and related applications; they are utilised in fields including environmental applications,
transportation systems, emergency response systems, and combat management.
U

Since temperatures and other meteorological data are tied to three-dimensional


spatial positions, other databases, like meteorological databases for weather
information, are three-dimensional. Objects that have spatial features that describe
them and that have spatial relationships among them are often stored in a spatial
ity

database. The spatial links between the objects are significant, as they are frequently
required while making database queries. Although, in general, a spatial database can
refer to any n-dimensional space, we will restrict our discussion to two dimensions for
simplicity.
m

A spatial database is designed to store and retrieve information on spatial


objects, such as points, lines, and polygons. Spatial data frequently includes satellite
imagery. Spatial queries are those that are made on these spatial data and use spatial
)A

parameters as predicates for selection. A spatial query would ask, for instance, “What
are the names of all bookstores within five miles of the College of Computing building
at Georgia Tech?” While standard databases analyse numeric and character data,
processing geographical data types requires additional capabilities.
(c

An external geographic database that maps the company headquarters and each
customer to a 2-D map based on their address may need to be consulted in order to
process spatial data types that are typically outside the scope of relational algebra

Amity Directorate of Distance & Online Education


240 Data Warehousing and Mining

in order to process a query like “List all the customers located within twenty miles of
Notes

e
company headquarters.” Each consumer will actually be connected to a latitude and
longitude position. Since conventional indexes are unable to organise multidimensional
coordinate data, this query cannot be processed using a conventional B+-tree index

in
based on customer zip codes or other nonspatial properties. As a result, databases
specifically designed to handle spatial data and spatial queries are required.

nl
The common analytical steps used to process geographic or spatial data are
shown in the table below.

O
ty
si
Measurement operations are used to determine the area, relative size of an
object’s components, compactness, or symmetry of single objects as well as the

r
relative position of other objects in terms of distance and direction. To find spatial
correlations within and among mapped data layers, spatial analysis operations—
ve
often utilising statistical methods—are performed. A prediction map, for instance, may
be made to show where clients for specific goods are most likely to be found based
on past sales and demographic data. The shortest path between two points and the
connectivity between nodes or regions in a graph can both be found with the use of flow
ni

analysis methods. Finding out whether a given set of points and lines falls into a given
polygon is the goal of location analysis (location). The procedure entails creating a
buffer around current geographic features, after which features are identified or chosen
U

according to whether they are inside or outside the buffer’s boundary. The topography
of a particular location can be represented with an x, y, and z data model known as a
Digital Terrain (or Elevation) Model (DTM/DEM) using digital terrain analysis, which is
used to create three-dimensional models.
ity

A DTM’s x, y, and z dimensions stand in for the horizontal plane and the
corresponding x, y coordinates’ spot heights, respectively. These models can be
applied to the analysis of environmental data or to the planning of engineering projects
that need for knowledge of the topography. A user can look for objects inside a certain
m

spatial area using spatial search. Thematic search, for instance, enables us to look for
items associated with a specific subject or class, such as “Find all water bodies within
25 miles of Atlanta” when the category for the search is water.
)A

Among spatial objects, there are also topological relationships. To choose


items based on their spatial relationships, these are frequently employed in Boolean
predicates. For instance, a condition like “Find all freeways that travel through
Arlington, Texas” would require an intersects operation to identify which freeways (lines)
intersect the city limit if a city boundary is represented as a polygon and freeways are
(c

represented as multilines (polygon).

Amity Directorate of Distance & Online Education


Data Warehousing and Mining 241

Spatial Data Types and Models


Notes

e
The common data types and models for storing spatial data are briefly described in
this section. There are three main types of spatial data. Due to their widespread use in

in
commercial systems, these forms have become the de facto industry standard.

Map Data covers different geographic or spatial characteristics of map objects,


such as an object’s shape and its location. Points, lines, and polygons are the three

nl
fundamental types of features (or areas). In the scale of a specific application, points
are used to describe the spatial characteristics of objects whose locations correspond
to a single pair of 2-d coordinates (x, y, or longitude/latitude). Buildings, cell towers, and

O
stationary vehicles are some instances of point objects, depending on the scale.

A series of point locations that change over time can be used to depict moving
cars and other moving things. A series of connected lines can approximate the spatial

ty
features of things with length, such as highways or rivers. When representing the
spatial properties of an item with a boundary, such as a nation, a state, a lake, or a city,
a polygon is utilised. Keep in mind that depending on the level of detail, some objects,
such cities or buildings, might be depicted as either points or polygons.

si
The descriptive information that GIS systems attach to map features is known as
attribute data. Imagine, for instance, that a map has features that represent the counties

r
of a US state (such as Texas or Oregon). Each county feature (object) may have the
following attributes: population, major city or town, square miles, etc. States, cities,
ve
congressional districts, census divisions, and other aspects of the map could all have
additional attribute data.

Images made by cameras, such as satellite images and aerial photos, are
examples of image data. These photos can have interesting objects, such highways
ni

and buildings, layered on them. Map features can also have images as properties.
Other map features can have photos added to them so that when a user clicks on the
feature, the image is displayed. Raster data frequently takes the form of aerial and
U

satellite photos.

There are two main categories into which spatial information models can be
categorised: field and object. Depending on the needs and the customary model
ity

selection for the application, either a field-based model or an object-based model is


used to model a spatial application (such as remote sensing or highway traffic control).
While object models have historically been used for applications such as transportation
networks, land parcels, buildings, and other objects that possess both spatial and non-
spatial attributes, field models are frequently used to model continuous spatial data,
m

such as terrain elevation, temperature data, and soil variation characteristics.

Spatial Operators
)A

When performing spatial analysis, spatial operators are used to record all the
pertinent geometric details of the objects that are physically embedded in the space as
well as the relationships between them. Operators are divided into three major groups.

Topological operators: When topological transformations are used, topological


(c

features remain unchanged. After transformations like rotation, translation, or scaling,


their attributes remain unchanged. The base level of topological operators allows users

Amity Directorate of Distance & Online Education


242 Data Warehousing and Mining

to check for intricate topological relationships between regions with a large border, while
Notes

e
the higher levels provide more abstract operators that let users query ambiguous spatial
data without relying on the underlying geometric data model. Examples include open
(region), close (region), and inside (point, loop).

in
Projective operators: To express predicates regarding the concavity/convexity of
objects as well as other spatial relations, projective operators such as convex hull are

nl
utilised (for example, being inside the concavity of a given object).

Metric operators: Metric operators give a more detailed account of the geometry
of the object. In addition to measuring the relative positions of various objects in terms

O
of distance and direction, they are also used to measure several global features of
single objects (such as the area, relative size of an object’s sections, compactness, and
symmetry). Examples include length (arc) and distance (point, point).

ty
Dynamic Spatial Operators: The operations carried out by the aforementioned
operators are static in the sense that their application has no impact on the operands.
For instance, the curve itself is unaffected by the length of the curve. The objects on
which dynamic operations act are changed. Create, destroy, and update are the three

si
basic dynamic operations. Updating a spatial object that can be divided into translate
(shift position), rotate (change orientation), scale up or down, reflect (produce a mirror
image), and shear (deform).
r
Spatial Queries: Requests for spatial information made using spatial operations are
ve
known as spatial queries. Three common forms of spatial searches are illustrated by
the following categories:

Range query: identifies the objects of a specific type that are present within a
ni

specified spatial region or at a specified distance from a specified place. (For instance,
look for all hospitals in the Metropolitan Atlanta region or all ambulances five miles away
from an accident.)
U

Nearest neighbor query: locates the closest instance of a specific type of object at
a specified location. (For instance, locate the police vehicle that is most convenient to
the scene of the crime.)
ity

Spatial joins or overlays: usually connects objects of two different types depending
on some spatial condition, such as the objects’ spatial intersection, overlap, or
proximity. (For instance, find all houses that are two miles or less from a lake, or all
townships on a main highway connecting two cities.)
m

Spatial Data Indexing


In order to make it simple to locate things in a specific spatial area, objects are
arranged into a series of buckets (which correspond to pages of secondary memory)
)A

using a spatial index. A bucket region, or area of space holding all the objects kept in a
bucket, exists for each bucket. The bucket regions, which are often rectangles, divide
the space so that each point belongs to exactly one bucket for point data structures. A
spatial index can be provided in essentially two different ways.
(c

The database system has specialised indexing structures that enable effective
search for data objects based on spatial search operations. These indexing structures
would serve a similar purpose to typical database systems’ B+-tree indexes. Grid files

Amity Directorate of Distance & Online Education


Data Warehousing and Mining 243

and R-trees are two examples of these indexing structures. Spatial join processes can
Notes

e
be sped up by using specialised spatial indexes, also known as spatial join indexes.

The two-dimensional (2-d) spatial data is converted to single-dimensional (1-d)

in
data so that conventional indexing techniques (B+-tree) can be employed, as opposed
to developing brand-new indexing structures. Space filling curves are the techniques for
transforming from 2-d to 1-d. We won’t go into great depth about these techniques (see

nl
the Selected Bibliography for further references).

Next, we provide an overview of a few spatial indexing methods.

Grid Files. For indexing spatial data in two dimensions and greater n dimensions,

O
gird files are utilised. The fixed-grid approach creates equal-sized buckets out of
an n-dimensional hyperspace. The fixed grid is implemented using an n-dimensional
array as the data structure. To handle overflows, the items whose spatial locations are

ty
entirely or partially within a cell can be stored in a dynamic structure. For data that is
evenly dispersed, like satellite imagery, this structure is helpful. The fixed-grid structure,
however, is rigid, and its directory may be both sparse and extensive.

si
R-Trees. An extension of the B+-tree for k-dimensions, where k > 1, the R-tree
is a height-balanced tree. The minimum bounding rectangle (MBR), which is the
smallest rectangle with sides parallel to the coordinate system’s (x and y) axis, is

r
used to represent spatial objects in the R-tree for two-dimensional (2-d) space. The
following characteristics of R-trees are similar to those of B+-trees but are tailored to
ve
2-dimensional spatial objects.

Each index record (or entry) in a leaf node has the structure (I, object-identifier),
where I is the MBR for the spatial object whose identifier is object-identifier.
ni

Except for the root node, every node must be at least halfway full. Since M/2 <=
m <= M, a leaf node that is not the root should have m entries (I, object-identifier).
Similarly, a non-leaf node that is not the root must have m entries (I, child-pointer),
U

where M/2 <= m <= M and I is the MBR that has the union of all the rectangles in the
node that child-pointer is pointing at.

The root node should contain at least two pointers unless it is a leaf node, and all
ity

leaf nodes are at the same level.

Each and every MBR has a side that is perpendicular to the axes of the world
coordinate system.

The quadtree and its derivatives are among other spatial storage structures. For
m

the purpose of identifying the locations of various items, quadtrees often divide each
space or subspace into sections of equal size. This field of study is still active despite
the recent proposals for numerous newer spatial access architectures.
)A

Spatial Join Index. An index structure for a spatial join precomputes a spatial join
operation and stores the pointers to the related objects. The performance of recurrent
join queries over tables with slow update rates is enhanced by join indexes. To respond
to requests like “Create a list of highway-river combinations that intersect,” spatial join
conditions are utilised. These item pairs that satisfy the cross spatial relationship are
(c

found and retrieved using the spatial join. The results of spatial relationships can be
computed once and stored in a table with the pairs of object identifiers (or tuple ids)

Amity Directorate of Distance & Online Education


244 Data Warehousing and Mining

that satisfy the spatial relationship, which is essentially the join index. This is because
Notes

e
computing the results of spatial relationships typically takes a lot of time.

A bipartite graph G = (V1,V2,E) can be used to define a join index, where V1

in
holds the tuple ids from relation R and V2 contains the tuple ids from relation S. If a
tuple corresponding to (vr,vs) in the join index exists, the edge set contains an edge
(vr,vs) for vr in R and vs in S. All of the associated tuples are represented as connected

nl
vertices in the bipartite networks. Operations (see Section 26.3.3) that involve
computing associations among spatial objects employ spatial join indexes.

Spatial Data Mining

O
Spatial data frequently exhibits strong correlation. For instance, individuals with
comparable traits, professions, and origins frequently congregate in the same areas.

Spatial classification, spatial association, and spatial grouping are the three main

ty
spatial data mining approaches.

Spatial classification. Estimating the value of a characteristic of a relation based

si
on the value of the relation’s other attributes is the aim of categorization. Identifying
the locations of nests in a wetland based on the importance of other qualities (such
as vegetation durability and water depth) is an example of the spatial classification
problem, also known as the location prediction problem. Similar to location prediction,
r
predicting hotspots for criminal activity is a challenge.
ve
Spatial association. In contrast to items, spatial predicates are used to establish
spatial association rules. If at least one of the Pis or Qjs is a spatial predicate, then
the rule is said to be of the type P1 P2 ^... ^Pn Q1^ Q2^ ... ^Qm. A good example
of an association rule with some support and confidence is the rule is_a(x, country) ^
ni

touches(x, Mediterranean) ⇒is_a (x, wine-exporter) (that is, a country that is adjacent
to the Mediterranean Sea is typically a wine exporter) is an example of an association
rule, which will have a certain support s and confidence c.
U

In an effort to point to collecting data sets that are spatially indexed, spatial
colocation rules try to generalise association rules. Between spatial and nonspatial
linkages, there are a number of significant distinctions, including:
ity

Given that data is embedded in continuous space, the concept of a transaction


is absent in spatial contexts. Partitioning space into transactions would cause interest
metrics, such support or confidence, to be overestimated or underestimated.

The size of item sets in spatial databases is minimal, meaning that in a spatial
m

scenario there are significantly less items in the item set than in a non-spatial situation.

Spatial elements are often the discrete form of continuous variables. For instance,
)A

in the United States, income regions can be categorised as areas where the mean
annual income falls within a given range, like $40,000 or less, $40,000 to $100,000, or
$100,000 or more.

The goal of spatial clustering is to organise database objects into clusters where
the most similar things are located and clusters where the least similar objects are
(c

located. The grouping of seismic occurrences in order to identify earthquake faults is


one use for spatial clustering. Density-based clustering, which seeks to identify clusters

Amity Directorate of Distance & Online Education


Data Warehousing and Mining 245

based on the density of data points in a region, is an illustration of a spatial clustering


Notes

e
technique. In the data space, these algorithms treat clusters as dense regions of items.

Density-based spatial clustering of applications with noise (DBSCAN) and density-

in
based clustering are two variants of these methods (DENCLUE). Because it starts
by identifying clusters based on the estimated density distribution of related nodes,
DBSCAN is a density-based clustering algorithm.

nl
Applications of Spatial Data
Numerous fields, including geography, remote sensing, urban planning, and

O
natural resource management, can benefit from spatial data management. The solving
of complex scientific issues like global climate change and genomics is greatly aided
by spatial database management. GIS and spatial database management systems
have a significant role to play in the field of bioinformatics because of the geographical

ty
character of genetic data.

Pattern recognition, the building of genome browsers, and map visualisation are
some of the main applications. For instance, one might check to see if the topology

si
of a specific gene in the genome is present in any other sequence feature map in the
database. The detection of spatial outliers is one of the key applications of spatial data
mining. A spatially referred object is considered an outlier if its nonspatial attribute
r
values considerably deviate from those of other spatially referenced objects in its
ve
immediate vicinity. According to the nonspatial feature “house age,” for instance, if a
neighbourhood of older homes has just one brand-new home, that home would be an
outlier.

Many applications of geographic information systems and spatial data-bases make


ni

use of the ability to detect spatial outliers. Transportation, ecology, public safety, public
health, climatology, and location-based services are only a few of these application
domains.
U

5.1.3 Web Mining


The number of websites has increased exponentially since Berners-Lee, the
ity

creator of the World Wide Web, established the first webpage in 1991. There were 1.8
billion websites in existence as of 2018. The amount of data that is available and the
need to organise it in order to extract usable information from it have both increased
exponentially along with this expansion.
m

Online directories were developed in the beginning in an effort to link together


related web pages and manage such data. These directories frequently used manual
keyword tagging and reviews of the web pages they contained. As time went on,
search engines started to be used, and they used a variety of methods to retrieve the
)A

necessary data from the online sites. These methods are referred to as web mining.
Online mining is the process of using machine learning and data mining techniques to
extract information from web pages’ data.

Figure divides web mining into three categories: usage mining, structure mining,
(c

and web content mining (Categories of Web mining).

Amity Directorate of Distance & Online Education


246 Data Warehousing and Mining

Notes

e
in
nl
O
Figure: Categories of Web mining

Web Content Mining

ty
Web content mining is the process of obtaining pertinent information from a web
page’s content. When content mining, we completely disregard how users interact with
a particular web page or how other web pages link to it. A simple method of web content

si
mining relies on the placement and usage of keywords. However, this leads to two
issues: the first is the issue of scarcity, and the second is the issue of abundance. The
issue of scarcity arises with queries that either produce a small number of results or

r
none at all. The issue of abundance arises when search queries produce an excessive
amount of results. The nature of the data on the web is the underlying cause of both
ve
issues. The material is typically dispersed across several web pages and is typically
presented as semi-structured HTML data.

●● Web document clustering: The management of several documents based on


ni

keyword clustering on the web. Instead of returning a list of web sites ranked in
order, the main idea is to group meaningfully related web pages together. To do
this, cluster analysis methods like K-mean and agglomerative clustering can be
applied. For the purposes of using clustering algorithms, a vector of words and
U

their frequency on a certain web page serve as the input attribute set. But these
grouping methods don’t produce good enough outcomes.
Based on Suffix Tree Clustering, another method of online document clustering is
ity

used.

●● Suffix Tree Clustering (STC): STC uses phrase clustering rather than keyword
frequency clustering. STC functions as follows:
Step 1: Download the text from the website. Remove frequent words and
m

punctuation from each sentence in the text. The remaining words should be changed to
their base form.
)A

Step 2: Create a tree using the list of words you got in step 1 as a base.

Step 3: Compare the trees you got from different documents. Words in a tree with
the same root to leaf node sequence are clustered together in a single cluster.

As can be seen in the Figure below, STC tries to cluster the document in a more
(c

relevant way by taking the material’s phrase order into account.

Amity Directorate of Distance & Online Education


Data Warehousing and Mining 247

Notes

e
in
nl
O
ty
si
Figure: STC example

●●
r
Resemblance and containment: Additionally, it is necessary to delete duplicate
web pages or pages that are nearly identical from the search results in order
ve
to improve query results. To do this, the ideas of similarity and containment are
useful. The term “resemblance” describes how similar two papers are to one
another. Its value ranges from 0 to 1. (both inclusive). The identical document
is represented by a value of 1. A value close to 1 denotes a document that is
ni

nearly identical, whereas a value close to 0 denotes a document that is entirely


different. One document’s presence inside another is referred to as containment.
Additionally, it has a value between 0 and 1, with a value of 1 signifying the
U

presence of the first document inside the second document and a value of 0
indicating the opposite.
The idea of shingles is used to quantitatively describe similarity and confinement.
ity

The text is separated into sets using a continuous L-word sequence. These patterns are
referred to as shingles. The definitions of similarity R(X, Y) and containment C(X, Y) for
two given texts, X and Y, are as follows:

R(X, Y) = {S(X) ∩ S(Y)} / {S(X) ∪ S(Y)}


m

C(X, Y) = {S(X) ∩ S(Y)} / {S(X)}

Where S(X) and S(Y) are sets of shingles, respectively, for documents X and Y.
)A

Resemblance is defined as the total number of shingles that are shared by two
documents X and Y, divided by the total number of shingles in both documents,
according to the formulas.

The amount of containment is determined by dividing the total number of shingles


in the original document X by the total number of shingles shared by documents X and
(c

Y.

Amity Directorate of Distance & Online Education


248 Data Warehousing and Mining

Finding related websites is possible in a number of ways. One method is to


Notes

e
compare each document pair using the Linux diff programme. A different strategy is
based on the idea of fingerprinting.

in
●● Fingerprinting: A document is broken down into a continuous series of words
(shingles) of every conceivable length in order to perform fingerprinting. Consider
two documents with the respective content listed below as an example.

nl
Document 1: I love machine learning.

Document 2: I love artificial intelligence.

O
Let’s take a look at each sequence for the shingle length two for the two texts
mentioned above.

Table: Sequences of length two

ty
r si
We can easily observe that only one out of three sequences matches using the
sequences listed in Table (Sequences of length two). This is used to compare two
ve
documents for similarities. Even though this method is quite accurate, it is rarely
employed because it is ineffective for manuscripts with a lot of words.

Web Usage Mining


ni

Web usage mining is the process of obtaining valuable information from log data
about how users interact with websites. Web use mining’s primary goal is to foresee
user behaviour when they engage with a website in order to improve consumer
U

focus, monetization, and commercial strategies. For instance, it would be far more
advantageous to invest in Facebook ads than it would be to invest in Twitter ads if the
bulk of visits to a particular web page were coming from Facebook pages rather than
Twitter. Information like outlined in the table below (Important parameters for web usage
ity

mining) is typically gathered in order to execute web usage mining.

In order to uncover hidden knowledge, the data gathered above is processed


using association mining or clustering. Discovering the relationships between web
pages on a website via association mining is possible. Log data may be transformed
m

into transactions similar to those used in market basket analysis, where each node
visited is treated as an item being purchased. However, difficulties arise because it is
common for users to return to a node while web traversing, whereas in market basket,
)A

no equivalent structure is possible.

Such analysis can unearth knowledge that is buried. For instance, data analysis
may reveal that, with a 75% or higher certainty, every time a visitor views page A, they
also visit page B. These kinds of associations are helpful because it may be possible to
(c

rearrange pages A and B so that the data the user is looking for on page B is actually
made available on page A, making page A more customer-centric.

Amity Directorate of Distance & Online Education


Data Warehousing and Mining 249

Table: Important parameters for web usage mining


Notes

e
in
nl
O
ty
si
Web Structure Mining

r
The goal of web structure mining is to extract information from the web’s
ve
hyperlinked structure. Ranking web sites and recognising the pages that operate as
authorities on particular sorts of information are both significantly influenced by the
web structure. Web structure is used to find hubs, or the websites that link to multiple
authority websites, in addition to locating authority sites.
ni

The HITS algorithm, which analyses web structure to identify hubs and authority,
is presented in the part after that. The PageRank algorithm, which also employs web
structure to rank web sites, will be discussed later.
U

●● Hyperlink Induced Topic Search (HITS) algorithm: The HITS algorithm, sometimes
referred to as hubs and authorities, examines the web’s hyperlinked structure to
rank web pages. In 1999, Jon Kleinberg created it. It was utilised when web page
ity

directories were commonly used and the Internet was just getting started.
The hub and authority concepts form the foundation of HITS. HITS is predicated
on the idea that websites acting as directories are not inherently experts on any subject
matter, but rather serve as a hub for directing users to other websites that may be more
qualified to provide the needed information.
m

Let’s use an illustration to better grasp the distinction between these two concepts.
A hyperlinked web structure is shown in Figure (Hubs and Authority), where H1, H2,
)A

and H3 represent search directory pages, and A, B, C, and D represent web pages to
which the directory pages have outbound links. In this case, pages like H1, H2, and H3
serve as information hubs. They don’t actually retain any information, but instead they
direct users to other pages that, in their opinion, contain the necessary data. In other
words, the outbound hyperlinks from hubs vouch for the legitimacy of the page they are
(c

directing users to. A good authority is a page that is linked by many distinct hubs, and
a good hub is a page that points to many other pages. Formally, page H1 gives page A
some authority if it links to page A on the internet.

Amity Directorate of Distance & Online Education


250 Data Warehousing and Mining

Notes

e
in
nl
Figure: Hubs and Authority

O
Let’s look at an illustration to better grasp how the HITS algorithm functions. The
web page structure is depicted in the Figure (Example of Web page structure) below,
where each node stands in for a different web page and where arrows indicate linkages

ty
between the vertices.

r si
ve
Figure: Example of Web page structure

Step 1: For future calculations, present the supplied web structure as an adjacency
ni

matrix. Assume that the necessary adjacency matrix is A, as displayed in Figure B.


U
ity

Figure B: Adjacency matrix representing web structure shown in (Fig A)


m

Step 2: Prepare matrix A’s transposition. Figure C includes it.


)A
(c

Figure C: Transpose Matrix of A


Amity Directorate of Distance & Online Education
Data Warehousing and Mining 251

Step 3: Assuming the original hub weight vector is 1, the authority weight vector
Notes

e
is created by multiplying the initial hub weight vector by the transposed matrix A, as
shown in Figure D below.

in
nl
Figure D: Obtaining the Authority Weight Matrix

O
Step 4: Multiply the authority weight matrix produced in step 3 by the adjacency
matrix A to determine the updated hub weight vector.

ty
si
Figure E: Updated Hub Weight Vector
r
The graph in Figure (Example of web page structure) can be updated with hub and
ve
authority weights in accordance with the calculations made above, and can then be
displayed as shown in Figure F below.
ni
U
ity

Figure F: Web page structure with Hub and Authority Weights

The HITS algorithm has now gone through one iteration. Repeat steps 3 and 4 to
m

get updated authority weight vectors and updated hub vector values for considerably
more precise results.

We can sort hubs and authorities based on the aforementioned computations,


)A

and we can display authorities in search results in decreasing order of authority weight
value. For instance, according to Figure F, page N4 has the highest authority for a
certain term because it is related to the majority of highly ranked hub pages.

Applications of Web Mining:


(c

●● By categorising web content and locating web pages, web mining increases the
power of web search engines.

Amity Directorate of Distance & Online Education


252 Data Warehousing and Mining

●● It is utilised for vertical search, such as FatLens, Become, and web search, such
Notes

e
as Google, Yahoo, etc.
●● User behaviour is predicted using web mining.

in
●● For a specific website and e-service, such as landing page optimization, web
mining is quite helpful.

nl
5.2 Multimedia Web Mining
Transmitting and storing vast volumes of multi/rich media data (such as text,
photos, music, video, and their combination) is now more practical and economical

O
than ever thanks to recent advancements in digital media technologies. Modern
methods for processing, mining, and managing those rich media are still in their infancy.
Rapid advancement in multimedia gathering and storage technologies has resulted

ty
in an extraordinary amount of data being kept in databases, which is now rising at an
incredible rate.

These multimedia files can be evaluated to disclose information that will be useful

si
to users. Extraction of implicit knowledge, links between multimedia data, or other
patterns that aren’t expressly stored in multimedia files is the subject of multimedia
mining. The system’s overall effectiveness depends on the efficient information fusion of

r
the many modalities used in multimedia data retrieval, indexing, and classification.
ve
The World Wide Web is a popular and interactive tool for information selection
nowadays. Due to the web’s size, diversity, and dynamic nature, challenges with
scalability, multimedia data, and timing arise. Almost all industries in business,
research, and engineering require the ability to comprehend huge, complicated,
ni

information-rich data sets. In today’s competitive environment, the capacity to extract


usable knowledge from this data and to act on that knowledge is becoming increasingly
crucial.
U

Web mining is the full process of using a computer-based technology to find and
extract knowledge from web resources. Web multimedia mining naturally sits at the
nexus of various multi-discipline study areas, including computer vision, multimedia
processing, multimedia retrieval, data mining, machine learning, databases, and
ity

artificial intelligence.

A multimedia database management system is the framework that controls how


various kinds of multimedia data are provided, saved, and used. Multimedia databases
can be divided into three categories: dimensional, dynamic, and static. The Multimedia
m

Database management system’s content is as follows:

◌◌ Media Data: Actual data used to depict an object is called media data.
)A

◌◌ Media Format Data: Data concerning the format of the media after it has
undergone the acquisition, processing, and encoding stages, including
sampling rate, resolution, encoding method, etc.
◌◌ Media keyword data: Keywords that describe how data is produced. It also
goes by the name of “content descriptive data.” Example: the recording’s date,
(c

time, and location.

Amity Directorate of Distance & Online Education


Data Warehousing and Mining 253

◌◌ Media Feature Data: Information that is depending on the content, such as the
Notes

e
distribution of colours, types of textures, and shapes.

5.2.1 Multidimensional Analysis of Multimedia Data

in
Multimedia data cubes can be generated and built similarly to traditional data
cubes using relational data in order to enable the multidimensional analysis of big

nl
multimedia datasets. Additional dimensions and measurements for multimedia
information, such as colour, texture, and shape, can be included in a multimedia data
cube.

O
Examining MultiMediaMiner, a prototype of a multimedia data mining system that
augments the DBMiner system by managing multimedia data The following describes
how the example database used to test the MultiMediaMiner system was built: A feature
descriptor and a layout descriptor are both present in each image. Only the image’s

ty
descriptors are kept in the database; the original image is not. The image file name,
image URL, image type (such as gif, jpeg, bmp, avi, or mpeg), a list of all known Web
pages relating to the picture (i.e., parent URLs), a list of keywords, and a thumbnail

si
used by the user interface for browsing images and videos are all included in the
description information.

Each visual attribute is represented by a set of vectors in the feature descriptor.


r
The essential vectors are an MFC (Most Frequent Color) vector, an MFO (Most
ve
Frequent Orientation) vector, and a colour vector with the colour histogram quantized
to 512 colours (8 x 8 x 8 for R x G x B). The five most frequent colours and edge
orientations are represented by five colour centroids and five edge orientation centroids
in the MFC and MFO, respectively. The edge orientations utilised are 0°, 22.5°, 45°,
67.5°, 90°, and so on. A colour layout vector and an edge layout vector are both present
ni

in the layout description. No matter how big they were originally, each image is given an
8 X 8 grid. The edge layout vector stores the number of edges for each orientation in
each of the 64 cells, whereas the colour layout vector stores the most prevalent colour
U

for each of the 64 cells. It is simple to generate grids in other sizes, such as 4 × 4, 2 x 2,
and 1 x 1.

Similar to HTML tags in Web pages, the Image Excavator part of MultiMediaMiner
ity

leverages contextual information about images to derive keywords. It is feasible to build


hierarchies of keywords mapped onto the directories where the image was discovered
by navigating online directory structures like the Yahoo! directory. In the multimedia data
cube, these graphs serve as concept hierarchies for the dimension keyword.
m

Multimedia data cubes can come in a variety of sizes. These are a few instances:
the Internet domain of the image or video, the Internet domain of pages referencing
the image or video (parent URL), the keywords, a colour dimension, an edgeorientation
)A

dimension, and so forth. The size of the image or video in bytes, the width and height
of the frames (or picture), constituting two dimensions, the date the image or video was
created (or last modified), the format type of the image or video, the frame sequence
duration in seconds, and so forth. Concept hierarchies can be automatically defined
for numerous numerical dimensions. Predefined hierarchies may be utilised for various
(c

dimensions, like Internet domains or colour.

Amity Directorate of Distance & Online Education


254 Data Warehousing and Mining

The creation of a multimedia data cube will make it easier to summarise, compare,
Notes

e
categorise, associate, and cluster knowledge as well as perform multiple-dimensional
analysis of multimedia data that are predominantly based on visual content.

in
An intriguing model for the multidimensional analysis of multimedia data appears
to be the multimedia data cube. With so many dimensions, it is important to keep in
mind that implementing a data cube effectively might be challenging. In the case of

nl
multimedia data cubes, this dimensionality plague is very severe. Color, orientation,
texture, keywords, and other characteristics may be represented as several dimensions
in a multimedia data cube. Many of these traits, however, are set-oriented rather than
single-valued.

O
One image, for instance, might be associated with a list of keywords. It might
include a collection of things, each connected to a spectrum of colours. In the design
of the data cube, there will be a tremendous number of dimensions if we utilise each

ty
term as a dimension or each finely detailed colour as a dimension. However, failing to
do so could result in the modelling of a picture at a relatively crude, constrained, and
inaccurate scale. A multimedia data cube’s potential to establish a balance between

si
effectiveness and representational strength requires further study.

A substantial part in separating valuable data from a mass of huge data is played
by data mining. We can determine the relationship between different data sets by
r
examining the patterns and idiosyncrasies. When unprocessed raw data is transformed
ve
into meaningful information, it can be used to advance a variety of industries on which
we rely every day.

5.2.2 Applications in Telecommunications Industry


ni

Role of Data Mining in Telecommunication Industries


The telecommunications business plays a vital role in processing enormous data
U

sets of customers, network, and call data in the continuously dynamic and competitive
environment. The telecommunications industry needs to develop a simple method of
handling data if it is to succeed in such a setting. In this area, data mining is chosen
to advance operations and address issues. Identification of scam calls and identifying
ity

network flaws to isolate the problems are two of the main functions. Additionally, data
mining helps improve efficient marketing strategies. In any case, this industry faces
difficulties when it comes to handling the logical and time aspects of data mining, which
necessitates the requirement to anticipate rarity in telecommunication data in order to
m

discover network errors or buyer frauds in real-time.

Call Detail Data:


)A

The specifics of every call that begins in the telecommunications network are
logged. The time and date that it occurs, the length of the call, and the time that it
concludes. Since a call’s entire data set is gathered in real-time, data mining techniques
can be applied to process it. But rather than separating data by isolated phone call
levels, we should do so by client level. Thus, the consumer calling pattern can be
(c

discovered by effective data extraction.

Amity Directorate of Distance & Online Education


Data Warehousing and Mining 255

Some of the data that can be used to identify patterns include


Notes

e
◌◌ average call time duration
◌◌ Whether the call occurred throughout the day or night

in
◌◌ The usual volume of calls during the week
◌◌ calls made using different area codes

nl
◌◌ daily number of calls, etc.
One can advance the expansion of their firm by picking up on the right consumer
call information. A consumer can be identified as a member of a commercial firm

O
if they place more calls during daytime business hours. If there are a lot of calls at
night, only domestic or residential uses are permitted. One can distinguish between
commercial and residential calls using the area code’s frequent variation because calls
for residential purposes occasionally span many area codes. However, the information

ty
gathered in the evening cannot provide a precise indication of whether the customer is
a member of a commercial or residential enterprise.

Data of Customers:

si
There would be a sizable number of clients in the telecommunications sector. This
client database is kept around for any additional data mining queries. For instance,

r
these client details would assist in the identification of the individual with the details in
the customer database, such as name and address of the person, when a customer
ve
fraud case is discovered. Finding them and resolving the problem would be simple.
This dataset can also be derived from outside sources because the information is
typically widespread. Additionally, it contains the subscription plan selected and a
complete payment history. We can accelerate the expansion of the telecommunications
ni

businesses by using this dataset.

Network Data:
U

Every component of the system has the potential to produce problems and alerts
since telecommunication networks use highly developed, complicated equipment. As a
result, a significant volume of network data must be processed. In the event that the
system isolates a network issue, this data must be divided, sorted, and stored in the
ity

correct sequence. This guarantees that the technical specialist will receive any error or
status messages from any component of the network system. They could therefore fix
it. Since the database is so huge, it becomes challenging to manually fix issues when a
lot of status or error messages are created. In order to ease the burden, some sets of
m

errors and notifications can be automated. A methodical data mining methodology may
effectively manage the network system, which can improve functionality.

Preparing and Clustering Data:


)A

Despite the fact that data mining processes raw data, the data must first be well-
organized and logically organised in order to be processed. It’s a crucial requirement in
the telecommunications sector, which is dealing with the enormous database. To avoid
inconsistency, it is first necessary to identify conflicting and contradictory facts. ensuring
(c

the elimination of unnecessary data fields’ overflowing space. To prevent duplication,


the data needs to be mapped and structured by identifying the connections across
databases.
Amity Directorate of Distance & Online Education
256 Data Warehousing and Mining

Algorithms in the field of data mining can cluster or group related data. Analysis
Notes

e
of patterns like call patterns or consumer behaviour patterns may be aided by it. A
group of frequencies is created by looking at their similarities. This makes it simple to
comprehend data, which facilitates its processing and usage.

in
Customer Profiling:
The telecommunications sector handles a vast array of client information. It begins

nl
by looking for patterns in call data about the consumer to profile them and forecast
future trends. The corporation can choose the customer promotion strategies by
understanding the customer trend. if a call falls within a specific area code. A group of

O
clients would be attracted by the promotion provided in that area. This can effectively
monetize the promotion strategies and save the business from investing in a single
subscriber, but with the correct strategy, it can draw a crowd. When a customer’s call
history or other information is tracked, privacy concerns arise.

ty
Customer turnover is one of the major issues the telecommunications sector has to
deal with. This is also known as customer turnover, in which a client leaves a business.

si
In this instance, the client resigns and chooses a different telecommunications provider.
A company will suffer a significant loss of income and profit if its customer turnover rate
is high, which would slow down its growth. By using data mining tools to profile and
collect client trends, this problem can be resolved. Companies’ incentive offerings draw
r
loyal customers away from their competitors. By profiling the data, the customer churn
ve
can be accurately predicted by their actions, such as their history of subscriptions, the
plan they choose, and so on. In addition to collecting data from paying customers, there
are some restrictions on collecting data from recipients or non-customers.

By using data mining tools to gather client data and modelling their behaviour,
ni

such as call details, as previously stated, these fraudulent actions can be reduced. The
frauds are simple to spot when data detection is done in real-time. The report of the
suspected call behaviour can also be compared to the generic fraud characteristics
U

to accomplish this. They can be identified if the call pattern resembles that of generic
frauds. This fraud detection procedure can be improved by gathering data from the
customer level rather than data at the individual user level. When frauds are incorrectly
classified, the company may suffer losses. They must therefore be aware of the relative
ity

cost of dropping a phoney call and blocking a person suspected of engaging in fraud
using a legitimate account. Accurately addressing this issue would be made possible
with the proper application of data mining.
m

5.2.3 Applications in Retail Marketing

Role of Data Mining in Retail Industries


)A

The consumption of goods in the dynamic and quickly expanding retail sector rises
daily, increasing the amount of data that is gathered and utilised. The selling of products
to consumers by retailers is included in the retail industry. It ranges from a little stand on
the street to huge malls in urban areas. For instance, the proprietor of a grocery store
in a specific area might be aware of their customers’ after-sales information for a few
(c

months.

It would be simple to increase sales if he takes note of the needs of his customers.

Amity Directorate of Distance & Online Education


Data Warehousing and Mining 257

The major retail industries experience the same thing. Customers’ opinions of a
Notes

e
product, their location, their time zone, their shopping cart history, etc. are all collected.
The corporation can produce customised advertisements based on customer
preferences to boost sales and profits.

in
Knowing the Customers:
If a retailer doesn’t know who their consumers are, what good are sales? There

nl
is unquestionably a requirement to comprehend their customers. It begins by running
them through a number of analyses. By identifying the source via which the buyer
learns about that retailing platform, retailers can improve their advertising and draw in

O
an entirely other demographic. Identifying the days on which they often shop can aid in
special promotions or boosts during festival days.

We may improve growth by using the amount of time customers spend purchasing

ty
for each order. The retailer can divide the customer base into groups of high-paid
orders, medium-paid orders, and low-paid orders based on the amount of money spent
on the order. According to pricing, this will either help introduce personalised packages

si
or enhance the number of targeted clients. Retailers may please clients by offering the
necessary services by understanding the language and preferred payment methods
of their customers. Maintaining positive customer interactions can increase trust and
loyalty, which can result in quick financial gains for the shop. Customer loyalty will assist
r
their business survive competition from other, similarly sized businesses.
ve
RFM Value:
Recency, Frequency, and Monetary Value is referred to as RFM. Recency simply
refers to the most recent or nearby time the consumer made a transaction. Frequency
ni

is the frequency of the purchase, and monetary value is the sum of the consumers’
purchases. By retaining current and new clients by providing them with outcomes they
are delighted with, RFM may increase its revenue. Additionally, it can aid in bringing
U

back trailing clients who frequently make smaller purchases. The rise of sales is
inversely correlated with RFM score. RFM also helps to apply new marketing strategies
to low-ordering clients and avoids requests from being sent to customers who are
already engaged. RFM aids in the discovery of original solutions.
ity

Market-based Analysis:
The market-based analysis is a method for examining and analysing a customer’s
shopping behaviour in order to boost income and sales. This is accomplished by
m

studying datasets of a certain client to learn about their purchasing patterns, commonly
purchased things, and combinations of items.

The loyalty card that customers receive from the shop is a prime example. The
)A

card is required from the perspective of the client to keep track of future discounts,
information regarding incentive criteria, and transaction history. However, if we look at
this loyalty card from the retailer’s perspective, market-based analysis programmes will
be layered within to gather information about the transaction.
(c

Various algorithms or data science techniques can be used to perform this


analysis. Even without technological know-how, this is possible. To assess consumer
purchases, commonly purchased or grouped items, Microsoft Excel platform is

Amity Directorate of Distance & Online Education


258 Data Warehousing and Mining

employed. The IDs assigned to various transactions can be used to arrange the
Notes

e
spreadsheets. This analysis aids in recommending to the customer goods that would
go well with their present purchase, resulting in cross-selling and increased earnings.
Monitoring the rate of purchases every month or year is also helpful. It indicates when

in
the shop should make the required offers in order to draw in the proper clients for the
desired goods.

nl
Potent Sales Campaign:
Nowadays, advertising is a must. Because product advertising spreads awareness
of its availability, benefits, and characteristics. It transports the item out of the

O
warehouse and into the world. Data must be examined if it is to draw in the correct
clients. The retailers’ call to action or marketing campaign is the proper one. If the
marketing campaigns are not launched with the proper planning, excessive spending on
untargeted advertisements could result in a company’s demise. The sales campaign is

ty
based on the customer’s preference, time, and location.

The campaign’s platform is another important factor in attracting the proper

si
customers. It necessitates routine study of the sales and related data that occur on a
specific platform at a specific period. The popularity of the promoted product will be
determined by the traffic on social or network platforms. With the help of the previous
statistics, the shop can alter the campaign in a way that quickly boosts sales profit and
r
discourages excess. Understanding customer and business revenues helps improve
ve
the use of marketing. The shop can decide whether or not to spend in a campaign
based on the volume of sales from that campaign. The effective management of data
can transform a trial-and-error process into a well-transformed method. A multi-channel
sale campaign also boosts sales analysis and increases revenue, profit, and customer
ni

count.

5.2.4 Applications in Target Marketing


U

Target marketing is a strategy for luring clients who are thought to be propensity
buyers. Today’s world of escalating demands and fierce competition has made target
marketing essential. Making marketing strategies more customer-focused is crucial for
ity

product development. Finding high-profit, low-risk clients, keeping current customers,


and bringing in new ones are essential for a successful firm.

In order to develop fresh campaigns for their current clients, businesses must first
conduct a thorough customer analysis. Businesses can cluster or group specific clients
m

who share characteristics. This could help businesses develop more effective marketing
plans for particular customer segments.

For the organisation to attract new clients, increasing leads is a crucial duty. Lead
)A

generation is the process by which a business obtains prospective clients’ contact


information. Following lead generation, sales and marketing collaborate to turn leads
into clients.

We suggest a method for finding trends and identifying client traits using data
mining techniques in order to improve customer satisfaction and create marketing
(c

plans that will boost revenue. Since there are many various types of clients, each with
a unique set of wants and preferences, market segmentation is essential: divide the

Amity Directorate of Distance & Online Education


Data Warehousing and Mining 259

overall market, select the best segments, and develop profitable business plans that
Notes

e
serve the selected group better than the competition does. In order to help businesses
sort through layers of data that at first glance appear to be unrelated in search of
significant relationships, data mining technologies and procedures can be used to

in
identify and track trends within the data.

5.3 Applications in Industry

nl
Data is a collection of distinct, objective facts about a process or an event that,
by themselves, are not very useful and must be transformed into information. From

O
straightforward numerical measures and text documents to more intricate information
like location data, multimedia channels, and hypertext texts, we have been gathering a
wide range of data.

These days, massive amounts of data are being gathered. According to reports,

ty
the amount of data gathered nearly doubles each year. Data mining techniques
are used to extract data or seek insight from this huge amount of data. Nearly every
location that stores and processes a lot of data uses data mining. For instance, banks

si
frequently employ “data mining” to identify potential clients who might also be interested
in insurance, personal loans, or credit cards. Banks examine all of this data and look
for patterns that can help them forecast that specific consumers could be interested
r
in personal loans, etc. as they have complete profiles of their clients and transactional
information.
ve
The urge to uncover relevant information in data to support better decision-making
or a better knowledge of the world around us is the fundamental driving force behind
data mining, whether it is for commercial or scientific purposes.
ni

Technically speaking, data mining is the computer process of examining data from
many angles, dimensions, and perspectives before classifying or summarising it to
produce useful information. Any form of data, such as that found in data warehouses,
U

relational databases, multimedia databases, spatial databases, time-series databases,


and the World Wide Web, can be mined using data mining techniques.

In the information economy, data mining gives competitive advantages. This is


ity

accomplished by giving users the most information possible to quickly come to informed
business decisions despite the vast amount of data at their disposal.

Data mining has produced numerous quantifiable advantages in a variety of


application fields.
m
)A
(c

Amity Directorate of Distance & Online Education


260 Data Warehousing and Mining

5.3.1 Mining in Fraud Protection


Notes

e
Fraud Detection:

in
Fraud is a serious issue for the telecommunications sector since it results in lost
income and deteriorates consumer relations. Subscription fraud and superimposed
scams are two of the main types of fraud involved. The subscription scam involves
gathering client information, such as name, address, and ID proof information, primarily

nl
via KYC (Know Your Customer) documentation. While there is no purpose to pay for
the service by utilising the account, these details are required to register for telecom
services with authenticating approval. In addition to continuing to use services illegally,

O
some offenders also engage in bypass fraud by switching voice traffic from local to
international protocols, which costs the telecommunications provider money.

Superimposed frauds begin with a legitimate account and lawful activity, but they

ty
later result in overlapped or imposed behaviour by someone else who is utilising the
services illegally rather than the account holder. However, by observing the account
holder’s behaviour patterns, if a suspect is discovered engaging in any fraudulent

si
activities, it will prompt prompt actions like barring or deleting the account user. This will
stop the company from suffering more harm.

Any unapproved operation on a digital network is referred to as a network intrusion.


r
Theft of priceless network resources is a common aspect of network invasions. The
ve
process of data mining is essential for looking for abnormalities, network attacks, and
intrusions. These methods assist in choosing and enhancing pertinent and usable
facts from enormous data collections. The classification of pertinent data for intrusion
detection systems is aided by data mining techniques. Network traffic is alerted by the
intrusion detection system to external intrusions in the system. For instance:
ni

◌◌ Detect security violations


◌◌ Misuse Detection
U

◌◌ Anomaly Detection
ity
m
)A
(c

The following is a list of possible applications for data mining technology in


intrusion detection:

Amity Directorate of Distance & Online Education


Data Warehousing and Mining 261

◌◌ creation of an algorithm for data mining to detect intrusions.


Notes

e
◌◌ To choose and create discriminating qualities, association and correlation
analysis, and aggregation are used.

in
◌◌ data analysis for streams.
◌◌ spread-out data mining.
◌◌ tools for querying and visualisation.

nl
5.3.2 Mining in Healthcare
Many industries successfully employ data mining. It enables the retail and banking

O
sectors to monitor consumer behaviour and calculate customer profitability. Numerous
linked industries, including as manufacturing, telecom, healthcare, the auto industry,
education, and many more, are served by the services it offers.

ty
There are many opportunities for data-mining applications due to the volume of
data and issues in the healthcare industry, not to mention the pharmaceutical market
and biological research, and considerable financial advantages may be realised.

si
Thanks to the electronic preservation of patient records and improvements in medical
information systems, a lot of clinical data is now accessible online. In order to help
physicians improve patient care and make better judgments, data-mining tools are

r
helpful in spotting patterns, trends, and unexpected events.
ve
A patient’s health is regularly evaluated by clinicians. The analysis of a sizable
amount of time-stamped data will provide clinicians with knowledge on the course of
the disease. Therefore, systems capable of temporal abstraction and reasoning are
crucial in this scenario. Even though the use of temporal-reasoning methods requires a
ni

significant knowledge-acquisition effort, data mining has been used in many successful
medical applications, including data validation in intensive care, the monitoring of
children’s growth, the analysis of diabetic patients’ data, the monitoring of heart-
transplant patients, and intelligent anaesthesia monitoring.
U

Data mining is widely used in the medical industry. Data visualisation and artificial
neural networks are two particularly important data mining applications for the medical
sector. For instance, NeuroMedical Systems used neural networks to provide a pap
ity

smear diagnostic assistance. The Vysis Company uses neural networks to analyse
proteins in order to create novel medications. The Oxford Transplant Center and the
University of Rochester Cancer Center use Knowledge Seeker, a decision-tree-based
system, to help in their oncology research.
m

Over the past ten years, biomedical research has expanded significantly, from the
identification and study of the human genome to the development of new drugs and
advancements in cancer therapies. The goal of researching diseases’ genetic causes
)A

is to better understand their molecular causes so that targeted medical interventions


can be developed for disease diagnosis, therapy, and prevention. The development
of novel pharmaceutical products that can be used to treat a range of ailments, from
various cancers to degenerative disorders like Alzheimer’s disease, makes up a sizable
percentage of the effort.
(c

Biomedical research has placed a lot of emphasis on DNA data processing,


and the results have helped pinpoint the hereditary causes of many diseases and

Amity Directorate of Distance & Online Education


262 Data Warehousing and Mining

disabilities. DNA sequences are a crucial topic of study in genome research since
Notes

e
they constitute the fundamental components of all living creatures’ genetic blueprints.
Explain DNA. The basis of all living things is ribonucleic acid, specifically DNA.

in
The main technique through which we can transmit our genes to future generations
is through DNA. The instructions that tell cells how to behave are encoded in DNA.
DNA is made up of sequences that make up our genetic blueprints and are crucial for

nl
understanding how our genes work. The components that each gene is made up of
are called nucleotides. These nucleotides combine to form long, twisted, linked DNA
sequences or chains. Unraveling these sequences has been more challenging since
the 1950s, when the DNA molecule’s structure was first established. Theoretically, by

O
understanding DNA sequences, we will be able to identify and anticipate defects, weak
points, or other features of our genes that could affect our lives. It may be possible to
create cures for diseases like cancer, birth defects, and other harmful processes if DNA

ty
sequences are better understood. Data-mining tools are simply one tool in a toolbox
for understanding different types of data, and the employment of classification and
visualisation techniques is crucial to these processes.

si
About 100,000 genes, each of which contains DNA encoding a unique protein with
a particular function or set of functions, are thought to exist in humans. Several genes
that control the creation of haemoglobin, the regulation of insulin, and vulnerability to

r
Huntington’s chorea have recently been identified. To construct distinctive genes,
nucleotides can be ordered and sequenced in an apparent limitless number of
ve
different ways. A single gene’s sequence may have hundreds of thousands of unique
nucleotides arranged in a particular order.

Additionally, the DNA sequencing method used to extract genetic information from
ni

cells and tissues frequently results in the creation of just gene fragments. It has proven
difficult to ascertain how these fragments fit into the overall complete sequence from
which they are derived using conventional methods. It is a difficult effort for genetic
experts to decipher these sequences and establish theories about the genes they might
U

belong to and the disease processes they might govern. One may equate the task
of choosing good candidate gene sequences for more research and development to
finding a needle in a haystack. There could be hundreds of possibilities for each ailment
ity

being studied.

In order to continue developing, businesses must decide which sequences to


concentrate on. What criteria do they use to determine which might be good therapeutic
targets? This approach has traditionally depended primarily on trial and error. For each
m

lead that finally leads to a pharmaceutical intervention that is effective in clinical settings
and yields the required results, there are dozens of leads. To improve the efficacy of
these analytical techniques, this area of study urgently needs a breakthrough. Data
mining has become a powerful platform for further research and discovery in DNA
)A

sequences thanks to the development of pattern analysis, data visualisation, and


similarity search algorithms.

A different approach to mine the data in healthcare:

The threesystem approach is the finest method for expanding data mining beyond
(c

the bounds of academic study. The best method to make a real-world improvement with
any healthcare analytics endeavour is to implement all three systems. Regrettably, only

Amity Directorate of Distance & Online Education


Data Warehousing and Mining 263

a small number of healthcare institutions use all three of these platforms.


Notes

e
The following three systems are listed:

in
nl
O
ty
si
The Analytics System:
The analytics system includes the tools and knowledge needed to gather data,
analyse it, and standardise metrics. The system’s core is built upon the aggregation of
r
clinical, patient satisfaction, financial, and other data into an enterprise data warehouse
ve
(EDW).

The Content System:


The content system has a knowledge work standardisation component. It
ni

integrates care delivery with best practises supported by evidence. Every year,
substantial advances in clinical best practise are made by scientists, but as was already
said, it sometimes takes a while for these advancements to be used in actual clinical
U

settings. Organizations can swiftly implement the newest medical standards thanks to a
robust content system.

The Deployment System:


ity

Driving change management over new hierarchical structures is part of the


deployment system. Implementing group structures that facilitate the consistent,
enterprise-wide adoption of best practises is particularly important. To encourage the
adoption of best practises throughout a company, a true hierarchical transformation is
m

necessary.

Application of Data Mining in Healthcare:


)A

Numerous industries have made extensive and frequent use of data mining. Data
mining is becoming more and more common in the healthcare industry. Applications
for data mining can be quite helpful for all parties involved in the healthcare sector.
Data mining, for instance, can benefit the healthcare sector by assisting with customer
relationship management, excellent patient care, best practises, and cost-effective
(c

healthcare services. Healthcare transactions create enormous amounts of data that are
too complex and vast for traditional processing and analysis techniques.

Amity Directorate of Distance & Online Education


264 Data Warehousing and Mining

The framework and methods for turning these data into meaningful information for
Notes

e
making data-driven decisions are provided by data mining.

in
nl
O
ty
si
Treatment Effectiveness:
Applications for data mining can be used to evaluate the efficacy of medical
interventions. By contrasting and contrasting causes, symptoms, and courses of
r
treatment, data mining can provide analysis of which course of action displays
ve
effectiveness.

Healthcare Management:
Applications for data mining can be used to identify and monitor patients in
ni

incentive care units, reduce hospital admissions, and support healthcare management.
Massive data sets and statistics are analysed using data mining to look for patterns that
could indicate a bioterrorist attack.
U

Customer Relationship Management:


Any organisation must interact with its management and customers in order
to accomplish its objectives. The main strategy for managing contacts between
ity

commercial organizations—typically banks and retail sectors—and their clients is


customer relationship management. In the context of healthcare, it is crucial. Call
centres, billing divisions, and ambulatory care facilities may all be the venues for
customer interactions.
m

Fraud and Abuse:


Applications for data mining fraud and abuse can concentrate on fraudulent
)A

insurance and medical claims, as well as improper or incorrect prescriptions.

Advantages of Data Mining in Healthcare:


The workflow of healthcare organisations is made simpler and more automated
by the data framework. Healthcare organisations can save time and effort on decision-
(c

making by integrating data mining into their data frameworks. The best informational
support and expertise are provided to healthcare professionals through predictive

Amity Directorate of Distance & Online Education


Data Warehousing and Mining 265

models. The goal of predictive data mining in medicine is to develop a predictive model
Notes

e
that is understandable, yields trustworthy predictions, and aids physicians in improving
their processes for diagnosing patients and formulating treatment plans. When there is
a lack of understanding about the relationship between different subsystems and when

in
conventional analysis methods are ineffective, as is frequently the case with nonlinear
associations, biomedical signal processing communicated by internal guidelines and
reactions to improve the condition is a crucial application of data mining.

nl
Challenges in Healthcare Data Mining:
The fact that the raw medical data is so large and diverse is one of the main

O
problems with data mining in healthcare. These facts can be gathered from various
sources.

For instance, from patient interviews, physician evaluations, and laboratory

ty
findings. All of these factors may have a big impact on how a patient is diagnosed and
treated. Data mining success is significantly hampered by missing, erroneous, and
inconsistent data, such as information stored in multiple formats from several data

si
sources.

Another issue is that practically all medical diagnoses and therapies are unreliable
and prone to errors. In this case, the study of specificity and sensitivity is taken into
r
account while measuring these errors. There are two main obstacles in the area of
ve
knowledge integrity evaluation:

How can we design algorithms that effectively distinguish the content of the before
and after versions?

For the evaluation of knowledge integrity in the data set, it presents a challenge
ni

and calls for the improvement of efficient algorithms and data structures.

How can algorithms be developed to assess the effects of certain data changes
U

on the statistical significance of individual patterns that are gathered with the aid of
fundamental types of data mining methods?

Though it is challenging to develop universal metrics for all data mining methods,
algorithms that quantify the influence that changes in data values have on the statistical
ity

significance of patterns are currently being developed.

5.3.3 Mining in Science


m

Large amounts of data kept in repositories can be automatically searched for


implicit patterns, correlations, anomalies, and statistical information using a method
called data mining. By using a hypothesis or theory to understand this data, forecasts
can be made. It is an interdisciplinary field that draws inspiration from a variety of
)A

mathematical and computer disciplines, such as statistics, machine learning, database


retrieval, optimization, and visualisation techniques. Relationships and trend-related
insights that cannot be found using simple query and reporting techniques can be
found using data mining. KDD, or knowledge data discovery, is frequently used
interchangeably with the word “data mining,” but it actually refers to a broader process
(c

of which mining is a part.

Amity Directorate of Distance & Online Education


266 Data Warehousing and Mining

Nowadays, science is becoming increasingly data-intensive. The Fourth Paradigm


Notes

e
is the term used to describe the revolutionary power that data science has given to
science.

in
Data is becoming more and more abundant, and its volume, velocity, and
authenticity are all increasing exponentially. Data mining has become an essential tool
for scientific research projects across a wide range of domains, from astronomy and

nl
bioinformatics to finance and social sciences, as a result of the proliferation of data that
is now available. This data is now too large in size and dimensionality to be directly
analysed by humans. The enormous amount of normally unintelligible scientific data
that is produced and stored every day can be exploited in data mining to draw relevant

O
findings and forecasts.

Data mining applications in science and engineering:

ty
◌◌ Data reduction: Scientific equipment, such as satellites and microscopes,
may quickly and efficiently collect terabytes of data and collect millions of data
points. The observations can be made simpler with a methodical, automated

si
approach without compromising the accuracy of the data. Using data mining
techniques, large databases can be effectively accessed by scientists.
◌◌ Research: The practise of extracting informative and user-requested
r
information from inconsistent and unstructured internet data is made simpler
by web data mining. Text data mining is the process of extracting structured
ve
data from text by employing techniques like natural language processing
(NLP). With the use of these tools, researchers can more quickly and
precisely locate current scientific data in literature databases.
◌◌ Pattern recognition: Due to high dimensionality, intelligent algorithms can
ni

identify patterns in datasets that people cannot. This can aid in finding
anomalies.
◌◌ Remote sensing: Aerial remote sensing photography can be used with data
U

mining techniques to automatically classify the land cover, and nighttime


remote sensing can be utilised to study many socioeconomic areas.
◌◌ Opinion mining: Opinion mining is a subfield of text mining, information
ity

retrieval, and natural language processing that involves taking human


thoughts and perceptions out of unstructured texts in order to assess the
feelings of social media users.

Application area of Data Mining Techniques:


m

●● High Energy Physics: The Large Hadron Collider experiments that simulate
collisions in accelerators and detectors produce petabytes of data that must be
stored, calibrated, and reconstructed before analysis. Data reduction strategies
)A

are used by the Worldwide LHC Computing Grid to deal with the volume. An open-
source data mining tool known as ROOT is specialised high-performance software
that makes it easier to conduct scientific investigations and visualise massive
amounts of data.
(c

●● Astronomy: Data mining methods are utilised for star-galaxy separation,


galaxy morphology, and other types of classifications, and are used to classify

Amity Directorate of Distance & Online Education


Data Warehousing and Mining 267

cosmological objects with completeness and efficiency. The template approach


Notes

e
or the empirical set training method are used to estimate redshifts for galaxies
and quasars from photometric data. Aside from these uses, data mining has
also been applied to astronomical simulations, the analysis of cosmic microwave

in
backgrounds, and the prediction of solar flares.
●● Bioinformatics: A science at the nexus of biology and information technology is

nl
called bioinformatics. It is possible to mine the data produced by genomics and
proteomics research to identify patterns in sequences, predict protein structures,
annotate the genome, analyse gene and protein expression, model biological
systems, and investigate genetic pathways to better understand disease.

O
●● Healthcare: Useful data on patient demographics, treatment plans, cost, and
insurance coverage are produced by the healthcare sector. Studies already
conducted have documented the use of data mining in clinical medicine, the

ty
detection of adverse drug reaction signals, and a focus on diabetes and skin
conditions. Regression, classification, sequential pattern mining, association,
clustering, and data warehousing are the most widely utilised mining methods in
this area.

si
●● Geo-Spatial Analysis: To reduce the effects of storm dust in arid regions, spatial
models of the locations that are sensitive to gully erosion, which leads to land
r
degradation, have been created utilising data mining methods, GIS, and R
programming.
ve
5.3.4 Mining in E-commerce
The emergence of computer networks has slowly seeped into all facets of
ni

everyone’s life with the growth of economic globalisation and trade liberalisation, and
the e-commerce business is created as a platform. E-commerce has altered people’s
perceptions of traditional trade, business, and payment methods as a form of new
U

business model. It has also given the current business community new life and given
the conventional business model a technological revolution. E-commerce has a
pressing demand for data mining, a system that analyses and finishes data. It will more
efficiently handle and analyse a significant volume of online information for businesses
ity

engaged in regular e-commerce, and it will offer future decision-making businesses


more accurate technological and information support for their business models and
marketing strategies.

Data mining, the process of extracting hidden predictive information from sizable
m

databases, is a formidable new technology that has the potential to greatly assist
businesses in concentrating on the most crucial data in their data warehouses. Making
proactive, knowledge-driven decisions is possible for businesses thanks to data mining
)A

tools that forecast upcoming trends and behaviours. Beyond the assessments of past
events provided by retrospective tools typical of decision support systems, data mining
offers automated, prospective studies. Business questions that were previously time-
consuming to resolve can now be answered by data mining techniques. They search
databases for hidden patterns and uncover predicted data that experts might miss
(c

because it deviates from what they expect. The majority of businesses now gather and
process enormous amounts of data. In order to increase the value of currently available
information resources, data mining techniques can be quickly implemented on existing

Amity Directorate of Distance & Online Education


268 Data Warehousing and Mining

software and hardware platforms. They can also be integrated with new products and
Notes

e
systems as they go online.

Data must first be arranged using data base tools and data warehouses, after

in
which it requires a knowledge discovery tool. The process of retrieving non-obvious,
practical information from sizable databases is known as data mining. With regard
to helping businesses concentrate on using their data, this developing field offers

nl
a number of potent strategies. From very vast databases, data mining tools produce
new information for decision-makers. This generation’s numerous techniques include
characterizations, summaries, aggregates, and abstractions of data. These forms
are the outcome of the use of advanced modelling methods from various disciplines

O
including statistics, artificial intelligence, database management, and computer
graphics.

The majority of business tasks in highly competitive firms have altered due to

ty
e-commerce. The interface operations between customers and retailers, retailers and
distributors, distributors and factories, and factories and their numerous suppliers are
now effortlessly automated thanks to internet technologies. Online transactions have

si
generally been made possible by e-commerce and e-business. Additionally, producing
enormous amounts of real-time data has never been simple. It only makes sense to
use data mining to make (business) sense of these data sets because data relevant to

r
diverse aspects of business transactions are easily available.
ve
The following elements contribute significantly to a DM exercise’s success:

◌◌ Rich data descriptions are available, therefore it is impossible to extract


hidden patterns and links between different attributes unless the relations
stored in the database are of a high degree.
ni

◌◌ Large amounts of data must be available in order for the rules’ statistical
significance to hold. The use of the rules produced by the transactional
database will presumably decrease in the absence of, say, at least 100,000
U

transactions.
◌◌ Despite the fact that a particular terabyte database may have hundreds of
attributes for each relation, the DM algorithms applied to this dataset may fail
ity

if the data was generated manually and erroneously and had the improper
default values specified.
◌◌ Ease of calculating the return on investment (ROI) in DM: Although the first
three factors may be advantageous, investments in the following DM activities
would not be feasible unless a solid business case can be developed with
m

ease. In other words, the domain of application must be quantified in order to


determine the utility of the DM exercise.
◌◌ Ease of interface with legacy systems: Large businesses frequently use a
)A

number of legacy systems that produce enormous amounts of data. System


integration shouldn’t be burdened by additional overheads from a DM activity
that is typically preceded by other exercises like extract, transformation, and
loading (ETL), data filtering, etc.
(c

Amity Directorate of Distance & Online Education


Data Warehousing and Mining 269

Data-Mining in E-Commerce
Notes

e
Data mining in e-commerce is a crucial method for repositioning the e-commerce
organisation to provide the enterprise with the necessary business information.

in
Most businesses now use e-commerce and have enormous data stored in their
data repositories. This data must be mined in order to enable business intelligence
or to improve decision-making, which is the only way to make the most of it. Before

nl
becoming knowledge or an application in e-commerce data mining, data must go
through three key processes.

Benefits of Data Mining in E-Commerce

O
The term “application of data mining in e-commerce” refers to potential
e-commerce sectors where data mining could be applied to improve business
operations. It is commonly known that when customers visit an online site to make

ty
purchases, they typically leave behind some information that businesses can store in
their database. These facts are examples of organised or unstructured data that can
be mined to provide the business a competitive edge. Data mining can be used in the
world of e-commerce for the advantage of businesses in the following areas:

si
Customer Profiling: In e-commerce, this is often referred to as a customer-oriented
strategy. This enables businesses to plan their business activities and operations and to
r
generate fresh research on products or services for successful ecommerce by mining
client data for business intelligence. Companies can save sales costs by grouping the
ve
clients with the greatest purchasing potential from the visiting data. Businesses can
utilise user surfing information to determine whether a person is shopping on purpose,
just browsing, purchasing something they are accustomed to or something new. This
aids businesses in planning and enhancing their infrastructure.
ni

Personalization of Service: The act of providing content and services that are
tailored to individuals based on knowledge of their needs and behaviour is known as
U

personalization. Recommender systems and related topics like collaborative filtering


have received the majority of attention in data mining research linked to personalisation.
In-depth research into recommender systems has been done by the data mining
community. Three categories can be used to categorise these systems: collaborative
ity

filtering, social data mining, and content-based systems. These systems are typically
represented as the user profile and are nurtured and learnt from the explicit or implicit
feedback of users. When evaluating the sources of data that a group of people
produce as part of their regular activities, social data mining can be a valuable source
of information for businesses. On the other hand, collaborative filtering can be used
m

to personalise content by matching people based on shared interests and, in a similar


vein, using these users’ preferences to generate suggestions.

Basket Analysis: Market Basket Analysis (MBA), a popular retail, analytical, and
)A

business intelligence tool, aids retailers in getting to know their customers better.
Every shopper’s basket has a tale to tell. There are several strategies to maximise the
benefits of market basket analysis, including:

◌◌ Identification of product affinities: The true issue in retail is tracking less


(c

obvious product affinities and capitalising on them. Customers who buy Barbie
dolls at Wal-Mart seem to have a preference for one of three candy bars. An

Amity Directorate of Distance & Online Education


270 Data Warehousing and Mining

advanced market basket analytics can reveal oblique connections like this to
Notes

e
help create more successful marketing campaigns.
◌◌ Cross-sell and up-sell marketing highlight related items so that customers who

in
buy printers may be convinced to purchase premium paper or cartridges.
◌◌ Pangrams and product combos are used to focus on products that sell well
together and improve inventory control based on product affinities, combo

nl
offers, and user-friendly pangram design.
◌◌ Shopper profiles are created by using data mining to analyse market baskets
over time in order to acquire a better understanding of who your customers

O
truly are. This includes learning about their ages, income levels, purchasing
patterns, likes and dislikes, and purchase preferences.
The traditional trade is undergoing a significant transformation in the age of
information and knowledge economy due to the quick advancement of network

ty
technology and the improvement of social information level, and e-commerce exhibits
significant market value and development potential. A recent business strategy in the
realm of commerce is e-commerce. It is a contemporary business model that uses the

si
internet as a platform and contemporary information technology as a tool, with a focus
on economic effectiveness.

The networking, automation, and intelligentization of business activities are


r
its ultimate objectives. The rise of e-commerce has significantly altered company
ve
philosophy, management techniques, and payment methods, as well as the many
spheres of society. When e-commerce is used in the workplace, the enterprise
information system will produce a large amount of data, and as a result, workers will
struggle with the issue of “rich data but lack of expertise.”
ni

How can one avoid being overwhelmed by the huge ocean of information with
the emergence of the “data explosion but the lack of knowledge” problem, and obtain
relevant information and knowledge from which timely? Therefore, a new generation
U

of technology and tools is required to conduct reasonable and higher level analysis, to
use inductive reasoning, to explore potential modes, and to extract useful knowledge to
assist e-commerce business decision-makers in adjusting the market strategy, creating
business forecasts, and taking the appropriate actions to improve information utilisation,
ity

lower risk, and generate enormous profits for the company. The data processing
technology that will best meet these development needs is data mining.

Application of Data Mining in Electronic Commerce


m

●● Optimize enterprise resources


The secret to corporate profitability is cost cutting. By analysing historical financial
data, inventory data, and transaction data, you can quickly, thoroughly, and accurately
)A

grasp enterprise resource information based on data mining technology. You can
then use this information to make decisions about how best to allocate resources,
such as by reducing inventory, increasing inventory turnover, and improving capital
utilisation.
(c

●● Manage customer data


Analyzing, comprehending, and directing client wants have grown in importance

Amity Directorate of Distance & Online Education


Data Warehousing and Mining 271

as a result of the “customer-centric” corporate concept. Businesses will make the


Notes

e
best use of customer resources, analyse and forecast consumer behaviour, and
categorise customers based on data mining technology. To increase customer
satisfaction and loyalty, it is beneficial to monitor customer profitability, identify

in
future valuable customers, and provide individualised services. By utilising online
resources, businesses can better understand their clients’ purchasing patterns and
interests, which helps them create more user-friendly websites with tailored content

nl
for each client.
●● Assess business credit

O
Poor credit standing is a significant issue affecting the business environment and
has generated global concern. Due to the growing severity of the enterprise finance
“fake” problem brought on by online fraud, credit crunch has emerged as a significant
barrier to the growth of e-commerce.

ty
Utilizing data mining tools to keep an eye on business operations, assess profitability,
evaluate assets, and estimate future growth opportunities. These technologies are
also used to implement online monitoring, secure online transactions, and oversee

si
online payment security. The level and capacity of the enterprise to screen credit
and manage risk are improved by mining the transaction history data based on
the data mining credit evaluation model to identify the customer’s transaction data
r
characteristics, establish customer credibility level, and effectively prevent and
resolve credit risk.
ve
●● Determine the abnormal events
Unusual occurrences, like customer churn, bank credit card fraud, and mobile free
default, have considerable economic value in several company sectors. We can
ni

rapidly and precisely pinpoint these aberrant events using data mining’s unique point
analysis, which may then serve as the foundation for business decision-making and
cut down on avoidable losses.
U

5.3.5 Mining in Finance


In order to determine whether a business is steady and profitable enough to get
ity

capital investment, financial analysis of data is crucial. The balance sheet, cash flow
statement, and income statement are where financial analysts concentrate their
analysis.

A wide range of banking services, including checking, savings, business and


m

individual customer transactions, investment services, credits, and loans, are offered
by the majority of banks and financial institutions. Financial data is frequently quite
comprehensive, trustworthy, and of good quality when collected in the banking and
financial sector, which makes systematic data analysis and data mining to boost a
)A

company’s competitiveness easier.

Data mining is widely utilised in the banking sector for modelling and predicting
credit fraud, risk assessment, trend analysis, profitability analysis, and aiding direct
marketing operations. Neural networks have been used in the financial markets to
(c

anticipate stock prices, options trading, bond ratings, portfolio management, commodity
price prediction, mergers and acquisitions analysis, and financial calamities.

Amity Directorate of Distance & Online Education


272 Data Warehousing and Mining

Among the financial firms that employ neural-network technology for data mining
Notes

e
are Daiwa Securities, NEC Corporation, Carl & Associates, LBS Capital Management,
Walkrich Investment Advisors, and O’Sullivan Brothers Investments. There have been
reports of a wide variety of successful business applications, albeit it is sometimes

in
difficult to retrieve technical information. Although there are many more banks and
investing firms than those on the previously listed list that mine data, you won’t often
find them wanting to be cited. Typically, they have rules prohibiting talking about it.

nl
Therefore, unless you look at the SEC reports of some of the data-mining companies
that offer their tools and services, it is difficult to discover articles about banking
companies that utilise data mining. Customers like Bank of America, First USA Bank,

O
Wells Fargo Bank, and U.S. Bancorp can be found there.

It has not gone unnoticed that data mining is widely used in the banking industry. It
is not unexpected that methods have been designed to counteract fraudulent activities

ty
in sectors like credit card, stock market, and other financial transactions given that fraud
costs industries billions of dollars. The issue of fraud is a very important one for credit
card issuers. For instance, fraud cost Visa and MasterCard more than $700 million in
a single year. Capital One’s losses from credit card fraud have been reduced by more

si
than 50% thanks to the implementation of a neural network-based solution. This article
explains some effective data-mining techniques to illustrate the value of this technology
for financial organisations.
r
Just five years ago, the concept of a “robo-advisor” was basically unknown, but
ve
today it is a staple of the financial industry. The phrase is somewhat deceptive because
it has nothing to do with robots. Instead, robo-advisors are sophisticated algorithms
created by firms like Betterment and Wealthfront to tailor a financial portfolio to the
objectives and risk tolerance of each individual customer. Users enter their objectives
ni

together with their age, income, and current financial assets, for instance, retiring
at age 65 with $250,000 in savings. In order to accomplish the user’s objectives, the
intelligent advisor algorithm then distributes investments among various asset classes
U

and financial instruments. Aiming to always identify the optimum fit for the user’s initial
aims, the system calibrates to changes in the user’s goals and to real-time changes
in the market. With millennial customers who do not require a physical advisor to feel
comfortable investing and who are less able to justify the costs paid to human advisors,
ity

robo-advisors have seen substantial growth.

Blockchain technology is a further trend in big data applications that originated


in the finance sector and is now permeating many other industries. A distributed
database of all completed transactions or digital events that have been shared among
m

participating parties makes up a blockchain. A majority of the users in the system agree
to verify each transaction in the public database. Information cannot be deleted once it
has been entered. Every transaction ever made is contained in a particular, verifiable
)A

record on the blockchain. The most well-known application of blockchain technology


is Bitcoin, a decentralised peer-to-peer digital currency. Bitcoin is a very contentious
digital currency, but the underlying blockchain technology has operated flawlessly and
has a wide range of uses in both the financial and non-financial sectors.

The primary assumption is that the blockchain creates a means for establishing
(c

distributed consensus in the virtual online environment. This establishes an

Amity Directorate of Distance & Online Education


Data Warehousing and Mining 273

unquestionable record in a public ledger and enables involved organisations to be


Notes

e
assured that a digital event occurred. It makes it possible to transition from a centralised
digital economy to one that is democratic, open, and scalable. This disruptive
technology offers a wide range of application possibilities, and the revolution in this field

in
has just begun.

The blockchain technologies are becoming more and more relevant as social

nl
responsibility and security play a larger role online. Since it is practically hard to
counterfeit any digital transactions in a system using blockchain, the legitimacy of such
systems will undoubtedly increase. We will observe many more possible use cases
for the government, healthcare, manufacturing, and other sectors when the early

O
blockchain boom in the financial services business slows.

ty
r si
ve
ni
U
ity
m
)A
(c

Amity Directorate of Distance & Online Education


274 Data Warehousing and Mining

Case Study
Notes

e
Amazon EC2 Setup

in
We use the Amazon Elastic Compute Cloud (EC2) service as a case study for
business cloud settings.. Although there were other publicly accessible cloud services
that competed at the time, Amazon’s service is the biggest and offers the most highly
adjustable virtual machines. All software above this level is configured by the user, but

nl
the nodes assigned by EC2 run an operating system or kernel that is configured by
Amazon. Many other cloud services provided by other providers impose restrictions on
the APIs or programming languages that can be used. We use Amazon as a case study

O
to demonstrate the utilisation of current highly efficient dense linear algebra libraries.

Instances are nodes allocated through EC2. The distribution of instances from
Amazon’s data centres is based on unreleased scheduling algorithms. 20 total

ty
instances are the initial allocation cap, but this cap can be removed upon request. The
smallest logical geographic unit for allocation used by Amazon is an availability zone,
which is made up of several data centres. These zones are further divided into regions,
which at the time only include the US and Europe.

si
Each instance automatically loads a user-specified image with the appropriate
operating system (in this case, Linux) and user software after allocation (described
r
below). The Xen virtual machine (VM) is used by Amazon services to automatically load
images onto one or more virtualized CPUs (Barham et al., 2003). Each CPU has many
ve
cores, giving the instances we reserved a total of 2 to 8 virtual cores. Amazon’s terms
of service do not contain any definite performance guarantees. They do not restrict
Amazon’s capacity to offer multi-tenancy, or the co-location of VMs from several clients,
which is crucial for our study. Below, we go over the performance aspects of several
ni

instances.

The ability to release unwanted nodes, produce and destroy images to be loaded
U

onto allocated instances, and allocate additional nodes on demand are all provided
through tools written to Amazon’s public APIs. We created images with the most recent
compilers offered by the hardware CPU vendors AMD and Intel using these tools plus
ones we developed ourselves. With the use of our tools, we are able to automatically
ity

allocate and configure variable-size clusters in EC2, including support for MPI
applications.

There are other publicly available tools for running scientific applications on cloud
platforms, even though we created tools to automatically manage and configure EC2
m

nodes for our applications (including EC2) In addition, as the cloud computing platform
develops, we anticipate seeing a lot more work done on specialised applications like
high-performance computing, which will lessen or completely do away with the steep
learning curve associated with implementing scientific applications on cloud platforms.
)A

For instance, there are already publicly accessible pictures on EC2 that support MPICH.

From November 2008 to January 2010, we conducted our case study using several
instance types on the Amazon Elastic Compute Cloud (EC2) service.
(c

The key distinctions between the instance types are shown in Table below,
including the number of cores per instance, installed memory, theoretical peak
performance, and instance cost per hour. Although Amazon offers a smaller 32-bit

Amity Directorate of Distance & Online Education


Data Warehousing and Mining 275

m1.small instance, we only tested instances with 64-bit processors and hence refer
Notes

e
to the m1.large as the smallest instance. Costs per node range from $0.34 for the
smallest to $2.40 for nodes with a lot of installed RAM, a factor of seven difference. The
c1.xlarge instance stands out in that cost grows more closely with installed RAM than

in
with peak CPU performance. Utilizing processor-specific capabilities, peak performance
is determined. For instance, the c1.xlarge instance type has two Intel Xeon quad-core
CPUs running at 2.3 GHz and 7 GB of total RAM. Theoretically, each core may do four

nl
floating-point operations each clock cycle, giving each node a peak performance of
74.56 GFLOP. We do not include the expenses associated with long-term data storage
and bandwidth utilised to enter and exit Amazon’s network in our calculations because

O
they pale in comparison to the costs associated with the computation itself in our
studies.

ty
r si
ve
Extensive testing revealed that the multi-core processors’ multithreaded parallelism
worked best when the Goto BLAS library was configured to use as many threads as
there were cores per socket—four for the Xeon and two for the Opteron, respectively.
In the section that follows, we give peak attained efficiency together with the number
of threads that were used to reach particular outcomes. We achieved 76% and 68%
ni

of theoretical peak performance (measured in GFLOP/sec) on the Xeon E5345 and


Opteron 270, respectively, for single node performance on xlarge instances, utilising
these settings and the platform-specific libraries and compilers. So, for the purposes
U

of our evaluation, we think that the configuration and execution of LINPACK in HPL on
the high-CPU and standard instances is effective enough to serve as an example of a
compute-intensive application.
ity

The 2.6.21 Linux kernel is used to run the RedHat Fedora Core 8 operating system
on all instance types (with Intel or AMD CPUs). The autotuning of buffer sizes for high-
performance networking is supported by the 2.6 line of Linux kernels and is turned on
by default. Multiple instances may even share a single physical network card, according
to Services (2010), which states that the specific connection used by Amazon is
m

unknown. As a result, a certain instance could not have access to the total throughput.
We conduct all trials with cluster nodes assigned to the same availability zone in an
effort tominimise the number of hops between nodes as much as possible.
)A

Summary
●● Extraction of interesting information or patterns from data in large databases is
known mining.
(c

●● Text mining is the process of removing valuable data and complex patterns
from massive text datasets. For the purpose of creating predictions and making
decisions, there are numerous methods and tools for text mining.

Amity Directorate of Distance & Online Education


276 Data Warehousing and Mining

●● Text mining is the procedure of synthesizing information, by analyzing relations,


Notes

e
patterns, and rules among t These days, information is electronically maintained
by all institutions, businesses, and other organisations. a vast amount of
information that is stored online in digital archives, databases, and other text-

in
based sources like websites, blogs, social media networks, and emails.
●● Procedures of analyzing text mining: a) Text summarization, b) Text categorization,

nl
c) Text clustering.
●● Application area of text mining: a) Digital library, b) Academic and research field, c)
Life science, d) Social media, e) Business intelligence.

O
●● A spatial database is designed to store and retrieve information on spatial objects,
such as points, lines, and polygons. Spatial data frequently includes satellite
imagery. Spatial queries are those that are made on these spatial data and use
spatial parameters as predicates for selection.

ty
●● Web mining, also known as data mining, is the process of using algorithms and
data mining techniques to extract valuable information from the web, including
Web pages and services, linkages, web content, and server logs.

si
●● Three categories of data mining: a) Web content mining, b) Web usage mining, c)
Web structure mining.
●● r
A multimedia database management system is the framework that controls how
ve
various kinds of multimedia data are provided, saved, and used. Multimedia
databases can be divided into three categories: dimensional, dynamic, and static.
●● Multimedia data cubes can be generated and built similarly to traditional data
cubes using relational data in order to enable the multidimensional analysis of big
ni

multimedia datasets.
●● Application area of data mining: a) Research, b) Education sector, c)
Transportation, d) Market basket analysis, e) Business transactions, f) Intrusion
U

detection, g) Scientific analysis, h) Finance and banking sector, i) Insurance and


healthcare.
●● The market-based analysis is a method for examining and analysing a customer’s
ity

shopping behaviourin order to boost income and sales.


●● Despite the fact that data mining processes raw data, the data must first be well-
organized and logically organised in order to be processed. It’s a crucial requirement
in the telecommunications sector, which is dealing with the enormous database.
m

●● Fraud is a serious issue for the telecommunications sector since it results in lost
income and deteriorates consumer relations. Subscription fraud and superimposed
scams are two of the main types of fraud involved. The subscription scam involves
)A

gathering client information, such as name, address, and ID proof information,


primarily via KYC (Know Your Customer) documentation.
●● Any unapproved operation on a digital network is referred to as a network
intrusion. Theft of priceless network resources is a common aspect of network
invasions. The process of data mining is essential for looking for abnormalities,
(c

network attacks, and intrusions. These methods assist in choosing and enhancing
pertinent and usable facts from enormous data collections.

Amity Directorate of Distance & Online Education


Data Warehousing and Mining 277

●● Application of data mining in healthcare: a) Treatment effectiveness, b) Fraud and


Notes

e
abuse, c) Customer relationship management, d) Healthcare management.
●● Data mining applications in science and engineering: a) Data reduction, b)

in
Research, c) Pattern recognition, d) Remote sensing, e) Opinion mining.
●● The networking, automation, and intelligentization of business activities are its
ultimate objectives. The rise of e-commerce has significantly altered company

nl
philosophy, management techniques, and payment methods, as well as the many
spheres of society.
●● Application of data mining in e-commerce: a) Optimize enterprise resources, b)

O
Manage customer data, c) Assess business credit, d) Determine the abnormal
events.
●● In order to determine whether a business is steady and profitable enough to get

ty
capital investment, financial analysis of data is crucial. The balance sheet, cash
flow statement, and income statement are where financial analysts concentrate
their analysis.

si
●● In order to uncover hidden patterns and forecast upcoming trends and behaviours
in the financial markets, data mining techniques have been applied. For mining
such data, especially the high-frequency financial data, advanced statistical,

r
mathematical, and artificial intelligence approaches are often needed.
ve
Glossary
●● GIS: Geographical Information Systems.
●● Text Summarization: To automatically extract its entire content from its partial
ni

content reflection.
●● Text Categorization: To classify the text into one of the user-defined categories.
●● Text Clustering: To divide texts into various clusters based on their significant
U

relevance.
●● Information extraction: The technique of information extraction involves sifting
through materials to find meaningful terms.
ity

●● Information retrieval:It is the process of identifying patterns that are pertinent and
related to a group of words or text documents.
●● Natural Language Processing: The automatic processing and analysis of
unstructured text data is known as “natural language processing.”
m

●● Clustering:It is an unsupervised learning method that groups texts together based


on their shared traits.
)A

●● Text Summarization: To mechanically derive its entire content from its partial
reflection.
●● DTM: Digital Terrain Model.
●● DBSCAN: Density-based spatial clustering of applications with noise.
(c

●● Web Content Mining: Web content mining is the process of extracting relevant
information from web-based documents, data, and materials. The internet used to
merely contain various services and data resources.
Amity Directorate of Distance & Online Education
278 Data Warehousing and Mining

●● Web Structured Mining:In Web Structure Mining, we are primarily interested in


Notes

e
the interdocument structure—or the organisation of hyperlinks—within the web
itself. It has a tight connection to online usage mining. Web structure mining is
fundamentally connected to pattern recognition and graph mining.

in
●● Web Usage Mining:Web use mining manipulates clickstream data. Web usage
data includes web server access logs, proxy server logs, browser logs, user

nl
profiles, registration information, user sessions, transactional information, cookies,
user queries, bookmark data, mouse clicks and scrolling, and other interaction
data.

O
●● Media Data: Actual data used to depict an object is called media data.
●● Media Format Data: Data concerning the format of the media after it has
undergone the acquisition, processing, and encoding stages, including sampling
rate, resolution, encoding method, etc.

ty
●● Media keyword data: Keywords that describe how data is produced. It also goes
by the name of “content descriptive data.” Example: the recording’s date, time, and
location.

si
●● Media Feature Data: Information that is depending on the content, such as the
distribution of colours, types of textures, and shapes.
●● r
MFC: Most Frequent Color.
ve
●● MFO:Most Frequent Orientation.
●● URL: Uniform Resource Locator.
●● RFM:Recency, Frequency, and Monetary.
ni

●● KYC: Know Your Customer.


●● Network intrusion: Any unapproved operation on a digital network is referred to as
a network intrusion.
U

●● EDW: Enterprise Data Warehouse.


●● KDD: Knowledge Data Discovery.
ity

Check Your Understanding


1. The process of extraction of interesting information or patterns from data in large
databases, is known as_ _ _ __ _.
a. Data analysis
m

b. Data collection
c. Data repository
)A

d. Data mining
2. _ _ _ _ is the procedure of synthesizing information, by analyzing relations, patterns,
and rules among textual data.
a. Text editing
(c

b. Text wrapping
c. Text mining
Amity Directorate of Distance & Online Education
Data Warehousing and Mining 279

d. None of the mentioned


Notes

e
3. To automatically extract its entire content from its partial content reflection, is termed
as?

in
a. Text summarization
b. Text categorization

nl
c. Text clustering
d. Text mining
4. To classify the text into one of the user-defined categories, is termed as:

O
a. Text summarization
b. Text categorization

ty
c. Text clustering
d. Text mining
5. To divide texts into various clusters based on their significant relevance, is termed

si
as:
a. Text summarization
b. Text categorization r
ve
c. Text clustering
d. None of the mentioned
6. The technique of_ _ _ _ _ _ involves sifting through materials to find meaningful
terms.
ni

a. Information retrieval
b. Clustering
U

c. Data repository
d. Information extraction
7. _ _ _ _ _is the process of identifying patterns that are pertinent and related to a
ity

group of words or text documents.


a. Information extraction
b. Information retrieval
m

c. Text summarization
d. Clustering
)A

8. The automatic processing and analysis of unstructured text data is known as__ _ _
_ __ .
e. Information extraction
a. Information retrieval
(c

b. Natural language processing


c. Clustering

Amity Directorate of Distance & Online Education


280 Data Warehousing and Mining

9. A_ _ _ _database is designed to store and retrieve information on spatial objects,


Notes

e
such as points, lines, and polygons.
a. External geographic

in
b. Meteorological
c. Topological

nl
d. Spatial
10. _ _ _ _ _ is the process of using algorithms and datamining techniques to extract
valuable information from the web, including Web pages and services, linkages,

O
web content, and server logs.
a. Web mining
b. Text mining

ty
c. Text wrapping
d. Data wrapping

si
11. _ _ _ _ is the process of extracting relevant information from web-based documents,
data, and materials.
a. Web usage mining
b. Web content mining r
ve
c. Web structure mining
d. None of the mentioned
12. Actual data used to depict an object is called_ _ _ _ .
ni

a. Media format data


b. Media feature data
U

c. Media data
d. All of the above
13. Any unapproved operation on a digital network is referred to as a_ _ _ _.
ity

a. Phishing
b. Hacking
c. Data mining
m

d. Network intrusion
14. The_ _ _ _ _is a method for examining and analysing a customer’s shopping
behaviourin order to boost income and sales.
)A

a. Market-based analysis
b. Business transactions
c. Research
(c

d. Education sector

Amity Directorate of Distance & Online Education


Data Warehousing and Mining 281

15. Data concerning the format of the media after it has undergone the acquisition,
Notes

e
processing, and encoding stages, including sampling rate, resolution, encoding
method, etc, is termed as?

in
a. Media data
b. Media format data
c. Media keyword data

nl
d. Media feature data
16. Bill Inmon has estimated_ _ _ _of the time required to build a data warehouse, is

O
consumed in the conversion process.
a. 80%
b. 60%

ty
c. 40%
d. 20%

si
17. The generic two-level data warehouse architecture includes_ _ _ _ .
a. Far real-times updates
b. Near real-times updates
c. At least one data mart
r
ve
d. None of the above
18. The most common source of change data in refreshing a data warehouse is_
________ _ _ .
ni

a. Logged change data


b. Snapshot change data
U

c. Queryable change data


d. None of the above
19. _ _ _ _ _are responsible for running queries and report against data warehouse
ity

tables.
a. Software
b. Hardware
m

c. Middleware
d. End user
20. Data warehouse architecture is based on_ _ _ _.
)A

a. RDBMS
b. Sybase
c. DBMS
(c

d. SQL server

Amity Directorate of Distance & Online Education


282 Data Warehousing and Mining

Exercise
Notes

e
1. What do you mean by Text Mining?
2. Define Spatial Databases.

in
3. Explain the term Web Mining.
4. What do you mean by Multidimensional Analysis of Multimedia Data?

nl
5. Define the Applications of data mining in following industry:
◌◌ Telecommunications Industry and

O
◌◌ Retail Marketing
◌◌ Fraud Protection,
◌◌ Healthcare and Science

ty
◌◌ E-commerce
◌◌ Finance

Learning Activities

si
1. You are employed as a data mining consultant by a sizable commercial bank that
offers a variety of financial services. A data warehouse that the bank launched two
years ago already exists. The management is looking for the current clients who will
r
be most receptive to a marketing effort promoting new services. Explain the steps
ve
involved in knowledge discovery, including their phases and associated activities.

Check Your Understanding - Answers


1 d 2 c 3 a 4 b
ni

5 c 6 d 7 b 8 c
9 d 10 a 11 b 12 c
U

13 d 14 a 15 b 16 a
17 b 18 c 19 d 20 a

Further Readings and Bibliography:


ity

1. Data Mining and Machine Learning Applications, Sandeep Kumar, Kapil Kumar
Nagwanshi, K. Ramya Laxmi, Rohit Raja, John Wiley & Sons
2. Data Mining Applications with R, Yanchang Zhao and Yonghua Cen, Elsevier
m

Science
3. Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data, Bing Liu,
Springer Berlin Heidelberg
)A

4. Practical Applications of Data Mining, Sang C. Suh, Jones & Bartlett Learning
(c

Amity Directorate of Distance & Online Education

You might also like