Data Warehousing Mining F-CSIT341
Data Warehousing Mining F-CSIT341
Programs Offered
n e
i
Post Graduate Programmes (PG)
l
• Master of Business Administration
• Master of Computer Applications
Data Warehousing
n
• Master of Commerce (Financial Management / Financial
Technology)
and MiningO
• Master of Arts (Journalism and Mass Communication)
• Master of Arts (Economics)
• Master of Arts (Public Policy and Governance)
•
•
•
•
Master of Social Work
Master of Arts (English)
Master of Science (Information Technology) (ODL)
Master of Science (Environmental Science) (ODL)
i t y
Diploma Programmes
• Post Graduate Diploma (Management)
r s
e
• Post Graduate Diploma (Logistics)
• Post Graduate Diploma (Machine Learning and Artificial
•
Intelligence)
Post Graduate Diploma (Data Science)
i v
Undergraduate Programmes (UG)
•
•
•
•
Bachelor of Business Administration
Bachelor of Computer Applications
Bachelor of Commerce
Bachelor of Arts (Journalism and Mass Communication)
U n
•
•
•
Bachelor of Social Work
i
Bachelor of Science (Information Technology) (ODL)
y
Bachelor of Arts (General / Political Science / Economics /
English / Sociology)
t
Am
c )
DIRECTORATE OF Product code
(
DISTANCE & ONLINE EDUCATION
Amity Helpline: 1800-102-3434 (Toll-free), 0120-4614200
AMITY
r si
ve
ni
U
ity
m
)A
(c
e
in
© Amity University Press
nl
No parts of this publication may be reproduced, stored in a retrieval system or transmitted
in any form or by any means, electronic, mechanical, photocopying, recording or otherwise
without the prior permission of the publisher.
O
SLM & Learning Resources Committee
ty
Chairman : Prof. Abhinash Kumar
si
Members : Dr. Divya Bansal
Dr. Coral J Barboza
Dr. Monica Rose
r
Dr. Apurva Chauhan
ve
Dr. Winnie Sharma
Published by Amity University Press for exclusive use of Amity Directorate of Distance and Online Education,
Amity University, Noida-201313
Contents
e
Page No.
in
Module - I: Data Warehouse Fundamentals 01
1.1 Defining the Cloud for the Enterprise
1.1.1 Database as a Service
nl
1.1.2 Governance/Management as a Service
1.1.3 Testing as a Service
O
1.1.4 Storage as a Service
1.2 Cloud Service Development
1.2.1 Cloud Service Development
ty
1.2.2 Cloud Computing Challenges
1.3 Cloud Computing Layers
si
1.3.1 Understand Layers of Cloud Computing
1.4 Cloud Computing Types
1.4.1 Types of Cloud Computing and Features
r
1.5 Cloud Computing Security Requirements, Pros and Cons, and Benefits
ve
1.5.1 Cloud Computing Security Requirements
1.5.2 Cloud Computing - Pros, Cons and Benefits
Case Study
ni
e
2.6.1 OLAP Server - ROLAP
in
2.6.2 OLAP Server - MOLAP
2.6.3 OLAP Server - HOLAP
Case Study
nl
Module - III: Data Mining 119
3.1 Understanding Data Mining
O
3.1.1 Understanding Concepts of Data Warehousing
3.1.2 Advancements to Data Mining
3.2 Motivation and Knowledge Discovery Process
ty
3.2.1 Data Mining on Databases
3.2.2 Data Mining Functionalities
si
3.3 Data Mining Basics
3.3.1 Objectives of Data Mining and the Business Context for Data Mining
3.3.2 Data Mining Process Improvement
3.3.3 Data Mining in Marketing
r
ve
3.3.4 Data Mining in CRM
3.3.5 Tools of Data Mining
Case Study
ni
e
Module - V: Data Mining Applications 233
in
5.1 Text Mining, Spatial Databases and Web Mining
5.1.1 Text Mining
nl
5.1.2 Spatial Databases
5.1.3 Web Mining
5.2 Multimedia Web Mining
O
5.2.1 Multidimensional Analysis of Multimedia Data
5.2.2 Applications in Telecommunications Industry
5.2.3 Applications in Retail Marketing
ty
5.2.4 Applications in Target Marketing
5.3 Applications in Industry
si
5.3.1 Mining in Fraud Protection
5.3.2 Mining in Healthcare
5.3.3 Mining in Science
5.3.4 Mining in E-commerce
r
ve
5.3.5 Mining in Finance
Case Study
ni
U
ity
m
)A
(c
(c
)A
m
ity
U
ni
ve
r si
ty
O
nl
in
e
Data Warehousing and Mining 1
e
Learning Objectives:
in
At the end of this topic, you will be able to understand:
●● Database as a Service
nl
●● Governance/Management as a Service
●● Testing as a Service
O
●● Storage as a Service
●● Cloud Service Development
●● Cloud Computing Challenges
ty
●● Understand Layers of Cloud Computing
●● Types of Cloud Computing and Features
●● Cloud Computing Security Requirements
si
●● Cloud Computing - Pros, Cons and Benefits
Introduction r
ve
The field of data warehousing has expanded dramatically during the past ten
years. To aid in business decision-making, many firms are either actively investigating
this technology or are using one or more data marts or warehouses. The competitive
advantage in the current economic climate frequently results from the proactive use of
ni
the data that businesses have been accumulating in their operational systems. They
are becoming aware of the enormous potential that this information may have for their
company. Users have access to these huge quantities of integrated, nonvolatile, time-
U
variant data through the data warehouse, which may be utilised to monitor commercial
trends, simplify forecasting, and enhance strategic choices.
An growing trend in IT called cloud computing aims to make the Internet the
ultimate repository for all computing resources, including storage, computation, and
ity
other factors.
free computing. Cloud computing provides enterprise firms with the opportunity to
outsource computing infrastructure so that they may concentrate on their core skills with
more efficiency, which is very advantageous to them. Virtualization, which allows for
Amity Directorate of Distance & Online Education
2 Data Warehousing and Mining
e
the foundation for cloud computing.
in
they enable them to do away with the need for technicians to support and manage
some of the most coveted new IT technologies, such as highly scalable, variably
provided systems. Computing resources, including virtual servers, data storage,
nl
and network capabilities, are all load-balanced and automatically extensible, which
is obviously advantageous to clients. A solid, dependable service is produced when
resources are assigned as necessary and loads can be moved automatically to better
areas.
O
The concept is not too complex. A huge business that has access to lots of
computer resources, like sizable datacenters, comes to an arrangement with clients.
Utilizing the capabilities of the provider, customers can execute their applications, store
ty
their data, host virtual machines, and more. Customers have the option of ending their
contracts, saving money on setup and maintenance fees, and gaining access to the
provider’s resource allocation flexibility.
si
Definition of Cloud Computing
According to the National Institute of Standards and Technology(NIST) definition,
r
“Cloud computing is a model for enabling ubiquitous, convenient, on-demand network
ve
access to a shared pool of configurable computing resources (e.g., networks, servers,
storage, applications, and services) that can be quickly provisioned and released with
little management effort or service provider interaction. Three service models, four
deployment methods, and five key criteria make up this cloud model.
ni
Notes
e
in
nl
O
ty
1.1 Defining the Cloud for the Enterprise
Applications and services made available through the Internet are referred to as
cloud computing. These services are offered via data centres located all over the world,
si
collectively known as the “cloud”. This metaphor illustrates how thin and yet universal
the Internet is. The numerous network connections and computer systems required for
internet services are made simpler by the “cloud” technology.
r
In reality, the Internet is frequently depicted as a cloud in network diagrams. This
ve
highlights the Internet’s enormous reach while also reducing its complexity. The cloud
and the services it offers are accessible to anybody with an Internet connection. Users
can exchange information with other users and between other systems thanks to the
regular connections between these services.
ni
The major enabling features of cloud computing are elasticity, pay-per-use, low
upfront investment, which makes cloud computing a ubiquitous paradigm for deploying
(c
novel applications. This has made many SMBs (Small and Medium Business) and
SMEs (Small and Medium Enterprises) to deploy the application online which were not
economically feasible in traditional enterprise infrastructure settings.
Cloud computing is transforming the way data is stored, retrieved and served.
Notes
e
Computing resources like servers, storage, network and applications (including
databases) are hosted, and made available as cloud services, for a price. Cloud
platforms have evolved to offer many IT needs as online services, without having to
in
invest in expensive data centers and worry about the hassle of managing them.
Cloud platforms virtually alleviate the need of having their own expensive data-
nl
centre. For database environments, the PaaS cloud model provides better IT services
than the IaaS model. The PaaS model provides enough resources in the cloud
databases which enables the users to create the applications they need. A lot of
research has been carried out for more than three decades to address scalable and
O
distributed data management.
The way data is stored, retrieved, and served is changing as a result of cloud
computing. Servers, storage, networks, and applications (including databases) are
ty
hosted and made available as cloud services for a fee. Cloud platforms have grown to
provide many IT needs as online services, eliminating the need to invest in costly data
centres and deal with the inconvenience of operating them. Cloud systems practically
si
eliminate the requirement for a costly data centre. For database settings, the PaaS
cloud architecture outperforms the IaaS approach in terms of IT services. The PaaS
architecture gives adequate resources in cloud databases for users to construct the
r
apps they require. For more than three decades, much research has been conducted to
address scalable and distributed data management.
ve
With the enterprise cloud computing paradigm, businesses can pay as they go
for access to virtualized IT resources from public or private cloud service providers.
Servers, computing power (CPU cores), data storage, virtualization tools, and
ni
Enterprise cloud computing gives companies new ways to cut expenses while
boosting their adaptability, network security, and resilience.
U
Processing power, computer memory, and data storage are three different types
of computing resources that enterprises undergoing digital transformation need flexible
and scalable access to. In the past, these companies were responsible for paying for
ity
the setup and upkeep of their own networks and data centres. Now that they have
partnered with public and private enterprise cloud service providers, businesses can
use these resources at a reasonable cost.
The enterprise cloud as we know it today has only been around for a few years.
It began with the introduction of Amazon Web Services (AWS), the first commercial
cloud storage provider, in March 2006. Enterprise cloud technology has grown quickly
)A
and widely since the introduction of AWS. According to a 2019 survey (conducted by
right Scale State of The Cloudreport) of 786 technical professionals from various firms,
91 percent of public cloud solutions and 72 percent of private cloud solutions were
adopted.
e
1. Cost savings: Using pay-as-you-go pricing, a typical enterprise cloud solution
allows firms to only pay for the services they really utilise. Additionally,
in
companies that switch to the cloud can avoid most or all of the upfront
expenditures associated with building comparable capabilities internally.
There is no requirement to purchase servers, lease a data centre, or manage
nl
any physical computing infrastructure. As a result, IT costs for businesses
that utilise the cloud are frequently lower, simpler to estimate, and more
predictable.
O
2. Security: Cybercriminals who want to steal or expose data commonly target
enterprise firms. Data breaches can harm your reputation and business
relationships and are very expensive to fix. Businesses can use security
technologies like system-wide identity/access management and cloud security
ty
monitoring thanks to enterprise clouds. They can quickly deploy identification
and access controls across the entire network. In both public and private
installations, cloud service providers enable data security in a variety of ways.
si
3. Business resilience and disaster recovery: Business resilience in the case of a
service outage, a natural disaster, or a cyber attack is at risk without a reliable
disaster recovery solution. Possible repercussions include lost sales, declining
customer trust, and even bankruptcy.
r
ve
Faction’s Hybrid-Disaster-Recovery-as-a-Service (HDRaaSTM) is our
proprietary and fully managed enterprise disaster recovery solution. We offer
reliable backup data storage, non-disruptive recovery testing, and process
management for disaster declaration and failover. Disaster recovery services
ni
can help businesses like Delta Air Lines recover from service interruptions
more quickly and keep their income flowing.
4. Flexibility & Innovation - Business can dynamically scale their resource
U
Internet connection. The service provider selects, instals, launches, and manages the
database management software.
In any event, all the consumer is getting is a piece of software. DaaS is especially
well suited for small- to medium-sized organisations who depend on databases but
(c
find the installation and maintenance fees prohibitive for financial reasons. Because
consumers don’t have to find, hire, train, and pay people to maintain a database, the
service is much more valuable.
Amity Directorate of Distance & Online Education
6 Data Warehousing and Mining
e
administered by a cloud operator (public or private), freeing the application team from
having to handle routine database management tasks. Application developers shouldn’t
be required to be database professionals or to pay a database administrator (DBA)
in
to maintain the database if there is a DBaaS available. When database services are
merely invoked by application developers and the database is taken care of, true
database as a service (DBaaS) has been attained.
nl
This would imply that the database will scale without any issues, as well as be
maintained, upgraded, backed up, and able to handle server failure without having
any negative effects on the developer. A service that is both intriguing and loaded with
O
challenging security challenges is database as a service (DaaS).
The DBaaS service mentioned above is one that cloud providers seek to provide.
The cloud providers require a high level of automation in order to offer a complete
ty
DBaaS solution to numerous customers. Backups are an example of an operation that
can be planned and batch processed. Numerous other processes, including elastic
scale-out, can be automated based on certain business rules. For instance, in order to
si
meet the service level agreement’s (SLA’s) requirements for quality of service (QoS),
databases may need to be restricted in terms of connections or peak CPU usage.
The DBaaS may automatically add a new database instance to share the load if
r
this requirement is surpassed. Additionally, the cloud service provider must be able to
ve
automatically create and set up database instances. This method can automate a large
portion of database administration, but the database management system that powers
DBaaS must provide these features through an application programming interface in
order to accomplish this level of automation.
ni
The number of databases that cloud operators must work on concurrently must
be in the hundreds of thousands or possibly tens of thousands. Automation is required
in this. The DBaaS solution must give the cloud operator an API in order to automate
U
these tasks in a customizable way. The main objective of a DBaaS is to free the
customer from having to worry about the database.
Today’s cloud users simply operate without having to consider server instances,
storage, or networking. Cloud computing is made possible through virtualization, which
ity
costs. The fact that DaaS aims to simplify things is crucial as well as obvious.
Data processing efficiency is a fundamental and critical issue for nearly any
scientific, academic, or business institution. As a result, enterprises install, operate, and
maintain database management systems to meet various data processing demands.
Although it is possible to purchase the necessary hardware, deploy database products,
establish network connectivity, and hire professional system administrators as a
(c
traditional solution, this solution has become increasingly expensive and impractical as
database systems and problems have grown larger and more complex. The traditional
solution has various costs.
Amity Directorate of Distance & Online Education
Data Warehousing and Mining 7
e
decrease over time, the costs of people do not. People costs will most likely dominate
computing solution costs in the future. Database backup, database restore, and
database reorganisation are also required to free up space or restore a preferred data
in
structure. Migration from one database version to the next without affecting solution
availability is a skill that is still in its early stages. During a version change, parts of a
database system, if not the entire solution, typically become inaccessible.
nl
Database management solutions have been used in data centres by businesses.
Initially, developers were left to install, manage, and use their preferred cloud database
instance, with the developer bearing the burden of all database administration tasks.
O
The benefit is that you may choose your own database and have complete control over
how the data is managed. Many PaaS companies have begun to offer cloud database
services in order to reduce the strain on customers of their cloud products.
ty
All physical database administration chores, such as backup, recovery, managing
the logs, etc., are managed by the cloud provider. The developer is in charge of the
database’s logical administration, which includes table tuning and query optimization. If
si
a database service provider is efficient, it has the opportunity to perform these jobs and
create a value proposition.
maintain, and update software, as well as administer the system. Instead, for its
database needs, the company will solely use the ready system managed by the service
provider.
U
Deployment of DBaaS
Database as a Service (DaaS) (DBaaS) is a technical and operational method that
enables IT companies to provide database functionality as a service to one or more
(c
customers. There are two use-case situations in which cloud database products meet
an organization’s database demands. These are their names:
●● A single huge corporation with several individual databases that can be transferred
Notes
e
to the organization’s private cloud.
●● Outsourcing the data management needs of small and medium-sized
in
organisations to a public cloud provider that serves a large number of small and
medium-sized businesses.
In a true sense, a DaaS offering should meet the following requirements:
nl
◌◌ Relieving the end developer/user of database administration, tuning, and
maintenance tasks, while providing high performance, availability, and fault
tolerance, as well as advanced features such as snapshot, analytics, and time
O
travel.
◌◌ Elasticity, or the capacity to adapt to changing workloads. Elasticity is
essential to meet user SLAs (Service Level Agreements) while lowering
infrastructure, power, and administration costs for the cloud provider.
ty
◌◌ Guarantees of security and privacy, as well as a pay-per-usage pricing plan.
Because it allows IT organisations to consolidate servers, storage, and database
workloads into a shared hardware and software architecture, a private cloud is an
si
efficient approach to supply database services. Databases hosted on a private cloud
provide significant cost, quality of service, and agility advantages by providing self-
service, elastically scalable, and metered access to database services.
r
For a variety of reasons, private clouds are preferable to public clouds. There are
ve
significant data security hazards with public clouds, which often provide little or no
availability or performance service-level agreements. Private clouds, on the other hand,
give IT departments entire control over the performance and availability service levels
they provide, as well as the capacity to efficiently implement data governance standards
ni
Challenges of DBaaS
U
User data must reside on the database service provider’s premises in the database
service provider model. Most businesses regard their data as a highly valuable asset.
The service provider must provide adequate security measures to protect data privacy.
m
At the same time, cloud databases have several negatives, such as security and
privacy concerns, as well as the potential loss or inability to access vital data in the
event of a disaster or bankruptcy of the cloud database service provider.
)A
point products to full enterprise frameworks. Existing technology products rarely solve
Notes
e
the complete problem since they are expensive to purchase, complex to integrate, and
do not include the full range of features and functions required to fulfil today’s system
administration requirements.
in
With low-cost data centre appliances on the rise, the only way to truly solve
today’s heterogeneous IT dilemma is to offer systems management capabilities at the
nl
service level. IT professionals should be able to use their management experience and
knowledge to improve management performance.
O
management that is emerging. Management as a service (MaaS) enables IT service
management to create network management appliances, services, and knowledge
repositories that are highly adaptable and capable of growing in response to changing
client needs. MaaS deployments combine several system management features and
ty
solutions into a unified management environment that covers all system management
kinds.
si
costs than any enterprise architecture. MaaS enables enterprises to fully leverage the
benefits of their converged network by recognising and resolving problems faster, more
precisely, less expensively, and with greater visibility than silos alone.
r
ve
1.1.3 Testing as a Service
IT has recently become a utility. Furthermore, software-oriented architecture and
software-as-a-service models have a significant impact on software-based companies.
These models have had a significant impact on the nature of software systems. Every
ni
software company strives to provide software that is simple to use, error-free, and of
excellent quality. To create high-quality, adaptable, and error-free software, it must
be tested against certain criteria in a specific environment. It is extremely tough and
U
It is nearly impossible and highly expansive for a company to sustain that climate.
ity
So, to test software, it is very simple and cost effective to contract the services of a third
party who has the necessary tools, hardware, simulators, or devices.
testing, functional testing, ERP software testing, cloud-based application testing, and
other services are supplied under the Software testing as a Service model.
e
in order to achieve the best results. However, due of real-world environment tools,
automation is prohibitively expensive for most enterprises. As a result of this issue,
testing operations were restricted, and everything was done manually. However, this
in
process was incredibly slow because it took a long time and resulted in a risk.
nl
supply them. It means that those businesses can employ automation tools and qualified
workers on demand rather than purchasing those tools at a low cost that is affordable
for all organisations, large or small.
O
Software testing is a comprehensive and continuous technique for checking and
authenticating software that meets the customer’s technical and business requirements.
Other quality measures like as integrity, dependability, interoperability, usability,
efficiency, maintainability, security, portability, and so on are tested using software.
ty
Software testing methods varies depending on the amount of testing and the goal
of the testing. Testing should be done successfully and efficiently under the constraints
of available financial and scheduling constraints.
si
Because of the vast number of testing limits, it is nearly impossible to ensure that
testing has eradicated almost every error. Applying previously developed concepts
r
for testing software can make testing easier and more successful. Testing is a crucial
quality filter, and it should be organised to identify its principles, aims, and restrictions.
ve
Software testing is the process of verifying and checking the defects in a
programme or system. It not only checks for mistakes, but it also ensures that the
software is operating in accordance with the user’s specifications. It has recently
ni
important thing to remember is that quality should never be compromised at any cost.
Today, with the help of innovative methods of innovation and testing, firms can secure
the highest quality of software at the lowest possible cost.
test actions are outsourced to a third party that focuses on reproducing real-life test
environments based on the needs of the purchaser. Simply said, STaaS is a means via
which businesses request providers to deliver application testing services as needed.
)A
e
examination activities using costly application models. Rather than purchasing these
types of tools, STaaS allows businesses to pay only for what they need, lowering
costs while achieving the best potential results. STaaS is sometimes referred to as on-
in
demand testing.
nl
package testing used to test an application as a service given to customers via the
internet. It offers regular procedure, upkeep, and examination support via web-based
web browsers, examination frameworks, and hosts.
O
This design helps to sustain any demand-led software package testing market by
allowing enterprises to supply and purchase testing products and services as needed.
There are numerous advantages to online software testing, including: Customers who
use tests do not have to invest significantly in installing and maintaining test conditions.
ty
This dramatically decreases check prices while providing customers with a flexible
way to acquire evaluation products and services as needed, from anywhere in the
world. Second, online distribution of software package evaluation opens up a bigger
si
current market for both evaluation companies and users.
r
have access to international evaluation professionals. Furthermore, it has been said
that software package evaluation as an online service may be given in a time period
ve
of up to 10 trading days. As a result, this adds to quicker turnaround times, allowing
customers to realise the ideal time to current market quickly.
on the web, the World Wide Web service APIs used can undoubtedly disguise
the complexity involved with using published assessment national infrastructure,
encouraging developers and testers to use it more frequently.
U
Cloud computing has gotten a lot of attention recently because it alters the way
computation as well as providers interact with customers. For example, the concept
alters how processing resources, including as CPUs, databases, and storage space
devices, are supplied and managed. Small and medium-sized organisations want
ity
higher - speed, security, and scalability in their product structure to meet their business
requirements.
However, in their assumption, these firms would not be able to possess this type of
creation. Now that organisations are generally focusing on improving efficiency as well
m
However, the choice has a lot of issues that must be addressed in terms of stability,
uniformity, and preservation, so your company must conduct thorough testing.
e
organization’s business activities are conducted by a service provider rather than in-
house workers.
in
TaaS may entail hiring consultants to assist and advise personnel, or it may simply
entail outsourcing a portion of testing to a service provider. Typically, a corporation will
conduct some testing in-house.
nl
TaaS is best suited for specific testing efforts that do not necessitate extensive
understanding of the architecture or system. Automated regression testing,
performance testing, security testing, application testing, testing of key ERP (enterprise
O
resource planning) software, and monitoring/testing of cloud-based apps are all good
candidates for the TaaS approach.
ty
is performed by a third-party service provider rather than by organisation employees.
TaaS testing is performed by a service provider who specialises in mimicking real-world
testing environments and locating defects in software products.
si
TaaS is employed when
◌◌ A corporation lacks the necessary skills and resources to conduct internal
◌◌
testing.
r
Don’t want in-house developers to have a say in the testing process (which
ve
they could if done internally).
◌◌ Save money.
◌◌ Accelerate test execution and shorten software development time.
ni
U
ity
m
)A
whatever shape it takes, entails a provider taking on some of the organization’s testing
Notes
e
obligations.
TaaS could be utilised for automated testing activities that would take in-house
in
workers longer to accomplish manually. It can also be employed when the customer
organisation lacks the resources to conduct testing themselves. The resource could be
time, money, people, or technology. TaaS may not be the best option for enterprises
nl
that want in-depth knowledge of their infrastructure.
There are numerous varieties of TaaS, each with its own set of procedures, but in
general, TaaS will work as follows:
O
◌◌ To conduct the test, a scenario and environment are built. In software testing,
this is known as a user scenario.
◌◌ A test is created to assess the company’s reaction to such scenario.
ty
◌◌ The test is carried out in the vendor’s secure test environment.
◌◌ The vendor monitors performance and assesses the company’s ability to
satisfy test design goals.
si
◌◌ To improve future performance and results, the vendor and company
collaborate to improve the system or product being tested.
The primary advantages of testing as a service are the same as those of employing
any service or outsourcing. They revolve around the fact that the company paying for
the service is not required to host or maintain the testing processes and technology.
Reduced costs: Companies are not required to host infrastructure or pay people.
There are no licencing or staffing fees.
Amity Directorate of Distance & Online Education
14 Data Warehousing and Mining
Pay as you go pricing: Companies just pay for what they utilise.
Notes
e
Less rote maintenance: In-house IT personnel will be doing less rote maintenance.
in
High flexibility: Companies’ service plans can be simply adjusted as their needs
change.
nl
Less-biased testers: The test is being carried out by a third party with little
understanding of the product or firm. The test is not influenced by internal personnel.
Data integrity: The vendor cleans test data and runs tests in controlled conditions.
O
Scalability:TaaS products can be tailored to the size of the organisation.
ty
Storage as a service (STaaS) is a managed service in which the provider provides
access to a data storage platform to the consumer. The service can be offered on-
premises using infrastructure dedicated to a single customer, or it can be delivered
si
from the public cloud as a shared service acquired on a subscription basis and invoiced
based on one or more consumption indicators.
r
Individual storage services are accessed by STaaS customers via regular system
interface protocols or application programme interfaces (APIs). Bare-metal storage
ve
capacity; raw storage volumes; network file systems; storage objects; and storage
programmes that provide file sharing and backup lifecycle management are typical
offerings.
ni
Storage as a service was first viewed as a cost-effective solution for small and
medium-sized organisations without the technical personnel and capital funding to
construct and operate their own storage infrastructure. Storage as a service is now
U
Uses of STaaS
Storage as a service can be used for data transfers, redundant storage, and data
ity
restoration from corrupted or missing files. CIOs may want to use STaaS to quickly
deploy resources or to replace some existing storage space, freeing up space for on-
premises storage gear. CIOs may also value the option to customise storage capacity
and performance based on workload.
m
Instead of maintaining a huge tape library and arranging for tape vaulting (storage)
elsewhere, a network administrator who utilisesSTaaS for backups could designate
what data on the network should be backed up and how frequently it should be backed
)A
up. Their company would sign a service-level agreement (SLA) in which the STaaS
provider agrees to rent storage space on a cost-per-gigabyte-stored and cost-per-
data-transfer basis, and the company’s data would then be automatically transferred at
the specified time over the storage provider’s proprietary WAN or the internet. If the
company’s data becomes corrupted or lost, the network administrator can contact the
(c
e
Organizations that employ STaaS will often use a public cloud for storage and
backup purposes rather than storing data on-premises. Different storage strategies
in
may be used for STaaS in public cloud storage. Backup and restore, disaster recovery,
block storage, SSD storage, object storage, and bulk data transfer are all examples of
storage technologies. Backup and restore refers to the process of backing up data to
nl
the cloud in order to protect it in the event of data loss. Data protection and replication
from virtual computers may be referred to as disaster recovery Virtual Machines(VMs).
Customers can use block storage to provision block storage volumes for lower-
O
latency I/O. SSD storage is yet another type of storage that is commonly utilised for
demanding read/write and I/O operations. Object storage systems, which have a high
latency, are utilised in data analytics, disaster recovery, and cloud applications. Cold
storage is used to quickly create and configure stored data. Bulk data transfers will
ty
transmit data using discs and other gear.
Advantages of STaaS
si
●● Storage costs: Expenses for personnel, hardware, and physical storage space are
decreased.
●● Disaster recovery: Having numerous copies of data kept in separate locations can
help disaster recovery techniques work better. r
ve
●● Scalability: Users of most public cloud services only pay for the resources they
utilise.
●● Syncing: Files can be synced automatically across various devices.
ni
Disadvantages of STaaS
●● Security: Users may wind up sending business-sensitive or mission-critical data to
the cloud, making it crucial to select a dependable service provider.
ity
●● Potential storage costs: If bandwidth limits are exceeded, this could be costly.
●● Potential downtimes: Vendors may experience periods of downtime during which
the service is unavailable, which can be problematic for mission-critical data.
m
e
◌◌ Amazon Web Services (AWS)
◌◌ Microsoft Azure
in
◌◌ Google Cloud
◌◌ Oracle cloud
nl
◌◌ Box
◌◌ Arcserve
O
1.2 Cloud Service Development
With the increasing rise of cloud computing, many companies are shifting their
computing activities to the cloud. Cloud computing refers to the delivery of computer
or IT infrastructure over the Internet. That is the provisioning of shared resources,
ty
software, apps, and services via the internet to meet the customer’s elastic demand
with minimal effort or interaction with the service provider.
si
It is a network access model that enables ubiquitous, convenient, on-demand
network access to a shared pool of computing resources (e.g., networks, servers,
storage, applications, and services) that can be rapidly provisioned and released with
minimal management effort or interaction from service providers.
r
ve
The cloud-based high-performance computing centre attempts to address the
following issues:
Phase Description
ity
of users needs.
Rise in Demand of personnedl Computer
Decentralized Computing
)A
Birth of IT service
Network Computing Pcs, laptops and servers were connected together through local
1990s networks to share resources and increase performance.
Internet Computing Local networks were connected to other local networks forming a
2001 global network such as the internet to utilize remote applications
(c
and resourse
e
Beyond 2010 through a distributed computing.
Solving large problems with prarallel computing
in
C l o u r C o m p u t i n g Cloud computing is the provision of computer or IT infrastructure
Beyond 2010 through the Internet. That is the provisioning of shared resources,
software, applications and services over the internet to meet the
nl
demand of the customer with minimum effort or interaction with
the service provider.
O
Historically, computing power was a precious and expensive resource. With the
advent of cloud computing, it is now abundant and inexpensive, resulting in a significant
paradigm shift – a shift from scarcity computing to abundance computing. This
ty
computing revolution hastens the commoditization of products, services, and business
models, while also disrupting the current information and communications technology
(ICT) industry. It provided the same services as water, electricity, gas, telephony, and
si
other appliances. Cloud computing provides on-demand computing, storage, software,
and other IT services with metered billing based on usage.
converted into a Utilities set that can be supplied and assembled in hours rather than
days, allowing devices to be deployed without incurring maintenance costs. The long-
term objective of a cloud computer is that IT services be exchanged on an open market
without the use of technology and as utilities as barriers to the rules.
ity
We can hope that in the near future, a solution that clearly meets our needs will
be identified and entered into our application on a worldwide digital market services
for cloud computing. This market will enable the automation of the discovery and
integration processes with its existing software platforms. A digital cloud trading platform
m
will also allow service providers to increase their earnings. A competitor’s customer
service may also be a cloud service to meet consumer promises.
allowing us to simply access and connect on a bigger scale. The security and stability of
cloud computing will continue to develop, making it even safer with a number of ways.
Instead of focusing on the services and applications that they enable, we do not believe
that “cloud” is the most relevant technology. The combination of wearables and bring
your own device (BYOD) with cloud technology and the Internet of Things (IOT) would
(c
become so widespread in personal and professional life that cloud technology would be
disregarded as an enabling.
Historical Developments
Notes
e
Cloud computing is not a cutting-edge technology. Cloud computing has gone
through several stages of development, including Grid computing, utility computing,
in
application service provision, software as a service, etc. However, the idea of providing
computing resources across a worldwide network was first introduced in the 1960s.
The market for cloud computing is anticipated to reach $241 billion by 2020(Forrester
nl
Research). But how we got there and where all that began is explained by the history of
cloud computing.
The first commercial and consumer cloud computing website was built in 1999,
O
hence cloud computing has a recent history (Salesforce.com and Google). As cloud
computing is the solution to the issue of how the Internet may enhance business
technology, it is intimately related to both the development of the Internet and the
advancement of corporate technology. Nearly as long as businesses themselves,
ty
business technology has a rich and fascinating history. However, the evolution that has
most directly influenced cloud computing starts with the introduction of computers as
providers of practical business solutions.
si
History of Cloud Computing
A cutting-edge technology today is cloud computing. A brief history of cloud
computing follows that. r
ve
ni
U
ity
EARLY 1960S
John McCarthy, a computer scientist, developed a time-sharing idea that enables
the company to use a pricey mainframe concurrently. This device is hailed as a
)A
IN 1969
The Advanced Research Projects Agency (ARPANET) was founded by J.C.R.
(c
goal was to connect everyone on the planet and enable universal access to data and
Notes
e
applications.
IN 1970
in
Use of virtualization products like VMware: It is possible to run many operating
systems concurrently in different environments. It was conceivable to operate a whole
separate computer under a different operating system (virtual machine).
nl
IN 1997
O
The first known definition of “cloud computing,” “a paradigm in which computer
boundaries are set purely on economic rather than technical restrictions alone,”
appears to have been provided by Prof. Ramnath Chellappa in Dallas in 1997.
ty
IN 1999
In 1999, Salesforce.com was introduced as the first company to offer client
applications through a straightforward website. The services provider was able to
si
offer software applications over the Internet to both niche and mainstream software
providers.
IN 2003
r
Xen, commonly known as the Virtual Machine Monitor (VMM) as a hypervisor,
ve
is a software system that enables many virtual guest operating systems to be run
concurrently on a single machine. This is its first public release.
IN 2006
ni
The Amazon cloud service was launched in 2006. First, its Elastic Compute
Cloud (EC2) enabled users to access machines and use their own cloud apps. Simple
Storage Service (S3) was then released. This utilised the user-as-you-go concept and
U
has since evolved into the accepted practise for both users and the sector at large.
IN 2013
ity
In 2012, the global market for public cloud services expanded by 18.5% to £ 78
billion, with IaaS being one of the services with the greatest market growth.
IN 2014
m
Notes
e
in
nl
O
ty
si
Figure: The evolution of distributed computing technologies, 1950s- 2010s.
As per Techtarget The term “distributed computing” refers to the use of several
r
computer systems to tackle a single task. In distributed computing, a single task is
ve
divided into several parts, with separate machines handling each part. Because the
computers are connected, they may speak to one another to address the issue. If done
correctly, the computer operates as a single unit.
establishing affordable, open, and safe links between users and IT resources.
Additionally, it guarantees defect tolerance and offers access to resources in the event
that one component fails.
U
The way resources are distributed in computer networks is really not all that
unusual. This was first accomplished with the use of mainframe terminals, progressed
to minicomputers, and is currently feasible with personal computers and client-server
ity
One or more dedicated servers for computer management are placed on a number
of very light client computers that make up a distributed computer architecture. Client
agents typically detect when a machine is unoccupied so that the management server
m
is informed that the machine is free to use. The agent next requests a shipment. When
the client receives this application package from the management server, it executes
the application software and sends the results back to the management server when
it has free CPU cycles. The management server will return the resources required to
)A
complete various tasks while the user was away when the user logs back in.
e
data store for an organization’s IT infrastructure. It communicates with users through
less capable hardware, such as workstations or terminals. By consolidating data into
a single mainframe repository, it is simpler to manage, update, and safeguard data
in
integrity. In contrast to smaller machines, mainframes are typically employed for
large-scale procedures that demand higher levels of availability and safety. Large
enterprises utilise mainframe computers or mainframes largely for processing mass
nl
data for things like censuses, industry and consumer statistics, enterprise resource
planning, and transaction processing. In the late 1950s, mainframes had a simple
interactive interface and transmitted data and programmes using punched cards,
O
paper tape, or magnetic tape.
To handle back office activities like payroll and customer billing, they functioned in
batch mode, mostly using repetitive tape and merging operations followed by a line print
ty
to continuous stationary using pre-printed ink. Digital user interfaces are now almost
exclusively employed to run applications (like airline reservations) rather than to create
the software. Although mainly replaced by keypads, typewriter and teletype machines
were the usual network operators’ control consoles in the early 1970s.
si
Cluster Computing: A fast local area network (LAN) is used to connect a few
computer nodes (personal computers employed as servers) that are available for
r
download in the approach to computer clustering. The “clustering middleware,” a
software layer placed in front of nodes that enables users to access the cluster as a
ve
whole through the use of a Single system image concept, coordinates computing node
activities.
Typically, clusters are utilised to provide more computational power than a single
U
computer can in order to support high availability, higher reliability, or high performance
computing. Since it uses off-the-shelf hardware and software components as opposed
to mainframe computers, which use custom-built hardware and software, the cluster
technique is more power and processing speed efficient when compared to other
ity
technologies.
from other disciplines to accomplish the main goal. Grid computing allows network
computers to collaborate on a job and act as a single supercomputer. A grid often
works on numerous networked jobs, however it can also handle particular applications.
A grid generally works on various network jobs, but it can also function on particular
applications. It is designed to handle multiple little problems while solving problems
(c
that are too big for a supercomputer to handle. A multi-user network that supports
discontinuous information processing is a feature of computing grids.
e
connect a grid to a computer cluster. The cluster’s size might range from one tiny
network to many. The technology is employed in a wide range of applications,
including mathematics, research, and instructional tasks, through a number of
in
computing resources. It is frequently used in web services like ATM banking, back
office infrastructure, scientific and marketing research, as well as structural analysis.
Applications are utilised in a parallel networking environment as part of grid computing
nl
to address computational issues. Each PC is connected, and information is combined
into a computational application.
O
Virtualization
Cloud computing is based on virtualization, a method that improves the utilisation
of actual computer hardware. Through the use of software, virtualization may divide the
hardware components of a single computer, such as processors, memory, storage, and
ty
more, into several virtual computers, commonly referred to as VMs. Despite only using
a piece of the underlying computer hardware, each VM runs its own OS and functions
like a standalone machine.
si
As a result, virtualization enables a considerably more efficient use of physical
computer hardware, enabling an organisation to get a higher return on its hardware
investment.
r
ve
Today, virtualization is a standard approach in business IT architecture. The
technology is also what powers the cloud computing industry. Because of virtualization,
cloud service providers can service customers using their own physical computing
hardware, and cloud customers can buy only the computer resources they require at
the time they require them and expand them affordably as their workloads grow.
ni
Web 2.0
Websites that stress user-generated content, user-friendliness, participatory
)A
website design and use, not imposing any technical restrictions on the designers.
The phrase “Web 2.0” is used to describe a variety of websites and applications
Notes
e
that enable anyone to produce or share content online. The ability for individuals to
create, exchange, and communicate is one of the major features of technology. Web
2.0 differs from earlier types of websites in that it allows anybody to easily create,
in
publish, and communicate their work to the world without the need for any prior
knowledge of Web design or publishing. The layout makes it simple and well-liked for
sharing knowledge with a small community or a much larger audience. These tools
nl
will be used by the university to communicate with its employees, students, and other
university members. Students and coworkers may be able to engage and communicate
well through it.
O
The web apps, which allow for interactive data sharing, user-centered design, and
global collaboration, symbolise the progress of the World Wide Web in this context. The
term “Web 2.0” refers to a broad category of Web-based technologies, including blogs
ty
and wikis, social networks, podcasts, social bookmaking websites, and really simple
syndication (RSS) feeds. The basic idea behind Web 2.0 is to improve the connection
of Web applications and make it possible for consumers to access the Web quickly
and effectively. Web applications that offer computing capabilities on demand over the
si
Internet are exactly what cloud computing services are.
r
as providing a key component of the Web 2.0 infrastructure and benefits from the Web
2.0 Framework. A number of web technologies sit beneath Web 2.0. RIAs that have
ve
just debuted or moved to a new production stage (Rich Internet Applications). The
most well-known and largely-accepted technology on the Web is AJAX (Asynchronous
JavaScript and XML). Widgets are plug-in modular components, RSS (Really Simple
Syndication), and Web services are further technologies ( e.g. SOAP, REST).
ni
Companies generate and store massive amounts of data. As a result, they confront
numerous security challenges. Businesses would incorporate institutions to streamline
and optimise the process, as well as to improve cloud computing administration.
ity
1. Security and Privacy of Cloud: The cloud data storage must be secure and private.
Clients are completely reliant on the cloud provider. In other words, the cloud
provider must implement the appropriate security procedures to protect customer
data. Securities are also the customer’s liability since they must have a good
m
password, not share the password with others, and regularly update the password.
Certain issues may arise if the data is stored outside of the firewall, which the cloud
provider can resolve. Hacking and viruses are also major issues because they might
harm a large number of clients. Data loss may occur, as well as disruptions to the
)A
hurdles is remote access, which prevents the cloud provider from accessing the
cloud from anyplace.
3. Reliable and Flexible: Reliability and flexibility are indeed demanding tasks for cloud
Notes
e
consumers, which can avoid data leakage and supply customer trustworthiness.
To address this issue, third-party services should be monitored, as well as the
performance, robustness, and reliance of businesses.
in
4. Cost: Cloud computing is inexpensive, but changing the cloud to meet customer
demand can be costly at times. Furthermore, changing the cloud might be detrimental
nl
to small businesses because demand can often be more expensive. Furthermore,
data transport from the Cloud to the premises can be pricey at times.
5. Downtime: Downtime is the most common cloud computing difficulty, as no cloud
O
provider guarantees a platform free of downtime. Internet connection is also vital, as
it might be a problem if a corporation has an untrustworthy internet connection that
experiences downtime.
6. Lack of resources: Many organisations are attempting to overcome a shortage of
ty
resources and skills in the cloud market by hiring new, more experienced staff.
These personnel will not only assist in resolving business difficulties, but will also
train existing staff to benefit the organisation. Currently, many IT staff are working to
si
improve their cloud computing skills, which is challenging for the CEO because the
employees are inexperienced. It says that individuals who are exposed to the latest
developments and associated technology will be more valuable in the workplace.
7. r
Dealing with Multi-Cloud Environments: Today, hardly a single cloud is running
ve
full-fledged businesses. According to the RightScale survey, nearly 84 percent of
organisations use a multi-cloud approach, and 58 percent use hybrid cloud systems
that combine public and private clouds. Furthermore, corporations employ five
separate public and private clouds.
ni
U
ity
m
)A
Long-term predictions concerning the future of cloud computing technology are more
challenging for IT infrastructure teams to make. Professionals have also proposed
(c
e
difficult to migrate an existing programme to a cloud computing environment.
According to the report(a survey from Velostrata conducted), 62% of respondents
claimed their cloud migration projects were more difficult than projected.
in
Furthermore, 64% of migration initiatives took longer than projected, and 55% went
over budget. Organizations that migrate their applications to the cloud, in particular,
reported migration downtime (37%), data synchronisation issues before cutbacks
nl
(40%), migration tooling problems that work well (40%), slow data migration (44%),
security configuration issues (40%), and time-consuming troubleshooting (47%). To
address these issues, over 42% of IT specialists stated they wanted to see their
O
budget grow, and nearly 45% wanted to work with an in-house professional, 50%
wanted to extend the project, and 56% wanted more pre-migration tests.
9. Vendor lock-in: The issue with vendor lock-in cloud computing is that clients are
ty
dependant (i.e. locked in) on the deployment of a single Cloud provider and cannot
migrate to another vendor in the future without incurring large costs, regulatory
constraints, or technological incompatibilities. The lock-up condition can be seen
in programmes for specific cloud platforms, such as Amazon EC2 and Microsoft
si
Azure, which are not readily transferred to any other cloud platform and expose
users to modifications made by their providers to further confirm the objectives of a
software developer.
r
In fact, the issue of lock-in arises when a company decides to change cloud
ve
providers (or perhaps integrate services from different providers), but is unable to
move applications or data across different cloud services because the semantics
of the cloud providers’ resources and services do not correspond. Because of the
heterogeneity of cloud semantics and APIs, technological incompatibility arises,
ni
and clients.
10. Privacy and Legal issues: The biggest issue with cloud privacy/data security appears
to be ‘data leak.’ Data infringement is broadly defined as the loss of electronically
encrypted personal information. A breach of the information could result in a slew
m
of losses for both the provider and the customer, including identity theft, debit/credit
card fraud for the customer, loss of credibility, future prosecutions, and so on. In
the event of a data breach, American law requires impacted individuals to notify the
)A
authorities.
Almost every state in the United States is now required to notify affected individuals
of data breaches. When data is subject to many jurisdictions and the laws governing
data privacy differ, problems occur. For example, the European Union’s Data
Privacy Directive clearly stipulates that “data can only leave the EU if it goes to a
(c
‘higher degree of security’ country.” This rule, while simple to establish, restricts data
transfer and so reduces data capacity. The EU’s regulations are enforceable.
e
What distinguishes cloud computing from conventional data storage so
significantly? Businesses and private individuals may use databases, storage, and
in
computing power to manage their data thanks to cloud computing. Due to this, you
no longer need to own, own, and operate physical data centres and servers. Cloud
computing layers are the name given to these technologies.
nl
The following figure illustrates a tiered cloud computing architecture that can be
used to conceptualise cloud computing as a collection of services:
O
ty
r si
ve
ni
addition to providing the ground for enabling client applications. The uppermost layer
is Programming as a Service (SaaS), which contains a whole application given as
administration on interest.
)A
protecting, and support. The SaaS vendor is responsible for sending and managing the
IT framework (servers, working framework programming, databases, server farm space,
organise access, power and cooling, and so on) and procedures (foundation patches/
Notes
e
updates, application patches/overhauls, reinforcements, and so on) required to run and
manage the entire arrangement.
in
SaaS emphasises a complete application provided as a service on demand.
There is a Divided Cloud and Convergence Intelligibility component in SaaS, and each
datum thing has either a Read Lock or a - Write Lock. SaaS makes use of two types
nl
of servers: the Main Consistence Server (MCS) and the Domain Consistence Server
(DCS) (DCS). The cooperation of MCS with DCS achieves reserve cognizance [6]. If
the MCS is damaged or traded off in SaaS, the power over the cloud status is gone.
Verifying the MCS will be quite important in the future. Salesforce.com and Google
O
Apps are two examples of SaaS.
The SaaS, or actual software, is the third cloud tier (Software as a service). This
layer offers a comprehensive software solution. An app is rented out by organisations,
ty
and consumers connect to it online, typically through a web browser.
SaaS is the layer where the user accesses the service provider’s offering in a cloud
environment. It must be available on the web, accessible from anywhere, and ideally
si
compatible with any device. The hardware and software are managed by the service
provider.
r
Web-based email services like Outlook, Gmail, and Hotmail are an example of
SaaS. Here, the email client and your communications are on the network of the service
ve
provider.
This allows companies to invest in high-end enterprise apps without having to set
them up or manage them. The service provider then makes hardware, middleware, and
U
software purchases and updates them. All businesses can now afford it because they
just have to pay for the services they actually utilise.
Additionally, the ability for users to access information and data at any time,
ity
with an unusual state of mix for executing and testing cloud applications. The
client does not deal with the foundation (including the operating system, servers,
working frameworks, and capacity), but he does control sent apps and, maybe, their
configurations. PaaS examples include Force.com, Google App Engine, and Microsoft
)A
Azure.
The platform, or PaaS, is the second layer of the cloud (Platform as a service). This
layer offers the tools necessary to create apps in a cloud development and deployment
environment.
(c
PaaS involves infrastructure, like IaaS, but it also contains business intelligence,
middleware, database management systems, development tools, and more. It is
intended to handle the complete lifetime of a web application, from development and
Notes
e
testing to deployment, administration, and updating.
PaaS thus offers the capability to create, test, execute, and host applications when
in
combined with Infrastructure as a Service.
nl
to create for several platforms. Additionally, it involves reducing the amount of coding
required because the platform already has pre-coded application components.
O
sharing of hardware assets for the execution of administrations employing Virtualization
technology. Its primary goal is to make assets, for example, servers, systems, and
capacity, more quickly accessible by applications and working frameworks. As a result,
ty
it gives fundamental framework on-demand administrations and utilises Application
Programming Interface (API) for collaborations with hosts, switches, and switches, as
well as the capacity to incorporate new gear in a simple and straightforward manner.
After all is said and done, the customer does not have to deal with the concealed
si
equipment in the cloud foundation, but he does have control over the working
frameworks, stockpiling, and conveyed applications. The specialist organisation
owns the hardware and is in responsible of housing, running, and maintaining it. The
r
customer is frequently charged on a per-use basis. Amazon Elastic Cloud Computing
ve
(EC2), Amazon S3, and GoGrid are examples of IaaS.
Pay-as-you-go is used here, so you only pay for what you use.
companies and organisations can utilise their own platforms and applications inside the
infrastructure that a service provider provides.
Businesses can swiftly set up and operate test and development environments
thanks to IaaS. They may be able to launch new application solutions more quickly as
m
a result. Additionally, it enables firms to analyse huge data and better manage storage
requirements as the business expands. Finally, it makes backup system management
simpler.
)A
◌◌ Reduced recurring costs as well as lower initial setup and management costs
for on-site data centres.
(c
e
1.4.1 Types of Cloud Computing and Features
in
As shown in the diagram below, in the cloud arrangement show, arranging, staging,
stockpiling, and programming framework are supplied as administrations that scale up
or down based on the interest. The Cloud Computing model is comprised of four core
nl
organisational paradigms, which are as follows:
O
ty
r si
ve
ni
Private Cloud: Private cloud is another name that some sellers have recently used
to describe contributions that mimic cloud computing on private systems. It is installed
within an organization’s internal effort datacenter. In the private cloud, the cloud
merchant’s mobile assets and virtual apps are pooled together and made available
ity
for cloud clients to share and use. It differs from the general population cloud in that
all cloud assets and applications, such as Intranet functionality, are governed by the
organisation itself. Because of its predefined inner introduction, private cloud usage
can be significantly more secure than public cloud usage. Only the organisation and
m
its designated partners may approach work on a specific Private cloud. Eucalyptus
Systems is one of the best examples of a private cloud.
Public Cloud: The term “public cloud” refers to cloud computing in the traditional
)A
food for spikes popular for cloud advancement. Open mists are less secure than other
cloud models because they place an added burden on ensuring that all apps and
e
attacks. Microsoft Azure and Google App Engine are examples of open clouds.
Hybrid Cloud: A hybrid cloud is a private cloud that is linked to at least one
in
outside cloud administration, is centrally managed, provisioned as a solitary unit, and
is protected by a secure system. It provides virtual IT configurations by combining
open and private mists. Half and half Cloud provides increasingly secure control
nl
of information and apps while also allowing various groups to access data via the
Internet. It also features an open engineering that allows for integration with different
administrative frameworks. Cross breed cloud might depict the establishment of a
local gadget, for example, a Plug PC, with cloud administrations. It can also depict
O
designs that combine virtual and physical, organised resources, for example, a mostly
virtualized state that requires real servers, switches, or other equipment, for example,
a system apparatus acting as a firewall or spam channel. Amazon Web Services is an
ty
example of a Hybrid Cloud (AWS).
si
given cloud display. These organisations are typically based on an agreement between
similar business associations, for example, money keeping or instructional associations.
This concept indicates that a cloud situation may exist locally or remotely. Facebook is
r
an example of a Community Cloud.
ve
Furthermore, as technology advances, we can see subsidiary cloud organisation
models emerge from various client requests and requirements. A comparable
architecture is a virtual-private cloud, in which an open cloud is used privately and is
linked to the client’s server farm’s internal resources. With the development of top-
ni
of-the-line organisation get to innovations like 2G, 3G, Wi-Fi, Wi-Max, and so on and
highlight telephones, another subordinate of cloud computing has emerged. This is
commonly referred to as - Mobile Cloud Computing (MCC).
U
fast to supply their representatives the ability to get to office arrangements using a
cell phone from anyplace. Recent technological advancements, for example, the rise
of HTML5 and other programme advancement instruments, have simply enlarged the
demand for adaptable cloud computing. A growing trend toward component telephone
m
and Benefits
Many businesses (such as major enterprises) are likely to find that the
infrastructure and data security in public cloud computing are less reliable than their
own current capabilities. As a result of this presumably less safe and higher risk
security posture, there is also a higher chance that privacy may be violated. However,
(c
it should be emphasised that many small and medium-sized businesses (SMBs) have
constrained IT and dedicated information security resources, which causes them to give
Notes
e
this sector little attention. For these companies, a public cloud service provider’s (CSP)
level of security may be higher.
in
Even a seemingly minor data breach can have significant financial repercussions
(e.g., cost of incident response and potential forensic investigation, compensation for
identity theft victims, punitive penalties), as well as long-term effects like bad press
nl
and a loss of customer confidence. Despite the overused headlines, privacy concerns
frequently don’t match the degree of intrinsic risk.
What Is Privacy?
O
The idea of privacy varies greatly between (and occasionally even within)
nations, cultures, and legal systems. A succinct definition is difficult, if not impossible,
because it is formed by legal interpretations and societal expectations. The gathering,
ty
use, disclosure, storage, and destruction of personal data (or personally identifiable
information, or PII) are all covered by privacy rights or obligations. The accountability
of corporations to data subjects and the openness of their practises around personal
information are ultimately what privacy is all about.
si
Similar to this, there is no accepted definition of what personal information is.
The Organization for Economic Cooperation and Development (OECD) accepted the
r
following definition for the purposes of this discussion: any information relating to a
named or identifiable individual (data subject).
ve
Another definition that is gaining popularity is the one given in the Generally
Accepted Privacy Principles (GAPP) standard by the American Institute of Certified
Public Accountants (AICPA) and the Canadian Institute of Chartered Accountants
ni
(CICA): “The rights and obligations of individuals and organisations with respect to the
collection, use, retention, and disclosure of personal information.”
though they enable clients to minimise start-up costs, lower running costs, and boost
their agility by quickly obtaining services and infrastructure resources when needed.
architecture and the fact that hardware and software components of a single service in
this model transcend many trust domains, the move to this model exacerbates security
and privacy concerns.
)A
Settings for cloud computing are multi-domain environments where each domain
may use a separate set of security, privacy, and trust standards as well as perhaps
use a variety of processes, interfaces, and semantics. These domains might stand
in for independently enabled services or other infrastructure or application parts.
(c
e
settings requires utilising existing research on multi domain policy integration and safe
service composition.
in
Clients that subscribe to the cloud will cede some control to a third party.
nl
the chances to cut down on capital expenses, absolve themselves of infrastructure
administration, and concentrate on key skills. Again, the agility provided by on-demand
compute provisioning and the ability to more easily match information technology with
O
business plans and demands make it more alluring to cloud clients. Users are worried
about the security dangers of cloud computing as well as the loss of direct control over
the systems for which they are still responsible.
ty
Cloud computing (CC) is a computer concept built on parallel computing,
distributed computing, and grid computing rather than a specific technology. The cloud
service provider (CSP) and the cloud service consumer are the two primary actors
implied by the CC business model (CSU).
si
The corporate software and data are kept on servers located far away, while
the CSPs provide applications over the internet that the CSUs access through
r
web browsers, desktop applications, and mobile apps. Three types of clouds are
distinguishable based on the service level offered: Infrastructure-as-a-Service (IaaS),
ve
Platform-as-a-Service (PaaS), and Software-as-a-Service (SaaS). Security has been
identified as the biggest challenge for all three.
smartphones and tablets. Each leading company in the industry offers consumers its
own offer in this segment which significantly intensifies the competition in it. Cloud
technologies are a flexible, highly efficient and proven platform for providing IT services
over the Internet. However, their increasing use and the provision of cloud services
U
be designed in a way that will interoperate with the rest. Figure below shows the scope
systems engineering.
)A
(c
Notes
e
in
nl
O
Figure: Systems Engineering Management Activities
ty
meet the needs or objectives of an organisation. It should promote stability, allow for
ongoing innovation, support corporate objectives, and assist in cost-cutting.
si
architecture. Many new models are based on the System Security Engineering
Capability Maturity Model (SSE-CMM), which was developed in the early 2000s and
highlights the value of practising security engineering. These kinds of models can
r
be used as reference models for cloud computing, security engineering, security
ve
architecture, and security operations. A few models worth mentioning are:
●● The International Standards for Security are ISO 27001 and ISO 27006. They
cover management, ideal procedures, specifications, and methods.
●● The European cyber security agency’s suggestions for security risks when
ni
users give their personal data to a third party.” He has concerns about software
vendors forcing people to utiliseparticular platforms and systems. Stallman stated
e
cloud services will be challenging to implement in sectors like defence, government
institutions, e-services, etc.
in
Data security is a different issue that comes up when using cloud-based IT
services. It entails growing reliant on the supplier and the security features it provides
for transferring and storing data. The likelihood of information leaks to rival businesses
nl
or malicious persons, the best way to handle a potential provider change, what would
happen in the event of a system breakdown, and other issues come up as they relate to
future hosting conditions.
O
A service portfolio provides a commercial value assessment of the cloud service
provider’s offerings. It is a dynamic strategy for controlling financial value-based
investments in service management throughout the entire organisation. Managers can
evaluate the expenses and associated quality requirements using Service Portfolio
ty
Management (SPM).
Access Control
si
This reflects how strictly the system restricts unauthorised parties from accessing
its resources (e.g. human users, programs, processes, devices). The need to identify
parties who want to engage with the system, confirm that the parties are who they claim
r
they are, and grant them access to only the resources they are authorised to access is
ve
therefore addressed by access control security requirements.
Almulla and Yeun(2013) outline the many issues, the present authentication, and
authorization practises for CSU’s cloud access, while Sato et al. make use of the new
Identity and Access Management (IAM) protocols and standards.
ni
IAM are strategies that impose rules and policies on CSUs using a variety of
tactics, such as enforcing login passwords, giving CSUs access, and provisioning
CSU accounts, to offer an acceptable level of protection for company resources and
U
data. It could be beneficial to utilise a model-driven approach for designing the security
needs because it converts the security objectives into enforceable security regulations.
Zhou et al. explain this method, which enables CSUs to declare security intentions in
system models that allow for the straightforward characterization of security needs.
ity
A 3D model might be used to create some of the specifications, with the data’s value
being first determined and then placed appropriately in nested “protective rings.” The
Security Access Control Service (SACS), a different model, combines the Security API
(which ensures safe use of the services after gaining access to the cloud) and Cloud
m
Connection Security (for those CSUs who want to request cloud services), as well as
Access Authorization (for those CSUs who want to do so).
capability-based access control are other methods for threat prevention. When it comes
to personal cloud computing, for instance, prominent cloud services like Amazon EC2
and Azure address security threats and requirements by regulating access to platform
services based on permissions encoded in cryptographic capability tokens.
(c
e
◌◌ The architecture of the cloud.
◌◌ The features of the multi-tenant environment
in
1. The CSUs typically access cloud environments through a web application, which
is frequently thought of as CC’s weakest link. Because browsers can’t generate
XML-based security tokens on their own, the current browser-based authentication
nl
protocols for the cloud are insecure. Technical solutions are put out to get around
these challenges, such as encrypting data while it is kept in the care of a cloud
service provider or while it is sent to a CSU.
O
2. Virtual machine (VM) instance interconnection is one of the greatest problems with
the cloud’s architecture. Isolation, which ensures that one VM cannot harm another
VM running on the same host, is a major issue in virtualization. One VM may be
unlawfully accessed through another VM when numerous VMs are present on
ty
the same hardware (which is typical for clouds). The Virtual Network Framework,
which consists of three layers (routing layer, firewall, and shared network) and tries
to regulate intercommunication among VMs installed in physical computers with
si
improved security, is a way to stop this from happening.
3. In order to prevent potential issues caused by role name conflicts, cross-level
management, and the makeup of tenants’ access control, needs should be aligned
r
to the specific context of the multi-tenant environments. The SaaS Role Based
ve
Access Control (S-RBAC) model, the reference architecture outlined in, and the
reference architecture including the idea of “interoperable security” are solutions
that satisfy these needs. These tools make it easier to distinguish between “home
clouds” and “foreign clouds.” When a CSP cannot meet demand with its available
ni
This speaks to the demands for detection, recording, and warning whenever
an assault is launched and/or is successful. Four categories of solutions are now
put forward. Cloud firewalls are a part of the first category and serve as the attack
ity
prevention filtering method. Their key strength is their dynamic and intelligent
technology, which fully utilises the cloud to collect and distribute threat information
in real time. Unfortunately, there is still no order to the cloud security standards used
by cloud firewalls, with providers competing with one another and adopting different
standards.
m
The second group speaks about a framework for measuring security that is
appropriate for SaaS. The framework in, for instance, seeks to ascertain the status of
)A
user-level apps in guest VMs that have been running for some time.
The third group of solutions includes cloud community watch services, which
constantly analyse data from millions of CSUs to find newly inserted malware threats.
Community services receive greater web traffic than individual CSPs and can therefore
make use of more defences.
(c
The fourth and final category includes multi-technology based strategies, such as
the cloud security based intelligent Network-based Intrusion Prevention system (NIPS),
which combines four key technologies to block visits based on real-time analysis: active
Notes
e
defence technology, linkage technology with firewall, synthesis detecting method, and
hardware acceleration system.
in
Integrity
This relates to how well various components are guarded against unauthorised
and purposeful corruption. Data integrity, hardware integrity, personal integrity, and
nl
software integrity are the four components of integrityDifferent models are deployed by
multi-model techniques in response to security-related features. In, a data tunnel and a
cryptography model team up to provide data security during storage and transmission,
O
whereas in, five models each deal with separation, availability, migration, data
tunnelling, and cryptography.
Last but not least, by raising the standards for VM security, VM-focused techniques
ty
ensure integrity. In systems where many VMs are co-located on the same physical
server, a malicious user in control of one VM may attempt to take over the resources
of other VMs, exhaust all system resources, target other VM users by depriving them of
si
resources, or steal server-based data. Jasti et al. investigate 2010 how such co-existing
VMs can be used to access other CSU’s data or disrupt service and provide useful
security measures that can be used to prevent such attacks.
r
When SaaS applications in the cloud need to contact business on-premises
ve
applications for data interchange and on-premises services, integrity issues in the cloud
may occur. Implementing a proxy-based firewall/NAT traversal solution that maintains
the security of on-premise applications while enabling SaaS applications to interface
with them without requiring firewall reconfiguration can solve these problems. A way for
the CSU or the CSP to check whether the record has been updated in the cloud storage
ni
Security Auditing
ity
By examining security-related events, security staff will be able to assess the status
and application of security systems. Typically, this is done to ensure accountability
and control as well as compliance with laws and regulations. Security configuration
management and vulnerability assessment seem to be the foundations of approaches
to security audibility needs. For instance, this method evaluates the vulnerabilities of
m
each virtual machine (VM) in an infrastructure and then combines these results into a
general evaluation of the multitier infrastructure’s vulnerabilities using attack graphs.
For security audits, a logging structure like the one suggested in might be very helpful.
)A
The data protection scheme with public auditing described in offers an alternative
strategy and provides a means to enable data encryption in the cloud without
compromising accessibility or functionality for authorised parties. We discovered
research that cover the auditability needs for IaaS and PaaS in addition to SaaS
(c
security auditability. To help with security evaluation and improvement at the IaaS layer,
a Security Model for IaaS (SMI) is proposed. Presents an infrastructure that combines
dynamic web service policy frameworks and semantic security risk management
tools on the PaaS layer to facilitate the mitigation of security threats. The platform
Notes
e
takes care of the requirements for dynamically provisioning and configuring security
services, modelling security requirements, and connecting operational security events
to vulnerability and effect analyses at the business level.
in
Privacy
In order to protect sensitive information from being obtained by unauthorised
nl
parties. It incorporates anonymity and secrecy, two features of privacy. It is crucial
to identify relevant, understandable, and worthwhile privacy metrics in order for
privacy needs to be clearly specified. At the moment, ways to improve privacy include
O
separating the data from the software used by CSUs and using cloud-based virus
scanners to obscure the relationship between data elements.
Non-repudiation
ty
This subarea contains rules for preventing one of the parties to a cloud interaction
from rejecting the interaction. In order to locate the origin of the CSUs, the authors
apply a technique. It makes it possible to capture visitor information and makes it
si
exceedingly difficult for CSUs to misrepresent visitors’ identity information. The multi-
party non-repudiation (MPNR) protocol is another option because it offers a fair non-
repudiation storage cloud and guards against roll-back attacks.
r
ve
1.5.2 Cloud Computing - Pros, Cons and Benefits
Today, cloud computing has completely taken over the IT industry. No of their size,
businesses are migratingtheir current IT infrastructure to the public cloud, their own
private cloud, or a hybrid cloud that combines the finest aspects of both.
ni
However, a small minority of sceptics continue to wonder about the benefits and
drawbacks of cloud computing, as well as whether they should migrate to the cloud
U
the key reason why big businesses started using cloud computing early on and why
more industries will do so in the near future.
1. Time Saving: With cloud computing, we can save time with just a few clicks. A PC
)A
with an Internet connection is used to connect our data from any location.
2. Inferior computer cost: Applications that run on the cloud do not require a powerful
machine to run them. While running on the cloud, apps don’t require the computing
power, hard disc space, or even a DVD drive that typical desktop software does.
(c
3. Enhanced presentation: The appearance of the computer will be better if fewer apps
are simultaneously consuming its memory. Because they have fewer processes
and programmes loaded into memory, desktop PCs that make use of cloud-based
Notes
e
services might boot and operate more quickly.
4. Abridged software costs: Cloud computing programmes are frequently available for
in
download for free, as opposed to having to pay for them. The consumer version of
Google Docs is one illustration.
5. Immediate software updates: A cloud-based application’s updates often happen
nl
immediately and are accessible after logging in. The most recent version of a web-
based application is typically immediately accessible without the requirement for an
upgrade.
O
6. Disaster recovery: Companies won’t require elaborate disaster recovery strategies
once they begin utilising cloud-based services. The majority of problems are handled
quickly and efficiently by cloud computing companies.
ty
7. Enhanced document format compatibility: Any other user that accesses the web-
based programme has access to all documents created by it. When all users share
documents and apps in the same cloud, there are fewer format mismatches.
si
8. Limitless storage capacity: Cloud computing has nearly infinite storage capacity.
The hard disc space on a PC today is insignificant in comparison to the cloud’s
storage capacity. However, keep in mind that large-scale storage is typically not
r
free, even in a cloud environment.
ve
9. Better data reliability: A computer catastrophe in the cloud shouldn’t have an impact
on the storage of data, unlike desktop computing, where a hard disc crash might
obliterate personal data. This is because cloud services normally offer numerous
layers of security.
ni
10. Universal document access: Documents remain in the cloud and are accessible
from any location that has an Internet connection and a device that is capable of
using the Internet. Documents are instantly accessible from anywhere, eliminating
U
12. Easier cluster cooperation: Users can simply collaborate with one another on papers
and projects. An Internet connection is all that is required to collaborate because the
papers are stored in the cloud rather than on individual PCs.
13. Domestic device independence: Applications that can be used continue to be
m
14. Shrink expenses on technology infrastructure: Maintain simple access to your data
with little up-front cost. Pay as you go (weekly, quarterly, or annually), as necessary.
15. Developing accessibility: Our lives are made so much easier by our access at any
time and from anyplace.
(c
This is a new age in which we are moving from the age of iron and stone to the
age of Technology, we participate on the biggest platform - the Internet. The cloud is
not a simple cloud literary saying not something floating above our head but it is a real
technology that combines data to open the doors for our future generation. We get
Notes
e
together upload pictures, make transactions, performs learning, do chat, post blogs, or
even provide a social attachment to our colleague.
in
With this feature of Internet it has given a boon for the Internet surfers by providing
clouds for the Users. With the advancement of Technology and users need, it is not a
big surprise that cloud computing has become now a talk of the town. Cloud for the
nl
person who has little knowledge of internet is enormously effective. People now are
understanding and grasping the usefulness of the cloud that how much effective and
beneficial cloud may be for them.
O
Pros in Cloud Computing
1. Dumping the costly systems: Business owners can spend the least amount of
money possible on system management thanks to cloud hosting. Since everything
ty
can be done on the cloud, there is no need for or very little usage for local systems,
which helps save money that would have been spent on expensive equipment.
2. Providing various options to access: The ability to access the cloud for several
si
purposes without relying solely on a computer makes it the most widely used
technology today. The cloud is accessible outside of the office via mobile devices
like iPods and tablets, making work for users simple and effective. In addition to
r
improving efficiency, it also improves the services offered to customers. With a
single touch, the consumer has access to the desired files and documents.
ve
3. No Software maintenance Expense: The design of cloud computing eliminates
the need for any additional software, saving users from having to invest in pricey
software systems for their companies. The user can relax because all the practical
ni
software is already stored on cloud servers. For customers who can’t buy pricey
software and its licence costs, it fully eliminates the scarcity. Another popular benefit
of periodic software upgrades is that they help businesses save time and money.
U
4. Pre Processed Platform: The configuration, organisation, and installation for the
new gadget have no impact on the cost of adding a new individual. There is no
need to modify the platform in order to add a new user or application because cloud
applications can be used without modification.
ity
5. Say No to Server: Using the cloud for business eliminates the significant server cost
burden. Up to a point, the additional cost associated with server maintenance is
eliminated.
6. Centralization of Data: The most amazing features of the Cloud include centralising
m
all the data from various projects and enabling one-click access to data from faraway
locations.
7. No data is lost, can be recovered easily: Since all of the data is automatically
)A
backed up on the cloud, cloud computing allows for quick data recovery. In personal
business servers, data recovery is very expensive or not possible, wasting a lot of
time and money.
8. Say No to Server: Using the cloud for business eliminates the significant server cost
(c
9. Centralization of data: The most amazing features of the Cloud include centralising
Notes
e
all the data from various projects and enabling one-click access to data from faraway
locations.
in
10. No data is lost , can be recovered easily: Since all of the data is automatically
backed up on the cloud, cloud computing allows for quick data recovery. In personal
business servers, data recovery is very expensive or not possible, wasting a lot of
nl
time and money.
11. Capable for sharing: There is also a sharing issue when discussing document
accessibility. Anytime you want, you can send the documents to a friend or colleague
O
via email.
12. Secure Store and Accessibility: Users’ data is accessible and stored in a secure
area thanks to cloud services; there is no risk of data loss or corruption. Users can
thus store the data without thinking about it.
ty
Cons in Cloud Computing
1. No Internet No cloud: An internet connection is required to access the cloud. We
si
can therefore conclude that it is a significant barrier to the expansion of cloud usage
globally.
2. Need a good Bandwidth: We are unable to fully take use of the benefits of the
r
clouds due to insufficient internet connectivity. Again, high latency can cause poor
ve
performance even with a high capacity satellite connection. It is therefore quite
challenging to achieve the simultaneous availability of all these items.
3. Accessing multiple things simultaneously can affect Quality of cloud access: A
user must give up some of the benefits of the cloud if they want to use numerous
ni
internet connections at once. Good Internet speed is required for cloud accessibility,
therefore if a user is viewing video or listening to audio while using the cloud, his
experience may be subpar.
U
4. Security is Must: Your data is secure when stored in the cloud, but for the best
experience, support from an IT business is required to maintain comprehensive
security. If a person uses the cloud without the right support and information, their
ity
6. Cost Comparison with traditional Utilities: When compared to the cost of installing
software in homes the traditional way, software that is available in the cloud may
appear to be a more cost-effective solution. Software that is accessible from the
)A
cloud may incur additional fees that do not apply when it is installed on-site or used
at home. As a result, offering a lot of features at once forces customers to consider
their entire requirement and free cloud availability because you are charged for the
more features.
7. Problems to programmers with no Hard Drive: A hard drive is a necessity for
(c
programmers, thus keeping one would be a waste of money if there was no need
for storage in the system. On the other hand, programmers must have a hard drive
Notes
e
for their own use.
8. Lack of full support: The dependability of cloud-based services is not perfect. When
in
a user experiences a difficulty while using a cloud service, they are not assisted
in accordance with the fees they are charged; instead, they are forced to rely on
the FAQ and user manual, which are inadequate to address the issues they are
nl
experiencing. Because there is a lack of transparency in the service, users may be
hesitant to rely only on cloud services.
9. Incompatibility in software: The incompatibility of the programme with cloud
O
applications causes issues on occasion. This is because some programmes,
devices, and software are only compatible with personal computers, and when such
computers are connected to the internet, they don’t completely support that feature.
10. Lack of insight into your network: It is undoubtedly true that cloud computing
ty
companies give you access to CPU, application, and disc usage. However, it limits
the user’s understanding to their own network. Therefore, it is impossible to correct
a bug in the code, a hardware issue, or anything else without first identifying the
si
issue.
11. Minimum flexibility: On a distant server, the application and services are running.
Because of this, businesses employing cloud computing have little control over how
r
the hardware and software work. The remote software prevents the applications
ve
from ever being run locally.
ni
U
ity
m
)A
(c
Case Study
Notes
e
Cloud as Infrastructure for an Internet Data Center (IDC)
in
Internet portals invested a significant amount of money in the 1990s to draw
users. Their market value was dependent on the number of unique “hits,” or visitors,
as opposed to earnings and losses. This approach worked effectively as these portals
started to provide paid services to users as well as advertising options aimed at their
nl
installed user base, raising revenue per capita in a theoretically unlimited growth curve.
Similar to this, Internet Data Centers (IDC) have developed into a strategic plan for
O
Cloud service providers to draw clients. An IDC would transform into a portal drawing
more applications and more users in a virtuous loop if it reached a critical mass of
people using computer resources and applications.
Two crucial elements will determine how the future generation of IDC is developed.
ty
The expansion of the Internet is the first. For instance, China had 253 million Internet
users at the end of June 2008, with a 56.2% annual growth rateAs a result, more
storage and servers are needed by Internet carriers to accommodate consumers’
si
growing demands for traffic and storage capacity on the network. The growth of mobile
communication is the second. The total number of mobile phone subscribers in China
reached 4 billion by the end of 2008. Server-based computing and storage are driven
r
by the growth of mobile communication, giving customers Internet access to the data
and computer services they require.
ve
How can we create a new IDC with core competency in the era of tremendous
Internet and mobile communication expansion?
In light of the whole business integration of fixed and mobile networks, cloud
ni
services only provide a small portion of the revenue in almost all IDCs, while basic
collocation services account for the majority of it. For instance, a telecom operator’s
IDC reports that hosting services account for 90% of its income but value-added
services only account for 10%. The demands of customers for load balancing, disaster
recovery, data flow analysis, resource utilisation analysis, and other services cannot be
m
met as a result.
account for over 50% of their operational expenses, and adding additional servers will
result in a sharp spike in the related power consumption. IDC firms will have to deal
with a significant growth in power usage as their businesses expand due to the rise
in Internet users and enterprise IT transformation. If prompt action is not done to find
appropriate solutions, the high costs will jeopardise the long-term growth of these
(c
businesses.
Additionally, as Web 2.0 sites and online games gain popularity, all forms of
Notes
e
content, including audio, video, photos, and games, will require a significant amount
of storage and the appropriate infrastructure to facilitate transmission. As a result, the
demand for IDC services from businesses will steadily climb, and service levels and
in
resource utilisation efficiency in data centres will be held to higher standards.
The market competitiveness intensifies under the full service operating model that
nl
evolved with the restructuring of telecom operators. Higher standards are placed on
telecom IDC operators as a result of the merging of fixed network and mobile services
since they must quickly roll out new services to keep up with consumer demand.
O
Cloud Computing Provides IDC with a New Infrastructure Solution
IDC is given a solution by cloud computing that takes into account both current
development needs and future development objectives. When using the cloud, you can
ty
create a resource service management system where physical resources are used as
input and virtual resources are produced at the appropriate time, in the right quantity,
and in the correct quality. The resources of IDC centres, including servers, storage, and
si
networks, are combined into a vast resource pool through cloud computing, thanks to
virtualization technologies.
Administrators may dynamically monitor, schedule, and deploy all of the resources
r
in the pool using a cloud computing management platform, then make them available to
ve
users via the network. A single resource management platform can result in improved
IDC operation, scheduling effectiveness, resource utilisation in the centre, and
decreased management complexity. The timely release of new services is ensured by
the automatic resource deployment and software installation, which can also shorten
the time-to-market. Customers who rent from data centres can use the resources
ni
Additionally, companies are permitted to timely modify the resources that they
U
rent and pay fees based on resource utilisation as needed by business development
needs. IDC is more appealing when it has variable charging modes like this. IDC
growth benefits from management through a single platform as well. The existing
cloud computing management platform can be expanded to accommodate additional
ity
resources when an IDC operator requires them, allowing them to be managed and
deployed consistently.
Software upgrades and the addition of new features and services will become
a continuous process thanks to cloud computing, which may be accomplished using
m
The Long Tail Theory claims that: Cloud computing creates infrastructures
)A
depending on the size of the market head and offers plug-and-play technological
infrastructure with marginal management expenses that are almost nil in the market tail.
With flexible costs, it is able to satisfy a wide range of needs. By utilising cutting-edge
IT technology in this way, the Long Tail’s effect of maintaining low-volume production
of varied goods is realised, creating a market economy model that is competitive and
(c
e
1. IDC is versatile and scalable and can exploit the Long Tail effect at a reasonable
cost because to cloud computing technology. The platform for cloud computing
in
enables the development and introduction of new goods at a low marginal cost of
management. As a result, initial expenses for new businesses can be practically
eliminated, and the resources are not constrained to a specific class of goods or
nl
services. In order to best utilise the Long Tail, operators can considerably expand
their product lines within a given investment scope and provide a variety of services
by automatically allocating resources.
O
2. When business demands are at their greatest, the dynamic architecture of cloud
computing can deploy resources in a flexible manner. For instance, during the
Olympics, a tonne of people visit the websites dedicated to the events. The cloud
computing technology would temporarily deploy more idle resources to support the
ty
demand on resources during peak hours to solve this problem. The USOC has
utilised the cloud computing tools offered by AT&T to enable competition viewing
throughout the Olympics. In addition, the peak times for resource demands include
holidays, application and inquiry days for exams, and SMS and phone calls.
si
3. The return on investment for IDC service providers is increased by cloud computing.
Cloud computing technology can lower the price of computing resources, electricity
r
consumption, and human resource expenses by increasing the usage and
management efficiency of resources. Additionally, it may result in a new service
ve
having a quicker time to market, assisting IDC service providers in gaining market
share.
Additionally, cloud computing offers a cutting-edge pricing method. IDC service
ni
providers bill clients depending on the terms of resource rental, and clients only have to
pay for what they really utilise. This increases payment charge transparency and may
draw in additional clients.
U
Table: Value comparison on co-location, physical server renting and IaaS for
providers
ity
m
)A
(c
e
1. It is possible to cut startup costs, ongoing expenses, and hazards. IDC customers
are not required to purchase pricey software licences or expensive hardware up
in
front. Instead, users only need to rent the hardware and software resources they
actually need, and they only have to pay for usage-related fees. More and more
subject matter specialists are starting to create their own websites and information
nl
systems in the age of business informationization. These businesses can achieve
digital transformation with the aid of cloud computing with relatively little expenditure
and fewer IT staff.
O
2. A platform for automatic, streamlined, and unified service management can help
customers quickly meet their growing resource needs and obtain the resources they
need on time. Customers can increase their responsiveness to market demands
and promote business innovation in this way.
ty
3. IDC users have quicker requirement response times and access to more value-
added services. The IDC cloud computing unified service delivery platform enables
users to submit customised requests and take advantage of numerous value-added
si
services. Additionally, their requests would receive a prompt answer.
Table: Value comparison on co-location, physical server renting and IaaS for
users
r
ve
ni
U
ity
m
IDC places a high value on cloud computing technologies with the goal of creating
a data centre that is adaptable, demand-driven, and quick to respond. It has made the
decision to create a number of cloud centres throughout Europe using cloud computing
(c
technology. Virtual SAN and the newest MPLS technologies connect the first five data
centres. Additionally, the centre complies with the ISO27001 security standard, and
other security activities, such as auditing functions offered by certified partners, that are
Notes
e
required by banks and government institutions are also implemented.
in
nl
O
ty
si
Figure: IDC cloud
r
Customers in the IDC’s sister sites are served by the main Data Center. This
IDC will be able to pay for fixed or usage-based variable services via a credit card bill
ve
thanks to the new cloud computing facility. The management of this hosting facility will
eventually include even more European data centres.
Summary
ni
●● Applications for SaaS cloud computing have a lower total cost of ownership. SaaS
programmes don’t require significant capital expenditures for licences or support
infrastructure. The price of cloud computing is constantly changing, thus it is
U
or grow without using cloud computing. The issue with cloud computing can be
reduced with proper study and safety measures. As it is undeniably true, one of
the rapidly-evolving technologies that has astounded the corporate world is cloud
computing. The advantages of cloud computing outweigh the drawbacks. Even
with so many drawbacks, cloud computing continues to gain popularity due to its
m
outstanding benefits, which include low costs, simple access, data backup, data
centralization, sharing capabilities, security, free storage, and speedy testing. The
improved dependability and adaptability make the case even stronger.
)A
●● The field of data warehousing has expanded dramatically during the past
ten years. To aid in business decision-making, many firms are either actively
investigating this technology or are using one or more data marts or warehouses.
●● Data warehousing is tremendously advantageous for business people to use
(c
cloud services since they enable them to do away with the need for technicians
to support and manage some of the most coveted new IT technologies, such as
highly scalable, variably provided systems.
Amity Directorate of Distance & Online Education
Data Warehousing and Mining 47
e
network access to a shared pool of configurable computing resources (e.g.,
networks, servers, storage, applications, and services) that can be quickly
provisioned and released with little management effort or service provider
in
interaction.
●● Applications and services made available through the Internet are referred to as
nl
cloud computing. These services are offered via data centres located all over the
world, collectively known as the “cloud”.
●● The most recent development in Internet-based computing is called cloud
O
computing, and it is a very successful paradigm of service-oriented computing.
An approach to providing simple, on-demand network access to a shared pool
of reconfigurable computer resources or shared services is known as cloud
computing (e.g., networks, servers, storage, applications, and IT services).
ty
●● Benefits of business cloud computing: a) Cost savings, b) Security, c) Business
resilience and disaster recovery, d) Flexibility and innovation.
●● Database-as-a-Service (DBaaS) is a service that supports applications and
si
is administered by a cloud operator (public or private), freeing the application
team from having to handle routine database management tasks. Application
developers shouldn’t be required to be database professionals or to pay a
r
database administrator (DBA) to maintain the database if there is a DBaaS
ve
available.
●● Today’s cloud users simply operate without having to consider server instances,
storage, or networking. Cloud computing is made possible through virtualization,
which automates many of the labor-intensive processes associated with
ni
the database’s logical administration, which includes table tuning and query
optimization.
●● Database as a Service (DaaS) (DBaaS) is a technical and operational method
ity
costs than any enterprise architecture. MaaS enables enterprises to fully leverage
the benefits of their converged network by recognising and resolving problems
faster, more precisely, less expensively, and with greater visibility than silos alone.
)A
●● Testing is now valuable for software organisationsin order to decrease costs and
improve services based on customer needs. Testing is critical for increasing user
happiness while minimising costs. As a result, firms must invest in people, the
environment, and tools while reducing a specific percentage of their budget. The
most important thing to remember is that quality should never be compromised at
(c
any cost.
e
test actions are outsourced to a third party that focuses on reproducing real-life
test environments based on the needs of the purchaser.
in
●● TaaS, or Testing as a Service, is an outsourcing model in which software testing is
performed by a third-party service provider rather than by organisation employees.
TaaS testing is performed by a service provider who specialises in mimicking real-
nl
world testing environments and locating defects in software products.
●● Benefits of TaaS: a) Reduced costs, b) Pay as you go pricing, c) Less rote
maintenance, d) High availability, e) High flexibility, f) Less-biased testers, g) Data
O
integrity, h) Scalibility.
●● Storage as a service (STaaS) is a managed service in which the provider provides
access to a data storage platform to the consumer. The service can be offered on-
premises using infrastructure dedicated to a single customer, or it can be delivered
ty
from the public cloud as a shared service acquired on a subscription basis and
invoiced based on one or more consumption indicators.
●● Advantages of STaaS: a) Storage costs, b) Disaster recovery, c) Scalability, d)
si
Syncing, e) Security.
●● A mainframe is a robust computer that frequently acts as the primary data store
r
for an organization’s IT infrastructure. It communicates with users through less
capable hardware, such as workstations or terminals. By consolidating data into a
ve
single mainframe repository, it is simpler to manage, update, and safeguard data
integrity.
●● A fast local area network (LAN) is used to connect a few computer nodes
(personal computers employed as servers) that are available for download in
ni
activities.
●● A cluster is a sort of parallel or distributed computer system that consists of a
group of connected, independent computers that act as a highly centralised
ity
computing tool that combines separate machines, networking, and software in one
system.
●● Grid computing is a processor architecture that integrates computing power from
other disciplines to accomplish the main goal. Grid computing allows network
m
●● The gathering, use, disclosure, storage, and destruction of personal data (or
personally identifiable information, or PII) are all covered by privacy rights or
obligations. The accountability of corporations to data subjects and the openness
of their practises around personal information are ultimately what privacy is all
about.
(c
e
serious security and privacy problems. Due to the dynamic nature of the cloud
computing architecture and the fact that hardware and software components of a
single service in this model transcend many trust domains, the move to this model
in
exacerbates security and privacy concerns.
●● Cloud services have gradually become a key feature of modern computers,
nl
smartphones and tablets. Each leading company in the industry offers consumers
its own offer in this segment which significantly intensifies the competition in it.
Cloud technologies are a flexible, highly efficient and proven platform for providing
IT services over the Internet.
O
●● Benefits of cloud computing: a) Time saving, b) Inferior computer cost, c)
Enhanced presentation, d) Abridged software costs, e) Immediate software
updates, f) Disaster recovery, g) Enhanced document format compatibility, h)
ty
Limitless storage capacity, i) Better data reliability, j) Universal Document Access,
k) Newest version availability, l) Easier cluster cooperation, m) Domestic device
independence, n) Shrink expenses on technology infrastructure, o) Developing
accessibility.
si
Glossary
●●
r
HDRaaSTM: Hybrid-Disaster-Recovery-as-a-Service.
ve
●● DBaaS: Database-as-a-Service.
●● DBA: Database Administrator.
●● SaaS: Software as a Service.
ni
e
●● MCS: Main Consistence Server.
●● DCS: Domain Consistence Server.
in
●● CRM: Customer Relationship Management.
●● PaaS: Platform as a Service.
nl
●● IaaS: Infrastructure as a Service.
●● EC2: Elastic Cloud Computing.
O
●● MCC: Mobile Cloud Computing.
●● SMBs: Small and Medium-sized Businesses.
●● CSP: Cloud Service Provider.
ty
●● OECD: Organization for Economic Cooperation and Development.
●● GAAP: Generally Accepted Privacy Principles.
si
●● AICPA: American Institute of Certified Public Accountants.
●● CICA: Canadian Institute of Chartered Accountants.
●● SSE-CMM: the System Security Engineering Capability Maturity Model.
●●
r
ENISA: European Network and Information Security Agency.
ve
●● ITIL: Information Technology Infrastructure Library.
●● COBIT: Control Objectives for Information and Related Technology.
●● NIST: National Institute for Standards and Technology.
ni
e
3. _ _ _ _ _is a term that applies to applications and data storage that are delivered
over the internet or via wireless technology.
in
a. Web hoisting
b. Cloud computing
nl
c. Web searching
d. Cloud enterprise
4. Applications and services are offered via data centres located all over the world,
O
collectively known as the_ _ _ _ _.
a. Storage
b. Database
ty
c. Data mining
d. Cloud
si
5. A data warehouse is which of the following?
a. Organised around important subject areas.
b. Can be updated by end user r
ve
c. Contains numerous naming convention and formats
d. Contains only current data
6. The following technology is not well-suited for data mining?
ni
c. Data visualization
d. Parallel architecture
7. Which is true about multidimensional model?
ity
b. Internet on Thinks
c. Internet over Things
d. Internet of Things
(c
e
(public or private), freeing the application team from having to handle routine
database management tasks.
in
a. DBaaS
b. SaaS
c. PaaS
nl
d. IaaS
10. _ _ _ _is an outsourcing model in which software testing is performed by a third-
O
party service provider rather than by organisation employees.
a. Database as a Service
b. Testing as a Service
ty
c. Software as a Service
d. Platform as a Service
si
11. A_ _ _ _is a robust computer that frequently acts as the primary data store for an
organization’s IT infrastructure.
a. LAN
b. WAN
r
ve
c. Mainframe
d. Cluster computing
12. What is the abbreviation of API?
ni
c. Mainframe
d. None of the mentioned
)A
14. The process of creating a virtual platform, including virtual computer networks,
virtual storage devices, and virtual computer hardware, is known as_ _ _ .
a. Cloud computing
b. Virtualization
(c
c. Data warehousing
d. None of the mentioned
Amity Directorate of Distance & Online Education
Data Warehousing and Mining 53
15. _ _ _ _is the delivery of a calculating stage and arrangement stack as a service
Notes
e
without the need for programming downloads or establishing for engineers, IT
administrators, or end-clients.
in
a. Software as a Service
b. Database as a Service
c. Platform as a Service
nl
d. Infrastructure as a Service
16. Data scrubbing is which of the following?
O
a. A process to upgrade the quality of data before it is moved into a data
warehouse
b. A process to upgrade the quality of data after it is moved into a data
ty
warehouse
c. A process to reject data from the data warehouse and to create the necessary
indexes
si
d. A process to load data from the data warehouse and to create the necessary
indexes
r
17. The @active data warehouse architecture includes which of the following?
ve
a. At least one data mart
b. Data that can be extracted from numerous internal and external sources
c. Near real-time updates
ni
c. Helper
d. All of the above
Exercise
Notes
e
1. What is database as a service and governance/management as a service?
2. Define the following terms:
in
a. Testing as a service
b. Storage as a service
nl
3. What do you mean by cloud service development?
4. What are the challenges in cloud computing?
O
5. Define different layers of cloud computing
6. Explain different types of cloud computing and their features?
7. What are the cloud computing security requirements?
ty
8. Define Cloud computing - pros, cons and benefits.
Learning Activities
si
1. Do you agree that a typical retail store collects huge volumes of data through its
operational systems? Name three types of transaction data likely to be collected by
a retail store in large volumes during its daily operations.
r
Check Your Understanding - Answers
ve
1 c 2 a
3 b 4 d
5 a 6 b
ni
7 c 8 d
9 a 10 b
U
11 c 12 d
13 a 14 b
15 c 16 a
ity
17 d 18 b
19 c 20 d
e
Science and Engineering., Bharati M
7. https://mu.ac.in/wp-content/uploads/2021/01/Cloud-Computing.pdf
in
8. Source: www.ibm.com/cloud
9. blog.irvingwb.com/blog/2008/07/what-is-cloud-c.html
nl
10. Source: CCIDConsulting, 2008–2009 China IDC Market Research Annual
Report
11. https://vmblog.com/archive/2017/11/15/research-shows-nearly-half-of-all-cloud-
O
migration-projects-over-budget-and-behind-schedule.aspx#.YzC-l3ZBzIU
ty
r si
ve
ni
U
ity
m
)A
(c
e
Learning Objectives:
in
At the end of this topic, you will be able to understand:
nl
●● Design Fact Tables and Design Dimension Table
●● Data Warehouse Schemas
O
●● Concept of OLAP, OLAP Features, Benefits
●● OLAP Operations
●● Data Extraction, Clean-up and Transformation
ty
●● Concept of Schemas, Star Schemas for Multidimensional Databases
●● Snowflake and Galaxy Schemas for Multidimensional Databases
si
●● Architecture for a Warehouse
●● Steps for Construction of Data Warehouses, Data Marts and Metadata
●● OLAP Server - ROLAP
●● OLAP Server - MOLAP
r
ve
●● OLAP Server - HOLAP
Introduction
ni
Data warehouses are challenging to make. Their strategy calls for a mindset
that is recently the opposite of how normal computer architectures are created. Their
creation necessitates the radical reconstruction of vast amounts of information that
U
are frequently of dubious or conflicting quality and are derived from many diverse
sources. Businesses only employ effective data warehouses to provide answers to their
questions. To communicate with experts, one must comprehend their questions. The
DW configuration connects corporate innovation with learning expertise.
ity
The information distribution center’s layout will indicate the difference between
success and failure. Deep knowledge of the business is necessary for the Data
Warehousing plan. Data warehousing frequently employs a legal outline technique
known as “dimensional demonstrating” (DM). It differs from and looks distinct from
m
for enhancing summary report query efficiency without compromising data integrity.
However, the extra storage capacity required to achieve that performance has a
price. In general, a dimensional database needs a lot more room than its relational
cousin. That expense is becoming less significant, though, as storage space costs
continue to fall.
(c
Another name for a dimensional model is a star schema. Because it can deliver
significantly higher query performance than an E/R model, especially for very big
Amity Directorate of Distance & Online Education
Data Warehousing and Mining 57
queries, this type of model is very common in data warehousing. The main advantage
Notes
e
is that it is simpler to understand, though. It often comprises of a sizable table of
facts (referred to as a fact table) and other tables surrounding it that offer descriptive
information, known as dimensions. The term comes from the fact that when it is drawn,
in
it has a star-like appearance.
nl
Dimensional modelling uses a cube operation to represent data, making OLAP
data management more suitable for logical data representation. Ralph Kimball created
O
the fact and dimension tables that make up the notion of “dimensional modelling.”
The transaction record is separated into “facts,” which are often numerical
transaction data, and “dimensions,” which are the reference data that places the facts
in context. For instance, a sale transaction can be broken down into specifics like the
ty
quantity of products ordered and the cost of those products, as well as into dimensions
like the date of the order, the user’s name, the product number, the ship-to and bill-to
addresses, and the salesperson who was in charge of receiving the order.
si
Objectives of Dimensional Modeling
Dimensional modelling serves the following purposes:
●●
r
create database design that is simple for end users to comprehend and create
ve
queries for.
●● to make queries as efficient as possible. By reducing the amount of tables and
relationships between them, it accomplishes these objectives.
ni
for data warehouse storage. Dimensional modelling is used to enhance databases for
quicker data retrieval. The “fact” and “dimension” tables that make up the Dimensional
Modelling idea were created by Ralph Kimball.
ity
These dimensional and relational models each offer a special method of storing
data that has certain benefits.
For instance, normalisation and ER models in the relational manner eliminate data
)A
redundancy. On the other hand, a data warehouse’s dimensional model organises data
to make it simpler to obtain information and produce reports.
In dimensional data warehouse schemas, the following object types are frequently
employed:
(c
The huge tables in your warehouse schema known as “fact tables” are used to
record business measurements. Facts and foreign keys to the dimension tables are
often included in fact tables. Fact tables provide examineable data, which is typically
Notes
e
additive and quantitative. Examples include revenue, expenses, and earnings.
in
nl
O
Figure: A sales analysis of star schema
The relatively static data in the warehouse is stored in dimension tables, which
ty
are also referred to as lookup or reference tables. The information that you often use
to contain queries is stored in dimension tables. You can use dimension tables, which
are often text-based and illustrative, as the result set’s row headers. Examples include
si
clients, the present moment, suppliers, and products.
Fact Tables
r
Both columns that hold numerical facts (commonly referred to as measurements)
ve
and columns that are foreign keys to dimension tables are common in fact tables. A
fact table either includes information at the aggregate level or at the detail level.
Summarized fact tables are frequently referred to as SUMMARY TABLES. Facts in a
fact table often have the same level of aggregation. Despite the fact that most facts
are additive, some are semi-additive or non-additive. Simple arithmetic addition can be
ni
used to group additive information together. Sales are an instance of this frequently.
Facts that are non-additive cannot be added at all. Averages are one illustration of this.
Some of the dimensions can be aggregated with semi-additive facts whereas others
U
cannot. Inventory levels are one instance of this, where it is impossible to determine
what a level means just by looking at it.
For each star schema, a fact table needs to be defined. From a modelling
perspective, the fact table’s primary key is typically a composite key made up of all of its
foreign keys.
m
Fact tables provide information on business events for synthesis. Fact tables are
frequently very huge, requiring hundreds of gigabytes or even terabytes of storage and
having hundreds of millions of entries. The fact table can be simplified to columns for
dimension foreign keys and numeric fact values since dimension tables include records
)A
that explain facts. Normally, the fact table does not store text, BLOBs, or denormalized
data. The following are definitions for this “sales” fact table:
e
time_id DATE CONSTRAINT sales_time_nn NOT NULL,
in
NOT NULL,
nl
cost NUMBER(10,2) CONSTRAINT sales_cost_nn NOT NULL )
O
Data warehouses that cover several business tasks, such as sales, inventories,
and finance, require a number of fact tables. There should be a fact table specific to
each business function, and there presumably will be some unique dimension tables as
ty
well. As was previously covered in “Dimension Tables,” all dimensions that are shared
by all business processes must express the dimension information uniformly. Typically,
each business function has its own schema, which includes a fact table, a number
si
of conforming dimension tables, and a few dimension tables that are special to that
business function. Such industry-specific schemas could be deployed as data marts or
as a component of the central data warehouse.
r
Physical partitioning of very large fact tables may be used for implementation
ve
and maintenance design purposes. Due to the historical nature of the majority of data
warehouse data, the partition divisions are nearly usually along a single dimension,
with time being the most frequently used one. For simplicity of maintenance, OLAP
cubes are typically partitioned to match the partitioned fact table portions if fact tables
are partitioned. As long as the total number of tables involved does not exceed the
ni
maximum for a single query, partitioned fact tables can be seen as a single table with a
SQL UNION query.
U
Typically, they are textual, descriptive values. You can respond to business inquiries
by combining a number of different dimensions with facts. Customers, items, and time
are frequently utilised dimensions. The smallest possible degree of detail is often used
to collect dimension data, which is then combined to create higher-level totals that are
m
better for analysis. Hierarchies are the terms used to describe these organic rollups or
aggregations found in dimension tables.
If the data warehouse has numerous fact tables or provides data to data marts,
)A
a dimension table could be used in more than one place. For instance, in the data
warehouse, one or more departmental data marts, and sales fact tables and inventory
fact tables, a product dimension may be employed. If all copies of a dimension, such as
customer, time, or product, utilised across several schemas are identical, the dimension
is said to be compliant.
(c
If separate schemas make use of various versions of a dimension table, the data
and reports for summarization won’t match. Successful data warehouse architecture
depends on the usage of conforming dimensions. These are the definitions for this fact
Notes
e
table’s “customer” row:
in
( cust_id NUMBER, cust_first_name VARCHAR2(20) CONSTRAINT customer_
fname_nn NOT
nl
NULL, cust_last_name VARCHAR2(40) CONSTRAINT customer_lname_nn NOT
NULL,cust_sex
O
VARCHAR2(20),cust_street_address
ty
CONSTRAINT customer_pcode_nn NOT NULL,cust_city VARCHAR2(30)
CONSTRAINT
si
CHAR(2)
HIERARCHY prod_rollup (
product CHILD OF
ity
subcategory CHILD OF
category
)
m
A dimension table’s records create many-to-one relationships with the fact table.
For instance, several sales of the same product or multiple sales to a same consumer
(c
may occur. The attribute table for the dimension entry comprises extensive and user-
oriented textual information, such as the name of the product or the customer’s name
and address. Report labels and query restrictions are provided through attributes. An
Notes
e
OLTP database’s coded attributes need to be translated into descriptions.
For instance, the product category field in the OLTP database might only be a
in
simple integer, but the dimension table ought to have the text for the category. If it’s
necessary for maintenance, the code could also be stored in the dimension table. This
denormalization streamlines searches, increases their effectiveness, and makes user
nl
query tools simpler. However, if a dimension attribute changes regularly, creating a
snowflake dimension by assigning the attribute to its own table may make maintenance
simpler.
O
Hierarchies:
A dimension typically contains hierarchical data. The necessity for data to be
grouped and summarised into meaningful information drives the creation of hierarchies.
ty
For instance, the hierarchy elements (all time), Year, Quarter, Month, Day, or Week
are frequently found in time dimensions. Multiple hierarchies may exist inside a
single dimension; for example, a time dimension frequently has both fiscal year and
si
calendar year hierarchies. Geography is typically a hierarchy that puts a structure
on sales points, clients, or other geographically spread dimensions rather than being
a dimension in and of itself. The following geography hierarchy for sales points is an
example: (all), nation, area, state or region, city, store.
r
ve
The top-to-bottom ordering of levels, from the root (the most general) to the leaf
(the most detailed), is specified by level relationships. In a hierarchy, they specify the
parent-child relationship between the levels. In order to facilitate more complicated
rewrites, hierarchies are also crucial components. When the dimensional dependencies
between the quarter and the year are understood, the database, for instance, can
ni
The fact table is surrounded by dimensions in the dimensional model, which has a
star-like structure. The schema is constructed using the following design model:
model are ensured by the dimensional modelling process, which is built on a 4-step
design methodology. The fundamentals of the design are based on the real business
process that the data warehouse should support. As a result, the model’s initial step is
to describe the business process on which it is based. For instance, this may be a sales
scenario in a store. One has the option of describing the business process in plain text,
(c
using the Unified Modeling Language or simple Business Process Modeling Notation
(BPMN) (UML).
e
Declaring the grain of the model comes after specifying the business process in
the design. The model’s grain perfectly captures what the dimensional model should be
in
concentrating on. One such example of this is “A single line item on a client slip from a
retail establishment.” Pick the central process and succinctly describe it in one sentence
to help the reader understand what the grain implies. Additionally, it is from the grain
nl
(sentence) that you will construct the dimensions and fact table. Due to new knowledge
about what your model is meant to be able to deliver, you might find that you need to
return to this phase and change the grain.
O
Identify the Dimensions
Specifying the model’s dimensions is the third step in the design process. The
second part of the 4-step method requires that the dimensions be established within the
ty
grain. The fact table’s base is its dimensions, from which the information is gathered. A
dimension is typically a noun, such as a date, store, inventory, etc. All the data is kept
in these dimensions. For instance, the date dimension can include information on the
year, month, and weekday.
si
Identify the Facts
r
Making keys for the fact table is the next stage in the process after establishing the
dimensions. Finding the numerical facts that will fill each row of the fact table is the next
ve
stage. The business users of the system are directly impacted by this stage because it
is where they can access the data that is kept in the data warehouse. As a result, the
majority of the rows in the fact table are composed of numerical, additive values like
amount or cost per unit, etc.
ni
dimensions.
●● The history of the dimensional information is stored in dimension tables.
●● It enables the addition of a completely new dimension without substantially
ity
●● The business can easily understand the dimensional model. In order for the
business to understand what each fact, dimension, or characteristic implies, this
model is built on business terminology.
●● Deformalized and streamlined dimensional models are used for quick data
(c
e
optimized schema. It results in fewer joins and lessens data redundancy.
●● The performance of queries is also improved by the dimensional model. Because
in
it is more denormalized, querying is optimised.
●● Models with dimensions can easily adapt to change. More columns can be added
to dimension tables without having an impact on currently running business
nl
intelligence applications that use these tables.
Multi-use dimensions:
O
By aggregating a lot of minor, unconnected dimensions into a single physical
dimension, frequently referred to as a garbage dimension, data warehouse design
can occasionally be made simpler. By lowering the amount of foreign keys in fact table
records, this can significantly minimise the size of the fact table. The cartesian product
ty
of all the dimension values will frequently be prepopulated in the combined dimension.
The table can be filled with value combinations as they are encountered during the load
or update process if the quantity of discrete values results in a very big table including
all potential value combinations.
si
A dimension with customer demographics chosen for reporting standardisation
is a typical example of a multi-use dimension. The collection of these remarks into a
r
single dimension eliminates a sparse text field from the fact table and replaces it
with a compact foreign key. Another multiuse dimension might contain helpful textual
ve
comments that only occasionally appear in the source data records.
overall design blueprint. It explains how the data are set up and how the relationships
between them are connected. The name and description of records, together with any
associated data items and aggregates, are included in the data warehouse schema.
A data warehouse uses the Star, Snowflake, and Fact Constellation types of schema
while a database uses relational models.
m
The following definitions of the basic terms used in this approach will help you
comprehend these schemas:
)A
Dimension
In data warehousing, a “dimension” is a group of references to data concerning
a quantifiable event. These incidents are known as facts and are kept in a fact table.
The entities that an organisation wants to keep data for are often the dimensions. A
(c
street name, state, and nation. Each record (row) of a dimension table has a primary key
Notes
e
column that uniquely identifies it. A dimension is a classification framework made up of
one or more hierarchies. Dimensions are typically de-normalized tables with potentially
redundant data.
in
The terms normalisation and de-normalization will be used throughout this chapter,
so let’s quickly review them. A larger table is split up into smaller tables during the
nl
normalisation process to remove any potential insertion, update, or deletion anomalies.
Data redundancy has been eliminated in normalised tables. These tables are typically
connected to obtain complete information.
O
De-normalization involves combining smaller tables to create larger ones in order
tominimise joining procedures. De-normalization is especially used when retrieval is
crucial when insert, update, and delete operations are few, as they are when dealing
with historical data in a data warehouse. The data in these de-normalized tables will be
ty
redundant. As an illustration, in the case of the EMP-DEPT database, two normalised
tables would be EMP (eno, ename, job, sal, deptno) and DEPT (deptno, dname),
whereas in the de-normalized case, we would have a single table called EMP DEPT
si
with the following attributes: eno, ename, job, sal, deptno, and dname.
Let’s look at the dimensions of the location and the item as indicated in Figure:
(A) Dimension of place. Location_id is the main key in this case, and its attributes are
r
street_name, city, state_id, and country_code. In Figure (b), the item dimension, which
ve
has the attributes item_name, item_type, brand_name, and supplier_id as well as item
code as the main key, is depicted.
ni
U
ity
The location dimension may be produced from the location detail (location id, street
name, state id) and state detail (state id, state name, country code) tables, as illustrated
m
in the below Figure: Normalized view. It is vital to keep in mind that these dimensions
may be de-normalized tables.
)A
(c
Measure
Notes
e
The term “measure” refers to values that depend on dimensions. For instance,
sales volume, sales volume, etc.
in
Fact Table
A bundle of related data pieces is referred to as a “fact table.” It consists
nl
of dimensions and measure values. It follows that the specified dimension and
measure can be used to define a fact table. Foreign keys and measure columns
are two common sorts of columns found in fact tables. As demonstrated in Figure:
O
Representation of fact and dimension tables, foreign keys are connected to dimension
tables, and measures are made up of numerical facts. Dimension tables are often
smaller in size than fact tables.
ty
r si
ve
ni
code and location_id as foreign keys of dimension tables. FK stands for a foreign key
in this context.
m
)A
(c
e
There are many different dimensions to data. Hierarchies are frequently used in
dimensions to depict parent-child connections.
in
2.3 OLAP Operations
The question of whether OLAP is simply data warehousing in a pretty package
nl
comes up rather frequently. Can you not think of online analytical processing as nothing
more than a method of information delivery? Is this layer in the data warehouse not
another layer that serves as an interface between the users and the data? OLAP
O
functions somewhat as a data warehouse’s information distribution mechanism. OLAP,
however, is much more than that. Data is stored and more easily accessible through
a data warehouse. An OLAP system enhances the data warehouse by expanding the
capacity for information distribution.
ty
2.3.1 Concept of OLAP, OLAP Features, Benefits
si
On-line analytical processing is known as OLAP. Through quick, consistent,
interactive access to a variety of views of data that have been transformed from raw
data to reflect the true dimensionality of the enterprise as understood by the clients,
r
analysts, managers, and executives can gain insight into information using OLAP, a
category of software technology.
ve
Business information is multidimensionally analysed using OLAP, which also
supports sophisticated data modelling, complex estimations, and trend analysis.
The vital framework for intelligent solutions, which includes business performance
management, planning, budgeting, forecasting, financial documentation, analysis,
ni
decisions.
◌◌ Activity-based costing
◌◌ Financial performance analysis
◌◌ And financial modeling
)A
◌◌ Promotion analysis
◌◌ Customer analysis
e
Production
in
◌◌ Production planning
◌◌ Defect analysis
There are two primary uses for OLAP cubes. First, a data model that is more
nl
understandable to business users than a tabular model must be made available to
them. A dimensional model is what this one is known as.
The second goal is to provide quick query responses, which are typically
O
challenging with tabular models.
ty
queries, such as aggregation, joining, and grouping, which are notoriously challenging
to run over tabular databases, are pre-calculated. These queries are calculated as part
of the OLAP cube’s “building” or “processing” operation. By the time end users arrive at
si
work in the morning, the data will have been updated as a result of this process.
OLAP Guidelines
r
The “founder” of the relational model, Dr. E.F. Codd, has created a list of 12
ve
standards and criteria that serve as the foundation for choosing OLAP systems:
ni
U
ity
m
)A
repository, computing operations, and the disparate nature of source data. The
users’ productivity and efficiency are enhanced by this transparency.
3) Accessibility: It only gives access to the data that is actually necessary to carry out
Notes
e
the specific analysis and give the clients a single, coherent, and consistent view.
The OLAP system must apply any necessary transformations and map its own
logical schema to the various physical data storage. The OLAP operations ought
in
to be situated in the middle of an OLAP front-end and data sources (such as data
warehouses).
nl
4) Consistent Reporting Performance: To ensure that when the number of dimensions
or the size of the database rises, the users do not notice any appreciable reduction
in documentation performance. That is, as the number of dimensions rises, OLAP
performance shouldn’t deteriorate. Every time a specific query is done, users must
O
see a consistent run time, response time, or machine utilisation.
5) Client/Server Architecture: Make the OLAP tool’s server component sufficiently
clever so that different clients can be connected with the least amount of effort and
ty
integration code. The server ought to be able to map and combine data from many
databases.
6) Generic Dimensionality: Each dimension should be treated equally by an OLAP
si
method in terms of its operational capability and organisational structure. Selected
dimensions may be given access to new operational capabilities, but all dimensions
should be able to receive these additional jobs.
7) r
Dynamic Sparse Matrix Handling: To modify the physical schema in order to best
ve
handle sparse matrices in the particular analytical model that is being built and
loaded. To achieve and keep up a constant level of performance when faced with a
sparse matrix, the system must be simple to dynamically assume the distribution of
the information and modify the storage and access.
ni
8) Multiuser Support: Concurrent data access, data integrity, and access security are
requirements for OLAP tools.
9) Unrestricted cross-dimensional Operations: It gives the techniques the ability to
U
determine the order of the dimensions and inevitably works roll-up and drill-down
methods within or across the dimensions.
10) Intuitive Data Manipulation: Fundamental data manipulation includes drill-down and
ity
roll-up operations, as well as other manipulations that can be carried out organically
and precisely using point-and-click and drag-and-drop techniques on the scientific
model’s cells. It does not require the usage of a menu or repeated visits to the user
interface.
m
11) Flexible Reporting: Columns, rows, and cells are organised efficiently for business
clients so that data may be easily modified, analysed, and synthesised.
12) Unlimited Dimensions and Aggregation Levels: There should be no limit to the
)A
amount of data dimensions. Within any given consolidation path, each of these
common dimensions must provide nearly an infinite number of customer-defined
aggregation levels.
Characteristics of OLAP
(c
The word for the OLAP methods FASMI characteristics is generated from the initial
letters of the characteristics and is as follows:
Notes
e
in
nl
O
ty
Fast
si
With the elementary analysis lasting little more than one second and very few
taking more than 20 seconds, it identifies which system was intended to provide the
majority of feedback to the client in approximately five seconds.
Analysis
r
ve
It specifies how any business logic and statistical analysis pertinent to the function
and the user can be handled by the method while yet keeping the method simple
enough for the intended client. We exclude products (like Oracle Discoverer) that do not
ni
allow the user to define new Adhoc calculations as part of the analysis and to document
on the data in any desired product that do not allow adequate e we do not think it is
acceptable if all application definitions have to be allow the user to define new Adhoc
calculations as part of the analysis and to document on the data in any desired method,
U
Share
ity
Not all functions require the user to write data back, but an increasing number do,
so the system should be able to manage multiple updates in a timely, secure manner.
It defines which the system tools all the security requirements for understanding and, if
multiple write connections are needed, concurrent update location at an appropriated
level.
m
Multidimensional
Due to the fact that this is unquestionably the most logical way to assess
)A
Information
(c
All the data required by the apps should be able to be stored on the system. It is
important to handle data sparsity effectively.
Benefits of OLAP
Notes
e
OLAP holds several benefits for businesses: -
in
efficient provision of multidimensional record views.
2. Because organised databases have an inherent flexibility, OLAP functions are self-
sufficient.
nl
3. Through careful management of analysis-capabilities, it makes it easier to simulate
business models and challenges.
O
4. When used in conjunction with a data warehouse, OLAP can help to speed up data
retrieval, reduce query drag, and reduce the backlog of applications.
ty
On-line analytical processing is known as OLAP. OLAP is a component of software
technology that allows analysts, managers, and executives to access data quickly,
consistently, interactively, and in a range of alternative views that reflect the true
si
dimensionality of the company as learnt by the client.
Without worrying about how or where the data are saved, OLAP servers provide
r
business users with multidimensional information from data warehouses or data marts.
Data storage concerns should be taken into account in the OLAP servers’ physical
ve
design and operation.
These various views are materialised via a number of OLAP data cube procedures,
allowing for interactive querying and analysis of the available data. As a result, OLAP
ni
Slice − In order to obtain more precise information, it describes the subcube. You
can do this by choosing one dimension.
U
Roll-up − The roll-up gives the user the ability to summarise data at a more general
level of the hierarchy. By expanding the area hierarchy from the level of the city to the
ity
level of the country, the roll-up operation illustrated gathers the data. In other words, the
generated cube groups the data by country rather than by city.
When roll-up is carried out using dimension reduction, the specified cube may
have one or more dimensions removed. Think of a sales data cube that just has the two
m
dimensions of location and time. By eliminating the time dimension, roll-up can be used
to aggregate total sales by place rather than by location and by time.
)A
that was produced analyses the entire sales for each month.
e
detailed graphs, pictures, lists, charts, and other visual elements. It enables users to
quickly comprehend the data and extract pertinent information, patterns, and trends.
Additionally, it makes the data simple to comprehend.
in
In other words, it can be said that data visualisation is the act of representing data
in a graphical format so that consumers can easily understand how the process of
nl
trends in the data works.
●● Understanding and improving sales: OLAP can assist in identifying the best
products and the most well-known channels for businesses that profit from a wide
O
range of channels for marketing their products. It could be possible to identify the
most lucrative users using several techniques.
For example, If a company wants to analyse the sales of products for every hour
ty
of the day (24 hours), the difference between weekdays and weekends (2 values), and
split regions to which calls are made into 50 region, there is a large amount of record
when considering the telecommunication industry and considering only one product,
communication minutes.
si
●● Understanding and decreasing costs of doing business: One way to improve a firm
is to increase sales. Another way is to assess costs and reduce them as much
r
as possible without impacting sales. OLAP can help with the analysis of sales-
related expenses. It may be possible to find expenses that have a good return on
ve
investment using some strategies as well (ROI).
For example, It may be expensive to hire a good salesperson, but the income they
provide can make the expense worthwhile.
ni
1. Roll-up
2. Drill-down
4. Pivot (rotate)
You must prepare the data for storage in the data warehouse once you have
extracted it from various operational systems and external sources. It is necessary to
modify, transform, and prepare the extracted data from several unrelated sources in a
)A
For the data to be ready, three primary tasks must be completed. The data must
first be extracted, transformed, and loaded into the data warehouse storage. A staging
area is where these three crucial processes—extracting, transforming, and getting
ready to load—take place. A workbench is included in the data staging component
(c
for these tasks. Data staging offers a location and a collection of tools with which to
organise, integrate, transform, deduplicate, and ready source data for use and storage
Notes
e
in the data warehouse.
Why is data preparation done using a different location or component? Can’t you
in
prepare the data first and then transport the data from the various sources into the data
warehouse storage? When we put an operational system into place, it’s likely that we’ll
gather data from various sources, transfer it to the new operational system database,
nl
and then do data conversions. Why is it that a data warehouse cannot employ this
technique? The key distinction is that in a data warehouse, information is gathered from
numerous operating systems as sources. Do not forget that data in a data warehouse
is operational application- and subject-agnostic. Therefore, preparing data for the data
O
warehouse requires a distinct staging space.
ty
Data Extraction: There are several data sources that this function must handle.
For each data source, you must use the right approach. Source data may come in a
variety of data types from various source machines. Relational database systems may
si
contain a portion of the original data. Some data might be stored in older hierarchical
and network data models. There may still be a lot of flat file data sources. Data from
spreadsheets and regional departmental data sets might be included. Data extraction
r
could become very difficult.
ve
On the market, there exist tools for data extraction. You might wish to think about
employing external tools made for particular data sources. You might want to create
internal programmes to perform the data extraction for the other data sources. It could
be expensive initially to buy tools from other sources. On the other hand, internal
ni
Where do you save the data after you’ve extracted it so you may prepare it
further? If your framework allows it, you can conduct the extraction function directly
U
on the old platform. Data warehouse implementation teams usually extract the source
into a different physical environment so that getting the data into the data warehouse
would be simpler from there. You can extract the source data into a collection of flat
files, a relational database for data staging, or a combination of both in the separate
ity
environment.
ETL, which stands for Extraction, Transformation, and Loading, is the term used to
describe the process of taking data from source systems and bringing it into the data
warehouse.
m
The ETL process is technically difficult and demands active participation from many
stakeholders, including developers, analysts, testers, and top executives.
)A
Notes
e
in
nl
ETL consists of three separate phases:
O
ty
r si
ve
Extraction
●● Extraction involves taking data out of a source system so that it can be used in a
data warehouse environment. The ETL process begins with this step.
●● One of the ETL’s most time-consuming processes is frequently the extraction
ni
procedure.
●● Determining which data has to be extracted might be challenging because the
source systems may be complex and poorly documented.
U
●● To update the warehouse and provide it with all modified data, the data must be
extracted multiple times on a regular basis.
ity
Cleansing
A data warehouse technique must include a cleansing stage because it is designed
to enhance data quality. Rectification and homogenization are the main data cleansing
functions present in ETL technologies. In order to correct spelling errors and identify
m
synonyms, they use specialised dictionaries. They also use rule-based cleansing to
impose domain-specific rules and define the proper relationships between variables.
enterprise database, however this requires that the caller’s name or the name of
his or her company be listed in the database.
e
in the databases with two or more slightly different names or different account
numbers.
in
Transformation
Data conversion is a crucial task in every system implementation. For instance,
you must initially load your database with information from the previous system
nl
records when you construct an operational system, such as a magazine subscription
application. It’s possible that you’re switching from a manual system. You can also
be switching from a file-oriented system to one that uses relational database tables
O
for support. You will convert the data from the earlier systems in any scenario. What,
therefore, makes a data warehouse so unique? In what ways is data transformation
more complex for a data warehouse than for an operational system?
ty
As you are aware, data for a data warehouse is gathered from a variety of
unrelated sources. Data transformation creates even larger hurdles than data extraction
for a data warehouse. The fact that the data flow is more than just an initial load is
another aspect of the data warehouse. You will need to keep acquiring the most recent
si
updates from the source systems. Any transformation tasks you create for the first load
will also be modified for the subsequent updates.
r
As part of data transformation, you carry out a variety of distinct actions. You must
first purge the data you have taken from each source. Cleaning can involve removing
ve
duplicates when bringing in the same data from different source systems, or it can
involve resolving conflicts between state codes and zip codes in the source data. It can
also involve giving default values for missing data components.
ni
eliminate homonyms and synonyms. You resolve synonyms when two or more terms
from several source systems have the same meaning. You resolve the homonym when
a single term has several meanings across separate source systems.
Combining data from many sources includes many different types of data
ity
transformation. Data from a single source record or associated data pieces from several
source records are combined. On the other side, data transformation also entails
breaking out source records into new combinations and removing source data that is
not helpful. In the data staging area, data is sorted and combined on a big scale.
m
The operational systems’ keys are frequently field values with predefined
meanings. For instance, the product key value could be a string of characters that
includes information on the product category, the warehouse where it is kept, and
)A
the production batch. The data warehouse’s primary keys are not allowed to have
predefined meanings. The assignment of substitute keys derived from the major system
keys of the source system is another aspect of data transformation.
The unit sales and revenue numbers for each transactions are recorded by a
(c
and revenue in the data warehouse storage and combine the totals by product at each
Notes
e
store for a certain day. In these circumstances, an appropriate summarization would be
part of the data transformation process.
in
You have a set of integrated, cleaned, standardised, and summarised data after
the data transformation function is finished. All of the data sets in your data warehouse
can now be loaded with data.
nl
The reconciliation phase’s fundamental component is transformation. It changes
the format of the records from their operational source into a specific data warehouse
format. Our reconciled data layer is produced by this stage if we use a three-tier
O
architecture.
◌◌ Unstructured text may conceal important information. For instance, XYZ PVT
ty
Ltd makes no mention of being a Limited Partnership corporation.
◌◌ Individual data can be used in a variety of formats. Data can be saved in
many formats, such as a string or three integers.
si
Following are the main transformation processes aimed at populating the
reconciled data layer:
◌◌
r
Data normalisation and conversion processes that apply to both storage
formats and units of measurement.
ve
◌◌ Matching, which links equivalent fields from many sources.
◌◌ A choice that minimises the number of source records and fields.
In ETL tools, cleaning and transformation procedures are frequently intertwined..
ni
U
ity
m
)A
(c
Loading
Notes
e
The function of data loading is composed of two different groups of jobs. The
initial loading of the data into the data warehouse storage is done when the design and
in
construction of the data warehouse are finished and it goes online for the first time.
Large amounts of data are moved initially, which takes a long time. As soon as the
data warehouse is operational, you continue to extract the source data’s modifications,
nl
transform the data revisions, and continuously feed the incremental data modifications.
The process of writing data into the target database is known as the load. Make
sure the load is carried out accurately and with the least amount of resources feasible
O
during the load step.
1. Refresh:Data in the data warehouse is entirely rewritten. The older file has so
ty
been overwritten. Typically, refresh is combined with static extraction to first
populate a data warehouse.
2. Update: The Data Warehouse only receives the updates that were made
si
to the original data. Normally, an update is performed without erasing or
changing previously stored data. This technique is used to update data
warehouses on a regular basis in conjunction with incremental extraction.
coordinated access to those sources. An ETL tool often includes capabilities for
automatic data loading into the object database, data cleansing, reorganisation,
transformations, aggregation, calculation, and aggregation.
U
2. An ETL solution should offer a straightforward user interface that enables point-and-
click specification of data cleansing and data transformation rules. The data extract/
transformation/load procedures, which are normally executed in batch mode, should
be automatically generated by the ETL tool once all mappings and transformations
ity
Star Schema
One of the simplest data warehouse schemas is the star schema. Because of the
)A
way its points expand outward from a central point, it is known as a star. Figure: The
star schema in which the fact table is in the centre and the dimension tables are at the
nodes is represented by the representation of fact and dimension tables above.
have fewer entries than fact tables, but each record may have many attributes that
characterise the fact data. Foreign keys to dimensions data are typically included in fact
Notes
e
tables along with numerical facts.
In a star schema, fact tables are often in third normal form (3NF), whereas
in
dimensional tables are typically in de-normalized form. Although one of the simplest
forms, the star schema is still widely used today and is advised by Oracle. A star
schema is depicted visually in Figure: Graphical illustration of Star Schema.
nl
O
ty
r si
ve
ni
U
ity
Figure: Star schema for analysis of sales illustrates a company’s star schema
for sales analysis. In this structure, the four foreign keys that are connected to each
m
of the four matching dimension tables form a star with the sales data in the middle.
The units sold and rupees sold measurements are also included in the sales fact
table. The fact that the dimensions are de-normalized in this case must be noted. For
instance, the location dimension table, which includes attribute sets like {location_id,
)A
Notes
e
in
nl
O
ty
si
Figure: Star schema for analysis of sales
●● De-normalization produces data redundancy, which can increase the size of the
table data and make it more time consuming to load the data into dimension
tables.
U
●● It is the data warehouse’s most widely used and simplest structure, and it is
supported by a huge variety of technologies.
●● The star schema offers simple business reporting logic compared to fully
standardised schemas.
●● It performs queries quickly because there aren’t many join operations.
)A
e
inserts and changes in star schemas may cause data anomalies.
●● When compared to a normalised data model, it is less adaptable to analytical
in
requirements.
●● Because it is made for a fixed analysis of data and does not support complex
analytics, it is overly specific.
nl
2.4.3 Snowflake and Galaxy Schemas for Multidimensional
Databases
O
The main distinction between the snowflake and star schemas is that while the
star schema always has de-normalized dimensions, the snowflake schema may also
contain normalised dimensions. As a result, the snowflake schema is a variation of the
star schema that supports dimension table normalisation. In the snowflake schema,
ty
some dimension tables are normalised, which separates the data into additional
tables. In a star schema, normalising the dimension tables is the procedure known as
“snowflaking.”
si
When all the dimension tables are fully normalised, the resulting structure
resembles a snowflake with a fact table at the centre. For instance, the item dimension
r
table in Figure (the Snowflake schema for analysis of sales) is normalised by being
divided into the item and supplier dimension tables.
ve
ni
U
ity
m
)A
The characteristics set “item code, item name, item type, brand name, supplier id”
make up this item dimension table. Additionally, the supplier id is linked to a database
called the supplier dimension that has two attributes: supplier id and supplier type.
(c
Similar to this, the attributes set {location_id, street_name, city_id} make up the
location dimension table. Additionally, the characteristics city_id, city_name, state_id,
and country_codeare all included in the city dimension table, which is related with the
Notes
e
city_id.
Figure: Snowflake Schema depicts another angle on the concept. Store, city, and
in
region dimensions have been created here by normalising the shop dimension. Time
has been normalised into three dimensions: time, month, and quarter. Client and
client group dimensions have been created from the normalised client dimension.
nl
Additionally, as illustrated in Figure: Snowflake schema, the product dimension has
been standardised into product type, brand, and supplier dimensions.
O
ty
r si
ve
ni
U
e
The primary distinction between the star/snowflake and fact constellation schemas
is that the former always includes one fact table while the latter always includes several.
in
Fact constellation is hence also known as the galaxy schema (multiple fact tables being
viewed as a group of stars).
nl
tables, is a metric for online analytical processing. Comparing this to Star schema is an
improvement.
O
for analysis of sales), which comprises two fact tables, namely sales and shipping.
ty
r si
ve
ni
U
Similar to the star schema’s sales fact table, the fact constellation schema’s does
as well. The sales fact table has two measures, such as rupees_sold and units_sold, as
well as four characteristics, including time_key, item_code, branch_code, and location_
id (Fact constellation schema for analysis of sales). Additionally, the shipping fact table
m
has two measures like rupees_cost and units_sold along with five attributes like item_
code, time_key, shipper_id, from_location, and to_location. Dimension tables can be
shared across fact tables in constellation schema.
)A
For instance, the fact tables shipping and sales share the dimension tables, namely
location, item, and time, as illustrated in Figure (Fact constellation schema for analysis
of sales).
(c
e
●● This schema’s primary drawback is its intricate design, which was necessitated by
taking several aggregation versions into account.
in
Comparison among Star, Snowflake and Galaxy (Fact Constellation) Schema
nl
O
ty
r si
ve
2.5 Warehouse Architecture
ni
The architecture is the framework that binds every element of a data warehouse
together. Consider the architecture of a school as an illustration. The building’s
architecture encompasses more than just its aesthetic. It consists of numerous
U
them all together. If you can apply this analogy to a data warehouse, the architecture of
the data warehouse is made up of all of the different parts of the data warehouse.
Let’s assume that the builders of the school building were instructed to make
the classrooms spacious. They increased the size of the classrooms but completely
m
removed the offices, giving the school building a flawed architectural design. Where did
the architecture go wrong? For starters, not all the required elements were available.
Most likely, the positioning of the remaining elements was also incorrect. Your data
warehouse’s success depends on the right architecture. As a result, we will examine
)A
The architecture of your data warehouse takes into account a number of elements.
It primarily consists of the integrated data, which serves as the focal point. Everything
required for preparing and storing the data is included in the design. However, it also
(c
provides all of the tools for getting data from your data warehouse. Rules, processes, and
functions that make your data warehouse run and satisfy business needs are also a part
of the architecture. Lastly, your data warehouse’s technology is included into the design.
Amity Directorate of Distance & Online Education
Data Warehousing and Mining 83
e
A means of outlining the total architecture of data exchange, processing, and
presentation that exists for end-client computing within the organisation is the data
in
warehouse architecture. Although every data warehouse is unique, they all share
several essential elements.
nl
payroll, accounts payable, product purchasing, and inventory control (OLTP). These
programmes collect thorough information on daily operations.
Applications for data warehouses are created to serve the user’s ad-hoc data
O
requirements, or what is now known as online analytical processing (OLAP). These
comprise tools including trend analysis, profiling, summary reporting, and forecasting.
ty
In contrast, a warehouse database receives periodic updates from operating
systems, typically after hours. OLTP data is routinely retrieved, filtered, and loaded
into a dedicated warehouse server that is open to users as it builds up in production
si
databases. Tables must be de-normalized, data must be cleaned of mistakes and
duplicates, and new fields and keys must be introduced as the warehouse is filled with
data to represent the user’s demands for sorting, combining, and averaging data.
r
Figure illustrates the three main elements of every data warehouse (Data
ve
warehouse architecture). These are listed below.
◌◌ Load Manager
◌◌ Warehouse Manager
ni
Load Manager
Notes
e
The task of gathering data from operating systems falls under the purview of the
load manager. Additionally, it converts data into a format that the user can use going
in
forward. It consists of all the software and application interfaces needed for data
extraction from operational systems, preparation of the extracted data, and then loading
the data into the data warehouse itself.
nl
It ought to carry out the following duties:
◌◌ Data Identification
◌◌ Data Validation for its accuracy
O
◌◌ Data Extraction from the original source
◌◌ Data Cleansing
◌◌ Data formatting
ty
◌◌ Data standardization (i.e. bringing data into conformity with some standard
format)
◌◌ Consolidates data from multiple sources to one place
si
◌◌ Establishment of Data Integrity using Integrity Constraints
Warehouse Manager
r
The key component of the data warehousing system is the warehouse manager,
ve
which contains a vast amount of data from numerous sources. It arranges data in such
a way that anyone may easily evaluate or locate the necessary information. It serves
as the data warehouse’s main component. It retains detailed, lightly summarised, and
highly summarised information levels. Additionally, it keeps track of mete data, or data
ni
about data.
Query Manager
U
Last but not least, the query manager is the interface that enables end users to
access data warehouse-stored data through the use of specific end-user tools. These
devices are referred to as access tools for data mining. These solutions, which have
basic functionality and the option to customise new features unique to a company,
ity
are widely available today. These fall under a number of categories, including data
discovery, statistics, and query and reporting.
Single-Tier Architecture
(c
The source layer is the sole layer that is actually accessible, as seen in the picture.
Notes
e
Data warehouses are virtual in this approach. This indicates that the data warehouse is
actually a multidimensional representation of operational data produced by a particular
middleware, or intermediate processing layer.
in
nl
O
ty
si
Figure: Single Tier
This architecture has a vulnerability since it does not separate analytical and
r
transactional processing as required. Following the middleware’s interpretation,
ve
analysis queries are approved for operational data. This is how inquiries have an impact
on transactional workloads.
Two-Tier Architecture
ni
As seen in fig., the requirement for separation is crucial in developing the two-tier
architecture for a data warehouse system.
U
ity
m
)A
(c
e
division between physically accessible sources and data warehouses, in reality, it
consists of four consecutive stages of data flow:
in
1. Source layer: A data warehouse system makes use of several data sources. The
information may originate from an information system beyond the boundaries of the
company or be initially housed in legacy databases or internal relational databases.
nl
2. Data Staging: In order to combine disparate sources into a single standard schema,
the data held in the source should be collected, cleaned to remove inconsistencies
and fill gaps, and integrated. The so-called Extraction, Transformation, and Loading
O
Tools (ETL) may extract, transform, clean, validate, filter, and load source data into
a data warehouse while combining disparate schemata.
3. Data Warehouse layer: A data warehouse serves as one logically centralised
individual repository for information storage. The data warehouses can be accessed
ty
directly, but they can also be utilised as a source for developing data marts, which are
intended for particular company departments and partially reproduce the contents
of data warehouses. Data staging, users, sources, access processes, data mart
si
schema, and other information are all stored in meta-data repositories.
4. Analysis: In this layer, integrated data is effectively and adaptably accessed to
generate reports, evaluate data in real time, and model fictitious business scenarios.
r
It should have customer-friendly GUIs, advanced query optimizers, and aggregate
ve
information navigators.
Three-Tier Architecture
The source layer, which includes several source systems, the reconciliation layer,
ni
and the data warehouse layer make up the three-tier architecture (containing both data
warehouses and data marts). In between the data warehouse and the source data is
the reconciliation layer.
U
The reconciled layer’s key benefit is that it produces a uniform reference data
model for an entire company. It also distinguishes between issues with data warehouse
filling and those with source data extraction and integration.
ity
In some instances, the reconciled layer is also used directly to improve how some
operational tasks are carried out, such as generating daily reports that cannot be
adequately prepared using corporate applications or creating data flows to periodically
feed external processes in order to gain the benefits of cleaning and integration.
m
result.
(c
Notes
e
in
nl
O
ty
r si
ve
Figure: Three Tier
give businesses an advantage over their rivals. Companies could learn about
patterns, untapped knowledge, and facts and figures that were previously
unavailable by adopting data warehousing. Such fresh information would
improve the calibre of choices.
m
the organisation.
◌◌ Cost-effective: Data warehousing allows for organisational streamlining, which
lowers overhead and lowers product costs.
◌◌ Improved customer service: Data warehousing offers crucial assistance when
dealing with clients, which helps to raise client satisfaction and keep them as
(c
clients.
e
The problems associated with developing and managing data warehousing are as
follows.
in
◌◌ Underestimation of resources for data:ETL The amount of time needed to
extract, clean, and load data before warehousing is frequently underestimated
by the user. As a result, many other processes suffer from real-time
nl
implementation, and operations may suffer in the interim.
◌◌ Erroneous source systems: The source systems that are being used to
feed data have a lot of unreported issues. Years may pass before these
O
issues are discovered, and after many years of laying undiscovered, they
may suddenly manifest in embarrassing ways. For instance, after entering
information for a brand-new property, one discovers that some fields have
allowed null values. In many of these situations, the employee is found to
ty
have entered inaccurate data.
◌◌ Required data not captured: Although data warehouses frequently
purposefully leave out a few inconsequential facts that could later be useful
si
for analysis or other tasks, they often keep detailed information. In the source
system, the date of registration, for instance, may not be used while entering
details of a new property, but it may be very helpful during analysis.
◌◌ r
Increased end user queries or demands: Queries never stop from the end
ve
user’s perspective. A series of additional questions may arise even after the
initial ones have been successfully answered.
◌◌ Loss of information during data homogenization: When combining data from
various sources, information loss due to format conversion is possible.
ni
design to real deployment, which is why many firms are first hesitant.
◌◌ Complexity of integration: The performance of a data warehouse could
be assessed using its integration capabilities. As a result, an organisation
)A
invests a significant amount of effort in figuring out how well the various data
warehousing solutions can work together (or integrate), which is a particularly
challenging undertaking given the number of options available.
(c
e
A data warehouse is a single repository for data where records from various
in
sources are combined for use in online business analytics (OLAP). This suggests that
a data warehouse must satisfy the needs of every business stage across the board. As
a result, data warehouse design is a very time-consuming, complex, and thus prone
nl
to error process. Additionally, as business analytical tasks evolve over time, so do
the requirements for the systems. As a result, the design process for OLAP and data
warehousing systems is continuous.
O
A different approach than view materialisation is used in the architecture of
data warehouses. Data warehouses are viewed as database systems with specific
requirements, such as responding to management-related inquiries. The goal of the
design is to determine how records should be extracted, processed, and loaded (ETL)
ty
from various data sources in order to be arranged in a database as the data warehouse.
si
◌◌ “top-down” approach
◌◌ “bottom-up” approach
constructed by choosing the data needed for certain business areas or departments. An
approach is data-driven because business requirements from the subjects for creating
data marts are developed after the information has been obtained and integrated. This
U
approach has the benefit of supporting a single integrated data source. Consequently,
when they overlap, data marts created from it will have consistency..
Notes
e
in
nl
O
ty
r si
ve
To down design approach
The lowest grain of data and, if necessary, aggregated data are both included
in data marts. To suit the data delivery needs of data warehouses, a denormalized
ity
marts. The data marts were joined by the conformed dimensions to create a data
warehouse, also known as a virtual data warehouse.
The benefit of the “bottom-up” design method is that it has a quick return
)A
on investment (ROI), as it requires far less time and effort to create a data mart, or
a data warehouse, for a single subject than it does to create an enterprise-wide data
warehouse. The chance of failure is even lower. This approach is by definition gradual.
This approach enables the project team to develop and learn.
(c
Notes
e
in
nl
O
ty
Advantages of bottom-up design r si
ve
◌◌ Quick document generation is possible.
◌◌ New business units can be added to the data warehouse as needed.
◌◌ It simply involves creating new data marts and merging them with existing
data marts.
ni
centralised authority and regulation. norms and regulations for the department.
It contains information that is superfluous. The redundant can be eliminated.
If repeatedly used, it can yield speedy less chance of failure, a good return on
results. investment, and approaches that have been
(c
proven.
Data Marts
Notes
e
The phrase “data mart” refers to a tiny, localised data warehouse created for
a single function and is used to describe a department-specific data warehouse. It is
in
typically designed to meet the requirements of a department or a group of users inside
an organisation.
nl
the finance and IT departments. These departments can each have their own data
warehouses, which are nothing more than their respective data marts.
O
characterised as “a specialised, subject-oriented, integrated, time-variant, volatile data
repository.”
Data marts are a subset of data warehouses that fulfil the needs of a specific
ty
department or business function, to put it simply.
A company may maintain both a data warehouse (which unifies the data from all
departments) and separate departmental data marts. Thus, as indicated in the below
si
graphic, the data mart can be centrally linked to the corporate data warehouse or stand-
alone (individual).
r
ve
ni
situations, data marts are helpful since they frequently serve as a method for building a
data warehouse in a phased or sequential manner in large enterprises. An enterprise-
wide data warehouse can be made up of a number of data marts. In contrast, a data
ity
e
There are mainly two approaches to designing data marts. These approaches are
in
◌◌ Independent Data Marts
nl
The logical subset of the physical subset of a higher data warehouse is a
dependent data mart. The data marts are regarded as a data warehouse’s subsets in
accordance with this method. This method starts by building a data warehouse from
O
which other data marts can be made. These data mart rely on the data warehouse
and pull the crucial information from it. This method eliminates the need for data mart
integration because the data warehouse builds the data mart. It is also referred to as a
top-down strategy.
ty
r si
ve
Independent Data Marts
ni
Independent data marts are the second strategy (IDM) In this case, separate
multiple data marts are first constructed, and then a data warehouse is designed using
them. This method requires the integration of data marts because each data mart is
U
independently built. As the data marts are combined to create a data warehouse, it is
also known as a bottom-up strategy.
ity
m
)A
There is a third category, known as “Hybrid Data Marts,” in addition to these two.
could be useful in a variety of circumstances, but it’s especially useful when adhoc
integrations are required, such when a new group or product is introduced to an
organisation.
e
Designing the schema, building the physical storage, populating the data mart with
data from source systems, accessing it to make educated decisions, and managing it
in
over time are the key implementation processes. These are the steps:
Designing
nl
The data mart process starts with the design step. Initiating the request for a data
mart, acquiring information about the needs, and creating the data mart’s logical and
physical design are all covered in this step.
O
The tasks involved are as follows:
ty
3. Choosing the right subset of data.
4. Creating the data mart’s logical and physical architecture.
si
Constructing
In order to enable quick and effective access to the data, this step involves building
r
the physical database and the logical structures related to the data mart.
ve
The tasks involved are as follows:
1. Establishing the data mart’s logical structures, such as table spaces, and
physical database.
ni
2. Creating the schema objects described in the design process, such as the
tables and indexes.
3. Deciding on the best way to put up the access structures and tables.
U
Populating
This stage involves obtaining data from the source, cleaning it up, transforming it
into the appropriate format and level of detail, and transferring it into the data mart.
ity
Accessing
In this step, the data is put to use through querying, analysis, report creation, chart
and graph creation, and publication.
(c
1. Create a meta layer (intermediate layer) for the front-end tool to use. This
Notes
e
layer converts database operations and object names into business terms so
that end users can communicate with the data mart using language that is
related to business processes.
in
2. Establish and maintain database architectures including summarised tables
that facilitate quick and effective query execution through front-end tools.
nl
Managing
In this step, the data mart’s lifespan management is included. The following
O
management tasks are carried out in this step:
ty
3. Improving the system’s performance through optimization.
4. Ensuring data accessibility in case of system faults.
si
Differences between Data Mart and Data Warehouse
The following traits set data marts and data warehouses apart from one another.
●●
r
Instead of the needs of the entire organisation, data marts often concentrate on
the data requirements of a single department.
ve
●● Data marts do not include detailed information (unlike data warehouses).
●● In contrast to data warehouses, which handle vast amounts of data, they are
simple to move between, explore, and traverse.
ni
warehouse.
◌◌ Instead of including numerous pointless individuals, it is possible to better
organise the potential users of a data mart.
)A
◌◌ Data marts are created on the premise that serving the full company is not
necessary. As a result, the department can independently summarise, pick,
and organise the data from its own departments.
◌◌ Each department may be able to focus on a particular segment of historical
(c
data rather than the entire data set thanks to data marts.
◌◌ Depending on their demands, departments can modify the software for their
data mart.
e
Limitations of data marts
in
The following are some drawbacks of data marts:
nl
◌◌ Data integration issues are frequently encountered.
◌◌ The scalability issue becomes prevalent as the data mart grows several
dimensions.
O
Meta data
For Example: Over the years, numerous application designers in each branch have
made their own choices about how an application and database should be constructed
ty
in order to store data. Therefore, naming conventions, variable measures, encoding
structures, and physical properties of data will vary amongst source systems. Consider
a bank with numerous locations throughout numerous nations, millions of clients, with
si
lending and savings as its primary business lines. The process of integrating data from
source systems to target systems is demonstrated in the example that follows.
and values vary greatly between source systems. By integrating the data into a data
warehouse with sound standards, this data inconsistency can be avoided.
The target system’s attribute names, column names, and datatypes are all
consistent in the target data from the aforementioned example. This is how the data
warehouse accurately integrates and stores data from diverse source systems.
Literally, “data about data” is what meta data is. It describes the many types of
(c
information in the warehouse, where they are kept, how they relate to one another,
where they originate, and how they relate to the company. This project aims to address
the issue of standardising meta data across diverse products and using a systems
Notes
e
engineering approach to this process in order to improve data warehouse architecture.
A data warehouse consolidates both recent and old data from various operational
in
systems (transactional databases), where the data is cleaned and reorganised to
assist data analysis. By developing a metadata system, we hope to construct a design
procedure for building a data warehouse. The basis for organising and designing
nl
the data warehouse will be provided by the meta data system we plan to build. This
procedure will be used initially to assist in defining the requirements for the data
warehouse and will then be used iteratively throughout the life of the data warehouse to
update and integrate additional dimensions.
O
The data warehouse user must have access to correct and current meta data in
order to be productive. Without a reliable source of meta data to base operations on,
the analyst’s task is substantially more challenging and the amount of labour necessary
ty
for analysis is significantly increased.
A successful design and deployment of a data warehouse and its meta data
management depend on an understanding of the enterprise data model’s relationship to
si
the data warehouse. Some particular difficulties that entail a data warehouse’s physical
architecture trade-offs include:
◌◌
◌◌
r
Granularity of data - refers to the level of detail held in the unit of data
Partitioning of data - refers to the breakup of data into separate physical units
ve
that can be handled independently.
◌◌ Performance issues
◌◌ Data structures inside the warehouse
ni
◌◌ Migration
As a byproduct of transaction processing and other computer-related activities,
data has been accumulating in the business sector for years. Most of the time, the data
U
that has accumulated has simply been used to fulfil the initial set of needs. The data
warehouse offers a data processing solution that reflects the integrated information
needs of the business. The entire corporate organisation can use it to facilitate
analytical processing. Analytical processing examines different parts of the business or
ity
the entire organisation to find trends and patterns that would not otherwise be visible.
The management’s vision for leading the organisation depends utterly on these trends
and patterns.
data itself. To correctly store and utilise the meta data produced by the various systems,
efficient tools must be used.
)A
for this operation. The quantity and kind of the source systems will also play a role.
The following data must be provided for each source data field:
Notes
e
Source field
in
Unique identifier, name, type, location (system, object).
A unique identifier is required to avoid any confusion occurring between two fields
of the same name from different sources. Name, type, location describe where the data
nl
comes from. Name is its local field name. Type is the storage type of data. Location is
both the system comes from and the object contains it.
O
Destination
ty
In order to assist prevent any ambiguity between frequently used field names,
a distinctive identification is required. Name is the identifier that the field in the data
warehouse’s base data will have. To identify between the original data load destination
si
and any copies, base data is employed. The database data type, such as clear, varchar,
or numeric, is known as the destination type. The name of the table field is what is used
to identify it.
r
To transform the source data into the destination data, transformation must be
ve
used.
Transformation(s)
ni
Name serves as a distinctive identifier to set this transformation apart from others
that are analogous. The language name used in the transformation is contained in the
U
Language attribute. Module name and syntax are the additional attributes.
●● Data Management
To define the data as it is stored in the data warehouse, meta data is necessary.
ity
The warehouse manager needs this to be able to monitor and manage all data
migration. It is necessary to describe each object in the database. All of the following
need the use of meta data:
To achieve this, each field will need to store the following meta data.
e
information.
in
aggregation).
Examples of aggregation functions are the following functions: min, max, average,
and sum
nl
Partitions are subsets of a table, but they also require knowledge of the key for the
partition and the data range it contains.
O
Partition: Table name, partition key (reference identifier), and table name (range
permitted, range contained)
Types of Metadata
ty
Metadata in a data warehouse fall into three major parts:
◌◌ Operational Metadata
si
◌◌ Extraction and Transformation Metadata
◌◌ End-User Metadata
Operational Metadata
r
ve
As is well known, the enterprise’s numerous operating systems provide the data
for the data warehouse. Different data structures are present in these source systems.
The data items chosen for the data warehouse contain a range of data kinds and field
lengths.
ni
We separate records, combine factors of documents from several source files, deal
with various coding schemes and field lengths when choosing data from the source
systems for the data warehouses. We must be able to link the information we provide
U
to end consumers back to the original data sets. All of this information regarding the
operational data sources is contained in operational metadata.
Data about the removal of data from the source systems, such as data extraction
frequencies, methods, and business rules, are included in the metadata for data
extraction and transformation. Additionally, the data transformations that occur in the
data staging area are all described in this type of metadata.
m
End-User Metadata
The navigational map of the data warehouses is the end-user metadata. It makes
)A
it possible for end users to locate data in data warehouses. The end-user metadata
enables the end-users to search for information using terms that are familiar to them
from the business world.
Online Analytical Processing Server is what OLAP stands for. Using this software
technology, users can simultaneously study data from many database systems. The
e
data model.
in
comes up rather frequently. Can you not think of online analytical processing as nothing
more than a method of information delivery? Is this layer in the data warehouse not
another layer that serves as an interface between the users and the data? OLAP
nl
functions somewhat as a data warehouse’s information distribution mechanism. OLAP,
however, is much more than that. Data is stored and more easily accessible through
a data warehouse. An OLAP system enhances the data warehouse by expanding the
capacity for information distribution.
O
OLAP is implemented with data cubes, on which the following operations may be
applied:
ty
◌◌ Roll-up
◌◌ Drill-down
◌◌ Slice and dice
si
◌◌ Pivot (rotate)
Roll-up
r
Rolling up is like enlarging the data cube’s view. Roll-up is a tool for giving users
ve
details at an abstract level. It accomplishes further data aggregation either by moving
up a concept hierarchy for a dimension or by reducing the dimension. The functioning of
the roll-up operation is shown in Figure (Working of the Roll-up operation).
ni
U
ity
m
)A
(c
Cities are aggregated in this case to reach the state level for dimension reduction.
Notes
e
Additionally, the aggregate can be carried out on Time (Quarter), Year, etc., or on
specific objects to group, such as mobile, modem, etc.
in
Drill-down
Roll-opposite up’s is drill-down. Drill-down is used to give the user detailed info
and is similar to zooming in on the data. By adding a new dimension or descending a
nl
concept hierarchy for an existing dimension, it gives extensive information. It switches
between less and more precise data. The working of the drill-down operation is shown
in Figure (Working of the Drill down operation).
O
ty
r si
ve
ni
U
In this example, the time dimension runs down from the level of quarter to the level
of month on moving down.
Slice and dice illustrate how information may be viewed from several angles. By
choosing a specific dimension from a predetermined cube, the slice operation creates
a new sub-cube. Consequently, a slice is a subset of the cube that represents a single
)A
value for one or more dimension members. It causes a reduction in size. In order to
choose one dimension from a three-dimensional cube and create a two-dimensional
slice, the user performs a slice operation.
operation). The criteria time ‘Q1’ for the dimension ‘time’ is used in this figure’s slice
operation. It provides a new sub-cube after choosing one or more dimensions.
Notes
e
in
nl
O
ty
r si
ve
Figure: Working of Slice operation
have a count of the number of students from each state in prior semesters in a certain
course, as shown in Figure (The slice operation).
U
ity
m
)A
(c
As can be seen in the table below (Result of the slice operation for degree = BE), the
Notes
e
information returned is consequently more like a two-dimensional rectangle than a cube.
in
nl
O
ty
Dice
Without a reduction in the number of dimensions, the dice operation is comparable
si
to the slice procedure. By choosing two or more dimensions from a predetermined
cube, the Dice operation creates a new sub-cube. The action of the dice is depicted in
Figure (Working of the Dice operation).
r
ve
ni
U
ity
m
)A
(c
Here, the dice operation is carried out on the cube based on the three selection
Notes
e
criteria listed below.
in
◌◌ (time = ‘Q1’ or ‘Q2’)
◌◌ (item = ‘Mobile’ or ‘Modem’)
A dice is obtained by doing selection on two or more dimensions, as was previously
nl
mentioned.
A dice operation can be used to determine the number of students enrolled in both
O
BE and BCom degrees from the states of Punjab, Haryana, and Maharashtra in the
university database system, as illustrated in Figure (Dice operation).
ty
r si
ve
ni
U
Pivot
Rotation is another name for the pivot action. To provide a different data
presentation, this action rotates the data axes in view. It might entail switching the
columns and rows. Figure below shows the pivot operation in action.
m
)A
(c
Notes
e
in
nl
O
ty
r
Figure: Working of the Pivot operation
si
ve
OLAP Server
A group of software tools known as Online Analytical Processing (OLAP) are used
to analyse data and make business choices. OLAP offers a platform for simultaneously
ni
◌◌ ROLAP
◌◌ MOLAP
◌◌ HOLAP
m
dimension tables are maintained as relational tables, is typically used for relational on-
line analytical processing (ROLAP). ROLAP servers are used to fill the gap between the
client’s front-end tools and the relational back-end server. OLAP middleware fills in the
gaps left by ROLAP servers’ use of RDBMS to store and manipulate warehouse data.
(c
Benefits:
●● Both data warehouses and OLTP systems are compatible with it.
●● The underlying RDBMS determines the ROLAP technology’s data size restriction.
Notes
e
As a result, ROLAP does not place a storage cap on the volume of data that can
be kept.
in
Limitations:
●● SQL has limited functionality.
nl
●● Updating aggregate tables is challenging.
O
ty
2.6.2 OLAP Server - MOLAP
r si
ve
Multidimensional On-Line Analytical Processing (MOLAP) offers multidimensional
representations of data through array-based multidimensional storage engines. If the
data set is sparse, storage utilisation in multidimensional data repositories may be low.
ni
All array elements are defined in MOLAP, as opposed to ROLAP, which only
saves records with non-zero facts; as a result, the arrays tend to be sparse, with empty
components taking up a larger proportion of them. Due to the fact that both storage and
retrieval costs must be taken into account when assessing online performance, MOLAP
systems frequently contain features like sophisticated indexing and hashing to locate
m
data while running queries for managing sparse arrays. MOLAP cubes can handle
intricate calculations and are perfect for slicing and chopping data. All calculations are
already done when the cube is produced.
)A
Benefits:
●● Suitable for activities involving cutting and dicing.
●● ROLAP is outperformed when the data is dense.
(c
Limitations:
Notes
e
●● Changing the proportions without re-aggregating is challenging.
●● Large amounts of data cannot be kept in the cube itself since all calculations must
in
be done at the time the cube is constructed.
nl
O
ty
2.6.3 OLAP Server - HOLAP
si
Hybrid On-Line Analytical Processing combines ROLAP and MOLAP. Compared
to ROLAP and MOLAP, HOLAP is more scalable and performs computations more
r
quickly. ROLAP and MOLAP are combined to create HOLAP. HOLAP servers have the
capacity to store a lot of specific data. On the one hand, the higher scalability of ROLAP
ve
improves HOLAP. HOLAP, on the other hand, utilises cube technology for information
that is summarised quickly. Cubes are smaller than MOLAP because detailed data is
kept in a relational database.
ni
Benefits:
●● The advantages of MOLAP and ROLAP are combined in HOLAP, which also offers
rapid access at all aggregation levels.
U
Limitations:
●● HOLAP architecture is exceedingly complex because it allows both MOLAP and
ity
Case Study
Notes
e
Designing a dimensional model for a cargo shipper
in
To examine the process of developing a data warehouse schema, we will use a
case study of a XYZ shipping corporation as our example. The entire data warehouse is
logically described by a schema or dimensional model. The most basic data warehouse
structure that we’ll take into consideration is a star schema. The entity-relationship
nl
diagram of this schema has points radiating out from a central table, giving it the
appearance of a star, which is why it is known as a star schema. It is distinguished by
one or more large fact tables that hold the majority of the data warehouse’s information
O
and a number of considerably smaller dimension tables that each contain details on the
entries for a specific attribute in the fact table. We’ll employ methods for dimensional
modelling.
ty
Gather Business Requirements
We must comprehend the needs of the Shipping Company and the realities of the
underlying source data before beginning a dimensional modelling project. Sessions with
si
business representatives can help us grasp the demands of the organisation and its key
performance indicators, compelling business problems, decision-making procedures,
and supporting analytical requirements.
r
Because we don’t fully grasp the business, we can’t develop the dimensional
ve
model alone. Together with subject matter experts and firm data governance officials,
the necessary dimensions model should be createdUnderstanding the case after
gathering data realities
ni
We can clearly define the criteria after gathering them in order to comprehend the
situation. In this instance, the shipping company only provides services to consumers
who have registered with the business. Bulk products are transported between ports
in containers for customers. There may be several intermediate port stops during the
U
journey.
The following details can be found on a shipment invoice: invoice number, date of
pick-up, date of delivery, ship from customer, ship to customer, ship from warehouse,
ity
ship from city, ship to city, shipment mode (such as air, sea, truck, or train), shipment
class (codes like 1, 2, and 3, which translate to express, expedited, and standard),
contract ID, total shipping charges, deal ID (refers to a discount deal in effect),
discount amount, taxes, total billed amount, total ship The product ID, product name,
m
quantity, ship weight, ship volume, and charge are also included on each invoice line.
Additionally, a single invoice will not have several lines for the same product.
Customer size (small, medium, or large), customer ship frequency (low, medium, or
)A
high), and credit status are all included in the customer table (excellent, good, average).
These sometimes go hand in hand; for instance, big clients typically buy frequently and
have great or decent credit. Multiple contracts may be held by a single consumer, but
the total number of contracts is minimal.
There aren’t many specials, and some shipments might not qualify for a discount.
(c
The business might need to run a query on data like total revenue broken down by deal.
Additionally, offers that are available for a specific product, source city, destination city,
and date may need to be identified. Even though it is not mentioned on the invoice,
Notes
e
we may nevertheless find an Estimated Date of Delivery for each cargo from operating
systems. Between shipment mode and shipment class, there is some link. While cities
and warehouses roll up to subregion, region, and territory, products roll up to brand and
in
category.
The following queries may be of interest to the business: client type, product
nl
category, region, year, shipment mode, discount contract terms, ship weight, volume,
delivery delay, and income.
Modeling
O
Dimensional modelling is done in four steps using the Kimball method. It acts as a
guide and looks like this:
ty
Step 2: Declare the Grain
si
Step 4: Identify the Facts
Identification of the dimensions is the third phase. Below, we’ll show the
ni
dimensions as a table:
U
ity
m
)A
Finding the facts is the fourth phase in the Kimball process. The following illustrates
(c
Notes
e
in
nl
O
ty
r si
Figure: Shipment Invoice Fact Table
ve
Further notes about the dimensional model:
The role-playing dimensions are presented as views in the categories of City, Date,
ni
Customer Type, and Customer. The corresponding tables include descriptions of the
roles. These role-playing dimensions act as the model’s legal outriggers.
Shipment mode and Shipment Class have a low attribute cardinality and are
U
connected. Therefore, the technique for the shipment dimension is Type 4 Mini-
Dimension.
discounts. When a deal isn’t in place, we’ll put a specific record in the dimension table
with a descriptive string like “No Deal Applicable” to prevent null keys. Per line item on
the invoice, discounts are distributed.
conjunction with a bridge table with dual keys to represent the many-to-many link
between customers and contracts. The primary key of the bridge table is a composite
key made up of the contract ID and customer ID.
)A
A join between a fact table and numerous dimension tables is known as a star
query. A primary key to foreign key join is used to connect each dimension table to
the fact table; however, the dimension tables are not connected to one another. Star
queries are identified by the cost-based optimizer, which also produces effective
execution plans for them.
(c
Notes
e
in
nl
O
ty
r si
ve
ni
We must keep in mind that the invoice line item is our fact grain. As a result, the
Fact Table won’t show us the agreement that was in effect on a particular day if the
products weren’t sold. To enable this functionality for Deals, we must add attributes.
Additionally, we can use a type 2 SCD technique because this will be a slowly changing
U
dimension (SCD) with shifting deals. We must add the Valid-From and Valid-To dates
as characteristics for deals. It will also be easier to keep track of the necessary deal
information for any given date if attributes are included for the product on which the deal
ity
Summary
●● Dimensional modelling uses a cube operation to represent data, making OLAP
)A
data management more suitable for logical data representation. Ralph Kimball
created the fact and dimension tables that make up the notion of “dimensional
modelling.”
●● A dimensional model is a tool used in data warehouses to read, summarise, and
analyse numerical data such as values, balances, counts, weights, etc. Relational
(c
models, on the other hand, are designed for the insertion, updating, and deletion
of data in an online transaction system that is live.
e
and columns that are foreign keys to dimension tables are common in fact tables.
A fact table either includes information at the aggregate level or at the detail level.
in
●● A dimension is a structure that classifies data and is frequently made up of one
or more hierarchies. The dimensional value is better explained by dimensional
qualities.
nl
●● A dimension typically contains hierarchical data. The necessity for data to be
grouped and summarised into meaningful information drives the creation of
hierarchies.
O
●● A schema is a logical definition of a database that logically connects fact and
dimension tables. Star, Snowflake, and Fact Constellation schema are used to
maintain Data Warehouse.
ty
●● In data warehousing, a “dimension” is a group of references to data concerning a
quantifiable event. These incidents are known as facts and are kept in a fact table.
The entities that an organisation wants to keep data for are often the dimensions.
si
●● A bundle of related data pieces is referred to as a “fact table.” It consists of
dimensions and measure values. It follows that the specified dimension and
measure can be used to define a fact table.
●● r
On-line analytical processing is known as OLAP. Through quick, consistent,
ve
interactive access to a variety of views of data that have been transformed from
raw data to reflect the true dimensionality of the enterprise as understood by the
clients, analysts, managers, and executives can gain insight into information using
OLAP, a category of software technology.
ni
source, you must use the right approach. Source data may come in a variety
of data types from various source machines. Relational database systems
may contain a portion of the original data. Some data might be stored in older
hierarchical and network data models.
ity
●● As part of data transformation, you carry out a variety of distinct actions. You
must first purge the data you have taken from each source. Cleaning can involve
removing duplicates when bringing in the same data from different source
systems, or it can involve resolving conflicts between state codes and zip codes
)A
in the source data. It can also involve giving default values for missing data
components.
●● The function of data loading is composed of two different groups of jobs. The initial
loading of the data into the data warehouse storage is done when the design and
(c
construction of the data warehouse are finished and it goes online for the first time.
●● The process of writing data into the target database is known as the load. Make
sure the load is carried out accurately and with the least amount of resources
Notes
e
feasible during the load step.
●● One of the simplest data warehouse schemas is the star schema. Because of the
in
way its points expand outward from a central point, it is known as a star.
●● The main distinction between the snowflake and star schemas is that while the
star schema always has de-normalized dimensions, the snowflake schema may
nl
also contain normalised dimensions. As a result, the snowflake schema is a
variation of the star schema that supports dimension table normalisation.
●● The architecture is the framework that binds every element of a data warehouse
O
together. Consider the architecture of a school as an illustration. The building’s
architecture encompasses more than just its aesthetic. It consists of numerous
classrooms, offices, a library, hallways, gymnasiums, doors, windows, a roof, and
numerous other similar structures.
ty
●● A means of outlining the total architecture of data exchange, processing, and
presentation that exists for end-client computing within the organisation is the
data warehouse architecture. Online transaction processing is built into production
si
applications including payroll, accounts payable, product purchasing, and
inventory control (OLTP). These programmes collect thorough information on daily
operations.
●●
r
Types of data warehouse architectures: a) Single-tier architecture, b) Two-tier
ve
architecture, c) Three-tier architecture.
●● A data warehouse is a single repository for data where records from various
sources are combined for use in online business analytics (OLAP). This suggests
that a data warehouse must satisfy the needs of every business stage across the
ni
●● Online Analytical Processing Server is what OLAP stands for. Using this
software technology, users can simultaneously study data from many database
e
multidimensional data model.
●● Types of OLAP servers: a) Relational OLAP, b) Multidimensional OLAP, c) Hybrid
in
OLAP.
Glossary
nl
●● DM: Dimensional Modeling.
●● UML: Unified Modeling Language.
●● BPMN: Business Process Modeling Notation.
O
●● OLAP: On-line analytical processing.
●● FASMI: Fast, Analysis, Shared, Multidimensional, Information.
ty
●● ROI: return on investment.
●● SQL: Sequential Query Language.
●● OLTP:Online Transaction Processing.
si
●● ETL: Extraction, Transformation, and Loading.
●● 3NF: Third Normal Form.
●● r
IDM: Independent data marts.
ve
●● ROLAP: RelationalOnline Transaction Processing.
●● MOLAP: MultidimensionalOnline Transaction Processing.
●● HOLAP: HybridOnline Transaction Processing.
ni
1. A_ _ _ _is a tool used in data warehouses to read, summarise, and analyse numerical
data such as values, balances, counts, weights, etc.
a. Dimensional model
ity
b. Data structure
c. Data warehouse
d. Data mining
m
b. Schema
c. Query
d. None of the mentioned
(c
a. Schema
Notes
e
b. Tables
c. Dimension
in
d. Normalisation
4. A bundle of related data pieces is referred to as a_ _ _ _ .
nl
a. Dimension
b. Schema
O
c. Data mining
d. Fact table
5. The term_ _ _ _ _refers to values that depend on dimensions.
ty
a. Measures
b. Fact table
si
c. Data warehouse
d. None of the mentioned
6. _ _ _. _is the term used to describe the process of taking data from source systems
and bringing it into the data warehouse. r
ve
a. OLAP
b. ETL
c. FASMI
ni
d. OLTP
7. In a star schema, normalising the dimension tables is the procedure known as_
_________ _ __.
U
a. Normalisation
b. Denormalisation
ity
c. Snowflaking
d. Clustering
8. The phrase_ _ _ _ refers to a tiny, localised data warehouse created for a single
function and is used to describe a department-specific data warehouse.
m
a. Meta data
b. Media data
)A
a. Designing
b. Constructing
c. Populating
Notes
e
d. Accessing
10. _ _ _ _ _ involves obtaining data from the source, cleaning it up, transforming it into
in
the appropriate format and level of detail, and transferring it into the data mart.
a. Designing
nl
b. Populating
c. Accessing
d. Constructing
O
11. _ _ _ _ is the step, in which the data is put to use through querying, analysis, report
creation, chart and graph creation, and publication.
a. Constructing
ty
b. Populating
c. Accessing
si
d. None of the mentioned
12. Data about data is termed as_ _ _ ?
a. Media format data r
ve
b. Media feature data
c. Media data
d. Meta data
ni
13. A group of software tools known as_ _ _ _ are used to analyse data and make
business choices
a. Online Analytical Processing
U
d. Roll up
e
a. Completely normalized
b. Partially normalized
in
c. Completely denormalized
d. Partially denormalized
nl
17. Data transformation includes which of the following?
a. A process to change data from a summary level to detailed level
O
b. A process to change data from a detailed level to summary level
c. Joining data from one source into various sources of data
d. Separating data from one source into various sources of data
ty
18. Reconciled data is which of the following?
a. Data stored in the various operational systems throughout the organization.
si
b. Current data intended to be the single source for all decision support systems.
c. Data stored in one operational system in the organization.
d. Data that has been selected and formatted for end-user support applications.
19. The load and index is which of the following?
r
ve
a. A process to reject the data in the data warehouse and to create the
necessary indexes
b. A process to upgrade the quality of data after it is moved into a data
ni
warehouse
c. A process to load the data in the data warehouse and to create the necessary
indexes
U
Exercise
)A
1. Define facts and dimensions, design fact tables and design dimension table.
2. What do you mean by data warehouse schemas?
3. Define OLAP.
4. What are the various features and benefits of OLAP?
(c
a. Data extraction
Notes
e
b. Clean-up
c. Transformation
in
6. What do you mean by Star, snowflake and galaxy schemas for multidimensional
databases?
nl
7. Define architecture for a warehouse.
8. Define the various steps for construction of data warehouses, data marts and
metadata.
O
9. What do you mean by the OLAP server - ROLAP, MOLAP and HOLAP.
Learning Activities
ty
1. As a senior analyst on the project team of a publishing company exploring the
options for a data warehouse, make a case for OLAP. Describe the merits of OLAP
and how it will be essential in your environment.
si
Check Your Understanding - Answers
1 a 2 b
3 c r
4 d
ve
5 a 6 b
7 c 8 d
9 a 10 b
ni
11 c 12 d
13 a 14 b
15 c 16 a
U
17 b 18 b
19 c 20 d
ity
Springer.
3. Kevin Jackson, 2012, OpenStack Cloud Computing Cookbook, Packt
Publishing.
)A
6. Williams, Mark I., 2010, A Quick Start Guide to Cloud Computing: Moving Your
Business into the Cloud, Kogan Page Ltd.
e
Learning Objectives:
in
At the end of this topic, you will be able to understand:
nl
●● Advancements to Data Mining
●● Data Mining on Databases
O
●● Data Mining Functionalities
●● Objectives of Data Mining and the Business Context for Data Mining
●● Data Mining Process Improvement
ty
●● Data Mining in Marketing
●● Data Mining in CRM
si
●● Tools of Data Mining
Introduction
r
Although the phrase “we are living in the information age” is common, we are
ve
actually living in the data age. Every day, terabytes or petabytes of data from business,
society, science and engineering, medicine, and nearly every other element of daily life
flood our computer networks, the World Wide Web (WWW), and numerous data storage
devices. The rapid development of effective data gathering and storage methods, as
well as the computerization of our society, are to blame for this tremendous expansion
ni
with thousands of locations around the world, like Wal-Mart, conduct hundreds of
millions of transactions per week. High orders of petabytes of data are continuously
produced by scientific and engineering procedures, including remote sensing, process
measurement, scientific experiments, system performance, engineering observations,
ity
Data mining converts a sizable data collection into knowledge. Every day,
hundreds of millions of searches are made using search engines like Google. Every
query can be seen as a transaction in which the user expresses a demand for
m
information. What innovative and helpful information can a search engine get from
such a vast database of user queries gathered over time? It’s interesting how some
user search query patterns can reveal priceless information that cannot be learned from
)A
Google’s Flu Trends, for instance, uses particular search phrases as a gauge of flu
activity. It discovered a strong correlation between the number of persons who actually
have flu symptoms and those who look for information relevant to the illness. When
(c
all of the flu-related search searches are combined, a pattern becomes apparent. Flu
Trends can predict flu activity up to two weeks sooner than conventional methods by
using aggregated Google search data. This illustration demonstrates how data mining
may transform a sizable data set into knowledge that can assist in resolving a current
Notes
e
global concern.
The market has been overrun with different products by hundreds of suppliers
in
during the last five years. Data modelling, data gathering, data quality, data analysis,
metadata, and other aspects of data warehousing are all covered by vendor solutions
and products. No fewer than 105 top items are highlighted in the Data Warehousing
nl
Institute’s buyer’s guide. The market is already enormous and is still expanding.
Almost sure, you’ve heard of data mining. Most of you are aware that technology
plays a role in knowledge discovery. It’s possible that some of you are aware of the use
O
of data mining in fields like marketing, sales, credit analysis, and fraud detection. You’re
all aware, at least in part, that data mining and data warehousing are related. Almost all
corporate sectors, including sales and marketing, new product development, inventory
management, and human resources, use data mining.
ty
The definition of data mining has possibly as many iterations as there has
supporters and suppliers. Some experts include a wide variety of tools and procedures
in the definition, ranging from straightforward query protocols to statistical analysis.
si
Others limit the definition to methods of knowledge discovery. Although not necessary, a
functional data warehouse will offer the data mining process a useful boost.
r
3.1 Understanding Data Mining
ve
Let’s try to comprehend the technology in a business context before offering
some formal definitions of data mining. Data mining delivers information, much like all
other decision support systems. Please refer to the figure below, which illustrates how
decision support has evolved. Take note of the first strategy, when simple decision
ni
to take over as the main and most beneficial source of decision support information in
the 1990s.
OLAP tools became accessible for more complex analysis. The strategy for
ity
gathering information up until this point had been driven by the users. But no one can
utilise analysis and query tools to find valuable trends because of the sheer amount
of data.
For instance, it is very impossible to think through all the potential linkages in
m
marketing analysis and to discover insights by querying and digging further into the
data warehouse. You require a solution that can forecast client behaviour by learning
from prior associations and outcomes. You need a device that can conduct knowledge
)A
discovery on its own. Instead of a user-driven strategy, you want a data-driven strategy.
At this point, data mining intervenes and replaces the users.
(c
Notes
e
in
nl
O
ty
si
Figure: Decision support progresses to data mining.
Building and using a data warehouse is known as data warehousing. Data from
several heterogeneous sources is combined to create a data warehouse, which
facilitates analytical reporting, organised and/or ad hoc searches, and decision-making.
U
Data consolidation, data integration, and data cleaning are all components of data
warehousing.
electronically by separating it from operational systems and making it accessible for ad-
hoc searches and scheduled reporting. In contrast, creating a data warehouse requires
creating a data model that can produce insights quickly.
different. To make daily operations, analysis, and reporting easier, it is set up such that
pertinent data is grouped together. This aids in identifying long-term trends and enables
users to make plans based on that knowledge. Thus, it is important for firms to use data
)A
warehouses.
(c
Notes
e
in
nl
O
ty
Figure: Data Warehouse Architecture
https://www.astera.com/type/blog/what-is-data-warehousing/
si
In the early stages, four significant factors drove many companies to move into
data warehousing:
◌◌
r
Fierce competition
ve
◌◌ Government deregulation
◌◌ Need to revamp internal processes
◌◌ Imperative for customized marketing
The first industries to use data warehousing were telecommunications, finance,
ni
and retail. Government deregulation in banking and telecoms was primarily to blame
for that. Because of the increased competitiveness, retail businesses have shifted to
data warehousing. As that industry was deregulated, utility firms joined the organisation.
U
(Source: https://anuradhasrinivas.files.wordpress.com/2013/03/data-warehousing-
fundamentals-by-paulraj-ponniah.pdf )
)A
The majority of global firms were the only ones who used data warehousing in its
early phases. Building a data warehouse was expensive, and the available tools were
not quite sufficient. Only big businesses have the money to invest in the new paradigm.
Now that smaller and medium-sized businesses can afford the cost of constructing
data warehouses or purchasing turnkey data marts, we are starting to notice a
(c
management systems (DBMSs) you have previously employed. The database vendors
Notes
e
have now included tools to help you create data warehouses using these DBMSs,
as you will discover. The cost of packaged solutions has decreased, and operating
systems are now reliable enough to support data warehousing operations.
in
Although earlier data warehouses focused on preserving summary data for high-
level analysis, different businesses are increasingly building larger and larger data
nl
warehouses. Companies may now collect, purify, maintain, and exploit the enormous
amounts of data produced by their economic activities. The amount of data stored in
data warehouses is now in the terabyte scale. In the telecommunications and retail
industries, data warehouses housing many terabytes of data are not unusual.
O
Take the telecommunications sector, for instance. In a single year, a telecoms
operator produces hundreds of millions of call-detail transactions. The corporation must
examine these specific transactions in order to market the right goods and services.
ty
The company’s data warehouse must keep data at the most basic degree of detail.
Consider a chain of stores with hundreds of locations in a similar way. Each store
produces thousands of point-of-sale transactions each day. Another example is a
si
business in the pharmaceutical sector that manages thousands of tests and measures
to obtain government product approvals. In these sectors, data warehouses are
frequently very vast.
r
ve
Multiple Data Types
If you’re just starting to develop your data warehouse, you might only include
numeric data. You will quickly understand that that simply including structured numerical
data is insufficient. Be ready to take into account more data kinds.
ni
data warehousing dealt with organised data. This line between them is getting fuzzier.
For instance, the majority of marketing data is structured data presented as numerical
numbers.
ity
Unstructured data in the form of pictures is also present in marketing data. Let’s
imagine that a decision-maker is conducting research to determine the most popular
product categories. During the course of the analysis, the decision maker settles on
a particular product type. In order to make additional judgments, he or she would
now like to see pictures of the products in that type. How is it possible to do this?
m
Companies are learning that their data warehouses need to integrate both structured
and unstructured data.
)A
What kinds of data fall under the category of unstructured data? The data
warehouse must incorporate a variety of data kinds to better support decision-making,
as seen in the figure below.
(c
Notes
e
in
nl
O
ty
r si
Figure: Data warehouse: multiple data types.
ve
Adding Unstructured Data: The incorporation of unstructured data, particularly text
and images, is being addressed by some suppliers by treating such multimedia data as
merely another data type. These are classified as relational data and are kept as binary
ni
large objects (BLOBs) with a maximum size of 2 GB. These are defined as user-defined
types using user-defined functions (UDFs) (UDTs).
It is not always possible to store BLOBs as just another relational data type. A
U
server that can send several streams of video at a set rate and synchronise them with
the audio component, for instance, is needed for a video clip. Specialized servers are
being made available for this purpose.
ity
Searching Unstructured Data: Your data warehouse has been improved by the
addition of unstructured data. Do you have any other tasks to complete? Of course,
integrating such data is largely useless without the capacity to search it. In order to help
users find the information they need from unstructured data, vendors are increasingly
offering new search engines. An illustration of an image search technique is querying by
m
image content.
Retrieval engines preindex the textual documents for free-form text data in order
to support word, character, phrase, wild card, proximity, and Boolean searches. Some
search engines are capable of searching and replacing words with their equivalents.
(c
The word mice will also turn up in documents when you search for it.
Directly searching audio and video data is still a research topic. These are typically
Notes
e
defined using free-form language, which is then searched using the existing textual
search techniques.
in
Searching audio and video data directly is still in the research stage. Usually, these
are described with free-form text, and then searched using textual search methods that
are currently available.
nl
Spatial Data: Imagine one of your key users—perhaps the Marketing Director—
is online and accessing your data warehouse to conduct an analysis. Show me the
sales for the first two quarters for all products compared to last year in shop XYZ,
O
requests the Marketing Director. He or she considers two additional questions after
going over the findings. What is the typical income of those who reside in the store’s
neighbourhood? How far did those people drive to get to the store on average? These
queries can only be addressed if spatial data is included in your data warehouse.
ty
Your data warehouse’s value will increase significantly with the addition of spatial
data. Examples of spatial data include an address, a street block, a city quarter, a
county, a state, and a zone. Vendors have started addressing the requirement for
si
spatial data. In order to combine spatial and business data, some database providers
offer spatial extenders to their products via SQL extensions.
r
Data mining is a technique used with the Internet of Things (IoT) and artificial
intelligence (AI) to locate relevant, possibly useful, and intelligible information, identify
ve
patterns, create knowledge graphs, find anomalies, and establish links in massive data.
This procedure is crucial for expanding our understanding of a variety of topics that deal
with unprocessed data from the web, text, numbers, media, or financial transactions.
ni
By fusing several data mining techniques for usage in financial technology and
cryptocurrencies, the blockchain, data sciences, sentiment analysis, and recommender
systems, its application domain has grown. Additionally, data mining benefits a variety
of real-world industries, including biology, data security, smart grids, smart cities,
U
and maintaining the privacy of health data analysis and mining. Investigating new
developments in data mining that incorporate machine learning methods and artificial
neural networks is also required.
ity
Machine learning and deep learning are undoubtedly two of the areas of
artificial intelligence that have received the most research in recent years. Due to the
development of deep learning, which has provided data mining with previously unheard-
of theoretical and application-based capabilities, there has been a significant change
m
over the past few decades. The articles in this Topic will examine both theoretical
and practical applications of knowledge discovery and extraction, image analysis,
classification, and clustering, as well as FinTech and cryptocurrencies, the blockchain
and data security, privacy-preserving data mining, and many other topics. Both
)A
that must be managed more easily in sectors like commerce, the medical industry,
Notes
e
astronomy, genetics, or banking has given rise to the DM process. Additionally, the
exceptional success of hardware technologies resulted in a large amount of storage
capacity on hard drives, which posed a challenge to the emergence of numerous issues
in
with handling enormous amounts of data. Of course, the Internet’s rapid expansion is
the most significant factor in this situation.
nl
The core of the DM process is the application of methods and algorithms to find
and extract patterns from stored data, although data must first be pre-processed before
this stage. It is well known that DM algorithms alone do not yield satisfactory results.
Finding useful knowledge from raw data thus requires the sequential application of
O
the following steps: developing an understanding of the application domain, creating a
target data set based on an intelligent way of selecting data by focusing on a subset
of variables or data samples, data cleaning and pre-processing, data reduction and
ty
projection, choosing the data mining task, choosing the data mining algorithm, the data
mining step, and interpreting the results.
si
variable. Clustering is the division of a data set into subsets (clusters). Association rules
determine implication rules for a subset of record attributes. Summarization involves
methods for finding a compact description for a subset. Classification is learning a
r
function that maps a data item into one of several predefined classes .
ve
The DM covers a wide range of academic disciplines, including artificial
intelligence, machine learning, databases, statistics, pattern identification in data, and
data visualisation. Finding trends in the data and presenting crucial information in a way
that the average person can understand it are the main objectives here. For ease of
ni
usage, it is advised that the information acquired be simple to understand. Getting high-
level data from low-level data is the overall goal of the process.
(DM) in 1989. At the time, there weren’t many data mining tools available for
completing a single problem. The C4.5 decision tree technique, the SNNS neural
network, and parallel coordinate visualisation are among examples (Quinlan, 1986).
)A
(Inselberg, 1985). These techniques required significant data preparation and were
challenging to use.
cleaning and preprocessing stages. Users may carry out a number of discovery
activities (often classification, clustering, and visualisation) using programmes like
SPSS Clementine, SGI Mineset, IBM Intelligent Miner, or SAS Enterprise Miner. These
Notes
e
programmes also offered data processing and visualisation. A GUI (Graphical User
Interface), invented by Clementine, was one of the most significant developments since
it allowed users to construct their information discovery process visually.
in
In 1999, more than 200 tools were available for handling various tasks, but even
the best of them only handled a portion of the whole DM architecture. Preprocessing
nl
and data cleaning were still required. Data-mining-based “vertical solutions” have
emerged as a result of the growth of these types of applications in industries like direct
marketing, telecommunications, and fraud detection. The systems HNC Falcon for
credit card fraud detection, IBM Advanced Scout for sports analysis, and NASD DM
O
Detection system are the best examples of such applications.
ty
appear. Data sets are distributed to high performance multi-computer machines
for analysis in Parallel DM. All algorithms that were employed on single-processor
units must be scaled in order to run on parallel computers, despite the fact that these
si
types of machines are becoming more widely available. The Parallel DM technique is
appropriate for transaction data, telecom data, and scientific simulation. Distributed DM
must offer local data analysis solutions as well as global methods for recombining local
r
results from each computing unit without requiring a significant amount of data to be
transferred to a central server. Grid technologies combine distributed parallel computing
ve
and DM.
complete several DM jobs. Since the human brain serves as the computing model for
these networks, neural networks should be able to learn from experience and change
in reaction to new information. Neural networks can find previously unidentified links
U
and learn complicated non-linear patterns in the data when subjected to a collection of
training or test data.
Our brain’s capacity to differentiate between two items is one of its most crucial
ity
abilities. This capability of differentiating between what is friendly and what is dangerous
for particular species has allowed numerous species to evolve successfully.
do: after seeing a few examples, they can learn to extrapolate from them in order to
apply the information to unrelated issues or situations. One of the benefits of adopting
neural network technology is its capacity for generalisation.
)A
The Hebbian learning theory, which holds that information is stored by associations
with relevant memories, was validated by neurophysiologists. A intricate network
of concepts with semantically similar meanings exists in human brain. According to
the hypothesis, when two neurons in the brain are engaged simultaneously, their
connection develops stronger and the physical properties of their synapse are altered.
(c
The neural network field’s associative memory branch focuses on developing models
that capture associative activity.
e
neural networks like Binary Adaptive Memories and Hopfield networks. Different neural
models, including back propagation networks, recurrent back propagation networks,
self-organizing maps, radial basis function networks, adaptive resonance theory
in
networks (ART), probabilistic neural networks, and fuzzy perceptions, are used to
resolve these DM procedures. Certain DM procedures are available to any structure.
nl
Businesses who were late to implement data mining are rapidly catching up to
the others. Making crucial business decisions frequently involves using data mining
to extract key information. Data mining is predicted to become as commonplace as
some of the current technologies. Among the major trends anticipated to influence the
O
direction of data mining are:
ty
This is one of the most recent techniques that is gaining acceptance due to its
increasing capacity to accurately capture important data. Data extraction from several
types of multimedia sources, including audio, text, hypertext, video, pictures, and more,
si
is a part of this process. Afterwards, the extracted data is transformed into a numerical
representation in a variety of formats. Using this technique, you may create categories
and clusters, run similarity tests, and find relationships.
corporate sites or within multiple companies, this type of data mining is becoming more
and more popular. To collect data from various sources and produce improved insights
and reports based on it, highly complex algorithms are used.
ity
e
Knowledge Discovery in Databases is what KDD stands for. Another word for data
mining is this.
in
It is essentially a group of procedures that aid in the gathering of intelligence.
Simply expressed, it makes it easy to make rational decisions that will help you reach
your goals.
nl
When you make decisions using data science, this method is really helpful. Simply
filter through a big volume of datasets. Once finished, select a few relational, business-
related patterns to filter. Finding such informative patterns through analysis is possible.
O
They serve as a foundation for analysis and better decision-making. What trend is most
likely to catch on is simple to forecast.
ty
movies, etc. We must carefully comprehend what it says. The knowledge discovery
technique works best for this objective. In fact, it facilitates decision-making.
This procedure aids in the efficient analysis of the data gathered. This analysis’s
si
output is referred to as business intelligence. Undoubtedly, real-time analytics apps and
in-depth examination of previous data are necessary to arrive at workable answers. You
can simply hear the voice of the data with their assistance.
r
In a word, it aids in corporate strategy development and operational management.
ve
Marketing, advertising, sales, and customer service are all involved in these two
objectives. It can also be used to modify the functions of manufacturing, supply chain
management, finance, and human resources. In other words, you can use it to refine
various industries and their processes.
ni
A significant amount of data is needed for the process of knowledge discovery, and
that data must first be in a reliable state in order to be used in data mining. The ideal
source of data for knowledge discovery is the aggregation of enterprise data in a data
warehouse that has been adequately vetted, cleaned, and integrated. The warehouse
ity
is likely to include not only the depth of data required for this step in the BI process, but
it also likely has the necessary historical data. Having the previous data available for
testing and evaluating hypothese makes the warehouse more more beneficial because
a lot of data mining relies on using one set of data for training a process that can then
m
and analytics algorithms through direct integration with existing platforms may have
produced the “perfect storm” for the development of large-scale “big data” analytics
applications.
applications reach a certain level of maturity, they will be able to manage large-scale
Notes
e
applications that access both structured and unstructured data from several sources
while streaming it at varying rates.
in
3.2.1 Data Mining on Databases
nl
Step 1. Data Collection
You require some data for any research. The research team locates pertinent
O
tools to extract, convert, and process data from various websites or big data settings
like data lakes, servers, or warehouses. Both structured and unstructured files could be
present. Data migration issues and the database’s inability to demonstrate file format
ty
compliance might occasionally create errors.
To avoid incorrect or erroneous data entry in this case, the errors should be
eliminated at the point of entry.
si
Step 2: Preparing Datasets
Investigate datasets after collection. As part of pre-processing, create their profiles.
r
It is the transformation step, in fact. All records are cleaned up, kept consistent, and
abnormalities are eliminated here. Resources are provided in this manner for efficient
ve
data analysis. The final phase would then be the error-fixing, which is done in the next
step. The method is used by Eminenture’s data mining specialists.
Here, cleaning refers to eliminating noise and duplication from data collecting.
Noises in the database relate to duplicates, corrupt, incomplete, and other anomalies.
U
◌◌ De-duplication
◌◌ Data appending
ity
◌◌ Normalization
◌◌ Typos removal
◌◌ Standardization
◌◌ Extracted
(c
◌◌ Transformed
◌◌ Loaded for deep analysis
e
Analysis is the process of delving deeply into insights. Datasets are processed
here by filtering. They are carefully scrutinised to see whether they can be helpful and
in
the ideal fit for these processes:
◌◌ Neural network
◌◌ Decision trees
nl
◌◌ Naïve Byaes
◌◌ Clustering
O
◌◌ Association
◌◌ Regression, etc.
ty
◌◌ Using data mapping, elements are assigned from sources to destinations.
◌◌ OCR conversion for scanning, recognition, and translation of data
si
◌◌ Coding to convert all of the data to the same format.
In this process, data is transformed into a format and structure that will be exactly
the same for further data mining. Sometimes, PDFs or data in many formats are
r
available. It is essential to digitise them. As a result, the following actions are taken:
ve
Step 7: Modeling or Mining
This is the most important step, which uses scripts and apps to extract pertinent
patterns to support goals.
ni
As the name implies, you must show your discoveries in extensive datasets
throughout this stage of the data mining process. It can be made simpler by using
visualisation tools like Data Studio.
◌◌ Making reports
(c
e
Finally, numerous applications or machine learning are used with the driven
decisions or findings. If done correctly, it can be used to automate operations using
in
artificial intelligence (AI).
nl
Six Basic Data Mining Activities
It’s critical to distinguish between the activities involved in data mining, the
O
processes by which they are carried out, and the methodologies that make use of these
activities to find business prospects. The most important techniques can essentially be
reduced to six tasks:
ty
Clustering and Segmentation
The process of clustering entails breaking up a huge collection of items into
smaller groups that share specific characteristics (Figure below). The distinction
si
between classification and clustering is that, in the clustering task, the classes are not
predefined. Instead, the decision or definition of that class is based on the evaluation of
the classes that occurs once clustering is complete.
r
ve
ni
U
ity
m
)A
When you need to separate data but are unsure exactly what you are looking
for, clustering can be helpful. To check if any relationships can be deduced through
the clustering process, you might, for instance, want to assess health data based on
(c
particular diseases together with other variables. Clustering can be used to identify a
business problem area that needs more investigation in conjunction with other data
e
before examining why sales are low within a certain market segment.
The clustering method arranges data instances so that each group stands out from
in
the others and that its members are easily distinguishable from one another. Since
there are no established criteria for classification, the algorithms essentially “choose”
the features (or “variables”) that are used to gauge similarity. The records in the
nl
collection are arranged according to how similar they are. The results may need to be
interpreted by someone with knowledge of the business context to ascertain whether
the clustering has any particular meaning. In some cases, this may lead to the culling
out of variables that do not carry meaning or may not be relevant, in which case the
O
clustering can be repeated in the absence of the culled variables.
Classification
ty
People who categorise the world into two groups and those who do not are divided
into two groups. But truly, it’s human nature to classify objects into groups based on a
shared set of traits. For instance, we classify products into product classes, or divide
si
consumer groups by demographic and/or psychographic profiles (e.g., marketing to the
profitable 18 to 34-year-olds).
The results of identified dependent variables can be used for classification when
r
segmentation is complete. The process of classifying data into preset groups is called
ve
classification (Figure below). These classifications might either be based on the output
of a clustering model or could be described using the analyst’s chosen qualities. The
programme is given the class definitions and a training set of previously classified
objects during a classification process, and it then makes an effort to create a model
that can be used to correctly categorise new records. For instance, a classification
ni
model can be used to categorise customers into specific market sectors, assign meta-
tags to news articles depending on their content, and classify public firms into good,
medium, and poor investments.
U
ity
m
)A
Estimation
(c
e
employed (for example, in a market segmentation process, to estimate a person’s
annual pay).
in
Because a value is being given to a continuous variable, one benefit of estimate
is that the resulting assignments can be sorted according to score. Consequently, a
ranking method may, for instance, assign a value to the variable “probability of buying a
nl
time-share vacation package” and then order the applicants according to that projected
score, making those individuals the most likely. Estimation is usually used to establish
a fair guess at an unknowable value or to infer some likelihood to execute an action.
Customer lifetime value is one instance where we tried to create a model that reflected
O
the future worth of the relationship with a customer.
Prediction
ty
Prediction is an attempt to categorise items based on some anticipated future
behaviour, which is a slight distinction from the preceding two jobs. Using historical data
where the classification is already known, classification and estimation can be utilised
si
to make predictions by creating a model (this is called training). Then, using fresh data,
the model can be used to forecast future behaviour.
When utilising training sets for prediction, you must be cautious. The data may
r
have an inherent bias, which could cause you to make inferences or conclusions that
ve
are pertinent to the bias. Utilize various data sets for testing, testing, and more testing!
Affinity Grouping
The technique of analysing associations or correlations between data items that
ni
show some sort of affinity between objects is known as affinity grouping. For instance,
affinity grouping could be used to assess whether customers of one product are likely
to be open to trying another. When doing marketing campaigns to cross-sell or up-sell
a consumer on more or better products, this type of analysis is helpful. The creation
U
of product packages that appeal to broad market segments can also be done in this
way. For instance, fast-food chains may choose particular product ingredients to go into
meals packed for a specific demographic (such as the “kid’s meal”) and aimed at the
ity
people who are most likely to buy those packages (e.g., children between the ages of 9
and 14).
Description
The final duty is description, which is attempting to characterise what has been
m
found or to provide an explanation for the outcomes of the data mining process. Another
step toward a successful intelligence programme that can locate knowledge, express
it, and then assess possible courses of action is the ability to characterise a behaviour
)A
or a business rule. In fact, we may claim that the metadata linked to that data collection
can include the description of newly acquired knowledge.
In general, “mining” refers to the process of extracting a valuable resource from the
earth, such as coal or diamonds. Knowledge mining from data, knowledge extraction,
data/pattern analysis, data archaeology, and data dredging are all terms used in
Notes
e
computer science to describe data mining. It is essentially the technique of extracting
valuable information from a large amount of data or data warehouses. It’s clear that the
term itself is a little perplexing. The product of the extraction process in coal or diamond
in
mining is coal or diamond. The result of the extraction process in Data Mining, however,
is not data!! Instead, the patterns and insights we get at the end of the extraction
process are what we call data mining outcomes. In that respect, Data Mining can be
nl
considered a step in the Knowledge Discovery or Knowledge Extraction process.
O
prevalent in the business and media circles. Data mining and knowledge discovery are
terms that are being used interchangeably.
ty
3.3.1 Objectives of Data Mining and the Business Context for Data
Mining
Data mining is now employed practically everywhere where enormous amounts of
si
data are stored and processed. Banks, for example, frequently employ ‘data mining’ to
identify potential consumers who could be interested in credit cards, personal loans,
or insurance. Because banks have transaction details and complete profiles on their
r
clients, they examine this data and look for patterns that can help them forecast which
customers might be interested in personal loans or other financial products.
ve
Companies employ data mining as a method to transform unstructured data into
information that is useful. Businesses can learn more about their customers to create
more successful marketing campaigns, boost sales, and cut expenses by employing
ni
software to seek for patterns in massive volumes of data. Effective data collection,
warehousing, and computer processing are prerequisites for data mining.
Data mining is the process of examining and analysing huge chunks of data to
discover significant patterns and trends. Numerous applications exist for it, including
database marketing, credit risk management, fraud detection, spam email screening,
ity
There are five steps in the data mining process. Data is first gathered by
organisations and loaded into data warehouses. The data is then kept and managed,
either on internal servers or on the cloud. The data is accessed by business analysts,
m
management groups, and information technology specialists, who then decide how
to organise it. The data is next sorted by application software according to the user’s
findings, and ultimately the end-user presents the data in a manner that is simple to
)A
gathered and establishes classes according to the frequency of client visits and the
Notes
e
items they purchase.
Other times, data miners hunt for information clusters based on logical
in
connections, or they analyse associations and sequential patterns to infer trends in
customer behaviour.
nl
warehouse their data into a single database or application. An organisation can
isolate specific data segments for analysis and use by particular users using a data
warehouse. In other instances, analysts could start with the data they need and build a
O
data warehouse from scratch using those specifications.
ty
r si
ve
ni
U
In general, data mining has been combined with a variety of other techniques from
other domains, such as statistics, machine learning, pattern recognition, database
ity
and data warehouse systems, information retrieval, visualisation, and so on, to gather
more information about the data and to help predict hidden patterns, future trends, and
behaviours, allowing businesses to make decisions.
Data mining, in technical terms, is the computer process of examining data from
m
Data Mining can be used on any sort of data, including data from Data
)A
The data mining process can be broken down into these four primary stages:
identified. A data lake or data warehouse, which are becoming more and more popular
Notes
e
repositories in big data contexts and include a mixture of structured and unstructured
data, may be where the data is located. Another option is to use other data sources.
Wherever the data originates, a data scientist frequently transfers it to a data lake for
in
the process’ remaining steps.
Data preparation: A series of procedures are included in this stage to prepare the
nl
data for mining. Data exploration, profiling, and pre-processing come first, then work
to clean up mistakes and other problems with data quality. Unless a data scientist is
attempting to evaluate unfiltered raw data for a specific application, data transformation
is also done to make data sets consistent.
O
Mining the data: A data scientist selects the best data mining technique after the
data is ready, and then uses one or more algorithms to perform the mining. Before
being applied to the entire set of data in machine learning applications, the algorithms
ty
are often trained on sample data sets to look for the desired information.
Data analysis and interpretation: Analytical models are developed using the data
mining findings to guide decision-making and other business activities. Additionally, the
si
data scientist or another member of the data science team must convey the results to
users and business executives, frequently by using data storytelling approaches and
data visualisation.
r
ve
Applications of Data Mining
●● Financial Analysis
●● Biological Analysis
ni
●● Scientific Analysis
●● Intrusion Detection
●● Fraud Detection
U
●● Research Analysis
Market Basket Analysis (MBA) is a technique for analysing the purchases made
by a client in a supermarket. The idea is to use the concept to identify the things that a
buyer buys together. What are the chances that if a person buys bread, he or she will
also buy butter? This study aids in the promotion of company offers and discounts. Data
m
discovery of causes and potential therapies for Alzheimer’s, Parkinson’s, and cancer
diseases caused by protein misfolding.
Fraud Detection: In today’s world of cell phones, we can utilise data mining to
compare suspicious phone activity by analysing cell phone activities. This may aid in
the detection of cloned phone calls. Similarly, when it comes to credit cards, comparing
(c
e
digital libraries, and digital governments are just a few of the successful applications of
data mining.
in
Data Mining Techniques
Algorithms and a variety of techniques are used in data mining to transform
massive data sets into useable output. The most often used kinds of data mining
nl
methods are as follows:
●● Market basket analysis and association rules both look for connections between
O
different variables. As it attempts to connect different bits of data, this relationship
in and of itself adds value to the data collection. For instance, association rules
would look up a business’s sales data to see which products were most frequently
bought together; with this knowledge, businesses may plan, advertise, and
ty
anticipate appropriately.
●● To assign classes to items, classification is used. These categories describe
the qualities of the things or show what the data points have in common. The
si
underlying data can be more precisely categorised and summed up across related
attributes or product lines thanks to this data mining technique.
●● Clustering and categorization go hand in hand. Clustering, on the other hand,
r
found similarities between objects before classifying them according to how they
ve
differ from one another. While clustering might reveal groupings like “dental health”
and “hair care,” categorization can produce groups like “shampoo,” “conditioner,”
“soap,” and “toothpaste.”
●● Decision trees are employed to categorise or forecast a result based on a
ni
to how the human brain is interconnected). This model can be fitted to provide
threshold values that show how accurate a model is.
●● In order to forecast future results, predictive analysis aims to use historical data
)A
and business in the digital era. As long as there is a set of data to analyse, data mining
Notes
e
is a broad process with a variety of applications.
Sales
in
A company’s primary objective is to maximise profits, and data mining promotes
more intelligent and effective capital allocation to boost sales. Think about the cashier
at your preferred neighbourhood coffee shop. The coffee shop records the time of each
nl
purchase, the products that were purchased at the same time, and the most popular
baked goods. The store can strategically design its product range using this knowledge.
O
Marketing
It’s time to put the modifications into effect once the coffee shop mentioned above
determines its optimum lineup. The store can utilise data mining to better identify where
ty
its customers view ads, which demographics to target, where to place digital ads, and
what marketing tactics resonate with them in order to increase the effectiveness of its
marketing campaigns. This entails adapting marketing strategies, advertising offerings,
cross-sell opportunities, and programmes to data mining discoveries.
si
Manufacturing
r
Data mining is essential for organisations that manufacture their own items in
determining the cost of each raw material, which materials are used most effectively,
ve
how much time is spent throughout the manufacturing process, and which bottlenecks
have a negative impact on the process. The continual and least expensive flow of
commodities is ensured with the use of data mining.
ni
Fraud Detection
Finding patterns, trends, and correlations between data points is at the core of
data mining. Data mining can therefore be used by a business to find anomalies or
U
relationships that shouldn’t exist. For instance, a business might examine its cash flow
and discover a recurring transaction to an unidentified account. If this is unexpected,
the business might want to look into it in case money was possibly mishandled.
ity
Human Resources
A variety of data, including information on retention, promotions, salary ranges,
business perks and usage of those benefits, and employee satisfaction surveys, are
frequently available for processing in human resources. This data can be correlated
m
through data mining to better understand why employees depart and what draws new
hires in.
)A
Customer Service
Numerous factors can either create or undermine customer satisfaction. Consider
a business that ships things. Customers may become dissatisfied with communication
over shipment expectations, shipping quality, or delivery delay. The same customer can
grow impatient with lengthy hold times on the phone or sluggish email replies. Data
(c
Why is data mining crucial for firms, then? Businesses that use data mining can
Notes
e
gain a competitive edge, have a deeper understanding of their customers, have more
control over daily operations, increase customer acquisition rates, and discover new
business prospects.
in
Data analytics will aid many industries in different ways. The best approaches to
attract new clients are sought after by some industries, while fresh marketing strategies
nl
and system upgrades are sought after by others. Businesses have the ability and
understanding to make decisions, examine their data, and proceed thanks to the data
mining process.
O
Data mining models have only recently been used in marketing. Data mining
is a “new ground” for many marketers who rely solely on their “intuition” and domain
knowledge, despite its rapid expansion. Their marketing campaign lists and
segmentation plans are developed using business criteria based on their understanding
ty
of the industry.
Data mining models are not “dangerous” since they cannot take the place of
domain experts and their valuable business expertise. Despite their strength, these
si
models are ineffective without the active participation of business professionals. On
the contrary, they can only produce outcomes that are truly significant when business
experience is added to data mining capabilities. For instance, incorporating meaningful
r
inputs with predictive potential proposed by experts in the industry can significantly
ve
improve the predictive capability of a data mining model.
Additionally, information from current business rules and scores can be included
into a data mining model to help create a more reliable and effective outcome.
Additionally, in order to reduce the danger of producing irrelevant or ambiguous
ni
outcomes, model results should always be verified by business specialists with regard
to their relevance prior to the actual deployment. Therefore, business domain expertise
can significantly improve and enhance data mining findings.
U
However, data mining programmeshave the ability to spot trends that even
the most seasoned businesspeople would have overlooked. They can assist
in fine-tuning the current business standards as well as enrich, automate, and
ity
manufacturing sector by changing how quality and safety are managed in asset-based
businesses.
Technology is transformed by data, and this is just the start of significant changes.
Notes
e
Numerous technological innovations, like real-time data from GPS and connected
vehicle sensors, text extracted from warranty reports, and voice-to-text translations
in
in call centres, to mention a few, contributed to the quality and safety revolution in
businesses. However, the data has been consolidated into a repository that enables
analysis across many data formats. This is the precise situation in which machine
nl
learning techniques are useful. They are tasked with seeing patterns in the data and
making forecasts.
Businesses utilise data mining to make inferences and find solutions to certain
O
issues. The fact that data mining is essentially relevant to every process and increases
the adaptability and effectiveness of operations is one of its main advantages. As
a result, using data in manufacturing helps with waste reduction, capacity modelling,
monitoring automation, and schedule adherence. By establishing comprehensive data
ty
transparency, departments are entirely transformed, and factories become smarter.
si
Process mining is now being used by ABB, a significant worldwide manufacturer,
for its procurement to payment and production procedures. In the past, staff from the
ABB facility in Hanau, Germany, would transfer evaluations from their SAP systems
r
into Excel, evaluate them using complicated formulas, and comprehend operations.
Each morning at ABB, an email detailing the production variants, throughput times, and
ve
rejection rates from the previous day is sent to the relevant production and assembly
team leaders. Process mining makes the entire ecosystem of quality improvement
procedures at the plant instantly visible. As more data is sent into the algorithm, pattern
recognition only improves. Operational processes deliver immediate results as opposed
ni
The industry that produces vehicles has also undergone significant development.
The products in this market are fairly pricey, with high-end producers emphasising
U
customer service and product quality. They point out that the commercial advantages
associated with the adoption of data-driven innovations have the potential to
accelerate the detection and correction of quality issues as well as reduce warranty
ity
spending, which represents between 2-6% of the total sales in the automotive sector.
Early detection and preventive maintenance frequently lead to higher uptime for the
customers and users of these vehicles and equipment. For instance, in one incident
involving an automaker, the discovery of a fault prior to the release of the product
prevented the recall of 28,000 automobiles.
m
The majority of marketers have come to the realisation that gathering client data
and extracting useful information from it is crucial for business growth. Business
organisations are frantically seeking expertise and insightful information from these
)A
customer data in order to stay competitive globally and boost their bottom line.
The marketing executives want to learn different kinds of information from the data
they get, but it isn’t always possible because operational computer systems can’t give
them information on daily transactions. Marketers can position their goods in response
(c
to the demands of their specific clientele. Finding the specific clients that are truly
interested from a vast customer database is very challenging. Data mining methods and
technology now play a part.
Businesses can sift through layers of data that at first glance appear to be
Notes
e
unrelated in search of important links where they can predict, rather than just respond
to, client wants with the aid of data mining techniques and tools. Data mining is just
one phase in a much longer process that occurs between a business and its clients.
in
Data mining must be pertinent to the underlying business process in order to have an
impact on a firm. The business process, not the data mining method, determines how
data mining affects a business.
nl
Knowledge discovery is a process; it has stages, a context, and operates under
presumptions and limitations. The KDD process is depicted in the figure below along
with typical data sets, tasks, and stages. Its main purpose is to serve as a database
O
with a lot of data that can be searched for potential patterns. A representative target
data set is used for KDD, and it is constructed from the big database. In most real-world
scenarios, both the database and the target data set contain noise, which includes
ty
incorrect, imprecise, contradictory, exceptional, missing, and values without a value.
One obtains the collection of preprocessed data by removing this noise from the target
data set. The preprocessed data set’s converted data set is used straight away for DM.
In general, DM produces a collection of patterns, some of which may represent newly
si
learned information.
r
ve
ni
The data is turned into several steps to be cleansed through a variety of stages
in the process. The target data set can be generated from the database using the
U
selection technique. Its main responsibility is to choose representative data from the
database in order tomaximise the representativeness of the target data collection.
Preprocessing is the subsequent step, which produces specified data sequences in the
set of preprocessed data and removes noise from the target data set. Some DM jobs
ity
require such sequences. The preprocessed data must now be transformed into a format
that will allow it to be used to carry out the intended DM task.
DM tasks are certain actions taken over a set of altered data in order to look for
patterns (useful information), directed by the kind of knowledge that should be found.
m
Although not all of the patterns are helpful, a process is used in the DM phase to
extract patterns from altered data. Keeping only the patterns that are interesting and
beneficial to the user and discarding the rest is the aim of understanding and evaluating
(c
all the patterns found. The information has been uncovered in those patterns that have
persisted.
The manner that businesses are run has undergone a paradigm shift. The
Notes
e
business process has become more complex as a result of the shifting customer
behaviour patterns and the methods used by competitors. Shopkeepers today face
an increased number of customers, items, and competitors, as well as a decrease in
in
reaction time. This suggests that it is now much more difficult to comprehend one’s
customers. Customer connections are becoming more complex as a result of a variety
of factors coming together.
nl
●● Client loyalty is a thing of the past due to the substantial decline in customer
attention span. A successful business must always emphasise the value it offers to
its clients. The amount of time between a new want and the moment when it must
O
be satisfied is likewise getting shorter. The customer will locate someone else who
will respond swiftly if one does not.
●● Everything is more expensive. printing, postage, and special discounts (if one
ty
doesn’t provide the special discount, his rivals will).
●● Customers choose products that precisely fulfil their requirements over those that
only sort of do. This indicates that there are now much more things available and
si
more ways to purchase them.
●● The best clients make one’s rivals appear good. They’ll concentrate on a few
select, lucrative areas of your business and work to maintain the finest for
themselves. r
ve
It has become difficult to maintain customer relationships in such a dynamic
climate. The continuation of a customer’s business is no longer assured. Customers
that one has today may no longer exist tomorrow because the market does not wait
for anyone. It’s also more difficult now to interact with customers than it formerly was.
ni
Customers and potential customers like dealing with businesses on their terms. In other
words, when deciding how to proceed, it is necessary to consider a variety of factors.
Companies must automate The Right Offer, To The Right Person, At The Right Time,
U
and Through The Right Channel in order to make it possible in an efficient manner.
Manage the many encounters with clients to make sure the business has given the
best offer. One must rank the most significant offers first and the least relevant ones last
ity
among the many available. The appropriate individual makes the implication that not
all clients are created equal. It is necessary to transition one’s interactions with them
to highly targeted, individual-focused marketing initiatives. The fact that interactions
with clients now take place continuously makes the correct timing possible. In the past,
quarterly mailings were considered to be the most innovative kind of marketing. Finally,
m
using the appropriate channel allows for a range of consumer interactions (direct mail,
email, telemarketing, etc.). Here, it’s crucial to pick the media that will facilitate a certain
interaction the best.
)A
The enormous amount of data collected about clients and everyday purchase
activities has caused business databases to grow rapidly. Because of this, the
strategies and technologies used in knowledge management and data mining (DM)
have become crucial for making marketing decisions. In order to support marketing
decisions, DM can be utilised to gather relevant data on hidden buying trends.
(c
e
Researching, creating, and making a product available to the public are all
part of the discipline of marketing. The idea of marketing has been around for a
in
while, yet it keeps evolving in response to customer wants and buying patterns. As a
result, marketing today is quite different from what it was a few decades ago, largely
because of the world economy’s rapid transformation and technological advancements,
nl
which together allowed for the free and quick sharing of knowledge. As a result of
globalization’s reduction in the cost and complexity of operating abroad and the
resulting increase in competitiveness, MNCs were exposed to local markets.
O
The manner and rate of knowledge dissemination have significantly changed as
markets have become more deregulated. Every action we take, including taking a
lengthy walk in the woods, leaves behind a few trace amounts of electronic data. Data
is punched every time the Internet or a mobile device is used, and a massive industry
ty
is right there suckling up all that data and utilising it to determine how to sell you
something.
si
sale of toothpaste to the purchase of life insurance policies—generates some amount
of data, which, when properly analysed, can give an organisation a competitive edge
in the global market by revealing implicit patterns and blatant connections among vast
r
data sets. Data mining is one such method that may be used to analyse such a massive
ve
amount of data.
Data mining is the process of extracting patterns or models from observable data.
It is a non-trivial method for finding valid, innovative, possibly helpful, and ultimately
U
statistical techniques and mathematical equations are utilised in the technology known
as “data mining.”
Intelligent techniques are used in data mining, a crucial procedure for extracting
patterns from data.
m
These days, massive amounts of data are being gathered. Every nine months, the
amount of data gathered is said to roughly treble. One of the most desirable features
of data mining is the ability to extract knowledge from large amounts of data. The
knowledge that can be derived from stored data typically has a wide gap between it
and it. Because this transformation does not happen naturally, data mining is important.
(c
Many marketers like this technology because it helps them understand their clients
better and make wise marketing choices.
e
production soars. Data mining information can help to raise return on investment (ROI),
enhance customer relationship management (CRM) and market analysis, lower the cost
of marketing campaigns, and make it easier to detect fraud and keep customers.
in
The definition of marketing is placing the ideal product in the ideal environment
at the ideal time and price. It is a process of organising and carrying out the creation,
nl
pricing, promotion, and distribution of concepts, products, and services to generate
exchanges that meet both individual and organisational goals. The process of
determining what customers want and where they purchase requires a lot of effort.
Then, it must be determined how to make the item at a cost that is valuable to them and
O
coordinate everything at the crucial moment.
However, one incorrect factor might cause a catastrophe, such as promoting a car
with incredible fuel efficiency in a nation where fuel is relatively inexpensive, releasing
ty
a textbook after the start of a new school year, or pricing an item too high or too low to
draw in the target market. The marketing mix is a useful place to start when thinking
through ideas for a product or service so as to avoid any blunders. It is a generic term
si
used to describe the various decisions businesses must make throughout the process
of bringing a good or service to market. The 4 Ps, which were first stated by E J
McCarthy in 1960, are one of the greatest ways to describe the marketing mix.
◌◌ Promotion
Data mining is used in a variety of marketing-related applications. One of them
is referred to as market segmentation, and it identifies typical client behaviours.
U
Customers who appear to buy the same things at the same time are examined for
patterns. Customer churn is another application of data mining that aids in determining
the kinds of customers that are likely to stop using a product or service and go to a rival.
A business can also utilise data mining to identify the transactions that are most likely
ity
to be fraudulent. A retail store, for instance, would be able to identify which products are
most frequently stolen using data mining so that appropriate security measures can be
adopted.
Even though direct mail marketing is an outdated strategy, businesses can achieve
m
excellent results by combining it with data mining. Data mining, for instance, can be
utilised to determine which clients will be responsive to a direct mail marketing plan.
Additionally, it establishes how effective interactive marketing is. There is a need to
)A
identify clients who are more inclined to buy things online than in-person since some
customers are more likely to do so.
While many marketers use data mining to boost their profitability, it can also be
utilised to launch new businesses and sectors. Every industry built on data mining is
predicated on the automatic identification of both patterns and behaviours. For instance,
(c
data mining uses automatic prediction to examine prior marketing tactics. Which one
was most effective? What made it work so well? Who were the clients who responded
to it the best? These questions can be answered via data mining, which aids in avoiding
Notes
e
errors committed in earlier marketing campaigns. Thus, data mining for marketing
aids businesses in gaining a competitive edge over rivals and surviving in a global
marketplace.
in
Implementation
Customer Identification: Target customer analysis and customer segmentation are
nl
components for identifying customers. While customer segmentation entails the division
of an entire customer base into smaller customer groups or segments, consisting of
customers who are relatively similar within each specific segment, target customer
O
analysis entails seeking the profitable segments of customers through analysis of
customers’ underlying characteristics.
Customer Attraction: Organizations can focus their efforts and resources on luring
ty
the target consumer segments once they have identified the possible client segments.
Direct marketing is one strategy for attracting customers. Direct marketing is a form of
advertising that encourages consumers to make purchases through multiple channels.
si
Distributions of coupons or direct mail are classic instances of direct marketing.
from a single customer. Up/Cross selling describes marketing initiatives that increase
the amount of related or complimentary services a consumer uses across a company.
The goal of market basket analysis is to increase the volume and value of client
U
Clustering in Marketing
When clients haven’t yet been segmented for advertising purposes, clustering
ity
may be used. The clusters may be investigated for characteristics based on which
ad campaigns may be targeted at the client base after performing a cluster analysis.
Following segmentation, product positioning, product repositioning, and product
creation may be done based on the traits of the clusters to better the product’s fit
m
with the intended customers. It is also possible to choose test markets using
cluster analysis. Additionally, customer relationship management may make use of
clustering (CRM).
)A
Client clustering would track purchasing behaviour and develop strategic business
efforts using data from customer purchase transactions. High-profit, high-value, and
low-risk consumers are what businesses want to keep. This cluster often symbolises
the 10–20% of consumers that account for 50%–80% of a business’s profits. A business
would not want to lose these clients, hence retention is the segment’s strategic aim.
(c
marketing strategies for this market category include cross-selling new products and
Notes
e
up-selling more of what customers already purchase.
When a marketer has very little information about the customer, pattern association
in
can be widely employed to forecast customer preferences. Tools for pattern association
would assist a marketer in predicting which product or advertisement the customer
may be interested in based solely on the customer’s current purchasing behaviour and
nl
comparing it with the purchasing behaviour of similar customers (who bought similar
products), even when no information about the customer is available.
O
3.3.4 Data Mining in CRM
Customers are an organization’s most valuable resource. Without happy clients
who remain devoted and strengthen their relationship with the company, there cannot
be any business opportunities. Because of this, a business should develop and
ty
implement a clear customer service strategy. Building, maintaining, and enhancing
devoted and enduring client connections is the goal of CRM (Customer Relationship
Management). Based on customer insight, CRM should be a customer-centric
si
approach. Its focus should be on providing consumers with “personalised” service by
recognising and appreciating their unique needs, preferences, and behaviours.
Let’s use the following real-world example of two apparel companies with
r
various selling strategies to clarify the goals and advantages of CRM. The first store’s
ve
employees make an effort to sell everything to everyone. The second store’s staff
makes an effort to determine each customer’s needs and interests before making the
suitable recommendations. Which retailer’s reputation for dependability will ultimately
win over customers? The second option certainly seems more reliable for a long-term
ni
connection because it seeks for customer happiness while taking into account the
unique needs of the consumer.
It is clear how important the first goal is. Getting new customers is challenging,
especially in developed countries. Replacing current clients with new ones from the
competition is never easy. There is no such thing as an average customer, which is the
second CRM goal of customer development. The client base consists of a variety of
people, each with their own wants, tendencies, and potentials that should be managed
m
appropriately.
CRM software solutions are available. By automating contacts and interactions with
the customers, these systems, also known as operational CRM systems, often support
front-line procedures in sales, marketing, and customer care. They keep track of
consumer information and contact histories. Additionally, they make sure that at every
customer “touch” points (interaction points), a consistent image of the customer’s
(c
e
managing clients effectively, though. Organizations must use data analysis to develop
understanding of consumers, their needs, and their wants in order to use CRM
successfully and achieve the aforementioned goals. The analytical CRM comes into
in
play here. Analytical CRM focuses on evaluating customer data to effectively address
CRM goals and communicate with the appropriate customer. In order to evaluate the
worth of the clients, comprehend their behaviour, and predict it, data mining algorithms
nl
are used. It involves examining data trends in order to derive information for improving
customer connections.
Data mining, for instance, can aid in customer retention since it makes it possible
O
to quickly identify important clients who are more likely to quit, giving time for tailored
retention programmes. By matching products with customers and improving the
targeting of product promotion activities, it can assist customer development.
ty
Additionally, it may help to identify different consumer segments, supporting the creation
of new, tailored products and product offerings that better reflect the interests and
objectives of the target market.
si
To better manage all client contacts on a more informed and “personalised” basis,
the outcomes of the analytical CRM operations should be imported and incorporated
into the operational CRM front-line systems. The topic of this book is analytical CRM. Its
r
goal is to demonstrate how data mining techniques can be used to the CRM framework,
with a particular emphasis on customer segmentation.
ve
What can data mining do in CRM?
Data mining uses complex modelling approaches to analyse vast volumes of data
in order to derive knowledge and insight. Data are transformed into knowledge and
ni
useful information.
The data that has to be examined may be extracted from numerous unstructured
data sources or may already be present in well-organized data marts and warehouses.
U
There are several steps in a data mining process. Prior to the use of a statistical or
machine learning technique and the creation of a suitable model, it frequently includes
substantial data management. Data mining tools are specialised software packages
ity
The goal of data mining is to find patterns in data that may be used to understand
and anticipate behaviour. Data mining models are made up of a series of rules,
equations, or complicated “transfer functions.” According to their objectives, they can be
m
The below figure outlines the CRM-data mining structure that has been presented.
In data mining, the first step in solving any problem is to acquire an understanding of
the business goals and requirements associated with the problem area. An in-depth
analysis and careful management of customer connections and the dynamics between
those ties will assist to find, attract, and keep valuable consumers in the domain. The
(c
others. The building of an effective model that satisfies the business needs is a major
Notes
e
stage in the CRM framework, and model construction is one of those major steps. The
behaviour of the customers can be more accurately predicted with the use of these
models. Evaluation and visualisation of the model are used to determine how effective
in
the model is at improving the performance of the organisation.
nl
O
ty
r si
ve
ni
U
Customer insight, which is essential for developing a successful CRM strategy, can
be obtained through data mining. Through data analysis, it can result in individualised
interactions with customers, increasing happiness and fostering profitable client
relationships. It can assist “individualised” and optimised customer management
m
throughout all stages of the customer lifecycle, from client acquisition and relationship
building to attrition prevention and client retention. Marketers aim to increase both their
market share and their consumer base. Simply put, they are in charge of acquiring,
)A
retaining, and growing the clientele. As seen in the following Figure, data mining models
can assist with all of these tasks:
(c
Notes
e
in
nl
O
ty
si
Figure: Data mining and customer lifecycle management
The following themes are more specifically included in the marketing activities that
can be aided by data mining. r
ve
Customer Segmentation
It is possible to create distinctive marketing strategies by segmenting the consumer
base into distinct, internally homogeneous groups based on their attributes. Based on
ni
the specific criteria or attributes utilised for segmentation, there are numerous different
segmentation kinds.
segmentation fields effectively. On the other hand, data mining can produce behavioural
segments that are driven by data. Clustering algorithms can examine behavioural
data, spot consumer natural groupings, and offer a solution based on seen patterns in
the data. If data mining models are constructed correctly, they can reveal groups with
unique profiles and attributes and produce rich segmentation schemes with commercial
m
essential for prioritising customer service and marketing efforts based on the value of
each customer.
in an effort to reduce customer attrition, increase client acquisition, and encourage the
Notes
e
purchase of supplementary items. Acquisition campaigns are more explicitly designed
to lure potential clients away from other businesses. Cross-, deep-, and up-selling
strategies are used to persuade current consumers to buy more of the same product,
in
more of the same product, or a different but more lucrative product. Finally, retention
campaigns are designed to keep loyal consumers from severing ties with the business.
nl
These efforts, although having the potential to be successful, can also squander
a significant amount of money and time by bombarding customers with unsolicited
messages. Marketing campaigns that are specifically aimed at a target audience can
be developed with the use of data mining and classification (propensity) models. They
O
look at client traits and identify the target customers’ profiles. Then, fresh instances with
comparable features are found, given a high propensity score, and added to the target
lists. To make the ensuing marketing campaigns as effective as possible, the following
ty
classification models are used:
si
Cross-/deep-/up-selling models: These can show whether current clients have the
potential to make purchases.
r
Voluntary attrition or voluntary churn models: These locate clients who are more
likely to depart freely and identify early churn indications.
ve
When constructed appropriately, these models can help create campaign lists with
higher target customer densities and frequencies by identifying the best customers to
reach out to. They exceed predictions made using business principles and personal
ni
intuition as well as random selections. The lift in predictive modelling refers to the metric
that contrasts a model’s propensity to forecast with chance. It indicates how much more
effectively a classification data mining model outperforms a random selection. Figure
following, which contrasts the outcomes of a data mining churn model to random
U
Figure: The increase in predictive ability resulting from the use of a data mining
churn model.
(c
In this fictitious example, 10% of the actual “churners” are included in the sample
that was chosen at random. However, a list of the same size produced by a data
mining algorithm is far more productive because it comprises around 60% of real
Notes
e
churners. Data mining outperformed randomness in terms of prediction six times over.
Although entirely speculative, these outcomes are not too far off from reality. In real-
world scenarios that were successfully handled by well-designed propensity models, lift
in
values higher than 4, 5, or even 6 are rather typical, demonstrating the potential for
improvement provided by data mining.
nl
The following diagram and explanation show the steps of direct marketing
campaigns:
1. Compiling and combining the required data from various data sources.
O
2. Customer segmentation and analysis into different customer groups.
3. Using propensity models to develop targeted marketing strategies that choose the
correct customers.
ty
4. Execution of the campaign by selecting the ideal channel, ideal moment, and ideal
offer for every campaign.
5. Evaluation of the campaign using test and control groups. The population is divided
si
into test and control groups for the evaluation, and the positive responses are
compared.
6.
r
Analysis of campaign results to help with targeting, timing, offers, products,
communication, and other aspects of the campaign for the following round.
ve
Data mining can play a significant role in all these stages, particularly in identifying
the right customers to be contacted.
ni
U
ity
m
)A
(c
e
In example, association models and data mining can be used to find related
products that are frequently bought together. These models can be used to analyse
in
market baskets and identify sets of goods or services that can be marketed as a unit.
nl
The Next Best Activity Strategy and ‘‘Individualized’’ Customer Management
To improve customer management, data mining models should be developed and
O
implemented in an organization’s daily business activities. The construction of a next
best activity (NBA) strategy can benefit from the knowledge acquired by data mining.
More specifically, the establishment of “personalised” marketing goals can be made
possible by the customer understanding gleaned via data mining. The company can
ty
choose the following “individualised” strategy and make a more informed decision on
the next appropriate marketing activity for each client:
si
2. an advertisement for the ideal add-on item and a targeted cross-, up-, or
deep-selling offer for clients with expansion potential.
3. r
placing use limits and limitations on clients with problematic payment histories
ve
and credit risk scores.
4. the creation of a new product or service that is specifically suited to the traits
of a targeted market niche, etc.
The following figure illustrates the key elements that should be considered when
ni
Notes
e
in
nl
O
ty
r si
ve
Figure: The next best activity components.
Although there are numerous ways for data mining, this section lists some of the
most popular ones and provides some instances of each technique’s use.
U
down the aisles. While some of them might have been spontaneously chosen, the
majority of these might match a pre-made shopping list. Let’s assume that when you
pay at the register, the items in your (and every other shopper’s) cart are recorded
because the store wants to identify any patterns in the items that different customers
tend to purchase. The term for this is market basket analysis.
m
A procedure called market basket analysis looks for links between items that “go
together” in a business context. Despite its name, market basket analysis goes beyond
)A
the typical supermarket setting. Any group of items can be analysed using the market
basket method to find affinities that can be used in some way. Following are some
instances of how market basket analysis has been used:
purchases, it is also possible to physically separate items that are frequently bought
together in a store.
Up-Sell, Cross-Sell, and Bundling Opportunities: Businesses may utilise the affinity
Notes
e
grouping of several products as a sign that customers could be inclined to purchase
the grouped products simultaneously. This makes it possible to showcase products for
cross-selling or may imply that buyers could be likely to purchase additional products
in
when specific products are grouped together.
nl
research to decide what incentives to provide when clients contact a company to end a
relationship in order to keep their business.
Memory-Based Reasoning
O
A typical sick visit to the doctor comprises the patient explaining a number of
symptoms, the doctor reviewing the patient’s medical history to match the symptoms to
known illnesses, making a diagnosis, and then suggesting a course of treatment. This
ty
is a practical illustration of memory-based reasoning (MBR), also known as case-based
reasoning or instance-based reasoning, which uses known circumstances to create a
model for analysis. The closest matches between new circumstances and the model
si
are then examined to help with classification or prediction judgments.
An MBR approach has two fundamental parts. The first is the similarity function,
often known as distance, which calculates how similar any two things are to one
U
another. The second option is the combination function, which combines the outcomes
from the neighbours’ set to make a choice.
clustering), and that new objects will then be classified using the findings. The same
approach used in the matching process to get the closest match can also be used for
prediction. The outcome of the new object can be predicted using the behaviour that
results from that matched object.
m
Think about a database that records cancer symptoms, diagnoses, and therapies
as an illustration. After doing a clustering of the cases based on symptoms, MBR can
be used to identify the instances that are most similar to a recently discovered one in
)A
order to make an educated guess at the diagnosis and suggest a course of action.
Cluster Detection
One common data mining challenge is to split up a big set of heterogeneous
(c
objects into several smaller, more homogeneous groupings. Applications for automated
clustering are utilised to carry out this grouping. We need to understand a function that
calculates the distance between any two points based on an element’s characteristics,
e
website to learn more about the types of people who are using the site, clustering might
be effective.
in
To cluster, there are two methods. The first method works on the presumption
that the data already contains a particular number of clusters, and the objective is to
separate the data into those clusters. The K-Means clustering methodology is a widely
nl
used method that first identifies clusters and then iteratively applies the following steps:
Determine the precise centre of each cluster, calculate the distance between each
object and that precise middle, assign each object to the cluster that owns that precise
middle, and then redrew the boundaries of the clusters. Up until the cluster borders stop
O
shifting, this is repeated.
ty
process, rather than assuming the presence of a set number of clusters. However,
in this scenario, we need a method to measure similarity between clusters (instead
of points within an n-dimensional space), which is a little more challenging. The
si
agglomerative method’s end outcome is the composition of all clusters into a single
cluster. However, each iteration’s history is documented, allowing the data analyst to
select the level of clustering that is most suitable for the specific business need at hand.
Link Analysis
r
ve
The process of finding and forming relationships between items in a data set, as
well as quantifying the weight attached to every such link, is known as link analysis.
Examples include examining links created when a connection is made at one phone
number to another, evaluating whether two people are related via a social network, or
ni
examining the frequency with which travellers with similar travel preferences choose
to board particular flights. In addition to creating a relationship between two entities,
this link can also be described by other variables or attributes. Examples of telephone
U
connectivity include the volume of calls, the length of calls, or the times when the calls
are placed.
For analytical applications that rely on graph theory for inference, link analysis is
ity
helpful. Finding groupings of people who are connected to one another is one example.
Are there groups of people connected to one another such that the connection between
any two individuals inside a group is as strong as the connection between any other
pair? By providing an answer, you can learn more about drug trafficking organisations
or a group of individuals with a lot of sway over one another.
m
types of aircraft) are distributed among the many routes that an airline takes. The goal
of lowering lag time and extra travel time required for a flight crew, as well as any
external regulations related with the amount of time a crew may be in the air, are the
driving forces behind the assignment of pilots to aircraft. Each trip represents a link
within a big graph.
(c
responsible for a sizable portion of product purchases, but several people in her
Notes
e
“sphere of influence” might.
in
The process of locating business (or other types of) rules that are embedded in
data is a part of knowledge discovery. For this discovery procedure, rule induction
techniques are applied.
nl
Similar to the game “Twenty Questions,” many situations can be solved by
responding to a series of questions that gradually eliminate potential solutions. Using
O
decision trees is one method of discovering rules. A decision tree is a decision-support
model that organises the questions and potential answers, directs the analyst to the
right solution, and can be applied to both operational and classification procedures.
When a decision tree is complete, each node in the tree represents a question, and
ty
the decision of which path to follow from that node depends on the issue’s resolution.
One internal node of a binary decision tree, for instance, could inquire whether the
employee’s compensation exceeds $50,000. The left-hand road is taken if the response
si
is yes, and the right-hand path if the response is no.
In essence, each node in the tree represents a collection of records that are
ni
consistent with the responses to the queries encountered on the way to that particular
node. Every path from the root node to any other node is different, and each of these
queries divides the view into two smaller halves. Every node in the tree serves as a
U
representation of a rule, and we may assess the number of records that adhere to a
rule as well as the size of that record set at any given node.
As the decision support process traverses the tree, the analyst utilises the model
ity
to seek a desired outcome; the traversal ends when it reaches a tree leaf. The “thinking
process” employed by decision tree models is transparent, and it is obvious to the
analyst how the model arrived at a specific conclusion.
The search for association rules is another method of rule induction. According
to association rules, certain relationships between traits are more common than
would be predicted if the attributes were independent. A relationship between sets of
)A
and diapers together. The apparent codependent variable(s) must have a degree of
confidence suggesting that of the times the X values exist, so do the Y values, and
the cooccurrence of the variable values must have support, which indicates that those
values occur together with a fair frequency. An association rule basically says that,
Notes
e
with some degree of certainty and some amount of support, the values of one set of
characteristics determine the values of another set of attributes.
in
The “source attribute value set” and “target attribute value set” are the
specifications for an association rule. For obvious reasons, the target attribute set is
also referred to as the right-hand side and the source attribute set as the left-hand side.
nl
The probability that the association rule will apply is its confidence level. The confidence
of the rule “item1: Buys network hub” / “item2: Buys NIC” is 85%, for instance, if a client
purchases a network hub 85% of the time, she also purchases a network interface card
(NIC). The percentage of all records where the left- and right-side attributes have the
O
prescribed values is the support of a rule.
In this scenario, 6% of all records must have the values “item1: Buys network hub”
and “item2: Buys NIC” set in order for the rule to be supported.
ty
Neural Networks
A data mining model used for prediction is a neural network. The algorithms for
si
creating neural networks incorporate statistical artefacts of the training data to construct
a “black box” process that accepts a certain amount of inputs and provides some
predictable output. The neural network model is trained using data examples and
r
desired outputs. Neural network models are based on statistics and probability and,
ve
once trained, are excellent at solving prediction problems. They were initially designed
as a technique to simulate human reasoning. However, the training set’s information
is not transparently absorbed into the neural network as it grows. The neural network
model is excellent at making predictions, but it is unable to explain how it arrived at a
particular conclusion.
ni
transmitted to other neurons in the network. The input values are ultimately connected
as the initial inputs, and the output(s) that result represent a choice made by the neural
network. For categorization, estimation, and prediction, this method works well.
There are instances when changing the way data is represented is necessary to
ity
obtain the kinds of values needed for accurate value calculation. For instance, when
looking for a discrete yes/no response, continuous value values may need to be
rounded to 0 or 1, and historical data supplied as dates may need to be converted
into elapsed days.Data mining is a collection of techniques for analysing data from a
m
variety of angles and views using algorithms, statical analysis, artificial intelligence, and
database systems.
Data mining techniques are used to uncover patterns, trends, and groups in large
)A
amounts of data and to translate that data into more comprehensive information.
It’s a framework, similar toRstudio or Tableau, that allows you to perform various
data mining analyses.
clustering or classification, and visualise the results. It’s a framework that aids in the
comprehension of our data and the phenomena it depicts. A framework like this is a
Notes
e
data mining tool.
According to ReortLinker’s latest forecast, sales of data mining tools would exceed
in
$1 billion by 2023, up from $ 591 million in 2018.
nl
Orange Data Mining:
The Orange software bundle is ideal for machine learning and data mining. It
is a software that facilitates visualisation and is based on components written in the
O
Python programming language and created at the bioinformatics laboratory at Ljubljana
University’s faculty of computer and information technology.
ty
based software. Preprocessing and data visualisation are only a few of the widgets
available, as are algorithm evaluation and predictive modelling.
si
◌◌ Displaying data table and allowing to select features
◌◌ Data reading
◌◌ r
Training predictors and comparison of learning algorithms
ve
◌◌ Data element visualization, etc.
It is an SAS Institute tool for data management and analytics. SAS can mine, modify,
and manage data from a variety of sources, as well as do statistical analysis. For non-
technical users, it provides a graphical user interface (GUI).
U
SAS data miners enable users to analyse vast volumes of data and provide
accurate information for speedy decision-making. SAS has a distributed memory
processing architecture that is very scalable. It can be used for text mining, data mining,
and optimization.
ity
DMelt is a JAVA-based multi-platform utility. It will run on any operating system that
is JVM compatible (Java Virtual Machine). It is made up of science and math libraries.
)A
●● Scientific libraries:
Scientific libraries are used for drawing the 2D/3D plots.
●● Mathematical libraries:
(c
DMelt can be used for large-scale data analysis, data mining, and statistical
Notes
e
analysis. It’s widely used in fields including natural sciences, finance, and engineering.
Rattle:
in
Ratte is a data mining tool with a graphical user interface. R, a statistical
programming language, was used to create it. By providing sophisticated data mining
features, Rattle exhibits R’s statical strength. While the user interface of rattling is well-
nl
designed, it also contains an incorporated log code tab that generates duplicate code
for all GUI operations.
O
The data collected by Rattle can be viewed and edited. Others can assess the
code, use it for a number of purposes, and enhance it without restriction thanks to
Rattle.
ty
Rapid Miner:
Rapid Miner, developed by the Rapid Miner corporation, is one of the most
extensively used predictive analysis programmes. The JAVA programming language
si
was used to construct it. Text mining, deep learning, machine learning, and predictive
analytic environments are all included.
Rapid Miner’s server can be hosted on-premises or in a public or private cloud. It’s
built on a client/server architecture. A rapid miner uses template-based frameworks to
deliver data quickly and with little errors (which are commonly expected in the manual
ni
Case Study
Notes
e
A Knowledge Discovery Case Study for the Intelligent Workplace
in
Introduction
The Robert L. Preger Intelligent Workplace (IW) serves as a “living laboratory”
for testing innovative construction methods and materials in a real-world setting for
nl
research and instruction. On the campus of Carnegie Mellon University, the IW may be
found on the top floor of Margaret Morrison Carnegie Hall. The IW has been equipped
with a collection of sensors that measure HVAC performance, ambient conditions, and
O
energy usage in order to assess the effectiveness of its revolutionary HVAC system.
Although the Intelligent Workplace researchers have daily access to the sensor records,
they would like to develop automated methods to condense the data into a manageable
form that illustrates trends in the energy performance. There are few choices for
ty
automated analysis of big observational data sets like the IW sensor data; one of them
is a collection of methods known as “data mining.” In this case study, we use knowledge
discovery to apply data mining techniques to the IW data.
si
(Source: https://www.researchgate.net/publication/269159907_A_Knowledge_
Discovery_Case_Study_for_the_Intelligent_Workplace)
Background r
ve
The use of data mining tools is merely one aspect of the situation. The non-trivial
process of finding true, original, possibly helpful, and ultimately intelligible patterns in
data is known as knowledge discovery in databases (KDD). The KDD method seeks to
assist in ensuring that any results generated are both accurate and reproducible.
ni
The six main categories that data mining techniques themselves fall under are
classification, regression, clustering, summarization, dependency modelling, and
ity
deviation detection. Rough sets analysis is a classification method that was employed
in this case study. In general, mapping a data item into a preset class is done using
classification procedures. Each data point is treated as a “object” in rough sets analysis.
Based on the values of each object’s separate properties, the approach seeks to
differentiate them from one another. The analysis excludes any attributes that cannot
m
be used to differentiate between several objects. Decision rules are generated using the
remaining attributes. The analyst is then given these generalised principles in the form
of conjunctive if-then rules.
)A
indicates how reliable the predictor qualities are. Calculating coverage involves
Notes
e
determining how much variation in the output attribute the rule accounts for. Although
it turns out that there is typically a trade-off between accuracy and coverage, it is highly
desirable to have both high accuracy and coverage. In the IW case study, accuracy
in
and coverage were used to gauge each rule’s technical excellence. The IW researchers
themselves assessed each rule’s actual usefulness.
nl
In light of the list of requirements put forth by Fayyad et al. 2000, the KDD
approach was a fantastic solution for the IW data analysis challenge. Due to the volume
of data, the intricate relationships between the sensor systems, and the observational
nature of the data, there were no viable options for the analysis. Over a year’s worth of
O
data had already been gathered by the researchers, which was more than enough for
data mining. The analysis was able to overlook any incorrect or missing data despite
the fact that the data set had issues with missing values and noise. We gathered
ty
attributes that were pertinent to the modelling goals, such as the weather station’s eight
separate environmental element monitors and the HVAC system’s instrumentation on
each part. Last but not least, the IW researchers were quite supportive of the project.
si
Data Understanding
Nearly all of the attributes in the IW data are numerical since all of them were
collected by sensors. The data collection also includes a few notional attributes, such
r
as the condition of the valves. There are around 400 distinct attributes total in the data
ve
set. Approximately 600MB of fiat text files are needed to store one year’s worth of data.
At intervals of five minutes, the weather station sensors measure the following
variables: temperature, relative humidity, solar radiation, barometric pressure, dew
point, wind speed, wind direction, and rainfall. The make-up air unit, the secondary
ni
water mullion system (for heating and cooling), individual water mullions, the hot
water system, the chilled water system, and the secondary water mullion system all
contribute temperature, flow, and status data to the HVAC sensor system at 30-minute
U
intervals. Finally, every outlet in every electrical panel has a sensor system that
records the number of kilowatt-hours utilised every thirty minutes. To identify the broad
characteristics of each attribute, a series of summary statistical tests were performed on
the data for each of the sensor attributes.
ity
Data Preparation
During the KDD process’s data preparation phase, a lot of problems need to
be solved. Data selection, data cleaning, data creation, data integration, and data
m
formatting are the smaller phases that make up the data preparation phase.
The first step of the data preparation phase in the CRISP-DM model is to choose
the data that will be modelled. First, we had to decide how much data each model
)A
minute intervals, is roughly 17,000 recordings. This is a medium-sized data set by data
mining standards, so most algorithms ought to be able to handle it without any difficulty.
Attribute selection is also a part of data selection. Although the majority of data
Notes
e
mining algorithms deteriorate dramatically with each additional feature added to the
data set, others are built to handle enormous numbers of records smoothly. It is best to
employ just pertinent attributes when computing. However, part of knowledge discovery
in
is seeing new and unexpected patterns in the data set; eliminating features might
stop these unexpected outcomes. The number of potential features in this case study
was enormous. However, given our primary objective was to validate current domain
nl
knowledge rather than uncover unique patterns, it made logical to consult the IW’s
domain experts to determine which elements each model needed.
Data cleansing comes next in the data preparation process. How to handle missing
O
values is the first problem with data cleansing. The IW data collection had numerous
intervals with missing data values, most of which were relatively long (e.g., the sensors
stopped recording for a weekend). One option was to attempt to forecast the missing
ty
values, however most prediction techniques struggle when used to make predictions
based on data that has already been predicted or predictions made several intervals in
advance. Prediction would most likely do well in a different data set where the missing
values are dispersed in tiny groups. We ignored the missing values since there was
si
still enough data to accurately describe the underlying processes that the data set
measures even if the missing data only made up around 10% of the overall data.
r
Noise in the data, often known as poor data, is the second problem with data
cleansing. How to distinguish between variance that is necessary for creating a
ve
complete model of a problem and noise that is actually noise is one of the more
challenging decisions to be made in the KDD process. Despite the well-known noise in
sensor data, we chose not to delete any values from the data set. Out-of-range values
were crucial since we were trying to characterise the data rather than make predictions,
ni
therefore it was crucial to know what variables were related to ineffective control tactics
or anomalous energy use.
The construction of the data is the third step in the data preparation phase. The
U
process of building the data include altering and deriving new qualities from preexisting
ones. Rough sets analysis functions better with nominal than with numerical attributes
because it is a classification technique. Since almost all of the IW data properties are
ity
numerical, we had to determine how to discretize (or divide into discrete ranges) the
data. There is no theory on the ideal method for discretizing an attribute; instead, most
analysts employ domain expertise or trial-and-error to do so. To select ranges that
would “make sense” to the IW researchers, we applied our domain knowledge.
m
Integrating data from many sources is the following step in the data preparation
process. How to deal with the different sensor granularities—environmental data
were collected in five-minute intervals, while all other data were recorded in thirty—
was the first problem that surfaced. The choice was simple in this instance. Because
)A
the environmental data were quite continuous, averaging them over intervals of thirty
minutes produced results that did not materially deviate from the original data. We
performed the statistical tests on the averaging of the environmental data to make sure
that there had been no modifications. Between the original and averaged data sets,
we did not detect any appreciable statistical differences. The time stamps that were
(c
recorded by each sensor as it recorded a value made it simple to connect the data once
it was all at the same degree of granularity.
The formatting of the data for the particular modelling tool is the last stage in the
Notes
e
data preparation procedure. This stage is one of the most iterative ones in the KDD
process, and the data will probably need to be reorganised almost every time a new
algorithm or tool (even one from the same set of techniques) is used. In this case study,
in
before beginning the processes of data comprehension and preparation, all of the
sensor files were structured as comma-separated value files. This first step made a lot
of the subsequent data manipulation simpler.
nl
Data Modeling
Choosing a modelling technique is the first step in the KDD process’ data modelling
O
phase. The data set that was prepared in the previous phase is used to apply this
modelling technique. The technical soundness of the produced data model is next
evaluated. Multiple repetitions of this phase may be necessary, either by choosing a
different modelling technique completely or by changing model parameters within the
ty
same modelling technique.
si
similarly to discretization, no dependable theories exist to guide users in making this
decision. Chapman, et al.2000, however, define classes of data mining techniques
that are frequently employed to address challenges of each type and describe a set of
six general data mining problem categories. A more detailed mapping of data mining
r
problem types to data mining techniques will be provided by the framework discussed
ve
in Section 4. The IW case study unmistakably belongs to the category of data mining
tasks called “concept description,” which “aims at a comprehensible description of ideas
or classes.” Typically, rule induction and conceptual clustering techniques are used to
model these issues.
ni
The technical criteria for this problem are satisfied via conceptual clustering and
rule induction. However, the majority of conceptual clustering algorithms frequently
provide data that is challenging for inexperienced analysts to interpret. If-then rules are
U
produced via rule induction algorithms, which are simple to comprehend. We believed
that a rule induction method would produce the best model, in terms of technical
correctness and usability, given the need that the IW researchers wanted to be able to
continue analysing their data on their own.
ity
We created two models for the IW researchers in this case study. These models
)A
will serve as a model for future model production, along with instructions on how to
construct more models. The first model explains how a heat exchanger’s hot water
supply temperature is impacted by the outside air temperature (oa-t) (hxl-hwst). The
second model explains how the amount of energy used to heat water is influenced by
the outside air temperature, solar radiation, and wind speed (btu-mw).
(c
The initial model (oa-t =v hxl-hwst) was created to test the efficiency of the heat
exchanger. The heat exchanger’s control rules are as follows: oa-t 0 =v hxl-hwst =
140; oa-t 60 =v 90 hxl-hwst 140; and, oa-t 60:=hxl-hwst = 90. The model discovered
Notes
e
that the heat exchanger occasionally did not operate in accordance with the control
criteria, most frequently when the temperature was between 30 and 60 degrees. There
was no discernible pattern to identify when the heat exchanger was not operating in
in
accordance with the control rules given only the temperature of the outside air. Further
investigation found a strong correlation between the heat exchanger’s activity and the
time of day. We created a new model (oa-t AND time =v hxl-hwst) to identify the hot
nl
water supply temperature using the time of day and the outside air temperature. The
new model demonstrated that from early in the morning to mid-afternoon, it was most
likely that the heat exchanger would run outside of the control rules.
O
The second model (oa-t AND sr AND ws = btu-mw) was created to describe how
various environmental factors impact the amount of energy used to heat water. Notably,
just as it is erroneous to assume that numerical correlation indicates causation, it is
ty
typically incorrect to infer a causal relationship between the left and right hand sides of
a rule created through rough sets analysis. In this instance, we did presume a causal
relationship based on the topic expertise provided by the IW researchers.
si
We looked at the rules generated for this model and found that they did not
significantly differentiate between different values for energy consumption but
did differentiate between different values for solar radiation and wind speed. This
r
information should be discovered by the rough sets method, which will then remove sun
radiation and wind speed from the rules. The rough sets technique was unable to do
ve
so due to the volume of noise in this data set; as a result, the analyst should carefully
review the resultant rules to ensure that they make sense. To categorise energy usage,
a new model was created using simply the outside air temperature (oa-t =v btu-mw).
Five of the eight rules we produced had an accuracy value over 77%, and all of the
ni
rules had an accuracy value over 55%, making the new model considerably clearer.
Results from the data modelling phase are assessed based on their technical
merit. The modelling outputs are assessed for utility, innovation, and understandability
in the final two stages of the KDD process, after which the outputs are put to use.
The case study’s findings and the method utilised to produce them primarily serve as
ity
a guide for the IW researchers’ future research. Teaching the IW researchers how to
use the data preparation and modelling tools was part of the case study’s deployment
phase. They will continue to utilise these tools to comprehend the patterns in their data,
enhance their control techniques, keep track of how well those tactics are working, and
m
Summary
)A
supporters and suppliers. Some experts include a wide variety of tools and
e
statistical analysis.
●● Data mining delivers information, much like all other decision support systems.
in
Please refer to the figure below, which illustrates how decision support has
evolved.
●● Building and using a data warehouse is known as data warehousing. Data from
nl
several heterogeneous sources is combined to create a data warehouse, which
facilitates analytical reporting, organised and/or ad hoc searches, and decision-
making.
O
●● The first industries to use data warehousing were telecommunications, finance,
and retail. Government deregulation in banking and telecoms was primarily to
blame for that. Because of the increased competitiveness, retail businesses have
shifted to data warehousing.
ty
●● Today, investment on data warehouses is still dominated by the banking and
telecoms sectors. Data warehousing accounts for up to 15% of these sectors’
technological budgets.
si
●● Data mining is a technique used with the Internet of Things (IoT) and artificial
intelligence (AI) to locate relevant, possibly useful, and intelligible information,
r
identify patterns, create knowledge graphs, find anomalies, and establish links in
massive data.
ve
●● By fusing several data mining techniques for usage in financial technology
and cryptocurrencies, the blockchain, data sciences, sentiment analysis, and
recommender systems, its application domain has grown.
ni
●● Machine learning and deep learning are undoubtedly two of the areas of artificial
intelligence that have received the most research in recent years. Due to the
development of deep learning, which has provided data mining with previously
U
data, and data visualisation. Finding trends in the data and presenting crucial
information in a way that the average person can understand it are the main
objectives here.
●● Neural networks are utilised to complete several DM jobs. Since the human brain
m
serves as the computing model for these networks, neural networks should be
able to learn from experience and change in reaction to new information.
●● Regression is a learning process that converts a specific piece of data into a
)A
Simply expressed, it makes it easy to make rational decisions that will help you
reach your goals.
e
and that data must first be in a reliable state in order to be used in data mining.
The ideal source of data for knowledge discovery is the aggregation of enterprise
data in a data warehouse that has been adequately vetted, cleaned, and
in
integrated.
●● Steps of Knowledge Discovery Database Process: 1) Data collection, 2)
nl
Preparing datasets, 3) Cleansing data,4) Data integration, 5) Data analysis, 6)
Data transformation, 7) Modeling or mining, 8) Validating models, 9) Knowledge
presentation, 10) Execution.
O
●● Estimation is the process of giving an object a continuously varying numerical
value. For instance, determining a person’s credit risk is not always a yes-or-
no decision; it is more likely to involve some form of scoring that determines a
person’s propensity to default on a loan.
ty
●● Prediction is an attempt to categorise items based on some anticipated future
behaviour, which is a slight distinction from the preceding two jobs. Using historical
data where the classification is already known, classification and estimation can
si
be utilised to make predictions by creating a model (this is called training). Then,
using fresh data, the model can be used to forecast future behaviour.
●● The technique of analysing associations or correlations between data items that
r
show some sort of affinity between objects is known as affinity grouping. For
ve
instance, affinity grouping could be used to assess whether customers of one
product are likely to be open to trying another.
●● The phrase “Knowledge Discovery in Databases” was coined by Gregory
Piatetsky-Shapiro in 1989. The term ‘data mining,’ on the other hand, grew
ni
increasingly prevalent in the business and media circles. Data mining and
knowledge discovery are terms that are being used interchangeably.
●● There are five steps in the data mining process. Data is first gathered by
U
organisations and loaded into data warehouses. The data is then kept and
managed, either on internal servers or on the cloud. The data is accessed by
business analysts, management groups, and information technology specialists,
who then decide how to organise it. The data is next sorted by application software
ity
according to the user’s findings, and ultimately the end-user presents the data in a
manner that is simple to communicate, such a graph or table.
●● Data mining, in technical terms, is the computer process of examining data from
many viewpoints, dimensions, and angles, and categorizing/summarizing it into
m
useful information.
●● Data Mining can be used on any sort of data, including data from Data
Warehouses, Transactional Databases, Relational Databases, Multimedia
)A
Databases, Spatial Databases, Time-series Databases, and the World Wide Web.
●● Applications of data mining: a) Sales, b) Marketing, c) Manufacturing, d) Fraud
detection, e) Human resources, f) Customer service.
●● Researching, creating, and making a product available to the public are all part of
(c
the discipline of marketing. The idea of marketing has been around for a while, yet
it keeps evolving in response to customer wants and buying patterns.
e
who remain devoted and strengthen their relationship with the company, there
cannot be any business opportunities. Because of this, a business should develop
and implement a clear customer service strategy.
in
●● CRM has two main objectives: a) Customer retention through customer
satisfaction, b) Customer development through customer insight.
nl
●● Direct marketing campaigns are used by marketers to reach out to customers
directly by mail, the Internet, e-mail, telemarketing (phone), and other direct
channels in an effort to reduce customer attrition, increase client acquisition, and
O
encourage the purchase of supplementary items.
●● Tools of data mining: a) Market based analysis, b) Memory based reasoning, c)
Cluster detection, d) Link analysis.
ty
●● A data mining model used for prediction is a neural network. The algorithms for
creating neural networks incorporate statistical artefacts of the training data to
construct a “black box” process that accepts a certain amount of inputs and
provides some predictable output. The neural network model is trained using data
si
examples and desired outputs.
●● Orange’s components are referred to as “widgets” because it is a component-
r
based software. Preprocessing and data visualisation are only a few of the widgets
available, as are algorithm evaluation and predictive modelling.
ve
●● DataMelt is a computing and visualisation environment that provides an interactive
data analysis and visualisation structure. It was created with students, engineers,
and scientists in mind. DMelt is another name for it. DMelt is a JAVA-based multi-
platform utility.
ni
Glossary
●● DW: Data Warehouse.
U
e
for a subset.
●● Classification: Classification is learning a function that maps a data item into one
in
of several predefined classes.
●● GUI: Graphical User Interface.
●● ART: Adaptive Resonance Theory.
nl
●● KDD: Knowledge Discovery in Databases.
●● Estimation: Estimation is the process of giving an object a continuously varying
O
numerical value.
●● Prediction: Prediction is an attempt to categorise items based on some anticipated
future behaviour, which is a slight distinction from the preceding two jobs.
ty
●● Affinity grouping: The technique of analysing associations or correlations between
data items that show some sort of affinity between objects is known as affinity
grouping.
si
●● MBA: Market Basket Analysis (MBA) is a technique for analysing the purchases
made by a client in a supermarket.
●● KNN: K-Nearest Neighbor.
●● MNCs: Multi-National Companies.
r
ve
●● CRM: Customer Relationship Management.
●● SAS: Statistical Analysis System.
●● Rattle: Rattle is a data mining tool with a graphical user interface. R, a statistical
ni
a. Relational data
b. Meta data
c. Operational data
d. None of the mentioned
(c
a. AWS
Notes
e
b. Informix
c. Redbrick
in
d. Oracle
4. The test is used in an online transactional processing environment is known as_ _
nl
_ _test.
a. MICRO
b. Blackbox
O
c. Whitebox
d. ACID
ty
5. The method of incremental conceptual clustering is known as_ _ _ _.
a. COBWEB
b. STING
si
c. OLAP
d. None of the mentioned
6. r
Multidimensional database is also known as_ _ _ _.
ve
a. Extended DBMS
b. Extended RDBMS
c. RDBMS
ni
d. DBMS
7. __ _ _ _ is a data transformation process.
U
a. Projection
b. Selection
c. Filtering
ity
c. Association
d. Summarisation
e
a. Summarization
b. Clustering
in
c. Association
d. Regression
nl
11. __ _ _ _is learning a function that maps a data item into one of several predefined
classes.
a. Association
O
b. Populating
c. Classification
ty
d. Clustering
12. _ _ _. _is the process of giving an object a continuously varying numerical value.
a. Association
si
b. Regression
c. Classification
d. Estimation r
ve
13. _ _ _ _ _involves methods for finding a compact description for a subset.
a. Summarization
b. Estimation
ni
c. Clustering
d. Association
U
14. The term data mining was coined by_ _ _ in_ _ _ _year?
a. James Gosling, 1986
b. Gregory Piatesky-Shapiro, 1989
ity
b. Data in which changes to the existing recordsdo not cause the previous
Notes
e
version of the records to be eliminated
c. Data that are never altered or deleted once they have been added
in
d. Data that are never deleted once they have been added
17. A multifield transformation does which of the following?
nl
a. Converts data from one field to multiple field
b. Converts data from multiple field to one field
c. Converts data from multiple field to multiple field
O
d. All of the above
18. The time horizon in data warehouse is usually_ _ _ _.
ty
a. 3-4 years
b. 5-6 years
c. 5-10 years
si
d. None of the above
19. Which of the following predicts the future trends&behaviour, allowing business
r
managers to make proactive, knowledge-driven decisions?
ve
a. Meta data
b. Data mining
c. Data marts
ni
d. Data warehouse
20. _ _ _ _ is the heart of the warehouse.
a. Data warehouse database servers
U
Exercise
1. What do you mean by data warehousing?
m
Learning Activities
Notes
e
1. You are the data analyst on the project team building a data warehouse for an
insurance company. List the possible data sources from which you will bring the
in
data into your data warehouse. State your assumptions.
nl
1 a 2 b
3 c 4 d
5 a 6 b
O
7 c 8 d
9 a 10 b
ty
11 c 12 d
13 a 14 b
15 c 16 a
si
17 d 18 c
19 b 20 a
e
Learning Objectives:
in
At the end of this topic, you will be able to understand:
●● Statistical Techniques
nl
●● Data Mining Characterisation
●● Data Mining Discrimination
O
●● Mining Patterns
●● Mining Associations
●● Mining Correlations
ty
●● Classification
●● Prediction
si
●● Cluster Analysis
●● Outlier Analysis
Introduction r
ve
We have seen a range of databases and information repositories that can be used
for data mining. Now let’s look at the types of data patterns that can be extracted.
The kind of patterns that will be found in data mining jobs are specified using
data mining functionalities. Data mining jobs can often be divided into two groups:
ni
descriptive and predictive. The general characteristics of the data in the database are
described by descriptive mining tasks. To produce predictions, predictive mining jobs
perform inference on the most recent data.
U
Users might search for a variety of patterns simultaneously because they are
unsure of what kinds of patterns in their data would be intriguing. Therefore, it’s crucial
to have a data mining system that can extract a variety of patterns in order to satisfy
ity
various user expectations or applications. Data mining systems should also be able to
find patterns at different granularities (i.e., different levels of abstraction). Users should
be able to specify indications in data mining systems so that the search for intriguing
patterns is guided or narrowed down. Each pattern found typically has a level of
assurance or “trustworthiness” attached to it because some patterns might not hold for
m
It is designed for the efficient handling of significant amounts of data that are
typically multidimensional and potentially of many complex forms in statistical data
mining approaches.
(c
Numerous known statistical techniques exist for the study of data, particularly
quantitative data. Numerous scientific records, including those from experiments in
e
the social and economic sciences, have been subjected to these techniques.
The intersection of statistics and machine learning gives rise to the interdisciplinary
in
topic of data mining (DM) (artificial intelligence). It offers a technique that aids in the
analysis and comprehension of the data found in databases, and it has been used to
a wide range of industries or applications. In particular, the analogy between searching
nl
databases for useful information and extracting valuable minerals from mountains
is where the term “data mining” (DM) originates. The concept is that the data to be
analysed serves as the raw material, and we use a collection of learning algorithms to
act as sleuths, looking for gold nuggets of knowledge.
O
In order to offer a didactic viewpoint on the data analysis process of these
techniques, we present an applied vision of DM techniques. In order to find knowledge
models that reveal the patterns and regularities underlying the analysed data, we
ty
employ machine learning algorithms and statistical approaches, comparing and
analysing the findings. To put it another way, some authors have noted that DM is “the
analysis of (often large) observational datasets to find unsuspected relationships and to
si
summarise the data in novel ways that are both understandable and useful to the data
owner,” or, to put it another way, “the search for valuable information in large volumes
of data,” or “the discovery of interesting, unexpected or valuable structures in large
r
databases.” According to some authors, data mining (DM) is the “search and analysis of
enormous amounts of data to uncover relevant patterns and laws.”
ve
4.1.1 Statistical Techniques
The table below provides a classification of some DM approaches based on the
ni
type of data studied. In this regard, we outline the strategies that are possible based
on the types of predictor and output variables. Unsupervised learning models are used
when there is no output variable, however supervised learning models are used when
U
from biological neural networks in their design and operation. ANN were created on the
following principles:
e
The connections between the neurons allow for the transmission of signals.
in
To the total entry of connected neurons received (sum of entries weighted
according to the connection weights), each neuron applies an activation function (often
nl
non linear), resulting in an output value that will serve as the entry value that will be
broadcast to the rest of the network.
O
core traits of ANN.
The artificial neuron serves as the processing unit, collecting input from nearby
neurons and calculating an output value to be relayed to the rest of the neurons.
ty
We can find networks with continuous input and output data, networks with discrete
or binary input and output data, and networks with continuous input data and discrete
output data when it comes to the representation of input and output information.
si
Input nodes, output nodes, and intermediate nodes (hidden layer) are the three
primary types of nodes or layers that make up an ANN (Figure below). The starting
values of the data from each case must be obtained by the input nodes before they can
r
be sent to the network. As input is received, the output nodes compute the output value.
ve
ni
U
ity
The activation function and the collection of nodes the ANN uses enable it to
readily express non-linear relationships, which are the most challenging to depict using
multivariate techniques.
)A
The step function, identity function, sigmoid or logistic function, and hyperbolic
tangent are the most used activation functions.
There are many different ANN models available. An ANN model is defined by
(c
its topology (number of neurons and hidden layers and their connections), learning
paradigm, and learning method.
e
appealing for managing data: adaptive learning through examples, robustness in
handling redundant and erroneous information, and significant parallelism.
in
The multilayer perceptron, which was popularised by Rumelhart et al., is the
technique most frequently employed in actual ANN implementations.
nl
forms the foundation of a multilayer perceptron type of ANN. Each of the neurons that
make up the hidden layer are connected to these input neurons. In turn, the neurons in
one hidden layer are connected to the nodes in the first. One (binary prediction) or more
O
output neurons make up the output layer. Information is always sent from the input layer
to the output layer in this type of architecture.
ty
a universal function approximator. More technically, any function or continuous
relationship between a collection of input variables (discrete and/or continuous) and an
output variable can be learned by a “backpropagation” network with at least one hidden
layer and sufficient non-linear units (discrete or continuous). Multilayer perceptron
si
networks are versatile, adaptable, and non-linear tools as a result of this characteristic.
Rumelhart et al. provide a thorough explanation of the mathematical underpinnings of
the multilayer perceptron architecture’s backpropagation algorithm’s training stage and
functional stage. r
ve
The multilayer perceptron’s value stems from its capacity to recognise almost any
correlation between a set of input and output variables. However, methods derived
from classical statistics, such as linear discriminant analysis, lack the ability to compute
non-linear functions and, as a result, perform less well than multilayer perceptrons in
ni
The output layer of a network used for classification typically includes as many
nodes as there are classes, and the node of the output layer with the greatest value
U
provides the network’s best guess as to the class for a given input. A node is frequently
present in the output layer in the special case of two classes, and the node value is
used to carry out the classification between the two classes by applying a cut off point.
ity
Decision Trees
Sequential data divisions called decision trees (DT) maximise the differences of
a dependent variable (response or output variable). They provide a clear definition for
groups whose characteristics are constant but whose dependent variable varies.
m
Nodes (input variables), branches (groups of input variable entries), and leaves or
leaf nodes make up DT (values of the output variable). The building of a DT is based
)A
on the “divide and conquer” principle: consecutive divides of the multivariable space
are carried out by a supervised learning algorithm in order tomaximise the distance
between groups in each division (that is, carry out partitions that discriminate). When
all of a branch’s entries have the same value in the output variable (pure leaf node),
the division process is complete and the full model is produced (maximum specified).
(c
The importance of the input factors in the output categorization decreases as they move
deeper down the tree in the tree diagram (and the less generalisation they allow, due to
the decrease in the number of inputs in the descending branches).
The tree can be pruned by removing the branches with few or hardly meaningful
Notes
e
entries in order to prevent overfitting the model. As a result, if we begin with the
whole model, the tree pruning will increase the model’s capacity for generalisation
(as measured by test data), but at the expense of decreasing the level of purity of its
in
leaves.
nl
elements are determined by the learning algorithm:
specific compatibility with the type of variables, including the input and output
variables’ characteristics.
O
Division criteria are used to measure the distance between groups in each division.
The number of branches that each node can be divided into may be constrained.
ty
Pre- and post-pruning pruning settings include: the minimal number of entries
per node or branch, the critical value of the division, and the performance difference
between the extended and reduced tree. While post-pruning applies the pruning
parameters to the entire tree, pre-pruning entails utilising halting criteria during the
si
tree’s formation.
The most popular algorithms are C4.5/C5.0, QUEST (Quick, Unbiased, Efficient
r
Statistical Tree), CHAID (Chi-Squared Automatic Interaction Detection), and
Classification and Regression Trees (CART).
ve
ni
U
ity
Breiman et al. (1984) created the CART algorithm, which creates binary decision
m
trees with each node precisely divided into two branches. This groups many categories
in one branch if the input variable is nominal and has more than two categories. Even if
the input variable is continuous or nominal, it still creates two branches, assigning a set
)A
of values to each one that is “less than or equal to” or “greater” than a specific value.
The model can accept nominal, ordinal, and continuous input data thanks to the CART
algorithm. The model’s output variable can also be nominal, ordinal, or continuous.
Initial versions of the CHAID algorithm (Kass, 1980) were only intended to handle
categorical variables. However, it is now able to handle continuous variables, nominal
(c
and ordinal categorical output data, and more. In order to establish clusters of similar
values (statistically homogeneous) with respect to the output variable and to keep all
the values that ultimately turn out to be heterogeneous (distinct), the tree construction
Notes
e
process is based on the calculation of the significance of a statistical contrast as a
criterion. Similar values are merged into one category and make up a portion of one
tree branch. The statistical test that is applied depends on how accurately the output
in
variable is measured. The F test is applied if the aforementioned variable is continuous.
The Chi-square test is utilised if the output variable is categorical.
nl
The CHAID algorithm differs from the CART algorithm in that it permits the partition
of each node into more than one branch, leading to the creation of significantly wider
trees than those produced by binary development techniques.
O
In the event that the result is nominal-categorical, the QUEST algorithm (Loh&
Shih, 1997) may be applied (allows the creation of classification trees). Calculating the
significance of a statistical contrast is the foundation of the tree construction process.
If an input variable is nominal categorical, the critical level of a Pearson Chi-square
ty
independence contrast between the input variable and the output variable is calculated
for each input variable. It employs the F test if the input variable is ordinal or continuous.
The only output variables accepted by the C5.0 algorithm (Quinlan, 1997) are
si
categorical ones. Input variables might be continuous or categorical. This algorithm
evolved from algorithm C4.5 (Quinlan, 1993), which was created by the same author
and has the ID3 version as its core (Quinlan, 1986). The ID3 algorithm bases its choice
of the best attribute on the idea of knowledge gain. r
ve
The descriptive nature of DT is one of its most notable benefits since it gives us
access to the rules that the model used to make its predictions, making it simple for
us to comprehend and analyse the judgments made by the model (an aspect that is
not taken into consideration in other machine learning techniques, such as ANN). In
ni
order to provide a clear, pleasant explanation of the outcomes, DT allows the graphic
representation of a set of rules pertaining to the choice that must be made in the
assignment of an output value for a specific entry.
U
The decision rules offered by a tree model, on the other hand, have a predictive
value (rather than just being descriptive) at the point at which their correctness is
evaluated from separate data (test data) to those used in the model’s construction
ity
(training data).
k-Nearest Neighbor
Humans rely on memories of previous encounters with situations similar to the
current one when confronted with them. The k-Nearest Neighbor (kNN) method
m
is founded on this. In other words, the k-NN method is founded on the idea of
similarity. Additionally, this method creates a classification method without assuming
anything about the structure of the function connecting the dependent variable to the
)A
independent variables. The goal is to dynamically find k training observations that are
similar to a fresh observation that needs to be classified. In this method, the observation
is classified explicitly in a class using k related (neighbouring) observations (see Figure
below).
(c
Notes
e
in
nl
Figure: Graphical representation of k-NN classification
In more detail, k-NN searches the training data for observations that are
O
comparable to or around the observation that needs to be categorised based on
the values of the independent variables (attributes). Then it allocates a class to
the observation it wants to classify based on the classes of these neighbouring
observations, using the majority vote of the neighbours to decide the class. In other
ty
words, it determines which class the majority of the new case belongs to by counting
the number of cases in each class.
Despite the method’s “naive” appearance, it may compete with other, more
si
advanced categorization techniques. Therefore, k-NN is very adaptable in situations
where the linear model is stiff. This method’s performance for data of a given size
depends on k as well as the measurement used to identify the nearest observations.
r
As a result, when using the technique, we must examine the number of neighbours
ve
to be taken into consideration (k value), how to measure the distance, how to combine
the data from several observations, and whether or not each neighbour should be given
the same weight.
ni
unknown example is categorised in the training data in the class of its closest neighbour
if we set k=1.
if we choose a tiny k value, the categorization may be too influenced by outlier values
or uncommon observations. On the other hand, picking a k value that is not extremely
small will tend to tamp down any peculiar behaviour that was picked up from the
training set. However, if we pick a k value that is too high, we will miss some locally
m
interesting behaviour.
By using a cross validation process, you can use the data to assist in solving this
issue. That is, we can test a variety of k values using various training sets selected at
)A
random, then select the k value that has the lowest classification error.
Notes
e
There are three issues with the Euclidean distance:
in
◌◌ The distance is determined by the measurement units used for the variables.
◌◌ The fluctuation of the various factors is not taken into account.
◌◌ The relationship between the variables is disregarded.
nl
Utilizing a gauge known as statistical distance is one option (or Mahalanobis
distance). We may list a few benefits of the k-NN technique, including the fact that the
training set is entirely stored as a description of this distribution rather than simplifying
O
the distribution of objects in space into a set of understandable features.
The k-NN approach is also simple to understand, simple to use, and practical.
For each new case to be classified, it can build a different approximation to the goal
ty
function, which is helpful when the target function is extremely complex but can be
described by a collection of less complex local approximations. Finally, without making
any assumptions about the structure of the function connecting the dependent variable
si
(the classification variable) with the independent variables, this strategy constructs a
classification method (attributes).
Naive Bayes
The Bayes rule or formula, which is based on Bayes’ theorem, is used in Bayesian
approaches to combine data from the sample with expert opinion (prior probability) in
U
In particular, the Naive Bayes technique (NB) is one of the most popular
classification techniques and one of the most powerful due to its straightforward
ity
computing procedure (Hand & Yu, 2001). It is able to forecast the likelihood that a given
example will belong to a particular class because it is based on the Bayes theorem.
It is referred to as a “naive” classifier in this respect because the assumption known
as class conditional independence, which holds that the impact of one attribute’s value
on a given class is independent of the values of other attributes, is what makes its
m
computations simple.
Because class Ci has the highest X conditioned posterior probability, this classifier
)A
predicts that case A will be a member of that class (set of attributes of the case in the
predictor variables). We can define the Bayes’ formula (3) that this posterior probability
offers thanks to Bayes’ theorem, and since P(X) is constant for all classes, the
classification procedure just requires us to maximise P(X|Ci)P(Ci).
(c
The number of instances of each class Ci in a set of training data, P(Ci), can be
Notes
e
calculated. The classifier makes the “naive” assumption that the attributes used to
define X are conditionally independent from one another given class Ci in order to
reduce the computing cost of calculating P(X|Ci) for all potential xk (predictor variables).
in
The phrase, where the m value denotes the number of predictor variables involved in
the classification, contains this conditional independence.
nl
When the characteristics are conditionally independent given the class, as is the
O
case in many studies comparing classification methods (e.g. Michie et al., 1994), NB
performs on par with and sometimes even better than ANN and DT. Recent theoretical
studies have demonstrated why NB is so resilient.
ty
NB classifiers are what make them appealing. Additionally, NB can deal with unknown
or missing values with ease. However, it has three significant flaws. The technique
assumes that a new case with this category in the predictor has zero probability if it is
si
not present in the training data, which is the first of several drawbacks. Finally, even
though we achieve a good performance if the aim is to classify or order cases according
to their probability of belonging to a certain class, this method offers very biassed
r
results when the aim is to estimate th
ve
Beyond these limitations, the NB technique is simple to apply, it adapts to the data,
and it is simple to interpret. Additionally, just one investigation of the data is necessary.
Its popularity has grown significantly, notably in the literature on machine learning,
thanks to its simplicity, parsimony, and interpretability.
ni
Logistic Regression
The link between a continuous response variable and a group of predictor variables
U
variables (with values like as yes/no or 0/1) is its principal use. Thus, using LR
approaches, a new observation with an unknown group can be assigned to one of the
groups depending on the values of the predictor variables.
like with linear regression. The linear combination is converted into an interval [0, 1] via
the logistic function (Ye, 2003). Consequently, the dependent variable is changed into a
continuous value that is a function of the likelihood that the event will occur in order to
apply LR.
)A
In LR, there are two steps: first, we estimate the likelihood that each instance
(c
will belong to each group, and then, in the second phase, we utilise a cut-off point
in conjunction with these probabilities to place each example in one of the groups.
Through a series of rounds, the model’s parameters are calculated using the maximum
Notes
e
likelihood technique. Consult Larose for a more thorough explanation of the LR method.
Finally, it is important to note that LR can generate reliable results with a little
in
amount of data. On the other hand, classical regression is even more appealing
because it is so widely established, simple to use, and universally understood.
Additionally, LR exhibits behavioursimilar to that of a diagnostic test.
nl
Data mining regression techniques such as linear regression use a straight line
to establish a connection between the target variable and one or more independent
variables. The provided equation serves as a representation of the linear regression
O
equation.
Y = a + b*X + e
Where
ty
The intercept is represented by a.
si
The letter e stands for error.
r
Multiple linear equations exist when X is made up of more than one variable.
ve
The least squares method, which minimises the total sum of the squares of the
deviations from each data point to the regression line, is used in linear regression to
determine the best suited line. The positive and negative deviations are not neutralised
since all variances are squared.
ni
There are two types of data mining: descriptive data mining and predictive data
mining. In a succinct and summarizing manner, descriptive data mining displays
the data set’s intriguing general qualities. In order to build one or more models
and make predictions about the behavior of fresh data sets, predictive data mining
ity
Large volumes of data are frequently stored in great detail in databases. However,
users frequently prefer to view collections of summarized facts in clear, evocative
language. Such data descriptions may give a broad overview of a data class or set it
m
apart from related classes. Users also like how easily and adaptably data sets may be
described at various granularities and from various perspectives. Concept description is
a type of descriptive data mining that is a crucial part of data mining.
)A
Data entries may be connected to concepts or classes. For instance, computer and
printer classes are available for purchase in the AllElectronics store, and big spenders
and budget spenders are concepts for customers. Individual classes and concepts can
be explained in succinct, precise, but brief ways.
(c
●● Data characterization involves a generalized summary of the data for the class
Notes
e
being studied (often referred to as the target class),
●● data discrimination involves contrasting the target class with one or more
in
comparison classes (often referred to as the contrasting classes),
●● both data characterization and discrimination.
nl
4.2.2 Data Mining Characterisation
Concept description is the most basic form of descriptive data mining. Typically, a
concept describes a grouping of facts, such as frequent_buyers, graduate_students,
O
and so forth. Concept description is not just an enumeration of the data when it comes
to data mining tasks. Concept description, on the other hand, produces descriptions for
characterising and contrasting the facts.
ty
When the notion to be expressed pertains to a class of objects, it is sometimes
referred to as a class description. While concept or class comparison (also known
as discrimination) provides discriminations between two or more collections of data,
si
characterization provides a brief and succinct summary of the given collection of facts.
result, including information about their age range of 40 to 50, employment status, and
credit standing. The customer relationship manager should be able to drill down on
any dimension, such as occupation, to view these customers according to their sort of
employment, using the data mining system.
ity
For instance, a user might want to contrast the general features of software items
whose sales rose by 10% last year with those of products whose sales fell by at least
30% during the same time period. The techniques used for data characterization and
)A
e
Customers who frequently (more than twice a month, for example) and infrequently
(less than once a year) purchase computer products may be compared by an
in
AllElectronics customer relationship manager (e.g., less than three times a year). The
resulting description offers a general comparative profile of these customers, showing,
for example, that 60% of those who infrequently buy computer products are either
nl
seniors or youths and have no university degree, compared to 80% of those who
frequently buy computer products who are between the ages of 20 and 40 and have a
university education. Finding even more distinguishing characteristics between the two
classes may be aided by expanding or narrowing a dimension, such as occupation, or
O
by including a new dimension, such as income level.
ty
brief and succinct terms at broad levels of abstraction, given the volume of data that is
contained in databases. With the ABCompany database, for instance, sales managers
may choose to view the data generalised to higher levels, such as aggregated by
si
customer groups according to geographic regions, frequency of purchases per group,
and customer income, rather than looking at individual customer transactions. Similar
to multidimensional data analysis in data warehouses, like multidimensional, multilevel
r
data generalisation. The following are the key distinctions between online analytical
processing and concept description in huge databases.
ve
Complex Data Types and Aggregation:
The foundation of data warehouses and OLAP technologies is a multidimensional
data model, which presents data as a data cube made up of dimensions (or
ni
and measures today’s OLAP systems (such as count, total, and average) only apply to
numeric data. In contrast, the database properties for concept development can be of
several data types, such as numeric, nonnumeric, spatial, text, or image.
ity
as drill-down, roll-up, slicing, and dicing. Users may also need to describe a lengthy list
of OLAP operations in order to find a good description of the data. Concept description
in data mining, in contrast, aims for a more automated approach that aids in deciding
which dimensions (or attributes) should be included in the analysis and the extent to
Notes
e
which the supplied data set should be generalisedin order to produce an engaging
summary of the data.
in
Data warehousing and OLAP technologies have recently advanced to handle more
complicated data kinds and incorporate additional knowledge discovery processes.
Additional descriptive data mining elements are predicted to be incorporated into OLAP
nl
systems in the future as this technology develops.
O
groundwork for the multiple-level characterisation and comparison functional modules,
which make up the bulk of data mining. You will also look at methods for presenting
concept and description in various formats, such as tables, charts, graphs, and rules.
ty
Data Generalization and Summarization-Based Characterization
Databases frequently have detailed information at the level of simple concepts in
their data and objects. For instance, properties indicating low-level item information
si
like item _ID, name, brand, category, supplier, place_made, and price may be present
in the item relation in the sales database. Being able to condense a vast collection of
material and convey it at a high conceptual level is helpful. For instance, providing a
r
basic summary of such data by summarising a huge number of elements related to
ve
Christmas season sales can be quite beneficial for sales and marketing managers.
Data generalisation, a crucial component of data mining, is necessary here.
be used to group methods for the effective and flexible generalisation of huge data sets:
Attribute-Oriented Induction
A few years before the data cube approach was introduced, in 1989, the attribute-
ity
off-line pre computation, though, because of an intrinsic barrier. While off-line pre-
computation of multidimensional space can speed up attribute-oriented induction as
well, some aggregations in the data cube can be computed online.
(c
e
(KDD), a technique that has applications across many industries. Here, I’ll utilise data
from a retail transaction to demonstrate how to give businesses access to information
containing the customer’s purchasing patterns. This may also be included in a system
in
of support for decisions.
nl
Patterns that commonly show up in a data set include item sets, subsequences,
and substructures. A frequent itemset is, for instance, a group of items that regularly
O
appear together in a transaction data set, like milk and bread.
ty
subtrees, or sublattices, which may be combined with item sets or subsequences, might
be referred to as substructures. A substructure is referred to as being frequent in an
organised pattern. In order to mine connections, correlations, and a variety of other
si
fascinating relationships between data, it is crucial to identify common patterns.
r
Market basket analysis is an illustration of frequent itemset mining in action.
ve
By identifying relationships between the many goods that customers place in their
“shopping baskets,” this technique analyses the purchasing behaviours of those
customers (Figure: Market basket analysis). The identification of these relationships
can assist merchants in creating marketing plans by providing information on the
ni
products that customers typically buy in tandem. How likely is it that consumers who are
purchasing milk will also purchase bread (and what kind of bread) during the same trip
to the store? By assisting merchants with targeted marketing and shelf space planning,
this information can result in an increase in sales.
U
ity
m
)A
(c
Which categories or combinations of goods are customers most likely to buy during
Notes
e
a particular store visit? The retail data of consumer transactions at your store may be
used for market basket analysis to address your query. The results can then be used to
develop marketing or advertising plans or to design a new catalogue.
in
Association rules are a sort of pattern representation. The following association
rule, for instance, illustrates the fact that consumers who buy laptops also frequently
nl
purchase antivirus software at the same time:
Two metrics for rule interestingness are rule support and rule confidence. They
O
demonstrate the utility and certainty of established rules, respectively. A support of 2%
for the aforementioned rule indicates that antivirus software and computers are bought
jointly in 2% of the transactions being examined. A 60% confidence level indicates
that 60% of customers who bought a machine also purchased the software. Usually,
ty
association rules fulfil both a minimum support threshold and a minimum confidence
requirement in order to be deemed interesting. Users or subject matter experts may
establish these thresholds.
si
Frequent Itemsets, Closed Itemsets, and Association Rules
r
ve
Strong rules are those that fulfil both a minimum support threshold and a minimum
confidence criterion.
including an itemset determines how often it occurs. The frequency, support count, or
count of the itemset are other names for this.
U
1. Find all frequent itemsets: Each of these itemsets must happen at least as
frequently as a specific minimum support count, min sup, to be considered by
definition.
2. Generate strong association rules from the frequent itemsets: These
regulations must by definition meet minimum support and confidence
m
requirements.
If there is no suitable super-itemset Y with the same support count as X in D, then
)A
itemset X in data set D is closed. If an item set X is both closed and frequent in set D,
then X is a closed frequent item set in D. If an itemset X is frequent and there is no
super-itemset Y such that X_Y and Y is frequent in D, then X is a maximal frequent
itemset (or max-itemset) in the data set D.
Assume that there are just two transactions in a transaction database: “{a1, a2, : : :
, a100}; {a1, a2, : : : , a50}.” Let min sup = 1 be the minimal support count criterion. We
Notes
e
discover two closed frequent item sets with their corresponding support counts, namely
C={{a1, a2, : : : , a100} : 1; {a1, a2, : : : , a50} : 2}. The only maximally common item
collection is this: M={{a1, a2,::: , a100}: 1}. Because it has a frequent superset, “{a1,
in
a2, : : : , a100},” we are unable to include “{a1, a2, : : : , a50}” as a maximal frequent
itemset. Compare this to the last analysis, when we found that there are too many 2100
-1 frequent itemsets to list!
nl
Complete details about the frequent itemsets are contained in the set of closed
frequent itemsets. For instance, from C, we can deduce (1) {a2, a45 : 2} because {a2,
a45} is a subitemset of the itemset a1, a2, … , a50 : 2}; and (2) {a8, a55 : 1} because
O
{a8, a55} is a subitemset of the itemset {a1, a2, : : : , a100 : 1} rather than the previous
itemset. However, based just on the most common itemset, we are only able to state
that the two itemsets ({a2, a45} and {a8, a55}) are frequent but not their precise support
ty
counts.
si
A priori Algorithm: Finding Frequent Itemsets by Confined Candidate Generation
frequent 2-itemsets, which is then used to find L3, and so on. Each Lk must be located
via a complete database scan.
U
Apriori property: A frequent itemset must also have all nonempty subsets.
ity
1. The join step: By combining Lk-1 with itself, a set of potential k-itemsets is created
m
in order to find Lk. This group of contenders is referred to as Ck. Let Lk-1’s itemsets
l1 and l2 be examples. The jth item in li is indicated by the notation li[j] (e.g., l1[k
-2] refers to the second to the last item in l1). Apriori assumes that the items
)A
put, the l1[k -1] < l2[k -1] condition checks to make sure no duplicates are produced.
By joining l1 and l2, an itemset is created that looks like this: l1[1], l1[2],...., l1[k -2],
Notes
e
l1[k -1], l2[k -1].
2. The prune step: Since Ck is a superset of Lk, all frequent k-itemsets are present
in
even though its members may or may not be common. Lk would be determined by
conducting a database search to count each candidate in Ck (i.e., all candidates
having a count no less than the minimum support count are frequent by definition,
nl
and therefore belong to Lk). However, Ck can be very large, thus this could need a
lot of work. The Apriori property is utilised in the following way to decrease the size
of Ck. Any . An uncommon (k-1)-itemset cannot be a subset of a common (k-1)-
itemset. Therefore, the candidate cannot be frequent either and can be eliminated
O
from Ck if any (k -1)-subset of a candidate k-itemset is not in Lk-1. By keeping a
hash tree of all frequently occurring itemsets, this subset testing can be completed
quickly.
ty
Transactional Data for an Allelectronics Branch
si
T100 11, 12, 15
T200 12, 14
T300 12, 13
r
T400 11, 12, 14
ve
T500 11, 13
T600 12, 13
T700 11, 13
ni
Example: Apriori.
Let’s examine a specific illustration based on Table T1 of the AllElectronics
ity
1. Each item in the algorithm’s initial iteration is a part of the C1 set of candidate
1-itemsets. To count the instances of each item, the programme simply scans all of
m
the transactions.
2. Assume that the necessary minimum support count is 2, or min sup = 2. (Since
we are utilising a support count, we are talking about absolute support here. 2/9
)A
= 22% is the corresponding relative support.) Then, L1, the collection of common
1-itemsets, can be found. The candidate 1-itemsets that meet the required level of
support make up this list. All of the candidates in C1 in our example meet the criteria
for little support.
3. The algorithm uses the join L1 L1 to create a candidate set of 2-itemsets, C2, in
(c
order to find the set of frequent 2-itemsets, L2. (|L1|C2) 2-itemsets make up C2. As
each subset of the candidates is also frequent, no candidates are eliminated from
Notes
e
C2 during the prune stage.
4. Then, as shown in the middle table of the second row in Figure 6.2, the transactions
in
in D are scanned, and the support count of each potential 2-itemset in C2 is tallied.
5. The set of frequent 2-itemsets, L2, is then established, consisting of the potential
2-itemsets in C2 with the least amount of support.
nl
6. The generation of the set of the candidate 3-itemsets, C3, is detailed in Figure 6.3.
From the join step, we first get C3 =L2 L2 = {{I1, I2, I3}, {I1, I2, I5}, {I1, I3, I5},{I2, I3,
I4}, {I2, I3, I5}, {I2, I4, I5}}.
O
(a) Join: C3 = L2 L2 = {{I1, I2}, {I1, I3}, {I1, I5}, {I2, I3}, {I2, I4}, {I2, I5}} {{I1, I2},
{I1, I3}, {I1, I5}, {I2, I3}, {I2, I4}, {I2, I5}} = {{I1, I2, I3}, {I1, I2, I5}, {I1, I3, I5},{I2,
I3, I4}, {I2, I3, I5}, {I2, I4, I5}}.
ty
(b) (b) Pruning based on the Apriori characteristic A frequent itemset must also
have all nonempty subsets. Do any of the contenders have an uncommon
subset?
si
Generation and pruning of Candidate 3-itemsets, C3, from L2 using the apriori
property.
{I1, I2, I3}, and {I2, I3} are the 2-item subsets of “{I1, I2, I3}.” L2 includes all
r
2-item subgroups of “{I1, I2, I3}.” Keep “{I1, I2, I3}” in C3 as a result.
ve
{I1, I2}, {I1, I5}, and {I2, I5} are the 2-item subsets of “{I1, I2, I5}”. L2 includes
all 2-item subgroups of “{I1, I2, I5}”. Keep “{I1, I2, I5}” in C3 as a result.
{I1, I3, I5}, and {I3, I5} are the 2-item subsets of “{I1, I3, I5}”. Since “{I3, I5}” is
ni
not a member of “L2,” it is uncommon. So, take off I1, I3, and I5 from C3.
{I2, I3}, {I2, I4}, and {I3, I4} are the 2-item subsets of “{I2, I3, I4}”. Since “{I3,
I4}” is not a part of L2, it is not frequently encountered. So, take off I2, I3, and
U
I4 from C3.
{I2, I3, I5}, and {I3, I5} are the 2-item subsets of “{I2, I3, I5}”. Since “{I3, I5}” is
not a member of “L2,” it is uncommon. So, take out I2, I3, and I5 from C3.
ity
{I2, I4}, {I2, I5}, and {I4, I5} are the 2-item subsets of “{I2, I4, I5}”. Since “{I4,
I5}” is not a part of L2, it is not frequently encountered. As a result, delete “I2,
I4, I5” from C3.
(c) Therefore, C3 = {{I1, I2, I3}, {I1, I2, I5}} after pruning.
m
)A
(c
Notes
e
in
nl
O
ty
Figure: Generation of the candidate itemsets
si
7. To find L3, which consists of those candidate 3-itemsets in C3 with minimum support,
the transactions in D are examined (Figure above).
8. r
The algorithm creates a candidate set of four item sets, C4, using L3 1 L3. Despite
ve
the fact that the join produces the itemset “{{I1, I2, I3, I5}}”, this itemset is pruned
since its subset “{I2, I3, I5}” is seldom. As a result, C4 =, and the algorithm stops
after discovering all of the common itemsets.
Algorithm: Apriori. Find frequent itemsets using an iterative level-wise approach
ni
Input:
U
◌◌ D, a database of transactions;
◌◌ min_sup, the minimum support count threshold.
Output: L, frequent itemsets in D.
ity
m
)A
(c
Notes
e
in
nl
O
ty
Figure F2: Apriori algorithm
si
4.3.2 Mining Associations
The number of transactions that contain the itemsets A and B, as well as the
number of transactions that contain the itemset A, are used to describe the conditional
probability in terms of itemset support count (also known as support count(A)). This
ity
confidence threshold.
Each rule automatically satisfies the minimal support because it is created from
frequent itemsets. Hash tables can be used to store frequently used itemsets and their
)A
algorithm has undergone numerous modifications, all of which aim to increase its
effectiveness. Here is a summary of a few of these variations:
Notes
e
in
nl
Figure: H2 hash table for potential 2-item sets. This hash table was created by
looking up transactions in Table (Transaction Data for All Electronics) and figuring out
L1. The itemsets in buckets 0, 1, 3, and 4 cannot be frequent, hence they should not be
O
included in C2 if the minimal support count is, say, 3.
ty
For k > 1, a hash-based method can be utilised to minimise the size of the
candidate k-itemsets, Ck. We can create all the 2-itemsets for each transaction, hash
(i.e., map) them into the various buckets of a hash table structure, and then increase
si
the associated bucket counts when scanning each transaction in the database to
produce the frequent 1-itemsets, L1, for instance (Figure above). A 2-itemset that has
a bucket count in the hash table for the associated item that is less than the support
r
criteria cannot be frequent and ought to be eliminated from the candidate set. A
hash-based method like this one could significantly minimise the amount of potential
ve
k-itemsets that need to be investigated (particularly when k = 2).
in later database searches for j-itemsets where j > k, such a transaction can be flagged
or excluded from further consideration.
U
Partitioning (partitioning the data to find candidate itemsets): To mine the common
itemsets, a partitioning strategy can be utilised that only needs two database scans
(Figure Mining by Partitioning the data). There are two phases to it. The algorithm
separates the transactions in D into n nonoverlapping partitions in phase I. The
ity
minimum support count for a partition is min sup the number of transactions in that
partition if the minimum relative support threshold for transactions in D is min sup. All of
the local frequent item sets, or the frequent item sets within the partition, are located for
each partition.
m
)A
(c
A local frequent itemset could or might not be frequent across the board, D. To be
Notes
e
a frequent itemset, however, a set of items must be potentially frequent with regard to
D in at least one of the partitions. In light of D, all local frequent itemsets are candidate
itemsets. The global candidate itemsetswith regard to D are comprised of the frequent
in
item sets from each partition. In phase II, a second scan of D is performed to ascertain
the worldwide frequent itemsets by evaluating the actual support of each contender.
The number of partitions and their sizes are chosen so that each may fit in main
nl
memory and only need to be read once during each phase.
O
for frequent item sets in S rather than D. By doing this, we compromise some of our
accuracy for some of our efficiency. Only one scan of the transactions in S is necessary
in total since the S sample size allows for the search for frequently occurring itemsets in
ty
S to be performed in main memory. We might overlook some of the worldwide frequent
itemsets because we are looking for frequent itemsets in S rather than D.
To lessen the likelihood of this happening, we discover the frequent itemsets local
si
to S with a support threshold that is lower than the minimal support (denoted LS ).
The actual frequencies of each itemset in LS are then calculated using the remaining
data in the database. The existence of all global frequent itemsets in LS is checked
r
using a technique. One scan of D is necessary if LS truly contains all of the frequent
item sets in D. If not, a second pass can be made to locate the common itemsets that
ve
were overlooked during the first. When efficiency is crucial, as it is in computationally
intensive programmes that must be executed frequently, the sampling strategy is
extremely advantageous.
ni
addition of new candidate itemsets at any stage in the analysis process. As the lower
bound of the actual count, the approach uses the count-so-far. The itemset is added to
the frequent itemset collection and can be used to produce lengthier candidates if the
ity
count-to-date exceeds the minimal support. Compared to using Apriori, this results in
fewer database scans to discover all the frequently occurring itemsets.
results in a good performance boost by drastically reducing the size of candidate sets.
However, it can be hampered by two significant expenses:
)A
◌◌ Quite a few candidate sets may still need to be generated. The Apriori
algorithm will need to produce more than 107 candidate 2-itemsets, for
instance, if there are 104 frequent 1-itemsets.
◌◌ It could be necessary to repeatedly scan the entire database and use pattern
matching to check a huge number of candidates. Analyzing every database
(c
e
known as FP-growth, is an intriguing approach in this endeavour that uses the following
divide-and-conquer tactic. First, it creates a frequent pattern tree, or FP-tree, using
the database of often occurring items while preserving the itemset association data.
in
The compressed database is then divided into a number of conditional databases
(a particular type of projected database), each linked to a single frequent item or
“pattern fragment,” and each database is mined independently. Only the data sets
nl
that are related with each “pattern fragment” need to be looked at. As a result, this
strategy may significantly minimise the amount of the data sets to be searched and the
“development” of the patterns under investigation.
O
Mining Frequent Itemsets Using the Vertical Data Format
The TID-itemset structure (i.e., “TID: itemset”), where TID is a transaction ID and
itemset is a set of items purchased in transaction TID, is used by both the Apriori and
ty
FP-growth methods to identify recurring patterns in a collection of transactions. The
horizontal data format is what is used in this. Alternative data presentation formats
include item-TID sets.
si
Algorithm: FP growth. Mine frequent itemsets using an FP-tree by pattern fragment
growth. Input:
◌◌ r
D, a transaction database;
ve
◌◌ Min_sup, the minimum support count threshold.
Output: The complete set of frequent patterns.
Method:
ni
Choose and arrange the goods in Trans that are used frequently in the
order of L. Let [p|P], where p is the initial member and P is the rest of the
list, represent the sorted frequent item list in Trans. Call insert tree([p|P], T),
and the following action is taken. If T has a child N such that N.item-name =
m
p.item-name, then N’s count should be increased by 1. If not, then a new node
N should be created with a count of 1, a parent link to T, and node links to
other nodes that have the same item-name. Recursively call insert tree(P, N) if
)A
P is not empty.
2. The FP-tree is mined by calling FP growth(FP tree, null), which is implemented as
follows.
procedure FP growth(Tree, α)
(c
e
(3) generate pattern β ∪α with support count = minimum support count of nodes
in β;
in
(4) else for each ai in the header of Tree {
(5) generate pattern β = ai ∪α with support count = ai .support count;
nl
(6) construct β’s conditional pattern base and then β’s conditional FP tree Treeβ ;
(7) if Treeβ 6= ∅ then
(8) call FP growth(Treeβ , β); }
O
(i.e., {item : TID set}), item is the name of the item, and TID set is the collection of
transaction identifiers that contain it. The vertical data format is what is used for this.
ty
Which Patterns Are Interesting?—Pattern Evaluation Methods
A support-confidence framework is used by the majority of association rule mining
algorithms. Many of the created rules are still not attractive to the users, despite
si
minimum support and confidence criteria’ assistance in weeding out or excluding
the investigation of a significant number of uninteresting rules. Unfortunately, this is
particularly true when mining for extended patterns or at low support thresholds. This
r
has been a significant barrier to the successful use of association rule mining.
ve
Strong Rules Are Not Necessarily Interesting
A rule’s interestingness might be determined subjectively or objectively. In the
end, only the user can determine whether a specific rule is interesting, and since this
judgement is subjective, it may vary from user to user. To eliminate boring rules that
ni
Example:
A misleading “strong” association rule. Let’s say our research involves looking at
AllElectronics purchases of video games and computer software. Let’s use the terms
ity
“game” and “video” to denote transactions that contain computer games and videos,
respectively. Data from the investigated 10,000 transactions reveals that 6000 of the
client purchases included video games, 7500 featured audiovisual content, and 2000
included both. Assume that the data is ran through a data mining tool to find association
rules, with a minimum support of, say, 30% and a minimum confidence of 60%. It is
m
Since the above rule (R1) meets the minimal support and minimum confidence
standards with a support value of 4000/10,000 = 40% and a confidence value of
4000/6000 = 66%, respectively, it would be reported as a strong association rule. Rule
(c
(R1) is misleading, though, because 75% rather than 66% of people are likely to buy
videos. In fact, there is a bad correlation between video games and computer games
because buying one of these products actually makes it less likely that you’ll buy
the other. We could easily base bad business judgments on Rule R1 if we don’t fully
Notes
e
comprehend this issue.
in
⇒B can be misleading. It does not reflect the actual strength (or lack thereof) of the
relationship between A and B in terms of correlation and implication. To mine intriguing
data correlations, alternatives to the support-confidence paradigm can be helpful.
nl
4.3.3 Mining Correlations
The support and confidence measurements are ineffective for removing
O
uninteresting association rules, as we have thus far shown. The support-confidence
framework for association rules can be supplemented with a correlation measure to
address this problem. As a result, the following correlation rules result:
ty
A ⇒ B [support, confidence, correlation]. (Rule R2)
In other words, the strength and confidence of a correlation rule are evaluated
along with the correlation of Itemsets A and B. There are a variety of correlation
si
measurements available. To identify which correlation measure would be best for
mining massive data sets, we examine a number of correlation measures in this
subsection.
r
The lift is a straightforward correlation metric and is provided as follows. If P(A B) =
ve
P(A)P(B), then the occurrence of itemset A is independent of the occurrence of itemset
B; otherwise, the events of itemsets A and B are dependent and connected. More than
two itemets can be included in this definition with ease. By calculating the values below
Equation, one may determine the lift between the incidence of A and B. (Eq1)
ni
(Equation Eq1)
U
The presence of A is negatively linked with the presence of B, which means that
the occurrence of one is likely to result in the absence of the other, if the resulting value
of the preceding equation (Eq1) is less than 1. A and B are positively associated if the
calculated value is higher than 1, which denotes that the occurrence of one precludes
ity
the occurrence of the other. A and B are independent and there is no correlation
between them if the outcome value equals 1.
The lift of the association (or correlation) rule A ⇒ B is also known as Eq (Eq1),
and it is identical to P(B|A)/P(B), or conf(A B)/sup(B). It evaluates how much the
m
incidence of one “lifts” the occurrence of the other, to put it another way. According
to the current state of the market, the sale of games is considered to “lift” or raise the
likelihood of the sale of videos by a factor of the value returned by Eq, for instance,
)A
0.60, the likelihood of buying a video is P(video) = 0.75, and the likelihood of buying
both is P(game, video) = 0.40. The lift of the provided Association rule according to the
equation is P(game, video)/(P(game)*P.(video)) = 0.40/(0.60*0.75) = 0.89. There is a
negative link between the occurrence of “game” and “video” because this number is
Notes
e
smaller than 1.
in
Game and Video Purchases
nl
Video 4000 3500 7500
Video 2000 500 2500
Σcol 6000 4000 10,000
O
The second correlation measure that we study is the x2 measure.
ty
Game Game Σrow
Video 4000 (4500) 3500 (3000) 7500
Video 2000 (1500) 500 (1000) 2500
si
Σcol 6000 4000 10,000
The observed value of the slot (game, video) = 4000, which is less than the
expected value of 4500, and the x2 value being more than 1 cause buying game and
ity
Yes, there are four such measures: cosine, Kulczynski, maximum confidence, and
)A
total confidence.
all confidence: The all confidence measure of two item sets A and B is defined as
(c
where max {sup(A), sup(B)} is the itemsets A and B’s maximum support.
max confidence: Given two itemsets, A and B, the max confidence measure of A
Notes
e
and B is defined as
in
The highest confidence of the two association rules is represented by the max conf
measure.
nl
“A ⇒ B” and “B ⇒ A.”
O
ty
It can be thought of as the mean of two confidence intervals. The likelihood of
itemset B given itemset A and the probability of itemset A given itemset B, respectively,
are the two conditional probabilities that make up this average.
si
Cosine: Last but not least, the cosine measure of two itemsets, A and B, is defined
as
r
ve
One might think of the cosine measure as a harmonised lift measure: The two
formulas are similar, with the exception that the square root of the product of the
ni
probabilities of A and B is used for cosine. But because the square root is used, the
cosine value is only affected by the supports of A, B, and A B, not by the overall number
of transactions. This is a significant distinction.
U
the link between A and B is. This is another characteristic that all measures have.
Now that lift and 2 have been added, we have a total of six pattern evaluation
metrics. Which method is ideal for evaluating the identified pattern correlations, you
might wonder. We investigate their performance on various common data sets to
m
e
in
nl
O
Example: Six pattern evaluation measures are compared on common data sets.
The purchase histories of two items, milk and coffee, can be summarised in
ty
Table T4, a 2 2 contingency table, where an entry such as mc denotes the number
of transactions including both milk and coffee. This allows for the examination of the
correlations between the purchases of the two commodities.
si
The transactional data sets, relevant contingency tables, and associated values
for each of the six assessment measures are displayed in Table T5. The first four data
sets, D1 through D4, should be examined. According to the table, m and c have positive
r
associations in D1 and D2, negative associations in D3, and no associations in D4. Due
ve
to the fact that mc (10,000) is significantly larger than mc (1000) and mc (1000), m and
c are positively correlated for D1 and D2 (1000). It makes intuitive sense to assume that
customers who purchased milk (m = 10,000 + 1000 = 11,000) also purchased coffee
(mc/m = 10/11 = 91%) and vice versa.
ni
measure values for D1 and D2. In truth, mc is frequently enormous and unstable in real-
world situations. A market basket database, for instance, might have a daily fluctuation
in the overall number of transactions that much outnumber those including any given
itemset. In order to avoid producing unstable findings, as shown in D1 and D2, a good
ity
Similar to D2, the four additional measures correctly demonstrate that m and c
have a substantial negative correlation because 100/1100 = 9.1% is the ratio of m to
m
c to m. However, lift and χ2 both incorrectly contradict this: Between D1 and D3, their
values for D2 fall.
others imply a “neutral” correlation while lift and 2 indicate a very positive association
between m and c for data set D4. In other words, the likelihood that a consumer will
also buy milk or coffee is exactly 50% if they purchase coffee.
Why, in the preceding transactional data sets, are lift and χ2 so ineffective at
(c
examination is said to be null. In our illustration, mc stands for the quantity of null-
Notes
e
transactions. Because mc greatly influences both lift and χ2, they have trouble
differentiating important pattern association associations. Typically, more people
may make nulltransactions than actual purchases since, for instance, many people
in
might decide not to buy either milk nor coffee. The other four measures, however, are
excellent predictors of intriguing pattern connections since their definitions eliminate the
impact of mc (i.e., they are not influenced by the number of null-transactions).
nl
This conversation demonstrates how important it is to have a measure whose
value is unaffected by the quantity of null transactions. If null-transactions have no
impact on a measure’s value, that measure is said to be null-invariant. For measuring
O
association patterns in sizable transaction databases, null-invariance is a key
characteristic. The only two measures that are not null-invariant among the six that are
addressed in this paragraph are lift and χ2.
ty
Which of the Kulczynski, cosine, all confidence, and maximum confidence
measures is most effective in highlighting intriguing pattern relationships?
The imbalance ratio (IR), which evaluates the imbalance of two itemsets, A and
si
B, in rule implications, is introduced in order to provide a response to this query. It’s
described as
r
ve
where the denominator is the number of transactions containing A or B and the
numerator is the absolute value of the difference between the support of itemsets A and
B. IR(A,B) will be zero if the two directional implications between A and B are identical.
ni
In contrast, the imbalance ratio increases with the size of the difference between the
two. This ratio is independent of both the total number of transactions and the number
of null transactions.
U
categorical class labels. For instance, we can create a classification model to divide
bank loan applications into safe and dangerous categories. We may gain a better
comprehension of the data as a whole via such study. Researchers in statistics, pattern
recognition, and machine learning have proposed a wide variety of classification
m
Notes
e
in
nl
O
ty
4.4.1 Classification
To extract models representing significant classes or forecast upcoming data
trends, two types of data analysis can be performed. Here are these two forms:
si
1. Classification
2. Prediction
r
To create a model that represents the various data classes and forecasts future
ve
data trends, classification and prediction are used. With the aid of prediction models,
classification forecasts the category labels of data. We have the clearest knowledge of
the data at a broad scale thanks to this analysis.
models predict categorical class labels. For instance, based on a person’s income and
line of work, we can create a classification model to classify bank loan applications as
safe or risky, or a prediction model to estimate how much money a potential consumer
U
What is Classification?
Notes
e
A fresh observation is classified by determining its category or class label. As
training data, a set of data is first employed. The algorithm receives a set of input data
in
along with the associated outputs. The input data and their corresponding class labels
are thus included in the training data set. The algorithm creates a classifier or model
using the training dataset. A decision tree, a mathematical formula, or a neural network
nl
can all be used as the resulting model. When given unlabeled data for classification, the
model should be able to identify the class to which it belongs. The test data set is the
fresh information given to the model.
O
The act of classifying a record is known as classification. To classify something as
simple as whether it is raining or not. Either yes or no can be given as a response.
There are so a certain amount of options. There may occasionally be more than two
classes to categorise. Multiclass classification is what that is.
ty
The bank must determine whether or not it is dangerous to lend money to a
specific consumer. A classification model may be created, for instance, that predicts
credit risk based on observable data for various loan borrowers. The information may
si
include employment history, house ownership or rental history, years of residence,
number, kind, and history of deposits, as well as credit scores. The data would
represent a case for each consumer, the goal would be credit ranking, and the
r
predictors would be the other features. In this illustration, a model is built to identify the
ve
categorical label. Both risky and safe labelling are included.
application. The creation of the classifier or model and the classification classifier are
the two phases of the data classification system.
U
ity
m
)A
1. Developing the Classifier or model creation: This level represents the learning
phase or process. In this phase, the classifier is built by the classification algorithms.
A training set of database records and the matching class names is used to build
a classifier. A category or class is used to describe each category that makes up
the training set. These documents may also be referred to as examples, things, or
(c
data points.
2. Applying classifier for classification: At this level, classification is done using the
classifier. Here, the test data are utilised to gauge how accurate the categorization
Amity Directorate of Distance & Online Education
Data Warehousing and Mining 205
system is. The classification rules may be expanded to include additional data
Notes
e
records if the consistency is judged to be sufficient. It contains:
◌◌ Sentiment Analysis: Monitoring social media effectively makes extensive use
in
of sentiment analysis. We can use it to gain insights into social media. With
the use of cutting-edge machine learning algorithms, we may create sentiment
analysis models to read and examine misspelt words. The precise trained
nl
models deliver consistently correct results in a small amount of time.
◌◌ Document Classification: To arrange the papers into parts based on content,
we can utilise document categorization. Text classification applies to
O
documents; the full text of a document can be categorised. We can carry it out
automatically with the aid of machine learning classification algorithms.
◌◌ Image Classification: To arrange the papers into parts based on content,
we can utilise document categorization. Text classification applies to
ty
documents; the full text of a document can be categorised. We can carry it out
automatically with the aid of machine learning classification algorithms.
◌◌ Machine Learning Classification: It performs analytical activities that would
si
take humans hundreds of additional hours to complete using algorithm rules
that are statistically verifiable.
3. Data Classification Process: There are five steps in the data classification process:
◌◌ r
Establish the architecture, strategy, and goals for the data classification
ve
process.
◌◌ Sort the stored private information.
◌◌ using data labelling to assign marks.
ni
The data classification life cycle creates a great framework for managing the
data flow into an organisation. Businesses must take each level of compliance and
data security into account. We are able to carry it out at every stage, from creation to
ity
deletion, with the use of data classification. The following stages make up the data life
cycle, including:
m
)A
(c
e
Excel, Google documents, social media, and websites.
2. Role-based practice: All sensitive data is tagged with role-based security restrictions
in
based on internal protection policies and agreement guidelines.
3. Storage: Here is the data that was collected, complete with encryption and access
restrictions.
nl
4. Sharing: Agents, customers, and coworkers receive data continuously from a variety
of platforms and devices.
O
5. Archive: Here, data eventually goes into the storage systems of an industry.
6. Publication: Data publication allows it to contact customers. The dashboards are
then available for viewing and downloading.
ty
4.4.2 Prediction
Prediction is another step in the data analysis process. It is employed to discover
a numerical result. The training dataset includes the inputs and matching numerical
si
output values, much like in classification. Using the training dataset, the algorithm
creates a model or prediction. When the new data is provided, the model should
produce a numerical result. In contrast to classification, this approach lacks a class
r
label. A continuous-valued function or an ordered value is predicted by the model.
ve
In most cases, regression is utilised for prediction. One example of prediction is
estimating the worth of a home based on information like the number of rooms, total
square footage, etc.
ni
Consider a scenario where the marketing manager must estimate how much a
specific consumer will spend during a sale. In this situation, we are bothered to predict
a number value. Data processing is thus an example of a numerical prediction. In this
situation, a model or predictor will be created that makes predictions about an ordered
U
Preparing the data for classification and prediction is the main problem. The
following tasks are involved in data preparation:
●● Data Cleaning: Data cleaning entails treating missing values and eliminating
noise. Smoothing techniques are used to reduce noise, and the issue of missing
m
values is resolved by replacing a missing value with the value that occurs the most
frequently for that characteristic.
●● Relevance Analysis − The unnecessary attributes may also be present in a
)A
e
measurements are used in the learning phase, normalisation is performed.
◌◌ Generalization: The data can also be changed by applying a higher-level
in
generalisation to it. Hierarchies are a useful notion for this.
nl
●● Accuracy: The ability of the classifier is referred to as accuracy. It accurately
predicts the class label, and predictor accuracy describes how effectively a given
O
predictor can make an educated guess about the value of a predicted attribute for
fresh data.
●● Speed: This is the price paid for the computational resources used to create and
employ the classifier or predictor.
ty
●● Robustness: It describes a classifier’s or predictor’s capacity to provide accurate
predictions from supplied noisy data.
si
●● Scalability: Scalability is the capacity to build a classifier or predictor effectively in
the presence of a vast amount of data.
●● Interpretability: It speaks to the level of comprehension of the classifier or
predictor. r
ve
4.5 Data Mining Analysis
Data science is a new discipline that incorporates automated ways to examine
patterns and models for all types of data, with applications ranging from scientific
ni
contains items that are similar to one another but different from those in other clusters.
A clustering is the collection of clusters that emerges from a cluster analysis. In this
situation, various clustering techniques may provide various clusterings on the same
data set. The clustering algorithm, not people, performs the partitioning. Therefore,
clustering is advantageous in that it can result in the identification of previously
m
web search, biology, and security, have extensively exploited cluster analysis. In
business intelligence, clustering can be used to group a lot of customers into groups
where each group of consumers has a lot in common with the others. This makes
it easier to create business plans for better customer relationship management.
Additionally, take into account a consulting firm with a lot of active projects. Projects
(c
can be divided into groups based on similarities using clustering to enhance project
management. This will allow for efficient project auditing and diagnosis (to enhance
project delivery and outcomes).
e
clusters or “subclasses” in images. Let’s say we have a collection of handwritten
numerals with the labels 1, 2, 3, and so on. Be aware that there can be significant
differences in how people write the same digit. Consider the number 2, for instance.
in
Some people might choose to write it with a tiny circle at the bottom left corner, while
others might choose not to. Clustering can be used to identify “2” subclasses, each
of which stands for a different way that 2 can be expressed. The accuracy of overall
nl
recognition can be increased by combining several models based on the subclasses.
Web search has benefited greatly from clustering. Due to the extraordinarily high
number of online pages, for instance, a keyword search may frequently produce a very
O
large number of hits (i.e., pages relevant to the search). Clustering can be used to
arrange the search results into categories and present them in a clear, understandable
manner. Additionally, strategies for grouping documents into subjects, which are
ty
frequently employed in information retrieval practise, have been developed.
si
concentrate on a specific collection of clusters for future research. Alternately, it might
act as a preprocessing stage for other algorithms, such characterization, attribute
subset selection, and classification, which would later work with the identified clusters
r
and the chosen attributes or features.
ve
A cluster of data items can be regarded as an implicit class since a cluster is a
group of data objects that are distinct from one another and similar to one another
inside the cluster. This is why clustering is occasionally referred to as automatic
classification. Another important distinction is that clustering can automatically identify
ni
the groupings in this case. Cluster analysis has the advantage of being able to do this.
Because clustering divides enormous data sets into groups based on their
resemblance, it is also known as data segmentation in some applications. In
U
circumstances when outliers (values that are “far away” from any cluster) are more
intriguing than common cases, clustering can also be employed to find them. The use
of outlier detection in electronic commerce and the identification of credit card fraud
are two examples of its applications. For instance, unusual credit card transactions,
ity
including very pricey and rarely purchases, may be of interest as potential fraud
schemes.
marketing, and many more application areas are among the research fields that have
contributed. Cluster analysis has lately emerged as a very active area of research in
data mining due to the enormous volumes of data produced in databases.
)A
Cluster analysis has received a great deal of attention as a statistical subfield, with
a primary focus on distance-based cluster analysis. Many statistical analysis software
packages or systems, such as S-Plus, SPSS, and SAS, also include cluster analysis
tools based on k-means, k-medoids, and various other techniques. The learning
algorithm is supervised in that it is informed of the class membership of each training
(c
e
unsupervised learning. Because of this, clustering is an example of learning through
observation as opposed to learning through examples. Finding techniques for efficient
and effective cluster analysis in huge databases has been the focus of data mining. The
in
efficacy of methods for clustering complex shapes (such as nonconvex) and types of
data (such as text, graphs, and images), high-dimensional clustering techniques (such
as clustering objects with thousands of features), and methods for clustering mixed
nl
numerical and nominal data in large databases are currently active research areas.
Properties of Clustering:
O
●● Scalability: A huge database may contain millions or even billions of objects,
especially in Web search applications. However, many clustering methods perform
effectively on small data sets with fewer than a few hundred data objects. Only
using a sample from a given huge data collection while clustering could produce
ty
skewed findings. As a result, we want extremely scalable clustering techniques.
●● Ability to deal with different types of attributes: Algorithms for clustering numerical
(interval-based) data are numerous. However, applications could call for
si
clustering binary, nominal (categorical), ordinal, or combinations of these data
types. In recent years, complex data types like graphs, sequences, pictures, and
documents have increased the requirement for clustering algorithms in a growing
number of applications. r
ve
●● Discovery of clusters with arbitrary shape: Numerous clustering methods
identify clusters using Manhattan or Euclidean distance metrics. Such distance
estimates have the tendency to lead to the discovery of spherical clusters that
are comparable in size and density. A cluster, however, could take on any shape.
ni
Take sensors, for instance, which are frequently used to monitor the surroundings.
Sensor readings can be subjected to cluster analysis to find intriguing patterns.
To locate the border of a running forest fire, which is frequently not spherical,
U
we would want to apply clustering. It’s crucial to create algorithms that can find
clusters of any shape.
●● Requirements for domain knowledge to determine input parameters: Users
ity
are often required to supply domain knowledge in the form of input parameters
for clustering algorithms, such as the desired number of clusters. As a result,
the clustering results might be affected by these settings. Particularly for high-
dimensional data sets and when users haven’t fully grasped their data, parameters
are frequently difficult to calculate. In addition to burdening users, requiring the
m
data are common components of real-world data sets. For instance, sensor
readings are frequently noisy; some readings may be incorrect as a result of
the detecting mechanisms, and some readings may be incorrect as a result of
interference from nearby transitory objects. Because of this noise, clustering
algorithms may create subpar clusters. We therefore require noise-resistant
(c
clustering techniques.
●● Incremental clustering and insensitivity to input order: Incremental updates
(indicating updated data) may appear at any time in many programmes. Some
Notes
e
clustering algorithms must recompute a new clustering from start since they are
unable to incorporate incremental updates into pre-existing clustering structures.
The arrangement of the incoming data may also affect how sensitive clustering
in
algorithms are. In other words, depending on the order in which the objects are
presented, clustering algorithms may produce drastically different clusterings
given a collection of data objects. Algorithms for incremental clustering and those
nl
indifferent to the input order are required.
●● Capability of clustering high-dimensionality data: There may be several
dimensions or attributes in a data set. Each term can be viewed as a dimension
O
when clustering papers, for instance, and there are frequently thousands of
keywords. The majority of clustering algorithms are adept at working with low-
dimensional data, such as sets with only two or three dimensions. Given that such
ty
data might be extremely sparse and highly skewed, it can be difficult to locate
groups of data objects in a highdimensional space.
●● Constraint-based clustering: Clustering may be required in real-world applications
under a variety of restrictions. Imagine you have the responsibility of selecting
si
the locations for a certain number of new ATMs to be installed in a city. You can
cluster homes while taking into account restrictions like the city’s waterways and
transportation networks, as well as the kinds and numbers of clients each cluster,
r
to decide on this. Finding data groups that satisfy the required requirements and
ve
exhibit effective clustering behaviour is a difficult undertaking.
●● Interpretability and usability: Users want interpretable, understandable, and
useable clustering findings. In other words, clustering might need to be connected
to certain semantic applications and interpretations. The impact of an application
ni
aim on the choice of clustering features and clustering techniques must be studied.
The following are orthogonal aspects with which clustering methods can be
compared:
U
●● The partitioning criteria: Some techniques divide up all the objects so that there
is no hierarchy between the clusters. In other words, conceptually, all the clusters
are on the same level. Such a technique is helpful, for instance, when dividing
ity
subtopics, is one example of how text mining might be used. Sports can include
subtopics like “football,” “basketball,” “baseball,” and “hockey,” for instance. The
last four themes are listed below “sports” in the hierarchy.
)A
●● Separation of clusters: Some techniques group data objects into clusters that are
mutually exclusive. Each client may only be a part of one group when customers
are grouped together and each group is managed by a different manager. Other
times, the clusters might not be mutually exclusive, meaning that a data object
might be a member of more than one cluster. A document may be related to many
(c
subjects, for instance, when grouping documents into themes. The clusters of
themes thus might not be mutually exclusive.
●● Similarity measure: Some techniques use the separation between the two things
Notes
e
to compare how similar they are. Any space, including Euclidean space, a road
network, a vector space, and others, can be used to define this distance. In other
approaches, connectedness based on density or contiguity may be used to identify
in
similarity rather than the exact separation of two items. The design of clustering
techniques relies heavily on similarity measures. Density- and continuity-based
approaches may frequently locate clusters of any shape, whereas distance-based
nl
methods can frequently benefit from optimization techniques.
●● Clustering space: Numerous clustering techniques look for clusters throughout the
full specified data space. These techniques work well with low-dimensional data
O
sets. However, highdimensional data may contain a large number of unimportant
features, making similarity measurements suspect. As a result, clusters discovered
throughout the entire space are frequently meaningless. Instead, it is frequently
ty
preferable to look for clusters inside various subspaces of the same data set.
Subspace clustering identifies clusters and subspaces that exhibit object similarity
(typically of low dimensionality).
Finally, clustering algorithms have a number of specifications. Scalability, the
si
capacity to handle various attribute kinds, noisy data, incremental updates, clusters of
variable shape, and limitations are some of these elements. Usability and interpretability
are also crucial. The level of partitioning, whether or not clusters are mutually exclusive,
r
the similarity metrics applied, and whether or not subspace clustering is carried out are
ve
further factors that might vary amongst clustering techniques.
Clustering Methods:
The clustering methods can be classified into the following categories:
ni
●● Partitioning Method
●● Hierarchical Method
U
●● Density-based Method
●● Grid-Based Method
●● Model-Based Method
ity
●● Constraint-based Method
Partitioning Method:
In order to create clusters, it is utilised to partition the data. Each partition is
m
represented by a cluster and n < p if “n” partitions are made on “p” database items. For
this Partitioning Clustering Method to work, two requirements must be met:
)A
Hierarchical Method:
This method creates a hierarchical breakdown of the supplied data items. On the
e
approaches and determine the classification’s purpose. There are two different methods
for creating hierarchical decomposition, and they are as follows:
in
◌◌ Agglomerative Approach: The bottom-up strategy is another name for the
agglomerative strategy. The initial division of the provided data is into groups
of objects. After then, it continues to combine groups of items that are close
nl
to one another and share characteristics. Up until the termination condition is
met, this merging process continues.
◌◌ Divisive Approach: The top-down strategy is another name for the polarising
O
strategy. We would begin with the data objects that are in the same cluster
in this method. By continuously iterating, the collection of distinct clusters is
split into smaller clusters. The iteration keeps going until either the termination
condition is satisfied or each cluster has an object in it.
ty
As it is a hard procedure and is not very flexible, once the group is divided or
merged, it cannot be undone. The two methods that can be utilised to raise the quality
of hierarchical clustering in data mining are as follows:
si
◌◌ At each stage of hierarchical clustering, the links of the item should be
thoroughly examined.
◌◌ For the integration of hierarchical agglomeration, one can utilise a hierarchical
r
agglomerative algorithm. In this method, the objects are initially put into little
ve
clusters. Data objects are grouped into microclusters, and the microcluster is
then subjected to macro clustering.
Density-Based Method:
ni
●● The majority of object clustering techniques use object distance to group things
together. Such algorithms struggle to find clusters of any shape and can only
locate spherical-shaped clusters.
U
●● On the basis of the idea of density, other clustering techniques have been
created. Their basic premise is that a given cluster should develop as long as its
neighbourhood density exceeds a certain threshold; in other words, each data
point within a given cluster must have at least a certain number of neighbours
ity
within a given radius. A technique like this can be used to remove noise (outliers)
and find clusters of any shape.
●● The usual density-based approaches DBSCAN and its extension OPTICS build
clusters in accordance with a density-based connectivity analysis. By analysing
m
Grid-Based Method:
●● Grid-based approaches divide the object space into a fixed number of grid-like
cells.
●● On the grid structure, or on the quantized space, all clustering operations are
(c
carried out. The key benefit of this strategy is its quick processing time, which is
usually unaffected by the quantity of data items and only depends on the number
of cells in each dimension of the quantized space.
e
grid-based and density-based wavelet transformation for clustering analysis.
Model-Based Method:
in
To find the data that is most suitable for the model, all of the clusters are
hypothesised in the model-based procedure. To find the clusters for a specific model,
use the clustering of the density function. In addition to reflecting the spatial distribution
nl
of data points, it offers a method for automatically calculating the number of clusters
using conventional statistics while accounting for outlier or noise. As a result, it
produces reliable clustering techniques.
O
Constraint-Based Method:
Application- or user-oriented restrictions are used in the constraint-based clustering
ty
approach. The user expectation or the characteristics of the expected clustering results
are examples of constraints. With the use of constraints, we may engage with the
clustering process. Constraints might be specified by the user or an application need.
si
Tasks in Data Mining:
●● Clustering High-Dimensional Data:
o r
Because many applications call for the study of objects with a lot of attributes
or dimensions, it is a task that is especially crucial in cluster analysis.
ve
o For instance, written documents may contain tens of thousands of terms
or keywords, and DNA microarray data may reveal details on the levels
of expression of tens of thousands of genes under hundreds of different
ni
circumstances.
o The curse of dimensionality makes it difficult to cluster high-dimensional data.
o Some dimensions might not be important. The data becomes increasingly
U
groupings.
●● Constraint-Based Clustering:
o It is a clustering method that employs application- or user-specified
constraints to produce clustering.
(c
e
accordance with the needs of the programme.
o With impediments present and clustering under user-specified limitations,
in
spatial clustering is used. Additionally, pairwise restrictions are used in semi-
supervised clustering to enhance the quality of the final grouping.
nl
Data objects are organised into groups in a tree structure using the hierarchical
clustering approach. A pure hierarchical clustering method’s performance is hindered
O
by its inability to modify once a merge or split decision has been made. This means that
the approach cannot go back and change a particular merge or split decision if it later
turns out to have been a bad one. Depending on whether the hierarchical breakdown
is generated top-down or bottom-up, hierarchical clustering approaches can be further
ty
categorised as either agglomerative or divisive.
si
Finding clusters that adhere to user-specified preferences or requirements is
known as constraint-based clustering. Constraint-based clustering may use a variety of
strategies depending on the type of constraints. There are various types of restrictions.
●● r
Constraints on individual objects: On the objects to be clustered, restrictions
ve
can be specified. For instance, in a real estate application, one would want to
geographically cluster just those expensive homes that cost more than a million
dollars. The set of objects that can be clustered is limited by this constraint.
Preprocessing makes it simple to manage, and the issue is thus reduced to a case
ni
of unrestricted clustering.
●● Constraints on the selection of clustering parameters: Each clustering parameter
can have a desired range chosen by the user. Typically, clustering parameters
U
significant impact on the clustering results. As a result, their processing and fine
tuning are typically not regarded as a type of constraint-based clustering.
●● Constraints on distance or similarity functions: For particular features of the items
to be clustered, we can provide various distance or similarity functions, as well as
m
but it will probably influence the mining results. However, in some circumstances,
particularly when it is closely related to the clustering procedure, such alterations
may make the assessment of the distance function nontrivial.
●● User-specified constraints on the properties of individual clusters: The user’s
preference for the intended properties of the generated clusters may have a
(c
e
supervision can be used to significantly enhance the quality of unsupervised
clustering. Pair-wise constraints may be used to accomplish this (i.e., pairs of
objects labelled as belonging to the same or different cluster). Semi-supervised
in
clustering is the name given to such a limited clustering procedure.
nl
●● Market research, pattern recognition, data analysis, and image processing are just
a few of the many fields in which cluster analysis has been successfully applied.
●● Clustering in business can assist marketers in identifying unique consumer groups
O
and describing customer groupings based on purchase trends.
●● In biology, it can be used to build taxonomies for plants and animals, classify
genes with similar functions, and learn more about population structures.
ty
●● The identification of areas of similar land use in an earth observation database,
the grouping of homes in a city based on house type, value, and location, and the
identification of car insurance policyholders with high average claim costs can all
si
benefit from clustering.
●● Because clustering divides enormous data sets into groups based on their
resemblance, it is also known as data segmentation in some applications.
●●
r
Outlier detection applications include the identification of credit card fraud and
ve
the monitoring of criminal activity in electronic commerce. Clustering can also be
utilised for outlier detection.
Imagine working for a credit card corporation as a transaction auditor. You pay
close attention to card usages that differ significantly from ordinary occurrences in
U
order to safeguard your consumers from credit card fraud. For instance, a purchase
is suspicious if it is significantly larger than usual for the cardholder and takes place
outside of the city where they normally reside. As soon as such transactions take place,
you should identify them and get in touch with the cardholder for confirmation. In a lot of
ity
credit card firms, this is standard procedure. What methods of data mining can be used
to identify suspicious transactions?
Credit card transactions are typically routine. However, when a credit card is stolen,
the typical transaction pattern is drastically altered; the places and things bought are
m
frequently considerably different from those of the legitimate card owner and other
customers. Detecting transactions that are noticeably out of the ordinary is a key
component of credit card fraud detection.
)A
Finding data items with behaviour that deviates significantly from expectations is
the process of outlier detection, sometimes referred to as anomaly detection. These
things are referred to as anomalies or outliers. In addition to fraud detection, outlier
identification is crucial for a variety of other applications, including image processing,
sensor/video network monitoring, medical care, public safety, and security, industry
(c
Clustering analysis and outlier detection are two tasks that are closely related.
Notes
e
In contrast to outlier identification, which seeks to identify unusual instances that
significantly vary from the majority patterns, clustering identifies the predominant
patterns in a data set and organises the data accordingly. Clustering analysis and
in
outlier identification have various uses.
nl
When discussing data analysis, the word “outliers” frequently comes to mind.
Outliers are data points that are not consistent with expectations, as the name implies.
What you do with the outliers is the most important aspect of them. You will always
O
make certain assumptions depending on how the data was produced if you are going
to evaluate any task to analyse data sets. These data points are outliers if you discover
them, and depending on the situation, you may wish to correct them if they are likely
to contain some sort of inaccuracy. The analysis and prediction of data that the data
ty
contains are two steps in the data mining process. Grubbs provided the initial definition
of outliers in 1969.
An outlier can neither be considered noise nor an error. Instead, it is thought that
si
they weren’t created using the same process as the other data objects.
◌◌ r
Global (or Point) Outliers
ve
◌◌ Collective Outliers
◌◌ Contextual (or Conditional) Outliers
ni
U
ity
Global Outliers
A data object in a particular data collection is considered a global outlier if
it considerably differs from the other data objects in the set. The most basic kind of
outliers are global outliers, also known as point anomalies. Finding global outliers is the
m
suggested, and outlier identification techniques are divided into various groups as a
result. Later, we shall discuss this topic in further detail.
The ability to recognise global outliers is crucial in many applications. Take, for
instance, intrusion detection in computer networks. When a computer’s communication
(c
behaviour deviates significantly from typical patterns (for example, when a huge
number of packages are broadcast in a brief period of time), this behaviour may be
regarded as a global outlier, and the accompanying machine is presumed to have
Amity Directorate of Distance & Online Education
Data Warehousing and Mining 217
been hacked. As another illustration, transactions that do not adhere to the rules are
Notes
e
regarded as global outliers and should be stored for additional analysis in trading
transaction auditing systems.
in
For instance, in an intrusion detection system, if several packages are broadcast
in a short period of time, this may be seen as a global outlier and we can infer that the
system in question has possibly been compromised.
nl
O
ty
The red data point is global outlier
si
Collective Outliers
Let’s say you handle the supply chain for All Electronics. Each day, you manage
tens of thousands of shipments and orders. Given that delays are statistically common,
r
if an order’s shipping is delayed, it may not be viewed as an oddity. However, if 100
ve
orders are delayed in a single day, you need to pay attention. Even if each of those
100 orders might not be viewed as an outlier if taken individually, taken as a total, they
constitute an anomaly. To comprehend the shipment issue, you might need to take a
thorough look at those orders taken as a whole.
ni
If a subset of data objects inside a given data collection considerably deviates from
the total data set, the subset becomes a collective outlier. The individual data objects
might not all be outliers, which is significant.
U
Because their density is significantly greater than that of the other objects in the
data set, the black objects in Figure (The black objects form a collective outlier) as a
whole constitute a collective outlier. However, when compared to the entire dataset, no
one black object is an outlier.
ity
m
)A
Numerous crucial applications exist for collective outlier detection. For instance, a
denial-of-service package sent from one computer to another is not at all unusual and
(c
e
of identical stock transactions within a small group in a brief period, however, are
collective outliers because they could be proof of market manipulation.
in
In contrast to global or contextual outlier detection, collective outlier detection
requires that we take into account both the behaviour of groups of objects as well
as that of individual objects. As a result, prior knowledge of the relationships among
nl
data items, such as assessments of object distance or similarity, is required in order to
identify collective outliers.
O
object may fall under more than one category of outlier. Different outliers can be used
in business for a variety of applications or goals. The easiest type of outlier detection
is global. Contextual attributes and contexts must be determined in order to perform
context outlier identification. In order to describe the link between items and discover
ty
groups of outliers, collective outlier identification requires background knowledge.
Contextual Outliers
si
“Today’s temperature is 28 C. Is it extraordinary (an outlier)? For instance, it
depends on the moment and place! Yes, it is an anomaly if it happens to be winter in
Toronto. It is typical if it is a summer day in Toronto. Contrary to global outlier detection,
r
in this instance, the context—the date, the location, and maybe other factors—
ve
determines whether or not the current temperature number is an outlier.
the context must be stated as part of the problem formulation in order to do contextual
outlier detection. The properties of the data objects in question are often separated into
two sections for contextual outlier detection:
U
When the set of contextual attributes is empty, global outlier identification can
be thought of as a specific instance of contextual outlier detection. In other words,
the entire data set serves as the background for global outlier detection. Users have
Notes
e
flexibility thanks to contextual outlier analysis since they may study outliers in various
situations, which is highly desirable in many applications.
in
An analyst may take into account outliers in several situations in addition to global
outliers while detecting credit card fraud, for instance. Think about clients who exceed
90% of their credit limit. Such behaviour might not be regarded as an outlier if one
nl
of these consumers is perceived to be a member of a group of clients with low credit
limitations. However, if a high-income customer consistently exceeds their credit limit,
that customer’s same conduct can be viewed as an exception. Raising credit limits for
such customers can generate additional money. Such outliers may present economic
O
opportunities.
ty
accuracy of contextual outlier detection in an application. Experts in the relevant fields
should, more often than not, decide on the contextual qualities; this can be viewed as
input background knowledge. In many applications, it is difficult to gather high-quality
si
contextual attribute data or gain enough information to infer contextual attributes.
How can we create contexts that are meaningful for contextual outlier detection?
The contextual characteristics’ group-bys are used as contexts in a simple approach.
r
But many group-bys might have insufficient data or noise, so this might not work. The
ve
proximity of data objects in the space of contextual attributes is used in a more generic
approach.
Outliers Analysis
ni
When data mining is used, outliers are frequently deleted. However, it is still utilised
in numerous applications, including medical and fraud detection. It usually happens
because uncommon events have a lot greater capacity to store important information
U
Outlier analysis in data mining can be used to examine any uncommon response
ity
Case Study
Notes
e
Data analysis Companies employ data mining as a method to transform
unstructured data into information that is useful. Businesses may learn more about
in
their customers, create more successful marketing campaigns, boost sales, and cut
expenses by employing software to find patterns in massive volumes of data. Effective
data collection, warehousing, and computer processing are all necessary for data
nl
mining. examine a customer’s psychological perspective, translate it into statistical
form, and determine whether there is any technical format that allows us to assess his
purchasing behaviour.
O
Breaking down ‘data mining in marketing’
The well-known application of data mining techniques in grocery stores.
Customers can obtain discounted pricing not available to non-members by using the
ty
free loyalty cards that are frequently given out by supermarkets. The cards make it
simple for retailers to keep tabs on who is purchasing what, when, and for what price.
After evaluating the data, the retailers can use it for a variety of things, like providing
consumers with coupons based on their purchasing patterns and choosing when to
si
sell items on sale or at full price. When only a small portion of information that is not
representative of the entire sample group is utilised to support a certain premise, data
mining can be problematic.
r
ve
Power of hidden information in data:
Every day, businesses produce terabytes of data that are kept in databases, data
warehouses, or other types of data repositories. The majority of useful information
might be concealed within such data; yet, the enormous amount of data makes it
ni
difficult for humans to access them without the aid of strong tools and approaches.
Information was only available on papers and only at certain times at the beginning of
the previous decade. Information is now readily available because content producers,
U
content locators, and strong search engines have made it possible to quickly access
vast amounts of information.
The same information can now be accessed through natural language processing
ity
in a language that is comfortable for the user. We have arrived at a period where
everything happens in a matter of clicks, unlike earlier times when people had to
wait in lengthy lines to pay bills, taxes, or purchase cinema tickets and other forms of
entertainment. The firms that are heavily reliant on consumer behaviour and basket
trends have been greatly impacted by all these altering elements and trends. To be
m
profitable, any organisation needs to be extremely scalable and able to anticipate client
behaviour in the future.
The Market Analysis was prompted by these very needs. Earlier methods for
)A
e
Classification:
in
Finding a model that adequately describes the various data classes or concepts
is the process of classification. The goal is to be able to forecast the class of objects
whose class label is unknown using this model. This model was developed by the
examination of training data sets.
nl
Eg: 1.putting voters into groups that are well-known to political parties.
O
Regression:
Regression analysis is a statistical method for determining the relationships
ty
between variables in statistical modelling. When the emphasis is on the link between
a dependent variable and one or more independent variables, it encompasses
numerous approaches for modelling and evaluating multiple variables. Imagining the
unemployment rate for the following year. estimating the cost of insurance.
si
Finding anomalies: Finding anomalies, also known as detecting outliers, entails
locating data points, events, or observations that deviate from a dataset’s norm.
r
Example: Credit card fraud transaction detection. The discovery of objects, events, or
observations that deviate from an expected pattern or other objects in a collection is
ve
known as anomaly detection (also known as outlier detection). For instance, credit card
fraud detection.
Time series
ni
A time series is a collection of data points that have been enumerated, graphed,
or otherwise organised chronologically. A time series is most frequently a sequence
captured at a series of equally spaced moments in time. As a result, it is a collection of
U
discrete-time data.
Eg: Forecasting for sales, manufacturing, or essentially any growth event that
requires extrapolation
ity
Clustering
The task of clustering involves organising a collection of objects into groups so that
objects within a cluster are more similar (in some manner or another) to one another
m
Association analysis
Data mining function called association determines the likelihood of elements in a
collection occurring together. Association rules are used to express the links between
(c
co-occurring items.
e
purchase data.
in
mining” is used to examine vast amounts of data in order to find significant patterns and
rules. Data mining is a tool that businesses can use to enhance their operations and
outperform rivals. The following figure lists the most significant business sectors where
nl
data mining is successfully used.
O
ty
r si
ve
ni
A kind of marketing that puts the customer first is digital marketing. Digital
information is not only simpler to integrate, organise, and disseminate, but it also
speeds up interactions between service providers and customers. In the past,
U
marketing analysis and efficacy typically took a very long time. Today, marketing
promotion can have a higher synergistic effect thanks to digital marketing.
advancements, and digital transmission, firms should modify their marketing strategies
quickly. The market should switch from the Red Sea Strategy to the Blue Ocean
Strategy in a similar manner. Market space keeps growing as the environment changes,
but the market environment also gets more competitive. The development of Internet
m
marketing, including sale and purchase through the Website, keyword marketing, blog
marketing, and so forth, along with the development of wireless networks, will bring
the world closer together from traditional store sales, telephone marketing, and face-
to-face marketing. It gives rise to the impression that developing digital marketing
)A
communications is crucial. The company is expanding its options for client engagement
and connection because it is no longer constrained by conventional factors like time
and geography.
Apriori Algorithm
(c
Agrawal and Srikant introduced the Apriori algorithm in 1994 [3]. The algorithm
used to learn association rules is a well-known one. Apriori is made to work with
e
customers, or details of commerce website frequentation). Given a set of itemsets
(for example, sets of retail transactions, each showing specific items purchased), the
method seeks to identify subsets that are shared by at least a portion of the itemsets,
in
as is typical in association rule mining. As part of its “bottom up” methodology,
Apriori extends frequent subsets one item at a time (a process known as candidate
generation), and then compares groups of candidates to the data. When no more
nl
successful extensions are detected, the algorithm ends.
The Apriori Algorithm is used to identify relationships between various data sets.
It is also known as “Market Basket Analysis.” Each set, which has several things, is
O
referred to as a transaction. Apriori produces rules that specify how frequently items
appear in collections of data.
ty
Predicting the Customer Behaviour
In enterprise company, predicting client behaviour is the most crucial action. All of
the aforementioned techniques provide businesses with a wealth of insightful data. The
si
scenarios that have used the aforementioned techniques are shown in the next section.
r
These days, recommender systems are widely used across many industries.
Movies, music, literature, academic articles, search terms, social media tags, etc. are
ve
just a few examples. In order to forecast client behaviour, these systems combine
the concepts of intelligent systems, machine learning, and information retrieval. In
recommender systems, there are two methods: collaborative filtering and content-based
filtering.
ni
In this case study, association rules are extracted from user profiles using the
Apriori method. The PVT system is presented as an example. A recommender
ity
confidence values in this case study as the similarity scores. We can create rules using
direct programme similarities, and we can create new results by chaining these rules
together.
Amity Directorate of Distance & Online Education
224 Data Warehousing and Mining
e
response model using data mining techniques was created using past purchase data
to determine the likelihood that a client at Ebedi Microfinance Bank (Nigeria) will take
advantage of an offer or promotion. Data mining techniques were used to create
in
a predictive response model for this goal utilising past purchase information from
customers.
nl
A data warehouse was used to store the data so that management could use it as
a decision support system. The response model was created using data on previous
consumer purchases and demographics. The following are the buying behaviour
variables that were used in the model’s construction. The amount of months since the
O
most recent and initial purchases is known as the recency. It is frequently the most
effective of the three factors for forecasting a customer’s reaction to a following offer.
This makes a lot of sense. According to this statement, you are more likely to
ty
make another purchase from a firm if you recently made one than if you haven’t. This
is the total number of purchases. It may represent the sum of purchases made during
a certain period of time or comprise all purchases. In terms of predicting a response,
si
this trait is only second to recency. Once more, the connection to future purchases is
extremely obvious. Value in money: This is the whole sum. It might be within a certain
time range or encompass all purchases, similar to frequency. This trait has the weakest
r
predictive ability of the three when it comes to predicting response. However, when
used in concert, it can broaden our understanding in still another way. Customers’
ve
individual traits and information, such as age, sex, residence, profession, etc., are
included in the demographic data. exact Bayesian algorithm The classifier system was
built using the Naive Bayes technique.
ni
In order to choose the model’s inputs, filter and wrapper feature selection
approaches were also used. According to the results, Ebedi Microfinance Bank can
effectively sell their goods and services by obtaining a report on the state of their
clients. This will help management significantly reduce the amount of money that would
U
Conclusion
ity
purchase intentions and interests. Businesses have the advantage of having innovative
products, services, brands, quality, etc. Overall, the analysis of the marketing data for
the product, customer history, and purchasing patterns will lead digital marketing to
)A
The type, cost, location, and advertising of the product are only a few of those
covered by the product information. History records provide prior marketing tactics,
approaches, and consumer responses for estimation, consultation, and exploration
(c
consumer attributes, including age, sex, occupation, income, and lifestyle. In other
Notes
e
words, the business’s ability to develop a marketing strategy for various items will be
influenced by product information, historical records, and consumer buying behaviour in
terms of products association.
in
Summary
●● The intersection of statistics and machine learning gives rise to the
nl
interdisciplinary topic of data mining (DM) (artificial intelligence). It offers a
technique that aids in the analysis and comprehension of the data found in
databases, and it has been used to a wide range of industries or applications.
O
●● In order to offer a didactic viewpoint on the data analysis process of these
techniques, we present an applied vision of DM techniques. In order to find
knowledge models that reveal the patterns and regularities underlying the
ty
analysed data, we employ machine learning algorithms and statistical approaches,
comparing and analysing the findings.
●● Artificial neural networks (ANNs) are data processing systems that borrow features
si
from biological neural networks in their design and operation.
●● In basic building blocks known as neurons, information processing takes place.
The connections between the neurons allow for the transmission of signals.Each
r
link (communication) has a corresponding weight.
ve
●● The artificial neuron serves as the processing unit, collecting input from nearby
neurons and calculating an output value to be relayed to the rest of the neurons.
●● Sequential data divisions called decision trees (DT) maximise the differences of a
dependent variable (response or output variable). They provide a clear definition
ni
for groups whose characteristics are constant but whose dependent variable
varies.
U
●● Despite the method’s “naive” appearance, it may compete with other, more
advanced categorization techniques. Therefore, k-NN is very adaptable in
situations where the linear model is stiff.
●● The k number is not predetermined, however it is important to keep in mind that
m
approaches to combine data from the sample with expert opinion (prior probability)
in order to get an updated expert opinion (posterior probability).
●● A generalised linear model (GLM) is a logistic regression (LR). Predicting binary
variables (with values like as yes/no or 0/1) is its principal use. Thus, using LR
(c
●● There are two types of data mining: descriptive data mining and predictive data
Notes
e
mining. In a succinct and summarizing manner, descriptive data mining displays
the data set’s intriguing general qualities. In order to build one or more models
and make predictions about the behavior of fresh data sets, predictive data mining
in
analyzes the data.
●● Data discrimination is the process of comparing the general characteristics of data
nl
items from the target class to those of objects from one or more opposing classes.
A user can specify the target and opposing classes, and database queries can be
used to retrieve the associated data objects.
O
●● The data cube approach can be thought of as a materialized-view, data
warehouse-based, pre-computational strategy. Before submitting an OLAP or data
mining query for processing, it performs off-line aggregation.
●● Data warehousing and OLAP technologies have recently advanced to handle
ty
more complicated data kinds and incorporate additional knowledge discovery
processes. Additional descriptive data mining elements are predicted to be
incorporated into OLAP systems in the future as this technology develops.
si
●● Patterns that commonly show up in a data set include itemsets, subsequences,
and substructures. A frequent itemset is, for instance, a group of items that
regularly appear together in a transaction data set, like milk and bread.
●●
r
Data analysis that uses classification extracts models describing significant data
ve
classes. These models, referred to as classifiers, forecast discrete, unordered
categorical class labels. For instance, we can create a classification model to
divide bank loan applications into safe and dangerous categories. We may gain a
better comprehension of the data as a whole via such study.
ni
training data, a set of data is first employed. The algorithm receives a set of input
data along with the associated outputs. The input data and their corresponding
class labels are thus included in the training data set. The algorithm creates a
classifier or model using the training dataset.
m
numerical result. The training dataset includes the inputs and matching numerical
output values, much like in classification. Using the training dataset, the algorithm
creates a model or prediction.
●● The division of a set of data items (or observations) into subsets is known as
cluster analysis or simply clustering. Every subset is a cluster, and each cluster
(c
contains items that are similar to one another but different from those in other
clusters.
e
this situation, various clustering techniques may provide various clustering on the
same data set. The clustering algorithm, not people, performs the partitioning.
in
●● Properties of clustering: a) Scalability, b) Ability to deal with different types of
attributes, c) Discovery of clusters with arbitrary shape, d) Requirements for
domain knowledge to determine input parameters, e) Ability to deal with noisy
nl
data, f) Incremental clustering and insensitivity to input order, g) Capability
of clustering high-dimensionality data, h) Constraint-based clustering, i)
Interpretability and usability.
O
●● Clustering methods: a) Partitioning method, b) Hierarchical method, c) Density-
based method, d) Grid-based method, e) Model-based method, f) Constraint-
based method.
●● Outliers are data points that are not consistent with expectations, as the name
ty
implies.An outlier can neither be considered noise nor an error. Instead, it is
thought that they weren’t created using the same process as the other data
objects.
si
●● Three categories of outliers: a) Global outliers, b) Collective outliers, c) Contextual
outliers.
●●
r
When data mining is used, outliers are frequently deleted. However, it is still
utilised in numerous applications, including medical and fraud detection. It usually
ve
happens because uncommon events have a lot greater capacity to store important
information than more frequent happenings.
Glossary
ni
e
●● DENCLUE:Density Clustering.
●● STING: Statistical Information Grid Clustering Algorithm.
in
●● PROCLUS: Projected Clustering.
nl
1. Which of the following refers to the problem of finding abstracted data in the
unlabeled data?
O
a. Unsupervised learning
b. Supervised learning
c. Hybrid learning
ty
d. Reinforcement learning
2. Which of the following refers to the querying the unstructured textual data?
si
a. Information access
b. Information retrieval
c. Information update
d.
r
Information manipulation
ve
3. Which of the following can be considered as the correct process of data mining?
a. Exploitation, Interpretation, Analysis, Exploration, Infrastructure
b. Exploration, Exploitation, Analysis, Infrastructure, Interpretation
ni
4. Which of the following is an essential process in which the intelligent methods are
applied to extract data patterns?
a. Warehousing
ity
b. Text selection
c. Text mining
d. Data mining
m
a. Partitioned
b. Hierarchical
c. Naïve Bayes
Notes
e
d. None of the mentioned
7. In data mining how many categories of function are included?
in
a. 4
b. 3
nl
c. 2
d. 5
O
8. What are the functions of data mining?
a. Association and correlation analysis classification
b. Prediction and Characterization
ty
c. Cluster analysis and Evolution analysis
d. All of the above
si
9. _ _ _ _module communicates between the data mining system and the user.
a. Graphical User Interface
b. Knowledge Base
c. Pattern Evolution Module
r
ve
d. None of the mentioned
10. The steps of the knowledge discovery database process, in which several data
sources are combined refers to_ _ _.
ni
a. Data transformation
b. Data integration
U
c. Data cleaning
d. None of the above
11. _ _ _ _ are the data mining application?
ity
a. Database Technology
b. Information Science
c. Machine Learning
d. All of the above
(c
b. 4
Notes
e
c. 7
d. 6
in
14. Which among the following is the data mining algorithm?
a. K-mean algorithm
nl
b. Apriori algorithm
c. Naïve Bayes algorithm
O
d. All of the above
15. _ _ _ _ _is a data mining algorithms used to create decision tree.
a. C4 .5 algorithm
ty
b. PageRank algorithm
c. K-mean algorithm
si
d. Adaboost algorithm
16. Which of the following defines the structure of the data held in operational databases
and used by operational applications?
a. User-level metadatar
ve
b. Operational metadata
c. Relational metadata
d. Data warehouse metadata
ni
17. Which of the following consists of information in the enterprise that is not in classical
form?
a. Mushy metadata
U
b. Data mining
c. Differential metadata
ity
b. Informational
c. Operational
d. None of the above
)A
c. Informal environment
d. None of the above
20. Data warehouse contains_ _ _ _ data that is never found in the operational
Notes
e
environment.
a. Normalized
in
b. Denormalized
c. Informational
nl
d. Summary
Exercise
O
1. Define statistical techniques in data mining.
2. Explain data mining characterisation and data mining discrimination.
3. Define the terms
ty
a. Mining patterns
b. Associations
si
c. Correlations
4. What do you mean by classification and prediction?
5. Explain cluster analysis.
6. Define the term outlier analysis.
r
ve
Learning Activities
1. In your project you are responsible for analyzing the requirements and selecting a
toolset for data mining. Make a list of the criteria you will use for the toolset selection.
ni
1 a 2 b
3 c 4 d
5 a 6 b
ity
7 c 8 d
9 a 10 b
11 c 12 d
m
13 a 14 d
15 a 16 b
)A
17 a 18 c
19 a 20 d
e
3. Handbook of Statistical Analysis and Data Mining Applications, Gary Miner,
John Elder, and Robert Nisbet
in
4. Data Mining and Machine Learning: Fundamental Concepts and Algorithms,
Mohammed J. Zaki and Wagner Meira
5. Introduction to Data Mining, Michael Steinbach, Pang-Ning Tan, and Vipin
nl
Kumar
6. Data Mining: Practical Machine Learning Tools and Techniques, Eibe Frank,
O
Ian H. Witten, and Mark A. Hall
ty
r si
ve
ni
U
ity
m
)A
(c
e
Learning Objectives:
in
At the end of this topic, you will be able to understand:
●● Text Mining
nl
●● Spatial Databases
●● Web Mining
O
●● Multidimensional Analysis of Multimedia Data
●● Applications in Telecommunications Industry
●● Applications in Retail Marketing
ty
●● Applications in Target Marketing
●● Mining in Fraud Protection
si
●● Mining in Healthcare
●● Mining in Science
●● Mining in E-commerce
●● Mining in Finance
r
ve
Introduction
Applications for data mining may be general or specialised. The general application
ni
must be an intelligent system capable of making choices on its own, including those
regarding data selection, data mining method, presentation, and result interpretation.
Some general data mining programmes cannot make these choices on their own
but instead assist users in choosing the data, the data mining technique, and the
U
text to photos, and store them in a range of databases and data structures. From these
various datasets, patterns and knowledge are extracted using various data mining
techniques. The work of choosing the data and methods for data mining is crucial to this
process and calls for domain knowledge.
m
For data mining, a wide range of data should be gathered in the particular problem
domain, specific data should be chosen, the data should be cleaned and transformed,
patterns should be extracted for knowledge development, and knowledge should
)A
then be interpreted. Data mining is utilised in the medical field to do basket analyses,
uncover patterns, forecast sales, and detect malicious executables. There are still
numerous unresolved challenges with data mining, such as security and social issues,
user interface problems, performance problems, etc., before it becomes a common,
established, and trustworthy discipline.
(c
e
There is a vast amount of knowledge saved in text documents nowadays, either
within businesses or generally accessible. Due to the enormous amount of information
in
available in electronic form, including electronic periodicals, digital libraries, e-mail,
and the World Wide Web, text databases are expanding quickly. The majority of text
databases store semi-structured data, and specialised text mining algorithms have
nl
been created for extracting new knowledge from big textual data sets.
O
5.1.1 Text Mining
Online text mining is often enabled by two main methods. The first is the ability
ty
to search the Internet, and the second is the text-analysis method. There has been
internet search for a while. Since there have been so many websites during the last
ten years, a tonne of search engines that are intended to assist consumers in finding
material have nearly developed overnight. The earliest search engines were Yahoo,
si
AltaVista, and Excite, whereas Google and Bing have gained the most traction in recent
years. Search engines work by indexing a certain Web site’s content and letting users
search these indices. Users of the latest generation of Internet search engines can now
r
find pertinent information by sifting through fewer links, pages, and indexes.
ve
The study of text analysis is older than the study of internet search. It has been
a part of efforts to teach computers to understand natural languages, and artificial
intelligence experts frequently view it as a challenge. Anywhere there is a lot of text
that has to be studied, text analysis can be applied. Although the depth of analysis that
ni
a human can bring to the task is not possible with automatic processing of documents
using various techniques, it can be used to extract important textual information,
classify documents, and produce summaries when manual analysis is impossible due
U
You can either look for keywords in text documents or try to classify the semantic
content of the text itself to comprehend the specifics. When establishing specific
ity
information or elements within text documents that can be used to demonstrate links or
linkages with other documents, you are looking for keywords in those text documents.
Documents have usually been modelled in vector space in the IR domain. White-space
delimiters in English are used as simple syntactic rules for tokenization, and tokens are
changed to their canonical forms (e.g., “reading” becomes “read,” “is,” “was,” and “are”
m
becomes “be”). A Euclidean axis is represented by each canonical token. Vectors in this
n-dimensional space are documents. The tth coordinate of a document d is just n if a
token t called term appears n times in it. Using the L1, L2, or L norms, one may decide
)A
Where n(d,t) denotes the quantity of times a term appears in a given text (d).
These illustrations fail to convey the reality that some keywords, such as “algorithm,”
(c
are more significant in determining document content than others, such as “the” and
“is.” If t appears in n papers out of n, n/N conveys the term’s rarity and significance.
Notes
e
document frequency inversion
IDF = 1 + log(nt/N)
in
is used to differentially extend the vector space’s axes. Thus, in the weighted
vector space model, the tth coordinate of document d may be represented by the
value (n(d,t)/||d1||) IDF(t)). Despite being incredibly basic and unable to capture any
nl
part of language or semantics, this model frequently works effectively for the task in
hand. Despite some small differences, all of these textual models treat documents as
collections of terms rather than paying attention to the order in which the terms are
O
used. As a result, they are referred to as bag-of-words models as a whole. The results
of these keyword approaches are frequently expressed as relational data sets, which
may then be examined using one of the common data-mining methods.
ty
Hypertext documents are a specific category of text-based documents that also
include hyperlinks. They are typically shown as basic Web components. Depending
on the application, different levels of detail are modelled for them. Hypertext can be
conceptualised as a directed graph (D, L) in the most basic paradigm, where D is the
si
collection of nodes that represent documents or Web pages and L is the collection
of links. When the focus is on documents’ linkages, crude models might not need to
incorporate the text models at the node level. A combined distribution between a node’s
r
term distribution and those around it in the graph’s document will be characterised by
ve
more complex models.
use these terms. The generated topographic map can show the degree of similarities
between documents in terms of Euclidean distance, depending on the specific algorithm
employed to build the landscape. The method used to create Kohonen feature maps
is comparable to this concept. You can then extrapolate concepts represented by
)A
◌◌ To make a search process more efficient and effective in order to find related
Notes
e
or comparable information, and
◌◌ To search an archive for duplicate data or documents.
in
A new set of functionalities called text mining is based mostly on text analysis
technologies. The most typical medium for the official exchange of information is text.
Even if attempts to automatically extract, organise, and utilise information from it only
nl
partially succeed, the urge to do so is strong. While conventional, commercial text-
retrieval systems rely on inverted text indices made up of statistics like the number of
times a word occurs in a document, text mining needs to offer benefits beyond simply
retriving text indices like keywords. Text mining can be defined as the process of
O
analysing text to extract interesting, nontrivial information that is valuable for specific
objectives. Text mining is about looking for semantic patterns in text.
Text mining is seen to have an even greater commercial potential than traditional
ty
data mining with structured data because text is the most natural way to store
information. In fact, according to recent studies, text documents contain 80% of a
company’s information. But because it deals with naturally confusing, unstructured
si
text data, text mining is also a considerably more difficult process than traditional data
mining.
Text mining is a multidisciplinary field that includes IR, text analysis, information
r
extraction, natural language processing, clustering, categorization, visualisation,
ve
machine learning, and other methodologies that are already on the data-mining
“menu.” This field may also include some additional particular techniques that have
recently been developed and used on semi-structured data. Only a few of the potential
uses for text mining include market research, business intelligence collecting, email
ni
management, claim analysis, electronic procurement, and automated help desk. The
two stages of the text-mining process are graphically shown in Figure A:
Text refinement is the process of taking free-form text documents and turning
U
them into a selected intermediate form (IF), and knowledge distillation is the process of
extracting patterns or knowledge from an intermediate form
ity
m
)A
be divided into two categories: idea-based and document-based, where each entity
Notes
e
represents an item or concept with interests in a certain domain. Finding patterns and
connections between papers via document-based IF mining. Examples of document-
based If mining include categorization, visualisation, and grouping of documents.
in
It is important to do a semantic analysis and derive a sufficiently rich representation
to capture the link between objects or concepts provided in the document for a
nl
fine-grained, domain-specific knowledge-discovery activity. Finding patterns and
connections between things and concepts through mining a concept-based IF. Making
these computationally expensive semantic analysis approaches more effective and
scalable for extremely big text corpora is a difficult task. This group includes text-mining
O
activities including association discovery and predictive modelling. By realigning or
extracting the pertinent data in accordance with the objects of interest in a particular
domain, a document-based IF can be converted into a concept-based IF. As a result, a
ty
concept-based representation is typically domain-dependent while a document-based
IF is typically domain independent.
si
is based on the functions of text-refining and knowledge-distillation as well as the
intermediate form used. Document organisation, visualisation, and navigation are
the subject of one set of methodologies and recently released commercial tools. The
r
functions of text analysis, IR, classification, and summarization are the focus of a
different group.
ve
These text-mining methods and techniques fall within a significant and broad
subclass that is based on document visualisation. The general strategy is to combine
or cluster papers together based on their similarity and then display the groups or
ni
Domain knowledge may be crucial to the text-mining process but is not currently
used or examined by any text-mining tools. In particular, domain expertise can be
applied as early as the text-refining step to boost parsing effectiveness and provide a
ity
infancy.
To extract patterns and trends from journal and proceedings that are maintained
in text database repositories, several text mining methodologies and technologies are
applied. These informational sources are beneficial in the field of the study. Libraries
are a fantastic place to find digitised text data. It provides a cutting-edge method for
(c
e
and multilingual interfaces. Along with text documents, it also supports the extraction
of audiovisual and image formats. Text mining techniques carry out a variety of tasks,
including document collecting, analysis, improvement, data removal, handling, and
in
summarization. There are various forms of text mining software for digital libraries,
including GATE, Net Owl, and Aylien.
nl
Academic and Research Field
Different text mining techniques and technologies are used in the field of education
to look at instructional patterns in a certain area or study area. The primary goal of
O
text mining use in the research sector is to assist in the discovery and organisation of
research papers and pertinent content from multiple fields on one platform. We employ
k-Means clustering for this, and other techniques help to differentiate the characteristics
of meaningful data. Additionally, information on student performance in other topics
ty
is accessible, and this mining can assess how different factors affect the choice of
disciplines.
si
Life Science
The life science and healthcare sectors generate a vast amount of textual and
mathematical data about patient records, illnesses, medications, disease symptoms,
r
and disease therapies, among other topics. Filtering information and pertinent text from
ve
a biological data store in order to make decisions is a significant problem. The clinical
records include fluctuating, unpredictably long data. Such data can be managed with
the use of text mining. Text mining is also used in the pharmacy sector, the disclosure of
biomarkers, clinical research, and competitive intelligence for patents.
ni
Social-Media
To monitor and look at online content such plain text from blogs, emails, web
U
journals, and other online sources, text mining is a tool that can be used. The number
of posts, likes, and followers on the online social media network can be distinguished
and investigated with the aid of text mining tools. This type of study demonstrates
how people react to various articles and news stories as well as how it spreads. It
ity
demonstrates how members of a certain age group behave and the range of reactions
to the same post in terms of likes and views.
Business Intelligence
m
Business intelligence relies heavily on text mining to study customers and rivals
so that various organisations and businesses may make better judgments. It provides
a thorough understanding of business and information on how to raise customer
)A
satisfaction and get an advantage in the marketplace. IBM Text Analytics is one of the
text mining tools.
GATE assists in choosing the company that sends alerts regarding good and bad
performance, a market transition that aids in taking the essential actions. The telecom
industry, business, and customer chain management systems can all make advantage
(c
of this mining.
e
Throughout the text mining process, a number of problems arise:
in
●● A text mining stage at which the unclear problem can appear is the intermediate
stage. Different criteria and principles are described during the pre-processing
step to standardise the content and facilitate text mining. Before performing
nl
pattern analysis on the document, unstructured data must be converted into a
reasonable structure.
O
●● Occasionally, adjustment might cause the original message or meaning to shift.
●● Many algorithms and techniques for text mining support text in multiple languages,
which is another problem. It could lead to unclear text meaning. This issue could
result in falsely favourable findings.
ty
●● Synonym, polysemy, and antonym usage in the document text causes problems
for text mining methods that use both in the same context. It is challenging to
classify these texts and words.
si
5.1.2 Spatial Databases
Geographical Information Systems (GIS) are the systems that control geographic data
and related applications; they are utilised in fields including environmental applications,
transportation systems, emergency response systems, and combat management.
U
database. The spatial links between the objects are significant, as they are frequently
required while making database queries. Although, in general, a spatial database can
refer to any n-dimensional space, we will restrict our discussion to two dimensions for
simplicity.
m
parameters as predicates for selection. A spatial query would ask, for instance, “What
are the names of all bookstores within five miles of the College of Computing building
at Georgia Tech?” While standard databases analyse numeric and character data,
processing geographical data types requires additional capabilities.
(c
An external geographic database that maps the company headquarters and each
customer to a 2-D map based on their address may need to be consulted in order to
process spatial data types that are typically outside the scope of relational algebra
in order to process a query like “List all the customers located within twenty miles of
Notes
e
company headquarters.” Each consumer will actually be connected to a latitude and
longitude position. Since conventional indexes are unable to organise multidimensional
coordinate data, this query cannot be processed using a conventional B+-tree index
in
based on customer zip codes or other nonspatial properties. As a result, databases
specifically designed to handle spatial data and spatial queries are required.
nl
The common analytical steps used to process geographic or spatial data are
shown in the table below.
O
ty
si
Measurement operations are used to determine the area, relative size of an
object’s components, compactness, or symmetry of single objects as well as the
r
relative position of other objects in terms of distance and direction. To find spatial
correlations within and among mapped data layers, spatial analysis operations—
ve
often utilising statistical methods—are performed. A prediction map, for instance, may
be made to show where clients for specific goods are most likely to be found based
on past sales and demographic data. The shortest path between two points and the
connectivity between nodes or regions in a graph can both be found with the use of flow
ni
analysis methods. Finding out whether a given set of points and lines falls into a given
polygon is the goal of location analysis (location). The procedure entails creating a
buffer around current geographic features, after which features are identified or chosen
U
according to whether they are inside or outside the buffer’s boundary. The topography
of a particular location can be represented with an x, y, and z data model known as a
Digital Terrain (or Elevation) Model (DTM/DEM) using digital terrain analysis, which is
used to create three-dimensional models.
ity
A DTM’s x, y, and z dimensions stand in for the horizontal plane and the
corresponding x, y coordinates’ spot heights, respectively. These models can be
applied to the analysis of environmental data or to the planning of engineering projects
that need for knowledge of the topography. A user can look for objects inside a certain
m
spatial area using spatial search. Thematic search, for instance, enables us to look for
items associated with a specific subject or class, such as “Find all water bodies within
25 miles of Atlanta” when the category for the search is water.
)A
e
The common data types and models for storing spatial data are briefly described in
this section. There are three main types of spatial data. Due to their widespread use in
in
commercial systems, these forms have become the de facto industry standard.
nl
fundamental types of features (or areas). In the scale of a specific application, points
are used to describe the spatial characteristics of objects whose locations correspond
to a single pair of 2-d coordinates (x, y, or longitude/latitude). Buildings, cell towers, and
O
stationary vehicles are some instances of point objects, depending on the scale.
A series of point locations that change over time can be used to depict moving
cars and other moving things. A series of connected lines can approximate the spatial
ty
features of things with length, such as highways or rivers. When representing the
spatial properties of an item with a boundary, such as a nation, a state, a lake, or a city,
a polygon is utilised. Keep in mind that depending on the level of detail, some objects,
such cities or buildings, might be depicted as either points or polygons.
si
The descriptive information that GIS systems attach to map features is known as
attribute data. Imagine, for instance, that a map has features that represent the counties
r
of a US state (such as Texas or Oregon). Each county feature (object) may have the
following attributes: population, major city or town, square miles, etc. States, cities,
ve
congressional districts, census divisions, and other aspects of the map could all have
additional attribute data.
Images made by cameras, such as satellite images and aerial photos, are
examples of image data. These photos can have interesting objects, such highways
ni
and buildings, layered on them. Map features can also have images as properties.
Other map features can have photos added to them so that when a user clicks on the
feature, the image is displayed. Raster data frequently takes the form of aerial and
U
satellite photos.
There are two main categories into which spatial information models can be
categorised: field and object. Depending on the needs and the customary model
ity
Spatial Operators
)A
When performing spatial analysis, spatial operators are used to record all the
pertinent geometric details of the objects that are physically embedded in the space as
well as the relationships between them. Operators are divided into three major groups.
to check for intricate topological relationships between regions with a large border, while
Notes
e
the higher levels provide more abstract operators that let users query ambiguous spatial
data without relying on the underlying geometric data model. Examples include open
(region), close (region), and inside (point, loop).
in
Projective operators: To express predicates regarding the concavity/convexity of
objects as well as other spatial relations, projective operators such as convex hull are
nl
utilised (for example, being inside the concavity of a given object).
Metric operators: Metric operators give a more detailed account of the geometry
of the object. In addition to measuring the relative positions of various objects in terms
O
of distance and direction, they are also used to measure several global features of
single objects (such as the area, relative size of an object’s sections, compactness, and
symmetry). Examples include length (arc) and distance (point, point).
ty
Dynamic Spatial Operators: The operations carried out by the aforementioned
operators are static in the sense that their application has no impact on the operands.
For instance, the curve itself is unaffected by the length of the curve. The objects on
which dynamic operations act are changed. Create, destroy, and update are the three
si
basic dynamic operations. Updating a spatial object that can be divided into translate
(shift position), rotate (change orientation), scale up or down, reflect (produce a mirror
image), and shear (deform).
r
Spatial Queries: Requests for spatial information made using spatial operations are
ve
known as spatial queries. Three common forms of spatial searches are illustrated by
the following categories:
Range query: identifies the objects of a specific type that are present within a
ni
specified spatial region or at a specified distance from a specified place. (For instance,
look for all hospitals in the Metropolitan Atlanta region or all ambulances five miles away
from an accident.)
U
Nearest neighbor query: locates the closest instance of a specific type of object at
a specified location. (For instance, locate the police vehicle that is most convenient to
the scene of the crime.)
ity
Spatial joins or overlays: usually connects objects of two different types depending
on some spatial condition, such as the objects’ spatial intersection, overlap, or
proximity. (For instance, find all houses that are two miles or less from a lake, or all
townships on a main highway connecting two cities.)
m
using a spatial index. A bucket region, or area of space holding all the objects kept in a
bucket, exists for each bucket. The bucket regions, which are often rectangles, divide
the space so that each point belongs to exactly one bucket for point data structures. A
spatial index can be provided in essentially two different ways.
(c
The database system has specialised indexing structures that enable effective
search for data objects based on spatial search operations. These indexing structures
would serve a similar purpose to typical database systems’ B+-tree indexes. Grid files
and R-trees are two examples of these indexing structures. Spatial join processes can
Notes
e
be sped up by using specialised spatial indexes, also known as spatial join indexes.
in
data so that conventional indexing techniques (B+-tree) can be employed, as opposed
to developing brand-new indexing structures. Space filling curves are the techniques for
transforming from 2-d to 1-d. We won’t go into great depth about these techniques (see
nl
the Selected Bibliography for further references).
Grid Files. For indexing spatial data in two dimensions and greater n dimensions,
O
gird files are utilised. The fixed-grid approach creates equal-sized buckets out of
an n-dimensional hyperspace. The fixed grid is implemented using an n-dimensional
array as the data structure. To handle overflows, the items whose spatial locations are
ty
entirely or partially within a cell can be stored in a dynamic structure. For data that is
evenly dispersed, like satellite imagery, this structure is helpful. The fixed-grid structure,
however, is rigid, and its directory may be both sparse and extensive.
si
R-Trees. An extension of the B+-tree for k-dimensions, where k > 1, the R-tree
is a height-balanced tree. The minimum bounding rectangle (MBR), which is the
smallest rectangle with sides parallel to the coordinate system’s (x and y) axis, is
r
used to represent spatial objects in the R-tree for two-dimensional (2-d) space. The
following characteristics of R-trees are similar to those of B+-trees but are tailored to
ve
2-dimensional spatial objects.
Each index record (or entry) in a leaf node has the structure (I, object-identifier),
where I is the MBR for the spatial object whose identifier is object-identifier.
ni
Except for the root node, every node must be at least halfway full. Since M/2 <=
m <= M, a leaf node that is not the root should have m entries (I, object-identifier).
Similarly, a non-leaf node that is not the root must have m entries (I, child-pointer),
U
where M/2 <= m <= M and I is the MBR that has the union of all the rectangles in the
node that child-pointer is pointing at.
The root node should contain at least two pointers unless it is a leaf node, and all
ity
Each and every MBR has a side that is perpendicular to the axes of the world
coordinate system.
The quadtree and its derivatives are among other spatial storage structures. For
m
the purpose of identifying the locations of various items, quadtrees often divide each
space or subspace into sections of equal size. This field of study is still active despite
the recent proposals for numerous newer spatial access architectures.
)A
Spatial Join Index. An index structure for a spatial join precomputes a spatial join
operation and stores the pointers to the related objects. The performance of recurrent
join queries over tables with slow update rates is enhanced by join indexes. To respond
to requests like “Create a list of highway-river combinations that intersect,” spatial join
conditions are utilised. These item pairs that satisfy the cross spatial relationship are
(c
found and retrieved using the spatial join. The results of spatial relationships can be
computed once and stored in a table with the pairs of object identifiers (or tuple ids)
that satisfy the spatial relationship, which is essentially the join index. This is because
Notes
e
computing the results of spatial relationships typically takes a lot of time.
in
holds the tuple ids from relation R and V2 contains the tuple ids from relation S. If a
tuple corresponding to (vr,vs) in the join index exists, the edge set contains an edge
(vr,vs) for vr in R and vs in S. All of the associated tuples are represented as connected
nl
vertices in the bipartite networks. Operations (see Section 26.3.3) that involve
computing associations among spatial objects employ spatial join indexes.
O
Spatial data frequently exhibits strong correlation. For instance, individuals with
comparable traits, professions, and origins frequently congregate in the same areas.
Spatial classification, spatial association, and spatial grouping are the three main
ty
spatial data mining approaches.
si
on the value of the relation’s other attributes is the aim of categorization. Identifying
the locations of nests in a wetland based on the importance of other qualities (such
as vegetation durability and water depth) is an example of the spatial classification
problem, also known as the location prediction problem. Similar to location prediction,
r
predicting hotspots for criminal activity is a challenge.
ve
Spatial association. In contrast to items, spatial predicates are used to establish
spatial association rules. If at least one of the Pis or Qjs is a spatial predicate, then
the rule is said to be of the type P1 P2 ^... ^Pn Q1^ Q2^ ... ^Qm. A good example
of an association rule with some support and confidence is the rule is_a(x, country) ^
ni
touches(x, Mediterranean) ⇒is_a (x, wine-exporter) (that is, a country that is adjacent
to the Mediterranean Sea is typically a wine exporter) is an example of an association
rule, which will have a certain support s and confidence c.
U
In an effort to point to collecting data sets that are spatially indexed, spatial
colocation rules try to generalise association rules. Between spatial and nonspatial
linkages, there are a number of significant distinctions, including:
ity
The size of item sets in spatial databases is minimal, meaning that in a spatial
m
scenario there are significantly less items in the item set than in a non-spatial situation.
Spatial elements are often the discrete form of continuous variables. For instance,
)A
in the United States, income regions can be categorised as areas where the mean
annual income falls within a given range, like $40,000 or less, $40,000 to $100,000, or
$100,000 or more.
The goal of spatial clustering is to organise database objects into clusters where
the most similar things are located and clusters where the least similar objects are
(c
e
technique. In the data space, these algorithms treat clusters as dense regions of items.
in
based clustering are two variants of these methods (DENCLUE). Because it starts
by identifying clusters based on the estimated density distribution of related nodes,
DBSCAN is a density-based clustering algorithm.
nl
Applications of Spatial Data
Numerous fields, including geography, remote sensing, urban planning, and
O
natural resource management, can benefit from spatial data management. The solving
of complex scientific issues like global climate change and genomics is greatly aided
by spatial database management. GIS and spatial database management systems
have a significant role to play in the field of bioinformatics because of the geographical
ty
character of genetic data.
Pattern recognition, the building of genome browsers, and map visualisation are
some of the main applications. For instance, one might check to see if the topology
si
of a specific gene in the genome is present in any other sequence feature map in the
database. The detection of spatial outliers is one of the key applications of spatial data
mining. A spatially referred object is considered an outlier if its nonspatial attribute
r
values considerably deviate from those of other spatially referenced objects in its
ve
immediate vicinity. According to the nonspatial feature “house age,” for instance, if a
neighbourhood of older homes has just one brand-new home, that home would be an
outlier.
use of the ability to detect spatial outliers. Transportation, ecology, public safety, public
health, climatology, and location-based services are only a few of these application
domains.
U
creator of the World Wide Web, established the first webpage in 1991. There were 1.8
billion websites in existence as of 2018. The amount of data that is available and the
need to organise it in order to extract usable information from it have both increased
exponentially along with this expansion.
m
necessary data from the online sites. These methods are referred to as web mining.
Online mining is the process of using machine learning and data mining techniques to
extract information from web pages’ data.
Figure divides web mining into three categories: usage mining, structure mining,
(c
Notes
e
in
nl
O
Figure: Categories of Web mining
ty
Web content mining is the process of obtaining pertinent information from a web
page’s content. When content mining, we completely disregard how users interact with
a particular web page or how other web pages link to it. A simple method of web content
si
mining relies on the placement and usage of keywords. However, this leads to two
issues: the first is the issue of scarcity, and the second is the issue of abundance. The
issue of scarcity arises with queries that either produce a small number of results or
r
none at all. The issue of abundance arises when search queries produce an excessive
amount of results. The nature of the data on the web is the underlying cause of both
ve
issues. The material is typically dispersed across several web pages and is typically
presented as semi-structured HTML data.
keyword clustering on the web. Instead of returning a list of web sites ranked in
order, the main idea is to group meaningfully related web pages together. To do
this, cluster analysis methods like K-mean and agglomerative clustering can be
applied. For the purposes of using clustering algorithms, a vector of words and
U
their frequency on a certain web page serve as the input attribute set. But these
grouping methods don’t produce good enough outcomes.
Based on Suffix Tree Clustering, another method of online document clustering is
ity
used.
●● Suffix Tree Clustering (STC): STC uses phrase clustering rather than keyword
frequency clustering. STC functions as follows:
Step 1: Download the text from the website. Remove frequent words and
m
punctuation from each sentence in the text. The remaining words should be changed to
their base form.
)A
Step 2: Create a tree using the list of words you got in step 1 as a base.
Step 3: Compare the trees you got from different documents. Words in a tree with
the same root to leaf node sequence are clustered together in a single cluster.
As can be seen in the Figure below, STC tries to cluster the document in a more
(c
Notes
e
in
nl
O
ty
si
Figure: STC example
●●
r
Resemblance and containment: Additionally, it is necessary to delete duplicate
web pages or pages that are nearly identical from the search results in order
ve
to improve query results. To do this, the ideas of similarity and containment are
useful. The term “resemblance” describes how similar two papers are to one
another. Its value ranges from 0 to 1. (both inclusive). The identical document
is represented by a value of 1. A value close to 1 denotes a document that is
ni
presence of the first document inside the second document and a value of 0
indicating the opposite.
The idea of shingles is used to quantitatively describe similarity and confinement.
ity
The text is separated into sets using a continuous L-word sequence. These patterns are
referred to as shingles. The definitions of similarity R(X, Y) and containment C(X, Y) for
two given texts, X and Y, are as follows:
Where S(X) and S(Y) are sets of shingles, respectively, for documents X and Y.
)A
Resemblance is defined as the total number of shingles that are shared by two
documents X and Y, divided by the total number of shingles in both documents,
according to the formulas.
Y.
e
compare each document pair using the Linux diff programme. A different strategy is
based on the idea of fingerprinting.
in
●● Fingerprinting: A document is broken down into a continuous series of words
(shingles) of every conceivable length in order to perform fingerprinting. Consider
two documents with the respective content listed below as an example.
nl
Document 1: I love machine learning.
O
Let’s take a look at each sequence for the shingle length two for the two texts
mentioned above.
ty
r si
We can easily observe that only one out of three sequences matches using the
sequences listed in Table (Sequences of length two). This is used to compare two
ve
documents for similarities. Even though this method is quite accurate, it is rarely
employed because it is ineffective for manuscripts with a lot of words.
Web usage mining is the process of obtaining valuable information from log data
about how users interact with websites. Web use mining’s primary goal is to foresee
user behaviour when they engage with a website in order to improve consumer
U
focus, monetization, and commercial strategies. For instance, it would be far more
advantageous to invest in Facebook ads than it would be to invest in Twitter ads if the
bulk of visits to a particular web page were coming from Facebook pages rather than
Twitter. Information like outlined in the table below (Important parameters for web usage
ity
into transactions similar to those used in market basket analysis, where each node
visited is treated as an item being purchased. However, difficulties arise because it is
common for users to return to a node while web traversing, whereas in market basket,
)A
Such analysis can unearth knowledge that is buried. For instance, data analysis
may reveal that, with a 75% or higher certainty, every time a visitor views page A, they
also visit page B. These kinds of associations are helpful because it may be possible to
(c
rearrange pages A and B so that the data the user is looking for on page B is actually
made available on page A, making page A more customer-centric.
e
in
nl
O
ty
si
Web Structure Mining
r
The goal of web structure mining is to extract information from the web’s
ve
hyperlinked structure. Ranking web sites and recognising the pages that operate as
authorities on particular sorts of information are both significantly influenced by the
web structure. Web structure is used to find hubs, or the websites that link to multiple
authority websites, in addition to locating authority sites.
ni
The HITS algorithm, which analyses web structure to identify hubs and authority,
is presented in the part after that. The PageRank algorithm, which also employs web
structure to rank web sites, will be discussed later.
U
●● Hyperlink Induced Topic Search (HITS) algorithm: The HITS algorithm, sometimes
referred to as hubs and authorities, examines the web’s hyperlinked structure to
rank web pages. In 1999, Jon Kleinberg created it. It was utilised when web page
ity
directories were commonly used and the Internet was just getting started.
The hub and authority concepts form the foundation of HITS. HITS is predicated
on the idea that websites acting as directories are not inherently experts on any subject
matter, but rather serve as a hub for directing users to other websites that may be more
qualified to provide the needed information.
m
Let’s use an illustration to better grasp the distinction between these two concepts.
A hyperlinked web structure is shown in Figure (Hubs and Authority), where H1, H2,
)A
and H3 represent search directory pages, and A, B, C, and D represent web pages to
which the directory pages have outbound links. In this case, pages like H1, H2, and H3
serve as information hubs. They don’t actually retain any information, but instead they
direct users to other pages that, in their opinion, contain the necessary data. In other
words, the outbound hyperlinks from hubs vouch for the legitimacy of the page they are
(c
directing users to. A good authority is a page that is linked by many distinct hubs, and
a good hub is a page that points to many other pages. Formally, page H1 gives page A
some authority if it links to page A on the internet.
Notes
e
in
nl
Figure: Hubs and Authority
O
Let’s look at an illustration to better grasp how the HITS algorithm functions. The
web page structure is depicted in the Figure (Example of Web page structure) below,
where each node stands in for a different web page and where arrows indicate linkages
ty
between the vertices.
r si
ve
Figure: Example of Web page structure
Step 1: For future calculations, present the supplied web structure as an adjacency
ni
Step 3: Assuming the original hub weight vector is 1, the authority weight vector
Notes
e
is created by multiplying the initial hub weight vector by the transposed matrix A, as
shown in Figure D below.
in
nl
Figure D: Obtaining the Authority Weight Matrix
O
Step 4: Multiply the authority weight matrix produced in step 3 by the adjacency
matrix A to determine the updated hub weight vector.
ty
si
Figure E: Updated Hub Weight Vector
r
The graph in Figure (Example of web page structure) can be updated with hub and
ve
authority weights in accordance with the calculations made above, and can then be
displayed as shown in Figure F below.
ni
U
ity
The HITS algorithm has now gone through one iteration. Repeat steps 3 and 4 to
m
get updated authority weight vectors and updated hub vector values for considerably
more precise results.
and we can display authorities in search results in decreasing order of authority weight
value. For instance, according to Figure F, page N4 has the highest authority for a
certain term because it is related to the majority of highly ranked hub pages.
●● By categorising web content and locating web pages, web mining increases the
power of web search engines.
●● It is utilised for vertical search, such as FatLens, Become, and web search, such
Notes
e
as Google, Yahoo, etc.
●● User behaviour is predicted using web mining.
in
●● For a specific website and e-service, such as landing page optimization, web
mining is quite helpful.
nl
5.2 Multimedia Web Mining
Transmitting and storing vast volumes of multi/rich media data (such as text,
photos, music, video, and their combination) is now more practical and economical
O
than ever thanks to recent advancements in digital media technologies. Modern
methods for processing, mining, and managing those rich media are still in their infancy.
Rapid advancement in multimedia gathering and storage technologies has resulted
ty
in an extraordinary amount of data being kept in databases, which is now rising at an
incredible rate.
These multimedia files can be evaluated to disclose information that will be useful
si
to users. Extraction of implicit knowledge, links between multimedia data, or other
patterns that aren’t expressly stored in multimedia files is the subject of multimedia
mining. The system’s overall effectiveness depends on the efficient information fusion of
r
the many modalities used in multimedia data retrieval, indexing, and classification.
ve
The World Wide Web is a popular and interactive tool for information selection
nowadays. Due to the web’s size, diversity, and dynamic nature, challenges with
scalability, multimedia data, and timing arise. Almost all industries in business,
research, and engineering require the ability to comprehend huge, complicated,
ni
Web mining is the full process of using a computer-based technology to find and
extract knowledge from web resources. Web multimedia mining naturally sits at the
nexus of various multi-discipline study areas, including computer vision, multimedia
processing, multimedia retrieval, data mining, machine learning, databases, and
ity
artificial intelligence.
◌◌ Media Data: Actual data used to depict an object is called media data.
)A
◌◌ Media Format Data: Data concerning the format of the media after it has
undergone the acquisition, processing, and encoding stages, including
sampling rate, resolution, encoding method, etc.
◌◌ Media keyword data: Keywords that describe how data is produced. It also
goes by the name of “content descriptive data.” Example: the recording’s date,
(c
◌◌ Media Feature Data: Information that is depending on the content, such as the
Notes
e
distribution of colours, types of textures, and shapes.
in
Multimedia data cubes can be generated and built similarly to traditional data
cubes using relational data in order to enable the multidimensional analysis of big
nl
multimedia datasets. Additional dimensions and measurements for multimedia
information, such as colour, texture, and shape, can be included in a multimedia data
cube.
O
Examining MultiMediaMiner, a prototype of a multimedia data mining system that
augments the DBMiner system by managing multimedia data The following describes
how the example database used to test the MultiMediaMiner system was built: A feature
descriptor and a layout descriptor are both present in each image. Only the image’s
ty
descriptors are kept in the database; the original image is not. The image file name,
image URL, image type (such as gif, jpeg, bmp, avi, or mpeg), a list of all known Web
pages relating to the picture (i.e., parent URLs), a list of keywords, and a thumbnail
si
used by the user interface for browsing images and videos are all included in the
description information.
in the layout description. No matter how big they were originally, each image is given an
8 X 8 grid. The edge layout vector stores the number of edges for each orientation in
each of the 64 cells, whereas the colour layout vector stores the most prevalent colour
U
for each of the 64 cells. It is simple to generate grids in other sizes, such as 4 × 4, 2 x 2,
and 1 x 1.
Similar to HTML tags in Web pages, the Image Excavator part of MultiMediaMiner
ity
Multimedia data cubes can come in a variety of sizes. These are a few instances:
the Internet domain of the image or video, the Internet domain of pages referencing
the image or video (parent URL), the keywords, a colour dimension, an edgeorientation
)A
dimension, and so forth. The size of the image or video in bytes, the width and height
of the frames (or picture), constituting two dimensions, the date the image or video was
created (or last modified), the format type of the image or video, the frame sequence
duration in seconds, and so forth. Concept hierarchies can be automatically defined
for numerous numerical dimensions. Predefined hierarchies may be utilised for various
(c
The creation of a multimedia data cube will make it easier to summarise, compare,
Notes
e
categorise, associate, and cluster knowledge as well as perform multiple-dimensional
analysis of multimedia data that are predominantly based on visual content.
in
An intriguing model for the multidimensional analysis of multimedia data appears
to be the multimedia data cube. With so many dimensions, it is important to keep in
mind that implementing a data cube effectively might be challenging. In the case of
nl
multimedia data cubes, this dimensionality plague is very severe. Color, orientation,
texture, keywords, and other characteristics may be represented as several dimensions
in a multimedia data cube. Many of these traits, however, are set-oriented rather than
single-valued.
O
One image, for instance, might be associated with a list of keywords. It might
include a collection of things, each connected to a spectrum of colours. In the design
of the data cube, there will be a tremendous number of dimensions if we utilise each
ty
term as a dimension or each finely detailed colour as a dimension. However, failing to
do so could result in the modelling of a picture at a relatively crude, constrained, and
inaccurate scale. A multimedia data cube’s potential to establish a balance between
si
effectiveness and representational strength requires further study.
A substantial part in separating valuable data from a mass of huge data is played
by data mining. We can determine the relationship between different data sets by
r
examining the patterns and idiosyncrasies. When unprocessed raw data is transformed
ve
into meaningful information, it can be used to advance a variety of industries on which
we rely every day.
sets of customers, network, and call data in the continuously dynamic and competitive
environment. The telecommunications industry needs to develop a simple method of
handling data if it is to succeed in such a setting. In this area, data mining is chosen
to advance operations and address issues. Identification of scam calls and identifying
ity
network flaws to isolate the problems are two of the main functions. Additionally, data
mining helps improve efficient marketing strategies. In any case, this industry faces
difficulties when it comes to handling the logical and time aspects of data mining, which
necessitates the requirement to anticipate rarity in telecommunication data in order to
m
The specifics of every call that begins in the telecommunications network are
logged. The time and date that it occurs, the length of the call, and the time that it
concludes. Since a call’s entire data set is gathered in real-time, data mining techniques
can be applied to process it. But rather than separating data by isolated phone call
levels, we should do so by client level. Thus, the consumer calling pattern can be
(c
e
◌◌ average call time duration
◌◌ Whether the call occurred throughout the day or night
in
◌◌ The usual volume of calls during the week
◌◌ calls made using different area codes
nl
◌◌ daily number of calls, etc.
One can advance the expansion of their firm by picking up on the right consumer
call information. A consumer can be identified as a member of a commercial firm
O
if they place more calls during daytime business hours. If there are a lot of calls at
night, only domestic or residential uses are permitted. One can distinguish between
commercial and residential calls using the area code’s frequent variation because calls
for residential purposes occasionally span many area codes. However, the information
ty
gathered in the evening cannot provide a precise indication of whether the customer is
a member of a commercial or residential enterprise.
Data of Customers:
si
There would be a sizable number of clients in the telecommunications sector. This
client database is kept around for any additional data mining queries. For instance,
r
these client details would assist in the identification of the individual with the details in
the customer database, such as name and address of the person, when a customer
ve
fraud case is discovered. Finding them and resolving the problem would be simple.
This dataset can also be derived from outside sources because the information is
typically widespread. Additionally, it contains the subscription plan selected and a
complete payment history. We can accelerate the expansion of the telecommunications
ni
Network Data:
U
Every component of the system has the potential to produce problems and alerts
since telecommunication networks use highly developed, complicated equipment. As a
result, a significant volume of network data must be processed. In the event that the
system isolates a network issue, this data must be divided, sorted, and stored in the
ity
correct sequence. This guarantees that the technical specialist will receive any error or
status messages from any component of the network system. They could therefore fix
it. Since the database is so huge, it becomes challenging to manually fix issues when a
lot of status or error messages are created. In order to ease the burden, some sets of
m
errors and notifications can be automated. A methodical data mining methodology may
effectively manage the network system, which can improve functionality.
Despite the fact that data mining processes raw data, the data must first be well-
organized and logically organised in order to be processed. It’s a crucial requirement in
the telecommunications sector, which is dealing with the enormous database. To avoid
inconsistency, it is first necessary to identify conflicting and contradictory facts. ensuring
(c
Algorithms in the field of data mining can cluster or group related data. Analysis
Notes
e
of patterns like call patterns or consumer behaviour patterns may be aided by it. A
group of frequencies is created by looking at their similarities. This makes it simple to
comprehend data, which facilitates its processing and usage.
in
Customer Profiling:
The telecommunications sector handles a vast array of client information. It begins
nl
by looking for patterns in call data about the consumer to profile them and forecast
future trends. The corporation can choose the customer promotion strategies by
understanding the customer trend. if a call falls within a specific area code. A group of
O
clients would be attracted by the promotion provided in that area. This can effectively
monetize the promotion strategies and save the business from investing in a single
subscriber, but with the correct strategy, it can draw a crowd. When a customer’s call
history or other information is tracked, privacy concerns arise.
ty
Customer turnover is one of the major issues the telecommunications sector has to
deal with. This is also known as customer turnover, in which a client leaves a business.
si
In this instance, the client resigns and chooses a different telecommunications provider.
A company will suffer a significant loss of income and profit if its customer turnover rate
is high, which would slow down its growth. By using data mining tools to profile and
collect client trends, this problem can be resolved. Companies’ incentive offerings draw
r
loyal customers away from their competitors. By profiling the data, the customer churn
ve
can be accurately predicted by their actions, such as their history of subscriptions, the
plan they choose, and so on. In addition to collecting data from paying customers, there
are some restrictions on collecting data from recipients or non-customers.
By using data mining tools to gather client data and modelling their behaviour,
ni
such as call details, as previously stated, these fraudulent actions can be reduced. The
frauds are simple to spot when data detection is done in real-time. The report of the
suspected call behaviour can also be compared to the generic fraud characteristics
U
to accomplish this. They can be identified if the call pattern resembles that of generic
frauds. This fraud detection procedure can be improved by gathering data from the
customer level rather than data at the individual user level. When frauds are incorrectly
classified, the company may suffer losses. They must therefore be aware of the relative
ity
cost of dropping a phoney call and blocking a person suspected of engaging in fraud
using a legitimate account. Accurately addressing this issue would be made possible
with the proper application of data mining.
m
The consumption of goods in the dynamic and quickly expanding retail sector rises
daily, increasing the amount of data that is gathered and utilised. The selling of products
to consumers by retailers is included in the retail industry. It ranges from a little stand on
the street to huge malls in urban areas. For instance, the proprietor of a grocery store
in a specific area might be aware of their customers’ after-sales information for a few
(c
months.
It would be simple to increase sales if he takes note of the needs of his customers.
The major retail industries experience the same thing. Customers’ opinions of a
Notes
e
product, their location, their time zone, their shopping cart history, etc. are all collected.
The corporation can produce customised advertisements based on customer
preferences to boost sales and profits.
in
Knowing the Customers:
If a retailer doesn’t know who their consumers are, what good are sales? There
nl
is unquestionably a requirement to comprehend their customers. It begins by running
them through a number of analyses. By identifying the source via which the buyer
learns about that retailing platform, retailers can improve their advertising and draw in
O
an entirely other demographic. Identifying the days on which they often shop can aid in
special promotions or boosts during festival days.
We may improve growth by using the amount of time customers spend purchasing
ty
for each order. The retailer can divide the customer base into groups of high-paid
orders, medium-paid orders, and low-paid orders based on the amount of money spent
on the order. According to pricing, this will either help introduce personalised packages
si
or enhance the number of targeted clients. Retailers may please clients by offering the
necessary services by understanding the language and preferred payment methods
of their customers. Maintaining positive customer interactions can increase trust and
loyalty, which can result in quick financial gains for the shop. Customer loyalty will assist
r
their business survive competition from other, similarly sized businesses.
ve
RFM Value:
Recency, Frequency, and Monetary Value is referred to as RFM. Recency simply
refers to the most recent or nearby time the consumer made a transaction. Frequency
ni
is the frequency of the purchase, and monetary value is the sum of the consumers’
purchases. By retaining current and new clients by providing them with outcomes they
are delighted with, RFM may increase its revenue. Additionally, it can aid in bringing
U
back trailing clients who frequently make smaller purchases. The rise of sales is
inversely correlated with RFM score. RFM also helps to apply new marketing strategies
to low-ordering clients and avoids requests from being sent to customers who are
already engaged. RFM aids in the discovery of original solutions.
ity
Market-based Analysis:
The market-based analysis is a method for examining and analysing a customer’s
shopping behaviour in order to boost income and sales. This is accomplished by
m
studying datasets of a certain client to learn about their purchasing patterns, commonly
purchased things, and combinations of items.
The loyalty card that customers receive from the shop is a prime example. The
)A
card is required from the perspective of the client to keep track of future discounts,
information regarding incentive criteria, and transaction history. However, if we look at
this loyalty card from the retailer’s perspective, market-based analysis programmes will
be layered within to gather information about the transaction.
(c
employed. The IDs assigned to various transactions can be used to arrange the
Notes
e
spreadsheets. This analysis aids in recommending to the customer goods that would
go well with their present purchase, resulting in cross-selling and increased earnings.
Monitoring the rate of purchases every month or year is also helpful. It indicates when
in
the shop should make the required offers in order to draw in the proper clients for the
desired goods.
nl
Potent Sales Campaign:
Nowadays, advertising is a must. Because product advertising spreads awareness
of its availability, benefits, and characteristics. It transports the item out of the
O
warehouse and into the world. Data must be examined if it is to draw in the correct
clients. The retailers’ call to action or marketing campaign is the proper one. If the
marketing campaigns are not launched with the proper planning, excessive spending on
untargeted advertisements could result in a company’s demise. The sales campaign is
ty
based on the customer’s preference, time, and location.
si
customers. It necessitates routine study of the sales and related data that occur on a
specific platform at a specific period. The popularity of the promoted product will be
determined by the traffic on social or network platforms. With the help of the previous
statistics, the shop can alter the campaign in a way that quickly boosts sales profit and
r
discourages excess. Understanding customer and business revenues helps improve
ve
the use of marketing. The shop can decide whether or not to spend in a campaign
based on the volume of sales from that campaign. The effective management of data
can transform a trial-and-error process into a well-transformed method. A multi-channel
sale campaign also boosts sales analysis and increases revenue, profit, and customer
ni
count.
Target marketing is a strategy for luring clients who are thought to be propensity
buyers. Today’s world of escalating demands and fierce competition has made target
marketing essential. Making marketing strategies more customer-focused is crucial for
ity
In order to develop fresh campaigns for their current clients, businesses must first
conduct a thorough customer analysis. Businesses can cluster or group specific clients
m
who share characteristics. This could help businesses develop more effective marketing
plans for particular customer segments.
For the organisation to attract new clients, increasing leads is a crucial duty. Lead
)A
We suggest a method for finding trends and identifying client traits using data
mining techniques in order to improve customer satisfaction and create marketing
(c
plans that will boost revenue. Since there are many various types of clients, each with
a unique set of wants and preferences, market segmentation is essential: divide the
overall market, select the best segments, and develop profitable business plans that
Notes
e
serve the selected group better than the competition does. In order to help businesses
sort through layers of data that at first glance appear to be unrelated in search of
significant relationships, data mining technologies and procedures can be used to
in
identify and track trends within the data.
nl
Data is a collection of distinct, objective facts about a process or an event that,
by themselves, are not very useful and must be transformed into information. From
O
straightforward numerical measures and text documents to more intricate information
like location data, multimedia channels, and hypertext texts, we have been gathering a
wide range of data.
These days, massive amounts of data are being gathered. According to reports,
ty
the amount of data gathered nearly doubles each year. Data mining techniques
are used to extract data or seek insight from this huge amount of data. Nearly every
location that stores and processes a lot of data uses data mining. For instance, banks
si
frequently employ “data mining” to identify potential clients who might also be interested
in insurance, personal loans, or credit cards. Banks examine all of this data and look
for patterns that can help them forecast that specific consumers could be interested
r
in personal loans, etc. as they have complete profiles of their clients and transactional
information.
ve
The urge to uncover relevant information in data to support better decision-making
or a better knowledge of the world around us is the fundamental driving force behind
data mining, whether it is for commercial or scientific purposes.
ni
Technically speaking, data mining is the computer process of examining data from
many angles, dimensions, and perspectives before classifying or summarising it to
produce useful information. Any form of data, such as that found in data warehouses,
U
accomplished by giving users the most information possible to quickly come to informed
business decisions despite the vast amount of data at their disposal.
e
Fraud Detection:
in
Fraud is a serious issue for the telecommunications sector since it results in lost
income and deteriorates consumer relations. Subscription fraud and superimposed
scams are two of the main types of fraud involved. The subscription scam involves
gathering client information, such as name, address, and ID proof information, primarily
nl
via KYC (Know Your Customer) documentation. While there is no purpose to pay for
the service by utilising the account, these details are required to register for telecom
services with authenticating approval. In addition to continuing to use services illegally,
O
some offenders also engage in bypass fraud by switching voice traffic from local to
international protocols, which costs the telecommunications provider money.
Superimposed frauds begin with a legitimate account and lawful activity, but they
ty
later result in overlapped or imposed behaviour by someone else who is utilising the
services illegally rather than the account holder. However, by observing the account
holder’s behaviour patterns, if a suspect is discovered engaging in any fraudulent
si
activities, it will prompt prompt actions like barring or deleting the account user. This will
stop the company from suffering more harm.
◌◌ Anomaly Detection
ity
m
)A
(c
e
◌◌ To choose and create discriminating qualities, association and correlation
analysis, and aggregation are used.
in
◌◌ data analysis for streams.
◌◌ spread-out data mining.
◌◌ tools for querying and visualisation.
nl
5.3.2 Mining in Healthcare
Many industries successfully employ data mining. It enables the retail and banking
O
sectors to monitor consumer behaviour and calculate customer profitability. Numerous
linked industries, including as manufacturing, telecom, healthcare, the auto industry,
education, and many more, are served by the services it offers.
ty
There are many opportunities for data-mining applications due to the volume of
data and issues in the healthcare industry, not to mention the pharmaceutical market
and biological research, and considerable financial advantages may be realised.
si
Thanks to the electronic preservation of patient records and improvements in medical
information systems, a lot of clinical data is now accessible online. In order to help
physicians improve patient care and make better judgments, data-mining tools are
r
helpful in spotting patterns, trends, and unexpected events.
ve
A patient’s health is regularly evaluated by clinicians. The analysis of a sizable
amount of time-stamped data will provide clinicians with knowledge on the course of
the disease. Therefore, systems capable of temporal abstraction and reasoning are
crucial in this scenario. Even though the use of temporal-reasoning methods requires a
ni
significant knowledge-acquisition effort, data mining has been used in many successful
medical applications, including data validation in intensive care, the monitoring of
children’s growth, the analysis of diabetic patients’ data, the monitoring of heart-
transplant patients, and intelligent anaesthesia monitoring.
U
Data mining is widely used in the medical industry. Data visualisation and artificial
neural networks are two particularly important data mining applications for the medical
sector. For instance, NeuroMedical Systems used neural networks to provide a pap
ity
smear diagnostic assistance. The Vysis Company uses neural networks to analyse
proteins in order to create novel medications. The Oxford Transplant Center and the
University of Rochester Cancer Center use Knowledge Seeker, a decision-tree-based
system, to help in their oncology research.
m
Over the past ten years, biomedical research has expanded significantly, from the
identification and study of the human genome to the development of new drugs and
advancements in cancer therapies. The goal of researching diseases’ genetic causes
)A
disabilities. DNA sequences are a crucial topic of study in genome research since
Notes
e
they constitute the fundamental components of all living creatures’ genetic blueprints.
Explain DNA. The basis of all living things is ribonucleic acid, specifically DNA.
in
The main technique through which we can transmit our genes to future generations
is through DNA. The instructions that tell cells how to behave are encoded in DNA.
DNA is made up of sequences that make up our genetic blueprints and are crucial for
nl
understanding how our genes work. The components that each gene is made up of
are called nucleotides. These nucleotides combine to form long, twisted, linked DNA
sequences or chains. Unraveling these sequences has been more challenging since
the 1950s, when the DNA molecule’s structure was first established. Theoretically, by
O
understanding DNA sequences, we will be able to identify and anticipate defects, weak
points, or other features of our genes that could affect our lives. It may be possible to
create cures for diseases like cancer, birth defects, and other harmful processes if DNA
ty
sequences are better understood. Data-mining tools are simply one tool in a toolbox
for understanding different types of data, and the employment of classification and
visualisation techniques is crucial to these processes.
si
About 100,000 genes, each of which contains DNA encoding a unique protein with
a particular function or set of functions, are thought to exist in humans. Several genes
that control the creation of haemoglobin, the regulation of insulin, and vulnerability to
r
Huntington’s chorea have recently been identified. To construct distinctive genes,
nucleotides can be ordered and sequenced in an apparent limitless number of
ve
different ways. A single gene’s sequence may have hundreds of thousands of unique
nucleotides arranged in a particular order.
Additionally, the DNA sequencing method used to extract genetic information from
ni
cells and tissues frequently results in the creation of just gene fragments. It has proven
difficult to ascertain how these fragments fit into the overall complete sequence from
which they are derived using conventional methods. It is a difficult effort for genetic
experts to decipher these sequences and establish theories about the genes they might
U
belong to and the disease processes they might govern. One may equate the task
of choosing good candidate gene sequences for more research and development to
finding a needle in a haystack. There could be hundreds of possibilities for each ailment
ity
being studied.
lead that finally leads to a pharmaceutical intervention that is effective in clinical settings
and yields the required results, there are dozens of leads. To improve the efficacy of
these analytical techniques, this area of study urgently needs a breakthrough. Data
mining has become a powerful platform for further research and discovery in DNA
)A
The threesystem approach is the finest method for expanding data mining beyond
(c
the bounds of academic study. The best method to make a real-world improvement with
any healthcare analytics endeavour is to implement all three systems. Regrettably, only
e
The following three systems are listed:
in
nl
O
ty
si
The Analytics System:
The analytics system includes the tools and knowledge needed to gather data,
analyse it, and standardise metrics. The system’s core is built upon the aggregation of
r
clinical, patient satisfaction, financial, and other data into an enterprise data warehouse
ve
(EDW).
integrates care delivery with best practises supported by evidence. Every year,
substantial advances in clinical best practise are made by scientists, but as was already
said, it sometimes takes a while for these advancements to be used in actual clinical
U
settings. Organizations can swiftly implement the newest medical standards thanks to a
robust content system.
necessary.
Numerous industries have made extensive and frequent use of data mining. Data
mining is becoming more and more common in the healthcare industry. Applications
for data mining can be quite helpful for all parties involved in the healthcare sector.
Data mining, for instance, can benefit the healthcare sector by assisting with customer
relationship management, excellent patient care, best practises, and cost-effective
(c
healthcare services. Healthcare transactions create enormous amounts of data that are
too complex and vast for traditional processing and analysis techniques.
The framework and methods for turning these data into meaningful information for
Notes
e
making data-driven decisions are provided by data mining.
in
nl
O
ty
si
Treatment Effectiveness:
Applications for data mining can be used to evaluate the efficacy of medical
interventions. By contrasting and contrasting causes, symptoms, and courses of
r
treatment, data mining can provide analysis of which course of action displays
ve
effectiveness.
Healthcare Management:
Applications for data mining can be used to identify and monitor patients in
ni
incentive care units, reduce hospital admissions, and support healthcare management.
Massive data sets and statistics are analysed using data mining to look for patterns that
could indicate a bioterrorist attack.
U
making by integrating data mining into their data frameworks. The best informational
support and expertise are provided to healthcare professionals through predictive
models. The goal of predictive data mining in medicine is to develop a predictive model
Notes
e
that is understandable, yields trustworthy predictions, and aids physicians in improving
their processes for diagnosing patients and formulating treatment plans. When there is
a lack of understanding about the relationship between different subsystems and when
in
conventional analysis methods are ineffective, as is frequently the case with nonlinear
associations, biomedical signal processing communicated by internal guidelines and
reactions to improve the condition is a crucial application of data mining.
nl
Challenges in Healthcare Data Mining:
The fact that the raw medical data is so large and diverse is one of the main
O
problems with data mining in healthcare. These facts can be gathered from various
sources.
ty
findings. All of these factors may have a big impact on how a patient is diagnosed and
treated. Data mining success is significantly hampered by missing, erroneous, and
inconsistent data, such as information stored in multiple formats from several data
si
sources.
Another issue is that practically all medical diagnoses and therapies are unreliable
and prone to errors. In this case, the study of specificity and sensitivity is taken into
r
account while measuring these errors. There are two main obstacles in the area of
ve
knowledge integrity evaluation:
How can we design algorithms that effectively distinguish the content of the before
and after versions?
For the evaluation of knowledge integrity in the data set, it presents a challenge
ni
and calls for the improvement of efficient algorithms and data structures.
How can algorithms be developed to assess the effects of certain data changes
U
on the statistical significance of individual patterns that are gathered with the aid of
fundamental types of data mining methods?
Though it is challenging to develop universal metrics for all data mining methods,
algorithms that quantify the influence that changes in data values have on the statistical
ity
e
is the term used to describe the revolutionary power that data science has given to
science.
in
Data is becoming more and more abundant, and its volume, velocity, and
authenticity are all increasing exponentially. Data mining has become an essential tool
for scientific research projects across a wide range of domains, from astronomy and
nl
bioinformatics to finance and social sciences, as a result of the proliferation of data that
is now available. This data is now too large in size and dimensionality to be directly
analysed by humans. The enormous amount of normally unintelligible scientific data
that is produced and stored every day can be exploited in data mining to draw relevant
O
findings and forecasts.
ty
◌◌ Data reduction: Scientific equipment, such as satellites and microscopes,
may quickly and efficiently collect terabytes of data and collect millions of data
points. The observations can be made simpler with a methodical, automated
si
approach without compromising the accuracy of the data. Using data mining
techniques, large databases can be effectively accessed by scientists.
◌◌ Research: The practise of extracting informative and user-requested
r
information from inconsistent and unstructured internet data is made simpler
by web data mining. Text data mining is the process of extracting structured
ve
data from text by employing techniques like natural language processing
(NLP). With the use of these tools, researchers can more quickly and
precisely locate current scientific data in literature databases.
◌◌ Pattern recognition: Due to high dimensionality, intelligent algorithms can
ni
identify patterns in datasets that people cannot. This can aid in finding
anomalies.
◌◌ Remote sensing: Aerial remote sensing photography can be used with data
U
●● High Energy Physics: The Large Hadron Collider experiments that simulate
collisions in accelerators and detectors produce petabytes of data that must be
stored, calibrated, and reconstructed before analysis. Data reduction strategies
)A
are used by the Worldwide LHC Computing Grid to deal with the volume. An open-
source data mining tool known as ROOT is specialised high-performance software
that makes it easier to conduct scientific investigations and visualise massive
amounts of data.
(c
e
or the empirical set training method are used to estimate redshifts for galaxies
and quasars from photometric data. Aside from these uses, data mining has
also been applied to astronomical simulations, the analysis of cosmic microwave
in
backgrounds, and the prediction of solar flares.
●● Bioinformatics: A science at the nexus of biology and information technology is
nl
called bioinformatics. It is possible to mine the data produced by genomics and
proteomics research to identify patterns in sequences, predict protein structures,
annotate the genome, analyse gene and protein expression, model biological
systems, and investigate genetic pathways to better understand disease.
O
●● Healthcare: Useful data on patient demographics, treatment plans, cost, and
insurance coverage are produced by the healthcare sector. Studies already
conducted have documented the use of data mining in clinical medicine, the
ty
detection of adverse drug reaction signals, and a focus on diabetes and skin
conditions. Regression, classification, sequential pattern mining, association,
clustering, and data warehousing are the most widely utilised mining methods in
this area.
si
●● Geo-Spatial Analysis: To reduce the effects of storm dust in arid regions, spatial
models of the locations that are sensitive to gully erosion, which leads to land
r
degradation, have been created utilising data mining methods, GIS, and R
programming.
ve
5.3.4 Mining in E-commerce
The emergence of computer networks has slowly seeped into all facets of
ni
everyone’s life with the growth of economic globalisation and trade liberalisation, and
the e-commerce business is created as a platform. E-commerce has altered people’s
perceptions of traditional trade, business, and payment methods as a form of new
U
business model. It has also given the current business community new life and given
the conventional business model a technological revolution. E-commerce has a
pressing demand for data mining, a system that analyses and finishes data. It will more
efficiently handle and analyse a significant volume of online information for businesses
ity
Data mining, the process of extracting hidden predictive information from sizable
m
databases, is a formidable new technology that has the potential to greatly assist
businesses in concentrating on the most crucial data in their data warehouses. Making
proactive, knowledge-driven decisions is possible for businesses thanks to data mining
)A
tools that forecast upcoming trends and behaviours. Beyond the assessments of past
events provided by retrospective tools typical of decision support systems, data mining
offers automated, prospective studies. Business questions that were previously time-
consuming to resolve can now be answered by data mining techniques. They search
databases for hidden patterns and uncover predicted data that experts might miss
(c
because it deviates from what they expect. The majority of businesses now gather and
process enormous amounts of data. In order to increase the value of currently available
information resources, data mining techniques can be quickly implemented on existing
software and hardware platforms. They can also be integrated with new products and
Notes
e
systems as they go online.
Data must first be arranged using data base tools and data warehouses, after
in
which it requires a knowledge discovery tool. The process of retrieving non-obvious,
practical information from sizable databases is known as data mining. With regard
to helping businesses concentrate on using their data, this developing field offers
nl
a number of potent strategies. From very vast databases, data mining tools produce
new information for decision-makers. This generation’s numerous techniques include
characterizations, summaries, aggregates, and abstractions of data. These forms
are the outcome of the use of advanced modelling methods from various disciplines
O
including statistics, artificial intelligence, database management, and computer
graphics.
The majority of business tasks in highly competitive firms have altered due to
ty
e-commerce. The interface operations between customers and retailers, retailers and
distributors, distributors and factories, and factories and their numerous suppliers are
now effortlessly automated thanks to internet technologies. Online transactions have
si
generally been made possible by e-commerce and e-business. Additionally, producing
enormous amounts of real-time data has never been simple. It only makes sense to
use data mining to make (business) sense of these data sets because data relevant to
r
diverse aspects of business transactions are easily available.
ve
The following elements contribute significantly to a DM exercise’s success:
◌◌ Large amounts of data must be available in order for the rules’ statistical
significance to hold. The use of the rules produced by the transactional
database will presumably decrease in the absence of, say, at least 100,000
U
transactions.
◌◌ Despite the fact that a particular terabyte database may have hundreds of
attributes for each relation, the DM algorithms applied to this dataset may fail
ity
if the data was generated manually and erroneously and had the improper
default values specified.
◌◌ Ease of calculating the return on investment (ROI) in DM: Although the first
three factors may be advantageous, investments in the following DM activities
would not be feasible unless a solid business case can be developed with
m
Data-Mining in E-Commerce
Notes
e
Data mining in e-commerce is a crucial method for repositioning the e-commerce
organisation to provide the enterprise with the necessary business information.
in
Most businesses now use e-commerce and have enormous data stored in their
data repositories. This data must be mined in order to enable business intelligence
or to improve decision-making, which is the only way to make the most of it. Before
nl
becoming knowledge or an application in e-commerce data mining, data must go
through three key processes.
O
The term “application of data mining in e-commerce” refers to potential
e-commerce sectors where data mining could be applied to improve business
operations. It is commonly known that when customers visit an online site to make
ty
purchases, they typically leave behind some information that businesses can store in
their database. These facts are examples of organised or unstructured data that can
be mined to provide the business a competitive edge. Data mining can be used in the
world of e-commerce for the advantage of businesses in the following areas:
si
Customer Profiling: In e-commerce, this is often referred to as a customer-oriented
strategy. This enables businesses to plan their business activities and operations and to
r
generate fresh research on products or services for successful ecommerce by mining
client data for business intelligence. Companies can save sales costs by grouping the
ve
clients with the greatest purchasing potential from the visiting data. Businesses can
utilise user surfing information to determine whether a person is shopping on purpose,
just browsing, purchasing something they are accustomed to or something new. This
aids businesses in planning and enhancing their infrastructure.
ni
Personalization of Service: The act of providing content and services that are
tailored to individuals based on knowledge of their needs and behaviour is known as
U
filtering, social data mining, and content-based systems. These systems are typically
represented as the user profile and are nurtured and learnt from the explicit or implicit
feedback of users. When evaluating the sources of data that a group of people
produce as part of their regular activities, social data mining can be a valuable source
of information for businesses. On the other hand, collaborative filtering can be used
m
Basket Analysis: Market Basket Analysis (MBA), a popular retail, analytical, and
)A
business intelligence tool, aids retailers in getting to know their customers better.
Every shopper’s basket has a tale to tell. There are several strategies to maximise the
benefits of market basket analysis, including:
obvious product affinities and capitalising on them. Customers who buy Barbie
dolls at Wal-Mart seem to have a preference for one of three candy bars. An
advanced market basket analytics can reveal oblique connections like this to
Notes
e
help create more successful marketing campaigns.
◌◌ Cross-sell and up-sell marketing highlight related items so that customers who
in
buy printers may be convinced to purchase premium paper or cartridges.
◌◌ Pangrams and product combos are used to focus on products that sell well
together and improve inventory control based on product affinities, combo
nl
offers, and user-friendly pangram design.
◌◌ Shopper profiles are created by using data mining to analyse market baskets
over time in order to acquire a better understanding of who your customers
O
truly are. This includes learning about their ages, income levels, purchasing
patterns, likes and dislikes, and purchase preferences.
The traditional trade is undergoing a significant transformation in the age of
information and knowledge economy due to the quick advancement of network
ty
technology and the improvement of social information level, and e-commerce exhibits
significant market value and development potential. A recent business strategy in the
realm of commerce is e-commerce. It is a contemporary business model that uses the
si
internet as a platform and contemporary information technology as a tool, with a focus
on economic effectiveness.
How can one avoid being overwhelmed by the huge ocean of information with
the emergence of the “data explosion but the lack of knowledge” problem, and obtain
relevant information and knowledge from which timely? Therefore, a new generation
U
of technology and tools is required to conduct reasonable and higher level analysis, to
use inductive reasoning, to explore potential modes, and to extract useful knowledge to
assist e-commerce business decision-makers in adjusting the market strategy, creating
business forecasts, and taking the appropriate actions to improve information utilisation,
ity
lower risk, and generate enormous profits for the company. The data processing
technology that will best meet these development needs is data mining.
grasp enterprise resource information based on data mining technology. You can
then use this information to make decisions about how best to allocate resources,
such as by reducing inventory, increasing inventory turnover, and improving capital
utilisation.
(c
e
best use of customer resources, analyse and forecast consumer behaviour, and
categorise customers based on data mining technology. To increase customer
satisfaction and loyalty, it is beneficial to monitor customer profitability, identify
in
future valuable customers, and provide individualised services. By utilising online
resources, businesses can better understand their clients’ purchasing patterns and
interests, which helps them create more user-friendly websites with tailored content
nl
for each client.
●● Assess business credit
O
Poor credit standing is a significant issue affecting the business environment and
has generated global concern. Due to the growing severity of the enterprise finance
“fake” problem brought on by online fraud, credit crunch has emerged as a significant
barrier to the growth of e-commerce.
ty
Utilizing data mining tools to keep an eye on business operations, assess profitability,
evaluate assets, and estimate future growth opportunities. These technologies are
also used to implement online monitoring, secure online transactions, and oversee
si
online payment security. The level and capacity of the enterprise to screen credit
and manage risk are improved by mining the transaction history data based on
the data mining credit evaluation model to identify the customer’s transaction data
r
characteristics, establish customer credibility level, and effectively prevent and
resolve credit risk.
ve
●● Determine the abnormal events
Unusual occurrences, like customer churn, bank credit card fraud, and mobile free
default, have considerable economic value in several company sectors. We can
ni
rapidly and precisely pinpoint these aberrant events using data mining’s unique point
analysis, which may then serve as the foundation for business decision-making and
cut down on avoidable losses.
U
capital investment, financial analysis of data is crucial. The balance sheet, cash flow
statement, and income statement are where financial analysts concentrate their
analysis.
individual customer transactions, investment services, credits, and loans, are offered
by the majority of banks and financial institutions. Financial data is frequently quite
comprehensive, trustworthy, and of good quality when collected in the banking and
financial sector, which makes systematic data analysis and data mining to boost a
)A
Data mining is widely utilised in the banking sector for modelling and predicting
credit fraud, risk assessment, trend analysis, profitability analysis, and aiding direct
marketing operations. Neural networks have been used in the financial markets to
(c
anticipate stock prices, options trading, bond ratings, portfolio management, commodity
price prediction, mergers and acquisitions analysis, and financial calamities.
Among the financial firms that employ neural-network technology for data mining
Notes
e
are Daiwa Securities, NEC Corporation, Carl & Associates, LBS Capital Management,
Walkrich Investment Advisors, and O’Sullivan Brothers Investments. There have been
reports of a wide variety of successful business applications, albeit it is sometimes
in
difficult to retrieve technical information. Although there are many more banks and
investing firms than those on the previously listed list that mine data, you won’t often
find them wanting to be cited. Typically, they have rules prohibiting talking about it.
nl
Therefore, unless you look at the SEC reports of some of the data-mining companies
that offer their tools and services, it is difficult to discover articles about banking
companies that utilise data mining. Customers like Bank of America, First USA Bank,
O
Wells Fargo Bank, and U.S. Bancorp can be found there.
It has not gone unnoticed that data mining is widely used in the banking industry. It
is not unexpected that methods have been designed to counteract fraudulent activities
ty
in sectors like credit card, stock market, and other financial transactions given that fraud
costs industries billions of dollars. The issue of fraud is a very important one for credit
card issuers. For instance, fraud cost Visa and MasterCard more than $700 million in
a single year. Capital One’s losses from credit card fraud have been reduced by more
si
than 50% thanks to the implementation of a neural network-based solution. This article
explains some effective data-mining techniques to illustrate the value of this technology
for financial organisations.
r
Just five years ago, the concept of a “robo-advisor” was basically unknown, but
ve
today it is a staple of the financial industry. The phrase is somewhat deceptive because
it has nothing to do with robots. Instead, robo-advisors are sophisticated algorithms
created by firms like Betterment and Wealthfront to tailor a financial portfolio to the
objectives and risk tolerance of each individual customer. Users enter their objectives
ni
together with their age, income, and current financial assets, for instance, retiring
at age 65 with $250,000 in savings. In order to accomplish the user’s objectives, the
intelligent advisor algorithm then distributes investments among various asset classes
U
and financial instruments. Aiming to always identify the optimum fit for the user’s initial
aims, the system calibrates to changes in the user’s goals and to real-time changes
in the market. With millennial customers who do not require a physical advisor to feel
comfortable investing and who are less able to justify the costs paid to human advisors,
ity
participating parties makes up a blockchain. A majority of the users in the system agree
to verify each transaction in the public database. Information cannot be deleted once it
has been entered. Every transaction ever made is contained in a particular, verifiable
)A
The primary assumption is that the blockchain creates a means for establishing
(c
e
assured that a digital event occurred. It makes it possible to transition from a centralised
digital economy to one that is democratic, open, and scalable. This disruptive
technology offers a wide range of application possibilities, and the revolution in this field
in
has just begun.
The blockchain technologies are becoming more and more relevant as social
nl
responsibility and security play a larger role online. Since it is practically hard to
counterfeit any digital transactions in a system using blockchain, the legitimacy of such
systems will undoubtedly increase. We will observe many more possible use cases
for the government, healthcare, manufacturing, and other sectors when the early
O
blockchain boom in the financial services business slows.
ty
r si
ve
ni
U
ity
m
)A
(c
Case Study
Notes
e
Amazon EC2 Setup
in
We use the Amazon Elastic Compute Cloud (EC2) service as a case study for
business cloud settings.. Although there were other publicly accessible cloud services
that competed at the time, Amazon’s service is the biggest and offers the most highly
adjustable virtual machines. All software above this level is configured by the user, but
nl
the nodes assigned by EC2 run an operating system or kernel that is configured by
Amazon. Many other cloud services provided by other providers impose restrictions on
the APIs or programming languages that can be used. We use Amazon as a case study
O
to demonstrate the utilisation of current highly efficient dense linear algebra libraries.
Instances are nodes allocated through EC2. The distribution of instances from
Amazon’s data centres is based on unreleased scheduling algorithms. 20 total
ty
instances are the initial allocation cap, but this cap can be removed upon request. The
smallest logical geographic unit for allocation used by Amazon is an availability zone,
which is made up of several data centres. These zones are further divided into regions,
which at the time only include the US and Europe.
si
Each instance automatically loads a user-specified image with the appropriate
operating system (in this case, Linux) and user software after allocation (described
r
below). The Xen virtual machine (VM) is used by Amazon services to automatically load
images onto one or more virtualized CPUs (Barham et al., 2003). Each CPU has many
ve
cores, giving the instances we reserved a total of 2 to 8 virtual cores. Amazon’s terms
of service do not contain any definite performance guarantees. They do not restrict
Amazon’s capacity to offer multi-tenancy, or the co-location of VMs from several clients,
which is crucial for our study. Below, we go over the performance aspects of several
ni
instances.
The ability to release unwanted nodes, produce and destroy images to be loaded
U
onto allocated instances, and allocate additional nodes on demand are all provided
through tools written to Amazon’s public APIs. We created images with the most recent
compilers offered by the hardware CPU vendors AMD and Intel using these tools plus
ones we developed ourselves. With the use of our tools, we are able to automatically
ity
allocate and configure variable-size clusters in EC2, including support for MPI
applications.
There are other publicly available tools for running scientific applications on cloud
platforms, even though we created tools to automatically manage and configure EC2
m
nodes for our applications (including EC2) In addition, as the cloud computing platform
develops, we anticipate seeing a lot more work done on specialised applications like
high-performance computing, which will lessen or completely do away with the steep
learning curve associated with implementing scientific applications on cloud platforms.
)A
For instance, there are already publicly accessible pictures on EC2 that support MPICH.
From November 2008 to January 2010, we conducted our case study using several
instance types on the Amazon Elastic Compute Cloud (EC2) service.
(c
The key distinctions between the instance types are shown in Table below,
including the number of cores per instance, installed memory, theoretical peak
performance, and instance cost per hour. Although Amazon offers a smaller 32-bit
m1.small instance, we only tested instances with 64-bit processors and hence refer
Notes
e
to the m1.large as the smallest instance. Costs per node range from $0.34 for the
smallest to $2.40 for nodes with a lot of installed RAM, a factor of seven difference. The
c1.xlarge instance stands out in that cost grows more closely with installed RAM than
in
with peak CPU performance. Utilizing processor-specific capabilities, peak performance
is determined. For instance, the c1.xlarge instance type has two Intel Xeon quad-core
CPUs running at 2.3 GHz and 7 GB of total RAM. Theoretically, each core may do four
nl
floating-point operations each clock cycle, giving each node a peak performance of
74.56 GFLOP. We do not include the expenses associated with long-term data storage
and bandwidth utilised to enter and exit Amazon’s network in our calculations because
O
they pale in comparison to the costs associated with the computation itself in our
studies.
ty
r si
ve
Extensive testing revealed that the multi-core processors’ multithreaded parallelism
worked best when the Goto BLAS library was configured to use as many threads as
there were cores per socket—four for the Xeon and two for the Opteron, respectively.
In the section that follows, we give peak attained efficiency together with the number
of threads that were used to reach particular outcomes. We achieved 76% and 68%
ni
of our evaluation, we think that the configuration and execution of LINPACK in HPL on
the high-CPU and standard instances is effective enough to serve as an example of a
compute-intensive application.
ity
The 2.6.21 Linux kernel is used to run the RedHat Fedora Core 8 operating system
on all instance types (with Intel or AMD CPUs). The autotuning of buffer sizes for high-
performance networking is supported by the 2.6 line of Linux kernels and is turned on
by default. Multiple instances may even share a single physical network card, according
to Services (2010), which states that the specific connection used by Amazon is
m
unknown. As a result, a certain instance could not have access to the total throughput.
We conduct all trials with cluster nodes assigned to the same availability zone in an
effort tominimise the number of hops between nodes as much as possible.
)A
Summary
●● Extraction of interesting information or patterns from data in large databases is
known mining.
(c
●● Text mining is the process of removing valuable data and complex patterns
from massive text datasets. For the purpose of creating predictions and making
decisions, there are numerous methods and tools for text mining.
e
patterns, and rules among t These days, information is electronically maintained
by all institutions, businesses, and other organisations. a vast amount of
information that is stored online in digital archives, databases, and other text-
in
based sources like websites, blogs, social media networks, and emails.
●● Procedures of analyzing text mining: a) Text summarization, b) Text categorization,
nl
c) Text clustering.
●● Application area of text mining: a) Digital library, b) Academic and research field, c)
Life science, d) Social media, e) Business intelligence.
O
●● A spatial database is designed to store and retrieve information on spatial objects,
such as points, lines, and polygons. Spatial data frequently includes satellite
imagery. Spatial queries are those that are made on these spatial data and use
spatial parameters as predicates for selection.
ty
●● Web mining, also known as data mining, is the process of using algorithms and
data mining techniques to extract valuable information from the web, including
Web pages and services, linkages, web content, and server logs.
si
●● Three categories of data mining: a) Web content mining, b) Web usage mining, c)
Web structure mining.
●● r
A multimedia database management system is the framework that controls how
ve
various kinds of multimedia data are provided, saved, and used. Multimedia
databases can be divided into three categories: dimensional, dynamic, and static.
●● Multimedia data cubes can be generated and built similarly to traditional data
cubes using relational data in order to enable the multidimensional analysis of big
ni
multimedia datasets.
●● Application area of data mining: a) Research, b) Education sector, c)
Transportation, d) Market basket analysis, e) Business transactions, f) Intrusion
U
●● Fraud is a serious issue for the telecommunications sector since it results in lost
income and deteriorates consumer relations. Subscription fraud and superimposed
scams are two of the main types of fraud involved. The subscription scam involves
)A
network attacks, and intrusions. These methods assist in choosing and enhancing
pertinent and usable facts from enormous data collections.
e
abuse, c) Customer relationship management, d) Healthcare management.
●● Data mining applications in science and engineering: a) Data reduction, b)
in
Research, c) Pattern recognition, d) Remote sensing, e) Opinion mining.
●● The networking, automation, and intelligentization of business activities are its
ultimate objectives. The rise of e-commerce has significantly altered company
nl
philosophy, management techniques, and payment methods, as well as the many
spheres of society.
●● Application of data mining in e-commerce: a) Optimize enterprise resources, b)
O
Manage customer data, c) Assess business credit, d) Determine the abnormal
events.
●● In order to determine whether a business is steady and profitable enough to get
ty
capital investment, financial analysis of data is crucial. The balance sheet, cash
flow statement, and income statement are where financial analysts concentrate
their analysis.
si
●● In order to uncover hidden patterns and forecast upcoming trends and behaviours
in the financial markets, data mining techniques have been applied. For mining
such data, especially the high-frequency financial data, advanced statistical,
r
mathematical, and artificial intelligence approaches are often needed.
ve
Glossary
●● GIS: Geographical Information Systems.
●● Text Summarization: To automatically extract its entire content from its partial
ni
content reflection.
●● Text Categorization: To classify the text into one of the user-defined categories.
●● Text Clustering: To divide texts into various clusters based on their significant
U
relevance.
●● Information extraction: The technique of information extraction involves sifting
through materials to find meaningful terms.
ity
●● Information retrieval:It is the process of identifying patterns that are pertinent and
related to a group of words or text documents.
●● Natural Language Processing: The automatic processing and analysis of
unstructured text data is known as “natural language processing.”
m
●● Text Summarization: To mechanically derive its entire content from its partial
reflection.
●● DTM: Digital Terrain Model.
●● DBSCAN: Density-based spatial clustering of applications with noise.
(c
●● Web Content Mining: Web content mining is the process of extracting relevant
information from web-based documents, data, and materials. The internet used to
merely contain various services and data resources.
Amity Directorate of Distance & Online Education
278 Data Warehousing and Mining
e
the interdocument structure—or the organisation of hyperlinks—within the web
itself. It has a tight connection to online usage mining. Web structure mining is
fundamentally connected to pattern recognition and graph mining.
in
●● Web Usage Mining:Web use mining manipulates clickstream data. Web usage
data includes web server access logs, proxy server logs, browser logs, user
nl
profiles, registration information, user sessions, transactional information, cookies,
user queries, bookmark data, mouse clicks and scrolling, and other interaction
data.
O
●● Media Data: Actual data used to depict an object is called media data.
●● Media Format Data: Data concerning the format of the media after it has
undergone the acquisition, processing, and encoding stages, including sampling
rate, resolution, encoding method, etc.
ty
●● Media keyword data: Keywords that describe how data is produced. It also goes
by the name of “content descriptive data.” Example: the recording’s date, time, and
location.
si
●● Media Feature Data: Information that is depending on the content, such as the
distribution of colours, types of textures, and shapes.
●● r
MFC: Most Frequent Color.
ve
●● MFO:Most Frequent Orientation.
●● URL: Uniform Resource Locator.
●● RFM:Recency, Frequency, and Monetary.
ni
b. Data collection
c. Data repository
)A
d. Data mining
2. _ _ _ _ is the procedure of synthesizing information, by analyzing relations, patterns,
and rules among textual data.
a. Text editing
(c
b. Text wrapping
c. Text mining
Amity Directorate of Distance & Online Education
Data Warehousing and Mining 279
e
3. To automatically extract its entire content from its partial content reflection, is termed
as?
in
a. Text summarization
b. Text categorization
nl
c. Text clustering
d. Text mining
4. To classify the text into one of the user-defined categories, is termed as:
O
a. Text summarization
b. Text categorization
ty
c. Text clustering
d. Text mining
5. To divide texts into various clusters based on their significant relevance, is termed
si
as:
a. Text summarization
b. Text categorization r
ve
c. Text clustering
d. None of the mentioned
6. The technique of_ _ _ _ _ _ involves sifting through materials to find meaningful
terms.
ni
a. Information retrieval
b. Clustering
U
c. Data repository
d. Information extraction
7. _ _ _ _ _is the process of identifying patterns that are pertinent and related to a
ity
c. Text summarization
d. Clustering
)A
8. The automatic processing and analysis of unstructured text data is known as__ _ _
_ __ .
e. Information extraction
a. Information retrieval
(c
e
such as points, lines, and polygons.
a. External geographic
in
b. Meteorological
c. Topological
nl
d. Spatial
10. _ _ _ _ _ is the process of using algorithms and datamining techniques to extract
valuable information from the web, including Web pages and services, linkages,
O
web content, and server logs.
a. Web mining
b. Text mining
ty
c. Text wrapping
d. Data wrapping
si
11. _ _ _ _ is the process of extracting relevant information from web-based documents,
data, and materials.
a. Web usage mining
b. Web content mining r
ve
c. Web structure mining
d. None of the mentioned
12. Actual data used to depict an object is called_ _ _ _ .
ni
c. Media data
d. All of the above
13. Any unapproved operation on a digital network is referred to as a_ _ _ _.
ity
a. Phishing
b. Hacking
c. Data mining
m
d. Network intrusion
14. The_ _ _ _ _is a method for examining and analysing a customer’s shopping
behaviourin order to boost income and sales.
)A
a. Market-based analysis
b. Business transactions
c. Research
(c
d. Education sector
15. Data concerning the format of the media after it has undergone the acquisition,
Notes
e
processing, and encoding stages, including sampling rate, resolution, encoding
method, etc, is termed as?
in
a. Media data
b. Media format data
c. Media keyword data
nl
d. Media feature data
16. Bill Inmon has estimated_ _ _ _of the time required to build a data warehouse, is
O
consumed in the conversion process.
a. 80%
b. 60%
ty
c. 40%
d. 20%
si
17. The generic two-level data warehouse architecture includes_ _ _ _ .
a. Far real-times updates
b. Near real-times updates
c. At least one data mart
r
ve
d. None of the above
18. The most common source of change data in refreshing a data warehouse is_
________ _ _ .
ni
tables.
a. Software
b. Hardware
m
c. Middleware
d. End user
20. Data warehouse architecture is based on_ _ _ _.
)A
a. RDBMS
b. Sybase
c. DBMS
(c
d. SQL server
Exercise
Notes
e
1. What do you mean by Text Mining?
2. Define Spatial Databases.
in
3. Explain the term Web Mining.
4. What do you mean by Multidimensional Analysis of Multimedia Data?
nl
5. Define the Applications of data mining in following industry:
◌◌ Telecommunications Industry and
O
◌◌ Retail Marketing
◌◌ Fraud Protection,
◌◌ Healthcare and Science
ty
◌◌ E-commerce
◌◌ Finance
Learning Activities
si
1. You are employed as a data mining consultant by a sizable commercial bank that
offers a variety of financial services. A data warehouse that the bank launched two
years ago already exists. The management is looking for the current clients who will
r
be most receptive to a marketing effort promoting new services. Explain the steps
ve
involved in knowledge discovery, including their phases and associated activities.
5 c 6 d 7 b 8 c
9 d 10 a 11 b 12 c
U
13 d 14 a 15 b 16 a
17 b 18 c 19 d 20 a
1. Data Mining and Machine Learning Applications, Sandeep Kumar, Kapil Kumar
Nagwanshi, K. Ramya Laxmi, Rohit Raja, John Wiley & Sons
2. Data Mining Applications with R, Yanchang Zhao and Yonghua Cen, Elsevier
m
Science
3. Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data, Bing Liu,
Springer Berlin Heidelberg
)A
4. Practical Applications of Data Mining, Sang C. Suh, Jones & Bartlett Learning
(c