0% found this document useful (0 votes)
80 views

Data Warehosing and Data Mining

The document provides information about data warehousing and data mining. It defines data mining as extracting knowledge from large amounts of data through computational methods. The key properties of data mining are automatic discovery of patterns, prediction of outcomes, creating actionable information, and focusing on large datasets. Data mining involves tasks like anomaly detection, association rule learning, clustering, classification, regression, and summarization. A typical data mining system has components like a knowledge base, data mining engine, pattern evaluation module, and user interface. Data warehousing is defined as organizing and compiling data into one database for mining purposes, while data mining deals with extracting important patterns from the stored data. Common data warehouse architectures include basic, with a staging area, and with
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
80 views

Data Warehosing and Data Mining

The document provides information about data warehousing and data mining. It defines data mining as extracting knowledge from large amounts of data through computational methods. The key properties of data mining are automatic discovery of patterns, prediction of outcomes, creating actionable information, and focusing on large datasets. Data mining involves tasks like anomaly detection, association rule learning, clustering, classification, regression, and summarization. A typical data mining system has components like a knowledge base, data mining engine, pattern evaluation module, and user interface. Data warehousing is defined as organizing and compiling data into one database for mining purposes, while data mining deals with extracting important patterns from the stored data. Common data warehouse architectures include basic, with a staging area, and with
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

Class Notes (date-wise)

Date: From 12/01/2024 to continue

Data Warehousing and Data Mining:

1.1 What Is Data Mining?

Data mining refers to extracting or mining knowledge from large amounts of data. The
term is actually a misnomer. Thus, data mining should have been more appropriately
named as knowledge mining which emphasizes mining from large amounts of data. It is
the computational process of discovering patterns in large data sets involving methods
at the intersection of artificial intelligence, machine learning, statistics, and database
systems. The overall goal of the data mining process is to extract information from a
data set and transform it into an understandable structure for further use.

The key properties of data mining are:

1. Automatic discovery of patterns


2. Prediction of likely outcomes
3. Creation of actionable information
4. Focus on large datasets and databases

1.2 The Scope of Data Mining Data mining:

Data mining derives its name from the similarities between searching for valuable
business information in a large database — for example, finding linked products in
gigabytes of store scanner data — and mining a mountain for a vein of valuable ore.
2

Both processes require either sifting through an immense amount of material, or


intelligently probing it to find exactly where the value resides. Given databases of
sufficient size and quality, data mining technology can generate new business
opportunities by providing these capabilities:

1.3 Tasks of Data Mining

Data mining involves six common classes of tasks:

● Anomaly detection (Outlier/change/deviation detection) – The identification of


unusual data records, that might be interesting or data errors that require further
investigation.
● Association rule learning (Dependency modelling) – Searches for relationships
between variables. For example a supermarket might gather data on customer
purchasing habits. Using association rule learning, the supermarket can determine which
products are frequently bought together and use this information for marketing purposes.
This is sometimes referred to as market basket analysis.
● Clustering – is the task of discovering groups and structures in the data that are in
some way or another "similar", without using known structures in the data.
● Classification – is the task of generalizing known structure to apply to new data. For
example, an e-mail program might attempt to classify an e-mail as "legitimate" or as
"spam".
● Regression – attempts to find a function which models the data with the least error.
● summarization – providing a more compact representation of the data set, including
visualization and report generation

1.4 Architecture of Data Mining

A typical data mining system may have the following major components.

1. Knowledge Base: This is the domain knowledge that is used to guide the search or
evaluate the interestingness of resulting patterns. Such knowledge can include concept
hierarchies, Burla used to organize attributes or attribute values into different levels of
abstraction. Knowledge such as user beliefs, which can be used to assess a pattern’s
interestingness based on its unexpectedness, may also be included. Other examples of
domain knowledge are additional interestingness constraints or thresholds, and
metadata (e.g., describing data from multiple heterogeneous sources).
3

2. Data Mining Engine: This is essential to the data mining system and ideally consists
of a set of functional modules for tasks such as characterization, association and
correlation analysis, classification, prediction, cluster analysis, outlier analysis, and
evolution analysis.

3. Pattern Evaluation Module: This component typically employs interestingness


measures and interacts with the data mining modules so as to focus the search toward
interesting patterns. It may use interestingness thresholds to filter out discovered
patterns. Alternatively, the pattern evaluation module may be integrated with the mining
module, depending on the implementation of the data mining method used. For efficient
data mining, it is highly recommended to push the evaluation of pattern interest as deep
as possible into the mining process so as to confine the search to only the interesting
patterns.

4. User interface: Thismodule communicates between users and the data mining
system,allowing the user to interact with the system by specifying a data mining query
or task, providing information to help focus the search, and performing exploratory data
mining based on the intermediate data mining results. In addition, this component
allows the user to browse database and data warehouse schemas or data
structures,evaluate mined patterns, and visualize the patterns in different forms.
4

1.5 What is Data Warehousing and Data Mining? Explain

Data warehousing is a method of organizing and compiling data into one database,
whereas data mining deals with fetching important data from databases. Data mining
attempts to depict meaningful patterns through a dependency on the data that is
compiled in the data warehouse.

DATA WAREHOUSE:

A data warehouse is where data can be collected for mining purposes, usually with large
storage capacity. Various organizations’ systems are in the data warehouse, where it
can be fetched as per usage.

1. Source 2. Extract 3. Transform 4. Load 5. Target.

(Data warehouse process)

Data warehouses collaborate data from several sources and ensure data accuracy,
quality, and consistency. System execution is boosted by differentiating the process of
analytics from traditional databases. In a data warehouse, data is sorted into a
formatted pattern by type and as needed. The data is examined by query tools using
several patterns.

Data warehouses store historical data and handle requests faster, helping in online
analytical processing, whereas a database is used to store current transactions in a
business process that is called online transaction processing.
5

FEATURES OF DATA WAREHOUSES:

● Subject Oriented:

It provides you with important data about a specific subject like suppliers, products,
promotion, customers, etc. Data warehousing usually handles the analysis and
modeling of data that assist any organization to make data-driven decisions.

● Integrated:

Different heterogeneous sources are put together to build a data warehouse, such as
level documents or social databases.

● Time-Variant:

The data collected in a data warehouse is identified with a specific period.

● Nonvolatile:

This means the earlier data is not deleted when new data is added to the data
warehouse. The operational database and data warehouse are kept separate and thus
continuous changes in the operational database are not shown in the data warehouse.

APPLICATIONS OF DATA WAREHOUSES:


Data warehouses help analysts or senior executives analyze, organize, and use data for
decision making.

It is used in the following fields:

● Consumer goods
● Banking services
● Financial services
● Manufacturing
● Retail sectors

ADVANTAGES OF DATA WAREHOUSING:


● Cost-efficient and provides quality of data
● Performance and productivity are improved
● Accurate data access and consistency
6

1.6 Explain in details about Architecture of Data Warehouse.


Ans.:

A data warehouse architecture is a method of defining the overall architecture of data


communication processing and presentation that exist for end-clients computing
within the enterprise. Each data warehouse is different, but all are characterized by
standard vital components.

Production applications such as payroll accounts payable product purchasing and


inventory control are designed for online transaction processing (OLTP). Such
applications gather detailed data from day to day operations.

Data Warehouse applications are designed to support the user ad-hoc data
requirements, an activity recently dubbed online analytical processing (OLAP). These
include applications such as forecasting, profiling, summary reporting, and trend
analysis.

Production databases are updated continuously by either by hand or via OLTP


applications. In contrast, a warehouse database is updated from operational systems
periodically, usually during off-hours. As OLTP data accumulates in production
databases, it is regularly extracted, filtered, and then loaded into a dedicated
7

warehouse server that is accessible to users. As the warehouse is populated, it must


be restructured, tables de-normalized, data cleansed of errors and redundancies and
new fields and keys added to reflect the needs to the user for sorting, combining, and
summarizing data.

Data warehouses and their architectures very depending upon the elements of an
organization's situation.

Three common architectures are:

○ Data Warehouse Architecture: Basic


○ Data Warehouse Architecture: With Staging Area
○ Data Warehouse Architecture: With Staging Area and Data Marts

Data Warehouse Architecture: Basic

Operational System

An operational system is a method used in data warehousing to refer to a system that


is used to process the day-to-day transactions of an organization.
8

Flat Files

A Flat file system is a system of files in which transactional data is stored, and every file
in the system must have a different name.

Meta Data

A set of data that defines and gives information about other data.

Metadata used in Data Warehouse for a variety of purpose, including:

Meta Data summarizes necessary information about data, which can make finding and
work with particular instances of data more accessible. For example, author, data build,
and data changed, and file size are examples of very basic document metadata.

Metadata is used to direct a query to the most appropriate data source.

Lightly and highly summarized data

The area of the data warehouse saves all the predefined lightly and highly summarized
(aggregated) data generated by the warehouse manager.

The goals of the summarized information are to speed up query performance. The
summarized record is updated continuously as new information is loaded into the
warehouse.

End-User access Tools

The principal purpose of a data warehouse is to provide information to the business


managers for strategic decision-making. These customers interact with the warehouse
using end-client access tools.

The examples of some of the end-user access tools can be:

○ Reporting and Query Tools


○ Application Development Tools
○ Executive Information Systems Tools
○ Online Analytical Processing Tools
○ Data Mining Tools

The examples of some of the end-user access tools can be:

○ Reporting and Query Tools


○ Application Development Tools
○ Executive Information Systems Tools
9

○ Online Analytical Processing Tools


○ Data Mining Tools

We can do this programmatically, although data warehouses use a staging area (A


place where data is processed before entering the warehouse).

A staging area simplifies data cleansing and consolidation for operational methods
coming from multiple source systems, especially for enterprise data warehouses where
all relevant data of an enterprise is consolidated.

1.7 Data Warehouse Architecture: With Staging Area and Data Marts

We may want to customize our warehouse's architecture for multiple groups within our
organization.

We can do this by adding data marts. A data mart is a segment of a data warehouse
that can provide information for reporting and analysis on a section, unit, department or
operation in the company, e.g., sales, payroll, production, etc.

The figure illustrates an example where purchasing, sales, and stocks are separated. In
this example, a financial analyst wants to analyze historical data for purchases and
sales or mine historical information to make predictions about customer behavior.
10

1.8 Explain Data Warehouse Delivery Methods.

Ans.: data warehouse is never static; it evolves as the business expands. As the
business evolves, its requirements keep changing and therefore a data warehouse must
be designed to ride with these changes. Hence a data warehouse system needs to be
flexible.

Ideally there should be a delivery process to deliver a data warehouse. However data
warehouse projects normally suffer from various issues that make it difficult to
complete tasks and deliverables in the strict and ordered fashion demanded by the
waterfall method. Most of the times, the requirements are not understood completely.
The architectures, designs, and build components can be completed only after
gathering and studying all the requirements.

Delivery Method
The delivery method is a variant of the joint application development approach adopted
for the delivery of a data warehouse. We have staged the data warehouse delivery
process to minimize risks. The approach that we will discuss here does not reduce the
overall delivery time-scales but ensures the business benefits are delivered
incrementally through the development process.

Note − The delivery process is broken into phases to reduce the project and delivery
risk.

The following diagram explains the stages in the delivery process −


11

IT Strategy
Data warehouses are strategic investments that require a business process
to generate benefits. IT Strategy is required to procure and retain funding
for the project.

Business Case
The objective of the business case is to estimate business benefits that
should be derived from using a data warehouse. These benefits may not be
quantifiable but the projected benefits need to be clearly stated. If a data
warehouse does not have a clear business case, then the business tends to
suffer from credibility problems at some stage during the delivery process.
Therefore in data warehouse projects, we need to understand the business
case for investment.

Education and Prototyping


Organizations experiment with the concept of data analysis and educate
themselves on the value of having a data warehouse before settling for a
solution. This is addressed by prototyping. It helps in understanding the
feasibility and benefits of a data warehouse. The prototyping activity on a
small scale can promote educational process as long as −

​ The prototype addresses a defined technical objective.


​ The prototype can be thrown away after the feasibility concept has
been shown.
​ The activity addresses a small subset of eventual data content of the
data warehouse.
​ The activity timescale is non-critical.

The following points are to be kept in mind to produce an early release and
deliver business benefits.

​ Identify the architecture that is capable of evolving.


​ Focus on business requirements and technical blueprint phases.
​ Limit the scope of the first build phase to the minimum that delivers
business benefits.
​ Understand the short-term and medium-term requirements of the data
warehouse.

Business Requirements
12

To provide quality deliverables, we should make sure the overall


requirements are understood. If we understand the business requirements
for both short-term and medium-term, then we can design a solution to
fulfill short-term requirements. The short-term solution can then be grown to
a full solution.

The following aspects are determined in this stage −

​ The business rule to be applied on data.


​ The logical model for information within the data warehouse.
​ The query profiles for the immediate requirement.
​ The source systems that provide this data.

Technical Blueprint
This phase needs to deliver an overall architecture satisfying the long term
requirements. This phase also delivers the components that must be
implemented in a short term to derive any business benefit. The blueprint
need to identify the followings.

​ The overall system architecture.


​ The data retention policy.
​ The backup and recovery strategy.
​ The server and data mart architecture.
​ The capacity plan for hardware and infrastructure.
​ The components of database design.

1.9 Data warehouse uses and some tools.

Ans: A data Warehouses are central repositories that store data from one or
more heterogeneous sources. Data warehouses are analytical tools built to
support decision-making for reporting users across many departments. Data
warehouse works to create a single, unified system of truth for an entire
warehouse is a Data management system that is used for storing, reporting,
and data analysis. It is the primary component of business intelligence and
is also known as an enterprise data warehouse. Data Organization and store
historical data about business and organization so that it could be analyzed
and extract insights from it.
13

The tools that allow sourcing of data contents and formats accurately and external data
stores into the data warehouse have to perform several essential tasks that contain:

○ Data consolidation and integration.

○ Data transformation from one form to another form.

○ Data transformation and calculation based on the function of business rules that
force transformation.

○ Metadata synchronization and management, which includes storing or updating


metadata about source files, transformation actions, loading formats, and
events.

There are several selection criteria which should be considered while implementing a
data warehouse:

1. The ability to identify the data in the data source environment that can be read by
the tool is necessary.

2. Support for flat files, indexed files, and legacy DBMSs is critical.
14

3. The capability to merge records from multiple data stores is required in many
installations.

4. The specification interface to indicate the information to be extracted and


conversation are essential.

5. The ability to read information from repository products or data dictionaries is


desired.

6. The code develops by the tool should be completely maintainable.

7. Selective data extraction of both data items and records enables users to extract
only the required data.

8. A field-level data examination for the transformation of data into information is


needed.

9. The ability to perform data type and the character-set translation is a requirement
when moving data between incompatible systems.

10. The ability to create aggregation, summarization and derivation fields and
records are necessary.

11. Vendor stability and support for the products are components that must be
evaluated carefully.

Data Warehouse Software Components

A warehousing team will require different types of tools during a warehouse project.
These software products usually fall into one or more of the categories illustrated, as
shown in the figure.
15

The warehouse team needs tools that can extract, transform, integrate, clean, and load
information from a source system into one or more data warehouse databases.
Middleware and gateway products may be needed for warehouses that extract a record
from a host-based source system.

Warehouse Storage

Software products are also needed to store warehouse data and their accompanying
metadata. Relational database management systems are well suited to large and
growing warehouses.

Data access and retrieval

Different types of software are needed to access, retrieve, distribute, and present
warehouse data to its end-clients.

You might also like