0% found this document useful (0 votes)
10 views20 pages

Data Warehousing

A data warehouse is a centralized system for storing and managing large volumes of data from various sources, designed to facilitate historical data analysis and informed decision-making. It includes components such as ETL processes, data marts, and OLAP tools, and supports business intelligence by providing a unified view of data for trend analysis and reporting. While data warehousing offers advantages like improved data quality and scalability, it also presents challenges such as high costs and complexity in data integration.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views20 pages

Data Warehousing

A data warehouse is a centralized system for storing and managing large volumes of data from various sources, designed to facilitate historical data analysis and informed decision-making. It includes components such as ETL processes, data marts, and OLAP tools, and supports business intelligence by providing a unified view of data for trend analysis and reporting. While data warehousing offers advantages like improved data quality and scalability, it also presents challenges such as high costs and complexity in data integration.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 20

Data Warehousing

A data warehouse is a centralized system used for storing and managing large volumes of data
from various sources. It is designed to help businesses analyze historical data and make
informed decisions. Data from different operational systems is collected, cleaned, and stored in
a structured way, enabling efficient querying and reporting.

• Goal is to produce statistical results that may help in decision-making.

• Ensures fast data retrieval even with the vast datasets.

Data Warehouse Architecture

Need for Data Warehousing

1. Handling Large Volumes of Data: Traditional databases can only store a limited amount of
data (MBs to GBs), whereas a data warehouse is designed to handle much larger datasets (TBs),
allowing businesses to store and manage massive amounts of historical data.
2. Enhanced Analytics: Transactional databases are not optimized for analytical purposes. A
data warehouse is built specifically for data analysis, enabling businesses to perform complex
queries and gain insights from historical data.

3. Centralized Data Storage: A data warehouse acts as a central repository for all organizational
data, helping businesses to integrate data from multiple sources and have a unified view of their
operations for better decision-making.

4. Trend Analysis: By storing historical data, a data warehouse allows businesses to analyze
trends over time, enabling them to make strategic decisions based on past performance and
predict future outcomes.

5. Support for Business Intelligence: Data warehouses support business intelligence tools and
reporting systems, providing decision-makers with easy access to critical information, which
enhances operational efficiency and supports data-driven strategies.

Components of Data Warehouse

The main components of a data warehouse include:

• Data Sources: These are the various operational systems, databases, and external data
feeds that provide raw data to be stored in the warehouse.

• ETL (Extract, Transform, Load) Process: The ETL process is responsible for extracting
data from different sources, transforming it into a suitable format, and loading it into the
data warehouse.

• Data Warehouse Database: This is the central repository where cleaned and
transformed data is stored. It is typically organized in a multidimensional format for
efficient querying and reporting.

• Metadata: Metadata describes the structure, source, and usage of data within the
warehouse, making it easier for users and systems to understand and work with the
data.

• Data Marts: These are smaller, more focused data repositories derived from the data
warehouse, designed to meet the needs of specific business departments or functions.

• OLAP (Online Analytical Processing) Tools: OLAP tools allow users to analyze data in
multiple dimensions, providing deeper insights and supporting complex analytical
queries.
• End-User Access Tools: These are reporting and analysis tools, such as dashboards
or Business Intelligence (BI) tools, that enable business users to query the data
warehouse and generate reports.

Characteristics of Data Warehousing

Data warehousing is essential for modern data management, providing a strong foundation for
organizations to consolidate and analyze data strategically. Its distinguishing features empower
businesses with the tools to make informed decisions and extract valuable insights from their
data.

• Centralized Data Repository: Data warehousing provides a centralized repository for all
enterprise data from various sources, such as transactional databases, operational
systems, and external sources. This enables organizations to have a comprehensive view
of their data, which can help in making informed business decisions.

• Data Integration: Data warehousing integrates data from different sources into a single,
unified view, which can help in eliminating data silos and reducing data inconsistencies.

• Historical Data Storage: Data warehousing stores historical data, which enables
organizations to analyze data trends over time. This can help in identifying patterns and
anomalies in the data, which can be used to improve business performance.

• Query and Analysis: Data warehousing provides powerful query and analysis capabilities
that enable users to explore and analyze data in different ways. This can help in
identifying patterns and trends, and can also help in making informed business
decisions.

• Data Transformation: Data warehousing includes a process of data transformation,


which involves cleaning, filtering, and formatting data from various sources to make it
consistent and usable. This can help in improving data quality and reducing data
inconsistencies.

• Data Mining: Data warehousing provides data mining capabilities, which enable
organizations to discover hidden patterns and relationships in their data. This can help in
identifying new opportunities, predicting future trends, and mitigating risks.

• Data Security: Data warehousing provides robust data security features, such as access
controls, data encryption, and data backups, which ensure that the data is secure and
protected from unauthorized access.

Types of Data Warehouses

The different types of Data Warehouses are:


1. Enterprise Data Warehouse (EDW): A centralized warehouse that stores data from
across the organization for analysis and reporting.

2. Operational Data Store (ODS): Stores real-time operational data used for day-to-day
operations, not for deep analytics.

3. Data Mart: A subset of a data warehouse, focusing on a specific business area or


department.

4. Cloud Data Warehouse: A data warehouse hosted in the cloud, offering scalability and
flexibility.

5. Big Data Warehouse: Designed to store vast amounts of unstructured and structured
data for big data analysis.

6. Virtual Data Warehouse: Provides access to data from multiple sources without
physically storing it.

7. Hybrid Data Warehouse: Combines on-premises and cloud-based storage to offer


flexibility.

8. Real-time Data Warehouse: Designed to handle real-time data streaming and analysis
for immediate insights.

Data Warehouse vs DBMS

Database Data Warehouse

A common Database is based on operational


A data Warehouse is based on analytical
or transactional processing. Each operation is
processing.
an indivisible transaction.

A Data Warehouse maintains historical data


Generally, a Database stores current and up- over time. Historical data is the data kept
to-date data which is used for daily over years and can used for trend analysis,
operations. make future predictions and decision
support.
Database Data Warehouse

A Data Warehouse is integrated generally at


the organization level, by combining data
A database is generally application specific. from different databases.

Example – A database stores related data, Example – A data warehouse integrates the
such as the student details in a school. data from one or more databases , so that
analysis can be done to get results , such as
the best performing school in a city.

Constructing a Data Warehouse can be


Constructing a Database is not so expensive.
expensive.

Issues Occur while Building the Warehouse

When and how to gather data?

In a source-driven architecture for gathering data, the data sources transmit new information,
either continually (as transaction processing takes place), or periodically (nightly, for example).
In a destination-driven architecture, the data warehouse periodically sends requests for new
data to the sources. Unless updates at the sources are replicated at the warehouse via two
phase commit, the warehouse will never be quite up to-date with the sources. Two-phase
commit is usually far too expensive to be an option, so data warehouses typically have slightly
out-of-date data. That, however, is usually not a problem for decision-support systems.

What schema to use?

Data sources that have been constructed independently are likely to have different schemas. In
fact, they may even use different data models. Part of the task of a warehouse is to perform
schema integration, and to convert data to the integrated schema before they are stored. As a
result, the data stored in the warehouse are not just a copy of the data at the sources. Instead,
they can be thought of as a materialized view of the data at the sources.

Data transformation and cleansing?

The task of correcting and preprocessing data is called data cleansing. Data sources often deliver
data with numerous minor inconsistencies, which can be corrected. For example, names are
often misspelled, and addresses may have street, area, or city names misspelled, or postal
codes entered incorrectly. These can be corrected to a reasonable extent by consulting a
database of street names and postal codes in each city. The approximate matching of data
required for this task is referred to as fuzzy lookup.

How to propagate update?

Updates on relations at the data sources must be propagated to the data warehouse. If the
relations at the data warehouse are exactly the same as those at the data source, the
propagation is straightforward. If they are not, the problem of propagating updates is basically
the view-maintenance problem.

Example Applications of Data Warehousing

Data Warehousing can be applied anywhere where we have a huge amount of data and we
want to see statistical results that help in decision making.

• Social Media Websites: The social networking websites like Facebook, Twitter, Linkedin,
etc. are based on analyzing large data sets. These sites gather data related to members,
groups, locations, etc., and store it in a single central repository. Being a large amount of
data, Data Warehouse is needed for implementing the same.

• Banking: Most of the banks these days use warehouses to see the spending patterns of
account/cardholders. They use this to provide them with special offers, deals, etc.

• Government: Government uses a data warehouse to store and analyze tax payments
which are used to detect tax thefts.

Advantages of Data Warehousing

• Intelligent Decision-Making: With centralized data in warehouses, decisions may be


made more quickly and intelligently.

• Business Intelligence: Provides strong operational insights through business intelligence.

• Data Quality: Guarantees data quality and consistency for trustworthy reporting.

• Scalability: Capable of managing massive data volumes and expanding to meet changing
requirements.

• Effective Queries: Fast and effective data retrieval is made possible by an optimized
structure.

• Cost reductions: Data warehousing can result in cost savings over time by reducing data
management procedures and increasing overall efficiency, even when there are setup
costs initially.
• Data security: Data warehouses employ security protocols to safeguard confidential
information, guaranteeing that only authorized personnel are granted access to certain
data.

• Faster Queries: The data warehouse is designed to handle large queries that’s why it
runs queries faster than the database..

• Historical Insight: The warehouse stores all your historical data which contains details
about the business so that one can analyze it at any time and extract insights from it.

Disadvantages of Data Warehousing

• Cost: Building a data warehouse can be expensive, requiring significant investments in


hardware, software, and personnel.

• Complexity: Data warehousing can be complex, and businesses may need to hire
specialized personnel to manage the system.

• Time-consuming: Building a data warehouse can take a significant amount of time,


requiring businesses to be patient and committed to the process.

• Data integration challenges: Data from different sources can be challenging to integrate,
requiring significant effort to ensure consistency and accuracy.

• Data security: Data warehousing can pose data security risks, and businesses must take
measures to protect sensitive data from unauthorized access or breaches.

1.9.3 Data Warehouse Models:


There are three data warehouse models.

1. Enterprise warehouse:
• An enterprise warehouse collects all of the information about subjects spanning the
entire organization.
• It provides corporate-wide data integration, usually from one or more operational
systems or external information providers, and is cross-functional in scope.
• It typically contains detailed data as well as summarized data, and can range in size from
a few gigabytes to hundreds of gigabytes, terabytes, or beyond.
• An enterprise data warehouse may be implemented on traditional mainframes,
computer Super servers, or parallel architecture platforms. It requires extensive business
modeling and may take years to design and build.
2. Data mart:
• A data mart contains a subset of corporate-wide data that is of value to a specific group
of users. The scope is confined to specific selected subjects. For example, a marketing
data mart may confine its subjects to customer, item, and sales. The data contained in
data marts tend to be summarized.
• Data marts are usually implemented on low-cost departmental servers that are
UNIX/LINUX- or Windows-based. The implementation cycle of a data mart is more likely
to be measured in weeks rather than months or years. However, it may involve complex
integration in the long run if its design and planning were not enterprise-wide.
• Depending on the source of data, data marts can be categorized as independent or
dependent. Independent data marts are sourced from data captured from one or More
operational systems or external information providers, or from data generated Locally
within a particular department or geographic area. Dependent data marts are Sourced
directly from enterprise data warehouses.

3. Virtual warehouse:
• A virtual warehouse is a set of views over operational databases. For efficient query
processing, only some of the possible summary views may be materialized.
• A virtual warehouse is easy to build but requires excess capacity on operational database
Servers.

1.9.4 Meta Data Repository:


Metadata are data about data. When used in a data warehouse, metadata are the data that
define warehouse objects. Metadata are created for the data names and definitions of the given
warehouse. Additional metadata are created and captured for time stamping any extracted
data, the source of the extracted data, and missing fields that have been added by data cleaning
or integration processes.

A metadata repository should contain the following:

• A description of the structure of the data warehouse, which includes the warehouse
schema, view, dimensions, hierarchies, and derived data definitions, as well as data mart
locations and contents.
• Operational metadata, which include data lineage (history of migrated data and the
sequence of transformations applied to it), currency of data (active, archived, or
purged), and monitoring information (warehouse usage statistics, error reports, and
audit trails).
• The algorithms used for summarization, which include measure and dimension
Definition algorithms, data on granularity, partitions, subject areas, aggregation,
Summarization, and predefined queries and reports.
• The mapping from the operational environment to the data warehouse, which Includes
source databases and their contents, gateway descriptions, data partitions, data
extraction, cleaning, transformation rules and defaults, data refresh and purging rules,
and security (user authorization and access control).
• Data related to system performance, which include indices and profiles that improve
data access and retrieval performance, in addition to rules for the timing and scheduling
of refresh, update, and replication cycles.
• Business metadata, which include business terms and definitions, data Ownership
information, and charging policies.

1.10 OLAP (Online analytical Processing):

• OLAP is an approach to answering multi-dimensional analytical (MDA) queries swiftly.


• OLAP is part of the broader category of business intelligence, which also encompasses
relational database, report writing and data mining.
• OLAP tools enable users to analyze multidimensional data interactively from multiple
perspectives.
OLAP consists of three basic analytical operations:
Consolidation (Roll-Up)
Drill-Down
Slicing And Dicing

• Consolidation involves the aggregation of data that can be accumulated and computed
in one or more dimensions. For example, all sales offices are rolled up to the sales
department or sales division to anticipate sales trends.
• The drill-down is a technique that allows users to navigate through the details. For
instance, users can view the sales by individual products that make up a region’s sales.
• Slicing and dicing is a feature whereby users can take out (slicing) a specific set of data of
the OLAP cube and view (dicing) the slices from different viewpoints.

1.10.1 Types of OLAP:


1. Relational OLAP (ROLAP):
• ROLAP works directly with relational databases. The base data and the dimension tables
are stored as relational tables and new tables are created to hold the aggregated
information. It depends on a specialized schema design.
• This methodology relies on manipulating the data stored in the relational database to
give the appearance of traditional OLAP's slicing and dicing functionality. In essence,
each action of slicing and dicing is equivalent to adding a "WHERE" clause in the SQL
statement.
• ROLAP tools do not use pre-calculated data cubes but instead pose the query to the
standard relational database and its tables in order to bring back the data required to
answer the question.
• ROLAP tools feature the ability to ask any question because the methodology does not
limit to the contents of a cube. ROLAP also has the ability to drill down to the lowest
level of detail in the database.

2. Multidimensional OLAP (MOLAP):


• MOLAP is the 'classic' form of OLAP and is sometimes referred to as just OLAP.
• MOLAP stores this data in an optimized multi-dimensional array storage, rather than in a
relational database. Therefore it requires the pre-computation and storage of
information in the cube - the operation known as processing.
• MOLAP tools generally utilize a pre-calculated data set referred to as a data cube. The
data cube contains all the possible answers to a given range of questions.
• MOLAP tools have a very fast response time and the ability to quickly write back data
into the data set.

3. Hybrid OLAP (HOLAP):


• There is no clear agreement across the industry as to what constitutes Hybrid OLAP,
except that a database will divide data between relational and specialized storage.
• For example, for some vendors, a HOLAP database will use relational tables to hold the
larger quantities of detailed data, and use specialized storage for at least some aspects
of the smaller quantities of more-aggregate or less-detailed data.
• HOLAP addresses the shortcomings of MOLAP and ROLAP by combining the capabilities
of both approaches.
• HOLAP tools can utilize both pre-calculated cubes and relational data sources.
4.1 Cluster Analysis:
• The process of grouping a set of physical or abstract objects into classes of similar
objects is called clustering.
• A cluster is a collection of data objects that are similar to one another within the same
cluster and are dissimilar to the objects in other clusters.
• A cluster of data objects can be treated collectively as one group and so may be
considered as a form of data compression.
• Cluster analysis tools based on k-means, k-medoids, and several methods have also been
built into many statistical analysis software packages or systems, such as S-Plus, SPSS,
and SAS.

4.1.1 Applications:
• Cluster analysis has been widely used in numerous applications, including market
research, pattern recognition, data analysis, and image processing.
• In business, clustering can help marketers discover distinct groups in their customer
bases and characterize customer groups based on purchasing patterns.
• In biology, it can be used to derive plant and animal taxonomies, categorize genes with
similar functionality, and gain insight into structures inherent in populations.
• Clustering may also help in the identification of areas of similar land use in an earth
observation database and in the identification of groups of houses in a city according to
house type, value, geographic location, as well as the identification of groups of
automobile insurance policy holders with a high average claim cost.
• Clustering is also called data segmentation in some applications because clustering
partitions large data sets into groups according to their similarity.
• Clustering can also be used for outlier detection; Applications of outlier detection
include the detection of credit card fraud and the monitoring of criminal activities in
electronic commerce.

4.1.2 Typical Requirements of Clustering in Data Mining:


1.Scalability:
Many clustering algorithms work well on small data sets containing fewer than several hundred
data objects; however, a large database may contain millions of objects. Clustering on a sample
of a given large data set may lead to biased results. Highly scalable clustering algorithms are
needed.

2.Ability to deal with different types of attributes:


Many algorithms are designed to cluster interval-based (numerical) data. However, applications
may require clustering other types of data, such as binary, categorical (nominal), and ordinal
data, or mixtures of these data types.

3.Discovery of clusters with arbitrary shape:


Many clustering algorithms determine clusters based on Euclidean or Manhattan distance
measures. Algorithms based on such distance measures tend to find spherical clusters with
similar size and density.

However, a cluster could be of any shape. It is important to develop algorithms thatcan detect
clusters of arbitrary shape.

4.Minimal requirements for domain knowledge to determine input parameters:


Many clustering algorithms require users to input certain parameters in cluster analysis (such as
the number of desired clusters). The clustering results can be quite sensitive to input
parameters. Parameters are often difficult to determine, especially for data sets containing high-
dimensional objects. This not only burdens users, but it also makes the quality of clustering
difficult to control.

5.Ability to deal with noisy data:


Most real-world databases contain outliers or missing, unknown, or erroneous data. Some
clustering algorithms are sensitive to such data and may lead to clusters of poor quality.

6.Incremental clustering and insensitivity to the order of input records:


Some clustering algorithms cannot incorporate newly inserted data (i.e., database updates) into
existing clustering structures and, instead, must determine a new clustering from scratch. Some
clustering algorithms are sensitive to the order of input data. That is, given a set of data objects,
such an algorithm may return dramatically different clustering’s depending on the order of
presentation of the input objects.

It is important to develop incremental clustering algorithms and algorithms that are insensitive
to the order of input.
7.High dimensionality:
A database or a data warehouse can contain several dimensions or attributes. Many clustering
algorithms are good at handling low-dimensional data, involving only two to three dimensions.
Human eyes are good at judging the qualities of clustering for up to three dimensions. Finding
clusters of data objects in high dimension space is challenging, especially considering that such
data can be sparse and highly skewed.

8.Constraint-based clustering:
Real-world applications may need to perform clustering under various kinds of constraints.
Suppose that your job is to choose the locations for a given number of new automatic banking
machines (ATMs) in a city. To decide upon this, you may cluster households while considering
constraints such as the city’s rivers and highway networks, and the type and number of
customers per cluster. A challenging task is to find groups of data with good clustering behavior
that satisfy specified constraints.

9.Interpretability and usability:


Users expect clustering results to be interpretable, comprehensible, and usable. That is,
clustering may need to be tied to specific semantic interpretations and applications. It is
important to study how an application goal may influence the selection of clustering features
and methods.

4.2 Major Clustering Methods:


• Partitioning Methods
• Hierarchical Methods
• Density-Based Methods
• Grid-Based Methods
• Model-Based Methods

4.2.1 Partitioning Methods:


A partitioning method constructs k partitions of the data, where each partition represents a
cluster and k <= n. That is, it classifies the data into k groups, which together satisfy the
following requirements:
• Each group must contain at least one object, and
• Each object must belong to exactly one group.

A partitioning method creates an initial partitioning. It then uses an iterative relocation


technique that attempts to improve the partitioning by moving objects from one group to
another.

The general criterion of a good partitioning is that objects in the same cluster are close or
related to each other, whereas objects of different clusters are far apart or very different.

4.2.2 Hierarchical Methods:


A hierarchical method creates a hierarchical decomposition of the given set of data objects. A
hierarchical method can be classified as being either agglomerative or divisive, based on How
the hierarchical decomposition is formed.

• The agglomerative approach, also called the bottom-up approach, starts with each
Object forming a separate group. It successively merges the objects or groups that are
Close to one another, until all of the groups are merged into one or until a termination
condition holds.
• The divisive approach, also called the top-down approach, starts with all of the objects
in the same cluster. In each successive iteration, a cluster is split up into smaller
clusters, until eventually each object is in one cluster, or until a termination condition
holds.
• Hierarchical methods suffer from the fact that once a step (merge or split) is done, it
can never be undone. This rigidity is useful in that it leads to smaller computation
costs by not having to worry about a combinatorial number of different choices.

There are two approaches to improving the quality of hierarchical clustering:

1.Perform careful analysis of object ―linkages‖ at each hierarchical partitioning, such as in

Chameleon, or

2.Integratehierarchical agglomeration and other approaches by first using a hierarchical


agglomerative algorithm to group objects into micro clusters, and then Performing macro
clustering on the micro clusters using another clustering method such as iterative relocation.

4.2.3 Density-based methods:


1.Most partitioning methods cluster objects based on the distance between objects. Such
methods can find only spherical-shaped clusters and encounter difficulty at discovering
clusters of arbitrary shapes.

2.Other clustering methods have been developed based on the notion of density. Their
general idea is to continue growing the given cluster as long as the density in the
neighborhood exceeds some threshold; that is, for each data point within a given cluster, the
neighborhood of a given radius has to contain at least a minimum number of points. Such a
method can be used to filter out noise (outliers)and discover clusters of arbitrary shape.

3.DBSCAN and its extension, OPTICS, are typical density-based methods that Grow clusters
according to a density-based connectivity analysis. DENCLUE is a method that clusters objects
based on the analysis of the value distributions of density functions.

4.2.4 Grid-Based Methods:


1.Grid-based methods quantize the object space into a finite number of cells that form a grid
structure

2.All of the clustering operations are performed on the grid structure i.e., on the quantized
space. The main advantage of this approach is its fast-processing time, which is typically
independent of the number of data objects and dependent only on the number of cells in
each dimension in the quantized space.

2.STING is a typical example of a grid-based method. Wave Cluster applies wavelet


transformation for clustering analysis and is both grid-based and density-based.

4.2.5 Model-Based Methods:


1.Model-based methods hypothesize a model for each of the clusters and find the best fit of
the data to the given model.

2. A model-based algorithm may locate clusters by constructing a density function that


reflects the spatial distribution of the data points.

3.It also leads to a way of automatically determining the number of clusters based on
standard statistics, taking ―noise‖ or outliers into account and thus yielding robust clustering
methods.

4.3 Tasks in Data Mining:


• Clustering High-Dimensional Data
• Constraint-Based Clustering

4.3.1 Clustering High-Dimensional Data:


• It is a particularly important task in cluster analysis because many applications require
the analysis of objects containing a large number of features or dimensions.
• For example, text documents may contain thousands of terms or keywords as features,
and DNA micro array data may provide information on the expression levels of
thousands of genes under hundreds of conditions.
• Clustering high-dimensional data is challenging due to the curse of dimensionality.
• Many dimensions may not be relevant. As the number of dimensions increases, the
data become increasingly sparse so that the distance measurement between pairs of
points become meaningless and the average density of points anywhere in the data is
likely to be low. Therefore, a different clustering methodology needs to be developed
for high-dimensional data.
• CLIQUE and PROCLUS are two influential subspace clustering methods, which search
for clusters in subspaces of the data, rather than over the entire data space.
• Frequent pattern–based clustering, another clustering methodology, extracts distinct
frequent patterns among subsets of dimensions that occur frequently. It uses such
patterns to group objects and generate meaningful clusters.

4.3.2 Constraint-Based Clustering:


It is a clustering approach that performs clustering by incorporation of user-specified or
application-oriented constraints.

A constraint expresses a user’s expectation or describes properties of the desired clustering


results, and provides an effective means for communicating with the clustering process.

Various kinds of constraints can be specified, either by a user or as per application


requirements.

Spatial clustering employs with the existence of obstacles and clustering under user specified
constraints. In addition, semi-supervised clustering employs for pairwise constraints in order
to improve the quality of the resulting clustering.
1. Data Marts
• A Data Mart is a small, focused version of a Data Warehouse.

• It is designed for a specific department or a particular group of users (e.g., marketing,


sales, finance).

• Faster access because it handles a smaller volume of data compared to a full data
warehouse.

Example:
A Sales Data Mart might store only sales-related data, helping the sales team analyze
performance.

2. Data Warehouse Cost-Benefit Analysis / Return on Investment (ROI)

• Cost-Benefit Analysis (CBA) measures whether building a Data Warehouse is worth the
investment.

• Costs include:

o Hardware and software expenses

o Hiring technical staff

o Maintenance and updates

• Benefits include:

o Faster decision-making

o Better customer insights

o Increased revenue due to improved strategies

Return on Investment (ROI) =

ROI=Total Benefits−Total CostsTotal Costs×100\text{ROI} = \frac{\text{Total Benefits} -


\text{Total Costs}}{\text{Total Costs}} \times 100ROI=Total CostsTotal Benefits−Total Costs
×100

Goal: Ensure that the Data Warehouse brings more value than it costs to build and maintain.

3. OLAP Technology for Data Mining


OLAP = Online Analytical Processing
It helps in analyzing huge volumes of data quickly and interactively.

Features of OLAP:

• Multidimensional Analysis: View data in dimensions (e.g., time, product, region).

• Fast Query Performance: Quick retrieval of summarized data.

• Drill-Down and Roll-Up: Zoom into details or summarize data easily.

• Slice and Dice: Look at the data from different perspectives.

Example:
Analyze quarterly sales data by product type and region over several years.

Importance for Data Mining:

• OLAP tools prepare the data efficiently for mining.

• They help discover trends, patterns, and anomalies quickly.

1. Data Visualization Principles

Data visualization helps to represent data graphically (charts, graphs, maps) to make it easier
to understand patterns, trends, and outliers.

Principles:

• Clarity: Keep visuals simple and clean.

• Accuracy: Represent data truthfully without distortion.

• Efficiency: Use charts that are easy to interpret quickly.

• Consistency: Use consistent colors, symbols, and scales.

• Focus: Highlight the most important insights.

Common Visualization Tools:

• Bar charts

• Line graphs

• Scatter plots

• Heatmaps
• Pie charts

Example:
A scatter plot showing the relationship between advertising spend and sales growth.

2. Data Mining Functionalities

Data mining discovers interesting patterns, relationships, and knowledge from large amounts
of data.

Main Functionalities:

Functionality Purpose

Classification Assign items into predefined categories (e.g., spam or not spam).

Group similar items without predefined labels (e.g., customer


Clustering
segmentation).

Association Rule Discover relationships (e.g., "People who buy bread often buy
Mining butter").

Prediction Predict future values (e.g., predicting house prices).

Outlier Detection Identify unusual data records (e.g., fraud detection).

Summarization Provide a compact description (e.g., average, max, min).

3. Major Issues in Data Mining

When mining data, certain challenges arise:

• Data Quality: Incomplete, noisy, or inconsistent data can affect the results.

• Scalability: Mining very large datasets is computationally expensive.

• High Dimensionality: Too many attributes/features can confuse models.

• Privacy and Security: Sensitive data must be protected.

• Interpretability: Complex models (like deep learning) can be hard to understand.

• Integration: Combining data from different sources can be tricky.


• Dynamic Data: Data keeps changing, so models need frequent updates.

You might also like