Data Warehousing
Data Warehousing
A data warehouse is a centralized system used for storing and managing large volumes of data
from various sources. It is designed to help businesses analyze historical data and make
informed decisions. Data from different operational systems is collected, cleaned, and stored in
a structured way, enabling efficient querying and reporting.
1. Handling Large Volumes of Data: Traditional databases can only store a limited amount of
data (MBs to GBs), whereas a data warehouse is designed to handle much larger datasets (TBs),
allowing businesses to store and manage massive amounts of historical data.
2. Enhanced Analytics: Transactional databases are not optimized for analytical purposes. A
data warehouse is built specifically for data analysis, enabling businesses to perform complex
queries and gain insights from historical data.
3. Centralized Data Storage: A data warehouse acts as a central repository for all organizational
data, helping businesses to integrate data from multiple sources and have a unified view of their
operations for better decision-making.
4. Trend Analysis: By storing historical data, a data warehouse allows businesses to analyze
trends over time, enabling them to make strategic decisions based on past performance and
predict future outcomes.
5. Support for Business Intelligence: Data warehouses support business intelligence tools and
reporting systems, providing decision-makers with easy access to critical information, which
enhances operational efficiency and supports data-driven strategies.
• Data Sources: These are the various operational systems, databases, and external data
feeds that provide raw data to be stored in the warehouse.
• ETL (Extract, Transform, Load) Process: The ETL process is responsible for extracting
data from different sources, transforming it into a suitable format, and loading it into the
data warehouse.
• Data Warehouse Database: This is the central repository where cleaned and
transformed data is stored. It is typically organized in a multidimensional format for
efficient querying and reporting.
• Metadata: Metadata describes the structure, source, and usage of data within the
warehouse, making it easier for users and systems to understand and work with the
data.
• Data Marts: These are smaller, more focused data repositories derived from the data
warehouse, designed to meet the needs of specific business departments or functions.
• OLAP (Online Analytical Processing) Tools: OLAP tools allow users to analyze data in
multiple dimensions, providing deeper insights and supporting complex analytical
queries.
• End-User Access Tools: These are reporting and analysis tools, such as dashboards
or Business Intelligence (BI) tools, that enable business users to query the data
warehouse and generate reports.
Data warehousing is essential for modern data management, providing a strong foundation for
organizations to consolidate and analyze data strategically. Its distinguishing features empower
businesses with the tools to make informed decisions and extract valuable insights from their
data.
• Centralized Data Repository: Data warehousing provides a centralized repository for all
enterprise data from various sources, such as transactional databases, operational
systems, and external sources. This enables organizations to have a comprehensive view
of their data, which can help in making informed business decisions.
• Data Integration: Data warehousing integrates data from different sources into a single,
unified view, which can help in eliminating data silos and reducing data inconsistencies.
• Historical Data Storage: Data warehousing stores historical data, which enables
organizations to analyze data trends over time. This can help in identifying patterns and
anomalies in the data, which can be used to improve business performance.
• Query and Analysis: Data warehousing provides powerful query and analysis capabilities
that enable users to explore and analyze data in different ways. This can help in
identifying patterns and trends, and can also help in making informed business
decisions.
• Data Mining: Data warehousing provides data mining capabilities, which enable
organizations to discover hidden patterns and relationships in their data. This can help in
identifying new opportunities, predicting future trends, and mitigating risks.
• Data Security: Data warehousing provides robust data security features, such as access
controls, data encryption, and data backups, which ensure that the data is secure and
protected from unauthorized access.
2. Operational Data Store (ODS): Stores real-time operational data used for day-to-day
operations, not for deep analytics.
4. Cloud Data Warehouse: A data warehouse hosted in the cloud, offering scalability and
flexibility.
5. Big Data Warehouse: Designed to store vast amounts of unstructured and structured
data for big data analysis.
6. Virtual Data Warehouse: Provides access to data from multiple sources without
physically storing it.
8. Real-time Data Warehouse: Designed to handle real-time data streaming and analysis
for immediate insights.
Example – A database stores related data, Example – A data warehouse integrates the
such as the student details in a school. data from one or more databases , so that
analysis can be done to get results , such as
the best performing school in a city.
In a source-driven architecture for gathering data, the data sources transmit new information,
either continually (as transaction processing takes place), or periodically (nightly, for example).
In a destination-driven architecture, the data warehouse periodically sends requests for new
data to the sources. Unless updates at the sources are replicated at the warehouse via two
phase commit, the warehouse will never be quite up to-date with the sources. Two-phase
commit is usually far too expensive to be an option, so data warehouses typically have slightly
out-of-date data. That, however, is usually not a problem for decision-support systems.
Data sources that have been constructed independently are likely to have different schemas. In
fact, they may even use different data models. Part of the task of a warehouse is to perform
schema integration, and to convert data to the integrated schema before they are stored. As a
result, the data stored in the warehouse are not just a copy of the data at the sources. Instead,
they can be thought of as a materialized view of the data at the sources.
The task of correcting and preprocessing data is called data cleansing. Data sources often deliver
data with numerous minor inconsistencies, which can be corrected. For example, names are
often misspelled, and addresses may have street, area, or city names misspelled, or postal
codes entered incorrectly. These can be corrected to a reasonable extent by consulting a
database of street names and postal codes in each city. The approximate matching of data
required for this task is referred to as fuzzy lookup.
Updates on relations at the data sources must be propagated to the data warehouse. If the
relations at the data warehouse are exactly the same as those at the data source, the
propagation is straightforward. If they are not, the problem of propagating updates is basically
the view-maintenance problem.
Data Warehousing can be applied anywhere where we have a huge amount of data and we
want to see statistical results that help in decision making.
• Social Media Websites: The social networking websites like Facebook, Twitter, Linkedin,
etc. are based on analyzing large data sets. These sites gather data related to members,
groups, locations, etc., and store it in a single central repository. Being a large amount of
data, Data Warehouse is needed for implementing the same.
• Banking: Most of the banks these days use warehouses to see the spending patterns of
account/cardholders. They use this to provide them with special offers, deals, etc.
• Government: Government uses a data warehouse to store and analyze tax payments
which are used to detect tax thefts.
• Data Quality: Guarantees data quality and consistency for trustworthy reporting.
• Scalability: Capable of managing massive data volumes and expanding to meet changing
requirements.
• Effective Queries: Fast and effective data retrieval is made possible by an optimized
structure.
• Cost reductions: Data warehousing can result in cost savings over time by reducing data
management procedures and increasing overall efficiency, even when there are setup
costs initially.
• Data security: Data warehouses employ security protocols to safeguard confidential
information, guaranteeing that only authorized personnel are granted access to certain
data.
• Faster Queries: The data warehouse is designed to handle large queries that’s why it
runs queries faster than the database..
• Historical Insight: The warehouse stores all your historical data which contains details
about the business so that one can analyze it at any time and extract insights from it.
• Complexity: Data warehousing can be complex, and businesses may need to hire
specialized personnel to manage the system.
• Data integration challenges: Data from different sources can be challenging to integrate,
requiring significant effort to ensure consistency and accuracy.
• Data security: Data warehousing can pose data security risks, and businesses must take
measures to protect sensitive data from unauthorized access or breaches.
1. Enterprise warehouse:
• An enterprise warehouse collects all of the information about subjects spanning the
entire organization.
• It provides corporate-wide data integration, usually from one or more operational
systems or external information providers, and is cross-functional in scope.
• It typically contains detailed data as well as summarized data, and can range in size from
a few gigabytes to hundreds of gigabytes, terabytes, or beyond.
• An enterprise data warehouse may be implemented on traditional mainframes,
computer Super servers, or parallel architecture platforms. It requires extensive business
modeling and may take years to design and build.
2. Data mart:
• A data mart contains a subset of corporate-wide data that is of value to a specific group
of users. The scope is confined to specific selected subjects. For example, a marketing
data mart may confine its subjects to customer, item, and sales. The data contained in
data marts tend to be summarized.
• Data marts are usually implemented on low-cost departmental servers that are
UNIX/LINUX- or Windows-based. The implementation cycle of a data mart is more likely
to be measured in weeks rather than months or years. However, it may involve complex
integration in the long run if its design and planning were not enterprise-wide.
• Depending on the source of data, data marts can be categorized as independent or
dependent. Independent data marts are sourced from data captured from one or More
operational systems or external information providers, or from data generated Locally
within a particular department or geographic area. Dependent data marts are Sourced
directly from enterprise data warehouses.
3. Virtual warehouse:
• A virtual warehouse is a set of views over operational databases. For efficient query
processing, only some of the possible summary views may be materialized.
• A virtual warehouse is easy to build but requires excess capacity on operational database
Servers.
• A description of the structure of the data warehouse, which includes the warehouse
schema, view, dimensions, hierarchies, and derived data definitions, as well as data mart
locations and contents.
• Operational metadata, which include data lineage (history of migrated data and the
sequence of transformations applied to it), currency of data (active, archived, or
purged), and monitoring information (warehouse usage statistics, error reports, and
audit trails).
• The algorithms used for summarization, which include measure and dimension
Definition algorithms, data on granularity, partitions, subject areas, aggregation,
Summarization, and predefined queries and reports.
• The mapping from the operational environment to the data warehouse, which Includes
source databases and their contents, gateway descriptions, data partitions, data
extraction, cleaning, transformation rules and defaults, data refresh and purging rules,
and security (user authorization and access control).
• Data related to system performance, which include indices and profiles that improve
data access and retrieval performance, in addition to rules for the timing and scheduling
of refresh, update, and replication cycles.
• Business metadata, which include business terms and definitions, data Ownership
information, and charging policies.
• Consolidation involves the aggregation of data that can be accumulated and computed
in one or more dimensions. For example, all sales offices are rolled up to the sales
department or sales division to anticipate sales trends.
• The drill-down is a technique that allows users to navigate through the details. For
instance, users can view the sales by individual products that make up a region’s sales.
• Slicing and dicing is a feature whereby users can take out (slicing) a specific set of data of
the OLAP cube and view (dicing) the slices from different viewpoints.
4.1.1 Applications:
• Cluster analysis has been widely used in numerous applications, including market
research, pattern recognition, data analysis, and image processing.
• In business, clustering can help marketers discover distinct groups in their customer
bases and characterize customer groups based on purchasing patterns.
• In biology, it can be used to derive plant and animal taxonomies, categorize genes with
similar functionality, and gain insight into structures inherent in populations.
• Clustering may also help in the identification of areas of similar land use in an earth
observation database and in the identification of groups of houses in a city according to
house type, value, geographic location, as well as the identification of groups of
automobile insurance policy holders with a high average claim cost.
• Clustering is also called data segmentation in some applications because clustering
partitions large data sets into groups according to their similarity.
• Clustering can also be used for outlier detection; Applications of outlier detection
include the detection of credit card fraud and the monitoring of criminal activities in
electronic commerce.
However, a cluster could be of any shape. It is important to develop algorithms thatcan detect
clusters of arbitrary shape.
It is important to develop incremental clustering algorithms and algorithms that are insensitive
to the order of input.
7.High dimensionality:
A database or a data warehouse can contain several dimensions or attributes. Many clustering
algorithms are good at handling low-dimensional data, involving only two to three dimensions.
Human eyes are good at judging the qualities of clustering for up to three dimensions. Finding
clusters of data objects in high dimension space is challenging, especially considering that such
data can be sparse and highly skewed.
8.Constraint-based clustering:
Real-world applications may need to perform clustering under various kinds of constraints.
Suppose that your job is to choose the locations for a given number of new automatic banking
machines (ATMs) in a city. To decide upon this, you may cluster households while considering
constraints such as the city’s rivers and highway networks, and the type and number of
customers per cluster. A challenging task is to find groups of data with good clustering behavior
that satisfy specified constraints.
The general criterion of a good partitioning is that objects in the same cluster are close or
related to each other, whereas objects of different clusters are far apart or very different.
• The agglomerative approach, also called the bottom-up approach, starts with each
Object forming a separate group. It successively merges the objects or groups that are
Close to one another, until all of the groups are merged into one or until a termination
condition holds.
• The divisive approach, also called the top-down approach, starts with all of the objects
in the same cluster. In each successive iteration, a cluster is split up into smaller
clusters, until eventually each object is in one cluster, or until a termination condition
holds.
• Hierarchical methods suffer from the fact that once a step (merge or split) is done, it
can never be undone. This rigidity is useful in that it leads to smaller computation
costs by not having to worry about a combinatorial number of different choices.
Chameleon, or
2.Other clustering methods have been developed based on the notion of density. Their
general idea is to continue growing the given cluster as long as the density in the
neighborhood exceeds some threshold; that is, for each data point within a given cluster, the
neighborhood of a given radius has to contain at least a minimum number of points. Such a
method can be used to filter out noise (outliers)and discover clusters of arbitrary shape.
3.DBSCAN and its extension, OPTICS, are typical density-based methods that Grow clusters
according to a density-based connectivity analysis. DENCLUE is a method that clusters objects
based on the analysis of the value distributions of density functions.
2.All of the clustering operations are performed on the grid structure i.e., on the quantized
space. The main advantage of this approach is its fast-processing time, which is typically
independent of the number of data objects and dependent only on the number of cells in
each dimension in the quantized space.
3.It also leads to a way of automatically determining the number of clusters based on
standard statistics, taking ―noise‖ or outliers into account and thus yielding robust clustering
methods.
Spatial clustering employs with the existence of obstacles and clustering under user specified
constraints. In addition, semi-supervised clustering employs for pairwise constraints in order
to improve the quality of the resulting clustering.
1. Data Marts
• A Data Mart is a small, focused version of a Data Warehouse.
• Faster access because it handles a smaller volume of data compared to a full data
warehouse.
Example:
A Sales Data Mart might store only sales-related data, helping the sales team analyze
performance.
• Cost-Benefit Analysis (CBA) measures whether building a Data Warehouse is worth the
investment.
• Costs include:
• Benefits include:
o Faster decision-making
Goal: Ensure that the Data Warehouse brings more value than it costs to build and maintain.
Features of OLAP:
Example:
Analyze quarterly sales data by product type and region over several years.
Data visualization helps to represent data graphically (charts, graphs, maps) to make it easier
to understand patterns, trends, and outliers.
Principles:
• Bar charts
• Line graphs
• Scatter plots
• Heatmaps
• Pie charts
Example:
A scatter plot showing the relationship between advertising spend and sales growth.
Data mining discovers interesting patterns, relationships, and knowledge from large amounts
of data.
Main Functionalities:
Functionality Purpose
Classification Assign items into predefined categories (e.g., spam or not spam).
Association Rule Discover relationships (e.g., "People who buy bread often buy
Mining butter").
• Data Quality: Incomplete, noisy, or inconsistent data can affect the results.