Bda U2
Bda U2
UNIT 2
Data Warehouse
• Data Warehouse:
✓Data Warehouse is basically the collection of data from various
heterogeneous sources.
✓It is the main component of the business intelligence system where
analysis and management of data are done which is further used to
improve decision making.
✓It involves the process of extraction, loading, and transformation for
providing the data for analysis.
✓Data warehouses are also used to perform queries on a large
amount of data.
✓It uses data from various relational databases and application log
files.
Big Data vs Data Warehouse
Big Data vs Data Warehouse
How does data warehousing relate to big data?
• These are two very different things in that, as a technology, big data
is a means to store and manage large volumes of data.
• On the other hand, a data warehouse is a set of software and
techniques that facilitate data collection and integration into a
centralized database.
Key Characteristics of Data Warehouse
• Integrated
➢A data warehouse is developed by integrating data from varied
sources into a consistent format.
➢The data must be stored in the warehouse in a consistent and
universally acceptable manner in terms of naming, format, and
coding.
➢This facilitates effective data analysis.
Key Characteristics of Data Warehouse
• Non-Volatile
➢Data once entered into a data warehouse must remain
unchanged.
➢All data is read-only. Previous data is not erased when current
data is entered.
➢This helps you to analyze what has happened and when.
Key Characteristics of Data Warehouse
• Time-Variant
➢The data stored in a data warehouse is documented with an element of
time, either explicitly or implicitly.
➢An example of time variance in a Data Warehouse is exhibited in the
Primary Key, which must have an element of time like the day, week, or
month.
Data Warehouse Architecture
Middle Tier
The middle tier represents an OLAP server that can be implemented in
two ways.
▪ The ROLAP or Relational OLAP model is an extended relational
database management system that maps multidimensional data
processes to standard relational processes.
▪ The MOLAP or multidimensional OLAP directly acts on
multidimensional data and operations.
Data Warehouse Architecture
Top Tier
▪ This is the front-end client interface that gets data out from the data
warehouse.
▪ It holds various tools like query tools, analysis tools, reporting tools,
and data mining tools.
How Data Warehouse Works
• Data Warehousing integrates data and information collected from various
sources into one comprehensive database.
• For example, a data warehouse might combine customer information from
an organization’s point-of-sale systems, its mailing lists, website, and
comment cards.
• It might also incorporate confidential information about employees, salary
information, etc.
• Businesses use such components of data warehouses to analyze
customers.
• Data mining is one of the features of a data warehouse that involves
looking for meaningful data patterns in vast volumes of data and devising
innovative strategies for increased sales and profits.
Types of Data Warehouse
There are three main types of data warehouses.
Enterprise Data Warehouse (EDW)
• This type of warehouse serves as a key or central database that facilitates
decision-support services throughout the enterprise.
• The advantage of this type of warehouse is that it provides access to
cross-organizational information, offers a unified approach to data
representation, and allows running complex queries.
Types of Data Warehouse
Operational Data Store (ODS)
• This type of data warehouse refreshes in real time.
• It is often preferred for routine activities like storing employee records.
• It is required when data warehouse systems do not support reporting needs of
the business.
Data Mart
• A data mart is a subset of a data warehouse built to maintain a particular
department, region, or business unit.
• Every department of a business has a central repository or data mart to store
data.
• The data from the data mart is stored in the ODS (operational data store)
periodically. The ODS then sends the data to the EDW, where it is stored and
used.
Data Warehousing Tools
These tools help to collect, read, write, and transfer data from
various sources.
They are designed to support operations like data sorting,
filtering, merging, etc.
• Query and reporting tools
• Application Development tools
• Data mining tools
• OLAP tools
Benefits of Data Warehouse
There are several benefits of a data warehouse:
The ETL process is an iterative process that is repeated as new data is added to the
warehouse. The process is important because it ensures that the data in the data
warehouse is accurate, complete, and up-to-date. It also helps to ensure that the data
is in the format required for data mining and reporting.
How ETL works
Extract
During data extraction, raw data is copied or exported from source locations
to a staging area. Data management teams can extract data from a variety of
data sources, which can be structured or unstructured. Those sources
include but are not limited to:
• SQL or NoSQL servers
• CRM and ERP systems
• Flat files
• Email
• Web pages
How ETL works
Transform
The second step of the ETL process is transformation. In this step, a set of
rules or functions are applied to the extracted data to convert it into a single
standard format. It may involve the following processes/tasks: Filtering –
loading only certain attributes into the data warehouse.
• Cleaning – filling up the NULL values with some default values, mapping
U.S.A, United States, and America into USA, etc.
• Joining – joining multiple attributes into one.
• Splitting – splitting a single attribute into multiple attributes.
• Sorting – sorting tuples on the basis of some attribute (generally key-
attribute).
How ETL works
Loading
• The third and final step of the ETL process is loading. In this step, the transformed
data is finally loaded into the data warehouse.
• Sometimes the data is updated by loading into the data warehouse very
frequently and sometimes it is done after longer but regular intervals.
• The rate and period of loading solely depend on the requirements and vary from
system to system.
ETL Tools
The most commonly used ETL tools are:
1. Hevo
2. Sybase
3. Oracle Warehouse builder
4. CloverETL
5. MarkLogic.
Advantages of the ETL process in data warehousing:
1.Improved data quality: The ETL process ensures that the data in the data
warehouse is accurate, complete, and up-to-date.
2.Better data integration: The ETL process helps to integrate data from
multiple sources and systems, making it more accessible and usable.
3.Increased data security: The ETL process can help to improve data
security by controlling access to the data warehouse and ensuring that
only authorized users can access the data.
4.Improved scalability: The ETL process can help to improve scalability by
providing a way to manage and analyze large amounts of data.
5.Increased automation: ETL tools and technologies can automate and
simplify the ETL process, reducing the time and effort required to load
and update data in the warehouse.
Disadvantages of ETL process in data warehousing:
1.High cost: The ETL process can be expensive to implement and maintain,
especially for organizations with limited resources.
2.Complexity: The ETL process can be complex and difficult to implement,
especially for organizations that lack the necessary expertise or
resources.
3.Limited flexibility: The ETL process can be limited in terms of flexibility,
as it may not be able to handle unstructured data or real-time data
streams.
4.Limited scalability: ETL process can be limited in terms of scalability, as
it may not be able to handle very large amounts of data.
5.Data privacy concerns: ETL process can raise concerns about data
privacy, as large amounts of data are collected, stored, and analyzed.