0% found this document useful (0 votes)
16 views27 pages

1.1 Basic Concepts & Architecture

Uploaded by

hareeeee14
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views27 pages

1.1 Basic Concepts & Architecture

Uploaded by

hareeeee14
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 27

SRI KRISHNA COLLEGE OF ENGINEERING

AND TECHNOLOGY

M.Tech. Computer Science and Engineering


21CSI501 DATA WAREHOUSING AND MINING

MODULE 1

1.1 BASIC CONCEPTS & ARCHITECTURE

Faculty - Dr.D.Prabha
DATA
INFORMATION

Data
Data Warehouse?
Single, complete and
consistent store of data
obtained from a variety of
different sources made
available to end users in a
way they can understand and
use in a business context.
[Barry Devlin]
Data warehouse - Database
which stores analytical data
for business decisions
6
Data Warehouse?
• “A data warehouse is a subject-oriented, integrated, time-
variant, and nonvolatile collection of data in support of
management’s decision-making process.” — W. H. Inmon
• A decision support database that is maintained separately
from the organization’s operational database
• Support information processing by providing a solid platform
of consolidated, historical data for analysis.

• Data warehousing - The process of constructing and using


data warehouses

7
Data Warehouse—Subject
Oriented
• Organized around major subjects, such as customer,
product, sales
• Focusing on the modeling and analysis of data for
decision makers, not on daily operations or transaction
processing
• Provide a simple and concise view around particular
subject issues by excluding data that are not useful in the
decision support process

8
Data Warehouse—
Integrated
• Constructed by integrating multiple, heterogeneous data
sources
• relational databases, flat files, on-line transaction
records
• Data cleaning and data integration techniques are
applied.
• Ensure consistency in naming conventions, encoding
structures, attribute measures, etc. among different
data sources. E.g., Hotel price: currency, tax
• When data is moved to the warehouse, it is converted.

9
Data Warehouse—Time
Variant
• The time horizon for the data warehouse is significantly
longer than that of operational systems
• Operational database: current value data
• Data warehouse data: provide information from a
historical perspective (e.g., past 5-10 years)
• Every key structure in the data warehouse
• Contains an element of time (explicitly or implicitly)
• But the key of operational data may or may not contain
“time element”

10
Data Warehouse—
Nonvolatile
• A physically separate store of data transformed from the
operational environment
• Operational update of data does not occur in the data
warehouse environment
• Does not require transaction processing, recovery,
and concurrency control mechanisms
• Requires only two operations in data accessing:
• initial loading of data and access of data

11
Data Warehousing tools
1. Amazon Redshift
2. Amazon S3
3. Microsoft Azure
4. Google BigQuery
5. Snowflake
6. Teradata
7. Informatica PowerCenter
8. IBM Infosphere
Data Warehouse Applications
● Retail Industry
✔ Forecasting, Market research, Merchandising etc.

● Manufacturing and distribution


✔ Sales history/trends, Market demand projects etc.

● Banks
✔ Spot market trends, Marketing, Credit cards etc.

● Insurance Companies
✔ Property and casualty fraud etc.

● Health Care Providers


✔ Fraud detection, Patient matching etc.
DW Applications [cont…]
● Government Agencies
✔ Auditing tax records, information sharing across

different agencies etc.

● Internet Companies
✔ Analyzing shopping behavior, CRM etc.

● Telecommunications
✔ Telemarketing, Product development etc.

● Sports
✔ Analyzing strategies, Winning player combinations etc.
Datawarehouse Sizes(usage)
Construction of Data Warehouse requires –
Cleaning , Integration and Consolidation.

Data Data OLAP Data


Data
Source Extraction Tools Mining
Warehouse

DATA WAREHOUSING AND DATA MINING


Why a Separate Data
Warehouse?
• DBMS — tuned for OLTP: access methods, indexing,
concurrency control, recovery
• Warehouse — tuned for OLAP: complex OLAP queries,
multidimensional view, consolidation
• Different functions and different data:
• missing data: Decision support requires historical data
which operational DBs do not typically maintain
• data consolidation: DS requires consolidation (aggregation,
summarization) of data from heterogeneous sources
• data quality: different sources typically use inconsistent data
representations, codes and formats which have to be
reconciled
17
OLTP Vs OLAP
Feature OLTP OLAP
Characteristic operational processing informational processing
Orientation transaction analysis
User clerk, DBA, database knowledge worker (e.g.,
professional manager, executive,
analyst)
Function day-to-day operations long-term informational
requirements decision
support
DB design ER-based, application- star/snowflake, subject-
oriented oriented
Data current, guaranteed up- historic, accuracy
to-date maintained over time
18
OLTP Vs OLAP
Feature OLTP OLAP
Summarization primitive, highly detailed summarized,
consolidated
View detailed, flat relational summarized,
multidimensional
Unit of work short, simple transaction complex query

Access read/write mostly read

Focus data in information out

Operations index/hash on primary lots of scans


key

19
OLTP Vs OLAP
Feature OLTP OLAP
Number of tens millions
records accessed

Number of users thousands hundreds

DB size GB to high-order GB >=TB

Priority high performance, high flexibility, end-user


high availability autonomy

Metric transaction throughput query throughput,


response time

20
Data Warehouse Architecture
There are 3 approaches
Single-tier architecture
The objective of a single layer is to minimize the amount of data
stored. This goal is to remove data redundancy. This
architecture is not frequently used in practice.
Two-tier architecture
Two-layer architecture is one of the Data Warehouse layers
which separates physically available sources and data
warehouse.
This architecture is not expandable and also not supporting a
large number of end-users. It also has connectivity problems
because of network limitations.
Three-Tier Data Warehouse Architecture
This is the most widely used Architecture of Data Warehouse.

21
Data Warehouse: A Multi-Tiered Architecture

Monitor
Metadata & OLAP Server
Other
sources Integrator

Analysis
Operational Extract Query
DBs Transform Data Serve Reports
Load
Refresh
Warehouse Data mining

Data Marts

Data Sources Bottom tier Middle tier Top tier 22


Data Storage OLAP Engine Front-End Tools
Data Warehouse Architecture
Bottom Tier: The database of the Data warehouse servers as the
bottom tier. It is usually a relational database system. Data is
cleansed, transformed, and loaded into this layer using back-end
tools.
Middle Tier: The middle tier in Data warehouse is an OLAP server
which is implemented using either ROLAP (Relational OLAP) or
MOLAP (multidimensional OLAP) model. ROLAP - an extended
relational DBMS that maps operations on multidimensional data to
standard relational operations. MOLAP - special-purpose server
that directly implements multidimensional data and operations. For
a user, this application tier presents an abstracted view of the
database. This layer also acts as a mediator between the end-user
and the database.
Top-Tier: front-end client layer. Top tier is the tools and API that
you connect and get data out from the data warehouse.
23
Data Warehouse Architecture
•There are mainly 5 components of Data Warehouse Architecture:
1) Database 2) ETL Tools 3) Meta Data
4) Query Tools 5) DataMarts

•These are four main categories of query tools 1. Query and


reporting tools 2. Application Development tools, 3. Data mining
tools 4. OLAP tools

•The data sourcing, transformation, and migration tools are used for
performing all the conversions and summarizations.

•In the Data Warehouse Architecture, meta-data plays an important


role as it specifies the source, usage, values, and features of data
warehouse data.
24
Architecture point of view -
Three Data Warehouse Models
• Enterprise warehouse
• collects all of the information about subjects spanning
the entire organization
• Data Mart
• a subset of corporate-wide data that is of value to a
specific groups of users. Its scope is confined to
specific, selected groups, such as marketing data mart
• Independent vs. dependent (directly from warehouse) data mart
• Virtual warehouse
• A set of views over operational databases
• Only some of the possible summary views may be
materialized
25
Extraction, Transformation, and Loading (ETL)
• Data extraction
• get data from multiple, heterogeneous, and external
sources
• Data cleaning
• detect errors in the data and rectify them when possible
• Data transformation
• convert data from legacy or host format to warehouse
format
• Load
• sort, summarize, consolidate, compute views, check
integrity, and build indicies and partitions
• Refresh
• propagate the updates from the data sources to the
warehouse
26
Metadata (data about data) Repository
• Meta data is the data that define warehouse objects. It contains:
• Description of the structure of the data warehouse - schema, view,
dimensions, hierarchies, derived data definition, data mart
locations and contents
• Operational meta-data - data lineage (history of migrated data and
transformation path), currency of data (active, archived, or
purged), monitoring information (warehouse usage statistics, error
reports, audit trails)
• The algorithms used for summarization
• The mapping from operational environment to the data warehouse
• Data related to system performance
• Business data - business terms and definitions, ownership of data,
charging policies
27

You might also like