0% found this document useful (0 votes)
3 views

Module 1-1basic Concepts

The document provides an overview of data warehousing and data mining, including definitions, components, architecture, and key features of data warehouses. It distinguishes between operational databases and data warehouses, highlighting their purposes and access patterns. Additionally, it discusses the importance of data warehousing in decision-making and lists various applications and tools used in the industry.

Uploaded by

wasimrajaa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

Module 1-1basic Concepts

The document provides an overview of data warehousing and data mining, including definitions, components, architecture, and key features of data warehouses. It distinguishes between operational databases and data warehouses, highlighting their purposes and access patterns. Additionally, it discusses the importance of data warehousing in decision-making and lists various applications and tools used in the industry.

Uploaded by

wasimrajaa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 40

DATA WAREHOUSING AND

21AD402 3/0/0/3
MINING

Introduction to Data Warehousing and Data Mining

Data Warehousing Components –Building a Data warehouse – Data


Warehouse Architecture, OLAP vs OLTP, OLAP operations - Data Warehouse
v/s Data Mining, Data Mining Process, Data Mining Functionalities, Data Pre-
processing – Descriptive Data Summarization, Data Cleaning, Integration and
Transformation, Reduction.

Jiawei Han, MichelineKamber and Jian Pei, “Data Mining Concepts and
Techniques”, Third Edition, Elsevier, 2012.
“DATA WAREHOUSING
CONCEPTS”

Prerequisites – Database Management System


DATA
INFORMATION

Data
DATA
What is a Data???
● A collection of facts in a raw or unorganized forms like alphabets, numbers,
symbols etc…

● Data, information, knowledge and wisdom are closely related concepts,


but each has its own role in relation to the other, and each term has its own
meaning.

When does data become


information???
● According to a common view, data is collected and analyzed; data only
becomes information suitable for making decisions once it has been
analyzed in some fashion.

● The Latin word data is the plural of datum,


How Much Data Does The World
Generate Every Minute?
Ninety percent of the data in the world today has been
created in the last two years alone. Our current output of
data is roughly 2.5 quintillion bytes a day.
Data size chart:-
What is Data warehouse?

● Stmt 1: A data warehouse is an electronic system that gathers data


from a wide range of sources within a company and uses the data to
support management decision-making.

● Stmt 2: A data warehouse is constructed by integrating data from


multiple heterogeneous sources that support analytical reporting,
structured and/or ad-hoc queries, and decision making. Data
warehousing involves data cleaning, data integration, and data
consolidations.

● Stmt 3: A data warehouse is a system that pulls together data from


many different sources within an organization for reporting and
analysis. The reports created from complex queries within a data
warehouse are used to make business decisions.

● Stmt 4: A collection of integrated, subject oriented databases designed


to support the DSS function where each unit of data is relevant at
some moment of time (Bill Inmon, 1991).
What is Data
warehousing?
● The process of constructing and using Data
Warehouses.

● A Data Warehouse refers to a place where data


can be stored for useful mining.
Figure:
Examples of
heterogeneous
data

13
So, Datawarehouse (vs)
ing (vs) Database
● Database is dump of data .
● Data warehousing is a ● A data warehouse is a federated
methodology to extract the repository for all the data that an
significant data that helps in enterprise's various business

taking decision for business. systems collect. The repository

● Data may be physical or logical.


warehouse is the
Database which stores
analytical data for business
decisions
A Simple “Definition”

“A decision support database that is maintained

separately from the organization’s operational

database which supports information processing by

providing a solid platform of consolidated, historical

data for analysis”.


Key Features / Characteristics

● Subject Oriented – Data warehouses provide simple and concise view


of particular subject by including essential data that are useful in
decision making.
● Integrated – A Data warehouse is usually constructed by integrating
multiple heterogeneous sources such as relational database, flat files and
online transaction records.
● Time – Variant – Data are stored to provide information from an
historic perspective. Every key structure in the data warehouse contains
either implicitly or explicitly a time element.
● Non – Volatile – Separate store of data from operational database. It
requires only two operations in data accessing : Initial loading of data
and access of data.
Key Features / Characteristics
Data Warehouse—Subject-Oriented

● Organized around major subjects, such as customer,


product, sales.
● Focusing on the modeling and analysis of data for
decision makers, not on daily operations or
transaction processing.
● Provide a simple and concise view around particular
subject issues by excluding data that are not useful in
the decision support process.
Data Warehouse—Integrated

● Constructed by integrating multiple, heterogeneous data


sources

○ relational databases, flat files, on-line transaction


records
● Data cleaning and data integration techniques are applied.

○ Ensure consistency in naming conventions, encoding


structures, attribute measures, etc. among different
data sources

■ E.g., Hotel price: currency, tax, breakfast covered,


etc.
Data Warehouse—Time Variant

● The time horizon for the data warehouse is


significantly longer than that of operational
systems.

○ Operational database: current value data.

○ Data warehouse data: provide information from a


historical perspective (e.g., past 5-10 years)
● Every key structure in the data warehouse

○ Contains an element of time, explicitly or implicitly

○ But the key of operational data may or may not


Data Warehouse—Non-Volatile

● A physically separate store of data transformed


from the operational environment.
● Operational update of data does not occur in
the data warehouse environment.

○ Does not require transaction processing, recovery,


and concurrency control mechanisms

○ Requires only two operations in data accessing:

■ initial loading of data and access of data.


Data warehouse system is also
known by the following names:
Why we use Data Warehouse?
Some most Important reasons for using Data warehouse are:

● Integrates many sources of data and helps to decrease


stress on a production system.
● Optimized Data for reading access and consecutive disk
scans.
● Data Warehouse helps to protect Data from the source
system upgrades.
● Improve data quality in source systems.
List of tools for Data
Warehouse:

1. Amazon Redshift
2. Teradata
3. Oracle 12c
4. Informatica
5. IBM Infosphere
Data Warehouse
Applications
● Retail Industry
✔ Forecasting, Market research, Merchandising etc.
● Manufacturing and distribution
✔ Sales history/trends, Market demand projects etc.
● Banks
✔ Spot market trends, Marketing, Credit cards etc.
● Insurance Companies
✔ Property and casualty fraud etc.
● Health Care Providers
✔ Fraud detection, Patient matching etc.
DW Applications [cont…]
● Government Agencies
✔ Auditing tax records, information sharing across
different agencies etc.
● Internet Companies
✔ Analyzing shopping behavior, CRM etc.
● Telecommunications
✔ Telemarketing, Product development etc.
● Sports
✔ Analyzing strategies, Winning player combinations etc.
Datawarehouse Sizes
● Terabyte (10^12) - Walmart (24 TB)

● Petabyte (10^15) - Geographic Information Systems

● Exabyte (10^18) - National Medical Association

● Zettabyte (10^21) - Weather Images

● Zottabyte (10^24) - Intelligence Agency (Video)


OLAP and OLTP
● A data warehouse is built to store a huge amount
of historical data and empowers fast requests
over all the data, typically using Online
Analytical Processing (OLAP).
● A database is made to store current transactions
and allow quick access to specific transactions for
ongoing business processes, commonly known
as Online Transaction Processing (OLTP).
Operational Database vs. Data Warehouse

TASKS OPERATIONAL DB / OLTP DATA WAREHOUSE /


OLAP
Users and System Customer Oriented – To Market Oriented – used
Orientation perform online transaction for data analysis and
and query processing. decision making by
knowledge workers.
Data Contents Manages Current Data Manages large amount of
historic data provides
facilities for summarization
and easier to use for
decision
Database Design Adopts an entity Adopts either star or snow
relationship model and flake model and subject
application oriented oriented database design.
database design
Cont…

TASKS OPERATIONAL DB / OLTP DATA WAREHOUSE /


OLAP
View Focuses on current data Spans multiple versions of
within an organization database schema, deals
without referring historic with information from
data – Detailed, Flat different organization –
relational Summarized, Multi
dimensional

Access Patterns Short and atomic Mostly Read only


transactions. Requires operations.
Concurrency Control and
Recovery Mechanisms –
Read / Write.
Priority Metric High Performance and High Flexibility, end user
Availability autonomy, query
throughput, response time
Database Size High order Giga Bytes Greater than tera bytes
(TB)
Data Mining
● Data mining aims to enable business
organizations to view business behaviours, trends
relationships that allow the business to make
data-driven decisions.
● It is also known as knowledge Discover in
Database (KDD).
● Data mining tools utilize AI, statistics, databases,
and machine learning systems to discover the
relationship between the data.
DW and DM
● Data warehouse refers to the process of
compiling and organizing data into one common
database, whereas data mining refers to the
process of extracting useful data from the
databases.

You might also like