Data Warehousing and Mining Unit 1
Data Warehousing and Mining Unit 1
B
MINING
U
S
[AKTU]
Y4
D
STUDY4SUB
U
ST UNIT 1
DATA WAREHOUSING AND ITS KEY FEATURES:
Data warehousing is a centralized repository for storing and managing large amounts of data from
various sources for analysis and reporting. It involves transforming and integrating data into a unified,
organized, and consistent format. The key features of data warehousing include:
• Subject-Oriented: Data warehouses focus on specific themes like sales, marketing, or distribution,
providing information about a particular subject rather than overall operations.
• Integrated: Data warehouses integrate data from different sources into a reliable format, ensuring
B
consistency in naming conventions, format, and codes.
S U
• Time-Variant: Data in a warehouse is maintained over different time intervals, allowing for historical
Y4
analysis and comparisons over time.
D
STUDY4SUB
• Non-Volatile: Data in a warehouse is permanent and does not change, preserving historical information
U
for analysis and decision-making
ST
COMPONENTS OF DATA WAREHOUSE
1.Operational Source: Data source for the warehouse, such as operational databases or external data.
2.Load Manager: Responsible for extracting and loading data into the warehouse, including data
transformation.
3.Warehouse Manager: Manages warehouse processes, such as data analysis, aggregation, backup,
2
collection, and denormalization.
3. Query Manager: Manages user queries within the data warehouse system.
4. ETL Tools: Extract data from various sources, transform it to fit the warehouse schema, and load it
into the warehouse.
5. Central Database: Stores all business data in the warehouse, making it easier for reporting and
analysi.
6. Access Tools: Enable users to access and interact with the data stored in the warehouse for analysis
B
and reporting.
U
7. Metadata: Data about the data stored in the warehouse, used for extraction, loading processes,
S
warehouse management, and query management.
Y4
8. Data Staging Area: Prepares extracted data for storage in the warehouse by cleaning, transforming,
D
STUDY4SUB
U
and standardizing it.
ST
9. Information Delivery Component: Enables users to access and subscribe to data from the warehouse
for analysis and reporting.
11. Data Marts: Subsets of corporate-wide data tailored for specific user groups or subjects, providing
focused data for analysis.
12. Management and Control Component: Coordinates services within the data warehouse, controlling
data transformation, transfer, and delivery to clients.
3
B
SU
Y4
D
STUDY4SUB
U
ST
4
CONCEPT OF BUILDING A DATA WAREHOUSE
1.Business Needs: Identify the organization's needs and goals.
2.Data Sources: Identify data sources and structure.
3.Storage: Choose physical or cloud-based servers.
4.Software: Data warehousing software processes and manages data.
B
5.Labor: Backend developers, architects, analysts, and managers.
U
6.Implementation: Use data warehouse for data analytics.
S
Y4
7.Benefits: Save time.
D
• Boost confidence in data. STUDY4SUB
U
ST
• Increase insights.
• Improve security.
5
STEPS INVOLVED IN MAPPING THE DATA WAREHOUSE TO A MULTIPROCESSOR ARCHITECTURE
1. The steps involved in mapping the data warehouse to a multiprocessor architecture include:Relational
database technology for data warehouse: This involves understanding the basics of relational databases
and their role in data warehousing.
2. Types of parallelism: There are two types of parallelism - inter query parallelism and intra query parallelism.
Inter query parallelism involves handling multiple requests at the same time, while intra query parallelism
decomposes a serial SQL query into lower level operations and executes them concurrently.
3. Data partitioning: This is a key component for effective parallel execution of database operations.
Partitioning can be done randomly or intelligently, with options such as random data striping, round robin
B
fashion partitioning, hash partitioning, key range partitioning, schema portioning, and user-defined
U
portioning.
S
4. Database architectures of parallel processing: There are three DBMS software architecture styles for parallel
Y4
processing - shared memory or shared everything architecture, shared disk architecture, and shred nothing
architecture.
D
STUDY4SUB
5. Parallel DBMS features: These include optimizer implementation, application transparency, parallel
U
environment, and DBMS management tools.
ST
6. Alternative technologies: These include advanced database indexing products, multidimensional databases,
and specialized RDBMS.
7. Parallel DBMS vendors: These include Oracle, Informix, IBM, and SYBASE, each with their own
architecture, data partitioning, and parallel operations.
8. Specialized database products: These include Red Brick Systems, White Cross System, and other RDBMS
products.
• The mapping process involves choosing the appropriate architecture style, partitioning the data effectively,
and6 implementing parallel processing features to improve performance and scalability.
STRATEGIES TO BE TAKEN CARE WHILE DESIGNING A WAREHOUSE
1. Analyze current layout and processes
2. Identify opportunities for improvement
3. Select appropriate design type
4. Test and refine the layout
5. Consider workforce
6. Optimize storage space
B
7. Ensure smooth traffic flow
U
8. Implement safety protocols
S
9. Utilize warehouse management software
Y4
10.Incorporate automation
D
11.Maintain workflow and access
STUDY4SUB
U
DIFFERENCE BETWEEN DATA WAREHOUSE AND DATABASE
ST
Database:
• Stores real-time information for specific applications.
• Handles daily transactions efficiently.
• Uses Online Transactional Processing (OLTP) for quick CRUD operations.
• Structured with normalized data for maximum efficiency.
• Contains current, up-to-date information.
7
Data warehouse
• Gathers historical data from various sources for analysis.
• Supports complex queries for strategic decision-making.
• Uses Online Analytical Processing (OLAP) for in-depth analysis.
• Denormalized structure for faster retrieval of data.
• Stores historical data for business insights and reporting.
DIFFERENCE BETWEEN OLAP AND OLTP
B
OLTP (Online Transaction Processing):
S U
Designed for managing day-to-day transactions.
Y4
• Optimized for inserting, updating, and deleting small amounts of data quickly and efficiently.
D
• Normalized structure for efficient data processing.
STUDY4SUB
U
• Typically stores current operational data.
ST
OLAP (Online Analytical Processing):Designed for complex data analysis.
• Optimized for handling complex queries that involve large data sets.
• Denormalized structure for faster retrieval of data.
• Stores historical data from various databases.
• Enables in-depth data analysis across multiple dimensions.
• Supports decision-making and problem-solving.
8
WHAT IS METADATA
Metadata is data about data, providing information about the content, context, and structure of data. It is important for data
management, organization, and analysis, and can include descriptive, structural, and administrative metadata. Metadata
can pose privacy and security risks if left unchecked, and should be managed carefully to ensure data safety and privacy.
IMPORTANCE OF METADATA
• Provides context and makes data unique.
• Encourages reuse and ensures interoperability.
• Supports effective data governance and improves data quality.
B
• Enables data discoverability and better decision-making.
S U
• Ensures compliance and interoperability between datasets.
Y4
• Facilitates collaboration and time/efficiency savings.
D
WHAT IS MULTIDIMENSIONAL DATA
STUDY4SUB
U
Definition: Data organized into dimensions for complex analysis.
ST
Usage: Common in data warehousing and business intelligence.
• Features: Enables analysis across multiple dimensions.
• Supports complex queries and insights.
• Includes descriptive and structural metadata.
• Facilitates drilling down into data.
• Enhances user-friendly data analysis interfaces.
9
Concept Hierarchy
• Definition: Organizing concepts into a hierarchical structure.
• Purpose: Helps categorize and classify concepts from broader to narrower levels.
• Example: In history, organizing historical periods from ancient to modern.
• Importance: Facilitates understanding, analysis, and research by placing concepts in a structured
context.
B
S U
Y4
D
WHAT IS GRANULARITY STUDY4SUB
U
ST
Definition: Level of detail or resolution at which data is stored and analyzed.
Importance: Affects volume of data, efficiency of data shipping, and types of analysis.
Example: Time-series data can have granularity based on years, months, weeks, days, or hours.
Optimal Level: Somewhere in the middle, providing exact segmentation without being too precise.
Purpose: Enables detailed segmentation and targeting in marketing and software.
10
WHAT IS PARTITIONING
Definition: Dividing a large database or table into smaller, more manageable pieces called partitions.
Purpose: Improving performance, manageability, availability, or load balancing.
Types: Horizontal and vertical partitioning.
Methods: Range, list, composite, and hash partitioning.
Key Feature: Scalability, manageability, and performance in data warehousing.
B
U
Transparency: SQL statements.
S
Benefits: Improves availability and security of distributed database management systems, allowing local
Y4
transactions to be performed on individual partitions.
D
WHAT IS STAR SCHEMA STUDY4SUB
U
ST
Definition: A multi-dimensional data model used to organize data in a database for easy understanding
and analysis.
• Components: Fact table (central) and dimension tables (connected to fact table).
• Purpose: Optimized for querying large data sets, improving analytical query performance, and
simplifying queries.
• Benefits: Improved query performance, fast aggregations, simplified business reporting logic, and
feeding cubes.
11
SNOWFLAKE SCHEMA
• Definition: A logical arrangement of tables in a multidimensional database where dimensions are
normalized into multiple related tables.
• Structure: Centralized fact tables connected to multiple dimensions, with dimensions normalized into
multiple related tables.
• Purpose: Normalization of dimension tables by removing low cardinality attributes and forming a
hierarchical structure.
B
U
• Comparison to Star Schema: Dimensions are normalized into multiple related tables, creating a
S
snowflake structure, unlike the denormalized dimensions in a star schema.
Y4
D
STUDY4SUB
U
ST
12
B
SU
Y4
D
STUDY4SUB
U
ST
13
WHAT IS FACT CONSTELLATION
• Definition: Data warehouse schema with multiple fact tables sharing dimensions.
• Also Known As: Galaxy schema.
• Structure: Multiple fact tables sharing one or more dimensions.
• Purpose: Allows for complex relationships between data.
• Challenges: More difficult to implement and maintain due to complexity.
B
S U
Y4
D
STUDY4SUB
U
ST
14
B
SU
Y4
D
STUDY4SUB
U
ST
15