Introduction To Data Warehousing and Data Mining
Introduction To Data Warehousing and Data Mining
Basic Definition
A data warehouse is a big storage system that keeps data from many different places so it can be used
for making business decisions. Unlike regular databases that handle day-to-day operations, data
warehouses are built for analyzing information rather than processing transactions.
Combined data: Brings together data from different sources with matching formats
Includes history: Keeps data from many years, not just current information
Stable data: Data is added and read but rarely changed or deleted
2. Main Parts:
Data sources: Company databases, outside information, and basic files
ETL tools: Programs that Extract data, Transform it to match, and Load it into the warehouse
Data about the data: Information that explains what's in the warehouse
Analysis tools: Software for exploring the data and making reports
Types of OLAP: Some work with cubes of data, some with regular databases, some use both
Basic Ideas
Item group: A collection of one or more things (like products)
Strength: How much more often things appear together than by chance
How it works:
Find common single items
Combine them to find possible pairs
Problems: Needs to check the data many times, creates many possible groups
Use the tree to find patterns without checking the whole database again
3. Vertical Methods
Eclat: Looks at which transactions contain each item
Diffset: Saves space by tracking differences between groups
Finding Correlations
Different ways to measure:
Correlation number: Shows if things tend to happen together or opposite
Basic Ideas
Sequence: A list of things that happened in order
Pattern: A shorter sequence that shows up inside longer ones
How common: The percentage of all sequences that contain the pattern
2. Finding subsequences: Looking for smaller ordered parts within larger sequences
3. Counting: How many full sequences contain the pattern we're looking for
2. PrefixSpan Method
Approach: Grows patterns by adding to what's already been found
How it works:
Find common single-item sequences
3. SPADE Method
Approach: Organizes data vertically to make searching easier
How it works:
Change database format to show where each item appears
Join these lists to find patterns
2. Memory Management
Divide-and-conquer: Break large problems into smaller ones
3. Parallel Processing
Data splitting: Divide data across many computers
4. Approximate Methods
Sampling: Check only part of the data to get quick results
Discovering relationships
3. Working together:
Mining results can help improve warehouse design
New mining needs can guide warehouse updates
This combination helps businesses turn raw data into useful insights through collecting, organizing, and
analyzing information.