0% found this document useful (0 votes)
6 views6 pages

Introduction To Data Warehousing and Data Mining

The document provides an overview of data warehousing and data mining, highlighting the definitions, features, and processes involved in each. Data warehouses are designed for analysis and long-term storage of data, while data mining focuses on discovering patterns and relationships within large datasets. Together, they enable businesses to derive valuable insights from organized data, guiding decision-making and strategy.

Uploaded by

Pankaj Kumar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views6 pages

Introduction To Data Warehousing and Data Mining

The document provides an overview of data warehousing and data mining, highlighting the definitions, features, and processes involved in each. Data warehouses are designed for analysis and long-term storage of data, while data mining focuses on discovering patterns and relationships within large datasets. Together, they enable businesses to derive valuable insights from organized data, guiding decision-making and strategy.

Uploaded by

Pankaj Kumar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

Introduction to Data Warehousing and Data Mining

What is Data Warehousing?

Basic Definition
A data warehouse is a big storage system that keeps data from many different places so it can be used
for making business decisions. Unlike regular databases that handle day-to-day operations, data
warehouses are built for analyzing information rather than processing transactions.

Main Features of Data Warehouses


Topic-focused: Organized around main business areas like customers, products, and sales

Combined data: Brings together data from different sources with matching formats

Includes history: Keeps data from many years, not just current information
Stable data: Data is added and read but rarely changed or deleted

How Data Warehouses Are Built


1. Three-layer design:
Bottom layer: Database servers that store all the data

Middle layer: Special servers that organize data for analysis


Top layer: Tools that help users get information and create reports

2. Main Parts:
Data sources: Company databases, outside information, and basic files
ETL tools: Programs that Extract data, Transform it to match, and Load it into the warehouse

Data about the data: Information that explains what's in the warehouse
Analysis tools: Software for exploring the data and making reports

Ways to Organize Data in Warehouses


1. Star Pattern: One main table connected to many smaller tables
2. Snowflake Pattern: Like a star but with more connections between tables
3. Multiple Stars: Several main tables sharing the smaller tables

OLAP (Online Analytical Processing)


What users can do: Move up to more general views, drill down to details, look at specific parts of
data, and reorganize views

Types of OLAP: Some work with cubes of data, some with regular databases, some use both

What is Data Mining?


Simple Definition
Data mining is finding useful patterns and information in large amounts of data using a mix of computer
learning, statistics, and database methods.

The Data Mining Process


1. Getting data ready: Cleaning it, bringing it together, picking what's important, and changing its
format

2. Finding patterns: Using special methods to spot useful information


3. Checking patterns: Making sure what's found is actually useful
4. Showing results: Creating charts and reports to explain what was discovered

Finding Patterns That Repeat Often

Basic Ideas
Item group: A collection of one or more things (like products)

How common: How often a group appears in the data

Common group: A group that appears often enough to be interesting

Complete group: A group where no larger group appears just as often


Biggest common group: A common group that has no larger common groups

Finding Connections Between Things


What it is: Discovering when things happen together
Rule format: If you see X, you'll likely see Y

Ways to measure connections:


How common: Chance of seeing both things together

How reliable: Chance of seeing Y when you see X

Strength: How much more often things appear together than by chance

Certainty: How unlikely it is that X appears without Y

Methods for Finding Common Patterns


1. The Apriori Method
Main idea: If a group is common, all smaller groups within it must also be common

How it works:
Find common single items
Combine them to find possible pairs

Check which pairs are common


Combine common pairs to make possible triplets
Continue until no more common groups are found

Problems: Needs to check the data many times, creates many possible groups

2. The FP-Growth Method


Special tool: Uses a tree structure to store information efficiently
How it works:
Check data once to find common single items

Sort items by how common they are

Build a special tree by checking data one more time

Use the tree to find patterns without checking the whole database again

Benefits: Faster, needs fewer checks of the data

3. Vertical Methods
Eclat: Looks at which transactions contain each item
Diffset: Saves space by tracking differences between groups

Finding Correlations
Different ways to measure:
Correlation number: Shows if things tend to happen together or opposite

Chi-square test: Checks if things are connected or just random


Similarity: Measures how alike two groups are

Confidence: How strongly items are connected

Finding Patterns in Sequences

Basic Ideas
Sequence: A list of things that happened in order
Pattern: A shorter sequence that shows up inside longer ones

How common: The percentage of all sequences that contain the pattern

Working with Sequences


1. How to write sequences: <(bread,milk)(eggs)(cheese,butter)> where each parenthesis is one
shopping trip

2. Finding subsequences: Looking for smaller ordered parts within larger sequences
3. Counting: How many full sequences contain the pattern we're looking for

Rules for Finding Sequence Patterns


Time rules: How close or far apart events should be
Item rules: Which items can or can't be included

Length rules: How long patterns can be

Pattern rules: Sequences that follow certain formats

Methods for Finding Sequence Patterns


1. GSP Method
Approach: Similar to Apriori, builds up patterns level by level
How it works:
Find common single-item sequences

Create possible 2-item sequences

Check which are common

Create possible 3-item sequences

Continue until no more common sequences found

Problems: Slow with large datasets, creates many possible sequences

2. PrefixSpan Method
Approach: Grows patterns by adding to what's already been found

How it works:
Find common single-item sequences

Split search into smaller parts based on first items

Grow patterns by looking only at relevant parts of the database

Benefits: More efficient, especially with long sequences

3. SPADE Method
Approach: Organizes data vertically to make searching easier

How it works:
Change database format to show where each item appears
Join these lists to find patterns

Search either breadth-first or depth-first

Benefits: Faster searching, fewer database checks

4. CloSpan and BIDE


Purpose: Find only the most useful sequence patterns

Approach: Cut out patterns that don't add new information

Benefits: Creates fewer patterns but keeps all important information


Making Sequence Mining Work with Big Data

1. Database Projection Techniques


Splitting the database: Focus on smaller parts at a time

Benefits: Uses less memory, works faster

2. Memory Management
Divide-and-conquer: Break large problems into smaller ones

Disk-based methods: Handle data too big to fit in memory

3. Parallel Processing
Data splitting: Divide data across many computers

Task splitting: Divide mining work across many computers

4. Approximate Methods
Sampling: Check only part of the data to get quick results

Top-k patterns: Find only the k most interesting patterns

Real-World Uses of Sequence Mining


Website analysis: Understanding how people browse websites

Shopping behavior: Predicting what customers will buy next

Biology: Finding patterns in DNA sequences

Computer security: Spotting unusual system behavior

Healthcare: Tracking how diseases develop over time

How Warehousing and Mining Work Together


Data warehousing and data mining work together to create a powerful system for understanding data:

1. Data warehouses provide:


Clean, organized data

Fast access to large amounts of information

Data structured for easy analysis

2. Data mining uses this data for:


Finding hidden patterns

Discovering relationships

Predicting future trends

Finding unusual events

3. Working together:
Mining results can help improve warehouse design
New mining needs can guide warehouse updates

This combination helps businesses turn raw data into useful insights through collecting, organizing, and
analyzing information.

You might also like