Bid M Course
Bid M Course
Data Mining
Part1
Business Intelligence
DATA: Knowledge:
produce used for
Actions: Decisions
generate Making: trigger
What is BI?
Business Intelligence (BI) is
1. a broad category of TECHNOLOGIES that allows for
gathering, storing, accessing & analyzing data to help
business users make better decisions and analyzing
business performance through data-driven insight:
Understanding the past and predicting the future….
• Data Warehousing (DW)
– On-Line Analytical Processing (OLAP)
– Data Mining (DM)
– Data Visualization (VIS)
– Decision Analysis (what-if)
– Customer Relationship Management (CRM)
– Benchmarking
– Text Mining
– Predictive Analysis: (Linear Regression)
2. a broad category of APPLICATIONS , which include the activities of
• decision support systems
• query and reporting
• online analytical processing (OLAP)
• statistical analysis, forecasting, and data mining.
== Regular database models and systems are not suitable for this
type of queries.!!!! If possible VERY COSTLY!!!
Why is BI Important?
Data Presentation
( visualisation Techniques)
|
Data Mining (KDD?)
(Knowledge Dicovery)
|
Data Exploration
(OLAP, Statistical Analysis, Querying , Reporting…)
|
Data Warehouse/Data Marts (BDD)
|
Data Sources
(BDs, Files, Web, clients, suppliers, Documents, OLTP…..etc)
BI Layers
Reporting Analysis
Warehouse Layer
DW
Transaction DB = Operational DB
OLTP (On-Line Transactional Processing)
M.Benkhalifa
Relationnel Models
(Reminder)
– Examples
➲ Inventory Management (Sales )
➲ Human resources Management
➲ Library
➲ Billing……
M.Benkhalifa
Operational DBs: Limitations
M.Benkhalifa
Decision Support DBs
Starting 1990 need for analysis support systems:
● Reporting: Report generation: what happened? What is happening ? Why did it happened ? Why
will happen?
● Navigation: navigate on data for analytical purposes
● Knowledge extraction: Data Mining :(Patterns)
M.Benkhalifa
Decision support queries
➲ Complex
➲ Decision makers (few users)
➲ Data aggregation levels:
• Product: Product->Type->Category
• Store: Store->Area->City->County
• Time: Day->Month->Quarter->Year
➲ Fewer, but ”bigger” queries
➲ Frequent reads, in-frequent updates (daily)
➲ 2-phase operation: either reading or updating
➲ Larger data volumes (collection of historical data)
➲ Simple data model (multidimensional/de-normalized)
M.Benkhalifa
Examples of decision support queries
M.Benkhalifa Advanced IT 22
DATA WAREHOUSE?
Solution: a new analysis environment (a data
warehouse) where data is
– Integrated (logically and physically)
– Subject oriented (versus function oriented)
– Supporting management decisions (different
organization)
– Non Volatile: Stable (data not deleted, several
versions)
– Time variant (data can always be related to time)
M.Benkhalifa 23
Subject oriented
•Organized around major subjects, such as customer, product, sales.
• Data are arranged to provide answers to questions coming from diverse functional
areas within a company. (sales, marketing, finance…) # functional or process oriented.
• data are summarized by topic (sales; marketing, …) for each topic the DW contains specific subject of interest: products, clients, departments,
regions…
Time Variant
Once data are periodically uploaded to the DW, all time dependant aggregations are
recomputed. (Exp. Weekly sales are updates => monthly sales are also updated)
Non volatile
When data enter the DW, they are never removed. Requires only two
M.Benkhalifa 28
Query-Driven Data Integration
Monitor
Metadata & OLAP Server
Other
sources Integrator
Analysis
Query
Operational Extract
Serve Reports
DBs Transform Data
Data mining
Load Warehouse
Refresh
Data Marts
M.Benkhalifa
Facts
Definition:
• The fact is a business metric (i.e., numerical
measurement) of the enterprise activity: (ie: sales,
profit, ……transaction)
• Facts should be numeric, have a value, and be additive.
Exemple : sales: each record represents total sales of a
product, by store, by day Facts represent the subject of
the desired analysis Fact Table : relates many dimension
tables
M.Benkhalifa
Facts types
Event Fact (transaction)
– A fact for a business event (sales)
Snapshot Fact
– A fact for combinations of dimensions during an
interval of time (Current inventory status/period,
store, region).
•Granularity of a fact ?:
– What does a single fact mean?
– Level of detail: sale: customer transaction or an
individual item purchase ?
M.Benkhalifa
Dimensions
Definiton:
A dimension represents a single set of objects or events in the real world. Each
dimension that you identify for the data model gets implemented as a dimension table.
Dimensions are the qualifiers that make the measures of the fact table meaningful,
because they answer the what, when, and where aspects of a question :
Exp1: Date (DateKey, Date, DayOfWeek, CalMonth, CalYear, Holiday)
M.Benkhalifa
Measures
M.Benkhalifa
DW Design
Star Schema
M.Benkhalifa Advanced IT 45
Example of Star Schema
time
time_key item
day item_key
day_of_the_week Sales Fact Table item_name
month brand
quarter time_key type
year supplier_type
item_key
branch_key
branch location
location_key
branch_key location_key
branch_name units_sold street
branch_type city
dollars_sold state_or_province
country
avg_sales
Measures
Snow flake Schema
Refining of Star Schema with normalized dimension tables.
Product
IDprod Supplier
description ID-sup
color description
Sales size type
ID-sup Address
Advantages
– Avoid redundancy
– Leads to constellations (many fact tables sharing the same dimensions)
M.Benkhalifa
Example of Snowflake Schema
time
time_key item
day item_key supplier
day_of_the_week Sales Fact Table item_name supplier_key
month brand supplier_type
quarter time_key type
year item_key supplier_key
branch_key
branch location
location_key
location_key
branch_key
units_sold street
branch_name
city_key
branch_type
dollars_sold city
city_key
avg_sales city
state_or_province
Measures country 48
Example of Fact Constellation
time
time_key item Shipping Fact Table
day item_key
day_of_the_week Sales Fact Table item_name time_key
month brand
quarter time_key type item_key
year supplier_type shipper_key
item_key
branch_key from_location
M.Benkhalifa Advanced IT 50
Questions
1) Identify facts, dimensions and measures
2) For each fact:
– produce the attributes
– design the star or snowflake schema and write the following
– SQL queries:
• Find the quantity, the total income and discount with respect to
each city, type of furniture and the month
• Find the average quantity, income and discount with respect to
each country, furniture material and year
• Determine the 5 most sold furnitures during the May month
M.Benkhalifa
Exercise 2
Insurance company
M.Benkhalifa
Exercise 3
Consider the following relational database schema of an international airport:
FLIGHT (IDF, Company, DepAirport, ArrAirport, DepTime,
ArrTime)
FLYING (IDFlight, FlightDate)
AIRPORT (IDAirport, AirName, City, State)
TICKET (Number, IDFlight, FlightDate, Seat, Rate, Name, Surname, Sex)
CHECK-IN (Number, CheckInTime, LuggageNr)
Design the Data Warehouse for the analysis of the flights of the
airport:
1) Suggest facts, measures and dimensions
2) Define the fact schema:
M.Benkhalifa
Multidimensional Analysis
OLAP
Example: sales of supermarkets
• Facts and measures
Each sales record is a fact, and its sales value is a measure
• Dimensions
– Group correlated attributes into the same dimension
– easier for analysis tasks
• Each sales record is associated with its values of Product, Store, Time
M.Benkhalifa
Store Product Time Sales
Aalborg Bread 2000 57
Aalborg Milk 2000 56
Copenhagen Bread 2000 123
M.Benkhalifa Advanced IT 57
Multidimensional OLAP
(MOLAP)
• Data stored in special multidimensional data structures
• E.g., multidimensional array on hard disk
– MOLAP data cube
– Pros
• Less storage use (“foreign keys” not stored)
• Faster query response times
– Cons
• Up till now not so good
• Less scalability
• Less flexible, e.g., cube must be re-computed when design changes
• Does not reuse an existing investment (but often bundled with RDBMS)
d2 /d1 1 2 3
1 0 4 9
2 3 2 0
3 2 1 4
Hybrid OLAP
(HOLAP)
Other operations
• drill across:
– Accesses more than one fact table that is linked by common dimensions.
– Combines cubes that share one or more dimensions.
• drill through:
– Drill down to the bottom level of a data cube down to its back-end relational tables.
Slicing
M.Benkhalifa Advanced IT 62
Exemple : Slicing
M.Benkhalifa Advanced IT 63
M.Benkhalifa Advanced IT 64
Slicing: Example
Attendance fact table
M.Benkhalifa Advanced IT 65
Jill Slice from
the attendance table
M.Benkhalifa Advanced IT 66
SQL Slicing and Dicing
Exercise 1: Sql query to build « jack » slice?
Exercise 1: solution
select course, avg(grade)
from attendance
where student = ’Jack’
group by course,discipline,faculty
M.Benkhalifa Advanced IT 67
Dicing
Dicing refers to range selection in multiple dimensions.
( Exp: select range 2-3 for dims 1 and 2,
select range 1-2 for dim 3.
M.Benkhalifa Advanced IT 68
Exemple : Dice
M.Benkhalifa Advanced IT 69
Dicing in SQL
Example
Exercise 1: the selection of courses between DB and AMIS and students whose names start letter
is between A and C?
Solution:
select * from attendance where course between ’CMP370’ and ’CMP537’ and student_name
between ’A*’ and ’C*’
M.Benkhalifa Advanced IT 70
Drill Down
Slice and Dice
Attendance Table
M.Benkhalifa Advanced IT 73
Pivot and CrossTabs
M.Benkhalifa Advanced IT 74
Pivot
Example
Registrar cube:
Session×Student x Forcredit Grade
Pivot choosing Student for x and Session for y
Jill Jack Al
Spring’03 42 76
Summer’03 87 89 20
M.Benkhalifa Advanced IT 75
Pivot, showing > 2 dimensions
M.Benkhalifa Advanced IT 77