Chapter 5-Business Intelligence
Chapter 5-Business Intelligence
Business Intelligence
Study questions
Spreadsheets
combine
• Storage
• Logic
• Processing
• Display
Spreadsheets
They also keep track of things.
They are mostly associated with single user applications, as soon
as you need to share data the possibility of error increases
Spreadsheets - Problems
Key
A column or group of columns that identifies a unique row in a table.
Student Number is the key of the Student table.
Every table must have a key.
Sometimes more than one column is needed to form a unique
identifier. In a table called City, for example, the key would consist of
combination of columns (City, State).
Relationship Special Terms
Foreign keys
These are keys of a different (foreign) table than the table in which they
reside.
Relational databases
Relationships among tables are created by using foreign keys.
Relation
Formal name for a table
What Is a Database Management System (DBMS)?
Source: textbook
[1], pg 142
Elements of Database Applications
Elements Functions
Source: textbook
[1], pg 143
Example of a Student Report
Crow’s
Feet
1:N N:M
Source:
One department can An Adviser ma
textbook [1],
pg 149 have many advisers, have many
but an adviser may be students, and one
in only one student may many
department advisers
Sample of Relationships─Version 2
“Crow’s
Foot”
Source:
N:M 1:N
textbook [1], A department has A student has
pg 149 many advisors, and only one advisor,
an advisor may but an adviser
advise for more than may advise many
one department students
Crow’s-Foot Diagram Version
Source:
textbook [1],
pg 150 Minimum cardinality—minimum number of
entities in a relationship. Small oval means entity
is optional; relationship need not have an entity
of that type.
How Is a Data Model Transformed into a Database
Design?
• Normalization
Converting poorly structured tables into two or more well-structured tables.
• Goal
Construct tables with data about a single theme or entity.
• Purpose
To minimize data integrity problems.
Data Integrity Problems
• Data integrity problems produce incorrect and inconsistent
information, users lose confidence in information, and the system
gets a poor reputation.
• Can only occur if data are duplicated.
Poorly Designed Employee Table Causes Data Integrity
Problem
Single
Themes
BI used to be
everything related to Business Analytics
use of data for
managerial decision
support Descriptive Predictive Prescriptive
Now, it is a part of
Questions
Business Analytics What happened?
What is happening?
What will happen?
Why will it happen?
What should I do?
Why should I do it?
BI = Descriptive Enablers
ü Business reporting ü Data mining ü Optimization
Analytics ü
ü
Dashboards
Scorecards
ü
ü
Text mining
Web/media mining
ü
ü
Simulation
Decision modeling
ü Data warehousing ü Forecasting ü Expert systems
Outcomes
Subject oriented
Integrated
Time-variant (time series)
Nonvolatile
Summarized
Not normalized
Metadata
Web based, relational/multi-dimensional
Client/server, real-time/right-time/active...
Data Mart
Data Warehouse
One management and analytics platform
for product configuration, warranty, and
diagnostic readout data
Data Applications
Sources No data marts option (Visualization)
Data
Marts Routine
ERP Business
ETL
Reporting
Process
Data mart
Select (Marketing)
Legacy Metadata Data/text
/ Middleware
Extract mining
Data mart
Transform Enterprise (Operations)
POS Data warehouse
OLAP,
Integrate
Dashboard,
API
Data mart
(Finance) Web
Other Load
OLTP/Web
Replication Data mart
(...) Custom built
External
applications
Data
DW Architecture
• Three-tier architecture
1. Data acquisition software (back-end)
2. The data warehouse that contains the data & software
3. Client (front-end) software that allows users to access and
analyze data from the warehouse
• Two-tier architecture
– First two tiers in three-tier architecture are combined into one
… sometimes there is only one tier?
DW Architectures
3-tier
architecture
Tier 1: Tier 2: Tier 3:
Client workstation Application server Database server
2-tier 1-tier
architecture Architecture?
Tier 1: Tier 2:
Client workstation Application & database server
Data Warehousing Architectures
Web pages
Application
Server
Client Web
(Web browser) Internet/ Server
Intranet/
Extranet
Data
warehouse
Alternative DW Architectures (1 of 2)
ETL
End user
Source Staging Independent data marts
access and
Systems Area (atomic/summarized data)
applications
ETL
Dimensionalized data marts End user
Source Staging
linked by conformed dimentions access and
Systems Area
(atomic/summarized data) applications
ETL
End user
Source Staging Normalized relational
access and
Systems Area warehouse (atomic data)
applications
ETL
Normalized relational End user
Source Staging
warehouse (atomic/some access and
Systems Area
summarized data) applications
Packaged Transient
application data source
Data
warehouse
Data
marts
Other internal
applications
ETL (Extract, Transform,
Load)
Benefits:
Requires minimal investment in infrastructure
Frees up capacity on in-house systems
Frees up cash flow
Makes powerful solutions affordable
Enables solutions that provide for growth
Offers better quality equipment and software
Provides faster connections
… more in the book
Representation of Data in
DW
Dimensional Modeling
A retrieval-based system that supports high-volume query access
Star schema
The most commonly used and the simplest style of dimensional
modeling
Contain a fact table surrounded by and connected to several
dimension tables
Snowflakes schema
An extension of star schema where the diagram resembles a
snowflake in shape
Multidimensionality
A 3-dimensional
OLAP cube with Sales volumes of
slicing a specific Product
operations on variable Time
and Region
e
m
Ti
Product
Geography
Sales volumes of
a specific Time on
variable Region
and Products
Successful DW Implementation Things to
Avoid
• Sourcing…
– Web, social media, and Big Data
– Open source software
– SaaS (software as a service)
– Cloud computing
– Data lakes
• Infrastructure…
– Columnar
– Real-time DW
– Data warehouse appliances
– Data management practices/technologies
– In-database & In-memory processing New D BMS
– New DBMS, Advanced analytics, …
Data Lakes
Table 3.6 A Simple Comparison between a Data Warehouse and a Data Lake
• Process Steps
1. Strategize
2. Plan
3. Monitor/analyze
4. Act/adjust
Each with its own sub-
process steps
1 - Strategize: Where Do We Want to Go?
• Strategic planning
– Common tasks for the strategic planning process:
1. Conduct a current situation analysis
2. Determine the planning horizon
3. Conduct an environment scan
4. Identify critical success factors
5. Complete a gap analysis
6. Create a strategic vision
7. Develop a business strategy
8. Identify strategic objectives and goals
2 - Plan: How Do We Get
There?
Operational planning
Operational plan: plan that translates an organization’s
strategic objectives and goals into a set of well-defined
tactics and initiatives, resource requirements, and expected
results for some future time period (usually a year).
Operational planning can be
Tactic-centric (operationally focused)
Budget-centric plan (financially focused)
3 - Monitor/Analyze: How Are We Doing?
(HBR, 1992)
Balanced Scorecard
Financial
Perspective
Internal
Customer VISION & Business
Perspective STRATEGY Process
Perspective
Learning and
Growth
Perspective
Six Sigma as a Performance Measurement System
(1 of 2)
Six Sigma
A performance management methodology aimed at reducing
the number of defects in a business process to as close to zero
defects per million opportunities (DPMO) as possible
Six Sigma as a Performance Measurement System
(2 of 2)
Discussion Questions
1. Why do law enforcement agencies and
departments like Miami-Dade Police Department
embrace advanced analytics and data mining?
2. What are the top challenges for law
enforcement agencies and departments like
Miami-Dade Police Department? Can you
think of other challenges (not mentioned in
this case) that can benefit from data mining?
Opening Vignette (3 of 3)
Prediction
Linear/Nonlinear Regression,
Supervised
n ANN, Regression Trees, SVM,
kNN, GA
Time Series
Association
Apriory, OneR, ZeroR, Eclat, GA Unsupervised
Expectation Maximization,
Market- Unsupervised
Apriory Algorithm, Graph-
based Matching
basket Link
analysis
Segmentation
• Time-series forecasting
– Part of the sequence or link
analysis?
• Visualization
– Another data mining task?
– Covered in Chapter 3
• Data Mining versus Statistics
– Are they the same?
– What is the relationship between the
two?
Data Mining Applications (1
of 4)
1 2
Business Data
Understandin Understandin
g g
3
Data
Preparatio
n
6
4
Deploymen
t Model
Dat
Buildin
a
g
5
Testing
and
Evaluation
Data Mining Process:
SEMMA
Assess Explore
(Evaluate the accuracy (Visualization and
and usefulness of the basic description of
models) the data)
Feedbac
k
Model Modify
(Use variety of statistical (Select variables,
and machine learning transform variable
models ) representations)
Data Mining Process: KDD
Data Mining
DE P LOYM ENT CHART
Knowledge
DE P T
1
DE P T
2
PHASE 1
PHASE 4
PHASE 2
PHASE 5
PHASE 3
“Actionabl
5 e
DE P T
DE P T 4
4
3
Data 1 2 3
Transformatio Insight”
n
Extracte
d
Patterns
Data
Cleanin Transformed
g Data
Data
Selectio Preprocessed
n Data
Target
Data
Feedback
Sources
for Raw
Data
Which Data Mining Process is the Best?
My own
SEMMA
KDD Process
My organization's
Domain-specific
methodology
None
Research
n
Combined
Cancer DB
Assess
Testing Testing Testing the variable
model
in contributing to medical
Model Testing Relative
Results Variable
(Accuracy, Sensitivity Importance
and Specificity) Results
• Predictive
accuracy – Hit
rate
• Speed
– Model
building
versus
predicting/usa
ge speed
• Robustness
• Scalability
Accuracy of Classification
Models
TP +
Accuracy True/Observed Class
TP + TN
TN + FP +
Positive
FN
TP Negative
True PositiveRate = True False
TP +
Positiv
Positive Positive
FN Count Count
TN
e
True NegativeRate = (TP) (FP)
Predicted
TN +
Class
FP
Negativ
False True
TP Negative Negative
TP
Precision = TP +
Recall = Count Count
e
TP + FN (FN) (TN)
FP
Estimation Methodologies for
Classification: Single/Simple Split
• Leave-one-out
– Similar to k-fold where k = number of
samples
• Bootstrapping
– Random sampling with replacement
• Jackknifing
– Similar to leave-one-out
• Area Under the ROC Curve (AUC)
– ROC: receiver operating characteristics
(a term borrowed from radar image
processing)
Area Under the ROC Curve
(AUC) (1 of 2)
• Produces values
1
• Random chance is
0.8
A
0.7
classification is 1.0
0.5
• Produces a good
(AUC) A = 0.84
0.3
0.2
skewed class 0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
• Analysis methods
– Statistical methods (including both
hierarchical and nonhierarchical), such as k-
means, k-modes, and so on.
– Neural networks (adaptive resonance theory
[ART], self-organizing map [SOM])
– Fuzzy logic (e.g., fuzzy c-means algorithm)
– Genetic algorithms
• How many clusters?
Cluster Analysis for Data
Mining (4 of 4)
• Apriori Algorithm
– Finds subsets that are common to at least a
minimum number of the itemsets
– Uses a bottom-up approach
▪ frequent subsets are extended one item at
a time (the size of frequent subsets
increases from one- item subsets to
two-item subsets, then three-item
subsets, and so on), and
▪ groups of candidates at each level are
tested against the data for minimum
support
(see the figure) --
Association Rule Mining Apriori Algorithm
• Commercial
R 1,419
Python 1,325
SQL 1,029
Excel 972
Tableau 536
(formerly
KNIME 521
SciKit-Learn 497
Java 487
Anaconda
Clementine)
462
Hive 359
Mllib 337
Weka 315
– SAS Enterprise
Microsoft SQL 314
Server 301
Unix 263
shell/awk/gawk 242
Miner
MATLAB 227
IBM SPSS Statistics 225
Dataiku 222
SAS base 211
– Statistica -
IBM SPSS Modeler 210
SQL on Hadoop tools 198
C/C++ 197
Other free analytics/data 193
Dell/Statsoft
mining tools
180
Other programming and
162
data languages
161
H2O
158
– … many more
Scala
SAS Enterprise Miner
153 Legend:
147 [Orange] Free/Open Source tools
Microsoft Power BI
141
Hbase [Green] Commercial tools
132
QlikView [Blue] Hadoop/Big Data tools
121
Microsoft Azure Machine Learning
– RapidMiner
Application Case 4.6 (1 of 5)
Dependent Variable
Class No. 1 2 3 4 5 6 7 8 9
Range >1 >1 > > 20 > 40 > 65 > 100 > 150 > 200
(in (Flop > 10 10 < 40 < 65 < 100 < 150 < 200 (Blockbuster)
$Millions) ) <
20
Application Case 4.6 (3 of 5)
Independent
Variables
Independent Variable Number of Possible Values
Values
MPAA Rating 5 G, PG, PG-13, R, NR
Competition 3 High, Medium, Low
Star value 3 High, Medium, Low
Genre Sci-Fi, Historic Epic Drama,
10 Modern Drama, Politically
Related, Thriller, Horror,
Comedy, Cartoon, Action,
Documentary
Special effects 3 High, Medium, Low
Sequel 2 Yes, No
Number of screens 1 Positive integer
Application Case 4.6 (4 of 5)
Model
Developme
nt process
Model
Assessmen
t process
Application Case 4.6 (5 of 5)
Myth Reality
Data mining provides instant, crystal- Data mining is a multistep process that
ball-like predictions. requires deliberate, proactive design and
use.
Data mining is not yet viable for
mainstream business applications. The current state of the art is ready to
go for almost any business type
Data mining requires a separate, and/or size.
dedicated database.
Because of the advances in database
Only those with advanced degrees can technology, a dedicated database is not
do data mining. required.
Data mining is only for large firms that Newer Web-based tools enable managers
have lots of customer data. of all educational levels to do data
mining.