0% found this document useful (0 votes)
36 views156 pages

Chapter 5-Business Intelligence

Chapter 5 discusses the fundamentals of Business Intelligence (BI) and databases, emphasizing the importance of understanding database processing, data warehousing, and data mining for business professionals. It outlines the structure and purpose of databases, the role of Database Management Systems (DBMS), and the significance of data warehouses in supporting decision-making. Additionally, it covers data integration processes, including Extract, Transform, Load (ETL), and various data warehousing architectures.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
36 views156 pages

Chapter 5-Business Intelligence

Chapter 5 discusses the fundamentals of Business Intelligence (BI) and databases, emphasizing the importance of understanding database processing, data warehousing, and data mining for business professionals. It outlines the structure and purpose of databases, the role of Database Management Systems (DBMS), and the significance of data warehouses in supporting decision-making. Additionally, it covers data integration processes, including Extract, Transform, Load (ETL), and various data warehousing architectures.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 156

Chapter 5:

Business Intelligence
Study questions

Q1 What do business professionals need to know


about database & database processing?

Q2 What is Data Warehouse?

Q3 What is Data Mining?

Q4 How to use Data for Logistics and Decision


Making?
Q1 What do business
professionals need to know
about database & database
processing?
What Is the Purpose of a Database?

 It keeps track of things.


 A spreadsheets can do it also, but they still have disadvantages
spreadsheets

Spreadsheets
combine
• Storage
• Logic
• Processing
• Display
Spreadsheets
 They also keep track of things.
 They are mostly associated with single user applications, as soon
as you need to share data the possibility of error increases
Spreadsheets - Problems

Spreadsheet Used for


Assignment of Sheet
Music

Source: textbook [1],


pg 164

Data redundancy A lot of data is duplicated.


Inefficiency Searching records, changing
Inconsistency Different values in same field
Integrity Data can disappear.
Data Shown from a Database

Student Data has


multiple themes:
• student grades
• student emails
• student office
visits

Source: textbook [1], pg 133


General Rule
 Lists of data involving a single theme can be stored in a
spreadsheet
 lists that involve data with multiple themes require a database

the purpose of a database is to keep


track of things that involve more
than one theme.
What is a Database?
 Database:
 A self-describing collection of integrated records
 In databases, bytes are grouped into columns, such as Student Number
and Student Name. Columns are also called fields. Columns or fields, in
turn, are grouped into rows, which are also called records.
Characters, Fields, and
Records

Source: textbook [1], pg 133


Hierarchy of Data
Elements

Source: textbook [1], pg 134


Structure of a Database

Source: textbook [1], pg 134

Metadata describes the structure of the Database, what


values are allowed, who can access it, what can be deleted
Relationships Among Rows

Source: textbook [1], pg 135


Relationship Special Terms

 Key
 A column or group of columns that identifies a unique row in a table.
 Student Number is the key of the Student table.
 Every table must have a key.
 Sometimes more than one column is needed to form a unique
identifier. In a table called City, for example, the key would consist of
combination of columns (City, State).
Relationship Special Terms
 Foreign keys
 These are keys of a different (foreign) table than the table in which they
reside.
 Relational databases
 Relationships among tables are created by using foreign keys.
 Relation
 Formal name for a table
What Is a Database Management System (DBMS)?

 Program used to create, process, and administer a


database.
 Licensed from vendors such as IBM (DB2), Microsoft
(Access and SQL Server), Oracle (Oracle
Database), and others.
 MySQL - open source.
Processing the Database

Four DBMS operations


1. Read
2. Insert
3. Modify
4. Delete data
Processing the Database
• Structured Query Language - SQL (see-quell)
– International standard
– Used by most popular DBMS

INSERT INTO Student


([Student Number], [Student Name], HW1, HW2, MidTerm)
VALUES (1000, ‘Franklin, Benjamin’, 90, 95, 100)
Administering the
Database
 Used to set up a security system involving user accounts,
passwords, permissions, and limits for processing.
 Permissions can be limited in very specific ways.
 Backing up database data, adding structures to improve
performance of database applications, removing unwanted data.
 most organizations dedicate one or more employees to the role of
database administration
Database Administration
Tasks

Source: textbook
[1], pg 142
Elements of Database Applications

Elements Functions

View data;, insert new, update existing,


Forms
and delete existing data
Structured presentation of data using
Reports sorting, grouping, filtering, and other
operations
Search based upon data values
Queries
provided by the user
Provide security, data consistency, and
Application
special purpose processing, e.g., handle
Programs
out-of-stock situations
How do applications make databases more useful?

Source: textbook
[1], pg 143
Example of a Student Report

Sample Query Form


Used to Enter Phrase
for Search Sample Query
Results of Query
Operation
How Are Data Models
Used for Database
Development?
Components of the Entity-Relationship Data Model

• Something users want to track


Entities • Order, customer, salesperson, item,
volunteer, donation
• Describe characteristics of an entity
Attributes • OrderNumber, CustomerNumber,
VolunteerName, PhoneNumber

• Uniquely identifies one entity instance


Identifier from other instances
• Student_ID_Number
Student Data Model Entities

Source: textbook [1], pg 148


Example of Department, Adviser, and
Student Entities and Relationships

Source: textbook [1], pg 148


Sample of Relationships―Version 1

Crow’s
Feet

1:N N:M
Source:
One department can An Adviser ma
textbook [1],
pg 149 have many advisers, have many
but an adviser may be students, and one
in only one student may many
department advisers
Sample of Relationships─Version 2

“Crow’s
Foot”

Source:
N:M 1:N
textbook [1], A department has A student has
pg 149 many advisors, and only one advisor,
an advisor may but an adviser
advise for more than may advise many
one department students
Crow’s-Foot Diagram Version

Maximum cardinality─maximum number of


entities involved in a relationship. Vertical bar on
a line means that at least one entity is required.

Source:
textbook [1],
pg 150 Minimum cardinality—minimum number of
entities in a relationship. Small oval means entity
is optional; relationship need not have an entity
of that type.
How Is a Data Model Transformed into a Database
Design?

• Normalization
 Converting poorly structured tables into two or more well-structured tables.
• Goal
 Construct tables with data about a single theme or entity.
• Purpose
 To minimize data integrity problems.
Data Integrity Problems
• Data integrity problems produce incorrect and inconsistent
information, users lose confidence in information, and the system
gets a poor reputation.
• Can only occur if data are duplicated.
Poorly Designed Employee Table Causes Data Integrity
Problem

Source: textbook [1], pg 151


Two Normalized Tables

Single
Themes

Source: textbook [1], pg 151


Summary of Normalization

Source: textbook [1], pg 152


Representing 1:N Relationships

Source: textbook [1], pg 153


Representing an N:M Relationship: Strategy for Foreign
Keys

Source: textbook [1], pg 154


the Users’ Role in the Development of Databases?

 Users are the final judges of:


 What data database should contain.
 How tables should be related.
 Users review data model to ensure it accurately reflects users’
view of the business.
 Mistakes will come back to haunt them.
Q2 What is Data
Warehouse?
Business Intelligence and Data Warehousing

 BI used to be
everything related to Business Analytics
use of data for
managerial decision
support Descriptive Predictive Prescriptive
 Now, it is a part of

Questions
Business Analytics What happened?
What is happening?
What will happen?
Why will it happen?
What should I do?
Why should I do it?

 BI = Descriptive Enablers
ü Business reporting ü Data mining ü Optimization
Analytics ü
ü
Dashboards
Scorecards
ü
ü
Text mining
Web/media mining
ü
ü
Simulation
Decision modeling
ü Data warehousing ü Forecasting ü Expert systems
Outcomes

Well defined Accurate projections Best possible


business problems of future events and business decisions
and opportunities outcomes and actions

Business Intelligence Advanced Analytics


What is a Data
Warehouse?

 A physical repository where relational data are specially


organized to provide enterprise-wide, cleansed data in a
standardized format
 A relational database? (so what is the difference?)
 “The data warehouse is a collection of integrated, subject-
oriented databases designed to support DSS functions, where
each unit of data is non-volatile and relevant to some
moment in time”
A Historical Perspective to Data Warehousing
Characteristics of DWs

 Subject oriented
 Integrated
 Time-variant (time series)
 Nonvolatile
 Summarized
 Not normalized
 Metadata
 Web based, relational/multi-dimensional
 Client/server, real-time/right-time/active...
Data Mart

A departmental small-scale “DW” that stores only


limited/relevant data
Dependent data mart
A subset that is created directly from a data warehouse
Independent data mart
A small data warehouse designed for a strategic business unit or
a department
Other DW Components

 Operational data stores (ODS)


 A type of database often used as an interim area for a data
warehouse
 Oper marts
 An operational data mart
 Enterprise data warehouse (EDW)
 A data warehouse for the enterprise
 Metadata – “data about data”
 In DW metadata describe the contents of a data warehouse
and its acquisition and use
Application Case 3.1

A Better Data Plan: Well-Established TELCOs Leverage Data


Warehousing and Analytics to Stay on Top in a Competitive
Industry
Questions for Discussion
1. What are the main challenges for TELCOs?
2. How can data warehousing and data analytics help TELCOs in
overcoming their challenges?
3. Why do you think TELCOs are well suited to take full
advantage of data analytics?
DW for Data-Driven Decision Making

• An example of a DW supporting data-driven decision making in


automotive industry

Data Warehouse
One management and analytics platform
for product configuration, warranty, and
diagnostic readout data

Reduced Produced Warranty Improved Cost of IT Architecture


Accurate
Infrastructure Expenses Quality Standardization
Environmental
Expenses Improved reimbursement Faster identification, One strategic platform for
2/3 cost reduction through accuracy through improved prioritization, and resolution Performance business intelligence and
data mart consolidation claim data quality of quality issues Reporting compliance reporting
A Generic DW Framework

Data Applications
Sources No data marts option (Visualization)
Data
Marts Routine
ERP Business
ETL
Reporting
Process
Data mart
Select (Marketing)
Legacy Metadata Data/text

/ Middleware
Extract mining
Data mart
Transform Enterprise (Operations)
POS Data warehouse
OLAP,
Integrate
Dashboard,

API
Data mart
(Finance) Web
Other Load
OLTP/Web
Replication Data mart
(...) Custom built
External
applications
Data
DW Architecture

• Three-tier architecture
1. Data acquisition software (back-end)
2. The data warehouse that contains the data & software
3. Client (front-end) software that allows users to access and
analyze data from the warehouse

• Two-tier architecture
– First two tiers in three-tier architecture are combined into one
… sometimes there is only one tier?
DW Architectures

3-tier
architecture
Tier 1: Tier 2: Tier 3:
Client workstation Application server Database server

2-tier 1-tier
architecture Architecture?
Tier 1: Tier 2:
Client workstation Application & database server
Data Warehousing Architectures

• Issues to consider when deciding which architecture to use:


– Which database management system (DBMS) should be
used?
– Will parallel processing and/or partitioning be used?
– Will data migration tools be used to load the data
warehouse?
– What tools will be used to support data retrieval and
analysis?
A Web-based DW Architecture

Web pages
Application
Server

Client Web
(Web browser) Internet/ Server
Intranet/
Extranet
Data
warehouse
Alternative DW Architectures (1 of 2)

(a) Independent Data Marts Architecture

ETL
End user
Source Staging Independent data marts
access and
Systems Area (atomic/summarized data)
applications

(b) Data Mart Bus Architecture with Linked Dimensional Datamarts

ETL
Dimensionalized data marts End user
Source Staging
linked by conformed dimentions access and
Systems Area
(atomic/summarized data) applications

(c) Hub and Spoke Architecture (Corporate Information Factory)

ETL
End user
Source Staging Normalized relational
access and
Systems Area warehouse (atomic data)
applications

Dependent data marts


(summarized/some atomic data)
Alternative DW Architectures (2 of 2)

(d) Centralized Data Warehouse Architecture

ETL
Normalized relational End user
Source Staging
warehouse (atomic/some access and
Systems Area
summarized data) applications

(e) Federated Architecture

Data mapping / metadata


End user
Logical/physical integration of access and
Existing data warehouses
common data elements applications
Data marts and legacy systmes

• Each architecture has advantages and disadvantages!


• Which architecture is the best?
Ten Factors that Potentially Affect the Architecture
Selection Decision

1. Information interdependence between organizational units


2. Upper management’s information needs
3. Urgency of need for a data warehouse
4. Nature of end-user tasks
5. Constraints on resources
6. Strategic view of the data warehouse prior to implementation
7. Compatibility with existing systems
8. Perceived ability of the in-house I T staff
9. Technical issues
10. Social/political factors
Data Integration and the Extraction,
Transformation, and Load Process (1 of
2)

 ETL = Extract Transform Load


 Data integration
 Integration that comprises three major processes: data access,
data federation, and change capture.
 Enterprise application integration (EAI)
 A technology that provides a vehicle for pushing data from source
systems into a data warehouse
 Enterprise information integration (EII)
 An evolving tool space that promises real-time data integration
from a variety of sources, such as relational or multidimensional
databases, Web services, etc.
Data Integration and the Extraction, Transformation, and
Load Process (2 of 2)

Packaged Transient
application data source

Data
warehouse

Legacy Extract Extract Extract Extract


system

Data
marts
Other internal
applications
ETL (Extract, Transform,
Load)

 Issues affecting the purchase of an ETL tool


 Data transformation tools are expensive
 Data transformation tools may have a long learning curve
 Important criteria in selecting an ETL tool
 Ability to read from and write to an unlimited number of
data sources/architectures
 Automatic capturing and delivery of metadata
 A history of conforming to open standards
 An easy-to-use interface for the developer and the
functional user
Application Case 3.2

BP Lubricants Achieves BIGS Success


Questions for Discussion
1. What is BIGS?
2. What were the challenges, the proposed solution, and the
obtained results with BIGS?
Data Warehouse
Development
 Data warehouse development approaches
 Inmon Model: EDW approach (top-down)
 Kimball Model: Data mart approach (bottom-up)
 Which model is best?
 Table 3.3 provides a comparative analysis between EDW and
Data Mart approach
 Another alternative is the hosted data warehouses
Comparing EDW and Data
Mart (1 of 2)

Table 3.3 Contrasts between the DM and EDW Development Approaches

Effort DM Approach EDW Approach


Scope One subject area Several subject areas
Development time Months Years
Development cost $10,000 to $100,000+ $1,000,000+
Development difficulty Low to medium High
Data prerequisite for sharing Common (within business area) Common (across enterprise)
Sources Only some operational and external Many operational and external
systems systems
Size Megabytes to several gigabytes Gigabytes to petabytes
Time horizon Near-current and historical data Historical data
Data transformations Low to medium High
Comparing EDW and Data
Mart (2 of 2)

Table 3.3 [continued]


Effort DM Approach EDW Approach
Update frequency Hourly, daily, weekly Weekly, monthly
Technology Blank Blank
Hardware Workstations and departmental Enterprise servers and mainframe
Servers computers
Operating system Windows and Linux Unix, Z/OS, OS/390
Databases Workgroup or standard Enterprise database servers
database servers
Usage Blank Blank
Number of simultaneous Users 10s 100s to 1,000s
User types Business area analysts and Enterprise analysts and senior
Managers executives
Business spotlight Optimizing activities within the Cross-functional optimization and
business area decision making
Application Case
3.3
Use of Teradata Analytics for SAP Solutions Accelerates Big Data
Delivery
Questions for Discussion
1. What were the challenges faced by the large Dutch retailer?
2. What was the proposed multivendor solution? What were the
implementation challenges?
3. What were the lessons learned?
Additional DW Considerations Hosted Data
Warehouses

 Benefits:
 Requires minimal investment in infrastructure
 Frees up capacity on in-house systems
 Frees up cash flow
 Makes powerful solutions affordable
 Enables solutions that provide for growth
 Offers better quality equipment and software
 Provides faster connections
 … more in the book
Representation of Data in
DW
 Dimensional Modeling
 A retrieval-based system that supports high-volume query access
 Star schema
 The most commonly used and the simplest style of dimensional
modeling
 Contain a fact table surrounded by and connected to several
dimension tables
 Snowflakes schema
 An extension of star schema where the diagram resembles a
snowflake in shape
Multidimensionality

The ability to organize, present, and analyze data by several


dimensions, such as sales by region, by product, by salesperson,
and by time (four dimensions)
 Multidimensional presentation
 Dimensions: products, salespeople, market segments,
business units, geographical locations, distribution channels,
country, or industry
 Measures: money, sales volume, head count, inventory
profit, actual versus forecast
 Time: daily, weekly, monthly, quarterly, or yearly
Star Schema versus Snowflake Schema
Analysis of Data in DW
 OLTP vs. OLAP…
 OLTP (Online Transaction Processing)
 Capturing and storing data from ERP, CRM, POS, …
 The main focus is on efficiency of routine tasks
 OLAP (Online Analytical Processing)
 Converting data into information for decision support
 Data cubes, drill-down / rollup, slice & dice, …
 Requesting ad hoc reports
 Conducting statistical and other analyses
 Developing multimedia-based applications
 …more in the book
OLAP vs. OLTP

Table 3.5 A Comparison between OLTP and OLAP

Criteria OLTP OLA P


Purpose To carry out day-to-day business To support decision making and
functions provide answers to business and
management queries
Data source Transaction database (a normalized Data warehouse or DM (a
data repository primarily focused on nonnormalized data repository
efficiency and consistency) primarily focused on accuracy and
completeness)
Reporting Routine, periodic, narrowly focused Ad hoc, multidimensional, broadly
Reports focused reports and queries
Resource requirements Ordinary relational databases Multiprocessor, large-capacity,
specialized databases
Execution speed Fast (recording of business Slow (resource intensive, complex,
transactions and routine reports) large-scale queries)
OLAP Operations
 Slice - a subset of a multidimensional array
 Dice - a slice on more than two dimensions
 Drill Down/Up - navigating among levels of data ranging from
the most summarized (up) to the most detailed (down)
 Roll Up - computing all of the data relationships for one or
more dimensions
 Pivot - used to change the dimensional orientation of a report
or an ad hoc query-page display
OL AP

• Slicing Operations on a Simple Tree-Dimensional Data Cube

A 3-dimensional
OLAP cube with Sales volumes of
slicing a specific Product
operations on variable Time
and Region

e
m
Ti

Product
Geography

Cells are filled


Sales volumes of
with numbers
representing a specific Region
sales volumes on variable Time
and Products

Sales volumes of
a specific Time on
variable Region
and Products
Successful DW Implementation Things to
Avoid

 Starting with the wrong sponsorship chain


 Setting expectations that you cannot meet
 Engaging in politically naive behavior
 Loading the data warehouse with information just because it is
available
 Believing that data warehousing database design is the same as
transactional database design
 … more in the book
Massive DW and
Scalability
 Scalability
 The main issues pertaining to scalability:
o The amount of data in the warehouse
o How quickly the warehouse is expected to grow
o The number of concurrent users
o The complexity of user queries
 Good scalability means that queries and other data-access
functions will grow linearly with the size of the warehouse
Application Case 3.4

EDW Helps Connect State Agencies in Michigan


Questions for Discussion
1. Why would a state invest in a large and expensive IT
infrastructure (such as an EDW)?
2. What is the size and complexity of the EDW used by state
agencies in Michigan?
3. What were the challenges, the proposed solution, and the
obtained results of the EDW?
DW Administration and
Security

 Data warehouse administrator (DWA)


 DWA should…
o have the knowledge of high-performance software, hardware, and
networking technologies
o possess solid business knowledge and insight
o be familiar with the decision-making processes so as to suitably
design/maintain the data warehouse structure
o possess excellent communications skills
 Security and privacy is a pressing issue in D W
 Safeguarding the most valuable assets
 Government regulations (HIPAA, etc.)
 Must be explicitly planned and executed
The Future of DW

• Sourcing…
– Web, social media, and Big Data
– Open source software
– SaaS (software as a service)
– Cloud computing
– Data lakes

• Infrastructure…
– Columnar
– Real-time DW
– Data warehouse appliances
– Data management practices/technologies
– In-database & In-memory processing New D BMS
– New DBMS, Advanced analytics, …
Data Lakes

• Unstructured data storage technology for Big Data


• Data Lake versus Data Warehouse

Table 3.6 A Simple Comparison between a Data Warehouse and a Data Lake

Dimension Data Warehouse Data Lake


The nature of data Structured, processed Any data in raw/native format
Processing Schema-on-write (SQL) Schema-on-read (NoSQL)
Retrieval speed Very fast Slow
Cost Expensive for large data volumes Designed for low-cost storage
Agility Less agile, fixed configuration Highly agile, flexible configuration
Novelty/newness Not new/matured Very new/maturing
Security Security Not yet well-secured
Users Business professionals Data scientists
Business Performance Management (1 of 2)

 Business Performance Management (BPM) is…


A real-time system that alerts managers to potential
opportunities, impending problems, and threats, and then
empowers them to react through models and collaboration
 Also called corporate performance management (CPM by
Gartner Group), enterprise performance management (EPM by
Oracle), strategic enterprise management (SEM by SAP)
Business Performance Management (2 of 2)

 BPM refers to the business processes, methodologies, metrics,


and technologies used by enterprises to measure, monitor, and
manage business performance.
 BPM encompasses three key components
 A set of integrated, closed-loop management and analytic
processes, supported by technology …
 Tools for businesses to define strategic goals and then
measure/manage performance against them
 Methods and tools for monitoring key performance
indicators (KPIs), linked to organizational strategy
A Closed-Loop Process to Optimize Business
Performance

• Process Steps

1. Strategize
2. Plan
3. Monitor/analyze
4. Act/adjust
Each with its own sub-
process steps
1 - Strategize: Where Do We Want to Go?

• Strategic planning
– Common tasks for the strategic planning process:
1. Conduct a current situation analysis
2. Determine the planning horizon
3. Conduct an environment scan
4. Identify critical success factors
5. Complete a gap analysis
6. Create a strategic vision
7. Develop a business strategy
8. Identify strategic objectives and goals
2 - Plan: How Do We Get
There?
 Operational planning
 Operational plan: plan that translates an organization’s
strategic objectives and goals into a set of well-defined
tactics and initiatives, resource requirements, and expected
results for some future time period (usually a year).
 Operational planning can be
 Tactic-centric (operationally focused)
 Budget-centric plan (financially focused)
3 - Monitor/Analyze: How Are We Doing?

 A comprehensive framework for monitoring performance


should address two key issues:
 What to monitor?
o Critical success factors
o Strategic goals and targets
 How to monitor?
4 - Act and Adjust: What Do We Need to Do
Differently?

 Success (or mere survival) depends on new projects: creating


new products, entering new markets, acquiring new customers
(or businesses), or streamlining some process.
 Many new projects and ventures fail!
 What is the chance of failure?
 60% of Hollywood movies fail
 70% of large IT projects fail, …
Application Case 3.5

AARP Transforms Its BI Infrastructure and Achieves a 347% ROI


in Three Years
Questions for Discussion
1. What were the challenges AARP was facing?
2. What was the approach for a potential solution?
3. What were the results obtained in the short term, and what
were the future plans?
Performance
Measurement
 Performance measurement system
A system that assists managers in tracking the implementations
of business strategy by comparing actual results against
strategic goals and objectives
 Comprises systematic comparative methods that indicate
progress (or lack thereof) against goals
KPIs and Operational Metrics

• Key performance indicator (KPI)


A KPI represents a strategic objective and metrics that
measure performance against a goal
• Distinguishing features of KPIs
– Strategy
– Targets
– Ranges
– Encodings
– Time frames
– Benchmarks
Performance Measurement 2

• Key performance indicator (KPI)

Outcome KPIs Driver KPIs

(lagging indicators e.g., revenues) (leading indicators e.g., sales leads)

• Operational areas covered by driver KPIs


– Customer performance
– Service performance
– Sales operations
– Sales plan/forecast
Performance Measurement System

• Balanced Scorecard (BSC)


A performance measurement and management methodology
that helps translate an organization’s financial, customer,
internal process, and learning and growth objectives and
targets into a set of actionable initiatives
“The Balanced Scorecard: Measures That Drive Performance”

(HBR, 1992)
Balanced Scorecard

The meaning of “balance”?

Financial
Perspective

Internal
Customer VISION & Business
Perspective STRATEGY Process
Perspective

Learning and
Growth
Perspective
Six Sigma as a Performance Measurement System
(1 of 2)

 Six Sigma
A performance management methodology aimed at reducing
the number of defects in a business process to as close to zero
defects per million opportunities (DPMO) as possible
Six Sigma as a Performance Measurement System
(2 of 2)

 The DMAIC performance model


A closed-loop business improvement model that encompasses
the steps of defining, measuring, analyzing, improving, and
controlling a process
 Lean Six Sigma
 Lean manufacturing / lean production
 Lean production versus six sigma?
Comparison of BSC and
Six Sigma (1 of 2)

Table 3.7 Comparison of the Balanced Scorecard and Six Sigma

Balanced Scorecard Six Sigma


Strategic management system Performance measurement system
Relates to the longer-term view of the business Provides snapshot of business’s performance and
identifies measures that drive performance toward
Profitability
Designed to develop a balanced set of measures Designed to identify a set of measurements that
impact profitability
Identifies measurements around vision and values Establishes accountability for leadership for wellness
and profitability
Critical management processes are to clarify Includes all business processes—management and
vision/strategy, communicate, plan, set targets, operational
align strategic initiatives, and enhance feedback
Comparison of BSC and
Six Sigma (2 of 2)

Table 3.7 [continued]

Balanced Scorecard Six Sigma


Balances customer and internal operations without a Balances management and employees’ roles;
clearly defined leadership role balances costs and revenue of heavy processes
Emphasizes targets for each measurement Emphasizes aggressive rate of improvement for each
measurement, irrespective of target
Emphasizes learning of executives based on feedback Emphasizes learning and innovation at all levels
based on process feedback; enlists all employees’
Participation
Focuses on growth Focuses on maximizing profitability
Heavy on strategic content Heavy on execution for profitability
Management system consisting of measures Measurement system based on process management
Effective Performance Measurement Should

 Measures should focus on key factors.


 Measures should be a mix of past, present, and future.
 Measures should balance the needs of shareholders,
employees, partners, suppliers, and other stakeholders.
 Measures should start at the top and flow down to the bottom.
 Measures need to have targets that are based on research and
reality rather than arbitrary.
Application Case 3.6

Expedia.com’s Customer Satisfaction Scorecard


Questions for Discussion
1. Who are the customers for Expedia.com? Why is customer
satisfaction a very important part of their business?
2. How did Expedia.com improve customer satisfaction with
scorecards?
3. What were the challenges, the proposed solution, and the
obtained results?
Q3 What is Data
Mining?
Opening Vignette (1 of 3)

Miami-Dade Police Department Is Using Predictive


Analytics to Foresee and Fight Crime
• Predictive analytics in
law enforcement
– Policing with less
– New thinking on
cold cases
– The big picture
starts small
– Success brings
credibility
– Just for the facts
Opening Vignette (2 of 3)

Discussion Questions
1. Why do law enforcement agencies and
departments like Miami-Dade Police Department
embrace advanced analytics and data mining?
2. What are the top challenges for law
enforcement agencies and departments like
Miami-Dade Police Department? Can you
think of other challenges (not mentioned in
this case) that can benefit from data mining?
Opening Vignette (3 of 3)

3. What are the sources of data that law


enforcement agencies and departments like
Miami-Dade Police Department use for their
predictive modeling and data mining projects?
4. What type of analytics do law enforcement
agencies and departments like Miami-Dade
Police Department use to fight crime?
5. What does “the big picture starts small”
mean in this case? Explain.
Data Mining Concepts and Definitions
Why Data Mining?

• More intense competition at the global scale.


• Recognition of the value in data sources.
• Availability of quality data on customers, vendors,
transactions, Web, etc.
• Consolidation and integration of data repositories
into data warehouses.
• The exponential increase in data processing and
storage capabilities; and decrease in cost.
• Movement toward conversion of information
resources into nonphysical form.
Definition of Data Mining

• The nontrivial process of identifying valid, novel,


potentially useful, and ultimately understandable
patterns in data stored in structured databases.
– Fayyad et al., (1996)
• Keywords in this definition: Process,
nontrivial, valid, novel, potentially useful,
understandable.
• Data mining: a misnomer?
• Other names: knowledge extraction, pattern
analysis, knowledge discovery, information
harvesting, pattern searching, data
dredging,…
Figure 4.1 Data Mining is a Blend of Multiple
Disciplines
Application Case 4.1

Visa Is Enhancing the Customer Experience While


Reducing Fraud with Predictive Analytics and Data
Mining
Questions for Discussion
1. What challenges were Visa and the rest of the
credit card industry facing?
2. How did Visa improve customer service
while also improving retention of fraud?
3. What is in-memory analytics, and why was
it necessary?
Data Mining Characteristics & Objectives

• Source of data for DM is often a


consolidated data warehouse (not always!).
• DM environment is usually a client-server or
a Web- based information systems
architecture.
• Data is the most critical ingredient for DM
which may include soft/unstructured data.
• The miner is often an end user.
• Striking it rich requires creative thinking.
• Data mining tools’ capabilities and ease of
use are essential (Web, Parallel processing,
How Data Mining Works

• DM extract patterns from data


– Pattern? A mathematical (numeric and/or
symbolic) relationship among data items
• Types of patterns
– Association
– Prediction
– Cluster (segmentation)
– Sequential (or time series) relationships
Application Case 4.2

Dell Is Staying Agile and Effective with Analytics in the 21st


Century
Questions for Discussion
1. What was the challenge Dell was facing that led
to their analytics journey?
2. What solution did Dell develop and implement?
What were the results?
3. As an analytics company itself, Dell has used its
service offerings for its own business. Do you
think it is easier or harder for a company to taste
its own medicine? Explain.
A Taxonomy for Data
Mining

• Figure 4.2 A Simple Taxonomy for Data Mining


Tasks, Methods, and Algorithms
Data Mining Tasks & Methods Data Mining Algorithms Learning Type

Prediction

Decision Trees, Neural Networks,


Supervised
Classificatio Support Vector Machines, kNN,
Naïve Bayes, GA

Linear/Nonlinear Regression,
Supervised
n ANN, Regression Trees, SVM,
kNN, GA

Autoregressive Methods, Averaging


Regression Supervised
Methods, Exponential Smoothing,
ARIMA

Time Series

Association
Apriory, OneR, ZeroR, Eclat, GA Unsupervised

Expectation Maximization,
Market- Unsupervised
Apriory Algorithm, Graph-
based Matching

basket Link

analysis

Apriory Algorithm, FP-Growth,


Sequence analysis Unsupervised
Graph- based Matching

Segmentation

Clustering K-means, Expectation Maximization Unsupervised


(EM)

Outlier K-means, Expectation Maximization Unsupervised


(EM)
Other Data Mining
Patterns/Tasks

• Time-series forecasting
– Part of the sequence or link
analysis?
• Visualization
– Another data mining task?
– Covered in Chapter 3
• Data Mining versus Statistics
– Are they the same?
– What is the relationship between the
two?
Data Mining Applications (1
of 4)

• Customer Relationship Management


– Maximize return on marketing
campaigns
– Improve customer retention (churn
analysis)
– Maximize customer value (cross-, up-
selling)
– Identify and treat most valued
customers
• Banking & Other Financial
– Automate the loan application process
– Detecting fraudulent transactions
Data Mining Applications (2
of 4)

Retailing and Logistics


 Optimize inventory levels at different locations
 Improve the store layout and sales promotions
 Optimize logistics by predicting seasonal effects
 Minimize losses due to limited shelf life
Manufacturing and Maintenance
 Predict/prevent machinery failures
 Identify anomalies in production systems to
optimize the use manufacturing capacity
 Discover novel patterns to improve product
quality
Data Mining Applications (3
of 4)

• Brokerage and Securities Trading


– Predict changes on certain bond prices
– Forecast the direction of stock fluctuations
– Assess the effect of events on market
movements
– Identify and prevent fraudulent activities in
trading
• Insurance
– Forecast claim costs for better business
planning
– Determine optimal rate plans
– Optimize marketing to specific customers
Data Mining Applications (4
of 4)

• Computer hardware and software


• Science and engineering
• Government and defense
• Homeland security and law
enforcement
• Travel, entertainment, sports
• Healthcare and medicine
• Sports,… virtually everywhere…
Application Case 4.3

Predictive Analytic and Data Mining Help Stop Terrorist


Funding
Questions for Discussion
1. How can data mining be used to fight
terrorism? Comment on what else can be
done beyond what is covered in this short
application case.
2. Do you think data mining, although essential for
fighting terrorist cells, also jeopardizes
individuals’ rights of privacy?
Data Mining Process

• A manifestation of the best practices


• A systematic way to conduct DM projects
• Moving from Art to Science for DM project
• Everybody has a different version
• Most common standard processes:
– CRISP-DM (Cross-Industry Standard Process for
Data Mining)
– SEMMA (Sample, Explore, Modify, Model, and
Assess)
– KDD (Knowledge Discovery in Databases)
Data Mining Process:
CRISP-DM (1 of 2)

• Cross Industry Standard Process for Data


Mining
• Proposed in 1990s by a European
consortium
– Step 1: Business
• Composed of six consecutive phasesAccounts for
Understanding ~85% of total
– Step 2: Data Understanding project time
– Step 3: Data Preparation
– Step 4: Model Building
– Step 5: Testing and
Evaluation
– Step 6: Deployment
Data Mining Process:
CRISP-DM (2 of 2)

• Figure 4.3 The Six-Step CRISP-DM Data Mining Process →

• The process is highly repetitive and experimental (DM: art versus


science?)

1 2
Business Data
Understandin Understandin
g g

3
Data
Preparatio
n
6
4
Deploymen
t Model
Dat
Buildin
a
g

5
Testing
and
Evaluation
Data Mining Process:
SEMMA

• Figure 4.5 SEMMA Data Mining


Process
• Developed by SAS Institute
Sample
(Generate a
representative sample
of the data)

Assess Explore
(Evaluate the accuracy (Visualization and
and usefulness of the basic description of
models) the data)
Feedbac
k

Model Modify
(Use variety of statistical (Select variables,
and machine learning transform variable
models ) representations)
Data Mining Process: KDD

• Figure 4.6 KDD (Knowledge Discovery in Databases)


Process
Internalizatio
n

Data Mining
DE P LOYM ENT CHART
Knowledge
DE P T
1
DE P T
2
PHASE 1
PHASE 4
PHASE 2
PHASE 5
PHASE 3

“Actionabl
5 e
DE P T
DE P T 4

4
3

Data 1 2 3

Transformatio Insight”
n
Extracte
d
Patterns
Data
Cleanin Transformed
g Data

Data
Selectio Preprocessed
n Data

Target
Data

Feedback

Sources
for Raw
Data
Which Data Mining Process is the Best?

• Figure 4.7 Ranking of Data Mining


Methodologies/Processes.
CRISP-DM

My own

SEMMA

KDD Process

My organization's

Domain-specific
methodology

None

Other methodology (not domain specific)


0 10 20 30 40 50 60 70

Source: Used with permission from


KDnuggets.com.
Application Case 4.4

Data Mining Helps in Cancer Cancer DB 1 Cancer DB 2 Cancer DB

Research
n

Combined
Cancer DB

Questions for Discussion


Data Preprocessing
 Cleaning

1. How can data mining be


 Selecting
 Transforming

used for ultimately curing


Partitioned Partitioned
data (training data (training
& testing) Partitioned & testing)
data (training
& testing)

illnesses like cancer? Artificial


Neural
Networks
Logistic
Regression
(LR)
Random
Forest (RF)

Training and Training and Training and


(ANN)
calibrating calibrating calibrating

2. What do you think are the


the model the model the model

Assess
Testing Testing Testing the variable
model

promises and major


the the
importan
model model
ce

challenges for data miners Tabulated Tabulated

in contributing to medical
Model Testing Relative
Results Variable
(Accuracy, Sensitivity Importance
and Specificity) Results

and biological research


endeavors?
Data Mining Methods:
Classification

• Most frequently used DM method


• Part of the machine-learning family
• Employ supervised learning
• Learn from past data, classify new data
• The output variable is categorical (nominal or
ordinal) in nature
• Classification versus regression?
• Classification versus clustering?
Assessment Methods for
Classification

• Predictive
accuracy – Hit
rate
• Speed
– Model
building
versus
predicting/usa
ge speed
• Robustness
• Scalability
Accuracy of Classification
Models

• In classification problems, the primary source for


accuracy estimation is the confusion matrix

TP +
Accuracy True/Observed Class
 TP + TN
TN + FP +
Positive
FN
TP Negative
True PositiveRate = True False
TP +

Positiv
Positive Positive
FN Count Count
TN

e
True NegativeRate = (TP) (FP)

Predicted
TN +

Class
FP

Negativ
False True
TP Negative Negative
TP
Precision = TP +
Recall = Count Count

e
TP + FN (FN) (TN)
FP
Estimation Methodologies for
Classification: Single/Simple Split

• Simple split (or holdout or test sample


estimation)
– Split the data into 2 mutually exclusive sets:
training (~70%) and testing (30%)
Model
Training Data Developme
2/ nt
3
Trained Predictio
Preprocessed Classifie n
Data r Accurac
Model TP
y FP
1/
3 Assessme
Testing Data nt FN TN
(scoring)

– For Neural Networks, the data is split into three


sub- sets (training [~60%], validation [~20%],
testing [~20%])
Estimation Methodologies for
Classification: k- Fold Cross
Validation (rotation estimation)

• Data is split into k mutual subsets and k


number training/testing experiments are
conducted
• Figure 4.10 A Graphical Depiction of k-Fold
Cross- Validation
Additional Estimation Methodologies for
Classification

• Leave-one-out
– Similar to k-fold where k = number of
samples
• Bootstrapping
– Random sampling with replacement
• Jackknifing
– Similar to leave-one-out
• Area Under the ROC Curve (AUC)
– ROC: receiver operating characteristics
(a term borrowed from radar image
processing)
Area Under the ROC Curve
(AUC) (1 of 2)

• Works with binary


classification
• Figure 4.11 A Sample ROC
Curve
Area Under the ROC Curve
(AUC) (2 of 2)

• Produces values
1

from 0 to 1.0 0.9

• Random chance is
0.8

A
0.7

0.5 and perfect 0.6

classification is 1.0
0.5

Area Under the


0.4
ROC Curve

• Produces a good
(AUC) A = 0.84
0.3

0.2

assessment for 0.1

skewed class 0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

distributions too! False Alarms (1 - Specificity)


Classification Techniques

• Decision tree analysis


• Statistical analysis
• Neural networks
• Support vector
machines
• Case-based reasoning
• Bayesian classifiers
• Genetic algorithms
• Rough sets
Decision Trees (1 of 2)

• Employs a divide-and-conquer method


• Recursively divides a training set until each division
consists of examples from one class:

1. Create a root node and assign all


A general of the training data to it.
algorithm
(steps) for 2. Select the best splitting attribute.
building a
3. Add a branch to the root node for each
decision tree
value of the split. Split the data into
mutually exclusive subsets along the
lines of the specific split.
4. Repeat steps 2 and 3 for each and
every leaf node until the stopping
criteria is reached.
Decision Trees (2 of 2)

• DT algorithms mainly differ on


1. Splitting criteria
▪ Which variable, what value,
etc.
2. Stopping criteria
▪ When to stop building the
tree
3. Pruning (generalization
method)
▪ Pre-pruning versus post-
pruning
• Most popular DT algorithms
Ensemble Models for Predictive Analytics

• Produces more robust and reliable prediction models

• Figure 4.12 Graphical Illustration of a Heterogeneous


Ensemble
Application Case 4.5

Influence Health Uses Advanced Predictive Analytics


to Focus on the Factors That Really Influence People’s
Healthcare Decisions
Questions for Discussion
1. What did Influence Health do?
2. What were the challenges, the proposed
solutions, and the obtained results?
3. How can data mining help companies in the
healthcare industry (in ways other than the
ones mentioned in this case)?
Cluster Analysis for Data
Mining (1 of 4)

• Used for automatic identification of natural


groupings of things
• Part of the machine-learning family
• Employ unsupervised learning
• Learns the clusters of things from past data, then
assigns new instances
• There is not an output/target variable
• In marketing, it is also known as segmentation
Cluster Analysis for Data
Mining (2 of 4)
Clustering results may be used to
 Identify natural groupings of customers
 Identify rules for assigning new cases to classes
for targeting/diagnostic purposes
 Provide characterization, definition, labeling
of populations
 Decrease the size and complexity of problems
for other data mining methods
 Identify outliers in a specific domain (e.g., rare-
event detection)
Cluster Analysis for Data
Mining (3 of 4)

• Analysis methods
– Statistical methods (including both
hierarchical and nonhierarchical), such as k-
means, k-modes, and so on.
– Neural networks (adaptive resonance theory
[ART], self-organizing map [SOM])
– Fuzzy logic (e.g., fuzzy c-means algorithm)
– Genetic algorithms
• How many clusters?
Cluster Analysis for Data
Mining (4 of 4)

• k-Means Clustering Algorithm


– k : pre-determined number of clusters
– Algorithm (Step 0: determine value of k)
Step 1: Randomly generate k random points as
initial cluster centers.
Step 2: Assign each point to the nearest cluster
center.
Step 3: Re-compute the new cluster centers.
Repetition step: Repeat steps 3 and 4 until
some convergence criterion is met (usually
that the assignment of points to clusters
becomes stable).
Cluster Analysis for Data Mining - k-
Means Clustering Algorithm

• Figure 4.13 A Graphical Illustration of the Steps in the k-


Means Algorithm

Step 1 Step 2 Step 3


Association Rule Mining (1 of
6)

• A very popular DM method in business


• Finds interesting relationships (affinities) between
variables (items or events)
• Part of machine learning family
• Employs unsupervised learning
• There is no output variable
• Also known as market basket analysis
• Often used as an example to describe DM to ordinary
people, such as the famous “relationship between
diapers and beers!”
Association Rule Mining (2 of
6)

• Input: the simple point-of-sale transaction data


• Output: Most frequent affinities among items

• Example: according to the transaction data…


“Customer who bought a lap-top computer and
a virus
protection software, also bought extended
service plan 70
percent of the time.”
• How do you use such a pattern/knowledge?
– Put the items next to each other
– Promote the items as a package
– Place items far apart from each other!
Association Rule Mining (3 of
6)

• A representative application of association rule


mining includes
– In business: cross-marketing, cross-selling, store
design, catalog design, e-commerce site
design, optimization of online advertising,
product pricing, and sales/promotion
configuration
– In medicine: relationships between symptoms
and illnesses; diagnosis and patient
characteristics and treatments (to be used in
medical DSS); and genes and their functions
(to be used in genomics projects)
– …
Association Rule Mining (4 of
6)

• Are all association rules interesting and


useful?
A Generic Rule: X  Y [S%, C%]

X, Y: products and/or services


X: Left-hand-side (LHS)
Y: Right-hand-side (RHS)
S: Support: how often X and Y go together
C: Confidence: how often Y go together with
the X

Example: {Laptop Computer, Antivirus Software}



Association Rule Mining (5 of
6)

• Several algorithms are developed for


discovering (identifying) association rules
– Apriori
– Eclat
– FP-Growth
– + Derivatives and hybrids of the three
• The algorithms help identify the frequent itemsets,
which are then converted to association rules
Association Rule Mining (6 of
6)

• Apriori Algorithm
– Finds subsets that are common to at least a
minimum number of the itemsets
– Uses a bottom-up approach
▪ frequent subsets are extended one item at
a time (the size of frequent subsets
increases from one- item subsets to
two-item subsets, then three-item
subsets, and so on), and
▪ groups of candidates at each level are
tested against the data for minimum
support
(see the figure)  --
Association Rule Mining Apriori Algorithm

• Figure 4.13 A Graphical Illustration of the Steps in the k-Means


Algorithm

Raw Transaction One-item Two-item Three-item


Data Itemsets Itemsets Itemsets
Transactio SKUs Items Items Items
Support Support Support
n No (Item No) et et et
(SKUs) (SKUs) (SKUs)
1001234 1, 2, 3, 4
1 3 1, 2 3 1, 2, 4 3
1001235 2, 3, 4
2 6 1, 3 2 2, 3, 4 3
1001236 2, 3
3 4 1, 4 3
1001237 1, 2, 4
4 5 2, 3 4
1001238 1, 2, 3, 4
2, 4 5
1001239 2, 4
3, 4 3
Data Mining Software
Tools

• Commercial
R 1,419
Python 1,325
SQL 1,029
Excel 972

– IBM SPSS Modeler RapidMiner


Hadoop
Spark
641
624
944

Tableau 536

(formerly
KNIME 521
SciKit-Learn 497
Java 487
Anaconda

Clementine)
462
Hive 359
Mllib 337
Weka 315

– SAS Enterprise
Microsoft SQL 314
Server 301
Unix 263
shell/awk/gawk 242

Miner
MATLAB 227
IBM SPSS Statistics 225
Dataiku 222
SAS base 211

– Statistica -
IBM SPSS Modeler 210
SQL on Hadoop tools 198
C/C++ 197
Other free analytics/data 193

Dell/Statsoft
mining tools
180
Other programming and
162
data languages
161
H2O
158

– … many more
Scala
SAS Enterprise Miner
153 Legend:
147 [Orange] Free/Open Source tools
Microsoft Power BI
141
Hbase [Green] Commercial tools
132
QlikView [Blue] Hadoop/Big Data tools
121
Microsoft Azure Machine Learning

• Free and/or Open Source


103
Other Hadoop/HDFS-based tools
100
Apache Pig
89
IBM Watson
89
Rattle

– KNIME Salford SPM/CART/RF/MARS/TreeNet 0


Gnu Octave
Orange
200
1600
400 600 800 1000 1200 1400

– RapidMiner
Application Case 4.6 (1 of 5)

Data Mining Goes to Hollywood: Predicting Financial


Success of Movies

• Goal: Predicting financial success of Hollywood


movies before the start of their production
process
• How: Use of advanced predictive analytics
methods
Application Case 4.6 (2 of 5)

A Typical Classification Problem

Dependent Variable

Class No. 1 2 3 4 5 6 7 8 9
Range >1 >1 > > 20 > 40 > 65 > 100 > 150 > 200
(in (Flop > 10 10 < 40 < 65 < 100 < 150 < 200 (Blockbuster)
$Millions) ) <
20
Application Case 4.6 (3 of 5)

Independent
Variables
Independent Variable Number of Possible Values
Values
MPAA Rating 5 G, PG, PG-13, R, NR
Competition 3 High, Medium, Low
Star value 3 High, Medium, Low
Genre Sci-Fi, Historic Epic Drama,
10 Modern Drama, Politically
Related, Thriller, Horror,
Comedy, Cartoon, Action,
Documentary
Special effects 3 High, Medium, Low
Sequel 2 Yes, No
Number of screens 1 Positive integer
Application Case 4.6 (4 of 5)

The DM Process Map in IBM SPSS


Modeler

Model
Developme
nt process

Model
Assessmen
t process
Application Case 4.6 (5 of 5)

*Training set 1998 – 2005 movies; Test set : 2006


Movies
Table 4.6 Data Mining Myths

Myth Reality
Data mining provides instant, crystal- Data mining is a multistep process that
ball-like predictions. requires deliberate, proactive design and
use.
Data mining is not yet viable for
mainstream business applications. The current state of the art is ready to
go for almost any business type
Data mining requires a separate, and/or size.
dedicated database.
Because of the advances in database
Only those with advanced degrees can technology, a dedicated database is not
do data mining. required.

Data mining is only for large firms that Newer Web-based tools enable managers
have lots of customer data. of all educational levels to do data
mining.

If the data accurately reflect the


business or its customers, any company
can use data mining.
Data Mining Mistakes

1. Selecting the wrong problem for data mining


2. Ignoring what your sponsor thinks data
mining is and what it really can/cannot do
3. Beginning without the end in mind
4. Not leaving sufficient time for data acquisition,
selection, and preparation
5. Looking only at aggregated results and not at
individual records/predictions
6. … 10 more mistakes… in your book

You might also like