Business Intelligence and Databases - Kopie
Business Intelligence and Databases - Kopie
DW development approaches
o Inmon Model:
EDW approach
Top-down
Starts with an ERD
o Kimball Model:
Data Mart approach
Bottom-up
“Plan big, build small” with one Data Mart built at a time
Representation of Data in DW
o Dimensional modelling to support high-volume query access
o Star schema: the most commonly used and the simplest style of dimensional
modeling
Contains a fact table surrounded by and connected to several
dimension tables
Fact table contains the data we want to include in reports, aggregated
based on values from dimension tables
Dimension tables contain classification and aggregation information
about the values in the fact table
o Snowflake schema: an extension of star schema where the diagram
resembles a snowflake in shape
Dimension tables branch out to other dimension tables
Analysis of data in DW
o Online analytical processing (OLAP)
Designed for effective and efficient ad-hoc analysis
OLAP Operations:
Drill-down: opposite of roll up
Roll-up: Climbing the hierarchy or reducing the dimensions
from medals per city to medals per country
Slice: Selection of one dimension in the cube
Dice: Selection of two or more dimensions
Pivot: Rotates the data axis to view it from different
perspectives
Lecture 3: Introduction SQL
SQL: structured query language
in a relational database, data is stored in a set of tables that can be connected with
each other.
Every SQL statement ends with a semicolon. Don’t forget it.
SQL is case-insensitive. We usually capitalize keywords of the language for better
readability, but “cReaTe DaTabAsE comPanY;” would work just as well.
We use the same color coding as most editors to highlight keywords, but your DBMS
does not care about colors.
Data types and their names may differ between RDBMS
However, the most important basic types work across all systems, some of them are:
o Numbers:
INT: Integer [-2,147,483,648 to 2,147,483,647]
TINYINT: Integer [-128 to 127]
FLOAT: Floating-point [-3,40282347 x 1038 to 3,40282347 x 1038]
o Text:
VARCHAR(N): String with maximum length N [0-65,535]
CHAR(N): String with exact length N [0-255]
o Dates:
DATE: YYYY-MM-DD
DATETIME: YYYY-MM-DD hh:mm:ss
Declaring a column as a primary key puts a constraint on this column, it limits which
data can be inserted and how it can be manipulated, specifically:
o it cannot be null and
o must be unique.
A foreign key prevents you from (accidently) changing a primary key or deleting an
entry with a primary key that is referenced as a foreign key in your database.
There is a number of additional constraints in SQL:
o NOT NULL
It is a universal (w.r.t. data type) symbol for an “empty” field
o UNIQUE
o DEFAULT value
o CHECK (condition)
Lecture 4: Data Warehousing
BI systems rely on a DW as the information source for creating insight to support
managerial decisions
DW is a collection of integrated, subject-oriented databases designed to support
decision-making functions, where each unit of data is cleansed, in a standardized
format, non-volatile and relevant to some moment in time
Characteristics DW:
o Subject-oriented: organized by subject (e.g.: sales, products) and contains
only relevant information for decision-making
o Integrated: places information from different sources in a consistent format
while dealing with naming conflicts and discrepancies
o Time variant (time series): maintains historical data which can be used for
forecasting and comparisons (must contain date/time)
o Non-volatile: once data is entered in a DW it cannot be changed (any change
is recorded as new data)
o Web based
o Relational/Multidimensional
o Client/Server architectures
needs an internet connection and a web browser
used to manage the inflow and outflow of data between client and
server
o Real-time
o Include Metadata
A Data Cube allows data to be viewed in multiple dimensions
Metadata = data about data
o in a data warehouse, metadata describe the contents of a data warehouse
and the manner of its acquisition and use
o types of metadata
Descriptive metadata
Adds information about who created a resource
What the resources is about, what it includes
Structural metadata
Includes additional data about the way data elements are
organized
Their relationships and the structure they exist in
Administrative metadata
Provides information about the origin of resources
Their types and access rights
Types of DW
o Enterprise Data Warehouse
Large-scale DW used across the organization for decision-support
Integrates data from multiple sources into a standard format
o Data Mart
Small and stores only relevant information for a specific subject or
department
Dependent data mart: subset of data directly from the EDW
Independent data mart: small warehouse with data not from the EDW
o Operational Data Store
Intermediary staging area for a DW
Can be updated throughout the course of business operations, unlike
the static nature of EDW
Used for short-term decision-making since it stores only very recent
data
Datawarehouse framework:
o Data sources: independent systems or external providers
o ETL: process to extract transform and load data into a DW
o API/Middleware tools: enable access to the DW for SQL queries, analysis,
dashboarding and reporting
Which architecture is the best?
o Which Database Management System should be used?
o Will parallel processing/partitioning be needed (scalability/speed)?
o Will migration tools be used to load the DW?
o What tools will be used to support data retrieval and analysis?
Extract Transform Load (ETL) process
o Extraction: reading data from one or more databases
o Transformation: converting extracted data into the form it needs to be in the
DW
o Load: putting the data into the DW
o It is key for integrating data from multiple sources into the DW
o In case of low-quality data (incomplete, not relevant, inconsistent, etc.), data
preprocessing, such as formatting, fixing, filtering (cleansing) is required
o Criteria for selecting ETL tools
Ability to read from and write to an unlimited number of data
sources/architectures
Automatic capturing and delivery of metadata
A history of conforming to open standards
An easy-to-use interface for the developer and the functional user
Lecture 5 BPM
Business Performance Management (BPM) = An integrated set of processes,
methodologies, metrics and applications designed to drive the overall financial and
operational performance of an organization
BPM helps organizations with:
o Translating strategies and objectives into plans
o Monitoring performance against strategic plans of the company
o Analyzing variations between actual and planned results
o Adjusting objectives and actions in response to the performed analysis
BPM components:
o A set of integrated management and analytic processes, supported by
technology
o Tools for businesses to define strategic goals and the associated key
performance indicators (KPIs) –e.g.: Balanced Scorecard
o Performance Measurement System including methods and tools for
monitoring KPIs –e.g.: BI dashboards
Closed loop process = Links strategy to execution to optimize business performance
o Step 1: strategize – where do we want to go?
Strategic plan: Is a map that details a course of action for moving the
organization from its current state to its future vision
o Step 2: plan – How do we get there?
Operational plan: Translates the strategic objectives and goals into a
set of well-defined tactics, resource requirements, and expected
results for a future time period (e.g. for a year interval)
o Step 3: monitor – How are we doing
A comprehensive framework for monitoring performance should
address two key issues:
What to monitor (Goals, KPI, etc.)
How to monitor
o Step 4: act/adjust – what do we need to do differently?
Success (or mere survival) depends on reacting on the findings:
Creating new products
Entering new markets
Acquiring new customers/businesses
Streamlining processes
How toact/adjust
Find facts about the problems/bottlenecks using performance
measurement techniques
Analyze the causes of bottlenecks
o Assignment of resources,
o Completion times,
o Resource utilization, etc.
Set priorities and assign a problem owner or adjust the
strategy
Key performance indicator (KPI)Represents a metric that measures performance
against a goal
Outcome Metrics: KPIs focused on financial performance
Operational Metrics: KPIs focused on measuring operational activities and
performance
Characteristics of KPIs
o Embody a strategic objective
o Measure performance against a target
o Targets have performance ranges
o Ranges are encoded in software for visual display (green, red, yellow, etc.)
o Targets are assigned a completion time frame
o Targets are measured against a baseline or benchmark
KPI typology
o Outcome KPIs: measure performance in outputs
o Driver KPIs: also called leading KPIs: measure activities that have significant
impact on outcome KPIs
o Operational KPIs: focus on operational areas dealing with operational
activities and performance
Operational metric KPI examples:
o Customer performance
e.g. customer satisfaction, customer retention, speed of issue
resolution, response time
o Service performance
service renewal rate, return rates, response time
o Process performance
completion time, throughput, defect/number of products
o Sales plan/forecast
order-to-fulfilment ratio, total closed contracts, etc.
Good KPIs should:
o Be focused on key factors
o Balance the needs of all stakeholders (shareholders, employees, partners,
suppliers)
o Have realistic targets,
o Be measurable and time-framed
How to derive/formulate KPIs- SMART
o Specific –KPIs should measure the areas that have the greatest impact on
your business performance
o Measurable –Ensure that your KPIs can be identified and tracked
o Achievable –The KPIs should be realistic
o Relevant –KPIs should link to overall strategic goals and objectives of a
business
o Time-framed –KPIs should have relevant data (e.g. timestamp, and/or be
captured systematically) that enables measuring them for specific time
intervals relevant for a business goal.
BPM methodologies
o Balanced Scorecard (BSC): Performance measurement and management
methodology that helps translate an organization’s financial, customer,
internal process, and learning and growth objectives and targets into a set of
actionable initiatives
o Six Sigma: Performance management methodology aimed at reducing the
number of defects in a business process to as close to zero defects per million
opportunities (DPMO) as possible
o DMAIC performance model
Define the project goals and customer (internal and external)
deliverables
Measure the process to determine current performance
Analyze and determine the root cause(s) of the defects
Improve the process by eliminating defects
Control future process performance
Visualization per aim
o Trend: Column or Line
o Comparison: Area, Bar, Bullet, Column, Line, or Scatter
o Relationship: Line or Scatter
o Distribution: Bar, Boxplot, or Column
o Composition: Donut, Pie, Stacked Bar, or Stacked Column
o Process / sequence: process discovery maps, Line / Dotted chart
Dashboard vs. Scorecard
o Dashboard:
Monitor operational performance
Free form (any measures)
o Scorecard:
Chart progress against strategic and tactical goals and targets
Predetermined measures
Lecture 6 data mining
Queries are precise requests that search the relational database, fetch information,
and display it
Processing of data: Collect, process, store, retrieve and distribute information
Data mining = Discovering or “mining” knowledge from large amounts of data
o Data mining seeks to identify four major patterns:
Predictions: future occurrences of certain events
Classification learns patterns from past data in order to place
new instances (with unknown labels) into their respective
group
Decision trees recursively divides a training set until each
division consists entirely or primarily of examples from once
class
Clusters: grouping of things based on known features
Association: commonly co-occurrence of things
Association rule mining aims to find interesting relationships
between variables (items) in large databases
Apriori algorithm finds subsets that are common to at least a
minimum number of item sets
Sequential relationships: time-ordered events
Supervised learning: algorithms require a training data set that includes both
independent and dependent variables
Unsupervised learning: algorithms only require independent variables
Hypothesis-driven data mining: starts with proposition made by the user, who seeks
to validate its truthfulness
Discovery-driven data mining: finds patterns, associations and other relationships
that are hidden in the dataset