Data Mining and
Business Intelligence
1st Lecture
Iraklis Varlamis
1
About the course
•Lectures: Monday 5:00-7:30
(+visits to the lab at the 2nd floor)
•Office: 5.1
•Email:[email protected]
•Eclass
• http://eclass.hua.gr/courses/DIT161/
2
Reading
• Tan, Steinbach & Kumar. Introduction to data
mining, Addison-Wesley.
http://www-users.cs.umn.edu/~kumar/dmbook/index.php
• Jiawei Han, Micheline Kamber, and Jian Pei.
Data mining – Concepts and techniques (3rd
edition)
These presentations are based on the book and
on the presentations of Jiawei Han
http://www.cs.uiuc.edu/~hanj/
3
Course Outline
Organize and Manage Data
• 1st: Basic concepts of DM & BI. Solutions and Architectures. The
stages of DM process. Application examples.
• 2nd: The process of Data Mining. Classification techniques and
algorithms. Evaluation techniques and metrics.
• 3rd: Data preparation. Dimensionality reduction. Regression.
Core Data Mining Techniques
• 4th: Classification techniques. Practical application of classification
techniques (Lab)
...
• 9th: Clustering techniques and algorithms. Evaluation metrics.
Cluster description.
• 10th: Association rules extraction. Techniques and algorithms.
• 11th: Data warehouses. Data quality. Cubes and multidimensional
data analysis. Concept hierarchies and data projection in
dimensions.
4
Course Outline
Case studies
•5th: Introduction to Graph/Network Mining
•6th: Measuring networks and random graph model.
• 7th: A graph processing library
• 8th: Social Recommender systems
...
• 12th: Presentation of assignments
5
Grading system
• What is graded:
– Final written exam: 60%
– Group assignment: 40% [compulsory]
• Interim report
• Final presentation and documentation
• Written exams: with lecture notes and open
books
6
Definitions
7
Definition and concepts
• Business Intelligence (BI) refers to applications and
technologies accessing the appropriate data and
information in order to make the correct business decision
at the correct moment.
• Two types of BI Systems:
– Those that provide data analysis tools
• Multidimensional data analysis (or online analytical
processing)
• Data mining
• Decision support systems
– Those that provide information in structured format
• Dashboards
8
9
Multidimensional Data Analysis
• Multidimensional analysis provides users with an
excellent view of what is happening or what has
happened.
• Allows users to analyze data in such a way that
they can quickly answer business questions
• To accomplish this multidimensional analysis
tools allow users to “slice and dice” the data in
any desired way.
10
Data mining
• Searching for valuable business information in a
large database or data warehouse
• Data mining performs two basic operations:
– Predicting trends and behaviors
– Identifying previously unknown patterns and
relationships
• Data mining: The process that combines
techniques from statistics, artificial intelligence
and machine learning in order to process data
and extract implicit, non-obvious, interesting and
potentially useful knowledge that can support
decision making
11
Decision support systems
• Decision support systems
• DSS capabilities
– Sensitivity analysis
– What-if analysis
– Goal-seeking analysis
12
Digital Dashboards
• Dashboards:
– Provide rapid access to timely information.
– Provide direct access to management reports.
– Are very user-friendly and supported by graphics.
13
The management cockpit
• A strategic management room that enables top-level
decision makers to pilot their businesses better
• The environment encourages more efficient management
meetings and boosts team performance via effective
communication
• Key performance indicators and information relating to
critical success factors are displayed graphically on the
walls of the meeting room
• External information can be easily imported to the room to
allow competitive analysis
14
From Data to Knowledge
15
Evolution of sciences
• Before 1600, empirical science
• 1600-1950, theoretical science
– Dominated by theoretical models, which often motivate
experiments for better understanding the world
• 1950-1990, computational science
– We try to understand complex mathematical models through
simulation
• 1990-now, data science
– Abundance of data
– Ability to process huge data sets
– Data resources interconnection (through internet)
– The needs for collection, management, querying and
visualization of data increase to the volume of data
16
The Data Gap
• Usually information is hidden in data. This
information is not obvious
• Analysts need weeks to locate it through
Hypothesis testing
• Most of the data are never analysed
• We have the data! Now what?
…and we don’t know what to do with
them.
17
Why do we need data analysts
• Explosive data growth (from terabytes to
petabytes)
• Automated data collection
• Abundant sources
– Businesses : web, e-transactions, stock market
– Sciences: sensors, bioinformatics, scientific
experiments
– Society: news, digital cameras, social networks
“We are drowning in Data but starving for
Knowledge”. John Naisbit in “Megatrends”.
• We need automated analysis of large data sets
18
What else is data mining
• Knowledge Discovery in Databases
• Knowledge extraction
• Data and models analysis
• Data surveying over time and in varying degrees
of detail
• Collection of information and creation of business
intelligence
• NOT data search
• NOT query processing
• NOT a smart system that reacts to the rule base
19
Business intelligence (ΒΙ)
Source: http://decision-quality.com/ 20
Stages of BI creation
Collection Storage Analysis Delivery
“Business Intelligence-The Missing Link.” http://www.ittoolbox.com/peer/bi.pdf, Viewed 4/12/2006. 21
It is not simple
Increasing potential
to support
business decisions End User
Decision
Making
Data Presentation Business
Analyst
Visualization Techniques
Data Mining Data
Information Discovery Analyst
Data Exploration
Statistical Summary, Querying, and Reporting
Data Preprocessing/Integration, Data Warehouses
DBA
Data Sources
Paper, Files, Web documents, Scientific experiments, Database Systems
22
What is Business Intelligence?
1. Tables and charts
– Tables (scorecards) present the key performance indicators
(e.g. ROI)
– Charts (dashboards) present the performance in a condensed
and simple “dials and gauges” format
http://www.microstrategy.com/Solutions/5Styles/
23
What is Business Intelligence?
1. Tables and charts
2. Company reports
– Extended reports, adapted to the needs of each user
group
24
What is Business Intelligence?
1. Tables and charts
2. Company reports
3. Data analytics (OLAP: On-line Analytical Processing)
– Associates data subsets (e.g. temporal data,
customer data, income data) in a multidimensional
analysis (“cube” analysis)
– Selectively provides access to the initial (raw) data
25
What is Business Intelligence?
1. Tables and charts
2. Company reports
3. Data analytics (OLAP: On-line Analytical Processing)
4. Composite analysis and prediction
– Training and evaluation of data mining techniques to
past data.
– Application to new data and prognostics
(i.e.predictions on how data will evolve, what-if
scenarios, etc.)
26
What is Business Intelligence?
1. Tables and charts
2. Company reports
3. Data analytics (OLAP: On-line Analytical Processing)
4. Composite analysis and prediction
5. Alerts
– Automatic generation of reports and notifications for
troubles and opportunities
27
Market interest
• Income from BI software development
• Companies that
develop BI software
•http://apandre.wordpress.com/market/ 28
Scientific interest – Big Data
• Volume
– Scalable algorithms
• Variety
– Multidimensional data, e.g. microarray DNA data
contain a few 10K features,
– Spatial, spatiotemporal data, time-series data
– Web data, multimedia
– Graphs and hypergraphs in social networks
• Velocity
– Data streams, sensor data
• Veracity
– We are not always sure about data accuracy (e.g. GPS
data)
VALUE
29
Data mining for BI
30
Data mining steps
Pattern Evaluation
Data Mining
Task-relevant Data
Data Warehouse Selection
Data Cleaning
Data Integration
Databases
31
Main principles
• Select useful data
• Determine the format of the extracted
knowledge (rules, predicted values,
groupings etc)
• Captures the domain knowledge (e.g. in
conceptual hierarchies or ontologies)
• Determine metrics for the evaluation of
found patterns (simplicity, certainty, utility,
innovation)
• Visualized (interactive, abstract, choose
depending on the knowledge format)
32
Example - Bank
• Business objective: Give residential loans that
can be paid back
• Existing knowledge:
– Clients with children studying use these loans to pay
tuition fees
– Customers with variable income use these loans to
offset their income
• Lots of data:
– Large stores that continuously collect data
from multiple active sources (data warehouse)
33
Sampling
• Choose a part of Customer Data that have
been granted a loan in the past
– Some paid it back
– Others not
34
Pattern mining
• Find the rules that predict whether a customer will
be able to pay the loan back
IF (Salary < 40k) and
(numChildren > 0) and
(ageChild1 > 18 and ageChild1 < 22)
THEN YES
• Group customers and describe each group with
its predominant characteristics
– Among the many groups that have no special meaning,
we find a group of customers that take a loan using their
payroll or savings account
35
Common architecture for DW and DM
Mining query Mining result Layer4
User Interface
User GUI API
Layer3
OLAM OLAP
Engine Engine OLAP/OLAM
Data Cube API
Layer2
MDDB
MDDB
Meta Data
Filtering&Integration Database API Filtering
Layer1
Data cleaning Data
Databases Data
Data integration Warehouse Repository
36
Discussion
• Briefly describe the following, emphasizing
on the data they contain, their differences
and their possible applications:
– object-relational databases,
– spatial databases,
– text databases,
– multimedia databases,
– World Wide Web
37
Discussion
• Describe the following data mining
concepts
– association and correlation analysis,
classification, prediction, clustering, data
evolution analysis, characterization,
discrimination.
• What are the scientific challenges from
data mining in:
– Data streams, spatio-temporal data,
bio-infromatics?
38
Data mining tasks (1)
• Characterization and Discrimination
– Generalize, summarize, compare and contrast features of my
data, e.g. collecting data from meteorological sensors, how
can I distinguish dry from wet areas of the country?
• Association, correlation vs causation
– Correlation does not imply causation. When ice cream sales increase,
the number of drowning increases too
• Classification and prediction
– Models (functions) that describe and distinguish classes or concepts
for future prediction
e.g. prediction of unknown or missing values Δημιουργία μοντέλων
(συναρτήσεων) που περιγράφουν και διακρίνουν κατηγορίες ή έννοιες
για μελλοντική πρόβλεψη
– Classification of countries based on climate
39
Data mining tasks (2)
• Cluster analysis
– Grouped samples in new unknown groups , e.g. group homes
which are for sale and study the characteristics of groups
– Aim to maximize the similarity within groups and the diversity
between groups
• Outlier analysis
– Exceptional samples have completely different behavior from
all other samples
– It is noise , error or exception?
• Trend analysis
– Trends and variations: e.g. regression analysis (finding a
function that describes the data, find data that deviate far from
this)
– Mining sequential patterns : e.g. searching for cameras 🡪
searching for a memory card
– periodicity analysis
40
Discussion
• Give an example of data mining
usefulness in a business (or a sector in
general) that you are familiar with
• Describe the data, knowledge to be
produced, domain knowledge, evaluation
measures standards, visualizations
41
Market analysis
• Where are the data:
– transactions with credit cards, coupons, customer
complaints, market research
• Targeted advertising
– find groups of customers with common characteristics:
interests, income, buying habits
– Define buyers patterns (in time)
• Cross market analysis: correlated products
(diapers, beer), bundle sales
• Analysis of customer needs
– What are the best products for each group of customers
– What factors attract new customers
42