Data Mining & Warehousing 01
Data Mining & Warehousing 01
(INSY3041)
Chapter 1
Introduction to Data Mining and Warehousing
(DMW)
Compiled by: Mulat S. 6/1/20
Data Mining & Warehousing 1
What is data mining?
Data is growing at a phenomenal rate. At the same time, users expect more
sophisticated information
A marketing manager is no longer satisfied with a simple listing of
marketing contacts, but wants detailed information about customers past
purchasing behavior and prediction of future purchases
Data mining steps to solve such kinds of needs.
How? Data mining uncover hidden patterns in a database
Output Precise and Not a subset of database The output is some hidden
Subset of (new interesting useful patterns & knowledge
database patterns) in the database
+ =
Interestingness Hidden
Data criteria patterns
Data Mining
Find all credit applicants who have no credit risks. (classification)
Identify customers with similar buying habits. (Clustering)
Find all items which are frequently purchased with Bread. (association rules)
Enterprise + Memory
contains more refined data and act as ROM( we can add new data but can’t
delete, update)
a RDBMS
responsible for the collection and storage of data to support management
decision making and problem solving
Data mart
A subset of a data warehouse for small and medium-size businesses or departments
within larger companies
Non volatile Once data enter the data warehouse, they are never
removed. Because the data in the warehouse represent the company’s
entire history.
Because data is added all the time, warehouse is growing.
What
Whatproduct
productprom- Which
prom- Whichcustomers
customers
-otions
-otionshave
havethe
thebiggest are
biggest aremost
mostlikely
likelyto
togo
go
impact
impactononrevenue? to
revenue? tothe
thecompetition
competition??
What
Whatimpact
impactwill
will
new
newproducts/services
products/services
have
haveon
onrevenue
revenue
and
andmargins?
margins?
Modeling using
algorithms
Cleansing, reduction,
integration
Determine Collect Initial Data Data Set Select Modeling Evaluate Results Plan Deployment
Business Objectives
Initial Data Collection Data Set Description Technique Assessment of Data Deployment Plan
Background Report Modeling Technique Mining Results
Business Objectives Select Data Modeling Assumptions w.r.t. Plan Monitoring
Business Success Describe Data Rationale for Inclusion / Business Success and
Criteria Generate Test Design
Data Description Report Exclusion Test Design Criteria Maintenance
Situation Assessment Approved Models Monitoring and
Inventory of Resources Explore Data Clean Data Build Model Maintenance Plan
Requirements, Data Exploration ReportData Cleaning Report Parameter Settings Review Process
Assumptions, and Models Review of Process Produce Final
Constraints Verify Data Quality Construct Data Model Description Report
Risks and ContingenciesData Quality Report Derived Attributes Determine Next Final Report
Terminology Generated Records Assess Model Steps Final Presentation
Costs and Benefits Model Assessment List of Possible
Integrate Data Revised Parameter Actions Review Project
Determine Merged Data Settings Decision Experience
Data Mining Goal
Data Mining Goals Documentation
Data Mining Success Format Data
Criteria Reformatted Data
Visualization Techniques
Data Mining Data Analyst
Information Discovery
Data Exploration
Statistical Analysis, Querying and Reporting
Data Warehouses / Data Marts
OLAP
Data Sources DBA
Paper, Files, Information Providers, Database Systems, OLTP
Data Mining & Warehousing 28
DM Process Ex: Web Log
Web log- a web access data in either client or server perspective (found in client pc,
browser, proxy…)
Selection:
Select log data (dates and locations) to use web access properties
Preprocessing: ( select the target one, remove unnecessary)
Remove identifying URLs
Remove error logs
Transformation: (Come up with same format)
Sessionize (sort and group) in periodical manner
Data Mining: ( Find useful/interesting Patterns in data)
Identify and count patterns apply for new web log data
Construct data structure
Interpretation/Evaluation: (Display, Visualize)
Identify and display frequently accessed sequences.
Hardware
(sensors, storage, computation)
Relational
Databases Data
AI Pattern Machine Mining
Learning Text Mining
Recognition
Web Mining
Knowledge Mining
“Flexible Models” Spatial Mining
EDA
“Pencil
“Data Dredging”
and Paper”
•Bayes Theorem
•Regression Analysis
DATA MINING •EM Algorithm
•K-Means Clustering
•Time Series Analysis
•Algorithm Design Techniques
•Algorithm Analysis •Neural Networks
•Data Structures
•Decision Tree Algorithms
HIGH PERFORMANCE
Data Mining & Warehousing 31
DM: Intersection of Many Fields
• Data mining overlaps with machine learning, statistics,
artificial intelligence, databases, visualization
Machine Learning (ML)
Data structure &
Statistics (stats) algorithm analysis
Accuracy in classification
Analyze true positive and false positive to calculate recall, precision of the system
Measure percentage of correct mis/classification.
Space/Time complexity
Running time: how fast the algorithm runs
Storage or memory space requirement
Analysis
Operational Extract
Server
Query
DBs Transform Data Reports
Load
Refresh
Warehouse Data mining
Data Marts
Data Sources Data Storage OLAP Engine Front-End Tools
41
DM Tasks and Models
52
End!!!
Question ???
53