Unit 1.2 Layered Framework

Uploaded by

istarss9101

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

42 views32 pages

Unit 1.2 Layered Framework

Uploaded by

istarss9101

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 32

Unit 1.

2
LAYERED FRAMEWORK
Aditi S. Chikhalikar

1
Outline
 Definition of Data Science Framework
 Cross-Industry Standard Process for Data Mining(CRISP-
DM)
 Homogeneous Ontology for Recursive Uniform
Schema(HORUS)
 The Top Layers of a Layered Framework
 Layered Framework for High-Level Data Science and
Engineering

2
Definition of Data Science Framework
 Data science is a series of discoveries.
 You work toward an overall business strategy of
converting raw unstructured data from data lake into
actionable business data.
 This process is a cycle of discovering and evolving your
understanding of the data you are working with to supply
with metadata that you need.
 You build a basic framework that you use for your data
processing.
 This will enable you to construct a data science solution
and then easily transfer it to your data engineering
environments.

3
Cross-Industry Standard Process for Data
Mining(CRISP-DM)
 CRISP-DM was generated in 1996, and by 1997, it was
extended via a European Union project, under the ESPRIT
[European Strategic Programme for Research and
development in Information Technology] funding initiative.
It was the majority support base for data scientists until
mid-2015. The web site that was driving the Special
Interest Group disappeared on June 30, 2015, and has
since reopened. Since then, however, it started losing
ground against other custom modeling methodologies.
 The basic concept behind the process is still valid, but you
will find that most companies do not use it as is, in any
projects, and have some form of modification that they
4
employ as an internal standard.
Cross-Industry Standard Process for Data
Mining(CRISP-DM)
 Goals:
 Encourage interoperable tools across entire data mining
process
 Take the mystery/high-priced expertise out of simple data
mining tasks
 CRISP-DM is a comprehensive data mining methodology
and process model that provides anyone—from novices
to data mining experts—with a complete blueprint for
conducting a data mining project.
 CRISP-DM breaks down the life cycle of a data mining
project into six phases.

5
CRISP-DM: Phases
 Business Understanding
 Understanding project objectives and requirements
 Data mining problem definition
 Data Understanding
 Initial data collection and familiarization
 Identify data quality issues
 Initial, obvious results
 Data Preparation
 Record and attribute selection
 Data cleansing
 Modeling
 Run the data mining tools
 Evaluation
 Determine if results meet business objectives
 Identify business issues that should have been addressed earlier
 Deployment
 Put the resulting models into practice
 Set up for continuous mining of the data

6
CRISP-DM: Phases

7
8
 Phase 1--Business Understanding
 Determine business objectives
 Key persons and their roles? Internal sponsor (financial, domain expert).
 Business units impacted by the project (sales, finance,...) ?
 Business success criteria and who assesses it?
 Users’ needs and expectations.
 Describe problem in general terms.
 Business questions, Expected benefits.
 Assess situation
 Are they already using data mining.
 Identify hardware and software available.
 Identify data sources and their types (online, experts, written documentation).
 Determine data mining goals
 Produce project plan
 Define initial process plan; discuss its feasibility with involved personnel.
 Estimate effort and resources needed; Identify critical steps.

9
 Phase 2--Data Understanding:
 Collect data
 Describe data
 Check data volume and examine its gross properties.
 Accessibility and availability of attributes, Attribute types, range, correlations, the
identities.
 Understand the meaning of each attribute and attribute value in business terms.
 For each attribute, compute basic statistics (e.g., distribution, average, max, min,
standard deviation, variance, mode, skewness)
 Explore data
 Analyze properties of interesting attributes in detail
 Distribution, relations between pairs or small numbers of attributes, properties of
significant sub-populations, simple statistical analyses.
 Verify data quality
 Does it cover all the cases required? Does it contain errors and how common are
they?
 Identify missing attributes and blank fields. Meaning of missing data.
 Do the meanings of attributes and contained values fit together?
 Check spelling of values (e.g., same value but sometime beginning with a lower case
10 letter, sometimes with an upper case letter).
 Phase 3--Data Preparation:
 Select data
 Reconsider data selection criteria.
 Decide which dataset will be used.
 Collect appropriate additional data (internal or external).
 Consider use of sampling techniques.
 Explain why certain data was included or excluded.
 Clean data
 Correct, remove or ignore noise.
 Decide how to deal with special values and their meaning.
 Aggregation level, missing values, etc.
 Outliers?
 Construct data
 Derived attributes.
 Background knowledge .
 How can missing attributes be constructed or imputed?
11
 https://sci2s.ugr.es/noisydata

https://sci2s.ugr.es/noisydata

12
 Data Preparation:
 Integrate data
 Integrate sources and store result (new tables and
records).
 Format Data
 Rearranging attributes
 Reordering records
 Reformatted within-value

13
 Phase 4--Modeling :
 Select the modeling technique
 based upon the data mining objective
 Generate test design
 Procedure to test model quality and validity
 Build model
 Parameter settings
 Assess model
 rank the models

14
 Phase 5 – Evaluation :
 Evaluate results
 Understand data mining result.
 Check impact for data mining goal.
 Check result against knowledge base to see if it is novel and useful.
 Evaluate and assess result with respect to business success criteria
 Review of process
 Summarize the process review (activities that missed or should be
repeated).
 Overview data mining process.
 Is there any overlooked factor or task? (did we correctly build the model?
Did we only use attributes that we are allowed to use and that are
available for future analyses?) •
 Identify failures, misleading steps, possible alternative actions,
unexpected paths
 Review data mining results with respect to business success

15
 Phase 5 – Evaluation :
 Determine next steps
 Analyze potential for deployment of each result.
 Estimate potential for improvement of current process.
 Check remaining resources to determine if they allow
additional process iterations (or whether additional
resources can be made available).
 Recommend alternative continuations. Refine process
plan.
 Decision
 At the end of this phase, a decision on the use of the data
mining results should be reached.
16
 Phase 6 – Deployment :
 Plan deployment
 How will the knowledge or information be propagated to users?
 How will the use of the result be monitored or its benefits measured?
 How will the model or software result be deployed within the
organization’s systems?
 How will its use be monitored and its benefits measured (where
applicable)?
 Identify possible problems when deploying the data mining results.
 Plan monitoring and maintenance
 What could change in the environment?
 How will accuracy be monitored?
 When should the data mining model not be used any more? What
should happen if could no longer be used? (Update model, new data
mining project)
 Will the business objectives of the use of the model change over
time?
17
 Phase 6 – Deployment :
 Produce a final report
 Identify reports needed (slide presentation, management summary,
detailed findings, explanation of models, etc.).
 How well initial data mining goals have been met.
 Identify target groups for reports.
 Outline structure and contents of reports.
 Select findings to be included in the reports. Write a report.
 Review project
 Interview people involved in project; Interview end users.
 What could have been done better? Do they need additional support?
Summarize feedback and write the experience documentation
 Analyze the process (what went right or wrong, what was done well
and what needs to be improved.).
 Document the specific data mining process
 Abstract from details to make the experience useful for future
projects.
18
Also read from link below
 https://www.zentut.com/data-mining/data-mining-
processes/

19
20
Homogeneous Ontology for Recursive
Uniform Schema
 The Homogeneous Ontology for Recursive Uniform
Schema (HORUS) is used as an internal data format
structure that enables the framework to reduce the
permutations of transformations required by the framework.
 The use of HORUS methodology results in a hub-and-spoke
data transformation approach.
 External data formats are converted to HORUS format, and
then a HORUS format is transformed into any other external
format.
 The basic concept is to take native raw data and then
transform it first to a single format. That means that there is
only one format for text files, one format for JSON or XML,
one format for images and video

21
22
The Top Layers of a Layered Framework

 The top layers are to support a long-term strategy of

creating a Center of Excellence for your data science
work.
 The layers shown in Figure below enable you to keep
track of all the processing and findings you achieve.
 The framework will enable you to turn your small project
into a big success, without having to do major
restructuring along the route to production.

23
The Top Layers of a Layered Framework

 The top layers are to support a long-term strategy of

creating a Center of Excellence for your data science
work.

24
The Top Layers of a Layered Framework
 Business Layer :-
 The business layer is the principal source of the requirements
and business information needed by data scientists and
engineers for processing.
 Material you would expect in the business layer includes the
following:
 Up-to-date organizational structure chart
 Business description of the business processes you are investigating
 List of your subject matter experts
 Project plans
 Budgets
 Functional requirements
 Nonfunctional requirements
 Standards for data items

25
 Utility Layer :-
 The utility layer is a common area in which you store all your
utilities.
 Collect your utilities (including source code) in one
central location. Keep detailed records of every version.

26
 Operational Management Layer :-
 Operations management is an area of the ecosystem concerned
with designing and controlling the process chains of a production
environment and redesigning business procedures.
 This layer stores what you intend to process.
 It is where you plan your data science processing pipelines.
 The operations management layer is where you record
 Processing-stream definition and management
 Parameters
 Scheduling
 Monitoring
 Communication
 Alerting
 The operations management layer is the common location where
you store any of the processing chains you have created for your
data science.
 Operations management is how you turn data science
experiments into long-term business gains.
27
 Audit, Balance, and Control Layer :-
 The audit, balance, and control layer is the area from which
you can observe what is currently running within your data
science environment.
 It records
 Process-execution statistics
 Balancing and controls
 Rejects and error-handling
 Codes management
 The three subareas are utilized in following manner:-
 Audit :-
 The audit sublayer records any process that runs within the
environment.
 This information is used by data scientists and engineers to
understand and plan improvements to the processing.
 Make sure your algorithms and processing generate a good and
complete audit trail.

28
 Balance :-
 The balance sublayer ensures that the ecosystem is balanced
across the available processing capability or has the capability
to top-up capability during periods of extreme processing.
 Using the audit trail, it is possible to adapt to changing
requirements and forecast what you will require to
complete the schedule of work you submitted to the
ecosystem.
 In the always-on and top-up ecosystem you can build, you can
balance your processing requirements by removing or adding
resources dynamically as you move through the processing
pipe.
 An example would be during end-of month processing, you
increase your processing capacity to sixty nodes, to handle the
extra demand of the end-of-month run.
 The rest of the month, you run at twenty nodes during
business hours.
 During weekends and other slow times, you only run with five
nodes. Massive savings can be generated in this manner.
29
 Control :-
 The control sublayer controls the execution of the current
active data science processes in a production ecosystem.
 The control elements are a combination of the control
element within the Data Science Technology Stack’s
individual tools plus a custom interface to control the
primary workflow.
 The control also ensures that when processing
experiences an error, it can attempt a recovery, as per
your requirements, or schedule a clean-up utility to undo
the error.

30
 Functional Layer :-
 The functional layer of the data science ecosystem is the
main layer of programming required.
 The functional layer is the part of the ecosystem that
executes the comprehensive data science.
 It consists of several structures.
 Data models
 Processing algorithms
 Provisioning of infrastructure

31
Processing algorithms
 The processing algorithm is spread across six supersteps of
processing as follows:
1. Retrieve: This super step contains all the processing chains for
retrieving data from the raw data lake via a more structured
format.
2. Assess: This superstep contains all the processing chains for quality
assurance and additional data enhancements.
3. Process: This superstep contains all the processing chains for
building the data vault.
4. Transform: This superstep contains all the processing chains for
building the data warehouse.
5. Organize: This superstep contains all the processing chains for
building the data marts.
6. Report: This superstep contains all the processing chains for
building virtualization and reporting the actionable knowledge.
32

Oracle Developer Syllabus
No ratings yet
Oracle Developer Syllabus
15 pages
ServiceNow Data Model v2.7
100% (8)
ServiceNow Data Model v2.7
41 pages
Software Testing Dictionary
No ratings yet
Software Testing Dictionary
16 pages
Advanced Backend Syllabus
No ratings yet
Advanced Backend Syllabus
15 pages
Milestone Project 2 Brief
No ratings yet
Milestone Project 2 Brief
3 pages
What Is CRISP DM - Data Science Process Alliance
No ratings yet
What Is CRISP DM - Data Science Process Alliance
20 pages
UNIT3
No ratings yet
UNIT3
125 pages
CRISP DM For Data Science 2025
No ratings yet
CRISP DM For Data Science 2025
6 pages
1.1 Objectives: Information Science & Engineering, Sjcit
No ratings yet
1.1 Objectives: Information Science & Engineering, Sjcit
43 pages
PM Unit 1
No ratings yet
PM Unit 1
41 pages
Data Mining - Intro
No ratings yet
Data Mining - Intro
17 pages
PAM - Unit1 PDF
No ratings yet
PAM - Unit1 PDF
217 pages
CRISP DM - Explained in Easy Way
No ratings yet
CRISP DM - Explained in Easy Way
12 pages
Data Mining and IBM SPSS Modeler
No ratings yet
Data Mining and IBM SPSS Modeler
20 pages
Chapter Five Data Mining For Healthcare Analytics
No ratings yet
Chapter Five Data Mining For Healthcare Analytics
77 pages
Experienced Testing Resume Template
100% (2)
Experienced Testing Resume Template
3 pages
Data Mining
No ratings yet
Data Mining
13 pages
Metodologia para Mineria de Datos Crisp-Dm
No ratings yet
Metodologia para Mineria de Datos Crisp-Dm
33 pages
Intorduction To Data Mining
No ratings yet
Intorduction To Data Mining
26 pages
IBM® Edge2013 - Basics of IBM Storwize V7000 Unified
No ratings yet
IBM® Edge2013 - Basics of IBM Storwize V7000 Unified
47 pages
Unlocking Insights The CRISPDMData Mining Process 622298 B 6385 C 29 A 7
No ratings yet
Unlocking Insights The CRISPDMData Mining Process 622298 B 6385 C 29 A 7
19 pages
Data Mining Implementation Process
No ratings yet
Data Mining Implementation Process
9 pages
Crisp DM
No ratings yet
Crisp DM
38 pages
Expert Veri Ed, Online, Free.: Custom View Settings Question #77
No ratings yet
Expert Veri Ed, Online, Free.: Custom View Settings Question #77
2 pages
Data Science Methodologies
No ratings yet
Data Science Methodologies
31 pages
IMP Questions & Ans On ML & CI Using Python
No ratings yet
IMP Questions & Ans On ML & CI Using Python
21 pages
02 Crispdm
No ratings yet
02 Crispdm
25 pages
Programmatic Approach Using DATA Step and PROC SQL Creating A SAS Data Set Using DATA Step
No ratings yet
Programmatic Approach Using DATA Step and PROC SQL Creating A SAS Data Set Using DATA Step
11 pages
Unit2 Notes
No ratings yet
Unit2 Notes
8 pages
Data Mining
No ratings yet
Data Mining
30 pages
Global Technical Architect HCI & Nutanix
No ratings yet
Global Technical Architect HCI & Nutanix
2 pages
Slide-5 (AWS - IAM)
No ratings yet
Slide-5 (AWS - IAM)
28 pages
2 crisp-DM
No ratings yet
2 crisp-DM
28 pages
How To Modify Scan and SCAN Listener
No ratings yet
How To Modify Scan and SCAN Listener
4 pages
Predictive & Prescriptive Analytics
No ratings yet
Predictive & Prescriptive Analytics
19 pages
Data Structure Unit-2 Quiz
No ratings yet
Data Structure Unit-2 Quiz
7 pages
Lecture 1.2.1
No ratings yet
Lecture 1.2.1
33 pages
Predictive Analytics Modelling (21CSH-440) : Apex Institute of Technology
No ratings yet
Predictive Analytics Modelling (21CSH-440) : Apex Institute of Technology
42 pages
WEEK 4-CRISP-DM Framework
No ratings yet
WEEK 4-CRISP-DM Framework
9 pages
Data Mining
No ratings yet
Data Mining
41 pages
Crisp
No ratings yet
Crisp
31 pages
What Is CRISP in Data Mining - Javatpoint
No ratings yet
What Is CRISP in Data Mining - Javatpoint
10 pages
Intro 2
No ratings yet
Intro 2
3 pages
ModelQB - Part B&C-1
No ratings yet
ModelQB - Part B&C-1
51 pages
Product Manager Dan Product Requirement Document (PRD)
No ratings yet
Product Manager Dan Product Requirement Document (PRD)
6 pages
Assignment:1: Install Java On Windows 10
No ratings yet
Assignment:1: Install Java On Windows 10
4 pages
W3 - DA Life Cycle
No ratings yet
W3 - DA Life Cycle
49 pages
What Are Multi Tasking
No ratings yet
What Are Multi Tasking
2 pages
Chapter 3-IB
No ratings yet
Chapter 3-IB
69 pages
Qualnet Tutorial
No ratings yet
Qualnet Tutorial
20 pages
Crisp Note
No ratings yet
Crisp Note
5 pages
Crisp DM
100% (1)
Crisp DM
30 pages
Lecture02 Frameworks Platforms-Part1
No ratings yet
Lecture02 Frameworks Platforms-Part1
40 pages
PAM - Complete
No ratings yet
PAM - Complete
322 pages
DS Unit 1
No ratings yet
DS Unit 1
26 pages
Crispslides
No ratings yet
Crispslides
20 pages
Topic 2 Business in Practice and The GRISP-DM Framework
No ratings yet
Topic 2 Business in Practice and The GRISP-DM Framework
22 pages
Crisp - DM: Data Mining Process
No ratings yet
Crisp - DM: Data Mining Process
8 pages
Crisp-Dm: Elgounidi Hajar Safsafi Aya El Malki Ikram Aqaabich Reda
No ratings yet
Crisp-Dm: Elgounidi Hajar Safsafi Aya El Malki Ikram Aqaabich Reda
87 pages
Big Data Module 2
No ratings yet
Big Data Module 2
31 pages
Introduction To Data Mining
No ratings yet
Introduction To Data Mining
27 pages
Chapter 1
No ratings yet
Chapter 1
23 pages
Website: Vce To PDF Converter: Facebook: Twitter:: Aca-Cloud1.Vceplus - Premium.Exam.50Q
100% (1)
Website: Vce To PDF Converter: Facebook: Twitter:: Aca-Cloud1.Vceplus - Premium.Exam.50Q
15 pages
T Assignment
No ratings yet
T Assignment
5 pages
Crisp DM Presentation
No ratings yet
Crisp DM Presentation
13 pages
Data Science Methodology
No ratings yet
Data Science Methodology
21 pages
Crisp
No ratings yet
Crisp
28 pages
Data Science Methodologies (Coursera)
No ratings yet
Data Science Methodologies (Coursera)
5 pages
Fortinet Product Matrix
No ratings yet
Fortinet Product Matrix
6 pages
Database Management System (CSE249)
No ratings yet
Database Management System (CSE249)
18 pages
Dynamic Systems Development Method
100% (5)
Dynamic Systems Development Method
15 pages
Domain Name System - Internet PDF
No ratings yet
Domain Name System - Internet PDF
8 pages
Module 5 - Data Science Methodology
No ratings yet
Module 5 - Data Science Methodology
17 pages
DTS Modul Data Science Methodology
100% (1)
DTS Modul Data Science Methodology
56 pages
Crisp DM
No ratings yet
Crisp DM
33 pages
Notes On Data Science Methodologies
No ratings yet
Notes On Data Science Methodologies
4 pages
Session Summary CRISP Data Mining: Business Understanding
No ratings yet
Session Summary CRISP Data Mining: Business Understanding
4 pages
Vendor Invoice Booking-MIRO
No ratings yet
Vendor Invoice Booking-MIRO
5 pages
Github-Tutorial PDF
No ratings yet
Github-Tutorial PDF
15 pages
Im1 - Chapter 1
No ratings yet
Im1 - Chapter 1
8 pages
Crisp DM 1stclass
No ratings yet
Crisp DM 1stclass
30 pages
Big Data Analytics - Quick Guide - Tutorialspoint
No ratings yet
Big Data Analytics - Quick Guide - Tutorialspoint
50 pages
Crisp-Dm
No ratings yet
Crisp-Dm
4 pages
Experiment No 2 Aim: To Study Virtualization and Install KVM. Virtualization
No ratings yet
Experiment No 2 Aim: To Study Virtualization and Install KVM. Virtualization
12 pages
Cross Industry Standard Process For Data Mining
No ratings yet
Cross Industry Standard Process For Data Mining
3 pages
Comptia: Comptia Cysa+ Certification Exam (Cs0-002)
No ratings yet
Comptia: Comptia Cysa+ Certification Exam (Cs0-002)
8 pages
Crisp DM
No ratings yet
Crisp DM
7 pages
WWW - Aka.ms/pathways: Getting Started Doing More With Power BI Role Based Certification Additional Study
No ratings yet
WWW - Aka.ms/pathways: Getting Started Doing More With Power BI Role Based Certification Additional Study
1 page
Data Mining: Fundamentals and Applications
From Everand
Data Mining: Fundamentals and Applications
Fouad Sabry
No ratings yet
Practical Data Strategies and Recipes
From Everand
Practical Data Strategies and Recipes
Tom Henricksen
No ratings yet

Unit 1.2 Layered Framework

Uploaded by

Unit 1.2 Layered Framework

Uploaded by

Unit 1.

 The top layers are to support a long-term strategy of

 The top layers are to support a long-term strategy of

You might also like