Unit 1.2 Layered Framework
Unit 1.2 Layered Framework
2
LAYERED FRAMEWORK
Aditi S. Chikhalikar
1
Outline
Definition of Data Science Framework
Cross-Industry Standard Process for Data Mining(CRISP-
DM)
Homogeneous Ontology for Recursive Uniform
Schema(HORUS)
The Top Layers of a Layered Framework
Layered Framework for High-Level Data Science and
Engineering
2
Definition of Data Science Framework
Data science is a series of discoveries.
You work toward an overall business strategy of
converting raw unstructured data from data lake into
actionable business data.
This process is a cycle of discovering and evolving your
understanding of the data you are working with to supply
with metadata that you need.
You build a basic framework that you use for your data
processing.
This will enable you to construct a data science solution
and then easily transfer it to your data engineering
environments.
3
Cross-Industry Standard Process for Data
Mining(CRISP-DM)
CRISP-DM was generated in 1996, and by 1997, it was
extended via a European Union project, under the ESPRIT
[European Strategic Programme for Research and
development in Information Technology] funding initiative.
It was the majority support base for data scientists until
mid-2015. The web site that was driving the Special
Interest Group disappeared on June 30, 2015, and has
since reopened. Since then, however, it started losing
ground against other custom modeling methodologies.
The basic concept behind the process is still valid, but you
will find that most companies do not use it as is, in any
projects, and have some form of modification that they
4
employ as an internal standard.
Cross-Industry Standard Process for Data
Mining(CRISP-DM)
Goals:
Encourage interoperable tools across entire data mining
process
Take the mystery/high-priced expertise out of simple data
mining tasks
CRISP-DM is a comprehensive data mining methodology
and process model that provides anyone—from novices
to data mining experts—with a complete blueprint for
conducting a data mining project.
CRISP-DM breaks down the life cycle of a data mining
project into six phases.
5
CRISP-DM: Phases
Business Understanding
Understanding project objectives and requirements
Data mining problem definition
Data Understanding
Initial data collection and familiarization
Identify data quality issues
Initial, obvious results
Data Preparation
Record and attribute selection
Data cleansing
Modeling
Run the data mining tools
Evaluation
Determine if results meet business objectives
Identify business issues that should have been addressed earlier
Deployment
Put the resulting models into practice
Set up for continuous mining of the data
6
CRISP-DM: Phases
7
8
Phase 1--Business Understanding
Determine business objectives
Key persons and their roles? Internal sponsor (financial, domain expert).
Business units impacted by the project (sales, finance,...) ?
Business success criteria and who assesses it?
Users’ needs and expectations.
Describe problem in general terms.
Business questions, Expected benefits.
Assess situation
Are they already using data mining.
Identify hardware and software available.
Identify data sources and their types (online, experts, written documentation).
Determine data mining goals
Produce project plan
Define initial process plan; discuss its feasibility with involved personnel.
Estimate effort and resources needed; Identify critical steps.
9
Phase 2--Data Understanding:
Collect data
Describe data
Check data volume and examine its gross properties.
Accessibility and availability of attributes, Attribute types, range, correlations, the
identities.
Understand the meaning of each attribute and attribute value in business terms.
For each attribute, compute basic statistics (e.g., distribution, average, max, min,
standard deviation, variance, mode, skewness)
Explore data
Analyze properties of interesting attributes in detail
Distribution, relations between pairs or small numbers of attributes, properties of
significant sub-populations, simple statistical analyses.
Verify data quality
Does it cover all the cases required? Does it contain errors and how common are
they?
Identify missing attributes and blank fields. Meaning of missing data.
Do the meanings of attributes and contained values fit together?
Check spelling of values (e.g., same value but sometime beginning with a lower case
10 letter, sometimes with an upper case letter).
Phase 3--Data Preparation:
Select data
Reconsider data selection criteria.
Decide which dataset will be used.
Collect appropriate additional data (internal or external).
Consider use of sampling techniques.
Explain why certain data was included or excluded.
Clean data
Correct, remove or ignore noise.
Decide how to deal with special values and their meaning.
Aggregation level, missing values, etc.
Outliers?
Construct data
Derived attributes.
Background knowledge .
How can missing attributes be constructed or imputed?
11
https://sci2s.ugr.es/noisydata
https://sci2s.ugr.es/noisydata
12
Data Preparation:
Integrate data
Integrate sources and store result (new tables and
records).
Format Data
Rearranging attributes
Reordering records
Reformatted within-value
13
Phase 4--Modeling :
Select the modeling technique
based upon the data mining objective
Generate test design
Procedure to test model quality and validity
Build model
Parameter settings
Assess model
rank the models
14
Phase 5 – Evaluation :
Evaluate results
Understand data mining result.
Check impact for data mining goal.
Check result against knowledge base to see if it is novel and useful.
Evaluate and assess result with respect to business success criteria
Review of process
Summarize the process review (activities that missed or should be
repeated).
Overview data mining process.
Is there any overlooked factor or task? (did we correctly build the model?
Did we only use attributes that we are allowed to use and that are
available for future analyses?) •
Identify failures, misleading steps, possible alternative actions,
unexpected paths
Review data mining results with respect to business success
15
Phase 5 – Evaluation :
Determine next steps
Analyze potential for deployment of each result.
Estimate potential for improvement of current process.
Check remaining resources to determine if they allow
additional process iterations (or whether additional
resources can be made available).
Recommend alternative continuations. Refine process
plan.
Decision
At the end of this phase, a decision on the use of the data
mining results should be reached.
16
Phase 6 – Deployment :
Plan deployment
How will the knowledge or information be propagated to users?
How will the use of the result be monitored or its benefits measured?
How will the model or software result be deployed within the
organization’s systems?
How will its use be monitored and its benefits measured (where
applicable)?
Identify possible problems when deploying the data mining results.
Plan monitoring and maintenance
What could change in the environment?
How will accuracy be monitored?
When should the data mining model not be used any more? What
should happen if could no longer be used? (Update model, new data
mining project)
Will the business objectives of the use of the model change over
time?
17
Phase 6 – Deployment :
Produce a final report
Identify reports needed (slide presentation, management summary,
detailed findings, explanation of models, etc.).
How well initial data mining goals have been met.
Identify target groups for reports.
Outline structure and contents of reports.
Select findings to be included in the reports. Write a report.
Review project
Interview people involved in project; Interview end users.
What could have been done better? Do they need additional support?
Summarize feedback and write the experience documentation
Analyze the process (what went right or wrong, what was done well
and what needs to be improved.).
Document the specific data mining process
Abstract from details to make the experience useful for future
projects.
18
Also read from link below
https://www.zentut.com/data-mining/data-mining-
processes/
19
20
Homogeneous Ontology for Recursive
Uniform Schema
The Homogeneous Ontology for Recursive Uniform
Schema (HORUS) is used as an internal data format
structure that enables the framework to reduce the
permutations of transformations required by the framework.
The use of HORUS methodology results in a hub-and-spoke
data transformation approach.
External data formats are converted to HORUS format, and
then a HORUS format is transformed into any other external
format.
The basic concept is to take native raw data and then
transform it first to a single format. That means that there is
only one format for text files, one format for JSON or XML,
one format for images and video
21
22
The Top Layers of a Layered Framework
23
The Top Layers of a Layered Framework
24
The Top Layers of a Layered Framework
Business Layer :-
The business layer is the principal source of the requirements
and business information needed by data scientists and
engineers for processing.
Material you would expect in the business layer includes the
following:
Up-to-date organizational structure chart
Business description of the business processes you are investigating
List of your subject matter experts
Project plans
Budgets
Functional requirements
Nonfunctional requirements
Standards for data items
25
Utility Layer :-
The utility layer is a common area in which you store all your
utilities.
Collect your utilities (including source code) in one
central location. Keep detailed records of every version.
26
Operational Management Layer :-
Operations management is an area of the ecosystem concerned
with designing and controlling the process chains of a production
environment and redesigning business procedures.
This layer stores what you intend to process.
It is where you plan your data science processing pipelines.
The operations management layer is where you record
Processing-stream definition and management
Parameters
Scheduling
Monitoring
Communication
Alerting
The operations management layer is the common location where
you store any of the processing chains you have created for your
data science.
Operations management is how you turn data science
experiments into long-term business gains.
27
Audit, Balance, and Control Layer :-
The audit, balance, and control layer is the area from which
you can observe what is currently running within your data
science environment.
It records
Process-execution statistics
Balancing and controls
Rejects and error-handling
Codes management
The three subareas are utilized in following manner:-
Audit :-
The audit sublayer records any process that runs within the
environment.
This information is used by data scientists and engineers to
understand and plan improvements to the processing.
Make sure your algorithms and processing generate a good and
complete audit trail.
28
Balance :-
The balance sublayer ensures that the ecosystem is balanced
across the available processing capability or has the capability
to top-up capability during periods of extreme processing.
Using the audit trail, it is possible to adapt to changing
requirements and forecast what you will require to
complete the schedule of work you submitted to the
ecosystem.
In the always-on and top-up ecosystem you can build, you can
balance your processing requirements by removing or adding
resources dynamically as you move through the processing
pipe.
An example would be during end-of month processing, you
increase your processing capacity to sixty nodes, to handle the
extra demand of the end-of-month run.
The rest of the month, you run at twenty nodes during
business hours.
During weekends and other slow times, you only run with five
nodes. Massive savings can be generated in this manner.
29
Control :-
The control sublayer controls the execution of the current
active data science processes in a production ecosystem.
The control elements are a combination of the control
element within the Data Science Technology Stack’s
individual tools plus a custom interface to control the
primary workflow.
The control also ensures that when processing
experiences an error, it can attempt a recovery, as per
your requirements, or schedule a clean-up utility to undo
the error.
30
Functional Layer :-
The functional layer of the data science ecosystem is the
main layer of programming required.
The functional layer is the part of the ecosystem that
executes the comprehensive data science.
It consists of several structures.
Data models
Processing algorithms
Provisioning of infrastructure
31
Processing algorithms
The processing algorithm is spread across six supersteps of
processing as follows:
1. Retrieve: This super step contains all the processing chains for
retrieving data from the raw data lake via a more structured
format.
2. Assess: This superstep contains all the processing chains for quality
assurance and additional data enhancements.
3. Process: This superstep contains all the processing chains for
building the data vault.
4. Transform: This superstep contains all the processing chains for
building the data warehouse.
5. Organize: This superstep contains all the processing chains for
building the data marts.
6. Report: This superstep contains all the processing chains for
building virtualization and reporting the actionable knowledge.
32