0% found this document useful (0 votes)
42 views32 pages

Unit 1.2 Layered Framework

Uploaded by

istarss9101
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
42 views32 pages

Unit 1.2 Layered Framework

Uploaded by

istarss9101
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 32

Unit 1.

2
LAYERED FRAMEWORK
Aditi S. Chikhalikar

1
Outline
 Definition of Data Science Framework
 Cross-Industry Standard Process for Data Mining(CRISP-
DM)
 Homogeneous Ontology for Recursive Uniform
Schema(HORUS)
 The Top Layers of a Layered Framework
 Layered Framework for High-Level Data Science and
Engineering

2
Definition of Data Science Framework
 Data science is a series of discoveries.
 You work toward an overall business strategy of
converting raw unstructured data from data lake into
actionable business data.
 This process is a cycle of discovering and evolving your
understanding of the data you are working with to supply
with metadata that you need.
 You build a basic framework that you use for your data
processing.
 This will enable you to construct a data science solution
and then easily transfer it to your data engineering
environments.

3
Cross-Industry Standard Process for Data
Mining(CRISP-DM)
 CRISP-DM was generated in 1996, and by 1997, it was
extended via a European Union project, under the ESPRIT
[European Strategic Programme for Research and
development in Information Technology] funding initiative.
It was the majority support base for data scientists until
mid-2015. The web site that was driving the Special
Interest Group disappeared on June 30, 2015, and has
since reopened. Since then, however, it started losing
ground against other custom modeling methodologies.
 The basic concept behind the process is still valid, but you
will find that most companies do not use it as is, in any
projects, and have some form of modification that they
4
employ as an internal standard.
Cross-Industry Standard Process for Data
Mining(CRISP-DM)
 Goals:
 Encourage interoperable tools across entire data mining
process
 Take the mystery/high-priced expertise out of simple data
mining tasks
 CRISP-DM is a comprehensive data mining methodology
and process model that provides anyone—from novices
to data mining experts—with a complete blueprint for
conducting a data mining project.
 CRISP-DM breaks down the life cycle of a data mining
project into six phases.

5
CRISP-DM: Phases
 Business Understanding
 Understanding project objectives and requirements
 Data mining problem definition
 Data Understanding
 Initial data collection and familiarization
 Identify data quality issues
 Initial, obvious results
 Data Preparation
 Record and attribute selection
 Data cleansing
 Modeling
 Run the data mining tools
 Evaluation
 Determine if results meet business objectives
 Identify business issues that should have been addressed earlier
 Deployment
 Put the resulting models into practice
 Set up for continuous mining of the data

6
CRISP-DM: Phases

7
8
 Phase 1--Business Understanding
 Determine business objectives
 Key persons and their roles? Internal sponsor (financial, domain expert).
 Business units impacted by the project (sales, finance,...) ?
 Business success criteria and who assesses it?
 Users’ needs and expectations.
 Describe problem in general terms.
 Business questions, Expected benefits.
 Assess situation
 Are they already using data mining.
 Identify hardware and software available.
 Identify data sources and their types (online, experts, written documentation).
 Determine data mining goals
 Produce project plan
 Define initial process plan; discuss its feasibility with involved personnel.
 Estimate effort and resources needed; Identify critical steps.

9
 Phase 2--Data Understanding:
 Collect data
 Describe data
 Check data volume and examine its gross properties.
 Accessibility and availability of attributes, Attribute types, range, correlations, the
identities.
 Understand the meaning of each attribute and attribute value in business terms.
 For each attribute, compute basic statistics (e.g., distribution, average, max, min,
standard deviation, variance, mode, skewness)
 Explore data
 Analyze properties of interesting attributes in detail
 Distribution, relations between pairs or small numbers of attributes, properties of
significant sub-populations, simple statistical analyses.
 Verify data quality
 Does it cover all the cases required? Does it contain errors and how common are
they?
 Identify missing attributes and blank fields. Meaning of missing data.
 Do the meanings of attributes and contained values fit together?
 Check spelling of values (e.g., same value but sometime beginning with a lower case
10 letter, sometimes with an upper case letter).
 Phase 3--Data Preparation:
 Select data
 Reconsider data selection criteria.
 Decide which dataset will be used.
 Collect appropriate additional data (internal or external).
 Consider use of sampling techniques.
 Explain why certain data was included or excluded.
 Clean data
 Correct, remove or ignore noise.
 Decide how to deal with special values and their meaning.
 Aggregation level, missing values, etc.
 Outliers?
 Construct data
 Derived attributes.
 Background knowledge .
 How can missing attributes be constructed or imputed?
11
 https://sci2s.ugr.es/noisydata

https://sci2s.ugr.es/noisydata

12
 Data Preparation:
 Integrate data
 Integrate sources and store result (new tables and
records).
 Format Data
 Rearranging attributes
 Reordering records
 Reformatted within-value

13
 Phase 4--Modeling :
 Select the modeling technique
 based upon the data mining objective
 Generate test design
 Procedure to test model quality and validity
 Build model
 Parameter settings
 Assess model
 rank the models

14
 Phase 5 – Evaluation :
 Evaluate results
 Understand data mining result.
 Check impact for data mining goal.
 Check result against knowledge base to see if it is novel and useful.
 Evaluate and assess result with respect to business success criteria
 Review of process
 Summarize the process review (activities that missed or should be
repeated).
 Overview data mining process.
 Is there any overlooked factor or task? (did we correctly build the model?
Did we only use attributes that we are allowed to use and that are
available for future analyses?) •
 Identify failures, misleading steps, possible alternative actions,
unexpected paths
 Review data mining results with respect to business success

15
 Phase 5 – Evaluation :
 Determine next steps
 Analyze potential for deployment of each result.
 Estimate potential for improvement of current process.
 Check remaining resources to determine if they allow
additional process iterations (or whether additional
resources can be made available).
 Recommend alternative continuations. Refine process
plan.
 Decision
 At the end of this phase, a decision on the use of the data
mining results should be reached.
16
 Phase 6 – Deployment :
 Plan deployment
 How will the knowledge or information be propagated to users?
 How will the use of the result be monitored or its benefits measured?
 How will the model or software result be deployed within the
organization’s systems?
 How will its use be monitored and its benefits measured (where
applicable)?
 Identify possible problems when deploying the data mining results.
 Plan monitoring and maintenance
 What could change in the environment?
 How will accuracy be monitored?
 When should the data mining model not be used any more? What
should happen if could no longer be used? (Update model, new data
mining project)
 Will the business objectives of the use of the model change over
time?
17
 Phase 6 – Deployment :
 Produce a final report
 Identify reports needed (slide presentation, management summary,
detailed findings, explanation of models, etc.).
 How well initial data mining goals have been met.
 Identify target groups for reports.
 Outline structure and contents of reports.
 Select findings to be included in the reports. Write a report.
 Review project
 Interview people involved in project; Interview end users.
 What could have been done better? Do they need additional support?
Summarize feedback and write the experience documentation
 Analyze the process (what went right or wrong, what was done well
and what needs to be improved.).
 Document the specific data mining process
 Abstract from details to make the experience useful for future
projects.
18
Also read from link below
 https://www.zentut.com/data-mining/data-mining-
processes/

19
20
Homogeneous Ontology for Recursive
Uniform Schema
 The Homogeneous Ontology for Recursive Uniform
Schema (HORUS) is used as an internal data format
structure that enables the framework to reduce the
permutations of transformations required by the framework.
 The use of HORUS methodology results in a hub-and-spoke
data transformation approach.
 External data formats are converted to HORUS format, and
then a HORUS format is transformed into any other external
format.
 The basic concept is to take native raw data and then
transform it first to a single format. That means that there is
only one format for text files, one format for JSON or XML,
one format for images and video

21
22
The Top Layers of a Layered Framework

 The top layers are to support a long-term strategy of


creating a Center of Excellence for your data science
work.
 The layers shown in Figure below enable you to keep
track of all the processing and findings you achieve.
 The framework will enable you to turn your small project
into a big success, without having to do major
restructuring along the route to production.

23
The Top Layers of a Layered Framework

 The top layers are to support a long-term strategy of


creating a Center of Excellence for your data science
work.

24
The Top Layers of a Layered Framework
 Business Layer :-
 The business layer is the principal source of the requirements
and business information needed by data scientists and
engineers for processing.
 Material you would expect in the business layer includes the
following:
 Up-to-date organizational structure chart
 Business description of the business processes you are investigating
 List of your subject matter experts
 Project plans
 Budgets
 Functional requirements
 Nonfunctional requirements
 Standards for data items

25
 Utility Layer :-
 The utility layer is a common area in which you store all your
utilities.
 Collect your utilities (including source code) in one
central location. Keep detailed records of every version.

26
 Operational Management Layer :-
 Operations management is an area of the ecosystem concerned
with designing and controlling the process chains of a production
environment and redesigning business procedures.
 This layer stores what you intend to process.
 It is where you plan your data science processing pipelines.
 The operations management layer is where you record
 Processing-stream definition and management
 Parameters
 Scheduling
 Monitoring
 Communication
 Alerting
 The operations management layer is the common location where
you store any of the processing chains you have created for your
data science.
 Operations management is how you turn data science
experiments into long-term business gains.
27
 Audit, Balance, and Control Layer :-
 The audit, balance, and control layer is the area from which
you can observe what is currently running within your data
science environment.
 It records
 Process-execution statistics
 Balancing and controls
 Rejects and error-handling
 Codes management
 The three subareas are utilized in following manner:-
 Audit :-
 The audit sublayer records any process that runs within the
environment.
 This information is used by data scientists and engineers to
understand and plan improvements to the processing.
 Make sure your algorithms and processing generate a good and
complete audit trail.

28
 Balance :-
 The balance sublayer ensures that the ecosystem is balanced
across the available processing capability or has the capability
to top-up capability during periods of extreme processing.
 Using the audit trail, it is possible to adapt to changing
requirements and forecast what you will require to
complete the schedule of work you submitted to the
ecosystem.
 In the always-on and top-up ecosystem you can build, you can
balance your processing requirements by removing or adding
resources dynamically as you move through the processing
pipe.
 An example would be during end-of month processing, you
increase your processing capacity to sixty nodes, to handle the
extra demand of the end-of-month run.
 The rest of the month, you run at twenty nodes during
business hours.
 During weekends and other slow times, you only run with five
nodes. Massive savings can be generated in this manner.
29
 Control :-
 The control sublayer controls the execution of the current
active data science processes in a production ecosystem.
 The control elements are a combination of the control
element within the Data Science Technology Stack’s
individual tools plus a custom interface to control the
primary workflow.
 The control also ensures that when processing
experiences an error, it can attempt a recovery, as per
your requirements, or schedule a clean-up utility to undo
the error.

30
 Functional Layer :-
 The functional layer of the data science ecosystem is the
main layer of programming required.
 The functional layer is the part of the ecosystem that
executes the comprehensive data science.
 It consists of several structures.
 Data models
 Processing algorithms
 Provisioning of infrastructure

31
Processing algorithms
 The processing algorithm is spread across six supersteps of
processing as follows:
1. Retrieve: This super step contains all the processing chains for
retrieving data from the raw data lake via a more structured
format.
2. Assess: This superstep contains all the processing chains for quality
assurance and additional data enhancements.
3. Process: This superstep contains all the processing chains for
building the data vault.
4. Transform: This superstep contains all the processing chains for
building the data warehouse.
5. Organize: This superstep contains all the processing chains for
building the data marts.
6. Report: This superstep contains all the processing chains for
building virtualization and reporting the actionable knowledge.
32

You might also like