METHODOLOGY Analitis Data Raya Sektor Awam (DRSA)
METHODOLOGY Analitis Data Raya Sektor Awam (DRSA)
METHODOLOGY Analitis Data Raya Sektor Awam (DRSA)
TITLE PAGE
1. INTRODUCTION ........................................................................................................... 1
1.1. PURPOSE .................................................................................................................. 1
1.2. SCOPE ...................................................................................................................... 1
2. INTRODUCTION TO KEY CONCEPTS ........................................................................ 2
2.1. BIG DATA................................................................................................................... 2
2.2. DATA SCIENCE, DATA ENGINEERING AND DATA SCIENCE PROCESS ............................... 5
2.3. CROSS INDUSTRY STANDARD PROCESS FOR DATA MINING ........................................... 7
3. NIST BIG DATA INTEROPERABILITY FRAMEWORK ............................................. 12
4. DRSA BIG DATA METHODOLOGY........................................................................... 13
i
LIST OF ILLUSTRATIONS
TITLE PAGE
ii
DRSA METHODOLOGY
1. Introduction
1.1. Purpose
1.2. Scope
This document provides a high level understanding of key concepts in Big Data,
iterative development, and processes that can be followed for future project, which
includes the following topics:
ii. Stages for a Big Data project and high level implementation tasks
iv. Documents and deliverables that shall be created throughout the project
1
DRSA METHODOLOGY
This section provide high level understanding on key concepts in the ecosystem of
Big Data to assist in understanding the base concepts, methodologies and
standards used for the development of this methodology, which are:
i. Harvard CS109: Data Science Process – Provides high level process flow
for execution of explorative analytics and data science which is the
foundation of this methodology.
ii. Cross Industry Standard Practice for Data Mining (CRISP-DM) – Defines
high level process flow for project delivery of machine learning and data
mining based projects. CRISP-DM provides the majority of core processes
and tasks for this methodology.
1 http://www.mckinsey.com/insights/business_technology/big_data_the_next_frontier_for_innovation
2
DRSA METHODOLOGY
- Techopedia
At the point of time which this document was written, there are no rigorous term
that defines Big Data. Different organizations define it differently, but in general a
Big Data implementation consist of analyzing data which contains one or many of
the following attributes:
ii. Velocity – the real-time analysis of streaming data that comes at a very fast
rate for faster decision making.
iii. Variety – the analysis of data coming from various sources; structured,
unstructured, data coming from internal sources of an organization and data
coming from external sources.
iv. Veracity – the analysis of highly inconsistent and poor quality datasets
2 http://www.techopedia.com/definition/27745/big-data
3
DRSA METHODOLOGY
Besides the attributes of the datasets, Big Data practices also bring into the table
data analysis techniques which previously were not accessible to mass market
consumers due to technological limitations. Big Data Analytics may consist of the
following properties:
In order to analyze collected datasets to provide the stated analytics, various Big
Data tools and techniques will be used such as:
ii. Machine learning – an analytic process that involve in the design of pattern
recognition algorithms that can learn from , and make prediction s on data
iii. Data mining – an analytic process that applies machine learning and other
techniques to collect, store, analyze, and extract value from mined datasets.
4
DRSA METHODOLOGY
iv. Data Warehouse / Data Lake – a repository where data are stored for the
purpose of reporting and analysis. Data warehouse have a different
architecture than data lake, with different pro and cons, but they both serve
the same purpose.
The end goal of analyzing Big Data is to harness information in novel ways to
produce useful insights, and create new forms of value for the business
Data science is a field of practice that focuses on the extraction of knowledge from
large volumes of data that are structured or unstructured. It is a continuation of the
field of data mining and predictive analytics, also known as Knowledge Discovery
and Data Mining (KDD)3. Data science involves principles, processes and
techniques for understanding phenomena via automated analysis of data.
Data science enables data driven decision making (DDD), which is a practice of
3 https://en.wikipedia.org/wiki/Data_science
5
DRSA METHODOLOGY
basic decision from facts extracted from analyzed data, instead of just purely on
intuition4.
To prepare and provide the data needed for data science activities, data
engineers deal with the development of processes, workflow and programs to
handle the process of preparation of datasets such as scheduled loading,
anonymization, cleansing, and integration.
6
DRSA METHODOLOGY
Once a business question have been identified, data collection activities are then
executed to gather the necessary data needed to answer the business question.
The data may come from sources such as:
iv. Publicly available datasets from internet, social media and Open Data
The next stage in the process is exploring all gathered datasets to get to know
the data, identify patterns and any anomalies in the data. The information
gathered at this stage will be used for development of any models to answer the
business question in the subsequent stage.
Once a level of understanding on the data have been reached, and that there are
enough data to work with, the data scientist will then engage in the development
of analytical model utilizing various data analytics techniques and tools such as
statistical analysis, machine learning and data mining.
Finally, the results of the analysis are then communicated and presented
through visualization that can be understood by business stakeholders for
assisting in decision making for the organization.
Cross Industry Standard Process for Data Mining, commonly known by its
acronym CRISP-DM is a data mining process model that describes commonly
used approaches that data mining experts use to tackle problems. CRISP-DM was
7
DRSA METHODOLOGY
conceived in late 1996 by three “veterans” of the young and immature data mining
market. DaimlerChrysler (then Daimler-Benz) was already ahead of most
industrial and commercial organizations in applying data mining in its business
operations.
Many Big Data related development activities, such as predictive analytics and
prescriptive analytics are essentially data mining activities, therefore, CRISP-DM
have been suggested as one good reference model for the implementation of
Big Data project5. However, CRISP-DM alone is not enough to be a methodology
for delivery a full end-to-end Big Data project. It only covers the development of
analytical models, which is only in the data layer and does not define any
phase for the development of the mechanism to consume the results, such
as analytics dashboards, or data applications.
CRISP-DM breaks the process of data mining into six major phases.6
8
DRSA METHODOLOGY
The sequence of the phases is not strict and moving back and forth between
different phases is always required. The arrows in the process diagram indicate
the most important and frequent dependencies between phases. The outer circle
in the diagram symbolizes the cyclic nature of data mining itself. A data mining
process continues after a solution has been deployed. The lessons learned during
the process can trigger new, often more focused business questions and
subsequent data mining processes will benefit from the experiences of previous
ones.
9
DRSA METHODOLOGY
The data understanding phase starts with an initial data collection and proceeds
with activities in order to get familiar with the data, to identify data quality
problems, to discover first insights into the data, or to detect interesting subsets to
form hypotheses for hidden information.
The data preparation phase covers all activities to construct the final dataset (data
that will be fed into the modeling tool(s)) from the initial raw data. Data preparation
tasks are likely to be performed multiple times, and not in any prescribed order.
Tasks include table, record, and attribute selection as well as transformation and
cleaning of data for modeling tools.
Phase 4: Modeling
In this phase, various modeling techniques are selected and applied, and their
parameters are calibrated to optimal values. Typically, there are several
techniques for the same data mining problem type. Some techniques have
specific requirements on the form of data. Therefore, stepping back to the data
preparation phase is often needed.
10
DRSA METHODOLOGY
Phase 5: Evaluation
At this stage in the project you have built a model (or models) that appears to
have high quality, from a data analysis perspective. Before proceeding to final
deployment of the model, it is important to more thoroughly evaluate the model,
and review the steps executed to construct the model, to be certain it properly
achieves the business objectives. A key objective is to determine if there is some
important business issue that has not been sufficiently considered. At the end of
this phase, a decision on the use of the data mining results should be reached.
Phase 6: Deployment
Creation of the model is generally not the end of the project. Even if the purpose
of the model is to increase knowledge of the data, the knowledge gained will need
to be organized and presented in a way that is useful to the customer. Depending
on the requirements, the deployment phase can be as simple as generating a
report or as complex as implementing a repeatable data scoring (e.g. segment
allocation) or data mining process. In many cases it will be the customer, not the
data analyst, who will carry out the deployment steps. Even if the analyst deploys
the model it is important for the customer to understand up front the actions which
will need to be carried out in order to actually make use of the created models.
11
DRSA METHODOLOGY
As part of this initiative, NIST produced a working draft of Big Data Interoperability
Framework (NIST-BDIF)7 which created the following draft standards that help in
defining and designing Big Data implementations:
7 http://bigdatawg.nist.gov/V1_output_docs.php
8 http://bigdatawg.nist.gov/_uploadfiles/M0394_v1_4746659136.pdf
12
DRSA METHODOLOGY
4.1. Overview
DRSA Methodology consist of seven (7) main stages which are further broken
down into sub steps and activities. The seven (7) stages are summarized below:
13
DRSA METHODOLOGY
a. Business Profile
b. Business objectives
c. Problem Statement
a. Inventory of Resources
d. Terminology
14
DRSA METHODOLOGY
This stage focuses on defining and documenting the scope of work, business
requirements , user requirements and system requirements of the project.
a. Project plan
a. Business questions
b. Analytic goals
15
DRSA METHODOLOGY
This stage focuses on acquiring and exploring available data to gain better
understanding on it, identifying data cleansing needs, identifying opportunities for
data enrichment, and identifying analysis that can be done with the available
data.
a. Data catalog
a. Sample data
b. Data description
16
DRSA METHODOLOGY
c. Data schema
a. Dataset
17
DRSA METHODOLOGY
This stage focuses on the development of data model and analysis algorithms to
process data to produce results needed by the business.
a. Model selections
b. Modeling assumptions
a. Models
Depending on the results of the model testing and evaluation, development may
go back to Stage 3: Data Acquisition and Exploration to acquire better datasets
to improve the model
18
DRSA METHODOLOGY
i. Design product
a. Data product
19
DRSA METHODOLOGY
After data product have been fully developed and tested, it is then evaluated
against the business requirements, and then rolled out into the production
environment with access to more data. Scheduling and automation of analysis
processes from production data are also configured during this phase.
a. Deployment plan
20
DRSA METHODOLOGY
Stage 7: Monitoring
The project is then monitored for its effectiveness, stability and capacity with
regards to business requirements. Any opportunities for further improvement and
enhancements are recorded for planning of the next cycle of improvement.
a. Final report
b. Final presentation
21