Data Science
Data Science
id
3
Sesi 2 . Pengenalan Berbagai Metoda DS Pengantar Data Science
COLORING THE GLOBAL FUTURE http://www.gunadarma.ac.id
Jenis Metodologi
● Metodologi kegiatan Teknis
● Metodologi kegiatan bisnis (dan teknis)
4
Sesi 2 . Pengenalan Berbagai Metoda DS Pengantar Data Science
COLORING THE GLOBAL FUTURE http://www.gunadarma.ac.id
Jenis Pengembangan
● Setiap jenis sistem
membutuhkan metodologi
yang berbeda
● Tugas arsitek pertama kali
memahami model SIM
mana yang dibutuhkan
Strategi P1 Strategi P2
●
Program ditulis dari scratch
●
Menggunakan komponen siap pakai yang telah
dikembangkan sebelumnya (orang lain atau diri sendiri)
●
Berawal dari algoritma dan struktur data
●
Dilakukan implementasi algoritma dan struktur data di dalam
●
Komponen siap pakai :
bahasa pemrograman yang dipilih – Sub rutin atau fungsi
– Library
●
Pemrograman memanfaatkan editor biasa ataupun IDE sederhana
– Interpreter (embedded DSL)
Strategi P3 Strategi P4
●
Menggunakan unit atau program jadi kecil yang dapat ●
Memanfaatkan Services yang tersedia melalui Application Program
disusun menjadi satu (glue) Interface (API): GoogleAPI, TweeterAPI, FacebookAPI dll
●
Banyak diterapkan di lingkungan Unix (1 program kecil yang ●
Tidak perlu memahami bagaimana internal, yang penting semantik
memiliki fungsi) pemanggilan services (REST, non REST)
●
Memanfaatkan “pipe” dan “redirect”
●
Mengetahui struktur data hasil service (JSON, BSON, XML, lainnya)
●
Contoh : cat Fileku.dat | sort | uniq
Sesi 2 . Pengenalan Berbagai Metoda DS Pengantar Data Science
COLORING THE GLOBAL FUTURE http://www.gunadarma.ac.id
Metodologi Pengembangan
Metoda iterative yang dipakai untuk menyelesaikan masalah dengan mengguna-kan data dan data science melalui urutan
langkah yang ditentukan
11
Sesi 2 . Pengenalan Berbagai Metoda DS Pengantar Data Science
COLORING THE GLOBAL FUTURE http://www.gunadarma.ac.id
12
Sesi 2 . Pengenalan Berbagai Metoda DS Pengantar Data Science
COLORING THE GLOBAL FUTURE http://www.gunadarma.ac.id
Metodologi SEMMA
●
Sample: Mengambil sampel data. Tahap
ini merupakan opsional
●
Explore: Mengeksplorasi data untuk pola
dan keanehan yang tidak diharapkan
dengan tujuan untuk mendapatkan
pengertian dan ide
●
Modify: Memodifikasi data dengan
membuat, menyeleksi dan
mentransformasi variabel-variabel untuk
fokus pada proses pemilihan model
●
Model: Memodelkan data dengan
menyediakan software untuk mencari
kombinasi data yang memprediksi hasil
terpercaya yang diinginkan secara
otomatis
●
Assess: Menilai data dengan
mengevaluasi kegunaan dan keandalan
penemuan dari proses data mining dan
mengevaluasi sebaik mana itu bekerja
https://documentation.sas.com/?docsetId=emref&docsetTar
get=n061bzurmej4j3n1jnj8bbjjm1a2.htm&docsetVersion=14
.3&locale=en
Sesi 2 . Pengenalan Berbagai Metoda DS Pengantar Data Science
COLORING THE GLOBAL FUTURE http://www.gunadarma.ac.id
●
TDSP Cylce:
– Business understanding
– Data acquisition and
understanding
– Modeling
– Deployment
– Customer acceptance
●
Role in Project:
– Solution architect
– Project manager
– Data engineer
– Data scientist
– Application developer
– Project lead https://docs.microsoft.com/en-us/azure/machine-learning/team-data-science-proces
s/overview
AI Project Cycle
●
Problem Scoping – Memahami permasalahan dengan cara memahami faktor yang mempengaruhi permaslaha, tujuan
dari proyek. Kegiatan ini akan mencoba mendefinisikan:
– Who – “Who” part helps us in comprehending and categorizing who all are affected directly and indirectly with the problem and who are called the Stake Holders
– What – “What” part helps us in understanding and identifying the nature of the problem and under this block, you also gather evidence to prove that the problem you
have selected exists.
– Where – “Where” does the problem arise, situation, and location.
– Why – “Why” is the given problem worth solving.
●
Data Acquisition – Tahapan ini meruapakn proses mengumpulkan data yang akurat dan handal agar dapat diproses.
Data dapat berupa teks, video, image, audio atau lainya yang dikumpulkan dari berbagai sumber, internet, koran, media
sosial dan lain sebagainya
●
Data Exploration – Mengatur data agar dapat diproses dengan baik. Data dapat diatur dalam bentuk tabel, grafik plot
atau database.
●
Modelling – Membuat model dari data hal ini dilakukan dengan mencoba berbagai model berbasiskan data yang
divisualisasi dengan mempertimbangkan keuntungan dan kerugian dari model tersebut
●
Evaluation – Mengevaluasi proyek dengan melihat keuaran yang diberikan sistem setelah data diberikan pada model
dan membandingkan dengan keluaran sesungguhnya
Bagaimana di Indonesia?
Standar Kompetensi Kerja Nasional:
KepMen Ketenagakerjaan No 299 thn 2020
Example UG
As more companies make the transition to selling over the Web, an established
computer/electronics e-retailer is facing increasing competition from newer sites. Faced with the
reality that Web stores are cropping up as fast (or faster!) than customers are migrating to the Web,
the company must find ways to remain profitable despite the rising costs of customer acquisition.
One proposed solution is to cultivate existing customer relationships in order to maximize the alue
of each of the company’s current customers.
Thus, a study is commissioned with the following objectives:
Improve cross-sales by making better recommendations.
Tentatively, the study will be judged a success if:
Cross-sales increase by 10%.
Customers spend more time and see more pages on the site per visit.
The study finishes on time and under budget.
Business Understanding UG
Your first task is to try to gain as much insight as possible into the business goals
for data mining. This may not be as easy as it seems, but you can minimize later
risk by clarifying problems, goals,
Task List
Start gathering background information about the current business situation.
Document specific business objectives decided upon by key decision makers
Agree upon criteria used to determine data mining success from a business perspective.
Business Understanding
Requirement
Inventory of Risk and Cost and
Assumption, Terminology
Resources Contigency Benefit
Constraints
Initial Assestment
Project Plan Of Tools and Techniques
You must start with a clear understanding of
A problem that your management wants to address
The business goals
Constraints (limitations on what you may do, the kinds of solutions that can be used, when the work must be
completed, and so on)
Impact (how the problem and possible solutions fit in with the business)
Deliveribilities
Background: Explain the business situation that drives the project. This item, like many that follow, amounts
only to a few paragraphs.
Business goals: Define what your organization intends to accomplish with the project. This is usually a
broader goal than you, as a data miner, can accomplish independently. For example, the business goal might
be to increase sales from a holiday ad campaign by 10 percent year over year.
Business success criteria: Define how the results will be measured. Try to get clearly defined quantitative
success criteria. If you must use subjective criteria (hint: terms like gain insight or get a handle on imply
subjective criteria), at least get agreement on exactly who will judge whether or not those criteria have been
fulfilled.
Business Background UG
Task 1—Determine Organizational Structure
Develop organizational charts to illustrate corporate divisions, departments, and
project groups. Be sure to include managers’ names and responsibilities.
Identify key individuals in the organization.
Identify an internal sponsor who will provide financial support and/or domain expertise.
Understanding your
organization’s business
Determine whether there is a steering committee and procure a list of members.
situation helps you know Identify business units that will be affected by the data mining project.
what you’re working with in
terms of:
Task 2—Describe Problem Area
Available resources (personnel and Identify the problem area, such as marketing, customer care, or business development.
material)
Describe the problem in general terms.
Problems
Clarify the prerequisites of the project. What are the motivations behind the project?
Goal
Does the business already use data mining?
Check on the status of the data mining project within the business group. Has the effort
been approved, or does data mining need to be “advertised” as a key technology for th
business group?
If necessary, prepare informational presentations on data mining to your organization.
Task 3—Describe Current Solution
Describe any solutions currently used to address the business problem.
Describe the advantages and disadvantages of the current solution. Also, address the
level of acceptance this solution has had within the organization.
Sesi 2 . Pengenalan Berbagai Metoda DS Pengantar Data Science
COLORING THE GLOBAL FUTURE http://www.gunadarma.ac.id
Business Objective UG
This is where things get specific. As a result of your research and meetings, you
should construct a concrete primary objective agreed upon by the project
sponsors and other business units affected by the results. This goal will
eventually be translated from something as nebulous as “reducing customer
churn” to specific data mining objectives that will guide your analytics.
Task
Describe the problem you want to solve using data mining.
Specify all business questions as precisely as possible.
Determine any other business requirements (such as not losing any existing customers while increasing
cross-sell opportunities).
Specify expected benefits in business terms (such as reducing churn among high-value customers by
10%).
Busines Success criteria fall into two categories:
Objective. These criteria can be as simple as a specific increase in the accuracy of audits or an agreed-upon
reduction in churn.
Subjective. Subjective criteria such as “discover clusters of effective treatments” are more difficult to pin
down, but you can agree upon who makes the final decision.
Task List
As precisely as possible, document the success criteria for this project.
Make sure each business objective has a correlative criterion for success.
Align the arbiters of the subjective measurements of success. If possible, take notes on their expectations.
Inventory of resources: A list of all resources available for the project. These may include
people (not just data miners, but also those with expert knowledge of the business problem,
data managers, technical support, and others), data, hardware, and software.
Requirements, assumptions, and constraints: Requirements will include a schedule for
completion, legal and security obligations, and requirements for acceptable finished work. This
is the point to verify that you’ll have access to appropriate data!
Risks and contingencies: Identify causes that could delay completion of the project, and
prepare a contingency plan for each of them. For example, if an Internet outage in your office
could pose a problem, perhaps your contingency could be to work at another office until the
outage has ended.
Terminology: Create a list of business terms and data-mining terms that are relevant to your
project and write them down in a glossary with definitions (and perhaps examples), so that
everyone involved in the project can have a common understanding of those terms.
Costs and benefits: Prepare a cost-benefit analysis for the project. Try to state all costs and
benefits in dollar (euro, pound, yen, and so on) terms. If the benefits don’t significantly exceed
the costs, stop and reconsider this analysis and your project.
What sort of data Example
Resource inventory UG
Task 1—Research Hardware Resources
What hardware do you need to support?
Task 2—Identify Data Sources and Knowledge Stores
Which data sources are available for data mining? Take note of data types and formats.
How are the data stored? Do you have live access to data warehouses or operational databases?
Do you plan to purchase external data, such as demographic information?
Are there any security issues preventing access to required data?
Task 3—Identify Personnel Resources
Do you have access to business and data experts?
Have you identified database administrators and other support staff that may be needed?
Once you have asked these questions, include a list of contacts and resources for the phase report.
Task 1—Determine Requirements. The fundamental requirement is the business goal
discussed earlier, but consider the following:
Are there security and legal restrictions on the data or project results?
Is everyone aligned on the project scheduling requirements?
Are there requirements on results deployment (for example, publishing to the Web or reading scores into a database)?
Task 2—Clarify Assumptions
Are there economic factors that might affect the project (for example, consulting fees or competitive products)?
Are there data quality assumptions?
How does the project sponsor/management team expect to view the results? In other words, do they want to
understand the model itself or simply view the results?
Task 3—Verify Constraints
Do you have all passwords required for data access?
Have you verified all legal constraints on data usage?
Are all financial constraints covered in the project budget
Types of risks include:
Scheduling (What if the project takes longer than anticipated?)
Financial (What if the project sponsor encounters budgetary problems?)
Data (What if the data are of poor quality or coverage?)
Results (What if the initial results are less dramatic than expected?)
Task List
Document each possible risk.
Document a contingency plan for each risk.
Terminology
To ensure that business and data mining teams are “speaking the same
language,” you should consider compiling a glossary of technical terms and
buzzwords that need clarification. For example, if “churn” for your business has
a particular and unique meaning, it is worth explicitly stating that for the benefit
of the whole team. Likewise, the team may benefit from clarification of the
usage of a gains chart.
Task List
Keep a list of terms or jargon confusing to team members. Include both business and data mining
terminology.
Consider publishing the list on the intranet or in other project documentation.
Cost/Benefit Analysis
This step answers the question, What is your bottom line? As part of the final
assessment, it’s critical to compare the costs of the project with the potential
benefits of success.
Task List
Include in your analysis estimated costs for:
Data collection and any external data used
Results deployment
Operating costs
Then, take into account the benefits of:
The primary objective being met
Additional insights generated from data exploration
Possible benefits from better data understanding
Data-mining goals: Define data-mining deliverables, such as models, reports,
presentations, and processed datasets.
Data-mining success criteria: Define the data-mining technical criteria
necessary to support the business success criteria. Try to define these in
quantitative terms (such as model accuracy or predictive improvement
compared to an existing method). If the criteria must be qualitative, identify the
person who makes the assessment.
Regression /
Classification Clustering Association
Estimation
44
Sesi 2 . Pengenalan Berbagai Metoda DS Pengantar Data Science
COLORING THE GLOBAL FUTURE http://www.gunadarma.ac.id
These data mining goals, if met, can then be used by the business to reduce churn among
the most valuable customers. As you can see, business and technology must work hand-
in-hand for effective data mining. Read on for specific tips on how to determine data
mining goals.
Task List – Data mining goals
Describe the type of data mining problem, such as clustering, prediction, or classification.
Document technical goals using specific units of time, such as predictions with a three-month validity.
If possible, provide actual numbers for desired outcomes, such as producing churn scores for 80% of existing customers.
With the help of its data mining consultant, the e-retailer has been
able to translate the company’s business objectives into data mining
terms. The goals for the initial study to be completed this quarter are:
Use historical information about previous purchases to generate a model
that links “related” items. When users look at an item description, provide
links to other items in the related group (market basket analysis).
Use Web logs to determine what different customers are trying to find, and
then redesign the site to highlight these items. Each different customer
“type” will see a different main page for the site (profiling).
Use Web logs to try to predict where a person is going next, given where he
or she came from and has been on your site (sequence analysis).
Success must also be defined in technical terms to keep your data mining efforts
on track. Use the data mining goal determined earlier to formulate benchmarks
for success.
Task List
Describe the methods for model assessment (for example, accuracy, performance, etc.).
Define benchmarks for evaluating success. Provide specific numbers.
Define subjective measurements as best you can and determine the arbiter of success.
Consider whether the successful deployment of model results is part of data mining success.
Start planning now for deployment.
Project plan: Outline your step-by-step action plan for the project. Expand the
outline with a schedule for completion of each step, required resources, inputs
(such as data or a meeting with a subject matter expert), and outputs (such as
cleaned data, a model, or a report) for each step, and dependencies (steps that
can’t begin until this step is completed). Explicitly state that certain steps must
be repeated (for example, modeling and evaluation usually call for several back-
and-forth repetitions).
Initial assessment of tools and techniques: Identify the required capabilities for
meeting your data-mining goals and assess the tools and resources that you
have. If something is missing, you have to address that concern very early in the
process.
Project Plan UG
The project plan is the master document for all of your data mining work. If done well,
it can inform everyone associated with the project of the goals, resources, risks, and
schedule for all phases of data mining. You may want to publish the plan, as well as
documentation gathered throughout this phase, to your company’s intranet.
Task List. When creating the plan, be
sure you’ve answered the following
questions:
Have you discussed the project tasks and proposed
plan with everyone involved?
Are time estimates included for all phases or tasks?
Have you included the effort and resources needed
to deploy the results or business solution?
Are decision points and review requests highlighted
in the plan?
Have you marked phases where multiple iterations
typically occur, such as modeling?
From a data mining perspective:
How specifically can data mining help you meet your business goals?
Do you have an idea about which data mining techniques might produce the best results?
How will you know when your results are accurate or effective enough? (Have we set a measurement of data mining
success?)
How will the modeling results be deployed? Have you considered deployment in your project plan?
Does the project plan include all phases of CRISP-DM?
Are risks and dependencies called out in the plan?
Current situation
Inventory of resources
Requirements, assumptions, and constraints.
Risks and contingencies.
Terminology.
Costs and benefits:
Data-mining goals
Data-mining goals:
Data-mining success criteria:
Project plan
Project plan
Initial assessment of tools and techniques:
Dokumentasi
Proyek Data Science
CONTOH PENERAPAN
Kasus : Kegagalan Kredit
●
Problem: Bagaimana ●
Problem: Bagaimana
menurunkan NPL suatu menurunkan NPL suatu
bank bank
●
Pertanyaan: Bagaimana ●
Pertanyaan: Bagaimana
memperbaiki perhitungan memperbaiki perhitungan
Credit score Credit score
●
Measurable outcomes: % ●
Tugas Analitik: Klasifikasi
Penurunan kredit gagal ●
Performance Metrics: F1-
bayar Score
C. Klastering: Mengelompokkan kasus berdasar F. Sequence Mining: Memprediksi apa yang akan
kemiripan terjadi dari keadaan saat ini
• Segmentasi nasabah perbankan • Prediksi apakah nasabah akan berhenti berlangganan
• Pengelompokkan pasien yang mirip kasusnya • Menentukan alur pada transaksi e-commerce
Data Understanding
Business Understanding
Data Understanding
Mengenali/ mendalami data yang dimiliki
Data Understanding
Mengapa Perlu Mengenali/ mendalami data yang dimilik i
• The United States armed forces faced a dilemma during the war,
because returning bomber planes were riddled with bullet holes and
they needed better ways to protect them
• “Where should they put it?”
• When they plotted out the damage
these planes were incurring, it was
spread out, but largely concentrated
around the tail, body and wings.
• Should they upgrade these sections?
Data Understanding
Data Understanding
Data Understanding
Memvalidasi data : Menilai kesesuaian kualitas data dengan masalah yang akan dipecahkan
Data Preparation
Memperbaiki kualitas data untuk Pemodelan
03
Mengkonstruksi data Fitur tambahan (Feature Engineering)
Menambahkan fitur dan transformasi data Transformasi data (standardisasi, transformasi)
04 Integrasi Data
Menggabungkan data Gabungan data
Modelling
Mengembangkan Model (Pengetahuan)
Membangun model
02 Mengembangkan model dengan Teknik ML
Eksekusi Algoritma
Pengaturan Parameter
Pengukuran Performance Metrics
Data
Latih
Split
Data
Data
Uji
Membangun Model - 1
Teknik ML Model
Data
Latih
Modelling - 2
Model Decision
Data
Uji
Mengevaluasi Model
Data Scientist
Dibutuhkan pemahaman teori, programming
Juga softskill seperti komunikasi, enterpreneurship
Jadi bukan hanya permasalahan ke-teknisan saja
Data Scientist
02 Mengembangkan model terbaik dari data untuk
menjawab permasalahan bisnis
03 Data Engineer
Menyiapkan (big) data untuk diolah/ dimodelkan
Data Analyst
04 Menganalisis/ mencari insight dari data (dan
menampilkannya dalam dashboard)
●
Data Engineer lebih dari sekedar Database
Administrator
●
Data Analyst memahami visualisasi dan cara
mengambil kesimpulan dari visualisasi
tersebut
●
Didampingi oleh tim TI yang memahami
masalah keteknisan (programming,
deployment)
Sesi 2 . Pengenalan Berbagai Metoda DS Pengantar Data Science
COLORING THE GLOBAL FUTURE http://www.gunadarma.ac.id
OKUPASI NASIONAL
EITC/AI/AIF Artificial
intelligence fundamentals Data Science AI/ML
[v1r2]
AI/ML
7 Data
Engineer
Data
Scientist Applied
Research
IABAC
Associate
Associate Associcate Associate
Int. Assoc. Bussines 6 Data
Data Data AI/ML
Analytics Engineer
Engineer Scientist Engineer
Certification
TERIMA KASIH
EdgeAI