0% found this document useful (0 votes)
44 views33 pages

MMABA1 - Data Lake Part 3

This document discusses big data, data types, data collection in big data, and data governance. It provides an overview of Hadoop and its ecosystem for storing and processing large datasets. It also discusses challenges in governing data lakes compared to traditional data warehouses and some common data governance pillars like data stewardship, data quality, and master data management.

Uploaded by

Alfred Wijaya
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
44 views33 pages

MMABA1 - Data Lake Part 3

This document discusses big data, data types, data collection in big data, and data governance. It provides an overview of Hadoop and its ecosystem for storing and processing large datasets. It also discusses challenges in governing data lakes compared to traditional data warehouses and some common data governance pillars like data stewardship, data quality, and master data management.

Uploaded by

Alfred Wijaya
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 33

MMABA1 – BIG DATA – W07

TIPE DATA DAN PENGUMPULAN DATA DALAM BIG DATA


DAN DATA GOVERNANCE
Lecturer: Sindhu Wardhana
Email: [email protected]
This material belongs to Universitas Prasetiya Mulya
Do not upload and share this material to public domain. For private use only!
Quick Summary

https://drive.google.com/drive/folders/1XM4gGa7X0YXPjTYM-IeJEu3bg2XS_GHW?usp=sharing
This material belongs to Universitas Prasetiya Mulya
Do not upload and share this material to public domain. For private use only!
Quick Summary

This material belongs to Universitas Prasetiya Mulya


Do not upload and share this material to public domain. For private use only!
Data Types

 Can’t fit into traditional


RDBMS
 Needs Hadoop for
creating data lakes
 Many tools available to
store
This material belongs to Universitas Prasetiya Mulya
Do not upload and share this material to public domain. For private use only!
How Hadoop Works

This material belongs to Universitas Prasetiya Mulya


Do not upload and share this material to public domain. For private use only!
Core Hadoop Ecosystem
Monitoring Hadoop Ecosystem,
Cluster, Recources, etc

Trans
SQL Query for Connect RDBMS
Platform
making SQL DB and Hadoop
NoSQL
scripting
Coordinating
in Clusters Transporting web
(nodes up MapRed Alt
logs (large)
down, etc)

YARN alt Collect Data of


Any Source and
broadcast to
Hadoop
Scheduling
Job on Processing
Clusters Streaming Data
in Real Time
This material belongs to Universitas Prasetiya Mulya
Do not upload and share this material to public domain. For private use only!
Hadoop Ecosystem

External Data
Query Engines
Storage

Write Query for Query No SQL


No SQL DB Visual SQL DB No SQL DB

Like OLTP from same No SQL DB


No SQL

Notebook Style
This material belongs to Universitas Prasetiya Mulya
Do not upload and share this material to public domain. For private use only!
Penggunaan Hadoop Kemenkeu

This material belongs to Universitas Prasetiya Mulya


Do not upload and share this material to public domain. For private use only!
Big Data Tools

This material belongs to Universitas Prasetiya Mulya


Do not upload and share this material to public domain. For private use only!
Intermezo : NoSQL and GraphDB

Data Governance Challenges Facing Controllers (researchgate.net)


This material belongs to Universitas Prasetiya Mulya
Do not upload and share this material to public domain. For private use only!
What is Data Governance

This material belongs to Universitas Prasetiya Mulya


Do not upload and share this material to public domain. For private use only!
Starting a Data Lake
 Hadoop (storage) + Map Reduce
(processing) + Spark (in-memory)
 Preventing Proliferation of Data
Puddles (Data Silo)
 The promise of Data Science
 Strategies to start:
1. Offload existing functionality to
big data platform
2. Data Lakes for new project
3. Central Point of Governance
This material belongs to Universitas Prasetiya Mulya
Do not upload and share this material to public domain. For private use only!
Organizing Data Lake into Zones
 Landing or Raw/Staging Zone
 Separated into folders
 Accessed by Technical team
 Gold/Prod Zone
 Data is normalized and harmonized
 Access to non-developer via SQL
 Work/Dev/Project Zone
 When analytic happens
 Sensitive Zone
 As needed basis
 Encrypted /redacted data (De-identification)
Gold Zone
This material belongs to Universitas Prasetiya Mulya
Do not upload and share this material to public domain. For private use only!
Data Governance

https://dzone.com/articles/data-lake-governance-best-practices
This material belongs to Universitas Prasetiya Mulya
Do not upload and share this material to public domain. For private use only!
Governing Data Lakes vs traditional
Load
The numbers of data sets, users, and changes are extremely
high.
Frictionless ingestion
Because a data lake stores data for future, yet ­to­be­
determined analytics, it usually ingests the data with minimal,
if any, processing.
Encryption
There are often government or internal regulations that
require sensitive or personal information to be protected, yet
that data is needed for analysis.
Exploratory nature of work
Data scientists often do not know what’s available in the huge
and diverse data store. If analysts cannot find data that they
don’t have access to, they can’t ask for access to it.

This material belongs to Universitas Prasetiya Mulya


Do not upload and share this material to public domain. For private use only!
Common Data Governance Challenge

This material belongs to Universitas Prasetiya Mulya Common Data Governance Challenges – TDAN.com
Do not upload and share this material to public domain. For private use only!
Data Governance Pillars
 Data stewardship. Accountable for a portion of an
organization's data, with job duties in areas such as data
quality, security and usage.

 Data quality. Data quality improvement is one of the


biggest driving forces behind data governance activities.
Data accuracy, completeness and consistency across
systems are crucial hallmarks of successful governance
initiatives

 Master data management. MDM initiatives establish a


master set of data on customers, products and other
business entities to help ensure that the data is
consistent in different systems across an organization.

 Data governance use cases. Effective data governance


is at the heart of managing the data used in operational
systems and the BI and analytics applications fed by
data warehouses, data marts and data lakes.
What Is Data Governance and Why Does It Matter? (techtarget.com)

This material belongs to Universitas Prasetiya Mulya


Do not upload and share this material to public domain. For private use only!
Authorization or Access Control Challenges
To streamline this process, security admins usually create roles (collections of permissions) and
assign those roles to groups of analysts, and they can use Single Sign-On, however the challenges
still persist:

 Hard to predict the needs of data analysts for their projects.


 They can analyze if they have the data first, but they cannot
tell what data they need
 High cost maintaining authorization: new employee, change
roles or projects, new data shows up, sensitive data

Proactive approaches:
 Tag –based data access policy
 Deidentifying Sensitive Data
 Implementing self service access management
Materi lengkap bisa diakses di: 9. Governing Data Access (ebookreading.net)

This material belongs to Universitas Prasetiya Mulya


Do not upload and share this material to public domain. For private use only!
Tag-Based Data Access
Policies The Problems:
 Specify which users and groups of  Huge amount of files
users can have what access to a  Complex permission schemes
specific file or folder  Determine and set permission for
 Hadoop tag based security: every file
Cloudera Navigator, Apache  Detect and address schema change
Ranger
 Tag automation: Informatica,
Waterline Data, Dataguise

This material belongs to Universitas Prasetiya Mulya


Do not upload and share this material to public domain. For private use only!
Deidentification Sensitive Data

 Process of replacing actual


sensitive data with similar made-­
up data in a way that retains the
properties of the original data
 Technique: cohort/bucketing,
tokenization (HASH) etc.
 Challenges: Mapping to maintain
consistency, vulnerable and
fragile
This material belongs to Universitas Prasetiya Mulya
Do not upload and share this material to public domain. For private use only!
Deidentification Sensitive Data

This material belongs to Universitas Prasetiya Mulya


Do not upload and share this material to public domain. For private use only!
Data Sovereignty and Regulatory Compliance

This material belongs to Universitas Prasetiya Mulya


Do not upload and share this material to public domain. For private use only!
Implementing Self-Service Access Management

This material belongs to Universitas Prasetiya Mulya


Do not upload and share this material to public domain. For private use only!
Provisioning

This material belongs to Universitas Prasetiya Mulya


Do not upload and share this material to public domain. For private use only!
7 foundations for successfully governing data
and analytics applications (Gartner 2019)
• A focus on business value and organizational outcomes;
• Internal agreement on data accountability and decision rights;
• A trust-based governance model that relies on data lineage and
curation;
• Transparent decision-making that hews to a set of ethical
principles;
• Risk management and data security included as core
governance components;
• Ongoing education and training, with mechanisms to monitor
their effectiveness; and
• A collaborative culture and governance process that
encourages broad participation.

This material belongs to Universitas Prasetiya Mulya


Do not upload and share this material to public domain. For private use only!
Rangkuman Data Sharing Best Practice di
Australia

This material belongs to Universitas Prasetiya Mulya


Do not upload and share this material to public domain. For private use only!
Best Practice Guide to Applying Data Sharing Principles
July 2019
Mendirikan Office of the National Data Commissioner (ONDC) pada Juli 2018, yang
bertanggungjawabuntuk mengimplementasikan data sharing framework untuk
meningkatkanakses dan penggunaan kembali data publik, namun tetap menjaga privasidan
keamanan.

Membuat guidance untuk membantupemegang data secara aman dan efektif membagikan
datamenggunakan Five Data Sharing Principles (FDSP).

FDSP ini merupakan adopsi dari prinsip yang dikembangkan oleh Kantor Statistik Nasional Inggris.
Pada guidance ini terdapat gambaranaplikasidari FDSP tersebut.

Dapat diadopsi prinsip2nya


This material belongs to Universitas Prasetiya Mulya
Do not upload and share this material to public domain. For private use only!
Hal yang harus dipertimbangkan dalam pertukaran data

1. Permintaansharing data: (a) apakah data tersedia dan sesuai?


(b)apakah dapat dibagi secara legal? (c) apakah ada data
sensitif?
2. Perjanjianpertukaran data: (a) dibuat antara data kustodian
dengan organisasi lain, (b) berisi bahwa pengujian tujuan
telah memenuhi ketentuan, (c) memuat sanksi jika ada
pelanggaran
3. Mempertimbangkanyang terbaik untuk kepentinganuser: (a)
data dibagikan, (b) diberikan akses ke data di lingkungan
khusus yang aman.
4. Kapabilitasdan Budaya: (a) menilai internal skill dan
kapabilitas,
(b) mengubah budaya agar risiko ter-manage
5. Memastikantanggungjawabyang jelaskepada dataset: apakah
terdapat joint responsibility atau tidak atas dataset
6. Tata kelola pertukarandata: (a) perlu tata kelola agar kustodian
data yakin, (b) tata kelola yang baik membutuhkan
transparansi,
(c) dapat membuat proses yang streamline agar lebih efisien
Proses PertukaranDo not
data 7. Biaya: jika biaya dibebankan ke pengguna perlu ada
This material belongs to Universitas Prasetiya Mulya
komunikasi
upload and share this material dan For
to public domain. terdokumentasikan.
private use only!
The Five Data Sharing Principles
Five Data Sharing Principles:
1. Projects: Data yang dibagi memiliki tujuan yang sesuai untuk kepentingan publik.
2. People: Pengguna memiliki otorisasi yang sesuai untuk mengakses data.
3. Settings: Lingkungan data sharing meminimalisasi risiko penggunaan atau pengungkapan yang
tidak terotorisasi.
4. Data: Proteksi yang sesuai dan proporsional diaplikasikan ke data.
5. Output: Output dari data sharing secara tepat dijaga sebelum atau sesudah dipublikasikan.

Konsep utama dari principle data sharing


dan framework ini bukanlah membatasi,
namun menyeimbangkan antara
keuntungan dari penggunaan data dengan
risk management controls serta treatments,
sehingga memaksimalkan benefit yang
didapatkan.

Penerapan FDSP
This material belongs to Universitas Prasetiya Mulya
Do not upload and share this material to public domain. For private use only!
Recap

 Data Types
 Hadoop Ecosystems
 NoSQL and GraphDB
 Data Governance

This material belongs to Universitas Prasetiya Mulya


Do not upload and share this material to public domain. For private use only!
Next Week – Mid Exam (20% score)

This material belongs to Universitas Prasetiya Mulya


Do not upload and share this material to public domain. For private use only!
UTS – July 6th (18:30 – 21:00)
 Class 1-7 materials
 Questions will be given through Whatsapp Group in the beginning of the exam.
 Essay with Open Book – using zoom with video ON.
 Remarks:
 You must answer ALL questions.
 If you have any queries about a question or believe there is an error in the question, while the assignment is in session,
briefly explain your understanding of and assumptions about that question before attempting it.
 You are to include the following particulars in your submission: Course Code, Full Name and name your submission file
as “CourseCode_FullName”
 All answers must be typed in a Word doc. using Times New Roman font size 12 and single spacing.
 For answers that cannot be typed and required hand-written, you may either scan or take a picture of your work, and
insert them in the Word doc., orderly. All uploaded hand-written answers must be clear, readable and complete. Marks
will not be awarded for un-readable or incomplete images.
 Please submit only ONE (1) file (<50 MB) in PDF/Word doc. format within the time-limit and email to
[email protected] within the time-limit.
 To prevent plagiarism and collusion, your submission will be reviewed thoroughly by Turnitin, The Turnitin report will
only be made available to the markers. The university takes a tough stance against plagiarism and collusion.
This material belongs to Universitas Prasetiya Mulya
Do not upload and share this material to public domain. For private use only!
END OF SESSION
Lecturer: Sindhu Wardhana M.Com
Email: [email protected]
WA: +62815 399 29 499

You might also like