100% found this document useful (1 vote)

3K views20 pages

Hadoop Training #1: Thinking at Scale

You know your data is big – you found Hadoop. What implications must you consider when working at this scale? This lecture addresses common challenges and general best practices for scaling with your data. Check http://www.cloudera.com/hadoop-training-basic for training videos.

Uploaded by

Dmytro Shteflyuk

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

100% found this document useful (1 vote)

3K views20 pages

Hadoop Training #1: Thinking at Scale

Uploaded by

Dmytro Shteflyuk

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 20

Thinking at Scale

This presentation includes course content © University of Washington

Redistributed under the Creative Commons Attribution 3.0 license.
All other contents:
© 2009 Cloudera, Inc.
Overview
• Characteristics of large-scale computation
• Pitfalls of distributed systems
• Designing for scalability

© 2009 Cloudera, Inc.

By the numbers…

• Max data in memory: 32 GB

• Max data per computer: 12 TB
• Data processed by Google every month:
400 PB … in 2007
• Average job size: 180 GB
• Time that would take to read sequentially
off a single drive: 45 minutes

© 2009 Cloudera, Inc.

What does this mean?
• We can process data very quickly but we
can read/write it very slowly
• Solution: parallel reads

• 1 HDD = 75 MB/sec
• 1000 HDDs = 75 GB/sec
– Now you’re talking!

© 2009 Cloudera, Inc.

Sharing is Slow
• Grid computing: not new
– MPI, PVM, Condor…
• Grid focus: distribute the workload
– NetApp filer or other SAN drives many
compute nodes
• Modern focus: distribute the data
– Reading 100 GB off a single filer would leave
nodes starved – just store data locally

© 2009 Cloudera, Inc.

Sharing is Tricky
• Exchanging data requires synchronization
– Deadlock becomes a problem
• Finite bandwidth is available
– Distributed systems can “drown themselves”
– Failovers can cause cascading failure
• Temporal dependencies are complicated
– Difficult to reason about partial restarts

© 2009 Cloudera, Inc.

Ken Arnold, CORBA designer:

“Failure is the defining difference between

distributed and local programming”

© 2009 Cloudera, Inc.

Reliability Demands
• Support partial failure
– Total system must support graceful decline in
application performance rather than a full halt

© 2009 Cloudera, Inc.

Reliability Demands
• Data Recoverability
– If components fail, their workload must be
picked up by still-functioning units

© 2009 Cloudera, Inc.

Reliability Demands
• Individual Recoverability
– Nodes that fail and restart must be able to
rejoin the group activity without a full group
restart

Reliability Demands
• Consistency
– Concurrent operations or partial internal
failures should not cause externally visible
nondeterminism

Reliability Demands
• Scalability
– Adding increased load to a system should not
cause outright failure, but a graceful decline
– Increasing resources should support a
proportional increase in load capacity

A Radical Way Out…
• Nodes talk to each other as little as
possible – maybe never
– “Shared nothing” architecture
• Programmer should not explicitly be
allowed to communicate between nodes
• Data is spread throughout machines in
advance, computation happens where it’s
stored.

Motivations for MapReduce
• Data processing: > 1 TB
• Massively parallel (hundreds or thousands
of CPUs)
• Must be easy to use
– High-level applications written in MapReduce
– Programmers don’t worry about socket(), etc.

Locality
• Master program divvies up tasks based on
location of data: tries to have map tasks
on same machine as physical file data, or
at least same rack
• Map task inputs are divided into 64—128
MB blocks: same size as filesystem
chunks
– Process components of a single file in parallel

Fault Tolerance
• Tasks designed for independence
• Master detects worker failures
• Master re-executes tasks that fail while in
progress
• Restarting one task does not require
communication with other tasks
• Data is replicated to increase availability,
durability

Optimizations
• No reduce can start until map is complete:
– A single slow disk controller can rate-limit the
whole process
• Master redundantly executes “slow-
moving” map tasks; uses results of first
copy to finish

Conclusions
• Computing with big datasets is a
fundamentally different challenge than
doing “big compute” over a small dataset
• New ways of thinking about problems
needed
– New tools provide means to capture this
– MapReduce, HDFS, etc. can help

Hadoop Tutorial
50% (2)
Hadoop Tutorial
199 pages
Hadoop Training #5: MapReduce Algorithm
100% (2)
Hadoop Training #5: MapReduce Algorithm
31 pages
Hadoop Training #4: Programming With Hadoop
100% (2)
Hadoop Training #4: Programming With Hadoop
46 pages
Hadoop and Java Ques - Ans
No ratings yet
Hadoop and Java Ques - Ans
222 pages
Cloudera Administrator Training
100% (6)
Cloudera Administrator Training
373 pages
DEV 301 - Lab Guide
100% (1)
DEV 301 - Lab Guide
46 pages
Cloudera Administration PDF
100% (1)
Cloudera Administration PDF
476 pages
Configuring Hadoop Security With Cloudera Manager
No ratings yet
Configuring Hadoop Security With Cloudera Manager
52 pages
Tutorial-HDP-Administration V III
100% (1)
Tutorial-HDP-Administration V III
274 pages
Spark Scala Protected
No ratings yet
Spark Scala Protected
211 pages
Introduction To MapReduce
No ratings yet
Introduction To MapReduce
43 pages
Big Data and Hadoop: by - Ujjwal Kumar Gupta
No ratings yet
Big Data and Hadoop: by - Ujjwal Kumar Gupta
57 pages
Apache Hadoop Training
No ratings yet
Apache Hadoop Training
377 pages
Hadoop Lab
100% (1)
Hadoop Lab
32 pages
Cloudera Hadoop Admin Notes PDF
No ratings yet
Cloudera Hadoop Admin Notes PDF
65 pages
Hadoop and Related Tools
No ratings yet
Hadoop and Related Tools
57 pages
Cloudera Impala
No ratings yet
Cloudera Impala
478 pages
Mapr Snapshots
No ratings yet
Mapr Snapshots
31 pages
HBase Interview Questions
No ratings yet
HBase Interview Questions
12 pages
Azure Cloud Intro
No ratings yet
Azure Cloud Intro
34 pages
Apache Hive Interview Questions
50% (2)
Apache Hive Interview Questions
6 pages
Operating System
No ratings yet
Operating System
60 pages
Map Reduce
No ratings yet
Map Reduce
10 pages
Hadoop Realtime Issues
No ratings yet
Hadoop Realtime Issues
3 pages
Hadoop Distributed File System (HDFS) : Suresh Pathipati
No ratings yet
Hadoop Distributed File System (HDFS) : Suresh Pathipati
43 pages
3 Mapreduce Notes
No ratings yet
3 Mapreduce Notes
25 pages
Install Sqoop
No ratings yet
Install Sqoop
7 pages
Spark Sample Resume 2
100% (1)
Spark Sample Resume 2
7 pages
Administration of Hadoop Summer 2014 Lab Guide v3.1
No ratings yet
Administration of Hadoop Summer 2014 Lab Guide v3.1
107 pages
Hadoop Security S360 2015v8 PDF
No ratings yet
Hadoop Security S360 2015v8 PDF
27 pages
Deepshikha Agrawal Pushp B.Sc. (IT), MBA (IT) Certification-Hadoop, Spark, Scala, Python, Tableau, ML (Assistant Professor JLBS)
No ratings yet
Deepshikha Agrawal Pushp B.Sc. (IT), MBA (IT) Certification-Hadoop, Spark, Scala, Python, Tableau, ML (Assistant Professor JLBS)
74 pages
Oozie Tutorial
No ratings yet
Oozie Tutorial
84 pages
Parallel Programming With Spark: Matei Zaharia
No ratings yet
Parallel Programming With Spark: Matei Zaharia
40 pages
DVS SPARK Course Content PDF
No ratings yet
DVS SPARK Course Content PDF
2 pages
Redis Cluster
67% (3)
Redis Cluster
17 pages
Redis Cluster
67% (3)
Redis Cluster
17 pages
2 Hadoop (Uploaded)
No ratings yet
2 Hadoop (Uploaded)
82 pages
Apache Hive Tutorial
No ratings yet
Apache Hive Tutorial
139 pages
Hadoop Overview Training Material
No ratings yet
Hadoop Overview Training Material
44 pages
Cloudera Administrator Training For Apache Hadoop
No ratings yet
Cloudera Administrator Training For Apache Hadoop
5 pages
Unstructured Dataload Into Hive Database Through PySpark
No ratings yet
Unstructured Dataload Into Hive Database Through PySpark
9 pages
24 Hadoop Interview Questions & Answers For MapReduce Developers - FromDev
No ratings yet
24 Hadoop Interview Questions & Answers For MapReduce Developers - FromDev
7 pages
Cognos Interview Questions Gave by SatyaNarayan
0% (1)
Cognos Interview Questions Gave by SatyaNarayan
12 pages
Explain in Detail About Hadoop Framework
No ratings yet
Explain in Detail About Hadoop Framework
4 pages
02 - Apache Spark On Amazon EMR
No ratings yet
02 - Apache Spark On Amazon EMR
31 pages
AaxHadoop Interview Questions and Answers
No ratings yet
AaxHadoop Interview Questions and Answers
37 pages
MapReduce Example
No ratings yet
MapReduce Example
3 pages
Informatica - The Basics: Trainer: Muhammed Naufal
100% (1)
Informatica - The Basics: Trainer: Muhammed Naufal
254 pages
Hadoop Admin Course
No ratings yet
Hadoop Admin Course
8 pages
Calpont InfiniDB Administrator Guide (For Version 1.0.3)
100% (2)
Calpont InfiniDB Administrator Guide (For Version 1.0.3)
106 pages
Go Programming
80% (5)
Go Programming
60 pages
Apache Sqoop
No ratings yet
Apache Sqoop
21 pages
Scala PDF
No ratings yet
Scala PDF
29 pages
SCRUM: Mastering Agile Project Management for Exceptional Results (2023 Guide for Beginners)
From Everand
SCRUM: Mastering Agile Project Management for Exceptional Results (2023 Guide for Beginners)
Whitney Soto
No ratings yet
The Complete Google Analytics Power User Guide PDF
100% (1)
The Complete Google Analytics Power User Guide PDF
45 pages
Cloudera Administration Study Guide
No ratings yet
Cloudera Administration Study Guide
3 pages
Hpe 3par Storeserv Storage: Update April, 2016
No ratings yet
Hpe 3par Storeserv Storage: Update April, 2016
27 pages
Business Analytics
100% (2)
Business Analytics
142 pages
Scaling Rails Applications in The Cloud
100% (3)
Scaling Rails Applications in The Cloud
59 pages
Meaning & Scope of Accounting
0% (1)
Meaning & Scope of Accounting
53 pages
Sigmod278 Silberstein
No ratings yet
Sigmod278 Silberstein
12 pages
Performance Tuning For The InfiniDB Analytics Database (For Version 1.0.3)
100% (1)
Performance Tuning For The InfiniDB Analytics Database (For Version 1.0.3)
72 pages
Ultimate AWS Certified Solutions Architect Associate Exam Guide: Master Designing Resilient, Scalable Architectures with Core and Advanced AWS Services to Crack the SAA-C03 Certification (English Edition)
From Everand
Ultimate AWS Certified Solutions Architect Associate Exam Guide: Master Designing Resilient, Scalable Architectures with Core and Advanced AWS Services to Crack the SAA-C03 Certification (English Edition)
Venkata Sasi Kanumuri
No ratings yet
Understanding The Top 5 Redis Performance Metrics
No ratings yet
Understanding The Top 5 Redis Performance Metrics
22 pages
Authors: Thanks To:: Miek Gieben Go Authors Google Go Nuts Mailing List
No ratings yet
Authors: Thanks To:: Miek Gieben Go Authors Google Go Nuts Mailing List
272 pages
Backup Strategies With MySQL Enterprise Backup
No ratings yet
Backup Strategies With MySQL Enterprise Backup
33 pages
HBase Administration Cookbook
From Everand
HBase Administration Cookbook
Yifeng Jiang
No ratings yet
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
From Everand
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
Wei Liu
No ratings yet
Amdahl's Law in The Multicore Era
100% (1)
Amdahl's Law in The Multicore Era
6 pages
Monitoring Hadoop
From Everand
Monitoring Hadoop
Gurmukh Singh
No ratings yet
Guide To Migrating From DB2 To SQL Server and Azure SQL DB
100% (1)
Guide To Migrating From DB2 To SQL Server and Azure SQL DB
179 pages
CSS 3 Help Cheat Sheet
100% (1)
CSS 3 Help Cheat Sheet
1 page
Oracle SOA BPEL Process Manager 11gR1 A Hands-on Tutorial
From Everand
Oracle SOA BPEL Process Manager 11gR1 A Hands-on Tutorial
Ravi Saraswathi
5/5 (1)
Analysis Vs Synthesis
No ratings yet
Analysis Vs Synthesis
23 pages
Data Engineering with Scala and Spark: Build streaming and batch pipelines that process massive amounts of data using Scala
From Everand
Data Engineering with Scala and Spark: Build streaming and batch pipelines that process massive amounts of data using Scala
Eric Tome
No ratings yet
Amdahl's Law in The Multicore Era - HPCA Keynote 02/2008
No ratings yet
Amdahl's Law in The Multicore Era - HPCA Keynote 02/2008
61 pages
Scaling MySQL Writes Through Partitioning
No ratings yet
Scaling MySQL Writes Through Partitioning
38 pages
HDInsight Essentials - Second Edition
From Everand
HDInsight Essentials - Second Edition
Rajesh Nadipalli
No ratings yet
Impact of Artificial Intelligence On Marketing: Corresponding Author: Mahabub Basha
No ratings yet
Impact of Artificial Intelligence On Marketing: Corresponding Author: Mahabub Basha
12 pages
Building TweetReach With Sinatra, Tokyo Cabinet and Grackle
No ratings yet
Building TweetReach With Sinatra, Tokyo Cabinet and Grackle
21 pages
Scribd Architecture Overview
100% (20)
Scribd Architecture Overview
19 pages
Data Warehousing Dr. L. Rajya Lakshmi
No ratings yet
Data Warehousing Dr. L. Rajya Lakshmi
16 pages
Data Flow Diagram Quality Checklist PDF
0% (1)
Data Flow Diagram Quality Checklist PDF
4 pages
ASOR - Tutorial - 01 05 - KoBo Collect Data App
No ratings yet
ASOR - Tutorial - 01 05 - KoBo Collect Data App
6 pages
De Lab Manual
No ratings yet
De Lab Manual
40 pages
Lesson2-1
No ratings yet
Lesson2-1
9 pages
Netbackup 8.0 Blueprint NDMP
No ratings yet
Netbackup 8.0 Blueprint NDMP
42 pages
Logcat CSC Update Log
No ratings yet
Logcat CSC Update Log
856 pages
RFM: A Precursor To Data Mining
No ratings yet
RFM: A Precursor To Data Mining
10 pages
CSS 2.1 Help Cheat Sheet
No ratings yet
CSS 2.1 Help Cheat Sheet
1 page
Secondary Storage Devices
No ratings yet
Secondary Storage Devices
36 pages
HP Data Protector Software Advanced Backup To Disk Integration With
No ratings yet
HP Data Protector Software Advanced Backup To Disk Integration With
23 pages
Standards For The Characterization of Endurance in Resistive Switching Devices Lanza Et Al 2021
No ratings yet
Standards For The Characterization of Endurance in Resistive Switching Devices Lanza Et Al 2021
18 pages
ESGCI DBA Brochure
No ratings yet
ESGCI DBA Brochure
20 pages
Online Shopping Project Synopsis
No ratings yet
Online Shopping Project Synopsis
95 pages
1 s2.0 S0168169921004592 Main
No ratings yet
1 s2.0 S0168169921004592 Main
18 pages
Index For SQL: Name: - Section: - Roll No.
No ratings yet
Index For SQL: Name: - Section: - Roll No.
2 pages
Social Oriented Quality: From Quality 4.0 Towards Quality 5.0
No ratings yet
Social Oriented Quality: From Quality 4.0 Towards Quality 5.0
8 pages
Locator Rev.0
No ratings yet
Locator Rev.0
2 pages
Prakt2 4311901028
No ratings yet
Prakt2 4311901028
6 pages
BCA 513 Linux Operating System CIA I Notes
No ratings yet
BCA 513 Linux Operating System CIA I Notes
8 pages
An Approach Based Iris Flower Species Recognition Using Machine Learning Classifiers
No ratings yet
An Approach Based Iris Flower Species Recognition Using Machine Learning Classifiers
7 pages
CCMS - Introduction: Computer Center Management System Ccms
No ratings yet
CCMS - Introduction: Computer Center Management System Ccms
5 pages
Google Cloud Dataproc The Ultimate Step-By-Step Guide
From Everand
Google Cloud Dataproc The Ultimate Step-By-Step Guide
Gerardus Blokdyk
No ratings yet
Index Cheat-Sheet v1.01
No ratings yet
Index Cheat-Sheet v1.01
1 page
Review - Big Data
No ratings yet
Review - Big Data
2 pages
CA Cloud Service Management A Clear and Concise Reference
From Everand
CA Cloud Service Management A Clear and Concise Reference
Gerardus Blokdyk
No ratings yet
IBM Integration Bus Third Edition
From Everand
IBM Integration Bus Third Edition
Gerardus Blokdyk
No ratings yet
Why RSA Works PDF
No ratings yet
Why RSA Works PDF
19 pages
Window Vista Business 20070810
No ratings yet
Window Vista Business 20070810
29 pages
The Datadog Handbook: A Guide to Monitoring, Metrics, and Tracing
From Everand
The Datadog Handbook: A Guide to Monitoring, Metrics, and Tracing
Robert Johnson
No ratings yet

Hadoop Training #1: Thinking at Scale

Uploaded by

Hadoop Training #1: Thinking at Scale

Uploaded by

Thinking at Scale

This presentation includes course content © University of Washington

© 2009 Cloudera, Inc.

• Max data in memory: 32 GB

© 2009 Cloudera, Inc.

© 2009 Cloudera, Inc.

© 2009 Cloudera, Inc.

© 2009 Cloudera, Inc.

“Failure is the defining difference between

© 2009 Cloudera, Inc.

© 2009 Cloudera, Inc.

© 2009 Cloudera, Inc.

© 2009 Cloudera, Inc.

© 2009 Cloudera, Inc.

© 2009 Cloudera, Inc.

© 2009 Cloudera, Inc.

© 2009 Cloudera, Inc.

© 2009 Cloudera, Inc.

© 2009 Cloudera, Inc.

© 2009 Cloudera, Inc.

© 2009 Cloudera, Inc.

You might also like