554_cheatsheet

Apache Hadoop is a scalable, reliable, flexible, and economical framework designed for distributed storage and processing of large datasets. It consists of a core architecture with components like HDFS for storage and YARN for resource management, and is complemented by an ecosystem of tools such as Pig, Hive, and Spark for data processing. Key features include fault tolerance, data parallelism, and the ability to handle various data types and formats efficiently.

Uploaded by

yeshwanth vemula

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

2 views

554_cheatsheet

Uploaded by

yeshwanth vemula

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 1

Apache Hadoop Characteristics: scalable (efficiently store and process data) (linear scale driven by add processing and

storage), persst)), Client (block locations/data node allocations, aleternative locations/failover). DATA NODE: Files consisting of blocks,
reliable (redundant storage, failover across nodes and racks), flexible (store all dtypes in any format) (apply schema on analysis blocks are unaware of file-level-membership, block ops: (new blocks, replications, notifications (deletes),), heatbeats are sent to
and sharing of the data), economical (use commodity hardware). HDFS write once, appends are permitted, no random writes. name node. *WRITE NEW FILE: 1) add block to name node, 2) write block to local data node, 3) background block operations ->
Allows to take advantage of clustered servers and distribute computation among all servers and collect the computations replication – orchestrated by namenode w/ local data node as src.
carried out. Spark: in memory computations increase speed of data processing MR (supports graph algorithms and ML),
excellent for low latency and large amounts of data. HDFS Concepts: name nodes (supervisor) and data nodes (workers), name Yarn Architecture: ARCHITECTURE: Minimize I/O cst and transit when not needed (overhead) CLUSTER : Master Nodes
nodes manage file system metadata (file system directory tree, file system access permissions, file system block location data, (control/client connectivity), Worker Nodes (Tasks, vmem, vcores), MANAGER INSTANCES: Resource Manager (per cluster),
keeps metadata in memory for fast access, metadata stored on disk, communicates with data nodes). YARN: resource Node manager (per node). CONTAINERS: Requested resource erequirements, allocation takes place by finding nodes/hosts to
management among clusters. HiveQL: Works on HDFS, batch; queries have overhead due to mapreduce, works best on large hold a container. APPLICATION: Clients driver (how to run), Application master: allows for resource usage on YARN nodes.
datasets, (hive DOESN’T support schema on write: can only write once a table is defined) (hive is schema on read, schema can Yarn Scheduling: SCHEDULING: resource request (static (upfront), dynamic (min/max)). SCHEDULING POLICIES: Fair scheduler
be defined after the data file which it describes already exists) (HBASE integrated with hive to support record level operations (pools of applications/tasks -> tags (gpu)/properties (userid, unix group) Intra-Pool (fair sharing of equal resources
and online analytical processing) (partitioning is very common for external tables). Pig (faster to write, test and deploy than
(space/time)), Inter-pool (weighing/ranking determing by scheduler). Intra-pool is FIFO.
MR, better choice for most analysis and processing tasks), (pig latin is a dataflow language, it is a DAG) (Pig has join and order
by operators that handle the case of a reducer getting 10 or more times the data than other reducers, can do early error Map-Reduce Concepts: RECORDS BLOCKS: Splits are a group of records on which parallel tasks can be ran.
checking and optimizations unlike mapreduce).
Map-Reduce Framework: Mapper has memory buffer, if overflow -> spills to disk, run combiner to merge spill files.
Distributed Systems (collection of independent computers that appears as a single coherent systems) *PARTITIONER: Intermediate KVP -> split.sharded to have on shard per reducer, has ojects, keyspace of intermediate kvp ->
distribute evenly across reducer nodes, need to generate mapping to identical reducer nodes.
UNSHARED: CPU, RAM, Disk, peripherals, accelerators, network cards, clock. Racks composed of machines, communication is
faster among adjacent machines, slower among racks| Coherency (Transparency): how much can I see and access from a Hive Architecture: HQL queries allow us to run many MR jobs. *Alternatives: Manual MR jobs, Athena (S3), Impala/Presto,
machine without feeling I’m on a different machine (access, location, concurrency, replication, failure, mobility, performance, Spark SQL. *Relationships: Online transaction processing (Online, hive isn’t well suited), Online Analytical Processing (Batch).
scaling. Distributed Apps -> Middleware Services (Hadoop). Pros: price/perf, redundancy (if a server fails, can rely on others),
large scale parallelization, extensibility (old code -> distributed version where parallelization takes place). Cons: reliable system Hive Queries: Database (Namespace, security, storage), Table (set of records with schema), Partition (Groupinds of recods in a
of unreliable components, administration/maintenance, data/code locality, coherency and coordination (shared state table by a column), Bucket (grouping of partition data into sets of records by a column. *Predicate Pushdown -> Partition
overhead, insert nodes to a table must ensure state is updated), debugging is hard, non-parallelizable tasks| SCALABILITY: Pruning
Characteristics [increased usage/utilization, increased resource requirements (storage/compute), Maintenance and Pig Concepts: Restructuring Big data isn’t just querying (Hive), data transformations are important which allows us to create
extensibility]. Types: [Vertical and Horizontal, *Pros Vertical: moore’s law, multicore + accelerator. *Cons Vertical: backplane, pipelines. Imperative language which manipulates datasets from HDFS, when it comes to abstractions, pig groups
latency and bandwidth, storage (mem/disk), latency and bandwidth. *Pros Horizontal: Cost, Software maturity. *Cons records/datasets into relations. * Relation: Fields -> Columns, Tuples -> Rows, Bags -> Tables. Flow: Pig Script -> Pig (Parse,
Horizontal: additional complexity, consistency (Coherency) Errors/fault-tolerance]. STORAGE: Shared Everything (HPC): PFS
Compiler, Optimize, Plan) -> MR Jobs -> YARN and HDFS. Outer bag (highest level)
(Parallel file system): NFS, Lustre are examples
Pig Dataflow: Pig script does not support control logic. Alows for arbitrary transformations without a general purpose prog.
Shared Nothing (Big Data): DFS: HDFS is an example. Lang. (Some limitations on allowed transformations. OPERATORS() LOAD, STORE, DUMP (trigger execution/evaluation I/O)
Fault Tolerance: RELIABILITY: Fault: causes a failure, Error: may lead to a failure, which can be recoverable or not recoverable, FILTER FOREACH GROUP JOIN UNION SPLIT (Transformations Bag -> Bag)
severity. Failure: deviation from specification [RTO recovery time objective, RPO recovery point objective (how long the service
is down and how much data we lost)]. Faults can be transient (one time), intermittent (occasional/random, associated with
complexity), persistent (remain until repaired). CONCERNS: Availability Reliability, Safety, Maintainability. CAVEATS: Fault ->
Error -> Failure does not always happen. Behavior outside of specification is not always a failure. FAILURE MODES: Crash
(Complete failure/inactive state), Omissions: (I/O, failure to send receive comms), Timing (response, processing outside of
thresholds, i.e. timeout), Response (Incorrect output/result), Arbitrary: (Catch-All, inconsistent/unknow results or behavior).
RECOVERY MECHANISMS (COMPENSATION): Monitoring/Alerting, Redundant/Backup Components, Checkpoint/Logs.

Hadoop Core: CONCEPTUALLY: Compute (if task is too large, run distributed across a system), Storage (if data is too large, store
distributed across a system). Hadoop Core and Hadoop Ecosystem are two different things. CHARACTERISTICS: Scalable,
Reliable, Flexible and Economical.

Hadoop Ecosystem: Additional tools that work with Hadoop framework (Pig, Hive Spark, and HBase). DATA PROCESSING:
Characterized by data parallelism mult. nodes/machines (cluster) result in Distr. Stor, Distr Compute. *Faul-Tolerance -> K-
safety. *Performance -> Sorage redundancy and compute locality (efficiency). *Hadoop Cluster: DFS -> read() blocks -> map ->
write() DFS -> shuffle/sort -> read() Blocks -> reduce -> write() HDFS.

HDFS Architecture: DFS allows storage partitioned across mult. Machines. CAP(consistency: read() followed after write,
availability: nodes allowed for read/write access,partitioning node/net split and connectivity). HDFS: filesystem abstraction, not
suitable for low-latency. NODES: Name Node (meta data), data node (data). *Write new file: add block to name node, write
block to load data node, background block operations -> replication which is orchestrated by name node w local data node as
source. USE CASES: Large files, fault tolerance, WORM, high throughput (not low-latency) OLAP. DFS: Blocks, Block-Level
Operations (read.write, copy/replicate). NAME NODE(Directory tree, permissions, block locations, checksums (cache and

Sony str-dn1030 Ver1.0 PDF
No ratings yet
Sony str-dn1030 Ver1.0 PDF
122 pages
ScanXL™ PRO V3.5.1 PDF
100% (1)
ScanXL™ PRO V3.5.1 PDF
4 pages
Day 2 S1 Intro_to_hadoop_Ashok
No ratings yet
Day 2 S1 Intro_to_hadoop_Ashok
27 pages
BigData Unit-4 Complete
No ratings yet
BigData Unit-4 Complete
97 pages
Hadoop and Pig Overview - Hands-On: Outline of Tutorial
No ratings yet
Hadoop and Pig Overview - Hands-On: Outline of Tutorial
52 pages
Understanding Hadoop Ecosystem
No ratings yet
Understanding Hadoop Ecosystem
38 pages
Hadoop Ecosystem
No ratings yet
Hadoop Ecosystem
58 pages
HADOOP
No ratings yet
HADOOP
10 pages
S - Hadoop Ecosystem
No ratings yet
S - Hadoop Ecosystem
14 pages
Hadoop Ecosystem
No ratings yet
Hadoop Ecosystem
56 pages
Big Data Analytics – Unit 4
No ratings yet
Big Data Analytics – Unit 4
32 pages
Apache Hadoop
No ratings yet
Apache Hadoop
27 pages
Hadoop
No ratings yet
Hadoop
83 pages
13 Lecture
No ratings yet
13 Lecture
23 pages
BD Notes
No ratings yet
BD Notes
11 pages
Cloud Compute
No ratings yet
Cloud Compute
46 pages
MODULE 2 Hadoop Ecosystem Tools
No ratings yet
MODULE 2 Hadoop Ecosystem Tools
44 pages
Hadoop Ecosystem
No ratings yet
Hadoop Ecosystem
55 pages
BDA CW Chapter 2
No ratings yet
BDA CW Chapter 2
6 pages
Bda Lab Manual
0% (1)
Bda Lab Manual
40 pages
BDA Unit 2 Q&A
No ratings yet
BDA Unit 2 Q&A
14 pages
Act2 - March7 - 6E - BDA - SEC
No ratings yet
Act2 - March7 - 6E - BDA - SEC
8 pages
1 - Big Data and Hadoop Framework
No ratings yet
1 - Big Data and Hadoop Framework
40 pages
8 MapReduce Different Phases 08-01-2025
No ratings yet
8 MapReduce Different Phases 08-01-2025
28 pages
Hadoop, A Distributed Framework For Big Data
No ratings yet
Hadoop, A Distributed Framework For Big Data
55 pages
Hadoop Ecosystem
No ratings yet
Hadoop Ecosystem
7 pages
Unit_IV_Hadoop
No ratings yet
Unit_IV_Hadoop
90 pages
Big Data(Hadoop) ppt
No ratings yet
Big Data(Hadoop) ppt
28 pages
Unit 2
No ratings yet
Unit 2
23 pages
Unit 5 - Introduction To Hadoop
No ratings yet
Unit 5 - Introduction To Hadoop
50 pages
data analyst
No ratings yet
data analyst
9 pages
Big Data Lab Manual
No ratings yet
Big Data Lab Manual
44 pages
2 Hadoop Ecosystem
No ratings yet
2 Hadoop Ecosystem
41 pages
DS Unit 4.1
No ratings yet
DS Unit 4.1
14 pages
UNIT II
No ratings yet
UNIT II
30 pages
Big Data Introduction PDF
No ratings yet
Big Data Introduction PDF
180 pages
Unit 5 - Introduction To Hadoop
No ratings yet
Unit 5 - Introduction To Hadoop
50 pages
Hadoop Ecosystem PDF
No ratings yet
Hadoop Ecosystem PDF
55 pages
Hadoop Ecosystem PDF
No ratings yet
Hadoop Ecosystem PDF
55 pages
Unit-2 - Introduction To Hadoop and Hadoop Architecture
No ratings yet
Unit-2 - Introduction To Hadoop and Hadoop Architecture
46 pages
Hadoopvsspark 180108070838
No ratings yet
Hadoopvsspark 180108070838
17 pages
BDA Unit 2
No ratings yet
BDA Unit 2
39 pages
Report Title: Wasit University
No ratings yet
Report Title: Wasit University
8 pages
Hadoop Overview Training Material
No ratings yet
Hadoop Overview Training Material
44 pages
BDA Module-2
No ratings yet
BDA Module-2
7 pages
Big Data Unit 4
No ratings yet
Big Data Unit 4
96 pages
Benefits of Hadoop MapReduce
No ratings yet
Benefits of Hadoop MapReduce
1 page
Big Data Computing Notes
No ratings yet
Big Data Computing Notes
17 pages
BIGDATA
No ratings yet
BIGDATA
180 pages
4.Big Data Platforms
No ratings yet
4.Big Data Platforms
49 pages
BDA Presentations Unit-4 - Hadoop, Ecosystem
100% (1)
BDA Presentations Unit-4 - Hadoop, Ecosystem
25 pages
BigData Unit 2
No ratings yet
BigData Unit 2
15 pages
Big Data – Introduction to Hadoop
No ratings yet
Big Data – Introduction to Hadoop
61 pages
BDA Module 2
No ratings yet
BDA Module 2
40 pages
Spark Introduction
No ratings yet
Spark Introduction
90 pages
Chapter 2 Introduction To Hadoop
No ratings yet
Chapter 2 Introduction To Hadoop
31 pages
Hadoop Ecosystem
No ratings yet
Hadoop Ecosystem
49 pages
HADOOP
No ratings yet
HADOOP
19 pages
Hadoop Ecosystem
No ratings yet
Hadoop Ecosystem
5 pages
Hadoop & Big Data
No ratings yet
Hadoop & Big Data
36 pages
Exploring Hadoop Ecosystem (Volume 1): Batch Processing
From Everand
Exploring Hadoop Ecosystem (Volume 1): Batch Processing
Wei Liu
No ratings yet
Learning Hadoop 2
From Everand
Learning Hadoop 2
Garry Turkington
4/5 (1)
Yash Raj Artificial Intelligence New
No ratings yet
Yash Raj Artificial Intelligence New
9 pages
AWT35-000032_AB_MX050 user
No ratings yet
AWT35-000032_AB_MX050 user
138 pages
CK College of Engineering & Technology
No ratings yet
CK College of Engineering & Technology
8 pages
15 BEST Digital Forensic Tools in 2020 (Free - Paid)
100% (1)
15 BEST Digital Forensic Tools in 2020 (Free - Paid)
9 pages
Caepipe: Tutorial For Modeling and Results Review Problem 1
No ratings yet
Caepipe: Tutorial For Modeling and Results Review Problem 1
45 pages
Data Classification Template - 1566298522
No ratings yet
Data Classification Template - 1566298522
1 page
Model Theory Examinations: Ime: 3 Hours Answer ALL Questions Max. Marks 100
No ratings yet
Model Theory Examinations: Ime: 3 Hours Answer ALL Questions Max. Marks 100
2 pages
3D-Printed Shoe Last For Bespoke Shoe Manufacturing: Cătălin Amza
No ratings yet
3D-Printed Shoe Last For Bespoke Shoe Manufacturing: Cătălin Amza
6 pages
0000000000001200000803378_000_01_004_Native
No ratings yet
0000000000001200000803378_000_01_004_Native
3 pages
Unix - Session09 & 10
100% (1)
Unix - Session09 & 10
111 pages
Google ACE
No ratings yet
Google ACE
5 pages
Quectel_CC660D-LS_TE-B_User_Guide_V1.0
No ratings yet
Quectel_CC660D-LS_TE-B_User_Guide_V1.0
30 pages
CSE 2151 Lecture 5
No ratings yet
CSE 2151 Lecture 5
4 pages
Schmidt 2009
No ratings yet
Schmidt 2009
15 pages
Lecture Notes Adversarial Search
No ratings yet
Lecture Notes Adversarial Search
15 pages
Data Sheet: Mcu For DSL
No ratings yet
Data Sheet: Mcu For DSL
49 pages
Web Site Audit
No ratings yet
Web Site Audit
4 pages
ProblemsetRPC03
No ratings yet
ProblemsetRPC03
19 pages
Vdocuments - MX Emc Documentum XCP 22 Self Paced Tutorial v10
No ratings yet
Vdocuments - MX Emc Documentum XCP 22 Self Paced Tutorial v10
217 pages
The Story of Civilization - Vol II - Old NCERT - Arjun Dev - STD X
100% (1)
The Story of Civilization - Vol II - Old NCERT - Arjun Dev - STD X
223 pages
Practical's Theory
No ratings yet
Practical's Theory
112 pages
MIC Microproject
No ratings yet
MIC Microproject
15 pages
Narrative Report BANAHAW CATV
No ratings yet
Narrative Report BANAHAW CATV
1 page
NCERT Solutions For Class 12 Maths Chapter 1 Relations and Functions Exercise 1.3
No ratings yet
NCERT Solutions For Class 12 Maths Chapter 1 Relations and Functions Exercise 1.3
8 pages
Mechine Learning
No ratings yet
Mechine Learning
10 pages
SY Electronics Old NEP Syllabus 24-25
No ratings yet
SY Electronics Old NEP Syllabus 24-25
15 pages
COMP P2-New
No ratings yet
COMP P2-New
3 pages
W12 - Technical Report, Thesis
No ratings yet
W12 - Technical Report, Thesis
71 pages

554_cheatsheet

Uploaded by

554_cheatsheet

Uploaded by

Apache Hadoop Characteristics: scalable (efficiently store and process data) (linear scale driven by add processing and

You might also like