0% found this document useful (0 votes)
2 views

554_cheatsheet

Apache Hadoop is a scalable, reliable, flexible, and economical framework designed for distributed storage and processing of large datasets. It consists of a core architecture with components like HDFS for storage and YARN for resource management, and is complemented by an ecosystem of tools such as Pig, Hive, and Spark for data processing. Key features include fault tolerance, data parallelism, and the ability to handle various data types and formats efficiently.

Uploaded by

yeshwanth vemula
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

554_cheatsheet

Apache Hadoop is a scalable, reliable, flexible, and economical framework designed for distributed storage and processing of large datasets. It consists of a core architecture with components like HDFS for storage and YARN for resource management, and is complemented by an ecosystem of tools such as Pig, Hive, and Spark for data processing. Key features include fault tolerance, data parallelism, and the ability to handle various data types and formats efficiently.

Uploaded by

yeshwanth vemula
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 1

Apache Hadoop Characteristics: scalable (efficiently store and process data) (linear scale driven by add processing and

storage), persst)), Client (block locations/data node allocations, aleternative locations/failover). DATA NODE: Files consisting of blocks,
reliable (redundant storage, failover across nodes and racks), flexible (store all dtypes in any format) (apply schema on analysis blocks are unaware of file-level-membership, block ops: (new blocks, replications, notifications (deletes),), heatbeats are sent to
and sharing of the data), economical (use commodity hardware). HDFS write once, appends are permitted, no random writes. name node. *WRITE NEW FILE: 1) add block to name node, 2) write block to local data node, 3) background block operations ->
Allows to take advantage of clustered servers and distribute computation among all servers and collect the computations replication – orchestrated by namenode w/ local data node as src.
carried out. Spark: in memory computations increase speed of data processing MR (supports graph algorithms and ML),
excellent for low latency and large amounts of data. HDFS Concepts: name nodes (supervisor) and data nodes (workers), name Yarn Architecture: ARCHITECTURE: Minimize I/O cst and transit when not needed (overhead) CLUSTER : Master Nodes
nodes manage file system metadata (file system directory tree, file system access permissions, file system block location data, (control/client connectivity), Worker Nodes (Tasks, vmem, vcores), MANAGER INSTANCES: Resource Manager (per cluster),
keeps metadata in memory for fast access, metadata stored on disk, communicates with data nodes). YARN: resource Node manager (per node). CONTAINERS: Requested resource erequirements, allocation takes place by finding nodes/hosts to
management among clusters. HiveQL: Works on HDFS, batch; queries have overhead due to mapreduce, works best on large hold a container. APPLICATION: Clients driver (how to run), Application master: allows for resource usage on YARN nodes.
datasets, (hive DOESN’T support schema on write: can only write once a table is defined) (hive is schema on read, schema can Yarn Scheduling: SCHEDULING: resource request (static (upfront), dynamic (min/max)). SCHEDULING POLICIES: Fair scheduler
be defined after the data file which it describes already exists) (HBASE integrated with hive to support record level operations (pools of applications/tasks -> tags (gpu)/properties (userid, unix group) Intra-Pool (fair sharing of equal resources
and online analytical processing) (partitioning is very common for external tables). Pig (faster to write, test and deploy than
(space/time)), Inter-pool (weighing/ranking determing by scheduler). Intra-pool is FIFO.
MR, better choice for most analysis and processing tasks), (pig latin is a dataflow language, it is a DAG) (Pig has join and order
by operators that handle the case of a reducer getting 10 or more times the data than other reducers, can do early error Map-Reduce Concepts: RECORDS BLOCKS: Splits are a group of records on which parallel tasks can be ran.
checking and optimizations unlike mapreduce).
Map-Reduce Framework: Mapper has memory buffer, if overflow -> spills to disk, run combiner to merge spill files.
Distributed Systems (collection of independent computers that appears as a single coherent systems) *PARTITIONER: Intermediate KVP -> split.sharded to have on shard per reducer, has ojects, keyspace of intermediate kvp ->
distribute evenly across reducer nodes, need to generate mapping to identical reducer nodes.
UNSHARED: CPU, RAM, Disk, peripherals, accelerators, network cards, clock. Racks composed of machines, communication is
faster among adjacent machines, slower among racks| Coherency (Transparency): how much can I see and access from a Hive Architecture: HQL queries allow us to run many MR jobs. *Alternatives: Manual MR jobs, Athena (S3), Impala/Presto,
machine without feeling I’m on a different machine (access, location, concurrency, replication, failure, mobility, performance, Spark SQL. *Relationships: Online transaction processing (Online, hive isn’t well suited), Online Analytical Processing (Batch).
scaling. Distributed Apps -> Middleware Services (Hadoop). Pros: price/perf, redundancy (if a server fails, can rely on others),
large scale parallelization, extensibility (old code -> distributed version where parallelization takes place). Cons: reliable system Hive Queries: Database (Namespace, security, storage), Table (set of records with schema), Partition (Groupinds of recods in a
of unreliable components, administration/maintenance, data/code locality, coherency and coordination (shared state table by a column), Bucket (grouping of partition data into sets of records by a column. *Predicate Pushdown -> Partition
overhead, insert nodes to a table must ensure state is updated), debugging is hard, non-parallelizable tasks| SCALABILITY: Pruning
Characteristics [increased usage/utilization, increased resource requirements (storage/compute), Maintenance and Pig Concepts: Restructuring Big data isn’t just querying (Hive), data transformations are important which allows us to create
extensibility]. Types: [Vertical and Horizontal, *Pros Vertical: moore’s law, multicore + accelerator. *Cons Vertical: backplane, pipelines. Imperative language which manipulates datasets from HDFS, when it comes to abstractions, pig groups
latency and bandwidth, storage (mem/disk), latency and bandwidth. *Pros Horizontal: Cost, Software maturity. *Cons records/datasets into relations. * Relation: Fields -> Columns, Tuples -> Rows, Bags -> Tables. Flow: Pig Script -> Pig (Parse,
Horizontal: additional complexity, consistency (Coherency) Errors/fault-tolerance]. STORAGE: Shared Everything (HPC): PFS
Compiler, Optimize, Plan) -> MR Jobs -> YARN and HDFS. Outer bag (highest level)
(Parallel file system): NFS, Lustre are examples
Pig Dataflow: Pig script does not support control logic. Alows for arbitrary transformations without a general purpose prog.
Shared Nothing (Big Data): DFS: HDFS is an example. Lang. (Some limitations on allowed transformations. OPERATORS() LOAD, STORE, DUMP (trigger execution/evaluation I/O)
Fault Tolerance: RELIABILITY: Fault: causes a failure, Error: may lead to a failure, which can be recoverable or not recoverable, FILTER FOREACH GROUP JOIN UNION SPLIT (Transformations Bag -> Bag)
severity. Failure: deviation from specification [RTO recovery time objective, RPO recovery point objective (how long the service
is down and how much data we lost)]. Faults can be transient (one time), intermittent (occasional/random, associated with
complexity), persistent (remain until repaired). CONCERNS: Availability Reliability, Safety, Maintainability. CAVEATS: Fault ->
Error -> Failure does not always happen. Behavior outside of specification is not always a failure. FAILURE MODES: Crash
(Complete failure/inactive state), Omissions: (I/O, failure to send receive comms), Timing (response, processing outside of
thresholds, i.e. timeout), Response (Incorrect output/result), Arbitrary: (Catch-All, inconsistent/unknow results or behavior).
RECOVERY MECHANISMS (COMPENSATION): Monitoring/Alerting, Redundant/Backup Components, Checkpoint/Logs.

Hadoop Core: CONCEPTUALLY: Compute (if task is too large, run distributed across a system), Storage (if data is too large, store
distributed across a system). Hadoop Core and Hadoop Ecosystem are two different things. CHARACTERISTICS: Scalable,
Reliable, Flexible and Economical.

Hadoop Ecosystem: Additional tools that work with Hadoop framework (Pig, Hive Spark, and HBase). DATA PROCESSING:
Characterized by data parallelism mult. nodes/machines (cluster) result in Distr. Stor, Distr Compute. *Faul-Tolerance -> K-
safety. *Performance -> Sorage redundancy and compute locality (efficiency). *Hadoop Cluster: DFS -> read() blocks -> map ->
write() DFS -> shuffle/sort -> read() Blocks -> reduce -> write() HDFS.

HDFS Architecture: DFS allows storage partitioned across mult. Machines. CAP(consistency: read() followed after write,
availability: nodes allowed for read/write access,partitioning node/net split and connectivity). HDFS: filesystem abstraction, not
suitable for low-latency. NODES: Name Node (meta data), data node (data). *Write new file: add block to name node, write
block to load data node, background block operations -> replication which is orchestrated by name node w local data node as
source. USE CASES: Large files, fault tolerance, WORM, high throughput (not low-latency) OLAP. DFS: Blocks, Block-Level
Operations (read.write, copy/replicate). NAME NODE(Directory tree, permissions, block locations, checksums (cache and

You might also like