554_cheatsheet
554_cheatsheet
storage), persst)), Client (block locations/data node allocations, aleternative locations/failover). DATA NODE: Files consisting of blocks,
reliable (redundant storage, failover across nodes and racks), flexible (store all dtypes in any format) (apply schema on analysis blocks are unaware of file-level-membership, block ops: (new blocks, replications, notifications (deletes),), heatbeats are sent to
and sharing of the data), economical (use commodity hardware). HDFS write once, appends are permitted, no random writes. name node. *WRITE NEW FILE: 1) add block to name node, 2) write block to local data node, 3) background block operations ->
Allows to take advantage of clustered servers and distribute computation among all servers and collect the computations replication – orchestrated by namenode w/ local data node as src.
carried out. Spark: in memory computations increase speed of data processing MR (supports graph algorithms and ML),
excellent for low latency and large amounts of data. HDFS Concepts: name nodes (supervisor) and data nodes (workers), name Yarn Architecture: ARCHITECTURE: Minimize I/O cst and transit when not needed (overhead) CLUSTER : Master Nodes
nodes manage file system metadata (file system directory tree, file system access permissions, file system block location data, (control/client connectivity), Worker Nodes (Tasks, vmem, vcores), MANAGER INSTANCES: Resource Manager (per cluster),
keeps metadata in memory for fast access, metadata stored on disk, communicates with data nodes). YARN: resource Node manager (per node). CONTAINERS: Requested resource erequirements, allocation takes place by finding nodes/hosts to
management among clusters. HiveQL: Works on HDFS, batch; queries have overhead due to mapreduce, works best on large hold a container. APPLICATION: Clients driver (how to run), Application master: allows for resource usage on YARN nodes.
datasets, (hive DOESN’T support schema on write: can only write once a table is defined) (hive is schema on read, schema can Yarn Scheduling: SCHEDULING: resource request (static (upfront), dynamic (min/max)). SCHEDULING POLICIES: Fair scheduler
be defined after the data file which it describes already exists) (HBASE integrated with hive to support record level operations (pools of applications/tasks -> tags (gpu)/properties (userid, unix group) Intra-Pool (fair sharing of equal resources
and online analytical processing) (partitioning is very common for external tables). Pig (faster to write, test and deploy than
(space/time)), Inter-pool (weighing/ranking determing by scheduler). Intra-pool is FIFO.
MR, better choice for most analysis and processing tasks), (pig latin is a dataflow language, it is a DAG) (Pig has join and order
by operators that handle the case of a reducer getting 10 or more times the data than other reducers, can do early error Map-Reduce Concepts: RECORDS BLOCKS: Splits are a group of records on which parallel tasks can be ran.
checking and optimizations unlike mapreduce).
Map-Reduce Framework: Mapper has memory buffer, if overflow -> spills to disk, run combiner to merge spill files.
Distributed Systems (collection of independent computers that appears as a single coherent systems) *PARTITIONER: Intermediate KVP -> split.sharded to have on shard per reducer, has ojects, keyspace of intermediate kvp ->
distribute evenly across reducer nodes, need to generate mapping to identical reducer nodes.
UNSHARED: CPU, RAM, Disk, peripherals, accelerators, network cards, clock. Racks composed of machines, communication is
faster among adjacent machines, slower among racks| Coherency (Transparency): how much can I see and access from a Hive Architecture: HQL queries allow us to run many MR jobs. *Alternatives: Manual MR jobs, Athena (S3), Impala/Presto,
machine without feeling I’m on a different machine (access, location, concurrency, replication, failure, mobility, performance, Spark SQL. *Relationships: Online transaction processing (Online, hive isn’t well suited), Online Analytical Processing (Batch).
scaling. Distributed Apps -> Middleware Services (Hadoop). Pros: price/perf, redundancy (if a server fails, can rely on others),
large scale parallelization, extensibility (old code -> distributed version where parallelization takes place). Cons: reliable system Hive Queries: Database (Namespace, security, storage), Table (set of records with schema), Partition (Groupinds of recods in a
of unreliable components, administration/maintenance, data/code locality, coherency and coordination (shared state table by a column), Bucket (grouping of partition data into sets of records by a column. *Predicate Pushdown -> Partition
overhead, insert nodes to a table must ensure state is updated), debugging is hard, non-parallelizable tasks| SCALABILITY: Pruning
Characteristics [increased usage/utilization, increased resource requirements (storage/compute), Maintenance and Pig Concepts: Restructuring Big data isn’t just querying (Hive), data transformations are important which allows us to create
extensibility]. Types: [Vertical and Horizontal, *Pros Vertical: moore’s law, multicore + accelerator. *Cons Vertical: backplane, pipelines. Imperative language which manipulates datasets from HDFS, when it comes to abstractions, pig groups
latency and bandwidth, storage (mem/disk), latency and bandwidth. *Pros Horizontal: Cost, Software maturity. *Cons records/datasets into relations. * Relation: Fields -> Columns, Tuples -> Rows, Bags -> Tables. Flow: Pig Script -> Pig (Parse,
Horizontal: additional complexity, consistency (Coherency) Errors/fault-tolerance]. STORAGE: Shared Everything (HPC): PFS
Compiler, Optimize, Plan) -> MR Jobs -> YARN and HDFS. Outer bag (highest level)
(Parallel file system): NFS, Lustre are examples
Pig Dataflow: Pig script does not support control logic. Alows for arbitrary transformations without a general purpose prog.
Shared Nothing (Big Data): DFS: HDFS is an example. Lang. (Some limitations on allowed transformations. OPERATORS() LOAD, STORE, DUMP (trigger execution/evaluation I/O)
Fault Tolerance: RELIABILITY: Fault: causes a failure, Error: may lead to a failure, which can be recoverable or not recoverable, FILTER FOREACH GROUP JOIN UNION SPLIT (Transformations Bag -> Bag)
severity. Failure: deviation from specification [RTO recovery time objective, RPO recovery point objective (how long the service
is down and how much data we lost)]. Faults can be transient (one time), intermittent (occasional/random, associated with
complexity), persistent (remain until repaired). CONCERNS: Availability Reliability, Safety, Maintainability. CAVEATS: Fault ->
Error -> Failure does not always happen. Behavior outside of specification is not always a failure. FAILURE MODES: Crash
(Complete failure/inactive state), Omissions: (I/O, failure to send receive comms), Timing (response, processing outside of
thresholds, i.e. timeout), Response (Incorrect output/result), Arbitrary: (Catch-All, inconsistent/unknow results or behavior).
RECOVERY MECHANISMS (COMPENSATION): Monitoring/Alerting, Redundant/Backup Components, Checkpoint/Logs.
Hadoop Core: CONCEPTUALLY: Compute (if task is too large, run distributed across a system), Storage (if data is too large, store
distributed across a system). Hadoop Core and Hadoop Ecosystem are two different things. CHARACTERISTICS: Scalable,
Reliable, Flexible and Economical.
Hadoop Ecosystem: Additional tools that work with Hadoop framework (Pig, Hive Spark, and HBase). DATA PROCESSING:
Characterized by data parallelism mult. nodes/machines (cluster) result in Distr. Stor, Distr Compute. *Faul-Tolerance -> K-
safety. *Performance -> Sorage redundancy and compute locality (efficiency). *Hadoop Cluster: DFS -> read() blocks -> map ->
write() DFS -> shuffle/sort -> read() Blocks -> reduce -> write() HDFS.
HDFS Architecture: DFS allows storage partitioned across mult. Machines. CAP(consistency: read() followed after write,
availability: nodes allowed for read/write access,partitioning node/net split and connectivity). HDFS: filesystem abstraction, not
suitable for low-latency. NODES: Name Node (meta data), data node (data). *Write new file: add block to name node, write
block to load data node, background block operations -> replication which is orchestrated by name node w local data node as
source. USE CASES: Large files, fault tolerance, WORM, high throughput (not low-latency) OLAP. DFS: Blocks, Block-Level
Operations (read.write, copy/replicate). NAME NODE(Directory tree, permissions, block locations, checksums (cache and