0% found this document useful (0 votes)
9 views6 pages

CT2 BDTT

Uploaded by

22f2000932
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views6 pages

CT2 BDTT

Uploaded by

22f2000932
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

Here is a summary of all the topics covered in the provided PDF files to help you prepare for the

exam:

Unit 2:
- MapReduce programming model and its implementation in Hadoop
- Unix tools for data analysis (awk, grep etc.)
- Hadoop YARN architecture and job scheduling
- Scaling out data processing with Hadoop
- Hadoop Streaming for running MapReduce jobs with any language
- HDFS architecture and its components
- Data integrity and compression in Hadoop
- File formats like SequenceFile and MapFile
- Developing MapReduce applications

Unit 3:
- Setting up a Hadoop cluster
- Pig - high-level data flow language to analyze large datasets
- Hive - data warehouse infrastructure on top of Hadoop
- Zookeeper - Coordination service for distributed applications
- Flume - Data ingestion system for streaming data into Hadoop
- Sqoop - Tool to transfer data between Hadoop and RDBMS

Unit 4:
- Oozie - Workflow scheduler system to manage Hadoop jobs
- Apache Spark architecture and components
- Overcoming limitations of Hadoop with Spark and Flink
- Batch analytics using Flink
- Introduction to NoSQL databases like HBase, MongoDB, Cassandra
- Architecture and data models of HBase, MongoDB, Cassandra

To summarize, the key topics cover the Hadoop ecosystem components like MapReduce,
HDFS, YARN, Pig, Hive, Oozie, Flume, Sqoop as well as NoSQL databases. There is also
coverage of big data processing paradigms like MapReduce, data analysis with Unix tools,
developing Hadoop applications, and an introduction to Spark and Flink. This should help
provide a broad overview to prepare for an exam related to big data tools and techniques.
Here are summaries of each of the 3 topics in around 200 words:

Data Integrity and Compression in Hadoop:

Hadoop ensures data integrity by computing checksums when data enters the system and
verifying checksums during data transmission. In HDFS, datanodes are responsible for verifying
data before storing, and checksums are persistently logged. Clients also verify checksums when
reading from datanodes. The LocalFileSystem performs client-side checksumming by creating
hidden .crc files containing checksums for each file chunk. The ChecksumFileSystem class
provides checksumming capabilities that can wrap other filesystems.

File compression in Hadoop reduces storage needs and speeds up data transfer. Hadoop
supports codecs like gzip, bzip2, LZO etc. Choice of compression codec depends on whether
minimizing storage space or maximizing speed is prioritized. Different compression formats
have tradeoffs between compression ratio and decompression speed. Benchmarking on
representative data is recommended to determine the ideal approach.

SequenceFile and MapFile Formats:

SequenceFile is a flat file consisting of binary key-value pairs that is ideal for storing small file
outputs compactly. It provides persistent data storage and works well when smaller files need to
be packaged into a single file for efficient processing by MapReduce jobs.

MapFile is a sorted, encoded data structure containing key-value pairs. It maintains indexes to
keys enabling efficient lookups and retrievals. MapFiles suit applications requiring ordered,
indexed data storage and fast random access reading. Both SequenceFile and MapFile
integrate well with MapReduce processing.

Developing MapReduce Applications:

Key steps include configuring resources like JARs, files, environment variables using XML files
or code. Applications parse input data and define Map and Reduce functions to implement the
desired data flow logic. Data (de)serialization is handled seamlessly.

Applications are tested locally first before packaging code and resources into a job.jar file. The
job is then submitted to a cluster's JobTracker/ResourceManager to run MapReduce tasks
across cluster nodes. Monitoring, troubleshooting and retrieving results are also part of the
development lifecycle.

GenericOptionsParser, Tool, and ToolRunner classes simplify command-line argument handling.


Mapper and Reducer classes are implemented by users. Unit tests validate logic on small
datasets locally before scaling out.
Here are summaries of each of those 6 topics in around 200 words:

Setting up a Hadoop Cluster:

A Hadoop cluster consists of a NameNode (master) and multiple DataNodes (workers) running
on commodity hardware. Nodes are typically arranged in a multi-rack topology connected via
switching gear. Configuring rack awareness lets Hadoop optimize for network topology. Key
steps include installing Java, creating a Hadoop user, installing Hadoop package, setting up
SSH for password-less login, configuring directories/settings in Hadoop's XML files like
core-site.xml and hdfs-site.xml. For YARN, configuring node manager, resource manager via
yarn-site.xml. Finally, starting Hadoop daemons and testing installation. Multi-node clusters
require careful configuration of node roles, replication policies, rack topology scripts.

Pig - High-Level Language for Data Analysis:

Pig is a high-level data flow language (Pig Latin) and execution framework for analyzing large
datasets. Pig Latin abstracts away low-level MapReduce details through a simple scripting
language. Programs consist of a sequence of operations (transformations) applying filtering,
joining, grouping etc. Pig is extensible via custom functions. At runtime, the logical plan is
mapped to a physical plan of MapReduce jobs. Pig Latin statements can be executed
interactively or in batch. Pig suits ad-hoc data exploration and ETL processing of
structured/semi-structured data.

Hive - Data Warehousing on Hadoop:

Hive is a data warehouse infrastructure that facilitates querying and analyzing large datasets on
Hadoop using a SQL-like language called HiveQL/HQL. It abstracts away the programming
complexity of MapReduce. Hive supports derived column types, table partitioning and bucketing
for efficient queries. Its architecture includes a metadata storage (metastore), a query compiler
service, and a driver that coordinates execution of compiler-generated plans on Hadoop
MapReduce or Tez. Hive suits batch data analytics over large immutable datasets using familiar
SQL semantics. It integrates with Hadoop ecosystem components like YARN, HDFS, Oozie etc.

ZooKeeper - Coordination Service:

ZooKeeper is a highly available, fault-tolerant coordination service for distributed applications. It


maintains an in-memory data tree of znodes (data nodes) organized hierarchically like a
standard file system. Clients can read/write znodes and set notifications (watches) for changes.
ZooKeeper ensures consistent views to clients through leader election and atomic broadcast
protocols. Its key use cases include shared configuration management, distributed locks,
naming services etc. ZooKeeper ensembles replicate znodes across a quorum of servers using
ZAB protocol for consistency. Clients connect via sessions with timeout/expiry mechanisms.
ZooKeeper simplifies building reliable distributed systems.
Flume - Data Ingestion to Hadoop:

Flume is a distributed, reliable service for ingesting high-volume, streaming data into Hadoop
storage like HDFS or HBase. Its goal is to collect, aggregate and move large amounts of log
data from many sources to a centralized data store. The core components are sources (data
generators), channels (data buffers) and sinks (write to destinations). An agent is a JVM running
source and sinks, connected through a channel. Events flow from external sources through
Flume agents, optional data collectors, to HDFS/HBase. Flume is configurably fault-tolerant and
fails over automatically. Interceptors allow transforming flow of data events.

Sqoop - Data Transfer Between Hadoop and RDBMSs:

Sqoop is a command-line tool that facilitates efficient bulk transfer of data between Hadoop
(HDFS/Hive) and structured relational databases like MySQL, Oracle, Postgres etc. It can
import tables from an RDBMS to HDFS/Hive, and export from HDFS to the RDBMS. Sqoop
uses database connectors (JDBC) to connect to the databases. It parallelizes transfers, works
with Kerberos authentication and compressesdata transfers. Incremental import/export and
handling nulls are supported features. Sqoop provides consistent coding and configuration
options across different databases. It integrates with the Hadoop MapReduce engine and file
formats like SequenceFiles.
Here are summaries of each of those 3 topics in around 200 words:

Oozie - Workflow Scheduler for Hadoop Jobs:

Apache Oozie is a workflow scheduling system to manage and orchestrate Hadoop jobs in a
distributed environment. It allows combining multiple jobs into a logically connected sequence of
actions, coordinating their execution across Hadoop components like HDFS, MapReduce, Pig,
Hive etc.

Oozie supports two main job types - Workflows (sequence of actions) and Coordinators (trigger
workflows based on time/data events). Workflows are defined using a control-flow XML notation.
Coordinators use time/data triggers and can schedule multiple Workflow instances. Oozie
provides job traceability, recovery, and notification mechanisms.

It integrates with the YARN resource manager and can schedule jobs across clusters. An Oozie
client submits jobs which are managed by the Oozie server. Oozie leverages ZooKeeper and
Hadoop JobTracker/ResourceManager services. Workflows can be re-run, suspended, killed
on-demand. Oozie simplifies building complex Hadoop processing pipelines.

Apache Spark Architecture:

Spark is a fast, general-purpose distributed data processing engine suited for batch, interactive,
streaming and machine learning use cases. Its core is the Spark driver program that connects to
a cluster manager like YARN and coordinates with distributed executors running on worker
nodes.

Spark's fundamental abstraction is the Resilient Distributed Dataset (RDD) - a fault-tolerant


collection of elements that can be operated on in parallel. Spark provides APIs in Java, Scala,
Python and R for distributed datasets and parallel operations like map, filter, join etc.

Key components include - Spark Core engine, Spark SQL for structured data querying, Spark
Streaming for stream processing, MLlib for machine learning, and GraphX for graph processing.
Spark can access data from HDFS, HBase, Cassandra etc. It leverages in-memory computing
and optimizes execution plans.

Overcoming Hadoop Limitations with Spark and Flink:

While Hadoop made big data processing accessible, it has several limitations like batch-only
processing model, high latency, lack of iterative computation capabilities, small file issues and
relatively slow processing speeds.

Apache Spark overcomes some of these limitations through its in-memory computing model,
support for iterative algorithms, stream processing capabilities, and faster performance
compared to Hadoop MapReduce. Its generalized execution engine suits a wide range of
data-intensive workloads.

Apache Flink takes it further by providing a true streaming engine with low event latency and
high throughput, along with unified stream and batch processing capabilities through its
DataStream and DataSet APIs. Its pipelined execution model avoids MapReduce's multiple
sort/shuffle stages. Native iteration operators accelerate iterative processing further.

Both Spark and Flink leverage in-memory computing, scaling out to process very large datasets
efficiently. Their adoption is growing rapidly to complement and enhance Hadoop environments.

You might also like