Kcs061 Unit 2
Kcs061 Unit 2
Big Data(KcS-061)
Hadoop
History of Hadoop, Apache Hadoop, the Hadoop Distributed File
System, components of Hadoop, data format, analyzing data with
Hadoop, scaling out, Hadoop streaming, Hadoop pipes, Hadoop Echo
System
2023/3/27 SHEAT CSE/Big Data/KCS061 1
HaDoop: Defination
Instead, massive amounts of human data and machine data (logs, IoT
devices, etc.) have been collected and stored in quantities far exceeding
traditional business data. A huge technology gap exists between the
massive amounts of data and human capabilities, which has spawned
various big data technologies. In this context, what we call the era of big
data has come into being.
.
2023/3/27 SHEAT CSE/Big Data/KCS061 7
Data expLoSion
Hadoop splits files into large blocks and distributes them across
nodes in a cluster. It then transfers packaged code into nodes to
process the data in parallel.
• The Hive and Pig projects are popular choices that provide SQL-like
and procedural data flow-like languages, respectively.
• HBase is also a popular way to store and analyze data in HDFS. It is a
column-oriented database, and unlike MapReduce, provides random
read and write access to data with low latency.
• MapReduce jobs can read and write data in HBase’s table format, but
data processing is often done via HBase’s own client API.
Map Reduce
Map Reduce framework and basics, how Map Reduce works,
developing a Map Reduce application, unit tests with MR unit, test
data and local tests, anatomy of a Map Reduce job run, failures, job
scheduling, shuffle and sort, task execution, Map Reduce types, input
formats, output formats, Map Reduce features, Real-world Map
Reduce
2023/3/27 SHEAT CSE/Big Data/KCS061 42
map reDuce
• PayLoad − Applica ons implement the Map and the Reduce func ons, and form the core of the
job.
• Mapper − Mapper maps the input key/value pairs to a set of intermediate key/value pair.
• NamedNode − Node that manages the Hadoop Distributed File System (HDFS).
• DataNode − Node where data is presented in advance before any processing takes place.
• MasterNode − Node where JobTracker runs and which accepts job requests from clients.
• SlaveNode − Node where Map and Reduce program runs.
• JobTracker − Schedules jobs and tracks the assign jobs to Task tracker.
• Task Tracker − Tracks the task and reports status to JobTracker.
• Job − A program is an execu on of a Mapper and Reducer across a dataset.
• Task − An execu on of a Mapper or a Reducer on a slice of data.
• Task Attempt − A par cular instance of an a empt to execute a task on a SlaveNode.
•
2023/3/27 SHEAT CSE/Big Data/KCS061 53
unit teStS WitH mr unit
MRUnit is: a unit test library designed to facilitate easy integration
between your MapReduce development process and standard
development and testing tools such as JUnit. With MRUnit, there are
no test files to create, no configuration parameters to change, and
generally less test code.