Unit 4
Unit 4
Hadoop Ecosystem, HDFS, Map Reduce, Python And Hadoop streaming, Spark- basics, Pyspark
Introduction: Hadoop Ecosystem is a platform or a suite which provides various services to solve the
big data problems. It includes Apache projects and various commercial tools and solutions. There are four
major elements of Hadoop i.e. HDFS, MapReduce, YARN, and Hadoop Common Utilities. Most of the tools or
solutions are used to supplement or support these major elements. All these tools work collectively to
provide services such as absorption, analysis, storage and maintenance of data etc.
HDFS:
HDFS is the primary or major component of Hadoop ecosystem and is responsible for storing large data sets
of structured or unstructured data across various nodes and thereby maintaining the metadata in the form of
log files.
HDFS consists of two core components i.e.
Name node, Data Node
Name Node is the prime node which contains metadata (data about data) requiring comparatively fewer
resources than the data nodes that stores the actual data. These data nodes are commodity hardware in the
distributed environment. Undoubtedly, making Hadoop cost effective.
HDFS maintains all the coordination between the clusters and hardware, thus working at the heart of the
system.
YARN:
Yet Another Resource Negotiator, as the name implies, YARN is the one who helps to manage the resources
across the clusters. In short, it performs scheduling and resource allocation for the Hadoop System.
Consists of three major components i.e.
Resource Manager, Nodes Manager, Application Manager
Resource manager has the privilege of allocating resources for the applications in a system whereas Node
managers work on the allocation of resources such as CPU, memory, bandwidth per machine and later on
acknowledges the resource manager. Application manager works as an interface between the resource
manager and node manager and performs negotiations as per the requirement of the two.
MapReduce:
By making the use of distributed and parallel algorithms, MapReduce makes it possible to carry over the
processing’s logic and helps to write applications which transform big data sets into a manageable one.
MapReduce makes the use of two functions i.e. Map() and Reduce() whose task is:
Map() performs sorting and filtering of data and thereby organizing them in the form of group. Map
generates a key-value pair based result which is later on processed by the Reduce() method.
Reduce(), as the name suggests does the summarization by aggregating the mapped data. In simple,
Reduce() takes the output generated by Map() as input and combines those tuples into smaller set of tuples.
PIG:
Pig was basically developed by Yahoo which works on a pig Latin language, which is Query based language
similar to SQL.
It is a platform for structuring the data flow, processing and analyzing huge data sets.
Pig does the work of executing commands and in the background, all the activities of MapReduce are taken
care of. After the processing, pig stores the result in HDFS.
Pig Latin language is specially designed for this framework which runs on Pig Runtime. Just the way Java runs
on the JVM.
Pig helps to achieve ease of programming and optimization and hence is a major segment of the Hadoop
Ecosystem.
HIVE:
With the help of SQL methodology and interface, HIVE performs reading and writing of large data sets.
However, its query language is called as HQL (Hive Query Language).
It is highly scalable as it allows real-time processing and batch processing both. Also, all the SQL datatypes
are supported by Hive thus, making the query processing easier.
Similar to the Query Processing frameworks, HIVE too comes with two components: JDBC Drivers and HIVE
Command Line.
JDBC, along with ODBC drivers work on establishing the data storage permissions and connection whereas
HIVE Command line helps in the processing of queries.
Mahout:
Mahout, allows Machine Learnability to a system or application. Machine Learning, as the name suggests
helps the system to develop itself based on some patterns, user/environmental interaction or on the basis of
algorithms.
It provides various libraries or functionalities such as collaborative filtering, clustering, and classification
which are nothing but concepts of Machine learning. It allows invoking algorithms as per our need with the
help of its own libraries.
Apache Spark:
It’s a platform that handles all the process consumptive tasks like batch processing, interactive or iterative
real-time processing, graph conversions, and visualization, etc.
It consumes in memory resources hence, thus being faster than the prior in terms of optimization.
Spark is best suited for real-time data whereas Hadoop is best suited for structured data or batch processing,
hence both are used in most of the companies interchangeably.
Apache HBase:
It’s a NoSQL database which supports all kinds of data and thus capable of handling anything of Hadoop
Database. It provides capabilities of Google’s BigTable, thus able to work on Big Data sets effectively.
At times where we need to search or retrieve the occurrences of something small in a huge database, the
request must be processed within a short quick span of time. At such times, HBase comes handy as it gives us
a tolerant way of storing limited data
**Python and Hadoop Streaming:**
Hadoop Streaming is a utility that comes with Hadoop distribution, allowing users to create and run
MapReduce jobs with any executable or script as the mapper and/or reducer. Python is a popular choice for
writing these scripts due to its ease of use and the availability of libraries like `mrjob` that simplify the
process of writing Hadoop Streaming jobs in Python.
To use Python with Hadoop Streaming, you would typically write a mapper and reducer script in Python,
make them executable (`chmod +x <script_name>`), and then use them with the `hadoop jar` command to
run the Hadoop Streaming job.
**PySpark:**
PySpark is the Python API for Spark, allowing you to use Spark with Python. PySpark provides an easy-to-use
programming interface and allows you to leverage the power of Spark for big data processing using Python
code.
PySpark provides a `SparkContext` object (`sc`) to interact with Spark, and you can use Python's familiar
syntax to work with RDDs and perform transformations and actions. For example:
This example creates an RDD from a list of numbers, squares each element using a lambda function, and
collects the results back to the driver program.
PySpark also provides DataFrame API, which allows you to work with structured data using the Spark SQL
module. This provides a more familiar and optimized way to work with data compared to RDDs.
Overall, PySpark is a powerful tool for working with big data using Python, allowing you to leverage the
scalability and performance of Spark for your data processing tasks.