Unit 4

The document provides an overview of the Hadoop ecosystem, detailing its major components such as HDFS, YARN, MapReduce, and Spark, which collectively address big data challenges. It explains the functionalities of each component, including data storage, resource management, and data processing, while also highlighting the integration of Python and PySpark for big data analytics. Additionally, it covers tools like Pig and Hive for data querying and Mahout for machine learning within the ecosystem.

Uploaded by

Shivani Bhagat

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

3 views

Unit 4

Uploaded by

Shivani Bhagat

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 4

Unit IV Big data analytics using Hadoop-I

Hadoop Ecosystem, HDFS, Map Reduce, Python And Hadoop streaming, Spark- basics, Pyspark

Introduction: Hadoop Ecosystem is a platform or a suite which provides various services to solve the
big data problems. It includes Apache projects and various commercial tools and solutions. There are four
major elements of Hadoop i.e. HDFS, MapReduce, YARN, and Hadoop Common Utilities. Most of the tools or
solutions are used to supplement or support these major elements. All these tools work collectively to
provide services such as absorption, analysis, storage and maintenance of data etc.

Following are the components that collectively form a Hadoop ecosystem:

HDFS: Hadoop Distributed File System
YARN: Yet Another Resource Negotiator
MapReduce: Programming based Data Processing
Spark: In-Memory data processing
PIG, HIVE: Query based processing of data services
HBase: NoSQL Database
Mahout, Spark MLLib: Machine Learning algorithm libraries
Solar, Lucene: Searching and Indexing
Zookeeper: Managing cluster
Oozie: Job Scheduling
Note: Apart from the above-mentioned components, there are many other components too that are part of
the Hadoop ecosystem.
All these toolkits or components revolve around one term i.e. Data. That’s the beauty of Hadoop that it
revolves around data and hence making its synthesis easier.

HDFS:
HDFS is the primary or major component of Hadoop ecosystem and is responsible for storing large data sets
of structured or unstructured data across various nodes and thereby maintaining the metadata in the form of
log files.
HDFS consists of two core components i.e.
Name node, Data Node
Name Node is the prime node which contains metadata (data about data) requiring comparatively fewer
resources than the data nodes that stores the actual data. These data nodes are commodity hardware in the
distributed environment. Undoubtedly, making Hadoop cost effective.
HDFS maintains all the coordination between the clusters and hardware, thus working at the heart of the
system.

YARN:
Yet Another Resource Negotiator, as the name implies, YARN is the one who helps to manage the resources
across the clusters. In short, it performs scheduling and resource allocation for the Hadoop System.
Consists of three major components i.e.
Resource Manager, Nodes Manager, Application Manager
Resource manager has the privilege of allocating resources for the applications in a system whereas Node
managers work on the allocation of resources such as CPU, memory, bandwidth per machine and later on
acknowledges the resource manager. Application manager works as an interface between the resource
manager and node manager and performs negotiations as per the requirement of the two.

MapReduce:
By making the use of distributed and parallel algorithms, MapReduce makes it possible to carry over the
processing’s logic and helps to write applications which transform big data sets into a manageable one.
MapReduce makes the use of two functions i.e. Map() and Reduce() whose task is:
Map() performs sorting and filtering of data and thereby organizing them in the form of group. Map
generates a key-value pair based result which is later on processed by the Reduce() method.
Reduce(), as the name suggests does the summarization by aggregating the mapped data. In simple,
Reduce() takes the output generated by Map() as input and combines those tuples into smaller set of tuples.
PIG:
Pig was basically developed by Yahoo which works on a pig Latin language, which is Query based language
similar to SQL.
It is a platform for structuring the data flow, processing and analyzing huge data sets.
Pig does the work of executing commands and in the background, all the activities of MapReduce are taken
care of. After the processing, pig stores the result in HDFS.
Pig Latin language is specially designed for this framework which runs on Pig Runtime. Just the way Java runs
on the JVM.
Pig helps to achieve ease of programming and optimization and hence is a major segment of the Hadoop
Ecosystem.
HIVE:
With the help of SQL methodology and interface, HIVE performs reading and writing of large data sets.
However, its query language is called as HQL (Hive Query Language).
It is highly scalable as it allows real-time processing and batch processing both. Also, all the SQL datatypes
are supported by Hive thus, making the query processing easier.
Similar to the Query Processing frameworks, HIVE too comes with two components: JDBC Drivers and HIVE
Command Line.
JDBC, along with ODBC drivers work on establishing the data storage permissions and connection whereas
HIVE Command line helps in the processing of queries.
Mahout:
Mahout, allows Machine Learnability to a system or application. Machine Learning, as the name suggests
helps the system to develop itself based on some patterns, user/environmental interaction or on the basis of
algorithms.
It provides various libraries or functionalities such as collaborative filtering, clustering, and classification
which are nothing but concepts of Machine learning. It allows invoking algorithms as per our need with the
help of its own libraries.
Apache Spark:
It’s a platform that handles all the process consumptive tasks like batch processing, interactive or iterative
real-time processing, graph conversions, and visualization, etc.
It consumes in memory resources hence, thus being faster than the prior in terms of optimization.
Spark is best suited for real-time data whereas Hadoop is best suited for structured data or batch processing,
hence both are used in most of the companies interchangeably.
Apache HBase:
It’s a NoSQL database which supports all kinds of data and thus capable of handling anything of Hadoop
Database. It provides capabilities of Google’s BigTable, thus able to work on Big Data sets effectively.
At times where we need to search or retrieve the occurrences of something small in a huge database, the
request must be processed within a short quick span of time. At such times, HBase comes handy as it gives us
a tolerant way of storing limited data
**Python and Hadoop Streaming:**
Hadoop Streaming is a utility that comes with Hadoop distribution, allowing users to create and run
MapReduce jobs with any executable or script as the mapper and/or reducer. Python is a popular choice for
writing these scripts due to its ease of use and the availability of libraries like `mrjob` that simplify the
process of writing Hadoop Streaming jobs in Python.
To use Python with Hadoop Streaming, you would typically write a mapper and reducer script in Python,
make them executable (`chmod +x <script_name>`), and then use them with the `hadoop jar` command to
run the Hadoop Streaming job.

Apache Spark Basics:

Apache Spark is an open-source, distributed computing system that provides an interface for programming
entire clusters with implicit data parallelism and fault tolerance. Spark is designed for speed and ease of use,
with APIs in multiple languages including Java, Scala, Python, and R.
Spark provides several libraries for different tasks:
- **Spark Core:** Provides basic functionalities of Spark, including task scheduling, memory management,
and fault recovery.
- **Spark SQL:** Provides a SQL interface for working with structured data.
- **Spark Streaming:** Allows for real-time stream processing.
- **MLlib (Machine Learning Library):** Provides scalable machine learning algorithms.
- **GraphX:** Provides a distributed graph-processing framework.
Spark operates on the concept of Resilient Distributed Datasets (RDDs), which are fault-tolerant collections of
elements that can be operated on in parallel. Spark transformations and actions allow you to manipulate and
analyze RDDs in parallel across the cluster.

**PySpark:**
PySpark is the Python API for Spark, allowing you to use Spark with Python. PySpark provides an easy-to-use
programming interface and allows you to leverage the power of Spark for big data processing using Python
code.
PySpark provides a `SparkContext` object (`sc`) to interact with Spark, and you can use Python's familiar
syntax to work with RDDs and perform transformations and actions. For example:
This example creates an RDD from a list of numbers, squares each element using a lambda function, and
collects the results back to the driver program.
PySpark also provides DataFrame API, which allows you to work with structured data using the Spark SQL
module. This provides a more familiar and optimized way to work with data compared to RDDs.
Overall, PySpark is a powerful tool for working with big data using Python, allowing you to leverage the
scalability and performance of Spark for your data processing tasks.

Dell EMC PowerMax Concepts and Features Participant Guide
No ratings yet
Dell EMC PowerMax Concepts and Features Participant Guide
40 pages
Database Systems Model Exam Question
100% (1)
Database Systems Model Exam Question
4 pages
RICEF Presentation
100% (1)
RICEF Presentation
10 pages
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
From Everand
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
Wei Liu
No ratings yet
BDP UNIT 4
No ratings yet
BDP UNIT 4
28 pages
Exploring Hadoop Ecosystem (Volume 1): Batch Processing
From Everand
Exploring Hadoop Ecosystem (Volume 1): Batch Processing
Wei Liu
No ratings yet
Hadoop Ecosystem
No ratings yet
Hadoop Ecosystem
7 pages
Big Data Unit 4
No ratings yet
Big Data Unit 4
96 pages
INTRO hadoop-ecosystem
No ratings yet
INTRO hadoop-ecosystem
6 pages
Unit 2 - Hadoop PDF
No ratings yet
Unit 2 - Hadoop PDF
7 pages
Hadoop Ecosystem
No ratings yet
Hadoop Ecosystem
8 pages
Assignment 6
No ratings yet
Assignment 6
12 pages
BigData Unit 2
No ratings yet
BigData Unit 2
15 pages
Hadoop Ecosystem
No ratings yet
Hadoop Ecosystem
49 pages
Bda Lab Manual
0% (1)
Bda Lab Manual
40 pages
BDT Unit 2 Textbook
No ratings yet
BDT Unit 2 Textbook
20 pages
226 Unit-7
No ratings yet
226 Unit-7
26 pages
Hadoop Ecosystem: Hdfs Mapreduce Yarn Hadoop Common
No ratings yet
Hadoop Ecosystem: Hdfs Mapreduce Yarn Hadoop Common
5 pages
What Is The Hadoop Ecosystem?
No ratings yet
What Is The Hadoop Ecosystem?
4 pages
BDA Presentations Unit-4 - Hadoop, Ecosystem
100% (1)
BDA Presentations Unit-4 - Hadoop, Ecosystem
25 pages
Hadoop Ecosystem
No ratings yet
Hadoop Ecosystem
11 pages
A Critical Analysis of Apache Hadoop and Spark For Big Data Processing
No ratings yet
A Critical Analysis of Apache Hadoop and Spark For Big Data Processing
6 pages
Module 2.2
No ratings yet
Module 2.2
32 pages
Module 1 Glossary What Is Big Data
No ratings yet
Module 1 Glossary What Is Big Data
2 pages
Hadoop Ecosystem
No ratings yet
Hadoop Ecosystem
4 pages
Kadi Sarva Vishwavidyalaya: LDRP Institute of Technology and Research Gandhinagar
No ratings yet
Kadi Sarva Vishwavidyalaya: LDRP Institute of Technology and Research Gandhinagar
44 pages
Big Data Technology Stack
100% (1)
Big Data Technology Stack
12 pages
Unit 5 - Introduction To Hadoop
No ratings yet
Unit 5 - Introduction To Hadoop
50 pages
Activity: NAME: Chogle Saif Ali ROLLNO.: 12CO27 Class: Be-Co Summary: Components of Hadoop Ecosystem
No ratings yet
Activity: NAME: Chogle Saif Ali ROLLNO.: 12CO27 Class: Be-Co Summary: Components of Hadoop Ecosystem
5 pages
Big Data Notes (All Lectures)
No ratings yet
Big Data Notes (All Lectures)
44 pages
BDA
No ratings yet
BDA
8 pages
Unit 2
No ratings yet
Unit 2
23 pages
Hadoop
No ratings yet
Hadoop
14 pages
Big Data Lab Manual
No ratings yet
Big Data Lab Manual
44 pages
Hadoop Tutorial
No ratings yet
Hadoop Tutorial
17 pages
Big Data Processing With Apache Spark - Infoqdotcom
No ratings yet
Big Data Processing With Apache Spark - Infoqdotcom
16 pages
Mastering Big Data and Hadoop: From Basics to Expert Proficiency
From Everand
Mastering Big Data and Hadoop: From Basics to Expert Proficiency
William Smith
No ratings yet
Hadoop Vs Apache Spark
No ratings yet
Hadoop Vs Apache Spark
6 pages
BD - HadoopEcoSystem Unit 2part 1
No ratings yet
BD - HadoopEcoSystem Unit 2part 1
12 pages
Poetic Seminar
No ratings yet
Poetic Seminar
17 pages
Performance Comparison of Apache Hadoop and Apache Spark
No ratings yet
Performance Comparison of Apache Hadoop and Apache Spark
5 pages
Data Science With Python - Lesson 12 - Python Integration With Hadoop
No ratings yet
Data Science With Python - Lesson 12 - Python Integration With Hadoop
53 pages
BDA - Unit 4
No ratings yet
BDA - Unit 4
18 pages
13 Lecture
No ratings yet
13 Lecture
23 pages
Hortonworks Data Platform (HDP)
100% (1)
Hortonworks Data Platform (HDP)
56 pages
Learn Hive in 24 Hours
From Everand
Learn Hive in 24 Hours
Alex Nordeen
No ratings yet
Hadoop Ecosystem
No ratings yet
Hadoop Ecosystem
21 pages
Lab Manual BDA
No ratings yet
Lab Manual BDA
36 pages
Apache Spark Components
No ratings yet
Apache Spark Components
4 pages
8 MapReduce Different Phases 08-01-2025
No ratings yet
8 MapReduce Different Phases 08-01-2025
28 pages
BigData Unit-4 Complete
No ratings yet
BigData Unit-4 Complete
97 pages
Big Data Analysis Using Apache Spark Mllib and Hadoop Hdfs With Scala and Java
No ratings yet
Big Data Analysis Using Apache Spark Mllib and Hadoop Hdfs With Scala and Java
8 pages
Learning Hadoop 2
From Everand
Learning Hadoop 2
Garry Turkington
4/5 (1)
Module 2. 16974328568170
No ratings yet
Module 2. 16974328568170
113 pages
Big Data Links
No ratings yet
Big Data Links
7 pages
Mastering Data Engineering: Advanced Techniques with Apache Hadoop and Hive
From Everand
Mastering Data Engineering: Advanced Techniques with Apache Hadoop and Hive
Peter Jones
No ratings yet
Hadoop_Ecosystem_Tools
No ratings yet
Hadoop_Ecosystem_Tools
2 pages
Advanced Hadoop Techniques: A Comprehensive Guide to Mastery
From Everand
Advanced Hadoop Techniques: A Comprehensive Guide to Mastery
Adam Jones
No ratings yet
Introduction To Big Dat1
No ratings yet
Introduction To Big Dat1
6 pages
Integration of Python With Hadoop and Spark
No ratings yet
Integration of Python With Hadoop and Spark
13 pages
Hadoop Ecosystem
No ratings yet
Hadoop Ecosystem
5 pages
DA U2
No ratings yet
DA U2
17 pages
Apache Hadoop Ecosystem
No ratings yet
Apache Hadoop Ecosystem
13 pages
Assignment No 2 ICT
No ratings yet
Assignment No 2 ICT
4 pages
Azure Capacity Planning - Google Search - 11
No ratings yet
Azure Capacity Planning - Google Search - 11
2 pages
12 Ip
No ratings yet
12 Ip
4 pages
T e 2552738 Handas Surprise Pencil Control Sheets Ver 1
No ratings yet
T e 2552738 Handas Surprise Pencil Control Sheets Ver 1
8 pages
ABHISHEK GHOSH_DBMS WITH SQL_MIC401B
No ratings yet
ABHISHEK GHOSH_DBMS WITH SQL_MIC401B
15 pages
CLR - Kimberly L. Tripp
No ratings yet
CLR - Kimberly L. Tripp
25 pages
Notes and Ques For SQL
No ratings yet
Notes and Ques For SQL
5 pages
Web Site Artikel
No ratings yet
Web Site Artikel
10 pages
Transacciones BW
No ratings yet
Transacciones BW
29 pages
Data Warehouse and Data Mining - Unit 4
No ratings yet
Data Warehouse and Data Mining - Unit 4
14 pages
Retail Store Management System Project Report
No ratings yet
Retail Store Management System Project Report
81 pages
Cloudera Manager Administration Guide
No ratings yet
Cloudera Manager Administration Guide
78 pages
ETL Standards For Informatica
100% (2)
ETL Standards For Informatica
16 pages
ACCS CustomReport ApplicationNote
No ratings yet
ACCS CustomReport ApplicationNote
116 pages
EMC Data Domain Technical Overview
0% (1)
EMC Data Domain Technical Overview
24 pages
Khawas 2018 Ijca 917200
No ratings yet
Khawas 2018 Ijca 917200
5 pages
Resume 3
No ratings yet
Resume 3
4 pages
Advantages and Disadvantages of EFS and Effective Recovery of Encrypted Data
100% (1)
Advantages and Disadvantages of EFS and Effective Recovery of Encrypted Data
11 pages
Saksham Report
No ratings yet
Saksham Report
22 pages
Types of Digital Data
100% (1)
Types of Digital Data
34 pages
CS 2004 (DBMS) - CS - End - May - 2023
No ratings yet
CS 2004 (DBMS) - CS - End - May - 2023
14 pages
Data Structures Algorithms Multiple Choice Questions MCQs PDF
No ratings yet
Data Structures Algorithms Multiple Choice Questions MCQs PDF
16 pages
Data Deduplication
No ratings yet
Data Deduplication
62 pages
Computer Science
No ratings yet
Computer Science
26 pages
Firestore Database CRUD Queries: Firebase 9
No ratings yet
Firestore Database CRUD Queries: Firebase 9
16 pages
How To Configure Multiple LDAP and Multiple Subtree in One LDAP For Lotus Connections 2.0
No ratings yet
How To Configure Multiple LDAP and Multiple Subtree in One LDAP For Lotus Connections 2.0
7 pages
Structure Query Language
No ratings yet
Structure Query Language
24 pages

Unit 4

Uploaded by

Unit 4

Uploaded by

Unit IV Big data analytics using Hadoop-I

Following are the components that collectively form a Hadoop ecosystem:

**Apache Spark Basics:**

You might also like

Apache Spark Basics: