APEX INSTITUTE OF TECHNOLOGY.
AIT-IBM CSE
CHANDIGARH UNIVERSITY, MOHALI
Big Data Technologies (Spark & Scala)
(22CSH-391)
Lecture-1 (CO1)
By
Dr Geeta Rani (E15227)
Associate Professor (Chandigarh University)
Course Outcomes
C • Understand the components of the Hadoop Ecosystem and Data Science
O methodology
1
• Understand the constructs of Scala
CO
2
• Understand Apache Spark and its components
CO
3
• Design the applications using Scala
CO
4
• Develop the Applications using Spark and its available Libraries
CO
5
MapReduce AND YARN
Mapreduce
• It is a processing layer of Hadoop.
• MapReduce is a programming model designed for processing large
volumes of data in parallel by dividing the work into the set of
chunks
• There are two processes one is Mapper and another is
the Reducer.
Mapreduce
• Map phase- It is the first phase of data processing. In
this phase, we specify all the complex logic/business
rules/costly code.
• Reduce phase- It is the second phase of processing.
In this phase, we specify light-weight processing like
aggregation/summation
Hadoop version 2.0
• YARN was introduced in Hadoop version 2.0 in the year 2012
by Yahoo and Hortonworks. The basic idea behind YARN is to
relieve MapReduce by taking over the responsibility of
Resource Management and Job Scheduling
• YARN allows different data processing methods like graph
processing, interactive processing, stream processing as well
as batch processing to run and process data stored in HDFS.
Therefore YARN opens up Hadoop to other types of distributed
applications beyond MapReduce.
• YARN enabled the users to perform operations as per
requirement by using a variety of tools like Spark for real-time
processing, Hive for SQL, HBase for NoSQL and others.
Limitations
• Scalability
• The utilization of computational resources is inefficient.
• The Hadoop framework became limited only to
MapReduce processing paradigm.
Q/A
• What are the two main components of YARN?
• a) ResourceManager and NameNode
• b) ResourceManager and NodeManager
• c) TaskTracker and JobTracker
• d) DataNode and SecondaryNameNode
Ans: b
17
Q/A
• Which programming framework is commonly used with YARN for
distributed data processing?
• a) Hadoop MapReduce
• b) Apache Spark
• c) Apache Hive
• d) Apache Kafka
Ans: b) Apache Spark
18
Q/A
• What is the role of the NodeManager in YARN?
• a) Managing and allocating cluster resources
• b) Managing and storing metadata information
• c) Coordinating and scheduling MapReduce jobs
• d) Storing and managing data blocks
Ans: a
19
References:
✔ https://www.edureka.co/blog/big-data-tutorial
✔ https://www.coursera.org/learn/big-data-introduction?specialization=big-data2.
✔ https://www.coursera.org/learn/fundamentals-of-big-data
✔ Big Data, Black Book: Covers Hadoop 2, MapReduce, Hive, YARN, Pig, R and Data Visualization, DT Editorial
Service, Dreamtech Press
✔ Big Data Analytics, Subhashini Chellappa, Seema Acharya, Wiley publications
✔ Big Data: Concepts, Technology, and Architecture, Nandhini Abirami R , Seifedine Kadry, Amir H. Gandomi ,
Wiley publication
8/8/2021 20
THANK YOU
21