0% found this document useful (0 votes)

249 views92 pages

Big Data NOTES and QB

This course provides an overview of big data concepts including storage, retrieval, and processing of large datasets. Students will learn about big data technologies and tools for working with structured, unstructured, and semi-structured data. The course aims to familiarize students with major approaches in big data analytics using tools like Hadoop, MapReduce, Hive, and Pig. Students will gain hands-on experience developing applications on these platforms.

Uploaded by

Siva

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

249 views92 pages

Big Data NOTES and QB

Uploaded by

Siva

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 92

CS445 BIG DATA ANALYTICS

Course Description and Objectives:

This course gives an overview of Big Data, i.e. storage, retrieval and processing of big data.
The focus will be on the “technologies”, i.e., the tools/algorithms that are available for storage,
processing of Big Data and a variety of “analytics”.
Course Outcome:

The student will be able to:

 Understand the theoretical issues involved in Big Data system design such as the curse
of dimensionality.
 Familiarize with major approaches in Big Data Analytics.

Skills:
Upon completion of this course, students will be able to do the following:
o Students will to build and maintain reliable, scalable, distributed systems with
Apache Hadoop.
o Students will be able to write Map-Reduce based Applications
o Students will be able to design and build applications using Hive and Pig based
Big data Applications
o Students will learn tips and tricks for Big Data use cases and solutions
Activities:
 Install Hadoop and develop applications on Hadoop
 Develop Map Reduce applications
 Develop applications using Hive/Pig/Spark
Unit-I
Introduction to big data: Data, Characteristics of data and Types of digital data:, Sources of
data, Working with unstructured data, Evolution and Definition of big data, Characteristics
and Need of big data, Challenges of big data
Big data analytics: Overview of business intelligence, Data science and Analytics, Meaning
and Characteristics of big data analytics, Need of big data analytics, Classification of analytics,
Challenges to big data analytics, Importance of big data analytics, Basic terminologies in big
data environment
Unit-II
Introduction to Hadoop : Introducing Hadoop, need of Hadoop, limitations of RDBMS,
RDBMS versus Hadoop, Distributed Computing Challenges, History of Hadoop , Hadoop
Overview, Use Case of Hadoop, Hadoop Distributors, HDFS (Hadoop Distributed File
System) , Processing Data with Hadoop, Managing Resources and Applications with Hadoop
YARN (Yet another Resource Negotiator), Interacting with Hadoop Ecosystem

Unit-III
Introduction to MAPREDUCE Programming: Introduction , Mapper, Reducer, Combiner,
Partitioner , Searching, Sorting , Compression, Real time applications using MapReduce, Data
serialization and Working with common serialization formats, Big data serialization formats
Unit-IV
Introduction to Hive: Introduction to Hive, Hive Architecture , Hive Data Types, Hive File
Format, Hive Query Language (HQL), User-Defined Function (UDF) in Hive.
Introduction to Pig

1
Introduction to Pig, The Anatomy of Pig , Pig on Hadoop , Pig Philosophy , Use Case for Pig:
ETL Processing , Pig Latin Overview , Data Types in Pig , Running Pig , Execution Modes of
Pig, HDFS Commands, Relational Operators, Piggy Bank , Word Count Example using Pig ,
Pig at Yahoo!, Pig versus Hive
Unit-V
Spark: Introduction to data analytics with Spark, Programming with RDDS, Working with
key/value pairs, advanced spark programming

Text Books
1. Big Data Analytics, SeemaAcharya, SubhashiniChellappan, Wiley
2. Learning Spark: Lightning-Fast Big Data Analysis, Holden Karau, Andy
Konwinski, Patrick Wendell, MateiZaharia, O'Reilly Media, Inc.
Reference Books:
1Boris lublinsky, Kevin t. Smith, AlexeyYakubovich, “Professional Hadoop Solutions”, Wiley,
ISBN: 9788126551071, 2015.

2. Chris Eaton,Dirkderooset al. , “Understanding Big data ”, McGraw Hill, 2012.

3. Tom White, “HADOOP: The definitive Guide”, O Reilly 2012.
4. VigneshPrajapati, “Big Data Analyticswith R and Haoop”, Packet Publishing 2013.

2
UNIT – I

There are three types of data we need to consider, structured, unstructured, and semi-
structured. Of these, the last two are new in Big Data.

Structured Data: Your current data warehouse contains structured data and only structured
data. It’s structured because when you placed it in your relational database system a structure
was enforced on it, so we know where it is, what it means, and how it relates to other pieces of
data in there. It may be text (a person’s name) or numerical (their age) but we know that the
age value goes with a specific person, hence structured.

Unstructured Data: Essentially everything else that has not been specifically structured is
considered unstructured. The list of truly unstructured data includes free text such as
documents produced in your company, images and videos, audio files, and some types of social
media. If the object to be stored carries no tags (metadata about the data) and has no
established schema, ontology, glossary, or consistent organization it is unstructured. However,
in the same category as unstructured data there are many types of data that do have at least
some organization.

Semi-Structured Data: The line between unstructured data and semi-structured is a little
fuzzy. If the data has any organizational structure (a known schema) or carries a tag (like
XML extensible markup language used for documents on the web) then it is somewhat easier
to organize and analyze, and because it is more accessible for analysis may make it more
valuable. Some types of data that appear to be unstructured but are actually semi-structured
include:

 Text: XML, email or electronic data interchange messages (EDI). These lacks formal
structure but do contain tags or a known structure that separate semantic elements. Most
social media sources, a hot topic for analysis today, fall in this category. Facebook,
Twitter, and others offer data access through an application programming interface (API).
 Web Server Logs and Search Patterns: An individual’s journey through a web site,
whether searching, consuming content, or shopping is recorded in detail in electronic web
server logs.
 Sensor Data: There is a huge explosion in the number of sensors producing streams of
data all around us. Once we thought of sensors as only being found in industrial control
systems or major transportation systems. Now this includes RFIDs, infrared and wireless
technology, and GPS location signals among others. In addition to monitoring mechanical
systems, sensors increasingly monitor consumer behavior. Your cell phone puts out a
constant stream of signals that are being captured for location-based marketing. In-store
sensors are monitoring consumer shopping behavior. Your car monitors its systems and
constantly records data that can be used to evaluate mechanical failure or accidents.
There is huge growth in the popularity of ‘the quantified self’ in which we voluntarily
wear devices like the FitBit or a Nike Fuel Band that record our activity and in some
cases even heart rate, velocity, location, and calorie burn. While a great deal of attention
is being paid to new types of analysis for social media, in the next two or three years at
most we will reach a crossover point where the volume of data available from sensors
will exceed new social media postings, and sensor data volumes are likely to grow 10 or
20 times faster than social media sources.

3
We have been refining our use of structured data for the past 10 or 20 years. Opportunity lies
in understanding how adding unstructured and semi-structured data to the mix creates
competitive advantage. Here are just a few thought starters for your consideration:

Marketing and Sales Campaigns: Consumers now actively share their likes and dislikes
about companies, campaigns, and products through social media. Through text-
based sentiment analysis of social media messages companies are learning quickly what
pleases and displeases their customers and prospects.

Ecommerce: Web server logs and search engine summaries are being analyzed in detail to
discover how to make the customer’s journey through your web site easier for them and more
profitable for you.

Brick and Mortar Retail: Retailers using electronic, RFID, video, and infrared technologies
can now track customers as groups and as individuals through their physical stores to enhance
the shopping experience. Some grocery chains are now using video technology to count the
number of shoppers and predict the number of checkout lanes needed to keep wait times at
acceptable levels. Customer reward cards can gather even more information matching
customer detail to specific product purchases.

Supply Chain: Both the consumers and providers of global logistical services have combined
data sources from traditional internal ERP systems with semi-structured data from GPS
location trackers, EDI messages, RFID and bar scans of shipped and in-transit merchandise,
and even social media sources to speed goods along at lower cost.

Finance: All types of financial institutions including banks, credit card companies, and the
internal finance activities of companies are rapidly embracing new data types to reduce fraud,
reduce revenue leakage (under billing), and ensure compliance with the multitude of financial
laws and regulations.

Healthcare: The government’s initiative to require electronic health records is making new
and vast semi-structured data sources available to enhance treatment outcomes and contain
cost.

Business executives need to understand the new opportunities available in Big Data from
unstructured and semi-structured data, and how to blend these newly available data types into
their data-driven competitive strategies.

Big Data Overview

Big data is a term defined for data sets that are large or complex that traditional data
processing applications are inadequate. Big Data basically consists of analysis zing, capturing
the data, data creation, searching, sharing, storage capacity, transfer, visualization, and
querying and information privacy.
What is Big Data?
 Big Data is a collection of large datasets that cannot be adequately processed using
traditional processing techniques. Big data is not only data it has become a complete subject,
which involves various tools, techniques and frameworks.

4
 Big data term describes the volume amount of data both structured and unstructured manner
that adapted in day-to-day business environment. It’s important that what organizations
utilize with these with the data that matters.
 Big data helps to analyze the in-depth concepts for the better decisions and strategic taken
for the development of the organization.
The Evolution of Big Data
While the term “big data” is the new in this era, as it is the act of gathering and storing huge
amounts of information for eventual analysis is ages old. The concept came into existence in
the early 2000s when Industry analyst Doug Laney the definition of big data as the three
categories as follows:

Volume: Organizations collects the data from relative sources, which includes business
transactions, social media and information from sensor or machine-to-machine data. Before,
storage was a big issue but now the advancement of new technologies (such as Hadoop) has
reduced the burden.
Velocity: Data streams unparalleled speed of velocity and have improved in timely manner.
RFID tags, sensors and smart metering are driving the need to deal with torrents of data in real
time operations.
Variety: Data comes in all varieties in form of structured, numeric data in traditional databases
to unstructured text documents, email, video, audio, stock ticker data and financial transactions.
In SAS, we consider two additional dimensions with respect to big data:
What are the categories which come under Big Data?
Big data works on the data produced by various devices and their applications. Below are some
of the fields that are involved in the umbrella of Big Data.
Black Box Data: It is an incorporated by flight crafts, which stores a large sum of information,
which includes the conversation between crew members and any other communications (alert
messages or any order passed)by the technical grounds duty staff.
5
Social Media Data: Social networking sites such as Face book and Twitter contains the
information and the views posted by millions of people across the globe.
Stock Exchange Data: It holds information (complete details of in and out of business
transactions) about the ‘buyer’ and ‘seller’ decisions in terms of share between different
companies made by the customers.
Power Grid Data: The power grid data mainly holds the information consumed by a particular
node in terms of base station.
Transport Data: It includes the data’s from various transport sectors such as model, capacity,
distance and availability of a vehicle.
Search Engine Data: Search engines retrieve a large amount of data from different sources of
database.

What is the importance of Big Data?

The importance of big data is how you utilize the data which you own. Data can be fetched
from any source and analyze it to solve that enable us in terms of
1) Cost reductions
2) Time reductions,
3) New product development and optimized offerings, and
4) Smart decision making.
Combination of big data with high-powered analytics, you can have great impact on your
business strategy such as:
 Finding the root cause of failures, issues and defects in real time operations.
 Generating coupons at the point of sale seeing the customer’s habit of buying goods.
 Recalculating entire risk portfolios in just minutes.
 Detecting fraudulent behavior before it affects and risks your organization.
Who are the ones who use the Big Data Technology?
Banking
Large amounts of data streaming in from countless sources, banks have to find out unique and
innovative ways to manage big data. It’s important to analyze customers needs and provide
them service as per their requirements, and minimize risk and fraud while maintaining
regulatory compliance. Big data have to deal with financial institutions to do one step from the
advanced analytics.
Government
When government agencies are harnessing and applying analytics to their big data, they have
improvised a lot in terms of managing utilities, running agencies, dealing with traffic
congestion or preventing the affects crime. But apart from its advantages in Big Data,
governments also address issues of transparency and privacy.
Education
Educator regarding Big Data provides a significant impact on school systems, students and
curriculums. By analyzing big data, they can identify at-risk students, ensuring student’s

6
progress, and can implement an improvised system for evaluation and support of teachers and
principals in their teachings.
Health Care
When it comes to health care in terms of Patient records. Treatment plans. Prescription
information etc., everything needs to be done quickly and accurately and some aspects enough
transparency to satisfy stringent industry regulations. Effective management results in good
health care to uncover hidden insights that improve patient care.
Manufacturing
Manufacturers can improve their quality and output while minimizing waste where processes
are known as the main key factors in today’s highly competitive market. Several manufacturers
are working on analytics where they can solve problems faster and make more agile business
decisions.
Retail
Customer relationship maintains is the biggest challenge in the retail industry and the best way
to manage will be to manage big data. Retailers must have unique marketing ideas to sell their
products to customers, the most effective way to handle transactions, and applying improvised
tactics of using innovative ideas using BigData to improve their business.
Brief explanation of how exactly businesses are utilizing Big Data
Big Data is being converted into nuggets of information and then it becomes very
straightforward for most business enterprises as we now know what their customers want, what
are the products are rapidly fast moving, what are the expectations of the end users from the
customer service, speed up the time sequence for marketing, methods on cost reduction, and
methods to build economies of scale in a highly efficient manner. Hence Big Data leads to big
time benefits for organizations and hence there exists a demand about it in the IT world.
Big Data Technologies
 Accurate analysis carried out based on big data which helps to increase and optimizes
operational efficiencies, enable cost reductions, and reduce risks for the business operations.
 In order to capitalize on big data one should require infrastructure that manages and
processes huge volumes of structured and unstructured data in real-time and can ensure data
privacy and security.
 Many technologies are available in the market from different vendors which includes
Amazon, IBM, Microsoft, etc., to approach big data. To pick up a particular technology one
must examine its classes, which areas are as follows
Operational Big Data
 It includes the applications such as MongoDB which provides operational capabilities for
interactive and real time workloads where data is generally captured and stored.
 NoSQL Big Data systems are designed in such a way it capitalizes on new cloud
computing architectures, to permit access on massive computations to be run reasonably and
efficiently. Hence this builds operation on big data workloads much easier to manage,
cheaper and faster to implement.
Analytical Big Data
7
 It owns the systems like Massively Parallel Processing database systems and MapReduce
which provides the analytical capabilities for re collective and complex analysis.
 MapReduce provides a new method for analyzing the data that flaunts its capabilities
provided by SQL, and based on a system called MapReduce that can be scaled up from
single servers to thousands of high and low end machines.
Barriers
Barriers that are imposed on big data are as follows:
 Capture data
 Storage Capacity
 Searching
 Sharing
 Transfer
 Analysis
 Presentation
Enterprise servers are using the above measures to overcome the barriers mentioned above.
Differentiation between Operational vs. Analytical Systems
Operational Analytical

Latency 1 ms to 100 ms 1 min to 100 min

Concurrency 1000 to100,000 1 to 10

Access Pattern Writes and Reads Reads

Queries Selective Unselective

Data Scope Operational Retrospective

End User Customer Data Scientist

Technology NoSQL Database MapReduce, MPP Database

Introduction to BIG DATA: Types, Characteristics & Benefits

In order to understand 'Big Data', we first need to know what 'data' is. Oxford dictionary
defines 'data' as -

"The quantities, characters, or symbols on which operations are performed by a computer,

which may be stored and transmitted in the form of electrical signals and recorded on
magnetic, optical, or mechanical recording media. "

In this tutorial we will learn,

 Examples Of 'Big Data'

 Categories Of 'Big Data'
 Characteristics Of 'Big Data'
 Advantages Of Big Data Processing

8
So, 'Big Data' is also a data but with a huge size. 'Big Data' is a term used to describe
collection of data that is huge in size and yet growing exponentially with time.In short, such a
data is so large and complex that none of the traditional data management tools are able to store
it or process it efficiently.

Examples Of 'Big Data'

Following are some the examples of 'Big Data'-

The New York Stock

Exchange generates about one terabyte of
new trade data per day.

• Social Media Impact

Statistic shows that 500+terabytes of new data gets

ingested into the databases of social media
site Facebook, every day. This data is mainly
generated in terms of photo and video uploads,
message exchanges, putting comments etc.

Single Jet engine can

generate 10+terabytes of data in 30
minutes of a flight time. With many
thousand flights per day, generation of
data reaches up to many Petabytes.

Categories Of 'Big Data'

Big data' could be found in three forms:
1. Structured
2. Unstructured
3. Semi-structured
Structured

9
Any data that can be stored, accessed and processed in the form of fixed format is termed as a
'structured' data. Over the period of time, talent in computer science have achieved greater
success in developing techniques for working with such kind of data (where the format is well
known in advance) and also deriving value out of it. However, now days, we are foreseeing
issues when size of such data grows to a huge extent, typical sizes are being in the rage of
multiple zettabyte.
Do you know? 1021 bytes equals to 1 zettabyte or one billion terabytes forms a zettabyte.
Looking at these figures one can easily understand why the name 'Big Data' is given and
imagine the challenges involved in its storage and processing.
Do you know? Data stored in a relational database management system is one example of
a 'structured' data.
Examples Of Structured Data
An 'Employee' table in a database is an example of Structured Data
Employee_ID Employee_Name Gender Department Salary_In_lacs

2365 Rajesh Kulkarni Male Finance 650000

3398 Pratibha Joshi Female Admin 650000

7465 Shushil Roy Male Admin 500000

7500 Shubhojit Das Male Finance 500000

7699 Priya Sane Female Finance 550000

Unstructured
Any data with unknown form or the structure is classified as unstructured data. In addition to
the size being huge, un-structured data poses multiple challenges in terms of its processing for
deriving value out of it. Typical example of unstructured data is, a heterogeneous data source
containing a combination of simple text files, images, videos etc. Now a day organizations
have wealth of data available with them but unfortunately they don't know how to derive value
out of it since this data is in its raw form or unstructured format.
Examples Of Un-structured Data
Output returned by 'Google Search'

Semi-structured
Semi-structured data can contain both the forms of data. We can see semi-structured data as a
strcutured in form but it is actually not defined with e.g. a table definition in relational DBMS.
Example of semi-structured data is a data represented in XML file.
Examples Of Semi-structured Data
10
Personal data stored in a XML file-
<rec><name>Prashant Rao</name><sex>Male</sex><age>35</age></rec>
<rec><name>Seema R.</name><sex>Female</sex><age>41</age></rec>
<rec><name>Satish Mane</name><sex>Male</sex><age>29</age></rec>
<rec><name>Subrato Roy</name><sex>Male</sex><age>26</age></rec>
<rec><name>Jeremiah J.</name><sex>Male</sex><age>35</age></rec>
Data Growth over years

Please note that web application data, which is unstructured, consists of log files, transaction
history files etc. OLTP systems are built to work with structured data wherein data is stored in
relations (tables).
Characteristics Of 'Big Data'
(i)Volume – The name 'Big Data' itself is related to a size which is enormous. Size of data
plays very crucial role in determining value out of data. Also, whether a particular data can
actually be considered as a Big Data or not, is dependent upon volume of data.
Hence, 'Volume' is one characteristic which needs to be considered while dealing with 'Big
Data'.
(ii)Variety – The next aspect of 'Big Data' is its variety.
Variety refers to heterogeneous sources and the nature of data, both structured and
unstructured. During earlier days, spreadsheets and databases were the only sources of data
considered by most of the applications. Now days, data in the form of emails, photos, videos,
monitoring devices, PDFs, audio, etc. is also being considered in the analysis applications. This
variety of unstructured data poses certain issues for storage, mining and analysing data.
(iii)Velocity – The term 'velocity' refers to the speed of generation of data. How fast the data is
generated and processed to meet the demands, determines real potential in the data.
Big Data Velocity deals with the speed at which data flows in from sources like business
processes, application logs, networks and social media sites, sensors, Mobile devices, etc. The
flow of data is massive and continuous.
(iv)Variability – This refers to the inconsistency which can be shown by the data at times, thus
hampering the process of being able to handle and manage the data effectively.
Benefits of Big Data Processing
Ability to process 'Big Data' brings in multiple benefits, such as-
• Businesses can utilize outside intelligence while taking decisions
Access to social data from search engines and sites like facebook, twitter are enabling
organizations to fine tune their business strategies.
• Improved customer service
Traditional customer feedback systems are getting replaced by new systems designed with 'Big
Data' technologies. In these new systems, Big Data and natural language processing
technologies are being used to read and evaluate consumer responses.
• Early identification of risk to the product/services, if any
• Better operational efficiency

11
'Big Data' technologies can be used for creating staging area or landing zone for new data
before identifying what data should be moved to the data warehouse. In addition, such
integration of 'Big Data' technologies and data warehouse helps organization to offload
infrequently accessed data.
1. What do you know about the term “Big Data”?
Answer: Big Data is a term associated with complex and large datasets. A relational database
cannot handle big data, and that’s why special tools and methods are used to perform
operations on a vast collection of data. Big data enables companies to understand their business
better and helps them derive meaningful information from the unstructured and raw data
collected on a regular basis. Big data also allows the companies to take better business
decisions backed by data.
2. What are the five V’s of Big Data?
Answer: The five V’s of Big data is as follows:
Volume – Volume represents the volume i.e. amount of data that is growing at a high
rate i.e. data volume in Petabytes
Velocity – Velocity is the rate at which data grows. Social media contributes a major
role in the velocity of growing data.
Variety – Variety refers to the different data types i.e. various data formats like text,
audios, videos, etc.
Veracity – Veracity refers to the uncertainty of available data. Veracity arises due to
the high volume of data that brings incompleteness and inconsistency.
Value –Value refers to turning data into value. By turning accessed big data into
values, businesses may generate revenue.

5 V’s of Big Data

Note: This is one of the basic and significant questions asked in the big data interview. You
can choose to explain the five V’s in detail if you see the interviewer is interested to know
more. However, the names can even be mentioned if you are asked about the term “Big Data”.
3. Tell us how big data and Hadoop are related to each other.
Answer: Big data and Hadoop are almost synonyms terms. With the rise of big data, Hadoop, a
framework that specializes in big data operations also became popular. The framework can be
used by professionals to analyze big data and help businesses to make decisions.
Note: This question is commonly asked in a big data interview. You can go further to answer
this question and try to explain the main components of Hadoop.
4. How is big data analysis helpful in increasing business revenue?
Answer: Big data analysis has become very important for the businesses. It helps businesses to
differentiate themselves from others and increase the revenue. Through predictive analytics,
big data analytics provides businesses customized recommendations and suggestions. Also, big
data analytics enables businesses to launch new products depending on customer needs and
preferences. These factors make businesses earn more revenue, and thus companies are using
big data analytics. Companies may encounter a significant increase of 5-20% in revenue by

12
implementing big data analytics. Some popular companies those are using big data analytics to
increase their revenue is – Walmart, LinkedIn, Facebook, Twitter, Bank of America etc.
5. Explain the steps to be followed to deploy a Big Data solution.
Answer: Followings are the three steps that are followed to deploy a Big Data Solution –
1. Data Ingestion
The first step for deploying a big data solution is the data ingestion i.e. extraction of data from
various sources. The data source may be a CRM like Salesforce, Enterprise Resource Planning
System like SAP, RDBMS like MySQL or any other log files, documents, social media feeds
etc. The data can be ingested either through batch jobs or real-time streaming. The extracted
data is then stored in HDFS.

Steps of
Deploying Big Data Solution
2. Data Storage
After data ingestion, the next step is to store the extracted data. The data either be stored in
HDFS or NoSQL database (i.e. HBase). The HDFS storage works well for sequential access
whereas HBase for random read/write access.
3. Data Processing
The final step in deploying a big data solution is the data processing. The data is processed
through one of the processing frameworks like Spark, MapReduce, Pig, etc.
6. Do you have any Big Data experience? If so, please share it with us.
How to Approach: There is no specific answer to the question as it is a subjective question
and the answer depends on your previous experience. Asking this question during a big data
interview, the interviewer wants to understand your previous experience and is also trying to
evaluate if you are fit for the project requirement.
So, how will you approach the question? If you have previous experience, start with your
duties in your past position and slowly add details to the conversation. Tell them about your
contributions that made the project successful. This question is generally, the 2 nd or 3rd question
asked in an interview. The later questions are based on this question, so answer it carefully.
You should also take care not to go overboard with a single aspect of your previous job. Keep
it simple and to the point.
7. Do you prefer good data or good models? Why?
How to Approach: This is a tricky question but generally asked in the big data interview. It
asks you to choose between good data or good models. As a candidate, you should try to
answer it from your experience. Many companies want to follow a strict process of evaluating
data, means they have already selected data models. In this case, having good data can be
game-changing. The other way around also works as a model is chosen based on good data.
As we already mentioned, answer it from your experience. However, don’t say that having both
good data and good models is important as it is hard to have both in real life projects.
8. Will you optimize algorithms or code to make them run faster?
How to Approach: The answer to this question should always be “Yes.” Real world
performance matters and it doesn’t depend on the data or model you are using in your project.
The interviewer might also be interested to know if you have had any previous experience in
code or algorithm optimization. For a beginner, it obviously depends on which projects he
worked on in the past. Experienced candidates can share their experience accordingly as well.
However, be honest about your work, and it is fine if you haven’t optimized code in the past.

13
Just let the interviewer know your real experience and you will be able to crack the big data
interview.
9. How do you approach data preparation?
How to Approach: Data preparation is one of the crucial steps in big data projects. A big data
interview may involve at least one question based on data preparation. When the interviewer
asks you this question, he wants to know what steps or precautions you take during data
preparation.
As you already know, data preparation is required to get necessary data which can then further
be used for modeling purposes. You should convey this message to the interviewer. You
should also emphasize the type of model you are going to use and reasons behind choosing that
particular model. Last, but not the least, you should also discuss important data preparation
terms such as transforming variables, outlier values, unstructured data, identifying gaps, and
others.
10. How would you transform unstructured data into structured data?
How to Approach: Unstructured data is very common in big data. The unstructured data
should be transformed into structured data to ensure proper data analysis. You can start
answering the question by briefly differentiating between the two. Once done, you can now
discuss the methods you use to transform one form to another. You might also share the real-
world situation where you did it. If you have recently been graduated, then you can share
information related to your academic projects.
By answering this question correctly, you are signaling that you understand the types of data,
both structured and unstructured, and also have the practical experience to work with these. If
you give an answer to this question specifically, you will definitely be able to crack the big
data interview.

UNIT – II
Introduction to Hadoop

14
Apache Hadoop was born to enhance the usage and solve major issues of big data. The web
media was generating loads of information on a daily basis, and it was becoming very difficult
to manage the data of around one billion pages of content. In order of revolutionary, Google
invented a new methodology of processing data popularly known as MapReduce. Later after a
year Google published a white paper of Map Reducing framework where Doug Cutting and
Mike Cafarella, inspired by the white paper and thus created Hadoop to apply these concepts to
an open-source software framework which supported the Nutch search engine project.
Considering the original case study, Hadoop was designed with a much simpler storage
infrastructure facilities.
Apache Hadoop is the most important framework for working with Big Data. Hadoop biggest
strength is scalability. It upgrades from working on a single node to thousands of nodes without
any issue in a seamless manner.
The different domains of Big Data means we are able to manage the data’s are from videos,
text medium, transactional data, sensor information, statistical data, social media conversations,
search engine queries, ecommerce data, financial information, weather data, news updates,
forum discussions, executive reports, and so on
Google’s Doug Cutting and his team members developed an Open Source Project namely
known as HADOOP which allows you to handle the very large amount of data. Hadoop runs
the applications on the basis of MapReduce where the data is processed in parallel and
accomplish the entire statistical analysis on large amount of data.

It is a framework which is based on java programming. It is intended to work upon from a

single server to thousands of machines each offering local computation and storage. It supports
the large collection of data set in a distributed computing environment.
The Apache Hadoop software library based framework that gives permissions to distribute
huge amount of data sets processing across clusters of computers using easy programming
models.
15
History of Hadoop
Hadoop was created by Doug Cutting and hence was the creator of Apache Lucene. It is the
widely used text to search library. Hadoop has its origins in Apache Nutch which is an open
source web search engine itself a part of the Lucene project.
The Hadoop High-level Architecture
Hadoop Architecture based on the two main components namely MapReduce and HDFS

Different Hadoop Architectures based on the Parameters chosen

16
The Apache Hadoop Module
Hadoop Common: Includes the common utilities which supports the other Hadoop modules
HDFS: Hadoop Distributed File System provides unrestricted, high-speed access to the data
application.
Hadoop YARN: This technology is basically used for scheduling of job and efficient
management of the cluster resource.
MapReduce: This is a highly efficient methodology for parallel processing of huge volumes of
data.

Then there are other projects included in the Hadoop module which are less used:

17
Apache Ambari: It is a tool for managing, monitoring and provisioning of the Hadoop
clusters. Apache Ambari supports the HDFS and MapReduce programs. Major highlights of
Ambari are:
 Managing of the Hadoop framework is highly efficient, secure and consistent.
 Management of cluster operations with an intuitive web UI and a robust API
 The installation and configuration of Hadoop cluster are simplified effectively.
 It is used to support automation, smart configuration and recommendations
 Advanced cluster security set-up comes additional with this tool kit.
 The entire cluster can be controlled using the metrics, heat maps, analysis and
troubleshooting
 Increased levels of customization and extension make this more valuable.
Cassandra: It is a distributed system to handle extremely huge amount of data which is stored
across several commodity servers. The database management system (DBMS)is highly
available with no single point of failure.
HBase: it is a non-relational, distributed database management system that works efficiently on
sparse data sets and it is highly scalable.
Apache Spark: This is highly agile, scalable and secure the Big Data compute engine,
versatiles the sufficient work on a wide variety of applications like real-time processing,
machine learning, ETL and so on.
Hive: It is a data warehouse tool basically used for analyzing, querying and summarizing of
analyzed data concepts on top of the Hadoop framework.
Pig: Pig is a high-level framework which ensures us to work in coordination either with
Apache Spark or MapReduce to analyze the data. The language used to code for the
frameworks are known as Pig Latin.
Sqoop: This framework is used for transferring the data to Hadoop from relational databases.
This application is based on a command-line interface.
Oozie: This is a scheduling system for workflow management, executing workflow routes for
successful completion of the task in a Hadoop.
Zookeeper: Open source centralized service which is used to provide coordination between
distributed applications of Hadoop. It offers the registry and synchronization service on a high
level.
 Hadoop Mapreduce (Processing/Computation layer) –MapReduce is a parallel
programming model mainly used for writing large amount of data distribution applications
devised from Google for efficient processing of large amounts of datasets, on large group of
clusters.
 Hadoop HDFS (Storage layer) –Hadoop Distributed File SystemorHDFS is based on the
Google File System (GFS) which provides a distributed file system that is especially
designed to run on commodity hardware. It reduces the faults or errors and helps incorporate
low-cost hardware. It gives high level processing throughput access to application data and is
suitable for applications with large datasets.

18
 Hadoop YARN –Hadoop YARN is a framework used for job scheduling and cluster
resource management.
 Hadoop Common –This includes Java libraries and utilities which provide those java files
which are essential to start Hadoop.
 Task Tracker –It is a node which is used to accept the tasks such as shuffle and Mapreduce
form job tracker.
 Job Tracker –It is a service provider which runs Mapreduce jobs on cluster.
 Name Node –It is a node where Hadoop stores all file location information(data stored
location) in Hadoop distributed file system.
 Data Node – The data is stored in the Hadoop distributed file system.
 Data Node –It stores data in the Hadoop distributed file system.
The Intended Audience and Prerequisites
Big Data and analytics are the most interesting domains to build your image in our IT world.
There is a scope for Big Data and Hadoop professionals . This intend towards those individuals
who are awed by the sheer might of Big Data and highly influence commands in corporate
boardrooms and who is keen to take up a career in Big Data and Hadoop.
If an individual aspire to become an Big Data and Hadoop Developeror Administrator,
Architect, Analyst, Scientist, Tester and who owns an corporate designation such as Chief
Technology Officer, Chief Information Officer, or even a Technical Manager of any enterprise.
Apache Hive and Pig are high-level programming languages tools and there is no compulsory
usage of Java or Linux It allows creating your own MapReduce program in any programming
language like Ruby, Python, Perl and even C programming. Hence now the requirement is
made easy to understand the computer programming logic and deductions. Remaining is add-
on and can be easily understood in a short duration of time.
How does Hadoop Work?
Hadoop helps to execute large amount of processing where the user can connect together
multiple commodity computers to a single-CPU, as a single functional distributed system and
have the particular set of clustered machines that reads the dataset in parallel and provide
intermediate, and after integration gets the desired output.
Hadoop runs code across a cluster of computers and performs the following tasks:
 Data are initially divided into files and directories. Files are divided into consistent sized
blocks ranging from 128M and 64M.
 Then the files are distributed across various cluster nodes for further processing of data.
 Job tracker starts its scheduling programs on individual nodes.
 Once all the nodes are done with scheduling then the output is return back.
The Ultimate Goal
 Apache Hadoop framework
 Hadoop Distributed in File System
 Visualizing of Data using MS Excel, Zoom data or also known as Zeppelin
 Apache MapReduce program
 Apache Spark ecosystem
19
 Ambari administration management
 Deploying Apache Hive and Pig, and Sqoop
 Knowledge of the Hadoop 2.x Architecture
 Data analytics based on Hadoop YARN
 Deployment of MapReduce and HBase integration
 Setup of Hadoop Cluster
 Proficiency in Development of Hadoop
 Working with Spark RDD
 Job scheduling using Oozie
The above methodology guide you to become professional of Big Data and Hadoop and
ensuring enough skills to work in an industrial environment and solve real world problems and
gain solutions for the better progressions.
The Challenges facing Data at Scale and the Scope of Hadoop
Big Data are categorized into:
 Structured –which stores the data in rows and columns like relational data sets
 Unstructured – here data cannot be stored in rows and columns like video, images, etc.
 Semi-structured – data in format XML are readable by machines and human
There is a standardized methodology that Big Data follows highlighting usage methodology of
ETL.
ETL – stands for Extract, Transform, and Load.
Extract –fetching the data from multiple sources
Transform – convert the existing data to fit into the analytical needs
Load –right systems to derive value in it.

Comparison to Existing Database Technologies

Most database management systems are not up to scratch for operating at such lofty levels of
Big data exigencies either due to the sheer technical inefficient. When the data is totally
unstructured, the volume of data is humongous, where the results are at high speeds, then
finally only platform that can effectively stand up to the challenge is Apache Hadoop.
Hadoop majorly owes its success to a processing framework called as MapReduce that is
central to its existence. The MapReduce technology gives opportunity to all programmers
contributes their part where large data sets are divided and are independently processed in
parallel. These coders doesn’t need to knew the high-performance computing and can work
efficiently without worrying about intra-cluster complexities, monitoring of tasks, node failure
management, and so on.
Hadoop also contributes it’s another platform namely known as Hadoop Distributed File
System (HDFS). The main strength of HDFS is its ability to rapidly scale and work without a
hitch irrespective of any fault with the nodes. HDFS in essence divides large file into smaller
blocks or units ranging from 64 to 128MB later are copied onto a couple of nodes of the
cluster. From this HDFS ensures no work would stop even when some nodes going out of
20
service. HDFS owns APIs to ensure The MapReduce program is used for reading and writing
data (contents) simultaneously at high speeds. When there is a need to speed up performance,
and then add extra nodes in parallel to the cluster and the increased demand can be
immediately met.
Advantages of Hadoop
 It give access to the user to rapidly write and test the distributed systems and then
automatically distributes the data and works across the machines and in turn utilizes the
primary parallelism of the CPU cores.
 Hadoop library are developed to find/search and handle the failures at the application layer.
 Servers can be added or removed from the cluster dynamically at any point of time.
 It is open source based on Java applications and hence compatible on all the platforms.
Hadoop Features and Characteristics
Apache Hadoop is the most popular and powerful big data tool, which provides world’s best
reliable storage layer –HDFS(Hadoop Distributed File System), a batch Processing engine
namely MapReduce and a Resource Management Layer like YARN. Open-source – Apache
Hadoop is an open source project. It means its code can be modified according to business
requirements.
 Distributed Processing– The data storage is maintained in a distributed manner
in HDFS across the cluster, data is processed in parallel on cluster of nodes.
 Fault Tolerance– By default the three replicas of each block is stored across the cluster in
Hadoop and it’s changed only when required. Hadoop’s fault tolerant can be examined in
such cases when any node goes down, the data on that node can be recovered easily from
other nodes. Failures of a particular node or task are recovered automatically by the
framework.
 Reliability– Due to replication of data in the cluster, data can be reliable which is stored on
the cluster of machine despite machine failures .Even if your machine goes down, and then
also your data will be stored reliably.
 High Availability– Data is available and accessible even there occurs a hardware failure
due to multiple copies of data. If any incidents occurred such as if your machine or few
hardware crashes, then data will be accessed from other path.
 Scalability– Hadoop is highly scalable and in a unique way hardware can be easily added to
the nodes. It also provides horizontal scalability which means new nodes can be added on
the top without any downtime.
 Economic– Hadoop is not very expensive as it runs on cluster of commodity hardware. We
do not require any specialized machine for it. Hadoop provides huge cost reduction since it is
very easy to add more nodes on the top here. So if the requirement increases, then there is an
increase of nodes, without any downtime and without any much of pre planning.
 Easy to use– No need of client to deal with distributed computing, framework takes care of
all the things. So it is easy to use.
 Data Locality– Hadoop works on data locality principle which states that the movement of
computation of data instead of data to computation. When client submits his algorithm, then
21
the algorithm is moved to data in the cluster instead of bringing data to the location where
algorithm is submitted and then processing it.
Hadoop Assumptions
Hadoop is written with huge amount of clusters of computers in mind and is built upon the
following assumptions:
 Hardware may fail due to any external or technical malfunction where instead commodity
hardware can be used.
 Processing will be run in batches and there exits an emphasis on high throughput as opposed
to low latency.
 Applications which run on HDFS have large sets of data. A typical file in HDFS may be of
gigabytes to terabytes in size.
 Applications require a write-once-read-many access model.
 Moving Computation is cheaper compared to the Moving Data.
Hadoop Design Principles
The following are the design principles on which Hadoop works:
 System shall manage and heal itself as per the requirement occurred.
 Fault Tolerant are automatically and transparently route are managed around failures
speculatively execute redundant tasks if certain nodes are detected to be running of slower
phase.
 Performance is scaled based on linearity.
 Proportional change in terms of capacity with resource been change (Scalability)
 Compute must be moved to data.
 Data Locality is termed as lower latency, lower bandwidth.
 It is based on simple core, modular and extensible (Economical).
HDFS Overview
Introduction to Hadoop Distributed File System
Hadoop File System was mainly developed for using distributed file system design. It is highly
fault tolerant and holds huge amount of data sets and provides ease of access. The files are
stored across multiple machines in a systematic order. These stored files are stored to eliminate
all possible data losses in case of failure and helps make applications available for parallel
processing. This file System is designed for storing very large amount of files
with streaming data access.
HDFS is based on the Google File System (GFS) and written completely in Java programming
language. Google provided only a white paper, without any implementation. Around 90 percent
of the GFS architecture has been implementation in the form of HDFS.
HDFS was originally built and developed as a storage infrastructure for the Apache Nutch web
search engine project. It was initially known as the Nutch Distributed File System (NDFS).
Assumptions and Goals
Hardware Failure
Hardware failure is basically the norm instead of exception. An HDFS instance consists of
hundreds or thousands of server machines, each consuming the storing part of the file system’s
22
data. The real fact is that there are a large number of components and that each component
consists of non-trivial probability of failure which explains that some of the component of
HDFS is always non-functional. Therefore, detection of faults is quick, automatic recovery
from them is the main core architectural motto of HDFS.
Streaming Data Access
Applications that run on HDFS need streaming access to every individual data set. They are not
meant for general purpose applications that typically run on general purpose file systems.
HDFS is designed more for batch processing instead of interactive use by End-users. The
emphasis is on high streams throughput of data access rather than low latency of data access.
POSIX consists of many hard requirements which are not required for applications which come
under HDFS. POSIX semantics are trending huge success in certain key areas to increase data
throughput rates.
Large Data Sets
Applications that run on HDFS have massively large data sets. A typical file in HDFS ranging
from gigabytes to terabytes in size. Thus, HDFS are built in such a way that it supports large
files. It provides high aggregate data bandwidth and scale measuring hundreds of nodes in a
single cluster. It should access to tons of millions of files at a same time.
Simple Coherency Model
HDFS applications need write-once-read-many access model for files. A file once created,
written, and closed need not be changed later. This assumption helps to simplify the data
coherency issues and enables us the high throughput data access. A MapReduce application or
a web crawler application is perfectly fitted in this model. In future new innovative ideas been
implemented to support appending-writes to files in the future.
“Moving Computation is Cheaper than Moving Data”
A computation requested by an application is very more efficient when executed near its
operated data. This is termed true when there is huge amount data set. This results in
minimization of network congestion and increases the overall throughput of the system. The
assumption will implies better if you migrate the computation to the closer data location rather
than moving the data to where the application is running. HDFS provides interfaces for
applications to move closely among themselves where exactly the data is being located.
Portability across Heterogeneous Hardware and Software Platforms
HDFS has been designed in such a way for easy transport from one platform to another. This
facilitates of HDFS as a platform of choice for a large set of applications.
Features of HDFS
 Usage of distributed storage and processing.
 Optimization is used for throughput over latency.
 Efficient in reading large files but poor at seek requests for many small ones.
 A command interface to interact with HDFS is provided.
 The built-in servers of data node and name node helps the end users to check the cluster‘s
status as per time intervals.
 Streams access the data of file system.

23
 Authentication of file permissions and authentication are provided.
 Uses replication are used instead of handling disk failures. Each blocks comprises a file
storage on several nodes inside the cluster and the HDFS NameNode continuously keep
monitoring the reports which are sent by every DataNode to ensure due to failures no block
have gone below the desired replication factor. If this this happens then it schedules the
addition of another copy within the cluster.
Why HDFS works very well with Big Data?
 HDFS uses the MapReduce method for access to data which is very fast
 It follows a data coherency model that is simple to implement still highly robust and scalable
 Compatible with any kind commodity hardware and operating system processor.
 Economy is been achieved by distributing data and processing on clusters with parallel nodes
 Data is always safe as it is automatically saved in multiple locations for safe secure.
 It provides a JAVA API’s and C language is on the top priority.
 It is easily accessible using a web browser making it highly utilitarian.
HDFS Architecture

It uses mainly the master slave architecture and contains the following elements:
Namenode and DataNode
The namenode is the commodity hardware that contains the GNU/Linux operating system and
its library file setup and the namenode software. The system containing namenode acts as the
master server and carries out following tasks:
 Manage namespace for file system.
 Provides client’s access to files.
 Execution of file system operations such as rename, open and close files and directories.
 There are a number of Data Nodes consists of one per node in the cluster, which helps to
manage storage attached to the nodes that they run on.
 HDFS exposes a file system namespace and provides access to user data to be stored within
files.

24
 Basically a file is split into one or more blocks and these blocks are stored in a set of
DataNodes.
 The NameNode executes file system namespace operations containing opening, closing, and
renaming files and directories which determine the mapping of blocks to DataNodes which is
responsible for providing read and write requests from the file system’s clients.
 The DataNodes also perform functions such as block creation, deletion, and replication upon
instruction from the NameNode.

 These machines typically run a GNU/Linux operating system (OS).

 As we know HDFS is designed and implemented based on Java language; any machine that
supports Java application can run the NameNode or the DataNode software.
 HDFS can be deployed on a wide range of machines as the usage of it is highly portable in
Java language. A particular deployment has a dedicated machine that runs only the
NameNode software. The other machines in the cluster run one instance of the DataNode
software only.
 The architecture does not predicate based on running multiple DataNodes on the same
machine but in a real deployment that is rarely happened.
The single NameNode in a cluster efficiently simplifies the architecture of the system. The
NameNode is the main arbitrator and repository for all HDFS metadata sets. The system is
designed in such a way that user data never flows through the NameNode.
Block
Data is mainly stored in HDFS’s file. These files are segregated into one or more segments and
further stored in individual data node. These file segments are namely known as block. The
default block size is 64MB which can be modified as per the requirements from HDFS
configuration.
HDFS blocks are huge compared to disk blocks and the main reason is cost reduction. By
making a particular set of block large enough hence here the time consumed to transfer the data
25
from the disk can be made larger than the time to seek from the beginning of the block. Thus
the time consumed to transfer a large file made of multiple blocks operates at the disk transfer
rate.
HDFS Objectives
 Fault detection and recovery: It provide methods for fast and automatic fault detection and
recovery of data.
 Huge datasets: It consists of multiple nodes per cluster to manage the applications with
large amount of datasets.
 Hardware at data: HDFS reduces the network traffic and increases the throughput as per
the required time intervals.
HDFS High Availability
 HDFS provides features such t recovery is possible in terms of failed namenode by adding
support for HDFS high availability (HA).
 A pair of namenodes in an active stands by configuration to take over its duties to continue
servicing client requests without any interruption.
The above points can be achieved through following implementations:
 Using highly available shared storage document and to share the edit log.
 Confirming that the DataNodes send block reports to both NameNodes
 Clients must be configured such that it can handle namenode failover
HDFS File Permissions
HDFS has a permissions model accessing the files and directories.
There are three types of permission:
 Read permission (r)
 Write permission (w)
 Execute permission (x)
 The read permission is used to read files or list the contents of files of a directory.
 The write permission is used to write a file, or for a directory, to create or delete files or
directories.
 The execute permission is ignored for a file since we can’t execute a file on HDFS, and for a
directory you have to access its children.
 Each file and directory has an owner, a group, and a mode within it.
The mode is given the permissions for the user who is the owner, the permissions is given for
the users who are members of the group, and the permissions is given for users who are neither
the owners nor members of the group.
Hadoop Ecosystem

26
1. HDFS is the file system of Hadoop
2. MR is the job which runs basically on file system
3. The MR Job guide the user to ask question from HDFS files
4. Pig and Hive are two projects built so that you can replace coding based on map reduce.
5. Pig and Hive interpreter converts the script and SQL queries “INTO” MR Job
6. To save the document on MapReduce only dependency for querying on HDFS will be
Impala and Hive
7. Impala Optimized for high latency queries are based on real time applications.
8. Hive is optimized for batch processing jobs.
9. Sqoop: Can put data from a relation DB to Hadoop ecosystem
10. Flume sends the data generated from external system to move towards the HDFS to adapt for
high volume logging.
11. Hue: Cluster based on Graphical frontend
12. Oozie: It is an workflow management tool
13. Mahout : Machine learning Library files.
14. When a 150 mb file is being implemented forcibly Hadoop ecosystem break itself into
multiple parts to achieve parallelism.
15. It breaks itself into smaller units where default unit size is 64 mb
16. Data node is the demon which takes care of all welfares happening on individual node
17. Name node keeps track on all aspects from when and where required and how to collect the
same group together.

Consider the following Problems:

1. If one of the nodes fails, then the data stored goes missing at that node
2. Network Failure issues.
3. Single Point of Failure: Name Node which is the Heart of Hadoop ecosystem
Solution:
1. Hadoop solves this problem by duplicating each and every data fragment thrice and save it
on different nodes, so that even if one node fails, we can still recover the data.
2. Network Failure is an important issue as lot of shuffle happens in the day to day activity.
3. Name node failure was reduced initially by storing the name node data on NFS (Network file
Server) which is been recovered in case of Name node crash.
HDFS Installation and Shell Commands
Setting up of the Hadoop cluster:
Successfully installation of Hadoop and configure them so that the clusters ranging from
couple of nodes to even tens of thousands over huge amount of clusters. For this, first you need
to install Hadoop on a single machine and it requires compulsory of installing Java if this
doesn’t exist in your system.
Getting Hadoop to work on the entire cluster involves required software on all the machines
that are tied up with the cluster. As per the norms one of the machines is associated with the

27
Name Node and another associated with the Resource Manager. The other services like The
MapReduce Job History and the Web App Proxy Server usually hosted on specific machines or
even on shared resources are loaded as per the requirement of the task. Rest all the nodes in
cluster have a dual nature of both the Node Manager and the Data Node. These are collectively
termed as the slave nodes.
Hadoop to work in the non-secure mode
The Java configuration of Hadoop has two important files:
 Read-only are the default configuration such as -core-default.xml, hdfs-default.xml, yarn-
default.xml and mapred-default.xml.
 Site-specific configuration based on -etc/hadoop/core-site.xml, etc/hadoop/hdfs-site.xml,
etc/hadoop/yarn-site.xml and etc/hadoop/mapred-site.xml.
Possible to manage the Hadoop scripts in the bin/ directory distribution, by setting site-specific
values by the following storage files etc/hadoop/hadoop-env.sh and etc/hadoop/yarn-env.sh.
For the Hadoop cluster configuration you first create the ecosystem where the Hadoop
daemons can execute and also requires the parameters for configuration.
The various daemons of Hadoop Distributed File System are listed below:
 Node Manager
 Resource Manager
 WebApp Proxy
 NameNode
 Secondary NameNode
 DataNode
 YARN daemons
The Hadoop Daemons configuration environment
To get the Hadoop daemons’ access the right site with specific customization where the
administrators need to use the following commands the etc/hadoop/hadoop-env.sh or the
etc/hadoop/mapred-env.sh and etc/hadoop/yarn-env.sh scripts. The JAVA_HOME are
specified properly such that it is defined in properly on every remote node.
Configuration of the individual daemons
The list of Daemons with their relevant environment variable
NameNode –HADOOP_NAMENODE_OPTS
DataNode – HADOOP_DATANODE_OPTS
Secondary NameNode – HADOOP_SECONDARYNAMENODE_OPTS
Resource Manager – YARN_RESOURCEMANAGER_OPTS
Node Manager – YARN_NODEMANAGER_OPTS
WebAppProxy – YARN_PROXYSERVER_OPTS
Map Reduce Job History Server – HADOOP_JOB_HISTORYSERVER_OPTS
Other related important Customization configuration parameters:
 HADOOP_PID_DIR – the process ID files of the daemons is contained in this directory.
 HADOOP_LOG_DIR – the log files of the daemons are stored in this directory.

28
 HADOOP_HEAPSIZE / YARN_HEAPSIZE – the heap size is measured in terms of MB’s
and if you own an variable that is set to 1000 then automatically the heap is also set to 1000
MB. And by default it is set as 1000
The HDFS Shell Commands
The important operations of Hadoop Distributed File System using the shell commands used
for file management in the cluster.
1. Directory creation in HDFS for a specific given path are given as.
hadoopfs-mkdir<paths>
Example:
hadoopfs-mkdir/user/saurzcode/dir1/user/saurzcode/dir2
2. Listing of the directory contents.
hadoopfs-ls<args>
Example:
hadoopfs-ls/user/saurzcode
3. HDFS file Upload/download.
Upload:
hadoopfs -put:
Copy single source file, or multiple source files based on local file system to the Hadoop data
file system
hadoopfs-put<localsrc> … <HDFS_dest_Path>
Example:
hadoopfs-put/home/saurzcode/Samplefile.txt/user/saurzcode/dir3/
Download:
hadoopfs -get:
Copies or downloads the files to the local file system
hadoopfs-get<hdfs_src><localdst>
Example:
hadoopfs-get/user/saurzcode/dir3/Samplefile.txt/home/
4. Viewing of file content
Same as Unix cat command:
hadoopfs-cat<path[filename]>
Example:
hadoopfs-cat/user/saurzcode/dir1/abc.txt
5. File copying from source to destination
hadoopfs-cp<source><dest>
Example:
hadoopfs-cp/user/saurzcode/dir1/abc.txt/user/saurzcode/dir2
6. Copying of file to HDFS from a local file and vice-versa
Copy the address from the LocalHost:
hadoopfs-copyFromLocal<localsrc>URI
Example:
hadoopfs-copyFromLocal/home/saurzcode/abc.txt/user/saurzcode/abc.txt
29
copyToLocal host
Usage:
hadoopfs-copyToLocal [-ignorecrc] [-crc] URI<localdst>
7. File moving from source to destination.
But remember, you cannot move files across filesystem.
hadoopfs-mv<src><dest>
Example:
hadoopfs-mv/user/saurzcode/dir1/abc.txt/user/saurzcode/dir2
8. File or directory removal in HDFS.
hadoopfs-rm<arg>
Example:
hadoopfs-rm/user/saurzcode/dir1/abc.txt
Repetitive version of delete.
hadoopfs-rmr<arg>
Example:
hadoopfs-rmr/user/saurzcode/
9. Showing the file’s final few lines.
hadoopfs-tail<path[filename]>
Example:
hadoopfs-tail/user/saurzcode/dir1/abc.txt
10. Showing the aggregate length of a file.
hadoopfs-du<path>
Example:
hadoopfs-du/user/saurzcode/dir1/abc.txt
HDFS Overview"

1. Compare Hadoop & Spark

Criteria Hadoop Spark

Dedicated storage HDFS None

Speed of processing average excellent

Libraries Separate tools available Spark Core, SQL, Streaming, MLlib, GraphX
2. What are real-time industry applications of Hadoop?
Hadoop, well known as Apache Hadoop, is an open-source software platform for scalable and
distributed computing of large volumes of data. It provides rapid, high performance and cost-
effective analysis of structured and unstructured data generated on digital platforms and within
the enterprise. It is used in almost all departments and sectors today.Some of the instances
where Hadoop is used:
 Managing traffic on streets.
 Streaming processing.
 Content Management and Archiving Emails.
 Processing Rat Brain Neuronal Signals using a Hadoop Computing Cluster.
 Fraud detection and Prevention.

30
 Advertisements Targeting Platforms are using Hadoop to capture and analyze click
stream, transaction, video and social media data.
 Managing content, posts, images and videos on social media platforms.
 Analyzing customer data in real-time for improving business performance.
 Public sector fields such as intelligence, defense, cyber security and scientific research.
 Financial agencies are using Big Data Hadoop to reduce risk, analyze fraud patterns,
identify rogue traders, more precisely target their marketing campaigns based on
customer segmentation, and improve customer satisfaction.
 Getting access to unstructured data like output from medical devices, doctor’s notes,
lab results, imaging reports, medical correspondence, clinical data, and financial data.
Read this log to find out how Big Data is transforming real estate now.
3. How is Hadoop different from other parallel computing systems?
Hadoop is a distributed file system, which lets you store and handle massive amount of data on
a cloud of machines, handling data redundancy. Go through this HDFS content to know how
the distributed file system works. The primary benefit is that since data is stored in several
nodes, it is better to process it in distributed manner. Each node can process the data stored on
it instead of spending time in moving it over the network.
On the contrary, in Relational database computing system, you can query data in real-time, but
it is not efficient to store data in tables, records and columns when the data is huge.
Learn about Oracle DBA now.
Hadoop also provides a scheme to build a Column Database with Hadoop HBase, for runtime
queries on rows.
Learn more in this HBase Tutorial.
4. What all modes Hadoop can be run in?
Hadoop can run in three modes:
 Standalone Mode: Default mode of Hadoop, it uses local file stystem for input and
output operations. This mode is mainly used for debugging purpose, and it does not
support the use of HDFS. Further, in this mode, there is no custom configuration
required for mapred-site.xml, core-site.xml, hdfs-site.xml files. Much faster when
compared to other modes.
 Pseudo-Distributed Mode (Single Node Cluster): In this case, you need
configuration for all the three files mentioned above. In this case, all daemons are
running on one node and thus, both Master and Slave node are the same.
 Fully Distributed Mode (Multiple Cluster Node): This is the production phase of
Hadoop (what Hadoop is known for) where data is used and distributed across several
nodes on a Hadoop cluster. Separate nodes are allotted as Master and Slave.
Learn more about Hadoop in this Hadoop Certification course to get ahead in your career!
5. Explain the major difference between HDFS block and InputSplit.
In simple terms, block is the physical representation of data while split is the logical
representation of data present in the block. Split acts a s an intermediary between block and
mapper.
Suppose we have two blocks:
Block 1: ii nntteell
Block 2: Ii ppaatt
Now, considering the map, it will read first block from ii till ll, but does not know how to
process the second block at the same time. Here comes Split into play, which will form a
logical group of Block1 and Block 2 as a single block.
It then forms key-value pair using inputformat and records reader and sends map for further
processing With inputsplit, if you have limited resources, you can increase the split size to limit
the number of maps. For instance, if there are 10 blocks of 640MB (64MB each) and there are
limited resources, you can assign ‘split size’ as 128MB. This will form a logical group of
128MB, with only 5 maps executing at a time.

31
However, if the ‘split size’ property is set to false, whole file will form one inputsplit and is
processed by single map, consuming more time when the file is bigger.
6. What is distributed cache and what are its benefits?
Distributed Cache, in Hadoop, is a service by MapReduce framework to cache files when
needed. Learn more in this MapReduce Tutorial now. Once a file is cached for a specific job,
hadoop will make it available on each data node both in system and in memory, where map and
reduce tasks are executing.Later, you can easily access and read the cache file and populate any
collection (like array, hashmap) in your code.
Benefits of using distributed cache are:
 It distributes simple, read only text/data files and/or complex types like jars, archives
and others. These archives are then un-archived at the slave node.
 Distributed cache tracks the modification timestamps of cache files, which notifies that
the files should not be modified until a job is executing currently.
Give your career a big boost by going through our Hadoop Online Training Videos now!
7. Explain the difference between NameNode, Checkpoint NameNode and BackupNode.
 NameNode is the core of HDFS that manages the metadata – the information of what
file maps to what block locations and what blocks are stored on what datanode. In
simple terms, it’s the data about the data being stored. NameNode supports a directory
tree-like structure consisting of all the files present in HDFS on a Hadoop cluster. It
uses following files for namespace:
fsimage file- It keeps track of the latest checkpoint of the namespace.
edits file-It is a log of changes that have been made to the namespace since checkpoint.
 Checkpoint NameNode has the same directory structure as NameNode, and creates
checkpoints for namespace at regular intervals by downloading the fsimage and edits
file and margining them within the local directory. The new image after merging is
then uploaded to NameNode.
There is a similar node like Checkpoint, commonly known as Secondary Node, but it
does not support the ‘upload to NameNode’ functionality.
 Backup Node provides similar functionality as Checkpoint, enforcing synchronization
with NameNode. It maintains an up-to-date in-memory copy of file system namespace
and doesn’t require getting hold of changes after regular intervals. The backup node
needs to save the current state in-memory to an image file to create a new checkpoint.
Learn about the various Hadoop components in this Big Data Hadoop Video Tutorial.
8. What are the most common Input Formats in Hadoop?
There are three most common input formats in Hadoop:
 Text Input Format: Default input format in Hadoop.
 Key Value Input Format: used for plain text files where the files are broken into lines
 Sequence File Input Format: used for reading files in sequence
Download Hadoop Interview Questions asked by top MNCs in 2018
9. Define DataNode and how does NameNode tackle DataNode failures?
DataNode stores data in HDFS; it is a node where actual data resides in the file system. Each
datanode sends a heartbeat message to notify that it is alive. If the namenode does noit receive
a message from datanode for 10 minutes, it considers it to be dead or out of place, and starts
replication of blocks that were hosted on that data node such that they are hosted on some other
data node.A BlockReport contains list of all blocks on a DataNode. Now, the system starts to
replicate what were stored in dead DataNode.
The NameNode manages the replication of data blocksfrom one DataNode to other. In this
process, the replication data transfers directly between DataNode such that the data never
passes the NameNode.
10. What are the core methods of a Reducer?
The three core methods of a Reducer are:

32
1. setup(): this method is used for configuring various parameters like input data size,
distributed cache.
public void setup (context)
2. reduce(): heart of the reducer always called once per key with the associated reduced
task
public void reduce(Key, Value, context)
3. cleanup(): this method is called to clean temporary files, only once at the end of the
task
public void cleanup (context)
11. What is SequenceFile in Hadoop?
Extensively used in MapReduce I/O formats, SequenceFile is a flat file containing binary
key/value pairs. The map outputs are stored as SequenceFile internally. It provides Reader,
Writer and Sorter classes. The three SequenceFile formats are:
1. Uncompressed key/value records.
2. Record compressed key/value records – only ‘values’ are compressed here.
3. Block compressed key/value records – both keys and values are collected in ‘blocks’
separately and compressed. The size of the ‘block’ is configurable.
12. What is Job Tracker role in Hadoop?
Job Tracker’s primary function is resource management (managing the task trackers), tracking
resource availability and task life cycle management (tracking the taks progress and fault
tolerance).
 It is a process that runs on a separate node, not on a DataNode often.
 Job Tracker communicates with the NameNode to identify data location.
 Finds the best Task Tracker Nodes to execute tasks on given nodes.
 Monitors individual Task Trackers and submits the overall job back to the client.
 It tracks the execution of MapReduce workloads local to the slave node.
13. What is the use of RecordReader in Hadoop?
Since Hadoop splits data into various blocks, RecordReader is used to read the slit data into
single record. For instance, if our input data is split like:
Row1: Welcome to
Row2: Intellipaat
It will be read as “Welcome to Intellipaat” using RecordReader.
14. What is Speculative Execution in Hadoop?
One limitation of Hadoop is that by distributing the tasks on several nodes, there are chances
that few slow nodes limit the rest of the program. Tehre are various reasons for the tasks to be
slow, which are sometimes not easy to detect. Instead of identifying and fixing the slow-
running tasks, Hadoop tries to detect when the task runs slower than expected and then
launches other equivalent task as backup. This backup mechanism in Hadoop is Speculative
Execution.
It creates a duplicate task on another disk. The same input can be processed multiple times in
parallel. When most tasks in a job comes to completion, the speculative execution mechanism
schedules duplicate copies of remaining tasks (which are slower) across the nodes that are free
currently. When these tasks finish, it is intimated to the JobTracker. If other copies are
executing speculatively, Hadoop notifies the TaskTrackers to quit those tasks and reject their
output.
Speculative execution is by default true in Hadoop. To disable, set
mapred.map.tasks.speculative.execution and mapred.reduce.tasks.speculative.execution
JobConf options to false.
15. What happens if you try to run a Hadoop job with an output directory that is already
present?
It will throw an exception saying that the output file directory already exists.
To run the MapReduce job, you need to ensure that the output directory does not exist before
in the HDFS.
33
To delete the directory before running the job, you can use shell:Hadoop fs –rmr
/path/to/your/output/Or via the Java API: FileSystem.getlocal(conf).delete(outputDir, true);
Prepare yourself for the MapReduce Interview questions and answers Now
16. How can you debug Hadoop code?
First, check the list of MapReduce jobs currently running. Next, we need to see that there are
no orphaned jobs running; if yes, you need to determine the location of RM logs.
1. Run: “ps –ef | grep –I ResourceManager”
and look for log directory in the displayed result. Find out the job-id from the displayed
list and check if there is any error message associated with that job.
2. On the basis of RM logs, identify the worker node that was involved in execution of the
task.
3. Now, login to that node and run – “ps –ef | grep –iNodeManager”
4. Examine the Node Manager log. The majority of errors come from user level logs for
each map-reduce job.
17. How to configure Replication Factor in HDFS?
hdfs-site.xml is used to configure HDFS. Changing the dfs.replication property in hdfs-
site.xml will change the default replication for all files placed in HDFS.
You can also modify the replication factor on a per-file basis using the
Hadoop FS Shell:[training@localhost ~]$ hadoopfs –setrep –w 3 /my/fileConversely,
you can also change the replication factor of all the files under a directory.
[training@localhost ~]$ hadoopfs –setrep –w 3 -R /my/dir
Go through Hadoop Administration Training to learn about Replication Factor In HDFS now!
18. How to compress mapper output but not the reducer output?
To achieve this compression, you should set:
conf.set("mapreduce.map.output.compress", true)
conf.set("mapreduce.output.fileoutputformat.compress", false)
19. What is the difference between Map Side join and Reduce Side Join?
Map side Join at map side is performed data reaches the map. You need a strict structure for
defining map side join. On the other hand, Reduce side Join (Repartitioned Join) is simpler
than map side join since the input datasets need not be structured. However, it is less efficient
as it will have to go through sort and shuffle phases, coming with network overheads.
Learn Hadoop from Experts! Enrol Today
20. How can you transfer data from Hive to HDFS?
By writing the query:
hive> insert overwrite directory '/' select * from emp;
You can write your query for the data you want to import from Hive to HDFS. The output you
receive will be stored in part files in the specified HDFS path.
21. What companies use Hadoop, any idea?
Learn how Big Data and Hadoop have changed the rules of the game in this blog post. Yahoo!
(the biggest contributor to the creation of Hadoop) – Yahoo search engine uses Hadoop,
Facebook – Developed Hive for analysis,Amazon,Netflix,Adobe,eBay,Spotify,Twitter,Adobe.
12. What are the common input formats in Hadoop?
Answer: Below are the common input formats in Hadoop –
Text Input Format – The default input format defined in Hadoop is the Text Input
Format.
Sequence File Input Format – To read files in a sequence, Sequence File Input
Format is used.

34
Key Value Input Format – The input format used for plain text files (files broken into
lines) is the Key Value Input Format.
13. Explain some important features of Hadoop.
Answer: Hadoop supports the storage and processing of big data. It is the best solution for
handling big data challenges. Some important features of Hadoop are –
Open Source – Hadoop is an open source framework which means it is available free
of cost. Also, the users are allowed to change the source code as per their
requirements.
Distributed Processing – Hadoop supports distributed processing of data i.e. faster
processing. The data in Hadoop HDFS is stored in a distributed manner and
MapReduce is responsible for the parallel processing of data.
Fault Tolerance – Hadoop is highly fault-tolerant. It creates three replicas for each
block at different nodes, by default. This number can be changed according to the
requirement. So, we can recover the data from another node if one node fails. The
detection of node failure and recovery of data is done automatically.
Reliability – Hadoop stores data on the cluster in a reliable manner that is independent
of machine. So, the data stored in Hadoop environment is not affected by the failure
of the machine.
Scalability – Another important feature of Hadoop is the scalability. It is compatible
with the other hardware and we can easily ass the new hardware to the nodes.
High Availability – The data stored in Hadoop is available to access even after the
hardware failure. In case of hardware failure, the data can be accessed from another
path.
14. Explain the different modes in which Hadoop run.
Answer: Apache Hadoop runs in the following three modes –
Standalone (Local) Mode – By default, Hadoop runs in a local mode i.e. on a non-
distributed, single node. This mode uses the local file system to perform input and
output operation. This mode does not support the use of HDFS, so it is used for
debugging. No custom configuration is needed for configuration files in this mode.
Pseudo-Distributed Mode – In the pseudo-distributed mode, Hadoop runs on a single
node just like the Standalone mode. In this mode, each daemon runs in a separate
Java process. As all the daemons run on a single node, there is the same node for
both the Master and Slave nodes.
Fully – Distributed Mode – In the fully-distributed mode, all the daemons run on
separate individual nodes and thus forms a multi-node cluster. There are different
nodes for Master and Slave nodes.
15. Explain the core components of Hadoop.
Answer: Hadoop is an open source framework that is meant for storage and processing of big
data in a distributed manner. The core components of Hadoop are –
HDFS (Hadoop Distributed File System) – HDFS is the basic storage system of
Hadoop. The large data files running on a cluster of commodity hardware are stored
in HDFS. It can store data in a reliable manner even when hardware fails.

35
Core Components of Hadoop
Hadoop MapReduce – MapReduce is the Hadoop layer that is responsible for data
processing. It writes an application to process unstructured and structured data stored
in HDFS. It is responsible for the parallel processing of high volume of data by
dividing data into independent tasks. The processing is done in two phases Map and
Reduce. The Map is the first phase of processing that specifies complex logic code
and the Reduce is the second phase of processing that specifies light-weight
operations.
YARN – The processing framework in Hadoop is YARN. It is used for resource
management and provides multiple data processing engines i.e. data science, real-
time streaming, and batch processing..
16. What are the different configuration files in Hadoop?
Answer: The different configuration files in Hadoop are –
core-site.xml – This configuration file contains Hadoop core configuration settings, for
example, I/O settings, very common for MapReduce and HDFS. It uses hostname a port.
mapred-site.xml – This configuration file specifies a framework name for MapReduce by
setting mapreduce.framework.name
hdfs-site.xml – This configuration file contains HDFS daemons configuration settings. It also
specifies default block permission and replication checking on HDFS.
yarn-site.xml – This configuration file specifies configuration settings for ResourceManager
and NodeManager.
17. What are the differences between Hadoop 2 and Hadoop 3?
Answer: Following are the differences between Hadoop 2 and Hadoop 3 –
18. How can you achieve security in Hadoop?
Answer: Kerberos are used to achieve security in Hadoop. There are 3 steps to access a service
while using Kerberos, at a high level. Each step involves a message exchange with a server.
1. Authentication – The first step involves authentication of the client to the
authentication server, and then provides a time-stamped TGT (Ticket-Granting
Ticket) to the client.
2. Authorization – In this step, the client uses received TGT to request a service ticket
from the TGS (Ticket Granting Server).
3. Service Request – It is the final step to achieve security in Hadoop. Then the client
uses service ticket to authenticate himself to the server.
19. What is commodity hardware?
Answer: Commodity hardware is a low-cost system identified by less-availability and low-
quality. The commodity hardware comprises of RAM as it performs a number of services that

36
require RAM for the execution. One doesn’t require high-end hardware configuration or
supercomputers to run Hadoop, it can be run on any commodity hardware.
20. How is NFS different from HDFS?
Answer: There are a number of distributed file systems that work in their own way. NFS
(Network File System) is one of the oldest and popular distributed file storage systems whereas
HDFS (Hadoop Distributed File System) is the recently used and popular one to handle big
data. The main differences between NFS and HDFS are as follows –
Hadoop Developer Interview Questions for Experienced
The interviewer has more expectations from an experienced Hadoop developer, and thus his
questions are one-level up. So, if you have gained some experience, don’t forget to cover
command based, scenario-based, real-experience based questions. Here we bring some sample
interview questions for experienced Hadoop developers.
21. How to restart all the daemons in Hadoop?
Answer: To restart all the daemons, it is required to stop all the daemons first. The Hadoop
directory contains sbin directory that stores the script files to stop and start daemons in
Hadoop.
Use stop daemons command /sbin/stop-all.sh to stop all the daemons and then use /sin/start-
all.sh command to start all the daemons again.
22. What is the use of jps command in Hadoop?
Answer: The jps command is used to check if the Hadoop daemons are running properly or
not. This command shows all the daemons running on a machine i.e. Datanode, Namenode,
NodeManager, ResourceManager etc.
23. Explain the process that overwrites the replication factors in HDFS.
Answer: There are two methods to overwrite the replication factors in HDFS –
Method 1: On File Basis
In this method, the replication factor is changed on the basis of file using Hadoop FS shell. The
command used for this is:
$hadoop fs – setrep –w2/my/test_file
Here, test_file is the filename that’s replication factor will be set to 2.
Method 2: On Directory Basis
In this method, the replication factor is changed on directory basis i.e. the replication factor for
all the files under a given directory is modified.
$hadoop fs –setrep –w5/my/test_dir
Here, test_dir is the name of the directory, the replication factor for the directory and all the
files in it will be set to 5.
24. What will happen with a NameNode that doesn’t have any data?
Answer: A NameNode without any data doesn’t exist in Hadoop. If there is a NameNode, it
will contain some data in it or it won’t exist.
25. Explain NameNode recovery process.
Answer: The NameNode recovery process involves the below-mentioned steps to make
Hadoop cluster running:
In the first step in the recovery process, file system metadata replica (FsImage) starts a
new NameNode.
The next step is to configure DataNodes and Clients. These DataNodes and Clients will
then acknowledge new NameNode.
During the final step, the new NameNode starts serving the client on the completion of
last checkpoint FsImage loading and receiving block reports from the DataNodes.

37
UNIT – III
MapReduceProgramming and Yarn
Mapreduce is mainly a data processing component of Hadoop. It is a programming model for
processing large number of data sets. It contains the task of data processing and distributes the
particular tasks across the nodes. It consists of two phases –
 Map
 Reduce
Map converts a typical dataset into another set of data where individual elements are divided
into key/value pairs.
Reduce task takes the output files from a map considering as an input and then integrate the
data tuples into a smaller set of tuples. Always it is been executed after the map job is done.
Features of Mapreduce system
Features of Mapreduce are as follows:
 Framework is provided for Mapreduce execution
 Abstracts developer from the complexity of distributed programming languages.
 Partial failure of the processing cluster is expected and tolerable to fulfill the requirements.
 In-built Redundancy and fault tolerance is available.
 Mapreduce programming model system is language independent.
 Automatic parallelization and distribution are in charge.
 Fault tolerance
 Enable data local processing
 Shared nothing than architectural model
 Manages all the inter process communication
 Parallelly managing the distributed servers which are running across the various tasks.
 Managing all communications and data transfers between the various part of system module.
 Redundancy and failures are provided for overall management of the whole process.
Mapreduce simple steps follow:
1. Executes map function on each input is received
2. Map function emits key, value pair
3. Shuffle, Sort and Group the outputs
4. Executes the reduce function on the group
5. Emits the output results is given per group basis.
Map Function
Mainly operates on each key/value pair of data and then transforms the data based on the
transformation logic provided in the map function. Map function always produces a key/value
pair as output result.
Map (key1, value1) ->List (key2, value2)
Reduce Function
It takes list of value for each and every key transforms the data based on the (aggregation) logic
provided in the reduce function.
Reduce (key2, List (value2)) ->List (key3, value3)
Map Function for Word Count

38
private final staic IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map (LongWritable key, Text value, Context context)

throws IOException, InterruptedException{ String line = value.toString(); StringTokenizer

tokenizer = new StringTokenizer(line); while(tokenizer.hasMoreTokens())
{ word.set(tokenizer.nextToken()); context.write(word, one); } } Reduce Function for Word
Count public void reduce(Text key, Iterable<IntWritable> values, Context context) throws
IOException, InterruptedException{ int sum = 0; for(IntWritable val: values){ sum+=val.get();
} Context.write(key, new IntWriatble(sum)); }

MapReduce is the framework that is used for processing large amounts of data on commodity
hardware on a huge dataset of cluster ecosystem. The MapReduce is a powerful method of
processing data when there are large amounts of node connected to the cluster. The two
important tasks of the MapReduce algorithm are: Map and Reduce.
The main motto of the Map task is to take a large set of data and convert it into another set of
data which is broken down into tuples(rows) or Key/Value pairs. Later the Reduce task takes
the tuple which is the form of an output of the Map task and makes the input for a reduction
task. Here the data tuples are converted into a very smaller set of tuples. The Reduce task
always follows as per the Map task.
The biggest strength of the MapReduce framework is its scalability. Once a MapReduce
program is written then it can be easily extrapolated to work over a cluster which has hundreds
or even thousands of nodes within it. In this framework, actually computation is sent to where
the data resides.
Hadoop Map Reduce – Key Features & Highlights

39
Terminology
PayLoad– These are the applications that are implemented for the Map and Reduce functions.
Mapper– This application helps to maps the input key/value pairs to a set of intermediate
key/value pair.
NamedNode– This node manages the HDFS.
DataNode– DataNode is used where data is presented in a before any processing takes place.
MasterNode– MasterNode is used where JobTracker runs and receives job requests from
clients.
SlaveNode– Map and Reduce program run particularly in this node.
JobTracker– This schedules the jobs and tracks the assigns the jobs to Task tracker.
Task Tracker– the Task Tracker status is reported to JobTracker after the task is being
tracked.
Job– It is an execution process of a Mapper and Reducer.
Task– Task of an execution of a Mapper or a called as Reducer on a slice of data.
Task Attempt– This is an attempt to execute a task on a SlaveNode.
Hadoop YARN Technology
Yarn full form stands for yet another resource negotiator. It is a cluster management
technology which is an open source platform distributed for processing framework. The main
objective of YARN is to construct a framework on Hadoop that allows the cluster resources to
be allocated to the specified applications and consider MapReduce has one of these
applications.
It separates each tasks of the job tracker into separate entities. The job tracker maintains track
of both job scheduling which matches the tasks with task tracker and another one is task
progress monitoring that take care of tasks and starts again the failed or slower tasks and doing
the task bookkeeping like as maintaining counter totals.
It divides these two roles into two independent daemons that are a mainly the resource manager
which manages the usage of resources across the cluster and an application master which
manage the lifecycle of applications running on the cluster.

40
Application master agrees with the resource manager for the sake of cluster resources which is
expressed in terms of a number of containers each with a certain memory limit then runs
application specific processes in these containers.
The containers are handled by node managers which are running on cluster nodes which ensure
the application does not use more resources which are allocated to it.
It is a very efficient technology which manages the Hadoop cluster. YARN is one of the parts
of Hadoop 2 version under the aegis of the Apache Software Foundation.
YARN is has developed a completely new and innovative way of processing data and is now
rightly at the center of The Hadoop architecture. Using this technology now it is possible to
stream real-time, uses new interactive SQL, process data using multiple engines, manages the
data using batch processing on a single platform and so on.
Map Reduce on YARN
MapReduce on YARN includes more entities compared to the classic MapReduce. They are:
 Client –Client submits the MapReduce job.

 YARN resource manager – This manages the allocation of compute resources based on the
cluster.
 YARN node managers – It launches and monitors the compute containers on machines
based on the cluster.
 Map Reduce application master – It manages and arranges the tasks running the
MapReduce job. The application master and the MapReduce application tasks run
correspondingly in the containers which are scheduled by the resource manager and
managed by the node managers.
 Distributed file system (Normally HDFS) – It shares the job files created between the other
entities.
How the YARN technology works?
 YARN technology lets Hadoop provides the enterprise level solutions, helping organizations
achieve better resource management. It is the main platform for getting consistent solutions,
high level of security and governing of data over the complete spectrum of the Hadoop
cluster.
41
 There are various technologies that resides within the data center can also benefit from
YARN. This procedure is possible to process and have linear-scale storage in a very cost
effective way. Using YARN helps to come with applications that can access data and run in
a Hadoop ecosystem on a consistent framework.
Some of the features of YARN
 High degree of compatibility: The applications created are using the Map Reduce framework
which can easily run on YARN.
 Better cluster utilization: YARN allocates all the cluster resources in an efficient and
dynamic manner and which leads to utilizes it in much better way compared to previous
version of Hadoop.
 Utmost scalability: As and when the required number of nodes in the Hadoop cluster
expands, the YARN Resource Manager ensures that it meets the user requirements and
processing power of the data center does not face any problems in solving.
 Multi-tenancy: Various engines that access data on the Hadoop cluster can efficiently works
all thanks goes to YARN being a highly versatile technology.
Key components of YARN
YARN came into existence because there was an urgent need to separate the two distinct tasks
that go on in a Hadoop ecosystem and which are known as TaskTracker and the JobTracker
entities. So consider the below mentioned key components of the YARN technology.
 Global Resource Manager
 Application Master per application
 Node Manager per node slave
 Container per application that runs on a Node Manager
Thus the Node Manager and the Resource Manager became the main reason on which the new
distributed application works. The various resources manager are allocated to the system
applications using the power of the Resource Manager. Application Master works along with
the Node Manager and also works on specific framework to get resources from the Resource
Manager to manage the various task components.
A scheduler works with the RM(Resource Manager) framework for the right allocation of
resources and ensuring all the constraints of the user limit and queue capacities are adhered are
provided at all times. As per the requirements of each application the scheduler will provide the
right resource.
The Application Master works on basis of coordination with the scheduler in order to get the
right resource containers keep an eye on the status and also keep tracking the progress of the
process.
The Node Manager manages the application containers and launches it when it is required,
tracks down the uses of the resources like the memory, processor, network and the disk
utilization and gives the entire detailed report to the Resource Manager.

42
1. Compare MapReduce and Spark
Criteria MapReduce Spark

Processing Good Exceptional

Speeds

Standalone Needs Hadoop Can work independently

mode

Ease of use Needs extensive Java program APIs for Python, Java, & Scala

Versatility Real-time & machine learning Not optimized for real-time & machine learning
applications applications
2. What is MapReduce?
Referred as the core of Hadoop, MapReduce is a programming framework to process large sets
of data or big data across thousands of servers in a Hadoop Cluster. The concept of MapReduce
is similar to the cluster scale-out data processing systems. The term MapReduce refers to two
important processes of Hadoop program operates.
First is the map() job, which converts a set of data into another breaking down individual
elements into key/value pairs (tuples). Then comes reduce() job into play, wherein the output
from the map, i.e. the tuples serve as the input and are combined into smaller set of tuples. As
the name suggests, the map job every time occurs before the reduce one.
Learn more about MapReduce in this insightful article on: Hadoop MapReduce – What it
Refers To?
3. Illustrate a simple example of the working of MapReduce.
Let’s take a simple example to understand the functioning of MapReduce. However, in real-
time projects and applications, this is going to be elaborate and complex as the data we deal
with Hadoop and MapReduce is extensive and massive.
Assume you have five files and each file consists of two key/value pairs as in two columns in
each file – a city name and its temperature recorded. Here, name of city is the key and the
temperature is value.
San Francisco, 22
Los Angeles, 15
Vancouver, 30
London, 25
Los Angeles, 16
Vancouver, 28
London,12
It is important to note that each file may consist of the data for same city multiple times. Now,
out of this data, we need to calculate the maximum temperature for each city across these five
files. As explained, the MapReduce framework will divide it into five map tasks and each map
task will perform data functions on one of the five files and returns maxim temperature for
each city.
(San Francisco, 22)(Los Angeles, 16)(Vancouver, 30)(London, 25)
Similarly each mapper performs it for the other four files and produce intermediate results, for
instance like below.
(San Francisco, 32)(Los Angeles, 2)(Vancouver, 8)(London, 27)
(San Francisco, 29)(Los Angeles, 19)(Vancouver, 28)(London, 12)
(San Francisco, 18)(Los Angeles, 24)(Vancouver, 36)(London, 10)
(San Francisco, 30)(Los Angeles, 11)(Vancouver, 12)(London, 5)
These tasks are then passed to the reduce job, where the input from all files are combined to
output a single value. The final results here would be:
(San Francisco, 32)(Los Angeles, 24)(Vancouver, 36)(London, 27)
43
These calculations are perform instantly and are extremely efficient to calculate outputs on a
large dataset.
Master the MapReduce computational engine in this in-depth Hadoop MapReduce course!
4. What are the main components of MapReduce Job?
Main Driver Class: providing job configuration parameters
Mapper Class: must extend org.apache.hadoop.mapreduce.Mapper class and performs
execution of map() method
Reducer Class: must extend org.apache.hadoop.mapreduce.Reducer class
5. What is Shuffling and Sorting in MapReduce?
Shuffling and Sorting are two major processes operating simultaneously during the working of
mapper and reducer.
The process of transferring data from Mapper to reducer is Shuffling. It is a mandatory
operation for reducers to proceed their jobs further as the shuffling process serves as input for
the reduce tasks.
In MapReduce, the output key-value pairs between the map and reduce phases (after the
mapper) are automatically sorted before moving to the Reducer. This feature is helpful in
programs where you need sorting at some stages. It also saves the programmer’s overall time.
Learn all about shuffling and sorting in this comprehensive MapReduce Tutorial.
6. What is Partitioner and its usage?
Partitioner is yet another important phase that controls the partitioning of the intermediate map-
reduce output keys using a hash function. The process of partitioning determines in what
reducer, a key-value pair (of the map output) is sent. The number of partitions is equal to the
total number of reduce jobs for the process.
Hash Partitioner is the default class available in Hadoop , which implements the following
function.int getPartition(K key, V value, int numReduceTasks)
The function returns the partition number using the numReduceTasks is the number of fixed
reducers.
7. What is Identity Mapper and Chain Mapper?
Identity Mapper is the default Mapper class provided by Hadoop. when no other Mapper class
is defined, Identify will be executed. It only writes the input data into output and do not
perform and computations and calculations on the input data.
The class name is org.apache.hadoop.mapred.lib.IdentityMapper.
Chain Mapper is the implementation of simple Mapper class through chain operations across a
set of Mapper classes, within a single map task. In this, the output from the first mapper
becomes the input for second mapper and second mapper’s output the input for third mapper
and so on until the last mapper.
The class name is org.apache.hadoop.mapreduce.lib.ChainMapper.
8. What main configuration parameters are specified in MapReduce?
The MapReduce programmers need to specify following configuration parameters to perform
the map and reduce jobs:
 The input location of the job in HDFs.
 The output location of the job in HDFS.
 The input’s and output’s format.
 The classes containing map and reduce functions, respectively.
 The .jar file for mapper, reducer and driver classes
9. Name Job control options specified by MapReduce.
Since this framework supports chained operations wherein an input of one map job serves as
the output for other, there is a need for job controls to govern these complex operations.
The various job control options are:
Job.submit() : to submit the job to the cluster and immediately return
Job.waitforCompletion(boolean) : to submit the job to the cluster and wait for its completion
10. What is InputFormat in Hadoop?

44
Another important feature in MapReduce programming, InputFormat defines the input
specifications for a job. It performs the following functions:
 Validates the input-specification of job.
 Split the input file(s) into logical instances called InputSplit. Each of these split files
are then assigned to individual Mapper.
 Provides implementation of RecordReader to extract input records from the above
instances for further Mapper processing
11. What is the difference between HDFS block and InputSplit?
An HDFS block splits data into physical divisions while InputSplit in MapReduce splits input
files logically.
While InputSplit is used to control number of mappers, the size of splits is user defined. On the
contrary, the HDFS block size is fixed to 64 MB, i.e. for 1GB data , it will be 1GB/64MB = 16
splits/blocks. However, if input split size is not defined by user, it takes the HDFS default
block size.
12. What is Text Input Format?
It is the default InputFormat for plain text files in a given job having input files with .gz
extension. In TextInputFormat, files are broken into lines, wherein key is position in the file
and value refers to the line of text. Programmers can write their own InputFormat.
The hierarchy is:
java.lang.Object
org.apache.hadoop.mapreduce.InputFormat<K,V>
org.apache.hadoop.mapreduce.lib.input.FileInputFormat<LongWritable,Text>
org.apache.hadoop.mapreduce.lib.input.TextInputFormat
13. What is JobTracker?
JobTracker is a Hadoop service used for the processing of MapReduce jobs in the cluster. It
submits and tracks the jobs to specific nodes having data. Only one JobTracker runs on single
Hadoop cluster on its own JVM process. if JobTracker goes down, all the jobs halt.
Download MapReduce Interview Questions asked by top MNCs in 2018
14. Explain job scheduling through JobTracker.
JobTracker communicates with NameNode to identify data location and submits the work to
TaskTracker node. The TaskTracker plays a major role as it notifies the JobTracker for any job
failure. It actually is referred to the heartbeat reporter reassuring the JobTracker that it is still
alive. Later, the JobTracker is responsible for the actions as in it may either resubmit the job or
mark a specific record as unreliable or blacklist it.
15. What is SequenceFileInputFormat?
A compressed binary output file format to read in sequence files and extends the
FileInputFormat.It passes data between output-input (between output of one MapReduce job to
input of another MapReduce job)phases of MapReduce jobs.
16. How to set mappers and reducers for Hadoop jobs?
Users can configure JobConf variable to set number of mappers and reducers.
job.setNumMaptasks()
job.setNumreduceTasks()
17. Explain JobConf in MapReduce.
It is a primary interface to define a map-reduce job in the Hadoop for job execution. JobConf
specifies mapper, Combiner, partitioner, Reducer,InputFormat , OutputFormat
implementations and other advanced job faets liek Comparators.
18. What is a MapReduce Combiner?
Also known as semi-reducer, Combiner is an optional class to combine the map out records
using the same key. The main function of a combiner is to accept inputs from Map Class and
pass those key-value pairs to Reducer class
19. What is RecordReader in a Map Reduce?
45
RecordReader is used to read key/value pairs form the InputSplit by converting the byte-
oriented view and presenting record-oriented view to Mapper.
20. Define Writable data types in MapReduce.
Hadoop reads and writes data in a serialized form in writable interface. The Writable interface
has several classes like Text (storing String data), IntWritable, LongWriatble, FloatWritable,
BooleanWritable. users are free to define their personal Writable classes as well.
Read this blog to see how the mapping and reducing speeds are increasing in the MapReduce
processing engine.
21. What is OutputCommitter?
OutPutCommitter describes the commit of MapReduce task. FileOutputCommitter is the
default available class available for OutputCommitter in MapReduce. It performs the following
operations:
 Create temporary output directory for the job during initialization.
 Then, it cleans the job as in removes temporary output directory post job completion.
 Sets up the task temporary output.
 Identifies whether a task needs commit. The commit is applied if required.
 JobSetup, JobCleanup and TaskCleanup are important tasks during output commit.
22. What is a “map” in Hadoop?
In Hadoop, a map is a phase in HDFS query solving. A map reads data from an input location,
and outputs a key value pair according to the input type.
23. What is a “reducer” in Hadoop?
In Hadoop, a reducer collects the output generated by the mapper, processes it, and creates a
final output of its own.
24. What are the parameters of mappers and reducers?
The four parameters for mappers are:
 LongWritable (input)
 text (input)
 text (intermediate output)
 IntWritable (intermediate output)
The four parameters for reducers are:
 Text (intermediate output)
 IntWritable (intermediate output)
 Text (final output)
 IntWritable (final output)
25. What are the key differences between Pig vs MapReduce?
PIG is a data flow language, the key focus of Pig is manage the flow of data from input source
to output store. As part of managing this data flow it moves data feeding it to
process 1. taking output and feeding it to
process2. The core features are preventing execution of subsequent stages if previous stage
fails, manages temporary storage of data and most importantly compresses and rearranges
processing steps for faster processing. While this can be done for any kind of processing tasks
Pig is written specifically for managing data flow of Map reduce type of jobs. Most if not all
jobs in a Pig are map reduce jobs or data movement jobs. Pig allows for custom functions to be
added which can be used for processing in Pig, some default ones are like ordering, grouping,
distinct, count etc.
Mapreduce on the other hand is a data processing paradigm, it is a framework for application
developers to write code in so that its easily scaled to PB of tasks, this creates a separation
between the developer that writes the application vs the developer that scales the application.
Not all applications can be migrated to Map reduce but good few can be including complex
ones like k-means to simple ones like counting uniques in a dataset.
Go through this insightful blog to learn more about what is MapReduce?

46
26. What is partitioning?
Partitioning is a process to identify the reducer instance which would be used to supply the
mappers output. Before mapper emits the data (Key Value) pair to reducer, mapper identify the
reducer as an recipient of mapper output. All the key, no matter which mapper has generated
this, must lie with same reducer.
27. How to set which framework would be used to run mapreduce program?
mapreduce.framework.name. it can be
1. Local
2. classic
3. Yarn
28. What platform and Java version is required to run Hadoop?
Java 1.6.x or higher version are good for Hadoop, preferably from Sun. Linux and Windows
are the supported operating system for Hadoop, but BSD, Mac OS/X and Solaris are more
famous to work.
29. Can MapReduce program be written in any language other than Java?
Yes, Mapreduce can be written in many programming languages Java, R, C++, scripting
Languages (Python, PHP). Any language able to read from stadin and write to stdout and parse
tab and newline characters should work . Hadoop streaming (A Hadoop Utility) allows you to
create and run Map/Reduce jobs with any executable or scripts as the mapper and/or the reduce

47
UNIT – IV
Apache Pig
Pig raises the level of abstraction for processing large amount of datasets. It is a fundamental
platform for analyzing large amount of data sets which consists of a high level language for
expressing data analysis programs. It is an open source platform developed by yahoo.
Advantages of Pig
 Reusing the code
 Faster development
 Less number of lines of code
 Schema and type checking etc
Pig is made up of two pieces:
 First is the language which allows to express data flows known as Pig Latin.
 Second one is execution environment created to run Pig Latin programs. There are now
presently two environments that are local execution in a single JVM and distributed
execution on the basis of Hadoop cluster.
A Pig Latin program is huge collection of series of operations or transformations which are
implemented to the input data files to generate output. These operations express a data flow
that the pig execution environment transforms into an executable representation and then runs
it accurately.
What makes Pig Hadoop popular?
 Easy to learn read and write and implement if you know SQL.
 It implements a new approach of multi query.
 Provides a large number of nested data types such as Maps, Tuples and Bags which are not
easily available in MapReduce along with some other data operations like Filters, Ordering
and Joins.
 It consist of different user groups for instance up to 90% of Yahoo’s MapReduce is done by
Pig and up to 80% of Twitter’s MapReduce is also done by Pig and various other companies
like Sales force, LinkedIn and Nokia etc are majoritively using the Pig.
The Apache Pig is a platform for managing large sets of data which consists of high-level
programming to analyze the data as per the requirements assigned. Pig mainly consists of the
infrastructure to evaluate the complexity of the program. The advantages of Pig programming
is that it can easily handle parallel processes correspondingly managing a very large number of
data. The programming on this platform is done by using the textual language Pig Latin.
Pig Latin comes with the following features:
 Simple programming: it is easy to code, execute and manage the program.
 Better optimization: system can automatically optimize the execution as per the requirement
raised.
 Extensive nature: Used to achieve highly specific processing tasks.
Pig can be used for following purposes:
 ETL data pipeline
 Research on raw data
48
 Iterative processing.
The scalar data types in pig are in the form of int, float, double, long, chararray, and byte array.
The complex data types in Pig are namely the map, tuple, and bag.
Map: The data element consisting the data type chararray where element has pig data type
include complex data type
Example- [city’#’bang’,’pin’#560001]

In this city and pin are data element mapping the values here.
Tuple: Collection of data types and it has defined fixed length. It consists of multiple fields
and those are ordered in sequence.
Bag: It is a huge collection of tuples ,unordered sequence , tuples arranged in the bag are
separated by comma.
Example: {(‘Bangalore’, 560001),(‘Mysore’,570001),(‘Mumbai’,400001)

LOAD function: Load function helps to load the data from the file system. It is a known as a
relational operator. In the first step in data-flow language it is required to mention the input,
which is completed by using the keyword named as ‘load’.
The LOAD syntax is
LOAD ‘mydata’ [USING function] [AS schema];
Example- A = LOAD ‘intellipaat.txt’;
A = LOAD ‘intellipaat.txt’ USINGPigStorage(‘\t’);
The relational operations in Pig segmentation is as follows:
foreach, order by, filters, group, distinct, join, limit.

foreach: Takes a set of expressions and applies them to almost all the records in the data
pipeline to next operator.
A =LOAD ‘input’ as (emp_name: charrarray, emp_id: long, emp_add : chararray, phone :
chararray, preferences : map []);
B = foreach A generate emp_name, emp_id;

Filters: It contains a predicate and it provides us to select which records will be retained in our
data pipeline permanently.
Syntax: alias = FILTER alias BY expression;
Otherwise it indicates the name of the relation, By indicates required keyword and the
expression containing Boolean.
Example: M = FILTER N BY F5 == 4;

There are namely 3 ways of executing Pig programs which works on both local and
MapReduce mode:
•Script
Pig can run a script file that contains Pig commands. For example, pig script.pig runs the
commands in the local file script.pig. Alternatively, for very short scripts, you can use the -e
option to run a script specified as a string on the command line.
•Grunt

49
Grunt is an interactive shell programming for running Pig commands. Grunt is started when no
file is specified for Pig to run, and the -e option apparently not used. It is also possible to run
Pig scripts from within Grunt using run and exec.
•Embedded
You can execute all the Pig programs from Java and can use JDBC to run SQL programs from
Java.
Example: Word count in Pig Lines=LOAD ‘input/hadoop.log’ AS (line: chararray); Words =
FOREACH Lines GENERATE FLATTEN (TOKENIZE (line)) AS word; Groups = GROUP
Words BY word; Counts = FOREACH Groups GENERATE group, COUNT (Words); Results
= ORDER Words BY Counts DESC; Top5 = LIMIT Results 5; STORE Top5 INTO
/output/top5words;

Apache Hive
Pig and Hive are open source platform mainly used for same purpose. These tools that ease the
complexity of writing difficult/complexed programs of java based MapReduce. Hive is like a
data warehouse that uses the MapReduce for the purpose of analyzing data stored on HDFS. It
provides a query language called HiveQL that is familiar to the Structured Query Language
(SQL) standard. It is developed based on facebook concepts. Hive was created who are posing
strong analysts having strong SQL skills but few java programming skills are required to run
queries on the large volumes of data that Face book stored in HDFS. Apache Pig and Hive are
two projects that are consider as the top most layer of Hadoop and provide a higher-level
language for using MapReduce library of Hadoop management.
Why hive?
It consists of a query language based on the standard SQL instead of giving a rapid
development of map and reduces tasks. Hive takes HiveQL statements and then automatically
transforms each and every query into one or more MapReduce jobs. Later it runs the overall
MapReduce program and executes the output to the user whereas Hadoop streaming decreases
the mandatory code, compile, and submit cycle. Hive removes it completely instead requires
only the composition of HiveQL statements.
This interface to Hadoop not only accelerates the time required to produce results from data
analysis but also it significantly expands for whom this Hadoop and MapReduce are helpful.
What makes Hive Hadoop popular?
 The users are provided with strong and powerful statistics functions.
 It is similar to SQL and hence it is very easy to understand the concepts.
 It can be combined with the HBase for querying the data in HBase. This kind of feature is
not available in pig. Pig function named HbaseStorage () is mainly used for loading the data
from HBase.
 Supported by Hue.
 Various user groups are considered such as CNET, Last.fm, Facebook, and Digg etc.
Difference between hive and pig

50
Hive Pig
Used for Data Analysis Used for Data and Programs
Used as Structured Data Pig is Semi-Structured Data
Hive has HiveQL Pig has Latin
Hive is used for creating reports Pig is used for programming
Hive works on the server side Pig works on the client side
Hive does not support avro Pig supports Avro
hive>select * form employee;
hive> describe employee;
 The Apache Hive is mainly data warehouse software which allows you to read, write and
manage huge number volumes of datasets stored in a distributed environment using SQL. It
is possible to project structure onto data that is termed as storage. Users can be connected to
Hive using a JDBC driver and a command line tool.
 Hive is an open Source platform system. Use Hive for analyzing and querying in large
number of datasets consisting the Hadoop files. It’s similar to the SQL programming. The
current version of Hive is 0.13.1.
 Hive supports ACID transaction: Atomicity, Consistency, Isolation, and Durability. ACID
transactions are provided at the row levels, those are Insert, Delete, and Update options so
that Hive supports ACID transaction.
 Hive is not considered as a complete database. The design rules and regulations of Hadoop
and HDFS put restrictions on what Hive can do in the field of programming.
Hive is most suitable for following data warehouse applications
 Analyzing the static data
 Less Responsive time
 No rapid changes in datasets.
Hive doesn’t provide fundamental features required for OLTP (Online Transaction Processing).
Hive is proper usage for data warehouse applications in large data sets.
The two types of tables in Hive
1. Managed table
2. External table
We can change the settings within Hive session, using the command known as SET. It is used
to change Hive job settings for a query to gain the exact results.
Example: The following below commands shows buckets are occupied according to the table
definition.
hive> SET hive.enforce.bucketing=true;

We can see the current value of any property by using the value of SET with the property
name. SET will allows to list all the properties with their values set by Hive.
hive> SET hive.enforce.bucketing;
hive.enforce.bucketing=true

51
And this above list will not be include by defaults of Hadoop. So we should use the below as
follows:
Que 1. Define Apache Pig
Ans. To analyze large data sets representing them as data flows, we use Apache Pig. Basically,
to provide an abstraction over MapReduce, reducing the complexities of writing a MapReduce
task using Java programming, Apache Pig is designed. Moreover, using Apache Pig, we can
perform data manipulation operations very easily in Hadoop.
Que 2. Why Do We Need Apache Pig?
Ans. At times, while performing any MapReduce tasks, programmers who are not so good at
Java normally used to struggle to work with Hadoop. Hence, Pig is a boon for all such
programmers. The reason is:
 Using Pig Latin, programmers can perform MapReduce tasks easily, without having to
type complex codes in Java.
 Since Pig uses multi-query approach, it also helps in reducing the length of codes.
 It is easy to learn Pig when you are familiar with SQL. It is because Pig Latin is SQL-like
language.
 In order to support data operations, it offers many built-in operators like joins, filters,
ordering, and many more. And, it offers nested data types that are missing from
MapReduce, for example, tuples, bags, and maps.
Que 3. What is the difference between Pig and SQL?
Ans. Here, are the list of major differences between Apache Pig and SQL.
 Pig
It is a procedural language.
 SQL
While it is a declarative language.
 Pig
Here, the schema is optional. Although, without designing a schema, we can store data.
However, it stores values as $01, $02 etc.
 SQL
In SQL, Schema is mandatory.
 Pig
In Pig, data model is nested relational.
 SQL
In SQL, data model used is flat relational.
 Pig
Here, we have limited opportunity for query optimization.
 SQL
While here we have more opportunity for query optimization.
Que 4. Explain the architecture of Hadoop Pig.
Ans. Below is the image, which shows the architecture of Apache Pig.

52
Now, we can see, several components in the Hadoop Pig framework. The major components
are:
1. Parser
At first, Parser handles all the Pig Scripts. Basically, Parser checks the syntax of the script,
does type checking, and other miscellaneous checks. Afterward, Parser’s output will be a DAG
(directed acyclic graph). That represents the Pig Latin statements as well as logical operators.
Basically, the logical operators of the script are represented as the nodes and the data flows are
represented as edges, in the DAG (the logical plan).
2. Optimizer
Further, DAG is passed to the logical optimizer. That carries out the logical optimizations, like
projection and push down.
3. Compiler
A series of MapReduce jobs have compiled from an optimized logical plan.
4. Execution engine
At last, these jobs are submitted to Hadoop in a sorted order. Hence, these MapReduce jobs are
executed finally on Hadoop, that produces the desired results.
Learn about Pig Architecture in detail, follow the link: Apache Pig Architecture and
Execution Modes
Que 5. What is the difference between Apache Pig and Hive?
Ans. Basically, to create MapReduce jobs, we use both Pig and Hive. Also, we can say, at
times, Hive operates on HDFS as same as Pig does. So, here we are listing few significant
points those set Apache Pig apart from Hive.
 Hadoop Pig
Pig Latin is a language, Apache Pig uses. Originally, it was created at Yahoo.
 Hive
HiveQL is a language, Hive uses. It was originally created at Facebook.
 Pig
It is a data flow language.
 Hive
Whereas, it is a query processing language.
 Pig
Moreover, it is a procedural language which fits in pipeline paradigm.
 Hive
It is a declarative language.
 Apache Pig
Also, can handle structured, unstructured, and semi-structured data.
 Hive
Whereas, it is mostly for structured data.
Que 6. What is the difference between Pig and MapReduce?
53
Ans. Some major differences between Hadoop Pig and MapReduce, are:
 Apache Pig
It is a data flow language.
 MapReduce
However, it is a data processing paradigm.
 Hadoop Pig
Pig is a high-level language.
 MapReduce
Well, it is a low level and rigid.
 Pig
In Apache Pig, performing a join operation is pretty simple.
 MapReduce
But, in MapReduce, it is quite difficult to perform a join operation between datasets.
Que 7. Explain Features of Pig.
Ans. There are several features of Pig, such as:

Apache Pig Interview Questions – Features of Pig

 Rich set of operators
In order to perform several operations, Pig offers many operators, for example, join, sort, filer
and many more.
 Ease of programming
Since you are good at SQL, it is easy to write a Pig script. Because of Pig Latin as same as
SQL.
 Optimization opportunities
In Apache Pig, all the tasks optimize their execution automatically. As a result, the
programmers need to focus only on the semantics of the language.
 Extensibility
Through Pig, it is easy to read, process, and write data. It is possible by using the existing
operators. Also, users can develop their own functions.
 UDF’s
By using Pig, we can create User-defined Functions in other programming languages. Like
Java. Also, can invoke or embed them in Pig Scripts.
Que 8. What is Pig Storage?
Ans. In Pig, there is a default load function, that is Pig Storage. Also, we can use pig storage,
whenever we want to load data from a file system into the pig. We can also specify the
delimiter of the data while loading data using pig storage (how the fields in the record are
separated). Also, we can specify the schema of the data along with the type of the data.
Que 9. While writing evaluate UDF, which method has to be overridden?
Ans. We have to override the method exec() while writing UDF in the Pig. Whereas the base
class can be different while writing filter UDF, we will have to extend FilterFunc and for

54
evaluate UDF, we will have to extend the EvalFunc. EvaluFunc is parameterized and must
provide the return type also.
Que 10. What are the different UDF’s in Pig?
Ans. On the basis of the number of rows, UDF can be processed. They are of two types:
 UDF that takes one record at a time, for example, Filter and Eval.
 UDFs that take multiple records at a time, for example, Avg and Sum.
Also, pig gives you the facility to write your own UDF’s for load/store the data.
Que 11. What are the Optimizations a developer can use during joins?
Ans. We use replicated join, to perform join between a small dataset with a large dataset.
Moreover, in the replicated join, the small dataset will be copied to all the machines where the
mapper is running and the large dataset is divided across all the nodes. Also, it gives us the
advantage of Map-side joins.
If your dataset is skewed i.e. if a particular data is repeated multiple times even if you use
reduce side join, the particular reducer will be overloaded and it will take a lot of time. Pig
itself, calculates skewed join and the skewed key.
And, if you have datasets where the records are sorted in the same field, you can go for sorted
join, this also happens in map phase and is very efficient and fast.
Que 12. What is a skewed join?
Ans. While we want to perform a join with a skewed dataset, that means a particular value will
be repeated many times, is a skewed join.
Que 13. What is Flatten?
Ans. An operator in pig that removes the level of nesting, is Flatten. Sometimes, we have data
in a bag or a tuple and we want to remove the level of nesting so that the data structured should
become even, we use Flatten.
In addition, each Flatten produces a cross product of every record in the bag with all of the
other expressions in the general statement.
Que 14. What are the complex data types in pig?
Ans. The following are the complex data types in Pig:

Apache Pig Interview Questions – Data Types in Pig

 Tuple
An ordered set of fields is what we call a tuple.
For Example: (Ankit, 32)
 Bag
A collection of tuples is what we call a bag.
For Example: {(Ankit,32),(Neha,30)}
 Map
A set of key-value pairs is what we call a Map.
For Example: [ ‘name’#’Ankit’, ‘age’#32]
Que 15. Why we use BloomMapFile?
Ans. In order to extend MapFile, we use the BloomMapFile. That implies its functionality is
similar to MapFile.
Also, to provide quick membership test for the keys, BloomMapFile uses dynamic Bloom
filters. We use it in HBase table format.
55
Que 16. How will you explain COGROUP in Pig?
Ans. In Apache Pig, COGROUP works on tuples. On several statements, we can apply
operators, which contains a few relations at least 127 relations at every time. When you make
use of the operator on tables, then Pig immediately books two tables and join them through
some of the columns that are grouped.
Que 17. What is the difference between logical and physical plans?
Ans. Pig undergoes some steps when a Pig Latin Script is converted into MapReduce jobs.
After performing the basic parsing and semantic checking, it produces a logical plan. The
logical plan describes the logical operators that have to be executed by Pig during execution.
After this, Pig produces a physical plan. The physical plan describes the physical operators that
are needed to execute the script.
Que 18. Does ‘ILLUSTRATE’ run MR job?
Ans. It will pull the internal data, illustrate will not pull any MR. Moreover, illustrate will not
do any job, on the console. It just shows the output of each stage and not the final output.
Que 19. Is the keyword ‘DEFINE’ as a function name?
Ans. The keyword ‘DEFINE’ is like a function name. As soon as we have registered, we have
to define it. Whatever logic you have written in Java program, we have an exported jar and also
a jar registered by us. Now the compiler will check the function in the exported jar. When the
function is not present in the library, it looks into our jar.
Que 20. Is the keyword ‘FUNCTIONAL’ a User Defined Function (UDF)?
Ans. The keyword ‘FUNCTIONAL’ is not a User Defined Function (UDF). we have to
override some functions while using UDF. Certainly, we have to do our job with the help of
these functions only. However, the keyword ‘FUNCTIONAL’ is a built-in function i.e a
predefined function, therefore it does not work as a UDF.
Apache Pig Interview Questions and Answers For Freshers. Q- 12,13,14,15,16,17
Apache Pig Interview Questions and Answers For Experience. Q- 11,18,19,20
Que 21. Why do we need MapReduce during Pig programming?
Ans. Let’s understand it in this way- Pig is a high-level platform that makes many Hadoop data
analysis issues easier to execute. And, we use Pig Latin for this platform. Now, a program
written in Pig Latin is like a query written in SQL, where we need an execution engine to
execute the query. Hence, when we write a program in Pig Latin, it was converted into
MapReduce jobs by pig complier. As a result, MapReduce acts as an execution engine.
Q 22 What are the scalar data types in Pig?
Ans. In Apache Pig, Scalar data types are:
 int -4bytes,
 float -4bytes,
 double -8bytes,
 long -8bytes,
 char array,
 byte array
Que 23 What are the different execution mode available in Pig?
Ans. In Pig, there are 3 modes of execution available:
 Interactive Mode (Also known as Grunt Mode)
 Batch Mode
 Embedded Mode
Que 24. Whether Pig Latin language is case-sensitive or not?
Ans. We can say, Pig Latin is sometimes not a case-sensitive, for example, Load is equivalent
to load.
A=load ‘b’ is not equivalent to a=load ‘b’
Note: UDF is also case-sensitive, here count is not equivalent to COUNT.
Que 25. What is the purpose of ‘dump’ keyword in Pig?
Ans. The keyword “dump” displays the output on the screen.
For Example- dump ‘processed’
56
Que 26. Does Pig give any warning when there is a type mismatch or missing field?
Ans. The pig will not show any warning if there is no matching field or a mismatch. However,
if any mismatch occurs, it assumes a null value in Pig.
Que 27. What is Grunt shell?
Ans. Grunt shell is also what we call as Pig interactive shell. Basically, it offers a shell for
users to interact with HDFS.
Que 28. What co-group does in Pig?
Ans. Basically, it joins the data set by grouping one particular data set only. Moreover, it
groups the elements by their common field and then returns a set of records containing two
separate bags. One bag consists of the record of the first data set with the common data set,
while and other bag consists of the records of the second data set with the common data set.
Que 29. What are relational operations in Pig latin?
Ans. Relational operations in Pig Latin are:
 For each
 Order by
 Filters
 Group
 Distinct
 Join
 Limit
Que 30. How is Pig Useful For?
Ans. There are 3 possible categories for which we can use Pig. They are:
1) ETL data pipeline
2) Research on raw data
3) Iterative processing
Que 1. What is Hive?
Ans. Hive is a tool to process structured data in Hadoop. We also call a data warehouse
infrastructure. Moreover, to summarize Big Data, it resides on top of Hadoop. Also, makes
querying and analyzing easy.
However, the Apache Software Foundation took it up, but initially, Hive was developed by
Facebook. Further Apache Software Foundation developed it as an open source under the name
Apache Hive. Although, many different companies use it. Like, Amazon uses it in Amazon
Elastic MapReduce.
Que 2. How to optimize Hive Performance?
Ans. There are several types of Query Optimization Techniques we can use in Hive in order to
Optimize Hive Performance. Such as:
1. Tez-Execution Engine in Hive
2. Usage of Suitable File Format in Hive
3. Hive Partitioning
4. Bucketing in Hive
5. Vectorization In Hive
6. Cost-Based Optimization in Hive (CBO)
7. Hive Indexing
Que 3. How can client interact with Hive?
Ans. However, there are 3 ways possible in which a Client can interact with the Hive. Such
as:-
i. Hive Thrift Client:
Basically, with any programming language that supports thrift, we can interact with HIVE.
ii. JDBC Driver:
However, to connect to the HIVE Server the BeeLine CLI uses JDBC Driver.
iii. ODBC Driver:

57
Also, we can use an ODBC Driver application. Since that support ODBC to connect to the
HIVE server.
Que 4. Can we change the data type of a column in a hive table?
Ans. By using REPLACE column option we change the data type of a column in a hive table
ALTER TABLE table_name REPLACE COLUMNS ……
Que 5. How to add the partition in existing without the partition table?
Ans. Basically, we cannot add/create the partition in the existing table, especially which was
not partitioned while creation of the table.
Although, there is one possible way, using “PARTITIONED BY” clause. But the condition is
if you had partitioned the existing table, then by using the ALTER TABLE command, you will
be allowed to add the partition.
So, here are the create and alter commands:
 CREATE TABLE tab02 (foo INT, bar STRING) PARTITIONED BY (mon STRING);
 ALTER TABLE tab02 ADD PARTITION (mon =’10’) location ‘/home/hdadmin/hive-
0.13.1-cdh5.3.2/examples/files/kv5.txt’;
Que 6. How Hive organize the data?
Ans. Basically, there are 3 ways possible in which Hive organizes data. Such as:
1. Tables
2. Partitions
3. Buckets
Que 7. Explain Clustering in Hive?
Ans. Basically, to decompose table data sets into more manageable parts is Clustering in Hive
To be more specific, the table is divided into the number of partitions, and these partitions can
be further subdivided into more manageable parts known as Buckets/Clusters. In addition,
“clustered by” clause is used to divide the table into buckets.
Que 8. Explain bucketing in Hive?
Ans. To decompose table data sets into more manageable parts, Apache Hive offers another
technique. That technique is what we call Bucketing in Hive.
Que 9. How is HCatalog different from Hive?
Ans. So, let’s learn the difference.
Hcatalog-
Basically, it is a table storage management tool for Hadoop. Basically, that exposes the tabular
data of Hive Metastore to other Hadoop applications. Also, it enables users with different data
processing tools to easily write data onto a grid. Moreover, it ensures that users don’t have to
worry about where or in what format their data is stored.
Hive-
Whereas, Hive is an open source data warehouse. Also, we use it for analysis and querying
datasets. Moreover, it is developed on top of Hadoop as its data warehouse framework for
querying and analysis of data is stored in HDFS.
In addition, it is useful for performing several operations. Such as data encapsulation, ad-hoc
queries, & analysis of huge datasets. Moreover, for managing and querying structured data
Hive’s design reflects its targeted use as a system.
Que 10. What is the difference between CREATE TABLE AND CREATE EXTERNAL
TABLE?
Ans.Although, we can create two types of tables in Hive. Such as:
– Internal Table
– External Table
Hence, to create the Internal table we use the command ‘CREATE TABLE’ whereas to create
the External table we use the command ‘CREATE EXTERNAL TABLE’.
Que 11. Unable to instantiate org.apache.hadoop.hive.metastore.HiveMetaStoreClient
Ans. There is a possibility that because of following reasons above error may occur:
1. While we use derby metastore, Then lock file would be there in case of the abnormal exit.
58
Hence, do remove the lock file
rm metastore_db/*.lck
2. Moreover, Run hive in Debug mode
hive -hiveconf hive.root.logger=DEBUG,console
Que 12. How many types of Tables in Hive?
Ans. Hive has two types of tables. Such as:
 Managed table
 External table
Que 13. Explain Hive Thrift server?
Ans. There is an optional component in Hive that we call as HiveServer or HiveThrift.
Basically, that allows access to Hive over a single port. However, for scalable cross-language
services development Thrift is a software framework. Also, it allows clients using languages
including Java, C++, Ruby, and many others, to programmatically access Hive remotely.
Que 14. How to Write a UDF function in Hive?
Ans. Basically, following are the steps:
1. Create a Java class for the User Defined Function which extends
ora.apache.hadoop.hive.sq.exec.UDF and implements more than one evaluate() methods.
Put in your desired logic and you are almost there.
2. Package your Java class into a JAR file
3. Go to Hive CLI, add your JAR, and verify your JARs is in the Hive CLI classpath
4. CREATE TEMPORARY FUNCTION in Hive which points to your Java class
5. Then Use it in Hive SQL.
Que 16. What is the difference between Internal Table and External Table in Hive?
Ans. Hive Managed Tables-
It is also known an internal table. When we create a table in Hive, it by default manages the
data. This means that Hive moves the data into its warehouse directory.
Usage:
 We want Hive to completely manage the lifecycle of the data and table.
 Data is temporary
Hive External Tables-
We can also create an external table. It tells Hive to refer to the data that is at an existing
location outside the warehouse directory.
Usage:
 Data is used outside of Hive. For example, the data files are read and processed by an
existing program that does not lock the files.
 We are not creating a table based on the existing table.
Que 17.Difference between order by and sort by in Hive?
Ans. So, the difference is:
 Sort by
hive> SELECT E.EMP_ID FROM Employee E SORT BY E.empid;
1. for final output, it may use multiple reducers.
2. within a reducer only guarantees to order of rows.
iii. it gives partially ordered result.
 Order by
hive> SELECT E.EMP_ID FROM Employee E order BY E.empid;
1. Basically, to guarantee the total order in output Uses single reducer.
2. Also, to minimize sort time LIMIT can be used.
Que 18. What are different modes of metastore deployment in Hive?
Ans. There are three modes for metastore deployment which Hive offers.
59
1. Embedded metastore
Here, by using embedded Derby Database both metastore service and hive service runs in the
same JVM.
2. Local Metastore
However, here, Hive metastore service runs in the same process as the main Hive Server
process, but the metastore database runs in a separate process.
3. Remote Metastore
Here, metastore runs on its own separate JVM, not in the Hive service JVM.
Que 19. Difference between HBase vs Hive
Ans. Following points are feature wise comparison of HBase vs Hive.
1.Database type
Apache Hive
Basically, Apache Hive is not a database.
HBase
HBase does support NoSQL database.
2. Type of processing
Apache Hive
Hive does support Batch processing. That is OLAP.
HBase
HBase does support real-time data streaming. That is OLTP.
3. Data Schema
Apache Hive
Basically, it supports to have schema model
HBase
However, it is schema-free
Que 20. What is the relation between MapReduce and Hive?
Ans. Hive offers no capabilities to MapReduce. The Programs are executed as MapReduce Job
via the interpreter. Then Interpreter runs on the Client machine. Afterward, that runs HiveQL
Que 21. What is the importance of driver in Hive?
Ans. Driver manages the life cycle of Hive QL Queries. It receives the queries from UI and
fetches on JDBC interfaces to process the query. Also, it creates a separate section to handle
the query.
Que 22. How can you configure remote metastore mode in Hive?
Ans. To use this remote metastore, you should configure Hive service by setting
hive.metastore.uris to the metastore server URI(s). Metastore server URIs are of the form
thrift://host:port, where the port corresponds to the one set by METASTORE_PORT when
starting the metastore server.
Que 23. Can we LOAD data into a view?
Ans. No.
Que 24. What types of costs are associated with creating the index on hive tables?
Ans. Basically, there is a processing cost in arranging the values of the column on which index
is created since Indexes occupies.
Que 25. Give the command to see the indexes on a table.
Ans. SHOW INDEX ON table_name
Basically, in the table table_name, this will list all the indexes created on any of the columns.
Que 26. How do you specify the table creator name when creating a table in Hive?
Ans. The TBLPROPERTIES clause is used to add the creator name while creating a table.
The TBLPROPERTIES is added like −
TBLPROPERTIES(‘creator’= ‘Joan’)
Que 27.Difference between Hive and Impala?
Ans. Following are the feature wise comparison between Impala vs Hive:
1. Query Process
 Hive
60
Basically, in Hive every query has the common problem of a “cold start”.
 Impala
Impala avoids any possible startup overheads, being a native query language. However, that are
very frequently and commonly observed in MapReduce based jobs. Moreover, to process a
query always Impala daemon processes are started at the boot time itself, making it ready.`
2. Intermediate Results
 Hive
Basically, Hive materializes all intermediate results. Hence, it enables enabling better
scalability and fault tolerance. However, that has an adverse effect on slowing down the data
processing.
 Impala
However, it’s streaming intermediate results between executors. Although, that trades off
scalability as such.
3. During the Runtime
 Hive
At Compile time, Hive generates query expressions.
 Impala
During the Runtime, Impala generates code for “big loops”.
Que 28. What are types of Hive Built-In Functions?
Ans. So, its types are:
1. Collection Functions
2. Hive Date Functions
3. Mathematical Functions
4. Conditional Functions
5. Hive String Functions
Que 29.Types of Hive DDL Commands.
Ans. However, there are several types of Hive DDL commands, we commonly use. such as:
1. Create Database Statement
2. Hive Show Database
3. Drop database
4. Creating Hive Tables
5. Browse the table
6. Altering and Dropping Tables
7. Hive Select Data from Table
8. Hive Load Data
Que 30. What are Hive Operators and its Types?
Ans. Hive operators are used for mathematical operations on operands. Also, it returns specific
value as per the logic applied.
 Types of Hive Built-in Operators
 Relational Operators
 Arithmetic Operators
 Logical Operators
 String Operators
 Operators on Complex Types
1) What is the difference between Pig and Hive ?
Criteria Pig Hive

61
Type of Apache Pig is usually used for
Data semi structured data. Used for Structured Data

Hive requires a well-defined

Schema Schema is optional. Schema.

It is a procedural data flow Follows SQL Dialect and is a

Language language. declarative language.

Purpose Mainly used for programming. It is mainly used for reporting.

General Usually used on the client side Usually used on the server side of
Usage of the hadoop cluster. the hadoop cluster.

Coding
Style Verbose More like SQL

Pig vs Hive
For a detailed answer on the difference between Pig and Hive, refer this link -
https://www.dezyre.com/article/difference-between-pig-and-hive-the-two-key-components-of-
hadoop-ecosystem/79
2) What is the difference between HBase and Hive ?

HBase Hive

HBase does not allow Hive allows execution of most

execution of SQL queries. SQL queries.

Hive runs on top of Hadoop

HBase runs on top of HDFS. MapReduce.

Hive is a datawarehouse
HBase is a NoSQL database. framework.

Supports record level insert, Does not support record level

updated and delete operations. insert, update and delete.

Hive vs HBase

2) I do not need the index created in the first question anymore. How can I delete the
above index named index_bonuspay?
DROP INDEX index_bonuspay ON employee;
Test Your Practical Hadoop Knowledge

3) Can you list few commonly used Hive services?

 Command Line Interface (cli)
 Hive Web Interface (hwi)
62
 HiveServer (hiveserver)
 Printing the contents of an RC file using the tool rcfilecat.
 Jar
 Metastore
4) Suppose that I want to monitor all the open and aborted transactions in the system
along with the transaction id and the transaction state. Can this be achieved using
Apache Hive?
Hive 0.13.0 and above version support SHOW TRANSACTIONS command that helps
administrators monitor various hive transactions.

5) What is the use of Hcatalog?

Hcatalog can be used to share data structures with external systems. Hcatalog provides access
to hive metastore to users of other tools on Hadoop so that they can read and write data to
hive’s data warehouse.
6) Write a query to rename a table Student to Student_New.
Alter Table Student RENAME to Student_New

Which companies use Hive extensively? This could be one of the possible Hive Interview
Questions asked at your next Hadoop Job interview.
Submit

8) Explain the difference between partitioning and bucketing.

 Partitioning and Bucketing of tables is done to improve the query performance. Partitioning
helps execute queries faster, only if the partitioning scheme has some common range filtering
i.e. either by timestamp ranges, by location, etc. Bucketing does not work by default.
 Partitioning helps eliminate data when used in WHERE clause. Bucketing helps organize data
inside the partition into multiple files so that same set of data will always be written in the
same bucket. Bucketing helps in joining various columns.
 In partitioning technique, a partition is created for every unique value of the column and there
could be a situation where several tiny partitions may have to be created. However, with
bucketing, one can limit it to a specific number and the data can then be decomposed in those
buckets.
 Basically, a bucket is a file in Hive whereas partition is a directory.

9) Explain about the different types of partitioning in Hive?

Partitioning in Hive helps prune the data when executing the queries to speed up processing.
Partitions are created when data is inserted into the table. In static partitions, the name of the
partition is hardcoded into the insert statement whereas in a dynamic partition, Hive
automatically identifies the partition based on the value of the partition field.
Based on how data is loaded into the table, requirements for data and the format in which data
is produced at source- static or dynamic partition can be chosen. In dynamic partitions the
complete data in the file is read and is partitioned through a MapReduce job based into the
tables based on a particular field in the file. Dynamic partitions are usually helpful during ETL
flows in the data pipeline.
When loading data from huge files, static partitions are preferred over dynamic partitions as
they save time in loading data. The partition is added to the table and then the file is moved
into the static partition. The partition column value can be obtained from the file name without
having to read the complete file.
10) When executing Hive queries in different directories, why is metastore_db created in
all places from where Hive is launched?
When running Hive in embedded mode, it creates a local metastore. When you run the query, it
first checks whether a metastore already exists or not. The property

63
javax.jdo.option.ConnectionURL defined in the hive-site.xml has a default value jdbc: derby:
databaseName=metastore_db; create=true.
The value implies that embedded derby will be used as the Hive metastore and the location of
the metastore is metastore_db which will be created only if it does not exist already. The
location metastore_db is a relative location so when you run queries from different directories
it gets created at all places from wherever you launch hive. This property can be altered in the
hive-site.xml file to an absolute path so that it can be used from that particular location instead
of creating multiple metastore_db subdirectory multiple times.

11) How will you read and write HDFS files in Hive?
i) TextInputFormat- This class is used to read data in plain text file format.
ii) HiveIgnoreKeyTextOutputFormat- This class is used to write data in plain text file format.
iii) SequenceFileInputFormat- This class is used to read data in hadoop SequenceFile format.
iv) SequenceFileOutputFormat- This class is used to write data in hadoop SequenceFile format.
12) What are the components of a Hive query processor?
Query processor in Apache Hive converts the SQL to a graph of MapReduce jobs with the
execution time framework so that the jobs can be executed in the order of dependencies. The
various components of a query processor are-
 Parser
 Semantic Analyser
 Type Checking
 Logical Plan Generation
 Optimizer
 Physical Plan Generation
 Execution Engine
 Operators
 UDF’s and UDAF’s.

13) Differentiate between describe and describe extended.

Describe database/schema- This query displays the name of the database, the root location on
the file system and comments if any.
Describe extended database/schema- Gives the details of the database or schema in a detailed
manner.
14) Is it possible to overwrite Hadoop MapReduce configuration in Hive?
Yes, hadoop MapReduce configuration can be overwritten by changing the hive conf settings
file.
15) I want to see the present working directory in UNIX from hive. Is it possible to run
this command from hive?
Hive allows execution of UNIX commands with the use of exclamatory (!) symbol. Just use the
!Symbol before the command to be executed at the hive prompt. To see the present working
directory in UNIX from hive run !pwd at the hive prompt.
16) What is the use of explode in Hive?
Explode in Hive is used to convert complex data types into desired table formats. explode
UDTF basically emits all the elements in an array into multiple rows.
17) Explain about SORT BY, ORDER BY, DISTRIBUTE BY and CLUSTER BY in
Hive.
SORT BY – Data is ordered at each of ‘N’ reducers where the reducers can have overlapping
range of data.
ORDER BY- This is similar to the ORDER BY in SQL where total ordering of data takes place
by passing it to a single reducer.
DISTRUBUTE BY – It is used to distribute the rows among the reducers. Rows that have the
same distribute by columns will go to the same reducer.

64
CLUSTER BY- It is a combination of DISTRIBUTE BY and SORT BY where each of the N
reducers gets non overlapping range of data which is then sorted by those ranges at the
respective reducers.
18) Difference between HBase and Hive.
 HBase is a NoSQL database whereas Hive is a data warehouse framework to process Hadoop
jobs.
 HBase runs on top of HDFS whereas Hive runs on top of Hadoop MapReduce.
19) Write a hive query to view all the databases whose name begins with “db”
SHOW DATABASES LIKE ‘db.*’
20) How can you prevent a large job from running for a long time?
This can be achieved by setting the MapReduce jobs to execute in strict mode set
hive.mapred.mode=strict;
The strict mode ensures that the queries on partitioned tables cannot execute without defining a
WHERE clause.

What do u think is more popular among the developers – Pig or Hive?

Submit

21) What is a Hive Metastore?

Hive Metastore is a central repository that stores metadata in external database.
22) Are multiline comments supported in Hive?
No
23) What is ObjectInspector functionality?
ObjectInspector is used to analyse the structure of individual columns and the internal structure
of the row objects. ObjectInspector in Hive provides access to complex objects which can be
stored in multiple formats.
24) Explain about the different types of join in Hive.
HiveQL has 4 different types of joins –
JOIN- Similar to Outer Join in SQL
FULL OUTER JOIN – Combines the records of both the left and right outer tables that fulfil
the join condition.
LEFT OUTER JOIN- All the rows from the left table are returned even if there are no matches
in the right table.
RIGHT OUTER JOIN-All the rows from the right table are returned even if there are no
matches in the left table.
25) How can you configure remote metastore mode in Hive?
To configure metastore in Hive, hive-site.xml file has to be configured with the below property
–
hive.metastore.uris
thrift: //node1 (or IP Address):9083
IP address and port of the metastore host
26) Is it possible to change the default location of Managed Tables in Hive, if so how?
Yes, we can change the default location of Managed tables using the LOCATION keyword
while creating the managed table. The user has to specify the storage path of the managed table
as the value to the LOCATION keyword.
27) How data transfer happens from HDFS to Hive?
If data is already present in HDFS then the user need not LOAD DATA that moves the files to
the /user/hive/warehouse/. So the user just has to define the table using the keyword external
that creates the table definition in the hive metastore.
Create external table table_name (
id int,
myfields string
65
)
location '/my/location/in/hdfs';
28) In case of embedded Hive, can the same metastore be used by multiple users?
We cannot use metastore in sharing mode. It is suggested to use standalone real database like
PostGreSQL and MySQL.
29) The partition of hive table has been modified to point to a new directory location. Do
I have to move the data to the new location or the data will be moved automatically to the
new location?
Changing the point of partition will not move the data to the new location. It has to be moved
manually to the new location from the old one.
30) What will be the output of cast (‘XYZ’ as INT)?
It will return a NULL value.
31) What are the different components of a Hive architecture?
Hive Architecture consists of a –
 User Interface – UI component of the Hive architecture calls the execute interface to the driver.
 Driver create a session handle to the query and sends the query to the compiler to generate an
execution plan for it.
 Metastore - Sends the metadata to the compiler for the execution of the query on receiving the
sendMetaData request.
 Compiler- Compiler generates the execution plan which is a DAG of stages where each stage is
either a metadata operation, a map or reduce job or an operation on HDFS.
 Execute Engine- Execution engine is responsible for submitting each of these stages to the
relevant components by managing the dependencies between the various stages in the
execution plan generated by the compiler.
32) What happens on executing the below query? After executing the below query, if you
modify the column –how will the changes be tracked?
Hive> CREATE INDEX index_bonuspay ON TABLE employee (bonus)
AS 'org.apache.hadoop.hive.ql.index.compact.CompactIndexHandler';
The query creates an index named index_bonuspay which points to the bonus column in the
employee table. Whenever the value of bonus is modified it will be stored using an index
value.
33) What is the default database provided by Hive for Metastore ?
Derby is the default database.
34) Is it possible to compress json in Hive external table ?
Yes, you need to gzip your files and put them as is (*.gz) into the table location.
Scenario based or Real-Time Interview Questions on Hadoop Hive
1. How will you optimize Hive performance?
There are various ways to run Hive queries faster -
 Using Apache Tez execution engine
 Using vectorization
 Using ORCFILE
 Do cost based query optimization.
1. Will the reducer work or not if you use “Limit 1” in any HiveQL query?
2. Why you should choose Hive instead of Hadoop MapReduce?
3. I create a table which contains transaction details of customers for the year 2018.
CREATE TABLE transaction_details (cust_id INT, amount FLOAT, month STRING, country
STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘,’ ;
I have inserted 60K tuples in this table and now want to know the total revenue that has been
generated for each month. However, Hive takes too much time to process this query. List all
the steps that you would follow to solve this problem.
4. There is a Python application that connects to Hive database for extracting data, creating sub
tables for data processing, drops temporary tables, etc. 90% of the processing is done through
hive queries which are generated from python code and are sent to hive server for
66
execution.Assume that there are 100K rows , would it be faster to fetch 100K rows to python
itself into a list of tuples and mimic the join or filter operations hive performs and avoid the
executuon of 20-50 queries run against hive or you should look into hive query optimization
techniques ? Which one is performance efficient ?
Other Interview Questions on Hadoop Hive
1. Explain the difference between SQL and Apache Hive.
2. Why mapreduce will not run if you run select * from table in hive?
We hope that these Hive Interview questions and answers have pre-charged you for your next
Hadoop interview on the subject of Hive. Let us know about your experience on Hive
interview questions in Hadoop interviews in the comments below.

UNIT – V
Spark Tutorial
In this free Apache Spark Tutorial you will be introduced to Spark analytics, Spark streaming,
RDD, Spark on cluster, Spark shell and actions. You will learn about Spark practical use cases.
Apache Spark continues to gain momentum in today’s big data analytics landscape. Although a
relatively newer entry to the realm, Apache Spark has earned immense popularity among
enterprises and data Analysts within a short period. Apache Spark is one of the most active
open source big data projects. The reason behind is its versatility and diversity of use.
Some of the key features that make Spark a strong big data engine are:
 Equipped with MLlib library for machine learning algorithms
 Good for Java and Scala developers as Spark imitates Scala’s collection API and functional
style
 Single library can perform SQL, graph analytics and streaming.
Spark is admired for many reasons by developers and analysts to quickly query, analyze and
transform data at scale. In simple words, you can call Spark a competent alternative to Hadoop,
with its characteristics, strengths and limitations. Spark runs in-memory to process data with
speed and sophistication than the other complement approaches like Hadoop MapReduce. It
can handle several terabytes of data at one time and perform efficient processing.
Spark versus Hadoop MapReduce
Despite having the similar functionality, there is much difference between these two
technologies. Let’s have a quick look into this comparative analysis:
Criteria Spark Hadoop MapReduce

Processing In-memory Persists on disk after map

Location and reduce functions

Ease of use Easy as based on Scala Difficult as based on Java

Speed Up to 100 times faster Slower

than Hadoop MapReduce

67
Latency Lower Higher

Computation Iterative computation single computation

possible possible

Task Schedules tasks itself Requires external

Scheduling schedulers.

One of the excellent benefits of using Spark is that it is often used in Hadoop’s data storage
model, i.e. HDFS and can well integrate with other big data frameworks like
HBase, MongoDB, Cassandra. It is one of the best big data choices to learn and apply machine
learning algorithms in real-time. It has the ability to run repeated queries on large databases
and potentially deal with them.
Knowing the extensively excellent future growth and rapid adoption of Apache Spark in
today’s business world, we have designed this Spark tutorial to educate the mass programmers
on interactive and expeditious framework. The tutorial aims at training you on beginner
concepts of using Spark as well as gain insights into its advanced modules. For all those who
are seeing an expert Spark tutor, this learning package is the delightful and knowledgeable end
to your search.
It includes detailed elucidation of Spark and Hadoop Distributed File System. The major topics
include Spark Components, Common Spark Algorithms-Iterative Algorithms, Graph
Analysis, Machine Learning, Running Spark on a Cluster. Further, you will be able to type in
algorithms by yourself by learning to write Spark Applications using Python, Java, Scala, RDD
and its operations. Since Spark has the ability to run on diverse platforms using various
languages, it is an important phase to gain insights into developing application with various
mentioned programming languages.
This learning package also covers Spark, Hadoop, and the Enterprise Data Centre, Common
Spark Algorithms and Spark Streaming, which is yet another important feature of Spark. Most
application developers are frequently using this data streaming to keep a check on fraudulent
financial transactions. If you find this tutorial helpful, you can browse through our multiple
combo training courses of Spark, Storm, Scala and Spark with Python – which can help you
grow technically and managerially.
Learn more about Most Valuable Data Science Skills Of 2017 in this insightful blog now!
Recommended Audience
 Big Data Analysts and Architects
 Software Professionals, ETL Developers and Data Engineers
 Data Scientists and Analytics Professionals
 Beginner and advanced-level programmers in Java, C++, Python
 Graduates aiming to learn latest and efficient programming language to process Big data in a
faster and easier manner.
Prerequisites

68
 Before getting started with this tutorial, have a good understanding of Java basics and
concepts of programming.
 For that matter, your knowledge of other programming languageslike C, C++, Python and
Big data analytics will be beneficial to decipher the topics better.
Spark Features
Developed in AMPLab of University of California, Berkeley, Apache Spark was developed for
higher speed, ease of use and more in-depth analysis. Though it was built to be installed on top
of Hadoop cluster, however its ability to parallel processing allows it run independently as
well. Let’s take a closer look at the features of Apache Spark –
 Fast processing – The most important feature of Apache Spark that has made the big data
world choosing this technology over others is its speed. Big data is characterized by volume,
variety, velocity and veracity which needs to be processed at a higher speed. Spark contains
Resilient Distributed Dataset (RDD) which saves time taken in reading and writing
operations and hence it runs almost ten to hundred times faster than Hadoop.
 Flexibility – Apache Spark supports multiple languages and allows the developers to write
applications on Java, Scala, R, or Python. Equipped with over 80 high-level operators this
tool is quite rich from this aspect.
 In-memory computing – Spark stores the data in the RAM of servers which allows him to
access it quickly and in turn accelerating the speed of analytics.
 Real-time processing – Spark is able to process real-time streaming data. Unlike
MapReduce which processes the stored data, Spark is able to process the real-time data and
hence is able to produce instant outcomes.
 Better analytics – Contrasting to MapReduce that includes Map and Reduce functions,
Spark includes much more than that. Apache Spark consists of a rich set of SQL queries,
machine learning algorithms, complex analytics, etc. With all these functionalities, analytics
can be performed in a better fashion with the help of Spark.
 Compatible with Hadoop – Spark is not only able to work independently, it can work on
top of Hadoop as well. Not just this, It is certainly compatible with both the versions of
Hadoop ecosystem.
Apache Spark Architecture
In order to understand the way Spark runs, it is very important to know the architecture of
Spark. Following diagram and discussion will give you a clearer view into it.

69
There are three ways Apache Spark can run :
Standalone – The Hadoop cluster can be equipped with all the resources statically and Spark
can run with MapReduce in parallel. This is the simplest deployment.
On Hadoop YARN – Spark can be executed on top of YARN without any pre-installation.
This deployment utilizes the maximum strength of Spark and other components.
Spark In MapReduce (SIMR) – If you don’t have YARN, you can also use Spark along with
MapReduce. This reduces the burden of deployments.

Either way the Spark is deployed, the configurations allocates it the resources. The moment
Spark is connected it obtains executors on the node. These executors are nothing but the
processes running computations and securing the data. Now the application code is sent to the
executor following with SparkContext sending tasks to the executors to run.
Some important terms to illustrate the architecture are –
Apache Spark Description
Basics

Spark These are the user programs built on Apache Spark.

Application

Application jar A container having Spark applications.

Spark Driver This is a program that runs the main () function of the
70
program Spark application.

Cluster manager A service to acquire the resources.

Deploy mode The configuration on which driver process runs.

Worker node A node that executes application programs on it.

Spark Executor A process assigned for each application which runs

applications and stores the data.

Job Consists of multiple tasks launched in response to a

Spark action.

Task It is sent to executor by SparkContext.

Apache Spark Applications

Since the time of its inception in 2009 and its conversion to an open source technology, Apache
Spark has taken the big data world by storm. It became one of the largest open source
communities that includes over 200 contributors. The prime reason behind its success was its
ability to process heavy data faster than ever before.
Spark is a widely-used technology adopted by most of the industries. Let us look at some of the
prominent applications of Apache Spark are –
 Machine Learning – Apache Spark is equipped with a scalable Machine Learning Library
called as MLlib that can perform advanced analytics such as clustering, classification,
dimensionality reduction, etc. Some of the prominent analytics jobs like predictive analysis,
customer segmentation, sentiment analysis, etc., make Spark an intelligent technology.
 Fog computing – With the influx of big data concepts, IoT has acquired a prominent space
for the invention of more advanced technologies. Based on the theory of connecting digital
devices with the help of small sensors this technology deals with a humongous amount of
data emanating from numerous mediums. This requires parallel processing which is certainly

71
not possible on cloud computing. Therefore Fog computing which decentralizes the data and
storage uses Spark streaming as a solution to this problem.
 Event detection – The feature of Spark streaming allows the organization to keep track of
rare and unusual behaviors for protecting the system. Institutions like financial institutions,
security organizations, and health organizations use triggers to detect the potential risk.
 Interactive analysis – Among the most notable features of Apache Spark is its ability to
support interactive analysis. Unlike MapReduce that supports batch processing, Apache
Spark processes data faster because of which it can process exploratory queries without
sampling.
Some of the most popular companies that are using Apache Spark are –
 Uber – Uses Kafka, Spark Streaming, and HDFS for building a continuous ETL pipeline.
 Pinterest – Uses Spark Streaming in order to gain deep insight into customer engagement
details.
 Conviva – The pinnacle video company Conviva deploys Spark for optimizing the videos
and handling live traffic.
Components of Spark
 The following procedure gives the clear picture of the different components of Spark.


 Apache Spark Core
 Spark Core consists of general execution engine for spark platform that all required by
other functionality which is built upon as per the requirement approach. It provides in-
built memory computing and referencing datasets stored in external storage systems.
 Spark allows the developers to write code quickly with the help of rich set of operators.
While it takes a lot of lines of code, it takes fewer lines to write the same code in Spark
Scala. Following program will help you understand the way programming is done with
Spark :
 sparkContext.textFile(“hdfs://…”)

 .flatMap(line => line.split(” “))

 .map(word => (word, 1)).reduceByKey(_ + _)


72
 .saveAsTextFile(“hdfs://…”)
 Spark SQL
 Spark SQL is a component on top of Spark Core that introduces a new set of data
abstraction called Schema RDD, which provides support for both the structured and
semi-structured data.
Below is an example of a Hive compatible query:
// sc is an existing SparkContext.

val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)

sqlContext.sql(“CREATE TABLE IF NOT EXISTS src (key INT, value STRING)”)

sqlContext.sql(“LOAD DATA LOCAL INPATH ‘examples/src/main/resources/kv1.txt’

INTO TABLE src”)

// Queries are expressed in HiveQL

sqlContext.sql(“FROM src SELECT key, value”).collect().foreach(println)

Spark Streaming
This component allows Spark to process real-time streaming data. It provides an API to
manipulate data streams that matches with RDD API. It allows the programmers to
understand the project and switch through the applications that manipulate the data and
giving outcome in real-time. Similar to Spark Core, Spark Streaming strives to make the
system fault-tolerant and scalable.
RDD API Examples
In this example, we will use a few transformations that are implemented to build a
dataset of (String, Int) pairs called counts and then save it to a file.
text_file = sc.textFile(“hdfs://…”)
counts = text_file.flatMap(lambda line: line.split(” “)) \
.map(lambda word: (word, 1)) \
.reduceByKey(lambda a, b: a + b)
counts.saveAsTextFile(“hdfs://…”)
MLlib (Machine Learning Library)
Apache Spark is equipped with a rich library known as MLlib. This library contains a wide
array of machine learning algorithms, classification, clustering and collaboration filters, etc.
It also includes few lower-level primitives. All these functionalities help Spark scale out
across a cluster.
Prediction with Logistic Regression
In this example, we take a dataset values in terms of labels and feature vectors. We learn to
predict the labels from feature vectors using the method of Logistic Regression algorithm
using the python language:
73
# Every record of this DataFrame contains the label and
# features represented by a vector.
df = sqlContext.createDataFrame(data, [“label”, “features”])
# Set parameters for the algorithm.
# Here, we limit the number of iterations to 10.
lr = LogisticRegression(maxIter=10)
# Fit the model to the data.
model = lr.fit(df)
# Given a dataset, predict each point’s label, and show the results.
model.transform(df).show()
GraphX
Spark also comes with a library to manipulate the graphs and performing computations,
called as GraphX. Just like Spark Streaming and Spark SQL, GraphX also extends Spark
RDD API which creates a directed graph. It also contains numerous operators in order to
manipulate the graphs along with graph algorithms.
Consider the following example to model users and products as a bipartite graph we might
follow:
class VertexProperty()
case class UserProperty(val name: String) extends VertexProperty
case class ProductProperty(val name: String, val price: Double) extends VertexProperty
// The graph might then have the type:
var graph: Graph[VertexProperty, String] = null
Programming with RDDs
A RDD known as Resilient Distributed Dataset in Spark is simply an immutable distributed
huge collection of objects sets. Each RDD is split into multiple partitions (a smaller units),
which may be computed on different aspects of nodes of the cluster. RDDs can contain any
type of languages such as Python, Java, or Scala objects, including user-defined classes.
RDDs can be created in two distinct ways: by loading an external dataset, or by distributing a
set of collection of objects (e.g., a list or set) in their driver program created.
Every Spark program and shell session will do its functioning as follows:
 Create some input RDDs programming from external data.
 Transform RDDs to define new RDDs programs using transformations such as filter ().
 Ask Spark to persist () any intermediate RDDs that will need to be reused as per the
requirements.
 Create the actions such as count() and first() to kick off a parallel computation, which is then
optimized and executed by Spark.
 Creating RDDs
The pretty simplest way to create RDDs is to take an already existing collection in your
program and pass the same to the Spark Context’s parallelize() method, as shown in
Examples:
parallelize() method in Python
74
lines = sc.parallelize([“pandas”, “i like pandas”])
parallelize() method in Scala
val lines = sc.parallelize(List(“pandas”, “i like pandas”))
parallelize() method in Java
JavaRDD<String> lines = sc.parallelize(Arrays.asList(“pandas”, “i like pandas”));
RDD Operations
RDD performs two types of operations such as
 Transformations.
 Actions.
Transformations : It returns a new type of RDD programs.
Take an example, suppose that we have a logfile, log.txt, with a number of messages, and we
want to select only particularly error messages. We can use the filter() transformation. We’ll
show a filter in all three of Spark’s language APIs(Application Programming Language).
filter() transformation in Python programming Language :
inputRDD = sc.textFile(“log.txt”)
errorsRDD = inputRDD.filter(lambda x: “error” in x)
filter() transformation in Scala platform :
val inputRDD = sc.textFile(“log.txt”)
val errorsRDD = inputRDD.filter(line => line.contains(“error”))
filter() transformation in Java
JavaRDD<String> inputRDD = sc.textFile(“log.txt”);
JavaRDD<String> errorsRDD = inputRDD.filter(
new Function<String, Boolean>() {
public Boolean call(String x) { return x.contains(“error”); }
}
});

filter() operation does not mutate the already existing input RDD. Instead, it will returns a
pointer to an entirely new RDD.
Actions : They are the certain operations that will return a final value to the driver program or
write data to an external storage system. Actions performed will force the evaluation of the
transformations required for the RDD they were called on, since they need to actually produce
output as those are required.
Python error count records using actions
print “Input had ” + badLinesRDD.count() + ” concerning lines”
print “Here are 10 examples:”
for line in badLinesRDD.take(10):
print line
Scala error count records using actions
println(“Input had ” + badLinesRDD.count() + ” concerning lines”)
println(“Here are 10 examples:”)
badLinesRDD.take(10).foreach(println)
Java error count records using actions
75
System.out.println(“Input had ” + badLinesRDD.count() + ” concerning lines”)
System.out.println(“Here are 10 examples:”)
for (String line: badLinesRDD.take(10)) {
System.out.println(line);
}
In the above lines of code, take() is used to get elements in RDD at the driver program which
are then iterated to write an output to the driver. Also the RDD consists of collect() that fetches
the complete RDD programming concept.
Each time new action is been called, the entire RDD must be computed “from scratch”. To
avoid this type of inefficiency, users can persist intermediate results.
Lazy Evaluation

Lazy evaluation means that when if we call a transformation or an action on an RDD (for
instance, calling map()), the operation is not immediately performed. Instead, Spark internally
records metadata to indicate that this typical operation has been requested. Rather than thinking
of an RDD as containing specific data, it is best to think of each RDD methodology as
consisting of instructions on how to compute the data that we build up through transformations
phases. Loading data into an RDD is lazily evaluated in the same way transformations are
done. So, when we call sc.textFile(), the data is not loaded until it is necessary things required.
As with transformations, the operation (in this case, reading the data) can occur multiple
amounts of times.
Passing Functions to Spark
Python : In Python, we have mostly three choices for passing functions into Spark. Firstly, we
can pass in the functions through lambda expressions. Secondly, pass a function that is already
a member of an object, or contains references of the fields within an object itself. Thirdly,
simply extract the fields you are required from your object into a local variable and pass that to
it.
Scala : Using Scala we are able pass in functions defined inline, references to methods, or
static functions as we do for Scala’s other functional APIs(Application Programming
Interface).
Java : In Java, functions are particularly specified as objects that implement one of Spark’s
function interfaces from the org.apache.spark.api.java.function package.
Common Transformations and Actions

76
There are relatively the two most common transformations you mostly be using are map() and
filter() transformations. The map() transformation takes in a function and applies it to each
element in the RDD with the result of the function being the new value of each element in the
resulting RDD. The filter() transformation takes in a function and returns an RDD that only has
elements that pass the filter() function.

mapped and filtered rdd from an input rdd

Basic example of map() that squares all of the numbers in an RDD
Python squaring the values in an RDD

nums = sc.parallelize([1, 2, 3, 4])

squared = nums.map(lambda x: x * x).collect()

for num in squared:

print “%i ” % (num)

Sometimes we want to produce multiple output elements for each input element. The operation
to do this is called flatMap().
A simple usage of flatMap() is splitting up an input string into words, as shown in Example:
flatMap() in Python, splitting lines into words

lines = sc.parallelize([“hello world”, “hi”])

words = lines.flatMap(lambda line: line.split(” “))

words.first() # returns “hello”

Actions
The most common action on basic RDDs faced is reduce(), which takes a function that operates
on two elements of the same type your RDD owns and returns a new element of the same type.
Similar to reduce() is fold(), which also takes a function with the same signature as required for
the reduce() function, but in addition takes a “zero value” to be used for the initial call on each
partition.
Both fold() and reduce() operations are required that the return type of our result be the same
type as that of the elements in the RDD we are operating on. This works well for operations
like sum, but sometimes we want to return a different type as per the operations assigned.

1. Compare MapReduce and Spark?

Criteria MapReduce Spark

Processing Speeds Good Excellent (up to 100 times faster)

77
Data caching Hard disk In-memory

Perform iterative jobs Average Excellent

Independent of Hadoop No Yes

Machine learning applications Average Excellent

2. What is Apache Spark?

Spark is a fast, easy-to-use and flexible data processing framework. It has an advanced
execution engine supporting cyclic data flow and in-memory computing. Spark can run on
Hadoop, standalone or in the cloud and is capable of accessing diverse data sources including
HDFS, HBase, Cassandra and others.
3. Explain key features of Spark.

 Allows Integration with Hadoop and files included in HDFS.

 Spark has an interactive language shell as it has an independent Scala (the language in
which Spark is written) interpreter.
 Spark consists of RDD’s (Resilient Distributed Datasets), which can be cached across
computing nodes in a cluster.
 Spark supports multiple analytic tools that are used for interactive query analysis , real-
time analysis and graph processing

4. Define RDD?

RDD is the acronym for Resilient Distribution Datasets – a fault-tolerant collection of

operational elements that run parallel. The partitioned data in RDD is immutable and
distributed. There are primarily two types of RDD:

1. Parallelized Collections : The existing RDD’s running parallel with one another.
2. Hadoop datasets : perform function on each file record in HDFS or other storage system

5. What does a Spark Engine do?

Spark Engine is responsible for scheduling, distributing and monitoring the data application
across the cluster.
Find out more about what the Spark Engine does in this Apache Spark Video.
6. Define Partitions?

As the name suggests, partition is a smaller and logical division of data similar to ‘split’ in
MapReduce. Partitioning is the process to derive logical units of data to speed up the
processing process. Everything in Spark is a partitioned RDD.
7. What operations RDD support?

 Transformations.
 Actions

8. What do you understand by Transformations in Spark?

Transformations are functions applied on RDD, resulting into another RDD. It does not execute
until an action occurs. map() and filer() are examples of transformations, where the former
applies the function passed to it on each element of RDD and results into another RDD. The
filter() creates a new RDD by selecting elements form current RDD that pass function
argument.
78
9. Define Actions.

An action helps in bringing back the data from RDD to the local machine. An action’s
execution is the result of all previously created transformations. reduce() is an action that
implements the function passed again and again until one value if left. take() action takes all
the values from RDD to local node.
10. Define functions of SparkCore?

Serving as the base engine, SparkCore performs various important functions like memory
management, monitoring jobs, fault-tolerance, job scheduling and interaction with storage
systems.
11. What is RDD Lineage?

Spark does not support data replication in the memory and thus, if any data is lost, it is rebuild
using RDD lineage. RDD lineage is a process that reconstructs lost data partitions. The best is
that RDD always remembers how to build from other datasets.
12. What is Spark Driver?

Spark Driver is the program that runs on the master node of the machine and declares
transformations and actions on data RDDs. In simple terms, driver in Spark creates
SparkContext, connected to a given Spark Master.
The driver also delivers the RDD graphs to Master, where the standalone cluster manager runs.
Are you interested in the comprehensive Apache Spark and Scala Videos to take your career
to the next level?
13. What is Hive on Spark?

Hive contains significant support for Apache Spark, wherein Hive execution is configured to
Spark:

hive> set spark.home=/location/to/sparkHome;

hive> set hive.execution.engine=spark;

Hive on Spark supports Spark on yarn mode by default.

14. Name commonly-used Spark Ecosystems.

 Spark SQL (Shark)- for developers.

 Spark Streaming for processing live data streams.
 GraphX for generating and computing graphs.
 MLlib (Machine Learning Algorithms).
 SparkR to promote R Programming in Spark engine.

15. Define Spark Streaming.

Spark supports stream processing – an extension to the Spark API , allowing stream processing
of live data streams. The data from different sources like Flume, HDFS is streamed and finally
processed to file systems, live dashboards and databases. It is similar to batch processing as the
input data is divided into streams like batches.
Learn in detail about Top four Spark use cases including Spark streaming.
16. What is GraphX?

79
Spark uses GraphX for graph processing to build and transform interactive graphs. The
GraphX component enables programmers to reason about structured data at scale.
17. What does MLlib do?

MLlib is scalable machine learning library provided by Spark. It aims at making machine
learning easy and scalable with common learning algorithms and use cases like clustering,
regression filtering, dimensional reduction, and alike.
Our in-depth Scala Certification Course can give your career a big boost!
18. What is Spark SQL?

SQL Spark, better known as Shark is a novel module introduced in Spark to work with
structured data and perform structured data processing. Through this module, Spark executes
relational SQL queries on the data. The core of the component supports an altogether different
RDD called SchemaRDD, composed of rows objects and schema objects defining data type of
each column in the row. It is similar to a table in relational database.
19. What is a Parquet file?

Parquet is a columnar format file supported by many other data processing systems. Spark SQL
performs both read and write operations with Parquet file and consider it be one of the best big
data analytics format so far.
20. What file systems Spark support?

 Hadoop Distributed File System (HDFS).

 Local File system.
 S3

21. What is Yarn?

Similar to Hadoop, Yarn is one of the key features in Spark, providing a central and resource
management platform to deliver scalable operations across the cluster . Running Spark on Yarn
necessitates a binary distribution of Spar as built on Yarn support.
22. List the functions of Spark SQL.?

Spark SQL is capable of:

 Loading data from a variety of structured sources.

 Querying data using SQL statements, both inside a Spark program and from external tools that
connect to Spark SQL through standard database connectors (JDBC/ODBC). For instance,
using business intelligence tools like Tableau.
 Providing rich integration between SQL and regular Python/Java/Scala code, including the
ability to join RDDs and SQL tables, expose custom functions in SQL, and more.

23. What are benefits of Spark over MapReduce?

 Due to the availability of in-memory processing, Spark implements the processing around 10-
100x faster than Hadoop MapReduce. MapReduce makes use of persistence storage for any of
the data processing tasks.
 Unlike Hadoop, Spark provides in-built libraries to perform multiple tasks form the same core
like batch processing, Steaming, Machine learning, Interactive SQL queries. However,
Hadoop only supports batch processing.
 Hadoop is highly disk-dependent whereas Spark promotes caching and in-memory data
storage.

80
 Spark is capable of performing computations multiple times on the same dataset. This is called
iterative computation while there is no iterative computing implemented by Hadoop.

24. Is there any benefit of learning MapReduce, then?

Yes, MapReduce is a paradigm used by many big data tools including Spark as well. It is
extremely relevant to use MapReduce when the data grows bigger and bigger. Most tools like
Pig and Hive convert their queries into MapReduce phases to optimize them better.

25. What is Spark Executor?

When SparkContext connect to a cluster manager, it acquires an Executor on nodes in the

cluster. Executors are Spark processes that run computations and store the data on the worker
node. The final tasks by SparkContext are transferred to executors for their execution.
26. Name types of Cluster Managers in Spark.

The Spark framework supports three major types of Cluster Managers:

1. Standalone : a basic manager to set up a cluster.

2. Apache Mesos : generalized/commonly-used cluster manager, also runs Hadoop MapReduce
and other applications.
3. Yarn : responsible for resource management in Hadoop

27. What do you understand by worker node?

Worker node refers to any node that can run the application code in a cluster.
28. What is PageRank?

A unique feature and algorithm in graph, PageRank is the measure of each vertex in the graph.
For instance, an edge from u to v represents endorsement of v’s importance by u. In simple
terms, if a user at Instagram is followed massively, it will rank high on that platform
29. Do you need to install Spark on all nodes of Yarn cluster while running Spark on Yarn?

No because Spark runs on top of Yarn.

30. Illustrate some demerits of using Spark.

Since Spark utilizes more storage space compared to Hadoop and MapReduce, there may arise
certain problems. Developers need to be careful while running their applications in Spark.
Instead of running everything on a single node, the work must be distributed over multiple
clusters.
31. How to create RDD?

Spark provides two methods to create RDD:• By parallelizing a collection in your Driver
program. This makes use of SparkContext’s ‘parallelize’ methodval IntellipaatData =
Array(2,4,6,8,10)
val distIntellipaatData = sc.parallelize(IntellipaatData)• By loading an external dataset from
external storage like HDFS, shared file system.

81
Assignment - 1
This set of Multiple Choice Questions & Answers (MCQs) focuses on “Big-Data”.
1. As companies move past the experimental phase with Hadoop, many cite the need for
additional capabilities, including:
a) Improved data storage and information retrieval
b) Improved extract, transform and load features for data integration
c) Improved data warehousing functionality
d) Improved security, workload management and SQL support
2. Point out the correct statement :
a) Hadoop do need specialized hardware to process the data
b) Hadoop 2.0 allows live stream processing of real time data
c) In Hadoop programming framework output files are divided in to lines or records
d) None of the mentioned
View Answer
3. According to analysts, for what can traditional IT systems provide a foundation when they’re
integrated with big data technologies like Hadoop ?
a) Big data management and data mining
b) Data warehousing and business intelligence
c) Management of Hadoop clusters
d) Collecting and storing unstructured data
4. Hadoop is a framework that works with a variety of related tools. Common cohorts include:
a) MapReduce, Hive and HBase
b) MapReduce, MySQL and Google Apps
c) MapReduce, Hummer and Iguana
d) MapReduce, Heron and Trumpet
5. Point out the wrong statement :
a) Hardtop’s processing capabilities are huge and its real advantage lies in the ability to process
terabytes & petabytes of data
b) Hadoop uses a programming model called “MapReduce”, all the programs should confirms
to this model in order to work on Hadoop platform
c) The programming model, MapReduce, used by Hadoop is difficult to write and test
d) All of the mentioned
6. What was Hadoop named after?
a) Creator Doug Cutting’s favorite circus act
82
b) Cutting’s high school rock band
c) The toy elephant of Cutting’s son
d) A sound Cutting’s laptop made during Hadoop’s development
7. All of the following accurately describe Hadoop, EXCEPT:
a) Open source
b) Real-time
c) Java-based
d) Distributed computing approach
8. __________ can best be described as a programming model used to develop Hadoop-based
applications that can process massive amounts of data.
a) MapReduce
b) Mahout
c) Oozie
d) All of the mentioned
9. __________ has the world’s largest Hadoop cluster.
a) Apple
b) Datamatics
c) Facebook
d) None of the mentioned
10. Facebook Tackles Big Data With _______ based on Hadoop.
a) ‘Project Prism’
b) ‘Prism’
c) ‘Project Big’
d) ‘Project Data’

83
BDA UNIT – I Question Bank
1. List various types of digital data?
2. Why an email placed in the Unstructured category?
3. What category will you place a CCTV footage into?
4. You have just got a book issued from the library. What are the details about the book
that can be placed in an RDBMS table.
5. Which category would you place the consumer complaints and feedback?
6. Which category (structured, semi-structured or Unstructured) will you place a web page
in?
7. Which category (structured, semi-structured or Unstructured) will you place a Power
point presentation in?
8. Which category (structured, semi-structured or Unstructured) will you place a word
document in?
9. __________, A gartner analyst coined the term Big Data
10. ____________is the characteristic of data dealing with its retention.
11. ____________is a large data repository that stores data in its native format until it is
needed.
12. _________ is the characteristic of data explains the spikes in data.
13. Near real time processing or real time processing deals with ___________characteristic
of data.
14. ____________technology helps query data that resides in a computer’s random access
memory (RAM) rather than data stored on Physical disks.
15. Eventual consistency is consistency model used in distributed computing to achieve
high ______
16. A coordinated processing of program by multiple processors, each working on different
parts of the program and using its own operating system and memory called
_________.
17. A collection of independent computers that appear to its users as a single coherent
system is __________.
18. CAP Theorem is also called as ________________
19. System will continue to function even when network partition occurs is called________
20. Every read fetches the most recent write is called _____________
21. A non failing node will return a reasonable response within a reasonable amount of
time is called_______
22. Where BASE is used?
23. What is replica convergence?
24. What is data science?
25. What is WPS?
26. What are the advantages of a shared nothing architecture?
84
27. What are the guarantees provided by the CAP Theorem?
28. What is SMP?
29. What is MPP?
30.

2 Mark Questions
1. Match the following:
Column A Column B
JSON SOAP
Mange DB REST
XML JSON
Flexible structure Couch DB
JSON XML
2. What according you are the challenges with unstructured data?
3. State few examples of human generated and machine generated data.
4. What are the characteristics of data?
5. Big Data (Hadoop) will replace the traditional RDBMS and data warehouse.Comment.
6. Mention few top analytics tools.
7. Mention few open source analytics tools
8. Big data analytics is about a tight handshaking between three communities:_________ ,
___________and _________
9. List the three common types of architecture for Multi processor high transaction rate
systems.
10. What are the responsibilities of a Data scientist.
5 Mark questions
1. Match the following:
Column A Column B
NLP Contentanalytics
Test Analytics Textmessages
UIMA Chats
Noisy Unstructured data Textmining
Data mining Comprehend human or natural languageinput
Noisy Unstructured data Uses methods at the intersection ofstatistics,
Artificial Intelligence, machine learning &DBs
IBM UIMA
2. Place the following in suitable basket:
i. Email ii. MS Access iii. Images iv. Database
v. Char conversions vi. Relations / Tables vii.Face book
viii. Videos ix. MS Excel x. XML
Structured Unstructured Semi structured

3. What is predictive and prescriptive analytics?

4. What is BASE? Where is it used? Why is it used?
5. What is conflict resolution? How is the conflict resolved?
6. What are the differences between parallel and distributed systems
10 Mark Questions
1. What is Big Data? Explain the evolution and Challenges of Big Data.
2. a. How is traditional BI environment different from Big data environment.
85
3. a. Can the same visualization tool that we run over conventional data warehouse, be
used in big data environment.
b. What is analytics 3.0? What can we expect from analytics 3.0?
4. What are the various types of analytics? What is Big Data Analytics? Why it is
important? Discuss the top challenges facing Big Data.
5. Explain the following in terms of Big Data:
a. In-Memory Analytics B) In-Database Processing
c. Symmetric Multi processor system d. Massively Parallel processing
Shared nothing architecture f. CAP Theorem
BDA UNIT – II QUESTION BANK
1 Mark Questions
1. Hadoop is _____based flat structure
2. RDBMS is best choice when ________ is the main concern.
3. RDBMS supports ___________data formats.
4. In Hadoop, Data is processed in ______________.
5. HDFS can be deployed on ______________.
6. NameNode uses________to store file system namespace.
7. NameNode uses________to record every transaction.
8. Seconday NameNode is a ____________daemon.
9. Data node is responsible for _________file operation.
10. Hadoop 2.x is based on ________architecture.
11. YARN is responsible for ______________.
12. One ______gigabytes are there in one Exabyte.
13. ______is a splunk’s new product to search, access and report on hadoop data sets.
14. _______gave hadoop its name.
15. __________open source software was developed from Google MapReduce concept.
16. The MapReduce programming model widely used in analytics was developed at
______
17. ___________created the popular Hadooop software framework for storage and
processing of large data sets.
18. _______traditional IT company is the largest Big Data vendor in the world.
19. According to a study by IBM, approximately ______amount of data existed in the
digital universe in 2012.
20. Global resource manager distributes _______among applications
21. Application is a____________ submitted to framework.
22. ________is a open source frame work managed by Apache software foundation.
23. HDFS has a _____________ / ________________ architecture.
24. HDFS is built using ________ language.
25. The __________maintains the files system Namespace.
26. The number of copies of a file is called the _______of that file.
27. A ______contains a list of all blocks on a data node.
28. The blocks of a file are replicated for ______tolerance.
29. The typical block size used by HDFS is _________
30. ________perform block creation, deletion and replication upon instruction from the
______
31. _________is a single point of failure of Hadoop cluster.
86
32. _______is a book keeper of HDFS.
33. There is only one _________daemon per hadoop cluster.
34. There is a single _______per slave mode.
35. Hadoop is best used as a _______once and _____many times type of data store.
36. What is a datanode in hadoop?
37. How many NameNodes can run on a single Hadoop cluster?
38. How many data nodes can run on a single Hadoop cluster?
39. Hadoop runs on a large clusters of ________
40. ________is the official development and production plat form for Hadoop.
2 Mark questions
1. What are the most common Input Formats in Hadoop?
2. Hadoop supports __________, ____________and ___________data formats.
3. Where are the hadoop’s configuration files located?
4. Explain the difference between Hadoop and RDBMS.
5. List Hadoop’s three configuration files.
6. The emphasis of HDFS is on _____ throughput of data access rather than _____latency
of data access.
7. An HDFS cluster consists of a single ___________and a number of ______
8. Complete the series Bits -> Bytes -> KB -> MB -> GB -> ____ -> _____ -> ______ ->
_____ ___ ->Yottabytes
9. Key distinctions of Hadoop ______, _________ and ________.
10. _______ and ______ are common design patterns that go along with mapping and
reducing.
5 Mark questions:
1. What are the four modules that make up the apache Hadoop frame work?
2. Which modes can Hadoop be run in? List few features for each mode or Explain the
different modes in which Hadoop run.
3. What is a job tracker in hadoop? How many Job tracker processes can run on a single
hadoop cluster?
4. Explain the special features of HDFS.
5. How is NFS different from HDFS
10 Marks questions:
1. Explain some important features of Hadoop. Explain the difference between Hadoop
and RDBMS
2. Explain the core components of Hadoop. Discuss the design of Hadoop distributed file
system and concept in detail.
3. Explain how to manage resources and applications with Hadoop YARN.
4. Discuss in detail about How to process data with Hadoop.
5. Discuss about the interaction with Hadoop eco system.

87
UNIT – III
1 Mark questions
1. Partioner phase belongs to ______ task
2. Combiner is also know as ________
3. What is RecordReader in a Map Reduce?
4. MapReduce sorts the intermediate value based on _________
5. In Map reduce programming, the reduce function is applied ______group at a
time.
6. Explain JobConf in MapReduce.
7. What is a MapReduce Combiner?
8. Define Writable data types in MapReduce.
9. What is OutputCommitter?
10. What is a “map” in Hadoop?
11. What is a “reducer” in Hadoop?
12. What are the parameters of mappers and reducers?
13. What are the key differences between Pig vs MapReduce?
14. What is partitioning?
15. How to set which framework would be used to run mapreduce program?
16. What platform and Java version is required to run Hadoop?
17. Can MapReduce program be written in any language other than Java?

2 Mark questions
1. What is the difference between HDFS block and InputSplit?
2. What is Text Input Format?
3. What is SequenceFileInputFormat?
4. How to set mappers and reducers for Hadoop jobs?

5 Mark questions
1. What are the main components of MapReduce Job?
2. What is Shuffling and Sorting in MapReduce?
3. What is Partitioner and its usage?
4. What is Identity Mapper and Chain Mapper?
5. Name Job control options specified by MapReduce.

10 mark questions
1. Illustrate with a simple example about the working of MapReduce.
2. Write a MapReduce program to find unitwise salary.
3. Write a MapReduce program to arrange the data on user-id, then within the user id sort
them in increasing order of the page count.
88
4. Explain about the Map task and Reducer task in detail.

UNIT – IV
1 Mark questions:
1. The metastore consists of ______ and a ____________
2. The most commonly used interface to interact with Hive is _________
3. The default metastore for Hive is _________
4. Metastore contains _________of Hive tables.
5. _________is responsible for compilation, optimization and execution of Hive queries.
6. PIG is ______language
7. In Pig, _________ is used to specify data flow.
8. Pig provides an ________to execute data flow
9. _________ and __________ are execution modes of Pig.
10. The interactive mode of Pig is _______________.
11. __________,__________and _________are complex data types of Pig.
12. Pig is used in ___________process.
13. PigStorage() function is case sensitive
14. Local mode is the default mode of Pig.
15. DISTINCT key word removes duplicate fields
16. LIMIT keyword is used to display limited number of tuples in Pig.
17. ORDERBY is used for sorting.
2 Marks questions:

1. Match the following :

Column A Column B
HQL Hive Query Language
Database Namespace
Complex data types Struct, Map
Hive Application Weblogs
Table Set of records
2. Match the following :

Column A Column B
Map Hadoop cluster
Bag An ordered collection of Fileds
Local Mode Collection of tuples
Tuple Key/Value pair
Map Reduce Mode Local file system

5 Mark questions
1. Explain in detail how Hive is different from Pig.
89
2. Perform the following operations using Hive Query language
a) Create a database named “STUDENTS” with comments and database properties,
b) Display a list of databases
c) Describe a database
d) To make the databases current working database
e) To delete or remove a database
3. Write a Pig script for word count. Why Hive is relevant in Hadoop eco system
4. Explain the Architecture of Pig with a neat sketch.
5.

10 Mark questions:
1. Create a data file for below schemas
Order: custid, itemid, orderdate, deliverydate.
Customer: customerid, Customername, Address, City, state, country.
a. Create a table for Order data and customer data
b. Write a HiveQL to find number of items bought by each customer.
2. Create a data file for below schemas
Order: custid, itemid, orderdate, deliverydate.
Customer: customerid, Customername, Address, City, state, country.
a. Create a table for Order data and customer data
b. Write a Pig Latin script to determine number of items bought by each customer.
3. Explain HIVE architecture in detail.
4. Discuss various data types in Pig.
5. Write a word count program in Pig to count the occurrence of similar words in a file.

UNIT – V

1. Transformation are operations on RDD’s, that return a new RDD (True/False)______.

2. ETL means ______________.
3. List the different ways in which RDDs are created.
4. How does RDD. Persisit() do?
5. List any two features of Spark.
6. Define RDD.
7. Compare MapReduce and Spark
8. What does a Spark engine do?
9. List any two transformations of RDD
10. List any two actions of RDD.
11. List any two numeric RDD operations

10 Mark Questions

1. Discuss about RDD operations

2. Explain the spark components in detail. Also list the features of Spark.

3. What is RDD? Discuss any five transformation functions on pair RDD’s.

4. What is spark? State the advantages of using Apache spark over Hadoop MapReduce for Big
data processing with example.

5. Write a brief note on : Spark Unified Stack.

90
UNIT wise objectives for Big Data Analytics
UNIT – I
Objectives
 Understand various types of digital data
 understand Big Data, the limitations of the existing solutions for Big Data problem
 Understand why this sudden hype around Big data Analytics
 Challenges facing Big Data
 Know about top Analytics tools
UNIT – II
Objectives
 Understand the various components of Hadoop, for instance, Hadoop 2.7, Impala, Yarn,
MapReduce, Pig, Hive, HBase, Sqoop, Flume, and Apache Spark
 Learn Hadoop Distributed File System (HDFS) and YARN building, and make sense of
how to function with them for limit and resource organization
 How to ingest data from a RDBMS or a data warehouse to Hadoop.
 Manage resources and applications with Hadoop YARN
 Interfacing with Hadoop Ecosystem

UNIT - III
 Understand MapReduce and its qualities and retain advanced MapReduce thoughts
 Ingest data using Sqoop and Flume

UNIT – IV

 Understand the architecture of HIVE and HIVEQL, PIG Latin overview

 Understand the Usecase for PIG: ETL Processing
 Get a working learning of Pig and its parts

UNIT - V
 Do functional programming in Spark, and execute and create Spark applications
 Understand adaptable spread datasets (RDD) in detail
 Get a through and through understanding of parallel get ready in Spark and Spark RDD
upgrade systems

91
 Understand the typical use occasions of Spark and distinctive natural estimations
 Learn Spark SQL, making, changing, and addressing data diagrams

http://nptel.ac.in/courses/106104135/48
http://hadoop.apache.org/
http://developer.yahoo.com/hadoop/tutorial/

Ey The Future of Actuarial Modeling
No ratings yet
Ey The Future of Actuarial Modeling
11 pages
Big Data Analytics (CS443) IV B.Tech (IT) 2018-19 I Semester
No ratings yet
Big Data Analytics (CS443) IV B.Tech (IT) 2018-19 I Semester
72 pages
Question 1 (10 Marks) : Student ID - 2015046 Student Name Sushan Rajchal Unit Code CIS007
0% (1)
Question 1 (10 Marks) : Student ID - 2015046 Student Name Sushan Rajchal Unit Code CIS007
4 pages
Daa M-4
No ratings yet
Daa M-4
28 pages
Levine Smume7 Ch00
No ratings yet
Levine Smume7 Ch00
22 pages
ML Question Bank
No ratings yet
ML Question Bank
29 pages
JNTUA Advanced Data Structures and Algorithms Lab Manual R20
No ratings yet
JNTUA Advanced Data Structures and Algorithms Lab Manual R20
71 pages
Big Data - SRM University PDF
No ratings yet
Big Data - SRM University PDF
29 pages
Education Website
No ratings yet
Education Website
158 pages
MCA Avinash REport - 107 - MTE - Report
No ratings yet
MCA Avinash REport - 107 - MTE - Report
43 pages
Wa0002.
No ratings yet
Wa0002.
29 pages
M.Tech (CSE) Big Data Analytics Curriculum
No ratings yet
M.Tech (CSE) Big Data Analytics Curriculum
69 pages
CP5191 Machine Learning Techniques L T P C3 0 0 3
No ratings yet
CP5191 Machine Learning Techniques L T P C3 0 0 3
7 pages
Mini Project HPC
No ratings yet
Mini Project HPC
17 pages
Software Quality Assurance and Testing Question Bank: Unit-I
No ratings yet
Software Quality Assurance and Testing Question Bank: Unit-I
3 pages
SCT - QB - Anwers - p1
No ratings yet
SCT - QB - Anwers - p1
53 pages
CS8091-Big Data Analytics
No ratings yet
CS8091-Big Data Analytics
12 pages
Lab Program
100% (1)
Lab Program
15 pages
OS - Module 5 - Memory Management
No ratings yet
OS - Module 5 - Memory Management
81 pages
SM 6th-Sem Cse Internet-Of-Things
No ratings yet
SM 6th-Sem Cse Internet-Of-Things
76 pages
Lecture5 Basic Traversal and Search Techniques
100% (1)
Lecture5 Basic Traversal and Search Techniques
68 pages
ML - CSA 301 - ML Perspective and Issues
No ratings yet
ML - CSA 301 - ML Perspective and Issues
34 pages
Unit - 3
No ratings yet
Unit - 3
42 pages
Beyond Binary Classification
No ratings yet
Beyond Binary Classification
34 pages
Chandigarh Group of Colleges College of Engineering Landran, Mohali
No ratings yet
Chandigarh Group of Colleges College of Engineering Landran, Mohali
47 pages
Compiler Design Unit 4
No ratings yet
Compiler Design Unit 4
28 pages
BDA Unit 1
No ratings yet
BDA Unit 1
10 pages
Machine Learning Notes
No ratings yet
Machine Learning Notes
6 pages
IF4071 - Deep Learning Laboratory
No ratings yet
IF4071 - Deep Learning Laboratory
1 page
TE AI Honor Course
No ratings yet
TE AI Honor Course
18 pages
CD Questions With Answers
100% (1)
CD Questions With Answers
36 pages
CS341 Software Quality Assurance and Testing - Tutorial2-Solution
No ratings yet
CS341 Software Quality Assurance and Testing - Tutorial2-Solution
3 pages
AKTU Syllabus CS 3rd Yr
No ratings yet
AKTU Syllabus CS 3rd Yr
1 page
ML Lab Final R22
No ratings yet
ML Lab Final R22
67 pages
DAA-2020-21 Final Updated Course File
No ratings yet
DAA-2020-21 Final Updated Course File
49 pages
Autism Spectrum Disorder Detection Using Facial Images
No ratings yet
Autism Spectrum Disorder Detection Using Facial Images
14 pages
4-Data Cleaning, Data Integration, Data Transformation, Data Reduction-03-02-2024
No ratings yet
4-Data Cleaning, Data Integration, Data Transformation, Data Reduction-03-02-2024
22 pages
Data Preprocessing in Python - Handling Missing Data
No ratings yet
Data Preprocessing in Python - Handling Missing Data
8 pages
CP4252 Machine Learning Lab Manual
No ratings yet
CP4252 Machine Learning Lab Manual
37 pages
R22B Tech CSE (AIML) IandIIYearSyllabus PDF
No ratings yet
R22B Tech CSE (AIML) IandIIYearSyllabus PDF
65 pages
Unit 2 - Advanced Computer Architecture - WWW - Rgpvnotes.in
No ratings yet
Unit 2 - Advanced Computer Architecture - WWW - Rgpvnotes.in
15 pages
Unit1 ML
No ratings yet
Unit1 ML
23 pages
UNIT 2 - Notes
No ratings yet
UNIT 2 - Notes
31 pages
Concept Learning
No ratings yet
Concept Learning
85 pages
Advanced Databases - Unit - V - PPT
No ratings yet
Advanced Databases - Unit - V - PPT
71 pages
AD3461 ML Lab Manual
No ratings yet
AD3461 ML Lab Manual
32 pages
ML Unit 1
No ratings yet
ML Unit 1
44 pages
Unit 5 - Compiler Design - WWW - Rgpvnotes.in
No ratings yet
Unit 5 - Compiler Design - WWW - Rgpvnotes.in
20 pages
KIET Group of Institutions: CT-1 Examination (2021-2022) EVEN Semester
No ratings yet
KIET Group of Institutions: CT-1 Examination (2021-2022) EVEN Semester
3 pages
CP4253 Map Unit I
No ratings yet
CP4253 Map Unit I
31 pages
An Autonomous Institute: Nba & Naac A+ Accredited
No ratings yet
An Autonomous Institute: Nba & Naac A+ Accredited
300 pages
SKP Engineering College: A Course Material On
No ratings yet
SKP Engineering College: A Course Material On
212 pages
Subject Name Parallel and Distributed Computing
100% (1)
Subject Name Parallel and Distributed Computing
3 pages
ARTIFICIAl iNTELLIGENCE Unit III &iv
No ratings yet
ARTIFICIAl iNTELLIGENCE Unit III &iv
39 pages
PowerPoint Slides To Chapter 07
No ratings yet
PowerPoint Slides To Chapter 07
49 pages
DLT Unit-2
100% (1)
DLT Unit-2
50 pages
Ece443 - Wireless Sensor Networks Course Information Sheet: Electronics and Communication Engineering Department
No ratings yet
Ece443 - Wireless Sensor Networks Course Information Sheet: Electronics and Communication Engineering Department
10 pages
FSD Unit III
No ratings yet
FSD Unit III
22 pages
Unit V Easy To Learn
No ratings yet
Unit V Easy To Learn
21 pages
Features of MapReduce
No ratings yet
Features of MapReduce
4 pages
Mastering Parallel Programming with R
From Everand
Mastering Parallel Programming with R
Simon R. Chapple
No ratings yet
Hybrid Neural Networks: Fundamentals and Applications for Interacting Biological Neural Networks with Artificial Neuronal Models
From Everand
Hybrid Neural Networks: Fundamentals and Applications for Interacting Biological Neural Networks with Artificial Neuronal Models
Fouad Sabry
No ratings yet
Optimizing Hadoop for MapReduce
From Everand
Optimizing Hadoop for MapReduce
Khaled Tannir
No ratings yet
I MTechMPharmacy
No ratings yet
I MTechMPharmacy
1 page
PennStateSchool08 LecNotes
No ratings yet
PennStateSchool08 LecNotes
529 pages
Universal Human Values II - Understanding Harmony 20EGM03
No ratings yet
Universal Human Values II - Understanding Harmony 20EGM03
2 pages
IMBAMCA
No ratings yet
IMBAMCA
1 page
MC5403 Adbdm Unit Ii Notes
No ratings yet
MC5403 Adbdm Unit Ii Notes
59 pages
427 16sacaob3 2020051805192483
No ratings yet
427 16sacaob3 2020051805192483
66 pages
Revised IV B. Tech
No ratings yet
Revised IV B. Tech
1 page
Banking UNIT IV
No ratings yet
Banking UNIT IV
17 pages
Img - 31 5 13
No ratings yet
Img - 31 5 13
5 pages
MC5403 Adbdm Unit I Notes
No ratings yet
MC5403 Adbdm Unit I Notes
95 pages
D2-S1 B Self-Exploration J Happiness and Prosperity July 26
No ratings yet
D2-S1 B Self-Exploration J Happiness and Prosperity July 26
17 pages
249 Enterpreneurship Lesson 14
No ratings yet
249 Enterpreneurship Lesson 14
11 pages
WT R19 Unit 3
No ratings yet
WT R19 Unit 3
18 pages
Hydrostatics (4A)
No ratings yet
Hydrostatics (4A)
4 pages
NN DL
No ratings yet
NN DL
54 pages
Eps Unit 1
No ratings yet
Eps Unit 1
11 pages
Syllabus EC 003
No ratings yet
Syllabus EC 003
2 pages
Eps Unit 2
No ratings yet
Eps Unit 2
81 pages
Big Data - Hadoop Questions Answers
No ratings yet
Big Data - Hadoop Questions Answers
18 pages
R Cet Vacancy 202218032023
No ratings yet
R Cet Vacancy 202218032023
2 pages
B. Techacforgovt
No ratings yet
B. Techacforgovt
4 pages
M.Tech - CTM II Year Syllabus
No ratings yet
M.Tech - CTM II Year Syllabus
17 pages
Fineartseventsyouthfestivaldec 2022
No ratings yet
Fineartseventsyouthfestivaldec 2022
1 page
Organizational Behaviour: Alagappa University
No ratings yet
Organizational Behaviour: Alagappa University
208 pages
FundamentalBigData Atal
No ratings yet
FundamentalBigData Atal
6 pages
Maheshwari Chapter 1
No ratings yet
Maheshwari Chapter 1
39 pages
Big Data
No ratings yet
Big Data
1 page
A Study On The Adoption of Wireless Communication in Big Data Analytics Using Neural Networks and D
No ratings yet
A Study On The Adoption of Wireless Communication in Big Data Analytics Using Neural Networks and D
6 pages
121 Cse
No ratings yet
121 Cse
23 pages
Hackcelerator Problem Statements
100% (1)
Hackcelerator Problem Statements
72 pages
Decision Science Project Report On "Big Data"
No ratings yet
Decision Science Project Report On "Big Data"
9 pages
What Are The Benefits of Information Technology
No ratings yet
What Are The Benefits of Information Technology
11 pages
Apache Pig - A Data Flow Framework Based On Hadoop Map Reduce
No ratings yet
Apache Pig - A Data Flow Framework Based On Hadoop Map Reduce
6 pages
Digital Transformation in Accounting
No ratings yet
Digital Transformation in Accounting
64 pages
Smartphone Literature Review
100% (2)
Smartphone Literature Review
8 pages
Research Professional Issues
No ratings yet
Research Professional Issues
5 pages
Big Data: Data Science & Advanced Analytics
No ratings yet
Big Data: Data Science & Advanced Analytics
42 pages
LS1.1 - V6 Generalized Architecture of Big Data Systems
No ratings yet
LS1.1 - V6 Generalized Architecture of Big Data Systems
8 pages
Bigdata .Profile
No ratings yet
Bigdata .Profile
3 pages
Ebffiledoc 2237
No ratings yet
Ebffiledoc 2237
53 pages
Big Data
No ratings yet
Big Data
15 pages
HRM Final Report On NAYATEL
No ratings yet
HRM Final Report On NAYATEL
17 pages
5 AI Powered Surveillance
No ratings yet
5 AI Powered Surveillance
12 pages
Research Paper
No ratings yet
Research Paper
54 pages
Analytics Platform System Datasheet PDF
No ratings yet
Analytics Platform System Datasheet PDF
2 pages
Big Data and Analytics Leaders - The Changing Role of CIO
No ratings yet
Big Data and Analytics Leaders - The Changing Role of CIO
8 pages
Sample 10277
No ratings yet
Sample 10277
23 pages
SAS BASE A00 211 Sample Questions
No ratings yet
SAS BASE A00 211 Sample Questions
33 pages
VP Director Product Marketing in San Francisco Bay CA Resume Dan Geiger
No ratings yet
VP Director Product Marketing in San Francisco Bay CA Resume Dan Geiger
2 pages
Beginning Database Design
No ratings yet
Beginning Database Design
2 pages
McKinsey Life Insurance 2.0
No ratings yet
McKinsey Life Insurance 2.0
52 pages
Big Data 1 - 1
No ratings yet
Big Data 1 - 1
98 pages