Big Data NOTES and QB
Big Data NOTES and QB
Skills:
Upon completion of this course, students will be able to do the following:
o Students will to build and maintain reliable, scalable, distributed systems with
Apache Hadoop.
o Students will be able to write Map-Reduce based Applications
o Students will be able to design and build applications using Hive and Pig based
Big data Applications
o Students will learn tips and tricks for Big Data use cases and solutions
Activities:
Install Hadoop and develop applications on Hadoop
Develop Map Reduce applications
Develop applications using Hive/Pig/Spark
Unit-I
Introduction to big data: Data, Characteristics of data and Types of digital data:, Sources of
data, Working with unstructured data, Evolution and Definition of big data, Characteristics
and Need of big data, Challenges of big data
Big data analytics: Overview of business intelligence, Data science and Analytics, Meaning
and Characteristics of big data analytics, Need of big data analytics, Classification of analytics,
Challenges to big data analytics, Importance of big data analytics, Basic terminologies in big
data environment
Unit-II
Introduction to Hadoop : Introducing Hadoop, need of Hadoop, limitations of RDBMS,
RDBMS versus Hadoop, Distributed Computing Challenges, History of Hadoop , Hadoop
Overview, Use Case of Hadoop, Hadoop Distributors, HDFS (Hadoop Distributed File
System) , Processing Data with Hadoop, Managing Resources and Applications with Hadoop
YARN (Yet another Resource Negotiator), Interacting with Hadoop Ecosystem
Unit-III
Introduction to MAPREDUCE Programming: Introduction , Mapper, Reducer, Combiner,
Partitioner , Searching, Sorting , Compression, Real time applications using MapReduce, Data
serialization and Working with common serialization formats, Big data serialization formats
Unit-IV
Introduction to Hive: Introduction to Hive, Hive Architecture , Hive Data Types, Hive File
Format, Hive Query Language (HQL), User-Defined Function (UDF) in Hive.
Introduction to Pig
1
Introduction to Pig, The Anatomy of Pig , Pig on Hadoop , Pig Philosophy , Use Case for Pig:
ETL Processing , Pig Latin Overview , Data Types in Pig , Running Pig , Execution Modes of
Pig, HDFS Commands, Relational Operators, Piggy Bank , Word Count Example using Pig ,
Pig at Yahoo!, Pig versus Hive
Unit-V
Spark: Introduction to data analytics with Spark, Programming with RDDS, Working with
key/value pairs, advanced spark programming
Text Books
1. Big Data Analytics, SeemaAcharya, SubhashiniChellappan, Wiley
2. Learning Spark: Lightning-Fast Big Data Analysis, Holden Karau, Andy
Konwinski, Patrick Wendell, MateiZaharia, O'Reilly Media, Inc.
Reference Books:
1Boris lublinsky, Kevin t. Smith, AlexeyYakubovich, “Professional Hadoop Solutions”, Wiley,
ISBN: 9788126551071, 2015.
2
UNIT – I
There are three types of data we need to consider, structured, unstructured, and semi-
structured. Of these, the last two are new in Big Data.
Structured Data: Your current data warehouse contains structured data and only structured
data. It’s structured because when you placed it in your relational database system a structure
was enforced on it, so we know where it is, what it means, and how it relates to other pieces of
data in there. It may be text (a person’s name) or numerical (their age) but we know that the
age value goes with a specific person, hence structured.
Unstructured Data: Essentially everything else that has not been specifically structured is
considered unstructured. The list of truly unstructured data includes free text such as
documents produced in your company, images and videos, audio files, and some types of social
media. If the object to be stored carries no tags (metadata about the data) and has no
established schema, ontology, glossary, or consistent organization it is unstructured. However,
in the same category as unstructured data there are many types of data that do have at least
some organization.
Semi-Structured Data: The line between unstructured data and semi-structured is a little
fuzzy. If the data has any organizational structure (a known schema) or carries a tag (like
XML extensible markup language used for documents on the web) then it is somewhat easier
to organize and analyze, and because it is more accessible for analysis may make it more
valuable. Some types of data that appear to be unstructured but are actually semi-structured
include:
Text: XML, email or electronic data interchange messages (EDI). These lacks formal
structure but do contain tags or a known structure that separate semantic elements. Most
social media sources, a hot topic for analysis today, fall in this category. Facebook,
Twitter, and others offer data access through an application programming interface (API).
Web Server Logs and Search Patterns: An individual’s journey through a web site,
whether searching, consuming content, or shopping is recorded in detail in electronic web
server logs.
Sensor Data: There is a huge explosion in the number of sensors producing streams of
data all around us. Once we thought of sensors as only being found in industrial control
systems or major transportation systems. Now this includes RFIDs, infrared and wireless
technology, and GPS location signals among others. In addition to monitoring mechanical
systems, sensors increasingly monitor consumer behavior. Your cell phone puts out a
constant stream of signals that are being captured for location-based marketing. In-store
sensors are monitoring consumer shopping behavior. Your car monitors its systems and
constantly records data that can be used to evaluate mechanical failure or accidents.
There is huge growth in the popularity of ‘the quantified self’ in which we voluntarily
wear devices like the FitBit or a Nike Fuel Band that record our activity and in some
cases even heart rate, velocity, location, and calorie burn. While a great deal of attention
is being paid to new types of analysis for social media, in the next two or three years at
most we will reach a crossover point where the volume of data available from sensors
will exceed new social media postings, and sensor data volumes are likely to grow 10 or
20 times faster than social media sources.
3
We have been refining our use of structured data for the past 10 or 20 years. Opportunity lies
in understanding how adding unstructured and semi-structured data to the mix creates
competitive advantage. Here are just a few thought starters for your consideration:
Marketing and Sales Campaigns: Consumers now actively share their likes and dislikes
about companies, campaigns, and products through social media. Through text-
based sentiment analysis of social media messages companies are learning quickly what
pleases and displeases their customers and prospects.
Ecommerce: Web server logs and search engine summaries are being analyzed in detail to
discover how to make the customer’s journey through your web site easier for them and more
profitable for you.
Brick and Mortar Retail: Retailers using electronic, RFID, video, and infrared technologies
can now track customers as groups and as individuals through their physical stores to enhance
the shopping experience. Some grocery chains are now using video technology to count the
number of shoppers and predict the number of checkout lanes needed to keep wait times at
acceptable levels. Customer reward cards can gather even more information matching
customer detail to specific product purchases.
Supply Chain: Both the consumers and providers of global logistical services have combined
data sources from traditional internal ERP systems with semi-structured data from GPS
location trackers, EDI messages, RFID and bar scans of shipped and in-transit merchandise,
and even social media sources to speed goods along at lower cost.
Finance: All types of financial institutions including banks, credit card companies, and the
internal finance activities of companies are rapidly embracing new data types to reduce fraud,
reduce revenue leakage (under billing), and ensure compliance with the multitude of financial
laws and regulations.
Healthcare: The government’s initiative to require electronic health records is making new
and vast semi-structured data sources available to enhance treatment outcomes and contain
cost.
Business executives need to understand the new opportunities available in Big Data from
unstructured and semi-structured data, and how to blend these newly available data types into
their data-driven competitive strategies.
4
Big data term describes the volume amount of data both structured and unstructured manner
that adapted in day-to-day business environment. It’s important that what organizations
utilize with these with the data that matters.
Big data helps to analyze the in-depth concepts for the better decisions and strategic taken
for the development of the organization.
The Evolution of Big Data
While the term “big data” is the new in this era, as it is the act of gathering and storing huge
amounts of information for eventual analysis is ages old. The concept came into existence in
the early 2000s when Industry analyst Doug Laney the definition of big data as the three
categories as follows:
Volume: Organizations collects the data from relative sources, which includes business
transactions, social media and information from sensor or machine-to-machine data. Before,
storage was a big issue but now the advancement of new technologies (such as Hadoop) has
reduced the burden.
Velocity: Data streams unparalleled speed of velocity and have improved in timely manner.
RFID tags, sensors and smart metering are driving the need to deal with torrents of data in real
time operations.
Variety: Data comes in all varieties in form of structured, numeric data in traditional databases
to unstructured text documents, email, video, audio, stock ticker data and financial transactions.
In SAS, we consider two additional dimensions with respect to big data:
What are the categories which come under Big Data?
Big data works on the data produced by various devices and their applications. Below are some
of the fields that are involved in the umbrella of Big Data.
Black Box Data: It is an incorporated by flight crafts, which stores a large sum of information,
which includes the conversation between crew members and any other communications (alert
messages or any order passed)by the technical grounds duty staff.
5
Social Media Data: Social networking sites such as Face book and Twitter contains the
information and the views posted by millions of people across the globe.
Stock Exchange Data: It holds information (complete details of in and out of business
transactions) about the ‘buyer’ and ‘seller’ decisions in terms of share between different
companies made by the customers.
Power Grid Data: The power grid data mainly holds the information consumed by a particular
node in terms of base station.
Transport Data: It includes the data’s from various transport sectors such as model, capacity,
distance and availability of a vehicle.
Search Engine Data: Search engines retrieve a large amount of data from different sources of
database.
6
progress, and can implement an improvised system for evaluation and support of teachers and
principals in their teachings.
Health Care
When it comes to health care in terms of Patient records. Treatment plans. Prescription
information etc., everything needs to be done quickly and accurately and some aspects enough
transparency to satisfy stringent industry regulations. Effective management results in good
health care to uncover hidden insights that improve patient care.
Manufacturing
Manufacturers can improve their quality and output while minimizing waste where processes
are known as the main key factors in today’s highly competitive market. Several manufacturers
are working on analytics where they can solve problems faster and make more agile business
decisions.
Retail
Customer relationship maintains is the biggest challenge in the retail industry and the best way
to manage will be to manage big data. Retailers must have unique marketing ideas to sell their
products to customers, the most effective way to handle transactions, and applying improvised
tactics of using innovative ideas using BigData to improve their business.
Brief explanation of how exactly businesses are utilizing Big Data
Big Data is being converted into nuggets of information and then it becomes very
straightforward for most business enterprises as we now know what their customers want, what
are the products are rapidly fast moving, what are the expectations of the end users from the
customer service, speed up the time sequence for marketing, methods on cost reduction, and
methods to build economies of scale in a highly efficient manner. Hence Big Data leads to big
time benefits for organizations and hence there exists a demand about it in the IT world.
Big Data Technologies
Accurate analysis carried out based on big data which helps to increase and optimizes
operational efficiencies, enable cost reductions, and reduce risks for the business operations.
In order to capitalize on big data one should require infrastructure that manages and
processes huge volumes of structured and unstructured data in real-time and can ensure data
privacy and security.
Many technologies are available in the market from different vendors which includes
Amazon, IBM, Microsoft, etc., to approach big data. To pick up a particular technology one
must examine its classes, which areas are as follows
Operational Big Data
It includes the applications such as MongoDB which provides operational capabilities for
interactive and real time workloads where data is generally captured and stored.
NoSQL Big Data systems are designed in such a way it capitalizes on new cloud
computing architectures, to permit access on massive computations to be run reasonably and
efficiently. Hence this builds operation on big data workloads much easier to manage,
cheaper and faster to implement.
Analytical Big Data
7
It owns the systems like Massively Parallel Processing database systems and MapReduce
which provides the analytical capabilities for re collective and complex analysis.
MapReduce provides a new method for analyzing the data that flaunts its capabilities
provided by SQL, and based on a system called MapReduce that can be scaled up from
single servers to thousands of high and low end machines.
Barriers
Barriers that are imposed on big data are as follows:
Capture data
Storage Capacity
Searching
Sharing
Transfer
Analysis
Presentation
Enterprise servers are using the above measures to overcome the barriers mentioned above.
Differentiation between Operational vs. Analytical Systems
Operational Analytical
In order to understand 'Big Data', we first need to know what 'data' is. Oxford dictionary
defines 'data' as -
8
So, 'Big Data' is also a data but with a huge size. 'Big Data' is a term used to describe
collection of data that is huge in size and yet growing exponentially with time.In short, such a
data is so large and complex that none of the traditional data management tools are able to store
it or process it efficiently.
9
Any data that can be stored, accessed and processed in the form of fixed format is termed as a
'structured' data. Over the period of time, talent in computer science have achieved greater
success in developing techniques for working with such kind of data (where the format is well
known in advance) and also deriving value out of it. However, now days, we are foreseeing
issues when size of such data grows to a huge extent, typical sizes are being in the rage of
multiple zettabyte.
Do you know? 1021 bytes equals to 1 zettabyte or one billion terabytes forms a zettabyte.
Looking at these figures one can easily understand why the name 'Big Data' is given and
imagine the challenges involved in its storage and processing.
Do you know? Data stored in a relational database management system is one example of
a 'structured' data.
Examples Of Structured Data
An 'Employee' table in a database is an example of Structured Data
Employee_ID Employee_Name Gender Department Salary_In_lacs
Semi-structured
Semi-structured data can contain both the forms of data. We can see semi-structured data as a
strcutured in form but it is actually not defined with e.g. a table definition in relational DBMS.
Example of semi-structured data is a data represented in XML file.
Examples Of Semi-structured Data
10
Personal data stored in a XML file-
<rec><name>Prashant Rao</name><sex>Male</sex><age>35</age></rec>
<rec><name>Seema R.</name><sex>Female</sex><age>41</age></rec>
<rec><name>Satish Mane</name><sex>Male</sex><age>29</age></rec>
<rec><name>Subrato Roy</name><sex>Male</sex><age>26</age></rec>
<rec><name>Jeremiah J.</name><sex>Male</sex><age>35</age></rec>
Data Growth over years
Please note that web application data, which is unstructured, consists of log files, transaction
history files etc. OLTP systems are built to work with structured data wherein data is stored in
relations (tables).
Characteristics Of 'Big Data'
(i)Volume – The name 'Big Data' itself is related to a size which is enormous. Size of data
plays very crucial role in determining value out of data. Also, whether a particular data can
actually be considered as a Big Data or not, is dependent upon volume of data.
Hence, 'Volume' is one characteristic which needs to be considered while dealing with 'Big
Data'.
(ii)Variety – The next aspect of 'Big Data' is its variety.
Variety refers to heterogeneous sources and the nature of data, both structured and
unstructured. During earlier days, spreadsheets and databases were the only sources of data
considered by most of the applications. Now days, data in the form of emails, photos, videos,
monitoring devices, PDFs, audio, etc. is also being considered in the analysis applications. This
variety of unstructured data poses certain issues for storage, mining and analysing data.
(iii)Velocity – The term 'velocity' refers to the speed of generation of data. How fast the data is
generated and processed to meet the demands, determines real potential in the data.
Big Data Velocity deals with the speed at which data flows in from sources like business
processes, application logs, networks and social media sites, sensors, Mobile devices, etc. The
flow of data is massive and continuous.
(iv)Variability – This refers to the inconsistency which can be shown by the data at times, thus
hampering the process of being able to handle and manage the data effectively.
Benefits of Big Data Processing
Ability to process 'Big Data' brings in multiple benefits, such as-
• Businesses can utilize outside intelligence while taking decisions
Access to social data from search engines and sites like facebook, twitter are enabling
organizations to fine tune their business strategies.
• Improved customer service
Traditional customer feedback systems are getting replaced by new systems designed with 'Big
Data' technologies. In these new systems, Big Data and natural language processing
technologies are being used to read and evaluate consumer responses.
• Early identification of risk to the product/services, if any
• Better operational efficiency
11
'Big Data' technologies can be used for creating staging area or landing zone for new data
before identifying what data should be moved to the data warehouse. In addition, such
integration of 'Big Data' technologies and data warehouse helps organization to offload
infrequently accessed data.
1. What do you know about the term “Big Data”?
Answer: Big Data is a term associated with complex and large datasets. A relational database
cannot handle big data, and that’s why special tools and methods are used to perform
operations on a vast collection of data. Big data enables companies to understand their business
better and helps them derive meaningful information from the unstructured and raw data
collected on a regular basis. Big data also allows the companies to take better business
decisions backed by data.
2. What are the five V’s of Big Data?
Answer: The five V’s of Big data is as follows:
Volume – Volume represents the volume i.e. amount of data that is growing at a high
rate i.e. data volume in Petabytes
Velocity – Velocity is the rate at which data grows. Social media contributes a major
role in the velocity of growing data.
Variety – Variety refers to the different data types i.e. various data formats like text,
audios, videos, etc.
Veracity – Veracity refers to the uncertainty of available data. Veracity arises due to
the high volume of data that brings incompleteness and inconsistency.
Value –Value refers to turning data into value. By turning accessed big data into
values, businesses may generate revenue.
12
implementing big data analytics. Some popular companies those are using big data analytics to
increase their revenue is – Walmart, LinkedIn, Facebook, Twitter, Bank of America etc.
5. Explain the steps to be followed to deploy a Big Data solution.
Answer: Followings are the three steps that are followed to deploy a Big Data Solution –
1. Data Ingestion
The first step for deploying a big data solution is the data ingestion i.e. extraction of data from
various sources. The data source may be a CRM like Salesforce, Enterprise Resource Planning
System like SAP, RDBMS like MySQL or any other log files, documents, social media feeds
etc. The data can be ingested either through batch jobs or real-time streaming. The extracted
data is then stored in HDFS.
Steps of
Deploying Big Data Solution
2. Data Storage
After data ingestion, the next step is to store the extracted data. The data either be stored in
HDFS or NoSQL database (i.e. HBase). The HDFS storage works well for sequential access
whereas HBase for random read/write access.
3. Data Processing
The final step in deploying a big data solution is the data processing. The data is processed
through one of the processing frameworks like Spark, MapReduce, Pig, etc.
6. Do you have any Big Data experience? If so, please share it with us.
How to Approach: There is no specific answer to the question as it is a subjective question
and the answer depends on your previous experience. Asking this question during a big data
interview, the interviewer wants to understand your previous experience and is also trying to
evaluate if you are fit for the project requirement.
So, how will you approach the question? If you have previous experience, start with your
duties in your past position and slowly add details to the conversation. Tell them about your
contributions that made the project successful. This question is generally, the 2 nd or 3rd question
asked in an interview. The later questions are based on this question, so answer it carefully.
You should also take care not to go overboard with a single aspect of your previous job. Keep
it simple and to the point.
7. Do you prefer good data or good models? Why?
How to Approach: This is a tricky question but generally asked in the big data interview. It
asks you to choose between good data or good models. As a candidate, you should try to
answer it from your experience. Many companies want to follow a strict process of evaluating
data, means they have already selected data models. In this case, having good data can be
game-changing. The other way around also works as a model is chosen based on good data.
As we already mentioned, answer it from your experience. However, don’t say that having both
good data and good models is important as it is hard to have both in real life projects.
8. Will you optimize algorithms or code to make them run faster?
How to Approach: The answer to this question should always be “Yes.” Real world
performance matters and it doesn’t depend on the data or model you are using in your project.
The interviewer might also be interested to know if you have had any previous experience in
code or algorithm optimization. For a beginner, it obviously depends on which projects he
worked on in the past. Experienced candidates can share their experience accordingly as well.
However, be honest about your work, and it is fine if you haven’t optimized code in the past.
13
Just let the interviewer know your real experience and you will be able to crack the big data
interview.
9. How do you approach data preparation?
How to Approach: Data preparation is one of the crucial steps in big data projects. A big data
interview may involve at least one question based on data preparation. When the interviewer
asks you this question, he wants to know what steps or precautions you take during data
preparation.
As you already know, data preparation is required to get necessary data which can then further
be used for modeling purposes. You should convey this message to the interviewer. You
should also emphasize the type of model you are going to use and reasons behind choosing that
particular model. Last, but not the least, you should also discuss important data preparation
terms such as transforming variables, outlier values, unstructured data, identifying gaps, and
others.
10. How would you transform unstructured data into structured data?
How to Approach: Unstructured data is very common in big data. The unstructured data
should be transformed into structured data to ensure proper data analysis. You can start
answering the question by briefly differentiating between the two. Once done, you can now
discuss the methods you use to transform one form to another. You might also share the real-
world situation where you did it. If you have recently been graduated, then you can share
information related to your academic projects.
By answering this question correctly, you are signaling that you understand the types of data,
both structured and unstructured, and also have the practical experience to work with these. If
you give an answer to this question specifically, you will definitely be able to crack the big
data interview.
UNIT – II
Introduction to Hadoop
14
Apache Hadoop was born to enhance the usage and solve major issues of big data. The web
media was generating loads of information on a daily basis, and it was becoming very difficult
to manage the data of around one billion pages of content. In order of revolutionary, Google
invented a new methodology of processing data popularly known as MapReduce. Later after a
year Google published a white paper of Map Reducing framework where Doug Cutting and
Mike Cafarella, inspired by the white paper and thus created Hadoop to apply these concepts to
an open-source software framework which supported the Nutch search engine project.
Considering the original case study, Hadoop was designed with a much simpler storage
infrastructure facilities.
Apache Hadoop is the most important framework for working with Big Data. Hadoop biggest
strength is scalability. It upgrades from working on a single node to thousands of nodes without
any issue in a seamless manner.
The different domains of Big Data means we are able to manage the data’s are from videos,
text medium, transactional data, sensor information, statistical data, social media conversations,
search engine queries, ecommerce data, financial information, weather data, news updates,
forum discussions, executive reports, and so on
Google’s Doug Cutting and his team members developed an Open Source Project namely
known as HADOOP which allows you to handle the very large amount of data. Hadoop runs
the applications on the basis of MapReduce where the data is processed in parallel and
accomplish the entire statistical analysis on large amount of data.
16
The Apache Hadoop Module
Hadoop Common: Includes the common utilities which supports the other Hadoop modules
HDFS: Hadoop Distributed File System provides unrestricted, high-speed access to the data
application.
Hadoop YARN: This technology is basically used for scheduling of job and efficient
management of the cluster resource.
MapReduce: This is a highly efficient methodology for parallel processing of huge volumes of
data.
Then there are other projects included in the Hadoop module which are less used:
17
Apache Ambari: It is a tool for managing, monitoring and provisioning of the Hadoop
clusters. Apache Ambari supports the HDFS and MapReduce programs. Major highlights of
Ambari are:
Managing of the Hadoop framework is highly efficient, secure and consistent.
Management of cluster operations with an intuitive web UI and a robust API
The installation and configuration of Hadoop cluster are simplified effectively.
It is used to support automation, smart configuration and recommendations
Advanced cluster security set-up comes additional with this tool kit.
The entire cluster can be controlled using the metrics, heat maps, analysis and
troubleshooting
Increased levels of customization and extension make this more valuable.
Cassandra: It is a distributed system to handle extremely huge amount of data which is stored
across several commodity servers. The database management system (DBMS)is highly
available with no single point of failure.
HBase: it is a non-relational, distributed database management system that works efficiently on
sparse data sets and it is highly scalable.
Apache Spark: This is highly agile, scalable and secure the Big Data compute engine,
versatiles the sufficient work on a wide variety of applications like real-time processing,
machine learning, ETL and so on.
Hive: It is a data warehouse tool basically used for analyzing, querying and summarizing of
analyzed data concepts on top of the Hadoop framework.
Pig: Pig is a high-level framework which ensures us to work in coordination either with
Apache Spark or MapReduce to analyze the data. The language used to code for the
frameworks are known as Pig Latin.
Sqoop: This framework is used for transferring the data to Hadoop from relational databases.
This application is based on a command-line interface.
Oozie: This is a scheduling system for workflow management, executing workflow routes for
successful completion of the task in a Hadoop.
Zookeeper: Open source centralized service which is used to provide coordination between
distributed applications of Hadoop. It offers the registry and synchronization service on a high
level.
Hadoop Mapreduce (Processing/Computation layer) –MapReduce is a parallel
programming model mainly used for writing large amount of data distribution applications
devised from Google for efficient processing of large amounts of datasets, on large group of
clusters.
Hadoop HDFS (Storage layer) –Hadoop Distributed File SystemorHDFS is based on the
Google File System (GFS) which provides a distributed file system that is especially
designed to run on commodity hardware. It reduces the faults or errors and helps incorporate
low-cost hardware. It gives high level processing throughput access to application data and is
suitable for applications with large datasets.
18
Hadoop YARN –Hadoop YARN is a framework used for job scheduling and cluster
resource management.
Hadoop Common –This includes Java libraries and utilities which provide those java files
which are essential to start Hadoop.
Task Tracker –It is a node which is used to accept the tasks such as shuffle and Mapreduce
form job tracker.
Job Tracker –It is a service provider which runs Mapreduce jobs on cluster.
Name Node –It is a node where Hadoop stores all file location information(data stored
location) in Hadoop distributed file system.
Data Node – The data is stored in the Hadoop distributed file system.
Data Node –It stores data in the Hadoop distributed file system.
The Intended Audience and Prerequisites
Big Data and analytics are the most interesting domains to build your image in our IT world.
There is a scope for Big Data and Hadoop professionals . This intend towards those individuals
who are awed by the sheer might of Big Data and highly influence commands in corporate
boardrooms and who is keen to take up a career in Big Data and Hadoop.
If an individual aspire to become an Big Data and Hadoop Developeror Administrator,
Architect, Analyst, Scientist, Tester and who owns an corporate designation such as Chief
Technology Officer, Chief Information Officer, or even a Technical Manager of any enterprise.
Apache Hive and Pig are high-level programming languages tools and there is no compulsory
usage of Java or Linux It allows creating your own MapReduce program in any programming
language like Ruby, Python, Perl and even C programming. Hence now the requirement is
made easy to understand the computer programming logic and deductions. Remaining is add-
on and can be easily understood in a short duration of time.
How does Hadoop Work?
Hadoop helps to execute large amount of processing where the user can connect together
multiple commodity computers to a single-CPU, as a single functional distributed system and
have the particular set of clustered machines that reads the dataset in parallel and provide
intermediate, and after integration gets the desired output.
Hadoop runs code across a cluster of computers and performs the following tasks:
Data are initially divided into files and directories. Files are divided into consistent sized
blocks ranging from 128M and 64M.
Then the files are distributed across various cluster nodes for further processing of data.
Job tracker starts its scheduling programs on individual nodes.
Once all the nodes are done with scheduling then the output is return back.
The Ultimate Goal
Apache Hadoop framework
Hadoop Distributed in File System
Visualizing of Data using MS Excel, Zoom data or also known as Zeppelin
Apache MapReduce program
Apache Spark ecosystem
19
Ambari administration management
Deploying Apache Hive and Pig, and Sqoop
Knowledge of the Hadoop 2.x Architecture
Data analytics based on Hadoop YARN
Deployment of MapReduce and HBase integration
Setup of Hadoop Cluster
Proficiency in Development of Hadoop
Working with Spark RDD
Job scheduling using Oozie
The above methodology guide you to become professional of Big Data and Hadoop and
ensuring enough skills to work in an industrial environment and solve real world problems and
gain solutions for the better progressions.
The Challenges facing Data at Scale and the Scope of Hadoop
Big Data are categorized into:
Structured –which stores the data in rows and columns like relational data sets
Unstructured – here data cannot be stored in rows and columns like video, images, etc.
Semi-structured – data in format XML are readable by machines and human
There is a standardized methodology that Big Data follows highlighting usage methodology of
ETL.
ETL – stands for Extract, Transform, and Load.
Extract –fetching the data from multiple sources
Transform – convert the existing data to fit into the analytical needs
Load –right systems to derive value in it.
23
Authentication of file permissions and authentication are provided.
Uses replication are used instead of handling disk failures. Each blocks comprises a file
storage on several nodes inside the cluster and the HDFS NameNode continuously keep
monitoring the reports which are sent by every DataNode to ensure due to failures no block
have gone below the desired replication factor. If this this happens then it schedules the
addition of another copy within the cluster.
Why HDFS works very well with Big Data?
HDFS uses the MapReduce method for access to data which is very fast
It follows a data coherency model that is simple to implement still highly robust and scalable
Compatible with any kind commodity hardware and operating system processor.
Economy is been achieved by distributing data and processing on clusters with parallel nodes
Data is always safe as it is automatically saved in multiple locations for safe secure.
It provides a JAVA API’s and C language is on the top priority.
It is easily accessible using a web browser making it highly utilitarian.
HDFS Architecture
It uses mainly the master slave architecture and contains the following elements:
Namenode and DataNode
The namenode is the commodity hardware that contains the GNU/Linux operating system and
its library file setup and the namenode software. The system containing namenode acts as the
master server and carries out following tasks:
Manage namespace for file system.
Provides client’s access to files.
Execution of file system operations such as rename, open and close files and directories.
There are a number of Data Nodes consists of one per node in the cluster, which helps to
manage storage attached to the nodes that they run on.
HDFS exposes a file system namespace and provides access to user data to be stored within
files.
24
Basically a file is split into one or more blocks and these blocks are stored in a set of
DataNodes.
The NameNode executes file system namespace operations containing opening, closing, and
renaming files and directories which determine the mapping of blocks to DataNodes which is
responsible for providing read and write requests from the file system’s clients.
The DataNodes also perform functions such as block creation, deletion, and replication upon
instruction from the NameNode.
26
1. HDFS is the file system of Hadoop
2. MR is the job which runs basically on file system
3. The MR Job guide the user to ask question from HDFS files
4. Pig and Hive are two projects built so that you can replace coding based on map reduce.
5. Pig and Hive interpreter converts the script and SQL queries “INTO” MR Job
6. To save the document on MapReduce only dependency for querying on HDFS will be
Impala and Hive
7. Impala Optimized for high latency queries are based on real time applications.
8. Hive is optimized for batch processing jobs.
9. Sqoop: Can put data from a relation DB to Hadoop ecosystem
10. Flume sends the data generated from external system to move towards the HDFS to adapt for
high volume logging.
11. Hue: Cluster based on Graphical frontend
12. Oozie: It is an workflow management tool
13. Mahout : Machine learning Library files.
14. When a 150 mb file is being implemented forcibly Hadoop ecosystem break itself into
multiple parts to achieve parallelism.
15. It breaks itself into smaller units where default unit size is 64 mb
16. Data node is the demon which takes care of all welfares happening on individual node
17. Name node keeps track on all aspects from when and where required and how to collect the
same group together.
27
Name Node and another associated with the Resource Manager. The other services like The
MapReduce Job History and the Web App Proxy Server usually hosted on specific machines or
even on shared resources are loaded as per the requirement of the task. Rest all the nodes in
cluster have a dual nature of both the Node Manager and the Data Node. These are collectively
termed as the slave nodes.
Hadoop to work in the non-secure mode
The Java configuration of Hadoop has two important files:
Read-only are the default configuration such as -core-default.xml, hdfs-default.xml, yarn-
default.xml and mapred-default.xml.
Site-specific configuration based on -etc/hadoop/core-site.xml, etc/hadoop/hdfs-site.xml,
etc/hadoop/yarn-site.xml and etc/hadoop/mapred-site.xml.
Possible to manage the Hadoop scripts in the bin/ directory distribution, by setting site-specific
values by the following storage files etc/hadoop/hadoop-env.sh and etc/hadoop/yarn-env.sh.
For the Hadoop cluster configuration you first create the ecosystem where the Hadoop
daemons can execute and also requires the parameters for configuration.
The various daemons of Hadoop Distributed File System are listed below:
Node Manager
Resource Manager
WebApp Proxy
NameNode
Secondary NameNode
DataNode
YARN daemons
The Hadoop Daemons configuration environment
To get the Hadoop daemons’ access the right site with specific customization where the
administrators need to use the following commands the etc/hadoop/hadoop-env.sh or the
etc/hadoop/mapred-env.sh and etc/hadoop/yarn-env.sh scripts. The JAVA_HOME are
specified properly such that it is defined in properly on every remote node.
Configuration of the individual daemons
The list of Daemons with their relevant environment variable
NameNode –HADOOP_NAMENODE_OPTS
DataNode – HADOOP_DATANODE_OPTS
Secondary NameNode – HADOOP_SECONDARYNAMENODE_OPTS
Resource Manager – YARN_RESOURCEMANAGER_OPTS
Node Manager – YARN_NODEMANAGER_OPTS
WebAppProxy – YARN_PROXYSERVER_OPTS
Map Reduce Job History Server – HADOOP_JOB_HISTORYSERVER_OPTS
Other related important Customization configuration parameters:
HADOOP_PID_DIR – the process ID files of the daemons is contained in this directory.
HADOOP_LOG_DIR – the log files of the daemons are stored in this directory.
28
HADOOP_HEAPSIZE / YARN_HEAPSIZE – the heap size is measured in terms of MB’s
and if you own an variable that is set to 1000 then automatically the heap is also set to 1000
MB. And by default it is set as 1000
The HDFS Shell Commands
The important operations of Hadoop Distributed File System using the shell commands used
for file management in the cluster.
1. Directory creation in HDFS for a specific given path are given as.
hadoopfs-mkdir<paths>
Example:
hadoopfs-mkdir/user/saurzcode/dir1/user/saurzcode/dir2
2. Listing of the directory contents.
hadoopfs-ls<args>
Example:
hadoopfs-ls/user/saurzcode
3. HDFS file Upload/download.
Upload:
hadoopfs -put:
Copy single source file, or multiple source files based on local file system to the Hadoop data
file system
hadoopfs-put<localsrc> … <HDFS_dest_Path>
Example:
hadoopfs-put/home/saurzcode/Samplefile.txt/user/saurzcode/dir3/
Download:
hadoopfs -get:
Copies or downloads the files to the local file system
hadoopfs-get<hdfs_src><localdst>
Example:
hadoopfs-get/user/saurzcode/dir3/Samplefile.txt/home/
4. Viewing of file content
Same as Unix cat command:
hadoopfs-cat<path[filename]>
Example:
hadoopfs-cat/user/saurzcode/dir1/abc.txt
5. File copying from source to destination
hadoopfs-cp<source><dest>
Example:
hadoopfs-cp/user/saurzcode/dir1/abc.txt/user/saurzcode/dir2
6. Copying of file to HDFS from a local file and vice-versa
Copy the address from the LocalHost:
hadoopfs-copyFromLocal<localsrc>URI
Example:
hadoopfs-copyFromLocal/home/saurzcode/abc.txt/user/saurzcode/abc.txt
29
copyToLocal host
Usage:
hadoopfs-copyToLocal [-ignorecrc] [-crc] URI<localdst>
7. File moving from source to destination.
But remember, you cannot move files across filesystem.
hadoopfs-mv<src><dest>
Example:
hadoopfs-mv/user/saurzcode/dir1/abc.txt/user/saurzcode/dir2
8. File or directory removal in HDFS.
hadoopfs-rm<arg>
Example:
hadoopfs-rm/user/saurzcode/dir1/abc.txt
Repetitive version of delete.
hadoopfs-rmr<arg>
Example:
hadoopfs-rmr/user/saurzcode/
9. Showing the file’s final few lines.
hadoopfs-tail<path[filename]>
Example:
hadoopfs-tail/user/saurzcode/dir1/abc.txt
10. Showing the aggregate length of a file.
hadoopfs-du<path>
Example:
hadoopfs-du/user/saurzcode/dir1/abc.txt
HDFS Overview"
Libraries Separate tools available Spark Core, SQL, Streaming, MLlib, GraphX
2. What are real-time industry applications of Hadoop?
Hadoop, well known as Apache Hadoop, is an open-source software platform for scalable and
distributed computing of large volumes of data. It provides rapid, high performance and cost-
effective analysis of structured and unstructured data generated on digital platforms and within
the enterprise. It is used in almost all departments and sectors today.Some of the instances
where Hadoop is used:
Managing traffic on streets.
Streaming processing.
Content Management and Archiving Emails.
Processing Rat Brain Neuronal Signals using a Hadoop Computing Cluster.
Fraud detection and Prevention.
30
Advertisements Targeting Platforms are using Hadoop to capture and analyze click
stream, transaction, video and social media data.
Managing content, posts, images and videos on social media platforms.
Analyzing customer data in real-time for improving business performance.
Public sector fields such as intelligence, defense, cyber security and scientific research.
Financial agencies are using Big Data Hadoop to reduce risk, analyze fraud patterns,
identify rogue traders, more precisely target their marketing campaigns based on
customer segmentation, and improve customer satisfaction.
Getting access to unstructured data like output from medical devices, doctor’s notes,
lab results, imaging reports, medical correspondence, clinical data, and financial data.
Read this log to find out how Big Data is transforming real estate now.
3. How is Hadoop different from other parallel computing systems?
Hadoop is a distributed file system, which lets you store and handle massive amount of data on
a cloud of machines, handling data redundancy. Go through this HDFS content to know how
the distributed file system works. The primary benefit is that since data is stored in several
nodes, it is better to process it in distributed manner. Each node can process the data stored on
it instead of spending time in moving it over the network.
On the contrary, in Relational database computing system, you can query data in real-time, but
it is not efficient to store data in tables, records and columns when the data is huge.
Learn about Oracle DBA now.
Hadoop also provides a scheme to build a Column Database with Hadoop HBase, for runtime
queries on rows.
Learn more in this HBase Tutorial.
4. What all modes Hadoop can be run in?
Hadoop can run in three modes:
Standalone Mode: Default mode of Hadoop, it uses local file stystem for input and
output operations. This mode is mainly used for debugging purpose, and it does not
support the use of HDFS. Further, in this mode, there is no custom configuration
required for mapred-site.xml, core-site.xml, hdfs-site.xml files. Much faster when
compared to other modes.
Pseudo-Distributed Mode (Single Node Cluster): In this case, you need
configuration for all the three files mentioned above. In this case, all daemons are
running on one node and thus, both Master and Slave node are the same.
Fully Distributed Mode (Multiple Cluster Node): This is the production phase of
Hadoop (what Hadoop is known for) where data is used and distributed across several
nodes on a Hadoop cluster. Separate nodes are allotted as Master and Slave.
Learn more about Hadoop in this Hadoop Certification course to get ahead in your career!
5. Explain the major difference between HDFS block and InputSplit.
In simple terms, block is the physical representation of data while split is the logical
representation of data present in the block. Split acts a s an intermediary between block and
mapper.
Suppose we have two blocks:
Block 1: ii nntteell
Block 2: Ii ppaatt
Now, considering the map, it will read first block from ii till ll, but does not know how to
process the second block at the same time. Here comes Split into play, which will form a
logical group of Block1 and Block 2 as a single block.
It then forms key-value pair using inputformat and records reader and sends map for further
processing With inputsplit, if you have limited resources, you can increase the split size to limit
the number of maps. For instance, if there are 10 blocks of 640MB (64MB each) and there are
limited resources, you can assign ‘split size’ as 128MB. This will form a logical group of
128MB, with only 5 maps executing at a time.
31
However, if the ‘split size’ property is set to false, whole file will form one inputsplit and is
processed by single map, consuming more time when the file is bigger.
6. What is distributed cache and what are its benefits?
Distributed Cache, in Hadoop, is a service by MapReduce framework to cache files when
needed. Learn more in this MapReduce Tutorial now. Once a file is cached for a specific job,
hadoop will make it available on each data node both in system and in memory, where map and
reduce tasks are executing.Later, you can easily access and read the cache file and populate any
collection (like array, hashmap) in your code.
Benefits of using distributed cache are:
It distributes simple, read only text/data files and/or complex types like jars, archives
and others. These archives are then un-archived at the slave node.
Distributed cache tracks the modification timestamps of cache files, which notifies that
the files should not be modified until a job is executing currently.
Give your career a big boost by going through our Hadoop Online Training Videos now!
7. Explain the difference between NameNode, Checkpoint NameNode and BackupNode.
NameNode is the core of HDFS that manages the metadata – the information of what
file maps to what block locations and what blocks are stored on what datanode. In
simple terms, it’s the data about the data being stored. NameNode supports a directory
tree-like structure consisting of all the files present in HDFS on a Hadoop cluster. It
uses following files for namespace:
fsimage file- It keeps track of the latest checkpoint of the namespace.
edits file-It is a log of changes that have been made to the namespace since checkpoint.
Checkpoint NameNode has the same directory structure as NameNode, and creates
checkpoints for namespace at regular intervals by downloading the fsimage and edits
file and margining them within the local directory. The new image after merging is
then uploaded to NameNode.
There is a similar node like Checkpoint, commonly known as Secondary Node, but it
does not support the ‘upload to NameNode’ functionality.
Backup Node provides similar functionality as Checkpoint, enforcing synchronization
with NameNode. It maintains an up-to-date in-memory copy of file system namespace
and doesn’t require getting hold of changes after regular intervals. The backup node
needs to save the current state in-memory to an image file to create a new checkpoint.
Learn about the various Hadoop components in this Big Data Hadoop Video Tutorial.
8. What are the most common Input Formats in Hadoop?
There are three most common input formats in Hadoop:
Text Input Format: Default input format in Hadoop.
Key Value Input Format: used for plain text files where the files are broken into lines
Sequence File Input Format: used for reading files in sequence
Download Hadoop Interview Questions asked by top MNCs in 2018
9. Define DataNode and how does NameNode tackle DataNode failures?
DataNode stores data in HDFS; it is a node where actual data resides in the file system. Each
datanode sends a heartbeat message to notify that it is alive. If the namenode does noit receive
a message from datanode for 10 minutes, it considers it to be dead or out of place, and starts
replication of blocks that were hosted on that data node such that they are hosted on some other
data node.A BlockReport contains list of all blocks on a DataNode. Now, the system starts to
replicate what were stored in dead DataNode.
The NameNode manages the replication of data blocksfrom one DataNode to other. In this
process, the replication data transfers directly between DataNode such that the data never
passes the NameNode.
10. What are the core methods of a Reducer?
The three core methods of a Reducer are:
32
1. setup(): this method is used for configuring various parameters like input data size,
distributed cache.
public void setup (context)
2. reduce(): heart of the reducer always called once per key with the associated reduced
task
public void reduce(Key, Value, context)
3. cleanup(): this method is called to clean temporary files, only once at the end of the
task
public void cleanup (context)
11. What is SequenceFile in Hadoop?
Extensively used in MapReduce I/O formats, SequenceFile is a flat file containing binary
key/value pairs. The map outputs are stored as SequenceFile internally. It provides Reader,
Writer and Sorter classes. The three SequenceFile formats are:
1. Uncompressed key/value records.
2. Record compressed key/value records – only ‘values’ are compressed here.
3. Block compressed key/value records – both keys and values are collected in ‘blocks’
separately and compressed. The size of the ‘block’ is configurable.
12. What is Job Tracker role in Hadoop?
Job Tracker’s primary function is resource management (managing the task trackers), tracking
resource availability and task life cycle management (tracking the taks progress and fault
tolerance).
It is a process that runs on a separate node, not on a DataNode often.
Job Tracker communicates with the NameNode to identify data location.
Finds the best Task Tracker Nodes to execute tasks on given nodes.
Monitors individual Task Trackers and submits the overall job back to the client.
It tracks the execution of MapReduce workloads local to the slave node.
13. What is the use of RecordReader in Hadoop?
Since Hadoop splits data into various blocks, RecordReader is used to read the slit data into
single record. For instance, if our input data is split like:
Row1: Welcome to
Row2: Intellipaat
It will be read as “Welcome to Intellipaat” using RecordReader.
14. What is Speculative Execution in Hadoop?
One limitation of Hadoop is that by distributing the tasks on several nodes, there are chances
that few slow nodes limit the rest of the program. Tehre are various reasons for the tasks to be
slow, which are sometimes not easy to detect. Instead of identifying and fixing the slow-
running tasks, Hadoop tries to detect when the task runs slower than expected and then
launches other equivalent task as backup. This backup mechanism in Hadoop is Speculative
Execution.
It creates a duplicate task on another disk. The same input can be processed multiple times in
parallel. When most tasks in a job comes to completion, the speculative execution mechanism
schedules duplicate copies of remaining tasks (which are slower) across the nodes that are free
currently. When these tasks finish, it is intimated to the JobTracker. If other copies are
executing speculatively, Hadoop notifies the TaskTrackers to quit those tasks and reject their
output.
Speculative execution is by default true in Hadoop. To disable, set
mapred.map.tasks.speculative.execution and mapred.reduce.tasks.speculative.execution
JobConf options to false.
15. What happens if you try to run a Hadoop job with an output directory that is already
present?
It will throw an exception saying that the output file directory already exists.
To run the MapReduce job, you need to ensure that the output directory does not exist before
in the HDFS.
33
To delete the directory before running the job, you can use shell:Hadoop fs –rmr
/path/to/your/output/Or via the Java API: FileSystem.getlocal(conf).delete(outputDir, true);
Prepare yourself for the MapReduce Interview questions and answers Now
16. How can you debug Hadoop code?
First, check the list of MapReduce jobs currently running. Next, we need to see that there are
no orphaned jobs running; if yes, you need to determine the location of RM logs.
1. Run: “ps –ef | grep –I ResourceManager”
and look for log directory in the displayed result. Find out the job-id from the displayed
list and check if there is any error message associated with that job.
2. On the basis of RM logs, identify the worker node that was involved in execution of the
task.
3. Now, login to that node and run – “ps –ef | grep –iNodeManager”
4. Examine the Node Manager log. The majority of errors come from user level logs for
each map-reduce job.
17. How to configure Replication Factor in HDFS?
hdfs-site.xml is used to configure HDFS. Changing the dfs.replication property in hdfs-
site.xml will change the default replication for all files placed in HDFS.
You can also modify the replication factor on a per-file basis using the
Hadoop FS Shell:[training@localhost ~]$ hadoopfs –setrep –w 3 /my/fileConversely,
you can also change the replication factor of all the files under a directory.
[training@localhost ~]$ hadoopfs –setrep –w 3 -R /my/dir
Go through Hadoop Administration Training to learn about Replication Factor In HDFS now!
18. How to compress mapper output but not the reducer output?
To achieve this compression, you should set:
conf.set("mapreduce.map.output.compress", true)
conf.set("mapreduce.output.fileoutputformat.compress", false)
19. What is the difference between Map Side join and Reduce Side Join?
Map side Join at map side is performed data reaches the map. You need a strict structure for
defining map side join. On the other hand, Reduce side Join (Repartitioned Join) is simpler
than map side join since the input datasets need not be structured. However, it is less efficient
as it will have to go through sort and shuffle phases, coming with network overheads.
Learn Hadoop from Experts! Enrol Today
20. How can you transfer data from Hive to HDFS?
By writing the query:
hive> insert overwrite directory '/' select * from emp;
You can write your query for the data you want to import from Hive to HDFS. The output you
receive will be stored in part files in the specified HDFS path.
21. What companies use Hadoop, any idea?
Learn how Big Data and Hadoop have changed the rules of the game in this blog post. Yahoo!
(the biggest contributor to the creation of Hadoop) – Yahoo search engine uses Hadoop,
Facebook – Developed Hive for analysis,Amazon,Netflix,Adobe,eBay,Spotify,Twitter,Adobe.
12. What are the common input formats in Hadoop?
Answer: Below are the common input formats in Hadoop –
Text Input Format – The default input format defined in Hadoop is the Text Input
Format.
Sequence File Input Format – To read files in a sequence, Sequence File Input
Format is used.
34
Key Value Input Format – The input format used for plain text files (files broken into
lines) is the Key Value Input Format.
13. Explain some important features of Hadoop.
Answer: Hadoop supports the storage and processing of big data. It is the best solution for
handling big data challenges. Some important features of Hadoop are –
Open Source – Hadoop is an open source framework which means it is available free
of cost. Also, the users are allowed to change the source code as per their
requirements.
Distributed Processing – Hadoop supports distributed processing of data i.e. faster
processing. The data in Hadoop HDFS is stored in a distributed manner and
MapReduce is responsible for the parallel processing of data.
Fault Tolerance – Hadoop is highly fault-tolerant. It creates three replicas for each
block at different nodes, by default. This number can be changed according to the
requirement. So, we can recover the data from another node if one node fails. The
detection of node failure and recovery of data is done automatically.
Reliability – Hadoop stores data on the cluster in a reliable manner that is independent
of machine. So, the data stored in Hadoop environment is not affected by the failure
of the machine.
Scalability – Another important feature of Hadoop is the scalability. It is compatible
with the other hardware and we can easily ass the new hardware to the nodes.
High Availability – The data stored in Hadoop is available to access even after the
hardware failure. In case of hardware failure, the data can be accessed from another
path.
14. Explain the different modes in which Hadoop run.
Answer: Apache Hadoop runs in the following three modes –
Standalone (Local) Mode – By default, Hadoop runs in a local mode i.e. on a non-
distributed, single node. This mode uses the local file system to perform input and
output operation. This mode does not support the use of HDFS, so it is used for
debugging. No custom configuration is needed for configuration files in this mode.
Pseudo-Distributed Mode – In the pseudo-distributed mode, Hadoop runs on a single
node just like the Standalone mode. In this mode, each daemon runs in a separate
Java process. As all the daemons run on a single node, there is the same node for
both the Master and Slave nodes.
Fully – Distributed Mode – In the fully-distributed mode, all the daemons run on
separate individual nodes and thus forms a multi-node cluster. There are different
nodes for Master and Slave nodes.
15. Explain the core components of Hadoop.
Answer: Hadoop is an open source framework that is meant for storage and processing of big
data in a distributed manner. The core components of Hadoop are –
HDFS (Hadoop Distributed File System) – HDFS is the basic storage system of
Hadoop. The large data files running on a cluster of commodity hardware are stored
in HDFS. It can store data in a reliable manner even when hardware fails.
35
Core Components of Hadoop
Hadoop MapReduce – MapReduce is the Hadoop layer that is responsible for data
processing. It writes an application to process unstructured and structured data stored
in HDFS. It is responsible for the parallel processing of high volume of data by
dividing data into independent tasks. The processing is done in two phases Map and
Reduce. The Map is the first phase of processing that specifies complex logic code
and the Reduce is the second phase of processing that specifies light-weight
operations.
YARN – The processing framework in Hadoop is YARN. It is used for resource
management and provides multiple data processing engines i.e. data science, real-
time streaming, and batch processing..
16. What are the different configuration files in Hadoop?
Answer: The different configuration files in Hadoop are –
core-site.xml – This configuration file contains Hadoop core configuration settings, for
example, I/O settings, very common for MapReduce and HDFS. It uses hostname a port.
mapred-site.xml – This configuration file specifies a framework name for MapReduce by
setting mapreduce.framework.name
hdfs-site.xml – This configuration file contains HDFS daemons configuration settings. It also
specifies default block permission and replication checking on HDFS.
yarn-site.xml – This configuration file specifies configuration settings for ResourceManager
and NodeManager.
17. What are the differences between Hadoop 2 and Hadoop 3?
Answer: Following are the differences between Hadoop 2 and Hadoop 3 –
18. How can you achieve security in Hadoop?
Answer: Kerberos are used to achieve security in Hadoop. There are 3 steps to access a service
while using Kerberos, at a high level. Each step involves a message exchange with a server.
1. Authentication – The first step involves authentication of the client to the
authentication server, and then provides a time-stamped TGT (Ticket-Granting
Ticket) to the client.
2. Authorization – In this step, the client uses received TGT to request a service ticket
from the TGS (Ticket Granting Server).
3. Service Request – It is the final step to achieve security in Hadoop. Then the client
uses service ticket to authenticate himself to the server.
19. What is commodity hardware?
Answer: Commodity hardware is a low-cost system identified by less-availability and low-
quality. The commodity hardware comprises of RAM as it performs a number of services that
36
require RAM for the execution. One doesn’t require high-end hardware configuration or
supercomputers to run Hadoop, it can be run on any commodity hardware.
20. How is NFS different from HDFS?
Answer: There are a number of distributed file systems that work in their own way. NFS
(Network File System) is one of the oldest and popular distributed file storage systems whereas
HDFS (Hadoop Distributed File System) is the recently used and popular one to handle big
data. The main differences between NFS and HDFS are as follows –
Hadoop Developer Interview Questions for Experienced
The interviewer has more expectations from an experienced Hadoop developer, and thus his
questions are one-level up. So, if you have gained some experience, don’t forget to cover
command based, scenario-based, real-experience based questions. Here we bring some sample
interview questions for experienced Hadoop developers.
21. How to restart all the daemons in Hadoop?
Answer: To restart all the daemons, it is required to stop all the daemons first. The Hadoop
directory contains sbin directory that stores the script files to stop and start daemons in
Hadoop.
Use stop daemons command /sbin/stop-all.sh to stop all the daemons and then use /sin/start-
all.sh command to start all the daemons again.
22. What is the use of jps command in Hadoop?
Answer: The jps command is used to check if the Hadoop daemons are running properly or
not. This command shows all the daemons running on a machine i.e. Datanode, Namenode,
NodeManager, ResourceManager etc.
23. Explain the process that overwrites the replication factors in HDFS.
Answer: There are two methods to overwrite the replication factors in HDFS –
Method 1: On File Basis
In this method, the replication factor is changed on the basis of file using Hadoop FS shell. The
command used for this is:
$hadoop fs – setrep –w2/my/test_file
Here, test_file is the filename that’s replication factor will be set to 2.
Method 2: On Directory Basis
In this method, the replication factor is changed on directory basis i.e. the replication factor for
all the files under a given directory is modified.
$hadoop fs –setrep –w5/my/test_dir
Here, test_dir is the name of the directory, the replication factor for the directory and all the
files in it will be set to 5.
24. What will happen with a NameNode that doesn’t have any data?
Answer: A NameNode without any data doesn’t exist in Hadoop. If there is a NameNode, it
will contain some data in it or it won’t exist.
25. Explain NameNode recovery process.
Answer: The NameNode recovery process involves the below-mentioned steps to make
Hadoop cluster running:
In the first step in the recovery process, file system metadata replica (FsImage) starts a
new NameNode.
The next step is to configure DataNodes and Clients. These DataNodes and Clients will
then acknowledge new NameNode.
During the final step, the new NameNode starts serving the client on the completion of
last checkpoint FsImage loading and receiving block reports from the DataNodes.
37
UNIT – III
MapReduceProgramming and Yarn
Mapreduce is mainly a data processing component of Hadoop. It is a programming model for
processing large number of data sets. It contains the task of data processing and distributes the
particular tasks across the nodes. It consists of two phases –
Map
Reduce
Map converts a typical dataset into another set of data where individual elements are divided
into key/value pairs.
Reduce task takes the output files from a map considering as an input and then integrate the
data tuples into a smaller set of tuples. Always it is been executed after the map job is done.
Features of Mapreduce system
Features of Mapreduce are as follows:
Framework is provided for Mapreduce execution
Abstracts developer from the complexity of distributed programming languages.
Partial failure of the processing cluster is expected and tolerable to fulfill the requirements.
In-built Redundancy and fault tolerance is available.
Mapreduce programming model system is language independent.
Automatic parallelization and distribution are in charge.
Fault tolerance
Enable data local processing
Shared nothing than architectural model
Manages all the inter process communication
Parallelly managing the distributed servers which are running across the various tasks.
Managing all communications and data transfers between the various part of system module.
Redundancy and failures are provided for overall management of the whole process.
Mapreduce simple steps follow:
1. Executes map function on each input is received
2. Map function emits key, value pair
3. Shuffle, Sort and Group the outputs
4. Executes the reduce function on the group
5. Emits the output results is given per group basis.
Map Function
Mainly operates on each key/value pair of data and then transforms the data based on the
transformation logic provided in the map function. Map function always produces a key/value
pair as output result.
Map (key1, value1) ->List (key2, value2)
Reduce Function
It takes list of value for each and every key transforms the data based on the (aggregation) logic
provided in the reduce function.
Reduce (key2, List (value2)) ->List (key3, value3)
Map Function for Word Count
38
private final staic IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map (LongWritable key, Text value, Context context)
MapReduce is the framework that is used for processing large amounts of data on commodity
hardware on a huge dataset of cluster ecosystem. The MapReduce is a powerful method of
processing data when there are large amounts of node connected to the cluster. The two
important tasks of the MapReduce algorithm are: Map and Reduce.
The main motto of the Map task is to take a large set of data and convert it into another set of
data which is broken down into tuples(rows) or Key/Value pairs. Later the Reduce task takes
the tuple which is the form of an output of the Map task and makes the input for a reduction
task. Here the data tuples are converted into a very smaller set of tuples. The Reduce task
always follows as per the Map task.
The biggest strength of the MapReduce framework is its scalability. Once a MapReduce
program is written then it can be easily extrapolated to work over a cluster which has hundreds
or even thousands of nodes within it. In this framework, actually computation is sent to where
the data resides.
Hadoop Map Reduce – Key Features & Highlights
39
Terminology
PayLoad– These are the applications that are implemented for the Map and Reduce functions.
Mapper– This application helps to maps the input key/value pairs to a set of intermediate
key/value pair.
NamedNode– This node manages the HDFS.
DataNode– DataNode is used where data is presented in a before any processing takes place.
MasterNode– MasterNode is used where JobTracker runs and receives job requests from
clients.
SlaveNode– Map and Reduce program run particularly in this node.
JobTracker– This schedules the jobs and tracks the assigns the jobs to Task tracker.
Task Tracker– the Task Tracker status is reported to JobTracker after the task is being
tracked.
Job– It is an execution process of a Mapper and Reducer.
Task– Task of an execution of a Mapper or a called as Reducer on a slice of data.
Task Attempt– This is an attempt to execute a task on a SlaveNode.
Hadoop YARN Technology
Yarn full form stands for yet another resource negotiator. It is a cluster management
technology which is an open source platform distributed for processing framework. The main
objective of YARN is to construct a framework on Hadoop that allows the cluster resources to
be allocated to the specified applications and consider MapReduce has one of these
applications.
It separates each tasks of the job tracker into separate entities. The job tracker maintains track
of both job scheduling which matches the tasks with task tracker and another one is task
progress monitoring that take care of tasks and starts again the failed or slower tasks and doing
the task bookkeeping like as maintaining counter totals.
It divides these two roles into two independent daemons that are a mainly the resource manager
which manages the usage of resources across the cluster and an application master which
manage the lifecycle of applications running on the cluster.
40
Application master agrees with the resource manager for the sake of cluster resources which is
expressed in terms of a number of containers each with a certain memory limit then runs
application specific processes in these containers.
The containers are handled by node managers which are running on cluster nodes which ensure
the application does not use more resources which are allocated to it.
It is a very efficient technology which manages the Hadoop cluster. YARN is one of the parts
of Hadoop 2 version under the aegis of the Apache Software Foundation.
YARN is has developed a completely new and innovative way of processing data and is now
rightly at the center of The Hadoop architecture. Using this technology now it is possible to
stream real-time, uses new interactive SQL, process data using multiple engines, manages the
data using batch processing on a single platform and so on.
Map Reduce on YARN
MapReduce on YARN includes more entities compared to the classic MapReduce. They are:
Client –Client submits the MapReduce job.
YARN resource manager – This manages the allocation of compute resources based on the
cluster.
YARN node managers – It launches and monitors the compute containers on machines
based on the cluster.
Map Reduce application master – It manages and arranges the tasks running the
MapReduce job. The application master and the MapReduce application tasks run
correspondingly in the containers which are scheduled by the resource manager and
managed by the node managers.
Distributed file system (Normally HDFS) – It shares the job files created between the other
entities.
How the YARN technology works?
YARN technology lets Hadoop provides the enterprise level solutions, helping organizations
achieve better resource management. It is the main platform for getting consistent solutions,
high level of security and governing of data over the complete spectrum of the Hadoop
cluster.
41
There are various technologies that resides within the data center can also benefit from
YARN. This procedure is possible to process and have linear-scale storage in a very cost
effective way. Using YARN helps to come with applications that can access data and run in
a Hadoop ecosystem on a consistent framework.
Some of the features of YARN
High degree of compatibility: The applications created are using the Map Reduce framework
which can easily run on YARN.
Better cluster utilization: YARN allocates all the cluster resources in an efficient and
dynamic manner and which leads to utilizes it in much better way compared to previous
version of Hadoop.
Utmost scalability: As and when the required number of nodes in the Hadoop cluster
expands, the YARN Resource Manager ensures that it meets the user requirements and
processing power of the data center does not face any problems in solving.
Multi-tenancy: Various engines that access data on the Hadoop cluster can efficiently works
all thanks goes to YARN being a highly versatile technology.
Key components of YARN
YARN came into existence because there was an urgent need to separate the two distinct tasks
that go on in a Hadoop ecosystem and which are known as TaskTracker and the JobTracker
entities. So consider the below mentioned key components of the YARN technology.
Global Resource Manager
Application Master per application
Node Manager per node slave
Container per application that runs on a Node Manager
Thus the Node Manager and the Resource Manager became the main reason on which the new
distributed application works. The various resources manager are allocated to the system
applications using the power of the Resource Manager. Application Master works along with
the Node Manager and also works on specific framework to get resources from the Resource
Manager to manage the various task components.
A scheduler works with the RM(Resource Manager) framework for the right allocation of
resources and ensuring all the constraints of the user limit and queue capacities are adhered are
provided at all times. As per the requirements of each application the scheduler will provide the
right resource.
The Application Master works on basis of coordination with the scheduler in order to get the
right resource containers keep an eye on the status and also keep tracking the progress of the
process.
The Node Manager manages the application containers and launches it when it is required,
tracks down the uses of the resources like the memory, processor, network and the disk
utilization and gives the entire detailed report to the Resource Manager.
42
1. Compare MapReduce and Spark
Criteria MapReduce Spark
Ease of use Needs extensive Java program APIs for Python, Java, & Scala
Versatility Real-time & machine learning Not optimized for real-time & machine learning
applications applications
2. What is MapReduce?
Referred as the core of Hadoop, MapReduce is a programming framework to process large sets
of data or big data across thousands of servers in a Hadoop Cluster. The concept of MapReduce
is similar to the cluster scale-out data processing systems. The term MapReduce refers to two
important processes of Hadoop program operates.
First is the map() job, which converts a set of data into another breaking down individual
elements into key/value pairs (tuples). Then comes reduce() job into play, wherein the output
from the map, i.e. the tuples serve as the input and are combined into smaller set of tuples. As
the name suggests, the map job every time occurs before the reduce one.
Learn more about MapReduce in this insightful article on: Hadoop MapReduce – What it
Refers To?
3. Illustrate a simple example of the working of MapReduce.
Let’s take a simple example to understand the functioning of MapReduce. However, in real-
time projects and applications, this is going to be elaborate and complex as the data we deal
with Hadoop and MapReduce is extensive and massive.
Assume you have five files and each file consists of two key/value pairs as in two columns in
each file – a city name and its temperature recorded. Here, name of city is the key and the
temperature is value.
San Francisco, 22
Los Angeles, 15
Vancouver, 30
London, 25
Los Angeles, 16
Vancouver, 28
London,12
It is important to note that each file may consist of the data for same city multiple times. Now,
out of this data, we need to calculate the maximum temperature for each city across these five
files. As explained, the MapReduce framework will divide it into five map tasks and each map
task will perform data functions on one of the five files and returns maxim temperature for
each city.
(San Francisco, 22)(Los Angeles, 16)(Vancouver, 30)(London, 25)
Similarly each mapper performs it for the other four files and produce intermediate results, for
instance like below.
(San Francisco, 32)(Los Angeles, 2)(Vancouver, 8)(London, 27)
(San Francisco, 29)(Los Angeles, 19)(Vancouver, 28)(London, 12)
(San Francisco, 18)(Los Angeles, 24)(Vancouver, 36)(London, 10)
(San Francisco, 30)(Los Angeles, 11)(Vancouver, 12)(London, 5)
These tasks are then passed to the reduce job, where the input from all files are combined to
output a single value. The final results here would be:
(San Francisco, 32)(Los Angeles, 24)(Vancouver, 36)(London, 27)
43
These calculations are perform instantly and are extremely efficient to calculate outputs on a
large dataset.
Master the MapReduce computational engine in this in-depth Hadoop MapReduce course!
4. What are the main components of MapReduce Job?
Main Driver Class: providing job configuration parameters
Mapper Class: must extend org.apache.hadoop.mapreduce.Mapper class and performs
execution of map() method
Reducer Class: must extend org.apache.hadoop.mapreduce.Reducer class
5. What is Shuffling and Sorting in MapReduce?
Shuffling and Sorting are two major processes operating simultaneously during the working of
mapper and reducer.
The process of transferring data from Mapper to reducer is Shuffling. It is a mandatory
operation for reducers to proceed their jobs further as the shuffling process serves as input for
the reduce tasks.
In MapReduce, the output key-value pairs between the map and reduce phases (after the
mapper) are automatically sorted before moving to the Reducer. This feature is helpful in
programs where you need sorting at some stages. It also saves the programmer’s overall time.
Learn all about shuffling and sorting in this comprehensive MapReduce Tutorial.
6. What is Partitioner and its usage?
Partitioner is yet another important phase that controls the partitioning of the intermediate map-
reduce output keys using a hash function. The process of partitioning determines in what
reducer, a key-value pair (of the map output) is sent. The number of partitions is equal to the
total number of reduce jobs for the process.
Hash Partitioner is the default class available in Hadoop , which implements the following
function.int getPartition(K key, V value, int numReduceTasks)
The function returns the partition number using the numReduceTasks is the number of fixed
reducers.
7. What is Identity Mapper and Chain Mapper?
Identity Mapper is the default Mapper class provided by Hadoop. when no other Mapper class
is defined, Identify will be executed. It only writes the input data into output and do not
perform and computations and calculations on the input data.
The class name is org.apache.hadoop.mapred.lib.IdentityMapper.
Chain Mapper is the implementation of simple Mapper class through chain operations across a
set of Mapper classes, within a single map task. In this, the output from the first mapper
becomes the input for second mapper and second mapper’s output the input for third mapper
and so on until the last mapper.
The class name is org.apache.hadoop.mapreduce.lib.ChainMapper.
8. What main configuration parameters are specified in MapReduce?
The MapReduce programmers need to specify following configuration parameters to perform
the map and reduce jobs:
The input location of the job in HDFs.
The output location of the job in HDFS.
The input’s and output’s format.
The classes containing map and reduce functions, respectively.
The .jar file for mapper, reducer and driver classes
9. Name Job control options specified by MapReduce.
Since this framework supports chained operations wherein an input of one map job serves as
the output for other, there is a need for job controls to govern these complex operations.
The various job control options are:
Job.submit() : to submit the job to the cluster and immediately return
Job.waitforCompletion(boolean) : to submit the job to the cluster and wait for its completion
10. What is InputFormat in Hadoop?
44
Another important feature in MapReduce programming, InputFormat defines the input
specifications for a job. It performs the following functions:
Validates the input-specification of job.
Split the input file(s) into logical instances called InputSplit. Each of these split files
are then assigned to individual Mapper.
Provides implementation of RecordReader to extract input records from the above
instances for further Mapper processing
11. What is the difference between HDFS block and InputSplit?
An HDFS block splits data into physical divisions while InputSplit in MapReduce splits input
files logically.
While InputSplit is used to control number of mappers, the size of splits is user defined. On the
contrary, the HDFS block size is fixed to 64 MB, i.e. for 1GB data , it will be 1GB/64MB = 16
splits/blocks. However, if input split size is not defined by user, it takes the HDFS default
block size.
12. What is Text Input Format?
It is the default InputFormat for plain text files in a given job having input files with .gz
extension. In TextInputFormat, files are broken into lines, wherein key is position in the file
and value refers to the line of text. Programmers can write their own InputFormat.
The hierarchy is:
java.lang.Object
org.apache.hadoop.mapreduce.InputFormat<K,V>
org.apache.hadoop.mapreduce.lib.input.FileInputFormat<LongWritable,Text>
org.apache.hadoop.mapreduce.lib.input.TextInputFormat
13. What is JobTracker?
JobTracker is a Hadoop service used for the processing of MapReduce jobs in the cluster. It
submits and tracks the jobs to specific nodes having data. Only one JobTracker runs on single
Hadoop cluster on its own JVM process. if JobTracker goes down, all the jobs halt.
Download MapReduce Interview Questions asked by top MNCs in 2018
14. Explain job scheduling through JobTracker.
JobTracker communicates with NameNode to identify data location and submits the work to
TaskTracker node. The TaskTracker plays a major role as it notifies the JobTracker for any job
failure. It actually is referred to the heartbeat reporter reassuring the JobTracker that it is still
alive. Later, the JobTracker is responsible for the actions as in it may either resubmit the job or
mark a specific record as unreliable or blacklist it.
15. What is SequenceFileInputFormat?
A compressed binary output file format to read in sequence files and extends the
FileInputFormat.It passes data between output-input (between output of one MapReduce job to
input of another MapReduce job)phases of MapReduce jobs.
16. How to set mappers and reducers for Hadoop jobs?
Users can configure JobConf variable to set number of mappers and reducers.
job.setNumMaptasks()
job.setNumreduceTasks()
17. Explain JobConf in MapReduce.
It is a primary interface to define a map-reduce job in the Hadoop for job execution. JobConf
specifies mapper, Combiner, partitioner, Reducer,InputFormat , OutputFormat
implementations and other advanced job faets liek Comparators.
18. What is a MapReduce Combiner?
Also known as semi-reducer, Combiner is an optional class to combine the map out records
using the same key. The main function of a combiner is to accept inputs from Map Class and
pass those key-value pairs to Reducer class
19. What is RecordReader in a Map Reduce?
45
RecordReader is used to read key/value pairs form the InputSplit by converting the byte-
oriented view and presenting record-oriented view to Mapper.
20. Define Writable data types in MapReduce.
Hadoop reads and writes data in a serialized form in writable interface. The Writable interface
has several classes like Text (storing String data), IntWritable, LongWriatble, FloatWritable,
BooleanWritable. users are free to define their personal Writable classes as well.
Read this blog to see how the mapping and reducing speeds are increasing in the MapReduce
processing engine.
21. What is OutputCommitter?
OutPutCommitter describes the commit of MapReduce task. FileOutputCommitter is the
default available class available for OutputCommitter in MapReduce. It performs the following
operations:
Create temporary output directory for the job during initialization.
Then, it cleans the job as in removes temporary output directory post job completion.
Sets up the task temporary output.
Identifies whether a task needs commit. The commit is applied if required.
JobSetup, JobCleanup and TaskCleanup are important tasks during output commit.
22. What is a “map” in Hadoop?
In Hadoop, a map is a phase in HDFS query solving. A map reads data from an input location,
and outputs a key value pair according to the input type.
23. What is a “reducer” in Hadoop?
In Hadoop, a reducer collects the output generated by the mapper, processes it, and creates a
final output of its own.
24. What are the parameters of mappers and reducers?
The four parameters for mappers are:
LongWritable (input)
text (input)
text (intermediate output)
IntWritable (intermediate output)
The four parameters for reducers are:
Text (intermediate output)
IntWritable (intermediate output)
Text (final output)
IntWritable (final output)
25. What are the key differences between Pig vs MapReduce?
PIG is a data flow language, the key focus of Pig is manage the flow of data from input source
to output store. As part of managing this data flow it moves data feeding it to
process 1. taking output and feeding it to
process2. The core features are preventing execution of subsequent stages if previous stage
fails, manages temporary storage of data and most importantly compresses and rearranges
processing steps for faster processing. While this can be done for any kind of processing tasks
Pig is written specifically for managing data flow of Map reduce type of jobs. Most if not all
jobs in a Pig are map reduce jobs or data movement jobs. Pig allows for custom functions to be
added which can be used for processing in Pig, some default ones are like ordering, grouping,
distinct, count etc.
Mapreduce on the other hand is a data processing paradigm, it is a framework for application
developers to write code in so that its easily scaled to PB of tasks, this creates a separation
between the developer that writes the application vs the developer that scales the application.
Not all applications can be migrated to Map reduce but good few can be including complex
ones like k-means to simple ones like counting uniques in a dataset.
Go through this insightful blog to learn more about what is MapReduce?
46
26. What is partitioning?
Partitioning is a process to identify the reducer instance which would be used to supply the
mappers output. Before mapper emits the data (Key Value) pair to reducer, mapper identify the
reducer as an recipient of mapper output. All the key, no matter which mapper has generated
this, must lie with same reducer.
27. How to set which framework would be used to run mapreduce program?
mapreduce.framework.name. it can be
1. Local
2. classic
3. Yarn
28. What platform and Java version is required to run Hadoop?
Java 1.6.x or higher version are good for Hadoop, preferably from Sun. Linux and Windows
are the supported operating system for Hadoop, but BSD, Mac OS/X and Solaris are more
famous to work.
29. Can MapReduce program be written in any language other than Java?
Yes, Mapreduce can be written in many programming languages Java, R, C++, scripting
Languages (Python, PHP). Any language able to read from stadin and write to stdout and parse
tab and newline characters should work . Hadoop streaming (A Hadoop Utility) allows you to
create and run Map/Reduce jobs with any executable or scripts as the mapper and/or the reduce
47
UNIT – IV
Apache Pig
Pig raises the level of abstraction for processing large amount of datasets. It is a fundamental
platform for analyzing large amount of data sets which consists of a high level language for
expressing data analysis programs. It is an open source platform developed by yahoo.
Advantages of Pig
Reusing the code
Faster development
Less number of lines of code
Schema and type checking etc
Pig is made up of two pieces:
First is the language which allows to express data flows known as Pig Latin.
Second one is execution environment created to run Pig Latin programs. There are now
presently two environments that are local execution in a single JVM and distributed
execution on the basis of Hadoop cluster.
A Pig Latin program is huge collection of series of operations or transformations which are
implemented to the input data files to generate output. These operations express a data flow
that the pig execution environment transforms into an executable representation and then runs
it accurately.
What makes Pig Hadoop popular?
Easy to learn read and write and implement if you know SQL.
It implements a new approach of multi query.
Provides a large number of nested data types such as Maps, Tuples and Bags which are not
easily available in MapReduce along with some other data operations like Filters, Ordering
and Joins.
It consist of different user groups for instance up to 90% of Yahoo’s MapReduce is done by
Pig and up to 80% of Twitter’s MapReduce is also done by Pig and various other companies
like Sales force, LinkedIn and Nokia etc are majoritively using the Pig.
The Apache Pig is a platform for managing large sets of data which consists of high-level
programming to analyze the data as per the requirements assigned. Pig mainly consists of the
infrastructure to evaluate the complexity of the program. The advantages of Pig programming
is that it can easily handle parallel processes correspondingly managing a very large number of
data. The programming on this platform is done by using the textual language Pig Latin.
Pig Latin comes with the following features:
Simple programming: it is easy to code, execute and manage the program.
Better optimization: system can automatically optimize the execution as per the requirement
raised.
Extensive nature: Used to achieve highly specific processing tasks.
Pig can be used for following purposes:
ETL data pipeline
Research on raw data
48
Iterative processing.
The scalar data types in pig are in the form of int, float, double, long, chararray, and byte array.
The complex data types in Pig are namely the map, tuple, and bag.
Map: The data element consisting the data type chararray where element has pig data type
include complex data type
Example- [city’#’bang’,’pin’#560001]
In this city and pin are data element mapping the values here.
Tuple: Collection of data types and it has defined fixed length. It consists of multiple fields
and those are ordered in sequence.
Bag: It is a huge collection of tuples ,unordered sequence , tuples arranged in the bag are
separated by comma.
Example: {(‘Bangalore’, 560001),(‘Mysore’,570001),(‘Mumbai’,400001)
LOAD function: Load function helps to load the data from the file system. It is a known as a
relational operator. In the first step in data-flow language it is required to mention the input,
which is completed by using the keyword named as ‘load’.
The LOAD syntax is
LOAD ‘mydata’ [USING function] [AS schema];
Example- A = LOAD ‘intellipaat.txt’;
A = LOAD ‘intellipaat.txt’ USINGPigStorage(‘\t’);
The relational operations in Pig segmentation is as follows:
foreach, order by, filters, group, distinct, join, limit.
foreach: Takes a set of expressions and applies them to almost all the records in the data
pipeline to next operator.
A =LOAD ‘input’ as (emp_name: charrarray, emp_id: long, emp_add : chararray, phone :
chararray, preferences : map []);
B = foreach A generate emp_name, emp_id;
Filters: It contains a predicate and it provides us to select which records will be retained in our
data pipeline permanently.
Syntax: alias = FILTER alias BY expression;
Otherwise it indicates the name of the relation, By indicates required keyword and the
expression containing Boolean.
Example: M = FILTER N BY F5 == 4;
There are namely 3 ways of executing Pig programs which works on both local and
MapReduce mode:
•Script
Pig can run a script file that contains Pig commands. For example, pig script.pig runs the
commands in the local file script.pig. Alternatively, for very short scripts, you can use the -e
option to run a script specified as a string on the command line.
•Grunt
49
Grunt is an interactive shell programming for running Pig commands. Grunt is started when no
file is specified for Pig to run, and the -e option apparently not used. It is also possible to run
Pig scripts from within Grunt using run and exec.
•Embedded
You can execute all the Pig programs from Java and can use JDBC to run SQL programs from
Java.
Example: Word count in Pig Lines=LOAD ‘input/hadoop.log’ AS (line: chararray); Words =
FOREACH Lines GENERATE FLATTEN (TOKENIZE (line)) AS word; Groups = GROUP
Words BY word; Counts = FOREACH Groups GENERATE group, COUNT (Words); Results
= ORDER Words BY Counts DESC; Top5 = LIMIT Results 5; STORE Top5 INTO
/output/top5words;
Apache Hive
Pig and Hive are open source platform mainly used for same purpose. These tools that ease the
complexity of writing difficult/complexed programs of java based MapReduce. Hive is like a
data warehouse that uses the MapReduce for the purpose of analyzing data stored on HDFS. It
provides a query language called HiveQL that is familiar to the Structured Query Language
(SQL) standard. It is developed based on facebook concepts. Hive was created who are posing
strong analysts having strong SQL skills but few java programming skills are required to run
queries on the large volumes of data that Face book stored in HDFS. Apache Pig and Hive are
two projects that are consider as the top most layer of Hadoop and provide a higher-level
language for using MapReduce library of Hadoop management.
Why hive?
It consists of a query language based on the standard SQL instead of giving a rapid
development of map and reduces tasks. Hive takes HiveQL statements and then automatically
transforms each and every query into one or more MapReduce jobs. Later it runs the overall
MapReduce program and executes the output to the user whereas Hadoop streaming decreases
the mandatory code, compile, and submit cycle. Hive removes it completely instead requires
only the composition of HiveQL statements.
This interface to Hadoop not only accelerates the time required to produce results from data
analysis but also it significantly expands for whom this Hadoop and MapReduce are helpful.
What makes Hive Hadoop popular?
The users are provided with strong and powerful statistics functions.
It is similar to SQL and hence it is very easy to understand the concepts.
It can be combined with the HBase for querying the data in HBase. This kind of feature is
not available in pig. Pig function named HbaseStorage () is mainly used for loading the data
from HBase.
Supported by Hue.
Various user groups are considered such as CNET, Last.fm, Facebook, and Digg etc.
Difference between hive and pig
50
Hive Pig
Used for Data Analysis Used for Data and Programs
Used as Structured Data Pig is Semi-Structured Data
Hive has HiveQL Pig has Latin
Hive is used for creating reports Pig is used for programming
Hive works on the server side Pig works on the client side
Hive does not support avro Pig supports Avro
hive>select * form employee;
hive> describe employee;
The Apache Hive is mainly data warehouse software which allows you to read, write and
manage huge number volumes of datasets stored in a distributed environment using SQL. It
is possible to project structure onto data that is termed as storage. Users can be connected to
Hive using a JDBC driver and a command line tool.
Hive is an open Source platform system. Use Hive for analyzing and querying in large
number of datasets consisting the Hadoop files. It’s similar to the SQL programming. The
current version of Hive is 0.13.1.
Hive supports ACID transaction: Atomicity, Consistency, Isolation, and Durability. ACID
transactions are provided at the row levels, those are Insert, Delete, and Update options so
that Hive supports ACID transaction.
Hive is not considered as a complete database. The design rules and regulations of Hadoop
and HDFS put restrictions on what Hive can do in the field of programming.
Hive is most suitable for following data warehouse applications
Analyzing the static data
Less Responsive time
No rapid changes in datasets.
Hive doesn’t provide fundamental features required for OLTP (Online Transaction Processing).
Hive is proper usage for data warehouse applications in large data sets.
The two types of tables in Hive
1. Managed table
2. External table
We can change the settings within Hive session, using the command known as SET. It is used
to change Hive job settings for a query to gain the exact results.
Example: The following below commands shows buckets are occupied according to the table
definition.
hive> SET hive.enforce.bucketing=true;
We can see the current value of any property by using the value of SET with the property
name. SET will allows to list all the properties with their values set by Hive.
hive> SET hive.enforce.bucketing;
hive.enforce.bucketing=true
51
And this above list will not be include by defaults of Hadoop. So we should use the below as
follows:
Que 1. Define Apache Pig
Ans. To analyze large data sets representing them as data flows, we use Apache Pig. Basically,
to provide an abstraction over MapReduce, reducing the complexities of writing a MapReduce
task using Java programming, Apache Pig is designed. Moreover, using Apache Pig, we can
perform data manipulation operations very easily in Hadoop.
Que 2. Why Do We Need Apache Pig?
Ans. At times, while performing any MapReduce tasks, programmers who are not so good at
Java normally used to struggle to work with Hadoop. Hence, Pig is a boon for all such
programmers. The reason is:
Using Pig Latin, programmers can perform MapReduce tasks easily, without having to
type complex codes in Java.
Since Pig uses multi-query approach, it also helps in reducing the length of codes.
It is easy to learn Pig when you are familiar with SQL. It is because Pig Latin is SQL-like
language.
In order to support data operations, it offers many built-in operators like joins, filters,
ordering, and many more. And, it offers nested data types that are missing from
MapReduce, for example, tuples, bags, and maps.
Que 3. What is the difference between Pig and SQL?
Ans. Here, are the list of major differences between Apache Pig and SQL.
Pig
It is a procedural language.
SQL
While it is a declarative language.
Pig
Here, the schema is optional. Although, without designing a schema, we can store data.
However, it stores values as $01, $02 etc.
SQL
In SQL, Schema is mandatory.
Pig
In Pig, data model is nested relational.
SQL
In SQL, data model used is flat relational.
Pig
Here, we have limited opportunity for query optimization.
SQL
While here we have more opportunity for query optimization.
Que 4. Explain the architecture of Hadoop Pig.
Ans. Below is the image, which shows the architecture of Apache Pig.
52
Now, we can see, several components in the Hadoop Pig framework. The major components
are:
1. Parser
At first, Parser handles all the Pig Scripts. Basically, Parser checks the syntax of the script,
does type checking, and other miscellaneous checks. Afterward, Parser’s output will be a DAG
(directed acyclic graph). That represents the Pig Latin statements as well as logical operators.
Basically, the logical operators of the script are represented as the nodes and the data flows are
represented as edges, in the DAG (the logical plan).
2. Optimizer
Further, DAG is passed to the logical optimizer. That carries out the logical optimizations, like
projection and push down.
3. Compiler
A series of MapReduce jobs have compiled from an optimized logical plan.
4. Execution engine
At last, these jobs are submitted to Hadoop in a sorted order. Hence, these MapReduce jobs are
executed finally on Hadoop, that produces the desired results.
Learn about Pig Architecture in detail, follow the link: Apache Pig Architecture and
Execution Modes
Que 5. What is the difference between Apache Pig and Hive?
Ans. Basically, to create MapReduce jobs, we use both Pig and Hive. Also, we can say, at
times, Hive operates on HDFS as same as Pig does. So, here we are listing few significant
points those set Apache Pig apart from Hive.
Hadoop Pig
Pig Latin is a language, Apache Pig uses. Originally, it was created at Yahoo.
Hive
HiveQL is a language, Hive uses. It was originally created at Facebook.
Pig
It is a data flow language.
Hive
Whereas, it is a query processing language.
Pig
Moreover, it is a procedural language which fits in pipeline paradigm.
Hive
It is a declarative language.
Apache Pig
Also, can handle structured, unstructured, and semi-structured data.
Hive
Whereas, it is mostly for structured data.
Que 6. What is the difference between Pig and MapReduce?
53
Ans. Some major differences between Hadoop Pig and MapReduce, are:
Apache Pig
It is a data flow language.
MapReduce
However, it is a data processing paradigm.
Hadoop Pig
Pig is a high-level language.
MapReduce
Well, it is a low level and rigid.
Pig
In Apache Pig, performing a join operation is pretty simple.
MapReduce
But, in MapReduce, it is quite difficult to perform a join operation between datasets.
Que 7. Explain Features of Pig.
Ans. There are several features of Pig, such as:
54
evaluate UDF, we will have to extend the EvalFunc. EvaluFunc is parameterized and must
provide the return type also.
Que 10. What are the different UDF’s in Pig?
Ans. On the basis of the number of rows, UDF can be processed. They are of two types:
UDF that takes one record at a time, for example, Filter and Eval.
UDFs that take multiple records at a time, for example, Avg and Sum.
Also, pig gives you the facility to write your own UDF’s for load/store the data.
Que 11. What are the Optimizations a developer can use during joins?
Ans. We use replicated join, to perform join between a small dataset with a large dataset.
Moreover, in the replicated join, the small dataset will be copied to all the machines where the
mapper is running and the large dataset is divided across all the nodes. Also, it gives us the
advantage of Map-side joins.
If your dataset is skewed i.e. if a particular data is repeated multiple times even if you use
reduce side join, the particular reducer will be overloaded and it will take a lot of time. Pig
itself, calculates skewed join and the skewed key.
And, if you have datasets where the records are sorted in the same field, you can go for sorted
join, this also happens in map phase and is very efficient and fast.
Que 12. What is a skewed join?
Ans. While we want to perform a join with a skewed dataset, that means a particular value will
be repeated many times, is a skewed join.
Que 13. What is Flatten?
Ans. An operator in pig that removes the level of nesting, is Flatten. Sometimes, we have data
in a bag or a tuple and we want to remove the level of nesting so that the data structured should
become even, we use Flatten.
In addition, each Flatten produces a cross product of every record in the bag with all of the
other expressions in the general statement.
Que 14. What are the complex data types in pig?
Ans. The following are the complex data types in Pig:
57
Also, we can use an ODBC Driver application. Since that support ODBC to connect to the
HIVE server.
Que 4. Can we change the data type of a column in a hive table?
Ans. By using REPLACE column option we change the data type of a column in a hive table
ALTER TABLE table_name REPLACE COLUMNS ……
Que 5. How to add the partition in existing without the partition table?
Ans. Basically, we cannot add/create the partition in the existing table, especially which was
not partitioned while creation of the table.
Although, there is one possible way, using “PARTITIONED BY” clause. But the condition is
if you had partitioned the existing table, then by using the ALTER TABLE command, you will
be allowed to add the partition.
So, here are the create and alter commands:
CREATE TABLE tab02 (foo INT, bar STRING) PARTITIONED BY (mon STRING);
ALTER TABLE tab02 ADD PARTITION (mon =’10’) location ‘/home/hdadmin/hive-
0.13.1-cdh5.3.2/examples/files/kv5.txt’;
Que 6. How Hive organize the data?
Ans. Basically, there are 3 ways possible in which Hive organizes data. Such as:
1. Tables
2. Partitions
3. Buckets
Que 7. Explain Clustering in Hive?
Ans. Basically, to decompose table data sets into more manageable parts is Clustering in Hive
To be more specific, the table is divided into the number of partitions, and these partitions can
be further subdivided into more manageable parts known as Buckets/Clusters. In addition,
“clustered by” clause is used to divide the table into buckets.
Que 8. Explain bucketing in Hive?
Ans. To decompose table data sets into more manageable parts, Apache Hive offers another
technique. That technique is what we call Bucketing in Hive.
Que 9. How is HCatalog different from Hive?
Ans. So, let’s learn the difference.
Hcatalog-
Basically, it is a table storage management tool for Hadoop. Basically, that exposes the tabular
data of Hive Metastore to other Hadoop applications. Also, it enables users with different data
processing tools to easily write data onto a grid. Moreover, it ensures that users don’t have to
worry about where or in what format their data is stored.
Hive-
Whereas, Hive is an open source data warehouse. Also, we use it for analysis and querying
datasets. Moreover, it is developed on top of Hadoop as its data warehouse framework for
querying and analysis of data is stored in HDFS.
In addition, it is useful for performing several operations. Such as data encapsulation, ad-hoc
queries, & analysis of huge datasets. Moreover, for managing and querying structured data
Hive’s design reflects its targeted use as a system.
Que 10. What is the difference between CREATE TABLE AND CREATE EXTERNAL
TABLE?
Ans.Although, we can create two types of tables in Hive. Such as:
– Internal Table
– External Table
Hence, to create the Internal table we use the command ‘CREATE TABLE’ whereas to create
the External table we use the command ‘CREATE EXTERNAL TABLE’.
Que 11. Unable to instantiate org.apache.hadoop.hive.metastore.HiveMetaStoreClient
Ans. There is a possibility that because of following reasons above error may occur:
1. While we use derby metastore, Then lock file would be there in case of the abnormal exit.
58
Hence, do remove the lock file
rm metastore_db/*.lck
2. Moreover, Run hive in Debug mode
hive -hiveconf hive.root.logger=DEBUG,console
Que 12. How many types of Tables in Hive?
Ans. Hive has two types of tables. Such as:
Managed table
External table
Que 13. Explain Hive Thrift server?
Ans. There is an optional component in Hive that we call as HiveServer or HiveThrift.
Basically, that allows access to Hive over a single port. However, for scalable cross-language
services development Thrift is a software framework. Also, it allows clients using languages
including Java, C++, Ruby, and many others, to programmatically access Hive remotely.
Que 14. How to Write a UDF function in Hive?
Ans. Basically, following are the steps:
1. Create a Java class for the User Defined Function which extends
ora.apache.hadoop.hive.sq.exec.UDF and implements more than one evaluate() methods.
Put in your desired logic and you are almost there.
2. Package your Java class into a JAR file
3. Go to Hive CLI, add your JAR, and verify your JARs is in the Hive CLI classpath
4. CREATE TEMPORARY FUNCTION in Hive which points to your Java class
5. Then Use it in Hive SQL.
Que 16. What is the difference between Internal Table and External Table in Hive?
Ans. Hive Managed Tables-
It is also known an internal table. When we create a table in Hive, it by default manages the
data. This means that Hive moves the data into its warehouse directory.
Usage:
We want Hive to completely manage the lifecycle of the data and table.
Data is temporary
Hive External Tables-
We can also create an external table. It tells Hive to refer to the data that is at an existing
location outside the warehouse directory.
Usage:
Data is used outside of Hive. For example, the data files are read and processed by an
existing program that does not lock the files.
We are not creating a table based on the existing table.
Que 17.Difference between order by and sort by in Hive?
Ans. So, the difference is:
Sort by
hive> SELECT E.EMP_ID FROM Employee E SORT BY E.empid;
1. for final output, it may use multiple reducers.
2. within a reducer only guarantees to order of rows.
iii. it gives partially ordered result.
Order by
hive> SELECT E.EMP_ID FROM Employee E order BY E.empid;
1. Basically, to guarantee the total order in output Uses single reducer.
2. Also, to minimize sort time LIMIT can be used.
Que 18. What are different modes of metastore deployment in Hive?
Ans. There are three modes for metastore deployment which Hive offers.
59
1. Embedded metastore
Here, by using embedded Derby Database both metastore service and hive service runs in the
same JVM.
2. Local Metastore
However, here, Hive metastore service runs in the same process as the main Hive Server
process, but the metastore database runs in a separate process.
3. Remote Metastore
Here, metastore runs on its own separate JVM, not in the Hive service JVM.
Que 19. Difference between HBase vs Hive
Ans. Following points are feature wise comparison of HBase vs Hive.
1.Database type
Apache Hive
Basically, Apache Hive is not a database.
HBase
HBase does support NoSQL database.
2. Type of processing
Apache Hive
Hive does support Batch processing. That is OLAP.
HBase
HBase does support real-time data streaming. That is OLTP.
3. Data Schema
Apache Hive
Basically, it supports to have schema model
HBase
However, it is schema-free
Que 20. What is the relation between MapReduce and Hive?
Ans. Hive offers no capabilities to MapReduce. The Programs are executed as MapReduce Job
via the interpreter. Then Interpreter runs on the Client machine. Afterward, that runs HiveQL
Que 21. What is the importance of driver in Hive?
Ans. Driver manages the life cycle of Hive QL Queries. It receives the queries from UI and
fetches on JDBC interfaces to process the query. Also, it creates a separate section to handle
the query.
Que 22. How can you configure remote metastore mode in Hive?
Ans. To use this remote metastore, you should configure Hive service by setting
hive.metastore.uris to the metastore server URI(s). Metastore server URIs are of the form
thrift://host:port, where the port corresponds to the one set by METASTORE_PORT when
starting the metastore server.
Que 23. Can we LOAD data into a view?
Ans. No.
Que 24. What types of costs are associated with creating the index on hive tables?
Ans. Basically, there is a processing cost in arranging the values of the column on which index
is created since Indexes occupies.
Que 25. Give the command to see the indexes on a table.
Ans. SHOW INDEX ON table_name
Basically, in the table table_name, this will list all the indexes created on any of the columns.
Que 26. How do you specify the table creator name when creating a table in Hive?
Ans. The TBLPROPERTIES clause is used to add the creator name while creating a table.
The TBLPROPERTIES is added like −
TBLPROPERTIES(‘creator’= ‘Joan’)
Que 27.Difference between Hive and Impala?
Ans. Following are the feature wise comparison between Impala vs Hive:
1. Query Process
Hive
60
Basically, in Hive every query has the common problem of a “cold start”.
Impala
Impala avoids any possible startup overheads, being a native query language. However, that are
very frequently and commonly observed in MapReduce based jobs. Moreover, to process a
query always Impala daemon processes are started at the boot time itself, making it ready.`
2. Intermediate Results
Hive
Basically, Hive materializes all intermediate results. Hence, it enables enabling better
scalability and fault tolerance. However, that has an adverse effect on slowing down the data
processing.
Impala
However, it’s streaming intermediate results between executors. Although, that trades off
scalability as such.
3. During the Runtime
Hive
At Compile time, Hive generates query expressions.
Impala
During the Runtime, Impala generates code for “big loops”.
Que 28. What are types of Hive Built-In Functions?
Ans. So, its types are:
1. Collection Functions
2. Hive Date Functions
3. Mathematical Functions
4. Conditional Functions
5. Hive String Functions
Que 29.Types of Hive DDL Commands.
Ans. However, there are several types of Hive DDL commands, we commonly use. such as:
1. Create Database Statement
2. Hive Show Database
3. Drop database
4. Creating Hive Tables
5. Browse the table
6. Altering and Dropping Tables
7. Hive Select Data from Table
8. Hive Load Data
Que 30. What are Hive Operators and its Types?
Ans. Hive operators are used for mathematical operations on operands. Also, it returns specific
value as per the logic applied.
Types of Hive Built-in Operators
Relational Operators
Arithmetic Operators
Logical Operators
String Operators
Operators on Complex Types
1) What is the difference between Pig and Hive ?
Criteria Pig Hive
61
Type of Apache Pig is usually used for
Data semi structured data. Used for Structured Data
General Usually used on the client side Usually used on the server side of
Usage of the hadoop cluster. the hadoop cluster.
Coding
Style Verbose More like SQL
Pig vs Hive
For a detailed answer on the difference between Pig and Hive, refer this link -
https://www.dezyre.com/article/difference-between-pig-and-hive-the-two-key-components-of-
hadoop-ecosystem/79
2) What is the difference between HBase and Hive ?
HBase Hive
Hive is a datawarehouse
HBase is a NoSQL database. framework.
Hive vs HBase
2) I do not need the index created in the first question anymore. How can I delete the
above index named index_bonuspay?
DROP INDEX index_bonuspay ON employee;
Test Your Practical Hadoop Knowledge
Which companies use Hive extensively? This could be one of the possible Hive Interview
Questions asked at your next Hadoop Job interview.
Submit
63
javax.jdo.option.ConnectionURL defined in the hive-site.xml has a default value jdbc: derby:
databaseName=metastore_db; create=true.
The value implies that embedded derby will be used as the Hive metastore and the location of
the metastore is metastore_db which will be created only if it does not exist already. The
location metastore_db is a relative location so when you run queries from different directories
it gets created at all places from wherever you launch hive. This property can be altered in the
hive-site.xml file to an absolute path so that it can be used from that particular location instead
of creating multiple metastore_db subdirectory multiple times.
11) How will you read and write HDFS files in Hive?
i) TextInputFormat- This class is used to read data in plain text file format.
ii) HiveIgnoreKeyTextOutputFormat- This class is used to write data in plain text file format.
iii) SequenceFileInputFormat- This class is used to read data in hadoop SequenceFile format.
iv) SequenceFileOutputFormat- This class is used to write data in hadoop SequenceFile format.
12) What are the components of a Hive query processor?
Query processor in Apache Hive converts the SQL to a graph of MapReduce jobs with the
execution time framework so that the jobs can be executed in the order of dependencies. The
various components of a query processor are-
Parser
Semantic Analyser
Type Checking
Logical Plan Generation
Optimizer
Physical Plan Generation
Execution Engine
Operators
UDF’s and UDAF’s.
64
CLUSTER BY- It is a combination of DISTRIBUTE BY and SORT BY where each of the N
reducers gets non overlapping range of data which is then sorted by those ranges at the
respective reducers.
18) Difference between HBase and Hive.
HBase is a NoSQL database whereas Hive is a data warehouse framework to process Hadoop
jobs.
HBase runs on top of HDFS whereas Hive runs on top of Hadoop MapReduce.
19) Write a hive query to view all the databases whose name begins with “db”
SHOW DATABASES LIKE ‘db.*’
20) How can you prevent a large job from running for a long time?
This can be achieved by setting the MapReduce jobs to execute in strict mode set
hive.mapred.mode=strict;
The strict mode ensures that the queries on partitioned tables cannot execute without defining a
WHERE clause.
UNIT – V
Spark Tutorial
In this free Apache Spark Tutorial you will be introduced to Spark analytics, Spark streaming,
RDD, Spark on cluster, Spark shell and actions. You will learn about Spark practical use cases.
Apache Spark continues to gain momentum in today’s big data analytics landscape. Although a
relatively newer entry to the realm, Apache Spark has earned immense popularity among
enterprises and data Analysts within a short period. Apache Spark is one of the most active
open source big data projects. The reason behind is its versatility and diversity of use.
Some of the key features that make Spark a strong big data engine are:
Equipped with MLlib library for machine learning algorithms
Good for Java and Scala developers as Spark imitates Scala’s collection API and functional
style
Single library can perform SQL, graph analytics and streaming.
Spark is admired for many reasons by developers and analysts to quickly query, analyze and
transform data at scale. In simple words, you can call Spark a competent alternative to Hadoop,
with its characteristics, strengths and limitations. Spark runs in-memory to process data with
speed and sophistication than the other complement approaches like Hadoop MapReduce. It
can handle several terabytes of data at one time and perform efficient processing.
Spark versus Hadoop MapReduce
Despite having the similar functionality, there is much difference between these two
technologies. Let’s have a quick look into this comparative analysis:
Criteria Spark Hadoop MapReduce
67
Latency Lower Higher
One of the excellent benefits of using Spark is that it is often used in Hadoop’s data storage
model, i.e. HDFS and can well integrate with other big data frameworks like
HBase, MongoDB, Cassandra. It is one of the best big data choices to learn and apply machine
learning algorithms in real-time. It has the ability to run repeated queries on large databases
and potentially deal with them.
Knowing the extensively excellent future growth and rapid adoption of Apache Spark in
today’s business world, we have designed this Spark tutorial to educate the mass programmers
on interactive and expeditious framework. The tutorial aims at training you on beginner
concepts of using Spark as well as gain insights into its advanced modules. For all those who
are seeing an expert Spark tutor, this learning package is the delightful and knowledgeable end
to your search.
It includes detailed elucidation of Spark and Hadoop Distributed File System. The major topics
include Spark Components, Common Spark Algorithms-Iterative Algorithms, Graph
Analysis, Machine Learning, Running Spark on a Cluster. Further, you will be able to type in
algorithms by yourself by learning to write Spark Applications using Python, Java, Scala, RDD
and its operations. Since Spark has the ability to run on diverse platforms using various
languages, it is an important phase to gain insights into developing application with various
mentioned programming languages.
This learning package also covers Spark, Hadoop, and the Enterprise Data Centre, Common
Spark Algorithms and Spark Streaming, which is yet another important feature of Spark. Most
application developers are frequently using this data streaming to keep a check on fraudulent
financial transactions. If you find this tutorial helpful, you can browse through our multiple
combo training courses of Spark, Storm, Scala and Spark with Python – which can help you
grow technically and managerially.
Learn more about Most Valuable Data Science Skills Of 2017 in this insightful blog now!
Recommended Audience
Big Data Analysts and Architects
Software Professionals, ETL Developers and Data Engineers
Data Scientists and Analytics Professionals
Beginner and advanced-level programmers in Java, C++, Python
Graduates aiming to learn latest and efficient programming language to process Big data in a
faster and easier manner.
Prerequisites
68
Before getting started with this tutorial, have a good understanding of Java basics and
concepts of programming.
For that matter, your knowledge of other programming languageslike C, C++, Python and
Big data analytics will be beneficial to decipher the topics better.
Spark Features
Developed in AMPLab of University of California, Berkeley, Apache Spark was developed for
higher speed, ease of use and more in-depth analysis. Though it was built to be installed on top
of Hadoop cluster, however its ability to parallel processing allows it run independently as
well. Let’s take a closer look at the features of Apache Spark –
Fast processing – The most important feature of Apache Spark that has made the big data
world choosing this technology over others is its speed. Big data is characterized by volume,
variety, velocity and veracity which needs to be processed at a higher speed. Spark contains
Resilient Distributed Dataset (RDD) which saves time taken in reading and writing
operations and hence it runs almost ten to hundred times faster than Hadoop.
Flexibility – Apache Spark supports multiple languages and allows the developers to write
applications on Java, Scala, R, or Python. Equipped with over 80 high-level operators this
tool is quite rich from this aspect.
In-memory computing – Spark stores the data in the RAM of servers which allows him to
access it quickly and in turn accelerating the speed of analytics.
Real-time processing – Spark is able to process real-time streaming data. Unlike
MapReduce which processes the stored data, Spark is able to process the real-time data and
hence is able to produce instant outcomes.
Better analytics – Contrasting to MapReduce that includes Map and Reduce functions,
Spark includes much more than that. Apache Spark consists of a rich set of SQL queries,
machine learning algorithms, complex analytics, etc. With all these functionalities, analytics
can be performed in a better fashion with the help of Spark.
Compatible with Hadoop – Spark is not only able to work independently, it can work on
top of Hadoop as well. Not just this, It is certainly compatible with both the versions of
Hadoop ecosystem.
Apache Spark Architecture
In order to understand the way Spark runs, it is very important to know the architecture of
Spark. Following diagram and discussion will give you a clearer view into it.
69
There are three ways Apache Spark can run :
Standalone – The Hadoop cluster can be equipped with all the resources statically and Spark
can run with MapReduce in parallel. This is the simplest deployment.
On Hadoop YARN – Spark can be executed on top of YARN without any pre-installation.
This deployment utilizes the maximum strength of Spark and other components.
Spark In MapReduce (SIMR) – If you don’t have YARN, you can also use Spark along with
MapReduce. This reduces the burden of deployments.
Either way the Spark is deployed, the configurations allocates it the resources. The moment
Spark is connected it obtains executors on the node. These executors are nothing but the
processes running computations and securing the data. Now the application code is sent to the
executor following with SparkContext sending tasks to the executors to run.
Some important terms to illustrate the architecture are –
Apache Spark Description
Basics
Spark Driver This is a program that runs the main () function of the
70
program Spark application.
Since the time of its inception in 2009 and its conversion to an open source technology, Apache
Spark has taken the big data world by storm. It became one of the largest open source
communities that includes over 200 contributors. The prime reason behind its success was its
ability to process heavy data faster than ever before.
Spark is a widely-used technology adopted by most of the industries. Let us look at some of the
prominent applications of Apache Spark are –
Machine Learning – Apache Spark is equipped with a scalable Machine Learning Library
called as MLlib that can perform advanced analytics such as clustering, classification,
dimensionality reduction, etc. Some of the prominent analytics jobs like predictive analysis,
customer segmentation, sentiment analysis, etc., make Spark an intelligent technology.
Fog computing – With the influx of big data concepts, IoT has acquired a prominent space
for the invention of more advanced technologies. Based on the theory of connecting digital
devices with the help of small sensors this technology deals with a humongous amount of
data emanating from numerous mediums. This requires parallel processing which is certainly
71
not possible on cloud computing. Therefore Fog computing which decentralizes the data and
storage uses Spark streaming as a solution to this problem.
Event detection – The feature of Spark streaming allows the organization to keep track of
rare and unusual behaviors for protecting the system. Institutions like financial institutions,
security organizations, and health organizations use triggers to detect the potential risk.
Interactive analysis – Among the most notable features of Apache Spark is its ability to
support interactive analysis. Unlike MapReduce that supports batch processing, Apache
Spark processes data faster because of which it can process exploratory queries without
sampling.
Some of the most popular companies that are using Apache Spark are –
Uber – Uses Kafka, Spark Streaming, and HDFS for building a continuous ETL pipeline.
Pinterest – Uses Spark Streaming in order to gain deep insight into customer engagement
details.
Conviva – The pinnacle video company Conviva deploys Spark for optimizing the videos
and handling live traffic.
Components of Spark
The following procedure gives the clear picture of the different components of Spark.
Apache Spark Core
Spark Core consists of general execution engine for spark platform that all required by
other functionality which is built upon as per the requirement approach. It provides in-
built memory computing and referencing datasets stored in external storage systems.
Spark allows the developers to write code quickly with the help of rich set of operators.
While it takes a lot of lines of code, it takes fewer lines to write the same code in Spark
Scala. Following program will help you understand the way programming is done with
Spark :
sparkContext.textFile(“hdfs://…”)
.flatMap(line => line.split(” “))
.map(word => (word, 1)).reduceByKey(_ + _)
72
.saveAsTextFile(“hdfs://…”)
Spark SQL
Spark SQL is a component on top of Spark Core that introduces a new set of data
abstraction called Schema RDD, which provides support for both the structured and
semi-structured data.
Below is an example of a Hive compatible query:
// sc is an existing SparkContext.
filter() operation does not mutate the already existing input RDD. Instead, it will returns a
pointer to an entirely new RDD.
Actions : They are the certain operations that will return a final value to the driver program or
write data to an external storage system. Actions performed will force the evaluation of the
transformations required for the RDD they were called on, since they need to actually produce
output as those are required.
Python error count records using actions
print “Input had ” + badLinesRDD.count() + ” concerning lines”
print “Here are 10 examples:”
for line in badLinesRDD.take(10):
print line
Scala error count records using actions
println(“Input had ” + badLinesRDD.count() + ” concerning lines”)
println(“Here are 10 examples:”)
badLinesRDD.take(10).foreach(println)
Java error count records using actions
75
System.out.println(“Input had ” + badLinesRDD.count() + ” concerning lines”)
System.out.println(“Here are 10 examples:”)
for (String line: badLinesRDD.take(10)) {
System.out.println(line);
}
In the above lines of code, take() is used to get elements in RDD at the driver program which
are then iterated to write an output to the driver. Also the RDD consists of collect() that fetches
the complete RDD programming concept.
Each time new action is been called, the entire RDD must be computed “from scratch”. To
avoid this type of inefficiency, users can persist intermediate results.
Lazy Evaluation
Lazy evaluation means that when if we call a transformation or an action on an RDD (for
instance, calling map()), the operation is not immediately performed. Instead, Spark internally
records metadata to indicate that this typical operation has been requested. Rather than thinking
of an RDD as containing specific data, it is best to think of each RDD methodology as
consisting of instructions on how to compute the data that we build up through transformations
phases. Loading data into an RDD is lazily evaluated in the same way transformations are
done. So, when we call sc.textFile(), the data is not loaded until it is necessary things required.
As with transformations, the operation (in this case, reading the data) can occur multiple
amounts of times.
Passing Functions to Spark
Python : In Python, we have mostly three choices for passing functions into Spark. Firstly, we
can pass in the functions through lambda expressions. Secondly, pass a function that is already
a member of an object, or contains references of the fields within an object itself. Thirdly,
simply extract the fields you are required from your object into a local variable and pass that to
it.
Scala : Using Scala we are able pass in functions defined inline, references to methods, or
static functions as we do for Scala’s other functional APIs(Application Programming
Interface).
Java : In Java, functions are particularly specified as objects that implement one of Spark’s
function interfaces from the org.apache.spark.api.java.function package.
Common Transformations and Actions
76
There are relatively the two most common transformations you mostly be using are map() and
filter() transformations. The map() transformation takes in a function and applies it to each
element in the RDD with the result of the function being the new value of each element in the
resulting RDD. The filter() transformation takes in a function and returns an RDD that only has
elements that pass the filter() function.
77
Data caching Hard disk In-memory
Spark is a fast, easy-to-use and flexible data processing framework. It has an advanced
execution engine supporting cyclic data flow and in-memory computing. Spark can run on
Hadoop, standalone or in the cloud and is capable of accessing diverse data sources including
HDFS, HBase, Cassandra and others.
3. Explain key features of Spark.
4. Define RDD?
1. Parallelized Collections : The existing RDD’s running parallel with one another.
2. Hadoop datasets : perform function on each file record in HDFS or other storage system
As the name suggests, partition is a smaller and logical division of data similar to ‘split’ in
MapReduce. Partitioning is the process to derive logical units of data to speed up the
processing process. Everything in Spark is a partitioned RDD.
7. What operations RDD support?
Transformations.
Actions
An action helps in bringing back the data from RDD to the local machine. An action’s
execution is the result of all previously created transformations. reduce() is an action that
implements the function passed again and again until one value if left. take() action takes all
the values from RDD to local node.
10. Define functions of SparkCore?
Serving as the base engine, SparkCore performs various important functions like memory
management, monitoring jobs, fault-tolerance, job scheduling and interaction with storage
systems.
11. What is RDD Lineage?
Spark does not support data replication in the memory and thus, if any data is lost, it is rebuild
using RDD lineage. RDD lineage is a process that reconstructs lost data partitions. The best is
that RDD always remembers how to build from other datasets.
12. What is Spark Driver?
Spark Driver is the program that runs on the master node of the machine and declares
transformations and actions on data RDDs. In simple terms, driver in Spark creates
SparkContext, connected to a given Spark Master.
The driver also delivers the RDD graphs to Master, where the standalone cluster manager runs.
Are you interested in the comprehensive Apache Spark and Scala Videos to take your career
to the next level?
13. What is Hive on Spark?
Hive contains significant support for Apache Spark, wherein Hive execution is configured to
Spark:
79
Spark uses GraphX for graph processing to build and transform interactive graphs. The
GraphX component enables programmers to reason about structured data at scale.
17. What does MLlib do?
MLlib is scalable machine learning library provided by Spark. It aims at making machine
learning easy and scalable with common learning algorithms and use cases like clustering,
regression filtering, dimensional reduction, and alike.
Our in-depth Scala Certification Course can give your career a big boost!
18. What is Spark SQL?
SQL Spark, better known as Shark is a novel module introduced in Spark to work with
structured data and perform structured data processing. Through this module, Spark executes
relational SQL queries on the data. The core of the component supports an altogether different
RDD called SchemaRDD, composed of rows objects and schema objects defining data type of
each column in the row. It is similar to a table in relational database.
19. What is a Parquet file?
Parquet is a columnar format file supported by many other data processing systems. Spark SQL
performs both read and write operations with Parquet file and consider it be one of the best big
data analytics format so far.
20. What file systems Spark support?
Due to the availability of in-memory processing, Spark implements the processing around 10-
100x faster than Hadoop MapReduce. MapReduce makes use of persistence storage for any of
the data processing tasks.
Unlike Hadoop, Spark provides in-built libraries to perform multiple tasks form the same core
like batch processing, Steaming, Machine learning, Interactive SQL queries. However,
Hadoop only supports batch processing.
Hadoop is highly disk-dependent whereas Spark promotes caching and in-memory data
storage.
80
Spark is capable of performing computations multiple times on the same dataset. This is called
iterative computation while there is no iterative computing implemented by Hadoop.
Yes, MapReduce is a paradigm used by many big data tools including Spark as well. It is
extremely relevant to use MapReduce when the data grows bigger and bigger. Most tools like
Pig and Hive convert their queries into MapReduce phases to optimize them better.
A unique feature and algorithm in graph, PageRank is the measure of each vertex in the graph.
For instance, an edge from u to v represents endorsement of v’s importance by u. In simple
terms, if a user at Instagram is followed massively, it will rank high on that platform
29. Do you need to install Spark on all nodes of Yarn cluster while running Spark on Yarn?
Since Spark utilizes more storage space compared to Hadoop and MapReduce, there may arise
certain problems. Developers need to be careful while running their applications in Spark.
Instead of running everything on a single node, the work must be distributed over multiple
clusters.
31. How to create RDD?
Spark provides two methods to create RDD:• By parallelizing a collection in your Driver
program. This makes use of SparkContext’s ‘parallelize’ methodval IntellipaatData =
Array(2,4,6,8,10)
val distIntellipaatData = sc.parallelize(IntellipaatData)• By loading an external dataset from
external storage like HDFS, shared file system.
81
Assignment - 1
This set of Multiple Choice Questions & Answers (MCQs) focuses on “Big-Data”.
1. As companies move past the experimental phase with Hadoop, many cite the need for
additional capabilities, including:
a) Improved data storage and information retrieval
b) Improved extract, transform and load features for data integration
c) Improved data warehousing functionality
d) Improved security, workload management and SQL support
2. Point out the correct statement :
a) Hadoop do need specialized hardware to process the data
b) Hadoop 2.0 allows live stream processing of real time data
c) In Hadoop programming framework output files are divided in to lines or records
d) None of the mentioned
View Answer
3. According to analysts, for what can traditional IT systems provide a foundation when they’re
integrated with big data technologies like Hadoop ?
a) Big data management and data mining
b) Data warehousing and business intelligence
c) Management of Hadoop clusters
d) Collecting and storing unstructured data
4. Hadoop is a framework that works with a variety of related tools. Common cohorts include:
a) MapReduce, Hive and HBase
b) MapReduce, MySQL and Google Apps
c) MapReduce, Hummer and Iguana
d) MapReduce, Heron and Trumpet
5. Point out the wrong statement :
a) Hardtop’s processing capabilities are huge and its real advantage lies in the ability to process
terabytes & petabytes of data
b) Hadoop uses a programming model called “MapReduce”, all the programs should confirms
to this model in order to work on Hadoop platform
c) The programming model, MapReduce, used by Hadoop is difficult to write and test
d) All of the mentioned
6. What was Hadoop named after?
a) Creator Doug Cutting’s favorite circus act
82
b) Cutting’s high school rock band
c) The toy elephant of Cutting’s son
d) A sound Cutting’s laptop made during Hadoop’s development
7. All of the following accurately describe Hadoop, EXCEPT:
a) Open source
b) Real-time
c) Java-based
d) Distributed computing approach
8. __________ can best be described as a programming model used to develop Hadoop-based
applications that can process massive amounts of data.
a) MapReduce
b) Mahout
c) Oozie
d) All of the mentioned
9. __________ has the world’s largest Hadoop cluster.
a) Apple
b) Datamatics
c) Facebook
d) None of the mentioned
10. Facebook Tackles Big Data With _______ based on Hadoop.
a) ‘Project Prism’
b) ‘Prism’
c) ‘Project Big’
d) ‘Project Data’
83
BDA UNIT – I Question Bank
1. List various types of digital data?
2. Why an email placed in the Unstructured category?
3. What category will you place a CCTV footage into?
4. You have just got a book issued from the library. What are the details about the book
that can be placed in an RDBMS table.
5. Which category would you place the consumer complaints and feedback?
6. Which category (structured, semi-structured or Unstructured) will you place a web page
in?
7. Which category (structured, semi-structured or Unstructured) will you place a Power
point presentation in?
8. Which category (structured, semi-structured or Unstructured) will you place a word
document in?
9. __________, A gartner analyst coined the term Big Data
10. ____________is the characteristic of data dealing with its retention.
11. ____________is a large data repository that stores data in its native format until it is
needed.
12. _________ is the characteristic of data explains the spikes in data.
13. Near real time processing or real time processing deals with ___________characteristic
of data.
14. ____________technology helps query data that resides in a computer’s random access
memory (RAM) rather than data stored on Physical disks.
15. Eventual consistency is consistency model used in distributed computing to achieve
high ______
16. A coordinated processing of program by multiple processors, each working on different
parts of the program and using its own operating system and memory called
_________.
17. A collection of independent computers that appear to its users as a single coherent
system is __________.
18. CAP Theorem is also called as ________________
19. System will continue to function even when network partition occurs is called________
20. Every read fetches the most recent write is called _____________
21. A non failing node will return a reasonable response within a reasonable amount of
time is called_______
22. Where BASE is used?
23. What is replica convergence?
24. What is data science?
25. What is WPS?
26. What are the advantages of a shared nothing architecture?
84
27. What are the guarantees provided by the CAP Theorem?
28. What is SMP?
29. What is MPP?
30.
2 Mark Questions
1. Match the following:
Column A Column B
JSON SOAP
Mange DB REST
XML JSON
Flexible structure Couch DB
JSON XML
2. What according you are the challenges with unstructured data?
3. State few examples of human generated and machine generated data.
4. What are the characteristics of data?
5. Big Data (Hadoop) will replace the traditional RDBMS and data warehouse.Comment.
6. Mention few top analytics tools.
7. Mention few open source analytics tools
8. Big data analytics is about a tight handshaking between three communities:_________ ,
___________and _________
9. List the three common types of architecture for Multi processor high transaction rate
systems.
10. What are the responsibilities of a Data scientist.
5 Mark questions
1. Match the following:
Column A Column B
NLP Contentanalytics
Test Analytics Textmessages
UIMA Chats
Noisy Unstructured data Textmining
Data mining Comprehend human or natural languageinput
Noisy Unstructured data Uses methods at the intersection ofstatistics,
Artificial Intelligence, machine learning &DBs
IBM UIMA
2. Place the following in suitable basket:
i. Email ii. MS Access iii. Images iv. Database
v. Char conversions vi. Relations / Tables vii.Face book
viii. Videos ix. MS Excel x. XML
Structured Unstructured Semi structured
87
UNIT – III
1 Mark questions
1. Partioner phase belongs to ______ task
2. Combiner is also know as ________
3. What is RecordReader in a Map Reduce?
4. MapReduce sorts the intermediate value based on _________
5. In Map reduce programming, the reduce function is applied ______group at a
time.
6. Explain JobConf in MapReduce.
7. What is a MapReduce Combiner?
8. Define Writable data types in MapReduce.
9. What is OutputCommitter?
10. What is a “map” in Hadoop?
11. What is a “reducer” in Hadoop?
12. What are the parameters of mappers and reducers?
13. What are the key differences between Pig vs MapReduce?
14. What is partitioning?
15. How to set which framework would be used to run mapreduce program?
16. What platform and Java version is required to run Hadoop?
17. Can MapReduce program be written in any language other than Java?
2 Mark questions
1. What is the difference between HDFS block and InputSplit?
2. What is Text Input Format?
3. What is SequenceFileInputFormat?
4. How to set mappers and reducers for Hadoop jobs?
5 Mark questions
1. What are the main components of MapReduce Job?
2. What is Shuffling and Sorting in MapReduce?
3. What is Partitioner and its usage?
4. What is Identity Mapper and Chain Mapper?
5. Name Job control options specified by MapReduce.
10 mark questions
1. Illustrate with a simple example about the working of MapReduce.
2. Write a MapReduce program to find unitwise salary.
3. Write a MapReduce program to arrange the data on user-id, then within the user id sort
them in increasing order of the page count.
88
4. Explain about the Map task and Reducer task in detail.
UNIT – IV
1 Mark questions:
1. The metastore consists of ______ and a ____________
2. The most commonly used interface to interact with Hive is _________
3. The default metastore for Hive is _________
4. Metastore contains _________of Hive tables.
5. _________is responsible for compilation, optimization and execution of Hive queries.
6. PIG is ______language
7. In Pig, _________ is used to specify data flow.
8. Pig provides an ________to execute data flow
9. _________ and __________ are execution modes of Pig.
10. The interactive mode of Pig is _______________.
11. __________,__________and _________are complex data types of Pig.
12. Pig is used in ___________process.
13. PigStorage() function is case sensitive
14. Local mode is the default mode of Pig.
15. DISTINCT key word removes duplicate fields
16. LIMIT keyword is used to display limited number of tuples in Pig.
17. ORDERBY is used for sorting.
2 Marks questions:
Column A Column B
HQL Hive Query Language
Database Namespace
Complex data types Struct, Map
Hive Application Weblogs
Table Set of records
2. Match the following :
Column A Column B
Map Hadoop cluster
Bag An ordered collection of Fileds
Local Mode Collection of tuples
Tuple Key/Value pair
Map Reduce Mode Local file system
5 Mark questions
1. Explain in detail how Hive is different from Pig.
89
2. Perform the following operations using Hive Query language
a) Create a database named “STUDENTS” with comments and database properties,
b) Display a list of databases
c) Describe a database
d) To make the databases current working database
e) To delete or remove a database
3. Write a Pig script for word count. Why Hive is relevant in Hadoop eco system
4. Explain the Architecture of Pig with a neat sketch.
5.
10 Mark questions:
1. Create a data file for below schemas
Order: custid, itemid, orderdate, deliverydate.
Customer: customerid, Customername, Address, City, state, country.
a. Create a table for Order data and customer data
b. Write a HiveQL to find number of items bought by each customer.
2. Create a data file for below schemas
Order: custid, itemid, orderdate, deliverydate.
Customer: customerid, Customername, Address, City, state, country.
a. Create a table for Order data and customer data
b. Write a Pig Latin script to determine number of items bought by each customer.
3. Explain HIVE architecture in detail.
4. Discuss various data types in Pig.
5. Write a word count program in Pig to count the occurrence of similar words in a file.
UNIT – V
10 Mark Questions
2. Explain the spark components in detail. Also list the features of Spark.
4. What is spark? State the advantages of using Apache spark over Hadoop MapReduce for Big
data processing with example.
UNIT - III
Understand MapReduce and its qualities and retain advanced MapReduce thoughts
Ingest data using Sqoop and Flume
UNIT – IV
UNIT - V
Do functional programming in Spark, and execute and create Spark applications
Understand adaptable spread datasets (RDD) in detail
Get a through and through understanding of parallel get ready in Spark and Spark RDD
upgrade systems
91
Understand the typical use occasions of Spark and distinctive natural estimations
Learn Spark SQL, making, changing, and addressing data diagrams
http://nptel.ac.in/courses/106104135/48
http://hadoop.apache.org/
http://developer.yahoo.com/hadoop/tutorial/
92