MCA - BigData Notes
MCA - BigData Notes
UNIT - I
5 Marks
15 marks
UNIT - II
5 Marks
3. Write steps to Find the most popular elements using decaying windows.
15 marks
5 Marks
15 marks
a. Mapper class
b. Reducer class
c. Scaling out
2. Explain the map reduce data flow with single reduce and multiple reduce.
3. Define HDFS. Describe namenode, datanode and block. Explain HDFS operations in
detail.
4. Write in detail the concept of developing the Map Reduce Application.
UNIT - IV
5 Marks
15 marks
UNIT - V
5 Marks
15 marks
Error.
UNIT V – FRAMEWORKS
Applications on Big Data Using Pig and Hive – Data processing operators in
TEXT BOOKS
A big data platform acts as an organized storage medium for large amounts of data.
Big data platforms utilize a combination of data management hardware and software tools to
store aggregated data sets, usually onto the cloud.
One of the major challenges of conventional systems was the uncertainty of the Data
Management Landscape.
Fundamental challenges
● How to store
● And more important, how to understand data and turn it into a competitive
advantage.
Big data has revolutionized the way businesses operate, but it has also presented a number of
challenges for conventional systems. Here are some of the challenges faced by conventional
systems in handling big data:
Big data is a term used to describe the large amount of data that can be stored and analyzed
by computers. Big data is often used in business, science and government. Big Data has been
around for several years now, but it's only recently that people have started realizing how
important it is for businesses to use this technology in order to improve their operations and
provide better services to customers. A lot of companies have already started using big data
analytics tools because they realize how much potential there is in utilizing these systems
effectively!
However, while there are many benefits associated with using such systems - including faster
processing times as well as increased accuracy - there are also some challenges involved with
implementing them correctly.
Challenges of Conventional System in big data
● Scalability
● Speed
● Storage
● Data Integration
● Security
Scalability
A common problem with conventional systems is that they can't scale. As the amount of data
increases, so does the time it takes to process and store it. This can cause bottlenecks and
system crashes, which are not ideal for businesses looking to make quick decisions based on
their data.
Conventional systems also lack flexibility in terms of how they handle new types of
information--for example, if you want to add another column (columns are like fields) or row
(rows are like records) without having to rewrite all your code from scratch.
Speed
Speed is a critical component of any data processing system. Speed is important because it
allows you to:
● Process and analyze your data faster, which means you can make better-informed
decisions about how to proceed with your business.
● Make more accurate predictions about future events based on past performance.
Storage
The amount of data being created and stored is growing exponentially, with estimates that it
will reach 44 zettabytes by 2020. That's a lot of storage space!
The problem with conventional systems is that they don't scale well as you add more data.
This leads to huge amounts of wasted storage space and lost information due to corruption or
security breaches.
Data Integration
The challenges of conventional systems in big data are numerous. Data integration is one of
the biggest challenges, as it requires a lot of time and effort to combine different sources into
a single database. This is especially true when you're trying to integrate data from multiple
sources with different schemas and formats.
Another challenge is errors and inaccuracies in analysis due to lack of understanding of what
exactly happened during an event or transaction. For example, if there was an error while
transferring money from one bank account to another, then there would be no way for us
know what actually happened unless someone tells us about it later on (which may not
happen).
Security
Security is a major challenge for enterprises that depend on conventional systems to process
and store their data. Traditional databases are designed to be accessed by trusted users within
an organization, but this makes it difficult to ensure that only authorized people have access
to sensitive information.
Security measures such as firewalls, passwords and encryption help protect against
unauthorized access and attacks by hackers who want to steal data or disrupt operations. But
these security measures have limitations: They're expensive; they require constant monitoring
and maintenance; they can slow down performance if implemented too extensively; and they
often don't prevent breaches altogether because there's always some way around them (such
as through phishing emails).
Conventional systems are not equipped for big data. They were designed for a different era,
when the volume of information was much smaller and more manageable. Now that we're
dealing with huge amounts of data, conventional systems are struggling to keep up.
Conventional systems are also expensive and time-consuming to maintain; they require
constant maintenance and upgrades in order to meet new demands from users who want
faster access speeds and more features than ever before.
Because of the 5V's of Big Data, Big Data and analytics technologies enable your
organisation to become more competitive and grow indefinitely. This, when combined with
specialised solutions for its analysis, such as an Intelligent Data Lake, adds a great deal of
value to a corporation. Let's get started:
The Five Vs of Big Data are widely used to describe its characteristics: If the problem meets
the Five criteria.
● Volume
● Value
● Velocity
● Veracity
● Variety
Volume capacity
One of the characteristics of big data is its enormous capacity. According to the above
description, it is "data that cannot be controlled by existing general technology," although it
appears that many people believe the amount of data ranges from several terabytes to several
petabytes.
The volume of data refers to the size of the data sets that must be examined and managed,
which are now commonly in the terabyte and petabyte ranges. The sheer volume of data
necessitates processing methods that are separate and distinct from standard storage and
processing capabilities. In other words, the data sets in Big Data are too vast to be processed
by a standard laptop or desktop CPU. A high-volume data set would include all credit card
transactions in Europe on a given day.
Value
The most important "V" from a financial perspective, the value of big data typically stems
from insight exploration and information processing, which leads to more efficient
functioning, bigger and more powerful client relationships, and other clear and quantifiable
financial gains.
This refers to the value that big data can deliver, and it is closely related to what enterprises
can do with the data they collect. The ability to extract value from big data is required, as the
value of big data increases considerably based on the insights that can be gleaned from it.
Companies can obtain and analyze the data using the same big data techniques, but how they
derive value from that data should be unique to them.
Variety type
Big Data is very massive due to its diversity. Big Data originates from a wide range of
sources and is often classified as one of three types: structured, semi-structured, or
unstructured data. The multiplicity of data kinds usually necessitates specialised processing
skills and algorithms. CCTV audio and video recordings generated at many points around a
city are an example of a high variety data set.
Big data may not always refer to structured data that is typically managed in a company's
core system. Unstructured data includes text, sound, video, log files, location information,
sensor information, and so on. Of course, some of this unstructured data has been there for a
while. Efforts are being made in the future to analyse information and extract usable
knowledge from it, rather than merely accumulating it.
Veracity
The quality of the data being studied is referred to as its veracity. High-quality data contains a
large number of records that are useful for analysis and contribute significantly to the total
findings. Data of low veracity, on the other hand, comprises a significant percentage of
useless data. Noise refers to the non-valuable in these data sets. Data from a medical
experiment or trial is an example of a high veracity data set.
Efforts to value big data are pointless if they do not result in business value. Big data can and
will be utilised in a broad range of circumstances in the future. To create big data efforts
high-value initiatives and consistently acquire the value that businesses should seek, not only
should tools and the usage of new services be introduced, but also operations and services
based on strategic measures. It must be completely rebuilt.
To reveal meaningful information, high volume, high velocity, and high variety data must be
processed using advanced tools (analytics and algorithms). Because of these data properties,
the knowledge area concerned with the storage, processing, and analysis of huge data
collections has been dubbed Big Data.
Unstructured data analysis has gained popularity in recent years as a form of large data
analysis. Some forms of unstructured data, on the other hand, are both suited and unfit for
data analysis. This time, I'd like to discuss the data with and without the regularity of
unstructured data, as well as the link between structured and unstructured data.
Data is a set of data consisting of structured and unstructured data, of which unstructured data
is stored in its native format. In addition, although it has the feature that nothing is processed
until it is used, it has the advantage of being highly flexible and versatile because it can
process data relatively freely when it is used. It is also easy for humans to recognize and
understand as it is.
Structured data
Structured data is data that is prepared and processed and is saved in business management
system programmes such as SFA, CRM, and ERP, as well as in RDB, as opposed to
unstructured data that is not formed and processed. The information is structured by
"columns" and "rows," similar to spreadsheet tools such as Excel. The data is also saved in a
preset state rather than its natural form, allowing anybody to operate with it.
However, organised data is difficult for people to grasp as it is, and computers can analyse
and calculate it more easily. As a result, in order to use structured data, specialist processing
is required, and the individual handling the data must have some specialised knowledge.
Structured data has the benefit of being easy to manage since it is preset, that is, processed,
and it is also excellent for use in machine learning, for example. Another significant aspect is
that it is interoperable with a wide range of IT tools. Furthermore, structured data is saved in
a Schema on Write database that is meant for specific data consumption, rather to a Schema
on Read database that keeps the data as is.
RDBs such as Oracle, PostgreSQL, and MySQL can be said to be databases for storing
structured data.
Semi-structured data
Semi-structured data is data that falls between structured and unstructured categories. When
categorised loosely, it is classed as unstructured data, but it is distinguished by the ability to
be handled as structured data as soon as it is processed since the structure of the information
that specifies certain qualities is defined.
It's not clearly structured with columns and rows, yet it's a manageable piece of data because
it's layered and includes regular elements. Examples include.csv and.tsv. While.csv is
referred to as a CSV file, the point at which elements are divided and organised by comma
separation is an intermediary location that may be viewed as structured data.
Semi-structured data, on the other hand, lacks a set format like structured data and maintains
data through the combination of data and tags.
Another distinguishing aspect is that data structures are nested. Semi-structured data formats
include the XML and JSON formats.
XML data is best example of semi-structures data
Google Cloud Platform offers NoSQL databases such as Cloud Firestore and Cloud Bigtable
for working with semi-structured data.
Unstructured data
Unstructured data is more diversified and vast than structured data, and includes email and
social media postings, audio, photos, invoices, logs, and other sensor data. The specifics on
how to utilise each are provided below.
● image data
● Image data includes digital camera photographs, scanned images, 3D images, and so
on. Image data, which is employed in a variety of contexts, is a common format
among unstructured data. In recent years, face recognition, identification of objects
put at cash registers, digitalization of documents by character recognition, and other
applications have been discussed, in addition to being utilised as a material for other
human judgements. It will be. The particular picture data also includes
video.Voice/audio data
● The data has been there for a long time, since audio data became popular with the
introduction of CDs. However, with the advancement of speech recognition
technology and the proliferation of voice speakers in recent years, voice input has
become ubiquitous, and the effective use of voice data has drawn attention.
Call centres, for example, not only record their replies, but also automatically convert
them to text (Voice to Text) to increase the efficiency of recording and analysis. It is
also utilised in ways for estimating the emotions of the other party based on the tone
of the voice, as well as for analysing the sound output by the machine to determine
whether or not an irregularity has happened.Sensor data
● With the advancement of IoT, big data analysis, OT field, and sensor technology, as
well as networking, it is now feasible to collect a broad variety of information, such as
manufacturing process data in factories and interior temperature, humidity, and
density.
Sensor data may be utilised for a variety of reasons, including detecting irregularities
on the production line that result in low yield, rectifying mistakes, and anticipating the
timing of equipment breakdown.
It's also employed in medicine, and initiatives like forecasting stress and sickness by
monitoring heart rate have grown frequent.
Sensor data of this type is also commonly employed in autonomous driving. To
distinguish it from files such as so-called pictures and Microsoft Office documents, it
is often referred to as semi-structured data or semi-structured data.Text data
The text data format, which boasts a vast volume of unstructured data on the
Internet, ranging from big phrases like books to publishing brief lines like
Twitter.
It is commonly used for researching pictures of brands from word-of-mouth
and SNS postings, detecting consumer complaints, automatically preparing
documents such as minutes utilising summary generation technology, and
automatically converting languages by scanning text data.
In this section, we will discuss the benefits ( Advantage ) and drawbacks
( Disadvantage ) of big data use based on big data characteristics.
Advantage benefits of big data :
One of the components of big data, real-time, provides you an advantage over your
competition. Real-time performance entails the rapid processing of enormous amounts of data
as well as the quick analysis of data that is continually flowing.
Big data contains a component called Veracity (accuracy), and it is distinguished by the
availability of real-time data. Real-time skills allow us to discover market demands rapidly
and use them in marketing and management strategies to build accurate enterprises.
This is a disadvantage for customers rather than firms attempting to increase the accuracy of
marketing, etc. by utilising big data, but if these issues grow and legal constraints get
stronger, the area of use may be limited. Companies that use big data must be prepared to
handle data responsibly in compliance with the Personal Information Protection Act and other
regulatory standards.
NATURE OF DATA
To understand the nature of data, we must recall, what are data? And what are the functions
that data should perform on the basis of its classification?
The first point in this is that data should have specific items (values or facts), which must be
identified.
Secondly, specific items of data must be organised into a meaningful form.
Thirdly, data should have the functions to perform.
Furthermore, the nature of data can be understood on the basis of the class to which it
belongs.
We have seen that in sciences there are six basic types with in which there exist fifteen
different classes of data. However, these are not mutually exclusive.
There is a large measure of cross-classification, e.g., all quantitative data are numerical
data,and most data are quantitative data.
Descriptive data: Sciences are not known for descriptive data. However, qualitative data in
sciences are expressed in terms of definitive statements concerning objects. These may be
viewed as descriptive data. Here, the nature of data is descriptive.
Graphic and symbolic data: Graphic and symbolic data are modes of presentation. They
enable users to grasp data by visual perception. The nature of data, in these cases, is graphic.
Likewise, it is possible to determine the nature of data in social sciences also.
Enumerative data: Most data in social sciences are enumerative in nature. However, they
are refined with the help of statistical techniques to make them more meaningful. They are
known as statistical data. This explains the use of different scales of measurement whereby
they are graded.
Descriptive data: All qualitative data in sciences can be descriptive in nature. These can be
in the form of definitive statements. All cataloguing and indexing data are bibliographic,
whereas all management data such as books acquired, books lent, visitors served and
photocopies supplied are non-bibliographic.
Having seen the nature of data, let us now examine the properties, which the data should
ideally possess.
Sampling Distributions
Sampling distribution refers to studying the randomly chosen samples to understand the
variations in the outcome expected to be derived.
Sampling distribution of the mean, sampling distribution of proportion, and T-distribution are
three major types of finite-sample distribution.
Re-Sampling
Resampling is the method that consists of drawing repeated samples from the original data
samples. The method of Resampling is a nonparametric method of statistical inference. In
other words, the method of resampling does not involve the utilization of the generic
distribution tables (for example, normal distribution tables) in order to compute approximate
p probability values.
Resampling involves the selection of randomized cases with replacement from the original
data sample in such a manner that each number of the sample drawn has a number of cases
that are similar to the original data sample. Due to replacement, the drawn number of samples
that are used by the method of resampling consists of repetitive cases.
Statistical Inference
Statistical Inference is defined as the procedure of analyzing the result and making
conclusions from data based on random variation. The two applications of statistical
inference are hypothesis testing and confidence interval. Statistical inference is the technique
of making decisions about the parameters of a population that relies on random sampling. It
enables us to assess the relationship between dependent and independent variables. The idea
of statistical inference is to estimate the uncertainty or sample to sample variation. It enables
us to deliver a range of value for the true value of something in the population. The
components used for making the statistical inference are:
● Sample Size
But, the most important two types of statistical inference that are primarily used are
● Confidence Interval
● Hypothesis testing
● Business Analysis
● Artificial Intelligence
● Financial Analysis
● Fraud Detection
● Machine Learning
● Pharmaceutical Sector
● Share market.
Prediction error
In statistics, prediction error refers to the difference between the predicted values made by
some model and the actual values.
1. Linear regression: Used to predict the value of some continuous response variable.
We typically measure the prediction error of a linear regression model with a metric known
as RMSE, which stands for root mean squared error.
It is calculated as:
2. Logistic Regression: Used to predict the value of some binary response variable.
One common way to measure the prediction error of a logistic regression model is with a
metric known as the total misclassification rate.
It is calculated as:
The lower the value for the misclassification rate, the better the model is able to predict the
outcomes of the response variable.
Stream Processing
There are several popular stream processing frameworks, including Apache Flink,
Apache Kafka, Apache Storm, and Apache Spark Streaming. These frameworks
provide tools for building and deploying stream processing pipelines, and they can
handle large volumes of data with low latency and high throughput.
Mining data streams refers to the process of extracting useful insights and
patterns from continuous and rapidly changing data streams in real-time. Data streams
are typically high- volume and high-velocity, making it challenging to analyze them
using traditional data mining techniques.
Mining data streams requires specialized algorithms that can handle the dynamic nature
of data streams, as well as the need for real-time processing. These algorithms
typically use techniques such as sliding windows, online learning, and incremental
processing to adapt to changing data patterns over time.
Mining data streams also requires careful consideration of the computational resources
required to process the data in real-time. As a result, many mining data stream
algorithms are designed to work with limited memory and processing power, making
them well-suited for deployment on edge devices or in cloud-based architectures.
Streams can be thought of as a flow of data that can be processed in real-time, rather
than being stored and processed at a later time. This allows for more efficient
processing of large volumes of data and enables applications that require real-time
processing and analysis.
1. Data Source:A stream's data source is the place where the data is generated or received.
This can include sensors, databases, network connections, or other sources.
2. Data Sink:A stream's data sink is the place where the data is consumed or stored.
This can include databases, data lakes, visualization tools, or other destinations.
3. Streaming Data Processing:This refers to the process of continuously processing
data as it arrives in a stream. This can involve filtering, aggregation, transformation, or
analysis of the data.
4. Stream Processing Frameworks:These are software tools that provide an
environment for building and deploying stream processing applications. Popular stream
processing frameworks include Apache Flink, Apache Kafka, and Apache Spark
Streaming.
5. Real-time Data Processing:This refers to the ability to process data as soon as it is
generated or received. Real-time data processing is often used in applications that
require immediate action, such as fraud detection or monitoring of critical systems.
Overall, streams are a powerful tool for processing and analyzing large volumes of data
in real-time, enabling a wide range of applications in fields such as finance, healthcare,
and the Internet of Things.
Stream data model is a data model used to represent the continuous flow of data in a
stream processing system. The stream data model typically consists of a series of
events, which are individual pieces of data that are generated by a data source and
processed by a stream processing system.
1. Data sources:The data sources are the components that generate the events that
make up the stream. These can include sensors, log files, databases, and other data
sources.
2. Stream processing engines:The stream processing engines are the components
responsible for processing the data in real-time. These engines typically use a variety of
algorithms and techniques to filter, transform, aggregate, and analyze the stream of
events.
3. Data sinks:The data sinks are the components that receive the output of the stream
processing engines. These can include databases, data lakes, visualization tools, and
other data destinations.
Some popular stream processing frameworks and architectures include Apache Flink,
Apache Kafka, and Lambda Architecture. These frameworks provide tools and
components for building scalable and fault-tolerant stream processing systems, and can
be used in a wide range of applications, from real-time analytics to internet of things
(IoT) data processing.
Stream Computing
Stream computing is the process of computing and analyzing data streams in real-time.
It involves continuously processing data as it is generated, rather than processing it in
batches. Stream computing is particularly useful for scenarios where data is generated
rapidly and needs to be analyzed quickly.
Stream computing involves a set of techniques and tools for processing and analyzing
data streams, including:
Sampling data in a stream refers to the process of selecting a subset of data points from
a continuous and rapidly changing data stream for analysis. Sampling is a useful
technique for processing data streams when it is not feasible or necessary to process all
data points in real- time.
There are various sampling techniques that can be used for stream data, including:
1.Random sampling:This involves selecting data points from the stream at random
intervals. Random sampling can be used to obtain a representative sample of the
entire stream.
Sampling data in a stream can be used in various applications, such as monitoring and
quality control, statistical analysis, and machine learning. By reducing the amount of
data that needs to be processed in real-time, sampling can help improve the efficiency
and scalability of stream processing systems.
Filtering Streams
Filtering streams refers to the process of selecting a subset of data from a data stream
based on certain criteria. This process is often used in stream processing systems to
reduce the amount of data that needs to be processed and to focus on the relevant data.
There are various filtering techniques that can be used for stream data, including:
1. Simple filtering:This involves selecting data points from the stream that meet a specific
condition, such as a range of values, a specific text string, or a certain timestamp.
2. Complex filtering:This involves selecting data points from the stream based on multiple
criteria or complex logic. Complex filtering can involve combining multiple conditions using
Boolean operators such as AND, OR, and NOT.
3. Machine learning-based filtering:This involves using machine learning algorithms
to automatically classify data points in the stream based on past observations. This can
be useful in applications such as anomaly detection or predictive maintenance.
When filtering streams, it is important to consider the trade-off between the amount of
data being filtered and the accuracy of the filtering process. Too much filtering can
result in valuable data being discarded, while too little filtering can result in a large
volume of irrelevant data being processed.
Counting distinct elements in a stream refers to the process of counting the number of
unique items in a continuous and rapidly changing data stream. This is an important
operation in stream processing because it can help detect anomalies, identify trends,
and provide insights into the data stream.
There are various techniques for counting distinct elements in a stream, including:
1. Exact counting:This involves storing all the distinct elements seen so far in a data
structure such as a hash table or a bloom filter. When a new element is encountered, it
is checked against the data structure to determine if it is a new distinct element.
Estimating Moments
In statistics, moments are numerical measures that describe the shape, central
tendency, and variability of a probability distribution. They are calculated as functions
of the random variables of the distribution, and they can provide useful insights into the
underlying properties of the data.
There are different types of moments, but two of the most commonly used are the mean
(the first moment) and the variance (the second moment). The mean represents the
central tendency of the data, while the variance measures its spread or variability.
To estimate the moments of a distribution from a sample of data, you can use the
following formulas:
These formulas provide estimates of the population moments based on the sample data.
The larger the sample size, the more accurate the estimates will be. However, it's
important to note that these formulas only work for certain types of distributions (e.g.,
normal distribution), and for other types of distributions, different formulas may be
required.
6.If the count of the number that just left the window is 1, decrement the count
variable.
7.If the count of the number that just entered the window is 1, increment
the count variable.
8.Repeat steps 5-7 until you reach the end of the sequence.
Here's some Python code that implements this approach:
Decaying Window
Here's one way you could implement a decaying window in Python using an
exponentially weighted moving average (EWMA):
This function takes in a Pandas Series data, a window size window_size, and a decay
rate decay_rate. The decay rate determines how much weight is given to recent
observations relative to older observations. A larger decay rate means that more weight
is given to recent observations.
The function first creates a series of weights using the decay rate and the window size.
The weights are calculated using the formula decay_rate^(window_size - i) where i is
the index of the weight in the series. This gives more weight to recent observations and
less weight to older observations.
Next, the function normalizes the weights so that they sum to one. This ensures that the
weighted average is a proper average.
Finally, the function applies the rolling function to the data using the window size and
a custom lambda function that calculates the weighted average of the window using the
weights.
Note that this implementation uses Pandas' built-in rolling and apply functions, which
are optimized for efficiency. If you're working with large datasets, this implementation
should be quite fast. If you're working with smaller datasets or need more control over
the implementation, you could implement a decaying window using a custom function
that calculates the weighted average directly.
Real time Analytics Platform (RTAP) Applications
3. Supply chain optimization:RTAPs can help companies optimize their supply chain by
monitoring inventory levels, shipment tracking, and demand forecasting. By analyzing this
data in real-time, companies can make better decisions about when to restock inventory,
when to reroute shipments, and how to allocate resources.
Overall, RTAPs can be applied in various industries and domains where real-time
monitoring and analysis of data is critical to achieving business objectives. By
providing insights into streaming data as it happens, RTAPs can help businesses make
faster and more informed decisions.
Case Studies - Real Time Sentiment Analysis
Real-time sentiment analysis is a powerful tool for businesses that want to monitor and
respond to customer feedback in real-time. Here are some case studies of companies
that have successfully implemented real-time sentiment analysis:
2. Coca-Cola: Coca-Cola uses real-time sentiment analysis to monitor social media for
mentions of the brand and to track sentiment over time. The company's marketing team
uses this data to identify trends and to create more targeted marketing campaigns. By
analyzing real-time sentiment data, Coca-Cola can quickly respond to changes in
consumer sentiment and adjust its marketing strategy accordingly.
3. Ford: Ford uses real-time sentiment analysis to monitor customer feedback on social
media and review sites. The company's customer service team uses this data to identify
issues and to respond to complaints in real-time. By analyzing real-time sentiment data,
Ford can quickly identify and address customer concerns, improving the overall
customer experience.
5. Twitter: Twitter uses real-time sentiment analysis to identify trending topics and to
monitor sentiment across the platform. The company's sentiment analysis tool allows
users to track sentiment across various topics and to identify emerging trends. By
analyzing real-time sentiment data, Twitter can quickly identify issues and respond to
changes in user sentiment.
Overall, real-time sentiment analysis is a powerful tool for businesses that want to
monitor and respond to customer feedback in real-time. By analyzing real-time
sentiment data, businesses can quickly identify issues and respond to changes in
customer sentiment, improving the overall customer experience.
Predicting stock market performance is a challenging task, but there have been several
successful case studies of companies using machine learning and artificial intelligence
to make accurate predictions. Here are some examples of successful stock market
prediction case studies:
Overall, these case studies demonstrate the potential of machine learning and artificial
intelligence to make accurate predictions in the stock market. By analyzing large
volumes of data and identifying patterns, these systems can generate investment
strategies that outperform traditional methods. However, it is important to note that the
stock market is inherently unpredictable, and past performance is not necessarily
indicative of future results.
Unit III-Hadoop
History of Hadoop
Hadoop was created by Doug Cutting, the creator of Apache Lucene, the widely used text
search library. Hadoop has its origins in Apache Nutch, an open source web search engine,
itself a part of the Lucene project.
The name Hadoop is not an acronym; it’s a made-up name. The project’s creator, Doug
Cutting, explains how the name came about:
The name my kid gave a stuffed yellow elephant. Short, relatively easy to spell and
pronounce, meaningless, and not used elsewhere: those are my naming criteria.Kids are good
at generating such. Googol is a kid’s term.
Subprojects and “contrib” modules in Hadoop also tend to have names that are unrelated to
their function, often with an elephant or other animal theme (“Pig,” for example). Smaller
components are given more descriptive (and therefore more mundane) names. This is a good
principle, as it means you can generally work out what something does from its name. For
example, the jobtracker9 keeps track of MapReduce jobs
With growing data velocity the data size easily outgrows the storage limit of a
machine. A solution would be to store the data across a network of machines. Such
filesystems are called distributed filesystems. Since data is stored across a network all the
complications of a network come in.
This is where Hadoop comes in. It provides one of the most reliable filesystems. HDFS
(Hadoop Distributed File System) is a unique design that provides storage for extremely
large files with streaming data access pattern and it runs on commodity hardware. Let’s
elaborate the terms:
● Extremely large files: Here we are talking about the data in range of petabytes(1000
TB).
● Streaming Data Access Pattern: HDFS is designed on principle of write-once and
read-many-times. Once data is written large portions of dataset can be processed any
number times.
● Commodity hardware: Hardware that is inexpensive and easily available in the
market. This is one of feature which specially distinguishes HDFS from other file
system.
Nodes: Master-slave nodes typically forms the HDFS cluster.
1. NameNode(MasterNode):
○ Manages all the slave nodes and assign work to them.
○ It executes filesystem namespace operations like opening, closing, renaming
files and directories.
○ It should be deployed on reliable hardware which has the high config. not on
commodity hardware.
2. DataNode(SlaveNode):
○ Actual worker nodes, who do the actual work like reading, writing, processing
etc.
○ They also perform creation, deletion, and replication upon instruction from the
master.
○ They can be deployed on commodity hardware.
● Namenodes:
○ Run on the master node.
○ Store metadata (data about data) like file path, the number of blocks, block
Ids. etc.
○ Require high amount of RAM.
○ Store meta-data in RAM for fast retrieval i.e to reduce seek time. Though a
persistent copy of it is kept on disk.
● DataNodes:
○ Run on slave nodes.
○ Require high memory as data is actually stored here.
Data storage in HDFS: Now let’s see how the data is stored in a distributed manner.
Lets assume that 100 TB file is inserted, then masternode(namenode) will first divide the file
into blocks of 10TB (default size is 128 MB in Hadoop 2.x and above). Then these blocks are
stored across different datanodes(slavenode). Datanodes(slavenode)replicate the blocks
among themselves and the information of what blocks they contain is sent to the master.
Default replication factor is 3 means for each block 3 replicas are created (including itself). In
hdfs.site.xml we can increase or decrease the replication factor i.e we can edit its
configuration here.
Note: MasterNode has the record of everything, it knows the location and info of each and
every single data nodes and the blocks they contain, i.e. nothing is done without the
permission of masternode.
Answer: Let’s assume that we don’t divide, now it’s very difficult to store a 100 TB file on a
single machine. Even if we store, then each read and write operation on that whole file is
going to take very high seek time. But if we have multiple blocks of size 128MB then its
become easy to perform various read and write operations on it compared to doing it on a
whole file at once. So we divide the file to have faster data access i.e. reduce seek time.
Note: No two replicas of the same block are present on the same datanode.
Features:
Limitations: Though HDFS provides many features there are some areas where it doesn’t
work well.
● Low latency data access: Applications that require low-latency access to data i.e in the
range of milliseconds will not work well with HDFS, because HDFS is designed
keeping in mind that we need high-throughput of data even at the cost of latency.
● Small file problem: Having lots of small files will result in lots of seeks and lots of
movement from one datanode to another datanode to retrieve each small file, this
whole process is a very inefficient data access pattern.
Components of Hadoop
Hadoop is a framework that uses distributed storage and parallel processing to store
and manage Big Data. It is the most commonly used software to handle Big Data. There are
three components of Hadoop.
1. Hadoop HDFS - Hadoop Distributed File System (HDFS) is the storage unit of
Hadoop.
2. Hadoop MapReduce - Hadoop MapReduce is the processing unit of Hadoop.
3. Hadoop YARN - Hadoop YARN is a resource management unit of Hadoop.
Hadoop HDFS
Data is stored in a distributed manner in HDFS. There are two components of HDFS - name
node and data node. While there is only one name node, there can be multiple data nodes.
HDFS is specially designed for storing huge datasets in commodity hardware. An enterprise
version of a server costs roughly $10,000 per terabyte for the full processor. In case you need
to buy 100 of these enterprise version servers, it will go up to a million dollars.
Hadoop enables you to use commodity machines as your data nodes. This way, you don’t
have to spend millions of dollars just on your data nodes. However, the name node is always
an enterprise server.
Features of HDFS
Master and slave nodes form the HDFS cluster. The name node is called the master, and the
data nodes are called the slaves.
The name node is responsible for the workings of the data nodes. It also stores the metadata.
The data nodes read, write, process, and replicate the data. They also send signals, known as
heartbeats, to the name node. These heartbeats show the status of the data node.
Consider that 30TB of data is loaded into the name node. The name node distributes it across
the data nodes, and this data is replicated among the data notes. You can see in the image
above that the blue, grey, and red data are replicated among the three data nodes.
Replication of the data is performed three times by default. It is done this way, so if a
commodity machine fails, you can replace it with a new machine that has the same data.
Let us focus on Hadoop MapReduce in the following section of the What is Hadoop article.
2.Hadoop MapReduce
Hadoop MapReduce is the processing unit of Hadoop. In the MapReduce approach, the
processing is done at the slave nodes, and the final result is sent to the master node.
A data containing code is used to process the entire data. This coded data is usually very
small in comparison to the data itself. You only need to send a few kilobytes worth of code to
perform a heavy-duty process on computers.
The input dataset is first split into chunks of data. In this example, the input has three lines of
text with three separate entities - “bus car train,” “ship ship train,” “bus ship car.” The dataset
is then split into three chunks, based on these entities, and processed parallely.
In the map phase, the data is assigned a key and a value of 1. In this case, we have one bus,
one car, one ship, and one train.
These key-value pairs are then shuffled and sorted together based on their keys. At the reduce
phase, the aggregation takes place, and the final output is obtained.
Hadoop YARN is the next concept we shall focus on in the What is Hadoop article.
Hadoop YARN
Hadoop YARN stands for Yet Another Resource Negotiator. It is the resource management
unit of Hadoop and is available as a component of Hadoop version 2.
● Hadoop YARN acts like an OS to Hadoop. It is a file system that is built on top of
HDFS.
● It is responsible for managing cluster resources to make sure you don't overload one
machine.
● It performs job scheduling to make sure that the jobs are scheduled in the right place
Suppose a client machine wants to do a query or fetch some code for data analysis. This job
request goes to the resource manager (Hadoop Yarn), which is responsible for resource
allocation and management.
In the node section, each of the nodes has its node managers. These node managers manage
the nodes and monitor the resource usage in the node. The containers contain a collection of
physical resources, which could be RAM, CPU, or hard drives. Whenever a job request
comes in, the app master requests the container from the node manager. Once the node
manager gets the resource, it goes back to the Resource Manager.
To analyze data with Hadoop, you first need to store your data in HDFS. This can be done by
using the Hadoop command line interface or through a web-based graphical interface like
Apache Ambari or Cloudera Manager.
Hadoop also provides a number of other tools for analyzing data, including Apache Hive,
Apache Pig, and Apache Spark. These tools provide higher-level abstractions that simplify
the process of data analysis.
Apache Hive provides a SQL-like interface for querying data stored in HDFS. It translates
SQL queries into MapReduce jobs, making it easier for analysts who are familiar with SQL
to work with Hadoop.
Apache Pig is a high-level scripting language that enables users to write data processing
pipelines that are translated into MapReduce jobs. Pig provides a simpler syntax than
MapReduce, making it easier to write and maintain data processing code.
Apache Spark is a distributed computing framework that provides a fast and flexible way to
process large amounts of data. It provides an API for working with data in various formats,
including SQL, machine learning, and graph processing.
In summary, Hadoop provides a powerful framework for analyzing large amounts of data. By
storing data in HDFS and using MapReduce or other tools like Apache Hive, Apache Pig, or
Apache Spark, you can perform distributed data processing and gain insights from your data
that would be difficult or impossible to obtain using traditional data analysis tools.
Once your data is stored in HDFS, you can use MapReduce to perform distributed data
processing. MapReduce breaks down the data processing into two phases: the map phase and
the reduce phase.
In the map phase, the input data is divided into smaller chunks and processed independently
by multiple mapper nodes in parallel. The output of the map phase is a set of key-value pairs.
In the reduce phase, the key-value pairs produced by the map phase are aggregated and
processed by multiple reducer nodes in parallel. The output of the reduce phase is typically a
summary of the input data, such as a count or an average.
Scaling Out
You’ve seen how MapReduce works for small inputs; now it’s time to take a bird’s-eye view
of the system and look at the data flow for large inputs. For simplicity, the examples so far
have used files on the local filesystem. However, to scale out, we need to store the data in a
distributed filesystem, typically HDFS (which you’ll learn about in the next chapter), to allow
Hadoop to move the MapReduce computation to each machine hosting a part of the data.
Let’s see how this works.
Data Flow
First, some terminology. A MapReduce job is a unit of work that the client wants to be
performed: it consists of the input data, the MapReduce program, and configuration
information. Hadoop runs the job by dividing it into tasks, of which there are two types:
map tasks and reduce tasks.
There are two types of nodes that control the job execution process: a jobtracker and a
number of tasktrackers. The jobtracker coordinates all the jobs run on the system by
scheduling tasks to run on tasktrackers. Tasktrackers run tasks and send progress reports to
the jobtracker, which keeps a record of the overall progress of each job. If a task fails, the
jobtracker can reschedule it on a different tasktracker.
Hadoop divides the input to a MapReduce job into fixed-size pieces called input
splits, or just splits. Hadoop creates one map task for each split, which runs the user- defined
map function for each record in the split.
Having many splits means the time taken to process each split is small compared to the time
to process the whole input. So if we are processing the splits in parallel, the processing is
better load-balanced when the splits are small, since a faster machine will be able to process
proportionally more splits over the course of the job than a slower machine. Even if the
machines are identical, failed processes or other jobs running concurrently make load
balancing desirable, and the quality of the load balancing increases as the splits become more
fine-grained.
On the other hand, if splits are too small, the overhead of managing the splits and of map task
creation begins to dominate the total job execution time. For most jobs, a good split size tends
to be the size of an HDFS block, 64 MB by default, although this can be changed for the
cluster (for all newly created files) or specified when each file is created.
Hadoop does its best to run the map task on a node where the input data resides in HDFS.
This is called the data locality optimization because it doesn’t use valuable cluster bandwidth.
Sometimes, however, all three nodes hosting the HDFS block replicas for a map task’s input
split are running other map tasks, so the job scheduler will look for a free map slot on a node
in the same rack as one of the blocks. Very occasionally even this is not possible, so an off-
rack node is used, which results in an inter-rack network transfer. The three possibilities are
illustrated in Fig.
It should now be clear why the optimal split size is the same as the block size: it is the largest
size of input that can be guaranteed to be stored on a single node. If the split spanned two
blocks, it would be unlikely that any HDFS node stored both blocks, so some of the split
would have to be transferred across the network to the node running the map task, which is
clearly less efficient than running the whole map task using local Data.
Map tasks write their output to the local disk, not to HDFS. Why is this? Map output is
intermediate output: it’s processed by reduce tasks to produce the final output, and once the
job is complete, the map output can be thrown away. So storing it in HDFS with replication
would be overkill. If the node running the map task fails before the map output has been
consumed by the reduce task, then Hadoop will automatically
rerun the map task on another node to re-create the map output.
Reduce tasks don’t have the advantage of data locality; the input to a single reduce task is
normally the output from all mappers. In the present example, we have a single reduce task
that is fed by all of the map tasks. Therefore, the sorted map outputs have to be transferred
across the network to the node where the reduce task is running, where they are merged and
then passed to the user-defined reduce function. The output of the reduce is normally stored
in HDFS for reliability. As explained for each HDFS block of the reduce output, the first
replica is stored on the local node, with other replicas being stored on off-rack nodes. Thus,
writing the reduce output does consume network bandwidth, but only as much as a normal
HDFS write pipeline consumes.
The whole data flow with a single reduce task is illustrated in the below Figure. The dotted
boxes indicate nodes, the light arrows show data transfers on a node, and the heavy arrows
show data transfers between nodes.
Fig .MapReduce data flow with a single reduce task
The number of reduce tasks is not governed by the size of the input, but instead is specified
independently. In “The Default MapReduce Job” on page 227, you will see how to choose
the number of reduce tasks for a given job.
When there are multiple reducers, the map tasks partition their output, each creating one
partition for each reduce task. There can be many keys (and their associated values) in each
partition, but the records for any given key are all in a single partition. The partitioning can
be controlled by a user-defined partitioning function, but normally the default partitioner—
which buckets keys using a hash function—works very well.
The data flow for the general case of multiple reduce tasks is illustrated in below image. This
diagram makes it clear why the data flow between map and reduce tasks is colloquially
known as “the shuffle,” as each reduce task is fed by many map tasks. The shuffle is more
complicated than this diagram suggests, and tuning it can have a big impact on job execution
time.
MapReduce data flow with multiple reduce tasks
Finally, it’s also possible to have zero reduce tasks. This can be appropriate when you don’t
need the shuffle because the processing can be carried out entirely in parallel . In this case,
the only off-node data transfer is when the map tasks write to HDFS (see Figure)
Hadoop Streaming
It is a utility or feature that comes with a Hadoop distribution that allows developers
or programmers to write the Map-Reduce program using different programming languages
like Ruby, Perl, Python, C++, etc. We can use any language that can read from the standard
input(STDIN) like keyboard input and all and write using standard output(STDOUT). We all
know the Hadoop Framework is completely written in java but programs for Hadoop are not
necessarily need to code in Java programming language. feature of Hadoop Streaming is
available since Hadoop version 0.14.1.
In the above example image, we can see that the flow shown in a dotted block is a basic
MapReduce job. In that, we have an Input Reader which is responsible for reading the input
data and produces the list of key-value pairs. We can read data in .csv format, in delimiter
format, from a database table, image data(.jpg, .png), audio data etc. The only requirement to
read all these types of data is that we have to create a particular input format for that data
with these input readers. The input reader contains the complete logic about the data it is
reading. Suppose we want to read an image then we have to specify the logic in the input
reader so that it can read that image data and finally it will generate key-value pairs for that
image data.
If we are reading an image data then we can generate key-value pair for each pixel where the
key will be the location of the pixel and the value will be its color value from (0-255) for a
colored image. Now this list of key-value pairs is fed to the Map phase and Mapper will work
on each of these key-value pair of each pixel and generate some intermediate key-value pairs
which are then fed to the Reducer after doing shuffling and sorting then the final output
produced by the reducer will be written to the HDFS. These are how a simple Map-Reduce
job works.
Now let’s see how we can use different languages like Python, C++, Ruby with Hadoop for
execution. We can run this arbitrary language by running them as a separate process. For that,
we will create our external mapper and run it as an external separate process. These external
map processes are not part of the basic MapReduce flow. This external mapper will take
input from STDIN and produce output to STDOUT. As the key-value pairs are passed to the
internal mapper the internal mapper process will send these key-value pairs to the external
mapper where we have written our code in some other language like with python with help of
STDIN. Now, these external mappers process these key-value pairs and generate intermediate
key-value pairs with help of STDOUT and send it to the internal mappers.
Similarly, Reducer does the same thing. Once the intermediate key-value pairs are processed
through the shuffle and sorting process they are fed to the internal reducer which will send
these pairs to external reducer process that are working separately through the help of STDIN
and gathers the output generated by external reducers with help of STDOUT and finally the
output is stored to our HDFS.
This is how Hadoop Streaming works on Hadoop which is by default available in Hadoop.
We are just utilizing this feature by making our external mapper and reducers. Now we can
see how powerful feature is Hadoop streaming. Anyone can write his code in any language of
his own choice.
Design of HDFSJava interfaces to HDFS
In this section, we dig into the Hadoop’s FileSystem class: the API for interacting with one of
Hadoop’s filesystems. Although we focus mainly on the HDFS implementation,
DistributedFileSystem, in general you should strive to write your code against the FileSystem
abstract class, to retain portability across filesystems. This is very useful when testing your
program, for example, because you can rapidly run tests using data stored on the local
filesystemIn this section, we dig into the Hadoop’s FileSystem class: the API for interacting
with one of Hadoop’s filesystems. Although we focus mainly on the HDFS implementation,
DistributedFileSystem, in general you should strive to write your code against the FileSystem
abstract class, to retain portability across filesystems. This is very useful when testing your
program, for example, because you can rapidly run tests using data stored on the local
filesystem
One of the simplest ways to read a file from a Hadoop filesystem is by using a java.net.URL
object to open a stream to read the data from. The general idiom is:
InputStream in = null;
try {
in = new URL("hdfs://host/path").openStream();
// process in
} finally {
IOUtils.closeStream(in);
There’s a little bit more work required to make Java recognize Hadoop’s hdfs URL scheme.
This is achieved by calling the setURLStreamHandlerFactory method on URL with an
instance of FsUrlStreamHandlerFactory. This method can be called only once per JVM, so it
is typically executed in a static block. This limitation means that if some other part of your
program—perhaps a third-party component outside your control sets a
URLStreamHandlerFactory, you won’t be able to use this approach for reading data from
Hadoop.
A file in a Hadoop filesystem is represented by a Hadoop Path object (and not a java.io.File
object, since its semantics are too closely tied to the local filesystem). You can think of a Path
as a Hadoop filesystem URI, such as hdfs://localhost/user/tom/ quangle.txt.
FileSystem is a general filesystem API, so the first step is to retrieve an instance for the
filesystem we want to use—HDFS in this case. There are several static factory methods for
getting a FileSystem instance:
public static FileSystem get(URI uri, Configuration conf, String user) throws IOException
A Configuration object encapsulates a client or server’s configuration, which is set using
configuration files read from the classpath, such as conf/core-site.xml. The first method
returns the default filesystem (as specified in the file conf/core-site.xml, or the default local
filesystem if not specified there). The second uses the given URI’s scheme and authority to
determine the filesystem to use, falling back to the default filesystem if no scheme is
specified in the given URI. The third retrieves the filesystem as the given user.
In some cases, you may want to retrieve a local filesystem instance, in which case you can
use the convenience method, getLocal():
Displaying files from a Hadoop filesystem on standard output by using the FileSystem
directly
in = fs.open(new Path(url));
} finally { IOUtils.closeStream(in); } } }
FSDataInputStream
The open () method on FileSystem actually returns a FSDataInputStream rather than a
standard java.io class. This class is a specialization of java.io.DataInputStream with support
for random access, so you can read from any part of the stream: package
org.apache.hadoop.fs;
Writing Data
The FileSystem class has a number of methods for creating a file. The simplest is the method
that takes a Path object for the file to be created and returns an output stream to write to:
public
FSDataOutputStream
package org.apache.hadoop.fs;
●Run on full dataset and if it fails debug it using hadoop debugging tools.
The first stage in development of MapReduce Application is the Mapper Class. Here,
RecordReader processes each Input record and generates the respective key-value pair.
Hadoop’s Mapper store saves this intermediate data into the local disk.
The Intermediate output generated from the mapper is fed to the reducer which processes it
and generates the final output which is then saved in the HDFS.
Driver code
The major component in a MapReduce job is a Driver Class. It is responsible for setting up a
MapReduce Job to run-in Hadoop. We specify the names of Mapper and Reducer Classes
long with data types and their respective job names.
Debugging a Mapreduce Application
For the process of debugging Log files are essential. Log Files can be found on the local fs of
each TaskTracker and if JVM reuse is enabled, each log accumulates the entire JVM run.
Anything written to standard output or error is directed to the relevant logfile
The MapReduce algorithm contains two important tasks, namely Map and Reduce.
● The Map task takes a set of data and converts it into another set of data, where
individual elements are broken down into tuples (key-value pairs).
● The Reduce task takes the output from the Map as an input and combines those data
tuples (key-value pairs) into a smaller set of tuples.
The reduced task is always performed after the map job.
Input Phase − Here we have a Record Reader that translates each record in an input file
and sends the parsed data to the mapper in the form of key-value pairs.
Map − Map is a user-defined function, which takes a series of key-value pairs and processes
each one of them to generate zero or more key-value pairs.
Intermediate Keys − The key-value pairs generated by the mapper are known as
intermediate keys.
Combiner − A combiner is a type of local Reducer that groups similar data from the map
phase into identifiable sets. It takes the intermediate keys from the mapper as input and
applies a user-defined code to aggregate the values in a small scope of one mapper. It is not a
part of the main MapReduce algorithm; it is optional.
Shuffle and Sort − The Reducer task starts with the Shuffle and Sort step. It
downloads the grouped key-value pairs onto the local machine, where the Reducer is running.
The individual key-value pairs are sorted by key into a larger data list. The data list groups
the equivalent keys together so that their values can be iterated easily in the Reducer task.
Reducer − The Reducer takes the grouped key-value paired data as input and runs a
Reducer function on each one of them. Here, the data can be aggregated, filtered, and
combined in a number of ways, and it requires a wide range of processing. Once the
execution is over, it gives zero or more key-value pairs to the final step.
Output Phase − In the output phase, we have an output formatter that translates the final
key-value pairs from the Reducer function and writes them onto a file using a record writer.
Advantage of MapReduce
Limitations Of MapReduce
● MapReduce cannot cache the intermediate data in memory for a further requirement
which diminishes the performance of Hadoop.
● It is only suitable for Batch Processing of a Huge amounts of Data.
Job Submission :
● The submit() method on Job creates an internal JobSubmitter instance and calls
submitJobInternal() on it.
● Having submitted the job, waitForCompletion polls the job’s progress once per
second and reports the progress to the console if it has changed since the last report.
● When the job completes successfully, the job counters are displayed Otherwise, the
error that caused the job to fail is logged to the
console.
● Asks the resource manager for a new application ID, used for the MapReduce job ID.
● Checks the output specification of the job For example, if the output directory has not
been specified or it already exists, the job is not submitted and an error is thrown to
the MapReduce program.
● Computes the input splits for the job If the splits cannot be computed (because the
input paths don’t exist, for example), the job is not submitted and an error is thrown to
the MapReduce program.
● Copies the resources needed to run the job, including the job JAR file, the
configuration file, and the computed input splits, to the shared filesystem in a
directory named after the job ID.
● Submits the job by calling submitApplication() on the resource manager.
Job Initialization :
● When the resource manager receives a call to its submitApplication() method, it hands
off the request to the YARN scheduler.
● The scheduler allocates a container, and the resource manager then launches the
application master’s process there, under the node manager’s management.
● The application master for MapReduce jobs is a Java application whose main class is
MRAppMaster .
● It initializes the job by creating a number of bookkeeping objects to keep track of the
job’s progress, as it will receive progress and completion reports from the tasks.
● It retrieves the input splits computed in the client from the shared filesystem.
● It then creates a map task object for each split, as well as a number of reduce task
objects determined by the mapreduce.job.reduces property (set by the
setNumReduceTasks() method on Job).
Task Assignment:
● If the job does not qualify for running as an uber task, then the application master
requests containers for all the map and reduce tasks in the job from the resource
manager .
● Requests for map tasks are made first and with a higher priority than those for reduce
tasks, since all the map tasks must complete before the sort phase of the reduce can
start.
● Requests for reduce tasks are not made until 5% of map tasks have completed.
Job Scheduling
Early versions of Hadoop had a very simple approach to scheduling users’ jobs: they ran in
order of submission, using a FIFO scheduler. Typically, each job would use the whole
cluster, so jobs had to wait their turn. Although a shared cluster offers great potential for
offering large resources to many users, the problem of sharing resources fairly between users
requires a better scheduler. Production jobs need to complete in a timely manner, while
allowing users who are making smaller ad hoc queries to get results back in a reasonable time
Later on, the ability to set a job’s priority was added, via the mapred.job.priority property or
the setJobPriority() method on JobClient (both of which take one of the values VERY_HIGH,
HIGH, NORMAL, LOW, or VERY_LOW). When the job scheduler is choosing the next job
to run, it selects one with the highest priority. However, with the FIFO scheduler, priorities
do not support preemption, so a high-priority job can still be blocked by a long-running, low-
priority job that started before the high-priority job was scheduled.
MapReduce in Hadoop comes with a choice of schedulers. The default in MapReduce is the
original FIFO queue-based scheduler, and there are also multiuser schedulers called the Fair
Scheduler and the Capacity Scheduler.
Capacity Scheduler
In Capacity Scheduler we have multiple job queues for scheduling our tasks. The Capacity
Scheduler allows multiple occupants to share a large size Hadoop cluster. In Capacity
Scheduler corresponding for each job queue, we provide some slots or cluster resources for
performing job operation. Each job queue has it’s own slots to perform its task. In case we
have tasks to perform in only one queue then the tasks of that queue can access the slots of
other queues also as they are free to use, and when the new task enters to some other queue
then jobs in running in its own slots of the cluster are replaced with its own job.
Capacity Scheduler also provides a level of abstraction to know which occupant is utilizing
the more cluster resource or slots, so that the single user or application doesn’t take
disappropriate or unnecessary slots in the cluster. The capacity Scheduler mainly contains 3
types of the queue that are root, parent, and leaf which are used to represent cluster,
organization, or any subgroup, application submission respectively.
Advantage:
● Best for working with Multiple clients or priority jobs in a Hadoop cluster
● Maximizes throughput in the Hadoop cluster
Disadvantage:
● More complex
● Not easy to configure for everyone
Fair Scheduler
The Fair Scheduler is very much similar to that of the capacity scheduler. The priority of the
job is kept in consideration. With the help of Fair Scheduler, the YARN applications can
share the resources in the large Hadoop Cluster and these resources are maintained
dynamically so no need for prior capacity. The resources are distributed in such a manner that
all applications within a cluster get an equal amount of time. Fair Scheduler takes Scheduling
decisions on the basis of memory, we can configure it to work with CPU also.
As we told you it is similar to Capacity Scheduler but the major thing to notice is that in Fair
Scheduler whenever any high priority job arises in the same queue, the task is processed in
parallel by replacing some portion from the already dedicated slots.
Advantages:
● Once a task has been assigned resources for a container on a particular node by the
resource manager’s scheduler, the application master starts the container by
contacting the node manager.
● The task is executed by a Java application whose main class is YarnChild. Before it
can run the task, it localizes the resources that the task needs, including the job
configuration and JAR file, and any files from the distributed cache.
● Finally, it runs the map or reduce task.
Streaming:
● Streaming runs special map and reduce tasks for the purpose of launching the user
supplied executable and communicating with it.
● The Streaming task communicates with the process (which may be written in any
language) using standard input and output streams.
● During execution of the task, the Java process passes input key value pairs to the
external process, which runs it through the user defined
map or reduce function anprocess d passes the output key value pairs back to the Java
process.
● From the node manager’s point of view, it is as if the child ran the map or reduce code
itself.
● MapReduce jobs are long running batch jobs, taking anything from tens of seconds to
hours to run.
● A job and each of its tasks have a status, which includes such things as the state of the
job or task (e g running, successfully completed, failed), the progress of maps and
reduces, the values of the job’s counters, and a status message or description (which
may be set by user code).
● When a task is running, it keeps track of its progress (i e the proportion of task is
completed).
● For map tasks, this is the proportion of the input that has been processed.
● For reduce tasks, it’s a little more complex, but the system can still estimate the
proportion of the reduce input processed.
It does this by dividing the total progress into three parts, corresponding to the three phases of
the shuffle.
● As the map or reduce task runs, the child process communicates with its parent
application master through the umbilical interface.
● The task reports its progress and status (including counters) back to its application
master, which has an aggregate view of the job, every three seconds over the
umbilical interface.
How status updates are propagated through the MapReduce System
● The resource manager web UI displays all the running applications with links to the
web UIs of their respective application masters,each of which displays further details
on the MapReduce job, including its progress.
● During the course of the job, the client receives the latest status by polling the
application master every second (the interval is set via
mapreduce.client.progressmonitor.pollinterval).
Job Completion:
● When the application master receives a notification that the last task for a job is
complete, it changes the status for the job to Successful.
● Then, when the Job polls for status, it learns that the job has completed successfully,
so it prints a message to tell the user and then returns from the waitForCompletion() .
● Finally, on job completion, the application master and the task containers clean up
their working state and the Output Committer’s commitJob () method is called.
● Job information is archived by the job history server to enable later interrogation by
users if desired.
Task execution
Once the resource manager’s scheduler assign a resources to the task for a container on a
particular node, the container is started up by the application master by contacting the node
manager. The task whose main class is YarnChild is executed by a Java application .
It localizes the resources that the task needed before it can run the task. It includes the job
configuration, any files from the distributed cache and JAR file. It finally runs the map or the
reduce task. Any kind of bugs in the user-defined map and reduce functions (or even in
YarnChild) don’t affect the node manager as YarnChild runs in a dedicated JVM. So it can’t
be affected by a crash or hang.
All actions running in the same JVM as the task itself are performed by each task setup.
These are determined by the OutputCommitter for the job. The commit action moves the
task output to its final location from its initial position for a file-based jobs. When speculative
execution is enabled, the commit protocol ensures that only one of the duplicate tasks is
committed and the other one is aborted.
What does Streaming means?
Streaming reduce tasks and runs special map for the purpose of launching the user supplied
executable and communicating with it. Using standard input and output streams, it
communicates with the process. The Java process passes input key-value pairs to the external
process during execution of the task. It runs the process through the user-defined map or
reduce function and passes the output key-value pairs back to the Java process.
It is as if the child process ran the map or reduce code itself from the manager’s point of
view. MapReduce jobs can take anytime from tens of second to hours to run, that’s why are
long-running batches. It’s important for the user to get feedback on how the job is
progressing because this can be a significant length of time. Each job including the task has a
status including the state of the job or task, values of the job’s counters, progress of maps and
reduces and the description or status message. These statuses change over the course of the
job.
The task keeps track of its progress when a task is running like a part of the task is completed.
This is the proportion of the input that has been processed for map tasks. It is a little more
complex for the reduce task but the system can still estimate the proportion of the reduce
input processed. When a task is running, it keeps track of its progress (i.e., the proportion of
the task completed). For map tasks, this is the proportion of the input that has been processed.
For reduce tasks, it’s a little more complex, but the system can still estimate the proportion of
the reduce input processed.
Process involved
In Hadoop, there are various MapReduce types for InputFormat that are used for various
purposes. Let us now look at the MapReduce types of InputFormat:
FileInputFormat
It serves as the foundation for all file-based InputFormats. FileInputFormat also provides the
input directory, which contains the location of the data files. When we start a MapReduce
task, FileInputFormat returns a path with files to read. This Input Format will read all files.
Then it divides these files into one or more InputSplits.
TextInputFormat
It is the standard InputFormat. Each line of each input file is treated as a separate record by
this InputFormat. It does not parse anything. TextInputFormat is suitable for raw data or line-
based records, such as log files. Hence:
● Key: It is the byte offset of the first line within the file (not the entire file split). As a
result, when paired with the file name, it will be unique.
KeyValueTextInputFormat
● Value: It is the remaining part of the line after the tab character.
SequenceFileInputFormat
It's an input format for reading sequence files. Binary files are sequence files. These files also
store binary key-value pair sequences. These are block-compressed and support direct
serialization and deserialization of a variety of data types. Hence Key & Value are both user-
defined.
SequenceFileAsTextInputFormat
NlineInputFormat
It is a variant of TextInputFormat in which the keys are the line's byte offset. And values are
the line's contents. As a result, each mapper receives a configurable number of lines of
TextInputFormat and KeyValueTextInputFormat input. The number is determined by the
magnitude of the split. It is also dependent on the length of the lines. So, if we want our
mapper to accept a specific amount of lines of input, we use NLineInputFormat.
Assuming N=2, each split has two lines. As a result, the first two Key-Value pairs are
distributed to one mapper. The second two key-value pairs are given to another mapper.
DBInputFormat
Using JDBC, this InputFormat reads data from a relational Database. It also loads small
datasets, which might be used to connect with huge datasets from HDFS using multiple
inputs. Hence:
● Key: LongWritables
● Value: DBWritables.
The output format classes work in the opposite direction as their corresponding input format
classes. The TextOutputFormat, for example, is the default output format that outputs records
as plain text files, although key values can be of any type and are converted to strings by
using the toString() method. The tab character separates the key-value character, but this can
be changed by modifying the separator attribute of the text output format.
DBOutputFormat handles the output formats for relational databases and HBase. It saves the
compressed output to a SQL table.
Features of MapReduce
Scalability
MapReduce can scale to process vast amounts of data by distributing tasks across a large
number of nodes in a cluster. This allows it to handle massive datasets, making it suitable for
Big Data applications.
Fault Tolerance
MapReduce incorporates built-in fault tolerance to ensure the reliable processing of data. It
automatically detects and handles node failures, rerunning tasks on available nodes as
needed.
Data Locality
MapReduce takes advantage of data locality by processing data on the same node where it is
stored, minimizing data movement across the network and improving overall performance.
Simplicity
The MapReduce programming model abstracts away many complexities associated with
distributed computing, allowing developers to focus on their data processing logic rather than
low-level details.
Cost-Effective Solution
Hadoop's scalable architecture and MapReduce programming framework make storing and
processing extensive data sets very economical.
Parallel Programming
Tasks are divided into programming models to allow for the simultaneous execution of
independent operations. As a result, programs run faster due to parallel processing, making it
easier for a process to handle each job. Thanks to parallel processing, these distributed tasks
can be performed by multiple processors. Therefore, all software runs faster.
UNIT IV
HADOOP ENVIRONMENT
Hadoop Cluster is stated as a combined group of unconventional units. These units are
in a connected with a dedicated server which is used for working as a sole data organizing
source. It works as centralized unit throughout the working process. In simple terms, it is
stated as a common type of cluster which is present for the computational task. This cluster
is helpful in distributing the workload for analyzing data. Workload over Hadoop cluster is
distributed among several other nodes, which are working together to process data. It can be
explained by considering the following terms:
1. Distributed Data Processing: In distributed data processing, the map gets reduced
and scrutinized from a large amount of data. It get assigned a job tracker for all the
functionalities. Apart from the job tracker, there is a data node and task tracker. All
these play a huge role in processing the data.
2. Distributed Data Storage: It allows storing a huge amount of data in terms of name
node and secondary name node. In this both the nodes have a data node and task
tracker.
How does Hadoop Cluster Makes Working so Easy?
It plays important role to collect and analyze the data in a proper way. It is useful in
performing a number of tasks which brings out the ease in any task.
● Add nodes: It is easy to add nodes in the cluster to help in other functional areas.
Without the nodes, it is not possible to scrutinize the data from unstructured units.
● Data Analysis: This special type of cluster which is compatible with parallel
computation to analyze the data.
● Fault tolerance: The data stored in any node remain unreliable. So, it creates a copy
of the data which is present on other nodes.
While working with Hadoop Cluster it is important to understand its architecture as follows :
● Master Nodes: Master node plays a great role in collecting a huge amount of data in
the Hadoop Distributed File System (HDFS). Apart from that, it works to store data
with parallel computation by applying Map Reduce.
● Slave nodes: It is responsible for the collection of data. While performing any
computation, the slave node is held responsible for any situation or result.
● Client nodes: The Hadoop is installed along with the configuration settings.Hadoop
Cluster demands to load the data, it is the client node who is held responsible for this
task.
Advantages:
This section describes how to install and configure a basic Hadoop cluster from scratch using
the Apache Hadoop distribution on a Unix operating system. It provides background
information on the things you need to think about when setting up Hadoop. For a production
installation, most users and operators should consider one of the Hadoop cluster management
tools Installing Java Hadoop runs on both Unix and Windows operating systems, and requires
Java to be installed. For a production installation, you should select a combination of
operating
system, Java, and Hadoop that has been certified by the vendor of the Hadoop distribution
you are using. There is also a page on the Hadoop wiki that lists combinations that
community members have run with success.
It’s good practice to create dedicated Unix user accounts to separate the Hadoop processes
from each other, and from other services running on the same machine. The HDFS,
MapReduce, and YARN services are usually run as separate users, named hdfs, mapred, and
yarn, respectively. They all belong to the same hadoop group.
Installing Hadoop
Download Hadoop from the Apache Hadoop releases page, and unpack the contents of the
distribution in a sensible location, such as /usr/local (/opt is another standard choice; note that
Hadoop should not be installed in a user’s home directory, as that may be an NFS-mounted
directory):
% cd /usr/local
You also need to change the owner of the Hadoop files to be the hadoop user and group:
It’s convenient to put the Hadoop binaries on the shell path too:
% export HADOOP_HOME=/usr/local/hadoop-x.y.z
% export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin
Configuring SSH
The Hadoop control scripts (but not the daemons) rely on SSH to perform cluster-wide
Operations. For example, there is a script for stopping and starting all the daemons in the
cluster. Note that the control scripts are optional—cluster-wide operations can be performed
by other mechanisms, too, such as a distributed shell or dedicated Hadoop management
applications. To work seamlessly, SSH needs to be set up to allow passwordless login for the
hdfs and yarn users from machines in the cluster.2 The simplest way to achieve this is to
generate a public/private key pair and place it in an NFS location that is shared across the
cluster.
First, generate an RSA key pair by typing the following. You need to do this twice, once as
the hdfs user and once as the yarn user:
Even though we want passwordless logins, keys without passphrases are not considered good
practice (it’s OK to have an empty passphrase when running a local pseudo distributed
cluster, as described in Appendix A), so we specify a passphrase when prompted for one. We
use ssh-agent to avoid the need to enter a password for each connection.
The private key is in the file specified by the -f option, ~/.ssh/id_rsa, and the public key is
stored in a file with the same name but with .pub appended, ~/.ssh/id_rsa.pub.
Next, we need to make sure that the public key is in the ~/.ssh/authorized_keys file on all the
machines in the cluster that we want to connect to. If the users’ home directories are stored on
an NFS filesystem, the keys can be shared across the cluster by typing the following (first as
hdfs and then as yarn):
If the home directory is not shared using NFS, the public keys will need to be shared by some
other means (such as ssh-copy-id).Test that you can SSH from the master to a worker
machine by making sure ssh-agent is running,3 and then run ssh-add to store your passphrase.
You should be able to SSH to a worker without entering the passphrase again.
To Perform setting up and installing Hadoop in the pseudo-distributed mode using the
following steps given below as follows. Let’s discuss one by one.
Step 1: Download Binary Package :
http://hadoop.apache.org/releases.html
For reference, you can check the file save to the folder as follows.
C:\BigData
Open Git Bash, and change directory (cd) to the folder where you save the binary package
and then unzip as follows.
$ cd C:\BigData
MINGW64: C:\BigData
Next, go to this GitHub Repo and download the receptacle organizer as a speed as
demonstrated as follows. Concentrate the compress and duplicate all the documents present
under the receptacle envelope to C:\BigData\hadoop-3.1.2\bin. Supplant the current records
too.
HADOOP_HOME=” C:\BigData\hadoop-3.1.2”
HADOOP_BIN=”C:\BigData\hadoop-3.1.2\bin”
In the event that you don’t have JAVA 1.8 introduced, at that point you’ll have to download
and introduce it first. In the event that the JAVA_HOME climate variable is now set, at that
point check whether the way has any spaces in it (ex: C:\Program Files\Java\… ). Spaces in
the JAVA_HOME way will lead you to issues. There is a stunt to get around it. Supplant
‘Program Files ‘to ‘Progra~1’in the variable worth. Guarantee that the variant of Java is 1.8
and JAVA_HOME is highlighting JDK 1.8.
Now we have set the environment variables, we need to validate them. Open a new Windows
Command prompt and run an echo command on each variable to confirm they are assigned
the desired values.
On the off chance that the factors are not instated yet, at that point it can likely be on the
grounds that you are trying them in an old meeting. Ensure you have opened another order
brief to test them.
Once environment variables are set up, we need to configure Hadoop by editing the following
configuration files.
After editing core-site.xml, you need to set the replication factor and the location of
namenode and datanodes. Open C:\BigData\hadoop-3.1.2\etc\hadoop\hdfs-site.xml and
below content within <configuration> </configuration> tags
Step 7: Edit core-site.xml
At last, how about we arrange properties for the Map-Reduce system. Open C:\BigData\
hadoop-3.1.2\etc\hadoop\mapred-site.xml and beneath content inside <configuration>
</configuration> labels. In the event that you don’t see mapred-site.xml, at that point open
mapred-site.xml.template record and rename it to mapred-site.xml
Check if C:\BigData\hadoop-3.1.2\etc\hadoop\slaves file is present, if it’s not then created
one and add localhost in it and save it.
To organize the Name Node, open another Windows Command Prompt and run the beneath
order. It might give you a few admonitions, disregard them.
● namenode
● datanode
● node manager
● resource manager
Don’t close these windows, minimize them. Closing the windows will terminate the
daemons. You can run them in the background if you don’t like to see these windows.
In conclusion, how about we screen to perceive how are Hadoop daemons are getting along.
Also you can utilize the Web UI for a wide range of authoritative and observing purposes.
Open your program and begin.
HDFS is capable of handling larger size data with high volume velocity and variety makes
Hadoop work more efficient and reliable with easy access to all its components. HDFS stores
the data in the form of the block where the size of each data block is 128MB in size which is
configurable means you can change it according to your requirement in hdfs-site.xml file in
your Hadoop directory.
1. NameNode(Master)
2. DataNode(Slave)
1. NameNode: NameNode works as a Master in a Hadoop cluster that Guides the
Datanode(Slaves). Namenode is mainly used for storing the Metadata i.e. nothing but the data
about the data. Meta Data can be the transaction logs that keep track of the user’s activity in a
Hadoop cluster.
Meta Data can also be the name of the file, size, and the information about the location(Block
number, Block ids) of Datanode that Namenode stores to find the closest DataNode for Faster
Communication. Namenode instructs the DataNodes with the operation like delete, create,
Replicate, etc.
As our NameNode is working as a Master it should have a high RAM or Processing power in
order to Maintain or Guide all the slaves in a Hadoop cluster. Namenode receives heartbeat
signals and block reports from all the slaves i.e. DataNodes.
2. DataNode: DataNodes works as a Slave DataNodes are mainly utilized for storing the data
in a Hadoop cluster, the number of DataNodes can be from 1 to 500 or even more than that,
the more number of DataNode your Hadoop cluster has More Data can be stored. so it is
advised that the DataNode should have High storing capacity to store a large number of file
blocks. Datanode performs operations like creation, deletion, etc. according to the instruction
provided by the NameNode.
Objectives and Assumptions Of HDFS
1. System Failure: As a Hadoop cluster is consists of Lots of nodes with are commodity
hardware so node failure is possible, so the fundamental goal of HDFS figure out this failure
problem and recover it.
2. Maintaining Large Dataset: As HDFS Handle files of size ranging from GB to PB, so
HDFS has to be cool enough to deal with these very large data sets on a single cluster.
3. Moving Data is Costlier then Moving the Computation: If the computational operation
is performed near the location where the data is present then it is quite faster and the overall
throughput of the system can be increased along with minimizing the network congestion
which is a good assumption.
4. Portable Across Various Platform: HDFS Posses portability which allows it to switch
across diverse Hardware and software platforms.
5. Simple Coherency Model: A Hadoop Distributed File System needs a model to write
once read much access for Files. A file written then closed should not be changed, only data
can be appended. This assumption helps us to minimize the data coherency issue.
MapReduce fits perfectly with such kind of file model.
6. Scalability: HDFS is designed to be scalable as the data storage requirements increase
over time. It can easily scale up or down by adding or removing nodes to the cluster. This
helps to ensure that the system can handle large amounts of data without compromising
performance.
7. Security: HDFS provides several security mechanisms to protect data stored on the
cluster. It supports authentication and authorization mechanisms to control
access to data, encryption of data in transit and at rest, and data integrity checks to detect any
tampering or corruption.
8. Data Locality: HDFS aims to move the computation to where the data resides rather than
moving the data to the computation. This approach minimizes network traffic and enhances
performance by processing data on local nodes.
9. Cost-Effective: HDFS can run on low-cost commodity hardware, which makes it a cost-
effective solution for large-scale data processing. Additionally, the ability to scale up or down
as required means that organizations can start small and expand over time, reducing upfront
costs.
10. Support for Various File Formats: HDFS is designed to support a wide range of file
formats, including structured, semi-structured, and unstructured data. This makes it easier to
store and process different types of data using a single system, simplifying data management
and reducing costs.
Hdfs administration:
Hdfs administration and MapReduce administration, both concepts come under Hadoop
administration.
● Hdfs administration: It includes monitoring the HDFS file structure, location and
updated files.
● MapReduce administration: it includes monitoring the list of applications,
configuration of nodes, application status.
Hadoop Benchmarks
Hadoop comes with several benchmarks that you can run very easily with minimal setup cost.
Benchmarks are packaged in the tests JAR file, and you can get a list of them, with
descriptions, by invoking the JAR file with no arguments:
Most of the benchmarks show usage instructions when invoked with no arguments. For
example:
% hadoop jar $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-*-tests.jar \
TestDFSIO
TestDFSIO.1.7
Missing arguments.
Hadoop comes with a MapReduce program called TeraSort that does a total sort of its input.9
It is very useful for benchmarking HDFS and MapReduce together, as the full input dataset is
transferred through the shuffle. The three steps are: generate some random data, perform the
sort, then validate the results.
First, we generate some random data using teragen (found in the examples JAR file, not the
tests one). It runs a map-only job that generates a specified number of rows of binary data.
Each row is 100 bytes long, so to generate one terabyte of data using 1,000 maps, run the
following (10t is short for 10 trillion):
% hadoop jar \
$HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-*.jar \
% hadoop jar \
$HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-*.jar \
The overall execution time of the sort is the metric we are interested in, but it’s instructive to
watch the job’s progress via the web UI (http://resource-manager-host:8088/), where you can
get a feel for how long each phase of the job takes.
As a final sanity check, we validate that the data in sorted-data is, in fact, correctly sorted:
% hadoop jar \
$HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-*.jar \
This command runs a short MapReduce job that performs a series of checks on the
sorted data to check whether the sort is accurate. Any errors can be found in the report/
Other benchmarks
There are many more Hadoop benchmarks, but the following are widely used:
• TestDFSIO tests the I/O performance of HDFS. It does this by using a MapReduce job as a
convenient way to read or write files in parallel.
• MRBench (invoked with mrbench) runs a small job a number of times. It acts as a good
counterpoint to TeraSort, as it checks whether small job runs are responsive.
• SWIM, or the Statistical Workload Injector for MapReduce, is a repository of real life
MapReduce workloads that you can use to generate representative test workloads for your
system.
Hadoop on AWS
Amazon Elastic Map/Reduce (EMR) is a managed service that allows you to process and
analyze large datasets using the latest versions of big data processing frameworks such as
Apache Hadoop, Spark, HBase, and Presto, on fully customizable clusters.
● Ability to launch Amazon EMR clusters in minutes, with no need to manage node
configuration, cluster setup, Hadoop configuration or cluster tuning.
● Simple and predictable pricing— flat hourly rate for every instance-hour, with the
ability to leverage low-cost spot Instances.
● Ability to provision one, hundreds, or thousands of compute instances to process data
at any scale.
● Amazon provides the EMR File System (EMRFS) to run clusters on demand based on
persistent HDFS data in Amazon S3. When the job is done, users can terminate the
cluster and store the data in Amazon S3, paying only for the actual time the cluster
was running.
Hadoop on Azure
Azure HDInsight is a managed, open-source analytics service in the cloud. HDInsight allows
users to leverage open-source frameworks such as Hadoop, Apache Spark, Apache Hive,
LLAP, Apache Kafka, and more, running them in the Azure cloud environment.
Azure HDInsight is a cloud distribution of Hadoop components. It makes it easy and cost-
effective to process massive amounts of data in a customizable environment. HDInsights
supports a broad range of scenarios such as extract, transform, and load (ETL), data
warehousing, machine learning, and IoT.
● Read and write data stored in Azure Blob Storage and configure several Blob Storage
accounts.
● Implement the standard Hadoop FileSystem interface for a hierarchical view.
● Choose between block blobs to support common use cases like MapReduce and page
blobs for continuous write use cases like HBase write-ahead log.
● Use wasb scheme-based URLs to reference file system paths, with or without SSL
encrypted access.
● Set up HDInsight as a data source in a MapReduce job or a sink.
Google Dataproc is a fully-managed cloud service for running Apache Hadoop and Spark
clusters. It provides enterprise-grade security, governance, and support, and can be used for
general purpose data processing, analytics, and machine learning.
Dataproc uses Cloud Storage (GCS) data for processing and stores it in GCS, Bigtable, or
BigQuery. You can use this data for analysis in your notebook and send logs to Cloud
Monitoring and Logging.
****************
UNIT V – FRAMEWORKS
Applications on Big Data Using Pig and Hive – Data processing operators in
To write data analysis programs, Pig provides a high-level language known as Pig Latin. This
language provides various operators using which programmers can develop their own
functions for reading, writing, and processing data.
To analyze data using Apache Pig, programmers need to write scripts using Pig Latin
language. All these scripts are internally converted to Map and Reduce tasks. Apache Pig has
a component known as Pig Engine that accepts the Pig Latin scripts as input and converts
those scripts into MapReduce jobs.
Features of Pig
Apache Pig comes with the following features −
Listed below are the major differences between Apache Pig and MapReduce.
As shown in the figure, there are various components in the Apache Pig framework. Let us
take a look at the major components.
Parser
Initially the Pig Scripts are handled by the Parser. It checks the syntax of the script, does type
checking, and other miscellaneous checks. The output of the parser will be a DAG (directed
acyclic graph), which represents the Pig Latin statements and logical operators.
In the DAG, the logical operators of the script are represented as the nodes and the data flows
are represented as edges.
Optimizer
The logical plan (DAG) is passed to the logical optimizer, which carries out the logical
optimizations such as projection and pushdown.
Compiler
The compiler compiles the optimized logical plan into a series of MapReduce jobs.
Execution engine
Finally the MapReduce jobs are submitted to Hadoop in a sorted order. Finally, these
MapReduce jobs are executed on Hadoop producing the desired results.
After downloading the Apache Pig software, install it in your Linux environment by
following the steps given below.
Step 1
Create a directory with the name Pig in the same directory where the installation directories
of Hadoop, Java, and other software were installed. (In our tutorial, we have created the Pig
directory in the user named Hadoop).
Step 2
Step 3
Move the content of pig-0.15.0-src.tar.gz file to the Pig directory created earlier as shown
below.
Configure Apache Pig
.bashrc file
pig.properties file
In the conf folder of Pig, we have a file named pig.properties. In the pig.properties file, you
can set various parameters as given below.
Verify the installation of Apache Pig by typing the version command. If the installation is
successful, you will get the version of Apache Pig as shown below.
Pig Latin is the language used to analyze data in Hadoop using Apache Pig. In this chapter,
we are going to discuss the basics of Pig Latin such as Pig Latin statements, data types,
general and relational operators, and Pig Latin UDF’s.
As discussed in the previous chapters, the data model of Pig is fully nested. A Relation is the
outermost structure of the Pig Latin data model. And it is a bag where −
While processing data using Pig Latin, statements are the basic constructs.
● These statements work with relations. They include expressions and schemas.
● Every statement ends with a semicolon (;).
● We will perform various operations using operators provided by Pig Latin, through
statements.
● Except LOAD and STORE, while performing all other operations, Pig Latin
statements take a relation as input and produce another relation as output.
● As soon as you enter a Load statement in the Grunt shell, its semantic checking will
be carried out. To see the contents of the schema, you need to use the Dump operator.
Only after performing the dump operation, the MapReduce job for loading the data
into the file system will be carried out.
Example
Given below is a Pig Latin statement, which loads data to Apache Pig.
grunt> Student_data = LOAD 'student_data.txt' USING PigStorage(',')as
Example : 8
Example : 5L
Example : 5.5F
Example : 10.5
Example : 1970-01-01T00:00:00.000+00:00
Example : 185.98376256272893883
Complex Types
Example : {(raju,30),(Mohhammad,45)}
Null Values
Values for all the above data types can be NULL. Apache Pig treats null values in a similar
way as SQL does.
The following table describes the arithmetic operators of Pig Latin. Suppose a = 10 and b =
20.
> Greater than − Checks if the value of the left (a > b) is not
operand is greater than the value of the right true.
operand. If yes, then the condition becomes
true.
< Less than − Checks if the value of the left (a < b) is true.
operand is less than the value of the right
operand. If yes, then the condition becomes
true.
<= Less than or equal to − Checks if the value of the (a <= b) is true.
left operand is less than or equal to the value
of the right operand. If yes, then the condition
becomes true.
The following table describes the Type construction operators of Pig Latin.
Operator Description
LOAD To Load the data from the file system (local/HDFS) into a
relation.
Filtering
Sorting
Diagnostic Operators
Hive :
Hive is a data warehouse infrastructure tool to process structured data in Hadoop.
It resides on top of Hadoop to summarize Big Data and makes querying and analyzing easy.
It is used by different companies. For example, Amazon uses it in Amazon Elastic
MapReduce.
Benefits :
○ Ease of use
○ Accelerated initial insertion of data
○ Superior scalability, flexibility, and cost-efficiency
○ Streamlined security
○ Low overhead
○ Exceptional working capacity
HBase :
HBase is a column-oriented non-relational database management system that runs on
top of the Hadoop Distributed File System (HDFS).
HBase provides a fault-tolerant way of storing sparse data sets, which are common in many
big data use cases
HBase does support writing applications in Apache Avro, REST and Thrift.
Application :
○ Medical
○ Sports
○ Web
○ Oil and petroleum
○ E-commerce
Hive Architecture
The following architecture explains the flow of submission of query into Hive.
Hive Client
Hive allows writing applications in various languages, including Java, Python, and C++. It
supports different types of clients such as:-
● Thrift Server - It is a cross-language service provider platform that serves the request
from all those programming languages that supports Thrift.
● JDBC Driver - It is used to establish a connection between hive and Java applications.
The JDBC Driver is present in the class org.apache.hadoop.hive.jdbc.HiveDriver.
● ODBC Driver - It allows the applications that support the ODBC protocol to connect
to Hive.
Hive Services
● Hive CLI - The Hive CLI (Command Line Interface) is a shell where we can execute
Hive queries and commands.
● Hive Web User Interface - The Hive Web UI is just an alternative of Hive CLI. It
provides a web-based GUI for executing Hive queries and commands.
● Hive MetaStore - It is a central repository that stores all the structure information of
various tables and partitions in the warehouse. It also includes metadata of column
and its type information, the serializers and deserializers which is used to read and
write data and the corresponding HDFS files where the data is stored.
● Hive Server - It is referred to as Apache Thrift Server. It accepts the request from
different clients and provides it to Hive Driver.
● Hive Driver - It receives queries from different sources like web UI, CLI, Thrift, and
JDBC/ODBC driver. It transfers the queries to the compiler.
● Hive Compiler - The purpose of the compiler is to parse the query and perform
semantic analysis on the different query blocks and expressions. It converts HiveQL
statements into MapReduce jobs.
● Hive Execution Engine - Optimizer generates the logical plan in the form of DAG of
map-reduce tasks and HDFS tasks. In the end, the execution engine executes the
incoming tasks in the order of their dependencies.
HiveQL
Hive’s SQL dialect, called HiveQL, is a mixture of SQL-92, MySQL, and Oracle’s SQL
dialect. The level of SQL-92 support has improved over time, and will likely continue to get
better. HiveQL also provides features from later SQL standards, such as window functions
(also known as analytic functions) from SQL:2003. Some of Hive’s non-standard extensions
to SQL were inspired by MapReduce, such as multi table inserts and the TRANSFORM,
MAP, and REDUCE clauses .
Data Types
Hive supports both primitive and complex data types. Primitives include numeric, Boolean,
string, and timestamp types.
Integer Types
Decimal Type
Date/Time Types
TIMESTAMP
DATES
The Date value is used to specify a particular year, month and day, in the form YYYY--
MM--DD. However, it didn't provide the time of the day. The range of Date type lies between
0000--01--01 to 9999--12--31.
String Types
STRING
The string is a sequence of characters. It values can be enclosed within single quotes (') or
double quotes (").
Varchar
The varchar is a variable length type whose range lies between 1 and 65535, which specifies
that the maximum number of characters allowed in the character string.
CHAR
Map It contains the key-value tuples where the fields are map('first','James','last','Roy
accessed using array notation. ')
In Hive, the database is considered as a catalog or namespace of tables. So, we can maintain
multiple tables within a database where a unique name is assigned to each table. Hive also
provides a default database with a name default.
● Hive also allows assigning properties with the database in the form of key-value pair.
HiveQL - Operators
The HiveQL operators facilitate to perform various arithmetic and relational operations.
Here, we are going to execute such type of operations on the records of the below table:
Example of Operators in Hive
Let's create a table and load the data into it by using the following steps: -
In Hive, the arithmetic operator accepts any numeric type. The commonly used arithmetic
operators are: -
Operators Description
A/B This is used to divide A and B and returns the quotient of the operands.
● Let's see an example to find out the 10% salary of each employee.
1. hive> select id, name, (salary * 10) /100 from employee;
Relational Operators in Hive
In Hive, the relational operators are generally used with clauses like Join and Having to
compare the existing records. The commonly used relational operators are: -
Operator Description
A <> B, A !=B It returns null if A or B is null; true if A is not equal to B, otherwise false.
● Let's see an example to fetch the details of the employee having salary<25000.
1. hive> select * from employee where salary < 25000;
HBase is a distributed column-oriented database built on top of the Hadoop file system. It is
an open-source project and is horizontally scalable.
HBase is a data model that is similar to Google’s big table designed to provide quick random
access to huge amounts of structured data. It leverages the fault tolerance provided by the
Hadoop File System (HDFS).
It is a part of the Hadoop ecosystem that provides random real-time read/write access to data
in the Hadoop File System.
One can store the data in HDFS either directly or through HBase. Data consumer
reads/accesses the data in HDFS randomly using HBase. HBase sits on top of the Hadoop
File System and provides read and write access.
HBase and HDFS
HDFS HBase
HDFS does not support fast HBase provides fast lookups for larger tables.
individual record lookups.
It provides high latency batch It provides low latency access to single rows from
processing; no concept of batch billions of records (Random access).
processing.
It provides only sequential access of HBase internally uses Hash tables and provides random
data. access, and it stores the data in indexed HDFS files for
faster lookups.
HBase is a column-oriented database and the tables in it are sorted by row. The table
schema defines only column families, which are the key value pairs. A table have multiple
column families and each column family can have any number of columns. Subsequent
column values are stored contiguously on the disk. Each cell value of the table has a
timestamp. In short, in an HBase:
Column-oriented databases are those that store data tables as sections of columns of data,
rather than as rows of data. Shortly, they will have column families.
Such databases are designed for small number of Column-oriented databases are designed
rows and columns. for huge tables.
It is built for wide tables. HBase is horizontally It is thin and built for small tables. Hard to
scalable. scale.
Features of HBase
● Apache HBase is used to have random, real-time read/write access to Big Data.
● It hosts very large tables on top of clusters of commodity hardware.
● Apache HBase is a non-relational database modeled after Google's Bigtable. Bigtable
acts up on Google File System, likewise Apache HBase works on top of Hadoop and
HDFS.
Applications of HBase
● It is used whenever there is a need to write heavy applications.
● HBase is used whenever we need to provide fast random access to available data.
● Companies such as Facebook, Twitter, Yahoo, and Adobe use HBase internally.
HBase - Architecture
In HBase, tables are split into regions and are served by the region servers. Regions are
vertically divided by column families into “Stores”. Stores are saved as files in HDFS. Shown
below is the architecture of HBase.
Note: The term ‘store’ is used for regions to explain the storage structure.
HBase has three major components: the client library, a master server, and region servers.
Region servers can be added or removed as per requirement.
MasterServer
● Assigns regions to the region servers and takes the help of Apache ZooKeeper for this
task.
● Handles load balancing of the regions across region servers. It unloads the busy
servers and shifts the regions to less occupied servers.
● Maintains the state of the cluster by negotiating the load balancing.
● Is responsible for schema changes and other metadata operations such as creation of
tables and column families.
Regions
Regions are nothing but tables that are split up and spread across the region servers.
Region server
When we take a deeper look into the region server, it contain regions and stores as shown
below:
The store contains memory store and HFiles. Memstore is just like a cache memory.
Anything that is entered into the HBase is stored here initially. Later, the data is transferred
and saved in Hfiles as blocks and the memstore is flushed.
Zookeeper
Architecture of ZooKeeper
Part Description
Clients, one of the nodes in our distributed application cluster, access information
Client
from the server. For a particular time interval, every client sends a message to the
server to let the sever know that the client is alive.
Server, one of the nodes in our ZooKeeper ensemble, provides all the services to
Server
clients. Gives acknowledgement to client to inform that the server is alive.
Ensemble Group of ZooKeeper servers. The minimum number of nodes that is required to
form an ensemble is 3.
Leader Server node which performs automatic recovery if any of the connected node
failed. Leaders are elected on service startup.
Hierarchical Namespace
The following diagram depicts the tree structure of ZooKeeper file system used for memory
representation. ZooKeeper node is referred as znode. Every znode is identified by a name and
separated by a sequence of path (/).
● In the diagram, first you have a root znode separated by “/”. Under root, you have two
logical namespaces config and workers.
● The config namespace is used for centralized configuration management and the
workers namespace is used for naming.
● Under config namespace, each znode can store upto 1MB of data. This is similar to
UNIX file system except that the parent znode can store data as well. The main
purpose of this structure is to store synchronized data and describe the metadata of the
znode. This structure is called as ZooKeeper Data Model.
Every znode in the ZooKeeper data model maintains a stat structure. A stat simply provides
the metadata of a znode. It consists of Version number, Action control list (ACL),
Timestamp, and Data length.
Types of Znodes
Sessions
Sessions are very important for the operation of ZooKeeper. Requests in a session are
executed in FIFO order. Once a client connects to a server, the session will be established and
a session id is assigned to the client.
The client sends heartbeats at a particular time interval to keep the session valid. If the
ZooKeeper ensemble does not receive heartbeats from a client for more than the period
(session timeout) specified at the starting of the service, it decides that the client died.
Session timeouts are usually represented in milliseconds. When a session ends for any reason,
the ephemeral znodes created during that session also get deleted.
Watches
Watches are a simple mechanism for the client to get notifications about the changes in the
ZooKeeper ensemble. Clients can set watches while reading a particular znode. Watches send
a notification to the registered client for any of the znode (on which client registers) changes.
Znode changes are modification of data associated with the znode or changes in the znode’s
children. Watches are triggered only once. If a client wants a notification again, it must be
done through another read operation. When a connection session expires, the client will be
disconnected from the server and the associated watches are also removed.
SQOOP
Sqoop is a tool used to transfer bulk data between Hadoop and external datastores, such as
relational databases (MS SQL Server, MySQL). To process data using Hadoop, the data first
needs to be loaded into Hadoop clusters from several sources.
However, it turned out that the process of loading data from several heterogeneous sources
was extremely challenging. The problems administrators encountered included:
● Maintaining data consistency
● Ensuring efficient utilization of resources
● Loading bulk data to Hadoop was not possible
● Loading data using scripts was slow
The solution was Sqoop. Using Sqoop in Hadoop helped to overcome all the challenges of
the traditional approach and it could load bulk data from RDBMS to Hadoop with ease.
Now that we've understood about Sqoop and the need for Sqoop, as the next topic in this
Sqoop tutorial, let's learn the features of Sqoop
Sqoop has several features, which makes it helpful in the Big Data world:
1.Parallel Import/Export
Sqoop uses the YARN framework to import and export data. This provides fault
tolerance on top of parallelism.
Sqoop enables us to import the results returned from an SQL query into HDFS.
Sqoop provides connectors for multiple RDBMSs, such as the MySQL and Microsoft
SQL servers.
Sqoop can load the entire table or parts of the table with a single command.
After going through the features of Sqoop as a part of this Sqoop tutorial, let us understand
the Sqoop architecture.
Sqoop Architecture
Now, let’s dive deep into the architecture of Sqoop, step by step:
1. The client submits the import/ export command to import or export data.
2. Sqoop fetches data from different databases. Here, we have an enterprise data warehouse,
document-based systems, and a relational database. We have a connector for each of these;
connectors help to work with a range of accessible databases.
4. Similarly, numerous map tasks will export the data from HDFS on to RDBMS using the
Sqoop export command.
Sqoop Import
It then submits a map-only job. Sqoop divides the input dataset into splits and uses individual
map tasks to push the splits to HDFS.
Sqoop Export
2.Sqoop then divides the input dataset into splits and uses individual map tasks to push the
splits to RDBMS.
Let’s now have a look at few of the arguments used in Sqoop export:
After understanding the Sqoop import and export, the next section in this Sqoop tutorial is the
processing that takes place in Sqoop.
Sqoop Processing
3.It uses mappers to slice the incoming data into multiple formats and loads the data in
HDFS.
4.Exports data back into the RDBMS while ensuring that the schema of the data in the
database is maintained.
********************