0% found this document useful (0 votes)
389 views136 pages

MCA - BigData Notes

MCA_ BigData notes

Uploaded by

SASIKUMAR A.G
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
389 views136 pages

MCA - BigData Notes

MCA_ BigData notes

Uploaded by

SASIKUMAR A.G
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 136

Question Bank

UNIT - I

5 Marks

1. Write about intelligent data analysis.


2. Describe any five characteristics of Big Data.

3. What is Intelligent Data Analytics?

4. Write about Analysis Vs Reporting.

15 marks

1. Write short notes,


a. Conventional challenges in big data.
b. Nature of Data

2. Define the different inferences in big data analytics.

3. Describe the Modern Data Analytic tools

4. What is sampling and sampling distribution give a detailed analysis.

UNIT - II

5 Marks

1. What is a data stream? Write types of data streams in detail.

2. Write about Counting distinct elements in a stream.

3. Write steps to Find the most popular elements using decaying windows.

4. What is Real Time Analytics? Discuss their technologies in detail

15 marks

1. Discuss 14 insights of Info sphere in data stream.


2. Explain the different applications of data streams in detail.
3. Explain the stream model and Data stream management system architecture.
4. What is Real Time Analytics? Discuss their technologies in detail.
5. Explain the Prediction methodologies.
UNIT - III

5 Marks

1. What is Hadoop? Explain its components.

2. How do you analyze the data in hadoop?

3. Enlist the failures in Mapreduce.

4.Discuss the various types of map reduce & its formats.

15 marks

1. Explain the following

a. Mapper class

b. Reducer class

c. Scaling out

2. Explain the map reduce data flow with single reduce and multiple reduce.
3. Define HDFS. Describe namenode, datanode and block. Explain HDFS operations in
detail.
4. Write in detail the concept of developing the Map Reduce Application.

UNIT - IV

5 Marks

1. What are the different types of Hadoop configuration files? Discuss.

2. What is benchmarking how it works in Hadoop.

3. Write the steps for upgrading HDFS.

15 marks

1. What is Cluster? Explain the setting up of a Hadoop cluster.


2. What are the additional configuration properties to set for HDFS.
3. Discuss administering Hadoop with its checking point process diagram.
4. How does security is done in Hadoop.Justify .

UNIT - V
5 Marks

1. What is PIG ?Explain its installation process.

2. How will you query the data in HIVE?

3.Give a detail note on HBASE

15 marks

1. Explain two execution types or modes in PIG


2. Explain the process of installing HIVE & features of HIVE
3. What is Zookeeper explain its features with applications
4. What is HiveQL? explain its features.

MCAD2232- BIG DATA AND ITS APPLICATIONS

UNIT I - INTRODUCTION TO BIG DATA

Introduction to Big Data Platform – Challenges of Conventional Systems -

Intelligent data analysis Nature of Data - Analytic Processes and Tools -

Analysis vs Reporting - Modern Data Analytic Tools - Statistical Concepts:

Sampling Distributions - Re-Sampling - Statistical Inference - Prediction

Error.

UNIT II - MINING DATA STREAMS

Introduction to Streams Concepts – Stream Data Model and Architecture –

Stream Computing - Sampling Data in a Stream – Filtering Streams –

Counting Distinct Elements in a Stream – Estimating Moments – Counting

Oneness in a Window – Decaying Window - Real time Analytics

Platform (RTAP) Applications - Case Studies - Real Time Sentiment

Analysis, Stock Market Predictions.

UNIT III – HADOOP


History of Hadoop- The Hadoop Distributed File System – Components of

Hadoop-Analyzing the Data with Hadoop- Scaling Out- Hadoop Streaming-

Design of HDFSJava interfaces to HDFS- Basics-Developing a Map Reduce

Application-How Map Reduce Works-Anatomy of a Map Reduce Job run-

Failures-Job Scheduling-Shuffle and Sort – Task execution - Map Reduce

Types and Formats- Map Reduce Features

UNIT IV - HADOOP ENVIRONMENT

Setting up a Hadoop Cluster - Cluster specification - Cluster Setup and

Installation - Hadoop Configuration-Security in Hadoop - Administering

Hadoop – HDFS - Monitoring-Maintenance-Hadoop benchmarks- Hadoop in the cloud

UNIT V – FRAMEWORKS

Applications on Big Data Using Pig and Hive – Data processing operators in

Pig –Hive services – HiveQL – Querying Data in Hive - fundamentals of

HBase and ZooKeeper –SQOOP

TEXT BOOKS

1. Michael Berthold, David J. Hand, “Intelligent Data Analysis”, Springer,

2007 (Unit 1).

2. Tom White “Hadoop: The Definitive Guide” Third Edition, O’reilly

Media, 2012 (Units 3, 4 & 5).

3. Anand Rajaraman and Jeffrey David Ullman, “Mining of Massive

Datasets”, Cambridge Press, 2012. (Unit 2).


UNIT I - INTRODUCTION TO BIG DATA

Introduction to Big Data Platform

A big data platform acts as an organized storage medium for large amounts of data.
Big data platforms utilize a combination of data management hardware and software tools to
store aggregated data sets, usually onto the cloud.

A big data platform works to wrangle this amount of information, storing it in a


manner that is organized and understandable enough to extract useful insights. Big data
platforms utilize a combination of data management hardware and software tools to aggregate
data on a massive scale, usually onto the cloud.

What is big data?


Big data is a term used to describe data of great variety, huge volumes, and even more
velocity. Apart from the significant volume, big data is also complex such that none of the
conventional data management tools can effectively store or process it. The data can be
structured or unstructured.

Challenges of Conventional Systems

One of the major challenges of conventional systems was the uncertainty of the Data
Management Landscape.

Fundamental challenges

● How to store

● How to work with voluminous data sizes,

● And more important, how to understand data and turn it into a competitive

advantage.

Big data has revolutionized the way businesses operate, but it has also presented a number of
challenges for conventional systems. Here are some of the challenges faced by conventional
systems in handling big data:

Big data is a term used to describe the large amount of data that can be stored and analyzed
by computers. Big data is often used in business, science and government. Big Data has been
around for several years now, but it's only recently that people have started realizing how
important it is for businesses to use this technology in order to improve their operations and
provide better services to customers. A lot of companies have already started using big data
analytics tools because they realize how much potential there is in utilizing these systems
effectively!

However, while there are many benefits associated with using such systems - including faster
processing times as well as increased accuracy - there are also some challenges involved with
implementing them correctly.
Challenges of Conventional System in big data

● Scalability
● Speed
● Storage
● Data Integration
● Security

Scalability
A common problem with conventional systems is that they can't scale. As the amount of data
increases, so does the time it takes to process and store it. This can cause bottlenecks and
system crashes, which are not ideal for businesses looking to make quick decisions based on
their data.
Conventional systems also lack flexibility in terms of how they handle new types of
information--for example, if you want to add another column (columns are like fields) or row
(rows are like records) without having to rewrite all your code from scratch.

Speed
Speed is a critical component of any data processing system. Speed is important because it
allows you to:

● Process and analyze your data faster, which means you can make better-informed
decisions about how to proceed with your business.
● Make more accurate predictions about future events based on past performance.

Storage
The amount of data being created and stored is growing exponentially, with estimates that it
will reach 44 zettabytes by 2020. That's a lot of storage space!
The problem with conventional systems is that they don't scale well as you add more data.
This leads to huge amounts of wasted storage space and lost information due to corruption or
security breaches.

Data Integration
The challenges of conventional systems in big data are numerous. Data integration is one of
the biggest challenges, as it requires a lot of time and effort to combine different sources into
a single database. This is especially true when you're trying to integrate data from multiple
sources with different schemas and formats.
Another challenge is errors and inaccuracies in analysis due to lack of understanding of what
exactly happened during an event or transaction. For example, if there was an error while
transferring money from one bank account to another, then there would be no way for us
know what actually happened unless someone tells us about it later on (which may not
happen).

Security
Security is a major challenge for enterprises that depend on conventional systems to process
and store their data. Traditional databases are designed to be accessed by trusted users within
an organization, but this makes it difficult to ensure that only authorized people have access
to sensitive information.
Security measures such as firewalls, passwords and encryption help protect against
unauthorized access and attacks by hackers who want to steal data or disrupt operations. But
these security measures have limitations: They're expensive; they require constant monitoring
and maintenance; they can slow down performance if implemented too extensively; and they
often don't prevent breaches altogether because there's always some way around them (such
as through phishing emails).

Conventional systems are not equipped for big data. They were designed for a different era,
when the volume of information was much smaller and more manageable. Now that we're
dealing with huge amounts of data, conventional systems are struggling to keep up.
Conventional systems are also expensive and time-consuming to maintain; they require
constant maintenance and upgrades in order to meet new demands from users who want
faster access speeds and more features than ever before.

Because of the 5V's of Big Data, Big Data and analytics technologies enable your
organisation to become more competitive and grow indefinitely. This, when combined with
specialised solutions for its analysis, such as an Intelligent Data Lake, adds a great deal of
value to a corporation. Let's get started:

The Five Vs of Big Data are widely used to describe its characteristics: If the problem meets
the Five criteria.

The below Five 5 V's factors in big data are:

● Volume
● Value
● Velocity
● Veracity
● Variety

This are the Five V's characteristics of Big Data

Volume capacity
One of the characteristics of big data is its enormous capacity. According to the above
description, it is "data that cannot be controlled by existing general technology," although it
appears that many people believe the amount of data ranges from several terabytes to several
petabytes.
The volume of data refers to the size of the data sets that must be examined and managed,
which are now commonly in the terabyte and petabyte ranges. The sheer volume of data
necessitates processing methods that are separate and distinct from standard storage and
processing capabilities. In other words, the data sets in Big Data are too vast to be processed
by a standard laptop or desktop CPU. A high-volume data set would include all credit card
transactions in Europe on a given day.

Value
The most important "V" from a financial perspective, the value of big data typically stems
from insight exploration and information processing, which leads to more efficient
functioning, bigger and more powerful client relationships, and other clear and quantifiable
financial gains.
This refers to the value that big data can deliver, and it is closely related to what enterprises
can do with the data they collect. The ability to extract value from big data is required, as the
value of big data increases considerably based on the insights that can be gleaned from it.
Companies can obtain and analyze the data using the same big data techniques, but how they
derive value from that data should be unique to them.

Variety type
Big Data is very massive due to its diversity. Big Data originates from a wide range of
sources and is often classified as one of three types: structured, semi-structured, or
unstructured data. The multiplicity of data kinds usually necessitates specialised processing
skills and algorithms. CCTV audio and video recordings generated at many points around a
city are an example of a high variety data set.
Big data may not always refer to structured data that is typically managed in a company's
core system. Unstructured data includes text, sound, video, log files, location information,
sensor information, and so on. Of course, some of this unstructured data has been there for a
while. Efforts are being made in the future to analyse information and extract usable
knowledge from it, rather than merely accumulating it.

Velocity Frequency / Speed


The pace at which data is created is referred to as its velocity. High velocity data is created at
such a rapid rate that it necessitates the use of unique (distributed) processing techniques.
Twitter tweets or Facebook postings are examples of data that is created at a high rate.
POS(Point of Sale) data created 24 hours a day at convenience stores across the country,
boarding history data generated from transportation IC cards, and in today's fast changing
market environment, this data must be responded to in real time.

Veracity
The quality of the data being studied is referred to as its veracity. High-quality data contains a
large number of records that are useful for analysis and contribute significantly to the total
findings. Data of low veracity, on the other hand, comprises a significant percentage of
useless data. Noise refers to the non-valuable in these data sets. Data from a medical
experiment or trial is an example of a high veracity data set.
Efforts to value big data are pointless if they do not result in business value. Big data can and
will be utilised in a broad range of circumstances in the future. To create big data efforts
high-value initiatives and consistently acquire the value that businesses should seek, not only
should tools and the usage of new services be introduced, but also operations and services
based on strategic measures. It must be completely rebuilt.

To reveal meaningful information, high volume, high velocity, and high variety data must be
processed using advanced tools (analytics and algorithms). Because of these data properties,
the knowledge area concerned with the storage, processing, and analysis of huge data
collections has been dubbed Big Data.

Unstructured data analysis has gained popularity in recent years as a form of large data
analysis. Some forms of unstructured data, on the other hand, are both suited and unfit for
data analysis. This time, I'd like to discuss the data with and without the regularity of
unstructured data, as well as the link between structured and unstructured data.

Data is a set of data consisting of structured and unstructured data, of which unstructured data
is stored in its native format. In addition, although it has the feature that nothing is processed
until it is used, it has the advantage of being highly flexible and versatile because it can
process data relatively freely when it is used. It is also easy for humans to recognize and
understand as it is.

Structured data

Structured data is data that is prepared and processed and is saved in business management
system programmes such as SFA, CRM, and ERP, as well as in RDB, as opposed to
unstructured data that is not formed and processed. The information is structured by
"columns" and "rows," similar to spreadsheet tools such as Excel. The data is also saved in a
preset state rather than its natural form, allowing anybody to operate with it.

However, organised data is difficult for people to grasp as it is, and computers can analyse
and calculate it more easily. As a result, in order to use structured data, specialist processing
is required, and the individual handling the data must have some specialised knowledge.

Structured data has the benefit of being easy to manage since it is preset, that is, processed,
and it is also excellent for use in machine learning, for example. Another significant aspect is
that it is interoperable with a wide range of IT tools. Furthermore, structured data is saved in
a Schema on Write database that is meant for specific data consumption, rather to a Schema
on Read database that keeps the data as is.

RDBs such as Oracle, PostgreSQL, and MySQL can be said to be databases for storing
structured data.

● Data with the following extensions are structured datacsv


● RDBMS

Semi-structured data

Semi-structured data is data that falls between structured and unstructured categories. When
categorised loosely, it is classed as unstructured data, but it is distinguished by the ability to
be handled as structured data as soon as it is processed since the structure of the information
that specifies certain qualities is defined.

It's not clearly structured with columns and rows, yet it's a manageable piece of data because
it's layered and includes regular elements. Examples include.csv and.tsv. While.csv is
referred to as a CSV file, the point at which elements are divided and organised by comma
separation is an intermediary location that may be viewed as structured data.

Semi-structured data, on the other hand, lacks a set format like structured data and maintains
data through the combination of data and tags.

Another distinguishing aspect is that data structures are nested. Semi-structured data formats
include the XML and JSON formats.
XML data is best example of semi-structures data

Google Cloud Platform offers NoSQL databases such as Cloud Firestore and Cloud Bigtable
for working with semi-structured data.

Examples of structured data


ID, NAME, DATE 1 , hoge , 2020/08/01 00:00 2 , foo, 2020/08/02 00 : 00 3 , bar,
2020/08/03 00:00

● Data with the following extensions are semi-structured dataJSON


● Avro
● ORC
● Parquet
● XML

Unstructured data

Unstructured data is more diversified and vast than structured data, and includes email and
social media postings, audio, photos, invoices, logs, and other sensor data. The specifics on
how to utilise each are provided below.

● Data with the following extensions are Unstructured datatext


● audio
● image

Examples of unstructured data


<6> Feb 28 12 : 00 : 00 192 .168 .0 .1 fluentd [11111] : [error] Syslog test

● image data
● Image data includes digital camera photographs, scanned images, 3D images, and so
on. Image data, which is employed in a variety of contexts, is a common format
among unstructured data. In recent years, face recognition, identification of objects
put at cash registers, digitalization of documents by character recognition, and other
applications have been discussed, in addition to being utilised as a material for other
human judgements. It will be. The particular picture data also includes
video.Voice/audio data
● The data has been there for a long time, since audio data became popular with the
introduction of CDs. However, with the advancement of speech recognition
technology and the proliferation of voice speakers in recent years, voice input has
become ubiquitous, and the effective use of voice data has drawn attention.
Call centres, for example, not only record their replies, but also automatically convert
them to text (Voice to Text) to increase the efficiency of recording and analysis. It is
also utilised in ways for estimating the emotions of the other party based on the tone
of the voice, as well as for analysing the sound output by the machine to determine
whether or not an irregularity has happened.Sensor data
● With the advancement of IoT, big data analysis, OT field, and sensor technology, as
well as networking, it is now feasible to collect a broad variety of information, such as
manufacturing process data in factories and interior temperature, humidity, and
density.
Sensor data may be utilised for a variety of reasons, including detecting irregularities
on the production line that result in low yield, rectifying mistakes, and anticipating the
timing of equipment breakdown.
It's also employed in medicine, and initiatives like forecasting stress and sickness by
monitoring heart rate have grown frequent.
Sensor data of this type is also commonly employed in autonomous driving. To
distinguish it from files such as so-called pictures and Microsoft Office documents, it
is often referred to as semi-structured data or semi-structured data.Text data

The text data format, which boasts a vast volume of unstructured data on the
Internet, ranging from big phrases like books to publishing brief lines like
Twitter.
It is commonly used for researching pictures of brands from word-of-mouth
and SNS postings, detecting consumer complaints, automatically preparing
documents such as minutes utilising summary generation technology, and
automatically converting languages by scanning text data.
In this section, we will discuss the benefits ( Advantage ) and drawbacks
( Disadvantage ) of big data use based on big data characteristics.
Advantage benefits of big data :

● High real-time performance


● Discover new businesses
● Highly accurate effect measurement (verification) is possible
● Reduction in the cost of collecting information

Lets Explain each of the benifits of big data one by one

High real-time performance


Conventional data may be acquired even if it is a single piece of data by combining and
rapidly processing a massive amount of data that had been spread in multiple forms for each
generation / acquisition site and each department as big data.
Because the technique requires analysis from the start, real-time performance is low, and it
needed more time and effort to integrate different data sets.

One of the components of big data, real-time, provides you an advantage over your
competition. Real-time performance entails the rapid processing of enormous amounts of data
as well as the quick analysis of data that is continually flowing.

Big data contains a component called Veracity (accuracy), and it is distinguished by the
availability of real-time data. Real-time skills allow us to discover market demands rapidly
and use them in marketing and management strategies to build accurate enterprises.

It' is the advanced technology for faster processing of data.

Immediate responsiveness to ever-changing markets gives you a competitive advantage over


your competitors.

Discover new businesses


It is predicted that by doing data mining to identify relevant information from massive
amounts of data utilising BI tools, etc., the link between the data will be identified and
unexpected ideas will be obtained.
You will be able to solve difficulties and uncover new enterprises, techniques, measures, and
so on, all of which will lead to tips.
Highly accurate effect measurement (verification) is possible
If data mining provides us with recommendations, we will develop additional measures based
on them.
Following the implementation of this measure, it is required to assess (check) the effect,
which may also be accomplished through the analysis of big data. In other words, using big
data allows for both analysis to test hypotheses and data mining to uncover ideas from
hypotheses.

Reduction in the cost of collecting information


Big data, which is a collection of high-quality data, lowers the cost of information collecting.
In the past, gathering information through interviews and questionnaires, for example,
imposed time limits and labour expenses on the target population.
However, big data allows for the collection of a vast quantity of information on the Internet
in a short period of time, as well as the reduction of the target person's constraint time and
labour expenses. Big data helps organisations to obtain information at a cheap cost and invest
in critical businesses such as development and marketing at a low cost. What is marketing
research, and how does it differ from other types of research? Methods and examples of
marketing research are introduced.

Disadvantage drawback benefits of big data :

● Individuals are identified even with anonymous data

Reduction in the cost of collecting information


On the other hand, there are both benefits and drawbacks to exploiting big data. The notion is
that by matching pertinent information from a massive quantity of data, you may identify an
individual from anonymous data.
For example, it is impossible to determine who a 26-year-old lady in NYC is, but if this
information is combined with data such as "I had a cecal operation at the age of 14," it is
possible to identify her.
Of course, the more information you integrate, the more definite accuracy you have.
Individuals were able to specialise, according to a document released by research teams in the
United Kingdom and Belgium, by comparing anonymous data with public information.

This is a disadvantage for customers rather than firms attempting to increase the accuracy of
marketing, etc. by utilising big data, but if these issues grow and legal constraints get
stronger, the area of use may be limited. Companies that use big data must be prepared to
handle data responsibly in compliance with the Personal Information Protection Act and other
regulatory standards.

Intelligent Data Analysis (IDA)


Intelligent Data Analysis (IDA) is one of the most important approaches in the field of data
mining. Based on the basic principles of IDA and the features of datasets that IDA handles,
the development of IDA is briefly summarized from three aspects:
1. Algorithm principle
2. The scale
3. Type of the dataset
Intelligent Data Analysis (IDA) is one of the major issues in artificial intelligence and
information. Intelligent data analysis discloses hidden facts that are not known previously and
provide potentially important information or facts from large quantities of data.
It also helps in making a decision. Based on machine learning, artificial intelligence,
recognition of pattern, and records and visualization technology, IDA helps to obtain useful
information, necessary data and interesting models from a lot of data available online in order
to make the right choices.
IDA includes three stages:
(1) Preparation of data
(2) Data mining
(3) Data validation and Explanation

NATURE OF DATA
To understand the nature of data, we must recall, what are data? And what are the functions
that data should perform on the basis of its classification?
The first point in this is that data should have specific items (values or facts), which must be
identified.
Secondly, specific items of data must be organised into a meaningful form.
Thirdly, data should have the functions to perform.
Furthermore, the nature of data can be understood on the basis of the class to which it
belongs.
We have seen that in sciences there are six basic types with in which there exist fifteen
different classes of data. However, these are not mutually exclusive.
There is a large measure of cross-classification, e.g., all quantitative data are numerical
data,and most data are quantitative data.

With reference to the types of data; their nature is as follows:


Numerical data: All data in sciences are derived by measurement and stated in numerical
values. Most of the time their nature is numerical. Even in semi-quantitative data, affirmative
and negative answers are coded as ‘1’ and ‘0’ for obtaining numerical data. Thus, except in
the three cases of qualitative, graphic and symbolic data, the remaining twelve classes yield
numerical data.

Descriptive data: Sciences are not known for descriptive data. However, qualitative data in
sciences are expressed in terms of definitive statements concerning objects. These may be
viewed as descriptive data. Here, the nature of data is descriptive.

Graphic and symbolic data: Graphic and symbolic data are modes of presentation. They
enable users to grasp data by visual perception. The nature of data, in these cases, is graphic.
Likewise, it is possible to determine the nature of data in social sciences also.

Enumerative data: Most data in social sciences are enumerative in nature. However, they
are refined with the help of statistical techniques to make them more meaningful. They are
known as statistical data. This explains the use of different scales of measurement whereby
they are graded.

Descriptive data: All qualitative data in sciences can be descriptive in nature. These can be
in the form of definitive statements. All cataloguing and indexing data are bibliographic,
whereas all management data such as books acquired, books lent, visitors served and
photocopies supplied are non-bibliographic.
Having seen the nature of data, let us now examine the properties, which the data should
ideally possess.

Analytical Processing Of Big Data (by Steps)


Let us now understand how Big Data is processed. The following are the steps involved:
1. Identification of a suitable storage for Big Data
2. Data ingestion (Adoption)
3. Data cleaning and processing (Exploratory data analysis)
4. Visualization of the data
5. Apply the machine learning algorithms (If required)
Analysis vs Reporting
Reporting:
∙ Once data is collected, it will be organized using tools such as
graphs and tables.
∙ The process of organizing this data is called reporting.
∙ Reporting translates raw data into information.
∙ Reporting helps companies to monitor their online business and
be alerted when data falls outside of expected ranges.
∙ Good reporting should raise questions about the business from its
end users.
Analysis:
∙ Analytics is the process of taking the organized data and
analysing it.
∙ This helps users to gain valuable insights on how businesses can
improve their
Performance.
∙ Analysis transforms data and information into insights.
∙ The goal of the analysis is to answer questions by interpreting
the data at a deeper level and providing actionable
recommendations.
Conclusion:
∙ Reporting shows us “what is happening”.
∙ The analysis focuses on explaining “why it is happening” and “what we can
do about it”.

Modern Data Analytic Tools:-


∙ These days, organizations are realizing the value they get out of
big data analytics and hence they are deploying big data tools and
processes to bring more efficiency to their work environment.
∙ Many big data tools and processes are being utilized by companies
these days in the processes of discovering insights and supporting
decision making.
∙ Data Analytics tools are types of application software that
retrieve data from one or more systems and combine it in a
repository, such as a data warehouse, to be reviewed and analyzed.
∙ Most organizations use more than one analytics tool including
spreadsheets with statistical functions, statistical software
packages, data mining tools, and predictive modelling tools.
∙ Together, these Data Analytics Tools give the organization a
complete overview of the company to provide key insights and
understanding of the market/business so smarter decisions may be
made.
∙ Data analytics tools not only report the results of the data but
also explain why the results occurred to help identify weaknesses,
fix potential problem areas, alert decision- makers to unforeseen
events and even forecast future results based on decisions the
company might make.

Below is the list some of data analytics tools:


1. R Programming (Leading Analytics Tool in the industry
2. Python
3. Excel
4. SAS
5. Apache Spark
6. Splunk
7. RapidMiner
8. Tableau Public
9. KNime

Sampling Distributions

Sampling distribution refers to studying the randomly chosen samples to understand the
variations in the outcome expected to be derived.

Sampling distribution in statistics represents the probability of varied outcomes when a


study is conducted. It is also known as finite-sample distribution. In the process, users collect
samples randomly but from one chosen population. A population is a group of people having
the same attribute used for random sample collection in terms of statistics.

Sampling distribution of the mean, sampling distribution of proportion, and T-distribution are
three major types of finite-sample distribution.

Re-Sampling

Resampling is the method that consists of drawing repeated samples from the original data
samples. The method of Resampling is a nonparametric method of statistical inference. In
other words, the method of resampling does not involve the utilization of the generic
distribution tables (for example, normal distribution tables) in order to compute approximate
p probability values.

Resampling involves the selection of randomized cases with replacement from the original
data sample in such a manner that each number of the sample drawn has a number of cases
that are similar to the original data sample. Due to replacement, the drawn number of samples
that are used by the method of resampling consists of repetitive cases.

Statistical Inference

Statistical Inference is defined as the procedure of analyzing the result and making
conclusions from data based on random variation. The two applications of statistical
inference are hypothesis testing and confidence interval. Statistical inference is the technique
of making decisions about the parameters of a population that relies on random sampling. It
enables us to assess the relationship between dependent and independent variables. The idea
of statistical inference is to estimate the uncertainty or sample to sample variation. It enables
us to deliver a range of value for the true value of something in the population. The
components used for making the statistical inference are:

● Sample Size

● Variability in the sample

● Size of the observed difference

Types of statistical inference


There are different types of statistical inference that are used to draw conclusions such as
Pearson Correlation, Bi-varaite Regression, Multivariate regression, Anova or T-test and Chi-
square statistic and contingency table.

But, the most important two types of statistical inference that are primarily used are

● Confidence Interval

● Hypothesis testing

Importance of Statistical Inference

Statistical Inference is significant to examine the data properly. To make an effective


solution, accurate data analysis is important to interpret the results of the research. Inferential
statistics is used in the future prediction for varied observations in different fields. It enables
us to make inferences about the data. It also helps us to deliver a probable range of values for
the true value of something in the population.

Statical inference is used in different fields such as:

● Business Analysis

● Artificial Intelligence

● Financial Analysis

● Fraud Detection

● Machine Learning
● Pharmaceutical Sector

● Share market.

Prediction error

In statistics, prediction error refers to the difference between the predicted values made by
some model and the actual values.

Prediction error is often used in two settings:

1. Linear regression: Used to predict the value of some continuous response variable.

We typically measure the prediction error of a linear regression model with a metric known
as RMSE, which stands for root mean squared error.

It is calculated as:

RMSE = √Σ(ŷi – yi)2 / n where:

● Σ is a symbol that means “sum”

● ŷi is the predicted value for the ith observation

● yi is the observed value for the ith observation

● n is the sample size

2. Logistic Regression: Used to predict the value of some binary response variable.

One common way to measure the prediction error of a logistic regression model is with a
metric known as the total misclassification rate.

It is calculated as:

Total misclassification rate = (# incorrect predictions / # total predictions)

The lower the value for the misclassification rate, the better the model is able to predict the
outcomes of the response variable.

UNIT II - MINING DATA STREAMS


Introduction to Streams Concepts – Stream Data Model and Architecture – Stream
Computing - Sampling Data in a Stream – Filtering Streams – Counting Distinct Elements in
a Stream – Estimating Moments – Counting Oneness in a Window – Decaying Window -
Real time Analytics Platform (RTAP) Applications - Case Studies - Real Time Sentiment
Analysis, Stock Market Predictions.

Stream Processing

Stream processing is a method of data processing that involves continuously


processing data in real-time as it is generated, rather than processing it in batches. In
stream processing, data is processed incrementally and in small chunks as it arrives,
making it possible to analyze and act on data in real-time.

Stream processing is particularly useful in scenarios where data is generated rapidly,


such as in the case of IoT devices or financial markets, where it is important to detect
anomalies or patterns in data quickly. Stream processing can also be used for real-time
data analytics, machine learning, and other applications where real-time data
processing is required.

There are several popular stream processing frameworks, including Apache Flink,
Apache Kafka, Apache Storm, and Apache Spark Streaming. These frameworks
provide tools for building and deploying stream processing pipelines, and they can
handle large volumes of data with low latency and high throughput.

Mining data streams

Mining data streams refers to the process of extracting useful insights and
patterns from continuous and rapidly changing data streams in real-time. Data streams
are typically high- volume and high-velocity, making it challenging to analyze them
using traditional data mining techniques.

Mining data streams requires specialized algorithms that can handle the dynamic nature
of data streams, as well as the need for real-time processing. These algorithms
typically use techniques such as sliding windows, online learning, and incremental
processing to adapt to changing data patterns over time.

Applications of mining data streams include fraud detection, network intrusion


detection, predictive maintenance, and real-time recommendation systems. Some
popular algorithms for mining data streams include Frequent Pattern Mining (FPM),
clustering, decision trees, and neural networks.

Mining data streams also requires careful consideration of the computational resources
required to process the data in real-time. As a result, many mining data stream
algorithms are designed to work with limited memory and processing power, making
them well-suited for deployment on edge devices or in cloud-based architectures.

Introduction to Streams Concepts

In computer science, a stream refers to a sequence of data elements that are


continuously generated or received over time. Streams can be used to represent a wide
range of data, including audio and video feeds, sensor data, and network packets.

Streams can be thought of as a flow of data that can be processed in real-time, rather
than being stored and processed at a later time. This allows for more efficient
processing of large volumes of data and enables applications that require real-time
processing and analysis.

Some important concepts related to streams include:

1. Data Source:A stream's data source is the place where the data is generated or received.
This can include sensors, databases, network connections, or other sources.
2. Data Sink:A stream's data sink is the place where the data is consumed or stored.
This can include databases, data lakes, visualization tools, or other destinations.
3. Streaming Data Processing:This refers to the process of continuously processing
data as it arrives in a stream. This can involve filtering, aggregation, transformation, or
analysis of the data.
4. Stream Processing Frameworks:These are software tools that provide an
environment for building and deploying stream processing applications. Popular stream
processing frameworks include Apache Flink, Apache Kafka, and Apache Spark
Streaming.
5. Real-time Data Processing:This refers to the ability to process data as soon as it is
generated or received. Real-time data processing is often used in applications that
require immediate action, such as fraud detection or monitoring of critical systems.

Overall, streams are a powerful tool for processing and analyzing large volumes of data
in real-time, enabling a wide range of applications in fields such as finance, healthcare,
and the Internet of Things.

Stream Data Model and Architecture

Stream data model is a data model used to represent the continuous flow of data in a
stream processing system. The stream data model typically consists of a series of
events, which are individual pieces of data that are generated by a data source and
processed by a stream processing system.

The architecture of a stream processing system typically involves three main


components: data sources, stream processing engines, and data sinks.

1. Data sources:The data sources are the components that generate the events that
make up the stream. These can include sensors, log files, databases, and other data
sources.
2. Stream processing engines:The stream processing engines are the components
responsible for processing the data in real-time. These engines typically use a variety of
algorithms and techniques to filter, transform, aggregate, and analyze the stream of
events.

3. Data sinks:The data sinks are the components that receive the output of the stream
processing engines. These can include databases, data lakes, visualization tools, and
other data destinations.

The architecture of a stream processing system can be distributed or centralized,


depending on the requirements of the application. In a distributed architecture, the
stream processing engines are distributed across multiple nodes, allowing for increased
scalability and fault tolerance. In a centralized architecture, the stream processing
engines are run on a single node, which can simplify deployment and management.

Some popular stream processing frameworks and architectures include Apache Flink,
Apache Kafka, and Lambda Architecture. These frameworks provide tools and
components for building scalable and fault-tolerant stream processing systems, and can
be used in a wide range of applications, from real-time analytics to internet of things
(IoT) data processing.

Stream Computing

Stream computing is the process of computing and analyzing data streams in real-time.
It involves continuously processing data as it is generated, rather than processing it in
batches. Stream computing is particularly useful for scenarios where data is generated
rapidly and needs to be analyzed quickly.

Stream computing involves a set of techniques and tools for processing and analyzing
data streams, including:

1. Stream processing frameworks:These are software tools that provide an


environment for building and deploying stream processing applications. Popular stream
processing frameworks include Apache Flink, Apache Kafka, and Apache Storm.
2. Stream processing algorithms:These are specialized algorithms that are designed to
handle the dynamic and rapidly changing nature of data streams. These algorithms use
techniques such as sliding windows, online learning, and incremental processing to
adapt to changing data patterns over time.
3. Real-time data analytics:This involves using stream computing techniques to
perform real-time analysis of data streams, such as detecting anomalies, predicting
future trends, and identifying patterns.
4. Machine learning:Machine learning algorithms can also be used in stream
computing to continuously learn from the data stream and make predictions in real-
time.

Stream computing is becoming increasingly important in fields such as finance,


healthcare, and the Internet of Things (IoT), where large volumes of data are generated
and need to be processed and analyzed in real-time. It enables businesses and
organizations to make more informed decisions based on real-time insights, leading to
better operational efficiency and improved customer experiences.

Sampling Data in a Stream

Sampling data in a stream refers to the process of selecting a subset of data points from
a continuous and rapidly changing data stream for analysis. Sampling is a useful
technique for processing data streams when it is not feasible or necessary to process all
data points in real- time.

There are various sampling techniques that can be used for stream data, including:
1.Random sampling:This involves selecting data points from the stream at random
intervals. Random sampling can be used to obtain a representative sample of the
entire stream.

2. Systematic sampling:This involves selecting data points at regular intervals, such as


every tenth or hundredth data point. Systematic sampling can be useful when the
stream has a regular pattern or periodicity.
3. Cluster sampling:This involves dividing the stream into clusters and selecting data
points from each cluster. Cluster sampling can be useful when there are multiple sub-
groups within the stream.

4. Stratified sampling:This involves dividing the stream into strata or sub-groups


based on some characteristic, such as location or time of day. Stratified sampling can
be useful when there are significant differences between the sub-groups.

When sampling data in a stream, it is important to ensure that the sample is


representative of the entire stream. This can be achieved by selecting a sample size that
is large enough to capture the variability of the stream and by using appropriate
sampling techniques.

Sampling data in a stream can be used in various applications, such as monitoring and
quality control, statistical analysis, and machine learning. By reducing the amount of
data that needs to be processed in real-time, sampling can help improve the efficiency
and scalability of stream processing systems.

Filtering Streams

Filtering streams refers to the process of selecting a subset of data from a data stream
based on certain criteria. This process is often used in stream processing systems to
reduce the amount of data that needs to be processed and to focus on the relevant data.

There are various filtering techniques that can be used for stream data, including:

1. Simple filtering:This involves selecting data points from the stream that meet a specific
condition, such as a range of values, a specific text string, or a certain timestamp.

2. Complex filtering:This involves selecting data points from the stream based on multiple
criteria or complex logic. Complex filtering can involve combining multiple conditions using
Boolean operators such as AND, OR, and NOT.
3. Machine learning-based filtering:This involves using machine learning algorithms
to automatically classify data points in the stream based on past observations. This can
be useful in applications such as anomaly detection or predictive maintenance.

When filtering streams, it is important to consider the trade-off between the amount of
data being filtered and the accuracy of the filtering process. Too much filtering can
result in valuable data being discarded, while too little filtering can result in a large
volume of irrelevant data being processed.

Filtering streams can be useful in various applications, such as monitoring and


surveillance, real-time analytics, and Internet of Things (IoT) data processing. By
reducing the amount of data that needs to be processed and analyzed in real-time,
filtering can help improve the efficiency and scalability of stream processing systems.

Counting Distinct Elements in a Stream

Counting distinct elements in a stream refers to the process of counting the number of
unique items in a continuous and rapidly changing data stream. This is an important
operation in stream processing because it can help detect anomalies, identify trends,
and provide insights into the data stream.

There are various techniques for counting distinct elements in a stream, including:

1. Exact counting:This involves storing all the distinct elements seen so far in a data
structure such as a hash table or a bloom filter. When a new element is encountered, it
is checked against the data structure to determine if it is a new distinct element.

2. Approximate counting:This involves using probabilistic algorithms such as the


Flajolet-Martin algorithm or the HyperLogLog algorithm to estimate the number of
distinct elements in a data stream. These algorithms use a small amount of memory to
provide an approximate count with a known level of accuracy.
3. Sampling:This involves selecting a subset of the data stream and counting the distinct
elements in the sample. This can be useful when the data stream is too large to be processed
in real-time or when exact or approximate counting techniques are not feasible.

Counting distinct elements in a stream can be useful in various applications, such as


social media analytics, fraud detection, and network traffic monitoring. By providing
real-time insights into the data stream, counting distinct elements can help businesses
and organizations make more informed decisions and improve operational efficiency.

Estimating Moments

In statistics, moments are numerical measures that describe the shape, central
tendency, and variability of a probability distribution. They are calculated as functions
of the random variables of the distribution, and they can provide useful insights into the
underlying properties of the data.

There are different types of moments, but two of the most commonly used are the mean
(the first moment) and the variance (the second moment). The mean represents the
central tendency of the data, while the variance measures its spread or variability.

To estimate the moments of a distribution from a sample of data, you can use the
following formulas:

Sample mean (first moment):

where n is the sample size, and x_i are the individual


observations. Sample variance (second moment):
where n is the sample size, x_i are the individual observations, and s^2 is the sample
variance.

These formulas provide estimates of the population moments based on the sample data.
The larger the sample size, the more accurate the estimates will be. However, it's
important to note that these formulas only work for certain types of distributions (e.g.,
normal distribution), and for other types of distributions, different formulas may be
required.

Counting Oneness in a Window

Counting the number of times a number appears exactly once (oneness) in a


window of a given size in a sequence is a common problem in computer science
and data analysis. Here's one way you could approach this problem:

1.Initialize a dictionary to store the counts of each number in the window.


2. Initialize a count variable to zero.
3.Iterate through the first window and update the counts in the dictionary.
4. If a count in the dictionary is 1, increment the count variable.
5. For the remaining windows, slide the window by one element to
the right and update the counts in the dictionary accordingly.

6.If the count of the number that just left the window is 1, decrement the count
variable.
7.If the count of the number that just entered the window is 1, increment
the count variable.
8.Repeat steps 5-7 until you reach the end of the sequence.
Here's some Python code that implements this approach:
Decaying Window

A decaying window is a common technique used in time-series analysis and


signal processing to give more weight to recent observations while gradually reducing
the importance of older observations. This can be useful when the underlying data
generating process is changing over time, and more recent observations are more
relevant for predicting future values.

Here's one way you could implement a decaying window in Python using an
exponentially weighted moving average (EWMA):
This function takes in a Pandas Series data, a window size window_size, and a decay
rate decay_rate. The decay rate determines how much weight is given to recent
observations relative to older observations. A larger decay rate means that more weight
is given to recent observations.

The function first creates a series of weights using the decay rate and the window size.
The weights are calculated using the formula decay_rate^(window_size - i) where i is
the index of the weight in the series. This gives more weight to recent observations and
less weight to older observations.

Next, the function normalizes the weights so that they sum to one. This ensures that the
weighted average is a proper average.

Finally, the function applies the rolling function to the data using the window size and
a custom lambda function that calculates the weighted average of the window using the
weights.

Note that this implementation uses Pandas' built-in rolling and apply functions, which
are optimized for efficiency. If you're working with large datasets, this implementation
should be quite fast. If you're working with smaller datasets or need more control over
the implementation, you could implement a decaying window using a custom function
that calculates the weighted average directly.
Real time Analytics Platform (RTAP) Applications

Real-time analytics platforms (RTAPs) are becoming increasingly popular as


businesses strive to gain insights from streaming data and respond quickly to changing
conditions. Here are some examples of RTAP applications:

1. Fraud detection:Financial institutions and e-commerce companies use RTAPs to


detect fraud in real-time. By analyzing transactional data as it occurs, these companies
can quickly identify and prevent fraudulent activity.
2. Predictive maintenance:RTAPs can be used to monitor the performance of
machines and equipment in real-time. By analyzing data such as temperature, pressure,
and vibration, these platforms can predict when equipment is likely to fail and alert
maintenance teams to take action.

3. Supply chain optimization:RTAPs can help companies optimize their supply chain by
monitoring inventory levels, shipment tracking, and demand forecasting. By analyzing this
data in real-time, companies can make better decisions about when to restock inventory,
when to reroute shipments, and how to allocate resources.

4. Customer experience management:RTAPs can help companies monitor customer


feedback in real-time, enabling them to respond quickly to complaints and improve the
customer experience. By analyzing customer data from various sources, such as social media,
email, and chat logs, companies can gain insights into customer behavior and preferences.

5. Cybersecurity:RTAPs can help companies detect and prevent cyberattacks in real-


time. By analyzing network traffic, log files, and other data sources, these platforms
can quickly identify suspicious activity and alert security teams to take action.

Overall, RTAPs can be applied in various industries and domains where real-time
monitoring and analysis of data is critical to achieving business objectives. By
providing insights into streaming data as it happens, RTAPs can help businesses make
faster and more informed decisions.
Case Studies - Real Time Sentiment Analysis

Real-time sentiment analysis is a powerful tool for businesses that want to monitor and
respond to customer feedback in real-time. Here are some case studies of companies
that have successfully implemented real-time sentiment analysis:

1. Airbnb: The popular home-sharing platform uses real-time sentiment analysis to


monitor customer feedback and respond to complaints. Airbnb's customer service team uses
the platform to monitor social media and review sites for mentions of the brand, and to track
sentiment over time. By analyzing this data in real-time, Airbnb can quickly respond to
complaints and improve the customer experience.

2. Coca-Cola: Coca-Cola uses real-time sentiment analysis to monitor social media for
mentions of the brand and to track sentiment over time. The company's marketing team
uses this data to identify trends and to create more targeted marketing campaigns. By
analyzing real-time sentiment data, Coca-Cola can quickly respond to changes in
consumer sentiment and adjust its marketing strategy accordingly.

3. Ford: Ford uses real-time sentiment analysis to monitor customer feedback on social
media and review sites. The company's customer service team uses this data to identify
issues and to respond to complaints in real-time. By analyzing real-time sentiment data,
Ford can quickly identify and address customer concerns, improving the overall
customer experience.

4. Hootsuite: Social media management platform Hootsuite uses real-time sentiment


analysis to help businesses monitor and respond to customer feedback. Hootsuite's
sentiment analysis tool allows businesses to monitor sentiment across social media
channels, track sentiment over time, and identify trends. By analyzing real-time
sentiment data, businesses can quickly respond to customer feedback and improve the
overall customer experience.

5. Twitter: Twitter uses real-time sentiment analysis to identify trending topics and to
monitor sentiment across the platform. The company's sentiment analysis tool allows
users to track sentiment across various topics and to identify emerging trends. By
analyzing real-time sentiment data, Twitter can quickly identify issues and respond to
changes in user sentiment.
Overall, real-time sentiment analysis is a powerful tool for businesses that want to
monitor and respond to customer feedback in real-time. By analyzing real-time
sentiment data, businesses can quickly identify issues and respond to changes in
customer sentiment, improving the overall customer experience.

Case Studies - Stock Market Predictions

Predicting stock market performance is a challenging task, but there have been several
successful case studies of companies using machine learning and artificial intelligence
to make accurate predictions. Here are some examples of successful stock market
prediction case studies:

1. Kavout:Kavout is a Seattle-based fintech company that uses artificial intelligence


and machine learning to predict stock performance. The company's system uses a
combination of fundamental and technical analysis to generate buy and sell
recommendations for individual stocks. Kavout's AI algorithms have outperformed
traditional investment strategies and consistently outperformed the S&P 500 index.
2. Sentient Technologies:Sentient Technologies is a San Francisco-based AI startup that
uses deep learning to predict stock market performance. The company's system uses a
combination of natural language processing, image recognition, and genetic algorithms to
analyze market data and generate investment strategies. Sentient's AI algorithms have
consistently outperformed the S&P 500 index and other traditional investment strategies.

3. Quantiacs:Quantiacs is a California-based investment firm that uses machine


learning to develop trading algorithms. The company's system uses machine learning
algorithms to analyze market data and generate trading strategies. Quantiacs' trading
algorithms have consistently outperformed traditional investment strategies and have
delivered returns that are significantly higher than the S&P 500 index.
4. Kensho Technologies:Kensho Technologies is a Massachusetts-based fintech company
that uses artificial intelligence to predict stock market performance. The company's system
uses natural language processing and machine learning algorithms to analyze news articles,
social media feeds, and other data sources to identify patterns and generate investment
recommendations. Kensho's AI algorithms have consistently outperformed the S&P 500
index and other traditional investment strategies.

5. AlphaSense:AlphaSense is a New York-based fintech company that uses natural language


processing and machine learning to analyze financial data. The company's system uses
machine learning algorithms to identify patterns in financial data and generate investment
recommendations. AlphaSense's AI algorithms have consistently outperformed traditional
investment strategies and have delivered returns that are significantly higher than the S&P
500 index.

Overall, these case studies demonstrate the potential of machine learning and artificial
intelligence to make accurate predictions in the stock market. By analyzing large
volumes of data and identifying patterns, these systems can generate investment
strategies that outperform traditional methods. However, it is important to note that the
stock market is inherently unpredictable, and past performance is not necessarily
indicative of future results.

Unit III-Hadoop

History of Hadoop- The Hadoop Distributed File System – Components of

Hadoop-Analyzing the Data with Hadoop- Scaling Out- Hadoop Streaming-

Design of HDFSJava interfaces to HDFS- Basics-Developing a Map Reduce

Application-How Map Reduce Works-Anatomy of a Map Reduce Job run-

Failures-Job Scheduling-Shuffle and Sort – Task execution - Map Reduce

Types and Formats- Map Reduce Features

History of Hadoop
Hadoop was created by Doug Cutting, the creator of Apache Lucene, the widely used text
search library. Hadoop has its origins in Apache Nutch, an open source web search engine,
itself a part of the Lucene project.

The name Hadoop is not an acronym; it’s a made-up name. The project’s creator, Doug
Cutting, explains how the name came about:

The name my kid gave a stuffed yellow elephant. Short, relatively easy to spell and
pronounce, meaningless, and not used elsewhere: those are my naming criteria.Kids are good
at generating such. Googol is a kid’s term.

Subprojects and “contrib” modules in Hadoop also tend to have names that are unrelated to
their function, often with an elephant or other animal theme (“Pig,” for example). Smaller
components are given more descriptive (and therefore more mundane) names. This is a good
principle, as it means you can generally work out what something does from its name. For
example, the jobtracker9 keeps track of MapReduce jobs

The Hadoop Distributed File System

With growing data velocity the data size easily outgrows the storage limit of a
machine. A solution would be to store the data across a network of machines. Such
filesystems are called distributed filesystems. Since data is stored across a network all the
complications of a network come in.
This is where Hadoop comes in. It provides one of the most reliable filesystems. HDFS
(Hadoop Distributed File System) is a unique design that provides storage for extremely
large files with streaming data access pattern and it runs on commodity hardware. Let’s
elaborate the terms:

● Extremely large files: Here we are talking about the data in range of petabytes(1000
TB).
● Streaming Data Access Pattern: HDFS is designed on principle of write-once and
read-many-times. Once data is written large portions of dataset can be processed any
number times.
● Commodity hardware: Hardware that is inexpensive and easily available in the
market. This is one of feature which specially distinguishes HDFS from other file
system.
Nodes: Master-slave nodes typically forms the HDFS cluster.

1. NameNode(MasterNode):
○ Manages all the slave nodes and assign work to them.
○ It executes filesystem namespace operations like opening, closing, renaming
files and directories.
○ It should be deployed on reliable hardware which has the high config. not on
commodity hardware.
2. DataNode(SlaveNode):
○ Actual worker nodes, who do the actual work like reading, writing, processing
etc.
○ They also perform creation, deletion, and replication upon instruction from the
master.
○ They can be deployed on commodity hardware.

HDFS daemons: Daemons are the processes running in background.

● Namenodes:
○ Run on the master node.
○ Store metadata (data about data) like file path, the number of blocks, block
Ids. etc.
○ Require high amount of RAM.
○ Store meta-data in RAM for fast retrieval i.e to reduce seek time. Though a
persistent copy of it is kept on disk.
● DataNodes:
○ Run on slave nodes.
○ Require high memory as data is actually stored here.

Data storage in HDFS: Now let’s see how the data is stored in a distributed manner.
Lets assume that 100 TB file is inserted, then masternode(namenode) will first divide the file
into blocks of 10TB (default size is 128 MB in Hadoop 2.x and above). Then these blocks are
stored across different datanodes(slavenode). Datanodes(slavenode)replicate the blocks
among themselves and the information of what blocks they contain is sent to the master.
Default replication factor is 3 means for each block 3 replicas are created (including itself). In
hdfs.site.xml we can increase or decrease the replication factor i.e we can edit its
configuration here.

Note: MasterNode has the record of everything, it knows the location and info of each and
every single data nodes and the blocks they contain, i.e. nothing is done without the
permission of masternode.

Why divide the file into blocks?

Answer: Let’s assume that we don’t divide, now it’s very difficult to store a 100 TB file on a
single machine. Even if we store, then each read and write operation on that whole file is
going to take very high seek time. But if we have multiple blocks of size 128MB then its
become easy to perform various read and write operations on it compared to doing it on a
whole file at once. So we divide the file to have faster data access i.e. reduce seek time.

Why replicate the blocks in data nodes while storing?


Answer: Let’s assume we don’t replicate and only one yellow block is present on datanode
D1. Now if the data node D1 crashes we will lose the block and which will make the overall
data inconsistent and faulty. So we replicate the blocks to achieve fault-tolerance.

Terms related to HDFS:

● HeartBeat : It is the signal that datanode continuously sends to namenode. If


namenode doesn’t receive heartbeat from a datanode then it will consider it dead.
● Balancing : If a datanode is crashed the blocks present on it will be gone too and the
blocks will be under-replicated compared to the remaining blocks. Here master
node(namenode) will give a signal to data nodes containing replicas of those lost
blocks to replicate so that overall distribution of blocks is balanced.
● Replication:: It is done by datanode.

Note: No two replicas of the same block are present on the same datanode.

Features:

● Distributed data storage.


● Blocks reduce seek time.
● The data is highly available as the same block is present at multiple datanodes.
● Even if multiple datanodes are down we can still do our work, thus making it highly
reliable.
● High fault tolerance.

Limitations: Though HDFS provides many features there are some areas where it doesn’t
work well.

● Low latency data access: Applications that require low-latency access to data i.e in the
range of milliseconds will not work well with HDFS, because HDFS is designed
keeping in mind that we need high-throughput of data even at the cost of latency.
● Small file problem: Having lots of small files will result in lots of seeks and lots of
movement from one datanode to another datanode to retrieve each small file, this
whole process is a very inefficient data access pattern.
Components of Hadoop

Hadoop is a framework that uses distributed storage and parallel processing to store
and manage Big Data. It is the most commonly used software to handle Big Data. There are
three components of Hadoop.

1. Hadoop HDFS - Hadoop Distributed File System (HDFS) is the storage unit of
Hadoop.
2. Hadoop MapReduce - Hadoop MapReduce is the processing unit of Hadoop.
3. Hadoop YARN - Hadoop YARN is a resource management unit of Hadoop.

Hadoop HDFS

Data is stored in a distributed manner in HDFS. There are two components of HDFS - name
node and data node. While there is only one name node, there can be multiple data nodes.

HDFS is specially designed for storing huge datasets in commodity hardware. An enterprise
version of a server costs roughly $10,000 per terabyte for the full processor. In case you need
to buy 100 of these enterprise version servers, it will go up to a million dollars.

Hadoop enables you to use commodity machines as your data nodes. This way, you don’t
have to spend millions of dollars just on your data nodes. However, the name node is always
an enterprise server.

Features of HDFS

● Provides distributed storage


● Can be implemented on commodity hardware
● Provides data security
● Highly fault-tolerant - If one machine goes down, the data from that machine goes to
the next machine

Master and Slave Nodes

Master and slave nodes form the HDFS cluster. The name node is called the master, and the
data nodes are called the slaves.
The name node is responsible for the workings of the data nodes. It also stores the metadata.

The data nodes read, write, process, and replicate the data. They also send signals, known as
heartbeats, to the name node. These heartbeats show the status of the data node.

Consider that 30TB of data is loaded into the name node. The name node distributes it across
the data nodes, and this data is replicated among the data notes. You can see in the image
above that the blue, grey, and red data are replicated among the three data nodes.

Replication of the data is performed three times by default. It is done this way, so if a
commodity machine fails, you can replace it with a new machine that has the same data.

Let us focus on Hadoop MapReduce in the following section of the What is Hadoop article.

2.Hadoop MapReduce

Hadoop MapReduce is the processing unit of Hadoop. In the MapReduce approach, the
processing is done at the slave nodes, and the final result is sent to the master node.
A data containing code is used to process the entire data. This coded data is usually very
small in comparison to the data itself. You only need to send a few kilobytes worth of code to
perform a heavy-duty process on computers.

The input dataset is first split into chunks of data. In this example, the input has three lines of
text with three separate entities - “bus car train,” “ship ship train,” “bus ship car.” The dataset
is then split into three chunks, based on these entities, and processed parallely.

In the map phase, the data is assigned a key and a value of 1. In this case, we have one bus,
one car, one ship, and one train.

These key-value pairs are then shuffled and sorted together based on their keys. At the reduce
phase, the aggregation takes place, and the final output is obtained.

Hadoop YARN is the next concept we shall focus on in the What is Hadoop article.

Hadoop YARN

Hadoop YARN stands for Yet Another Resource Negotiator. It is the resource management
unit of Hadoop and is available as a component of Hadoop version 2.

● Hadoop YARN acts like an OS to Hadoop. It is a file system that is built on top of
HDFS.
● It is responsible for managing cluster resources to make sure you don't overload one
machine.
● It performs job scheduling to make sure that the jobs are scheduled in the right place
Suppose a client machine wants to do a query or fetch some code for data analysis. This job
request goes to the resource manager (Hadoop Yarn), which is responsible for resource
allocation and management.

In the node section, each of the nodes has its node managers. These node managers manage
the nodes and monitor the resource usage in the node. The containers contain a collection of
physical resources, which could be RAM, CPU, or hard drives. Whenever a job request
comes in, the app master requests the container from the node manager. Once the node
manager gets the resource, it goes back to the Resource Manager.

Analyze data with Hadoop

Hadoop is an open-source framework that provides distributed storage and processing of


large data sets. It consists of two main components: Hadoop Distributed File System (HDFS)
and MapReduce. HDFS is a distributed file system that allows data to be stored across
multiple machines, while MapReduce is a programming model that enables large-scale
distributed data processing.

To analyze data with Hadoop, you first need to store your data in HDFS. This can be done by
using the Hadoop command line interface or through a web-based graphical interface like
Apache Ambari or Cloudera Manager.

Hadoop also provides a number of other tools for analyzing data, including Apache Hive,
Apache Pig, and Apache Spark. These tools provide higher-level abstractions that simplify
the process of data analysis.
Apache Hive provides a SQL-like interface for querying data stored in HDFS. It translates
SQL queries into MapReduce jobs, making it easier for analysts who are familiar with SQL
to work with Hadoop.

Apache Pig is a high-level scripting language that enables users to write data processing
pipelines that are translated into MapReduce jobs. Pig provides a simpler syntax than
MapReduce, making it easier to write and maintain data processing code.

Apache Spark is a distributed computing framework that provides a fast and flexible way to
process large amounts of data. It provides an API for working with data in various formats,
including SQL, machine learning, and graph processing.

In summary, Hadoop provides a powerful framework for analyzing large amounts of data. By
storing data in HDFS and using MapReduce or other tools like Apache Hive, Apache Pig, or
Apache Spark, you can perform distributed data processing and gain insights from your data
that would be difficult or impossible to obtain using traditional data analysis tools.

Once your data is stored in HDFS, you can use MapReduce to perform distributed data
processing. MapReduce breaks down the data processing into two phases: the map phase and
the reduce phase.
In the map phase, the input data is divided into smaller chunks and processed independently
by multiple mapper nodes in parallel. The output of the map phase is a set of key-value pairs.

In the reduce phase, the key-value pairs produced by the map phase are aggregated and
processed by multiple reducer nodes in parallel. The output of the reduce phase is typically a
summary of the input data, such as a count or an average.

Scaling Out
You’ve seen how MapReduce works for small inputs; now it’s time to take a bird’s-eye view
of the system and look at the data flow for large inputs. For simplicity, the examples so far
have used files on the local filesystem. However, to scale out, we need to store the data in a
distributed filesystem, typically HDFS (which you’ll learn about in the next chapter), to allow
Hadoop to move the MapReduce computation to each machine hosting a part of the data.
Let’s see how this works.
Data Flow
First, some terminology. A MapReduce job is a unit of work that the client wants to be
performed: it consists of the input data, the MapReduce program, and configuration
information. Hadoop runs the job by dividing it into tasks, of which there are two types:
map tasks and reduce tasks.
There are two types of nodes that control the job execution process: a jobtracker and a
number of tasktrackers. The jobtracker coordinates all the jobs run on the system by
scheduling tasks to run on tasktrackers. Tasktrackers run tasks and send progress reports to
the jobtracker, which keeps a record of the overall progress of each job. If a task fails, the
jobtracker can reschedule it on a different tasktracker.
Hadoop divides the input to a MapReduce job into fixed-size pieces called input
splits, or just splits. Hadoop creates one map task for each split, which runs the user- defined
map function for each record in the split.

Having many splits means the time taken to process each split is small compared to the time
to process the whole input. So if we are processing the splits in parallel, the processing is
better load-balanced when the splits are small, since a faster machine will be able to process
proportionally more splits over the course of the job than a slower machine. Even if the
machines are identical, failed processes or other jobs running concurrently make load
balancing desirable, and the quality of the load balancing increases as the splits become more
fine-grained.

On the other hand, if splits are too small, the overhead of managing the splits and of map task
creation begins to dominate the total job execution time. For most jobs, a good split size tends
to be the size of an HDFS block, 64 MB by default, although this can be changed for the
cluster (for all newly created files) or specified when each file is created.

Hadoop does its best to run the map task on a node where the input data resides in HDFS.
This is called the data locality optimization because it doesn’t use valuable cluster bandwidth.
Sometimes, however, all three nodes hosting the HDFS block replicas for a map task’s input
split are running other map tasks, so the job scheduler will look for a free map slot on a node
in the same rack as one of the blocks. Very occasionally even this is not possible, so an off-
rack node is used, which results in an inter-rack network transfer. The three possibilities are
illustrated in Fig.

Figure.Data-local (a), rack-local (b), and off-rack (c) map tasks

It should now be clear why the optimal split size is the same as the block size: it is the largest
size of input that can be guaranteed to be stored on a single node. If the split spanned two
blocks, it would be unlikely that any HDFS node stored both blocks, so some of the split
would have to be transferred across the network to the node running the map task, which is
clearly less efficient than running the whole map task using local Data.

Map tasks write their output to the local disk, not to HDFS. Why is this? Map output is
intermediate output: it’s processed by reduce tasks to produce the final output, and once the
job is complete, the map output can be thrown away. So storing it in HDFS with replication
would be overkill. If the node running the map task fails before the map output has been
consumed by the reduce task, then Hadoop will automatically

rerun the map task on another node to re-create the map output.

Reduce tasks don’t have the advantage of data locality; the input to a single reduce task is
normally the output from all mappers. In the present example, we have a single reduce task
that is fed by all of the map tasks. Therefore, the sorted map outputs have to be transferred
across the network to the node where the reduce task is running, where they are merged and
then passed to the user-defined reduce function. The output of the reduce is normally stored
in HDFS for reliability. As explained for each HDFS block of the reduce output, the first
replica is stored on the local node, with other replicas being stored on off-rack nodes. Thus,
writing the reduce output does consume network bandwidth, but only as much as a normal
HDFS write pipeline consumes.

The whole data flow with a single reduce task is illustrated in the below Figure. The dotted
boxes indicate nodes, the light arrows show data transfers on a node, and the heavy arrows
show data transfers between nodes.
Fig .MapReduce data flow with a single reduce task

The number of reduce tasks is not governed by the size of the input, but instead is specified
independently. In “The Default MapReduce Job” on page 227, you will see how to choose
the number of reduce tasks for a given job.

When there are multiple reducers, the map tasks partition their output, each creating one
partition for each reduce task. There can be many keys (and their associated values) in each
partition, but the records for any given key are all in a single partition. The partitioning can
be controlled by a user-defined partitioning function, but normally the default partitioner—
which buckets keys using a hash function—works very well.

The data flow for the general case of multiple reduce tasks is illustrated in below image. This
diagram makes it clear why the data flow between map and reduce tasks is colloquially
known as “the shuffle,” as each reduce task is fed by many map tasks. The shuffle is more
complicated than this diagram suggests, and tuning it can have a big impact on job execution
time.
MapReduce data flow with multiple reduce tasks

Finally, it’s also possible to have zero reduce tasks. This can be appropriate when you don’t
need the shuffle because the processing can be carried out entirely in parallel . In this case,
the only off-node data transfer is when the map tasks write to HDFS (see Figure)

Hadoop Streaming

It is a utility or feature that comes with a Hadoop distribution that allows developers
or programmers to write the Map-Reduce program using different programming languages
like Ruby, Perl, Python, C++, etc. We can use any language that can read from the standard
input(STDIN) like keyboard input and all and write using standard output(STDOUT). We all
know the Hadoop Framework is completely written in java but programs for Hadoop are not
necessarily need to code in Java programming language. feature of Hadoop Streaming is
available since Hadoop version 0.14.1.
In the above example image, we can see that the flow shown in a dotted block is a basic
MapReduce job. In that, we have an Input Reader which is responsible for reading the input
data and produces the list of key-value pairs. We can read data in .csv format, in delimiter
format, from a database table, image data(.jpg, .png), audio data etc. The only requirement to
read all these types of data is that we have to create a particular input format for that data
with these input readers. The input reader contains the complete logic about the data it is
reading. Suppose we want to read an image then we have to specify the logic in the input
reader so that it can read that image data and finally it will generate key-value pairs for that
image data.

If we are reading an image data then we can generate key-value pair for each pixel where the
key will be the location of the pixel and the value will be its color value from (0-255) for a
colored image. Now this list of key-value pairs is fed to the Map phase and Mapper will work
on each of these key-value pair of each pixel and generate some intermediate key-value pairs
which are then fed to the Reducer after doing shuffling and sorting then the final output
produced by the reducer will be written to the HDFS. These are how a simple Map-Reduce
job works.

Now let’s see how we can use different languages like Python, C++, Ruby with Hadoop for
execution. We can run this arbitrary language by running them as a separate process. For that,
we will create our external mapper and run it as an external separate process. These external
map processes are not part of the basic MapReduce flow. This external mapper will take
input from STDIN and produce output to STDOUT. As the key-value pairs are passed to the
internal mapper the internal mapper process will send these key-value pairs to the external
mapper where we have written our code in some other language like with python with help of
STDIN. Now, these external mappers process these key-value pairs and generate intermediate
key-value pairs with help of STDOUT and send it to the internal mappers.

Similarly, Reducer does the same thing. Once the intermediate key-value pairs are processed
through the shuffle and sorting process they are fed to the internal reducer which will send
these pairs to external reducer process that are working separately through the help of STDIN
and gathers the output generated by external reducers with help of STDOUT and finally the
output is stored to our HDFS.

This is how Hadoop Streaming works on Hadoop which is by default available in Hadoop.
We are just utilizing this feature by making our external mapper and reducers. Now we can
see how powerful feature is Hadoop streaming. Anyone can write his code in any language of
his own choice.
Design of HDFSJava interfaces to HDFS

In this section, we dig into the Hadoop’s FileSystem class: the API for interacting with one of
Hadoop’s filesystems. Although we focus mainly on the HDFS implementation,
DistributedFileSystem, in general you should strive to write your code against the FileSystem
abstract class, to retain portability across filesystems. This is very useful when testing your
program, for example, because you can rapidly run tests using data stored on the local
filesystemIn this section, we dig into the Hadoop’s FileSystem class: the API for interacting
with one of Hadoop’s filesystems. Although we focus mainly on the HDFS implementation,
DistributedFileSystem, in general you should strive to write your code against the FileSystem
abstract class, to retain portability across filesystems. This is very useful when testing your
program, for example, because you can rapidly run tests using data stored on the local
filesystem

Reading Data from a Hadoop URL

One of the simplest ways to read a file from a Hadoop filesystem is by using a java.net.URL
object to open a stream to read the data from. The general idiom is:
InputStream in = null;

try {

in = new URL("hdfs://host/path").openStream();

// process in

} finally {

IOUtils.closeStream(in);

There’s a little bit more work required to make Java recognize Hadoop’s hdfs URL scheme.
This is achieved by calling the setURLStreamHandlerFactory method on URL with an
instance of FsUrlStreamHandlerFactory. This method can be called only once per JVM, so it
is typically executed in a static block. This limitation means that if some other part of your
program—perhaps a third-party component outside your control sets a
URLStreamHandlerFactory, you won’t be able to use this approach for reading data from
Hadoop.

Reading Data Using the FileSystem API

As the previous section explained, sometimes it is impossible to set a URL StreamHandler


Factory for your application. In this case, you will need to use the FileSystem API to open an
input stream for a file.

A file in a Hadoop filesystem is represented by a Hadoop Path object (and not a java.io.File
object, since its semantics are too closely tied to the local filesystem). You can think of a Path
as a Hadoop filesystem URI, such as hdfs://localhost/user/tom/ quangle.txt.

FileSystem is a general filesystem API, so the first step is to retrieve an instance for the
filesystem we want to use—HDFS in this case. There are several static factory methods for
getting a FileSystem instance:

public static FileSystem get(Configuration conf) throws IOException

public static FileSystem get(URI uri, Configuration conf) throws IOException

public static FileSystem get(URI uri, Configuration conf, String user) throws IOException
A Configuration object encapsulates a client or server’s configuration, which is set using
configuration files read from the classpath, such as conf/core-site.xml. The first method
returns the default filesystem (as specified in the file conf/core-site.xml, or the default local
filesystem if not specified there). The second uses the given URI’s scheme and authority to
determine the filesystem to use, falling back to the default filesystem if no scheme is
specified in the given URI. The third retrieves the filesystem as the given user.

In some cases, you may want to retrieve a local filesystem instance, in which case you can
use the convenience method, getLocal():

public static LocalFileSystem getLocal(Configuration conf) throws IOException

Displaying files from a Hadoop filesystem on standard output by using the FileSystem
directly

public class FileSystemCat {

public static void main(String[] args) throws Exception {

String uri = args[0];

Configuration conf = new Configuration();

FileSystem fs = FileSystem.get(URI.create(uri), conf);

InputStream in = null; try {

in = fs.open(new Path(url));

IOUtils.copyBytes(in, System.out, 4096, false);

} finally { IOUtils.closeStream(in); } } }

FSDataInputStream
The open () method on FileSystem actually returns a FSDataInputStream rather than a
standard java.io class. This class is a specialization of java.io.DataInputStream with support
for random access, so you can read from any part of the stream: package

org.apache.hadoop.fs;

Writing Data

The FileSystem class has a number of methods for creating a file. The simplest is the method
that takes a Path object for the file to be created and returns an output stream to write to:
public

FSDataOutputStream create(Path f) throws IOException

FSDataOutputStream

The create() method on FileSystem returns an FSDataOutputStream, which, like


FSDataInputStream, has a method for querying the current position in the file:

package org.apache.hadoop.fs;

Developing a Map Reduce Application

● Write map , reduce , driver functions.

●Test with a small subset of dataset.

●If it fails use IDE’s debugger to identify solve the problem.

●Run on full dataset and if it fails debug it using hadoop debugging tools.

●Do profiling to tune the performance of the program.


Mapper Phase Code

The first stage in development of MapReduce Application is the Mapper Class. Here,
RecordReader processes each Input record and generates the respective key-value pair.

Hadoop’s Mapper store saves this intermediate data into the local disk.

Reducer Phase Code

The Intermediate output generated from the mapper is fed to the reducer which processes it
and generates the final output which is then saved in the HDFS.

Driver code

The major component in a MapReduce job is a Driver Class. It is responsible for setting up a
MapReduce Job to run-in Hadoop. We specify the names of Mapper and Reducer Classes
long with data types and their respective job names.
Debugging a Mapreduce Application

For the process of debugging Log files are essential. Log Files can be found on the local fs of
each TaskTracker and if JVM reuse is enabled, each log accumulates the entire JVM run.
Anything written to standard output or error is directed to the relevant logfile

How does MapReduce Works?

The MapReduce algorithm contains two important tasks, namely Map and Reduce.

● The Map task takes a set of data and converts it into another set of data, where
individual elements are broken down into tuples (key-value pairs).
● The Reduce task takes the output from the Map as an input and combines those data
tuples (key-value pairs) into a smaller set of tuples.
The reduced task is always performed after the map job.

Input Phase − Here we have a Record Reader that translates each record in an input file
and sends the parsed data to the mapper in the form of key-value pairs.

Map − Map is a user-defined function, which takes a series of key-value pairs and processes
each one of them to generate zero or more key-value pairs.

Intermediate Keys − The key-value pairs generated by the mapper are known as
intermediate keys.

Combiner − A combiner is a type of local Reducer that groups similar data from the map
phase into identifiable sets. It takes the intermediate keys from the mapper as input and
applies a user-defined code to aggregate the values in a small scope of one mapper. It is not a
part of the main MapReduce algorithm; it is optional.

Shuffle and Sort − The Reducer task starts with the Shuffle and Sort step. It
downloads the grouped key-value pairs onto the local machine, where the Reducer is running.
The individual key-value pairs are sorted by key into a larger data list. The data list groups
the equivalent keys together so that their values can be iterated easily in the Reducer task.

Reducer − The Reducer takes the grouped key-value paired data as input and runs a
Reducer function on each one of them. Here, the data can be aggregated, filtered, and
combined in a number of ways, and it requires a wide range of processing. Once the
execution is over, it gives zero or more key-value pairs to the final step.

Output Phase − In the output phase, we have an output formatter that translates the final
key-value pairs from the Reducer function and writes them onto a file using a record writer.

Advantage of MapReduce

Fault tolerance: It can handle failures without downtime.


Speed: It splits, shuffles, and reduces the unstructured data in a short time.
Cost-effective: Hadoop MapReduce has a scale-out feature that enables users to process or
store the data in a cost-effective manner.
Scalability: It provides a highly scalable framework. MapReduce allows users to run
applications from many nodes.
Parallel Processing: Here multiple job-parts of the same dataset can be processed in a
parallel manner. This can reduce the task that can be taken to complete a task.

Limitations Of MapReduce

● MapReduce cannot cache the intermediate data in memory for a further requirement
which diminishes the performance of Hadoop.
● It is only suitable for Batch Processing of a Huge amounts of Data.

Anatomy of a Map Reduce Job run


There are five independent entities:

● The client, which submits the MapReduce job.


● The YARN resource manager, which coordinates the allocation of compute resources
on the cluster.
● The YARN node managers, which launch and monitor the compute containers on
machines in the cluster.
● The MapReduce application master, which coordinates the tasks running the
MapReduce job The application master and the MapReduce tasks run in containers
that are scheduled by the resource manager and managed by the node managers.
● The distributed file system, which is used for sharing job files between
the other entities.

Job Submission :

● The submit() method on Job creates an internal JobSubmitter instance and calls
submitJobInternal() on it.
● Having submitted the job, waitForCompletion polls the job’s progress once per
second and reports the progress to the console if it has changed since the last report.
● When the job completes successfully, the job counters are displayed Otherwise, the
error that caused the job to fail is logged to the
console.

The job submission process implemented by JobSubmitter does the following:

● Asks the resource manager for a new application ID, used for the MapReduce job ID.
● Checks the output specification of the job For example, if the output directory has not
been specified or it already exists, the job is not submitted and an error is thrown to
the MapReduce program.
● Computes the input splits for the job If the splits cannot be computed (because the
input paths don’t exist, for example), the job is not submitted and an error is thrown to
the MapReduce program.
● Copies the resources needed to run the job, including the job JAR file, the
configuration file, and the computed input splits, to the shared filesystem in a
directory named after the job ID.
● Submits the job by calling submitApplication() on the resource manager.

Job Initialization :

● When the resource manager receives a call to its submitApplication() method, it hands
off the request to the YARN scheduler.
● The scheduler allocates a container, and the resource manager then launches the
application master’s process there, under the node manager’s management.
● The application master for MapReduce jobs is a Java application whose main class is
MRAppMaster .
● It initializes the job by creating a number of bookkeeping objects to keep track of the
job’s progress, as it will receive progress and completion reports from the tasks.
● It retrieves the input splits computed in the client from the shared filesystem.
● It then creates a map task object for each split, as well as a number of reduce task
objects determined by the mapreduce.job.reduces property (set by the
setNumReduceTasks() method on Job).

Task Assignment:

● If the job does not qualify for running as an uber task, then the application master
requests containers for all the map and reduce tasks in the job from the resource
manager .
● Requests for map tasks are made first and with a higher priority than those for reduce
tasks, since all the map tasks must complete before the sort phase of the reduce can
start.
● Requests for reduce tasks are not made until 5% of map tasks have completed.

Job Scheduling

Early versions of Hadoop had a very simple approach to scheduling users’ jobs: they ran in
order of submission, using a FIFO scheduler. Typically, each job would use the whole
cluster, so jobs had to wait their turn. Although a shared cluster offers great potential for
offering large resources to many users, the problem of sharing resources fairly between users
requires a better scheduler. Production jobs need to complete in a timely manner, while
allowing users who are making smaller ad hoc queries to get results back in a reasonable time

Later on, the ability to set a job’s priority was added, via the mapred.job.priority property or
the setJobPriority() method on JobClient (both of which take one of the values VERY_HIGH,
HIGH, NORMAL, LOW, or VERY_LOW). When the job scheduler is choosing the next job
to run, it selects one with the highest priority. However, with the FIFO scheduler, priorities
do not support preemption, so a high-priority job can still be blocked by a long-running, low-
priority job that started before the high-priority job was scheduled.

MapReduce in Hadoop comes with a choice of schedulers. The default in MapReduce is the
original FIFO queue-based scheduler, and there are also multiuser schedulers called the Fair
Scheduler and the Capacity Scheduler.
Capacity Scheduler

In Capacity Scheduler we have multiple job queues for scheduling our tasks. The Capacity
Scheduler allows multiple occupants to share a large size Hadoop cluster. In Capacity
Scheduler corresponding for each job queue, we provide some slots or cluster resources for
performing job operation. Each job queue has it’s own slots to perform its task. In case we
have tasks to perform in only one queue then the tasks of that queue can access the slots of
other queues also as they are free to use, and when the new task enters to some other queue
then jobs in running in its own slots of the cluster are replaced with its own job.

Capacity Scheduler also provides a level of abstraction to know which occupant is utilizing
the more cluster resource or slots, so that the single user or application doesn’t take
disappropriate or unnecessary slots in the cluster. The capacity Scheduler mainly contains 3
types of the queue that are root, parent, and leaf which are used to represent cluster,
organization, or any subgroup, application submission respectively.

Advantage:

● Best for working with Multiple clients or priority jobs in a Hadoop cluster
● Maximizes throughput in the Hadoop cluster

Disadvantage:

● More complex
● Not easy to configure for everyone
Fair Scheduler

The Fair Scheduler is very much similar to that of the capacity scheduler. The priority of the
job is kept in consideration. With the help of Fair Scheduler, the YARN applications can
share the resources in the large Hadoop Cluster and these resources are maintained
dynamically so no need for prior capacity. The resources are distributed in such a manner that
all applications within a cluster get an equal amount of time. Fair Scheduler takes Scheduling
decisions on the basis of memory, we can configure it to work with CPU also.

As we told you it is similar to Capacity Scheduler but the major thing to notice is that in Fair
Scheduler whenever any high priority job arises in the same queue, the task is processed in
parallel by replacing some portion from the already dedicated slots.

Advantages:

● Resources assigned to each application depend upon its priority.


● it can limit the concurrent running task in a particular pool or queue.

Disadvantages: The configuration is required.


Task Execution:

● Once a task has been assigned resources for a container on a particular node by the
resource manager’s scheduler, the application master starts the container by
contacting the node manager.
● The task is executed by a Java application whose main class is YarnChild. Before it
can run the task, it localizes the resources that the task needs, including the job
configuration and JAR file, and any files from the distributed cache.
● Finally, it runs the map or reduce task.
Streaming:

● Streaming runs special map and reduce tasks for the purpose of launching the user
supplied executable and communicating with it.
● The Streaming task communicates with the process (which may be written in any
language) using standard input and output streams.
● During execution of the task, the Java process passes input key value pairs to the
external process, which runs it through the user defined
map or reduce function anprocess d passes the output key value pairs back to the Java
process.
● From the node manager’s point of view, it is as if the child ran the map or reduce code
itself.

Progress and status updates :

● MapReduce jobs are long running batch jobs, taking anything from tens of seconds to
hours to run.
● A job and each of its tasks have a status, which includes such things as the state of the
job or task (e g running, successfully completed, failed), the progress of maps and
reduces, the values of the job’s counters, and a status message or description (which
may be set by user code).
● When a task is running, it keeps track of its progress (i e the proportion of task is
completed).
● For map tasks, this is the proportion of the input that has been processed.
● For reduce tasks, it’s a little more complex, but the system can still estimate the
proportion of the reduce input processed.

It does this by dividing the total progress into three parts, corresponding to the three phases of
the shuffle.

● As the map or reduce task runs, the child process communicates with its parent
application master through the umbilical interface.
● The task reports its progress and status (including counters) back to its application
master, which has an aggregate view of the job, every three seconds over the
umbilical interface.
How status updates are propagated through the MapReduce System

● The resource manager web UI displays all the running applications with links to the
web UIs of their respective application masters,each of which displays further details
on the MapReduce job, including its progress.
● During the course of the job, the client receives the latest status by polling the
application master every second (the interval is set via
mapreduce.client.progressmonitor.pollinterval).

Job Completion:

● When the application master receives a notification that the last task for a job is
complete, it changes the status for the job to Successful.
● Then, when the Job polls for status, it learns that the job has completed successfully,
so it prints a message to tell the user and then returns from the waitForCompletion() .
● Finally, on job completion, the application master and the task containers clean up
their working state and the Output Committer’s commitJob () method is called.
● Job information is archived by the job history server to enable later interrogation by
users if desired.

Task execution

Once the resource manager’s scheduler assign a resources to the task for a container on a
particular node, the container is started up by the application master by contacting the node
manager. The task whose main class is YarnChild is executed by a Java application .

It localizes the resources that the task needed before it can run the task. It includes the job
configuration, any files from the distributed cache and JAR file. It finally runs the map or the
reduce task. Any kind of bugs in the user-defined map and reduce functions (or even in
YarnChild) don’t affect the node manager as YarnChild runs in a dedicated JVM. So it can’t
be affected by a crash or hang.

All actions running in the same JVM as the task itself are performed by each task setup.
These are determined by the OutputCommitter for the job. The commit action moves the
task output to its final location from its initial position for a file-based jobs. When speculative
execution is enabled, the commit protocol ensures that only one of the duplicate tasks is
committed and the other one is aborted.
What does Streaming means?

Streaming reduce tasks and runs special map for the purpose of launching the user supplied
executable and communicating with it. Using standard input and output streams, it
communicates with the process. The Java process passes input key-value pairs to the external
process during execution of the task. It runs the process through the user-defined map or
reduce function and passes the output key-value pairs back to the Java process.

It is as if the child process ran the map or reduce code itself from the manager’s point of
view. MapReduce jobs can take anytime from tens of second to hours to run, that’s why are
long-running batches. It’s important for the user to get feedback on how the job is
progressing because this can be a significant length of time. Each job including the task has a
status including the state of the job or task, values of the job’s counters, progress of maps and
reduces and the description or status message. These statuses change over the course of the
job.
The task keeps track of its progress when a task is running like a part of the task is completed.
This is the proportion of the input that has been processed for map tasks. It is a little more
complex for the reduce task but the system can still estimate the proportion of the reduce
input processed. When a task is running, it keeps track of its progress (i.e., the proportion of
the task completed). For map tasks, this is the proportion of the input that has been processed.
For reduce tasks, it’s a little more complex, but the system can still estimate the proportion of
the reduce input processed.

Process involved

● Read an input record in a mapper or reducer.


● Write an output record in a mapper or reducer.
● Set the status description.
● Increment a counter using Reporter’s incrCounter() method or Counter’s increment()
method.
● Call Reporter’s or TaskAttemptContext’s progress() method.

Types of InputFormat in MapReduce

In Hadoop, there are various MapReduce types for InputFormat that are used for various
purposes. Let us now look at the MapReduce types of InputFormat:

FileInputFormat

It serves as the foundation for all file-based InputFormats. FileInputFormat also provides the
input directory, which contains the location of the data files. When we start a MapReduce
task, FileInputFormat returns a path with files to read. This Input Format will read all files.
Then it divides these files into one or more InputSplits.

TextInputFormat
It is the standard InputFormat. Each line of each input file is treated as a separate record by
this InputFormat. It does not parse anything. TextInputFormat is suitable for raw data or line-
based records, such as log files. Hence:

● Key: It is the byte offset of the first line within the file (not the entire file split). As a
result, when paired with the file name, it will be unique.

● Value: It is the line's substance. It does not include line terminators.

KeyValueTextInputFormat

It is comparable to TextInputFormat. Each line of input is also treated as a separate record by


this InputFormat. While TextInputFormat treats the entire line as the value,
KeyValueTextInputFormat divides the line into key and value by a tab character ('/t'). Hence:

● Key: Everything up to and including the tab character.

● Value: It is the remaining part of the line after the tab character.

SequenceFileInputFormat

It's an input format for reading sequence files. Binary files are sequence files. These files also
store binary key-value pair sequences. These are block-compressed and support direct
serialization and deserialization of a variety of data types. Hence Key & Value are both user-
defined.

SequenceFileAsTextInputFormat

It is a subtype of SequenceFileInputFormat. The sequence file key values are converted to


Text objects using this format. As a result, it converts the keys and values by running
'tostring()' on them. As a result, SequenceFileAsTextInputFormat converts sequence files into
text-based input for streaming.

NlineInputFormat
It is a variant of TextInputFormat in which the keys are the line's byte offset. And values are
the line's contents. As a result, each mapper receives a configurable number of lines of
TextInputFormat and KeyValueTextInputFormat input. The number is determined by the
magnitude of the split. It is also dependent on the length of the lines. So, if we want our
mapper to accept a specific amount of lines of input, we use NLineInputFormat.

N- It is the number of lines of input received by each mapper.

Each mapper receives exactly one line of input by default (N=1).

Assuming N=2, each split has two lines. As a result, the first two Key-Value pairs are
distributed to one mapper. The second two key-value pairs are given to another mapper.

DBInputFormat

Using JDBC, this InputFormat reads data from a relational Database. It also loads small
datasets, which might be used to connect with huge datasets from HDFS using multiple
inputs. Hence:

● Key: LongWritables

● Value: DBWritables.

Output Format in MapReduce

The output format classes work in the opposite direction as their corresponding input format
classes. The TextOutputFormat, for example, is the default output format that outputs records
as plain text files, although key values can be of any type and are converted to strings by
using the toString() method. The tab character separates the key-value character, but this can
be changed by modifying the separator attribute of the text output format.

SequenceFileOutputFormat is used to write a sequence of binary output to a file for binary


output. Binary outputs are especially valuable if they are used as input to another MapReduce
process.

DBOutputFormat handles the output formats for relational databases and HBase. It saves the
compressed output to a SQL table.
Features of MapReduce

There are some key features of MapReduce below:

Scalability

MapReduce can scale to process vast amounts of data by distributing tasks across a large
number of nodes in a cluster. This allows it to handle massive datasets, making it suitable for
Big Data applications.

Fault Tolerance

MapReduce incorporates built-in fault tolerance to ensure the reliable processing of data. It
automatically detects and handles node failures, rerunning tasks on available nodes as
needed.

Data Locality

MapReduce takes advantage of data locality by processing data on the same node where it is
stored, minimizing data movement across the network and improving overall performance.

Simplicity

The MapReduce programming model abstracts away many complexities associated with
distributed computing, allowing developers to focus on their data processing logic rather than
low-level details.
Cost-Effective Solution

Hadoop's scalable architecture and MapReduce programming framework make storing and
processing extensive data sets very economical.

Parallel Programming

Tasks are divided into programming models to allow for the simultaneous execution of
independent operations. As a result, programs run faster due to parallel processing, making it
easier for a process to handle each job. Thanks to parallel processing, these distributed tasks
can be performed by multiple processors. Therefore, all software runs faster.

UNIT IV

HADOOP ENVIRONMENT

Setting up a Hadoop Cluster

Hadoop Cluster is stated as a combined group of unconventional units. These units are
in a connected with a dedicated server which is used for working as a sole data organizing
source. It works as centralized unit throughout the working process. In simple terms, it is
stated as a common type of cluster which is present for the computational task. This cluster
is helpful in distributing the workload for analyzing data. Workload over Hadoop cluster is
distributed among several other nodes, which are working together to process data. It can be
explained by considering the following terms:

1. Distributed Data Processing: In distributed data processing, the map gets reduced
and scrutinized from a large amount of data. It get assigned a job tracker for all the
functionalities. Apart from the job tracker, there is a data node and task tracker. All
these play a huge role in processing the data.
2. Distributed Data Storage: It allows storing a huge amount of data in terms of name
node and secondary name node. In this both the nodes have a data node and task
tracker.
How does Hadoop Cluster Makes Working so Easy?

It plays important role to collect and analyze the data in a proper way. It is useful in
performing a number of tasks which brings out the ease in any task.

● Add nodes: It is easy to add nodes in the cluster to help in other functional areas.
Without the nodes, it is not possible to scrutinize the data from unstructured units.
● Data Analysis: This special type of cluster which is compatible with parallel
computation to analyze the data.
● Fault tolerance: The data stored in any node remain unreliable. So, it creates a copy
of the data which is present on other nodes.

Uses of Hadoop Cluster:

● It is extremely helpful in storing different type of data sets.


● Compatible with the storage of the huge amount of diverse data.
● Hadoop cluster fits best under the situation of parallel computation for processing the
data.
● It is also helpful for data cleaning processes.

Major Tasks of Hadoop Cluster:

1. It is suitable for performing data processing activities.


2. It is a great tool for collecting bulk amount of data.
3. It also adds great value in the data serialization process.

Working with Hadoop Cluster:

While working with Hadoop Cluster it is important to understand its architecture as follows :

● Master Nodes: Master node plays a great role in collecting a huge amount of data in
the Hadoop Distributed File System (HDFS). Apart from that, it works to store data
with parallel computation by applying Map Reduce.
● Slave nodes: It is responsible for the collection of data. While performing any
computation, the slave node is held responsible for any situation or result.
● Client nodes: The Hadoop is installed along with the configuration settings.Hadoop
Cluster demands to load the data, it is the client node who is held responsible for this
task.

Advantages:

1. Cost-effective: It offers cost-effective solution for data storage and analysis.


2. Quick process: The storage system in Hadoop cluster runs in a fast way to provide
speedy results. In the case of the huge amount of data is available, it is a helpful tool.
3. Easy accessibility: It helps to access the new sources of data easily. Moreover used to
collect both the structured as well as unstructured data.

Architecture of Hadoop Cluster

Typical two-level network architecture for a Hadoop cluster

Cluster Setup and Installation

This section describes how to install and configure a basic Hadoop cluster from scratch using
the Apache Hadoop distribution on a Unix operating system. It provides background
information on the things you need to think about when setting up Hadoop. For a production
installation, most users and operators should consider one of the Hadoop cluster management
tools Installing Java Hadoop runs on both Unix and Windows operating systems, and requires
Java to be installed. For a production installation, you should select a combination of
operating
system, Java, and Hadoop that has been certified by the vendor of the Hadoop distribution
you are using. There is also a page on the Hadoop wiki that lists combinations that
community members have run with success.

Creating Unix User Accounts

It’s good practice to create dedicated Unix user accounts to separate the Hadoop processes
from each other, and from other services running on the same machine. The HDFS,
MapReduce, and YARN services are usually run as separate users, named hdfs, mapred, and
yarn, respectively. They all belong to the same hadoop group.

Installing Hadoop

Download Hadoop from the Apache Hadoop releases page, and unpack the contents of the
distribution in a sensible location, such as /usr/local (/opt is another standard choice; note that
Hadoop should not be installed in a user’s home directory, as that may be an NFS-mounted
directory):

% cd /usr/local

% sudo tar xzf hadoop-x.y.z.tar.gz

You also need to change the owner of the Hadoop files to be the hadoop user and group:

% sudo chown -R hadoop:hadoop hadoop-x.y.z

It’s convenient to put the Hadoop binaries on the shell path too:

% export HADOOP_HOME=/usr/local/hadoop-x.y.z

% export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin

Configuring SSH

The Hadoop control scripts (but not the daemons) rely on SSH to perform cluster-wide
Operations. For example, there is a script for stopping and starting all the daemons in the
cluster. Note that the control scripts are optional—cluster-wide operations can be performed
by other mechanisms, too, such as a distributed shell or dedicated Hadoop management
applications. To work seamlessly, SSH needs to be set up to allow passwordless login for the
hdfs and yarn users from machines in the cluster.2 The simplest way to achieve this is to
generate a public/private key pair and place it in an NFS location that is shared across the
cluster.

First, generate an RSA key pair by typing the following. You need to do this twice, once as
the hdfs user and once as the yarn user:

% ssh-keygen -t rsa -f ~/.ssh/id_rsa

Even though we want passwordless logins, keys without passphrases are not considered good
practice (it’s OK to have an empty passphrase when running a local pseudo distributed
cluster, as described in Appendix A), so we specify a passphrase when prompted for one. We
use ssh-agent to avoid the need to enter a password for each connection.

The private key is in the file specified by the -f option, ~/.ssh/id_rsa, and the public key is
stored in a file with the same name but with .pub appended, ~/.ssh/id_rsa.pub.

Next, we need to make sure that the public key is in the ~/.ssh/authorized_keys file on all the
machines in the cluster that we want to connect to. If the users’ home directories are stored on
an NFS filesystem, the keys can be shared across the cluster by typing the following (first as
hdfs and then as yarn):

% cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys

If the home directory is not shared using NFS, the public keys will need to be shared by some
other means (such as ssh-copy-id).Test that you can SSH from the master to a worker
machine by making sure ssh-agent is running,3 and then run ssh-add to store your passphrase.
You should be able to SSH to a worker without entering the passphrase again.

Installing and Setting Up Hadoop in Pseudo-Distributed Mode

To Perform setting up and installing Hadoop in the pseudo-distributed mode using the
following steps given below as follows. Let’s discuss one by one.
Step 1: Download Binary Package :

Download the latest binary from the following site as follows.

http://hadoop.apache.org/releases.html
For reference, you can check the file save to the folder as follows.

C:\BigData

Step 2: Unzip the binary package

Open Git Bash, and change directory (cd) to the folder where you save the binary package
and then unzip as follows.

$ cd C:\BigData

MINGW64: C:\BigData

$ tar -xvzf hadoop-3.1.2.tar.gz

For my situation, the Hadoop twofold is extricated to C:\BigData\hadoop-3.1.2.

Next, go to this GitHub Repo and download the receptacle organizer as a speed as
demonstrated as follows. Concentrate the compress and duplicate all the documents present
under the receptacle envelope to C:\BigData\hadoop-3.1.2\bin. Supplant the current records
too.

Step 3: Create folders for datanode and namenode :

● Goto C:/BigData/hadoop-3.1.2 and make an organizer ‘information’. Inside the


‘information’ envelope make two organizers ‘datanode’ and ‘namenode’. Your
documents on HDFS will dwell under the datanode envelope.
● Set Hadoop Environment Variables
● Hadoop requires the following environment variables to be set.

HADOOP_HOME=” C:\BigData\hadoop-3.1.2”
HADOOP_BIN=”C:\BigData\hadoop-3.1.2\bin”

JAVA_HOME=<Root of your JDK installation>”

● To set these variables, navigate to My Computer or This PC.

In the event that you don’t have JAVA 1.8 introduced, at that point you’ll have to download
and introduce it first. In the event that the JAVA_HOME climate variable is now set, at that
point check whether the way has any spaces in it (ex: C:\Program Files\Java\… ). Spaces in
the JAVA_HOME way will lead you to issues. There is a stunt to get around it. Supplant
‘Program Files ‘to ‘Progra~1’in the variable worth. Guarantee that the variant of Java is 1.8
and JAVA_HOME is highlighting JDK 1.8.

Step 4: To make Short Name of Java Home path

Now we have set the environment variables, we need to validate them. Open a new Windows
Command prompt and run an echo command on each variable to confirm they are assigned
the desired values.
On the off chance that the factors are not instated yet, at that point it can likely be on the
grounds that you are trying them in an old meeting. Ensure you have opened another order
brief to test them.

Step 5: Configure Hadoop

Once environment variables are set up, we need to configure Hadoop by editing the following
configuration files.

First, let’s configure the Hadoop environment file. Open C:\BigData\hadoop-3.1.2\etc\


hadoop\hadoop-env.cmd and add below content at the bottom

Step 6: Edit hdfs-site.xml

After editing core-site.xml, you need to set the replication factor and the location of
namenode and datanodes. Open C:\BigData\hadoop-3.1.2\etc\hadoop\hdfs-site.xml and
below content within <configuration> </configuration> tags
Step 7: Edit core-site.xml

Now, configure Hadoop Core’s settings. Open C:\BigData\hadoop-3.1.2\etc\hadoop\core-


site.xml and below content within <configuration> </configuration> tags.

Step 8: YARN configurations


Edit file yarn-site.xml

Make sure the following entries are existing as follows.

Step 9: Edit mapred-site.xml

At last, how about we arrange properties for the Map-Reduce system. Open C:\BigData\
hadoop-3.1.2\etc\hadoop\mapred-site.xml and beneath content inside <configuration>
</configuration> labels. In the event that you don’t see mapred-site.xml, at that point open
mapred-site.xml.template record and rename it to mapred-site.xml
Check if C:\BigData\hadoop-3.1.2\etc\hadoop\slaves file is present, if it’s not then created
one and add localhost in it and save it.

Step 10: Format Name Node :

To organize the Name Node, open another Windows Command Prompt and run the beneath
order. It might give you a few admonitions, disregard them.

● hadoop namenode -format

Step 11: Launch Hadoop :

Open another Windows Command brief, make a point to run it as an Administrator to


maintain a strategic distance from authorization mistakes. When opened, execute the
beginning all.cmd order. Since we have added %HADOOP_HOME%\sbin to the PATH
variable, you can run this order from any envelope. In the event that you haven’t done as
such, at that point go to the %HADOOP_HOME%\sbin organizer and run the order.
You can check the given below screenshot for your reference 4 new windows will open and
cmd terminals for 4 daemon processes like as follows.

● namenode
● datanode
● node manager
● resource manager

Don’t close these windows, minimize them. Closing the windows will terminate the
daemons. You can run them in the background if you don’t like to see these windows.

Step 12: Hadoop Web UI

In conclusion, how about we screen to perceive how are Hadoop daemons are getting along.
Also you can utilize the Web UI for a wide range of authoritative and observing purposes.
Open your program and begin.

Step 13: Resource Manager

Open localhost:8088 to open Resource Manager

Step 14: Node Manager

Open localhost:8042 to open Node Manager

Step 15: Name Node :

Open localhost:9870 to check out the health of Name Node

Step 16: Data Node :

Open localhost:9864 to check out Data Node


HDFS

HDFS(Hadoop Distributed File System) is utilized for storage permission is a Hadoop


cluster. It mainly designed for working on commodity Hardware devices(devices that are
inexpensive), working on a distributed file system design. HDFS is designed in such a way
that it believes more in storing the data in a large chunk of blocks rather than storing small
data blocks. HDFS in Hadoop provides Fault-tolerance and High availability to the storage
layer and the other devices present in that Hadoop cluster.

HDFS is capable of handling larger size data with high volume velocity and variety makes
Hadoop work more efficient and reliable with easy access to all its components. HDFS stores
the data in the form of the block where the size of each data block is 128MB in size which is
configurable means you can change it according to your requirement in hdfs-site.xml file in
your Hadoop directory.

Some Important Features of HDFS(Hadoop Distributed File System)

● It’s easy to access the files stored in HDFS.


● HDFS also provides high availability and fault tolerance.
● Provides scalability to scaleup or scaledown nodes as per our requirement.
● Data is stored in distributed manner i.e. various Datanodes are responsible for storing
the data.
● HDFS provides Replication because of which no fear of Data Loss.
● HDFS Provides High Reliability as it can store data in a large range of Petabytes.
● HDFS has in-built servers in Name node and Data Node that helps them to easily
retrieve the cluster information.
● Provides high throughput.

HDFS Storage Daemon’s

As we all know Hadoop works on the MapReduce algorithm which is a master-slave


architecture, HDFS has NameNode and DataNode that works in the similar pattern.

1. NameNode(Master)
2. DataNode(Slave)
1. NameNode: NameNode works as a Master in a Hadoop cluster that Guides the
Datanode(Slaves). Namenode is mainly used for storing the Metadata i.e. nothing but the data
about the data. Meta Data can be the transaction logs that keep track of the user’s activity in a
Hadoop cluster.

Meta Data can also be the name of the file, size, and the information about the location(Block
number, Block ids) of Datanode that Namenode stores to find the closest DataNode for Faster
Communication. Namenode instructs the DataNodes with the operation like delete, create,
Replicate, etc.

As our NameNode is working as a Master it should have a high RAM or Processing power in
order to Maintain or Guide all the slaves in a Hadoop cluster. Namenode receives heartbeat
signals and block reports from all the slaves i.e. DataNodes.

2. DataNode: DataNodes works as a Slave DataNodes are mainly utilized for storing the data
in a Hadoop cluster, the number of DataNodes can be from 1 to 500 or even more than that,
the more number of DataNode your Hadoop cluster has More Data can be stored. so it is
advised that the DataNode should have High storing capacity to store a large number of file
blocks. Datanode performs operations like creation, deletion, etc. according to the instruction
provided by the NameNode.
Objectives and Assumptions Of HDFS

1. System Failure: As a Hadoop cluster is consists of Lots of nodes with are commodity
hardware so node failure is possible, so the fundamental goal of HDFS figure out this failure
problem and recover it.

2. Maintaining Large Dataset: As HDFS Handle files of size ranging from GB to PB, so
HDFS has to be cool enough to deal with these very large data sets on a single cluster.

3. Moving Data is Costlier then Moving the Computation: If the computational operation
is performed near the location where the data is present then it is quite faster and the overall
throughput of the system can be increased along with minimizing the network congestion
which is a good assumption.

4. Portable Across Various Platform: HDFS Posses portability which allows it to switch
across diverse Hardware and software platforms.

5. Simple Coherency Model: A Hadoop Distributed File System needs a model to write
once read much access for Files. A file written then closed should not be changed, only data
can be appended. This assumption helps us to minimize the data coherency issue.
MapReduce fits perfectly with such kind of file model.
6. Scalability: HDFS is designed to be scalable as the data storage requirements increase
over time. It can easily scale up or down by adding or removing nodes to the cluster. This
helps to ensure that the system can handle large amounts of data without compromising
performance.

7. Security: HDFS provides several security mechanisms to protect data stored on the
cluster. It supports authentication and authorization mechanisms to control

access to data, encryption of data in transit and at rest, and data integrity checks to detect any
tampering or corruption.

8. Data Locality: HDFS aims to move the computation to where the data resides rather than
moving the data to the computation. This approach minimizes network traffic and enhances
performance by processing data on local nodes.

9. Cost-Effective: HDFS can run on low-cost commodity hardware, which makes it a cost-
effective solution for large-scale data processing. Additionally, the ability to scale up or down
as required means that organizations can start small and expand over time, reducing upfront
costs.

10. Support for Various File Formats: HDFS is designed to support a wide range of file
formats, including structured, semi-structured, and unstructured data. This makes it easier to
store and process different types of data using a single system, simplifying data management
and reducing costs.

Hdfs administration:

Hdfs administration and MapReduce administration, both concepts come under Hadoop
administration.

● Hdfs administration: It includes monitoring the HDFS file structure, location and
updated files.
● MapReduce administration: it includes monitoring the list of applications,
configuration of nodes, application status.
Hadoop Benchmarks

Hadoop comes with several benchmarks that you can run very easily with minimal setup cost.
Benchmarks are packaged in the tests JAR file, and you can get a list of them, with
descriptions, by invoking the JAR file with no arguments:

% hadoop jar $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-*-tests.jar

Most of the benchmarks show usage instructions when invoked with no arguments. For
example:
% hadoop jar $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-*-tests.jar \

TestDFSIO

TestDFSIO.1.7

Missing arguments.

Usage: TestDFSIO [genericOptions] -read [-random | -backward |

-skip [-skipSize Size]] | -write | -append | -clean [-compression codecClassName]

[-nrFiles N] [-size Size[B|KB|MB|GB|TB]] [-resFile resultFileName]

[-bufferSize Bytes] [-rootDir]

Benchmarking MapReduce with TeraSort

Hadoop comes with a MapReduce program called TeraSort that does a total sort of its input.9
It is very useful for benchmarking HDFS and MapReduce together, as the full input dataset is
transferred through the shuffle. The three steps are: generate some random data, perform the
sort, then validate the results.

First, we generate some random data using teragen (found in the examples JAR file, not the
tests one). It runs a map-only job that generates a specified number of rows of binary data.
Each row is 100 bytes long, so to generate one terabyte of data using 1,000 maps, run the
following (10t is short for 10 trillion):

% hadoop jar \
$HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-*.jar \

teragen -Dmapreduce.job.maps=1000 10t random-data

Next, run terasort:

% hadoop jar \

$HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-*.jar \

terasort random-data sorted-data

The overall execution time of the sort is the metric we are interested in, but it’s instructive to
watch the job’s progress via the web UI (http://resource-manager-host:8088/), where you can
get a feel for how long each phase of the job takes.

As a final sanity check, we validate that the data in sorted-data is, in fact, correctly sorted:

% hadoop jar \

$HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-*.jar \

teravalidate sorted-data report

This command runs a short MapReduce job that performs a series of checks on the

sorted data to check whether the sort is accurate. Any errors can be found in the report/

part-r-00000 output file.

Other benchmarks

There are many more Hadoop benchmarks, but the following are widely used:

• TestDFSIO tests the I/O performance of HDFS. It does this by using a MapReduce job as a
convenient way to read or write files in parallel.

• MRBench (invoked with mrbench) runs a small job a number of times. It acts as a good
counterpoint to TeraSort, as it checks whether small job runs are responsive.

• NNBench (invoked with nnbench) is useful for load-testing namenode hardware.


• Gridmix is a suite of benchmarks designed to model a realistic cluster workload by
mimicking a variety of data-access patterns seen in practice. See the documentation in the
distribution for how to run Gridmix.

• SWIM, or the Statistical Workload Injector for MapReduce, is a repository of real life
MapReduce workloads that you can use to generate representative test workloads for your
system.

• TPCx-HS is a standardized benchmark based on TeraSort from the Transaction Processing


Performance Council

Hadoop in the cloud

Hadoop on AWS

Amazon Elastic Map/Reduce (EMR) is a managed service that allows you to process and
analyze large datasets using the latest versions of big data processing frameworks such as
Apache Hadoop, Spark, HBase, and Presto, on fully customizable clusters.

Key features include:

● Ability to launch Amazon EMR clusters in minutes, with no need to manage node
configuration, cluster setup, Hadoop configuration or cluster tuning.
● Simple and predictable pricing— flat hourly rate for every instance-hour, with the
ability to leverage low-cost spot Instances.
● Ability to provision one, hundreds, or thousands of compute instances to process data
at any scale.
● Amazon provides the EMR File System (EMRFS) to run clusters on demand based on
persistent HDFS data in Amazon S3. When the job is done, users can terminate the
cluster and store the data in Amazon S3, paying only for the actual time the cluster
was running.

Hadoop on Azure
Azure HDInsight is a managed, open-source analytics service in the cloud. HDInsight allows
users to leverage open-source frameworks such as Hadoop, Apache Spark, Apache Hive,
LLAP, Apache Kafka, and more, running them in the Azure cloud environment.

Azure HDInsight is a cloud distribution of Hadoop components. It makes it easy and cost-
effective to process massive amounts of data in a customizable environment. HDInsights
supports a broad range of scenarios such as extract, transform, and load (ETL), data
warehousing, machine learning, and IoT.

Here are notable features of Azure HDInsight:

● Read and write data stored in Azure Blob Storage and configure several Blob Storage
accounts.
● Implement the standard Hadoop FileSystem interface for a hierarchical view.
● Choose between block blobs to support common use cases like MapReduce and page
blobs for continuous write use cases like HBase write-ahead log.
● Use wasb scheme-based URLs to reference file system paths, with or without SSL
encrypted access.
● Set up HDInsight as a data source in a MapReduce job or a sink.

HDInsight was tested at scale and tested on Linux as well as Windows.

Hadoop on Google Cloud

Google Dataproc is a fully-managed cloud service for running Apache Hadoop and Spark
clusters. It provides enterprise-grade security, governance, and support, and can be used for
general purpose data processing, analytics, and machine learning.

Dataproc uses Cloud Storage (GCS) data for processing and stores it in GCS, Bigtable, or
BigQuery. You can use this data for analysis in your notebook and send logs to Cloud
Monitoring and Logging.

Here are notable features of Dataproc:

● Supports open source tools, such as Spark and Hadoop.


● Lets you customize virtual machines (VMs) to can scale up and down to meet
changing needs.
● Provides on-demand ephemeral clusters to help you reduce costs.
● Integrates tightly with Google Cloud services.

****************

UNIT V – FRAMEWORKS

Applications on Big Data Using Pig and Hive – Data processing operators in

Pig –Hive services – HiveQL – Querying Data in Hive - fundamentals of

HBase and ZooKeeper –SQOOP

Applications on Big Data Using Pig and Hive

What is Apache Pig?

Apache Pig is an abstraction over MapReduce. It is a tool/platform which is used to analyze


larger sets of data representing them as data flows. Pig is generally used with Hadoop; we can
perform all the data manipulation operations in Hadoop using Apache Pig.

To write data analysis programs, Pig provides a high-level language known as Pig Latin. This
language provides various operators using which programmers can develop their own
functions for reading, writing, and processing data.

To analyze data using Apache Pig, programmers need to write scripts using Pig Latin
language. All these scripts are internally converted to Map and Reduce tasks. Apache Pig has
a component known as Pig Engine that accepts the Pig Latin scripts as input and converts
those scripts into MapReduce jobs.

Features of Pig
Apache Pig comes with the following features −

● Rich set of operators − It provides many operators to perform


operations like join, sort, filer, etc.
● Ease of programming − Pig Latin is similar to SQL and it is easy
to write a Pig script if you are good at SQL.
● Optimization opportunities − The tasks in Apache Pig optimize their
execution automatically, so the programmers need to focus
only on semantics of the language.
● Extensibility − Using the existing operators, users can develop
their own functions to read, process, and write data.
● UDF’s − Pig provides the facility to create User-defined
Functions in other programming languages such as Java and
invoke or embed them in Pig Scripts.
● Handles all kinds of data − Apache Pig analyzes all kinds of data,
both structured as well as unstructured. It stores the results
in HDFS.

Apache Pig Vs MapReduce

Listed below are the major differences between Apache Pig and MapReduce.

○ Apache Pig ○ MapReduce

● MapReduce is a data processing


● Apache Pig is a data flow language.
paradigm.

● It is a high level language. ● MapReduce is low level and rigid.

● Performing a Join operation in Apache Pig ● It is quite difficult in MapReduce to


is pretty simple. perform a Join operation between
datasets.

● Any novice programmer with a basic ● Exposure to Java is a must to work


knowledge of SQL can work conveniently with MapReduce.
with Apache Pig.

● Apache Pig uses a multi-query approach, ● MapReduce will require almost 20


thereby reducing the length of the codes to times more the number of lines to
a great extent. perform the same task.

● There is no need for compilation. On ● MapReduce jobs have a long


execution, every Apache Pig operator is compilation process.
converted internally into a MapReduce job.

Applications of Apache Pig

Apache Pig is generally used by data scientists for performing


tasks involving ad-hoc processing and quick prototyping. Apache
Pig is used −

● To process huge data sources such as web logs.


● To perform data processing for search platforms.
● To process time sensitive data loads.

Apache Pig - Architecture


Apache Pig Components

As shown in the figure, there are various components in the Apache Pig framework. Let us
take a look at the major components.

Parser

Initially the Pig Scripts are handled by the Parser. It checks the syntax of the script, does type
checking, and other miscellaneous checks. The output of the parser will be a DAG (directed
acyclic graph), which represents the Pig Latin statements and logical operators.

In the DAG, the logical operators of the script are represented as the nodes and the data flows
are represented as edges.

Optimizer

The logical plan (DAG) is passed to the logical optimizer, which carries out the logical
optimizations such as projection and pushdown.
Compiler

The compiler compiles the optimized logical plan into a series of MapReduce jobs.

Execution engine

Finally the MapReduce jobs are submitted to Hadoop in a sorted order. Finally, these
MapReduce jobs are executed on Hadoop producing the desired results.

install Apache Pig

After downloading the Apache Pig software, install it in your Linux environment by
following the steps given below.

Step 1

Create a directory with the name Pig in the same directory where the installation directories
of Hadoop, Java, and other software were installed. (In our tutorial, we have created the Pig
directory in the user named Hadoop).

Step 2

Extract the downloaded tar files as shown below.

Step 3

Move the content of pig-0.15.0-src.tar.gz file to the Pig directory created earlier as shown
below.
Configure Apache Pig

After installing Apache Pig, we have to configure it. To configure,


we need to edit two files − bashrc and pig.properties.

.bashrc file

In the .bashrc file, set the following variables −

● PIG_HOME folder to the Apache Pig’s installation folder,


● PATH environment variable to the bin folder, and
● PIG_CLASSPATH environment variable to the etc (configuration) folder of your
Hadoop installations (the directory that contains the core-site.xml, hdfs-site.xml and
mapred-site.xml files).

export PIG_HOME = /home/Hadoop/Pig

export PATH = $PATH:/home/Hadoop/pig/bin

export PIG_CLASSPATH = $HADOOP_HOME/conf

pig.properties file

In the conf folder of Pig, we have a file named pig.properties. In the pig.properties file, you
can set various parameters as given below.

Verifying the Installation

Verify the installation of Apache Pig by typing the version command. If the installation is
successful, you will get the version of Apache Pig as shown below.
Pig Latin is the language used to analyze data in Hadoop using Apache Pig. In this chapter,
we are going to discuss the basics of Pig Latin such as Pig Latin statements, data types,
general and relational operators, and Pig Latin UDF’s.

Pig Latin – Data Model

As discussed in the previous chapters, the data model of Pig is fully nested. A Relation is the
outermost structure of the Pig Latin data model. And it is a bag where −

● A bag is a collection of tuples.


● A tuple is an ordered set of fields.
● A field is a piece of data.

Pig Latin – Statements

While processing data using Pig Latin, statements are the basic constructs.

● These statements work with relations. They include expressions and schemas.
● Every statement ends with a semicolon (;).
● We will perform various operations using operators provided by Pig Latin, through
statements.
● Except LOAD and STORE, while performing all other operations, Pig Latin
statements take a relation as input and produce another relation as output.
● As soon as you enter a Load statement in the Grunt shell, its semantic checking will
be carried out. To see the contents of the schema, you need to use the Dump operator.
Only after performing the dump operation, the MapReduce job for loading the data
into the file system will be carried out.

Example

Given below is a Pig Latin statement, which loads data to Apache Pig.
grunt> Student_data = LOAD 'student_data.txt' USING PigStorage(',')as

( id:int, firstname:chararray, lastname:chararray, phone:chararray, city:chararray );

Pig Latin – Data types

Given below table describes the Pig Latin data types.

S.N. Data Type Description & Example

1 int Represents a signed 32-bit integer.

Example : 8

2 long Represents a signed 64-bit integer.

Example : 5L

3 float Represents a signed 32-bit floating point.

Example : 5.5F

4 double Represents a 64-bit floating point.

Example : 10.5

5 chararray Represents a character array (string) in Unicode UTF-8 format.

Example : ‘tutorials point’

6 Bytearray Represents a Byte array (blob).

7 Boolean Represents a Boolean value.

Example : true/ false.

8 Datetime Represents a date-time.

Example : 1970-01-01T00:00:00.000+00:00

9 Biginteger Represents a Java BigInteger.


Example : 60708090709

10 Bigdecimal Represents a Java BigDecimal

Example : 185.98376256272893883

Complex Types

11 Tuple A tuple is an ordered set of fields.

Example : (raja, 30)

12 Bag A bag is a collection of tuples.

Example : {(raju,30),(Mohhammad,45)}

13 Map A Map is a set of key-value pairs.

Example : [ ‘name’#’Raju’, ‘age’#30]

Null Values

Values for all the above data types can be NULL. Apache Pig treats null values in a similar
way as SQL does.

A null can be an unknown value or a non-existent value. It is used as a placeholder for


optional values. These nulls can occur naturally or can be the result of an operation.

Pig Latin – Arithmetic Operators

The following table describes the arithmetic operators of Pig Latin. Suppose a = 10 and b =
20.

Operator Description Example

+ Addition − Adds values on either side of the a + b will give 30


operator

− Subtraction − Subtracts right hand operand a − b will give


from left hand operand −10

* Multiplication − Multiplies values on either a * b will give 200


side of the operator

/ Division − Divides left hand operand by b / a will give 2


right hand operand

% Modulus − Divides left hand operand by b % a will give 0


right hand operand and returns remainder

Bincond − Evaluates the Boolean operators. b = (a == 1)? 20: 30;


It has three operands as shown below.
if a = 1 the value of
?: variable x = (expression) ? value1 if true : value2 if false. b is 20.

if a!=1 the value of b


is 30.

CASE Case − The case operator is equivalent to CASE f2 % 2


nested bincond operator.
WHEN WHEN 0 THEN
'even'
THEN
WHEN 1 THEN
ELSE
'odd'
END
END

Pig Latin – Comparison Operators

The following table describes the comparison operators of Pig Latin.

Operato Description Example


r

== Equal − Checks if the values of two operands (a = b) is not


are equal or not; if yes, then the condition
becomes true. true

!= Not Equal − Checks if the values of two (a != b) is true.


operands are equal or not. If the values are
not equal, then the condition becomes true.

> Greater than − Checks if the value of the left (a > b) is not
operand is greater than the value of the right true.
operand. If yes, then the condition becomes
true.

< Less than − Checks if the value of the left (a < b) is true.
operand is less than the value of the right
operand. If yes, then the condition becomes
true.

>= Greater than or equal to − Checks if the value of (a >= b) is not


the left operand is greater than or equal to true.
the value of the right operand. If yes, then the
condition becomes true.

<= Less than or equal to − Checks if the value of the (a <= b) is true.
left operand is less than or equal to the value
of the right operand. If yes, then the condition
becomes true.

matches Pattern matching − Checks whether the string in f1 matches


the left-hand side matches with the constant '.*tutorial.*'
in the right-hand side.

Pig Latin – Type Construction Operators

The following table describes the Type construction operators of Pig Latin.

Operato Description Example


r
() Tuple constructor operator − This operator (Raju, 30)
is used to construct a tuple.

{} Bag constructor operator − This operator is {(Raju, 30),


used to construct a bag. (Mohammad, 45)}

[] Map constructor operator − This operator is [name#Raja, age#30]


used to construct a tuple.

Pig Latin – Relational Operations

The following table describes the relational operators of Pig Latin.

Operator Description

Loading and Storing

LOAD To Load the data from the file system (local/HDFS) into a
relation.

STORE To save a relation to the file system (local/HDFS).

Filtering

FILTER To remove unwanted rows from a relation.

DISTINCT To remove duplicate rows from a relation.

FOREACH, GENERATE To generate data transformations based on columns of data.

STREAM To transform a relation using an external program.

Grouping and Joining

JOIN To join two or more relations.

COGROUP To group the data in two or more relations.

GROUP To group the data in a single relation.


CROSS To create the cross product of two or more relations.

Sorting

ORDER To arrange a relation in a sorted order based on one or more


fields (ascending or descending).

LIMIT To get a limited number of tuples from a relation.

Combining and Splitting

UNION To combine two or more relations into a single relation.

SPLIT To split a single relation into two or more relations.

Diagnostic Operators

DUMP To print the contents of a relation on the console.

DESCRIBE To describe the schema of a relation.

EXPLAIN To view the logical, physical, or MapReduce execution plans to


compute a relation.

ILLUSTRATE To view the step-by-step execution of a series of statements.

Hive :
Hive is a data warehouse infrastructure tool to process structured data in Hadoop.
It resides on top of Hadoop to summarize Big Data and makes querying and analyzing easy.
It is used by different companies. For example, Amazon uses it in Amazon Elastic
MapReduce.
Benefits :

○ Ease of use
○ Accelerated initial insertion of data
○ Superior scalability, flexibility, and cost-efficiency
○ Streamlined security
○ Low overhead
○ Exceptional working capacity

HBase :
HBase is a column-oriented non-relational database management system that runs on
top of the Hadoop Distributed File System (HDFS).
HBase provides a fault-tolerant way of storing sparse data sets, which are common in many
big data use cases
HBase does support writing applications in Apache Avro, REST and Thrift.
Application :

○ Medical
○ Sports
○ Web
○ Oil and petroleum
○ E-commerce

Hive Architecture

The following architecture explains the flow of submission of query into Hive.
Hive Client

Hive allows writing applications in various languages, including Java, Python, and C++. It
supports different types of clients such as:-

● Thrift Server - It is a cross-language service provider platform that serves the request
from all those programming languages that supports Thrift.
● JDBC Driver - It is used to establish a connection between hive and Java applications.
The JDBC Driver is present in the class org.apache.hadoop.hive.jdbc.HiveDriver.
● ODBC Driver - It allows the applications that support the ODBC protocol to connect
to Hive.

Hive Services

The following are the services provided by Hive:-

● Hive CLI - The Hive CLI (Command Line Interface) is a shell where we can execute
Hive queries and commands.
● Hive Web User Interface - The Hive Web UI is just an alternative of Hive CLI. It
provides a web-based GUI for executing Hive queries and commands.
● Hive MetaStore - It is a central repository that stores all the structure information of
various tables and partitions in the warehouse. It also includes metadata of column
and its type information, the serializers and deserializers which is used to read and
write data and the corresponding HDFS files where the data is stored.
● Hive Server - It is referred to as Apache Thrift Server. It accepts the request from
different clients and provides it to Hive Driver.
● Hive Driver - It receives queries from different sources like web UI, CLI, Thrift, and
JDBC/ODBC driver. It transfers the queries to the compiler.
● Hive Compiler - The purpose of the compiler is to parse the query and perform
semantic analysis on the different query blocks and expressions. It converts HiveQL
statements into MapReduce jobs.
● Hive Execution Engine - Optimizer generates the logical plan in the form of DAG of
map-reduce tasks and HDFS tasks. In the end, the execution engine executes the
incoming tasks in the order of their dependencies.

HiveQL
Hive’s SQL dialect, called HiveQL, is a mixture of SQL-92, MySQL, and Oracle’s SQL
dialect. The level of SQL-92 support has improved over time, and will likely continue to get
better. HiveQL also provides features from later SQL standards, such as window functions
(also known as analytic functions) from SQL:2003. Some of Hive’s non-standard extensions
to SQL were inspired by MapReduce, such as multi table inserts and the TRANSFORM,
MAP, and REDUCE clauses .

Data Types

Hive supports both primitive and complex data types. Primitives include numeric, Boolean,
string, and timestamp types.

A list of Hive data types is given below.

Integer Types

Type Size Range

1-byte signed -128 to 127


TINYINT
integer

SMALLINT 2-byte signed 32,768 to 32,767


integer

INT 4-byte signed 2,147,483,648 to 2,147,483,647


integer

BIGINT 8-byte signed -9,223,372,036,854,775,808 to


integer 9,223,372,036,854,775,807

Decimal Type

Type Size Range


4-byte Single precision floating
FLOAT
point number

DOUBLE 8-byte Double precision floating


point number

Date/Time Types

TIMESTAMP

● It supports traditional UNIX timestamp with optional nanosecond precision.


● As Integer numeric type, it is interpreted as UNIX timestamp in seconds.
● As Floating point numeric type, it is interpreted as UNIX timestamp in seconds with
decimal precision.
● As string, it follows java.sql.Timestamp format "YYYY-MM-DD
HH:MM:SS.fffffffff" (9 decimal place precision)

DATES

The Date value is used to specify a particular year, month and day, in the form YYYY--
MM--DD. However, it didn't provide the time of the day. The range of Date type lies between
0000--01--01 to 9999--12--31.

String Types

STRING

The string is a sequence of characters. It values can be enclosed within single quotes (') or
double quotes (").

Varchar

The varchar is a variable length type whose range lies between 1 and 65535, which specifies
that the maximum number of characters allowed in the character string.

CHAR

The char is a fixed-length type whose maximum length is fixed at 255.


Complex Type

Type Size Range

It is similar to C struct or an object where fields are struct('James','Roy')


Struct
accessed using the "dot" notation.

Map It contains the key-value tuples where the fields are map('first','James','last','Roy
accessed using array notation. ')

Array It is a collection of similar type of values that array('James','Roy')


indexable using zero-based integers.

Hive - Create Database

In Hive, the database is considered as a catalog or namespace of tables. So, we can maintain
multiple tables within a database where a unique name is assigned to each table. Hive also
provides a default database with a name default.

create a new database by using the following command: -

hive> create database demo;

So, a new database is created.

● Let's check the existence of a newly created database.


1. hive> show databases;
● Each database must contain a unique name. If we create two databases with the same
name, the following error generates: -
● If we want to suppress the warning generated by Hive on creating the database with
the same name, follow the below command: -

1. hive> create a database if not exists demo;

● Hive also allows assigning properties with the database in the form of key-value pair.

1. hive>create the database demo


2. >WITH DBPROPERTIES ('creator' = 'Gaurav Chawla', 'date' = '2019-06-03');

● Let's retrieve the information associated with the database.


1. hive> describe database extended demo;

HiveQL - Operators

The HiveQL operators facilitate to perform various arithmetic and relational operations.
Here, we are going to execute such type of operations on the records of the below table:
Example of Operators in Hive

Let's create a table and load the data into it by using the following steps: -

● Select the database in which we want to create a table.


1. hive> use hql;
● Create a hive table using the following command: -
1. hive> create table employee (Id int, Name string , Salary float)
2. row format delimited
3. fields terminated by ',' ;
● Now, load the data into the table.
1. hive> load data local inpath '/home/codegyani/hive/emp_data' into table employee;
● Let's fetch the loaded data by using the following command: -
1. hive> select * from employee;
Now, we discuss arithmetic and relational operators with the corresponding examples.

Arithmetic Operators in Hive

In Hive, the arithmetic operator accepts any numeric type. The commonly used arithmetic
operators are: -

Operators Description

This is used to add A and B.


A+B

A-B This is used to subtract B from A.

A*B This is used to multiply A and B.

A/B This is used to divide A and B and returns the quotient of the operands.

A%B This returns the remainder of A / B.

A|B This is used to determine the bitwise OR of A and B.

A&B This is used to determine the bitwise AND of A and B.

A^B This is used to determine the bitwise XOR of A and B.

~A This is used to determine the bitwise NOT of A.

Examples of Arithmetic Operator in Hive


● Let's see an example to increase the salary of each employee by 50.
1. hive> select id, name, salary + 50 from employee;

● Let's see an example to decrease the salary of each employee by 50.


1. hive> select id, name, salary - 50 from employee;

● Let's see an example to find out the 10% salary of each employee.
1. hive> select id, name, (salary * 10) /100 from employee;
Relational Operators in Hive

In Hive, the relational operators are generally used with clauses like Join and Having to
compare the existing records. The commonly used relational operators are: -

Operator Description

It returns true if A equals B, otherwise false.


A=B

A <> B, A !=B It returns null if A or B is null; true if A is not equal to B, otherwise false.

A<B It returns null if A or B is null; true if A is less than B, otherwise false.

A>B It returns null if A or B is null; true if A is greater than B, otherwise false.

A<=B It returns null if A or B is null; true if A is less than or equal to B, otherwise


false.

A>=B It returns null if A or B is null; true if A is greater than or equal to B,


otherwise false.

A IS NULL It returns true if A evaluates to null, otherwise false.

A IS NOT It returns false if A evaluates to null, otherwise true.


NULL

Examples of Relational Operator in Hive


● Let's see an example to fetch the details of the employee having salary>=25000.
1. hive> select * from employee where salary >= 25000;

● Let's see an example to fetch the details of the employee having salary<25000.
1. hive> select * from employee where salary < 25000;

HBase is a distributed column-oriented database built on top of the Hadoop file system. It is
an open-source project and is horizontally scalable.

HBase is a data model that is similar to Google’s big table designed to provide quick random
access to huge amounts of structured data. It leverages the fault tolerance provided by the
Hadoop File System (HDFS).

It is a part of the Hadoop ecosystem that provides random real-time read/write access to data
in the Hadoop File System.

One can store the data in HDFS either directly or through HBase. Data consumer
reads/accesses the data in HDFS randomly using HBase. HBase sits on top of the Hadoop
File System and provides read and write access.
HBase and HDFS

HDFS HBase

HBase is a database built on top of the HDFS.


HDFS is a distributed file system
suitable for storing large files.

HDFS does not support fast HBase provides fast lookups for larger tables.
individual record lookups.

It provides high latency batch It provides low latency access to single rows from
processing; no concept of batch billions of records (Random access).
processing.

It provides only sequential access of HBase internally uses Hash tables and provides random
data. access, and it stores the data in indexed HDFS files for
faster lookups.

Storage Mechanism in HBase

HBase is a column-oriented database and the tables in it are sorted by row. The table
schema defines only column families, which are the key value pairs. A table have multiple
column families and each column family can have any number of columns. Subsequent
column values are stored contiguously on the disk. Each cell value of the table has a
timestamp. In short, in an HBase:

● Table is a collection of rows.


● Row is a collection of column families.
● Column family is a collection of columns.
● Column is a collection of key value pairs.

Column Oriented and Row Oriented

Column-oriented databases are those that store data tables as sections of columns of data,
rather than as rows of data. Shortly, they will have column families.

Row-Oriented Database Column-Oriented Database

It is suitable for Online Analytical


It is suitable for Online Transaction Process
Processing (OLAP).
(OLTP).

Such databases are designed for small number of Column-oriented databases are designed
rows and columns. for huge tables.

The following image shows column families in a column-oriented database:

HBase and RDBMS


HBase RDBMS

An RDBMS is governed by its schema,


HBase is schema-less, it doesn't have the concept
which describes the whole structure of
of fixed columns schema; defines only column
tables.
families.

It is built for wide tables. HBase is horizontally It is thin and built for small tables. Hard to
scalable. scale.

No transactions are there in HBase. RDBMS is transactional.

It has de-normalized data. It will have normalized data.

It is good for semi-structured as well as structured It is good for structured data.


data.

Features of HBase

● HBase is linearly scalable.


● It has automatic failure support.
● It provides consistent read and writes.
● It integrates with Hadoop, both as a source and a destination.
● It has easy java API for client.
● It provides data replication across clusters.

Where to Use HBase

● Apache HBase is used to have random, real-time read/write access to Big Data.
● It hosts very large tables on top of clusters of commodity hardware.
● Apache HBase is a non-relational database modeled after Google's Bigtable. Bigtable
acts up on Google File System, likewise Apache HBase works on top of Hadoop and
HDFS.

Applications of HBase
● It is used whenever there is a need to write heavy applications.
● HBase is used whenever we need to provide fast random access to available data.
● Companies such as Facebook, Twitter, Yahoo, and Adobe use HBase internally.

HBase - Architecture

In HBase, tables are split into regions and are served by the region servers. Regions are
vertically divided by column families into “Stores”. Stores are saved as files in HDFS. Shown
below is the architecture of HBase.

Note: The term ‘store’ is used for regions to explain the storage structure.

HBase has three major components: the client library, a master server, and region servers.
Region servers can be added or removed as per requirement.

MasterServer

The master server -

● Assigns regions to the region servers and takes the help of Apache ZooKeeper for this
task.
● Handles load balancing of the regions across region servers. It unloads the busy
servers and shifts the regions to less occupied servers.
● Maintains the state of the cluster by negotiating the load balancing.
● Is responsible for schema changes and other metadata operations such as creation of
tables and column families.

Regions

Regions are nothing but tables that are split up and spread across the region servers.

Region server

The region servers have regions that -

● Communicate with the client and handle data-related operations.


● Handle read and write requests for all the regions under it.
● Decide the size of the region by following the region size thresholds.

When we take a deeper look into the region server, it contain regions and stores as shown
below:

The store contains memory store and HFiles. Memstore is just like a cache memory.
Anything that is entered into the HBase is stored here initially. Later, the data is transferred
and saved in Hfiles as blocks and the memstore is flushed.
Zookeeper

● Zookeeper is an open-source project that provides services like maintaining


configuration information, naming, providing distributed synchronization, etc.
● Zookeeper has ephemeral nodes representing different region servers. Master servers
use these nodes to discover available servers.
● In addition to availability, the nodes are also used to track server failures or network
partitions.
● Clients communicate with region servers via zookeeper.
● In pseudo and standalone modes, HBase itself will take care of zookeeper.

Architecture of ZooKeeper

Take a look at the following diagram. It depicts the “Client-Server Architecture” of


ZooKeeper.
Each one of the components that is a part of the ZooKeeper architecture has been explained
in the following table.

Part Description

Clients, one of the nodes in our distributed application cluster, access information
Client
from the server. For a particular time interval, every client sends a message to the
server to let the sever know that the client is alive.

Similarly, the server sends an acknowledgement when a client connects. If there is


no response from the connected server, the client automatically redirects the
message to another server.

Server, one of the nodes in our ZooKeeper ensemble, provides all the services to
Server
clients. Gives acknowledgement to client to inform that the server is alive.

Ensemble Group of ZooKeeper servers. The minimum number of nodes that is required to
form an ensemble is 3.

Leader Server node which performs automatic recovery if any of the connected node
failed. Leaders are elected on service startup.

Follower Server node which follows leader instruction.

Hierarchical Namespace

The following diagram depicts the tree structure of ZooKeeper file system used for memory
representation. ZooKeeper node is referred as znode. Every znode is identified by a name and
separated by a sequence of path (/).

● In the diagram, first you have a root znode separated by “/”. Under root, you have two
logical namespaces config and workers.
● The config namespace is used for centralized configuration management and the
workers namespace is used for naming.
● Under config namespace, each znode can store upto 1MB of data. This is similar to
UNIX file system except that the parent znode can store data as well. The main
purpose of this structure is to store synchronized data and describe the metadata of the
znode. This structure is called as ZooKeeper Data Model.

Every znode in the ZooKeeper data model maintains a stat structure. A stat simply provides
the metadata of a znode. It consists of Version number, Action control list (ACL),
Timestamp, and Data length.

● Version number − Every znode has a version number, which means


every time the data associated with the znode changes, its
corresponding version number would also increased. The use
of version number is important when multiple zookeeper
clients are trying to perform operations over the same znode.
● Action Control List (ACL) − ACL is basically an authentication
mechanism for accessing the znode. It governs all the znode
read and write operations.
● Timestamp − Timestamp represents time elapsed from znode
creation and modification. It is usually represented in
milliseconds. ZooKeeper identifies every change to the
znodes from “Transaction ID” (zxid). Zxid is unique and maintains time
for each transaction so that you can easily identify the time elapsed from one request
to another request.
● Data length − Total amount of the data stored in a znode is the
data length. You can store a maximum of 1MB of data.

Types of Znodes

Znodes are categorized as persistence, sequential, and ephemeral.

● Persistence znode − Persistence znode is alive even after the


client, which created that particular znode, is disconnected.
By default, all znodes are persistent unless otherwise
specified.
● Ephemeral znode − Ephemeral znodes are active until the client
is alive. When a client gets disconnected from the ZooKeeper
ensemble, then the ephemeral znodes get deleted
automatically. For this reason, only ephemeral znodes are not
allowed to have children further. If an ephemeral znode is
deleted, then the next suitable node will fill its position.
Ephemeral znodes play an important role in Leader election.
● Sequential znode − Sequential znodes can be either persistent or
ephemeral. When a new znode is created as a sequential znode,
then ZooKeeper sets the path of the znode by attaching a 10
digit sequence number to the original name. For example, if a
znode with path /myapp is created as a sequential znode, ZooKeeper will
change the path to /myapp0000000001 and set the next sequence number as
0000000002. If two sequential znodes are created concurrently, then ZooKeeper never
uses the same number for each znode. Sequential znodes play an important role in
Locking and Synchronization.

Sessions
Sessions are very important for the operation of ZooKeeper. Requests in a session are
executed in FIFO order. Once a client connects to a server, the session will be established and
a session id is assigned to the client.

The client sends heartbeats at a particular time interval to keep the session valid. If the
ZooKeeper ensemble does not receive heartbeats from a client for more than the period
(session timeout) specified at the starting of the service, it decides that the client died.

Session timeouts are usually represented in milliseconds. When a session ends for any reason,
the ephemeral znodes created during that session also get deleted.

Watches

Watches are a simple mechanism for the client to get notifications about the changes in the
ZooKeeper ensemble. Clients can set watches while reading a particular znode. Watches send
a notification to the registered client for any of the znode (on which client registers) changes.

Znode changes are modification of data associated with the znode or changes in the znode’s
children. Watches are triggered only once. If a client wants a notification again, it must be
done through another read operation. When a connection session expires, the client will be
disconnected from the server and the associated watches are also removed.

SQOOP

Sqoop is a tool used to transfer bulk data between Hadoop and external datastores, such as
relational databases (MS SQL Server, MySQL). To process data using Hadoop, the data first
needs to be loaded into Hadoop clusters from several sources.

However, it turned out that the process of loading data from several heterogeneous sources
was extremely challenging. The problems administrators encountered included:
● Maintaining data consistency
● Ensuring efficient utilization of resources
● Loading bulk data to Hadoop was not possible
● Loading data using scripts was slow

The solution was Sqoop. Using Sqoop in Hadoop helped to overcome all the challenges of
the traditional approach and it could load bulk data from RDBMS to Hadoop with ease.

Now that we've understood about Sqoop and the need for Sqoop, as the next topic in this
Sqoop tutorial, let's learn the features of Sqoop

Sqoop has several features, which makes it helpful in the Big Data world:

1.Parallel Import/Export

Sqoop uses the YARN framework to import and export data. This provides fault
tolerance on top of parallelism.

2.Import Results of an SQL Query

Sqoop enables us to import the results returned from an SQL query into HDFS.

3.Connectors For All Major RDBMS Databases

Sqoop provides connectors for multiple RDBMSs, such as the MySQL and Microsoft
SQL servers.

4.Kerberos Security Integration

Sqoop supports the Kerberos computer network authentication protocol, which


enables nodes communication over an insecure network to authenticate users securely.

5.Provides Full and Incremental Load

Sqoop can load the entire table or parts of the table with a single command.

After going through the features of Sqoop as a part of this Sqoop tutorial, let us understand
the Sqoop architecture.
Sqoop Architecture

Now, let’s dive deep into the architecture of Sqoop, step by step:

1. The client submits the import/ export command to import or export data.

2. Sqoop fetches data from different databases. Here, we have an enterprise data warehouse,
document-based systems, and a relational database. We have a connector for each of these;
connectors help to work with a range of accessible databases.

3. Multiple mappers perform map tasks to load the data on to HDFS.

4. Similarly, numerous map tasks will export the data from HDFS on to RDBMS using the
Sqoop export command.

Sqoop Import

The diagram below represents the Sqoop import mechanism.


In this example, a company’s data is present in the RDBMS. All this metadata is sent to the
Sqoop import. Scoop then performs an introspection of the database to gather metadata
(primary key information).

It then submits a map-only job. Sqoop divides the input dataset into splits and uses individual
map tasks to push the splits to HDFS.

Few of the arguments used in Sqoop import are shown below:

Sqoop Export

Let us understand the Sqoop export mechanism stepwise:

1.The first step is to gather the metadata through introspection.

2.Sqoop then divides the input dataset into splits and uses individual map tasks to push the
splits to RDBMS.

Let’s now have a look at few of the arguments used in Sqoop export:
After understanding the Sqoop import and export, the next section in this Sqoop tutorial is the
processing that takes place in Sqoop.

Sqoop Processing

Processing takes place step by step, as shown below:

1.Sqoop runs in the Hadoop cluster.

2.It imports data from the RDBMS or NoSQL database to HDFS.

3.It uses mappers to slice the incoming data into multiple formats and loads the data in
HDFS.

4.Exports data back into the RDBMS while ensuring that the schema of the data in the
database is maintained.

********************

You might also like