0% found this document useful (0 votes)
106 views136 pages

Bigdata Notes

The document discusses digital data, types of digital data including unstructured, semi-structured, and structured data. It then discusses big data including the volume, variety, and velocity of big data. The key components of a big data platform and architecture are described, including data sources, storage, processing, analytics, and consumption. Finally, the importance and applications of big data are briefly covered.

Uploaded by

Vani Mittal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
106 views136 pages

Bigdata Notes

The document discusses digital data, types of digital data including unstructured, semi-structured, and structured data. It then discusses big data including the volume, variety, and velocity of big data. The key components of a big data platform and architecture are described, including data sources, storage, processing, analytics, and consumption. Finally, the importance and applications of big data are briefly covered.

Uploaded by

Vani Mittal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 136

UNIT 1

Types of digital data

DIGITAL DATA

Digital data is information stored on a computer system as a series of 0’s and 1’s
in a

binary language. Digital data jumps from one value to the next in a step by step
sequence.

Example: Whenever we send an email, read a social media post, or take pictures
with our digital camera, we are working with digital data.

Digital data can be classified into three forms:

a. Unstructured Data: The data which does not conform to a data model or is not
in a form that can be used easily by a computer program is categorized as
unstructured data. About 80—90% data of an organization is in this format.

Example: Memos, chat rooms, PowerPoint presentations, images, videos, letters,


researches, white papers, the body of an email, etc.

b. Semi-Structured Data: The data which does not conform to a data model but
has some structure is categorized as semi-structured data. However, it is not in a
form that can be used easily by a computer program.

Example : Emails, XML, markup languages like HTML, etc. Metadata for this
data is available but is not sufficient.

c. Structured Data: The data which is in an organized form (ie. in rows and
columns) and can be easily used by a computer program is categorized as
semi-structured data. Relationships exist between entities of data, such as
classes and their objects.

Example: Data stored in databases.HISTORY OF BIG DATA


The 21 st century is characterized by the rapid advancement in the field of
information technology.

IT has become an integral part of daily life as well as various other industries like:
health, education, entertainment, science and technology, genetics, or business
operations and these industries generate a lot of data, this can be called Big
Data.

Big Data consists of large datasets that cannot be managed efficiently by the
common database management systems.

These datasets range from terabytes to exabytes.

Mobile phones, credit cards, Radio Frequency Identification (RFID) devices, and
social networking platforms create huge amounts of data that may reside
unutilized at unknown servers for many years.

And with the evolution of Big Data, this data can be accessed and analyzed on a
regular basis to generate useful information.

“Big Data” is a relative term depending on who is discussing it. For Example, Big
Data to Amazon or Google is very different from Big Data to a medium-sized
insurance organization.

Introduction to Big Data platform

A big data platform is a type of IT solution that combines the features and
capabilities of several big data applications and utilities within a single solution,
this is then used further for managing as well as analyzing Big Data.

It focuses on providing its users with efficient analytics tools for massive
datasets.

The users of such platforms can custom build applications according to their use
case like to calculate customer loyalty (E-Commerce user case), and so on.
Goal: The main goal of a Big Data Platform is to achieve: Scalability, Availability,
Performance, and Security.

Example: Some of the most commonly used Big Data Platforms are :

● Hadoop Delta Lake Migration Platform


● Data Catalog Platform
● Data Ingestion Platform
● IoT Analytics Platform

Drivers for Big Data

Big Data has quickly risen to become one of the most desired topics in the
industry.

The main business drivers for such rising demand for Big Data Analytics are :

1. The digitization of society

2. The drop in technology costs

3. Connectivity through cloud computing

4. Increased knowledge about data science

5. Social media applications

6. The rise of Internet-of-Things(IoT)

Example: A number of companies that have Big Data at the core of their strategy
like :

Apple, Amazon, Facebook and Netflix have become very successful at the
beginning of the 21st century.

Big Data Architecture :

Big data architecture is designed to handle the ingestion, processing, and


analysis of data that is too large or complex for traditional database systems.
The big data architectures include the following components:

Data sources: All big data solutions start with one or more data sources.

Example,

o Application data stores, such as relational databases.

o Static files produced by applications, such as web server log files.

o Real-time data sources, such as IoT devices.

Data storage: Data for batch processing operations is stored in a distributed file
store that can hold high volumes of large files in various formats (also called data
lake).

Example,

Azure Data Lake Store or blob containers in Azure Storage.

Batch processing: Since the data sets are so large, therefore a big data solution
must process data files using long-running batch jobs to filter, aggregate, and
prepare the data for analysis.
Real-time message ingestion: If a solution includes real-time sources, the
architecture must include a way to capture and store real-time messages for
stream processing.

Stream processing: After capturing real-time messages, the solution must


process them by filtering, aggregating, and preparing the data for analysis. The
processed stream data is then written to an output sink. We can use open-source
Apache streaming technologies like Storm and Spark Streaming for this.

Analytical data store: Many big data solutions prepare data for analysis and then
serve the processed data in a structured format that can be queried using
analytical tools. Example: Azure Synapse Analytics provides a managed service
for large-scale, cloud-based data warehousing.

Analysis and reporting: The goal of most big data solutions is to provide insights
into the data through analysis and reporting. To empower users to analyze the
data, the architecture may include a data modelling layer. Analysis and reporting
can also take the form of interactive data exploration by data scientists or data
analysts.

Orchestration: Most big data solutions consist of repeated data processing


operations, that transform source data, move data between multiple sources and
sinks, load the processed data into an analytical data store, or push the results
straight to a report. To automate these workflows, we can use an orchestration
technology such as Azure Data Factory.

Big Data Characteristics :

Big data can be described by the following characteristics:

● Volume
● Variety
● Velocity
5 Vs of Big Data, Big Data technology components

5 Vs of Big Data :

1. Volume :

Big Data is a vast “volumes” of data generated from many sources daily, such as
business processes, machines, social media platforms, networks, human
interactions, and so on.

Example: Facebook generates approximately a billion messages, 4.5 billion


times the “Like” button is recorded, and more than 350 million new posts are
uploaded each day.

Big data technologies can handle large amounts of data.

2. Variety :

Big Data can be structured, unstructured, and semi-structured that are being
collected from different sources.
Data were only collected from databases and sheets in the past, But these days
the data will come in an array of forms ie.- PDFs, Emails, audios, Social Media
posts, photos, videos, etc.

3. Velocity :

Velocity refers to the speed with which data is generated in real-time.

Velocity plays an important role compared to others.

It contains the linking of incoming data sets speeds, rate of change, and activity
bursts.

The primary aspect of Big Data is to provide demanding data rapidly.

Example of data that is generated with high velocity - Twitter messages or


Facebook posts.

4. Veracity :

Veracity refers to the quality of the data that is being analyzed.

It is the process of being able to handle and manage data efficiently.

Example: Facebook posts with hashtags.

5. Value :

Value is an essential characteristic of big data.

It is not the data that we process or store, it is valuable and reliable data that we

store, process and analyze

Big Data Technology Components :


1. Ingestion :

The ingestion layer is the very first step of pulling in raw data.

It comes from internal sources, relational databases, non-relational databases,


social media, emails, phone calls etc.

There are two kinds of ingestions :

Batch, in which large groups of data are gathered and delivered together.

Streaming, which is a continuous flow of data. This is necessary for real-time


data analytics.

2. Storage :

Storage is where the converted data is stored in a data lake or warehouse and
eventually processed.

The data lake/warehouse is the most essential component of a big data


ecosystem.
It needs to contain only thorough, relevant data to make insights as valuable as
possible.

It must be efficient with as little redundancy as possible to allow for quicker


processing.

3. Analysis :

In the analysis layer, data gets passed through several tools, shaping it into
actionable insights.

There are four types of analytics on big data :

● Diagnostic: Explains why a problem is happening.


● Descriptive: Describes the current state of a business through historical
data.
● Predictive: Projects future results based on historical data.
● Prescriptive: Takes predictive analytics a step further by projecting best
future efforts.

4. Consumption :

The final big data component is presenting the information in a format digestible
to the end-user.

This can be in the forms of tables, advanced visualizations and even single
numbers if requested.

The most important thing in this layer is making sure the intent and meaning of
the output is understandable.

Big Data importance and applications

Big Data Importance :

Big Data importance doesn’t revolve around the amount of data a company has
but lies in the fact that how the company utilizes the gathered data.
Every company uses its collected data in its own way. More effectively the
company uses its data, more rapidly it grows.

By analysing the big data pools effectively the companies can get answers to :

Cost Savings :

o Some tools of Big Data like Hadoop can bring cost advantages to business
when large amounts of data are to be stored.

o These tools help in identifying more efficient ways of doing business.

Time Reductions :

o The high speed of tools like Hadoop and in-memory analytics can easily
identify new sources of data which helps businesses analyzing data immediately.

o This helps us to make quick decisions based on the learnings.

Understand the market conditions :

o By analyzing big data we can get a better understanding of current market


conditions.

o For example: By analyzing customers’ purchasing behaviours, a company can


find out the products that are sold the most and produce products according to
this trend. By this, it can get ahead of its competitors.

Control online reputation :

o Big data tools can do sentiment analysis.

o Therefore, you can get feedback about who is saying what about your
company.

o If you want to monitor and improve the online presence of your business, then
big data tools can help in all this.

Using Big Data Analytics to Boost Customer Acquisition(purchase) and Retention


:
o The customer is the most important asset any business depends on.

o No single business can claim success without first having to establish a solid
customer base.

o If a business is slow to learn what customers are looking for, then it is very
likely to deliver poor quality products.

o The use of big data allows businesses to observe various customer-related


patterns and trends.

Using Big Data Analytics to Solve Advertisers Problem and Offer Marketing
Insights :

o Big data analytics can help change all business operations.

o Like the ability to match customer expectations, changing

company’s product line, etc.

o And ensuring that the marketing campaigns are powerful.

Big Data Applications :

In today’s world big data have several applications, some of them are listed
below :

Tracking Customer Spending Habit, Shopping Behavior :

In big retails stores, the management team has to keep data of customer’s
spending habits, shopping behaviour, most liked product, which product is being
searched/sold most, based on that data, the production/collection rate of that
product gets fixed.

Recommendation :

By tracking customer spending habits, shopping behaviour, big retail stores


provide recommendations to the customers.

Smart Traffic System :


Data about the condition of the traffic of different roads, collected through
cameras, GPS devices placed in the vehicle.

All such data are analyzed and jam-free or less jam way, less time taking ways
are recommended.

One more profit is fuel consumption can be reduced.

Secure Air Traffic System :

At various places of flight, sensors are present.

These sensors capture data like the speed of flight, moisture, temperature, and
other environmental conditions.

Based on such data analysis, an environmental parameter within flight is set up


and varied.

By analyzing flight’s machine-generated data, it can be estimated how long the


machine can operate flawlessly and when it can be replaced/repaired.

Auto Driving Car :

In the various spots of the car camera, a sensor is placed that gathers data like
the size of the surrounding car, obstacle, distance from those, etc.

These data are being analyzed, then various calculations are carried out.

These calculations help to take action automatically.

Virtual Personal Assistant Tool :

Big data analysis helps virtual personal assistant tools like Siri, Cortana and
Google Assistant to provide the answer to the various questions asked by users.

This tool tracks the location of the user, their local time, season, other data
related to questions asked, etc.

Analyzing all such data provides an answer.


Example: Suppose one user asks “Do I need to take Umbrella?”The tool collects
data like location of the user, season and weather condition at that location, then
analyzes these data to conclude if there is a chance of raining, then provides the
answer.

IoT :

Manufacturing companies install IOT sensors into machines to collect operational


data.

Analyzing such data, it can be predicted how long a machine will work without
any problem when it requires repair.

Thus, the cost to replace the whole machine can be saved.

Education Sector Energy Sector :

Online educational courses conducting organization utilize big data to search


candidates interested in that course.

If someone searches for a YouTube tutorial video on a subject, then an online or


offline course provider organization on that subject sends an ad online to that
person about their course.

Media and Entertainment Sector :

Media and entertainment service providing company like Netflix,

Amazon Prime, Spotify do analysis on data collected from their users.

Data like what type of video, music users are watching, listening to most,

how long users are spending on site, etc are collected and analyzed to set

the next business strategy.

Big Data features –security, compliance, auditing and protection

BIG DATA SECURITY :


Big data security is the collective term for all the measures and
tools used to guard both the data and analytics processes from
attacks, theft, or other malicious activities that could harm or
negatively affect them.
For companies that operate on the cloud, big data security
challenges are multi-faceted.
When customers give their personal information to companies, they trust
them with personal data which can be used against them if it falls into the
wrong hands.

BIG DATA COMPLIANCE :


Data compliance is the practice of ensuring that sensitive data is
organized and managed in such a way as to enable organizations to meet
enterprise business rules along with legal and governmental regulations.
Organizations that don’t implement these regulations can be fined up to
tens of millions of dollars and even receive a 20-year penalty.
BIG DATA AUDITING :

Auditors can use big data to expand the scope of their projects and draw
comparisons over larger populations of data.

Big data also helps financial auditors to streamline the reporting process and
detect fraud.

These professionals can identify business risks in time and conduct more
relevant and accurate audits.

BIG DATA PROTECTION :


Big data security is the collective term for all the measures and tools used to
guard both the data and analytics processes from attacks, theft, or other
malicious activities that could harm or negatively affect them.

That’s why data privacy is there to protect those customers but also
companies and their employees from security breaches.

When customers give their personal information to companies, they trust


them with personal data which can be used against them if it falls into the
wrong hands.

Data protection is also important as organizations that don’t implement these


regulations can be fined up to tens of millions of dollars and even receive a
20-year penalty.

Big Data privacy and ethics

Most data is collected through surveys, interviews, or observation.

When customers give their personal information to companies, they trust


them with personal data which can be used against them if it falls into the
wrong hands.

That’s why data privacy is there to protect those


customers but also companies and their
employees from security breaches.
One of the main reasons why companies comply with data privacy
regulations is to avoid fines.
Organizations that don’t implement these regulations can be fined up to tens
of millions of dollars and even receive a 20-year penalty.

Reasons, why we need to take data privacy seriously, are :

Data breaches could hurt your business.

Protecting your customers’ privacy

Maintaining and improving brand value

It gives you a competitive advantage

It supports the code of ethics

Big Data Analytics

Big data analytics is a complex process of examining big data to uncover


information, such as - hidden patterns, correlations, market trends and customer
preferences.

This can help organizations make informed business decisions.

Data Analytics technologies and techniques give organizations a way to analyze


data sets and gather new information.

Big Data Analytics enables enterprises to analyze their data in full context quickly
and some also offer real-time analysis.

Importance of Big Data Analytics :

Organizations use big data analytics systems and software to make data-driven
decisions that can improve business-related outcomes.

The benefits include more effective marketing, new revenue opportunities,


customer personalization and improved operational efficiency.
With an effective strategy, these benefits can provide competitive advantages
over

rivals.

Big Data Analytics tools also help businesses save time and money and aid in
gaining insights to inform data-driven decisions.

Big Data Analytics enables enterprises to narrow their Big Data to the most
relevant information and analyze it to inform critical business decisions.

Challenges of conventional systems

● Big data is the storage and analysis of large data sets.


● These are complex data sets that can be both structured or unstructured.
● They are so large that it is not possible to work on them with traditional
analytical tools.
● One of the major challenges of conventional systems was the uncertainty
of the Data Management Landscape.
● Big data is continuously expanding, there are new companies and
technologies that are being developed every day.
● A big challenge for companies is to find out which technology works bests
for them without the introduction of new risks and problems.
● These days, organizations are realising the value they get out of big data
analytics and hence they are deploying big data tools and processes to
bring more efficiency in their work environment.

Previous

Intelligent data analysis, nature of data

Intelligent Data Analysis (IDA) is one of the most important approaches in the
field of data mining.
Based on the basic principles of IDA and the features of datasets that IDA
handles, the development of IDA is briefly summarized from three aspects :

● Algorithm principle
● The scale
● Type of the dataset

Intelligent Data Analysis (IDA) is one of the major issues in artificial intelligence
and information.

Intelligent data analysis discloses hidden facts that are not known previously and
provide potentially important information or facts from large quantities of data.

It also helps in making a decision.

Based on machine learning, artificial intelligence, recognition of pattern, and


records and visualization technology, IDA helps to obtain useful information,
necessary data and interesting models from a lot of data available online in order
to make the right choices.

IDA includes three stages:

(1) Preparation of data

(2) Data mining

(3) Data validation and Explanation

Analytic processes and tools

Big Data Analytics is the process of collecting large chunks of


structured/unstructured data, segregating and analyzing it and discovering the
patterns and other useful business insights from it.

These days, organizations are realising the value they get out of big data
analytics and hence they are deploying big data tools and processes to bring
more efficiency in their work environment.
Many big data tools and processes are being utilised by companies these days in
the processes of discovering insights and supporting decision making.

Big data processing is a set of techniques or programming models to access


large- scale data to extract useful information for supporting and providing
decisions.

Below is the list of some of the data analytics tools used most in the industry :

● R Programming (Leading Analytics Tool in the industry)


● Python
● Excel
● SAS
● Apache Spark
● Splunk
● RapidMiner
● Tableau Public
● KNime

Analysis vs reporting

Reporting :

● Once data is collected, it will be organized using tools such as graphs and
tables.
● The process of organizing this data is called reporting.
● Reporting translates raw data into information.
● Reporting helps companies to monitor their online business and be alerted
when data falls outside of expected ranges.
● Good reporting should raise questions about the business from its end
users.

Analysis :
● Analytics is the process of taking the organized data and analyzing it.
● This helps users to gain valuable insights on how businesses can improve
their performance.
● Analysis transforms data and information into insights.
● The goal of the analysis is to answer questions by interpreting the data at a
deeper level and providing actionable recommendations.

Conclusion :

● Reporting shows us “what is happening”.


● The analysis focuses on explaining “why it is happening” and “what we
can do about it”.

Modern data analytic tools

● These days, organizations are realising the value they get out of big data
analytics and hence they are deploying big data tools and processes to
bring more efficiency to their work environment.
● Many big data tools and processes are being utilised by companies these
days in the processes of discovering insights and supporting decision
making.
● Data Analytics tools are types of application software that retrieve data
from one or more systems and combine it in a repository, such as a data
warehouse, to be reviewed and analysed.
● Most organizations use more than one analytics tool including
spreadsheets with statistical functions, statistical software packages, data
mining tools, and predictive modelling tools.
● Together, these Data Analytics Tools give the organization a complete
overview of the company to provide key insights and understanding of the
market/business so smarter decisions may be made.
● Data analytics tools not only report the results of the data but also explain
why the results occurred to help identify weaknesses, fix potential problem
areas, alert decision-makers to unforeseen events and even forecast future
results based on decisions the company might make.
● Below is the list some of data analytics tools :
● R Programming (Leading Analytics Tool in the industry)
● Python
● Excel
● SAS
● Apache Spark
● Splunk
● RapidMiner
● Tableau Public
● KNime

UNIT 2

HADOOP

History of Hadoop :

Hadoop is an open-source software framework for storing and processing large


datasets ranging in size from gigabytes to petabytes.

Hadoop was developed at the Apache Software Foundation in 2005.

It is written in Java.

The traditional approach like RDBMS is not sufficient due to the heterogeneity of
the data.

So Hadoop comes as the solution to the problem of big data i.e. storing and
processing the big data with some extra capabilities.
Its co-founder Doug Cutting named it on his son’s toy elephant.

There are mainly two components of Hadoop which are :

Hadoop Distributed File System (HDFS)

Yet Another Resource Negotiator(YARN).

In April 2006 Hadoop 0.1.0 was released.

Apache Hadoop :

Hadoop is an open-source software framework for storing and processing large


datasets ranging in size from gigabytes to petabytes.

Hadoop was developed at the Apache Software Foundation in 2005.

It is written in Java.

Hadoop is designed to scale up from a single server to thousands of machines,


each offering local computation and storage.

Applications built using HADOOP are run on large data sets distributed across
clusters of commodity computers.

Commodity computers are cheap and widely available, these are useful for
achieving greater computational power at a low cost.

In Hadoop data resides in a distributed file system which is called a Hadoop


Distributed File system.

Hadoop Distributed File System :

In Hadoop data resides in a distributed file system which is called a Hadoop


Distributed File system.

HDFS splits files into blocks and sends them across various nodes in form of
large clusters.
The Hadoop Distributed File System (HDFS) is based on the Google File System
(GFS) and provides a distributed file system that is designed to run on
commodity hardware.

Commodity hardware is cheap and widely available, these are useful for
achieving greater computational power at a low cost.

It is highly fault-tolerant and is designed to be deployed on low-cost hardware.

It provides high throughput access to application data and is suitable for


applications having large datasets.

Hadoop framework includes the following two modules :

Hadoop Common: These are Java libraries and utilities required by other
Hadoop modules.

Hadoop YARN: This is a framework for job scheduling and cluster resource
management.

Hadoop Ecosystem And Components:

There are three components of Hadoop :


Hadoop HDFS -

Hadoop Distributed File System (HDFS) is the storage unit of Hadoop.

HDFS splits files into blocks and sends them across various nodes in form of
large clusters.

It is highly fault-tolerant and is designed to be deployed on low-cost hardware.

It provides high throughput access to application data and is suitable for


applications having large datasets

Hadoop MapReduce -

Hadoop MapReduce is the processing unit of Hadoop.

MapReduce is a computational model and software framework for writing


applications that are run on Hadoop.

These MapReduce programs are capable of processing enormous data in


parallel on large clusters of computation nodes.

Hadoop YARN –

Hadoop YARN is a resource management unit of Hadoop.

This is a framework for job scheduling and cluster resource management.

YARN helps to open up Hadoop by allowing to process and run data for batch
processing, stream processing, interactive processing and graph processing
which are stored in HDFS.

It helps to run different types of distributed applications other than MapReduce.

Previous

DATA FORMAT :

A data/file format defines how information is stored in HDFS.


Hadoop does not have a default file format and the choice of a format depends
on its use.

The big problem in the performance of applications that use HDFS is the
information search time and the writing time.

Managing the processing and storage of large volumes of information is very


complex that’s why a certain data format is required.

The choice of an appropriate file format can produce the following benefits:

● Optimum writing time


● Optimum reading time
● File divisibility
● Adaptive scheme and compression support

Some of the most commonly used formats of the Hadoop ecosystem are :

● Text/CSV: A plain text file or CSV is the most common format both outside and
within the Hadoop ecosystem.

● SequenceFile: The SequenceFile format stores the data in binary format, this
format accepts compression but does not store metadata.

● Avro: Avro is a row-based storage format. This format includes the definition of
the scheme of your data in JSON format. Avro allows block compression along
with its divisibility, making it a good choice for most cases when using Hadoop.

● Parquet: Parquet is a column-based binary storage format that can store


nested data structures. This format is very efficient in terms of disk input/output
operations when the necessary columns to be used are specified.

● RCFile (Record Columnar File): RCFile is a columnar format that divides data
into groups of rows, and inside it, data is stored in columns.
● ORC (Optimized Row Columnar): ORC is considered an evolution of the
RCFile format and has all its benefits alongside some improvements such as
better compression, allowing faster queries.

Analysing Data with Hadoop :

While the MapReduce programming model is at the heart of Hadoop, it is


low-level and as such becomes an unproductive way for developers to write
complex analysis jobs.

To increase developer productivity, several higher-level languages and APIs have


been created that abstract away the low-level details of the MapReduce
programming model.

There are several choices available for writing data analysis jobs.

The Hive and Pig projects are popular choices that provide SQL-like and
procedural data flow-like languages, respectively.

HBase is also a popular way to store and analyze data in HDFS. It is a


column-oriented database, and unlike MapReduce, provides random read and
write access to data with low latency.

MapReduce jobs can read and write data in HBase’s table format, but data
processing is often done via HBase’s own client API.

Scaling In Vs Scaling Out :

Once a decision has been made for data scaling, the specific scaling approach
must be chosen.

There are two commonly used types of data scaling :

1. Up
2. Out
Scaling up, or vertical scaling :

It involves obtaining a faster server with more powerful processors and more
memory.

This solution uses less network hardware, and consumes less power; but
ultimately.

For many platforms, it may only provide a short-term fix, especially if continued
growth is expected.

Scaling out, or horizontal scaling :

It involves adding servers for parallel computing.

The scale-out technique is a long-term solution, as more and more servers may
be added when needed.

But going from one monolithic system to this type of cluster may be difficult,
although extremely effective solution.

Hadoop Streaming :

It is a feature that comes with a Hadoop distribution that allows developers or


programmers to write the Map-Reduce program using different programming
languages like Ruby, Perl, Python, C++, etc.

We can use any language that can read from the standard input(STDIN) like
keyboard input and all and write using standard output(STDOUT).

Although Hadoop Framework is completely written in java programs for Hadoop


do not necessarily need to code in Java programming language.

In the diagram,
We have an Input Reader which is responsible for reading the input data and
produces the list of key-value pairs. We can read data in .csv format, in delimiter
format, from a database table, image data(.jpg, .png), audio data etc.

This list of key-value pairs is fed to the Map phase and Mapper will work on each
of these key-value pair of each pixel and generate some intermediate key-value
pairs.

After shuffling and sorting, the intermediate key-value pairs are fed to the
Reducer: then the final output produced by the reducer will be written to the
HDFS. These are how a simple Map-Reduce job works.

Hadoop Pipes :

Hadoop Pipes is the name of the C++ interface to Hadoop MapReduce.

Unlike Streaming, this uses standard input and output to communicate with the
map and reduce code.

Pipes uses sockets as the channel over which the task tracker communicates
with the process running the C++ map or reduce function.
. Map Reduce Framework and Basics :

MapReduce is a software framework for processing data sets in a distributed


fashion over several machines.

Prior to Hadoop 2.0, MapReduce was the only way to process data in Hadoop.

A MapReduce job usually splits the input data set into independent chunks,
which are processed by the map tasks in a completely parallel manner.

The core idea behind MapReduce is mapping your data set into a collection of <
key, value> pairs, and then reducing overall pairs with the same key.

The framework sorts the outputs of the maps, which are then inputted to the
reduced tasks.

Both the input and the output of the job are stored in a file system.
The framework takes care of scheduling tasks, monitors them, and re-executes
the failed tasks.

The overall concept is simple :

1. Almost all data can be mapped into pairs somehow, and

2.Your keys and values may be of any type: strings, integers, dummy types and,
of course, pairs themselves.

How Map Reduce Works :

Map Reduce contains two core components :

1. Mapper component
2. Reducer component

Uses master-slave architecture.

Storing data in HDFS is low cost, fault-tolerant, and easily scalable.

MapReduce integrates with HDFS to provide the exact same benefits for parallel
data processing.

Sends computations where the data is stored on local disks.

Programming model or framework for distributed computing.


It hides complex “housekeeping” tasks from you as a developer.

Developing A Map-Reduce Application :

Writing a program in MapReduce follows a certain pattern.

You start by writing your map and reduce functions, ideally with unit tests to make
sure they do what you expect.

Then you write a driver program to run a job, which can run from your IDE using
a small subset of the data to check that it is working.

If it fails, you can use your IDE’s debugger to find the source of the problem.

When the program runs as expected against the small dataset, you are ready to
unleash it on a cluster.

Running against the full dataset is likely to expose some more issues, which you
can fix by expanding your tests and altering your mapper or reducer to handle
the new cases.

After the program is working, you may wish to do some tuning :

● First by running through some standard checks for making MapReduce


programs faster
● Second, by doing task profiling.

Profiling distributed programs are not easy, but Hadoop has hooks to aid in the
process.

Before we start writing a MapReduce program, we need to set up and configure


the development environment.

Components in Hadoop are configured using Hadoop’s own configuration API.

An instance of the Configuration class represents a collection of configuration


properties and their values.
Each property is named by a String, and the type of a value may be one of
several, including Java primitives such as boolean, int, long, and float and other
useful types such as String, Class, and java.io.File; and collections of Strings.

Unit Tests With MR Unit :

Hadoop MapReduce jobs have a unique code architecture that follows a specific
template with specific constructs.

This architecture raises interesting issues when doing test-driven development


(TDD) and writing unit tests.

With MRUnit, you can craft test input, push it through your mapper and/or
reducer, and verify its output all in a JUnit test.

As do other JUnit tests, this allows you to debug your code using the JUnit test
as a driver.

A map/reduce pair can be tested using MRUnit’s MapReduceDriver. , a


combiner can be tested using MapReduceDriver as well.

A PipelineMapReduceDriver allows you to test a workflow of map/reduce jobs.

Currently, partitioners do not have a test driver under MRUnit.

MRUnit allows you to do TDD(Test Driven Development) and write lightweight


unit tests which accommodate Hadoop’s specific architecture and constructs.

Example: We’re processing road surface data used to create maps. The input
contains both linear surfaces and intersections. The mapper takes a collection of
these mixed surfaces as input, discards anything that isn’t a linear road surface,
i.e., intersections, and then processes each road surface and writes it out to
HDFS. We can keep count and eventually print out how many non-road
surfaces are input. For debugging purposes, we can additionally print out how
many road surfaces were processed.
Anatomy of a Map-Reduce Job Run :

Hadoop Framework comprises of two main components :

● Hadoop Distributed File System (HDFS) for Data Storage


● MapReduce for Data Processing.

A typical Hadoop MapReduce job is divided into a set of Map and Reduce tasks
that execute on a Hadoop cluster.

The execution flow occurs as follows:

● Input data is split into small subsets of data.


● Map tasks work on these data splits.
● The intermediate input data from Map tasks are then submitted to Reduce
task after an intermediate process called ‘shuffle’.
● The Reduce task(s) works on this intermediate data to generate the result
of a MapReduce Job.

Job Scheduling :

Early versions of Hadoop had a very simple approach to scheduling users’ jobs:
they ran in order of submission, using a FIFO scheduler.

Typically, each job would use the whole cluster, so jobs had to wait their turn.

Although a shared cluster offers great potential for offering large resources to
many users, the problem of sharing resources fairly between users requires a
better scheduler.

Production jobs need to complete in a timely manner while allowing users who
are making smaller ad hoc queries to get results back in a reasonable time.

The ability to set a job’s priority was added, via the mapred. job.priority property
or the setJobPriority() method on JobClient.
When the job scheduler is choosing the next job to run, it selects the one with the
highest priority.

However, with the FIFO scheduler, priorities do not support preemption, so a


high-priority job can still be blocked by a long-running low priority job that started
before the high-priority job was scheduled.

MapReduce in Hadoop comes with a choice of schedulers.

The default is the original FIFO queue-based scheduler, and there are also
multiuser schedulers called :

● The Fair Scheduler


● The Capacity Scheduler.

The Fair Scheduler :

The Fair Scheduler aims to give every user a fair share of the cluster capacity
over time.

If a single job is running, it gets all of the clusters.

As more jobs are submitted, free task slots are given to the jobs in such a way as
to give each user a fair share of the cluster.

A short job belonging to one user will complete in a reasonable time even while
another user’s long job is running, and the long job will still make progress.

Jobs are placed in pools, and by default, each user gets their own pool.

It is also possible to define custom pools with guaranteed minimum capacities


defined in terms of the number of maps and reduce slots, and to set weightings
for each pool.

The Fair Scheduler supports preemption, so if a pool has not received its fair
share for a certain period of time, then the scheduler will kill tasks in pools
running over capacity in order to give the slots to the pool running under capacity.
The Capacity Scheduler :

The Capacity Scheduler takes a slightly different approach to multiuser


scheduling.

A cluster is made up of a number of queues (like the Fair Scheduler’s pools),


which may be hierarchical (so a queue may be the child of another queue), and
each queue has an allocated capacity.

This is like the Fair Scheduler, except that within each queue, jobs are scheduled
using FIFO scheduling (with priorities).

The Capacity Scheduler allows users or organizations to simulate a separate


MapReduce cluster with FIFO scheduling for each user or organization.

The Fair Scheduler, by contrast, enforces fair sharing within each pool, so
running jobs share the pool’s resources.

Task Execution :

After the task tracker assigns a task, the next step is for it to run the task.

First, it localizes the job JAR by copying it from the shared filesystem to the
tasktracker’s filesystem.

It also copies any files needed from the distributed cache by the application to
the local disk.

Second, it creates a local working directory for the task and un-jars the contents
of the JAR into this directory.

Third, it creates an instance of TaskRunner to run the task.

TaskRunner launches a new Java Virtual Machine to run each task so that any
bugs in the user-defined map and reduce functions don’t affect the task tracker
(by causing it to crash or hang, for example).

It is, however, possible to reuse the JVM between tasks.


The child process communicates with its parent through the umbilical interface.

This way it informs the parent of the task’s progress every few seconds until the
task is complete.

Map Reduce Types :

Hadoop uses the MapReduce programming model for the data processing of
input and output for the map and to reduce functions represented as key-value
pairs.

They are subject to the parallel execution of datasets situated in a wide array of
machines in a distributed architecture.

The programming paradigm is essentially functional in nature in combining while


using the technique of map and reduce.

Map Reduce Types :

Mapping is the core technique of processing a list of data elements that come in
pairs of keys and values.

The map function applies to individual elements defined as key-value pairs of a


list and produces a new list.

The general idea of the map and reduce the function of Hadoop can be illustrated
as follows:

map: (K1, V1)-> list (K2, V2)

reduce: (K2, list(V2)) -> list (K3, V3)

The input parameters of the key and value pair, represented by K1 and V1
respectively, are different from the output pair type: K2 and V2.

The reduce function accepts the same format output by the map, but the type of
output again of the reduce operation is different: K3 and V3.

The Java API for this is as follows:


public interface Mapper<K1, V1, K2, V2> extends JobConfigurable, Closeable {

void map(K1 key, V1 value, OutputCollector<K2, V2> output,Reporter reporter) throws


IOException;
}
public interface Reducer<K2, V2, K3, V3> extends JobConfigurable,Closeable {
void reduce(K2 key, Iterator<V2> values,OutputCollector<K3, V3> output, Reporter
reporter)throws IOException;
}

The OutputCollector is the generalized interface of the Map-Reduce framework


to facilitate the collection of data output either by the Mapper or the Reducer.

These outputs are nothing but the intermediate output of the job.

Therefore, they must be parameterized with their types.

The reporter facilitates the Map-Reduce application to report progress and


update counters and status information.

If the combine function is used, it has the same form as the reduce function and
the output is fed to the reduce function.

This may be illustrated as follows:

map: (K1, V1) → list (K2, V2)

combine: (K2, list(V2)) → list (K2, V2)

reduce: (K2, list(V2)) → list (K3, V3)

Note that they combine and reduce functions use the same type, except in the
variable names where K3 is K2 and V3 is V2.

The partition function operates on the intermediate key-value types.

It controls the partitioning of the keys of the intermediate map outputs.

The key derives the partition using a typical hash function.

The total number of partitions is the same as the number of reduced tasks for the
job.
The partition is determined only by the key ignoring the value.

public interface Partitioner<K2, V2> extends JobConfigurable {

int getPartition(K2 key, V2 value, int numberOfPartition);


}

Input Format :

Hadoop has to accept and process a variety of formats, from text files to
databases.

A chunk of input, called input split, is processed by a single map.

Each split is further divided into logical records given to the map to process in
key-value pair.

In the context of a database, the split means reading a range of tuples from an
SQL table, as done by the DBInputFormat and producing LongWritables
containing record numbers as keys and DBWritables as values.

The Java API for input splits is as follows:

public interface InputSplit extends Writable {

long getLength() throws IOException;


String[] getLocations() throws IOException;
}

The InputSplit represents the data to be processed by a Mapper.

It returns the length in bytes and has a reference to the input data.

It is the responsibility of the InputFormat to create the input splits and divide them
into records.

public interface InputFormat<K, V> {

InputSplit[] getSplits(JobConf job, int numSplits) throws IOException;


RecordReader<K, V> getRecordReader(InputSplit split,JobConf job, throws
IOException;
}
The JobClient invokes the getSplits() method with an appropriate number of split
arguments.

Once the split is calculated it is sent to the jobtracker.

The jobtracker schedules map tasks for the tasktracker using storage location.

The task tracker then passes the split by invoking the getRecordReader() method
on the InputFormat to get RecordReader for the split.

The FileInputFormat is the base class for the file data source.

It has the responsibility to identify the files that are to be included as the job input
and the definition for generating the split.

Hadoop also includes the processing of unstructured data that often comes in
textual format, the TextInputFormat is the default InputFormat for such data.

The SequenceInputFormat takes up binary inputs and stores sequences of


binary key-value pairs.

DBInputFormat provides the capability to read data from a relational database


using JDBC.

Output Format :

The output format classes are similar to their corresponding input format classes
and work in the reverse direction.

For example :

The TextOutputFormat is the default output format that writes


records as plain text files, whereas key-values any be of any type,
and transforms them into a string by invoking the toString() method.
The key-value character is separated by the tab character, although
this can be customized by manipulating the separator property of the
text output format.
For binary output, there is SequenceFileOutputFormat to write a
sequence of binary output to a file. Binary outputs are particularly
useful if the output becomes an input to a further MapReduce job.

The output formats for relational databases and to HBase are


handled by DBOutputFormat. It sends the reduced output to a SQL
table like the HBase’s TableOutputFormat enables the MapReduce
program to work on the data stored in the HBase table and uses it
for writing outputs to the HBase table.

Map Reduce Features :

Features of MapReduce are as follows :

Scalability: Apache Hadoop is a highly scalable framework. This is because of its


ability to store and distribute huge data across plenty of servers.

Flexibility : MapReduce programming enables companies to access new sources


of data. It enables companies to operate on different types of data.

Security and Authentication: The MapReduce programming model uses HBase


and HDFS security platform that allows access only to the authenticated users to
operate on the data.

Cost-effective solution: Hadoop’s scalable architecture with the MapReduce


programming framework allows the storage and processing of large data sets in
a very affordable manner.

Fast: Even if we are dealing with large volumes of unstructured data, Hadoop
MapReduce just takes minutes to process terabytes of data. It can process
petabytes of data in just an hour.

A simple model of programming: One of the most important features is that it is


based on a simple programming model.
Parallel Programming: It divides the tasks in a manner that allows their
execution in parallel. Parallel processing allows multiple processors to execute
these divided tasks.

Availability: If any particular node suffers from a failure, then there are always
other copies present on other nodes that can still be accessed whenever needed.

Resilient nature: One of the major features offered by Apache Hadoop is its fault
tolerance. The Hadoop MapReduce framework has the ability to quickly
recognizing faults that occur.

UNIT 3

Design of HDFS :

● HDFS is a filesystem designed for storing very large files with streaming
data access patterns, running on clusters of commodity hardware.
● There are Hadoop clusters running today that store petabytes of data.
● HDFS is built around the idea that the most efficient data processing
pattern is a write-once, read-many-times pattern.
● A dataset is typically generated or copied from a source, then various
analyses are

performed on that dataset over time.

● It’s designed to run on clusters of commodity hardware (commonly


available hardware available from multiple vendors) for which the chance
of node failure across the cluster is high, at least for large clusters.
● HDFS is designed to carry on working without a noticeable interruption to
the user in the face of such failure.
● Since the namenode holds filesystem metadata in memory, the limit to the
number of files in a filesystem is governed by the amount of memory on
the namenode.
● Files in HDFS may be written by a single writer.
● Writes are always made at the end of the file.
● There is no support for multiple writers, or for modifications at arbitrary
offsets in the file.

HDFS Concepts :

● Blocks :
● HDFS has the concept of a block, but it is a much larger unit—64 MB by
default.
● Files in HDFS are broken into block-sized chunks, which are stored as
independent units.
● Having a block abstraction for a distributed filesystem brings several
benefits. :
1. A file can be larger than any single disk in the network. Nothing requires
the blocks from a file to be stored on the same disk, so they can take
advantage of any of the disks in the cluster.
2. Making the unit of abstraction a block rather than a file simplifies the
storage subsystem. It simplifies the storage management (since blocks are
a fixed size, it is easy to calculate how many can be stored on a given disk)
and eliminating metadata concerns.
3. Blocks fit well with replication for providing fault tolerance and availability.
To insure against corrupted blocks and disk and machine failure, each
block is replicated to a small number of physically separate machines.
● HDFS blocks are large compared to disk blocks, and the reason is to
minimize the cost of seeks.
● Namenodes and Datanodes :
● An HDFS cluster has two types of nodes operating in a master-worker
pattern:
1. A Namenode (the master) and
2. A number of datanodes (workers).
● The namenode manages the filesystem namespace.
● It maintains the filesystem tree and the metadata for all the files and
directories in the tree.
● This information is stored persistently on the local disk in the form of two
files:
● The namespace image
● The edit log.
● The namenode also knows the datanodes on which all the blocks for a
given file are located.

Benefits and Challenges :

Benefits of HDFS:

● HDFS can store a large amount of information.


● HDFS is a simple & robust coherency model.
● HDFS is scalable and has fast access to required information.
● HDFS also serve a substantial number of clients by adding more machines
to the cluster.
● HDFS provides streaming read access.
● HDFS can be used to read data stored multiple times but the data will be
written to the HDFS once.
● The recovery techniques will be applied very quickly.
● Hardware and operating systems portability across is heterogeneous
commodities.
● High Economy by distributing data and processing across clusters of
commodity personal computers.
● High Efficiency by distributing data, logic on parallel nodes to process it
from where data is located.
● High Reliability by automatically maintaining multiple copies of data and
automatically redeploying processing logic in the event of failures.

Challenges for HDFS :

● HDFS does not give any reliability if that machine goes down.
● An enormous number of clients must be handled if all the clients need the
data stored on a single machine.
● Clients need to copy the data to their local machines before they can
operate it.
● Applications that require low-latency access to data, in the tens of
milliseconds range, will not work well with HDFS.
● Since the namenode holds filesystem metadata in memory, the limit to the
number of files in a filesystem is governed by the amount of memory on
the namenode.
● Files in HDFS may be written by a single writer. Writers are always made
at the end of the file.
● There is no support for multiple writers, or for modifications at arbitrary
offsets in the file.

File Size :

1.

● You can use the Hadoop fs -ls command to list files in the current directory
as well as their details.
● The 5th column in the command output contains file size in bytes.
● For example, command Hadoop fs -ls input gives the following output :
Found 1 item

-rw-r--r-- 1 hduser supergroup 45956 2020-07-8 20:57


/user/hduser/input/sou

The size of the file sou is 45956 bytes.

2.

● You can also find file size using hadoop fs -dus <path>.
● For example, if a directory on HDFS named "/user/frylock/input" contains
100 files and you need the total size for all of those files you could run:

hadoop fs -dus /user/frylock/input

● And you would get back the total size (in bytes) of all of the files in the
"/user/frylock/input" directory.

3.

● You can also use the following function to find the file size :
public class GetflStatus

{
public long getflSize(String args) throws IOException, FileNotFoundException
{
Configuration config = new Configuration();
Path path = new Path(args);
FileSystem hdfs = path.getFileSystem(config);
ContentSummary cSummary = hdfs.getContentSummary(path);
long length = cSummary.getLength();
return length;
}

HDFS Block abstraction :

● HDFS block size is usually 64MB-128MB and unlike other filesystems, a


file smaller than the block size does not occupy the complete block size’s
worth of memory.
● The block size is kept so large so that less time is made doing disk seeks
as compared to the data transfer rate.
● Why do we need block abstraction :
● Files can be bigger than individual disks.
● Filesystem metadata does not need to be associated with each and every
block.
● Simplifies storage management - Easy to figure out the number of blocks
which can be stored on each disk.
● Fault tolerance and storage replication can be easily done on a per-block
basis.
Data Replication :

● Replication ensures the availability of the data.


● Replication is - making a copy of something and the

number of times you make a copy of that particular thing can be expressed
as its Replication Factor.

● As HDFS stores the data in the form of various blocks at the same time
Hadoop is also configured to make a copy of those file blocks.
● By default, the Replication Factor for Hadoop is set to 3 which can be
configured.
● We need this replication for our file blocks because for running Hadoop we
are using commodity hardware (inexpensive system hardware) which can
be crashed at any time.
● We are not using a supercomputer for our Hadoop setup.
● That is why we need such a feature in HDFS that can make copies of that
file blocks for backup purposes, this is known as fault tolerance.
● For the big brand organization, the data is very much important than the
storage, so nobody cares about this extra storage.
● You can configure the Replication factor in your hdfs-site.xml file.
How Does Hadoop Store , read , write files :

1. Read Files :

Step 1: The client opens the file he/she wishes to read by calling open() on
the File System Object.

Step 2: Distributed File System(DFS) calls the name node, to determine


the locations of the first few blocks in the file. For each block, the name
node returns the addresses of the data nodes that have a copy of that
block. The DFS returns an FSDataInputStream to the client for it to read
data from.
Step 3: The client then calls read() on the stream. DFSInputStream, which
has stored the info node addresses for the primary few blocks within the
file, then connects to the primary (closest) data node for the primary block
in the file.

Step 4: Data is streamed from the data node back to the client, which calls
read() repeatedly on the stream.

Step 5: When the end of the block is reached, DFSInputStream will close
the connection to the data node, then finds the best data node for the next
block.

Step 6: When the client has finished reading the file, a function is called,
close() on the FSDataInputStream.

1. Write Files :
Step 1: The client creates the file by calling create() on
DistributedFileSystem(DFS).

Step 2: DFS makes an RPC call to the name node to create a new file in
the file system’s namespace, with no blocks associated with it. The name
node performs various checks to make sure the file doesn’t already exist
and that the client has the right permissions to create the file. If these
checks pass, the name node prepares a record of the new file; otherwise,
the file can’t be created. The DFS returns an FSDataOutputStream for the
client to start out writing data to the file.

Step 3: Because the client writes data, the DFSOutputStream splits it into
packets, which it writes to an indoor queue called the info queue. The data
queue is consumed by the DataStreamer, which is liable for asking the
name node to allocate new blocks by picking an inventory of suitable data
nodes to store the replicas. The list of data nodes forms a pipeline. The
DataStreamer streams the packets to the primary data node within the
pipeline, which stores each packet and forwards it to the second data node
within the pipeline.

Step 4: Similarly, the second data node stores the packet and forwards it to
the third (and last) data node in the pipeline.

Step 5: The DFSOutputStream sustains an internal queue of packets that


are waiting to be acknowledged by data nodes, called an “ack queue”.

Step 6: This action sends up all the remaining packets to the data node
pipeline and waits for acknowledgements before connecting to the name
node to signal whether the file is complete or not.

1. Store Files :
● HDFS divides files into blocks and stores each block on a DataNode.
● Multiple DataNodes are linked to the master node in the cluster, the
NameNode.
● The master node distributes replicas of these data blocks across the
cluster.
● It also instructs the user where to locate wanted information.
● Before the NameNode can help you store and manage the data, it first
needs to partition the file into smaller, manageable data blocks.
● This process is called data block splitting.
Java Interfaces to HDFS :

● Java code for writing file in HDFS :


FileSystem fileSystem = FileSystem.get(conf);

// Check if the file already exists


Path path = new Path("/path/to/file.ext");
if (fileSystem.exists(path)) {
System.out.println("File " + dest + " already exists");
return;
}

// Create a new file and write data to it.


FSDataOutputStream out = fileSystem.create(path);
InputStream in = new BufferedInputStream(new FileInputStream(new File(source)));
byte[] b = new byte[1024];
int numBytes = 0;
while ((numBytes = in.read(b)) > 0) {
out.write(b, 0, numBytes);
}

// Close all the file descripters


in.close();
out.close();
fileSystem.close();

● Java code for reading file in HDFS :


FileSystem fileSystem = FileSystem.get(conf);

Path path = new Path("/path/to/file.ext");


if (!fileSystem.exists(path)) {
System.out.println("File does not exists");
return;
}
FSDataInputStream in = fileSystem.open(path);
int numBytes = 0;
while ((numBytes = in.read(b))> 0) {
System.out.prinln((char)numBytes));
// code to manipulate the data which is read
}
in.close();
out.close();
fileSystem.close();

Command Line Interface :

● The HDFS can be manipulated through a Java API or through a


command-line interface.
● The File System (FS) shell includes various shell-like commands that
directly interact with the Hadoop Distributed File System (HDFS) as well as
other file systems that Hadoop supports.
● Below are the commands supported :
● appendToFile: Append the content of the text file in the HDFS.
● cat: Copies source paths to stdout.
● checksum: Returns the checksum information of a file.
● chgrp : Change group association of files. The user must be the owner of
files, or else a super-user.
● chmod : Change the permissions of files. The user must be the owner of
the file, or else a super-user.
● chown: Change the owner of files. The user must be a super-user.
● copyFromLocal: This command copies all the files inside the test folder in
the edge node to the test folder in the HDFS.
● copyToLocal : This command copies all the files inside the test folder in the
HDFS to the test folder in the edge node.
● count: Count the number of directories, files and bytes under the paths that
match the specified file pattern.
● cp: Copy files from source to destination. This command allows multiple
sources as well in which case the destination must be a directory.
● createSnapshot: HDFS Snapshots are read-only point-in-time copies of the
file system. Snapshots can be taken on a subtree of the file system or the
entire file system. Some common use cases of snapshots are data backup,
protection against user errors and disaster recovery.
● deleteSnapshot: Delete a snapshot from a snapshot table directory. This
operation requires the owner privilege of the snapshottable directory.
● df: Displays free space
● du: Displays sizes of files and directories contained in the given directory
or the length of a file in case its just a file.
● expunge: Empty the Trash.
● find: Finds all files that match the specified expression and applies
selected actions to them. If no path is specified then defaults to the current
working directory. If no expression is specified then defaults to -print.
● get Copy files to the local file system.
● getfacl: Displays the Access Control Lists (ACLs) of files and directories. If
a directory has a default ACL, then getfacl also displays the default ACL.
● getfattr: Displays the extended attribute names and values for a file or
directory.
● getmerge : Takes a source directory and a destination file as input and
concatenates files in src into the destination local file.
● help: Return usage output.
● ls: list files
● lsr: Recursive version of ls.
● mkdir: Takes path URI’s as argument and creates directories.
● moveFromLocal: Similar to put command, except that the source localsrc
is deleted after it’s copied.
● moveToLocal: Displays a “Not implemented yet” message.
● mv: Moves files from source to destination. This command allows multiple
sources as well in which case the destination needs to be a directory.
● put : Copy single src, or multiple srcs from local file system to the
destination file system. Also reads input from stdin and writes to
destination file system.
● renameSnapshot : Rename a snapshot. This operation requires the owner
privilege of the snapshottable directory.
● rm : Delete files specified as args.
● rmdir : Delete a directory.
● rmr : Recursive version of delete.
● setfacl : Sets Access Control Lists (ACLs) of files and directories.
● setfattr : Sets an extended attribute name and value for a file or directory.
● setrep: Changes the replication factor of a file. If the path is a directory
then the command recursively changes the replication factor of all files
under the directory tree rooted at the path.
● stat : Print statistics about the file/directory at <path> in the specified
format.
● tail: Displays the last kilobyte of the file to stdout.
● test : Hadoop fs -test -[defsz] URI.
● text: Takes a source file and outputs the file in text format. The allowed
formats are zip and TextRecordInputStream.
● touchz: Create a file of zero length.
● truncate: Truncate all files that match the specified file pattern to the
specified length.
● usage: Return the help for an individual command.

HDFS Interfaces :

Features of HDFS interfaces are :

1. Create new file


2. Upload files/folder
3. Set Permission
4. Copy
5. Move
6. Rename
7. Delete
8. Drag and Drop
9. HDFS File viewer

Data Flow :

● MapReduce is used to compute a huge amount of data.


● To handle the upcoming data in a parallel and distributed form, the data
has to flow from various phases :

● Input Reader :
● The input reader reads the upcoming data and splits it into the data blocks
of the appropriate size (64 MB to 128 MB).
● Once input reads the data, it generates the corresponding key-value pairs.
● The input files reside in HDFS.
● Map Function :
● The map function process the upcoming key-value pairs and generated the
corresponding output key-value pairs.
● The mapped input and output types may be different from each other.
● Partition Function :
● The partition function assigns the output of each Map function to the
appropriate reducer.
● The available key and value provide this function.
● It returns the index of reducers.
● Shuffling and Sorting :
● The data are shuffled between nodes so that it moves out from the map
and get ready to process for reduce function.
● The sorting operation is performed on input data for Reduce function.
● Reduce Function :
● The Reduce function is assigned to each unique key.
● These keys are already arranged in sorted order.
● The values associated with the keys can iterate the Reduce and generates
the corresponding output.
● Output Writer :
● Once the data flow from all the above phases, the Output writer executes.
● The role of the Output writer is to write the Reduce output to the stable
storage.

Data Ingestion :

● Hadoop Data ingestion is the beginning of your data pipeline in a data


lake.
● It means taking data from various silo databases and files and putting it
into Hadoop.
● For many companies, it does turn out to be an intricate task.
● That is why they take more than a year to ingest all their data into the
Hadoop data lake.
● The reason is, as Hadoop is open-source; there are a variety of ways you
can ingest data into Hadoop.
● It gives every developer the choice of using her/his favourite tool or
language to ingest data into Hadoop.
● Developers while choosing a tool/technology stress on performance, but
this makes governance very complicated.
● Sqoop :
● Apache Sqoop (SQL-to-Hadoop) is a lifesaver for anyone who is
experiencing difficulties in moving data from the data warehouse into the
Hadoop environment.
● Apache Sqoop is an effective Hadoop tool used for importing data from
RDBMS’s like MySQL, Oracle, etc. into HBase, Hive or HDFS.
● Sqoop Hadoop can also be used for exporting data from HDFS into
RDBMS.
● Apache Sqoop is a command-line interpreter i.e. the Sqoop commands are
executed one at a time by the interpreter.
● Flume :
● Apache Flume is a service designed for streaming logs into the Hadoop
environment.
● Flume is a distributed and reliable service for collecting and aggregating
huge amounts of log data.
● With a simple and easy to use architecture based on streaming data flows,
it also has tunable reliability mechanisms and several recoveries and
failover mechanisms.

Hadoop Archives :

● Hadoop Archive is a facility that packs up small files into one compact
HDFS block to avoid memory wastage of name nodes.
● Name node stores the metadata information of the HDFS data.
● If 1GB file is broken into 1000 pieces then namenode will have to store
metadata about all those 1000 small files.
● In that manner,namenode memory will be wasted in storing and managing
a lot of data.
● HAR is created from a collection of files and the archiving tool will run a
MapReduce job.
● These Maps reduces jobs to process the input files in parallel to create an
archive file.
● Hadoop is created to deal with large files data, so small files are
problematic and to be handled efficiently.
● As a large input file is split into a number of small input files and stored
across all the data nodes, all these huge numbers of records are to be
stored in the name node which makes the name node inefficient.
● To handle this problem, Hadoop Archive has been created which packs the
HDFS files into archives and we can directly use these files as input to the
MR jobs.
● It always comes with *.har extension.
● HAR Syntax :
hadoop archive -archiveName NAME -p <parent path> <src>* <dest>

Example :

hadoop archive -archiveName foo.har -p /user/hadoop dir1 dir2 /user/zoo

I/O Compression :

● In the Hadoop framework, where large data sets are stored and processed,
you will need storage for large files.
● These files are divided into blocks and those blocks are stored in different
nodes across the cluster so lots of I/O and network data transfer is also
involved.
● In order to reduce the storage requirements and to reduce the time spent
in-network transfer, you can have a look at data compression in the
Hadoop framework.
● Using data compression in Hadoop you can compress files at various
steps, at all of these steps it will help to reduce storage and quantity of
data transferred.
● You can compress the input file itself.
● That will help you reduce storage space in HDFS.
● You can also configure that the output of a MapReduce job is compressed
in Hadoop.
● That helps is reducing storage space if you are archiving output or sending
it to some other application for further processing.

I/O Serialization :

● Serialization refers to the conversion of structured objects into byte


streams for transmission over the network or permanent storage on a disk.
● Deserialization refers to the conversion of byte streams back to structured
objects.
● Serialization is mainly used in two areas of distributed data processing :
● Interprocess communication
● Permanent storage
● We require I/O Serialization because :
● To process records faster (Time-bound).
● When proper data formats need to maintain and transmit over data without
schema support on another end.
● When in the future, data without structure or format needs to process,
complex Errors may occur.
● Serialization offers data validation over transmission.
● To maintain the proper format of data serialization, the system must have
the following four properties -
● Compact - helps in the best use of network bandwidth
● Fast - reduces the performance overhead
● Extensible - can match new requirements
● Inter-operable - not language-specific

Avro :

● Apache Avro is a language-neutral data serialization system.


● Since Hadoop writable classes lack language portability, Avro becomes
quite helpful, as it deals with data formats that can be processed by
multiple languages.
● Avro is a preferred tool to serialize data in Hadoop.
● Avro has a schema-based system.
● A language-independent schema is associated with its read and write
operations.
● Avro serializes the data which has a built-in schema.
● Avro serializes the data into a compact binary format, which can be
deserialized by any application.
● Avro uses JSON format to declare the data structures.
● Presently, it supports languages such as Java, C, C++, C#, Python, and
Ruby.

Security in Hadoop :

● Apache Hadoop achieves security by using Kerberos.


● At a high level, there are three steps that a client must take to access a
service when using Kerberos.
● Thus, each of which involves a message exchange with a server.
● Authentication – The client authenticates itself to the authentication server.
Then, receives a timestamped Ticket-Granting Ticket (TGT).
● Authorization – The client uses the TGT to request a service ticket from the
Ticket Granting Server.
● Service Request – The client uses the service ticket to authenticate itself to
the server.

Administering Hadoop :

● The person who administers Hadoop is called HADOOP


ADMINISTRATOR.
● Some of the common administering tasks in Hadoop are :
● Monitor health of a cluster
● Add new data nodes as needed
● Optionally turn on security
● Optionally turn on encryption
● Recommended, but optional, to turn on high availability
● Optional to turn on MapReduce Job History Tracking Server
● Fix corrupt data blocks when necessary
● Tune performance

UNIT 4

Hadoop Eco System and YARN:

Hadoop Ecosystem components


Hadoop Ecosystem is a platform or a suite that provides various services to solve
big data problems. It includes Apache projects and various commercial tools and
solutions. There are four major elements of Hadoop i.e. HDFS, MapReduce,
YARN, and Hadoop Common. Most of the tools or solutions are used to
supplement or support these major elements. All these tools work collectively to
provide services such as absorption, analysis, storage and maintenance of data
etc.

Following are the components that collectively form a Hadoop ecosystem:

● HDFS: Hadoop Distributed File System


● YARN: Yet Another Resource Negotiator
● MapReduce: Programming based Data Processing
● Spark: In-Memory data processing
● PIG, HIVE: Query-based processing of data services
● HBase: NoSQL Database
● Mahout, Spark MLLib: Machine Learning algorithm libraries
● Solar, Lucene: Searching and Indexing
● Zookeeper: Managing cluster
● Oozie: Job Scheduling

Hadoop-Schedulers
1. Fifo scheduler

As the name suggests FIFO i.e. First In First Out, therefore the tasks or
application that comes first are going to be served first. This is the default
Scheduler we use in Hadoop. The tasks are placed during a queue and therefore
the tasks are performed in their submission order. In this method, once the work
is scheduled, no intervention is allowed. So sometimes the high priority process
has got to wait an extended time since the priority of the task doesn't matter
during this method.

2.Capacity schedulers

In Capacity Scheduler we've multiple job queues for scheduling our tasks. The
Capacity Scheduler allows multiple occupants to share an outsized size Hadoop
cluster. In the Capacity Scheduler corresponding for every job queue, we offer
some slots or cluster resources for performing job operations. Each job queue
has its own slots to perform its task. just in case we've tasks to perform in just
one queue then the tasks of that queue can access the slots of other queues also
as they're liberal to use, and when the new task enters to another queue then
jobs in running in its own slots of the cluster are replaced with its own job.

Capacity Scheduler also provides A level of abstraction to understand which


occupant is utilizing the more cluster resource or slots, so that the only user or
application doesn’t take disappropriate or unnecessary slots within the cluster.
The capacity Scheduler mainly contains 3 sorts of the queue that are root,
parent, and leaf which are wont to represent cluster, organization, or any
subgroup, application submission respectively.

3. Fair Scheduler

The Fair Scheduler is very much similar to that of the capacity scheduler. The
priority of the job is kept in consideration. With the help of Fair Scheduler, the
YARN applications can share the resources in the large Hadoop Cluster and
these resources are maintained dynamically so no need for prior capacity. The
resources are distributed in such a manner that all applications within a cluster
get an equal amount of time. Fair Scheduler takes Scheduling decisions based
on memory, we can configure it to work with CPU also.

As we told you it is similar to Capacity Scheduler but the major thing to notice is
that in Fair Scheduler whenever any high priority job arises in the same queue,
the task is processed in parallel by replacing some portion from the already
dedicated slots.

Hadoop 2.0 New Features-NameNode high availability

High Availability was a new feature added to Hadoop 2.x to solve the Single point
of failure problem in the older versions of Hadoop.

As the Hadoop HDFS follows the master-slave architecture where the


NameNode is the master node and maintains the filesystem tree. So HDFS
cannot be used without NameNode. This NameNode becomes a bottleneck.
HDFS high availability feature addresses this issue.

Before Hadoop 2.0.0, the NameNode was a single point of failure (SPOF) in an
HDFS cluster. Each cluster had a single NameNode, and if NameNode fails, the
cluster as a whole would be out of services. The cluster will be unavailable until
the NameNode restarts or brought on a separate machine.

Hadoop 2.0 overcomes this SPOF by providing support for many NameNode.
HDFS NameNode High Availability architecture provides the option of running
two redundant NameNodes in the same cluster in an active/passive configuration
with a hot standby.

● Active NameNode – It handles all client operations in the cluster.


● Passive NameNode – It is a standby namenode, which has similar data as
active NameNode. It acts as a slave, maintains enough state to provide a
fast failover, if necessary.

If Active NameNode fails, then passive NameNode takes all the responsibility of
active node and the cluster continues to work.

Issues in maintaining consistency in the HDFS High Availability cluster are as


follows:

● Active and Standby NameNode should always be in sync with each other,
i.e. they should have the same metadata. This permit reinstating the
Hadoop cluster to the same namespace state where it got crashed. And
this will provide us to have fast failover.
● There should be only one NameNode active at a time. Otherwise, two
NameNode will lead to corruption of the data. We call this scenario a
“Split-Brain Scenario”, where a cluster gets divided into the smaller cluster.
Each one believes that it is the only active cluster. “Fencing” avoids such
scenarios. Fencing is a process of ensuring that only one NameNode
remains active at a particular time.
HDFS Federation

HDFS Federation improves the existing HDFS architecture through a clear


separation of namespace and storage, enabling a generic block storage layer. It
enables support for multiple namespaces in the cluster to improve scalability and
isolation. Federation also opens up the architecture, expanding the applicability
of HDFS clusters to new implementations and use cases.

To scale the name service horizontally, the federation uses multiple independent
namenodes/namespaces. The namenodes are federated, that is, the namenodes
are independent and don’t require coordination with each other. The datanodes
are used as common storage for blocks by all the namenodes. Each datanode
registers with all the namenodes in the cluster. Datanodes send periodic
heartbeats and block reports and handle commands from the namenodes.

A Block Pool is a set of blocks that belong to a single namespace. Datanodes


store blocks for all the block pools in the cluster.
It is managed independently of other block pools. This allows a namespace to
generate Block IDs for new blocks without the need for coordination with the
other namespaces. The failure of a namenode does not prevent the datanode
from serving other namenodes in the cluster.

A Namespace and its block pool together are called Namespace Volume. It is a
self-contained unit of management. When a namenode/namespace is deleted,
the corresponding block pool at the datanodes is deleted. Each namespace
volume is upgraded as a unit, during cluster upgrade.

MRv2

MRv2 (aka YARN, "Yet Another Resource Negotiator") has a Resource Manager
for each cluster, and each data node runs a Node Manager. For each job, one
slave node will act as the Application Master, monitoring resources/tasks, etc.
The MapReduce framework in the Hadoop 1.x version is also known as MRv1.
The MRv1 framework includes client communication, job execution and
management, resource scheduling and resource management. The Hadoop
daemons associated with MRv1 are JobTracker and TaskTracker as shown in the
following figure:

YARN
YARN stands for “Yet Another Resource Negotiator“. It was introduced in Hadoop
2.0 to remove the bottleneck on Job Tracker which was present in Hadoop 1.0.
YARN was described as a “Redesigned Resource Manager” at the time of its
launching, but it has now evolved to be known as a large-scale distributed
operating system used for Big Data processing.YARN also allows different data
processing engines like graph processing, interactive processing, stream
processing as well as batch processing to run and process data stored in HDFS
(Hadoop Distributed File System) thus making the system much more efficient.
Through its various components, it can dynamically allocate various resources
and schedule the application processing. For large volume data processing, it is
quite necessary to manage the available resources properly so that every
application can leverage them.

Running MRv1 in YARN.


YARN uses the ResourceManager web interface for monitoring applications
running on a YARN cluster. The ResourceManager UI shows the basic cluster
metrics, list of applications, and nodes associated with the cluster. In this section,
we'll discuss the monitoring of MRv1 applications over YARN.

The Resource Manager is the core component of YARN – Yet Another Resource
Negotiator. In analogy, it occupies the place of JobTracker of MRV1. Hadoop
YARN is designed to provide a generic and flexible framework to administer the
computing resources in the Hadoop cluster.

In this direction, the YARN Resource Manager Service (RM) is the central
controlling authority for resource management and makes allocation decisions
ResourceManager has two main components: Scheduler and
ApplicationsManager.

NoSQL Databases:

Introduction to NoSQL
A term for any type of database that does not use SQL for the primary retrieval of
data from the database. NoSQL databases have limited traditional functionality
and are designed for scalability and high performance retrieve and append.
Typically, NoSQL databases store data as key-value pairs, which works well for
data that is unrelated.

NoSQL databases can store relationship data—they just store it differently than
relational databases do. When compared with SQL databases, many find
modeling relationship data in NoSQL databases to be easier than in SQL
databases, because related data doesn’t have to be split between tables.

NoSQL data models allow related data to be nested within a single data
structure.

Here are the four main types of NoSQL databases:

● Document databases

A document database stores data in JSON, BSON, or XML documents (not Word
documents or Google docs, of course). In a document database, documents can
be nested. Particular elements can be indexed for faster querying.

Documents can be stored and retrieved in a form that is much closer to the data
objects used in applications, which means less translation is required to use the
data in an application. SQL data must often be assembled and disassembled
when moving back and forth between applications and storage.

● Key-value stores

Key-value databases are a simpler type of database where each item contains
keys and values. A value can typically only be retrieved by referencing its key, so
learning how to query for a specific key-value pair is typically simple. Key-value
databases are great for use cases where you need to store large amounts of
data but you don’t need to perform complex queries to retrieve it. Common use
cases include storing user preferences or caching. Redis and DynamoDB are
popular key-value databases.

● Column-oriented databases

Wide-column stores store data in tables, rows, and dynamic columns.


Wide-column stores provide a lot of flexibility over relational databases because
each row is not required to have the same columns. Many consider wide-column
stores to be two-dimensional key-value databases. Wide-column stores are great
for when you need to store large amounts of data and you can predict what your
query patterns will be. Wide-column stores are commonly used for storing
Internet of Things data and user profile data. Cassandra and HBase are two of
the most popular wide-column stores.

● Graph databases

Graph databases store data in nodes and edges. Nodes typically store
information about people, places, and things while edges store information about
the relationships between the nodes. Graph databases excel in use cases where
you need to traverse relationships to look for patterns such as social networks,
fraud detection, and recommendation engines. Neo4j and JanusGraph are
examples of graph databases.

MongoDB:

Introduction:
MongoDB is an open-source document database and leading NoSQL database.
MongoDB is written in C++. This tutorial will give you a great understanding of
MongoDB concepts needed to create and deploy a highly scalable and
performance-oriented database.

Data Types:
MongoDB supports many data types. Some of them are −

● String − This is the most commonly used datatype to store the data. String
in MongoDB must be UTF-8 valid.
● Integer − This type is used to store a numerical value. Integer can be 32 bit
or 64 bit depending upon your server.
● Boolean − This type is used to store a boolean (true/ false) value.
● Double − This type is used to store floating-point values.
● Min/ Max keys − This type is used to compare a value against the lowest
and highest BSON elements.
● Arrays − This type is used to store arrays or lists or multiple values into
one key.
● Timestamp − timestamp. This can be handy for recording when a
document has been modified or added.
● Object − This data type is used for embedded documents.
● Null − This type is used to store a Null value.
● Symbol − This datatype is used identically to a string; however, it's
generally reserved for languages that use a specific symbol type.
● Date − This data type is used to store the current date or time in UNIX time
format. You can specify your own date time by creating an object of Date
and passing day, month, a year into it.
● Object ID − This data type is used to store the document’s ID.
● Binary data − This data type is used to store binary data.
● Code − This data type is used to store JavaScript code into the document.
● Regular expression − This data type is used to store regular expression.

Creating Document:
Insert a Single Document

db.collection.insertOne() inserts a single document into a collection.

Insert Multiple Document

db.collection.insertMany() can insert multiple documents into a collection. Pass


an array of documents to the method.
Updating Document
db.collection.updateOne(<filter>, <update>, <options>)

Updates at most a single document that matches a specified filter even though
multiple documents may match the specified filter.

db.collection.updateMany(<filter>, <update>, <options>)

Update all documents that match a specified filter.

db.collection.replaceOne(<filter>, <update>, <options>)

Replaces at most a single document that matches a specified filter even though
multiple documents may match the specified filter.

db.collection.update()

Either updates or replaces a single document that matches a specified filter or


updates all documents that match a specified filter.

By default, the db.collection.update() method updates a single document. To


update multiple documents, use the multi option.

Deleting Documents
db.collection.deleteMany()

Delete all documents that match a specified filter.

db.collection.deleteOne()

Delete at most a single document that matches a specified filter even though
multiple documents may match the specified filter.

db.collection.remove()

Delete a single document or all documents that match a specified filter.

db.collection.findOneAndDelete()
findOneAndDelete() provides a sort option. The option allows for the deletion of
the first document sorted by the specified order.

db.collection.findAndModify()

db.collection.findAndModify() provides a sort option. The option allows for the


deletion of the first document sorted by the specified order.

db.collection.bulkWrite()
Previous

Querying:

find() Method
To query data from MongoDB collection, you need to use MongoDB's find()
method.

Syntax

The basic syntax of find() method is as follows −

>db.COLLECTION_NAME.find()

find() method will display all the documents in a non-structured way.

pretty() Method
To display the results in a formatted way, you can use pretty() method.

Syntax

>db.COLLECTION_NAME.find().pretty()

findOne() method
Apart from the find() method, there is findOne() method, that returns only one
document.
Syntax

>db.COLLECTIONNAME.findOne()

AND in MongoDB
Syntax

To query documents based on the AND condition, you need to use $and
keyword. Following is the basic syntax of AND −

>db.mycol.find({ $and: [ {<key1>:<value1>}, { <key2>:<value2>} ] })

OR in MongoDB
Syntax

To query documents based on the OR condition, you need to use $or keyword.
Following is the basic syntax of OR −

>db.mycol.find(
{
$or: [
{key1: value1}, {key2:value2}
]
}
).pretty()

NOR in MongoDB
Syntax

To query documents based on the NOT condition, you need to use the $not
keyword. Following is the basic syntax of NOT −
>db.COLLECTION_NAME.find(
{
$not: [
{key1:
value1}, {key2:value2}
]
}
)

NOT in MongoDB
Syntax

To query documents based on the NOT condition, you need to use the $not
keyword following is the basic syntax of NOT −

>db.COLLECTION_NAME.find(
{
$NOT: [
{key1:
value1}, {key2:value2}
]
}
).pretty()

Introduction to indexing

Indexes support the efficient execution of queries in MongoDB. Without indexes,


MongoDB must perform a collection scan, i.e. scan every document in a
collection, to select those documents that match the query statement. If an
appropriate index exists for a query, MongoDB can use the index to limit the
number of documents it must inspect.

Indexes are special data structures that store a small portion of the collection's
data set in an easy to traverse form. The index stores the value of a specific field
or set of fields, ordered by the value of the field. The ordering of the index entries
supports efficient equality matches and range-based query operations. In
addition, MongoDB can return sorted results by using the ordering in the index.

MongoDB supports indexes

– At the collection level

– Similar to indexes on RDBMS

Can be used for

– More efficient filtering

– More efficient sorting

– Index-only queries (covering index)

Types of Indexes

→ Default _id Index

MongoDB creates a unique index on the _id field during the creation of a
collection.

The _id index prevents clients from inserting two documents with the same value
for the

_id field.

You cannot drop this index on the _id field.

→ Create an Index

To create an index, use

db.collection.createIndex()

db.collection.createIndex( <key and index type specification>, <options> )

The db.collection.createIndex() method only creates an index if an index of the


same

specification does not already exist.


MongoDB indexes use a B-tree data structure.

MongoDB provides several different index types to support specific types of data
and

queries.

→ Single Field

In addition to the MongoDB-defined _id index, MongoDB supports the creation of


user-defined

ascending/descending indexes on a single field of a document.

The following example creates an ascending index on the field orderDate.

db.collection.createIndex( { orderDate: 1 } )

→ Compound Index

MongoDB also supports user-defined indexes on multiple fields, i.e. compound


indexes.

The order of fields listed in a compound index has significance. For instance, if a
compound index consists of { userid: 1, score: -1 }, the index sorts first by userid
and then, within each userid value, sorts by score.

The following example creates a compound index on the orderDate field (in
ascending order) and the zip code field (in descending order.)

db.collection.createIndex( { orderDate: 1, zipcode: -1 } )

→ Multikey Index

MongoDB uses multikey indexes to index the content stored in arrays.

If you index a field that holds an array value, MongoDB creates separate index
entries for every element of the array.
These multikey indexes allow queries to select documents that contain arrays by
matching on elements or elements of the arrays.

MongoDB automatically determines whether to create a multikey index if the


indexed field contains an array value; you do not need to explicitly specify the
multikey type.

→ Index Use

Indexes can improve the efficiency of reading operations.

Covered Queries

When the query criteria and the projection of a query includes only the indexed
fields,

MongoDB will return results directly from the index without scanning any
documents or

bringing documents into memory. These covered queries can be very efficient.

Index Intersection(New in version 2.6.)

MongoDB can use the intersection of indexes to fulfil queries.

For queries that specify the compound query conditions, if one index can fulfil a
part of a query condition, and another index can fulfil another part of the query
condition, then MongoDB can

use the intersection of the two indexes to fulfil the query.

To illustrate index intersection, consider collection orders that have the following

indexes:

{ qty: 1 }

{ item: 1 }

MongoDB can use the intersection of the two

indexes to support the following query:


db.orders.find( { item: "abc123", qty: { $gt: 15} } )

→ Remove Indexes

You can use the following methods to remove indexes:

db.collection.dropIndex() method

db.accounts.dropIndex( { "tax-id": 1 } )

The above operation removes an ascending index on the item field in the items
collection.

Db.collection.drop indexes()

To remove all indexes barring the _id index from a collection, use the operation
above.

→ Modify Indexes

To modify an index, first, drop the index and then recreate it.

Drop Index: Execute the query given below to return a document showing the
operation status.

db.orders.dropIndex({ "cust_id" : 1, "ord_date" :-1, "items" : 1 })

Recreate the Index: Execute the query given below to return a document
showing the status of the results.

db.orders.createIndex({ "cust_id" : 1, "ord_date" : -1, "items" : -1 })

→ Rebuild Indexes

In addition to modifying indexes, you can also rebuild them.

To rebuild all indexes of a collection, use the db.collection.reIndex() method.

This will drop all indexes including _id and rebuild all indexes in a single
operation.
Capped Collections:

Capped collections are fixed-size collections that support high-throughput


operations that insert and retrieve documents based on insertion order. Capped
collections work in a way similar to circular buffers: once a collection fills its
allocated space, it makes room for new documents by overwriting the oldest
documents in the collection.

Procedures
→ Create a Capped Collection

You must create capped collections explicitly using the db.createCollection()


method, which is a helper in the mongo shell for the create command. When
creating a capped collection you must specify the maximum size of the collection
in bytes, which MongoDB will pre-allocate for the collection. The size of the
capped collection includes a small amount of space for internal overhead.

→ Query a Capped Collection

If you perform a find() on a capped collection with no ordering specified,


MongoDB guarantees that the ordering of results is the same as the insertion
order.

To retrieve documents in reverse insertion order, issue find() along with the sort()
method with the $natural parameter set to -1, as shown in the following example:

db.cappedCollection.find().sort( { $natural: -1 } )

→ Check if a Collection is Capped

Use the isCapped() method to determine if a collection is capped, as follows:

db.collection.isCapped()

→ Convert a Collection to Capped


You can convert a non-capped collection to a capped collection with the
convertToCapped command:

db.runCommand({"convertToCapped": "mycoll", size: 100000});

The size parameter specifies the size of the capped collection in bytes.

Spark:

Installing spark
1. Choose a Spark release: 3.1.2 (Jun 01 2021)3.0.3 (Jun 23 2021)
2. Choose a package type: Pre-built for Apache Hadoop 3.2 and later
Pre-built for Apache Hadoop 2.7 Pre-built with user-provided Apache
Hadoop Source Code
3. Download Spark: spark-3.1.2-bin-hadoop3.2.tgz
4. Verify this release using the 3.1.2 signatures, checksums and project
release KEYS.

Note that, Spark 2.x is pre-built with Scala 2.11 except version 2.4.2, which is
pre-built with Scala 2.12. Spark 3.0+ is pre-built with Scala 2.12.

Spark Applications:
1. Processing Streaming Data

The most wonderful aspect of Apache Spark is its ability to process streaming
data. Every second, an unprecedented amount of data is generated globally. This
pushes companies and businesses to process data in large bulks and analyze it
in real-time. The Spark Streaming feature can efficiently handle this function. By
unifying disparate data processing capabilities, Spark Streaming allows
developers to use a single framework to accommodate all their processing
requirements. Some of the best features of Spark Streaming are:

Streaming ETL – Spark’s Streaming ETL continually cleans and aggregates the
data before pushing it into data repositories, unlike the complicated process of
conventional ETL (extract, transform, load) tools used for batch processing in
data warehouse environments – they first read the data, then convert it to a
database compatible format, and finally, write it to the target database.

Data enrichment – This feature helps to enrich the quality of data by combining it
with static data, thus, promoting real-time data analysis. Online marketers use
data enrichment capabilities to combine historical customer data with live
customer behaviour data for delivering personalized and targeted ads to
customers in real-time.

Trigger event detection – The trigger event detection feature allows you to
promptly detect and respond to unusual behaviours or “trigger events” that could
compromise the system or create a serious problem within it.

While financial institutions leverage this capability to detect fraudulent


transactions, healthcare providers use it to identify potentially dangerous health
changes in the vital signs of a patient and automatically send alerts to the
caregivers so that they can take the appropriate actions.

Complex session analysis – Spark Streaming allows you to group live sessions
and events ( for example, user activity after logging into a website/application)
together and also analyze them. Moreover, this information can be used to
update ML models continually. Netflix uses this feature to obtain real-time
customer behavior insights on the platform and to create more targeted show
recommendations for the users.

2. Machine Learning

Spark has commendable Machine Learning abilities. It is equipped with an


integrated framework for performing advanced analytics that allows you to run
repeated queries on datasets. This, in essence, is the processing of Machine
learning algorithms. Machine Learning Library (MLlib) is one of Spark’s most
potent ML components.

This library can perform clustering, classification, dimensionality reduction, and


much more. With MLlib, Spark can be used for many Big Data functions such as
sentiment analysis, predictive intelligence, customer segmentation, and
recommendation engines, among other things.

Another mention-worthy application of Spark is network security. By leveraging


the diverse components of the Spark stack, security providers/companies can
inspect data packets in real-time inspections for detecting any traces of malicious
activity. Spark Streaming enables them to check any known threats before
passing the packets on to the repository.

When the packets arrive in the repository, they are further analyzed by other
Spark components (for instance, MLlib). In this way, Spark helps security
providers to identify and detect threats as they emerge, thereby enabling them to
solidify client security.

3. Fog Computing

Fog Computing decentralizes data processing and storage. However, certain


complexities accompany Fog Computing – it requires low latency, massively
parallel processing of ML, and incredibly complex graph analytics algorithms.
Thanks to vital stack components like Spark Streaming, MLlib, and GraphX (a
graph analysis engine), Spark performs excellently as a capable Fog Computing
solution.

Spark jobs, stages and tasks

Job - A parallel computation consisting of multiple tasks that get spawned in


response to a Spark action (e.g., save(), collect()). During interactive sessions
with Spark shells, the driver converts your Spark application into one or more
Spark jobs. It then transforms each job into a DAG. This, in essence, is Spark’s
execution plan, where each node within a DAG could be single or multiple Spark
stages.

Stage - Each job gets divided into smaller sets of tasks called stages that depend
on each other. As part of the DAG nodes, stages are created based on what
operations can be performed serially or in parallel. Not all Spark operations can
happen in a single stage, so they may be divided into multiple stages. Often
stages are delineated on the operator’s computation boundaries, where they
dictate data transfer among Spark executors.

Task - A single unit of work or execution that will be sent to a Spark executor.
Each stage is comprised of Spark tasks (a unit of execution), which are then
federated across each Spark executor; each task maps to a single core and
works on a single partition of data. As such, an executor with 16 cores can have
16 or more tasks working on 16 or more partitions in parallel, making the
execution of Spark’s tasks exceedingly parallel!

Previous

11. Spark

Next

Resilient Distributed Databases

Resilient Distributed Datasets (RDD) is a fundamental data structure of Spark. It


is an immutable distributed collection of objects. Each dataset in RDD is divided
into logical partitions, which may be computed on different nodes of the cluster.
RDDs can contain any type of Python, Java, or Scala objects, including
user-defined classes.

Formally, an RDD is a read-only, partitioned collection of records. RDDs can be


created through deterministic operations on either data on stable storage or
other RDDs. RDD is a fault-tolerant collection of elements that can be operated
on in parallel.

There are two ways to create RDDs − parallelizing an existing collection in your
driver program or referencing a dataset in an external storage system, such as a
shared file system, HDFS, HBase, or any data source offering a Hadoop Input
Format.

Spark makes use of the concept of RDD to achieve faster and efficient
MapReduce operations. Let us first discuss how MapReduce operations take
place and why they are not so efficient.

Anatomy of a Spark job run

Spark application contains several components, all of which exists whether you
are running Spark on a single machine or across a cluster of hundreds or
thousands of nodes.

The components of the spark application are Driver, the Master, the Cluster
Manager and the Executors.

All of the spark components including the driver, master, executor processes run
in java virtual machines(JVMs). A JVM is a cross-platform runtime engine that
executes the instructions compiled into java bytecode. Scala, which spark is
written in, compiles into bytecode and runs on JVMs.
Spark on YARN:

Apache Spark is an in-memory distributed data processing engine and YARN is a


cluster management technology.

When running Spark on YARN, each Spark executor runs as a YARN container.
Where MapReduce schedules a container and fires up a JVM for each task,
Spark hosts multiple tasks within the same container. This approach enables
several orders of magnitude faster task startup time.

Spark supports two modes for running on YARN, “yarn-cluster” mode and
“yarn-client” mode. Broadly, yarn-cluster mode makes sense for production jobs,
while yarn-client mode makes sense for interactive and debugging uses where
you want to see your application’s output immediately.

Understanding the difference requires an understanding of YARN’s Application


Master concept. In YARN, each application instance has an Application Master
process, which is the first container started for that application. The application is
responsible for requesting resources from the ResourceManager, and, when
allocated them, telling NodeManagers to start containers on its behalf.
Application Masters obviate the need for an active client — the process starting
the application can go away and coordination continues from a process managed
by YARN running on the cluster.

In yarn-cluster mode, the driver runs in the Application Master. This means that
the same process is responsible for both driving the application and requesting
resources from YARN, and this process runs inside a YARN container. The client
that starts the app doesn’t need to stick around for its entire lifetime.

yarn cluster mode

The yarn-cluster mode, however, is not well suited to using Spark interactively.
Spark applications that require user input, like spark-shell and PySpark, need the
Spark driver to run inside the client process that initiates the Spark application. In
yarn-client mode, the Application Master is merely present to request executor
containers from YARN. The client communicates with those containers to
schedule work after they start:

Yarn Client Mode

Different Deployment Modes across the cluster

In Yarn Cluster-Mode, Spark client will submit spark application to yarn, both
Spark Driver and Spark Executor are under the supervision of yarn. In yarn client
mode, only the Spark Executor are under the supervision of yarn. The Yarn
ApplicationMaster will request resources for just spark executor. The driver
program is running in the client process which has nothing to do with yarn.

SCALA:

Introduction
Scala is a modern multi-paradigm programming language designed to express
common programming patterns in a concise, elegant, and type-safe way. It
seamlessly integrates features of object-oriented and functional languages.

Classes and Objects


A class is a blueprint for objects. Once you define a class, you can create objects
from the class blueprint with the keyword new. Through the object, you can use
all functionalities of the defined class.

Class
Following is a simple syntax to define a basic class in Scala. This class defines
two variables x and y and a method: move, which does not return a value. Class
variables are called, fields of the class and methods are called class methods.

The class name works as a class constructor which can take several parameters.
The above code defines two constructor arguments, xc and yc; they are both
visible in the whole body of the class.

Syntax
class Point(xc: Int, yc: Int) {
var x: Int = xc
var y: Int = yc

def move(dx: Int, dy: Int) {


x = x + dx
y = y + dy
println ("Point x location : " + x);
println ("Point y location : " + y);
}

you can create objects using a keyword new and then you can access class
fields and methods

Extending a Class
You can extend a base Scala class and you can design an inherited class in the
same way you do it in Java (use extends keyword), but there are two restrictions:
method overriding requires the override keyword, and only the primary
constructor can pass parameters to the base constructor.

Implicit Classes
Implicit classes allow implicit conversations with the class’s primary constructor
when the class is in scope. An implicit class is a class marked with an ‘implicit’
keyword. This feature is introduced in Scala 2.10.

Syntax − The following is the syntax for implicit classes. Here implicit class is
always in the object scope where all method definitions are allowed because the
implicit class cannot be a top-level class.

Syntax
object <object name> {
implicit class <class name>(<Variable>: Data type) {
def <method>(): Unit =
}
}

Singleton Objects

Scala is more object-oriented than Java because, in Scala, we cannot have static
members. Instead, Scala has singleton objects. A singleton is a class that can
have only one instance, i.e., Object. You create a singleton using the keyword
object instead of the class keyword. Since you can't instantiate a singleton object,
you can't pass parameters to the primary constructor.

Basic Types and Operators

Basic Data Types:

Sr. No. Data Type Description

1
Byte 8 bit signed value. Range from -128 to 127

2 Short 16 bit signed value. Range -32768 to 32767

3 Int 32 bit signed value. Range -2147483648 to 2147483647

4 Long 64 bit signed value. -9223372036854775808 to


9223372036854775807

5 Float 32 bit IEEE 754 single-precision float

6 Double 64 bit IEEE 754 double-precision float

7 Char 16 bit unsigned Unicode character. Range from U+0000 to


U+FFFF

8 String A sequence of Chars


9 Boolean Either the literal true or the literal false

10 Unit Corresponds to no value

11 Null null or empty reference

12 Nothing The subtype of every other type; includes no values

13 Any The supertype of any type; any object is of type Any

14 AnyRef The supertype of any reference type

Operators:

An operator is a symbol that tells the compiler to perform specific mathematical


or logical manipulations. Scala is rich in built-in operators and provides the
following types of operators −

● Arithmetic Operators
● Relational Operators
● Logical Operators
● Bitwise Operators
● Assignment Operators

→ Arithmetic Operators
The following arithmetic operators are supported by the Scala language.

Operator Description

+ Adds two operands

- Subtracts second operand from the first

* Multiplies both operands

/ Divides numerator by de-numerator

% Modulus operator finds the remainder after division of one number by


another
→ Relational Operators
The following relational operators are supported by the Scala language

Operator Description

== Checks if the values of two operands are equal or not, if yes then the
condition becomes true.

!= Checks if the values of two operands are equal or not, if values are not
equal then the condition becomes true.

> Checks if the value of the left operand is greater than the value of the
right operand, if yes then the condition becomes true.

< Checks if the value of the left operand is less than the value of the right
operand, if yes then the condition becomes true.

>= Checks if the value of the left operand is greater than or equal to the
value of the right operand, if yes then the condition becomes true.

<= Checks if the value of the left operand is less than or equal to the value
of the right operand, if yes then the condition becomes true.

→ Logical Operators
The following logical operators are supported by the Scala language.

Operator Description

&& It is called Logical AND operator. If both the operands are non zero then
the condition becomes true.

|| It is called Logical OR Operator. If any of the two operands is non zero


then the condition becomes true.

! It is called Logical NOT Operator. Use to reverses the logical state of its
operand. If a condition is true then the Logical NOT operator will make it
false.

→ Bitwise Operators
Bitwise operator works on bits and performs bit by bit operation. The truth tables
for &, |, and ^ are as follows −

p q p&q p|q p^q

0 0 0 0 0

0 1 0 1 1

1 1 1 1 0

1 0 0 1 1

Operator Description

& Binary AND Operator copies a bit to the result if it exists in both
operands.

| Binary OR Operator copies a bit if it exists in either operand.

^ Binary XOR Operator copies the bit if it is set in one operand but not
both.

~ Binary Ones Complement Operator is unary and has the effect of


'flipping' bits.

<< Binary Left Shift Operator. The bit positions of the value of the left
operand are moved left by the number of bits specified by the right
operand.

>> Binary Right Shift Operator. The bit positions of the left operand value
are moved right by the number of bits specified by the right operand.

>>> Shift right zero-fill operator. The left operands value is moved right by
the number of bits specified by the right operand and shifted values are
filled up with zeros.

Assignment Operators
There are the following assignment operators supported by Scala language −
Operator Description

= Simple assignment operator, Assigns values from right side operands to


left side operand

+= Add AND assignment operator, It adds the right operand to the left
operand and assigns the result to the left operand

-= Subtract AND assignment operator, It subtracts right operand from the


left operand and assigns the result to left operand

*= Multiply AND assignment operator, It multiplies right operand with the


left operand and assigns the result to the left operand

/= Divide AND assignment operator, It divides left operand with the right
operand and assigns the result to left operand

%= Modulus AND assignment operator, It takes modulus using two


operands and assign the result to the left operand

<<= Left shift AND assignment operator

>>= Right shift AND assignment operator

&= Bitwise AND assignment operator

^= bitwise exclusive OR and assignment operator

|= bitwise inclusive OR and assignment operator

Built-in control structures:

Scala has only a handful of built-in control structures. The only control structures
are if, while, for, try, match, and function calls. The reason Scala has so few is
that it has included function literals since its inception. Instead of accumulating
one higher-level control structure after another in the base syntax, Scala
accumulates them in libraries.

1. If expressions
Scala's if works just like in many other languages. It tests a condition and then
executes one of two code branches depending on whether the condition holds
true. Here is a common example, written in an imperative style:

var filename = "default.txt"


if (!args.isEmpty)
filename = args(0)

This code declares a variable, filename, and initializes it to a default value. It then
uses and if expression to check whether any arguments were supplied to the
program. If so, it changes the variable to hold the value specified in the argument
list. If no arguments were supplied, it leaves the variable set to the default value.

2. While loops

Scala's while loop behaves as in other languages. It has a condition and a body,
and the body is executed over and over as long as the condition holds true.

example:

def gcdLoop(x: Long, y: Long): Long = {


var a = x
var b = y
while (a != 0) {
val temp = a
a=b%a
b = temp
}
b
}

Scala also has a do-while loop. This works like the while loop except that it tests
the condition after the loop body instead of before.

Below shows a Scala script that uses a do-while to echo lines read from the
standard input until an empty line is entered:
var line = ""
do {
line = readLine()
println("Read: "+ line)
} while (line != "")

3. For expressions

Scala's for expression is a Swiss army knife of iteration. It lets you combine a few
simple ingredients in different ways to express a wide variety of iterations. Simple
uses enable common tasks such as iterating through a sequence of integers.
More advanced expressions can iterate over multiple collections of different
kinds, can filter out elements based on arbitrary conditions, and can produce new
collections.
Iteration through collections

The simplest thing you can do is to iterate through all the elements of a
collection.

For example, below shows some code that prints out all files in the current
directory. The I/O is performed using the Java API. First, we create a java.io.File
on the current directory, ".", and call its listFiles method. This method returns an
array of File objects, one per directory and file contained in the current directory.
We store the resulting array in the filesHere variable.

val filesHere = (new java.io.File(".")).listFiles


for (file <- filesHere)
println(file)

Filtering

Sometimes you do not want to iterate through a collection in its entirety. You want
to filter it down to some subset. You can do this with a for expression by adding a
filter: an if clause inside the for's parentheses.
For example, the code shown below lists only those files in the current directory
whose names end with ".scala":

val filesHere = (new java.io.File(".")).listFiles

for (file <- filesHere if file.getName.endsWith(".scala"))


println(file)

Nested iteration

If you add multiple <- clauses, you will get nested "loops." For example, the for
expression shown below has two nested loops. The outer loop iterates through
filesHere, and the inner loop iterates through fileLines(file) for any file that ends
with .scala.

def fileLines(file: java.io.File) =


scala.io.Source.fromFile(file).getLines.toList

def grep(pattern: String) =


for (
file <- filesHere
if file.getName.endsWith(".scala");
line <- fileLines(file)
if line.trim.matches(pattern)
) println(file +": "+ line.trim)

grep(".*gcd.*")

Mid-stream variable bindings

Note that the previous code repeats the expression line.trim. This is a non-trivial
computation, so you might want to only compute it once. You can do this by
binding the result to a new variable using an equals sign (=). The bound variable
is introduced and used just like a val, only with the val keyword left out.
Below shows an example.

def grep(pattern: String) =


for {
file <- filesHere
if file.getName.endsWith(".scala")
line <- fileLines(file)
trimmed = line.trim
if trimmed.matches(pattern)
} println(file +": "+ trimmed)

grep(".*gcd.*")

Producing a new collection

While all of the examples so far have operated on the iterated values and then
forgotten them, you can also generate a value to remember for each iteration. To
do so, you prefix the body of the for expression by the keyword yield. For
example, here is a function that identifies the .scala files and stores them in an
array:

def scalaFiles =
for {
file <- filesHere
if file.getName.ends with(".scala")
} yield file

Each time the body of the for expression executes it produces one value, in this
case simply file. When the for expression completes, the result will include all of
the yielded values contained in a single collection. The type of the resulting
collection is based on the kind of collections processed in the iteration clauses. In
this case, the result is an Array[File], because filesHere is an array and the type
of the yielded expression is File.

4. Exception handling with try expressions


Scala's exceptions behave just like in many other languages. Instead of returning
a value in the normal way, a method can terminate by throwing an exception. The
method's caller can either catch and handle that exception, or it can itself simply
terminate, in which case the exception propagates to the caller's caller. The
exception propagates in this way, unwinding the call stack until a method handles
it or there are no more methods left.
Throwing exceptions

Throwing an exception looks the same as in Java. You create an exception


object and then you throw it with the throw keyword:

throw new IllegalArgumentException

Catching exceptions

You catch exceptions using the syntax shown below. The syntax for catch
clauses was chosen for its consistency with an important part of Scala: pattern
matching. Pattern matching, a powerful feature.

import java.io.FileReader
import java.io.FileNotFoundException
import java.io.IOException

try {
val f = new FileReader("input.txt") // Use and close file
} catch {
case ex: FileNotFoundException => // Handle missing file
case ex: IOException => // Handle other I/O error
}

The finally clause

You can wrap an expression with a finally clause if you want to cause some code
to execute no matter how the expression terminates. For example, you might
want to be sure an open file gets closed even if a method exits by throwing an
exception. Below shown an example.
import java.io.FileReader

val file = new FileReader("input.txt")


try {
// Use the file
} finally {
file.close() // Be sure to close the file
}

Yielding a value

As with most other Scala control structures, try-catch-finally results in a value.


For example, below shows how you can try to parse a URL but use a default
value if the URL is badly formed. The result is that of the try clause if no
exception is thrown, or the relevant catch clause if an exception is thrown and
caught. If an exception is thrown but not caught, the expression has no result at
all. The value computed in the finally clause, if there is one, is dropped. Usually,
finally clauses do some kind of clean up such as closing a file; they should not
normally change the value computed in the main body or a catch clause of the
try.

import java.net.URL
import java.net.MalformedURLException

def urlFor(path: String) =


try {
new URL(path)
} catch {
case e: MalformedURLException =>
new URL("http://www.scala-lang.org")
}

5. Match expressions
Scala's match expression lets you select from several alternatives, just like
switch statements in other languages. In general, a match expression lets you
select using arbitrary patterns. For now, just consider using match to select
among several alternatives.

As an example, the script below reads a food name from the argument list and
prints a companion to that food. This match expression examines firstArg, which
has been set to the first argument out of the argument list. If it is the string "salt",
it prints "pepper", while if it is the string "chips", it prints "salsa", and so on. The
default case is specified with an underscore (_), a wildcard symbol frequently
used in Scala as a placeholder for a completely unknown value.

val firstArg = if (args.length > 0) args(0) else ""


firstArg match {
case "salt" => println("pepper")
case "chips" => println("salsa")
case "eggs" => println("bacon")
case _ => println("huh?")
}

Functions

Scala has both functions and methods and we use the terms method and
function interchangeably with a minor difference. A Scala method is a part of a
class that has a name, a signature, optionally some annotations, and some
bytecode whereas a function in Scala is a complete object which can be
assigned to a variable. In other words, a function, which is defined as a member
of some object, is called a method.

Function Declarations
A Scala function declaration has the following form −

def functionName ([list of parameters]) : [return type]

Methods are implicitly declared abstract if you don’t use the equals sign and the
method body.

Function Definitions
A Scala function definition has the following form −

Syntax

def functionName ([list of parameters]) : [return type] = {


function body
return [expr]
}

Here, the return type could be any valid Scala data type and the list of
parameters will be a list of variables separated by a comma and the list of
parameters and return type are optional.

Calling Functions
Scala provides several syntactic variations for invoking methods. Following is the
standard way to call a method −

functionName( list of parameters )

If a function is being called using an instance of the object, then we would use
dot notation similar to Java as follows −

[instance.]functionName( list of parameters )

Closures
Scala Closures are functions which uses one or more free variables and the
return value of this function is dependent of these variable. The free variables are
defined outside of the Closure Function and are not included as a parameter of
this function. So the difference between a closure function and a normal function
is the free variable. A free variable is any kind of variable which is not defined
within the function and not passed as the parameter of the function. A free
variable is not bound to a function with a valid value. The function does not
contain any values for the free variable.

Inheritance.

Inheritance is an important pillar of OOP(Object Oriented Programming). It is the


mechanism in Scala by which one class is allowed to inherit the features(fields
and methods) of another class.

Important terminology:

● Super Class: The class whose features are inherited is known as


superclass(or a base class or a parent class).
● Sub Class: The class that inherits the other class is known as a
subclass(or a derived class, extended class, or child class). The subclass
can add its own fields and methods in addition to the superclass fields and
methods.
● Reusability: Inheritance supports the concept of “reusability”, i.e. when we
want to create a new class and there is already a class that includes some
of the code that we want, we can derive our new class from the existing
class. By doing this, we are reusing the fields and methods of the existing
class.

The keyword used for inheritance is extends.

Syntax:
class parent_class_name extends child_class_name{
// Methods and fields
}

Type of inheritance

Below are the different types of inheritance which are supported by Scala.

Single Inheritance: In single inheritance, derived class inherits the features of one
base class. In the image below, class A serves as a base class for the derived
class B.

Multilevel Inheritance: In Multilevel Inheritance, a derived class will be inheriting a


base class and as well as the derived class also act as the base class to another
class. In the below image, class A serves as a base class for the derived class B,
which in turn serves as a base class for the derived class C.
Hierarchical Inheritance: In Hierarchical Inheritance, one class serves as a
superclass (base class) for more than one subclass. In the below image, class A
serves as a base class for the derived classes B, C, and D.
Multiple Inheritance: In Multiple inheritance, one class can have more than one
superclass and inherit features from all parent classes. Scala does not support
multiple inheritance with classes, but it can be achieved by traits.

Hybrid Inheritance: It is a mix of two or more of the above types of inheritance.


Since Scala doesn’t support multiple inheritance with classes, hybrid inheritance
is also not possible with classes. In Scala, we can achieve hybrid inheritance
only through traits.
UNIT 5
Application of Big Data using :

1. Pig :

Pig is a high-level platform or tool which is used to process large datasets.

It provides a high level of abstraction for processing over MapReduce.

It provides a high-level scripting language, known as Pig Latin which is used to


develop the data analysis codes.

Applications :

1. For exploring large datasets Pig Scripting is used.


2. Provides supports across large data sets for Ad-hoc queries.
3. In the prototyping of large data-sets processing algorithms.
4. Required to process the time-sensitive data loads.
5. For collecting large amounts of datasets in form of search logs and web
crawls.
6. Used where the analytical insights are needed using the sampling.

2. Hive :

Hive is a data warehouse infrastructure tool to process structured data in


Hadoop.

It resides on top of Hadoop to summarize Big Data and makes querying and
analyzing easy.

It is used by different companies. For example, Amazon uses it in Amazon


Elastic MapReduce.

Benefits :

1. Ease of use
2. Accelerated initial insertion of data
3. Superior scalability, flexibility, and cost-efficiency
4. Streamlined security
5. Low overhead
6. Exceptional working capacity

3. HBase :

HBase is a column-oriented non-relational database management system that


runs on top of the Hadoop Distributed File System (HDFS).

HBase provides a fault-tolerant way of storing sparse data sets, which are
common in many big data use cases

HBase does support writing applications in Apache Avro, REST and Thrift.

Application :

1. Medical
2. Sports
3. Web
4. Oil and petroleum
5. e-commerce

PIG

Introduction to PIG :
Pig is a high-level platform or tool which is used to process large datasets.

It provides a high level of abstraction for processing over MapReduce.

It provides a high-level scripting language, known as Pig Latin which is used to


develop the data analysis codes.

Pig Latin and Pig Engine are the two main components of the Apache Pig tool.

The result of Pig is always stored in the HDFS.

One limitation of MapReduce is that the development cycle is very long. Writing
the reducer and mapper, compiling packaging the code, submitting the job and
retrieving the output is a time-consuming task.

Apache Pig reduces the time of development using the multi-query approach.

Pig is beneficial for programmers who are not from Java backgrounds.

200 lines of Java code can be written in only 10 lines using the Pig Latin
language.

Programmers who have SQL knowledge needed less effort to learn Pig Latin.

Execution Modes of Pig :

Apache Pig scripts can be executed in three ways :

Interactive Mode (Grunt shell) :


You can run Apache Pig in interactive mode using the Grunt shell.

In this shell, you can enter the Pig Latin statements and get the output (using the
Dump operator).

Batch Mode (Script) :

You can run Apache Pig in Batch mode by writing the Pig Latin script in a single
file with the .pig extension.

Embedded Mode (UDF) :

Apache Pig provides the provision of defining our own functions (User Defined
Functions) in programming languages such as Java and using them in our script.

Comparison of Pig with Databases :

PIG SQL

Pig Latin is a procedural language SQL is a declarative


language

In Apache Pig, the schema is optional. We can store Schema is mandatory in


data without designing a schema (values are stored SQL.
as $01, $02 etc.)

The data model in Apache Pig is nested relational. The data model used in SQL
is flat relational.

Apache Pig provides limited opportunity for Query There is more opportunity for
optimization. query optimization in SQL.

Grunt :

Grunt shell is a shell command.

The Grunt shell of the Apace pig is mainly used to write pig Latin scripts.
Pig script can be executed with grunt shell which is a native shell provided by
Apache pig to execute pig queries.

We can invoke shell commands using sh and fs.

Syntax of sh command :

grunt> sh ls

Syntax of fs command :

grunt>fs -ls

Pig Latin :

The Pig Latin is a data flow language used by Apache Pig to analyze the data in
Hadoop.

It is a textual language that abstracts the programming from the Java


MapReduce idiom into a notation.

The Pig Latin statements are used to process the data.

It is an operator that accepts a relation as an input and generates another


relation as an output.

· It can span multiple lines.

· Each statement must end with a semicolon.

· It may include expression and schemas.

· By default, these statements are processed using multi-query execution

User-Defined Functions :

Apache Pig provides extensive support for User Defined Functions(UDF’s).

Using these UDF’s, we can define our own functions and use them.
The UDF support is provided in six programming languages:

· Java

· Jython

· Python

· JavaScript

· Ruby

· Groovy

For writing UDF’s, complete support is provided in Java and limited support is
provided in all the remaining languages.

Using Java, you can write UDF’s involving all parts of the processing like data
load/store, column transformation, and aggregation.

Since Apache Pig has been written in Java, the UDF’s written using Java
language work efficiently compared to other languages.

Types of UDF’s in Java :

Filter Functions :

● The filter functions are used as conditions in filter statements.


● These functions accept a Pig value as input and return a Boolean value.

Eval Functions :

● The Eval functions are used in FOREACH-GENERATE statements.


● These functions accept a Pig value as input and return a Pig result.

Algebraic Functions :

● The Algebraic functions act on inner bags in a FOREACHGENERATE


statement.
● These functions are used to perform full MapReduce operations on an
inner bag.

Data Processing Operators :

The Apache Pig Operators is a high-level procedural language for querying large
data sets using Hadoop and the Map-Reduce Platform.

A Pig Latin statement is an operator that takes a relation as input and produces
another relation as output.

These operators are the main tools for Pig Latin provides to operate on the data.

They allow you to transform it by sorting, grouping, joining, projecting, and


filtering.

The Apache Pig operators can be classified as :

Relational Operators :

Relational operators are the main tools Pig Latin provides to operate on the data.

Some of the Relational Operators are :

LOAD: The LOAD operator is used to load data from the file system or HDFS
storage into a Pig relation.

FOREACH: This operator generates data transformations based on columns of


data. It is used to add or remove fields from a relation.

FILTER: This operator selects tuples from a relation based on a condition.

JOIN: JOIN operator is used to performing an inner, equijoin join of two or more
relations based on common field values

ORDER BY: Order By is used to sort a relation based on one or more fields in
either ascending or descending order using ASC and DESC keywords.

GROUP: The GROUP operator groups together the tuples with the same group
key (key field).
COGROUP: COGROUP is the same as the GROUP operator. For readability,
programmers usually use GROUP when only one relation is involved and
COGROUP when multiple relations are reinvolved.

Diagnostic Operator :

The load statement will simply load the data into the specified relation in Apache
Pig.

To verify the execution of the Load statement, you have to use the Diagnostic
Operators.

Some Diagnostic Operators are :

DUMP: The DUMP operator is used to run Pig Latin statements and display the
results on the screen.

DESCRIBE: Use the DESCRIBE operator to review the schema of a particular


relation. The DESCRIBE operator is best used for debugging a script.

ILLUSTRATE: ILLUSTRATE operator is used to review how data is transformed


through a sequence of Pig Latin statements. ILLUSTRATE command is your best
friend when it comes to debugging a script.

EXPLAIN: The EXPLAIN operator is used to display the logical, physical, and
MapReduce execution plans of a relation.

Hive

Apache Hive Architecture :


The above figure shows the architecture of Apache Hive and its major
components.

The major components of Apache Hive are :

1. Hive Client

2. Hive Services

3. Processing and Resource Management

4. Distributed Storage

HIVE CLIENT :

Hive supports applications written in any language like Python, Java, C++, Ruby,
etc using JDBC, ODBC, and Thrift drivers, for performing queries on the Hive.
Hence, one can easily write a hive client application in any language of its own
choice.

Hive clients are categorized into three types :

1. Thrift Clients : The Hive server is based on Apache Thrift so that it can
serve the request from a thrift client.

2. JDBC client : Hive allows for the Java applications to connect to it using the
JDBC driver. JDBC driver uses Thrift to communicate with the Hive Server.

3. ODBC client : Hive ODBC driver allows applications based on the ODBC
protocol to connect to Hive. Similar to the JDBC driver, the ODBC driver uses
Thrift to communicate with the Hive Server.

HIVE SERVICE :

To perform all queries, Hive provides various services like the Hive server2,
Beeline, etc.

The various services offered by Hive are :

1. Beeline

2. Hive Server 2

3. Hive Driver

4. Hive Compiler

5. Optimizer

6. Execution Engine

7. Metastore

8. HCatalog

9. WebHCat

PROCESSING AND RESOURCE MANAGEMENT :


Hive internally uses a MapReduce framework as a defacto engine for executing
the queries.

MapReduce is a software framework for writing those applications that process a


massive amount of data in parallel on the large clusters of commodity hardware.

MapReduce job works by splitting data into chunks, which are processed by
map-reduce tasks.

DISTRIBUTED STORAGE :

Hive is built on top of Hadoop, so it uses the underlying Hadoop Distributed File
System for the distributed storage.

Hive Shell :

Hive shell is a primary way to interact with hive.

It is a default service in the hive.

It is also called CLI (command line interference).

Hive shell is similar to MySQL Shell.

Hive users can run HQL queries in the hive shell.

In hive shell up and down arrow keys are used to scroll previous commands.

HiveQL is case-insensitive (except for string comparisons).

The tab key will autocomplete (provides suggestions while you type into the field)
Hive keywords and functions.

Hive Shell can run in two modes :

Non-Interactive mode :

Non-interactive mode means run shell scripts in administer zone.

Hive Shell can run in the non-interactive mode, with the -f option.
Example:

$hive -f script.q, Where script. q is a file.

Interactive mode :

The hive can work in interactive mode by directly typing the command “hive” in
the terminal.

Example:

$hive

Hive> show databases;


Previous

5. Hive

Next

Hive Services :

The following are the services provided by Hive :

· Hive CLI: The Hive CLI (Command Line Interface) is a shell where we can
execute Hive queries and commands.

· Hive Web User Interface: The Hive Web UI is just an alternative of Hive CLI.
It provides a web-based GUI for executing Hive queries and commands.

· Hive metastore: It is a central repository that stores all the structure


information of various tables and partitions in the warehouse. It also includes
metadata of column and its type information, the serializers and deserializers
which is used to read and write data and the corresponding HDFS files where the
data is stored.

· Hive Server: It is referred to as Apache Thrift Server. It accepts the request


from different clients and provides it to Hive Driver.
· Hive Driver: It receives queries from different sources like web UI, CLI,
Thrift, and JDBC/ODBC driver. It transfers the queries to the compiler.

· Hive Compiler: The purpose of the compiler is to parse the query and
perform semantic analysis on the different query blocks and expressions. It
converts HiveQL statements into MapReduce jobs.

· Hive Execution Engine: Optimizer generates the logical plan in the form of
DAG of map-reduce tasks and HDFS tasks. In the end, the execution engine
executes the incoming tasks in the order of their dependencies.

MetaStore :

Hive metastore (HMS) is a service that stores Apache Hive and other metadata
in a backend RDBMS, such as MySQL or PostgreSQL.

Impala, Spark, Hive, and other services share the metastore.

The connections to and from HMS include HiveServer, Ranger, and the
NameNode, which represents HDFS.

Beeline, Hue, JDBC, and Impala shell clients make requests through thrift or
JDBC to HiveServer.

The HiveServer instance reads/writes data to HMS.

By default, redundant HMS operate in active/active mode.

The physical data resides in a backend RDBMS, one for HMS.

All connections are routed to a single RDBMS service at any given time.

HMS talks to the NameNode over thrift and functions as a client to HDFS.

HMS connects directly to Ranger and the NameNode (HDFS), and so does
HiveServer.
One or more HMS instances on the backend can talk to other services, such as
Ranger.

Comparison with Traditional Database :

RDBMS HIVE

It is used to maintain the database. It is used to maintain a data warehouse.

It uses SQL (Structured Query It uses HQL (Hive Query Language).


Language).

Schema is fixed in RDBMS Schema varies in it.

Normalized data is stored. Normalized and de-normalized both type of


data is stored.

Tables in rdms are sparse. The table in hive is dense.

It doesn’t support partitioning. It supports automation partition.

No partition method is used The sharding method is used for partition

HiveQL :

Even though based on SQL, HiveQL does not strictly follow the full SQL-92
standard.

HiveQL offers extensions not in SQL, including multitable inserts and create table
as select.

HiveQL lacked support for transactions and materialized views and only limited
subquery support.

Support for insert, update, and delete with full ACID functionality was made
available with release 0.14.
Internally, a compiler translates HiveQL statements into a directed acyclic graph
of MapReduce Tez, or Spark jobs, which are submitted to Hadoop for execution.

Example :

DROP TABLE IF EXISTS docs;


CREATE TABLE docs (line STRING);

Checks if table docs exist and drop it if it does. Creates a new table called docs
with a single column of type STRING called line.

LOAD DATA INPATH 'input_file' OVERWRITE INTO TABLE docs;

Loads the specified file or directory (In this case “input_file”) into the table.

OVERWRITE specifies that the target table to which the data is being loaded is
to be re-written; Otherwise, the data would be appended.

CREATE TABLE word_counts AS


SELECT word, count(1) AS count FROM
(SELECT explode(split(line, '\s')) AS word FROM docs) temp
GROUP BY word
ORDER BY word;

The query CREATE TABLE word_counts AS SELECT word, count(1) AS count


creates a table called word_counts with two columns: word and count.

This query draws its input from the inner query (SELECT explode(split(line, '\s'))
AS word FROM docs) temp".

This query serves to split the input words into different rows of a temporary table
aliased as temp.

The GROUP BY WORD groups the results based on their keys.

This results in the count column holding the number of occurrences for each
word of the word column.

The ORDER BY WORDS sorts the words alphabetically.


Tables :

Here are the types of tables in Apache Hive:


Managed Tables :

In a managed table, both the table data and the table schema are managed by
Hive.

The data will be located in a folder named after the table within the Hive data
warehouse, which is essentially just a file location in HDFS.

By managed or controlled we mean that if you drop (delete) a managed table,


then Hive will delete both the Schema (the description of the table) and the data
files associated with the table.

Default location is /user/hive/warehouse.

The syntax for Managed Tables :

CREATE TABLE IF NOT EXISTS stocks (exchange STRING,


symbol STRING,
price_open FLOAT,
price_high FLOAT,
price_low FLOAT,
price_adj_close FLOAT)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' ;

External Tables :

An external table is one where only the table schema is controlled by Hive.

In most cases, the user will set up the folder location within HDFS and copy the
data file(s) there.

This location is included as part of the table definition statement.

When an external table is deleted, Hive will only delete the schema associated
with the table.
The data files are not affected.

Syntax for External Tables :

CREATE EXTERNAL TABLE IF NOT EXISTS stocks (exchange


STRING,
symbol STRING,
price_open FLOAT,
price_high FLOAT,
price_low FLOAT,
price_adj_close FLOAT)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
LOCATION '/data/stocks';

Querying Data :

A query is a request for data or information from a database table or a


combination of tables.

This data may be generated as results returned by Structured Query Language


(SQL) or as pictorials, graphs or complex results, e.g., trend analyses from
data-mining tools.

One of several different query languages may be used to perform a range of


simple to complex database queries.

SQL, the most well-known and widely-used query language, is familiar to most
database administrators (DBAs)

User-Defined Functions :

In Hive, the users can define their own functions to meet certain client
requirements.

These are known as UDFs in Hive.

User-Defined Functions written in Java for specific modules.


Some of UDFs are specifically designed for the reusability of code in application
frameworks.

The developer will develop these functions in Java and integrate those UDFs
with the Hive.

During the Query execution, the developer can directly use the code, and UDFs
will return outputs according to the user-defined tasks.

It will provide high performance in terms of coding and execution.

The general type of UDF will accept a single input value and produce a single
output value.

We can use two different interfaces for writing Apache Hive User-Defined
Functions :

1. Simple API

2. Complex API

Sorting And Aggregating :

Sorting data in Hive can be achieved by use of a standard ORDER BY clause,


but there is a catch.

ORDER BY produces a result that is totally sorted, as expected, but to do so it


sets the number of reducers to one, making it very inefficient for large datasets.

When a globally sorted result is not required and in many cases it isn’t, then you
can use Hive’s nonstandard extension, SORT BY instead.

SORT BY produces a sorted file per reducer.

If you want to control which reducer a particular row goes to, typically so you can
perform some subsequent aggregation.

This is what Hive’s DISTRIBUTE BY clause does.


Example :

· To sort the weather dataset by year and temperature, in such a way to


ensure that all the rows for a given year end up in the same reducer partition :

Hive> FROM records2


> SELECT year, temperature
> DISTRIBUTE BY year
> SORT BY year ASC, temperature
DESC;

· Output :

1949 111

1949 78

1950 22

1950 0

1950 -11

MapReduce Scripts in Hive / Hive Scripts :

Similar to any other scripting language, Hive scripts are used to execute a set of
Hive commands collectively.

Hive scripting helps us to reduce the time and effort invested in writing and
executing the individual commands manually.

Hive scripting is supported in Hive 0.10.0 or higher versions of Hive.

Joins and SubQueries :

JOINS :

Join queries can perform on two tables present in Hive.

Joins are of 4 types, these are :


· Inner join: The Records common to both tables will be retrieved by this
Inner Join.

· Left outer Join: Returns all the rows from the left table even though there
are no matches in the right table.

· Right Outer Join: Returns all the rows from the Right table even though
there are no matches in the left table.

· Full Outer Join: It combines records of both the tables based on the JOIN
Condition given in the query. It returns all the records from both tables and fills in
NULL Values for the columns missing values matched on either side.

SUBQUERIES :

A Query present within a Query is known as a subquery.

The main query will depend on the values returned by the subqueries.

Subqueries can be classified into two types :

· Subqueries in FROM clause

· Subqueries in WHERE clause

When to use :

· To get a particular value combined from two column values from different
tables.

· Dependency of one table values on other tables.

· Comparative checking of one column values from other tables.

Syntax :
Subquery in FROM clause
SELECT <column names 1, 2...n>From (SubQuery) <TableName_Main >
Subquery in WHERE clause
SELECT <column names 1, 2...n> From<TableName_Main>WHERE col1 IN (SubQuery);

HBASE

HBase Concepts :

HBase is a distributed column-oriented database built on top of the Hadoop file


system.

It is an open-source project and is horizontally scalable.

HBase is a data model that is similar to Google’s big table designed to provide
quick random access to huge amounts of structured data.

It leverages the fault tolerance provided by the Hadoop File System (HDFS).

It is a part of the Hadoop ecosystem that provides random real-time read/write


access to data in the Hadoop File System.

One can store the data in HDFS either directly or through HBase.

Data consumer reads/accesses the data in HDFS randomly using HBase.

HBase sits on top of the Hadoop File System and provides read and write
access.

HBase Vs RDBMS :

RDBMS HBase

It requires SQL (structured query NO SQL


language)

It has a fixed schema No fixed schema

It is row-oriented It is column-oriented
It is not scalable It is scalable

It is static in nature Dynamic in nature

Slower retrieval of data Faster retrieval of data

It follows the ACID (Atomicity, It follows CAP (Consistency, Availability,


Consistency, Isolation and Durability) Partition-tolerance) theorem.
property.

It can handle structured data It can handle structured, unstructured


as well as semi-structured data

It cannot handle sparse data It can handle sparse data

Schema Design :

HBase table can scale to billions of rows and any number of columns based on
your requirements.

This table allows you to store terabytes of data in it.

The HBase table supports the high read and writes throughput at low latency.

A single value in each row is indexed; this value is known as the row key.

The HBase schema design is very different compared to the relational database
schema design.

Some of the general concepts that should be followed while designing schema in
Hbase:

· Row key: Each table in the HBase table is indexed on the row key. There
are no secondary indices available on the HBase table.

· Automaticity: Avoid designing a table that requires atomicity across all rows.
All operations on HBase rows are atomic at row level.
· Even distribution: Read and write should be uniformly distributed across all
nodes available in the cluster. Design row key in such a way that, related entities
should be stored in adjacent rows to increase read efficacy.

Zookeeper :

ZooKeeper is a distributed coordination service that also helps to manage a large


set of hosts.

Managing and coordinating a service especially in a distributed environment is a


complicated process, so ZooKeeper solves this problem due to its simple
architecture as well as API.

ZooKeeper allows developers to focus on core application logic.

For instance, to track the status of distributed data, Apache HBase uses
ZooKeeper.

They can also support a large Hadoop cluster easily.

To retrieve information, each client machine communicates with one of the


servers.

It keeps an eye on the synchronization as well as coordination across the cluster

There is some best Apache ZooKeeper feature :

· Simplicity: With the help of a shared hierarchical namespace, it


coordinates.

· Reliability: The system keeps performing, even if more than one node fails.

· Speed: In the cases where ‘Reads’ are more common, it runs with the ratio
of 10:1.

· Scalability: By deploying more machines, the performance can be


enhanced.
IBM Big Data Strategy :

IBM, a US-based computer hardware and software manufacturer, had


implemented a Big Data strategy.

Where the company offered solutions to store, manage, and analyze the huge
amounts of data generated daily and equipped large and small companies to
make informed business decisions.

The company believed that its Big Data and analytics products and services
would help its clients become more competitive and drive growth.

Issues :

· Understand the concept of Big Data and its importance to large, medium,
and small companies in the current industry scenario.

· Understand the need for implementing a Big Data strategy and the various
issues and challenges associated with this.

· Analyze the Big Data strategy of IBM.

· Explore ways in which IBM’s Big Data strategy could be improved further.

Introduction to InfoSphere :

InfoSphere Information Server provides a single platform for data integration and
governance.

The components in the suite combine to create a unified foundation for enterprise
information architectures, capable of scaling to meet any information volume
requirements.

You can use the suite to deliver business results faster while maintaining data
quality and integrity throughout your information landscape.
InfoSphere Information Server helps your business and IT personnel collaborate
to understand the meaning, structure, and content of information across a wide
variety of sources.

By using InfoSphere Information Server, your business can access and use
information in new ways to drive innovation, increase operational efficiency, and
lower risk.

BigInsights :

BigInsights is a software platform for discovering, analyzing, and visualizing data


from disparate sources.

The flexible platform is built on an Apache Hadoop open-source framework that


runs in parallel on commonly available, low-cost hardware.

Big Sheets :

BigSheets is a browser-based analytic tool included in the InfoSphere


BigInsights Console that you use to break large amounts of unstructured data
into consumable, situation-specific business contexts.

These deep insights help you to filter and manipulate data from sheets even
further.

Intro to Big SQL :


IBM Big SQL is a high performance massively parallel processing (MPP) SQL
engine for Hadoop that makes querying enterprise data from across the
organization an easy and secure experience.

A Big SQL query can quickly access a variety of data sources including HDFS,
RDBMS, NoSQL databases, object stores, and WebHDFS by using a single
database connection or single query for best-in-class analytic capabilities.

Big SQL provides tools to help you manage your system and your databases,
and you can use popular analytic tools to visualize your data.

Big SQL's robust engine executes complex queries for relational data and
Hadoop data.

Big SQL provides an advanced SQL compiler and a cost-based optimizer for
efficient query execution.

Combining these with massive parallel processing (MPP) engine helps distribute
query execution across nodes in a cluster.

You might also like